2020-07-21 04:18:52

by Sergey Senozhatsky

[permalink] [raw]
Subject: [RFC][PATCH 0/4] arm64:kvm: teach guest sched that VCPUs can be preempted

Hello,

RFC

We noticed that in a number of cases when we wake_up_process()
on arm64 guest we end up enqueuing that task on a preempted VCPU. The culprit
appears to be the fact that arm64 guests are not aware of VCPU preemption
as such, so when sched picks up an idle VCPU it always assumes that VCPU
is available:

wake_up_process()
try_to_wake_up()
select_task_rq_fair()
available_idle_cpu()
vcpu_is_preempted() // return false;

Which is, obviously, not the case.

This RFC patch set adds a simple vcpu_is_preempted() implementation so
that scheduler can make better decisions when it search for the idle
(v)CPU.

I ran a number of sched benchmarks please refer to [0] for more
details.

[0] https://github.com/sergey-senozhatsky/arm64-vcpu_is_preempted

Sergey Senozhatsky (4):
arm64:kvm: define pv_state SMCCC HV calls
arm64: add guest pvstate support
arm64: add host pvstate support
arm64: do not use dummy vcpu_is_preempted() anymore

arch/arm64/include/asm/kvm_host.h | 23 ++++++
arch/arm64/include/asm/paravirt.h | 15 ++++
arch/arm64/include/asm/spinlock.h | 17 +++--
arch/arm64/kernel/Makefile | 2 +-
arch/arm64/kernel/paravirt-state.c | 117 +++++++++++++++++++++++++++++
arch/arm64/kernel/paravirt.c | 4 +-
arch/arm64/kernel/time.c | 1 +
arch/arm64/kvm/Makefile | 2 +-
arch/arm64/kvm/arm.c | 4 +
arch/arm64/kvm/hypercalls.c | 11 +++
arch/arm64/kvm/pvstate.c | 58 ++++++++++++++
include/linux/arm-smccc.h | 18 +++++
12 files changed, 262 insertions(+), 10 deletions(-)
create mode 100644 arch/arm64/kernel/paravirt-state.c
create mode 100644 arch/arm64/kvm/pvstate.c

--
2.27.0


2020-07-21 04:18:52

by Sergey Senozhatsky

[permalink] [raw]
Subject: [RFC][PATCH 1/4] arm64:kvm: define pv_state SMCCC HV calls

These will be used later on to configure and enable vCPU
PV-state support, which is needed for vcpu_is_preempted().

Signed-off-by: Sergey Senozhatsky <[email protected]>
---
include/linux/arm-smccc.h | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index 15c706fb0a37..cba054662692 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -99,6 +99,24 @@
ARM_SMCCC_OWNER_STANDARD_HYP, \
0x21)

+#define ARM_SMCCC_HV_PV_STATE_FEATURES \
+ ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
+ ARM_SMCCC_SMC_64, \
+ ARM_SMCCC_OWNER_STANDARD_HYP, \
+ 0x22)
+
+#define ARM_SMCCC_HV_PV_STATE_INIT \
+ ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
+ ARM_SMCCC_SMC_64, \
+ ARM_SMCCC_OWNER_STANDARD_HYP, \
+ 0x23)
+
+#define ARM_SMCCC_HV_PV_STATE_RELEASE \
+ ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
+ ARM_SMCCC_SMC_64, \
+ ARM_SMCCC_OWNER_STANDARD_HYP, \
+ 0x24)
+
/*
* Return codes defined in ARM DEN 0070A
* ARM DEN 0070A is now merged/consolidated into ARM DEN 0028 C
--
2.27.0

2020-07-21 04:19:02

by Sergey Senozhatsky

[permalink] [raw]
Subject: [RFC][PATCH 2/4] arm64: add guest pvstate support

PV-state is a per-CPU struct, which, for the time being,
holds boolean `preempted' vCPU state. During the startup,
given that host supports PV-state, each guest vCPU sends
a pointer to its per-CPU variable to the host as a payload
with the SMCC HV call, so that host can update vCPU state
when it puts or loads vCPU.

This has impact on the guest's scheduler - it does check
the state of the vCPU it wants to run a task on:

[..]
wake_up_process()
try_to_wake_up()
select_task_rq_fair()
available_idle_cpu()
vcpu_is_preempted()

Some sched benchmarks data is available on the github page [0].

[0] https://github.com/sergey-senozhatsky/arm64-vcpu_is_preempted
Signed-off-by: Sergey Senozhatsky <[email protected]>
---
arch/arm64/include/asm/paravirt.h | 15 ++++
arch/arm64/kernel/Makefile | 2 +-
arch/arm64/kernel/paravirt-state.c | 117 +++++++++++++++++++++++++++++
arch/arm64/kernel/paravirt.c | 4 +-
arch/arm64/kernel/time.c | 1 +
5 files changed, 137 insertions(+), 2 deletions(-)
create mode 100644 arch/arm64/kernel/paravirt-state.c

diff --git a/arch/arm64/include/asm/paravirt.h b/arch/arm64/include/asm/paravirt.h
index cf3a0fd7c1a7..1bf164b2041b 100644
--- a/arch/arm64/include/asm/paravirt.h
+++ b/arch/arm64/include/asm/paravirt.h
@@ -7,12 +7,22 @@ struct static_key;
extern struct static_key paravirt_steal_enabled;
extern struct static_key paravirt_steal_rq_enabled;

+struct pvstate_vcpu_info {
+ bool preempted;
+ u8 reserved[63];
+};
+
+struct pv_state_ops {
+ bool (*vcpu_is_preempted)(int cpu);
+};
+
struct pv_time_ops {
unsigned long long (*steal_clock)(int cpu);
};

struct paravirt_patch_template {
struct pv_time_ops time;
+ struct pv_state_ops state;
};

extern struct paravirt_patch_template pv_ops;
@@ -22,10 +32,15 @@ static inline u64 paravirt_steal_clock(int cpu)
return pv_ops.time.steal_clock(cpu);
}

+bool native_vcpu_is_preempted(int cpu);
+bool paravirt_vcpu_is_preempted(int cpu);
+
+int __init pv_state_init(void);
int __init pv_time_init(void);

#else

+#define pv_state_init() do {} while (0)
#define pv_time_init() do {} while (0)

#endif // CONFIG_PARAVIRT
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 5fb9b728459b..18974d5e798d 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -48,7 +48,7 @@ obj-$(CONFIG_ARMV8_DEPRECATED) += armv8_deprecated.o
obj-$(CONFIG_ACPI) += acpi.o
obj-$(CONFIG_ACPI_NUMA) += acpi_numa.o
obj-$(CONFIG_ARM64_ACPI_PARKING_PROTOCOL) += acpi_parking_protocol.o
-obj-$(CONFIG_PARAVIRT) += paravirt.o
+obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt-state.o
obj-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
obj-$(CONFIG_HIBERNATION) += hibernate.o hibernate-asm.o
obj-$(CONFIG_KEXEC_CORE) += machine_kexec.o relocate_kernel.o \
diff --git a/arch/arm64/kernel/paravirt-state.c b/arch/arm64/kernel/paravirt-state.c
new file mode 100644
index 000000000000..4ae92a84c73d
--- /dev/null
+++ b/arch/arm64/kernel/paravirt-state.c
@@ -0,0 +1,117 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#define pr_fmt(fmt) "arm-pvstate: " fmt
+
+#include <linux/arm-smccc.h>
+#include <linux/cpuhotplug.h>
+#include <linux/export.h>
+#include <linux/io.h>
+#include <linux/jump_label.h>
+#include <linux/printk.h>
+#include <linux/psci.h>
+#include <linux/reboot.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+
+#include <asm/paravirt.h>
+#include <asm/smp_plat.h>
+
+static DEFINE_PER_CPU(struct pvstate_vcpu_info, vcpus_states);
+
+bool native_vcpu_is_preempted(int cpu)
+{
+ return false;
+}
+
+static bool pv_vcpu_is_preempted(int cpu)
+{
+ struct pvstate_vcpu_info *st;
+
+ st = &per_cpu(vcpus_states, cpu);
+ return READ_ONCE(st->preempted);
+}
+
+bool paravirt_vcpu_is_preempted(int cpu)
+{
+ return pv_ops.state.vcpu_is_preempted(cpu);
+}
+
+static bool has_pvstate(void)
+{
+ struct arm_smccc_res res;
+
+ /* To detect the presence of PV time support we require SMCCC 1.1+ */
+ if (arm_smccc_1_1_get_conduit() == SMCCC_CONDUIT_NONE)
+ return false;
+
+ arm_smccc_1_1_invoke(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
+ ARM_SMCCC_HV_PV_STATE_FEATURES,
+ &res);
+
+ if (res.a0 != SMCCC_RET_SUCCESS)
+ return false;
+ return true;
+}
+
+static int __pvstate_cpu_hook(unsigned int cpu, int event)
+{
+ struct arm_smccc_res res;
+ struct pvstate_vcpu_info *st;
+
+ st = &per_cpu(vcpus_states, cpu);
+ arm_smccc_1_1_invoke(event, virt_to_phys(st), &res);
+ if (res.a0 != SMCCC_RET_SUCCESS)
+ return -EINVAL;
+ return 0;
+}
+
+static int pvstate_cpu_init(unsigned int cpu)
+{
+ int ret = __pvstate_cpu_hook(cpu, ARM_SMCCC_HV_PV_STATE_INIT);
+
+ if (ret)
+ pr_warn("Unable to ARM_SMCCC_HV_PV_STATE_INIT\n");
+ return ret;
+}
+
+static int pvstate_cpu_release(unsigned int cpu)
+{
+ int ret = __pvstate_cpu_hook(cpu, ARM_SMCCC_HV_PV_STATE_RELEASE);
+
+ if (ret)
+ pr_warn("Unable to ARM_SMCCC_HV_PV_STATE_RELEASE\n");
+ return ret;
+}
+
+static int pvstate_register_hooks(void)
+{
+ int ret;
+
+ ret = cpuhp_setup_state(CPUHP_AP_ARM_KVMPV_STARTING,
+ "hypervisor/arm/pvstate:starting",
+ pvstate_cpu_init,
+ pvstate_cpu_release);
+ if (ret < 0)
+ pr_warn("Failed to register CPU hooks\n");
+ return ret;
+}
+
+static int __pvstate_init(void)
+{
+ return pvstate_register_hooks();
+}
+
+int __init pv_state_init(void)
+{
+ int ret;
+
+ if (!has_pvstate())
+ return 0;
+
+ ret = __pvstate_init();
+ if (ret)
+ return ret;
+
+ pv_ops.state.vcpu_is_preempted = pv_vcpu_is_preempted;
+ return 0;
+}
diff --git a/arch/arm64/kernel/paravirt.c b/arch/arm64/kernel/paravirt.c
index 295d66490584..3fec7563ac27 100644
--- a/arch/arm64/kernel/paravirt.c
+++ b/arch/arm64/kernel/paravirt.c
@@ -26,7 +26,9 @@
struct static_key paravirt_steal_enabled;
struct static_key paravirt_steal_rq_enabled;

-struct paravirt_patch_template pv_ops;
+struct paravirt_patch_template pv_ops = {
+ .state.vcpu_is_preempted = native_vcpu_is_preempted,
+};
EXPORT_SYMBOL_GPL(pv_ops);

struct pv_time_stolen_time_region {
diff --git a/arch/arm64/kernel/time.c b/arch/arm64/kernel/time.c
index eebbc8d7123e..50c55792f72b 100644
--- a/arch/arm64/kernel/time.c
+++ b/arch/arm64/kernel/time.c
@@ -68,4 +68,5 @@ void __init time_init(void)
lpj_fine = arch_timer_rate / HZ;

pv_time_init();
+ pv_state_init();
}
--
2.27.0

2020-07-21 04:20:01

by Sergey Senozhatsky

[permalink] [raw]
Subject: [RFC][PATCH 3/4] arm64: add host pvstate support

Add PV-state support bits to the host. Host uses the guest
PV-state per-CPU pointers to update the VCPU state each time
it kvm_arch_vcpu_load() or kvm_arch_vcpu_put() the VCPU, so
that guest scheduler can become aware of the fact that not
all VCPUs are always available. Currently guest scheduler
on amr64 always assumes that all CPUs are available because
vcpu_is_preempted() is not implemented on arm64.

Signed-off-by: Sergey Senozhatsky <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 23 ++++++++++++
arch/arm64/kvm/Makefile | 2 +-
arch/arm64/kvm/arm.c | 4 +++
arch/arm64/kvm/hypercalls.c | 11 ++++++
arch/arm64/kvm/pvstate.c | 58 +++++++++++++++++++++++++++++++
5 files changed, 97 insertions(+), 1 deletion(-)
create mode 100644 arch/arm64/kvm/pvstate.c

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index f6485c1b8c15..e6f325a4ba70 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -372,6 +372,12 @@ struct kvm_vcpu_arch {
u64 last_steal;
gpa_t base;
} steal;
+
+ /* PV state of the VCPU (e.g. vcpu_is_preempted()) */
+ struct {
+ gpa_t base;
+ struct gfn_to_hva_cache ghc;
+ } pv_state;
};

/* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
@@ -556,6 +562,23 @@ static inline bool kvm_arm_is_pvtime_enabled(struct kvm_vcpu_arch *vcpu_arch)
return (vcpu_arch->steal.base != GPA_INVALID);
}

+int kvm_init_vcpu_state(struct kvm_vcpu *vcpu, gfn_t addr);
+int kvm_release_vcpu_state(struct kvm_vcpu *vcpu);
+
+static inline void kvm_arm_vcpu_state_init(struct kvm_vcpu_arch *vcpu_arch)
+{
+ vcpu_arch->pv_state.base = GPA_INVALID;
+ memset(&vcpu_arch->pv_state.ghc, 0x00, sizeof(struct gfn_to_hva_cache));
+}
+
+static inline bool
+kvm_arm_is_vcpu_state_enabled(struct kvm_vcpu_arch *vcpu_arch)
+{
+ return (vcpu_arch->pv_state.base != GPA_INVALID);
+}
+
+void kvm_update_vcpu_preempted(struct kvm_vcpu *vcpu, bool preempted);
+
void kvm_set_sei_esr(struct kvm_vcpu *vcpu, u64 syndrome);

struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr);
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 99977c1972cc..efc05ac0d781 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -12,7 +12,7 @@ obj-$(CONFIG_KVM) += hyp/

kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \
$(KVM)/vfio.o $(KVM)/irqchip.o \
- arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \
+ arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o pvstate.o \
inject_fault.o regmap.o va_layout.o hyp.o handle_exit.o \
guest.o debug.o reset.o sys_regs.o \
vgic-sys-reg-v3.o fpsimd.o pmu.o \
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 691d21e4c717..a1229e3b7117 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -265,6 +265,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)

kvm_arm_pvtime_vcpu_init(&vcpu->arch);

+ kvm_arm_vcpu_state_init(&vcpu->arch);
+
vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu;

err = kvm_vgic_vcpu_init(vcpu);
@@ -355,10 +357,12 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)

if (vcpu_has_ptrauth(vcpu))
vcpu_ptrauth_disable(vcpu);
+ kvm_update_vcpu_preempted(vcpu, false);
}

void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
{
+ kvm_update_vcpu_preempted(vcpu, true);
kvm_arch_vcpu_put_fp(vcpu);
if (has_vhe())
kvm_vcpu_put_sysregs_vhe(vcpu);
diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
index 550dfa3e53cd..3660552a8e02 100644
--- a/arch/arm64/kvm/hypercalls.c
+++ b/arch/arm64/kvm/hypercalls.c
@@ -52,6 +52,9 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
case ARM_SMCCC_HV_PV_TIME_FEATURES:
val = SMCCC_RET_SUCCESS;
break;
+ case ARM_SMCCC_HV_PV_STATE_FEATURES:
+ val = SMCCC_RET_SUCCESS;
+ break;
}
break;
case ARM_SMCCC_HV_PV_TIME_FEATURES:
@@ -62,6 +65,14 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
if (gpa != GPA_INVALID)
val = gpa;
break;
+ case ARM_SMCCC_HV_PV_STATE_INIT:
+ if (kvm_init_vcpu_state(vcpu, smccc_get_arg1(vcpu)) == 0)
+ val = SMCCC_RET_SUCCESS;
+ break;
+ case ARM_SMCCC_HV_PV_STATE_RELEASE:
+ if (kvm_release_vcpu_state(vcpu) == 0)
+ val = SMCCC_RET_SUCCESS;
+ break;
default:
return kvm_psci_call(vcpu);
}
diff --git a/arch/arm64/kvm/pvstate.c b/arch/arm64/kvm/pvstate.c
new file mode 100644
index 000000000000..3152555f3ed2
--- /dev/null
+++ b/arch/arm64/kvm/pvstate.c
@@ -0,0 +1,58 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/arm-smccc.h>
+#include <linux/kvm_host.h>
+
+#include <asm/kvm_mmu.h>
+#include <asm/paravirt.h>
+
+#include <kvm/arm_hypercalls.h>
+
+int kvm_init_vcpu_state(struct kvm_vcpu *vcpu, gpa_t addr)
+{
+ if (kvm_arm_is_vcpu_state_enabled(&vcpu->arch))
+ return 0;
+
+ if (kvm_gfn_to_hva_cache_init(vcpu->kvm,
+ &vcpu->arch.pv_state.ghc,
+ addr,
+ sizeof(struct pvstate_vcpu_info)))
+ return -EINVAL;
+
+ vcpu->arch.pv_state.base = addr;
+ return 0;
+}
+
+int kvm_release_vcpu_state(struct kvm_vcpu *vcpu)
+{
+ if (!kvm_arm_is_vcpu_state_enabled(&vcpu->arch))
+ return 0;
+
+ vcpu->arch.pv_state.base = GPA_INVALID;
+ return 0;
+}
+
+void kvm_update_vcpu_preempted(struct kvm_vcpu *vcpu, bool preempted)
+{
+ struct kvm *kvm = vcpu->kvm;
+ u64 idx;
+
+ if (!kvm_arm_is_vcpu_state_enabled(&vcpu->arch))
+ return;
+
+ /*
+ * This function is called from atomic context, so we need to
+ * disable page faults. kvm_write_guest_cached() will call
+ * might_fault().
+ */
+ pagefault_disable();
+ /*
+ * Need to take the SRCU lock because kvm_write_guest_offset_cached()
+ * calls kvm_memslots();
+ */
+ idx = srcu_read_lock(&kvm->srcu);
+ kvm_write_guest_cached(kvm, &vcpu->arch.pv_state.ghc,
+ &preempted, sizeof(bool));
+ srcu_read_unlock(&kvm->srcu, idx);
+ pagefault_enable();
+}
--
2.27.0

2020-07-21 04:22:21

by Sergey Senozhatsky

[permalink] [raw]
Subject: [RFC][PATCH 4/4] arm64: do not use dummy vcpu_is_preempted() anymore

vcpu_is_preempted() now can represent the actual state of
the VCPU, so the scheduler can make better decisions when
it picks the idle CPU to enqueue a task on. I executed a
whole bunch of scheduler tests [0]. One particular test
that shows the importance of vcpu_is_preempted() is AIO
stress-ng test:

x Disabled vcpu_is_preempted()
stress-ng: info: [100] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: info: [100] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [100] aio 222927 10.01 0.89 27.61 22262.04 7822.00
stress-ng: info: [139] aio 217043 10.01 1.00 26.80 21685.46 7807.30
stress-ng: info: [178] aio 217261 10.01 1.08 26.79 21709.36 7795.51

+ Enabled vcpu_is_preempted()
stress-ng: info: [100] aio 432750 10.00 1.14 19.03 43264.33 21455.13
stress-ng: info: [139] aio 426771 10.01 1.09 18.67 42629.13 21597.72
stress-ng: info: [179] aio 533039 10.00 1.42 20.39 53281.70 24440.12

Signed-off-by: Sergey Senozhatsky <[email protected]>
---
arch/arm64/include/asm/spinlock.h | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h
index 9083d6992603..6a390eeabe82 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -11,17 +11,20 @@
/* See include/linux/spinlock.h */
#define smp_mb__after_spinlock() smp_mb()

-/*
- * Changing this will break osq_lock() thanks to the call inside
- * smp_cond_load_relaxed().
- *
- * See:
- * https://lore.kernel.org/lkml/[email protected]
- */
#define vcpu_is_preempted vcpu_is_preempted
+
+#ifdef CONFIG_PARAVIRT
+extern bool paravirt_vcpu_is_preempted(int cpu);
+
+static inline bool vcpu_is_preempted(int cpu)
+{
+ return paravirt_vcpu_is_preempted(cpu);
+}
+#else
static inline bool vcpu_is_preempted(int cpu)
{
return false;
}
+#endif /* CONFIG_PARAVIRT */

#endif /* __ASM_SPINLOCK_H */
--
2.27.0

2020-08-17 02:05:34

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] arm64:kvm: teach guest sched that VCPUs can be preempted

On (20/07/21 13:17), Sergey Senozhatsky wrote:
> Hello,
>
> RFC
>
> We noticed that in a number of cases when we wake_up_process()
> on arm64 guest we end up enqueuing that task on a preempted VCPU. The culprit
> appears to be the fact that arm64 guests are not aware of VCPU preemption
> as such, so when sched picks up an idle VCPU it always assumes that VCPU
> is available:
>
> wake_up_process()
> try_to_wake_up()
> select_task_rq_fair()
> available_idle_cpu()
> vcpu_is_preempted() // return false;
>
> Which is, obviously, not the case.
>
> This RFC patch set adds a simple vcpu_is_preempted() implementation so
> that scheduler can make better decisions when it search for the idle
> (v)CPU.

Hi,

A gentle ping.

-ss

2020-08-17 12:07:12

by yezengruan

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] arm64:kvm: teach guest sched that VCPUs can be preempted

On 2020/8/17 10:03, Sergey Senozhatsky wrote:
> On (20/07/21 13:17), Sergey Senozhatsky wrote:
>> Hello,
>>
>> RFC
>>
>> We noticed that in a number of cases when we wake_up_process()
>> on arm64 guest we end up enqueuing that task on a preempted VCPU. The culprit
>> appears to be the fact that arm64 guests are not aware of VCPU preemption
>> as such, so when sched picks up an idle VCPU it always assumes that VCPU
>> is available:
>>
>> wake_up_process()
>> try_to_wake_up()
>> select_task_rq_fair()
>> available_idle_cpu()
>> vcpu_is_preempted() // return false;
>>
>> Which is, obviously, not the case.
>>
>> This RFC patch set adds a simple vcpu_is_preempted() implementation so
>> that scheduler can make better decisions when it search for the idle
>> (v)CPU.
> Hi,
>
> A gentle ping.
>
> -ss
> _______________________________________________
> kvmarm mailing list
> [email protected]
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> .

Hi Sergey,

I have a set of patches similar to yours.

https://lore.kernel.org/lkml/[email protected]/

2020-08-17 12:29:14

by Marc Zyngier

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] arm64:kvm: teach guest sched that VCPUs can be preempted

On 2020-08-17 13:03, yezengruan wrote:
> On 2020/8/17 10:03, Sergey Senozhatsky wrote:
>> On (20/07/21 13:17), Sergey Senozhatsky wrote:
>>> Hello,
>>>
>>> RFC
>>>
>>> We noticed that in a number of cases when we wake_up_process()
>>> on arm64 guest we end up enqueuing that task on a preempted VCPU. The
>>> culprit
>>> appears to be the fact that arm64 guests are not aware of VCPU
>>> preemption
>>> as such, so when sched picks up an idle VCPU it always assumes that
>>> VCPU
>>> is available:
>>>
>>> wake_up_process()
>>> try_to_wake_up()
>>> select_task_rq_fair()
>>> available_idle_cpu()
>>> vcpu_is_preempted() // return false;
>>>
>>> Which is, obviously, not the case.
>>>
>>> This RFC patch set adds a simple vcpu_is_preempted() implementation
>>> so
>>> that scheduler can make better decisions when it search for the idle
>>> (v)CPU.
>> Hi,
>>
>> A gentle ping.
>>
>> -ss
>> _______________________________________________
>> kvmarm mailing list
>> [email protected]
>> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>> .
>
> Hi Sergey,
>
> I have a set of patches similar to yours.
>
> https://lore.kernel.org/lkml/[email protected]/

It really isn't the same thing at all. You are exposing PV spinlocks,
while Sergey exposes preemption to vcpus. The former is a massive,
and probably unnecessary superset of the later, which only impacts
the scheduler (it doesn't change the way locks are implemented).

You really shouldn't conflate the two (which you have done in your
series).

M.
--
Jazz is not dead. It just smells funny...

2020-08-17 14:17:49

by yezengruan

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] arm64:kvm: teach guest sched that VCPUs can be preempted

On 2020/8/17 20:25, Marc Zyngier wrote:
> On 2020-08-17 13:03, yezengruan wrote:
>> On 2020/8/17 10:03, Sergey Senozhatsky wrote:
>>> On (20/07/21 13:17), Sergey Senozhatsky wrote:
>>>> Hello,
>>>>
>>>>     RFC
>>>>
>>>>     We noticed that in a number of cases when we wake_up_process()
>>>> on arm64 guest we end up enqueuing that task on a preempted VCPU. The culprit
>>>> appears to be the fact that arm64 guests are not aware of VCPU preemption
>>>> as such, so when sched picks up an idle VCPU it always assumes that VCPU
>>>> is available:
>>>>
>>>>       wake_up_process()
>>>>        try_to_wake_up()
>>>>         select_task_rq_fair()
>>>>          available_idle_cpu()
>>>>           vcpu_is_preempted()    // return false;
>>>>
>>>> Which is, obviously, not the case.
>>>>
>>>> This RFC patch set adds a simple vcpu_is_preempted() implementation so
>>>> that scheduler can make better decisions when it search for the idle
>>>> (v)CPU.
>>> Hi,
>>>
>>> A gentle ping.
>>>
>>>     -ss
>>> _______________________________________________
>>> kvmarm mailing list
>>> [email protected]
>>> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>>> .
>>
>> Hi Sergey,
>>
>> I have a set of patches similar to yours.
>>
>> https://lore.kernel.org/lkml/[email protected]/
>
> It really isn't the same thing at all. You are exposing PV spinlocks,
> while Sergey exposes preemption to vcpus. The former is a massive,
> and probably unnecessary superset of the later, which only impacts
> the scheduler (it doesn't change the way locks are implemented).
>
> You really shouldn't conflate the two (which you have done in your
> series).
>
>         M.


Hi Marc,

Actually, both series support paravirtualization vcpu_is_preempted. My
series regard this as PV lock, but only the vcpu_is_preempted interface
of pv_lock_opt is implemented.

Except wake_up_process(), the vcpu_is_preempted interface of the current
kernel is used in the following scenarios:

kernel/sched/core.c:                            <---- wake_up_process()
--------------------
available_idle_cpu
    vcpu_is_preempted

kernel/locking/rwsem.c:
-----------------------
rwsem_optimistic_spin
    rwsem_spin_on_owner
        owner_on_cpu
            vcpu_is_preempted

kernel/locking/mutex.c:
-----------------------
mutex_optimistic_spin
    mutex_spin_on_owner
        vcpu_is_preempted

kernel/locking/osq_lock.c:
--------------------------
osq_lock
    vcpu_is_preempted


Thanks,

Zengruan

2020-09-11 08:47:35

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] arm64:kvm: teach guest sched that VCPUs can be preempted

Hi,

On (20/08/17 20:03), yezengruan wrote:
> Hi Sergey,
>
> I have a set of patches similar to yours.
>
> https://lore.kernel.org/lkml/[email protected]/

I'm sorry for the belated reply.

Right, quite similar, but not exactly, I believe. I deliberately wanted
to untangle vcpu preemption (which is a characteristics feature) from
pv-lock, which may be somewhat implementation dependent.

Perhaps vcpu_is_preempted() should not even be implemented on per-arch
basis, but instead it can be more of a "core" functionality.

-ss

2020-09-11 09:00:20

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] arm64:kvm: teach guest sched that VCPUs can be preempted

My apologies for the slow reply.

On (20/08/17 13:25), Marc Zyngier wrote:
>
> It really isn't the same thing at all. You are exposing PV spinlocks,
> while Sergey exposes preemption to vcpus.
>

Correct, we see vcpu preemption as a "fundamental" feature, with
consequences that affect scheduling, which is a core feature :)

Marc, is there anything in particular that you dislike about this RFC
patch set? Joel has some ideas, which we may discuss offline if that
works for you.

-ss

2020-12-08 21:01:30

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] arm64:kvm: teach guest sched that VCPUs can be preempted

On Fri, Sep 11, 2020 at 4:58 AM Sergey Senozhatsky
<[email protected]> wrote:
>
> My apologies for the slow reply.
>
> On (20/08/17 13:25), Marc Zyngier wrote:
> >
> > It really isn't the same thing at all. You are exposing PV spinlocks,
> > while Sergey exposes preemption to vcpus.
> >
>
> Correct, we see vcpu preemption as a "fundamental" feature, with
> consequences that affect scheduling, which is a core feature :)
>
> Marc, is there anything in particular that you dislike about this RFC
> patch set? Joel has some ideas, which we may discuss offline if that
> works for you.

Hi Marc, Sergey, Just checking what is the latest on this series?

About the idea me and Sergey discussed, at a high level we discussed
being able to share information similar to "Is the vCPU preempted?"
using a more arch-independent infrastructure. I do not believe this
needs to be arch-specific. Maybe the speciifc mechanism about how to
share a page of information needs to be arch-specific, but the actual
information shared need not be. This could open the door to sharing
more such information in an arch-independent way (for example, if the
scheduler needs to know other information such as the capacity of the
CPU that the vCPU is on).

Other thoughts?

thanks,

- Joel

2020-12-09 23:06:32

by Marc Zyngier

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] arm64:kvm: teach guest sched that VCPUs can be preempted

Hi all,

On 2020-12-08 20:02, Joel Fernandes wrote:
> On Fri, Sep 11, 2020 at 4:58 AM Sergey Senozhatsky
> <[email protected]> wrote:
>>
>> My apologies for the slow reply.
>>
>> On (20/08/17 13:25), Marc Zyngier wrote:
>> >
>> > It really isn't the same thing at all. You are exposing PV spinlocks,
>> > while Sergey exposes preemption to vcpus.
>> >
>>
>> Correct, we see vcpu preemption as a "fundamental" feature, with
>> consequences that affect scheduling, which is a core feature :)
>>
>> Marc, is there anything in particular that you dislike about this RFC
>> patch set? Joel has some ideas, which we may discuss offline if that
>> works for you.
>
> Hi Marc, Sergey, Just checking what is the latest on this series?

I was planning to give it a go, but obviously got sidetracked. :-(

>
> About the idea me and Sergey discussed, at a high level we discussed
> being able to share information similar to "Is the vCPU preempted?"
> using a more arch-independent infrastructure. I do not believe this
> needs to be arch-specific. Maybe the speciifc mechanism about how to
> share a page of information needs to be arch-specific, but the actual
> information shared need not be.

We already have some information sharing in the form of steal time
accounting, and I believe this "vcpu preempted" falls in the same
bucket. It looks like we could implement the feature as an extension
of the steal-time accounting, as the two concepts are linked
(one describes the accumulation of non-running time, the other is
instantaneous).

> This could open the door to sharing
> more such information in an arch-independent way (for example, if the
> scheduler needs to know other information such as the capacity of the
> CPU that the vCPU is on).

Quentin and I have discussed potential ways of improving guest
scheduling
on terminally broken systems (otherwise known as big-little), in the
form of a capacity request from the guest to the host. I'm not really
keen on the host exposing its own capacity, as that doesn't tell the
host what the guest actually needs.

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2020-12-10 13:51:13

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] arm64:kvm: teach guest sched that VCPUs can be preempted

Hi Marc, nice to hear from you.

On Wed, Dec 9, 2020 at 4:43 AM Marc Zyngier <[email protected]> wrote:
>
> Hi all,
>
> On 2020-12-08 20:02, Joel Fernandes wrote:
> > On Fri, Sep 11, 2020 at 4:58 AM Sergey Senozhatsky
> > <[email protected]> wrote:
> >>
> >> My apologies for the slow reply.
> >>
> >> On (20/08/17 13:25), Marc Zyngier wrote:
> >> >
> >> > It really isn't the same thing at all. You are exposing PV spinlocks,
> >> > while Sergey exposes preemption to vcpus.
> >> >
> >>
> >> Correct, we see vcpu preemption as a "fundamental" feature, with
> >> consequences that affect scheduling, which is a core feature :)
> >>
> >> Marc, is there anything in particular that you dislike about this RFC
> >> patch set? Joel has some ideas, which we may discuss offline if that
> >> works for you.
> >
> > Hi Marc, Sergey, Just checking what is the latest on this series?
>
> I was planning to give it a go, but obviously got sidetracked. :-(

Ah, that happens.

> > About the idea me and Sergey discussed, at a high level we discussed
> > being able to share information similar to "Is the vCPU preempted?"
> > using a more arch-independent infrastructure. I do not believe this
> > needs to be arch-specific. Maybe the speciifc mechanism about how to
> > share a page of information needs to be arch-specific, but the actual
> > information shared need not be.
>
> We already have some information sharing in the form of steal time
> accounting, and I believe this "vcpu preempted" falls in the same
> bucket. It looks like we could implement the feature as an extension
> of the steal-time accounting, as the two concepts are linked
> (one describes the accumulation of non-running time, the other is
> instantaneous).

Yeah I noticed the steal stuff. Will go look more into that.

> > This could open the door to sharing
> > more such information in an arch-independent way (for example, if the
> > scheduler needs to know other information such as the capacity of the
> > CPU that the vCPU is on).
>
> Quentin and I have discussed potential ways of improving guest
> scheduling
> on terminally broken systems (otherwise known as big-little), in the
> form of a capacity request from the guest to the host. I'm not really
> keen on the host exposing its own capacity, as that doesn't tell the
> host what the guest actually needs.

I am not sure how a capacity request could work well. It seems the
cost of a repeated hypercall could be prohibitive. In this case, a
lighter approach might be for KVM to restrict vCPU threads to run on
certain types of cores, and pass the capacity information to the guest
at guest's boot time. This would be a one-time cost to pay. And then,
then the guest scheduler can handle the scheduling appropriately
without any more hypercalls. Thoughts?

- Joel

2020-12-10 14:08:55

by Marc Zyngier

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] arm64:kvm: teach guest sched that VCPUs can be preempted

On 2020-12-10 01:39, Joel Fernandes wrote:

[...]

>> Quentin and I have discussed potential ways of improving guest
>> scheduling
>> on terminally broken systems (otherwise known as big-little), in the
>> form of a capacity request from the guest to the host. I'm not really
>> keen on the host exposing its own capacity, as that doesn't tell the
>> host what the guest actually needs.
>
> I am not sure how a capacity request could work well. It seems the
> cost of a repeated hypercall could be prohibitive. In this case, a
> lighter approach might be for KVM to restrict vCPU threads to run on
> certain types of cores, and pass the capacity information to the guest
> at guest's boot time.

That seems like a very narrow use case. If you actually pin vcpus to
physical CPU classes, DT is the right place to put things, because
it is completely static. This is effectively creating a virtual
big-little, which is in my opinion a userspace job.

> This would be a one-time cost to pay. And then,
> then the guest scheduler can handle the scheduling appropriately
> without any more hypercalls. Thoughts?

Anything that is a one-off belongs to firmware configuration, IMO.

The case I'm concerned with is when vcpus are allowed to roam across
the system, and hit random physical CPUs because the host has no idea
of the workload the guest deals with (specially as the AMU counters
are either absent or unusable on any available core).

The cost of a hypercall really depends on where you terminate it.
If it is a shallow exit, that's only a few hundred cycles on any half
baked CPU. Go all the way to userspace, and the host scheduler is the
limit. But the frequency of that hypercall obviously matters too.

How often do you expect the capacity request to fire? Probably not
on each and every time slice, right?

Quentin, can you shed some light on this?

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2020-12-11 13:20:35

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] arm64:kvm: teach guest sched that VCPUs can be preempted

On Thursday 10 Dec 2020 at 08:45:22 (+0000), Marc Zyngier wrote:
> On 2020-12-10 01:39, Joel Fernandes wrote:
>
> [...]
>
> > > Quentin and I have discussed potential ways of improving guest
> > > scheduling
> > > on terminally broken systems (otherwise known as big-little), in the
> > > form of a capacity request from the guest to the host. I'm not really
> > > keen on the host exposing its own capacity, as that doesn't tell the
> > > host what the guest actually needs.
> >
> > I am not sure how a capacity request could work well. It seems the
> > cost of a repeated hypercall could be prohibitive. In this case, a
> > lighter approach might be for KVM to restrict vCPU threads to run on
> > certain types of cores, and pass the capacity information to the guest
> > at guest's boot time.
>
> That seems like a very narrow use case. If you actually pin vcpus to
> physical CPU classes, DT is the right place to put things, because
> it is completely static. This is effectively creating a virtual
> big-little, which is in my opinion a userspace job.

+1, all you should need for this is to have the VMM pin the vCPUS and
set capacity-dmips-mhz in the guest DT accordingly. And if you're
worried about sharing the runqueue with host tasks, could you vacate the
host CPUs using cpusets or such?

The last difficult bit is how to drive DVFS. I suppose Marc's suggestion
to relay capacity requests from the guest would help with that.

> > This would be a one-time cost to pay. And then,
> > then the guest scheduler can handle the scheduling appropriately
> > without any more hypercalls. Thoughts?
>
> Anything that is a one-off belongs to firmware configuration, IMO.
>
> The case I'm concerned with is when vcpus are allowed to roam across
> the system, and hit random physical CPUs because the host has no idea
> of the workload the guest deals with (specially as the AMU counters
> are either absent or unusable on any available core).
>
> The cost of a hypercall really depends on where you terminate it.
> If it is a shallow exit, that's only a few hundred cycles on any half
> baked CPU. Go all the way to userspace, and the host scheduler is the
> limit. But the frequency of that hypercall obviously matters too.
>
> How often do you expect the capacity request to fire? Probably not
> on each and every time slice, right?
>
> Quentin, can you shed some light on this?

Assuming that we change the 'capacity request' (aka uclamp.min of the
vCPU) every time the guest makes a frequency request, then the answer
very much is 'it depends on the workload'. Yes there is an overhead, but
I think it is hard to say how bad that would be before we give it a go.
It's unfortunately not uncommon to have painfully slow frequency changes
on real hardware, so this may be just fine. And there may be ways we
can mitigate this too (with rate limiting and such), so all in all it is
worth a try.

Also as per the above, this still would help even if the VMM pins vCPUs
and such, so these two things can live and complement each other I
think.

Now, for the patch originally under discussion here, no objection from
me in principle, it looks like a nice improvement to the stolen time
stuff and I can see how that could help some use-cases, so +1 from me.

Thanks,
Quentin