2023-12-14 02:47:39

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

Double scheduling is a concern with virtualization hosts where the host
schedules vcpus without knowing whats run by the vcpu and guest schedules
tasks without knowing where the vcpu is physically running. This causes
issues related to latencies, power consumption, resource utilization
etc. An ideal solution would be to have a cooperative scheduling
framework where the guest and host shares scheduling related information
and makes an educated scheduling decision to optimally handle the
workloads. As a first step, we are taking a stab at reducing latencies
for latency sensitive workloads in the guest.

This series of patches aims to implement a framework for dynamically
managing the priority of vcpu threads based on the needs of the workload
running on the vcpu. Latency sensitive workloads (nmi, irq, softirq,
critcal sections, RT tasks etc) will get a boost from the host so as to
minimize the latency.

The host can proactively boost the vcpu threads when it has enough
information about what is going to run on the vcpu - fo eg: injecting
interrupts. For rest of the case, guest can request boost if the vcpu is
not already boosted. The guest can subsequently request unboost after
the latency sensitive workloads completes. Guest can also request a
boost if needed.

A shared memory region is used to communicate the scheduling information.
Guest shares its needs for priority boosting and host shares the boosting
status of the vcpu. Guest sets a flag when it needs a boost and continues
running. Host reads this on next VMEXIT and boosts the vcpu thread. For
unboosting, it is done synchronously so that host workloads can fairly
compete with guests when guest is not running any latency sensitive
workload.

This RFC is x86 specific. This is mostly feature complete, but more work
needs to be done on the following areas:
- Use of paravirt ops framework.
- Optimizing critical paths for speed, cache efficiency etc
- Extensibility of this idea for sharing more scheduling information to
make better educated scheduling decisions in guest and host.
- Prevent misuse by rogue/buggy guest kernels

Tests
------

Real world workload on chromeos shows considerable improvement. Audio
and video applications running on low end devices experience high
latencies when the system is under load. This patch helps in mitigating
the audio and video glitches caused due to scheduling latencies.

Following are the results from oboetester app on android vm running in
chromeos. This app tests for audio glitches.

-------------------------------------------------------
| | Noload || Busy |
| Buffer Size |----------------------------------------
| | Vanilla | patches || Vanilla | Patches |
-------------------------------------------------------
| 96 (2ms) | 20 | 4 || 1365 | 67 |
-------------------------------------------------------
| 256 (4ms) | 3 | 1 || 524 | 23 |
-------------------------------------------------------
| 512 (10ms) | 0 | 0 || 25 | 24 |
-------------------------------------------------------

Noload: Tests run on idle system
Busy: Busy system simulated by Speedometer benchmark

The test shows considerable reduction in glitches especially with
smaller buffer sizes.

Following are data collected from few micro benchmark tests. cyclictest
was run on a VM to measure the latency with and without the patches. We
also took a baseline of the results with all vcpus statically boosted to
RT(chrt). This is to observe the difference between dynamic and static
boosting and its effect on host as well. Cyclictest on guest is to
observe the effect of the patches on guest and cyclictest on host is to
see if the patch affects workloads on the host.

cyclictest is run on both host and guest.
cyclictest cmdline: "cyclictest -q -D 90s -i 500 -d $INTERVAL"
where $INTERVAL used was 500 and 1000 us.

Host is Intel N4500 4C/4T. Guest also has 4 vcpus.

In the following tables,
Vanilla: baseline: vanilla kernel
Dynamic: the patches applied
Static: baseline: all vcpus statically boosted to RT(chrt)

Idle tests
----------
The Host is idle and cyclictest on host and guest.

-----------------------------------------------------------------------
| | Avg Latency(us): Guest || Avg Latency(us): Host |
-----------------------------------------------------------------------
| Interval | vanilla | dynamic | static || vanilla | dynamic | static |
-----------------------------------------------------------------------
| 500 | 9 | 9 | 10 || 5 | 3 | 3 |
-----------------------------------------------------------------------
| 1000 | 34 | 35 | 35 || 5 | 3 | 3 |
----------------------------------------------------------------------

-----------------------------------------------------------------------
| | Max Latency(us): Guest || Max Latency(us): Host |
-----------------------------------------------------------------------
| Interval | vanilla | dynamic | static || vanilla | dynamic | static |
-----------------------------------------------------------------------
| 500 | 1577 | 1433 | 140 || 1577 | 1526 | 15969 |
-----------------------------------------------------------------------
| 1000 | 6649 | 765 | 204 || 697 | 174 | 2444 |
-----------------------------------------------------------------------

Busy Tests
----------
Here the a busy host was simulated using stress-ng and cyclictest was
run on both host and guest.

-----------------------------------------------------------------------
| | Avg Latency(us): Guest || Avg Latency(us): Host |
-----------------------------------------------------------------------
| Interval | vanilla | dynamic | static || vanilla | dynamic | static |
-----------------------------------------------------------------------
| 500 | 887 | 21 | 25 || 6 | 6 | 7 |
-----------------------------------------------------------------------
| 1000 | 6335 | 45 | 38 || 11 | 11 | 14 |
----------------------------------------------------------------------

-----------------------------------------------------------------------
| | Max Latency(us): Guest || Max Latency(us): Host |
-----------------------------------------------------------------------
| Interval | vanilla | dynamic | static || vanilla | dynamic | static |
-----------------------------------------------------------------------
| 500 | 216835 | 13978 | 1728 || 2075 | 2114 | 2447 |
-----------------------------------------------------------------------
| 1000 | 199575 | 70651 | 1537 || 1886 | 1285 | 27104 |
-----------------------------------------------------------------------

These patches are rebased on 6.5.10.
Patches 1-4: Implementation of the core host side feature
Patch 5: A naive throttling mechanism for limiting boosted duration
for preemption disabled state in the guest. This is a placeholder for
the throttling mechanism for now and would need to be implemented
differently
Patch 6: Enable/disable tunables - global and per-vm
Patches 7-8: Implementation of the code guest side feature

---
Vineeth Pillai (Google) (8):
kvm: x86: MSR for setting up scheduler info shared memory
sched/core: sched_setscheduler_pi_nocheck for interrupt context usage
kvm: x86: vcpu boosting/unboosting framework
kvm: x86: boost vcpu threads on latency sensitive paths
kvm: x86: upper bound for preemption based boost duration
kvm: x86: enable/disable global/per-guest vcpu boost feature
sched/core: boost/unboost in guest scheduler
irq: boost/unboost in irq/nmi entry/exit and softirq

arch/x86/Kconfig | 13 +++
arch/x86/include/asm/kvm_host.h | 69 ++++++++++++
arch/x86/include/asm/kvm_para.h | 7 ++
arch/x86/include/uapi/asm/kvm_para.h | 43 ++++++++
arch/x86/kernel/kvm.c | 16 +++
arch/x86/kvm/Kconfig | 12 +++
arch/x86/kvm/cpuid.c | 2 +
arch/x86/kvm/i8259.c | 2 +-
arch/x86/kvm/lapic.c | 8 +-
arch/x86/kvm/svm/svm.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 2 +-
arch/x86/kvm/x86.c | 154 +++++++++++++++++++++++++++
include/linux/kvm_host.h | 56 ++++++++++
include/linux/sched.h | 23 ++++
include/uapi/linux/kvm.h | 5 +
kernel/entry/common.c | 39 +++++++
kernel/sched/core.c | 127 +++++++++++++++++++++-
kernel/softirq.c | 11 ++
virt/kvm/kvm_main.c | 150 ++++++++++++++++++++++++++
19 files changed, 730 insertions(+), 11 deletions(-)

--
2.43.0


2023-12-14 02:47:43

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 1/8] kvm: x86: MSR for setting up scheduler info shared memory

Implement a kvm MSR that guest uses to provide the GPA of shared memory
for communicating the scheduling information between host and guest.

wrmsr(0) disables the feature. wrmsr(valid_gpa) enables the feature and
uses the gpa for further communication.

Also add a new cpuid feature flag for the host to advertise the feature
to the guest.

Co-developed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Vineeth Pillai (Google) <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 25 ++++++++++++
arch/x86/include/uapi/asm/kvm_para.h | 24 +++++++++++
arch/x86/kvm/Kconfig | 12 ++++++
arch/x86/kvm/cpuid.c | 2 +
arch/x86/kvm/x86.c | 61 ++++++++++++++++++++++++++++
include/linux/kvm_host.h | 5 +++
6 files changed, 129 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f72b30d2238a..f89ba1f07d88 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -987,6 +987,18 @@ struct kvm_vcpu_arch {
/* Protected Guests */
bool guest_state_protected;

+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+ /*
+ * MSR to setup a shared memory for scheduling
+ * information sharing between host and guest.
+ */
+ struct {
+ enum kvm_vcpu_boost_state boost_status;
+ u64 msr_val;
+ struct gfn_to_hva_cache data;
+ } pv_sched;
+#endif
+
/*
* Set when PDPTS were loaded directly by the userspace without
* reading the guest memory
@@ -2217,4 +2229,17 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
*/
#define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)

+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+static inline bool kvm_arch_vcpu_pv_sched_enabled(struct kvm_vcpu_arch *arch)
+{
+ return arch->pv_sched.msr_val;
+}
+
+static inline void kvm_arch_vcpu_set_boost_status(struct kvm_vcpu_arch *arch,
+ enum kvm_vcpu_boost_state boost_status)
+{
+ arch->pv_sched.boost_status = boost_status;
+}
+#endif
+
#endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 6e64b27b2c1e..6b1dea07a563 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -36,6 +36,7 @@
#define KVM_FEATURE_MSI_EXT_DEST_ID 15
#define KVM_FEATURE_HC_MAP_GPA_RANGE 16
#define KVM_FEATURE_MIGRATION_CONTROL 17
+#define KVM_FEATURE_PV_SCHED 18

#define KVM_HINTS_REALTIME 0

@@ -58,6 +59,7 @@
#define MSR_KVM_ASYNC_PF_INT 0x4b564d06
#define MSR_KVM_ASYNC_PF_ACK 0x4b564d07
#define MSR_KVM_MIGRATION_CONTROL 0x4b564d08
+#define MSR_KVM_PV_SCHED 0x4b564da0

struct kvm_steal_time {
__u64 steal;
@@ -150,4 +152,26 @@ struct kvm_vcpu_pv_apf_data {
#define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK
#define KVM_PV_EOI_DISABLED 0x0

+/*
+ * VCPU boost state shared between the host and guest.
+ */
+enum kvm_vcpu_boost_state {
+ /* Priority boosting feature disabled in host */
+ VCPU_BOOST_DISABLED = 0,
+ /*
+ * vcpu is not explicitly boosted by the host.
+ * (Default priority when the guest started)
+ */
+ VCPU_BOOST_NORMAL,
+ /* vcpu is boosted by the host */
+ VCPU_BOOST_BOOSTED
+};
+
+/*
+ * Structure passed in via MSR_KVM_PV_SCHED
+ */
+struct pv_sched_data {
+ __u64 boost_status;
+};
+
#endif /* _UAPI_ASM_X86_KVM_PARA_H */
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 89ca7f4c1464..dbcba73fb508 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -141,4 +141,16 @@ config KVM_XEN
config KVM_EXTERNAL_WRITE_TRACKING
bool

+config PARAVIRT_SCHED_KVM
+ bool "Enable paravirt scheduling capability for kvm"
+ depends on KVM
+ help
+ Paravirtualized scheduling facilitates the exchange of scheduling
+ related information between the host and guest through shared memory,
+ enhancing the efficiency of vCPU thread scheduling by the hypervisor.
+ An illustrative use case involves dynamically boosting the priority of
+ a vCPU thread when the guest is executing a latency-sensitive workload
+ on that specific vCPU.
+ This config enables paravirt scheduling in the kvm hypervisor.
+
endif # VIRTUALIZATION
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 7bdc66abfc92..960ef6e869f2 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1113,6 +1113,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
(1 << KVM_FEATURE_POLL_CONTROL) |
(1 << KVM_FEATURE_PV_SCHED_YIELD) |
(1 << KVM_FEATURE_ASYNC_PF_INT);
+ if (IS_ENABLED(CONFIG_PARAVIRT_SCHED_KVM))
+ entry->eax |= (1 << KVM_FEATURE_PV_SCHED);

if (sched_info_on())
entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7bcf1a76a6ab..0f475b50ac83 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3879,6 +3879,33 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 1;
break;

+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+ case MSR_KVM_PV_SCHED:
+ if (!guest_pv_has(vcpu, KVM_FEATURE_PV_SCHED))
+ return 1;
+
+ if (!(data & KVM_MSR_ENABLED))
+ break;
+
+ if (!(data & ~KVM_MSR_ENABLED)) {
+ /*
+ * Disable the feature
+ */
+ vcpu->arch.pv_sched.msr_val = 0;
+ kvm_set_vcpu_boosted(vcpu, false);
+ } if (!kvm_gfn_to_hva_cache_init(vcpu->kvm,
+ &vcpu->arch.pv_sched.data, data & ~KVM_MSR_ENABLED,
+ sizeof(struct pv_sched_data))) {
+ vcpu->arch.pv_sched.msr_val = data;
+ kvm_set_vcpu_boosted(vcpu, false);
+ } else {
+ pr_warn("MSR_KVM_PV_SCHED: kvm:%p, vcpu:%p, "
+ "msr value: %llx, kvm_gfn_to_hva_cache_init failed!\n",
+ vcpu->kvm, vcpu, data & ~KVM_MSR_ENABLED);
+ }
+ break;
+#endif
+
case MSR_KVM_POLL_CONTROL:
if (!guest_pv_has(vcpu, KVM_FEATURE_POLL_CONTROL))
return 1;
@@ -4239,6 +4266,11 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

msr_info->data = vcpu->arch.pv_eoi.msr_val;
break;
+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+ case MSR_KVM_PV_SCHED:
+ msr_info->data = vcpu->arch.pv_sched.msr_val;
+ break;
+#endif
case MSR_KVM_POLL_CONTROL:
if (!guest_pv_has(vcpu, KVM_FEATURE_POLL_CONTROL))
return 1;
@@ -9820,6 +9852,29 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
return kvm_skip_emulated_instruction(vcpu);
}

+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+static void record_vcpu_boost_status(struct kvm_vcpu *vcpu)
+{
+ u64 val = vcpu->arch.pv_sched.boost_status;
+
+ if (!kvm_arch_vcpu_pv_sched_enabled(&vcpu->arch))
+ return;
+
+ pagefault_disable();
+ kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.pv_sched.data,
+ &val, offsetof(struct pv_sched_data, boost_status), sizeof(u64));
+ pagefault_enable();
+}
+
+void kvm_set_vcpu_boosted(struct kvm_vcpu *vcpu, bool boosted)
+{
+ kvm_arch_vcpu_set_boost_status(&vcpu->arch,
+ boosted ? VCPU_BOOST_BOOSTED : VCPU_BOOST_NORMAL);
+
+ kvm_make_request(KVM_REQ_VCPU_BOOST_UPDATE, vcpu);
+}
+#endif
+
int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
{
unsigned long nr, a0, a1, a2, a3, ret;
@@ -10593,6 +10648,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
}
if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
record_steal_time(vcpu);
+
+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+ if (kvm_check_request(KVM_REQ_VCPU_BOOST_UPDATE, vcpu))
+ record_vcpu_boost_status(vcpu);
+#endif
+
#ifdef CONFIG_KVM_SMM
if (kvm_check_request(KVM_REQ_SMI, vcpu))
process_smi(vcpu);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9d3ac7720da9..a74aeea55347 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -167,6 +167,7 @@ static inline bool is_error_page(struct page *page)
#define KVM_REQ_VM_DEAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define KVM_REQ_UNBLOCK 2
#define KVM_REQ_DIRTY_RING_SOFT_FULL 3
+#define KVM_REQ_VCPU_BOOST_UPDATE 6
#define KVM_REQUEST_ARCH_BASE 8

/*
@@ -2287,4 +2288,8 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
/* Max number of entries allowed for each kvm dirty ring */
#define KVM_DIRTY_RING_MAX_ENTRIES 65536

+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+void kvm_set_vcpu_boosted(struct kvm_vcpu *vcpu, bool boosted);
+#endif
+
#endif
--
2.43.0

2023-12-14 02:47:47

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 2/8] sched/core: sched_setscheduler_pi_nocheck for interrupt context usage

__sched_setscheduler takes an argument 'pi' so as to allow its usage in
interrupt context when PI is not used. But this is not exported and
cannot be used outside of scheduler code. sched_setscheduler_nocheck is
exported but it doesn't allow that flexibility.

Introduce sched_setscheduler_pi_nocheck to allow for the flexibility to
call from interrupt context

Co-developed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Vineeth Pillai (Google) <[email protected]>
---
include/linux/sched.h | 2 ++
kernel/sched/core.c | 34 +++++++++++++++++++++++++++++++---
2 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 609bde814cb0..de7382f149cf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1908,6 +1908,8 @@ extern int idle_cpu(int cpu);
extern int available_idle_cpu(int cpu);
extern int sched_setscheduler(struct task_struct *, int, const struct sched_param *);
extern int sched_setscheduler_nocheck(struct task_struct *, int, const struct sched_param *);
+extern int sched_setscheduler_pi_nocheck(struct task_struct *p, int policy,
+ const struct sched_param *sp, bool pi);
extern void sched_set_fifo(struct task_struct *p);
extern void sched_set_fifo_low(struct task_struct *p);
extern void sched_set_normal(struct task_struct *p, int nice);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e8f73ff12126..b47f72b6595f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7850,8 +7850,8 @@ static int __sched_setscheduler(struct task_struct *p,
return retval;
}

-static int _sched_setscheduler(struct task_struct *p, int policy,
- const struct sched_param *param, bool check)
+static int _sched_setscheduler_pi(struct task_struct *p, int policy,
+ const struct sched_param *param, bool check, bool pi)
{
struct sched_attr attr = {
.sched_policy = policy,
@@ -7866,8 +7866,15 @@ static int _sched_setscheduler(struct task_struct *p, int policy,
attr.sched_policy = policy;
}

- return __sched_setscheduler(p, &attr, check, true);
+ return __sched_setscheduler(p, &attr, check, pi);
+}
+
+static inline int _sched_setscheduler(struct task_struct *p, int policy,
+ const struct sched_param *param, bool check)
+{
+ return _sched_setscheduler_pi(p, policy, param, check, true);
}
+
/**
* sched_setscheduler - change the scheduling policy and/or RT priority of a thread.
* @p: the task in question.
@@ -7916,6 +7923,27 @@ int sched_setscheduler_nocheck(struct task_struct *p, int policy,
return _sched_setscheduler(p, policy, param, false);
}

+/**
+ * sched_setscheduler_pi_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
+ * @p: the task in question.
+ * @policy: new policy.
+ * @param: structure containing the new RT priority.
+ * @pi: boolean flag stating if pi validation needs to be performed.
+ *
+ * A flexible version of sched_setcheduler_nocheck which allows for specifying
+ * whether PI context validation needs to be done or not. set_scheduler_nocheck
+ * is not allowed in interrupt context as it assumes that PI is used.
+ * This function allows interrupt context call by specifying pi = false.
+ *
+ * Return: 0 on success. An error code otherwise.
+ */
+int sched_setscheduler_pi_nocheck(struct task_struct *p, int policy,
+ const struct sched_param *param, bool pi)
+{
+ return _sched_setscheduler_pi(p, policy, param, false, pi);
+}
+EXPORT_SYMBOL_GPL(sched_setscheduler_pi_nocheck);
+
/*
* SCHED_FIFO is a broken scheduler model; that is, it is fundamentally
* incapable of resource management, which is the one thing an OS really should
--
2.43.0

2023-12-14 02:57:53

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 4/8] kvm: x86: boost vcpu threads on latency sensitive paths

Proactively boost the vcpu thread when delivering an interrupt so
that the guest vcpu gets to run with minimum latency and service the
interrupt. The host knows that guest vcpu is going to service an irq/nmi
as soon as its delivered and boosting the priority will help the guest
to avoid latencies. Timer interrupt is one common scenario which
benefits from this.

When a vcpu resumes from halt, it would be because of an event like
timer or irq/nmi and is latency sensitive. So, boosting the priority
of vcpu thread when it goes idle makes sense as the wakeup would be
because of a latency sensitive event and this boosting will not hurt
the host as the thread is scheduled out.

Co-developed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Vineeth Pillai (Google) <[email protected]>
---
arch/x86/kvm/i8259.c | 2 +-
arch/x86/kvm/lapic.c | 8 ++++----
arch/x86/kvm/svm/svm.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 2 +-
include/linux/kvm_host.h | 8 ++++++++
virt/kvm/kvm_main.c | 8 ++++++++
6 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index 8dec646e764b..6841ed802f00 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -62,7 +62,7 @@ static void pic_unlock(struct kvm_pic *s)
kvm_for_each_vcpu(i, vcpu, s->kvm) {
if (kvm_apic_accept_pic_intr(vcpu)) {
kvm_make_request(KVM_REQ_EVENT, vcpu);
- kvm_vcpu_kick(vcpu);
+ kvm_vcpu_kick_boost(vcpu);
return;
}
}
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e74e223f46aa..ae25176fddc8 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1309,12 +1309,12 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
result = 1;
vcpu->arch.pv.pv_unhalted = 1;
kvm_make_request(KVM_REQ_EVENT, vcpu);
- kvm_vcpu_kick(vcpu);
+ kvm_vcpu_kick_boost(vcpu);
break;

case APIC_DM_SMI:
if (!kvm_inject_smi(vcpu)) {
- kvm_vcpu_kick(vcpu);
+ kvm_vcpu_kick_boost(vcpu);
result = 1;
}
break;
@@ -1322,7 +1322,7 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
case APIC_DM_NMI:
result = 1;
kvm_inject_nmi(vcpu);
- kvm_vcpu_kick(vcpu);
+ kvm_vcpu_kick_boost(vcpu);
break;

case APIC_DM_INIT:
@@ -1901,7 +1901,7 @@ static void apic_timer_expired(struct kvm_lapic *apic, bool from_timer_fn)
atomic_inc(&apic->lapic_timer.pending);
kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
if (from_timer_fn)
- kvm_vcpu_kick(vcpu);
+ kvm_vcpu_kick_boost(vcpu);
}

static void start_sw_tscdeadline(struct kvm_lapic *apic)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index c8466bc64b87..578c19aeef73 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3566,7 +3566,7 @@ void svm_complete_interrupt_delivery(struct kvm_vcpu *vcpu, int delivery_mode,
if (!READ_ONCE(vcpu->arch.apic->apicv_active)) {
/* Process the interrupt via kvm_check_and_inject_events(). */
kvm_make_request(KVM_REQ_EVENT, vcpu);
- kvm_vcpu_kick(vcpu);
+ kvm_vcpu_kick_boost(vcpu);
return;
}

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index bc6f0fea48b4..b786cb2eb185 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4266,7 +4266,7 @@ static void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
if (vmx_deliver_posted_interrupt(vcpu, vector)) {
kvm_lapic_set_irr(vector, apic);
kvm_make_request(KVM_REQ_EVENT, vcpu);
- kvm_vcpu_kick(vcpu);
+ kvm_vcpu_kick_boost(vcpu);
} else {
trace_kvm_apicv_accept_irq(vcpu->vcpu_id, delivery_mode,
trig_mode, vector);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c6647f6312c9..f76680fbc60d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2296,11 +2296,19 @@ static inline bool kvm_vcpu_sched_enabled(struct kvm_vcpu *vcpu)
{
return kvm_arch_vcpu_pv_sched_enabled(&vcpu->arch);
}
+
+static inline void kvm_vcpu_kick_boost(struct kvm_vcpu *vcpu)
+{
+ kvm_vcpu_set_sched(vcpu, true);
+ kvm_vcpu_kick(vcpu);
+}
#else
static inline int kvm_vcpu_set_sched(struct kvm_vcpu *vcpu, bool boost)
{
return 0;
}
+
+#define kvm_vcpu_kick_boost kvm_vcpu_kick
#endif

#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 37748e2512e1..0dd8b84ed073 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3460,6 +3460,14 @@ bool kvm_vcpu_block(struct kvm_vcpu *vcpu)
if (kvm_vcpu_check_block(vcpu) < 0)
break;

+ /*
+ * Boost before scheduling out. Wakeup happens only on
+ * an event or a signal and hence it is beneficial to
+ * be scheduled ASAP. Ultimately, guest gets to idle loop
+ * and then will request deboost.
+ */
+ kvm_vcpu_set_sched(vcpu, true);
+
waited = true;
schedule();
}
--
2.43.0

2023-12-14 03:01:01

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 6/8] kvm: x86: enable/disable global/per-guest vcpu boost feature

Implement the module parameter for enable/disable of the feature
globally. Also implement the ioctls for enable/disable of the feature
per guest.

TODO: Documentation for the ioctls and kvm module parameters.

Co-developed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Vineeth Pillai (Google) <[email protected]>
---
arch/x86/kvm/x86.c | 8 +++--
include/linux/kvm_host.h | 34 +++++++++++++++++-
include/uapi/linux/kvm.h | 5 +++
virt/kvm/kvm_main.c | 76 +++++++++++++++++++++++++++++++++++++---
4 files changed, 116 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8c15c6ff352e..4fb73833fc68 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9949,8 +9949,12 @@ static void record_vcpu_boost_status(struct kvm_vcpu *vcpu)

void kvm_set_vcpu_boosted(struct kvm_vcpu *vcpu, bool boosted)
{
- kvm_arch_vcpu_set_boost_status(&vcpu->arch,
- boosted ? VCPU_BOOST_BOOSTED : VCPU_BOOST_NORMAL);
+ enum kvm_vcpu_boost_state boost_status = VCPU_BOOST_DISABLED;
+
+ if (kvm_pv_sched_enabled(vcpu->kvm))
+ boost_status = boosted ? VCPU_BOOST_BOOSTED : VCPU_BOOST_NORMAL;
+
+ kvm_arch_vcpu_set_boost_status(&vcpu->arch, boost_status);

kvm_make_request(KVM_REQ_VCPU_BOOST_UPDATE, vcpu);
}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f76680fbc60d..07f60a27025c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -807,6 +807,9 @@ struct kvm {
struct notifier_block pm_notifier;
#endif
char stats_id[KVM_STATS_NAME_SIZE];
+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+ bool pv_sched_enabled;
+#endif
};

#define kvm_err(fmt, ...) \
@@ -2292,9 +2295,38 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
void kvm_set_vcpu_boosted(struct kvm_vcpu *vcpu, bool boosted);
int kvm_vcpu_set_sched(struct kvm_vcpu *vcpu, bool boost);

+DECLARE_STATIC_KEY_FALSE(kvm_pv_sched);
+
+static inline bool kvm_pv_sched_enabled(struct kvm *kvm)
+{
+ if (static_branch_unlikely(&kvm_pv_sched))
+ return kvm->pv_sched_enabled;
+
+ return false;
+}
+
+static inline void kvm_set_pv_sched_enabled(struct kvm *kvm, bool enabled)
+{
+ unsigned long i;
+ struct kvm_vcpu *vcpu;
+
+ kvm->pv_sched_enabled = enabled;
+ /*
+ * After setting vcpu_sched_enabled, we need to update each vcpu's
+ * state(VCPU_BOOST_{DISABLED,NORMAL}) so that guest knows about the
+ * update.
+ * When disabling, we would also need to unboost vcpu threads
+ * if already boosted.
+ * XXX: this can race, needs locking!
+ */
+ kvm_for_each_vcpu(i, vcpu, kvm)
+ kvm_vcpu_set_sched(vcpu, false);
+}
+
static inline bool kvm_vcpu_sched_enabled(struct kvm_vcpu *vcpu)
{
- return kvm_arch_vcpu_pv_sched_enabled(&vcpu->arch);
+ return kvm_pv_sched_enabled(vcpu->kvm) &&
+ kvm_arch_vcpu_pv_sched_enabled(&vcpu->arch);
}

static inline void kvm_vcpu_kick_boost(struct kvm_vcpu *vcpu)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f089ab290978..4beaeaa3e78f 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1192,6 +1192,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_COUNTER_OFFSET 227
#define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
#define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
+#define KVM_CAP_PV_SCHED 600

#ifdef KVM_CAP_IRQ_ROUTING

@@ -2249,4 +2250,8 @@ struct kvm_s390_zpci_op {
/* flags for kvm_s390_zpci_op->u.reg_aen.flags */
#define KVM_S390_ZPCIOP_REGAEN_HOST (1 << 0)

+/* Available with KVM_CAP_PV_SCHED */
+#define KVM_SET_PV_SCHED_ENABLED _IOW(KVMIO, 0xe0, int)
+#define KVM_GET_PV_SCHED_ENABLED _IOR(KVMIO, 0xe1, int)
+
#endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0dd8b84ed073..d17cd28d5a92 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -99,6 +99,52 @@ unsigned int halt_poll_ns_shrink;
module_param(halt_poll_ns_shrink, uint, 0644);
EXPORT_SYMBOL_GPL(halt_poll_ns_shrink);

+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+__read_mostly DEFINE_STATIC_KEY_FALSE(kvm_pv_sched);
+EXPORT_SYMBOL_GPL(kvm_pv_sched);
+
+static int set_kvm_pv_sched(const char *val, const struct kernel_param *cp)
+{
+ struct kvm *kvm;
+ char *s = strstrip((char *)val);
+ bool new_val, old_val = static_key_enabled(&kvm_pv_sched);
+
+ if (!strcmp(s, "0"))
+ new_val = 0;
+ else if (!strcmp(s, "1"))
+ new_val = 1;
+ else
+ return -EINVAL;
+
+ if (old_val != new_val) {
+ if (new_val)
+ static_branch_enable(&kvm_pv_sched);
+ else
+ static_branch_disable(&kvm_pv_sched);
+
+ mutex_lock(&kvm_lock);
+ list_for_each_entry(kvm, &vm_list, vm_list)
+ kvm_set_pv_sched_enabled(kvm, !old_val);
+ mutex_unlock(&kvm_lock);
+ }
+
+ return 0;
+}
+
+static int get_kvm_pv_sched(char *buf, const struct kernel_param *cp)
+{
+ return sprintf(buf, "%s\n",
+ static_key_enabled(&kvm_pv_sched) ? "1" : "0");
+}
+
+static const struct kernel_param_ops kvm_pv_sched_ops = {
+ .set = set_kvm_pv_sched,
+ .get = get_kvm_pv_sched
+};
+
+module_param_cb(kvm_pv_sched, &kvm_pv_sched_ops, NULL, 0644);
+#endif
+
/*
* Ordering of locks:
*
@@ -1157,6 +1203,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)

BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);

+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+ kvm->pv_sched_enabled = true;
+#endif
/*
* Force subsequent debugfs file creations to fail if the VM directory
* is not created (by kvm_create_vm_debugfs()).
@@ -3635,11 +3684,15 @@ int kvm_vcpu_set_sched(struct kvm_vcpu *vcpu, bool boost)
struct task_struct *vcpu_task = NULL;

/*
- * We can ignore the request if a boost request comes
- * when we are already boosted or an unboost request
- * when we are already unboosted.
+ * If the feature is disabled and we receive a boost request,
+ * we can ignore the request and set VCPU_BOOST_DISABLED for the
+ * guest to see(kvm_set_vcpu_boosted).
+ * Similarly, we can ignore the request if a boost request comes
+ * when we are already boosted or an unboost request when we are
+ * already unboosted.
*/
- if (__can_ignore_set_sched(vcpu, boost))
+ if ((!kvm_vcpu_sched_enabled(vcpu) && boost) ||
+ __can_ignore_set_sched(vcpu, boost))
goto set_boost_status;

if (boost) {
@@ -4591,6 +4644,9 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
case KVM_CAP_CHECK_EXTENSION_VM:
case KVM_CAP_ENABLE_CAP_VM:
case KVM_CAP_HALT_POLL:
+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+ case KVM_CAP_PV_SCHED:
+#endif
return 1;
#ifdef CONFIG_KVM_MMIO
case KVM_CAP_COALESCED_MMIO:
@@ -5018,6 +5074,18 @@ static long kvm_vm_ioctl(struct file *filp,
case KVM_GET_STATS_FD:
r = kvm_vm_ioctl_get_stats_fd(kvm);
break;
+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+ case KVM_SET_PV_SCHED_ENABLED:
+ r = -EINVAL;
+ if (arg == 0 || arg == 1) {
+ kvm_set_pv_sched_enabled(kvm, arg);
+ r = 0;
+ }
+ break;
+ case KVM_GET_PV_SCHED_ENABLED:
+ r = kvm->pv_sched_enabled;
+ break;
+#endif
default:
r = kvm_arch_vm_ioctl(filp, ioctl, arg);
}
--
2.43.0

2023-12-14 03:01:06

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 5/8] kvm: x86: upper bound for preemption based boost duration

Guest requests boost on preempt disable but doesn't request unboost on
preempt enable. This may cause the guest vcpu to be boosted for longer
than what it deserves. Also, there are lots of preemption disabled paths
in kernel and some could be quite long.

This patch sets a bound on the maximum time a vcpu is boosted due to
preemption disabled in guest. Default is 3000us, and could be changed
via kvm module parameter.

Co-developed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Vineeth Pillai (Google) <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/x86.c | 49 ++++++++++++++++++++++++++++++---
2 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 474fe2d6d3e0..6a8326baa6a0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -994,6 +994,8 @@ struct kvm_vcpu_arch {
*/
struct {
enum kvm_vcpu_boost_state boost_status;
+ bool preempt_disabled;
+ ktime_t preempt_disabled_ts;
int boost_policy;
int boost_prio;
u64 msr_val;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2577e1083f91..8c15c6ff352e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -199,6 +199,15 @@ module_param(eager_page_split, bool, 0644);
static bool __read_mostly mitigate_smt_rsb;
module_param(mitigate_smt_rsb, bool, 0444);

+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+/*
+ * Maximum time in micro seconds a guest vcpu can stay boosted due
+ * to preemption disabled.
+ */
+unsigned int pvsched_max_preempt_disabled_us = 3000;
+module_param(pvsched_max_preempt_disabled_us, uint, 0644);
+#endif
+
/*
* Restoring the host value for MSRs that are only consumed when running in
* usermode, e.g. SYSCALL MSRs and TSC_AUX, can be deferred until the CPU
@@ -2149,17 +2158,47 @@ static inline bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu)
}

#ifdef CONFIG_PARAVIRT_SCHED_KVM
+static inline void kvm_vcpu_update_preempt_disabled(struct kvm_vcpu_arch *arch,
+ bool preempt_disabled)
+{
+ if (arch->pv_sched.preempt_disabled != preempt_disabled) {
+ arch->pv_sched.preempt_disabled = preempt_disabled;
+ if (preempt_disabled)
+ arch->pv_sched.preempt_disabled_ts = ktime_get();
+ else
+ arch->pv_sched.preempt_disabled_ts = 0;
+ }
+}
+
+static inline bool kvm_vcpu_exceeds_preempt_disabled_duration(struct kvm_vcpu_arch *arch)
+{
+ s64 max_delta = pvsched_max_preempt_disabled_us * NSEC_PER_USEC;
+
+ if (max_delta && arch->pv_sched.preempt_disabled) {
+ s64 delta;
+
+ WARN_ON_ONCE(arch->pv_sched.preempt_disabled_ts == 0);
+ delta = ktime_to_ns(ktime_sub(ktime_get(),
+ arch->pv_sched.preempt_disabled_ts));
+
+ if (delta >= max_delta)
+ return true;
+ }
+
+ return false;
+}
+
static inline bool __vcpu_needs_boost(struct kvm_vcpu *vcpu, union guest_schedinfo schedinfo)
{
bool pending_event = kvm_cpu_has_pending_timer(vcpu) || kvm_cpu_has_interrupt(vcpu);

/*
* vcpu needs a boost if
- * - A lazy boost request active, or
- * - Pending latency sensitive event, or
- * - Preemption disabled in this vcpu.
+ * - A lazy boost request active or a pending latency sensitive event, and
+ * - Preemption disabled duration on this vcpu has not crossed the threshold.
*/
- return (schedinfo.boost_req == VCPU_REQ_BOOST || pending_event || schedinfo.preempt_disabled);
+ return ((schedinfo.boost_req == VCPU_REQ_BOOST || pending_event) &&
+ !kvm_vcpu_exceeds_preempt_disabled_duration(&vcpu->arch));
}

static inline void kvm_vcpu_do_pv_sched(struct kvm_vcpu *vcpu)
@@ -2173,6 +2212,8 @@ static inline void kvm_vcpu_do_pv_sched(struct kvm_vcpu *vcpu)
&schedinfo, offsetof(struct pv_sched_data, schedinfo), sizeof(schedinfo)))
return;

+ kvm_vcpu_update_preempt_disabled(&vcpu->arch, schedinfo.preempt_disabled);
+
kvm_vcpu_set_sched(vcpu, __vcpu_needs_boost(vcpu, schedinfo));
}
#else
--
2.43.0

2023-12-14 03:01:19

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 8/8] irq: boost/unboost in irq/nmi entry/exit and softirq

The host proactively boosts the VCPU threads during irq/nmi injection.
However, the host is unaware of posted interrupts, and therefore, the
guest should request a boost if it has not already been boosted.

Similarly, guest should request an unboost on irq/nmi/softirq exit if
the vcpu doesn't need the boost any more.

Co-developed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Vineeth Pillai (Google) <[email protected]>
---
kernel/entry/common.c | 30 ++++++++++++++++++++++++++++++
kernel/softirq.c | 11 +++++++++++
2 files changed, 41 insertions(+)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index fae56faac0b0..c69912b71725 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -327,6 +327,13 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
.exit_rcu = false,
};

+#ifdef CONFIG_PARAVIRT_SCHED
+ instrumentation_begin();
+ if (pv_sched_enabled())
+ pv_sched_boost_vcpu_lazy();
+ instrumentation_end();
+#endif
+
if (user_mode(regs)) {
irqentry_enter_from_user_mode(regs);
return ret;
@@ -452,6 +459,18 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
if (state.exit_rcu)
ct_irq_exit();
}
+
+#ifdef CONFIG_PARAVIRT_SCHED
+ instrumentation_begin();
+ /*
+ * On irq exit, request a deboost from hypervisor if no softirq pending
+ * and current task is not RT and !need_resched.
+ */
+ if (pv_sched_enabled() && !local_softirq_pending() &&
+ !need_resched() && !task_is_realtime(current))
+ pv_sched_unboost_vcpu();
+ instrumentation_end();
+#endif
}

irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
@@ -469,6 +488,11 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
kmsan_unpoison_entry_regs(regs);
trace_hardirqs_off_finish();
ftrace_nmi_enter();
+
+#ifdef CONFIG_PARAVIRT_SCHED
+ if (pv_sched_enabled())
+ pv_sched_boost_vcpu_lazy();
+#endif
instrumentation_end();

return irq_state;
@@ -482,6 +506,12 @@ void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
trace_hardirqs_on_prepare();
lockdep_hardirqs_on_prepare();
}
+
+#ifdef CONFIG_PARAVIRT_SCHED
+ if (pv_sched_enabled() && !in_hardirq() && !local_softirq_pending() &&
+ !need_resched() && !task_is_realtime(current))
+ pv_sched_unboost_vcpu();
+#endif
instrumentation_end();

ct_nmi_exit();
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 807b34ccd797..90a127615e16 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -530,6 +530,11 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
in_hardirq = lockdep_softirq_start();
account_softirq_enter(current);

+#ifdef CONFIG_PARAVIRT_SCHED
+ if (pv_sched_enabled())
+ pv_sched_boost_vcpu_lazy();
+#endif
+
restart:
/* Reset the pending bitmask before enabling irqs */
set_softirq_pending(0);
@@ -577,6 +582,12 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
wakeup_softirqd();
}

+#ifdef CONFIG_PARAVIRT_SCHED
+ if (pv_sched_enabled() && !need_resched() &&
+ !task_is_realtime(current))
+ pv_sched_unboost_vcpu();
+#endif
+
account_softirq_exit(current);
lockdep_softirq_end(in_hardirq);
softirq_handle_end();
--
2.43.0

2023-12-14 03:09:22

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 7/8] sched/core: boost/unboost in guest scheduler

RT or higher priority tasks in guest is considered a critical workload
and guest scheduler can request boost/unboost on a task switch and/or a
task wakeup. Also share the preempt status of guest vcpu with the host
so that host can take decision on boot/unboost.

CONFIG_TRACE_PREEMPT_TOGGLE is enabled for using the function
equivalent of preempt_count_{add,sub} to update the shared memory.
Another option is to update the preempt_count_{add,sub} macros, but
it will be more code churn and complex.

Boost request is lazy, but unboost request is synchronous.

Detect the feature in guest from cpuid flags and use the MSR to pass the
GPA of memory location for sharing scheduling information.

Co-developed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Vineeth Pillai (Google) <[email protected]>
---
arch/x86/Kconfig | 13 +++++
arch/x86/include/asm/kvm_para.h | 7 +++
arch/x86/kernel/kvm.c | 16 ++++++
include/linux/sched.h | 21 ++++++++
kernel/entry/common.c | 9 ++++
kernel/sched/core.c | 93 ++++++++++++++++++++++++++++++++-
6 files changed, 158 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 68ce4f786dcd..556ae2698633 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -827,6 +827,19 @@ config KVM_GUEST
underlying device model, the host provides the guest with
timing infrastructure such as time of day, and system time

+config PARAVIRT_SCHED
+ bool "Enable paravirt scheduling capability for guests"
+ depends on KVM_GUEST
+ select TRACE_PREEMPT_TOGGLE
+ help
+ Paravirtualized scheduling facilitates the exchange of scheduling
+ related information between the host and guest through shared memory,
+ enhancing the efficiency of vCPU thread scheduling by the hypervisor.
+ An illustrative use case involves dynamically boosting the priority of
+ a vCPU thread when the guest is executing a latency-sensitive workload
+ on that specific vCPU.
+ This config enables paravirt scheduling in guest(VM).
+
config ARCH_CPUIDLE_HALTPOLL
def_bool n
prompt "Disable host haltpoll when loading haltpoll driver"
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 57bc74e112f2..3473dd2915b5 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -176,4 +176,11 @@ static __always_inline bool kvm_handle_async_pf(struct pt_regs *regs, u32 token)
}
#endif

+#ifdef CONFIG_PARAVIRT_SCHED
+static inline void kvm_pv_sched_notify_host(void)
+{
+ wrmsrl(MSR_KVM_PV_SCHED, ULLONG_MAX);
+}
+#endif
+
#endif /* _ASM_X86_KVM_PARA_H */
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 526d4da3dcd4..5f96b228bdd5 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -377,6 +377,14 @@ static void kvm_guest_cpu_init(void)
wrmsrl(MSR_KVM_PV_EOI_EN, pa);
}

+#ifdef CONFIG_PARAVIRT_SCHED
+ if (pv_sched_enabled()) {
+ unsigned long pa = pv_sched_pa() | KVM_MSR_ENABLED;
+
+ wrmsrl(MSR_KVM_PV_SCHED, pa);
+ }
+#endif
+
if (has_steal_clock)
kvm_register_steal_time();
}
@@ -832,6 +840,14 @@ static void __init kvm_guest_init(void)
alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, asm_sysvec_kvm_asyncpf_interrupt);
}

+#ifdef CONFIG_PARAVIRT_SCHED
+ if (kvm_para_has_feature(KVM_FEATURE_PV_SCHED)) {
+ pr_info("KVM host has PV_SCHED!\n");
+ pv_sched_enable();
+ } else
+ pr_info("KVM host does not support PV_SCHED!\n");
+#endif
+
#ifdef CONFIG_SMP
if (pv_tlb_flush_supported()) {
pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index de7382f149cf..e740b1e8abe3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2442,4 +2442,25 @@ static inline void sched_core_fork(struct task_struct *p) { }

extern void sched_set_stop_task(int cpu, struct task_struct *stop);

+#ifdef CONFIG_PARAVIRT_SCHED
+DECLARE_STATIC_KEY_FALSE(__pv_sched_enabled);
+
+extern unsigned long pv_sched_pa(void);
+
+static inline bool pv_sched_enabled(void)
+{
+ return static_branch_unlikely(&__pv_sched_enabled);
+}
+
+static inline void pv_sched_enable(void)
+{
+ static_branch_enable(&__pv_sched_enabled);
+}
+
+extern bool pv_sched_vcpu_boosted(void);
+extern void pv_sched_boost_vcpu(void);
+extern void pv_sched_unboost_vcpu(void);
+extern void pv_sched_boost_vcpu_lazy(void);
+extern void pv_sched_unboost_vcpu_lazy(void);
+#endif
#endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index be61332c66b5..fae56faac0b0 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -210,6 +210,15 @@ static void exit_to_user_mode_prepare(struct pt_regs *regs)
kmap_assert_nomap();
lockdep_assert_irqs_disabled();
lockdep_sys_exit();
+#ifdef CONFIG_PARAVIRT_SCHED
+ /*
+ * Guest requests a boost when preemption is disabled but does not request
+ * an immediate unboost when preemption is enabled back. There is a chance
+ * that we are boosted here. Unboost if needed.
+ */
+ if (pv_sched_enabled() && !task_is_realtime(current))
+ pv_sched_unboost_vcpu();
+#endif
}

/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b47f72b6595f..57f211f1b3d7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -151,6 +151,71 @@ const_debug unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;

__read_mostly int scheduler_running;

+#ifdef CONFIG_PARAVIRT_SCHED
+#include <linux/kvm_para.h>
+
+DEFINE_STATIC_KEY_FALSE(__pv_sched_enabled);
+
+DEFINE_PER_CPU_DECRYPTED(struct pv_sched_data, pv_sched) __aligned(64);
+
+unsigned long pv_sched_pa(void)
+{
+ return slow_virt_to_phys(this_cpu_ptr(&pv_sched));
+}
+
+bool pv_sched_vcpu_boosted(void)
+{
+ return (this_cpu_read(pv_sched.boost_status) == VCPU_BOOST_BOOSTED);
+}
+
+void pv_sched_boost_vcpu_lazy(void)
+{
+ this_cpu_write(pv_sched.schedinfo.boost_req, VCPU_REQ_BOOST);
+}
+
+void pv_sched_unboost_vcpu_lazy(void)
+{
+ this_cpu_write(pv_sched.schedinfo.boost_req, VCPU_REQ_UNBOOST);
+}
+
+void pv_sched_boost_vcpu(void)
+{
+ pv_sched_boost_vcpu_lazy();
+ /*
+ * XXX: there could be a race between the boost_status check
+ * and hypercall.
+ */
+ if (this_cpu_read(pv_sched.boost_status) == VCPU_BOOST_NORMAL)
+ kvm_pv_sched_notify_host();
+}
+
+void pv_sched_unboost_vcpu(void)
+{
+ pv_sched_unboost_vcpu_lazy();
+ /*
+ * XXX: there could be a race between the boost_status check
+ * and hypercall.
+ */
+ if (this_cpu_read(pv_sched.boost_status) == VCPU_BOOST_BOOSTED &&
+ !preempt_count())
+ kvm_pv_sched_notify_host();
+}
+
+/*
+ * Share the preemption enabled/disabled status with host. This will not incur a
+ * VMEXIT and acts as a lazy boost/unboost mechanism - host will check this on
+ * the next VMEXIT for boost/unboost decisions.
+ * XXX: Lazy unboosting may allow cfs tasks to run on RT vcpu till next VMEXIT.
+ */
+static inline void pv_sched_update_preempt_status(bool preempt_disabled)
+{
+ if (pv_sched_enabled())
+ this_cpu_write(pv_sched.schedinfo.preempt_disabled, preempt_disabled);
+}
+#else
+static inline void pv_sched_update_preempt_status(bool preempt_disabled) {}
+#endif
+
#ifdef CONFIG_SCHED_CORE

DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
@@ -2070,6 +2135,19 @@ unsigned long get_wchan(struct task_struct *p)

static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
+#ifdef CONFIG_PARAVIRT_SCHED
+ /*
+ * TODO: currently request for boosting remote vcpus is not implemented. So
+ * we boost only if this enqueue happens for this cpu.
+ * This is not a big problem though, target cpu gets an IPI and then gets
+ * boosted by the host. Posted interrupts is an exception where target vcpu
+ * will not get boosted immediately, but on the next schedule().
+ */
+ if (pv_sched_enabled() && this_rq() == rq &&
+ sched_class_above(p->sched_class, &fair_sched_class))
+ pv_sched_boost_vcpu_lazy();
+#endif
+
if (!(flags & ENQUEUE_NOCLOCK))
update_rq_clock(rq);

@@ -5835,6 +5913,8 @@ static inline void preempt_latency_start(int val)
#ifdef CONFIG_DEBUG_PREEMPT
current->preempt_disable_ip = ip;
#endif
+ pv_sched_update_preempt_status(true);
+
trace_preempt_off(CALLER_ADDR0, ip);
}
}
@@ -5867,8 +5947,10 @@ NOKPROBE_SYMBOL(preempt_count_add);
*/
static inline void preempt_latency_stop(int val)
{
- if (preempt_count() == val)
+ if (preempt_count() == val) {
+ pv_sched_update_preempt_status(false);
trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
+ }
}

void preempt_count_sub(int val)
@@ -6678,6 +6760,15 @@ static void __sched notrace __schedule(unsigned int sched_mode)
rq->last_seen_need_resched_ns = 0;
#endif

+#ifdef CONFIG_PARAVIRT_SCHED
+ if (pv_sched_enabled()) {
+ if (sched_class_above(next->sched_class, &fair_sched_class))
+ pv_sched_boost_vcpu_lazy();
+ else if (next->sched_class == &fair_sched_class)
+ pv_sched_unboost_vcpu();
+ }
+#endif
+
if (likely(prev != next)) {
rq->nr_switches++;
/*
--
2.43.0

2023-12-14 03:11:37

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 3/8] kvm: x86: vcpu boosting/unboosting framework

When the guest kernel is about to run a critical or latency sensitive
workload, it can request the hypervisor to boost the priority of the
vcpu thread. Similarly, guest kernel can request to unboost when the
vcpu switches to a normal workload.

When a guest determines that it needs a boost, it need not immediately
request a synchronous boost as it is already running at that moment.
Synchronous request is detrimental because it incurs a VMEXIT. Rather,
let the guest note down its request on a shared memory and the host can
check this request on next VMEXIT and boost if needed.

Preemption disabled state in a vcpu is also considered latency sensitive
and requires boosting. Guest passes its preemption state to host and
host boosts the vcpu thread as needed.

Unboost requests needs to be synchronous because guest running boosted
while really not required might hurt host workload. Implement a
synchronous mechanism to unboost using MSR_KVM_PV_SCHED. Host checks
the shared memory for boost/unboost request on every VMEXIT. So, the
VMEXIT due to wrmsr(ULLONG_MAX) also goes through a fastexit path which
takes care of the boost/unboost.

Co-developed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Vineeth Pillai (Google) <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 42 ++++++++++++++++
arch/x86/include/uapi/asm/kvm_para.h | 19 +++++++
arch/x86/kvm/x86.c | 48 ++++++++++++++++++
include/linux/kvm_host.h | 11 +++++
virt/kvm/kvm_main.c | 74 ++++++++++++++++++++++++++++
5 files changed, 194 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f89ba1f07d88..474fe2d6d3e0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -994,6 +994,8 @@ struct kvm_vcpu_arch {
*/
struct {
enum kvm_vcpu_boost_state boost_status;
+ int boost_policy;
+ int boost_prio;
u64 msr_val;
struct gfn_to_hva_cache data;
} pv_sched;
@@ -2230,6 +2232,13 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
#define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)

#ifdef CONFIG_PARAVIRT_SCHED_KVM
+/*
+ * Default policy and priority used for boosting
+ * VCPU threads.
+ */
+#define VCPU_BOOST_DEFAULT_PRIO 8
+#define VCPU_BOOST_DEFAULT_POLICY SCHED_RR
+
static inline bool kvm_arch_vcpu_pv_sched_enabled(struct kvm_vcpu_arch *arch)
{
return arch->pv_sched.msr_val;
@@ -2240,6 +2249,39 @@ static inline void kvm_arch_vcpu_set_boost_status(struct kvm_vcpu_arch *arch,
{
arch->pv_sched.boost_status = boost_status;
}
+
+static inline bool kvm_arch_vcpu_boosted(struct kvm_vcpu_arch *arch)
+{
+ return arch->pv_sched.boost_status == VCPU_BOOST_BOOSTED;
+}
+
+static inline int kvm_arch_vcpu_boost_policy(struct kvm_vcpu_arch *arch)
+{
+ return arch->pv_sched.boost_policy;
+}
+
+static inline int kvm_arch_vcpu_boost_prio(struct kvm_vcpu_arch *arch)
+{
+ return arch->pv_sched.boost_prio;
+}
+
+static inline int kvm_arch_vcpu_set_boost_prio(struct kvm_vcpu_arch *arch, u64 prio)
+{
+ if (prio >= MAX_RT_PRIO)
+ return -EINVAL;
+
+ arch->pv_sched.boost_prio = prio;
+ return 0;
+}
+
+static inline int kvm_arch_vcpu_set_boost_policy(struct kvm_vcpu_arch *arch, u64 policy)
+{
+ if (policy != SCHED_FIFO && policy != SCHED_RR)
+ return -EINVAL;
+
+ arch->pv_sched.boost_policy = policy;
+ return 0;
+}
#endif

#endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 6b1dea07a563..e53c3f3a88d7 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -167,11 +167,30 @@ enum kvm_vcpu_boost_state {
VCPU_BOOST_BOOSTED
};

+/*
+ * Boost Request from guest to host for lazy boosting.
+ */
+enum kvm_vcpu_boost_request {
+ VCPU_REQ_NONE = 0,
+ VCPU_REQ_UNBOOST,
+ VCPU_REQ_BOOST,
+};
+
+
+union guest_schedinfo {
+ struct {
+ __u8 boost_req;
+ __u8 preempt_disabled;
+ };
+ __u64 pad;
+};
+
/*
* Structure passed in via MSR_KVM_PV_SCHED
*/
struct pv_sched_data {
__u64 boost_status;
+ union guest_schedinfo schedinfo;
};

#endif /* _UAPI_ASM_X86_KVM_PARA_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0f475b50ac83..2577e1083f91 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2148,6 +2148,37 @@ static inline bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu)
xfer_to_guest_mode_work_pending();
}

+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+static inline bool __vcpu_needs_boost(struct kvm_vcpu *vcpu, union guest_schedinfo schedinfo)
+{
+ bool pending_event = kvm_cpu_has_pending_timer(vcpu) || kvm_cpu_has_interrupt(vcpu);
+
+ /*
+ * vcpu needs a boost if
+ * - A lazy boost request active, or
+ * - Pending latency sensitive event, or
+ * - Preemption disabled in this vcpu.
+ */
+ return (schedinfo.boost_req == VCPU_REQ_BOOST || pending_event || schedinfo.preempt_disabled);
+}
+
+static inline void kvm_vcpu_do_pv_sched(struct kvm_vcpu *vcpu)
+{
+ union guest_schedinfo schedinfo;
+
+ if (!kvm_vcpu_sched_enabled(vcpu))
+ return;
+
+ if (kvm_read_guest_offset_cached(vcpu->kvm, &vcpu->arch.pv_sched.data,
+ &schedinfo, offsetof(struct pv_sched_data, schedinfo), sizeof(schedinfo)))
+ return;
+
+ kvm_vcpu_set_sched(vcpu, __vcpu_needs_boost(vcpu, schedinfo));
+}
+#else
+static inline void kvm_vcpu_do_pv_sched(struct kvm_vcpu *vcpu) { }
+#endif
+
/*
* The fast path for frequent and performance sensitive wrmsr emulation,
* i.e. the sending of IPI, sending IPI early in the VM-Exit flow reduces
@@ -2201,6 +2232,15 @@ fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu)
ret = EXIT_FASTPATH_REENTER_GUEST;
}
break;
+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+ case MSR_KVM_PV_SCHED:
+ data = kvm_read_edx_eax(vcpu);
+ if (data == ULLONG_MAX) {
+ kvm_skip_emulated_instruction(vcpu);
+ ret = EXIT_FASTPATH_EXIT_HANDLED;
+ }
+ break;
+#endif
default:
break;
}
@@ -10919,6 +10959,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
guest_timing_exit_irqoff();

local_irq_enable();
+
+ kvm_vcpu_do_pv_sched(vcpu);
+
preempt_enable();

kvm_vcpu_srcu_read_lock(vcpu);
@@ -11990,6 +12033,11 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
if (r)
goto free_guest_fpu;

+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+ kvm_arch_vcpu_set_boost_prio(&vcpu->arch, VCPU_BOOST_DEFAULT_PRIO);
+ kvm_arch_vcpu_set_boost_policy(&vcpu->arch, VCPU_BOOST_DEFAULT_POLICY);
+#endif
+
vcpu->arch.arch_capabilities = kvm_get_arch_capabilities();
vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT;
kvm_xen_init_vcpu(vcpu);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a74aeea55347..c6647f6312c9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2290,6 +2290,17 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)

#ifdef CONFIG_PARAVIRT_SCHED_KVM
void kvm_set_vcpu_boosted(struct kvm_vcpu *vcpu, bool boosted);
+int kvm_vcpu_set_sched(struct kvm_vcpu *vcpu, bool boost);
+
+static inline bool kvm_vcpu_sched_enabled(struct kvm_vcpu *vcpu)
+{
+ return kvm_arch_vcpu_pv_sched_enabled(&vcpu->arch);
+}
+#else
+static inline int kvm_vcpu_set_sched(struct kvm_vcpu *vcpu, bool boost)
+{
+ return 0;
+}
#endif

#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5bbb5612b207..37748e2512e1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -57,6 +57,9 @@
#include <asm/ioctl.h>
#include <linux/uaccess.h>

+#include <linux/sched.h>
+#include <uapi/linux/sched/types.h>
+
#include "coalesced_mmio.h"
#include "async_pf.h"
#include "kvm_mm.h"
@@ -3602,6 +3605,77 @@ bool kvm_vcpu_wake_up(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(kvm_vcpu_wake_up);

+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+/*
+ * Check if we need to act on the boost/unboost request.
+ * Returns true if:
+ * - caller is requesting boost and vcpu is boosted, or
+ * - caller is requesting unboost and vcpu is not boosted.
+ */
+static inline bool __can_ignore_set_sched(struct kvm_vcpu *vcpu, bool boost)
+{
+ return ((boost && kvm_arch_vcpu_boosted(&vcpu->arch)) ||
+ (!boost && !kvm_arch_vcpu_boosted(&vcpu->arch)));
+}
+
+int kvm_vcpu_set_sched(struct kvm_vcpu *vcpu, bool boost)
+{
+ int policy;
+ int ret = 0;
+ struct pid *pid;
+ struct sched_param param = { 0 };
+ struct task_struct *vcpu_task = NULL;
+
+ /*
+ * We can ignore the request if a boost request comes
+ * when we are already boosted or an unboost request
+ * when we are already unboosted.
+ */
+ if (__can_ignore_set_sched(vcpu, boost))
+ goto set_boost_status;
+
+ if (boost) {
+ policy = kvm_arch_vcpu_boost_policy(&vcpu->arch);
+ param.sched_priority = kvm_arch_vcpu_boost_prio(&vcpu->arch);
+ } else {
+ /*
+ * TODO: here we just unboost to SCHED_NORMAL. Ideally we
+ * should either
+ * - revert to the initial priority before boost, or
+ * - introduce tunables for unboost priority.
+ */
+ policy = SCHED_NORMAL;
+ param.sched_priority = 0;
+ }
+
+ rcu_read_lock();
+ pid = rcu_dereference(vcpu->pid);
+ if (pid)
+ vcpu_task = get_pid_task(pid, PIDTYPE_PID);
+ rcu_read_unlock();
+ if (vcpu_task == NULL)
+ return -KVM_EINVAL;
+
+ /*
+ * This might be called from interrupt context.
+ * Since we do not use rt-mutexes, we can safely call
+ * sched_setscheduler_pi_nocheck with pi = false.
+ * NOTE: If in future, we use rt-mutexes, this should
+ * be modified to use a tasklet to do boost/unboost.
+ */
+ WARN_ON_ONCE(vcpu_task->pi_top_task);
+ ret = sched_setscheduler_pi_nocheck(vcpu_task, policy,
+ &param, false);
+ put_task_struct(vcpu_task);
+set_boost_status:
+ if (!ret)
+ kvm_set_vcpu_boosted(vcpu, boost);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_set_sched);
+#endif
+
#ifndef CONFIG_S390
/*
* Kick a sleeping VCPU, or a guest VCPU in guest mode, into host kernel mode.
--
2.43.0

2023-12-14 10:54:19

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] kvm: x86: MSR for setting up scheduler info shared memory

"Vineeth Pillai (Google)" <[email protected]> writes:

> Implement a kvm MSR that guest uses to provide the GPA of shared memory
> for communicating the scheduling information between host and guest.
>
> wrmsr(0) disables the feature. wrmsr(valid_gpa) enables the feature and
> uses the gpa for further communication.
>
> Also add a new cpuid feature flag for the host to advertise the feature
> to the guest.
>
> Co-developed-by: Joel Fernandes (Google) <[email protected]>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> Signed-off-by: Vineeth Pillai (Google) <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 25 ++++++++++++
> arch/x86/include/uapi/asm/kvm_para.h | 24 +++++++++++
> arch/x86/kvm/Kconfig | 12 ++++++
> arch/x86/kvm/cpuid.c | 2 +
> arch/x86/kvm/x86.c | 61 ++++++++++++++++++++++++++++
> include/linux/kvm_host.h | 5 +++
> 6 files changed, 129 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f72b30d2238a..f89ba1f07d88 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -987,6 +987,18 @@ struct kvm_vcpu_arch {
> /* Protected Guests */
> bool guest_state_protected;
>
> +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> + /*
> + * MSR to setup a shared memory for scheduling
> + * information sharing between host and guest.
> + */
> + struct {
> + enum kvm_vcpu_boost_state boost_status;
> + u64 msr_val;
> + struct gfn_to_hva_cache data;
> + } pv_sched;
> +#endif
> +
> /*
> * Set when PDPTS were loaded directly by the userspace without
> * reading the guest memory
> @@ -2217,4 +2229,17 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
> */
> #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)
>
> +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> +static inline bool kvm_arch_vcpu_pv_sched_enabled(struct kvm_vcpu_arch *arch)
> +{
> + return arch->pv_sched.msr_val;
> +}
> +
> +static inline void kvm_arch_vcpu_set_boost_status(struct kvm_vcpu_arch *arch,
> + enum kvm_vcpu_boost_state boost_status)
> +{
> + arch->pv_sched.boost_status = boost_status;
> +}
> +#endif
> +
> #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index 6e64b27b2c1e..6b1dea07a563 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -36,6 +36,7 @@
> #define KVM_FEATURE_MSI_EXT_DEST_ID 15
> #define KVM_FEATURE_HC_MAP_GPA_RANGE 16
> #define KVM_FEATURE_MIGRATION_CONTROL 17
> +#define KVM_FEATURE_PV_SCHED 18
>
> #define KVM_HINTS_REALTIME 0
>
> @@ -58,6 +59,7 @@
> #define MSR_KVM_ASYNC_PF_INT 0x4b564d06
> #define MSR_KVM_ASYNC_PF_ACK 0x4b564d07
> #define MSR_KVM_MIGRATION_CONTROL 0x4b564d08
> +#define MSR_KVM_PV_SCHED 0x4b564da0
>
> struct kvm_steal_time {
> __u64 steal;
> @@ -150,4 +152,26 @@ struct kvm_vcpu_pv_apf_data {
> #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK
> #define KVM_PV_EOI_DISABLED 0x0
>
> +/*
> + * VCPU boost state shared between the host and guest.
> + */
> +enum kvm_vcpu_boost_state {
> + /* Priority boosting feature disabled in host */
> + VCPU_BOOST_DISABLED = 0,
> + /*
> + * vcpu is not explicitly boosted by the host.
> + * (Default priority when the guest started)
> + */
> + VCPU_BOOST_NORMAL,
> + /* vcpu is boosted by the host */
> + VCPU_BOOST_BOOSTED
> +};
> +
> +/*
> + * Structure passed in via MSR_KVM_PV_SCHED
> + */
> +struct pv_sched_data {
> + __u64 boost_status;
> +};
> +
> #endif /* _UAPI_ASM_X86_KVM_PARA_H */
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 89ca7f4c1464..dbcba73fb508 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -141,4 +141,16 @@ config KVM_XEN
> config KVM_EXTERNAL_WRITE_TRACKING
> bool
>
> +config PARAVIRT_SCHED_KVM
> + bool "Enable paravirt scheduling capability for kvm"
> + depends on KVM
> + help
> + Paravirtualized scheduling facilitates the exchange of scheduling
> + related information between the host and guest through shared memory,
> + enhancing the efficiency of vCPU thread scheduling by the hypervisor.
> + An illustrative use case involves dynamically boosting the priority of
> + a vCPU thread when the guest is executing a latency-sensitive workload
> + on that specific vCPU.
> + This config enables paravirt scheduling in the kvm hypervisor.
> +
> endif # VIRTUALIZATION
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 7bdc66abfc92..960ef6e869f2 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1113,6 +1113,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
> (1 << KVM_FEATURE_POLL_CONTROL) |
> (1 << KVM_FEATURE_PV_SCHED_YIELD) |
> (1 << KVM_FEATURE_ASYNC_PF_INT);
> + if (IS_ENABLED(CONFIG_PARAVIRT_SCHED_KVM))
> + entry->eax |= (1 << KVM_FEATURE_PV_SCHED);
>
> if (sched_info_on())
> entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7bcf1a76a6ab..0f475b50ac83 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3879,6 +3879,33 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> return 1;
> break;
>
> +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> + case MSR_KVM_PV_SCHED:
> + if (!guest_pv_has(vcpu, KVM_FEATURE_PV_SCHED))
> + return 1;
> +
> + if (!(data & KVM_MSR_ENABLED))
> + break;
> +
> + if (!(data & ~KVM_MSR_ENABLED)) {
> + /*
> + * Disable the feature
> + */
> + vcpu->arch.pv_sched.msr_val = 0;
> + kvm_set_vcpu_boosted(vcpu, false);
> + } if (!kvm_gfn_to_hva_cache_init(vcpu->kvm,
> + &vcpu->arch.pv_sched.data, data & ~KVM_MSR_ENABLED,
> + sizeof(struct pv_sched_data))) {
> + vcpu->arch.pv_sched.msr_val = data;
> + kvm_set_vcpu_boosted(vcpu, false);
> + } else {
> + pr_warn("MSR_KVM_PV_SCHED: kvm:%p, vcpu:%p, "
> + "msr value: %llx, kvm_gfn_to_hva_cache_init failed!\n",
> + vcpu->kvm, vcpu, data & ~KVM_MSR_ENABLED);

As this is triggerable by the guest please drop this print (which is not
even ratelimited!). I think it would be better to just 'return 1;' in case
of kvm_gfn_to_hva_cache_init() failure but maybe you also need to
account for 'msr_info->host_initiated' to not fail setting this MSR from
the host upon migration.

> + }
> + break;
> +#endif
> +
> case MSR_KVM_POLL_CONTROL:
> if (!guest_pv_has(vcpu, KVM_FEATURE_POLL_CONTROL))
> return 1;
> @@ -4239,6 +4266,11 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>
> msr_info->data = vcpu->arch.pv_eoi.msr_val;
> break;
> +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> + case MSR_KVM_PV_SCHED:
> + msr_info->data = vcpu->arch.pv_sched.msr_val;
> + break;
> +#endif
> case MSR_KVM_POLL_CONTROL:
> if (!guest_pv_has(vcpu, KVM_FEATURE_POLL_CONTROL))
> return 1;
> @@ -9820,6 +9852,29 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
> return kvm_skip_emulated_instruction(vcpu);
> }
>
> +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> +static void record_vcpu_boost_status(struct kvm_vcpu *vcpu)
> +{
> + u64 val = vcpu->arch.pv_sched.boost_status;
> +
> + if (!kvm_arch_vcpu_pv_sched_enabled(&vcpu->arch))
> + return;
> +
> + pagefault_disable();
> + kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.pv_sched.data,
> + &val, offsetof(struct pv_sched_data, boost_status), sizeof(u64));
> + pagefault_enable();
> +}
> +
> +void kvm_set_vcpu_boosted(struct kvm_vcpu *vcpu, bool boosted)
> +{
> + kvm_arch_vcpu_set_boost_status(&vcpu->arch,
> + boosted ? VCPU_BOOST_BOOSTED : VCPU_BOOST_NORMAL);
> +
> + kvm_make_request(KVM_REQ_VCPU_BOOST_UPDATE, vcpu);
> +}
> +#endif
> +
> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> {
> unsigned long nr, a0, a1, a2, a3, ret;
> @@ -10593,6 +10648,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> }
> if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
> record_steal_time(vcpu);
> +
> +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> + if (kvm_check_request(KVM_REQ_VCPU_BOOST_UPDATE, vcpu))
> + record_vcpu_boost_status(vcpu);
> +#endif
> +
> #ifdef CONFIG_KVM_SMM
> if (kvm_check_request(KVM_REQ_SMI, vcpu))
> process_smi(vcpu);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9d3ac7720da9..a74aeea55347 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -167,6 +167,7 @@ static inline bool is_error_page(struct page *page)
> #define KVM_REQ_VM_DEAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> #define KVM_REQ_UNBLOCK 2
> #define KVM_REQ_DIRTY_RING_SOFT_FULL 3
> +#define KVM_REQ_VCPU_BOOST_UPDATE 6
> #define KVM_REQUEST_ARCH_BASE 8
>
> /*
> @@ -2287,4 +2288,8 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
> /* Max number of entries allowed for each kvm dirty ring */
> #define KVM_DIRTY_RING_MAX_ENTRIES 65536
>
> +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> +void kvm_set_vcpu_boosted(struct kvm_vcpu *vcpu, bool boosted);
> +#endif
> +
> #endif

--
Vitaly

2023-12-14 16:39:03

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

+sched_ext folks

On Wed, Dec 13, 2023, Vineeth Pillai (Google) wrote:
> Double scheduling is a concern with virtualization hosts where the host
> schedules vcpus without knowing whats run by the vcpu and guest schedules
> tasks without knowing where the vcpu is physically running. This causes
> issues related to latencies, power consumption, resource utilization
> etc. An ideal solution would be to have a cooperative scheduling
> framework where the guest and host shares scheduling related information
> and makes an educated scheduling decision to optimally handle the
> workloads. As a first step, we are taking a stab at reducing latencies
> for latency sensitive workloads in the guest.
>
> This series of patches aims to implement a framework for dynamically
> managing the priority of vcpu threads based on the needs of the workload
> running on the vcpu. Latency sensitive workloads (nmi, irq, softirq,
> critcal sections, RT tasks etc) will get a boost from the host so as to
> minimize the latency.
>
> The host can proactively boost the vcpu threads when it has enough
> information about what is going to run on the vcpu - fo eg: injecting
> interrupts. For rest of the case, guest can request boost if the vcpu is
> not already boosted. The guest can subsequently request unboost after
> the latency sensitive workloads completes. Guest can also request a
> boost if needed.
>
> A shared memory region is used to communicate the scheduling information.
> Guest shares its needs for priority boosting and host shares the boosting
> status of the vcpu. Guest sets a flag when it needs a boost and continues
> running. Host reads this on next VMEXIT and boosts the vcpu thread. For
> unboosting, it is done synchronously so that host workloads can fairly
> compete with guests when guest is not running any latency sensitive
> workload.

Big thumbs down on my end. Nothing in this RFC explains why this should be done
in KVM. In general, I am very opposed to putting policy of any kind into KVM,
and this puts a _lot_ of unmaintainable policy into KVM by deciding when to
start/stop boosting a vCPU.

Concretely, boosting vCPUs for most events is far too coarse grained. E.g. boosting
a vCPU that is running a low priority workload just because the vCPU triggered
an NMI due to PMU counter overflow doesn't make sense. Ditto for if a guest's
hrtimer expires on a vCPU running a low priority workload.

And as evidenced by patch 8/8, boosting vCPUs based on when an event is _pending_
is not maintainable. As hardware virtualizes more and more functionality, KVM's
visilibity into the guest effectively decreases, e.g. Intel and AMD both support
with IPI virtualization.

Boosting the target of a PV spinlock kick is similarly flawed. In that case, KVM
only gets involved _after_ there is a problem, i.e. after a lock is contended so
heavily that a vCPU stops spinning and instead decided to HLT. It's not hard to
imagine scenarios where a guest would want to communicate to the host that it's
acquiring a spinlock for a latency sensitive path and so shouldn't be scheduled
out. And of course that's predicated on the assumption that all vCPUs are subject
to CPU overcommit.

Initiating a boost from the host is also flawed in the sense that it relies on
the guest to be on the same page as to when it should stop boosting. E.g. if
KVM boosts a vCPU because an IRQ is pending, but the guest doesn't want to boost
IRQs on that vCPU and thus doesn't stop boosting at the end of the IRQ handler,
then the vCPU could end up being boosted long after its done with the IRQ.

Throw nested virtualization into the mix and then all of this becomes nigh
impossible to sort out in KVM. E.g. if an L1 vCPU is a running an L2 vCPU, i.e.
a nested guest, and L2 is spamming interrupts for whatever reason, KVM will end
repeatedly boosting the L1 vCPU regardless of the priority of the L2 workload.

For things that aren't clearly in KVM's domain, I don't think we should implement
KVM-specific functionality until every other option has been tried (and failed).
I don't see any reason why KVM needs to get involved in scheduling, beyond maybe
providing *input* regarding event injection, emphasis on *input* because KVM
providing information to userspace or some other entity is wildly different than
KVM making scheduling decisions based on that information.

Pushing the scheduling policies to host userspace would allow for far more control
and flexibility. E.g. a heavily paravirtualized environment where host userspace
knows *exactly* what workloads are being run could have wildly different policies
than an environment where the guest is a fairly vanilla Linux VM that has received
a small amount of enlightment.

Lastly, if the concern/argument is that userspace doesn't have the right knobs
to (quickly) boost vCPU tasks, then the proposed sched_ext functionality seems
tailor made for the problems you are trying to solve.

https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org

2023-12-14 19:25:32

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

On Thu, Dec 14, 2023 at 11:38 AM Sean Christopherson <[email protected]> wrote:
>
> +sched_ext folks
>
> On Wed, Dec 13, 2023, Vineeth Pillai (Google) wrote:
> > Double scheduling is a concern with virtualization hosts where the host
> > schedules vcpus without knowing whats run by the vcpu and guest schedules
> > tasks without knowing where the vcpu is physically running. This causes
> > issues related to latencies, power consumption, resource utilization
> > etc. An ideal solution would be to have a cooperative scheduling
> > framework where the guest and host shares scheduling related information
> > and makes an educated scheduling decision to optimally handle the
> > workloads. As a first step, we are taking a stab at reducing latencies
> > for latency sensitive workloads in the guest.
> >
> > This series of patches aims to implement a framework for dynamically
> > managing the priority of vcpu threads based on the needs of the workload
> > running on the vcpu. Latency sensitive workloads (nmi, irq, softirq,
> > critcal sections, RT tasks etc) will get a boost from the host so as to
> > minimize the latency.
> >
> > The host can proactively boost the vcpu threads when it has enough
> > information about what is going to run on the vcpu - fo eg: injecting
> > interrupts. For rest of the case, guest can request boost if the vcpu is
> > not already boosted. The guest can subsequently request unboost after
> > the latency sensitive workloads completes. Guest can also request a
> > boost if needed.
> >
> > A shared memory region is used to communicate the scheduling information.
> > Guest shares its needs for priority boosting and host shares the boosting
> > status of the vcpu. Guest sets a flag when it needs a boost and continues
> > running. Host reads this on next VMEXIT and boosts the vcpu thread. For
> > unboosting, it is done synchronously so that host workloads can fairly
> > compete with guests when guest is not running any latency sensitive
> > workload.
>
> Big thumbs down on my end. Nothing in this RFC explains why this should be done
> in KVM. In general, I am very opposed to putting policy of any kind into KVM,
> and this puts a _lot_ of unmaintainable policy into KVM by deciding when to
> start/stop boosting a vCPU.
>
I am sorry for not clearly explaining the goal. The intent was not to
have scheduling policies implemented in kvm, but to have a mechanism
for guest and host schedulers to communicate so that guest workloads
get a fair treatment from host scheduler while competing with host
workloads. Now when I think about it, the implementation seems to
suggest that we are putting policies in kvm. Ideally, the goal is:
- guest scheduler communicates the priority requirements of the workload
- kvm applies the priority to the vcpu task.
- Now that vcpu is appropriately prioritized, host scheduler can make
the right choice of picking the next best task.

We have an exception of proactive boosting for interrupts/nmis. I
don't expect these proactive boosting cases to grow. And I think this
also to be controlled by the guest where the guest can say what
scenarios would it like to be proactive boosted.

That would make kvm just a medium to communicate the scheduler
requirements from guest to host and not house any policies. What do
you think?

> Concretely, boosting vCPUs for most events is far too coarse grained. E.g. boosting
> a vCPU that is running a low priority workload just because the vCPU triggered
> an NMI due to PMU counter overflow doesn't make sense. Ditto for if a guest's
> hrtimer expires on a vCPU running a low priority workload.
>
> And as evidenced by patch 8/8, boosting vCPUs based on when an event is _pending_
> is not maintainable. As hardware virtualizes more and more functionality, KVM's
> visibility into the guest effectively decreases, e.g. Intel and AMD both support
> with IPI virtualization.
>
> Boosting the target of a PV spinlock kick is similarly flawed. In that case, KVM
> only gets involved _after_ there is a problem, i.e. after a lock is contended so
> heavily that a vCPU stops spinning and instead decided to HLT. It's not hard to
> imagine scenarios where a guest would want to communicate to the host that it's
> acquiring a spinlock for a latency sensitive path and so shouldn't be scheduled
> out. And of course that's predicated on the assumption that all vCPUs are subject
> to CPU overcommit.
>
> Initiating a boost from the host is also flawed in the sense that it relies on
> the guest to be on the same page as to when it should stop boosting. E.g. if
> KVM boosts a vCPU because an IRQ is pending, but the guest doesn't want to boost
> IRQs on that vCPU and thus doesn't stop boosting at the end of the IRQ handler,
> then the vCPU could end up being boosted long after its done with the IRQ.
>
> Throw nested virtualization into the mix and then all of this becomes nigh
> impossible to sort out in KVM. E.g. if an L1 vCPU is a running an L2 vCPU, i.e.
> a nested guest, and L2 is spamming interrupts for whatever reason, KVM will end
> repeatedly boosting the L1 vCPU regardless of the priority of the L2 workload.
>
> For things that aren't clearly in KVM's domain, I don't think we should implement
> KVM-specific functionality until every other option has been tried (and failed).
> I don't see any reason why KVM needs to get involved in scheduling, beyond maybe
> providing *input* regarding event injection, emphasis on *input* because KVM
> providing information to userspace or some other entity is wildly different than
> KVM making scheduling decisions based on that information.
>
Agreed with all the points above and it doesn't make sense to have
policies in kvm. But if kvm can act as a medium to communicate
scheduling requirements between guest and host and not make any
decisions, would that be more reasonable?

> Pushing the scheduling policies to host userspace would allow for far more control
> and flexibility. E.g. a heavily paravirtualized environment where host userspace
> knows *exactly* what workloads are being run could have wildly different policies
> than an environment where the guest is a fairly vanilla Linux VM that has received
> a small amount of enlightment.
>
> Lastly, if the concern/argument is that userspace doesn't have the right knobs
> to (quickly) boost vCPU tasks, then the proposed sched_ext functionality seems
> tailor made for the problems you are trying to solve.
>
> https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org
>
You are right, sched_ext is a good choice to have policies
implemented. In our case, we would need a communication mechanism as
well and hence we thought kvm would work best to be a medium between
the guest and the host. The policies could be in the guest and the
guest shall communicate its priority requirements(based on policy) to
the host via kvm and then the host scheduler takes action based on
that.

Please let me know.

Thanks,
Vineeth

2023-12-14 19:54:59

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] kvm: x86: MSR for setting up scheduler info shared memory

On Thu, Dec 14, 2023 at 5:53 AM Vitaly Kuznetsov <[email protected]> wrote:
>
> "Vineeth Pillai (Google)" <[email protected]> writes:
>
> > Implement a kvm MSR that guest uses to provide the GPA of shared memory
> > for communicating the scheduling information between host and guest.
> >
> > wrmsr(0) disables the feature. wrmsr(valid_gpa) enables the feature and
> > uses the gpa for further communication.
> >
> > Also add a new cpuid feature flag for the host to advertise the feature
> > to the guest.
> >
> > Co-developed-by: Joel Fernandes (Google) <[email protected]>
> > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > Signed-off-by: Vineeth Pillai (Google) <[email protected]>
> > ---
> > arch/x86/include/asm/kvm_host.h | 25 ++++++++++++
> > arch/x86/include/uapi/asm/kvm_para.h | 24 +++++++++++
> > arch/x86/kvm/Kconfig | 12 ++++++
> > arch/x86/kvm/cpuid.c | 2 +
> > arch/x86/kvm/x86.c | 61 ++++++++++++++++++++++++++++
> > include/linux/kvm_host.h | 5 +++
> > 6 files changed, 129 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index f72b30d2238a..f89ba1f07d88 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -987,6 +987,18 @@ struct kvm_vcpu_arch {
> > /* Protected Guests */
> > bool guest_state_protected;
> >
> > +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> > + /*
> > + * MSR to setup a shared memory for scheduling
> > + * information sharing between host and guest.
> > + */
> > + struct {
> > + enum kvm_vcpu_boost_state boost_status;
> > + u64 msr_val;
> > + struct gfn_to_hva_cache data;
> > + } pv_sched;
> > +#endif
> > +
> > /*
> > * Set when PDPTS were loaded directly by the userspace without
> > * reading the guest memory
> > @@ -2217,4 +2229,17 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
> > */
> > #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)
> >
> > +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> > +static inline bool kvm_arch_vcpu_pv_sched_enabled(struct kvm_vcpu_arch *arch)
> > +{
> > + return arch->pv_sched.msr_val;
> > +}
> > +
> > +static inline void kvm_arch_vcpu_set_boost_status(struct kvm_vcpu_arch *arch,
> > + enum kvm_vcpu_boost_state boost_status)
> > +{
> > + arch->pv_sched.boost_status = boost_status;
> > +}
> > +#endif
> > +
> > #endif /* _ASM_X86_KVM_HOST_H */
> > diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> > index 6e64b27b2c1e..6b1dea07a563 100644
> > --- a/arch/x86/include/uapi/asm/kvm_para.h
> > +++ b/arch/x86/include/uapi/asm/kvm_para.h
> > @@ -36,6 +36,7 @@
> > #define KVM_FEATURE_MSI_EXT_DEST_ID 15
> > #define KVM_FEATURE_HC_MAP_GPA_RANGE 16
> > #define KVM_FEATURE_MIGRATION_CONTROL 17
> > +#define KVM_FEATURE_PV_SCHED 18
> >
> > #define KVM_HINTS_REALTIME 0
> >
> > @@ -58,6 +59,7 @@
> > #define MSR_KVM_ASYNC_PF_INT 0x4b564d06
> > #define MSR_KVM_ASYNC_PF_ACK 0x4b564d07
> > #define MSR_KVM_MIGRATION_CONTROL 0x4b564d08
> > +#define MSR_KVM_PV_SCHED 0x4b564da0
> >
> > struct kvm_steal_time {
> > __u64 steal;
> > @@ -150,4 +152,26 @@ struct kvm_vcpu_pv_apf_data {
> > #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK
> > #define KVM_PV_EOI_DISABLED 0x0
> >
> > +/*
> > + * VCPU boost state shared between the host and guest.
> > + */
> > +enum kvm_vcpu_boost_state {
> > + /* Priority boosting feature disabled in host */
> > + VCPU_BOOST_DISABLED = 0,
> > + /*
> > + * vcpu is not explicitly boosted by the host.
> > + * (Default priority when the guest started)
> > + */
> > + VCPU_BOOST_NORMAL,
> > + /* vcpu is boosted by the host */
> > + VCPU_BOOST_BOOSTED
> > +};
> > +
> > +/*
> > + * Structure passed in via MSR_KVM_PV_SCHED
> > + */
> > +struct pv_sched_data {
> > + __u64 boost_status;
> > +};
> > +
> > #endif /* _UAPI_ASM_X86_KVM_PARA_H */
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index 89ca7f4c1464..dbcba73fb508 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -141,4 +141,16 @@ config KVM_XEN
> > config KVM_EXTERNAL_WRITE_TRACKING
> > bool
> >
> > +config PARAVIRT_SCHED_KVM
> > + bool "Enable paravirt scheduling capability for kvm"
> > + depends on KVM
> > + help
> > + Paravirtualized scheduling facilitates the exchange of scheduling
> > + related information between the host and guest through shared memory,
> > + enhancing the efficiency of vCPU thread scheduling by the hypervisor.
> > + An illustrative use case involves dynamically boosting the priority of
> > + a vCPU thread when the guest is executing a latency-sensitive workload
> > + on that specific vCPU.
> > + This config enables paravirt scheduling in the kvm hypervisor.
> > +
> > endif # VIRTUALIZATION
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index 7bdc66abfc92..960ef6e869f2 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -1113,6 +1113,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
> > (1 << KVM_FEATURE_POLL_CONTROL) |
> > (1 << KVM_FEATURE_PV_SCHED_YIELD) |
> > (1 << KVM_FEATURE_ASYNC_PF_INT);
> > + if (IS_ENABLED(CONFIG_PARAVIRT_SCHED_KVM))
> > + entry->eax |= (1 << KVM_FEATURE_PV_SCHED);
> >
> > if (sched_info_on())
> > entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 7bcf1a76a6ab..0f475b50ac83 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -3879,6 +3879,33 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> > return 1;
> > break;
> >
> > +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> > + case MSR_KVM_PV_SCHED:
> > + if (!guest_pv_has(vcpu, KVM_FEATURE_PV_SCHED))
> > + return 1;
> > +
> > + if (!(data & KVM_MSR_ENABLED))
> > + break;
> > +
> > + if (!(data & ~KVM_MSR_ENABLED)) {
> > + /*
> > + * Disable the feature
> > + */
> > + vcpu->arch.pv_sched.msr_val = 0;
> > + kvm_set_vcpu_boosted(vcpu, false);
> > + } if (!kvm_gfn_to_hva_cache_init(vcpu->kvm,
> > + &vcpu->arch.pv_sched.data, data & ~KVM_MSR_ENABLED,
> > + sizeof(struct pv_sched_data))) {
> > + vcpu->arch.pv_sched.msr_val = data;
> > + kvm_set_vcpu_boosted(vcpu, false);
> > + } else {
> > + pr_warn("MSR_KVM_PV_SCHED: kvm:%p, vcpu:%p, "
> > + "msr value: %llx, kvm_gfn_to_hva_cache_init failed!\n",
> > + vcpu->kvm, vcpu, data & ~KVM_MSR_ENABLED);
>
> As this is triggerable by the guest please drop this print (which is not
> even ratelimited!). I think it would be better to just 'return 1;' in case
> of kvm_gfn_to_hva_cache_init() failure but maybe you also need to
> account for 'msr_info->host_initiated' to not fail setting this MSR from
> the host upon migration.
>
Makes sense, shall remove the pr_warn.
I hadn't thought about migration, thanks for bringing this up. Will
make modifications to account for migration as well.

Thanks,
Vineeth

2023-12-14 20:13:57

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

On Thu, Dec 14, 2023, Vineeth Remanan Pillai wrote:
> On Thu, Dec 14, 2023 at 11:38 AM Sean Christopherson <[email protected]> wrote:
> Now when I think about it, the implementation seems to
> suggest that we are putting policies in kvm. Ideally, the goal is:
> - guest scheduler communicates the priority requirements of the workload
> - kvm applies the priority to the vcpu task.

Why? Tasks are tasks, why does KVM need to get involved? E.g. if the problem
is that userspace doesn't have the right knobs to adjust the priority of a task
quickly and efficiently, then wouldn't it be better to solve that problem in a
generic way?

> - Now that vcpu is appropriately prioritized, host scheduler can make
> the right choice of picking the next best task.
>
> We have an exception of proactive boosting for interrupts/nmis. I
> don't expect these proactive boosting cases to grow. And I think this
> also to be controlled by the guest where the guest can say what
> scenarios would it like to be proactive boosted.
>
> That would make kvm just a medium to communicate the scheduler
> requirements from guest to host and not house any policies. What do
> you think?

...

> > Pushing the scheduling policies to host userspace would allow for far more control
> > and flexibility. E.g. a heavily paravirtualized environment where host userspace
> > knows *exactly* what workloads are being run could have wildly different policies
> > than an environment where the guest is a fairly vanilla Linux VM that has received
> > a small amount of enlightment.
> >
> > Lastly, if the concern/argument is that userspace doesn't have the right knobs
> > to (quickly) boost vCPU tasks, then the proposed sched_ext functionality seems
> > tailor made for the problems you are trying to solve.
> >
> > https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org
> >
> You are right, sched_ext is a good choice to have policies
> implemented. In our case, we would need a communication mechanism as
> well and hence we thought kvm would work best to be a medium between
> the guest and the host.

Making KVM be the medium may be convenient and the quickest way to get a PoC
out the door, but effectively making KVM a middle-man is going to be a huge net
negative in the long term. Userspace can communicate with the guest just as
easily as KVM, and if you make KVM the middle-man, then you effectively *must*
define a relatively rigid guest/host ABI.

If instead the contract is between host userspace and the guest, the ABI can be
much more fluid, e.g. if you (or any setup) can control at least some amount of
code that runs in the guest, then the contract between the guest and host doesn't
even need to be formally defined, it could simply be a matter of bundling host
and guest code appropriately.

If you want to land support for a given contract in upstream repositories, e.g.
to broadly enable paravirt scheduling support across a variety of usersepace VMMs
and/or guests, then yeah, you'll need a formal ABI. But that's still not a good
reason to have KVM define the ABI. Doing it in KVM might be a wee bit easier because
it's largely just a matter of writing code, and LKML provides a centralized channel
for getting buyin from all parties. But defining an ABI that's independent of the
kernel is absolutely doable, e.g. see the many virtio specs.

I'm not saying KVM can't help, e.g. if there is information that is known only
to KVM, but the vast majority of the contract doesn't need to be defined by KVM.

2023-12-14 21:37:02

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

On Thu, Dec 14, 2023 at 3:13 PM Sean Christopherson <[email protected]> wrote:
>
> On Thu, Dec 14, 2023, Vineeth Remanan Pillai wrote:
> > On Thu, Dec 14, 2023 at 11:38 AM Sean Christopherson <[email protected]> wrote:
> > Now when I think about it, the implementation seems to
> > suggest that we are putting policies in kvm. Ideally, the goal is:
> > - guest scheduler communicates the priority requirements of the workload
> > - kvm applies the priority to the vcpu task.
>
> Why? Tasks are tasks, why does KVM need to get involved? E.g. if the problem
> is that userspace doesn't have the right knobs to adjust the priority of a task
> quickly and efficiently, then wouldn't it be better to solve that problem in a
> generic way?
>
I get your point. A generic way would have been more preferable, but I
feel the scenario we are tackling is a bit more time critical and kvm
is better equipped to handle this. kvm has control over the VM/vcpu
execution and hence it can take action in the most effective way.

One example is the place where we handle boost/unboost. By the time
you come out of kvm to userspace it would be too late. Currently we
apply the boost soon after VMEXIT before enabling preemption so that
the next scheduler entry will consider the boosted priority. As soon
as you enable preemption, the vcpu could be preempted and boosting
would not help when it is boosted. This timing correctness is very
difficult to achieve if we try to do it in userland or do it
out-of-band.

[...snip...]
> > > Lastly, if the concern/argument is that userspace doesn't have the right knobs
> > > to (quickly) boost vCPU tasks, then the proposed sched_ext functionality seems
> > > tailor made for the problems you are trying to solve.
> > >
> > > https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org
> > >
> > You are right, sched_ext is a good choice to have policies
> > implemented. In our case, we would need a communication mechanism as
> > well and hence we thought kvm would work best to be a medium between
> > the guest and the host.
>
> Making KVM be the medium may be convenient and the quickest way to get a PoC
> out the door, but effectively making KVM a middle-man is going to be a huge net
> negative in the long term. Userspace can communicate with the guest just as
> easily as KVM, and if you make KVM the middle-man, then you effectively *must*
> define a relatively rigid guest/host ABI.
>
> If instead the contract is between host userspace and the guest, the ABI can be
> much more fluid, e.g. if you (or any setup) can control at least some amount of
> code that runs in the guest, then the contract between the guest and host doesn't
> even need to be formally defined, it could simply be a matter of bundling host
> and guest code appropriately.
>
> If you want to land support for a given contract in upstream repositories, e.g.
> to broadly enable paravirt scheduling support across a variety of usersepace VMMs
> and/or guests, then yeah, you'll need a formal ABI. But that's still not a good
> reason to have KVM define the ABI. Doing it in KVM might be a wee bit easier because
> it's largely just a matter of writing code, and LKML provides a centralized channel
> for getting buyin from all parties. But defining an ABI that's independent of the
> kernel is absolutely doable, e.g. see the many virtio specs.
>
> I'm not saying KVM can't help, e.g. if there is information that is known only
> to KVM, but the vast majority of the contract doesn't need to be defined by KVM.
>
As you mentioned, custom contract between guest and host userspace is
really flexible, but I believe tackling scheduling(especially latency)
issues is a bit more difficult with generic approaches. Here kvm does
have some information known only to kvm(which could be shared - eg:
interrupt injection) but more importantly kvm has some unique
capabilities when it comes to scheduling. kvm and scheduler are
cooperating currently for various cases like, steal time accounting,
vcpu preemption state, spinlock handling etc. We could possibly try to
extend it a little further in a non-intrusive way.

Having a formal paravirt scheduling ABI is something we would want to
pursue (as I mentioned in the cover letter) and this could help not
only with latencies, but optimal task placement for efficiency, power
utilization etc. kvm's role could be to set the stage and share
information with minimum delay and less resource overhead. We could
use schedulers (vanilla, sched_ext, ...) to actually make decisions
based on the information it receives.

Thanks for all your valuable inputs and I understand that a formal ABI
is needed for the above interface. We shall look more into the
feasibility and efforts for this.

Thanks,
Vineeth

2023-12-15 00:54:41

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

On Thu, Dec 14, 2023, Vineeth Remanan Pillai wrote:
> On Thu, Dec 14, 2023 at 3:13 PM Sean Christopherson <[email protected]> wrote:
> >
> > On Thu, Dec 14, 2023, Vineeth Remanan Pillai wrote:
> > > On Thu, Dec 14, 2023 at 11:38 AM Sean Christopherson <[email protected]> wrote:
> > > Now when I think about it, the implementation seems to
> > > suggest that we are putting policies in kvm. Ideally, the goal is:
> > > - guest scheduler communicates the priority requirements of the workload
> > > - kvm applies the priority to the vcpu task.
> >
> > Why? Tasks are tasks, why does KVM need to get involved? E.g. if the problem
> > is that userspace doesn't have the right knobs to adjust the priority of a task
> > quickly and efficiently, then wouldn't it be better to solve that problem in a
> > generic way?
> >
> I get your point. A generic way would have been more preferable, but I
> feel the scenario we are tackling is a bit more time critical and kvm
> is better equipped to handle this. kvm has control over the VM/vcpu
> execution and hence it can take action in the most effective way.

No, KVM most definitely does not. Between sched, KVM, and userspace, I would
rank KVM a very distant third. Userspace controls when to do KVM_RUN, to which
cgroup(s) a vCPU task is assigned, the affinity of the task, etc. sched decides
when and where to run a vCPU task based on input from userspace.

Only in some edge cases that are largely unique to overcommitted CPUs does KVM
have any input on scheduling whatsoever. And even then, KVM's view is largely
limited to a single VM, e.g. teaching KVM to yield to a vCPU running in a different
VM would be interesting, to say the least.

> One example is the place where we handle boost/unboost. By the time
> you come out of kvm to userspace it would be too late.

Making scheduling decisions in userspace doesn't require KVM to exit to userspace.
It doesn't even need to require a VM-Exit to KVM. E.g. if the scheduler (whether
it's in kernel or userspace) is running on a different logical CPU(s), then there's
no need to trigger a VM-Exit because the scheduler can incorporate information
about a vCPU in real time, and interrupt the vCPU if and only if something else
needs to run on that associated CPU. From the sched_ext cover letter:

: Google has also experimented with some promising, novel scheduling policies.
: One example is “central” scheduling, wherein a single CPU makes all
: scheduling decisions for the entire system. This allows most cores on the
: system to be fully dedicated to running workloads, and can have significant
: performance improvements for certain use cases. For example, central
: scheduling with VCPUs can avoid expensive vmexits and cache flushes, by
: instead delegating the responsibility of preemption checks from the tick to
: a single CPU. See scx_central.bpf.c for a simple example of a central
: scheduling policy built in sched_ext.

> Currently we apply the boost soon after VMEXIT before enabling preemption so
> that the next scheduler entry will consider the boosted priority. As soon as
> you enable preemption, the vcpu could be preempted and boosting would not
> help when it is boosted. This timing correctness is very difficult to achieve
> if we try to do it in userland or do it out-of-band.

Hooking VM-Exit isn't necessarily the fastest and/or best time to make scheduling
decisions about vCPUs. Presumably the whole point of this is to allow running
high priority, latency senstive workloads in the guest. As above, the ideal scenario
is that a vCPU running a high priority workload would never exit in the first place.

Is it easy to get there? No. But it's definitely possible.

> [...snip...]
> > > > Lastly, if the concern/argument is that userspace doesn't have the right knobs
> > > > to (quickly) boost vCPU tasks, then the proposed sched_ext functionality seems
> > > > tailor made for the problems you are trying to solve.
> > > >
> > > > https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org
> > > >
> > > You are right, sched_ext is a good choice to have policies
> > > implemented. In our case, we would need a communication mechanism as
> > > well and hence we thought kvm would work best to be a medium between
> > > the guest and the host.
> >
> > Making KVM be the medium may be convenient and the quickest way to get a PoC
> > out the door, but effectively making KVM a middle-man is going to be a huge net
> > negative in the long term. Userspace can communicate with the guest just as
> > easily as KVM, and if you make KVM the middle-man, then you effectively *must*
> > define a relatively rigid guest/host ABI.
> >
> > If instead the contract is between host userspace and the guest, the ABI can be
> > much more fluid, e.g. if you (or any setup) can control at least some amount of
> > code that runs in the guest, then the contract between the guest and host doesn't
> > even need to be formally defined, it could simply be a matter of bundling host
> > and guest code appropriately.
> >
> > If you want to land support for a given contract in upstream repositories, e.g.
> > to broadly enable paravirt scheduling support across a variety of usersepace VMMs
> > and/or guests, then yeah, you'll need a formal ABI. But that's still not a good
> > reason to have KVM define the ABI. Doing it in KVM might be a wee bit easier because
> > it's largely just a matter of writing code, and LKML provides a centralized channel
> > for getting buyin from all parties. But defining an ABI that's independent of the
> > kernel is absolutely doable, e.g. see the many virtio specs.
> >
> > I'm not saying KVM can't help, e.g. if there is information that is known only
> > to KVM, but the vast majority of the contract doesn't need to be defined by KVM.
> >
> As you mentioned, custom contract between guest and host userspace is
> really flexible, but I believe tackling scheduling(especially latency)
> issues is a bit more difficult with generic approaches. Here kvm does
> have some information known only to kvm(which could be shared - eg:
> interrupt injection) but more importantly kvm has some unique
> capabilities when it comes to scheduling. kvm and scheduler are
> cooperating currently for various cases like, steal time accounting,
> vcpu preemption state, spinlock handling etc. We could possibly try to
> extend it a little further in a non-intrusive way.

I'm not too worried about the code being intrusive, I'm worried about the
maintainability, longevity, and applicability of this approach.

IMO, this has a significantly lower ceiling than what is possible with something
like sched_ext, e.g. it requires a host tick to make scheduling decisions, and
because it'd require a kernel-defined ABI, would essentially be limited to knobs
that are broadly useful. I.e. every bit of information that you want to add to
the guest/host ABI will need to get approval from at least the affected subsystems
in the guest, from KVM, and possibly from the host scheduler too. That's going
to make for a very high bar.

> Having a formal paravirt scheduling ABI is something we would want to
> pursue (as I mentioned in the cover letter) and this could help not
> only with latencies, but optimal task placement for efficiency, power
> utilization etc. kvm's role could be to set the stage and share
> information with minimum delay and less resource overhead.

Making KVM middle-man is most definitely not going to provide minimum delay or
overhead. Minimum delay would be the guest directly communicating with the host
scheduler. I get that convincing the sched folks to add a bunch of paravirt
stuff is a tall order (for very good reason), but that's exactly why I Cc'd the
sched_ext folks.

> We could use schedulers (vanilla, sched_ext, ...) to actually make decisions
> based on the information it receives.

2023-12-15 14:34:59

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

> > >
> > I get your point. A generic way would have been more preferable, but I
> > feel the scenario we are tackling is a bit more time critical and kvm
> > is better equipped to handle this. kvm has control over the VM/vcpu
> > execution and hence it can take action in the most effective way.
>
> No, KVM most definitely does not. Between sched, KVM, and userspace, I would
> rank KVM a very distant third. Userspace controls when to do KVM_RUN, to which
> cgroup(s) a vCPU task is assigned, the affinity of the task, etc. sched decides
> when and where to run a vCPU task based on input from userspace.
>
> Only in some edge cases that are largely unique to overcommitted CPUs does KVM
> have any input on scheduling whatsoever. And even then, KVM's view is largely
> limited to a single VM, e.g. teaching KVM to yield to a vCPU running in a different
> VM would be interesting, to say the least.
>
Over committed case is exactly what we are trying to tackle. Sorry for
not making this clear in the cover letter. ChromeOS runs on low-end
devices (eg: 2C/2T cpus) and does not have enough compute capacity to
offload scheduling decisions. In-band scheduling decisions gave the
best results.

> > One example is the place where we handle boost/unboost. By the time
> > you come out of kvm to userspace it would be too late.
>
> Making scheduling decisions in userspace doesn't require KVM to exit to userspace.
> It doesn't even need to require a VM-Exit to KVM. E.g. if the scheduler (whether
> it's in kernel or userspace) is running on a different logical CPU(s), then there's
> no need to trigger a VM-Exit because the scheduler can incorporate information
> about a vCPU in real time, and interrupt the vCPU if and only if something else
> needs to run on that associated CPU. From the sched_ext cover letter:
>
> : Google has also experimented with some promising, novel scheduling policies.
> : One example is “central” scheduling, wherein a single CPU makes all
> : scheduling decisions for the entire system. This allows most cores on the
> : system to be fully dedicated to running workloads, and can have significant
> : performance improvements for certain use cases. For example, central
> : scheduling with VCPUs can avoid expensive vmexits and cache flushes, by
> : instead delegating the responsibility of preemption checks from the tick to
> : a single CPU. See scx_central.bpf.c for a simple example of a central
> : scheduling policy built in sched_ext.
>
This makes sense when the host has enough compute resources for
offloading scheduling decisions. In an over committed system, the
scheduler running out-of-band would need to get cpu time to make
decisions and starvation of scheduler may make the situation worse. We
could probably tune the priorities of the scheduler to have least
latencies, but in our experience this was not scaling due to the
nature of cpu interruptions happening in a consumer devices..

> > Currently we apply the boost soon after VMEXIT before enabling preemption so
> > that the next scheduler entry will consider the boosted priority. As soon as
> > you enable preemption, the vcpu could be preempted and boosting would not
> > help when it is boosted. This timing correctness is very difficult to achieve
> > if we try to do it in userland or do it out-of-band.
>
> Hooking VM-Exit isn't necessarily the fastest and/or best time to make scheduling
> decisions about vCPUs. Presumably the whole point of this is to allow running
> high priority, latency senstive workloads in the guest. As above, the ideal scenario
> is that a vCPU running a high priority workload would never exit in the first place.
>
> Is it easy to get there? No. But it's definitely possible.
>
Low end devices do not have the luxury of dedicating physical cpus to
vcpus and having an out-of-band scheduler also adds to the load of the
system. In this RFC, a boost request doesn't induce an immeidate
VMEXIT, but just sets a shared memory flag and continues to run. On
the very next VMEXIT, kvm checks the shared memory and passes it to
scheduler. This technique allows for avoiding extra VMEXITs for
boosting, but still uses the fast in-band scheduling mechanism to
achieve the desired results.

> > [...snip...]
> > > > > Lastly, if the concern/argument is that userspace doesn't have the right knobs
> > > > > to (quickly) boost vCPU tasks, then the proposed sched_ext functionality seems
> > > > > tailor made for the problems you are trying to solve.
> > > > >
> > > > > https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org
> > > > >
> > > > You are right, sched_ext is a good choice to have policies
> > > > implemented. In our case, we would need a communication mechanism as
> > > > well and hence we thought kvm would work best to be a medium between
> > > > the guest and the host.
> > >
> > > Making KVM be the medium may be convenient and the quickest way to get a PoC
> > > out the door, but effectively making KVM a middle-man is going to be a huge net
> > > negative in the long term. Userspace can communicate with the guest just as
> > > easily as KVM, and if you make KVM the middle-man, then you effectively *must*
> > > define a relatively rigid guest/host ABI.
> > >
> > > If instead the contract is between host userspace and the guest, the ABI can be
> > > much more fluid, e.g. if you (or any setup) can control at least some amount of
> > > code that runs in the guest, then the contract between the guest and host doesn't
> > > even need to be formally defined, it could simply be a matter of bundling host
> > > and guest code appropriately.
> > >
> > > If you want to land support for a given contract in upstream repositories, e.g.
> > > to broadly enable paravirt scheduling support across a variety of usersepace VMMs
> > > and/or guests, then yeah, you'll need a formal ABI. But that's still not a good
> > > reason to have KVM define the ABI. Doing it in KVM might be a wee bit easier because
> > > it's largely just a matter of writing code, and LKML provides a centralized channel
> > > for getting buyin from all parties. But defining an ABI that's independent of the
> > > kernel is absolutely doable, e.g. see the many virtio specs.
> > >
> > > I'm not saying KVM can't help, e.g. if there is information that is known only
> > > to KVM, but the vast majority of the contract doesn't need to be defined by KVM.
> > >
> > As you mentioned, custom contract between guest and host userspace is
> > really flexible, but I believe tackling scheduling(especially latency)
> > issues is a bit more difficult with generic approaches. Here kvm does
> > have some information known only to kvm(which could be shared - eg:
> > interrupt injection) but more importantly kvm has some unique
> > capabilities when it comes to scheduling. kvm and scheduler are
> > cooperating currently for various cases like, steal time accounting,
> > vcpu preemption state, spinlock handling etc. We could possibly try to
> > extend it a little further in a non-intrusive way.
>
> I'm not too worried about the code being intrusive, I'm worried about the
> maintainability, longevity, and applicability of this approach.
>
> IMO, this has a significantly lower ceiling than what is possible with something
> like sched_ext, e.g. it requires a host tick to make scheduling decisions, and
> because it'd require a kernel-defined ABI, would essentially be limited to knobs
> that are broadly useful. I.e. every bit of information that you want to add to
> the guest/host ABI will need to get approval from at least the affected subsystems
> in the guest, from KVM, and possibly from the host scheduler too. That's going
> to make for a very high bar.
>
Just thinking out loud, The ABI could be very simple to start with. A
shared page with dedicated guest and host areas. Guest fills details
about its priority requirements, host fills details about the actions
it took(boost/unboost, priority/sched class etc). Passing this
information could be in-band or out-of-band. out-of-band could be used
by dedicated userland schedulers. If both guest and host agrees on
in-band during guest startup, kvm could hand over the data to
scheduler using a scheduler callback. I feel this small addition to
kvm could be maintainable and by leaving the protocol for interpreting
shared memory to guest and host, this would be very generic and cater
to multiple use cases. Something like above could be used both by
low-end devices and high-end server like systems and guest and host
could have custom protocols to interpret the data and make decisions.

In this RFC, we have a miniature form of the above, where we have a
shared memory area and the scheduler callback is basically
sched_setscheduler. But it could be made very generic as part of ABI
design. For out-of-band schedulers, this call back could be setup by
sched_ext, a userland scheduler and any similar out-of-band scheduler.

I agree, getting a consensus and approval is non-trivial. IMHO, this
use case is compelling for such an ABI because out-of-band schedulers
might not give the desired results for low-end devices.

> > Having a formal paravirt scheduling ABI is something we would want to
> > pursue (as I mentioned in the cover letter) and this could help not
> > only with latencies, but optimal task placement for efficiency, power
> > utilization etc. kvm's role could be to set the stage and share
> > information with minimum delay and less resource overhead.
>
> Making KVM middle-man is most definitely not going to provide minimum delay or
> overhead. Minimum delay would be the guest directly communicating with the host
> scheduler. I get that convincing the sched folks to add a bunch of paravirt
> stuff is a tall order (for very good reason), but that's exactly why I Cc'd the
> sched_ext folks.
>
As mentioned above, guest directly talking to host scheduler without
involving kvm would mean an out-of-band scheduler and the
effectiveness depends on how fast the scheduler gets to run. In lowend
compute devices, that would pose a challenge. In such scenarios, kvm
seems to be a better option to provide minimum delay and cpu overhead.

Sorry for not being clear in the cover letter, the goal is to have a
minimal latency and overhead framework that would work for low-end
devices as well where we have constrained cpu capacity. A design with
focus on the constraints of systems with not enough compute capacity
to spare, but caters to generic use cases as well is what we are
striving for. This would be useful for cloud providers whose offerings
are mostly over-committed VMs and we have seen interest from such
crowd.

Thanks,
Vineeth

2023-12-15 15:20:42

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

Hi Sean,
Nice to see your quick response to the RFC, thanks. I wanted to
clarify some points below:

On Thu, Dec 14, 2023 at 3:13 PM Sean Christopherson <[email protected]> wrote:
>
> On Thu, Dec 14, 2023, Vineeth Remanan Pillai wrote:
> > On Thu, Dec 14, 2023 at 11:38 AM Sean Christopherson <[email protected]> wrote:
> > Now when I think about it, the implementation seems to
> > suggest that we are putting policies in kvm. Ideally, the goal is:
> > - guest scheduler communicates the priority requirements of the workload
> > - kvm applies the priority to the vcpu task.
>
> Why? Tasks are tasks, why does KVM need to get involved? E.g. if the problem
> is that userspace doesn't have the right knobs to adjust the priority of a task
> quickly and efficiently, then wouldn't it be better to solve that problem in a
> generic way?

No, it is not only about tasks. We are boosting anything RT or above
such as softirq, irq etc as well. Could you please see the other
patches? Also, Vineeth please make this clear in the next revision.

> > > Pushing the scheduling policies to host userspace would allow for far more control
> > > and flexibility. E.g. a heavily paravirtualized environment where host userspace
> > > knows *exactly* what workloads are being run could have wildly different policies
> > > than an environment where the guest is a fairly vanilla Linux VM that has received
> > > a small amount of enlightment.
> > >
> > > Lastly, if the concern/argument is that userspace doesn't have the right knobs
> > > to (quickly) boost vCPU tasks, then the proposed sched_ext functionality seems
> > > tailor made for the problems you are trying to solve.
> > >
> > > https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org
> > >
> > You are right, sched_ext is a good choice to have policies
> > implemented. In our case, we would need a communication mechanism as
> > well and hence we thought kvm would work best to be a medium between
> > the guest and the host.
>
> Making KVM be the medium may be convenient and the quickest way to get a PoC
> out the door, but effectively making KVM a middle-man is going to be a huge net
> negative in the long term. Userspace can communicate with the guest just as
> easily as KVM, and if you make KVM the middle-man, then you effectively *must*
> define a relatively rigid guest/host ABI.

At the moment, the only ABI is a shared memory structure and a custom
MSR. This is no different from the existing steal time accounting
where a shared structure is similarly shared between host and guest,
we could perhaps augment that structure with other fields instead of
adding a new one? On the ABI point, we have deliberately tried to keep
it simple (for example, a few months ago we had hypercalls and we went
to great lengths to eliminate those).

> If instead the contract is between host userspace and the guest, the ABI can be
> much more fluid, e.g. if you (or any setup) can control at least some amount of
> code that runs in the guest

I see your point of view. One way to achieve this is to have a BPF
program run to implement the boosting part, in the VMEXIT path. KVM
then just calls a hook. Would that alleviate some of your concerns?

> then the contract between the guest and host doesn't
> even need to be formally defined, it could simply be a matter of bundling host
> and guest code appropriately.
>
> If you want to land support for a given contract in upstream repositories, e.g.
> to broadly enable paravirt scheduling support across a variety of usersepace VMMs
> and/or guests, then yeah, you'll need a formal ABI. But that's still not a good
> reason to have KVM define the ABI. Doing it in KVM might be a wee bit easier because
> it's largely just a matter of writing code, and LKML provides a centralized channel
> for getting buyin from all parties. But defining an ABI that's independent of the
> kernel is absolutely doable, e.g. see the many virtio specs.
>
> I'm not saying KVM can't help, e.g. if there is information that is known only
> to KVM, but the vast majority of the contract doesn't need to be defined by KVM.

The key to making this working of the patch is VMEXIT path, that is
only available to KVM. If we do anything later, then it might be too
late. We have to intervene *before* the scheduler takes the vCPU
thread off the CPU. Similarly, in the case of an interrupt injected
into the guest, we have to boost the vCPU before the "vCPU run" stage
-- anything later might be too late.

Also you mentioned something about the tick path in the other email,
we have no control over the host tick preempting the vCPU thread. The
guest *will VMEXIT* on the host tick. On ChromeOS, we run multiple VMs
and overcommitting is very common especially on devices with smaller
number of CPUs.

Just to clarify, this isn't a "quick POC". We have been working on
this for many months and it was hard to get working correctly and
handle all corner cases. We are finally at a point where - it just
works (TM) and is roughly half the code size of when we initially
started.

thanks,

- Joel

2023-12-15 16:38:34

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

On Fri, Dec 15, 2023, Joel Fernandes wrote:
> Hi Sean,
> Nice to see your quick response to the RFC, thanks. I wanted to
> clarify some points below:
>
> On Thu, Dec 14, 2023 at 3:13 PM Sean Christopherson <[email protected]> wrote:
> >
> > On Thu, Dec 14, 2023, Vineeth Remanan Pillai wrote:
> > > On Thu, Dec 14, 2023 at 11:38 AM Sean Christopherson <[email protected]> wrote:
> > > Now when I think about it, the implementation seems to
> > > suggest that we are putting policies in kvm. Ideally, the goal is:
> > > - guest scheduler communicates the priority requirements of the workload
> > > - kvm applies the priority to the vcpu task.
> >
> > Why? Tasks are tasks, why does KVM need to get involved? E.g. if the problem
> > is that userspace doesn't have the right knobs to adjust the priority of a task
> > quickly and efficiently, then wouldn't it be better to solve that problem in a
> > generic way?
>
> No, it is not only about tasks. We are boosting anything RT or above
> such as softirq, irq etc as well.

I was talking about the host side of things. A vCPU is a task, full stop. KVM
*may* have some information that is useful to the scheduler, but KVM does not
*need* to initiate adjustments to a vCPU's priority.

> Could you please see the other patches?

I already have, see my comments about boosting vCPUs that have received NMIs and
IRQs not necessarily being desirable.

> Also, Vineeth please make this clear in the next revision.
>
> > > > Pushing the scheduling policies to host userspace would allow for far more control
> > > > and flexibility. E.g. a heavily paravirtualized environment where host userspace
> > > > knows *exactly* what workloads are being run could have wildly different policies
> > > > than an environment where the guest is a fairly vanilla Linux VM that has received
> > > > a small amount of enlightment.
> > > >
> > > > Lastly, if the concern/argument is that userspace doesn't have the right knobs
> > > > to (quickly) boost vCPU tasks, then the proposed sched_ext functionality seems
> > > > tailor made for the problems you are trying to solve.
> > > >
> > > > https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org
> > > >
> > > You are right, sched_ext is a good choice to have policies
> > > implemented. In our case, we would need a communication mechanism as
> > > well and hence we thought kvm would work best to be a medium between
> > > the guest and the host.
> >
> > Making KVM be the medium may be convenient and the quickest way to get a PoC
> > out the door, but effectively making KVM a middle-man is going to be a huge net
> > negative in the long term. Userspace can communicate with the guest just as
> > easily as KVM, and if you make KVM the middle-man, then you effectively *must*
> > define a relatively rigid guest/host ABI.
>
> At the moment, the only ABI is a shared memory structure and a custom
> MSR. This is no different from the existing steal time accounting
> where a shared structure is similarly shared between host and guest,
> we could perhaps augment that structure with other fields instead of
> adding a new one?

I'm not concerned about the number of structures/fields, it's the amount/type of
information and the behavior of KVM that is problematic. E.g. boosting the priority
of a vCPU that has a pending NMI is dubious. This RFC has baked in a large
number of assumptions that (mostly) fit your specific use case, but do not
necessarily apply to all use cases. I'm not even remotely convinced that what's
prosed here is optimal for your use case either.

> On the ABI point, we have deliberately tried to keep it simple (for example,
> a few months ago we had hypercalls and we went to great lengths to eliminate
> those).

Which illustrates one of the points I'm trying to make is kind of my point.
Upstream will never accept anything that's wildly complex or specific because such
a thing is unlikely to be maintainable. And so we'll end up with functionality
in KVM that is moderately helpful, but doesn't fully solve things and doesn't have
legs to go anywhere precisely because the specificity and/or complexity required
to eke out maximum performance will never be accepted.

> > If instead the contract is between host userspace and the guest, the ABI can be
> > much more fluid, e.g. if you (or any setup) can control at least some amount of
> > code that runs in the guest
>
> I see your point of view. One way to achieve this is to have a BPF
> program run to implement the boosting part, in the VMEXIT path. KVM
> then just calls a hook. Would that alleviate some of your concerns?

Yes, it absolutely would! I would *love* to build a rich set of BPF utilities
and whatnot for KVM[1]. I have zero objections to KVM making data available to
BPF programs, i.e. to host userspace, quite the opposite. What I am steadfastedly
against is KVM make decisions that are not obviously the "right" decisions in all
situations. And I do not want to end up in a world where KVM has a big pile of
knobs to let userspace control those decisions points, i.e. I don't want to make
KVM a de facto paravirt scheduler.

I think there is far more potential for this direction. KVM already has hooks
for VM-Exit and VM-Entry, they likely just need to be enhanced to make them more
useful for BPF programs. And adding hooks in other paths shouldn't be too
contentious, e.g. in addition to what you've done in this RFC, adding a hook to
kvm_vcpu_on_spin() could be quite interesting as I would not be at all surprised
if userspace could make far better decisions when selecting the vCPU to yield to.

And there are other use cases for KVM making "interesting" data available to
userspace, e.g. there's (very early) work[2] to allow userspace to poll() on vCPUs,
which likely needs much of the same information that paravirt scheduling would
find useful, e.g. whether or not the vCPU has pending events.

[1] https://lore.kernel.org/all/ZRIf1OPjKV66Y17%[email protected]
[2] https://lore.kernel.org/all/[email protected]

> > then the contract between the guest and host doesn't
> > even need to be formally defined, it could simply be a matter of bundling host
> > and guest code appropriately.
> >
> > If you want to land support for a given contract in upstream repositories, e.g.
> > to broadly enable paravirt scheduling support across a variety of usersepace VMMs
> > and/or guests, then yeah, you'll need a formal ABI. But that's still not a good
> > reason to have KVM define the ABI. Doing it in KVM might be a wee bit easier because
> > it's largely just a matter of writing code, and LKML provides a centralized channel
> > for getting buyin from all parties. But defining an ABI that's independent of the
> > kernel is absolutely doable, e.g. see the many virtio specs.
> >
> > I'm not saying KVM can't help, e.g. if there is information that is known only
> > to KVM, but the vast majority of the contract doesn't need to be defined by KVM.
>
> The key to making this working of the patch is VMEXIT path, that is
> only available to KVM. If we do anything later, then it might be too
> late.

Strictly speaking, no, it's not. It's key if and only if *KVM* boosts the priority
of the task/vCPU (or if KVM provides a hook for a BPF program to do its thing).

> We have to intervene *before* the scheduler takes the vCPU thread off the
> CPU.

If the host scheduler is directly involved in the paravirt shenanigans, then
there is no need to hook KVM's VM-Exit path because the scheduler already has the
information needed to make an informed decision.

> Similarly, in the case of an interrupt injected into the guest, we have
> to boost the vCPU before the "vCPU run" stage -- anything later might be too
> late.

Except that this RFC doesn't actually do this. KVM's relevant function names suck
and aren't helping you, but these patches boost vCPUs when events are *pended*,
not when they are actually injected.

Now, maybe boosting vCPUs with pending events is actually desirable, but that is
precisely the type of policy making that I do not want to have in KVM. Take the
existing kvm_vcpu_on_spin() path for example. It's a relatively simple idea, and
has no major downsides since KVM has very high confidence that the current vCPU
is spinning, i.e. waiting on something and thus doing nothing useful.

But even that path has a non-trivial amount of subtle, delicate logic to improve
the probability that KVM yields to a vCPU that can actually make forward progress.

Boosting the priority of vCPUs at semi-arbitrary points is going to be much more
difficult for KVM to "get right". E.g. why boost the priority of a vCPU that has
a pending IRQ, but not a vCPU that is running with IRQs disabled? The potential
for endless twiddling to try and tune KVM's de facto scheduling logic so that it's
universally "correct" is what terrifies me.

> Also you mentioned something about the tick path in the other email,
> we have no control over the host tick preempting the vCPU thread. The
> guest *will VMEXIT* on the host tick. On ChromeOS, we run multiple VMs
> and overcommitting is very common especially on devices with smaller
> number of CPUs.
>
> Just to clarify, this isn't a "quick POC". We have been working on
> this for many months and it was hard to get working correctly and
> handle all corner cases. We are finally at a point where - it just
> works (TM) and is roughly half the code size of when we initially
> started.

2023-12-15 16:57:14

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

On Fri, Dec 15, 2023, Vineeth Remanan Pillai wrote:
> > > >
> > > I get your point. A generic way would have been more preferable, but I
> > > feel the scenario we are tackling is a bit more time critical and kvm
> > > is better equipped to handle this. kvm has control over the VM/vcpu
> > > execution and hence it can take action in the most effective way.
> >
> > No, KVM most definitely does not. Between sched, KVM, and userspace, I would
> > rank KVM a very distant third. Userspace controls when to do KVM_RUN, to which
> > cgroup(s) a vCPU task is assigned, the affinity of the task, etc. sched decides
> > when and where to run a vCPU task based on input from userspace.
> >
> > Only in some edge cases that are largely unique to overcommitted CPUs does KVM
> > have any input on scheduling whatsoever. And even then, KVM's view is largely
> > limited to a single VM, e.g. teaching KVM to yield to a vCPU running in a different
> > VM would be interesting, to say the least.
> >
> Over committed case is exactly what we are trying to tackle.

Yes, I know. I was objecting to the assertion that "kvm has control over the
VM/vcpu execution and hence it can take action in the most effective way". In
overcommit use cases, KVM has some *influence*, and in non-overcommit use cases,
KVM is essentially not in the picture at all.

> Sorry for not making this clear in the cover letter. ChromeOS runs on low-end
> devices (eg: 2C/2T cpus) and does not have enough compute capacity to
> offload scheduling decisions. In-band scheduling decisions gave the
> best results.
>
> > > One example is the place where we handle boost/unboost. By the time
> > > you come out of kvm to userspace it would be too late.
> >
> > Making scheduling decisions in userspace doesn't require KVM to exit to userspace.
> > It doesn't even need to require a VM-Exit to KVM. E.g. if the scheduler (whether
> > it's in kernel or userspace) is running on a different logical CPU(s), then there's
> > no need to trigger a VM-Exit because the scheduler can incorporate information
> > about a vCPU in real time, and interrupt the vCPU if and only if something else
> > needs to run on that associated CPU. From the sched_ext cover letter:
> >
> > : Google has also experimented with some promising, novel scheduling policies.
> > : One example is “central” scheduling, wherein a single CPU makes all
> > : scheduling decisions for the entire system. This allows most cores on the
> > : system to be fully dedicated to running workloads, and can have significant
> > : performance improvements for certain use cases. For example, central
> > : scheduling with VCPUs can avoid expensive vmexits and cache flushes, by
> > : instead delegating the responsibility of preemption checks from the tick to
> > : a single CPU. See scx_central.bpf.c for a simple example of a central
> > : scheduling policy built in sched_ext.
> >
> This makes sense when the host has enough compute resources for
> offloading scheduling decisions.

Yeah, again, I know. The point I am trying to get across is that this RFC only
benefits/handles one use case, and doesn't have line of sight to being extensible
to other use cases.

> > > As you mentioned, custom contract between guest and host userspace is
> > > really flexible, but I believe tackling scheduling(especially latency)
> > > issues is a bit more difficult with generic approaches. Here kvm does
> > > have some information known only to kvm(which could be shared - eg:
> > > interrupt injection) but more importantly kvm has some unique
> > > capabilities when it comes to scheduling. kvm and scheduler are
> > > cooperating currently for various cases like, steal time accounting,
> > > vcpu preemption state, spinlock handling etc. We could possibly try to
> > > extend it a little further in a non-intrusive way.
> >
> > I'm not too worried about the code being intrusive, I'm worried about the
> > maintainability, longevity, and applicability of this approach.
> >
> > IMO, this has a significantly lower ceiling than what is possible with something
> > like sched_ext, e.g. it requires a host tick to make scheduling decisions, and
> > because it'd require a kernel-defined ABI, would essentially be limited to knobs
> > that are broadly useful. I.e. every bit of information that you want to add to
> > the guest/host ABI will need to get approval from at least the affected subsystems
> > in the guest, from KVM, and possibly from the host scheduler too. That's going
> > to make for a very high bar.
> >
> Just thinking out loud, The ABI could be very simple to start with. A
> shared page with dedicated guest and host areas. Guest fills details
> about its priority requirements, host fills details about the actions
> it took(boost/unboost, priority/sched class etc). Passing this
> information could be in-band or out-of-band. out-of-band could be used
> by dedicated userland schedulers. If both guest and host agrees on
> in-band during guest startup, kvm could hand over the data to
> scheduler using a scheduler callback. I feel this small addition to
> kvm could be maintainable and by leaving the protocol for interpreting
> shared memory to guest and host, this would be very generic and cater
> to multiple use cases. Something like above could be used both by
> low-end devices and high-end server like systems and guest and host
> could have custom protocols to interpret the data and make decisions.
>
> In this RFC, we have a miniature form of the above, where we have a
> shared memory area and the scheduler callback is basically
> sched_setscheduler. But it could be made very generic as part of ABI
> design. For out-of-band schedulers, this call back could be setup by
> sched_ext, a userland scheduler and any similar out-of-band scheduler.
>
> I agree, getting a consensus and approval is non-trivial. IMHO, this
> use case is compelling for such an ABI because out-of-band schedulers
> might not give the desired results for low-end devices.
>
> > > Having a formal paravirt scheduling ABI is something we would want to
> > > pursue (as I mentioned in the cover letter) and this could help not
> > > only with latencies, but optimal task placement for efficiency, power
> > > utilization etc. kvm's role could be to set the stage and share
> > > information with minimum delay and less resource overhead.
> >
> > Making KVM middle-man is most definitely not going to provide minimum delay or
> > overhead. Minimum delay would be the guest directly communicating with the host
> > scheduler. I get that convincing the sched folks to add a bunch of paravirt
> > stuff is a tall order (for very good reason), but that's exactly why I Cc'd the
> > sched_ext folks.
> >
> As mentioned above, guest directly talking to host scheduler without
> involving kvm would mean an out-of-band scheduler and the
> effectiveness depends on how fast the scheduler gets to run.

No, the "host scheduler" could very well be a dedicated in-kernel paravirt
scheduler. It could be a sched_ext BPF program that for all intents and purposes
is in-band.

You are basically proposing that KVM bounce-buffer data between guest and host.
I'm saying there's no _technical_ reason to use a bounce-buffer, just do zero copy.

> In lowend compute devices, that would pose a challenge. In such scenarios, kvm
> seems to be a better option to provide minimum delay and cpu overhead.

2023-12-15 17:30:53

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH 8/8] irq: boost/unboost in irq/nmi entry/exit and softirq

On Wed, Dec 13 2023 at 21:47, Vineeth Pillai (Google) wrote:
> The host proactively boosts the VCPU threads during irq/nmi injection.
> However, the host is unaware of posted interrupts, and therefore, the
> guest should request a boost if it has not already been boosted.
>
> Similarly, guest should request an unboost on irq/nmi/softirq exit if
> the vcpu doesn't need the boost any more.

That's giving a hint but no context for someone who is not familiar with
the problem which is tried to be solved here.

> @@ -327,6 +327,13 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
> .exit_rcu = false,
> };
>
> +#ifdef CONFIG_PARAVIRT_SCHED
> + instrumentation_begin();

Slapping instrumentation_begin() at it silences the objtool checker, but
that does not make it correct in any way.

You _cannot_ call random code _before_ the kernel has established
context. It's clearly documented:

https://www.kernel.org/doc/html/latest/core-api/entry.html

No?

> + if (pv_sched_enabled())
> + pv_sched_boost_vcpu_lazy();
> + instrumentation_end();
> +#endif
> +
> if (user_mode(regs)) {
> irqentry_enter_from_user_mode(regs);
> return ret;
> @@ -452,6 +459,18 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
> if (state.exit_rcu)
> ct_irq_exit();
> }
> +
> +#ifdef CONFIG_PARAVIRT_SCHED
> + instrumentation_begin();

Broken too

> + /*
> + * On irq exit, request a deboost from hypervisor if no softirq pending
> + * and current task is not RT and !need_resched.
> + */
> + if (pv_sched_enabled() && !local_softirq_pending() &&
> + !need_resched() && !task_is_realtime(current))
> + pv_sched_unboost_vcpu();
> + instrumentation_end();
> +#endif
> }
>
> irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
> @@ -469,6 +488,11 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
> kmsan_unpoison_entry_regs(regs);
> trace_hardirqs_off_finish();
> ftrace_nmi_enter();
> +
> +#ifdef CONFIG_PARAVIRT_SCHED
> + if (pv_sched_enabled())
> + pv_sched_boost_vcpu_lazy();
> +#endif
> instrumentation_end();
>
> return irq_state;
> @@ -482,6 +506,12 @@ void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
> trace_hardirqs_on_prepare();
> lockdep_hardirqs_on_prepare();
> }
> +
> +#ifdef CONFIG_PARAVIRT_SCHED
> + if (pv_sched_enabled() && !in_hardirq() && !local_softirq_pending() &&
> + !need_resched() && !task_is_realtime(current))
> + pv_sched_unboost_vcpu();
> +#endif

Symmetry is overrated. Just pick a spot and slap your hackery in.

Aside of that this whole #ifdeffery is tasteless at best.

> instrumentation_end();

> +#ifdef CONFIG_PARAVIRT_SCHED
> + if (pv_sched_enabled())
> + pv_sched_boost_vcpu_lazy();
> +#endif

But what's worse is that this is just another approach of sprinkling
random hints all over the place and see what sticks.

Abusing KVM as the man in the middle to communicate between guest and
host scheduler is creative, but ill defined.

From the host scheduler POV the vCPU is just a user space thread and
making the guest special is just the wrong approach.

The kernel has no proper general interface to convey information from a
user space task to the scheduler.

So instead of adding some ad-hoc KVM hackery the right thing is to solve
this problem from ground up in a generic way and KVM can just utilize
that instead of having the special snow-flake hackery which is just a
maintainability nightmare.

Thanks,

tglx

2023-12-15 17:40:58

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

[...snip...]
> > > IMO, this has a significantly lower ceiling than what is possible with something
> > > like sched_ext, e.g. it requires a host tick to make scheduling decisions, and
> > > because it'd require a kernel-defined ABI, would essentially be limited to knobs
> > > that are broadly useful. I.e. every bit of information that you want to add to
> > > the guest/host ABI will need to get approval from at least the affected subsystems
> > > in the guest, from KVM, and possibly from the host scheduler too. That's going
> > > to make for a very high bar.
> > >
> > Just thinking out loud, The ABI could be very simple to start with. A
> > shared page with dedicated guest and host areas. Guest fills details
> > about its priority requirements, host fills details about the actions
> > it took(boost/unboost, priority/sched class etc). Passing this
> > information could be in-band or out-of-band. out-of-band could be used
> > by dedicated userland schedulers. If both guest and host agrees on
> > in-band during guest startup, kvm could hand over the data to
> > scheduler using a scheduler callback. I feel this small addition to
> > kvm could be maintainable and by leaving the protocol for interpreting
> > shared memory to guest and host, this would be very generic and cater
> > to multiple use cases. Something like above could be used both by
> > low-end devices and high-end server like systems and guest and host
> > could have custom protocols to interpret the data and make decisions.
> >
> > In this RFC, we have a miniature form of the above, where we have a
> > shared memory area and the scheduler callback is basically
> > sched_setscheduler. But it could be made very generic as part of ABI
> > design. For out-of-band schedulers, this call back could be setup by
> > sched_ext, a userland scheduler and any similar out-of-band scheduler.
> >
> > I agree, getting a consensus and approval is non-trivial. IMHO, this
> > use case is compelling for such an ABI because out-of-band schedulers
> > might not give the desired results for low-end devices.
> >
> > > > Having a formal paravirt scheduling ABI is something we would want to
> > > > pursue (as I mentioned in the cover letter) and this could help not
> > > > only with latencies, but optimal task placement for efficiency, power
> > > > utilization etc. kvm's role could be to set the stage and share
> > > > information with minimum delay and less resource overhead.
> > >
> > > Making KVM middle-man is most definitely not going to provide minimum delay or
> > > overhead. Minimum delay would be the guest directly communicating with the host
> > > scheduler. I get that convincing the sched folks to add a bunch of paravirt
> > > stuff is a tall order (for very good reason), but that's exactly why I Cc'd the
> > > sched_ext folks.
> > >
> > As mentioned above, guest directly talking to host scheduler without
> > involving kvm would mean an out-of-band scheduler and the
> > effectiveness depends on how fast the scheduler gets to run.
>
> No, the "host scheduler" could very well be a dedicated in-kernel paravirt
> scheduler. It could be a sched_ext BPF program that for all intents and purposes
> is in-band.
>
Yes, if the scheduler is on the same physical cpu and acts on events
like VMEXIT/VMENTRY etc, this would work perfectly. Having the VM talk
to a scheduler running on another cpu and making decisions might not
be quick enough when we do not have enough cpu capacity.

> You are basically proposing that KVM bounce-buffer data between guest and host.
> I'm saying there's no _technical_ reason to use a bounce-buffer, just do zero copy.
>
I was also meaning zero copy only. The help required from the kvm side is:
- Pass the address of the shared memory to bpf programs/scheduler once
the guest sets it up.
- Invoke scheduler registered callbacks on events like VMEXIT,
VEMENTRY, interrupt injection etc. Its the job of guest and host
paravirt scheduler to interpret the shared memory contents and take
actions.

I admit current RFC doesn't strictly implement hooks and callbacks -
it calls sched_setscheduler in place of all callbacks that I mentioned
above. I guess this was your strongest objection.

As you mentioned in the reply to Joel, if it is fine for kvm to allow
hooks into events (VMEXIT, VMENTRY, interrupt injection etc) then, it
makes it easier to develop the ABI I was mentioning and have the hooks
implemented by a paravirt scheduler. We shall re-design the
architecture based on this for v2.

Thanks,
Vineeth

2023-12-15 17:55:00

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

On Fri, Dec 15, 2023, Vineeth Remanan Pillai wrote:
> > You are basically proposing that KVM bounce-buffer data between guest and host.
> > I'm saying there's no _technical_ reason to use a bounce-buffer, just do zero copy.
> >
> I was also meaning zero copy only. The help required from the kvm side is:
> - Pass the address of the shared memory to bpf programs/scheduler once
> the guest sets it up.
> - Invoke scheduler registered callbacks on events like VMEXIT,
> VEMENTRY, interrupt injection etc. Its the job of guest and host
> paravirt scheduler to interpret the shared memory contents and take
> actions.
>
> I admit current RFC doesn't strictly implement hooks and callbacks -
> it calls sched_setscheduler in place of all callbacks that I mentioned
> above. I guess this was your strongest objection.

Ya, more or less.

> As you mentioned in the reply to Joel, if it is fine for kvm to allow
> hooks into events (VMEXIT, VMENTRY, interrupt injection etc) then, it
> makes it easier to develop the ABI I was mentioning and have the hooks
> implemented by a paravirt scheduler. We shall re-design the
> architecture based on this for v2.

Instead of going straight to a full blown re-design, can you instead post slightly
more incremental RFCs? E.g. flesh out enough code to get a BPF program attached
and receiving information, but do NOT wait until you have fully working setup
before posting the next RFC.

There are essentially four-ish things to sort out:

1. Where to insert/modify hooks in KVM
2. How KVM exposes KVM-internal information through said hooks
3. How a BPF program can influence the host scheduler
4. The guest/host ABI

#1 and #2 are largely KVM-only, and I think/hope we can get a rough idea of how
to address them before moving onto #3 and #4 (assuming #3 isn't already a solved
problem).

2023-12-15 18:11:02

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

On Thu, Dec 14, 2023 at 08:38:47AM -0800, Sean Christopherson wrote:
> +sched_ext folks

Thanks for the cc.

>
> On Wed, Dec 13, 2023, Vineeth Pillai (Google) wrote:
> > Double scheduling is a concern with virtualization hosts where the host
> > schedules vcpus without knowing whats run by the vcpu and guest schedules
> > tasks without knowing where the vcpu is physically running. This causes
> > issues related to latencies, power consumption, resource utilization
> > etc. An ideal solution would be to have a cooperative scheduling
> > framework where the guest and host shares scheduling related information
> > and makes an educated scheduling decision to optimally handle the
> > workloads. As a first step, we are taking a stab at reducing latencies
> > for latency sensitive workloads in the guest.
> >
> > This series of patches aims to implement a framework for dynamically
> > managing the priority of vcpu threads based on the needs of the workload
> > running on the vcpu. Latency sensitive workloads (nmi, irq, softirq,
> > critcal sections, RT tasks etc) will get a boost from the host so as to
> > minimize the latency.
> >
> > The host can proactively boost the vcpu threads when it has enough
> > information about what is going to run on the vcpu - fo eg: injecting
> > interrupts. For rest of the case, guest can request boost if the vcpu is
> > not already boosted. The guest can subsequently request unboost after
> > the latency sensitive workloads completes. Guest can also request a
> > boost if needed.
> >
> > A shared memory region is used to communicate the scheduling information.
> > Guest shares its needs for priority boosting and host shares the boosting
> > status of the vcpu. Guest sets a flag when it needs a boost and continues
> > running. Host reads this on next VMEXIT and boosts the vcpu thread. For
> > unboosting, it is done synchronously so that host workloads can fairly
> > compete with guests when guest is not running any latency sensitive
> > workload.
>
> Big thumbs down on my end. Nothing in this RFC explains why this should be done
> in KVM. In general, I am very opposed to putting policy of any kind into KVM,
> and this puts a _lot_ of unmaintainable policy into KVM by deciding when to
> start/stop boosting a vCPU.

I have to agree, not least of which is because in addition to imposing a
severe maintenance tax, these policies are far from exhaustive in terms
of what you may want to do for cooperative paravirt scheduling. I think
something like sched_ext would give you the best of all worlds: no
maintenance burden on the KVM maintainers, more options for implementing
various types of policies, performant, safe to run on the host, no need
to reboot when trying a new policy, etc. More on this below.

> Concretely, boosting vCPUs for most events is far too coarse grained. E.g. boosting
> a vCPU that is running a low priority workload just because the vCPU triggered
> an NMI due to PMU counter overflow doesn't make sense. Ditto for if a guest's
> hrtimer expires on a vCPU running a low priority workload.
>
> And as evidenced by patch 8/8, boosting vCPUs based on when an event is _pending_
> is not maintainable. As hardware virtualizes more and more functionality, KVM's
> visilibity into the guest effectively decreases, e.g. Intel and AMD both support
> with IPI virtualization.
>
> Boosting the target of a PV spinlock kick is similarly flawed. In that case, KVM
> only gets involved _after_ there is a problem, i.e. after a lock is contended so
> heavily that a vCPU stops spinning and instead decided to HLT. It's not hard to
> imagine scenarios where a guest would want to communicate to the host that it's
> acquiring a spinlock for a latency sensitive path and so shouldn't be scheduled
> out. And of course that's predicated on the assumption that all vCPUs are subject
> to CPU overcommit.
>
> Initiating a boost from the host is also flawed in the sense that it relies on
> the guest to be on the same page as to when it should stop boosting. E.g. if
> KVM boosts a vCPU because an IRQ is pending, but the guest doesn't want to boost
> IRQs on that vCPU and thus doesn't stop boosting at the end of the IRQ handler,
> then the vCPU could end up being boosted long after its done with the IRQ.
>
> Throw nested virtualization into the mix and then all of this becomes nigh
> impossible to sort out in KVM. E.g. if an L1 vCPU is a running an L2 vCPU, i.e.
> a nested guest, and L2 is spamming interrupts for whatever reason, KVM will end
> repeatedly boosting the L1 vCPU regardless of the priority of the L2 workload.
>
> For things that aren't clearly in KVM's domain, I don't think we should implement
> KVM-specific functionality until every other option has been tried (and failed).
> I don't see any reason why KVM needs to get involved in scheduling, beyond maybe
> providing *input* regarding event injection, emphasis on *input* because KVM
> providing information to userspace or some other entity is wildly different than
> KVM making scheduling decisions based on that information.
>
> Pushing the scheduling policies to host userspace would allow for far more control
> and flexibility. E.g. a heavily paravirtualized environment where host userspace
> knows *exactly* what workloads are being run could have wildly different policies
> than an environment where the guest is a fairly vanilla Linux VM that has received
> a small amount of enlightment.
>
> Lastly, if the concern/argument is that userspace doesn't have the right knobs
> to (quickly) boost vCPU tasks, then the proposed sched_ext functionality seems
> tailor made for the problems you are trying to solve.
>
> https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org

I very much agree. There are some features missing from BPF that we'd
need to add to enable this, but they're on the roadmap, and I don't
think they'd be especially difficult to implement.

The main building block that I was considering is a new kptr [0] and set
of kfuncs [1] that would allow a BPF program to have one or more R/W
shared memory regions with a guest. This could enable a wide swath of
BPF paravirt use cases that are not limited to scheduling, but in terms
of sched_ext, the BPF scheduler could communicate with the guest
scheduler over this shared memory region in whatever manner was required
for that use case.

[0]: https://lwn.net/Articles/900749/
[1]: https://lwn.net/Articles/856005/

For example, the guest could communicate scheduling intention such as:

- "Don't preempt me and/or boost me (because I'm holding a spinlock, in an
NMI region, running some low-latency task, etc)".
- "VCPU x prefers to be on a P core", and then later, "Now it prefers an
E core". Note that this doesn't require pinning or anything like that.
It's just the VCPU requesting some best-effort placement, and allowing
that policy to change dynamically depending on what the guest is
doing.
- "Try to give these VCPUs their own fully idle cores if possible, but
these other VCPUs should ideally be run as hypertwins as they're
expected to have good cache locality", etc.

In general, some of these policies might be silly and not work well,
others might work very well for some workloads / architectures and not
as well on others, etc. sched_ext would make it easy to try things out
and see what works well, without having to worry about rebooting or
crashing the host, and ultimately without having to implement and
maintain some scheduling policy directly in KVM. As Sean said, the host
knows exactly what workloads are being run and could have very targeted
and bespoke policies that wouldn't be appropriate for a vanilla Linux
VM.

Finally, while it's not necessarily related to paravirt scheduling
specifically, I think it's maybe worth mentioning that sched_ext would
have allowed us to implement a core-sched-like policy when L1TF first
hit us. It was inevitable that we'd need a core-sched policy build into
the core kernel as well, but we could have used sched_ext as a solution
until core sched was merged. Tejun implemented something similar in an
example scheduler where only tasks in the same cgroup are ever allowed
to run as hypertwins [3]. The point is, you can do basically anything
you want with sched_ext. For paravirt, I think there are a ton of
interesting possibilities, and I think those possibilities are better
explored and implemented with sched_ext rather than implementing
policies directly in KVM.

[3]: https://lore.kernel.org/lkml/[email protected]/


Attachments:
(No filename) (8.53 kB)
signature.asc (235.00 B)
Download all attachments

2023-12-15 18:53:02

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH 8/8] irq: boost/unboost in irq/nmi entry/exit and softirq

[...snip..]
> > +#ifdef CONFIG_PARAVIRT_SCHED
> > + if (pv_sched_enabled() && !in_hardirq() && !local_softirq_pending() &&
> > + !need_resched() && !task_is_realtime(current))
> > + pv_sched_unboost_vcpu();
> > +#endif
>
> Symmetry is overrated. Just pick a spot and slap your hackery in.
>
> Aside of that this whole #ifdeffery is tasteless at best.
>
> > instrumentation_end();
>
> > +#ifdef CONFIG_PARAVIRT_SCHED
> > + if (pv_sched_enabled())
> > + pv_sched_boost_vcpu_lazy();
> > +#endif
>
> But what's worse is that this is just another approach of sprinkling
> random hints all over the place and see what sticks.
>
Understood. WIll have a full re-write of guest side logic for v2.

> Abusing KVM as the man in the middle to communicate between guest and
> host scheduler is creative, but ill defined.
>
> From the host scheduler POV the vCPU is just a user space thread and
> making the guest special is just the wrong approach.
>
> The kernel has no proper general interface to convey information from a
> user space task to the scheduler.
>
> So instead of adding some ad-hoc KVM hackery the right thing is to solve
> this problem from ground up in a generic way and KVM can just utilize
> that instead of having the special snow-flake hackery which is just a
> maintainability nightmare.
>
We had a constructive and productive discussion on the cover letter
thread and reaching a consensus on the kvm side of the design. Will
work towards that and post iterative revisions.

Thanks,
Vineeth

2023-12-15 19:19:10

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

On Fri, Dec 15, 2023 at 12:54 PM Sean Christopherson <[email protected]> wrote:
>
> On Fri, Dec 15, 2023, Vineeth Remanan Pillai wrote:
> > > You are basically proposing that KVM bounce-buffer data between guest and host.
> > > I'm saying there's no _technical_ reason to use a bounce-buffer, just do zero copy.
> > >
> > I was also meaning zero copy only. The help required from the kvm side is:
> > - Pass the address of the shared memory to bpf programs/scheduler once
> > the guest sets it up.
> > - Invoke scheduler registered callbacks on events like VMEXIT,
> > VEMENTRY, interrupt injection etc. Its the job of guest and host
> > paravirt scheduler to interpret the shared memory contents and take
> > actions.
> >
> > I admit current RFC doesn't strictly implement hooks and callbacks -
> > it calls sched_setscheduler in place of all callbacks that I mentioned
> > above. I guess this was your strongest objection.
>
> Ya, more or less.
>
> > As you mentioned in the reply to Joel, if it is fine for kvm to allow
> > hooks into events (VMEXIT, VMENTRY, interrupt injection etc) then, it
> > makes it easier to develop the ABI I was mentioning and have the hooks
> > implemented by a paravirt scheduler. We shall re-design the
> > architecture based on this for v2.
>
> Instead of going straight to a full blown re-design, can you instead post slightly
> more incremental RFCs? E.g. flesh out enough code to get a BPF program attached
> and receiving information, but do NOT wait until you have fully working setup
> before posting the next RFC.
>
Sure, makes sense.

> There are essentially four-ish things to sort out:
>
> 1. Where to insert/modify hooks in KVM
> 2. How KVM exposes KVM-internal information through said hooks
> 3. How a BPF program can influence the host scheduler
> 4. The guest/host ABI
>
> #1 and #2 are largely KVM-only, and I think/hope we can get a rough idea of how
> to address them before moving onto #3 and #4 (assuming #3 isn't already a solved
> problem).

Agreed. Will start with the kvm side and keep you posted on the progress.

Thanks,
Vineeth

2023-12-15 20:18:52

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

On Fri, Dec 15, 2023 at 11:38 AM Sean Christopherson <[email protected]> wrote:
>
> On Fri, Dec 15, 2023, Joel Fernandes wrote:
> > Hi Sean,
> > Nice to see your quick response to the RFC, thanks. I wanted to
> > clarify some points below:
> >
> > On Thu, Dec 14, 2023 at 3:13 PM Sean Christopherson <[email protected]> wrote:
> > >
> > > On Thu, Dec 14, 2023, Vineeth Remanan Pillai wrote:
> > > > On Thu, Dec 14, 2023 at 11:38 AM Sean Christopherson <[email protected]> wrote:
> > > > Now when I think about it, the implementation seems to
> > > > suggest that we are putting policies in kvm. Ideally, the goal is:
> > > > - guest scheduler communicates the priority requirements of the workload
> > > > - kvm applies the priority to the vcpu task.
> > >
> > > Why? Tasks are tasks, why does KVM need to get involved? E.g. if the problem
> > > is that userspace doesn't have the right knobs to adjust the priority of a task
> > > quickly and efficiently, then wouldn't it be better to solve that problem in a
> > > generic way?
> >
> > No, it is not only about tasks. We are boosting anything RT or above
> > such as softirq, irq etc as well.
>
> I was talking about the host side of things. A vCPU is a task, full stop. KVM
> *may* have some information that is useful to the scheduler, but KVM does not
> *need* to initiate adjustments to a vCPU's priority.

Sorry I thought you were referring to guest tasks. You are right, KVM
does not *need* to change priority. But a vCPU is a container of tasks
who's priority dynamically changes. Still, I see your point of view
and that's also why we offer the capability to be selectively enabled
or disabled per-guest by the VMM (Vineeth will make it default off and
opt-in in the next series).

> > Could you please see the other patches?
>
> I already have, see my comments about boosting vCPUs that have received NMIs and
> IRQs not necessarily being desirable.

Ah, I was not on CC for that email. Seeing it now. I think I don't
fully buy that argument, hard IRQs are always high priority IMHO. If
an hrtimer expires on a CPU running a low priority workload, that
hrtimer might itself wake up a high priority thread. If we don't boost
the hrtimer interrupt handler, then that will delay the wakeup as
well. It is always a chain of events and it has to be boosted from the
first event. If a system does not wish to give an interrupt a high
priority, then the typical way is to use threaded IRQs and lower the
priority of the thread. That will give the interrupt handler lower
priority and the guest is free to do that. We had many POCs before
where we don't boost at all for interrupts and they all fall apart.
This is the only POC that works without any issues as far as we know
(we've been trying to do this for a long time :P).

Regarding perf, I similarly disagree. I think a PMU event is super
important (example, some versions of the kernel watchdog that depend
on PMU fail without it). But if some VM does not want this to be
prioritized, they could just not opt-in for the feature IMO. I can see
your point of view that not all VMs may want this behavior though.

> > > > > Pushing the scheduling policies to host userspace would allow for far more control
> > > > > and flexibility. E.g. a heavily paravirtualized environment where host userspace
> > > > > knows *exactly* what workloads are being run could have wildly different policies
> > > > > than an environment where the guest is a fairly vanilla Linux VM that has received
> > > > > a small amount of enlightment.
> > > > >
> > > > > Lastly, if the concern/argument is that userspace doesn't have the right knobs
> > > > > to (quickly) boost vCPU tasks, then the proposed sched_ext functionality seems
> > > > > tailor made for the problems you are trying to solve.
> > > > >
> > > > > https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org
> > > > >
> > > > You are right, sched_ext is a good choice to have policies
> > > > implemented. In our case, we would need a communication mechanism as
> > > > well and hence we thought kvm would work best to be a medium between
> > > > the guest and the host.
> > >
> > > Making KVM be the medium may be convenient and the quickest way to get a PoC
> > > out the door, but effectively making KVM a middle-man is going to be a huge net
> > > negative in the long term. Userspace can communicate with the guest just as
> > > easily as KVM, and if you make KVM the middle-man, then you effectively *must*
> > > define a relatively rigid guest/host ABI.
> >
> > At the moment, the only ABI is a shared memory structure and a custom
> > MSR. This is no different from the existing steal time accounting
> > where a shared structure is similarly shared between host and guest,
> > we could perhaps augment that structure with other fields instead of
> > adding a new one?
>
> I'm not concerned about the number of structures/fields, it's the amount/type of
> information and the behavior of KVM that is problematic. E.g. boosting the priority
> of a vCPU that has a pending NMI is dubious.

I think NMIs have to be treated as high priority, the perf profiling
interrupt for instance works well on x86 (unlike ARM) because it can
interrupt any context (other than NMI and possibly the machine check
ones). On ARM on the other hand, because the perf interrupt is a
regular non-NMI interrupt, you cannot profile hardirq and IRQ-disable
regions (this could have changed since pseudo-NMI features). So making
the NMI a higher priority than IRQ is not dubious AFAICS, it is a
requirement in many cases IMHO.

> This RFC has baked in a large
> number of assumptions that (mostly) fit your specific use case, but do not
> necessarily apply to all use cases.

Yes, agreed. We'll make it more generic.

> I'm not even remotely convinced that what's prosed here is optimal for your use case either.

We have a data-driven approach. We've tested this with lots of use
cases and collected metrics with both real and synthetic workloads
(not just us but many other teams we work with). It might not be
optimal, but definitely is a step forward IMO. As you can see the
several-fold reduction in latencies, audio glitches etc. We did wait a
long time for RFC however that was because we did not want to push out
something broken. In hind-sight, we should be posting this work
upstream more quickly (but in our defense, we did present this work at
2 other conferences this year ;-)).

> > On the ABI point, we have deliberately tried to keep it simple (for example,
> > a few months ago we had hypercalls and we went to great lengths to eliminate
> > those).
>
> Which illustrates one of the points I'm trying to make is kind of my point.
> Upstream will never accept anything that's wildly complex or specific because such
> a thing is unlikely to be maintainable.

TBH, it is not that complex though. But let us know which parts, if
any, can be further simplified (I saw your suggestions for next steps
in the reply to Vineeth, those look good to me and we'll plan
accordingly).

> > > If instead the contract is between host userspace and the guest, the ABI can be
> > > much more fluid, e.g. if you (or any setup) can control at least some amount of
> > > code that runs in the guest
> >
> > I see your point of view. One way to achieve this is to have a BPF
> > program run to implement the boosting part, in the VMEXIT path. KVM
> > then just calls a hook. Would that alleviate some of your concerns?
>
> Yes, it absolutely would! I would *love* to build a rich set of BPF utilities
> and whatnot for KVM[1].

Nice to see it! Definitely interested in this work.

> I have zero objections to KVM making data available to
> BPF programs, i.e. to host userspace, quite the opposite. What I am steadfastedly
> against is KVM make decisions that are not obviously the "right" decisions in all
> situations. And I do not want to end up in a world where KVM has a big pile of
> knobs to let userspace control those decisions points, i.e. I don't want to make
> KVM a de facto paravirt scheduler.
>
> I think there is far more potential for this direction. KVM already has hooks
> for VM-Exit and VM-Entry, they likely just need to be enhanced to make them more
> useful for BPF programs. And adding hooks in other paths shouldn't be too
> contentious, e.g. in addition to what you've done in this RFC, adding a hook to
> kvm_vcpu_on_spin() could be quite interesting as I would not be at all surprised
> if userspace could make far better decisions when selecting the vCPU to yield to.
>
> And there are other use cases for KVM making "interesting" data available to
> userspace, e.g. there's (very early) work[2] to allow userspace to poll() on vCPUs,
> which likely needs much of the same information that paravirt scheduling would
> find useful, e.g. whether or not the vCPU has pending events.
>
> [1] https://lore.kernel.org/all/ZRIf1OPjKV66Y17%[email protected]
> [2] https://lore.kernel.org/all/[email protected]

The polling work seems interesting too for what we're doing, shall
look further as well. Thank you!

> > > then the contract between the guest and host doesn't
> > > even need to be formally defined, it could simply be a matter of bundling host
> > > and guest code appropriately.
> > >
> > > If you want to land support for a given contract in upstream repositories, e.g.
> > > to broadly enable paravirt scheduling support across a variety of usersepace VMMs
> > > and/or guests, then yeah, you'll need a formal ABI. But that's still not a good
> > > reason to have KVM define the ABI. Doing it in KVM might be a wee bit easier because
> > > it's largely just a matter of writing code, and LKML provides a centralized channel
> > > for getting buyin from all parties. But defining an ABI that's independent of the
> > > kernel is absolutely doable, e.g. see the many virtio specs.
> > >
> > > I'm not saying KVM can't help, e.g. if there is information that is known only
> > > to KVM, but the vast majority of the contract doesn't need to be defined by KVM.
> >
> > The key to making this working of the patch is VMEXIT path, that is
> > only available to KVM. If we do anything later, then it might be too
> > late.
>
> Strictly speaking, no, it's not. It's key if and only if *KVM* boosts the priority
> of the task/vCPU (or if KVM provides a hook for a BPF program to do its thing).

Ok, agreed.

> > We have to intervene *before* the scheduler takes the vCPU thread off the
> > CPU.
>
> If the host scheduler is directly involved in the paravirt shenanigans, then
> there is no need to hook KVM's VM-Exit path because the scheduler already has the
> information needed to make an informed decision.

Just to clarify, we're not making any "decisions" in the VM exit path,
we're just giving the scheduler enough information (via the
setscheduler call). The scheduler may just as well "decide" it wants
to still preempt the vCPU thread -- that's Ok in fact required some
times. We're just arming it with more information, specifically that
this is an important thread. We can find another way to pass this
information along (BPF etc) but I just wanted to mention that KVM is
not really replacing the functionality or decision-making of the
scheduler even with this POC.

> > Similarly, in the case of an interrupt injected into the guest, we have
> > to boost the vCPU before the "vCPU run" stage -- anything later might be too
> > late.
>
> Except that this RFC doesn't actually do this. KVM's relevant function names suck
> and aren't helping you, but these patches boost vCPUs when events are *pended*,
> not when they are actually injected.

We are doing the injection bit in:
https://lore.kernel.org/all/[email protected]/

For instance, in:

kvm_set_msi ->
kvm_irq_delivery_to_apic ->
kvm_apic_set_irq ->
__apic_accept_irq ->
kvm_vcpu_kick_boost();

The patch is a bit out of order because patch 4 depends on 3. Patch 3
does what you're referring to, which is checking for pending events.

Did we miss something? If there is some path that we are missing, your
help is much appreciated as you're likely much more versed with this
code than me. But doing the boosting at injection time is what has
made all the difference (for instance with cyclictest latencies).

> Boosting the priority of vCPUs at semi-arbitrary points is going to be much more
> difficult for KVM to "get right". E.g. why boost the priority of a vCPU that has
> a pending IRQ, but not a vCPU that is running with IRQs disabled?

I was actually wanting to boost preempted vCPU threads that were
preempted in IRQ disabled regions as well. In fact, that is on our
TODO.. we just haven't done it yet as we notice that usually IRQ is
disabled while preemption was already disabled and just boosting
preempt-disabled gets us most of the way there.

> The potential
> for endless twiddling to try and tune KVM's de facto scheduling logic so that it's
> universally "correct" is what terrifies me.

Yes, we can certainly look into BPF to make it a bit more generic for
our usecase (while getting enough information from the kernel).

By the way, one other usecase for this patch series is RCU. I am one
of the RCU maintainers and I am looking into improving RCU in VMs. For
instance, boosting preempted RCU readers to RT (CONFIG_RCU_BOOST) does
not usually work because the vCPU thread on the host is not RT.
Similarly, other RCU threads which have RT priority don't get to run
causing issues.

Thanks!

- Joel

2023-12-15 22:08:57

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

On Fri, Dec 15, 2023, Joel Fernandes wrote:
> On Fri, Dec 15, 2023 at 11:38 AM Sean Christopherson <[email protected]> wrote:
> >
> > On Fri, Dec 15, 2023, Joel Fernandes wrote:
> > > Hi Sean,
> > > Nice to see your quick response to the RFC, thanks. I wanted to
> > > clarify some points below:
> > >
> > > On Thu, Dec 14, 2023 at 3:13 PM Sean Christopherson <[email protected]> wrote:
> > > >
> > > > On Thu, Dec 14, 2023, Vineeth Remanan Pillai wrote:
> > > > > On Thu, Dec 14, 2023 at 11:38 AM Sean Christopherson <[email protected]> wrote:
> > > > > Now when I think about it, the implementation seems to
> > > > > suggest that we are putting policies in kvm. Ideally, the goal is:
> > > > > - guest scheduler communicates the priority requirements of the workload
> > > > > - kvm applies the priority to the vcpu task.
> > > >
> > > > Why? Tasks are tasks, why does KVM need to get involved? E.g. if the problem
> > > > is that userspace doesn't have the right knobs to adjust the priority of a task
> > > > quickly and efficiently, then wouldn't it be better to solve that problem in a
> > > > generic way?
> > >
> > > No, it is not only about tasks. We are boosting anything RT or above
> > > such as softirq, irq etc as well.
> >
> > I was talking about the host side of things. A vCPU is a task, full stop. KVM
> > *may* have some information that is useful to the scheduler, but KVM does not
> > *need* to initiate adjustments to a vCPU's priority.
>
> Sorry I thought you were referring to guest tasks. You are right, KVM
> does not *need* to change priority. But a vCPU is a container of tasks
> who's priority dynamically changes. Still, I see your point of view
> and that's also why we offer the capability to be selectively enabled
> or disabled per-guest by the VMM (Vineeth will make it default off and
> opt-in in the next series).
>
> > > Could you please see the other patches?
> >
> > I already have, see my comments about boosting vCPUs that have received
> > NMIs and IRQs not necessarily being desirable.
>
> Ah, I was not on CC for that email. Seeing it now. I think I don't
> fully buy that argument, hard IRQs are always high priority IMHO.

They most definitely are not, and there are undoubtedly tiers of priority, e.g.
tiers are part and parcel of the APIC architecture. I agree that *most* IRQs are
high-ish priority, but that is not the same that *all* IRQs are high priority.
It only takes one example to disprove the latter, and I can think of several off
the top of my head.

Nested virtualization is the easy one to demonstrate.

On AMD, which doesn't have an equivalent to the VMX preemption timer, KVM uses a
self-IPI to wrest control back from the guest immediately after VMRUN in some
situations (mostly to inject events into L2 while honoring the architectural
priority of events). If the guest is running a nested workload, the resulting
IRQ in L1 is not at all interesting or high priority, as the L2 workload hasn't
suddenly become high priority just because KVM wants to inject an event.

Anyways, I didn't mean to start a debate over the priority of handling IRQs and
NMIs, quite the opposite actually. The point I'm trying to make is that under
no circumstance do I want KVM to be making decisions about whether or not such
things are high priority. I have no objection to KVM making information available
to whatever entity is making the actual decisions, it's having policy in KVM that
I am staunchly opposed to.

> If an hrtimer expires on a CPU running a low priority workload, that
> hrtimer might itself wake up a high priority thread. If we don't boost
> the hrtimer interrupt handler, then that will delay the wakeup as
> well. It is always a chain of events and it has to be boosted from the
> first event. If a system does not wish to give an interrupt a high
> priority, then the typical way is to use threaded IRQs and lower the
> priority of the thread. That will give the interrupt handler lower
> priority and the guest is free to do that. We had many POCs before
> where we don't boost at all for interrupts and they all fall apart.
> This is the only POC that works without any issues as far as we know
> (we've been trying to do this for a long time :P).

In *your* environment. The fact that it took multiple months to get a stable,
functional set of patches for one use case is *exactly* why I am pushing back on
this. Change any number of things about the setup and odds are good that the
result would need different tuning. E.g. the ratio of vCPUs to pCPUs, the number
of VMs, the number of vCPUs per VM, whether or not the host kernel is preemptible,
whether or not the guest kernel is preemptible, the tick rate of the host and
guest kernels, the workload of the VM, the affinity of tasks within the VM, and
and so on and so forth.

It's a catch-22 of sorts. Anything that is generic enough to land upstream is
likely going to be too coarse grained to be universally applicable.

> Regarding perf, I similarly disagree. I think a PMU event is super
> important (example, some versions of the kernel watchdog that depend
> on PMU fail without it). But if some VM does not want this to be
> prioritized, they could just not opt-in for the feature IMO. I can see
> your point of view that not all VMs may want this behavior though.

Or a VM may want it conditionally, e.g. only for select tasks.

> > > At the moment, the only ABI is a shared memory structure and a custom
> > > MSR. This is no different from the existing steal time accounting
> > > where a shared structure is similarly shared between host and guest,
> > > we could perhaps augment that structure with other fields instead of
> > > adding a new one?
> >
> > I'm not concerned about the number of structures/fields, it's the amount/type of
> > information and the behavior of KVM that is problematic. E.g. boosting the priority
> > of a vCPU that has a pending NMI is dubious.
>
> I think NMIs have to be treated as high priority, the perf profiling
> interrupt for instance works well on x86 (unlike ARM) because it can
> interrupt any context (other than NMI and possibly the machine check
> ones). On ARM on the other hand, because the perf interrupt is a
> regular non-NMI interrupt, you cannot profile hardirq and IRQ-disable
> regions (this could have changed since pseudo-NMI features). So making
> the NMI a higher priority than IRQ is not dubious AFAICS, it is a
> requirement in many cases IMHO.

Again, many, but not all. A large part of KVM's success is that KVM has very few
"opinions" of its own. Outside of the MMU and a few paravirt paths, KVM mostly
just emulates/virtualizes hardware according to the desires of userspace. This
has allowed a fairly large variety of use cases to spring up with relatively few
changes to KVM.

What I want to avoid is (a) adding something that only works for one use case
and (b) turning KVM into a scheduler of any kind.

> > Which illustrates one of the points I'm trying to make is kind of my point.
> > Upstream will never accept anything that's wildly complex or specific because such
> > a thing is unlikely to be maintainable.
>
> TBH, it is not that complex though.

Yet. Your use case is happy with relatively simple, coarse-grained hooks. Use
cases that want to squeeze out maximum performance, e.g. because shaving N% off
the runtime saves $$$, are likely willing to take on far more complexity, or may
just want to make decisions at a slightly different granularity.

> But let us know which parts, if any, can be further simplified (I saw your
> suggestions for next steps in the reply to Vineeth, those look good to me and
> we'll plan accordingly).

It's not a matter of simplifying things, it's a matter of KVM (a) not defining
policy of any kind and (b) KVM not defining a guest/host ABI.

> > > We have to intervene *before* the scheduler takes the vCPU thread off the
> > > CPU.
> >
> > If the host scheduler is directly involved in the paravirt shenanigans, then
> > there is no need to hook KVM's VM-Exit path because the scheduler already has the
> > information needed to make an informed decision.
>
> Just to clarify, we're not making any "decisions" in the VM exit path,

Yes, you are.

> we're just giving the scheduler enough information (via the
> setscheduler call). The scheduler may just as well "decide" it wants
> to still preempt the vCPU thread -- that's Ok in fact required some
> times. We're just arming it with more information, specifically that
> this is an important thread. We can find another way to pass this
> information along (BPF etc) but I just wanted to mention that KVM is
> not really replacing the functionality or decision-making of the
> scheduler even with this POC.

Yes, it is. kvm_vcpu_kick_boost() *directly* adjusts the priority of the task.
KVM is not just passing a message, KVM is defining a scheduling policy of "boost
vCPUs with pending IRQs, NMIs, SMIs, and PV unhalt events".

The VM-Exit path also makes those same decisions by boosting a vCPU if the guest
has requested boost *or* the vCPU has a pending event (but oddly, not pending
NMIs, SMIs, or PV unhalt events):

bool pending_event = kvm_cpu_has_pending_timer(vcpu) || kvm_cpu_has_interrupt(vcpu);

/*
* vcpu needs a boost if
* - A lazy boost request active or a pending latency sensitive event, and
* - Preemption disabled duration on this vcpu has not crossed the threshold.
*/
return ((schedinfo.boost_req == VCPU_REQ_BOOST || pending_event) &&
!kvm_vcpu_exceeds_preempt_disabled_duration(&vcpu->arch));


Which, by the by is suboptimal. Detecting for pending events isn't free, so you
ideally want to check for pending events if and only if the guest hasn't requested
a boost.

> > > Similarly, in the case of an interrupt injected into the guest, we have
> > > to boost the vCPU before the "vCPU run" stage -- anything later might be too
> > > late.
> >
> > Except that this RFC doesn't actually do this. KVM's relevant function names suck
> > and aren't helping you, but these patches boost vCPUs when events are *pended*,
> > not when they are actually injected.
>
> We are doing the injection bit in:
> https://lore.kernel.org/all/[email protected]/
>
> For instance, in:
>
> kvm_set_msi ->
> kvm_irq_delivery_to_apic ->
> kvm_apic_set_irq ->
> __apic_accept_irq ->
> kvm_vcpu_kick_boost();
>
> The patch is a bit out of order because patch 4 depends on 3. Patch 3
> does what you're referring to, which is checking for pending events.
>
> Did we miss something? If there is some path that we are missing, your
> help is much appreciated as you're likely much more versed with this
> code than me. But doing the boosting at injection time is what has
> made all the difference (for instance with cyclictest latencies).

That accepts in IRQ into the vCPU's local APIC, it does not *inject* the IRQ into
the vCPU proper. The actual injection is done by kvm_check_and_inject_events().
A pending IRQ is _usually_ delivered/injected fairly quickly, but not always.

E.g. if the guest has IRQs disabled (RFLAGS.IF=0), KVM can't immediately inject
the IRQ (without violating x86 architecture). In that case, KVM will twiddle
VMCS/VMCB knobs to detect an IRQ window, i.e. to cause a VM-Exit when IRQs are
no longer blocked in the guest.

Your PoC actually (mostly) handles this (see above) by keeping the vCPU boosted
after EXIT_REASON_INTERRUPT_WINDOW (because the IRQ will obviously still be pending).

> > Boosting the priority of vCPUs at semi-arbitrary points is going to be much more
> > difficult for KVM to "get right". E.g. why boost the priority of a vCPU that has
> > a pending IRQ, but not a vCPU that is running with IRQs disabled?
>
> I was actually wanting to boost preempted vCPU threads that were
> preempted in IRQ disabled regions as well. In fact, that is on our
> TODO.. we just haven't done it yet as we notice that usually IRQ is
> disabled while preemption was already disabled and just boosting
> preempt-disabled gets us most of the way there.
>
> > The potential for endless twiddling to try and tune KVM's de facto
> > scheduling logic so that it's universally "correct" is what terrifies me.
>
> Yes, we can certainly look into BPF to make it a bit more generic for
> our usecase (while getting enough information from the kernel).
>
> By the way, one other usecase for this patch series is RCU. I am one
> of the RCU maintainers and I am looking into improving RCU in VMs. For
> instance, boosting preempted RCU readers to RT (CONFIG_RCU_BOOST) does
> not usually work because the vCPU thread on the host is not RT.
> Similarly, other RCU threads which have RT priority don't get to run
> causing issues.
>
> Thanks!
>
> - Joel

2024-01-03 20:11:37

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

Hi David,

On Fri, Dec 15, 2023 at 12:10:14PM -0600, David Vernet wrote:
> On Thu, Dec 14, 2023 at 08:38:47AM -0800, Sean Christopherson wrote:
> > +sched_ext folks
>
> Thanks for the cc.

Just back from holidays, sorry for the late reply. But it was a good break to
go over your email in more detail ;-).

> > On Wed, Dec 13, 2023, Vineeth Pillai (Google) wrote:
> > > Double scheduling is a concern with virtualization hosts where the host
> > > schedules vcpus without knowing whats run by the vcpu and guest schedules
> > > tasks without knowing where the vcpu is physically running. This causes
> > > issues related to latencies, power consumption, resource utilization
> > > etc. An ideal solution would be to have a cooperative scheduling
> > > framework where the guest and host shares scheduling related information
> > > and makes an educated scheduling decision to optimally handle the
> > > workloads. As a first step, we are taking a stab at reducing latencies
> > > for latency sensitive workloads in the guest.
> > >
> > > This series of patches aims to implement a framework for dynamically
> > > managing the priority of vcpu threads based on the needs of the workload
> > > running on the vcpu. Latency sensitive workloads (nmi, irq, softirq,
> > > critcal sections, RT tasks etc) will get a boost from the host so as to
> > > minimize the latency.
> > >
> > > The host can proactively boost the vcpu threads when it has enough
> > > information about what is going to run on the vcpu - fo eg: injecting
> > > interrupts. For rest of the case, guest can request boost if the vcpu is
> > > not already boosted. The guest can subsequently request unboost after
> > > the latency sensitive workloads completes. Guest can also request a
> > > boost if needed.
> > >
> > > A shared memory region is used to communicate the scheduling information.
> > > Guest shares its needs for priority boosting and host shares the boosting
> > > status of the vcpu. Guest sets a flag when it needs a boost and continues
> > > running. Host reads this on next VMEXIT and boosts the vcpu thread. For
> > > unboosting, it is done synchronously so that host workloads can fairly
> > > compete with guests when guest is not running any latency sensitive
> > > workload.
> >
> > Big thumbs down on my end. Nothing in this RFC explains why this should be done
> > in KVM. In general, I am very opposed to putting policy of any kind into KVM,
> > and this puts a _lot_ of unmaintainable policy into KVM by deciding when to
> > start/stop boosting a vCPU.
>
> I have to agree, not least of which is because in addition to imposing a
> severe maintenance tax, these policies are far from exhaustive in terms
> of what you may want to do for cooperative paravirt scheduling.

Just to clarify the 'policy' we are discussing here, it is not about 'how to
schedule' but rather 'how/when to boost/unboost'. We want the existing
scheduler (or whatever it might be in the future) to make the actual decision
about how to schedule.

In that sense, I agree with Sean that we are probably forcing a singular
policy on when and how to boost which might not work for everybody (in theory
anyway). And I am perfectly OK with the BPF route as I mentioned in the other
email. So perhaps we can use a tracepoint in the VMEXIT path to run a BPF
program (?). And we still have to figure out how to run BPF programs in the
interrupt injections patch (I am currently studying those paths more also
thanks to Sean's email describing them).

> I think
> something like sched_ext would give you the best of all worlds: no
> maintenance burden on the KVM maintainers, more options for implementing
> various types of policies, performant, safe to run on the host, no need
> to reboot when trying a new policy, etc. More on this below.

I think switching to sched_ext just for this is overkill, we don't want
to change the scheduler yet which is a much more invasive/involved changed.
For instance, we want the features of this patchset to work for ARM as well
which heavily depends on EAS/cpufreq.

[...]
> > Concretely, boosting vCPUs for most events is far too coarse grained. E.g. boosting
> > a vCPU that is running a low priority workload just because the vCPU triggered
> > an NMI due to PMU counter overflow doesn't make sense. Ditto for if a guest's
> > hrtimer expires on a vCPU running a low priority workload.
> >
> > And as evidenced by patch 8/8, boosting vCPUs based on when an event is _pending_
> > is not maintainable. As hardware virtualizes more and more functionality, KVM's
> > visilibity into the guest effectively decreases, e.g. Intel and AMD both support
> > with IPI virtualization.
> >
> > Boosting the target of a PV spinlock kick is similarly flawed. In that case, KVM
> > only gets involved _after_ there is a problem, i.e. after a lock is contended so
> > heavily that a vCPU stops spinning and instead decided to HLT. It's not hard to
> > imagine scenarios where a guest would want to communicate to the host that it's
> > acquiring a spinlock for a latency sensitive path and so shouldn't be scheduled
> > out. And of course that's predicated on the assumption that all vCPUs are subject
> > to CPU overcommit.
> >
> > Initiating a boost from the host is also flawed in the sense that it relies on
> > the guest to be on the same page as to when it should stop boosting. E.g. if
> > KVM boosts a vCPU because an IRQ is pending, but the guest doesn't want to boost
> > IRQs on that vCPU and thus doesn't stop boosting at the end of the IRQ handler,
> > then the vCPU could end up being boosted long after its done with the IRQ.
> >
> > Throw nested virtualization into the mix and then all of this becomes nigh
> > impossible to sort out in KVM. E.g. if an L1 vCPU is a running an L2 vCPU, i.e.
> > a nested guest, and L2 is spamming interrupts for whatever reason, KVM will end
> > repeatedly boosting the L1 vCPU regardless of the priority of the L2 workload.
> >
> > For things that aren't clearly in KVM's domain, I don't think we should implement
> > KVM-specific functionality until every other option has been tried (and failed).
> > I don't see any reason why KVM needs to get involved in scheduling, beyond maybe
> > providing *input* regarding event injection, emphasis on *input* because KVM
> > providing information to userspace or some other entity is wildly different than
> > KVM making scheduling decisions based on that information.
> >
> > Pushing the scheduling policies to host userspace would allow for far more control
> > and flexibility. E.g. a heavily paravirtualized environment where host userspace
> > knows *exactly* what workloads are being run could have wildly different policies
> > than an environment where the guest is a fairly vanilla Linux VM that has received
> > a small amount of enlightment.
> >
> > Lastly, if the concern/argument is that userspace doesn't have the right knobs
> > to (quickly) boost vCPU tasks, then the proposed sched_ext functionality seems
> > tailor made for the problems you are trying to solve.
> >
> > https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org
>
> I very much agree. There are some features missing from BPF that we'd
> need to add to enable this, but they're on the roadmap, and I don't
> think they'd be especially difficult to implement.
>
> The main building block that I was considering is a new kptr [0] and set
> of kfuncs [1] that would allow a BPF program to have one or more R/W
> shared memory regions with a guest.

I really like your ideas around sharing memory across virt boundary using
BPF. The one concern I would have is that, this does require loading a BPF
program from the guest userspace, versus just having a guest kernel that
'does the right thing'.

On the host, I would have no problem loading a BPF program as a one-time
thing, but on the guest it may be more complex as we don't always control the
guest userspace and their BPF loading mechanisms. Example, an Android guest
needs to have its userspace modified to load BPF progs, etc. Not hard
problems, but flexibility comes with more cost. Last I recall, Android does
not use a lot of the BPF features that come with the libbpf library because
they write their own userspace from scratch (due to licensing). OTOH, if this
was an Android kernel-only change, that would simplify a lot.

Still there is a lot of merit to sharing memory with BPF and let BPF decide
the format of the shared memory, than baking it into the kernel... so thanks
for bringing this up! Lets talk more about it... Oh, and there's my LSFMMBPF
invitiation request ;-) ;-).

> This could enable a wide swath of
> BPF paravirt use cases that are not limited to scheduling, but in terms
> of sched_ext, the BPF scheduler could communicate with the guest
> scheduler over this shared memory region in whatever manner was required
> for that use case.
>
> [0]: https://lwn.net/Articles/900749/
> [1]: https://lwn.net/Articles/856005/

Thanks, I had no idea about these. I have a question -- would it be possible
to call the sched_setscheduler() function in core.c via a kfunc? Then we can
trigger the boost from a BPF program on the host side. We could start simple
from there.

I agree on the rest below. I just wanted to emphasize though that this patch
series does not care about what the scheduler does. It merely hints the
scheduler via a priority setting that something is an important task. That's
a far cry from how to actually schedule and the spirit here is to use
whatever scheduler the user has to decide how to actually schedule.

thanks,

- Joel


> For example, the guest could communicate scheduling intention such as:
>
> - "Don't preempt me and/or boost me (because I'm holding a spinlock, in an
> NMI region, running some low-latency task, etc)".
> - "VCPU x prefers to be on a P core", and then later, "Now it prefers an
> E core". Note that this doesn't require pinning or anything like that.
> It's just the VCPU requesting some best-effort placement, and allowing
> that policy to change dynamically depending on what the guest is
> doing.
> - "Try to give these VCPUs their own fully idle cores if possible, but
> these other VCPUs should ideally be run as hypertwins as they're
> expected to have good cache locality", etc.
>
> In general, some of these policies might be silly and not work well,
> others might work very well for some workloads / architectures and not
> as well on others, etc. sched_ext would make it easy to try things out
> and see what works well, without having to worry about rebooting or
> crashing the host, and ultimately without having to implement and
> maintain some scheduling policy directly in KVM. As Sean said, the host
> knows exactly what workloads are being run and could have very targeted
> and bespoke policies that wouldn't be appropriate for a vanilla Linux
> VM.
>
> Finally, while it's not necessarily related to paravirt scheduling
> specifically, I think it's maybe worth mentioning that sched_ext would
> have allowed us to implement a core-sched-like policy when L1TF first
> hit us. It was inevitable that we'd need a core-sched policy build into
> the core kernel as well, but we could have used sched_ext as a solution
> until core sched was merged. Tejun implemented something similar in an
> example scheduler where only tasks in the same cgroup are ever allowed
> to run as hypertwins [3]. The point is, you can do basically anything
> you want with sched_ext. For paravirt, I think there are a ton of
> interesting possibilities, and I think those possibilities are better
> explored and implemented with sched_ext rather than implementing
> policies directly in KVM.
>
> [3]: https://lore.kernel.org/lkml/[email protected]/



2024-01-04 22:34:28

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

On Wed, Jan 03, 2024 at 03:09:07PM -0500, Joel Fernandes wrote:
> Hi David,
>
> On Fri, Dec 15, 2023 at 12:10:14PM -0600, David Vernet wrote:
> > On Thu, Dec 14, 2023 at 08:38:47AM -0800, Sean Christopherson wrote:
> > > +sched_ext folks
> >
> > Thanks for the cc.
>
> Just back from holidays, sorry for the late reply. But it was a good break to
> go over your email in more detail ;-).

Hi Joel,

No worries at all, and happy new year!

> > > On Wed, Dec 13, 2023, Vineeth Pillai (Google) wrote:
> > > > Double scheduling is a concern with virtualization hosts where the host
> > > > schedules vcpus without knowing whats run by the vcpu and guest schedules
> > > > tasks without knowing where the vcpu is physically running. This causes
> > > > issues related to latencies, power consumption, resource utilization
> > > > etc. An ideal solution would be to have a cooperative scheduling
> > > > framework where the guest and host shares scheduling related information
> > > > and makes an educated scheduling decision to optimally handle the
> > > > workloads. As a first step, we are taking a stab at reducing latencies
> > > > for latency sensitive workloads in the guest.
> > > >
> > > > This series of patches aims to implement a framework for dynamically
> > > > managing the priority of vcpu threads based on the needs of the workload
> > > > running on the vcpu. Latency sensitive workloads (nmi, irq, softirq,
> > > > critcal sections, RT tasks etc) will get a boost from the host so as to
> > > > minimize the latency.
> > > >
> > > > The host can proactively boost the vcpu threads when it has enough
> > > > information about what is going to run on the vcpu - fo eg: injecting
> > > > interrupts. For rest of the case, guest can request boost if the vcpu is
> > > > not already boosted. The guest can subsequently request unboost after
> > > > the latency sensitive workloads completes. Guest can also request a
> > > > boost if needed.
> > > >
> > > > A shared memory region is used to communicate the scheduling information.
> > > > Guest shares its needs for priority boosting and host shares the boosting
> > > > status of the vcpu. Guest sets a flag when it needs a boost and continues
> > > > running. Host reads this on next VMEXIT and boosts the vcpu thread. For
> > > > unboosting, it is done synchronously so that host workloads can fairly
> > > > compete with guests when guest is not running any latency sensitive
> > > > workload.
> > >
> > > Big thumbs down on my end. Nothing in this RFC explains why this should be done
> > > in KVM. In general, I am very opposed to putting policy of any kind into KVM,
> > > and this puts a _lot_ of unmaintainable policy into KVM by deciding when to
> > > start/stop boosting a vCPU.
> >
> > I have to agree, not least of which is because in addition to imposing a
> > severe maintenance tax, these policies are far from exhaustive in terms
> > of what you may want to do for cooperative paravirt scheduling.
>
> Just to clarify the 'policy' we are discussing here, it is not about 'how to
> schedule' but rather 'how/when to boost/unboost'. We want the existing
> scheduler (or whatever it might be in the future) to make the actual decision
> about how to schedule.

Thanks for clarifying. I think we're on the same page. I didn't mean to
imply that KVM is actually in the scheduler making decisions about what
task to run next, but that wasn't really my concern. My concern is that
this patch set makes KVM responsible for all of the possible paravirt
policies by encoding it in KVM UAPI, and is ultimately responsible for
being aware of and communicating those policies between the guest to the
host scheduler.

Say that we wanted to add some other paravirt related policy like "these
VCPUs may benefit from being co-located", or, "this VCPU just grabbed a
critical spinlock so please pin me for a moment". That would require
updating struct guest_schedinfo UAPI in kvm_para.h, adding getters and
setters to kvm_host.h to set those policies for the VCPU (though your
idea to use a BPF hook on VMEXIT may help with that onme), setting the
state from the guest, etc.

KVM isn't making scheduling decisions, but it's the arbiter of what data
is available to the scheduler to consume. As it relates to a VCPU, it
seems like this patch set makes KVM as much invested in the scheduling
decision that's eventually made as the actual scheduler. Also worth
considering is that it ties KVM UAPI to sched/core.c, which seems
undesirable from the perspective of both subsystems.

> In that sense, I agree with Sean that we are probably forcing a singular
> policy on when and how to boost which might not work for everybody (in theory
> anyway). And I am perfectly OK with the BPF route as I mentioned in the other

FWIW, I don't think it's just that boosting may not work well in every
context (though I do think there's an argument to be made for that, as
Sean pointed out r.e. hard IRQs in nested context). The problem is also
that boosting is just one example of a paravirt policy that you may want
to enable, as I alluded to above, and that comes with complexity and
maintainership costs.

> email. So perhaps we can use a tracepoint in the VMEXIT path to run a BPF
> program (?). And we still have to figure out how to run BPF programs in the
> interrupt injections patch (I am currently studying those paths more also
> thanks to Sean's email describing them).

If I understand correctly, based on your question below, the idea is to
call sched_setscheduler() from a kfunc in this VMEXIT BPF tracepoint?
Please let me know if that's correct -- I'll respond to this below where
you ask the question.

As an aside, even if we called a BPF tracepoint prog on the VMEXIT path,
AFAIU it would still require UAPI changes given that we'd still need to
make all the same changes in the guest, right? I see why having a BPF
hook here would be useful to avoid some of the logic on the host that
implements the boosting, and to give more flexibility as to when to
apply that boosting, but in my mind it seems like the UAPI concerns are
the most relevant.

> > I think
> > something like sched_ext would give you the best of all worlds: no
> > maintenance burden on the KVM maintainers, more options for implementing
> > various types of policies, performant, safe to run on the host, no need
> > to reboot when trying a new policy, etc. More on this below.
>
> I think switching to sched_ext just for this is overkill, we don't want
> to change the scheduler yet which is a much more invasive/involved changed.
> For instance, we want the features of this patchset to work for ARM as well
> which heavily depends on EAS/cpufreq.

Fair enough, I see your point in that you just want to adjust prio and
policy. I agree that's not the same thing as writing an entirely new
scheduler, but I think we have to approach this from the perspective of
what's the right long-term design. The cost of having to plumb all of
this through KVM UAPI (as well as hard-code where the paravirt policies
are applied in the guest, such as in sched/core.c) is pretty steep, and
using BPF seems like a way to broadly enable paravirt scheduling without
having to take on the complexity or maintenance burden of adding these
paravirt UAPI changes at all.

> [...]
> > > Concretely, boosting vCPUs for most events is far too coarse grained. E.g. boosting
> > > a vCPU that is running a low priority workload just because the vCPU triggered
> > > an NMI due to PMU counter overflow doesn't make sense. Ditto for if a guest's
> > > hrtimer expires on a vCPU running a low priority workload.
> > >
> > > And as evidenced by patch 8/8, boosting vCPUs based on when an event is _pending_
> > > is not maintainable. As hardware virtualizes more and more functionality, KVM's
> > > visilibity into the guest effectively decreases, e.g. Intel and AMD both support
> > > with IPI virtualization.
> > >
> > > Boosting the target of a PV spinlock kick is similarly flawed. In that case, KVM
> > > only gets involved _after_ there is a problem, i.e. after a lock is contended so
> > > heavily that a vCPU stops spinning and instead decided to HLT. It's not hard to
> > > imagine scenarios where a guest would want to communicate to the host that it's
> > > acquiring a spinlock for a latency sensitive path and so shouldn't be scheduled
> > > out. And of course that's predicated on the assumption that all vCPUs are subject
> > > to CPU overcommit.
> > >
> > > Initiating a boost from the host is also flawed in the sense that it relies on
> > > the guest to be on the same page as to when it should stop boosting. E.g. if
> > > KVM boosts a vCPU because an IRQ is pending, but the guest doesn't want to boost
> > > IRQs on that vCPU and thus doesn't stop boosting at the end of the IRQ handler,
> > > then the vCPU could end up being boosted long after its done with the IRQ.
> > >
> > > Throw nested virtualization into the mix and then all of this becomes nigh
> > > impossible to sort out in KVM. E.g. if an L1 vCPU is a running an L2 vCPU, i.e.
> > > a nested guest, and L2 is spamming interrupts for whatever reason, KVM will end
> > > repeatedly boosting the L1 vCPU regardless of the priority of the L2 workload.
> > >
> > > For things that aren't clearly in KVM's domain, I don't think we should implement
> > > KVM-specific functionality until every other option has been tried (and failed).
> > > I don't see any reason why KVM needs to get involved in scheduling, beyond maybe
> > > providing *input* regarding event injection, emphasis on *input* because KVM
> > > providing information to userspace or some other entity is wildly different than
> > > KVM making scheduling decisions based on that information.
> > >
> > > Pushing the scheduling policies to host userspace would allow for far more control
> > > and flexibility. E.g. a heavily paravirtualized environment where host userspace
> > > knows *exactly* what workloads are being run could have wildly different policies
> > > than an environment where the guest is a fairly vanilla Linux VM that has received
> > > a small amount of enlightment.
> > >
> > > Lastly, if the concern/argument is that userspace doesn't have the right knobs
> > > to (quickly) boost vCPU tasks, then the proposed sched_ext functionality seems
> > > tailor made for the problems you are trying to solve.
> > >
> > > https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org
> >
> > I very much agree. There are some features missing from BPF that we'd
> > need to add to enable this, but they're on the roadmap, and I don't
> > think they'd be especially difficult to implement.
> >
> > The main building block that I was considering is a new kptr [0] and set
> > of kfuncs [1] that would allow a BPF program to have one or more R/W
> > shared memory regions with a guest.
>
> I really like your ideas around sharing memory across virt boundary using
> BPF. The one concern I would have is that, this does require loading a BPF
> program from the guest userspace, versus just having a guest kernel that
> 'does the right thing'.

Yeah, that's a fair concern. The problem is that I'm not sure how we get
around that if we want to avoid tying up every scheduling paravirt
policy into a KVM UAPI. Putting everything into UAPI just really doesn't
seem scalable. I'd be curious to hear Sean's thoughts on this. Note that
I'm not just talking about scheduling paravirt here -- one could imagine
many possible examples, e.g. relating to I/O, MM, etc.

It seems much more scalable to instead have KVM be responsible for
plumbing mappings between guest and host BPF programs (I haven't thought
through the design or interface for that at _all_, just thinking in
high-level terms here), and then those BPF programs could decide on
paravirt interfaces that don't have to be in UAPI. Having KVM be
responsible for setting up opaque communication channels between the
guest and host feels likes a more natural building block than having it
actually be aware of the policies being implemented over those
communication channels.

> On the host, I would have no problem loading a BPF program as a one-time
> thing, but on the guest it may be more complex as we don't always control the
> guest userspace and their BPF loading mechanisms. Example, an Android guest
> needs to have its userspace modified to load BPF progs, etc. Not hard
> problems, but flexibility comes with more cost. Last I recall, Android does
> not use a lot of the BPF features that come with the libbpf library because
> they write their own userspace from scratch (due to licensing). OTOH, if this
> was an Android kernel-only change, that would simplify a lot.

That is true, but the cost is double-sided. On the one hand, we have the
complexity and maintenance costs of plumbing all of this through KVM and
making it UAPI. On the other, we have the cost of needing to update a
user space framework to accommodate loading and managing BPF programs.
At this point BPF is fully supported on aarch64, so Android should have
everything it needs to use it. It sounds like it will require some
(maybe even a lot of) work to accommodate that, but that seems
preferable to compensating for gaps in a user space framework by adding
complexity to the kernel, no?

> Still there is a lot of merit to sharing memory with BPF and let BPF decide
> the format of the shared memory, than baking it into the kernel... so thanks
> for bringing this up! Lets talk more about it... Oh, and there's my LSFMMBPF
> invitiation request ;-) ;-).

Discussing this BPF feature at LSFMMBPF is a great idea -- I'll submit a
proposal for it and cc you. I looked and couldn't seem to find the
thread for your LSFMMBPF proposal. Would you mind please sending a link?

> > This could enable a wide swath of
> > BPF paravirt use cases that are not limited to scheduling, but in terms
> > of sched_ext, the BPF scheduler could communicate with the guest
> > scheduler over this shared memory region in whatever manner was required
> > for that use case.
> >
> > [0]: https://lwn.net/Articles/900749/
> > [1]: https://lwn.net/Articles/856005/
>
> Thanks, I had no idea about these. I have a question -- would it be possible
> to call the sched_setscheduler() function in core.c via a kfunc? Then we can
> trigger the boost from a BPF program on the host side. We could start simple
> from there.

There's nothing stopping us from adding a kfunc that calls
sched_setscheduler(). The questions are how other scheduler folks feel
about that, and whether that's the appropriate general solution to the
problem. It does seem odd to me to have a random KVM tracepoint be
entitled to set a generic task's scheduling policy and priority.

> I agree on the rest below. I just wanted to emphasize though that this patch
> series does not care about what the scheduler does. It merely hints the
> scheduler via a priority setting that something is an important task. That's
> a far cry from how to actually schedule and the spirit here is to use
> whatever scheduler the user has to decide how to actually schedule.

Yep, I don't disagree with you at all on that point. To me this really
comes down to a question of the costs and long-term design choices, as
we discussed above:

1. What is the long-term maintenance cost to KVM and the scheduler to
land paravirt scheduling in this form?

Even assuming that we go with the BPF hook on VMEXIT approach, unless
I'm missing something, I think we'd still need to make UAPI changes to
kvm_para.h, and update the guest to set the relevant paravirt state in
the guest scheduler. That's not a huge amount of code for just boosting
and deboosting, but it sets the precedent that any and all future
scheduling paravirt logic will need to go through UAPI, and that the
core scheduler will need to accommodate that paravirt when and where
appropriate.

I may be being overly pessimistic, but I feel that could open up a
rather scary can of worms; both in terms of the potential long-term
complexity in the kernel itself, and in terms of the maintenance burden
that may eventually be imposed to properly support it. It also imposes a
very high bar on being able to add and experiment with new paravirt
policies.

2. What is the cost we're imposing on users if we force paravirt to be
done through BPF? Is this prohibitively high?

There is certainly a nonzero cost. As you pointed out, right now Android
apparently doesn't use much BPF, and adding the requisite logic to use
and manage BPF programs is not insigificant.

Is that cost prohibitively high? I would say no. BPF should be fully
supported on aarch64 at this point, so it's really a user space problem.
Managing the system is what user space does best, and many other
ecosystems have managed to integrate BPF to great effect. So while the
cost is cetainly nonzero, I think there's a reasonable argument to be
made that it's not prohibitively high.

Thanks,
David


Attachments:
(No filename) (16.89 kB)
signature.asc (235.00 B)
Download all attachments

2024-01-09 17:28:19

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [RFC PATCH 7/8] sched/core: boost/unboost in guest scheduler



On 12/14/23 8:17 AM, Vineeth Pillai (Google) wrote:
> RT or higher priority tasks in guest is considered a critical workload
> and guest scheduler can request boost/unboost on a task switch and/or a
> task wakeup. Also share the preempt status of guest vcpu with the host
> so that host can take decision on boot/unboost.
>
> CONFIG_TRACE_PREEMPT_TOGGLE is enabled for using the function
> equivalent of preempt_count_{add,sub} to update the shared memory.
> Another option is to update the preempt_count_{add,sub} macros, but
> it will be more code churn and complex.
>
> Boost request is lazy, but unboost request is synchronous.
>
> Detect the feature in guest from cpuid flags and use the MSR to pass the
> GPA of memory location for sharing scheduling information.
>
> Co-developed-by: Joel Fernandes (Google) <[email protected]>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> Signed-off-by: Vineeth Pillai (Google) <[email protected]>
> ---
> arch/x86/Kconfig | 13 +++++
> arch/x86/include/asm/kvm_para.h | 7 +++
> arch/x86/kernel/kvm.c | 16 ++++++
> include/linux/sched.h | 21 ++++++++
> kernel/entry/common.c | 9 ++++
> kernel/sched/core.c | 93 ++++++++++++++++++++++++++++++++-
> 6 files changed, 158 insertions(+), 1 deletion(-)
>


Wish you all happy new year!
Sorry for the late reply. Took a while to go through this.

[...]
> /*
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b47f72b6595f..57f211f1b3d7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -151,6 +151,71 @@ const_debug unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
>
> __read_mostly int scheduler_running;
>
> +#ifdef CONFIG_PARAVIRT_SCHED
> +#include <linux/kvm_para.h>
> +
> +DEFINE_STATIC_KEY_FALSE(__pv_sched_enabled);
> +
> +DEFINE_PER_CPU_DECRYPTED(struct pv_sched_data, pv_sched) __aligned(64);
> +
> +unsigned long pv_sched_pa(void)
> +{
> + return slow_virt_to_phys(this_cpu_ptr(&pv_sched));
> +}
> +
> +bool pv_sched_vcpu_boosted(void)
> +{
> + return (this_cpu_read(pv_sched.boost_status) == VCPU_BOOST_BOOSTED);
> +}
> +
> +void pv_sched_boost_vcpu_lazy(void)
> +{
> + this_cpu_write(pv_sched.schedinfo.boost_req, VCPU_REQ_BOOST);
> +}
> +
> +void pv_sched_unboost_vcpu_lazy(void)
> +{
> + this_cpu_write(pv_sched.schedinfo.boost_req, VCPU_REQ_UNBOOST);
> +}
> +
> +void pv_sched_boost_vcpu(void)
> +{
> + pv_sched_boost_vcpu_lazy();
> + /*
> + * XXX: there could be a race between the boost_status check
> + * and hypercall.
> + */
> + if (this_cpu_read(pv_sched.boost_status) == VCPU_BOOST_NORMAL)
> + kvm_pv_sched_notify_host();
> +}
> +
> +void pv_sched_unboost_vcpu(void)
> +{
> + pv_sched_unboost_vcpu_lazy();
> + /*
> + * XXX: there could be a race between the boost_status check
> + * and hypercall.
> + */
> + if (this_cpu_read(pv_sched.boost_status) == VCPU_BOOST_BOOSTED &&
> + !preempt_count())
> + kvm_pv_sched_notify_host();
> +}
> +
> +/*
> + * Share the preemption enabled/disabled status with host. This will not incur a
> + * VMEXIT and acts as a lazy boost/unboost mechanism - host will check this on
> + * the next VMEXIT for boost/unboost decisions.
> + * XXX: Lazy unboosting may allow cfs tasks to run on RT vcpu till next VMEXIT.
> + */
> +static inline void pv_sched_update_preempt_status(bool preempt_disabled)
> +{
> + if (pv_sched_enabled())
> + this_cpu_write(pv_sched.schedinfo.preempt_disabled, preempt_disabled);
> +}
> +#else
> +static inline void pv_sched_update_preempt_status(bool preempt_disabled) {}
> +#endif
> +
> #ifdef CONFIG_SCHED_CORE
>

Wouldn't it be better to define a arch hook for this instead? implementation then could
follow depending on the architecture. This boosting for vcpu tasks in host may be of
interest to other hypervisors as well.


Currently I see there are two places where interaction is taking place for paravirt.
1. steal time accounting
2. vcpu_is_preempted
each architecture seems do this in their own way. So one option is an arch hook for
other paravirt interfaces as well. Other option is probably what was discussed. i.e
define a framework for paravirt interfaces from the ground up.


We are working on resource limiting aspect of para virtualization on powerVM.
Interface for that could be done via hypercall or via VPA (he VPA is a memory structure shared
between the hypervisor and OS, defined by PAPR). That would be one more paravirt interface.
As you mentioned in the cover letter, I am curious to know if you have tried the resource limiting,
since you have overcommit of vCPUs.

> DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
> @@ -2070,6 +2135,19 @@ unsigned long get_wchan(struct task_struct *p)
>
> static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> {
> +#ifdef CONFIG_PARAVIRT_SCHED
> + /*
> + * TODO: currently request for boosting remote vcpus is not implemented. So
> + * we boost only if this enqueue happens for this cpu.
> + * This is not a big problem though, target cpu gets an IPI and then gets
> + * boosted by the host. Posted interrupts is an exception where target vcpu
> + * will not get boosted immediately, but on the next schedule().
> + */
> + if (pv_sched_enabled() && this_rq() == rq &&
> + sched_class_above(p->sched_class, &fair_sched_class))
> + pv_sched_boost_vcpu_lazy();
> +#endif
> +
> if (!(flags & ENQUEUE_NOCLOCK))
> update_rq_clock(rq);
>
> @@ -5835,6 +5913,8 @@ static inline void preempt_latency_start(int val)
> #ifdef CONFIG_DEBUG_PREEMPT
> current->preempt_disable_ip = ip;
> #endif
> + pv_sched_update_preempt_status(true);
> +
> trace_preempt_off(CALLER_ADDR0, ip);
> }
> }
> @@ -5867,8 +5947,10 @@ NOKPROBE_SYMBOL(preempt_count_add);
> */
> static inline void preempt_latency_stop(int val)
> {
> - if (preempt_count() == val)
> + if (preempt_count() == val) {
> + pv_sched_update_preempt_status(false);
> trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
> + }
> }
>
> void preempt_count_sub(int val)
> @@ -6678,6 +6760,15 @@ static void __sched notrace __schedule(unsigned int sched_mode)
> rq->last_seen_need_resched_ns = 0;
> #endif
>
> +#ifdef CONFIG_PARAVIRT_SCHED
> + if (pv_sched_enabled()) {
> + if (sched_class_above(next->sched_class, &fair_sched_class))
> + pv_sched_boost_vcpu_lazy();
> + else if (next->sched_class == &fair_sched_class)
> + pv_sched_unboost_vcpu();
> + }
> +#endif
> +
> if (likely(prev != next)) {
> rq->nr_switches++;
> /*

2024-01-12 18:38:04

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

Hi Sean,

Happy new year. Just back from holidays, thanks for the discussions. I
replied below:

On Fri, Dec 15, 2023 at 02:01:14PM -0800, Sean Christopherson wrote:
> On Fri, Dec 15, 2023, Joel Fernandes wrote:
> > On Fri, Dec 15, 2023 at 11:38 AM Sean Christopherson <[email protected]> wrote:
> > >
> > > On Fri, Dec 15, 2023, Joel Fernandes wrote:
> > > > Hi Sean,
> > > > Nice to see your quick response to the RFC, thanks. I wanted to
> > > > clarify some points below:
> > > >
> > > > On Thu, Dec 14, 2023 at 3:13 PM Sean Christopherson <[email protected]> wrote:
> > > > >
> > > > > On Thu, Dec 14, 2023, Vineeth Remanan Pillai wrote:
> > > > > > On Thu, Dec 14, 2023 at 11:38 AM Sean Christopherson <[email protected]> wrote:
> > > > > > Now when I think about it, the implementation seems to
> > > > > > suggest that we are putting policies in kvm. Ideally, the goal is:
> > > > > > - guest scheduler communicates the priority requirements of the workload
> > > > > > - kvm applies the priority to the vcpu task.
> > > > >
> > > > > Why? Tasks are tasks, why does KVM need to get involved? E.g. if the problem
> > > > > is that userspace doesn't have the right knobs to adjust the priority of a task
> > > > > quickly and efficiently, then wouldn't it be better to solve that problem in a
> > > > > generic way?
> > > >
> > > > No, it is not only about tasks. We are boosting anything RT or above
> > > > such as softirq, irq etc as well.
> > >
> > > I was talking about the host side of things. A vCPU is a task, full stop. KVM
> > > *may* have some information that is useful to the scheduler, but KVM does not
> > > *need* to initiate adjustments to a vCPU's priority.
> >
> > Sorry I thought you were referring to guest tasks. You are right, KVM
> > does not *need* to change priority. But a vCPU is a container of tasks
> > who's priority dynamically changes. Still, I see your point of view
> > and that's also why we offer the capability to be selectively enabled
> > or disabled per-guest by the VMM (Vineeth will make it default off and
> > opt-in in the next series).
> >
> > > > Could you please see the other patches?
> > >
> > > I already have, see my comments about boosting vCPUs that have received
> > > NMIs and IRQs not necessarily being desirable.
> >
> > Ah, I was not on CC for that email. Seeing it now. I think I don't
> > fully buy that argument, hard IRQs are always high priority IMHO.
>
> They most definitely are not, and there are undoubtedly tiers of priority, e.g.
> tiers are part and parcel of the APIC architecture. I agree that *most* IRQs are
> high-ish priority, but that is not the same that *all* IRQs are high priority.
> It only takes one example to disprove the latter, and I can think of several off
> the top of my head.
>
> Nested virtualization is the easy one to demonstrate.
>
> On AMD, which doesn't have an equivalent to the VMX preemption timer, KVM uses a
> self-IPI to wrest control back from the guest immediately after VMRUN in some
> situations (mostly to inject events into L2 while honoring the architectural
> priority of events). If the guest is running a nested workload, the resulting
> IRQ in L1 is not at all interesting or high priority, as the L2 workload hasn't
> suddenly become high priority just because KVM wants to inject an event.

Ok I see your point of view and tend to agree with you. I think I was mostly
going by my experience which has predominantly been on embedded systems. One
counter argument I would make is that just because an IRQ event turned out
not to be high priority, doesn't mean it would never be. For instance, if the
guest was running something important and not just a nested workload, then
perhaps the self-IPI in your example would be important. As another example,
the scheduling clock interrupt may be a useless a lot of times but those
times that it does become useful make it worth prioritizing highly (example,
preempting a task after it exhausts slice, invoking RCU core to service
callbacks, etc).

But I see your point of view and tend to agree with it too!

>
> Anyways, I didn't mean to start a debate over the priority of handling IRQs and
> NMIs, quite the opposite actually. The point I'm trying to make is that under
> no circumstance do I want KVM to be making decisions about whether or not such
> things are high priority. I have no objection to KVM making information available
> to whatever entity is making the actual decisions, it's having policy in KVM that
> I am staunchly opposed to.

Ok makes sense, I think we have converged on this now and I agree it seems
wrong to bake the type of policy we were attempting into KVM, which may not
work for everyone and makes it inflexible to experiment. So back to the
drawing board on that..

> > If an hrtimer expires on a CPU running a low priority workload, that
> > hrtimer might itself wake up a high priority thread. If we don't boost
> > the hrtimer interrupt handler, then that will delay the wakeup as
> > well. It is always a chain of events and it has to be boosted from the
> > first event. If a system does not wish to give an interrupt a high
> > priority, then the typical way is to use threaded IRQs and lower the
> > priority of the thread. That will give the interrupt handler lower
> > priority and the guest is free to do that. We had many POCs before
> > where we don't boost at all for interrupts and they all fall apart.
> > This is the only POC that works without any issues as far as we know
> > (we've been trying to do this for a long time :P).
>
> In *your* environment. The fact that it took multiple months to get a stable,
> functional set of patches for one use case is *exactly* why I am pushing back on
> this. Change any number of things about the setup and odds are good that the
> result would need different tuning. E.g. the ratio of vCPUs to pCPUs, the number
> of VMs, the number of vCPUs per VM, whether or not the host kernel is preemptible,
> whether or not the guest kernel is preemptible, the tick rate of the host and
> guest kernels, the workload of the VM, the affinity of tasks within the VM, and
> and so on and so forth.
>
> It's a catch-22 of sorts. Anything that is generic enough to land upstream is
> likely going to be too coarse grained to be universally applicable.

Ok, agreed.

> > Regarding perf, I similarly disagree. I think a PMU event is super
> > important (example, some versions of the kernel watchdog that depend
> > on PMU fail without it). But if some VM does not want this to be
> > prioritized, they could just not opt-in for the feature IMO. I can see
> > your point of view that not all VMs may want this behavior though.
>
> Or a VM may want it conditionally, e.g. only for select tasks.

Got it.

> > > > At the moment, the only ABI is a shared memory structure and a custom
> > > > MSR. This is no different from the existing steal time accounting
> > > > where a shared structure is similarly shared between host and guest,
> > > > we could perhaps augment that structure with other fields instead of
> > > > adding a new one?
> > >
> > > I'm not concerned about the number of structures/fields, it's the amount/type of
> > > information and the behavior of KVM that is problematic. E.g. boosting the priority
> > > of a vCPU that has a pending NMI is dubious.
> >
> > I think NMIs have to be treated as high priority, the perf profiling
> > interrupt for instance works well on x86 (unlike ARM) because it can
> > interrupt any context (other than NMI and possibly the machine check
> > ones). On ARM on the other hand, because the perf interrupt is a
> > regular non-NMI interrupt, you cannot profile hardirq and IRQ-disable
> > regions (this could have changed since pseudo-NMI features). So making
> > the NMI a higher priority than IRQ is not dubious AFAICS, it is a
> > requirement in many cases IMHO.
>
> Again, many, but not all. A large part of KVM's success is that KVM has very few
> "opinions" of its own. Outside of the MMU and a few paravirt paths, KVM mostly
> just emulates/virtualizes hardware according to the desires of userspace. This
> has allowed a fairly large variety of use cases to spring up with relatively few
> changes to KVM.
>
> What I want to avoid is (a) adding something that only works for one use case
> and (b) turning KVM into a scheduler of any kind.

Ok. Makes sense.

> > > Which illustrates one of the points I'm trying to make is kind of my point.
> > > Upstream will never accept anything that's wildly complex or specific because such
> > > a thing is unlikely to be maintainable.
> >
> > TBH, it is not that complex though.
>
> Yet. Your use case is happy with relatively simple, coarse-grained hooks. Use
> cases that want to squeeze out maximum performance, e.g. because shaving N% off
> the runtime saves $$$, are likely willing to take on far more complexity, or may
> just want to make decisions at a slightly different granularity.

Ok.

> > But let us know which parts, if any, can be further simplified (I saw your
> > suggestions for next steps in the reply to Vineeth, those look good to me and
> > we'll plan accordingly).
>
> It's not a matter of simplifying things, it's a matter of KVM (a) not defining
> policy of any kind and (b) KVM not defining a guest/host ABI.

Ok.

> > > > We have to intervene *before* the scheduler takes the vCPU thread off the
> > > > CPU.
> > >
> > > If the host scheduler is directly involved in the paravirt shenanigans, then
> > > there is no need to hook KVM's VM-Exit path because the scheduler already has the
> > > information needed to make an informed decision.
> >
> > Just to clarify, we're not making any "decisions" in the VM exit path,
>
> Yes, you are.

Ok sure, I see what you mean. We're making decisions about the "when to
boost" thing. Having worked on the scheduler for a few years, I was more
referring to "decisions that the scheduler makes such as where to place a
task during wakeup, load balance, whether to migrate, etc" due to my bias.
Those are complex decisions that even with these patches, it is upto the
scheduler. But I see you were referring to a different type of decision
making, so agreed with you.

> > we're just giving the scheduler enough information (via the
> > setscheduler call). The scheduler may just as well "decide" it wants
> > to still preempt the vCPU thread -- that's Ok in fact required some
> > times. We're just arming it with more information, specifically that
> > this is an important thread. We can find another way to pass this
> > information along (BPF etc) but I just wanted to mention that KVM is
> > not really replacing the functionality or decision-making of the
> > scheduler even with this POC.
>
> Yes, it is. kvm_vcpu_kick_boost() *directly* adjusts the priority of the task.
> KVM is not just passing a message, KVM is defining a scheduling policy of "boost
> vCPUs with pending IRQs, NMIs, SMIs, and PV unhalt events".

Fair enough.

> The VM-Exit path also makes those same decisions by boosting a vCPU if the guest
> has requested boost *or* the vCPU has a pending event (but oddly, not pending
> NMIs, SMIs, or PV unhalt events):
>
> bool pending_event = kvm_cpu_has_pending_timer(vcpu) || kvm_cpu_has_interrupt(vcpu);
>
> /*
> * vcpu needs a boost if
> * - A lazy boost request active or a pending latency sensitive event, and
> * - Preemption disabled duration on this vcpu has not crossed the threshold.
> */
> return ((schedinfo.boost_req == VCPU_REQ_BOOST || pending_event) &&
> !kvm_vcpu_exceeds_preempt_disabled_duration(&vcpu->arch));
>
>
> Which, by the by is suboptimal. Detecting for pending events isn't free, so you
> ideally want to check for pending events if and only if the guest hasn't requested
> a boost.

That sounds like a worthwhile optimization! Agreed that the whole
boosting-a-pending-event can also be looked at as policy..

> > > > Similarly, in the case of an interrupt injected into the guest, we have
> > > > to boost the vCPU before the "vCPU run" stage -- anything later might be too
> > > > late.
> > >
> > > Except that this RFC doesn't actually do this. KVM's relevant function names suck
> > > and aren't helping you, but these patches boost vCPUs when events are *pended*,
> > > not when they are actually injected.
> >
> > We are doing the injection bit in:
> > https://lore.kernel.org/all/[email protected]/
> >
> > For instance, in:
> >
> > kvm_set_msi ->
> > kvm_irq_delivery_to_apic ->
> > kvm_apic_set_irq ->
> > __apic_accept_irq ->
> > kvm_vcpu_kick_boost();
> >
> > The patch is a bit out of order because patch 4 depends on 3. Patch 3
> > does what you're referring to, which is checking for pending events.
> >
> > Did we miss something? If there is some path that we are missing, your
> > help is much appreciated as you're likely much more versed with this
> > code than me. But doing the boosting at injection time is what has
> > made all the difference (for instance with cyclictest latencies).
>
> That accepts in IRQ into the vCPU's local APIC, it does not *inject* the IRQ into
> the vCPU proper. The actual injection is done by kvm_check_and_inject_events().
> A pending IRQ is _usually_ delivered/injected fairly quickly, but not always.
>
> E.g. if the guest has IRQs disabled (RFLAGS.IF=0), KVM can't immediately inject
> the IRQ (without violating x86 architecture). In that case, KVM will twiddle
> VMCS/VMCB knobs to detect an IRQ window, i.e. to cause a VM-Exit when IRQs are
> no longer blocked in the guest.
>
> Your PoC actually (mostly) handles this (see above) by keeping the vCPU boosted
> after EXIT_REASON_INTERRUPT_WINDOW (because the IRQ will obviously still be pending).

Ah got it now, thanks for the detailed description of the implementation and
what we might be missing. Will dig further.

Thanks Sean!

- Joel


2024-01-24 02:15:37

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

Hi David,
I got again side tracked. I'll now prioritize this thread with quicker
(hopefully daily) replies.

>
>>>> On Wed, Dec 13, 2023, Vineeth Pillai (Google) wrote:
>>>>> Double scheduling is a concern with virtualization hosts where the host
>>>>> schedules vcpus without knowing whats run by the vcpu and guest schedules
>>>>> tasks without knowing where the vcpu is physically running. This causes
>>>>> issues related to latencies, power consumption, resource utilization
>>>>> etc. An ideal solution would be to have a cooperative scheduling
>>>>> framework where the guest and host shares scheduling related information
>>>>> and makes an educated scheduling decision to optimally handle the
>>>>> workloads. As a first step, we are taking a stab at reducing latencies
>>>>> for latency sensitive workloads in the guest.
>>>>>
>>>>> This series of patches aims to implement a framework for dynamically
>>>>> managing the priority of vcpu threads based on the needs of the workload
>>>>> running on the vcpu. Latency sensitive workloads (nmi, irq, softirq,
>>>>> critcal sections, RT tasks etc) will get a boost from the host so as to
>>>>> minimize the latency.
>>>>>
>>>>> The host can proactively boost the vcpu threads when it has enough
>>>>> information about what is going to run on the vcpu - fo eg: injecting
>>>>> interrupts. For rest of the case, guest can request boost if the vcpu is
>>>>> not already boosted. The guest can subsequently request unboost after
>>>>> the latency sensitive workloads completes. Guest can also request a
>>>>> boost if needed.
>>>>>
>>>>> A shared memory region is used to communicate the scheduling information.
>>>>> Guest shares its needs for priority boosting and host shares the boosting
>>>>> status of the vcpu. Guest sets a flag when it needs a boost and continues
>>>>> running. Host reads this on next VMEXIT and boosts the vcpu thread. For
>>>>> unboosting, it is done synchronously so that host workloads can fairly
>>>>> compete with guests when guest is not running any latency sensitive
>>>>> workload.
>>>>
>>>> Big thumbs down on my end. Nothing in this RFC explains why this should be done
>>>> in KVM. In general, I am very opposed to putting policy of any kind into KVM,
>>>> and this puts a _lot_ of unmaintainable policy into KVM by deciding when to
>>>> start/stop boosting a vCPU.
>>>
>>> I have to agree, not least of which is because in addition to imposing a
>>> severe maintenance tax, these policies are far from exhaustive in terms
>>> of what you may want to do for cooperative paravirt scheduling.
>>
>> Just to clarify the 'policy' we are discussing here, it is not about 'how to
>> schedule' but rather 'how/when to boost/unboost'. We want the existing
>> scheduler (or whatever it might be in the future) to make the actual decision
>> about how to schedule.
>
> Thanks for clarifying. I think we're on the same page. I didn't mean to
> imply that KVM is actually in the scheduler making decisions about what
> task to run next, but that wasn't really my concern. My concern is that
> this patch set makes KVM responsible for all of the possible paravirt
> policies by encoding it in KVM UAPI, and is ultimately responsible for
> being aware of and communicating those policies between the guest to the
> host scheduler.
>
> Say that we wanted to add some other paravirt related policy like "these
> VCPUs may benefit from being co-located", or, "this VCPU just grabbed a
> critical spinlock so please pin me for a moment". That would require
> updating struct guest_schedinfo UAPI in kvm_para.h, adding getters and
> setters to kvm_host.h to set those policies for the VCPU (though your
> idea to use a BPF hook on VMEXIT may help with that onme), setting the
> state from the guest, etc.

These are valid points, and I agree!

>
> KVM isn't making scheduling decisions, but it's the arbiter of what data
> is available to the scheduler to consume. As it relates to a VCPU, it
> seems like this patch set makes KVM as much invested in the scheduling
> decision that's eventually made as the actual scheduler. Also worth
> considering is that it ties KVM UAPI to sched/core.c, which seems
> undesirable from the perspective of both subsystems.

Ok, Agreed.

>
>> In that sense, I agree with Sean that we are probably forcing a singular
>> policy on when and how to boost which might not work for everybody (in theory
>> anyway). And I am perfectly OK with the BPF route as I mentioned in the other
>
> FWIW, I don't think it's just that boosting may not work well in every
> context (though I do think there's an argument to be made for that, as
> Sean pointed out r.e. hard IRQs in nested context). The problem is also
> that boosting is just one example of a paravirt policy that you may want
> to enable, as I alluded to above, and that comes with complexity and
> maintainership costs.

Ok, agreed.

>
>> email. So perhaps we can use a tracepoint in the VMEXIT path to run a BPF
>> program (?). And we still have to figure out how to run BPF programs in the
>> interrupt injections patch (I am currently studying those paths more also
>> thanks to Sean's email describing them).
>
> If I understand correctly, based on your question below, the idea is to
> call sched_setscheduler() from a kfunc in this VMEXIT BPF tracepoint?
> Please let me know if that's correct -- I'll respond to this below where
> you ask the question.

Yes that's correct.

>
> As an aside, even if we called a BPF tracepoint prog on the VMEXIT path,
> AFAIU it would still require UAPI changes given that we'd still need to
> make all the same changes in the guest, right?

By UAPI, do you mean hypercall or do you mean shared memory? If hypercall, we
actually don't need hypercall for boosting. We boost during VMEXIT. We only need
to set some state from the guest, in shared memory to hint that it might be
needed at some point in the future. If no preemption-induced VMEXIT happens,
then no scheduler boosting happens (or needs to happen).

There might be a caveat to the unboosting path though needing a hypercall and I
need to check with Vineeth on his latest code whether it needs a hypercall, but
we could probably figure that out. In the latest design, one thing I know is
that we just have to force a VMEXIT for both boosting and unboosting. Well for
boosting, the VMEXIT just happens automatically due to vCPU preemption, but for
unboosting it may not.

In any case, can we not just force a VMEXIT from relevant path within the guest,
again using a BPF program? I don't know what the BPF prog to do that would look
like, but I was envisioning we would call a BPF prog from within a guest if
needed at relevant point (example, return to guest userspace).

Does that make sense?

> I see why having a BPF
> hook here would be useful to avoid some of the logic on the host that
> implements the boosting, and to give more flexibility as to when to
> apply that boosting, but in my mind it seems like the UAPI concerns are
> the most relevant.

Yes, lets address the UAPI. My plan is to start a new design document like a
google doc, and I could share that with you so we can sketch this out. What do
you think? And perhaps also we can discuss it at LSFMM.

>
>> [...]
>>>> Concretely, boosting vCPUs for most events is far too coarse grained. E.g. boosting
>>>> a vCPU that is running a low priority workload just because the vCPU triggered
>>>> an NMI due to PMU counter overflow doesn't make sense. Ditto for if a guest's
>>>> hrtimer expires on a vCPU running a low priority workload.
>>>>
>>>> And as evidenced by patch 8/8, boosting vCPUs based on when an event is _pending_
>>>> is not maintainable. As hardware virtualizes more and more functionality, KVM's
>>>> visilibity into the guest effectively decreases, e.g. Intel and AMD both support
>>>> with IPI virtualization.
>>>>
>>>> Boosting the target of a PV spinlock kick is similarly flawed. In that case, KVM
>>>> only gets involved _after_ there is a problem, i.e. after a lock is contended so
>>>> heavily that a vCPU stops spinning and instead decided to HLT. It's not hard to
>>>> imagine scenarios where a guest would want to communicate to the host that it's
>>>> acquiring a spinlock for a latency sensitive path and so shouldn't be scheduled
>>>> out. And of course that's predicated on the assumption that all vCPUs are subject
>>>> to CPU overcommit.
>>>>
>>>> Initiating a boost from the host is also flawed in the sense that it relies on
>>>> the guest to be on the same page as to when it should stop boosting. E.g. if
>>>> KVM boosts a vCPU because an IRQ is pending, but the guest doesn't want to boost
>>>> IRQs on that vCPU and thus doesn't stop boosting at the end of the IRQ handler,
>>>> then the vCPU could end up being boosted long after its done with the IRQ.
>>>>
>>>> Throw nested virtualization into the mix and then all of this becomes nigh
>>>> impossible to sort out in KVM. E.g. if an L1 vCPU is a running an L2 vCPU, i.e.
>>>> a nested guest, and L2 is spamming interrupts for whatever reason, KVM will end
>>>> repeatedly boosting the L1 vCPU regardless of the priority of the L2 workload.
>>>>
>>>> For things that aren't clearly in KVM's domain, I don't think we should implement
>>>> KVM-specific functionality until every other option has been tried (and failed).
>>>> I don't see any reason why KVM needs to get involved in scheduling, beyond maybe
>>>> providing *input* regarding event injection, emphasis on *input* because KVM
>>>> providing information to userspace or some other entity is wildly different than
>>>> KVM making scheduling decisions based on that information.
>>>>
>>>> Pushing the scheduling policies to host userspace would allow for far more control
>>>> and flexibility. E.g. a heavily paravirtualized environment where host userspace
>>>> knows *exactly* what workloads are being run could have wildly different policies
>>>> than an environment where the guest is a fairly vanilla Linux VM that has received
>>>> a small amount of enlightment.
>>>>
>>>> Lastly, if the concern/argument is that userspace doesn't have the right knobs
>>>> to (quickly) boost vCPU tasks, then the proposed sched_ext functionality seems
>>>> tailor made for the problems you are trying to solve.
>>>>
>>>> https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org
>>>
>>> I very much agree. There are some features missing from BPF that we'd
>>> need to add to enable this, but they're on the roadmap, and I don't
>>> think they'd be especially difficult to implement.
>>>
>>> The main building block that I was considering is a new kptr [0] and set
>>> of kfuncs [1] that would allow a BPF program to have one or more R/W
>>> shared memory regions with a guest.
>>
>> I really like your ideas around sharing memory across virt boundary using
>> BPF. The one concern I would have is that, this does require loading a BPF
>> program from the guest userspace, versus just having a guest kernel that
>> 'does the right thing'.
>
> Yeah, that's a fair concern. The problem is that I'm not sure how we get
> around that if we want to avoid tying up every scheduling paravirt
> policy into a KVM UAPI. Putting everything into UAPI just really doesn't
> seem scalable. I'd be curious to hear Sean's thoughts on this. Note that
> I'm not just talking about scheduling paravirt here -- one could imagine
> many possible examples, e.g. relating to I/O, MM, etc.

We could do same thing in guest, call BPF prog at a certain point, if needed.
But again, since we only need to bother with VMEXIT for scheduler boosting
(which doesn't need hypercall), it should be Ok. For the unboosting part, we
could implement that also using either a BPF prog at appropriate guest hook, or
just having a timer go off to take away boost if boosting has been done too
long. We could award a boost for a bounded time as well, and force a VMEXIT to
unboost if VMEXIT has not already happened yet.. there should be many ways to
avoid an unboost hypercall..

>
> It seems much more scalable to instead have KVM be responsible for
> plumbing mappings between guest and host BPF programs (I haven't thought
> through the design or interface for that at _all_, just thinking in
> high-level terms here), and then those BPF programs could decide on
> paravirt interfaces that don't have to be in UAPI.

It sounds like by UAPI, you mean hypercalls right? The actual shared memory
structure should not be a UAPI concern since that will defined by the BPF
program and how it wants to interpret the fields..

> Having KVM be
> responsible for setting up opaque communication channels between the
> guest and host feels likes a more natural building block than having it
> actually be aware of the policies being implemented over those
> communication channels.

Agreed.

>
>> On the host, I would have no problem loading a BPF program as a one-time
>> thing, but on the guest it may be more complex as we don't always control the
>> guest userspace and their BPF loading mechanisms. Example, an Android guest
>> needs to have its userspace modified to load BPF progs, etc. Not hard
>> problems, but flexibility comes with more cost. Last I recall, Android does
>> not use a lot of the BPF features that come with the libbpf library because
>> they write their own userspace from scratch (due to licensing). OTOH, if this
>> was an Android kernel-only change, that would simplify a lot.
>
> That is true, but the cost is double-sided. On the one hand, we have the
> complexity and maintenance costs of plumbing all of this through KVM and
> making it UAPI. On the other, we have the cost of needing to update a
> user space framework to accommodate loading and managing BPF programs.
> At this point BPF is fully supported on aarch64, so Android should have
> everything it needs to use it. It sounds like it will require some
> (maybe even a lot of) work to accommodate that, but that seems
> preferable to compensating for gaps in a user space framework by adding
> complexity to the kernel, no?

Yes it should be preferable.

>
>> Still there is a lot of merit to sharing memory with BPF and let BPF decide
>> the format of the shared memory, than baking it into the kernel... so thanks
>> for bringing this up! Lets talk more about it... Oh, and there's my LSFMMBPF
>> invitiation request ;-) ;-).
>
> Discussing this BPF feature at LSFMMBPF is a great idea -- I'll submit a
> proposal for it and cc you. I looked and couldn't seem to find the
> thread for your LSFMMBPF proposal. Would you mind please sending a link?

I actually have not even submitted one for LSFMM but my management is supportive
of my visit. Do you want to go ahead and submit one with all of us included in
the proposal? And I am again sorry for the late reply and hopefully we did not
miss any deadlines. Also on related note, there is interest in sched_ext for
more custom scheduling. We could discuss that as well while at LSFMM.

>
>>> This could enable a wide swath of
>>> BPF paravirt use cases that are not limited to scheduling, but in terms
>>> of sched_ext, the BPF scheduler could communicate with the guest
>>> scheduler over this shared memory region in whatever manner was required
>>> for that use case.
>>>
>>> [0]: https://lwn.net/Articles/900749/
>>> [1]: https://lwn.net/Articles/856005/
>>
>> Thanks, I had no idea about these. I have a question -- would it be possible
>> to call the sched_setscheduler() function in core.c via a kfunc? Then we can
>> trigger the boost from a BPF program on the host side. We could start simple
>> from there.
>
> There's nothing stopping us from adding a kfunc that calls
> sched_setscheduler(). The questions are how other scheduler folks feel
> about that, and whether that's the appropriate general solution to the
> problem. It does seem odd to me to have a random KVM tracepoint be
> entitled to set a generic task's scheduling policy and priority.

Such oddities are application specific though. User space can already call
setscheduler arbitrarily, so why not a BPF program?

>
>> I agree on the rest below. I just wanted to emphasize though that this patch
>> series does not care about what the scheduler does. It merely hints the
>> scheduler via a priority setting that something is an important task. That's
>> a far cry from how to actually schedule and the spirit here is to use
>> whatever scheduler the user has to decide how to actually schedule.
>
> Yep, I don't disagree with you at all on that point. To me this really
> comes down to a question of the costs and long-term design choices, as
> we discussed above:
>
> 1. What is the long-term maintenance cost to KVM and the scheduler to
> land paravirt scheduling in this form?
>
> Even assuming that we go with the BPF hook on VMEXIT approach, unless
> I'm missing something, I think we'd still need to make UAPI changes to
> kvm_para.h, and update the guest to set the relevant paravirt state in
> the guest scheduler.

As mentioned above, for boosting, there is no hypercall. The VMEXIT is induced
by host preemption.

> That's not a huge amount of code for just boosting
> and deboosting, but it sets the precedent that any and all future
> scheduling paravirt logic will need to go through UAPI, and that the
> core scheduler will need to accommodate that paravirt when and where
> appropriate.
>
> I may be being overly pessimistic, but I feel that could open up a
> rather scary can of worms; both in terms of the potential long-term
> complexity in the kernel itself, and in terms of the maintenance burden
> that may eventually be imposed to properly support it. It also imposes a
> very high bar on being able to add and experiment with new paravirt
> policies.

Hmm, yeah lets discuss this more. It does appear we need to do *something* than
leaving the performance on the table.

>
> 2. What is the cost we're imposing on users if we force paravirt to be
> done through BPF? Is this prohibitively high?
>
> There is certainly a nonzero cost. As you pointed out, right now Android
> apparently doesn't use much BPF, and adding the requisite logic to use
> and manage BPF programs is not insigificant.
>
> Is that cost prohibitively high? I would say no. BPF should be fully
> supported on aarch64 at this point, so it's really a user space problem.
> Managing the system is what user space does best, and many other
> ecosystems have managed to integrate BPF to great effect. So while the
> cost is cetainly nonzero, I think there's a reasonable argument to be
> made that it's not prohibitively high.

Yes, I think it is doable.

Glad to be able to finally reply, and I shall prioritize this thread more on my
side moving forward.

thanks,

- Joel

2024-01-24 17:09:34

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

On Tue, Jan 23, 2024 at 09:15:10PM -0500, Joel Fernandes wrote:
> Hi David,
> I got again side tracked. I'll now prioritize this thread with quicker
> (hopefully daily) replies.

Hi Joel,

Thanks for your detailed reply.

[...]

> > Thanks for clarifying. I think we're on the same page. I didn't mean to
> > imply that KVM is actually in the scheduler making decisions about what
> > task to run next, but that wasn't really my concern. My concern is that
> > this patch set makes KVM responsible for all of the possible paravirt
> > policies by encoding it in KVM UAPI, and is ultimately responsible for
> > being aware of and communicating those policies between the guest to the
> > host scheduler.
> >
> > Say that we wanted to add some other paravirt related policy like "these
> > VCPUs may benefit from being co-located", or, "this VCPU just grabbed a
> > critical spinlock so please pin me for a moment". That would require
> > updating struct guest_schedinfo UAPI in kvm_para.h, adding getters and
> > setters to kvm_host.h to set those policies for the VCPU (though your
> > idea to use a BPF hook on VMEXIT may help with that onme), setting the
> > state from the guest, etc.
>
> These are valid points, and I agree!
>
> >
> > KVM isn't making scheduling decisions, but it's the arbiter of what data
> > is available to the scheduler to consume. As it relates to a VCPU, it
> > seems like this patch set makes KVM as much invested in the scheduling
> > decision that's eventually made as the actual scheduler. Also worth
> > considering is that it ties KVM UAPI to sched/core.c, which seems
> > undesirable from the perspective of both subsystems.
>
> Ok, Agreed.
>
> >
> >> In that sense, I agree with Sean that we are probably forcing a singular
> >> policy on when and how to boost which might not work for everybody (in theory
> >> anyway). And I am perfectly OK with the BPF route as I mentioned in the other
> >
> > FWIW, I don't think it's just that boosting may not work well in every
> > context (though I do think there's an argument to be made for that, as
> > Sean pointed out r.e. hard IRQs in nested context). The problem is also
> > that boosting is just one example of a paravirt policy that you may want
> > to enable, as I alluded to above, and that comes with complexity and
> > maintainership costs.
>
> Ok, agreed.
>
> >
> >> email. So perhaps we can use a tracepoint in the VMEXIT path to run a BPF
> >> program (?). And we still have to figure out how to run BPF programs in the
> >> interrupt injections patch (I am currently studying those paths more also
> >> thanks to Sean's email describing them).
> >
> > If I understand correctly, based on your question below, the idea is to
> > call sched_setscheduler() from a kfunc in this VMEXIT BPF tracepoint?
> > Please let me know if that's correct -- I'll respond to this below where
> > you ask the question.
>
> Yes that's correct.
>
> >
> > As an aside, even if we called a BPF tracepoint prog on the VMEXIT path,
> > AFAIU it would still require UAPI changes given that we'd still need to
> > make all the same changes in the guest, right?
>
> By UAPI, do you mean hypercall or do you mean shared memory? If hypercall, we
> actually don't need hypercall for boosting. We boost during VMEXIT. We only need
> to set some state from the guest, in shared memory to hint that it might be
> needed at some point in the future. If no preemption-induced VMEXIT happens,
> then no scheduler boosting happens (or needs to happen).

So I was referring to setting state in shared memory from the guest.
Specifically, the UAPI defined in [0] and set in [1] (also shown below).
There's no hypercall here, right? But we're still adding UAPI for the
shared data structure written by the guest and consumed by the host.
Please let me know if I'm missing something, which seems very possible.

[0]: https://lore.kernel.org/all/[email protected]/
[1]: https://lore.kernel.org/all/[email protected]/

diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 6b1dea07a563..e53c3f3a88d7 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -167,11 +167,30 @@ enum kvm_vcpu_boost_state {
VCPU_BOOST_BOOSTED
};

+/*
+ * Boost Request from guest to host for lazy boosting.
+ */
+enum kvm_vcpu_boost_request {
+ VCPU_REQ_NONE = 0,
+ VCPU_REQ_UNBOOST,
+ VCPU_REQ_BOOST,
+};
+
+
+union guest_schedinfo {
+ struct {
+ __u8 boost_req;
+ __u8 preempt_disabled;
+ };
+ __u64 pad;
+};
+
/*
* Structure passed in via MSR_KVM_PV_SCHED
*/
struct pv_sched_data {
__u64 boost_status;
+ union guest_schedinfo schedinfo;
};

> There might be a caveat to the unboosting path though needing a hypercall and I
> need to check with Vineeth on his latest code whether it needs a hypercall, but
> we could probably figure that out. In the latest design, one thing I know is
> that we just have to force a VMEXIT for both boosting and unboosting. Well for
> boosting, the VMEXIT just happens automatically due to vCPU preemption, but for
> unboosting it may not.

As mentioned above, I think we'd need to add UAPI for setting state from
the guest scheduler, even if we didn't use a hypercall to induce a
VMEXIT, right?

> In any case, can we not just force a VMEXIT from relevant path within the guest,
> again using a BPF program? I don't know what the BPF prog to do that would look
> like, but I was envisioning we would call a BPF prog from within a guest if
> needed at relevant point (example, return to guest userspace).

I agree it would be useful to have a kfunc that could be used to force a
VMEXIT if we e.g. need to trigger a resched or something. In general
that seems like a pretty reasonable building block for something like
this. I expect there are use cases where doing everything async would be
useful as well. We'll have to see what works well in experimentation.

> Does that make sense?

Yes it does, though I want to make sure I'm not misunderstanding what
you mean by shared memory vs. hypercall as it relates to UAPI.

> > I see why having a BPF
> > hook here would be useful to avoid some of the logic on the host that
> > implements the boosting, and to give more flexibility as to when to
> > apply that boosting, but in my mind it seems like the UAPI concerns are
> > the most relevant.
>
> Yes, lets address the UAPI. My plan is to start a new design document like a
> google doc, and I could share that with you so we can sketch this out. What do
> you think? And perhaps also we can discuss it at LSFMM.

Awesome, thank you for writing that up. I'm happy to take a look when
it's ready for more eyes.

[...]

> >>> The main building block that I was considering is a new kptr [0] and set
> >>> of kfuncs [1] that would allow a BPF program to have one or more R/W
> >>> shared memory regions with a guest.
> >>
> >> I really like your ideas around sharing memory across virt boundary using
> >> BPF. The one concern I would have is that, this does require loading a BPF
> >> program from the guest userspace, versus just having a guest kernel that
> >> 'does the right thing'.
> >
> > Yeah, that's a fair concern. The problem is that I'm not sure how we get
> > around that if we want to avoid tying up every scheduling paravirt
> > policy into a KVM UAPI. Putting everything into UAPI just really doesn't
> > seem scalable. I'd be curious to hear Sean's thoughts on this. Note that
> > I'm not just talking about scheduling paravirt here -- one could imagine
> > many possible examples, e.g. relating to I/O, MM, etc.
>
> We could do same thing in guest, call BPF prog at a certain point, if needed.
> But again, since we only need to bother with VMEXIT for scheduler boosting
> (which doesn't need hypercall), it should be Ok. For the unboosting part, we

Hopefully my explanation above r.e. shared memory makes sense. My worry
isn't just for hypercalls, it's that we also need to make UAPI changes
for the guest to set the shared memory state that's read by the host. If
we instead had the BPF program in the guest setting state via its shared
memory channel with the BPF prog on the host, that's a different story.

> could implement that also using either a BPF prog at appropriate guest hook, or
> just having a timer go off to take away boost if boosting has been done too
> long. We could award a boost for a bounded time as well, and force a VMEXIT to
> unboost if VMEXIT has not already happened yet.. there should be many ways to
> avoid an unboost hypercall..
>
> > It seems much more scalable to instead have KVM be responsible for
> > plumbing mappings between guest and host BPF programs (I haven't thought
> > through the design or interface for that at _all_, just thinking in
> > high-level terms here), and then those BPF programs could decide on
> > paravirt interfaces that don't have to be in UAPI.
>
> It sounds like by UAPI, you mean hypercalls right? The actual shared memory
> structure should not be a UAPI concern since that will defined by the BPF
> program and how it wants to interpret the fields..

See above -- I mean the shared data structure added in patch 3 / 8
("kvm: x86: vcpu boosting/unboosting framework").

> > Having KVM be
> > responsible for setting up opaque communication channels between the
> > guest and host feels likes a more natural building block than having it
> > actually be aware of the policies being implemented over those
> > communication channels.
>
> Agreed.
>
> >
> >> On the host, I would have no problem loading a BPF program as a one-time
> >> thing, but on the guest it may be more complex as we don't always control the
> >> guest userspace and their BPF loading mechanisms. Example, an Android guest
> >> needs to have its userspace modified to load BPF progs, etc. Not hard
> >> problems, but flexibility comes with more cost. Last I recall, Android does
> >> not use a lot of the BPF features that come with the libbpf library because
> >> they write their own userspace from scratch (due to licensing). OTOH, if this
> >> was an Android kernel-only change, that would simplify a lot.
> >
> > That is true, but the cost is double-sided. On the one hand, we have the
> > complexity and maintenance costs of plumbing all of this through KVM and
> > making it UAPI. On the other, we have the cost of needing to update a
> > user space framework to accommodate loading and managing BPF programs.
> > At this point BPF is fully supported on aarch64, so Android should have
> > everything it needs to use it. It sounds like it will require some
> > (maybe even a lot of) work to accommodate that, but that seems
> > preferable to compensating for gaps in a user space framework by adding
> > complexity to the kernel, no?
>
> Yes it should be preferable.
>
> >
> >> Still there is a lot of merit to sharing memory with BPF and let BPF decide
> >> the format of the shared memory, than baking it into the kernel... so thanks
> >> for bringing this up! Lets talk more about it... Oh, and there's my LSFMMBPF
> >> invitiation request ;-) ;-).
> >
> > Discussing this BPF feature at LSFMMBPF is a great idea -- I'll submit a
> > proposal for it and cc you. I looked and couldn't seem to find the
> > thread for your LSFMMBPF proposal. Would you mind please sending a link?
>
> I actually have not even submitted one for LSFMM but my management is supportive
> of my visit. Do you want to go ahead and submit one with all of us included in
> the proposal? And I am again sorry for the late reply and hopefully we did not
> miss any deadlines. Also on related note, there is interest in sched_ext for

I see that you submitted a proposal in [2] yesterday. Thanks for writing
it up, it looks great and I'll comment on that thread adding a +1 for
the discussion.

[2]: https://lore.kernel.org/all/[email protected]/

No worries at all about the reply latency. Thank you for being so open
to discussing different approaches, and for driving the discussion. I
think this could be a very powerful feature for the kernel so I'm
pretty excited to further flesh out the design and figure out what makes
the most sense here.

> more custom scheduling. We could discuss that as well while at LSFMM.

Sure thing. I haven't proposed a topic for that yet as I didn't have
anything new to discuss following last year's discussion, but I'm happy
to continue the discussion this year. I'll follow up with you in a
separate thread.

> >
> >>> This could enable a wide swath of
> >>> BPF paravirt use cases that are not limited to scheduling, but in terms
> >>> of sched_ext, the BPF scheduler could communicate with the guest
> >>> scheduler over this shared memory region in whatever manner was required
> >>> for that use case.
> >>>
> >>> [0]: https://lwn.net/Articles/900749/
> >>> [1]: https://lwn.net/Articles/856005/
> >>
> >> Thanks, I had no idea about these. I have a question -- would it be possible
> >> to call the sched_setscheduler() function in core.c via a kfunc? Then we can
> >> trigger the boost from a BPF program on the host side. We could start simple
> >> from there.
> >
> > There's nothing stopping us from adding a kfunc that calls
> > sched_setscheduler(). The questions are how other scheduler folks feel
> > about that, and whether that's the appropriate general solution to the
> > problem. It does seem odd to me to have a random KVM tracepoint be
> > entitled to set a generic task's scheduling policy and priority.
>
> Such oddities are application specific though. User space can already call
> setscheduler arbitrarily, so why not a BPF program?

Hmm, that's a good point. Perhaps it does make sense. The BPF program
would be no less privileged than a user space application with
sufficient privileges to change a remote task's prio.

> >> I agree on the rest below. I just wanted to emphasize though that this patch
> >> series does not care about what the scheduler does. It merely hints the
> >> scheduler via a priority setting that something is an important task. That's
> >> a far cry from how to actually schedule and the spirit here is to use
> >> whatever scheduler the user has to decide how to actually schedule.
> >
> > Yep, I don't disagree with you at all on that point. To me this really
> > comes down to a question of the costs and long-term design choices, as
> > we discussed above:
> >
> > 1. What is the long-term maintenance cost to KVM and the scheduler to
> > land paravirt scheduling in this form?
> >
> > Even assuming that we go with the BPF hook on VMEXIT approach, unless
> > I'm missing something, I think we'd still need to make UAPI changes to
> > kvm_para.h, and update the guest to set the relevant paravirt state in
> > the guest scheduler.
>
> As mentioned above, for boosting, there is no hypercall. The VMEXIT is induced
> by host preemption.

I expect I am indeed missing something then, as mentioned above. VMEXIT
aside, we still need some UAPI for the shared structure between the
guest and host where the guest indicates its need for boosting, no?

> > That's not a huge amount of code for just boosting
> > and deboosting, but it sets the precedent that any and all future
> > scheduling paravirt logic will need to go through UAPI, and that the
> > core scheduler will need to accommodate that paravirt when and where
> > appropriate.
> >
> > I may be being overly pessimistic, but I feel that could open up a
> > rather scary can of worms; both in terms of the potential long-term
> > complexity in the kernel itself, and in terms of the maintenance burden
> > that may eventually be imposed to properly support it. It also imposes a
> > very high bar on being able to add and experiment with new paravirt
> > policies.
>
> Hmm, yeah lets discuss this more. It does appear we need to do *something* than
> leaving the performance on the table.

Agreed. There is a lot of performance to be gained from this. Doing
nothing is not the answer. Or at least not the answer I'm hoping for.

> >
> > 2. What is the cost we're imposing on users if we force paravirt to be
> > done through BPF? Is this prohibitively high?
> >
> > There is certainly a nonzero cost. As you pointed out, right now Android
> > apparently doesn't use much BPF, and adding the requisite logic to use
> > and manage BPF programs is not insigificant.
> >
> > Is that cost prohibitively high? I would say no. BPF should be fully
> > supported on aarch64 at this point, so it's really a user space problem.
> > Managing the system is what user space does best, and many other
> > ecosystems have managed to integrate BPF to great effect. So while the
> > cost is cetainly nonzero, I think there's a reasonable argument to be
> > made that it's not prohibitively high.
>
> Yes, I think it is doable.
>
> Glad to be able to finally reply, and I shall prioritize this thread more on my
> side moving forward.

Thanks for your detailed reply, and happy belated birthday :-)

- David


Attachments:
(No filename) (17.04 kB)
signature.asc (235.00 B)
Download all attachments

2024-01-25 01:09:41

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

Hi David,

On Wed, Jan 24, 2024 at 12:06 PM David Vernet <[email protected]> wrote:
>
[...]
> > There might be a caveat to the unboosting path though needing a hypercall and I
> > need to check with Vineeth on his latest code whether it needs a hypercall, but
> > we could probably figure that out. In the latest design, one thing I know is
> > that we just have to force a VMEXIT for both boosting and unboosting. Well for
> > boosting, the VMEXIT just happens automatically due to vCPU preemption, but for
> > unboosting it may not.
>
> As mentioned above, I think we'd need to add UAPI for setting state from
> the guest scheduler, even if we didn't use a hypercall to induce a
> VMEXIT, right?

I see what you mean now. I'll think more about it. The immediate
thought is to load BPF programs to trigger at appropriate points in
the guest. For instance, we already have tracepoints for preemption
disabling. I added that upstream like 8 years ago or something. And
sched_switch already knows when we switch to RT, which we could
leverage in the guest. The BPF program would set some shared memory
state in whatever format it desires, when it runs is what I'm
envisioning.

By the way, one crazy idea about loading BPF programs into a guest..
Maybe KVM can pass along the BPF programs to be loaded to the guest?
The VMM can do that. The nice thing there is only the host would be
the only responsible for the BPF programs. I am not sure if that makes
sense, so please let me know what you think. I guess the VMM should
also be passing additional metadata, like which tracepoints to hook
to, in the guest, etc.

> > In any case, can we not just force a VMEXIT from relevant path within the guest,
> > again using a BPF program? I don't know what the BPF prog to do that would look
> > like, but I was envisioning we would call a BPF prog from within a guest if
> > needed at relevant point (example, return to guest userspace).
>
> I agree it would be useful to have a kfunc that could be used to force a
> VMEXIT if we e.g. need to trigger a resched or something. In general
> that seems like a pretty reasonable building block for something like
> this. I expect there are use cases where doing everything async would be
> useful as well. We'll have to see what works well in experimentation.

Sure.

> > >> Still there is a lot of merit to sharing memory with BPF and let BPF decide
> > >> the format of the shared memory, than baking it into the kernel... so thanks
> > >> for bringing this up! Lets talk more about it... Oh, and there's my LSFMMBPF
> > >> invitiation request ;-) ;-).
> > >
> > > Discussing this BPF feature at LSFMMBPF is a great idea -- I'll submit a
> > > proposal for it and cc you. I looked and couldn't seem to find the
> > > thread for your LSFMMBPF proposal. Would you mind please sending a link?
> >
> > I actually have not even submitted one for LSFMM but my management is supportive
> > of my visit. Do you want to go ahead and submit one with all of us included in
> > the proposal? And I am again sorry for the late reply and hopefully we did not
> > miss any deadlines. Also on related note, there is interest in sched_ext for
>
> I see that you submitted a proposal in [2] yesterday. Thanks for writing
> it up, it looks great and I'll comment on that thread adding a +1 for
> the discussion.
>
> [2]: https://lore.kernel.org/all/[email protected]/
>
> No worries at all about the reply latency. Thank you for being so open
> to discussing different approaches, and for driving the discussion. I
> think this could be a very powerful feature for the kernel so I'm
> pretty excited to further flesh out the design and figure out what makes
> the most sense here.

Great!

> > As mentioned above, for boosting, there is no hypercall. The VMEXIT is induced
> > by host preemption.
>
> I expect I am indeed missing something then, as mentioned above. VMEXIT
> aside, we still need some UAPI for the shared structure between the
> guest and host where the guest indicates its need for boosting, no?

Yes you are right, it is more clear now what you were referring to
with UAPI. I think we need figure that issue out. But if we can make
the VMM load BPF programs, then the host can completely decide how to
structure the shared memory.

> > > 2. What is the cost we're imposing on users if we force paravirt to be
> > > done through BPF? Is this prohibitively high?
> > >
> > > There is certainly a nonzero cost. As you pointed out, right now Android
> > > apparently doesn't use much BPF, and adding the requisite logic to use
> > > and manage BPF programs is not insigificant.
> > >
> > > Is that cost prohibitively high? I would say no. BPF should be fully
> > > supported on aarch64 at this point, so it's really a user space problem.
> > > Managing the system is what user space does best, and many other
> > > ecosystems have managed to integrate BPF to great effect. So while the
> > > cost is cetainly nonzero, I think there's a reasonable argument to be
> > > made that it's not prohibitively high.
> >
> > Yes, I think it is doable.
> >
> > Glad to be able to finally reply, and I shall prioritize this thread more on my
> > side moving forward.
>
> Thanks for your detailed reply, and happy belated birthday :-)

Thank you!!! :-)

- Joel

2024-01-26 21:19:44

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

On Wed, Jan 24, 2024 at 08:08:56PM -0500, Joel Fernandes wrote:
> Hi David,

Hi Joel,

>
> On Wed, Jan 24, 2024 at 12:06 PM David Vernet <[email protected]> wrote:
> >
> [...]
> > > There might be a caveat to the unboosting path though needing a hypercall and I
> > > need to check with Vineeth on his latest code whether it needs a hypercall, but
> > > we could probably figure that out. In the latest design, one thing I know is
> > > that we just have to force a VMEXIT for both boosting and unboosting. Well for
> > > boosting, the VMEXIT just happens automatically due to vCPU preemption, but for
> > > unboosting it may not.
> >
> > As mentioned above, I think we'd need to add UAPI for setting state from
> > the guest scheduler, even if we didn't use a hypercall to induce a
> > VMEXIT, right?
>
> I see what you mean now. I'll think more about it. The immediate
> thought is to load BPF programs to trigger at appropriate points in
> the guest. For instance, we already have tracepoints for preemption
> disabling. I added that upstream like 8 years ago or something. And
> sched_switch already knows when we switch to RT, which we could
> leverage in the guest. The BPF program would set some shared memory
> state in whatever format it desires, when it runs is what I'm
> envisioning.

That sounds like it would work perfectly. Tracepoints are really ideal,
both because BPF doesn't allow (almost?) any kfuncs to be called from
fentry/kprobe progs (whereas they do from tracepoints), and because
tracepoint program arguments are trusted so the BPF verifier knows that
it's safe to pass them onto kfuncs, etc as refernced kptrs.

> By the way, one crazy idea about loading BPF programs into a guest..
> Maybe KVM can pass along the BPF programs to be loaded to the guest?
> The VMM can do that. The nice thing there is only the host would be
> the only responsible for the BPF programs. I am not sure if that makes
> sense, so please let me know what you think. I guess the VMM should
> also be passing additional metadata, like which tracepoints to hook
> to, in the guest, etc.

This I'm not sure I can really share an intelligent opinion on. My first
thought was that the guest VM would load some BPF programs at boot using
something like systemd, and then those progs would somehow register with
the VMM -- maybe through a kfunc implemented by KVM that makes a
hypercall. Perhaps what you're suggesting would work as well.

> > > In any case, can we not just force a VMEXIT from relevant path within the guest,
> > > again using a BPF program? I don't know what the BPF prog to do that would look
> > > like, but I was envisioning we would call a BPF prog from within a guest if
> > > needed at relevant point (example, return to guest userspace).
> >
> > I agree it would be useful to have a kfunc that could be used to force a
> > VMEXIT if we e.g. need to trigger a resched or something. In general
> > that seems like a pretty reasonable building block for something like
> > this. I expect there are use cases where doing everything async would be
> > useful as well. We'll have to see what works well in experimentation.
>
> Sure.
>
> > > >> Still there is a lot of merit to sharing memory with BPF and let BPF decide
> > > >> the format of the shared memory, than baking it into the kernel... so thanks
> > > >> for bringing this up! Lets talk more about it... Oh, and there's my LSFMMBPF
> > > >> invitiation request ;-) ;-).
> > > >
> > > > Discussing this BPF feature at LSFMMBPF is a great idea -- I'll submit a
> > > > proposal for it and cc you. I looked and couldn't seem to find the
> > > > thread for your LSFMMBPF proposal. Would you mind please sending a link?
> > >
> > > I actually have not even submitted one for LSFMM but my management is supportive
> > > of my visit. Do you want to go ahead and submit one with all of us included in
> > > the proposal? And I am again sorry for the late reply and hopefully we did not
> > > miss any deadlines. Also on related note, there is interest in sched_ext for
> >
> > I see that you submitted a proposal in [2] yesterday. Thanks for writing
> > it up, it looks great and I'll comment on that thread adding a +1 for
> > the discussion.
> >
> > [2]: https://lore.kernel.org/all/[email protected]/
> >
> > No worries at all about the reply latency. Thank you for being so open
> > to discussing different approaches, and for driving the discussion. I
> > think this could be a very powerful feature for the kernel so I'm
> > pretty excited to further flesh out the design and figure out what makes
> > the most sense here.
>
> Great!
>
> > > As mentioned above, for boosting, there is no hypercall. The VMEXIT is induced
> > > by host preemption.
> >
> > I expect I am indeed missing something then, as mentioned above. VMEXIT
> > aside, we still need some UAPI for the shared structure between the
> > guest and host where the guest indicates its need for boosting, no?
>
> Yes you are right, it is more clear now what you were referring to
> with UAPI. I think we need figure that issue out. But if we can make
> the VMM load BPF programs, then the host can completely decide how to
> structure the shared memory.

Yep -- if the communication channel is from guest BPF -> host BPF, I
think the UAPI concerns are completely addressed. Figuring out how to
actually load the guest BPF progs and setup the communication channels
is another matter.

Thanks,
David


Attachments:
(No filename) (5.46 kB)
signature.asc (235.00 B)
Download all attachments