2024-04-03 14:01:38

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)

Double scheduling is a concern with virtualization hosts where the host
schedules vcpus without knowing whats run by the vcpu and guest schedules
tasks without knowing where the vcpu is physically running. This causes
issues related to latencies, power consumption, resource utilization
etc. An ideal solution would be to have a cooperative scheduling
framework where the guest and host shares scheduling related information
and makes an educated scheduling decision to optimally handle the
workloads. As a first step, we are taking a stab at reducing latencies
for latency sensitive workloads in the guest.

v1 RFC[1] was posted in December 2023. The main disagreement was in the
implementation where the patch was making scheduling policy decisions
in kvm and kvm is not the right place to do it. The suggestion was to
move the polcy decisions outside of kvm and let kvm only handle the
notifications needed to make the policy decisions. This patch series is
an iterative step towards implementing the feature as a layered
design where the policy could be implemented outside of kvm as a
kernel built-in, a kernel module or a bpf program.

This design comprises mainly of 4 components:

- pvsched driver: Implements the scheduling policies. Register with
host with a set of callbacks that hypervisor(kvm) can use to notify
vcpu events that the driver is interested in. The callback will be
passed in the address of shared memory so that the driver can get
scheduling information shared by the guest and also update the
scheduling policies set by the driver.
- kvm component: Selects the pvsched driver for a guest and notifies
the driver via callbacks for events that the driver is interested
in. Also interface with the guest in retreiving the shared memory
region for sharing the scheduling information.
- host kernel component: Implements the APIs for:
- pvsched driver for register/unregister to the host kernel, and
- hypervisor for assingning/unassigning driver for guests.
- guest component: Implements a framework for sharing the scheduling
information with the pvsched driver through kvm.

There is another component that we refer to as pvsched protocol. This
defines the details about shared memory layout, information sharing and
sheduling policy decisions. The protocol need not be part of the kernel
and can be defined separately based on the use case and requirements.
Both guest and the selected pvsched driver need to match the protocol
for the feature to work. Protocol shall be identified by a name and a
possible versioning scheme. Guest will advertise the protocol and then
the hypervisor can assign the driver implementing the protocol if it is
registered in the host kernel.

This patch series only implements the first 3 components. Guest side
implementation and the protocol framework shall come as a separate
series once we finalize rest of the design.

This series also implements a sample bpf program and a kernel-builtin
pvsched drivers. They do not do any real stuff now, but just skeletons
to demonstrate the feature.

Rebased on 6.8.2.

[1]: https://lwn.net/Articles/955145/

Vineeth Pillai (Google) (5):
pvsched: paravirt scheduling framework
kvm: Implement the paravirt sched framework for kvm
kvm: interface for managing pvsched driver for guest VMs
pvsched: bpf support for pvsched
selftests/bpf: sample implementation of a bpf pvsched driver.

Kconfig | 2 +
arch/x86/kvm/Kconfig | 13 +
arch/x86/kvm/x86.c | 3 +
include/linux/kvm_host.h | 32 +++
include/linux/pvsched.h | 102 +++++++
include/uapi/linux/kvm.h | 6 +
kernel/bpf/bpf_struct_ops_types.h | 4 +
kernel/sysctl.c | 27 ++
.../testing/selftests/bpf/progs/bpf_pvsched.c | 37 +++
virt/Makefile | 2 +-
virt/kvm/kvm_main.c | 265 ++++++++++++++++++
virt/pvsched/Kconfig | 12 +
virt/pvsched/Makefile | 2 +
virt/pvsched/pvsched.c | 215 ++++++++++++++
virt/pvsched/pvsched_bpf.c | 141 ++++++++++
15 files changed, 862 insertions(+), 1 deletion(-)
create mode 100644 include/linux/pvsched.h
create mode 100644 tools/testing/selftests/bpf/progs/bpf_pvsched.c
create mode 100644 virt/pvsched/Kconfig
create mode 100644 virt/pvsched/Makefile
create mode 100644 virt/pvsched/pvsched.c
create mode 100644 virt/pvsched/pvsched_bpf.c

--
2.40.1



2024-04-03 14:02:07

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH v2 1/5] pvsched: paravirt scheduling framework

Implement a paravirt scheduling framework for linux kernel.

The framework allows for pvsched driver to register to the kernel and
receive callbacks from hypervisor(eg: kvm) for interested vcpu events
like VMENTER, VMEXIT etc.

The framework also allows hypervisor to select a pvsched driver (from
the available list of registered drivers) for each guest.

Also implement a sysctl for listing the available pvsched drivers.

Signed-off-by: Vineeth Pillai (Google) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
Kconfig | 2 +
include/linux/pvsched.h | 102 +++++++++++++++++++
kernel/sysctl.c | 27 +++++
virt/Makefile | 2 +-
virt/pvsched/Kconfig | 12 +++
virt/pvsched/Makefile | 2 +
virt/pvsched/pvsched.c | 215 ++++++++++++++++++++++++++++++++++++++++
7 files changed, 361 insertions(+), 1 deletion(-)
create mode 100644 include/linux/pvsched.h
create mode 100644 virt/pvsched/Kconfig
create mode 100644 virt/pvsched/Makefile
create mode 100644 virt/pvsched/pvsched.c

diff --git a/Kconfig b/Kconfig
index 745bc773f567..4a52eaa21166 100644
--- a/Kconfig
+++ b/Kconfig
@@ -29,4 +29,6 @@ source "lib/Kconfig"

source "lib/Kconfig.debug"

+source "virt/pvsched/Kconfig"
+
source "Documentation/Kconfig"
diff --git a/include/linux/pvsched.h b/include/linux/pvsched.h
new file mode 100644
index 000000000000..59df6b44aacb
--- /dev/null
+++ b/include/linux/pvsched.h
@@ -0,0 +1,102 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2024 Google */
+
+#ifndef _LINUX_PVSCHED_H
+#define _LINUX_PVSCHED_H 1
+
+/*
+ * List of events for which hypervisor calls back into pvsched driver.
+ * Driver can specify the events it is interested in.
+ */
+enum pvsched_vcpu_events {
+ PVSCHED_VCPU_VMENTER = 0x1,
+ PVSCHED_VCPU_VMEXIT = 0x2,
+ PVSCHED_VCPU_HALT = 0x4,
+ PVSCHED_VCPU_INTR_INJ = 0x8,
+};
+
+#define PVSCHED_NAME_MAX 32
+#define PVSCHED_MAX 8
+#define PVSCHED_DRV_BUF_MAX (PVSCHED_NAME_MAX * PVSCHED_MAX + PVSCHED_MAX)
+
+/*
+ * pvsched driver callbacks.
+ * TODO: versioning support for better compatibility with the guest
+ * component implementing this feature.
+ */
+struct pvsched_vcpu_ops {
+ /*
+ * pvsched_vcpu_register() - Register the vcpu with pvsched driver.
+ * @pid: pid of the vcpu task.
+ *
+ * pvsched driver can store the pid internally and initialize
+ * itself to prepare for receiving callbacks from thsi vcpu.
+ */
+ int (*pvsched_vcpu_register)(struct pid *pid);
+
+ /*
+ * pvsched_vcpu_unregister() - Un-register the vcpu with pvsched driver.
+ * @pid: pid of the vcpu task.
+ */
+ void (*pvsched_vcpu_unregister)(struct pid *pid);
+
+ /*
+ * pvsched_vcpu_notify_event() - Callback for pvsched events
+ * @addr: Address of the memory region shared with guest
+ * @pid: pid of the vcpu task.
+ * @events: bit mask of the events that hypervisor wants to notify.
+ */
+ void (*pvsched_vcpu_notify_event)(void *addr, struct pid *pid, u32 event);
+
+ char name[PVSCHED_NAME_MAX];
+ struct module *owner;
+ struct list_head list;
+ u32 events;
+ u32 key;
+};
+
+#ifdef CONFIG_PARAVIRT_SCHED_HOST
+int pvsched_get_available_drivers(char *buf, size_t maxlen);
+
+int pvsched_register_vcpu_ops(struct pvsched_vcpu_ops *ops);
+void pvsched_unregister_vcpu_ops(struct pvsched_vcpu_ops *ops);
+
+struct pvsched_vcpu_ops *pvsched_get_vcpu_ops(char *name);
+void pvsched_put_vcpu_ops(struct pvsched_vcpu_ops *ops);
+
+static inline int pvsched_validate_vcpu_ops(struct pvsched_vcpu_ops *ops)
+{
+ /*
+ * All callbacks are mandatory.
+ */
+ if (!ops->pvsched_vcpu_register || !ops->pvsched_vcpu_unregister ||
+ !ops->pvsched_vcpu_notify_event)
+ return -EINVAL;
+
+ return 0;
+}
+#else
+static inline void pvsched_get_available_drivers(char *buf, size_t maxlen)
+{
+}
+
+static inline int pvsched_register_vcpu_ops(struct pvsched_vcpu_ops *ops)
+{
+ return -ENOTSUPP;
+}
+
+static inline void pvsched_unregister_vcpu_ops(struct pvsched_vcpu_ops *ops)
+{
+}
+
+static inline struct pvsched_vcpu_ops *pvsched_get_vcpu_ops(char *name)
+{
+ return NULL;
+}
+
+static inline void pvsched_put_vcpu_ops(struct pvsched_vcpu_ops *ops)
+{
+}
+#endif
+
+#endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 157f7ce2942d..10a18a791b4f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -63,6 +63,7 @@
#include <linux/mount.h>
#include <linux/userfaultfd_k.h>
#include <linux/pid.h>
+#include <linux/pvsched.h>

#include "../lib/kstrtox.h"

@@ -1615,6 +1616,24 @@ int proc_do_static_key(struct ctl_table *table, int write,
return ret;
}

+#ifdef CONFIG_PARAVIRT_SCHED_HOST
+static int proc_pvsched_available_drivers(struct ctl_table *ctl,
+ int write, void *buffer,
+ size_t *lenp, loff_t *ppos)
+{
+ struct ctl_table tbl = { .maxlen = PVSCHED_DRV_BUF_MAX, };
+ int ret;
+
+ tbl.data = kmalloc(tbl.maxlen, GFP_USER);
+ if (!tbl.data)
+ return -ENOMEM;
+ pvsched_get_available_drivers(tbl.data, PVSCHED_DRV_BUF_MAX);
+ ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
+ kfree(tbl.data);
+ return ret;
+}
+#endif
+
static struct ctl_table kern_table[] = {
{
.procname = "panic",
@@ -2033,6 +2052,14 @@ static struct ctl_table kern_table[] = {
.extra1 = SYSCTL_ONE,
.extra2 = SYSCTL_INT_MAX,
},
+#endif
+#ifdef CONFIG_PARAVIRT_SCHED_HOST
+ {
+ .procname = "pvsched_available_drivers",
+ .maxlen = PVSCHED_DRV_BUF_MAX,
+ .mode = 0444,
+ .proc_handler = proc_pvsched_available_drivers,
+ },
#endif
{ }
};
diff --git a/virt/Makefile b/virt/Makefile
index 1cfea9436af9..9d0f32d775a1 100644
--- a/virt/Makefile
+++ b/virt/Makefile
@@ -1,2 +1,2 @@
# SPDX-License-Identifier: GPL-2.0-only
-obj-y += lib/
+obj-y += lib/ pvsched/
diff --git a/virt/pvsched/Kconfig b/virt/pvsched/Kconfig
new file mode 100644
index 000000000000..5ca2669060cb
--- /dev/null
+++ b/virt/pvsched/Kconfig
@@ -0,0 +1,12 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config PARAVIRT_SCHED_HOST
+ bool "Paravirt scheduling framework in the host kernel"
+ default n
+ help
+ Paravirtualized scheduling facilitates the exchange of scheduling
+ related information between the host and guest through shared memory,
+ enhancing the efficiency of vCPU thread scheduling by the hypervisor.
+ An illustrative use case involves dynamically boosting the priority of
+ a vCPU thread when the guest is executing a latency-sensitive workload
+ on that specific vCPU.
+ This config enables paravirt scheduling framework in the host kernel.
diff --git a/virt/pvsched/Makefile b/virt/pvsched/Makefile
new file mode 100644
index 000000000000..4ca38e30479b
--- /dev/null
+++ b/virt/pvsched/Makefile
@@ -0,0 +1,2 @@
+
+obj-$(CONFIG_PARAVIRT_SCHED_HOST) += pvsched.o
diff --git a/virt/pvsched/pvsched.c b/virt/pvsched/pvsched.c
new file mode 100644
index 000000000000..610c85cf90d2
--- /dev/null
+++ b/virt/pvsched/pvsched.c
@@ -0,0 +1,215 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2024 Google */
+
+/*
+ * Paravirt scheduling framework
+ *
+ */
+
+/*
+ * Heavily inspired from tcp congestion avoidance implementation.
+ * (net/ipv4/tcp_cong.c)
+ */
+
+#define pr_fmt(fmt) "PVSCHED: " fmt
+
+#include <linux/module.h>
+#include <linux/bpf.h>
+#include <linux/gfp.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/jhash.h>
+#include <linux/pvsched.h>
+
+static DEFINE_SPINLOCK(pvsched_drv_list_lock);
+static int nr_pvsched_drivers = 0;
+static LIST_HEAD(pvsched_drv_list);
+
+/*
+ * Retrieve pvsched_vcpu_ops given the name.
+ */
+static struct pvsched_vcpu_ops *pvsched_find_vcpu_ops_name(char *name)
+{
+ struct pvsched_vcpu_ops *ops;
+
+ list_for_each_entry_rcu(ops, &pvsched_drv_list, list) {
+ if (strcmp(ops->name, name) == 0)
+ return ops;
+ }
+
+ return NULL;
+}
+
+/*
+ * Retrieve pvsched_vcpu_ops given the hash key.
+ */
+static struct pvsched_vcpu_ops *pvsched_find_vcpu_ops_key(u32 key)
+{
+ struct pvsched_vcpu_ops *ops;
+
+ list_for_each_entry_rcu(ops, &pvsched_drv_list, list) {
+ if (ops->key == key)
+ return ops;
+ }
+
+ return NULL;
+}
+
+/*
+ * pvsched_get_available_drivers() - Copy space separated list of pvsched
+ * driver names.
+ * @buf: buffer to store the list of driver names
+ * @maxlen: size of the buffer
+ *
+ * Return: 0 on success, negative value on error.
+ */
+int pvsched_get_available_drivers(char *buf, size_t maxlen)
+{
+ struct pvsched_vcpu_ops *ops;
+ size_t offs = 0;
+
+ if (!buf)
+ return -EINVAL;
+
+ if (maxlen > PVSCHED_DRV_BUF_MAX)
+ maxlen = PVSCHED_DRV_BUF_MAX;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(ops, &pvsched_drv_list, list) {
+ offs += snprintf(buf + offs, maxlen - offs,
+ "%s%s",
+ offs == 0 ? "" : " ", ops->name);
+
+ if (WARN_ON_ONCE(offs >= maxlen))
+ break;
+ }
+ rcu_read_unlock();
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(pvsched_get_available_drivers);
+
+/*
+ * pvsched_register_vcpu_ops() - Register the driver in the kernel.
+ * @ops: Driver data(callbacks)
+ *
+ * After the registration, driver will be exposed to the hypervisor
+ * for assignment to the guest VMs.
+ *
+ * Return: 0 on success, negative value on error.
+ */
+int pvsched_register_vcpu_ops(struct pvsched_vcpu_ops *ops)
+{
+ int ret = 0;
+
+ ops->key = jhash(ops->name, sizeof(ops->name), strlen(ops->name));
+ spin_lock(&pvsched_drv_list_lock);
+ if (nr_pvsched_drivers > PVSCHED_MAX) {
+ ret = -ENOSPC;
+ } if (pvsched_find_vcpu_ops_key(ops->key)) {
+ ret = -EEXIST;
+ } else if (!(ret = pvsched_validate_vcpu_ops(ops))) {
+ list_add_tail_rcu(&ops->list, &pvsched_drv_list);
+ nr_pvsched_drivers++;
+ }
+ spin_unlock(&pvsched_drv_list_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(pvsched_register_vcpu_ops);
+
+/*
+ * pvsched_register_vcpu_ops() - Un-register the driver from the kernel.
+ * @ops: Driver data(callbacks)
+ *
+ * After un-registration, driver will not be visible to hypervisor.
+ */
+void pvsched_unregister_vcpu_ops(struct pvsched_vcpu_ops *ops)
+{
+ spin_lock(&pvsched_drv_list_lock);
+ list_del_rcu(&ops->list);
+ nr_pvsched_drivers--;
+ spin_unlock(&pvsched_drv_list_lock);
+
+ synchronize_rcu();
+}
+EXPORT_SYMBOL_GPL(pvsched_unregister_vcpu_ops);
+
+/*
+ * pvsched_get_vcpu_ops: Acquire the driver.
+ * @name: Name of the driver to be acquired.
+ *
+ * Hypervisor can use this API to get the driver structure for
+ * assigning it to guest VMs. This API takes a reference on the
+ * module/bpf program so that driver doesn't vanish under the
+ * hypervisor.
+ *
+ * Return: driver structure if found, else NULL.
+ */
+struct pvsched_vcpu_ops *pvsched_get_vcpu_ops(char *name)
+{
+ struct pvsched_vcpu_ops *ops;
+
+ if (!name || (strlen(name) >= PVSCHED_NAME_MAX))
+ return NULL;
+
+ rcu_read_lock();
+ ops = pvsched_find_vcpu_ops_name(name);
+ if (!ops)
+ goto out;
+
+ if (unlikely(!bpf_try_module_get(ops, ops->owner))) {
+ ops = NULL;
+ goto out;
+ }
+
+out:
+ rcu_read_unlock();
+ return ops;
+}
+EXPORT_SYMBOL_GPL(pvsched_get_vcpu_ops);
+
+/*
+ * pvsched_put_vcpu_ops: Release the driver.
+ * @name: Name of the driver to be releases.
+ *
+ * Hypervisor can use this API to release the driver.
+ */
+void pvsched_put_vcpu_ops(struct pvsched_vcpu_ops *ops)
+{
+ bpf_module_put(ops, ops->owner);
+}
+EXPORT_SYMBOL_GPL(pvsched_put_vcpu_ops);
+
+/*
+ * NOP vm_ops Sample implementation.
+ * This driver doesn't do anything other than registering itself.
+ * Placeholder for adding some default logic when the feature is
+ * complete.
+ */
+static int nop_pvsched_vcpu_register(struct pid *pid)
+{
+ return 0;
+}
+static void nop_pvsched_vcpu_unregister(struct pid *pid)
+{
+}
+static void nop_pvsched_notify_event(void *addr, struct pid *pid, u32 event)
+{
+}
+
+struct pvsched_vcpu_ops nop_vcpu_ops = {
+ .events = PVSCHED_VCPU_VMENTER | PVSCHED_VCPU_VMEXIT | PVSCHED_VCPU_HALT,
+ .pvsched_vcpu_register = nop_pvsched_vcpu_register,
+ .pvsched_vcpu_unregister = nop_pvsched_vcpu_unregister,
+ .pvsched_vcpu_notify_event = nop_pvsched_notify_event,
+ .name = "pvsched_nop",
+ .owner = THIS_MODULE,
+};
+
+static int __init pvsched_init(void)
+{
+ return WARN_ON(pvsched_register_vcpu_ops(&nop_vcpu_ops));
+}
+
+late_initcall(pvsched_init);
--
2.40.1


2024-04-03 14:02:08

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH v2 2/5] kvm: Implement the paravirt sched framework for kvm

kvm uses the kernel's paravirt sched framework to assign an available
pvsched driver for a guest. guest vcpus registers with the pvsched
driver and calls into the driver callback to notify the events that the
driver is interested in.

This PoC doesn't do the callback on interrupt injection yet. Will be
implemented in subsequent iterations.

Signed-off-by: Vineeth Pillai (Google) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
arch/x86/kvm/Kconfig | 13 ++++
arch/x86/kvm/x86.c | 3 +
include/linux/kvm_host.h | 32 +++++++++
virt/kvm/kvm_main.c | 148 +++++++++++++++++++++++++++++++++++++++
4 files changed, 196 insertions(+)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 65ed14b6540b..c1776cdb5b65 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -189,4 +189,17 @@ config KVM_MAX_NR_VCPUS
the memory footprint of each KVM guest, regardless of how many vCPUs are
created for a given VM.

+config PARAVIRT_SCHED_KVM
+ bool "Enable paravirt scheduling capability for kvm"
+ depends on KVM
+ default n
+ help
+ Paravirtualized scheduling facilitates the exchange of scheduling
+ related information between the host and guest through shared memory,
+ enhancing the efficiency of vCPU thread scheduling by the hypervisor.
+ An illustrative use case involves dynamically boosting the priority of
+ a vCPU thread when the guest is executing a latency-sensitive workload
+ on that specific vCPU.
+ This config enables paravirt scheduling in the kvm hypervisor.
+
endif # VIRTUALIZATION
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ffe580169c93..d0abc2c64d47 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10896,6 +10896,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)

preempt_disable();

+ kvm_vcpu_pvsched_notify(vcpu, PVSCHED_VCPU_VMENTER);
+
static_call(kvm_x86_prepare_switch_to_guest)(vcpu);

/*
@@ -11059,6 +11061,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
guest_timing_exit_irqoff();

local_irq_enable();
+ kvm_vcpu_pvsched_notify(vcpu, PVSCHED_VCPU_VMEXIT);
preempt_enable();

kvm_vcpu_srcu_read_lock(vcpu);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 179df96b20f8..6381569f3de8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -45,6 +45,8 @@
#include <asm/kvm_host.h>
#include <linux/kvm_dirty_ring.h>

+#include <linux/pvsched.h>
+
#ifndef KVM_MAX_VCPU_IDS
#define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
#endif
@@ -832,6 +834,11 @@ struct kvm {
bool vm_bugged;
bool vm_dead;

+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+ spinlock_t pvsched_ops_lock;
+ struct pvsched_vcpu_ops __rcu *pvsched_ops;
+#endif
+
#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
struct notifier_block pm_notifier;
#endif
@@ -2413,4 +2420,29 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
}
#endif /* CONFIG_KVM_PRIVATE_MEM */

+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+int kvm_vcpu_pvsched_notify(struct kvm_vcpu *vcpu, u32 events);
+int kvm_vcpu_pvsched_register(struct kvm_vcpu *vcpu);
+void kvm_vcpu_pvsched_unregister(struct kvm_vcpu *vcpu);
+
+int kvm_replace_pvsched_ops(struct kvm *kvm, char *name);
+#else
+static inline int kvm_vcpu_pvsched_notify(struct kvm_vcpu *vcpu, u32 events)
+{
+ return 0;
+}
+static inline int kvm_vcpu_pvsched_register(struct kvm_vcpu *vcpu)
+{
+ return 0;
+}
+static inline void kvm_vcpu_pvsched_unregister(struct kvm_vcpu *vcpu)
+{
+}
+
+static inline int kvm_replace_pvsched_ops(struct kvm *kvm, char *name)
+{
+ return 0;
+}
+#endif
+
#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0f50960b0e3a..0546814e4db7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -170,6 +170,142 @@ bool kvm_is_zone_device_page(struct page *page)
return is_zone_device_page(page);
}

+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+typedef enum {
+ PVSCHED_CB_REGISTER = 1,
+ PVSCHED_CB_UNREGISTER = 2,
+ PVSCHED_CB_NOTIFY = 3
+} pvsched_vcpu_callback_t;
+
+/*
+ * Helper function to invoke the pvsched driver callback.
+ */
+static int __vcpu_pvsched_callback(struct kvm_vcpu *vcpu, u32 events,
+ pvsched_vcpu_callback_t action)
+{
+ int ret = 0;
+ struct pid *pid;
+ struct pvsched_vcpu_ops *ops;
+
+ rcu_read_lock();
+ ops = rcu_dereference(vcpu->kvm->pvsched_ops);
+ if (!ops) {
+ ret = -ENOENT;
+ goto out;
+ }
+
+ pid = rcu_dereference(vcpu->pid);
+ if (WARN_ON_ONCE(!pid)) {
+ ret = -EINVAL;
+ goto out;
+ }
+ get_pid(pid);
+ switch(action) {
+ case PVSCHED_CB_REGISTER:
+ ops->pvsched_vcpu_register(pid);
+ break;
+ case PVSCHED_CB_UNREGISTER:
+ ops->pvsched_vcpu_unregister(pid);
+ break;
+ case PVSCHED_CB_NOTIFY:
+ if (ops->events & events) {
+ ops->pvsched_vcpu_notify_event(
+ NULL, /* TODO: Pass guest allocated sharedmem addr */
+ pid,
+ ops->events & events);
+ }
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ }
+ put_pid(pid);
+
+out:
+ rcu_read_unlock();
+ return ret;
+}
+
+int kvm_vcpu_pvsched_notify(struct kvm_vcpu *vcpu, u32 events)
+{
+ return __vcpu_pvsched_callback(vcpu, events, PVSCHED_CB_NOTIFY);
+}
+
+int kvm_vcpu_pvsched_register(struct kvm_vcpu *vcpu)
+{
+ return __vcpu_pvsched_callback(vcpu, 0, PVSCHED_CB_REGISTER);
+ /*
+ * TODO: Action if the registration fails?
+ */
+}
+
+void kvm_vcpu_pvsched_unregister(struct kvm_vcpu *vcpu)
+{
+ __vcpu_pvsched_callback(vcpu, 0, PVSCHED_CB_UNREGISTER);
+}
+
+/*
+ * Replaces the VM's current pvsched driver.
+ * if name is NULL or empty string, unassign the
+ * current driver.
+ */
+int kvm_replace_pvsched_ops(struct kvm *kvm, char *name)
+{
+ int ret = 0;
+ unsigned long i;
+ struct kvm_vcpu *vcpu = NULL;
+ struct pvsched_vcpu_ops *ops = NULL, *prev_ops;
+
+
+ spin_lock(&kvm->pvsched_ops_lock);
+
+ prev_ops = rcu_dereference(kvm->pvsched_ops);
+
+ /*
+ * Unassign operation if the passed in value is
+ * NULL or an empty string.
+ */
+ if (name && *name) {
+ ops = pvsched_get_vcpu_ops(name);
+ if (!ops) {
+ ret = -EINVAL;
+ goto out;
+ }
+ }
+
+ if (prev_ops) {
+ /*
+ * Unregister current pvsched driver.
+ */
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ kvm_vcpu_pvsched_unregister(vcpu);
+ }
+
+ pvsched_put_vcpu_ops(prev_ops);
+ }
+
+
+ rcu_assign_pointer(kvm->pvsched_ops, ops);
+ if (ops) {
+ /*
+ * Register new pvsched driver.
+ */
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ WARN_ON_ONCE(kvm_vcpu_pvsched_register(vcpu));
+ }
+ }
+
+out:
+ spin_unlock(&kvm->pvsched_ops_lock);
+
+ if (ret)
+ return ret;
+
+ synchronize_rcu();
+
+ return 0;
+}
+#endif
+
/*
* Returns a 'struct page' if the pfn is "valid" and backed by a refcounted
* page, NULL otherwise. Note, the list of refcounted PG_reserved page types
@@ -508,6 +644,8 @@ static void kvm_vcpu_destroy(struct kvm_vcpu *vcpu)
kvm_arch_vcpu_destroy(vcpu);
kvm_dirty_ring_free(&vcpu->dirty_ring);

+ kvm_vcpu_pvsched_unregister(vcpu);
+
/*
* No need for rcu_read_lock as VCPU_RUN is the only place that changes
* the vcpu->pid pointer, and at destruction time all file descriptors
@@ -1221,6 +1359,10 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)

BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);

+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+ spin_lock_init(&kvm->pvsched_ops_lock);
+#endif
+
/*
* Force subsequent debugfs file creations to fail if the VM directory
* is not created (by kvm_create_vm_debugfs()).
@@ -1343,6 +1485,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
int i;
struct mm_struct *mm = kvm->mm;

+ kvm_replace_pvsched_ops(kvm, NULL);
+
kvm_destroy_pm_notifier(kvm);
kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
kvm_destroy_vm_debugfs(kvm);
@@ -3779,6 +3923,8 @@ bool kvm_vcpu_block(struct kvm_vcpu *vcpu)
if (kvm_vcpu_check_block(vcpu) < 0)
break;

+ kvm_vcpu_pvsched_notify(vcpu, PVSCHED_VCPU_HALT);
+
waited = true;
schedule();
}
@@ -4434,6 +4580,7 @@ static long kvm_vcpu_ioctl(struct file *filp,
/* The thread running this VCPU changed. */
struct pid *newpid;

+ kvm_vcpu_pvsched_unregister(vcpu);
r = kvm_arch_vcpu_run_pid_change(vcpu);
if (r)
break;
@@ -4442,6 +4589,7 @@ static long kvm_vcpu_ioctl(struct file *filp,
rcu_assign_pointer(vcpu->pid, newpid);
if (oldpid)
synchronize_rcu();
+ kvm_vcpu_pvsched_register(vcpu);
put_pid(oldpid);
}
r = kvm_arch_vcpu_ioctl_run(vcpu);
--
2.40.1


2024-04-03 14:22:25

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH v2 4/5] pvsched: bpf support for pvsched

Add support for implementing bpf pvsched drivers. bpf programs can use
the struct_ops to define the callbacks of pvsched drivers.

This is only a skeleton of the bpf framework for pvsched. Some
verification details are not implemented yet.

Signed-off-by: Vineeth Pillai (Google) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/bpf/bpf_struct_ops_types.h | 4 +
virt/pvsched/Makefile | 2 +-
virt/pvsched/pvsched_bpf.c | 141 ++++++++++++++++++++++++++++++
3 files changed, 146 insertions(+), 1 deletion(-)
create mode 100644 virt/pvsched/pvsched_bpf.c

diff --git a/kernel/bpf/bpf_struct_ops_types.h b/kernel/bpf/bpf_struct_ops_types.h
index 5678a9ddf817..9d5e4d1a331a 100644
--- a/kernel/bpf/bpf_struct_ops_types.h
+++ b/kernel/bpf/bpf_struct_ops_types.h
@@ -9,4 +9,8 @@ BPF_STRUCT_OPS_TYPE(bpf_dummy_ops)
#include <net/tcp.h>
BPF_STRUCT_OPS_TYPE(tcp_congestion_ops)
#endif
+#ifdef CONFIG_PARAVIRT_SCHED_HOST
+#include <linux/pvsched.h>
+BPF_STRUCT_OPS_TYPE(pvsched_vcpu_ops)
+#endif
#endif
diff --git a/virt/pvsched/Makefile b/virt/pvsched/Makefile
index 4ca38e30479b..02bc072cd806 100644
--- a/virt/pvsched/Makefile
+++ b/virt/pvsched/Makefile
@@ -1,2 +1,2 @@

-obj-$(CONFIG_PARAVIRT_SCHED_HOST) += pvsched.o
+obj-$(CONFIG_PARAVIRT_SCHED_HOST) += pvsched.o pvsched_bpf.o
diff --git a/virt/pvsched/pvsched_bpf.c b/virt/pvsched/pvsched_bpf.c
new file mode 100644
index 000000000000..b125089abc3b
--- /dev/null
+++ b/virt/pvsched/pvsched_bpf.c
@@ -0,0 +1,141 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Google */
+
+#include <linux/types.h>
+#include <linux/bpf_verifier.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/filter.h>
+#include <linux/pvsched.h>
+
+
+/* "extern" is to avoid sparse warning. It is only used in bpf_struct_ops.c. */
+extern struct bpf_struct_ops bpf_pvsched_vcpu_ops;
+
+static int bpf_pvsched_vcpu_init(struct btf *btf)
+{
+ return 0;
+}
+
+static bool bpf_pvsched_vcpu_is_valid_access(int off, int size,
+ enum bpf_access_type type,
+ const struct bpf_prog *prog,
+ struct bpf_insn_access_aux *info)
+{
+ if (off < 0 || off >= sizeof(__u64) * MAX_BPF_FUNC_ARGS)
+ return false;
+ if (type != BPF_READ)
+ return false;
+ if (off % size != 0)
+ return false;
+
+ if (!btf_ctx_access(off, size, type, prog, info))
+ return false;
+
+ return true;
+}
+
+static int bpf_pvsched_vcpu_btf_struct_access(struct bpf_verifier_log *log,
+ const struct bpf_reg_state *reg,
+ int off, int size)
+{
+ /*
+ * TODO: Enable write access to Guest shared mem.
+ */
+ return -EACCES;
+}
+
+static const struct bpf_func_proto *
+bpf_pvsched_vcpu_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+ return bpf_base_func_proto(func_id);
+}
+
+static const struct bpf_verifier_ops bpf_pvsched_vcpu_verifier_ops = {
+ .get_func_proto = bpf_pvsched_vcpu_get_func_proto,
+ .is_valid_access = bpf_pvsched_vcpu_is_valid_access,
+ .btf_struct_access = bpf_pvsched_vcpu_btf_struct_access,
+};
+
+static int bpf_pvsched_vcpu_init_member(const struct btf_type *t,
+ const struct btf_member *member,
+ void *kdata, const void *udata)
+{
+ const struct pvsched_vcpu_ops *uvm_ops;
+ struct pvsched_vcpu_ops *vm_ops;
+ u32 moff;
+
+ uvm_ops = (const struct pvsched_vcpu_ops *)udata;
+ vm_ops = (struct pvsched_vcpu_ops *)kdata;
+
+ moff = __btf_member_bit_offset(t, member) / 8;
+ switch (moff) {
+ case offsetof(struct pvsched_vcpu_ops, events):
+ vm_ops->events = *(u32 *)(udata + moff);
+ return 1;
+ case offsetof(struct pvsched_vcpu_ops, name):
+ if (bpf_obj_name_cpy(vm_ops->name, uvm_ops->name,
+ sizeof(vm_ops->name)) <= 0)
+ return -EINVAL;
+ return 1;
+ }
+
+ return 0;
+}
+
+static int bpf_pvsched_vcpu_check_member(const struct btf_type *t,
+ const struct btf_member *member,
+ const struct bpf_prog *prog)
+{
+ return 0;
+}
+
+static int bpf_pvsched_vcpu_reg(void *kdata)
+{
+ return pvsched_register_vcpu_ops((struct pvsched_vcpu_ops *)kdata);
+}
+
+static void bpf_pvsched_vcpu_unreg(void *kdata)
+{
+ pvsched_unregister_vcpu_ops((struct pvsched_vcpu_ops *)kdata);
+}
+
+static int bpf_pvsched_vcpu_validate(void *kdata)
+{
+ return pvsched_validate_vcpu_ops((struct pvsched_vcpu_ops *)kdata);
+}
+
+static int bpf_pvsched_vcpu_update(void *kdata, void *old_kdata)
+{
+ return -EOPNOTSUPP;
+}
+
+static int __pvsched_vcpu_register(struct pid *pid)
+{
+ return 0;
+}
+static void __pvsched_vcpu_unregister(struct pid *pid)
+{
+}
+static void __pvsched_notify_event(void *addr, struct pid *pid, u32 event)
+{
+}
+
+static struct pvsched_vcpu_ops __bpf_ops_pvsched_vcpu_ops = {
+ .pvsched_vcpu_register = __pvsched_vcpu_register,
+ .pvsched_vcpu_unregister = __pvsched_vcpu_unregister,
+ .pvsched_vcpu_notify_event = __pvsched_notify_event,
+};
+
+struct bpf_struct_ops bpf_pvsched_vcpu_ops = {
+ .init = &bpf_pvsched_vcpu_init,
+ .validate = bpf_pvsched_vcpu_validate,
+ .update = bpf_pvsched_vcpu_update,
+ .verifier_ops = &bpf_pvsched_vcpu_verifier_ops,
+ .reg = bpf_pvsched_vcpu_reg,
+ .unreg = bpf_pvsched_vcpu_unreg,
+ .check_member = bpf_pvsched_vcpu_check_member,
+ .init_member = bpf_pvsched_vcpu_init_member,
+ .name = "pvsched_vcpu_ops",
+ .cfi_stubs = &__bpf_ops_pvsched_vcpu_ops,
+};
--
2.40.1


2024-04-03 14:23:08

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH v2 5/5] selftests/bpf: sample implementation of a bpf pvsched driver.

A dummy skeleton of a bpf pvsched driver. This is just for demonstration
purpose and would need more work to be included as a test for this
feature.

Not-Signed-off-by: Vineeth Pillai (Google) <[email protected]>
---
.../testing/selftests/bpf/progs/bpf_pvsched.c | 37 +++++++++++++++++++
1 file changed, 37 insertions(+)
create mode 100644 tools/testing/selftests/bpf/progs/bpf_pvsched.c

diff --git a/tools/testing/selftests/bpf/progs/bpf_pvsched.c b/tools/testing/selftests/bpf/progs/bpf_pvsched.c
new file mode 100644
index 000000000000..a653baa3034b
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_pvsched.c
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook */
+
+#include "vmlinux.h"
+#include "bpf_tracing_net.h"
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+SEC("struct_ops/pvsched_vcpu_reg")
+int BPF_PROG(pvsched_vcpu_reg, struct pid *pid)
+{
+ bpf_printk("pvsched_vcpu_reg: pid: %p", pid);
+ return 0;
+}
+
+SEC("struct_ops/pvsched_vcpu_unreg")
+void BPF_PROG(pvsched_vcpu_unreg, struct pid *pid)
+{
+ bpf_printk("pvsched_vcpu_unreg: pid: %p", pid);
+}
+
+SEC("struct_ops/pvsched_vcpu_notify_event")
+void BPF_PROG(pvsched_vcpu_notify_event, void *addr, struct pid *pid, __u32 event)
+{
+ bpf_printk("pvsched_vcpu_notify: pid: %p, event:%u", pid, event);
+}
+
+SEC(".struct_ops")
+struct pvsched_vcpu_ops pvsched_ops = {
+ .pvsched_vcpu_register = (void *)pvsched_vcpu_reg,
+ .pvsched_vcpu_unregister = (void *)pvsched_vcpu_unreg,
+ .pvsched_vcpu_notify_event = (void *)pvsched_vcpu_notify_event,
+ .events = 0x6,
+ .name = "bpf_pvsched_ops",
+};
--
2.40.1


2024-04-03 14:32:47

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH v2 3/5] kvm: interface for managing pvsched driver for guest VMs

Implement ioctl for assigning and unassigning pvsched driver for a
guest. VMMs would need to adopt this ioctls for supporting the feature.
Also add a temporary debugfs interface for managing this.

Ideally, the hypervisor would be able to determine the pvsched driver
based on the information received from the guest. Guest VMs with the
feature enabled would request hypervisor to select a pvsched driver.
ioctl api is an override mechanism to give more control to the admin.

Signed-off-by: Vineeth Pillai (Google) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
include/uapi/linux/kvm.h | 6 ++
virt/kvm/kvm_main.c | 117 +++++++++++++++++++++++++++++++++++++++
2 files changed, 123 insertions(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index c3308536482b..4b29bdad4188 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -2227,4 +2227,10 @@ struct kvm_create_guest_memfd {
__u64 reserved[6];
};

+struct kvm_pvsched_ops {
+ __u8 ops_name[32]; /* PVSCHED_NAME_MAX */
+};
+
+#define KVM_GET_PVSCHED_OPS _IOR(KVMIO, 0xe4, struct kvm_pvsched_ops)
+#define KVM_REPLACE_PVSCHED_OPS _IOWR(KVMIO, 0xe5, struct kvm_pvsched_ops)
#endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0546814e4db7..b3d9c362d2e3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1223,6 +1223,79 @@ static void kvm_destroy_vm_debugfs(struct kvm *kvm)
}
}

+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+static int pvsched_vcpu_ops_show(struct seq_file *m, void *data)
+{
+ char ops_name[PVSCHED_NAME_MAX];
+ struct pvsched_vcpu_ops *ops;
+ struct kvm *kvm = (struct kvm *) m->private;
+
+ rcu_read_lock();
+ ops = rcu_dereference(kvm->pvsched_ops);
+ if (ops)
+ strncpy(ops_name, ops->name, PVSCHED_NAME_MAX);
+ rcu_read_unlock();
+
+ seq_printf(m, "%s\n", ops_name);
+
+ return 0;
+}
+
+static ssize_t
+pvsched_vcpu_ops_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ int ret;
+ char *cmp;
+ char buf[PVSCHED_NAME_MAX];
+ struct inode *inode;
+ struct kvm *kvm;
+
+ if (cnt > PVSCHED_NAME_MAX)
+ return -EINVAL;
+
+ if (copy_from_user(&buf, ubuf, cnt))
+ return -EFAULT;
+
+ cmp = strstrip(buf);
+
+ inode = file_inode(filp);
+ inode_lock(inode);
+ kvm = (struct kvm *)inode->i_private;
+ ret = kvm_replace_pvsched_ops(kvm, cmp);
+ inode_unlock(inode);
+
+ if (ret)
+ return ret;
+
+ *ppos += cnt;
+ return cnt;
+}
+
+static int pvsched_vcpu_ops_open(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, pvsched_vcpu_ops_show, inode->i_private);
+}
+
+static const struct file_operations pvsched_vcpu_ops_fops = {
+ .open = pvsched_vcpu_ops_open,
+ .write = pvsched_vcpu_ops_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static void kvm_create_vm_pvsched_debugfs(struct kvm *kvm)
+{
+ debugfs_create_file("pvsched_vcpu_ops", 0644, kvm->debugfs_dentry, kvm,
+ &pvsched_vcpu_ops_fops);
+}
+#else
+static void kvm_create_vm_pvsched_debugfs(struct kvm *kvm)
+{
+}
+#endif
+
static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname)
{
static DEFINE_MUTEX(kvm_debugfs_lock);
@@ -1288,6 +1361,8 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname)
&stat_fops_per_vm);
}

+ kvm_create_vm_pvsched_debugfs(kvm);
+
ret = kvm_arch_create_vm_debugfs(kvm);
if (ret)
goto out_err;
@@ -5474,6 +5549,48 @@ static long kvm_vm_ioctl(struct file *filp,
r = kvm_gmem_create(kvm, &guest_memfd);
break;
}
+#endif
+#ifdef CONFIG_PARAVIRT_SCHED_KVM
+ case KVM_REPLACE_PVSCHED_OPS:
+ struct pvsched_vcpu_ops *ops;
+ struct kvm_pvsched_ops in_ops, out_ops;
+
+ r = -EFAULT;
+ if (copy_from_user(&in_ops, argp, sizeof(in_ops)))
+ goto out;
+
+ out_ops.ops_name[0] = 0;
+
+ rcu_read_lock();
+ ops = rcu_dereference(kvm->pvsched_ops);
+ if (ops)
+ strncpy(out_ops.ops_name, ops->name, PVSCHED_NAME_MAX);
+ rcu_read_unlock();
+
+ r = kvm_replace_pvsched_ops(kvm, (char *)in_ops.ops_name);
+ if (r)
+ goto out;
+
+ r = -EFAULT;
+ if (copy_to_user(argp, &out_ops, sizeof(out_ops)))
+ goto out;
+
+ r = 0;
+ break;
+ case KVM_GET_PVSCHED_OPS:
+ out_ops.ops_name[0] = 0;
+ rcu_read_lock();
+ ops = rcu_dereference(kvm->pvsched_ops);
+ if (ops)
+ strncpy(out_ops.ops_name, ops->name, PVSCHED_NAME_MAX);
+ rcu_read_unlock();
+
+ r = -EFAULT;
+ if (copy_to_user(argp, &out_ops, sizeof(out_ops)))
+ goto out;
+
+ r = 0;
+ break;
#endif
default:
r = kvm_arch_vm_ioctl(filp, ioctl, arg);
--
2.40.1


2024-04-08 13:58:41

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/5] pvsched: paravirt scheduling framework

Adding sched_ext folks

On Wed, Apr 3, 2024 at 10:01 AM Vineeth Pillai (Google)
<[email protected]> wrote:
>
> Implement a paravirt scheduling framework for linux kernel.
>
> The framework allows for pvsched driver to register to the kernel and
> receive callbacks from hypervisor(eg: kvm) for interested vcpu events
> like VMENTER, VMEXIT etc.
>
> The framework also allows hypervisor to select a pvsched driver (from
> the available list of registered drivers) for each guest.
>
> Also implement a sysctl for listing the available pvsched drivers.
>
> Signed-off-by: Vineeth Pillai (Google) <[email protected]>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> ---
> Kconfig | 2 +
> include/linux/pvsched.h | 102 +++++++++++++++++++
> kernel/sysctl.c | 27 +++++
> virt/Makefile | 2 +-
> virt/pvsched/Kconfig | 12 +++
> virt/pvsched/Makefile | 2 +
> virt/pvsched/pvsched.c | 215 ++++++++++++++++++++++++++++++++++++++++
> 7 files changed, 361 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/pvsched.h
> create mode 100644 virt/pvsched/Kconfig
> create mode 100644 virt/pvsched/Makefile
> create mode 100644 virt/pvsched/pvsched.c
>
> diff --git a/Kconfig b/Kconfig
> index 745bc773f567..4a52eaa21166 100644
> --- a/Kconfig
> +++ b/Kconfig
> @@ -29,4 +29,6 @@ source "lib/Kconfig"
>
> source "lib/Kconfig.debug"
>
> +source "virt/pvsched/Kconfig"
> +
> source "Documentation/Kconfig"
> diff --git a/include/linux/pvsched.h b/include/linux/pvsched.h
> new file mode 100644
> index 000000000000..59df6b44aacb
> --- /dev/null
> +++ b/include/linux/pvsched.h
> @@ -0,0 +1,102 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2024 Google */
> +
> +#ifndef _LINUX_PVSCHED_H
> +#define _LINUX_PVSCHED_H 1
> +
> +/*
> + * List of events for which hypervisor calls back into pvsched driver.
> + * Driver can specify the events it is interested in.
> + */
> +enum pvsched_vcpu_events {
> + PVSCHED_VCPU_VMENTER = 0x1,
> + PVSCHED_VCPU_VMEXIT = 0x2,
> + PVSCHED_VCPU_HALT = 0x4,
> + PVSCHED_VCPU_INTR_INJ = 0x8,
> +};
> +
> +#define PVSCHED_NAME_MAX 32
> +#define PVSCHED_MAX 8
> +#define PVSCHED_DRV_BUF_MAX (PVSCHED_NAME_MAX * PVSCHED_MAX + PVSCHED_MAX)
> +
> +/*
> + * pvsched driver callbacks.
> + * TODO: versioning support for better compatibility with the guest
> + * component implementing this feature.
> + */
> +struct pvsched_vcpu_ops {
> + /*
> + * pvsched_vcpu_register() - Register the vcpu with pvsched driver.
> + * @pid: pid of the vcpu task.
> + *
> + * pvsched driver can store the pid internally and initialize
> + * itself to prepare for receiving callbacks from thsi vcpu.
> + */
> + int (*pvsched_vcpu_register)(struct pid *pid);
> +
> + /*
> + * pvsched_vcpu_unregister() - Un-register the vcpu with pvsched driver.
> + * @pid: pid of the vcpu task.
> + */
> + void (*pvsched_vcpu_unregister)(struct pid *pid);
> +
> + /*
> + * pvsched_vcpu_notify_event() - Callback for pvsched events
> + * @addr: Address of the memory region shared with guest
> + * @pid: pid of the vcpu task.
> + * @events: bit mask of the events that hypervisor wants to notify.
> + */
> + void (*pvsched_vcpu_notify_event)(void *addr, struct pid *pid, u32 event);
> +
> + char name[PVSCHED_NAME_MAX];
> + struct module *owner;
> + struct list_head list;
> + u32 events;
> + u32 key;
> +};
> +
> +#ifdef CONFIG_PARAVIRT_SCHED_HOST
> +int pvsched_get_available_drivers(char *buf, size_t maxlen);
> +
> +int pvsched_register_vcpu_ops(struct pvsched_vcpu_ops *ops);
> +void pvsched_unregister_vcpu_ops(struct pvsched_vcpu_ops *ops);
> +
> +struct pvsched_vcpu_ops *pvsched_get_vcpu_ops(char *name);
> +void pvsched_put_vcpu_ops(struct pvsched_vcpu_ops *ops);
> +
> +static inline int pvsched_validate_vcpu_ops(struct pvsched_vcpu_ops *ops)
> +{
> + /*
> + * All callbacks are mandatory.
> + */
> + if (!ops->pvsched_vcpu_register || !ops->pvsched_vcpu_unregister ||
> + !ops->pvsched_vcpu_notify_event)
> + return -EINVAL;
> +
> + return 0;
> +}
> +#else
> +static inline void pvsched_get_available_drivers(char *buf, size_t maxlen)
> +{
> +}
> +
> +static inline int pvsched_register_vcpu_ops(struct pvsched_vcpu_ops *ops)
> +{
> + return -ENOTSUPP;
> +}
> +
> +static inline void pvsched_unregister_vcpu_ops(struct pvsched_vcpu_ops *ops)
> +{
> +}
> +
> +static inline struct pvsched_vcpu_ops *pvsched_get_vcpu_ops(char *name)
> +{
> + return NULL;
> +}
> +
> +static inline void pvsched_put_vcpu_ops(struct pvsched_vcpu_ops *ops)
> +{
> +}
> +#endif
> +
> +#endif
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 157f7ce2942d..10a18a791b4f 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -63,6 +63,7 @@
> #include <linux/mount.h>
> #include <linux/userfaultfd_k.h>
> #include <linux/pid.h>
> +#include <linux/pvsched.h>
>
> #include "../lib/kstrtox.h"
>
> @@ -1615,6 +1616,24 @@ int proc_do_static_key(struct ctl_table *table, int write,
> return ret;
> }
>
> +#ifdef CONFIG_PARAVIRT_SCHED_HOST
> +static int proc_pvsched_available_drivers(struct ctl_table *ctl,
> + int write, void *buffer,
> + size_t *lenp, loff_t *ppos)
> +{
> + struct ctl_table tbl = { .maxlen = PVSCHED_DRV_BUF_MAX, };
> + int ret;
> +
> + tbl.data = kmalloc(tbl.maxlen, GFP_USER);
> + if (!tbl.data)
> + return -ENOMEM;
> + pvsched_get_available_drivers(tbl.data, PVSCHED_DRV_BUF_MAX);
> + ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
> + kfree(tbl.data);
> + return ret;
> +}
> +#endif
> +
> static struct ctl_table kern_table[] = {
> {
> .procname = "panic",
> @@ -2033,6 +2052,14 @@ static struct ctl_table kern_table[] = {
> .extra1 = SYSCTL_ONE,
> .extra2 = SYSCTL_INT_MAX,
> },
> +#endif
> +#ifdef CONFIG_PARAVIRT_SCHED_HOST
> + {
> + .procname = "pvsched_available_drivers",
> + .maxlen = PVSCHED_DRV_BUF_MAX,
> + .mode = 0444,
> + .proc_handler = proc_pvsched_available_drivers,
> + },
> #endif
> { }
> };
> diff --git a/virt/Makefile b/virt/Makefile
> index 1cfea9436af9..9d0f32d775a1 100644
> --- a/virt/Makefile
> +++ b/virt/Makefile
> @@ -1,2 +1,2 @@
> # SPDX-License-Identifier: GPL-2.0-only
> -obj-y += lib/
> +obj-y += lib/ pvsched/
> diff --git a/virt/pvsched/Kconfig b/virt/pvsched/Kconfig
> new file mode 100644
> index 000000000000..5ca2669060cb
> --- /dev/null
> +++ b/virt/pvsched/Kconfig
> @@ -0,0 +1,12 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config PARAVIRT_SCHED_HOST
> + bool "Paravirt scheduling framework in the host kernel"
> + default n
> + help
> + Paravirtualized scheduling facilitates the exchange of scheduling
> + related information between the host and guest through shared memory,
> + enhancing the efficiency of vCPU thread scheduling by the hypervisor.
> + An illustrative use case involves dynamically boosting the priority of
> + a vCPU thread when the guest is executing a latency-sensitive workload
> + on that specific vCPU.
> + This config enables paravirt scheduling framework in the host kernel.
> diff --git a/virt/pvsched/Makefile b/virt/pvsched/Makefile
> new file mode 100644
> index 000000000000..4ca38e30479b
> --- /dev/null
> +++ b/virt/pvsched/Makefile
> @@ -0,0 +1,2 @@
> +
> +obj-$(CONFIG_PARAVIRT_SCHED_HOST) += pvsched.o
> diff --git a/virt/pvsched/pvsched.c b/virt/pvsched/pvsched.c
> new file mode 100644
> index 000000000000..610c85cf90d2
> --- /dev/null
> +++ b/virt/pvsched/pvsched.c
> @@ -0,0 +1,215 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2024 Google */
> +
> +/*
> + * Paravirt scheduling framework
> + *
> + */
> +
> +/*
> + * Heavily inspired from tcp congestion avoidance implementation.
> + * (net/ipv4/tcp_cong.c)
> + */
> +
> +#define pr_fmt(fmt) "PVSCHED: " fmt
> +
> +#include <linux/module.h>
> +#include <linux/bpf.h>
> +#include <linux/gfp.h>
> +#include <linux/types.h>
> +#include <linux/list.h>
> +#include <linux/jhash.h>
> +#include <linux/pvsched.h>
> +
> +static DEFINE_SPINLOCK(pvsched_drv_list_lock);
> +static int nr_pvsched_drivers = 0;
> +static LIST_HEAD(pvsched_drv_list);
> +
> +/*
> + * Retrieve pvsched_vcpu_ops given the name.
> + */
> +static struct pvsched_vcpu_ops *pvsched_find_vcpu_ops_name(char *name)
> +{
> + struct pvsched_vcpu_ops *ops;
> +
> + list_for_each_entry_rcu(ops, &pvsched_drv_list, list) {
> + if (strcmp(ops->name, name) == 0)
> + return ops;
> + }
> +
> + return NULL;
> +}
> +
> +/*
> + * Retrieve pvsched_vcpu_ops given the hash key.
> + */
> +static struct pvsched_vcpu_ops *pvsched_find_vcpu_ops_key(u32 key)
> +{
> + struct pvsched_vcpu_ops *ops;
> +
> + list_for_each_entry_rcu(ops, &pvsched_drv_list, list) {
> + if (ops->key == key)
> + return ops;
> + }
> +
> + return NULL;
> +}
> +
> +/*
> + * pvsched_get_available_drivers() - Copy space separated list of pvsched
> + * driver names.
> + * @buf: buffer to store the list of driver names
> + * @maxlen: size of the buffer
> + *
> + * Return: 0 on success, negative value on error.
> + */
> +int pvsched_get_available_drivers(char *buf, size_t maxlen)
> +{
> + struct pvsched_vcpu_ops *ops;
> + size_t offs = 0;
> +
> + if (!buf)
> + return -EINVAL;
> +
> + if (maxlen > PVSCHED_DRV_BUF_MAX)
> + maxlen = PVSCHED_DRV_BUF_MAX;
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(ops, &pvsched_drv_list, list) {
> + offs += snprintf(buf + offs, maxlen - offs,
> + "%s%s",
> + offs == 0 ? "" : " ", ops->name);
> +
> + if (WARN_ON_ONCE(offs >= maxlen))
> + break;
> + }
> + rcu_read_unlock();
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(pvsched_get_available_drivers);
> +
> +/*
> + * pvsched_register_vcpu_ops() - Register the driver in the kernel.
> + * @ops: Driver data(callbacks)
> + *
> + * After the registration, driver will be exposed to the hypervisor
> + * for assignment to the guest VMs.
> + *
> + * Return: 0 on success, negative value on error.
> + */
> +int pvsched_register_vcpu_ops(struct pvsched_vcpu_ops *ops)
> +{
> + int ret = 0;
> +
> + ops->key = jhash(ops->name, sizeof(ops->name), strlen(ops->name));
> + spin_lock(&pvsched_drv_list_lock);
> + if (nr_pvsched_drivers > PVSCHED_MAX) {
> + ret = -ENOSPC;
> + } if (pvsched_find_vcpu_ops_key(ops->key)) {
> + ret = -EEXIST;
> + } else if (!(ret = pvsched_validate_vcpu_ops(ops))) {
> + list_add_tail_rcu(&ops->list, &pvsched_drv_list);
> + nr_pvsched_drivers++;
> + }
> + spin_unlock(&pvsched_drv_list_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(pvsched_register_vcpu_ops);
> +
> +/*
> + * pvsched_register_vcpu_ops() - Un-register the driver from the kernel.
> + * @ops: Driver data(callbacks)
> + *
> + * After un-registration, driver will not be visible to hypervisor.
> + */
> +void pvsched_unregister_vcpu_ops(struct pvsched_vcpu_ops *ops)
> +{
> + spin_lock(&pvsched_drv_list_lock);
> + list_del_rcu(&ops->list);
> + nr_pvsched_drivers--;
> + spin_unlock(&pvsched_drv_list_lock);
> +
> + synchronize_rcu();
> +}
> +EXPORT_SYMBOL_GPL(pvsched_unregister_vcpu_ops);
> +
> +/*
> + * pvsched_get_vcpu_ops: Acquire the driver.
> + * @name: Name of the driver to be acquired.
> + *
> + * Hypervisor can use this API to get the driver structure for
> + * assigning it to guest VMs. This API takes a reference on the
> + * module/bpf program so that driver doesn't vanish under the
> + * hypervisor.
> + *
> + * Return: driver structure if found, else NULL.
> + */
> +struct pvsched_vcpu_ops *pvsched_get_vcpu_ops(char *name)
> +{
> + struct pvsched_vcpu_ops *ops;
> +
> + if (!name || (strlen(name) >= PVSCHED_NAME_MAX))
> + return NULL;
> +
> + rcu_read_lock();
> + ops = pvsched_find_vcpu_ops_name(name);
> + if (!ops)
> + goto out;
> +
> + if (unlikely(!bpf_try_module_get(ops, ops->owner))) {
> + ops = NULL;
> + goto out;
> + }
> +
> +out:
> + rcu_read_unlock();
> + return ops;
> +}
> +EXPORT_SYMBOL_GPL(pvsched_get_vcpu_ops);
> +
> +/*
> + * pvsched_put_vcpu_ops: Release the driver.
> + * @name: Name of the driver to be releases.
> + *
> + * Hypervisor can use this API to release the driver.
> + */
> +void pvsched_put_vcpu_ops(struct pvsched_vcpu_ops *ops)
> +{
> + bpf_module_put(ops, ops->owner);
> +}
> +EXPORT_SYMBOL_GPL(pvsched_put_vcpu_ops);
> +
> +/*
> + * NOP vm_ops Sample implementation.
> + * This driver doesn't do anything other than registering itself.
> + * Placeholder for adding some default logic when the feature is
> + * complete.
> + */
> +static int nop_pvsched_vcpu_register(struct pid *pid)
> +{
> + return 0;
> +}
> +static void nop_pvsched_vcpu_unregister(struct pid *pid)
> +{
> +}
> +static void nop_pvsched_notify_event(void *addr, struct pid *pid, u32 event)
> +{
> +}
> +
> +struct pvsched_vcpu_ops nop_vcpu_ops = {
> + .events = PVSCHED_VCPU_VMENTER | PVSCHED_VCPU_VMEXIT | PVSCHED_VCPU_HALT,
> + .pvsched_vcpu_register = nop_pvsched_vcpu_register,
> + .pvsched_vcpu_unregister = nop_pvsched_vcpu_unregister,
> + .pvsched_vcpu_notify_event = nop_pvsched_notify_event,
> + .name = "pvsched_nop",
> + .owner = THIS_MODULE,
> +};
> +
> +static int __init pvsched_init(void)
> +{
> + return WARN_ON(pvsched_register_vcpu_ops(&nop_vcpu_ops));
> +}
> +
> +late_initcall(pvsched_init);
> --
> 2.40.1
>

2024-04-08 13:59:52

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/5] kvm: Implement the paravirt sched framework for kvm

Adding sched_ext folks

On Wed, Apr 3, 2024 at 10:01 AM Vineeth Pillai (Google)
<[email protected]> wrote:
>
> kvm uses the kernel's paravirt sched framework to assign an available
> pvsched driver for a guest. guest vcpus registers with the pvsched
> driver and calls into the driver callback to notify the events that the
> driver is interested in.
>
> This PoC doesn't do the callback on interrupt injection yet. Will be
> implemented in subsequent iterations.
>
> Signed-off-by: Vineeth Pillai (Google) <[email protected]>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> ---
> arch/x86/kvm/Kconfig | 13 ++++
> arch/x86/kvm/x86.c | 3 +
> include/linux/kvm_host.h | 32 +++++++++
> virt/kvm/kvm_main.c | 148 +++++++++++++++++++++++++++++++++++++++
> 4 files changed, 196 insertions(+)
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 65ed14b6540b..c1776cdb5b65 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -189,4 +189,17 @@ config KVM_MAX_NR_VCPUS
> the memory footprint of each KVM guest, regardless of how many vCPUs are
> created for a given VM.
>
> +config PARAVIRT_SCHED_KVM
> + bool "Enable paravirt scheduling capability for kvm"
> + depends on KVM
> + default n
> + help
> + Paravirtualized scheduling facilitates the exchange of scheduling
> + related information between the host and guest through shared memory,
> + enhancing the efficiency of vCPU thread scheduling by the hypervisor.
> + An illustrative use case involves dynamically boosting the priority of
> + a vCPU thread when the guest is executing a latency-sensitive workload
> + on that specific vCPU.
> + This config enables paravirt scheduling in the kvm hypervisor.
> +
> endif # VIRTUALIZATION
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ffe580169c93..d0abc2c64d47 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10896,6 +10896,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>
> preempt_disable();
>
> + kvm_vcpu_pvsched_notify(vcpu, PVSCHED_VCPU_VMENTER);
> +
> static_call(kvm_x86_prepare_switch_to_guest)(vcpu);
>
> /*
> @@ -11059,6 +11061,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> guest_timing_exit_irqoff();
>
> local_irq_enable();
> + kvm_vcpu_pvsched_notify(vcpu, PVSCHED_VCPU_VMEXIT);
> preempt_enable();
>
> kvm_vcpu_srcu_read_lock(vcpu);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 179df96b20f8..6381569f3de8 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -45,6 +45,8 @@
> #include <asm/kvm_host.h>
> #include <linux/kvm_dirty_ring.h>
>
> +#include <linux/pvsched.h>
> +
> #ifndef KVM_MAX_VCPU_IDS
> #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> #endif
> @@ -832,6 +834,11 @@ struct kvm {
> bool vm_bugged;
> bool vm_dead;
>
> +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> + spinlock_t pvsched_ops_lock;
> + struct pvsched_vcpu_ops __rcu *pvsched_ops;
> +#endif
> +
> #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> struct notifier_block pm_notifier;
> #endif
> @@ -2413,4 +2420,29 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
> }
> #endif /* CONFIG_KVM_PRIVATE_MEM */
>
> +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> +int kvm_vcpu_pvsched_notify(struct kvm_vcpu *vcpu, u32 events);
> +int kvm_vcpu_pvsched_register(struct kvm_vcpu *vcpu);
> +void kvm_vcpu_pvsched_unregister(struct kvm_vcpu *vcpu);
> +
> +int kvm_replace_pvsched_ops(struct kvm *kvm, char *name);
> +#else
> +static inline int kvm_vcpu_pvsched_notify(struct kvm_vcpu *vcpu, u32 events)
> +{
> + return 0;
> +}
> +static inline int kvm_vcpu_pvsched_register(struct kvm_vcpu *vcpu)
> +{
> + return 0;
> +}
> +static inline void kvm_vcpu_pvsched_unregister(struct kvm_vcpu *vcpu)
> +{
> +}
> +
> +static inline int kvm_replace_pvsched_ops(struct kvm *kvm, char *name)
> +{
> + return 0;
> +}
> +#endif
> +
> #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0f50960b0e3a..0546814e4db7 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -170,6 +170,142 @@ bool kvm_is_zone_device_page(struct page *page)
> return is_zone_device_page(page);
> }
>
> +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> +typedef enum {
> + PVSCHED_CB_REGISTER = 1,
> + PVSCHED_CB_UNREGISTER = 2,
> + PVSCHED_CB_NOTIFY = 3
> +} pvsched_vcpu_callback_t;
> +
> +/*
> + * Helper function to invoke the pvsched driver callback.
> + */
> +static int __vcpu_pvsched_callback(struct kvm_vcpu *vcpu, u32 events,
> + pvsched_vcpu_callback_t action)
> +{
> + int ret = 0;
> + struct pid *pid;
> + struct pvsched_vcpu_ops *ops;
> +
> + rcu_read_lock();
> + ops = rcu_dereference(vcpu->kvm->pvsched_ops);
> + if (!ops) {
> + ret = -ENOENT;
> + goto out;
> + }
> +
> + pid = rcu_dereference(vcpu->pid);
> + if (WARN_ON_ONCE(!pid)) {
> + ret = -EINVAL;
> + goto out;
> + }
> + get_pid(pid);
> + switch(action) {
> + case PVSCHED_CB_REGISTER:
> + ops->pvsched_vcpu_register(pid);
> + break;
> + case PVSCHED_CB_UNREGISTER:
> + ops->pvsched_vcpu_unregister(pid);
> + break;
> + case PVSCHED_CB_NOTIFY:
> + if (ops->events & events) {
> + ops->pvsched_vcpu_notify_event(
> + NULL, /* TODO: Pass guest allocated sharedmem addr */
> + pid,
> + ops->events & events);
> + }
> + break;
> + default:
> + WARN_ON_ONCE(1);
> + }
> + put_pid(pid);
> +
> +out:
> + rcu_read_unlock();
> + return ret;
> +}
> +
> +int kvm_vcpu_pvsched_notify(struct kvm_vcpu *vcpu, u32 events)
> +{
> + return __vcpu_pvsched_callback(vcpu, events, PVSCHED_CB_NOTIFY);
> +}
> +
> +int kvm_vcpu_pvsched_register(struct kvm_vcpu *vcpu)
> +{
> + return __vcpu_pvsched_callback(vcpu, 0, PVSCHED_CB_REGISTER);
> + /*
> + * TODO: Action if the registration fails?
> + */
> +}
> +
> +void kvm_vcpu_pvsched_unregister(struct kvm_vcpu *vcpu)
> +{
> + __vcpu_pvsched_callback(vcpu, 0, PVSCHED_CB_UNREGISTER);
> +}
> +
> +/*
> + * Replaces the VM's current pvsched driver.
> + * if name is NULL or empty string, unassign the
> + * current driver.
> + */
> +int kvm_replace_pvsched_ops(struct kvm *kvm, char *name)
> +{
> + int ret = 0;
> + unsigned long i;
> + struct kvm_vcpu *vcpu = NULL;
> + struct pvsched_vcpu_ops *ops = NULL, *prev_ops;
> +
> +
> + spin_lock(&kvm->pvsched_ops_lock);
> +
> + prev_ops = rcu_dereference(kvm->pvsched_ops);
> +
> + /*
> + * Unassign operation if the passed in value is
> + * NULL or an empty string.
> + */
> + if (name && *name) {
> + ops = pvsched_get_vcpu_ops(name);
> + if (!ops) {
> + ret = -EINVAL;
> + goto out;
> + }
> + }
> +
> + if (prev_ops) {
> + /*
> + * Unregister current pvsched driver.
> + */
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + kvm_vcpu_pvsched_unregister(vcpu);
> + }
> +
> + pvsched_put_vcpu_ops(prev_ops);
> + }
> +
> +
> + rcu_assign_pointer(kvm->pvsched_ops, ops);
> + if (ops) {
> + /*
> + * Register new pvsched driver.
> + */
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + WARN_ON_ONCE(kvm_vcpu_pvsched_register(vcpu));
> + }
> + }
> +
> +out:
> + spin_unlock(&kvm->pvsched_ops_lock);
> +
> + if (ret)
> + return ret;
> +
> + synchronize_rcu();
> +
> + return 0;
> +}
> +#endif
> +
> /*
> * Returns a 'struct page' if the pfn is "valid" and backed by a refcounted
> * page, NULL otherwise. Note, the list of refcounted PG_reserved page types
> @@ -508,6 +644,8 @@ static void kvm_vcpu_destroy(struct kvm_vcpu *vcpu)
> kvm_arch_vcpu_destroy(vcpu);
> kvm_dirty_ring_free(&vcpu->dirty_ring);
>
> + kvm_vcpu_pvsched_unregister(vcpu);
> +
> /*
> * No need for rcu_read_lock as VCPU_RUN is the only place that changes
> * the vcpu->pid pointer, and at destruction time all file descriptors
> @@ -1221,6 +1359,10 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>
> BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
>
> +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> + spin_lock_init(&kvm->pvsched_ops_lock);
> +#endif
> +
> /*
> * Force subsequent debugfs file creations to fail if the VM directory
> * is not created (by kvm_create_vm_debugfs()).
> @@ -1343,6 +1485,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
> int i;
> struct mm_struct *mm = kvm->mm;
>
> + kvm_replace_pvsched_ops(kvm, NULL);
> +
> kvm_destroy_pm_notifier(kvm);
> kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
> kvm_destroy_vm_debugfs(kvm);
> @@ -3779,6 +3923,8 @@ bool kvm_vcpu_block(struct kvm_vcpu *vcpu)
> if (kvm_vcpu_check_block(vcpu) < 0)
> break;
>
> + kvm_vcpu_pvsched_notify(vcpu, PVSCHED_VCPU_HALT);
> +
> waited = true;
> schedule();
> }
> @@ -4434,6 +4580,7 @@ static long kvm_vcpu_ioctl(struct file *filp,
> /* The thread running this VCPU changed. */
> struct pid *newpid;
>
> + kvm_vcpu_pvsched_unregister(vcpu);
> r = kvm_arch_vcpu_run_pid_change(vcpu);
> if (r)
> break;
> @@ -4442,6 +4589,7 @@ static long kvm_vcpu_ioctl(struct file *filp,
> rcu_assign_pointer(vcpu->pid, newpid);
> if (oldpid)
> synchronize_rcu();
> + kvm_vcpu_pvsched_register(vcpu);
> put_pid(oldpid);
> }
> r = kvm_arch_vcpu_ioctl_run(vcpu);
> --
> 2.40.1
>

2024-04-08 14:00:38

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/5] kvm: interface for managing pvsched driver for guest VMs

Adding sched_ext folks

On Wed, Apr 3, 2024 at 10:01 AM Vineeth Pillai (Google)
<[email protected]> wrote:
>
> Implement ioctl for assigning and unassigning pvsched driver for a
> guest. VMMs would need to adopt this ioctls for supporting the feature.
> Also add a temporary debugfs interface for managing this.
>
> Ideally, the hypervisor would be able to determine the pvsched driver
> based on the information received from the guest. Guest VMs with the
> feature enabled would request hypervisor to select a pvsched driver.
> ioctl api is an override mechanism to give more control to the admin.
>
> Signed-off-by: Vineeth Pillai (Google) <[email protected]>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> ---
> include/uapi/linux/kvm.h | 6 ++
> virt/kvm/kvm_main.c | 117 +++++++++++++++++++++++++++++++++++++++
> 2 files changed, 123 insertions(+)
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index c3308536482b..4b29bdad4188 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -2227,4 +2227,10 @@ struct kvm_create_guest_memfd {
> __u64 reserved[6];
> };
>
> +struct kvm_pvsched_ops {
> + __u8 ops_name[32]; /* PVSCHED_NAME_MAX */
> +};
> +
> +#define KVM_GET_PVSCHED_OPS _IOR(KVMIO, 0xe4, struct kvm_pvsched_ops)
> +#define KVM_REPLACE_PVSCHED_OPS _IOWR(KVMIO, 0xe5, struct kvm_pvsched_ops)
> #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0546814e4db7..b3d9c362d2e3 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1223,6 +1223,79 @@ static void kvm_destroy_vm_debugfs(struct kvm *kvm)
> }
> }
>
> +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> +static int pvsched_vcpu_ops_show(struct seq_file *m, void *data)
> +{
> + char ops_name[PVSCHED_NAME_MAX];
> + struct pvsched_vcpu_ops *ops;
> + struct kvm *kvm = (struct kvm *) m->private;
> +
> + rcu_read_lock();
> + ops = rcu_dereference(kvm->pvsched_ops);
> + if (ops)
> + strncpy(ops_name, ops->name, PVSCHED_NAME_MAX);
> + rcu_read_unlock();
> +
> + seq_printf(m, "%s\n", ops_name);
> +
> + return 0;
> +}
> +
> +static ssize_t
> +pvsched_vcpu_ops_write(struct file *filp, const char __user *ubuf,
> + size_t cnt, loff_t *ppos)
> +{
> + int ret;
> + char *cmp;
> + char buf[PVSCHED_NAME_MAX];
> + struct inode *inode;
> + struct kvm *kvm;
> +
> + if (cnt > PVSCHED_NAME_MAX)
> + return -EINVAL;
> +
> + if (copy_from_user(&buf, ubuf, cnt))
> + return -EFAULT;
> +
> + cmp = strstrip(buf);
> +
> + inode = file_inode(filp);
> + inode_lock(inode);
> + kvm = (struct kvm *)inode->i_private;
> + ret = kvm_replace_pvsched_ops(kvm, cmp);
> + inode_unlock(inode);
> +
> + if (ret)
> + return ret;
> +
> + *ppos += cnt;
> + return cnt;
> +}
> +
> +static int pvsched_vcpu_ops_open(struct inode *inode, struct file *filp)
> +{
> + return single_open(filp, pvsched_vcpu_ops_show, inode->i_private);
> +}
> +
> +static const struct file_operations pvsched_vcpu_ops_fops = {
> + .open = pvsched_vcpu_ops_open,
> + .write = pvsched_vcpu_ops_write,
> + .read = seq_read,
> + .llseek = seq_lseek,
> + .release = single_release,
> +};
> +
> +static void kvm_create_vm_pvsched_debugfs(struct kvm *kvm)
> +{
> + debugfs_create_file("pvsched_vcpu_ops", 0644, kvm->debugfs_dentry, kvm,
> + &pvsched_vcpu_ops_fops);
> +}
> +#else
> +static void kvm_create_vm_pvsched_debugfs(struct kvm *kvm)
> +{
> +}
> +#endif
> +
> static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname)
> {
> static DEFINE_MUTEX(kvm_debugfs_lock);
> @@ -1288,6 +1361,8 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname)
> &stat_fops_per_vm);
> }
>
> + kvm_create_vm_pvsched_debugfs(kvm);
> +
> ret = kvm_arch_create_vm_debugfs(kvm);
> if (ret)
> goto out_err;
> @@ -5474,6 +5549,48 @@ static long kvm_vm_ioctl(struct file *filp,
> r = kvm_gmem_create(kvm, &guest_memfd);
> break;
> }
> +#endif
> +#ifdef CONFIG_PARAVIRT_SCHED_KVM
> + case KVM_REPLACE_PVSCHED_OPS:
> + struct pvsched_vcpu_ops *ops;
> + struct kvm_pvsched_ops in_ops, out_ops;
> +
> + r = -EFAULT;
> + if (copy_from_user(&in_ops, argp, sizeof(in_ops)))
> + goto out;
> +
> + out_ops.ops_name[0] = 0;
> +
> + rcu_read_lock();
> + ops = rcu_dereference(kvm->pvsched_ops);
> + if (ops)
> + strncpy(out_ops.ops_name, ops->name, PVSCHED_NAME_MAX);
> + rcu_read_unlock();
> +
> + r = kvm_replace_pvsched_ops(kvm, (char *)in_ops.ops_name);
> + if (r)
> + goto out;
> +
> + r = -EFAULT;
> + if (copy_to_user(argp, &out_ops, sizeof(out_ops)))
> + goto out;
> +
> + r = 0;
> + break;
> + case KVM_GET_PVSCHED_OPS:
> + out_ops.ops_name[0] = 0;
> + rcu_read_lock();
> + ops = rcu_dereference(kvm->pvsched_ops);
> + if (ops)
> + strncpy(out_ops.ops_name, ops->name, PVSCHED_NAME_MAX);
> + rcu_read_unlock();
> +
> + r = -EFAULT;
> + if (copy_to_user(argp, &out_ops, sizeof(out_ops)))
> + goto out;
> +
> + r = 0;
> + break;
> #endif
> default:
> r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> --
> 2.40.1
>

2024-04-08 14:01:41

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v2 4/5] pvsched: bpf support for pvsched

Adding sched_ext folks

On Wed, Apr 3, 2024 at 10:01 AM Vineeth Pillai (Google)
<[email protected]> wrote:
>
> Add support for implementing bpf pvsched drivers. bpf programs can use
> the struct_ops to define the callbacks of pvsched drivers.
>
> This is only a skeleton of the bpf framework for pvsched. Some
> verification details are not implemented yet.
>
> Signed-off-by: Vineeth Pillai (Google) <[email protected]>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> ---
> kernel/bpf/bpf_struct_ops_types.h | 4 +
> virt/pvsched/Makefile | 2 +-
> virt/pvsched/pvsched_bpf.c | 141 ++++++++++++++++++++++++++++++
> 3 files changed, 146 insertions(+), 1 deletion(-)
> create mode 100644 virt/pvsched/pvsched_bpf.c
>
> diff --git a/kernel/bpf/bpf_struct_ops_types.h b/kernel/bpf/bpf_struct_ops_types.h
> index 5678a9ddf817..9d5e4d1a331a 100644
> --- a/kernel/bpf/bpf_struct_ops_types.h
> +++ b/kernel/bpf/bpf_struct_ops_types.h
> @@ -9,4 +9,8 @@ BPF_STRUCT_OPS_TYPE(bpf_dummy_ops)
> #include <net/tcp.h>
> BPF_STRUCT_OPS_TYPE(tcp_congestion_ops)
> #endif
> +#ifdef CONFIG_PARAVIRT_SCHED_HOST
> +#include <linux/pvsched.h>
> +BPF_STRUCT_OPS_TYPE(pvsched_vcpu_ops)
> +#endif
> #endif
> diff --git a/virt/pvsched/Makefile b/virt/pvsched/Makefile
> index 4ca38e30479b..02bc072cd806 100644
> --- a/virt/pvsched/Makefile
> +++ b/virt/pvsched/Makefile
> @@ -1,2 +1,2 @@
>
> -obj-$(CONFIG_PARAVIRT_SCHED_HOST) += pvsched.o
> +obj-$(CONFIG_PARAVIRT_SCHED_HOST) += pvsched.o pvsched_bpf.o
> diff --git a/virt/pvsched/pvsched_bpf.c b/virt/pvsched/pvsched_bpf.c
> new file mode 100644
> index 000000000000..b125089abc3b
> --- /dev/null
> +++ b/virt/pvsched/pvsched_bpf.c
> @@ -0,0 +1,141 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2024 Google */
> +
> +#include <linux/types.h>
> +#include <linux/bpf_verifier.h>
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +#include <linux/filter.h>
> +#include <linux/pvsched.h>
> +
> +
> +/* "extern" is to avoid sparse warning. It is only used in bpf_struct_ops.c. */
> +extern struct bpf_struct_ops bpf_pvsched_vcpu_ops;
> +
> +static int bpf_pvsched_vcpu_init(struct btf *btf)
> +{
> + return 0;
> +}
> +
> +static bool bpf_pvsched_vcpu_is_valid_access(int off, int size,
> + enum bpf_access_type type,
> + const struct bpf_prog *prog,
> + struct bpf_insn_access_aux *info)
> +{
> + if (off < 0 || off >= sizeof(__u64) * MAX_BPF_FUNC_ARGS)
> + return false;
> + if (type != BPF_READ)
> + return false;
> + if (off % size != 0)
> + return false;
> +
> + if (!btf_ctx_access(off, size, type, prog, info))
> + return false;
> +
> + return true;
> +}
> +
> +static int bpf_pvsched_vcpu_btf_struct_access(struct bpf_verifier_log *log,
> + const struct bpf_reg_state *reg,
> + int off, int size)
> +{
> + /*
> + * TODO: Enable write access to Guest shared mem.
> + */
> + return -EACCES;
> +}
> +
> +static const struct bpf_func_proto *
> +bpf_pvsched_vcpu_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> + return bpf_base_func_proto(func_id);
> +}
> +
> +static const struct bpf_verifier_ops bpf_pvsched_vcpu_verifier_ops = {
> + .get_func_proto = bpf_pvsched_vcpu_get_func_proto,
> + .is_valid_access = bpf_pvsched_vcpu_is_valid_access,
> + .btf_struct_access = bpf_pvsched_vcpu_btf_struct_access,
> +};
> +
> +static int bpf_pvsched_vcpu_init_member(const struct btf_type *t,
> + const struct btf_member *member,
> + void *kdata, const void *udata)
> +{
> + const struct pvsched_vcpu_ops *uvm_ops;
> + struct pvsched_vcpu_ops *vm_ops;
> + u32 moff;
> +
> + uvm_ops = (const struct pvsched_vcpu_ops *)udata;
> + vm_ops = (struct pvsched_vcpu_ops *)kdata;
> +
> + moff = __btf_member_bit_offset(t, member) / 8;
> + switch (moff) {
> + case offsetof(struct pvsched_vcpu_ops, events):
> + vm_ops->events = *(u32 *)(udata + moff);
> + return 1;
> + case offsetof(struct pvsched_vcpu_ops, name):
> + if (bpf_obj_name_cpy(vm_ops->name, uvm_ops->name,
> + sizeof(vm_ops->name)) <= 0)
> + return -EINVAL;
> + return 1;
> + }
> +
> + return 0;
> +}
> +
> +static int bpf_pvsched_vcpu_check_member(const struct btf_type *t,
> + const struct btf_member *member,
> + const struct bpf_prog *prog)
> +{
> + return 0;
> +}
> +
> +static int bpf_pvsched_vcpu_reg(void *kdata)
> +{
> + return pvsched_register_vcpu_ops((struct pvsched_vcpu_ops *)kdata);
> +}
> +
> +static void bpf_pvsched_vcpu_unreg(void *kdata)
> +{
> + pvsched_unregister_vcpu_ops((struct pvsched_vcpu_ops *)kdata);
> +}
> +
> +static int bpf_pvsched_vcpu_validate(void *kdata)
> +{
> + return pvsched_validate_vcpu_ops((struct pvsched_vcpu_ops *)kdata);
> +}
> +
> +static int bpf_pvsched_vcpu_update(void *kdata, void *old_kdata)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +static int __pvsched_vcpu_register(struct pid *pid)
> +{
> + return 0;
> +}
> +static void __pvsched_vcpu_unregister(struct pid *pid)
> +{
> +}
> +static void __pvsched_notify_event(void *addr, struct pid *pid, u32 event)
> +{
> +}
> +
> +static struct pvsched_vcpu_ops __bpf_ops_pvsched_vcpu_ops = {
> + .pvsched_vcpu_register = __pvsched_vcpu_register,
> + .pvsched_vcpu_unregister = __pvsched_vcpu_unregister,
> + .pvsched_vcpu_notify_event = __pvsched_notify_event,
> +};
> +
> +struct bpf_struct_ops bpf_pvsched_vcpu_ops = {
> + .init = &bpf_pvsched_vcpu_init,
> + .validate = bpf_pvsched_vcpu_validate,
> + .update = bpf_pvsched_vcpu_update,
> + .verifier_ops = &bpf_pvsched_vcpu_verifier_ops,
> + .reg = bpf_pvsched_vcpu_reg,
> + .unreg = bpf_pvsched_vcpu_unreg,
> + .check_member = bpf_pvsched_vcpu_check_member,
> + .init_member = bpf_pvsched_vcpu_init_member,
> + .name = "pvsched_vcpu_ops",
> + .cfi_stubs = &__bpf_ops_pvsched_vcpu_ops,
> +};
> --
> 2.40.1
>

2024-04-08 14:25:36

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)

Sorry I missed sched_ext folks, adding them as well.

Thanks,
Vineeth


On Wed, Apr 3, 2024 at 10:01 AM Vineeth Pillai (Google)
<[email protected]> wrote:
>
> Double scheduling is a concern with virtualization hosts where the host
> schedules vcpus without knowing whats run by the vcpu and guest schedules
> tasks without knowing where the vcpu is physically running. This causes
> issues related to latencies, power consumption, resource utilization
> etc. An ideal solution would be to have a cooperative scheduling
> framework where the guest and host shares scheduling related information
> and makes an educated scheduling decision to optimally handle the
> workloads. As a first step, we are taking a stab at reducing latencies
> for latency sensitive workloads in the guest.
>
> v1 RFC[1] was posted in December 2023. The main disagreement was in the
> implementation where the patch was making scheduling policy decisions
> in kvm and kvm is not the right place to do it. The suggestion was to
> move the polcy decisions outside of kvm and let kvm only handle the
> notifications needed to make the policy decisions. This patch series is
> an iterative step towards implementing the feature as a layered
> design where the policy could be implemented outside of kvm as a
> kernel built-in, a kernel module or a bpf program.
>
> This design comprises mainly of 4 components:
>
> - pvsched driver: Implements the scheduling policies. Register with
> host with a set of callbacks that hypervisor(kvm) can use to notify
> vcpu events that the driver is interested in. The callback will be
> passed in the address of shared memory so that the driver can get
> scheduling information shared by the guest and also update the
> scheduling policies set by the driver.
> - kvm component: Selects the pvsched driver for a guest and notifies
> the driver via callbacks for events that the driver is interested
> in. Also interface with the guest in retreiving the shared memory
> region for sharing the scheduling information.
> - host kernel component: Implements the APIs for:
> - pvsched driver for register/unregister to the host kernel, and
> - hypervisor for assingning/unassigning driver for guests.
> - guest component: Implements a framework for sharing the scheduling
> information with the pvsched driver through kvm.
>
> There is another component that we refer to as pvsched protocol. This
> defines the details about shared memory layout, information sharing and
> sheduling policy decisions. The protocol need not be part of the kernel
> and can be defined separately based on the use case and requirements.
> Both guest and the selected pvsched driver need to match the protocol
> for the feature to work. Protocol shall be identified by a name and a
> possible versioning scheme. Guest will advertise the protocol and then
> the hypervisor can assign the driver implementing the protocol if it is
> registered in the host kernel.
>
> This patch series only implements the first 3 components. Guest side
> implementation and the protocol framework shall come as a separate
> series once we finalize rest of the design.
>
> This series also implements a sample bpf program and a kernel-builtin
> pvsched drivers. They do not do any real stuff now, but just skeletons
> to demonstrate the feature.
>
> Rebased on 6.8.2.
>
> [1]: https://lwn.net/Articles/955145/
>
> Vineeth Pillai (Google) (5):
> pvsched: paravirt scheduling framework
> kvm: Implement the paravirt sched framework for kvm
> kvm: interface for managing pvsched driver for guest VMs
> pvsched: bpf support for pvsched
> selftests/bpf: sample implementation of a bpf pvsched driver.
>
> Kconfig | 2 +
> arch/x86/kvm/Kconfig | 13 +
> arch/x86/kvm/x86.c | 3 +
> include/linux/kvm_host.h | 32 +++
> include/linux/pvsched.h | 102 +++++++
> include/uapi/linux/kvm.h | 6 +
> kernel/bpf/bpf_struct_ops_types.h | 4 +
> kernel/sysctl.c | 27 ++
> .../testing/selftests/bpf/progs/bpf_pvsched.c | 37 +++
> virt/Makefile | 2 +-
> virt/kvm/kvm_main.c | 265 ++++++++++++++++++
> virt/pvsched/Kconfig | 12 +
> virt/pvsched/Makefile | 2 +
> virt/pvsched/pvsched.c | 215 ++++++++++++++
> virt/pvsched/pvsched_bpf.c | 141 ++++++++++
> 15 files changed, 862 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/pvsched.h
> create mode 100644 tools/testing/selftests/bpf/progs/bpf_pvsched.c
> create mode 100644 virt/pvsched/Kconfig
> create mode 100644 virt/pvsched/Makefile
> create mode 100644 virt/pvsched/pvsched.c
> create mode 100644 virt/pvsched/pvsched_bpf.c
>
> --
> 2.40.1
>

2024-04-08 14:28:05

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v2 5/5] selftests/bpf: sample implementation of a bpf pvsched driver.

Adding sched_ext folks

On Wed, Apr 3, 2024 at 10:01 AM Vineeth Pillai (Google)
<[email protected]> wrote:
>
> A dummy skeleton of a bpf pvsched driver. This is just for demonstration
> purpose and would need more work to be included as a test for this
> feature.
>
> Not-Signed-off-by: Vineeth Pillai (Google) <[email protected]>
> ---
> .../testing/selftests/bpf/progs/bpf_pvsched.c | 37 +++++++++++++++++++
> 1 file changed, 37 insertions(+)
> create mode 100644 tools/testing/selftests/bpf/progs/bpf_pvsched.c
>
> diff --git a/tools/testing/selftests/bpf/progs/bpf_pvsched.c b/tools/testing/selftests/bpf/progs/bpf_pvsched.c
> new file mode 100644
> index 000000000000..a653baa3034b
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/bpf_pvsched.c
> @@ -0,0 +1,37 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2019 Facebook */
> +
> +#include "vmlinux.h"
> +#include "bpf_tracing_net.h"
> +#include <bpf/bpf_tracing.h>
> +#include <bpf/bpf_helpers.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +SEC("struct_ops/pvsched_vcpu_reg")
> +int BPF_PROG(pvsched_vcpu_reg, struct pid *pid)
> +{
> + bpf_printk("pvsched_vcpu_reg: pid: %p", pid);
> + return 0;
> +}
> +
> +SEC("struct_ops/pvsched_vcpu_unreg")
> +void BPF_PROG(pvsched_vcpu_unreg, struct pid *pid)
> +{
> + bpf_printk("pvsched_vcpu_unreg: pid: %p", pid);
> +}
> +
> +SEC("struct_ops/pvsched_vcpu_notify_event")
> +void BPF_PROG(pvsched_vcpu_notify_event, void *addr, struct pid *pid, __u32 event)
> +{
> + bpf_printk("pvsched_vcpu_notify: pid: %p, event:%u", pid, event);
> +}
> +
> +SEC(".struct_ops")
> +struct pvsched_vcpu_ops pvsched_ops = {
> + .pvsched_vcpu_register = (void *)pvsched_vcpu_reg,
> + .pvsched_vcpu_unregister = (void *)pvsched_vcpu_unreg,
> + .pvsched_vcpu_notify_event = (void *)pvsched_vcpu_notify_event,
> + .events = 0x6,
> + .name = "bpf_pvsched_ops",
> +};
> --
> 2.40.1
>

2024-05-01 15:29:46

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)

On Wed, Apr 03, 2024, Vineeth Pillai (Google) wrote:
> Double scheduling is a concern with virtualization hosts where the host
> schedules vcpus without knowing whats run by the vcpu and guest schedules
> tasks without knowing where the vcpu is physically running. This causes
> issues related to latencies, power consumption, resource utilization
> etc. An ideal solution would be to have a cooperative scheduling
> framework where the guest and host shares scheduling related information
> and makes an educated scheduling decision to optimally handle the
> workloads. As a first step, we are taking a stab at reducing latencies
> for latency sensitive workloads in the guest.
>
> v1 RFC[1] was posted in December 2023. The main disagreement was in the
> implementation where the patch was making scheduling policy decisions
> in kvm and kvm is not the right place to do it. The suggestion was to
> move the polcy decisions outside of kvm and let kvm only handle the
> notifications needed to make the policy decisions. This patch series is
> an iterative step towards implementing the feature as a layered
> design where the policy could be implemented outside of kvm as a
> kernel built-in, a kernel module or a bpf program.
>
> This design comprises mainly of 4 components:
>
> - pvsched driver: Implements the scheduling policies. Register with
> host with a set of callbacks that hypervisor(kvm) can use to notify
> vcpu events that the driver is interested in. The callback will be
> passed in the address of shared memory so that the driver can get
> scheduling information shared by the guest and also update the
> scheduling policies set by the driver.
> - kvm component: Selects the pvsched driver for a guest and notifies
> the driver via callbacks for events that the driver is interested
> in. Also interface with the guest in retreiving the shared memory
> region for sharing the scheduling information.
> - host kernel component: Implements the APIs for:
> - pvsched driver for register/unregister to the host kernel, and
> - hypervisor for assingning/unassigning driver for guests.
> - guest component: Implements a framework for sharing the scheduling
> information with the pvsched driver through kvm.

Roughly summarazing an off-list discussion.

- Discovery schedulers should be handled outside of KVM and the kernel, e.g.
similar to how userspace uses PCI, VMBUS, etc. to enumerate devices to the guest.

- "Negotiating" features/hooks should also be handled outside of the kernel,
e.g. similar to how VirtIO devices negotiate features between host and guest.

- Pushing PV scheduler entities to KVM should either be done through an exported
API, e.g. if the scheduler is provided by a separate kernel module, or by a
KVM or VM ioctl() (especially if the desire is to have per-VM schedulers).

I think those were the main takeaways? Vineeth and Joel, please chime in on
anything I've missed or misremembered.

The other reason I'm bringing this discussion back on-list is that I (very) briefly
discussed this with Paolo, and he pointed out the proposed rseq-based mechanism
that would allow userspace to request an extended time slice[*], and that if that
landed it would be easy-ish to reuse the interface for KVM's steal_time PV API.

I see that you're both on that thread, so presumably you're already aware of the
idea, but I wanted to bring it up here to make sure that we aren't trying to
design something that's more complex than is needed.

Specifically, if the guest has a generic way to request an extended time slice
(or boost its priority?), would that address your use cases? Or rather, how close
does it get you? E.g. the guest will have no way of requesting a larger time
slice or boosting priority when an event is _pending_ but not yet receiveed by
the guest, but is that actually problematic in practice?

[*] https://lore.kernel.org/all/[email protected]

2024-05-02 13:43:59

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)

> > This design comprises mainly of 4 components:
> >
> > - pvsched driver: Implements the scheduling policies. Register with
> > host with a set of callbacks that hypervisor(kvm) can use to notify
> > vcpu events that the driver is interested in. The callback will be
> > passed in the address of shared memory so that the driver can get
> > scheduling information shared by the guest and also update the
> > scheduling policies set by the driver.
> > - kvm component: Selects the pvsched driver for a guest and notifies
> > the driver via callbacks for events that the driver is interested
> > in. Also interface with the guest in retreiving the shared memory
> > region for sharing the scheduling information.
> > - host kernel component: Implements the APIs for:
> > - pvsched driver for register/unregister to the host kernel, and
> > - hypervisor for assingning/unassigning driver for guests.
> > - guest component: Implements a framework for sharing the scheduling
> > information with the pvsched driver through kvm.
>
> Roughly summarazing an off-list discussion.
>
> - Discovery schedulers should be handled outside of KVM and the kernel, e.g.
> similar to how userspace uses PCI, VMBUS, etc. to enumerate devices to the guest.
>
> - "Negotiating" features/hooks should also be handled outside of the kernel,
> e.g. similar to how VirtIO devices negotiate features between host and guest.
>
> - Pushing PV scheduler entities to KVM should either be done through an exported
> API, e.g. if the scheduler is provided by a separate kernel module, or by a
> KVM or VM ioctl() (especially if the desire is to have per-VM schedulers).
>
> I think those were the main takeaways? Vineeth and Joel, please chime in on
> anything I've missed or misremembered.
>
Thanks for the brief about the offlist discussion, all the points are
captured, just some minor additions. v2 implementation removed the
scheduling policies outside of kvm to a separate entity called pvsched
driver and could be implemented as a kernel module or bpf program. But
the handshake between guest and host to decide on what pvsched driver
to attach was still going through kvm. So it was suggested to move
this handshake(discovery and negotiation) outside of kvm. The idea is
to have a virtual device exposed by the VMM which would take care of
the handshake. Guest driver for this device would talk to the device
to understand the pvsched details on the host and pass the shared
memory details. Once the handshake is completed, the device is
responsible for loading the pvsched driver(bpf program or kernel
module responsible for implementing the policies). The pvsched driver
will register to the trace points exported by kvm and handle the
callbacks from then on. The scheduling will be taken care of by the
host scheduler, pvsched driver on host is responsible only for setting
the policies(placement, priorities etc).

With the above approach, the only change in kvm would be the internal
tracepoints for pvsched. Host kernel will also be unchanged and all
the complexities move to the VMM and the pvsched driver. Guest kernel
will have a new driver to talk to the virtual pvsched device and this
driver would hook into the guest kernel for passing scheduling
information to the host(via tracepoints).

> The other reason I'm bringing this discussion back on-list is that I (very) briefly
> discussed this with Paolo, and he pointed out the proposed rseq-based mechanism
> that would allow userspace to request an extended time slice[*], and that if that
> landed it would be easy-ish to reuse the interface for KVM's steal_time PV API.
>
> I see that you're both on that thread, so presumably you're already aware of the
> idea, but I wanted to bring it up here to make sure that we aren't trying to
> design something that's more complex than is needed.
>
> Specifically, if the guest has a generic way to request an extended time slice
> (or boost its priority?), would that address your use cases? Or rather, how close
> does it get you? E.g. the guest will have no way of requesting a larger time
> slice or boosting priority when an event is _pending_ but not yet receiveed by
> the guest, but is that actually problematic in practice?
>
> [*] https://lore.kernel.org/all/[email protected]
>
Thanks for bringing this up. We were also very much interested in this
feature and were planning to use the pvmem shared memory instead of
rseq framework for guests. The motivation of paravirt scheduling
framework was a bit broader than the latency issues and hence we were
proposing a bit more complex design. Other than the use case for
temporarily extending the time slice of vcpus, we were also looking at
vcpu placements on physical cpus, educated decisions that could be
made by guest scheduler if it has a picture of host cpu load etc.
Having a paravirt mechanism to share scheduling information would
benefit in such cases. Once we have this framework setup, the policy
implementation on guest and host could be taken care of by other
entities like BPF programs, modules or schedulers like sched_ext.

We are working on a v3 incorporating the above ideas and would shortly
be posting a design RFC soon. Thanks for all the help and inputs on
this.

Thanks,
Vineeth