2023-03-30 22:46:18

by David Dai

[permalink] [raw]
Subject: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

Hi,

This patch series is a continuation of the talk Saravana gave at LPC 2022
titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist
of the talk is that workloads running in a guest VM get terrible task
placement and DVFS behavior when compared to running the same workload in
the host. Effectively, no EAS for threads inside VMs. This would make power
and performance terrible just by running the workload in a VM even if we
assume there is zero virtualization overhead.

We have been iterating over different options for communicating between
guest and host, ways of applying the information coming from the
guest/host, etc to figure out the best performance and power improvements
we could get.

The patch series in its current state is NOT meant for landing in the
upstream kernel. We are sending this patch series to share the current
progress and data we have so far. The patch series is meant to be easy to
cherry-pick and test on various devices to see what performance and power
benefits this might give for others.

With this series, a workload running in a VM gets the same task placement
and DVFS treatment as it would when running in the host.

As expected, we see significant performance improvement and better
performance/power ratio. If anyone else wants to try this out for your VM
workloads and report findings, that'd be very much appreciated.

The idea is to improve VM CPUfreq/sched behavior by:
- Having guest kernel to do accurate load tracking by taking host CPU
arch/type and frequency into account.
- Sharing vCPU run queue utilization information with the host so that the
host can do proper frequency scaling and task placement on the host side.

Results:
========

As of right now, the best results have been with using hypercalls (see more
below first) to communicate between host and guest and treating the vCPU
run queue util similar to util_est on the host side vCPU thread. So that's
what this patch series does.

Let's look at the results for this series first and then look at the other
options we are trying/tried out:

Use cases running Android inside a VM on a Chromebook:
======================================================

PCMark (Emulates real world usecases)
Higher is better
+-------------------+----------+------------+--------+
| Test Case (score) | Baseline | Util_guest | %delta |
+-------------------+----------+------------+--------+
| Weighted Total | 6136 | 7274 | +19% |
+-------------------+----------+------------+--------+
| Web Browsing | 5558 | 6273 | +13% |
+-------------------+----------+------------+--------+
| Video Editing | 4921 | 5221 | +6% |
+-------------------+----------+------------+--------+
| Writing | 6864 | 8825 | +29% |
+-------------------+----------+------------+--------+
| Photo Editing | 7983 | 11593 | +45% |
+-------------------+----------+------------+--------+
| Data Manipulation | 5814 | 6081 | +5% |
+-------------------+----------+------------+--------+

PCMark Performance/mAh
Higher is better
+-----------+----------+------------+--------+
| | Baseline | Util_guest | %delta |
+-----------+----------+------------+--------+
| Score/mAh | 79 | 88 | +11% |
+-----------+----------+------------+--------+

Roblox
Higher is better
+-----+----------+------------+--------+
| | Baseline | Util_guest | %delta |
+-----+----------+------------+--------+
| FPS | 18.25 | 28.66 | +57% |
+-----+----------+------------+--------+

Roblox FPS/mAh
Higher is better
+-----+----------+------------+--------+
| | Baseline | Util_guest | %delta |
+-----+----------+------------+--------+
| FPS | 0.15 | 0.19 | +26% |
+-----+----------+------------+--------+

Use cases running a minimal system inside a VM on a Pixel 6:
============================================================

FIO
Higher is better
+----------------------+----------+------------+--------+
| Test Case (avg MB/s) | Baseline | Util_guest | %delta |
+----------------------+----------+------------+--------+
| Seq Write | 9.27 | 12.6 | +36% |
+----------------------+----------+------------+--------+
| Rand Write | 9.34 | 11.9 | +27% |
+----------------------+----------+------------+--------+
| Seq Read | 106 | 124 | +17% |
+----------------------+----------+------------+--------+
| Rand Read | 33.6 | 35 | +4% |
+----------------------+----------+------------+--------+

CPU-based ML Inference Benchmark
Lower is better
+-------------------------+----------+------------+--------+
| Test Case (ms) | Baseline | Util_guest | %delta |
+-------------------------+----------+------------+--------+
| Cached Sample Inference | 2.57 | 1.75 | -32% |
+-------------------------+----------+------------+--------+
| Small Sample Inference | 6.8 | 5.57 | -18% |
+-------------------------+----------+------------+--------+
| Large Sample Inference | 31.2 | 26.58 | -15% |
+-------------------------+----------+------------+--------+

These patches expect the host to:
- Affine vCPUs to specific clusters.
- Set vCPU capacity to match the host CPU they are running on.

To make this easy to do/try out, we have put up patches[4][5] to do this on
CrosVM. Once you pick up those patches, you can use options
"--host-cpu-topology" and "--virt-cpufreq" to achieve the above.

The patch series can be broken into:

Patch 1: Add util_guest as an additional PELT signal for host vCPU threads
Patch 2: Hypercall for guest to get current pCPU's frequency
Patch 3: Send vCPU run queue util to host and apply as util_guest
Patch 4: Query pCPU freq table from guest (we'll move this to DT in the
future)
Patch 5: Virtual cpufreq driver that uses the hypercalls to send util to
host and implement frequency invariance in the guest.

Alternative we have implemented and profiled:
=============================================

util_guest vs uclamp_min
========================

One suggestion at LPC was to use uclamp_min to apply the util info coming
from the guest. As we suspected, it doesn't perform as well because
uclamp_min is not additive, whereas the actual workload on the host CPU due
to the vCPU is additive to the existing workloads on the host. Uclamp_min
also has the undesirable side-effect of threads forked from the vCPU thread
inheriting whatever uclamp_min value the vCPU thread had and then getting
stuck with that uclamp_min value.

Below are some additional benchmark results comparing the uclamp_min
prototype (listed as Uclamp) using the same test environment as before
(including hypercalls).

As before, %delta is always comparing to baseline.

PCMark
Higher is better
+-------------------+----------+------------+--------+--------+--------+
| Test Case (score) | Baseline | Util_guest | %delta | Uclamp | %delta |
+-------------------+----------+------------+--------+--------+--------+
| Weighted Total | 6136 | 7274 | +19% | 6848 | +12% |
+-------------------+----------+------------+--------+--------+--------+
| Web Browsing | 5558 | 6273 | +13% | 6050 | +9% |
+-------------------+----------+------------+--------+--------+--------+
| Video Editing | 4921 | 5221 | +6% | 5091 | +3% |
+-------------------+----------+------------+--------+--------+--------+
| Writing | 6864 | 8825 | +29% | 8523 | +24% |
+-------------------+----------+------------+--------+--------+--------+
| Photo Editing | 7983 | 11593 | +45% | 9865 | +24% |
+-------------------+----------+------------+--------+--------+--------+
| Data Manipulation | 5814 | 6081 | +5% | 5836 | 0% |
+-------------------+----------+------------+--------+--------+--------+

PCMark Performance/mAh
Higher is better
+-----------+----------+------------+--------+--------+--------+
| | Baseline | Util_guest | %delta | Uclamp | %delta |
+-----------+----------+------------+--------+--------+--------+
| Score/mAh | 79 | 88 | +11% | 83 | +7% |
+-----------+----------+------------+--------+--------+--------+

Hypercalls vs MMIO:
===================
We realize that hypercalls are not the recommended choice for this and we
have no attachment to any communication method as long as it gives good
results.

We started off with hypercalls to see what is the best we could achieve if
we didn't have to context switch into host side userspace.

To see the impact of switching from hypercalls to MMIO, we kept util_guest
and only switched from hypercall to MMIO. So in the results below:
- Hypercall = hypercall + util_guest
- MMIO = MMIO + util_guest

As before, %delta is always comparing to baseline.

PCMark
Higher is better
+-------------------+----------+------------+--------+-------+--------+
| Test Case (score) | Baseline | Hypercall | %delta | MMIO | %delta |
+-------------------+----------+------------+--------+-------+--------+
| Weighted Total | 6136 | 7274 | +19% | 6867 | +12% |
+-------------------+----------+------------+--------+-------+--------+
| Web Browsing | 5558 | 6273 | +13% | 6035 | +9% |
+-------------------+----------+------------+--------+-------+--------+
| Video Editing | 4921 | 5221 | +6% | 5167 | +5% |
+-------------------+----------+------------+--------+-------+--------+
| Writing | 6864 | 8825 | +29% | 8529 | +24% |
+-------------------+----------+------------+--------+-------+--------+
| Photo Editing | 7983 | 11593 | +45% | 10812 | +35% |
+-------------------+----------+------------+--------+-------+--------+
| Data Manipulation | 5814 | 6081 | +5% | 5327 | -8% |
+-------------------+----------+------------+--------+-------+--------+

PCMark Performance/mAh
Higher is better
+-----------+----------+-----------+--------+------+--------+
| | Baseline | Hypercall | %delta | MMIO | %delta |
+-----------+----------+-----------+--------+------+--------+
| Score/mAh | 79 | 88 | +11% | 83 | +7% |
+-----------+----------+-----------+--------+------+--------+

Roblox
Higher is better
+-----+----------+------------+--------+-------+--------+
| | Baseline | Hypercall | %delta | MMIO | %delta |
+-----+----------+------------+--------+-------+--------+
| FPS | 18.25 | 28.66 | +57% | 24.06 | +32% |
+-----+----------+------------+--------+-------+--------+

Roblox Frames/mAh
Higher is better
+------------+----------+------------+--------+--------+--------+
| | Baseline | Hypercall | %delta | MMIO | %delta |
+------------+----------+------------+--------+--------+--------+
| Frames/mAh | 91.25 | 114.64 | +26% | 103.11 | +13% |
+------------+----------+------------+--------+--------+--------+

Next steps:
===========
We are continuing to look into communication mechanisms other than
hypercalls that are just as/more efficient and avoid switching into the VMM
userspace. Any inputs in this regard are greatly appreciated.

Thanks,
David & Saravana

[1] - https://lpc.events/event/16/contributions/1195/
[2] - https://lpc.events/event/16/contributions/1195/attachments/970/1893/LPC%202022%20-%20VM%20DVFS.pdf
[3] - https://www.youtube.com/watch?v=hIg_5bg6opU
[4] - https://chromium-review.googlesource.com/c/crosvm/crosvm/+/4208668
[5] - https://chromium-review.googlesource.com/c/crosvm/crosvm/+/4288027

David Dai (6):
sched/fair: Add util_guest for tasks
kvm: arm64: Add support for get_cur_cpufreq service
kvm: arm64: Add support for util_hint service
kvm: arm64: Add support for get_freqtbl service
dt-bindings: cpufreq: add bindings for virtual kvm cpufreq
cpufreq: add kvm-cpufreq driver

.../bindings/cpufreq/cpufreq-virtual-kvm.yaml | 39 +++
Documentation/virt/kvm/api.rst | 28 ++
.../virt/kvm/arm/get_cur_cpufreq.rst | 21 ++
Documentation/virt/kvm/arm/get_freqtbl.rst | 23 ++
Documentation/virt/kvm/arm/index.rst | 3 +
Documentation/virt/kvm/arm/util_hint.rst | 22 ++
arch/arm64/include/uapi/asm/kvm.h | 3 +
arch/arm64/kvm/arm.c | 3 +
arch/arm64/kvm/hypercalls.c | 60 +++++
drivers/cpufreq/Kconfig | 13 +
drivers/cpufreq/Makefile | 1 +
drivers/cpufreq/kvm-cpufreq.c | 245 ++++++++++++++++++
include/linux/arm-smccc.h | 21 ++
include/linux/sched.h | 12 +
include/uapi/linux/kvm.h | 3 +
kernel/sched/core.c | 24 +-
kernel/sched/fair.c | 15 +-
tools/arch/arm64/include/uapi/asm/kvm.h | 3 +
18 files changed, 536 insertions(+), 3 deletions(-)
create mode 100644 Documentation/devicetree/bindings/cpufreq/cpufreq-virtual-kvm.yaml
create mode 100644 Documentation/virt/kvm/arm/get_cur_cpufreq.rst
create mode 100644 Documentation/virt/kvm/arm/get_freqtbl.rst
create mode 100644 Documentation/virt/kvm/arm/util_hint.rst
create mode 100644 drivers/cpufreq/kvm-cpufreq.c

--
2.40.0.348.gf938b09366-goog


2023-03-30 22:46:44

by David Dai

[permalink] [raw]
Subject: [RFC PATCH 3/6] kvm: arm64: Add support for util_hint service

This service allows guests to send the utilization of workoads on its vCPUs
to the host. Utilization is represented as an arbitrary value of 0-1024
where 1024 represents the highest performance point normalized for
frequency and architecture across all CPUs. This hint is used by
the host for scheduling vCPU threads and deciding CPU frequency.

Co-developed-by: Saravana Kannan <[email protected]>
Signed-off-by: Saravana Kannan <[email protected]>
Signed-off-by: David Dai <[email protected]>
---
Documentation/virt/kvm/api.rst | 12 ++++++++++++
Documentation/virt/kvm/arm/index.rst | 1 +
Documentation/virt/kvm/arm/util_hint.rst | 22 ++++++++++++++++++++++
arch/arm64/include/uapi/asm/kvm.h | 1 +
arch/arm64/kvm/arm.c | 1 +
arch/arm64/kvm/hypercalls.c | 20 ++++++++++++++++++++
include/linux/arm-smccc.h | 7 +++++++
include/uapi/linux/kvm.h | 1 +
tools/arch/arm64/include/uapi/asm/kvm.h | 1 +
9 files changed, 66 insertions(+)
create mode 100644 Documentation/virt/kvm/arm/util_hint.rst

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index b0ff0ad700bf..38ce33564efc 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8388,6 +8388,18 @@ must point to a byte where the value will be stored or retrieved from.
This capability indicates that KVM supports getting the
frequency of the current CPU that the vCPU thread is running on.

+8.41 KVM_CAP_UTIL_HINT
+----------------------
+
+:Architectures: arm64
+
+This capability indicates that the KVM supports taking utilization
+hints from the guest. Utilization is represented as a value from 0-1024
+where 1024 represents the highest performance point across all physical CPUs
+after normalizing for architecture. This is useful when guests are tracking
+workload on its vCPUs. Util hints allow the host to make more accurate
+frequency selections and task placement for vCPU threads.
+
9. Known KVM API problems
=========================

diff --git a/Documentation/virt/kvm/arm/index.rst b/Documentation/virt/kvm/arm/index.rst
index 47afc5c1f24a..f83877663813 100644
--- a/Documentation/virt/kvm/arm/index.rst
+++ b/Documentation/virt/kvm/arm/index.rst
@@ -12,3 +12,4 @@ ARM
pvtime
ptp_kvm
get_cur_cpufreq
+ util_hint
diff --git a/Documentation/virt/kvm/arm/util_hint.rst b/Documentation/virt/kvm/arm/util_hint.rst
new file mode 100644
index 000000000000..262d142d62d9
--- /dev/null
+++ b/Documentation/virt/kvm/arm/util_hint.rst
@@ -0,0 +1,22 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Util_hint support for arm64
+============================
+
+Util_hint is used for sharing the utilization value from the guest
+to the host.
+
+* ARM_SMCCC_HYP_KVM_UTIL_HINT_FUNC_ID: 0x86000041
+
+This hypercall using the SMC32/HVC32 calling convention:
+
+ARM_SMCCC_HYP_KVM_UTIL_HINT_FUNC_ID
+ ============== ========= ============================
+ Function ID: (uint32) 0x86000041
+ Arguments: (uint32) util value(0-1024) where 1024 represents
+ the highest performance point normalized
+ across all CPUs
+ Return values: (int32) NOT_SUPPORTED(-1) on error.
+ Endianness: Must be the same endianness
+ as the host.
+ ============== ======== ============================
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index ed8b63e91bdc..61309ecb7241 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -368,6 +368,7 @@ enum {
KVM_REG_ARM_VENDOR_HYP_BIT_FUNC_FEAT = 0,
KVM_REG_ARM_VENDOR_HYP_BIT_PTP = 1,
KVM_REG_ARM_VENDOR_HYP_BIT_GET_CUR_CPUFREQ = 2,
+ KVM_REG_ARM_VENDOR_HYP_BIT_UTIL_HINT = 3,
#ifdef __KERNEL__
KVM_REG_ARM_VENDOR_HYP_BMAP_BIT_COUNT,
#endif
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index f960b136c611..bf3c4d4b9b67 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -221,6 +221,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_PTP_KVM:
case KVM_CAP_ARM_SYSTEM_SUSPEND:
case KVM_CAP_GET_CUR_CPUFREQ:
+ case KVM_CAP_UTIL_HINT:
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
index b3f4b90c024b..01dba07b5183 100644
--- a/arch/arm64/kvm/hypercalls.c
+++ b/arch/arm64/kvm/hypercalls.c
@@ -28,6 +28,20 @@ static void kvm_sched_get_cur_cpufreq(struct kvm_vcpu *vcpu, u64 *val)
val[0] = ret_freq;
}

+static void kvm_sched_set_util(struct kvm_vcpu *vcpu, u64 *val)
+{
+ struct sched_attr attr = {
+ .sched_flags = SCHED_FLAG_UTIL_GUEST,
+ };
+ int ret;
+
+ attr.sched_util_min = smccc_get_arg1(vcpu);
+
+ ret = sched_setattr_nocheck(current, &attr);
+
+ val[0] = (u64)ret;
+}
+
static void kvm_ptp_get_time(struct kvm_vcpu *vcpu, u64 *val)
{
struct system_time_snapshot systime_snapshot;
@@ -131,6 +145,9 @@ static bool kvm_hvc_call_allowed(struct kvm_vcpu *vcpu, u32 func_id)
case ARM_SMCCC_VENDOR_HYP_KVM_GET_CUR_CPUFREQ_FUNC_ID:
return test_bit(KVM_REG_ARM_VENDOR_HYP_BIT_GET_CUR_CPUFREQ,
&smccc_feat->vendor_hyp_bmap);
+ case ARM_SMCCC_VENDOR_HYP_KVM_UTIL_HINT_FUNC_ID:
+ return test_bit(KVM_REG_ARM_VENDOR_HYP_BIT_UTIL_HINT,
+ &smccc_feat->vendor_hyp_bmap);
default:
return kvm_hvc_call_default_allowed(func_id);
}
@@ -231,6 +248,9 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
case ARM_SMCCC_VENDOR_HYP_KVM_GET_CUR_CPUFREQ_FUNC_ID:
kvm_sched_get_cur_cpufreq(vcpu, val);
break;
+ case ARM_SMCCC_VENDOR_HYP_KVM_UTIL_HINT_FUNC_ID:
+ kvm_sched_set_util(vcpu, val);
+ break;
case ARM_SMCCC_TRNG_VERSION:
case ARM_SMCCC_TRNG_FEATURES:
case ARM_SMCCC_TRNG_GET_UUID:
diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index e15f1bdcf3f1..9f747e5025b6 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -113,6 +113,7 @@
#define ARM_SMCCC_KVM_FUNC_FEATURES 0
#define ARM_SMCCC_KVM_FUNC_PTP 1
#define ARM_SMCCC_KVM_FUNC_GET_CUR_CPUFREQ 64
+#define ARM_SMCCC_KVM_FUNC_UTIL_HINT 65
#define ARM_SMCCC_KVM_FUNC_FEATURES_2 127
#define ARM_SMCCC_KVM_NUM_FUNCS 128

@@ -145,6 +146,12 @@
ARM_SMCCC_OWNER_VENDOR_HYP, \
ARM_SMCCC_KVM_FUNC_GET_CUR_CPUFREQ)

+#define ARM_SMCCC_VENDOR_HYP_KVM_UTIL_HINT_FUNC_ID \
+ ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
+ ARM_SMCCC_SMC_32, \
+ ARM_SMCCC_OWNER_VENDOR_HYP, \
+ ARM_SMCCC_KVM_FUNC_UTIL_HINT)
+
/* Paravirtualised time calls (defined by ARM DEN0057A) */
#define ARM_SMCCC_HV_PV_TIME_FEATURES \
ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0a1a260243bf..7f667ab344ae 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1185,6 +1185,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
#define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
#define KVM_CAP_GET_CUR_CPUFREQ 512
+#define KVM_CAP_UTIL_HINT 513

#ifdef KVM_CAP_IRQ_ROUTING

diff --git a/tools/arch/arm64/include/uapi/asm/kvm.h b/tools/arch/arm64/include/uapi/asm/kvm.h
index ed8b63e91bdc..61309ecb7241 100644
--- a/tools/arch/arm64/include/uapi/asm/kvm.h
+++ b/tools/arch/arm64/include/uapi/asm/kvm.h
@@ -368,6 +368,7 @@ enum {
KVM_REG_ARM_VENDOR_HYP_BIT_FUNC_FEAT = 0,
KVM_REG_ARM_VENDOR_HYP_BIT_PTP = 1,
KVM_REG_ARM_VENDOR_HYP_BIT_GET_CUR_CPUFREQ = 2,
+ KVM_REG_ARM_VENDOR_HYP_BIT_UTIL_HINT = 3,
#ifdef __KERNEL__
KVM_REG_ARM_VENDOR_HYP_BMAP_BIT_COUNT,
#endif
--
2.40.0.348.gf938b09366-goog

2023-03-30 22:46:45

by David Dai

[permalink] [raw]
Subject: [RFC PATCH 2/6] kvm: arm64: Add support for get_cur_cpufreq service

This service allows guests to query the host for frequency of the CPU
that the vCPU is currently running on.

Co-developed-by: Saravana Kannan <[email protected]>
Signed-off-by: Saravana Kannan <[email protected]>
Signed-off-by: David Dai <[email protected]>
---
Documentation/virt/kvm/api.rst | 8 +++++++
.../virt/kvm/arm/get_cur_cpufreq.rst | 21 +++++++++++++++++++
Documentation/virt/kvm/arm/index.rst | 1 +
arch/arm64/include/uapi/asm/kvm.h | 1 +
arch/arm64/kvm/arm.c | 1 +
arch/arm64/kvm/hypercalls.c | 18 ++++++++++++++++
include/linux/arm-smccc.h | 7 +++++++
include/uapi/linux/kvm.h | 1 +
tools/arch/arm64/include/uapi/asm/kvm.h | 1 +
9 files changed, 59 insertions(+)
create mode 100644 Documentation/virt/kvm/arm/get_cur_cpufreq.rst

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 62de0768d6aa..b0ff0ad700bf 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8380,6 +8380,14 @@ structure.
When getting the Modified Change Topology Report value, the attr->addr
must point to a byte where the value will be stored or retrieved from.

+8.40 KVM_CAP_GET_CUR_CPUFREQ
+------------------------
+
+:Architectures: arm64
+
+This capability indicates that KVM supports getting the
+frequency of the current CPU that the vCPU thread is running on.
+
9. Known KVM API problems
=========================

diff --git a/Documentation/virt/kvm/arm/get_cur_cpufreq.rst b/Documentation/virt/kvm/arm/get_cur_cpufreq.rst
new file mode 100644
index 000000000000..06e0ed5b3868
--- /dev/null
+++ b/Documentation/virt/kvm/arm/get_cur_cpufreq.rst
@@ -0,0 +1,21 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+get_cur_cpufreq support for arm/arm64
+=============================
+
+Get_cur_cpufreq support is used to get current frequency(in KHz) of the
+current CPU that the vCPU thread is running on.
+
+* ARM_SMCCC_VENDOR_HYP_KVM_GET_CUR_CPUFREQ_FUNC_ID: 0x86000040
+
+This hypercall uses the SMC32/HVC32 calling convention:
+
+ARM_SMCCC_VENDOR_HYP_KVM_GET_CUR_CPUFREQ_FUNC_ID
+ ============== ======== =====================================
+ Function ID: (uint32) 0x86000040
+ Return Values: (int32) NOT_SUPPORTED(-1) on error, or
+ (uint32) Frequency in KHz of current CPU that the
+ vCPU thread is running on.
+ Endianness: Must be the same endianness
+ as the host.
+ ============== ======== =====================================
diff --git a/Documentation/virt/kvm/arm/index.rst b/Documentation/virt/kvm/arm/index.rst
index e84848432158..47afc5c1f24a 100644
--- a/Documentation/virt/kvm/arm/index.rst
+++ b/Documentation/virt/kvm/arm/index.rst
@@ -11,3 +11,4 @@ ARM
hypercalls
pvtime
ptp_kvm
+ get_cur_cpufreq
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index f8129c624b07..ed8b63e91bdc 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -367,6 +367,7 @@ enum {
enum {
KVM_REG_ARM_VENDOR_HYP_BIT_FUNC_FEAT = 0,
KVM_REG_ARM_VENDOR_HYP_BIT_PTP = 1,
+ KVM_REG_ARM_VENDOR_HYP_BIT_GET_CUR_CPUFREQ = 2,
#ifdef __KERNEL__
KVM_REG_ARM_VENDOR_HYP_BMAP_BIT_COUNT,
#endif
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 3bd732eaf087..f960b136c611 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -220,6 +220,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_VCPU_ATTRIBUTES:
case KVM_CAP_PTP_KVM:
case KVM_CAP_ARM_SYSTEM_SUSPEND:
+ case KVM_CAP_GET_CUR_CPUFREQ:
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
index 5da884e11337..b3f4b90c024b 100644
--- a/arch/arm64/kvm/hypercalls.c
+++ b/arch/arm64/kvm/hypercalls.c
@@ -3,6 +3,9 @@

#include <linux/arm-smccc.h>
#include <linux/kvm_host.h>
+#include <linux/cpufreq.h>
+#include <linux/sched.h>
+#include <uapi/linux/sched/types.h>

#include <asm/kvm_emulate.h>

@@ -16,6 +19,15 @@
#define KVM_ARM_SMCCC_VENDOR_HYP_FEATURES \
GENMASK(KVM_REG_ARM_VENDOR_HYP_BMAP_BIT_COUNT - 1, 0)

+static void kvm_sched_get_cur_cpufreq(struct kvm_vcpu *vcpu, u64 *val)
+{
+ unsigned long ret_freq;
+
+ ret_freq = cpufreq_get(task_cpu(current));
+
+ val[0] = ret_freq;
+}
+
static void kvm_ptp_get_time(struct kvm_vcpu *vcpu, u64 *val)
{
struct system_time_snapshot systime_snapshot;
@@ -116,6 +128,9 @@ static bool kvm_hvc_call_allowed(struct kvm_vcpu *vcpu, u32 func_id)
case ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID:
return test_bit(KVM_REG_ARM_VENDOR_HYP_BIT_PTP,
&smccc_feat->vendor_hyp_bmap);
+ case ARM_SMCCC_VENDOR_HYP_KVM_GET_CUR_CPUFREQ_FUNC_ID:
+ return test_bit(KVM_REG_ARM_VENDOR_HYP_BIT_GET_CUR_CPUFREQ,
+ &smccc_feat->vendor_hyp_bmap);
default:
return kvm_hvc_call_default_allowed(func_id);
}
@@ -213,6 +228,9 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
case ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID:
kvm_ptp_get_time(vcpu, val);
break;
+ case ARM_SMCCC_VENDOR_HYP_KVM_GET_CUR_CPUFREQ_FUNC_ID:
+ kvm_sched_get_cur_cpufreq(vcpu, val);
+ break;
case ARM_SMCCC_TRNG_VERSION:
case ARM_SMCCC_TRNG_FEATURES:
case ARM_SMCCC_TRNG_GET_UUID:
diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index 220c8c60e021..e15f1bdcf3f1 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -112,6 +112,7 @@
/* KVM "vendor specific" services */
#define ARM_SMCCC_KVM_FUNC_FEATURES 0
#define ARM_SMCCC_KVM_FUNC_PTP 1
+#define ARM_SMCCC_KVM_FUNC_GET_CUR_CPUFREQ 64
#define ARM_SMCCC_KVM_FUNC_FEATURES_2 127
#define ARM_SMCCC_KVM_NUM_FUNCS 128

@@ -138,6 +139,12 @@
#define KVM_PTP_VIRT_COUNTER 0
#define KVM_PTP_PHYS_COUNTER 1

+#define ARM_SMCCC_VENDOR_HYP_KVM_GET_CUR_CPUFREQ_FUNC_ID \
+ ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
+ ARM_SMCCC_SMC_32, \
+ ARM_SMCCC_OWNER_VENDOR_HYP, \
+ ARM_SMCCC_KVM_FUNC_GET_CUR_CPUFREQ)
+
/* Paravirtualised time calls (defined by ARM DEN0057A) */
#define ARM_SMCCC_HV_PV_TIME_FEATURES \
ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d77aef872a0a..0a1a260243bf 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1184,6 +1184,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
#define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
#define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
+#define KVM_CAP_GET_CUR_CPUFREQ 512

#ifdef KVM_CAP_IRQ_ROUTING

diff --git a/tools/arch/arm64/include/uapi/asm/kvm.h b/tools/arch/arm64/include/uapi/asm/kvm.h
index f8129c624b07..ed8b63e91bdc 100644
--- a/tools/arch/arm64/include/uapi/asm/kvm.h
+++ b/tools/arch/arm64/include/uapi/asm/kvm.h
@@ -367,6 +367,7 @@ enum {
enum {
KVM_REG_ARM_VENDOR_HYP_BIT_FUNC_FEAT = 0,
KVM_REG_ARM_VENDOR_HYP_BIT_PTP = 1,
+ KVM_REG_ARM_VENDOR_HYP_BIT_GET_CUR_CPUFREQ = 2,
#ifdef __KERNEL__
KVM_REG_ARM_VENDOR_HYP_BMAP_BIT_COUNT,
#endif
--
2.40.0.348.gf938b09366-goog

2023-03-30 22:46:56

by David Dai

[permalink] [raw]
Subject: [RFC PATCH 1/6] sched/fair: Add util_guest for tasks

For virtualization usecases, util_est and util_avg currently tracked
on the host aren't sufficient to accurately represent the workload on
vCPU threads, which results in poor frequency selection and performance.
For example, when a large workload migrates from a busy vCPU thread to
an idle vCPU thread, it incurs additional DVFS ramp-up latencies
as util accumulates.

Introduce a new "util_guest" member as an additional PELT signal that's
independently updated by the guest. When used, it's max aggregated to
provide a boost to both task_util and task_util_est.

Updating task_util and task_util_est will ensure:
-Better task placement decisions for vCPU threads on the host
-Correctly updating util_est.ewma during dequeue
-Additive util with other threads on the same runqueue for more
accurate frequency responses

Co-developed-by: Saravana Kannan <[email protected]>
Signed-off-by: Saravana Kannan <[email protected]>
Signed-off-by: David Dai <[email protected]>
---
include/linux/sched.h | 11 +++++++++++
kernel/sched/core.c | 18 +++++++++++++++++-
kernel/sched/fair.c | 15 +++++++++++++--
3 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d242164b1a..d8c346fcdf52 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -445,6 +445,16 @@ struct util_est {
#define UTIL_AVG_UNCHANGED 0x80000000
} __attribute__((__aligned__(sizeof(u64))));

+/*
+ * For sched_setattr_nocheck() (kernel) only
+ *
+ * Allow vCPU threads to use UTIL_GUEST as a way to hint the scheduler with more
+ * accurate utilization info. This is useful when guest kernels have some way of
+ * tracking its own runqueue's utilization.
+ *
+ */
+#define SCHED_FLAG_UTIL_GUEST 0x20000000
+
/*
* The load/runnable/util_avg accumulates an infinite geometric series
* (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c).
@@ -499,6 +509,7 @@ struct sched_avg {
unsigned long load_avg;
unsigned long runnable_avg;
unsigned long util_avg;
+ unsigned long util_guest;
struct util_est util_est;
} ____cacheline_aligned;

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0d18c3969f90..7700ef5610c1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2024,6 +2024,16 @@ static inline void uclamp_post_fork(struct task_struct *p) { }
static inline void init_uclamp(void) { }
#endif /* CONFIG_UCLAMP_TASK */

+static void __setscheduler_task_util(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+
+ if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_GUEST)))
+ return;
+
+ p->se.avg.util_guest = attr->sched_util_min;
+}
+
bool sched_task_on_rq(struct task_struct *p)
{
return task_on_rq_queued(p);
@@ -7561,7 +7571,7 @@ static int __sched_setscheduler(struct task_struct *p,
return -EINVAL;
}

- if (attr->sched_flags & ~(SCHED_FLAG_ALL | SCHED_FLAG_SUGOV))
+ if (attr->sched_flags & ~(SCHED_FLAG_ALL | SCHED_FLAG_SUGOV | SCHED_FLAG_UTIL_GUEST))
return -EINVAL;

/*
@@ -7583,6 +7593,9 @@ static int __sched_setscheduler(struct task_struct *p,
if (attr->sched_flags & SCHED_FLAG_SUGOV)
return -EINVAL;

+ if (attr->sched_flags & SCHED_FLAG_UTIL_GUEST)
+ return -EINVAL;
+
retval = security_task_setscheduler(p);
if (retval)
return retval;
@@ -7629,6 +7642,8 @@ static int __sched_setscheduler(struct task_struct *p,
goto change;
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
goto change;
+ if (attr->sched_flags & SCHED_FLAG_UTIL_GUEST)
+ goto change;

p->sched_reset_on_fork = reset_on_fork;
retval = 0;
@@ -7718,6 +7733,7 @@ static int __sched_setscheduler(struct task_struct *p,
__setscheduler_prio(p, newprio);
}
__setscheduler_uclamp(p, attr);
+ __setscheduler_task_util(p, attr);

if (queued) {
/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6986ea31c984..998649554344 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4276,14 +4276,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf);

static inline unsigned long task_util(struct task_struct *p)
{
- return READ_ONCE(p->se.avg.util_avg);
+ return max(READ_ONCE(p->se.avg.util_avg),
+ READ_ONCE(p->se.avg.util_guest));
}

static inline unsigned long _task_util_est(struct task_struct *p)
{
struct util_est ue = READ_ONCE(p->se.avg.util_est);

- return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED));
+ return max_t(unsigned long, READ_ONCE(p->se.avg.util_guest),
+ max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED)));
}

static inline unsigned long task_util_est(struct task_struct *p)
@@ -6242,6 +6244,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
*/
util_est_enqueue(&rq->cfs, p);

+ /*
+ * The normal code path for host thread enqueue doesn't take into
+ * account guest task migrations when updating cpufreq util.
+ * So, always update the cpufreq when a vCPU thread has a
+ * non-zero util_guest value.
+ */
+ if (READ_ONCE(p->se.avg.util_guest))
+ cpufreq_update_util(rq, 0);
+
/*
* If in_iowait is set, the code below may not trigger any cpufreq
* utilization updates, so do it here explicitly with the IOWAIT flag
--
2.40.0.348.gf938b09366-goog

2023-03-30 22:47:25

by David Dai

[permalink] [raw]
Subject: [RFC PATCH 4/6] kvm: arm64: Add support for get_freqtbl service

This service allows guests to query the host for the frequency table
of the CPU that the vCPU is currently running on.

Co-developed-by: Saravana Kannan <[email protected]>
Signed-off-by: Saravana Kannan <[email protected]>
Signed-off-by: David Dai <[email protected]>
---
Documentation/virt/kvm/api.rst | 8 ++++++++
Documentation/virt/kvm/arm/get_freqtbl.rst | 23 ++++++++++++++++++++++
Documentation/virt/kvm/arm/index.rst | 1 +
arch/arm64/include/uapi/asm/kvm.h | 1 +
arch/arm64/kvm/arm.c | 1 +
arch/arm64/kvm/hypercalls.c | 22 +++++++++++++++++++++
include/linux/arm-smccc.h | 7 +++++++
include/uapi/linux/kvm.h | 1 +
tools/arch/arm64/include/uapi/asm/kvm.h | 1 +
9 files changed, 65 insertions(+)
create mode 100644 Documentation/virt/kvm/arm/get_freqtbl.rst

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 38ce33564efc..8f905456e2b4 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8400,6 +8400,14 @@ after normalizing for architecture. This is useful when guests are tracking
workload on its vCPUs. Util hints allow the host to make more accurate
frequency selections and task placement for vCPU threads.

+8.42 KVM_CAP_GET_CPUFREQ_TBL
+---------------------------
+
+:Architectures: arm64
+
+This capability indicates that the KVM supports getting the
+frequency table of the current CPU that the vCPU thread is running on.
+
9. Known KVM API problems
=========================

diff --git a/Documentation/virt/kvm/arm/get_freqtbl.rst b/Documentation/virt/kvm/arm/get_freqtbl.rst
new file mode 100644
index 000000000000..f6832d7566e7
--- /dev/null
+++ b/Documentation/virt/kvm/arm/get_freqtbl.rst
@@ -0,0 +1,23 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+get_freqtbl support for arm/arm64
+=============================
+
+Allows guest to query the frequency(in KHz) table of the current CPU that
+the vCPU thread is running on.
+
+* ARM_SMCCC_VENDOR_HYP_KVM_GET_CPUFREQ_TBL_FUNC_ID: 0x86000042
+
+This hypercall uses the SMC32/HVC32 calling convention:
+
+ARM_SMCCC_VENDOR_HYP_KVM_GET_CPUFREQ_TBL_FUNC_ID
+ ============== ======== =====================================
+ Function ID: (uint32) 0x86000042
+ Arguments: (uint32) index of the current CPU's frequency table
+ Return Values: (int32) NOT_SUPPORTED(-1) on error, or
+ (uint32) Frequency table entry of requested index
+ in KHz
+ of current CPU(r1)
+ Endianness: Must be the same endianness
+ as the host.
+ ============== ======== =====================================
diff --git a/Documentation/virt/kvm/arm/index.rst b/Documentation/virt/kvm/arm/index.rst
index f83877663813..e2e56bb41491 100644
--- a/Documentation/virt/kvm/arm/index.rst
+++ b/Documentation/virt/kvm/arm/index.rst
@@ -13,3 +13,4 @@ ARM
ptp_kvm
get_cur_cpufreq
util_hint
+ get_freqtbl
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 61309ecb7241..ed6f593264bd 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -369,6 +369,7 @@ enum {
KVM_REG_ARM_VENDOR_HYP_BIT_PTP = 1,
KVM_REG_ARM_VENDOR_HYP_BIT_GET_CUR_CPUFREQ = 2,
KVM_REG_ARM_VENDOR_HYP_BIT_UTIL_HINT = 3,
+ KVM_REG_ARM_VENDOR_HYP_BIT_GET_CPUFREQ_TBL = 4,
#ifdef __KERNEL__
KVM_REG_ARM_VENDOR_HYP_BMAP_BIT_COUNT,
#endif
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index bf3c4d4b9b67..cd76128e4af4 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -222,6 +222,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_ARM_SYSTEM_SUSPEND:
case KVM_CAP_GET_CUR_CPUFREQ:
case KVM_CAP_UTIL_HINT:
+ case KVM_CAP_GET_CPUFREQ_TBL:
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
index 01dba07b5183..6f96579dda80 100644
--- a/arch/arm64/kvm/hypercalls.c
+++ b/arch/arm64/kvm/hypercalls.c
@@ -42,6 +42,22 @@ static void kvm_sched_set_util(struct kvm_vcpu *vcpu, u64 *val)
val[0] = (u64)ret;
}

+static void kvm_sched_get_cpufreq_table(struct kvm_vcpu *vcpu, u64 *val)
+{
+ struct cpufreq_policy *policy;
+ u32 idx = smccc_get_arg1(vcpu);
+
+ policy = cpufreq_cpu_get(task_cpu(current));
+
+ if (!policy)
+ return;
+
+ val[0] = SMCCC_RET_SUCCESS;
+ val[1] = policy->freq_table[idx].frequency;
+
+ cpufreq_cpu_put(policy);
+}
+
static void kvm_ptp_get_time(struct kvm_vcpu *vcpu, u64 *val)
{
struct system_time_snapshot systime_snapshot;
@@ -148,6 +164,9 @@ static bool kvm_hvc_call_allowed(struct kvm_vcpu *vcpu, u32 func_id)
case ARM_SMCCC_VENDOR_HYP_KVM_UTIL_HINT_FUNC_ID:
return test_bit(KVM_REG_ARM_VENDOR_HYP_BIT_UTIL_HINT,
&smccc_feat->vendor_hyp_bmap);
+ case ARM_SMCCC_VENDOR_HYP_KVM_GET_CPUFREQ_TBL_FUNC_ID:
+ return test_bit(KVM_REG_ARM_VENDOR_HYP_BIT_GET_CPUFREQ_TBL,
+ &smccc_feat->vendor_hyp_bmap);
default:
return kvm_hvc_call_default_allowed(func_id);
}
@@ -251,6 +270,9 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
case ARM_SMCCC_VENDOR_HYP_KVM_UTIL_HINT_FUNC_ID:
kvm_sched_set_util(vcpu, val);
break;
+ case ARM_SMCCC_VENDOR_HYP_KVM_GET_CPUFREQ_TBL_FUNC_ID:
+ kvm_sched_get_cpufreq_table(vcpu, val);
+ break;
case ARM_SMCCC_TRNG_VERSION:
case ARM_SMCCC_TRNG_FEATURES:
case ARM_SMCCC_TRNG_GET_UUID:
diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index 9f747e5025b6..19fefb73a9bd 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -114,6 +114,7 @@
#define ARM_SMCCC_KVM_FUNC_PTP 1
#define ARM_SMCCC_KVM_FUNC_GET_CUR_CPUFREQ 64
#define ARM_SMCCC_KVM_FUNC_UTIL_HINT 65
+#define ARM_SMCCC_KVM_FUNC_GET_CPUFREQ_TBL 66
#define ARM_SMCCC_KVM_FUNC_FEATURES_2 127
#define ARM_SMCCC_KVM_NUM_FUNCS 128

@@ -152,6 +153,12 @@
ARM_SMCCC_OWNER_VENDOR_HYP, \
ARM_SMCCC_KVM_FUNC_UTIL_HINT)

+#define ARM_SMCCC_VENDOR_HYP_KVM_GET_CPUFREQ_TBL_FUNC_ID \
+ ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
+ ARM_SMCCC_SMC_32, \
+ ARM_SMCCC_OWNER_VENDOR_HYP, \
+ ARM_SMCCC_KVM_FUNC_GET_CPUFREQ_TBL)
+
/* Paravirtualised time calls (defined by ARM DEN0057A) */
#define ARM_SMCCC_HV_PV_TIME_FEATURES \
ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 7f667ab344ae..90a7f37f046d 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1186,6 +1186,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
#define KVM_CAP_GET_CUR_CPUFREQ 512
#define KVM_CAP_UTIL_HINT 513
+#define KVM_CAP_GET_CPUFREQ_TBL 514

#ifdef KVM_CAP_IRQ_ROUTING

diff --git a/tools/arch/arm64/include/uapi/asm/kvm.h b/tools/arch/arm64/include/uapi/asm/kvm.h
index 61309ecb7241..ebf9a3395c1b 100644
--- a/tools/arch/arm64/include/uapi/asm/kvm.h
+++ b/tools/arch/arm64/include/uapi/asm/kvm.h
@@ -369,6 +369,7 @@ enum {
KVM_REG_ARM_VENDOR_HYP_BIT_PTP = 1,
KVM_REG_ARM_VENDOR_HYP_BIT_GET_CUR_CPUFREQ = 2,
KVM_REG_ARM_VENDOR_HYP_BIT_UTIL_HINT = 3,
+ KVM_REG_ARM_VENDOR_HYP_BIT_CPUFREQ_TBL = 4,
#ifdef __KERNEL__
KVM_REG_ARM_VENDOR_HYP_BMAP_BIT_COUNT,
#endif
--
2.40.0.348.gf938b09366-goog

2023-03-30 22:47:29

by David Dai

[permalink] [raw]
Subject: [RFC PATCH 5/6] dt-bindings: cpufreq: add bindings for virtual kvm cpufreq

Add devicetree bindings for a virtual kvm cpufreq driver.

Co-developed-by: Saravana Kannan <[email protected]>
Signed-off-by: Saravana Kannan <[email protected]>
Signed-off-by: David Dai <[email protected]>
---
.../bindings/cpufreq/cpufreq-virtual-kvm.yaml | 39 +++++++++++++++++++
1 file changed, 39 insertions(+)
create mode 100644 Documentation/devicetree/bindings/cpufreq/cpufreq-virtual-kvm.yaml

diff --git a/Documentation/devicetree/bindings/cpufreq/cpufreq-virtual-kvm.yaml b/Documentation/devicetree/bindings/cpufreq/cpufreq-virtual-kvm.yaml
new file mode 100644
index 000000000000..31e64558a7f1
--- /dev/null
+++ b/Documentation/devicetree/bindings/cpufreq/cpufreq-virtual-kvm.yaml
@@ -0,0 +1,39 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/cpufreq/cpufreq-virtual-kvm.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Virtual KVM CPUFreq
+
+maintainers:
+ - David Dai <[email protected]>
+
+description: |
+
+ KVM CPUFreq is a virtualized driver in guest kernels that sends utilization
+ of its vCPUs as a hint to the host. The host uses hint to schedule vCPU
+ threads and select CPU frequency. It enables accurate Per-Entity Load
+ Tracking for tasks running in the guest by querying host CPU frequency
+ unless a virtualized FIE exists(Like AMUs).
+
+properties:
+ compatible:
+ const: virtual,kvm-cpufreq
+
+required:
+ - compatible
+
+additionalProperties: false
+
+examples:
+ - |
+ {
+ #address-cells = <2>;
+ #size-cells = <2>;
+
+ cpufreq {
+ compatible = "virtual,kvm-cpufreq";
+ };
+
+ };
--
2.40.0.348.gf938b09366-goog

2023-03-30 22:48:10

by David Dai

[permalink] [raw]
Subject: [RFC PATCH 6/6] cpufreq: add kvm-cpufreq driver

Introduce a virtualized cpufreq driver for guest kernels to improve
performance and power of workloads within VMs.

This driver does two main things:

1. Sends utilization of vCPUs as a hint to the host. The host uses the
hint to schedule the vCPU threads and decide physical CPU frequency.

2. If a VM does not support a virtualized FIE(like AMUs), it uses
hypercalls to update the guest's frequency scaling factor periodically
by querying the host CPU frequency. This enables accurate
Per-Entity Load Tracking for tasks running in the guest.

Note that because the host already employs a rate_limit_us, we set the
transition_delay_us of the cpufreq policy to a miniscule value(1)
to avoid any additional delays between when the runqueue's util change
and a frequency response on the host.

Co-developed-by: Saravana Kannan <[email protected]>
Signed-off-by: Saravana Kannan <[email protected]>
Signed-off-by: David Dai <[email protected]>
---
drivers/cpufreq/Kconfig | 13 ++
drivers/cpufreq/Makefile | 1 +
drivers/cpufreq/kvm-cpufreq.c | 245 ++++++++++++++++++++++++++++++++++
include/linux/sched.h | 1 +
kernel/sched/core.c | 6 +
5 files changed, 266 insertions(+)
create mode 100644 drivers/cpufreq/kvm-cpufreq.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index 2c839bd2b051..0ef9d5be7c4d 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -217,6 +217,19 @@ config CPUFREQ_DT

If in doubt, say N.

+config CPUFREQ_KVM
+ tristate "KVM cpufreq driver"
+ help
+ This adds a virtualized KVM cpufreq driver for guest kernels that
+ uses hypercalls to communicate with the host. It sends utilization
+ updates to the host and gets used to schedule vCPU threads and
+ select CPU frequency. If a VM does not support a virtualized FIE
+ such as AMUs, it updates the frequency scaling factor by polling
+ host CPU frequency to enable accurate Per-Entity Load Tracking
+ for tasks running in the guest.
+
+ If in doubt, say N.
+
config CPUFREQ_DT_PLATDEV
bool
help
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index ef8510774913..179ea8d45135 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -16,6 +16,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_ATTR_SET) += cpufreq_governor_attr_set.o

obj-$(CONFIG_CPUFREQ_DT) += cpufreq-dt.o
obj-$(CONFIG_CPUFREQ_DT_PLATDEV) += cpufreq-dt-platdev.o
+obj-$(CONFIG_CPUFREQ_KVM) += kvm-cpufreq.o

# Traces
CFLAGS_amd-pstate-trace.o := -I$(src)
diff --git a/drivers/cpufreq/kvm-cpufreq.c b/drivers/cpufreq/kvm-cpufreq.c
new file mode 100644
index 000000000000..1542c9ac4119
--- /dev/null
+++ b/drivers/cpufreq/kvm-cpufreq.c
@@ -0,0 +1,245 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 Google LLC
+ */
+
+#include <linux/arch_topology.h>
+#include <linux/arm-smccc.h>
+#include <linux/cpufreq.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/of_address.h>
+#include <linux/of_platform.h>
+#include <linux/pm_opp.h>
+#include <linux/slab.h>
+
+static void kvm_scale_freq_tick(void)
+{
+ unsigned long scale, cur_freq, max_freq;
+ struct arm_smccc_res hvc_res;
+
+ arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_GET_CUR_CPUFREQ_FUNC_ID,
+ 0, &hvc_res);
+
+ cur_freq = hvc_res.a0;
+ max_freq = cpufreq_get_hw_max_freq(task_cpu(current));
+ scale = (cur_freq << SCHED_CAPACITY_SHIFT) / max_freq;
+
+ this_cpu_write(arch_freq_scale, (unsigned long)scale);
+}
+
+static struct scale_freq_data kvm_sfd = {
+ .source = SCALE_FREQ_SOURCE_ARCH,
+ .set_freq_scale = kvm_scale_freq_tick,
+};
+
+struct remote_data {
+ int ret;
+ struct cpufreq_frequency_table *table;
+};
+
+static void remote_get_freqtbl_num_entries(void *data)
+{
+ struct arm_smccc_res hvc_res;
+ u32 freq = 1UL;
+ int *idx = data;
+
+ while (freq != CPUFREQ_TABLE_END) {
+ arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_GET_CPUFREQ_TBL_FUNC_ID,
+ *idx, &hvc_res);
+ if (hvc_res.a0) {
+ *idx = -ENODEV;
+ return;
+ }
+ freq = hvc_res.a1;
+ (*idx)++;
+ }
+}
+
+static int kvm_cpufreq_get_freqtbl_num_entries(int cpu)
+{
+ int num_entries = 0;
+
+ smp_call_function_single(cpu, remote_get_freqtbl_num_entries, &num_entries, true);
+ return num_entries;
+}
+
+static void remote_populate_freqtbl(void *data)
+{
+ struct arm_smccc_res hvc_res;
+ struct remote_data *freq_data = data;
+ struct cpufreq_frequency_table *pos;
+ struct cpufreq_frequency_table *table = freq_data->table;
+ int idx;
+
+ cpufreq_for_each_entry_idx(pos, table, idx) {
+ arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_GET_CPUFREQ_TBL_FUNC_ID,
+ idx, &hvc_res);
+ if (hvc_res.a0) {
+ freq_data->ret = -ENODEV;
+ return;
+ }
+ pos->frequency = hvc_res.a1;
+ }
+ freq_data->ret = 0;
+}
+
+static int kvm_cpufreq_populate_freqtbl(struct cpufreq_frequency_table *table, int cpu)
+{
+ struct remote_data freq_data;
+
+ freq_data.table = table;
+ smp_call_function_single(cpu, remote_populate_freqtbl, &freq_data, true);
+ return freq_data.ret;
+}
+
+static unsigned int kvm_cpufreq_setutil_hyp(struct cpufreq_policy *policy)
+{
+ struct arm_smccc_res hvc_res;
+ u32 util = sched_cpu_util_freq(policy->cpu);
+ u32 cap = arch_scale_cpu_capacity(policy->cpu);
+ u32 threshold = cap - (cap >> 2);
+
+ if (util > threshold)
+ util = (cap + threshold) >> 1;
+
+ arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_UTIL_HINT_FUNC_ID,
+ util, &hvc_res);
+
+ return hvc_res.a0;
+}
+
+static unsigned int kvm_cpufreq_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq)
+{
+ kvm_cpufreq_setutil_hyp(policy);
+ return target_freq;
+}
+
+static int kvm_cpufreq_target_index(struct cpufreq_policy *policy,
+ unsigned int index)
+{
+ return kvm_cpufreq_setutil_hyp(policy);
+}
+
+static const struct of_device_id kvm_cpufreq_match[] = {
+ { .compatible = "virtual,kvm-cpufreq"},
+ {}
+};
+MODULE_DEVICE_TABLE(of, kvm_cpufreq_match);
+
+static int kvm_cpufreq_cpu_init(struct cpufreq_policy *policy)
+{
+ struct device *cpu_dev;
+ struct cpufreq_frequency_table *table;
+ int num_entries;
+
+ cpu_dev = get_cpu_device(policy->cpu);
+ if (!cpu_dev) {
+ pr_err("%s: failed to get cpu%d device\n", __func__,
+ policy->cpu);
+ return -ENODEV;
+ }
+
+ num_entries = kvm_cpufreq_get_freqtbl_num_entries(policy->cpu);
+ if (num_entries == -ENODEV)
+ return -ENODEV;
+
+ table = kcalloc(num_entries, sizeof(*table), GFP_KERNEL);
+ if (!table)
+ return -ENOMEM;
+
+ table[num_entries-1].frequency = CPUFREQ_TABLE_END;
+
+ if (kvm_cpufreq_populate_freqtbl(table, policy->cpu))
+ return -ENODEV;
+
+ policy->freq_table = table;
+ policy->dvfs_possible_from_any_cpu = false;
+ policy->fast_switch_possible = true;
+ policy->transition_delay_us = 1;
+
+ /*
+ * Only takes effect if another FIE source such as AMUs
+ * have not been registered.
+ */
+ topology_set_scale_freq_source(&kvm_sfd, policy->cpus);
+
+ return 0;
+}
+
+static int kvm_cpufreq_cpu_exit(struct cpufreq_policy *policy)
+{
+ kfree(policy->freq_table);
+ return 0;
+}
+
+static int kvm_cpufreq_online(struct cpufreq_policy *policy)
+{
+ /* Nothing to restore. */
+ return 0;
+}
+
+static int kvm_cpufreq_offline(struct cpufreq_policy *policy)
+{
+ /* Dummy offline() to avoid exit() being called and freeing resources. */
+ return 0;
+}
+
+static struct cpufreq_driver cpufreq_kvm_driver = {
+ .name = "kvm-cpufreq",
+ .init = kvm_cpufreq_cpu_init,
+ .exit = kvm_cpufreq_cpu_exit,
+ .online = kvm_cpufreq_online,
+ .offline = kvm_cpufreq_offline,
+ .verify = cpufreq_generic_frequency_table_verify,
+ .target_index = kvm_cpufreq_target_index,
+ .fast_switch = kvm_cpufreq_fast_switch,
+ .attr = cpufreq_generic_attr,
+};
+
+static int kvm_cpufreq_driver_probe(struct platform_device *pdev)
+{
+ int ret;
+
+ ret = cpufreq_register_driver(&cpufreq_kvm_driver);
+ if (ret) {
+ dev_err(&pdev->dev, "KVM CPUFreq driver failed to register: %d\n", ret);
+ return ret;
+ }
+
+ dev_dbg(&pdev->dev, "KVM CPUFreq driver initialized\n");
+ return 0;
+}
+
+static int kvm_cpufreq_driver_remove(struct platform_device *pdev)
+{
+ cpufreq_unregister_driver(&cpufreq_kvm_driver);
+ return 0;
+}
+
+static struct platform_driver kvm_cpufreq_driver = {
+ .probe = kvm_cpufreq_driver_probe,
+ .remove = kvm_cpufreq_driver_remove,
+ .driver = {
+ .name = "kvm-cpufreq",
+ .of_match_table = kvm_cpufreq_match,
+ },
+};
+
+static int __init kvm_cpufreq_init(void)
+{
+ return platform_driver_register(&kvm_cpufreq_driver);
+}
+postcore_initcall(kvm_cpufreq_init);
+
+static void __exit kvm_cpufreq_exit(void)
+{
+ platform_driver_unregister(&kvm_cpufreq_driver);
+}
+module_exit(kvm_cpufreq_exit);
+
+MODULE_DESCRIPTION("KVM cpufreq driver");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d8c346fcdf52..bd38aa32a57c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2303,6 +2303,7 @@ static inline bool owner_on_cpu(struct task_struct *owner)

/* Returns effective CPU energy utilization, as seen by the scheduler */
unsigned long sched_cpu_util(int cpu);
+unsigned long sched_cpu_util_freq(int cpu);
#endif /* CONFIG_SMP */

#ifdef CONFIG_RSEQ
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7700ef5610c1..dd46f4cc629b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7421,6 +7421,12 @@ unsigned long sched_cpu_util(int cpu)
{
return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
}
+
+unsigned long sched_cpu_util_freq(int cpu)
+{
+ return effective_cpu_util(cpu, cpu_util_cfs(cpu), FREQUENCY_UTIL, NULL);
+}
+
#endif /* CONFIG_SMP */

/**
--
2.40.0.348.gf938b09366-goog

2023-03-30 23:33:50

by Oliver Upton

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote:

[...]

> David Dai (6):
> sched/fair: Add util_guest for tasks
> kvm: arm64: Add support for get_cur_cpufreq service
> kvm: arm64: Add support for util_hint service
> kvm: arm64: Add support for get_freqtbl service
> dt-bindings: cpufreq: add bindings for virtual kvm cpufreq
> cpufreq: add kvm-cpufreq driver

I only received patches 2-4 in my inbox (same goes for the mailing lists
AFAICT). Mind sending the rest? :)

--
Thanks,
Oliver

2023-03-30 23:45:14

by Saravana Kannan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

On Thu, Mar 30, 2023 at 4:20 PM Oliver Upton <[email protected]> wrote:
>
> On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote:
>
> [...]
>
> > David Dai (6):
> > sched/fair: Add util_guest for tasks
> > kvm: arm64: Add support for get_cur_cpufreq service
> > kvm: arm64: Add support for util_hint service
> > kvm: arm64: Add support for get_freqtbl service
> > dt-bindings: cpufreq: add bindings for virtual kvm cpufreq
> > cpufreq: add kvm-cpufreq driver
>
> I only received patches 2-4 in my inbox (same goes for the mailing lists
> AFAICT). Mind sending the rest? :)

Oliver,

Sorry about that. Actually even I'm not cc'ed in the cover letter :)

Is it okay if we fix this when we send the next version? Mainly to
avoid some people responding to this vs other responding to a new
series (where the patches are the same).

We used a script for --to-cmd and --cc-cmd but looks like it needs
some more fixes.

Here is the full series to anyone who's wondering where the rest of
the patches are:
https://lore.kernel.org/lkml/[email protected]/T/#t

Thanks,
Saravana

2023-03-30 23:48:04

by Oliver Upton

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

On Thu, Mar 30, 2023 at 04:36:52PM -0700, Saravana Kannan wrote:
> On Thu, Mar 30, 2023 at 4:20 PM Oliver Upton <[email protected]> wrote:
> >
> > On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote:
> >
> > [...]
> >
> > > David Dai (6):
> > > sched/fair: Add util_guest for tasks
> > > kvm: arm64: Add support for get_cur_cpufreq service
> > > kvm: arm64: Add support for util_hint service
> > > kvm: arm64: Add support for get_freqtbl service
> > > dt-bindings: cpufreq: add bindings for virtual kvm cpufreq
> > > cpufreq: add kvm-cpufreq driver
> >
> > I only received patches 2-4 in my inbox (same goes for the mailing lists
> > AFAICT). Mind sending the rest? :)
>
> Oliver,
>
> Sorry about that. Actually even I'm not cc'ed in the cover letter :)
>
> Is it okay if we fix this when we send the next version? Mainly to
> avoid some people responding to this vs other responding to a new
> series (where the patches are the same).

Fine by me, as long as the full series arrived somewhere.

> We used a script for --to-cmd and --cc-cmd but looks like it needs
> some more fixes.
>
> Here is the full series to anyone who's wondering where the rest of
> the patches are:
> https://lore.kernel.org/lkml/[email protected]/T/#t

Gah, a bit of noodling would've dug up the full series. Thanks for the
link.

--
Thanks,
Oliver

2023-03-31 00:37:07

by Saravana Kannan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

On Thu, Mar 30, 2023 at 4:40 PM Oliver Upton <[email protected]> wrote:
>
> On Thu, Mar 30, 2023 at 04:36:52PM -0700, Saravana Kannan wrote:
> > On Thu, Mar 30, 2023 at 4:20 PM Oliver Upton <[email protected]> wrote:
> > >
> > > On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote:
> > >
> > > [...]
> > >
> > > > David Dai (6):
> > > > sched/fair: Add util_guest for tasks
> > > > kvm: arm64: Add support for get_cur_cpufreq service
> > > > kvm: arm64: Add support for util_hint service
> > > > kvm: arm64: Add support for get_freqtbl service
> > > > dt-bindings: cpufreq: add bindings for virtual kvm cpufreq
> > > > cpufreq: add kvm-cpufreq driver
> > >
> > > I only received patches 2-4 in my inbox (same goes for the mailing lists
> > > AFAICT). Mind sending the rest? :)
> >
> > Oliver,
> >
> > Sorry about that. Actually even I'm not cc'ed in the cover letter :)
> >
> > Is it okay if we fix this when we send the next version? Mainly to
> > avoid some people responding to this vs other responding to a new
> > series (where the patches are the same).
>
> Fine by me, as long as the full series arrived somewhere.
>
> > We used a script for --to-cmd and --cc-cmd but looks like it needs
> > some more fixes.
> >
> > Here is the full series to anyone who's wondering where the rest of
> > the patches are:
> > https://lore.kernel.org/lkml/[email protected]/T/#t
>
> Gah, a bit of noodling would've dug up the full series. Thanks for the
> link.

Actually, we'll send out a new RFC v2 series with the To's and Cc's
fixed with some minor cover letter fixes. So everyone can ignore this
series and just wait for the RFC v2 series later today.


-Saravana

2023-03-31 00:58:23

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote:
> Hi,
>
> This patch series is a continuation of the talk Saravana gave at LPC 2022
> titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist
> of the talk is that workloads running in a guest VM get terrible task
> placement and DVFS behavior when compared to running the same workload in

DVFS? Some new filesystem, perhaps?

> the host. Effectively, no EAS for threads inside VMs. This would make power

EAS?

Two unfamiliar and undefined acronyms in your opening paragraph.
You're not making me want to read the rest of your opus.

2023-04-03 10:22:07

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

On Fri, Mar 31, 2023 at 01:49:48AM +0100, Matthew Wilcox wrote:
> On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote:
> > Hi,
> >
> > This patch series is a continuation of the talk Saravana gave at LPC 2022
> > titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist
> > of the talk is that workloads running in a guest VM get terrible task
> > placement and DVFS behavior when compared to running the same workload in
>
> DVFS? Some new filesystem, perhaps?
>

Dynamic Voltage and Frequency Scaling (DVFS) -- it's a well known term in
cpufreq/cpuidle/schedutil land.

> > the host. Effectively, no EAS for threads inside VMs. This would make power
>
> EAS?
>

Energy Aware Scheduling (EAS) is mostly a kernel/sched thing that has
an impact on cpufreq and my recollection is that it was discussed at
conferences long before kernel/sched had any EAS awareness. I don't have
the full series in my inbox and didn't dig further but patch 1 at least is
providing additional information to schedutil which impacts CPU frequency
selection on systems to varying degrees. The full impact would depend on
what cpufreq driver is in use and the specific hardware so even if the
series benefits one set of hardware, it's not necessarily a guaranteed win.

> Two unfamiliar and undefined acronyms in your opening paragraph.
> You're not making me want to read the rest of your opus.

It depends on the audience and mm/ is not the audience. VM in the title
refers to Virtual Machine, not Virtual Memory although I confess I originally
read it as mm/ and wondered initially how mm/ affects DVFS to the extent it
triggered a "wtf happened in mm/ recently that I completely missed?". This
series is mostly of concern to scheduler, cpufreq or KVM depending on your
perspective. For example, on KVM, I'd immediately wonder if the hypercall
overhead exceeds any benefit from better task placement although the leader
suggests the answer is "no". However, it didn't comment (or I didn't read
carefully enough) on whether MMIO overhead or alternative communication
methods have constant cost across different hardware or, much more likely,
depend on the hardware that could potentially opt-in. Various cpufreq
hardware has very different costs when measuring or alterating CPU frequency
stuff, even within different generations of chips from the same vendor.
While the data also shows performance improvements, it doesn't indicate how
close to bare metal the improvement is. Even if it's 50% faster within a
VM, how much slower than bare metal is it? In terms of data presentation,
it might be better to assign bare metal a score of 1 at the best possible
score and show the VM performance as a relative ratio (1.00 for bare metal,
0.5 for VM with a vanilla kernel, 0.75 using improved task placement).
It would also be preferred to have x86-64 data as the hazards the series
details with impacts arm64 and x86-64 has the additional challenge that
cpufreq is often managed by the hardware so it should be demonstrated the
the series "does no harm" on x86-64 for recent generation Intel and AMD
chips if possible. The lack of that data doesn't kill the series as a large
improvement is still very interesting even if it's not perfect and possible
specific to arm64. If this *was* my area or I happened to be paying close
attention to it at the time, I would likely favour using hypercalls only at
the start because it can be used universally and suggest adding alternative
communication methods later using the same metric "is an alternative method
of Guest<->Host communication worse, neutral or better at getting close to
bare metal performance?" I'd also push for the ratio tables as it's easier
to see at a glance how close to bare metal performance the series achieves.
Finally, I would look for x86-64 data just in case it causes harm due to
hypercall overhead on chips that management frequency in firmware.

So while I haven't read the series and only patches 2+6 reached by inbox,
I understand the point in principle. The scheduler on wakeup paths for bare
metal also tries to favour recently used CPUs and spurious CPU migration
even though it is only tangentially related to EAS. For example, a recently
used CPUs may still be polling (drivers/cpuidle/poll_state.c:poll_idle)
or at least not entered a deep C-state so the wakeup penalty is lower.

So whatever critism the series deserves, it's not due to using obscure
terms that no one in kernel/sched/, drivers/cpuidle of drivers/cpufreq
would recognise.

--
Mel Gorman
SUSE Labs

2023-04-03 11:43:51

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] sched/fair: Add util_guest for tasks

Hi David,

On 31/03/2023 00:43, David Dai wrote:
> For virtualization usecases, util_est and util_avg currently tracked
> on the host aren't sufficient to accurately represent the workload on
> vCPU threads, which results in poor frequency selection and performance.
> For example, when a large workload migrates from a busy vCPU thread to
> an idle vCPU thread, it incurs additional DVFS ramp-up latencies
> as util accumulates.
>
> Introduce a new "util_guest" member as an additional PELT signal that's
> independently updated by the guest. When used, it's max aggregated to
> provide a boost to both task_util and task_util_est.
>
> Updating task_util and task_util_est will ensure:
> -Better task placement decisions for vCPU threads on the host
> -Correctly updating util_est.ewma during dequeue
> -Additive util with other threads on the same runqueue for more
> accurate frequency responses
>
> Co-developed-by: Saravana Kannan <[email protected]>
> Signed-off-by: Saravana Kannan <[email protected]>
> Signed-off-by: David Dai <[email protected]>
> ---

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6986ea31c984..998649554344 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4276,14 +4276,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
>
> static inline unsigned long task_util(struct task_struct *p)
> {
> - return READ_ONCE(p->se.avg.util_avg);
> + return max(READ_ONCE(p->se.avg.util_avg),
> + READ_ONCE(p->se.avg.util_guest));
> }
>
> static inline unsigned long _task_util_est(struct task_struct *p)
> {
> struct util_est ue = READ_ONCE(p->se.avg.util_est);
>
> - return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED));
> + return max_t(unsigned long, READ_ONCE(p->se.avg.util_guest),
> + max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED)));
> }

I can't see why the existing p->uclamp_req[UCLAMP_MIN].value can't be
used here instead p->se.avg.util_guest.

I do understand the issue of inheriting uclamp values at fork but don't
get the not being `additive` thing. We are at task level here.

The fact that you have to max util_avg and util_est directly in
task_util() and _task_util_est() tells me that there are places where
this helps and uclamp_task_util() is not called there.

When you say in the cover letter that you tried uclamp_min, how exactly
did you use it? Did you run the existing mainline or did you use
uclamp_min as a replacement for util_guest in this patch here?

> static inline unsigned long task_util_est(struct task_struct *p)
> @@ -6242,6 +6244,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> */
> util_est_enqueue(&rq->cfs, p);
>
> + /*
> + * The normal code path for host thread enqueue doesn't take into
> + * account guest task migrations when updating cpufreq util.
> + * So, always update the cpufreq when a vCPU thread has a
> + * non-zero util_guest value.
> + */
> + if (READ_ONCE(p->se.avg.util_guest))
> + cpufreq_update_util(rq, 0);


This is because enqueue_entity() -> update_load_avg() ->
attach_entity_load_avg() -> cfs_rq_util_change() requires root run-queue
(&rq->cfs == cfs_rq) to call cpufreq_update_util()?

[...]

2023-04-04 01:14:58

by David Dai

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] sched/fair: Add util_guest for tasks

On Mon, Apr 3, 2023 at 4:40 AM Dietmar Eggemann
<[email protected]> wrote:
>
> Hi David,
Hi Dietmar, thanks for your comments.
>
> On 31/03/2023 00:43, David Dai wrote:
> > For virtualization usecases, util_est and util_avg currently tracked
> > on the host aren't sufficient to accurately represent the workload on
> > vCPU threads, which results in poor frequency selection and performance.
> > For example, when a large workload migrates from a busy vCPU thread to
> > an idle vCPU thread, it incurs additional DVFS ramp-up latencies
> > as util accumulates.
> >
> > Introduce a new "util_guest" member as an additional PELT signal that's
> > independently updated by the guest. When used, it's max aggregated to
> > provide a boost to both task_util and task_util_est.
> >
> > Updating task_util and task_util_est will ensure:
> > -Better task placement decisions for vCPU threads on the host
> > -Correctly updating util_est.ewma during dequeue
> > -Additive util with other threads on the same runqueue for more
> > accurate frequency responses
> >
> > Co-developed-by: Saravana Kannan <[email protected]>
> > Signed-off-by: Saravana Kannan <[email protected]>
> > Signed-off-by: David Dai <[email protected]>
> > ---
>
> [...]
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6986ea31c984..998649554344 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4276,14 +4276,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
> >
> > static inline unsigned long task_util(struct task_struct *p)
> > {
> > - return READ_ONCE(p->se.avg.util_avg);
> > + return max(READ_ONCE(p->se.avg.util_avg),
> > + READ_ONCE(p->se.avg.util_guest));
> > }
> >
> > static inline unsigned long _task_util_est(struct task_struct *p)
> > {
> > struct util_est ue = READ_ONCE(p->se.avg.util_est);
> >
> > - return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED));
> > + return max_t(unsigned long, READ_ONCE(p->se.avg.util_guest),
> > + max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED)));
> > }
>
> I can't see why the existing p->uclamp_req[UCLAMP_MIN].value can't be
> used here instead p->se.avg.util_guest.
Using p->uclamp_req[UCLAMP_MIN].value would result in folding in
uclamp values into task_util and task_util_est for all tasks that have
uclamp values set. The intent of these patches isn’t to modify
existing uclamp behaviour. Users would also override util values from
the guest when they set uclamp values.
>
> I do understand the issue of inheriting uclamp values at fork but don't
> get the not being `additive` thing. We are at task level here.
Uclamp values are max aggregated with other tasks at the runqueue
level when deciding CPU frequency. For example, a vCPU runqueue may
have an util of 512 that results in setting 512 to uclamp_min on the
vCPU task. This is insufficient to drive a frequency response if it
shares the runqueue with another host task running with util of 512 as
it would result in a clamped util value of 512 at the runqueue(Ex. If
a guest thread had just migrated onto this vCPU).
>
> The fact that you have to max util_avg and util_est directly in
> task_util() and _task_util_est() tells me that there are places where
> this helps and uclamp_task_util() is not called there.
Can you clarify on this point a bit more?
>
> When you say in the cover letter that you tried uclamp_min, how exactly
> did you use it? Did you run the existing mainline or did you use
> uclamp_min as a replacement for util_guest in this patch here?
I called sched_setattr_nocheck() with .sched_flags =
SCHED_FLAG_UTIL_CLAMP when updating uclamp_min and clamp_max is left
at 1024. Uclamp_min was not aggregated with task_util and
task_util_est during my testing. The only caveat there is that I added
a change to only reset uclamp on fork when testing(I realize there is
specifically a SCHED_FLAG_RESET_ON_FORK, but I didn’t want to reset
other sched attributes).
>
> > static inline unsigned long task_util_est(struct task_struct *p)
> > @@ -6242,6 +6244,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > */
> > util_est_enqueue(&rq->cfs, p);
> >
> > + /*
> > + * The normal code path for host thread enqueue doesn't take into
> > + * account guest task migrations when updating cpufreq util.
> > + * So, always update the cpufreq when a vCPU thread has a
> > + * non-zero util_guest value.
> > + */
> > + if (READ_ONCE(p->se.avg.util_guest))
> > + cpufreq_update_util(rq, 0);
>
>
> This is because enqueue_entity() -> update_load_avg() ->
> attach_entity_load_avg() -> cfs_rq_util_change() requires root run-queue
> (&rq->cfs == cfs_rq) to call cpufreq_update_util()?
The enqueue_entity() would not call into update_load_avg() due to the
check for !se->avg.last_update_time. se->avg.last_update_time is
non-zero because the vCPU task did not migrate before this enqueue.
This enqueue path is reached when util_guest is updated for the vCPU
task through the sched_setattr_nocheck call where we want to ensure a
frequency update occurs.
>
> [...]

2023-04-04 19:55:14

by Oliver Upton

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

Folks,

On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote:

<snip>

> PCMark
> Higher is better
> +-------------------+----------+------------+--------+-------+--------+
> | Test Case (score) | Baseline | Hypercall | %delta | MMIO | %delta |
> +-------------------+----------+------------+--------+-------+--------+
> | Weighted Total | 6136 | 7274 | +19% | 6867 | +12% |
> +-------------------+----------+------------+--------+-------+--------+
> | Web Browsing | 5558 | 6273 | +13% | 6035 | +9% |
> +-------------------+----------+------------+--------+-------+--------+
> | Video Editing | 4921 | 5221 | +6% | 5167 | +5% |
> +-------------------+----------+------------+--------+-------+--------+
> | Writing | 6864 | 8825 | +29% | 8529 | +24% |
> +-------------------+----------+------------+--------+-------+--------+
> | Photo Editing | 7983 | 11593 | +45% | 10812 | +35% |
> +-------------------+----------+------------+--------+-------+--------+
> | Data Manipulation | 5814 | 6081 | +5% | 5327 | -8% |
> +-------------------+----------+------------+--------+-------+--------+
>
> PCMark Performance/mAh
> Higher is better
> +-----------+----------+-----------+--------+------+--------+
> | | Baseline | Hypercall | %delta | MMIO | %delta |
> +-----------+----------+-----------+--------+------+--------+
> | Score/mAh | 79 | 88 | +11% | 83 | +7% |
> +-----------+----------+-----------+--------+------+--------+
>
> Roblox
> Higher is better
> +-----+----------+------------+--------+-------+--------+
> | | Baseline | Hypercall | %delta | MMIO | %delta |
> +-----+----------+------------+--------+-------+--------+
> | FPS | 18.25 | 28.66 | +57% | 24.06 | +32% |
> +-----+----------+------------+--------+-------+--------+
>
> Roblox Frames/mAh
> Higher is better
> +------------+----------+------------+--------+--------+--------+
> | | Baseline | Hypercall | %delta | MMIO | %delta |
> +------------+----------+------------+--------+--------+--------+
> | Frames/mAh | 91.25 | 114.64 | +26% | 103.11 | +13% |
> +------------+----------+------------+--------+--------+--------+

</snip>

> Next steps:
> ===========
> We are continuing to look into communication mechanisms other than
> hypercalls that are just as/more efficient and avoid switching into the VMM
> userspace. Any inputs in this regard are greatly appreciated.

We're highly unlikely to entertain such an interface in KVM.

The entire feature is dependent on pinning vCPUs to physical cores, for which
userspace is in the driver's seat. That is a well established and documented
policy which can be seen in the way we handle heterogeneous systems and
vPMU.

Additionally, this bloats the KVM PV ABI with highly VMM-dependent interfaces
that I would not expect to benefit the typical user of KVM.

Based on the data above, it would appear that the userspace implementation is
in the same neighborhood as a KVM-based implementation, which only further
weakens the case for moving this into the kernel.

I certainly can appreciate the motivation for the series, but this feature
should be in userspace as some form of a virtual device.

--
Thanks,
Oliver

2023-04-04 20:49:55

by Marc Zyngier

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

On Tue, 04 Apr 2023 20:43:40 +0100,
Oliver Upton <[email protected]> wrote:
>
> Folks,
>
> On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote:
>
> <snip>
>
> > PCMark
> > Higher is better
> > +-------------------+----------+------------+--------+-------+--------+
> > | Test Case (score) | Baseline | Hypercall | %delta | MMIO | %delta |
> > +-------------------+----------+------------+--------+-------+--------+
> > | Weighted Total | 6136 | 7274 | +19% | 6867 | +12% |
> > +-------------------+----------+------------+--------+-------+--------+
> > | Web Browsing | 5558 | 6273 | +13% | 6035 | +9% |
> > +-------------------+----------+------------+--------+-------+--------+
> > | Video Editing | 4921 | 5221 | +6% | 5167 | +5% |
> > +-------------------+----------+------------+--------+-------+--------+
> > | Writing | 6864 | 8825 | +29% | 8529 | +24% |
> > +-------------------+----------+------------+--------+-------+--------+
> > | Photo Editing | 7983 | 11593 | +45% | 10812 | +35% |
> > +-------------------+----------+------------+--------+-------+--------+
> > | Data Manipulation | 5814 | 6081 | +5% | 5327 | -8% |
> > +-------------------+----------+------------+--------+-------+--------+
> >
> > PCMark Performance/mAh
> > Higher is better
> > +-----------+----------+-----------+--------+------+--------+
> > | | Baseline | Hypercall | %delta | MMIO | %delta |
> > +-----------+----------+-----------+--------+------+--------+
> > | Score/mAh | 79 | 88 | +11% | 83 | +7% |
> > +-----------+----------+-----------+--------+------+--------+
> >
> > Roblox
> > Higher is better
> > +-----+----------+------------+--------+-------+--------+
> > | | Baseline | Hypercall | %delta | MMIO | %delta |
> > +-----+----------+------------+--------+-------+--------+
> > | FPS | 18.25 | 28.66 | +57% | 24.06 | +32% |
> > +-----+----------+------------+--------+-------+--------+
> >
> > Roblox Frames/mAh
> > Higher is better
> > +------------+----------+------------+--------+--------+--------+
> > | | Baseline | Hypercall | %delta | MMIO | %delta |
> > +------------+----------+------------+--------+--------+--------+
> > | Frames/mAh | 91.25 | 114.64 | +26% | 103.11 | +13% |
> > +------------+----------+------------+--------+--------+--------+
>
> </snip>
>
> > Next steps:
> > ===========
> > We are continuing to look into communication mechanisms other than
> > hypercalls that are just as/more efficient and avoid switching into the VMM
> > userspace. Any inputs in this regard are greatly appreciated.
>
> We're highly unlikely to entertain such an interface in KVM.
>
> The entire feature is dependent on pinning vCPUs to physical cores, for which
> userspace is in the driver's seat. That is a well established and documented
> policy which can be seen in the way we handle heterogeneous systems and
> vPMU.
>
> Additionally, this bloats the KVM PV ABI with highly VMM-dependent interfaces
> that I would not expect to benefit the typical user of KVM.
>
> Based on the data above, it would appear that the userspace implementation is
> in the same neighborhood as a KVM-based implementation, which only further
> weakens the case for moving this into the kernel.
>
> I certainly can appreciate the motivation for the series, but this feature
> should be in userspace as some form of a virtual device.

+1 on all of the above.

The one thing I'd like to understand that the comment seems to imply
that there is a significant difference in overhead between a hypercall
and an MMIO. In my experience, both are pretty similar in cost for a
handling location (both in userspace or both in the kernel). MMIO
handling is a tiny bit more expensive due to a guaranteed TLB miss
followed by a walk of the in-kernel device ranges, but that's all. It
should hardly register.

And if you really want some super-low latency, low overhead
signalling, maybe an exception is the wrong tool for the job. Shared
memory communication could be more appropriate.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2023-04-05 07:55:16

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

On Tuesday 04 Apr 2023 at 21:49:10 (+0100), Marc Zyngier wrote:
> On Tue, 04 Apr 2023 20:43:40 +0100,
> Oliver Upton <[email protected]> wrote:
> >
> > Folks,
> >
> > On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote:
> >
> > <snip>
> >
> > > PCMark
> > > Higher is better
> > > +-------------------+----------+------------+--------+-------+--------+
> > > | Test Case (score) | Baseline | Hypercall | %delta | MMIO | %delta |
> > > +-------------------+----------+------------+--------+-------+--------+
> > > | Weighted Total | 6136 | 7274 | +19% | 6867 | +12% |
> > > +-------------------+----------+------------+--------+-------+--------+
> > > | Web Browsing | 5558 | 6273 | +13% | 6035 | +9% |
> > > +-------------------+----------+------------+--------+-------+--------+
> > > | Video Editing | 4921 | 5221 | +6% | 5167 | +5% |
> > > +-------------------+----------+------------+--------+-------+--------+
> > > | Writing | 6864 | 8825 | +29% | 8529 | +24% |
> > > +-------------------+----------+------------+--------+-------+--------+
> > > | Photo Editing | 7983 | 11593 | +45% | 10812 | +35% |
> > > +-------------------+----------+------------+--------+-------+--------+
> > > | Data Manipulation | 5814 | 6081 | +5% | 5327 | -8% |
> > > +-------------------+----------+------------+--------+-------+--------+
> > >
> > > PCMark Performance/mAh
> > > Higher is better
> > > +-----------+----------+-----------+--------+------+--------+
> > > | | Baseline | Hypercall | %delta | MMIO | %delta |
> > > +-----------+----------+-----------+--------+------+--------+
> > > | Score/mAh | 79 | 88 | +11% | 83 | +7% |
> > > +-----------+----------+-----------+--------+------+--------+
> > >
> > > Roblox
> > > Higher is better
> > > +-----+----------+------------+--------+-------+--------+
> > > | | Baseline | Hypercall | %delta | MMIO | %delta |
> > > +-----+----------+------------+--------+-------+--------+
> > > | FPS | 18.25 | 28.66 | +57% | 24.06 | +32% |
> > > +-----+----------+------------+--------+-------+--------+
> > >
> > > Roblox Frames/mAh
> > > Higher is better
> > > +------------+----------+------------+--------+--------+--------+
> > > | | Baseline | Hypercall | %delta | MMIO | %delta |
> > > +------------+----------+------------+--------+--------+--------+
> > > | Frames/mAh | 91.25 | 114.64 | +26% | 103.11 | +13% |
> > > +------------+----------+------------+--------+--------+--------+
> >
> > </snip>
> >
> > > Next steps:
> > > ===========
> > > We are continuing to look into communication mechanisms other than
> > > hypercalls that are just as/more efficient and avoid switching into the VMM
> > > userspace. Any inputs in this regard are greatly appreciated.
> >
> > We're highly unlikely to entertain such an interface in KVM.
> >
> > The entire feature is dependent on pinning vCPUs to physical cores, for which
> > userspace is in the driver's seat. That is a well established and documented
> > policy which can be seen in the way we handle heterogeneous systems and
> > vPMU.
> >
> > Additionally, this bloats the KVM PV ABI with highly VMM-dependent interfaces
> > that I would not expect to benefit the typical user of KVM.
> >
> > Based on the data above, it would appear that the userspace implementation is
> > in the same neighborhood as a KVM-based implementation, which only further
> > weakens the case for moving this into the kernel.
> >
> > I certainly can appreciate the motivation for the series, but this feature
> > should be in userspace as some form of a virtual device.
>
> +1 on all of the above.

And I concur with all the above as well. Putting this in the kernel is
not an obvious fit at all as that requires a number of assumptions about
the VMM.

As Oliver pointed out, the guest topology, and how it maps to the host
topology (vcpu pinning etc) is very much a VMM policy decision and will
be particularly important to handle guest frequency requests correctly.

In addition to that, the VMM's software architecture may have an impact.
Crosvm for example does device emulation in separate processes for
security reasons, so it is likely that adjusting the scheduling
parameters ('util_guest', uclamp, or else) only for the vCPU thread that
issues frequency requests will be sub-optimal for performance, we may
want to adjust those parameters for all the tasks that are on the
critical path.

And at an even higher level, assuming in the kernel a certain mapping of
vCPU threads to host threads feels kinda wrong, this too is a host
userspace policy decision I believe. Not that anybody in their right
mind would want to do this, but I _think_ it would technically be
feasible to serialize the execution of multiple vCPUs on the same host
thread, at which point the util_guest thingy becomes entirely bogus. (I
obviously don't want to conflate this use-case, it's just an example
that shows the proposed abstraction in the series is not a perfect fit
for the KVM userspace delegation model.)

So +1 from me to move this as a virtual device of some kind. And if the
extra cost of exiting all the way back to userspace is prohibitive (is
it btw?), then we can try to work on that. Maybe something a la vhost
can be done to optimize, I'll have a think.

> The one thing I'd like to understand that the comment seems to imply
> that there is a significant difference in overhead between a hypercall
> and an MMIO. In my experience, both are pretty similar in cost for a
> handling location (both in userspace or both in the kernel). MMIO
> handling is a tiny bit more expensive due to a guaranteed TLB miss
> followed by a walk of the in-kernel device ranges, but that's all. It
> should hardly register.
>
> And if you really want some super-low latency, low overhead
> signalling, maybe an exception is the wrong tool for the job. Shared
> memory communication could be more appropriate.

I presume some kind of signalling mechanism will be necessary to
synchronously update host scheduling parameters in response to guest
frequency requests, but if the volume of data requires it then a shared
buffer + doorbell type of approach should do.

Thinking about it, using SCMI over virtio would implement exactly that.
Linux-as-a-guest already supports it IIRC, so possibly the problem
being addressed in this series could be 'simply' solved using an SCMI
backend in the VMM...

Thanks,
Quentin

2023-04-05 08:05:23

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] kvm: arm64: Add support for get_cur_cpufreq service

On Thursday 30 Mar 2023 at 15:43:37 (-0700), David Dai wrote:
> This service allows guests to query the host for frequency of the CPU
> that the vCPU is currently running on.

I assume the intention here is to achieve scale invariance in the guest
to ensure its PELT signals represent how much work is actually being
done. If so, it's likely the usage of activity monitors will be superior
for this type of thing as that may allow us to drop the baked-in
assumption about vCPU pinning. IIRC, AMUs v2 (arm64-specific obv) have
extended support for virtualization, so I'd suggest looking into
supporting that first.

And assuming we also want to support this on hardware that don't have
AMUs, or don't have the right virt extensions, then the only thing I can
think of is to have the VMM expose non-architectural AMUs to the guest,
maybe emulated using PMUs. If the guest uses Linux, it'll need to grow
support for non-architectural AMUs which is its own can of worms though.

Thanks,
Quentin

2023-04-05 08:08:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote:
> Hi,
>
> This patch series is a continuation of the talk Saravana gave at LPC 2022
> titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist
> of the talk is that workloads running in a guest VM get terrible task
> placement and DVFS behavior when compared to running the same workload in
> the host. Effectively, no EAS for threads inside VMs. This would make power
> and performance terrible just by running the workload in a VM even if we
> assume there is zero virtualization overhead.
>
> We have been iterating over different options for communicating between
> guest and host, ways of applying the information coming from the
> guest/host, etc to figure out the best performance and power improvements
> we could get.
>
> The patch series in its current state is NOT meant for landing in the
> upstream kernel. We are sending this patch series to share the current
> progress and data we have so far. The patch series is meant to be easy to
> cherry-pick and test on various devices to see what performance and power
> benefits this might give for others.
>
> With this series, a workload running in a VM gets the same task placement
> and DVFS treatment as it would when running in the host.
>
> As expected, we see significant performance improvement and better
> performance/power ratio. If anyone else wants to try this out for your VM
> workloads and report findings, that'd be very much appreciated.
>
> The idea is to improve VM CPUfreq/sched behavior by:
> - Having guest kernel to do accurate load tracking by taking host CPU
> arch/type and frequency into account.
> - Sharing vCPU run queue utilization information with the host so that the
> host can do proper frequency scaling and task placement on the host side.

So, not having actually been send many of the patches I've no idea what
you've done... Please, eradicate this ridiculous idea of sending random
people a random subset of a patch series. Either send all of it or none,
this is a bloody nuisance.

Having said that; my biggest worry is that you're making scheduler
internals into an ABI. I would hate for this paravirt interface to tie
us down.

2023-04-05 08:31:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] sched/fair: Add util_guest for tasks

On Thu, Mar 30, 2023 at 03:43:36PM -0700, David Dai wrote:
> @@ -499,6 +509,7 @@ struct sched_avg {
> unsigned long load_avg;
> unsigned long runnable_avg;
> unsigned long util_avg;
> + unsigned long util_guest;
> struct util_est util_est;
> } ____cacheline_aligned;
>

Yeah, no... you'll have to make room first.

struct sched_avg {
/* typedef u64 -> __u64 */ long long unsigned int last_update_time; /* 0 8 */
/* typedef u64 -> __u64 */ long long unsigned int load_sum; /* 8 8 */
/* typedef u64 -> __u64 */ long long unsigned int runnable_sum; /* 16 8 */
/* typedef u32 -> __u32 */ unsigned int util_sum; /* 24 4 */
/* typedef u32 -> __u32 */ unsigned int period_contrib; /* 28 4 */
long unsigned int load_avg; /* 32 8 */
long unsigned int runnable_avg; /* 40 8 */
long unsigned int util_avg; /* 48 8 */
struct util_est {
unsigned int enqueued; /* 56 4 */
unsigned int ewma; /* 60 4 */
} __attribute__((__aligned__(8)))util_est __attribute__((__aligned__(8))); /* 56 8 */

/* size: 64, cachelines: 1, members: 9 */
/* forced alignments: 1 */
} __attribute__((__aligned__(64)));


2023-04-05 08:33:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] cpufreq: add kvm-cpufreq driver

On Thu, Mar 30, 2023 at 03:43:41PM -0700, David Dai wrote:

> +struct remote_data {
> + int ret;
> + struct cpufreq_frequency_table *table;
> +};
> +
> +static void remote_get_freqtbl_num_entries(void *data)
> +{
> + struct arm_smccc_res hvc_res;
> + u32 freq = 1UL;
> + int *idx = data;
> +
> + while (freq != CPUFREQ_TABLE_END) {
> + arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_GET_CPUFREQ_TBL_FUNC_ID,
> + *idx, &hvc_res);
> + if (hvc_res.a0) {
> + *idx = -ENODEV;
> + return;
> + }
> + freq = hvc_res.a1;
> + (*idx)++;
> + }
> +}
> +
> +static int kvm_cpufreq_get_freqtbl_num_entries(int cpu)
> +{
> + int num_entries = 0;
> +
> + smp_call_function_single(cpu, remote_get_freqtbl_num_entries, &num_entries, true);
> + return num_entries;
> +}
> +
> +static void remote_populate_freqtbl(void *data)
> +{
> + struct arm_smccc_res hvc_res;
> + struct remote_data *freq_data = data;
> + struct cpufreq_frequency_table *pos;
> + struct cpufreq_frequency_table *table = freq_data->table;
> + int idx;
> +
> + cpufreq_for_each_entry_idx(pos, table, idx) {
> + arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_GET_CPUFREQ_TBL_FUNC_ID,
> + idx, &hvc_res);
> + if (hvc_res.a0) {
> + freq_data->ret = -ENODEV;
> + return;
> + }
> + pos->frequency = hvc_res.a1;
> + }
> + freq_data->ret = 0;
> +}
> +
> +static int kvm_cpufreq_populate_freqtbl(struct cpufreq_frequency_table *table, int cpu)
> +{
> + struct remote_data freq_data;
> +
> + freq_data.table = table;
> + smp_call_function_single(cpu, remote_populate_freqtbl, &freq_data, true);
> + return freq_data.ret;
> +}

*WHY* are you sending IPIs to do a hypercall ?!?

You can simply have the hypercall tell what vCPU you're doing this for.

2023-04-05 08:36:04

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] sched/fair: Add util_guest for tasks

On Monday 03 Apr 2023 at 18:11:37 (-0700), David Dai wrote:
> On Mon, Apr 3, 2023 at 4:40 AM Dietmar Eggemann
> <[email protected]> wrote:
> > I can't see why the existing p->uclamp_req[UCLAMP_MIN].value can't be
> > used here instead p->se.avg.util_guest.
> Using p->uclamp_req[UCLAMP_MIN].value would result in folding in
> uclamp values into task_util and task_util_est for all tasks that have
> uclamp values set. The intent of these patches isn’t to modify
> existing uclamp behaviour. Users would also override util values from
> the guest when they set uclamp values.

That shouldn't be a problem if host userspace is responsible for driving
the uclamp values in response to guest frequency requests in the first
place ...

> > I do understand the issue of inheriting uclamp values at fork but don't
> > get the not being `additive` thing. We are at task level here.
> Uclamp values are max aggregated with other tasks at the runqueue
> level when deciding CPU frequency. For example, a vCPU runqueue may
> have an util of 512 that results in setting 512 to uclamp_min on the
> vCPU task. This is insufficient to drive a frequency response if it
> shares the runqueue with another host task running with util of 512 as
> it would result in a clamped util value of 512 at the runqueue(Ex. If
> a guest thread had just migrated onto this vCPU).

Maybe it's a feature rather than bug?

It's not obvious giving extra powers to vCPU threads that other host
threads don't have is a good idea. The fact that vCPU threads are
limited to what the VMM would be allowed to request for its other
threads is more than desirable. I'd even say it's a requirement.

Thanks,
Quentin

2023-04-05 08:39:41

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

On Wed, 5 Apr 2023 at 09:48, Quentin Perret <[email protected]> wrote:
>
> On Tuesday 04 Apr 2023 at 21:49:10 (+0100), Marc Zyngier wrote:
> > On Tue, 04 Apr 2023 20:43:40 +0100,
> > Oliver Upton <[email protected]> wrote:
> > >
> > > Folks,
> > >
> > > On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote:
> > >
> > > <snip>
> > >
> > > > PCMark
> > > > Higher is better
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Test Case (score) | Baseline | Hypercall | %delta | MMIO | %delta |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Weighted Total | 6136 | 7274 | +19% | 6867 | +12% |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Web Browsing | 5558 | 6273 | +13% | 6035 | +9% |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Video Editing | 4921 | 5221 | +6% | 5167 | +5% |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Writing | 6864 | 8825 | +29% | 8529 | +24% |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Photo Editing | 7983 | 11593 | +45% | 10812 | +35% |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Data Manipulation | 5814 | 6081 | +5% | 5327 | -8% |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > >
> > > > PCMark Performance/mAh
> > > > Higher is better
> > > > +-----------+----------+-----------+--------+------+--------+
> > > > | | Baseline | Hypercall | %delta | MMIO | %delta |
> > > > +-----------+----------+-----------+--------+------+--------+
> > > > | Score/mAh | 79 | 88 | +11% | 83 | +7% |
> > > > +-----------+----------+-----------+--------+------+--------+
> > > >
> > > > Roblox
> > > > Higher is better
> > > > +-----+----------+------------+--------+-------+--------+
> > > > | | Baseline | Hypercall | %delta | MMIO | %delta |
> > > > +-----+----------+------------+--------+-------+--------+
> > > > | FPS | 18.25 | 28.66 | +57% | 24.06 | +32% |
> > > > +-----+----------+------------+--------+-------+--------+
> > > >
> > > > Roblox Frames/mAh
> > > > Higher is better
> > > > +------------+----------+------------+--------+--------+--------+
> > > > | | Baseline | Hypercall | %delta | MMIO | %delta |
> > > > +------------+----------+------------+--------+--------+--------+
> > > > | Frames/mAh | 91.25 | 114.64 | +26% | 103.11 | +13% |
> > > > +------------+----------+------------+--------+--------+--------+
> > >
> > > </snip>
> > >
> > > > Next steps:
> > > > ===========
> > > > We are continuing to look into communication mechanisms other than
> > > > hypercalls that are just as/more efficient and avoid switching into the VMM
> > > > userspace. Any inputs in this regard are greatly appreciated.
> > >
> > > We're highly unlikely to entertain such an interface in KVM.
> > >
> > > The entire feature is dependent on pinning vCPUs to physical cores, for which
> > > userspace is in the driver's seat. That is a well established and documented
> > > policy which can be seen in the way we handle heterogeneous systems and
> > > vPMU.
> > >
> > > Additionally, this bloats the KVM PV ABI with highly VMM-dependent interfaces
> > > that I would not expect to benefit the typical user of KVM.
> > >
> > > Based on the data above, it would appear that the userspace implementation is
> > > in the same neighborhood as a KVM-based implementation, which only further
> > > weakens the case for moving this into the kernel.
> > >
> > > I certainly can appreciate the motivation for the series, but this feature
> > > should be in userspace as some form of a virtual device.
> >
> > +1 on all of the above.
>
> And I concur with all the above as well. Putting this in the kernel is
> not an obvious fit at all as that requires a number of assumptions about
> the VMM.
>
> As Oliver pointed out, the guest topology, and how it maps to the host
> topology (vcpu pinning etc) is very much a VMM policy decision and will
> be particularly important to handle guest frequency requests correctly.
>
> In addition to that, the VMM's software architecture may have an impact.
> Crosvm for example does device emulation in separate processes for
> security reasons, so it is likely that adjusting the scheduling
> parameters ('util_guest', uclamp, or else) only for the vCPU thread that
> issues frequency requests will be sub-optimal for performance, we may
> want to adjust those parameters for all the tasks that are on the
> critical path.
>
> And at an even higher level, assuming in the kernel a certain mapping of
> vCPU threads to host threads feels kinda wrong, this too is a host
> userspace policy decision I believe. Not that anybody in their right
> mind would want to do this, but I _think_ it would technically be
> feasible to serialize the execution of multiple vCPUs on the same host
> thread, at which point the util_guest thingy becomes entirely bogus. (I
> obviously don't want to conflate this use-case, it's just an example
> that shows the proposed abstraction in the series is not a perfect fit
> for the KVM userspace delegation model.)
>
> So +1 from me to move this as a virtual device of some kind. And if the
> extra cost of exiting all the way back to userspace is prohibitive (is
> it btw?), then we can try to work on that. Maybe something a la vhost
> can be done to optimize, I'll have a think.
>
> > The one thing I'd like to understand that the comment seems to imply
> > that there is a significant difference in overhead between a hypercall
> > and an MMIO. In my experience, both are pretty similar in cost for a
> > handling location (both in userspace or both in the kernel). MMIO
> > handling is a tiny bit more expensive due to a guaranteed TLB miss
> > followed by a walk of the in-kernel device ranges, but that's all. It
> > should hardly register.
> >
> > And if you really want some super-low latency, low overhead
> > signalling, maybe an exception is the wrong tool for the job. Shared
> > memory communication could be more appropriate.
>
> I presume some kind of signalling mechanism will be necessary to
> synchronously update host scheduling parameters in response to guest
> frequency requests, but if the volume of data requires it then a shared
> buffer + doorbell type of approach should do.
>
> Thinking about it, using SCMI over virtio would implement exactly that.
> Linux-as-a-guest already supports it IIRC, so possibly the problem
> being addressed in this series could be 'simply' solved using an SCMI
> backend in the VMM...

This is what was suggested at LPC:
using virtio-scmi and scmi performance domain in the guest for cpufreq driver
using a vhost user scmi backend in user space
from this vhost userspace backend updates the uclamp min of the vCPU
thread or use another method is this one is not good enough

>
> Thanks,
> Quentin

2023-04-05 10:56:21

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] sched/fair: Add util_guest for tasks

On 04/04/2023 03:11, David Dai wrote:
> On Mon, Apr 3, 2023 at 4:40 AM Dietmar Eggemann
> <[email protected]> wrote:
>>
>> Hi David,
> Hi Dietmar, thanks for your comments.
>>
>> On 31/03/2023 00:43, David Dai wrote:

[...]

>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 6986ea31c984..998649554344 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -4276,14 +4276,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
>>>
>>> static inline unsigned long task_util(struct task_struct *p)
>>> {
>>> - return READ_ONCE(p->se.avg.util_avg);
>>> + return max(READ_ONCE(p->se.avg.util_avg),
>>> + READ_ONCE(p->se.avg.util_guest));
>>> }
>>>
>>> static inline unsigned long _task_util_est(struct task_struct *p)
>>> {
>>> struct util_est ue = READ_ONCE(p->se.avg.util_est);
>>>
>>> - return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED));
>>> + return max_t(unsigned long, READ_ONCE(p->se.avg.util_guest),
>>> + max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED)));
>>> }
>>
>> I can't see why the existing p->uclamp_req[UCLAMP_MIN].value can't be
>> used here instead p->se.avg.util_guest.
> Using p->uclamp_req[UCLAMP_MIN].value would result in folding in
> uclamp values into task_util and task_util_est for all tasks that have
> uclamp values set. The intent of these patches isn’t to modify
> existing uclamp behaviour. Users would also override util values from
> the guest when they set uclamp values.
>>
>> I do understand the issue of inheriting uclamp values at fork but don't
>> get the not being `additive` thing. We are at task level here.

> Uclamp values are max aggregated with other tasks at the runqueue
> level when deciding CPU frequency. For example, a vCPU runqueue may
> have an util of 512 that results in setting 512 to uclamp_min on the
> vCPU task. This is insufficient to drive a frequency response if it
> shares the runqueue with another host task running with util of 512 as
> it would result in a clamped util value of 512 at the runqueue(Ex. If
> a guest thread had just migrated onto this vCPU).

OK, see your point now. You want an accurate per-task boost for this
vCPU task on the host run-queue.
And a scenario in which a vCPU can ask for 100% in these moments is not
sufficient I guess? In this case uclamp_min could work.

>> The fact that you have to max util_avg and util_est directly in
>> task_util() and _task_util_est() tells me that there are places where
>> this helps and uclamp_task_util() is not called there.
> Can you clarify on this point a bit more?

Sorry, I meant s/util_est/util_guest/.

The effect of the change in _task_util_est() you see via:

enqueue_task_fair()
util_est_enqueue()
cfs_rq->avg.util_est.enqueued += _task_util_est(p)

so that `sugov_get_util() -> cpu_util_cfs() ->
cfs_rq->avg.util_est.enqueued` can see the effect of util_guest?

Not sure about the change in task_util() yet.

>> When you say in the cover letter that you tried uclamp_min, how exactly
>> did you use it? Did you run the existing mainline or did you use
>> uclamp_min as a replacement for util_guest in this patch here?

> I called sched_setattr_nocheck() with .sched_flags =
> SCHED_FLAG_UTIL_CLAMP when updating uclamp_min and clamp_max is left
> at 1024. Uclamp_min was not aggregated with task_util and
> task_util_est during my testing. The only caveat there is that I added
> a change to only reset uclamp on fork when testing(I realize there is
> specifically a SCHED_FLAG_RESET_ON_FORK, but I didn’t want to reset
> other sched attributes).

OK, understood. It's essentially a util_est v2 for vCPU tasks on host.

>>> static inline unsigned long task_util_est(struct task_struct *p)
>>> @@ -6242,6 +6244,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>> */
>>> util_est_enqueue(&rq->cfs, p);
>>>
>>> + /*
>>> + * The normal code path for host thread enqueue doesn't take into
>>> + * account guest task migrations when updating cpufreq util.
>>> + * So, always update the cpufreq when a vCPU thread has a
>>> + * non-zero util_guest value.
>>> + */
>>> + if (READ_ONCE(p->se.avg.util_guest))
>>> + cpufreq_update_util(rq, 0);
>>
>>
>> This is because enqueue_entity() -> update_load_avg() ->
>> attach_entity_load_avg() -> cfs_rq_util_change() requires root run-queue
>> (&rq->cfs == cfs_rq) to call cpufreq_update_util()?
> The enqueue_entity() would not call into update_load_avg() due to the
> check for !se->avg.last_update_time. se->avg.last_update_time is
> non-zero because the vCPU task did not migrate before this enqueue.
> This enqueue path is reached when util_guest is updated for the vCPU
> task through the sched_setattr_nocheck call where we want to ensure a
> frequency update occurs.

OK, vCPU tasks are pinned so always !WF_MIGRATED wakeup I guess?

2023-04-05 21:08:07

by Saravana Kannan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

On Tue, Apr 4, 2023 at 1:49 PM Marc Zyngier <[email protected]> wrote:
>
> On Tue, 04 Apr 2023 20:43:40 +0100,
> Oliver Upton <[email protected]> wrote:
> >
> > Folks,
> >
> > On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote:
> >
> > <snip>
> >
> > > PCMark
> > > Higher is better
> > > +-------------------+----------+------------+--------+-------+--------+
> > > | Test Case (score) | Baseline | Hypercall | %delta | MMIO | %delta |
> > > +-------------------+----------+------------+--------+-------+--------+
> > > | Weighted Total | 6136 | 7274 | +19% | 6867 | +12% |
> > > +-------------------+----------+------------+--------+-------+--------+
> > > | Web Browsing | 5558 | 6273 | +13% | 6035 | +9% |
> > > +-------------------+----------+------------+--------+-------+--------+
> > > | Video Editing | 4921 | 5221 | +6% | 5167 | +5% |
> > > +-------------------+----------+------------+--------+-------+--------+
> > > | Writing | 6864 | 8825 | +29% | 8529 | +24% |
> > > +-------------------+----------+------------+--------+-------+--------+
> > > | Photo Editing | 7983 | 11593 | +45% | 10812 | +35% |
> > > +-------------------+----------+------------+--------+-------+--------+
> > > | Data Manipulation | 5814 | 6081 | +5% | 5327 | -8% |
> > > +-------------------+----------+------------+--------+-------+--------+
> > >
> > > PCMark Performance/mAh
> > > Higher is better
> > > +-----------+----------+-----------+--------+------+--------+
> > > | | Baseline | Hypercall | %delta | MMIO | %delta |
> > > +-----------+----------+-----------+--------+------+--------+
> > > | Score/mAh | 79 | 88 | +11% | 83 | +7% |
> > > +-----------+----------+-----------+--------+------+--------+
> > >
> > > Roblox
> > > Higher is better
> > > +-----+----------+------------+--------+-------+--------+
> > > | | Baseline | Hypercall | %delta | MMIO | %delta |
> > > +-----+----------+------------+--------+-------+--------+
> > > | FPS | 18.25 | 28.66 | +57% | 24.06 | +32% |
> > > +-----+----------+------------+--------+-------+--------+
> > >
> > > Roblox Frames/mAh
> > > Higher is better
> > > +------------+----------+------------+--------+--------+--------+
> > > | | Baseline | Hypercall | %delta | MMIO | %delta |
> > > +------------+----------+------------+--------+--------+--------+
> > > | Frames/mAh | 91.25 | 114.64 | +26% | 103.11 | +13% |
> > > +------------+----------+------------+--------+--------+--------+
> >
> > </snip>
> >
> > > Next steps:
> > > ===========
> > > We are continuing to look into communication mechanisms other than
> > > hypercalls that are just as/more efficient and avoid switching into the VMM
> > > userspace. Any inputs in this regard are greatly appreciated.

Hi Oliver and Marc,

Replying to both of you in this one email.

> >
> > We're highly unlikely to entertain such an interface in KVM.
> >
> > The entire feature is dependent on pinning vCPUs to physical cores, for which
> > userspace is in the driver's seat. That is a well established and documented
> > policy which can be seen in the way we handle heterogeneous systems and
> > vPMU.
> >
> > Additionally, this bloats the KVM PV ABI with highly VMM-dependent interfaces
> > that I would not expect to benefit the typical user of KVM.
> >
> > Based on the data above, it would appear that the userspace implementation is
> > in the same neighborhood as a KVM-based implementation, which only further
> > weakens the case for moving this into the kernel.

Oliver,

Sorry if the tables/data aren't presented in an intuitive way, but
MMIO vs hypercall is definitely not in the same neighborhood. The
hypercall method often gives close to 2x the improvement that the MMIO
method gives. For example:

- Roblox FPS: MMIO improves it by 32% vs hypercall improves it by 57%.
- Frames/mAh: MMIO improves it by 13% vs hypercall improves it by 26%.
- PC Mark Data manipulation: MMIO makes it worse by 8% vs hypercall
improves it by 5%

Hypercall does better for other cases too, just not as good. For example,
- PC Mark Photo editing: Going from MMIO to hypercall gives a 10% improvement.

These are all pretty non-trivial, at least in the mobile world. Heck,
whole teams would spend months for 2% improvement in battery :)

> >
> > I certainly can appreciate the motivation for the series, but this feature
> > should be in userspace as some form of a virtual device.
>
> +1 on all of the above.

Marc and Oliver,

We are not tied to hypercalls. We want to do the right thing here, but
MMIO going all the way to userspace definitely doesn't cut it as is.
This is where we need some guidance. See more below.

> The one thing I'd like to understand that the comment seems to imply
> that there is a significant difference in overhead between a hypercall
> and an MMIO. In my experience, both are pretty similar in cost for a
> handling location (both in userspace or both in the kernel).

I think the main difference really is that in our hypercall vs MMIO
comparison the hypercall is handled in the kernel vs MMIO goes all the
way to userspace. I agree with you that the difference probably won't
be significant if both of them go to the same "depth" in the privilege
levels.

> MMIO
> handling is a tiny bit more expensive due to a guaranteed TLB miss
> followed by a walk of the in-kernel device ranges, but that's all. It
> should hardly register.
>
> And if you really want some super-low latency, low overhead
> signalling, maybe an exception is the wrong tool for the job. Shared
> memory communication could be more appropriate.

Yeah, that's one of our next steps. Ideally, we want to use shared
memory for the host to guest information flow. It's a 32-bit value
representing the current frequency that the host can update whenever
the host CPU frequency changes and the guest can read whenever it
needs it.

For guest to host information flow, we'll need a kick from guest to
host because we need to take action on the host side when threads
migrate between vCPUs and cause a significant change in vCPU util.
Again it can be just a shared memory and some kick. This is what we
are currently trying to figure out how to do.

If there are APIs to do this, can you point us to those please? We'd
also want the shared memory to be accessible by the VMM (so, shared
between guest kernel, host kernel and VMM).

Are the above next steps sane? Or is that a no-go? The main thing we
want to cut out is the need for having to switch to userspace for
every single interaction because, as is, it leaves a lot on the table.

Also, thanks for all the feedback. Glad to receive it.

-Saravana

2023-04-05 21:10:15

by Saravana Kannan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

On Wed, Apr 5, 2023 at 12:48 AM 'Quentin Perret' via kernel-team
<[email protected]> wrote:
>
> On Tuesday 04 Apr 2023 at 21:49:10 (+0100), Marc Zyngier wrote:
> > On Tue, 04 Apr 2023 20:43:40 +0100,
> > Oliver Upton <[email protected]> wrote:
> > >
> > > Folks,
> > >
> > > On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote:
> > >
> > > <snip>
> > >
> > > > PCMark
> > > > Higher is better
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Test Case (score) | Baseline | Hypercall | %delta | MMIO | %delta |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Weighted Total | 6136 | 7274 | +19% | 6867 | +12% |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Web Browsing | 5558 | 6273 | +13% | 6035 | +9% |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Video Editing | 4921 | 5221 | +6% | 5167 | +5% |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Writing | 6864 | 8825 | +29% | 8529 | +24% |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Photo Editing | 7983 | 11593 | +45% | 10812 | +35% |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Data Manipulation | 5814 | 6081 | +5% | 5327 | -8% |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > >
> > > > PCMark Performance/mAh
> > > > Higher is better
> > > > +-----------+----------+-----------+--------+------+--------+
> > > > | | Baseline | Hypercall | %delta | MMIO | %delta |
> > > > +-----------+----------+-----------+--------+------+--------+
> > > > | Score/mAh | 79 | 88 | +11% | 83 | +7% |
> > > > +-----------+----------+-----------+--------+------+--------+
> > > >
> > > > Roblox
> > > > Higher is better
> > > > +-----+----------+------------+--------+-------+--------+
> > > > | | Baseline | Hypercall | %delta | MMIO | %delta |
> > > > +-----+----------+------------+--------+-------+--------+
> > > > | FPS | 18.25 | 28.66 | +57% | 24.06 | +32% |
> > > > +-----+----------+------------+--------+-------+--------+
> > > >
> > > > Roblox Frames/mAh
> > > > Higher is better
> > > > +------------+----------+------------+--------+--------+--------+
> > > > | | Baseline | Hypercall | %delta | MMIO | %delta |
> > > > +------------+----------+------------+--------+--------+--------+
> > > > | Frames/mAh | 91.25 | 114.64 | +26% | 103.11 | +13% |
> > > > +------------+----------+------------+--------+--------+--------+
> > >
> > > </snip>
> > >
> > > > Next steps:
> > > > ===========
> > > > We are continuing to look into communication mechanisms other than
> > > > hypercalls that are just as/more efficient and avoid switching into the VMM
> > > > userspace. Any inputs in this regard are greatly appreciated.
> > >
> > > We're highly unlikely to entertain such an interface in KVM.
> > >
> > > The entire feature is dependent on pinning vCPUs to physical cores, for which
> > > userspace is in the driver's seat. That is a well established and documented
> > > policy which can be seen in the way we handle heterogeneous systems and
> > > vPMU.
> > >
> > > Additionally, this bloats the KVM PV ABI with highly VMM-dependent interfaces
> > > that I would not expect to benefit the typical user of KVM.
> > >
> > > Based on the data above, it would appear that the userspace implementation is
> > > in the same neighborhood as a KVM-based implementation, which only further
> > > weakens the case for moving this into the kernel.
> > >
> > > I certainly can appreciate the motivation for the series, but this feature
> > > should be in userspace as some form of a virtual device.
> >
> > +1 on all of the above.
>
> And I concur with all the above as well. Putting this in the kernel is
> not an obvious fit at all as that requires a number of assumptions about
> the VMM.
>
> As Oliver pointed out, the guest topology, and how it maps to the host
> topology (vcpu pinning etc) is very much a VMM policy decision and will
> be particularly important to handle guest frequency requests correctly.
>
> In addition to that, the VMM's software architecture may have an impact.
> Crosvm for example does device emulation in separate processes for
> security reasons, so it is likely that adjusting the scheduling
> parameters ('util_guest', uclamp, or else) only for the vCPU thread that
> issues frequency requests will be sub-optimal for performance, we may
> want to adjust those parameters for all the tasks that are on the
> critical path.
>
> And at an even higher level, assuming in the kernel a certain mapping of
> vCPU threads to host threads feels kinda wrong, this too is a host
> userspace policy decision I believe. Not that anybody in their right
> mind would want to do this, but I _think_ it would technically be
> feasible to serialize the execution of multiple vCPUs on the same host
> thread, at which point the util_guest thingy becomes entirely bogus. (I
> obviously don't want to conflate this use-case, it's just an example
> that shows the proposed abstraction in the series is not a perfect fit
> for the KVM userspace delegation model.)

See my reply to Oliver and Marc. To me it looks like we are converging
towards having shared memory between guest, host kernel and VMM and
that should address all our concerns.

The guest will see a MMIO device, writing to it will trigger the host
kernel to do the basic "set util_guest/uclamp for the vCPU thread that
corresponds to the vCPU" and then the VMM can do more on top as/if
needed (because it has access to the shared memory too). Does that
make sense?

Even in the extreme example, the stuff the kernel would do would still
be helpful, but not sufficient. You can aggregate the
util_guest/uclamp and do whatever from the VMM.
Technically in the extreme example, you don't need any of this. The
normal util tracking of the vCPU thread on the host side would be
sufficient.

Actually any time we have only 1 vCPU host thread per VM, we shouldn't
be using anything in this patch series and not instantiate the guest
device at all.

> So +1 from me to move this as a virtual device of some kind. And if the
> extra cost of exiting all the way back to userspace is prohibitive (is
> it btw?),

I think the "13% increase in battery consumption for games" makes it
pretty clear that going to userspace is prohibitive. And that's just
one example.

> then we can try to work on that. Maybe something a la vhost
> can be done to optimize, I'll have a think.
>
> > The one thing I'd like to understand that the comment seems to imply
> > that there is a significant difference in overhead between a hypercall
> > and an MMIO. In my experience, both are pretty similar in cost for a
> > handling location (both in userspace or both in the kernel). MMIO
> > handling is a tiny bit more expensive due to a guaranteed TLB miss
> > followed by a walk of the in-kernel device ranges, but that's all. It
> > should hardly register.
> >
> > And if you really want some super-low latency, low overhead
> > signalling, maybe an exception is the wrong tool for the job. Shared
> > memory communication could be more appropriate.
>
> I presume some kind of signalling mechanism will be necessary to
> synchronously update host scheduling parameters in response to guest
> frequency requests, but if the volume of data requires it then a shared
> buffer + doorbell type of approach should do.

Part of the communication doesn't need synchronous handling by the
host. So, what I said above.

> Thinking about it, using SCMI over virtio would implement exactly that.
> Linux-as-a-guest already supports it IIRC, so possibly the problem
> being addressed in this series could be 'simply' solved using an SCMI
> backend in the VMM...

This will be worse than all the options we've tried so far because it
has the userspace overhead AND uclamp overhead.

-Saravana

2023-04-05 21:11:38

by Saravana Kannan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

On Wed, Apr 5, 2023 at 1:06 AM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote:
> > Hi,
> >
> > This patch series is a continuation of the talk Saravana gave at LPC 2022
> > titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist
> > of the talk is that workloads running in a guest VM get terrible task
> > placement and DVFS behavior when compared to running the same workload in
> > the host. Effectively, no EAS for threads inside VMs. This would make power
> > and performance terrible just by running the workload in a VM even if we
> > assume there is zero virtualization overhead.
> >
> > We have been iterating over different options for communicating between
> > guest and host, ways of applying the information coming from the
> > guest/host, etc to figure out the best performance and power improvements
> > we could get.
> >
> > The patch series in its current state is NOT meant for landing in the
> > upstream kernel. We are sending this patch series to share the current
> > progress and data we have so far. The patch series is meant to be easy to
> > cherry-pick and test on various devices to see what performance and power
> > benefits this might give for others.
> >
> > With this series, a workload running in a VM gets the same task placement
> > and DVFS treatment as it would when running in the host.
> >
> > As expected, we see significant performance improvement and better
> > performance/power ratio. If anyone else wants to try this out for your VM
> > workloads and report findings, that'd be very much appreciated.
> >
> > The idea is to improve VM CPUfreq/sched behavior by:
> > - Having guest kernel to do accurate load tracking by taking host CPU
> > arch/type and frequency into account.
> > - Sharing vCPU run queue utilization information with the host so that the
> > host can do proper frequency scaling and task placement on the host side.
>
> So, not having actually been send many of the patches I've no idea what
> you've done... Please, eradicate this ridiculous idea of sending random
> people a random subset of a patch series. Either send all of it or none,
> this is a bloody nuisance.

Sorry, that was our intention, but had a scripting error. It's been fixed.

I have a script to use with git send-email's --to-cmd and --cc-cmd
option. It uses get_maintainers.pl to figure out who to email, but it
gets trickier for a patch series that spans maintainer trees.

v2 and later will have everyone get all the patches.

> Having said that; my biggest worry is that you're making scheduler
> internals into an ABI. I would hate for this paravirt interface to tie
> us down.

The only 2 pieces of information shared between host/guest are:

1. Host CPU frequency -- this isn't really scheduler internals and
will map nicely to a virtual cpufreq driver.

2. A vCPU util value between 0 - 1024 where 1024 corresponds to the
highest performance point across all CPUs (taking freq, arch, etc into
consideration). Yes, this currently matches how the run queue util is
tracked, but we can document the interface as "percentage of max
performance capability", but representing it as 0 - 1024 instead of
0-100. That way, even if the scheduler changes how it tracks util in
the future, we can still keep this interface between guest/host and
map it appropriately on the host end.

In either case, we could even have a Windows guest where they might
track vCPU utilization differently and still have this work with the
Linux host with this interface.

Does that sound reasonable to you?

Another option is to convert (2) into a "CPU frequency" request (but
without latching it to values in the CPUfreq table) but it'll add some
unnecessary math (with division) on the guest and host end. But I'd
rather keep it as 0-1024 unless you really want this 2nd option.

-Saravana

2023-04-05 21:50:05

by Saravana Kannan

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] sched/fair: Add util_guest for tasks

On Wed, Apr 5, 2023 at 3:50 AM Dietmar Eggemann
<[email protected]> wrote:
>
> On 04/04/2023 03:11, David Dai wrote:
> > On Mon, Apr 3, 2023 at 4:40 AM Dietmar Eggemann
> > <[email protected]> wrote:
> >>
> >> Hi David,
> > Hi Dietmar, thanks for your comments.
> >>
> >> On 31/03/2023 00:43, David Dai wrote:
>
> [...]
>
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index 6986ea31c984..998649554344 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -4276,14 +4276,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
> >>>
> >>> static inline unsigned long task_util(struct task_struct *p)
> >>> {
> >>> - return READ_ONCE(p->se.avg.util_avg);
> >>> + return max(READ_ONCE(p->se.avg.util_avg),
> >>> + READ_ONCE(p->se.avg.util_guest));
> >>> }
> >>>
> >>> static inline unsigned long _task_util_est(struct task_struct *p)
> >>> {
> >>> struct util_est ue = READ_ONCE(p->se.avg.util_est);
> >>>
> >>> - return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED));
> >>> + return max_t(unsigned long, READ_ONCE(p->se.avg.util_guest),
> >>> + max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED)));
> >>> }
> >>
> >> I can't see why the existing p->uclamp_req[UCLAMP_MIN].value can't be
> >> used here instead p->se.avg.util_guest.
> > Using p->uclamp_req[UCLAMP_MIN].value would result in folding in
> > uclamp values into task_util and task_util_est for all tasks that have
> > uclamp values set. The intent of these patches isn’t to modify
> > existing uclamp behaviour. Users would also override util values from
> > the guest when they set uclamp values.
> >>
> >> I do understand the issue of inheriting uclamp values at fork but don't
> >> get the not being `additive` thing. We are at task level here.
>
> > Uclamp values are max aggregated with other tasks at the runqueue
> > level when deciding CPU frequency. For example, a vCPU runqueue may
> > have an util of 512 that results in setting 512 to uclamp_min on the
> > vCPU task. This is insufficient to drive a frequency response if it
> > shares the runqueue with another host task running with util of 512 as
> > it would result in a clamped util value of 512 at the runqueue(Ex. If
> > a guest thread had just migrated onto this vCPU).
>
> OK, see your point now. You want an accurate per-task boost for this
> vCPU task on the host run-queue.
> And a scenario in which a vCPU can ask for 100% in these moments is not
> sufficient I guess? In this case uclamp_min could work.

Right. vCPU can have whatever utilization and there can be random host
threads completely unrelated to the VM. And we need to aggregate both
of their util when deciding CPU freq.

>
> >> The fact that you have to max util_avg and util_est directly in
> >> task_util() and _task_util_est() tells me that there are places where
> >> this helps and uclamp_task_util() is not called there.
> > Can you clarify on this point a bit more?
>
> Sorry, I meant s/util_est/util_guest/.
>
> The effect of the change in _task_util_est() you see via:
>
> enqueue_task_fair()
> util_est_enqueue()
> cfs_rq->avg.util_est.enqueued += _task_util_est(p)
>
> so that `sugov_get_util() -> cpu_util_cfs() ->
> cfs_rq->avg.util_est.enqueued` can see the effect of util_guest?
>
> Not sure about the change in task_util() yet.
>
> >> When you say in the cover letter that you tried uclamp_min, how exactly
> >> did you use it? Did you run the existing mainline or did you use
> >> uclamp_min as a replacement for util_guest in this patch here?
>
> > I called sched_setattr_nocheck() with .sched_flags =
> > SCHED_FLAG_UTIL_CLAMP when updating uclamp_min and clamp_max is left
> > at 1024. Uclamp_min was not aggregated with task_util and
> > task_util_est during my testing. The only caveat there is that I added
> > a change to only reset uclamp on fork when testing(I realize there is
> > specifically a SCHED_FLAG_RESET_ON_FORK, but I didn’t want to reset
> > other sched attributes).
>
> OK, understood. It's essentially a util_est v2 for vCPU tasks on host.

Yup. We initially looked into just overwriting util_est, but didn't
think that'll land well with the community :) as it was a bit messier
because we needed to make sure the current util_est update paths don't
run for vCPU tasks on host (because those values would be wrong).

> >>> static inline unsigned long task_util_est(struct task_struct *p)
> >>> @@ -6242,6 +6244,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >>> */
> >>> util_est_enqueue(&rq->cfs, p);
> >>>
> >>> + /*
> >>> + * The normal code path for host thread enqueue doesn't take into
> >>> + * account guest task migrations when updating cpufreq util.
> >>> + * So, always update the cpufreq when a vCPU thread has a
> >>> + * non-zero util_guest value.
> >>> + */
> >>> + if (READ_ONCE(p->se.avg.util_guest))
> >>> + cpufreq_update_util(rq, 0);
> >>
> >>
> >> This is because enqueue_entity() -> update_load_avg() ->
> >> attach_entity_load_avg() -> cfs_rq_util_change() requires root run-queue
> >> (&rq->cfs == cfs_rq) to call cpufreq_update_util()?
> > The enqueue_entity() would not call into update_load_avg() due to the
> > check for !se->avg.last_update_time. se->avg.last_update_time is
> > non-zero because the vCPU task did not migrate before this enqueue.
> > This enqueue path is reached when util_guest is updated for the vCPU
> > task through the sched_setattr_nocheck call where we want to ensure a
> > frequency update occurs.
>
> OK, vCPU tasks are pinned so always !WF_MIGRATED wakeup I guess?

Even if say little-vCPU threads are allowed to migrate within little
CPUs, this will still be an issue. While a vCPU thread is continuously
running on a single CPU, a guest thread can migrate into that vCPU and
cause a huge increase in util_guest. But that won't trigger an cpufreq
update on the host side because the host doesn't see a task migration.
That's what David is trying to address.

-Saravana

2023-04-05 22:48:59

by David Dai

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] cpufreq: add kvm-cpufreq driver

On Wed, Apr 5, 2023 at 1:23 AM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Mar 30, 2023 at 03:43:41PM -0700, David Dai wrote:
>
> > +struct remote_data {
> > + int ret;
> > + struct cpufreq_frequency_table *table;
> > +};
> > +
> > +static void remote_get_freqtbl_num_entries(void *data)
> > +{
> > + struct arm_smccc_res hvc_res;
> > + u32 freq = 1UL;
> > + int *idx = data;
> > +
> > + while (freq != CPUFREQ_TABLE_END) {
> > + arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_GET_CPUFREQ_TBL_FUNC_ID,
> > + *idx, &hvc_res);
> > + if (hvc_res.a0) {
> > + *idx = -ENODEV;
> > + return;
> > + }
> > + freq = hvc_res.a1;
> > + (*idx)++;
> > + }
> > +}
> > +
> > +static int kvm_cpufreq_get_freqtbl_num_entries(int cpu)
> > +{
> > + int num_entries = 0;
> > +
> > + smp_call_function_single(cpu, remote_get_freqtbl_num_entries, &num_entries, true);
> > + return num_entries;
> > +}
> > +
> > +static void remote_populate_freqtbl(void *data)
> > +{
> > + struct arm_smccc_res hvc_res;
> > + struct remote_data *freq_data = data;
> > + struct cpufreq_frequency_table *pos;
> > + struct cpufreq_frequency_table *table = freq_data->table;
> > + int idx;
> > +
> > + cpufreq_for_each_entry_idx(pos, table, idx) {
> > + arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_GET_CPUFREQ_TBL_FUNC_ID,
> > + idx, &hvc_res);
> > + if (hvc_res.a0) {
> > + freq_data->ret = -ENODEV;
> > + return;
> > + }
> > + pos->frequency = hvc_res.a1;
> > + }
> > + freq_data->ret = 0;
> > +}
> > +
> > +static int kvm_cpufreq_populate_freqtbl(struct cpufreq_frequency_table *table, int cpu)
> > +{
> > + struct remote_data freq_data;
> > +
> > + freq_data.table = table;
> > + smp_call_function_single(cpu, remote_populate_freqtbl, &freq_data, true);
> > + return freq_data.ret;
> > +}
>

Hi Peter,

Thanks for taking the time to review!

> *WHY* are you sending IPIs to do a hypercall ?!?
>
> You can simply have the hypercall tell what vCPU you're doing this for.
>

We’ll remove all the code that’s making IPI calls and switch over to
populating the frequency tables in the device tree when we move from
RFC to PATCH. It is here for now to make it easier for others to
cherry-pick and test the series without needing more changes to their
VMM.

Thanks,
David

2023-04-05 22:58:41

by David Dai

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] sched/fair: Add util_guest for tasks

On Wed, Apr 5, 2023 at 1:14 AM Peter Zijlstra <[email protected]> wrote:
>

Hi Peter,

Appreciate your time,

> On Thu, Mar 30, 2023 at 03:43:36PM -0700, David Dai wrote:
> > @@ -499,6 +509,7 @@ struct sched_avg {
> > unsigned long load_avg;
> > unsigned long runnable_avg;
> > unsigned long util_avg;
> > + unsigned long util_guest;
> > struct util_est util_est;
> > } ____cacheline_aligned;
> >
>
> Yeah, no... you'll have to make room first.
>

I’m not sure what you mean. Do you mean making room by reducing
another member in the same struct? If so, which member would be a good
fit to shorten? Or do you mean something else entirely?

Thanks,
David

> struct sched_avg {
> /* typedef u64 -> __u64 */ long long unsigned int last_update_time; /* 0 8 */
> /* typedef u64 -> __u64 */ long long unsigned int load_sum; /* 8 8 */
> /* typedef u64 -> __u64 */ long long unsigned int runnable_sum; /* 16 8 */
> /* typedef u32 -> __u32 */ unsigned int util_sum; /* 24 4 */
> /* typedef u32 -> __u32 */ unsigned int period_contrib; /* 28 4 */
> long unsigned int load_avg; /* 32 8 */
> long unsigned int runnable_avg; /* 40 8 */
> long unsigned int util_avg; /* 48 8 */
> struct util_est {
> unsigned int enqueued; /* 56 4 */
> unsigned int ewma; /* 60 4 */
> } __attribute__((__aligned__(8)))util_est __attribute__((__aligned__(8))); /* 56 8 */
>
> /* size: 64, cachelines: 1, members: 9 */
> /* forced alignments: 1 */
> } __attribute__((__aligned__(64)));
>
>

2023-04-05 23:47:27

by David Dai

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] sched/fair: Add util_guest for tasks

On Wed, Apr 5, 2023 at 2:43 PM Saravana Kannan <[email protected]> wrote:
>
> On Wed, Apr 5, 2023 at 3:50 AM Dietmar Eggemann
> <[email protected]> wrote:
> >
> > On 04/04/2023 03:11, David Dai wrote:
> > > On Mon, Apr 3, 2023 at 4:40 AM Dietmar Eggemann
> > > <[email protected]> wrote:
> > >>
> > >> Hi David,
> > > Hi Dietmar, thanks for your comments.
> > >>
> > >> On 31/03/2023 00:43, David Dai wrote:
> >
> > [...]
> >
> > >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > >>> index 6986ea31c984..998649554344 100644
> > >>> --- a/kernel/sched/fair.c
> > >>> +++ b/kernel/sched/fair.c
> > >>> @@ -4276,14 +4276,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
> > >>>
> > >>> static inline unsigned long task_util(struct task_struct *p)
> > >>> {
> > >>> - return READ_ONCE(p->se.avg.util_avg);
> > >>> + return max(READ_ONCE(p->se.avg.util_avg),
> > >>> + READ_ONCE(p->se.avg.util_guest));
> > >>> }
> > >>>
> > >>> static inline unsigned long _task_util_est(struct task_struct *p)
> > >>> {
> > >>> struct util_est ue = READ_ONCE(p->se.avg.util_est);
> > >>>
> > >>> - return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED));
> > >>> + return max_t(unsigned long, READ_ONCE(p->se.avg.util_guest),
> > >>> + max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED)));
> > >>> }
> > >>
> > >> I can't see why the existing p->uclamp_req[UCLAMP_MIN].value can't be
> > >> used here instead p->se.avg.util_guest.
> > > Using p->uclamp_req[UCLAMP_MIN].value would result in folding in
> > > uclamp values into task_util and task_util_est for all tasks that have
> > > uclamp values set. The intent of these patches isn’t to modify
> > > existing uclamp behaviour. Users would also override util values from
> > > the guest when they set uclamp values.
> > >>
> > >> I do understand the issue of inheriting uclamp values at fork but don't
> > >> get the not being `additive` thing. We are at task level here.
> >
> > > Uclamp values are max aggregated with other tasks at the runqueue
> > > level when deciding CPU frequency. For example, a vCPU runqueue may
> > > have an util of 512 that results in setting 512 to uclamp_min on the
> > > vCPU task. This is insufficient to drive a frequency response if it
> > > shares the runqueue with another host task running with util of 512 as
> > > it would result in a clamped util value of 512 at the runqueue(Ex. If
> > > a guest thread had just migrated onto this vCPU).
> >
> > OK, see your point now. You want an accurate per-task boost for this
> > vCPU task on the host run-queue.
> > And a scenario in which a vCPU can ask for 100% in these moments is not
> > sufficient I guess? In this case uclamp_min could work.
>
> Right. vCPU can have whatever utilization and there can be random host
> threads completely unrelated to the VM. And we need to aggregate both
> of their util when deciding CPU freq.
>
> >
> > >> The fact that you have to max util_avg and util_est directly in
> > >> task_util() and _task_util_est() tells me that there are places where
> > >> this helps and uclamp_task_util() is not called there.
> > > Can you clarify on this point a bit more?
> >
> > Sorry, I meant s/util_est/util_guest/.
> >
> > The effect of the change in _task_util_est() you see via:
> >
> > enqueue_task_fair()
> > util_est_enqueue()
> > cfs_rq->avg.util_est.enqueued += _task_util_est(p)
> >
> > so that `sugov_get_util() -> cpu_util_cfs() ->
> > cfs_rq->avg.util_est.enqueued` can see the effect of util_guest?

That sequence looks correct to me.

> >
> > Not sure about the change in task_util() yet.

task_util() provides some signaling in addition to task_util_est() via:

find_energy_effcient_cpu()
cpu_util_next()
lsub_positive(&util, task_util(p));
...
util += task_util(p);
//Can provide a better signal than util_est.

dequeue_task_fair()
util_est_update()
ue.enqueued = task_util(p);
//Updates ue.ewma

Thanks,
David

> >
> > >> When you say in the cover letter that you tried uclamp_min, how exactly
> > >> did you use it? Did you run the existing mainline or did you use
> > >> uclamp_min as a replacement for util_guest in this patch here?
> >
> > > I called sched_setattr_nocheck() with .sched_flags =
> > > SCHED_FLAG_UTIL_CLAMP when updating uclamp_min and clamp_max is left
> > > at 1024. Uclamp_min was not aggregated with task_util and
> > > task_util_est during my testing. The only caveat there is that I added
> > > a change to only reset uclamp on fork when testing(I realize there is
> > > specifically a SCHED_FLAG_RESET_ON_FORK, but I didn’t want to reset
> > > other sched attributes).
> >
> > OK, understood. It's essentially a util_est v2 for vCPU tasks on host.
>
> Yup. We initially looked into just overwriting util_est, but didn't
> think that'll land well with the community :) as it was a bit messier
> because we needed to make sure the current util_est update paths don't
> run for vCPU tasks on host (because those values would be wrong).
>
> > >>> static inline unsigned long task_util_est(struct task_struct *p)
> > >>> @@ -6242,6 +6244,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >>> */
> > >>> util_est_enqueue(&rq->cfs, p);
> > >>>
> > >>> + /*
> > >>> + * The normal code path for host thread enqueue doesn't take into
> > >>> + * account guest task migrations when updating cpufreq util.
> > >>> + * So, always update the cpufreq when a vCPU thread has a
> > >>> + * non-zero util_guest value.
> > >>> + */
> > >>> + if (READ_ONCE(p->se.avg.util_guest))
> > >>> + cpufreq_update_util(rq, 0);
> > >>
> > >>
> > >> This is because enqueue_entity() -> update_load_avg() ->
> > >> attach_entity_load_avg() -> cfs_rq_util_change() requires root run-queue
> > >> (&rq->cfs == cfs_rq) to call cpufreq_update_util()?
> > > The enqueue_entity() would not call into update_load_avg() due to the
> > > check for !se->avg.last_update_time. se->avg.last_update_time is
> > > non-zero because the vCPU task did not migrate before this enqueue.
> > > This enqueue path is reached when util_guest is updated for the vCPU
> > > task through the sched_setattr_nocheck call where we want to ensure a
> > > frequency update occurs.
> >
> > OK, vCPU tasks are pinned so always !WF_MIGRATED wakeup I guess?
>
> Even if say little-vCPU threads are allowed to migrate within little
> CPUs, this will still be an issue. While a vCPU thread is continuously
> running on a single CPU, a guest thread can migrate into that vCPU and
> cause a huge increase in util_guest. But that won't trigger an cpufreq
> update on the host side because the host doesn't see a task migration.
> That's what David is trying to address.
>
> -Saravana

2023-04-06 07:40:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] sched/fair: Add util_guest for tasks

On Wed, Apr 05, 2023 at 03:54:08PM -0700, David Dai wrote:
> On Wed, Apr 5, 2023 at 1:14 AM Peter Zijlstra <[email protected]> wrote:
> >
>
> Hi Peter,
>
> Appreciate your time,
>
> > On Thu, Mar 30, 2023 at 03:43:36PM -0700, David Dai wrote:
> > > @@ -499,6 +509,7 @@ struct sched_avg {
> > > unsigned long load_avg;
> > > unsigned long runnable_avg;
> > > unsigned long util_avg;
> > > + unsigned long util_guest;
> > > struct util_est util_est;
> > > } ____cacheline_aligned;
> > >
> >
> > Yeah, no... you'll have to make room first.
> >
>
> I’m not sure what you mean. Do you mean making room by reducing
> another member in the same struct? If so, which member would be a good
> fit to shorten? Or do you mean something else entirely?

Yeah, as you can see below, this structure is completely filling up the
cacheline already so there's no room for another member. I've not looked
at this in detail in a little while so I'm not at all sure what would be
the easiest way to free up space.

Going by the rest of the discusion is seems this is the least of your
problems though.

> > struct sched_avg {
> > /* typedef u64 -> __u64 */ long long unsigned int last_update_time; /* 0 8 */
> > /* typedef u64 -> __u64 */ long long unsigned int load_sum; /* 8 8 */
> > /* typedef u64 -> __u64 */ long long unsigned int runnable_sum; /* 16 8 */
> > /* typedef u32 -> __u32 */ unsigned int util_sum; /* 24 4 */
> > /* typedef u32 -> __u32 */ unsigned int period_contrib; /* 28 4 */
> > long unsigned int load_avg; /* 32 8 */
> > long unsigned int runnable_avg; /* 40 8 */
> > long unsigned int util_avg; /* 48 8 */
> > struct util_est {
> > unsigned int enqueued; /* 56 4 */
> > unsigned int ewma; /* 60 4 */
> > } __attribute__((__aligned__(8)))util_est __attribute__((__aligned__(8))); /* 56 8 */
> >
> > /* size: 64, cachelines: 1, members: 9 */
> > /* forced alignments: 1 */
> > } __attribute__((__aligned__(64)));
> >
> >

2023-04-06 07:41:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

On Wed, Apr 05, 2023 at 02:08:43PM -0700, Saravana Kannan wrote:

> Sorry, that was our intention, but had a scripting error. It's been fixed.
>
> I have a script to use with git send-email's --to-cmd and --cc-cmd
> option. It uses get_maintainers.pl to figure out who to email, but it
> gets trickier for a patch series that spans maintainer trees.

What I do is I simply run get_maintainers.pl against the full series
diff and CC everybody the same.

Then again, I don't use git-send-email, so I've no idea how to use that.

2023-04-06 07:41:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

On Wed, Apr 05, 2023 at 02:08:43PM -0700, Saravana Kannan wrote:

> The only 2 pieces of information shared between host/guest are:
>
> 1. Host CPU frequency -- this isn't really scheduler internals and
> will map nicely to a virtual cpufreq driver.
>
> 2. A vCPU util value between 0 - 1024 where 1024 corresponds to the
> highest performance point across all CPUs (taking freq, arch, etc into
> consideration). Yes, this currently matches how the run queue util is
> tracked, but we can document the interface as "percentage of max
> performance capability", but representing it as 0 - 1024 instead of
> 0-100. That way, even if the scheduler changes how it tracks util in
> the future, we can still keep this interface between guest/host and
> map it appropriately on the host end.
>
> In either case, we could even have a Windows guest where they might
> track vCPU utilization differently and still have this work with the
> Linux host with this interface.
>
> Does that sound reasonable to you?

Yeah, I suppose that's managable.

Something that wasn't initially clear to me; all this hard assumes a 1:1
vCPU:CPU relation, right? Which isn't typical in virt land.

2023-04-06 08:43:42

by Marc Zyngier

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

On Wed, 05 Apr 2023 22:00:59 +0100,
Saravana Kannan <[email protected]> wrote:
>
> On Tue, Apr 4, 2023 at 1:49 PM Marc Zyngier <[email protected]> wrote:
> >
> > On Tue, 04 Apr 2023 20:43:40 +0100,
> > Oliver Upton <[email protected]> wrote:
> > >
> > > Folks,
> > >
> > > On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote:
> > >
> > > <snip>
> > >
> > > > PCMark
> > > > Higher is better
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Test Case (score) | Baseline | Hypercall | %delta | MMIO | %delta |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Weighted Total | 6136 | 7274 | +19% | 6867 | +12% |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Web Browsing | 5558 | 6273 | +13% | 6035 | +9% |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Video Editing | 4921 | 5221 | +6% | 5167 | +5% |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Writing | 6864 | 8825 | +29% | 8529 | +24% |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Photo Editing | 7983 | 11593 | +45% | 10812 | +35% |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > > | Data Manipulation | 5814 | 6081 | +5% | 5327 | -8% |
> > > > +-------------------+----------+------------+--------+-------+--------+
> > > >
> > > > PCMark Performance/mAh
> > > > Higher is better
> > > > +-----------+----------+-----------+--------+------+--------+
> > > > | | Baseline | Hypercall | %delta | MMIO | %delta |
> > > > +-----------+----------+-----------+--------+------+--------+
> > > > | Score/mAh | 79 | 88 | +11% | 83 | +7% |
> > > > +-----------+----------+-----------+--------+------+--------+
> > > >
> > > > Roblox
> > > > Higher is better
> > > > +-----+----------+------------+--------+-------+--------+
> > > > | | Baseline | Hypercall | %delta | MMIO | %delta |
> > > > +-----+----------+------------+--------+-------+--------+
> > > > | FPS | 18.25 | 28.66 | +57% | 24.06 | +32% |
> > > > +-----+----------+------------+--------+-------+--------+
> > > >
> > > > Roblox Frames/mAh
> > > > Higher is better
> > > > +------------+----------+------------+--------+--------+--------+
> > > > | | Baseline | Hypercall | %delta | MMIO | %delta |
> > > > +------------+----------+------------+--------+--------+--------+
> > > > | Frames/mAh | 91.25 | 114.64 | +26% | 103.11 | +13% |
> > > > +------------+----------+------------+--------+--------+--------+
> > >
> > > </snip>
> > >
> > > > Next steps:
> > > > ===========
> > > > We are continuing to look into communication mechanisms other than
> > > > hypercalls that are just as/more efficient and avoid switching into the VMM
> > > > userspace. Any inputs in this regard are greatly appreciated.
>
> Hi Oliver and Marc,
>
> Replying to both of you in this one email.
>
> > >
> > > We're highly unlikely to entertain such an interface in KVM.
> > >
> > > The entire feature is dependent on pinning vCPUs to physical cores, for which
> > > userspace is in the driver's seat. That is a well established and documented
> > > policy which can be seen in the way we handle heterogeneous systems and
> > > vPMU.
> > >
> > > Additionally, this bloats the KVM PV ABI with highly VMM-dependent interfaces
> > > that I would not expect to benefit the typical user of KVM.
> > >
> > > Based on the data above, it would appear that the userspace implementation is
> > > in the same neighborhood as a KVM-based implementation, which only further
> > > weakens the case for moving this into the kernel.
>
> Oliver,
>
> Sorry if the tables/data aren't presented in an intuitive way, but
> MMIO vs hypercall is definitely not in the same neighborhood. The
> hypercall method often gives close to 2x the improvement that the MMIO
> method gives. For example:
>
> - Roblox FPS: MMIO improves it by 32% vs hypercall improves it by 57%.
> - Frames/mAh: MMIO improves it by 13% vs hypercall improves it by 26%.
> - PC Mark Data manipulation: MMIO makes it worse by 8% vs hypercall
> improves it by 5%
>
> Hypercall does better for other cases too, just not as good. For example,
> - PC Mark Photo editing: Going from MMIO to hypercall gives a 10% improvement.
>
> These are all pretty non-trivial, at least in the mobile world. Heck,
> whole teams would spend months for 2% improvement in battery :)
>
> > >
> > > I certainly can appreciate the motivation for the series, but this feature
> > > should be in userspace as some form of a virtual device.
> >
> > +1 on all of the above.
>
> Marc and Oliver,
>
> We are not tied to hypercalls. We want to do the right thing here, but
> MMIO going all the way to userspace definitely doesn't cut it as is.
> This is where we need some guidance. See more below.

I don't buy this assertion at all. An MMIO in userspace is already
much better than nothing. One of my many objection to the whole series
is that it is built as a massively invasive thing that has too many
fingers in too many pies, with unsustainable assumptions such as 1:1
mapping between CPU and vCPUs.

I'd rather you build something simple first (pure userspace using
MMIOs), work out where the bottlenecks are, and work with us to add
what is needed to get to something sensible, and only that. I'm not
willing to sacrifice maintainability for maximum performance (the
whole thing reminds me of the in-kernel http server...).

>
> > The one thing I'd like to understand that the comment seems to imply
> > that there is a significant difference in overhead between a hypercall
> > and an MMIO. In my experience, both are pretty similar in cost for a
> > handling location (both in userspace or both in the kernel).
>
> I think the main difference really is that in our hypercall vs MMIO
> comparison the hypercall is handled in the kernel vs MMIO goes all the
> way to userspace. I agree with you that the difference probably won't
> be significant if both of them go to the same "depth" in the privilege
> levels.
>
> > MMIO
> > handling is a tiny bit more expensive due to a guaranteed TLB miss
> > followed by a walk of the in-kernel device ranges, but that's all. It
> > should hardly register.
> >
> > And if you really want some super-low latency, low overhead
> > signalling, maybe an exception is the wrong tool for the job. Shared
> > memory communication could be more appropriate.
>
> Yeah, that's one of our next steps. Ideally, we want to use shared
> memory for the host to guest information flow. It's a 32-bit value
> representing the current frequency that the host can update whenever
> the host CPU frequency changes and the guest can read whenever it
> needs it.

Why should the guest care? Why can't the guest ask for an arbitrary
capacity, and get what it gets? You give no information as to *why*
you are doing what you are doing...

>
> For guest to host information flow, we'll need a kick from guest to
> host because we need to take action on the host side when threads
> migrate between vCPUs and cause a significant change in vCPU util.
> Again it can be just a shared memory and some kick. This is what we
> are currently trying to figure out how to do.

That kick would have to go to userspace. There is no way I'm willing
to introduce scheduling primitives inside KVM (the ones we have are
ridiculously bad anyway), and I very much want to avoid extra PV gunk.

> If there are APIs to do this, can you point us to those please? We'd
> also want the shared memory to be accessible by the VMM (so, shared
> between guest kernel, host kernel and VMM).

By default, *ALL* the memory is shared. Isn't that wonderful?

>
> Are the above next steps sane? Or is that a no-go? The main thing we
> want to cut out is the need for having to switch to userspace for
> every single interaction because, as is, it leaves a lot on the table.

Well, for a start, you could disclose how often you hit this DVFS
"device", and when are the critical state changes that must happen
immediately vs those that can simply be posted without having to take
immediate effect.

This sort of information would be much more interesting than a bunch
of benchmarks I know nothing about.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2023-04-06 12:53:18

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

On Wednesday 05 Apr 2023 at 14:07:18 (-0700), Saravana Kannan wrote:
> On Wed, Apr 5, 2023 at 12:48 AM 'Quentin Perret' via kernel-team
> > And I concur with all the above as well. Putting this in the kernel is
> > not an obvious fit at all as that requires a number of assumptions about
> > the VMM.
> >
> > As Oliver pointed out, the guest topology, and how it maps to the host
> > topology (vcpu pinning etc) is very much a VMM policy decision and will
> > be particularly important to handle guest frequency requests correctly.
> >
> > In addition to that, the VMM's software architecture may have an impact.
> > Crosvm for example does device emulation in separate processes for
> > security reasons, so it is likely that adjusting the scheduling
> > parameters ('util_guest', uclamp, or else) only for the vCPU thread that
> > issues frequency requests will be sub-optimal for performance, we may
> > want to adjust those parameters for all the tasks that are on the
> > critical path.
> >
> > And at an even higher level, assuming in the kernel a certain mapping of
> > vCPU threads to host threads feels kinda wrong, this too is a host
> > userspace policy decision I believe. Not that anybody in their right
> > mind would want to do this, but I _think_ it would technically be
> > feasible to serialize the execution of multiple vCPUs on the same host
> > thread, at which point the util_guest thingy becomes entirely bogus. (I
> > obviously don't want to conflate this use-case, it's just an example
> > that shows the proposed abstraction in the series is not a perfect fit
> > for the KVM userspace delegation model.)
>
> See my reply to Oliver and Marc. To me it looks like we are converging
> towards having shared memory between guest, host kernel and VMM and
> that should address all our concerns.

Hmm, that is not at all my understanding of what has been the most
important part of the feedback so far: this whole thing belongs to
userspace.

> The guest will see a MMIO device, writing to it will trigger the host
> kernel to do the basic "set util_guest/uclamp for the vCPU thread that
> corresponds to the vCPU" and then the VMM can do more on top as/if
> needed (because it has access to the shared memory too). Does that
> make sense?

Not really no. I've given examples of why this doesn't make sense for
the kernel to do this, which still seems to be the case with what you're
suggesting here.

> Even in the extreme example, the stuff the kernel would do would still
> be helpful, but not sufficient. You can aggregate the
> util_guest/uclamp and do whatever from the VMM.
> Technically in the extreme example, you don't need any of this. The
> normal util tracking of the vCPU thread on the host side would be
> sufficient.
>
> Actually any time we have only 1 vCPU host thread per VM, we shouldn't
> be using anything in this patch series and not instantiate the guest
> device at all.

> > So +1 from me to move this as a virtual device of some kind. And if the
> > extra cost of exiting all the way back to userspace is prohibitive (is
> > it btw?),
>
> I think the "13% increase in battery consumption for games" makes it
> pretty clear that going to userspace is prohibitive. And that's just
> one example.

I beg to differ. We need to understand where these 13% come from in more
details. Is it really the actual cost of the userspace exit? Or is it
just that from userspace the only knob you can play with is uclamp and
that didn't reach the expected level of performance?

If that is the userspace exit, then we can work to optimize that -- it's
a fairly common problem in the virt world, nothing special here.

And if the issue is the lack of expressiveness in uclamp, then that too
is something we should work on, but clearly giving vCPU threads more
'power' than normal host threads is a bit of a red flag IMO. vCPU
threads must be constrained in the same way that userspace threads are,
because they _are_ userspace threads.

> > then we can try to work on that. Maybe something a la vhost
> > can be done to optimize, I'll have a think.
> >
> > > The one thing I'd like to understand that the comment seems to imply
> > > that there is a significant difference in overhead between a hypercall
> > > and an MMIO. In my experience, both are pretty similar in cost for a
> > > handling location (both in userspace or both in the kernel). MMIO
> > > handling is a tiny bit more expensive due to a guaranteed TLB miss
> > > followed by a walk of the in-kernel device ranges, but that's all. It
> > > should hardly register.
> > >
> > > And if you really want some super-low latency, low overhead
> > > signalling, maybe an exception is the wrong tool for the job. Shared
> > > memory communication could be more appropriate.
> >
> > I presume some kind of signalling mechanism will be necessary to
> > synchronously update host scheduling parameters in response to guest
> > frequency requests, but if the volume of data requires it then a shared
> > buffer + doorbell type of approach should do.
>
> Part of the communication doesn't need synchronous handling by the
> host. So, what I said above.

I've also replied to another message about the scale invariance issue,
and I'm not convinced the frequency based interface proposed here really
makes sense. An AMU-like interface is very likely to be superior.

> > Thinking about it, using SCMI over virtio would implement exactly that.
> > Linux-as-a-guest already supports it IIRC, so possibly the problem
> > being addressed in this series could be 'simply' solved using an SCMI
> > backend in the VMM...
>
> This will be worse than all the options we've tried so far because it
> has the userspace overhead AND uclamp overhead.

But it doesn't violate the whole KVM userspace delegation model, so we
should start from there and then optimize further if need be.

Thanks,
Quentin

2023-04-06 21:39:58

by David Dai

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

On Thu, Apr 6, 2023 at 5:52 AM Quentin Perret <[email protected]> wrote:
>
> On Wednesday 05 Apr 2023 at 14:07:18 (-0700), Saravana Kannan wrote:
> > On Wed, Apr 5, 2023 at 12:48 AM 'Quentin Perret' via kernel-team
> > > And I concur with all the above as well. Putting this in the kernel is
> > > not an obvious fit at all as that requires a number of assumptions about
> > > the VMM.
> > >
> > > As Oliver pointed out, the guest topology, and how it maps to the host
> > > topology (vcpu pinning etc) is very much a VMM policy decision and will
> > > be particularly important to handle guest frequency requests correctly.
> > >
> > > In addition to that, the VMM's software architecture may have an impact.
> > > Crosvm for example does device emulation in separate processes for
> > > security reasons, so it is likely that adjusting the scheduling
> > > parameters ('util_guest', uclamp, or else) only for the vCPU thread that
> > > issues frequency requests will be sub-optimal for performance, we may
> > > want to adjust those parameters for all the tasks that are on the
> > > critical path.
> > >
> > > And at an even higher level, assuming in the kernel a certain mapping of
> > > vCPU threads to host threads feels kinda wrong, this too is a host
> > > userspace policy decision I believe. Not that anybody in their right
> > > mind would want to do this, but I _think_ it would technically be
> > > feasible to serialize the execution of multiple vCPUs on the same host
> > > thread, at which point the util_guest thingy becomes entirely bogus. (I
> > > obviously don't want to conflate this use-case, it's just an example
> > > that shows the proposed abstraction in the series is not a perfect fit
> > > for the KVM userspace delegation model.)
> >
> > See my reply to Oliver and Marc. To me it looks like we are converging
> > towards having shared memory between guest, host kernel and VMM and
> > that should address all our concerns.
>
> Hmm, that is not at all my understanding of what has been the most
> important part of the feedback so far: this whole thing belongs to
> userspace.
>
> > The guest will see a MMIO device, writing to it will trigger the host
> > kernel to do the basic "set util_guest/uclamp for the vCPU thread that
> > corresponds to the vCPU" and then the VMM can do more on top as/if
> > needed (because it has access to the shared memory too). Does that
> > make sense?
>
> Not really no. I've given examples of why this doesn't make sense for
> the kernel to do this, which still seems to be the case with what you're
> suggesting here.
>
> > Even in the extreme example, the stuff the kernel would do would still
> > be helpful, but not sufficient. You can aggregate the
> > util_guest/uclamp and do whatever from the VMM.
> > Technically in the extreme example, you don't need any of this. The
> > normal util tracking of the vCPU thread on the host side would be
> > sufficient.
> >
> > Actually any time we have only 1 vCPU host thread per VM, we shouldn't
> > be using anything in this patch series and not instantiate the guest
> > device at all.
>
> > > So +1 from me to move this as a virtual device of some kind. And if the
> > > extra cost of exiting all the way back to userspace is prohibitive (is
> > > it btw?),
> >
> > I think the "13% increase in battery consumption for games" makes it
> > pretty clear that going to userspace is prohibitive. And that's just
> > one example.
>

Hi Quentin,

Appreciate the feedback,

> I beg to differ. We need to understand where these 13% come from in more
> details. Is it really the actual cost of the userspace exit? Or is it
> just that from userspace the only knob you can play with is uclamp and
> that didn't reach the expected level of performance?

To clarify, the MMIO numbers shown in the cover letter were collected
with updating vCPU task's util_guest as opposed to uclamp_min. In that
configuration, userspace(VMM) handles the mmio_exit from the guest and
makes an ioctl on the host kernel to update util_guest for the vCPU
task.

>
> If that is the userspace exit, then we can work to optimize that -- it's
> a fairly common problem in the virt world, nothing special here.
>

Ok, we're open to suggestions on how to better optimize here.

> And if the issue is the lack of expressiveness in uclamp, then that too
> is something we should work on, but clearly giving vCPU threads more
> 'power' than normal host threads is a bit of a red flag IMO. vCPU
> threads must be constrained in the same way that userspace threads are,
> because they _are_ userspace threads.
>
> > > then we can try to work on that. Maybe something a la vhost
> > > can be done to optimize, I'll have a think.
> > >
> > > > The one thing I'd like to understand that the comment seems to imply
> > > > that there is a significant difference in overhead between a hypercall
> > > > and an MMIO. In my experience, both are pretty similar in cost for a
> > > > handling location (both in userspace or both in the kernel). MMIO
> > > > handling is a tiny bit more expensive due to a guaranteed TLB miss
> > > > followed by a walk of the in-kernel device ranges, but that's all. It
> > > > should hardly register.
> > > >
> > > > And if you really want some super-low latency, low overhead
> > > > signalling, maybe an exception is the wrong tool for the job. Shared
> > > > memory communication could be more appropriate.
> > >
> > > I presume some kind of signalling mechanism will be necessary to
> > > synchronously update host scheduling parameters in response to guest
> > > frequency requests, but if the volume of data requires it then a shared
> > > buffer + doorbell type of approach should do.
> >
> > Part of the communication doesn't need synchronous handling by the
> > host. So, what I said above.
>
> I've also replied to another message about the scale invariance issue,
> and I'm not convinced the frequency based interface proposed here really
> makes sense. An AMU-like interface is very likely to be superior.
>

Some sort of AMU-based interface was discussed offline with Saravana,
but I'm not sure how to best implement that. If you have any pointers
to get started, that would be helpful.

> > > Thinking about it, using SCMI over virtio would implement exactly that.
> > > Linux-as-a-guest already supports it IIRC, so possibly the problem
> > > being addressed in this series could be 'simply' solved using an SCMI
> > > backend in the VMM...
> >
> > This will be worse than all the options we've tried so far because it
> > has the userspace overhead AND uclamp overhead.
>
> But it doesn't violate the whole KVM userspace delegation model, so we
> should start from there and then optimize further if need be.

Do you have any references we can experiment with getting started for
SCMI? (ex. SCMI backend support in CrosVM).

For RFC V3, I'll post a CPUfreq driver implementation that only uses
MMIO and without any kernel host modifications(I.E. Only using uclamp
as a knob to tune the host) along with performance numbers and then
work on optimizing from there.

Thanks,
David

>
> Thanks,
> Quentin

2023-04-27 07:48:42

by Pavan Kondeti

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote:
> Hi,
>
> This patch series is a continuation of the talk Saravana gave at LPC 2022
> titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist
> of the talk is that workloads running in a guest VM get terrible task
> placement and DVFS behavior when compared to running the same workload in
> the host. Effectively, no EAS for threads inside VMs. This would make power
> and performance terrible just by running the workload in a VM even if we
> assume there is zero virtualization overhead.
>
> We have been iterating over different options for communicating between
> guest and host, ways of applying the information coming from the
> guest/host, etc to figure out the best performance and power improvements
> we could get.
>
> The patch series in its current state is NOT meant for landing in the
> upstream kernel. We are sending this patch series to share the current
> progress and data we have so far. The patch series is meant to be easy to
> cherry-pick and test on various devices to see what performance and power
> benefits this might give for others.
>
> With this series, a workload running in a VM gets the same task placement
> and DVFS treatment as it would when running in the host.
>
> As expected, we see significant performance improvement and better
> performance/power ratio. If anyone else wants to try this out for your VM
> workloads and report findings, that'd be very much appreciated.
>
> The idea is to improve VM CPUfreq/sched behavior by:
> - Having guest kernel to do accurate load tracking by taking host CPU
> arch/type and frequency into account.
> - Sharing vCPU run queue utilization information with the host so that the
> host can do proper frequency scaling and task placement on the host side.
>

[...]

>
> Next steps:
> ===========
> We are continuing to look into communication mechanisms other than
> hypercalls that are just as/more efficient and avoid switching into the VMM
> userspace. Any inputs in this regard are greatly appreciated.
>

I am trying to understand why virtio based cpufrq does not work here?
The VMM on host can process requests from guest VM like freq table,
current frequency and setting the min_freq. I believe Virtio backend
has mechanisms for acceleration (vhost) so that user space is not
involved for every frequency request from the guest.

It has advantages of (1) Hypervisor agnostic (virtio basically)
(2) scheduler does not need additional input, the aggregated min_freq
requests from all guest should be sufficient.

>
> [1] - https://lpc.events/event/16/contributions/1195/
> [2] - https://lpc.events/event/16/contributions/1195/attachments/970/1893/LPC%202022%20-%20VM%20DVFS.pdf
> [3] - https://www.youtube.com/watch?v=hIg_5bg6opU
> [4] - https://chromium-review.googlesource.com/c/crosvm/crosvm/+/4208668
> [5] - https://chromium-review.googlesource.com/c/crosvm/crosvm/+/4288027
>
> David Dai (6):
> sched/fair: Add util_guest for tasks
> kvm: arm64: Add support for get_cur_cpufreq service
> kvm: arm64: Add support for util_hint service
> kvm: arm64: Add support for get_freqtbl service
> dt-bindings: cpufreq: add bindings for virtual kvm cpufreq
> cpufreq: add kvm-cpufreq driver
>
> .../bindings/cpufreq/cpufreq-virtual-kvm.yaml | 39 +++
> Documentation/virt/kvm/api.rst | 28 ++
> .../virt/kvm/arm/get_cur_cpufreq.rst | 21 ++
> Documentation/virt/kvm/arm/get_freqtbl.rst | 23 ++
> Documentation/virt/kvm/arm/index.rst | 3 +
> Documentation/virt/kvm/arm/util_hint.rst | 22 ++
> arch/arm64/include/uapi/asm/kvm.h | 3 +
> arch/arm64/kvm/arm.c | 3 +
> arch/arm64/kvm/hypercalls.c | 60 +++++
> drivers/cpufreq/Kconfig | 13 +
> drivers/cpufreq/Makefile | 1 +
> drivers/cpufreq/kvm-cpufreq.c | 245 ++++++++++++++++++
> include/linux/arm-smccc.h | 21 ++
> include/linux/sched.h | 12 +
> include/uapi/linux/kvm.h | 3 +
> kernel/sched/core.c | 24 +-
> kernel/sched/fair.c | 15 +-
> tools/arch/arm64/include/uapi/asm/kvm.h | 3 +
> 18 files changed, 536 insertions(+), 3 deletions(-)
> create mode 100644 Documentation/devicetree/bindings/cpufreq/cpufreq-virtual-kvm.yaml
> create mode 100644 Documentation/virt/kvm/arm/get_cur_cpufreq.rst
> create mode 100644 Documentation/virt/kvm/arm/get_freqtbl.rst
> create mode 100644 Documentation/virt/kvm/arm/util_hint.rst
> create mode 100644 drivers/cpufreq/kvm-cpufreq.c

Thanks,
Pavan

2023-04-27 09:58:46

by Gupta, Pankaj

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior


>> This patch series is a continuation of the talk Saravana gave at LPC 2022
>> titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist
>> of the talk is that workloads running in a guest VM get terrible task
>> placement and DVFS behavior when compared to running the same workload in
>> the host. Effectively, no EAS for threads inside VMs. This would make power
>> and performance terrible just by running the workload in a VM even if we
>> assume there is zero virtualization overhead.
>>
>> We have been iterating over different options for communicating between
>> guest and host, ways of applying the information coming from the
>> guest/host, etc to figure out the best performance and power improvements
>> we could get.
>>
>> The patch series in its current state is NOT meant for landing in the
>> upstream kernel. We are sending this patch series to share the current
>> progress and data we have so far. The patch series is meant to be easy to
>> cherry-pick and test on various devices to see what performance and power
>> benefits this might give for others.
>>
>> With this series, a workload running in a VM gets the same task placement
>> and DVFS treatment as it would when running in the host.
>>
>> As expected, we see significant performance improvement and better
>> performance/power ratio. If anyone else wants to try this out for your VM
>> workloads and report findings, that'd be very much appreciated.
>>
>> The idea is to improve VM CPUfreq/sched behavior by:
>> - Having guest kernel to do accurate load tracking by taking host CPU
>> arch/type and frequency into account.
>> - Sharing vCPU run queue utilization information with the host so that the
>> host can do proper frequency scaling and task placement on the host side.
>>
>
> [...]
>
>>
>> Next steps:
>> ===========
>> We are continuing to look into communication mechanisms other than
>> hypercalls that are just as/more efficient and avoid switching into the VMM
>> userspace. Any inputs in this regard are greatly appreciated.
>>
>
> I am trying to understand why virtio based cpufrq does not work here?
> The VMM on host can process requests from guest VM like freq table,
> current frequency and setting the min_freq. I believe Virtio backend
> has mechanisms for acceleration (vhost) so that user space is not
> involved for every frequency request from the guest.
>
> It has advantages of (1) Hypervisor agnostic (virtio basically)
> (2) scheduler does not need additional input, the aggregated min_freq
> requests from all guest should be sufficient.

Also want to add, 3) virtio based solution would definitely be better
from performance POV as would avoid expense vmexits which we have with
hypercalls.

Thanks,
Pankaj


2023-04-27 11:49:47

by Pavan Kondeti

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

On Thu, Apr 27, 2023 at 11:52:29AM +0200, Gupta, Pankaj wrote:
>
> > > This patch series is a continuation of the talk Saravana gave at LPC 2022
> > > titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist
> > > of the talk is that workloads running in a guest VM get terrible task
> > > placement and DVFS behavior when compared to running the same workload in
> > > the host. Effectively, no EAS for threads inside VMs. This would make power
> > > and performance terrible just by running the workload in a VM even if we
> > > assume there is zero virtualization overhead.
> > >
> > > We have been iterating over different options for communicating between
> > > guest and host, ways of applying the information coming from the
> > > guest/host, etc to figure out the best performance and power improvements
> > > we could get.
> > >
> > > The patch series in its current state is NOT meant for landing in the
> > > upstream kernel. We are sending this patch series to share the current
> > > progress and data we have so far. The patch series is meant to be easy to
> > > cherry-pick and test on various devices to see what performance and power
> > > benefits this might give for others.
> > >
> > > With this series, a workload running in a VM gets the same task placement
> > > and DVFS treatment as it would when running in the host.
> > >
> > > As expected, we see significant performance improvement and better
> > > performance/power ratio. If anyone else wants to try this out for your VM
> > > workloads and report findings, that'd be very much appreciated.
> > >
> > > The idea is to improve VM CPUfreq/sched behavior by:
> > > - Having guest kernel to do accurate load tracking by taking host CPU
> > > arch/type and frequency into account.
> > > - Sharing vCPU run queue utilization information with the host so that the
> > > host can do proper frequency scaling and task placement on the host side.
> > >
> >
> > [...]
> >
> > >
> > > Next steps:
> > > ===========
> > > We are continuing to look into communication mechanisms other than
> > > hypercalls that are just as/more efficient and avoid switching into the VMM
> > > userspace. Any inputs in this regard are greatly appreciated.
> > >
> >
> > I am trying to understand why virtio based cpufrq does not work here?
> > The VMM on host can process requests from guest VM like freq table,
> > current frequency and setting the min_freq. I believe Virtio backend
> > has mechanisms for acceleration (vhost) so that user space is not
> > involved for every frequency request from the guest.
> >
> > It has advantages of (1) Hypervisor agnostic (virtio basically)
> > (2) scheduler does not need additional input, the aggregated min_freq
> > requests from all guest should be sufficient.
>
> Also want to add, 3) virtio based solution would definitely be better from
> performance POV as would avoid expense vmexits which we have with
> hypercalls.
>
>
I just went through the whole discussion, it seems David mentioned he would
re-write this series with virtio frontend and VMM in user space taking
care of the requests. will wait for that series to land.

Thanks,
Pavan