2024-01-27 00:43:42

by David Dai

[permalink] [raw]
Subject: [PATCH v5 0/2] Improve VM CPUfreq and task placement behavior

Hi,

This patch series is a continuation of the talk Saravana gave at LPC 2022
titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist
of the talk is that workloads running in a guest VM get terrible task
placement and CPUfreq behavior when compared to running the same workload
in the host. Effectively, no EAS(Energy Aware Scheduling) for threads
inside VMs. This would make power and performance terrible just by running
the workload in a VM even if we assume there is zero virtualization
overhead.

With this series, a workload running in a VM gets the same task placement
and CPUfreq behavior as it would when running in the host.

The idea is to improve VM CPUfreq/sched behavior by:
- Having guest kernel do accurate load tracking by taking host CPU
arch/type and frequency into account.
- Sharing vCPU frequency requirements with the host so that the
host can do proper frequency scaling and task placement on the host side.

Based on feedback from RFC v1 proposal[4], we've revised our
implementation to using MMIO reads and writes to pass information
from/to host instead of using hypercalls. In our example, the
VMM(Virtual Machine Manager) translates the frequency requests into
Uclamp_min and applies it to the vCPU thread as a hint to the host
kernel.

To achieve the results below, configure the host to:
- Affine vCPUs to specific clusters.
- Set vCPU capacity to match the host CPU they are running on.

To make it easy for folks to try this out with CrosVM, we have put up
userspace patches[5][6]. With those patches, you can configure CrosVM
correctly by adding the options "--host-cpu-topology" and "--virt-cpufreq".

Results:
========

Here are some side-by-side comparisons of RFC v1 proposal vs the current
patch series and are labelled as follows.

- (RFC v1) UtilHyp = hypercall + util_guest
- (current) UClampMMIO = MMIO + UClamp_min

Use cases running a minimal system inside a VM on a Pixel 6:
============================================================

FIO
Higher is better
+-------------------+----------+---------+--------+------------+--------+
| Usecase(avg MB/s) | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| Seq Write | 13.3 | 16.4 | +23% | 13.6 | +2% |
+-------------------+----------+---------+--------+------------+--------+
| Rand Write | 11.2 | 12.9 | +15% | 11.8 | +8% |
+-------------------+----------+---------+--------+------------+--------+
| Seq Read | 100 | 168 | +68% | 138 | +38% |
+-------------------+----------+---------+--------+------------+--------+
| Rand Read | 20.5 | 35.6 | +74% | 31.0 | +51% |
+-------------------+----------+---------+--------+------------+--------+

CPU-based ML Inference Benchmark
Lower is better
+----------------+----------+------------+--------+------------+--------+
| Test Case (ms) | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+----------------+----------+------------+--------+------------+--------+
| Cached Sample | | | | | |
| Inference | 3.40 | 2.37 | -30% | 2.99 | -12% |
+----------------+----------+------------+--------+------------+--------+
| Small Sample | | | | | |
| Inference | 9.87 | 6.78 | -31% | 7.65 | -22% |
+----------------+----------+------------+--------+------------+--------+
| Large Sample | | | | | |
| Inference | 33.35 | 26.74 | -20% | 31.05 | -7% |
+----------------+----------+------------+--------+------------+--------+

Use cases running Android inside a VM on a Chromebook:
======================================================

PCMark (Emulates real world usecases)
Higher is better
+-------------------+----------+---------+--------+------------+--------+
| Test Case (score) | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| Weighted Total | 6190 | 7442 | +20% | 7171 | +16% |
+-------------------+----------+---------+--------+------------+--------+
| Web Browsing | 5461 | 6620 | +21% | 6284 | +15% |
+-------------------+----------+---------+--------+------------+--------+
| Video Editing | 4891 | 5376 | +10% | 5344 | +9% |
+-------------------+----------+---------+--------+------------+--------+
| Writing | 6929 | 8791 | +27% | 8457 | +22% |
+-------------------+----------+---------+--------+------------+--------+
| Photo Editing | 7966 | 12057 | +51% | 11881 | +49% |
+-------------------+----------+---------+--------+------------+--------+
| Data Manipulation | 5596 | 6057 | +8% | 5694 | +2% |
+-------------------+----------+---------+--------+------------+--------+

PCMark Performance/mAh
Higher is better
+-------------------+----------+---------+--------+------------+--------+
| | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| Score/mAh | 87 | 100 | +15% | 92 | +5% |
+-------------------+----------+---------+--------+------------+--------+

Roblox
Higher is better
+-------------------+----------+---------+--------+------------+--------+
| | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| FPS | 17.92 | 21.82 | +22% | 20.02 | +12% |
+-------------------+----------+---------+--------+------------+--------+

Roblox Frames/mAh
Higher is better
+-------------------+----------+---------+--------+------------+--------+
| | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| Frames/mAh | 77.91 | 84.46 | +8% | 81.71 | 5% |
+-------------------+----------+---------+--------+------------+--------+

We've simplified our implementation based on community feedback to make
it less intrusive and to use a more generic MMIO interface for
communication with the host. The results show that the current design
still has tangible improvements over baseline. We'll continue looking
into ways to reduce the overhead of the MMIO read/writes and submit
separate and generic patches for that if we find any good optimizations.

Thanks,
David & Saravana

Cc: Saravana Kannan <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Oliver Upton <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Pavan Kondeti <[email protected]>
Cc: Gupta Pankaj <[email protected]>
Cc: Mel Gorman <[email protected]>

v4 -> v5:
-Added dt-binding description to allow for normalized frequencies
-Updated dt-binding examples with normalized frequency values
-Updated cpufreq exit to use dev_pm_opp_free_cpufreq_table to free tables
-Updated fast_switch and target_index to use entries from cpufreq tables
-Refreshed benchmark numbers using indexed frequencies
-Added missing header that was indirectly being used

v3 -> v4:
-Fixed dt-binding formatting issues
-Added additional dt-binding descriptions for “HW interfaces”
-Changed dt-binding to “qemu,virtual-cpufreq”
-Fixed Kconfig formatting issues
-Removed frequency downscaling when requesting frequency updates
-Removed ops and cpufreq driver data
-Added check to limit freq_scale to 1024
-Added KHZ in the register offset naming
-Added comments to explain FIE and not allowing dvfs_possible_from_any_cpu

v2 -> v3:
- Dropped patches adding new hypercalls
- Dropped patch adding util_guest in sched/fair
- Cpufreq driver now populates frequency using opp bindings
- Removed transition_delay_us=1 cpufreq setting as it was configured too
agressively and resulted in poor I/O performance
- Modified guest cpufreq driver to read/write MMIO regions instead of
using hypercalls to communicate with the host
- Modified guest cpufreq driver to pass frequency info instead of
utilization of the current vCPU's runqueue which now takes
iowait_boost into account from the schedutil governor
- Updated DT bindings for a virtual CPU frequency device
Userspace changes:
- Updated CrosVM patches to emulate a virtual cpufreq device
- Updated to newer userspace binaries when collecting more recent
benchmark data

v1 -> v2:
- No functional changes.
- Added description for EAS and removed DVFS in coverletter.
- Added a v2 tag to the subject.
- Fixed up the inconsistent "units" between tables.
- Made sure everyone is To/Cc-ed for all the patches in the series.

[1] - https://lpc.events/event/16/contributions/1195/
[2] - https://lpc.events/event/16/contributions/1195/attachments/970/1893/LPC%202022%20-%20VM%20DVFS.pdf
[3] - https://www.youtube.com/watch?v=hIg_5bg6opU
[4] - https://lore.kernel.org/all/[email protected]/
[5] - https://chromium-review.googlesource.com/c/crosvm/crosvm/+/4208668
[6] - https://chromium-review.googlesource.com/q/topic:%22virtcpufreq-v5%22

David Dai (2):
dt-bindings: cpufreq: add virtual cpufreq device
cpufreq: add virtual-cpufreq driver

.../cpufreq/qemu,cpufreq-virtual.yaml | 110 +++++++++
drivers/cpufreq/Kconfig | 15 ++
drivers/cpufreq/Makefile | 1 +
drivers/cpufreq/virtual-cpufreq.c | 209 ++++++++++++++++++
include/linux/arch_topology.h | 1 +
5 files changed, 336 insertions(+)
create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
create mode 100644 drivers/cpufreq/virtual-cpufreq.c

--
2.43.0.429.g432eaa2c6b-goog



2024-01-27 00:44:49

by David Dai

[permalink] [raw]
Subject: [PATCH v5 2/2] cpufreq: add virtual-cpufreq driver

Introduce a virtualized cpufreq driver for guest kernels to improve
performance and power of workloads within VMs.

This driver does two main things:

1. Sends the frequency of vCPUs as a hint to the host. The host uses the
hint to schedule the vCPU threads and decide physical CPU frequency.

2. If a VM does not support a virtualized FIE(like AMUs), it queries the
host CPU frequency by reading a MMIO region of a virtual cpufreq device
to update the guest's frequency scaling factor periodically. This enables
accurate Per-Entity Load Tracking for tasks running in the guest.

Co-developed-by: Saravana Kannan <[email protected]>
Signed-off-by: Saravana Kannan <[email protected]>
Signed-off-by: David Dai <[email protected]>
---
drivers/cpufreq/Kconfig | 15 +++
drivers/cpufreq/Makefile | 1 +
drivers/cpufreq/virtual-cpufreq.c | 209 ++++++++++++++++++++++++++++++
include/linux/arch_topology.h | 1 +
4 files changed, 226 insertions(+)
create mode 100644 drivers/cpufreq/virtual-cpufreq.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index 35efb53d5492..f2d37075aa10 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -217,6 +217,21 @@ config CPUFREQ_DT

If in doubt, say N.

+config CPUFREQ_VIRT
+ tristate "Virtual cpufreq driver"
+ depends on OF
+ select PM_OPP
+ help
+ This adds a virtualized cpufreq driver for guest kernels that
+ read/writes to a MMIO region for a virtualized cpufreq device to
+ communicate with the host. It sends frequency updates to the host
+ which gets used as a hint to schedule vCPU threads and select CPU
+ frequency. If a VM does not support a virtualized FIE such as AMUs,
+ it updates the frequency scaling factor by polling host CPU frequency
+ to enable accurate Per-Entity Load Tracking for tasks running in the guest.
+
+ If in doubt, say N.
+
config CPUFREQ_DT_PLATDEV
tristate "Generic DT based cpufreq platdev driver"
depends on OF
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index 8d141c71b016..eb72ecdc24db 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -16,6 +16,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_ATTR_SET) += cpufreq_governor_attr_set.o

obj-$(CONFIG_CPUFREQ_DT) += cpufreq-dt.o
obj-$(CONFIG_CPUFREQ_DT_PLATDEV) += cpufreq-dt-platdev.o
+obj-$(CONFIG_CPUFREQ_VIRT) += virtual-cpufreq.o

# Traces
CFLAGS_amd-pstate-trace.o := -I$(src)
diff --git a/drivers/cpufreq/virtual-cpufreq.c b/drivers/cpufreq/virtual-cpufreq.c
new file mode 100644
index 000000000000..0132f430a13e
--- /dev/null
+++ b/drivers/cpufreq/virtual-cpufreq.c
@@ -0,0 +1,209 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 Google LLC
+ */
+
+#include <linux/arch_topology.h>
+#include <linux/cpufreq.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/of_address.h>
+#include <linux/of_platform.h>
+#include <linux/platform_device.h>
+#include <linux/pm_opp.h>
+#include <linux/slab.h>
+
+#define REG_CUR_FREQ_KHZ_OFFSET 0x0
+#define REG_SET_FREQ_KHZ_OFFSET 0x4
+#define PER_CPU_OFFSET 0x8
+
+static void __iomem *base;
+
+static void virt_scale_freq_tick(void)
+{
+ int cpu = smp_processor_id();
+ u32 max_freq = (u32)cpufreq_get_hw_max_freq(cpu);
+ u64 cur_freq;
+ unsigned long scale;
+
+ cur_freq = (u64)readl_relaxed(base + cpu * PER_CPU_OFFSET
+ + REG_CUR_FREQ_KHZ_OFFSET);
+
+ cur_freq <<= SCHED_CAPACITY_SHIFT;
+ scale = (unsigned long)div_u64(cur_freq, max_freq);
+ scale = min(scale, SCHED_CAPACITY_SCALE);
+
+ this_cpu_write(arch_freq_scale, scale);
+}
+
+static struct scale_freq_data virt_sfd = {
+ .source = SCALE_FREQ_SOURCE_VIRT,
+ .set_freq_scale = virt_scale_freq_tick,
+};
+
+static unsigned int virt_cpufreq_set_perf(struct cpufreq_policy *policy,
+ unsigned int target_freq)
+{
+ writel_relaxed(target_freq,
+ base + policy->cpu * PER_CPU_OFFSET + REG_SET_FREQ_KHZ_OFFSET);
+ return 0;
+}
+
+static unsigned int virt_cpufreq_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq)
+{
+ virt_cpufreq_set_perf(policy, target_freq);
+ return target_freq;
+}
+
+static int virt_cpufreq_target_index(struct cpufreq_policy *policy,
+ unsigned int index)
+{
+ return virt_cpufreq_set_perf(policy,
+ policy->freq_table[index].frequency);
+}
+
+static int virt_cpufreq_cpu_init(struct cpufreq_policy *policy)
+{
+ struct cpufreq_frequency_table *table;
+ struct device *cpu_dev;
+ int ret;
+
+ cpu_dev = get_cpu_device(policy->cpu);
+ if (!cpu_dev)
+ return -ENODEV;
+
+ ret = dev_pm_opp_of_add_table(cpu_dev);
+ if (ret)
+ return ret;
+
+ ret = dev_pm_opp_get_opp_count(cpu_dev);
+ if (ret <= 0) {
+ dev_err(cpu_dev, "OPP table can't be empty\n");
+ return -ENODEV;
+ }
+
+ ret = dev_pm_opp_init_cpufreq_table(cpu_dev, &table);
+ if (ret) {
+ dev_err(cpu_dev, "failed to init cpufreq table: %d\n", ret);
+ return ret;
+ }
+
+ policy->freq_table = table;
+
+ /*
+ * To simplify and improve latency of handling frequency requests on
+ * the host side, this ensures that the vCPU thread triggering the MMIO
+ * abort is the same thread whose performance constraints (Ex. uclamp
+ * settings) need to be updated. This simplifies the VMM (Virtual
+ * Machine Manager) having to find the correct vCPU thread and/or
+ * facing permission issues when configuring other threads.
+ */
+ policy->dvfs_possible_from_any_cpu = false;
+ policy->fast_switch_possible = true;
+
+ /*
+ * Using the default SCALE_FREQ_SOURCE_CPUFREQ is insufficient since
+ * the actual physical CPU frequency may not match requested frequency
+ * from the vCPU thread due to frequency update latencies or other
+ * inputs to the physical CPU frequency selection. This additional FIE
+ * source allows for more accurate freq_scale updates and only takes
+ * effect if another FIE source such as AMUs have not been registered.
+ */
+ topology_set_scale_freq_source(&virt_sfd, policy->cpus);
+
+ return 0;
+}
+
+static int virt_cpufreq_cpu_exit(struct cpufreq_policy *policy)
+{
+ struct device *cpu_dev;
+
+ cpu_dev = get_cpu_device(policy->cpu);
+ if (!cpu_dev)
+ return -ENODEV;
+
+ topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_VIRT, policy->related_cpus);
+ dev_pm_opp_free_cpufreq_table(cpu_dev, &policy->freq_table);
+ return 0;
+}
+
+static int virt_cpufreq_online(struct cpufreq_policy *policy)
+{
+ /* Nothing to restore. */
+ return 0;
+}
+
+static int virt_cpufreq_offline(struct cpufreq_policy *policy)
+{
+ /* Dummy offline() to avoid exit() being called and freeing resources. */
+ return 0;
+}
+
+static struct cpufreq_driver cpufreq_virt_driver = {
+ .name = "virt-cpufreq",
+ .init = virt_cpufreq_cpu_init,
+ .exit = virt_cpufreq_cpu_exit,
+ .online = virt_cpufreq_online,
+ .offline = virt_cpufreq_offline,
+ .verify = cpufreq_generic_frequency_table_verify,
+ .target_index = virt_cpufreq_target_index,
+ .fast_switch = virt_cpufreq_fast_switch,
+ .attr = cpufreq_generic_attr,
+};
+
+static int virt_cpufreq_driver_probe(struct platform_device *pdev)
+{
+ int ret;
+
+ base = devm_platform_ioremap_resource(pdev, 0);
+ if (IS_ERR(base))
+ return PTR_ERR(base);
+
+ ret = cpufreq_register_driver(&cpufreq_virt_driver);
+ if (ret) {
+ dev_err(&pdev->dev, "Virtual CPUFreq driver failed to register: %d\n", ret);
+ return ret;
+ }
+
+ dev_dbg(&pdev->dev, "Virtual CPUFreq driver initialized\n");
+ return 0;
+}
+
+static int virt_cpufreq_driver_remove(struct platform_device *pdev)
+{
+ cpufreq_unregister_driver(&cpufreq_virt_driver);
+ return 0;
+}
+
+static const struct of_device_id virt_cpufreq_match[] = {
+ { .compatible = "qemu,virtual-cpufreq", .data = NULL},
+ {}
+};
+MODULE_DEVICE_TABLE(of, virt_cpufreq_match);
+
+static struct platform_driver virt_cpufreq_driver = {
+ .probe = virt_cpufreq_driver_probe,
+ .remove = virt_cpufreq_driver_remove,
+ .driver = {
+ .name = "virt-cpufreq",
+ .of_match_table = virt_cpufreq_match,
+ },
+};
+
+static int __init virt_cpufreq_init(void)
+{
+ return platform_driver_register(&virt_cpufreq_driver);
+}
+postcore_initcall(virt_cpufreq_init);
+
+static void __exit virt_cpufreq_exit(void)
+{
+ platform_driver_unregister(&virt_cpufreq_driver);
+}
+module_exit(virt_cpufreq_exit);
+
+MODULE_DESCRIPTION("Virtual cpufreq driver");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index a63d61ca55af..fb272b4bf7b1 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -49,6 +49,7 @@ enum scale_freq_source {
SCALE_FREQ_SOURCE_CPUFREQ = 0,
SCALE_FREQ_SOURCE_ARCH,
SCALE_FREQ_SOURCE_CPPC,
+ SCALE_FREQ_SOURCE_VIRT,
};

struct scale_freq_data {
--
2.43.0.429.g432eaa2c6b-goog


2024-01-27 01:03:17

by David Dai

[permalink] [raw]
Subject: [PATCH v5 1/2] dt-bindings: cpufreq: add virtual cpufreq device

Adding bindings to represent a virtual cpufreq device.

Virtual machines may expose MMIO regions for a virtual cpufreq device
for guests to read frequency information or to request frequency
selection. The virtual cpufreq device has an individual controller for
each frequency domain. Performance points for a given domain can be
normalized across all domains for ease of allowing for virtual machines
to migrate between hosts.

Co-developed-by: Saravana Kannan <[email protected]>
Signed-off-by: Saravana Kannan <[email protected]>
Signed-off-by: David Dai <[email protected]>
---
.../cpufreq/qemu,cpufreq-virtual.yaml | 110 ++++++++++++++++++
1 file changed, 110 insertions(+)
create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml

diff --git a/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
new file mode 100644
index 000000000000..cd617baf75e7
--- /dev/null
+++ b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
@@ -0,0 +1,110 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/cpufreq/qemu,cpufreq-virtual.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Virtual CPUFreq
+
+maintainers:
+ - David Dai <[email protected]>
+ - Saravana Kannan <[email protected]>
+
+description:
+ Virtual CPUFreq is a virtualized driver in guest kernels that sends frequency
+ selection of its vCPUs as a hint to the host through MMIO regions. Each vCPU
+ is associated with a frequency domain which can be shared with other vCPUs.
+ Each frequency domain has its own set of registers for frequency controls.
+
+properties:
+ compatible:
+ const: qemu,virtual-cpufreq
+
+ reg:
+ maxItems: 1
+ description:
+ Address and size of region containing frequency controls for each of the
+ frequency domains. Regions for each frequency domain is placed
+ contiguously and contain registers for controlling DVFS(Dynamic Frequency
+ and Voltage) characteristics. The size of the region is proportional to
+ total number of frequency domains. This device also needs the CPUs to
+ list their OPPs using operating-points-v2 tables. The OPP tables for the
+ CPUs should use normalized "frequency" values where the OPP with the
+ highest performance among all the vCPUs is listed as 1024 KHz. The rest
+ of the frequencies of all the vCPUs should be normalized based on their
+ performance relative to that 1024 KHz OPP. This makes it much easier to
+ migrate the VM across systems which might have different physical CPU
+ OPPs.
+
+required:
+ - compatible
+ - reg
+
+additionalProperties: false
+
+examples:
+ - |
+ // This example shows a two CPU configuration with a frequency domain
+ // for each CPU showing normalized performance points.
+ cpus {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ cpu@0 {
+ compatible = "arm,armv8";
+ device_type = "cpu";
+ reg = <0x0>;
+ operating-points-v2 = <&opp_table0>;
+ };
+
+ cpu@1 {
+ compatible = "arm,armv8";
+ device_type = "cpu";
+ reg = <0x0>;
+ operating-points-v2 = <&opp_table1>;
+ };
+ };
+
+ opp_table0: opp-table-0 {
+ compatible = "operating-points-v2";
+
+ opp64000 { opp-hz = /bits/ 64 <64000>; };
+ opp128000 { opp-hz = /bits/ 64 <128000>; };
+ opp192000 { opp-hz = /bits/ 64 <192000>; };
+ opp256000 { opp-hz = /bits/ 64 <256000>; };
+ opp320000 { opp-hz = /bits/ 64 <320000>; };
+ opp384000 { opp-hz = /bits/ 64 <384000>; };
+ opp425000 { opp-hz = /bits/ 64 <425000>; };
+ };
+
+ opp_table1: opp-table-1 {
+ compatible = "operating-points-v2";
+
+ opp64000 { opp-hz = /bits/ 64 <64000>; };
+ opp128000 { opp-hz = /bits/ 64 <128000>; };
+ opp192000 { opp-hz = /bits/ 64 <192000>; };
+ opp256000 { opp-hz = /bits/ 64 <256000>; };
+ opp320000 { opp-hz = /bits/ 64 <320000>; };
+ opp384000 { opp-hz = /bits/ 64 <384000>; };
+ opp448000 { opp-hz = /bits/ 64 <448000>; };
+ opp512000 { opp-hz = /bits/ 64 <512000>; };
+ opp576000 { opp-hz = /bits/ 64 <576000>; };
+ opp640000 { opp-hz = /bits/ 64 <640000>; };
+ opp704000 { opp-hz = /bits/ 64 <704000>; };
+ opp768000 { opp-hz = /bits/ 64 <768000>; };
+ opp832000 { opp-hz = /bits/ 64 <832000>; };
+ opp896000 { opp-hz = /bits/ 64 <896000>; };
+ opp960000 { opp-hz = /bits/ 64 <960000>; };
+ opp1024000 { opp-hz = /bits/ 64 <1024000>; };
+
+ };
+
+ soc {
+ #address-cells = <1>;
+ #size-cells = <1>;
+
+ cpufreq@1040000 {
+ compatible = "qemu,virtual-cpufreq";
+ reg = <0x1040000 0x10>;
+ };
+ };
--
2.43.0.429.g432eaa2c6b-goog


2024-01-31 01:14:24

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] cpufreq: add virtual-cpufreq driver

Hi David,

kernel test robot noticed the following build errors:

[auto build test ERROR on rafael-pm/linux-next]
[also build test ERROR on robh/for-next linus/master rafael-pm/acpi-bus v6.8-rc2 next-20240130]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/David-Dai/dt-bindings-cpufreq-add-virtual-cpufreq-device/20240127-084645
base: https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
patch link: https://lore.kernel.org/r/20240127004321.1902477-3-davidai%40google.com
patch subject: [PATCH v5 2/2] cpufreq: add virtual-cpufreq driver
config: x86_64-randconfig-121-20240130 (https://download.01.org/0day-ci/archive/20240131/[email protected]/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240131/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

>> ld.lld: error: undefined symbol: topology_set_scale_freq_source
>>> referenced by virtual-cpufreq.c:115 (drivers/cpufreq/virtual-cpufreq.c:115)
>>> drivers/cpufreq/virtual-cpufreq.o:(virt_cpufreq_cpu_init) in archive vmlinux.a
--
>> ld.lld: error: undefined symbol: topology_clear_scale_freq_source
>>> referenced by virtual-cpufreq.c:128 (drivers/cpufreq/virtual-cpufreq.c:128)
>>> drivers/cpufreq/virtual-cpufreq.o:(virt_cpufreq_cpu_exit) in archive vmlinux.a
--
>> ld.lld: error: undefined symbol: arch_freq_scale
>>> referenced by virtual-cpufreq.c:38 (drivers/cpufreq/virtual-cpufreq.c:38)
>>> drivers/cpufreq/virtual-cpufreq.o:(virt_scale_freq_tick) in archive vmlinux.a

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-01-31 17:11:02

by Rob Herring (Arm)

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] dt-bindings: cpufreq: add virtual cpufreq device

On Fri, Jan 26, 2024 at 04:43:15PM -0800, David Dai wrote:
> Adding bindings to represent a virtual cpufreq device.
>
> Virtual machines may expose MMIO regions for a virtual cpufreq device
> for guests to read frequency information or to request frequency
> selection. The virtual cpufreq device has an individual controller for
> each frequency domain. Performance points for a given domain can be
> normalized across all domains for ease of allowing for virtual machines
> to migrate between hosts.
>
> Co-developed-by: Saravana Kannan <[email protected]>
> Signed-off-by: Saravana Kannan <[email protected]>
> Signed-off-by: David Dai <[email protected]>
> ---
> .../cpufreq/qemu,cpufreq-virtual.yaml | 110 ++++++++++++++++++

> + const: qemu,virtual-cpufreq

Well, the filename almost matches the compatible.

> +
> + reg:
> + maxItems: 1
> + description:
> + Address and size of region containing frequency controls for each of the
> + frequency domains. Regions for each frequency domain is placed
> + contiguously and contain registers for controlling DVFS(Dynamic Frequency
> + and Voltage) characteristics. The size of the region is proportional to
> + total number of frequency domains. This device also needs the CPUs to
> + list their OPPs using operating-points-v2 tables. The OPP tables for the
> + CPUs should use normalized "frequency" values where the OPP with the
> + highest performance among all the vCPUs is listed as 1024 KHz. The rest
> + of the frequencies of all the vCPUs should be normalized based on their
> + performance relative to that 1024 KHz OPP. This makes it much easier to
> + migrate the VM across systems which might have different physical CPU
> + OPPs.
> +
> +required:
> + - compatible
> + - reg
> +
> +additionalProperties: false
> +
> +examples:
> + - |
> + // This example shows a two CPU configuration with a frequency domain
> + // for each CPU showing normalized performance points.
> + cpus {
> + #address-cells = <1>;
> + #size-cells = <0>;
> +
> + cpu@0 {
> + compatible = "arm,armv8";
> + device_type = "cpu";
> + reg = <0x0>;
> + operating-points-v2 = <&opp_table0>;
> + };
> +
> + cpu@1 {
> + compatible = "arm,armv8";
> + device_type = "cpu";
> + reg = <0x0>;
> + operating-points-v2 = <&opp_table1>;
> + };
> + };
> +
> + opp_table0: opp-table-0 {
> + compatible = "operating-points-v2";
> +
> + opp64000 { opp-hz = /bits/ 64 <64000>; };

opp-64000 is the preferred form.

> + opp128000 { opp-hz = /bits/ 64 <128000>; };
> + opp192000 { opp-hz = /bits/ 64 <192000>; };
> + opp256000 { opp-hz = /bits/ 64 <256000>; };
> + opp320000 { opp-hz = /bits/ 64 <320000>; };
> + opp384000 { opp-hz = /bits/ 64 <384000>; };
> + opp425000 { opp-hz = /bits/ 64 <425000>; };
> + };
> +
> + opp_table1: opp-table-1 {
> + compatible = "operating-points-v2";
> +
> + opp64000 { opp-hz = /bits/ 64 <64000>; };
> + opp128000 { opp-hz = /bits/ 64 <128000>; };
> + opp192000 { opp-hz = /bits/ 64 <192000>; };
> + opp256000 { opp-hz = /bits/ 64 <256000>; };
> + opp320000 { opp-hz = /bits/ 64 <320000>; };
> + opp384000 { opp-hz = /bits/ 64 <384000>; };
> + opp448000 { opp-hz = /bits/ 64 <448000>; };
> + opp512000 { opp-hz = /bits/ 64 <512000>; };
> + opp576000 { opp-hz = /bits/ 64 <576000>; };
> + opp640000 { opp-hz = /bits/ 64 <640000>; };
> + opp704000 { opp-hz = /bits/ 64 <704000>; };
> + opp768000 { opp-hz = /bits/ 64 <768000>; };
> + opp832000 { opp-hz = /bits/ 64 <832000>; };
> + opp896000 { opp-hz = /bits/ 64 <896000>; };
> + opp960000 { opp-hz = /bits/ 64 <960000>; };
> + opp1024000 { opp-hz = /bits/ 64 <1024000>; };
> +
> + };

I don't recall your prior versions having an OPP table. Maybe it was
incomplete. You are designing the "h/w" interface. Why don't you make it
discoverable or implicit (fixed for the h/w)? Do you really need it if
the frequency is normalized?

Also, we have "opp-level" for opaque values that aren't Hz.

Rob

2024-01-31 18:58:20

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] dt-bindings: cpufreq: add virtual cpufreq device

On Wed, Jan 31, 2024 at 9:06 AM Rob Herring <[email protected]> wrote:
>
> On Fri, Jan 26, 2024 at 04:43:15PM -0800, David Dai wrote:
> > Adding bindings to represent a virtual cpufreq device.
> >
> > Virtual machines may expose MMIO regions for a virtual cpufreq device
> > for guests to read frequency information or to request frequency
> > selection. The virtual cpufreq device has an individual controller for
> > each frequency domain. Performance points for a given domain can be
> > normalized across all domains for ease of allowing for virtual machines
> > to migrate between hosts.
> >
> > Co-developed-by: Saravana Kannan <[email protected]>
> > Signed-off-by: Saravana Kannan <[email protected]>
> > Signed-off-by: David Dai <[email protected]>
> > ---
> > .../cpufreq/qemu,cpufreq-virtual.yaml | 110 ++++++++++++++++++
>
> > + const: qemu,virtual-cpufreq
>
> Well, the filename almost matches the compatible.
>
> > +
> > + reg:
> > + maxItems: 1
> > + description:
> > + Address and size of region containing frequency controls for each of the
> > + frequency domains. Regions for each frequency domain is placed
> > + contiguously and contain registers for controlling DVFS(Dynamic Frequency
> > + and Voltage) characteristics. The size of the region is proportional to
> > + total number of frequency domains. This device also needs the CPUs to
> > + list their OPPs using operating-points-v2 tables. The OPP tables for the
> > + CPUs should use normalized "frequency" values where the OPP with the
> > + highest performance among all the vCPUs is listed as 1024 KHz. The rest
> > + of the frequencies of all the vCPUs should be normalized based on their
> > + performance relative to that 1024 KHz OPP. This makes it much easier to
> > + migrate the VM across systems which might have different physical CPU
> > + OPPs.
> > +
> > +required:
> > + - compatible
> > + - reg
> > +
> > +additionalProperties: false
> > +
> > +examples:
> > + - |
> > + // This example shows a two CPU configuration with a frequency domain
> > + // for each CPU showing normalized performance points.
> > + cpus {
> > + #address-cells = <1>;
> > + #size-cells = <0>;
> > +
> > + cpu@0 {
> > + compatible = "arm,armv8";
> > + device_type = "cpu";
> > + reg = <0x0>;
> > + operating-points-v2 = <&opp_table0>;
> > + };
> > +
> > + cpu@1 {
> > + compatible = "arm,armv8";
> > + device_type = "cpu";
> > + reg = <0x0>;
> > + operating-points-v2 = <&opp_table1>;
> > + };
> > + };
> > +
> > + opp_table0: opp-table-0 {
> > + compatible = "operating-points-v2";
> > +
> > + opp64000 { opp-hz = /bits/ 64 <64000>; };
>
> opp-64000 is the preferred form.
>
> > + opp128000 { opp-hz = /bits/ 64 <128000>; };
> > + opp192000 { opp-hz = /bits/ 64 <192000>; };
> > + opp256000 { opp-hz = /bits/ 64 <256000>; };
> > + opp320000 { opp-hz = /bits/ 64 <320000>; };
> > + opp384000 { opp-hz = /bits/ 64 <384000>; };
> > + opp425000 { opp-hz = /bits/ 64 <425000>; };
> > + };
> > +
> > + opp_table1: opp-table-1 {
> > + compatible = "operating-points-v2";
> > +
> > + opp64000 { opp-hz = /bits/ 64 <64000>; };
> > + opp128000 { opp-hz = /bits/ 64 <128000>; };
> > + opp192000 { opp-hz = /bits/ 64 <192000>; };
> > + opp256000 { opp-hz = /bits/ 64 <256000>; };
> > + opp320000 { opp-hz = /bits/ 64 <320000>; };
> > + opp384000 { opp-hz = /bits/ 64 <384000>; };
> > + opp448000 { opp-hz = /bits/ 64 <448000>; };
> > + opp512000 { opp-hz = /bits/ 64 <512000>; };
> > + opp576000 { opp-hz = /bits/ 64 <576000>; };
> > + opp640000 { opp-hz = /bits/ 64 <640000>; };
> > + opp704000 { opp-hz = /bits/ 64 <704000>; };
> > + opp768000 { opp-hz = /bits/ 64 <768000>; };
> > + opp832000 { opp-hz = /bits/ 64 <832000>; };
> > + opp896000 { opp-hz = /bits/ 64 <896000>; };
> > + opp960000 { opp-hz = /bits/ 64 <960000>; };
> > + opp1024000 { opp-hz = /bits/ 64 <1024000>; };
> > +
> > + };
>
> I don't recall your prior versions having an OPP table. Maybe it was
> incomplete. You are designing the "h/w" interface. Why don't you make it
> discoverable or implicit (fixed for the h/w)?

We also need the OPP tables to indicate which CPUs are part of the
same cluster, etc. Don't want to invent a new "protocol" and just use
existing DT bindings.

> Do you really need it if the frequency is normalized?

Yeah, we can have little and big CPUs and want to emulate different
performance levels. So while the Fmax on big is 1024, we still want to
be able to say little is 425. So we definitely need frequency tables.

> Also, we have "opp-level" for opaque values that aren't Hz.

Still want to keep it Hz to be compatible with arch_freq_scale and
when virtualized CPU perf counters are available.

-Saravana

2024-02-02 15:54:06

by Rob Herring (Arm)

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] dt-bindings: cpufreq: add virtual cpufreq device

On Wed, Jan 31, 2024 at 10:23:03AM -0800, Saravana Kannan wrote:
> On Wed, Jan 31, 2024 at 9:06 AM Rob Herring <[email protected]> wrote:
> >
> > On Fri, Jan 26, 2024 at 04:43:15PM -0800, David Dai wrote:
> > > Adding bindings to represent a virtual cpufreq device.
> > >
> > > Virtual machines may expose MMIO regions for a virtual cpufreq device
> > > for guests to read frequency information or to request frequency
> > > selection. The virtual cpufreq device has an individual controller for
> > > each frequency domain. Performance points for a given domain can be
> > > normalized across all domains for ease of allowing for virtual machines
> > > to migrate between hosts.
> > >
> > > Co-developed-by: Saravana Kannan <[email protected]>
> > > Signed-off-by: Saravana Kannan <[email protected]>
> > > Signed-off-by: David Dai <[email protected]>
> > > ---
> > > .../cpufreq/qemu,cpufreq-virtual.yaml | 110 ++++++++++++++++++
> >
> > > + const: qemu,virtual-cpufreq
> >
> > Well, the filename almost matches the compatible.
> >
> > > +
> > > + reg:
> > > + maxItems: 1
> > > + description:
> > > + Address and size of region containing frequency controls for each of the
> > > + frequency domains. Regions for each frequency domain is placed
> > > + contiguously and contain registers for controlling DVFS(Dynamic Frequency
> > > + and Voltage) characteristics. The size of the region is proportional to
> > > + total number of frequency domains. This device also needs the CPUs to
> > > + list their OPPs using operating-points-v2 tables. The OPP tables for the
> > > + CPUs should use normalized "frequency" values where the OPP with the
> > > + highest performance among all the vCPUs is listed as 1024 KHz. The rest
> > > + of the frequencies of all the vCPUs should be normalized based on their
> > > + performance relative to that 1024 KHz OPP. This makes it much easier to
> > > + migrate the VM across systems which might have different physical CPU
> > > + OPPs.
> > > +
> > > +required:
> > > + - compatible
> > > + - reg
> > > +
> > > +additionalProperties: false
> > > +
> > > +examples:
> > > + - |
> > > + // This example shows a two CPU configuration with a frequency domain
> > > + // for each CPU showing normalized performance points.
> > > + cpus {
> > > + #address-cells = <1>;
> > > + #size-cells = <0>;
> > > +
> > > + cpu@0 {
> > > + compatible = "arm,armv8";
> > > + device_type = "cpu";
> > > + reg = <0x0>;
> > > + operating-points-v2 = <&opp_table0>;
> > > + };
> > > +
> > > + cpu@1 {
> > > + compatible = "arm,armv8";
> > > + device_type = "cpu";
> > > + reg = <0x0>;
> > > + operating-points-v2 = <&opp_table1>;
> > > + };
> > > + };
> > > +
> > > + opp_table0: opp-table-0 {
> > > + compatible = "operating-points-v2";
> > > +
> > > + opp64000 { opp-hz = /bits/ 64 <64000>; };
> >
> > opp-64000 is the preferred form.
> >
> > > + opp128000 { opp-hz = /bits/ 64 <128000>; };
> > > + opp192000 { opp-hz = /bits/ 64 <192000>; };
> > > + opp256000 { opp-hz = /bits/ 64 <256000>; };
> > > + opp320000 { opp-hz = /bits/ 64 <320000>; };
> > > + opp384000 { opp-hz = /bits/ 64 <384000>; };
> > > + opp425000 { opp-hz = /bits/ 64 <425000>; };
> > > + };
> > > +
> > > + opp_table1: opp-table-1 {
> > > + compatible = "operating-points-v2";
> > > +
> > > + opp64000 { opp-hz = /bits/ 64 <64000>; };
> > > + opp128000 { opp-hz = /bits/ 64 <128000>; };
> > > + opp192000 { opp-hz = /bits/ 64 <192000>; };
> > > + opp256000 { opp-hz = /bits/ 64 <256000>; };
> > > + opp320000 { opp-hz = /bits/ 64 <320000>; };
> > > + opp384000 { opp-hz = /bits/ 64 <384000>; };
> > > + opp448000 { opp-hz = /bits/ 64 <448000>; };
> > > + opp512000 { opp-hz = /bits/ 64 <512000>; };
> > > + opp576000 { opp-hz = /bits/ 64 <576000>; };
> > > + opp640000 { opp-hz = /bits/ 64 <640000>; };
> > > + opp704000 { opp-hz = /bits/ 64 <704000>; };
> > > + opp768000 { opp-hz = /bits/ 64 <768000>; };
> > > + opp832000 { opp-hz = /bits/ 64 <832000>; };
> > > + opp896000 { opp-hz = /bits/ 64 <896000>; };
> > > + opp960000 { opp-hz = /bits/ 64 <960000>; };
> > > + opp1024000 { opp-hz = /bits/ 64 <1024000>; };
> > > +
> > > + };
> >
> > I don't recall your prior versions having an OPP table. Maybe it was
> > incomplete. You are designing the "h/w" interface. Why don't you make it
> > discoverable or implicit (fixed for the h/w)?
>
> We also need the OPP tables to indicate which CPUs are part of the
> same cluster, etc. Don't want to invent a new "protocol" and just use
> existing DT bindings.

Topology binding is for that.

What about when x86 and other ACPI systems need to do this too? You
define a discoverable interface, then it works regardless of firmware.
KVM, Virtio, VFIO, etc. are all their own protocols.

> > Do you really need it if the frequency is normalized?
>
> Yeah, we can have little and big CPUs and want to emulate different
> performance levels. So while the Fmax on big is 1024, we still want to
> be able to say little is 425. So we definitely need frequency tables.

You need per CPU Fmax, sure. But all the frequencies? I don't follow why
you don't just have a max available capacity and then request the
desired capacity. Then the host maps that to an underlying OPP. Why have
an intermediate set of fake frequencies?

As these are normalized, I guess you are normalizing for capacity as
well? Or you are using "capacity-dmips-mhz"?

I'm also lost how this would work when you migrate and the underlying
CPU changes. The DT is fixed.

> > Also, we have "opp-level" for opaque values that aren't Hz.
>
> Still want to keep it Hz to be compatible with arch_freq_scale and
> when virtualized CPU perf counters are available.

Seems like no one would want "opp-level" then. Shrug.

Anyway, if Viresh and Marc are fine with all this, I'll shut up.

Rob

2024-02-04 10:23:23

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] dt-bindings: cpufreq: add virtual cpufreq device

On Fri, 02 Feb 2024 15:53:52 +0000,
Rob Herring <[email protected]> wrote:
>
> On Wed, Jan 31, 2024 at 10:23:03AM -0800, Saravana Kannan wrote:
> > On Wed, Jan 31, 2024 at 9:06 AM Rob Herring <[email protected]> wrote:
> > >
> > > On Fri, Jan 26, 2024 at 04:43:15PM -0800, David Dai wrote:
> > > > Adding bindings to represent a virtual cpufreq device.
> > > >
> > > > Virtual machines may expose MMIO regions for a virtual cpufreq device
> > > > for guests to read frequency information or to request frequency
> > > > selection. The virtual cpufreq device has an individual controller for
> > > > each frequency domain. Performance points for a given domain can be
> > > > normalized across all domains for ease of allowing for virtual machines
> > > > to migrate between hosts.
> > > >
> > > > Co-developed-by: Saravana Kannan <[email protected]>
> > > > Signed-off-by: Saravana Kannan <[email protected]>
> > > > Signed-off-by: David Dai <[email protected]>
> > > > ---
> > > > .../cpufreq/qemu,cpufreq-virtual.yaml | 110 ++++++++++++++++++
> > >
> > > > + const: qemu,virtual-cpufreq
> > >
> > > Well, the filename almost matches the compatible.
> > >
> > > > +
> > > > + reg:
> > > > + maxItems: 1
> > > > + description:
> > > > + Address and size of region containing frequency controls for each of the
> > > > + frequency domains. Regions for each frequency domain is placed
> > > > + contiguously and contain registers for controlling DVFS(Dynamic Frequency
> > > > + and Voltage) characteristics. The size of the region is proportional to
> > > > + total number of frequency domains. This device also needs the CPUs to
> > > > + list their OPPs using operating-points-v2 tables. The OPP tables for the
> > > > + CPUs should use normalized "frequency" values where the OPP with the
> > > > + highest performance among all the vCPUs is listed as 1024 KHz. The rest
> > > > + of the frequencies of all the vCPUs should be normalized based on their
> > > > + performance relative to that 1024 KHz OPP. This makes it much easier to
> > > > + migrate the VM across systems which might have different physical CPU
> > > > + OPPs.
> > > > +
> > > > +required:
> > > > + - compatible
> > > > + - reg
> > > > +
> > > > +additionalProperties: false
> > > > +
> > > > +examples:
> > > > + - |
> > > > + // This example shows a two CPU configuration with a frequency domain
> > > > + // for each CPU showing normalized performance points.
> > > > + cpus {
> > > > + #address-cells = <1>;
> > > > + #size-cells = <0>;
> > > > +
> > > > + cpu@0 {
> > > > + compatible = "arm,armv8";
> > > > + device_type = "cpu";
> > > > + reg = <0x0>;
> > > > + operating-points-v2 = <&opp_table0>;
> > > > + };
> > > > +
> > > > + cpu@1 {
> > > > + compatible = "arm,armv8";
> > > > + device_type = "cpu";
> > > > + reg = <0x0>;
> > > > + operating-points-v2 = <&opp_table1>;
> > > > + };
> > > > + };
> > > > +
> > > > + opp_table0: opp-table-0 {
> > > > + compatible = "operating-points-v2";
> > > > +
> > > > + opp64000 { opp-hz = /bits/ 64 <64000>; };
> > >
> > > opp-64000 is the preferred form.
> > >
> > > > + opp128000 { opp-hz = /bits/ 64 <128000>; };
> > > > + opp192000 { opp-hz = /bits/ 64 <192000>; };
> > > > + opp256000 { opp-hz = /bits/ 64 <256000>; };
> > > > + opp320000 { opp-hz = /bits/ 64 <320000>; };
> > > > + opp384000 { opp-hz = /bits/ 64 <384000>; };
> > > > + opp425000 { opp-hz = /bits/ 64 <425000>; };
> > > > + };
> > > > +
> > > > + opp_table1: opp-table-1 {
> > > > + compatible = "operating-points-v2";
> > > > +
> > > > + opp64000 { opp-hz = /bits/ 64 <64000>; };
> > > > + opp128000 { opp-hz = /bits/ 64 <128000>; };
> > > > + opp192000 { opp-hz = /bits/ 64 <192000>; };
> > > > + opp256000 { opp-hz = /bits/ 64 <256000>; };
> > > > + opp320000 { opp-hz = /bits/ 64 <320000>; };
> > > > + opp384000 { opp-hz = /bits/ 64 <384000>; };
> > > > + opp448000 { opp-hz = /bits/ 64 <448000>; };
> > > > + opp512000 { opp-hz = /bits/ 64 <512000>; };
> > > > + opp576000 { opp-hz = /bits/ 64 <576000>; };
> > > > + opp640000 { opp-hz = /bits/ 64 <640000>; };
> > > > + opp704000 { opp-hz = /bits/ 64 <704000>; };
> > > > + opp768000 { opp-hz = /bits/ 64 <768000>; };
> > > > + opp832000 { opp-hz = /bits/ 64 <832000>; };
> > > > + opp896000 { opp-hz = /bits/ 64 <896000>; };
> > > > + opp960000 { opp-hz = /bits/ 64 <960000>; };
> > > > + opp1024000 { opp-hz = /bits/ 64 <1024000>; };
> > > > +
> > > > + };
> > >
> > > I don't recall your prior versions having an OPP table. Maybe it was
> > > incomplete. You are designing the "h/w" interface. Why don't you make it
> > > discoverable or implicit (fixed for the h/w)?
> >
> > We also need the OPP tables to indicate which CPUs are part of the
> > same cluster, etc. Don't want to invent a new "protocol" and just use
> > existing DT bindings.
>
> Topology binding is for that.
>
> What about when x86 and other ACPI systems need to do this too? You
> define a discoverable interface, then it works regardless of firmware.
> KVM, Virtio, VFIO, etc. are all their own protocols.

Describing the emulated HW in ACPI would seem appropriate to me. We
are talking about the guest here, so whether this is KVM or not is not
relevant. If this was actually using any soft of data transfer, using
virtio would have been acceptable. But given how simple this is,
piggybacking on virtio is hardly appropriate.

>
> > > Do you really need it if the frequency is normalized?
> >
> > Yeah, we can have little and big CPUs and want to emulate different
> > performance levels. So while the Fmax on big is 1024, we still want to
> > be able to say little is 425. So we definitely need frequency tables.
>
> You need per CPU Fmax, sure. But all the frequencies? I don't follow why
> you don't just have a max available capacity and then request the
> desired capacity. Then the host maps that to an underlying OPP. Why have
> an intermediate set of fake frequencies?
>
> As these are normalized, I guess you are normalizing for capacity as
> well? Or you are using "capacity-dmips-mhz"?
>
> I'm also lost how this would work when you migrate and the underlying
> CPU changes. The DT is fixed.
>
> > > Also, we have "opp-level" for opaque values that aren't Hz.
> >
> > Still want to keep it Hz to be compatible with arch_freq_scale and
> > when virtualized CPU perf counters are available.
>
> Seems like no one would want "opp-level" then. Shrug.
>
> Anyway, if Viresh and Marc are fine with all this, I'll shut up.

Well, I've said it before, and I'll say it again: the use of
*frequencies* makes no sense. It is a lie (it doesn't describe any
hardware, physical nor virtual), and doesn't reflect the way the
emulated cpufreq controller behaves either (since it scales everything
back to what the host can potentially do)

The closest abstraction we have to this is the unit-less capacity. And
*that* reflects the way the emulated cpufreq controller works while
avoiding lying to the guest about some arbitrary frequency.

In practice, this changes nothing to either the code or the behaviour.
But it changes the binding.

M.

--
Without deviation from the norm, progress is not possible.

2024-02-05 08:41:18

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] dt-bindings: cpufreq: add virtual cpufreq device

On 02-02-24, 09:53, Rob Herring wrote:
> On Wed, Jan 31, 2024 at 10:23:03AM -0800, Saravana Kannan wrote:
> > We also need the OPP tables to indicate which CPUs are part of the
> > same cluster, etc. Don't want to invent a new "protocol" and just use
> > existing DT bindings.
>
> Topology binding is for that.

This one, right ?

Documentation/devicetree/bindings/dvfs/performance-domain.yaml

> You need per CPU Fmax, sure. But all the frequencies? I don't follow why
> you don't just have a max available capacity and then request the
> desired capacity. Then the host maps that to an underlying OPP. Why have
> an intermediate set of fake frequencies?

+1

> As these are normalized, I guess you are normalizing for capacity as
> well? Or you are using "capacity-dmips-mhz"?
>
> I'm also lost how this would work when you migrate and the underlying
> CPU changes. The DT is fixed.
>
> > > Also, we have "opp-level" for opaque values that aren't Hz.
> >
> > Still want to keep it Hz to be compatible with arch_freq_scale and
> > when virtualized CPU perf counters are available.

These are all specific to a driver only, that can be handled easily I guess. I
don't see a value to using Hz for this to be honest.

--
viresh

2024-02-05 16:52:38

by Rob Herring (Arm)

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] dt-bindings: cpufreq: add virtual cpufreq device

On Mon, Feb 05, 2024 at 02:08:30PM +0530, Viresh Kumar wrote:
> On 02-02-24, 09:53, Rob Herring wrote:
> > On Wed, Jan 31, 2024 at 10:23:03AM -0800, Saravana Kannan wrote:
> > > We also need the OPP tables to indicate which CPUs are part of the
> > > same cluster, etc. Don't want to invent a new "protocol" and just use
> > > existing DT bindings.
> >
> > Topology binding is for that.
>
> This one, right ?
>
> Documentation/devicetree/bindings/dvfs/performance-domain.yaml

No, Documentation/devicetree/bindings/cpu/cpu-topology.txt (or the
schema version of it in dtschema)

Rob


2024-02-15 11:36:23

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] dt-bindings: cpufreq: add virtual cpufreq device

On Fri, Feb 02, 2024 at 09:53:52AM -0600, Rob Herring wrote:
> On Wed, Jan 31, 2024 at 10:23:03AM -0800, Saravana Kannan wrote:
> >
> > We also need the OPP tables to indicate which CPUs are part of the
> > same cluster, etc. Don't want to invent a new "protocol" and just use
> > existing DT bindings.
>
> Topology binding is for that.
>
> What about when x86 and other ACPI systems need to do this too? You
> define a discoverable interface, then it works regardless of firmware.
> KVM, Virtio, VFIO, etc. are all their own protocols.
>

+1 for the above. I have mentioned the same couple of times but I am told
it can be taken up later which I fail to understand. Once we define DT
bindings, it must be supported for long time which doesn't provide any
motivation to such a discoverable interface which works on any virtual
platforms irrespective of the firmware.

--
Regards,
Sudeep

2024-03-11 11:40:58

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] dt-bindings: cpufreq: add virtual cpufreq device

On Sunday 04 Feb 2024 at 10:23:00 (+0000), Marc Zyngier wrote:
> Well, I've said it before, and I'll say it again: the use of
> *frequencies* makes no sense. It is a lie (it doesn't describe any
> hardware, physical nor virtual), and doesn't reflect the way the
> emulated cpufreq controller behaves either (since it scales everything
> back to what the host can potentially do)
>
> The closest abstraction we have to this is the unit-less capacity. And
> *that* reflects the way the emulated cpufreq controller works while
> avoiding lying to the guest about some arbitrary frequency.
>
> In practice, this changes nothing to either the code or the behaviour.
> But it changes the binding.

Apologies all for jumping late into this, but for what it's worth,
regardless of the unit of the binding, Linux will shove that into
cpufreq's 'frequency table' anyway, which as the name suggests is very
much assuming frequencies :/ -- see how struct cpufreq_frequency_table
explicitely requires KHz. The worst part is that this even ends up
being reported to _userspace_ as frequencies in sysfs via cpufreq's
scaling_available_frequencies file, even when they're really not...

In the case of SCMI for example, IIRC the firmware can optionally (and
in practice I think it does for all older implementations of the spec
least) report unit-less operating points to the driver, which will then
happily pretend these are KHz values when reporting that into PM_OPP and
cpufreq -- see how scmi_dvfs_device_opps_add() simply multiplies the
level's 'perf' member by 1000 when populating PM_OPP (which is then
propagated to cpufreq's freq_table'). And a small extract from the SCMI
spec:

"Certain platforms use IMPLEMENTATION DEFINED indices to identify
performance levels. Level Indexing Mode is used to describe such
platform behavior. The level indices associated with performance
levels are neither guaranteed to be contiguous nor required to be
on a linear scale."

Not nice, but unfortunately the core cpufreq framework has way too much
historical dependencies on things being frequencies to really change it
now, so we're pretty much stuck with that :(

So, while I do agree with the sentiment that this is a non-ideal place
to be, 'faking' frequencies is how we've addressed this so far in Linux,
so I'm personally not too fussed about David's usage of a freq-based DT
binding in this particular instance. On the plus side that allows to
re-use all of PM_OPP and cpufreq infrastructure as-is, so that's cool.

I guess we could make the argument that Linux's approach to handling
frequencies shouldn't influence this given that the binding should be OS
agnostic, but I can easily see how another OS could still make use of
that binding (and in fact requiring that this other OS can deal with
unitless frequencies is most likely going to be a bigger problem), so
I'd be inclined to think this isn't a major problem either.

Thanks,
Quentin

2024-05-02 20:20:01

by David Dai

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] dt-bindings: cpufreq: add virtual cpufreq device

On Thu, Feb 15, 2024 at 3:26 AM Sudeep Holla <[email protected]> wrote:
>
> On Fri, Feb 02, 2024 at 09:53:52AM -0600, Rob Herring wrote:
> > On Wed, Jan 31, 2024 at 10:23:03AM -0800, Saravana Kannan wrote:
> > >
> > > We also need the OPP tables to indicate which CPUs are part of the
> > > same cluster, etc. Don't want to invent a new "protocol" and just use
> > > existing DT bindings.
> >
> > Topology binding is for that.
> >
> > What about when x86 and other ACPI systems need to do this too? You
> > define a discoverable interface, then it works regardless of firmware.
> > KVM, Virtio, VFIO, etc. are all their own protocols.
> >
>
> +1 for the above. I have mentioned the same couple of times but I am told
> it can be taken up later which I fail to understand. Once we define DT
> bindings, it must be supported for long time which doesn't provide any
> motivation to such a discoverable interface which works on any virtual
> platforms irrespective of the firmware.
>

Hi Sudeep,

We are thinking of a discoverable interface like this, where the
performance info and performance domain mappings are discoverable
through the device registers. This should make it more portable across
firmwares. Would this address your concerns? Also, you asked to
document this. Where exactly would you want to document this? AFAIK
the DT bindings documentation is not supposed to include this level of
detail. Would a comment in the driver be sufficient?

CPU0..CPUn
+-------------+-------------------------------+--------+-------+
| Register | Description | Offset | Len |
+-------------+-------------------------------+--------+-------+
| cur_perf | read this register to get | 0x0 | 0x4 |
| | the current perf (integer val | | |
| | representing perf relative to | | |
| | max performance) | | |
| | that vCPU is running at | | |
+-------------+-------------------------------+--------+-------+
| set_perf | write to this register to set | 0x4 | 0x4 |
| | perf value of the vCPU | | |
+-------------+-------------------------------+--------+-------+
| perftbl_len | number of entries in perf | 0x8 | 0x4 |
| | table. A single entry in the | | |
| | perf table denotes no table | | |
| | and the entry contains | | |
| | the maximum perf value | | |
| | that this vCPU supports. | | |
| | The guest can request any | | |
| | value between 1 and max perf. | | |
+---------------------------------------------+--------+-------+
| perftbl_sel | write to this register to | 0xc | 0x4 |
| | select perf table entry to | | |
| | read from | | |
+---------------------------------------------+--------+-------+
| perftbl_rd | read this register to get | 0x10 | 0x4 |
| | perf value of the selected | | |
| | entry based on perftbl_sel | | |
+---------------------------------------------+--------+-------+
| perf_domain | performance domain number | 0x14 | 0x4 |
| | that this vCPU belongs to. | | |
| | vCPUs sharing the same perf | | |
| | domain number are part of the | | |
| | same performance domain. | | |
+-------------+-------------------------------+--------+-------+

Thanks,
David


> --
> Regards,
> Sudeep

2024-05-07 10:22:23

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] dt-bindings: cpufreq: add virtual cpufreq device

On Thu, May 02, 2024 at 01:17:57PM -0700, David Dai wrote:
> On Thu, Feb 15, 2024 at 3:26 AM Sudeep Holla <[email protected]> wrote:
> >
> > On Fri, Feb 02, 2024 at 09:53:52AM -0600, Rob Herring wrote:
> > > On Wed, Jan 31, 2024 at 10:23:03AM -0800, Saravana Kannan wrote:
> > > >
> > > > We also need the OPP tables to indicate which CPUs are part of the
> > > > same cluster, etc. Don't want to invent a new "protocol" and just use
> > > > existing DT bindings.
> > >
> > > Topology binding is for that.
> > >
> > > What about when x86 and other ACPI systems need to do this too? You
> > > define a discoverable interface, then it works regardless of firmware.
> > > KVM, Virtio, VFIO, etc. are all their own protocols.
> > >
> >
> > +1 for the above. I have mentioned the same couple of times but I am told
> > it can be taken up later which I fail to understand. Once we define DT
> > bindings, it must be supported for long time which doesn't provide any
> > motivation to such a discoverable interface which works on any virtual
> > platforms irrespective of the firmware.
> >
>
> Hi Sudeep,
>
> We are thinking of a discoverable interface like this, where the
> performance info and performance domain mappings are discoverable
> through the device registers. This should make it more portable across
> firmwares. Would this address your concerns?

Yes.

> Also, you asked to document this.
> Where exactly would you want to document this?

IMO it could go under Documentation/firmware-guide ? Unless someone
has any other suggestions.

> AFAIK the DT bindings documentation is not supposed to include this level of
> detail. Would a comment in the driver be sufficient?

Agree, DT bindings is not the right place. May be even comment in the
driver would be sufficient.

Overall it looks good and on the right path IMO.

>
> CPU0..CPUn
> +-------------+-------------------------------+--------+-------+
> | Register | Description | Offset | Len |
> +-------------+-------------------------------+--------+-------+
> | cur_perf | read this register to get | 0x0 | 0x4 |
> | | the current perf (integer val | | |
> | | representing perf relative to | | |
> | | max performance) | | |
> | | that vCPU is running at | | |
> +-------------+-------------------------------+--------+-------+
> | set_perf | write to this register to set | 0x4 | 0x4 |
> | | perf value of the vCPU | | |
> +-------------+-------------------------------+--------+-------+
> | perftbl_len | number of entries in perf | 0x8 | 0x4 |
> | | table. A single entry in the | | |
> | | perf table denotes no table | | |
> | | and the entry contains | | |
> | | the maximum perf value | | |
> | | that this vCPU supports. | | |
> | | The guest can request any | | |
> | | value between 1 and max perf. | | |

Does this have to be per cpu ? It can be simplified by keeping
just cur_perf, set_perf and perf_domain in per-cpu entries and this
per domain entries separate. But I am not against per cpu entries
as well.

Also why do you need the table if the guest can request any value from
1 to max perf ? The table will have discrete OPPs ? If so, how to they
map to the perf range [1 - maxperf] ?

> +---------------------------------------------+--------+-------+
> | perftbl_sel | write to this register to | 0xc | 0x4 |
> | | select perf table entry to | | |
> | | read from | | |
> +---------------------------------------------+--------+-------+
> | perftbl_rd | read this register to get | 0x10 | 0x4 |
> | | perf value of the selected | | |
> | | entry based on perftbl_sel | | |
> +---------------------------------------------+--------+-------+
> | perf_domain | performance domain number | 0x14 | 0x4 |
> | | that this vCPU belongs to. | | |
> | | vCPUs sharing the same perf | | |
> | | domain number are part of the | | |
> | | same performance domain. | | |
> +-------------+-------------------------------+--------+-------+

The above are couple of high level questions I have ATM.

--
Regards,
Sudeep

2024-05-17 21:00:00

by David Dai

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] dt-bindings: cpufreq: add virtual cpufreq device

On Tue, May 7, 2024 at 3:21 AM Sudeep Holla <[email protected]> wrote:
>
> On Thu, May 02, 2024 at 01:17:57PM -0700, David Dai wrote:
> > On Thu, Feb 15, 2024 at 3:26 AM Sudeep Holla <[email protected]> wrote:
> > >
> > > On Fri, Feb 02, 2024 at 09:53:52AM -0600, Rob Herring wrote:
> > > > On Wed, Jan 31, 2024 at 10:23:03AM -0800, Saravana Kannan wrote:
> > > > >
> > > > > We also need the OPP tables to indicate which CPUs are part of the
> > > > > same cluster, etc. Don't want to invent a new "protocol" and just use
> > > > > existing DT bindings.
> > > >
> > > > Topology binding is for that.
> > > >
> > > > What about when x86 and other ACPI systems need to do this too? You
> > > > define a discoverable interface, then it works regardless of firmware.
> > > > KVM, Virtio, VFIO, etc. are all their own protocols.
> > > >
> > >
> > > +1 for the above. I have mentioned the same couple of times but I am told
> > > it can be taken up later which I fail to understand. Once we define DT
> > > bindings, it must be supported for long time which doesn't provide any
> > > motivation to such a discoverable interface which works on any virtual
> > > platforms irrespective of the firmware.
> > >
> >
> > Hi Sudeep,
> >
> > We are thinking of a discoverable interface like this, where the
> > performance info and performance domain mappings are discoverable
> > through the device registers. This should make it more portable across
> > firmwares. Would this address your concerns?
>
> Yes.
>
> > Also, you asked to document this.
> > Where exactly would you want to document this?
>
> IMO it could go under Documentation/firmware-guide ? Unless someone
> has any other suggestions.
>
> > AFAIK the DT bindings documentation is not supposed to include this level of
> > detail. Would a comment in the driver be sufficient?
>
> Agree, DT bindings is not the right place. May be even comment in the
> driver would be sufficient.

Alright, I’ll make this into a comment in the driver itself.

>
> Overall it looks good and on the right path IMO.
>

Okay, I’ll submit V6 patches and continue from there.

> >
> > CPU0..CPUn
> > +-------------+-------------------------------+--------+-------+
> > | Register | Description | Offset | Len |
> > +-------------+-------------------------------+--------+-------+
> > | cur_perf | read this register to get | 0x0 | 0x4 |
> > | | the current perf (integer val | | |
> > | | representing perf relative to | | |
> > | | max performance) | | |
> > | | that vCPU is running at | | |
> > +-------------+-------------------------------+--------+-------+
> > | set_perf | write to this register to set | 0x4 | 0x4 |
> > | | perf value of the vCPU | | |
> > +-------------+-------------------------------+--------+-------+
> > | perftbl_len | number of entries in perf | 0x8 | 0x4 |
> > | | table. A single entry in the | | |
> > | | perf table denotes no table | | |
> > | | and the entry contains | | |
> > | | the maximum perf value | | |
> > | | that this vCPU supports. | | |
> > | | The guest can request any | | |
> > | | value between 1 and max perf. | | |
>
> Does this have to be per cpu ? It can be simplified by keeping
> just cur_perf, set_perf and perf_domain in per-cpu entries and this
> per domain entries separate. But I am not against per cpu entries
> as well.

I think separating out the perf domain entries may make the device
emulation and the driver slightly more complicated. Emulating the perf
domain regions per CPU is a simpler layout if we need to install eBPF
programs to handle the backend per vCPU. Each vCPU looking up its own
frequency information in its own MMIO region is a bit easier too when
initializing the driver. Also each vCPU will be in its own perf domain
for the majority of the use cases, so it won’t make much of a
difference most of the time.

>
> Also why do you need the table if the guest can request any value from
> 1 to max perf ? The table will have discrete OPPs ? If so, how to they
> map to the perf range [1 - maxperf] ?

Let me clarify this in the comment, the perf range [1 - maxperf] is
only applicable in the case where the frequency table is not
supported. The cpufreq driver will still vote for discrete levels if
tables are used. The VMM(Virtual Machine Manager) may choose to use
tables depending on the use case and the driver will support both
cases.

Thanks,
David

>
> > +---------------------------------------------+--------+-------+
> > | perftbl_sel | write to this register to | 0xc | 0x4 |
> > | | select perf table entry to | | |
> > | | read from | | |
> > +---------------------------------------------+--------+-------+
> > | perftbl_rd | read this register to get | 0x10 | 0x4 |
> > | | perf value of the selected | | |
> > | | entry based on perftbl_sel | | |
> > +---------------------------------------------+--------+-------+
> > | perf_domain | performance domain number | 0x14 | 0x4 |
> > | | that this vCPU belongs to. | | |
> > | | vCPUs sharing the same perf | | |
> > | | domain number are part of the | | |
> > | | same performance domain. | | |
> > +-------------+-------------------------------+--------+-------+
>
> The above are couple of high level questions I have ATM.
>
> --
> Regards,
> Sudeep