2015-11-23 14:28:51

by Juri Lelli

[permalink] [raw]
Subject: [RFC PATCH 0/8] CPUs capacity information for heterogeneous systems

Hi all,

ARM systems may be configured to have CPUs with different power/performance
characteristics within the same chip. In this case, additional information has
to be made available to the kernel (the scheduler in particular) for it to be
aware of such differences and take decisions accordingly. This posting stems
from the ongoing discussion about introducing a simple platform energy cost
model to guide scheduling decisions (a.k.a Energy Aware Scheduling [1]), but
also aims to be an independent track aimed to standardise the way we make the
scheduler aware of heterogenous CPU systems. With these patches and in addition
patches from [1] (that make the scheduler wakeup paths aware of heterogenous
CPU systems) we enable the scheduler to have good default performance on such
systems. In addition, we get a clearly defined way of providing the scheduler
with needed information about CPU capacity on such systems.

CPU capacity is defined in this context as a number that provides the scheduler
information about CPUs heterogeneity. Such heterogeneity can come from
micro-architectural differences (e.g., ARM big.LITTLE systems) or maximum
frequency at which CPUs can run (e.g., SMP systems with multiple frequency
domains and different max frequencies). Heterogeneity in this context is about
differing performance characteristics; in practice, the binding that we propose
in this RFC tries to capture a first-order approximation of the relative
performance of CPUs.

This RFC proposes a solution to the problem of how do we init CPUs original
capacity. The way it works today, and for arm A15/A7 systems only, is that we
rely on cpu_efficiency magic numbers from arch/arm/kernel/topology.c and the
existence of clock-frequency dtb properties; having those values available, we
then do some math to come up with capacities we know from measurement (e.g.,
EAS energy model), e.g. for TC2 they are 430 for A7 and 1024 for A15.
Currently, arm64 doesn't have such a feature at all.

With this patchset we provide CPUs capacity information either from DT or from
sysfs interface. Such information is standardized for both arm and arm64.

Patches high level description:

o 01/08 cleans up how cpu_scale is initialized in arm
o 02/08 introduces documentation for the new optional DT binding
o [03-06]/08 add cpu-capacity attribute to TC2 and Juno DTs and provide
parsing of such information at boot time
o [07-08]/08 introduce sysfs attribute

The patchset is based on top of tip/sched/core as of today.

In case you would like to test this out, I pushed a branch here:

git://linux-arm.org/linux-jl.git upstream/default_caps_dt

This branch contains additional patches, useful to better understand how CPU
capacity information is actually used by the scheduler. Discussion regarding
these additional patches will be started with a different posting in the
future. We just didn't want to make discussion too broad, as we realize that
this set can be controversial already on its own.

Comments, concerns and rants are more than welcome!

Best,

- Juri

Juri Lelli (8):
ARM: initialize cpu_scale to its default
Documentation: arm: define DT cpu capacity bindings
arm: parse cpu capacity from DT
arm, dts: add TC2 cpu capacity information
arm64: parse cpu capacity from DT
arm64, dts: add Juno cpu capacity information
arm: add sysfs cpu_capacity attribute
arm64: add sysfs cpu_capacity attribute

.../devicetree/bindings/arm/cpu-capacity.txt | 227 +++++++++++++++++++++
Documentation/devicetree/bindings/arm/cpus.txt | 17 ++
arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts | 6 +
arch/arm/kernel/topology.c | 122 ++++++++++-
arch/arm64/boot/dts/arm/juno.dts | 7 +
arch/arm64/kernel/topology.c | 114 +++++++++++
6 files changed, 489 insertions(+), 4 deletions(-)
create mode 100644 Documentation/devicetree/bindings/arm/cpu-capacity.txt

--
2.2.2


2015-11-23 14:28:54

by Juri Lelli

[permalink] [raw]
Subject: [RFC PATCH 1/8] ARM: initialize cpu_scale to its default

Instead of looping through all cpus calling set_capacity_scale, we can
initialise cpu_scale per-cpu variables to SCHED_CAPACITY_SCALE with their
definition.

Cc: Russell King <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
arch/arm/kernel/topology.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 08b7847..ec279d1 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -40,7 +40,7 @@
* to run the rebalance_domains for all idle cores and the cpu_capacity can be
* updated during this sequence.
*/
-static DEFINE_PER_CPU(unsigned long, cpu_scale);
+static DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;

unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
{
@@ -306,8 +306,6 @@ void __init init_cpu_topology(void)
cpu_topo->socket_id = -1;
cpumask_clear(&cpu_topo->core_sibling);
cpumask_clear(&cpu_topo->thread_sibling);
-
- set_capacity_scale(cpu, SCHED_CAPACITY_SCALE);
}
smp_wmb();

--
2.2.2

2015-11-23 14:29:02

by Juri Lelli

[permalink] [raw]
Subject: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

ARM systems may be configured to have cpus with different power/performance
characteristics within the same chip. In this case, additional information
has to be made available to the kernel (the scheduler in particular) for it
to be aware of such differences and take decisions accordingly.

Therefore, this patch aims at standardizing cpu capacities device tree
bindings for ARM platforms. Bindings define cpu capacity parameter, to
allow operating systems to retrieve such information from the device tree
and initialize related kernel structures, paving the way for common code in
the kernel to deal with heterogeneity.

Cc: Rob Herring <[email protected]>
Cc: Pawel Moll <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Ian Campbell <[email protected]>
Cc: Kumar Gala <[email protected]>
Cc: Maxime Ripard <[email protected]>
Cc: Olof Johansson <[email protected]>
Cc: Gregory CLEMENT <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Linus Walleij <[email protected]>
Cc: Chen-Yu Tsai <[email protected]>
Cc: Thomas Petazzoni <[email protected]>
Cc: [email protected]
Signed-off-by: Juri Lelli <[email protected]>
---
.../devicetree/bindings/arm/cpu-capacity.txt | 227 +++++++++++++++++++++
Documentation/devicetree/bindings/arm/cpus.txt | 17 ++
2 files changed, 244 insertions(+)
create mode 100644 Documentation/devicetree/bindings/arm/cpu-capacity.txt

diff --git a/Documentation/devicetree/bindings/arm/cpu-capacity.txt b/Documentation/devicetree/bindings/arm/cpu-capacity.txt
new file mode 100644
index 0000000..2a00af0
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/cpu-capacity.txt
@@ -0,0 +1,227 @@
+==========================================
+ARM CPUs capacity bindings
+==========================================
+
+==========================================
+1 - Introduction
+==========================================
+
+ARM systems may be configured to have cpus with different power/performance
+characteristics within the same chip. In this case, additional information
+has to be made available to the kernel (the scheduler in particular) for
+it to be aware of such differences and take decisions accordingly.
+
+==========================================
+2 - CPU capacity definition
+==========================================
+
+CPU capacity is a number that provides the scheduler information about CPUs
+heterogeneity. Such heterogeneity can come from micro-architectural differences
+(e.g., ARM big.LITTLE systems) or maximum frequency at which CPUs can run
+(e.g., SMP systems with multiple frequency domains). Heterogeneity in this
+context is about differing performance characteristics; this binding tries to
+capture a first-order approximation of the relative performance of CPUs.
+
+One simple way to estimate CPU capacities is to iteratively run a well-known
+CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on each CPU at
+maximum frequency and then normalize values w.r.t. the best performing CPU.
+One can also do a statistically significant study of a wide collection of
+benchmarks, but pros of such an approach are not really evident at the time of
+writing.
+
+==========================================
+3 - capacity-scale
+==========================================
+
+CPUs capacities are defined with respect to capacity-scale property in the cpus
+node [1]. The property is optional; if not defined a 1024 capacity-scale is
+assumed. This property defines both the highest CPU capacity present in the
+system and granularity of CPU capacity values.
+
+==========================================
+4 - capacity
+==========================================
+
+capacity is an optional cpu node [1] property: u32 value representing CPU
+capacity, relative to capacity-scale. It is required and enforced that capacity
+<= capacity-scale.
+
+===========================================
+5 - Examples
+===========================================
+
+Example 1 (ARM 64-bit, 6-cpu system, two clusters):
+capacity-scale is not defined, so it is assumed to be 1024
+
+cpus {
+ #address-cells = <2>;
+ #size-cells = <0>;
+
+ cpu-map {
+ cluster0 {
+ core0 {
+ cpu = <&A57_0>;
+ };
+ core1 {
+ cpu = <&A57_1>;
+ };
+ };
+
+ cluster1 {
+ core0 {
+ cpu = <&A53_0>;
+ };
+ core1 {
+ cpu = <&A53_1>;
+ };
+ core2 {
+ cpu = <&A53_2>;
+ };
+ core3 {
+ cpu = <&A53_3>;
+ };
+ };
+ };
+
+ idle-states {
+ entry-method = "arm,psci";
+
+ CPU_SLEEP_0: cpu-sleep-0 {
+ compatible = "arm,idle-state";
+ arm,psci-suspend-param = <0x0010000>;
+ local-timer-stop;
+ entry-latency-us = <100>;
+ exit-latency-us = <250>;
+ min-residency-us = <150>;
+ };
+
+ CLUSTER_SLEEP_0: cluster-sleep-0 {
+ compatible = "arm,idle-state";
+ arm,psci-suspend-param = <0x1010000>;
+ local-timer-stop;
+ entry-latency-us = <800>;
+ exit-latency-us = <700>;
+ min-residency-us = <2500>;
+ };
+ };
+
+ A57_0: cpu@0 {
+ compatible = "arm,cortex-a57","arm,armv8";
+ reg = <0x0 0x0>;
+ device_type = "cpu";
+ enable-method = "psci";
+ next-level-cache = <&A57_L2>;
+ clocks = <&scpi_dvfs 0>;
+ cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
+ capacity = <1024>;
+ };
+
+ A57_1: cpu@1 {
+ compatible = "arm,cortex-a57","arm,armv8";
+ reg = <0x0 0x1>;
+ device_type = "cpu";
+ enable-method = "psci";
+ next-level-cache = <&A57_L2>;
+ clocks = <&scpi_dvfs 0>;
+ cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
+ capacity = <1024>;
+ };
+
+ A53_0: cpu@100 {
+ compatible = "arm,cortex-a53","arm,armv8";
+ reg = <0x0 0x100>;
+ device_type = "cpu";
+ enable-method = "psci";
+ next-level-cache = <&A53_L2>;
+ clocks = <&scpi_dvfs 1>;
+ cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
+ capacity = <447>;
+ };
+
+ A53_1: cpu@101 {
+ compatible = "arm,cortex-a53","arm,armv8";
+ reg = <0x0 0x101>;
+ device_type = "cpu";
+ enable-method = "psci";
+ next-level-cache = <&A53_L2>;
+ clocks = <&scpi_dvfs 1>;
+ cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
+ capacity = <447>;
+ };
+
+ A53_2: cpu@102 {
+ compatible = "arm,cortex-a53","arm,armv8";
+ reg = <0x0 0x102>;
+ device_type = "cpu";
+ enable-method = "psci";
+ next-level-cache = <&A53_L2>;
+ clocks = <&scpi_dvfs 1>;
+ cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
+ capacity = <447>;
+ };
+
+ A53_3: cpu@103 {
+ compatible = "arm,cortex-a53","arm,armv8";
+ reg = <0x0 0x103>;
+ device_type = "cpu";
+ enable-method = "psci";
+ next-level-cache = <&A53_L2>;
+ clocks = <&scpi_dvfs 1>;
+ cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
+ capacity = <447>;
+ };
+
+ A57_L2: l2-cache0 {
+ compatible = "cache";
+ };
+
+ A53_L2: l2-cache1 {
+ compatible = "cache";
+ };
+};
+
+Example 2 (ARM 32-bit, 4-cpu system, two clusters,
+ cpus 0,1@1GHz, cpus 2,3@500MHz):
+capacity-scale is equal to 2, so first cluster is twice faster than second
+cluster (which matches with clock frequencies)
+
+cpus {
+ #address-cells = <1>;
+ #size-cells = <0>;
+ capacity-scale = <2>;
+
+ cpu0: cpu@0 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a15";
+ reg = <0>;
+ capacity = <2>;
+ };
+
+ cpu1: cpu@1 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a15";
+ reg = <1>;
+ capacity = <2>;
+ };
+
+ cpu2: cpu@2 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a15";
+ reg = <0x100>;
+ capacity = <1>;
+ };
+
+ cpu3: cpu@3 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a15";
+ reg = <0x101>;
+ capacity = <1>;
+ };
+};
+
+===========================================
+6 - References
+===========================================
+
+[1] ARM Linux Kernel documentation - CPUs bindings
+ Documentation/devicetree/bindings/arm/cpus.txt
diff --git a/Documentation/devicetree/bindings/arm/cpus.txt b/Documentation/devicetree/bindings/arm/cpus.txt
index 91e6e5c..7593584 100644
--- a/Documentation/devicetree/bindings/arm/cpus.txt
+++ b/Documentation/devicetree/bindings/arm/cpus.txt
@@ -62,6 +62,14 @@ nodes to be present and contain the properties described below.
Value type: <u32>
Definition: must be set to 0

+ A cpus node may also define the following optional property:
+
+ - capacity-scale
+ Usage: optional
+ Value type: <u32>
+ Definition: value used as a reference for CPU capacity [3]
+ (see below).
+
- cpu node

Description: Describes a CPU in an ARM based system
@@ -231,6 +239,13 @@ nodes to be present and contain the properties described below.
# List of phandles to idle state nodes supported
by this cpu [3].

+ - capacity
+ Usage: Optional
+ Value type: <u32>
+ Definition:
+ # u32 value representing CPU capacity [3], relative to
+ capacity-scale (see above).
+
- rockchip,pmu
Usage: optional for systems that have an "enable-method"
property value of "rockchip,rk3066-smp"
@@ -437,3 +452,5 @@ cpus {
[2] arm/msm/qcom,kpss-acc.txt
[3] ARM Linux kernel documentation - idle states bindings
Documentation/devicetree/bindings/arm/idle-states.txt
+[3] ARM Linux kernel documentation - cpu capacity bindings
+ Documentation/devicetree/bindings/arm/cpu-capacity.txt
--
2.2.2

2015-11-23 14:31:22

by Juri Lelli

[permalink] [raw]
Subject: [RFC PATCH 3/8] arm: parse cpu capacity from DT

With the introduction of cpu capacity bindings, CPU capacities can now be
extracted from DT. Add parsing of such information at boot time. We keep
code that can produce same information, based on different DT properties
and hard-coded values, as fall-back for backward compatibility.

Cc: Russell King <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
arch/arm/kernel/topology.c | 50 +++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 49 insertions(+), 1 deletion(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index ec279d1..ecbff03 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -78,6 +78,35 @@ static unsigned long *__cpu_capacity;
#define cpu_capacity(cpu) __cpu_capacity[cpu]

static unsigned long middle_capacity = 1;
+static bool capacity_from_dt = true;
+static u32 capacity_scale = SCHED_CAPACITY_SCALE;
+
+static int __init parse_cpu_capacity(struct device_node *cpu_node, int cpu)
+{
+ int ret = 1;
+ u32 cpu_capacity;
+
+ ret = of_property_read_u32(cpu_node,
+ "capacity",
+ &cpu_capacity);
+ if (!ret) {
+ u64 capacity;
+
+ /*
+ * Enforce capacity <= capacity-scale.
+ */
+ cpu_capacity = cpu_capacity <= capacity_scale ? cpu_capacity :
+ capacity_scale;
+ capacity = (cpu_capacity << SCHED_CAPACITY_SHIFT) /
+ capacity_scale;
+
+ set_capacity_scale(cpu, capacity);
+ pr_info("CPU%d: DT cpu capacity %lu\n",
+ cpu, arch_scale_cpu_capacity(NULL, cpu));
+ }
+
+ return !ret;
+}

/*
* Iterate all CPUs' descriptor in DT and compute the efficiency
@@ -99,6 +128,18 @@ static void __init parse_dt_topology(void)
__cpu_capacity = kcalloc(nr_cpu_ids, sizeof(*__cpu_capacity),
GFP_NOWAIT);

+ cn = of_find_node_by_path("/cpus");
+ if (!cn) {
+ pr_err("No CPU information found in DT\n");
+ return;
+ }
+
+ if (!of_property_read_u32(cn, "capacity-scale", &capacity_scale))
+ pr_info("DT cpus capacity-scale %u\n", capacity_scale);
+ else
+ pr_debug("DT cpus capacity-scale not found: assuming %u\n",
+ capacity_scale);
+
for_each_possible_cpu(cpu) {
const u32 *rate;
int len;
@@ -110,6 +151,13 @@ static void __init parse_dt_topology(void)
continue;
}

+ if (parse_cpu_capacity(cn, cpu)) {
+ of_node_put(cn);
+ continue;
+ }
+
+ capacity_from_dt = false;
+
for (cpu_eff = table_efficiency; cpu_eff->compatible; cpu_eff++)
if (of_device_is_compatible(cn, cpu_eff->compatible))
break;
@@ -160,7 +208,7 @@ static void __init parse_dt_topology(void)
*/
static void update_cpu_capacity(unsigned int cpu)
{
- if (!cpu_capacity(cpu))
+ if (!cpu_capacity(cpu) || capacity_from_dt)
return;

set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
--
2.2.2

2015-11-23 14:29:06

by Juri Lelli

[permalink] [raw]
Subject: [RFC PATCH 4/8] arm, dts: add TC2 cpu capacity information

Add TC2 cpu capacity binding information.

Cc: Liviu Dudau <[email protected]>
Cc: Sudeep Holla <[email protected]>
Cc: Lorenzo Pieralisi <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Pawel Moll <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Ian Campbell <[email protected]>
Cc: Kumar Gala <[email protected]>
Cc: Russell King <[email protected]>
Cc: [email protected]
Signed-off-by: Juri Lelli <[email protected]>
---
arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
index 17f63f7..4596481 100644
--- a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
+++ b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
@@ -32,6 +32,7 @@
cpus {
#address-cells = <1>;
#size-cells = <0>;
+ capacity-scale = <1024>;

cpu0: cpu@0 {
device_type = "cpu";
@@ -39,6 +40,7 @@
reg = <0>;
cci-control-port = <&cci_control1>;
cpu-idle-states = <&CLUSTER_SLEEP_BIG>;
+ capacity = <1024>;
};

cpu1: cpu@1 {
@@ -47,6 +49,7 @@
reg = <1>;
cci-control-port = <&cci_control1>;
cpu-idle-states = <&CLUSTER_SLEEP_BIG>;
+ capacity = <1024>;
};

cpu2: cpu@2 {
@@ -55,6 +58,7 @@
reg = <0x100>;
cci-control-port = <&cci_control2>;
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
+ capacity = <430>;
};

cpu3: cpu@3 {
@@ -63,6 +67,7 @@
reg = <0x101>;
cci-control-port = <&cci_control2>;
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
+ capacity = <430>;
};

cpu4: cpu@4 {
@@ -71,6 +76,7 @@
reg = <0x102>;
cci-control-port = <&cci_control2>;
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
+ capacity = <430>;
};

idle-states {
--
2.2.2

2015-11-23 14:29:20

by Juri Lelli

[permalink] [raw]
Subject: [RFC PATCH 5/8] arm64: parse cpu capacity from DT

With the introduction of cpu capacity bindings, CPU capacities can now be
extracted from DT. Add parsing of such information at boot time. Also,
store such information using per CPU variables, as we do for arm.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Sudeep Holla <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
arch/arm64/kernel/topology.c | 46 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 46 insertions(+)

diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index 694f6de..4423cc5 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -23,6 +23,45 @@
#include <asm/cputype.h>
#include <asm/topology.h>

+static DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
+
+unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+{
+ return per_cpu(cpu_scale, cpu);
+}
+
+static void set_capacity_scale(unsigned int cpu, unsigned long capacity)
+{
+ per_cpu(cpu_scale, cpu) = capacity;
+}
+
+static u32 capacity_scale = SCHED_CAPACITY_SCALE;
+
+static void __init parse_cpu_capacity(struct device_node *cpu_node, int cpu)
+{
+ int ret;
+ u32 cpu_capacity;
+
+ ret = of_property_read_u32(cpu_node,
+ "capacity",
+ &cpu_capacity);
+ if (!ret) {
+ u64 capacity;
+
+ /*
+ * Enforce capacity <= capacity-scale.
+ */
+ cpu_capacity = cpu_capacity <= capacity_scale ? cpu_capacity :
+ capacity_scale;
+ capacity = (cpu_capacity << SCHED_CAPACITY_SHIFT) /
+ capacity_scale;
+
+ set_capacity_scale(cpu, capacity);
+ pr_info("CPU%d: DT cpu_capacity %lu\n",
+ cpu, arch_scale_cpu_capacity(NULL, cpu));
+ }
+}
+
static int __init get_cpu_for_node(struct device_node *node)
{
struct device_node *cpu_node;
@@ -34,6 +73,7 @@ static int __init get_cpu_for_node(struct device_node *node)

for_each_possible_cpu(cpu) {
if (of_get_cpu_node(cpu, NULL) == cpu_node) {
+ parse_cpu_capacity(cpu_node, cpu);
of_node_put(cpu_node);
return cpu;
}
@@ -173,6 +213,12 @@ static int __init parse_dt_topology(void)
return 0;
}

+ if (!of_property_read_u32(cn, "capacity-scale", &capacity_scale))
+ pr_info("DT cpus capacity-scale %u\n", capacity_scale);
+ else
+ pr_debug("DT cpus capacity-scale not found: assuming %u\n",
+ capacity_scale);
+
/*
* When topology is provided cpu-map is essentially a root
* cluster with restricted subnodes.
--
2.2.2

2015-11-23 14:29:18

by Juri Lelli

[permalink] [raw]
Subject: [RFC PATCH 6/8] arm64, dts: add Juno cpu capacity information

Add Juno cpu capacity bindings information.

Cc: Rob Herring <[email protected]>
Cc: Pawel Moll <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Ian Campbell <[email protected]>
Cc: Kumar Gala <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Liviu Dudau <[email protected]>
Cc: Sudeep Holla <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Jon Medhurst <[email protected]>
Cc: Olof Johansson <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: [email protected]
Signed-off-by: Juri Lelli <[email protected]>
---
arch/arm64/boot/dts/arm/juno.dts | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/arch/arm64/boot/dts/arm/juno.dts b/arch/arm64/boot/dts/arm/juno.dts
index d7cbdd4..06f6d2b 100644
--- a/arch/arm64/boot/dts/arm/juno.dts
+++ b/arch/arm64/boot/dts/arm/juno.dts
@@ -33,6 +33,7 @@
cpus {
#address-cells = <2>;
#size-cells = <0>;
+ capacity-scale = <1024>;

A57_0: cpu@0 {
compatible = "arm,cortex-a57","arm,armv8";
@@ -40,6 +41,7 @@
device_type = "cpu";
enable-method = "psci";
next-level-cache = <&A57_L2>;
+ capacity = <1024>;
};

A57_1: cpu@1 {
@@ -48,6 +50,7 @@
device_type = "cpu";
enable-method = "psci";
next-level-cache = <&A57_L2>;
+ capacity = <1024>;
};

A53_0: cpu@100 {
@@ -56,6 +59,7 @@
device_type = "cpu";
enable-method = "psci";
next-level-cache = <&A53_L2>;
+ capacity = <447>;
};

A53_1: cpu@101 {
@@ -64,6 +68,7 @@
device_type = "cpu";
enable-method = "psci";
next-level-cache = <&A53_L2>;
+ capacity = <447>;
};

A53_2: cpu@102 {
@@ -72,6 +77,7 @@
device_type = "cpu";
enable-method = "psci";
next-level-cache = <&A53_L2>;
+ capacity = <447>;
};

A53_3: cpu@103 {
@@ -80,6 +86,7 @@
device_type = "cpu";
enable-method = "psci";
next-level-cache = <&A53_L2>;
+ capacity = <447>;
};

A57_L2: l2-cache0 {
--
2.2.2

2015-11-23 14:29:16

by Juri Lelli

[permalink] [raw]
Subject: [RFC PATCH 7/8] arm: add sysfs cpu_capacity attribute

Add a sysfs cpu_capacity attribute with which it is possible to read and
write (thus over-writing default values) CPUs capacity. This might be
useful in situation where there is no way to get proper default values
at boot time.

The new attribute shows up as:

/sys/devices/system/cpu/cpu*/cpu_capacity

Cc: Russell King <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
arch/arm/kernel/topology.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 68 insertions(+)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index ecbff03..a226761 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -52,6 +52,74 @@ static void set_capacity_scale(unsigned int cpu, unsigned long capacity)
per_cpu(cpu_scale, cpu) = capacity;
}

+#ifdef CONFIG_PROC_SYSCTL
+#include <asm/cpu.h>
+#include <linux/string.h>
+static ssize_t show_cpu_capacity(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct cpu *cpu = container_of(dev, struct cpu, dev);
+ ssize_t rc;
+ int cpunum = cpu->dev.id;
+ unsigned long capacity = arch_scale_cpu_capacity(NULL, cpunum);
+
+ rc = sprintf(buf, "%lu\n", capacity);
+
+ return rc;
+}
+
+static ssize_t store_cpu_capacity(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf,
+ size_t count)
+{
+ struct cpu *cpu = container_of(dev, struct cpu, dev);
+ int this_cpu = cpu->dev.id, i;
+ unsigned long new_capacity;
+ ssize_t ret;
+
+ if (count) {
+ char *p = (char *) buf;
+
+ ret = kstrtoul(p, 0, &new_capacity);
+ if (ret)
+ return ret;
+ if (new_capacity > SCHED_CAPACITY_SCALE)
+ return -EINVAL;
+
+ for_each_cpu(i, &cpu_topology[this_cpu].core_sibling)
+ set_capacity_scale(i, new_capacity);
+ }
+
+ return count;
+}
+
+static DEVICE_ATTR(cpu_capacity,
+ 0644,
+ show_cpu_capacity,
+ store_cpu_capacity);
+
+static int register_cpu_capacity_sysctl(void)
+{
+ int i;
+ struct device *cpu;
+
+ for_each_possible_cpu(i) {
+ cpu = get_cpu_device(i);
+ if (!cpu) {
+ pr_err("%s: too early to get CPU%d device!\n",
+ __func__, i);
+ continue;
+ }
+ device_create_file(cpu, &dev_attr_cpu_capacity);
+ }
+
+ return 0;
+}
+late_initcall(register_cpu_capacity_sysctl);
+#endif
+
#ifdef CONFIG_OF
struct cpu_efficiency {
const char *compatible;
--
2.2.2

2015-11-23 14:30:06

by Juri Lelli

[permalink] [raw]
Subject: [RFC PATCH 8/8] arm64: add sysfs cpu_capacity attribute

Add a sysfs cpu_capacity attribute with which it is possible to read and
write (thus over-writing default values) CPUs capacity. This might be
useful in situation where there is no way to get proper default values
at boot time.

The new attribute shows up as:

/sys/devices/system/cpu/cpu*/cpu_capacity

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Sudeep Holla <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
arch/arm64/kernel/topology.c | 68 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 68 insertions(+)

diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index 4423cc5..5c9e477 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -35,6 +35,74 @@ static void set_capacity_scale(unsigned int cpu, unsigned long capacity)
per_cpu(cpu_scale, cpu) = capacity;
}

+#ifdef CONFIG_PROC_SYSCTL
+#include <asm/cpu.h>
+#include <linux/string.h>
+static ssize_t show_cpu_capacity(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct cpu *cpu = container_of(dev, struct cpu, dev);
+ ssize_t rc;
+ int cpunum = cpu->dev.id;
+ unsigned long capacity = arch_scale_cpu_capacity(NULL, cpunum);
+
+ rc = sprintf(buf, "%lu\n", capacity);
+
+ return rc;
+}
+
+static ssize_t store_cpu_capacity(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf,
+ size_t count)
+{
+ struct cpu *cpu = container_of(dev, struct cpu, dev);
+ int this_cpu = cpu->dev.id, i;
+ unsigned long new_capacity;
+ ssize_t ret;
+
+ if (count) {
+ char *p = (char *) buf;
+
+ ret = kstrtoul(p, 0, &new_capacity);
+ if (ret)
+ return ret;
+ if (new_capacity > SCHED_CAPACITY_SCALE)
+ return -EINVAL;
+
+ for_each_cpu(i, &cpu_topology[this_cpu].core_sibling)
+ set_capacity_scale(i, new_capacity);
+ }
+
+ return count;
+}
+
+static DEVICE_ATTR(cpu_capacity,
+ 0644,
+ show_cpu_capacity,
+ store_cpu_capacity);
+
+static int register_cpu_capacity_sysctl(void)
+{
+ int i;
+ struct device *cpu;
+
+ for_each_possible_cpu(i) {
+ cpu = get_cpu_device(i);
+ if (!cpu) {
+ pr_err("%s: too early to get CPU%d device!\n",
+ __func__, i);
+ continue;
+ }
+ device_create_file(cpu, &dev_attr_cpu_capacity);
+ }
+
+ return 0;
+}
+late_initcall(register_cpu_capacity_sysctl);
+#endif
+
static u32 capacity_scale = SCHED_CAPACITY_SCALE;

static void __init parse_cpu_capacity(struct device_node *cpu_node, int cpu)
--
2.2.2

2015-11-24 02:06:40

by Rob Herring (Arm)

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On Mon, Nov 23, 2015 at 02:28:35PM +0000, Juri Lelli wrote:
> ARM systems may be configured to have cpus with different power/performance
> characteristics within the same chip. In this case, additional information
> has to be made available to the kernel (the scheduler in particular) for it
> to be aware of such differences and take decisions accordingly.
>
> Therefore, this patch aims at standardizing cpu capacities device tree
> bindings for ARM platforms. Bindings define cpu capacity parameter, to
> allow operating systems to retrieve such information from the device tree
> and initialize related kernel structures, paving the way for common code in
> the kernel to deal with heterogeneity.
>
> Cc: Rob Herring <[email protected]>
> Cc: Pawel Moll <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Ian Campbell <[email protected]>
> Cc: Kumar Gala <[email protected]>
> Cc: Maxime Ripard <[email protected]>
> Cc: Olof Johansson <[email protected]>
> Cc: Gregory CLEMENT <[email protected]>
> Cc: Paul Walmsley <[email protected]>
> Cc: Linus Walleij <[email protected]>
> Cc: Chen-Yu Tsai <[email protected]>
> Cc: Thomas Petazzoni <[email protected]>
> Cc: [email protected]
> Signed-off-by: Juri Lelli <[email protected]>
> ---
> .../devicetree/bindings/arm/cpu-capacity.txt | 227 +++++++++++++++++++++
> Documentation/devicetree/bindings/arm/cpus.txt | 17 ++
> 2 files changed, 244 insertions(+)
> create mode 100644 Documentation/devicetree/bindings/arm/cpu-capacity.txt
>
> diff --git a/Documentation/devicetree/bindings/arm/cpu-capacity.txt b/Documentation/devicetree/bindings/arm/cpu-capacity.txt
> new file mode 100644
> index 0000000..2a00af0
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/arm/cpu-capacity.txt
> @@ -0,0 +1,227 @@
> +==========================================
> +ARM CPUs capacity bindings
> +==========================================
> +
> +==========================================
> +1 - Introduction
> +==========================================
> +
> +ARM systems may be configured to have cpus with different power/performance
> +characteristics within the same chip. In this case, additional information
> +has to be made available to the kernel (the scheduler in particular) for
> +it to be aware of such differences and take decisions accordingly.
> +
> +==========================================
> +2 - CPU capacity definition
> +==========================================
> +
> +CPU capacity is a number that provides the scheduler information about CPUs
> +heterogeneity. Such heterogeneity can come from micro-architectural differences
> +(e.g., ARM big.LITTLE systems) or maximum frequency at which CPUs can run
> +(e.g., SMP systems with multiple frequency domains). Heterogeneity in this
> +context is about differing performance characteristics; this binding tries to
> +capture a first-order approximation of the relative performance of CPUs.
> +
> +One simple way to estimate CPU capacities is to iteratively run a well-known
> +CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on each CPU at
> +maximum frequency and then normalize values w.r.t. the best performing CPU.
> +One can also do a statistically significant study of a wide collection of
> +benchmarks, but pros of such an approach are not really evident at the time of
> +writing.
> +
> +==========================================
> +3 - capacity-scale
> +==========================================
> +
> +CPUs capacities are defined with respect to capacity-scale property in the cpus
> +node [1]. The property is optional; if not defined a 1024 capacity-scale is
> +assumed. This property defines both the highest CPU capacity present in the
> +system and granularity of CPU capacity values.

I don't really see the point of this vs. having an absolute scale.

> +
> +==========================================
> +4 - capacity
> +==========================================
> +
> +capacity is an optional cpu node [1] property: u32 value representing CPU
> +capacity, relative to capacity-scale. It is required and enforced that capacity
> +<= capacity-scale.

I think you need something absolute and probably per MHz (like
dynamic-power-coefficient property). Perhaps the IPC (instructions per
clock) value?

In other words, I want to see these numbers have a defined method
of determining them and don't want to see random values from every
vendor. ARM, Ltd. says core X has a value of Y would be good enough for
me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a
value of 1024 is not what I want to see. Of course things like cache
sizes can vary the performance, but is a baseline value good enough?

However, no vendor will want to publish their values if these are
absolute values relative to other vendors.

If you expect these to need frequent tuning, then don't put them in DT.

Rob

2015-11-24 10:54:10

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

Hi,

On 23/11/15 20:06, Rob Herring wrote:
> On Mon, Nov 23, 2015 at 02:28:35PM +0000, Juri Lelli wrote:
> > ARM systems may be configured to have cpus with different power/performance
> > characteristics within the same chip. In this case, additional information
> > has to be made available to the kernel (the scheduler in particular) for it
> > to be aware of such differences and take decisions accordingly.
> >
> > Therefore, this patch aims at standardizing cpu capacities device tree
> > bindings for ARM platforms. Bindings define cpu capacity parameter, to
> > allow operating systems to retrieve such information from the device tree
> > and initialize related kernel structures, paving the way for common code in
> > the kernel to deal with heterogeneity.
> >
> > Cc: Rob Herring <[email protected]>
> > Cc: Pawel Moll <[email protected]>
> > Cc: Mark Rutland <[email protected]>
> > Cc: Ian Campbell <[email protected]>
> > Cc: Kumar Gala <[email protected]>
> > Cc: Maxime Ripard <[email protected]>
> > Cc: Olof Johansson <[email protected]>
> > Cc: Gregory CLEMENT <[email protected]>
> > Cc: Paul Walmsley <[email protected]>
> > Cc: Linus Walleij <[email protected]>
> > Cc: Chen-Yu Tsai <[email protected]>
> > Cc: Thomas Petazzoni <[email protected]>
> > Cc: [email protected]
> > Signed-off-by: Juri Lelli <[email protected]>
> > ---
> > .../devicetree/bindings/arm/cpu-capacity.txt | 227 +++++++++++++++++++++
> > Documentation/devicetree/bindings/arm/cpus.txt | 17 ++
> > 2 files changed, 244 insertions(+)
> > create mode 100644 Documentation/devicetree/bindings/arm/cpu-capacity.txt
> >
> > diff --git a/Documentation/devicetree/bindings/arm/cpu-capacity.txt b/Documentation/devicetree/bindings/arm/cpu-capacity.txt
> > new file mode 100644
> > index 0000000..2a00af0
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/arm/cpu-capacity.txt
> > @@ -0,0 +1,227 @@
> > +==========================================
> > +ARM CPUs capacity bindings
> > +==========================================
> > +
> > +==========================================
> > +1 - Introduction
> > +==========================================
> > +
> > +ARM systems may be configured to have cpus with different power/performance
> > +characteristics within the same chip. In this case, additional information
> > +has to be made available to the kernel (the scheduler in particular) for
> > +it to be aware of such differences and take decisions accordingly.
> > +
> > +==========================================
> > +2 - CPU capacity definition
> > +==========================================
> > +
> > +CPU capacity is a number that provides the scheduler information about CPUs
> > +heterogeneity. Such heterogeneity can come from micro-architectural differences
> > +(e.g., ARM big.LITTLE systems) or maximum frequency at which CPUs can run
> > +(e.g., SMP systems with multiple frequency domains). Heterogeneity in this
> > +context is about differing performance characteristics; this binding tries to
> > +capture a first-order approximation of the relative performance of CPUs.
> > +
> > +One simple way to estimate CPU capacities is to iteratively run a well-known
> > +CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on each CPU at
> > +maximum frequency and then normalize values w.r.t. the best performing CPU.
> > +One can also do a statistically significant study of a wide collection of
> > +benchmarks, but pros of such an approach are not really evident at the time of
> > +writing.
> > +
> > +==========================================
> > +3 - capacity-scale
> > +==========================================
> > +
> > +CPUs capacities are defined with respect to capacity-scale property in the cpus
> > +node [1]. The property is optional; if not defined a 1024 capacity-scale is
> > +assumed. This property defines both the highest CPU capacity present in the
> > +system and granularity of CPU capacity values.
>
> I don't really see the point of this vs. having an absolute scale.
>

IMHO, we need this for several reasons, one being to address one of your
concerns below: vendors are free to choose their scale without being
forced to publish absolute data. Another reason is that it might make
life easier in certain cases; for example, someone could implement a
system with a few clusters of, say, A57s, but some run at half the clock
of the others (e.g., you have a 1.2GHz cluster and a 600MHz cluster); in
this case I think it is just easier to define capacity-scale as 1200 and
capacities as 1200 and 600. Last reason that I can think of right now is
that we don't probably want to bound ourself to some particular range
from the beginning, as that range might be enough now, but it could
change in the future (as in, right now [1-1024] looks fine for
scheduling purposes, but that might change).

> > +
> > +==========================================
> > +4 - capacity
> > +==========================================
> > +
> > +capacity is an optional cpu node [1] property: u32 value representing CPU
> > +capacity, relative to capacity-scale. It is required and enforced that capacity
> > +<= capacity-scale.
>
> I think you need something absolute and probably per MHz (like
> dynamic-power-coefficient property). Perhaps the IPC (instructions per
> clock) value?
>
> In other words, I want to see these numbers have a defined method
> of determining them and don't want to see random values from every
> vendor. ARM, Ltd. says core X has a value of Y would be good enough for
> me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a
> value of 1024 is not what I want to see. Of course things like cache
> sizes can vary the performance, but is a baseline value good enough?
>

A standard reference baseline is what we advocate with this set, but
making this baseline work for every vendor's implementation is hardly
achievable, IMHO. I don't think we can come up with any number that
applies to each and every implementation; you can have different
revisions of the same core and vendors might make implementation choices
that end up with different peak performance.

> However, no vendor will want to publish their values if these are
> absolute values relative to other vendors.
>

Right. That is why I think we need to abstract numbers, as we do with
capacity-scale.

> If you expect these to need frequent tuning, then don't put them in DT.
>

I expect that it is possible to come up with a sensible baseline number
for a specific platform implementation, so there is value in
standardizing how we specify this value and how it is then consumed.
Finer grained tuning might then happen both offline (with changes to the
mainline DT) and online (using the sysfs interface), but that should
only apply to a narrow set of use cases.

Thanks,

- Juri

2015-11-30 09:59:28

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

Hi Juri,

On 24 November 2015 at 11:54, Juri Lelli <[email protected]> wrote:
> Hi,
>
> On 23/11/15 20:06, Rob Herring wrote:
>> On Mon, Nov 23, 2015 at 02:28:35PM +0000, Juri Lelli wrote:
>> > ARM systems may be configured to have cpus with different power/performance
>> > characteristics within the same chip. In this case, additional information
>> > has to be made available to the kernel (the scheduler in particular) for it
>> > to be aware of such differences and take decisions accordingly.
>> >

[snip]

>> > +==========================================
>> > +2 - CPU capacity definition
>> > +==========================================
>> > +
>> > +CPU capacity is a number that provides the scheduler information about CPUs
>> > +heterogeneity. Such heterogeneity can come from micro-architectural differences
>> > +(e.g., ARM big.LITTLE systems) or maximum frequency at which CPUs can run
>> > +(e.g., SMP systems with multiple frequency domains). Heterogeneity in this
>> > +context is about differing performance characteristics; this binding tries to
>> > +capture a first-order approximation of the relative performance of CPUs.
>> > +
>> > +One simple way to estimate CPU capacities is to iteratively run a well-known
>> > +CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on each CPU at
>> > +maximum frequency and then normalize values w.r.t. the best performing CPU.
>> > +One can also do a statistically significant study of a wide collection of
>> > +benchmarks, but pros of such an approach are not really evident at the time of
>> > +writing.
>> > +
>> > +==========================================
>> > +3 - capacity-scale
>> > +==========================================
>> > +
>> > +CPUs capacities are defined with respect to capacity-scale property in the cpus
>> > +node [1]. The property is optional; if not defined a 1024 capacity-scale is
>> > +assumed. This property defines both the highest CPU capacity present in the
>> > +system and granularity of CPU capacity values.
>>
>> I don't really see the point of this vs. having an absolute scale.
>>
>
> IMHO, we need this for several reasons, one being to address one of your
> concerns below: vendors are free to choose their scale without being
> forced to publish absolute data. Another reason is that it might make
> life easier in certain cases; for example, someone could implement a
> system with a few clusters of, say, A57s, but some run at half the clock
> of the others (e.g., you have a 1.2GHz cluster and a 600MHz cluster); in
> this case I think it is just easier to define capacity-scale as 1200 and
> capacities as 1200 and 600. Last reason that I can think of right now is
> that we don't probably want to bound ourself to some particular range
> from the beginning, as that range might be enough now, but it could
> change in the future (as in, right now [1-1024] looks fine for
> scheduling purposes, but that might change).

Like Rob, i don't really see the benefit of this optional
capacity-scale property. Parsing the capacity of all cpu nodes should
give you a range as well.
IMHO, this property looks like an optimization of the code that will
parse the dt more than a HW description

>
>> > +
>> > +==========================================
>> > +4 - capacity
>> > +==========================================
>> > +
>> > +capacity is an optional cpu node [1] property: u32 value representing CPU
>> > +capacity, relative to capacity-scale. It is required and enforced that capacity
>> > +<= capacity-scale.
>>
>> I think you need something absolute and probably per MHz (like
>> dynamic-power-coefficient property). Perhaps the IPC (instructions per
>> clock) value?
>>
>> In other words, I want to see these numbers have a defined method
>> of determining them and don't want to see random values from every
>> vendor. ARM, Ltd. says core X has a value of Y would be good enough for
>> me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a
>> value of 1024 is not what I want to see. Of course things like cache
>> sizes can vary the performance, but is a baseline value good enough?
>>
>
> A standard reference baseline is what we advocate with this set, but
> making this baseline work for every vendor's implementation is hardly
> achievable, IMHO. I don't think we can come up with any number that
> applies to each and every implementation; you can have different
> revisions of the same core and vendors might make implementation choices
> that end up with different peak performance.
>
>> However, no vendor will want to publish their values if these are
>> absolute values relative to other vendors.
>>
>
> Right. That is why I think we need to abstract numbers, as we do with
> capacity-scale.
>
>> If you expect these to need frequent tuning, then don't put them in DT.
>>
>
> I expect that it is possible to come up with a sensible baseline number
> for a specific platform implementation, so there is value in
> standardizing how we specify this value and how it is then consumed.
> Finer grained tuning might then happen both offline (with changes to the
> mainline DT) and online (using the sysfs interface), but that should
> only apply to a narrow set of use cases.
>
> Thanks,
>
> - Juri

2015-11-30 11:13:27

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] ARM: initialize cpu_scale to its default

On 23 November 2015 at 15:28, Juri Lelli <[email protected]> wrote:
> Instead of looping through all cpus calling set_capacity_scale, we can
> initialise cpu_scale per-cpu variables to SCHED_CAPACITY_SCALE with their
> definition.
>
> Cc: Russell King <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> ---
> arch/arm/kernel/topology.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index 08b7847..ec279d1 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -40,7 +40,7 @@
> * to run the rebalance_domains for all idle cores and the cpu_capacity can be
> * updated during this sequence.
> */
> -static DEFINE_PER_CPU(unsigned long, cpu_scale);
> +static DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
>
> unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> {
> @@ -306,8 +306,6 @@ void __init init_cpu_topology(void)
> cpu_topo->socket_id = -1;
> cpumask_clear(&cpu_topo->core_sibling);
> cpumask_clear(&cpu_topo->thread_sibling);
> -
> - set_capacity_scale(cpu, SCHED_CAPACITY_SCALE);
> }
> smp_wmb();

FWIW, Acked-by: Vincent Guittot <[email protected]>

>
> --
> 2.2.2
>

2015-12-01 11:20:24

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

Hi Vincent,

On 30/11/15 10:59, Vincent Guittot wrote:
> Hi Juri,
>
> On 24 November 2015 at 11:54, Juri Lelli <[email protected]> wrote:
> > Hi,
> >
> > On 23/11/15 20:06, Rob Herring wrote:
> >> On Mon, Nov 23, 2015 at 02:28:35PM +0000, Juri Lelli wrote:
> >> > ARM systems may be configured to have cpus with different power/performance
> >> > characteristics within the same chip. In this case, additional information
> >> > has to be made available to the kernel (the scheduler in particular) for it
> >> > to be aware of such differences and take decisions accordingly.
> >> >
>
> [snip]
>
> >> > +==========================================
> >> > +2 - CPU capacity definition
> >> > +==========================================
> >> > +
> >> > +CPU capacity is a number that provides the scheduler information about CPUs
> >> > +heterogeneity. Such heterogeneity can come from micro-architectural differences
> >> > +(e.g., ARM big.LITTLE systems) or maximum frequency at which CPUs can run
> >> > +(e.g., SMP systems with multiple frequency domains). Heterogeneity in this
> >> > +context is about differing performance characteristics; this binding tries to
> >> > +capture a first-order approximation of the relative performance of CPUs.
> >> > +
> >> > +One simple way to estimate CPU capacities is to iteratively run a well-known
> >> > +CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on each CPU at
> >> > +maximum frequency and then normalize values w.r.t. the best performing CPU.
> >> > +One can also do a statistically significant study of a wide collection of
> >> > +benchmarks, but pros of such an approach are not really evident at the time of
> >> > +writing.
> >> > +
> >> > +==========================================
> >> > +3 - capacity-scale
> >> > +==========================================
> >> > +
> >> > +CPUs capacities are defined with respect to capacity-scale property in the cpus
> >> > +node [1]. The property is optional; if not defined a 1024 capacity-scale is
> >> > +assumed. This property defines both the highest CPU capacity present in the
> >> > +system and granularity of CPU capacity values.
> >>
> >> I don't really see the point of this vs. having an absolute scale.
> >>
> >
> > IMHO, we need this for several reasons, one being to address one of your
> > concerns below: vendors are free to choose their scale without being
> > forced to publish absolute data. Another reason is that it might make
> > life easier in certain cases; for example, someone could implement a
> > system with a few clusters of, say, A57s, but some run at half the clock
> > of the others (e.g., you have a 1.2GHz cluster and a 600MHz cluster); in
> > this case I think it is just easier to define capacity-scale as 1200 and
> > capacities as 1200 and 600. Last reason that I can think of right now is
> > that we don't probably want to bound ourself to some particular range
> > from the beginning, as that range might be enough now, but it could
> > change in the future (as in, right now [1-1024] looks fine for
> > scheduling purposes, but that might change).
>
> Like Rob, i don't really see the benefit of this optional
> capacity-scale property. Parsing the capacity of all cpu nodes should
> give you a range as well.
> IMHO, this property looks like an optimization of the code that will
> parse the dt more than a HW description
>

I agree that we can come up with the same information just looking at
the biggest capacity value of all CPUs and treat that value as
capacity-scale. I just thought that having that explicit made things
clearer, as it could be not easy to immediately see from a DT with many
CPUs which is the biggest capacity value. But, yes, we could remove that
anyway.

Thanks,

- Juri

> >
> >> > +
> >> > +==========================================
> >> > +4 - capacity
> >> > +==========================================
> >> > +
> >> > +capacity is an optional cpu node [1] property: u32 value representing CPU
> >> > +capacity, relative to capacity-scale. It is required and enforced that capacity
> >> > +<= capacity-scale.
> >>
> >> I think you need something absolute and probably per MHz (like
> >> dynamic-power-coefficient property). Perhaps the IPC (instructions per
> >> clock) value?
> >>
> >> In other words, I want to see these numbers have a defined method
> >> of determining them and don't want to see random values from every
> >> vendor. ARM, Ltd. says core X has a value of Y would be good enough for
> >> me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a
> >> value of 1024 is not what I want to see. Of course things like cache
> >> sizes can vary the performance, but is a baseline value good enough?
> >>
> >
> > A standard reference baseline is what we advocate with this set, but
> > making this baseline work for every vendor's implementation is hardly
> > achievable, IMHO. I don't think we can come up with any number that
> > applies to each and every implementation; you can have different
> > revisions of the same core and vendors might make implementation choices
> > that end up with different peak performance.
> >
> >> However, no vendor will want to publish their values if these are
> >> absolute values relative to other vendors.
> >>
> >
> > Right. That is why I think we need to abstract numbers, as we do with
> > capacity-scale.
> >
> >> If you expect these to need frequent tuning, then don't put them in DT.
> >>
> >
> > I expect that it is possible to come up with a sensible baseline number
> > for a specific platform implementation, so there is value in
> > standardizing how we specify this value and how it is then consumed.
> > Finer grained tuning might then happen both offline (with changes to the
> > mainline DT) and online (using the sysfs interface), but that should
> > only apply to a narrow set of use cases.
> >
> > Thanks,
> >
> > - Juri
>

2015-12-07 12:02:19

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] CPUs capacity information for heterogeneous systems

On 23/11/15 14:28, Juri Lelli wrote:
> Hi all,
>

Hi again everybody,

thanks a lot to Rob and Vincent for feedback, but, IMHO, we'd need more
discussion happening on this series to figure out how we move forward;
so, this is a ping :-). It's a simple information that we need to figure
out how to provide, and in which form, but it has the potential to form
a standardized basis for achieving important performance benefits.

Thanks,

- Juri

> ARM systems may be configured to have CPUs with different power/performance
> characteristics within the same chip. In this case, additional information has
> to be made available to the kernel (the scheduler in particular) for it to be
> aware of such differences and take decisions accordingly. This posting stems
> from the ongoing discussion about introducing a simple platform energy cost
> model to guide scheduling decisions (a.k.a Energy Aware Scheduling [1]), but
> also aims to be an independent track aimed to standardise the way we make the
> scheduler aware of heterogenous CPU systems. With these patches and in addition
> patches from [1] (that make the scheduler wakeup paths aware of heterogenous
> CPU systems) we enable the scheduler to have good default performance on such
> systems. In addition, we get a clearly defined way of providing the scheduler
> with needed information about CPU capacity on such systems.
>
> CPU capacity is defined in this context as a number that provides the scheduler
> information about CPUs heterogeneity. Such heterogeneity can come from
> micro-architectural differences (e.g., ARM big.LITTLE systems) or maximum
> frequency at which CPUs can run (e.g., SMP systems with multiple frequency
> domains and different max frequencies). Heterogeneity in this context is about
> differing performance characteristics; in practice, the binding that we propose
> in this RFC tries to capture a first-order approximation of the relative
> performance of CPUs.
>
> This RFC proposes a solution to the problem of how do we init CPUs original
> capacity. The way it works today, and for arm A15/A7 systems only, is that we
> rely on cpu_efficiency magic numbers from arch/arm/kernel/topology.c and the
> existence of clock-frequency dtb properties; having those values available, we
> then do some math to come up with capacities we know from measurement (e.g.,
> EAS energy model), e.g. for TC2 they are 430 for A7 and 1024 for A15.
> Currently, arm64 doesn't have such a feature at all.
>
> With this patchset we provide CPUs capacity information either from DT or from
> sysfs interface. Such information is standardized for both arm and arm64.
>
> Patches high level description:
>
> o 01/08 cleans up how cpu_scale is initialized in arm
> o 02/08 introduces documentation for the new optional DT binding
> o [03-06]/08 add cpu-capacity attribute to TC2 and Juno DTs and provide
> parsing of such information at boot time
> o [07-08]/08 introduce sysfs attribute
>
> The patchset is based on top of tip/sched/core as of today.
>
> In case you would like to test this out, I pushed a branch here:
>
> git://linux-arm.org/linux-jl.git upstream/default_caps_dt
>
> This branch contains additional patches, useful to better understand how CPU
> capacity information is actually used by the scheduler. Discussion regarding
> these additional patches will be started with a different posting in the
> future. We just didn't want to make discussion too broad, as we realize that
> this set can be controversial already on its own.
>
> Comments, concerns and rants are more than welcome!
>
> Best,
>
> - Juri
>
> Juri Lelli (8):
> ARM: initialize cpu_scale to its default
> Documentation: arm: define DT cpu capacity bindings
> arm: parse cpu capacity from DT
> arm, dts: add TC2 cpu capacity information
> arm64: parse cpu capacity from DT
> arm64, dts: add Juno cpu capacity information
> arm: add sysfs cpu_capacity attribute
> arm64: add sysfs cpu_capacity attribute
>
> .../devicetree/bindings/arm/cpu-capacity.txt | 227 +++++++++++++++++++++
> Documentation/devicetree/bindings/arm/cpus.txt | 17 ++
> arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts | 6 +
> arch/arm/kernel/topology.c | 122 ++++++++++-
> arch/arm64/boot/dts/arm/juno.dts | 7 +
> arch/arm64/kernel/topology.c | 114 +++++++++++
> 6 files changed, 489 insertions(+), 4 deletions(-)
> create mode 100644 Documentation/devicetree/bindings/arm/cpu-capacity.txt
>
> --
> 2.2.2
>

2015-12-07 12:11:46

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] CPUs capacity information for heterogeneous systems

On Mon, Dec 07, 2015 at 12:02:44PM +0000, Juri Lelli wrote:
> On 23/11/15 14:28, Juri Lelli wrote:
> > Hi all,
> >
>
> Hi again everybody,
>
> thanks a lot to Rob and Vincent for feedback, but, IMHO, we'd need more
> discussion happening on this series to figure out how we move forward;
> so, this is a ping :-). It's a simple information that we need to figure
> out how to provide, and in which form, but it has the potential to form
> a standardized basis for achieving important performance benefits.

I'm happy to take the first patch, it looks like a simple clean up.
Please arrange for it to land in my patch system. Thanks.

It looks like the second patch doesn't yet have the backing from the DT
people, which prevents the remainder of the patch set progressing. That's
the sticking point which needs to be resolved.

--
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2015-12-07 12:36:28

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] CPUs capacity information for heterogeneous systems

On 07/12/15 12:11, Russell King - ARM Linux wrote:
> On Mon, Dec 07, 2015 at 12:02:44PM +0000, Juri Lelli wrote:
> > On 23/11/15 14:28, Juri Lelli wrote:
> > > Hi all,
> > >
> >
> > Hi again everybody,
> >
> > thanks a lot to Rob and Vincent for feedback, but, IMHO, we'd need more
> > discussion happening on this series to figure out how we move forward;
> > so, this is a ping :-). It's a simple information that we need to figure
> > out how to provide, and in which form, but it has the potential to form
> > a standardized basis for achieving important performance benefits.
>
> I'm happy to take the first patch, it looks like a simple clean up.
> Please arrange for it to land in my patch system. Thanks.
>

OK, will do. Thanks!

> It looks like the second patch doesn't yet have the backing from the DT
> people, which prevents the remainder of the patch set progressing. That's
> the sticking point which needs to be resolved.
>

Agreed. That is exactly the point for which I was seeking more feedback.
OTOH, 7/8 and 8/8 may potentially work without the DT bindings, as they
simply expose a sysfs attribute. If you, and others, see value in them,
I can easily put them at the top of the series and post a v2. Thoughts?

Best,

- Juri

2015-12-07 13:18:54

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] CPUs capacity information for heterogeneous systems

On Mon, Dec 07, 2015 at 12:36:54PM +0000, Juri Lelli wrote:
> On 07/12/15 12:11, Russell King - ARM Linux wrote:
> > It looks like the second patch doesn't yet have the backing from the DT
> > people, which prevents the remainder of the patch set progressing. That's
> > the sticking point which needs to be resolved.
> >
>
> Agreed. That is exactly the point for which I was seeking more feedback.
> OTOH, 7/8 and 8/8 may potentially work without the DT bindings, as they
> simply expose a sysfs attribute. If you, and others, see value in them,
> I can easily put them at the top of the series and post a v2. Thoughts?

I've no idea, sorry. When it comes to stuff like CPU level power
management, the hardware I have is very limited in its capabilities
(no support even for S2RAM.) I've also never managed to get
big.LITTLE working on my platforms that should support it.

Eg, I gave up with the TC2 tile in Versatile Express. I forget the
details, but getting it booting was exceedingly difficult, and no one
really seemed interested in helping. So my TC2 tile is collecting
dust while the CA9x4 tile lives in the platform - which is now part of
the nightly build/boot test farm.

--
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2015-12-07 15:40:49

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] CPUs capacity information for heterogeneous systems

On 07/12/15 13:18, Russell King - ARM Linux wrote:
> On Mon, Dec 07, 2015 at 12:36:54PM +0000, Juri Lelli wrote:
> > On 07/12/15 12:11, Russell King - ARM Linux wrote:
> > > It looks like the second patch doesn't yet have the backing from the DT
> > > people, which prevents the remainder of the patch set progressing. That's
> > > the sticking point which needs to be resolved.
> > >
> >
> > Agreed. That is exactly the point for which I was seeking more feedback.
> > OTOH, 7/8 and 8/8 may potentially work without the DT bindings, as they
> > simply expose a sysfs attribute. If you, and others, see value in them,
> > I can easily put them at the top of the series and post a v2. Thoughts?
>
> I've no idea, sorry. When it comes to stuff like CPU level power
> management, the hardware I have is very limited in its capabilities
> (no support even for S2RAM.) I've also never managed to get
> big.LITTLE working on my platforms that should support it.
>

No problem. Thanks a lot for your feedback anyway.

> Eg, I gave up with the TC2 tile in Versatile Express. I forget the
> details, but getting it booting was exceedingly difficult, and no one
> really seemed interested in helping. So my TC2 tile is collecting
> dust while the CA9x4 tile lives in the platform - which is now part of
> the nightly build/boot test farm.
>

I guess I could provide some support for the TC2 bits, if you didn't yet
give up completely on it :-). Please, don't hesitate to let me know,
even privately, about that.

Best,

- Juri

2015-12-10 14:14:51

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On 01/12/15 11:20, Juri Lelli wrote:
> Hi Vincent,
>
> On 30/11/15 10:59, Vincent Guittot wrote:
>> Hi Juri,
>>
>> On 24 November 2015 at 11:54, Juri Lelli <[email protected]> wrote:

[...]

>>>>> +==========================================
>>>>> +3 - capacity-scale
>>>>> +==========================================
>>>>> +
>>>>> +CPUs capacities are defined with respect to capacity-scale property in the cpus
>>>>> +node [1]. The property is optional; if not defined a 1024 capacity-scale is
>>>>> +assumed. This property defines both the highest CPU capacity present in the
>>>>> +system and granularity of CPU capacity values.
>>>>
>>>> I don't really see the point of this vs. having an absolute scale.
>>>>
>>>
>>> IMHO, we need this for several reasons, one being to address one of your
>>> concerns below: vendors are free to choose their scale without being
>>> forced to publish absolute data. Another reason is that it might make
>>> life easier in certain cases; for example, someone could implement a
>>> system with a few clusters of, say, A57s, but some run at half the clock
>>> of the others (e.g., you have a 1.2GHz cluster and a 600MHz cluster); in
>>> this case I think it is just easier to define capacity-scale as 1200 and
>>> capacities as 1200 and 600. Last reason that I can think of right now is
>>> that we don't probably want to bound ourself to some particular range
>>> from the beginning, as that range might be enough now, but it could
>>> change in the future (as in, right now [1-1024] looks fine for
>>> scheduling purposes, but that might change).
>>
>> Like Rob, i don't really see the benefit of this optional
>> capacity-scale property. Parsing the capacity of all cpu nodes should
>> give you a range as well.
>> IMHO, this property looks like an optimization of the code that will
>> parse the dt more than a HW description
>>
>
> I agree that we can come up with the same information just looking at
> the biggest capacity value of all CPUs and treat that value as
> capacity-scale. I just thought that having that explicit made things
> clearer, as it could be not easy to immediately see from a DT with many
> CPUs which is the biggest capacity value. But, yes, we could remove that
> anyway.

+1! This capacity-scale complicates things unnecessarily. It was hard
for me to understand the meaning of it. Your 2. example sets
'capacity-scale = <2>' but also 'capacity = <2>' for cpu[01] and
'capacity = <1>' for cpu[23]. This can be easily replaced by 'capacity =
<1024>' for cpu[01] and 'capacity = <512>' for cpu[23]. Much more
readable, as it was mentioned already in this thread.

I understand that we don't want to limit the range of capacity values in
the dt file to [1..1024] nor enforce that the cpu w/ the highest
capacity has to have the value of 1024 in the dt file so the scheduler
has to scale accordingly if we want to limit capacity to its supported
capacity range (like with EAS [1..1024]).

[...]

2015-12-10 14:14:59

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 3/8] arm: parse cpu capacity from DT

On 23/11/15 14:28, Juri Lelli wrote:
> With the introduction of cpu capacity bindings, CPU capacities can now be
> extracted from DT. Add parsing of such information at boot time. We keep
> code that can produce same information, based on different DT properties
> and hard-coded values, as fall-back for backward compatibility.

This patch-set should define _one_ way to be able to specify
heterogeneous cpu capacity values for all arm and arm64 systems. So I
would really like that we can agree on this solution ('capacity'
property) and delete the old (cortex-a[15,7] only) solution based on
struct cpu_efficiency table_efficiency[] and 'clock-frequency' property
in arch/arm/kernel/topology.c. The appropriate code in
arch/arm[,64]/kernel/topology.c should be the same. Everything else is
highly confusing.

>
> Cc: Russell King <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> ---
> arch/arm/kernel/topology.c | 50 +++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 49 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index ec279d1..ecbff03 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -78,6 +78,35 @@ static unsigned long *__cpu_capacity;
> #define cpu_capacity(cpu) __cpu_capacity[cpu]
>
> static unsigned long middle_capacity = 1;
> +static bool capacity_from_dt = true;
> +static u32 capacity_scale = SCHED_CAPACITY_SCALE;
> +
> +static int __init parse_cpu_capacity(struct device_node *cpu_node, int cpu)
> +{
> + int ret = 1;
> + u32 cpu_capacity;
> +
> + ret = of_property_read_u32(cpu_node,
> + "capacity",
> + &cpu_capacity);
> + if (!ret) {
> + u64 capacity;
> +
> + /*
> + * Enforce capacity <= capacity-scale.
> + */
> + cpu_capacity = cpu_capacity <= capacity_scale ? cpu_capacity :
> + capacity_scale;
> + capacity = (cpu_capacity << SCHED_CAPACITY_SHIFT) /
> + capacity_scale;
> +
> + set_capacity_scale(cpu, capacity);
> + pr_info("CPU%d: DT cpu capacity %lu\n",
> + cpu, arch_scale_cpu_capacity(NULL, cpu));
> + }
> +
> + return !ret;
> +}
>
> /*
> * Iterate all CPUs' descriptor in DT and compute the efficiency
> @@ -99,6 +128,18 @@ static void __init parse_dt_topology(void)
> __cpu_capacity = kcalloc(nr_cpu_ids, sizeof(*__cpu_capacity),
> GFP_NOWAIT);
>
> + cn = of_find_node_by_path("/cpus");
> + if (!cn) {
> + pr_err("No CPU information found in DT\n");
> + return;
> + }
> +
> + if (!of_property_read_u32(cn, "capacity-scale", &capacity_scale))
> + pr_info("DT cpus capacity-scale %u\n", capacity_scale);
> + else
> + pr_debug("DT cpus capacity-scale not found: assuming %u\n",
> + capacity_scale);
> +
> for_each_possible_cpu(cpu) {
> const u32 *rate;
> int len;
> @@ -110,6 +151,13 @@ static void __init parse_dt_topology(void)
> continue;
> }
>
> + if (parse_cpu_capacity(cn, cpu)) {
> + of_node_put(cn);
> + continue;
> + }
> +
> + capacity_from_dt = false;
> +
> for (cpu_eff = table_efficiency; cpu_eff->compatible; cpu_eff++)
> if (of_device_is_compatible(cn, cpu_eff->compatible))
> break;
> @@ -160,7 +208,7 @@ static void __init parse_dt_topology(void)
> */
> static void update_cpu_capacity(unsigned int cpu)
> {
> - if (!cpu_capacity(cpu))
> + if (!cpu_capacity(cpu) || capacity_from_dt)
> return;
>
> set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
>

2015-12-10 14:15:41

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 5/8] arm64: parse cpu capacity from DT

On 23/11/15 14:28, Juri Lelli wrote:
> With the introduction of cpu capacity bindings, CPU capacities can now be
> extracted from DT. Add parsing of such information at boot time. Also,
> store such information using per CPU variables, as we do for arm.
>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Mark Brown <[email protected]>
> Cc: Sudeep Holla <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> ---
> arch/arm64/kernel/topology.c | 46 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 46 insertions(+)

A lot of code disappears if you get rid of capacity-scale.

>
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index 694f6de..4423cc5 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -23,6 +23,45 @@
> #include <asm/cputype.h>
> #include <asm/topology.h>
>
> +static DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
> +
> +unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> +{
> + return per_cpu(cpu_scale, cpu);
> +}
> +
> +static void set_capacity_scale(unsigned int cpu, unsigned long capacity)
> +{
> + per_cpu(cpu_scale, cpu) = capacity;
> +}
> +
> +static u32 capacity_scale = SCHED_CAPACITY_SCALE;
> +
> +static void __init parse_cpu_capacity(struct device_node *cpu_node, int cpu)
> +{
> + int ret;
> + u32 cpu_capacity;
> +
> + ret = of_property_read_u32(cpu_node,
> + "capacity",
> + &cpu_capacity);
> + if (!ret) {
> + u64 capacity;
> +
> + /*
> + * Enforce capacity <= capacity-scale.
> + */
> + cpu_capacity = cpu_capacity <= capacity_scale ? cpu_capacity :
> + capacity_scale;
> + capacity = (cpu_capacity << SCHED_CAPACITY_SHIFT) /
> + capacity_scale;
> +
> + set_capacity_scale(cpu, capacity);
> + pr_info("CPU%d: DT cpu_capacity %lu\n",
> + cpu, arch_scale_cpu_capacity(NULL, cpu));
> + }
> +}
> +
> static int __init get_cpu_for_node(struct device_node *node)
> {
> struct device_node *cpu_node;
> @@ -34,6 +73,7 @@ static int __init get_cpu_for_node(struct device_node *node)
>
> for_each_possible_cpu(cpu) {
> if (of_get_cpu_node(cpu, NULL) == cpu_node) {
> + parse_cpu_capacity(cpu_node, cpu);
> of_node_put(cpu_node);
> return cpu;
> }
> @@ -173,6 +213,12 @@ static int __init parse_dt_topology(void)
> return 0;
> }
>
> + if (!of_property_read_u32(cn, "capacity-scale", &capacity_scale))
> + pr_info("DT cpus capacity-scale %u\n", capacity_scale);
> + else
> + pr_debug("DT cpus capacity-scale not found: assuming %u\n",
> + capacity_scale);
> +
> /*
> * When topology is provided cpu-map is essentially a root
> * cluster with restricted subnodes.
>

2015-12-10 14:15:13

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 8/8] arm64: add sysfs cpu_capacity attribute

On 23/11/15 14:28, Juri Lelli wrote:
> Add a sysfs cpu_capacity attribute with which it is possible to read and
> write (thus over-writing default values) CPUs capacity. This might be
> useful in situation where there is no way to get proper default values
> at boot time.
>
> The new attribute shows up as:
>
> /sys/devices/system/cpu/cpu*/cpu_capacity
>

This sysfs interface is not really needed for arm or arm64. People can
build the dt blob if they want to change the values. Less code to carry
... Let's focus on the core functionality, which is parsing cpu capacity
from dt file to task scheduler for heterogeneous systems.

[...]

2015-12-10 15:30:33

by Mark Brown

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On Mon, Nov 23, 2015 at 08:06:31PM -0600, Rob Herring wrote:

> I think you need something absolute and probably per MHz (like
> dynamic-power-coefficient property). Perhaps the IPC (instructions per
> clock) value?

> In other words, I want to see these numbers have a defined method
> of determining them and don't want to see random values from every
> vendor. ARM, Ltd. says core X has a value of Y would be good enough for
> me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a
> value of 1024 is not what I want to see. Of course things like cache
> sizes can vary the performance, but is a baseline value good enough?

> However, no vendor will want to publish their values if these are
> absolute values relative to other vendors.

> If you expect these to need frequent tuning, then don't put them in DT.

I agree strongly. Putting what are essentially tuning numbers for the
system into the ABI is going to lead us into a mess long term since if
we change anything related to the performance of the system the numbers
may become invalid and we've no real way of recovering sensible
information.

There is of course also the issue where people are getting the numbers
from in the first place - were the numbers picked for some particular
use case or to optimise some particular benchmark, what other conditions
existed at the time (cpufreq setup for example), what tuning goals did
the people picking the numbers have and do any of those things
correspond to what a given user wants? If detailed tuning the numbers
for specific systems matters much will we get competing users patching
the in kernel DTs over and over, and what do we do about ACPI systems?
Having an absolute definition doesn't really help with this since the
concrete effect DT authors see is that these are tuning numbers.

It would be better to have the DT describe concrete physical properties
of the system which we can then map onto numbers we like, that way if we
get better information in future or just decide that completely
different metrics are appropriate for tuning we can just do that without
having to worry about translating the old metrics into new ones. We can
then expose the tuning knobs to userspace for override if that's needed.
If doing system specific tuning on vertically integrated systems really
is terribly important it's not going to matter too much where the tuning
is but we also have to consider more general purpose systems.

We're not going to get out of having to pick numbers at some point,
pushing them into DT doesn't get us out of that but it does make the
situation harder to manage long term and makes the performance for the
general user less relaible. It's also just more work all round,
everyone doing the DT for a SoC is going to have to do some combination
of cargo culting or repeating the callibration.

I remember Peter remarking at one of the LPC discussions of this idea
that there had been some bad experiences with getting numbers from
firmware on other systems.


Attachments:
(No filename) (2.95 kB)
signature.asc (473.00 B)
Download all attachments

2015-12-10 15:59:52

by Mark Brown

[permalink] [raw]
Subject: Re: [RFC PATCH 8/8] arm64: add sysfs cpu_capacity attribute

On Thu, Dec 10, 2015 at 02:15:04PM +0000, Dietmar Eggemann wrote:
> On 23/11/15 14:28, Juri Lelli wrote:

> > The new attribute shows up as:

> > /sys/devices/system/cpu/cpu*/cpu_capacity

> This sysfs interface is not really needed for arm or arm64. People can
> build the dt blob if they want to change the values. Less code to carry
> ... Let's focus on the core functionality, which is parsing cpu capacity
> from dt file to task scheduler for heterogeneous systems.

That does make the tuning process much more cumbersome - users have to
rebuild and reboot to tweak the numbers rather than just tweaking the
numbers and rerunning the benchmark (which seems like something people
would want to automate).


Attachments:
(No filename) (710.00 B)
signature.asc (473.00 B)
Download all attachments

2015-12-10 17:57:55

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

Hi Mark,

I certainly understand your (and Rob's) concerns, but let me try anyway
to argument a bit more around this approach :-).

On 10/12/15 15:30, Mark Brown wrote:
> On Mon, Nov 23, 2015 at 08:06:31PM -0600, Rob Herring wrote:
>
> > I think you need something absolute and probably per MHz (like
> > dynamic-power-coefficient property). Perhaps the IPC (instructions per
> > clock) value?
>
> > In other words, I want to see these numbers have a defined method
> > of determining them and don't want to see random values from every
> > vendor. ARM, Ltd. says core X has a value of Y would be good enough for
> > me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a
> > value of 1024 is not what I want to see. Of course things like cache
> > sizes can vary the performance, but is a baseline value good enough?
>
> > However, no vendor will want to publish their values if these are
> > absolute values relative to other vendors.
>
> > If you expect these to need frequent tuning, then don't put them in DT.
>
> I agree strongly. Putting what are essentially tuning numbers for the
> system into the ABI is going to lead us into a mess long term since if
> we change anything related to the performance of the system the numbers
> may become invalid and we've no real way of recovering sensible
> information.
>
> There is of course also the issue where people are getting the numbers
> from in the first place - were the numbers picked for some particular
> use case or to optimise some particular benchmark, what other conditions
> existed at the time (cpufreq setup for example), what tuning goals did
> the people picking the numbers have and do any of those things
> correspond to what a given user wants? If detailed tuning the numbers
> for specific systems matters much will we get competing users patching
> the in kernel DTs over and over, and what do we do about ACPI systems?
> Having an absolute definition doesn't really help with this since the
> concrete effect DT authors see is that these are tuning numbers.
>

I'm not entirely getting here why you consider capacity values to be
tunables. As part of the EAS effort, we are proposing ways in which users
should be able to fine tune their system as needed, when required
(don't know if you had a chance to have a look at the SchedTune posting
back in August for example [1]). This patch tries to only standardize
where do we get default values from and how we specify them. I'm not
seeing them changing much after an initial benchmarking phase has been
done. Tuning should happen using different methods, not by changing
these values, IMHO.

> It would be better to have the DT describe concrete physical properties
> of the system which we can then map onto numbers we like, that way if we
> get better information in future or just decide that completely
> different metrics are appropriate for tuning we can just do that without
> having to worry about translating the old metrics into new ones. We can
> then expose the tuning knobs to userspace for override if that's needed.
> If doing system specific tuning on vertically integrated systems really
> is terribly important it's not going to matter too much where the tuning
> is but we also have to consider more general purpose systems.
>

As replied to Rob, I'm not sure it is so easy to find any physical
property that expresses what we essentially need (without maybe relying
on a complex mix of hardware details and a model to extract numbers from
them). Instead, we propose to have reasonable, per SoC, default numbers;
and then let users fine tune their platform afterwards, without changing
those default values.

> We're not going to get out of having to pick numbers at some point,
> pushing them into DT doesn't get us out of that but it does make the
> situation harder to manage long term and makes the performance for the
> general user less relaible. It's also just more work all round,
> everyone doing the DT for a SoC is going to have to do some combination
> of cargo culting or repeating the callibration.
>

I'm most probably a bit naive here, but I see the calibration phase
happening only once, after the platform is stable. You get default
capacity values by running a pretty simple benchmark on a fixed
configuration; and you put them somewhere (DTs still seem to be a
sensible place to me). Then you'll be able to suit tuning needs using
different interfaces.

ABI changes have to be carefully considered, I know. But still, we need
to agree on a way to provide these default capacity values someway. So,
thanks for helping us carry on this discussion.

Best,

- Juri

> I remember Peter remarking at one of the LPC discussions of this idea
> that there had been some bad experiences with getting numbers from
> firmware on other systems.

[1] https://lkml.org/lkml/2015/8/19/419

2015-12-10 18:01:32

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 8/8] arm64: add sysfs cpu_capacity attribute

Hi,

On 10/12/15 15:59, Mark Brown wrote:
> On Thu, Dec 10, 2015 at 02:15:04PM +0000, Dietmar Eggemann wrote:
> > On 23/11/15 14:28, Juri Lelli wrote:
>
> > > The new attribute shows up as:
>
> > > /sys/devices/system/cpu/cpu*/cpu_capacity
>
> > This sysfs interface is not really needed for arm or arm64. People can
> > build the dt blob if they want to change the values. Less code to carry
> > ... Let's focus on the core functionality, which is parsing cpu capacity
> > from dt file to task scheduler for heterogeneous systems.
>
> That does make the tuning process much more cumbersome - users have to
> rebuild and reboot to tweak the numbers rather than just tweaking the
> numbers and rerunning the benchmark (which seems like something people
> would want to automate).

IMHO, this is not a tuning interface. It is an alternative interface,
w.r.t. DTs, that we could use to provide default capacity values to the
kernel. I'm proposing both here as they make both sense to me. Then we
might dedice for which one to go (or if we need some other way) or to
keep both for flexibility.

Best,

- Juri

2015-12-11 10:07:09

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 5/8] arm64: parse cpu capacity from DT

Hi,

On 10/12/15 14:15, Dietmar Eggemann wrote:
> On 23/11/15 14:28, Juri Lelli wrote:
> > With the introduction of cpu capacity bindings, CPU capacities can now be
> > extracted from DT. Add parsing of such information at boot time. Also,
> > store such information using per CPU variables, as we do for arm.
> >
> > Cc: Catalin Marinas <[email protected]>
> > Cc: Will Deacon <[email protected]>
> > Cc: Mark Brown <[email protected]>
> > Cc: Sudeep Holla <[email protected]>
> > Signed-off-by: Juri Lelli <[email protected]>
> > ---
> > arch/arm64/kernel/topology.c | 46 ++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 46 insertions(+)
>
> A lot of code disappears if you get rid of capacity-scale.
>

Some code disappears yes, but I'll have to put some other code to
normalize capacities w.r.t. the highest capacity amongst CPUs. Not
a problem, though :-).

Thanks,

- Juri

> >
> > diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> > index 694f6de..4423cc5 100644
> > --- a/arch/arm64/kernel/topology.c
> > +++ b/arch/arm64/kernel/topology.c
> > @@ -23,6 +23,45 @@
> > #include <asm/cputype.h>
> > #include <asm/topology.h>
> >
> > +static DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
> > +
> > +unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> > +{
> > + return per_cpu(cpu_scale, cpu);
> > +}
> > +
> > +static void set_capacity_scale(unsigned int cpu, unsigned long capacity)
> > +{
> > + per_cpu(cpu_scale, cpu) = capacity;
> > +}
> > +
> > +static u32 capacity_scale = SCHED_CAPACITY_SCALE;
> > +
> > +static void __init parse_cpu_capacity(struct device_node *cpu_node, int cpu)
> > +{
> > + int ret;
> > + u32 cpu_capacity;
> > +
> > + ret = of_property_read_u32(cpu_node,
> > + "capacity",
> > + &cpu_capacity);
> > + if (!ret) {
> > + u64 capacity;
> > +
> > + /*
> > + * Enforce capacity <= capacity-scale.
> > + */
> > + cpu_capacity = cpu_capacity <= capacity_scale ? cpu_capacity :
> > + capacity_scale;
> > + capacity = (cpu_capacity << SCHED_CAPACITY_SHIFT) /
> > + capacity_scale;
> > +
> > + set_capacity_scale(cpu, capacity);
> > + pr_info("CPU%d: DT cpu_capacity %lu\n",
> > + cpu, arch_scale_cpu_capacity(NULL, cpu));
> > + }
> > +}
> > +
> > static int __init get_cpu_for_node(struct device_node *node)
> > {
> > struct device_node *cpu_node;
> > @@ -34,6 +73,7 @@ static int __init get_cpu_for_node(struct device_node *node)
> >
> > for_each_possible_cpu(cpu) {
> > if (of_get_cpu_node(cpu, NULL) == cpu_node) {
> > + parse_cpu_capacity(cpu_node, cpu);
> > of_node_put(cpu_node);
> > return cpu;
> > }
> > @@ -173,6 +213,12 @@ static int __init parse_dt_topology(void)
> > return 0;
> > }
> >
> > + if (!of_property_read_u32(cn, "capacity-scale", &capacity_scale))
> > + pr_info("DT cpus capacity-scale %u\n", capacity_scale);
> > + else
> > + pr_debug("DT cpus capacity-scale not found: assuming %u\n",
> > + capacity_scale);
> > +
> > /*
> > * When topology is provided cpu-map is essentially a root
> > * cluster with restricted subnodes.
> >
>

2015-12-11 10:08:54

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

Hi,

On 10/12/15 14:14, Dietmar Eggemann wrote:
> On 01/12/15 11:20, Juri Lelli wrote:
> > Hi Vincent,
> >
> > On 30/11/15 10:59, Vincent Guittot wrote:
> >> Hi Juri,
> >>
> >> On 24 November 2015 at 11:54, Juri Lelli <[email protected]> wrote:
>
> [...]
>
> >>>>> +==========================================
> >>>>> +3 - capacity-scale
> >>>>> +==========================================
> >>>>> +
> >>>>> +CPUs capacities are defined with respect to capacity-scale property in the cpus
> >>>>> +node [1]. The property is optional; if not defined a 1024 capacity-scale is
> >>>>> +assumed. This property defines both the highest CPU capacity present in the
> >>>>> +system and granularity of CPU capacity values.
> >>>>
> >>>> I don't really see the point of this vs. having an absolute scale.
> >>>>
> >>>
> >>> IMHO, we need this for several reasons, one being to address one of your
> >>> concerns below: vendors are free to choose their scale without being
> >>> forced to publish absolute data. Another reason is that it might make
> >>> life easier in certain cases; for example, someone could implement a
> >>> system with a few clusters of, say, A57s, but some run at half the clock
> >>> of the others (e.g., you have a 1.2GHz cluster and a 600MHz cluster); in
> >>> this case I think it is just easier to define capacity-scale as 1200 and
> >>> capacities as 1200 and 600. Last reason that I can think of right now is
> >>> that we don't probably want to bound ourself to some particular range
> >>> from the beginning, as that range might be enough now, but it could
> >>> change in the future (as in, right now [1-1024] looks fine for
> >>> scheduling purposes, but that might change).
> >>
> >> Like Rob, i don't really see the benefit of this optional
> >> capacity-scale property. Parsing the capacity of all cpu nodes should
> >> give you a range as well.
> >> IMHO, this property looks like an optimization of the code that will
> >> parse the dt more than a HW description
> >>
> >
> > I agree that we can come up with the same information just looking at
> > the biggest capacity value of all CPUs and treat that value as
> > capacity-scale. I just thought that having that explicit made things
> > clearer, as it could be not easy to immediately see from a DT with many
> > CPUs which is the biggest capacity value. But, yes, we could remove that
> > anyway.
>
> +1! This capacity-scale complicates things unnecessarily. It was hard
> for me to understand the meaning of it. Your 2. example sets
> 'capacity-scale = <2>' but also 'capacity = <2>' for cpu[01] and
> 'capacity = <1>' for cpu[23]. This can be easily replaced by 'capacity =
> <1024>' for cpu[01] and 'capacity = <512>' for cpu[23]. Much more
> readable, as it was mentioned already in this thread.
>
> I understand that we don't want to limit the range of capacity values in
> the dt file to [1..1024] nor enforce that the cpu w/ the highest
> capacity has to have the value of 1024 in the dt file so the scheduler
> has to scale accordingly if we want to limit capacity to its supported
> capacity range (like with EAS [1..1024]).
>

OK, I guess I can easily remove capacity-value and simply normalize CPU
capacities w.r.t. the highest capacity in the DT.

Thanks,

- Juri

2015-12-11 10:12:06

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 3/8] arm: parse cpu capacity from DT

Hi,

On 10/12/15 14:14, Dietmar Eggemann wrote:
> On 23/11/15 14:28, Juri Lelli wrote:
> > With the introduction of cpu capacity bindings, CPU capacities can now be
> > extracted from DT. Add parsing of such information at boot time. We keep
> > code that can produce same information, based on different DT properties
> > and hard-coded values, as fall-back for backward compatibility.
>
> This patch-set should define _one_ way to be able to specify
> heterogeneous cpu capacity values for all arm and arm64 systems. So I
> would really like that we can agree on this solution ('capacity'
> property) and delete the old (cortex-a[15,7] only) solution based on
> struct cpu_efficiency table_efficiency[] and 'clock-frequency' property
> in arch/arm/kernel/topology.c. The appropriate code in
> arch/arm[,64]/kernel/topology.c should be the same. Everything else is
> highly confusing.
>

Yeah, I tend to agree on this. I'd say that we could first agree on the
way we specify capacities and then we see what cleanups are required to
make things nicer.

Thanks,

- Juri

> >
> > Cc: Russell King <[email protected]>
> > Signed-off-by: Juri Lelli <[email protected]>
> > ---
> > arch/arm/kernel/topology.c | 50 +++++++++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 49 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> > index ec279d1..ecbff03 100644
> > --- a/arch/arm/kernel/topology.c
> > +++ b/arch/arm/kernel/topology.c
> > @@ -78,6 +78,35 @@ static unsigned long *__cpu_capacity;
> > #define cpu_capacity(cpu) __cpu_capacity[cpu]
> >
> > static unsigned long middle_capacity = 1;
> > +static bool capacity_from_dt = true;
> > +static u32 capacity_scale = SCHED_CAPACITY_SCALE;
> > +
> > +static int __init parse_cpu_capacity(struct device_node *cpu_node, int cpu)
> > +{
> > + int ret = 1;
> > + u32 cpu_capacity;
> > +
> > + ret = of_property_read_u32(cpu_node,
> > + "capacity",
> > + &cpu_capacity);
> > + if (!ret) {
> > + u64 capacity;
> > +
> > + /*
> > + * Enforce capacity <= capacity-scale.
> > + */
> > + cpu_capacity = cpu_capacity <= capacity_scale ? cpu_capacity :
> > + capacity_scale;
> > + capacity = (cpu_capacity << SCHED_CAPACITY_SHIFT) /
> > + capacity_scale;
> > +
> > + set_capacity_scale(cpu, capacity);
> > + pr_info("CPU%d: DT cpu capacity %lu\n",
> > + cpu, arch_scale_cpu_capacity(NULL, cpu));
> > + }
> > +
> > + return !ret;
> > +}
> >
> > /*
> > * Iterate all CPUs' descriptor in DT and compute the efficiency
> > @@ -99,6 +128,18 @@ static void __init parse_dt_topology(void)
> > __cpu_capacity = kcalloc(nr_cpu_ids, sizeof(*__cpu_capacity),
> > GFP_NOWAIT);
> >
> > + cn = of_find_node_by_path("/cpus");
> > + if (!cn) {
> > + pr_err("No CPU information found in DT\n");
> > + return;
> > + }
> > +
> > + if (!of_property_read_u32(cn, "capacity-scale", &capacity_scale))
> > + pr_info("DT cpus capacity-scale %u\n", capacity_scale);
> > + else
> > + pr_debug("DT cpus capacity-scale not found: assuming %u\n",
> > + capacity_scale);
> > +
> > for_each_possible_cpu(cpu) {
> > const u32 *rate;
> > int len;
> > @@ -110,6 +151,13 @@ static void __init parse_dt_topology(void)
> > continue;
> > }
> >
> > + if (parse_cpu_capacity(cn, cpu)) {
> > + of_node_put(cn);
> > + continue;
> > + }
> > +
> > + capacity_from_dt = false;
> > +
> > for (cpu_eff = table_efficiency; cpu_eff->compatible; cpu_eff++)
> > if (of_device_is_compatible(cn, cpu_eff->compatible))
> > break;
> > @@ -160,7 +208,7 @@ static void __init parse_dt_topology(void)
> > */
> > static void update_cpu_capacity(unsigned int cpu)
> > {
> > - if (!cpu_capacity(cpu))
> > + if (!cpu_capacity(cpu) || capacity_from_dt)
> > return;
> >
> > set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
> >
>

2015-12-11 17:50:03

by Mark Brown

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On Thu, Dec 10, 2015 at 05:58:20PM +0000, Juri Lelli wrote:
> On 10/12/15 15:30, Mark Brown wrote:
> > On Mon, Nov 23, 2015 at 08:06:31PM -0600, Rob Herring wrote:

> > > In other words, I want to see these numbers have a defined method
> > > of determining them and don't want to see random values from every
> > > vendor. ARM, Ltd. says core X has a value of Y would be good enough for
> > > me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a
> > > value of 1024 is not what I want to see. Of course things like cache
> > > sizes can vary the performance, but is a baseline value good enough?

> > > However, no vendor will want to publish their values if these are
> > > absolute values relative to other vendors.

> > > If you expect these to need frequent tuning, then don't put them in DT.

> > I agree strongly. Putting what are essentially tuning numbers for the
> > system into the ABI is going to lead us into a mess long term since if
> > we change anything related to the performance of the system the numbers
> > may become invalid and we've no real way of recovering sensible
> > information.

> I'm not entirely getting here why you consider capacity values to be
> tunables. As part of the EAS effort, we are proposing ways in which users

The purpose of the capacity values is to influence the scheduler
behaviour and hence performance. Without a concrete definition they're
just magic numbers which have meaining only in terms of their effect on
the performance of the system. That is a sufficiently complex outcome
to ensure that there will be an element of taste in what the desired
outcomes are. Sounds like tuneables to me.

> should be able to fine tune their system as needed, when required
> (don't know if you had a chance to have a look at the SchedTune posting
> back in August for example [1]). This patch tries to only standardize
> where do we get default values from and how we specify them. I'm not
> seeing them changing much after an initial benchmarking phase has been
> done. Tuning should happen using different methods, not by changing
> these values, IMHO.

If you are saying people should use other, more sensible, ways of
specifying the final values that actually get used in production then
why take the defaults from direct numbers DT in the first place? If you
are saying that people should tune and then put the values in here then
that's problematic for the reasons I outlined.

> > It would be better to have the DT describe concrete physical properties
> > of the system which we can then map onto numbers we like, that way if we
> > get better information in future or just decide that completely
> > different metrics are appropriate for tuning we can just do that without
> > having to worry about translating the old metrics into new ones. We can
> > then expose the tuning knobs to userspace for override if that's needed.
> > If doing system specific tuning on vertically integrated systems really
> > is terribly important it's not going to matter too much where the tuning
> > is but we also have to consider more general purpose systems.

> As replied to Rob, I'm not sure it is so easy to find any physical
> property that expresses what we essentially need (without maybe relying
> on a complex mix of hardware details and a model to extract numbers from
> them). Instead, we propose to have reasonable, per SoC, default numbers;
> and then let users fine tune their platform afterwards, without changing
> those default values.

If users are supposed to do fine tuning elsewhere after the fact why
bother with this initial callibration? Something that's ballpark good
enough like just knowing the core used and perhaps some important
options on it should give an adequate starting point and not have the
issues with having the tuning numbers present as magic numbers. Perhaps
we might also feed cache information in at some point. If in future
we're able to improve those default numbers (or just adapt at runtime)
then even better.

It also seems a bit strange to expect people to do some tuning in one
place initially and then additional tuning somewhere else later, from
a user point of view I'd expect to always do my tuning in the same
place.

> > We're not going to get out of having to pick numbers at some point,
> > pushing them into DT doesn't get us out of that but it does make the
> > situation harder to manage long term and makes the performance for the
> > general user less relaible. It's also just more work all round,
> > everyone doing the DT for a SoC is going to have to do some combination
> > of cargo culting or repeating the callibration.

> I'm most probably a bit naive here, but I see the calibration phase
> happening only once, after the platform is stable. You get default
> capacity values by running a pretty simple benchmark on a fixed
> configuration; and you put them somewhere (DTs still seem to be a
> sensible place to me). Then you'll be able to suit tuning needs using
> different interfaces.

My point is that everyone making any kind of SoC with asymmetries is
expected to go and do some kind of callibration based on some unclear
criteria, if these are just ballpark accurate starting points that seems
like wasted effort - the kernel should be making a reasonable effort to
do something sensible without this information which is going to be less
effort all round. It doesn't need to wait for real silicon (this seems
like the sort of core bit of DT which will be being written pre-tapeout)
and doesn't have marketing implications.

Doing that and then switching to some other interface for real tuning
seems especially odd and I'm not sure that's something that users are
going to expect or understand.


Attachments:
(No filename) (5.61 kB)
signature.asc (473.00 B)
Download all attachments

2015-12-11 17:54:53

by Mark Brown

[permalink] [raw]
Subject: Re: [RFC PATCH 8/8] arm64: add sysfs cpu_capacity attribute

On Thu, Dec 10, 2015 at 06:01:59PM +0000, Juri Lelli wrote:
> On 10/12/15 15:59, Mark Brown wrote:
> > On Thu, Dec 10, 2015 at 02:15:04PM +0000, Dietmar Eggemann wrote:
> > > On 23/11/15 14:28, Juri Lelli wrote:

> > > > The new attribute shows up as:

> > > > /sys/devices/system/cpu/cpu*/cpu_capacity

> > > This sysfs interface is not really needed for arm or arm64. People can
> > > build the dt blob if they want to change the values. Less code to carry
> > > ... Let's focus on the core functionality, which is parsing cpu capacity
> > > from dt file to task scheduler for heterogeneous systems.

> > That does make the tuning process much more cumbersome - users have to
> > rebuild and reboot to tweak the numbers rather than just tweaking the
> > numbers and rerunning the benchmark (which seems like something people
> > would want to automate).

> IMHO, this is not a tuning interface. It is an alternative interface,
> w.r.t. DTs, that we could use to provide default capacity values to the
> kernel. I'm proposing both here as they make both sense to me. Then we
> might dedice for which one to go (or if we need some other way) or to
> keep both for flexibility.

Kind of repeating what I said in the other mail but I'd say that any
interface which provides a mechanism for setting a magic number that
influences system performance is providing tuning. It's hard to see how
else to describe it to be honest.


Attachments:
(No filename) (1.39 kB)
signature.asc (473.00 B)
Download all attachments

2015-12-14 12:36:25

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

Hi Mark,

On 11/12/15 17:49, Mark Brown wrote:
> On Thu, Dec 10, 2015 at 05:58:20PM +0000, Juri Lelli wrote:
> > On 10/12/15 15:30, Mark Brown wrote:
> > > On Mon, Nov 23, 2015 at 08:06:31PM -0600, Rob Herring wrote:
>
> > > > In other words, I want to see these numbers have a defined method
> > > > of determining them and don't want to see random values from every
> > > > vendor. ARM, Ltd. says core X has a value of Y would be good enough for
> > > > me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a
> > > > value of 1024 is not what I want to see. Of course things like cache
> > > > sizes can vary the performance, but is a baseline value good enough?
>
> > > > However, no vendor will want to publish their values if these are
> > > > absolute values relative to other vendors.
>
> > > > If you expect these to need frequent tuning, then don't put them in DT.
>
> > > I agree strongly. Putting what are essentially tuning numbers for the
> > > system into the ABI is going to lead us into a mess long term since if
> > > we change anything related to the performance of the system the numbers
> > > may become invalid and we've no real way of recovering sensible
> > > information.
>
> > I'm not entirely getting here why you consider capacity values to be
> > tunables. As part of the EAS effort, we are proposing ways in which users
>
> The purpose of the capacity values is to influence the scheduler
> behaviour and hence performance. Without a concrete definition they're
> just magic numbers which have meaining only in terms of their effect on
> the performance of the system. That is a sufficiently complex outcome
> to ensure that there will be an element of taste in what the desired
> outcomes are. Sounds like tuneables to me.
>

Capacity values are meant to describe asymmetry (if any) of the system
CPUs to the scheduler. The scheduler can then use this additional bit of
information to try to do better scheduling decisions. Yes, having these
values available will end up giving you better performance, but I guess
this apply to any information we provide to the kernel (and scheduler);
the less dumb a subsystem is, the better we can make it work.

> > should be able to fine tune their system as needed, when required
> > (don't know if you had a chance to have a look at the SchedTune posting
> > back in August for example [1]). This patch tries to only standardize
> > where do we get default values from and how we specify them. I'm not
> > seeing them changing much after an initial benchmarking phase has been
> > done. Tuning should happen using different methods, not by changing
> > these values, IMHO.
>
> If you are saying people should use other, more sensible, ways of
> specifying the final values that actually get used in production then
> why take the defaults from direct numbers DT in the first place? If you
> are saying that people should tune and then put the values in here then
> that's problematic for the reasons I outlined.
>

IMHO, people should come up with default values that describe
heterogeneity in their system. Then use other ways to tune the system at
run time (depending on the workload maybe).

As said, I understand your concerns; but, what I don't still get is
where CPU capacity values are so different from, say, idle states
min-residency-us. AFAIK there is a per-SoC benchmarking phase required
to come up with that values as well; you have to pick some benchmark
that stresses worst case entry/exit while measuring energy, then make
calculations that tells you when it is wise to enter a particular idle
state. Ideally we should derive min residency from specs, but I'm not
sure is how it works in practice.

> > > It would be better to have the DT describe concrete physical properties
> > > of the system which we can then map onto numbers we like, that way if we
> > > get better information in future or just decide that completely
> > > different metrics are appropriate for tuning we can just do that without
> > > having to worry about translating the old metrics into new ones. We can
> > > then expose the tuning knobs to userspace for override if that's needed.
> > > If doing system specific tuning on vertically integrated systems really
> > > is terribly important it's not going to matter too much where the tuning
> > > is but we also have to consider more general purpose systems.
>
> > As replied to Rob, I'm not sure it is so easy to find any physical
> > property that expresses what we essentially need (without maybe relying
> > on a complex mix of hardware details and a model to extract numbers from
> > them). Instead, we propose to have reasonable, per SoC, default numbers;
> > and then let users fine tune their platform afterwards, without changing
> > those default values.
>
> If users are supposed to do fine tuning elsewhere after the fact why
> bother with this initial callibration? Something that's ballpark good
> enough like just knowing the core used and perhaps some important
> options on it should give an adequate starting point and not have the
> issues with having the tuning numbers present as magic numbers. Perhaps
> we might also feed cache information in at some point. If in future
> we're able to improve those default numbers (or just adapt at runtime)
> then even better.
>
> It also seems a bit strange to expect people to do some tuning in one
> place initially and then additional tuning somewhere else later, from
> a user point of view I'd expect to always do my tuning in the same
> place.
>

I think that runtime tuning needs are much more complex and have finer
grained needs than what you can achieve by playing with CPU capacities.
And I agree with you, users should only play with these other methods
I'm referring to; they should not mess around with platform description
bits. They should provide information about runtime needs, then the
scheduler (in this case) will do its best to give them acceptable
performance using improved knowledge about the platform.

> > > We're not going to get out of having to pick numbers at some point,
> > > pushing them into DT doesn't get us out of that but it does make the
> > > situation harder to manage long term and makes the performance for the
> > > general user less relaible. It's also just more work all round,
> > > everyone doing the DT for a SoC is going to have to do some combination
> > > of cargo culting or repeating the callibration.
>
> > I'm most probably a bit naive here, but I see the calibration phase
> > happening only once, after the platform is stable. You get default
> > capacity values by running a pretty simple benchmark on a fixed
> > configuration; and you put them somewhere (DTs still seem to be a
> > sensible place to me). Then you'll be able to suit tuning needs using
> > different interfaces.
>
> My point is that everyone making any kind of SoC with asymmetries is
> expected to go and do some kind of callibration based on some unclear
> criteria, if these are just ballpark accurate starting points that seems
> like wasted effort - the kernel should be making a reasonable effort to
> do something sensible without this information which is going to be less
> effort all round. It doesn't need to wait for real silicon (this seems
> like the sort of core bit of DT which will be being written pre-tapeout)
> and doesn't have marketing implications.
>
> Doing that and then switching to some other interface for real tuning
> seems especially odd and I'm not sure that's something that users are
> going to expect or understand.

As I'm saying above, users should not care about this first step of
platform description; not more than how much they care about other bits
in DTs that describe their platform.

Thanks,

- Juri

2015-12-14 16:59:51

by Mark Brown

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On Mon, Dec 14, 2015 at 12:36:16PM +0000, Juri Lelli wrote:
> On 11/12/15 17:49, Mark Brown wrote:

> > The purpose of the capacity values is to influence the scheduler
> > behaviour and hence performance. Without a concrete definition they're
> > just magic numbers which have meaining only in terms of their effect on
> > the performance of the system. That is a sufficiently complex outcome
> > to ensure that there will be an element of taste in what the desired
> > outcomes are. Sounds like tuneables to me.

> Capacity values are meant to describe asymmetry (if any) of the system
> CPUs to the scheduler. The scheduler can then use this additional bit of
> information to try to do better scheduling decisions. Yes, having these
> values available will end up giving you better performance, but I guess
> this apply to any information we provide to the kernel (and scheduler);
> the less dumb a subsystem is, the better we can make it work.

This information is a magic number, there's never going to be a right
answer. If it needs changing it's not like the kernel is modeling a
concrete thing like the relative performance of the A53 and A57 poorly
or whatever, it's just that the relative values of number A and number B
are not what the system integrator desires.

> > If you are saying people should use other, more sensible, ways of
> > specifying the final values that actually get used in production then
> > why take the defaults from direct numbers DT in the first place? If you
> > are saying that people should tune and then put the values in here then
> > that's problematic for the reasons I outlined.

> IMHO, people should come up with default values that describe
> heterogeneity in their system. Then use other ways to tune the system at
> run time (depending on the workload maybe).

My argument is that they should be describing the hetrogeneity of their
system by describing concrete properties of their system rather than by
providing magic numbers.

> As said, I understand your concerns; but, what I don't still get is
> where CPU capacity values are so different from, say, idle states
> min-residency-us. AFAIK there is a per-SoC benchmarking phase required
> to come up with that values as well; you have to pick some benchmark
> that stresses worst case entry/exit while measuring energy, then make
> calculations that tells you when it is wise to enter a particular idle
> state. Ideally we should derive min residency from specs, but I'm not
> sure is how it works in practice.

Those at least have a concrete physical value that it is possible to
measure in a describable way that is unlikely to change based on the
internals of the kernel. It would be kind of nice to have the broken
down numbers for entry time, exit time and power burn in suspend but
it's not clear it's worth the bother. It's also one of these things
where we don't have any real proxies that get us anywhere in the
ballpark of where we want to be.

> > It also seems a bit strange to expect people to do some tuning in one
> > place initially and then additional tuning somewhere else later, from
> > a user point of view I'd expect to always do my tuning in the same
> > place.

> I think that runtime tuning needs are much more complex and have finer
> grained needs than what you can achieve by playing with CPU capacities.
> And I agree with you, users should only play with these other methods
> I'm referring to; they should not mess around with platform description
> bits. They should provide information about runtime needs, then the
> scheduler (in this case) will do its best to give them acceptable
> performance using improved knowledge about the platform.

So then why isn't it adequate to just have things like the core types in
there and work from there? Are we really expecting the tuning to be so
much better than it's possible to come up with something that's so much
better on the scale that we're expecting this to be accurate that it's
worth just jumping straight to magic numbers?

> > Doing that and then switching to some other interface for real tuning
> > seems especially odd and I'm not sure that's something that users are
> > going to expect or understand.

> As I'm saying above, users should not care about this first step of
> platform description; not more than how much they care about other bits
> in DTs that describe their platform.

That may be your intention but I don't see how it is realistic to expect
that this is what people will actually understand. It's a number, it
has an effect and it's hard to see that people won't tune it, it's not
like people don't have to edit DTs during system integration. People
won't reliably read documentation or look in mailing list threads and
other that that it has all the properties of a tuning interface.

There's a tension here between what you're saying about people not being
supposed to care much about the numbers for tuning and the very fact
that there's a need for the DT to carry explicit numbers.


Attachments:
(No filename) (4.89 kB)
signature.asc (473.00 B)
Download all attachments

2015-12-15 12:22:45

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On 14/12/15 16:59, Mark Brown wrote:
> On Mon, Dec 14, 2015 at 12:36:16PM +0000, Juri Lelli wrote:
> > On 11/12/15 17:49, Mark Brown wrote:
>
> > > The purpose of the capacity values is to influence the scheduler
> > > behaviour and hence performance. Without a concrete definition they're
> > > just magic numbers which have meaining only in terms of their effect on
> > > the performance of the system. That is a sufficiently complex outcome
> > > to ensure that there will be an element of taste in what the desired
> > > outcomes are. Sounds like tuneables to me.
>
> > Capacity values are meant to describe asymmetry (if any) of the system
> > CPUs to the scheduler. The scheduler can then use this additional bit of
> > information to try to do better scheduling decisions. Yes, having these
> > values available will end up giving you better performance, but I guess
> > this apply to any information we provide to the kernel (and scheduler);
> > the less dumb a subsystem is, the better we can make it work.
>
> This information is a magic number, there's never going to be a right
> answer. If it needs changing it's not like the kernel is modeling a
> concrete thing like the relative performance of the A53 and A57 poorly
> or whatever, it's just that the relative values of number A and number B
> are not what the system integrator desires.
>
> > > If you are saying people should use other, more sensible, ways of
> > > specifying the final values that actually get used in production then
> > > why take the defaults from direct numbers DT in the first place? If you
> > > are saying that people should tune and then put the values in here then
> > > that's problematic for the reasons I outlined.
>
> > IMHO, people should come up with default values that describe
> > heterogeneity in their system. Then use other ways to tune the system at
> > run time (depending on the workload maybe).
>
> My argument is that they should be describing the hetrogeneity of their
> system by describing concrete properties of their system rather than by
> providing magic numbers.
>
> > As said, I understand your concerns; but, what I don't still get is
> > where CPU capacity values are so different from, say, idle states
> > min-residency-us. AFAIK there is a per-SoC benchmarking phase required
> > to come up with that values as well; you have to pick some benchmark
> > that stresses worst case entry/exit while measuring energy, then make
> > calculations that tells you when it is wise to enter a particular idle
> > state. Ideally we should derive min residency from specs, but I'm not
> > sure is how it works in practice.
>
> Those at least have a concrete physical value that it is possible to
> measure in a describable way that is unlikely to change based on the
> internals of the kernel. It would be kind of nice to have the broken
> down numbers for entry time, exit time and power burn in suspend but
> it's not clear it's worth the bother. It's also one of these things
> where we don't have any real proxies that get us anywhere in the
> ballpark of where we want to be.
>

I'm proposing to add a new value because I couldn't find any proxies in
the current bindings that bring us any close to what we need. If I
failed in looking for them, and they actually exists, I'll personally be
more then happy to just rely on them instead of adding more stuff :-).

Interestingly, to me it sounds like we could actually use your first
paragraph above almost as it is to describe how to come up with capacity
values. In the documentation I put the following:

"One simple way to estimate CPU capacities is to iteratively run a
well-known CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on
each CPU at maximum frequency and then normalize values w.r.t. the best
performing CPU."

I don't see why this should change if we decide that the scheduler has
to change in the future.

Also, looking again at section 2 of idle-states bindings docs, we have a
nice and accurate description of what min-residency is, but not much
info about how we can actually measure that. Maybe, expanding the docs
section regarding CPU capacity could help?

> > > It also seems a bit strange to expect people to do some tuning in one
> > > place initially and then additional tuning somewhere else later, from
> > > a user point of view I'd expect to always do my tuning in the same
> > > place.
>
> > I think that runtime tuning needs are much more complex and have finer
> > grained needs than what you can achieve by playing with CPU capacities.
> > And I agree with you, users should only play with these other methods
> > I'm referring to; they should not mess around with platform description
> > bits. They should provide information about runtime needs, then the
> > scheduler (in this case) will do its best to give them acceptable
> > performance using improved knowledge about the platform.
>
> So then why isn't it adequate to just have things like the core types in
> there and work from there? Are we really expecting the tuning to be so
> much better than it's possible to come up with something that's so much
> better on the scale that we're expecting this to be accurate that it's
> worth just jumping straight to magic numbers?
>

I take your point here that having fine grained values might not really
give us appreciable differences (that is also why I proposed the
capacity-scale in the first instance), but I'm not sure I'm getting what
you are proposing here.

Today, and for arm only, we have a static table representing CPUs
"efficiency":

/*
* Table of relative efficiency of each processors
* The efficiency value must fit in 20bit and the final
* cpu_scale value must be in the range
* 0 < cpu_scale < 3*SCHED_CAPACITY_SCALE/2
* in order to return at most 1 when DIV_ROUND_CLOSEST
* is used to compute the capacity of a CPU.
* Processors that are not defined in the table,
* use the default SCHED_CAPACITY_SCALE value for cpu_scale.
*/
static const struct cpu_efficiency table_efficiency[] = {
{"arm,cortex-a15", 3891},
{"arm,cortex-a7", 2048},
{NULL, },
};

When clock-frequency property is defined in DT, we try to find a match
for the compatibility string in the table above and then use the
associate number to compute the capacity. Are you proposing to have
something like this for arm64 as well?

BTW, the only info I could find about those numbers is from this thread

http://lists.infradead.org/pipermail/linux-arm-kernel/2012-June/104072.html

Vincent, do we have more precise information about these numbers
somewhere else?

If I understand how that table was created, how do we think we will
extend it in the future to allow newer core types (say we replicate this
solution for arm64)? It seems that we have to change it, rescaling
values, each time we have a new core on the market. How can we come up
with relative numbers, in the future, comparing newer cores to old ones
(that might be already out of the market by that time)?

> > > Doing that and then switching to some other interface for real tuning
> > > seems especially odd and I'm not sure that's something that users are
> > > going to expect or understand.
>
> > As I'm saying above, users should not care about this first step of
> > platform description; not more than how much they care about other bits
> > in DTs that describe their platform.
>
> That may be your intention but I don't see how it is realistic to expect
> that this is what people will actually understand. It's a number, it
> has an effect and it's hard to see that people won't tune it, it's not
> like people don't have to edit DTs during system integration. People
> won't reliably read documentation or look in mailing list threads and
> other that that it has all the properties of a tuning interface.
>

Eh, sad but true. I guess we can, as we usually do, put more effort in
documenting how things are supposed to be used. Then, if people think
that they can make their system perform better without looking at
documentation or asking around, I'm not sure there is much we could do
to prevent them to do things wrong. There are already lot of things
people shouldn't touch if they don't know what they are doing. :-/

> There's a tension here between what you're saying about people not being
> supposed to care much about the numbers for tuning and the very fact
> that there's a need for the DT to carry explicit numbers.

My point is that people with tuning needs shouldn't even look at DTs,
but put all their efforts in describing (using appropriate APIs) their
needs and how they apply to the workload they care about. Our job is to
put together information coming from users and knowledge of system
configuration to provide people the desired outcomes.

Best,

- Juri

2015-12-15 13:40:27

by Mark Brown

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On Tue, Dec 15, 2015 at 12:22:38PM +0000, Juri Lelli wrote:

> I'm proposing to add a new value because I couldn't find any proxies in
> the current bindings that bring us any close to what we need. If I
> failed in looking for them, and they actually exists, I'll personally be
> more then happy to just rely on them instead of adding more stuff :-).

Well, the first pass is going to be the core types (possibly with some
other properties if there's interesting parameters people can tweak in
integration).

> Interestingly, to me it sounds like we could actually use your first
> paragraph above almost as it is to describe how to come up with capacity
> values. In the documentation I put the following:

> "One simple way to estimate CPU capacities is to iteratively run a
> well-known CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on
> each CPU at maximum frequency and then normalize values w.r.t. the best
> performing CPU."

> I don't see why this should change if we decide that the scheduler has
> to change in the future.

You'd at least need to pick a particular benchmark there...

> Also, looking again at section 2 of idle-states bindings docs, we have a
> nice and accurate description of what min-residency is, but not much
> info about how we can actually measure that. Maybe, expanding the docs
> section regarding CPU capacity could help?

I'm dubious about this to be honest, if only on the basis of people
reading the docs in the first place. It also seems that if we're really
just talking about some CPU microbenchmark then forcing every
implementor to do the benchmark and scaling is at best going to burn
people's time and be error prone since it doesn't seem so likely to have
dramatic system integration variation.

> > So then why isn't it adequate to just have things like the core types in
> > there and work from there? Are we really expecting the tuning to be so
> > much better than it's possible to come up with something that's so much
> > better on the scale that we're expecting this to be accurate that it's
> > worth just jumping straight to magic numbers?

> I take your point here that having fine grained values might not really
> give us appreciable differences (that is also why I proposed the
> capacity-scale in the first instance), but I'm not sure I'm getting what
> you are proposing here.

Something like the existing solution for arm32.

> static const struct cpu_efficiency table_efficiency[] = {
> {"arm,cortex-a15", 3891},
> {"arm,cortex-a7", 2048},
> {NULL, },
> };

> When clock-frequency property is defined in DT, we try to find a match
> for the compatibility string in the table above and then use the
> associate number to compute the capacity. Are you proposing to have
> something like this for arm64 as well?

> BTW, the only info I could find about those numbers is from this thread

It was discussed in some other thread when I was sending the equivalent
stuff for arm64 (I never got round to finishing it off due to issues
with Catalin and Will being concerned about the specific numbers).
Vincent confirmed that the numbers came from the (IIRC) DMIPS/MHz
numbers that ARM publish for the cores. I'd independently done the same
thing for arm64. It would probably help to put comments in there with
the base numbers before scaling, or just redo the table in terms of the
raw numbers.

This is, of course, an example of my concerns about magic number
configuration.

> If I understand how that table was created, how do we think we will
> extend it in the future to allow newer core types (say we replicate this
> solution for arm64)? It seems that we have to change it, rescaling
> values, each time we have a new core on the market. How can we come up
> with relative numbers, in the future, comparing newer cores to old ones
> (that might be already out of the market by that time)?

It doesn't seem particularly challenging to add new numbers to the table
(and add additional properties to select on) TBH. We can either rescale
by hand in the table when adding entries, script it as part of the
kernel build or do it at runtime (as the arm32 code already does to an
extent based on the particular set of cores we find). What difficulties
do you see with this?

This is something that seems like an advantage to me - we can just
replace everything at any point, we're not tied to trusting the golden
benchmark someone did (or tweaked) if we come up with a better
methodology later on.

> Eh, sad but true. I guess we can, as we usually do, put more effort in
> documenting how things are supposed to be used. Then, if people think
> that they can make their system perform better without looking at
> documentation or asking around, I'm not sure there is much we could do
> to prevent them to do things wrong. There are already lot of things
> people shouldn't touch if they don't know what they are doing. :-/

The trouble with making people specify this in DT is that it becomes a
parameter that someone *has* to tweak at some point.

> > There's a tension here between what you're saying about people not being
> > supposed to care much about the numbers for tuning and the very fact
> > that there's a need for the DT to carry explicit numbers.

> My point is that people with tuning needs shouldn't even look at DTs,
> but put all their efforts in describing (using appropriate APIs) their
> needs and how they apply to the workload they care about. Our job is to
> put together information coming from users and knowledge of system
> configuration to provide people the desired outcomes.

You can't do a system integration for a smartphone or embedded board
without configuring things in DT, people integrating for those systems
are already looking at DT and they are the main current targets for
heterogeneous systems I'm aware of. The people who don't need to look
at DT are mainly the enterprise type people but a lot of them won't be
able to use this as-is anyway since they'll be using ACPI.


Attachments:
(No filename) (5.86 kB)
signature.asc (473.00 B)
Download all attachments

2015-12-15 13:55:36

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On 15 December 2015 at 13:22, Juri Lelli <[email protected]> wrote:
> On 14/12/15 16:59, Mark Brown wrote:
>> On Mon, Dec 14, 2015 at 12:36:16PM +0000, Juri Lelli wrote:
>> > On 11/12/15 17:49, Mark Brown wrote:
>>
>> > > The purpose of the capacity values is to influence the scheduler
>> > > behaviour and hence performance. Without a concrete definition they're
>> > > just magic numbers which have meaining only in terms of their effect on
>> > > the performance of the system. That is a sufficiently complex outcome
>> > > to ensure that there will be an element of taste in what the desired
>> > > outcomes are. Sounds like tuneables to me.
>>
>> > Capacity values are meant to describe asymmetry (if any) of the system
>> > CPUs to the scheduler. The scheduler can then use this additional bit of
>> > information to try to do better scheduling decisions. Yes, having these
>> > values available will end up giving you better performance, but I guess
>> > this apply to any information we provide to the kernel (and scheduler);
>> > the less dumb a subsystem is, the better we can make it work.
>>
>> This information is a magic number, there's never going to be a right
>> answer. If it needs changing it's not like the kernel is modeling a
>> concrete thing like the relative performance of the A53 and A57 poorly
>> or whatever, it's just that the relative values of number A and number B
>> are not what the system integrator desires.
>>
>> > > If you are saying people should use other, more sensible, ways of
>> > > specifying the final values that actually get used in production then
>> > > why take the defaults from direct numbers DT in the first place? If you
>> > > are saying that people should tune and then put the values in here then
>> > > that's problematic for the reasons I outlined.
>>
>> > IMHO, people should come up with default values that describe
>> > heterogeneity in their system. Then use other ways to tune the system at
>> > run time (depending on the workload maybe).
>>
>> My argument is that they should be describing the hetrogeneity of their
>> system by describing concrete properties of their system rather than by
>> providing magic numbers.
>>
>> > As said, I understand your concerns; but, what I don't still get is
>> > where CPU capacity values are so different from, say, idle states
>> > min-residency-us. AFAIK there is a per-SoC benchmarking phase required
>> > to come up with that values as well; you have to pick some benchmark
>> > that stresses worst case entry/exit while measuring energy, then make
>> > calculations that tells you when it is wise to enter a particular idle
>> > state. Ideally we should derive min residency from specs, but I'm not
>> > sure is how it works in practice.
>>
>> Those at least have a concrete physical value that it is possible to
>> measure in a describable way that is unlikely to change based on the
>> internals of the kernel. It would be kind of nice to have the broken
>> down numbers for entry time, exit time and power burn in suspend but
>> it's not clear it's worth the bother. It's also one of these things
>> where we don't have any real proxies that get us anywhere in the
>> ballpark of where we want to be.
>>
>
> I'm proposing to add a new value because I couldn't find any proxies in
> the current bindings that bring us any close to what we need. If I
> failed in looking for them, and they actually exists, I'll personally be
> more then happy to just rely on them instead of adding more stuff :-).
>
> Interestingly, to me it sounds like we could actually use your first
> paragraph above almost as it is to describe how to come up with capacity
> values. In the documentation I put the following:
>
> "One simple way to estimate CPU capacities is to iteratively run a
> well-known CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on
> each CPU at maximum frequency and then normalize values w.r.t. the best
> performing CPU."
>
> I don't see why this should change if we decide that the scheduler has
> to change in the future.
>
> Also, looking again at section 2 of idle-states bindings docs, we have a
> nice and accurate description of what min-residency is, but not much
> info about how we can actually measure that. Maybe, expanding the docs
> section regarding CPU capacity could help?
>
>> > > It also seems a bit strange to expect people to do some tuning in one
>> > > place initially and then additional tuning somewhere else later, from
>> > > a user point of view I'd expect to always do my tuning in the same
>> > > place.
>>
>> > I think that runtime tuning needs are much more complex and have finer
>> > grained needs than what you can achieve by playing with CPU capacities.
>> > And I agree with you, users should only play with these other methods
>> > I'm referring to; they should not mess around with platform description
>> > bits. They should provide information about runtime needs, then the
>> > scheduler (in this case) will do its best to give them acceptable
>> > performance using improved knowledge about the platform.
>>
>> So then why isn't it adequate to just have things like the core types in
>> there and work from there? Are we really expecting the tuning to be so
>> much better than it's possible to come up with something that's so much
>> better on the scale that we're expecting this to be accurate that it's
>> worth just jumping straight to magic numbers?
>>
>
> I take your point here that having fine grained values might not really
> give us appreciable differences (that is also why I proposed the
> capacity-scale in the first instance), but I'm not sure I'm getting what
> you are proposing here.
>
> Today, and for arm only, we have a static table representing CPUs
> "efficiency":
>
> /*
> * Table of relative efficiency of each processors
> * The efficiency value must fit in 20bit and the final
> * cpu_scale value must be in the range
> * 0 < cpu_scale < 3*SCHED_CAPACITY_SCALE/2
> * in order to return at most 1 when DIV_ROUND_CLOSEST
> * is used to compute the capacity of a CPU.
> * Processors that are not defined in the table,
> * use the default SCHED_CAPACITY_SCALE value for cpu_scale.
> */
> static const struct cpu_efficiency table_efficiency[] = {
> {"arm,cortex-a15", 3891},
> {"arm,cortex-a7", 2048},
> {NULL, },
> };
>
> When clock-frequency property is defined in DT, we try to find a match
> for the compatibility string in the table above and then use the
> associate number to compute the capacity. Are you proposing to have
> something like this for arm64 as well?
>
> BTW, the only info I could find about those numbers is from this thread
>
> http://lists.infradead.org/pipermail/linux-arm-kernel/2012-June/104072.html
>
> Vincent, do we have more precise information about these numbers
> somewhere else?

These numbers come from a document from ARM in which they compared A15
and A7 . I just used the number provided by this white paper and scale
it in a more appropriate range than DMIPS/Mhz

>
> If I understand how that table was created, how do we think we will
> extend it in the future to allow newer core types (say we replicate this
> solution for arm64)? It seems that we have to change it, rescaling
> values, each time we have a new core on the market. How can we come up
> with relative numbers, in the future, comparing newer cores to old ones
> (that might be already out of the market by that time)?
>
>> > > Doing that and then switching to some other interface for real tuning
>> > > seems especially odd and I'm not sure that's something that users are
>> > > going to expect or understand.
>>
>> > As I'm saying above, users should not care about this first step of
>> > platform description; not more than how much they care about other bits
>> > in DTs that describe their platform.
>>
>> That may be your intention but I don't see how it is realistic to expect
>> that this is what people will actually understand. It's a number, it
>> has an effect and it's hard to see that people won't tune it, it's not
>> like people don't have to edit DTs during system integration. People
>> won't reliably read documentation or look in mailing list threads and
>> other that that it has all the properties of a tuning interface.
>>
>
> Eh, sad but true. I guess we can, as we usually do, put more effort in
> documenting how things are supposed to be used. Then, if people think
> that they can make their system perform better without looking at
> documentation or asking around, I'm not sure there is much we could do
> to prevent them to do things wrong. There are already lot of things
> people shouldn't touch if they don't know what they are doing. :-/
>
>> There's a tension here between what you're saying about people not being
>> supposed to care much about the numbers for tuning and the very fact
>> that there's a need for the DT to carry explicit numbers.
>
> My point is that people with tuning needs shouldn't even look at DTs,
> but put all their efforts in describing (using appropriate APIs) their
> needs and how they apply to the workload they care about. Our job is to
> put together information coming from users and knowledge of system
> configuration to provide people the desired outcomes.
>
> Best,
>
> - Juri

2015-12-15 14:01:55

by Mark Rutland

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On Tue, Dec 15, 2015 at 01:39:51PM +0000, Mark Brown wrote:
> On Tue, Dec 15, 2015 at 12:22:38PM +0000, Juri Lelli wrote:
> > > So then why isn't it adequate to just have things like the core types in
> > > there and work from there? Are we really expecting the tuning to be so
> > > much better than it's possible to come up with something that's so much
> > > better on the scale that we're expecting this to be accurate that it's
> > > worth just jumping straight to magic numbers?
>
> > I take your point here that having fine grained values might not really
> > give us appreciable differences (that is also why I proposed the
> > capacity-scale in the first instance), but I'm not sure I'm getting what
> > you are proposing here.
>
> Something like the existing solution for arm32.
>
> > static const struct cpu_efficiency table_efficiency[] = {
> > {"arm,cortex-a15", 3891},
> > {"arm,cortex-a7", 2048},
> > {NULL, },
> > };
>
> > When clock-frequency property is defined in DT, we try to find a match
> > for the compatibility string in the table above and then use the
> > associate number to compute the capacity. Are you proposing to have
> > something like this for arm64 as well?
>
> > BTW, the only info I could find about those numbers is from this thread
>
> It was discussed in some other thread when I was sending the equivalent
> stuff for arm64 (I never got round to finishing it off due to issues
> with Catalin and Will being concerned about the specific numbers).
> Vincent confirmed that the numbers came from the (IIRC) DMIPS/MHz
> numbers that ARM publish for the cores. I'd independently done the same
> thing for arm64. It would probably help to put comments in there with
> the base numbers before scaling, or just redo the table in terms of the
> raw numbers.
>
> This is, of course, an example of my concerns about magic number
> configuration.
>
> > If I understand how that table was created, how do we think we will
> > extend it in the future to allow newer core types (say we replicate this
> > solution for arm64)? It seems that we have to change it, rescaling
> > values, each time we have a new core on the market. How can we come up
> > with relative numbers, in the future, comparing newer cores to old ones
> > (that might be already out of the market by that time)?
>
> It doesn't seem particularly challenging to add new numbers to the table
> (and add additional properties to select on) TBH. We can either rescale
> by hand in the table when adding entries, script it as part of the
> kernel build or do it at runtime (as the arm32 code already does to an
> extent based on the particular set of cores we find). What difficulties
> do you see with this?
>
> This is something that seems like an advantage to me - we can just
> replace everything at any point, we're not tied to trusting the golden
> benchmark someone did (or tweaked) if we come up with a better
> methodology later on.

I really don't want to see a table of magic numbers in the kernel.

The relative performance and efficiency of cores will vary depending on
uArch-specific configuration (e.g. sizing of L1/L2 caches) in addition
to general uArch differences, and integration too (e.g. if the memory
system gives priority to one cluster over another for whatever reason).
I've heard of pseudo-heterogeneous platforms with different
configuration of the same uArch across clusters.

We also don't necessarily have the CPU clock frequencies, or the ability
to scale them. Maybe we simply give up in that case, though.

If we cannot rely on external information, and want this information to
be derived by the kernel, then we need to perform some dynamic
benchmark. That would work for future CPUs the kernel knows nothing
about yet, and would cater for the pseudo-heterogeneous cases too.

Thanks,
Mark.

2015-12-15 14:25:07

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

Hi Mark,

On 15/12/15 14:01, Mark Rutland wrote:
> On Tue, Dec 15, 2015 at 01:39:51PM +0000, Mark Brown wrote:
> > On Tue, Dec 15, 2015 at 12:22:38PM +0000, Juri Lelli wrote:
> > > > So then why isn't it adequate to just have things like the core types in
> > > > there and work from there? Are we really expecting the tuning to be so
> > > > much better than it's possible to come up with something that's so much
> > > > better on the scale that we're expecting this to be accurate that it's
> > > > worth just jumping straight to magic numbers?
> >
> > > I take your point here that having fine grained values might not really
> > > give us appreciable differences (that is also why I proposed the
> > > capacity-scale in the first instance), but I'm not sure I'm getting what
> > > you are proposing here.
> >
> > Something like the existing solution for arm32.
> >
> > > static const struct cpu_efficiency table_efficiency[] = {
> > > {"arm,cortex-a15", 3891},
> > > {"arm,cortex-a7", 2048},
> > > {NULL, },
> > > };
> >
> > > When clock-frequency property is defined in DT, we try to find a match
> > > for the compatibility string in the table above and then use the
> > > associate number to compute the capacity. Are you proposing to have
> > > something like this for arm64 as well?
> >
> > > BTW, the only info I could find about those numbers is from this thread
> >
> > It was discussed in some other thread when I was sending the equivalent
> > stuff for arm64 (I never got round to finishing it off due to issues
> > with Catalin and Will being concerned about the specific numbers).
> > Vincent confirmed that the numbers came from the (IIRC) DMIPS/MHz
> > numbers that ARM publish for the cores. I'd independently done the same
> > thing for arm64. It would probably help to put comments in there with
> > the base numbers before scaling, or just redo the table in terms of the
> > raw numbers.
> >
> > This is, of course, an example of my concerns about magic number
> > configuration.
> >
> > > If I understand how that table was created, how do we think we will
> > > extend it in the future to allow newer core types (say we replicate this
> > > solution for arm64)? It seems that we have to change it, rescaling
> > > values, each time we have a new core on the market. How can we come up
> > > with relative numbers, in the future, comparing newer cores to old ones
> > > (that might be already out of the market by that time)?
> >
> > It doesn't seem particularly challenging to add new numbers to the table
> > (and add additional properties to select on) TBH. We can either rescale
> > by hand in the table when adding entries, script it as part of the
> > kernel build or do it at runtime (as the arm32 code already does to an
> > extent based on the particular set of cores we find). What difficulties
> > do you see with this?
> >
> > This is something that seems like an advantage to me - we can just
> > replace everything at any point, we're not tied to trusting the golden
> > benchmark someone did (or tweaked) if we come up with a better
> > methodology later on.
>
> I really don't want to see a table of magic numbers in the kernel.
>

Doesn't seem to be a clean and scalable solution to me either. It is not
easy to reconfigure when new core types come around, as I don't think
relative data is always present or easy to derive, and it exposes some
sort of centralized global information where everyone is compared
against everyone. Where the DT solution is inherently per platform: no
need to expose absolute values and no problems with knowing data
regarding old core types.

> The relative performance and efficiency of cores will vary depending on
> uArch-specific configuration (e.g. sizing of L1/L2 caches) in addition
> to general uArch differences, and integration too (e.g. if the memory
> system gives priority to one cluster over another for whatever reason).
> I've heard of pseudo-heterogeneous platforms with different
> configuration of the same uArch across clusters.
>
> We also don't necessarily have the CPU clock frequencies, or the ability
> to scale them. Maybe we simply give up in that case, though.
>
> If we cannot rely on external information, and want this information to
> be derived by the kernel, then we need to perform some dynamic
> benchmark. That would work for future CPUs the kernel knows nothing
> about yet, and would cater for the pseudo-heterogeneous cases too.
>

I've actually experimented a bit with this approch already, but I wasn't
convinced of its viability. It is true that we remove the burden of
coming up with default values from user/integrator, but I'm pretty sure
we will end up discussing endlessly about which particular benchmark to
pick and the fact that it impacts on boot time and such.

Best,

- Juri

2015-12-15 14:51:23

by Mark Rutland

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On Tue, Dec 15, 2015 at 02:24:58PM +0000, Juri Lelli wrote:
> Hi Mark,

Hi Juri,

> On 15/12/15 14:01, Mark Rutland wrote:
> > I really don't want to see a table of magic numbers in the kernel.
>
> Doesn't seem to be a clean and scalable solution to me either. It is not
> easy to reconfigure when new core types come around, as I don't think
> relative data is always present or easy to derive, and it exposes some
> sort of centralized global information where everyone is compared
> against everyone.

I'm also concerned that it will be difficult to curate this to avoid
deceptive marketing numbers. These may not reflect reality.

> Where the DT solution is inherently per platform: no need to expose
> absolute values and no problems with knowing data regarding old core
> types.

The DT approach certainly avoids tying the kernel to a given idea of
particular microarchitectures.

> > If we cannot rely on external information, and want this information to
> > be derived by the kernel, then we need to perform some dynamic
> > benchmark. That would work for future CPUs the kernel knows nothing
> > about yet, and would cater for the pseudo-heterogeneous cases too.
>
> I've actually experimented a bit with this approch already, but I wasn't
> convinced of its viability. It is true that we remove the burden of
> coming up with default values from user/integrator, but I'm pretty sure
> we will end up discussing endlessly about which particular benchmark to
> pick

Regardless of which direction we go there will be endless discussion as
to the benchmark. As Mark pointed out, that happened in the case of the
table, and it's happening now for the DT approach.

I think we agree that if this is something we can change later (i.e. we
don't rely on an external oracle like DT) the particular benchmark
matters less, as that can be changed given evidence of superiority.

> and the fact that it impacts on boot time and such.

I was under the impression that the kernel already did RAID algorithm
benchmarking as part of the boot process. Maybe we can find a set of
similarly brief benchmarks.

Thanks,
Mark.

2015-12-15 15:08:50

by Mark Brown

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On Tue, Dec 15, 2015 at 02:01:36PM +0000, Mark Rutland wrote:

> I really don't want to see a table of magic numbers in the kernel.

Right, there's pitfalls there too although not being part of an ABI
does make them more manageable.

One thing it's probably helpful to establish here is how much the
specific numbers are going to matter in the grand scheme of things. If
the specific numbers *are* super important then nobody is going to want
to touch them as they'll be prone to getting tweaked. If instead the
numbers just need to be ballpark accurate so the scheduler starts off in
roughly the right place and the specific numbers don't matter it's a lot
easier and having a table in the kernel until we think of something
better (if that ever happens) gets a lot easier.

My expectation is that we just need good enough, not perfect, and that
seems to match what Juri is saying about the expectation that most of
the fine tuning is done via other knobs.

> The relative performance and efficiency of cores will vary depending on
> uArch-specific configuration (e.g. sizing of L1/L2 caches) in addition
> to general uArch differences, and integration too (e.g. if the memory
> system gives priority to one cluster over another for whatever reason).
> I've heard of pseudo-heterogeneous platforms with different
> configuration of the same uArch across clusters.

> We also don't necessarily have the CPU clock frequencies, or the ability
> to scale them. Maybe we simply give up in that case, though.

These variables all sound like the sort of thing we can get people to
put in the DT where they matter, and the more we talk about multiple
subtle variables feeding into a single number the more this starts to
sound like tuning (with all the problems that brings, especially when
you're into things like the behaviour with the clusters competing for
resources). Who knows, at some point the scheduler may even want to
directly act on some of these parameters?

> If we cannot rely on external information, and want this information to
> be derived by the kernel, then we need to perform some dynamic
> benchmark. That would work for future CPUs the kernel knows nothing
> about yet, and would cater for the pseudo-heterogeneous cases too.

Runtime benchmarking (or gradual tuning based on observed performance of
the scheduler) would work as well of course. I'd expect something like
that to be part of any end point we get to here, the more the system is
able to self tune the better.


Attachments:
(No filename) (2.44 kB)
signature.asc (473.00 B)
Download all attachments

2015-12-15 15:32:43

by Mark Rutland

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
> On Tue, Dec 15, 2015 at 02:01:36PM +0000, Mark Rutland wrote:
>
> > I really don't want to see a table of magic numbers in the kernel.
>
> Right, there's pitfalls there too although not being part of an ABI
> does make them more manageable.

I think that people are very likely to treat them exactly like an ABI,
w.r.t. any regressions in performance that result from their addition,
modification, or removal. That becomes really horrible when new CPUs
appear.

> One thing it's probably helpful to establish here is how much the
> specific numbers are going to matter in the grand scheme of things. If
> the specific numbers *are* super important then nobody is going to want
> to touch them as they'll be prone to getting tweaked. If instead the
> numbers just need to be ballpark accurate so the scheduler starts off in
> roughly the right place and the specific numbers don't matter it's a lot
> easier and having a table in the kernel until we think of something
> better (if that ever happens) gets a lot easier.

I agree that we first need to figure out the importance of these
numbers. I disagree that our first step should be to add a table.

> My expectation is that we just need good enough, not perfect, and that
> seems to match what Juri is saying about the expectation that most of
> the fine tuning is done via other knobs.

My expectation is that if a ballpark figure is good enough, it should be
possible to implement something trivial like bogomips / loop_per_jiffy
calculation.

Thanks,
Mark.

2015-12-15 15:36:45

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On 15/12/15 14:50, Mark Rutland wrote:
> On Tue, Dec 15, 2015 at 02:24:58PM +0000, Juri Lelli wrote:
> > Hi Mark,
>
> Hi Juri,
>
> > On 15/12/15 14:01, Mark Rutland wrote:
> > > I really don't want to see a table of magic numbers in the kernel.
> >
> > Doesn't seem to be a clean and scalable solution to me either. It is not
> > easy to reconfigure when new core types come around, as I don't think
> > relative data is always present or easy to derive, and it exposes some
> > sort of centralized global information where everyone is compared
> > against everyone.
>
> I'm also concerned that it will be difficult to curate this to avoid
> deceptive marketing numbers. These may not reflect reality.
>

Right.

> > Where the DT solution is inherently per platform: no need to expose
> > absolute values and no problems with knowing data regarding old core
> > types.
>
> The DT approach certainly avoids tying the kernel to a given idea of
> particular microarchitectures.
>
> > > If we cannot rely on external information, and want this information to
> > > be derived by the kernel, then we need to perform some dynamic
> > > benchmark. That would work for future CPUs the kernel knows nothing
> > > about yet, and would cater for the pseudo-heterogeneous cases too.
> >
> > I've actually experimented a bit with this approch already, but I wasn't
> > convinced of its viability. It is true that we remove the burden of
> > coming up with default values from user/integrator, but I'm pretty sure
> > we will end up discussing endlessly about which particular benchmark to
> > pick
>
> Regardless of which direction we go there will be endless discussion as
> to the benchmark. As Mark pointed out, that happened in the case of the
> table, and it's happening now for the DT approach.
>
> I think we agree that if this is something we can change later (i.e. we
> don't rely on an external oracle like DT) the particular benchmark
> matters less, as that can be changed given evidence of superiority.
>

True, and in fact we already only point out suggestions for sensitive
benchmarks. I don't think we want to tie ourselves to any particular
one. And I think not having a particular benchmark implementation as
part of the solution will help changing our minds afterwards.

> > and the fact that it impacts on boot time and such.
>
> I was under the impression that the kernel already did RAID algorithm
> benchmarking as part of the boot process. Maybe we can find a set of
> similarly brief benchmarks.
>

Yeah, it's certainly doable, but I'm not really yet seeing what we gain
with this additional complexity. Also, AFAIK boot time is one important
metric for the mobile market (while it is less crucial for systems with
RAID configurations?) and we are going to add overhead there. Stability
of default values is probably another factor here, where you could be
rebooting your phone in many different conditions from time to time.

Best,

- Juri

2015-12-15 15:46:57

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On 15/12/15 15:32, Mark Rutland wrote:
> On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
> > On Tue, Dec 15, 2015 at 02:01:36PM +0000, Mark Rutland wrote:
> >
> > > I really don't want to see a table of magic numbers in the kernel.
> >
> > Right, there's pitfalls there too although not being part of an ABI
> > does make them more manageable.
>
> I think that people are very likely to treat them exactly like an ABI,
> w.r.t. any regressions in performance that result from their addition,
> modification, or removal. That becomes really horrible when new CPUs
> appear.
>

Yeah, and I guess the path towards out of three patches changing this
values for a specifica platform (without exposing the same changes
upstream) is not too far away.

> > One thing it's probably helpful to establish here is how much the
> > specific numbers are going to matter in the grand scheme of things. If
> > the specific numbers *are* super important then nobody is going to want
> > to touch them as they'll be prone to getting tweaked. If instead the
> > numbers just need to be ballpark accurate so the scheduler starts off in
> > roughly the right place and the specific numbers don't matter it's a lot
> > easier and having a table in the kernel until we think of something
> > better (if that ever happens) gets a lot easier.
>
> I agree that we first need to figure out the importance of these
> numbers. I disagree that our first step should be to add a table.
>

My take is that ballpark is fine, but it's a per platform/configuration
ballpark that we need. Not a per core-type one.

> > My expectation is that we just need good enough, not perfect, and that
> > seems to match what Juri is saying about the expectation that most of
> > the fine tuning is done via other knobs.
>
> My expectation is that if a ballpark figure is good enough, it should be
> possible to implement something trivial like bogomips / loop_per_jiffy
> calculation.
>

I didn't really followed that, so I might be wrong here, but isn't
already happened a discussion about how we want/like to stop exposing
bogomips info or rely on it for anything but in kernel delay loops?

Thanks,

- Juri

2015-12-15 15:57:56

by Mark Rutland

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote:
> On 15/12/15 15:32, Mark Rutland wrote:
> > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
> > > My expectation is that we just need good enough, not perfect, and that
> > > seems to match what Juri is saying about the expectation that most of
> > > the fine tuning is done via other knobs.
> >
> > My expectation is that if a ballpark figure is good enough, it should be
> > possible to implement something trivial like bogomips / loop_per_jiffy
> > calculation.
> >
>
> I didn't really followed that, so I might be wrong here, but isn't
> already happened a discussion about how we want/like to stop exposing
> bogomips info or rely on it for anything but in kernel delay loops?

I meant that we could have a benchmark of that level of complexity,
rather than those specific values.

Mark.

2015-12-15 16:23:27

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On Tue, Dec 15, 2015 at 03:57:37PM +0000, Mark Rutland wrote:
> On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote:
> > On 15/12/15 15:32, Mark Rutland wrote:
> > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
> > > > My expectation is that we just need good enough, not perfect, and that
> > > > seems to match what Juri is saying about the expectation that most of
> > > > the fine tuning is done via other knobs.
> > >
> > > My expectation is that if a ballpark figure is good enough, it should be
> > > possible to implement something trivial like bogomips / loop_per_jiffy
> > > calculation.
> >
> > I didn't really followed that, so I might be wrong here, but isn't
> > already happened a discussion about how we want/like to stop exposing
> > bogomips info or rely on it for anything but in kernel delay loops?
>
> I meant that we could have a benchmark of that level of complexity,
> rather than those specific values.

Or we could simply let user space use whatever benchmarks or hard-coded
values it wants and set the capacity via sysfs (during boot). By
default, the kernel would assume all CPUs equal.

--
Catalin

2015-12-15 16:42:11

by Mark Rutland

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On Tue, Dec 15, 2015 at 04:23:18PM +0000, Catalin Marinas wrote:
> On Tue, Dec 15, 2015 at 03:57:37PM +0000, Mark Rutland wrote:
> > On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote:
> > > On 15/12/15 15:32, Mark Rutland wrote:
> > > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
> > > > > My expectation is that we just need good enough, not perfect, and that
> > > > > seems to match what Juri is saying about the expectation that most of
> > > > > the fine tuning is done via other knobs.
> > > >
> > > > My expectation is that if a ballpark figure is good enough, it should be
> > > > possible to implement something trivial like bogomips / loop_per_jiffy
> > > > calculation.
> > >
> > > I didn't really followed that, so I might be wrong here, but isn't
> > > already happened a discussion about how we want/like to stop exposing
> > > bogomips info or rely on it for anything but in kernel delay loops?
> >
> > I meant that we could have a benchmark of that level of complexity,
> > rather than those specific values.
>
> Or we could simply let user space use whatever benchmarks or hard-coded
> values it wants and set the capacity via sysfs (during boot). By
> default, the kernel would assume all CPUs equal.

I assume that a userspace override would be available regardless of
whatever mechanism the kernel uses to determine relative
performance/effinciency.

I am not opposed to that mechanism being "assume equal".

Mark.

2015-12-15 16:59:59

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On 15 December 2015 at 17:41, Mark Rutland <[email protected]> wrote:
> On Tue, Dec 15, 2015 at 04:23:18PM +0000, Catalin Marinas wrote:
>> On Tue, Dec 15, 2015 at 03:57:37PM +0000, Mark Rutland wrote:
>> > On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote:
>> > > On 15/12/15 15:32, Mark Rutland wrote:
>> > > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
>> > > > > My expectation is that we just need good enough, not perfect, and that
>> > > > > seems to match what Juri is saying about the expectation that most of
>> > > > > the fine tuning is done via other knobs.
>> > > >
>> > > > My expectation is that if a ballpark figure is good enough, it should be
>> > > > possible to implement something trivial like bogomips / loop_per_jiffy
>> > > > calculation.
>> > >
>> > > I didn't really followed that, so I might be wrong here, but isn't
>> > > already happened a discussion about how we want/like to stop exposing
>> > > bogomips info or rely on it for anything but in kernel delay loops?
>> >
>> > I meant that we could have a benchmark of that level of complexity,
>> > rather than those specific values.
>>
>> Or we could simply let user space use whatever benchmarks or hard-coded
>> values it wants and set the capacity via sysfs (during boot). By
>> default, the kernel would assume all CPUs equal.
>
> I assume that a userspace override would be available regardless of
> whatever mechanism the kernel uses to determine relative
> performance/effinciency.

Don't you think that if we let a complete latitude to the userspace
to set whatever it wants, it will be used to abuse the kernel (and the
scheduler in particular ) and that this will finish in a real mess to
understand what is wrong when a task is not placed where it should be.
We can probably provide a debug mode to help soc manufacturer to
define their capacity value but IMHO we should not let complete
latitude in normal operation

In normal operation we need to give some methods to tweak the value to
reflect a memory bounded or integer calculation work or other kind of
work that currently runs on the cpu but not more

Vincent
>
> I am not opposed to that mechanism being "assume equal".
>
> Mark.

2015-12-15 17:15:38

by Mark Rutland

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On Tue, Dec 15, 2015 at 05:59:34PM +0100, Vincent Guittot wrote:
> On 15 December 2015 at 17:41, Mark Rutland <[email protected]> wrote:
> > On Tue, Dec 15, 2015 at 04:23:18PM +0000, Catalin Marinas wrote:
> >> On Tue, Dec 15, 2015 at 03:57:37PM +0000, Mark Rutland wrote:
> >> > On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote:
> >> > > On 15/12/15 15:32, Mark Rutland wrote:
> >> > > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
> >> > > > > My expectation is that we just need good enough, not perfect, and that
> >> > > > > seems to match what Juri is saying about the expectation that most of
> >> > > > > the fine tuning is done via other knobs.
> >> > > >
> >> > > > My expectation is that if a ballpark figure is good enough, it should be
> >> > > > possible to implement something trivial like bogomips / loop_per_jiffy
> >> > > > calculation.
> >> > >
> >> > > I didn't really followed that, so I might be wrong here, but isn't
> >> > > already happened a discussion about how we want/like to stop exposing
> >> > > bogomips info or rely on it for anything but in kernel delay loops?
> >> >
> >> > I meant that we could have a benchmark of that level of complexity,
> >> > rather than those specific values.
> >>
> >> Or we could simply let user space use whatever benchmarks or hard-coded
> >> values it wants and set the capacity via sysfs (during boot). By
> >> default, the kernel would assume all CPUs equal.
> >
> > I assume that a userspace override would be available regardless of
> > whatever mechanism the kernel uses to determine relative
> > performance/effinciency.
>
> Don't you think that if we let a complete latitude to the userspace
> to set whatever it wants, it will be used to abuse the kernel (and the
> scheduler in particular ) and that this will finish in a real mess to
> understand what is wrong when a task is not placed where it should be.

I'm not sure I follow what you mean by "abuse" here. Userspace currently
can force the scheduler to make sub-optimal decisions in a number of
ways, e.g.

* Hot-unplugging the preferred CPUs
* Changing a task's affinity mask
* Setting the nice value of a task
* Using rlimits and/or cgroups
* Using a cpufreq governor
* Fork-bombing

Practically all of these are prvileged operations. I would envisage the
userspace interface for "capacity" management to be similar.

> We can probably provide a debug mode to help soc manufacturer to
> define their capacity value but IMHO we should not let complete
> latitude in normal operation

In normal operation userspace wouldn't mess with this, as with most of
the cases above. Userspace can already shooti tself in the foot.

> In normal operation we need to give some methods to tweak the value to
> reflect a memory bounded or integer calculation work or other kind of
> work that currently runs on the cpu but not more

You can already do that with the mechanisms above, to some extent. I'm
not sure I follow.

Mark.

2015-12-15 17:17:35

by Mark Brown

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On Tue, Dec 15, 2015 at 03:32:19PM +0000, Mark Rutland wrote:
> On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
> > On Tue, Dec 15, 2015 at 02:01:36PM +0000, Mark Rutland wrote:

> > > I really don't want to see a table of magic numbers in the kernel.

> > Right, there's pitfalls there too although not being part of an ABI
> > does make them more manageable.

> I think that people are very likely to treat them exactly like an ABI,
> w.r.t. any regressions in performance that result from their addition,
> modification, or removal. That becomes really horrible when new CPUs
> appear.

Obviously people are going to get upset if we introduce performance
regressions - but that's true always, we can also introduce problems
with numbers people have put in DT. It seems like it'd be harder to
manage regressions due to externally provided magic numbers since
there's inherently less information there.

> > One thing it's probably helpful to establish here is how much the
> > specific numbers are going to matter in the grand scheme of things. If
> > the specific numbers *are* super important then nobody is going to want
> > to touch them as they'll be prone to getting tweaked. If instead the
> > numbers just need to be ballpark accurate so the scheduler starts off in
> > roughly the right place and the specific numbers don't matter it's a lot
> > easier and having a table in the kernel until we think of something
> > better (if that ever happens) gets a lot easier.

> I agree that we first need to figure out the importance of these
> numbers. I disagree that our first step should be to add a table.

My point there is that if we're not that concerned about the specific
number something in kernel is safer.

> > My expectation is that we just need good enough, not perfect, and that
> > seems to match what Juri is saying about the expectation that most of
> > the fine tuning is done via other knobs.

> My expectation is that if a ballpark figure is good enough, it should be
> possible to implement something trivial like bogomips / loop_per_jiffy
> calculation.

That does have the issue that we need to scale with regard to the
frequency the benchmark gets run at. That's not an insurmountable
obstacle but it's not completely trivial either.


Attachments:
(No filename) (2.23 kB)
signature.asc (473.00 B)
Download all attachments

2015-12-15 17:28:56

by Mark Rutland

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On Tue, Dec 15, 2015 at 05:17:13PM +0000, Mark Brown wrote:
> On Tue, Dec 15, 2015 at 03:32:19PM +0000, Mark Rutland wrote:
> > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
> > > On Tue, Dec 15, 2015 at 02:01:36PM +0000, Mark Rutland wrote:
>
> > > > I really don't want to see a table of magic numbers in the kernel.
>
> > > Right, there's pitfalls there too although not being part of an ABI
> > > does make them more manageable.
>
> > I think that people are very likely to treat them exactly like an ABI,
> > w.r.t. any regressions in performance that result from their addition,
> > modification, or removal. That becomes really horrible when new CPUs
> > appear.
>
> Obviously people are going to get upset if we introduce performance
> regressions - but that's true always, we can also introduce problems
> with numbers people have put in DT. It seems like it'd be harder to
> manage regressions due to externally provided magic numbers since
> there's inherently less information there.

It's certainly still possible to have regressions in that case. Those
regressions would be due to code changes in the kernel, given the DT
didn't change.

I'm not sure I follow w.r.t. "inherently less information", unless you
mean trying to debug without access to that DTB?

> > > One thing it's probably helpful to establish here is how much the
> > > specific numbers are going to matter in the grand scheme of things. If
> > > the specific numbers *are* super important then nobody is going to want
> > > to touch them as they'll be prone to getting tweaked. If instead the
> > > numbers just need to be ballpark accurate so the scheduler starts off in
> > > roughly the right place and the specific numbers don't matter it's a lot
> > > easier and having a table in the kernel until we think of something
> > > better (if that ever happens) gets a lot easier.
>
> > I agree that we first need to figure out the importance of these
> > numbers. I disagree that our first step should be to add a table.
>
> My point there is that if we're not that concerned about the specific
> number something in kernel is safer.

I don't entirely disagree there. I think an in-kernel benchmark is
likely safer.

> > > My expectation is that we just need good enough, not perfect, and that
> > > seems to match what Juri is saying about the expectation that most of
> > > the fine tuning is done via other knobs.
>
> > My expectation is that if a ballpark figure is good enough, it should be
> > possible to implement something trivial like bogomips / loop_per_jiffy
> > calculation.
>
> That does have the issue that we need to scale with regard to the
> frequency the benchmark gets run at. That's not an insurmountable
> obstacle but it's not completely trivial either.

If we change clock frequency, then regardless of where the information
comes from we need to perform scaling, no?

One nice thing about doing a benchmark to derive the numbers is that
when the kernel is that when the frequency is fixed but the kernel
cannot query it, the numbers will be representative.

Mark.

2015-12-15 17:45:37

by Mark Brown

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On Tue, Dec 15, 2015 at 05:28:37PM +0000, Mark Rutland wrote:
> On Tue, Dec 15, 2015 at 05:17:13PM +0000, Mark Brown wrote:

> > Obviously people are going to get upset if we introduce performance
> > regressions - but that's true always, we can also introduce problems
> > with numbers people have put in DT. It seems like it'd be harder to
> > manage regressions due to externally provided magic numbers since
> > there's inherently less information there.

> It's certainly still possible to have regressions in that case. Those
> regressions would be due to code changes in the kernel, given the DT
> didn't change.

> I'm not sure I follow w.r.t. "inherently less information", unless you
> mean trying to debug without access to that DTB?

If what the kernel knows about the system is that it's got a bunch of
cores with numbers assigned to them then all it's really got is those
numbers. If something changes that causes problems for some systems
(eg, because the numbers have been picked poorly but in a way that
happened to work well with the old code) that's not a lot to go on, the
more we know about the system the more likely it is that we'll be able
to adjust the assumptions in whatever new thing we do that causes
problems for any particular systems where we run into trouble.

> > My point there is that if we're not that concerned about the specific
> > number something in kernel is safer.

> I don't entirely disagree there. I think an in-kernel benchmark is
> likely safer.

Yes, I think that something where we just observe the system performance
at runtime is likely one of the best solutions if we can get something
that gives reasonable results.

> > That does have the issue that we need to scale with regard to the
> > frequency the benchmark gets run at. That's not an insurmountable
> > obstacle but it's not completely trivial either.

> If we change clock frequency, then regardless of where the information
> comes from we need to perform scaling, no?

Yes, it's just a question of making the benchmarking bit talk to the
scaling bit so we know where we're at when we do the benchmark. Like I
say it should be doable.

> One nice thing about doing a benchmark to derive the numbers is that
> when the kernel is that when the frequency is fixed but the kernel
> cannot query it, the numbers will be representative.

Definitely.


Attachments:
(No filename) (2.31 kB)
signature.asc (473.00 B)
Download all attachments

2015-12-15 17:47:44

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On 15 December 2015 at 18:15, Mark Rutland <[email protected]> wrote:
> On Tue, Dec 15, 2015 at 05:59:34PM +0100, Vincent Guittot wrote:
>> On 15 December 2015 at 17:41, Mark Rutland <[email protected]> wrote:
>> > On Tue, Dec 15, 2015 at 04:23:18PM +0000, Catalin Marinas wrote:
>> >> On Tue, Dec 15, 2015 at 03:57:37PM +0000, Mark Rutland wrote:
>> >> > On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote:
>> >> > > On 15/12/15 15:32, Mark Rutland wrote:
>> >> > > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
>> >> > > > > My expectation is that we just need good enough, not perfect, and that
>> >> > > > > seems to match what Juri is saying about the expectation that most of
>> >> > > > > the fine tuning is done via other knobs.
>> >> > > >
>> >> > > > My expectation is that if a ballpark figure is good enough, it should be
>> >> > > > possible to implement something trivial like bogomips / loop_per_jiffy
>> >> > > > calculation.
>> >> > >
>> >> > > I didn't really followed that, so I might be wrong here, but isn't
>> >> > > already happened a discussion about how we want/like to stop exposing
>> >> > > bogomips info or rely on it for anything but in kernel delay loops?
>> >> >
>> >> > I meant that we could have a benchmark of that level of complexity,
>> >> > rather than those specific values.
>> >>
>> >> Or we could simply let user space use whatever benchmarks or hard-coded
>> >> values it wants and set the capacity via sysfs (during boot). By
>> >> default, the kernel would assume all CPUs equal.
>> >
>> > I assume that a userspace override would be available regardless of
>> > whatever mechanism the kernel uses to determine relative
>> > performance/effinciency.
>>
>> Don't you think that if we let a complete latitude to the userspace
>> to set whatever it wants, it will be used to abuse the kernel (and the
>> scheduler in particular ) and that this will finish in a real mess to
>> understand what is wrong when a task is not placed where it should be.
>
> I'm not sure I follow what you mean by "abuse" here. Userspace currently
> can force the scheduler to make sub-optimal decisions in a number of
> ways, e.g.
>
> * Hot-unplugging the preferred CPUs
> * Changing a task's affinity mask
> * Setting the nice value of a task
> * Using rlimits and/or cgroups
> * Using a cpufreq governor
> * Fork-bombing

All these are parameters have a meaning (except the last one). By
abusing i mean setting the capacity of the most powerful cpu to 1 for
no good reason except to abuse the scheduler so the latter will not
put that much tasks on it just because the current running use case
is more efficient if the big core is not used.


>
> Practically all of these are prvileged operations. I would envisage the
> userspace interface for "capacity" management to be similar.
>
>> We can probably provide a debug mode to help soc manufacturer to
>> define their capacity value but IMHO we should not let complete
>> latitude in normal operation
>
> In normal operation userspace wouldn't mess with this, as with most of
> the cases above. Userspace can already shooti tself in the foot.
>
>> In normal operation we need to give some methods to tweak the value to
>> reflect a memory bounded or integer calculation work or other kind of
>> work that currently runs on the cpu but not more
>
> You can already do that with the mechanisms above, to some extent. I'm
> not sure I follow.
>
> Mark.

2015-12-15 18:10:31

by Mark Rutland

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On Tue, Dec 15, 2015 at 05:45:16PM +0000, Mark Brown wrote:
> On Tue, Dec 15, 2015 at 05:28:37PM +0000, Mark Rutland wrote:
> > On Tue, Dec 15, 2015 at 05:17:13PM +0000, Mark Brown wrote:
>
> > > Obviously people are going to get upset if we introduce performance
> > > regressions - but that's true always, we can also introduce problems
> > > with numbers people have put in DT. It seems like it'd be harder to
> > > manage regressions due to externally provided magic numbers since
> > > there's inherently less information there.
>
> > It's certainly still possible to have regressions in that case. Those
> > regressions would be due to code changes in the kernel, given the DT
> > didn't change.
>
> > I'm not sure I follow w.r.t. "inherently less information", unless you
> > mean trying to debug without access to that DTB?
>
> If what the kernel knows about the system is that it's got a bunch of
> cores with numbers assigned to them then all it's really got is those
> numbers. If something changes that causes problems for some systems
> (eg, because the numbers have been picked poorly but in a way that
> happened to work well with the old code) that's not a lot to go on, the
> more we know about the system the more likely it is that we'll be able
> to adjust the assumptions in whatever new thing we do that causes
> problems for any particular systems where we run into trouble.

Regardless of where the numbers live (DT or kernel), all we have are
numbers. I can see that changing the in-kernel numbers would be possible
when modifyign the DT is not, but I don't see how that gives you more
information.

Mark.

2015-12-15 18:39:53

by Mark Rutland

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On Tue, Dec 15, 2015 at 06:47:20PM +0100, Vincent Guittot wrote:
> On 15 December 2015 at 18:15, Mark Rutland <[email protected]> wrote:
> > On Tue, Dec 15, 2015 at 05:59:34PM +0100, Vincent Guittot wrote:
> >> On 15 December 2015 at 17:41, Mark Rutland <[email protected]> wrote:
> >> > On Tue, Dec 15, 2015 at 04:23:18PM +0000, Catalin Marinas wrote:
> >> >> On Tue, Dec 15, 2015 at 03:57:37PM +0000, Mark Rutland wrote:
> >> >> > On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote:
> >> >> > > On 15/12/15 15:32, Mark Rutland wrote:
> >> >> > > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
> >> >> > > > > My expectation is that we just need good enough, not perfect, and that
> >> >> > > > > seems to match what Juri is saying about the expectation that most of
> >> >> > > > > the fine tuning is done via other knobs.
> >> >> > > >
> >> >> > > > My expectation is that if a ballpark figure is good enough, it should be
> >> >> > > > possible to implement something trivial like bogomips / loop_per_jiffy
> >> >> > > > calculation.
> >> >> > >
> >> >> > > I didn't really followed that, so I might be wrong here, but isn't
> >> >> > > already happened a discussion about how we want/like to stop exposing
> >> >> > > bogomips info or rely on it for anything but in kernel delay loops?
> >> >> >
> >> >> > I meant that we could have a benchmark of that level of complexity,
> >> >> > rather than those specific values.
> >> >>
> >> >> Or we could simply let user space use whatever benchmarks or hard-coded
> >> >> values it wants and set the capacity via sysfs (during boot). By
> >> >> default, the kernel would assume all CPUs equal.
> >> >
> >> > I assume that a userspace override would be available regardless of
> >> > whatever mechanism the kernel uses to determine relative
> >> > performance/effinciency.
> >>
> >> Don't you think that if we let a complete latitude to the userspace
> >> to set whatever it wants, it will be used to abuse the kernel (and the
> >> scheduler in particular ) and that this will finish in a real mess to
> >> understand what is wrong when a task is not placed where it should be.
> >
> > I'm not sure I follow what you mean by "abuse" here. Userspace currently
> > can force the scheduler to make sub-optimal decisions in a number of
> > ways, e.g.
> >
> > * Hot-unplugging the preferred CPUs
> > * Changing a task's affinity mask
> > * Setting the nice value of a task
> > * Using rlimits and/or cgroups
> > * Using a cpufreq governor
> > * Fork-bombing
>
> All these are parameters have a meaning (except the last one). By
> abusing i mean setting the capacity of the most powerful cpu to 1 for
> no good reason except to abuse the scheduler so the latter will not
> put that much tasks on it just because the current running use case
> is more efficient if the big core is not used.

Surely it's better to allow them to "abuse" the kernel in that manner
than to place otherwise insane values into a DT? Especially if they can
later change to a sane value?

For that particular case it's easy to hotplug out the big core, or to
set the affinity of tasks to avoid it.

Mark.

2015-12-15 18:45:24

by Mark Brown

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

On Tue, Dec 15, 2015 at 06:10:03PM +0000, Mark Rutland wrote:
> On Tue, Dec 15, 2015 at 05:45:16PM +0000, Mark Brown wrote:

> > > I'm not sure I follow w.r.t. "inherently less information", unless you
> > > mean trying to debug without access to that DTB?

> > If what the kernel knows about the system is that it's got a bunch of
> > cores with numbers assigned to them then all it's really got is those
> > numbers. If something changes that causes problems for some systems
> > (eg, because the numbers have been picked poorly but in a way that
> > happened to work well with the old code) that's not a lot to go on, the
> > more we know about the system the more likely it is that we'll be able
> > to adjust the assumptions in whatever new thing we do that causes
> > problems for any particular systems where we run into trouble.

> Regardless of where the numbers live (DT or kernel), all we have are
> numbers. I can see that changing the in-kernel numbers would be possible
> when modifyign the DT is not, but I don't see how that gives you more
> information.

It's mainly the modifying the DT case - you're not dealing with some
external misguided number selection method you'd never thought of and
you're not forcing some third party to redo benchmarks or adjust DTs
they may not want to adjust. You're also able to readjust the numbers
based on feedback if you need to rather than having to adapt algorithms
to handle particular number selections, the algorithm and number
selection are done together rather than separately.


Attachments:
(No filename) (1.50 kB)
signature.asc (473.00 B)
Download all attachments

2015-12-17 09:07:53

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

Hi,

On 15/12/15 17:45, Mark Brown wrote:
> On Tue, Dec 15, 2015 at 05:28:37PM +0000, Mark Rutland wrote:
> > On Tue, Dec 15, 2015 at 05:17:13PM +0000, Mark Brown wrote:
>
> > > Obviously people are going to get upset if we introduce performance
> > > regressions - but that's true always, we can also introduce problems
> > > with numbers people have put in DT. It seems like it'd be harder to
> > > manage regressions due to externally provided magic numbers since
> > > there's inherently less information there.
>
> > It's certainly still possible to have regressions in that case. Those
> > regressions would be due to code changes in the kernel, given the DT
> > didn't change.
>
> > I'm not sure I follow w.r.t. "inherently less information", unless you
> > mean trying to debug without access to that DTB?
>
> If what the kernel knows about the system is that it's got a bunch of
> cores with numbers assigned to them then all it's really got is those
> numbers. If something changes that causes problems for some systems
> (eg, because the numbers have been picked poorly but in a way that
> happened to work well with the old code) that's not a lot to go on, the
> more we know about the system the more likely it is that we'll be able
> to adjust the assumptions in whatever new thing we do that causes
> problems for any particular systems where we run into trouble.
>
> > > My point there is that if we're not that concerned about the specific
> > > number something in kernel is safer.
>
> > I don't entirely disagree there. I think an in-kernel benchmark is
> > likely safer.
>
> Yes, I think that something where we just observe the system performance
> at runtime is likely one of the best solutions if we can get something
> that gives reasonable results.
>
> > > That does have the issue that we need to scale with regard to the
> > > frequency the benchmark gets run at. That's not an insurmountable
> > > obstacle but it's not completely trivial either.
>
> > If we change clock frequency, then regardless of where the information
> > comes from we need to perform scaling, no?
>
> Yes, it's just a question of making the benchmarking bit talk to the
> scaling bit so we know where we're at when we do the benchmark. Like I
> say it should be doable.
>
> > One nice thing about doing a benchmark to derive the numbers is that
> > when the kernel is that when the frequency is fixed but the kernel
> > cannot query it, the numbers will be representative.
>
> Definitely.

OK, let's see how a dynamic approach could look like. As said, since it
was actually our first thought too, I already have a possible
implementation of such a thing. I'll be OOO until early Jan, but I'll
try to rebase what I have and post it here as soon as I'm back; and then
we see which solution looks better.

Thanks a lot for the feedback!

Best,

- Juri