Changes in v8:
- Added some tags for reviews and acks.
- Cleanup timer patch (patch6) according to comments from Rafael.
- Rebased series on top of v4.18rc1 - it applied cleanly, except for patch 5.
- While adopting patch 5 to new genpd changes, I took the opportunity to
improve the new function description a bit.
- Corrected malformed SPDX-License-Identifier in patch20.
Changes in v7:
- Addressed comments concerning the PSCI changes from Mark Rutland, which moves
the psci firmware driver to a new firmware subdir and change to force PSCI PC
mode during boot to cope with kexec'ed booted kernels.
- Added some maintainers in cc for the timer/nohz patches.
- Minor update to the new genpd governor, taking into account the state's
poweroff latency while validating the sleep duration time.
- Addressed a problem pointed out by Geert Uytterhoeven, around calling
pm_runtime_get|put() for CPUs that has not been attached to a CPU PM domain.
- Re-based on Linus' latest master.
Some background:
Overall this series have been discussed over years at various Linux conferences
and LKML, although let me give a brief introduction and then the rest can be
read in each changelog.
For ARM, the PSCI firmware interface may be managing the power to the CPUs.
Depending on the SoC, CPUs may also be arranged in hierarchical manner, which
could add another level of complexity from a CPU idle management point of view.
PSCI v1.0+ adds support for the so called OS initiated CPU suspend mode, which
enables a more fine grained method, allowing Linux to get more control, in
regards to being energy efficient. This is typically useful for these kind of
complex battery driven platforms.
Now, in principle what is missing today around CPU idle management for these
SoCs arranging CPUs in a hierarchical manner, that is what this series intends
to address.
- Patch 1 -> Patch 12: The first part are generic changes to genpd, cpu_pm,
timers, cpuidle and DT. Of course the solution is based on an opt-in method,
so no users should be affected of any of these changes.
- Patch 13 -> Patch 26: The second part are changes to PSCI and ARM64, which
deploys the support for CPU idle management, based upon the new generic
changes from the first part.
The series is based on v4.18rc1 and the code has been tested on a QCOM 410c
dragonboard. You may find the code at:
git.linaro.org/people/ulf.hansson/linux-pm.git next
Kind regards
Ulf Hansson
Lina Iyer (6):
PM / Domains: Add generic data pointer to genpd_power_state struct
timer: Export next wakeup time of a CPU
dt: psci: Update DT bindings to support hierarchical PSCI states
cpuidle: dt: Support hierarchical CPU idle states
drivers: firmware: psci: Support hierarchical CPU idle states
arm64: dts: Convert to the hierarchical CPU topology layout for
MSM8916
Ulf Hansson (20):
PM / Domains: Don't treat zero found compatible idle states as an
error
PM / Domains: Deal with multiple states but no governor in genpd
PM / Domains: Add support for CPU devices to genpd
PM / Domains: Add helper functions to attach/detach CPUs to/from genpd
PM / Domains: Add genpd governor for CPUs
PM / Domains: Extend genpd CPU governor to cope with QoS constraints
kernel/cpu_pm: Manage runtime PM in the idle path for CPUs
of: base: Add of_get_cpu_state_node() to get idle states for a CPU
node
drivers: firmware: psci: Move psci to separate directory
MAINTAINERS: Update files for PSCI
drivers: firmware: psci: Split psci_dt_cpu_init_idle()
drivers: firmware: psci: Simplify error path of psci_dt_init()
drivers: firmware: psci: Announce support for OS initiated suspend
mode
drivers: firmware: psci: Prepare to use OS initiated suspend mode
drivers: firmware: psci: Share a few internal PSCI functions
drivers: firmware: psci: Add support for PM domains using genpd
drivers: firmware: psci: Introduce psci_dt_topology_init()
drivers: firmware: psci: Try to attach CPU devices to their PM domains
drivers: firmware: psci: Deal with CPU hotplug when using OSI mode
arm64: kernel: Respect the hierarchical CPU topology in DT for PSCI
.../devicetree/bindings/arm/psci.txt | 156 +++++++++++++++
MAINTAINERS | 2 +-
arch/arm64/boot/dts/qcom/msm8916.dtsi | 53 +++++-
arch/arm64/kernel/setup.c | 3 +
drivers/base/power/domain.c | 158 ++++++++++++++-
drivers/base/power/domain_governor.c | 67 ++++++-
drivers/cpuidle/dt_idle_states.c | 5 +-
drivers/firmware/Kconfig | 15 +-
drivers/firmware/Makefile | 3 +-
drivers/firmware/psci/Kconfig | 13 ++
drivers/firmware/psci/Makefile | 4 +
drivers/firmware/{ => psci}/psci.c | 174 +++++++++++++----
drivers/firmware/psci/psci.h | 19 ++
drivers/firmware/{ => psci}/psci_checker.c | 0
drivers/firmware/psci/psci_pm_domain.c | 180 ++++++++++++++++++
drivers/of/base.c | 35 ++++
include/linux/of.h | 8 +
include/linux/pm_domain.h | 16 ++
include/linux/psci.h | 2 +
include/linux/tick.h | 8 +
include/uapi/linux/psci.h | 5 +
kernel/cpu_pm.c | 11 ++
kernel/time/tick-sched.c | 10 +
23 files changed, 877 insertions(+), 70 deletions(-)
create mode 100644 drivers/firmware/psci/Kconfig
create mode 100644 drivers/firmware/psci/Makefile
rename drivers/firmware/{ => psci}/psci.c (83%)
create mode 100644 drivers/firmware/psci/psci.h
rename drivers/firmware/{ => psci}/psci_checker.c (100%)
create mode 100644 drivers/firmware/psci/psci_pm_domain.c
--
2.17.1
The CPU's idle state nodes are currently parsed at the common cpuidle DT
library, but also when initializing back-end data for the arch specific CPU
operations, as in the PSCI driver case.
To avoid open-coding, let's introduce of_get_cpu_state_node(), which takes
the device node for the CPU and the index to the requested idle state node,
as in-parameters. In case a corresponding idle state node is found, it
returns the node with the refcount incremented for it, else it returns
NULL.
Moreover, for ARM, there are two generic methods, to describe the CPU's
idle states, either via the flattened description through the
"cpu-idle-states" binding [1] or via the hierarchical layout, using the
"power-domains" and the "domain-idle-states" bindings [2]. Hence, let's
take both options into account.
[1]
Documentation/devicetree/bindings/arm/idle-states.txt
[2]
Documentation/devicetree/bindings/arm/psci.txt
Cc: Rob Herring <[email protected]>
Cc: [email protected]
Cc: Lina Iyer <[email protected]>
Suggested-by: Sudeep Holla <[email protected]>
Co-developed-by: Lina Iyer <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
Reviewed-by: Rob Herring <[email protected]>
---
drivers/of/base.c | 35 +++++++++++++++++++++++++++++++++++
include/linux/of.h | 8 ++++++++
2 files changed, 43 insertions(+)
diff --git a/drivers/of/base.c b/drivers/of/base.c
index 848f549164cd..97350cce1b8e 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -399,6 +399,41 @@ int of_cpu_node_to_id(struct device_node *cpu_node)
}
EXPORT_SYMBOL(of_cpu_node_to_id);
+/**
+ * of_get_cpu_state_node - Get CPU's idle state node at the given index
+ *
+ * @cpu_node: The device node for the CPU
+ * @index: The index in the list of the idle states
+ *
+ * Two generic methods can be used to describe a CPU's idle states, either via
+ * a flattened description through the "cpu-idle-states" binding or via the
+ * hierarchical layout, using the "power-domains" and the "domain-idle-states"
+ * bindings. This function check for both and returns the idle state node for
+ * the requested index.
+ *
+ * In case and idle state node is found at index, the refcount incremented for
+ * it, so call of_node_put() on it when done. Returns NULL if not found.
+ */
+struct device_node *of_get_cpu_state_node(struct device_node *cpu_node,
+ int index)
+{
+ struct of_phandle_args args;
+ int err;
+
+ err = of_parse_phandle_with_args(cpu_node, "power-domains",
+ "#power-domain-cells", 0, &args);
+ if (!err) {
+ struct device_node *state_node =
+ of_parse_phandle(args.np, "domain-idle-states", index);
+
+ of_node_put(args.np);
+ return state_node;
+ }
+
+ return of_parse_phandle(cpu_node, "cpu-idle-states", index);
+}
+EXPORT_SYMBOL(of_get_cpu_state_node);
+
/**
* __of_device_is_compatible() - Check if the node matches given constraints
* @device: pointer to node
diff --git a/include/linux/of.h b/include/linux/of.h
index 4d25e4f952d9..15072b10ef4d 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -348,6 +348,8 @@ extern const void *of_get_property(const struct device_node *node,
const char *name,
int *lenp);
extern struct device_node *of_get_cpu_node(int cpu, unsigned int *thread);
+extern struct device_node *of_get_cpu_state_node(struct device_node *cpu_node,
+ int index);
#define for_each_property_of_node(dn, pp) \
for (pp = dn->properties; pp != NULL; pp = pp->next)
@@ -733,6 +735,12 @@ static inline struct device_node *of_get_cpu_node(int cpu,
return NULL;
}
+static inline struct device_node *of_get_cpu_state_node(struct device_node *cpu_node,
+ int index)
+{
+ return NULL;
+}
+
static inline int of_n_addr_cells(struct device_node *np)
{
return 0;
--
2.17.1
To let the PSCI driver to parse the CPU topology in DT, as to create CPU PM
domains in case the hierarchical layout is used, let's call
psci_dt_topology_init() from the existing topology_init() subsys_initcall.
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Lina Iyer <[email protected]>
Co-developed-by: Lina Iyer <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
arch/arm64/kernel/setup.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 30ad2f085d1f..574a5045f2f0 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -331,6 +331,9 @@ static int __init topology_init(void)
{
int i;
+ if (acpi_disabled)
+ psci_dt_topology_init();
+
for_each_online_node(i)
register_one_node(i);
--
2.17.1
To deal with CPU hotplug when OSI mode is used, the CPU device needs to be
detached from its PM domain (genpd) when putting it offline, otherwise the
CPU becomes considered as being in use from genpd and runtime PM point of
view. Obviously, then we also need to re-attach the CPU device when bring
the CPU back online, so let's do this.
Cc: Lina Iyer <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
drivers/firmware/psci/psci.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/drivers/firmware/psci/psci.c b/drivers/firmware/psci/psci.c
index 700e0e995871..e649673d71f0 100644
--- a/drivers/firmware/psci/psci.c
+++ b/drivers/firmware/psci/psci.c
@@ -190,6 +190,10 @@ static int psci_cpu_off(u32 state)
int err;
u32 fn;
+ /* If running OSI mode, detach the CPU device from its PM domain. */
+ if (psci_osi_mode_enabled)
+ of_genpd_detach_cpu(smp_processor_id());
+
fn = psci_function_id[PSCI_FN_CPU_OFF];
err = invoke_psci_fn(fn, state, 0, 0);
return psci_to_linux_errno(err);
@@ -204,6 +208,10 @@ static int psci_cpu_on(unsigned long cpuid, unsigned long entry_point)
err = invoke_psci_fn(fn, cpuid, entry_point, 0);
/* Clear the domain state to start fresh. */
psci_set_domain_state(0);
+
+ if (!err && psci_osi_mode_enabled)
+ of_genpd_attach_cpu(cpuid);
+
return psci_to_linux_errno(err);
}
--
2.17.1
From: Lina Iyer <[email protected]>
In the hierarchical layout, we are creating power domains around each CPU
and describes the idle states for them inside the power domain provider
node. Note that, the CPU's idle states still needs to be compatible with
"arm,idle-state".
Furthermore, represent the CPU cluster as a separate master power domain,
powering the CPU's power domains. The cluster node, contains the idle
states for the cluster and each idle state needs to be compatible with the
"domain-idle-state".
If the running platform is using a PSCI FW that supports the OS initiated
CPU suspend mode, which likely should be the case unless the PSCI FW is
very old, this change makes the PSCI driver to enable it.
Cc: Andy Gross <[email protected]>
Cc: David Brown <[email protected]>
Cc: Lina Iyer <[email protected]>
Signed-off-by: Lina Iyer <[email protected]>
Co-developed-by: Ulf Hansson <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
arch/arm64/boot/dts/qcom/msm8916.dtsi | 53 +++++++++++++++++++++++++--
1 file changed, 49 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/boot/dts/qcom/msm8916.dtsi b/arch/arm64/boot/dts/qcom/msm8916.dtsi
index 650f356f69ca..d67c51090d0c 100644
--- a/arch/arm64/boot/dts/qcom/msm8916.dtsi
+++ b/arch/arm64/boot/dts/qcom/msm8916.dtsi
@@ -113,10 +113,10 @@
reg = <0x0>;
next-level-cache = <&L2_0>;
enable-method = "psci";
- cpu-idle-states = <&CPU_SPC>;
clocks = <&apcs 0>;
operating-points-v2 = <&cpu_opp_table>;
#cooling-cells = <2>;
+ power-domains = <&CPU_PD0>;
};
CPU1: cpu@1 {
@@ -125,10 +125,10 @@
reg = <0x1>;
next-level-cache = <&L2_0>;
enable-method = "psci";
- cpu-idle-states = <&CPU_SPC>;
clocks = <&apcs 0>;
operating-points-v2 = <&cpu_opp_table>;
#cooling-cells = <2>;
+ power-domains = <&CPU_PD1>;
};
CPU2: cpu@2 {
@@ -137,10 +137,10 @@
reg = <0x2>;
next-level-cache = <&L2_0>;
enable-method = "psci";
- cpu-idle-states = <&CPU_SPC>;
clocks = <&apcs 0>;
operating-points-v2 = <&cpu_opp_table>;
#cooling-cells = <2>;
+ power-domains = <&CPU_PD2>;
};
CPU3: cpu@3 {
@@ -149,10 +149,10 @@
reg = <0x3>;
next-level-cache = <&L2_0>;
enable-method = "psci";
- cpu-idle-states = <&CPU_SPC>;
clocks = <&apcs 0>;
operating-points-v2 = <&cpu_opp_table>;
#cooling-cells = <2>;
+ power-domains = <&CPU_PD3>;
};
L2_0: l2-cache {
@@ -169,12 +169,57 @@
min-residency-us = <2000>;
local-timer-stop;
};
+
+ CLUSTER_RET: cluster-retention {
+ compatible = "domain-idle-state";
+ arm,psci-suspend-param = <0x1000010>;
+ entry-latency-us = <500>;
+ exit-latency-us = <500>;
+ min-residency-us = <2000>;
+ };
+
+ CLUSTER_PWRDN: cluster-gdhs {
+ compatible = "domain-idle-state";
+ arm,psci-suspend-param = <0x1000030>;
+ entry-latency-us = <2000>;
+ exit-latency-us = <2000>;
+ min-residency-us = <6000>;
+ };
};
};
psci {
compatible = "arm,psci-1.0";
method = "smc";
+
+ CPU_PD0: cpu-pd0 {
+ #power-domain-cells = <0>;
+ power-domains = <&CLUSTER_PD>;
+ domain-idle-states = <&CPU_SPC>;
+ };
+
+ CPU_PD1: cpu-pd1 {
+ #power-domain-cells = <0>;
+ power-domains = <&CLUSTER_PD>;
+ domain-idle-states = <&CPU_SPC>;
+ };
+
+ CPU_PD2: cpu-pd2 {
+ #power-domain-cells = <0>;
+ power-domains = <&CLUSTER_PD>;
+ domain-idle-states = <&CPU_SPC>;
+ };
+
+ CPU_PD3: cpu-pd3 {
+ #power-domain-cells = <0>;
+ power-domains = <&CLUSTER_PD>;
+ domain-idle-states = <&CPU_SPC>;
+ };
+
+ CLUSTER_PD: cluster-pd {
+ #power-domain-cells = <0>;
+ domain-idle-states = <&CLUSTER_RET>, <&CLUSTER_PWRDN>;
+ };
};
pmu {
--
2.17.1
When the hierarchical layout is used in DT, as to describe the PM topology
for the CPUs, which are managed by PSCI, we want to be able to initialize
and setup the corresponding PM domain data structures.
Let's make this possible via adding a new file, psci_pm_domains.c and
implement the needed interface towards the generic PM domain (aka genpd).
Share a helper function, psci_dt_init_pm_domains(), which the regular PSCI
firmware driver may call when it needs to initialize the PM topology using
genpd.
In principle, the implementation consists of allocating/initializing the
genpd data structures, parsing the domain-idle states DT bindings via
calling of_genpd_parse_idle_states() and to call pm_genpd_init() for the
allocated genpds.
Finally, one genpd OF provider is added per genpd. Via DT, this enables
devices, including CPU devices, to be attached to the created genpds.
Cc: Lina Iyer <[email protected]>
Co-developed-by: Lina Iyer <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
drivers/firmware/psci/Makefile | 2 +-
drivers/firmware/psci/psci.h | 6 +
drivers/firmware/psci/psci_pm_domain.c | 180 +++++++++++++++++++++++++
3 files changed, 187 insertions(+), 1 deletion(-)
create mode 100644 drivers/firmware/psci/psci_pm_domain.c
diff --git a/drivers/firmware/psci/Makefile b/drivers/firmware/psci/Makefile
index 1956b882470f..ff300f1fec86 100644
--- a/drivers/firmware/psci/Makefile
+++ b/drivers/firmware/psci/Makefile
@@ -1,4 +1,4 @@
# SPDX-License-Identifier: GPL-2.0
#
-obj-$(CONFIG_ARM_PSCI_FW) += psci.o
+obj-$(CONFIG_ARM_PSCI_FW) += psci.o psci_pm_domain.o
obj-$(CONFIG_ARM_PSCI_CHECKER) += psci_checker.o
diff --git a/drivers/firmware/psci/psci.h b/drivers/firmware/psci/psci.h
index dc7b596daa2b..a22684b24902 100644
--- a/drivers/firmware/psci/psci.h
+++ b/drivers/firmware/psci/psci.h
@@ -10,4 +10,10 @@ void psci_set_domain_state(u32 state);
int psci_dt_parse_state_node(struct device_node *np, u32 *state);
+#ifdef CONFIG_PM_GENERIC_DOMAINS_OF
+int psci_dt_init_pm_domains(struct device_node *np);
+#else
+static inline int psci_dt_init_pm_domains(struct device_node *np) { return 0; }
+#endif
+
#endif /* __PSCI_H */
diff --git a/drivers/firmware/psci/psci_pm_domain.c b/drivers/firmware/psci/psci_pm_domain.c
new file mode 100644
index 000000000000..f54819e7e487
--- /dev/null
+++ b/drivers/firmware/psci/psci_pm_domain.c
@@ -0,0 +1,180 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PM domains for CPUs via genpd - managed by PSCI.
+ *
+ * Copyright (C) 2018 Linaro Ltd.
+ * Author: Ulf Hansson <[email protected]>
+ *
+ */
+
+#define pr_fmt(fmt) "psci: " fmt
+
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/pm_domain.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+
+#include "psci.h"
+
+#ifdef CONFIG_PM_GENERIC_DOMAINS_OF
+static int psci_pd_power_off(struct generic_pm_domain *pd)
+{
+ struct genpd_power_state *state = &pd->states[pd->state_idx];
+ u32 *pd_state;
+ u32 composite_pd_state;
+
+ if (!state->data)
+ return 0;
+
+ pd_state = state->data;
+ composite_pd_state = *pd_state | psci_get_domain_state();
+ psci_set_domain_state(composite_pd_state);
+
+ return 0;
+}
+
+static int psci_dt_parse_pd_states(struct genpd_power_state *states,
+ int state_count)
+{
+ int i, err;
+ u32 *psci_states;
+
+ if (!state_count)
+ return 0;
+
+ psci_states = kcalloc(state_count, sizeof(psci_states), GFP_KERNEL);
+ if (!psci_states)
+ return -ENOMEM;
+
+ for (i = 0; i < state_count; i++) {
+ err = psci_dt_parse_state_node(to_of_node(states[i].fwnode),
+ &psci_states[i]);
+ if (err) {
+ kfree(psci_states);
+ return err;
+ }
+ }
+
+ for (i = 0; i < state_count; i++)
+ states[i].data = &psci_states[i];
+
+ return 0;
+}
+
+static int psci_dt_init_genpd(struct device_node *np,
+ struct genpd_power_state *states,
+ unsigned int state_count)
+{
+ struct generic_pm_domain *pd;
+ struct dev_power_governor *pd_gov;
+ int ret = -ENOMEM;
+
+ pd = kzalloc(sizeof(*pd), GFP_KERNEL);
+ if (!pd)
+ return -ENOMEM;
+
+ pd->name = kasprintf(GFP_KERNEL, "%pOF", np);
+ if (!pd->name)
+ goto free_pd;
+
+ pd->name = kbasename(pd->name);
+ pd->power_off = psci_pd_power_off;
+ pd->states = states;
+ pd->state_count = state_count;
+ pd->flags |= GENPD_FLAG_IRQ_SAFE | GENPD_FLAG_CPU_DOMAIN;
+
+ /* Use governor for CPU PM domains if it has some states to manage. */
+ pd_gov = state_count > 0 ? &pm_domain_cpu_gov : NULL;
+
+ ret = pm_genpd_init(pd, pd_gov, false);
+ if (ret)
+ goto free_name;
+
+ ret = of_genpd_add_provider_simple(np, pd);
+ if (ret)
+ goto remove_pd;
+
+ pr_info("init PM domain %s\n", pd->name);
+ return 0;
+
+remove_pd:
+ pm_genpd_remove(pd);
+free_name:
+ kfree(pd->name);
+free_pd:
+ kfree(pd);
+ pr_err("failed to init PM domain ret=%d %pOF\n", ret, np);
+ return ret;
+}
+
+static int psci_dt_set_genpd_topology(struct device_node *np)
+{
+ struct device_node *node;
+ struct of_phandle_args child, parent;
+ int ret;
+
+ for_each_child_of_node(np, node) {
+ if (of_parse_phandle_with_args(node, "power-domains",
+ "#power-domain-cells", 0,
+ &parent))
+ continue;
+
+ child.np = node;
+ child.args_count = 0;
+
+ ret = of_genpd_add_subdomain(&parent, &child);
+ of_node_put(parent.np);
+ if (ret) {
+ of_node_put(node);
+ return ret;
+ }
+ }
+
+ return 0;
+}
+
+int psci_dt_init_pm_domains(struct device_node *np)
+{
+ struct device_node *node;
+ struct genpd_power_state *states;
+ int state_count;
+ int pd_count = 0;
+ int ret;
+
+ /* Parse child nodes for "#power-domain-cells". */
+ for_each_child_of_node(np, node) {
+ if (!of_find_property(node, "#power-domain-cells", NULL))
+ continue;
+
+ ret = of_genpd_parse_idle_states(node, &states, &state_count);
+ if (ret)
+ goto err_put;
+
+ ret = psci_dt_parse_pd_states(states, state_count);
+ if (ret)
+ goto err_put;
+
+ ret = psci_dt_init_genpd(node, states, state_count);
+ if (ret)
+ goto err_put;
+
+ pd_count++;
+ }
+
+ if (!pd_count)
+ return 0;
+
+ ret = psci_dt_set_genpd_topology(np);
+ if (ret)
+ goto err_msg;
+
+ return pd_count;
+
+err_put:
+ of_node_put(node);
+err_msg:
+ pr_err("failed to create PM domains ret=%d\n", ret);
+ return ret;
+}
+#endif
--
2.17.1
Following changes needs to be able to call psci_get|set_domain_state() and
psci_dt_parse_state_node(), but from a separate file. Let's make that
possible by sharing them via a new internal PSCI header file.
Cc: Lina Iyer <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
drivers/firmware/psci/psci.c | 14 ++++++++------
drivers/firmware/psci/psci.h | 13 +++++++++++++
2 files changed, 21 insertions(+), 6 deletions(-)
create mode 100644 drivers/firmware/psci/psci.h
diff --git a/drivers/firmware/psci/psci.c b/drivers/firmware/psci/psci.c
index 40b2b8945018..463f78cda3be 100644
--- a/drivers/firmware/psci/psci.c
+++ b/drivers/firmware/psci/psci.c
@@ -34,6 +34,8 @@
#include <asm/smp_plat.h>
#include <asm/suspend.h>
+#include "psci.h"
+
/*
* While a 64-bit OS can make calls with SMC32 calling conventions, for some
* calls it is necessary to use SMC64 to pass or return 64-bit values.
@@ -90,12 +92,12 @@ static u32 psci_function_id[PSCI_FN_MAX];
static DEFINE_PER_CPU(u32, domain_state);
static u32 psci_cpu_suspend_feature;
-static inline u32 psci_get_domain_state(void)
+u32 psci_get_domain_state(void)
{
return this_cpu_read(domain_state);
}
-static inline void psci_set_domain_state(u32 state)
+void psci_set_domain_state(u32 state)
{
this_cpu_write(domain_state, state);
}
@@ -285,10 +287,7 @@ static int __init psci_features(u32 psci_func_id)
psci_func_id, 0, 0);
}
-#ifdef CONFIG_CPU_IDLE
-static DEFINE_PER_CPU_READ_MOSTLY(u32 *, psci_power_state);
-
-static int psci_dt_parse_state_node(struct device_node *np, u32 *state)
+int psci_dt_parse_state_node(struct device_node *np, u32 *state)
{
int err = of_property_read_u32(np, "arm,psci-suspend-param", state);
@@ -305,6 +304,9 @@ static int psci_dt_parse_state_node(struct device_node *np, u32 *state)
return 0;
}
+#ifdef CONFIG_CPU_IDLE
+static DEFINE_PER_CPU_READ_MOSTLY(u32 *, psci_power_state);
+
static int psci_dt_cpu_init_idle(struct device_node *cpu_node, int cpu)
{
int i, ret = 0, count = 0;
diff --git a/drivers/firmware/psci/psci.h b/drivers/firmware/psci/psci.h
new file mode 100644
index 000000000000..dc7b596daa2b
--- /dev/null
+++ b/drivers/firmware/psci/psci.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __PSCI_H
+#define __PSCI_H
+
+struct device_node;
+
+u32 psci_get_domain_state(void);
+void psci_set_domain_state(u32 state);
+
+int psci_dt_parse_state_node(struct device_node *np, u32 *state);
+
+#endif /* __PSCI_H */
--
2.17.1
In case the OS initiated CPU suspend mode have been enabled, the PM domain
topology for CPUs have earlier been created by PSCI. Let's use this
information in psci_dt_cpu_init_idle() as a condition for when it makes
sense to try to attach the CPU to its corresponding PM domain, via calling
of_genpd_attach_cpu().
If the CPU is attached successfully to its PM domain, idle management is
now fully prepared to be controlled through runtime PM for the CPU.
Cc: Lina Iyer <[email protected]>
Co-developed-by: Lina Iyer <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
drivers/firmware/psci/psci.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/drivers/firmware/psci/psci.c b/drivers/firmware/psci/psci.c
index 80c286d83369..700e0e995871 100644
--- a/drivers/firmware/psci/psci.c
+++ b/drivers/firmware/psci/psci.c
@@ -20,6 +20,7 @@
#include <linux/linkage.h>
#include <linux/of.h>
#include <linux/pm.h>
+#include <linux/pm_domain.h>
#include <linux/printk.h>
#include <linux/psci.h>
#include <linux/reboot.h>
@@ -91,6 +92,7 @@ static u32 psci_function_id[PSCI_FN_MAX];
static DEFINE_PER_CPU(u32, domain_state);
static u32 psci_cpu_suspend_feature;
+static bool psci_osi_mode_enabled;
u32 psci_get_domain_state(void)
{
@@ -339,6 +341,14 @@ static int psci_dt_cpu_init_idle(struct device_node *cpu_node, int cpu)
/* Idle states parsed correctly, initialize per-cpu pointer */
per_cpu(psci_power_state, cpu) = psci_states;
+
+ /* If running OSI mode, attach the CPU device to its PM domain. */
+ if (psci_osi_mode_enabled) {
+ ret = of_genpd_attach_cpu(cpu);
+ if (ret)
+ goto free_mem;
+ }
+
return 0;
free_mem:
@@ -753,6 +763,7 @@ int __init psci_dt_topology_init(void)
goto out;
}
+ psci_osi_mode_enabled = true;
pr_info("OSI mode enabled.\n");
out:
of_node_put(np);
--
2.17.1
Instead of having each psci init function taking care of the of_node_put(),
let's deal with that from psci_dt_init(), as this enables a bit simpler
error path for each psci init function.
Cc: Lina Iyer <[email protected]>
Co-developed-by: Lina Iyer <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
Acked-by: Mark Rutland <[email protected]>
---
drivers/firmware/psci/psci.c | 23 ++++++++++-------------
1 file changed, 10 insertions(+), 13 deletions(-)
diff --git a/drivers/firmware/psci/psci.c b/drivers/firmware/psci/psci.c
index 256b4edbb20a..38881007584e 100644
--- a/drivers/firmware/psci/psci.c
+++ b/drivers/firmware/psci/psci.c
@@ -608,9 +608,9 @@ static int __init psci_0_2_init(struct device_node *np)
int err;
err = get_set_conduit_method(np);
-
if (err)
- goto out_put_node;
+ return err;
+
/*
* Starting with v0.2, the PSCI specification introduced a call
* (PSCI_VERSION) that allows probing the firmware version, so
@@ -618,11 +618,7 @@ static int __init psci_0_2_init(struct device_node *np)
* can be carried out according to the specific version reported
* by firmware
*/
- err = psci_probe();
-
-out_put_node:
- of_node_put(np);
- return err;
+ return psci_probe();
}
/*
@@ -634,9 +630,8 @@ static int __init psci_0_1_init(struct device_node *np)
int err;
err = get_set_conduit_method(np);
-
if (err)
- goto out_put_node;
+ return err;
pr_info("Using PSCI v0.1 Function IDs from DT\n");
@@ -660,9 +655,7 @@ static int __init psci_0_1_init(struct device_node *np)
psci_ops.migrate = psci_migrate;
}
-out_put_node:
- of_node_put(np);
- return err;
+ return 0;
}
static const struct of_device_id psci_of_match[] __initconst = {
@@ -677,6 +670,7 @@ int __init psci_dt_init(void)
struct device_node *np;
const struct of_device_id *matched_np;
psci_initcall_t init_fn;
+ int ret;
np = of_find_matching_node_and_match(NULL, psci_of_match, &matched_np);
@@ -684,7 +678,10 @@ int __init psci_dt_init(void)
return -ENODEV;
init_fn = (psci_initcall_t)matched_np->data;
- return init_fn(np);
+ ret = init_fn(np);
+
+ of_node_put(np);
+ return ret;
}
#ifdef CONFIG_ACPI
--
2.17.1
In case the hierarchical layout is used in DT, we want to initialize the
corresponding PM domain topology for the CPUs, by using the generic PM
domain (aka genpd) infrastructure.
At first glance, it may seem feasible to hook into the existing
psci_dt_init() function, although because it's called quite early in the
boot sequence, allocating the dynamic data structure for a genpd doesn't
work.
Therefore, let's export a new init function for PSCI,
psci_dt_topology_init(), which the ARM machine code should call from a
suitable initcall.
Succeeding to initialize the PM domain topology, which means at least one
instance of a genpd becomes created, allows us to continue to enable the
PSCI OS initiated mode for the platform. If everything turns out fine,
let's print a message in log to inform the user about the changed mode.
In case of any failures, we stick to the default PSCI Platform Coordinated
mode. Moreover, in case the kernel started from a kexec call, let's make
sure to explicitly default to this mode during boot, in case the previous
kernel changed the mode.
Cc: Lina Iyer <[email protected]>
Co-developed-by: Lina Iyer <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
drivers/firmware/psci/psci.c | 38 +++++++++++++++++++++++++++++++++++-
include/linux/psci.h | 2 ++
2 files changed, 39 insertions(+), 1 deletion(-)
diff --git a/drivers/firmware/psci/psci.c b/drivers/firmware/psci/psci.c
index 463f78cda3be..80c286d83369 100644
--- a/drivers/firmware/psci/psci.c
+++ b/drivers/firmware/psci/psci.c
@@ -691,9 +691,14 @@ static int __init psci_1_0_init(struct device_node *np)
if (err)
return err;
- if (psci_has_osi_support())
+ if (psci_has_osi_support()) {
pr_info("OSI mode supported.\n");
+ /* Make sure we default to PC mode. */
+ invoke_psci_fn(PSCI_1_0_FN_SET_SUSPEND_MODE,
+ PSCI_1_0_SUSPEND_MODE_PC, 0, 0);
+ }
+
return 0;
}
@@ -723,6 +728,37 @@ int __init psci_dt_init(void)
return ret;
}
+int __init psci_dt_topology_init(void)
+{
+ struct device_node *np;
+ int ret;
+
+ if (!psci_has_osi_support())
+ return 0;
+
+ np = of_find_matching_node_and_match(NULL, psci_of_match, NULL);
+ if (!np)
+ return -ENODEV;
+
+ /* Initialize the CPU PM domains based on topology described in DT. */
+ ret = psci_dt_init_pm_domains(np);
+ if (ret <= 0)
+ goto out;
+
+ /* Enable OSI mode. */
+ ret = invoke_psci_fn(PSCI_1_0_FN_SET_SUSPEND_MODE,
+ PSCI_1_0_SUSPEND_MODE_OSI, 0, 0);
+ if (ret) {
+ pr_info("failed to enable OSI mode: %d\n", ret);
+ goto out;
+ }
+
+ pr_info("OSI mode enabled.\n");
+out:
+ of_node_put(np);
+ return ret;
+}
+
#ifdef CONFIG_ACPI
/*
* We use PSCI 0.2+ when ACPI is deployed on ARM64 and it's
diff --git a/include/linux/psci.h b/include/linux/psci.h
index 8b1b3b5935ab..298a044407f0 100644
--- a/include/linux/psci.h
+++ b/include/linux/psci.h
@@ -53,8 +53,10 @@ extern struct psci_operations psci_ops;
#if defined(CONFIG_ARM_PSCI_FW)
int __init psci_dt_init(void);
+int __init psci_dt_topology_init(void);
#else
static inline int psci_dt_init(void) { return 0; }
+static inline int psci_dt_topology_init(void) { return 0; }
#endif
#if defined(CONFIG_ARM_PSCI_FW) && defined(CONFIG_ACPI)
--
2.17.1
From: Lina Iyer <[email protected]>
Currently CPU's idle states are represented in a flattened model, via the
"cpu-idle-states" binding from within the CPU's device nodes.
Support the hierarchical layout, simply by converting to calling the new OF
helper, of_get_cpu_state_node().
Cc: Lina Iyer <[email protected]>
Suggested-by: Sudeep Holla <[email protected]>
Signed-off-by: Lina Iyer <[email protected]>
Co-developed-by: Ulf Hansson <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
drivers/firmware/psci/psci.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/firmware/psci/psci.c b/drivers/firmware/psci/psci.c
index 9788bfc1cf8b..256b4edbb20a 100644
--- a/drivers/firmware/psci/psci.c
+++ b/drivers/firmware/psci/psci.c
@@ -294,8 +294,7 @@ static int psci_dt_cpu_init_idle(struct device_node *cpu_node, int cpu)
struct device_node *state_node;
/* Count idle states */
- while ((state_node = of_parse_phandle(cpu_node, "cpu-idle-states",
- count))) {
+ while ((state_node = of_get_cpu_state_node(cpu_node, count))) {
count++;
of_node_put(state_node);
}
@@ -308,7 +307,7 @@ static int psci_dt_cpu_init_idle(struct device_node *cpu_node, int cpu)
return -ENOMEM;
for (i = 0; i < count; i++) {
- state_node = of_parse_phandle(cpu_node, "cpu-idle-states", i);
+ state_node = of_get_cpu_state_node(cpu_node, i);
ret = psci_dt_parse_state_node(state_node, &psci_states[i]);
of_node_put(state_node);
--
2.17.1
The files for the PSCI firmware driver were moved to a sub-directory,
let's update MAINTAINERS to reflect that.
Suggested-by: Mark Rutland <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
MAINTAINERS | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/MAINTAINERS b/MAINTAINERS
index 9d5eeff51b5f..3f28c21d0ad0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11389,7 +11389,7 @@ M: Mark Rutland <[email protected]>
M: Lorenzo Pieralisi <[email protected]>
L: [email protected]
S: Maintained
-F: drivers/firmware/psci*.c
+F: drivers/firmware/psci/
F: include/linux/psci.h
F: include/uapi/linux/psci.h
--
2.17.1
To enable the OS initiated mode, the CPU topology needs to be described
using the hierarchical model in DT. When used, the idle state bits for the
CPU needs to be created by ORing the bits for CPU's selected idle state
with the bits for CPU's PM domain (CPU's cluster) idle state.
Let's prepare the PSCI driver to deal with this, via introducing a per CPU
variable called domain_state and by adding internal helpers to read/write
the value of variable.
Cc: Lina Iyer <[email protected]>
Co-developed-by: Lina Iyer <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
drivers/firmware/psci/psci.c | 26 ++++++++++++++++++++++----
1 file changed, 22 insertions(+), 4 deletions(-)
diff --git a/drivers/firmware/psci/psci.c b/drivers/firmware/psci/psci.c
index e8f4f8444ff1..40b2b8945018 100644
--- a/drivers/firmware/psci/psci.c
+++ b/drivers/firmware/psci/psci.c
@@ -87,8 +87,19 @@ static u32 psci_function_id[PSCI_FN_MAX];
(PSCI_1_0_EXT_POWER_STATE_ID_MASK | \
PSCI_1_0_EXT_POWER_STATE_TYPE_MASK)
+static DEFINE_PER_CPU(u32, domain_state);
static u32 psci_cpu_suspend_feature;
+static inline u32 psci_get_domain_state(void)
+{
+ return this_cpu_read(domain_state);
+}
+
+static inline void psci_set_domain_state(u32 state)
+{
+ this_cpu_write(domain_state, state);
+}
+
static inline bool psci_has_ext_power_state(void)
{
return psci_cpu_suspend_feature &
@@ -187,6 +198,8 @@ static int psci_cpu_on(unsigned long cpuid, unsigned long entry_point)
fn = psci_function_id[PSCI_FN_CPU_ON];
err = invoke_psci_fn(fn, cpuid, entry_point, 0);
+ /* Clear the domain state to start fresh. */
+ psci_set_domain_state(0);
return psci_to_linux_errno(err);
}
@@ -408,15 +421,17 @@ int psci_cpu_init_idle(unsigned int cpu)
static int psci_suspend_finisher(unsigned long index)
{
u32 *state = __this_cpu_read(psci_power_state);
+ u32 composite_state = state[index - 1] | psci_get_domain_state();
- return psci_ops.cpu_suspend(state[index - 1],
- __pa_symbol(cpu_resume));
+ return psci_ops.cpu_suspend(composite_state, __pa_symbol(cpu_resume));
}
int psci_cpu_suspend_enter(unsigned long index)
{
int ret;
u32 *state = __this_cpu_read(psci_power_state);
+ u32 composite_state = state[index - 1] | psci_get_domain_state();
+
/*
* idle state index 0 corresponds to wfi, should never be called
* from the cpu_suspend operations
@@ -424,11 +439,14 @@ int psci_cpu_suspend_enter(unsigned long index)
if (WARN_ON_ONCE(!index))
return -EINVAL;
- if (!psci_power_state_loses_context(state[index - 1]))
- ret = psci_ops.cpu_suspend(state[index - 1], 0);
+ if (!psci_power_state_loses_context(composite_state))
+ ret = psci_ops.cpu_suspend(composite_state, 0);
else
ret = cpu_suspend(index, psci_suspend_finisher);
+ /* Clear the domain state to start fresh when back from idle. */
+ psci_set_domain_state(0);
+
return ret;
}
--
2.17.1
Let's split psci_dt_cpu_init_idle() function into two functions, as to
allow following changes to re-use some of the code.
Cc: Lina Iyer <[email protected]>
Co-developed-by: Lina Iyer <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
drivers/firmware/psci/psci.c | 42 ++++++++++++++++++++----------------
1 file changed, 23 insertions(+), 19 deletions(-)
diff --git a/drivers/firmware/psci/psci.c b/drivers/firmware/psci/psci.c
index c80ec1d03274..9788bfc1cf8b 100644
--- a/drivers/firmware/psci/psci.c
+++ b/drivers/firmware/psci/psci.c
@@ -270,9 +270,26 @@ static int __init psci_features(u32 psci_func_id)
#ifdef CONFIG_CPU_IDLE
static DEFINE_PER_CPU_READ_MOSTLY(u32 *, psci_power_state);
+static int psci_dt_parse_state_node(struct device_node *np, u32 *state)
+{
+ int err = of_property_read_u32(np, "arm,psci-suspend-param", state);
+
+ if (err) {
+ pr_warn("%pOF missing arm,psci-suspend-param property\n", np);
+ return err;
+ }
+
+ if (!psci_power_state_is_valid(*state)) {
+ pr_warn("Invalid PSCI power state %#x\n", *state);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
static int psci_dt_cpu_init_idle(struct device_node *cpu_node, int cpu)
{
- int i, ret, count = 0;
+ int i, ret = 0, count = 0;
u32 *psci_states;
struct device_node *state_node;
@@ -291,29 +308,16 @@ static int psci_dt_cpu_init_idle(struct device_node *cpu_node, int cpu)
return -ENOMEM;
for (i = 0; i < count; i++) {
- u32 state;
-
state_node = of_parse_phandle(cpu_node, "cpu-idle-states", i);
+ ret = psci_dt_parse_state_node(state_node, &psci_states[i]);
+ of_node_put(state_node);
- ret = of_property_read_u32(state_node,
- "arm,psci-suspend-param",
- &state);
- if (ret) {
- pr_warn(" * %pOF missing arm,psci-suspend-param property\n",
- state_node);
- of_node_put(state_node);
+ if (ret)
goto free_mem;
- }
- of_node_put(state_node);
- pr_debug("psci-power-state %#x index %d\n", state, i);
- if (!psci_power_state_is_valid(state)) {
- pr_warn("Invalid PSCI power state %#x\n", state);
- ret = -EINVAL;
- goto free_mem;
- }
- psci_states[i] = state;
+ pr_debug("psci-power-state %#x index %d\n", psci_states[i], i);
}
+
/* Idle states parsed correctly, initialize per-cpu pointer */
per_cpu(psci_power_state, cpu) = psci_states;
return 0;
--
2.17.1
Some following changes extends the PSCI driver with some additional new
files. Let's avoid to continue cluttering the toplevel firmware directory
and first move the PSCI files into a PSCI sub-directory.
Suggested-by: Mark Rutland <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
drivers/firmware/Kconfig | 15 +--------------
drivers/firmware/Makefile | 3 +--
drivers/firmware/psci/Kconfig | 13 +++++++++++++
drivers/firmware/psci/Makefile | 4 ++++
drivers/firmware/{ => psci}/psci.c | 0
drivers/firmware/{ => psci}/psci_checker.c | 0
6 files changed, 19 insertions(+), 16 deletions(-)
create mode 100644 drivers/firmware/psci/Kconfig
create mode 100644 drivers/firmware/psci/Makefile
rename drivers/firmware/{ => psci}/psci.c (100%)
rename drivers/firmware/{ => psci}/psci_checker.c (100%)
diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig
index 6e83880046d7..923c42d5a2e6 100644
--- a/drivers/firmware/Kconfig
+++ b/drivers/firmware/Kconfig
@@ -5,20 +5,6 @@
menu "Firmware Drivers"
-config ARM_PSCI_FW
- bool
-
-config ARM_PSCI_CHECKER
- bool "ARM PSCI checker"
- depends on ARM_PSCI_FW && HOTPLUG_CPU && CPU_IDLE && !TORTURE_TEST
- help
- Run the PSCI checker during startup. This checks that hotplug and
- suspend operations work correctly when using PSCI.
-
- The torture tests may interfere with the PSCI checker by turning CPUs
- on and off through hotplug, so for now torture tests and PSCI checker
- are mutually exclusive.
-
config ARM_SCMI_PROTOCOL
bool "ARM System Control and Management Interface (SCMI) Message Protocol"
depends on ARM || ARM64 || COMPILE_TEST
@@ -286,6 +272,7 @@ config TI_SCI_PROTOCOL
config HAVE_ARM_SMCCC
bool
+source "drivers/firmware/psci/Kconfig"
source "drivers/firmware/broadcom/Kconfig"
source "drivers/firmware/google/Kconfig"
source "drivers/firmware/efi/Kconfig"
diff --git a/drivers/firmware/Makefile b/drivers/firmware/Makefile
index e18a041cfc53..ea284e551dc8 100644
--- a/drivers/firmware/Makefile
+++ b/drivers/firmware/Makefile
@@ -2,8 +2,6 @@
#
# Makefile for the linux kernel.
#
-obj-$(CONFIG_ARM_PSCI_FW) += psci.o
-obj-$(CONFIG_ARM_PSCI_CHECKER) += psci_checker.o
obj-$(CONFIG_ARM_SCPI_PROTOCOL) += arm_scpi.o
obj-$(CONFIG_ARM_SCPI_POWER_DOMAIN) += scpi_pm_domain.o
obj-$(CONFIG_ARM_SDE_INTERFACE) += arm_sdei.o
@@ -26,6 +24,7 @@ CFLAGS_qcom_scm-32.o :=$(call as-instr,.arch armv7-a\n.arch_extension sec,-DREQU
obj-$(CONFIG_TI_SCI_PROTOCOL) += ti_sci.o
obj-$(CONFIG_ARM_SCMI_PROTOCOL) += arm_scmi/
+obj-y += psci/
obj-y += broadcom/
obj-y += meson/
obj-$(CONFIG_GOOGLE_FIRMWARE) += google/
diff --git a/drivers/firmware/psci/Kconfig b/drivers/firmware/psci/Kconfig
new file mode 100644
index 000000000000..26a3b32bf7ab
--- /dev/null
+++ b/drivers/firmware/psci/Kconfig
@@ -0,0 +1,13 @@
+config ARM_PSCI_FW
+ bool
+
+config ARM_PSCI_CHECKER
+ bool "ARM PSCI checker"
+ depends on ARM_PSCI_FW && HOTPLUG_CPU && CPU_IDLE && !TORTURE_TEST
+ help
+ Run the PSCI checker during startup. This checks that hotplug and
+ suspend operations work correctly when using PSCI.
+
+ The torture tests may interfere with the PSCI checker by turning CPUs
+ on and off through hotplug, so for now torture tests and PSCI checker
+ are mutually exclusive.
diff --git a/drivers/firmware/psci/Makefile b/drivers/firmware/psci/Makefile
new file mode 100644
index 000000000000..1956b882470f
--- /dev/null
+++ b/drivers/firmware/psci/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+obj-$(CONFIG_ARM_PSCI_FW) += psci.o
+obj-$(CONFIG_ARM_PSCI_CHECKER) += psci_checker.o
diff --git a/drivers/firmware/psci.c b/drivers/firmware/psci/psci.c
similarity index 100%
rename from drivers/firmware/psci.c
rename to drivers/firmware/psci/psci.c
diff --git a/drivers/firmware/psci_checker.c b/drivers/firmware/psci/psci_checker.c
similarity index 100%
rename from drivers/firmware/psci_checker.c
rename to drivers/firmware/psci/psci_checker.c
--
2.17.1
CPU devices and other regular devices may share the same PM domain and may
also be hierarchically related via subdomains. In either case, all devices
including CPUs, may be attached to a PM domain managed by genpd, that has
an idle state with an enter/exit latency.
Let's take these latencies into account in the state selection process by
genpd's governor for CPUs. This means the governor, pm_domain_cpu_gov,
becomes extended to satisfy both a state's residency and a potential dev PM
QoS constraint.
Cc: Lina Iyer <[email protected]>
Co-developed-by: Lina Iyer <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
drivers/base/power/domain_governor.c | 15 +++++++++++----
include/linux/pm_domain.h | 1 +
2 files changed, 12 insertions(+), 4 deletions(-)
diff --git a/drivers/base/power/domain_governor.c b/drivers/base/power/domain_governor.c
index 1aad55719537..03d4e9454ce9 100644
--- a/drivers/base/power/domain_governor.c
+++ b/drivers/base/power/domain_governor.c
@@ -214,8 +214,10 @@ static bool default_power_down_ok(struct dev_pm_domain *pd)
struct generic_pm_domain *genpd = pd_to_genpd(pd);
struct gpd_link *link;
- if (!genpd->max_off_time_changed)
+ if (!genpd->max_off_time_changed) {
+ genpd->state_idx = genpd->cached_power_down_state_idx;
return genpd->cached_power_down_ok;
+ }
/*
* We have to invalidate the cached results for the masters, so
@@ -240,6 +242,7 @@ static bool default_power_down_ok(struct dev_pm_domain *pd)
genpd->state_idx--;
}
+ genpd->cached_power_down_state_idx = genpd->state_idx;
return genpd->cached_power_down_ok;
}
@@ -255,6 +258,10 @@ static bool cpu_power_down_ok(struct dev_pm_domain *pd)
s64 idle_duration_ns;
int cpu, i;
+ /* Validate dev PM QoS constraints. */
+ if (!default_power_down_ok(pd))
+ return false;
+
if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
return true;
@@ -276,9 +283,9 @@ static bool cpu_power_down_ok(struct dev_pm_domain *pd)
/*
* Find the deepest idle state that has its residency value satisfied
* and by also taking into account the power off latency for the state.
- * Start at the deepest supported state.
+ * Start at the state picked by the dev PM QoS constraint validation.
*/
- i = genpd->state_count - 1;
+ i = genpd->state_idx;
do {
if (!genpd->states[i].residency_ns)
break;
@@ -312,6 +319,6 @@ struct dev_power_governor pm_domain_always_on_gov = {
};
struct dev_power_governor pm_domain_cpu_gov = {
- .suspend_ok = NULL,
+ .suspend_ok = default_suspend_ok,
.power_down_ok = cpu_power_down_ok,
};
diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
index 97901c833108..dbc69721cad8 100644
--- a/include/linux/pm_domain.h
+++ b/include/linux/pm_domain.h
@@ -81,6 +81,7 @@ struct generic_pm_domain {
s64 max_off_time_ns; /* Maximum allowed "suspended" time. */
bool max_off_time_changed;
bool cached_power_down_ok;
+ bool cached_power_down_state_idx;
int (*attach_dev)(struct generic_pm_domain *domain,
struct device *dev);
void (*detach_dev)(struct generic_pm_domain *domain,
--
2.17.1
From: Lina Iyer <[email protected]>
Update DT bindings to represent hierarchical CPU and CPU PM domain idle
states for PSCI. Also update the PSCI examples to clearly show how
flattened and hierarchical idle states can be represented in DT.
Cc: Lina Iyer <[email protected]>
Signed-off-by: Lina Iyer <[email protected]>
Co-developed-by: Ulf Hansson <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
Reviewed-by: Rob Herring <[email protected]>
Reviewed-by: Sudeep Holla <[email protected]>
---
.../devicetree/bindings/arm/psci.txt | 156 ++++++++++++++++++
1 file changed, 156 insertions(+)
diff --git a/Documentation/devicetree/bindings/arm/psci.txt b/Documentation/devicetree/bindings/arm/psci.txt
index a2c4f1d52492..17aa3d3a1c8e 100644
--- a/Documentation/devicetree/bindings/arm/psci.txt
+++ b/Documentation/devicetree/bindings/arm/psci.txt
@@ -105,7 +105,163 @@ Case 3: PSCI v0.2 and PSCI v0.1.
...
};
+ARM systems can have multiple cores sometimes in hierarchical arrangement.
+This often, but not always, maps directly to the processor power topology of
+the system. Individual nodes in a topology have their own specific power states
+and can be better represented in DT hierarchically.
+
+For these cases, the definitions of the idle states for the CPUs and the CPU
+topology, must conform to the domain idle state specification [3]. The domain
+idle states themselves, must be compatible with the defined 'domain-idle-state'
+binding [1], and also need to specify the arm,psci-suspend-param property for
+each idle state.
+
+DT allows representing CPUs and CPU idle states in two different ways -
+
+The flattened model as given in Example 1, lists CPU's idle states followed by
+the domain idle state that the CPUs may choose. Note that the idle states are
+all compatible with "arm,idle-state".
+
+Example 2 represents the hierarchical model of CPUs and domain idle states.
+CPUs define their domain provider in their psci DT node. The domain controls
+the power to the CPU and possibly other h/w blocks that would enter an idle
+state along with the CPU. The CPU's idle states may therefore be considered as
+the domain's idle states and have the compatible "arm,idle-state". Such domains
+may also be embedded within another domain that may represent common h/w blocks
+between these CPUs. The idle states of the CPU topology shall be represented as
+the domain's idle states.
+
+In PSCI firmware v1.0, the OS-Initiated mode is introduced. In order to use it,
+the hierarchical representation must be used.
+
+Example 1: Flattened representation of CPU and domain idle states
+ cpus {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ CPU0: cpu@0 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a53", "arm,armv8";
+ reg = <0x0>;
+ enable-method = "psci";
+ cpu-idle-states = <&CPU_PWRDN>, <&CLUSTER_RET>,
+ <&CLUSTER_PWRDN>;
+ };
+
+ CPU1: cpu@1 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a57", "arm,armv8";
+ reg = <0x100>;
+ enable-method = "psci";
+ cpu-idle-states = <&CPU_PWRDN>, <&CLUSTER_RET>,
+ <&CLUSTER_PWRDN>;
+ };
+
+ idle-states {
+ CPU_PWRDN: cpu-power-down {
+ compatible = "arm,idle-state";
+ arm,psci-suspend-param = <0x000001>;
+ entry-latency-us = <10>;
+ exit-latency-us = <10>;
+ min-residency-us = <100>;
+ };
+
+ CLUSTER_RET: cluster-retention {
+ compatible = "arm,idle-state";
+ arm,psci-suspend-param = <0x1000010>;
+ entry-latency-us = <500>;
+ exit-latency-us = <500>;
+ min-residency-us = <2000>;
+ };
+
+ CLUSTER_PWRDN: cluster-power-down {
+ compatible = "arm,idle-state";
+ arm,psci-suspend-param = <0x1000030>;
+ entry-latency-us = <2000>;
+ exit-latency-us = <2000>;
+ min-residency-us = <6000>;
+ };
+ };
+
+ psci {
+ compatible = "arm,psci-0.2";
+ method = "smc";
+ };
+
+Example 2: Hierarchical representation of CPU and domain idle states
+
+ cpus {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ CPU0: cpu@0 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a53", "arm,armv8";
+ reg = <0x0>;
+ enable-method = "psci";
+ power-domains = <&CPU_PD0>;
+ };
+
+ CPU1: cpu@1 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a57", "arm,armv8";
+ reg = <0x100>;
+ enable-method = "psci";
+ power-domains = <&CPU_PD1>;
+ };
+
+ idle-states {
+ CPU_PWRDN: cpu-power-down {
+ compatible = "arm,idle-state";
+ arm,psci-suspend-param = <0x000001>;
+ entry-latency-us = <10>;
+ exit-latency-us = <10>;
+ min-residency-us = <100>;
+ };
+
+ CLUSTER_RET: cluster-retention {
+ compatible = "domain-idle-state";
+ arm,psci-suspend-param = <0x1000010>;
+ entry-latency-us = <500>;
+ exit-latency-us = <500>;
+ min-residency-us = <2000>;
+ };
+
+ CLUSTER_PWRDN: cluster-power-down {
+ compatible = "domain-idle-state";
+ arm,psci-suspend-param = <0x1000030>;
+ entry-latency-us = <2000>;
+ exit-latency-us = <2000>;
+ min-residency-us = <6000>;
+ };
+ };
+ };
+
+ psci {
+ compatible = "arm,psci-1.0";
+ method = "smc";
+
+ CPU_PD0: cpu-pd0 {
+ #power-domain-cells = <0>;
+ domain-idle-states = <&CPU_PWRDN>;
+ power-domains = <&CLUSTER_PD>;
+ };
+
+ CPU_PD1: cpu-pd1 {
+ #power-domain-cells = <0>;
+ domain-idle-states = <&CPU_PWRDN>;
+ power-domains = <&CLUSTER_PD>;
+ };
+
+ CLUSTER_PD: cluster-pd {
+ #power-domain-cells = <0>;
+ domain-idle-states = <&CLUSTER_RET>, <&CLUSTER_PWRDN>;
+ };
+ };
+
[1] Kernel documentation - ARM idle states bindings
Documentation/devicetree/bindings/arm/idle-states.txt
[2] Power State Coordination Interface (PSCI) specification
http://infocenter.arm.com/help/topic/com.arm.doc.den0022c/DEN0022C_Power_State_Coordination_Interface.pdf
+[3]. PM Domains description
+ Documentation/devicetree/bindings/power/power_domain.txt
--
2.17.1
From: Lina Iyer <[email protected]>
Knowing the sleep duration of CPUs, is known to be needed while selecting
the most energy efficient idle state for a CPU or a group of CPUs.
However, to be able to compute the sleep duration, we need to know at what
time the next expected wakeup is for the CPU. Therefore, let's export this
information via a new function, tick_nohz_get_next_wakeup(). Following
changes make use of it.
Cc: Thomas Gleixner <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Lina Iyer <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Lina Iyer <[email protected]>
Co-developed-by: Ulf Hansson <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
include/linux/tick.h | 8 ++++++++
kernel/time/tick-sched.c | 10 ++++++++++
2 files changed, 18 insertions(+)
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 55388ab45fd4..e48f6b26b425 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -125,6 +125,7 @@ extern bool tick_nohz_idle_got_tick(void);
extern ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next);
extern unsigned long tick_nohz_get_idle_calls(void);
extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu);
+extern ktime_t tick_nohz_get_next_wakeup(int cpu);
extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
@@ -151,6 +152,13 @@ static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
*delta_next = TICK_NSEC;
return *delta_next;
}
+
+static inline ktime_t tick_nohz_get_next_wakeup(int cpu)
+{
+ /* Next wake up is the tick period, assume it starts now */
+ return ktime_add(ktime_get(), TICK_NSEC);
+}
+
static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index da9455a6b42b..f380bb4f0744 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1089,6 +1089,16 @@ unsigned long tick_nohz_get_idle_calls(void)
return ts->idle_calls;
}
+/**
+ * tick_nohz_get_next_wakeup - return the next wake up of the CPU
+ */
+ktime_t tick_nohz_get_next_wakeup(int cpu)
+{
+ struct clock_event_device *dev = per_cpu(tick_cpu_device.evtdev, cpu);
+
+ return dev->next_event;
+}
+
static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
{
#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
--
2.17.1
From: Lina Iyer <[email protected]>
Currently CPU's idle states are represented in a flattened model, via the
"cpu-idle-states" binding from within the CPU's device nodes.
Support the hierarchical layout during parsing and validating of the CPU's
idle states. This is simply done by calling the new OF helper,
of_get_cpu_state_node().
Cc: Lina Iyer <[email protected]>
Suggested-by: Sudeep Holla <[email protected]>
Signed-off-by: Lina Iyer <[email protected]>
Co-developed-by: Ulf Hansson <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
drivers/cpuidle/dt_idle_states.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/cpuidle/dt_idle_states.c b/drivers/cpuidle/dt_idle_states.c
index 53342b7f1010..13f9b7cd32d1 100644
--- a/drivers/cpuidle/dt_idle_states.c
+++ b/drivers/cpuidle/dt_idle_states.c
@@ -118,8 +118,7 @@ static bool idle_state_valid(struct device_node *state_node, unsigned int idx,
for (cpu = cpumask_next(cpumask_first(cpumask), cpumask);
cpu < nr_cpu_ids; cpu = cpumask_next(cpu, cpumask)) {
cpu_node = of_cpu_device_node_get(cpu);
- curr_state_node = of_parse_phandle(cpu_node, "cpu-idle-states",
- idx);
+ curr_state_node = of_get_cpu_state_node(cpu_node, idx);
if (state_node != curr_state_node)
valid = false;
@@ -176,7 +175,7 @@ int dt_init_idle_driver(struct cpuidle_driver *drv,
cpu_node = of_cpu_device_node_get(cpumask_first(cpumask));
for (i = 0; ; i++) {
- state_node = of_parse_phandle(cpu_node, "cpu-idle-states", i);
+ state_node = of_get_cpu_state_node(cpu_node, i);
if (!state_node)
break;
--
2.17.1
As it's now perfectly possible that a PM domain managed by genpd contains
devices belonging to CPUs, we should start to take into account the
residency values for the idle states during the state selection process.
The residency value specifies the minimum duration of time, the CPU or a
group of CPUs, needs to spend in an idle state to not waste energy entering
it.
To deal with this, let's add a new genpd governor, pm_domain_cpu_gov, that
may be used for a PM domain that have CPU devices attached or if the CPUs
are attached through subdomains.
The new governor computes the minimum expected idle duration time for the
online CPUs being attached to the PM domain and its subdomains. Then in the
state selection process, trying the deepest state first, it verifies that
the idle duration time satisfies the state's residency value.
It should be noted that, when computing the minimum expected idle duration
time, we use the information from tick_nohz_get_next_wakeup(), to find the
next wakeup for the related CPUs. Future wise, this may deserve to be
improved, as there are more reasons to why a CPU may be woken up from idle.
Cc: Thomas Gleixner <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Lina Iyer <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Co-developed-by: Lina Iyer <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
drivers/base/power/domain_governor.c | 58 ++++++++++++++++++++++++++++
include/linux/pm_domain.h | 2 +
2 files changed, 60 insertions(+)
diff --git a/drivers/base/power/domain_governor.c b/drivers/base/power/domain_governor.c
index 99896fbf18e4..1aad55719537 100644
--- a/drivers/base/power/domain_governor.c
+++ b/drivers/base/power/domain_governor.c
@@ -10,6 +10,9 @@
#include <linux/pm_domain.h>
#include <linux/pm_qos.h>
#include <linux/hrtimer.h>
+#include <linux/cpumask.h>
+#include <linux/ktime.h>
+#include <linux/tick.h>
static int dev_update_qos_constraint(struct device *dev, void *data)
{
@@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
return false;
}
+static bool cpu_power_down_ok(struct dev_pm_domain *pd)
+{
+ struct generic_pm_domain *genpd = pd_to_genpd(pd);
+ ktime_t domain_wakeup, cpu_wakeup;
+ s64 idle_duration_ns;
+ int cpu, i;
+
+ if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
+ return true;
+
+ /*
+ * Find the next wakeup for any of the online CPUs within the PM domain
+ * and its subdomains. Note, we only need the genpd->cpus, as it already
+ * contains a mask of all CPUs from subdomains.
+ */
+ domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
+ for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
+ cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
+ if (ktime_before(cpu_wakeup, domain_wakeup))
+ domain_wakeup = cpu_wakeup;
+ }
+
+ /* The minimum idle duration is from now - until the next wakeup. */
+ idle_duration_ns = ktime_to_ns(ktime_sub(domain_wakeup, ktime_get()));
+
+ /*
+ * Find the deepest idle state that has its residency value satisfied
+ * and by also taking into account the power off latency for the state.
+ * Start at the deepest supported state.
+ */
+ i = genpd->state_count - 1;
+ do {
+ if (!genpd->states[i].residency_ns)
+ break;
+
+ /* Check idle_duration_ns >= 0 to compare signed/unsigned. */
+ if (idle_duration_ns >= 0 && idle_duration_ns >=
+ (genpd->states[i].residency_ns +
+ genpd->states[i].power_off_latency_ns))
+ break;
+ i--;
+ } while (i >= 0);
+
+ if (i < 0)
+ return false;
+
+ genpd->state_idx = i;
+ return true;
+}
+
struct dev_power_governor simple_qos_governor = {
.suspend_ok = default_suspend_ok,
.power_down_ok = default_power_down_ok,
@@ -257,3 +310,8 @@ struct dev_power_governor pm_domain_always_on_gov = {
.power_down_ok = always_on_power_down_ok,
.suspend_ok = default_suspend_ok,
};
+
+struct dev_power_governor pm_domain_cpu_gov = {
+ .suspend_ok = NULL,
+ .power_down_ok = cpu_power_down_ok,
+};
diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
index 2c09cf80b285..97901c833108 100644
--- a/include/linux/pm_domain.h
+++ b/include/linux/pm_domain.h
@@ -160,6 +160,7 @@ int dev_pm_genpd_set_performance_state(struct device *dev, unsigned int state);
extern struct dev_power_governor simple_qos_governor;
extern struct dev_power_governor pm_domain_always_on_gov;
+extern struct dev_power_governor pm_domain_cpu_gov;
#else
static inline struct generic_pm_domain_data *dev_gpd_data(struct device *dev)
@@ -203,6 +204,7 @@ static inline int dev_pm_genpd_set_performance_state(struct device *dev,
#define simple_qos_governor (*(struct dev_power_governor *)(NULL))
#define pm_domain_always_on_gov (*(struct dev_power_governor *)(NULL))
+#define pm_domain_cpu_gov (*(struct dev_power_governor *)(NULL))
#endif
#ifdef CONFIG_PM_GENERIC_DOMAINS_SLEEP
--
2.17.1
A caller of pm_genpd_init() that provides some states for the genpd via the
->states pointer in the struct generic_pm_domain, should also provide a
governor. This because it's the job of the governor to pick a state that
satisfies the constraints.
Therefore, let's print a warning to inform the user about such bogus
configuration and avoid to bail out, by instead picking the shallowest
state before genpd invokes the ->power_off() callback.
Cc: Lina Iyer <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
Reviewed-by: Lina Iyer <[email protected]>
---
drivers/base/power/domain.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
index 62969c3d5d04..21d298e1820b 100644
--- a/drivers/base/power/domain.c
+++ b/drivers/base/power/domain.c
@@ -467,6 +467,10 @@ static int genpd_power_off(struct generic_pm_domain *genpd, bool one_dev_on,
return -EAGAIN;
}
+ /* Default to shallowest state. */
+ if (!genpd->gov)
+ genpd->state_idx = 0;
+
if (genpd->power_off) {
int ret;
@@ -1687,6 +1691,8 @@ int pm_genpd_init(struct generic_pm_domain *genpd,
ret = genpd_set_default_power_state(genpd);
if (ret)
return ret;
+ } else if (!gov) {
+ pr_warn("%s : no governor for states\n", genpd->name);
}
device_initialize(&genpd->dev);
--
2.17.1
To enable a device belonging to a CPU to be attached to a PM domain managed
by genpd, let's do a few changes to genpd as to make it convenient to
manage the specifics around CPUs.
First, as to be able to quickly find out what CPUs that are attached to a
genpd, which typically becomes useful from a genpd governor as following
changes is about to show, let's add a cpumask 'cpus' to the struct
generic_pm_domain.
At the point when a device that belongs to a CPU, is attached/detached to
its corresponding PM domain via genpd_add_device(), let's update the
cpumask in genpd->cpus. Moreover, propagate the update of the cpumask to
the master domains, which makes the genpd->cpus to contain a cpumask that
hierarchically reflect all CPUs for a genpd, including CPUs attached to
subdomains.
Second, to unconditionally manage CPUs and the cpumask in genpd->cpus, is
unnecessary for cases when only non-CPU devices are parts of a genpd.
Let's avoid this by adding a new configuration bit, GENPD_FLAG_CPU_DOMAIN.
Clients must set the bit before they call pm_genpd_init(), as to instruct
genpd that it shall deal with CPUs and thus manage the cpumask in
genpd->cpus.
Cc: Lina Iyer <[email protected]>
Co-developed-by: Lina Iyer <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
drivers/base/power/domain.c | 69 ++++++++++++++++++++++++++++++++++++-
include/linux/pm_domain.h | 3 ++
2 files changed, 71 insertions(+), 1 deletion(-)
diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
index 21d298e1820b..6149ce0bfa7b 100644
--- a/drivers/base/power/domain.c
+++ b/drivers/base/power/domain.c
@@ -20,6 +20,7 @@
#include <linux/sched.h>
#include <linux/suspend.h>
#include <linux/export.h>
+#include <linux/cpu.h>
#include "power.h"
@@ -126,6 +127,7 @@ static const struct genpd_lock_ops genpd_spin_ops = {
#define genpd_is_irq_safe(genpd) (genpd->flags & GENPD_FLAG_IRQ_SAFE)
#define genpd_is_always_on(genpd) (genpd->flags & GENPD_FLAG_ALWAYS_ON)
#define genpd_is_active_wakeup(genpd) (genpd->flags & GENPD_FLAG_ACTIVE_WAKEUP)
+#define genpd_is_cpu_domain(genpd) (genpd->flags & GENPD_FLAG_CPU_DOMAIN)
static inline bool irq_safe_dev_in_no_sleep_domain(struct device *dev,
const struct generic_pm_domain *genpd)
@@ -1377,6 +1379,62 @@ static void genpd_free_dev_data(struct device *dev,
dev_pm_put_subsys_data(dev);
}
+static void __genpd_update_cpumask(struct generic_pm_domain *genpd,
+ int cpu, bool set, unsigned int depth)
+{
+ struct gpd_link *link;
+
+ if (!genpd_is_cpu_domain(genpd))
+ return;
+
+ list_for_each_entry(link, &genpd->slave_links, slave_node) {
+ struct generic_pm_domain *master = link->master;
+
+ genpd_lock_nested(master, depth + 1);
+ __genpd_update_cpumask(master, cpu, set, depth + 1);
+ genpd_unlock(master);
+ }
+
+ if (set)
+ cpumask_set_cpu(cpu, genpd->cpus);
+ else
+ cpumask_clear_cpu(cpu, genpd->cpus);
+}
+
+static void genpd_update_cpumask(struct generic_pm_domain *genpd,
+ struct device *dev, bool set)
+{
+ bool is_cpu = false;
+ int cpu;
+
+ if (!genpd_is_cpu_domain(genpd))
+ return;
+
+ for_each_possible_cpu(cpu) {
+ if (get_cpu_device(cpu) == dev) {
+ is_cpu = true;
+ break;
+ }
+ }
+
+ if (!is_cpu)
+ return;
+
+ __genpd_update_cpumask(genpd, cpu, set, 0);
+}
+
+static void genpd_set_cpumask(struct generic_pm_domain *genpd,
+ struct device *dev)
+{
+ genpd_update_cpumask(genpd, dev, true);
+}
+
+static void genpd_clear_cpumask(struct generic_pm_domain *genpd,
+ struct device *dev)
+{
+ genpd_update_cpumask(genpd, dev, false);
+}
+
static int genpd_add_device(struct generic_pm_domain *genpd, struct device *dev,
struct gpd_timing_data *td)
{
@@ -1398,6 +1456,8 @@ static int genpd_add_device(struct generic_pm_domain *genpd, struct device *dev,
if (ret)
goto out;
+ genpd_set_cpumask(genpd, dev);
+
dev_pm_domain_set(dev, &genpd->domain);
genpd->device_count++;
@@ -1459,6 +1519,7 @@ static int genpd_remove_device(struct generic_pm_domain *genpd,
if (genpd->detach_dev)
genpd->detach_dev(genpd, dev);
+ genpd_clear_cpumask(genpd, dev);
dev_pm_domain_set(dev, NULL);
list_del_init(&pdd->list_node);
@@ -1686,11 +1747,16 @@ int pm_genpd_init(struct generic_pm_domain *genpd,
if (genpd_is_always_on(genpd) && !genpd_status_on(genpd))
return -EINVAL;
+ if (!zalloc_cpumask_var(&genpd->cpus, GFP_KERNEL))
+ return -ENOMEM;
+
/* Use only one "off" state if there were no states declared */
if (genpd->state_count == 0) {
ret = genpd_set_default_power_state(genpd);
- if (ret)
+ if (ret) {
+ free_cpumask_var(genpd->cpus);
return ret;
+ }
} else if (!gov) {
pr_warn("%s : no governor for states\n", genpd->name);
}
@@ -1736,6 +1802,7 @@ static int genpd_remove(struct generic_pm_domain *genpd)
list_del(&genpd->gpd_list_node);
genpd_unlock(genpd);
cancel_work_sync(&genpd->power_off_work);
+ free_cpumask_var(genpd->cpus);
kfree(genpd->free);
pr_debug("%s: removed %s\n", __func__, genpd->name);
diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
index 27fca748344a..3f67ff0c1c69 100644
--- a/include/linux/pm_domain.h
+++ b/include/linux/pm_domain.h
@@ -16,12 +16,14 @@
#include <linux/of.h>
#include <linux/notifier.h>
#include <linux/spinlock.h>
+#include <linux/cpumask.h>
/* Defines used for the flags field in the struct generic_pm_domain */
#define GENPD_FLAG_PM_CLK (1U << 0) /* PM domain uses PM clk */
#define GENPD_FLAG_IRQ_SAFE (1U << 1) /* PM domain operates in atomic */
#define GENPD_FLAG_ALWAYS_ON (1U << 2) /* PM domain is always powered on */
#define GENPD_FLAG_ACTIVE_WAKEUP (1U << 3) /* Keep devices active if wakeup */
+#define GENPD_FLAG_CPU_DOMAIN (1U << 4) /* PM domain manages CPUs */
enum gpd_status {
GPD_STATE_ACTIVE = 0, /* PM domain is active */
@@ -68,6 +70,7 @@ struct generic_pm_domain {
unsigned int suspended_count; /* System suspend device counter */
unsigned int prepared_count; /* Suspend counter of prepared devices */
unsigned int performance_state; /* Aggregated max performance state */
+ cpumask_var_t cpus; /* A cpumask of the attached CPUs */
int (*power_off)(struct generic_pm_domain *domain);
int (*power_on)(struct generic_pm_domain *domain);
unsigned int (*opp_to_performance_state)(struct generic_pm_domain *genpd,
--
2.17.1
To allow CPUs being power managed by PM domains, let's deploy support for
runtime PM for the CPU's corresponding struct device.
More precisely, at the point when the CPU is about to enter an idle state,
decrease the runtime PM usage count for its corresponding struct device,
via calling pm_runtime_put_sync_suspend(). Then, at the point when the CPU
resumes from idle, let's increase the runtime PM usage count, via calling
pm_runtime_get_sync().
Cc: Lina Iyer <[email protected]>
Co-developed-by: Lina Iyer <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
kernel/cpu_pm.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/kernel/cpu_pm.c b/kernel/cpu_pm.c
index 67b02e138a47..492d4a83dca0 100644
--- a/kernel/cpu_pm.c
+++ b/kernel/cpu_pm.c
@@ -16,9 +16,11 @@
*/
#include <linux/kernel.h>
+#include <linux/cpu.h>
#include <linux/cpu_pm.h>
#include <linux/module.h>
#include <linux/notifier.h>
+#include <linux/pm_runtime.h>
#include <linux/spinlock.h>
#include <linux/syscore_ops.h>
@@ -91,6 +93,7 @@ int cpu_pm_enter(void)
{
int nr_calls;
int ret = 0;
+ struct device *dev = get_cpu_device(smp_processor_id());
ret = cpu_pm_notify(CPU_PM_ENTER, -1, &nr_calls);
if (ret)
@@ -100,6 +103,9 @@ int cpu_pm_enter(void)
*/
cpu_pm_notify(CPU_PM_ENTER_FAILED, nr_calls - 1, NULL);
+ if (!ret && dev && dev->pm_domain)
+ pm_runtime_put_sync_suspend(dev);
+
return ret;
}
EXPORT_SYMBOL_GPL(cpu_pm_enter);
@@ -118,6 +124,11 @@ EXPORT_SYMBOL_GPL(cpu_pm_enter);
*/
int cpu_pm_exit(void)
{
+ struct device *dev = get_cpu_device(smp_processor_id());
+
+ if (dev && dev->pm_domain)
+ pm_runtime_get_sync(dev);
+
return cpu_pm_notify(CPU_PM_EXIT, -1, NULL);
}
EXPORT_SYMBOL_GPL(cpu_pm_exit);
--
2.17.1
Introduce two new genpd helper functions, of_genpd_attach|detach_cpu(),
which takes the CPU-number as an in-parameter.
To attach a CPU to a genpd, of_genpd_attach_cpu() starts by fetching the
struct device belonging to the CPU. Then it calls genpd_dev_pm_attach(),
which via DT tries to hook up the CPU device to its corresponding PM
domain. If it succeeds, of_genpd_attach_cpu() continues to prepare/enable
runtime PM of the device.
To detach a CPU from its PM domain, of_genpd_attach_cpu() reverse the
operations made from of_genpd_attach_cpu(). However, first it checks that
the CPU device has a valid PM domain pointer assigned, as to make sure it
belongs to genpd.
Cc: Lina Iyer <[email protected]>
Co-developed-by: Lina Iyer <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
drivers/base/power/domain.c | 69 +++++++++++++++++++++++++++++++++++++
include/linux/pm_domain.h | 9 +++++
2 files changed, 78 insertions(+)
diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
index 6149ce0bfa7b..299fa2febbec 100644
--- a/drivers/base/power/domain.c
+++ b/drivers/base/power/domain.c
@@ -2445,6 +2445,75 @@ struct device *genpd_dev_pm_attach_by_id(struct device *dev,
}
EXPORT_SYMBOL_GPL(genpd_dev_pm_attach_by_id);
+/*
+ * of_genpd_attach_cpu() - Attach a CPU to its PM domain
+ * @cpu: The CPU to be attached.
+ *
+ * Parses the OF node of the CPU's device, to find a PM domain specifier. If
+ * such is found, attaches the CPU's device to the retrieved pm_domain ops and
+ * enables runtime PM for it. This to allow the CPU to be power managed through
+ * its PM domain.
+ *
+ * Returns zero when successfully attached the CPU's device to its PM domain,
+ * else a negative error code.
+ */
+int of_genpd_attach_cpu(int cpu)
+{
+ struct device *dev = get_cpu_device(cpu);
+ int ret;
+
+ if (!dev) {
+ pr_warn("genpd: no dev for cpu%d\n", cpu);
+ return -ENODEV;
+ }
+
+ ret = genpd_dev_pm_attach(dev);
+ if (ret != 1) {
+ dev_warn(dev, "genpd: attach cpu failed %d\n", ret);
+ return ret < 0 ? ret : -ENODEV;
+ }
+
+ pm_runtime_irq_safe(dev);
+ pm_runtime_get_noresume(dev);
+ pm_runtime_set_active(dev);
+ pm_runtime_enable(dev);
+
+ dev_info(dev, "genpd: attached cpu\n");
+ return 0;
+}
+EXPORT_SYMBOL(of_genpd_attach_cpu);
+
+/**
+ * of_genpd_detach_cpu() - Detach a CPU from its PM domain
+ * @cpu: The CPU to be detached.
+ *
+ * Detach the CPU's device from its corresponding PM domain. If detaching is
+ * completed successfully, disable runtime PM and restore the runtime PM usage
+ * count for the CPU's device.
+ */
+void of_genpd_detach_cpu(int cpu)
+{
+ struct device *dev = get_cpu_device(cpu);
+
+ if (!dev) {
+ pr_warn("genpd: no dev for cpu%d\n", cpu);
+ return;
+ }
+
+ /* Check that the device is attached to a genpd. */
+ if (!(dev->pm_domain && dev->pm_domain->detach == genpd_dev_pm_detach))
+ return;
+
+ genpd_dev_pm_detach(dev, true);
+
+ pm_runtime_disable(dev);
+ pm_runtime_put_noidle(dev);
+ pm_runtime_reinit(dev);
+
+ dev_info(dev, "genpd: detached cpu\n");
+}
+EXPORT_SYMBOL(of_genpd_detach_cpu);
+
static const struct of_device_id idle_state_match[] = {
{ .compatible = "domain-idle-state", },
{ }
diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
index 3f67ff0c1c69..2c09cf80b285 100644
--- a/include/linux/pm_domain.h
+++ b/include/linux/pm_domain.h
@@ -243,6 +243,8 @@ unsigned int of_genpd_opp_to_performance_state(struct device *dev,
int genpd_dev_pm_attach(struct device *dev);
struct device *genpd_dev_pm_attach_by_id(struct device *dev,
unsigned int index);
+int of_genpd_attach_cpu(int cpu);
+void of_genpd_detach_cpu(int cpu);
#else /* !CONFIG_PM_GENERIC_DOMAINS_OF */
static inline int of_genpd_add_provider_simple(struct device_node *np,
struct generic_pm_domain *genpd)
@@ -294,6 +296,13 @@ static inline struct device *genpd_dev_pm_attach_by_id(struct device *dev,
return NULL;
}
+static inline int of_genpd_attach_cpu(int cpu)
+{
+ return -ENODEV;
+}
+
+static inline void of_genpd_detach_cpu(int cpu) {}
+
static inline
struct generic_pm_domain *of_genpd_remove_last(struct device_node *np)
{
--
2.17.1
From: Lina Iyer <[email protected]>
Let's add a data pointer to the genpd_power_state struct, to allow
platforms to store per state specific data.
Cc: Lina Iyer <[email protected]>
Signed-off-by: Lina Iyer <[email protected]>
Co-developed-by: Ulf Hansson <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
include/linux/pm_domain.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
index 9206a4fef9ac..27fca748344a 100644
--- a/include/linux/pm_domain.h
+++ b/include/linux/pm_domain.h
@@ -44,6 +44,7 @@ struct genpd_power_state {
s64 residency_ns;
struct fwnode_handle *fwnode;
ktime_t idle_time;
+ void *data;
};
struct genpd_lock_ops;
--
2.17.1
Instead of returning -EINVAL from of_genpd_parse_idle_states() in case none
compatible states was found, let's return 0 to indicate success. Assign
also the out-parameter *states to NULL and *n to 0, to indicate to the
caller that zero states have been found/allocated.
This enables the caller of of_genpd_parse_idle_states() to easier act on
the returned error code.
Cc: Lina Iyer <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
Reviewed-by: Lina Iyer <[email protected]>
---
drivers/base/power/domain.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
index 4925af5c4cf0..62969c3d5d04 100644
--- a/drivers/base/power/domain.c
+++ b/drivers/base/power/domain.c
@@ -2452,8 +2452,8 @@ static int genpd_iterate_idle_states(struct device_node *dn,
*
* Returns the device states parsed from the OF node. The memory for the states
* is allocated by this function and is the responsibility of the caller to
- * free the memory after use. If no domain idle states is found it returns
- * -EINVAL and in case of errors, a negative error code.
+ * free the memory after use. If any or zero compatible domain idle states is
+ * found it returns 0 and in case of errors, a negative error code is returned.
*/
int of_genpd_parse_idle_states(struct device_node *dn,
struct genpd_power_state **states, int *n)
@@ -2462,8 +2462,14 @@ int of_genpd_parse_idle_states(struct device_node *dn,
int ret;
ret = genpd_iterate_idle_states(dn, NULL);
- if (ret <= 0)
- return ret < 0 ? ret : -EINVAL;
+ if (ret < 0)
+ return ret;
+
+ if (!ret) {
+ *states = NULL;
+ *n = 0;
+ return 0;
+ }
st = kcalloc(ret, sizeof(*st), GFP_KERNEL);
if (!st)
--
2.17.1
PSCI firmware v1.0+, supports two different modes for CPU_SUSPEND. The
Platform Coordinated mode, which is the default and mandatory mode, while
support for the OS initiated mode is optional.
This change introduces initial support for the OS initiated mode, in way
that it adds the related PSCI bits from the spec and prints a message in
the log to inform whether the mode is supported by the PSCI FW.
Cc: Lina Iyer <[email protected]>
Co-developed-by: Lina Iyer <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
---
drivers/firmware/psci/psci.c | 21 ++++++++++++++++++++-
include/uapi/linux/psci.h | 5 +++++
2 files changed, 25 insertions(+), 1 deletion(-)
diff --git a/drivers/firmware/psci/psci.c b/drivers/firmware/psci/psci.c
index 38881007584e..e8f4f8444ff1 100644
--- a/drivers/firmware/psci/psci.c
+++ b/drivers/firmware/psci/psci.c
@@ -95,6 +95,11 @@ static inline bool psci_has_ext_power_state(void)
PSCI_1_0_FEATURES_CPU_SUSPEND_PF_MASK;
}
+static inline bool psci_has_osi_support(void)
+{
+ return psci_cpu_suspend_feature & PSCI_1_0_OS_INITIATED;
+}
+
static inline bool psci_power_state_loses_context(u32 state)
{
const u32 mask = psci_has_ext_power_state() ?
@@ -658,10 +663,24 @@ static int __init psci_0_1_init(struct device_node *np)
return 0;
}
+static int __init psci_1_0_init(struct device_node *np)
+{
+ int err;
+
+ err = psci_0_2_init(np);
+ if (err)
+ return err;
+
+ if (psci_has_osi_support())
+ pr_info("OSI mode supported.\n");
+
+ return 0;
+}
+
static const struct of_device_id psci_of_match[] __initconst = {
{ .compatible = "arm,psci", .data = psci_0_1_init},
{ .compatible = "arm,psci-0.2", .data = psci_0_2_init},
- { .compatible = "arm,psci-1.0", .data = psci_0_2_init},
+ { .compatible = "arm,psci-1.0", .data = psci_1_0_init},
{},
};
diff --git a/include/uapi/linux/psci.h b/include/uapi/linux/psci.h
index b3bcabe380da..581f72085c33 100644
--- a/include/uapi/linux/psci.h
+++ b/include/uapi/linux/psci.h
@@ -49,6 +49,7 @@
#define PSCI_1_0_FN_PSCI_FEATURES PSCI_0_2_FN(10)
#define PSCI_1_0_FN_SYSTEM_SUSPEND PSCI_0_2_FN(14)
+#define PSCI_1_0_FN_SET_SUSPEND_MODE PSCI_0_2_FN(15)
#define PSCI_1_0_FN64_SYSTEM_SUSPEND PSCI_0_2_FN64(14)
@@ -97,6 +98,10 @@
#define PSCI_1_0_FEATURES_CPU_SUSPEND_PF_MASK \
(0x1 << PSCI_1_0_FEATURES_CPU_SUSPEND_PF_SHIFT)
+#define PSCI_1_0_OS_INITIATED BIT(0)
+#define PSCI_1_0_SUSPEND_MODE_PC 0
+#define PSCI_1_0_SUSPEND_MODE_OSI 1
+
/* PSCI return values (inclusive of all PSCI versions) */
#define PSCI_RET_SUCCESS 0
#define PSCI_RET_NOT_SUPPORTED -1
--
2.17.1
On Wed, Jun 20, 2018 at 7:22 PM, Ulf Hansson <[email protected]> wrote:
> From: Lina Iyer <[email protected]>
>
> Let's add a data pointer to the genpd_power_state struct, to allow
> platforms to store per state specific data.
Can you please fold it into a patch actually using this pointer?
> Cc: Lina Iyer <[email protected]>
> Signed-off-by: Lina Iyer <[email protected]>
> Co-developed-by: Ulf Hansson <[email protected]>
> Signed-off-by: Ulf Hansson <[email protected]>
> ---
> include/linux/pm_domain.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
> index 9206a4fef9ac..27fca748344a 100644
> --- a/include/linux/pm_domain.h
> +++ b/include/linux/pm_domain.h
> @@ -44,6 +44,7 @@ struct genpd_power_state {
> s64 residency_ns;
> struct fwnode_handle *fwnode;
> ktime_t idle_time;
> + void *data;
> };
>
> struct genpd_lock_ops;
> --
> 2.17.1
>
On 24 June 2018 at 23:09, Rafael J. Wysocki <[email protected]> wrote:
> On Wed, Jun 20, 2018 at 7:22 PM, Ulf Hansson <[email protected]> wrote:
>> From: Lina Iyer <[email protected]>
>>
>> Let's add a data pointer to the genpd_power_state struct, to allow
>> platforms to store per state specific data.
>
> Can you please fold it into a patch actually using this pointer?
Yep, no problem.
Anyway, the change that uses the pointer is "[PATCH v8 21/26] drivers:
firmware: psci: Add support for PM domains using genpd".
Let's see if there is further comments, if not - perhaps you can
squash $subject patch into that change?
>
>> Cc: Lina Iyer <[email protected]>
>> Signed-off-by: Lina Iyer <[email protected]>
>> Co-developed-by: Ulf Hansson <[email protected]>
>> Signed-off-by: Ulf Hansson <[email protected]>
>> ---
>> include/linux/pm_domain.h | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
>> index 9206a4fef9ac..27fca748344a 100644
>> --- a/include/linux/pm_domain.h
>> +++ b/include/linux/pm_domain.h
>> @@ -44,6 +44,7 @@ struct genpd_power_state {
>> s64 residency_ns;
>> struct fwnode_handle *fwnode;
>> ktime_t idle_time;
>> + void *data;
>> };
>>
>> struct genpd_lock_ops;
>> --
>> 2.17.1
>>
Kind regards
Uffe
Rafael,
On 20 June 2018 at 19:22, Ulf Hansson <[email protected]> wrote:
> Changes in v8:
> - Added some tags for reviews and acks.
> - Cleanup timer patch (patch6) according to comments from Rafael.
> - Rebased series on top of v4.18rc1 - it applied cleanly, except for patch 5.
> - While adopting patch 5 to new genpd changes, I took the opportunity to
> improve the new function description a bit.
> - Corrected malformed SPDX-License-Identifier in patch20.
There have only been a minor comment from you at patch3 (about
squashing it with patch21), with that fixed and assuming there are no
further comments - would you like me to collect the changes and send
it to you as a pull request?
[...]
Kind regards
Uffe
On Tue, Jul 3, 2018 at 7:44 AM, Ulf Hansson <[email protected]> wrote:
> Rafael,
>
> On 20 June 2018 at 19:22, Ulf Hansson <[email protected]> wrote:
>> Changes in v8:
>> - Added some tags for reviews and acks.
>> - Cleanup timer patch (patch6) according to comments from Rafael.
>> - Rebased series on top of v4.18rc1 - it applied cleanly, except for patch 5.
>> - While adopting patch 5 to new genpd changes, I took the opportunity to
>> improve the new function description a bit.
>> - Corrected malformed SPDX-License-Identifier in patch20.
>
> There have only been a minor comment from you at patch3 (about
> squashing it with patch21), with that fixed and assuming there are no
> further comments - would you like me to collect the changes and send
> it to you as a pull request?
I have not looked at them in detail yet, sorry.
I any case, I'll apply the patches by hand.
Thanks,
Rafael
On 3 July 2018 at 09:54, Rafael J. Wysocki <[email protected]> wrote:
> On Tue, Jul 3, 2018 at 7:44 AM, Ulf Hansson <[email protected]> wrote:
>> Rafael,
>>
>> On 20 June 2018 at 19:22, Ulf Hansson <[email protected]> wrote:
>>> Changes in v8:
>>> - Added some tags for reviews and acks.
>>> - Cleanup timer patch (patch6) according to comments from Rafael.
>>> - Rebased series on top of v4.18rc1 - it applied cleanly, except for patch 5.
>>> - While adopting patch 5 to new genpd changes, I took the opportunity to
>>> improve the new function description a bit.
>>> - Corrected malformed SPDX-License-Identifier in patch20.
>>
>> There have only been a minor comment from you at patch3 (about
>> squashing it with patch21), with that fixed and assuming there are no
>> further comments - would you like me to collect the changes and send
>> it to you as a pull request?
>
> I have not looked at them in detail yet, sorry.
>
> I any case, I'll apply the patches by hand.
Okay. If there is anything I can help to make the review easier, please tell!
Kind regards
Uffe
On Wednesday, June 20, 2018 7:22:09 PM CEST Ulf Hansson wrote:
> To allow CPUs being power managed by PM domains, let's deploy support for
> runtime PM for the CPU's corresponding struct device.
>
> More precisely, at the point when the CPU is about to enter an idle state,
> decrease the runtime PM usage count for its corresponding struct device,
> via calling pm_runtime_put_sync_suspend(). Then, at the point when the CPU
> resumes from idle, let's increase the runtime PM usage count, via calling
> pm_runtime_get_sync().
>
> Cc: Lina Iyer <[email protected]>
> Co-developed-by: Lina Iyer <[email protected]>
> Signed-off-by: Ulf Hansson <[email protected]>
I finally got to this one, sorry for the huge delay.
Let me confirm that I understand the code flow correctly.
> ---
> kernel/cpu_pm.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/kernel/cpu_pm.c b/kernel/cpu_pm.c
> index 67b02e138a47..492d4a83dca0 100644
> --- a/kernel/cpu_pm.c
> +++ b/kernel/cpu_pm.c
> @@ -16,9 +16,11 @@
> */
>
> #include <linux/kernel.h>
> +#include <linux/cpu.h>
> #include <linux/cpu_pm.h>
> #include <linux/module.h>
> #include <linux/notifier.h>
> +#include <linux/pm_runtime.h>
> #include <linux/spinlock.h>
> #include <linux/syscore_ops.h>
>
> @@ -91,6 +93,7 @@ int cpu_pm_enter(void)
This is called from a cpuidle driver's ->enter callback for the target state
selected by the idle governor ->
> {
> int nr_calls;
> int ret = 0;
> + struct device *dev = get_cpu_device(smp_processor_id());
>
> ret = cpu_pm_notify(CPU_PM_ENTER, -1, &nr_calls);
> if (ret)
> @@ -100,6 +103,9 @@ int cpu_pm_enter(void)
> */
> cpu_pm_notify(CPU_PM_ENTER_FAILED, nr_calls - 1, NULL);
>
> + if (!ret && dev && dev->pm_domain)
> + pm_runtime_put_sync_suspend(dev);
-> so this is going to invoke genpd_runtime_suspend() if the usage
counter of dev is 0.
That will cause cpu_power_down_ok() to be called (because this is
a CPU domain) and that will walk the domain cpumask and compute the
estimated idle duration as the minimum of tick_nohz_get_next_wakeup()
values over the CPUs in that cpumask. [Note that the weight of the
cpumask must be seriously limited for that to actually work, as this
happens in the idle path.] Next, it will return "true" if it can
find a domain state with residency within the estimated idle
duration. [Note that this sort of overlaps with the idle governor's
job.]
Next, __genpd_runtime_suspend() will be invoked to run the device-specific
callback if any [Note that this has to be suitable for the idle path if
present.] and genpd_stop_dev() runs (which, again, may invoke a callback)
and genpd_power_off() runs under the domain lock (which must be a spinlock
then).
> +
> return ret;
> }
> EXPORT_SYMBOL_GPL(cpu_pm_enter);
> @@ -118,6 +124,11 @@ EXPORT_SYMBOL_GPL(cpu_pm_enter);
> */
> int cpu_pm_exit(void)
> {
> + struct device *dev = get_cpu_device(smp_processor_id());
> +
> + if (dev && dev->pm_domain)
> + pm_runtime_get_sync(dev);
> +
> return cpu_pm_notify(CPU_PM_EXIT, -1, NULL);
> }
> EXPORT_SYMBOL_GPL(cpu_pm_exit);
>
And this is called on wakeup when the cpuidle driver's ->enter callback
is about to return and it reverses the suspend flow (except that the
governor doesn't need to be called now).
Have I got that right?
On Wednesday, July 18, 2018 12:11:06 PM CEST Rafael J. Wysocki wrote:
> On Wednesday, June 20, 2018 7:22:09 PM CEST Ulf Hansson wrote:
> > To allow CPUs being power managed by PM domains, let's deploy support for
> > runtime PM for the CPU's corresponding struct device.
> >
> > More precisely, at the point when the CPU is about to enter an idle state,
> > decrease the runtime PM usage count for its corresponding struct device,
> > via calling pm_runtime_put_sync_suspend(). Then, at the point when the CPU
> > resumes from idle, let's increase the runtime PM usage count, via calling
> > pm_runtime_get_sync().
> >
> > Cc: Lina Iyer <[email protected]>
> > Co-developed-by: Lina Iyer <[email protected]>
> > Signed-off-by: Ulf Hansson <[email protected]>
>
> I finally got to this one, sorry for the huge delay.
>
> Let me confirm that I understand the code flow correctly.
>
> > ---
> > kernel/cpu_pm.c | 11 +++++++++++
> > 1 file changed, 11 insertions(+)
> >
> > diff --git a/kernel/cpu_pm.c b/kernel/cpu_pm.c
> > index 67b02e138a47..492d4a83dca0 100644
> > --- a/kernel/cpu_pm.c
> > +++ b/kernel/cpu_pm.c
> > @@ -16,9 +16,11 @@
> > */
> >
> > #include <linux/kernel.h>
> > +#include <linux/cpu.h>
> > #include <linux/cpu_pm.h>
> > #include <linux/module.h>
> > #include <linux/notifier.h>
> > +#include <linux/pm_runtime.h>
> > #include <linux/spinlock.h>
> > #include <linux/syscore_ops.h>
> >
> > @@ -91,6 +93,7 @@ int cpu_pm_enter(void)
>
> This is called from a cpuidle driver's ->enter callback for the target state
> selected by the idle governor ->
>
> > {
> > int nr_calls;
> > int ret = 0;
> > + struct device *dev = get_cpu_device(smp_processor_id());
> >
> > ret = cpu_pm_notify(CPU_PM_ENTER, -1, &nr_calls);
> > if (ret)
> > @@ -100,6 +103,9 @@ int cpu_pm_enter(void)
> > */
> > cpu_pm_notify(CPU_PM_ENTER_FAILED, nr_calls - 1, NULL);
> >
> > + if (!ret && dev && dev->pm_domain)
> > + pm_runtime_put_sync_suspend(dev);
>
> -> so this is going to invoke genpd_runtime_suspend() if the usage
> counter of dev is 0.
>
> That will cause cpu_power_down_ok() to be called (because this is
> a CPU domain) and that will walk the domain cpumask and compute the
> estimated idle duration as the minimum of tick_nohz_get_next_wakeup()
> values over the CPUs in that cpumask. [Note that the weight of the
> cpumask must be seriously limited for that to actually work, as this
> happens in the idle path.] Next, it will return "true" if it can
> find a domain state with residency within the estimated idle
> duration. [Note that this sort of overlaps with the idle governor's
> job.]
>
> Next, __genpd_runtime_suspend() will be invoked to run the device-specific
> callback if any [Note that this has to be suitable for the idle path if
> present.] and genpd_stop_dev() runs (which, again, may invoke a callback)
> and genpd_power_off() runs under the domain lock (which must be a spinlock
> then).
>
> > +
> > return ret;
> > }
> > EXPORT_SYMBOL_GPL(cpu_pm_enter);
> > @@ -118,6 +124,11 @@ EXPORT_SYMBOL_GPL(cpu_pm_enter);
> > */
> > int cpu_pm_exit(void)
> > {
> > + struct device *dev = get_cpu_device(smp_processor_id());
> > +
> > + if (dev && dev->pm_domain)
> > + pm_runtime_get_sync(dev);
> > +
> > return cpu_pm_notify(CPU_PM_EXIT, -1, NULL);
> > }
> > EXPORT_SYMBOL_GPL(cpu_pm_exit);
> >
>
> And this is called on wakeup when the cpuidle driver's ->enter callback
> is about to return and it reverses the suspend flow (except that the
> governor doesn't need to be called now).
>
> Have I got that right?
Assuming that I have got that right, there are concerns, mostly regarding
patch [07/26], but I will reply to that directly.
The $subject patch is fine by me by itself, but it obviously depends on the
previous ones. Patches [01-02/26] are fine too, but they don't seem to be
particularly useful without the rest of the series.
As far as patches [10-26/26] go, I'd like to see some review comments and/or
tags from the people with vested interest in there, in particular from Daniel
on patch [12/26] and from Sudeep on the PSCI ones.
Thanks,
Rafael
On Wednesday, June 20, 2018 7:22:06 PM CEST Ulf Hansson wrote:
> From: Lina Iyer <[email protected]>
>
> Knowing the sleep duration of CPUs, is known to be needed while selecting
> the most energy efficient idle state for a CPU or a group of CPUs.
>
> However, to be able to compute the sleep duration, we need to know at what
> time the next expected wakeup is for the CPU. Therefore, let's export this
> information via a new function, tick_nohz_get_next_wakeup(). Following
> changes make use of it.
>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Daniel Lezcano <[email protected]>
> Cc: Lina Iyer <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Signed-off-by: Lina Iyer <[email protected]>
> Co-developed-by: Ulf Hansson <[email protected]>
> Signed-off-by: Ulf Hansson <[email protected]>
> ---
> include/linux/tick.h | 8 ++++++++
> kernel/time/tick-sched.c | 10 ++++++++++
> 2 files changed, 18 insertions(+)
>
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index 55388ab45fd4..e48f6b26b425 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -125,6 +125,7 @@ extern bool tick_nohz_idle_got_tick(void);
> extern ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next);
> extern unsigned long tick_nohz_get_idle_calls(void);
> extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu);
> +extern ktime_t tick_nohz_get_next_wakeup(int cpu);
> extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
> extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
>
> @@ -151,6 +152,13 @@ static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
> *delta_next = TICK_NSEC;
> return *delta_next;
> }
> +
> +static inline ktime_t tick_nohz_get_next_wakeup(int cpu)
> +{
> + /* Next wake up is the tick period, assume it starts now */
> + return ktime_add(ktime_get(), TICK_NSEC);
> +}
> +
> static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
> static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
>
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index da9455a6b42b..f380bb4f0744 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1089,6 +1089,16 @@ unsigned long tick_nohz_get_idle_calls(void)
> return ts->idle_calls;
> }
>
> +/**
> + * tick_nohz_get_next_wakeup - return the next wake up of the CPU
> + */
I'd add to the comment that this is to be invoked for idle CPUs only.
> +ktime_t tick_nohz_get_next_wakeup(int cpu)
> +{
> + struct clock_event_device *dev = per_cpu(tick_cpu_device.evtdev, cpu);
> +
> + return dev->next_event;
> +}
> +
> static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
> {
> #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
>
On Wednesday, June 20, 2018 7:22:05 PM CEST Ulf Hansson wrote:
> Introduce two new genpd helper functions, of_genpd_attach|detach_cpu(),
> which takes the CPU-number as an in-parameter.
>
> To attach a CPU to a genpd, of_genpd_attach_cpu() starts by fetching the
> struct device belonging to the CPU. Then it calls genpd_dev_pm_attach(),
> which via DT tries to hook up the CPU device to its corresponding PM
> domain. If it succeeds, of_genpd_attach_cpu() continues to prepare/enable
> runtime PM of the device.
>
> To detach a CPU from its PM domain, of_genpd_attach_cpu() reverse the
> operations made from of_genpd_attach_cpu(). However, first it checks that
> the CPU device has a valid PM domain pointer assigned, as to make sure it
> belongs to genpd.
>
> Cc: Lina Iyer <[email protected]>
> Co-developed-by: Lina Iyer <[email protected]>
> Signed-off-by: Ulf Hansson <[email protected]>
> ---
> drivers/base/power/domain.c | 69 +++++++++++++++++++++++++++++++++++++
> include/linux/pm_domain.h | 9 +++++
> 2 files changed, 78 insertions(+)
>
> diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
> index 6149ce0bfa7b..299fa2febbec 100644
> --- a/drivers/base/power/domain.c
> +++ b/drivers/base/power/domain.c
> @@ -2445,6 +2445,75 @@ struct device *genpd_dev_pm_attach_by_id(struct device *dev,
> }
> EXPORT_SYMBOL_GPL(genpd_dev_pm_attach_by_id);
>
> +/*
> + * of_genpd_attach_cpu() - Attach a CPU to its PM domain
> + * @cpu: The CPU to be attached.
> + *
> + * Parses the OF node of the CPU's device, to find a PM domain specifier. If
> + * such is found, attaches the CPU's device to the retrieved pm_domain ops and
> + * enables runtime PM for it. This to allow the CPU to be power managed through
> + * its PM domain.
> + *
> + * Returns zero when successfully attached the CPU's device to its PM domain,
> + * else a negative error code.
> + */
> +int of_genpd_attach_cpu(int cpu)
> +{
> + struct device *dev = get_cpu_device(cpu);
> + int ret;
> +
> + if (!dev) {
> + pr_warn("genpd: no dev for cpu%d\n", cpu);
> + return -ENODEV;
> + }
I'm not sure about the value of the above. Is it possible even?
> +
> + ret = genpd_dev_pm_attach(dev);
> + if (ret != 1) {
> + dev_warn(dev, "genpd: attach cpu failed %d\n", ret);
This looks like a debug message. Do you really want to print it with high prio?
> + return ret < 0 ? ret : -ENODEV;
> + }
> +
> + pm_runtime_irq_safe(dev);
> + pm_runtime_get_noresume(dev);
> + pm_runtime_set_active(dev);
> + pm_runtime_enable(dev);
> +
> + dev_info(dev, "genpd: attached cpu\n");
This definitely is a debug message.
> + return 0;
> +}
> +EXPORT_SYMBOL(of_genpd_attach_cpu);
> +
> +/**
> + * of_genpd_detach_cpu() - Detach a CPU from its PM domain
> + * @cpu: The CPU to be detached.
> + *
> + * Detach the CPU's device from its corresponding PM domain. If detaching is
> + * completed successfully, disable runtime PM and restore the runtime PM usage
> + * count for the CPU's device.
> + */
> +void of_genpd_detach_cpu(int cpu)
> +{
> + struct device *dev = get_cpu_device(cpu);
> +
> + if (!dev) {
> + pr_warn("genpd: no dev for cpu%d\n", cpu);
> + return;
> + }
> +
> + /* Check that the device is attached to a genpd. */
> + if (!(dev->pm_domain && dev->pm_domain->detach == genpd_dev_pm_detach))
> + return;
> +
> + genpd_dev_pm_detach(dev, true);
> +
> + pm_runtime_disable(dev);
> + pm_runtime_put_noidle(dev);
> + pm_runtime_reinit(dev);
> +
> + dev_info(dev, "genpd: detached cpu\n");
> +}
> +EXPORT_SYMBOL(of_genpd_detach_cpu);
> +
> static const struct of_device_id idle_state_match[] = {
> { .compatible = "domain-idle-state", },
> { }
> diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
> index 3f67ff0c1c69..2c09cf80b285 100644
> --- a/include/linux/pm_domain.h
> +++ b/include/linux/pm_domain.h
> @@ -243,6 +243,8 @@ unsigned int of_genpd_opp_to_performance_state(struct device *dev,
> int genpd_dev_pm_attach(struct device *dev);
> struct device *genpd_dev_pm_attach_by_id(struct device *dev,
> unsigned int index);
> +int of_genpd_attach_cpu(int cpu);
> +void of_genpd_detach_cpu(int cpu);
> #else /* !CONFIG_PM_GENERIC_DOMAINS_OF */
> static inline int of_genpd_add_provider_simple(struct device_node *np,
> struct generic_pm_domain *genpd)
> @@ -294,6 +296,13 @@ static inline struct device *genpd_dev_pm_attach_by_id(struct device *dev,
> return NULL;
> }
>
> +static inline int of_genpd_attach_cpu(int cpu)
> +{
> + return -ENODEV;
> +}
> +
> +static inline void of_genpd_detach_cpu(int cpu) {}
> +
> static inline
> struct generic_pm_domain *of_genpd_remove_last(struct device_node *np)
> {
>
I'd combine this with patch [04/26]. The split here is somewhat artificial IMO.
On Wednesday, June 20, 2018 7:22:04 PM CEST Ulf Hansson wrote:
> To enable a device belonging to a CPU to be attached to a PM domain managed
> by genpd, let's do a few changes to genpd as to make it convenient to
> manage the specifics around CPUs.
>
> First, as to be able to quickly find out what CPUs that are attached to a
> genpd, which typically becomes useful from a genpd governor as following
> changes is about to show, let's add a cpumask 'cpus' to the struct
> generic_pm_domain.
>
> At the point when a device that belongs to a CPU, is attached/detached to
> its corresponding PM domain via genpd_add_device(), let's update the
> cpumask in genpd->cpus. Moreover, propagate the update of the cpumask to
> the master domains, which makes the genpd->cpus to contain a cpumask that
> hierarchically reflect all CPUs for a genpd, including CPUs attached to
> subdomains.
>
> Second, to unconditionally manage CPUs and the cpumask in genpd->cpus, is
> unnecessary for cases when only non-CPU devices are parts of a genpd.
> Let's avoid this by adding a new configuration bit, GENPD_FLAG_CPU_DOMAIN.
> Clients must set the bit before they call pm_genpd_init(), as to instruct
> genpd that it shall deal with CPUs and thus manage the cpumask in
> genpd->cpus.
>
> Cc: Lina Iyer <[email protected]>
> Co-developed-by: Lina Iyer <[email protected]>
> Signed-off-by: Ulf Hansson <[email protected]>
> ---
> drivers/base/power/domain.c | 69 ++++++++++++++++++++++++++++++++++++-
> include/linux/pm_domain.h | 3 ++
> 2 files changed, 71 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
> index 21d298e1820b..6149ce0bfa7b 100644
> --- a/drivers/base/power/domain.c
> +++ b/drivers/base/power/domain.c
> @@ -20,6 +20,7 @@
> #include <linux/sched.h>
> #include <linux/suspend.h>
> #include <linux/export.h>
> +#include <linux/cpu.h>
>
> #include "power.h"
>
> @@ -126,6 +127,7 @@ static const struct genpd_lock_ops genpd_spin_ops = {
> #define genpd_is_irq_safe(genpd) (genpd->flags & GENPD_FLAG_IRQ_SAFE)
> #define genpd_is_always_on(genpd) (genpd->flags & GENPD_FLAG_ALWAYS_ON)
> #define genpd_is_active_wakeup(genpd) (genpd->flags & GENPD_FLAG_ACTIVE_WAKEUP)
> +#define genpd_is_cpu_domain(genpd) (genpd->flags & GENPD_FLAG_CPU_DOMAIN)
>
> static inline bool irq_safe_dev_in_no_sleep_domain(struct device *dev,
> const struct generic_pm_domain *genpd)
> @@ -1377,6 +1379,62 @@ static void genpd_free_dev_data(struct device *dev,
> dev_pm_put_subsys_data(dev);
> }
>
> +static void __genpd_update_cpumask(struct generic_pm_domain *genpd,
> + int cpu, bool set, unsigned int depth)
> +{
> + struct gpd_link *link;
> +
> + if (!genpd_is_cpu_domain(genpd))
> + return;
> +
> + list_for_each_entry(link, &genpd->slave_links, slave_node) {
> + struct generic_pm_domain *master = link->master;
> +
> + genpd_lock_nested(master, depth + 1);
> + __genpd_update_cpumask(master, cpu, set, depth + 1);
> + genpd_unlock(master);
> + }
> +
> + if (set)
> + cpumask_set_cpu(cpu, genpd->cpus);
> + else
> + cpumask_clear_cpu(cpu, genpd->cpus);
> +}
As noted elsewhere, there is a concern about the possible weight of this
cpumask and I think that it would be good to explicitly put a limit on it.
> +
> +static void genpd_update_cpumask(struct generic_pm_domain *genpd,
> + struct device *dev, bool set)
> +{
> + bool is_cpu = false;
> + int cpu;
> +
> + if (!genpd_is_cpu_domain(genpd))
> + return;
> +
> + for_each_possible_cpu(cpu) {
> + if (get_cpu_device(cpu) == dev) {
> + is_cpu = true;
You may call __genpd_update_cpumask() right here and then you won't
need the extra is_cpu variable.
> + break;
> + }
> + }
> +
> + if (!is_cpu)
> + return;
> +
> + __genpd_update_cpumask(genpd, cpu, set, 0);
> +}
> +
> +static void genpd_set_cpumask(struct generic_pm_domain *genpd,
> + struct device *dev)
> +{
> + genpd_update_cpumask(genpd, dev, true);
> +}
> +
> +static void genpd_clear_cpumask(struct generic_pm_domain *genpd,
> + struct device *dev)
> +{
> + genpd_update_cpumask(genpd, dev, false);
> +}
> +
> static int genpd_add_device(struct generic_pm_domain *genpd, struct device *dev,
> struct gpd_timing_data *td)
> {
> @@ -1398,6 +1456,8 @@ static int genpd_add_device(struct generic_pm_domain *genpd, struct device *dev,
> if (ret)
> goto out;
>
> + genpd_set_cpumask(genpd, dev);
> +
> dev_pm_domain_set(dev, &genpd->domain);
>
> genpd->device_count++;
> @@ -1459,6 +1519,7 @@ static int genpd_remove_device(struct generic_pm_domain *genpd,
> if (genpd->detach_dev)
> genpd->detach_dev(genpd, dev);
>
> + genpd_clear_cpumask(genpd, dev);
> dev_pm_domain_set(dev, NULL);
>
> list_del_init(&pdd->list_node);
> @@ -1686,11 +1747,16 @@ int pm_genpd_init(struct generic_pm_domain *genpd,
> if (genpd_is_always_on(genpd) && !genpd_status_on(genpd))
> return -EINVAL;
>
> + if (!zalloc_cpumask_var(&genpd->cpus, GFP_KERNEL))
> + return -ENOMEM;
> +
> /* Use only one "off" state if there were no states declared */
> if (genpd->state_count == 0) {
> ret = genpd_set_default_power_state(genpd);
> - if (ret)
> + if (ret) {
> + free_cpumask_var(genpd->cpus);
> return ret;
> + }
> } else if (!gov) {
> pr_warn("%s : no governor for states\n", genpd->name);
> }
> @@ -1736,6 +1802,7 @@ static int genpd_remove(struct generic_pm_domain *genpd)
> list_del(&genpd->gpd_list_node);
> genpd_unlock(genpd);
> cancel_work_sync(&genpd->power_off_work);
> + free_cpumask_var(genpd->cpus);
> kfree(genpd->free);
> pr_debug("%s: removed %s\n", __func__, genpd->name);
>
> diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
> index 27fca748344a..3f67ff0c1c69 100644
> --- a/include/linux/pm_domain.h
> +++ b/include/linux/pm_domain.h
> @@ -16,12 +16,14 @@
> #include <linux/of.h>
> #include <linux/notifier.h>
> #include <linux/spinlock.h>
> +#include <linux/cpumask.h>
>
> /* Defines used for the flags field in the struct generic_pm_domain */
> #define GENPD_FLAG_PM_CLK (1U << 0) /* PM domain uses PM clk */
> #define GENPD_FLAG_IRQ_SAFE (1U << 1) /* PM domain operates in atomic */
> #define GENPD_FLAG_ALWAYS_ON (1U << 2) /* PM domain is always powered on */
> #define GENPD_FLAG_ACTIVE_WAKEUP (1U << 3) /* Keep devices active if wakeup */
> +#define GENPD_FLAG_CPU_DOMAIN (1U << 4) /* PM domain manages CPUs */
>
> enum gpd_status {
> GPD_STATE_ACTIVE = 0, /* PM domain is active */
> @@ -68,6 +70,7 @@ struct generic_pm_domain {
> unsigned int suspended_count; /* System suspend device counter */
> unsigned int prepared_count; /* Suspend counter of prepared devices */
> unsigned int performance_state; /* Aggregated max performance state */
> + cpumask_var_t cpus; /* A cpumask of the attached CPUs */
> int (*power_off)(struct generic_pm_domain *domain);
> int (*power_on)(struct generic_pm_domain *domain);
> unsigned int (*opp_to_performance_state)(struct generic_pm_domain *genpd,
>
On Wednesday, June 20, 2018 7:22:07 PM CEST Ulf Hansson wrote:
> As it's now perfectly possible that a PM domain managed by genpd contains
> devices belonging to CPUs, we should start to take into account the
> residency values for the idle states during the state selection process.
> The residency value specifies the minimum duration of time, the CPU or a
> group of CPUs, needs to spend in an idle state to not waste energy entering
> it.
>
> To deal with this, let's add a new genpd governor, pm_domain_cpu_gov, that
> may be used for a PM domain that have CPU devices attached or if the CPUs
> are attached through subdomains.
>
> The new governor computes the minimum expected idle duration time for the
> online CPUs being attached to the PM domain and its subdomains. Then in the
> state selection process, trying the deepest state first, it verifies that
> the idle duration time satisfies the state's residency value.
>
> It should be noted that, when computing the minimum expected idle duration
> time, we use the information from tick_nohz_get_next_wakeup(), to find the
> next wakeup for the related CPUs. Future wise, this may deserve to be
> improved, as there are more reasons to why a CPU may be woken up from idle.
>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Daniel Lezcano <[email protected]>
> Cc: Lina Iyer <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Co-developed-by: Lina Iyer <[email protected]>
> Signed-off-by: Ulf Hansson <[email protected]>
> ---
> drivers/base/power/domain_governor.c | 58 ++++++++++++++++++++++++++++
> include/linux/pm_domain.h | 2 +
> 2 files changed, 60 insertions(+)
>
> diff --git a/drivers/base/power/domain_governor.c b/drivers/base/power/domain_governor.c
> index 99896fbf18e4..1aad55719537 100644
> --- a/drivers/base/power/domain_governor.c
> +++ b/drivers/base/power/domain_governor.c
> @@ -10,6 +10,9 @@
> #include <linux/pm_domain.h>
> #include <linux/pm_qos.h>
> #include <linux/hrtimer.h>
> +#include <linux/cpumask.h>
> +#include <linux/ktime.h>
> +#include <linux/tick.h>
>
> static int dev_update_qos_constraint(struct device *dev, void *data)
> {
> @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
> return false;
> }
>
> +static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> +{
> + struct generic_pm_domain *genpd = pd_to_genpd(pd);
> + ktime_t domain_wakeup, cpu_wakeup;
> + s64 idle_duration_ns;
> + int cpu, i;
> +
> + if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
> + return true;
> +
> + /*
> + * Find the next wakeup for any of the online CPUs within the PM domain
> + * and its subdomains. Note, we only need the genpd->cpus, as it already
> + * contains a mask of all CPUs from subdomains.
> + */
> + domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
> + for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
> + cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
> + if (ktime_before(cpu_wakeup, domain_wakeup))
> + domain_wakeup = cpu_wakeup;
> + }
> +
> + /* The minimum idle duration is from now - until the next wakeup. */
> + idle_duration_ns = ktime_to_ns(ktime_sub(domain_wakeup, ktime_get()));
> +
If idle_duration_ns is negative at this point, you can return false right
away and then you won't need to bother with this case below.
> + /*
> + * Find the deepest idle state that has its residency value satisfied
> + * and by also taking into account the power off latency for the state.
> + * Start at the deepest supported state.
> + */
> + i = genpd->state_count - 1;
> + do {
> + if (!genpd->states[i].residency_ns)
> + break;
> +
> + /* Check idle_duration_ns >= 0 to compare signed/unsigned. */
> + if (idle_duration_ns >= 0 && idle_duration_ns >=
> + (genpd->states[i].residency_ns +
> + genpd->states[i].power_off_latency_ns))
Why don't you set state_idx and return true right here?
Then you'll only need to return false if you haven't found a matching state.
> + break;
> + i--;
> + } while (i >= 0);
> +
> + if (i < 0)
> + return false;
> +
> + genpd->state_idx = i;
> + return true;
> +}
> +
> struct dev_power_governor simple_qos_governor = {
> .suspend_ok = default_suspend_ok,
> .power_down_ok = default_power_down_ok,
> @@ -257,3 +310,8 @@ struct dev_power_governor pm_domain_always_on_gov = {
> .power_down_ok = always_on_power_down_ok,
> .suspend_ok = default_suspend_ok,
> };
> +
> +struct dev_power_governor pm_domain_cpu_gov = {
> + .suspend_ok = NULL,
> + .power_down_ok = cpu_power_down_ok,
I see that I haven't got your code flow right after all. :-)
And which means that this should work AFAICS.
> +};
> diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
> index 2c09cf80b285..97901c833108 100644
> --- a/include/linux/pm_domain.h
> +++ b/include/linux/pm_domain.h
> @@ -160,6 +160,7 @@ int dev_pm_genpd_set_performance_state(struct device *dev, unsigned int state);
>
> extern struct dev_power_governor simple_qos_governor;
> extern struct dev_power_governor pm_domain_always_on_gov;
> +extern struct dev_power_governor pm_domain_cpu_gov;
> #else
>
> static inline struct generic_pm_domain_data *dev_gpd_data(struct device *dev)
> @@ -203,6 +204,7 @@ static inline int dev_pm_genpd_set_performance_state(struct device *dev,
>
> #define simple_qos_governor (*(struct dev_power_governor *)(NULL))
> #define pm_domain_always_on_gov (*(struct dev_power_governor *)(NULL))
> +#define pm_domain_cpu_gov (*(struct dev_power_governor *)(NULL))
> #endif
>
> #ifdef CONFIG_PM_GENERIC_DOMAINS_SLEEP
>
On Wednesday, June 20, 2018 7:22:08 PM CEST Ulf Hansson wrote:
> CPU devices and other regular devices may share the same PM domain and may
> also be hierarchically related via subdomains. In either case, all devices
> including CPUs, may be attached to a PM domain managed by genpd, that has
> an idle state with an enter/exit latency.
>
> Let's take these latencies into account in the state selection process by
> genpd's governor for CPUs. This means the governor, pm_domain_cpu_gov,
> becomes extended to satisfy both a state's residency and a potential dev PM
> QoS constraint.
>
> Cc: Lina Iyer <[email protected]>
> Co-developed-by: Lina Iyer <[email protected]>
> Signed-off-by: Ulf Hansson <[email protected]>
> ---
> drivers/base/power/domain_governor.c | 15 +++++++++++----
> include/linux/pm_domain.h | 1 +
> 2 files changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/base/power/domain_governor.c b/drivers/base/power/domain_governor.c
> index 1aad55719537..03d4e9454ce9 100644
> --- a/drivers/base/power/domain_governor.c
> +++ b/drivers/base/power/domain_governor.c
> @@ -214,8 +214,10 @@ static bool default_power_down_ok(struct dev_pm_domain *pd)
> struct generic_pm_domain *genpd = pd_to_genpd(pd);
> struct gpd_link *link;
>
> - if (!genpd->max_off_time_changed)
> + if (!genpd->max_off_time_changed) {
> + genpd->state_idx = genpd->cached_power_down_state_idx;
> return genpd->cached_power_down_ok;
> + }
>
> /*
> * We have to invalidate the cached results for the masters, so
> @@ -240,6 +242,7 @@ static bool default_power_down_ok(struct dev_pm_domain *pd)
> genpd->state_idx--;
> }
>
> + genpd->cached_power_down_state_idx = genpd->state_idx;
> return genpd->cached_power_down_ok;
> }
>
> @@ -255,6 +258,10 @@ static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> s64 idle_duration_ns;
> int cpu, i;
>
> + /* Validate dev PM QoS constraints. */
> + if (!default_power_down_ok(pd))
> + return false;
> +
> if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
> return true;
>
> @@ -276,9 +283,9 @@ static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> /*
> * Find the deepest idle state that has its residency value satisfied
> * and by also taking into account the power off latency for the state.
> - * Start at the deepest supported state.
> + * Start at the state picked by the dev PM QoS constraint validation.
> */
> - i = genpd->state_count - 1;
> + i = genpd->state_idx;
> do {
> if (!genpd->states[i].residency_ns)
> break;
> @@ -312,6 +319,6 @@ struct dev_power_governor pm_domain_always_on_gov = {
> };
>
> struct dev_power_governor pm_domain_cpu_gov = {
> - .suspend_ok = NULL,
> + .suspend_ok = default_suspend_ok,
> .power_down_ok = cpu_power_down_ok,
> };
> diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
> index 97901c833108..dbc69721cad8 100644
> --- a/include/linux/pm_domain.h
> +++ b/include/linux/pm_domain.h
> @@ -81,6 +81,7 @@ struct generic_pm_domain {
> s64 max_off_time_ns; /* Maximum allowed "suspended" time. */
> bool max_off_time_changed;
> bool cached_power_down_ok;
> + bool cached_power_down_state_idx;
> int (*attach_dev)(struct generic_pm_domain *domain,
> struct device *dev);
> void (*detach_dev)(struct generic_pm_domain *domain,
>
I don't see much value in splitting this patch off [07/26] and it actually
confused me, so it may as well confuse someone else.
On Thursday, July 19, 2018 12:12:55 PM CEST Rafael J. Wysocki wrote:
> On Wednesday, July 18, 2018 12:11:06 PM CEST Rafael J. Wysocki wrote:
> > On Wednesday, June 20, 2018 7:22:09 PM CEST Ulf Hansson wrote:
> > > To allow CPUs being power managed by PM domains, let's deploy support for
> > > runtime PM for the CPU's corresponding struct device.
> > >
> > > More precisely, at the point when the CPU is about to enter an idle state,
> > > decrease the runtime PM usage count for its corresponding struct device,
> > > via calling pm_runtime_put_sync_suspend(). Then, at the point when the CPU
> > > resumes from idle, let's increase the runtime PM usage count, via calling
> > > pm_runtime_get_sync().
> > >
> > > Cc: Lina Iyer <[email protected]>
> > > Co-developed-by: Lina Iyer <[email protected]>
> > > Signed-off-by: Ulf Hansson <[email protected]>
> >
> > I finally got to this one, sorry for the huge delay.
> >
> > Let me confirm that I understand the code flow correctly.
> >
> > > ---
> > > kernel/cpu_pm.c | 11 +++++++++++
> > > 1 file changed, 11 insertions(+)
> > >
> > > diff --git a/kernel/cpu_pm.c b/kernel/cpu_pm.c
> > > index 67b02e138a47..492d4a83dca0 100644
> > > --- a/kernel/cpu_pm.c
> > > +++ b/kernel/cpu_pm.c
> > > @@ -16,9 +16,11 @@
> > > */
> > >
> > > #include <linux/kernel.h>
> > > +#include <linux/cpu.h>
> > > #include <linux/cpu_pm.h>
> > > #include <linux/module.h>
> > > #include <linux/notifier.h>
> > > +#include <linux/pm_runtime.h>
> > > #include <linux/spinlock.h>
> > > #include <linux/syscore_ops.h>
> > >
> > > @@ -91,6 +93,7 @@ int cpu_pm_enter(void)
> >
> > This is called from a cpuidle driver's ->enter callback for the target state
> > selected by the idle governor ->
> >
> > > {
> > > int nr_calls;
> > > int ret = 0;
> > > + struct device *dev = get_cpu_device(smp_processor_id());
> > >
> > > ret = cpu_pm_notify(CPU_PM_ENTER, -1, &nr_calls);
> > > if (ret)
> > > @@ -100,6 +103,9 @@ int cpu_pm_enter(void)
> > > */
> > > cpu_pm_notify(CPU_PM_ENTER_FAILED, nr_calls - 1, NULL);
> > >
> > > + if (!ret && dev && dev->pm_domain)
> > > + pm_runtime_put_sync_suspend(dev);
> >
> > -> so this is going to invoke genpd_runtime_suspend() if the usage
> > counter of dev is 0.
> >
> > That will cause cpu_power_down_ok() to be called (because this is
> > a CPU domain) and that will walk the domain cpumask and compute the
> > estimated idle duration as the minimum of tick_nohz_get_next_wakeup()
> > values over the CPUs in that cpumask. [Note that the weight of the
> > cpumask must be seriously limited for that to actually work, as this
> > happens in the idle path.] Next, it will return "true" if it can
> > find a domain state with residency within the estimated idle
> > duration. [Note that this sort of overlaps with the idle governor's
> > job.]
> >
> > Next, __genpd_runtime_suspend() will be invoked to run the device-specific
> > callback if any [Note that this has to be suitable for the idle path if
> > present.] and genpd_stop_dev() runs (which, again, may invoke a callback)
> > and genpd_power_off() runs under the domain lock (which must be a spinlock
> > then).
> >
> > > +
> > > return ret;
> > > }
> > > EXPORT_SYMBOL_GPL(cpu_pm_enter);
> > > @@ -118,6 +124,11 @@ EXPORT_SYMBOL_GPL(cpu_pm_enter);
> > > */
> > > int cpu_pm_exit(void)
> > > {
> > > + struct device *dev = get_cpu_device(smp_processor_id());
> > > +
> > > + if (dev && dev->pm_domain)
> > > + pm_runtime_get_sync(dev);
> > > +
> > > return cpu_pm_notify(CPU_PM_EXIT, -1, NULL);
> > > }
> > > EXPORT_SYMBOL_GPL(cpu_pm_exit);
> > >
> >
> > And this is called on wakeup when the cpuidle driver's ->enter callback
> > is about to return and it reverses the suspend flow (except that the
> > governor doesn't need to be called now).
> >
> > Have I got that right?
>
> Assuming that I have got that right, there are concerns, mostly regarding
> patch [07/26], but I will reply to that directly.
Well, I haven't got that right, so never mind.
There are a few minor things to address, but apart from that the general
genpd patches look ready.
> The $subject patch is fine by me by itself, but it obviously depends on the
> previous ones. Patches [01-02/26] are fine too, but they don't seem to be
> particularly useful without the rest of the series.
>
> As far as patches [10-26/26] go, I'd like to see some review comments and/or
> tags from the people with vested interest in there, in particular from Daniel
> on patch [12/26] and from Sudeep on the PSCI ones.
But this still holds.
Thanks,
Rafael
On Thursday, July 19, 2018 12:32:52 PM CEST Rafael J. Wysocki wrote:
> On Wednesday, June 20, 2018 7:22:07 PM CEST Ulf Hansson wrote:
> > As it's now perfectly possible that a PM domain managed by genpd contains
> > devices belonging to CPUs, we should start to take into account the
> > residency values for the idle states during the state selection process.
> > The residency value specifies the minimum duration of time, the CPU or a
> > group of CPUs, needs to spend in an idle state to not waste energy entering
> > it.
> >
> > To deal with this, let's add a new genpd governor, pm_domain_cpu_gov, that
> > may be used for a PM domain that have CPU devices attached or if the CPUs
> > are attached through subdomains.
> >
> > The new governor computes the minimum expected idle duration time for the
> > online CPUs being attached to the PM domain and its subdomains. Then in the
> > state selection process, trying the deepest state first, it verifies that
> > the idle duration time satisfies the state's residency value.
> >
> > It should be noted that, when computing the minimum expected idle duration
> > time, we use the information from tick_nohz_get_next_wakeup(), to find the
> > next wakeup for the related CPUs. Future wise, this may deserve to be
> > improved, as there are more reasons to why a CPU may be woken up from idle.
> >
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Daniel Lezcano <[email protected]>
> > Cc: Lina Iyer <[email protected]>
> > Cc: Frederic Weisbecker <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Co-developed-by: Lina Iyer <[email protected]>
> > Signed-off-by: Ulf Hansson <[email protected]>
> > ---
> > drivers/base/power/domain_governor.c | 58 ++++++++++++++++++++++++++++
> > include/linux/pm_domain.h | 2 +
> > 2 files changed, 60 insertions(+)
> >
> > diff --git a/drivers/base/power/domain_governor.c b/drivers/base/power/domain_governor.c
> > index 99896fbf18e4..1aad55719537 100644
> > --- a/drivers/base/power/domain_governor.c
> > +++ b/drivers/base/power/domain_governor.c
> > @@ -10,6 +10,9 @@
> > #include <linux/pm_domain.h>
> > #include <linux/pm_qos.h>
> > #include <linux/hrtimer.h>
> > +#include <linux/cpumask.h>
> > +#include <linux/ktime.h>
> > +#include <linux/tick.h>
> >
> > static int dev_update_qos_constraint(struct device *dev, void *data)
> > {
> > @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
> > return false;
> > }
> >
> > +static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> > +{
> > + struct generic_pm_domain *genpd = pd_to_genpd(pd);
> > + ktime_t domain_wakeup, cpu_wakeup;
> > + s64 idle_duration_ns;
> > + int cpu, i;
> > +
> > + if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
> > + return true;
> > +
> > + /*
> > + * Find the next wakeup for any of the online CPUs within the PM domain
> > + * and its subdomains. Note, we only need the genpd->cpus, as it already
> > + * contains a mask of all CPUs from subdomains.
> > + */
> > + domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
> > + for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
> > + cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
> > + if (ktime_before(cpu_wakeup, domain_wakeup))
> > + domain_wakeup = cpu_wakeup;
> > + }
Here's a concern I have missed before. :-/
Say, one of the CPUs you're walking here is woken up in the meantime.
I don't think it is valid to evaluate tick_nohz_get_next_wakeup() for it then
to update domain_wakeup. We really should just avoid the domain power off in
that case at all IMO.
Sure enough, if the domain power off is already started and one of the CPUs
in the domain is woken up then, too bad, it will suffer the latency (but in
that case the hardware should be able to help somewhat), but otherwise CPU
wakeup should prevent domain power off from being carried out.
Thanks,
Rafael
[...]
>>
>> Assuming that I have got that right, there are concerns, mostly regarding
>> patch [07/26], but I will reply to that directly.
>
> Well, I haven't got that right, so never mind.
>
> There are a few minor things to address, but apart from that the general
> genpd patches look ready.
Alright, thanks!
I will re-spin the series and post a new version once 4.19 rc1 is out.
Hopefully we can queue it up early in next cycle to get it tested in
next for a while.
>
>> The $subject patch is fine by me by itself, but it obviously depends on the
>> previous ones. Patches [01-02/26] are fine too, but they don't seem to be
>> particularly useful without the rest of the series.
>>
>> As far as patches [10-26/26] go, I'd like to see some review comments and/or
>> tags from the people with vested interest in there, in particular from Daniel
>> on patch [12/26] and from Sudeep on the PSCI ones.
>
> But this still holds.
Actually, patch 10 and patch11 is ready to go as well. I ping Daniel
on patch 12.
In regards to the rest of the series, some of the PSCI/ARM changes
have been reviewed by Mark Rutland, however several changes have not
been acked.
On the other hand, one can also interpret the long silence in regards
to PSCI/ARM changes as they are good to go. :-)
Kind regards
Uffe
On 19 July 2018 at 12:35, Rafael J. Wysocki <[email protected]> wrote:
> On Wednesday, June 20, 2018 7:22:08 PM CEST Ulf Hansson wrote:
>> CPU devices and other regular devices may share the same PM domain and may
>> also be hierarchically related via subdomains. In either case, all devices
>> including CPUs, may be attached to a PM domain managed by genpd, that has
>> an idle state with an enter/exit latency.
>>
>> Let's take these latencies into account in the state selection process by
>> genpd's governor for CPUs. This means the governor, pm_domain_cpu_gov,
>> becomes extended to satisfy both a state's residency and a potential dev PM
>> QoS constraint.
>>
>> Cc: Lina Iyer <[email protected]>
>> Co-developed-by: Lina Iyer <[email protected]>
>> Signed-off-by: Ulf Hansson <[email protected]>
>> ---
>> drivers/base/power/domain_governor.c | 15 +++++++++++----
>> include/linux/pm_domain.h | 1 +
>> 2 files changed, 12 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/base/power/domain_governor.c b/drivers/base/power/domain_governor.c
>> index 1aad55719537..03d4e9454ce9 100644
>> --- a/drivers/base/power/domain_governor.c
>> +++ b/drivers/base/power/domain_governor.c
>> @@ -214,8 +214,10 @@ static bool default_power_down_ok(struct dev_pm_domain *pd)
>> struct generic_pm_domain *genpd = pd_to_genpd(pd);
>> struct gpd_link *link;
>>
>> - if (!genpd->max_off_time_changed)
>> + if (!genpd->max_off_time_changed) {
>> + genpd->state_idx = genpd->cached_power_down_state_idx;
>> return genpd->cached_power_down_ok;
>> + }
>>
>> /*
>> * We have to invalidate the cached results for the masters, so
>> @@ -240,6 +242,7 @@ static bool default_power_down_ok(struct dev_pm_domain *pd)
>> genpd->state_idx--;
>> }
>>
>> + genpd->cached_power_down_state_idx = genpd->state_idx;
>> return genpd->cached_power_down_ok;
>> }
>>
>> @@ -255,6 +258,10 @@ static bool cpu_power_down_ok(struct dev_pm_domain *pd)
>> s64 idle_duration_ns;
>> int cpu, i;
>>
>> + /* Validate dev PM QoS constraints. */
>> + if (!default_power_down_ok(pd))
>> + return false;
>> +
>> if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
>> return true;
>>
>> @@ -276,9 +283,9 @@ static bool cpu_power_down_ok(struct dev_pm_domain *pd)
>> /*
>> * Find the deepest idle state that has its residency value satisfied
>> * and by also taking into account the power off latency for the state.
>> - * Start at the deepest supported state.
>> + * Start at the state picked by the dev PM QoS constraint validation.
>> */
>> - i = genpd->state_count - 1;
>> + i = genpd->state_idx;
>> do {
>> if (!genpd->states[i].residency_ns)
>> break;
>> @@ -312,6 +319,6 @@ struct dev_power_governor pm_domain_always_on_gov = {
>> };
>>
>> struct dev_power_governor pm_domain_cpu_gov = {
>> - .suspend_ok = NULL,
>> + .suspend_ok = default_suspend_ok,
>> .power_down_ok = cpu_power_down_ok,
>> };
>> diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
>> index 97901c833108..dbc69721cad8 100644
>> --- a/include/linux/pm_domain.h
>> +++ b/include/linux/pm_domain.h
>> @@ -81,6 +81,7 @@ struct generic_pm_domain {
>> s64 max_off_time_ns; /* Maximum allowed "suspended" time. */
>> bool max_off_time_changed;
>> bool cached_power_down_ok;
>> + bool cached_power_down_state_idx;
>> int (*attach_dev)(struct generic_pm_domain *domain,
>> struct device *dev);
>> void (*detach_dev)(struct generic_pm_domain *domain,
>>
>
> I don't see much value in splitting this patch off [07/26] and it actually
> confused me, so it may as well confuse someone else.
>
The idea was to let people, explicitly, comment on the whether dev PM
Qos constraints should be considered by the governor.
However, I get your point, let's combine them!
Kind regards
Uffe
On 19 July 2018 at 12:25, Rafael J. Wysocki <[email protected]> wrote:
> On Wednesday, June 20, 2018 7:22:04 PM CEST Ulf Hansson wrote:
>> To enable a device belonging to a CPU to be attached to a PM domain managed
>> by genpd, let's do a few changes to genpd as to make it convenient to
>> manage the specifics around CPUs.
>>
>> First, as to be able to quickly find out what CPUs that are attached to a
>> genpd, which typically becomes useful from a genpd governor as following
>> changes is about to show, let's add a cpumask 'cpus' to the struct
>> generic_pm_domain.
>>
>> At the point when a device that belongs to a CPU, is attached/detached to
>> its corresponding PM domain via genpd_add_device(), let's update the
>> cpumask in genpd->cpus. Moreover, propagate the update of the cpumask to
>> the master domains, which makes the genpd->cpus to contain a cpumask that
>> hierarchically reflect all CPUs for a genpd, including CPUs attached to
>> subdomains.
>>
>> Second, to unconditionally manage CPUs and the cpumask in genpd->cpus, is
>> unnecessary for cases when only non-CPU devices are parts of a genpd.
>> Let's avoid this by adding a new configuration bit, GENPD_FLAG_CPU_DOMAIN.
>> Clients must set the bit before they call pm_genpd_init(), as to instruct
>> genpd that it shall deal with CPUs and thus manage the cpumask in
>> genpd->cpus.
>>
>> Cc: Lina Iyer <[email protected]>
>> Co-developed-by: Lina Iyer <[email protected]>
>> Signed-off-by: Ulf Hansson <[email protected]>
>> ---
>> drivers/base/power/domain.c | 69 ++++++++++++++++++++++++++++++++++++-
>> include/linux/pm_domain.h | 3 ++
>> 2 files changed, 71 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
>> index 21d298e1820b..6149ce0bfa7b 100644
>> --- a/drivers/base/power/domain.c
>> +++ b/drivers/base/power/domain.c
>> @@ -20,6 +20,7 @@
>> #include <linux/sched.h>
>> #include <linux/suspend.h>
>> #include <linux/export.h>
>> +#include <linux/cpu.h>
>>
>> #include "power.h"
>>
>> @@ -126,6 +127,7 @@ static const struct genpd_lock_ops genpd_spin_ops = {
>> #define genpd_is_irq_safe(genpd) (genpd->flags & GENPD_FLAG_IRQ_SAFE)
>> #define genpd_is_always_on(genpd) (genpd->flags & GENPD_FLAG_ALWAYS_ON)
>> #define genpd_is_active_wakeup(genpd) (genpd->flags & GENPD_FLAG_ACTIVE_WAKEUP)
>> +#define genpd_is_cpu_domain(genpd) (genpd->flags & GENPD_FLAG_CPU_DOMAIN)
>>
>> static inline bool irq_safe_dev_in_no_sleep_domain(struct device *dev,
>> const struct generic_pm_domain *genpd)
>> @@ -1377,6 +1379,62 @@ static void genpd_free_dev_data(struct device *dev,
>> dev_pm_put_subsys_data(dev);
>> }
>>
>> +static void __genpd_update_cpumask(struct generic_pm_domain *genpd,
>> + int cpu, bool set, unsigned int depth)
>> +{
>> + struct gpd_link *link;
>> +
>> + if (!genpd_is_cpu_domain(genpd))
>> + return;
>> +
>> + list_for_each_entry(link, &genpd->slave_links, slave_node) {
>> + struct generic_pm_domain *master = link->master;
>> +
>> + genpd_lock_nested(master, depth + 1);
>> + __genpd_update_cpumask(master, cpu, set, depth + 1);
>> + genpd_unlock(master);
>> + }
>> +
>> + if (set)
>> + cpumask_set_cpu(cpu, genpd->cpus);
>> + else
>> + cpumask_clear_cpu(cpu, genpd->cpus);
>> +}
>
> As noted elsewhere, there is a concern about the possible weight of this
> cpumask and I think that it would be good to explicitly put a limit on it.
I have been digesting your comments on the series, but wonder if this
is still a relevant concern?
Updating the mask is only done when the cpu is attached to its PM
domain. However, of course, I should not allocate the cpumask in
pm_genpd_init() unless the GENPD_FLAG_CPU_DOMAIN is set, as that is
just a waste.
>
>> +
>> +static void genpd_update_cpumask(struct generic_pm_domain *genpd,
>> + struct device *dev, bool set)
>> +{
>> + bool is_cpu = false;
>> + int cpu;
>> +
>> + if (!genpd_is_cpu_domain(genpd))
>> + return;
>> +
>> + for_each_possible_cpu(cpu) {
>> + if (get_cpu_device(cpu) == dev) {
>> + is_cpu = true;
>
> You may call __genpd_update_cpumask() right here and then you won't
> need the extra is_cpu variable.
Yes, indeed this looks weird, thanks for spotting it!
Ah, now I recall, the idea was to store an is_cpu variable per device,
to avoid looking up the cpu device at detach, but this is just
unnecessary. :-)
[...]
Thanks for reviewing!
Kind regards
Uffe
On 19 July 2018 at 12:22, Rafael J. Wysocki <[email protected]> wrote:
> On Wednesday, June 20, 2018 7:22:05 PM CEST Ulf Hansson wrote:
>> Introduce two new genpd helper functions, of_genpd_attach|detach_cpu(),
>> which takes the CPU-number as an in-parameter.
>>
>> To attach a CPU to a genpd, of_genpd_attach_cpu() starts by fetching the
>> struct device belonging to the CPU. Then it calls genpd_dev_pm_attach(),
>> which via DT tries to hook up the CPU device to its corresponding PM
>> domain. If it succeeds, of_genpd_attach_cpu() continues to prepare/enable
>> runtime PM of the device.
>>
>> To detach a CPU from its PM domain, of_genpd_attach_cpu() reverse the
>> operations made from of_genpd_attach_cpu(). However, first it checks that
>> the CPU device has a valid PM domain pointer assigned, as to make sure it
>> belongs to genpd.
>>
>> Cc: Lina Iyer <[email protected]>
>> Co-developed-by: Lina Iyer <[email protected]>
>> Signed-off-by: Ulf Hansson <[email protected]>
>> ---
>> drivers/base/power/domain.c | 69 +++++++++++++++++++++++++++++++++++++
>> include/linux/pm_domain.h | 9 +++++
>> 2 files changed, 78 insertions(+)
[...]
> I'd combine this with patch [04/26]. The split here is somewhat artificial IMO.
I wanted to keep one change per patch, hence the split.
$subject patch introduces helpers to add CPU devices to genpd and
isn't really part of making genpd to cope with CPU devices. So
$subject patch is about avoiding open coding.
Are you still sure you want me to squash the changes?
Kind regards
Uffe
On 26 July 2018 at 11:14, Rafael J. Wysocki <[email protected]> wrote:
> On Thursday, July 19, 2018 12:32:52 PM CEST Rafael J. Wysocki wrote:
>> On Wednesday, June 20, 2018 7:22:07 PM CEST Ulf Hansson wrote:
>> > As it's now perfectly possible that a PM domain managed by genpd contains
>> > devices belonging to CPUs, we should start to take into account the
>> > residency values for the idle states during the state selection process.
>> > The residency value specifies the minimum duration of time, the CPU or a
>> > group of CPUs, needs to spend in an idle state to not waste energy entering
>> > it.
>> >
>> > To deal with this, let's add a new genpd governor, pm_domain_cpu_gov, that
>> > may be used for a PM domain that have CPU devices attached or if the CPUs
>> > are attached through subdomains.
>> >
>> > The new governor computes the minimum expected idle duration time for the
>> > online CPUs being attached to the PM domain and its subdomains. Then in the
>> > state selection process, trying the deepest state first, it verifies that
>> > the idle duration time satisfies the state's residency value.
>> >
>> > It should be noted that, when computing the minimum expected idle duration
>> > time, we use the information from tick_nohz_get_next_wakeup(), to find the
>> > next wakeup for the related CPUs. Future wise, this may deserve to be
>> > improved, as there are more reasons to why a CPU may be woken up from idle.
>> >
>> > Cc: Thomas Gleixner <[email protected]>
>> > Cc: Daniel Lezcano <[email protected]>
>> > Cc: Lina Iyer <[email protected]>
>> > Cc: Frederic Weisbecker <[email protected]>
>> > Cc: Ingo Molnar <[email protected]>
>> > Co-developed-by: Lina Iyer <[email protected]>
>> > Signed-off-by: Ulf Hansson <[email protected]>
>> > ---
>> > drivers/base/power/domain_governor.c | 58 ++++++++++++++++++++++++++++
>> > include/linux/pm_domain.h | 2 +
>> > 2 files changed, 60 insertions(+)
>> >
>> > diff --git a/drivers/base/power/domain_governor.c b/drivers/base/power/domain_governor.c
>> > index 99896fbf18e4..1aad55719537 100644
>> > --- a/drivers/base/power/domain_governor.c
>> > +++ b/drivers/base/power/domain_governor.c
>> > @@ -10,6 +10,9 @@
>> > #include <linux/pm_domain.h>
>> > #include <linux/pm_qos.h>
>> > #include <linux/hrtimer.h>
>> > +#include <linux/cpumask.h>
>> > +#include <linux/ktime.h>
>> > +#include <linux/tick.h>
>> >
>> > static int dev_update_qos_constraint(struct device *dev, void *data)
>> > {
>> > @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
>> > return false;
>> > }
>> >
>> > +static bool cpu_power_down_ok(struct dev_pm_domain *pd)
>> > +{
>> > + struct generic_pm_domain *genpd = pd_to_genpd(pd);
>> > + ktime_t domain_wakeup, cpu_wakeup;
>> > + s64 idle_duration_ns;
>> > + int cpu, i;
>> > +
>> > + if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
>> > + return true;
>> > +
>> > + /*
>> > + * Find the next wakeup for any of the online CPUs within the PM domain
>> > + * and its subdomains. Note, we only need the genpd->cpus, as it already
>> > + * contains a mask of all CPUs from subdomains.
>> > + */
>> > + domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
>> > + for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
>> > + cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
>> > + if (ktime_before(cpu_wakeup, domain_wakeup))
>> > + domain_wakeup = cpu_wakeup;
>> > + }
>
> Here's a concern I have missed before. :-/
>
> Say, one of the CPUs you're walking here is woken up in the meantime.
Yes, that can happen - when we miss-predicted "next wakeup".
>
> I don't think it is valid to evaluate tick_nohz_get_next_wakeup() for it then
> to update domain_wakeup. We really should just avoid the domain power off in
> that case at all IMO.
Correct.
However, we also want to avoid locking contentions in the idle path,
which is what this boils done to.
>
> Sure enough, if the domain power off is already started and one of the CPUs
> in the domain is woken up then, too bad, it will suffer the latency (but in
> that case the hardware should be able to help somewhat), but otherwise CPU
> wakeup should prevent domain power off from being carried out.
The CPU is not prevented from waking up, as we rely on the FW to deal with that.
Even if the above computation turns out to wrongly suggest that the
cluster can be powered off, the FW shall together with the genpd
backend driver prevent it.
To cover this case for PSCI, we also use a per cpu variable for the
CPU's power off state, as can be seen later in the series.
Hope this clarifies your concern, else tell and will to elaborate a bit more.
Kind regards
Uffe
On Fri, Aug 3, 2018 at 4:28 PM, Ulf Hansson <[email protected]> wrote:
> On 26 July 2018 at 11:14, Rafael J. Wysocki <[email protected]> wrote:
>> On Thursday, July 19, 2018 12:32:52 PM CEST Rafael J. Wysocki wrote:
>>> On Wednesday, June 20, 2018 7:22:07 PM CEST Ulf Hansson wrote:
>>> > As it's now perfectly possible that a PM domain managed by genpd contains
>>> > devices belonging to CPUs, we should start to take into account the
>>> > residency values for the idle states during the state selection process.
>>> > The residency value specifies the minimum duration of time, the CPU or a
>>> > group of CPUs, needs to spend in an idle state to not waste energy entering
>>> > it.
>>> >
>>> > To deal with this, let's add a new genpd governor, pm_domain_cpu_gov, that
>>> > may be used for a PM domain that have CPU devices attached or if the CPUs
>>> > are attached through subdomains.
>>> >
>>> > The new governor computes the minimum expected idle duration time for the
>>> > online CPUs being attached to the PM domain and its subdomains. Then in the
>>> > state selection process, trying the deepest state first, it verifies that
>>> > the idle duration time satisfies the state's residency value.
>>> >
>>> > It should be noted that, when computing the minimum expected idle duration
>>> > time, we use the information from tick_nohz_get_next_wakeup(), to find the
>>> > next wakeup for the related CPUs. Future wise, this may deserve to be
>>> > improved, as there are more reasons to why a CPU may be woken up from idle.
>>> >
>>> > Cc: Thomas Gleixner <[email protected]>
>>> > Cc: Daniel Lezcano <[email protected]>
>>> > Cc: Lina Iyer <[email protected]>
>>> > Cc: Frederic Weisbecker <[email protected]>
>>> > Cc: Ingo Molnar <[email protected]>
>>> > Co-developed-by: Lina Iyer <[email protected]>
>>> > Signed-off-by: Ulf Hansson <[email protected]>
>>> > ---
>>> > drivers/base/power/domain_governor.c | 58 ++++++++++++++++++++++++++++
>>> > include/linux/pm_domain.h | 2 +
>>> > 2 files changed, 60 insertions(+)
>>> >
>>> > diff --git a/drivers/base/power/domain_governor.c b/drivers/base/power/domain_governor.c
>>> > index 99896fbf18e4..1aad55719537 100644
>>> > --- a/drivers/base/power/domain_governor.c
>>> > +++ b/drivers/base/power/domain_governor.c
>>> > @@ -10,6 +10,9 @@
>>> > #include <linux/pm_domain.h>
>>> > #include <linux/pm_qos.h>
>>> > #include <linux/hrtimer.h>
>>> > +#include <linux/cpumask.h>
>>> > +#include <linux/ktime.h>
>>> > +#include <linux/tick.h>
>>> >
>>> > static int dev_update_qos_constraint(struct device *dev, void *data)
>>> > {
>>> > @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
>>> > return false;
>>> > }
>>> >
>>> > +static bool cpu_power_down_ok(struct dev_pm_domain *pd)
>>> > +{
>>> > + struct generic_pm_domain *genpd = pd_to_genpd(pd);
>>> > + ktime_t domain_wakeup, cpu_wakeup;
>>> > + s64 idle_duration_ns;
>>> > + int cpu, i;
>>> > +
>>> > + if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
>>> > + return true;
>>> > +
>>> > + /*
>>> > + * Find the next wakeup for any of the online CPUs within the PM domain
>>> > + * and its subdomains. Note, we only need the genpd->cpus, as it already
>>> > + * contains a mask of all CPUs from subdomains.
>>> > + */
>>> > + domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
>>> > + for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
>>> > + cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
>>> > + if (ktime_before(cpu_wakeup, domain_wakeup))
>>> > + domain_wakeup = cpu_wakeup;
>>> > + }
>>
>> Here's a concern I have missed before. :-/
>>
>> Say, one of the CPUs you're walking here is woken up in the meantime.
>
> Yes, that can happen - when we miss-predicted "next wakeup".
>
>>
>> I don't think it is valid to evaluate tick_nohz_get_next_wakeup() for it then
>> to update domain_wakeup. We really should just avoid the domain power off in
>> that case at all IMO.
>
> Correct.
>
> However, we also want to avoid locking contentions in the idle path,
> which is what this boils done to.
This already is done under genpd_lock() AFAICS, so I'm not quite sure
what exactly you mean.
Besides, this is not just about increased latency, which is a concern
by itself but maybe not so much in all environments, but also about
possibility of missing a CPU wakeup, which is a major issue.
If one of the CPUs sharing the domain with the current one is woken up
during cpu_power_down_ok() and the wakeup is an edge-triggered
interrupt and the domain is turned off regardless, the wakeup may be
missed entirely if I'm not mistaken.
It looks like there needs to be a way for the hardware to prevent a
domain poweroff when there's a pending interrupt or I don't quite see
how this can be handled correctly.
>> Sure enough, if the domain power off is already started and one of the CPUs
>> in the domain is woken up then, too bad, it will suffer the latency (but in
>> that case the hardware should be able to help somewhat), but otherwise CPU
>> wakeup should prevent domain power off from being carried out.
>
> The CPU is not prevented from waking up, as we rely on the FW to deal with that.
>
> Even if the above computation turns out to wrongly suggest that the
> cluster can be powered off, the FW shall together with the genpd
> backend driver prevent it.
Fine, but then the solution depends on specific FW/HW behavior, so I'm
not sure how generic it really is. At least, that expectation should
be clearly documented somewhere, preferably in code comments.
> To cover this case for PSCI, we also use a per cpu variable for the
> CPU's power off state, as can be seen later in the series.
Oh great, but the generic part should be independent on the underlying
implementation of the driver. If it isn't, then it also is not
generic.
> Hope this clarifies your concern, else tell and will to elaborate a bit more.
Not really.
There also is one more problem and that is the interaction between
this code and the idle governor.
Namely, the idle governor may select a shallower state for some
reason, for example due to an additional latency limit derived from
CPU utilization (like in the menu governor), and how does the code in
cpu_power_down_ok() know what state has been selected and how does it
honor the selection made by the idle governor?
On Fri, Aug 3, 2018 at 1:43 PM, Ulf Hansson <[email protected]> wrote:
> On 19 July 2018 at 12:25, Rafael J. Wysocki <[email protected]> wrote:
>> On Wednesday, June 20, 2018 7:22:04 PM CEST Ulf Hansson wrote:
>>> To enable a device belonging to a CPU to be attached to a PM domain managed
>>> by genpd, let's do a few changes to genpd as to make it convenient to
>>> manage the specifics around CPUs.
>>>
>>> First, as to be able to quickly find out what CPUs that are attached to a
>>> genpd, which typically becomes useful from a genpd governor as following
>>> changes is about to show, let's add a cpumask 'cpus' to the struct
>>> generic_pm_domain.
>>>
>>> At the point when a device that belongs to a CPU, is attached/detached to
>>> its corresponding PM domain via genpd_add_device(), let's update the
>>> cpumask in genpd->cpus. Moreover, propagate the update of the cpumask to
>>> the master domains, which makes the genpd->cpus to contain a cpumask that
>>> hierarchically reflect all CPUs for a genpd, including CPUs attached to
>>> subdomains.
>>>
>>> Second, to unconditionally manage CPUs and the cpumask in genpd->cpus, is
>>> unnecessary for cases when only non-CPU devices are parts of a genpd.
>>> Let's avoid this by adding a new configuration bit, GENPD_FLAG_CPU_DOMAIN.
>>> Clients must set the bit before they call pm_genpd_init(), as to instruct
>>> genpd that it shall deal with CPUs and thus manage the cpumask in
>>> genpd->cpus.
>>>
>>> Cc: Lina Iyer <[email protected]>
>>> Co-developed-by: Lina Iyer <[email protected]>
>>> Signed-off-by: Ulf Hansson <[email protected]>
>>> ---
>>> drivers/base/power/domain.c | 69 ++++++++++++++++++++++++++++++++++++-
>>> include/linux/pm_domain.h | 3 ++
>>> 2 files changed, 71 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
>>> index 21d298e1820b..6149ce0bfa7b 100644
>>> --- a/drivers/base/power/domain.c
>>> +++ b/drivers/base/power/domain.c
>>> @@ -20,6 +20,7 @@
>>> #include <linux/sched.h>
>>> #include <linux/suspend.h>
>>> #include <linux/export.h>
>>> +#include <linux/cpu.h>
>>>
>>> #include "power.h"
>>>
>>> @@ -126,6 +127,7 @@ static const struct genpd_lock_ops genpd_spin_ops = {
>>> #define genpd_is_irq_safe(genpd) (genpd->flags & GENPD_FLAG_IRQ_SAFE)
>>> #define genpd_is_always_on(genpd) (genpd->flags & GENPD_FLAG_ALWAYS_ON)
>>> #define genpd_is_active_wakeup(genpd) (genpd->flags & GENPD_FLAG_ACTIVE_WAKEUP)
>>> +#define genpd_is_cpu_domain(genpd) (genpd->flags & GENPD_FLAG_CPU_DOMAIN)
>>>
>>> static inline bool irq_safe_dev_in_no_sleep_domain(struct device *dev,
>>> const struct generic_pm_domain *genpd)
>>> @@ -1377,6 +1379,62 @@ static void genpd_free_dev_data(struct device *dev,
>>> dev_pm_put_subsys_data(dev);
>>> }
>>>
>>> +static void __genpd_update_cpumask(struct generic_pm_domain *genpd,
>>> + int cpu, bool set, unsigned int depth)
>>> +{
>>> + struct gpd_link *link;
>>> +
>>> + if (!genpd_is_cpu_domain(genpd))
>>> + return;
>>> +
>>> + list_for_each_entry(link, &genpd->slave_links, slave_node) {
>>> + struct generic_pm_domain *master = link->master;
>>> +
>>> + genpd_lock_nested(master, depth + 1);
>>> + __genpd_update_cpumask(master, cpu, set, depth + 1);
>>> + genpd_unlock(master);
>>> + }
>>> +
>>> + if (set)
>>> + cpumask_set_cpu(cpu, genpd->cpus);
>>> + else
>>> + cpumask_clear_cpu(cpu, genpd->cpus);
>>> +}
>>
>> As noted elsewhere, there is a concern about the possible weight of this
>> cpumask and I think that it would be good to explicitly put a limit on it.
>
> I have been digesting your comments on the series, but wonder if this
> is still a relevant concern?
Well, there are systems with very large cpumasks and it is sort of
good to have that in mind when designing any code using them.
On Fri, Aug 3, 2018 at 1:42 PM, Ulf Hansson <[email protected]> wrote:
> [...]
>
>>>
>>> Assuming that I have got that right, there are concerns, mostly regarding
>>> patch [07/26], but I will reply to that directly.
>>
>> Well, I haven't got that right, so never mind.
>>
>> There are a few minor things to address, but apart from that the general
>> genpd patches look ready.
>
> Alright, thanks!
>
> I will re-spin the series and post a new version once 4.19 rc1 is out.
> Hopefully we can queue it up early in next cycle to get it tested in
> next for a while.
>
>>
>>> The $subject patch is fine by me by itself, but it obviously depends on the
>>> previous ones. Patches [01-02/26] are fine too, but they don't seem to be
>>> particularly useful without the rest of the series.
>>>
>>> As far as patches [10-26/26] go, I'd like to see some review comments and/or
>>> tags from the people with vested interest in there, in particular from Daniel
>>> on patch [12/26] and from Sudeep on the PSCI ones.
>>
>> But this still holds.
>
> Actually, patch 10 and patch11 is ready to go as well. I ping Daniel
> on patch 12.
>
> In regards to the rest of the series, some of the PSCI/ARM changes
> have been reviewed by Mark Rutland, however several changes have not
> been acked.
>
> On the other hand, one can also interpret the long silence in regards
> to PSCI/ARM changes as they are good to go. :-)
Well, in that case giving an ACK to them should not be an issue for
the people with a vested interest I suppose.
On Mon, Aug 06, 2018 at 11:37:55AM +0200, Rafael J. Wysocki wrote:
> On Fri, Aug 3, 2018 at 1:42 PM, Ulf Hansson <[email protected]> wrote:
> > [...]
> >
> >>>
> >>> Assuming that I have got that right, there are concerns, mostly regarding
> >>> patch [07/26], but I will reply to that directly.
> >>
> >> Well, I haven't got that right, so never mind.
> >>
> >> There are a few minor things to address, but apart from that the general
> >> genpd patches look ready.
> >
> > Alright, thanks!
> >
> > I will re-spin the series and post a new version once 4.19 rc1 is out.
> > Hopefully we can queue it up early in next cycle to get it tested in
> > next for a while.
> >
> >>
> >>> The $subject patch is fine by me by itself, but it obviously depends on the
> >>> previous ones. Patches [01-02/26] are fine too, but they don't seem to be
> >>> particularly useful without the rest of the series.
> >>>
> >>> As far as patches [10-26/26] go, I'd like to see some review comments and/or
> >>> tags from the people with vested interest in there, in particular from Daniel
> >>> on patch [12/26] and from Sudeep on the PSCI ones.
> >>
> >> But this still holds.
> >
> > Actually, patch 10 and patch11 is ready to go as well. I ping Daniel
> > on patch 12.
> >
> > In regards to the rest of the series, some of the PSCI/ARM changes
> > have been reviewed by Mark Rutland, however several changes have not
> > been acked.
> >
> > On the other hand, one can also interpret the long silence in regards
> > to PSCI/ARM changes as they are good to go. :-)
>
> Well, in that case giving an ACK to them should not be an issue for
> the people with a vested interest I suppose.
Apologies to everyone for the delay in replying.
Side note: cpu_pm_enter()/exit() are also called through syscore ops in
s2RAM/IDLE, you know that but I just wanted to mention it to compound
the discussion.
As for PSCI patches I do not personally think PSCI OSI enablement is
beneficial (and my position has always been the same since PSCI OSI was
added to the specification, I am not even talking about this patchset)
and Arm Trusted Firmware does not currently support it for the same
reason.
We (if Mark and Sudeep agree) will enable PSCI OSI if and when we have a
definitive and constructive answer to *why* we have to do that that is
not a dogmatic "the kernel knows better" but rather a comprehensive
power benchmark evaluation - I thought that was the agreement reached
at OSPM but apparently I was mistaken.
As a reminder - PSCI firmware implementation has to have state machines
and locking to guarantee safe power down operations (and to flush caches
only if necessary - which requires cpu masks for power domains) and
that's true whether we enable PSCI OSI or not, the coordination logic
must be in firmware/hardware _already_ - the cpumasks, the power domain
topology, etc.
I agree with the power-domains representation of idle-states (since
that's the correct HW description) and I thought and hoped that runtime
PM could help _remove_ the CPU PM notifiers (by making the notifiers
callbacks a runtime PM one) even though I have to say that's quite
complex, given that only few (ie one instance :)) CPU PM notifiers
callbacks are backed by a struct device (eg an ARM PMU is a device but
for instance the GIC is not a device so its save/restore code I am not
sure it can be implemented with runtime PM callbacks).
Lorenzo
On Wed, Aug 08 2018 at 04:56 -0600, Lorenzo Pieralisi wrote:
>On Mon, Aug 06, 2018 at 11:37:55AM +0200, Rafael J. Wysocki wrote:
>> On Fri, Aug 3, 2018 at 1:42 PM, Ulf Hansson <[email protected]> wrote:
>> > [...]
>> >
>> >>>
>> >>> Assuming that I have got that right, there are concerns, mostly regarding
>> >>> patch [07/26], but I will reply to that directly.
>> >>
>> >> Well, I haven't got that right, so never mind.
>> >>
>> >> There are a few minor things to address, but apart from that the general
>> >> genpd patches look ready.
>> >
>> > Alright, thanks!
>> >
>> > I will re-spin the series and post a new version once 4.19 rc1 is out.
>> > Hopefully we can queue it up early in next cycle to get it tested in
>> > next for a while.
>> >
>> >>
>> >>> The $subject patch is fine by me by itself, but it obviously depends on the
>> >>> previous ones. Patches [01-02/26] are fine too, but they don't seem to be
>> >>> particularly useful without the rest of the series.
>> >>>
>> >>> As far as patches [10-26/26] go, I'd like to see some review comments and/or
>> >>> tags from the people with vested interest in there, in particular from Daniel
>> >>> on patch [12/26] and from Sudeep on the PSCI ones.
>> >>
>> >> But this still holds.
>> >
>> > Actually, patch 10 and patch11 is ready to go as well. I ping Daniel
>> > on patch 12.
>> >
>> > In regards to the rest of the series, some of the PSCI/ARM changes
>> > have been reviewed by Mark Rutland, however several changes have not
>> > been acked.
>> >
>> > On the other hand, one can also interpret the long silence in regards
>> > to PSCI/ARM changes as they are good to go. :-)
>>
>> Well, in that case giving an ACK to them should not be an issue for
>> the people with a vested interest I suppose.
>
>Apologies to everyone for the delay in replying.
>
>Side note: cpu_pm_enter()/exit() are also called through syscore ops in
>s2RAM/IDLE, you know that but I just wanted to mention it to compound
>the discussion.
>
>As for PSCI patches I do not personally think PSCI OSI enablement is
>beneficial (and my position has always been the same since PSCI OSI was
>added to the specification, I am not even talking about this patchset)
>and Arm Trusted Firmware does not currently support it for the same
>reason.
>
>We (if Mark and Sudeep agree) will enable PSCI OSI if and when we have a
>definitive and constructive answer to *why* we have to do that that is
>not a dogmatic "the kernel knows better" but rather a comprehensive
>power benchmark evaluation - I thought that was the agreement reached
>at OSPM but apparently I was mistaken.
>
I will not speak to any comparison of benchmarks between OSI and PC.
AFAIK, there are no platforms supporting both.
But, the OSI feature is critical for QCOM mobile platforms. The
last man activities during cpuidle save quite a lot of power. Powering
off the clocks, busses, regulators and even the oscillator is very
important to have a reasonable battery life when using the phone.
Platform coordinated approach falls quite short of the needs of a
powerful processor with a desired battery efficiency.
-- Lina
>As a reminder - PSCI firmware implementation has to have state machines
>and locking to guarantee safe power down operations (and to flush caches
>only if necessary - which requires cpu masks for power domains) and
>that's true whether we enable PSCI OSI or not, the coordination logic
>must be in firmware/hardware _already_ - the cpumasks, the power domain
>topology, etc.
>
>I agree with the power-domains representation of idle-states (since
>that's the correct HW description) and I thought and hoped that runtime
>PM could help _remove_ the CPU PM notifiers (by making the notifiers
>callbacks a runtime PM one) even though I have to say that's quite
>complex, given that only few (ie one instance :)) CPU PM notifiers
>callbacks are backed by a struct device (eg an ARM PMU is a device but
>for instance the GIC is not a device so its save/restore code I am not
>sure it can be implemented with runtime PM callbacks).
>
>Lorenzo
On Wed, Aug 8, 2018 at 8:02 PM, Lina Iyer <[email protected]> wrote:
> On Wed, Aug 08 2018 at 04:56 -0600, Lorenzo Pieralisi wrote:
>>
>> On Mon, Aug 06, 2018 at 11:37:55AM +0200, Rafael J. Wysocki wrote:
>>>
>>> On Fri, Aug 3, 2018 at 1:42 PM, Ulf Hansson <[email protected]>
>>> wrote:
>>> > [...]
>>> >
>>> >>>
>>> >>> Assuming that I have got that right, there are concerns, mostly
>>> >>> regarding
>>> >>> patch [07/26], but I will reply to that directly.
>>> >>
>>> >> Well, I haven't got that right, so never mind.
>>> >>
>>> >> There are a few minor things to address, but apart from that the
>>> >> general
>>> >> genpd patches look ready.
>>> >
>>> > Alright, thanks!
>>> >
>>> > I will re-spin the series and post a new version once 4.19 rc1 is out.
>>> > Hopefully we can queue it up early in next cycle to get it tested in
>>> > next for a while.
>>> >
>>> >>
>>> >>> The $subject patch is fine by me by itself, but it obviously depends
>>> >>> on the
>>> >>> previous ones. Patches [01-02/26] are fine too, but they don't seem
>>> >>> to be
>>> >>> particularly useful without the rest of the series.
>>> >>>
>>> >>> As far as patches [10-26/26] go, I'd like to see some review comments
>>> >>> and/or
>>> >>> tags from the people with vested interest in there, in particular
>>> >>> from Daniel
>>> >>> on patch [12/26] and from Sudeep on the PSCI ones.
>>> >>
>>> >> But this still holds.
>>> >
>>> > Actually, patch 10 and patch11 is ready to go as well. I ping Daniel
>>> > on patch 12.
>>> >
>>> > In regards to the rest of the series, some of the PSCI/ARM changes
>>> > have been reviewed by Mark Rutland, however several changes have not
>>> > been acked.
>>> >
>>> > On the other hand, one can also interpret the long silence in regards
>>> > to PSCI/ARM changes as they are good to go. :-)
>>>
>>> Well, in that case giving an ACK to them should not be an issue for
>>> the people with a vested interest I suppose.
>>
>>
>> Apologies to everyone for the delay in replying.
>>
>> Side note: cpu_pm_enter()/exit() are also called through syscore ops in
>> s2RAM/IDLE, you know that but I just wanted to mention it to compound
>> the discussion.
>>
>> As for PSCI patches I do not personally think PSCI OSI enablement is
>> beneficial (and my position has always been the same since PSCI OSI was
>> added to the specification, I am not even talking about this patchset)
>> and Arm Trusted Firmware does not currently support it for the same
>> reason.
>>
>> We (if Mark and Sudeep agree) will enable PSCI OSI if and when we have a
>> definitive and constructive answer to *why* we have to do that that is
>> not a dogmatic "the kernel knows better" but rather a comprehensive
>> power benchmark evaluation - I thought that was the agreement reached
>> at OSPM but apparently I was mistaken.
>>
> I will not speak to any comparison of benchmarks between OSI and PC.
> AFAIK, there are no platforms supporting both.
>
> But, the OSI feature is critical for QCOM mobile platforms. The
> last man activities during cpuidle save quite a lot of power. Powering
> off the clocks, busses, regulators and even the oscillator is very
> important to have a reasonable battery life when using the phone.
> Platform coordinated approach falls quite short of the needs of a
> powerful processor with a desired battery efficiency.
Even so, you still need firmware (or hardware) to do the right thing
in the concurrent wakeup via an edge-triggered interrupt case AFAICS.
That is, you need the domain to be prevented from being turned off if
one of the CPUs in it has just been woken up and the interrupt is
still pending.
On Wed, Aug 08, 2018 at 12:02:48PM -0600, Lina Iyer wrote:
[...]
> I will not speak to any comparison of benchmarks between OSI and PC.
> AFAIK, there are no platforms supporting both.
>
That's the fundamental issue here. So we have never ever done a proper
comparison.
> But, the OSI feature is critical for QCOM mobile platforms. The
> last man activities during cpuidle save quite a lot of power. Powering
> off the clocks, busses, regulators and even the oscillator is very
> important to have a reasonable battery life when using the phone.
> Platform coordinated approach falls quite short of the needs of a
> powerful processor with a desired battery efficiency.
>
As mentioned above, without the actual comparison it's hard to justify
that. While there are corner cases where OSI is able to make better
judgement, may be we can add ways to deal with that in the firmware
with PC mode, have we explored that before adding complexity to the OSPM ?
Since the firmware complexity with OSI remains same as PC mode, isn't it
worth checking if the corner case we are talking here can be handled in
the firmware.
--
Regards,
Sudeep
On Wed, Aug 08, 2018 at 12:02:48PM -0600, Lina Iyer wrote:
> On Wed, Aug 08 2018 at 04:56 -0600, Lorenzo Pieralisi wrote:
> >On Mon, Aug 06, 2018 at 11:37:55AM +0200, Rafael J. Wysocki wrote:
> >>On Fri, Aug 3, 2018 at 1:42 PM, Ulf Hansson <[email protected]> wrote:
> >>> [...]
> >>>
> >>>>>
> >>>>> Assuming that I have got that right, there are concerns, mostly regarding
> >>>>> patch [07/26], but I will reply to that directly.
> >>>>
> >>>> Well, I haven't got that right, so never mind.
> >>>>
> >>>> There are a few minor things to address, but apart from that the general
> >>>> genpd patches look ready.
> >>>
> >>> Alright, thanks!
> >>>
> >>> I will re-spin the series and post a new version once 4.19 rc1 is out.
> >>> Hopefully we can queue it up early in next cycle to get it tested in
> >>> next for a while.
> >>>
> >>>>
> >>>>> The $subject patch is fine by me by itself, but it obviously depends on the
> >>>>> previous ones. Patches [01-02/26] are fine too, but they don't seem to be
> >>>>> particularly useful without the rest of the series.
> >>>>>
> >>>>> As far as patches [10-26/26] go, I'd like to see some review comments and/or
> >>>>> tags from the people with vested interest in there, in particular from Daniel
> >>>>> on patch [12/26] and from Sudeep on the PSCI ones.
> >>>>
> >>>> But this still holds.
> >>>
> >>> Actually, patch 10 and patch11 is ready to go as well. I ping Daniel
> >>> on patch 12.
> >>>
> >>> In regards to the rest of the series, some of the PSCI/ARM changes
> >>> have been reviewed by Mark Rutland, however several changes have not
> >>> been acked.
> >>>
> >>> On the other hand, one can also interpret the long silence in regards
> >>> to PSCI/ARM changes as they are good to go. :-)
> >>
> >>Well, in that case giving an ACK to them should not be an issue for
> >>the people with a vested interest I suppose.
> >
> >Apologies to everyone for the delay in replying.
> >
> >Side note: cpu_pm_enter()/exit() are also called through syscore ops in
> >s2RAM/IDLE, you know that but I just wanted to mention it to compound
> >the discussion.
> >
> >As for PSCI patches I do not personally think PSCI OSI enablement is
> >beneficial (and my position has always been the same since PSCI OSI was
> >added to the specification, I am not even talking about this patchset)
> >and Arm Trusted Firmware does not currently support it for the same
> >reason.
> >
> >We (if Mark and Sudeep agree) will enable PSCI OSI if and when we have a
> >definitive and constructive answer to *why* we have to do that that is
> >not a dogmatic "the kernel knows better" but rather a comprehensive
> >power benchmark evaluation - I thought that was the agreement reached
> >at OSPM but apparently I was mistaken.
> >
> I will not speak to any comparison of benchmarks between OSI and PC.
> AFAIK, there are no platforms supporting both.
PSCI specifications, 5.20.1:
"The platform will boot in platform-coordinated mode."
So all platforms implementing OSI have to support both.
> But, the OSI feature is critical for QCOM mobile platforms. The
> last man activities during cpuidle save quite a lot of power.
What I expressed above was that, in PSCI based systems (OSI or PC
alike), it is up to firmware/hardware to detect "the last man" not
the kernel.
I need to understand what you mean by "last man activities" to
provide feedback here.
> Powering off the clocks, busses, regulators and even the oscillator is
> very important to have a reasonable battery life when using the phone.
> Platform coordinated approach falls quite short of the needs of a
> powerful processor with a desired battery efficiency.
I am sorry but if you want us to merge PSCI patches in this series you
will have to back the claim above with a detailed technical explanation
of *why* platform-coordination falls short of QCOM (or whoever else)
needs wrt PSCI OSI.
Thanks,
Lorenzo
On Mon, Aug 06, 2018 at 11:20:59AM +0200, Rafael J. Wysocki wrote:
[...]
> >>> > @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
> >>> > return false;
> >>> > }
> >>> >
> >>> > +static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> >>> > +{
> >>> > + struct generic_pm_domain *genpd = pd_to_genpd(pd);
> >>> > + ktime_t domain_wakeup, cpu_wakeup;
> >>> > + s64 idle_duration_ns;
> >>> > + int cpu, i;
> >>> > +
> >>> > + if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
> >>> > + return true;
> >>> > +
> >>> > + /*
> >>> > + * Find the next wakeup for any of the online CPUs within the PM domain
> >>> > + * and its subdomains. Note, we only need the genpd->cpus, as it already
> >>> > + * contains a mask of all CPUs from subdomains.
> >>> > + */
> >>> > + domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
> >>> > + for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
> >>> > + cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
> >>> > + if (ktime_before(cpu_wakeup, domain_wakeup))
> >>> > + domain_wakeup = cpu_wakeup;
> >>> > + }
> >>
> >> Here's a concern I have missed before. :-/
> >>
> >> Say, one of the CPUs you're walking here is woken up in the meantime.
> >
> > Yes, that can happen - when we miss-predicted "next wakeup".
> >
> >>
> >> I don't think it is valid to evaluate tick_nohz_get_next_wakeup() for it then
> >> to update domain_wakeup. We really should just avoid the domain power off in
> >> that case at all IMO.
> >
> > Correct.
> >
> > However, we also want to avoid locking contentions in the idle path,
> > which is what this boils done to.
>
> This already is done under genpd_lock() AFAICS, so I'm not quite sure
> what exactly you mean.
>
> Besides, this is not just about increased latency, which is a concern
> by itself but maybe not so much in all environments, but also about
> possibility of missing a CPU wakeup, which is a major issue.
>
> If one of the CPUs sharing the domain with the current one is woken up
> during cpu_power_down_ok() and the wakeup is an edge-triggered
> interrupt and the domain is turned off regardless, the wakeup may be
> missed entirely if I'm not mistaken.
>
> It looks like there needs to be a way for the hardware to prevent a
> domain poweroff when there's a pending interrupt or I don't quite see
> how this can be handled correctly.
>
> >> Sure enough, if the domain power off is already started and one of the CPUs
> >> in the domain is woken up then, too bad, it will suffer the latency (but in
> >> that case the hardware should be able to help somewhat), but otherwise CPU
> >> wakeup should prevent domain power off from being carried out.
> >
> > The CPU is not prevented from waking up, as we rely on the FW to deal with that.
> >
> > Even if the above computation turns out to wrongly suggest that the
> > cluster can be powered off, the FW shall together with the genpd
> > backend driver prevent it.
>
> Fine, but then the solution depends on specific FW/HW behavior, so I'm
> not sure how generic it really is. At least, that expectation should
> be clearly documented somewhere, preferably in code comments.
>
> > To cover this case for PSCI, we also use a per cpu variable for the
> > CPU's power off state, as can be seen later in the series.
>
> Oh great, but the generic part should be independent on the underlying
> implementation of the driver. If it isn't, then it also is not
> generic.
>
> > Hope this clarifies your concern, else tell and will to elaborate a bit more.
>
> Not really.
>
> There also is one more problem and that is the interaction between
> this code and the idle governor.
>
> Namely, the idle governor may select a shallower state for some
> reason, for example due to an additional latency limit derived from
> CPU utilization (like in the menu governor), and how does the code in
> cpu_power_down_ok() know what state has been selected and how does it
> honor the selection made by the idle governor?
That's a good question and it maybe gives a path towards a solution.
AFAICS the genPD governor only selects the idle state parameter that
determines the idle state at, say, GenPD cpumask level it does not touch
the CPUidle decision, that works on a subset of idle states (at cpu
level).
That's my understanding, which can be wrong so please correct me
if that's the case because that's a bit confusing.
Let's imagine that we flattened out the list of idle states and feed
CPUidle with it (all of them - cpu, cluster, package, system - as it is
in the mainline _now_). Then the GenPD governor can run-through the
CPUidle selection and _demote_ the idle state if necessary since it
understands that some CPUs in the GenPD will wake up shortly and break
the target residency hyphothesis the CPUidle governor is expecting.
The whole idea about this series is improving CPUidle decision when
the target idle state is _shared_ among groups of cpus (again, please
do correct me if I am wrong).
It is obvious that a GenPD governor must only demote - never promote a
CPU idle state selection given that hierarchy implies more power
savings and higher target residencies required.
This whole series would become more generic and won't depend on
PSCI OSI at all - actually that would become a hierarchical
CPUidle governor.
I still think that PSCI firmware and most certainly mwait() play the
role the GenPD governor does since they can detect in FW/HW whether
that's worthwhile to switch off a domain, the information is obviously
there and the kernel would just add latency to the idle path in that
case but let's gloss over this for the sake of this discussion.
Lorenzo
On Thu, Aug 09 2018 at 04:25 -0600, Lorenzo Pieralisi wrote:
>On Wed, Aug 08, 2018 at 12:02:48PM -0600, Lina Iyer wrote:
>> On Wed, Aug 08 2018 at 04:56 -0600, Lorenzo Pieralisi wrote:
>> >On Mon, Aug 06, 2018 at 11:37:55AM +0200, Rafael J. Wysocki wrote:
>> >>On Fri, Aug 3, 2018 at 1:42 PM, Ulf Hansson <[email protected]> wrote:
>> >>> [...]
>> >>>
>> >>>>>
>> >>>>> Assuming that I have got that right, there are concerns, mostly regarding
>> >>>>> patch [07/26], but I will reply to that directly.
>> >>>>
>> >>>> Well, I haven't got that right, so never mind.
>> >>>>
>> >>>> There are a few minor things to address, but apart from that the general
>> >>>> genpd patches look ready.
>> >>>
>> >>> Alright, thanks!
>> >>>
>> >>> I will re-spin the series and post a new version once 4.19 rc1 is out.
>> >>> Hopefully we can queue it up early in next cycle to get it tested in
>> >>> next for a while.
>> >>>
>> >>>>
>> >>>>> The $subject patch is fine by me by itself, but it obviously depends on the
>> >>>>> previous ones. Patches [01-02/26] are fine too, but they don't seem to be
>> >>>>> particularly useful without the rest of the series.
>> >>>>>
>> >>>>> As far as patches [10-26/26] go, I'd like to see some review comments and/or
>> >>>>> tags from the people with vested interest in there, in particular from Daniel
>> >>>>> on patch [12/26] and from Sudeep on the PSCI ones.
>> >>>>
>> >>>> But this still holds.
>> >>>
>> >>> Actually, patch 10 and patch11 is ready to go as well. I ping Daniel
>> >>> on patch 12.
>> >>>
>> >>> In regards to the rest of the series, some of the PSCI/ARM changes
>> >>> have been reviewed by Mark Rutland, however several changes have not
>> >>> been acked.
>> >>>
>> >>> On the other hand, one can also interpret the long silence in regards
>> >>> to PSCI/ARM changes as they are good to go. :-)
>> >>
>> >>Well, in that case giving an ACK to them should not be an issue for
>> >>the people with a vested interest I suppose.
>> >
>> >Apologies to everyone for the delay in replying.
>> >
>> >Side note: cpu_pm_enter()/exit() are also called through syscore ops in
>> >s2RAM/IDLE, you know that but I just wanted to mention it to compound
>> >the discussion.
>> >
>> >As for PSCI patches I do not personally think PSCI OSI enablement is
>> >beneficial (and my position has always been the same since PSCI OSI was
>> >added to the specification, I am not even talking about this patchset)
>> >and Arm Trusted Firmware does not currently support it for the same
>> >reason.
>> >
>> >We (if Mark and Sudeep agree) will enable PSCI OSI if and when we have a
>> >definitive and constructive answer to *why* we have to do that that is
>> >not a dogmatic "the kernel knows better" but rather a comprehensive
>> >power benchmark evaluation - I thought that was the agreement reached
>> >at OSPM but apparently I was mistaken.
>> >
>> I will not speak to any comparison of benchmarks between OSI and PC.
>> AFAIK, there are no platforms supporting both.
>
>PSCI specifications, 5.20.1:
>
>"The platform will boot in platform-coordinated mode."
>
>So all platforms implementing OSI have to support both.
>
I understand. But there are no actual platforms out there that support
both. QC platforms do not support Platform coordinated in the firmware.
The primary reason for not doing PC is that it did not fit the
requirements for all high level OSes running on the AP. Also, having
dead code in a firmware that also does secure aspects was not desirable
for QC platforms. That said, the decision to not do PC is beyond my pay
grade.
>> But, the OSI feature is critical for QCOM mobile platforms. The
>> last man activities during cpuidle save quite a lot of power.
>
>What I expressed above was that, in PSCI based systems (OSI or PC
>alike), it is up to firmware/hardware to detect "the last man" not
>the kernel.
>
>I need to understand what you mean by "last man activities" to
>provide feedback here.
>
When the last CPU goes down during deep sleep, the following would be
done
- Lower resource requirements for shared resources such as clocks,
busses and regulators that were used by drivers in AP. These shared
resources when not used by other processors in the SoC may be turned
off and put in low power state by a remote processor. [1][2]
- Enable and setup wakeup capable interrupts on an always-on interrupt
controller, so the GIC and the GPIO controllers may be put in low
power state. [3][4]
- Write next known wakeup value to the timer, so the blocks that were
powered off, may be brought back into operational before the wakeup.
[4][5]
These are commonly done during suspend, but to achieve a good power
efficiency, we have to do this when all the CPUs are just executing CPU
idle. Also, they cannot be done from the firmware (because the data
required for all this is part of Linux). OSI plays a crucial role in
determining when to do all this.
>> Powering off the clocks, busses, regulators and even the oscillator is
>> very important to have a reasonable battery life when using the phone.
>> Platform coordinated approach falls quite short of the needs of a
>> powerful processor with a desired battery efficiency.
>
>I am sorry but if you want us to merge PSCI patches in this series you
>will have to back the claim above with a detailed technical explanation
>of *why* platform-coordination falls short of QCOM (or whoever else)
>needs wrt PSCI OSI.
These items above add much value to reduce latency in wakeup idle,
increase cache performance and increase days of use. Even if we had a
platform to test for platform coordinated, it would be hard to quantify
because of the inability to do low power state for resources and setting
up wakeup is not easily possible with platform coordinated. Not doing
it, would leave a lot of power efficiency and performance on the table.
Thanks,
Lina
[1]. https://lkml.org/lkml/2018/6/11/546
[2]. For older production code -
https://source.codeaurora.org/quic/la/kernel/msm-4.4/tree/drivers/soc/qcom/rpm-smd.c?h=LA.HB.1.1.5.c1
Line 1764.
[3]. https://lkml.org/lkml/2018/8/10/437
[4]. Older production code -
https://source.codeaurora.org/quic/la/kernel/msm-4.4/tree/drivers/soc/qcom/mpm-of.c?h=LA.HB.1.1.5.c1
[5]. https://lkml.org/lkml/2018/7/19/218
On Thu, Aug 09 2018 at 02:16 -0600, Rafael J. Wysocki wrote:
>On Wed, Aug 8, 2018 at 8:02 PM, Lina Iyer <[email protected]> wrote:
>> On Wed, Aug 08 2018 at 04:56 -0600, Lorenzo Pieralisi wrote:
>>>
>>> On Mon, Aug 06, 2018 at 11:37:55AM +0200, Rafael J. Wysocki wrote:
>>>>
>>>> On Fri, Aug 3, 2018 at 1:42 PM, Ulf Hansson <[email protected]>
>>>> wrote:
>>>> > [...]
>>>> >
>>>> >>>
>>>> >>> Assuming that I have got that right, there are concerns, mostly
>>>> >>> regarding
>>>> >>> patch [07/26], but I will reply to that directly.
>>>> >>
>>>> >> Well, I haven't got that right, so never mind.
>>>> >>
>>>> >> There are a few minor things to address, but apart from that the
>>>> >> general
>>>> >> genpd patches look ready.
>>>> >
>>>> > Alright, thanks!
>>>> >
>>>> > I will re-spin the series and post a new version once 4.19 rc1 is out.
>>>> > Hopefully we can queue it up early in next cycle to get it tested in
>>>> > next for a while.
>>>> >
>>>> >>
>>>> >>> The $subject patch is fine by me by itself, but it obviously depends
>>>> >>> on the
>>>> >>> previous ones. Patches [01-02/26] are fine too, but they don't seem
>>>> >>> to be
>>>> >>> particularly useful without the rest of the series.
>>>> >>>
>>>> >>> As far as patches [10-26/26] go, I'd like to see some review comments
>>>> >>> and/or
>>>> >>> tags from the people with vested interest in there, in particular
>>>> >>> from Daniel
>>>> >>> on patch [12/26] and from Sudeep on the PSCI ones.
>>>> >>
>>>> >> But this still holds.
>>>> >
>>>> > Actually, patch 10 and patch11 is ready to go as well. I ping Daniel
>>>> > on patch 12.
>>>> >
>>>> > In regards to the rest of the series, some of the PSCI/ARM changes
>>>> > have been reviewed by Mark Rutland, however several changes have not
>>>> > been acked.
>>>> >
>>>> > On the other hand, one can also interpret the long silence in regards
>>>> > to PSCI/ARM changes as they are good to go. :-)
>>>>
>>>> Well, in that case giving an ACK to them should not be an issue for
>>>> the people with a vested interest I suppose.
>>>
>>>
>>> Apologies to everyone for the delay in replying.
>>>
>>> Side note: cpu_pm_enter()/exit() are also called through syscore ops in
>>> s2RAM/IDLE, you know that but I just wanted to mention it to compound
>>> the discussion.
>>>
>>> As for PSCI patches I do not personally think PSCI OSI enablement is
>>> beneficial (and my position has always been the same since PSCI OSI was
>>> added to the specification, I am not even talking about this patchset)
>>> and Arm Trusted Firmware does not currently support it for the same
>>> reason.
>>>
>>> We (if Mark and Sudeep agree) will enable PSCI OSI if and when we have a
>>> definitive and constructive answer to *why* we have to do that that is
>>> not a dogmatic "the kernel knows better" but rather a comprehensive
>>> power benchmark evaluation - I thought that was the agreement reached
>>> at OSPM but apparently I was mistaken.
>>>
>> I will not speak to any comparison of benchmarks between OSI and PC.
>> AFAIK, there are no platforms supporting both.
>>
>> But, the OSI feature is critical for QCOM mobile platforms. The
>> last man activities during cpuidle save quite a lot of power. Powering
>> off the clocks, busses, regulators and even the oscillator is very
>> important to have a reasonable battery life when using the phone.
>> Platform coordinated approach falls quite short of the needs of a
>> powerful processor with a desired battery efficiency.
>
>Even so, you still need firmware (or hardware) to do the right thing
>in the concurrent wakeup via an edge-triggered interrupt case AFAICS.
>That is, you need the domain to be prevented from being turned off if
>one of the CPUs in it has just been woken up and the interrupt is
>still pending.
Yes, that is true and we have been doing this on pretty much every QC
SoC there is, for CPU domains. Generally, there is a handshake of sorts
with the power domain controller when the core executes WFI. It
decrements the reference on the controller when going down and
increments when coming up. The controller is only turned off when the
reference count is 0 and is turned back on before the CPU is ready to
exit the WFI.
What we are doing here is hand the domain's ->power_off and ->power_on
over to the platform firmware, which needs to make sure the races are
handled correctly either in h/w or through mechanisms like MCPM or in
the firmware. I would consider what happens during the power on/off of
the domains beyond the realm of the genpd at least for CPU specific PM
domains.
Thanks,
Lina
On Fri, Aug 10, 2018 at 10:36 PM Lina Iyer <[email protected]> wrote:
>
> On Thu, Aug 09 2018 at 02:16 -0600, Rafael J. Wysocki wrote:
> >On Wed, Aug 8, 2018 at 8:02 PM, Lina Iyer <[email protected]> wrote:
> >> On Wed, Aug 08 2018 at 04:56 -0600, Lorenzo Pieralisi wrote:
> >>>
> >>> On Mon, Aug 06, 2018 at 11:37:55AM +0200, Rafael J. Wysocki wrote:
> >>>>
> >>>> On Fri, Aug 3, 2018 at 1:42 PM, Ulf Hansson <[email protected]>
> >>>> wrote:
> >>>> > [...]
> >>>> >
> >>>> >>>
> >>>> >>> Assuming that I have got that right, there are concerns, mostly
> >>>> >>> regarding
> >>>> >>> patch [07/26], but I will reply to that directly.
> >>>> >>
> >>>> >> Well, I haven't got that right, so never mind.
> >>>> >>
> >>>> >> There are a few minor things to address, but apart from that the
> >>>> >> general
> >>>> >> genpd patches look ready.
> >>>> >
> >>>> > Alright, thanks!
> >>>> >
> >>>> > I will re-spin the series and post a new version once 4.19 rc1 is out.
> >>>> > Hopefully we can queue it up early in next cycle to get it tested in
> >>>> > next for a while.
> >>>> >
> >>>> >>
> >>>> >>> The $subject patch is fine by me by itself, but it obviously depends
> >>>> >>> on the
> >>>> >>> previous ones. Patches [01-02/26] are fine too, but they don't seem
> >>>> >>> to be
> >>>> >>> particularly useful without the rest of the series.
> >>>> >>>
> >>>> >>> As far as patches [10-26/26] go, I'd like to see some review comments
> >>>> >>> and/or
> >>>> >>> tags from the people with vested interest in there, in particular
> >>>> >>> from Daniel
> >>>> >>> on patch [12/26] and from Sudeep on the PSCI ones.
> >>>> >>
> >>>> >> But this still holds.
> >>>> >
> >>>> > Actually, patch 10 and patch11 is ready to go as well. I ping Daniel
> >>>> > on patch 12.
> >>>> >
> >>>> > In regards to the rest of the series, some of the PSCI/ARM changes
> >>>> > have been reviewed by Mark Rutland, however several changes have not
> >>>> > been acked.
> >>>> >
> >>>> > On the other hand, one can also interpret the long silence in regards
> >>>> > to PSCI/ARM changes as they are good to go. :-)
> >>>>
> >>>> Well, in that case giving an ACK to them should not be an issue for
> >>>> the people with a vested interest I suppose.
> >>>
> >>>
> >>> Apologies to everyone for the delay in replying.
> >>>
> >>> Side note: cpu_pm_enter()/exit() are also called through syscore ops in
> >>> s2RAM/IDLE, you know that but I just wanted to mention it to compound
> >>> the discussion.
> >>>
> >>> As for PSCI patches I do not personally think PSCI OSI enablement is
> >>> beneficial (and my position has always been the same since PSCI OSI was
> >>> added to the specification, I am not even talking about this patchset)
> >>> and Arm Trusted Firmware does not currently support it for the same
> >>> reason.
> >>>
> >>> We (if Mark and Sudeep agree) will enable PSCI OSI if and when we have a
> >>> definitive and constructive answer to *why* we have to do that that is
> >>> not a dogmatic "the kernel knows better" but rather a comprehensive
> >>> power benchmark evaluation - I thought that was the agreement reached
> >>> at OSPM but apparently I was mistaken.
> >>>
> >> I will not speak to any comparison of benchmarks between OSI and PC.
> >> AFAIK, there are no platforms supporting both.
> >>
> >> But, the OSI feature is critical for QCOM mobile platforms. The
> >> last man activities during cpuidle save quite a lot of power. Powering
> >> off the clocks, busses, regulators and even the oscillator is very
> >> important to have a reasonable battery life when using the phone.
> >> Platform coordinated approach falls quite short of the needs of a
> >> powerful processor with a desired battery efficiency.
> >
> >Even so, you still need firmware (or hardware) to do the right thing
> >in the concurrent wakeup via an edge-triggered interrupt case AFAICS.
> >That is, you need the domain to be prevented from being turned off if
> >one of the CPUs in it has just been woken up and the interrupt is
> >still pending.
> Yes, that is true and we have been doing this on pretty much every QC
> SoC there is, for CPU domains. Generally, there is a handshake of sorts
> with the power domain controller when the core executes WFI. It
> decrements the reference on the controller when going down and
> increments when coming up. The controller is only turned off when the
> reference count is 0 and is turned back on before the CPU is ready to
> exit the WFI.
>
> What we are doing here is hand the domain's ->power_off and ->power_on
> over to the platform firmware, which needs to make sure the races are
> handled correctly either in h/w or through mechanisms like MCPM or in
> the firmware. I would consider what happens during the power on/off of
> the domains beyond the realm of the genpd at least for CPU specific PM
> domains.
I see.
The dependency on this FW/HW behavior should be clearly documented,
though, preferably next to the code in question, or people will try to
use it on systems where this requirement is not met and will be
wondering what's wrong and/or complaining.
On Fri, Aug 10, 2018 at 02:18:15PM -0600, Lina Iyer wrote:
[...]
> >>But, the OSI feature is critical for QCOM mobile platforms. The
> >>last man activities during cpuidle save quite a lot of power.
> >
> >What I expressed above was that, in PSCI based systems (OSI or PC
> >alike), it is up to firmware/hardware to detect "the last man" not
> >the kernel.
> >
> >I need to understand what you mean by "last man activities" to
> >provide feedback here.
> >
> When the last CPU goes down during deep sleep, the following would be
> done
> - Lower resource requirements for shared resources such as clocks,
> busses and regulators that were used by drivers in AP. These shared
> resources when not used by other processors in the SoC may be turned
> off and put in low power state by a remote processor. [1][2]
> - Enable and setup wakeup capable interrupts on an always-on interrupt
> controller, so the GIC and the GPIO controllers may be put in low
> power state. [3][4]
> - Write next known wakeup value to the timer, so the blocks that were
> powered off, may be brought back into operational before the wakeup.
> [4][5]
>
> These are commonly done during suspend, but to achieve a good power
> efficiency, we have to do this when all the CPUs are just executing CPU
> idle. Also, they cannot be done from the firmware (because the data
> required for all this is part of Linux). OSI plays a crucial role in
> determining when to do all this.
No it does not. It is the power domain cpumasks that allow this code to
make an educated guess on the last cpu running (the kernel), PSCI OSI is
not crucial at all (it is crucial in QC platforms because that's the
only mode supported but that's not a reason I accept as valid since it
does not comply with the PSCI specifications).
As I mentioned in another thread[1] the generic part of this
series may be applicable in a platform agnostic way to the
CPUidle framework, whether that's beneficial it has to be proven
and it is benchmark specific anyway.
Lorenzo
[1]: https://marc.info/?l=linux-pm&m=153382916513032&w=2
On 6 August 2018 at 11:36, Rafael J. Wysocki <[email protected]> wrote:
> On Fri, Aug 3, 2018 at 1:43 PM, Ulf Hansson <[email protected]> wrote:
>> On 19 July 2018 at 12:25, Rafael J. Wysocki <[email protected]> wrote:
>>> On Wednesday, June 20, 2018 7:22:04 PM CEST Ulf Hansson wrote:
>>>> To enable a device belonging to a CPU to be attached to a PM domain managed
>>>> by genpd, let's do a few changes to genpd as to make it convenient to
>>>> manage the specifics around CPUs.
>>>>
>>>> First, as to be able to quickly find out what CPUs that are attached to a
>>>> genpd, which typically becomes useful from a genpd governor as following
>>>> changes is about to show, let's add a cpumask 'cpus' to the struct
>>>> generic_pm_domain.
>>>>
>>>> At the point when a device that belongs to a CPU, is attached/detached to
>>>> its corresponding PM domain via genpd_add_device(), let's update the
>>>> cpumask in genpd->cpus. Moreover, propagate the update of the cpumask to
>>>> the master domains, which makes the genpd->cpus to contain a cpumask that
>>>> hierarchically reflect all CPUs for a genpd, including CPUs attached to
>>>> subdomains.
>>>>
>>>> Second, to unconditionally manage CPUs and the cpumask in genpd->cpus, is
>>>> unnecessary for cases when only non-CPU devices are parts of a genpd.
>>>> Let's avoid this by adding a new configuration bit, GENPD_FLAG_CPU_DOMAIN.
>>>> Clients must set the bit before they call pm_genpd_init(), as to instruct
>>>> genpd that it shall deal with CPUs and thus manage the cpumask in
>>>> genpd->cpus.
>>>>
>>>> Cc: Lina Iyer <[email protected]>
>>>> Co-developed-by: Lina Iyer <[email protected]>
>>>> Signed-off-by: Ulf Hansson <[email protected]>
>>>> ---
>>>> drivers/base/power/domain.c | 69 ++++++++++++++++++++++++++++++++++++-
>>>> include/linux/pm_domain.h | 3 ++
>>>> 2 files changed, 71 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
>>>> index 21d298e1820b..6149ce0bfa7b 100644
>>>> --- a/drivers/base/power/domain.c
>>>> +++ b/drivers/base/power/domain.c
>>>> @@ -20,6 +20,7 @@
>>>> #include <linux/sched.h>
>>>> #include <linux/suspend.h>
>>>> #include <linux/export.h>
>>>> +#include <linux/cpu.h>
>>>>
>>>> #include "power.h"
>>>>
>>>> @@ -126,6 +127,7 @@ static const struct genpd_lock_ops genpd_spin_ops = {
>>>> #define genpd_is_irq_safe(genpd) (genpd->flags & GENPD_FLAG_IRQ_SAFE)
>>>> #define genpd_is_always_on(genpd) (genpd->flags & GENPD_FLAG_ALWAYS_ON)
>>>> #define genpd_is_active_wakeup(genpd) (genpd->flags & GENPD_FLAG_ACTIVE_WAKEUP)
>>>> +#define genpd_is_cpu_domain(genpd) (genpd->flags & GENPD_FLAG_CPU_DOMAIN)
>>>>
>>>> static inline bool irq_safe_dev_in_no_sleep_domain(struct device *dev,
>>>> const struct generic_pm_domain *genpd)
>>>> @@ -1377,6 +1379,62 @@ static void genpd_free_dev_data(struct device *dev,
>>>> dev_pm_put_subsys_data(dev);
>>>> }
>>>>
>>>> +static void __genpd_update_cpumask(struct generic_pm_domain *genpd,
>>>> + int cpu, bool set, unsigned int depth)
>>>> +{
>>>> + struct gpd_link *link;
>>>> +
>>>> + if (!genpd_is_cpu_domain(genpd))
>>>> + return;
>>>> +
>>>> + list_for_each_entry(link, &genpd->slave_links, slave_node) {
>>>> + struct generic_pm_domain *master = link->master;
>>>> +
>>>> + genpd_lock_nested(master, depth + 1);
>>>> + __genpd_update_cpumask(master, cpu, set, depth + 1);
>>>> + genpd_unlock(master);
>>>> + }
>>>> +
>>>> + if (set)
>>>> + cpumask_set_cpu(cpu, genpd->cpus);
>>>> + else
>>>> + cpumask_clear_cpu(cpu, genpd->cpus);
>>>> +}
>>>
>>> As noted elsewhere, there is a concern about the possible weight of this
>>> cpumask and I think that it would be good to explicitly put a limit on it.
>>
>> I have been digesting your comments on the series, but wonder if this
>> is still a relevant concern?
>
> Well, there are systems with very large cpumasks and it is sort of
> good to have that in mind when designing any code using them.
Right.
So, if I avoid allocating the cpumask for those genpd structures that
doesn't need it (those not having GENPD_FLAG_CPU_DOMAIN set), would
that be sufficient to deal with your concern?
Kind regards
Uffe
On 6 August 2018 at 11:20, Rafael J. Wysocki <[email protected]> wrote:
> On Fri, Aug 3, 2018 at 4:28 PM, Ulf Hansson <[email protected]> wrote:
>> On 26 July 2018 at 11:14, Rafael J. Wysocki <[email protected]> wrote:
>>> On Thursday, July 19, 2018 12:32:52 PM CEST Rafael J. Wysocki wrote:
>>>> On Wednesday, June 20, 2018 7:22:07 PM CEST Ulf Hansson wrote:
>>>> > As it's now perfectly possible that a PM domain managed by genpd contains
>>>> > devices belonging to CPUs, we should start to take into account the
>>>> > residency values for the idle states during the state selection process.
>>>> > The residency value specifies the minimum duration of time, the CPU or a
>>>> > group of CPUs, needs to spend in an idle state to not waste energy entering
>>>> > it.
>>>> >
>>>> > To deal with this, let's add a new genpd governor, pm_domain_cpu_gov, that
>>>> > may be used for a PM domain that have CPU devices attached or if the CPUs
>>>> > are attached through subdomains.
>>>> >
>>>> > The new governor computes the minimum expected idle duration time for the
>>>> > online CPUs being attached to the PM domain and its subdomains. Then in the
>>>> > state selection process, trying the deepest state first, it verifies that
>>>> > the idle duration time satisfies the state's residency value.
>>>> >
>>>> > It should be noted that, when computing the minimum expected idle duration
>>>> > time, we use the information from tick_nohz_get_next_wakeup(), to find the
>>>> > next wakeup for the related CPUs. Future wise, this may deserve to be
>>>> > improved, as there are more reasons to why a CPU may be woken up from idle.
>>>> >
>>>> > Cc: Thomas Gleixner <[email protected]>
>>>> > Cc: Daniel Lezcano <[email protected]>
>>>> > Cc: Lina Iyer <[email protected]>
>>>> > Cc: Frederic Weisbecker <[email protected]>
>>>> > Cc: Ingo Molnar <[email protected]>
>>>> > Co-developed-by: Lina Iyer <[email protected]>
>>>> > Signed-off-by: Ulf Hansson <[email protected]>
>>>> > ---
>>>> > drivers/base/power/domain_governor.c | 58 ++++++++++++++++++++++++++++
>>>> > include/linux/pm_domain.h | 2 +
>>>> > 2 files changed, 60 insertions(+)
>>>> >
>>>> > diff --git a/drivers/base/power/domain_governor.c b/drivers/base/power/domain_governor.c
>>>> > index 99896fbf18e4..1aad55719537 100644
>>>> > --- a/drivers/base/power/domain_governor.c
>>>> > +++ b/drivers/base/power/domain_governor.c
>>>> > @@ -10,6 +10,9 @@
>>>> > #include <linux/pm_domain.h>
>>>> > #include <linux/pm_qos.h>
>>>> > #include <linux/hrtimer.h>
>>>> > +#include <linux/cpumask.h>
>>>> > +#include <linux/ktime.h>
>>>> > +#include <linux/tick.h>
>>>> >
>>>> > static int dev_update_qos_constraint(struct device *dev, void *data)
>>>> > {
>>>> > @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
>>>> > return false;
>>>> > }
>>>> >
>>>> > +static bool cpu_power_down_ok(struct dev_pm_domain *pd)
>>>> > +{
>>>> > + struct generic_pm_domain *genpd = pd_to_genpd(pd);
>>>> > + ktime_t domain_wakeup, cpu_wakeup;
>>>> > + s64 idle_duration_ns;
>>>> > + int cpu, i;
>>>> > +
>>>> > + if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
>>>> > + return true;
>>>> > +
>>>> > + /*
>>>> > + * Find the next wakeup for any of the online CPUs within the PM domain
>>>> > + * and its subdomains. Note, we only need the genpd->cpus, as it already
>>>> > + * contains a mask of all CPUs from subdomains.
>>>> > + */
>>>> > + domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
>>>> > + for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
>>>> > + cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
>>>> > + if (ktime_before(cpu_wakeup, domain_wakeup))
>>>> > + domain_wakeup = cpu_wakeup;
>>>> > + }
>>>
>>> Here's a concern I have missed before. :-/
>>>
>>> Say, one of the CPUs you're walking here is woken up in the meantime.
>>
>> Yes, that can happen - when we miss-predicted "next wakeup".
>>
>>>
>>> I don't think it is valid to evaluate tick_nohz_get_next_wakeup() for it then
>>> to update domain_wakeup. We really should just avoid the domain power off in
>>> that case at all IMO.
>>
>> Correct.
>>
>> However, we also want to avoid locking contentions in the idle path,
>> which is what this boils done to.
>
> This already is done under genpd_lock() AFAICS, so I'm not quite sure
> what exactly you mean.
>
> Besides, this is not just about increased latency, which is a concern
> by itself but maybe not so much in all environments, but also about
> possibility of missing a CPU wakeup, which is a major issue.
>
> If one of the CPUs sharing the domain with the current one is woken up
> during cpu_power_down_ok() and the wakeup is an edge-triggered
> interrupt and the domain is turned off regardless, the wakeup may be
> missed entirely if I'm not mistaken.
>
> It looks like there needs to be a way for the hardware to prevent a
> domain poweroff when there's a pending interrupt or I don't quite see
> how this can be handled correctly.
Well, the job of genpd and its new cpu governor is not directly to
power off the PM domain, but rather to try to select/promote an idle
state for it. Along the lines of what Lorenzo explained in the other
thread.
Then what happens in the genpd backend driver's ->power_off()
callback, is platform specific. In other words, it's the job of the
backend driver to understand how its FW works and thus to correctly
deal with the last man standing algorithm.
In regards to the PSCI FW, it supports the race condition you are
referring to in the FW (which makes it easier), no matter if it's
running in OS-initiated mode or platform-coordinated mode.
>
>>> Sure enough, if the domain power off is already started and one of the CPUs
>>> in the domain is woken up then, too bad, it will suffer the latency (but in
>>> that case the hardware should be able to help somewhat), but otherwise CPU
>>> wakeup should prevent domain power off from being carried out.
>>
>> The CPU is not prevented from waking up, as we rely on the FW to deal with that.
>>
>> Even if the above computation turns out to wrongly suggest that the
>> cluster can be powered off, the FW shall together with the genpd
>> backend driver prevent it.
>
> Fine, but then the solution depends on specific FW/HW behavior, so I'm
> not sure how generic it really is. At least, that expectation should
> be clearly documented somewhere, preferably in code comments.
Alright, let me add some comments somewhere in the code, to explain a
bit about what a genpd backend driver should expect when using the
GENPD_FLAG_CPU_DOMAIN flag.
>
>> To cover this case for PSCI, we also use a per cpu variable for the
>> CPU's power off state, as can be seen later in the series.
>
> Oh great, but the generic part should be independent on the underlying
> implementation of the driver. If it isn't, then it also is not
> generic.
>
>> Hope this clarifies your concern, else tell and will to elaborate a bit more.
>
> Not really.
>
> There also is one more problem and that is the interaction between
> this code and the idle governor.
>
> Namely, the idle governor may select a shallower state for some
> reason, for example due to an additional latency limit derived from
> CPU utilization (like in the menu governor), and how does the code in
> cpu_power_down_ok() know what state has been selected and how does it
> honor the selection made by the idle governor?
This is indeed a valid concern. I must have failed to explained this
during various conferences, but at least I have tried. :-)
Ideally, we need the menu idle governor and genpd's new cpu governor
to share code or exchange information, somehow. I am looking into that
as a next step of improvements, count on it!
The idea at this point was instead to take a simplified approach to
the problem, to at least get some support for cpu cluster idle
management in place, then improve it on top.
This means, for PSCI, we are using the new genpd cpu governor *only*
for the cluster PM domain (master), but not for the genpd subdomains,
which each contains of a single CPU device. So, the subdomains don't
have a genpd governor assigned, but instead rely on the existing menu
idle governor to select an idle state for the CPU. This means that
*most* of the problem disappears, as its only when the last CPU in the
cluster goes idle, when the selection could be "wrong". In worst case,
genpd will promote an idle state for the cluster PM domain, while it
shouldn't.
Moreover, for the QCOM case in 410c, this isn't even a potential
problem, because there is only *one* idle state to pick by the menu
idle governor for the CPU (besides WFI). Hence, when the genpd cpu
governor runs to pick and idle state, we know that the menu idle
governor have already selected the deepest idle state for each CPU.
Kind regards
Uffe
On 9 August 2018 at 17:39, Lorenzo Pieralisi <[email protected]> wrote:
> On Mon, Aug 06, 2018 at 11:20:59AM +0200, Rafael J. Wysocki wrote:
>
> [...]
>
>> >>> > @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
>> >>> > return false;
>> >>> > }
>> >>> >
>> >>> > +static bool cpu_power_down_ok(struct dev_pm_domain *pd)
>> >>> > +{
>> >>> > + struct generic_pm_domain *genpd = pd_to_genpd(pd);
>> >>> > + ktime_t domain_wakeup, cpu_wakeup;
>> >>> > + s64 idle_duration_ns;
>> >>> > + int cpu, i;
>> >>> > +
>> >>> > + if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
>> >>> > + return true;
>> >>> > +
>> >>> > + /*
>> >>> > + * Find the next wakeup for any of the online CPUs within the PM domain
>> >>> > + * and its subdomains. Note, we only need the genpd->cpus, as it already
>> >>> > + * contains a mask of all CPUs from subdomains.
>> >>> > + */
>> >>> > + domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
>> >>> > + for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
>> >>> > + cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
>> >>> > + if (ktime_before(cpu_wakeup, domain_wakeup))
>> >>> > + domain_wakeup = cpu_wakeup;
>> >>> > + }
>> >>
>> >> Here's a concern I have missed before. :-/
>> >>
>> >> Say, one of the CPUs you're walking here is woken up in the meantime.
>> >
>> > Yes, that can happen - when we miss-predicted "next wakeup".
>> >
>> >>
>> >> I don't think it is valid to evaluate tick_nohz_get_next_wakeup() for it then
>> >> to update domain_wakeup. We really should just avoid the domain power off in
>> >> that case at all IMO.
>> >
>> > Correct.
>> >
>> > However, we also want to avoid locking contentions in the idle path,
>> > which is what this boils done to.
>>
>> This already is done under genpd_lock() AFAICS, so I'm not quite sure
>> what exactly you mean.
>>
>> Besides, this is not just about increased latency, which is a concern
>> by itself but maybe not so much in all environments, but also about
>> possibility of missing a CPU wakeup, which is a major issue.
>>
>> If one of the CPUs sharing the domain with the current one is woken up
>> during cpu_power_down_ok() and the wakeup is an edge-triggered
>> interrupt and the domain is turned off regardless, the wakeup may be
>> missed entirely if I'm not mistaken.
>>
>> It looks like there needs to be a way for the hardware to prevent a
>> domain poweroff when there's a pending interrupt or I don't quite see
>> how this can be handled correctly.
>>
>> >> Sure enough, if the domain power off is already started and one of the CPUs
>> >> in the domain is woken up then, too bad, it will suffer the latency (but in
>> >> that case the hardware should be able to help somewhat), but otherwise CPU
>> >> wakeup should prevent domain power off from being carried out.
>> >
>> > The CPU is not prevented from waking up, as we rely on the FW to deal with that.
>> >
>> > Even if the above computation turns out to wrongly suggest that the
>> > cluster can be powered off, the FW shall together with the genpd
>> > backend driver prevent it.
>>
>> Fine, but then the solution depends on specific FW/HW behavior, so I'm
>> not sure how generic it really is. At least, that expectation should
>> be clearly documented somewhere, preferably in code comments.
>>
>> > To cover this case for PSCI, we also use a per cpu variable for the
>> > CPU's power off state, as can be seen later in the series.
>>
>> Oh great, but the generic part should be independent on the underlying
>> implementation of the driver. If it isn't, then it also is not
>> generic.
>>
>> > Hope this clarifies your concern, else tell and will to elaborate a bit more.
>>
>> Not really.
>>
>> There also is one more problem and that is the interaction between
>> this code and the idle governor.
>>
>> Namely, the idle governor may select a shallower state for some
>> reason, for example due to an additional latency limit derived from
>> CPU utilization (like in the menu governor), and how does the code in
>> cpu_power_down_ok() know what state has been selected and how does it
>> honor the selection made by the idle governor?
>
> That's a good question and it maybe gives a path towards a solution.
>
> AFAICS the genPD governor only selects the idle state parameter that
> determines the idle state at, say, GenPD cpumask level it does not touch
> the CPUidle decision, that works on a subset of idle states (at cpu
> level).
>
> That's my understanding, which can be wrong so please correct me
> if that's the case because that's a bit confusing.
>
> Let's imagine that we flattened out the list of idle states and feed
> CPUidle with it (all of them - cpu, cluster, package, system - as it is
> in the mainline _now_). Then the GenPD governor can run-through the
> CPUidle selection and _demote_ the idle state if necessary since it
> understands that some CPUs in the GenPD will wake up shortly and break
> the target residency hyphothesis the CPUidle governor is expecting.
>
> The whole idea about this series is improving CPUidle decision when
> the target idle state is _shared_ among groups of cpus (again, please
> do correct me if I am wrong).
Absolutely, this is one of the main reason for the series!
>
> It is obvious that a GenPD governor must only demote - never promote a
> CPU idle state selection given that hierarchy implies more power
> savings and higher target residencies required.
Absolutely. I apologize if I have been using the word "promote"
wrongly, I realize it may be a bit confusing.
>
> This whole series would become more generic and won't depend on
> PSCI OSI at all - actually that would become a hierarchical
> CPUidle governor.
Well, to me we need a first user of the new infrastructure code in
genpd and PSCI is probably the easiest one to start with. An option
would be to start with an old ARM32 platform, but it seems a bit silly
to me.
In regards to OS-initiated mode vs platform coordinated mode, let's
discuss that in details in the other email thread instead.
>
> I still think that PSCI firmware and most certainly mwait() play the
> role the GenPD governor does since they can detect in FW/HW whether
> that's worthwhile to switch off a domain, the information is obviously
> there and the kernel would just add latency to the idle path in that
> case but let's gloss over this for the sake of this discussion.
Yep, let's discuss that separately.
That said, can I interpret your comments on the series up until this
change, that you seems rather happy with where the series is going?
Kind regards
Uffe
On Fri, Aug 24, 2018 at 11:26:19AM +0200, Ulf Hansson wrote:
[...]
> > That's a good question and it maybe gives a path towards a solution.
> >
> > AFAICS the genPD governor only selects the idle state parameter that
> > determines the idle state at, say, GenPD cpumask level it does not touch
> > the CPUidle decision, that works on a subset of idle states (at cpu
> > level).
> >
> > That's my understanding, which can be wrong so please correct me
> > if that's the case because that's a bit confusing.
> >
> > Let's imagine that we flattened out the list of idle states and feed
> > CPUidle with it (all of them - cpu, cluster, package, system - as it is
> > in the mainline _now_). Then the GenPD governor can run-through the
> > CPUidle selection and _demote_ the idle state if necessary since it
> > understands that some CPUs in the GenPD will wake up shortly and break
> > the target residency hyphothesis the CPUidle governor is expecting.
> >
> > The whole idea about this series is improving CPUidle decision when
> > the target idle state is _shared_ among groups of cpus (again, please
> > do correct me if I am wrong).
>
> Absolutely, this is one of the main reason for the series!
>
> >
> > It is obvious that a GenPD governor must only demote - never promote a
> > CPU idle state selection given that hierarchy implies more power
> > savings and higher target residencies required.
>
> Absolutely. I apologize if I have been using the word "promote"
> wrongly, I realize it may be a bit confusing.
>
> >
> > This whole series would become more generic and won't depend on
> > PSCI OSI at all - actually that would become a hierarchical
> > CPUidle governor.
>
> Well, to me we need a first user of the new infrastructure code in
> genpd and PSCI is probably the easiest one to start with. An option
> would be to start with an old ARM32 platform, but it seems a bit silly
> to me.
If the code can be structured as described above as a hierarchical
(possibly optional through a Kconfig entry or sysfs tuning) idle
decision you can apply it to _any_ PSCI based platform out there,
provided that the new governor improves power savings.
> In regards to OS-initiated mode vs platform coordinated mode, let's
> discuss that in details in the other email thread instead.
I think that's crystal clear by now that IMHO PSCI OS-initiated mode is
a red-herring, it has nothing to do with this series, it is there just
because QC firmware does not support PSCI platform coordinated suspend
mode.
You can apply the concept in this series to _any_ arch provided
the power domains representation is correct (and again, I would sound
like a broken record but the series must improve power savings over
vanilla CPUidle menu governor).
> > I still think that PSCI firmware and most certainly mwait() play the
> > role the GenPD governor does since they can detect in FW/HW whether
> > that's worthwhile to switch off a domain, the information is obviously
> > there and the kernel would just add latency to the idle path in that
> > case but let's gloss over this for the sake of this discussion.
>
> Yep, let's discuss that separately.
>
> That said, can I interpret your comments on the series up until this
> change, that you seems rather happy with where the series is going?
It is something we have been discussing with Daniel since generic idle
was merged for Arm a long while back. I have nothing against describing
idle states with power domains but it must improve idle decisions
against the mainline. As I said before, runtime PM can also be used
to get rid of CPU PM notifiers (because with power domains we KNOW
what devices eg PMU are switched off on idle entry, we do not guess
any longer; replacing CPU PM notifiers is challenging and can be
tackled - if required - in a different series).
Bottom line (talk is cheap, I know and apologise about that): this
series (up until this change) adds complexity to the idle path and lots
of code; if its usage is made optional and can be switched on on systems
where it saves power that's fine by me as long as we keep PSCI
OS-initiated idle states out of the equation, that's an orthogonal
discussion as, I hope, I managed to convey.
Thanks,
Lorenzo
Lorenzo, Sudeep, Mark
On 15 August 2018 at 12:44, Lorenzo Pieralisi <[email protected]> wrote:
> On Fri, Aug 10, 2018 at 02:18:15PM -0600, Lina Iyer wrote:
>
> [...]
>
>> >>But, the OSI feature is critical for QCOM mobile platforms. The
>> >>last man activities during cpuidle save quite a lot of power.
>> >
>> >What I expressed above was that, in PSCI based systems (OSI or PC
>> >alike), it is up to firmware/hardware to detect "the last man" not
>> >the kernel.
>> >
>> >I need to understand what you mean by "last man activities" to
>> >provide feedback here.
>> >
>> When the last CPU goes down during deep sleep, the following would be
>> done
>> - Lower resource requirements for shared resources such as clocks,
>> busses and regulators that were used by drivers in AP. These shared
>> resources when not used by other processors in the SoC may be turned
>> off and put in low power state by a remote processor. [1][2]
>> - Enable and setup wakeup capable interrupts on an always-on interrupt
>> controller, so the GIC and the GPIO controllers may be put in low
>> power state. [3][4]
>> - Write next known wakeup value to the timer, so the blocks that were
>> powered off, may be brought back into operational before the wakeup.
>> [4][5]
>>
>> These are commonly done during suspend, but to achieve a good power
>> efficiency, we have to do this when all the CPUs are just executing CPU
>> idle. Also, they cannot be done from the firmware (because the data
>> required for all this is part of Linux). OSI plays a crucial role in
>> determining when to do all this.
>
> No it does not. It is the power domain cpumasks that allow this code to
> make an educated guess on the last cpu running (the kernel), PSCI OSI is
> not crucial at all (it is crucial in QC platforms because that's the
> only mode supported but that's not a reason I accept as valid since it
> does not comply with the PSCI specifications).
We can keep argue on this back and forward, but it seems to lead
nowhere. As a matter of fact I am also surprised that this kind of
discussion pops up, again. I thought we had sorted this out,
especially since we have also met face to face, discussing this in
detail, several times by now. Well, well, let's try again. :-)
First, in regards to complying with the PSCI spec, to me, that sounds
like nonsense, sorry! Is the spec stating that the PSCI FW needs to
support all the idle states in PC mode, when the optional OSI mode
also is supported? To me, it looks like the QCOM PSCI FW supports PC
mode, but in that mode only a subset of the idle states can be
reached, so that should be fine, no?
Moving forward, I am wondering if a more detailed technical
description, comparing the benefits from OSI mode vs the benefits from
PC mode could help? Or is just a waste of everybody time, as you all
already know this? Anyway I am willing to try, just tell me and I
provide you with the best details I can give, about why OSI is better
suited for these kind of QCOM SoCs. I trust Lina to help to fill in,
if/when needed.
Why? Simply because I doubt we ever see the QCOM FW for the battery
driven embedded devices to support all idles states in the PC mode, so
doing a comparison on for example the 410c platform, just doesn't
seems to be possible, sorry!
I also have another, quite important, concern. That is, ARM decided to
put the OSI mode into the PSCI spec, I assume there were reasons for
it. Then, when the ARM community wants to implement support for OSI
mode, you are now requiring us to proof the justification of it in the
spec. To me, that is, nicely stated, weird. :-) But it also worries
me, ARM vendors observes the behavior.
That said, in the end we are discussing a quite limited amount of
lines of code to support PSCI OSI (some which may not even be
considered as OSI specific). It's ~200 lines of code, where most of
the code lives in a separate new c-file (psci_pm_domain.c).
Additionally, existing PC mode only platforms should still work as
before, without drawbacks.
Really, why are we arguing about this at all?
>
> As I mentioned in another thread[1] the generic part of this
> series may be applicable in a platform agnostic way to the
> CPUidle framework, whether that's beneficial it has to be proven
> and it is benchmark specific anyway.
I don't think this can be made fully platform agnostic. Or maybe you
are suggesting another helper layer above the new genpd
infrastructure?
Anyway, my point is, the genpd backend driver requires knowledge about
the FW and the last man standing algorithm, hence a platform agnostic
backend doesn't sound feasible to me.
>
> Lorenzo
>
> [1]: https://marc.info/?l=linux-pm&m=153382916513032&w=2
Kind regards
Uffe
On 24 August 2018 at 12:38, Lorenzo Pieralisi <[email protected]> wrote:
> On Fri, Aug 24, 2018 at 11:26:19AM +0200, Ulf Hansson wrote:
>
> [...]
>
>> > That's a good question and it maybe gives a path towards a solution.
>> >
>> > AFAICS the genPD governor only selects the idle state parameter that
>> > determines the idle state at, say, GenPD cpumask level it does not touch
>> > the CPUidle decision, that works on a subset of idle states (at cpu
>> > level).
>> >
>> > That's my understanding, which can be wrong so please correct me
>> > if that's the case because that's a bit confusing.
>> >
>> > Let's imagine that we flattened out the list of idle states and feed
>> > CPUidle with it (all of them - cpu, cluster, package, system - as it is
>> > in the mainline _now_). Then the GenPD governor can run-through the
>> > CPUidle selection and _demote_ the idle state if necessary since it
>> > understands that some CPUs in the GenPD will wake up shortly and break
>> > the target residency hyphothesis the CPUidle governor is expecting.
>> >
>> > The whole idea about this series is improving CPUidle decision when
>> > the target idle state is _shared_ among groups of cpus (again, please
>> > do correct me if I am wrong).
>>
>> Absolutely, this is one of the main reason for the series!
>>
>> >
>> > It is obvious that a GenPD governor must only demote - never promote a
>> > CPU idle state selection given that hierarchy implies more power
>> > savings and higher target residencies required.
>>
>> Absolutely. I apologize if I have been using the word "promote"
>> wrongly, I realize it may be a bit confusing.
>>
>> >
>> > This whole series would become more generic and won't depend on
>> > PSCI OSI at all - actually that would become a hierarchical
>> > CPUidle governor.
>>
>> Well, to me we need a first user of the new infrastructure code in
>> genpd and PSCI is probably the easiest one to start with. An option
>> would be to start with an old ARM32 platform, but it seems a bit silly
>> to me.
>
> If the code can be structured as described above as a hierarchical
> (possibly optional through a Kconfig entry or sysfs tuning) idle
> decision you can apply it to _any_ PSCI based platform out there,
> provided that the new governor improves power savings.
>
>> In regards to OS-initiated mode vs platform coordinated mode, let's
>> discuss that in details in the other email thread instead.
>
> I think that's crystal clear by now that IMHO PSCI OS-initiated mode is
> a red-herring, it has nothing to do with this series, it is there just
> because QC firmware does not support PSCI platform coordinated suspend
> mode.
I fully agree that the series isn't specific to PSCI OSI mode. On the
other hand, for PSCI OSI mode, that's where I see this series to fit
naturally. And in particular for the QCOM 410c board.
When it comes to the PSCI PC mode, it may under certain circumstances
be useful to deploy this approach for that as well, and I agree that
it seems reasonable to have that configurable as opt-in, somehow.
Although, let's discuss that separately, in a next step. Or at least
let's try to keep PSCI related technical discussions to the other
thread, as that makes it easier to follow.
>
> You can apply the concept in this series to _any_ arch provided
> the power domains representation is correct (and again, I would sound
> like a broken record but the series must improve power savings over
> vanilla CPUidle menu governor).
I agree, but let me elaborate a bit, to hopefully add some clarity,
which I may not have been able to communicate earlier.
The goal with the series is to enable platforms to support all its
available idlestates, which are shared among a group of CPUs. This is
the case for QCOM 410c, for example.
To my knowledge, we have other ARM32 based platforms that currently
have disabled some of its cluster idle states. That's because they
can't know when it's safe to power off the cluster "coherency domain",
in cases when the platform also have other shared resources in it.
The point is, to see improved power savings, additional platform
deployment may be needed and that just takes time. For example runtime
PM support is needed in those drivers that deals with the "shared
resources", a correctly modeled PM domain topology using genpd, etc,
etc.
>
>> > I still think that PSCI firmware and most certainly mwait() play the
>> > role the GenPD governor does since they can detect in FW/HW whether
>> > that's worthwhile to switch off a domain, the information is obviously
>> > there and the kernel would just add latency to the idle path in that
>> > case but let's gloss over this for the sake of this discussion.
>>
>> Yep, let's discuss that separately.
>>
>> That said, can I interpret your comments on the series up until this
>> change, that you seems rather happy with where the series is going?
>
> It is something we have been discussing with Daniel since generic idle
> was merged for Arm a long while back. I have nothing against describing
> idle states with power domains but it must improve idle decisions
> against the mainline. As I said before, runtime PM can also be used
> to get rid of CPU PM notifiers (because with power domains we KNOW
> what devices eg PMU are switched off on idle entry, we do not guess
> any longer; replacing CPU PM notifiers is challenging and can be
> tackled - if required - in a different series).
Yes, we have be talking about the CPU PM and CPU_CLUSTER_PM notifiers
and I fully agree. It's something that we should look into and in
future steps.
>
> Bottom line (talk is cheap, I know and apologise about that): this
> series (up until this change) adds complexity to the idle path and lots
> of code; if its usage is made optional and can be switched on on systems
> where it saves power that's fine by me as long as we keep PSCI
> OS-initiated idle states out of the equation, that's an orthogonal
> discussion as, I hope, I managed to convey.
>
> Thanks,
> Lorenzo
Lorenzo, thanks for your feedback!
Please, when you have time, could you also reply to the other thread
we started, I would like to understand how I should proceed with this
series.
Kind regards
Uffe
On Thu, Aug 30, 2018 at 03:36:02PM +0200, Ulf Hansson wrote:
> On 24 August 2018 at 12:38, Lorenzo Pieralisi <[email protected]> wrote:
> > On Fri, Aug 24, 2018 at 11:26:19AM +0200, Ulf Hansson wrote:
> >
> > [...]
> >
> >> > That's a good question and it maybe gives a path towards a solution.
> >> >
> >> > AFAICS the genPD governor only selects the idle state parameter that
> >> > determines the idle state at, say, GenPD cpumask level it does not touch
> >> > the CPUidle decision, that works on a subset of idle states (at cpu
> >> > level).
> >> >
> >> > That's my understanding, which can be wrong so please correct me
> >> > if that's the case because that's a bit confusing.
> >> >
> >> > Let's imagine that we flattened out the list of idle states and feed
> >> > CPUidle with it (all of them - cpu, cluster, package, system - as it is
> >> > in the mainline _now_). Then the GenPD governor can run-through the
> >> > CPUidle selection and _demote_ the idle state if necessary since it
> >> > understands that some CPUs in the GenPD will wake up shortly and break
> >> > the target residency hyphothesis the CPUidle governor is expecting.
> >> >
> >> > The whole idea about this series is improving CPUidle decision when
> >> > the target idle state is _shared_ among groups of cpus (again, please
> >> > do correct me if I am wrong).
> >>
> >> Absolutely, this is one of the main reason for the series!
> >>
> >> >
> >> > It is obvious that a GenPD governor must only demote - never promote a
> >> > CPU idle state selection given that hierarchy implies more power
> >> > savings and higher target residencies required.
> >>
> >> Absolutely. I apologize if I have been using the word "promote"
> >> wrongly, I realize it may be a bit confusing.
> >>
> >> >
> >> > This whole series would become more generic and won't depend on
> >> > PSCI OSI at all - actually that would become a hierarchical
> >> > CPUidle governor.
> >>
> >> Well, to me we need a first user of the new infrastructure code in
> >> genpd and PSCI is probably the easiest one to start with. An option
> >> would be to start with an old ARM32 platform, but it seems a bit silly
> >> to me.
> >
> > If the code can be structured as described above as a hierarchical
> > (possibly optional through a Kconfig entry or sysfs tuning) idle
> > decision you can apply it to _any_ PSCI based platform out there,
> > provided that the new governor improves power savings.
> >
> >> In regards to OS-initiated mode vs platform coordinated mode, let's
> >> discuss that in details in the other email thread instead.
> >
> > I think that's crystal clear by now that IMHO PSCI OS-initiated mode is
> > a red-herring, it has nothing to do with this series, it is there just
> > because QC firmware does not support PSCI platform coordinated suspend
> > mode.
>
> I fully agree that the series isn't specific to PSCI OSI mode. On the
> other hand, for PSCI OSI mode, that's where I see this series to fit
> naturally. And in particular for the QCOM 410c board.
>
> When it comes to the PSCI PC mode, it may under certain circumstances
> be useful to deploy this approach for that as well, and I agree that
> it seems reasonable to have that configurable as opt-in, somehow.
>
> Although, let's discuss that separately, in a next step. Or at least
> let's try to keep PSCI related technical discussions to the other
> thread, as that makes it easier to follow.
>
> >
> > You can apply the concept in this series to _any_ arch provided
> > the power domains representation is correct (and again, I would sound
> > like a broken record but the series must improve power savings over
> > vanilla CPUidle menu governor).
>
> I agree, but let me elaborate a bit, to hopefully add some clarity,
> which I may not have been able to communicate earlier.
>
> The goal with the series is to enable platforms to support all its
> available idlestates, which are shared among a group of CPUs. This is
> the case for QCOM 410c, for example.
>
> To my knowledge, we have other ARM32 based platforms that currently
> have disabled some of its cluster idle states. That's because they
> can't know when it's safe to power off the cluster "coherency domain",
> in cases when the platform also have other shared resources in it.
>
> The point is, to see improved power savings, additional platform
> deployment may be needed and that just takes time. For example runtime
> PM support is needed in those drivers that deals with the "shared
> resources", a correctly modeled PM domain topology using genpd, etc,
> etc.
>
> >
> >> > I still think that PSCI firmware and most certainly mwait() play the
> >> > role the GenPD governor does since they can detect in FW/HW whether
> >> > that's worthwhile to switch off a domain, the information is obviously
> >> > there and the kernel would just add latency to the idle path in that
> >> > case but let's gloss over this for the sake of this discussion.
> >>
> >> Yep, let's discuss that separately.
> >>
> >> That said, can I interpret your comments on the series up until this
> >> change, that you seems rather happy with where the series is going?
> >
> > It is something we have been discussing with Daniel since generic idle
> > was merged for Arm a long while back. I have nothing against describing
> > idle states with power domains but it must improve idle decisions
> > against the mainline. As I said before, runtime PM can also be used
> > to get rid of CPU PM notifiers (because with power domains we KNOW
> > what devices eg PMU are switched off on idle entry, we do not guess
> > any longer; replacing CPU PM notifiers is challenging and can be
> > tackled - if required - in a different series).
>
> Yes, we have be talking about the CPU PM and CPU_CLUSTER_PM notifiers
> and I fully agree. It's something that we should look into and in
> future steps.
>
> >
> > Bottom line (talk is cheap, I know and apologise about that): this
> > series (up until this change) adds complexity to the idle path and lots
> > of code; if its usage is made optional and can be switched on on systems
> > where it saves power that's fine by me as long as we keep PSCI
> > OS-initiated idle states out of the equation, that's an orthogonal
> > discussion as, I hope, I managed to convey.
> >
> > Thanks,
> > Lorenzo
>
> Lorenzo, thanks for your feedback!
>
> Please, when you have time, could you also reply to the other thread
> we started, I would like to understand how I should proceed with this
> series.
OK, thanks, I will, sorry for the delay in responding.
Lorenzo
On Friday, August 24, 2018 8:47:21 AM CEST Ulf Hansson wrote:
> On 6 August 2018 at 11:36, Rafael J. Wysocki <[email protected]> wrote:
> > On Fri, Aug 3, 2018 at 1:43 PM, Ulf Hansson <[email protected]> wrote:
> >> On 19 July 2018 at 12:25, Rafael J. Wysocki <[email protected]> wrote:
> >>> On Wednesday, June 20, 2018 7:22:04 PM CEST Ulf Hansson wrote:
> >>>> To enable a device belonging to a CPU to be attached to a PM domain managed
> >>>> by genpd, let's do a few changes to genpd as to make it convenient to
> >>>> manage the specifics around CPUs.
> >>>>
> >>>> First, as to be able to quickly find out what CPUs that are attached to a
> >>>> genpd, which typically becomes useful from a genpd governor as following
> >>>> changes is about to show, let's add a cpumask 'cpus' to the struct
> >>>> generic_pm_domain.
> >>>>
> >>>> At the point when a device that belongs to a CPU, is attached/detached to
> >>>> its corresponding PM domain via genpd_add_device(), let's update the
> >>>> cpumask in genpd->cpus. Moreover, propagate the update of the cpumask to
> >>>> the master domains, which makes the genpd->cpus to contain a cpumask that
> >>>> hierarchically reflect all CPUs for a genpd, including CPUs attached to
> >>>> subdomains.
> >>>>
> >>>> Second, to unconditionally manage CPUs and the cpumask in genpd->cpus, is
> >>>> unnecessary for cases when only non-CPU devices are parts of a genpd.
> >>>> Let's avoid this by adding a new configuration bit, GENPD_FLAG_CPU_DOMAIN.
> >>>> Clients must set the bit before they call pm_genpd_init(), as to instruct
> >>>> genpd that it shall deal with CPUs and thus manage the cpumask in
> >>>> genpd->cpus.
> >>>>
> >>>> Cc: Lina Iyer <[email protected]>
> >>>> Co-developed-by: Lina Iyer <[email protected]>
> >>>> Signed-off-by: Ulf Hansson <[email protected]>
> >>>> ---
> >>>> drivers/base/power/domain.c | 69 ++++++++++++++++++++++++++++++++++++-
> >>>> include/linux/pm_domain.h | 3 ++
> >>>> 2 files changed, 71 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
> >>>> index 21d298e1820b..6149ce0bfa7b 100644
> >>>> --- a/drivers/base/power/domain.c
> >>>> +++ b/drivers/base/power/domain.c
> >>>> @@ -20,6 +20,7 @@
> >>>> #include <linux/sched.h>
> >>>> #include <linux/suspend.h>
> >>>> #include <linux/export.h>
> >>>> +#include <linux/cpu.h>
> >>>>
> >>>> #include "power.h"
> >>>>
> >>>> @@ -126,6 +127,7 @@ static const struct genpd_lock_ops genpd_spin_ops = {
> >>>> #define genpd_is_irq_safe(genpd) (genpd->flags & GENPD_FLAG_IRQ_SAFE)
> >>>> #define genpd_is_always_on(genpd) (genpd->flags & GENPD_FLAG_ALWAYS_ON)
> >>>> #define genpd_is_active_wakeup(genpd) (genpd->flags & GENPD_FLAG_ACTIVE_WAKEUP)
> >>>> +#define genpd_is_cpu_domain(genpd) (genpd->flags & GENPD_FLAG_CPU_DOMAIN)
> >>>>
> >>>> static inline bool irq_safe_dev_in_no_sleep_domain(struct device *dev,
> >>>> const struct generic_pm_domain *genpd)
> >>>> @@ -1377,6 +1379,62 @@ static void genpd_free_dev_data(struct device *dev,
> >>>> dev_pm_put_subsys_data(dev);
> >>>> }
> >>>>
> >>>> +static void __genpd_update_cpumask(struct generic_pm_domain *genpd,
> >>>> + int cpu, bool set, unsigned int depth)
> >>>> +{
> >>>> + struct gpd_link *link;
> >>>> +
> >>>> + if (!genpd_is_cpu_domain(genpd))
> >>>> + return;
> >>>> +
> >>>> + list_for_each_entry(link, &genpd->slave_links, slave_node) {
> >>>> + struct generic_pm_domain *master = link->master;
> >>>> +
> >>>> + genpd_lock_nested(master, depth + 1);
> >>>> + __genpd_update_cpumask(master, cpu, set, depth + 1);
> >>>> + genpd_unlock(master);
> >>>> + }
> >>>> +
> >>>> + if (set)
> >>>> + cpumask_set_cpu(cpu, genpd->cpus);
> >>>> + else
> >>>> + cpumask_clear_cpu(cpu, genpd->cpus);
> >>>> +}
> >>>
> >>> As noted elsewhere, there is a concern about the possible weight of this
> >>> cpumask and I think that it would be good to explicitly put a limit on it.
> >>
> >> I have been digesting your comments on the series, but wonder if this
> >> is still a relevant concern?
> >
> > Well, there are systems with very large cpumasks and it is sort of
> > good to have that in mind when designing any code using them.
>
> Right.
>
> So, if I avoid allocating the cpumask for those genpd structures that
> doesn't need it (those not having GENPD_FLAG_CPU_DOMAIN set), would
> that be sufficient to deal with your concern?
Yes, it would, if I understand you correctly.
Thanks,
Rafael
On Thursday, August 9, 2018 5:39:25 PM CEST Lorenzo Pieralisi wrote:
> On Mon, Aug 06, 2018 at 11:20:59AM +0200, Rafael J. Wysocki wrote:
>
> [...]
>
> > >>> > @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
> > >>> > return false;
> > >>> > }
> > >>> >
> > >>> > +static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> > >>> > +{
> > >>> > + struct generic_pm_domain *genpd = pd_to_genpd(pd);
> > >>> > + ktime_t domain_wakeup, cpu_wakeup;
> > >>> > + s64 idle_duration_ns;
> > >>> > + int cpu, i;
> > >>> > +
> > >>> > + if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
> > >>> > + return true;
> > >>> > +
> > >>> > + /*
> > >>> > + * Find the next wakeup for any of the online CPUs within the PM domain
> > >>> > + * and its subdomains. Note, we only need the genpd->cpus, as it already
> > >>> > + * contains a mask of all CPUs from subdomains.
> > >>> > + */
> > >>> > + domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
> > >>> > + for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
> > >>> > + cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
> > >>> > + if (ktime_before(cpu_wakeup, domain_wakeup))
> > >>> > + domain_wakeup = cpu_wakeup;
> > >>> > + }
> > >>
> > >> Here's a concern I have missed before. :-/
> > >>
> > >> Say, one of the CPUs you're walking here is woken up in the meantime.
> > >
> > > Yes, that can happen - when we miss-predicted "next wakeup".
> > >
> > >>
> > >> I don't think it is valid to evaluate tick_nohz_get_next_wakeup() for it then
> > >> to update domain_wakeup. We really should just avoid the domain power off in
> > >> that case at all IMO.
> > >
> > > Correct.
> > >
> > > However, we also want to avoid locking contentions in the idle path,
> > > which is what this boils done to.
> >
> > This already is done under genpd_lock() AFAICS, so I'm not quite sure
> > what exactly you mean.
> >
> > Besides, this is not just about increased latency, which is a concern
> > by itself but maybe not so much in all environments, but also about
> > possibility of missing a CPU wakeup, which is a major issue.
> >
> > If one of the CPUs sharing the domain with the current one is woken up
> > during cpu_power_down_ok() and the wakeup is an edge-triggered
> > interrupt and the domain is turned off regardless, the wakeup may be
> > missed entirely if I'm not mistaken.
> >
> > It looks like there needs to be a way for the hardware to prevent a
> > domain poweroff when there's a pending interrupt or I don't quite see
> > how this can be handled correctly.
> >
> > >> Sure enough, if the domain power off is already started and one of the CPUs
> > >> in the domain is woken up then, too bad, it will suffer the latency (but in
> > >> that case the hardware should be able to help somewhat), but otherwise CPU
> > >> wakeup should prevent domain power off from being carried out.
> > >
> > > The CPU is not prevented from waking up, as we rely on the FW to deal with that.
> > >
> > > Even if the above computation turns out to wrongly suggest that the
> > > cluster can be powered off, the FW shall together with the genpd
> > > backend driver prevent it.
> >
> > Fine, but then the solution depends on specific FW/HW behavior, so I'm
> > not sure how generic it really is. At least, that expectation should
> > be clearly documented somewhere, preferably in code comments.
> >
> > > To cover this case for PSCI, we also use a per cpu variable for the
> > > CPU's power off state, as can be seen later in the series.
> >
> > Oh great, but the generic part should be independent on the underlying
> > implementation of the driver. If it isn't, then it also is not
> > generic.
> >
> > > Hope this clarifies your concern, else tell and will to elaborate a bit more.
> >
> > Not really.
> >
> > There also is one more problem and that is the interaction between
> > this code and the idle governor.
> >
> > Namely, the idle governor may select a shallower state for some
> > reason, for example due to an additional latency limit derived from
> > CPU utilization (like in the menu governor), and how does the code in
> > cpu_power_down_ok() know what state has been selected and how does it
> > honor the selection made by the idle governor?
>
> That's a good question and it maybe gives a path towards a solution.
>
> AFAICS the genPD governor only selects the idle state parameter that
> determines the idle state at, say, GenPD cpumask level it does not touch
> the CPUidle decision, that works on a subset of idle states (at cpu
> level).
I've deferred responding to this as I wasn't quite sure if I followed you
at that time, but I'm afraid I'm still not following you now. :-)
The idle governor has to take the total worst-case wakeup latency into
account. Not just from the logical CPU itself, but also from whatever
state the SoC may end up in as a result of this particular logical CPU
going idle, this way or another.
So for example, if your logical CPU has an idle state A that may trigger an
idle state X at the cluster level (if the other logical CPUs happen to be in
the right states and so on), then the worst-case exit latency for that
is the one of state X.
> That's my understanding, which can be wrong so please correct me
> if that's the case because that's a bit confusing.
>
> Let's imagine that we flattened out the list of idle states and feed
> CPUidle with it (all of them - cpu, cluster, package, system - as it is
> in the mainline _now_). Then the GenPD governor can run-through the
> CPUidle selection and _demote_ the idle state if necessary since it
> understands that some CPUs in the GenPD will wake up shortly and break
> the target residency hyphothesis the CPUidle governor is expecting.
>
> The whole idea about this series is improving CPUidle decision when
> the target idle state is _shared_ among groups of cpus (again, please
> do correct me if I am wrong).
>
> It is obvious that a GenPD governor must only demote - never promote a
> CPU idle state selection given that hierarchy implies more power
> savings and higher target residencies required.
So I see a problem here, because the way patch 9 in this series is done,
the genpd governor for CPUs has no idea what states have been selected by
the idle governor, so how does it know how deep it can go with turning
off domains?
My point is that the selection made by the idle governor need not be
based only on timers which is the only thing that the genpd governor
seems to be looking at. The genpd governor should rather look at what
idle states have been selected for each CPU in the domain by the idle
governor and work within the boundaries of those.
Thanks,
Rafael
On Fri, Sep 14, 2018 at 11:50:15AM +0200, Rafael J. Wysocki wrote:
> On Thursday, August 9, 2018 5:39:25 PM CEST Lorenzo Pieralisi wrote:
> > On Mon, Aug 06, 2018 at 11:20:59AM +0200, Rafael J. Wysocki wrote:
> >
> > [...]
> >
> > > >>> > @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
> > > >>> > return false;
> > > >>> > }
> > > >>> >
> > > >>> > +static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> > > >>> > +{
> > > >>> > + struct generic_pm_domain *genpd = pd_to_genpd(pd);
> > > >>> > + ktime_t domain_wakeup, cpu_wakeup;
> > > >>> > + s64 idle_duration_ns;
> > > >>> > + int cpu, i;
> > > >>> > +
> > > >>> > + if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
> > > >>> > + return true;
> > > >>> > +
> > > >>> > + /*
> > > >>> > + * Find the next wakeup for any of the online CPUs within the PM domain
> > > >>> > + * and its subdomains. Note, we only need the genpd->cpus, as it already
> > > >>> > + * contains a mask of all CPUs from subdomains.
> > > >>> > + */
> > > >>> > + domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
> > > >>> > + for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
> > > >>> > + cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
> > > >>> > + if (ktime_before(cpu_wakeup, domain_wakeup))
> > > >>> > + domain_wakeup = cpu_wakeup;
> > > >>> > + }
> > > >>
> > > >> Here's a concern I have missed before. :-/
> > > >>
> > > >> Say, one of the CPUs you're walking here is woken up in the meantime.
> > > >
> > > > Yes, that can happen - when we miss-predicted "next wakeup".
> > > >
> > > >>
> > > >> I don't think it is valid to evaluate tick_nohz_get_next_wakeup() for it then
> > > >> to update domain_wakeup. We really should just avoid the domain power off in
> > > >> that case at all IMO.
> > > >
> > > > Correct.
> > > >
> > > > However, we also want to avoid locking contentions in the idle path,
> > > > which is what this boils done to.
> > >
> > > This already is done under genpd_lock() AFAICS, so I'm not quite sure
> > > what exactly you mean.
> > >
> > > Besides, this is not just about increased latency, which is a concern
> > > by itself but maybe not so much in all environments, but also about
> > > possibility of missing a CPU wakeup, which is a major issue.
> > >
> > > If one of the CPUs sharing the domain with the current one is woken up
> > > during cpu_power_down_ok() and the wakeup is an edge-triggered
> > > interrupt and the domain is turned off regardless, the wakeup may be
> > > missed entirely if I'm not mistaken.
> > >
> > > It looks like there needs to be a way for the hardware to prevent a
> > > domain poweroff when there's a pending interrupt or I don't quite see
> > > how this can be handled correctly.
> > >
> > > >> Sure enough, if the domain power off is already started and one of the CPUs
> > > >> in the domain is woken up then, too bad, it will suffer the latency (but in
> > > >> that case the hardware should be able to help somewhat), but otherwise CPU
> > > >> wakeup should prevent domain power off from being carried out.
> > > >
> > > > The CPU is not prevented from waking up, as we rely on the FW to deal with that.
> > > >
> > > > Even if the above computation turns out to wrongly suggest that the
> > > > cluster can be powered off, the FW shall together with the genpd
> > > > backend driver prevent it.
> > >
> > > Fine, but then the solution depends on specific FW/HW behavior, so I'm
> > > not sure how generic it really is. At least, that expectation should
> > > be clearly documented somewhere, preferably in code comments.
> > >
> > > > To cover this case for PSCI, we also use a per cpu variable for the
> > > > CPU's power off state, as can be seen later in the series.
> > >
> > > Oh great, but the generic part should be independent on the underlying
> > > implementation of the driver. If it isn't, then it also is not
> > > generic.
> > >
> > > > Hope this clarifies your concern, else tell and will to elaborate a bit more.
> > >
> > > Not really.
> > >
> > > There also is one more problem and that is the interaction between
> > > this code and the idle governor.
> > >
> > > Namely, the idle governor may select a shallower state for some
> > > reason, for example due to an additional latency limit derived from
> > > CPU utilization (like in the menu governor), and how does the code in
> > > cpu_power_down_ok() know what state has been selected and how does it
> > > honor the selection made by the idle governor?
> >
> > That's a good question and it maybe gives a path towards a solution.
> >
> > AFAICS the genPD governor only selects the idle state parameter that
> > determines the idle state at, say, GenPD cpumask level it does not touch
> > the CPUidle decision, that works on a subset of idle states (at cpu
> > level).
>
> I've deferred responding to this as I wasn't quite sure if I followed you
> at that time, but I'm afraid I'm still not following you now. :-)
>
> The idle governor has to take the total worst-case wakeup latency into
> account. Not just from the logical CPU itself, but also from whatever
> state the SoC may end up in as a result of this particular logical CPU
> going idle, this way or another.
>
> So for example, if your logical CPU has an idle state A that may trigger an
> idle state X at the cluster level (if the other logical CPUs happen to be in
> the right states and so on), then the worst-case exit latency for that
> is the one of state X.
I will provide an example:
IDLE STATE A (affects CPU {0,1}): exit latency 1ms, min-residency 1.5ms
CPU 0 is about to enter IDLE state A since its "next-event" fulfill the
residency requirements and exit latency constraints.
CPU 1 is in idle state A (given that CPU 0 is ON, some of the common
logic shared between CPU {0,1} is still ON, but, as soon as CPU 0
enters idle state A CPU {0,1} can enter the "full" idle state A
power savings mode).
The current CPUidle governor does not check the "next-event" for CPU 1,
that it may wake up in, say, 10us.
Requesting IDLE STATE A is a waste of power (if firmware or hardware
does not demote it since it does peek at CPU 1 next-event and actually
demote CPU 0 request).
The current flat list of idle states has no notion of CPUs sharing
an idle state request and that's where I think this series kicks in
and that's the reason I say that the genPD governor can only demote
an idle state request.
Linking power domains to idle states is the only sensible way I see
to define what logical cpus are affected by an idle state entry, this
information is missing in the current kernel (whether that's wortwhile
adding it that's another question).
> > That's my understanding, which can be wrong so please correct me
> > if that's the case because that's a bit confusing.
> >
> > Let's imagine that we flattened out the list of idle states and feed
> > CPUidle with it (all of them - cpu, cluster, package, system - as it is
> > in the mainline _now_). Then the GenPD governor can run-through the
> > CPUidle selection and _demote_ the idle state if necessary since it
> > understands that some CPUs in the GenPD will wake up shortly and break
> > the target residency hyphothesis the CPUidle governor is expecting.
> >
> > The whole idea about this series is improving CPUidle decision when
> > the target idle state is _shared_ among groups of cpus (again, please
> > do correct me if I am wrong).
> >
> > It is obvious that a GenPD governor must only demote - never promote a
> > CPU idle state selection given that hierarchy implies more power
> > savings and higher target residencies required.
>
> So I see a problem here, because the way patch 9 in this series is done,
> the genpd governor for CPUs has no idea what states have been selected by
> the idle governor, so how does it know how deep it can go with turning
> off domains?
>
> My point is that the selection made by the idle governor need not be
> based only on timers which is the only thing that the genpd governor
> seems to be looking at. The genpd governor should rather look at what
> idle states have been selected for each CPU in the domain by the idle
> governor and work within the boundaries of those.
That's agreed.
Lorenzo
On Fri, Sep 14, 2018 at 12:44 PM Lorenzo Pieralisi
<[email protected]> wrote:
>
> On Fri, Sep 14, 2018 at 11:50:15AM +0200, Rafael J. Wysocki wrote:
> > On Thursday, August 9, 2018 5:39:25 PM CEST Lorenzo Pieralisi wrote:
> > > On Mon, Aug 06, 2018 at 11:20:59AM +0200, Rafael J. Wysocki wrote:
> > >
> > > [...]
> > >
> > > > >>> > @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
> > > > >>> > return false;
> > > > >>> > }
> > > > >>> >
> > > > >>> > +static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> > > > >>> > +{
> > > > >>> > + struct generic_pm_domain *genpd = pd_to_genpd(pd);
> > > > >>> > + ktime_t domain_wakeup, cpu_wakeup;
> > > > >>> > + s64 idle_duration_ns;
> > > > >>> > + int cpu, i;
> > > > >>> > +
> > > > >>> > + if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
> > > > >>> > + return true;
> > > > >>> > +
> > > > >>> > + /*
> > > > >>> > + * Find the next wakeup for any of the online CPUs within the PM domain
> > > > >>> > + * and its subdomains. Note, we only need the genpd->cpus, as it already
> > > > >>> > + * contains a mask of all CPUs from subdomains.
> > > > >>> > + */
> > > > >>> > + domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
> > > > >>> > + for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
> > > > >>> > + cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
> > > > >>> > + if (ktime_before(cpu_wakeup, domain_wakeup))
> > > > >>> > + domain_wakeup = cpu_wakeup;
> > > > >>> > + }
> > > > >>
> > > > >> Here's a concern I have missed before. :-/
> > > > >>
> > > > >> Say, one of the CPUs you're walking here is woken up in the meantime.
> > > > >
> > > > > Yes, that can happen - when we miss-predicted "next wakeup".
> > > > >
> > > > >>
> > > > >> I don't think it is valid to evaluate tick_nohz_get_next_wakeup() for it then
> > > > >> to update domain_wakeup. We really should just avoid the domain power off in
> > > > >> that case at all IMO.
> > > > >
> > > > > Correct.
> > > > >
> > > > > However, we also want to avoid locking contentions in the idle path,
> > > > > which is what this boils done to.
> > > >
> > > > This already is done under genpd_lock() AFAICS, so I'm not quite sure
> > > > what exactly you mean.
> > > >
> > > > Besides, this is not just about increased latency, which is a concern
> > > > by itself but maybe not so much in all environments, but also about
> > > > possibility of missing a CPU wakeup, which is a major issue.
> > > >
> > > > If one of the CPUs sharing the domain with the current one is woken up
> > > > during cpu_power_down_ok() and the wakeup is an edge-triggered
> > > > interrupt and the domain is turned off regardless, the wakeup may be
> > > > missed entirely if I'm not mistaken.
> > > >
> > > > It looks like there needs to be a way for the hardware to prevent a
> > > > domain poweroff when there's a pending interrupt or I don't quite see
> > > > how this can be handled correctly.
> > > >
> > > > >> Sure enough, if the domain power off is already started and one of the CPUs
> > > > >> in the domain is woken up then, too bad, it will suffer the latency (but in
> > > > >> that case the hardware should be able to help somewhat), but otherwise CPU
> > > > >> wakeup should prevent domain power off from being carried out.
> > > > >
> > > > > The CPU is not prevented from waking up, as we rely on the FW to deal with that.
> > > > >
> > > > > Even if the above computation turns out to wrongly suggest that the
> > > > > cluster can be powered off, the FW shall together with the genpd
> > > > > backend driver prevent it.
> > > >
> > > > Fine, but then the solution depends on specific FW/HW behavior, so I'm
> > > > not sure how generic it really is. At least, that expectation should
> > > > be clearly documented somewhere, preferably in code comments.
> > > >
> > > > > To cover this case for PSCI, we also use a per cpu variable for the
> > > > > CPU's power off state, as can be seen later in the series.
> > > >
> > > > Oh great, but the generic part should be independent on the underlying
> > > > implementation of the driver. If it isn't, then it also is not
> > > > generic.
> > > >
> > > > > Hope this clarifies your concern, else tell and will to elaborate a bit more.
> > > >
> > > > Not really.
> > > >
> > > > There also is one more problem and that is the interaction between
> > > > this code and the idle governor.
> > > >
> > > > Namely, the idle governor may select a shallower state for some
> > > > reason, for example due to an additional latency limit derived from
> > > > CPU utilization (like in the menu governor), and how does the code in
> > > > cpu_power_down_ok() know what state has been selected and how does it
> > > > honor the selection made by the idle governor?
> > >
> > > That's a good question and it maybe gives a path towards a solution.
> > >
> > > AFAICS the genPD governor only selects the idle state parameter that
> > > determines the idle state at, say, GenPD cpumask level it does not touch
> > > the CPUidle decision, that works on a subset of idle states (at cpu
> > > level).
> >
> > I've deferred responding to this as I wasn't quite sure if I followed you
> > at that time, but I'm afraid I'm still not following you now. :-)
> >
> > The idle governor has to take the total worst-case wakeup latency into
> > account. Not just from the logical CPU itself, but also from whatever
> > state the SoC may end up in as a result of this particular logical CPU
> > going idle, this way or another.
> >
> > So for example, if your logical CPU has an idle state A that may trigger an
> > idle state X at the cluster level (if the other logical CPUs happen to be in
> > the right states and so on), then the worst-case exit latency for that
> > is the one of state X.
>
> I will provide an example:
>
> IDLE STATE A (affects CPU {0,1}): exit latency 1ms, min-residency 1.5ms
>
> CPU 0 is about to enter IDLE state A since its "next-event" fulfill the
> residency requirements and exit latency constraints.
>
> CPU 1 is in idle state A (given that CPU 0 is ON, some of the common
> logic shared between CPU {0,1} is still ON, but, as soon as CPU 0
> enters idle state A CPU {0,1} can enter the "full" idle state A
> power savings mode).
>
> The current CPUidle governor does not check the "next-event" for CPU 1,
> that it may wake up in, say, 10us.
Right.
> Requesting IDLE STATE A is a waste of power (if firmware or hardware
> does not demote it since it does peek at CPU 1 next-event and actually
> demote CPU 0 request).
OK, I see.
That's because the state is "collaborative" so to speak. But was't
that supposed to be covered by the "coupled" thing?
> The current flat list of idle states has no notion of CPUs sharing
> an idle state request and that's where I think this series kicks in
> and that's the reason I say that the genPD governor can only demote
> an idle state request.
>
> Linking power domains to idle states is the only sensible way I see
> to define what logical cpus are affected by an idle state entry, this
> information is missing in the current kernel (whether that's wortwhile
> adding it that's another question).
OK, thanks for the clarification!
Cheers,
Rafael
On Fri, Sep 14, 2018 at 01:34:14PM +0200, Rafael J. Wysocki wrote:
[...]
> > > So for example, if your logical CPU has an idle state A that may trigger an
> > > idle state X at the cluster level (if the other logical CPUs happen to be in
> > > the right states and so on), then the worst-case exit latency for that
> > > is the one of state X.
> >
> > I will provide an example:
> >
> > IDLE STATE A (affects CPU {0,1}): exit latency 1ms, min-residency 1.5ms
> >
> > CPU 0 is about to enter IDLE state A since its "next-event" fulfill the
> > residency requirements and exit latency constraints.
> >
> > CPU 1 is in idle state A (given that CPU 0 is ON, some of the common
> > logic shared between CPU {0,1} is still ON, but, as soon as CPU 0
> > enters idle state A CPU {0,1} can enter the "full" idle state A
> > power savings mode).
> >
> > The current CPUidle governor does not check the "next-event" for CPU 1,
> > that it may wake up in, say, 10us.
>
> Right.
>
> > Requesting IDLE STATE A is a waste of power (if firmware or hardware
> > does not demote it since it does peek at CPU 1 next-event and actually
> > demote CPU 0 request).
>
> OK, I see.
>
> That's because the state is "collaborative" so to speak. But was't
> that supposed to be covered by the "coupled" thing?
The coupled idle states code was merged because on some early SMP
ARM platforms CPUs must enter cluster idle states orderly otherwise
the system would break; "coupled" as-in "syncronized idle state entry".
Basically coupled idle code fixed a HW bug. This series code instead
applies to all arches where an idle state may span multiple CPUs (x86
inclusive, but as I mentioned it is probably not needed since FW/HW
behind mwait is capable of detecting whether that's wortwhile to shut
down, say, a package. PSCI, whether OSI or PC mode can work the same way).
Entering an idle state spanning multiple cpus need not be synchronized
but a sort of cpumask aware governor may help optimize idle state
selection.
I hope this makes the whole point clearer.
Cheers,
Lorenzo
Hi Ulf,
Got one issue in hotplug path where of_genpd_detach_cpu calls
dev_pm_qos_remove_notifier which can be sleeping as per below call
stack. I think it should be applicable for current patch as well right?
Please let me know what am I missing? why didn't you see this issue with
this patch?
[ 8103.221387] BUG: sleeping function called from invalid context at
/mnt/host/source/src/third_party/kernel/v4.14/kernel/locking/mutex.c:238
[ 8103.221455] in_atomic(): 1, irqs_disabled(): 128, pid: 11, name:
migration/0
[ 8103.221487] Preemption disabled at:
[ 8103.221529] [<ffffff800814dfb0>] cpu_stopper_thread+0x98/0x118
[ 8103.221600] ------------[ cut here ]------------
[ 8103.221636] kernel BUG at
/mnt/host/source/src/third_party/kernel/v4.14/kernel/sched/core.c:6102!
[ 8103.221678] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 8103.222396] CPU: 0 PID: 11 Comm: migration/0 Tainted: G W
4.14.72 #1
[ 8103.222428] Hardware name: Google Cheza (rev1) (DT)
[ 8103.222460] task: ffffffc0f842d580 task.stack: ffffff8009c18000
[ 8103.222504] PC is at ___might_sleep+0x138/0x140
[ 8103.222542] LR is at ___might_sleep+0x138/0x140
[ 8103.222577] pc : [<ffffff80080d8f04>] lr : [<ffffff80080d8f04>]
pstate: 60c001c9
[ 8103.222605] sp : ffffff8009c1bb40
….
[ 8103.223924] [<ffffff80080d8f04>] ___might_sleep+0x138/0x140
[ 8103.223965] [<ffffff80080d8d98>] __might_sleep+0x4c/0x80
[ 8103.224009] [<ffffff80088e4258>] mutex_lock+0x28/0x60
[ 8103.224054] [<ffffff800850fa2c>] dev_pm_qos_remove_notifier+0x1c/0x54
[ 8103.224097] [<ffffff8008517814>] genpd_remove_device+0x3c/0x10c
[ 8103.224140] [<ffffff800851949c>] genpd_dev_pm_detach+0x48/0x108
[ 8103.224183] [<ffffff80085193e0>] of_genpd_detach_cpu+0x48/0xbc
[ 8103.224227] [<ffffff80083edea4>] cpu_pd_dying+0x28/0x38
[ 8103.224268] [<ffffff80080ab2c0>] cpuhp_invoke_callback+0x254/0x5f0
[ 8103.224308] [<ffffff80080acdec>] take_cpu_down+0x60/0x9c
[ 8103.224346] [<ffffff800814d898>] multi_cpu_stop+0xac/0x104
[ 8103.224385] [<ffffff800814dfb8>] cpu_stopper_thread+0xa0/0x118
[ 8103.224427] [<ffffff80080cff74>] smpboot_thread_fn+0x19c/0x278
[ 8103.224472] [<ffffff80080cc0c4>] kthread+0x120/0x130
[ 8103.224513] [<ffffff8008084608>] ret_from_fork+0x10/0x18
Thanks,
Raju
On 6/20/2018 10:52 PM, Ulf Hansson wrote:
> To deal with CPU hotplug when OSI mode is used, the CPU device needs to be
> detached from its PM domain (genpd) when putting it offline, otherwise the
> CPU becomes considered as being in use from genpd and runtime PM point of
> view. Obviously, then we also need to re-attach the CPU device when bring
> the CPU back online, so let's do this.
>
> Cc: Lina Iyer <[email protected]>
> Signed-off-by: Ulf Hansson <[email protected]>
> ---
> drivers/firmware/psci/psci.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/drivers/firmware/psci/psci.c b/drivers/firmware/psci/psci.c
> index 700e0e995871..e649673d71f0 100644
> --- a/drivers/firmware/psci/psci.c
> +++ b/drivers/firmware/psci/psci.c
> @@ -190,6 +190,10 @@ static int psci_cpu_off(u32 state)
> int err;
> u32 fn;
>
> + /* If running OSI mode, detach the CPU device from its PM domain. */
> + if (psci_osi_mode_enabled)
> + of_genpd_detach_cpu(smp_processor_id());
> +
> fn = psci_function_id[PSCI_FN_CPU_OFF];
> err = invoke_psci_fn(fn, state, 0, 0);
> return psci_to_linux_errno(err);
> @@ -204,6 +208,10 @@ static int psci_cpu_on(unsigned long cpuid, unsigned long entry_point)
> err = invoke_psci_fn(fn, cpuid, entry_point, 0);
> /* Clear the domain state to start fresh. */
> psci_set_domain_state(0);
> +
> + if (!err && psci_osi_mode_enabled)
> + of_genpd_attach_cpu(cpuid);
> +
> return psci_to_linux_errno(err);
> }
>
>
Hi Ulf, This is noticed in v4.19 as well. Could you please check if
CONFIG_DEBUG_ATOMIC_SLEEP is enabled in your case? I think with current
patch the scenario would be applicable.
On 11/20/2018 3:20 PM, Ulf Hansson wrote:
> On 19 November 2018 at 20:50, Raju P L S S S N <[email protected]> wrote:
>> Hi Ulf,
>>
>> Got one issue in hotplug path where of_genpd_detach_cpu calls
>> dev_pm_qos_remove_notifier which can be sleeping as per below call stack. I
>> think it should be applicable for current patch as well right? Please let me
>> know what am I missing? why didn't you see this issue with this patch?
>
> Weird.
>
>>
>>
>> [ 8103.221387] BUG: sleeping function called from invalid context at
>> /mnt/host/source/src/third_party/kernel/v4.14/kernel/locking/mutex.c:238
>
> Could it be due to some other patch in your v.4.14 kernel?
>
>> [ 8103.221455] in_atomic(): 1, irqs_disabled(): 128, pid: 11, name:
>> migration/0
>> [ 8103.221487] Preemption disabled at:
>> [ 8103.221529] [<ffffff800814dfb0>] cpu_stopper_thread+0x98/0x118
>> [ 8103.221600] ------------[ cut here ]------------
>> [ 8103.221636] kernel BUG at
>> /mnt/host/source/src/third_party/kernel/v4.14/kernel/sched/core.c:6102!
>> [ 8103.221678] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
>> [ 8103.222396] CPU: 0 PID: 11 Comm: migration/0 Tainted: G W 4.14.72
>> #1
>> [ 8103.222428] Hardware name: Google Cheza (rev1) (DT)
>> [ 8103.222460] task: ffffffc0f842d580 task.stack: ffffff8009c18000
>> [ 8103.222504] PC is at ___might_sleep+0x138/0x140
>> [ 8103.222542] LR is at ___might_sleep+0x138/0x140
>> [ 8103.222577] pc : [<ffffff80080d8f04>] lr : [<ffffff80080d8f04>] pstate:
>> 60c001c9
>> [ 8103.222605] sp : ffffff8009c1bb40
>> ….
>> [ 8103.223924] [<ffffff80080d8f04>] ___might_sleep+0x138/0x140
>> [ 8103.223965] [<ffffff80080d8d98>] __might_sleep+0x4c/0x80
>> [ 8103.224009] [<ffffff80088e4258>] mutex_lock+0x28/0x60
>> [ 8103.224054] [<ffffff800850fa2c>] dev_pm_qos_remove_notifier+0x1c/0x54
>> [ 8103.224097] [<ffffff8008517814>] genpd_remove_device+0x3c/0x10c
>> [ 8103.224140] [<ffffff800851949c>] genpd_dev_pm_detach+0x48/0x108
>> [ 8103.224183] [<ffffff80085193e0>] of_genpd_detach_cpu+0x48/0xbc
>> [ 8103.224227] [<ffffff80083edea4>] cpu_pd_dying+0x28/0x38
>> [ 8103.224268] [<ffffff80080ab2c0>] cpuhp_invoke_callback+0x254/0x5f0
>> [ 8103.224308] [<ffffff80080acdec>] take_cpu_down+0x60/0x9c
>> [ 8103.224346] [<ffffff800814d898>] multi_cpu_stop+0xac/0x104
>> [ 8103.224385] [<ffffff800814dfb8>] cpu_stopper_thread+0xa0/0x118
>> [ 8103.224427] [<ffffff80080cff74>] smpboot_thread_fn+0x19c/0x278
>> [ 8103.224472] [<ffffff80080cc0c4>] kthread+0x120/0x130
>> [ 8103.224513] [<ffffff8008084608>] ret_from_fork+0x10/0x18
>
> Thanks for the report, I will double check my series before I post the
> new version of my series. If nothing unexpected shows up, that should
> be in a couple of days from now.
>
> I keep you cc.
>
> [...]
>
> Kind regards
> Uffe
>
On 19 November 2018 at 20:50, Raju P L S S S N <[email protected]> wrote:
> Hi Ulf,
>
> Got one issue in hotplug path where of_genpd_detach_cpu calls
> dev_pm_qos_remove_notifier which can be sleeping as per below call stack. I
> think it should be applicable for current patch as well right? Please let me
> know what am I missing? why didn't you see this issue with this patch?
Weird.
>
>
> [ 8103.221387] BUG: sleeping function called from invalid context at
> /mnt/host/source/src/third_party/kernel/v4.14/kernel/locking/mutex.c:238
Could it be due to some other patch in your v.4.14 kernel?
> [ 8103.221455] in_atomic(): 1, irqs_disabled(): 128, pid: 11, name:
> migration/0
> [ 8103.221487] Preemption disabled at:
> [ 8103.221529] [<ffffff800814dfb0>] cpu_stopper_thread+0x98/0x118
> [ 8103.221600] ------------[ cut here ]------------
> [ 8103.221636] kernel BUG at
> /mnt/host/source/src/third_party/kernel/v4.14/kernel/sched/core.c:6102!
> [ 8103.221678] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> [ 8103.222396] CPU: 0 PID: 11 Comm: migration/0 Tainted: G W 4.14.72
> #1
> [ 8103.222428] Hardware name: Google Cheza (rev1) (DT)
> [ 8103.222460] task: ffffffc0f842d580 task.stack: ffffff8009c18000
> [ 8103.222504] PC is at ___might_sleep+0x138/0x140
> [ 8103.222542] LR is at ___might_sleep+0x138/0x140
> [ 8103.222577] pc : [<ffffff80080d8f04>] lr : [<ffffff80080d8f04>] pstate:
> 60c001c9
> [ 8103.222605] sp : ffffff8009c1bb40
> ….
> [ 8103.223924] [<ffffff80080d8f04>] ___might_sleep+0x138/0x140
> [ 8103.223965] [<ffffff80080d8d98>] __might_sleep+0x4c/0x80
> [ 8103.224009] [<ffffff80088e4258>] mutex_lock+0x28/0x60
> [ 8103.224054] [<ffffff800850fa2c>] dev_pm_qos_remove_notifier+0x1c/0x54
> [ 8103.224097] [<ffffff8008517814>] genpd_remove_device+0x3c/0x10c
> [ 8103.224140] [<ffffff800851949c>] genpd_dev_pm_detach+0x48/0x108
> [ 8103.224183] [<ffffff80085193e0>] of_genpd_detach_cpu+0x48/0xbc
> [ 8103.224227] [<ffffff80083edea4>] cpu_pd_dying+0x28/0x38
> [ 8103.224268] [<ffffff80080ab2c0>] cpuhp_invoke_callback+0x254/0x5f0
> [ 8103.224308] [<ffffff80080acdec>] take_cpu_down+0x60/0x9c
> [ 8103.224346] [<ffffff800814d898>] multi_cpu_stop+0xac/0x104
> [ 8103.224385] [<ffffff800814dfb8>] cpu_stopper_thread+0xa0/0x118
> [ 8103.224427] [<ffffff80080cff74>] smpboot_thread_fn+0x19c/0x278
> [ 8103.224472] [<ffffff80080cc0c4>] kthread+0x120/0x130
> [ 8103.224513] [<ffffff8008084608>] ret_from_fork+0x10/0x18
Thanks for the report, I will double check my series before I post the
new version of my series. If nothing unexpected shows up, that should
be in a couple of days from now.
I keep you cc.
[...]
Kind regards
Uffe