2014-04-24 12:24:39

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH 0/3] sched: idle: Provide the basis to integrate cpuidle

This patchset provides three patches for the basis to integrate cpuidle with
the scheduler.

The first patch is a cleanup.
The second one adds the sched balance option as requested by Ingo.
The third one stores the idle state a cpu is and adds a rcu_barrier() to
prevent races when using the pointed object.

This patchset is based on top of v3.15-rc2.

This patchset does not modify the behavior of the scheduler.

Taking into account the cpuidle information from the scheduler will be
posted in a separate patchset in order to keep focused on the right decisions
the scheduler should take regarding the policy vs idle parameters.

Daniel Lezcano (3):
sched: idle: Encapsulate the code to compile it out
sched: idle: Add sched balance option
sched: idle: Store the idle state the cpu is

drivers/cpuidle/cpuidle.c | 6 ++
include/linux/sched/sysctl.h | 14 ++++
kernel/sched/fair.c | 92 ++++++++++++++++++++++-
kernel/sched/idle.c | 169 +++++++++++++++++++++++-------------------
kernel/sched/sched.h | 5 ++
kernel/sysctl.c | 11 +++
6 files changed, 220 insertions(+), 77 deletions(-)

--
1.7.9.5


2014-04-24 12:24:37

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH 2/3] sched: idle: Add sched balance option

This patch adds a sysctl schedule balance option to choose against:

* auto (0)
* performance (1)
* power (2)

It relies on the recently added notifier to monitor the power supply changes.
If the scheduler balance option is set to 'auto', then when the system switches
to battery, the balance option change to 'power' and when it goes back to AC, it
switches to 'performance'.

The default value is 'auto'.

If the kernel is compiled without the CONFIG_POWER_SUPPLY option, then any call
to the 'auto' option will fail and the scheduler will use the 'performance'
option as default.

Signed-off-by: Daniel Lezcano <[email protected]>
---
include/linux/sched/sysctl.h | 14 +++++++
kernel/sched/fair.c | 92 +++++++++++++++++++++++++++++++++++++++++-
kernel/sysctl.c | 11 +++++
3 files changed, 115 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 8045a55..f8507bf 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -44,6 +44,20 @@ enum sched_tunable_scaling {
};
extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;

+#ifdef CONFIG_SMP
+enum sched_balance_option {
+ SCHED_BALANCE_OPTION_PERFORMANCE,
+ SCHED_BALANCE_OPTION_POWER,
+ SCHED_BALANCE_OPTION_AUTO,
+ SCHED_BALANCE_OPTION_END,
+};
+extern enum sched_balance_option sysctl_sched_balance_option;
+
+int sched_proc_balance_option_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length,
+ loff_t *ppos);
+#endif
+
extern unsigned int sysctl_numa_balancing_scan_delay;
extern unsigned int sysctl_numa_balancing_scan_period_min;
extern unsigned int sysctl_numa_balancing_scan_period_max;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7570dd9..7b8e93d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -29,7 +29,7 @@
#include <linux/mempolicy.h>
#include <linux/migrate.h>
#include <linux/task_work.h>
-
+#include <linux/power_supply.h>
#include <trace/events/sched.h>

#include "sched.h"
@@ -61,6 +61,24 @@ unsigned int normalized_sysctl_sched_latency = 6000000ULL;
enum sched_tunable_scaling sysctl_sched_tunable_scaling
= SCHED_TUNABLESCALING_LOG;

+#ifdef CONFIG_SMP
+/*
+ * Scheduler balancing policy:
+ *
+ * Options are:
+ * SCHED_BALANCE_OPTION_PERFORMANCE - full performance
+ * SCHED_BALANCE_OPTION_POWER - power saving aggressive
+ * SCHED_BALANCE_OPTION_AUTO - switches to 'performance' when plugged
+ * on or 'power' on battery
+ */
+enum sched_balance_option sysctl_sched_balance_option
+ = SCHED_BALANCE_OPTION_AUTO;
+
+static int sched_current_balance_option
+ = SCHED_BALANCE_OPTION_PERFORMANCE;
+
+#endif
+
/*
* Minimal preemption granularity for CPU-bound tasks:
* (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
@@ -555,6 +573,76 @@ static struct sched_entity *__pick_next_entity(struct sched_entity *se)
return rb_entry(next, struct sched_entity, run_node);
}

+#ifdef CONFIG_SMP
+static int sched_balance_option_update(void)
+{
+ int ret;
+
+ /*
+ * Copy the current balance option
+ */
+ if (sysctl_sched_balance_option != SCHED_BALANCE_OPTION_AUTO) {
+ sched_current_balance_option = sysctl_sched_balance_option;
+ return 0;
+ }
+
+ /*
+ * This call may fail if the kernel is not compiled with
+ * the POWER_SUPPLY option.
+ */
+ ret = power_supply_is_system_supplied();
+ if (ret < 0) {
+ sysctl_sched_balance_option = sched_current_balance_option;
+ return ret;
+ }
+
+ /*
+ * When in 'auto' mode, switch to 'performance if the system
+ * is plugged on the wall, to 'power' if we are on battery
+ */
+ sched_current_balance_option = ret ?
+ SCHED_BALANCE_OPTION_PERFORMANCE :
+ SCHED_BALANCE_OPTION_POWER;
+
+ return 0;
+}
+
+int sched_proc_balance_option_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length,
+ loff_t *ppos)
+{
+ int ret;
+
+ ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ if (ret)
+ return ret;
+
+ return sched_balance_option_update();
+}
+
+static int sched_power_supply_notifier(struct notifier_block *b,
+ unsigned long l, void *v)
+{
+ sched_balance_option_update();
+ return NOTIFY_OK;
+}
+
+static struct notifier_block power_supply_notifier_nb = {
+ .notifier_call = sched_power_supply_notifier,
+};
+
+static int sched_balance_option_init(void)
+{
+ int ret;
+
+ ret = sched_balance_option_update();
+ if (ret)
+ return ret;
+
+ return power_supply_reg_notifier(&power_supply_notifier_nb);
+}
+#endif
+
#ifdef CONFIG_SCHED_DEBUG
struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
{
@@ -7695,7 +7783,7 @@ __init void init_sched_fair_class(void)
{
#ifdef CONFIG_SMP
open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
-
+ sched_balance_option_init();
#ifdef CONFIG_NO_HZ_COMMON
nohz.next_balance = jiffies;
zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 74f5b58..e4ecc7d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -282,6 +282,17 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+#ifdef CONFIG_SMP
+ {
+ .procname = "sched_balance_option",
+ .data = &sysctl_sched_balance_option,
+ .maxlen = sizeof(enum sched_balance_option),
+ .mode = 0644,
+ .proc_handler = sched_proc_balance_option_handler,
+ .extra1 = &zero, /* SCHED_BALANCE_OPTION_AUTO */
+ .extra2 = &two, /* SCHED_BALANCE_OPTION_POWER */
+ },
+#endif
#ifdef CONFIG_SCHED_DEBUG
{
.procname = "sched_min_granularity_ns",
--
1.7.9.5

2014-04-24 12:24:35

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH 1/3] sched: idle: Encapsulate the code to compile it out

Encapsulate the large portion of cpuidle_idle_call inside another
function so when CONFIG_CPU_IDLE=n, the code will be compiled out.
Also that is benefitial for the clarity of the code as it removes
a nested indentation level.

Signed-off-by: Daniel Lezcano <[email protected]>
---
kernel/sched/idle.c | 161 +++++++++++++++++++++++++++------------------------
1 file changed, 86 insertions(+), 75 deletions(-)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 8f4390a..e877dd4 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -63,6 +63,90 @@ void __weak arch_cpu_idle(void)
local_irq_enable();
}

+#ifdef CONFIG_CPU_IDLE
+static int __cpuidle_idle_call(void)
+{
+ struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
+ struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
+ int next_state, entered_state, ret;
+ bool broadcast;
+
+ /*
+ * Check if the cpuidle framework is ready, otherwise fallback
+ * to the default arch specific idle method
+ */
+ ret = cpuidle_enabled(drv, dev);
+ if (ret)
+ return ret;
+
+ /*
+ * Ask the governor to choose an idle state it thinks
+ * it is convenient to go to. There is *always* a
+ * convenient idle state
+ */
+ next_state = cpuidle_select(drv, dev);
+
+ /*
+ * The idle task must be scheduled, it is pointless to
+ * go to idle, just update no idle residency and get
+ * out of this function
+ */
+ if (current_clr_polling_and_test()) {
+ dev->last_residency = 0;
+ entered_state = next_state;
+ local_irq_enable();
+ } else {
+ broadcast = !!(drv->states[next_state].flags &
+ CPUIDLE_FLAG_TIMER_STOP);
+
+ if (broadcast)
+ /*
+ * Tell the time framework to switch to a
+ * broadcast timer because our local timer
+ * will be shutdown. If a local timer is used
+ * from another cpu as a broadcast timer, this
+ * call may fail if it is not available
+ */
+ ret = clockevents_notify(
+ CLOCK_EVT_NOTIFY_BROADCAST_ENTER,
+ &dev->cpu);
+
+ if (!ret) {
+ trace_cpu_idle_rcuidle(next_state, dev->cpu);
+
+ /*
+ * Enter the idle state previously returned by
+ * the governor decision. This function will
+ * block until an interrupt occurs and will
+ * take care of re-enabling the local
+ * interrupts
+ */
+ entered_state = cpuidle_enter(drv, dev, next_state);
+
+ trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
+
+ if (broadcast)
+ clockevents_notify(
+ CLOCK_EVT_NOTIFY_BROADCAST_EXIT,
+ &dev->cpu);
+
+ /*
+ * Give the governor an opportunity to reflect
+ * on the outcome
+ */
+ cpuidle_reflect(dev, entered_state);
+ }
+ }
+
+ return 0;
+}
+#else
+static int inline __cpuidle_idle_call(void)
+{
+ return -ENOSYS;
+}
+#endif
+
/**
* cpuidle_idle_call - the main idle function
*
@@ -71,10 +155,7 @@ void __weak arch_cpu_idle(void)
*/
static int cpuidle_idle_call(void)
{
- struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
- struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
- int next_state, entered_state, ret;
- bool broadcast;
+ int ret;

/*
* Check if the idle task must be rescheduled. If it is the
@@ -101,80 +182,10 @@ static int cpuidle_idle_call(void)
rcu_idle_enter();

/*
- * Check if the cpuidle framework is ready, otherwise fallback
- * to the default arch specific idle method
- */
- ret = cpuidle_enabled(drv, dev);
-
- if (!ret) {
- /*
- * Ask the governor to choose an idle state it thinks
- * it is convenient to go to. There is *always* a
- * convenient idle state
- */
- next_state = cpuidle_select(drv, dev);
-
- /*
- * The idle task must be scheduled, it is pointless to
- * go to idle, just update no idle residency and get
- * out of this function
- */
- if (current_clr_polling_and_test()) {
- dev->last_residency = 0;
- entered_state = next_state;
- local_irq_enable();
- } else {
- broadcast = !!(drv->states[next_state].flags &
- CPUIDLE_FLAG_TIMER_STOP);
-
- if (broadcast)
- /*
- * Tell the time framework to switch
- * to a broadcast timer because our
- * local timer will be shutdown. If a
- * local timer is used from another
- * cpu as a broadcast timer, this call
- * may fail if it is not available
- */
- ret = clockevents_notify(
- CLOCK_EVT_NOTIFY_BROADCAST_ENTER,
- &dev->cpu);
-
- if (!ret) {
- trace_cpu_idle_rcuidle(next_state, dev->cpu);
-
- /*
- * Enter the idle state previously
- * returned by the governor
- * decision. This function will block
- * until an interrupt occurs and will
- * take care of re-enabling the local
- * interrupts
- */
- entered_state = cpuidle_enter(drv, dev,
- next_state);
-
- trace_cpu_idle_rcuidle(PWR_EVENT_EXIT,
- dev->cpu);
-
- if (broadcast)
- clockevents_notify(
- CLOCK_EVT_NOTIFY_BROADCAST_EXIT,
- &dev->cpu);
-
- /*
- * Give the governor an opportunity to reflect on the
- * outcome
- */
- cpuidle_reflect(dev, entered_state);
- }
- }
- }
-
- /*
* We can't use the cpuidle framework, let's use the default
* idle routine
*/
+ ret = __cpuidle_idle_call();
if (ret)
arch_cpu_idle();

--
1.7.9.5

2014-04-24 12:27:15

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH 3/3] sched: idle: Store the idle state the cpu is

When the cpu enters idle it stores the cpuidle state pointer in the struct
rq which in turn could be used to take a right decision when balancing
a task.

As soon as the cpu exits the idle state, the structure is filled back with the
NULL pointer.

There are a couple of situations where the idle state pointer could be changed
while looking at it:

1. For x86/acpi with dynamic c-states, when a laptop switches from battery
to AC that could result on removing the deeper idle state. The acpi driver
triggers:
'acpi_processor_cst_has_changed'
'cpuidle_pause_and_lock'
'cpuidle_uninstall_idle_handler'
'kick_all_cpus_sync'.

All cpus will exit their idle state and the pointed object will be set to NULL.

2. The cpuidle driver is unloaded. Logically that could happen but not in
practice because the drivers are always compiled in and 95% of the drivers are
not coded to unregister the driver. In any case, the unloading code must call
'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock' leading to
'kick_all_cpus_sync' as mentioned above.

The race can happen if we use the pointer and then one of these two situations
occurs at the same moment.

In order to be safe, the idle state pointer stored in the rq must be used inside
a rcu_read_lock section where we are protected with the 'rcu_barrier' in the
'cpuidle_uninstall_idle_handler' function.

Signed-off-by: Daniel Lezcano <[email protected]>
---
drivers/cpuidle/cpuidle.c | 6 ++++++
kernel/sched/idle.c | 8 ++++++++
kernel/sched/sched.h | 5 +++++
3 files changed, 19 insertions(+)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 8236746..6a13f40 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -190,6 +190,12 @@ void cpuidle_install_idle_handler(void)
*/
void cpuidle_uninstall_idle_handler(void)
{
+ /*
+ * Wait for the scheduler to finish to use the idle state he
+ * may pointing to when looking for the cpu idle states
+ */
+ rcu_barrier();
+
if (enabled_devices) {
initialized = 0;
kick_all_cpus_sync();
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index e877dd4..4c14ec0 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -12,6 +12,8 @@

#include <trace/events/power.h>

+#include "sched.h"
+
static int __read_mostly cpu_idle_force_poll;

void cpu_idle_poll_ctrl(bool enable)
@@ -66,6 +68,8 @@ void __weak arch_cpu_idle(void)
#ifdef CONFIG_CPU_IDLE
static int __cpuidle_idle_call(void)
{
+ struct rq *rq = this_rq();
+ struct cpuidle_state **idle_state = &rq->idle_state;
struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
int next_state, entered_state, ret;
@@ -114,6 +118,8 @@ static int __cpuidle_idle_call(void)
if (!ret) {
trace_cpu_idle_rcuidle(next_state, dev->cpu);

+ *idle_state = &drv->states[next_state];
+
/*
* Enter the idle state previously returned by
* the governor decision. This function will
@@ -123,6 +129,8 @@ static int __cpuidle_idle_call(void)
*/
entered_state = cpuidle_enter(drv, dev, next_state);

+ *idle_state = NULL;
+
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);

if (broadcast)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 456e492..bace64a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -14,6 +14,7 @@
#include "cpuacct.h"

struct rq;
+struct cpuidle_state;

extern __read_mostly int scheduler_running;

@@ -643,6 +644,10 @@ struct rq {
#ifdef CONFIG_SMP
struct llist_head wake_list;
#endif
+
+#ifdef CONFIG_CPU_IDLE
+ struct cpuidle_state *idle_state;
+#endif
};

static inline int cpu_of(struct rq *rq)
--
1.7.9.5

2014-04-24 13:29:42

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On 04/24/2014 03:13 PM, Amit Kucheria wrote:
> On Thu, Apr 24, 2014 at 5:54 PM, Daniel Lezcano
> <[email protected] <mailto:[email protected]>> wrote:
>
> This patch adds a sysctl schedule balance option to choose against:
>
> * auto (0)
> * performance (1)
> * power (2)
>
> It relies on the recently added notifier to monitor the power supply
> changes.
> If the scheduler balance option is set to 'auto', then when the
> system switches
> to battery, the balance option change to 'power' and when it goes
> back to AC, it
> switches to 'performance'.
>
> The default value is 'auto'.
>
> If the kernel is compiled without the CONFIG_POWER_SUPPLY option,
> then any call
> to the 'auto' option will fail and the scheduler will use the
> 'performance'
> option as default.
>
> Signed-off-by: Daniel Lezcano <[email protected]
> <mailto:[email protected]>>
> ---
> include/linux/sched/sysctl.h | 14 +++++++
> kernel/sched/fair.c | 92
> +++++++++++++++++++++++++++++++++++++++++-
> kernel/sysctl.c | 11 +++++
> 3 files changed, 115 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index 8045a55..f8507bf 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -44,6 +44,20 @@ enum sched_tunable_scaling {
> };
> extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
>
> +#ifdef CONFIG_SMP
> +enum sched_balance_option {
>
>
> What do you think of s/option/bias/g ?
>
> It is essentially biasing the scheduler towards performance or power

Yes, could be more adequate.

> + SCHED_BALANCE_OPTION_PERFORMANCE,
> + SCHED_BALANCE_OPTION_POWER,
> + SCHED_BALANCE_OPTION_AUTO,
> + SCHED_BALANCE_OPTION_END,
> +};
>
>
> +extern enum sched_balance_option sysctl_sched_balance_option;
> +
> +int sched_proc_balance_option_handler(struct ctl_table *table, int
> write,
> + void __user *buffer, size_t *length,
> + loff_t *ppos);
> +#endif
> +
> extern unsigned int sysctl_numa_balancing_scan_delay;
> extern unsigned int sysctl_numa_balancing_scan_period_min;
> extern unsigned int sysctl_numa_balancing_scan_period_max;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7570dd9..7b8e93d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -29,7 +29,7 @@
> #include <linux/mempolicy.h>
> #include <linux/migrate.h>
> #include <linux/task_work.h>
> -
> +#include <linux/power_supply.h>
> #include <trace/events/sched.h>
>
> #include "sched.h"
> @@ -61,6 +61,24 @@ unsigned int normalized_sysctl_sched_latency =
> 6000000ULL;
> enum sched_tunable_scaling sysctl_sched_tunable_scaling
> = SCHED_TUNABLESCALING_LOG;
>
> +#ifdef CONFIG_SMP
> +/*
> + * Scheduler balancing policy:
> + *
> + * Options are:
> + * SCHED_BALANCE_OPTION_PERFORMANCE - full performance
> + * SCHED_BALANCE_OPTION_POWER - power saving aggressive
> + * SCHED_BALANCE_OPTION_AUTO - switches to 'performance' when plugged
> + * on or 'power' on battery
> + */
> +enum sched_balance_option sysctl_sched_balance_option
> + = SCHED_BALANCE_OPTION_AUTO;
> +
> +static int sched_current_balance_option
> + = SCHED_BALANCE_OPTION_PERFORMANCE;
> +
> +#endif
> +
> /*
> * Minimal preemption granularity for CPU-bound tasks:
> * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
> @@ -555,6 +573,76 @@ static struct sched_entity
> *__pick_next_entity(struct sched_entity *se)
> return rb_entry(next, struct sched_entity, run_node);
> }
>
> +#ifdef CONFIG_SMP
> +static int sched_balance_option_update(void)
> +{
> + int ret;
> +
> + /*
> + * Copy the current balance option
> + */
> + if (sysctl_sched_balance_option != SCHED_BALANCE_OPTION_AUTO) {
> + sched_current_balance_option =
> sysctl_sched_balance_option;
> + return 0;
> + }
> +
> + /*
> + * This call may fail if the kernel is not compiled with
> + * the POWER_SUPPLY option.
> + */
> + ret = power_supply_is_system_supplied();
> + if (ret < 0) {
> + sysctl_sched_balance_option =
> sched_current_balance_option;
> + return ret;
> + }
> +
> + /*
> + * When in 'auto' mode, switch to 'performance if the system
> + * is plugged on the wall, to 'power' if we are on battery
> + */
> + sched_current_balance_option = ret ?
> + SCHED_BALANCE_OPTION_PERFORMANCE :
> + SCHED_BALANCE_OPTION_POWER;
> +
> + return 0;
> +}
> +
>
>
> I understand that this is only meant to kick off discussions and other
> criteria besides power being plugged in to bias the scheduler
> performance could be added later. But does it make sense to hardcode the
> power supply assumption into the scheduler?
>
> Can't we instead make sched_balance_option_update() a function pointer
> (with a default implementation that you've provided) that provide
> platforms the ability to override that with their own implementation?

I agree if that really hurts, it could be placed somewhere else, for
example in a new file:
kernel/sched/energy.c

But concerning the callback, I don't see the point to create a specific
platform ops for that as the current code is generic. Do you have any
use case in mind ?

Thanks for the review

-- Daniel

> +int sched_proc_balance_option_handler(struct ctl_table *table, int
> write,
> + void __user *buffer, size_t *length,
> + loff_t *ppos)
> +{
> + int ret;
> +
> + ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
> + if (ret)
> + return ret;
> +
> + return sched_balance_option_update();
> +}
> +
> +static int sched_power_supply_notifier(struct notifier_block *b,
> + unsigned long l, void *v)
> +{
> + sched_balance_option_update();
> + return NOTIFY_OK;
> +}
> +
> +static struct notifier_block power_supply_notifier_nb = {
> + .notifier_call = sched_power_supply_notifier,
> +};
> +
> +static int sched_balance_option_init(void)
> +{
> + int ret;
> +
> + ret = sched_balance_option_update();
> + if (ret)
> + return ret;
> +
> + return power_supply_reg_notifier(&power_supply_notifier_nb);
> +}
> +#endif
> +
> #ifdef CONFIG_SCHED_DEBUG
> struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
> {
> @@ -7695,7 +7783,7 @@ __init void init_sched_fair_class(void)
> {
> #ifdef CONFIG_SMP
> open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
> -
> + sched_balance_option_init();
> #ifdef CONFIG_NO_HZ_COMMON
> nohz.next_balance = jiffies;
> zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 74f5b58..e4ecc7d 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -282,6 +282,17 @@ static struct ctl_table kern_table[] = {
> .mode = 0644,
> .proc_handler = proc_dointvec,
> },
> +#ifdef CONFIG_SMP
> + {
> + .procname = "sched_balance_option",
> + .data = &sysctl_sched_balance_option,
> + .maxlen = sizeof(enum sched_balance_option),
> + .mode = 0644,
> + .proc_handler = sched_proc_balance_option_handler,
> + .extra1 = &zero, /* SCHED_BALANCE_OPTION_AUTO */
> + .extra2 = &two, /*
> SCHED_BALANCE_OPTION_POWER */
> + },
> +#endif
> #ifdef CONFIG_SCHED_DEBUG
> {
> .procname = "sched_min_granularity_ns",
> --
> 1.7.9.5
>
>
> _______________________________________________
> linaro-kernel mailing list
> [email protected] <mailto:[email protected]>
> http://lists.linaro.org/mailman/listinfo/linaro-kernel
>
>


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2014-04-24 15:57:35

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 0/3] sched: idle: Provide the basis to integrate cpuidle

On Thursday, April 24, 2014 02:24:48 PM Daniel Lezcano wrote:
> This patchset provides three patches for the basis to integrate cpuidle with
> the scheduler.
>
> The first patch is a cleanup.
> The second one adds the sched balance option as requested by Ingo.
> The third one stores the idle state a cpu is and adds a rcu_barrier() to
> prevent races when using the pointed object.
>
> This patchset is based on top of v3.15-rc2.
>
> This patchset does not modify the behavior of the scheduler.
>
> Taking into account the cpuidle information from the scheduler will be
> posted in a separate patchset in order to keep focused on the right decisions
> the scheduler should take regarding the policy vs idle parameters.
>
> Daniel Lezcano (3):
> sched: idle: Encapsulate the code to compile it out
> sched: idle: Add sched balance option
> sched: idle: Store the idle state the cpu is

Please rebase patch [1/3] on top of https://patchwork.kernel.org/patch/4021831/

I agree with the Amit's comment regarding the power supply assumption in
patch [2/3].

> drivers/cpuidle/cpuidle.c | 6 ++
> include/linux/sched/sysctl.h | 14 ++++
> kernel/sched/fair.c | 92 ++++++++++++++++++++++-
> kernel/sched/idle.c | 169 +++++++++++++++++++++++-------------------
> kernel/sched/sched.h | 5 ++
> kernel/sysctl.c | 11 +++
> 6 files changed, 220 insertions(+), 77 deletions(-)
>
>

--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2014-04-24 16:00:39

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: idle: Store the idle state the cpu is

On Thursday, April 24, 2014 02:24:51 PM Daniel Lezcano wrote:
> When the cpu enters idle it stores the cpuidle state pointer in the struct
> rq which in turn could be used to take a right decision when balancing
> a task.
>
> As soon as the cpu exits the idle state, the structure is filled back with the
> NULL pointer.
>
> There are a couple of situations where the idle state pointer could be changed
> while looking at it:
>
> 1. For x86/acpi with dynamic c-states, when a laptop switches from battery
> to AC that could result on removing the deeper idle state. The acpi driver
> triggers:
> 'acpi_processor_cst_has_changed'
> 'cpuidle_pause_and_lock'
> 'cpuidle_uninstall_idle_handler'
> 'kick_all_cpus_sync'.
>
> All cpus will exit their idle state and the pointed object will be set to NULL.
>
> 2. The cpuidle driver is unloaded. Logically that could happen but not in
> practice because the drivers are always compiled in and 95% of the drivers are
> not coded to unregister the driver. In any case, the unloading code must call
> 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock' leading to
> 'kick_all_cpus_sync' as mentioned above.
>
> The race can happen if we use the pointer and then one of these two situations
> occurs at the same moment.
>
> In order to be safe, the idle state pointer stored in the rq must be used inside
> a rcu_read_lock section where we are protected with the 'rcu_barrier' in the
> 'cpuidle_uninstall_idle_handler' function.
>
> Signed-off-by: Daniel Lezcano <[email protected]>
> ---
> drivers/cpuidle/cpuidle.c | 6 ++++++
> kernel/sched/idle.c | 8 ++++++++
> kernel/sched/sched.h | 5 +++++
> 3 files changed, 19 insertions(+)
>
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index 8236746..6a13f40 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -190,6 +190,12 @@ void cpuidle_install_idle_handler(void)
> */
> void cpuidle_uninstall_idle_handler(void)
> {
> + /*
> + * Wait for the scheduler to finish to use the idle state he
> + * may pointing to when looking for the cpu idle states
> + */
> + rcu_barrier();
> +
> if (enabled_devices) {
> initialized = 0;
> kick_all_cpus_sync();
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index e877dd4..4c14ec0 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -12,6 +12,8 @@
>
> #include <trace/events/power.h>
>
> +#include "sched.h"
> +
> static int __read_mostly cpu_idle_force_poll;
>
> void cpu_idle_poll_ctrl(bool enable)
> @@ -66,6 +68,8 @@ void __weak arch_cpu_idle(void)
> #ifdef CONFIG_CPU_IDLE
> static int __cpuidle_idle_call(void)
> {
> + struct rq *rq = this_rq();
> + struct cpuidle_state **idle_state = &rq->idle_state;
> struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
> struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
> int next_state, entered_state, ret;
> @@ -114,6 +118,8 @@ static int __cpuidle_idle_call(void)
> if (!ret) {
> trace_cpu_idle_rcuidle(next_state, dev->cpu);
>
> + *idle_state = &drv->states[next_state];

What about using rq->idle_state directly here?

> +
> /*
> * Enter the idle state previously returned by
> * the governor decision. This function will
> @@ -123,6 +129,8 @@ static int __cpuidle_idle_call(void)
> */
> entered_state = cpuidle_enter(drv, dev, next_state);
>
> + *idle_state = NULL;

And here?

That would be simpler it seems.

> +
> trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
>
> if (broadcast)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 456e492..bace64a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -14,6 +14,7 @@
> #include "cpuacct.h"
>
> struct rq;
> +struct cpuidle_state;
>
> extern __read_mostly int scheduler_running;
>
> @@ -643,6 +644,10 @@ struct rq {
> #ifdef CONFIG_SMP
> struct llist_head wake_list;
> #endif
> +
> +#ifdef CONFIG_CPU_IDLE
> + struct cpuidle_state *idle_state;
> +#endif
> };
>
> static inline int cpu_of(struct rq *rq)
>

--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2014-04-25 10:54:54

by Amit Kucheria

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On Thu, Apr 24, 2014 at 7:00 PM, Daniel Lezcano
<[email protected]> wrote:
>
> On 04/24/2014 03:13 PM, Amit Kucheria wrote:
>>
>> On Thu, Apr 24, 2014 at 5:54 PM, Daniel Lezcano
>> <[email protected] <mailto:[email protected]>> wrote:
>>
>> This patch adds a sysctl schedule balance option to choose against:
>>
>> * auto (0)
>> * performance (1)
>> * power (2)
>>
>> It relies on the recently added notifier to monitor the power supply
>> changes.
>> If the scheduler balance option is set to 'auto', then when the
>> system switches
>> to battery, the balance option change to 'power' and when it goes
>> back to AC, it
>> switches to 'performance'.
>>
>> The default value is 'auto'.
>>
>> If the kernel is compiled without the CONFIG_POWER_SUPPLY option,
>> then any call
>> to the 'auto' option will fail and the scheduler will use the
>> 'performance'
>> option as default.
>>
>> Signed-off-by: Daniel Lezcano <[email protected]
>> <mailto:[email protected]>>
>>
>> ---
>> include/linux/sched/sysctl.h | 14 +++++++
>> kernel/sched/fair.c | 92
>> +++++++++++++++++++++++++++++++++++++++++-
>> kernel/sysctl.c | 11 +++++
>> 3 files changed, 115 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
>> index 8045a55..f8507bf 100644
>> --- a/include/linux/sched/sysctl.h
>> +++ b/include/linux/sched/sysctl.h
>> @@ -44,6 +44,20 @@ enum sched_tunable_scaling {
>> };
>> extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
>>
>> +#ifdef CONFIG_SMP
>> +enum sched_balance_option {
>>
>>
>> What do you think of s/option/bias/g ?
>>
>> It is essentially biasing the scheduler towards performance or power
>
>
> Yes, could be more adequate.
>
>
>> + SCHED_BALANCE_OPTION_PERFORMANCE,
>> + SCHED_BALANCE_OPTION_POWER,
>> + SCHED_BALANCE_OPTION_AUTO,
>> + SCHED_BALANCE_OPTION_END,
>> +};
>>
>>
>> +extern enum sched_balance_option sysctl_sched_balance_option;
>> +
>> +int sched_proc_balance_option_handler(struct ctl_table *table, int
>> write,
>> + void __user *buffer, size_t *length,
>> + loff_t *ppos);
>> +#endif
>> +
>> extern unsigned int sysctl_numa_balancing_scan_delay;
>> extern unsigned int sysctl_numa_balancing_scan_period_min;
>> extern unsigned int sysctl_numa_balancing_scan_period_max;
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7570dd9..7b8e93d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -29,7 +29,7 @@
>> #include <linux/mempolicy.h>
>> #include <linux/migrate.h>
>> #include <linux/task_work.h>
>> -
>> +#include <linux/power_supply.h>
>> #include <trace/events/sched.h>
>>
>> #include "sched.h"
>> @@ -61,6 +61,24 @@ unsigned int normalized_sysctl_sched_latency =
>> 6000000ULL;
>> enum sched_tunable_scaling sysctl_sched_tunable_scaling
>> = SCHED_TUNABLESCALING_LOG;
>>
>> +#ifdef CONFIG_SMP
>> +/*
>> + * Scheduler balancing policy:
>> + *
>> + * Options are:
>> + * SCHED_BALANCE_OPTION_PERFORMANCE - full performance
>> + * SCHED_BALANCE_OPTION_POWER - power saving aggressive
>> + * SCHED_BALANCE_OPTION_AUTO - switches to 'performance' when plugged
>> + * on or 'power' on battery
>> + */
>> +enum sched_balance_option sysctl_sched_balance_option
>> + = SCHED_BALANCE_OPTION_AUTO;
>> +
>> +static int sched_current_balance_option
>> + = SCHED_BALANCE_OPTION_PERFORMANCE;
>> +
>> +#endif
>> +
>> /*
>> * Minimal preemption granularity for CPU-bound tasks:
>> * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
>> @@ -555,6 +573,76 @@ static struct sched_entity
>> *__pick_next_entity(struct sched_entity *se)
>> return rb_entry(next, struct sched_entity, run_node);
>> }
>>
>> +#ifdef CONFIG_SMP
>> +static int sched_balance_option_update(void)
>> +{
>> + int ret;
>> +
>> + /*
>> + * Copy the current balance option
>> + */
>> + if (sysctl_sched_balance_option != SCHED_BALANCE_OPTION_AUTO) {
>> + sched_current_balance_option =
>> sysctl_sched_balance_option;
>> + return 0;
>> + }
>> +
>> + /*
>> + * This call may fail if the kernel is not compiled with
>> + * the POWER_SUPPLY option.
>> + */
>> + ret = power_supply_is_system_supplied();
>> + if (ret < 0) {
>> + sysctl_sched_balance_option =
>> sched_current_balance_option;
>> + return ret;
>> + }
>> +
>> + /*
>> + * When in 'auto' mode, switch to 'performance if the system
>> + * is plugged on the wall, to 'power' if we are on battery
>> + */
>> + sched_current_balance_option = ret ?
>> + SCHED_BALANCE_OPTION_PERFORMANCE :
>> + SCHED_BALANCE_OPTION_POWER;
>> +
>> + return 0;
>> +}
>> +
>>
>>
>> I understand that this is only meant to kick off discussions and other
>> criteria besides power being plugged in to bias the scheduler
>> performance could be added later. But does it make sense to hardcode the
>> power supply assumption into the scheduler?
>>
>> Can't we instead make sched_balance_option_update() a function pointer
>> (with a default implementation that you've provided) that provide
>> platforms the ability to override that with their own implementation?
>
>
> I agree if that really hurts, it could be placed somewhere else, for example in a new file:
> kernel/sched/energy.c
>
> But concerning the callback, I don't see the point to create a specific platform ops for that as the current code is generic. Do you have any use case in mind ?
>

I had a offline conversation with Daniel about this since there are
other triggers - thermal constraints and game-like apps being examples
- that might want to override the system policy. He intended
"performance" mode to mean the existing code paths and "power" mode to
mean *additional* new heuristics for energy-efficiency. The power
supply assumption is just the first one of those heuristics.

In that case, "performance" should be the default so we don't change
existing system behavior. And perhaps we want to keep these energy
heuristics in a separate file, at least until we've gotten them right.

While we're at it, can we attempt to replace power with energy since
that is what we're trying to save in most cases?

2014-04-25 11:30:33

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On Friday, April 25, 2014 04:24:49 PM Amit Kucheria wrote:
> On Thu, Apr 24, 2014 at 7:00 PM, Daniel Lezcano
> <[email protected]> wrote:
> >
> > On 04/24/2014 03:13 PM, Amit Kucheria wrote:
> >>
> >> On Thu, Apr 24, 2014 at 5:54 PM, Daniel Lezcano
> >> <[email protected] <mailto:[email protected]>> wrote:
> >>
> >> This patch adds a sysctl schedule balance option to choose against:
> >>
> >> * auto (0)
> >> * performance (1)
> >> * power (2)
> >>
> >> It relies on the recently added notifier to monitor the power supply
> >> changes.
> >> If the scheduler balance option is set to 'auto', then when the
> >> system switches
> >> to battery, the balance option change to 'power' and when it goes
> >> back to AC, it
> >> switches to 'performance'.
> >>
> >> The default value is 'auto'.
> >>
> >> If the kernel is compiled without the CONFIG_POWER_SUPPLY option,
> >> then any call
> >> to the 'auto' option will fail and the scheduler will use the
> >> 'performance'
> >> option as default.
> >>
> >> Signed-off-by: Daniel Lezcano <[email protected]
> >> <mailto:[email protected]>>
> >>
> >> ---
> >> include/linux/sched/sysctl.h | 14 +++++++
> >> kernel/sched/fair.c | 92
> >> +++++++++++++++++++++++++++++++++++++++++-
> >> kernel/sysctl.c | 11 +++++
> >> 3 files changed, 115 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> >> index 8045a55..f8507bf 100644
> >> --- a/include/linux/sched/sysctl.h
> >> +++ b/include/linux/sched/sysctl.h
> >> @@ -44,6 +44,20 @@ enum sched_tunable_scaling {
> >> };
> >> extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
> >>
> >> +#ifdef CONFIG_SMP
> >> +enum sched_balance_option {
> >>
> >>
> >> What do you think of s/option/bias/g ?
> >>
> >> It is essentially biasing the scheduler towards performance or power
> >
> >
> > Yes, could be more adequate.
> >
> >
> >> + SCHED_BALANCE_OPTION_PERFORMANCE,
> >> + SCHED_BALANCE_OPTION_POWER,
> >> + SCHED_BALANCE_OPTION_AUTO,
> >> + SCHED_BALANCE_OPTION_END,
> >> +};
> >>
> >>
> >> +extern enum sched_balance_option sysctl_sched_balance_option;
> >> +
> >> +int sched_proc_balance_option_handler(struct ctl_table *table, int
> >> write,
> >> + void __user *buffer, size_t *length,
> >> + loff_t *ppos);
> >> +#endif
> >> +
> >> extern unsigned int sysctl_numa_balancing_scan_delay;
> >> extern unsigned int sysctl_numa_balancing_scan_period_min;
> >> extern unsigned int sysctl_numa_balancing_scan_period_max;
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 7570dd9..7b8e93d 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -29,7 +29,7 @@
> >> #include <linux/mempolicy.h>
> >> #include <linux/migrate.h>
> >> #include <linux/task_work.h>
> >> -
> >> +#include <linux/power_supply.h>
> >> #include <trace/events/sched.h>
> >>
> >> #include "sched.h"
> >> @@ -61,6 +61,24 @@ unsigned int normalized_sysctl_sched_latency =
> >> 6000000ULL;
> >> enum sched_tunable_scaling sysctl_sched_tunable_scaling
> >> = SCHED_TUNABLESCALING_LOG;
> >>
> >> +#ifdef CONFIG_SMP
> >> +/*
> >> + * Scheduler balancing policy:
> >> + *
> >> + * Options are:
> >> + * SCHED_BALANCE_OPTION_PERFORMANCE - full performance
> >> + * SCHED_BALANCE_OPTION_POWER - power saving aggressive
> >> + * SCHED_BALANCE_OPTION_AUTO - switches to 'performance' when plugged
> >> + * on or 'power' on battery
> >> + */
> >> +enum sched_balance_option sysctl_sched_balance_option
> >> + = SCHED_BALANCE_OPTION_AUTO;
> >> +
> >> +static int sched_current_balance_option
> >> + = SCHED_BALANCE_OPTION_PERFORMANCE;
> >> +
> >> +#endif
> >> +
> >> /*
> >> * Minimal preemption granularity for CPU-bound tasks:
> >> * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
> >> @@ -555,6 +573,76 @@ static struct sched_entity
> >> *__pick_next_entity(struct sched_entity *se)
> >> return rb_entry(next, struct sched_entity, run_node);
> >> }
> >>
> >> +#ifdef CONFIG_SMP
> >> +static int sched_balance_option_update(void)
> >> +{
> >> + int ret;
> >> +
> >> + /*
> >> + * Copy the current balance option
> >> + */
> >> + if (sysctl_sched_balance_option != SCHED_BALANCE_OPTION_AUTO) {
> >> + sched_current_balance_option =
> >> sysctl_sched_balance_option;
> >> + return 0;
> >> + }
> >> +
> >> + /*
> >> + * This call may fail if the kernel is not compiled with
> >> + * the POWER_SUPPLY option.
> >> + */
> >> + ret = power_supply_is_system_supplied();
> >> + if (ret < 0) {
> >> + sysctl_sched_balance_option =
> >> sched_current_balance_option;
> >> + return ret;
> >> + }
> >> +
> >> + /*
> >> + * When in 'auto' mode, switch to 'performance if the system
> >> + * is plugged on the wall, to 'power' if we are on battery
> >> + */
> >> + sched_current_balance_option = ret ?
> >> + SCHED_BALANCE_OPTION_PERFORMANCE :
> >> + SCHED_BALANCE_OPTION_POWER;
> >> +
> >> + return 0;
> >> +}
> >> +
> >>
> >>
> >> I understand that this is only meant to kick off discussions and other
> >> criteria besides power being plugged in to bias the scheduler
> >> performance could be added later. But does it make sense to hardcode the
> >> power supply assumption into the scheduler?
> >>
> >> Can't we instead make sched_balance_option_update() a function pointer
> >> (with a default implementation that you've provided) that provide
> >> platforms the ability to override that with their own implementation?
> >
> >
> > I agree if that really hurts, it could be placed somewhere else, for example in a new file:
> > kernel/sched/energy.c
> >
> > But concerning the callback, I don't see the point to create a specific platform ops for that as the current code is generic. Do you have any use case in mind ?
> >
>
> I had a offline conversation with Daniel about this since there are
> other triggers - thermal constraints and game-like apps being examples
> - that might want to override the system policy. He intended
> "performance" mode to mean the existing code paths and "power" mode to
> mean *additional* new heuristics for energy-efficiency. The power
> supply assumption is just the first one of those heuristics.

Well, so now the question is whether or not we relly want to always
go to the "power" (or "energy efficiency" if you will) mode if the system
is on battery. That arguably may not be a good thing even for energy
efficiency depending on how exactly the modes are defined.

So in my opinion it's too early to add things like that at this point.

> In that case, "performance" should be the default so we don't change
> existing system behavior. And perhaps we want to keep these energy
> heuristics in a separate file, at least until we've gotten them right.
>
> While we're at it, can we attempt to replace power with energy since
> that is what we're trying to save in most cases?

That would be a good thing IMO.

--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2014-04-25 12:17:51

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On 04/25/2014 01:46 PM, Rafael J. Wysocki wrote:
> On Friday, April 25, 2014 04:24:49 PM Amit Kucheria wrote:
>> On Thu, Apr 24, 2014 at 7:00 PM, Daniel Lezcano
>> <[email protected]> wrote:
>>>
>>> On 04/24/2014 03:13 PM, Amit Kucheria wrote:
>>>>
>>>> On Thu, Apr 24, 2014 at 5:54 PM, Daniel Lezcano
>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>
>>>> This patch adds a sysctl schedule balance option to choose against:
>>>>
>>>> * auto (0)
>>>> * performance (1)
>>>> * power (2)
>>>>
>>>> It relies on the recently added notifier to monitor the power supply
>>>> changes.
>>>> If the scheduler balance option is set to 'auto', then when the
>>>> system switches
>>>> to battery, the balance option change to 'power' and when it goes
>>>> back to AC, it
>>>> switches to 'performance'.
>>>>
>>>> The default value is 'auto'.
>>>>
>>>> If the kernel is compiled without the CONFIG_POWER_SUPPLY option,
>>>> then any call
>>>> to the 'auto' option will fail and the scheduler will use the
>>>> 'performance'
>>>> option as default.
>>>>
>>>> Signed-off-by: Daniel Lezcano <[email protected]
>>>> <mailto:[email protected]>>
>>>>
>>>> ---
>>>> include/linux/sched/sysctl.h | 14 +++++++
>>>> kernel/sched/fair.c | 92
>>>> +++++++++++++++++++++++++++++++++++++++++-
>>>> kernel/sysctl.c | 11 +++++
>>>> 3 files changed, 115 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
>>>> index 8045a55..f8507bf 100644
>>>> --- a/include/linux/sched/sysctl.h
>>>> +++ b/include/linux/sched/sysctl.h
>>>> @@ -44,6 +44,20 @@ enum sched_tunable_scaling {
>>>> };
>>>> extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
>>>>
>>>> +#ifdef CONFIG_SMP
>>>> +enum sched_balance_option {
>>>>
>>>>
>>>> What do you think of s/option/bias/g ?
>>>>
>>>> It is essentially biasing the scheduler towards performance or power
>>>
>>>
>>> Yes, could be more adequate.
>>>
>>>
>>>> + SCHED_BALANCE_OPTION_PERFORMANCE,
>>>> + SCHED_BALANCE_OPTION_POWER,
>>>> + SCHED_BALANCE_OPTION_AUTO,
>>>> + SCHED_BALANCE_OPTION_END,
>>>> +};
>>>>
>>>>
>>>> +extern enum sched_balance_option sysctl_sched_balance_option;
>>>> +
>>>> +int sched_proc_balance_option_handler(struct ctl_table *table, int
>>>> write,
>>>> + void __user *buffer, size_t *length,
>>>> + loff_t *ppos);
>>>> +#endif
>>>> +
>>>> extern unsigned int sysctl_numa_balancing_scan_delay;
>>>> extern unsigned int sysctl_numa_balancing_scan_period_min;
>>>> extern unsigned int sysctl_numa_balancing_scan_period_max;
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 7570dd9..7b8e93d 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -29,7 +29,7 @@
>>>> #include <linux/mempolicy.h>
>>>> #include <linux/migrate.h>
>>>> #include <linux/task_work.h>
>>>> -
>>>> +#include <linux/power_supply.h>
>>>> #include <trace/events/sched.h>
>>>>
>>>> #include "sched.h"
>>>> @@ -61,6 +61,24 @@ unsigned int normalized_sysctl_sched_latency =
>>>> 6000000ULL;
>>>> enum sched_tunable_scaling sysctl_sched_tunable_scaling
>>>> = SCHED_TUNABLESCALING_LOG;
>>>>
>>>> +#ifdef CONFIG_SMP
>>>> +/*
>>>> + * Scheduler balancing policy:
>>>> + *
>>>> + * Options are:
>>>> + * SCHED_BALANCE_OPTION_PERFORMANCE - full performance
>>>> + * SCHED_BALANCE_OPTION_POWER - power saving aggressive
>>>> + * SCHED_BALANCE_OPTION_AUTO - switches to 'performance' when plugged
>>>> + * on or 'power' on battery
>>>> + */
>>>> +enum sched_balance_option sysctl_sched_balance_option
>>>> + = SCHED_BALANCE_OPTION_AUTO;
>>>> +
>>>> +static int sched_current_balance_option
>>>> + = SCHED_BALANCE_OPTION_PERFORMANCE;
>>>> +
>>>> +#endif
>>>> +
>>>> /*
>>>> * Minimal preemption granularity for CPU-bound tasks:
>>>> * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
>>>> @@ -555,6 +573,76 @@ static struct sched_entity
>>>> *__pick_next_entity(struct sched_entity *se)
>>>> return rb_entry(next, struct sched_entity, run_node);
>>>> }
>>>>
>>>> +#ifdef CONFIG_SMP
>>>> +static int sched_balance_option_update(void)
>>>> +{
>>>> + int ret;
>>>> +
>>>> + /*
>>>> + * Copy the current balance option
>>>> + */
>>>> + if (sysctl_sched_balance_option != SCHED_BALANCE_OPTION_AUTO) {
>>>> + sched_current_balance_option =
>>>> sysctl_sched_balance_option;
>>>> + return 0;
>>>> + }
>>>> +
>>>> + /*
>>>> + * This call may fail if the kernel is not compiled with
>>>> + * the POWER_SUPPLY option.
>>>> + */
>>>> + ret = power_supply_is_system_supplied();
>>>> + if (ret < 0) {
>>>> + sysctl_sched_balance_option =
>>>> sched_current_balance_option;
>>>> + return ret;
>>>> + }
>>>> +
>>>> + /*
>>>> + * When in 'auto' mode, switch to 'performance if the system
>>>> + * is plugged on the wall, to 'power' if we are on battery
>>>> + */
>>>> + sched_current_balance_option = ret ?
>>>> + SCHED_BALANCE_OPTION_PERFORMANCE :
>>>> + SCHED_BALANCE_OPTION_POWER;
>>>> +
>>>> + return 0;
>>>> +}
>>>> +
>>>>
>>>>
>>>> I understand that this is only meant to kick off discussions and other
>>>> criteria besides power being plugged in to bias the scheduler
>>>> performance could be added later. But does it make sense to hardcode the
>>>> power supply assumption into the scheduler?
>>>>
>>>> Can't we instead make sched_balance_option_update() a function pointer
>>>> (with a default implementation that you've provided) that provide
>>>> platforms the ability to override that with their own implementation?
>>>
>>>
>>> I agree if that really hurts, it could be placed somewhere else, for example in a new file:
>>> kernel/sched/energy.c
>>>
>>> But concerning the callback, I don't see the point to create a specific platform ops for that as the current code is generic. Do you have any use case in mind ?
>>>
>>
>> I had a offline conversation with Daniel about this since there are
>> other triggers - thermal constraints and game-like apps being examples
>> - that might want to override the system policy. He intended
>> "performance" mode to mean the existing code paths and "power" mode to
>> mean *additional* new heuristics for energy-efficiency. The power
>> supply assumption is just the first one of those heuristics.
>
> Well, so now the question is whether or not we relly want to always
> go to the "power" (or "energy efficiency" if you will) mode if the system
> is on battery. That arguably may not be a good thing even for energy
> efficiency depending on how exactly the modes are defined.
>
> So in my opinion it's too early to add things like that at this point.

Ok, the point to solve is to have the old code path (performance) and
the new code path (power) to be selectable from userspace.

It is acceptable to keep the power/performance option for now without
the power supply thing ? And we address the policy to switch from one
mode to another mode with another option later ?


>> In that case, "performance" should be the default so we don't change
>> existing system behavior. And perhaps we want to keep these energy
>> heuristics in a separate file, at least until we've gotten them right.
>>
>> While we're at it, can we attempt to replace power with energy since
>> that is what we're trying to save in most cases?
>
> That would be a good thing IMO.


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2014-04-25 13:22:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option


The infinite wisdom of the original PC keyboard gifted us with backspace
and delete, use them!

On Fri, Apr 25, 2014 at 04:24:49PM +0530, Amit Kucheria wrote:
> In that case, "performance" should be the default so we don't change
> existing system behavior. And perhaps we want to keep these energy
> heuristics in a separate file, at least until we've gotten them right.

No, we're not going to make a separate file. Its going to be fully
integrated in the normal load balancer, no special add on hacks and
warts.

2014-04-25 13:22:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On Fri, Apr 25, 2014 at 01:46:53PM +0200, Rafael J. Wysocki wrote:

_trim_ emails!!! one of these days I'm going to write a bot to flame
your head of if there's excessive quoting.

> > I had a offline conversation with Daniel about this since there are
> > other triggers - thermal constraints and game-like apps being examples
> > - that might want to override the system policy. He intended
> > "performance" mode to mean the existing code paths and "power" mode to
> > mean *additional* new heuristics for energy-efficiency. The power
> > supply assumption is just the first one of those heuristics.
>
> Well, so now the question is whether or not we relly want to always
> go to the "power" (or "energy efficiency" if you will) mode if the system
> is on battery. That arguably may not be a good thing even for energy
> efficiency depending on how exactly the modes are defined.

Nobody is talking about always. But in general it seems a good enough
approach. Hell, many of the AC/BAT switches in todays power management
crap things are not always right.

Do I want it to dim the LCD further when I unplug the laptop -- mostly
no, but still it does. And the most annoying one is that it reduces the
screen blank time to something near 5 seconds or so.

Why would this be any different? If you know what you want you can turn
the knob.

> So in my opinion it's too early to add things like that at this point.

Meh..

2014-04-25 17:01:29

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On 04/25/2014 03:20 PM, Peter Zijlstra wrote:
> On Fri, Apr 25, 2014 at 01:46:53PM +0200, Rafael J. Wysocki wrote:
>
> _trim_ emails!!! one of these days I'm going to write a bot to flame
> your head of if there's excessive quoting.
>
>>> I had a offline conversation with Daniel about this since there are
>>> other triggers - thermal constraints and game-like apps being examples
>>> - that might want to override the system policy. He intended
>>> "performance" mode to mean the existing code paths and "power" mode to
>>> mean *additional* new heuristics for energy-efficiency. The power
>>> supply assumption is just the first one of those heuristics.
>>
>> Well, so now the question is whether or not we relly want to always
>> go to the "power" (or "energy efficiency" if you will) mode if the system
>> is on battery. That arguably may not be a good thing even for energy
>> efficiency depending on how exactly the modes are defined.
>
> Nobody is talking about always. But in general it seems a good enough
> approach. Hell, many of the AC/BAT switches in todays power management
> crap things are not always right.
>
> Do I want it to dim the LCD further when I unplug the laptop -- mostly
> no, but still it does. And the most annoying one is that it reduces the
> screen blank time to something near 5 seconds or so.
>
> Why would this be any different? If you know what you want you can turn
> the knob.
>
>> So in my opinion it's too early to add things like that at this point.
>
> Meh..
>

Peter,

I assume the patchset is correct for you (modulo the few comments about
code), right ?

As the sysctl is some kind of ABI, I would like to make sure we reach a
consensus and discuss a bit about that.

I understand Rafael and Amit could be reluctant with the power supply to
be hardcoded. As mentioned Amit, we may want to switch to 'power' if the
thermal framework tells us we are reaching a threshold. For example, the
system is set to 'auto', we are on battery, thus the current policy is
'power', we plug the device, the current policy will switch to
'performance' but the thermal framework may want to do 'power' because
of some thermal constraint.

IMO, the power supply part could be extended to take into account the
thermal.

The other point is, I guess, what should do the 'power' policy ? pack
tasks or not ? Override the cpuidle governor and choose the deeper idle
states or not ? etc ... So in other words, how can we put a cursor
between 'performance' and 'power'. Knowing on some platform one may be
more efficient than another platform.

IIUC, 'performance' for you means 'max performance' and 'power' means
'max power', and a list of sysctl for each power/performance option give
us the knobs to tune one or another policy ? (Each platform defining its
own options)

Would this approach be ok ?

Thanks

-- Daniel







--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2014-04-25 18:43:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On Fri, Apr 25, 2014 at 07:01:23PM +0200, Daniel Lezcano wrote:
> As the sysctl is some kind of ABI, I would like to make sure we reach a
> consensus and discuss a bit about that.

We could make it a sysfs file, like /sys/power/state, which when read
provides the words it takes.

That is more flexible than a numeric sysctl for which we have to keep an
enumeration.

2014-04-26 00:02:37

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On Friday, April 25, 2014 03:20:55 PM Peter Zijlstra wrote:
> On Fri, Apr 25, 2014 at 01:46:53PM +0200, Rafael J. Wysocki wrote:
>
> _trim_ emails!!! one of these days I'm going to write a bot to flame
> your head of if there's excessive quoting.

Words ...

> >
> > Well, so now the question is whether or not we relly want to always
> > go to the "power" (or "energy efficiency" if you will) mode if the system
> > is on battery. That arguably may not be a good thing even for energy
> > efficiency depending on how exactly the modes are defined.
>
> Nobody is talking about always. But in general it seems a good enough
> approach. Hell, many of the AC/BAT switches in todays power management
> crap things are not always right.
>
> Do I want it to dim the LCD further when I unplug the laptop -- mostly
> no, but still it does. And the most annoying one is that it reduces the
> screen blank time to something near 5 seconds or so.
>
> Why would this be any different?

And why do we have to do things that we hate it when they are done by others?

> If you know what you want you can turn the knob.
>
> > So in my opinion it's too early to add things like that at this point.
>
> Meh..

Seriously, I'm not even sure if we really need that stuff to be honest.

--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2014-04-26 06:17:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option


* Rafael J. Wysocki <[email protected]> wrote:

> > > Well, so now the question is whether or not we relly want to
> > > always go to the "power" (or "energy efficiency" if you will)
> > > mode if the system is on battery. That arguably may not be a
> > > good thing even for energy efficiency depending on how exactly
> > > the modes are defined.
> >
> > Nobody is talking about always. But in general it seems a good
> > enough approach. Hell, many of the AC/BAT switches in todays power
> > management crap things are not always right.
> >
> > Do I want it to dim the LCD further when I unplug the laptop --
> > mostly no, but still it does. And the most annoying one is that it
> > reduces the screen blank time to something near 5 seconds or so.
> >
> > Why would this be any different?
>
> And why do we have to do things that we hate it when they are done
> by others?

He replied to your question of 'do we want to act on power events'.

The answer is: yes, from the scheduler point of view we want to act on
power events by default, and if a user does not want that default
behavior, it's not an unprecedented request and GUIs offer various
ways to tweak screen dimming and other power saving details.

So "trying to save power" is the default everywhere, and the scheduler
wants to do the same. The main reason we couldn't do this before was
that the scheduler's 'power saving mode' was dysfunctional. That is
being fixed.

So lets try this, it's high time.

Thanks,

Ingo

2014-04-26 10:55:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On Sat, Apr 26, 2014 at 02:18:58AM +0200, Rafael J. Wysocki wrote:
> And why do we have to do things that we hate it when they are done by others?

Well, I'm assuming that these things that I dislike are actually useful
and liked by the majority (me not fitting that is nothing new). If
everybody hates these things, WTF are they still the default on pretty
much every OS out there, Windows and Mac OS do the same screen dimming
things when they loose AC last time I touched one of them.

Also, whilst I dislike screen dimming with a passion, I wouldn't
actually dislike the power aware balancer (assuming it would actually
work). Then I'm on BAT I tend to not actually do much more than read
email/browse web and maybe write some code. But I rarely if ever compile
when on battery.

2014-04-27 13:07:00

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On Saturday, April 26, 2014 08:17:44 AM Ingo Molnar wrote:
>
> * Rafael J. Wysocki <[email protected]> wrote:
>
> > > > Well, so now the question is whether or not we relly want to
> > > > always go to the "power" (or "energy efficiency" if you will)
> > > > mode if the system is on battery. That arguably may not be a
> > > > good thing even for energy efficiency depending on how exactly
> > > > the modes are defined.
> > >
> > > Nobody is talking about always. But in general it seems a good
> > > enough approach. Hell, many of the AC/BAT switches in todays power
> > > management crap things are not always right.
> > >
> > > Do I want it to dim the LCD further when I unplug the laptop --
> > > mostly no, but still it does. And the most annoying one is that it
> > > reduces the screen blank time to something near 5 seconds or so.
> > >
> > > Why would this be any different?
> >
> > And why do we have to do things that we hate it when they are done
> > by others?
>
> He replied to your question of 'do we want to act on power events'.
>
> The answer is: yes, from the scheduler point of view we want to act on
> power events by default, and if a user does not want that default
> behavior, it's not an unprecedented request and GUIs offer various
> ways to tweak screen dimming and other power saving details.
>
> So "trying to save power" is the default everywhere, and the scheduler
> wants to do the same. The main reason we couldn't do this before was
> that the scheduler's 'power saving mode' was dysfunctional. That is
> being fixed.
>
> So lets try this, it's high time.

Thanks for stating your perspective clearly, this is good to know.

Still, I have a rather fundamental problem with the notion that performance
and energy efficiency are essentially at odds with each other, because quite
often they aren't. What is good for performance is often good for energy
efficiency as well (like when it may be better to do a certain amount of work
quickly and then go idle instead of slowing down and taking more time to do
the work, because that time may turn out to be so long that energy used for
doing the work will actually be greater). So while I'm all for acting on power
events, I'm also for being rather careful with that.

Moreover, in fact performance and energy efficiency are not the only factors
that need to be taken into account, there also is the maximum power draw (ie.
the rate of using energy) that can be sustained. Arguably, all three of them
always matter, regardless of whether or not the system is on battery.

People often mentally translate energy efficiency into battery life, because
that's probably the most obvious case, but in fact it generally translates into
the cost of doing work (in therms of money or other things). Going to battery
means something like "energy is now more valuable to me", but there may be other
events meaning the same (for example, energy may be cheaper in the night etc).
So it looks like we need a global mechanism to represent the current relative
value of energy to the user and to update it whenever something like "we're on
battery now" happens (other subsystems would benefit from that too IMO).

The same applies to the maximum sustainable power draw limit more or less.
While it generally changes between "on battery" and "on AC power", it may also
depend on other factors like the capacity of the available power supply for a
data center.

Thanks!

--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2014-04-27 13:38:09

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On Saturday, April 26, 2014 12:55:37 PM Peter Zijlstra wrote:
> On Sat, Apr 26, 2014 at 02:18:58AM +0200, Rafael J. Wysocki wrote:
> > And why do we have to do things that we hate it when they are done by others?
>
> Well, I'm assuming that these things that I dislike are actually useful
> and liked by the majority (me not fitting that is nothing new). If
> everybody hates these things, WTF are they still the default on pretty
> much every OS out there, Windows and Mac OS do the same screen dimming
> things when they loose AC last time I touched one of them.

Well, vendors have certain battery life targets for their products (you
know, something for the marketing people to brag about) and tend to be
quite aggressive about doing whatever they can to reach them.

We don't have such targets for the mainline kernel, though, do we? ;-)

> Also, whilst I dislike screen dimming with a passion, I wouldn't
> actually dislike the power aware balancer (assuming it would actually
> work). Then I'm on BAT I tend to not actually do much more than read
> email/browse web and maybe write some code. But I rarely if ever compile
> when on battery.

So the relative value of energy grows for you when going from AC to battery.
That may happen in other situations too (like when the cost of energy in
terms of money depends on the time of day which may be the case in some
geographies etc.), so "battery vs AC" is not really special.

--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2014-04-27 19:39:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On Sun, Apr 27, 2014 at 03:54:32PM +0200, Rafael J. Wysocki wrote:
> On Saturday, April 26, 2014 12:55:37 PM Peter Zijlstra wrote:
> > On Sat, Apr 26, 2014 at 02:18:58AM +0200, Rafael J. Wysocki wrote:
> > > And why do we have to do things that we hate it when they are done by others?
> >
> > Well, I'm assuming that these things that I dislike are actually useful
> > and liked by the majority (me not fitting that is nothing new). If
> > everybody hates these things, WTF are they still the default on pretty
> > much every OS out there, Windows and Mac OS do the same screen dimming
> > things when they loose AC last time I touched one of them.
>
> Well, vendors have certain battery life targets for their products (you
> know, something for the marketing people to brag about) and tend to be
> quite aggressive about doing whatever they can to reach them.

Well, I was just hoping the industry wasn't _that_ rotten and there was
actual a majority of people liking this.

> We don't have such targets for the mainline kernel, though, do we? ;-)

No, for sure :-)

2014-04-28 10:09:25

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On 04/25/2014 08:43 PM, Peter Zijlstra wrote:
> On Fri, Apr 25, 2014 at 07:01:23PM +0200, Daniel Lezcano wrote:
>> As the sysctl is some kind of ABI, I would like to make sure we reach a
>> consensus and discuss a bit about that.
>
> We could make it a sysfs file, like /sys/power/state, which when read
> provides the words it takes.
>
> That is more flexible than a numeric sysctl for which we have to keep an
> enumeration.

I agree a numerical value is not flexible. But it sounds weird to put a
scheduler option in the sysfs and maybe more options will follow.

I am wondering if we shouldn't create a new cgroup for 'energy' and put
everything in there. So we will have more flexibility for extension and
we will be able to create a group of tasks for performance and a group
of tasks for energy saving.

Does it make sense ?

--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2014-04-28 10:28:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On Mon, Apr 28, 2014 at 12:09:20PM +0200, Daniel Lezcano wrote:
> I agree a numerical value is not flexible. But it sounds weird to put a
> scheduler option in the sysfs and maybe more options will follow.
>
> I am wondering if we shouldn't create a new cgroup for 'energy' and put
> everything in there. So we will have more flexibility for extension and we
> will be able to create a group of tasks for performance and a group of tasks
> for energy saving.
>
> Does it make sense ?

The old knobs used to live here:

-What: /sys/devices/system/cpu/sched_mc_power_savings
- /sys/devices/system/cpu/sched_smt_power_savings

Not entirely sure that's a fine place, but it has precedent.

2014-04-28 11:07:35

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On 04/28/2014 12:28 PM, Peter Zijlstra wrote:
> On Mon, Apr 28, 2014 at 12:09:20PM +0200, Daniel Lezcano wrote:
>> I agree a numerical value is not flexible. But it sounds weird to put a
>> scheduler option in the sysfs and maybe more options will follow.
>>
>> I am wondering if we shouldn't create a new cgroup for 'energy' and put
>> everything in there. So we will have more flexibility for extension and we
>> will be able to create a group of tasks for performance and a group of tasks
>> for energy saving.
>>
>> Does it make sense ?
>
> The old knobs used to live here:
>
> -What: /sys/devices/system/cpu/sched_mc_power_savings
> - /sys/devices/system/cpu/sched_smt_power_savings

Ah right.

> Not entirely sure that's a fine place, but it has precedent.

I share your doubts about the right place.

I'm really wondering if the cgroup couldn't be a good solution:

Amit pointed the conflict about the power vs performance with some
applications. We want to have for example a game to run fast performance
and some other application to save power.

The cgroup will allow to:

* eg. create a couple of cgroup one for performance and the other one
for power and assign the different applications to one of these group

* tweak the options just for a group of processes. Depending of the
behavior of the task, the userspace can create and change some options
for a specific application for optimum performance or energy saving

* add more energy options in a place easy to extend

* use string based options

* alternatively use a single configuration for the entire system

* use the event based file to trigger event about power consumption
(threshold reached) and from there we can easily freeze the group of
processes consuming too much energy

The cgroup will provide a highly configurable mechanism based on tasks
and could use with some other cgroup subsys (eg. to force the tasks on a
set of cpus).

IMHO, the cgroup is a good place.


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2014-04-28 11:21:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On Mon, Apr 28, 2014 at 01:07:31PM +0200, Daniel Lezcano wrote:
> I'm really wondering if the cgroup couldn't be a good solution:

most my kernels don't have no stinking cgroups ;-)

2014-04-28 22:55:31

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On Monday, April 28, 2014 01:07:31 PM Daniel Lezcano wrote:
> On 04/28/2014 12:28 PM, Peter Zijlstra wrote:
> > On Mon, Apr 28, 2014 at 12:09:20PM +0200, Daniel Lezcano wrote:
> >> I agree a numerical value is not flexible. But it sounds weird to put a
> >> scheduler option in the sysfs and maybe more options will follow.
> >>
> >> I am wondering if we shouldn't create a new cgroup for 'energy' and put
> >> everything in there. So we will have more flexibility for extension and we
> >> will be able to create a group of tasks for performance and a group of tasks
> >> for energy saving.
> >>
> >> Does it make sense ?
> >
> > The old knobs used to live here:
> >
> > -What: /sys/devices/system/cpu/sched_mc_power_savings
> > - /sys/devices/system/cpu/sched_smt_power_savings
>
> Ah right.
>
> > Not entirely sure that's a fine place, but it has precedent.
>
> I share your doubts about the right place.
>
> I'm really wondering if the cgroup couldn't be a good solution:
>
> Amit pointed the conflict about the power vs performance with some
> applications. We want to have for example a game to run fast performance
> and some other application to save power.

You can't save power.

Power is the energy flow *rate*. It's like speed, so how can you save it?

If you talk about saving in this context, please always talk about energy as
well, because that's what we want to save.

This means that positioning power against performance doesn't make any sense
whatsoever. You could try to position energy efficiency (that is, the relative
cost of doing work in terms of energy) against performance, but even that is
questionable, because, as I said in one of the previous messages, what is good
for performance is often good for energy efficiency too (think about race to
idle for example).

In other words, you want to have a knob whose both ends may happen to mean
the same thing. Wouldn't that be a little odd?

In my opinion it would be much better to have a knob representing the current
relative value of energy to the user (which may depend on things like whether
or not the system is on battery etc) and meaning how far we need to go with
energy saving efforts.

So if that knob is 0, we'll do things that are known-good for performance.
If it is 1, we'll do some extra effort to save enery as well possibly at
a small expense of performance if that's necessary. If it is 100, we'll do
all we can to save as much energy as possible without caring about performance
at all.

And it doesn't even have to be scheduler-specific, it very well may be global.

Thanks!


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2014-04-29 09:26:09

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On Mon, Apr 28, 2014 at 12:07:31PM +0100, Daniel Lezcano wrote:
> I'm really wondering if the cgroup couldn't be a good solution:
>
> Amit pointed the conflict about the power vs performance with some
> applications. We want to have for example a game to run fast performance
> and some other application to save power.
>
> The cgroup will allow to:
>
> * eg. create a couple of cgroup one for performance and the other one
> for power and assign the different applications to one of these group

If you mean using the cgroups similar to how Android has foreground and
background groups, I think cgroups is a good way to distinguish between
tasks with different performance priorities.

We can't optimize for both energy and performance at the same time, but
it would allow us to only deliver high performance when a high
performance task is runnable. It would give us more fine-grained control
than either energy or performance for all tasks. If necessary, it would
also be possible to have more than two cgroups.

2014-04-29 10:00:25

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On Tue, Apr 29, 2014 at 12:11:47AM +0100, Rafael J. Wysocki wrote:
> On Monday, April 28, 2014 01:07:31 PM Daniel Lezcano wrote:

[...]

> > I'm really wondering if the cgroup couldn't be a good solution:
> >
> > Amit pointed the conflict about the power vs performance with some
> > applications. We want to have for example a game to run fast performance
> > and some other application to save power.
>
> You can't save power.
>
> Power is the energy flow *rate*. It's like speed, so how can you save it?
>
> If you talk about saving in this context, please always talk about energy as
> well, because that's what we want to save.
>
> This means that positioning power against performance doesn't make any sense
> whatsoever. You could try to position energy efficiency (that is, the relative
> cost of doing work in terms of energy) against performance, but even that is
> questionable, because, as I said in one of the previous messages, what is good
> for performance is often good for energy efficiency too (think about race to
> idle for example).
>
> In other words, you want to have a knob whose both ends may happen to mean
> the same thing. Wouldn't that be a little odd?
>
> In my opinion it would be much better to have a knob representing the current
> relative value of energy to the user (which may depend on things like whether
> or not the system is on battery etc) and meaning how far we need to go with
> energy saving efforts.

I agree, more or less. I think of performance and energy awareness as
optimization objectives. If we only care about performance we are
dealing with a single-objective optimization problem. We do what it
takes to get the best possible performance without considering energy at
all. Energy awareness is IMO as multi-objective optimization problem. We
want to save energy, but do so without sacrificing too much performance.
What "too much" is depends on the scenario (application).

Energy efficiency and performance are different objectives, but I agree
that the means to get better energy efficiency might be the same as
those to get better performance in some cases. IMO, energy awareness is
not a big switch to turn on specific scheduling mechanics (or other
mechnaics in the kernel) such as task packing and disabling expensive
DVFS states, it is about changing the scheduling optimization objective
to include energy in the scheduling decisions. That is, try to save
energy for the current scenario. That includes race to idle if that is
the best way for the particular application and platform/system
topology.

>
> So if that knob is 0, we'll do things that are known-good for performance.
> If it is 1, we'll do some extra effort to save enery as well possibly at
> a small expense of performance if that's necessary. If it is 100, we'll do
> all we can to save as much energy as possible without caring about performance
> at all.
>
> And it doesn't even have to be scheduler-specific, it very well may be global.

I like the idea of knob representing the weight of energy awareness.
Whether it should be global, per task, or per cgroup I'm not sure about.

2014-04-29 10:25:43

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On 04/29/2014 01:11 AM, Rafael J. Wysocki wrote:
> On Monday, April 28, 2014 01:07:31 PM Daniel Lezcano wrote:
>> On 04/28/2014 12:28 PM, Peter Zijlstra wrote:
>>> On Mon, Apr 28, 2014 at 12:09:20PM +0200, Daniel Lezcano wrote:
>>>> I agree a numerical value is not flexible. But it sounds weird to put a
>>>> scheduler option in the sysfs and maybe more options will follow.
>>>>
>>>> I am wondering if we shouldn't create a new cgroup for 'energy' and put
>>>> everything in there. So we will have more flexibility for extension and we
>>>> will be able to create a group of tasks for performance and a group of tasks
>>>> for energy saving.
>>>>
>>>> Does it make sense ?
>>>
>>> The old knobs used to live here:
>>>
>>> -What: /sys/devices/system/cpu/sched_mc_power_savings
>>> - /sys/devices/system/cpu/sched_smt_power_savings
>>
>> Ah right.
>>
>>> Not entirely sure that's a fine place, but it has precedent.
>>
>> I share your doubts about the right place.
>>
>> I'm really wondering if the cgroup couldn't be a good solution:
>>
>> Amit pointed the conflict about the power vs performance with some
>> applications. We want to have for example a game to run fast performance
>> and some other application to save power.
>
> You can't save power.
>
> Power is the energy flow *rate*. It's like speed, so how can you save it?
>
> If you talk about saving in this context, please always talk about energy as
> well, because that's what we want to save.

Hi Rafael,

yeah, I think there is an abuse when talking about 'power'. I thought I
took care of talking about energy but 'power' comes always in my mind. I
believe the confusion is coming from the meaning of 'power' in French
where one translation is 'energy'.

Anyway, thanks for the clarification, I will try to use the term
'energy' and 'power' conveniently next time.

> This means that positioning power against performance doesn't make any sense
> whatsoever. You could try to position energy efficiency (that is, the relative
> cost of doing work in terms of energy) against performance, but even that is
> questionable, because, as I said in one of the previous messages, what is good
> for performance is often good for energy efficiency too (think about race to
> idle for example).
> In other words, you want to have a knob whose both ends may happen to mean
> the same thing. Wouldn't that be a little odd?

Yes, I share this point of view. I believe we won't care about adding
any knobs for this situation.

> In my opinion it would be much better to have a knob representing the current
> relative value of energy to the user (which may depend on things like whether
> or not the system is on battery etc) and meaning how far we need to go with
> energy saving efforts.
>
> So if that knob is 0, we'll do things that are known-good for performance.
> If it is 1, we'll do some extra effort to save enery as well possibly at
> a small expense of performance if that's necessary. If it is 100, we'll do
> all we can to save as much energy as possible without caring about performance
> at all.
>
> And it doesn't even have to be scheduler-specific, it very well may be global.

That would be very nice but I don't see how we can quantify this energy
and handle that generically from the kernel for all the hardware.

I am pretty sure we will discover for some kind of hardware a specific
option will consume more power, argh ! energy I mean, than another
hardware because of the architecture.

From my personal experience, when we are facing this kind of complexity
and heuristic, it is the sign the userspace has some work to do.

What I am proposing is not in contradiction with your approach, it is
about exporting a lot of knobs to userspace, and the userspace decide
how to map what is '0' <--> '100' regarding these options. Nothing
prevent the different platform to set a default value for these options.

From my POV, the cgroup could be a good solution for that for different
reasons. Especially one good reason is we can stick the energy policy
per task instead of the entire system.

Let's imagine the following scenario:

An user has a laptop running a mailer looking for the email every 5
minutes. The system switched to 'power'. The user wants to play a video
game but due to the 'power' policy, the game is not playable so it
forces the policy to 'performance'. All the tasks will use the
'performance' policy, thus consuming more energy.

If we do per task, the video game will use the 'performance' policy and
the other tasks on the system will use the 'power' policy. The userspace
can take the decision to freeze the application running 'performance' if
we reach a critical battery level.

The cgroup is a good framework to do that and gives a lot of flexibility
to userspace. I understood Peter does not like the cgroup but I did not
give up to convince him, the cgroup could be good solution :)

Looking forward, if the energy policy is tied with the task, in the
future we can normalize the energy consumption and stick to an 'energy
load' per task and reuse the load tracking for energy, do per task
energy accounting, nice per energy, etc ...

Going back to reality, concretely this sysctl patch did not reach a
consensus. So I will resend the two other patches, hoping the discussion
will lead to an agreement.



--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2014-04-29 10:49:57

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On Sun, Apr 27, 2014 at 02:23:24PM +0100, Rafael J. Wysocki wrote:
> Still, I have a rather fundamental problem with the notion that performance
> and energy efficiency are essentially at odds with each other, because quite
> often they aren't. What is good for performance is often good for energy
> efficiency as well (like when it may be better to do a certain amount of work
> quickly and then go idle instead of slowing down and taking more time to do
> the work, because that time may turn out to be so long that energy used for
> doing the work will actually be greater). So while I'm all for acting on power
> events, I'm also for being rather careful with that.

As I wrote in my other reply in this thread, IMO we should distinguish
between objectives and methods to optimize for objectives. Performance
and energy efficiency are different objectives, but some methods might
be useful for optimizing for both (such as race to idle in some
scenarios on some platforms). Tuning knobs and power events shouldn't be
assumed to directly control specific scheduling decisions, for example
whether to race to idle or not. They should only mean that the scheduler
should optimize for a specific goal. For example, a specific
energy/performance trade-off.

> Moreover, in fact performance and energy efficiency are not the only factors
> that need to be taken into account, there also is the maximum power draw (ie.
> the rate of using energy) that can be sustained. Arguably, all three of them
> always matter, regardless of whether or not the system is on battery.
>
> People often mentally translate energy efficiency into battery life, because
> that's probably the most obvious case, but in fact it generally translates into
> the cost of doing work (in therms of money or other things). Going to battery
> means something like "energy is now more valuable to me", but there may be other
> events meaning the same (for example, energy may be cheaper in the night etc).
> So it looks like we need a global mechanism to represent the current relative
> value of energy to the user and to update it whenever something like "we're on
> battery now" happens (other subsystems would benefit from that too IMO).
>
> The same applies to the maximum sustainable power draw limit more or less.
> While it generally changes between "on battery" and "on AC power", it may also
> depend on other factors like the capacity of the available power supply for a
> data center.

I haven't given maximum power draw much thought, but it is a third
objective independent of energy efficiency even though methods for
improving energy efficiency (such as disabling expensive DVFS states)
might be useful for this as well.

2014-04-29 21:45:44

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On Tuesday, April 29, 2014 11:50:02 AM Morten Rasmussen wrote:
> On Sun, Apr 27, 2014 at 02:23:24PM +0100, Rafael J. Wysocki wrote:

[cut]

> > The same applies to the maximum sustainable power draw limit more or less.
> > While it generally changes between "on battery" and "on AC power", it may also
> > depend on other factors like the capacity of the available power supply for a
> > data center.
>
> I haven't given maximum power draw much thought, but it is a third
> objective independent of energy efficiency even though methods for
> improving energy efficiency (such as disabling expensive DVFS states)
> might be useful for this as well.

Yes, it is an independent objective.

Thanks!

--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2014-04-29 22:03:26

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: idle: Add sched balance option

On Tuesday, April 29, 2014 11:00:23 AM Morten Rasmussen wrote:
> On Tue, Apr 29, 2014 at 12:11:47AM +0100, Rafael J. Wysocki wrote:
> > On Monday, April 28, 2014 01:07:31 PM Daniel Lezcano wrote:
>
> [...]
>
> > > I'm really wondering if the cgroup couldn't be a good solution:
> > >
> > > Amit pointed the conflict about the power vs performance with some
> > > applications. We want to have for example a game to run fast performance
> > > and some other application to save power.
> >
> > You can't save power.
> >
> > Power is the energy flow *rate*. It's like speed, so how can you save it?
> >
> > If you talk about saving in this context, please always talk about energy as
> > well, because that's what we want to save.
> >
> > This means that positioning power against performance doesn't make any sense
> > whatsoever. You could try to position energy efficiency (that is, the relative
> > cost of doing work in terms of energy) against performance, but even that is
> > questionable, because, as I said in one of the previous messages, what is good
> > for performance is often good for energy efficiency too (think about race to
> > idle for example).
> >
> > In other words, you want to have a knob whose both ends may happen to mean
> > the same thing. Wouldn't that be a little odd?
> >
> > In my opinion it would be much better to have a knob representing the current
> > relative value of energy to the user (which may depend on things like whether
> > or not the system is on battery etc) and meaning how far we need to go with
> > energy saving efforts.
>
> I agree, more or less. I think of performance and energy awareness as
> optimization objectives. If we only care about performance we are
> dealing with a single-objective optimization problem. We do what it
> takes to get the best possible performance without considering energy at
> all. Energy awareness is IMO as multi-objective optimization problem. We
> want to save energy, but do so without sacrificing too much performance.
> What "too much" is depends on the scenario (application).
>
> Energy efficiency and performance are different objectives, but I agree
> that the means to get better energy efficiency might be the same as
> those to get better performance in some cases. IMO, energy awareness is
> not a big switch to turn on specific scheduling mechanics (or other
> mechnaics in the kernel) such as task packing and disabling expensive
> DVFS states, it is about changing the scheduling optimization objective
> to include energy in the scheduling decisions.

Very well stated.

Also it is about to what extent those decisions should take energy into
account.

> That is, try to save energy for the current scenario. That includes race to
> idle if that is the best way for the particular application and platform/system
> topology.
>
> >
> > So if that knob is 0, we'll do things that are known-good for performance.
> > If it is 1, we'll do some extra effort to save enery as well possibly at
> > a small expense of performance if that's necessary. If it is 100, we'll do
> > all we can to save as much energy as possible without caring about performance
> > at all.
> >
> > And it doesn't even have to be scheduler-specific, it very well may be global.
>
> I like the idea of knob representing the weight of energy awareness.
> Whether it should be global, per task, or per cgroup I'm not sure about.

A global knob might be useful to other parts of the kernel too, like device
drivers implementing runtime PM, for example. In that context per task
or per cgroup knobs wouldn't be very useful.

Also there may be a user space component and a kernel component in it,
combined somehow, so that user space can influence its value and so that
the kernel can adjust it in response to power events automatically.

Thanks!

--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2014-04-30 17:39:34

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: idle: Encapsulate the code to compile it out

On Thu, 24 Apr 2014, Daniel Lezcano wrote:

> Encapsulate the large portion of cpuidle_idle_call inside another
> function so when CONFIG_CPU_IDLE=n, the code will be compiled out.
> Also that is benefitial for the clarity of the code as it removes
> a nested indentation level.

I agree with the nesting level concern.

However I dislike the proliferation of #ifdef's in the main code. Those
added in this patch are unnecessary. We want as much compilation
coverage as possible.

With cpuidle_enabled() already hardcoded to return -ENODEV when
CONFIG_CPU_IDLE=n, the compiler should already be smart enough to
optimize away all the redundant code in that case. You may look at the
assembly output from the compiler if you're not sure.

And if you're still not trusting the compiler then the second best
option is to use IS_ENABLED(CONFIG_CPU_IDLE) inside some if statement.


Nicolas