Capacity Awareness refers to the fact that on heterogeneous systems
(like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
when placing tasks we need to be aware of this difference of CPU
capacities.
In such scenarios we want to ensure that the selected CPU has enough
capacity to meet the requirement of the running task. Enough capacity
means here that capacity_orig_of(cpu) >= task.requirement.
The definition of task.requirement is dependent on the scheduling class.
For CFS, utilization is used to select a CPU that has >= capacity value
than the cfs_task.util.
capacity_orig_of(cpu) >= cfs_task.util
DL isn't capacity aware at the moment but can make use of the bandwidth
reservation to implement that in a similar manner CFS uses utilization.
The following patchset implements that:
https://lore.kernel.org/lkml/[email protected]/
capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
For RT we don't have a per task utilization signal and we lack any
information in general about what performance requirement the RT task
needs. But with the introduction of uclamp, RT tasks can now control
that by setting uclamp_min to guarantee a minimum performance point.
ATM the uclamp value are only used for frequency selection; but on
heterogeneous systems this is not enough and we need to ensure that the
capacity of the CPU is >= uclamp_min. Which is what implemented here.
capacity_orig_of(cpu) >= rt_task.uclamp_min
Note that by default uclamp.min is 1024, which means that RT tasks will
always be biased towards the big CPUs, which make for a better more
predictable behavior for the default case.
Must stress that the bias acts as a hint rather than a definite
placement strategy. For example, if all big cores are busy executing
other RT tasks we can't guarantee that a new RT task will be placed
there.
On non-heterogeneous systems the original behavior of RT should be
retained. Similarly if uclamp is not selected in the config.
Signed-off-by: Qais Yousef <[email protected]>
---
Changes in v2:
- Use cpupri_find() to check the fitness of the task instead of
sprinkling find_lowest_rq() with several checks of
rt_task_fits_capacity().
The selected implementation opted to pass the fitness function as an
argument rather than call rt_task_fits_capacity() capacity which is
a cleaner to keep the logical separation of the 2 modules; but it
means the compiler has less room to optimize rt_task_fits_capacity()
out when it's a constant value.
The logic is not perfect. For example if a 'small' task is occupying a big CPU
and another big task wakes up; we won't force migrate the small task to clear
the big cpu for the big task that woke up.
IOW, the logic is best effort and can't give hard guarantees. But improves the
current situation where a task can randomly end up on any CPU regardless of
what it needs. ie: without this patch an RT task can wake up on a big or small
CPU, but with this it will always wake up on a big CPU (assuming the big CPUs
aren't overloaded) - hence provide a consistent performance.
I'm looking at ways to improve this best effort, but this patch should be
a good start to discuss our Capacity Awareness requirement. There's a trade-off
of complexity to be made here and I'd like to keep things as simple as
possible and build on top as needed.
kernel/sched/cpupri.c | 23 ++++++++++--
kernel/sched/cpupri.h | 4 ++-
kernel/sched/rt.c | 81 +++++++++++++++++++++++++++++++++++--------
3 files changed, 91 insertions(+), 17 deletions(-)
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index b7abca987d94..799791c01d60 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -57,7 +57,8 @@ static int convert_prio(int prio)
* Return: (int)bool - CPUs were found
*/
int cpupri_find(struct cpupri *cp, struct task_struct *p,
- struct cpumask *lowest_mask)
+ struct cpumask *lowest_mask,
+ bool (*fitness_fn)(struct task_struct *p, int cpu))
{
int idx = 0;
int task_pri = convert_prio(p->prio);
@@ -98,6 +99,8 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
continue;
if (lowest_mask) {
+ int cpu;
+
cpumask_and(lowest_mask, p->cpus_ptr, vec->mask);
/*
@@ -108,7 +111,23 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
* condition, simply act as though we never hit this
* priority level and continue on.
*/
- if (cpumask_any(lowest_mask) >= nr_cpu_ids)
+ if (cpumask_empty(lowest_mask))
+ continue;
+
+ if (!fitness_fn)
+ return 1;
+
+ /* Ensure the capacity of the CPUs fit the task */
+ for_each_cpu(cpu, lowest_mask) {
+ if (!fitness_fn(p, cpu))
+ cpumask_clear_cpu(cpu, lowest_mask);
+ }
+
+ /*
+ * If no CPU at the current priority can fit the task
+ * continue looking
+ */
+ if (cpumask_empty(lowest_mask))
continue;
}
diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index 7dc20a3232e7..32dd520db11f 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -18,7 +18,9 @@ struct cpupri {
};
#ifdef CONFIG_SMP
-int cpupri_find(struct cpupri *cp, struct task_struct *p, struct cpumask *lowest_mask);
+int cpupri_find(struct cpupri *cp, struct task_struct *p,
+ struct cpumask *lowest_mask,
+ bool (*fitness_fn)(struct task_struct *p, int cpu));
void cpupri_set(struct cpupri *cp, int cpu, int pri);
int cpupri_init(struct cpupri *cp);
void cpupri_cleanup(struct cpupri *cp);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ebaa4e619684..3a68054e15b3 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -437,6 +437,45 @@ static inline int on_rt_rq(struct sched_rt_entity *rt_se)
return rt_se->on_rq;
}
+#ifdef CONFIG_UCLAMP_TASK
+/*
+ * Verify the fitness of task @p to run on @cpu taking into account the uclamp
+ * settings.
+ *
+ * This check is only important for heterogeneous systems where uclamp_min value
+ * is higher than the capacity of a @cpu. For non-heterogeneous system this
+ * function will always return true.
+ *
+ * The function will return true if the capacity of the @cpu is >= the
+ * uclamp_min and false otherwise.
+ *
+ * Note that uclamp_min will be clamped to uclamp_max if uclamp_min
+ * > uclamp_max.
+ */
+inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
+{
+ unsigned int min_cap;
+ unsigned int max_cap;
+ unsigned int cpu_cap;
+
+ /* Only heterogeneous systems can benefit from this check */
+ if (!static_branch_unlikely(&sched_asym_cpucapacity))
+ return true;
+
+ min_cap = uclamp_eff_value(p, UCLAMP_MIN);
+ max_cap = uclamp_eff_value(p, UCLAMP_MAX);
+
+ cpu_cap = capacity_orig_of(cpu);
+
+ return cpu_cap >= min(min_cap, max_cap);
+}
+#else
+static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
+{
+ return true;
+}
+#endif
+
#ifdef CONFIG_RT_GROUP_SCHED
static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
@@ -1391,6 +1430,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
{
struct task_struct *curr;
struct rq *rq;
+ bool test;
/* For anything but wake ups, just return the task_cpu */
if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
@@ -1422,10 +1462,16 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
*
* This test is optimistic, if we get it wrong the load-balancer
* will have to sort it out.
+ *
+ * We take into account the capacity of the cpu to ensure it fits the
+ * requirement of the task - which is only important on heterogeneous
+ * systems like big.LITTLE.
*/
- if (curr && unlikely(rt_task(curr)) &&
- (curr->nr_cpus_allowed < 2 ||
- curr->prio <= p->prio)) {
+ test = curr &&
+ unlikely(rt_task(curr)) &&
+ (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
+
+ if (test || !rt_task_fits_capacity(p, cpu)) {
int target = find_lowest_rq(p);
/*
@@ -1449,7 +1495,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
* let's hope p can move out.
*/
if (rq->curr->nr_cpus_allowed == 1 ||
- !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
+ !cpupri_find(&rq->rd->cpupri, rq->curr, NULL, NULL))
return;
/*
@@ -1457,7 +1503,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
* see if it is pushed or pulled somewhere else.
*/
if (p->nr_cpus_allowed != 1
- && cpupri_find(&rq->rd->cpupri, p, NULL))
+ && cpupri_find(&rq->rd->cpupri, p, NULL, NULL))
return;
/*
@@ -1600,7 +1646,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct rq_fla
static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
{
if (!task_running(rq, p) &&
- cpumask_test_cpu(cpu, p->cpus_ptr))
+ cpumask_test_cpu(cpu, p->cpus_ptr) &&
+ rt_task_fits_capacity(p, cpu))
return 1;
return 0;
@@ -1642,7 +1689,8 @@ static int find_lowest_rq(struct task_struct *task)
if (task->nr_cpus_allowed == 1)
return -1; /* No other targets possible */
- if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask))
+ if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask,
+ rt_task_fits_capacity))
return -1; /* No targets found */
/*
@@ -2146,12 +2194,14 @@ static void pull_rt_task(struct rq *this_rq)
*/
static void task_woken_rt(struct rq *rq, struct task_struct *p)
{
- if (!task_running(rq, p) &&
- !test_tsk_need_resched(rq->curr) &&
- p->nr_cpus_allowed > 1 &&
- (dl_task(rq->curr) || rt_task(rq->curr)) &&
- (rq->curr->nr_cpus_allowed < 2 ||
- rq->curr->prio <= p->prio))
+ bool need_to_push = !task_running(rq, p) &&
+ !test_tsk_need_resched(rq->curr) &&
+ p->nr_cpus_allowed > 1 &&
+ (dl_task(rq->curr) || rt_task(rq->curr)) &&
+ (rq->curr->nr_cpus_allowed < 2 ||
+ rq->curr->prio <= p->prio);
+
+ if (need_to_push || !rt_task_fits_capacity(p, cpu_of(rq)))
push_rt_tasks(rq);
}
@@ -2223,7 +2273,10 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
*/
if (task_on_rq_queued(p) && rq->curr != p) {
#ifdef CONFIG_SMP
- if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
+ bool need_to_push = rq->rt.overloaded ||
+ !rt_task_fits_capacity(p, cpu_of(rq));
+
+ if (p->nr_cpus_allowed > 1 && need_to_push)
rt_queue_push_tasks(rq);
#endif /* CONFIG_SMP */
if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
--
2.17.1
Adding some Android folks who might be interested.
Steven/Peter, in case this has dropped off your queue; it'd be great to get
some feedback when you get a chance to look at it.
Thanks
--
Qais Yousef
On 10/09/19 11:46, Qais Yousef wrote:
> Capacity Awareness refers to the fact that on heterogeneous systems
> (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> when placing tasks we need to be aware of this difference of CPU
> capacities.
>
> In such scenarios we want to ensure that the selected CPU has enough
> capacity to meet the requirement of the running task. Enough capacity
> means here that capacity_orig_of(cpu) >= task.requirement.
>
> The definition of task.requirement is dependent on the scheduling class.
>
> For CFS, utilization is used to select a CPU that has >= capacity value
> than the cfs_task.util.
>
> capacity_orig_of(cpu) >= cfs_task.util
>
> DL isn't capacity aware at the moment but can make use of the bandwidth
> reservation to implement that in a similar manner CFS uses utilization.
> The following patchset implements that:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
>
> For RT we don't have a per task utilization signal and we lack any
> information in general about what performance requirement the RT task
> needs. But with the introduction of uclamp, RT tasks can now control
> that by setting uclamp_min to guarantee a minimum performance point.
>
> ATM the uclamp value are only used for frequency selection; but on
> heterogeneous systems this is not enough and we need to ensure that the
> capacity of the CPU is >= uclamp_min. Which is what implemented here.
>
> capacity_orig_of(cpu) >= rt_task.uclamp_min
>
> Note that by default uclamp.min is 1024, which means that RT tasks will
> always be biased towards the big CPUs, which make for a better more
> predictable behavior for the default case.
>
> Must stress that the bias acts as a hint rather than a definite
> placement strategy. For example, if all big cores are busy executing
> other RT tasks we can't guarantee that a new RT task will be placed
> there.
>
> On non-heterogeneous systems the original behavior of RT should be
> retained. Similarly if uclamp is not selected in the config.
>
> Signed-off-by: Qais Yousef <[email protected]>
> ---
>
> Changes in v2:
> - Use cpupri_find() to check the fitness of the task instead of
> sprinkling find_lowest_rq() with several checks of
> rt_task_fits_capacity().
>
> The selected implementation opted to pass the fitness function as an
> argument rather than call rt_task_fits_capacity() capacity which is
> a cleaner to keep the logical separation of the 2 modules; but it
> means the compiler has less room to optimize rt_task_fits_capacity()
> out when it's a constant value.
>
> The logic is not perfect. For example if a 'small' task is occupying a big CPU
> and another big task wakes up; we won't force migrate the small task to clear
> the big cpu for the big task that woke up.
>
> IOW, the logic is best effort and can't give hard guarantees. But improves the
> current situation where a task can randomly end up on any CPU regardless of
> what it needs. ie: without this patch an RT task can wake up on a big or small
> CPU, but with this it will always wake up on a big CPU (assuming the big CPUs
> aren't overloaded) - hence provide a consistent performance.
>
> I'm looking at ways to improve this best effort, but this patch should be
> a good start to discuss our Capacity Awareness requirement. There's a trade-off
> of complexity to be made here and I'd like to keep things as simple as
> possible and build on top as needed.
>
>
> kernel/sched/cpupri.c | 23 ++++++++++--
> kernel/sched/cpupri.h | 4 ++-
> kernel/sched/rt.c | 81 +++++++++++++++++++++++++++++++++++--------
> 3 files changed, 91 insertions(+), 17 deletions(-)
>
> diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
> index b7abca987d94..799791c01d60 100644
> --- a/kernel/sched/cpupri.c
> +++ b/kernel/sched/cpupri.c
> @@ -57,7 +57,8 @@ static int convert_prio(int prio)
> * Return: (int)bool - CPUs were found
> */
> int cpupri_find(struct cpupri *cp, struct task_struct *p,
> - struct cpumask *lowest_mask)
> + struct cpumask *lowest_mask,
> + bool (*fitness_fn)(struct task_struct *p, int cpu))
> {
> int idx = 0;
> int task_pri = convert_prio(p->prio);
> @@ -98,6 +99,8 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
> continue;
>
> if (lowest_mask) {
> + int cpu;
> +
> cpumask_and(lowest_mask, p->cpus_ptr, vec->mask);
>
> /*
> @@ -108,7 +111,23 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
> * condition, simply act as though we never hit this
> * priority level and continue on.
> */
> - if (cpumask_any(lowest_mask) >= nr_cpu_ids)
> + if (cpumask_empty(lowest_mask))
> + continue;
> +
> + if (!fitness_fn)
> + return 1;
> +
> + /* Ensure the capacity of the CPUs fit the task */
> + for_each_cpu(cpu, lowest_mask) {
> + if (!fitness_fn(p, cpu))
> + cpumask_clear_cpu(cpu, lowest_mask);
> + }
> +
> + /*
> + * If no CPU at the current priority can fit the task
> + * continue looking
> + */
> + if (cpumask_empty(lowest_mask))
> continue;
> }
>
> diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
> index 7dc20a3232e7..32dd520db11f 100644
> --- a/kernel/sched/cpupri.h
> +++ b/kernel/sched/cpupri.h
> @@ -18,7 +18,9 @@ struct cpupri {
> };
>
> #ifdef CONFIG_SMP
> -int cpupri_find(struct cpupri *cp, struct task_struct *p, struct cpumask *lowest_mask);
> +int cpupri_find(struct cpupri *cp, struct task_struct *p,
> + struct cpumask *lowest_mask,
> + bool (*fitness_fn)(struct task_struct *p, int cpu));
> void cpupri_set(struct cpupri *cp, int cpu, int pri);
> int cpupri_init(struct cpupri *cp);
> void cpupri_cleanup(struct cpupri *cp);
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index ebaa4e619684..3a68054e15b3 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -437,6 +437,45 @@ static inline int on_rt_rq(struct sched_rt_entity *rt_se)
> return rt_se->on_rq;
> }
>
> +#ifdef CONFIG_UCLAMP_TASK
> +/*
> + * Verify the fitness of task @p to run on @cpu taking into account the uclamp
> + * settings.
> + *
> + * This check is only important for heterogeneous systems where uclamp_min value
> + * is higher than the capacity of a @cpu. For non-heterogeneous system this
> + * function will always return true.
> + *
> + * The function will return true if the capacity of the @cpu is >= the
> + * uclamp_min and false otherwise.
> + *
> + * Note that uclamp_min will be clamped to uclamp_max if uclamp_min
> + * > uclamp_max.
> + */
> +inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> +{
> + unsigned int min_cap;
> + unsigned int max_cap;
> + unsigned int cpu_cap;
> +
> + /* Only heterogeneous systems can benefit from this check */
> + if (!static_branch_unlikely(&sched_asym_cpucapacity))
> + return true;
> +
> + min_cap = uclamp_eff_value(p, UCLAMP_MIN);
> + max_cap = uclamp_eff_value(p, UCLAMP_MAX);
> +
> + cpu_cap = capacity_orig_of(cpu);
> +
> + return cpu_cap >= min(min_cap, max_cap);
> +}
> +#else
> +static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> +{
> + return true;
> +}
> +#endif
> +
> #ifdef CONFIG_RT_GROUP_SCHED
>
> static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
> @@ -1391,6 +1430,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
> {
> struct task_struct *curr;
> struct rq *rq;
> + bool test;
>
> /* For anything but wake ups, just return the task_cpu */
> if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
> @@ -1422,10 +1462,16 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
> *
> * This test is optimistic, if we get it wrong the load-balancer
> * will have to sort it out.
> + *
> + * We take into account the capacity of the cpu to ensure it fits the
> + * requirement of the task - which is only important on heterogeneous
> + * systems like big.LITTLE.
> */
> - if (curr && unlikely(rt_task(curr)) &&
> - (curr->nr_cpus_allowed < 2 ||
> - curr->prio <= p->prio)) {
> + test = curr &&
> + unlikely(rt_task(curr)) &&
> + (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
> +
> + if (test || !rt_task_fits_capacity(p, cpu)) {
> int target = find_lowest_rq(p);
>
> /*
> @@ -1449,7 +1495,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
> * let's hope p can move out.
> */
> if (rq->curr->nr_cpus_allowed == 1 ||
> - !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
> + !cpupri_find(&rq->rd->cpupri, rq->curr, NULL, NULL))
> return;
>
> /*
> @@ -1457,7 +1503,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
> * see if it is pushed or pulled somewhere else.
> */
> if (p->nr_cpus_allowed != 1
> - && cpupri_find(&rq->rd->cpupri, p, NULL))
> + && cpupri_find(&rq->rd->cpupri, p, NULL, NULL))
> return;
>
> /*
> @@ -1600,7 +1646,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct rq_fla
> static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
> {
> if (!task_running(rq, p) &&
> - cpumask_test_cpu(cpu, p->cpus_ptr))
> + cpumask_test_cpu(cpu, p->cpus_ptr) &&
> + rt_task_fits_capacity(p, cpu))
> return 1;
>
> return 0;
> @@ -1642,7 +1689,8 @@ static int find_lowest_rq(struct task_struct *task)
> if (task->nr_cpus_allowed == 1)
> return -1; /* No other targets possible */
>
> - if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask))
> + if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask,
> + rt_task_fits_capacity))
> return -1; /* No targets found */
>
> /*
> @@ -2146,12 +2194,14 @@ static void pull_rt_task(struct rq *this_rq)
> */
> static void task_woken_rt(struct rq *rq, struct task_struct *p)
> {
> - if (!task_running(rq, p) &&
> - !test_tsk_need_resched(rq->curr) &&
> - p->nr_cpus_allowed > 1 &&
> - (dl_task(rq->curr) || rt_task(rq->curr)) &&
> - (rq->curr->nr_cpus_allowed < 2 ||
> - rq->curr->prio <= p->prio))
> + bool need_to_push = !task_running(rq, p) &&
> + !test_tsk_need_resched(rq->curr) &&
> + p->nr_cpus_allowed > 1 &&
> + (dl_task(rq->curr) || rt_task(rq->curr)) &&
> + (rq->curr->nr_cpus_allowed < 2 ||
> + rq->curr->prio <= p->prio);
> +
> + if (need_to_push || !rt_task_fits_capacity(p, cpu_of(rq)))
> push_rt_tasks(rq);
> }
>
> @@ -2223,7 +2273,10 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
> */
> if (task_on_rq_queued(p) && rq->curr != p) {
> #ifdef CONFIG_SMP
> - if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
> + bool need_to_push = rq->rt.overloaded ||
> + !rt_task_fits_capacity(p, cpu_of(rq));
> +
> + if (p->nr_cpus_allowed > 1 && need_to_push)
> rt_queue_push_tasks(rq);
> #endif /* CONFIG_SMP */
> if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
> --
> 2.17.1
>
On Wed, Oct 09, 2019 at 11:46:11AM +0100, Qais Yousef wrote:
> Capacity Awareness refers to the fact that on heterogeneous systems
> (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> when placing tasks we need to be aware of this difference of CPU
> capacities.
>
> In such scenarios we want to ensure that the selected CPU has enough
> capacity to meet the requirement of the running task. Enough capacity
> means here that capacity_orig_of(cpu) >= task.requirement.
>
> The definition of task.requirement is dependent on the scheduling class.
>
> For CFS, utilization is used to select a CPU that has >= capacity value
> than the cfs_task.util.
>
> capacity_orig_of(cpu) >= cfs_task.util
>
> DL isn't capacity aware at the moment but can make use of the bandwidth
> reservation to implement that in a similar manner CFS uses utilization.
> The following patchset implements that:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
>
> For RT we don't have a per task utilization signal and we lack any
> information in general about what performance requirement the RT task
> needs. But with the introduction of uclamp, RT tasks can now control
> that by setting uclamp_min to guarantee a minimum performance point.
>
> ATM the uclamp value are only used for frequency selection; but on
> heterogeneous systems this is not enough and we need to ensure that the
> capacity of the CPU is >= uclamp_min. Which is what implemented here.
>
> capacity_orig_of(cpu) >= rt_task.uclamp_min
>
> Note that by default uclamp.min is 1024, which means that RT tasks will
> always be biased towards the big CPUs, which make for a better more
> predictable behavior for the default case.
>
> Must stress that the bias acts as a hint rather than a definite
> placement strategy. For example, if all big cores are busy executing
> other RT tasks we can't guarantee that a new RT task will be placed
> there.
>
> On non-heterogeneous systems the original behavior of RT should be
> retained. Similarly if uclamp is not selected in the config.
>
> Signed-off-by: Qais Yousef <[email protected]>
Works for me; Steve you OK with this?
On Mon, 28 Oct 2019 15:37:49 +0100
Peter Zijlstra <[email protected]> wrote:
> On Wed, Oct 09, 2019 at 11:46:11AM +0100, Qais Yousef wrote:
> > Capacity Awareness refers to the fact that on heterogeneous systems
> > (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> > when placing tasks we need to be aware of this difference of CPU
> > capacities.
> >
> > In such scenarios we want to ensure that the selected CPU has enough
> > capacity to meet the requirement of the running task. Enough capacity
> > means here that capacity_orig_of(cpu) >= task.requirement.
> >
> > The definition of task.requirement is dependent on the scheduling class.
> >
> > For CFS, utilization is used to select a CPU that has >= capacity value
> > than the cfs_task.util.
> >
> > capacity_orig_of(cpu) >= cfs_task.util
> >
> > DL isn't capacity aware at the moment but can make use of the bandwidth
> > reservation to implement that in a similar manner CFS uses utilization.
> > The following patchset implements that:
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
> >
> > For RT we don't have a per task utilization signal and we lack any
> > information in general about what performance requirement the RT task
> > needs. But with the introduction of uclamp, RT tasks can now control
> > that by setting uclamp_min to guarantee a minimum performance point.
> >
> > ATM the uclamp value are only used for frequency selection; but on
> > heterogeneous systems this is not enough and we need to ensure that the
> > capacity of the CPU is >= uclamp_min. Which is what implemented here.
> >
> > capacity_orig_of(cpu) >= rt_task.uclamp_min
> >
> > Note that by default uclamp.min is 1024, which means that RT tasks will
> > always be biased towards the big CPUs, which make for a better more
> > predictable behavior for the default case.
> >
> > Must stress that the bias acts as a hint rather than a definite
> > placement strategy. For example, if all big cores are busy executing
> > other RT tasks we can't guarantee that a new RT task will be placed
> > there.
> >
> > On non-heterogeneous systems the original behavior of RT should be
> > retained. Similarly if uclamp is not selected in the config.
> >
> > Signed-off-by: Qais Yousef <[email protected]>
>
> Works for me; Steve you OK with this?
Nothing against it, but I want to take a deeper look before we accept
it. Are you OK in waiting a week? I'm currently at Open Source Summit
and still have two more talks to write (giving them Thursday). I wont
have time to look till next week.
-- Steve
On Mon, Oct 28, 2019 at 02:01:47PM -0400, Steven Rostedt wrote:
> On Mon, 28 Oct 2019 15:37:49 +0100
> Peter Zijlstra <[email protected]> wrote:
> > Works for me; Steve you OK with this?
>
> Nothing against it, but I want to take a deeper look before we accept
> it. Are you OK in waiting a week? I'm currently at Open Source Summit
> and still have two more talks to write (giving them Thursday). I wont
> have time to look till next week.
Sure, I'll keep it in my queue, but will make sure it doesn't hit -tip
until you've had time.
On Wed, 9 Oct 2019 at 12:46, Qais Yousef <[email protected]> wrote:
>
> Capacity Awareness refers to the fact that on heterogeneous systems
> (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> when placing tasks we need to be aware of this difference of CPU
> capacities.
>
> In such scenarios we want to ensure that the selected CPU has enough
> capacity to meet the requirement of the running task. Enough capacity
> means here that capacity_orig_of(cpu) >= task.requirement.
>
> The definition of task.requirement is dependent on the scheduling class.
>
> For CFS, utilization is used to select a CPU that has >= capacity value
> than the cfs_task.util.
>
> capacity_orig_of(cpu) >= cfs_task.util
>
> DL isn't capacity aware at the moment but can make use of the bandwidth
> reservation to implement that in a similar manner CFS uses utilization.
> The following patchset implements that:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
>
> For RT we don't have a per task utilization signal and we lack any
> information in general about what performance requirement the RT task
> needs. But with the introduction of uclamp, RT tasks can now control
> that by setting uclamp_min to guarantee a minimum performance point.
>
> ATM the uclamp value are only used for frequency selection; but on
> heterogeneous systems this is not enough and we need to ensure that the
> capacity of the CPU is >= uclamp_min. Which is what implemented here.
>
> capacity_orig_of(cpu) >= rt_task.uclamp_min
>
> Note that by default uclamp.min is 1024, which means that RT tasks will
> always be biased towards the big CPUs, which make for a better more
> predictable behavior for the default case.
hmm... big cores are not always the best choices for rt tasks, they
generally took more time to wake up or to switch context because of
the pipeline depth and others branch predictions
>
> Must stress that the bias acts as a hint rather than a definite
> placement strategy. For example, if all big cores are busy executing
> other RT tasks we can't guarantee that a new RT task will be placed
> there.
>
> On non-heterogeneous systems the original behavior of RT should be
> retained. Similarly if uclamp is not selected in the config.
>
> Signed-off-by: Qais Yousef <[email protected]>
> ---
>
> Changes in v2:
> - Use cpupri_find() to check the fitness of the task instead of
> sprinkling find_lowest_rq() with several checks of
> rt_task_fits_capacity().
>
> The selected implementation opted to pass the fitness function as an
> argument rather than call rt_task_fits_capacity() capacity which is
> a cleaner to keep the logical separation of the 2 modules; but it
> means the compiler has less room to optimize rt_task_fits_capacity()
> out when it's a constant value.
>
> The logic is not perfect. For example if a 'small' task is occupying a big CPU
> and another big task wakes up; we won't force migrate the small task to clear
> the big cpu for the big task that woke up.
>
> IOW, the logic is best effort and can't give hard guarantees. But improves the
> current situation where a task can randomly end up on any CPU regardless of
> what it needs. ie: without this patch an RT task can wake up on a big or small
> CPU, but with this it will always wake up on a big CPU (assuming the big CPUs
> aren't overloaded) - hence provide a consistent performance.
>
> I'm looking at ways to improve this best effort, but this patch should be
> a good start to discuss our Capacity Awareness requirement. There's a trade-off
> of complexity to be made here and I'd like to keep things as simple as
> possible and build on top as needed.
>
>
> kernel/sched/cpupri.c | 23 ++++++++++--
> kernel/sched/cpupri.h | 4 ++-
> kernel/sched/rt.c | 81 +++++++++++++++++++++++++++++++++++--------
> 3 files changed, 91 insertions(+), 17 deletions(-)
>
> diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
> index b7abca987d94..799791c01d60 100644
> --- a/kernel/sched/cpupri.c
> +++ b/kernel/sched/cpupri.c
> @@ -57,7 +57,8 @@ static int convert_prio(int prio)
> * Return: (int)bool - CPUs were found
> */
> int cpupri_find(struct cpupri *cp, struct task_struct *p,
> - struct cpumask *lowest_mask)
> + struct cpumask *lowest_mask,
> + bool (*fitness_fn)(struct task_struct *p, int cpu))
> {
> int idx = 0;
> int task_pri = convert_prio(p->prio);
> @@ -98,6 +99,8 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
> continue;
>
> if (lowest_mask) {
> + int cpu;
> +
> cpumask_and(lowest_mask, p->cpus_ptr, vec->mask);
>
> /*
> @@ -108,7 +111,23 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
> * condition, simply act as though we never hit this
> * priority level and continue on.
> */
> - if (cpumask_any(lowest_mask) >= nr_cpu_ids)
> + if (cpumask_empty(lowest_mask))
> + continue;
> +
> + if (!fitness_fn)
> + return 1;
> +
> + /* Ensure the capacity of the CPUs fit the task */
> + for_each_cpu(cpu, lowest_mask) {
> + if (!fitness_fn(p, cpu))
> + cpumask_clear_cpu(cpu, lowest_mask);
> + }
> +
> + /*
> + * If no CPU at the current priority can fit the task
> + * continue looking
> + */
> + if (cpumask_empty(lowest_mask))
> continue;
> }
>
> diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
> index 7dc20a3232e7..32dd520db11f 100644
> --- a/kernel/sched/cpupri.h
> +++ b/kernel/sched/cpupri.h
> @@ -18,7 +18,9 @@ struct cpupri {
> };
>
> #ifdef CONFIG_SMP
> -int cpupri_find(struct cpupri *cp, struct task_struct *p, struct cpumask *lowest_mask);
> +int cpupri_find(struct cpupri *cp, struct task_struct *p,
> + struct cpumask *lowest_mask,
> + bool (*fitness_fn)(struct task_struct *p, int cpu));
> void cpupri_set(struct cpupri *cp, int cpu, int pri);
> int cpupri_init(struct cpupri *cp);
> void cpupri_cleanup(struct cpupri *cp);
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index ebaa4e619684..3a68054e15b3 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -437,6 +437,45 @@ static inline int on_rt_rq(struct sched_rt_entity *rt_se)
> return rt_se->on_rq;
> }
>
> +#ifdef CONFIG_UCLAMP_TASK
> +/*
> + * Verify the fitness of task @p to run on @cpu taking into account the uclamp
> + * settings.
> + *
> + * This check is only important for heterogeneous systems where uclamp_min value
> + * is higher than the capacity of a @cpu. For non-heterogeneous system this
> + * function will always return true.
> + *
> + * The function will return true if the capacity of the @cpu is >= the
> + * uclamp_min and false otherwise.
> + *
> + * Note that uclamp_min will be clamped to uclamp_max if uclamp_min
> + * > uclamp_max.
> + */
> +inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> +{
> + unsigned int min_cap;
> + unsigned int max_cap;
> + unsigned int cpu_cap;
> +
> + /* Only heterogeneous systems can benefit from this check */
> + if (!static_branch_unlikely(&sched_asym_cpucapacity))
> + return true;
> +
> + min_cap = uclamp_eff_value(p, UCLAMP_MIN);
> + max_cap = uclamp_eff_value(p, UCLAMP_MAX);
> +
> + cpu_cap = capacity_orig_of(cpu);
> +
> + return cpu_cap >= min(min_cap, max_cap);
> +}
> +#else
> +static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> +{
> + return true;
> +}
> +#endif
> +
> #ifdef CONFIG_RT_GROUP_SCHED
>
> static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
> @@ -1391,6 +1430,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
> {
> struct task_struct *curr;
> struct rq *rq;
> + bool test;
>
> /* For anything but wake ups, just return the task_cpu */
> if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
> @@ -1422,10 +1462,16 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
> *
> * This test is optimistic, if we get it wrong the load-balancer
> * will have to sort it out.
> + *
> + * We take into account the capacity of the cpu to ensure it fits the
> + * requirement of the task - which is only important on heterogeneous
> + * systems like big.LITTLE.
> */
> - if (curr && unlikely(rt_task(curr)) &&
> - (curr->nr_cpus_allowed < 2 ||
> - curr->prio <= p->prio)) {
> + test = curr &&
> + unlikely(rt_task(curr)) &&
> + (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
> +
> + if (test || !rt_task_fits_capacity(p, cpu)) {
> int target = find_lowest_rq(p);
>
> /*
> @@ -1449,7 +1495,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
> * let's hope p can move out.
> */
> if (rq->curr->nr_cpus_allowed == 1 ||
> - !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
> + !cpupri_find(&rq->rd->cpupri, rq->curr, NULL, NULL))
> return;
>
> /*
> @@ -1457,7 +1503,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
> * see if it is pushed or pulled somewhere else.
> */
> if (p->nr_cpus_allowed != 1
> - && cpupri_find(&rq->rd->cpupri, p, NULL))
> + && cpupri_find(&rq->rd->cpupri, p, NULL, NULL))
> return;
>
> /*
> @@ -1600,7 +1646,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct rq_fla
> static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
> {
> if (!task_running(rq, p) &&
> - cpumask_test_cpu(cpu, p->cpus_ptr))
> + cpumask_test_cpu(cpu, p->cpus_ptr) &&
> + rt_task_fits_capacity(p, cpu))
> return 1;
>
> return 0;
> @@ -1642,7 +1689,8 @@ static int find_lowest_rq(struct task_struct *task)
> if (task->nr_cpus_allowed == 1)
> return -1; /* No other targets possible */
>
> - if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask))
> + if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask,
> + rt_task_fits_capacity))
> return -1; /* No targets found */
>
> /*
> @@ -2146,12 +2194,14 @@ static void pull_rt_task(struct rq *this_rq)
> */
> static void task_woken_rt(struct rq *rq, struct task_struct *p)
> {
> - if (!task_running(rq, p) &&
> - !test_tsk_need_resched(rq->curr) &&
> - p->nr_cpus_allowed > 1 &&
> - (dl_task(rq->curr) || rt_task(rq->curr)) &&
> - (rq->curr->nr_cpus_allowed < 2 ||
> - rq->curr->prio <= p->prio))
> + bool need_to_push = !task_running(rq, p) &&
> + !test_tsk_need_resched(rq->curr) &&
> + p->nr_cpus_allowed > 1 &&
> + (dl_task(rq->curr) || rt_task(rq->curr)) &&
> + (rq->curr->nr_cpus_allowed < 2 ||
> + rq->curr->prio <= p->prio);
> +
> + if (need_to_push || !rt_task_fits_capacity(p, cpu_of(rq)))
> push_rt_tasks(rq);
> }
>
> @@ -2223,7 +2273,10 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
> */
> if (task_on_rq_queued(p) && rq->curr != p) {
> #ifdef CONFIG_SMP
> - if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
> + bool need_to_push = rq->rt.overloaded ||
> + !rt_task_fits_capacity(p, cpu_of(rq));
> +
> + if (p->nr_cpus_allowed > 1 && need_to_push)
> rt_queue_push_tasks(rq);
> #endif /* CONFIG_SMP */
> if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
> --
> 2.17.1
>
On 10/29/19 09:13, Vincent Guittot wrote:
> On Wed, 9 Oct 2019 at 12:46, Qais Yousef <[email protected]> wrote:
> >
> > Capacity Awareness refers to the fact that on heterogeneous systems
> > (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> > when placing tasks we need to be aware of this difference of CPU
> > capacities.
> >
> > In such scenarios we want to ensure that the selected CPU has enough
> > capacity to meet the requirement of the running task. Enough capacity
> > means here that capacity_orig_of(cpu) >= task.requirement.
> >
> > The definition of task.requirement is dependent on the scheduling class.
> >
> > For CFS, utilization is used to select a CPU that has >= capacity value
> > than the cfs_task.util.
> >
> > capacity_orig_of(cpu) >= cfs_task.util
> >
> > DL isn't capacity aware at the moment but can make use of the bandwidth
> > reservation to implement that in a similar manner CFS uses utilization.
> > The following patchset implements that:
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
> >
> > For RT we don't have a per task utilization signal and we lack any
> > information in general about what performance requirement the RT task
> > needs. But with the introduction of uclamp, RT tasks can now control
> > that by setting uclamp_min to guarantee a minimum performance point.
> >
> > ATM the uclamp value are only used for frequency selection; but on
> > heterogeneous systems this is not enough and we need to ensure that the
> > capacity of the CPU is >= uclamp_min. Which is what implemented here.
> >
> > capacity_orig_of(cpu) >= rt_task.uclamp_min
> >
> > Note that by default uclamp.min is 1024, which means that RT tasks will
> > always be biased towards the big CPUs, which make for a better more
> > predictable behavior for the default case.
>
> hmm... big cores are not always the best choices for rt tasks, they
> generally took more time to wake up or to switch context because of
> the pipeline depth and others branch predictions
Can you quantify this into a number? I suspect this latency should be in the
200-300us range. And the difference between little and big should be much
smaller than that, no? We can't give guarantees in Linux in that order in
general and for serious real time users they have to do extra tweaks like
disabling power management which can introduce latency and hinder determinism.
Beside enabling PREEMPT_RT.
For generic systems a few ms is the best we can give and we can easily fall out
of this without any tweaks.
The choice of going to the maximum performance point in the system for RT tasks
by default goes beyond this patch anyway. I'm just making it consistent here
since we have different performance levels and RT didn't understand this
before.
So what I'm doing here is just make things consistent rather than change the
default.
What do you suggest?
Thanks
--
Qais Yousef
On Tue, 29 Oct 2019 at 12:02, Qais Yousef <[email protected]> wrote:
>
> On 10/29/19 09:13, Vincent Guittot wrote:
> > On Wed, 9 Oct 2019 at 12:46, Qais Yousef <[email protected]> wrote:
> > >
> > > Capacity Awareness refers to the fact that on heterogeneous systems
> > > (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> > > when placing tasks we need to be aware of this difference of CPU
> > > capacities.
> > >
> > > In such scenarios we want to ensure that the selected CPU has enough
> > > capacity to meet the requirement of the running task. Enough capacity
> > > means here that capacity_orig_of(cpu) >= task.requirement.
> > >
> > > The definition of task.requirement is dependent on the scheduling class.
> > >
> > > For CFS, utilization is used to select a CPU that has >= capacity value
> > > than the cfs_task.util.
> > >
> > > capacity_orig_of(cpu) >= cfs_task.util
> > >
> > > DL isn't capacity aware at the moment but can make use of the bandwidth
> > > reservation to implement that in a similar manner CFS uses utilization.
> > > The following patchset implements that:
> > >
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
> > >
> > > For RT we don't have a per task utilization signal and we lack any
> > > information in general about what performance requirement the RT task
> > > needs. But with the introduction of uclamp, RT tasks can now control
> > > that by setting uclamp_min to guarantee a minimum performance point.
> > >
> > > ATM the uclamp value are only used for frequency selection; but on
> > > heterogeneous systems this is not enough and we need to ensure that the
> > > capacity of the CPU is >= uclamp_min. Which is what implemented here.
> > >
> > > capacity_orig_of(cpu) >= rt_task.uclamp_min
> > >
> > > Note that by default uclamp.min is 1024, which means that RT tasks will
> > > always be biased towards the big CPUs, which make for a better more
> > > predictable behavior for the default case.
> >
> > hmm... big cores are not always the best choices for rt tasks, they
> > generally took more time to wake up or to switch context because of
> > the pipeline depth and others branch predictions
>
> Can you quantify this into a number? I suspect this latency should be in the
As a general rule, we pinned IRQs on little core because of such
responsiveness difference. I don't have numbers in mind as the tests
were run at the beg of b.L system.. few years ago
Then, if you look at some idle states definitions in DT, you will see
that exit latency of cluster down state of big core of hikey960 is
2900us vs 1600us for little
> 200-300us range. And the difference between little and big should be much
> smaller than that, no? We can't give guarantees in Linux in that order in
> general and for serious real time users they have to do extra tweaks like
> disabling power management which can introduce latency and hinder determinism.
> Beside enabling PREEMPT_RT.
>
> For generic systems a few ms is the best we can give and we can easily fall out
> of this without any tweaks.
>
> The choice of going to the maximum performance point in the system for RT tasks
> by default goes beyond this patch anyway. I'm just making it consistent here
> since we have different performance levels and RT didn't understand this
> before.
>
> So what I'm doing here is just make things consistent rather than change the
> default.
>
> What do you suggest?
Making big cores the default CPUs for all RT tasks is not a minor
change and IMO locality should stay the default behavior when there is
no uclamp constraint
>
> Thanks
>
> --
> Qais Yousef
On 10/29/19 12:17, Vincent Guittot wrote:
> On Tue, 29 Oct 2019 at 12:02, Qais Yousef <[email protected]> wrote:
> >
> > On 10/29/19 09:13, Vincent Guittot wrote:
> > > On Wed, 9 Oct 2019 at 12:46, Qais Yousef <[email protected]> wrote:
> > > >
> > > > Capacity Awareness refers to the fact that on heterogeneous systems
> > > > (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> > > > when placing tasks we need to be aware of this difference of CPU
> > > > capacities.
> > > >
> > > > In such scenarios we want to ensure that the selected CPU has enough
> > > > capacity to meet the requirement of the running task. Enough capacity
> > > > means here that capacity_orig_of(cpu) >= task.requirement.
> > > >
> > > > The definition of task.requirement is dependent on the scheduling class.
> > > >
> > > > For CFS, utilization is used to select a CPU that has >= capacity value
> > > > than the cfs_task.util.
> > > >
> > > > capacity_orig_of(cpu) >= cfs_task.util
> > > >
> > > > DL isn't capacity aware at the moment but can make use of the bandwidth
> > > > reservation to implement that in a similar manner CFS uses utilization.
> > > > The following patchset implements that:
> > > >
> > > > https://lore.kernel.org/lkml/[email protected]/
> > > >
> > > > capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
> > > >
> > > > For RT we don't have a per task utilization signal and we lack any
> > > > information in general about what performance requirement the RT task
> > > > needs. But with the introduction of uclamp, RT tasks can now control
> > > > that by setting uclamp_min to guarantee a minimum performance point.
> > > >
> > > > ATM the uclamp value are only used for frequency selection; but on
> > > > heterogeneous systems this is not enough and we need to ensure that the
> > > > capacity of the CPU is >= uclamp_min. Which is what implemented here.
> > > >
> > > > capacity_orig_of(cpu) >= rt_task.uclamp_min
> > > >
> > > > Note that by default uclamp.min is 1024, which means that RT tasks will
> > > > always be biased towards the big CPUs, which make for a better more
> > > > predictable behavior for the default case.
> > >
> > > hmm... big cores are not always the best choices for rt tasks, they
> > > generally took more time to wake up or to switch context because of
> > > the pipeline depth and others branch predictions
> >
> > Can you quantify this into a number? I suspect this latency should be in the
>
> As a general rule, we pinned IRQs on little core because of such
> responsiveness difference. I don't have numbers in mind as the tests
> were run at the beg of b.L system.. few years ago
> Then, if you look at some idle states definitions in DT, you will see
> that exit latency of cluster down state of big core of hikey960 is
> 2900us vs 1600us for little
I don't think hikey960 is a good system to use as a reference. SD845 shows more
sensible numbers
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/boot/dts/qcom/sdm845.dtsi?h=v5.4-rc5#n407
>
> > 200-300us range. And the difference between little and big should be much
> > smaller than that, no? We can't give guarantees in Linux in that order in
> > general and for serious real time users they have to do extra tweaks like
> > disabling power management which can introduce latency and hinder determinism.
> > Beside enabling PREEMPT_RT.
> >
> > For generic systems a few ms is the best we can give and we can easily fall out
> > of this without any tweaks.
> >
> > The choice of going to the maximum performance point in the system for RT tasks
> > by default goes beyond this patch anyway. I'm just making it consistent here
> > since we have different performance levels and RT didn't understand this
> > before.
> >
> > So what I'm doing here is just make things consistent rather than change the
> > default.
> >
> > What do you suggest?
>
> Making big cores the default CPUs for all RT tasks is not a minor
> change and IMO locality should stay the default behavior when there is
> no uclamp constraint
How this is affecting locality? The task will always go to the big core, so it
should be local.
And before introducing uclamp the default was going to the maximum frequency
anyway - which is the highest performance point. So what this does is basically
make sure that if we asked for a 1024 capacity, we get that.
Beside the decision is taken by the setup of the system wide uclamp.min. We
can change this to be something smaller but I don't think we can come up with
a better value by default. Admin should tune this to something smaller if the
performance of their little cores is good for their needs.
What this patch says if I want my uclamp.min of my RT task to be 1024, then we
give better guarantees it'll get that 1024 performance it asked for. And the
default of 1024 is consistent with what Linux has always done for RT out of the
box.
--
Qais Yousef
On Tue, 29 Oct 2019 at 12:48, Qais Yousef <[email protected]> wrote:
>
> On 10/29/19 12:17, Vincent Guittot wrote:
> > On Tue, 29 Oct 2019 at 12:02, Qais Yousef <[email protected]> wrote:
> > >
> > > On 10/29/19 09:13, Vincent Guittot wrote:
> > > > On Wed, 9 Oct 2019 at 12:46, Qais Yousef <[email protected]> wrote:
> > > > >
> > > > > Capacity Awareness refers to the fact that on heterogeneous systems
> > > > > (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> > > > > when placing tasks we need to be aware of this difference of CPU
> > > > > capacities.
> > > > >
> > > > > In such scenarios we want to ensure that the selected CPU has enough
> > > > > capacity to meet the requirement of the running task. Enough capacity
> > > > > means here that capacity_orig_of(cpu) >= task.requirement.
> > > > >
> > > > > The definition of task.requirement is dependent on the scheduling class.
> > > > >
> > > > > For CFS, utilization is used to select a CPU that has >= capacity value
> > > > > than the cfs_task.util.
> > > > >
> > > > > capacity_orig_of(cpu) >= cfs_task.util
> > > > >
> > > > > DL isn't capacity aware at the moment but can make use of the bandwidth
> > > > > reservation to implement that in a similar manner CFS uses utilization.
> > > > > The following patchset implements that:
> > > > >
> > > > > https://lore.kernel.org/lkml/[email protected]/
> > > > >
> > > > > capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
> > > > >
> > > > > For RT we don't have a per task utilization signal and we lack any
> > > > > information in general about what performance requirement the RT task
> > > > > needs. But with the introduction of uclamp, RT tasks can now control
> > > > > that by setting uclamp_min to guarantee a minimum performance point.
> > > > >
> > > > > ATM the uclamp value are only used for frequency selection; but on
> > > > > heterogeneous systems this is not enough and we need to ensure that the
> > > > > capacity of the CPU is >= uclamp_min. Which is what implemented here.
> > > > >
> > > > > capacity_orig_of(cpu) >= rt_task.uclamp_min
> > > > >
> > > > > Note that by default uclamp.min is 1024, which means that RT tasks will
> > > > > always be biased towards the big CPUs, which make for a better more
> > > > > predictable behavior for the default case.
> > > >
> > > > hmm... big cores are not always the best choices for rt tasks, they
> > > > generally took more time to wake up or to switch context because of
> > > > the pipeline depth and others branch predictions
> > >
> > > Can you quantify this into a number? I suspect this latency should be in the
> >
> > As a general rule, we pinned IRQs on little core because of such
> > responsiveness difference. I don't have numbers in mind as the tests
> > were run at the beg of b.L system.. few years ago
> > Then, if you look at some idle states definitions in DT, you will see
> > that exit latency of cluster down state of big core of hikey960 is
> > 2900us vs 1600us for little
>
> I don't think hikey960 is a good system to use as a reference. SD845 shows more
> sensible numbers
It is not worse than another
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/boot/dts/qcom/sdm845.dtsi?h=v5.4-rc5#n407
>
> >
> > > 200-300us range. And the difference between little and big should be much
> > > smaller than that, no? We can't give guarantees in Linux in that order in
> > > general and for serious real time users they have to do extra tweaks like
> > > disabling power management which can introduce latency and hinder determinism.
> > > Beside enabling PREEMPT_RT.
> > >
> > > For generic systems a few ms is the best we can give and we can easily fall out
> > > of this without any tweaks.
> > >
> > > The choice of going to the maximum performance point in the system for RT tasks
> > > by default goes beyond this patch anyway. I'm just making it consistent here
> > > since we have different performance levels and RT didn't understand this
> > > before.
> > >
> > > So what I'm doing here is just make things consistent rather than change the
> > > default.
> > >
> > > What do you suggest?
> >
> > Making big cores the default CPUs for all RT tasks is not a minor
> > change and IMO locality should stay the default behavior when there is
> > no uclamp constraint
>
> How this is affecting locality? The task will always go to the big core, so it
> should be local.
local with the waker
You will force rt task to run on big cluster although waker, data and
interrupts can be on little one.
So making big core as default is far from always being the best choice
>
> And before introducing uclamp the default was going to the maximum frequency
> anyway - which is the highest performance point. So what this does is basically
> make sure that if we asked for a 1024 capacity, we get that.
>
> Beside the decision is taken by the setup of the system wide uclamp.min. We
> can change this to be something smaller but I don't think we can come up with
> a better value by default. Admin should tune this to something smaller if the
> performance of their little cores is good for their needs.
>
> What this patch says if I want my uclamp.min of my RT task to be 1024, then we
> give better guarantees it'll get that 1024 performance it asked for. And the
> default of 1024 is consistent with what Linux has always done for RT out of the
> box.
>
> --
> Qais Yousef
On Tue, 29 Oct 2019 at 13:46, Qais Yousef <[email protected]> wrote:
>
> On 10/29/19 13:20, Vincent Guittot wrote:
> > > > Making big cores the default CPUs for all RT tasks is not a minor
> > > > change and IMO locality should stay the default behavior when there is
> > > > no uclamp constraint
> > >
> > > How this is affecting locality? The task will always go to the big core, so it
> > > should be local.
> >
> > local with the waker
> > You will force rt task to run on big cluster although waker, data and
> > interrupts can be on little one.
> > So making big core as default is far from always being the best choice
>
> This is loaded with assumptions IMO. AFAICT we don't know what's the best
> choice.
>
> First, the value of uclamp.min is outside of the scope of this patch. Unless
> what you're saying is that when uclamp.min is 1024 then we should NOT choose a
> big cpu then there's no disagreement about what this patch do. If that's what
> you're objecting to please be more specific about how do you see this working
> instead.
My point is that this patch makes the big cores the default CPUs for
RT tasks which is far from being a minor change and far from being an
obvious default good choice
>
> If your objection is purely based on the choice of uclamp.min then while
> I agree that on modern systems the little cores are good enough for the
> majority of RT tasks in average Android systems. But I don't feel confident to
> reach this conclusion on low end systems where the little core doesn't have
> enough grunt in many cases. So I see the current default is adequate and the
> responsibility of further tweaking lies within the hands of the system admin.
>
> --
> Qais Yousef
On 10/29/19 13:20, Vincent Guittot wrote:
> > > Making big cores the default CPUs for all RT tasks is not a minor
> > > change and IMO locality should stay the default behavior when there is
> > > no uclamp constraint
> >
> > How this is affecting locality? The task will always go to the big core, so it
> > should be local.
>
> local with the waker
> You will force rt task to run on big cluster although waker, data and
> interrupts can be on little one.
> So making big core as default is far from always being the best choice
This is loaded with assumptions IMO. AFAICT we don't know what's the best
choice.
First, the value of uclamp.min is outside of the scope of this patch. Unless
what you're saying is that when uclamp.min is 1024 then we should NOT choose a
big cpu then there's no disagreement about what this patch do. If that's what
you're objecting to please be more specific about how do you see this working
instead.
If your objection is purely based on the choice of uclamp.min then while
I agree that on modern systems the little cores are good enough for the
majority of RT tasks in average Android systems. But I don't feel confident to
reach this conclusion on low end systems where the little core doesn't have
enough grunt in many cases. So I see the current default is adequate and the
responsibility of further tweaking lies within the hands of the system admin.
--
Qais Yousef
On Tue, Oct 29, 2019 at 01:54:24PM +0100, Vincent Guittot wrote:
> On Tue, 29 Oct 2019 at 13:46, Qais Yousef <[email protected]> wrote:
> >
> > On 10/29/19 13:20, Vincent Guittot wrote:
> > > > > Making big cores the default CPUs for all RT tasks is not a minor
> > > > > change and IMO locality should stay the default behavior when there is
> > > > > no uclamp constraint
> > > >
> > > > How this is affecting locality? The task will always go to the big core, so it
> > > > should be local.
> > >
> > > local with the waker
> > > You will force rt task to run on big cluster although waker, data and
> > > interrupts can be on little one.
> > > So making big core as default is far from always being the best choice
> >
> > This is loaded with assumptions IMO. AFAICT we don't know what's the best
> > choice.
> >
> > First, the value of uclamp.min is outside of the scope of this patch. Unless
> > what you're saying is that when uclamp.min is 1024 then we should NOT choose a
> > big cpu then there's no disagreement about what this patch do. If that's what
> > you're objecting to please be more specific about how do you see this working
> > instead.
>
> My point is that this patch makes the big cores the default CPUs for
> RT tasks which is far from being a minor change and far from being an
> obvious default good choice
FIFO/RR tasks don't have a bandwidth specification (barring uclamp),
therefore we must assume the worst. This is the same principle that has
them select max_freq all the time.
I think it is a very natural extention of that very principle to place
(otherwise unconstrained RT tasks) on big cores.
Hi Vincent, Qais,
On 29-Oct 13:54, Vincent Guittot wrote:
> On Tue, 29 Oct 2019 at 13:46, Qais Yousef <[email protected]> wrote:
> > On 10/29/19 13:20, Vincent Guittot wrote:
> > > > > Making big cores the default CPUs for all RT tasks is not a minor
> > > > > change and IMO locality should stay the default behavior when there is
> > > > > no uclamp constraint
The default value for system-wide's uclamp.min is 1024 because by
default _we want_ RT tasks running at MAX_OPP. This means that by
default RT tasks are running _under constraints_.
The "no uclamp constraints" case you mention requires that you set
uclamp.min=0 for that task. In that case, this patch will do exactly
what you ask for: locality is preserved.
> > > > How this is affecting locality? The task will always go to the big core, so it
> > > > should be local.
> > >
> > > local with the waker
> > > You will force rt task to run on big cluster although waker, data and
> > > interrupts can be on little one.
> > > So making big core as default is far from always being the best choice
> >
> > This is loaded with assumptions IMO. AFAICT we don't know what's the best
> > choice.
Agree... more on that after...
> > First, the value of uclamp.min is outside of the scope of this patch. Unless
> > what you're saying is that when uclamp.min is 1024 then we should NOT choose a
> > big cpu then there's no disagreement about what this patch do. If that's what
> > you're objecting to please be more specific about how do you see this working
> > instead.
>
> My point is that this patch makes the big cores the default CPUs for
> RT tasks which is far from being a minor change and far from being
> an obvious default good choice
Some time ago we agreed that going to MAX_OPP for RT tasks was
"mandatory". That was defenitively a big change, likely much more
impacting than the one proposed by this patch.
On many mobile devices we ended up pinning RT tasks on LITTLE cores
(mainly) to save quite a lot of energy by avoiding the case of big
CPUs randomly spiking to MAX_OPP just because of a small RT task
waking up on them. We also added some heuristic in schedutil has a
"band aid" for the effects of the aforementioned choice.
By running RT on LITTLEs there could be also some wakeup latency
improvement? Yes, maybe... would be interesting to have some real
HW *and* SW use-case on hand to compare.
However, we know that RT is all about "latency", but what is a bit
more fuzzy is the definition of "latency":
A) wakeup-latency
From a scheduler standpoint it's quite often considered as the the
time it takes to "wakeup" a task and actually start executing its
instructions.
B) completion-time
From an app standpoint, it's quite often important the time to
complete the task activation and go back to sleep.
Running at MAX_OPP looks much more related to the need to complete
fast than waking up fast, especially considering that that decision
was taken looking mainly (perhaps only) to SMP systems.
On heterogeneous systems, "wakeup-latency" and "completion-time" are
two metrics which *maybe* can be better served by different cores.
However, it's very difficult to argument if one metric is more
important than the other. It's even more difficult to quantify it
because of the multitide of HW and SW combinations.
Thus, what's proposed here can be far from being an "obvious good
chooce", but it's also quite difficult to argue and proof that's
defenitively _not_ a good choice. It's just a default which:
1) keeps things simple for RT tasks by using the same default
policy for both frequency and CPUs selection
we always run (when possible) at the highest available capacity
2) it's based on a per-task/system-wide tunable mechanism
Is that not enought to justfy it as a default?
Best,
Patrick
--
#include <best/regards.h>
Patrick Bellasi
On Tue, 29 Oct 2019 at 21:36, Patrick Bellasi
<[email protected]> wrote:
>
> Hi Vincent, Qais,
>
> On 29-Oct 13:54, Vincent Guittot wrote:
> > On Tue, 29 Oct 2019 at 13:46, Qais Yousef <[email protected]> wrote:
> > > On 10/29/19 13:20, Vincent Guittot wrote:
> > > > > > Making big cores the default CPUs for all RT tasks is not a minor
> > > > > > change and IMO locality should stay the default behavior when there is
> > > > > > no uclamp constraint
>
> The default value for system-wide's uclamp.min is 1024 because by
> default _we want_ RT tasks running at MAX_OPP. This means that by
> default RT tasks are running _under constraints_.
>
> The "no uclamp constraints" case you mention requires that you set
> uclamp.min=0 for that task. In that case, this patch will do exactly
> what you ask for: locality is preserved.
>
> > > > > How this is affecting locality? The task will always go to the big core, so it
> > > > > should be local.
> > > >
> > > > local with the waker
> > > > You will force rt task to run on big cluster although waker, data and
> > > > interrupts can be on little one.
> > > > So making big core as default is far from always being the best choice
> > >
> > > This is loaded with assumptions IMO. AFAICT we don't know what's the best
> > > choice.
>
> Agree... more on that after...
>
> > > First, the value of uclamp.min is outside of the scope of this patch. Unless
> > > what you're saying is that when uclamp.min is 1024 then we should NOT choose a
> > > big cpu then there's no disagreement about what this patch do. If that's what
> > > you're objecting to please be more specific about how do you see this working
> > > instead.
> >
> > My point is that this patch makes the big cores the default CPUs for
> > RT tasks which is far from being a minor change and far from being
> > an obvious default good choice
>
> Some time ago we agreed that going to MAX_OPP for RT tasks was
> "mandatory". That was defenitively a big change, likely much more
> impacting than the one proposed by this patch.
>
> On many mobile devices we ended up pinning RT tasks on LITTLE cores
> (mainly) to save quite a lot of energy by avoiding the case of big
> CPUs randomly spiking to MAX_OPP just because of a small RT task
> waking up on them. We also added some heuristic in schedutil has a
> "band aid" for the effects of the aforementioned choice.
>
> By running RT on LITTLEs there could be also some wakeup latency
> improvement? Yes, maybe... would be interesting to have some real
> HW *and* SW use-case on hand to compare.
>
> However, we know that RT is all about "latency", but what is a bit
> more fuzzy is the definition of "latency":
>
> A) wakeup-latency
> From a scheduler standpoint it's quite often considered as the the
> time it takes to "wakeup" a task and actually start executing its
> instructions.
>
> B) completion-time
> From an app standpoint, it's quite often important the time to
> complete the task activation and go back to sleep.
>
> Running at MAX_OPP looks much more related to the need to complete
> fast than waking up fast, especially considering that that decision
You will wake up faster as well when running at MAX_OPP because
instructions will run faster or at least as fast. That being said,
running twice faster doesn't mean at all waking up twice faster but
for sure it will be faster although the gain can be really short.
Whereas running on a big core with more capacity doesn't mean that you
will wake up faster because of uarch difference.
I agree that "long" running rt task will most probably benefit from
big cores to complete earlier but that no more obvious for short one.
> was taken looking mainly (perhaps only) to SMP systems.
>
> On heterogeneous systems, "wakeup-latency" and "completion-time" are
> two metrics which *maybe* can be better served by different cores.
> However, it's very difficult to argument if one metric is more
> important than the other. It's even more difficult to quantify it
> because of the multitide of HW and SW combinations.
That's the point of my comment, choosing big cores as default and
always best choice is far from being obvious.
And this patch changes the default behavior without study of the
impact apart from stating that this should be ok
>
> Thus, what's proposed here can be far from being an "obvious good
> chooce", but it's also quite difficult to argue and proof that's
> defenitively _not_ a good choice. It's just a default which:
>
> 1) keeps things simple for RT tasks by using the same default
> policy for both frequency and CPUs selection
> we always run (when possible) at the highest available capacity
>
> 2) it's based on a per-task/system-wide tunable mechanism
>
> Is that not enought to justfy it as a default?
>
> Best,
> Patrick
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi
On 10/30/19 09:04, Vincent Guittot wrote:
> On Tue, 29 Oct 2019 at 21:36, Patrick Bellasi
> <[email protected]> wrote:
> > Some time ago we agreed that going to MAX_OPP for RT tasks was
> > "mandatory". That was defenitively a big change, likely much more
> > impacting than the one proposed by this patch.
> >
> > On many mobile devices we ended up pinning RT tasks on LITTLE cores
> > (mainly) to save quite a lot of energy by avoiding the case of big
> > CPUs randomly spiking to MAX_OPP just because of a small RT task
> > waking up on them. We also added some heuristic in schedutil has a
> > "band aid" for the effects of the aforementioned choice.
> >
> > By running RT on LITTLEs there could be also some wakeup latency
> > improvement? Yes, maybe... would be interesting to have some real
> > HW *and* SW use-case on hand to compare.
> >
> > However, we know that RT is all about "latency", but what is a bit
> > more fuzzy is the definition of "latency":
> >
> > A) wakeup-latency
> > From a scheduler standpoint it's quite often considered as the the
> > time it takes to "wakeup" a task and actually start executing its
> > instructions.
> >
> > B) completion-time
> > From an app standpoint, it's quite often important the time to
> > complete the task activation and go back to sleep.
> >
> > Running at MAX_OPP looks much more related to the need to complete
> > fast than waking up fast, especially considering that that decision
>
> You will wake up faster as well when running at MAX_OPP because
> instructions will run faster or at least as fast. That being said,
> running twice faster doesn't mean at all waking up twice faster but
> for sure it will be faster although the gain can be really short.
> Whereas running on a big core with more capacity doesn't mean that you
> will wake up faster because of uarch difference.
> I agree that "long" running rt task will most probably benefit from
> big cores to complete earlier but that no more obvious for short one.
Idle states and other power management features are a known source of latency.
This latency changes across hardware all the time and RT people are accustomed
to test against this.
Android has a wakelock which AFAIR disabled deep sleep because on some sections
the wakeup latency can hinder throughput for some apps. So it's a known problem
outside RT universe too.
>
> > was taken looking mainly (perhaps only) to SMP systems.
> >
> > On heterogeneous systems, "wakeup-latency" and "completion-time" are
> > two metrics which *maybe* can be better served by different cores.
> > However, it's very difficult to argument if one metric is more
> > important than the other. It's even more difficult to quantify it
> > because of the multitide of HW and SW combinations.
>
> That's the point of my comment, choosing big cores as default and
> always best choice is far from being obvious.
It's consistent and deterministic unlike the current situation of it depends on
your luck. What you get across boots/runs is completely random and this is
worse than what this patch offers.
The default for Linux has always been putting the system at the highest
performance point by default. And this translates to the biggest CPU at
the highest frequency. It's not ideal but consistent. This doesn't prevent
people from tweaking their systems to get what they want.
> And this patch changes the default behavior without study of the
> impact apart from stating that this should be ok
Without this patch there's no way for an RT task to guarantee a minimum
performance requirement.
I don't think there's a change of the default behavior because without this
patch we could still end up on a big CPU.
And you're stating that the difference between wakeup time in big cores and
little cores is a problem. And as I stated several times this is a known source
of latency that changes across systems and if somebody cares about idle state
latencies then they probably looking at something beyond generic systems and
need to tune it to guarantee a deterministic low latency behavior.
I still don't see any proposed alternative to what should be the default
behavior.
--
Qais Yousef
On 09.10.19 12:46, Qais Yousef wrote:
[...]
> Changes in v2:
> - Use cpupri_find() to check the fitness of the task instead of
> sprinkling find_lowest_rq() with several checks of
> rt_task_fits_capacity().
>
> The selected implementation opted to pass the fitness function as an
> argument rather than call rt_task_fits_capacity() capacity which is
> a cleaner to keep the logical separation of the 2 modules; but it
> means the compiler has less room to optimize rt_task_fits_capacity()
> out when it's a constant value.
I would prefer exporting rt_task_fits_capacity() sched-internally via
kernel/sched/sched.h. Less code changes and the indication whether
rt_task_fits_capacity() has to be used in cpupri_find() is already given
by lowest_mask being !NULL or NULL.
[...]
> +inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> +{
> + unsigned int min_cap;
> + unsigned int max_cap;
> + unsigned int cpu_cap;
Nit picking. Since we discussed it already,
I found this "Also please try to aggregate variables of the same type
into a single line. There is no point in wasting screen space::" ;-)
https://lore.kernel.org/r/[email protected]
[...]
> @@ -2223,7 +2273,10 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
> */
> if (task_on_rq_queued(p) && rq->curr != p) {
> #ifdef CONFIG_SMP
> - if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
> + bool need_to_push = rq->rt.overloaded ||
> + !rt_task_fits_capacity(p, cpu_of(rq));
> +
> + if (p->nr_cpus_allowed > 1 && need_to_push)
> rt_queue_push_tasks(rq);
> #endif /* CONFIG_SMP */
> if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
What happens to a always running CFS task which switches to RT? Luca
introduced a special migrate callback (next to push and pull)
specifically to deal with this scenario. A lot of new infrastructure for
this one use case, but still, do we care for it in RT as well?
https://lore.kernel.org/r/[email protected]
On Wednesday 30 Oct 2019 at 09:04:44 (+0100), Vincent Guittot wrote:
> That's the point of my comment, choosing big cores as default and
> always best choice is far from being obvious.
> And this patch changes the default behavior without study of the
> impact apart from stating that this should be ok
The current behaviour is totally bogus on big.LITTLE TBH, and nobody
uses it as-is. Vendors hack RT a lot for that exact reason.
Right now a RT task gets a random CPU, which on big.LITTLE is
effectively equivalent to selecting a random OPP on SMP. And since RT is
all about predicatbility, I'd argue sticking to the big CPUs for
consistency with the frequency selection policy is the only sensible
default.
So +1 from me for Qais' patch :)
Thanks,
Quentin
On 10/30/19 12:57, Dietmar Eggemann wrote:
> On 09.10.19 12:46, Qais Yousef wrote:
>
> [...]
>
> > Changes in v2:
> > - Use cpupri_find() to check the fitness of the task instead of
> > sprinkling find_lowest_rq() with several checks of
> > rt_task_fits_capacity().
> >
> > The selected implementation opted to pass the fitness function as an
> > argument rather than call rt_task_fits_capacity() capacity which is
> > a cleaner to keep the logical separation of the 2 modules; but it
> > means the compiler has less room to optimize rt_task_fits_capacity()
> > out when it's a constant value.
>
> I would prefer exporting rt_task_fits_capacity() sched-internally via
> kernel/sched/sched.h. Less code changes and the indication whether
> rt_task_fits_capacity() has to be used in cpupri_find() is already given
> by lowest_mask being !NULL or NULL.
>
I don't mind if the maintainers agree too. The reason I did that way is because
it keeps the implementation of cpupri generic and self contained.
rt_task_fits_capacity() at the moment is a static function in rt.c
To use it in cpupri_find() I either need to make it public somewhere or have an
extern bool rt_task_fits_capacity(...);
in cpupri.c. Neither of which appealed to me personally.
> [...]
>
> > +inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > +{
> > + unsigned int min_cap;
> > + unsigned int max_cap;
> > + unsigned int cpu_cap;
>
> Nit picking. Since we discussed it already,
>
> I found this "Also please try to aggregate variables of the same type
> into a single line. There is no point in wasting screen space::" ;-)
>
> https://lore.kernel.org/r/[email protected]
That wasn't merged at the end AFAICT :)
It's not my preferred style in this case if I have the choice - but I promise
to change it if I ended up having to spin another version anyway :)
>
> [...]
>
> > @@ -2223,7 +2273,10 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
> > */
> > if (task_on_rq_queued(p) && rq->curr != p) {
> > #ifdef CONFIG_SMP
> > - if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
> > + bool need_to_push = rq->rt.overloaded ||
> > + !rt_task_fits_capacity(p, cpu_of(rq));
> > +
> > + if (p->nr_cpus_allowed > 1 && need_to_push)
> > rt_queue_push_tasks(rq);
> > #endif /* CONFIG_SMP */
> > if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
> What happens to a always running CFS task which switches to RT? Luca
> introduced a special migrate callback (next to push and pull)
> specifically to deal with this scenario. A lot of new infrastructure for
> this one use case, but still, do we care for it in RT as well?
>
> https://lore.kernel.org/r/[email protected]
>
Good question. This scenario and the one where uclamp values are changed while
an RT task is running are similar.
In both cases the migration will happen on the next activation/wakeup AFAICS.
I am not sure an always running rt/deadline task is something conceivable in
practice and we want to cater for. It is certainly something we can push a fix
for in the future on top of this. IMHO we're better off trying to keep the
complexity low until we have a real scenario/use case that justifies it.
As it stands when the system is overloaded or when there are more RT tasks than
big cores we're stuffed too. And there are probably more ways where we can
shoot ourselves in the foot. Using RT and deadline is hard and that's one of
their unavoidable plagues I suppose.
Thanks
--
Qais Yousef
Hi Steve
On 10/28/19 14:01, Steven Rostedt wrote:
> On Mon, 28 Oct 2019 15:37:49 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > On Wed, Oct 09, 2019 at 11:46:11AM +0100, Qais Yousef wrote:
> > > Capacity Awareness refers to the fact that on heterogeneous systems
> > > (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> > > when placing tasks we need to be aware of this difference of CPU
> > > capacities.
> > >
> > > In such scenarios we want to ensure that the selected CPU has enough
> > > capacity to meet the requirement of the running task. Enough capacity
> > > means here that capacity_orig_of(cpu) >= task.requirement.
> > >
> > > The definition of task.requirement is dependent on the scheduling class.
> > >
> > > For CFS, utilization is used to select a CPU that has >= capacity value
> > > than the cfs_task.util.
> > >
> > > capacity_orig_of(cpu) >= cfs_task.util
> > >
> > > DL isn't capacity aware at the moment but can make use of the bandwidth
> > > reservation to implement that in a similar manner CFS uses utilization.
> > > The following patchset implements that:
> > >
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
> > >
> > > For RT we don't have a per task utilization signal and we lack any
> > > information in general about what performance requirement the RT task
> > > needs. But with the introduction of uclamp, RT tasks can now control
> > > that by setting uclamp_min to guarantee a minimum performance point.
> > >
> > > ATM the uclamp value are only used for frequency selection; but on
> > > heterogeneous systems this is not enough and we need to ensure that the
> > > capacity of the CPU is >= uclamp_min. Which is what implemented here.
> > >
> > > capacity_orig_of(cpu) >= rt_task.uclamp_min
> > >
> > > Note that by default uclamp.min is 1024, which means that RT tasks will
> > > always be biased towards the big CPUs, which make for a better more
> > > predictable behavior for the default case.
> > >
> > > Must stress that the bias acts as a hint rather than a definite
> > > placement strategy. For example, if all big cores are busy executing
> > > other RT tasks we can't guarantee that a new RT task will be placed
> > > there.
> > >
> > > On non-heterogeneous systems the original behavior of RT should be
> > > retained. Similarly if uclamp is not selected in the config.
> > >
> > > Signed-off-by: Qais Yousef <[email protected]>
> >
> > Works for me; Steve you OK with this?
>
> Nothing against it, but I want to take a deeper look before we accept
> it. Are you OK in waiting a week? I'm currently at Open Source Summit
> and still have two more talks to write (giving them Thursday). I wont
> have time to look till next week.
A gentle reminder if you can spare some time to look at this. It'd be nice if
it can make it to the 5.5 merge window if there are no major concerns about it.
Thanks!
--
Qais Yousef
Hi Steve
On 10/28/19 14:01, Steven Rostedt wrote:
> On Mon, 28 Oct 2019 15:37:49 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > On Wed, Oct 09, 2019 at 11:46:11AM +0100, Qais Yousef wrote:
> > > Capacity Awareness refers to the fact that on heterogeneous systems
> > > (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> > > when placing tasks we need to be aware of this difference of CPU
> > > capacities.
> > >
> > > In such scenarios we want to ensure that the selected CPU has enough
> > > capacity to meet the requirement of the running task. Enough capacity
> > > means here that capacity_orig_of(cpu) >= task.requirement.
> > >
> > > The definition of task.requirement is dependent on the scheduling class.
> > >
> > > For CFS, utilization is used to select a CPU that has >= capacity value
> > > than the cfs_task.util.
> > >
> > > capacity_orig_of(cpu) >= cfs_task.util
> > >
> > > DL isn't capacity aware at the moment but can make use of the bandwidth
> > > reservation to implement that in a similar manner CFS uses utilization.
> > > The following patchset implements that:
> > >
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
> > >
> > > For RT we don't have a per task utilization signal and we lack any
> > > information in general about what performance requirement the RT task
> > > needs. But with the introduction of uclamp, RT tasks can now control
> > > that by setting uclamp_min to guarantee a minimum performance point.
> > >
> > > ATM the uclamp value are only used for frequency selection; but on
> > > heterogeneous systems this is not enough and we need to ensure that the
> > > capacity of the CPU is >= uclamp_min. Which is what implemented here.
> > >
> > > capacity_orig_of(cpu) >= rt_task.uclamp_min
> > >
> > > Note that by default uclamp.min is 1024, which means that RT tasks will
> > > always be biased towards the big CPUs, which make for a better more
> > > predictable behavior for the default case.
> > >
> > > Must stress that the bias acts as a hint rather than a definite
> > > placement strategy. For example, if all big cores are busy executing
> > > other RT tasks we can't guarantee that a new RT task will be placed
> > > there.
> > >
> > > On non-heterogeneous systems the original behavior of RT should be
> > > retained. Similarly if uclamp is not selected in the config.
> > >
> > > Signed-off-by: Qais Yousef <[email protected]>
> >
> > Works for me; Steve you OK with this?
>
> Nothing against it, but I want to take a deeper look before we accept
> it. Are you OK in waiting a week? I'm currently at Open Source Summit
> and still have two more talks to write (giving them Thursday). I wont
> have time to look till next week.
Apologies if I am being too pushy. But not sure whether this is waiting for its
turn in the queue or slipped through the cracks, hence another gentle reminder
in case it's the latter :-)
Many thanks
--
Qais Yousef
On Mon, 18 Nov 2019 15:43:35 +0000
Qais Yousef <[email protected]> wrote:
>
> >
> > Nothing against it, but I want to take a deeper look before we accept
> > it. Are you OK in waiting a week? I'm currently at Open Source Summit
> > and still have two more talks to write (giving them Thursday). I wont
> > have time to look till next week.
>
> Apologies if I am being too pushy. But not sure whether this is waiting for its
> turn in the queue or slipped through the cracks, hence another gentle reminder
> in case it's the latter :-)
No you are not being too pushy, my apologies to you, it's been quite
hectic lately (both from a business and personal stand point). I'll
look at it now.
Thanks, and sorry for the delay :-(
-- Steve
On 11/18/19 10:53, Steven Rostedt wrote:
> On Mon, 18 Nov 2019 15:43:35 +0000
> Qais Yousef <[email protected]> wrote:
>
> >
> > >
> > > Nothing against it, but I want to take a deeper look before we accept
> > > it. Are you OK in waiting a week? I'm currently at Open Source Summit
> > > and still have two more talks to write (giving them Thursday). I wont
> > > have time to look till next week.
> >
> > Apologies if I am being too pushy. But not sure whether this is waiting for its
> > turn in the queue or slipped through the cracks, hence another gentle reminder
> > in case it's the latter :-)
>
> No you are not being too pushy, my apologies to you, it's been quite
> hectic lately (both from a business and personal stand point). I'll
> look at it now.
>
> Thanks, and sorry for the delay :-(
No worries! I appreciate that you have to juggle a lot of things together. As
long as it's on the queue and isn't accidentally dropped, then it is what it
is.
Many thanks
--
Qais Yousef
Sorry for the very late response...
On Wed, 9 Oct 2019 11:46:11 +0100
Qais Yousef <[email protected]> wrote:
> ATM the uclamp value are only used for frequency selection; but on
> heterogeneous systems this is not enough and we need to ensure that the
> capacity of the CPU is >= uclamp_min. Which is what implemented here.
Is it possible that the capacity can be fixed, where the process can
just have a cpu mask of CPUs that has the capacity for it?
Not that this will affect this patch now, but just something for the
future.
>
> capacity_orig_of(cpu) >= rt_task.uclamp_min
>
> Note that by default uclamp.min is 1024, which means that RT tasks will
> always be biased towards the big CPUs, which make for a better more
> predictable behavior for the default case.
>
> Must stress that the bias acts as a hint rather than a definite
> placement strategy. For example, if all big cores are busy executing
> other RT tasks we can't guarantee that a new RT task will be placed
> there.
>
> On non-heterogeneous systems the original behavior of RT should be
> retained. Similarly if uclamp is not selected in the config.
>
> Signed-off-by: Qais Yousef <[email protected]>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
-- Steve
On 11/25/19 16:36, Steven Rostedt wrote:
> Sorry for the very late response...
No worries and thanks for taking the time to look at this!
>
> On Wed, 9 Oct 2019 11:46:11 +0100
> Qais Yousef <[email protected]> wrote:
>
> > ATM the uclamp value are only used for frequency selection; but on
> > heterogeneous systems this is not enough and we need to ensure that the
> > capacity of the CPU is >= uclamp_min. Which is what implemented here.
>
> Is it possible that the capacity can be fixed, where the process can
> just have a cpu mask of CPUs that has the capacity for it?
It is possible and I did consider it. But it didn't feel right because:
1. The same can be achieved with regular affinities.
2. On some systems, especially medium and lower end devices the number
of big cores might be small (2 bigs and 6 littles is common).
I couldn't justify that pinning to bigs is always better than
letting the system try to balance itself in case it gets overloaded.
The thing with RT is that generally we offer little guarantees if the system is
not designed properly and the user can easily shoot themselves in the foot.
This implementation offered the simplest best effort while still not being too
restrictive.
But the idea is valid and worth looking at in the future.
>
> Not that this will affect this patch now, but just something for the
> future.
>
> >
> > capacity_orig_of(cpu) >= rt_task.uclamp_min
> >
> > Note that by default uclamp.min is 1024, which means that RT tasks will
> > always be biased towards the big CPUs, which make for a better more
> > predictable behavior for the default case.
> >
> > Must stress that the bias acts as a hint rather than a definite
> > placement strategy. For example, if all big cores are busy executing
> > other RT tasks we can't guarantee that a new RT task will be placed
> > there.
> >
> > On non-heterogeneous systems the original behavior of RT should be
> > retained. Similarly if uclamp is not selected in the config.
> >
> > Signed-off-by: Qais Yousef <[email protected]>
>
> Reviewed-by: Steven Rostedt (VMware) <[email protected]>
Thanks for your time :-)
--
Qais Yousef
On 30/10/2019 18:43, Qais Yousef wrote:
> On 10/30/19 12:57, Dietmar Eggemann wrote:
>> On 09.10.19 12:46, Qais Yousef wrote:
>>
>> [...]
>>
>>> Changes in v2:
>>> - Use cpupri_find() to check the fitness of the task instead of
>>> sprinkling find_lowest_rq() with several checks of
>>> rt_task_fits_capacity().
>>>
>>> The selected implementation opted to pass the fitness function as an
>>> argument rather than call rt_task_fits_capacity() capacity which is
>>> a cleaner to keep the logical separation of the 2 modules; but it
>>> means the compiler has less room to optimize rt_task_fits_capacity()
>>> out when it's a constant value.
>>
>> I would prefer exporting rt_task_fits_capacity() sched-internally via
>> kernel/sched/sched.h. Less code changes and the indication whether
>> rt_task_fits_capacity() has to be used in cpupri_find() is already given
>> by lowest_mask being !NULL or NULL.
>>
>
> I don't mind if the maintainers agree too. The reason I did that way is because
> it keeps the implementation of cpupri generic and self contained.
>
> rt_task_fits_capacity() at the moment is a static function in rt.c
> To use it in cpupri_find() I either need to make it public somewhere or have an
>
> extern bool rt_task_fits_capacity(...);
>
> in cpupri.c. Neither of which appealed to me personally.
>
>> [...]
>>
>>> +inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
>>> +{
>>> + unsigned int min_cap;
>>> + unsigned int max_cap;
>>> + unsigned int cpu_cap;
>>
>> Nit picking. Since we discussed it already,
>>
>> I found this "Also please try to aggregate variables of the same type
>> into a single line. There is no point in wasting screen space::" ;-)
>>
>> https://lore.kernel.org/r/[email protected]
>
> That wasn't merged at the end AFAICT :)
>
> It's not my preferred style in this case if I have the choice - but I promise
> to change it if I ended up having to spin another version anyway :)
>
>>
>> [...]
>>
>>> @@ -2223,7 +2273,10 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
>>> */
>>> if (task_on_rq_queued(p) && rq->curr != p) {
>>> #ifdef CONFIG_SMP
>>> - if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
>>> + bool need_to_push = rq->rt.overloaded ||
>>> + !rt_task_fits_capacity(p, cpu_of(rq));
>>> +
>>> + if (p->nr_cpus_allowed > 1 && need_to_push)
>>> rt_queue_push_tasks(rq);
>>> #endif /* CONFIG_SMP */
>>> if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
>> What happens to a always running CFS task which switches to RT? Luca
>> introduced a special migrate callback (next to push and pull)
>> specifically to deal with this scenario. A lot of new infrastructure for
>> this one use case, but still, do we care for it in RT as well?
>>
>> https://lore.kernel.org/r/[email protected]
>>
>
> Good question. This scenario and the one where uclamp values are changed while
> an RT task is running are similar.
>
> In both cases the migration will happen on the next activation/wakeup AFAICS.
>
> I am not sure an always running rt/deadline task is something conceivable in
> practice and we want to cater for. It is certainly something we can push a fix
> for in the future on top of this. IMHO we're better off trying to keep the
> complexity low until we have a real scenario/use case that justifies it.
>
> As it stands when the system is overloaded or when there are more RT tasks than
> big cores we're stuffed too. And there are probably more ways where we can
> shoot ourselves in the foot. Using RT and deadline is hard and that's one of
> their unavoidable plagues I suppose.
I'm OK with that. I just wanted to make sure we will apply the same
requirements to the upcoming DL capacity awareness implementation. Not
catering for this use case means that we can skip the migrate callback
from Luca's initial DL capacity awareness patch-set.
I still would prefer to export rt_task_fits_capacity() via
kernel/sched/sched.h but I won't insist on that detail:
Reviewed-by: Dietmar Eggemann <[email protected]>
Hi Peter
On 10/28/19 21:50, Peter Zijlstra wrote:
> On Mon, Oct 28, 2019 at 02:01:47PM -0400, Steven Rostedt wrote:
> > On Mon, 28 Oct 2019 15:37:49 +0100
> > Peter Zijlstra <[email protected]> wrote:
>
> > > Works for me; Steve you OK with this?
> >
> > Nothing against it, but I want to take a deeper look before we accept
> > it. Are you OK in waiting a week? I'm currently at Open Source Summit
> > and still have two more talks to write (giving them Thursday). I wont
> > have time to look till next week.
>
> Sure, I'll keep it in my queue, but will make sure it doesn't hit -tip
> until you've had time.
Reviewers are happy with this now. It'd be nice if you can pick it up again for
the next round to -tip.
Thanks!
--
Qais Yousef
On Fri, Dec 20, 2019 at 04:01:49PM +0000, Qais Yousef wrote:
> On 10/28/19 21:50, Peter Zijlstra wrote:
> > On Mon, Oct 28, 2019 at 02:01:47PM -0400, Steven Rostedt wrote:
> > > On Mon, 28 Oct 2019 15:37:49 +0100
> > > Peter Zijlstra <[email protected]> wrote:
> >
> > > > Works for me; Steve you OK with this?
> > >
> > > Nothing against it, but I want to take a deeper look before we accept
> > > it. Are you OK in waiting a week? I'm currently at Open Source Summit
> > > and still have two more talks to write (giving them Thursday). I wont
> > > have time to look till next week.
> >
> > Sure, I'll keep it in my queue, but will make sure it doesn't hit -tip
> > until you've had time.
>
> Reviewers are happy with this now. It'd be nice if you can pick it up again for
> the next round to -tip.
>
Sorry, I missed Steve's and Dietmar's replies. It should shorty appear
in queue.git and I'll try and push to -tip over the weekend (provided
the robots don't come up with something fishy).
On 12/20/19 18:18, Peter Zijlstra wrote:
> On Fri, Dec 20, 2019 at 04:01:49PM +0000, Qais Yousef wrote:
> > On 10/28/19 21:50, Peter Zijlstra wrote:
> > > On Mon, Oct 28, 2019 at 02:01:47PM -0400, Steven Rostedt wrote:
> > > > On Mon, 28 Oct 2019 15:37:49 +0100
> > > > Peter Zijlstra <[email protected]> wrote:
> > >
> > > > > Works for me; Steve you OK with this?
> > > >
> > > > Nothing against it, but I want to take a deeper look before we accept
> > > > it. Are you OK in waiting a week? I'm currently at Open Source Summit
> > > > and still have two more talks to write (giving them Thursday). I wont
> > > > have time to look till next week.
> > >
> > > Sure, I'll keep it in my queue, but will make sure it doesn't hit -tip
> > > until you've had time.
> >
> > Reviewers are happy with this now. It'd be nice if you can pick it up again for
> > the next round to -tip.
> >
>
> Sorry, I missed Steve's and Dietmar's replies. It should shorty appear
> in queue.git and I'll try and push to -tip over the weekend (provided
> the robots don't come up with something fishy).
No worries, thanks! It missed the 5.5 merge window anyway.
We had 2 reports by the buildbot last time, luckily I kept the fixups at the
top of my local branch.
Happy to apply locally again or prefer I send v3?
Cheers
---
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index 799791c01d60..bdfb802d4c12 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -46,6 +46,8 @@ static int convert_prio(int prio)
* @cp: The cpupri context
* @p: The task
* @lowest_mask: A mask to fill in with selected CPUs (or NULL)
+ * @fitness_fn: A pointer to a function to do custom checks whether the CPU
+ * fits a specific criteria so that we only return those CPUs.
*
* Note: This function returns the recommended CPUs as calculated during the
* current invocation. By the time the call returns, the CPUs may have in
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3a68054e15b3..6afecb5557db 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -452,7 +452,7 @@ static inline int on_rt_rq(struct sched_rt_entity *rt_se)
* Note that uclamp_min will be clamped to uclamp_max if uclamp_min
* > uclamp_max.
*/
-inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
+static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
{
unsigned int min_cap;
unsigned int max_cap;
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 804d402fb6f6487b825aae8cf42fda6426c62867
Gitweb: https://git.kernel.org/tip/804d402fb6f6487b825aae8cf42fda6426c62867
Author: Qais Yousef <[email protected]>
AuthorDate: Wed, 09 Oct 2019 11:46:11 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Wed, 25 Dec 2019 10:42:10 +01:00
sched/rt: Make RT capacity-aware
Capacity Awareness refers to the fact that on heterogeneous systems
(like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
when placing tasks we need to be aware of this difference of CPU
capacities.
In such scenarios we want to ensure that the selected CPU has enough
capacity to meet the requirement of the running task. Enough capacity
means here that capacity_orig_of(cpu) >= task.requirement.
The definition of task.requirement is dependent on the scheduling class.
For CFS, utilization is used to select a CPU that has >= capacity value
than the cfs_task.util.
capacity_orig_of(cpu) >= cfs_task.util
DL isn't capacity aware at the moment but can make use of the bandwidth
reservation to implement that in a similar manner CFS uses utilization.
The following patchset implements that:
https://lore.kernel.org/lkml/[email protected]/
capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
For RT we don't have a per task utilization signal and we lack any
information in general about what performance requirement the RT task
needs. But with the introduction of uclamp, RT tasks can now control
that by setting uclamp_min to guarantee a minimum performance point.
ATM the uclamp value are only used for frequency selection; but on
heterogeneous systems this is not enough and we need to ensure that the
capacity of the CPU is >= uclamp_min. Which is what implemented here.
capacity_orig_of(cpu) >= rt_task.uclamp_min
Note that by default uclamp.min is 1024, which means that RT tasks will
always be biased towards the big CPUs, which make for a better more
predictable behavior for the default case.
Must stress that the bias acts as a hint rather than a definite
placement strategy. For example, if all big cores are busy executing
other RT tasks we can't guarantee that a new RT task will be placed
there.
On non-heterogeneous systems the original behavior of RT should be
retained. Similarly if uclamp is not selected in the config.
[ mingo: Minor edits to comments. ]
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/cpupri.c | 25 +++++++++++--
kernel/sched/cpupri.h | 4 +-
kernel/sched/rt.c | 83 ++++++++++++++++++++++++++++++++++--------
3 files changed, 94 insertions(+), 18 deletions(-)
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index b7abca9..1a2719e 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -46,6 +46,8 @@ static int convert_prio(int prio)
* @cp: The cpupri context
* @p: The task
* @lowest_mask: A mask to fill in with selected CPUs (or NULL)
+ * @fitness_fn: A pointer to a function to do custom checks whether the CPU
+ * fits a specific criteria so that we only return those CPUs.
*
* Note: This function returns the recommended CPUs as calculated during the
* current invocation. By the time the call returns, the CPUs may have in
@@ -57,7 +59,8 @@ static int convert_prio(int prio)
* Return: (int)bool - CPUs were found
*/
int cpupri_find(struct cpupri *cp, struct task_struct *p,
- struct cpumask *lowest_mask)
+ struct cpumask *lowest_mask,
+ bool (*fitness_fn)(struct task_struct *p, int cpu))
{
int idx = 0;
int task_pri = convert_prio(p->prio);
@@ -98,6 +101,8 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
continue;
if (lowest_mask) {
+ int cpu;
+
cpumask_and(lowest_mask, p->cpus_ptr, vec->mask);
/*
@@ -108,7 +113,23 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
* condition, simply act as though we never hit this
* priority level and continue on.
*/
- if (cpumask_any(lowest_mask) >= nr_cpu_ids)
+ if (cpumask_empty(lowest_mask))
+ continue;
+
+ if (!fitness_fn)
+ return 1;
+
+ /* Ensure the capacity of the CPUs fit the task */
+ for_each_cpu(cpu, lowest_mask) {
+ if (!fitness_fn(p, cpu))
+ cpumask_clear_cpu(cpu, lowest_mask);
+ }
+
+ /*
+ * If no CPU at the current priority can fit the task
+ * continue looking
+ */
+ if (cpumask_empty(lowest_mask))
continue;
}
diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index 7dc20a3..32dd520 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -18,7 +18,9 @@ struct cpupri {
};
#ifdef CONFIG_SMP
-int cpupri_find(struct cpupri *cp, struct task_struct *p, struct cpumask *lowest_mask);
+int cpupri_find(struct cpupri *cp, struct task_struct *p,
+ struct cpumask *lowest_mask,
+ bool (*fitness_fn)(struct task_struct *p, int cpu));
void cpupri_set(struct cpupri *cp, int cpu, int pri);
int cpupri_init(struct cpupri *cp);
void cpupri_cleanup(struct cpupri *cp);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e591d40..4043abe 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -437,6 +437,45 @@ static inline int on_rt_rq(struct sched_rt_entity *rt_se)
return rt_se->on_rq;
}
+#ifdef CONFIG_UCLAMP_TASK
+/*
+ * Verify the fitness of task @p to run on @cpu taking into account the uclamp
+ * settings.
+ *
+ * This check is only important for heterogeneous systems where uclamp_min value
+ * is higher than the capacity of a @cpu. For non-heterogeneous system this
+ * function will always return true.
+ *
+ * The function will return true if the capacity of the @cpu is >= the
+ * uclamp_min and false otherwise.
+ *
+ * Note that uclamp_min will be clamped to uclamp_max if uclamp_min
+ * > uclamp_max.
+ */
+static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
+{
+ unsigned int min_cap;
+ unsigned int max_cap;
+ unsigned int cpu_cap;
+
+ /* Only heterogeneous systems can benefit from this check */
+ if (!static_branch_unlikely(&sched_asym_cpucapacity))
+ return true;
+
+ min_cap = uclamp_eff_value(p, UCLAMP_MIN);
+ max_cap = uclamp_eff_value(p, UCLAMP_MAX);
+
+ cpu_cap = capacity_orig_of(cpu);
+
+ return cpu_cap >= min(min_cap, max_cap);
+}
+#else
+static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
+{
+ return true;
+}
+#endif
+
#ifdef CONFIG_RT_GROUP_SCHED
static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
@@ -1391,6 +1430,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
{
struct task_struct *curr;
struct rq *rq;
+ bool test;
/* For anything but wake ups, just return the task_cpu */
if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
@@ -1422,10 +1462,16 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
*
* This test is optimistic, if we get it wrong the load-balancer
* will have to sort it out.
+ *
+ * We take into account the capacity of the CPU to ensure it fits the
+ * requirement of the task - which is only important on heterogeneous
+ * systems like big.LITTLE.
*/
- if (curr && unlikely(rt_task(curr)) &&
- (curr->nr_cpus_allowed < 2 ||
- curr->prio <= p->prio)) {
+ test = curr &&
+ unlikely(rt_task(curr)) &&
+ (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
+
+ if (test || !rt_task_fits_capacity(p, cpu)) {
int target = find_lowest_rq(p);
/*
@@ -1449,15 +1495,15 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
* let's hope p can move out.
*/
if (rq->curr->nr_cpus_allowed == 1 ||
- !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
+ !cpupri_find(&rq->rd->cpupri, rq->curr, NULL, NULL))
return;
/*
* p is migratable, so let's not schedule it and
* see if it is pushed or pulled somewhere else.
*/
- if (p->nr_cpus_allowed != 1
- && cpupri_find(&rq->rd->cpupri, p, NULL))
+ if (p->nr_cpus_allowed != 1 &&
+ cpupri_find(&rq->rd->cpupri, p, NULL, NULL))
return;
/*
@@ -1601,7 +1647,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
{
if (!task_running(rq, p) &&
- cpumask_test_cpu(cpu, p->cpus_ptr))
+ cpumask_test_cpu(cpu, p->cpus_ptr) &&
+ rt_task_fits_capacity(p, cpu))
return 1;
return 0;
@@ -1643,7 +1690,8 @@ static int find_lowest_rq(struct task_struct *task)
if (task->nr_cpus_allowed == 1)
return -1; /* No other targets possible */
- if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask))
+ if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask,
+ rt_task_fits_capacity))
return -1; /* No targets found */
/*
@@ -2147,12 +2195,14 @@ skip:
*/
static void task_woken_rt(struct rq *rq, struct task_struct *p)
{
- if (!task_running(rq, p) &&
- !test_tsk_need_resched(rq->curr) &&
- p->nr_cpus_allowed > 1 &&
- (dl_task(rq->curr) || rt_task(rq->curr)) &&
- (rq->curr->nr_cpus_allowed < 2 ||
- rq->curr->prio <= p->prio))
+ bool need_to_push = !task_running(rq, p) &&
+ !test_tsk_need_resched(rq->curr) &&
+ p->nr_cpus_allowed > 1 &&
+ (dl_task(rq->curr) || rt_task(rq->curr)) &&
+ (rq->curr->nr_cpus_allowed < 2 ||
+ rq->curr->prio <= p->prio);
+
+ if (need_to_push || !rt_task_fits_capacity(p, cpu_of(rq)))
push_rt_tasks(rq);
}
@@ -2224,7 +2274,10 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
*/
if (task_on_rq_queued(p) && rq->curr != p) {
#ifdef CONFIG_SMP
- if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
+ bool need_to_push = rq->rt.overloaded ||
+ !rt_task_fits_capacity(p, cpu_of(rq));
+
+ if (p->nr_cpus_allowed > 1 && need_to_push)
rt_queue_push_tasks(rq);
#endif /* CONFIG_SMP */
if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
Hi Qais,
On Wed, Oct 09, 2019 at 11:46:11AM +0100, Qais Yousef wrote:
> Capacity Awareness refers to the fact that on heterogeneous systems
> (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> when placing tasks we need to be aware of this difference of CPU
> capacities.
>
> In such scenarios we want to ensure that the selected CPU has enough
> capacity to meet the requirement of the running task. Enough capacity
> means here that capacity_orig_of(cpu) >= task.requirement.
>
> The definition of task.requirement is dependent on the scheduling class.
>
> For CFS, utilization is used to select a CPU that has >= capacity value
> than the cfs_task.util.
>
> capacity_orig_of(cpu) >= cfs_task.util
>
> DL isn't capacity aware at the moment but can make use of the bandwidth
> reservation to implement that in a similar manner CFS uses utilization.
> The following patchset implements that:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
>
> For RT we don't have a per task utilization signal and we lack any
> information in general about what performance requirement the RT task
> needs. But with the introduction of uclamp, RT tasks can now control
> that by setting uclamp_min to guarantee a minimum performance point.
>
> ATM the uclamp value are only used for frequency selection; but on
> heterogeneous systems this is not enough and we need to ensure that the
> capacity of the CPU is >= uclamp_min. Which is what implemented here.
>
> capacity_orig_of(cpu) >= rt_task.uclamp_min
>
> Note that by default uclamp.min is 1024, which means that RT tasks will
> always be biased towards the big CPUs, which make for a better more
> predictable behavior for the default case.
>
> Must stress that the bias acts as a hint rather than a definite
> placement strategy. For example, if all big cores are busy executing
> other RT tasks we can't guarantee that a new RT task will be placed
> there.
>
> On non-heterogeneous systems the original behavior of RT should be
> retained. Similarly if uclamp is not selected in the config.
>
> Signed-off-by: Qais Yousef <[email protected]>
> ---
>
> Changes in v2:
> - Use cpupri_find() to check the fitness of the task instead of
> sprinkling find_lowest_rq() with several checks of
> rt_task_fits_capacity().
>
> The selected implementation opted to pass the fitness function as an
> argument rather than call rt_task_fits_capacity() capacity which is
> a cleaner to keep the logical separation of the 2 modules; but it
> means the compiler has less room to optimize rt_task_fits_capacity()
> out when it's a constant value.
>
> The logic is not perfect. For example if a 'small' task is occupying a big CPU
> and another big task wakes up; we won't force migrate the small task to clear
> the big cpu for the big task that woke up.
>
> IOW, the logic is best effort and can't give hard guarantees. But improves the
> current situation where a task can randomly end up on any CPU regardless of
> what it needs. ie: without this patch an RT task can wake up on a big or small
> CPU, but with this it will always wake up on a big CPU (assuming the big CPUs
> aren't overloaded) - hence provide a consistent performance.
>
> I'm looking at ways to improve this best effort, but this patch should be
> a good start to discuss our Capacity Awareness requirement. There's a trade-off
> of complexity to be made here and I'd like to keep things as simple as
> possible and build on top as needed.
>
>
> kernel/sched/cpupri.c | 23 ++++++++++--
> kernel/sched/cpupri.h | 4 ++-
> kernel/sched/rt.c | 81 +++++++++++++++++++++++++++++++++++--------
> 3 files changed, 91 insertions(+), 17 deletions(-)
>
> diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
> index b7abca987d94..799791c01d60 100644
> --- a/kernel/sched/cpupri.c
> +++ b/kernel/sched/cpupri.c
> @@ -57,7 +57,8 @@ static int convert_prio(int prio)
> * Return: (int)bool - CPUs were found
> */
> int cpupri_find(struct cpupri *cp, struct task_struct *p,
> - struct cpumask *lowest_mask)
> + struct cpumask *lowest_mask,
> + bool (*fitness_fn)(struct task_struct *p, int cpu))
> {
> int idx = 0;
> int task_pri = convert_prio(p->prio);
> @@ -98,6 +99,8 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
> continue;
>
> if (lowest_mask) {
> + int cpu;
> +
> cpumask_and(lowest_mask, p->cpus_ptr, vec->mask);
>
> /*
> @@ -108,7 +111,23 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
> * condition, simply act as though we never hit this
> * priority level and continue on.
> */
> - if (cpumask_any(lowest_mask) >= nr_cpu_ids)
> + if (cpumask_empty(lowest_mask))
> + continue;
> +
> + if (!fitness_fn)
> + return 1;
> +
> + /* Ensure the capacity of the CPUs fit the task */
> + for_each_cpu(cpu, lowest_mask) {
> + if (!fitness_fn(p, cpu))
> + cpumask_clear_cpu(cpu, lowest_mask);
> + }
> +
> + /*
> + * If no CPU at the current priority can fit the task
> + * continue looking
> + */
> + if (cpumask_empty(lowest_mask))
> continue;
> }
>
I understand that RT tasks run on BIG cores by default when uclamp is enabled.
Can you tell what happens when we have more runnable RT tasks than the BIG
CPUs? Do they get packed on the BIG CPUs or eventually silver CPUs pull those
tasks? Since rt_task_fits_capacity() is considered during wakeup, push and
pull, the tasks may get packed on BIG forever. Is my understanding correct?
Also what happens for the case where RT tasks are pinned to silver but with
default uclamp value i.e p.uclamp.min=1024 ? They may all get queued on a
single silver and other silvers may not help since the task does not fit
there. In practice, we may not use this setup. Just wanted to know if this
behavior is intentional or not.
Thanks,
Pavan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Hi Pavan
On 01/31/20 15:36, Pavan Kondeti wrote:
> Hi Qais,
>
> On Wed, Oct 09, 2019 at 11:46:11AM +0100, Qais Yousef wrote:
[...]
> >
> > For RT we don't have a per task utilization signal and we lack any
> > information in general about what performance requirement the RT task
> > needs. But with the introduction of uclamp, RT tasks can now control
> > that by setting uclamp_min to guarantee a minimum performance point.
[...]
> > ---
> >
> > Changes in v2:
> > - Use cpupri_find() to check the fitness of the task instead of
> > sprinkling find_lowest_rq() with several checks of
> > rt_task_fits_capacity().
> >
> > The selected implementation opted to pass the fitness function as an
> > argument rather than call rt_task_fits_capacity() capacity which is
> > a cleaner to keep the logical separation of the 2 modules; but it
> > means the compiler has less room to optimize rt_task_fits_capacity()
> > out when it's a constant value.
> >
> > The logic is not perfect. For example if a 'small' task is occupying a big CPU
> > and another big task wakes up; we won't force migrate the small task to clear
> > the big cpu for the big task that woke up.
> >
> > IOW, the logic is best effort and can't give hard guarantees. But improves the
> > current situation where a task can randomly end up on any CPU regardless of
> > what it needs. ie: without this patch an RT task can wake up on a big or small
> > CPU, but with this it will always wake up on a big CPU (assuming the big CPUs
> > aren't overloaded) - hence provide a consistent performance.
[...]
> I understand that RT tasks run on BIG cores by default when uclamp is enabled.
> Can you tell what happens when we have more runnable RT tasks than the BIG
> CPUs? Do they get packed on the BIG CPUs or eventually silver CPUs pull those
> tasks? Since rt_task_fits_capacity() is considered during wakeup, push and
> pull, the tasks may get packed on BIG forever. Is my understanding correct?
I left up the relevant part from the commit message and my 'cover-letter' above
that should contain answers to your question.
In short, the logic is best effort and isn't a hard guarantee. When the system
is overloaded we'll still spread, and a task that needs a big core might end up
on a little one. But AFAIU with RT, if you really want guarantees you need to
do some planning otherwise there are no guarantees in general that your task
will get what it needs.
But I understand your question is for the general purpose case. I've hacked my
notebook to run a few tests for you
https://gist.github.com/qais-yousef/cfe7487e3b43c3c06a152da31ae09101
Look at the diagrams in "Test {1, 2, 3} Results". I spawned 6 tasks which match
the 6 cores on the Juno I ran on. Based on Linus' master from a couple of days.
Note on Juno cores 1 and 2 are the big cors. 'b_*' and 'l_*' are the task names
which are remnants from my previous testing where I spawned different numbers
of big and small tasks.
I repeat the same tests 3 times to demonstrate the repeatability. The logic
causes 2 tasks to run on a big CPU, but there's spreading. IMO on a general
purpose system this is a good behavior. On a real time system that needs better
guarantee then there's no alternative to doing proper RT planning.
In the last test I just spawn 2 tasks which end up on the right CPUs, 1 and 2.
On system like Android my observations has been that there are very little
concurrent RT tasks active at the same time. So if there are some tasks in the
system that do want to be on the big CPU, they most likely to get that
guarantee. Without this patch what you get is completely random.
>
> Also what happens for the case where RT tasks are pinned to silver but with
> default uclamp value i.e p.uclamp.min=1024 ? They may all get queued on a
> single silver and other silvers may not help since the task does not fit
> there. In practice, we may not use this setup. Just wanted to know if this
> behavior is intentional or not.
I'm not sure I understand your question.
If the RT tasks are affined to a set of CPUs, then we'll only search in these
CPUs. I expect the logic not to change with this patch. If you have a use case
that you think that breaks with this patch, can you please share the details
so I can reproduce?
I just ran several tests spawning 4 tasks affined to the little cores and
I indeed see them spreading on the littles.
Cheers
--
Qais Yousef
Hi Qais,
On Sat, Feb 01, 2020 at 07:08:34AM +0530, Pavan Kondeti wrote:
> Hi Qais,
>
> On Fri, Jan 31, 2020 at 9:04 PM Qais Yousef <[email protected]> wrote:
>
> > Hi Pavan
> >
> > On 01/31/20 15:36, Pavan Kondeti wrote:
> > > Hi Qais,
> > >
> > > On Wed, Oct 09, 2019 at 11:46:11AM +0100, Qais Yousef wrote:
> >
> > [...]
> >
> > > >
> > > > For RT we don't have a per task utilization signal and we lack any
> > > > information in general about what performance requirement the RT task
> > > > needs. But with the introduction of uclamp, RT tasks can now control
> > > > that by setting uclamp_min to guarantee a minimum performance point.
> >
> > [...]
> >
> > > > ---
> > > >
> > > > Changes in v2:
> > > > - Use cpupri_find() to check the fitness of the task instead of
> > > > sprinkling find_lowest_rq() with several checks of
> > > > rt_task_fits_capacity().
> > > >
> > > > The selected implementation opted to pass the fitness function
> > as an
> > > > argument rather than call rt_task_fits_capacity() capacity which
> > is
> > > > a cleaner to keep the logical separation of the 2 modules; but it
> > > > means the compiler has less room to optimize
> > rt_task_fits_capacity()
> > > > out when it's a constant value.
> > > >
> > > > The logic is not perfect. For example if a 'small' task is occupying a
> > big CPU
> > > > and another big task wakes up; we won't force migrate the small task
> > to clear
> > > > the big cpu for the big task that woke up.
> > > >
> > > > IOW, the logic is best effort and can't give hard guarantees. But
> > improves the
> > > > current situation where a task can randomly end up on any CPU
> > regardless of
> > > > what it needs. ie: without this patch an RT task can wake up on a big
> > or small
> > > > CPU, but with this it will always wake up on a big CPU (assuming the
> > big CPUs
> > > > aren't overloaded) - hence provide a consistent performance.
> >
> > [...]
> >
> > > I understand that RT tasks run on BIG cores by default when uclamp is
> > enabled.
> > > Can you tell what happens when we have more runnable RT tasks than the
> > BIG
> > > CPUs? Do they get packed on the BIG CPUs or eventually silver CPUs pull
> > those
> > > tasks? Since rt_task_fits_capacity() is considered during wakeup, push
> > and
> > > pull, the tasks may get packed on BIG forever. Is my understanding
> > correct?
> >
> > I left up the relevant part from the commit message and my 'cover-letter'
> > above
> > that should contain answers to your question.
> >
> > In short, the logic is best effort and isn't a hard guarantee. When the
> > system
> > is overloaded we'll still spread, and a task that needs a big core might
> > end up
> > on a little one. But AFAIU with RT, if you really want guarantees you need
> > to
> > do some planning otherwise there are no guarantees in general that your
> > task
> > will get what it needs.
> >
> > But I understand your question is for the general purpose case. I've
> > hacked my
> > notebook to run a few tests for you
> >
> >
> > https://gist.github.com/qais-yousef/cfe7487e3b43c3c06a152da31ae09101
> >
> > Look at the diagrams in "Test {1, 2, 3} Results". I spawned 6 tasks which
> > match
> > the 6 cores on the Juno I ran on. Based on Linus' master from a couple of
> > days.
> >
> > Note on Juno cores 1 and 2 are the big cors. 'b_*' and 'l_*' are the task
> > names
> > which are remnants from my previous testing where I spawned different
> > numbers
> > of big and small tasks.
> >
> > I repeat the same tests 3 times to demonstrate the repeatability. The logic
> > causes 2 tasks to run on a big CPU, but there's spreading. IMO on a general
> > purpose system this is a good behavior. On a real time system that needs
> > better
> > guarantee then there's no alternative to doing proper RT planning.
> >
> > In the last test I just spawn 2 tasks which end up on the right CPUs, 1
> > and 2.
> > On system like Android my observations has been that there are very little
> > concurrent RT tasks active at the same time. So if there are some tasks in
> > the
> > system that do want to be on the big CPU, they most likely to get that
> > guarantee. Without this patch what you get is completely random.
> >
> >
> Thanks for the results. I see that tasks are indeed spreading on to silver.
> However it is not
> clear to me from the code how tasks would get spread. Because cpupri_find()
> never returns
> little CPU in the lowest_mask because RT task does not fit on little CPU.
> So from wake up
> path, we place the task on the previous CPU or BIG CPU. The push logic also
> has the
> RT capacity checks, so an overloaded BIG CPU may not push tasks to an idle
> (RT free) little CPU.
>
>
I pulled your patch on top of v5.5 and run the below rt-app test on SDM845
platform. We expect 5 RT tasks to spread on different CPUs which was happening
without the patch. With the patch, I see one of the task got woken up on a CPU
which is running another RT task.
{
"tasks" : {
"big-task" : {
"instance" : 5,
"loop" : 10,
"sleep" : 100000,
"runtime" : 100000,
},
},
"global" : {
"duration" : -1,
"calibration" : 720,
"default_policy" : "SCHED_FIFO",
"pi_enabled" : false,
"lock_pages" : false,
"logdir" : "/",
"log_basename" : "rt-app2",
"ftrace" : false,
"gnuplot" : false
}
}
Full trace is attached. Copy/pasting the snippet where it shows packing is
happening. The custom trace_printk are added in cpupri_find() before calling
fitness_fn(). As you can see pid=535 is woken on CPU#7 where pid=538 RT task
is runnning. Right after waking, the push is tried but it did not work either.
This is not a serious problem for us since we don't set RT tasks
uclamp.min=1024 . However, it changed the behavior and might introduce latency
for RT tasks on b.L platforms running the upstream kernel as is.
big-task-538 [007] d.h. 403.401633: irq_handler_entry: irq=3 name=arch_timer
big-task-538 [007] d.h2 403.401633: sched_waking: comm=big-task pid=535 prio=89 target_cpu=007
big-task-538 [007] d.h2 403.401635: cpupri_find: before task=big-task-535 lowest_mask=f
big-task-538 [007] d.h2 403.401636: cpupri_find: after task=big-task-535 lowest_mask=0
big-task-538 [007] d.h2 403.401637: cpupri_find: it comes here
big-task-538 [007] d.h2 403.401638: find_lowest_rq: task=big-task-535 ret=0 lowest_mask=0
big-task-538 [007] d.h3 403.401640: sched_wakeup: comm=big-task pid=535 prio=89 target_cpu=007
big-task-538 [007] d.h3 403.401641: cpupri_find: before task=big-task-535 lowest_mask=f
big-task-538 [007] d.h3 403.401642: cpupri_find: after task=big-task-535 lowest_mask=0
big-task-538 [007] d.h3 403.401642: cpupri_find: it comes here
big-task-538 [007] d.h3 403.401643: find_lowest_rq: task=big-task-535 ret=0 lowest_mask=0
big-task-538 [007] d.h. 403.401644: irq_handler_exit: irq=3 ret=handled
big-task-538 [007] d..2 403.402413: sched_switch: prev_comm=big-task prev_pid=538 prev_prio=89 prev_state=S ==> next_comm=big-task next_pid=535 next_prio=89
Thanks,
Pavan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
On 02/01/20 07:08, Pavan Kondeti wrote:
> Thanks for the results. I see that tasks are indeed spreading on to silver.
> However it is not
> clear to me from the code how tasks would get spread. Because cpupri_find()
> never returns
> little CPU in the lowest_mask because RT task does not fit on little CPU.
> So from wake up
> path, we place the task on the previous CPU or BIG CPU. The push logic also
> has the
> RT capacity checks, so an overloaded BIG CPU may not push tasks to an idle
> (RT free) little CPU.
Okay I see what you mean.
Sorry I had to cache this back in again as it's been a while.
So yes you're right we won't force move to the little CPUs, but if we were
running there already we won't force overload the big CPU either. IIRC I didn't
see more than 2 tasks packed on a big core. But maybe I needed to try more
combination of things. The 2 tasks packed can be seen in my results. Which
I don't think it's a necessarily bad thing.
I tried to call out that in overloaded case we don't give any guarantees. I'll
expand below, but if a user asked for RT tasks to run on big cores more than
available big cores, then there's very little we can do.
If the tasks really want to be on a big core but the user is asking for more
than what the system can get, then the end result isn't the best whether we
pack or spread. Packing might end up good if the 2 tasks aren't intensive. Or
could end up being bad if they are.
Similarly if the 2 tasks aren't computationally intensive, then spreading to
the little is bad. Most likely the request for running at a certain performance
level, is to guarantee the predictability to finish execution within a certain
window of time. But don't quote me on this :)
I don't see one right answer here. The current mechanism could certainly do
better; but it's not clear what better means without delving into system
specific details. I am open to any suggestions to improve it.
As it stands, it allows the admin to boost RT tasks and guarantee they end up
running on the correct CPU. But it doesn't protect against bad RT planning.
> Yes, we do search with in the affined CPUs. However in cpupri_find(), we
> clear
> the CPUs on which the task does not fit. so the lowest_mask always be empty
> and we return -1. There is no fallback.
The question here is: if a task has its uclamp_min set to 1024 but is affined
to the wrong type of cpus, is it a user error or a kernel failure to deal with
this case?
The default p->uclamp_min = 1024 is not the right default to use in these
systems. I am pushing for a patch [1] to allow modifying this default behavior
at runtime. AFAIU Android has always disabled max RT boosting.
The common use case that we are trying to cater for is that most of the tasks
are happy to run anywhere, but there might be few that need to be boosted and
this boost value can only be guaranteed by running on a set of CPUs, in that
case we give this guarantee.
Again I agree the logic could be improved, but I prefer to see a real use case
first where this improvement helps.
[1] https://lore.kernel.org/lkml/[email protected]/
Cheers
--
Qais Yousef
On 02/03/20 11:02, Pavan Kondeti wrote:
> Full trace is attached. Copy/pasting the snippet where it shows packing is
> happening. The custom trace_printk are added in cpupri_find() before calling
> fitness_fn(). As you can see pid=535 is woken on CPU#7 where pid=538 RT task
> is runnning. Right after waking, the push is tried but it did not work either.
>
> This is not a serious problem for us since we don't set RT tasks
> uclamp.min=1024 . However, it changed the behavior and might introduce latency
> for RT tasks on b.L platforms running the upstream kernel as is.
>
> big-task-538 [007] d.h. 403.401633: irq_handler_entry: irq=3 name=arch_timer
> big-task-538 [007] d.h2 403.401633: sched_waking: comm=big-task pid=535 prio=89 target_cpu=007
> big-task-538 [007] d.h2 403.401635: cpupri_find: before task=big-task-535 lowest_mask=f
> big-task-538 [007] d.h2 403.401636: cpupri_find: after task=big-task-535 lowest_mask=0
> big-task-538 [007] d.h2 403.401637: cpupri_find: it comes here
> big-task-538 [007] d.h2 403.401638: find_lowest_rq: task=big-task-535 ret=0 lowest_mask=0
> big-task-538 [007] d.h3 403.401640: sched_wakeup: comm=big-task pid=535 prio=89 target_cpu=007
> big-task-538 [007] d.h3 403.401641: cpupri_find: before task=big-task-535 lowest_mask=f
> big-task-538 [007] d.h3 403.401642: cpupri_find: after task=big-task-535 lowest_mask=0
> big-task-538 [007] d.h3 403.401642: cpupri_find: it comes here
> big-task-538 [007] d.h3 403.401643: find_lowest_rq: task=big-task-535 ret=0 lowest_mask=0
> big-task-538 [007] d.h. 403.401644: irq_handler_exit: irq=3 ret=handled
> big-task-538 [007] d..2 403.402413: sched_switch: prev_comm=big-task prev_pid=538 prev_prio=89 prev_state=S ==> next_comm=big-task next_pid=535 next_prio=89
Thanks for that.
If I read the trace correctly, the 5 tasks end up all being on the *all* the
big cores (ie: sharing), correct?
The results I posted did show that we can end up with 2 tasks on a single big
core. I don't think we can say this is a good or a bad thing, though for me
I see it a good thing since it honored a request to be on the big core which
the system tried its best to provide.
Maybe we do want to cater for a default all boosted RT tasks, is this what
you're saying? If yes, how do you propose the logic to look like? My thought is
to provide a real time knob to tune down most RT tasks to sensible default
dependong on how powerful (and power hungry) the system is, then use the per
task API to boost the few tasks that really need more performance out of the
system.
Note from my results assuming I didn't do anything stupid, if you boot with
a system that runs with rt_task->uclamp_min = 1024, then some will race to the
big cores and the rest will stay where they are on the littles.
In my first version things looked slightly different and I think handling of
the fallback not finding a fitting CPU was better.
Please have a look at and let me know what you think.
https://lore.kernel.org/lkml/[email protected]/
Thanks
--
Qais Yousef
On Mon, 3 Feb 2020 14:27:14 +0000
Qais Yousef <[email protected]> wrote:
> I don't see one right answer here. The current mechanism could certainly do
> better; but it's not clear what better means without delving into system
> specific details. I am open to any suggestions to improve it.
The way I see this is that if there's no big cores available but little
cores are, and the RT task has those cores in its affinity mask then
the task most definitely should consider moving to the little core. The
cpu_find() should return them!
But, what we can do is to mark the little core that's running an RT
task on a it that prefers bigger cores, as "rt-overloaded". This will
add this core into the being looked at when another core schedules out
an RT task. When that happens, the RT task on the little core will get
pulled back to the big core.
Here's what I propose.
1. Scheduling of an RT task that wants big cores, but has little cores
in its affinity.
2. Calls cpu_find() which will look to place it first on a big core, if
there's a core that is running a task that is lower priority than
itself.
3. If all the big cores have RT tasks it can not preempt, look to find
a little core.
4. If a little core is returned, and we schedule an RT task that
prefers big cores on it, we mark it overloaded.
5. An RT task on a big core schedules out. Start looking at the RT
overloaded run queues.
6. See that there's an RT task on the little core, and migrate it over.
Note, this will require a bit more logic as the overloaded code wasn't
designed for migration of running tasks, but that could be added.
-- Steve
On 03/02/2020 16:14, Steven Rostedt wrote:
> On Mon, 3 Feb 2020 14:27:14 +0000
> Qais Yousef <[email protected]> wrote:
>
>> I don't see one right answer here. The current mechanism could certainly do
>> better; but it's not clear what better means without delving into system
>> specific details. I am open to any suggestions to improve it.
>
> The way I see this is that if there's no big cores available but little
> cores are, and the RT task has those cores in its affinity mask then
> the task most definitely should consider moving to the little core. The
> cpu_find() should return them!
>
> But, what we can do is to mark the little core that's running an RT
> task on a it that prefers bigger cores, as "rt-overloaded". This will
> add this core into the being looked at when another core schedules out
> an RT task. When that happens, the RT task on the little core will get
> pulled back to the big core.
>
That sounds sensible enough - it's also very similar to what we have for
CFS, labeled under "misfit tasks" (i.e. tasks that are "too big" for
LITTLEs).
>
> Note, this will require a bit more logic as the overloaded code wasn't
> designed for migration of running tasks, but that could be added.
>
I haven't adventured too much within RT land, but FWIW that's what we use
the CPU stopper for in CFS (see active_load_balance_cpu_stop()).
> -- Steve
>
On 02/03/20 11:14, Steven Rostedt wrote:
> On Mon, 3 Feb 2020 14:27:14 +0000
> Qais Yousef <[email protected]> wrote:
>
> > I don't see one right answer here. The current mechanism could certainly do
> > better; but it's not clear what better means without delving into system
> > specific details. I am open to any suggestions to improve it.
>
> The way I see this is that if there's no big cores available but little
> cores are, and the RT task has those cores in its affinity mask then
> the task most definitely should consider moving to the little core. The
> cpu_find() should return them!
I almost agree. I think the cpupri_find() could certainly do better if the task
is already running on a little core. It can fallback to the next best little
core if no bigger core is available.
I already started looking at pushing a patch to do that.
I'm torn about pushing a task already on a big core to a little core if it says
it wants it (down migration).
I guess since most tasks are fifo by default then one will starve if the other
one is a long running task (assuming same priority). But long running RT tasks
are not the common case, hence I wanted to hear about what use case this logic
hurts. I expect by default the big cores not to be over subscribed. Based on
some profiles I did at least running real Android apps I didn't see the RT
tasks were overwhelming the system.
In my view, the performance dip of sharing the big core would be less than
migrating the task to a little core momentarily then bring it back in to the
big core.
Because the following 2 big ifs must be satisfied first to starve an RT task:
1. We need all the big cores to be overloaded first.
2. The RT tasks on all the big cores are CPU hoggers (however we want
to define this)
And I think this needs more investigation.
>
> But, what we can do is to mark the little core that's running an RT
> task on a it that prefers bigger cores, as "rt-overloaded". This will
> add this core into the being looked at when another core schedules out
> an RT task. When that happens, the RT task on the little core will get
> pulled back to the big core.
I didn't think of using the rt-overloaded flag in this way. That would be
interesting to try.
>
> Here's what I propose.
>
> 1. Scheduling of an RT task that wants big cores, but has little cores
> in its affinity.
>
> 2. Calls cpu_find() which will look to place it first on a big core, if
> there's a core that is running a task that is lower priority than
> itself.
>
> 3. If all the big cores have RT tasks it can not preempt, look to find
> a little core.
I agree with the above.
>
> 4. If a little core is returned, and we schedule an RT task that
> prefers big cores on it, we mark it overloaded.
>
> 5. An RT task on a big core schedules out. Start looking at the RT
> overloaded run queues.
>
> 6. See that there's an RT task on the little core, and migrate it over.
I think the above should depend on the fitness of the cpu we currently run on.
I think we shouldn't down migrate, or at least investigate better down
migration makes more sense than keeping tasks running on the correct CPU where
they are.
> Note, this will require a bit more logic as the overloaded code wasn't
> designed for migration of running tasks, but that could be added.
I'm wary of overloading the meaning of rt.overloaded. Maybe I can convert it to
a bitmap so that we can encode the reason.
Let me see how complicated to write something up.
Thanks!
--
Qais Yousef
On Mon, 3 Feb 2020 17:17:46 +0000
Qais Yousef <[email protected]> wrote:
> I'm torn about pushing a task already on a big core to a little core if it says
> it wants it (down migration).
If the "down migration" happens to a process that is lower in priority,
then that stays in line with the policy decisions of scheduling RT
tasks. That is, higher priority task take precedence over lower
priority tasks, even if that means "degrading" that lower priority task.
For example, if a high priority task wakes up on a CPU that is running
a lower priority task, and with the exception of that lower priority
task being pinned, it will boot it off the CPU. Even if the lower
priority task is pinned, it may still take over the CPU if it can't
find another CPU.
> >
> > 4. If a little core is returned, and we schedule an RT task that
> > prefers big cores on it, we mark it overloaded.
> >
> > 5. An RT task on a big core schedules out. Start looking at the RT
> > overloaded run queues.
> >
> > 6. See that there's an RT task on the little core, and migrate it over.
>
> I think the above should depend on the fitness of the cpu we currently run on.
> I think we shouldn't down migrate, or at least investigate better down
> migration makes more sense than keeping tasks running on the correct CPU where
> they are.
Note, this only happens when a big core CPU schedules. And if you do
not have HAVE_RT_PUSH_IPI (which sends IPIs to overloaded CPUS and just
schedules), then that "down migration" happens to an RT task that isn't
even running.
You can add to the logic that you do not take over an RT task that is
pinned and can't move itself. Perhaps that may be the only change to
cpu_find(), is that it will only pick a big CPU if little CPUs are
available if the big CPU doesn't have a pinned RT task on it.
Like you said, this is best effort, and I believe this is the best
approach. The policy has always been the higher the priority of a task,
the more likely it will push other tasks away. We don't change that. If
the system administrator is overloading the big cores with RT tasks,
then this is what they get.
>
> > Note, this will require a bit more logic as the overloaded code wasn't
> > designed for migration of running tasks, but that could be added.
>
> I'm wary of overloading the meaning of rt.overloaded. Maybe I can convert it to
> a bitmap so that we can encode the reason.
We can change the name to something like rt.needs_pull or whatever.
-- Steve
On 02/03/20 13:12, Steven Rostedt wrote:
> On Mon, 3 Feb 2020 17:17:46 +0000
> Qais Yousef <[email protected]> wrote:
>
>
> > I'm torn about pushing a task already on a big core to a little core if it says
> > it wants it (down migration).
>
> If the "down migration" happens to a process that is lower in priority,
> then that stays in line with the policy decisions of scheduling RT
> tasks. That is, higher priority task take precedence over lower
> priority tasks, even if that means "degrading" that lower priority task.
>
> For example, if a high priority task wakes up on a CPU that is running
> a lower priority task, and with the exception of that lower priority
> task being pinned, it will boot it off the CPU. Even if the lower
> priority task is pinned, it may still take over the CPU if it can't
> find another CPU.
Indeed this makes sense.
>
>
> > >
> > > 4. If a little core is returned, and we schedule an RT task that
> > > prefers big cores on it, we mark it overloaded.
> > >
> > > 5. An RT task on a big core schedules out. Start looking at the RT
> > > overloaded run queues.
> > >
> > > 6. See that there's an RT task on the little core, and migrate it over.
> >
> > I think the above should depend on the fitness of the cpu we currently run on.
> > I think we shouldn't down migrate, or at least investigate better down
> > migration makes more sense than keeping tasks running on the correct CPU where
> > they are.
>
> Note, this only happens when a big core CPU schedules. And if you do
> not have HAVE_RT_PUSH_IPI (which sends IPIs to overloaded CPUS and just
> schedules), then that "down migration" happens to an RT task that isn't
> even running.
In the light of strictly adhering to priority based scheduling; yes this makes
sense. Though I still think the migration will produce worse performance, but
I can appreciate even if that was true it breaks the strict priority rule.
>
> You can add to the logic that you do not take over an RT task that is
> pinned and can't move itself. Perhaps that may be the only change to
I get this.
> cpu_find(), is that it will only pick a big CPU if little CPUs are
> available if the big CPU doesn't have a pinned RT task on it.
But not that. Do you mind rephrasing it?
Or let me try first:
1. Search all priority levels for a fitting CPU
2. If failed, return the first lowest mask found
3. If it succeeds, remove any CPU that has a pinned task in it
4. If the lowest_mask is empty, return (2).
5. Else return the lowest_mask with the fitting CPU(s)
Did I get it right?
The idea is not to potentially overload that CPU when this pinned task wakes
up? The task could be sleeping waiting for something interesting to poke it..?
>
> Like you said, this is best effort, and I believe this is the best
> approach. The policy has always been the higher the priority of a task,
> the more likely it will push other tasks away. We don't change that. If
> the system administrator is overloading the big cores with RT tasks,
> then this is what they get.
Yes. I think that has always been the case with RT. It is very easy to shoot
yourself in the foot.
>
> >
> > > Note, this will require a bit more logic as the overloaded code wasn't
> > > designed for migration of running tasks, but that could be added.
> >
> > I'm wary of overloading the meaning of rt.overloaded. Maybe I can convert it to
> > a bitmap so that we can encode the reason.
>
> We can change the name to something like rt.needs_pull or whatever.
Thanks for bringing more clarity to this.
Cheers
--
Qais Yousef
On 03/02/2020 20:03, Qais Yousef wrote:
> On 02/03/20 13:12, Steven Rostedt wrote:
>> On Mon, 3 Feb 2020 17:17:46 +0000
>> Qais Yousef <[email protected]> wrote:
[...]
> In the light of strictly adhering to priority based scheduling; yes this makes
> sense. Though I still think the migration will produce worse performance, but
> I can appreciate even if that was true it breaks the strict priority rule.
>
>>
>> You can add to the logic that you do not take over an RT task that is
>> pinned and can't move itself. Perhaps that may be the only change to
>
> I get this.
>
>> cpu_find(), is that it will only pick a big CPU if little CPUs are
>> available if the big CPU doesn't have a pinned RT task on it.
>
> But not that. Do you mind rephrasing it?
>
> Or let me try first:
>
> 1. Search all priority levels for a fitting CPU
Just so I get this right: All _lower_ prio levels than p->prio, right?
> 2. If failed, return the first lowest mask found
> 3. If it succeeds, remove any CPU that has a pinned task in it
> 4. If the lowest_mask is empty, return (2).
> 5. Else return the lowest_mask with the fitting CPU(s)
Mapping this to the 5 FIFO tasks rt-tasks of Pavan's example (all
p->prio=89 (dflt rt-app prio), dflt min_cap=1024 max_cap=1024) on a 4
big (Cpu Capacity=1024) 4 little (Cpu capacity < 1024) system:
You search from idx 1 to 11 [p->prio=89 eq. idx (task_pri)=12] and since
there are no lower prior RT tasks the lowest mask of idx=1 (CFS or Idle)
for the 5th RT task is returned.
But that means that CPU capacity trumps priority?
[...]
On 02/04/20 18:23, Dietmar Eggemann wrote:
> On 03/02/2020 20:03, Qais Yousef wrote:
> > On 02/03/20 13:12, Steven Rostedt wrote:
> >> On Mon, 3 Feb 2020 17:17:46 +0000
> >> Qais Yousef <[email protected]> wrote:
>
> [...]
>
> > In the light of strictly adhering to priority based scheduling; yes this makes
> > sense. Though I still think the migration will produce worse performance, but
> > I can appreciate even if that was true it breaks the strict priority rule.
> >
> >>
> >> You can add to the logic that you do not take over an RT task that is
> >> pinned and can't move itself. Perhaps that may be the only change to
> >
> > I get this.
> >
> >> cpu_find(), is that it will only pick a big CPU if little CPUs are
> >> available if the big CPU doesn't have a pinned RT task on it.
> >
> > But not that. Do you mind rephrasing it?
> >
> > Or let me try first:
> >
> > 1. Search all priority levels for a fitting CPU
>
> Just so I get this right: All _lower_ prio levels than p->prio, right?
Correct, that's what I meant :)
>
> > 2. If failed, return the first lowest mask found
> > 3. If it succeeds, remove any CPU that has a pinned task in it
> > 4. If the lowest_mask is empty, return (2).
> > 5. Else return the lowest_mask with the fitting CPU(s)
>
> Mapping this to the 5 FIFO tasks rt-tasks of Pavan's example (all
> p->prio=89 (dflt rt-app prio), dflt min_cap=1024 max_cap=1024) on a 4
> big (Cpu Capacity=1024) 4 little (Cpu capacity < 1024) system:
>
> You search from idx 1 to 11 [p->prio=89 eq. idx (task_pri)=12] and since
> there are no lower prior RT tasks the lowest mask of idx=1 (CFS or Idle)
> for the 5th RT task is returned.
We should basically fallback to whatever was supposed to be returned if this
patch is not applied.
if (lower_mask) {
// record the value of the first valid lower_mask
if lower_mask doesn't contain a fitting CPU:
continue searching in the next priority level
}
if no fitting cpu was found at any lower level:
return the recorded first valid lower_mask
>
> But that means that CPU capacity trumps priority?
I'm not sure how to translate 'trumps' here.
So priority has precedence over capacity. I think this is not the best option,
but it keeps the rules consistent; which is if a higher priority task is
runnable it'd be pushed to another CPU running a lower priority one if we can
find one. We'll attempt to make sure this CPU fits the capacity requirement of
the task, but if there isn't one we'll fallback to the next best thing.
I think this makes sense and will keep this fitness logic generic.
Maybe it's easier to discuss over a patch. I will post one soon hopefully.
Thanks
--
Qais Yousef