2019-08-30 17:56:52

by Subhra Mazumdar

[permalink] [raw]
Subject: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
"latency-nice" which is shared by all the threads in that Cgroup.

Signed-off-by: subhra mazumdar <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 40 ++++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 1 +
kernel/sched/sched.h | 8 ++++++++
4 files changed, 50 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1183741..b4a79c3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -631,6 +631,7 @@ struct task_struct {
int static_prio;
int normal_prio;
unsigned int rt_priority;
+ u64 latency_nice;

const struct sched_class *sched_class;
struct sched_entity se;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427..47969bc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5976,6 +5976,7 @@ void __init sched_init(void)
init_dl_rq(&rq->dl);
#ifdef CONFIG_FAIR_GROUP_SCHED
root_task_group.shares = ROOT_TASK_GROUP_LOAD;
+ root_task_group.latency_nice = LATENCY_NICE_DEFAULT;
INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
/*
@@ -6345,6 +6346,7 @@ static void sched_change_group(struct task_struct *tsk, int type)
*/
tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
struct task_group, css);
+ tsk->latency_nice = tg->latency_nice;
tg = autogroup_task_group(tsk, tg);
tsk->sched_task_group = tg;

@@ -6812,6 +6814,34 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
}
#endif /* CONFIG_RT_GROUP_SCHED */

+static u64 cpu_latency_nice_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ struct task_group *tg = css_tg(css);
+
+ return tg->latency_nice;
+}
+
+static int cpu_latency_nice_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft, u64 latency_nice)
+{
+ struct task_group *tg = css_tg(css);
+ struct css_task_iter it;
+ struct task_struct *p;
+
+ if (latency_nice < LATENCY_NICE_MIN || latency_nice > LATENCY_NICE_MAX)
+ return -ERANGE;
+
+ tg->latency_nice = latency_nice;
+
+ css_task_iter_start(css, 0, &it);
+ while ((p = css_task_iter_next(&it)))
+ p->latency_nice = latency_nice;
+ css_task_iter_end(&it);
+
+ return 0;
+}
+
static struct cftype cpu_legacy_files[] = {
#ifdef CONFIG_FAIR_GROUP_SCHED
{
@@ -6848,6 +6878,11 @@ static struct cftype cpu_legacy_files[] = {
.write_u64 = cpu_rt_period_write_uint,
},
#endif
+ {
+ .name = "latency-nice",
+ .read_u64 = cpu_latency_nice_read_u64,
+ .write_u64 = cpu_latency_nice_write_u64,
+ },
{ } /* Terminate */
};

@@ -7015,6 +7050,11 @@ static struct cftype cpu_files[] = {
.write = cpu_max_write,
},
#endif
+ {
+ .name = "latency-nice",
+ .read_u64 = cpu_latency_nice_read_u64,
+ .write_u64 = cpu_latency_nice_write_u64,
+ },
{ } /* terminate */
};

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..b08d00c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10479,6 +10479,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
goto err;

tg->shares = NICE_0_LOAD;
+ tg->latency_nice = LATENCY_NICE_DEFAULT;

init_cfs_bandwidth(tg_cfs_bandwidth(tg));

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..365c928 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
#define NICE_0_LOAD (1L << NICE_0_LOAD_SHIFT)

/*
+ * Latency-nice default value
+ */
+#define LATENCY_NICE_DEFAULT 5
+#define LATENCY_NICE_MIN 1
+#define LATENCY_NICE_MAX 100
+
+/*
* Single value that decides SCHED_DEADLINE internal math precision.
* 10 -> just above 1us
* 9 -> just above 0.5us
@@ -362,6 +369,7 @@ struct cfs_bandwidth {
/* Task group related information */
struct task_group {
struct cgroup_subsys_state css;
+ u64 latency_nice;

#ifdef CONFIG_FAIR_GROUP_SCHED
/* schedulable entities of this group on each CPU */
--
2.9.3


2019-09-04 17:33:40

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

On 8/30/19 10:49 AM, subhra mazumdar wrote:
> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
> "latency-nice" which is shared by all the threads in that Cgroup.


Subhra,

Thanks for posting the patchset. Having a latency nice hint
is useful beyond idle load balancing. I can think of other
application scenarios, like scheduling batch machine learning AVX 512
processes with latency sensitive processes. AVX512 limits the frequency
of the CPU and it is best to avoid latency sensitive task on the
same core with AVX512. So latency nice hint allows the scheduler
to have a criteria to determine the latency sensitivity of a task
and arrange latency sensitive tasks away from AVX512 tasks.

You configure the latency hint on a cgroup basis.
But I think not all tasks in a cgroup necessarily have the same
latency sensitivity.

For example, I can see that cgroup can be applied on a per user basis,
and the user could run different tasks that have different latency sensitivity.
We may also need a way to configure latency sensitivity on a per task basis instead on
a per cgroup basis.

Tim


> @@ -631,6 +631,7 @@ struct task_struct {
> int static_prio;
> int normal_prio;
> unsigned int rt_priority;
> + u64 latency_nice;

Does it need to be 64 bit? Max latency nice is only 100.

>
> const struct sched_class *sched_class;
> struct sched_entity se;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 874c427..47969bc 100644

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b52ed1a..365c928 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
> #define NICE_0_LOAD (1L << NICE_0_LOAD_SHIFT)
>
> /*
> + * Latency-nice default value
> + */

Will be useful to add comments to let reader know
that higher latency nice number means a task is more
latency tolerant.

Is there a reason for setting the default to be a low
value of 5?

Seems like we will default to only to search the
same core for idle cpu on a smaller system,
as we only search 5% of the cpu span of the target sched domain.

> +#define LATENCY_NICE_DEFAULT 5
> +#define LATENCY_NICE_MIN 1
> +#define LATENCY_NICE_MAX 100
> +

2019-09-05 06:59:18

by Parth Shah

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice



On 9/4/19 11:02 PM, Tim Chen wrote:
> On 8/30/19 10:49 AM, subhra mazumdar wrote:
>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
>> "latency-nice" which is shared by all the threads in that Cgroup.
>
>
> Subhra,
>
> Thanks for posting the patchset. Having a latency nice hint
> is useful beyond idle load balancing. I can think of other
> application scenarios, like scheduling batch machine learning AVX 512
> processes with latency sensitive processes. AVX512 limits the frequency
> of the CPU and it is best to avoid latency sensitive task on the
> same core with AVX512. So latency nice hint allows the scheduler
> to have a criteria to determine the latency sensitivity of a task
> and arrange latency sensitive tasks away from AVX512 tasks.
>


Hi Tim and Subhra,

This patchset seems to be interesting for my TurboSched patches as well
where I try to pack jitter tasks on fewer cores to get higher Turbo Frequencies.
Well, the problem I face is that we sometime end up putting multiple jitter tasks on a core
running some latency sensitive application which may see performance degradation.
So my plan was to classify such tasks to be latency sensitive thereby hinting the load
balancer to not put tasks on such cores.

TurboSched: https://lkml.org/lkml/2019/7/25/296

> You configure the latency hint on a cgroup basis.
> But I think not all tasks in a cgroup necessarily have the same
> latency sensitivity.
>
> For example, I can see that cgroup can be applied on a per user basis,
> and the user could run different tasks that have different latency sensitivity.
> We may also need a way to configure latency sensitivity on a per task basis instead on
> a per cgroup basis.
>

AFAIU, the problem defined above intersects with my patches as well where the interface
is required to classify the jitter tasks. I have already tried few methods like
syscall and cgroup to classify such tasks and maybe something like that can be adopted
with these patchset as well.


Thanks,
Parth

2019-09-05 08:59:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

On Fri, Aug 30, 2019 at 10:49:36AM -0700, subhra mazumdar wrote:
> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
> "latency-nice" which is shared by all the threads in that Cgroup.

*sigh*, no. We start with a normal per task attribute, and then later,
if it is needed and makes sense, we add it to cgroups.

Also, your Changelog fails on pretty much every point. It doesn't
explain why, it doesn't describe anything and so on.

From just reading the above, I would expect it to have the range
[-20,19] just like normal nice. Apparently this is not so.

2019-09-05 10:51:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote:

> > From just reading the above, I would expect it to have the range
> > [-20,19] just like normal nice. Apparently this is not so.
>
> Regarding the range for the latency-nice values, I guess we have two
> options:
>
> - [-20..19], which makes it similar to priorities
> downside: we quite likely end up with a kernel space representation
> which does not match the user-space one, e.g. look at
> task_struct::prio.
>
> - [0..1024], which makes it more similar to a "percentage"
>
> Being latency-nice a new concept, we are not constrained by POSIX and
> IMHO the [0..1024] scale is a better fit.
>
> That will translate into:
>
> latency-nice=0 : default (current mainline) behaviour, all "biasing"
> policies are disabled and we wakeup up as fast as possible
>
> latency-nice=1024 : maximum niceness, where for example we can imaging
> to turn switch a CFS task to be SCHED_IDLE?

There's a few things wrong there; I really feel that if we call it nice,
it should be like nice. Otherwise we should call it latency-bias and not
have the association with nice to confuse people.

Secondly; the default should be in the middle of the range. Naturally
this would be a signed range like nice [-(x+1),x] for some x. but if you
want [0,1024], then the default really should be 512, but personally I
like 0 better as a default, in which case we need negative numbers.

This is important because we want to be able to bias towards less
importance to (tail) latency as well as more importantance to (tail)
latency.

Specifically, Oracle wants to sacrifice (some) latency for throughput.
Facebook OTOH seems to want to sacrifice (some) throughput for latency.

2019-09-05 10:52:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

On Thu, Sep 05, 2019 at 11:05:18AM +0100, Patrick Bellasi wrote:
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index b52ed1a..365c928 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
> > #define NICE_0_LOAD (1L << NICE_0_LOAD_SHIFT)
> >
> > /*
> > + * Latency-nice default value
> > + */
> > +#define LATENCY_NICE_DEFAULT 5
> > +#define LATENCY_NICE_MIN 1
> > +#define LATENCY_NICE_MAX 100
>
> Values 1 and 5 looks kind of arbitrary.

Yes, and like I just wrote, completely and utterly wrong.

2019-09-05 11:42:56

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice


On Thu, Sep 05, 2019 at 09:31:27 +0100, Peter Zijlstra wrote...

> On Fri, Aug 30, 2019 at 10:49:36AM -0700, subhra mazumdar wrote:
>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
>> "latency-nice" which is shared by all the threads in that Cgroup.
>
> *sigh*, no. We start with a normal per task attribute, and then later,
> if it is needed and makes sense, we add it to cgroups.

FWIW, to add on top of what Peter says, we used this same approach for
uclamp and it proved to be a very effective way to come up with a good
design. General principles have been:

- a system wide API [1] (under /proc/sys/kernel/sched_*) defines
default values for all tasks affected by that feature.
This interface has to define also upper bounds for task specific
values. Thus, in the case of latency-nice, it should be set by
default to the MIN value, since that's the current mainline
behaviour: all tasks are latency sensitive.

- a per-task API [2] (via the sched_setattr() syscall) can be used to
relax the system wide setting thus implementing a "nice" policy.

- a per-taskgroup API [3] (via cpu controller's attributes) can be used
to relax the system-wide settings and restrict the per-task API.

The above features are worth to be added in that exact order.

> Also, your Changelog fails on pretty much every point. It doesn't
> explain why, it doesn't describe anything and so on.

On the description side, I guess it's worth to mention somewhere to
which scheduling classes this feature can be useful for. It's worth to
mention that it can apply only to:

- CFS tasks: for example, at wakeup time a task with an high
latency-nice should avoid to preempt a low latency-nice task.
Maybe by mapping the latency nice value into proper vruntime
normalization value?

- RT tasks: for example, at wakeup time a task with an high
latency-nice value could avoid to preempt a CFS task.

I'm sure there will be discussion about some of these features, that's
why it's important in the proposal presentation to keep a well defined
distinction among the "mechanisms and API" and how we use the new
concept to "bias" some scheduler policies.

> From just reading the above, I would expect it to have the range
> [-20,19] just like normal nice. Apparently this is not so.

Regarding the range for the latency-nice values, I guess we have two
options:

- [-20..19], which makes it similar to priorities
downside: we quite likely end up with a kernel space representation
which does not match the user-space one, e.g. look at
task_struct::prio.

- [0..1024], which makes it more similar to a "percentage"

Being latency-nice a new concept, we are not constrained by POSIX and
IMHO the [0..1024] scale is a better fit.

That will translate into:

latency-nice=0 : default (current mainline) behaviour, all "biasing"
policies are disabled and we wakeup up as fast as possible

latency-nice=1024 : maximum niceness, where for example we can imaging
to turn switch a CFS task to be SCHED_IDLE?

Best,
Patrick

[1] commit e8f14172c6b1 ("sched/uclamp: Add system default clamps")
[2] commit a509a7cd7974 ("sched/uclamp: Extend sched_setattr() to support utilization clamping")
[3] 5 patches in today's tip/sched/core up to:
commit babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp changes")

--
#include <best/regards.h>

Patrick Bellasi

2019-09-05 12:13:11

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice


We already commented on adding the cgroup API after the per-task API.

However, for the cgroup bits will be super important to have

[ +tejun ]

in CC since here we are at discussing the idea to add a new cpu
controller's attribute.

There are opinions about which kind of attributes can be added to
cgroups and I'm sure a "latency-nice" attribute will generate an
interesting discussion. :)

LPC is coming up, perhaps we can get the chance to have a chat with
Tejun about the manoeuvring space in this area.

On Fri, Aug 30, 2019 at 18:49:36 +0100, subhra mazumdar wrote...

> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
> "latency-nice" which is shared by all the threads in that Cgroup.
>
> Signed-off-by: subhra mazumdar <[email protected]>
> ---
> include/linux/sched.h | 1 +
> kernel/sched/core.c | 40 ++++++++++++++++++++++++++++++++++++++++
> kernel/sched/fair.c | 1 +
> kernel/sched/sched.h | 8 ++++++++
> 4 files changed, 50 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1183741..b4a79c3 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -631,6 +631,7 @@ struct task_struct {
> int static_prio;
> int normal_prio;
> unsigned int rt_priority;
> + u64 latency_nice;

I guess we can save some bit here... or, if we are very brave, maybe we
can explore the possibility to pack all prios into a single u64?

( ( (tomatoes target here) ) )

> const struct sched_class *sched_class;
> struct sched_entity se;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 874c427..47969bc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5976,6 +5976,7 @@ void __init sched_init(void)
> init_dl_rq(&rq->dl);
> #ifdef CONFIG_FAIR_GROUP_SCHED
> root_task_group.shares = ROOT_TASK_GROUP_LOAD;
> + root_task_group.latency_nice = LATENCY_NICE_DEFAULT;
> INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
> rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
> /*
> @@ -6345,6 +6346,7 @@ static void sched_change_group(struct task_struct *tsk, int type)
> */
> tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
> struct task_group, css);
> + tsk->latency_nice = tg->latency_nice;
> tg = autogroup_task_group(tsk, tg);
> tsk->sched_task_group = tg;
>
> @@ -6812,6 +6814,34 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
> }
> #endif /* CONFIG_RT_GROUP_SCHED */
>
> +static u64 cpu_latency_nice_read_u64(struct cgroup_subsys_state *css,
> + struct cftype *cft)
> +{
> + struct task_group *tg = css_tg(css);
> +
> + return tg->latency_nice;
> +}
> +
> +static int cpu_latency_nice_write_u64(struct cgroup_subsys_state *css,
> + struct cftype *cft, u64 latency_nice)
> +{
> + struct task_group *tg = css_tg(css);
> + struct css_task_iter it;
> + struct task_struct *p;
> +
> + if (latency_nice < LATENCY_NICE_MIN || latency_nice > LATENCY_NICE_MAX)
> + return -ERANGE;
> +
> + tg->latency_nice = latency_nice;
> +
> + css_task_iter_start(css, 0, &it);
> + while ((p = css_task_iter_next(&it)))
> + p->latency_nice = latency_nice;

Once (and if) the cgroup API is added we can avoid this (potentially
massive) "update on write" in favour of an "on demand composition at
wakeup-time".

We don't care about updating the latency-nice of NON RUNNABLE tasks,
do we?

AFAIK, we need that value only (or mostly) at wakeup time. Thus, when a
task wakeup up we can easily compose (and eventually cache) it's
current latency-nice value by considering, in priority order:

- the system wide upper-bound
- the task group restriction
- the task specific relaxation

Something similar to what we already do for uclamp composition with this
patch currently in tip/sched/core:

commit 3eac870a3247 ("sched/uclamp: Use TG's clamps to restrict TASK's clamps")


> + css_task_iter_end(&it);
> +
> + return 0;
> +}
> +
> static struct cftype cpu_legacy_files[] = {
> #ifdef CONFIG_FAIR_GROUP_SCHED
> {
> @@ -6848,6 +6878,11 @@ static struct cftype cpu_legacy_files[] = {
> .write_u64 = cpu_rt_period_write_uint,
> },
> #endif
> + {
> + .name = "latency-nice",
> + .read_u64 = cpu_latency_nice_read_u64,
> + .write_u64 = cpu_latency_nice_write_u64,
> + },
> { } /* Terminate */
> };
>
> @@ -7015,6 +7050,11 @@ static struct cftype cpu_files[] = {
> .write = cpu_max_write,
> },
> #endif
> + {
> + .name = "latency-nice",
> + .read_u64 = cpu_latency_nice_read_u64,
> + .write_u64 = cpu_latency_nice_write_u64,
> + },
> { } /* terminate */
> };
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f35930f..b08d00c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10479,6 +10479,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
> goto err;
>
> tg->shares = NICE_0_LOAD;
> + tg->latency_nice = LATENCY_NICE_DEFAULT;
^^^^^^^^^^^^^^^^^^^^
Maybe better NICE_0_LATENCY to be more consistent?


> init_cfs_bandwidth(tg_cfs_bandwidth(tg));
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b52ed1a..365c928 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
> #define NICE_0_LOAD (1L << NICE_0_LOAD_SHIFT)
>
> /*
> + * Latency-nice default value
> + */
> +#define LATENCY_NICE_DEFAULT 5
> +#define LATENCY_NICE_MIN 1
> +#define LATENCY_NICE_MAX 100

Values 1 and 5 looks kind of arbitrary.
For the range specifically, I already commented in this other message:

Message-ID: <[email protected]>
https://lore.kernel.org/lkml/[email protected]/

> +
> +/*
> * Single value that decides SCHED_DEADLINE internal math precision.
> * 10 -> just above 1us
> * 9 -> just above 0.5us
> @@ -362,6 +369,7 @@ struct cfs_bandwidth {
> /* Task group related information */
> struct task_group {
> struct cgroup_subsys_state css;
> + u64 latency_nice;
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> /* schedulable entities of this group on each CPU */


Best,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2019-09-05 12:13:17

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice


On Thu, Sep 05, 2019 at 07:15:34 +0100, Parth Shah wrote...

> On 9/4/19 11:02 PM, Tim Chen wrote:
>> On 8/30/19 10:49 AM, subhra mazumdar wrote:
>>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
>>> "latency-nice" which is shared by all the threads in that Cgroup.
>>
>>
>> Subhra,
>>
>> Thanks for posting the patchset. Having a latency nice hint
>> is useful beyond idle load balancing. I can think of other
>> application scenarios, like scheduling batch machine learning AVX 512
>> processes with latency sensitive processes. AVX512 limits the frequency
>> of the CPU and it is best to avoid latency sensitive task on the
>> same core with AVX512. So latency nice hint allows the scheduler
>> to have a criteria to determine the latency sensitivity of a task
>> and arrange latency sensitive tasks away from AVX512 tasks.
>>
>
>
> Hi Tim and Subhra,
>
> This patchset seems to be interesting for my TurboSched patches as well
> where I try to pack jitter tasks on fewer cores to get higher Turbo Frequencies.
> Well, the problem I face is that we sometime end up putting multiple jitter tasks on a core
> running some latency sensitive application which may see performance degradation.
> So my plan was to classify such tasks to be latency sensitive thereby hinting the load
> balancer to not put tasks on such cores.
>
> TurboSched: https://lkml.org/lkml/2019/7/25/296
>
>> You configure the latency hint on a cgroup basis.
>> But I think not all tasks in a cgroup necessarily have the same
>> latency sensitivity.
>>
>> For example, I can see that cgroup can be applied on a per user basis,
>> and the user could run different tasks that have different latency sensitivity.
>> We may also need a way to configure latency sensitivity on a per task basis instead on
>> a per cgroup basis.
>>
>
> AFAIU, the problem defined above intersects with my patches as well where the interface
> is required to classify the jitter tasks. I have already tried few methods like
> syscall and cgroup to classify such tasks and maybe something like that can be adopted
> with these patchset as well.

Agree, these two patchest are definitively overlapping in terms of
mechanisms and APIs to expose to userspace. You to guys seems to target
different goals but the general approach should be:

- expose a single and abstract concept to user-space
latency-nice or latency-tolerant as PaulT proposed at OSPM

- map this concept in kernel-space to different kind of bias, both at
wakeup time and load-balance time, and use both for RT and CFS tasks.

That's my understanding at least ;)

I guess we will have interesting discussions at the upcoming LPC to
figure out a solution fitting all needs.

> Thanks,
> Parth

Best,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2019-09-05 13:08:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
> On 09/05/19 12:46, Peter Zijlstra wrote:

> > This is important because we want to be able to bias towards less
> > importance to (tail) latency as well as more importantance to (tail)
> > latency.
> >
> > Specifically, Oracle wants to sacrifice (some) latency for throughput.
> > Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>
> Another use case I'm considering is using latency-nice to prefer an idle CPU if
> latency-nice is set otherwise go for the most energy efficient CPU.
>
> Ie: sacrifice (some) energy for latency.
>
> The way I see interpreting latency-nice here as a binary switch. But
> maybe we can use the range to select what (some) energy to sacrifice
> mean here. Hmmm.

It cannot be binary, per definition is must be ternary, that is, <0, ==0
and >0 (or middle value if you're of that persuasion).

In your case, I'm thinking you mean >0, we want to lower the latency.

Anyway; there were a number of things mentioned at OSPM that we could
tie into this thing and finding sensible mappings is going to be a bit
of trial and error I suppose.

But as patrick said; we're very much exporting a BIAS knob, not a set of
behaviours.

2019-09-05 13:12:46

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice


On Thu, Sep 05, 2019 at 12:30:02 +0100, Peter Zijlstra wrote...

> On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
>> On 09/05/19 12:46, Peter Zijlstra wrote:
>
>> > This is important because we want to be able to bias towards less
>> > importance to (tail) latency as well as more importantance to (tail)
>> > latency.
>> >
>> > Specifically, Oracle wants to sacrifice (some) latency for throughput.
>> > Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>>
>> Another use case I'm considering is using latency-nice to prefer an idle CPU if
>> latency-nice is set otherwise go for the most energy efficient CPU.
>>
>> Ie: sacrifice (some) energy for latency.
>>
>> The way I see interpreting latency-nice here as a binary switch. But
>> maybe we can use the range to select what (some) energy to sacrifice
>> mean here. Hmmm.
>
> It cannot be binary, per definition is must be ternary, that is, <0, ==0
> and >0 (or middle value if you're of that persuasion).
>
> In your case, I'm thinking you mean >0, we want to lower the latency.
>
> Anyway; there were a number of things mentioned at OSPM that we could
> tie into this thing and finding sensible mappings is going to be a bit
> of trial and error I suppose.
>
> But as patrick said; we're very much exporting a BIAS knob, not a set of
> behaviours.

Right, although I think behaviours could still be exported but via a
different and configurable interface, using thresholds.

Either at compile time or via procfs maybe we can expose and properly
document what happen in the scheduler if/when a task has a "latency
niceness" crossing a given threshold.

For example, by setting something like:

/proc/sys/kernel/sched_cfs_latency_idle = 1000

we state that the task is going to be scheduled according to the
SCHED_IDLE policy.

( ( (tomatoes target here) ) )

Not sure also if we wanna commit to user-space APIs how we internally
map/translate a "latency niceness" value into a scheduler behaviour
bias. Maybe better not at least at the very beginning.

Best,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2019-09-05 13:16:31

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

On 09/05/19 13:30, Peter Zijlstra wrote:
> On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
> > On 09/05/19 12:46, Peter Zijlstra wrote:
>
> > > This is important because we want to be able to bias towards less
> > > importance to (tail) latency as well as more importantance to (tail)
> > > latency.
> > >
> > > Specifically, Oracle wants to sacrifice (some) latency for throughput.
> > > Facebook OTOH seems to want to sacrifice (some) throughput for latency.
> >
> > Another use case I'm considering is using latency-nice to prefer an idle CPU if
> > latency-nice is set otherwise go for the most energy efficient CPU.
> >
> > Ie: sacrifice (some) energy for latency.
> >
> > The way I see interpreting latency-nice here as a binary switch. But
> > maybe we can use the range to select what (some) energy to sacrifice
> > mean here. Hmmm.
>
> It cannot be binary, per definition is must be ternary, that is, <0, ==0
> and >0 (or middle value if you're of that persuasion).

I meant I want to use it as a binary.

>
> In your case, I'm thinking you mean >0, we want to lower the latency.

Yes. As long as there's an easy way to say: does this task care about latency
or not I'm good.

>
> Anyway; there were a number of things mentioned at OSPM that we could
> tie into this thing and finding sensible mappings is going to be a bit
> of trial and error I suppose.
>
> But as patrick said; we're very much exporting a BIAS knob, not a set of
> behaviours.

Agreed. I just wanted to say that the way this range is going to be
interpreted will differ from path to path and we need to consider that in the
final mapping. Especially from the final user's perspective of what setting
this value ultimately means to them.

--
Qais Yousef

2019-09-05 13:17:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

On Thu, Sep 05, 2019 at 12:40:01PM +0100, Patrick Bellasi wrote:
> Right, although I think behaviours could still be exported but via a
> different and configurable interface, using thresholds.

I would try _really_ hard to avoid pinning down behaviour. The more you
do that, the less you can change.

2019-09-05 13:53:42

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice


On Thu, Sep 05, 2019 at 12:46:37 +0100, Valentin Schneider wrote...

> On 05/09/2019 12:18, Patrick Bellasi wrote:
>>> There's a few things wrong there; I really feel that if we call it nice,
>>> it should be like nice. Otherwise we should call it latency-bias and not
>>> have the association with nice to confuse people.
>>>
>>> Secondly; the default should be in the middle of the range. Naturally
>>> this would be a signed range like nice [-(x+1),x] for some x. but if you
>>> want [0,1024], then the default really should be 512, but personally I
>>> like 0 better as a default, in which case we need negative numbers.
>>>
>>> This is important because we want to be able to bias towards less
>>> importance to (tail) latency as well as more importantance to (tail)
>>> latency.
>>>
>>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>>
>> Right, we have this dualism to deal with and current mainline behaviour
>> is somehow in the middle.
>>
>> BTW, the FB requirement is the same we have in Android.
>> We want some CFS tasks to have very small latency and a low chance
>> to be preempted by the wake-up of less-important "background" tasks.
>>
>> I'm not totally against the usage of a signed range, but I'm thinking
>> that since we are introducing a new (non POSIX) concept we can get the
>> chance to make it more human friendly.
>>
>> Give the two extremes above, would not be much simpler and intuitive to
>> have 0 implementing the FB/Android (no latency) case and 1024 the
>> (max latency) Oracle case?
>>
>
> For something like latency-<whatever>, I don't see the point of having
> such a wide range. The nice range is probably more than enough - and before
> even bothering about the range, we should probably agree on what the range
> should represent.
>
> If it's niceness, I read it as: positive latency-nice value means we're
> nice to latency, means we reduce it. So the further up you go, the more you
> restrict your wakeup scan. I think it's quite easy to map that into the
> code: current behaviour at 0, with a decreasing scan mask size as we go
> towards +19. I don't think anyone needs 512 steps to tune this.
>
> I don't know what logic we'd follow for negative values though. Maybe
> latency-nice -20 means always going through the slowpath, but what of the
> intermediate values?

Yep, I think so fare we are all converging towards the idea to use the
a signed range. Regarding the range itself, yes: 1024 looks very
oversized, but +-20 is still something which leave room for a bit of
flexibility and it also better matches the idea that we don't want to
"enumerate behaviours" but just expose a knob. To map certain "bias" we
could benefit from a slightly larger range.

> AFAICT this RFC only looks at wakeups, but I guess latency-nice can be

For the wakeup path there is also the TurboSched proposal by Parth:

Message-ID: <[email protected]>
https://lore.kernel.org/lkml/[email protected]/

we should keep in mind.

> applied elsewhere (e.g. load-balance, something like task_hot() and its
> use of sysctl_sched_migration_cost).

For LB can you come up with some better description of what usages you
see could benefit from a "per task" or "per task-group" latency niceness?

Best,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2019-09-05 14:51:15

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

On 05/09/2019 14:07, Patrick Bellasi wrote:
> Yep, I think so fare we are all converging towards the idea to use the
> a signed range. Regarding the range itself, yes: 1024 looks very
> oversized, but +-20 is still something which leave room for a bit of
> flexibility and it also better matches the idea that we don't want to
> "enumerate behaviours" but just expose a knob. To map certain "bias" we
> could benefit from a slightly larger range.
>
>> AFAICT this RFC only looks at wakeups, but I guess latency-nice can be
>
> For the wakeup path there is also the TurboSched proposal by Parth:
>
> Message-ID: <[email protected]>
> https://lore.kernel.org/lkml/[email protected]/
>
> we should keep in mind.
>
>> applied elsewhere (e.g. load-balance, something like task_hot() and its
>> use of sysctl_sched_migration_cost).
>
> For LB can you come up with some better description of what usages you
> see could benefit from a "per task" or "per task-group" latency niceness?
>

task_hot() "ratelimits" migrations of tasks that were running up until
very recently (hence "cache hot"), but the knob is system wide. It might
make sense to exploit a per-task attribute to tune this (e.g. go wild if
not latency sensitive, otherwise stay away for longer).

We could perhaps even apply it to active load balance to similarly stay
away from latency sensitive tasks. Right now this is gated by a
sched_domain-wide attribute (nr_balance_failed). We could tweak this to
requiring more (less) failed attempts before interrupting latency (in)
sensitive tasks.

I'm sure we can come up with even more creative ways to pour even more
heuristics in there ;)

> Best,
> Patrick
>

2019-09-05 15:14:05

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

On 05/09/2019 12:18, Patrick Bellasi wrote:
>> There's a few things wrong there; I really feel that if we call it nice,
>> it should be like nice. Otherwise we should call it latency-bias and not
>> have the association with nice to confuse people.
>>
>> Secondly; the default should be in the middle of the range. Naturally
>> this would be a signed range like nice [-(x+1),x] for some x. but if you
>> want [0,1024], then the default really should be 512, but personally I
>> like 0 better as a default, in which case we need negative numbers.
>>
>> This is important because we want to be able to bias towards less
>> importance to (tail) latency as well as more importantance to (tail)
>> latency.
>>
>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>
> Right, we have this dualism to deal with and current mainline behaviour
> is somehow in the middle.
>
> BTW, the FB requirement is the same we have in Android.
> We want some CFS tasks to have very small latency and a low chance
> to be preempted by the wake-up of less-important "background" tasks.
>
> I'm not totally against the usage of a signed range, but I'm thinking
> that since we are introducing a new (non POSIX) concept we can get the
> chance to make it more human friendly.
>
> Give the two extremes above, would not be much simpler and intuitive to
> have 0 implementing the FB/Android (no latency) case and 1024 the
> (max latency) Oracle case?
>

For something like latency-<whatever>, I don't see the point of having
such a wide range. The nice range is probably more than enough - and before
even bothering about the range, we should probably agree on what the range
should represent.

If it's niceness, I read it as: positive latency-nice value means we're
nice to latency, means we reduce it. So the further up you go, the more you
restrict your wakeup scan. I think it's quite easy to map that into the
code: current behaviour at 0, with a decreasing scan mask size as we go
towards +19. I don't think anyone needs 512 steps to tune this.

I don't know what logic we'd follow for negative values though. Maybe
latency-nice -20 means always going through the slowpath, but what of the
intermediate values?

AFAICT this RFC only looks at wakeups, but I guess latency-nice can be
applied elsewhere (e.g. load-balance, something like task_hot() and its
use of sysctl_sched_migration_cost).

> Moreover, we will never match completely the nice semantic, give that
> a 1 nice unit has a proper math meaning, isn't something like 10% CPU
> usage change for each step?
>
> For latency-nice instead we will likely base our biasing strategies on
> some predefined (maybe system-wide configurable) const thresholds.
>
> Could changing the name to "latency-tolerance" break the tie by marking
> its difference wrt prior/nice levels? AFAIR, that was also the original
> proposal [1] by PaulT during the OSPM discussion.
>
> Best,
> Patrick
>
> [1] https://youtu.be/oz43thSFqmk?t=1302
>

2019-09-05 15:16:33

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice


On Thu, Sep 05, 2019 at 12:40:30 +0100, Peter Zijlstra wrote...

> On Thu, Sep 05, 2019 at 12:18:55PM +0100, Patrick Bellasi wrote:
>
>> Right, we have this dualism to deal with and current mainline behaviour
>> is somehow in the middle.
>>
>> BTW, the FB requirement is the same we have in Android.
>> We want some CFS tasks to have very small latency and a low chance
>> to be preempted by the wake-up of less-important "background" tasks.
>>
>> I'm not totally against the usage of a signed range, but I'm thinking
>> that since we are introducing a new (non POSIX) concept we can get the
>> chance to make it more human friendly.
>
> I'm arguing that signed _is_ more human friendly ;-)

... but you are not human. :)

>> Give the two extremes above, would not be much simpler and intuitive to
>> have 0 implementing the FB/Android (no latency) case and 1024 the
>> (max latency) Oracle case?
>
> See, I find the signed thing more natural, negative is a bias away from
> latency sensitive, positive is a bias towards latency sensitive.
>
> Also; 0 is a good default value ;-)

Yes, that's appealing indeed.

>> Moreover, we will never match completely the nice semantic, give that
>> a 1 nice unit has a proper math meaning, isn't something like 10% CPU
>> usage change for each step?
>
> Only because we were nice when implementing it. Posix leaves it
> unspecified and we could change it at any time. The only real semantics
> is a relative 'weight' (opengroup uses the term 'favourable').

Good to know, I was considering it a POXIS requirement.

>> Could changing the name to "latency-tolerance" break the tie by marking
>> its difference wrt prior/nice levels? AFAIR, that was also the original
>> proposal [1] by PaulT during the OSPM discussion.
>
> latency torrerance could still be a signed entity, positive would
> signify we're more tolerant of latency (ie. less sensitive) while
> negative would be less tolerant (ie. more sensitive).

Right.

>> For latency-nice instead we will likely base our biasing strategies on
>> some predefined (maybe system-wide configurable) const thresholds.
>
> I'm not quite sure; yes, for some of these things, like the idle search
> on wakeup, certainly. But say for wakeup-preemption, we could definitely
> make it a task relative attribute.

--
#include <best/regards.h>

Patrick Bellasi

2019-09-05 15:23:01

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

On 09/05/19 12:46, Peter Zijlstra wrote:
> On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote:
>
> > > From just reading the above, I would expect it to have the range
> > > [-20,19] just like normal nice. Apparently this is not so.
> >
> > Regarding the range for the latency-nice values, I guess we have two
> > options:
> >
> > - [-20..19], which makes it similar to priorities
> > downside: we quite likely end up with a kernel space representation
> > which does not match the user-space one, e.g. look at
> > task_struct::prio.
> >
> > - [0..1024], which makes it more similar to a "percentage"
> >
> > Being latency-nice a new concept, we are not constrained by POSIX and
> > IMHO the [0..1024] scale is a better fit.
> >
> > That will translate into:
> >
> > latency-nice=0 : default (current mainline) behaviour, all "biasing"
> > policies are disabled and we wakeup up as fast as possible
> >
> > latency-nice=1024 : maximum niceness, where for example we can imaging
> > to turn switch a CFS task to be SCHED_IDLE?
>
> There's a few things wrong there; I really feel that if we call it nice,
> it should be like nice. Otherwise we should call it latency-bias and not
> have the association with nice to confuse people.
>
> Secondly; the default should be in the middle of the range. Naturally
> this would be a signed range like nice [-(x+1),x] for some x. but if you
> want [0,1024], then the default really should be 512, but personally I
> like 0 better as a default, in which case we need negative numbers.
>
> This is important because we want to be able to bias towards less
> importance to (tail) latency as well as more importantance to (tail)
> latency.
>
> Specifically, Oracle wants to sacrifice (some) latency for throughput.
> Facebook OTOH seems to want to sacrifice (some) throughput for latency.

Another use case I'm considering is using latency-nice to prefer an idle CPU if
latency-nice is set otherwise go for the most energy efficient CPU.

Ie: sacrifice (some) energy for latency.

The way I see interpreting latency-nice here as a binary switch. But maybe we
can use the range to select what (some) energy to sacrifice mean here. Hmmm.

--
Qais Yousef

2019-09-05 15:27:47

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice


On Thu, Sep 05, 2019 at 11:46:16 +0100, Peter Zijlstra wrote...

> On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote:
>
>> > From just reading the above, I would expect it to have the range
>> > [-20,19] just like normal nice. Apparently this is not so.
>>
>> Regarding the range for the latency-nice values, I guess we have two
>> options:
>>
>> - [-20..19], which makes it similar to priorities
>> downside: we quite likely end up with a kernel space representation
>> which does not match the user-space one, e.g. look at
>> task_struct::prio.
>>
>> - [0..1024], which makes it more similar to a "percentage"
>>
>> Being latency-nice a new concept, we are not constrained by POSIX and
>> IMHO the [0..1024] scale is a better fit.
>>
>> That will translate into:
>>
>> latency-nice=0 : default (current mainline) behaviour, all "biasing"
>> policies are disabled and we wakeup up as fast as possible
>>
>> latency-nice=1024 : maximum niceness, where for example we can imaging
>> to turn switch a CFS task to be SCHED_IDLE?
>
> There's a few things wrong there; I really feel that if we call it nice,
> it should be like nice. Otherwise we should call it latency-bias and not
> have the association with nice to confuse people.
>
> Secondly; the default should be in the middle of the range. Naturally
> this would be a signed range like nice [-(x+1),x] for some x. but if you
> want [0,1024], then the default really should be 512, but personally I
> like 0 better as a default, in which case we need negative numbers.
>
> This is important because we want to be able to bias towards less
> importance to (tail) latency as well as more importantance to (tail)
> latency.
>
> Specifically, Oracle wants to sacrifice (some) latency for throughput.
> Facebook OTOH seems to want to sacrifice (some) throughput for latency.

Right, we have this dualism to deal with and current mainline behaviour
is somehow in the middle.

BTW, the FB requirement is the same we have in Android.
We want some CFS tasks to have very small latency and a low chance
to be preempted by the wake-up of less-important "background" tasks.

I'm not totally against the usage of a signed range, but I'm thinking
that since we are introducing a new (non POSIX) concept we can get the
chance to make it more human friendly.

Give the two extremes above, would not be much simpler and intuitive to
have 0 implementing the FB/Android (no latency) case and 1024 the
(max latency) Oracle case?

Moreover, we will never match completely the nice semantic, give that
a 1 nice unit has a proper math meaning, isn't something like 10% CPU
usage change for each step?

For latency-nice instead we will likely base our biasing strategies on
some predefined (maybe system-wide configurable) const thresholds.

Could changing the name to "latency-tolerance" break the tie by marking
its difference wrt prior/nice levels? AFAIR, that was also the original
proposal [1] by PaulT during the OSPM discussion.

Best,
Patrick

[1] https://youtu.be/oz43thSFqmk?t=1302

--
#include <best/regards.h>

Patrick Bellasi

2019-09-05 15:28:51

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice


On Thu, Sep 05, 2019 at 12:13:47 +0100, Qais Yousef wrote...

> On 09/05/19 12:46, Peter Zijlstra wrote:
>> On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote:
>>
>> > > From just reading the above, I would expect it to have the range
>> > > [-20,19] just like normal nice. Apparently this is not so.
>> >
>> > Regarding the range for the latency-nice values, I guess we have two
>> > options:
>> >
>> > - [-20..19], which makes it similar to priorities
>> > downside: we quite likely end up with a kernel space representation
>> > which does not match the user-space one, e.g. look at
>> > task_struct::prio.
>> >
>> > - [0..1024], which makes it more similar to a "percentage"
>> >
>> > Being latency-nice a new concept, we are not constrained by POSIX and
>> > IMHO the [0..1024] scale is a better fit.
>> >
>> > That will translate into:
>> >
>> > latency-nice=0 : default (current mainline) behaviour, all "biasing"
>> > policies are disabled and we wakeup up as fast as possible
>> >
>> > latency-nice=1024 : maximum niceness, where for example we can imaging
>> > to turn switch a CFS task to be SCHED_IDLE?
>>
>> There's a few things wrong there; I really feel that if we call it nice,
>> it should be like nice. Otherwise we should call it latency-bias and not
>> have the association with nice to confuse people.
>>
>> Secondly; the default should be in the middle of the range. Naturally
>> this would be a signed range like nice [-(x+1),x] for some x. but if you
>> want [0,1024], then the default really should be 512, but personally I
>> like 0 better as a default, in which case we need negative numbers.
>>
>> This is important because we want to be able to bias towards less
>> importance to (tail) latency as well as more importantance to (tail)
>> latency.
>>
>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>
> Another use case I'm considering is using latency-nice to prefer an idle CPU if
> latency-nice is set otherwise go for the most energy efficient CPU.
>
> Ie: sacrifice (some) energy for latency.
>
> The way I see interpreting latency-nice here as a binary switch. But maybe we
> can use the range to select what (some) energy to sacrifice mean here. Hmmm.

I see this concept possibly evolving into something more then just a
binary switch. Not yet convinced if it make sense and/or it's possible
but, in principle, I was thinking about these possible usages for CFS
tasks:

- dynamically tune the policy of a task among SCHED_{OTHER,BATCH,IDLE}
depending on crossing certain pre-configured threshold of latency
niceness.

- dynamically bias the vruntime updates we do in place_entity()
depending on the actual latency niceness of a task.

- bias the decisions we take in check_preempt_tick() still depending
on a relative comparison of the current and wakeup task latency
niceness values.

--
#include <best/regards.h>

Patrick Bellasi

2019-09-05 15:44:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

On Thu, Sep 05, 2019 at 12:18:55PM +0100, Patrick Bellasi wrote:

> Right, we have this dualism to deal with and current mainline behaviour
> is somehow in the middle.
>
> BTW, the FB requirement is the same we have in Android.
> We want some CFS tasks to have very small latency and a low chance
> to be preempted by the wake-up of less-important "background" tasks.
>
> I'm not totally against the usage of a signed range, but I'm thinking
> that since we are introducing a new (non POSIX) concept we can get the
> chance to make it more human friendly.

I'm arguing that signed _is_ more human friendly ;-)

> Give the two extremes above, would not be much simpler and intuitive to
> have 0 implementing the FB/Android (no latency) case and 1024 the
> (max latency) Oracle case?

See, I find the signed thing more natural, negative is a bias away from
latency sensitive, positive is a bias towards latency sensitive.

Also; 0 is a good default value ;-)

> Moreover, we will never match completely the nice semantic, give that
> a 1 nice unit has a proper math meaning, isn't something like 10% CPU
> usage change for each step?

Only because we were nice when implementing it. Posix leaves it
unspecified and we could change it at any time. The only real semantics
is a relative 'weight' (opengroup uses the term 'favourable').

> Could changing the name to "latency-tolerance" break the tie by marking
> its difference wrt prior/nice levels? AFAIR, that was also the original
> proposal [1] by PaulT during the OSPM discussion.

latency torrerance could still be a signed entity, positive would
signify we're more tolerant of latency (ie. less sensitive) while
negative would be less tolerant (ie. more sensitive).

> For latency-nice instead we will likely base our biasing strategies on
> some predefined (maybe system-wide configurable) const thresholds.

I'm not quite sure; yes, for some of these things, like the idle search
on wakeup, certainly. But say for wakeup-preemption, we could definitely
make it a task relative attribute.

2019-09-05 16:19:20

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

On 09/05/19 13:48, Peter Zijlstra wrote:
> On Thu, Sep 05, 2019 at 12:40:01PM +0100, Patrick Bellasi wrote:
> > Right, although I think behaviours could still be exported but via a
> > different and configurable interface, using thresholds.
>
> I would try _really_ hard to avoid pinning down behaviour. The more you
> do that, the less you can change.

While I agree with that but I find there's a contradiction between not
'pinning down behavior' and 'easy and clear way to bias latency sensitive from
end user's perspective'.

Maybe we should protect this with a kconfig + experimental tag until trial
and error show the best way forward?

--
Qais Yousef

2019-09-05 16:37:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

On Thu, Sep 05, 2019 at 12:30:52PM +0100, Patrick Bellasi wrote:

> I see this concept possibly evolving into something more then just a
> binary switch. Not yet convinced if it make sense and/or it's possible
> but, in principle, I was thinking about these possible usages for CFS
> tasks:
>
> - dynamically tune the policy of a task among SCHED_{OTHER,BATCH,IDLE}
> depending on crossing certain pre-configured threshold of latency
> niceness.

A big part of BATCH is wakeup preemption (batch doesn't preempt itself),
and wakeup preemption is a task-task propery and can thus be completely
relative.

> - dynamically bias the vruntime updates we do in place_entity()
> depending on the actual latency niceness of a task.

That is dangerous; theory says we should keep track of the 0-lag point
and place it back where we found it. BFQ does this correctly IIRC, but
for CFS I've never done that because 'expensive'.

But yes, we could (carefully) fumble a bit there.

> - bias the decisions we take in check_preempt_tick() still depending
> on a relative comparison of the current and wakeup task latency
> niceness values.

Ack.

Placing relative and absolute behaviour on the same scale is going to be
'fun' :-)

2019-09-06 15:25:47

by Parth Shah

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice



On 9/5/19 3:41 PM, Patrick Bellasi wrote:
>
> On Thu, Sep 05, 2019 at 07:15:34 +0100, Parth Shah wrote...
>
>> On 9/4/19 11:02 PM, Tim Chen wrote:
>>> On 8/30/19 10:49 AM, subhra mazumdar wrote:
>>>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
>>>> "latency-nice" which is shared by all the threads in that Cgroup.
>>>
>>>
>>> Subhra,
>>>
>>> Thanks for posting the patchset. Having a latency nice hint
>>> is useful beyond idle load balancing. I can think of other
>>> application scenarios, like scheduling batch machine learning AVX 512
>>> processes with latency sensitive processes. AVX512 limits the frequency
>>> of the CPU and it is best to avoid latency sensitive task on the
>>> same core with AVX512. So latency nice hint allows the scheduler
>>> to have a criteria to determine the latency sensitivity of a task
>>> and arrange latency sensitive tasks away from AVX512 tasks.
>>>
>>
>>
>> Hi Tim and Subhra,
>>
>> This patchset seems to be interesting for my TurboSched patches as well
>> where I try to pack jitter tasks on fewer cores to get higher Turbo Frequencies.
>> Well, the problem I face is that we sometime end up putting multiple jitter tasks on a core
>> running some latency sensitive application which may see performance degradation.
>> So my plan was to classify such tasks to be latency sensitive thereby hinting the load
>> balancer to not put tasks on such cores.
>>
>> TurboSched: https://lkml.org/lkml/2019/7/25/296
>>
>>> You configure the latency hint on a cgroup basis.
>>> But I think not all tasks in a cgroup necessarily have the same
>>> latency sensitivity.
>>>
>>> For example, I can see that cgroup can be applied on a per user basis,
>>> and the user could run different tasks that have different latency sensitivity.
>>> We may also need a way to configure latency sensitivity on a per task basis instead on
>>> a per cgroup basis.
>>>
>>
>> AFAIU, the problem defined above intersects with my patches as well where the interface
>> is required to classify the jitter tasks. I have already tried few methods like
>> syscall and cgroup to classify such tasks and maybe something like that can be adopted
>> with these patchset as well.
>
> Agree, these two patchest are definitively overlapping in terms of
> mechanisms and APIs to expose to userspace. You to guys seems to target
> different goals but the general approach should be:
>
> - expose a single and abstract concept to user-space
> latency-nice or latency-tolerant as PaulT proposed at OSPM
>

I agree. Both the patchset tries to classify a tasks for some purpose for better latency.
TurboSched requires the classification of whether the task is jitter and should not be given
enough resources/frequency. This is a boolean value.
Whereas, latency-nice is a range. So does that mean that a max-latency-nice task is a jitter?

I was thinking of not doing jitter packing on a core occupying
min-latency-nice (i.e, latency sensitive) task (until there are other busier cores).

Given this, we can expose a single per-task attribute to the user by a syscall, right?

> - map this concept in kernel-space to different kind of bias, both at
> wakeup time and load-balance time, and use both for RT and CFS tasks.
>
> That's my understanding at least ;)
>
> I guess we will have interesting discussions at the upcoming LPC to
> figure out a solution fitting all needs.

Definitely.

>
>> Thanks,
>> Parth
>
> Best,
> Patrick
>

2019-09-06 15:27:46

by Parth Shah

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice



On 9/5/19 3:15 PM, Patrick Bellasi wrote:
>
> On Thu, Sep 05, 2019 at 09:31:27 +0100, Peter Zijlstra wrote...
>
>> On Fri, Aug 30, 2019 at 10:49:36AM -0700, subhra mazumdar wrote:
>>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
>>> "latency-nice" which is shared by all the threads in that Cgroup.
>>
>> *sigh*, no. We start with a normal per task attribute, and then later,
>> if it is needed and makes sense, we add it to cgroups.
>
> FWIW, to add on top of what Peter says, we used this same approach for
> uclamp and it proved to be a very effective way to come up with a good
> design. General principles have been:
>
> - a system wide API [1] (under /proc/sys/kernel/sched_*) defines
> default values for all tasks affected by that feature.
> This interface has to define also upper bounds for task specific
> values. Thus, in the case of latency-nice, it should be set by
> default to the MIN value, since that's the current mainline
> behaviour: all tasks are latency sensitive.
>
> - a per-task API [2] (via the sched_setattr() syscall) can be used to
> relax the system wide setting thus implementing a "nice" policy.
>
> - a per-taskgroup API [3] (via cpu controller's attributes) can be used
> to relax the system-wide settings and restrict the per-task API.
>
> The above features are worth to be added in that exact order.
>
>> Also, your Changelog fails on pretty much every point. It doesn't
>> explain why, it doesn't describe anything and so on.
>
> On the description side, I guess it's worth to mention somewhere to
> which scheduling classes this feature can be useful for. It's worth to
> mention that it can apply only to:
>
> - CFS tasks: for example, at wakeup time a task with an high
> latency-nice should avoid to preempt a low latency-nice task.
> Maybe by mapping the latency nice value into proper vruntime
> normalization value?
>

If I got this correct, does this also mean that a task's latency-nice
will be mapped to prio/nice.
i.e, task with min-latency-nice will have highest priority?

> - RT tasks: for example, at wakeup time a task with an high
> latency-nice value could avoid to preempt a CFS task.
>

So, will this make CFS task to precede RT task?
and cause priority inversion?

> I'm sure there will be discussion about some of these features, that's
> why it's important in the proposal presentation to keep a well defined
> distinction among the "mechanisms and API" and how we use the new
> concept to "bias" some scheduler policies.
>
>> From just reading the above, I would expect it to have the range
>> [-20,19] just like normal nice. Apparently this is not so.
>
> Regarding the range for the latency-nice values, I guess we have two
> options:
>
> - [-20..19], which makes it similar to priorities
> downside: we quite likely end up with a kernel space representation
> which does not match the user-space one, e.g. look at
> task_struct::prio.
>
> - [0..1024], which makes it more similar to a "percentage"
>
> Being latency-nice a new concept, we are not constrained by POSIX and
> IMHO the [0..1024] scale is a better fit.
>
> That will translate into:
>
> latency-nice=0 : default (current mainline) behaviour, all "biasing"
> policies are disabled and we wakeup up as fast as possible
>
> latency-nice=1024 : maximum niceness, where for example we can imaging
> to turn switch a CFS task to be SCHED_IDLE?
>
> Best,
> Patrick
>
> [1] commit e8f14172c6b1 ("sched/uclamp: Add system default clamps")
> [2] commit a509a7cd7974 ("sched/uclamp: Extend sched_setattr() to support utilization clamping")
> [3] 5 patches in today's tip/sched/core up to:
> commit babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp changes")
>

2019-09-06 18:31:19

by Parth Shah

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice



On 9/5/19 6:37 PM, Patrick Bellasi wrote:
>
> On Thu, Sep 05, 2019 at 12:46:37 +0100, Valentin Schneider wrote...
>
>> On 05/09/2019 12:18, Patrick Bellasi wrote:
>>>> There's a few things wrong there; I really feel that if we call it nice,
>>>> it should be like nice. Otherwise we should call it latency-bias and not
>>>> have the association with nice to confuse people.
>>>>
>>>> Secondly; the default should be in the middle of the range. Naturally
>>>> this would be a signed range like nice [-(x+1),x] for some x. but if you
>>>> want [0,1024], then the default really should be 512, but personally I
>>>> like 0 better as a default, in which case we need negative numbers.
>>>>
>>>> This is important because we want to be able to bias towards less
>>>> importance to (tail) latency as well as more importantance to (tail)
>>>> latency.
>>>>
>>>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>>>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>>>
>>> Right, we have this dualism to deal with and current mainline behaviour
>>> is somehow in the middle.
>>>
>>> BTW, the FB requirement is the same we have in Android.
>>> We want some CFS tasks to have very small latency and a low chance
>>> to be preempted by the wake-up of less-important "background" tasks.
>>>
>>> I'm not totally against the usage of a signed range, but I'm thinking
>>> that since we are introducing a new (non POSIX) concept we can get the
>>> chance to make it more human friendly.
>>>
>>> Give the two extremes above, would not be much simpler and intuitive to
>>> have 0 implementing the FB/Android (no latency) case and 1024 the
>>> (max latency) Oracle case?
>>>
>>
>> For something like latency-<whatever>, I don't see the point of having
>> such a wide range. The nice range is probably more than enough - and before
>> even bothering about the range, we should probably agree on what the range
>> should represent.
>>
>> If it's niceness, I read it as: positive latency-nice value means we're
>> nice to latency, means we reduce it. So the further up you go, the more you
>> restrict your wakeup scan. I think it's quite easy to map that into the
>> code: current behaviour at 0, with a decreasing scan mask size as we go
>> towards +19. I don't think anyone needs 512 steps to tune this.
>>
>> I don't know what logic we'd follow for negative values though. Maybe
>> latency-nice -20 means always going through the slowpath, but what of the
>> intermediate values?
>
> Yep, I think so fare we are all converging towards the idea to use the
> a signed range. Regarding the range itself, yes: 1024 looks very
> oversized, but +-20 is still something which leave room for a bit of
> flexibility and it also better matches the idea that we don't want to
> "enumerate behaviours" but just expose a knob. To map certain "bias" we
> could benefit from a slightly larger range.
>
>> AFAICT this RFC only looks at wakeups, but I guess latency-nice can be
>
> For the wakeup path there is also the TurboSched proposal by Parth:
>
> Message-ID: <[email protected]>
> https://lore.kernel.org/lkml/[email protected]/
>
> we should keep in mind.
>
>> applied elsewhere (e.g. load-balance, something like task_hot() and its
>> use of sysctl_sched_migration_cost).
>
> For LB can you come up with some better description of what usages you
> see could benefit from a "per task" or "per task-group" latency niceness?
>

I guess there is some usecase in case of thermal throttling.
If a task is heating up the core then in ideal scenarios POWER systems throttle
down to rated frequency.
In such case, if the task is latency sensitive (min latency nice), we can move the
task around the chip to heat up the chip uniformly allowing me to gain more performance
with sustained higher frequency.
With this, we will require the help from active load balancer and latency-nice
classification on per task and/or group basis.

Hopefully, this might be useful for other arch as well, right?

> Best,
> Patrick
>

Thanks,
Parth

2019-09-06 19:18:39

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

On Fri, 6 Sep 2019 at 16:13, Valentin Schneider
<[email protected]> wrote:
>
> On 06/09/2019 13:45, Parth Shah wrote:>
> > I guess there is some usecase in case of thermal throttling.
> > If a task is heating up the core then in ideal scenarios POWER systems throttle
> > down to rated frequency.
> > In such case, if the task is latency sensitive (min latency nice), we can move the
> > task around the chip to heat up the chip uniformly allowing me to gain more performance
> > with sustained higher frequency.
> > With this, we will require the help from active load balancer and latency-nice
> > classification on per task and/or group basis.
> >
> > Hopefully, this might be useful for other arch as well, right?
> >
>
> Most of the functionality is already there, we're only really missing thermal
> pressure awareness. There was [1] but it seems to have died.

Thara still works on it but she has been sidetracked on other
activities and It takes more time than expected to run all tests with
different averaging window and process the results
>
>
> At least with CFS load balancing, if thermal throttling is correctly
> reflected as a CPU capacity reduction you will tend to move things away from
> that CPU, since load is balanced over capacities.
>
>
> For active balance, we actually already have a condition that moves a task
> to a less capacity-pressured CPU (although it is somewhat specific). So if
> thermal pressure follows that task (e.g. it's doing tons of vector/float),
> it will be rotated around.
>
> However there should be a point made on latency vs throughput. If you
> care about latency you probably do not want to active balance your task. If
> you care about throughput, it should be specified in some way (util-clamp
> says hello!).
>
> It sort of feels like you'd want an extension of misfit migration (salesman
> hat goes on from here) - misfit moves tasks that are CPU bound (IOW their
> util is >= 80% of the CPU capacity) to CPUs of higher capacity. It's only
> enabled for systems with asymmetric capacities, but could be enabled globally
> for "dynamically-created asymmetric capacities" (IOW RT/IRQ/thermal pressure
> on SMP systems).
>
> On top of that, if we make misfit consider e.g. uclamp.min (I don't think
> that's already the case), then you have your throughput knob to have *some*
> designated tasks move away from (thermal & else) pressure.
>
>
> [1]: https://lore.kernel.org/lkml/[email protected]/

2019-09-06 23:13:14

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

On 06/09/2019 13:45, Parth Shah wrote:>
> I guess there is some usecase in case of thermal throttling.
> If a task is heating up the core then in ideal scenarios POWER systems throttle
> down to rated frequency.
> In such case, if the task is latency sensitive (min latency nice), we can move the
> task around the chip to heat up the chip uniformly allowing me to gain more performance
> with sustained higher frequency.
> With this, we will require the help from active load balancer and latency-nice
> classification on per task and/or group basis.
>
> Hopefully, this might be useful for other arch as well, right?
>

Most of the functionality is already there, we're only really missing thermal
pressure awareness. There was [1] but it seems to have died.


At least with CFS load balancing, if thermal throttling is correctly
reflected as a CPU capacity reduction you will tend to move things away from
that CPU, since load is balanced over capacities.


For active balance, we actually already have a condition that moves a task
to a less capacity-pressured CPU (although it is somewhat specific). So if
thermal pressure follows that task (e.g. it's doing tons of vector/float),
it will be rotated around.

However there should be a point made on latency vs throughput. If you
care about latency you probably do not want to active balance your task. If
you care about throughput, it should be specified in some way (util-clamp
says hello!).

It sort of feels like you'd want an extension of misfit migration (salesman
hat goes on from here) - misfit moves tasks that are CPU bound (IOW their
util is >= 80% of the CPU capacity) to CPUs of higher capacity. It's only
enabled for systems with asymmetric capacities, but could be enabled globally
for "dynamically-created asymmetric capacities" (IOW RT/IRQ/thermal pressure
on SMP systems).

On top of that, if we make misfit consider e.g. uclamp.min (I don't think
that's already the case), then you have your throughput knob to have *some*
designated tasks move away from (thermal & else) pressure.


[1]: https://lore.kernel.org/lkml/[email protected]/

2019-09-07 14:17:51

by Parth Shah

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice



On 9/6/19 7:43 PM, Valentin Schneider wrote:
> On 06/09/2019 13:45, Parth Shah wrote:>
>> I guess there is some usecase in case of thermal throttling.
>> If a task is heating up the core then in ideal scenarios POWER systems throttle
>> down to rated frequency.
>> In such case, if the task is latency sensitive (min latency nice), we can move the
>> task around the chip to heat up the chip uniformly allowing me to gain more performance
>> with sustained higher frequency.
>> With this, we will require the help from active load balancer and latency-nice
>> classification on per task and/or group basis.
>>
>> Hopefully, this might be useful for other arch as well, right?
>>
>
> Most of the functionality is already there, we're only really missing thermal
> pressure awareness. There was [1] but it seems to have died.
>
>
> At least with CFS load balancing, if thermal throttling is correctly
> reflected as a CPU capacity reduction you will tend to move things away from
> that CPU, since load is balanced over capacities.
>

Right, CPU capacity can solve the problem of indicating the thermal throttle to the scheduler.
AFAIU, the patchset from Thara changes CPU capacity to reflect Thermal headroom of the CPU.
This is a nice mitigation but,
1. Sometimes a single task is responsible for the Thermal heatup of the core, reducing the
CPU capacity of all the CPUs in the core is not optimal when just moving such single
task to other core can allow us to remain within thermal headroom. This is important
for the servers especially where there are upto 8 threads.
2. Given the implementation in the patches and its integration with EAS, it seems difficult
to adapt to servers, where CPU capacity itself is in doubt.
https://lkml.org/lkml/2019/5/15/1402

>
> For active balance, we actually already have a condition that moves a task
> to a less capacity-pressured CPU (although it is somewhat specific). So if
> thermal pressure follows that task (e.g. it's doing tons of vector/float),
> it will be rotated around.

Agree. But this should break in certain conditions like when we have multiple tasks
in a core with almost equal utilization among which one is just doing vector operations.
LB can pick and move any task with equal probability if the capacity is reduced here.

>
> However there should be a point made on latency vs throughput. If you
> care about latency you probably do not want to active balance your task. If

Can you please elaborate on why not to consider active balance for latency sensitive tasks?
Because, sometimes finding a thermally cool core is beneficial when Turbo frequency
range is around 20% above rated ones.

> you care about throughput, it should be specified in some way (util-clamp
> says hello!).
>

yes I do care for latency and throughput both. :-)
but I'm wondering how uclamp can solve the problem for throughput.
If I make the thermally hot tasks to appear bigger than other tasks then reducing
CPU capacity can allow such tasks to move around the chip.
But this will require the utilization value to be relatively large compared to the other
tasks in the core. Or other task's uclamp.max can be lowered to make such task rotate.
If I got it right, then this will be a difficult UCLAMP usecase from user perspective, right?
I feel like I'm missing something here.

> It sort of feels like you'd want an extension of misfit migration (salesman
> hat goes on from here) - misfit moves tasks that are CPU bound (IOW their
> util is >= 80% of the CPU capacity) to CPUs of higher capacity. It's only
> enabled for systems with asymmetric capacities, but could be enabled globally
> for "dynamically-created asymmetric capacities" (IOW RT/IRQ/thermal pressure
> on SMP systems).> On top of that, if we make misfit consider e.g. uclamp.min (I don't think
> that's already the case), then you have your throughput knob to have *some*
> designated tasks move away from (thermal & else) pressure.
>
>
> [1]: https://lore.kernel.org/lkml/[email protected]/
>

Thanks,
Parth

2019-09-08 08:35:09

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

On 06/09/2019 18:10, Parth Shah wrote:
> Right, CPU capacity can solve the problem of indicating the thermal throttle to the scheduler.
> AFAIU, the patchset from Thara changes CPU capacity to reflect Thermal headroom of the CPU.
> This is a nice mitigation but,
> 1. Sometimes a single task is responsible for the Thermal heatup of the core, reducing the
> CPU capacity of all the CPUs in the core is not optimal when just moving such single
> task to other core can allow us to remain within thermal headroom. This is important
> for the servers especially where there are upto 8 threads.> 2. Given the implementation in the patches and its integration with EAS, it seems difficult
> to adapt to servers, where CPU capacity itself is in doubt.
> https://lkml.org/lkml/2019/5/15/1402
>

I'd nuance this to *SMT* capacity (which isn't just servers). The thing is
that it's difficult to come up with a sensible scheme to describe the base
capacity of a single logical CPU. But yeah, valid point.

>>
>> For active balance, we actually already have a condition that moves a task
>> to a less capacity-pressured CPU (although it is somewhat specific). So if
>> thermal pressure follows that task (e.g. it's doing tons of vector/float),
>> it will be rotated around.
>
> Agree. But this should break in certain conditions like when we have multiple tasks
> in a core with almost equal utilization among which one is just doing vector operations.
> LB can pick and move any task with equal probability if the capacity is reduced here.
>

Right, if/when we get things like per-unit signals (wasn't there something
about tracking AVX a few months back?) then we'll be able to make
more informed decisions, for now we'll need some handholding (read: task
classification).

>>
>> However there should be a point made on latency vs throughput. If you
>> care about latency you probably do not want to active balance your task. If
>
> Can you please elaborate on why not to consider active balance for latency sensitive tasks?
> Because, sometimes finding a thermally cool core is beneficial when Turbo frequency
> range is around 20% above rated ones.
>

This goes back to my reply to Patrick further up the thread.

Right now active balance can happen just because we've been imbalanced for
some time and repeatedly failed to migrate anything. After 3 (IIRC) successive
failed attempts, we'll active balance the running task of the remote rq we
decided was busiest.

If that happens to be a latency sensitive task, that's not great - active
balancing means stopping that task's execution, so we're going to add some
latency to this latency-sensitive task. My proposal was to further ratelimit
active balance (e.g. require more failed attempts) when the task that would be
preempted is latency-sensitive.

My point is: if that task is doing fine where it is, why preempt it? That's
just introducing latency IMO (keeping in mind that those balance attempts
could happen despite not having any thermal pressure).

If you care about performance (e.g. a minimum level of throughput), to me
that is a separate (though perhaps not entirely distinct) property.

>> you care about throughput, it should be specified in some way (util-clamp
>> says hello!).
>>
>
> yes I do care for latency and throughput both. :-)

Don't we all!

> but I'm wondering how uclamp can solve the problem for throughput.
> If I make the thermally hot tasks to appear bigger than other tasks then reducing
> CPU capacity can allow such tasks to move around the chip.
> But this will require the utilization value to be relatively large compared to the other
> tasks in the core. Or other task's uclamp.max can be lowered to make such task rotate.
> If I got it right, then this will be a difficult UCLAMP usecase from user perspective, right?
> I feel like I'm missing something here.
>

Hmm perhaps I was jumping the gun here. What I was getting to is if you have
something like misfit that migrates tasks to CPUs of higher capacity than the
one they are on, you could use uclamp to flag them.

You could translate your throughput requirement as a uclamp.min of e.g. 80%,
and if the CPU capacity goes below that (or close within a margin) then you'd
try to migrate the task to a CPU of higher capacity (i.e. not or less
thermally pressured).

This doesn't have to involve your less throughput-sensitive tasks, since you
would only tag and take action for your throughput-sensitive tasks.

2020-04-16 01:17:15

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

On Thu, Sep 05, 2019 at 12:47:26PM +0100, Qais Yousef wrote:
> On 09/05/19 13:30, Peter Zijlstra wrote:
> > On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
> > > On 09/05/19 12:46, Peter Zijlstra wrote:
> >
> > > > This is important because we want to be able to bias towards less
> > > > importance to (tail) latency as well as more importantance to (tail)
> > > > latency.
> > > >
> > > > Specifically, Oracle wants to sacrifice (some) latency for throughput.
> > > > Facebook OTOH seems to want to sacrifice (some) throughput for latency.
> > >
> > > Another use case I'm considering is using latency-nice to prefer an idle CPU if
> > > latency-nice is set otherwise go for the most energy efficient CPU.
> > >
> > > Ie: sacrifice (some) energy for latency.
> > >
> > > The way I see interpreting latency-nice here as a binary switch. But
> > > maybe we can use the range to select what (some) energy to sacrifice
> > > mean here. Hmmm.
> >
> > It cannot be binary, per definition is must be ternary, that is, <0, ==0
> > and >0 (or middle value if you're of that persuasion).
>
> I meant I want to use it as a binary.
>
> >
> > In your case, I'm thinking you mean >0, we want to lower the latency.
>
> Yes. As long as there's an easy way to say: does this task care about latency
> or not I'm good.

Qais, Peter, all,

For ChromeOS (my team), we are planning to use the upstream uclamp mechanism
instead of the out-of-tree schedtune mechanism to provide EAS with the
latency-sensitivity (binary/ternary) hint. ChromeOS is thankfully quite a bit
upstream focussed :)

However, uclamp is missing an attribute to provide this biasing to EAS as we
know.

What was the consensus on adding a per-task attribute to uclamp for providing
this? Happy to collaborate on this front.

thanks,

- Joel


> > Anyway; there were a number of things mentioned at OSPM that we could
> > tie into this thing and finding sensible mappings is going to be a bit
> > of trial and error I suppose.
> >
> > But as patrick said; we're very much exporting a BIAS knob, not a set of
> > behaviours.
>
> Agreed. I just wanted to say that the way this range is going to be
> interpreted will differ from path to path and we need to consider that in the
> final mapping. Especially from the final user's perspective of what setting
> this value ultimately means to them.
>
> --
> Qais Yousef

2020-04-16 20:41:13

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

Hi Joel,

On 16.04.20 02:02, Joel Fernandes wrote:
> On Thu, Sep 05, 2019 at 12:47:26PM +0100, Qais Yousef wrote:
>> On 09/05/19 13:30, Peter Zijlstra wrote:
>>> On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
>>>> On 09/05/19 12:46, Peter Zijlstra wrote:
>>>
>>>>> This is important because we want to be able to bias towards less
>>>>> importance to (tail) latency as well as more importantance to (tail)
>>>>> latency.
>>>>>
>>>>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>>>>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>>>>
>>>> Another use case I'm considering is using latency-nice to prefer an idle CPU if
>>>> latency-nice is set otherwise go for the most energy efficient CPU.
>>>>
>>>> Ie: sacrifice (some) energy for latency.
>>>>
>>>> The way I see interpreting latency-nice here as a binary switch. But
>>>> maybe we can use the range to select what (some) energy to sacrifice
>>>> mean here. Hmmm.
>>>
>>> It cannot be binary, per definition is must be ternary, that is, <0, ==0
>>> and >0 (or middle value if you're of that persuasion).
>>
>> I meant I want to use it as a binary.
>>
>>>
>>> In your case, I'm thinking you mean >0, we want to lower the latency.
>>
>> Yes. As long as there's an easy way to say: does this task care about latency
>> or not I'm good.
>
> Qais, Peter, all,
>
> For ChromeOS (my team), we are planning to use the upstream uclamp mechanism
> instead of the out-of-tree schedtune mechanism to provide EAS with the
> latency-sensitivity (binary/ternary) hint. ChromeOS is thankfully quite a bit
> upstream focussed :)
>
> However, uclamp is missing an attribute to provide this biasing to EAS as we
> know.
>
> What was the consensus on adding a per-task attribute to uclamp for providing
> this? Happy to collaborate on this front.

We're planning to have a session about this topic (latency-nice
attribute per task group) during the virtual Pisa OSPM summit
retis.sssup.it/ospm-summit in May this year.

There are two presentations/discussions planned:

"Introducing Latency Nice for Scheduler Hints and Optimizing Scheduler
Task Wakeup" and "The latency nice use case for Energy-Aware-Scheduling
(EAS) in Android Common Kernel (ACK)"

We'll probably merge those two into one presentation/discussion.

So far we have Parth's per-task implementation

https://lore.kernel.org/lkml/[email protected]

What's missing is the per-taskgroup implementation, at least from the
standpoint of ACK.

The (mainline) EAS use-case for latency nice is already in ACK
(android-5.4):

https://android.googlesource.com/kernel/common/+/760b82c9b88d2c8125abfc5f732cc3cd460b2a54

2020-04-18 16:03:36

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

Hi Dietmar,

On Thu, Apr 16, 2020 at 1:23 PM Dietmar Eggemann
<[email protected]> wrote:
>
> Hi Joel,
>
> On 16.04.20 02:02, Joel Fernandes wrote:
> > On Thu, Sep 05, 2019 at 12:47:26PM +0100, Qais Yousef wrote:
> >> On 09/05/19 13:30, Peter Zijlstra wrote:
> >>> On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
> >>>> On 09/05/19 12:46, Peter Zijlstra wrote:
> >>>
> >>>>> This is important because we want to be able to bias towards less
> >>>>> importance to (tail) latency as well as more importantance to (tail)
> >>>>> latency.
> >>>>>
> >>>>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
> >>>>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
> >>>>
> >>>> Another use case I'm considering is using latency-nice to prefer an idle CPU if
> >>>> latency-nice is set otherwise go for the most energy efficient CPU.
> >>>>
> >>>> Ie: sacrifice (some) energy for latency.
> >>>>
> >>>> The way I see interpreting latency-nice here as a binary switch. But
> >>>> maybe we can use the range to select what (some) energy to sacrifice
> >>>> mean here. Hmmm.
> >>>
> >>> It cannot be binary, per definition is must be ternary, that is, <0, ==0
> >>> and >0 (or middle value if you're of that persuasion).
> >>
> >> I meant I want to use it as a binary.
> >>
> >>>
> >>> In your case, I'm thinking you mean >0, we want to lower the latency.
> >>
> >> Yes. As long as there's an easy way to say: does this task care about latency
> >> or not I'm good.
> >
> > Qais, Peter, all,
> >
> > For ChromeOS (my team), we are planning to use the upstream uclamp mechanism
> > instead of the out-of-tree schedtune mechanism to provide EAS with the
> > latency-sensitivity (binary/ternary) hint. ChromeOS is thankfully quite a bit
> > upstream focussed :)
> >
> > However, uclamp is missing an attribute to provide this biasing to EAS as we
> > know.
> >
> > What was the consensus on adding a per-task attribute to uclamp for providing
> > this? Happy to collaborate on this front.
>
> We're planning to have a session about this topic (latency-nice
> attribute per task group) during the virtual Pisa OSPM summit
> retis.sssup.it/ospm-summit in May this year.

Cool, I registered as well.

>
> There are two presentations/discussions planned:
>
> "Introducing Latency Nice for Scheduler Hints and Optimizing Scheduler
> Task Wakeup" and "The latency nice use case for Energy-Aware-Scheduling
> (EAS) in Android Common Kernel (ACK)"
>
> We'll probably merge those two into one presentation/discussion.
>
> So far we have Parth's per-task implementation
>
> https://lore.kernel.org/lkml/[email protected]

Cool, I see it has some Reviewed-by tags so that's a good sign. Will
look more into that.

> What's missing is the per-taskgroup implementation, at least from the
> standpoint of ACK.
>
> The (mainline) EAS use-case for latency nice is already in ACK
> (android-5.4):
>
> https://android.googlesource.com/kernel/common/+/760b82c9b88d2c8125abfc5f732cc3cd460b2a54

Yes, I was aware of this. But if we use task groups, then the
transition from schedtune -> uclamp means now the tasks that use
uclamp would also be subjected to cpu.shares. That's why we were
looking into the per-task interface and glad there's some work on this
already done.

Thanks!

- Joel

2020-04-20 11:29:18

by Parth Shah

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

Hi Joel,

On 4/18/20 9:31 PM, Joel Fernandes wrote:
> Hi Dietmar,
>
> On Thu, Apr 16, 2020 at 1:23 PM Dietmar Eggemann
> <[email protected]> wrote:
>>
>> Hi Joel,
>>
>> On 16.04.20 02:02, Joel Fernandes wrote:
>>> On Thu, Sep 05, 2019 at 12:47:26PM +0100, Qais Yousef wrote:
>>>> On 09/05/19 13:30, Peter Zijlstra wrote:
>>>>> On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
>>>>>> On 09/05/19 12:46, Peter Zijlstra wrote:
>>>>>
>>>>>>> This is important because we want to be able to bias towards less
>>>>>>> importance to (tail) latency as well as more importantance to (tail)
>>>>>>> latency.
>>>>>>>
>>>>>>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>>>>>>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>>>>>>
>>>>>> Another use case I'm considering is using latency-nice to prefer an idle CPU if
>>>>>> latency-nice is set otherwise go for the most energy efficient CPU.
>>>>>>
>>>>>> Ie: sacrifice (some) energy for latency.
>>>>>>
>>>>>> The way I see interpreting latency-nice here as a binary switch. But
>>>>>> maybe we can use the range to select what (some) energy to sacrifice
>>>>>> mean here. Hmmm.
>>>>>
>>>>> It cannot be binary, per definition is must be ternary, that is, <0, ==0
>>>>> and >0 (or middle value if you're of that persuasion).
>>>>
>>>> I meant I want to use it as a binary.
>>>>
>>>>>
>>>>> In your case, I'm thinking you mean >0, we want to lower the latency.
>>>>
>>>> Yes. As long as there's an easy way to say: does this task care about latency
>>>> or not I'm good.
>>>
>>> Qais, Peter, all,
>>>
>>> For ChromeOS (my team), we are planning to use the upstream uclamp mechanism
>>> instead of the out-of-tree schedtune mechanism to provide EAS with the
>>> latency-sensitivity (binary/ternary) hint. ChromeOS is thankfully quite a bit
>>> upstream focussed :)
>>>
>>> However, uclamp is missing an attribute to provide this biasing to EAS as we
>>> know.
>>>
>>> What was the consensus on adding a per-task attribute to uclamp for providing
>>> this? Happy to collaborate on this front.
>>
>> We're planning to have a session about this topic (latency-nice
>> attribute per task group) during the virtual Pisa OSPM summit
>> retis.sssup.it/ospm-summit in May this year.
>
> Cool, I registered as well.
>
>>
>> There are two presentations/discussions planned:
>>
>> "Introducing Latency Nice for Scheduler Hints and Optimizing Scheduler
>> Task Wakeup" and "The latency nice use case for Energy-Aware-Scheduling
>> (EAS) in Android Common Kernel (ACK)"
>>
>> We'll probably merge those two into one presentation/discussion.
>>
>> So far we have Parth's per-task implementation
>>
>> https://lore.kernel.org/lkml/[email protected]
>
> Cool, I see it has some Reviewed-by tags so that's a good sign. Will
> look more into that.
>
>> What's missing is the per-taskgroup implementation, at least from the
>> standpoint of ACK.
>>
>> The (mainline) EAS use-case for latency nice is already in ACK
>> (android-5.4):
>>
>> https://android.googlesource.com/kernel/common/+/760b82c9b88d2c8125abfc5f732cc3cd460b2a54
>
> Yes, I was aware of this. But if we use task groups, then the
> transition from schedtune -> uclamp means now the tasks that use
> uclamp would also be subjected to cpu.shares. That's why we were
> looking into the per-task interface and glad there's some work on this
> already done.
>

Yes, that series of latency_nice seems to be in good shape to be used for
any usecases. Hopefully, OSPM will lead to its upstreaming sooner :-)
But at the end, we aim to have both the per-task and cgroup based interface
to mark the latency_nice value of a task.
Till, then I'm finding some generic use-cases to show benefits of such task
attribute to increase community interest.


Thanks,
Parth

2020-04-20 11:49:38

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

On 04/18/20 12:01, Joel Fernandes wrote:
> > What's missing is the per-taskgroup implementation, at least from the
> > standpoint of ACK.
> >
> > The (mainline) EAS use-case for latency nice is already in ACK
> > (android-5.4):
> >
> > https://android.googlesource.com/kernel/common/+/760b82c9b88d2c8125abfc5f732cc3cd460b2a54
>
> Yes, I was aware of this. But if we use task groups, then the
> transition from schedtune -> uclamp means now the tasks that use
> uclamp would also be subjected to cpu.shares. That's why we were
> looking into the per-task interface and glad there's some work on this
> already done.

Hmm uclamp doesn't do anything with cpu.shares. I assume this is some
implementation detail at your end? IOW, you don't have to use cpu.shares to use
uclamp.

Although there should be few tasks in the system that need the latency-nice, so
I prefer the per-task interface rather than lump everything in a cgroup. Though
there could be valid use cases for the latter.

Thanks

--
Qais Yousef

2020-04-20 19:12:59

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

Hi Qais!

On Mon, Apr 20, 2020 at 12:47:29PM +0100, Qais Yousef wrote:
> On 04/18/20 12:01, Joel Fernandes wrote:
> > > What's missing is the per-taskgroup implementation, at least from the
> > > standpoint of ACK.
> > >
> > > The (mainline) EAS use-case for latency nice is already in ACK
> > > (android-5.4):
> > >
> > > https://android.googlesource.com/kernel/common/+/760b82c9b88d2c8125abfc5f732cc3cd460b2a54
> >
> > Yes, I was aware of this. But if we use task groups, then the
> > transition from schedtune -> uclamp means now the tasks that use
> > uclamp would also be subjected to cpu.shares. That's why we were
> > looking into the per-task interface and glad there's some work on this
> > already done.
>
> Hmm uclamp doesn't do anything with cpu.shares. I assume this is some
> implementation detail at your end? IOW, you don't have to use cpu.shares to use
> uclamp.

Right, it is a ChromeOS-specific issue. We have CONFIG_FAIR_GROUP_SCHED
enabled in the kernel for container workloads. However there are CGroups of
tasks that used "schedtune" CGroup interface before to provide util clamping
like behavior. We are now migrating these to the upstream util-clamp.

We can't disable CONFIG_FAIR_GROUP_SCHED because that would break the
container workloads.

So we have to use the per-process interface of util clamp.

If we used the CGroups interface of util clamping, we would get the
cpu.shares as well since the CGroup interface comes with shares. There's no
way to avoid being subject to cpu.shares (that I'm aware off anyway).

> Although there should be few tasks in the system that need the latency-nice, so
> I prefer the per-task interface rather than lump everything in a cgroup. Though
> there could be valid use cases for the latter.

Yes, with either interface, we need something like latency_nice to indicate
that the task is low-latency (something we used for a number of years with
the out-of-tree schedtune).

thanks!

- Joel


>
> Thanks
>
> --
> Qais Yousef

2020-04-20 19:15:55

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

On Mon, Apr 20, 2020 at 04:56:55PM +0530, Parth Shah wrote:

> >>
> >> There are two presentations/discussions planned:
> >>
> >> "Introducing Latency Nice for Scheduler Hints and Optimizing Scheduler
> >> Task Wakeup" and "The latency nice use case for Energy-Aware-Scheduling
> >> (EAS) in Android Common Kernel (ACK)"
> >>
> >> We'll probably merge those two into one presentation/discussion.
> >>
> >> So far we have Parth's per-task implementation
> >>
> >> https://lore.kernel.org/lkml/[email protected]
> >
> > Cool, I see it has some Reviewed-by tags so that's a good sign. Will
> > look more into that.
> >
> >> What's missing is the per-taskgroup implementation, at least from the
> >> standpoint of ACK.
> >>
> >> The (mainline) EAS use-case for latency nice is already in ACK
> >> (android-5.4):
> >>
> >> https://android.googlesource.com/kernel/common/+/760b82c9b88d2c8125abfc5f732cc3cd460b2a54
> >
> > Yes, I was aware of this. But if we use task groups, then the
> > transition from schedtune -> uclamp means now the tasks that use
> > uclamp would also be subjected to cpu.shares. That's why we were
> > looking into the per-task interface and glad there's some work on this
> > already done.
> >
>
> Yes, that series of latency_nice seems to be in good shape to be used for
> any usecases. Hopefully, OSPM will lead to its upstreaming sooner :-)

Cool :)

> But at the end, we aim to have both the per-task and cgroup based interface
> to mark the latency_nice value of a task.

Ok. We'd likely use the per-task interface unless we decide to assign
cpu.shares for the groups as well.

> Till, then I'm finding some generic use-cases to show benefits of such task
> attribute to increase community interest.

Ok. Feel free to add ChromeOS as a usecase as well.

thanks,

- Joel