This patchset restarts the work about adding a latency priority to describe
the latency tolerance of cfs tasks.
The patches [1-4] have been done by Parth:
https://lore.kernel.org/lkml/[email protected]/
I have just rebased and moved the set of latency priority outside the
priority update. I have removed the reviewed tag because the patches
are 2 years old.
The patch [5] uses latency nice priority to define a latency offset
and then to decide if a cfs task can preempt the current running task. The
patch gives some tests results with cyclictests and hackbench to highlight
the benefit of latency priority for short interactive task or
long intensive tasks.
Patch [6] adds the support of latency_offset to task group by adding a
cpu.latency field. There were discussions for the last version about
using a real unit for the field so I decided to expose directly the
latency offset which reflects the time up to which we can preempt when the
value is negative, or up to which we can defer the preemption when the
value is positive.
The range is [-sysctl_sched_latency:sysctl_sched_latency]
Patch [7] makes sched_core taking into account the latency offset.
Patch [8] adds a rb tree to cover some corner cases where the latency
sensitive task is preempted by high priority task (RT/DL) or fails to
preempt them. This patch ensures that tasks will have at least a
slice of sched_min_granularity in priority at wakeup. The patch gives
results to show the benefit in addition to patch 5
I have also backported the patchset on a dragonboard RB3 with an android
mainline kernel based on v5.18 for a quick test. I have used
the TouchLatency app which is part of AOSP and described to be very good
test to highlight jitter and jank frame sources of a system [1].
In addition to the app, I have added some short running tasks waking-up
regularly (to use the 8 cpus for 4 ms every 37777us) to stress the system
without overloading it (and disabling EAS). The 1st results shows that the
patchset helps to reduce the missed deadline frames from 5% to less than
0.1% when the cpu.latency of task group are set.
[1] https://source.android.com/docs/core/debug/eval_perf#touchlatency
Change since v3:
- Fix 2 compilation issues raised by kernel test robot <[email protected]>
Change since v2:
- Set a latency_offset field instead of saving a weight and computing it
on the fly.
- Make latency_offset available for task group: cpu.latency
- Fix some corner cases to make latency sensitive tasks schedule first and
add a rb tree for latency sensitive task.
Change since v1:
- fix typo
- move some codes in the right patch to make bisect happy
- simplify and fixed how the weight is computed
- added support of sched core patch 7
Parth Shah (4):
sched: Introduce latency-nice as a per-task attribute
sched/core: Propagate parent task's latency requirements to the child
task
sched: Allow sched_{get,set}attr to change latency_nice of the task
sched/core: Add permission checks for setting the latency_nice value
Vincent Guittot (4):
sched/fair: Take into account latency priority at wakeup
sched/fair: Add sched group latency support
sched/core: support latency priority with sched core
sched/fair: Add latency list
include/linux/sched.h | 5 +
include/uapi/linux/sched.h | 4 +-
include/uapi/linux/sched/types.h | 19 ++++
init/init_task.c | 1 +
kernel/sched/core.c | 81 +++++++++++++
kernel/sched/debug.c | 1 +
kernel/sched/fair.c | 189 ++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 37 +++++-
tools/include/uapi/linux/sched.h | 4 +-
9 files changed, 333 insertions(+), 8 deletions(-)
--
2.17.1
From: Parth Shah <[email protected]>
Latency-nice indicates the latency requirements of a task with respect
to the other tasks in the system. The value of the attribute can be within
the range of [-20, 19] both inclusive to be in-line with the values just
like task nice values.
latency_nice = -20 indicates the task to have the least latency as
compared to the tasks having latency_nice = +19.
The latency_nice may affect only the CFS SCHED_CLASS by getting
latency requirements from the userspace.
Additionally, add debugging bits for newly added latency_nice attribute.
Signed-off-by: Parth Shah <[email protected]>
[rebase]
Signed-off-by: Vincent Guittot <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/debug.c | 1 +
kernel/sched/sched.h | 18 ++++++++++++++++++
3 files changed, 20 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 88b8817b827d..a0adb55efa1c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -773,6 +773,7 @@ struct task_struct {
int static_prio;
int normal_prio;
unsigned int rt_priority;
+ int latency_nice;
struct sched_entity se;
struct sched_rt_entity rt;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index bb3d63bdf4ae..a3f7876217a6 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1042,6 +1042,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
#endif
P(policy);
P(prio);
+ P(latency_nice);
if (task_has_dl_policy(p)) {
P(dl.runtime);
P(dl.deadline);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 74130a69d365..b927269b84f9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -124,6 +124,24 @@ extern int sched_rr_timeslice;
*/
#define NS_TO_JIFFIES(TIME) ((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))
+/*
+ * Latency nice is meant to provide scheduler hints about the relative
+ * latency requirements of a task with respect to other tasks.
+ * Thus a task with latency_nice == 19 can be hinted as the task with no
+ * latency requirements, in contrast to the task with latency_nice == -20
+ * which should be given priority in terms of lower latency.
+ */
+#define MAX_LATENCY_NICE 19
+#define MIN_LATENCY_NICE -20
+
+#define LATENCY_NICE_WIDTH \
+ (MAX_LATENCY_NICE - MIN_LATENCY_NICE + 1)
+
+/*
+ * Default tasks should be treated as a task with latency_nice = 0.
+ */
+#define DEFAULT_LATENCY_NICE 0
+
/*
* Increase resolution of nice-level calculations for 64-bit architectures.
* The extra resolution improves shares distribution and load balancing of
--
2.17.1
Take into account wakeup_latency_gran() when ordering the cfs threads.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6cc4f2a9725d..7563fb16aba1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11444,6 +11444,10 @@ bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool in_fi)
delta = (s64)(sea->vruntime - seb->vruntime) +
(s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi);
+ /* Take into account latency prio */
+ delta -= wakeup_latency_gran(sea, seb);
+
+
return delta > 0;
}
#else
--
2.17.1
Task can set its latency priority, which is then used to decide to preempt
the current running entity of the cfs, but sched group entities still have
the default latency offset.
Add a latency field in task group to set the latency offset of the
sched_eneities of the group, which will be used against other entities in
the parent cfs when deciding which entity to schedule first.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/core.c | 24 ++++++++++++++++++++++++
kernel/sched/fair.c | 33 +++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 4 ++++
3 files changed, 61 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 13cf794708ee..bfea862a3588 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10860,6 +10860,19 @@ static int cpu_idle_write_s64(struct cgroup_subsys_state *css,
{
return sched_group_set_idle(css_tg(css), idle);
}
+
+static s64 cpu_latency_read_s64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return css_tg(css)->latency_offset;
+}
+
+static int cpu_latency_write_s64(struct cgroup_subsys_state *css,
+ struct cftype *cft, s64 latency)
+{
+ return sched_group_set_latency(css_tg(css), latency);
+}
+
#endif
static struct cftype cpu_legacy_files[] = {
@@ -10874,6 +10887,11 @@ static struct cftype cpu_legacy_files[] = {
.read_s64 = cpu_idle_read_s64,
.write_s64 = cpu_idle_write_s64,
},
+ {
+ .name = "latency",
+ .read_s64 = cpu_latency_read_s64,
+ .write_s64 = cpu_latency_write_s64,
+ },
#endif
#ifdef CONFIG_CFS_BANDWIDTH
{
@@ -11091,6 +11109,12 @@ static struct cftype cpu_files[] = {
.read_s64 = cpu_idle_read_s64,
.write_s64 = cpu_idle_write_s64,
},
+ {
+ .name = "latency",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_s64 = cpu_latency_read_s64,
+ .write_s64 = cpu_latency_write_s64,
+ },
#endif
#ifdef CONFIG_CFS_BANDWIDTH
{
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a20eadb0af97..6cc4f2a9725d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11769,6 +11769,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
goto err;
tg->shares = NICE_0_LOAD;
+ tg->latency_offset = 0;
init_cfs_bandwidth(tg_cfs_bandwidth(tg));
@@ -11867,6 +11868,9 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
}
se->my_q = cfs_rq;
+
+ se->latency_offset = tg->latency_offset;
+
/* guarantee group entities always have weight */
update_load_set(&se->load, NICE_0_LOAD);
se->parent = parent;
@@ -11997,6 +12001,35 @@ int sched_group_set_idle(struct task_group *tg, long idle)
return 0;
}
+int sched_group_set_latency(struct task_group *tg, long latency)
+{
+ int i;
+
+ if (tg == &root_task_group)
+ return -EINVAL;
+
+ if (abs(latency) > sysctl_sched_latency)
+ return -EINVAL;
+
+ mutex_lock(&shares_mutex);
+
+ if (tg->latency_offset == latency) {
+ mutex_unlock(&shares_mutex);
+ return 0;
+ }
+
+ tg->latency_offset = latency;
+
+ for_each_possible_cpu(i) {
+ struct sched_entity *se = tg->se[i];
+
+ WRITE_ONCE(se->latency_offset, latency);
+ }
+
+ mutex_unlock(&shares_mutex);
+ return 0;
+}
+
#else /* CONFIG_FAIR_GROUP_SCHED */
void free_fair_sched_group(struct task_group *tg) { }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a4cb8058f1b2..619132f4f480 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -406,6 +406,8 @@ struct task_group {
/* A positive value indicates that this is a SCHED_IDLE group. */
int idle;
+ /* latency constraint of the group. */
+ int latency_offset;
#ifdef CONFIG_SMP
/*
@@ -519,6 +521,8 @@ extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
extern int sched_group_set_idle(struct task_group *tg, long idle);
+extern int sched_group_set_latency(struct task_group *tg, long latency);
+
#ifdef CONFIG_SMP
extern void set_task_rq_fair(struct sched_entity *se,
struct cfs_rq *prev, struct cfs_rq *next);
--
2.17.1
s/[email protected]//
On 16/09/2022 10:03, Vincent Guittot wrote:
> Task can set its latency priority, which is then used to decide to preempt
> the current running entity of the cfs, but sched group entities still have
> the default latency offset.
>
> Add a latency field in task group to set the latency offset of the
> sched_eneities of the group, which will be used against other entities in
s/sched_eneities/sched_entity
> the parent cfs when deciding which entity to schedule first.
So latency for cgroups does not follow any (existing) Resource
Distribution Model/Scheme (Documentation/admin-guide/cgroup-v2.rst)?
Latency values are only used to compare sched entities at the same level.
[...]
> +static int cpu_latency_write_s64(struct cgroup_subsys_state *css,
> + struct cftype *cft, s64 latency)
> +{
There is no [MIN, MAX] checking?
min_weight = sched_latency_to_weight[0] = -1024
max_weight = sched_latency_to_weight[39] = 973
[MIN, MAX] = [sysctl_sched_latency * min_weight >> NICE_LATENCY_SHIFT,
sysctl_sched_latency * max_weight >> NICE_LATENCY_SHIFT]
With the `cpu.latency` knob user would have to know for example that the
value is -24,000,000ns to get the same behaviour as for a task latency
nice = -20 (latency prio = 0) (w/ sysctl_sched_latency = 24ms)?
For `nice` we have `cpu.weight.nice` next to `cpu.weight` in cgroup v2 ?
[...]
On Mon, 19 Sept 2022 at 13:55, Dietmar Eggemann
<[email protected]> wrote:
>
> s/[email protected]//
>
> On 16/09/2022 10:03, Vincent Guittot wrote:
> > Task can set its latency priority, which is then used to decide to preempt
> > the current running entity of the cfs, but sched group entities still have
> > the default latency offset.
> >
> > Add a latency field in task group to set the latency offset of the
> > sched_eneities of the group, which will be used against other entities in
>
> s/sched_eneities/sched_entity
>
> > the parent cfs when deciding which entity to schedule first.
>
> So latency for cgroups does not follow any (existing) Resource
> Distribution Model/Scheme (Documentation/admin-guide/cgroup-v2.rst)?
> Latency values are only used to compare sched entities at the same level.
Just like share/cpu.weight value does for time sharing
>
> [...]
>
> > +static int cpu_latency_write_s64(struct cgroup_subsys_state *css,
> > + struct cftype *cft, s64 latency)
> > +{
>
> There is no [MIN, MAX] checking?
This is done is sched_group_set_latency() which checks that
abs(latency) < sysctl_sched_latency
>
> min_weight = sched_latency_to_weight[0] = -1024
> max_weight = sched_latency_to_weight[39] = 973
>
> [MIN, MAX] = [sysctl_sched_latency * min_weight >> NICE_LATENCY_SHIFT,
> sysctl_sched_latency * max_weight >> NICE_LATENCY_SHIFT]
>
>
> With the `cpu.latency` knob user would have to know for example that the
> value is -24,000,000ns to get the same behaviour as for a task latency
> nice = -20 (latency prio = 0) (w/ sysctl_sched_latency = 24ms)?
Yes, Tejun raised some concerns about adding an interface like nice in
the task group in v2 so I have removed it.
>
> For `nice` we have `cpu.weight.nice` next to `cpu.weight` in cgroup v2 ?
If everybody is ok, I can add back the cpu.latency.nice interface in
the v5 in addition to the cpu.latency
>
> [...]
On Mon, Sep 19, 2022 at 05:49:27PM +0200, Vincent Guittot wrote:
> > For `nice` we have `cpu.weight.nice` next to `cpu.weight` in cgroup v2 ?
>
> If everybody is ok, I can add back the cpu.latency.nice interface in
> the v5 in addition to the cpu.latency
Yeah, that sounds fine to me.
Thanks.
--
tejun
Hello,
On Mon, Sep 19, 2022 at 01:55:15PM +0200, Dietmar Eggemann wrote:
> s/[email protected]//
>
> On 16/09/2022 10:03, Vincent Guittot wrote:
> > Task can set its latency priority, which is then used to decide to preempt
> > the current running entity of the cfs, but sched group entities still have
> > the default latency offset.
> >
> > Add a latency field in task group to set the latency offset of the
> > sched_eneities of the group, which will be used against other entities in
>
> s/sched_eneities/sched_entity
>
> > the parent cfs when deciding which entity to schedule first.
>
> So latency for cgroups does not follow any (existing) Resource
> Distribution Model/Scheme (Documentation/admin-guide/cgroup-v2.rst)?
> Latency values are only used to compare sched entities at the same level.
I think it'd still result in a hierarchical behavior as scheduling is done
recursively top-down. Right, vincent?
It doesn't fit any of the resource distribution model. But it's rather
difficult to map latencies to existing concepts of resources and we have a
precedence in the cpu controller (.idle) which behaves similarly, so as long
as the behavior is hierarchical, I think it's okay.
Thanks.
--
tejun
On Mon, 19 Sept 2022 at 19:34, Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Mon, Sep 19, 2022 at 01:55:15PM +0200, Dietmar Eggemann wrote:
> > s/[email protected]//
> >
> > On 16/09/2022 10:03, Vincent Guittot wrote:
> > > Task can set its latency priority, which is then used to decide to preempt
> > > the current running entity of the cfs, but sched group entities still have
> > > the default latency offset.
> > >
> > > Add a latency field in task group to set the latency offset of the
> > > sched_eneities of the group, which will be used against other entities in
> >
> > s/sched_eneities/sched_entity
> >
> > > the parent cfs when deciding which entity to schedule first.
> >
> > So latency for cgroups does not follow any (existing) Resource
> > Distribution Model/Scheme (Documentation/admin-guide/cgroup-v2.rst)?
> > Latency values are only used to compare sched entities at the same level.
>
> I think it'd still result in a hierarchical behavior as scheduling is done
> recursively top-down. Right, vincent?
Correct
>
> It doesn't fit any of the resource distribution model. But it's rather
> difficult to map latencies to existing concepts of resources and we have a
> precedence in the cpu controller (.idle) which behaves similarly, so as long
> as the behavior is hierarchical, I think it's okay.
>
> Thanks.
>
> --
> tejun
On Mon, 19 Sept 2022 at 19:34, Tejun Heo <[email protected]> wrote:
>
> On Mon, Sep 19, 2022 at 05:49:27PM +0200, Vincent Guittot wrote:
> > > For `nice` we have `cpu.weight.nice` next to `cpu.weight` in cgroup v2 ?
> >
> > If everybody is ok, I can add back the cpu.latency.nice interface in
> > the v5 in addition to the cpu.latency
>
> Yeah, that sounds fine to me.
Thanks
>
> Thanks.
>
> --
> tejun
On 19/09/2022 17:49, Vincent Guittot wrote:
> On Mon, 19 Sept 2022 at 13:55, Dietmar Eggemann
> <[email protected]> wrote:
>>
>> s/[email protected]//
>>
>> On 16/09/2022 10:03, Vincent Guittot wrote:
>>> Task can set its latency priority, which is then used to decide to preempt
>>> the current running entity of the cfs, but sched group entities still have
>>> the default latency offset.
>>>
>>> Add a latency field in task group to set the latency offset of the
>>> sched_eneities of the group, which will be used against other entities in
>>
>> s/sched_eneities/sched_entity
>>
>>> the parent cfs when deciding which entity to schedule first.
>>
>> So latency for cgroups does not follow any (existing) Resource
>> Distribution Model/Scheme (Documentation/admin-guide/cgroup-v2.rst)?
>> Latency values are only used to compare sched entities at the same level.
>
> Just like share/cpu.weight value does for time sharing
But for this we define it as following the `Weights` scheme. That's why
I was asking,
>> [...]
>>
>>> +static int cpu_latency_write_s64(struct cgroup_subsys_state *css,
>>> + struct cftype *cft, s64 latency)
>>> +{
>>
>> There is no [MIN, MAX] checking?
>
> This is done is sched_group_set_latency() which checks that
> abs(latency) < sysctl_sched_latency
I see. Nit-picking: Wouldn't this allow to specify a latency offset
value for the non-existent `nice = 20`? Highest nice value 19 maps to
`973/1024 * sysctl_sched_latency`.
>
>>
>> min_weight = sched_latency_to_weight[0] = -1024
>> max_weight = sched_latency_to_weight[39] = 973
>>
>> [MIN, MAX] = [sysctl_sched_latency * min_weight >> NICE_LATENCY_SHIFT,
>> sysctl_sched_latency * max_weight >> NICE_LATENCY_SHIFT]
>>
>>
>> With the `cpu.latency` knob user would have to know for example that the
>> value is -24,000,000ns to get the same behaviour as for a task latency
>> nice = -20 (latency prio = 0) (w/ sysctl_sched_latency = 24ms)?
>
> Yes, Tejun raised some concerns about adding an interface like nice in
> the task group in v2 so I have removed it.
>
>>
>> For `nice` we have `cpu.weight.nice` next to `cpu.weight` in cgroup v2 ?
>
> If everybody is ok, I can add back the cpu.latency.nice interface in
> the v5 in addition to the cpu.latency
cpu.weight/cpu.weight.nice interface:
echo X > cpu.weight tg->shares
1 10,240
100 1,048,576
10000 104,857,600
echo X > cpu.weight.nice
-20 90,891,264
0 1,048,576
19 15,360
Wouldn't then a similar interface for cpu.latency [1..100..10000] and
cpu.latency.nice [-20..0..19] make most sense?
Raw latency_offset values at interface level are not portable.
On Tue, 20 Sept 2022 at 20:17, Dietmar Eggemann
<[email protected]> wrote:
>
> On 19/09/2022 17:49, Vincent Guittot wrote:
> > On Mon, 19 Sept 2022 at 13:55, Dietmar Eggemann
> > <[email protected]> wrote:
> >>
> >> s/[email protected]//
> >>
> >> On 16/09/2022 10:03, Vincent Guittot wrote:
> >>> Task can set its latency priority, which is then used to decide to preempt
> >>> the current running entity of the cfs, but sched group entities still have
> >>> the default latency offset.
> >>>
> >>> Add a latency field in task group to set the latency offset of the
> >>> sched_eneities of the group, which will be used against other entities in
> >>
> >> s/sched_eneities/sched_entity
> >>
> >>> the parent cfs when deciding which entity to schedule first.
> >>
> >> So latency for cgroups does not follow any (existing) Resource
> >> Distribution Model/Scheme (Documentation/admin-guide/cgroup-v2.rst)?
> >> Latency values are only used to compare sched entities at the same level.
> >
> > Just like share/cpu.weight value does for time sharing
>
> But for this we define it as following the `Weights` scheme. That's why
> I was asking,
>
> >> [...]
> >>
> >>> +static int cpu_latency_write_s64(struct cgroup_subsys_state *css,
> >>> + struct cftype *cft, s64 latency)
> >>> +{
> >>
> >> There is no [MIN, MAX] checking?
> >
> > This is done is sched_group_set_latency() which checks that
> > abs(latency) < sysctl_sched_latency
>
> I see. Nit-picking: Wouldn't this allow to specify a latency offset
> value for the non-existent `nice = 20`? Highest nice value 19 maps to
> `973/1024 * sysctl_sched_latency`.
yes, but the same applies for tg->shares and cpu.weight as we can set
a tg->shares of 104,857,600 whereas the max shares for nice -20 is
90,891,264. Furthermore, I don't see a real problem with the ability
to set a latency offset up to sysctl_sched_latency because it's about
being even more nice with other task and not the opposite
>
> >
> >>
> >> min_weight = sched_latency_to_weight[0] = -1024
> >> max_weight = sched_latency_to_weight[39] = 973
> >>
> >> [MIN, MAX] = [sysctl_sched_latency * min_weight >> NICE_LATENCY_SHIFT,
> >> sysctl_sched_latency * max_weight >> NICE_LATENCY_SHIFT]
> >>
> >>
> >> With the `cpu.latency` knob user would have to know for example that the
> >> value is -24,000,000ns to get the same behaviour as for a task latency
> >> nice = -20 (latency prio = 0) (w/ sysctl_sched_latency = 24ms)?
> >
> > Yes, Tejun raised some concerns about adding an interface like nice in
> > the task group in v2 so I have removed it.
> >
> >>
> >> For `nice` we have `cpu.weight.nice` next to `cpu.weight` in cgroup v2 ?
> >
> > If everybody is ok, I can add back the cpu.latency.nice interface in
> > the v5 in addition to the cpu.latency
>
> cpu.weight/cpu.weight.nice interface:
>
> echo X > cpu.weight tg->shares
>
> 1 10,240
> 100 1,048,576
> 10000 104,857,600
>
> echo X > cpu.weight.nice
>
> -20 90,891,264
> 0 1,048,576
> 19 15,360
>
> Wouldn't then a similar interface for cpu.latency [1..100..10000] and
> cpu.latency.nice [-20..0..19] make most sense?
We need at least a signed value for cpu.latency to make the difference
between a sensitivity to the latency or a not careness
>
> Raw latency_offset values at interface level are not portable.
I can use [-1000:1000] but I' not sure it's better than the raw value at the end
>
>
On 09/19/22 07:34, Tejun Heo wrote:
> On Mon, Sep 19, 2022 at 05:49:27PM +0200, Vincent Guittot wrote:
> > > For `nice` we have `cpu.weight.nice` next to `cpu.weight` in cgroup v2 ?
> >
> > If everybody is ok, I can add back the cpu.latency.nice interface in
> > the v5 in addition to the cpu.latency
>
> Yeah, that sounds fine to me.
I do have concerns about cpu.latency knob as it exposes latency_offset which
won't be meaningful for all consumers of latency_nice [1].
The current use case proposed by Vincent is not going to be the only consumer
of this interface. The concept of converting this latency_nice value to weight
in similar fashion to existing nice value is specific to it. In previous
discussion this conversion was not required and I'd expect it to still not be
required for those other use cases.
Wouldn't cpu.latency.nice be enough? I think the latency_offset is
implementation detail that userspace shouldn't be concerned about.
[1] https://lwn.net/Articles/820659/
Cheers
--
Qais Yousef
Hi Vincent
On 09/16/22 10:02, Vincent Guittot wrote:
> This patchset restarts the work about adding a latency priority to describe
> the latency tolerance of cfs tasks.
>
> The patches [1-4] have been done by Parth:
> https://lore.kernel.org/lkml/[email protected]/
>
> I have just rebased and moved the set of latency priority outside the
> priority update. I have removed the reviewed tag because the patches
> are 2 years old.
>
> The patch [5] uses latency nice priority to define a latency offset
> and then to decide if a cfs task can preempt the current running task. The
> patch gives some tests results with cyclictests and hackbench to highlight
> the benefit of latency priority for short interactive task or
> long intensive tasks.
>
> Patch [6] adds the support of latency_offset to task group by adding a
> cpu.latency field. There were discussions for the last version about
> using a real unit for the field so I decided to expose directly the
> latency offset which reflects the time up to which we can preempt when the
> value is negative, or up to which we can defer the preemption when the
> value is positive.
> The range is [-sysctl_sched_latency:sysctl_sched_latency]
>
> Patch [7] makes sched_core taking into account the latency offset.
>
> Patch [8] adds a rb tree to cover some corner cases where the latency
> sensitive task is preempted by high priority task (RT/DL) or fails to
> preempt them. This patch ensures that tasks will have at least a
> slice of sched_min_granularity in priority at wakeup. The patch gives
> results to show the benefit in addition to patch 5
>
> I have also backported the patchset on a dragonboard RB3 with an android
> mainline kernel based on v5.18 for a quick test. I have used
> the TouchLatency app which is part of AOSP and described to be very good
> test to highlight jitter and jank frame sources of a system [1].
> In addition to the app, I have added some short running tasks waking-up
> regularly (to use the 8 cpus for 4 ms every 37777us) to stress the system
> without overloading it (and disabling EAS). The 1st results shows that the
> patchset helps to reduce the missed deadline frames from 5% to less than
> 0.1% when the cpu.latency of task group are set.
>
> [1] https://source.android.com/docs/core/debug/eval_perf#touchlatency
One thing that still confuses me is whether this proposal is supposed to be the
only consumer of this interface or we still expect other users to be able to
use this hint to optimize other sources of latency in the scheduler? Last
discussion [1] raised issues on the interface and I haven't seen any
discussions on the suitability of the interface to enable future consumers.
I might have missed something. What's the current state on this?
[1] https://lwn.net/Articles/820659/
Thanks
--
Qais Yousef
Hello,
On Wed, Sep 21, 2022 at 05:07:38PM +0100, Qais Yousef wrote:
> Wouldn't cpu.latency.nice be enough? I think the latency_offset is
> implementation detail that userspace shouldn't be concerned about.
One option could be just using the same mapping as cpu.weight so that 100
maps to neutral, 10000 maps close to -20, 1 maps close to 19. It isn't great
that the value can't be interpreted in any intuitive way (e.g. a time
duration based interface would be a lot easier to grok even if it still is
best effort) but if that's what the per-task interface is gonna be, it'd be
best to keep cgroup interface in line.
As for whether a single value would fit the bill, it's again something which
should be answered for both task and cgroup based interface at the same
time. That said, my not-too-throught-through opinion is that a single value
for per-task / per-cgroup interface + system level knobs to fine tune how
that actually applies is likely enough and probably better than exposing
exposing a host of internal details to applications directly.
Thanks.
--
tejun
On Wed, 21 Sept 2022 at 18:48, Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Wed, Sep 21, 2022 at 05:07:38PM +0100, Qais Yousef wrote:
> > Wouldn't cpu.latency.nice be enough? I think the latency_offset is
> > implementation detail that userspace shouldn't be concerned about.
>
> One option could be just using the same mapping as cpu.weight so that 100
> maps to neutral, 10000 maps close to -20, 1 maps close to 19. It isn't great
> that the value can't be interpreted in any intuitive way (e.g. a time
> duration based interface would be a lot easier to grok even if it still is
> best effort) but if that's what the per-task interface is gonna be, it'd be
> best to keep cgroup interface in line.
I would prefer a signed range like the [-1000:1000] as the behavior is
different for sensitive and non sensitive task unlike the cpu.weight
which is reflect that a bigger value get more
>
> As for whether a single value would fit the bill, it's again something which
> should be answered for both task and cgroup based interface at the same
> time. That said, my not-too-throught-through opinion is that a single value
> for per-task / per-cgroup interface + system level knobs to fine tune how
> that actually applies is likely enough and probably better than exposing
> exposing a host of internal details to applications directly.
>
> Thanks.
>
> --
> tejun
On Wed, Sep 21, 2022 at 07:02:57PM +0200, Vincent Guittot wrote:
> > One option could be just using the same mapping as cpu.weight so that 100
> > maps to neutral, 10000 maps close to -20, 1 maps close to 19. It isn't great
> > that the value can't be interpreted in any intuitive way (e.g. a time
> > duration based interface would be a lot easier to grok even if it still is
> > best effort) but if that's what the per-task interface is gonna be, it'd be
> > best to keep cgroup interface in line.
>
> I would prefer a signed range like the [-1000:1000] as the behavior is
> different for sensitive and non sensitive task unlike the cpu.weight
> which is reflect that a bigger value get more
How about just sticking with .nice?
--
tejun
On Wed, 21 Sept 2022 at 19:12, Tejun Heo <[email protected]> wrote:
>
> On Wed, Sep 21, 2022 at 07:02:57PM +0200, Vincent Guittot wrote:
> > > One option could be just using the same mapping as cpu.weight so that 100
> > > maps to neutral, 10000 maps close to -20, 1 maps close to 19. It isn't great
> > > that the value can't be interpreted in any intuitive way (e.g. a time
> > > duration based interface would be a lot easier to grok even if it still is
> > > best effort) but if that's what the per-task interface is gonna be, it'd be
> > > best to keep cgroup interface in line.
> >
> > I would prefer a signed range like the [-1000:1000] as the behavior is
> > different for sensitive and non sensitive task unlike the cpu.weight
> > which is reflect that a bigger value get more
>
> How about just sticking with .nice?
Looks good to me. I will just implement the cpu.latency.nice
>
> --
> tejun
On Wed, 21 Sept 2022 at 18:08, Qais Yousef <[email protected]> wrote:
>
> Hi Vincent
>
> On 09/16/22 10:02, Vincent Guittot wrote:
> > This patchset restarts the work about adding a latency priority to describe
> > the latency tolerance of cfs tasks.
> >
> > The patches [1-4] have been done by Parth:
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > I have just rebased and moved the set of latency priority outside the
> > priority update. I have removed the reviewed tag because the patches
> > are 2 years old.
> >
> > The patch [5] uses latency nice priority to define a latency offset
> > and then to decide if a cfs task can preempt the current running task. The
> > patch gives some tests results with cyclictests and hackbench to highlight
> > the benefit of latency priority for short interactive task or
> > long intensive tasks.
> >
> > Patch [6] adds the support of latency_offset to task group by adding a
> > cpu.latency field. There were discussions for the last version about
> > using a real unit for the field so I decided to expose directly the
> > latency offset which reflects the time up to which we can preempt when the
> > value is negative, or up to which we can defer the preemption when the
> > value is positive.
> > The range is [-sysctl_sched_latency:sysctl_sched_latency]
> >
> > Patch [7] makes sched_core taking into account the latency offset.
> >
> > Patch [8] adds a rb tree to cover some corner cases where the latency
> > sensitive task is preempted by high priority task (RT/DL) or fails to
> > preempt them. This patch ensures that tasks will have at least a
> > slice of sched_min_granularity in priority at wakeup. The patch gives
> > results to show the benefit in addition to patch 5
> >
> > I have also backported the patchset on a dragonboard RB3 with an android
> > mainline kernel based on v5.18 for a quick test. I have used
> > the TouchLatency app which is part of AOSP and described to be very good
> > test to highlight jitter and jank frame sources of a system [1].
> > In addition to the app, I have added some short running tasks waking-up
> > regularly (to use the 8 cpus for 4 ms every 37777us) to stress the system
> > without overloading it (and disabling EAS). The 1st results shows that the
> > patchset helps to reduce the missed deadline frames from 5% to less than
> > 0.1% when the cpu.latency of task group are set.
> >
> > [1] https://source.android.com/docs/core/debug/eval_perf#touchlatency
>
> One thing that still confuses me is whether this proposal is supposed to be the
> only consumer of this interface or we still expect other users to be able to
> use this hint to optimize other sources of latency in the scheduler? Last
> discussion [1] raised issues on the interface and I haven't seen any
> discussions on the suitability of the interface to enable future consumers.
>
> I might have missed something. What's the current state on this?
Nothing has changed since the discussion in March:
https://lore.kernel.org/lkml/CAKfTPtBCKyqa-472Z7LtiWTq+Yirq6=jSrkzJtNbkdKXnwP-mA@mail.gmail.com/T/
>
>
> [1] https://lwn.net/Articles/820659/
>
>
> Thanks
>
> --
> Qais Yousef
On 09/22/22 08:40, Vincent Guittot wrote:
> On Wed, 21 Sept 2022 at 19:12, Tejun Heo <[email protected]> wrote:
> >
> > On Wed, Sep 21, 2022 at 07:02:57PM +0200, Vincent Guittot wrote:
> > > > One option could be just using the same mapping as cpu.weight so that 100
> > > > maps to neutral, 10000 maps close to -20, 1 maps close to 19. It isn't great
> > > > that the value can't be interpreted in any intuitive way (e.g. a time
> > > > duration based interface would be a lot easier to grok even if it still is
> > > > best effort) but if that's what the per-task interface is gonna be, it'd be
> > > > best to keep cgroup interface in line.
> > >
> > > I would prefer a signed range like the [-1000:1000] as the behavior is
> > > different for sensitive and non sensitive task unlike the cpu.weight
> > > which is reflect that a bigger value get more
> >
> > How about just sticking with .nice?
>
> Looks good to me. I will just implement the cpu.latency.nice
+1
Keeping both interfaces exposing the same thing would make the most sense IMHO.
Though it begs the question, how the per-task and cgroup interface should
interact?
For example, one of the proposed use cases was to use this knob to control how
hard we search for the best cpu in the load balancer (IIRC). If a task sets
latency_nice to -19, but it is attached to a cgroup that has cpu.latency.nice
to 20. How should the new consumer (load balancer) interpret the effective
value for the task?
IIUC, current use case doesn't care about effective value as it considers the
group and the task as separate entities. But other paths like above wouldn't
see this separation and would want to know which of the 2 to consider.
We should update Documentation/admin-guide/cgroup-v2.rst with these details for
cpu.latency.nice.
Thanks!
--
Qais Yousef
On 09/22/22 09:19, Vincent Guittot wrote:
> On Wed, 21 Sept 2022 at 18:08, Qais Yousef <[email protected]> wrote:
> >
> > Hi Vincent
> >
> > On 09/16/22 10:02, Vincent Guittot wrote:
> > > This patchset restarts the work about adding a latency priority to describe
> > > the latency tolerance of cfs tasks.
> > >
> > > The patches [1-4] have been done by Parth:
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > I have just rebased and moved the set of latency priority outside the
> > > priority update. I have removed the reviewed tag because the patches
> > > are 2 years old.
> > >
> > > The patch [5] uses latency nice priority to define a latency offset
> > > and then to decide if a cfs task can preempt the current running task. The
> > > patch gives some tests results with cyclictests and hackbench to highlight
> > > the benefit of latency priority for short interactive task or
> > > long intensive tasks.
> > >
> > > Patch [6] adds the support of latency_offset to task group by adding a
> > > cpu.latency field. There were discussions for the last version about
> > > using a real unit for the field so I decided to expose directly the
> > > latency offset which reflects the time up to which we can preempt when the
> > > value is negative, or up to which we can defer the preemption when the
> > > value is positive.
> > > The range is [-sysctl_sched_latency:sysctl_sched_latency]
> > >
> > > Patch [7] makes sched_core taking into account the latency offset.
> > >
> > > Patch [8] adds a rb tree to cover some corner cases where the latency
> > > sensitive task is preempted by high priority task (RT/DL) or fails to
> > > preempt them. This patch ensures that tasks will have at least a
> > > slice of sched_min_granularity in priority at wakeup. The patch gives
> > > results to show the benefit in addition to patch 5
> > >
> > > I have also backported the patchset on a dragonboard RB3 with an android
> > > mainline kernel based on v5.18 for a quick test. I have used
> > > the TouchLatency app which is part of AOSP and described to be very good
> > > test to highlight jitter and jank frame sources of a system [1].
> > > In addition to the app, I have added some short running tasks waking-up
> > > regularly (to use the 8 cpus for 4 ms every 37777us) to stress the system
> > > without overloading it (and disabling EAS). The 1st results shows that the
> > > patchset helps to reduce the missed deadline frames from 5% to less than
> > > 0.1% when the cpu.latency of task group are set.
> > >
> > > [1] https://source.android.com/docs/core/debug/eval_perf#touchlatency
> >
> > One thing that still confuses me is whether this proposal is supposed to be the
> > only consumer of this interface or we still expect other users to be able to
> > use this hint to optimize other sources of latency in the scheduler? Last
> > discussion [1] raised issues on the interface and I haven't seen any
> > discussions on the suitability of the interface to enable future consumers.
> >
> > I might have missed something. What's the current state on this?
>
> Nothing has changed since the discussion in March:
> https://lore.kernel.org/lkml/CAKfTPtBCKyqa-472Z7LtiWTq+Yirq6=jSrkzJtNbkdKXnwP-mA@mail.gmail.com/T/
Okay. For my peace of mind, could you update the cover letter please to be more
explicit that this is only one use case of potential others to come later in
the future? The other 2 that I remember are improve load balancer search and
modify EAS behavior. Parth had a use case to help take advantage of turbo
frequencies, but not sure if this is still being pursued/desired.
Your proposed use case could actually help make the EAS one unnecessary. But it
is hard to tell at this stage and prefer to continue to consider it as
a potential new consumer of this interface.
Thanks!
--
Qais Yousef
On Thu, 22 Sept 2022 at 13:00, Qais Yousef <[email protected]> wrote:
>
> On 09/22/22 09:19, Vincent Guittot wrote:
> > On Wed, 21 Sept 2022 at 18:08, Qais Yousef <[email protected]> wrote:
> > >
> > > Hi Vincent
> > >
> > > On 09/16/22 10:02, Vincent Guittot wrote:
> > > > This patchset restarts the work about adding a latency priority to describe
> > > > the latency tolerance of cfs tasks.
> > > >
> > > > The patches [1-4] have been done by Parth:
> > > > https://lore.kernel.org/lkml/[email protected]/
> > > >
> > > > I have just rebased and moved the set of latency priority outside the
> > > > priority update. I have removed the reviewed tag because the patches
> > > > are 2 years old.
> > > >
> > > > The patch [5] uses latency nice priority to define a latency offset
> > > > and then to decide if a cfs task can preempt the current running task. The
> > > > patch gives some tests results with cyclictests and hackbench to highlight
> > > > the benefit of latency priority for short interactive task or
> > > > long intensive tasks.
> > > >
> > > > Patch [6] adds the support of latency_offset to task group by adding a
> > > > cpu.latency field. There were discussions for the last version about
> > > > using a real unit for the field so I decided to expose directly the
> > > > latency offset which reflects the time up to which we can preempt when the
> > > > value is negative, or up to which we can defer the preemption when the
> > > > value is positive.
> > > > The range is [-sysctl_sched_latency:sysctl_sched_latency]
> > > >
> > > > Patch [7] makes sched_core taking into account the latency offset.
> > > >
> > > > Patch [8] adds a rb tree to cover some corner cases where the latency
> > > > sensitive task is preempted by high priority task (RT/DL) or fails to
> > > > preempt them. This patch ensures that tasks will have at least a
> > > > slice of sched_min_granularity in priority at wakeup. The patch gives
> > > > results to show the benefit in addition to patch 5
> > > >
> > > > I have also backported the patchset on a dragonboard RB3 with an android
> > > > mainline kernel based on v5.18 for a quick test. I have used
> > > > the TouchLatency app which is part of AOSP and described to be very good
> > > > test to highlight jitter and jank frame sources of a system [1].
> > > > In addition to the app, I have added some short running tasks waking-up
> > > > regularly (to use the 8 cpus for 4 ms every 37777us) to stress the system
> > > > without overloading it (and disabling EAS). The 1st results shows that the
> > > > patchset helps to reduce the missed deadline frames from 5% to less than
> > > > 0.1% when the cpu.latency of task group are set.
> > > >
> > > > [1] https://source.android.com/docs/core/debug/eval_perf#touchlatency
> > >
> > > One thing that still confuses me is whether this proposal is supposed to be the
> > > only consumer of this interface or we still expect other users to be able to
> > > use this hint to optimize other sources of latency in the scheduler? Last
> > > discussion [1] raised issues on the interface and I haven't seen any
> > > discussions on the suitability of the interface to enable future consumers.
> > >
> > > I might have missed something. What's the current state on this?
> >
> > Nothing has changed since the discussion in March:
> > https://lore.kernel.org/lkml/CAKfTPtBCKyqa-472Z7LtiWTq+Yirq6=jSrkzJtNbkdKXnwP-mA@mail.gmail.com/T/
>
> Okay. For my peace of mind, could you update the cover letter please to be more
> explicit that this is only one use case of potential others to come later in
I thought that the reference to Prath's patchset at the beg of the
cover letter was enough to highlight that there were other possible
uses of the interface. But I can mentioned that this is not the sole
and alone use of the interface can be done in the scheduler and other
places could also take advantage of this new hint.
> the future? The other 2 that I remember are improve load balancer search and
> modify EAS behavior. Parth had a use case to help take advantage of turbo
> frequencies, but not sure if this is still being pursued/desired.
>
> Your proposed use case could actually help make the EAS one unnecessary. But it
> is hard to tell at this stage and prefer to continue to consider it as
> a potential new consumer of this interface.
>
>
> Thanks!
>
> --
> Qais Yousef