LinuxLists.cc - [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup

2022-09-16 08:06:14

Subject: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup

Take into account the latency priority of a thread when deciding to
preempt the current running thread. We don't want to provide more CPU
bandwidth to a thread but reorder the scheduling to run latency sensitive
task first whenever possible.

As long as a thread didn't use its bandwidth, it will be able to preempt
the current thread.

At the opposite, a thread with a low latency priority will preempt current
thread at wakeup only to keep fair CPU bandwidth sharing. Otherwise it will
wait for the tick to get its sched slice.

curr vruntime
|
sysctl_sched_wakeup_granularity
<-->
----------------------------------|----|-----------------------|---------------
| |<--------------------->
| . sysctl_sched_latency
| .
default/current latency entity | .
| .
1111111111111111111111111111111111|0000|-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-
se preempts curr at wakeup ------>|<- se doesn't preempt curr -----------------
| .
| .
| .
low latency entity | .
---------------------->|
% of sysctl_sched_latency |
1111111111111111111111111111111111111111111111111111111111|0000|-1-1-1-1-1-1-1-
preempt ------------------------------------------------->|<- do not preempt --
| .
| .
| .
high latency entity | .
|<-----------------------| .
| % of sysctl_sched_latency .
111111111|0000|-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1
preempt->|<- se doesn't preempt curr ------------------------------------------

Tests results of nice latency impact on heavy load like hackbench:

hackbench -l (2560 / group) -g group
group latency 0 latency 19
1 1.350(+/- 1.10%) 1.287(+/- 1.60%) + 5%
4 1.392(+/- 1.67%) 1.248(+/- 1.51%) +11%
8 1.294(+/- 1.56%) 1.254(+/- 1.96%) + 3%
16 1.315(+/- 1.02%) 1.288(+/- 1.53%) + 2%

hackbench -p -l (2560 / group) -g group
group
1 1.768(+/- 13%) 0.805(+/- 2%) +54%
4 1.634(+/- 13%) 0.765(+/- 1%) +53%
8 1.305(+/- 4%) 0.745(+/- 2%) +43%
16 0.786(+/- 4%) 0.705(+/- 2%) +10%

By deacreasing the latency prio, we reduce the number of preemption at
wakeup and help hackbench making progress.

Test results of nice latency impact on short live load like cyclictest
while competing with heavy load like hackbench:

hackbench -l 10000 -g $group &
cyclictest --policy other -D 5 -q -n
latency 0 latency -20
group min avg max min avg max
0 17 18 28 16 18 28
1 67 319 7369 63 76 2479
4 64 453 51699 45 95 8788
8 65 814 51699 63 116 19535
16 64 1275 37744 63 159 51134

group = 0 means that hackbench is not running.

The avg is significantly improved with nice latency -20 especially with
large number of groups but min and max remain quite similar. If we add the
histogram parameter to get details of latency, we have :

hackbench -l 10000 -g 16 &
cyclictest --policy other -D 5 -q -n -H 20000 --histfile data.txt
latency 0 latency -20
Min Latencies: 63 63
Avg Latencies: 1459 214
Max Latencies: 58274 82358
50% latencies: 164 87
75% latencies: 848 91
85% latencies: 1246 94
90% latencies: 2149 96
95% latencies: 8463 99
99% latencies:>20000 120

With percentile details, we see the benefit of nice latency -20 as
only 1% of the latencies stays above 120us whereas the default latency
has got 25% are above 848us, 10% over the 2ms and 1% above 20ms.

Signed-off-by: Vincent Guittot <[email protected]>
---
include/linux/sched.h | 4 +++-
init/init_task.c | 2 +-
kernel/sched/core.c | 39 ++++++++++++++++++++++++------
kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 56 +++++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 14 ++++++++++-
6 files changed, 104 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a0adb55efa1c..2d2a361d65ee 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -557,6 +557,8 @@ struct sched_entity {
/* cached value of my_q->h_nr_running */
unsigned long runnable_weight;
#endif
+ /* preemption offset in ns */
+ long latency_offset;

#ifdef CONFIG_SMP
/*
@@ -773,7 +775,7 @@ struct task_struct {
int static_prio;
int normal_prio;
unsigned int rt_priority;
- int latency_nice;
+ int latency_prio;

struct sched_entity se;
struct sched_rt_entity rt;
diff --git a/init/init_task.c b/init/init_task.c
index 225d11a39bc9..e98c71f24981 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -78,7 +78,7 @@ struct task_struct init_task
.prio = MAX_PRIO - 20,
.static_prio = MAX_PRIO - 20,
.normal_prio = MAX_PRIO - 20,
- .latency_nice = DEFAULT_LATENCY_NICE,
+ .latency_prio = NICE_WIDTH - 20,
.policy = SCHED_NORMAL,
.cpus_ptr = &init_task.cpus_mask,
.user_cpus_ptr = NULL,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3c79c5419d1b..13cf794708ee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1288,6 +1288,13 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}
}

+static void set_latency_offset(struct task_struct *p)
+{
+ long weight = sched_latency_to_weight[p->latency_prio];
+
+ p->se.latency_offset = (sysctl_sched_latency * weight) >> NICE_LATENCY_SHIFT;
+}
+
#ifdef CONFIG_UCLAMP_TASK
/*
* Serializes updates of utilization clamp values
@@ -4512,7 +4519,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
p->prio = current->normal_prio;

/* Propagate the parent's latency requirements to the child as well */
- p->latency_nice = current->latency_nice;
+ p->latency_prio = current->latency_prio;

uclamp_fork(p);

@@ -4530,7 +4537,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
p->prio = p->normal_prio = p->static_prio;
set_load_weight(p, false);

- p->latency_nice = DEFAULT_LATENCY_NICE;
+ p->latency_prio = NICE_TO_LATENCY(0);
+ set_latency_offset(p);
+
/*
* We don't need the reset flag anymore after the fork. It has
* fulfilled its duty:
@@ -7295,8 +7304,10 @@ static void __setscheduler_params(struct task_struct *p,
static void __setscheduler_latency(struct task_struct *p,
const struct sched_attr *attr)
{
- if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
- p->latency_nice = attr->sched_latency_nice;
+ if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
+ p->latency_prio = NICE_TO_LATENCY(attr->sched_latency_nice);
+ set_latency_offset(p);
+ }
}

/*
@@ -7445,7 +7456,7 @@ static int __sched_setscheduler(struct task_struct *p,
if (attr->sched_latency_nice < MIN_LATENCY_NICE)
return -EINVAL;
/* Use the same security checks as NICE */
- if (attr->sched_latency_nice < p->latency_nice &&
+ if (attr->sched_latency_nice < LATENCY_TO_NICE(p->latency_prio) &&
!capable(CAP_SYS_NICE))
return -EPERM;
}
@@ -7485,7 +7496,7 @@ static int __sched_setscheduler(struct task_struct *p,
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
goto change;
if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE &&
- attr->sched_latency_nice != p->latency_nice)
+ attr->sched_latency_nice != LATENCY_TO_NICE(p->latency_prio))
goto change;

p->sched_reset_on_fork = reset_on_fork;
@@ -8026,7 +8037,7 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
get_params(p, &kattr);
kattr.sched_flags &= SCHED_FLAG_ALL;

- kattr.sched_latency_nice = p->latency_nice;
+ kattr.sched_latency_nice = LATENCY_TO_NICE(p->latency_prio);

#ifdef CONFIG_UCLAMP_TASK
/*
@@ -11177,6 +11188,20 @@ const u32 sched_prio_to_wmult[40] = {
/* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};

+/*
+ * latency weight for wakeup preemption
+ */
+const int sched_latency_to_weight[40] = {
+ /* -20 */ -1024, -973, -922, -870, -819,
+ /* -15 */ -768, -717, -666, -614, -563,
+ /* -10 */ -512, -461, -410, -358, -307,
+ /* -5 */ -256, -205, -154, -102, -51,
+ /* 0 */ 0, 51, 102, 154, 205,
+ /* 5 */ 256, 307, 358, 410, 461,
+ /* 10 */ 512, 563, 614, 666, 717,
+ /* 15 */ 768, 819, 870, 922, 973,
+};
+
void call_trace_sched_update_nr_running(struct rq *rq, int count)
{
trace_sched_update_nr_running_tp(rq, count);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index a3f7876217a6..06aaa0c81d4b 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1042,7 +1042,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
#endif
P(policy);
P(prio);
- P(latency_nice);
+ P(latency_prio);
if (task_has_dl_policy(p)) {
P(dl.runtime);
P(dl.deadline);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8c1b889dcbb..a20eadb0af97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4574,6 +4574,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
update_idle_cfs_rq_clock_pelt(cfs_rq);
}

+static long wakeup_latency_gran(struct sched_entity *curr, struct sched_entity *se);
+
/*
* Preempt the current task with a newly woken task if needed:
*/
@@ -4606,6 +4608,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)

se = __pick_first_entity(cfs_rq);
delta = curr->vruntime - se->vruntime;
+ delta -= wakeup_latency_gran(curr, se);

if (delta < 0)
return;
@@ -5732,6 +5735,35 @@ static int sched_idle_cpu(int cpu)
}
#endif

+static void set_next_buddy(struct sched_entity *se);
+
+static void check_preempt_from_others(struct cfs_rq *cfs, struct sched_entity *se)
+{
+ struct sched_entity *next;
+
+ if (se->latency_offset >= 0)
+ return;
+
+ if (cfs->nr_running <= 1)
+ return;
+ /*
+ * When waking from idle, we don't need to check to preempt at wakeup
+ * the idle thread and don't set next buddy as a candidate for being
+ * picked in priority.
+ * In case of simultaneous wakeup from idle, the latency sensitive tasks
+ * lost opportunity to preempt non sensitive tasks which woke up
+ * simultaneously.
+ */
+
+ if (cfs->next)
+ next = cfs->next;
+ else
+ next = __pick_first_entity(cfs);
+
+ if (next && wakeup_preempt_entity(next, se) == 1)
+ set_next_buddy(se);
+}
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -5818,14 +5850,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!task_new)
update_overutilized_status(rq);

+ if (rq->curr->sched_class != &fair_sched_class)
+ check_preempt_from_others(cfs_rq_of(&p->se), &p->se);
+
enqueue_throttle:
assert_list_leaf_cfs_rq(rq);

hrtick_update(rq);
}

-static void set_next_buddy(struct sched_entity *se);
-
/*
* The dequeue_task method is called before nr_running is
* decreased. We remove the task from the rbtree and
@@ -7148,6 +7181,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
}
#endif /* CONFIG_SMP */

+static long wakeup_latency_gran(struct sched_entity *curr, struct sched_entity *se)
+{
+ long latency_offset = se->latency_offset;
+
+ /*
+ * A negative latency weigth means that the sched_entity has latency
+ * requirement that needs to be evaluated versus other entity.
+ * Otherwise, use the latency weight to evaluate how much scheduling
+ * delay is acceptable by se.
+ */
+ if ((se->latency_offset < 0) || (curr->latency_offset < 0))
+ latency_offset -= curr->latency_offset;
+
+ return latency_offset;
+}
+
static unsigned long wakeup_gran(struct sched_entity *se)
{
unsigned long gran = sysctl_sched_wakeup_granularity;
@@ -7187,6 +7236,9 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
{
s64 gran, vdiff = curr->vruntime - se->vruntime;

+ /* Take into account latency priority */
+ vdiff -= wakeup_latency_gran(curr, se);
+
if (vdiff <= 0)
return -1;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b927269b84f9..a4cb8058f1b2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -141,6 +141,17 @@ extern int sched_rr_timeslice;
* Default tasks should be treated as a task with latency_nice = 0.
*/
#define DEFAULT_LATENCY_NICE 0
+#define DEFAULT_LATENCY_PRIO (DEFAULT_LATENCY_NICE + LATENCY_NICE_WIDTH/2)
+
+/*
+ * Convert user-nice values [ -20 ... 0 ... 19 ]
+ * to static latency [ 0..39 ],
+ * and back.
+ */
+#define NICE_TO_LATENCY(nice) ((nice) + DEFAULT_LATENCY_PRIO)
+#define LATENCY_TO_NICE(prio) ((prio) - DEFAULT_LATENCY_PRIO)
+#define NICE_LATENCY_SHIFT (SCHED_FIXEDPOINT_SHIFT)
+#define NICE_LATENCY_WEIGHT_MAX (1L << NICE_LATENCY_SHIFT)

/*
* Increase resolution of nice-level calculations for 64-bit architectures.
@@ -2114,6 +2125,7 @@ static_assert(WF_TTWU == SD_BALANCE_WAKE);

extern const int sched_prio_to_weight[40];
extern const u32 sched_prio_to_wmult[40];
+extern const int sched_latency_to_weight[40];

/*
* {de,en}queue flags:
@@ -2442,8 +2454,8 @@ extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);
extern const_debug unsigned int sysctl_sched_nr_migrate;
extern const_debug unsigned int sysctl_sched_migration_cost;

-#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_latency;
+#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_min_granularity;
extern unsigned int sysctl_sched_idle_min_granularity;
extern unsigned int sysctl_sched_wakeup_granularity;
--
2.17.1

2022-09-16 13:44:27

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup

Hi Hillf,

On Fri, 16 Sept 2022 at 14:03, Hillf Danton <[email protected]> wrote:
>
> Hello Vincent
>
> On 16 Sep 2022 10:03:02 +0200 Vincent Guittot <[email protected]> wrote:
> >
> > @@ -4606,6 +4608,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> >
> > se = __pick_first_entity(cfs_rq);
> > delta = curr->vruntime - se->vruntime;
> > + delta -= wakeup_latency_gran(curr, se);
> >
> > if (delta < 0)
> > return;
>
> What is derived from the latency nice you added is the runtime granulaity
> which has a role in preempting the current task.
>
> Given the same defination of latency nice as the nice, the runtime granularity
> can be computed without introducing the latency nice.
>
> Only for thoughts now.
>
> Hillf
>
> +++ b/kernel/sched/fair.c
> @@ -4569,7 +4569,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
> static void
> check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> {
> - unsigned long ideal_runtime, delta_exec;
> + unsigned long ideal_runtime, delta_exec, granu;
> struct sched_entity *se;
> s64 delta;
>
> @@ -4594,6 +4594,14 @@ check_preempt_tick(struct cfs_rq *cfs_rq
> return;
>
> se = __pick_first_entity(cfs_rq);
> +
> + granu = sysctl_sched_min_granularity +
> + (ideal_runtime - sysctl_sched_min_granularity) *
> + (se->latency_nice + 20) / LATENCY_NICE_WIDTH;

There is no latency_nice field in se but a latency_offset instead

Also at this step, we are sure that curr has run at least
sysctl_sched_min_granularity and we want now to compare curr vruntime
with first se's one. We take the latency offset into account to make
sure we will not preempt curr too early

> +
> + if (delta_exec < granu)
> + return;
> +
> delta = curr->vruntime - se->vruntime;
>
> if (delta < 0)

2022-09-18 11:13:56

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup

On Sun, 18 Sept 2022 at 00:58, Hillf Danton <[email protected]> wrote:
>
> On 16 Sep 2022 15:36:53 +0200 Vincent Guittot <[email protected]> wrote:
> >
> > Hi Hillf,
> >
> > On Fri, 16 Sept 2022 at 14:03, Hillf Danton <[email protected]> wrote:
> > >
> > > Hello Vincent
> > >
> > > On 16 Sep 2022 10:03:02 +0200 Vincent Guittot <[email protected]> wrote:
> > > >
> > > > @@ -4606,6 +4608,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > > >
> > > > se = __pick_first_entity(cfs_rq);
> > > > delta = curr->vruntime - se->vruntime;
> > > > + delta -= wakeup_latency_gran(curr, se);
> > > >
> > > > if (delta < 0)
> > > > return;
> > >
> > > What is derived from the latency nice you added is the runtime granulaity
> > > which has a role in preempting the current task.
> > >
> > > Given the same defination of latency nice as the nice, the runtime granularity
> > > can be computed without introducing the latency nice.
> > >
> > > Only for thoughts now.
> > >
> > > Hillf
> > >
> > > +++ b/kernel/sched/fair.c
> > > @@ -4569,7 +4569,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
> > > static void
> > > check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > > {
> > > - unsigned long ideal_runtime, delta_exec;
> > > + unsigned long ideal_runtime, delta_exec, granu;
> > > struct sched_entity *se;
> > > s64 delta;
> > >
> > > @@ -4594,6 +4594,14 @@ check_preempt_tick(struct cfs_rq *cfs_rq
> > > return;
> > >
> > > se = __pick_first_entity(cfs_rq);
> > > +
> > > + granu = sysctl_sched_min_granularity +
> > > + (ideal_runtime - sysctl_sched_min_granularity) *
> > > + (se->latency_nice + 20) / LATENCY_NICE_WIDTH;
> >
> > There is no latency_nice field in se but a latency_offset instead
> >
> > Also at this step, we are sure that curr has run at least
> > sysctl_sched_min_granularity and we want now to compare curr vruntime
> > with first se's one. We take the latency offset into account to make
> > sure we will not preempt curr too early
> >
> > > +
> > > + if (delta_exec < granu)
> > > + return;
> > > +
> > > delta = curr->vruntime - se->vruntime;
> > >
> > > if (delta < 0)
> return;
>
> if (delta > ideal_runtime)
> resched_curr(rq_of(cfs_rq));
>
> After another look, curr is not preempted without the gap in vruntime
> between curr and the first entity growing more than ideal runtime, while

Curr can be preempted as it has run more than the ideal time (1st
test). This one is to make sure that the diff does not become too
large. Here we reuse the same comparison as wakeup to make sure that a
newly curr will get a chance to run its ideal time after having
preempted at wakeup the previous current

> with latency_offset, since the gap becomes larger, preempt happens later
> than ideal runtime thoughts IMO.

2022-09-19 10:18:12

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup

On 16/09/2022 10:03, Vincent Guittot wrote:

[...]

> @@ -4512,7 +4519,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
> p->prio = current->normal_prio;
>
> /* Propagate the parent's latency requirements to the child as well */
> - p->latency_nice = current->latency_nice;
> + p->latency_prio = current->latency_prio;

Isn't here a `set_latency_offset(p)` missing here?

>
> uclamp_fork(p);
>

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e8c1b889dcbb..a20eadb0af97 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4574,6 +4574,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> update_idle_cfs_rq_clock_pelt(cfs_rq);
> }
>
> +static long wakeup_latency_gran(struct sched_entity *curr, struct sched_entity *se);

minor: `struct sched_entity *curr` ... doesn't have to be current
(cfs_rq->curr). Isn't this more like `struct sched_entity *sea, struct
sched_entity *seb`? Anyway, it's already the case for
`wakeup_preempt_entity`.

[...]

> @@ -5732,6 +5735,35 @@ static int sched_idle_cpu(int cpu)
> }
> #endif
>
> +static void set_next_buddy(struct sched_entity *se);
> +
> +static void check_preempt_from_others(struct cfs_rq *cfs, struct sched_entity *se)

minor: Why `struct cfs_rq *cfs` and not `struct cfs_rq *cfs_rq` ?

Using `cfs_rq` would make it more consistent when looking for things
like `cfs_rq->nr_running` for example.

> +{
> + struct sched_entity *next;
> +
> + if (se->latency_offset >= 0)
> + return;
> +
> + if (cfs->nr_running <= 1)
> + return;
> + /*
> + * When waking from idle, we don't need to check to preempt at wakeup

s/idle/others ?

> + * the idle thread and don't set next buddy as a candidate for being
> + * picked in priority.
> + * In case of simultaneous wakeup from idle, the latency sensitive tasks
> + * lost opportunity to preempt non sensitive tasks which woke up
> + * simultaneously.
> + */

The position of this comment block within this function is somehow
misleading since it describes the reason for the function rather then a
particular condition within this function. Wouldn't it be more readable
when it would be a function header comment instead?

[...]

> @@ -7148,6 +7181,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> }
> #endif /* CONFIG_SMP */
>
> +static long wakeup_latency_gran(struct sched_entity *curr, struct sched_entity *se)
> +{
> + long latency_offset = se->latency_offset;
> +
> + /*
> + * A negative latency weigth means that the sched_entity has latency

s/weigth/latency_offset ?

> + * requirement that needs to be evaluated versus other entity.
> + * Otherwise, use the latency weight to evaluate how much scheduling
> + * delay is acceptable by se.
> + */
> + if ((se->latency_offset < 0) || (curr->latency_offset < 0))
> + latency_offset -= curr->latency_offset;

I still don't get the rationale behind why when either one (se or curr)
of the latency_nice values is negative, we use the diff between them but
if not, we only care about se's value. Why don't you always use the diff
between se and curr? Since we have a range [-20 ... 19] why shouldn't we
use the difference between let's say se = 19 and curr = 5?
You discussed this with Tao Zhou on the v1 but I didn't understand it fully.

[...]

2022-09-19 16:01:37

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup

On Mon, 19 Sept 2022 at 12:05, Dietmar Eggemann
<[email protected]> wrote:
>
> On 16/09/2022 10:03, Vincent Guittot wrote:
>
> [...]
>
> > @@ -4512,7 +4519,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
> > p->prio = current->normal_prio;
> >
> > /* Propagate the parent's latency requirements to the child as well */
> > - p->latency_nice = current->latency_nice;
> > + p->latency_prio = current->latency_prio;
>
> Isn't here a `set_latency_offset(p)` missing here?

Hmm, I think it's the opposite and the line above is a nop from the
beginning (i.e. patch 2).

>
> >
> > uclamp_fork(p);
> >
>
> [...]
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e8c1b889dcbb..a20eadb0af97 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4574,6 +4574,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > update_idle_cfs_rq_clock_pelt(cfs_rq);
> > }
> >
> > +static long wakeup_latency_gran(struct sched_entity *curr, struct sched_entity *se);
>
> minor: `struct sched_entity *curr` ... doesn't have to be current
> (cfs_rq->curr). Isn't this more like `struct sched_entity *sea, struct
> sched_entity *seb`? Anyway, it's already the case for
> `wakeup_preempt_entity`.
>
> [...]
>
> > @@ -5732,6 +5735,35 @@ static int sched_idle_cpu(int cpu)
> > }
> > #endif
> >
> > +static void set_next_buddy(struct sched_entity *se);
> > +
> > +static void check_preempt_from_others(struct cfs_rq *cfs, struct sched_entity *se)
>
> minor: Why `struct cfs_rq *cfs` and not `struct cfs_rq *cfs_rq` ?
>
> Using `cfs_rq` would make it more consistent when looking for things
> like `cfs_rq->nr_running` for example.
>
> > +{
> > + struct sched_entity *next;
> > +
> > + if (se->latency_offset >= 0)
> > + return;
> > +
> > + if (cfs->nr_running <= 1)
> > + return;
> > + /*
> > + * When waking from idle, we don't need to check to preempt at wakeup
>
> s/idle/others ?

yes, I forgot to update the comment

>
> > + * the idle thread and don't set next buddy as a candidate for being
> > + * picked in priority.
> > + * In case of simultaneous wakeup from idle, the latency sensitive tasks
> > + * lost opportunity to preempt non sensitive tasks which woke up
> > + * simultaneously.
> > + */
>
> The position of this comment block within this function is somehow
> misleading since it describes the reason for the function rather then a
> particular condition within this function. Wouldn't it be more readable
> when it would be a function header comment instead?

I put it after the usual early return tests to put the comment close
to the useful part: the use of next buddy and __pick_first_entity()

>
> [...]
>
> > @@ -7148,6 +7181,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > }
> > #endif /* CONFIG_SMP */
> >
> > +static long wakeup_latency_gran(struct sched_entity *curr, struct sched_entity *se)
> > +{
> > + long latency_offset = se->latency_offset;
> > +
> > + /*
> > + * A negative latency weigth means that the sched_entity has latency
>
> s/weigth/latency_offset ?

yes

>
>
> > + * requirement that needs to be evaluated versus other entity.
> > + * Otherwise, use the latency weight to evaluate how much scheduling
> > + * delay is acceptable by se.
> > + */
> > + if ((se->latency_offset < 0) || (curr->latency_offset < 0))
> > + latency_offset -= curr->latency_offset;
>
> I still don't get the rationale behind why when either one (se or curr)
> of the latency_nice values is negative, we use the diff between them but
> if not, we only care about se's value. Why don't you always use the diff
> between se and curr? Since we have a range [-20 ... 19] why shouldn't we
> use the difference between let's say se = 19 and curr = 5?
> You discussed this with Tao Zhou on the v1 but I didn't understand it fully.

Let say that current has a latency nice prio of 19 and a task A with a
latency nice of 10 wakes up. Both tasks don't care about scheduling
latency (current more than task A). If we use the diff, the output of
wakeup_latency_gran() would be negative (-10ms) which reflects the
fact that the waking task is sensitive to the latency and wants to
preempt current even if its vruntime is after. But obviously both
current and task A don't care to preempt at wakeup.

>
> [...]

2022-09-20 13:57:16

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup

On 19/09/2022 17:39, Vincent Guittot wrote:
> On Mon, 19 Sept 2022 at 12:05, Dietmar Eggemann
> <[email protected]> wrote:
>>
>> On 16/09/2022 10:03, Vincent Guittot wrote:
>>
>> [...]
>>
>>> @@ -4512,7 +4519,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
>>> p->prio = current->normal_prio;
>>>
>>> /* Propagate the parent's latency requirements to the child as well */
>>> - p->latency_nice = current->latency_nice;
>>> + p->latency_prio = current->latency_prio;
>>
>> Isn't here a `set_latency_offset(p)` missing here?
>
> Hmm, I think it's the opposite and the line above is a nop from the
> beginning (i.e. patch 2).

Yeah, you're right! It looked suspicious ...

[...]

>>> + * the idle thread and don't set next buddy as a candidate for being
>>> + * picked in priority.
>>> + * In case of simultaneous wakeup from idle, the latency sensitive tasks
>>> + * lost opportunity to preempt non sensitive tasks which woke up
>>> + * simultaneously.
>>> + */
>>
>> The position of this comment block within this function is somehow
>> misleading since it describes the reason for the function rather then a
>> particular condition within this function. Wouldn't it be more readable
>> when it would be a function header comment instead?
>
> I put it after the usual early return tests to put the comment close
> to the useful part: the use of next buddy and __pick_first_entity()

So you want to have the `wakeup_preempt_entity(se, pse) == 1` condition
from check_preempt_wakeup() also for cfs_task woken up by others.

[...]

>>> + * requirement that needs to be evaluated versus other entity.
>>> + * Otherwise, use the latency weight to evaluate how much scheduling
>>> + * delay is acceptable by se.
>>> + */
>>> + if ((se->latency_offset < 0) || (curr->latency_offset < 0))
>>> + latency_offset -= curr->latency_offset;
>>
>> I still don't get the rationale behind why when either one (se or curr)
>> of the latency_nice values is negative, we use the diff between them but
>> if not, we only care about se's value. Why don't you always use the diff
>> between se and curr? Since we have a range [-20 ... 19] why shouldn't we
>> use the difference between let's say se = 19 and curr = 5?
>> You discussed this with Tao Zhou on the v1 but I didn't understand it fully.
>
> Let say that current has a latency nice prio of 19 and a task A with a
> latency nice of 10 wakes up. Both tasks don't care about scheduling
> latency (current more than task A). If we use the diff, the output of
> wakeup_latency_gran() would be negative (-10ms) which reflects the
> fact that the waking task is sensitive to the latency and wants to
> preempt current even if its vruntime is after. But obviously both
> current and task A don't care to preempt at wakeup.

OK, I understand but there is a certain level of unsteadiness here.

If p has >0 it gets treated differently in case current has >=0 and case
current has <0.

Do we expect that tasks set their value to [1..19] in this case, when
the default 0 already indicates a 'don't care'?

2022-09-20 16:03:48

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup

On Tue, 20 Sept 2022 at 13:32, Hillf Danton <[email protected]> wrote:
>
> On 18 Sep 2022 12:46:00 +0200 Vincent Guittot <[email protected]> wrote:
> > On Sun, 18 Sept 2022 at 00:58, Hillf Danton <[email protected]> wrote:
> > >
> > > On 16 Sep 2022 15:36:53 +0200 Vincent Guittot <[email protected]> wrote:
> > > >
> > > > Hi Hillf,
> > > >
> > > > On Fri, 16 Sept 2022 at 14:03, Hillf Danton <[email protected]> wrote:
> > > > >
> > > > > Hello Vincent
> > > > >
> > > > > On 16 Sep 2022 10:03:02 +0200 Vincent Guittot <[email protected]> wrote:
> > > > > >
> > > > > > @@ -4606,6 +4608,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > > > > >
> > > > > > se = __pick_first_entity(cfs_rq);
> > > > > > delta = curr->vruntime - se->vruntime;
> > > > > > + delta -= wakeup_latency_gran(curr, se);
> > > > > >
> > > > > > if (delta < 0)
> > > > > > return;
> > > > >
> > > > > What is derived from the latency nice you added is the runtime granulaity
> > > > > which has a role in preempting the current task.
> > > > >
> > > > > Given the same defination of latency nice as the nice, the runtime granularity
> > > > > can be computed without introducing the latency nice.
> > > > >
> > > > > Only for thoughts now.
> > > > >
> > > > > Hillf
> > > > >
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -4569,7 +4569,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
> > > > > static void
> > > > > check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > > > > {
> > > > > - unsigned long ideal_runtime, delta_exec;
> > > > > + unsigned long ideal_runtime, delta_exec, granu;
> > > > > struct sched_entity *se;
> > > > > s64 delta;
> > > > >
> > > > > @@ -4594,6 +4594,14 @@ check_preempt_tick(struct cfs_rq *cfs_rq
> > > > > return;
> > > > >
> > > > > se = __pick_first_entity(cfs_rq);
> > > > > +
> > > > > + granu = sysctl_sched_min_granularity +
> > > > > + (ideal_runtime - sysctl_sched_min_granularity) *
> > > > > + (se->latency_nice + 20) / LATENCY_NICE_WIDTH;
> > > >
> > > > There is no latency_nice field in se but a latency_offset instead
> > > >
> > > > Also at this step, we are sure that curr has run at least
> > > > sysctl_sched_min_granularity and we want now to compare curr vruntime
> > > > with first se's one. We take the latency offset into account to make
> > > > sure we will not preempt curr too early
> > > >
> > > > > +
> > > > > + if (delta_exec < granu)
> > > > > + return;
> > > > > +
> > > > > delta = curr->vruntime - se->vruntime;
> > > > >
> > > > > if (delta < 0)
> > > return;
> > >
> > > if (delta > ideal_runtime)
> > > resched_curr(rq_of(cfs_rq));
> > >
> > > After another look, curr is not preempted without the gap in vruntime
> > > between curr and the first entity growing more than ideal runtime, while
> >
> > Curr can be preempted as it has run more than the ideal time (1st
> > test). This one is to make sure that the diff does not become too
> > large. Here we reuse the same comparison as wakeup to make sure that a
> > newly curr will get a chance to run its ideal time after having
> > preempted at wakeup the previous current
>
> IIUC it would take two checks to preempt correctly - diff in vruntime
> is checked first to avoid preempting too early, then it is checked again
> with latency_offset taken into account to avoid preempting too late.

The 1st test in check_preempt_tick() : if (delta_exec > ideal_runtime)
ensures that a resched happens after the current run is slice

The 2nd test : if (delta_exec < sysctl_sched_min_granularity)
ensures that current will run at least 3ms

The 3rd one : if (delta > ideal_runtime)
is there to make sure that there is not too much diff between the
vruntime. But we are comparing virtual runtime with execution time and
as Peter mentioned in another thread that's weird. [1] will fix it in
addition to ensure max runtime.
Then, current might have preempted first_entity few ms before thanks
to its latency_offset. If the tick happens quickly after the
preemption, delta might be above ideal_runtime whereas current has run
its ideal time yet

[1] https://lore.kernel.org/all/[email protected]/

>
> +++ b/kernel/sched/fair.c
> @@ -4571,7 +4571,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq
> {
> unsigned long ideal_runtime, delta_exec;
> struct sched_entity *se;
> - s64 delta;
> + s64 delta, d2;
>
> ideal_runtime = sched_slice(cfs_rq, curr);
> delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
> @@ -4595,11 +4595,9 @@ check_preempt_tick(struct cfs_rq *cfs_rq
>
> se = __pick_first_entity(cfs_rq);
> delta = curr->vruntime - se->vruntime;
> + d2 = delta - wakeup_latency_gran(curr, se);
>
> - if (delta < 0)
> - return;
> -
> - if (delta > ideal_runtime)
> + if (delta > ideal_runtime || d2 > ideal_runtime)
> resched_curr(rq_of(cfs_rq));
> }
>

2022-09-20 16:05:16

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup

On Tue, 20 Sept 2022 at 15:18, Dietmar Eggemann
<[email protected]> wrote:
>
> On 19/09/2022 17:39, Vincent Guittot wrote:
> > On Mon, 19 Sept 2022 at 12:05, Dietmar Eggemann
> > <[email protected]> wrote:
> >>
> >> On 16/09/2022 10:03, Vincent Guittot wrote:
> >>
> >> [...]
> >>
> >>> @@ -4512,7 +4519,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
> >>> p->prio = current->normal_prio;
> >>>
> >>> /* Propagate the parent's latency requirements to the child as well */
> >>> - p->latency_nice = current->latency_nice;
> >>> + p->latency_prio = current->latency_prio;
> >>
> >> Isn't here a `set_latency_offset(p)` missing here?
> >
> > Hmm, I think it's the opposite and the line above is a nop from the
> > beginning (i.e. patch 2).
>
> Yeah, you're right! It looked suspicious ...
>
> [...]
>
> >>> + * the idle thread and don't set next buddy as a candidate for being
> >>> + * picked in priority.
> >>> + * In case of simultaneous wakeup from idle, the latency sensitive tasks
> >>> + * lost opportunity to preempt non sensitive tasks which woke up
> >>> + * simultaneously.
> >>> + */
> >>
> >> The position of this comment block within this function is somehow
> >> misleading since it describes the reason for the function rather then a
> >> particular condition within this function. Wouldn't it be more readable
> >> when it would be a function header comment instead?
> >
> > I put it after the usual early return tests to put the comment close
> > to the useful part: the use of next buddy and __pick_first_entity()
>
> So you want to have the `wakeup_preempt_entity(se, pse) == 1` condition
> from check_preempt_wakeup() also for cfs_task woken up by others.

I wake the wakeup_preempt_entity(cfs_rq->next, left) < 1 in
pick_next_entity() to pick the task with highest latency constraint
when another class is running while waking up

>
> [...]
>
> >>> + * requirement that needs to be evaluated versus other entity.
> >>> + * Otherwise, use the latency weight to evaluate how much scheduling
> >>> + * delay is acceptable by se.
> >>> + */
> >>> + if ((se->latency_offset < 0) || (curr->latency_offset < 0))
> >>> + latency_offset -= curr->latency_offset;
> >>
> >> I still don't get the rationale behind why when either one (se or curr)
> >> of the latency_nice values is negative, we use the diff between them but
> >> if not, we only care about se's value. Why don't you always use the diff
> >> between se and curr? Since we have a range [-20 ... 19] why shouldn't we
> >> use the difference between let's say se = 19 and curr = 5?
> >> You discussed this with Tao Zhou on the v1 but I didn't understand it fully.
> >
> > Let say that current has a latency nice prio of 19 and a task A with a
> > latency nice of 10 wakes up. Both tasks don't care about scheduling
> > latency (current more than task A). If we use the diff, the output of
> > wakeup_latency_gran() would be negative (-10ms) which reflects the
> > fact that the waking task is sensitive to the latency and wants to
> > preempt current even if its vruntime is after. But obviously both
> > current and task A don't care to preempt at wakeup.
>
> OK, I understand but there is a certain level of unsteadiness here.
>
> If p has >0 it gets treated differently in case current has >=0 and case

"If p >=0"; 0 has same behavior than [1..19]

> current has <0.
>
> Do we expect that tasks set their value to [1..19] in this case, when
> the default 0 already indicates a 'don't care'?

I'm not sure that I understand your concern as [0..19] are treated in
the same way. Only tasks (curr or se) with offset <0 need a relative
comparison to the other. If curr and se has both a latency nice of
-19, se should not blindly preempt curr but only if curr already run
for its amount of time

2022-09-21 23:41:02

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup

On 20/09/2022 17:49, Vincent Guittot wrote:
> On Tue, 20 Sept 2022 at 15:18, Dietmar Eggemann
> <[email protected]> wrote:
>>
>> On 19/09/2022 17:39, Vincent Guittot wrote:
>>> On Mon, 19 Sept 2022 at 12:05, Dietmar Eggemann
>>> <[email protected]> wrote:
>>>>
>>>> On 16/09/2022 10:03, Vincent Guittot wrote:

[...]

>>>>> + * the idle thread and don't set next buddy as a candidate for being
>>>>> + * picked in priority.
>>>>> + * In case of simultaneous wakeup from idle, the latency sensitive tasks
>>>>> + * lost opportunity to preempt non sensitive tasks which woke up
>>>>> + * simultaneously.
>>>>> + */
>>>>
>>>> The position of this comment block within this function is somehow
>>>> misleading since it describes the reason for the function rather then a
>>>> particular condition within this function. Wouldn't it be more readable
>>>> when it would be a function header comment instead?
>>>
>>> I put it after the usual early return tests to put the comment close
>>> to the useful part: the use of next buddy and __pick_first_entity()
>>
>> So you want to have the `wakeup_preempt_entity(se, pse) == 1` condition
>> from check_preempt_wakeup() also for cfs_task woken up by others.
>
> I wake the wakeup_preempt_entity(cfs_rq->next, left) < 1 in
> pick_next_entity() to pick the task with highest latency constraint
> when another class is running while waking up

That's correct. This is where you potentially pick this task since it is
the next_buddy.
All I wanted to say is that check_preempt_from_others() and its `next &&
wakeup_preempt_entity(next, se) == 1` is the counterpart of the
`wakeup_preempt_entity(se, pse) == 1` in check_preempt_wakeup() to be
able to set next_buddy even curr is from an other class than CFS.

[...]

>>>> I still don't get the rationale behind why when either one (se or curr)
>>>> of the latency_nice values is negative, we use the diff between them but
>>>> if not, we only care about se's value. Why don't you always use the diff
>>>> between se and curr? Since we have a range [-20 ... 19] why shouldn't we
>>>> use the difference between let's say se = 19 and curr = 5?
>>>> You discussed this with Tao Zhou on the v1 but I didn't understand it fully.
>>>
>>> Let say that current has a latency nice prio of 19 and a task A with a
>>> latency nice of 10 wakes up. Both tasks don't care about scheduling
>>> latency (current more than task A). If we use the diff, the output of
>>> wakeup_latency_gran() would be negative (-10ms) which reflects the
>>> fact that the waking task is sensitive to the latency and wants to
>>> preempt current even if its vruntime is after. But obviously both
>>> current and task A don't care to preempt at wakeup.
>>
>> OK, I understand but there is a certain level of unsteadiness here.
>>
>> If p has >0 it gets treated differently in case current has >=0 and case
>
> "If p >=0"; 0 has same behavior than [1..19]
>
>> current has <0.

Not quite. It depends on curr. With sysctl_sched_latency = 24ms:

(1) p = 10 curr = 19 -> wakeup_latency_gran() returns 12ms

(2) p = 10 curr = -10 -> wakeup_latency_gran() returns 24ms

In (1) only p's own latency counts whereas in (2) we take the diff,

In (A) we 'punish' p even though it competes against curr which has an
even lower latency requirement than p,

>> Do we expect that tasks set their value to [1..19] in this case, when
>> the default 0 already indicates a 'don't care'?
>
> I'm not sure that I understand your concern as [0..19] are treated in
> the same way. Only tasks (curr or se) with offset <0 need a relative
> comparison to the other. If curr and se has both a latency nice of
> -19, se should not blindly preempt curr but only if curr already run
> for its amount of time

With p = -19 and curr = -19 we would take the diff, so 0ms.

With p = 19 and curr = 19, if we would use `latency_offset -=
curr->latency_offset` wakeup_latency_gran() would return 973/1024*24ms -
973/1024*24ms = 0ms and nothing will shift.

OTHA, in case (1) wakeup_latency_gran() would return 512/1024*24ms -
973/1024*24ms = - 10.80ms. So p would gain an advantage here instead of
a penalty.

Essentially using the full [-20 .. 19] nice scope for `se vs. curr`
comparison.

2022-09-22 07:49:33

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup

On Thu, 22 Sept 2022 at 00:41, Dietmar Eggemann
<[email protected]> wrote:
>
> On 20/09/2022 17:49, Vincent Guittot wrote:
> > On Tue, 20 Sept 2022 at 15:18, Dietmar Eggemann
> > <[email protected]> wrote:
> >>
> >> On 19/09/2022 17:39, Vincent Guittot wrote:
> >>> On Mon, 19 Sept 2022 at 12:05, Dietmar Eggemann
> >>> <[email protected]> wrote:
> >>>>
> >>>> On 16/09/2022 10:03, Vincent Guittot wrote:
>
> [...]
>
> >>>>> + * the idle thread and don't set next buddy as a candidate for being
> >>>>> + * picked in priority.
> >>>>> + * In case of simultaneous wakeup from idle, the latency sensitive tasks
> >>>>> + * lost opportunity to preempt non sensitive tasks which woke up
> >>>>> + * simultaneously.
> >>>>> + */
> >>>>
> >>>> The position of this comment block within this function is somehow
> >>>> misleading since it describes the reason for the function rather then a
> >>>> particular condition within this function. Wouldn't it be more readable
> >>>> when it would be a function header comment instead?
> >>>
> >>> I put it after the usual early return tests to put the comment close
> >>> to the useful part: the use of next buddy and __pick_first_entity()
> >>
> >> So you want to have the `wakeup_preempt_entity(se, pse) == 1` condition
> >> from check_preempt_wakeup() also for cfs_task woken up by others.
> >
> > I wake the wakeup_preempt_entity(cfs_rq->next, left) < 1 in
> > pick_next_entity() to pick the task with highest latency constraint
> > when another class is running while waking up
>
> That's correct. This is where you potentially pick this task since it is
> the next_buddy.
> All I wanted to say is that check_preempt_from_others() and its `next &&
> wakeup_preempt_entity(next, se) == 1` is the counterpart of the
> `wakeup_preempt_entity(se, pse) == 1` in check_preempt_wakeup() to be
> able to set next_buddy even curr is from an other class than CFS.
>
> [...]
>
> >>>> I still don't get the rationale behind why when either one (se or curr)
> >>>> of the latency_nice values is negative, we use the diff between them but
> >>>> if not, we only care about se's value. Why don't you always use the diff
> >>>> between se and curr? Since we have a range [-20 ... 19] why shouldn't we
> >>>> use the difference between let's say se = 19 and curr = 5?
> >>>> You discussed this with Tao Zhou on the v1 but I didn't understand it fully.
> >>>
> >>> Let say that current has a latency nice prio of 19 and a task A with a
> >>> latency nice of 10 wakes up. Both tasks don't care about scheduling
> >>> latency (current more than task A). If we use the diff, the output of
> >>> wakeup_latency_gran() would be negative (-10ms) which reflects the
> >>> fact that the waking task is sensitive to the latency and wants to
> >>> preempt current even if its vruntime is after. But obviously both
> >>> current and task A don't care to preempt at wakeup.
> >>
> >> OK, I understand but there is a certain level of unsteadiness here.
> >>
> >> If p has >0 it gets treated differently in case current has >=0 and case
> >
> > "If p >=0"; 0 has same behavior than [1..19]
> >
> >> current has <0.
>
> Not quite. It depends on curr. With sysctl_sched_latency = 24ms:

I thought you were speaking about priority 0 vs [1..19] as you made a
difference in your previous comment below

>
> (1) p = 10 curr = 19 -> wakeup_latency_gran() returns 12ms
>
> (2) p = 10 curr = -10 -> wakeup_latency_gran() returns 24ms
>
> In (1) only p's own latency counts whereas in (2) we take the diff,

Yes because curr is latency sensitive in (2) whereas it's not in (1)

>
> In (A) we 'punish' p even though it competes against curr which has an
> even lower latency requirement than p,

What is (A) ? Assuming you meant (1), having a positive nice latency
means that you don't have latency requirement but you are tolerant to
scheduling delay so we don't 'punish' p. P will preempt curr is we are
above the tolerance

>
> >> Do we expect that tasks set their value to [1..19] in this case, when
> >> the default 0 already indicates a 'don't care'?
> >
> > I'm not sure that I understand your concern as [0..19] are treated in
> > the same way. Only tasks (curr or se) with offset <0 need a relative
> > comparison to the other. If curr and se has both a latency nice of
> > -19, se should not blindly preempt curr but only if curr already run
> > for its amount of time
>
> With p = -19 and curr = -19 we would take the diff, so 0ms.
>
> With p = 19 and curr = 19, if we would use `latency_offset -=
> curr->latency_offset` wakeup_latency_gran() would return 973/1024*24ms -
> 973/1024*24ms = 0ms and nothing will shift.
>
> OTHA, in case (1) wakeup_latency_gran() would return 512/1024*24ms -
> 973/1024*24ms = - 10.80ms. So p would gain an advantage here instead of
> a penalty.

And that's all the point. A priority >= 0 means that you don't care
about scheduling delays so there is no reason to be more aggressive
with a task that is also latency tolerant. We only have to ensure that
the delay stays in the acceptable range

>
> Essentially using the full [-20 .. 19] nice scope for `se vs. curr`
> comparison.

2022-09-22 17:16:35

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup

On 22/09/2022 09:12, Vincent Guittot wrote:
> On Thu, 22 Sept 2022 at 00:41, Dietmar Eggemann
> <[email protected]> wrote:
>>
>> On 20/09/2022 17:49, Vincent Guittot wrote:
>>> On Tue, 20 Sept 2022 at 15:18, Dietmar Eggemann
>>> <[email protected]> wrote:
>>>>
>>>> On 19/09/2022 17:39, Vincent Guittot wrote:
>>>>> On Mon, 19 Sept 2022 at 12:05, Dietmar Eggemann
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> On 16/09/2022 10:03, Vincent Guittot wrote:

[...]

> I thought you were speaking about priority 0 vs [1..19] as you made a
> difference in your previous comment below
>
>>
>> (1) p = 10 curr = 19 -> wakeup_latency_gran() returns 12ms
>>
>> (2) p = 10 curr = -10 -> wakeup_latency_gran() returns 24ms
>>
>> In (1) only p's own latency counts whereas in (2) we take the diff,
>
> Yes because curr is latency sensitive in (2) whereas it's not in (1)
>
>>
>> In (A) we 'punish' p even though it competes against curr which has an
>> even lower latency requirement than p,
>
> What is (A) ? Assuming you meant (1), having a positive nice latency

Sorry, yes I meant (1).

> means that you don't have latency requirement but you are tolerant to
> scheduling delay so we don't 'punish' p. P will preempt curr is we are
> above the tolerance

wakeup_preempt_entity() {

vdiff = curr->vruntime - se->vruntime

vdiff -= wakeup_latency_gran(curr, se) <-- (3)

if (vdiff <= 0)
return -1;

...
}

Wouldn't it be more suitable to return 0 from wakeup_latency_gran() if
both have latency_nice >=0 in this case instead of se->latency_offset?

By `punish` I mean that vdiff (3) gets smaller in case we return (the
positive) `se->latency_offset` even `latency nice of curr` > `latency
nice of p`.

[...]

>> With p = -19 and curr = -19 we would take the diff, so 0ms.
>>
>> With p = 19 and curr = 19, if we would use `latency_offset -=
>> curr->latency_offset` wakeup_latency_gran() would return 973/1024*24ms -
>> 973/1024*24ms = 0ms and nothing will shift.
>>
>> OTHA, in case (1) wakeup_latency_gran() would return 512/1024*24ms -
>> 973/1024*24ms = - 10.80ms. So p would gain an advantage here instead of
>> a penalty.
>
> And that's all the point. A priority >= 0 means that you don't care
> about scheduling delays so there is no reason to be more aggressive
> with a task that is also latency tolerant. We only have to ensure that
> the delay stays in the acceptable range

OK, I understand you model here but I'm still not convinced. Might be
interesting to hear what others think.

2022-09-23 06:16:23

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup

On Thu, 22 Sept 2022 at 18:50, Dietmar Eggemann
<[email protected]> wrote:
>
> On 22/09/2022 09:12, Vincent Guittot wrote:
> > On Thu, 22 Sept 2022 at 00:41, Dietmar Eggemann
> > <[email protected]> wrote:
> >>
> >> On 20/09/2022 17:49, Vincent Guittot wrote:
> >>> On Tue, 20 Sept 2022 at 15:18, Dietmar Eggemann
> >>> <[email protected]> wrote:
> >>>>
> >>>> On 19/09/2022 17:39, Vincent Guittot wrote:
> >>>>> On Mon, 19 Sept 2022 at 12:05, Dietmar Eggemann
> >>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>> On 16/09/2022 10:03, Vincent Guittot wrote:
>
> [...]
>
> > I thought you were speaking about priority 0 vs [1..19] as you made a
> > difference in your previous comment below
> >
> >>
> >> (1) p = 10 curr = 19 -> wakeup_latency_gran() returns 12ms
> >>
> >> (2) p = 10 curr = -10 -> wakeup_latency_gran() returns 24ms
> >>
> >> In (1) only p's own latency counts whereas in (2) we take the diff,
> >
> > Yes because curr is latency sensitive in (2) whereas it's not in (1)
> >
> >>
> >> In (A) we 'punish' p even though it competes against curr which has an
> >> even lower latency requirement than p,
> >
> > What is (A) ? Assuming you meant (1), having a positive nice latency
>
> Sorry, yes I meant (1).
>
> > means that you don't have latency requirement but you are tolerant to
> > scheduling delay so we don't 'punish' p. P will preempt curr is we are
> > above the tolerance
>
> wakeup_preempt_entity() {
>
> vdiff = curr->vruntime - se->vruntime
>
> vdiff -= wakeup_latency_gran(curr, se) <-- (3)
>
> if (vdiff <= 0)
> return -1;
>
> ...
> }
>
> Wouldn't it be more suitable to return 0 from wakeup_latency_gran() if
> both have latency_nice >=0 in this case instead of se->latency_offset?

No because that defeats all the purpose of being tolerant to
scheduling delay and not trying to preempt the current as usual at
wakeup. In this case there would be not diff with not setting the
latency nice value.

>
> By `punish` I mean that vdiff (3) gets smaller in case we return (the
> positive) `se->latency_offset` even `latency nice of curr` > `latency
> nice of p`.
>
> [...]
>
> >> With p = -19 and curr = -19 we would take the diff, so 0ms.
> >>
> >> With p = 19 and curr = 19, if we would use `latency_offset -=
> >> curr->latency_offset` wakeup_latency_gran() would return 973/1024*24ms -
> >> 973/1024*24ms = 0ms and nothing will shift.
> >>
> >> OTHA, in case (1) wakeup_latency_gran() would return 512/1024*24ms -
> >> 973/1024*24ms = - 10.80ms. So p would gain an advantage here instead of
> >> a penalty.
> >
> > And that's all the point. A priority >= 0 means that you don't care
> > about scheduling delays so there is no reason to be more aggressive
> > with a task that is also latency tolerant. We only have to ensure that
> > the delay stays in the acceptable range
>
> OK, I understand you model here but I'm still not convinced. Might be
> interesting to hear what others think.