2023-05-08 16:25:35

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [PATCH 1/2] sched/deadline: accurate reclaim bandwidth for GRUB

Current reclaim calculation for GRUB is a bit inaccurate and the
inaccuracy gets larger as the bandwidth of tasks becomes smaller.
I have a test program to show the issue - it runs one or more
deadline threads and observes the utilization. Following tests
are run on an isolated cpu(isolcpus=3) in a 4 cpu system and the
results as shown below:

RUN 1: runtime=7ms, deadline=period=10ms, RT capacity = 95%
TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.33
TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.35
TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.35
TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.29

RUN 2: runtime=2ms, deadline=period=10ms, RT capacity = 95%
TID[704]: RECLAIM=1, (r=2ms, d=10ms, p=10ms), Util: 79.96
TID[704]: RECLAIM=1, (r=2ms, d=10ms, p=10ms), Util: 80.06
TID[704]: RECLAIM=1, (r=2ms, d=10ms, p=10ms), Util: 80.00

RUN 3: runtime=1ms, deadline=period=100ms, RT capacity = 95%
TID[708]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 16.69
TID[708]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 16.69
TID[708]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 16.70

When running multiple tasks, the reclaimed bandwidth is divided
proportionately, but is not reclaimed to the max allowable limit:

RUN 4: 2 SCHED_FLAG_RECLAIM tasks, 1 normal task
Task 1: runtime=1ms, deadline=period=10ms
Task 2: runtime=1ms, deadline=period=10ms
Task 3: runtime=5ms, deadline=period=20ms(normal)
TID[624]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 20.10
TID[625]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 20.10
TID[626]: RECLAIM=0, (r=5ms, d=20ms, p=20ms), Util: 25.07
TID[624]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 20.06
TID[625]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 20.13
TID[626]: RECLAIM=0, (r=5ms, d=20ms, p=20ms), Util: 25.12
TID[624]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 19.95
TID[625]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 19.93
TID[626]: RECLAIM=0, (r=5ms, d=20ms, p=20ms), Util: 25.04

I have also tested multiple tasks on all cpus allowing for tasks to
migrate and see the same issue there as well. Running 10 tasks on 3
cpus with 6 SCHED_FLAG_RECLAIM and 4 normal tasks, top shows:
%Cpu0 : 70.1 us, 0.3 sy, 0.0 ni, 29.3 id, 0.0 wa
%Cpu1 : 69.1 us, 0.3 sy, 0.0 ni, 30.3 id, 0.3 wa
%Cpu2 : 70.5 us, 0.3 sy, 0.0 ni, 29.2 id, 0.0 wa

The max{} logic in the existing implementation seems to not fully
capture the GRUB algorithm.

This patch fixes the issue by appropriatley caping the max allowed
utilization and also slightly adjusting GRUB algorithm to account
for a mix of normal deadline and SCHED_FLAG_RECLAIM tasks.

According to the GRUB rule, the runtime is depreciated as a factor
of active bandwidth of the runqueue: "dq = -dt", where U is the
active bandwidth. Also, we do not allocate the full bandwidth of a
cpu to deadline task, but only a portion(Umax) to it, so as to avoid
deadline tasks starving lower class tasks. The equation could be
re-written as "dq = -(U / Umax) * dt".

Since both normal deadline and SCHED_FLAG_RECLAIM tasks can share
cpu, we need to consider bandwidth of only SCHED_FLAG_RECLAIM tasks
in the equation:
"dq = -(Ureclaim / Umax_reclaim) * dt"

Following are the results with this patch:

RUN 1: runtime=7ms, deadline=period=10ms, RT capacity = 95%
TID[616]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 94.98
TID[616]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.04
TID[616]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.01

RUN 2: runtime=2ms, deadline=period=10ms, RT capacity = 95%
TID[625]: RECLAIM=1, (r=2ms, d=10ms, p=10ms), Util: 95.00
TID[625]: RECLAIM=1, (r=2ms, d=10ms, p=10ms), Util: 95.01
TID[625]: RECLAIM=1, (r=2ms, d=10ms, p=10ms), Util: 94.99

RUN 3: runtime=1ms, deadline=period=100ms, RT capacity = 95%
TID[629]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 94.87
TID[629]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 95.03
TID[629]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 95.03

Also more results with multiple tasks.
RUN 4: 2 SCHED_FLAG RECLAIM tasks:
Task 1: runtime=1ms, deadline=period=10ms
Task 2: runtime=1ms, deadline=period=10ms
TID[633]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 47.53
TID[634]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 47.64
TID[633]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 47.52
TID[634]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 47.39
TID[633]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 47.59
TID[634]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 47.61

RUN 5: 2 SCHED_FLAG_RECLAIM tasks, 1 normal task
Task 1: runtime=1ms, deadline=period=10ms
Task 2: runtime=1ms, deadline=period=10ms
Task 3: runtime=5ms, deadline=period=20ms(normal)
TID[641]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 35.02
TID[642]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 35.02
TID[643]: RECLAIM=0, (r=5ms, d=20ms, p=20ms), Util: 24.93
TID[641]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 35.00
TID[642]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 34.94
TID[643]: RECLAIM=0, (r=5ms, d=20ms, p=20ms), Util: 24.98
TID[641]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 35.00
TID[642]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 35.03
TID[643]: RECLAIM=0, (r=5ms, d=20ms, p=20ms), Util: 25.03

Running tasks on all cpus allowing for migration also showed that
the utilization is reclaimed to the maximum. Running 10 tasks on
3 cpus with 6 SCHED_FLAG_RECLAIM and 4 normal tasks - top shows:
%Cpu0 : 94.3 us, 0.3 sy, 0.0 ni, 5.4 id, 0.0 wa
%Cpu1 : 95.2 us, 0.0 sy, 0.0 ni, 4.8 id, 0.0 wa
%Cpu2 : 94.9 us, 0.0 sy, 0.0 ni, 5.1 id, 0.0 wa

Signed-off-by: Vineeth Pillai (Google) <[email protected]>
---
kernel/sched/deadline.c | 95 +++++++++++++++++++++++------------------
kernel/sched/sched.h | 21 +++++++--
2 files changed, 71 insertions(+), 45 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 71b24371a6f7..17b2d87ea6fc 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -229,11 +229,13 @@ __dl_overflow(struct dl_bw *dl_b, unsigned long cap, u64 old_bw, u64 new_bw)
}

static inline
-void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
+void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq, bool reclaim_bw_se)
{
u64 old = dl_rq->running_bw;

lockdep_assert_rq_held(rq_of_dl_rq(dl_rq));
+ if (reclaim_bw_se)
+ dl_rq->reclaim_bw += dl_bw;
dl_rq->running_bw += dl_bw;
SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
@@ -242,15 +244,19 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
}

static inline
-void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
+void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq, bool reclaim_bw_se)
{
u64 old = dl_rq->running_bw;

lockdep_assert_rq_held(rq_of_dl_rq(dl_rq));
+ if (reclaim_bw_se)
+ dl_rq->reclaim_bw -= dl_bw;
dl_rq->running_bw -= dl_bw;
SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
- if (dl_rq->running_bw > old)
+ if (dl_rq->running_bw > old) {
dl_rq->running_bw = 0;
+ dl_rq->reclaim_bw = 0;
+ }
/* kick cpufreq (see the comment in kernel/sched/sched.h). */
cpufreq_update_util(rq_of_dl_rq(dl_rq), 0);
}
@@ -296,14 +302,14 @@ static inline
void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
{
if (!dl_entity_is_special(dl_se))
- __add_running_bw(dl_se->dl_bw, dl_rq);
+ __add_running_bw(dl_se->dl_bw, dl_rq, dl_entity_is_reclaim(dl_se));
}

static inline
void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
{
if (!dl_entity_is_special(dl_se))
- __sub_running_bw(dl_se->dl_bw, dl_rq);
+ __sub_running_bw(dl_se->dl_bw, dl_rq, dl_entity_is_reclaim(dl_se));
}

static void dl_change_utilization(struct task_struct *p, u64 new_bw)
@@ -522,6 +528,7 @@ void init_dl_rq(struct dl_rq *dl_rq)
#endif

dl_rq->running_bw = 0;
+ dl_rq->reclaim_bw = 0;
dl_rq->this_bw = 0;
init_dl_rq_bw_ratio(dl_rq);
}
@@ -1260,44 +1267,51 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
}

/*
- * This function implements the GRUB accounting rule:
- * according to the GRUB reclaiming algorithm, the runtime is
- * not decreased as "dq = -dt", but as
- * "dq = -max{u / Umax, (1 - Uinact - Uextra)} dt",
- * where u is the utilization of the task, Umax is the maximum reclaimable
- * utilization, Uinact is the (per-runqueue) inactive utilization, computed
- * as the difference between the "total runqueue utilization" and the
- * runqueue active utilization, and Uextra is the (per runqueue) extra
- * reclaimable utilization.
- * Since rq->dl.running_bw and rq->dl.this_bw contain utilizations
- * multiplied by 2^BW_SHIFT, the result has to be shifted right by
- * BW_SHIFT.
- * Since rq->dl.bw_ratio contains 1 / Umax multiplied by 2^RATIO_SHIFT,
- * dl_bw is multiped by rq->dl.bw_ratio and shifted right by RATIO_SHIFT.
- * Since delta is a 64 bit variable, to have an overflow its value
- * should be larger than 2^(64 - 20 - 8), which is more than 64 seconds.
- * So, overflow is not an issue here.
+ * This function implements a slightly modified version of the GRUB accounting
+ * rule to accommodate mix of normal deadline tasks and SCHED_FLAG_RECLAIM tasks
+ * running together:
+ * As per the GRUB rule, the runtime is not decreased as "dq = -dt", but as a
+ * factor of the running(active) bandwidth for a cpu:
+ * "dq = -U * dt"
+ * In our case, deadline is not allowed to use the whole bandwidth of the cpu,
+ * but only a fraction of it: "Umax". So the equation becomes:
+ * "dq = -(U / Umax) * dt"
+ *
+ * To account for the fact that we have a mix of normal deadline tasks and
+ * SCHED_RECLAIM_FLAG tasks running together, we do not consider the bandwidth
+ * of normal tasks in the equation. So
+ * "dq = -(Ureclaim / Umax_reclaim) * dt"
+ * Where
+ * Ureclaim: Active Bandwidth of SCHED_FLAG_RECLAIM tasks for this rq.
+ * Umax_reclaim: Maximum reclaimable bandwidth for this rq.
+ *
+ * We can calculate Umax_reclaim as:
+ * Umax_reclaim: this_bw + Uinact + Ureclaim
+ * Where:
+ * this_bw: Reserved bandwidth for this runqueue.
+ * Ureclaim: Active Bandwidth of SCHED_FLAG_RECLAIM tasks for this rq.
+ * Uinact: Inactive utilization (this_bw - running_bw)
+ * Uextra: Extra bandwidth(Usually Umax - this_bw)
+ * Umax: Max usable bandwidth. Currently
+ * = sched_rt_runtime_us / sched_rt_period_us
+ *
+ * We use the above formula to scale the runtime down
+ *
+ * dq = -(Ureclaim / Umax_reclaim) * dt
+ * = -(Ureclaim / (Ureclaim + Uextra + Uinact)) * dt
*/
static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
{
+ u64 scaled_delta;
u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot - Uact */
- u64 u_act;
- u64 u_act_min = (dl_se->dl_bw * rq->dl.bw_ratio) >> RATIO_SHIFT;
+ u64 reclaimable_bw = rq->dl.extra_bw + u_inact;

- /*
- * Instead of computing max{u * bw_ratio, (1 - u_inact - u_extra)},
- * we compare u_inact + rq->dl.extra_bw with
- * 1 - (u * rq->dl.bw_ratio >> RATIO_SHIFT), because
- * u_inact + rq->dl.extra_bw can be larger than
- * 1 * (so, 1 - u_inact - rq->dl.extra_bw would be negative
- * leading to wrong results)
- */
- if (u_inact + rq->dl.extra_bw > BW_UNIT - u_act_min)
- u_act = u_act_min;
- else
- u_act = BW_UNIT - u_inact - rq->dl.extra_bw;
+ if (reclaimable_bw > rq->dl.max_bw)
+ reclaimable_bw = rq->dl.max_bw;

- return (delta * u_act) >> BW_SHIFT;
+ scaled_delta = div64_u64(delta * rq->dl.reclaim_bw,
+ (rq->dl.reclaim_bw + reclaimable_bw));
+ return scaled_delta;
}

/*
@@ -2783,12 +2797,9 @@ int sched_dl_global_validate(void)
static void init_dl_rq_bw_ratio(struct dl_rq *dl_rq)
{
if (global_rt_runtime() == RUNTIME_INF) {
- dl_rq->bw_ratio = 1 << RATIO_SHIFT;
- dl_rq->extra_bw = 1 << BW_SHIFT;
+ dl_rq->max_bw = dl_rq->extra_bw = 1 << BW_SHIFT;
} else {
- dl_rq->bw_ratio = to_ratio(global_rt_runtime(),
- global_rt_period()) >> (BW_SHIFT - RATIO_SHIFT);
- dl_rq->extra_bw = to_ratio(global_rt_period(),
+ dl_rq->max_bw = dl_rq->extra_bw = to_ratio(global_rt_period(),
global_rt_runtime());
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3e8df6d31c1e..13d85af0f42b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -257,6 +257,11 @@ static inline bool dl_entity_is_special(const struct sched_dl_entity *dl_se)
#endif
}

+static inline bool dl_entity_is_reclaim(const struct sched_dl_entity *dl_se)
+{
+ return dl_se->flags & SCHED_FLAG_RECLAIM;
+}
+
/*
* Tells if entity @a should preempt entity @b.
*/
@@ -754,10 +759,20 @@ struct dl_rq {
u64 extra_bw;

/*
- * Inverse of the fraction of CPU utilization that can be reclaimed
- * by the GRUB algorithm.
+ * Maximum available bandwidth for this runqueue. This is used to
+ * calculate reclaimable bandwidth for SCHED_FLAG_RECLAIM tasks.
+ * By restricting maximum usable bandwidth, we aim to give other
+ * tasks on lower classes a chance to run, when competing with
+ * SCHED_FLAG_RECLAIM tasks.
*/
- u64 bw_ratio;
+ u64 max_bw;
+
+ /*
+ * Active bandwidth of SCHED_FLAG_RECLAIM tasks on this rq.
+ * This will be a subset of running_bw.
+ */
+ u64 reclaim_bw;
+
};

#ifdef CONFIG_FAIR_GROUP_SCHED
--
2.40.1


2023-05-08 16:29:29

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [PATCH 2/2] Documentation: sched/deadline: Update GRUB description

Update the details of GRUB to reflect the updated logic.

Signed-off-by: Vineeth Pillai (Google) <[email protected]>
---
Documentation/scheduler/sched-deadline.rst | 21 +++++++++++++--------
1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/Documentation/scheduler/sched-deadline.rst b/Documentation/scheduler/sched-deadline.rst
index 9d9be52f221a..0c73f07f712d 100644
--- a/Documentation/scheduler/sched-deadline.rst
+++ b/Documentation/scheduler/sched-deadline.rst
@@ -195,11 +195,15 @@ Deadline Task Scheduling
its utilization is added to the active utilization of the runqueue where
it has been enqueued.

- For each runqueue, the algorithm GRUB keeps track of two different bandwidths:
+ For each runqueue, the algorithm GRUB keeps track of three different bandwidths:

- Active bandwidth (running_bw): this is the sum of the bandwidths of all
tasks in active state (i.e., ActiveContending or ActiveNonContending);

+ - Active bandwidth of SCHED_FLAG_RECLAIM tasks(reclaim_bw): this is the sum of
+ bandwidth of all tasks in active state which participates in GRUB. This is a
+ subset of running_bw and is needed for reclaimable bandwidth calculation.
+
- Total bandwidth (this_bw): this is the sum of all tasks "belonging" to the
runqueue, including the tasks in Inactive state.

@@ -209,12 +213,12 @@ Deadline Task Scheduling
to

dq = -max{ Ui / Umax, (1 - Uinact - Uextra) } dt
+ dq = -(Ureclaim / (Uextra + Uinact + Ureclaim)) dt

where:

- - Ui is the bandwidth of task Ti;
- - Umax is the maximum reclaimable utilization (subjected to RT throttling
- limits);
+ - Ureclaim is the (per runqueue) bandwidth of all SCHED_FLAG_RECLAIM tasks
+ in active state;
- Uinact is the (per runqueue) inactive utilization, computed as
(this_bq - running_bw);
- Uextra is the (per runqueue) extra reclaimable utilization
@@ -222,7 +226,8 @@ Deadline Task Scheduling


Let's now see a trivial example of two deadline tasks with runtime equal
- to 4 and period equal to 8 (i.e., bandwidth equal to 0.5)::
+ to 4 and period equal to 8 (i.e., bandwidth equal to 0.5). Tasks are
+ allowed to use the whole cpu(Umax = 1)::

A Task T1
|
@@ -244,7 +249,7 @@ Deadline Task Scheduling
0 1 2 3 4 5 6 7 8


- A running_bw
+ A reclaim_bw
|
1 ----------------- ------
| | |
@@ -272,7 +277,7 @@ Deadline Task Scheduling

This is the 0-lag time for Task T1. Since it didn't woken up in the
meantime, it enters the Inactive state. Its bandwidth is removed from
- running_bw.
+ running_bw and reclaim_bw.
Task T2 continues its execution. However, its runtime is now decreased as
dq = - 0.5 dt because Uinact = 0.5.
Task T2 therefore reclaims the bandwidth unused by Task T1.
@@ -280,7 +285,7 @@ Deadline Task Scheduling
- Time t = 8:

Task T1 wakes up. It enters the ActiveContending state again, and the
- running_bw is incremented.
+ running_bw and reclaim_bw are incremented.


2.3 Energy-aware scheduling
--
2.40.1

2023-05-09 11:38:27

by luca abeni

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/deadline: accurate reclaim bandwidth for GRUB

Hi,

if I understand well, your patch addresses 2 separate issues:
1) The current reclaiming code uses an approximation to avoid using
div64_u64(), which might introduce too much overhead (at least, as
far as I remember :). Your patch changes it to use the exact,
non-approximated, equation
2) Currently, the reclaimable CPU time is divided as if all the
SCHED_DEADLINE tasks (and not only the SCHED_FLAG_RECLAIM tasks)
could use it; your patch changes the code to distribute the
reclaimable CPU time only to tasks having the SCHED_FLAG_RECLAIM
flag set

Is this understanding correct?
If using div64_u64() does not introduce too much overhead, then I agree
with the first change.
The second change also looks good to me.

I have no comments on the code, but there is one thing in the comments
that looks misleading to me (or I am misunderstanding the code or the
comments):

On Mon, 8 May 2023 12:08:28 -0400
Vineeth Pillai <[email protected]> wrote:
[...]

> + * "dq = -(Ureclaim / Umax_reclaim) * dt"
> + * Where
> + * Ureclaim: Active Bandwidth of SCHED_FLAG_RECLAIM tasks for this rq.
> + * Umax_reclaim: Maximum reclaimable bandwidth for this rq.
> + *
> + * We can calculate Umax_reclaim as:
> + * Umax_reclaim: this_bw + Uinact + Ureclaim

I think this looks like a typo (summing this_bw to Uinact
looks wrong). Should "this_bw" be Uextra?

> + * Where:
> + * this_bw: Reserved bandwidth for this runqueue.
> + * Ureclaim: Active Bandwidth of SCHED_FLAG_RECLAIM tasks for this rq.
> + * Uinact: Inactive utilization (this_bw - running_bw)
> + * Uextra: Extra bandwidth(Usually Umax - this_bw)
> + * Umax: Max usable bandwidth. Currently
> + * = sched_rt_runtime_us / sched_rt_period_us
> + *
> + * We use the above formula to scale the runtime down
> + *
> + * dq = -(Ureclaim / Umax_reclaim) * dt
> + * = -(Ureclaim / (Ureclaim + Uextra + Uinact)) * dt

I think this should be the correct equation. BTW, since you are summing
Uextra and Uinact, mabe you could just use "Umax - running_bw"?



Luca

> */
> static u64 grub_reclaim(u64 delta, struct rq *rq, struct
> sched_dl_entity *dl_se) {
> + u64 scaled_delta;
> u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot -
> Uact */
> - u64 u_act;
> - u64 u_act_min = (dl_se->dl_bw * rq->dl.bw_ratio) >>
> RATIO_SHIFT;
> + u64 reclaimable_bw = rq->dl.extra_bw + u_inact;
>
> - /*
> - * Instead of computing max{u * bw_ratio, (1 - u_inact -
> u_extra)},
> - * we compare u_inact + rq->dl.extra_bw with
> - * 1 - (u * rq->dl.bw_ratio >> RATIO_SHIFT), because
> - * u_inact + rq->dl.extra_bw can be larger than
> - * 1 * (so, 1 - u_inact - rq->dl.extra_bw would be negative
> - * leading to wrong results)
> - */
> - if (u_inact + rq->dl.extra_bw > BW_UNIT - u_act_min)
> - u_act = u_act_min;
> - else
> - u_act = BW_UNIT - u_inact - rq->dl.extra_bw;
> + if (reclaimable_bw > rq->dl.max_bw)
> + reclaimable_bw = rq->dl.max_bw;
>
> - return (delta * u_act) >> BW_SHIFT;
> + scaled_delta = div64_u64(delta * rq->dl.reclaim_bw,
> + (rq->dl.reclaim_bw + reclaimable_bw));
> + return scaled_delta;
> }
>
> /*
> @@ -2783,12 +2797,9 @@ int sched_dl_global_validate(void)
> static void init_dl_rq_bw_ratio(struct dl_rq *dl_rq)
> {
> if (global_rt_runtime() == RUNTIME_INF) {
> - dl_rq->bw_ratio = 1 << RATIO_SHIFT;
> - dl_rq->extra_bw = 1 << BW_SHIFT;
> + dl_rq->max_bw = dl_rq->extra_bw = 1 << BW_SHIFT;
> } else {
> - dl_rq->bw_ratio = to_ratio(global_rt_runtime(),
> - global_rt_period()) >> (BW_SHIFT -
> RATIO_SHIFT);
> - dl_rq->extra_bw = to_ratio(global_rt_period(),
> + dl_rq->max_bw = dl_rq->extra_bw =
> to_ratio(global_rt_period(), global_rt_runtime());
> }
> }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3e8df6d31c1e..13d85af0f42b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -257,6 +257,11 @@ static inline bool dl_entity_is_special(const
> struct sched_dl_entity *dl_se) #endif
> }
>
> +static inline bool dl_entity_is_reclaim(const struct sched_dl_entity
> *dl_se) +{
> + return dl_se->flags & SCHED_FLAG_RECLAIM;
> +}
> +
> /*
> * Tells if entity @a should preempt entity @b.
> */
> @@ -754,10 +759,20 @@ struct dl_rq {
> u64 extra_bw;
>
> /*
> - * Inverse of the fraction of CPU utilization that can be
> reclaimed
> - * by the GRUB algorithm.
> + * Maximum available bandwidth for this runqueue. This is
> used to
> + * calculate reclaimable bandwidth for SCHED_FLAG_RECLAIM
> tasks.
> + * By restricting maximum usable bandwidth, we aim to give
> other
> + * tasks on lower classes a chance to run, when competing
> with
> + * SCHED_FLAG_RECLAIM tasks.
> */
> - u64 bw_ratio;
> + u64 max_bw;
> +
> + /*
> + * Active bandwidth of SCHED_FLAG_RECLAIM tasks on this rq.
> + * This will be a subset of running_bw.
> + */
> + u64 reclaim_bw;
> +
> };
>
> #ifdef CONFIG_FAIR_GROUP_SCHED

2023-05-09 19:46:29

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/deadline: accurate reclaim bandwidth for GRUB

Hi Luka,

Thanks for reviewing the changes.

On Tue, May 9, 2023 at 7:25 AM luca abeni <[email protected]> wrote:
>
> Hi,
>
> if I understand well, your patch addresses 2 separate issues:
> 1) The current reclaiming code uses an approximation to avoid using
> div64_u64(), which might introduce too much overhead (at least, as
> far as I remember :). Your patch changes it to use the exact,
> non-approximated, equation
> 2) Currently, the reclaimable CPU time is divided as if all the
> SCHED_DEADLINE tasks (and not only the SCHED_FLAG_RECLAIM tasks)
> could use it; your patch changes the code to distribute the
> reclaimable CPU time only to tasks having the SCHED_FLAG_RECLAIM
> flag set
>
> Is this understanding correct?
Yes, the above two details are correct. In addition to that, I think
the existing equation had a small bug:
GRUB paper says, running time is depreciated as
"dq = -U dt" where U is running_bw.
This is assuming that the whole cpu bandwidth could be reclaimed. But
in our case, we cap at Umax. So the equation should be
"dq = -(U/Umax) dt"

And then we have an upper limit of (1 - Uextra - Uinact). I feel we
should be taking the minimum of these values to make sure that we
don't cross the upper bound. I think the equation should be:
"dq = -min{U/Umax, (1 - Uextra - Uinact)} dt"

But the current implementation is
"dq = -max{u/Umax, (1 - Uextra - Uinact)} dt"
Where u = dl_se->dl_bw.

After fixing the above equation, reclaim logic works well but when only
SCHED_FLAG_RECLAIM tasks are running. When we have a mix of both
normal deadline and SCHED_FLAG_RECLAIM, it breaks the reclaim logic.
As you pointed out, the second part of the fix is for that.

> If using div64_u64() does not introduce too much overhead, then I agree
> with the first change.
In my testing, I did not see a change in the performance of the
grub_reclaim function. Both old and new implementations take 10 to
20 nanoseconds on average. But my observation might not be accurate.

With this change, it is difficult to avoid division as the denominator
is a variable and we would not be able to pre-calculate an inverse. We
could probably calculate inverse during {__add/__sub}_running_bw so as
to reduce the frequency of div64_u64. I shall try this for v2.

> The second change also looks good to me.
>
> I have no comments on the code, but there is one thing in the comments
> that looks misleading to me (or I am misunderstanding the code or the
> comments):
>

> > + * We can calculate Umax_reclaim as:
> > + * Umax_reclaim: this_bw + Uinact + Ureclaim
>
> I think this looks like a typo (summing this_bw to Uinact
> looks wrong). Should "this_bw" be Uextra?
>
Thanks a lot for pointing it out. Yes you are right, I messed up in the
comments. It should be Uextra and I shall fix it in v2.

> > + * dq = -(Ureclaim / Umax_reclaim) * dt
> > + * = -(Ureclaim / (Ureclaim + Uextra + Uinact)) * dt
>
> I think this should be the correct equation. BTW, since you are summing
> Uextra and Uinact, mabe you could just use "Umax - running_bw"?
>
Makes sense, it will avoid an extra variable Uinact. I shall modify this
in v2.

Thanks,
Vineeth

2023-05-09 20:50:58

by luca abeni

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/deadline: accurate reclaim bandwidth for GRUB

Hi,

On Tue, 9 May 2023 15:29:21 -0400
Vineeth Remanan Pillai <[email protected]> wrote:
[...]
> > Is this understanding correct?
> Yes, the above two details are correct. In addition to that, I think
> the existing equation had a small bug:
> GRUB paper says, running time is depreciated as
> "dq = -U dt" where U is running_bw.
> This is assuming that the whole cpu bandwidth could be reclaimed. But
> in our case, we cap at Umax. So the equation should be
> "dq = -(U/Umax) dt"

Yes, this is the approximation I was mentioning... Instead of using a
division, I approximated it with a different equation using a sum.


> And then we have an upper limit of (1 - Uextra - Uinact). I feel we
> should be taking the minimum of these values to make sure that we
> don't cross the upper bound. I think the equation should be:
> "dq = -min{U/Umax, (1 - Uextra - Uinact)} dt"
>
> But the current implementation is
> "dq = -max{u/Umax, (1 - Uextra - Uinact)} dt"
> Where u = dl_se->dl_bw.

Well, here I think we should really use a "max{}", not a "min{}",
otherwise we risk to subtract an amount of time which is too small (the
"min{}" should be on the reclaimed bandwidth - so that we do not
reclaim too much - but this expression is computing the runtime
decrement - so I think this should be a "max{}").

Or am I misunderstanding something?

Did you try using u/Umax, but without changing the "max{}" into "min{}"?


> After fixing the above equation, reclaim logic works well but when
> only SCHED_FLAG_RECLAIM tasks are running. When we have a mix of both
> normal deadline and SCHED_FLAG_RECLAIM, it breaks the reclaim logic.
> As you pointed out, the second part of the fix is for that.

OK


> > If using div64_u64() does not introduce too much overhead, then I
> > agree with the first change.
> In my testing, I did not see a change in the performance of the
> grub_reclaim function. Both old and new implementations take 10 to
> 20 nanoseconds on average. But my observation might not be accurate.

Or maybe my assumption that div64 is bad was wrong :)
Let's see what other people think about this.


Thanks,
Luca




> With this change, it is difficult to avoid division as the denominator
> is a variable and we would not be able to pre-calculate an inverse. We
> could probably calculate inverse during {__add/__sub}_running_bw so as
> to reduce the frequency of div64_u64. I shall try this for v2.
>
> > The second change also looks good to me.
> >
> > I have no comments on the code, but there is one thing in the
> > comments that looks misleading to me (or I am misunderstanding the
> > code or the comments):
> >
>
> > > + * We can calculate Umax_reclaim as:
> > > + * Umax_reclaim: this_bw + Uinact + Ureclaim
> >
> > I think this looks like a typo (summing this_bw to Uinact
> > looks wrong). Should "this_bw" be Uextra?
> >
> Thanks a lot for pointing it out. Yes you are right, I messed up in
> the comments. It should be Uextra and I shall fix it in v2.
>
> > > + * dq = -(Ureclaim / Umax_reclaim) * dt
> > > + * = -(Ureclaim / (Ureclaim + Uextra + Uinact)) * dt
> >
> > I think this should be the correct equation. BTW, since you are
> > summing Uextra and Uinact, mabe you could just use "Umax -
> > running_bw"?
> Makes sense, it will avoid an extra variable Uinact. I shall modify
> this in v2.
>
> Thanks,
> Vineeth

2023-05-09 21:11:24

by luca abeni

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/deadline: accurate reclaim bandwidth for GRUB

On Tue, 9 May 2023 22:48:29 +0200
luca abeni <[email protected]> wrote:

> Hi,
>
> On Tue, 9 May 2023 15:29:21 -0400
> Vineeth Remanan Pillai <[email protected]> wrote:
> [...]
> > > Is this understanding correct?
> > Yes, the above two details are correct. In addition to that, I think
> > the existing equation had a small bug:
> > GRUB paper says, running time is depreciated as
> > "dq = -U dt" where U is running_bw.
> > This is assuming that the whole cpu bandwidth could be reclaimed.
> > But in our case, we cap at Umax. So the equation should be
> > "dq = -(U/Umax) dt"
>
> Yes, this is the approximation I was mentioning... Instead of using a
> division, I approximated it with a different equation using a sum.

Sorry, ignore this comment (and the following); I misinterpreted the
code (and my old notes).

I do not understand why the "max{}" doe not work well, I need to double
think about it.


Luca

2023-05-10 04:10:27

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/deadline: accurate reclaim bandwidth for GRUB

On Tue, May 9, 2023 at 4:54 PM luca abeni <[email protected]> wrote:

> > Yes, this is the approximation I was mentioning... Instead of using a
> > division, I approximated it with a different equation using a sum.
>
> Sorry, ignore this comment (and the following); I misinterpreted the
> code (and my old notes).
>
> I do not understand why the "max{}" doe not work well, I need to double
> think about it.
>
I was thinking more about this and was doing some more digging into
this. I was also wrong about min{}. Giving it some more thought, I think
(U/Umax) is indeed the only equation we need and it will take care
of caping the reclaiming at Umax. The reason why it was not working
is because of the loss of precision when we did the inverse. I tried
replacing (delta * running_bw * bw_ratio) by
div64_u64(delta * running_bw, Umax) and it worked as expected and
reclaimed only till Umax with only SCHED_FLAG_RECLAIM tasks. As an
example a task with reservation (1, 100) and RT capacity 95%, and
delta = 4ms, we get scaled_delta as
delta * running_bw * bw_ratio ~= .040000 (roughly)
div64_u64(delta * running_bw, Umax) ~= .04210526 (roughly)

This caused the inverse logic to consume ~99% bw, while the other
one consumed ~95% as expected.

I still could not figure out why min{} works. As you mentioned in
the previous thread, its the loss of precision thats the culprit and
I think we only need U/Umax if we have enough precision. This along
with accounting for both type of tasks will be the solution.

I will look deeper into any performance issues with using div64_u64
over multiplication and shall let you know soon.

Thanks,
Vineeth

2023-05-10 07:30:12

by luca abeni

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/deadline: accurate reclaim bandwidth for GRUB

Hi,

after thinking more about this (and re-reading the code, and your patch
:), I try to comment again:

On Tue, 9 May 2023 23:53:28 -0400
Vineeth Remanan Pillai <[email protected]> wrote:

> On Tue, May 9, 2023 at 4:54 PM luca abeni
> <[email protected]> wrote:
>
> > > Yes, this is the approximation I was mentioning... Instead of
> > > using a division, I approximated it with a different equation
> > > using a sum.
> >
> > Sorry, ignore this comment (and the following); I misinterpreted the
> > code (and my old notes).
> >
> > I do not understand why the "max{}" doe not work well, I need to
> > double think about it.
> >
> I was thinking more about this and was doing some more digging into
> this. I was also wrong about min{}. Giving it some more thought, I
> think (U/Umax) is indeed the only equation we need and it will take
> care of caping the reclaiming at Umax.

Uhm... I am not sure about this: for 1 single task on 1 CPU, yes,
u/Umax does the right thing. But when there are multiple tasks on the
CPU, I think it can cause issues (because every task ends up trying to
execute for Umax).

The "max{}" comes from the original multi-processor GRUB algorithm:
https://arxiv.org/pdf/1512.01984.pdf (see Equation (13) - in that
equation, the part we call u_extra is included in Uinact[p])

the "1 - u_inact - u_extra" part is needed to make sure that the
real-time guarantees are not broken by the reclaiming mechanism... But
it can end up with a task trying to consume too much time on a single
CPU, hence the "u/Umax" term in the "max{}" is needed to make sure that
a task will not consume more than Umax of a CPU.

Now, if we have one single task on a CPU u/Umax will always be larger
than the other term... But when we have multiple tasks the other term
is needed too.


> The reason why it was not
> working is because of the loss of precision when we did the inverse.

I agree


> I tried replacing (delta * running_bw * bw_ratio) by
> div64_u64(delta * running_bw, Umax) and it worked as expected and
> reclaimed only till Umax with only SCHED_FLAG_RECLAIM tasks. As an
> example a task with reservation (1, 100) and RT capacity 95%, and
> delta = 4ms, we get scaled_delta as
> delta * running_bw * bw_ratio ~= .040000 (roughly)
> div64_u64(delta * running_bw, Umax) ~= .04210526 (roughly)
>
> This caused the inverse logic to consume ~99% bw, while the other
> one consumed ~95% as expected.
>
> I still could not figure out why min{} works. As you mentioned in
> the previous thread, its the loss of precision thats the culprit and
> I think we only need U/Umax if we have enough precision. This along
> with accounting for both type of tasks will be the solution.

I agree with this.


(BTW, when considering multiple tasks on multiple CPUs, another
potential problem is given by u_extra... Now that I remember all the
details, u_extra is not "Umax - this_bw" - this is true when we consider
only one CPU, but is is "Umax - sum(u_i)/m" (where "sum(u_i)" is the
sum of the bandwidths of all the SCHED_DEADLINE tasks in the root
domain, and "m" is the number of CPUs in the root domain)... So, the
reclaimable CPU time is distributed uniformly on all the CPUs and this
could create some issues. But let's see what happens after the div64
fix and the SCHED_FLAG_RECLAIM fix)


Thanks,
Luca

>
> I will look deeper into any performance issues with using div64_u64
> over multiplication and shall let you know soon.
>
> Thanks,
> Vineeth


2023-05-10 08:45:59

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: [PATCH 2/2] Documentation: sched/deadline: Update GRUB description

On 5/8/23 23:08, Vineeth Pillai wrote:
> diff --git a/Documentation/scheduler/sched-deadline.rst b/Documentation/scheduler/sched-deadline.rst
> index 9d9be52f221a..0c73f07f712d 100644
> --- a/Documentation/scheduler/sched-deadline.rst
> +++ b/Documentation/scheduler/sched-deadline.rst
> @@ -195,11 +195,15 @@ Deadline Task Scheduling
> its utilization is added to the active utilization of the runqueue where
> it has been enqueued.
>
> - For each runqueue, the algorithm GRUB keeps track of two different bandwidths:
> + For each runqueue, the algorithm GRUB keeps track of three different bandwidths:
>
> - Active bandwidth (running_bw): this is the sum of the bandwidths of all
> tasks in active state (i.e., ActiveContending or ActiveNonContending);
>
> + - Active bandwidth of SCHED_FLAG_RECLAIM tasks(reclaim_bw): this is the sum of
> + bandwidth of all tasks in active state which participates in GRUB. This is a
> + subset of running_bw and is needed for reclaimable bandwidth calculation.
> +
> - Total bandwidth (this_bw): this is the sum of all tasks "belonging" to the
> runqueue, including the tasks in Inactive state.
>
> @@ -209,12 +213,12 @@ Deadline Task Scheduling
> to
>
> dq = -max{ Ui / Umax, (1 - Uinact - Uextra) } dt
> + dq = -(Ureclaim / (Uextra + Uinact + Ureclaim)) dt
>
> where:
>
> - - Ui is the bandwidth of task Ti;
> - - Umax is the maximum reclaimable utilization (subjected to RT throttling
> - limits);
> + - Ureclaim is the (per runqueue) bandwidth of all SCHED_FLAG_RECLAIM tasks
> + in active state;
> - Uinact is the (per runqueue) inactive utilization, computed as
> (this_bq - running_bw);
> - Uextra is the (per runqueue) extra reclaimable utilization
> @@ -222,7 +226,8 @@ Deadline Task Scheduling
>
>
> Let's now see a trivial example of two deadline tasks with runtime equal
> - to 4 and period equal to 8 (i.e., bandwidth equal to 0.5)::
> + to 4 and period equal to 8 (i.e., bandwidth equal to 0.5). Tasks are
> + allowed to use the whole cpu(Umax = 1)::
>
> A Task T1
> |
> @@ -244,7 +249,7 @@ Deadline Task Scheduling
> 0 1 2 3 4 5 6 7 8
>
>
> - A running_bw
> + A reclaim_bw
> |
> 1 ----------------- ------
> | | |
> @@ -272,7 +277,7 @@ Deadline Task Scheduling
>
> This is the 0-lag time for Task T1. Since it didn't woken up in the
> meantime, it enters the Inactive state. Its bandwidth is removed from
> - running_bw.
> + running_bw and reclaim_bw.
> Task T2 continues its execution. However, its runtime is now decreased as
> dq = - 0.5 dt because Uinact = 0.5.
> Task T2 therefore reclaims the bandwidth unused by Task T1.
> @@ -280,7 +285,7 @@ Deadline Task Scheduling
> - Time t = 8:
>
> Task T1 wakes up. It enters the ActiveContending state again, and the
> - running_bw is incremented.
> + running_bw and reclaim_bw are incremented.
>
>
> 2.3 Energy-aware scheduling

LGTM, thanks!

Reviewed-by: Bagas Sanjaya <[email protected]>

--
An old man doll... just what I always wanted! - Clara


2023-05-10 16:05:03

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/deadline: accurate reclaim bandwidth for GRUB

Hi Luca,

On Wed, May 10, 2023 at 3:07 AM luca abeni <[email protected]> wrote:
>
> > I was thinking more about this and was doing some more digging into
> > this. I was also wrong about min{}. Giving it some more thought, I
> > think (U/Umax) is indeed the only equation we need and it will take
> > care of caping the reclaiming at Umax.
>
> Uhm... I am not sure about this: for 1 single task on 1 CPU, yes,
> u/Umax does the right thing. But when there are multiple tasks on the
> CPU, I think it can cause issues (because every task ends up trying to
> execute for Umax).
>
Ah ok, I understand now. I was going with the single CPU equation of
U/Umax where this would not have been an issue as U(running_bw) takes
care of the adjustment and does not allow a single task to use whole
of Umax. I read the SMP GRUB paper you shared and see that it uses
u_i/Umax and that might cause issues as you mentioned.

> The "max{}" comes from the original multi-processor GRUB algorithm:
> https://arxiv.org/pdf/1512.01984.pdf (see Equation (13) - in that
> equation, the part we call u_extra is included in Uinact[p])
>
Thanks for sharing. I read the paper and got an overall idea an
background of existing implementation.

> the "1 - u_inact - u_extra" part is needed to make sure that the
> real-time guarantees are not broken by the reclaiming mechanism... But
> it can end up with a task trying to consume too much time on a single
> CPU, hence the "u/Umax" term in the "max{}" is needed to make sure that
> a task will not consume more than Umax of a CPU.
>
> Now, if we have one single task on a CPU u/Umax will always be larger
> than the other term... But when we have multiple tasks the other term
> is needed too.
>
Understood, thanks for explaining.

> (BTW, when considering multiple tasks on multiple CPUs, another
> potential problem is given by u_extra... Now that I remember all the
> details, u_extra is not "Umax - this_bw" - this is true when we consider
> only one CPU, but is is "Umax - sum(u_i)/m" (where "sum(u_i)" is the
> sum of the bandwidths of all the SCHED_DEADLINE tasks in the root
> domain, and "m" is the number of CPUs in the root domain)... So, the
> reclaimable CPU time is distributed uniformly on all the CPUs and this
> could create some issues. But let's see what happens after the div64
> fix and the SCHED_FLAG_RECLAIM fix)
>
This makes sense. This also means that we wouldn't be able to replace
"Uextra + Uinact" with "Umax - running_bw" and I was seeing problems
with SMP testing. So I shall revert to "Uextra + Uinact" in v2. And I
think the potential issue with Uextra would be avoided by the check
for Uextra + Uinact > Umax to make sure that we don't reclaim more
than Umax for a single cpu.

I have tested the patch with SMP using the stressor mentioned in the
commit message and running cyclicdeadline in parallel. The results
are similar to upstream and GRUB able to reclaim upto Umax now.

I shall send the v2 soon after a bit more testing.. Thanks a lot for
all the valuable inputs and detailed explanation :-)

Thanks,
Vineeth

2023-05-11 07:46:02

by luca abeni

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/deadline: accurate reclaim bandwidth for GRUB

Hi,

On Wed, 10 May 2023 11:50:00 -0400
Vineeth Remanan Pillai <[email protected]> wrote:
[...]
> > the "1 - u_inact - u_extra" part is needed to make sure that the
> > real-time guarantees are not broken by the reclaiming mechanism...
> > But it can end up with a task trying to consume too much time on a
> > single CPU, hence the "u/Umax" term in the "max{}" is needed to
> > make sure that a task will not consume more than Umax of a CPU.
> >
> > Now, if we have one single task on a CPU u/Umax will always be
> > larger than the other term... But when we have multiple tasks the
> > other term is needed too.
> >
> Understood, thanks for explaining.
>
> > (BTW, when considering multiple tasks on multiple CPUs, another
> > potential problem is given by u_extra... Now that I remember all the
> > details, u_extra is not "Umax - this_bw" - this is true when we
> > consider only one CPU, but is is "Umax - sum(u_i)/m" (where
> > "sum(u_i)" is the sum of the bandwidths of all the SCHED_DEADLINE
> > tasks in the root domain, and "m" is the number of CPUs in the root
> > domain)... So, the reclaimable CPU time is distributed uniformly on
> > all the CPUs and this could create some issues. But let's see what
> > happens after the div64 fix and the SCHED_FLAG_RECLAIM fix)
> >
> This makes sense. This also means that we wouldn't be able to replace
> "Uextra + Uinact" with "Umax - running_bw"

Right. When I suggested it, I was mistaken (I probably mis-read some
comments, and I did not remember how u_extra is exactly computed)


> and I was seeing problems
> with SMP testing. So I shall revert to "Uextra + Uinact" in v2. And I
> think the potential issue with Uextra would be avoided by the check
> for Uextra + Uinact > Umax to make sure that we don't reclaim more
> than Umax for a single cpu.
>
> I have tested the patch with SMP using the stressor mentioned in the
> commit message and running cyclicdeadline in parallel. The results
> are similar to upstream and GRUB able to reclaim upto Umax now.
>
> I shall send the v2 soon after a bit more testing..

I've just seen v2, and (unless I misunderstand something) I see you
removed the max{u_i/u_max, 1 - (u_inact + u_extra}} thing?

I fear this might break the real-time guarantees provided by the
algorithm...


> Thanks a lot for all the valuable inputs and detailed explanation :-)

And thank you for addressing this issue and listening to me :)



Thanks,
Luca

2023-05-11 18:53:32

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/deadline: accurate reclaim bandwidth for GRUB

Hi Luca,

On Thu, May 11, 2023 at 3:37 AM luca abeni <[email protected]> wrote:
>
> I've just seen v2, and (unless I misunderstand something) I see you
> removed the max{u_i/u_max, 1 - (u_inact + u_extra}} thing?
>
> I fear this might break the real-time guarantees provided by the
> algorithm...
>
I am sorry I missed sending more details before sending out v2. So, I
think there is another bug in the existing implementation. Let me try
to explain the details.

SMP GRUB paper has the equation for depreciating runtime as:
dq_i = -max{u_i, 1 - (extra_bw + Uinact)} dt

Since we are caping at Umax, the equation would be
dq_i = -(max{u_i, Umax - (extra_bw + Uinact)} / Umax) dt (1)

But existing implementation is:
dq_i = -max{u_i/Umax, 1 - (extra_bw + Uinact)} dt (2)

Here in (2), we factored Umax only to the first term "u_i" and the
second term in max{} was as left as it is. What do you think?

Now with normal DL and SCHED_FLAG_RECLAIM tasks, equation (1) can be
re-written as:
dq_i =
-(max{u_i, Ureclaim_max - (extra_bw + Uinact)}/Ureclaim_max)dt (3)

I tested this equation (3) and it works as expected. What do you think
about the correctness of equation (3)?

I felt that, since we are using sequential reclaim mentioned in the
paper and we isolate all parameters per-cpu(except for extra_bw) we
could use the "-dq = -(U/Umax) dt" equation as it was simpler than
equation (3). Sorry that I missed discussing this. I shall send out
v3 with equation (3), if you think it's the right way to go to enforce
deadline guarantees.

Thanks,
Vineeth

2023-05-11 20:30:33

by luca abeni

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/deadline: accurate reclaim bandwidth for GRUB

Hi,

first of all, thanks for your patience with my comments :)

On Thu, 11 May 2023 14:34:38 -0400
Vineeth Remanan Pillai <[email protected]> wrote:
[...]
> SMP GRUB paper has the equation for depreciating runtime as:
> dq_i = -max{u_i, 1 - (extra_bw + Uinact)} dt
>
> Since we are caping at Umax, the equation would be
> dq_i = -(max{u_i, Umax - (extra_bw + Uinact)} / Umax) dt (1)
>
> But existing implementation is:
> dq_i = -max{u_i/Umax, 1 - (extra_bw + Uinact)} dt (2)
>
> Here in (2), we factored Umax only to the first term "u_i" and the
> second term in max{} was as left as it is. What do you think?

I agree with you, (1) looks more correct. I do not know why I
implemented (2), but I agree with (1) now.


> Now with normal DL and SCHED_FLAG_RECLAIM tasks, equation (1) can be
> re-written as:
> dq_i =
> -(max{u_i, Ureclaim_max - (extra_bw + Uinact)}/Ureclaim_max)dt (3)
>
> I tested this equation (3) and it works as expected. What do you think
> about the correctness of equation (3)?

I agree with this too.

>
> I felt that, since we are using sequential reclaim mentioned in the
> paper and we isolate all parameters per-cpu(except for extra_bw) we
> could use the "-dq = -(U/Umax) dt" equation as it was simpler than
> equation (3).

This is the part I am not sure about...

Maybe the best way to go is to split the patch: first you implement (1)
(and use div64 to remove the approximation I used), then you implement
(3) in a second patch.

Finally, if removing the max{} is really needed you can do it in a
third patch (but I would try to go with Equation 3 before removing the
max{})


Thanks,
Luca

2023-05-11 21:06:11

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/deadline: accurate reclaim bandwidth for GRUB

> >
> > I felt that, since we are using sequential reclaim mentioned in the
> > paper and we isolate all parameters per-cpu(except for extra_bw) we
> > could use the "-dq = -(U/Umax) dt" equation as it was simpler than
> > equation (3).
>
> This is the part I am not sure about...
>
> Maybe the best way to go is to split the patch: first you implement (1)
> (and use div64 to remove the approximation I used), then you implement
> (3) in a second patch.
>
> Finally, if removing the max{} is really needed you can do it in a
> third patch (but I would try to go with Equation 3 before removing the
> max{})
>
Sure, I shall split the patch. Joel also suggested splitting the patch
and I was probably wrong to think that the patch was simple to be a
single patch :-).

Since equation (3) has theoretical backing, I am perfectly fine with
using it for our fix. Will have 2 patches as you suggested.

Will get the v3 out soon..

Thanks,
Vineeth

2023-05-15 03:05:46

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/deadline: accurate reclaim bandwidth for GRUB

Hi Luca,

> Sure, I shall split the patch. Joel also suggested splitting the patch
> and I was probably wrong to think that the patch was simple to be a
> single patch :-).
>
> Since equation (3) has theoretical backing, I am perfectly fine with
> using it for our fix. Will have 2 patches as you suggested.
>
> Will get the v3 out soon..
>
Sorry for the delay. I was testing the fixes and noticed one other
issue and was working on a fix.

This is with the extra_bw for tracking the unused bandwidth in the
system. As you have shown me:
"extra_bw = Umax - ((Sum of bw of all tasks) / #cpus)"
But I noticed that tracking the extra_bw globally while rest of the
values as per-cpu causes issues with reclaiming especially when we
have tasks with small bandwidth. This is because GRUB tries to
proportionately divide unused bandwidth based on running bw on a cpu.
So even if there is only a single task in a cpu, it reclaims less if
its bandwidth is less.

I think removing the extra_bw and tracking unused bandwidth using
"max_bw - this_bw" should be okay. Since admission control will
guarantee that we don't admit more than the capacity, we should be
good with allowing tasks on a cpu to use max_bw while reclaiming.

With the above theory, the code becomes simpler and is the same as the
v2 patch.
max{u_i, (max_bw - Uinact - Uextra)}
= max {u_i, (max_bw - (this_bw - running_bw) - (max_bw - this_bw))}
= max {u_i, running_bw} = running_bw
So, "dq = -(running_bw / max_bw) dt"

v2 had passed all my tests. But having Uextra broke couple of those
when it came to multiprocessors and after identifying the root cause
and re-writing the equation, everything works now and passes all
my tests. I have more confidence in the above equation as its derived
from the SMP GRUB rule using our max_bw.

Please have a look and let me know what you think about this.

I have the v3 ready with patches split into 5(including doc patch).
I shall post v3 soon after this so that you can see the code changes
as well to have a better look.

Thanks,
Vineeth