LinuxLists.cc - [PATCH v3 0/5] GRUB reclaiming fixes

2023-05-15 03:04:04

by Vineeth Remanan Pillai

[permalink] [raw]

Subject: [PATCH v3 0/5] GRUB reclaiming fixes

Current reclaim calculation for GRUB is a bit inaccurate and the
inaccuracy gets larger as the bandwidth of tasks becomes smaller.

There are couple of issues with the existing implementation:
- Loss of precision in division due to rounding off.
- Minor bug in the reclaim calculation.
- In SMP, tasks with smaller reservation tend to reclaim less
even if its the only task on a cpu.
- Inaccuracy when normal deadline tasks and SCHED_FLAG_RECLAIM
tasks share the cpu.

This patch series aims to fix the above mentioned issues.

Changes in v3
-------------
1. Split the fixes into multiple self contained patches
2. Cover letter.

Vineeth Pillai (5):
sched/deadline: Fix bandwidth reclaim equation in GRUB
sched/deadline: Fix reclaim inaccuracy with SMP
sched/deadline: Remove unused variable extra_bw
sched/deadline: Account for normal deadline tasks in GRUB
Documentation: sched/deadline: Update GRUB description

Documentation/scheduler/sched-deadline.rst | 28 ++--
kernel/sched/deadline.c | 160 ++++++++++-----------
kernel/sched/sched.h | 18 ++-
3 files changed, 112 insertions(+), 94 deletions(-)

--
2.40.1

2023-05-15 03:04:55

by Vineeth Remanan Pillai

[permalink] [raw]

Subject: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

In a multi-processor system, bandwidth usage is divided equally to
all cpus. This causes issues with reclaiming free bandwidth on a cpu.
"Uextra" is same on all cpus in a root domain and running_bw would be
different based on the reserved bandwidth of tasks running on the cpu.
This causes disproportionate reclaiming - task with lesser bandwidth
reclaims less even if its the only task running on that cpu.

Following is a small test with three tasks with reservations (8,10)
(1,10) and (1, 100). These three tasks run on different cpus. But
since the reclamation logic calculates available bandwidth as a factor
of globally available bandwidth, tasks with lesser bandwidth reclaims
only little compared to higher bandwidth even if cpu has free and
available bandwidth to be reclaimed.

TID[730]: RECLAIM=1, (r=8ms, d=10ms, p=10ms), Util: 95.05
TID[731]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 31.34
TID[732]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 3.16

Fix: use the available bandwidth on each cpu to calculate reclaimable
bandwidth. Admission control takes care of total bandwidth and hence
using the available bandwidth on a specific cpu would not break the
deadline guarentees.

With this fix, the above test behaves as follows:
TID[586]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 95.24
TID[585]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 95.01
TID[584]: RECLAIM=1, (r=8ms, d=10ms, p=10ms), Util: 95.01

Signed-off-by: Vineeth Pillai (Google) <[email protected]>
---
kernel/sched/deadline.c | 22 +++++++---------------
1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 91451c1c7e52..85902c4c484b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1272,7 +1272,7 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
* Umax: Max usable bandwidth for DL. Currently
* = sched_rt_runtime_us / sched_rt_period_us
* Uextra: Extra bandwidth not reserved:
- * = Umax - \Sum(u_i / #cpus in the root domain)
+ * = Umax - this_bw
* u_i: Bandwidth of an admitted dl task in the
* root domain.
*
@@ -1286,22 +1286,14 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
*/
static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
{
- u64 u_act;
- u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot - Uact */
-
/*
- * Instead of computing max{u, (rq->dl.max_bw - u_inact - u_extra)},
- * we compare u_inact + rq->dl.extra_bw with
- * rq->dl.max_bw - u, because u_inact + rq->dl.extra_bw can be larger
- * than rq->dl.max_bw (so, rq->dl.max_bw - u_inact - rq->dl.extra_bw
- * would be negative leading to wrong results)
+ * max{u, Umax - Uinact - Uextra}
+ * = max{u, max_bw - (this_bw - running_bw) + (this_bw - running_bw)}
+ * = max{u, running_bw} = running_bw
+ * So dq = -(max{u, Umax - Uinact - Uextra} / Umax) dt
+ * = -(running_bw / max_bw) dt
*/
- if (u_inact + rq->dl.extra_bw > rq->dl.max_bw - dl_se->dl_bw)
- u_act = dl_se->dl_bw;
- else
- u_act = rq->dl.max_bw - u_inact - rq->dl.extra_bw;
-
- return div64_u64(delta * u_act, rq->dl.max_bw);
+ return div64_u64(delta * rq->dl.running_bw, rq->dl.max_bw);
}

/*
--
2.40.1

2023-05-15 03:05:44

by Vineeth Remanan Pillai

[permalink] [raw]

Subject: [PATCH v3 4/5] sched/deadline: Account for normal deadline tasks in GRUB

GRUB algorithm assumes that all tasks participate in the reclaim. So
when there is a mix of normal deadline and SCHED_FLAG_RECLAIM tasks,
reclaiming the unused bandwidth is not accurate.

Running two deadline tasks on a cpu where one is SCHED_FLAG_RECLAIM
and other is a normal deadline task, we can see the utilization as
follows:

Task 1(Normal DL): (5, 10), Task 2(SCHED_FLAG_RECLAIM): (1, 10)
TID[673]: RECLAIM=0, (r=5ms, d=10ms, p=10ms), Util: 50.11
TID[672]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 15.93
TID[673]: RECLAIM=0, (r=5ms, d=10ms, p=10ms), Util: 50.01
TID[672]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 15.83

GRUB rule says, runtime is dpreciated as:
"dq = -(max{u, (Umax - Uinact - Uextra)} / Umax) dt"
Where Umax is the maximum allowed bandwidth for DL tasks
Uinact is the inactive utilization for the runqueue
Uextra is the free bandwidth available for reclaim

To account for a mix of normal deadline and SCHED_RECLAIM_FLAG tasks
running together, we do not consider the bandwidth of normal tasks in
the equation. So the equation becomes:
"dq = -(max{u, (Umax_reclaim - Uinact - Uextra)} / Umax_reclaim) dt"

"Umax_reclaim" is the maximum allowed bandwidth for SCHED_FLAG_RECLAIM
tasks. When only SCHED_FLAG_RECLAIM tasks are running,
"Umax_reclaim = Umax". Otherwise:
"Umax_reclaim = Umax - running_bw + Ureclaim"
Where Ureclaim is the total bandwidth of SCHED_FLAG_RECLAIM tasks in
active state for this runqueue.

With this fix, the results of above test is as follows:
Task 1(Normal DL): (5, 10), Task 2(SCHED_FLAG_RECLAIM): (1, 10)
TID[591]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 45.11
TID[592]: RECLAIM=0, (r=5ms, d=10ms, p=10ms), Util: 50.18
TID[591]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 44.99
TID[592]: RECLAIM=0, (r=5ms, d=10ms, p=10ms), Util: 49.88

Signed-off-by: Vineeth Pillai (Google) <[email protected]>
---
kernel/sched/deadline.c | 53 ++++++++++++++++++++++++++++++++---------
kernel/sched/sched.h | 11 +++++++++
2 files changed, 53 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 67c1138df43a..66a1b9365429 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -206,11 +206,13 @@ __dl_overflow(struct dl_bw *dl_b, unsigned long cap, u64 old_bw, u64 new_bw)
}

static inline
-void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
+void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq, bool reclaim_bw_se)
{
u64 old = dl_rq->running_bw;

lockdep_assert_rq_held(rq_of_dl_rq(dl_rq));
+ if (reclaim_bw_se)
+ dl_rq->reclaim_bw += dl_bw;
dl_rq->running_bw += dl_bw;
SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
@@ -219,15 +221,19 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
}

static inline
-void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
+void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq, bool reclaim_bw_se)
{
u64 old = dl_rq->running_bw;

lockdep_assert_rq_held(rq_of_dl_rq(dl_rq));
+ if (reclaim_bw_se)
+ dl_rq->reclaim_bw -= dl_bw;
dl_rq->running_bw -= dl_bw;
SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
- if (dl_rq->running_bw > old)
+ if (dl_rq->running_bw > old) {
+ dl_rq->reclaim_bw = 0;
dl_rq->running_bw = 0;
+ }
/* kick cpufreq (see the comment in kernel/sched/sched.h). */
cpufreq_update_util(rq_of_dl_rq(dl_rq), 0);
}
@@ -273,14 +279,14 @@ static inline
void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
{
if (!dl_entity_is_special(dl_se))
- __add_running_bw(dl_se->dl_bw, dl_rq);
+ __add_running_bw(dl_se->dl_bw, dl_rq, dl_entity_is_reclaim(dl_se));
}

static inline
void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
{
if (!dl_entity_is_special(dl_se))
- __sub_running_bw(dl_se->dl_bw, dl_rq);
+ __sub_running_bw(dl_se->dl_bw, dl_rq, dl_entity_is_reclaim(dl_se));
}

static void dl_change_utilization(struct task_struct *p, u64 new_bw)
@@ -499,6 +505,7 @@ void init_dl_rq(struct dl_rq *dl_rq)
#endif

dl_rq->running_bw = 0;
+ dl_rq->reclaim_bw = 0;
dl_rq->this_bw = 0;
init_dl_rq_bw(dl_rq);
}
@@ -1257,20 +1264,44 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
* but only a portion of it denoted by "Umax". So the equation becomes:
* "dq = -(max{u, (Umax - Uinact - Uextra)} / Umax) dt",
*
+ * To account for the fact that we have a mix of normal deadline tasks and
+ * SCHED_RECLAIM_FLAG tasks running together, we do not consider the bandwidth
+ * of normal tasks in the equation. So the equation becomes:
+ * "dq = -(max{u, (Umax_reclaim - Uinact - Uextra)} / Umax_reclaim) dt",
+ * where
+ * Umax_reclaim: Maximum reclaimable bandwidth for this rq.
+ *
+ * We can calculate Umax_reclaim as:
+ * "Umax_reclaim = Uextra + Uinact + Ureclaim"
+ * where:
+ * Ureclaim: Total bandwidth of SCHED_FLAG_RECLAIM tasks in active
+ * state for this rq.
+ *
* Since delta is a 64 bit variable, to have an overflow its value
* should be larger than 2^(64 - 20 - 8), which is more than 64 seconds.
* So, overflow is not an issue here.
*/
static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
{
+ u64 u_max_reclaim;
+
+ /*
+ * In SMP, max_bw can be less than running_bw without violating the global
+ * bandwidth limits. If thats the case, we should not reclaim.
+ */
+ if (rq->dl.max_bw < rq->dl.running_bw)
+ return delta;
+
+ u_max_reclaim = rq->dl.max_bw - rq->dl.running_bw + rq->dl.reclaim_bw;
+
/*
- * max{u, Umax - Uinact - Uextra}
- * = max{u, max_bw - (this_bw - running_bw) + (this_bw - running_bw)}
- * = max{u, running_bw} = running_bw
- * So dq = -(max{u, Umax - Uinact - Uextra} / Umax) dt
- * = -(running_bw / max_bw) dt
+ * max{u, Umax_reclaim - Uinact - Uextra}
+ * = max{u, Uextra + Uinact + Ureclaim - Uinact - Uextra}
+ * = max{u, Ureclaim} = Ureclaim = reclaim_bw
+ * So dq = -(max{u, Umax_reclaim - Uinact - Uextra} / Umax_reclaim) dt
+ * = -(reclaim_bw / Umax_reclaim) dt
*/
- return div64_u64(delta * rq->dl.running_bw, rq->dl.max_bw);
+ return div64_u64(delta * rq->dl.reclaim_bw, u_max_reclaim);
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 33db99756624..a6cb891835da 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -257,6 +257,11 @@ static inline bool dl_entity_is_special(const struct sched_dl_entity *dl_se)
#endif
}

+static inline bool dl_entity_is_reclaim(const struct sched_dl_entity *dl_se)
+{
+ return dl_se->flags & SCHED_FLAG_RECLAIM;
+}
+
/*
* Tells if entity @a should preempt entity @b.
*/
@@ -741,6 +746,12 @@ struct dl_rq {
*/
u64 running_bw;

+ /*
+ * Active bandwidth of SCHED_FLAG_RECLAIM tasks on this rq.
+ * This will be a subset of running_bw.
+ */
+ u64 reclaim_bw;
+
/*
* Utilization of the tasks "assigned" to this runqueue (including
* the tasks that are in runqueue and the tasks that executed on this
--
2.40.1

2023-05-15 03:06:49

by Vineeth Remanan Pillai

[permalink] [raw]

Subject: [PATCH v3 3/5] sched/deadline: Remove unused variable extra_bw

Since we do not use extra_bw for GRUB, remove its usage.

Signed-off-by: Vineeth Pillai (Google) <[email protected]>
---
kernel/sched/deadline.c | 53 ++++++++++++-----------------------------
kernel/sched/sched.h | 1 -
2 files changed, 15 insertions(+), 39 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 85902c4c484b..67c1138df43a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -163,20 +163,6 @@ static inline bool dl_bw_visited(int cpu, u64 gen)
return false;
}

-static inline
-void __dl_update(struct dl_bw *dl_b, s64 bw)
-{
- struct root_domain *rd = container_of(dl_b, struct root_domain, dl_bw);
- int i;
-
- RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
- "sched RCU must be held");
- for_each_cpu_and(i, rd->span, cpu_active_mask) {
- struct rq *rq = cpu_rq(i);
-
- rq->dl.extra_bw += bw;
- }
-}
#else
static inline struct dl_bw *dl_bw_of(int i)
{
@@ -198,27 +184,18 @@ static inline bool dl_bw_visited(int cpu, u64 gen)
return false;
}

-static inline
-void __dl_update(struct dl_bw *dl_b, s64 bw)
-{
- struct dl_rq *dl = container_of(dl_b, struct dl_rq, dl_bw);
-
- dl->extra_bw += bw;
-}
#endif

static inline
-void __dl_sub(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
+void __dl_sub(struct dl_bw *dl_b, u64 tsk_bw)
{
dl_b->total_bw -= tsk_bw;
- __dl_update(dl_b, (s32)tsk_bw / cpus);
}

static inline
-void __dl_add(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
+void __dl_add(struct dl_bw *dl_b, u64 tsk_bw)
{
dl_b->total_bw += tsk_bw;
- __dl_update(dl_b, -((s32)tsk_bw / cpus));
}

static inline bool
@@ -430,7 +407,7 @@ static void task_non_contending(struct task_struct *p)
if (READ_ONCE(p->__state) == TASK_DEAD)
sub_rq_bw(&p->dl, &rq->dl);
raw_spin_lock(&dl_b->lock);
- __dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
+ __dl_sub(dl_b, p->dl.dl_bw);
raw_spin_unlock(&dl_b->lock);
__dl_clear_params(p);
}
@@ -721,12 +698,12 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
*/
dl_b = &rq->rd->dl_bw;
raw_spin_lock(&dl_b->lock);
- __dl_sub(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
+ __dl_sub(dl_b, p->dl.dl_bw);
raw_spin_unlock(&dl_b->lock);

dl_b = &later_rq->rd->dl_bw;
raw_spin_lock(&dl_b->lock);
- __dl_add(dl_b, p->dl.dl_bw, cpumask_weight(later_rq->rd->span));
+ __dl_add(dl_b, p->dl.dl_bw);
raw_spin_unlock(&dl_b->lock);

set_task_cpu(p, later_rq->cpu);
@@ -1425,7 +1402,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
}

raw_spin_lock(&dl_b->lock);
- __dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
+ __dl_sub(dl_b, p->dl.dl_bw);
raw_spin_unlock(&dl_b->lock);
__dl_clear_params(p);

@@ -2506,7 +2483,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
* until we complete the update.
*/
raw_spin_lock(&src_dl_b->lock);
- __dl_sub(src_dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
+ __dl_sub(src_dl_b, p->dl.dl_bw);
raw_spin_unlock(&src_dl_b->lock);
}

@@ -2560,7 +2537,7 @@ void dl_add_task_root_domain(struct task_struct *p)
dl_b = &rq->rd->dl_bw;
raw_spin_lock(&dl_b->lock);

- __dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
+ __dl_add(dl_b, p->dl.dl_bw);

raw_spin_unlock(&dl_b->lock);

@@ -2779,9 +2756,9 @@ int sched_dl_global_validate(void)
static void init_dl_rq_bw(struct dl_rq *dl_rq)
{
if (global_rt_runtime() == RUNTIME_INF)
- dl_rq->max_bw = dl_rq->extra_bw = 1 << BW_SHIFT;
+ dl_rq->max_bw = 1 << BW_SHIFT;
else
- dl_rq->max_bw = dl_rq->extra_bw = to_ratio(global_rt_period(),
+ dl_rq->max_bw = to_ratio(global_rt_period(),
global_rt_runtime());
}

@@ -2852,8 +2829,8 @@ int sched_dl_overflow(struct task_struct *p, int policy,
if (dl_policy(policy) && !task_has_dl_policy(p) &&
!__dl_overflow(dl_b, cap, 0, new_bw)) {
if (hrtimer_active(&p->dl.inactive_timer))
- __dl_sub(dl_b, p->dl.dl_bw, cpus);
- __dl_add(dl_b, new_bw, cpus);
+ __dl_sub(dl_b, p->dl.dl_bw);
+ __dl_add(dl_b, new_bw);
err = 0;
} else if (dl_policy(policy) && task_has_dl_policy(p) &&
!__dl_overflow(dl_b, cap, p->dl.dl_bw, new_bw)) {
@@ -2864,8 +2841,8 @@ int sched_dl_overflow(struct task_struct *p, int policy,
* But this would require to set the task's "inactive
* timer" when the task is not inactive.
*/
- __dl_sub(dl_b, p->dl.dl_bw, cpus);
- __dl_add(dl_b, new_bw, cpus);
+ __dl_sub(dl_b, p->dl.dl_bw);
+ __dl_add(dl_b, new_bw);
dl_change_utilization(p, new_bw);
err = 0;
} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
@@ -3044,7 +3021,7 @@ int dl_cpu_busy(int cpu, struct task_struct *p)
* We will free resources in the source root_domain
* later on (see set_cpus_allowed_dl()).
*/
- __dl_add(dl_b, p->dl.dl_bw, dl_bw_cpus(cpu));
+ __dl_add(dl_b, p->dl.dl_bw);
}

raw_spin_unlock_irqrestore(&dl_b->lock, flags);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1bc7ae9ad349..33db99756624 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -751,7 +751,6 @@ struct dl_rq {
* runqueue (inactive utilization = this_bw - running_bw).
*/
u64 this_bw;
- u64 extra_bw;

/*
* Maximum available bandwidth for deadline tasks of this rq. This is
--
2.40.1

2023-05-15 03:07:51

by Vineeth Remanan Pillai

[permalink] [raw]

Subject: [PATCH v3 1/5] sched/deadline: Fix bandwidth reclaim equation in GRUB

According to the GRUB[1] rule, the runtime is depreciated as:
"dq = -max{u, (1 - Uinact - Uextra)} dt" (1)

To guarentee that deadline tasks doesn't starve lower class tasks,
we do not allocate the full bandwidth of the cpu to deadline tasks.
Maximum bandwidth usable by deadline tasks is denoted by "Umax".
Considering Umax, equation (1) becomes:
"dq = -(max{u, (Umax - Uinact - Uextra)} / Umax) dt" (2)

Current implementation has a minor bug in equation (2). This patch
fixes the bug and also fixes the precision issue by using div64_u64.

The reclamation logic is verified by a sample program which creates
multiple deadline threads and observing their utilization. The tests
were run on an isolated cpu(isolcpus=3) on a 4 cpu system.

Tests on 6.3.0
==============

RUN 1: runtime=7ms, deadline=period=10ms, RT capacity = 95%
TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.33
TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.35
TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.35
TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.29

RUN 2: runtime=1ms, deadline=period=100ms, RT capacity = 95%
TID[708]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 16.69
TID[708]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 16.69
TID[708]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 16.70

RUN 3: 2 tasks
Task 1: runtime=1ms, deadline=period=10ms
Task 2: runtime=1ms, deadline=period=100ms
TID[631]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 62.67
TID[632]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 6.37
TID[631]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 62.38
TID[632]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 6.19
TID[631]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 62.60
TID[632]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 6.23
TID[631]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 62.43

As seen above, the reclamation doesn't reclaim the maximum allowed
bandwidth and as the bandwidth of tasks gets smaller, the reclaimed
bandwidth also comes down.

Tests with this patch applied
=============================

RUN 1: runtime=7ms, deadline=period=10ms, RT capacity = 95%
TID[667]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.01
TID[667]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.00

RUN 2: runtime=1ms, deadline=period=100ms, RT capacity = 95%
TID[641]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 94.86
TID[641]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 95.06

RUN 3: 2 tasks
Task 1: runtime=1ms, deadline=period=10ms
Task 2: runtime=1ms, deadline=period=100ms
TID[636]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 86.44
TID[637]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 8.67
TID[636]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 86.34
TID[637]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 8.61

Running tasks on all cpus allowing for migration also showed that
the utilization is reclaimed to the maximum. Running 10 tasks on
3 cpus SCHED_FLAG_RECLAIM - top shows:
%Cpu0 : 94.6 us, 0.0 sy, 0.0 ni, 5.4 id, 0.0 wa
%Cpu1 : 95.2 us, 0.0 sy, 0.0 ni, 4.8 id, 0.0 wa
%Cpu2 : 95.8 us, 0.0 sy, 0.0 ni, 4.2 id, 0.0 wa

[1]: Abeni, Luca & Lipari, Giuseppe & Parri, Andrea & Sun, Youcheng.
(2015). Parallel and sequential reclaiming in multicore
real-time global scheduling.

Signed-off-by: Vineeth Pillai (Google) <[email protected]>
---
kernel/sched/deadline.c | 72 ++++++++++++++++++++---------------------
kernel/sched/sched.h | 6 ++--
2 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 71b24371a6f7..91451c1c7e52 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -487,7 +487,7 @@ static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
return rb_first_cached(&dl_rq->root) == &dl_se->rb_node;
}

-static void init_dl_rq_bw_ratio(struct dl_rq *dl_rq);
+static void init_dl_rq_bw(struct dl_rq *dl_rq);

void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime)
{
@@ -523,7 +523,7 @@ void init_dl_rq(struct dl_rq *dl_rq)

dl_rq->running_bw = 0;
dl_rq->this_bw = 0;
- init_dl_rq_bw_ratio(dl_rq);
+ init_dl_rq_bw(dl_rq);
}

#ifdef CONFIG_SMP
@@ -1261,43 +1261,47 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)

/*
* This function implements the GRUB accounting rule:
- * according to the GRUB reclaiming algorithm, the runtime is
- * not decreased as "dq = -dt", but as
- * "dq = -max{u / Umax, (1 - Uinact - Uextra)} dt",
- * where u is the utilization of the task, Umax is the maximum reclaimable
- * utilization, Uinact is the (per-runqueue) inactive utilization, computed
- * as the difference between the "total runqueue utilization" and the
- * runqueue active utilization, and Uextra is the (per runqueue) extra
- * reclaimable utilization.
- * Since rq->dl.running_bw and rq->dl.this_bw contain utilizations
- * multiplied by 2^BW_SHIFT, the result has to be shifted right by
- * BW_SHIFT.
- * Since rq->dl.bw_ratio contains 1 / Umax multiplied by 2^RATIO_SHIFT,
- * dl_bw is multiped by rq->dl.bw_ratio and shifted right by RATIO_SHIFT.
+ * As per the GRUB rule,the runtime is not decreased as "dq = -dt", but as
+ * "dq = -max{u, (1 - Uinact - Uextra)} dt",
+ * where:
+ * u: Bandwith of the task.
+ * running_bw: Total bandwidth of tasks in active state for this rq.
+ * this_bw: Reserved bandwidth for this rq. Includes active and
+ * inactive bandwidth for this rq.
+ * Uinact: Inactive utilization (this_bw - running_bw)
+ * Umax: Max usable bandwidth for DL. Currently
+ * = sched_rt_runtime_us / sched_rt_period_us
+ * Uextra: Extra bandwidth not reserved:
+ * = Umax - \Sum(u_i / #cpus in the root domain)
+ * u_i: Bandwidth of an admitted dl task in the
+ * root domain.
+ *
+ * Deadline tasks are not allowed to use the whole bandwidth of the cpu,
+ * but only a portion of it denoted by "Umax". So the equation becomes:
+ * "dq = -(max{u, (Umax - Uinact - Uextra)} / Umax) dt",
+ *
* Since delta is a 64 bit variable, to have an overflow its value
* should be larger than 2^(64 - 20 - 8), which is more than 64 seconds.
* So, overflow is not an issue here.
*/
static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
{
- u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot - Uact */
u64 u_act;
- u64 u_act_min = (dl_se->dl_bw * rq->dl.bw_ratio) >> RATIO_SHIFT;
+ u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot - Uact */

/*
- * Instead of computing max{u * bw_ratio, (1 - u_inact - u_extra)},
+ * Instead of computing max{u, (rq->dl.max_bw - u_inact - u_extra)},
* we compare u_inact + rq->dl.extra_bw with
- * 1 - (u * rq->dl.bw_ratio >> RATIO_SHIFT), because
- * u_inact + rq->dl.extra_bw can be larger than
- * 1 * (so, 1 - u_inact - rq->dl.extra_bw would be negative
- * leading to wrong results)
+ * rq->dl.max_bw - u, because u_inact + rq->dl.extra_bw can be larger
+ * than rq->dl.max_bw (so, rq->dl.max_bw - u_inact - rq->dl.extra_bw
+ * would be negative leading to wrong results)
*/
- if (u_inact + rq->dl.extra_bw > BW_UNIT - u_act_min)
- u_act = u_act_min;
+ if (u_inact + rq->dl.extra_bw > rq->dl.max_bw - dl_se->dl_bw)
+ u_act = dl_se->dl_bw;
else
- u_act = BW_UNIT - u_inact - rq->dl.extra_bw;
+ u_act = rq->dl.max_bw - u_inact - rq->dl.extra_bw;

- return (delta * u_act) >> BW_SHIFT;
+ return div64_u64(delta * u_act, rq->dl.max_bw);
}

/*
@@ -2780,17 +2784,13 @@ int sched_dl_global_validate(void)
return ret;
}

-static void init_dl_rq_bw_ratio(struct dl_rq *dl_rq)
+static void init_dl_rq_bw(struct dl_rq *dl_rq)
{
- if (global_rt_runtime() == RUNTIME_INF) {
- dl_rq->bw_ratio = 1 << RATIO_SHIFT;
- dl_rq->extra_bw = 1 << BW_SHIFT;
- } else {
- dl_rq->bw_ratio = to_ratio(global_rt_runtime(),
- global_rt_period()) >> (BW_SHIFT - RATIO_SHIFT);
- dl_rq->extra_bw = to_ratio(global_rt_period(),
+ if (global_rt_runtime() == RUNTIME_INF)
+ dl_rq->max_bw = dl_rq->extra_bw = 1 << BW_SHIFT;
+ else
+ dl_rq->max_bw = dl_rq->extra_bw = to_ratio(global_rt_period(),
global_rt_runtime());
- }
}

void sched_dl_do_global(void)
@@ -2819,7 +2819,7 @@ void sched_dl_do_global(void)
raw_spin_unlock_irqrestore(&dl_b->lock, flags);

rcu_read_unlock_sched();
- init_dl_rq_bw_ratio(&cpu_rq(cpu)->dl);
+ init_dl_rq_bw(&cpu_rq(cpu)->dl);
}
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3e8df6d31c1e..1bc7ae9ad349 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -754,10 +754,10 @@ struct dl_rq {
u64 extra_bw;

/*
- * Inverse of the fraction of CPU utilization that can be reclaimed
- * by the GRUB algorithm.
+ * Maximum available bandwidth for deadline tasks of this rq. This is
+ * used in calculation of reclaimable bandwidth(GRUB).
*/
- u64 bw_ratio;
+ u64 max_bw;
};

#ifdef CONFIG_FAIR_GROUP_SCHED
--
2.40.1

2023-05-15 08:22:31

[permalink] [raw]

Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Hi,

this patch is giving me some headaches:

On Sun, 14 May 2023 22:57:13 -0400
Vineeth Pillai <[email protected]> wrote:
[...]
> * Uextra: Extra bandwidth not reserved:
> - * = Umax - \Sum(u_i / #cpus in the root domain)
> + * = Umax - this_bw

While I agree that this setting should be OK, it ends up with
dq = -Uact / Umax * dt
which I remember I originally tried, and gave some issues
(I do not remember the details, but I think if you try N
identical reclaiming tasks, with N > M, the reclaimed time
is not distributed equally among them?)

I need to think a little bit more about this...

Luca

> * u_i: Bandwidth of an admitted dl task in the
> * root domain.
> *
> @@ -1286,22 +1286,14 @@ int dl_runtime_exceeded(struct
> sched_dl_entity *dl_se) */
> static u64 grub_reclaim(u64 delta, struct rq *rq, struct
> sched_dl_entity *dl_se) {
> - u64 u_act;
> - u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot -
> Uact */ -
> /*
> - * Instead of computing max{u, (rq->dl.max_bw - u_inact -
> u_extra)},
> - * we compare u_inact + rq->dl.extra_bw with
> - * rq->dl.max_bw - u, because u_inact + rq->dl.extra_bw can
> be larger
> - * than rq->dl.max_bw (so, rq->dl.max_bw - u_inact -
> rq->dl.extra_bw
> - * would be negative leading to wrong results)
> + * max{u, Umax - Uinact - Uextra}
> + * = max{u, max_bw - (this_bw - running_bw) + (this_bw -
> running_bw)}
> + * = max{u, running_bw} = running_bw
> + * So dq = -(max{u, Umax - Uinact - Uextra} / Umax) dt
> + * = -(running_bw / max_bw) dt
> */
> - if (u_inact + rq->dl.extra_bw > rq->dl.max_bw - dl_se->dl_bw)
> - u_act = dl_se->dl_bw;
> - else
> - u_act = rq->dl.max_bw - u_inact - rq->dl.extra_bw;
> -
> - return div64_u64(delta * u_act, rq->dl.max_bw);
> + return div64_u64(delta * rq->dl.running_bw, rq->dl.max_bw);
> }
>
> /*

2023-05-16 02:11:06

by Vineeth Remanan Pillai

[permalink] [raw]

Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Hi Luca,

On Mon, May 15, 2023 at 4:06 AM luca abeni <[email protected]> wrote:

>
> this patch is giving me some headaches:
>
Sorry about that.. I was also stressing out on how to get the
reclaiming done right for the past couple of days ;-)

> Vineeth Pillai <[email protected]> wrote:
> [...]
> > * Uextra: Extra bandwidth not reserved:
> > - * = Umax - \Sum(u_i / #cpus in the root domain)
> > + * = Umax - this_bw
>
> While I agree that this setting should be OK, it ends up with
> dq = -Uact / Umax * dt
> which I remember I originally tried, and gave some issues
> (I do not remember the details, but I think if you try N
> identical reclaiming tasks, with N > M, the reclaimed time
> is not distributed equally among them?)
>
I have noticed this behaviour where the reclaimed time is not equally
distributed when we have more tasks than available processors. But it
depended on where the task was scheduled. Within the same cpu, the
distribution seemed to be proportional. But the tasks migrated often
and then depending on whether the task got a whole cpu for its
runtime or not, the reclaimed bandwidth differed. I thought that
should be okay as it depended upon where the task landed.

One other problem I saw was cpu usage spiking above max_bw leading to
system hang sometimes. I thought stopping reclaiming when running_bw
gets larger than max_bw(in 4th patch) fixed this, but when I ran the
tests long enough, I did see this hang.

> I need to think a little bit more about this...
>
Thanks for looking into this.. I have a basic idea why tasks with less
bandwidth reclaim less in SMP when number of tasks is less than number
of cpus, but do not yet have a verifiable fix for it.

If patches 1 and 4 looks good to you, we shall drop 2 and 3 and fix the
SMP issue with varying bandwidth separately.. Patch 4 would differ a
bit when I remove 2 and 3 so as to use the formula:
"dq = -(max{u, (Umax_reclaim - Uinact - Uextra)} / Umax_reclaim) dt"

Thanks for your patience with all these brainstorming:-)

Thanks,
Vineeth

2023-05-16 07:48:44

[permalink] [raw]

Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

On Mon, 15 May 2023 21:47:03 -0400
Vineeth Remanan Pillai <[email protected]> wrote:

> Hi Luca,
>
> On Mon, May 15, 2023 at 4:06 AM luca abeni
> <[email protected]> wrote:
>
> >
> > this patch is giving me some headaches:
> >
> Sorry about that.. I was also stressing out on how to get the
> reclaiming done right for the past couple of days ;-)

Well, this math is hard... :)

> > Vineeth Pillai <[email protected]> wrote:
> > [...]
> > > * Uextra: Extra bandwidth not reserved:
> > > - * = Umax - \Sum(u_i / #cpus in the root
> > > domain)
> > > + * = Umax - this_bw
> >
> > While I agree that this setting should be OK, it ends up with
> > dq = -Uact / Umax * dt
> > which I remember I originally tried, and gave some issues
> > (I do not remember the details, but I think if you try N
> > identical reclaiming tasks, with N > M, the reclaimed time
> > is not distributed equally among them?)
> >
> I have noticed this behaviour where the reclaimed time is not equally
> distributed when we have more tasks than available processors. But it
> depended on where the task was scheduled. Within the same cpu, the
> distribution seemed to be proportional.

Yes, as far as I remember it is due to migrations. IIRC, the problem is
related to the fact that using "dq = -Uact / Umax * dt" a task running
on a core might end up trying to reclaim some idle time from other
cores (which is obviously not possible).
This is why m-GRUB used "1 - Uinact" instead of "Uact"

[...]
> > I need to think a little bit more about this...
> >
> Thanks for looking into this.. I have a basic idea why tasks with less
> bandwidth reclaim less in SMP when number of tasks is less than number
> of cpus, but do not yet have a verifiable fix for it.

I think I can now understand at least part of the problem. In my
understanding, the problem is due to using
dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt

It should really be
dq = -(max{u_i, (1 - Uinact - Uextra)} / Umax) * dt

(since we divide by Umax, using "Umax - ..." will lead to reclaiming up
to "Umax / Umax" = 1)

Did you try this equation?

I'll write more about this later... And thanks for coping with all my
comments!

Luca
>
> If patches 1 and 4 looks good to you, we shall drop 2 and 3 and fix
> the SMP issue with varying bandwidth separately.. Patch 4 would
> differ a bit when I remove 2 and 3 so as to use the formula:
> "dq = -(max{u, (Umax_reclaim - Uinact - Uextra)} / Umax_reclaim) dt"
>
> Thanks for your patience with all these brainstorming:-)
>
> Thanks,
> Vineeth

2023-05-16 15:18:05

by Vineeth Remanan Pillai

[permalink] [raw]

Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Hi Luca,

On Tue, May 16, 2023 at 3:37 AM luca abeni <[email protected]> wrote:
> > I have noticed this behaviour where the reclaimed time is not equally
> > distributed when we have more tasks than available processors. But it
> > depended on where the task was scheduled. Within the same cpu, the
> > distribution seemed to be proportional.
>
> Yes, as far as I remember it is due to migrations. IIRC, the problem is
> related to the fact that using "dq = -Uact / Umax * dt" a task running
> on a core might end up trying to reclaim some idle time from other
> cores (which is obviously not possible).
> This is why m-GRUB used "1 - Uinact" instead of "Uact"
>
This is what I was a little confused about. In "-Uact / Umax", all
the variables are per-cpu and it should only be reclaiming what is
free on the cpu right? And when migration happens, Uact changes
and the reclaiming adapts itself. I was thinking it should probably
be okay for tasks to reclaim differently based on what free bw is
left on the cpu it is running. For eg: if cpu 1 has two tasks of bw
.3 each, each task can reclaim "(.95 - .6) / 2" and another cpu with
only one task(.3 bandwidth) reclaims (.95 - .3). So both cpus
utilization is .95 and tasks reclaim what is available on the cpu.

With "1 - Uinact", where Uinact accounts for a portion of global free
bandwidth, tasks reclaim proportionately to the global free bandwidth
and this causes tasks with lesser bandwidth to reclaim lesser when
compared to higher bandwidth tasks even if they don't share the cpu.
This is what I was seeing in practice.

But with Uact / Umax, Uact can be greater than Umax and this caused
some issues like tasks not getting their reserved bandwidth. For eg:
4 tasks with (7,10) on a 3 cpu system - one cpu can have Uact of 1.4
and scaled_delta to be greater than delta. This causes runtime to
deplete faster until one task is migrated. But after migration, the
target cpu will have this problem. So "Uact / Umax" was not working
in close to overcommit situations.

In summary "1 - Uinact" causes reclaiming much less while "Uact / Umax"
has issues during overcommitting of tasks with high bandwidth. This is
what I understood from experiments and reading.

> [...]
> > > I need to think a little bit more about this...
> > >
> > Thanks for looking into this.. I have a basic idea why tasks with less
> > bandwidth reclaim less in SMP when number of tasks is less than number
> > of cpus, but do not yet have a verifiable fix for it.
>
> I think I can now understand at least part of the problem. In my
> understanding, the problem is due to using
> dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt
>
> It should really be
> dq = -(max{u_i, (1 - Uinact - Uextra)} / Umax) * dt
>
> (since we divide by Umax, using "Umax - ..." will lead to reclaiming up
> to "Umax / Umax" = 1)
>
> Did you try this equation?
>
I had tested this and it was reclaiming much less compared to the first one.
I had 3 tasks with reservation (3,100) and 3 cpus.

With dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt (1)
TID[636]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.08
TID[635]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.07
TID[637]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.06

With dq = -(max{u_i, (1 - Uinact - Uextra)} / Umax) * dt (2)
TID[601]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
TID[600]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
TID[602]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65

As the task bandwidth goes higher, equation (2) reclaims more, but
equation (2) is a constant of 95% as long as number of tasks less
than cpus. If the number of tasks is more than cpus, eq (2) fares
better in reclaiming than eq (1)

eq (1) with 5 tasks (3,100)
TID[627]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.64
TID[626]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.64
TID[629]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.62
TID[628]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 29.00
TID[630]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.99

Here top shows 3 cpus in the range ~45 to 50% util

eq (2) with 5 tasks (3,100)
TID[667]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.20
TID[670]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.79
TID[668]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.11
TID[666]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 56.34
TID[669]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 55.82

And here top shows all 3 cpus with 95% util

> I'll write more about this later... And thanks for coping with all my
> comments!
>
Thanks :-)

Vineeth

2023-05-16 16:22:23

[permalink] [raw]

Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Hi,

On Tue, 16 May 2023 11:08:18 -0400
Vineeth Remanan Pillai <[email protected]> wrote:

> Hi Luca,
>
> On Tue, May 16, 2023 at 3:37 AM luca abeni
> <[email protected]> wrote:
> > > I have noticed this behaviour where the reclaimed time is not
> > > equally distributed when we have more tasks than available
> > > processors. But it depended on where the task was scheduled.
> > > Within the same cpu, the distribution seemed to be proportional.
> >
> > Yes, as far as I remember it is due to migrations. IIRC, the
> > problem is related to the fact that using "dq = -Uact / Umax * dt"
> > a task running on a core might end up trying to reclaim some idle
> > time from other cores (which is obviously not possible).
> > This is why m-GRUB used "1 - Uinact" instead of "Uact"
> >
> This is what I was a little confused about. In "-Uact / Umax", all
> the variables are per-cpu and it should only be reclaiming what is
> free on the cpu right? And when migration happens, Uact changes
> and the reclaiming adapts itself.

Sorry, I do not remember the details... But I think the problem is in
the transient when a task migrates from a core to a different one.
I am trying to search from my old notes to see if I find some more
details.

> I was thinking it should probably
> be okay for tasks to reclaim differently based on what free bw is
> left on the cpu it is running. For eg: if cpu 1 has two tasks of bw
> .3 each, each task can reclaim "(.95 - .6) / 2" and another cpu with
> only one task(.3 bandwidth) reclaims (.95 - .3). So both cpus
> utilization is .95 and tasks reclaim what is available on the cpu.

I suspect (but I am not sure) this only works if tasks do not migrate.

> With "1 - Uinact", where Uinact accounts for a portion of global free
> bandwidth, tasks reclaim proportionately to the global free bandwidth
> and this causes tasks with lesser bandwidth to reclaim lesser when
> compared to higher bandwidth tasks even if they don't share the cpu.
> This is what I was seeing in practice.

Just to be sure: is this with the "original" Uextra setting, or with
your new "Uextra = Umax - this_bw" setting?
(I am not sure, but I suspect that "1 - Uinact - Uextra" with your new
definition of Uextra should work well...)

[...]
> > I think I can now understand at least part of the problem. In my
> > understanding, the problem is due to using
> > dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt
> >
> > It should really be
> > dq = -(max{u_i, (1 - Uinact - Uextra)} / Umax) * dt
> >
> > (since we divide by Umax, using "Umax - ..." will lead to
> > reclaiming up to "Umax / Umax" = 1)
> >
> > Did you try this equation?
> >
> I had tested this and it was reclaiming much less compared to the
> first one. I had 3 tasks with reservation (3,100) and 3 cpus.
>
> With dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt (1)
> TID[636]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.08
> TID[635]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.07
> TID[637]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.06
>
> With dq = -(max{u_i, (1 - Uinact - Uextra)} / Umax) * dt (2)
> TID[601]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
> TID[600]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
> TID[602]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65

Maybe I am missing something and I am misunderstanding the situation,
but my impression was that this is the effect of setting
Umax - \Sum(u_i / #cpus in the root domain)
I was hoping that with your new Umax setting this problem could be
fixed... I am going to double-check my reasoning.

Thanks,
Luca

2023-05-17 02:53:29

by Vineeth Remanan Pillai

[permalink] [raw]

Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Hi Luca,

On Tue, May 16, 2023 at 12:19 PM luca abeni <[email protected]> wrote:
>
> > I was thinking it should probably
> > be okay for tasks to reclaim differently based on what free bw is
> > left on the cpu it is running. For eg: if cpu 1 has two tasks of bw
> > .3 each, each task can reclaim "(.95 - .6) / 2" and another cpu with
> > only one task(.3 bandwidth) reclaims (.95 - .3). So both cpus
> > utilization is .95 and tasks reclaim what is available on the cpu.
>
> I suspect (but I am not sure) this only works if tasks do not migrate.
>
From what I am seeing, if the reserved bandwidth of all tasks on a cpu
is less than Umax, then this works. Even with migration, if the task
lands on another cpu where the new running_bw < Umax, then it runs and
reclaims the free bandwidth. But this breaks if running_bw > Umax and
it can happen if total_bw is within limits, but a cpu is overloaded.
For eg: four tasks with reservation (7, 10) on a three cpu system.
Here two cpus will have running_bw = .7 but third cpu will be 1.4
even though total_bw = 2.80 which is less than the limit of 2.85.

>
> > With "1 - Uinact", where Uinact accounts for a portion of global free
> > bandwidth, tasks reclaim proportionately to the global free bandwidth
> > and this causes tasks with lesser bandwidth to reclaim lesser when
> > compared to higher bandwidth tasks even if they don't share the cpu.
> > This is what I was seeing in practice.
>
> Just to be sure: is this with the "original" Uextra setting, or with
> your new "Uextra = Umax - this_bw" setting?
> (I am not sure, but I suspect that "1 - Uinact - Uextra" with your new
> definition of Uextra should work well...)
>
I am seeing this with original Uextra setting where the global bandwidth
is accounted. With "Uextra = Umax - this_bw", reclaiming seems to be
correct and I think it is because it considers local bandwidth only.

> > With dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt (1)
> > TID[636]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.08
> > TID[635]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.07
> > TID[637]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.06
> >
> > With dq = -(max{u_i, (1 - Uinact - Uextra)} / Umax) * dt (2)
> > TID[601]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
> > TID[600]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
> > TID[602]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
>
> Maybe I am missing something and I am misunderstanding the situation,
> but my impression was that this is the effect of setting
> Umax - \Sum(u_i / #cpus in the root domain)
> I was hoping that with your new Umax setting this problem could be
> fixed... I am going to double-check my reasoning.
>
Even with the Umax_reclaim changes, equation (1) is the one which
reclaims upto 95% when number of tasks is less than the number of
cpus. With more tasks than cpus, eq (1) still reclaims more than
eq (2) and cpu utilization caps at 95%. I also need to dig more to
understand the reason behind this.

Thanks for looking into this, I will also study more on this and
keep you posted..

Thanks,
Vineeth

2023-05-19 10:03:12

[permalink] [raw]

Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Hi,

sorry for returning on this discussion, but there is something I still
do not understand:

On Tue, 16 May 2023 11:08:18 -0400
Vineeth Remanan Pillai <[email protected]> wrote:
[...]
> I had tested this and it was reclaiming much less compared to the
> first one. I had 3 tasks with reservation (3,100) and 3 cpus.

So, just to confirm: here you have only 3 SCHED_DEADLINE tasks,
scheduled on a root domain containing only 3 CPUs (dl_bw_cpus() return
3)... Right?
So, the utilization of each task is 3/100 = 0.03 and Uextra is
1 - (0.03 * 3) / 3 = 0.97.
And since all the tasks are always active, Uinact = 0...
Is this understanding right?

If so:
> With dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt (1)
> TID[636]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.08
> TID[635]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.07
> TID[637]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.06
>
> With dq = -(max{u_i, (1 - Uinact - Uextra)} / Umax) * dt (2)
> TID[601]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
> TID[600]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
> TID[602]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65

Here, we should have
dq = -(max{0.03, (1 - 0 - 0.97)} / Umax) * dt
= -(0.03 / Umax) * dt
which reclaims up to Umax... So, the utilization should be 95%
Since you measured 35.65%, it means that (1-Uextra) is much larger
than 0.97... So, maybe you found some bug in the Uextra computation?

Can you try printing the extra_bw value, to check what happened?

Thanks,
Luca

>
> As the task bandwidth goes higher, equation (2) reclaims more, but
> equation (2) is a constant of 95% as long as number of tasks less
> than cpus. If the number of tasks is more than cpus, eq (2) fares
> better in reclaiming than eq (1)
>
> eq (1) with 5 tasks (3,100)
> TID[627]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.64
> TID[626]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.64
> TID[629]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.62
> TID[628]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 29.00
> TID[630]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.99
>
> Here top shows 3 cpus in the range ~45 to 50% util
>
> eq (2) with 5 tasks (3,100)
> TID[667]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.20
> TID[670]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.79
> TID[668]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.11
> TID[666]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 56.34
> TID[669]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 55.82
>
> And here top shows all 3 cpus with 95% util
>
> > I'll write more about this later... And thanks for coping with all
> > my comments!
> >
> Thanks :-)
>
> Vineeth

2023-05-19 10:44:40

[permalink] [raw]

Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

On Fri, 19 May 2023 11:56:21 +0200
luca abeni <[email protected]> wrote:

> Hi,
>
> sorry for returning on this discussion, but there is something I still
> do not understand:
>
> On Tue, 16 May 2023 11:08:18 -0400
> Vineeth Remanan Pillai <[email protected]> wrote:
> [...]
> > I had tested this and it was reclaiming much less compared to the
> > first one. I had 3 tasks with reservation (3,100) and 3 cpus.
>
> So, just to confirm: here you have only 3 SCHED_DEADLINE tasks,
> scheduled on a root domain containing only 3 CPUs (dl_bw_cpus() return
> 3)... Right?
> So, the utilization of each task is 3/100 = 0.03 and Uextra is
> 1 - (0.03 * 3) / 3 = 0.97.

OK, sorry again... I found my error immediately after sending the email.
Uextra is computed as "Umax - ...", not "1 - ...".
So, I now understand where the 35% comes from.

I now _suspect_ the correct equation should be
dq = -(max{u_i / Umax, (Umax - Uinact - Uextra)}) * dt
but I want to test it before wasting your time again; I'll write more
after performing some more tests.

Luca

> And since all the tasks are always active, Uinact = 0...
> Is this understanding right?
>
> If so:
> > With dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt (1)
> > TID[636]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.08
> > TID[635]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.07
> > TID[637]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.06
> >
> > With dq = -(max{u_i, (1 - Uinact - Uextra)} / Umax) * dt (2)
> > TID[601]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
> > TID[600]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
> > TID[602]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
>
> Here, we should have
> dq = -(max{0.03, (1 - 0 - 0.97)} / Umax) * dt
> = -(0.03 / Umax) * dt
> which reclaims up to Umax... So, the utilization should be 95%
> Since you measured 35.65%, it means that (1-Uextra) is much larger
> than 0.97... So, maybe you found some bug in the Uextra computation?
>
> Can you try printing the extra_bw value, to check what happened?
>
>
>
> Thanks,
> Luca
>
> >
> > As the task bandwidth goes higher, equation (2) reclaims more, but
> > equation (2) is a constant of 95% as long as number of tasks less
> > than cpus. If the number of tasks is more than cpus, eq (2) fares
> > better in reclaiming than eq (1)
> >
> > eq (1) with 5 tasks (3,100)
> > TID[627]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.64
> > TID[626]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.64
> > TID[629]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.62
> > TID[628]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 29.00
> > TID[630]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.99
> >
> > Here top shows 3 cpus in the range ~45 to 50% util
> >
> > eq (2) with 5 tasks (3,100)
> > TID[667]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.20
> > TID[670]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.79
> > TID[668]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.11
> > TID[666]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 56.34
> > TID[669]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 55.82
> >
> > And here top shows all 3 cpus with 95% util
> >
> > > I'll write more about this later... And thanks for coping with all
> > > my comments!
> > >
> > Thanks :-)
> >
> > Vineeth
>

2023-05-19 16:24:56

by Vineeth Remanan Pillai

[permalink] [raw]

Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Hi Luca,

On Fri, May 19, 2023 at 6:18 AM luca abeni <[email protected]> wrote:
>
> On Fri, 19 May 2023 11:56:21 +0200
> luca abeni <[email protected]> wrote:
> [...]
>
> OK, sorry again... I found my error immediately after sending the email.
> Uextra is computed as "Umax - ...", not "1 - ...".
> So, I now understand where the 35% comes from.
>
Thanks for debugging this, it makes sense now!

> I now _suspect_ the correct equation should be
> dq = -(max{u_i / Umax, (Umax - Uinact - Uextra)}) * dt
> but I want to test it before wasting your time again; I'll write more
> after performing some more tests.
>
I tried this equation and it fixes the above issue. But a little
confused as to why we should not be limiting the second term to Umax?
In my testing, I was seeing the issue solved with this equation as
well:
"dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt"

With both these equations, it doesn't solve couple of other issues we
had discussed before:
- tasks with different bandwidth reclaims differently even when #tasks
is less than #cpus.
- cpu util may go to 100% when we have tasks with large bandwidth close
to Umax

As an eg. for issue 1, three tasks - (7,10) (3,10) and (1,10):
TID[590]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.20
TID[591]: RECLAIM=1, (r=3ms, d=10ms, p=10ms), Util: 81.94
TID[592]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 27.19

re. issue 2, four tasks with same reservation (7,10), tasks tries
to reclaim leading to 100% cpu usage on all three cpus and leads to
system hang.

I was trying to understand the issue and it looks like static values
of Uextra and Umax are causing inaccuracies. Uextra is calculated
based on global bandwidth But based on the local state of the cpu, we
could reclaim more or less than (u_inact + rq->dl.extra_bw). Similarly
Umax is a local max for each cpu and we should not be reclaiming upto
Umax unconditionally. If the global load is high, reclaiming upto Umax
would cause problems as a migration can land in.

I was trying an idea to dynamically decide Uextra and Umax based on
global and local load.

The crux of the idea is as below

+ if (rq->dl.running_bw > rq->dl.max_bw)
+ return delta;
+
+ max_bw = rq->dl.this_bw + rq->dl.extra_bw;
+ extra_bw = rq->dl.extra_bw;
+ if (rq->dl.this_bw < rq->dl.extra_bw || max_bw > rq->dl.max_bw) {
+ extra_bw = rq->dl.max_bw - rq->dl.this_bw;
+ max_bw = rq->dl.max_bw;
+ }
+

And use this max_bw and extra_bw in the equation:
"dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt"

The reasoning for above changes are:
- running_bw can be greater than max_bw in SMP and we should not be
reclaiming if that's the case
- Initially we assume max_bw and extra_bw, based on global load.
If this_bw < extra_bw, it means that cpus are not heavily loaded
and incoming migrations can be satisfied with local cpu's available
bandwidth and we could reclaim upto rq->dl.max_bw and use extra_bw
as "rq->dl.max_bw - rq->dl.this_bw". Also we should limit max_bw for
any cpu to a maximum of rq->dl.max_cpu.

This does not consider mix of normal and SCHED_FLAG_RECLAIM tasks and
minor changes on top of this are need to consider that scenario. This
seems to fix all the issues that I have encountered.

The above fix doesn't work on equation:
"dq = -(max{u_i / Umax, (Umax - Uinact - Uextra)}) * dt"

cpu still spikes to 100% with 4 tasks of (7,10). I am not sure, but I
guess it might be because we are not limiting the second term to Umax.

Please let me know your thoughts on this.

Thanks again,
Vineeth

2023-05-19 18:28:11

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Hi Vineeth,

On 15/05/2023 04:57, Vineeth Pillai wrote:
> In a multi-processor system, bandwidth usage is divided equally to
> all cpus. This causes issues with reclaiming free bandwidth on a cpu.
> "Uextra" is same on all cpus in a root domain and running_bw would be
> different based on the reserved bandwidth of tasks running on the cpu.
> This causes disproportionate reclaiming - task with lesser bandwidth
> reclaims less even if its the only task running on that cpu.
>
> Following is a small test with three tasks with reservations (8,10)
> (1,10) and (1, 100). These three tasks run on different cpus. But
> since the reclamation logic calculates available bandwidth as a factor
> of globally available bandwidth, tasks with lesser bandwidth reclaims
> only little compared to higher bandwidth even if cpu has free and
> available bandwidth to be reclaimed.
>
> TID[730]: RECLAIM=1, (r=8ms, d=10ms, p=10ms), Util: 95.05
> TID[731]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 31.34
> TID[732]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 3.16

What does this 'Util: X' value stand for? I assume it's the utilization
of the task? How do you obtain it?

I see that e.g. TID[731] should run 1ms each 10ms w/o grub and with grub
the runtime could be potentially longer since 'scaled_delta_exec < delta'.

I don't get this comment in update_curr_dl():

1325 /*
1326 * For tasks that participate in GRUB, we implement GRUB-PA: the
1327 * spare reclaimed bandwidth is used to clock down frequency.
1328 *

It looks like dl_se->runtime is affected and with 'scaled_delta_exec <
delta' the task runs longer than dl_se->dl_runtime?

> Fix: use the available bandwidth on each cpu to calculate reclaimable
> bandwidth. Admission control takes care of total bandwidth and hence
> using the available bandwidth on a specific cpu would not break the
> deadline guarentees.
>
> With this fix, the above test behaves as follows:
> TID[586]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 95.24
> TID[585]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 95.01
> TID[584]: RECLAIM=1, (r=8ms, d=10ms, p=10ms), Util: 95.01
>
> Signed-off-by: Vineeth Pillai (Google) <[email protected]>
> ---
> kernel/sched/deadline.c | 22 +++++++---------------
> 1 file changed, 7 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 91451c1c7e52..85902c4c484b 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1272,7 +1272,7 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
> * Umax: Max usable bandwidth for DL. Currently
> * = sched_rt_runtime_us / sched_rt_period_us
> * Uextra: Extra bandwidth not reserved:
> - * = Umax - \Sum(u_i / #cpus in the root domain)
> + * = Umax - this_bw
> * u_i: Bandwidth of an admitted dl task in the
> * root domain.
> *
> @@ -1286,22 +1286,14 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
> */
> static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
> {
> - u64 u_act;
> - u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot - Uact */
> -
> /*
> - * Instead of computing max{u, (rq->dl.max_bw - u_inact - u_extra)},
> - * we compare u_inact + rq->dl.extra_bw with
> - * rq->dl.max_bw - u, because u_inact + rq->dl.extra_bw can be larger
> - * than rq->dl.max_bw (so, rq->dl.max_bw - u_inact - rq->dl.extra_bw
> - * would be negative leading to wrong results)
> + * max{u, Umax - Uinact - Uextra}
> + * = max{u, max_bw - (this_bw - running_bw) + (this_bw - running_bw)}
> + * = max{u, running_bw} = running_bw
> + * So dq = -(max{u, Umax - Uinact - Uextra} / Umax) dt
> + * = -(running_bw / max_bw) dt
> */
> - if (u_inact + rq->dl.extra_bw > rq->dl.max_bw - dl_se->dl_bw)
> - u_act = dl_se->dl_bw;
> - else
> - u_act = rq->dl.max_bw - u_inact - rq->dl.extra_bw;
> -
> - return div64_u64(delta * u_act, rq->dl.max_bw);
> + return div64_u64(delta * rq->dl.running_bw, rq->dl.max_bw);

I did the test discussed later in this thread with:

3 [3/100] tasks (dl_se->dl_bw = (3 << 20)/100 = 31457) on 3 CPUs

factor = scaled_delta_exec/delta

- existing grub

rq->dl.bw_ratio = ( 100 << 8 ) / 95 = 269
rq->dl.extra_bw = ( 95 << 20 ) / 100 = 996147

cpu=2 curr->[thread0-2 1715] delta=2140100 this_bw=31457
running_bw=31457 extra_bw=894788 u_inact=0 u_act_min=33054 u_act=153788
scaled_delta_exec=313874 factor=0.14

- your solution patch [1-2]

cpu=2 curr->[thread0-0 1676] delta=157020 running_bw=31457 max_bw=996147
res=4958 factor=0.03

You say that GRUB calculation is inaccurate and that this inaccuracy
gets larger as the bandwidth of tasks becomes smaller.

Could you explain this inaccuracy on this example?

2023-05-20 02:28:32

by Vineeth Remanan Pillai

[permalink] [raw]

Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Hi Dietmar,

On Fri, May 19, 2023 at 1:56 PM Dietmar Eggemann
<[email protected]> wrote:

> > TID[730]: RECLAIM=1, (r=8ms, d=10ms, p=10ms), Util: 95.05
> > TID[731]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 31.34
> > TID[732]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 3.16
>
> What does this 'Util: X' value stand for? I assume it's the utilization
> of the task? How do you obtain it?
>
Yes, it is the utilization of the task. I calculate it by dividing the
cputime with elapsed time(using clock_gettime(2)).

> I see that e.g. TID[731] should run 1ms each 10ms w/o grub and with grub
> the runtime could be potentially longer since 'scaled_delta_exec < delta'.
>
Yes correct. GRUB(Greedy Reclamation of Unused Bandwidth) algorithm
is used here for deadline tasks that needs to run longer than their
runtime when needed. sched_setattr allows a flag SCHED_FLAG_RECLAIM
to indicate that the task would like to reclaim unused bandwidth of a
cpu if available. For those tasks, 'runtime' is depreciated using the
GRUB formula and it allows it to run for longer and reclaim the free
bandwidth of the cpu. The GRUB implementation in linux allows a task
to reclaim upto RT capacity(95%) and depends on the free bandwidth
of the cpu. So TID[731] theoretically should run for 95ms as it is
the only task in the cpu, but it doesn't get to run that long.

> I don't get this comment in update_curr_dl():
>
> 1325 /*
> 1326 * For tasks that participate in GRUB, we implement GRUB-PA: the
> 1327 * spare reclaimed bandwidth is used to clock down frequency.
> 1328 *
>
> It looks like dl_se->runtime is affected and with 'scaled_delta_exec <
> delta' the task runs longer than dl_se->dl_runtime?
>
Yes. As mentioned above, GRUB allows the task to run longer by slowing
down the depreciation of "dl_se->dl_runtime". scaled_delta_exec is
calculated by the GRUB formula explained in the paper [1] & [2].

> I did the test discussed later in this thread with:
>
> 3 [3/100] tasks (dl_se->dl_bw = (3 << 20)/100 = 31457) on 3 CPUs
>
> factor = scaled_delta_exec/delta
>
> - existing grub
>
> rq->dl.bw_ratio = ( 100 << 8 ) / 95 = 269
> rq->dl.extra_bw = ( 95 << 20 ) / 100 = 996147
>
> cpu=2 curr->[thread0-2 1715] delta=2140100 this_bw=31457
> running_bw=31457 extra_bw=894788 u_inact=0 u_act_min=33054 u_act=153788
> scaled_delta_exec=313874 factor=0.14
>
> - your solution patch [1-2]
>
> cpu=2 curr->[thread0-0 1676] delta=157020 running_bw=31457 max_bw=996147
> res=4958 factor=0.03
>
> You say that GRUB calculation is inaccurate and that this inaccuracy
> gets larger as the bandwidth of tasks becomes smaller.
>
> Could you explain this inaccuracy on this example?
>
According to GRUB, we should be able to reclaim the unused bandwidth
for the running task upto RT limits(95%). In this example we have a
task with 3ms runtime and 100ms runtime on a cpu. So it is supposed
to run for 95ms before it is throttled.
Existing implementation's factor = 0.14 and 3ms is depreciated by
this factor. So it gets to run for "3 / 0.14 ~= 22ms". This is the
inaccuracy that the patch is trying to solve. With the patch, the
factor is .03166 and runtime = "3 / 0.03166 ~= 95ms"

Hope this clarifies.

Thanks,
Vineeth

[1]: Abeni, Luca & Lipari, Giuseppe & Parri, Andrea & Sun, Youcheng.
(2015). Parallel and sequential reclaiming in multicore
real-time global scheduling.

[2]: G. Lipari and S. Baruah, "Greedy reclamation of unused bandwidth
in constant-bandwidth servers," Proceedings 12th Euromicro
Conference on Real-Time Systems. Euromicro RTS 2000, Stockholm,
Sweden, 2000, pp. 193-200, doi: 10.1109/EMRTS.2000.854007.

2023-05-20 09:53:50

[permalink] [raw]

Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Hi,

On Fri, 19 May 2023 12:12:50 -0400
Vineeth Remanan Pillai <[email protected]> wrote:
[...]
> With both these equations, it doesn't solve couple of other issues we
> had discussed before:
> - tasks with different bandwidth reclaims differently even when #tasks
> is less than #cpus.

I think I now understand this issue (see below)

> - cpu util may go to 100% when we have tasks with large bandwidth
> close to Umax

This one is still not clear to me... I'll do some more analysis.

> As an eg. for issue 1, three tasks - (7,10) (3,10) and (1,10):
> TID[590]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.20
> TID[591]: RECLAIM=1, (r=3ms, d=10ms, p=10ms), Util: 81.94
> TID[592]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 27.19

So, the issue here is that GRUB tries to assign the reclaimed
utilization proportionally to the task's utilizations... So, 591 should
execute 3 times the amount of time executed by 592, and 590 should
execute 7 times the amount of time executed by 592. Task 592 is then
supposed to execute for 95 / (1 + 3 + 7) = 95 / 11 = 8.63% of the CPU
time; task 591 is supposed to execute for 8.63 * 3 = 25.9% of the CPU
time; task 590 is supposed to execute for 8.63 * 7 = 60.64% of the CPU
time.

So, 592 executes for 8.63 * 3 = 25.9% of the time on one single CPU
(you measured 27.19, but this is nead), task 591 executes for
25.9 * 3 = 77.72% of the time on one single CPU (again, this is
close to what you measured) and task 590 should execute for...
60.64 * 3 = 181.3% of the time on one single CPU! Which is clearly not
possible... And the "max{}" rule cuts this to 95%.

So, we are wasting 181.3 - 95 = 86.3% of CPU time, which 590 cannot
reclaim (because it cannot execute simultaneously on 2 CPUs).

And this is close to the amount of CPU time not reclaimed in the test
you cite above (95 - 81 + 95 - 27)

Now that the problem is more clear to me, I am trying to understand a
possible solution (as you mention, moving some extra bandwidth from the
590's CPU will fix this problem... But I am not sure if this dynamic
extra bandwidth migration is feasible in practice without introducing
too much overhead)

I'll look better at your new proposal.

Thanks,
Luca

2023-05-20 10:51:42

[permalink] [raw]

Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Hi again,

On Fri, 19 May 2023 12:12:50 -0400
Vineeth Remanan Pillai <[email protected]> wrote:
[...]
> - cpu util may go to 100% when we have tasks with large bandwidth
> close to Umax
>
> As an eg. for issue 1, three tasks - (7,10) (3,10) and (1,10):
> TID[590]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.20
> TID[591]: RECLAIM=1, (r=3ms, d=10ms, p=10ms), Util: 81.94
> TID[592]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 27.19
>
> re. issue 2, four tasks with same reservation (7,10), tasks tries
> to reclaim leading to 100% cpu usage on all three cpus and leads to
> system hang.

I just tried to repeat this test on a VM with 3 CPUs, and I can
reproduce the stall (100% of CPU time reclaimed by SCHED_DEADLINE
tasks, with no possibility for the other tasks to execute) when I use
dq = -(max{u_i / Umax, (Umax - Uinact - Uextra)}) * dt

But when I use
dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt
everything works as expected, the 4 tasks reclaim 95% of the CPU
time and my shell is still active...
(so, I cannot reproduce the starvation issue with this equation)

So, I now think the second one is the correct equation to be used.

Luca

2023-05-22 20:01:48

by Vineeth Remanan Pillai

[permalink] [raw]

Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Hi Luca,

Merging the last two mails in this reply :-)

> So, we are wasting 181.3 - 95 = 86.3% of CPU time, which 590 cannot
> reclaim (because it cannot execute simultaneously on 2 CPUs).
>
Correct. Thanks for explaining it in detail, I was tracing the scheduler
and verified this pattern you explained.

> Now that the problem is more clear to me, I am trying to understand a
> possible solution (as you mention, moving some extra bandwidth from the
> 590's CPU will fix this problem... But I am not sure if this dynamic
> extra bandwidth migration is feasible in practice without introducing
> too much overhead)
>
> I'll look better at your new proposal.
>
The idea that I mentioned tries to solve this problem in a best effort
way: If global load is high, use the global "Uextra = rq->dl.extra_bw"
and "Umax = rq->dl.this_bw + rq->dl.extra_bw". Otherwise use the local
values "Umax= rq->dl.max_bw", "Uextra= rq->dl.max_bw - rq->dl.this_bw".
This is still not perfect, but tries to reclaim very close to maximum
allowed limit almost always.

Please have a look when you get a chance :-).

>
> I just tried to repeat this test on a VM with 3 CPUs, and I can
> reproduce the stall (100% of CPU time reclaimed by SCHED_DEADLINE
> tasks, with no possibility for the other tasks to execute) when I use
> dq = -(max{u_i / Umax, (Umax - Uinact - Uextra)}) * dt
>
> But when I use
> dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt
> everything works as expected, the 4 tasks reclaim 95% of the CPU
> time and my shell is still active...
> (so, I cannot reproduce the starvation issue with this equation)
>
Sorry about this confusion, yes you are right, there is no stall with
this equation. The only issue is the lesser reclaim when the load is
less and tasks have different bandwidth requirements.

> So, I now think the second one is the correct equation to be used.
>
Thanks for confirming.

I think it probably makes sense to get the fix for the equation to go
in as a first step and then we can investigate more about the second
issue (less reclaiming with less load and different bandwidth) and
fix it separately. What do you think? I shall send the next iteration
with the fix for the equation alone if its okay with you.

Thanks,
Vineeth

2023-05-23 21:24:23

[permalink] [raw]

Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Hi,

sorry for the late reply.

On Mon, 22 May 2023 15:22:52 -0400
Vineeth Remanan Pillai <[email protected]> wrote:
[...]
> > But when I use
> > dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt
> > everything works as expected, the 4 tasks reclaim 95% of the CPU
> > time and my shell is still active...
> > (so, I cannot reproduce the starvation issue with this equation)
> >
> Sorry about this confusion, yes you are right, there is no stall with
> this equation. The only issue is the lesser reclaim when the load is
> less and tasks have different bandwidth requirements.
>
> > So, I now think the second one is the correct equation to be used.
> >
> Thanks for confirming.
>
> I think it probably makes sense to get the fix for the equation to go
> in as a first step and then we can investigate more about the second
> issue (less reclaiming with less load and different bandwidth) and
> fix it separately. What do you think?

I fully agree. If you split this change in a first patch, IMHO it
can be applied.

BTW, I tried changing the equation without introducing div64, and it
seems to me that it works well... So, if removing the bw_ratio
approximation is needed, I think you can do it in a second patch (so,
the first patch changes the reclaiming equation, and the second one
introduces div64)

Thanks,
Luca

> I shall send the next iteration
> with the fix for the equation alone if its okay with you.
>
> Thanks,
> Vineeth

2023-05-24 02:14:07

by Vineeth Remanan Pillai

[permalink] [raw]

Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Hi Luca,

> >
> > I think it probably makes sense to get the fix for the equation to go
> > in as a first step and then we can investigate more about the second
> > issue (less reclaiming with less load and different bandwidth) and
> > fix it separately. What do you think?
>
> I fully agree. If you split this change in a first patch, IMHO it
> can be applied.
>
Thanks, I shall have it as the first patch in the next iteration.

> BTW, I tried changing the equation without introducing div64, and it
> seems to me that it works well... So, if removing the bw_ratio
> approximation is needed, I think you can do it in a second patch (so,
> the first patch changes the reclaiming equation, and the second one
> introduces div64)
>
There was one more reason for removing the bw_ratio. The second patch
fixing the issue with mix of normal and SCHED_FLAG_RECLAIM tasks,
do not have a constant Umax, so we cannot calculate the bw_ratio in
advance and has to be calculated in grub_reclaim().

As you suggested, I shall remove the bw_ratio as the second patch and
then have the third patch to fix the case of SCHED_FLAG_RECLAIM and
normal deadline tasks. For the rest of the issues, I shall work on it as
a separate patch series once this goes in.

Will send it out tomorrow after a bit more testing.

Thanks again,
Vineeth

2023-05-25 12:12:51

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Hi Vineeth,

On 20/05/2023 04:15, Vineeth Remanan Pillai wrote:
> Hi Dietmar,
>
> On Fri, May 19, 2023 at 1:56 PM Dietmar Eggemann
> <[email protected]> wrote:
>
>>> TID[730]: RECLAIM=1, (r=8ms, d=10ms, p=10ms), Util: 95.05
>>> TID[731]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 31.34
>>> TID[732]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 3.16
>>
>> What does this 'Util: X' value stand for? I assume it's the utilization
>> of the task? How do you obtain it?
>>
> Yes, it is the utilization of the task. I calculate it by dividing the
> cputime with elapsed time(using clock_gettime(2)).

Makes, sense, I guess what I missed here in the first place is the fact
that those DL tasks want to run 100%.

>> I see that e.g. TID[731] should run 1ms each 10ms w/o grub and with grub
>> the runtime could be potentially longer since 'scaled_delta_exec < delta'.
>>
> Yes correct. GRUB(Greedy Reclamation of Unused Bandwidth) algorithm
> is used here for deadline tasks that needs to run longer than their
> runtime when needed. sched_setattr allows a flag SCHED_FLAG_RECLAIM
> to indicate that the task would like to reclaim unused bandwidth of a
> cpu if available. For those tasks, 'runtime' is depreciated using the
> GRUB formula and it allows it to run for longer and reclaim the free
> bandwidth of the cpu. The GRUB implementation in linux allows a task
> to reclaim upto RT capacity(95%) and depends on the free bandwidth
> of the cpu. So TID[731] theoretically should run for 95ms as it is
> the only task in the cpu, but it doesn't get to run that long.

Correct.

>> I don't get this comment in update_curr_dl():
>>
>> 1325 /*
>> 1326 * For tasks that participate in GRUB, we implement GRUB-PA: the
>> 1327 * spare reclaimed bandwidth is used to clock down frequency.
>> 1328 *
>>
>> It looks like dl_se->runtime is affected and with 'scaled_delta_exec <
>> delta' the task runs longer than dl_se->dl_runtime?
>>
> Yes. As mentioned above, GRUB allows the task to run longer by slowing
> down the depreciation of "dl_se->dl_runtime". scaled_delta_exec is
> calculated by the GRUB formula explained in the paper [1] & [2].

What I didn't understand was this `GRUB-PA` and `the spare reclaimed
bandwidth is used to clock down frequency` in relation to GRUB task
runtime depreciation.

But now I think I get it. `GRUB-PA` means that in case we run with the
schedutil CPUfreq governor, the CPU frequency is influenced by Uact
(rq->dl.running_bw) via:

sugov_get_util() -> effective_cpu_util() -> cpu_bw_dl() ->

return rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT

and on top of this we do GRUB reclaiming for those SCHED_FLAG_RECLAIM
tasks, i.e. task runtime depreciation.

>> I did the test discussed later in this thread with:
>>
>> 3 [3/100] tasks (dl_se->dl_bw = (3 << 20)/100 = 31457) on 3 CPUs
>>
>> factor = scaled_delta_exec/delta
>>
>> - existing grub
>>
>> rq->dl.bw_ratio = ( 100 << 8 ) / 95 = 269
>> rq->dl.extra_bw = ( 95 << 20 ) / 100 = 996147
>>
>> cpu=2 curr->[thread0-2 1715] delta=2140100 this_bw=31457
>> running_bw=31457 extra_bw=894788 u_inact=0 u_act_min=33054 u_act=153788
>> scaled_delta_exec=313874 factor=0.14
>>
>> - your solution patch [1-2]
>>
>> cpu=2 curr->[thread0-0 1676] delta=157020 running_bw=31457 max_bw=996147
>> res=4958 factor=0.03
>>
>> You say that GRUB calculation is inaccurate and that this inaccuracy
>> gets larger as the bandwidth of tasks becomes smaller.
>>
>> Could you explain this inaccuracy on this example?
>>
> According to GRUB, we should be able to reclaim the unused bandwidth
> for the running task upto RT limits(95%). In this example we have a
> task with 3ms runtime and 100ms runtime on a cpu. So it is supposed
> to run for 95ms before it is throttled.

Correct.

> Existing implementation's factor = 0.14 and 3ms is depreciated by
> this factor. So it gets to run for "3 / 0.14 ~= 22ms". This is the
> inaccuracy that the patch is trying to solve. With the patch, the
> factor is .03166 and runtime = "3 / 0.03166 ~= 95ms"

My tests were wrong since I was using DL task with dl_runtime=3ms and
dl_period = 100ms with an actual runtime=3ms whereas your tasks probably
want to run 100%.

> Hope this clarifies.

yes, it did, thanks!

2023-05-26 15:24:23

[permalink] [raw]

Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

On Fri, 26 May 2023 10:54:09 -0400
Vineeth Remanan Pillai <[email protected]> wrote:
[...]
> Sorry for the delay. I was trying to fix the inaccuracy when normal
> and SCHED_FLAG_RECLAIM tasks are present. Its a bit complicated than
> I initially thought as Umax_reclaim and Uextra needs to be computed
> at the root domain level and this would mean more locking. I have a
> patch ready but I think I need to do more testing to understand the
> performance implications and it is better to batch it as a separate
> series.
>
> So, I am just sending out the GRUB equation fix which is tested well
> and I am confident about. Will send out rest of the fixes later as
> a separate series.

Looks good to me. And... Thanks again for taking care of this issue!

Luca

2023-05-26 15:24:43

by Vineeth Remanan Pillai

[permalink] [raw]

Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Hi Luca,

On Tue, May 23, 2023 at 10:11 PM Vineeth Remanan Pillai
<[email protected]> wrote:
>
> Hi Luca,
>
> [...]
>
> Will send it out tomorrow after a bit more testing.
>
Sorry for the delay. I was trying to fix the inaccuracy when normal
and SCHED_FLAG_RECLAIM tasks are present. Its a bit complicated than
I initially thought as Umax_reclaim and Uextra needs to be computed
at the root domain level and this would mean more locking. I have a
patch ready but I think I need to do more testing to understand the
performance implications and it is better to batch it as a separate
series.

So, I am just sending out the GRUB equation fix which is tested well
and I am confident about. Will send out rest of the fixes later as
a separate series.

Thanks,
Vineeth