2020-10-07 15:15:37

by Peng Liu

[permalink] [raw]
Subject: [PATCH v5 0/2] sched/deadline: Fix and optimize sched_dl_global_validate()

When change global rt bandwidth, we check to make sure that new
settings could accommodate the allocated dl bandwidth.

Under SMP, the dl_bw is on a per root domain basis, currently we check
and update the new settings one cpu by one cpu, but not in the unit of
root domain, which is either overdoing or error.

patch 1 removed the superfluous checking and updating
patch 2 fixed the error validation

For details, please see the corresponding patch.

----------------
v5 <-- v4:
- no functional changes, just split the v4 single patch to two to
obey the "one patch do only one thing" rule;
- turn root_domain::visit_gen from u64 to u32;
both suggested by Juri.
- refine changelog;

v4 <-- v3:
- refine changelog;
- eliminate the ugly #ifdef guys with Peter's method;

v3 <-- v2:
- fix build error for !CONFIG_SMP, reported by kernel test robot;

v2 <-- v1:
- replace cpumask_weight(cpu_rq(cpu)->rd->span) with dl_bw_cpus(cpu),
suggested by Juri;

Peng Liu (2):
sched/deadline: optimize sched_dl_global_validate()
sched/deadline: Fix sched_dl_global_validate()

kernel/sched/deadline.c | 46 +++++++++++++++++++++++++++++---------
kernel/sched/sched.h | 49 +++++++++++++++++++++--------------------
kernel/sched/topology.c | 1 +
3 files changed, 62 insertions(+), 34 deletions(-)

--
2.20.1


2020-10-07 15:16:39

by Peng Liu

[permalink] [raw]
Subject: [PATCH v5 2/2] sched/deadline: Fix sched_dl_global_validate()

When change sched_rt_{runtime, period}_us, we validate that the new
settings should at least accommodate the currently allocated -dl
bandwidth:

sched_rt_handler()
--> sched_dl_bandwidth_validate()
{
new_bw = global_rt_runtime()/global_rt_period();

for_each_possible_cpu(cpu) {
dl_b = dl_bw_of(cpu);
if (new_bw < dl_b->total_bw) <-------
ret = -EBUSY;
}
}

But under CONFIG_SMP, dl_bw is per root domain , but not per CPU,
dl_b->total_bw is the allocated bandwidth of the whole root domain.
Instead, we should compare dl_b->total_bw against "cpus*new_bw",
where 'cpus' is the number of CPUs of the root domain.

Also, below annotation(in kernel/sched/sched.h) implied implementation
only appeared in SCHED_DEADLINE v2[1], then deadline scheduler kept
evolving till got merged(v9), but the annotation remains unchanged,
meaningless and misleading, update it.

* With respect to SMP, the bandwidth is given on a per-CPU basis,
* meaning that:
* - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
* - dl_total_bw array contains, in the i-eth element, the currently
* allocated bandwidth on the i-eth CPU.

[1]: https://lore.kernel.org/lkml/1267385230.13676.101.camel@Palantir/

Fixes: 332ac17ef5bf ("sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks")
Signed-off-by: Peng Liu <[email protected]>
---
kernel/sched/deadline.c | 5 +++--
kernel/sched/sched.h | 42 ++++++++++++++++++-----------------------
2 files changed, 21 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 5200e185923f..e32715db7119 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2540,7 +2540,7 @@ int sched_dl_global_validate(void)
u64 new_bw = to_ratio(period, runtime);
struct dl_bw *dl_b;
unsigned long flags;
- int cpu, ret = 0;
+ int cpu, cpus, ret = 0;
u32 gen = ++dl_generation;

/*
@@ -2555,9 +2555,10 @@ int sched_dl_global_validate(void)
goto next;

dl_b = dl_bw_of(cpu);
+ cpus = dl_bw_cpus(cpu);

raw_spin_lock_irqsave(&dl_b->lock, flags);
- if (new_bw < dl_b->total_bw)
+ if (new_bw * cpus < dl_b->total_bw)
ret = -EBUSY;
raw_spin_unlock_irqrestore(&dl_b->lock, flags);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 53477e8b26b0..b407e7af8e09 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -257,30 +257,6 @@ struct rt_bandwidth {

void __dl_clear_params(struct task_struct *p);

-/*
- * To keep the bandwidth of -deadline tasks and groups under control
- * we need some place where:
- * - store the maximum -deadline bandwidth of the system (the group);
- * - cache the fraction of that bandwidth that is currently allocated.
- *
- * This is all done in the data structure below. It is similar to the
- * one used for RT-throttling (rt_bandwidth), with the main difference
- * that, since here we are only interested in admission control, we
- * do not decrease any runtime while the group "executes", neither we
- * need a timer to replenish it.
- *
- * With respect to SMP, the bandwidth is given on a per-CPU basis,
- * meaning that:
- * - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
- * - dl_total_bw array contains, in the i-eth element, the currently
- * allocated bandwidth on the i-eth CPU.
- * Moreover, groups consume bandwidth on each CPU, while tasks only
- * consume bandwidth on the CPU they're running on.
- * Finally, dl_total_bw_cpu is used to cache the index of dl_total_bw
- * that will be shown the next time the proc or cgroup controls will
- * be red. It on its turn can be changed by writing on its own
- * control.
- */
struct dl_bandwidth {
raw_spinlock_t dl_runtime_lock;
u64 dl_runtime;
@@ -292,6 +268,24 @@ static inline int dl_bandwidth_enabled(void)
return sysctl_sched_rt_runtime >= 0;
}

+/*
+ * To keep the bandwidth of -deadline tasks under control
+ * we need some place where:
+ * - store the maximum -deadline bandwidth of each cpu;
+ * - cache the fraction of bandwidth that is currently allocated in
+ * each root domain;
+ *
+ * This is all done in the data structure below. It is similar to the
+ * one used for RT-throttling (rt_bandwidth), with the main difference
+ * that, since here we are only interested in admission control, we
+ * do not decrease any runtime while the group "executes", neither we
+ * need a timer to replenish it.
+ *
+ * With respect to SMP, bandwidth is given on a per root domain basis,
+ * meaning that:
+ * - bw (< 100%) is the deadline bandwidth of each CPU;
+ * - total_bw is the currently allocated bandwidth in each root domain;
+ */
struct dl_bw {
raw_spinlock_t lock;
u64 bw;
--
2.20.1