2020-10-08 15:57:46

by Peng Liu

[permalink] [raw]
Subject: [PATCH v6 0/2] sched/deadline: Fix and optimize sched_dl_global_validate()

When change global rt bandwidth, we check to make sure that new
settings could accommodate the allocated dl bandwidth.

Under SMP, the dl_bw is on a per root domain basis, currently we check
and update the new settings one cpu by one cpu, but not in the unit of
root domain, which is either overdoing or error.

patch 1 removed the superfluous checking and updating
patch 2 fixed the error validation

For details, please see the corresponding patch.

----------------
v6 <-- v5:
- no functional changes, just revert visit_gen back to u64;

v5 <-- v4:
- no functional changes, just split the v4 single patch to two to
obey the "one patch do only one thing" rule;
- turn root_domain::visit_gen from u64 to u32;
both suggested by Juri.
- refine changelog;

v4 <-- v3:
- refine changelog;
- eliminate the ugly #ifdef guys with Peter's method;

v3 <-- v2:
- fix build error for !CONFIG_SMP, reported by kernel test robot;

v2 <-- v1:
- replace cpumask_weight(cpu_rq(cpu)->rd->span) with dl_bw_cpus(cpu),
suggested by Juri;

Peng Liu (2):
sched/deadline: Optimize sched_dl_global_validate()
sched/deadline: Fix sched_dl_global_validate()

kernel/sched/deadline.c | 44 +++++++++++++++++++++++++++--------
kernel/sched/sched.h | 51 ++++++++++++++++++++++-------------------
kernel/sched/topology.c | 1 +
3 files changed, 63 insertions(+), 33 deletions(-)

--
2.20.1


2020-10-08 15:59:31

by Peng Liu

[permalink] [raw]
Subject: [PATCH v6 1/2] sched/deadline: Optimize sched_dl_global_validate()

Under CONFIG_SMP, dl_bw is per root domain, but not per CPU.
When checking or updating dl_bw, currently iterating every CPU is
overdoing, just need iterate each root domain once.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Peng Liu <[email protected]>
---
kernel/sched/deadline.c | 39 ++++++++++++++++++++++++++++++++-------
kernel/sched/sched.h | 9 +++++++++
kernel/sched/topology.c | 1 +
3 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c19c1883d695..365e5fec8c28 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -97,6 +97,17 @@ static inline unsigned long dl_bw_capacity(int i)
return __dl_bw_capacity(i);
}
}
+
+static inline bool dl_bw_visited(int cpu, u64 gen)
+{
+ struct root_domain *rd = cpu_rq(cpu)->rd;
+
+ if (rd->visit_gen == gen)
+ return true;
+
+ rd->visit_gen = gen;
+ return false;
+}
#else
static inline struct dl_bw *dl_bw_of(int i)
{
@@ -112,6 +123,11 @@ static inline unsigned long dl_bw_capacity(int i)
{
return SCHED_CAPACITY_SCALE;
}
+
+static inline bool dl_bw_visited(int cpu, u64 gen)
+{
+ return false;
+}
#endif

static inline
@@ -2514,11 +2530,15 @@ const struct sched_class dl_sched_class
.update_curr = update_curr_dl,
};

+/* Used for dl_bw check and update. */
+static u64 dl_generation;
+
int sched_dl_global_validate(void)
{
u64 runtime = global_rt_runtime();
u64 period = global_rt_period();
u64 new_bw = to_ratio(period, runtime);
+ u64 gen = ++dl_generation;
struct dl_bw *dl_b;
int cpu, ret = 0;
unsigned long flags;
@@ -2527,13 +2547,13 @@ int sched_dl_global_validate(void)
* Here we want to check the bandwidth not being set to some
* value smaller than the currently allocated bandwidth in
* any of the root_domains.
- *
- * FIXME: Cycling on all the CPUs is overdoing, but simpler than
- * cycling on root_domains... Discussion on different/better
- * solutions is welcome!
*/
for_each_possible_cpu(cpu) {
rcu_read_lock_sched();
+
+ if (dl_bw_visited(cpu, gen))
+ goto next;
+
dl_b = dl_bw_of(cpu);

raw_spin_lock_irqsave(&dl_b->lock, flags);
@@ -2541,6 +2561,7 @@ int sched_dl_global_validate(void)
ret = -EBUSY;
raw_spin_unlock_irqrestore(&dl_b->lock, flags);

+next:
rcu_read_unlock_sched();

if (ret)
@@ -2566,6 +2587,7 @@ static void init_dl_rq_bw_ratio(struct dl_rq *dl_rq)
void sched_dl_do_global(void)
{
u64 new_bw = -1;
+ u64 gen = ++dl_generation;
struct dl_bw *dl_b;
int cpu;
unsigned long flags;
@@ -2576,11 +2598,14 @@ void sched_dl_do_global(void)
if (global_rt_runtime() != RUNTIME_INF)
new_bw = to_ratio(global_rt_period(), global_rt_runtime());

- /*
- * FIXME: As above...
- */
for_each_possible_cpu(cpu) {
rcu_read_lock_sched();
+
+ if (dl_bw_visited(cpu, gen)) {
+ rcu_read_unlock_sched();
+ continue;
+ }
+
dl_b = dl_bw_of(cpu);

raw_spin_lock_irqsave(&dl_b->lock, flags);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 28709f6b0975..ce527b981e61 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -801,6 +801,15 @@ struct root_domain {
struct dl_bw dl_bw;
struct cpudl cpudl;

+ /*
+ * Indicate whether a root_domain's dl_bw has been checked or
+ * updated. It's monotonously increasing.
+ *
+ * Also, some corner cases, like 'wrap around' is dangerous, but given
+ * that u64 is 'big enough'. So that shouldn't be a concern.
+ */
+ u64 visit_gen;
+
#ifdef HAVE_RT_PUSH_IPI
/*
* For IPI pull requests, loop across the rto_mask.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index dd7770226086..90f3e5558fa2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -516,6 +516,7 @@ static int init_rootdomain(struct root_domain *rd)
init_irq_work(&rd->rto_push_work, rto_push_irq_work_func);
#endif

+ rd->visit_gen = 0;
init_dl_bw(&rd->dl_bw);
if (cpudl_init(&rd->cpudl) != 0)
goto free_rto_mask;
--
2.20.1

2020-10-08 17:35:10

by Peng Liu

[permalink] [raw]
Subject: [PATCH v6 2/2] sched/deadline: Fix sched_dl_global_validate()

When change sched_rt_{runtime, period}_us, we validate that the new
settings should at least accommodate the currently allocated -dl
bandwidth:

sched_rt_handler()
--> sched_dl_bandwidth_validate()
{
new_bw = global_rt_runtime()/global_rt_period();

for_each_possible_cpu(cpu) {
dl_b = dl_bw_of(cpu);
if (new_bw < dl_b->total_bw) <-------
ret = -EBUSY;
}
}

But under CONFIG_SMP, dl_bw is per root domain , but not per CPU,
dl_b->total_bw is the allocated bandwidth of the whole root domain.
Instead, we should compare dl_b->total_bw against "cpus*new_bw",
where 'cpus' is the number of CPUs of the root domain.

Also, below annotation(in kernel/sched/sched.h) implied implementation
only appeared in SCHED_DEADLINE v2[1], then deadline scheduler kept
evolving till got merged(v9), but the annotation remains unchanged,
meaningless and misleading, update it.

* With respect to SMP, the bandwidth is given on a per-CPU basis,
* meaning that:
* - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
* - dl_total_bw array contains, in the i-eth element, the currently
* allocated bandwidth on the i-eth CPU.

[1]: https://lore.kernel.org/lkml/1267385230.13676.101.camel@Palantir/

Fixes: 332ac17ef5bf ("sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks")
Signed-off-by: Peng Liu <[email protected]>
---
kernel/sched/deadline.c | 5 +++--
kernel/sched/sched.h | 42 ++++++++++++++++++-----------------------
2 files changed, 21 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 365e5fec8c28..ac0c53672e83 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2540,7 +2540,7 @@ int sched_dl_global_validate(void)
u64 new_bw = to_ratio(period, runtime);
u64 gen = ++dl_generation;
struct dl_bw *dl_b;
- int cpu, ret = 0;
+ int cpu, cpus, ret = 0;
unsigned long flags;

/*
@@ -2555,9 +2555,10 @@ int sched_dl_global_validate(void)
goto next;

dl_b = dl_bw_of(cpu);
+ cpus = dl_bw_cpus(cpu);

raw_spin_lock_irqsave(&dl_b->lock, flags);
- if (new_bw < dl_b->total_bw)
+ if (new_bw * cpus < dl_b->total_bw)
ret = -EBUSY;
raw_spin_unlock_irqrestore(&dl_b->lock, flags);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ce527b981e61..5076ea05b2e0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -257,30 +257,6 @@ struct rt_bandwidth {

void __dl_clear_params(struct task_struct *p);

-/*
- * To keep the bandwidth of -deadline tasks and groups under control
- * we need some place where:
- * - store the maximum -deadline bandwidth of the system (the group);
- * - cache the fraction of that bandwidth that is currently allocated.
- *
- * This is all done in the data structure below. It is similar to the
- * one used for RT-throttling (rt_bandwidth), with the main difference
- * that, since here we are only interested in admission control, we
- * do not decrease any runtime while the group "executes", neither we
- * need a timer to replenish it.
- *
- * With respect to SMP, the bandwidth is given on a per-CPU basis,
- * meaning that:
- * - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
- * - dl_total_bw array contains, in the i-eth element, the currently
- * allocated bandwidth on the i-eth CPU.
- * Moreover, groups consume bandwidth on each CPU, while tasks only
- * consume bandwidth on the CPU they're running on.
- * Finally, dl_total_bw_cpu is used to cache the index of dl_total_bw
- * that will be shown the next time the proc or cgroup controls will
- * be red. It on its turn can be changed by writing on its own
- * control.
- */
struct dl_bandwidth {
raw_spinlock_t dl_runtime_lock;
u64 dl_runtime;
@@ -292,6 +268,24 @@ static inline int dl_bandwidth_enabled(void)
return sysctl_sched_rt_runtime >= 0;
}

+/*
+ * To keep the bandwidth of -deadline tasks under control
+ * we need some place where:
+ * - store the maximum -deadline bandwidth of each cpu;
+ * - cache the fraction of bandwidth that is currently allocated in
+ * each root domain;
+ *
+ * This is all done in the data structure below. It is similar to the
+ * one used for RT-throttling (rt_bandwidth), with the main difference
+ * that, since here we are only interested in admission control, we
+ * do not decrease any runtime while the group "executes", neither we
+ * need a timer to replenish it.
+ *
+ * With respect to SMP, bandwidth is given on a per root domain basis,
+ * meaning that:
+ * - bw (< 100%) is the deadline bandwidth of each CPU;
+ * - total_bw is the currently allocated bandwidth in each root domain;
+ */
struct dl_bw {
raw_spinlock_t lock;
u64 bw;
--
2.20.1

2020-10-14 16:45:31

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v6 0/2] sched/deadline: Fix and optimize sched_dl_global_validate()

Hi,

On 08/10/20 23:47, Peng Liu wrote:
> When change global rt bandwidth, we check to make sure that new
> settings could accommodate the allocated dl bandwidth.
>
> Under SMP, the dl_bw is on a per root domain basis, currently we check
> and update the new settings one cpu by one cpu, but not in the unit of
> root domain, which is either overdoing or error.
>
> patch 1 removed the superfluous checking and updating
> patch 2 fixed the error validation
>
> For details, please see the corresponding patch.
>
> ----------------
> v6 <-- v5:
> - no functional changes, just revert visit_gen back to u64;
>
> v5 <-- v4:
> - no functional changes, just split the v4 single patch to two to
> obey the "one patch do only one thing" rule;
> - turn root_domain::visit_gen from u64 to u32;
> both suggested by Juri.
> - refine changelog;
>
> v4 <-- v3:
> - refine changelog;
> - eliminate the ugly #ifdef guys with Peter's method;
>
> v3 <-- v2:
> - fix build error for !CONFIG_SMP, reported by kernel test robot;
>
> v2 <-- v1:
> - replace cpumask_weight(cpu_rq(cpu)->rd->span) with dl_bw_cpus(cpu),
> suggested by Juri;
>
> Peng Liu (2):
> sched/deadline: Optimize sched_dl_global_validate()
> sched/deadline: Fix sched_dl_global_validate()
>
> kernel/sched/deadline.c | 44 +++++++++++++++++++++++++++--------
> kernel/sched/sched.h | 51 ++++++++++++++++++++++-------------------
> kernel/sched/topology.c | 1 +
> 3 files changed, 63 insertions(+), 33 deletions(-)

These look now good to me. Thanks a lot!

Acked-by: Juri Lelli <[email protected]>

Best,
Juri

Subject: Re: [PATCH v6 0/2] sched/deadline: Fix and optimize sched_dl_global_validate()

Hi,

On 10/14/20 3:32 PM, Juri Lelli wrote:
> Hi,
>
> On 08/10/20 23:47, Peng Liu wrote:
>> When change global rt bandwidth, we check to make sure that new
>> settings could accommodate the allocated dl bandwidth.
>>
>> Under SMP, the dl_bw is on a per root domain basis, currently we check
>> and update the new settings one cpu by one cpu, but not in the unit of
>> root domain, which is either overdoing or error.
>>
>> patch 1 removed the superfluous checking and updating
>> patch 2 fixed the error validation
>>
>> For details, please see the corresponding patch.
>>
>> ----------------
>> v6 <-- v5:
>> - no functional changes, just revert visit_gen back to u64;
>>
>> v5 <-- v4:
>> - no functional changes, just split the v4 single patch to two to
>> obey the "one patch do only one thing" rule;
>> - turn root_domain::visit_gen from u64 to u32;
>> both suggested by Juri.
>> - refine changelog;
>>
>> v4 <-- v3:
>> - refine changelog;
>> - eliminate the ugly #ifdef guys with Peter's method;
>>
>> v3 <-- v2:
>> - fix build error for !CONFIG_SMP, reported by kernel test robot;
>>
>> v2 <-- v1:
>> - replace cpumask_weight(cpu_rq(cpu)->rd->span) with dl_bw_cpus(cpu),
>> suggested by Juri;
>>
>> Peng Liu (2):
>> sched/deadline: Optimize sched_dl_global_validate()
>> sched/deadline: Fix sched_dl_global_validate()
>>
>> kernel/sched/deadline.c | 44 +++++++++++++++++++++++++++--------
>> kernel/sched/sched.h | 51 ++++++++++++++++++++++-------------------
>> kernel/sched/topology.c | 1 +
>> 3 files changed, 63 insertions(+), 33 deletions(-)
>
> These look now good to me. Thanks a lot!
>
> Acked-by: Juri Lelli <[email protected]>


I just found a minor thing in a comment:

"It's *monotonously* increasing."

I tried to find the usage of "monotonously increasing" for monotonic functions,
and I did not find it. The (as least most used) term is "monotonically," so the
sentence would be like "It is a monotonically increasing value." But it is just
a minor thing. Anyways:

Reviewed-by: Daniel Bristot de Oliveira <[email protected]>

Thanks!
-- Daniel

Subject: [tip: sched/core] sched/deadline: Fix sched_dl_global_validate()

The following commit has been merged into the sched/core branch of tip:

Commit-ID: a57415f5d1e43c3a5c5d412cd85e2792d7ed9b11
Gitweb: https://git.kernel.org/tip/a57415f5d1e43c3a5c5d412cd85e2792d7ed9b11
Author: Peng Liu <[email protected]>
AuthorDate: Thu, 08 Oct 2020 23:49:42 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 29 Oct 2020 11:00:29 +01:00

sched/deadline: Fix sched_dl_global_validate()

When change sched_rt_{runtime, period}_us, we validate that the new
settings should at least accommodate the currently allocated -dl
bandwidth:

sched_rt_handler()
--> sched_dl_bandwidth_validate()
{
new_bw = global_rt_runtime()/global_rt_period();

for_each_possible_cpu(cpu) {
dl_b = dl_bw_of(cpu);
if (new_bw < dl_b->total_bw) <-------
ret = -EBUSY;
}
}

But under CONFIG_SMP, dl_bw is per root domain , but not per CPU,
dl_b->total_bw is the allocated bandwidth of the whole root domain.
Instead, we should compare dl_b->total_bw against "cpus*new_bw",
where 'cpus' is the number of CPUs of the root domain.

Also, below annotation(in kernel/sched/sched.h) implied implementation
only appeared in SCHED_DEADLINE v2[1], then deadline scheduler kept
evolving till got merged(v9), but the annotation remains unchanged,
meaningless and misleading, update it.

* With respect to SMP, the bandwidth is given on a per-CPU basis,
* meaning that:
* - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
* - dl_total_bw array contains, in the i-eth element, the currently
* allocated bandwidth on the i-eth CPU.

[1]: https://lore.kernel.org/lkml/1267385230.13676.101.camel@Palantir/

Fixes: 332ac17ef5bf ("sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks")
Signed-off-by: Peng Liu <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Daniel Bristot de Oliveira <[email protected]>
Acked-by: Juri Lelli <[email protected]>
Link: https://lkml.kernel.org/r/db6bbda316048cda7a1bbc9571defde193a8d67e.1602171061.git.iwtbavbm@gmail.com
---
kernel/sched/deadline.c | 5 +++--
kernel/sched/sched.h | 42 +++++++++++++++++-----------------------
2 files changed, 21 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 98d96d4..0f75e95 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2561,7 +2561,7 @@ int sched_dl_global_validate(void)
u64 new_bw = to_ratio(period, runtime);
u64 gen = ++dl_generation;
struct dl_bw *dl_b;
- int cpu, ret = 0;
+ int cpu, cpus, ret = 0;
unsigned long flags;

/*
@@ -2576,9 +2576,10 @@ int sched_dl_global_validate(void)
goto next;

dl_b = dl_bw_of(cpu);
+ cpus = dl_bw_cpus(cpu);

raw_spin_lock_irqsave(&dl_b->lock, flags);
- if (new_bw < dl_b->total_bw)
+ if (new_bw * cpus < dl_b->total_bw)
ret = -EBUSY;
raw_spin_unlock_irqrestore(&dl_b->lock, flags);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 49a2dae..965b296 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -257,30 +257,6 @@ struct rt_bandwidth {

void __dl_clear_params(struct task_struct *p);

-/*
- * To keep the bandwidth of -deadline tasks and groups under control
- * we need some place where:
- * - store the maximum -deadline bandwidth of the system (the group);
- * - cache the fraction of that bandwidth that is currently allocated.
- *
- * This is all done in the data structure below. It is similar to the
- * one used for RT-throttling (rt_bandwidth), with the main difference
- * that, since here we are only interested in admission control, we
- * do not decrease any runtime while the group "executes", neither we
- * need a timer to replenish it.
- *
- * With respect to SMP, the bandwidth is given on a per-CPU basis,
- * meaning that:
- * - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
- * - dl_total_bw array contains, in the i-eth element, the currently
- * allocated bandwidth on the i-eth CPU.
- * Moreover, groups consume bandwidth on each CPU, while tasks only
- * consume bandwidth on the CPU they're running on.
- * Finally, dl_total_bw_cpu is used to cache the index of dl_total_bw
- * that will be shown the next time the proc or cgroup controls will
- * be red. It on its turn can be changed by writing on its own
- * control.
- */
struct dl_bandwidth {
raw_spinlock_t dl_runtime_lock;
u64 dl_runtime;
@@ -292,6 +268,24 @@ static inline int dl_bandwidth_enabled(void)
return sysctl_sched_rt_runtime >= 0;
}

+/*
+ * To keep the bandwidth of -deadline tasks under control
+ * we need some place where:
+ * - store the maximum -deadline bandwidth of each cpu;
+ * - cache the fraction of bandwidth that is currently allocated in
+ * each root domain;
+ *
+ * This is all done in the data structure below. It is similar to the
+ * one used for RT-throttling (rt_bandwidth), with the main difference
+ * that, since here we are only interested in admission control, we
+ * do not decrease any runtime while the group "executes", neither we
+ * need a timer to replenish it.
+ *
+ * With respect to SMP, bandwidth is given on a per root domain basis,
+ * meaning that:
+ * - bw (< 100%) is the deadline bandwidth of each CPU;
+ * - total_bw is the currently allocated bandwidth in each root domain;
+ */
struct dl_bw {
raw_spinlock_t lock;
u64 bw;

Subject: [tip: sched/core] sched/deadline: Optimize sched_dl_global_validate()

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 26762423a2664692de2bcccc9de684a5ac105e23
Gitweb: https://git.kernel.org/tip/26762423a2664692de2bcccc9de684a5ac105e23
Author: Peng Liu <[email protected]>
AuthorDate: Thu, 08 Oct 2020 23:48:46 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 29 Oct 2020 11:00:28 +01:00

sched/deadline: Optimize sched_dl_global_validate()

Under CONFIG_SMP, dl_bw is per root domain, but not per CPU.
When checking or updating dl_bw, currently iterating every CPU is
overdoing, just need iterate each root domain once.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Peng Liu <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Daniel Bristot de Oliveira <[email protected]>
Acked-by: Juri Lelli <[email protected]>
Link: https://lkml.kernel.org/r/78d21ee792cc48ff79e8cd62a5f26208463684d6.1602171061.git.iwtbavbm@gmail.com
---
kernel/sched/deadline.c | 39 ++++++++++++++++++++++++++++++++-------
kernel/sched/sched.h | 9 +++++++++
kernel/sched/topology.c | 1 +
3 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f232305..98d96d4 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -97,6 +97,17 @@ static inline unsigned long dl_bw_capacity(int i)
return __dl_bw_capacity(i);
}
}
+
+static inline bool dl_bw_visited(int cpu, u64 gen)
+{
+ struct root_domain *rd = cpu_rq(cpu)->rd;
+
+ if (rd->visit_gen == gen)
+ return true;
+
+ rd->visit_gen = gen;
+ return false;
+}
#else
static inline struct dl_bw *dl_bw_of(int i)
{
@@ -112,6 +123,11 @@ static inline unsigned long dl_bw_capacity(int i)
{
return SCHED_CAPACITY_SCALE;
}
+
+static inline bool dl_bw_visited(int cpu, u64 gen)
+{
+ return false;
+}
#endif

static inline
@@ -2535,11 +2551,15 @@ const struct sched_class dl_sched_class
.update_curr = update_curr_dl,
};

+/* Used for dl_bw check and update, used under sched_rt_handler()::mutex */
+static u64 dl_generation;
+
int sched_dl_global_validate(void)
{
u64 runtime = global_rt_runtime();
u64 period = global_rt_period();
u64 new_bw = to_ratio(period, runtime);
+ u64 gen = ++dl_generation;
struct dl_bw *dl_b;
int cpu, ret = 0;
unsigned long flags;
@@ -2548,13 +2568,13 @@ int sched_dl_global_validate(void)
* Here we want to check the bandwidth not being set to some
* value smaller than the currently allocated bandwidth in
* any of the root_domains.
- *
- * FIXME: Cycling on all the CPUs is overdoing, but simpler than
- * cycling on root_domains... Discussion on different/better
- * solutions is welcome!
*/
for_each_possible_cpu(cpu) {
rcu_read_lock_sched();
+
+ if (dl_bw_visited(cpu, gen))
+ goto next;
+
dl_b = dl_bw_of(cpu);

raw_spin_lock_irqsave(&dl_b->lock, flags);
@@ -2562,6 +2582,7 @@ int sched_dl_global_validate(void)
ret = -EBUSY;
raw_spin_unlock_irqrestore(&dl_b->lock, flags);

+next:
rcu_read_unlock_sched();

if (ret)
@@ -2587,6 +2608,7 @@ static void init_dl_rq_bw_ratio(struct dl_rq *dl_rq)
void sched_dl_do_global(void)
{
u64 new_bw = -1;
+ u64 gen = ++dl_generation;
struct dl_bw *dl_b;
int cpu;
unsigned long flags;
@@ -2597,11 +2619,14 @@ void sched_dl_do_global(void)
if (global_rt_runtime() != RUNTIME_INF)
new_bw = to_ratio(global_rt_period(), global_rt_runtime());

- /*
- * FIXME: As above...
- */
for_each_possible_cpu(cpu) {
rcu_read_lock_sched();
+
+ if (dl_bw_visited(cpu, gen)) {
+ rcu_read_unlock_sched();
+ continue;
+ }
+
dl_b = dl_bw_of(cpu);

raw_spin_lock_irqsave(&dl_b->lock, flags);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index df80bfc..49a2dae 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -801,6 +801,15 @@ struct root_domain {
struct dl_bw dl_bw;
struct cpudl cpudl;

+ /*
+ * Indicate whether a root_domain's dl_bw has been checked or
+ * updated. It's monotonously increasing value.
+ *
+ * Also, some corner cases, like 'wrap around' is dangerous, but given
+ * that u64 is 'big enough'. So that shouldn't be a concern.
+ */
+ u64 visit_gen;
+
#ifdef HAVE_RT_PUSH_IPI
/*
* For IPI pull requests, loop across the rto_mask.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index dd77702..90f3e55 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -516,6 +516,7 @@ static int init_rootdomain(struct root_domain *rd)
init_irq_work(&rd->rto_push_work, rto_push_irq_work_func);
#endif

+ rd->visit_gen = 0;
init_dl_bw(&rd->dl_bw);
if (cpudl_init(&rd->cpudl) != 0)
goto free_rto_mask;