v2 -> v3:
- Wrapped check for value change inside ser_rd_overload_status
as suggested by Qais.
- Added reviewed-by tags.
v1 -> v2:
- dropped Fixes tag.
- Added one of the perf probes in the changelog for reference.
- Added reviewed-by tags.
tl;dr
When running workloads in large systems, it was observed that access to
rd->overload was taking time. It would be better to check the value
before updating since value changes less often. Patch 1/2 does that.
With patch updates happen only if necessary. CPU Bus traffic reduced a
bit. No significant gains in workload performance.
Qais Suggested that it would be better to use the helper functions to
access the rd->overload instead. Patch 2/2 does that.
*These patches depend on below to be applied first*
https://lore.kernel.org/all/[email protected]/
-----------------------------------------------------------------------
Detailed Perf annotation and probes stat
-----------------------------------------------------------------------
=======
6.8-rc5
=======
320 CPU system, SMT8
NUMA node(s): 4
NUMA node0 CPU(s): 0-79
NUMA node1 CPU(s): 80-159
NUMA node6 CPU(s): 160-239
NUMA node7 CPU(s): 240-319
Perf annoate while running "schbench -t 320 -i 30 -r 30"
│ if (!READ_ONCE(this_rq->rd->overload) ||
18.05 │ ld r9,2752(r31)
│ sd = rcu_dereference_check_sched_domain(this_rq->sd);
6.97 │ ld r30,2760(r31)
Added some dummy codes so the probes can be put at required places.
perf probe -L update_sd_lb_stats
46 if (env->sd->flags & SD_NUMA)
47 env->fbq_type = fbq_classify_group(&sds->busiest_stat);
49 if (!env->sd->parent) {
/* update overload indicator if we are at root domain */
51 WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD);
perf -probe -L sched_balance_newidle
rcu_read_lock();
38 sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (!READ_ONCE(this_rq->rd->overload) ||
(sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
perf probe -L add_nr_running
#ifdef CONFIG_SMP
11 if (prev_nr < 2 && rq->nr_running >= 2) {
12 if (!READ_ONCE(rq->rd->overload)) {
13 a = a +10;
WRITE_ONCE(rq->rd->overload, 1);
}
probe hits when running different workload.
idle
320 probe:add_nr_running_L12
260 probe:add_nr_running_L13
1K probe:sched_balance_newidle_L38
596 probe:update_sd_lb_stats_L51
/hackbench 10 process 100000 loops
130K probe:add_nr_running_L12
93 probe:add_nr_running_L13
1M probe:sched_balance_newidle_L38
109K probe:update_sd_lb_stats_L51
/schbench -t 320 -i 30 -r 30
3K probe:add_nr_running_L12
436 probe:add_nr_running_L13
125K probe:sched_balance_newidle_L38
33K probe:update_sd_lb_stats_L51
Modified stress-ng --wait
3K probe:add_nr_running_L12
1K probe:add_nr_running_L13
6M probe:sched_balance_newidle_L38
11K probe:update_sd_lb_stats_L51
stress-ng --cpu=400 -l 20
833 probe:add_nr_running_L12
280 probe:add_nr_running_L13
2K probe:sched_balance_newidle_L38
1K probe:update_sd_lb_stats_L51
stress-ng --cpu=400 -l 100
730 probe:add_nr_running_L12
0 probe:add_nr_running_L13
0 probe:sched_balance_newidle_L38
0 probe:update_sd_lb_stats_L51
stress-ng --cpu=800 -l 50
2K probe:add_nr_running_L12
0 probe:add_nr_running_L13
2K probe:sched_balance_newidle_L38
946 probe:update_sd_lb_stats_L51
stress-ng --cpu=800 -l 100
361 probe:add_nr_running_L12
0 probe:add_nr_running_L13
0 probe:sched_balance_newidle_L38
0 probe:update_sd_lb_stats_L51
L13 numbers are quite less compared to L12. This indicates that it might
not change often.
------------------------------------------------------------------------------
==========
With Patch:
==========
Perf annoate while running "schbench -t 320 -i 30 -r 30"
│ if (!READ_ONCE(this_rq->rd->overload) ||
│ ld r9,2752(r31)
│ sd = rcu_dereference_check_sched_domain(this_rq->sd);
│ ld r30,2760(r31)
│ if (!READ_ONCE(this_rq->rd->overload) ||
│ lwz r9,536(r9)
│ cmpwi r9,0
│ ↓ beq 2b4
│100: mflr r0
│ cmpdi r30,0
0.38 │ std r0,240(r1)
1.56 │ ↓ beq 120
perf probe -L update_sd_lb_stats
49 if (!env->sd->parent) {
50 int a;
/* update overload indicator if we are at root domain */
if ( READ_ONCE(env->dst_rq->rd->overload) != sg_status & SG_OVERLOAD) {
53 a= a+10;
WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD);
}
perf probe -L sched_balance_newidle
38 sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (!READ_ONCE(this_rq->rd->overload) ||
(sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
perf probe -L add_nr_running
#ifdef CONFIG_SMP
11 if (prev_nr < 2 && rq->nr_running >= 2) {
12 if (!READ_ONCE(rq->rd->overload)) {
13 a = a +10;
WRITE_ONCE(rq->rd->overload, 1);
}
perf probes when running different workloads. How many times actual
value changes in update_sd_lb_stats is indicated as L53/L50*100.
idle
818 probe:sched_balance_newidle_L38
262 probe:update_sd_lb_stats_L53 <-- 86%
321 probe:add_nr_running_L12
261 probe:add_nr_running_L13
304 probe:update_sd_lb_stats_L50
/hackbench 10 process 100000 loops
1M probe:sched_balance_newidle_L38
139 probe:update_sd_lb_stats_L53 <-- 0.25%
129K probe:add_nr_running_L12
74 probe:add_nr_running_L13
54K probe:update_sd_lb_stats_L50
/schbench -t 320 -i 30 -r 30
101K probe:sched_balance_newidle_L38
2K probe:update_sd_lb_stats_L53 <-- 9.09%
5K probe:add_nr_running_L12
1K probe:add_nr_running_L13
22K probe:update_sd_lb_stats_L50
Modified stress-ng --wait
6M probe:sched_balance_newidle_L38
2K probe:update_sd_lb_stats_L53 <-- 25%
4K probe:add_nr_running_L12
2K probe:add_nr_running_L13
8K probe:update_sd_lb_stats_L50
stress-ng --cpu=400 -l 20
2K probe:sched_balance_newidle_L38
286 probe:update_sd_lb_stats_L53 <-- 36.11%
746 probe:add_nr_running_L12
256 probe:add_nr_running_L13
792 probe:update_sd_lb_stats_L50
stress-ng --cpu=400 -l 100
2 probe:sched_balance_newidle_L38
0 probe:update_sd_lb_stats_L53 <-- NA
923 probe:add_nr_running_L12
0 probe:add_nr_running_L13
0 probe:update_sd_lb_stats_L50
stress-ng --cpu=800 -l 50
2K probe:sched_balance_newidle_L38
0 probe:update_sd_lb_stats_L53 <-- 0%
2K probe:add_nr_running_L12
0 probe:add_nr_running_L13
429 probe:update_sd_lb_stats_L50
stress-ng --cpu=800 -l 100
0 probe:sched_balance_newidle_L38
0 probe:update_sd_lb_stats_L53 <-- NA
424 probe:add_nr_running_L12
0 probe:add_nr_running_L13
1 probe:update_sd_lb_stats_L50
This indicates that likely that value changes less often. So adding a
read before update would help in generic workloads.
-------------------------------------------------------------------------------
Shrikanth Hegde (2):
sched/fair: Check rd->overload value before update
sched/fair: Use helper functions to access rd->overload
kernel/sched/fair.c | 4 ++--
kernel/sched/sched.h | 14 ++++++++++++--
2 files changed, 14 insertions(+), 4 deletions(-)
--
2.39.3
rd->overload is accessed at multiple places. Instead it could use helper
get/set functions. This would make the code more readable and easy to
maintain.
No change in functionality intended.
Suggested-by: Qais Yousef <[email protected]>
Signed-off-by: Shrikanth Hegde <[email protected]>
Reviewed-by: Qais Yousef <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 5 ++---
kernel/sched/sched.h | 14 ++++++++++++--
2 files changed, 14 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eeebadd7d9ae..f00cb66cc479 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10621,8 +10621,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
if (!env->sd->parent) {
/* update overload indicator if we are at root domain */
- if (READ_ONCE(env->dst_rq->rd->overload) != (sg_status & SG_OVERLOAD))
- WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD);
+ set_rd_overload_status(env->dst_rq->rd, sg_status & SG_OVERLOAD);
/* Update over-utilization (tipping point, U >= 0) indicator */
set_rd_overutilized_status(env->dst_rq->rd,
@@ -12344,7 +12343,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
- if (!READ_ONCE(this_rq->rd->overload) ||
+ if (!is_rd_overloaded(this_rq->rd) ||
(sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
if (sd)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 41024c1c49b4..bed5ad8a1827 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -918,6 +918,17 @@ extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
extern void sched_get_rd(struct root_domain *rd);
extern void sched_put_rd(struct root_domain *rd);
+static inline int is_rd_overloaded(struct root_domain *rd)
+{
+ return READ_ONCE(rd->overload);
+}
+
+static inline void set_rd_overload_status(struct root_domain *rd, int status)
+{
+ if (is_rd_overloaded(rd) != status)
+ WRITE_ONCE(rd->overload, status);
+}
+
#ifdef HAVE_RT_PUSH_IPI
extern void rto_push_irq_work_func(struct irq_work *work);
#endif
@@ -2518,8 +2529,7 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
#ifdef CONFIG_SMP
if (prev_nr < 2 && rq->nr_running >= 2) {
- if (!READ_ONCE(rq->rd->overload))
- WRITE_ONCE(rq->rd->overload, 1);
+ set_rd_overload_status(rq->rd, SG_OVERLOAD);
}
#endif
--
2.39.3
* Shrikanth Hegde <[email protected]> wrote:
> v2 -> v3:
> - Wrapped check for value change inside ser_rd_overload_status
> as suggested by Qais.
> - Added reviewed-by tags.
>
> v1 -> v2:
> - dropped Fixes tag.
> - Added one of the perf probes in the changelog for reference.
> - Added reviewed-by tags.
>
> tl;dr
> When running workloads in large systems, it was observed that access to
> rd->overload was taking time. It would be better to check the value
> before updating since value changes less often. Patch 1/2 does that.
> With patch updates happen only if necessary. CPU Bus traffic reduced a
> bit. No significant gains in workload performance.
Could you please post this against the latest scheduler tree, ie. tip:sched/core?
Thanks,
Ingo
On 3/25/24 4:06 PM, Ingo Molnar wrote:
>
> * Shrikanth Hegde <[email protected]> wrote:
>
>> v2 -> v3:
>> - Wrapped check for value change inside ser_rd_overload_status
>> as suggested by Qais.
>> - Added reviewed-by tags.
>>
>> v1 -> v2:
>> - dropped Fixes tag.
>> - Added one of the perf probes in the changelog for reference.
>> - Added reviewed-by tags.
>>
>> tl;dr
>> When running workloads in large systems, it was observed that access to
>> rd->overload was taking time. It would be better to check the value
>> before updating since value changes less often. Patch 1/2 does that.
>> With patch updates happen only if necessary. CPU Bus traffic reduced a
>> bit. No significant gains in workload performance.
>
> Could you please post this against the latest scheduler tree, ie. tip:sched/core?
>
> Thanks,
>
> Ingo
Hi Ingo. I had mentioned the same in cover letter. Sorry if I didn't
mention it in the correct place.
*These patches depend on below to be applied first*
https://lore.kernel.org/all/[email protected]/
After that the above patch would apply. I kept the dependency since these are in similar
code area. Can we please consider the above patch as well? Is there any other change that needs
to be done?
On tip
1043c003415b (HEAD -> master) sched/fair: Use helper functions to access rd->overload
3049bc16643d sched/fair: Check rd->overload value before update
436d634f2cad sched/fair: Combine EAS check with overutilized access
379aa2cd62e0 sched/fair: Use helper function to access rd->overutilized
19bfeb2d565e sched/fair: Add EAS checks before updating overutilized
71706005072c (origin/master, origin/HEAD) Merge branch into tip/master: 'x86/shstk'
fa63c2c111ea (HEAD -> sched/core) sched/fair: Use helper functions to access rd->overload
9bef486d044b sched/fair: Check rd->overload value before update
21f90cae75c8 sched/fair: Combine EAS check with overutilized access
e835f1fa3654 sched/fair: Use helper function to access rd->overutilized
ddf390f9449d sched/fair: Add EAS checks before updating overutilized
58eeb2d79b54 (origin/sched/core) sched/fair: Don't double balance_interval for migrate_misfit
* Shrikanth Hegde <[email protected]> wrote:
>
>
> On 3/25/24 4:06 PM, Ingo Molnar wrote:
> >
> > * Shrikanth Hegde <[email protected]> wrote:
> >
> >> v2 -> v3:
> >> - Wrapped check for value change inside ser_rd_overload_status
> >> as suggested by Qais.
> >> - Added reviewed-by tags.
> >>
> >> v1 -> v2:
> >> - dropped Fixes tag.
> >> - Added one of the perf probes in the changelog for reference.
> >> - Added reviewed-by tags.
> >>
> >> tl;dr
> >> When running workloads in large systems, it was observed that access to
> >> rd->overload was taking time. It would be better to check the value
> >> before updating since value changes less often. Patch 1/2 does that.
> >> With patch updates happen only if necessary. CPU Bus traffic reduced a
> >> bit. No significant gains in workload performance.
> >
> > Could you please post this against the latest scheduler tree, ie. tip:sched/core?
> >
> > Thanks,
> >
> > Ingo
>
> Hi Ingo. I had mentioned the same in cover letter. Sorry if I didn't
> mention it in the correct place.
>
> *These patches depend on below to be applied first*
> https://lore.kernel.org/all/[email protected]/
>
> After that the above patch would apply.
OK, I missed that, but I don't really like patch #3 of that other series,
so we'll need to address that first.
Thanks,
Ingo
On 3/26/24 1:30 PM, Ingo Molnar wrote:
>
> * Shrikanth Hegde <[email protected]> wrote:
>
>>
>>
>> On 3/25/24 4:06 PM, Ingo Molnar wrote:
>>>
>>> * Shrikanth Hegde <[email protected]> wrote:
>>>
>>>> v2 -> v3:
>>>> - Wrapped check for value change inside ser_rd_overload_status
>>>> as suggested by Qais.
>>>> - Added reviewed-by tags.
>>>>
>>>> v1 -> v2:
>>>> - dropped Fixes tag.
>>>> - Added one of the perf probes in the changelog for reference.
>>>> - Added reviewed-by tags.
>>>>
>>>> tl;dr
>>>> When running workloads in large systems, it was observed that access to
>>>> rd->overload was taking time. It would be better to check the value
>>>> before updating since value changes less often. Patch 1/2 does that.
>>>> With patch updates happen only if necessary. CPU Bus traffic reduced a
>>>> bit. No significant gains in workload performance.
>>>
>>> Could you please post this against the latest scheduler tree, ie. tip:sched/core?
>>>
>>> Thanks,
>>>
>>> Ingo
>>
>> Hi Ingo. I had mentioned the same in cover letter. Sorry if I didn't
>> mention it in the correct place.
>>
>> *These patches depend on below to be applied first*
>> https://lore.kernel.org/all/[email protected]/
>>
>> After that the above patch would apply.
Hi Ingo.
These two patches apply cleanly now to sched/core.
7a9dd7ef946c (HEAD -> sched/core) sched/fair: Use helper functions to access rd->overload
4f24aa918558 sched/fair: Check rd->overload value before update
c829d6818b60 (origin/sched/core) sched/fair: Simplify the continue_balancing logic in sched_balance_newidle()
d0f5d3cefc25 sched/fair: Introduce is_rd_overutilized() helper function to access root_domain::overutilized
>
> OK, I missed that, but I don't really like patch #3 of that other series,
> so we'll need to address that first.
>
That will be a standalone patch now and can go later based on the discussions.
> Thanks,
>
> Ingo
* Shrikanth Hegde <[email protected]> wrote:
>
>
> On 3/26/24 1:30 PM, Ingo Molnar wrote:
> >
> > * Shrikanth Hegde <[email protected]> wrote:
> >
> >>
> >>
> >> On 3/25/24 4:06 PM, Ingo Molnar wrote:
> >>>
> >>> * Shrikanth Hegde <[email protected]> wrote:
> >>>
> >>>> v2 -> v3:
> >>>> - Wrapped check for value change inside ser_rd_overload_status
> >>>> as suggested by Qais.
> >>>> - Added reviewed-by tags.
> >>>>
> >>>> v1 -> v2:
> >>>> - dropped Fixes tag.
> >>>> - Added one of the perf probes in the changelog for reference.
> >>>> - Added reviewed-by tags.
> >>>>
> >>>> tl;dr
> >>>> When running workloads in large systems, it was observed that access to
> >>>> rd->overload was taking time. It would be better to check the value
> >>>> before updating since value changes less often. Patch 1/2 does that.
> >>>> With patch updates happen only if necessary. CPU Bus traffic reduced a
> >>>> bit. No significant gains in workload performance.
> >>>
> >>> Could you please post this against the latest scheduler tree, ie. tip:sched/core?
> >>>
> >>> Thanks,
> >>>
> >>> Ingo
> >>
> >> Hi Ingo. I had mentioned the same in cover letter. Sorry if I didn't
> >> mention it in the correct place.
> >>
> >> *These patches depend on below to be applied first*
> >> https://lore.kernel.org/all/[email protected]/
> >>
> >> After that the above patch would apply.
>
> Hi Ingo.
>
> These two patches apply cleanly now to sched/core.
>
> 7a9dd7ef946c (HEAD -> sched/core) sched/fair: Use helper functions to access rd->overload
> 4f24aa918558 sched/fair: Check rd->overload value before update
> c829d6818b60 (origin/sched/core) sched/fair: Simplify the continue_balancing logic in sched_balance_newidle()
> d0f5d3cefc25 sched/fair: Introduce is_rd_overutilized() helper function to access root_domain::overutilized
I've applied them, but note that there were quite a few avoidable typos
and grammar mistakes in the changelogs. Please always review what
you've submitted versus what I have applied and try to learn from that:
I almost invariable have to edit some detail to make the changelog more
presentable... Could you please take more care with future patches?
I also renamed the accessor functions in the second patch to:
get_rd_overload()
set_rd_overload()
Plus I've applied a patch to rename ::overload to ::overloaded. It is
silly to use an ambiguous noun instead of a clear adjective when naming
such a flag ...
Thanks,
Ingo
The following commit has been merged into the sched/core branch of tip:
Commit-ID: caac6291728ed5493d8a53f4b086c270849ce0c4
Gitweb: https://git.kernel.org/tip/caac6291728ed5493d8a53f4b086c270849ce0c4
Author: Shrikanth Hegde <[email protected]>
AuthorDate: Mon, 25 Mar 2024 11:15:05 +05:30
Committer: Ingo Molnar <[email protected]>
CommitterDate: Thu, 28 Mar 2024 11:30:13 +01:00
sched/fair: Use helper functions to access root_domain::overload
Introduce two helper functions to access & set the root_domain::overload flag:
get_rd_overload()
set_rd_overload()
To make sure code is always following READ_ONCE()/WRITE_ONCE() access methods.
No change in functionality intended.
[ mingo: Renamed the accessors to get_/set_rd_overload(), tidied up the changelog. ]
Suggested-by: Qais Yousef <[email protected]>
Signed-off-by: Shrikanth Hegde <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Qais Yousef <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 5 ++---
kernel/sched/sched.h | 14 ++++++++++++--
2 files changed, 14 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 600fdde..0cc0582 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10657,8 +10657,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
if (!env->sd->parent) {
/* update overload indicator if we are at root domain */
- if (READ_ONCE(env->dst_rq->rd->overload) != (sg_status & SG_OVERLOAD))
- WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD);
+ set_rd_overload(env->dst_rq->rd, sg_status & SG_OVERLOAD);
/* Update over-utilization (tipping point, U >= 0) indicator */
set_rd_overutilized_status(env->dst_rq->rd,
@@ -12391,7 +12390,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
- if (!READ_ONCE(this_rq->rd->overload) ||
+ if (!get_rd_overload(this_rq->rd) ||
(sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
if (sd)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4f9e952..e86ee26 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -930,6 +930,17 @@ extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
extern void sched_get_rd(struct root_domain *rd);
extern void sched_put_rd(struct root_domain *rd);
+static inline int get_rd_overload(struct root_domain *rd)
+{
+ return READ_ONCE(rd->overload);
+}
+
+static inline void set_rd_overload(struct root_domain *rd, int status)
+{
+ if (get_rd_overload(rd) != status)
+ WRITE_ONCE(rd->overload, status);
+}
+
#ifdef HAVE_RT_PUSH_IPI
extern void rto_push_irq_work_func(struct irq_work *work);
#endif
@@ -2530,8 +2541,7 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
#ifdef CONFIG_SMP
if (prev_nr < 2 && rq->nr_running >= 2) {
- if (!READ_ONCE(rq->rd->overload))
- WRITE_ONCE(rq->rd->overload, 1);
+ set_rd_overload(rq->rd, SG_OVERLOAD);
}
#endif
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 7bda10ba7f453729f210264dd07d38989fb858d9
Gitweb: https://git.kernel.org/tip/7bda10ba7f453729f210264dd07d38989fb858d9
Author: Ingo Molnar <[email protected]>
AuthorDate: Thu, 28 Mar 2024 11:44:16 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Thu, 28 Mar 2024 11:44:44 +01:00
sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED
Follow the rename of the root_domain::overloaded flag.
Note that this also matches the SG_OVERUTILIZED flag better.
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Shrikanth Hegde <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 6 +++---
kernel/sched/sched.h | 4 ++--
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf10665..839a97a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9961,7 +9961,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->sum_nr_running += nr_running;
if (nr_running > 1)
- *sg_status |= SG_OVERLOAD;
+ *sg_status |= SG_OVERLOADED;
if (cpu_overutilized(i))
*sg_status |= SG_OVERUTILIZED;
@@ -9986,7 +9986,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
/* Check for a misfit task on the cpu */
if (sgs->group_misfit_task_load < rq->misfit_task_load) {
sgs->group_misfit_task_load = rq->misfit_task_load;
- *sg_status |= SG_OVERLOAD;
+ *sg_status |= SG_OVERLOADED;
}
} else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
/* Check for a task running on a CPU with reduced capacity */
@@ -10657,7 +10657,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
if (!env->sd->parent) {
/* update overload indicator if we are at root domain */
- set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOAD);
+ set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
/* Update over-utilization (tipping point, U >= 0) indicator */
set_rd_overutilized_status(env->dst_rq->rd,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7e7ae1..07c6669 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -851,7 +851,7 @@ struct perf_domain {
};
/* Scheduling group status flags */
-#define SG_OVERLOAD 0x1 /* More than one runnable task on a CPU. */
+#define SG_OVERLOADED 0x1 /* More than one runnable task on a CPU. */
#define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */
/*
@@ -2541,7 +2541,7 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
#ifdef CONFIG_SMP
if (prev_nr < 2 && rq->nr_running >= 2) {
- set_rd_overloaded(rq->rd, SG_OVERLOAD);
+ set_rd_overloaded(rq->rd, SG_OVERLOADED);
}
#endif
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 76cc4f91488af0a808bec97794bfe434dece7d67
Gitweb: https://git.kernel.org/tip/76cc4f91488af0a808bec97794bfe434dece7d67
Author: Ingo Molnar <[email protected]>
AuthorDate: Thu, 28 Mar 2024 11:41:31 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Thu, 28 Mar 2024 11:41:31 +01:00
sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded()
Follow the rename of the root_domain::overloaded flag.
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Shrikanth Hegde <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 4 ++--
kernel/sched/sched.h | 8 ++++----
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0cc0582..bf10665 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10657,7 +10657,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
if (!env->sd->parent) {
/* update overload indicator if we are at root domain */
- set_rd_overload(env->dst_rq->rd, sg_status & SG_OVERLOAD);
+ set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOAD);
/* Update over-utilization (tipping point, U >= 0) indicator */
set_rd_overutilized_status(env->dst_rq->rd,
@@ -12390,7 +12390,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
- if (!get_rd_overload(this_rq->rd) ||
+ if (!get_rd_overloaded(this_rq->rd) ||
(sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
if (sd)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cddc479..c7e7ae1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -930,14 +930,14 @@ extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
extern void sched_get_rd(struct root_domain *rd);
extern void sched_put_rd(struct root_domain *rd);
-static inline int get_rd_overload(struct root_domain *rd)
+static inline int get_rd_overloaded(struct root_domain *rd)
{
return READ_ONCE(rd->overloaded);
}
-static inline void set_rd_overload(struct root_domain *rd, int status)
+static inline void set_rd_overloaded(struct root_domain *rd, int status)
{
- if (get_rd_overload(rd) != status)
+ if (get_rd_overloaded(rd) != status)
WRITE_ONCE(rd->overloaded, status);
}
@@ -2541,7 +2541,7 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
#ifdef CONFIG_SMP
if (prev_nr < 2 && rq->nr_running >= 2) {
- set_rd_overload(rq->rd, SG_OVERLOAD);
+ set_rd_overloaded(rq->rd, SG_OVERLOAD);
}
#endif
The following commit has been merged into the sched/core branch of tip:
Commit-ID: dfb83ef7b8b064c15be19cf7fcbde0996712de8f
Gitweb: https://git.kernel.org/tip/dfb83ef7b8b064c15be19cf7fcbde0996712de8f
Author: Ingo Molnar <[email protected]>
AuthorDate: Thu, 28 Mar 2024 11:33:20 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Thu, 28 Mar 2024 11:38:58 +01:00
sched/fair: Rename root_domain::overload to ::overloaded
It is silly to use an ambiguous noun instead of a clear adjective when naming
such a flag ...
Note how root_domain::overutilized already used a proper adjective.
rd->overloaded is now set to 1 when the root domain is overloaded.
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Shrikanth Hegde <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/sched.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e86ee26..cddc479 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -874,7 +874,7 @@ struct root_domain {
* - More than one runnable task
* - Running task is misfit
*/
- int overload;
+ int overloaded;
/* Indicate one or more cpus over-utilized (tipping point) */
int overutilized;
@@ -932,13 +932,13 @@ extern void sched_put_rd(struct root_domain *rd);
static inline int get_rd_overload(struct root_domain *rd)
{
- return READ_ONCE(rd->overload);
+ return READ_ONCE(rd->overloaded);
}
static inline void set_rd_overload(struct root_domain *rd, int status)
{
if (get_rd_overload(rd) != status)
- WRITE_ONCE(rd->overload, status);
+ WRITE_ONCE(rd->overloaded, status);
}
#ifdef HAVE_RT_PUSH_IPI
* Ingo Molnar <[email protected]> wrote:
> Plus I've applied a patch to rename ::overload to ::overloaded. It is
> silly to use an ambiguous noun instead of a clear adjective when naming
> such a flag ...
Plus SG_OVERLOAD should be SG_OVERLOADED as well - it now looks in line
with SG_OVERUTILIZED:
/* Scheduling group status flags */
#define SG_OVERLOADED 0x1 /* More than one runnable task on a CPU. */
#define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */
My followup question is: why are these a bitmask, why not separate
flags?
AFAICS we only ever set them separately:
thule:~/tip> git grep SG_OVER kernel/sched/
kernel/sched/fair.c: set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
kernel/sched/fair.c: *sg_status |= SG_OVERLOADED;
kernel/sched/fair.c: *sg_status |= SG_OVERUTILIZED;
kernel/sched/fair.c: *sg_status |= SG_OVERLOADED;
kernel/sched/fair.c: set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
kernel/sched/fair.c: sg_status & SG_OVERUTILIZED);
kernel/sched/fair.c: } else if (sg_status & SG_OVERUTILIZED) {
kernel/sched/fair.c: set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED);
kernel/sched/sched.h:#define SG_OVERLOADED 0x1 /* More than one runnable task on a CPU. */
kernel/sched/sched.h:#define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */
kernel/sched/sched.h: set_rd_overloaded(rq->rd, SG_OVERLOADED);
In fact this results in suboptimal code:
/* update overload indicator if we are at root domain */
set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
/* Update over-utilization (tipping point, U >= 0) indicator */
set_rd_overutilized_status(env->dst_rq->rd,
sg_status & SG_OVERUTILIZED);
Note how the bits that got mixed together in sg_status now have to be
masked out individually.
The sg_status bitmask appears to make no sense at all to me.
By turning these into individual bool flags we could also do away with
all the extra SG_OVERLOADED/SG_OVERUTILIZED abstraction.
Ie. something like the patch below? Untested.
Thanks,
Ingo
=================>
From: Ingo Molnar <[email protected]>
Date: Thu, 28 Mar 2024 12:00:14 +0100
Subject: [PATCH] sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags
SG_OVERLOADED and SG_OVERUTILIZED flags plus the sg_status bitmask are an
unnecessary complication that only make the code harder to read and slower.
We only ever set them separately:
thule:~/tip> git grep SG_OVER kernel/sched/
kernel/sched/fair.c: set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
kernel/sched/fair.c: *sg_status |= SG_OVERLOADED;
kernel/sched/fair.c: *sg_status |= SG_OVERUTILIZED;
kernel/sched/fair.c: *sg_status |= SG_OVERLOADED;
kernel/sched/fair.c: set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
kernel/sched/fair.c: sg_status & SG_OVERUTILIZED);
kernel/sched/fair.c: } else if (sg_status & SG_OVERUTILIZED) {
kernel/sched/fair.c: set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED);
kernel/sched/sched.h:#define SG_OVERLOADED 0x1 /* More than one runnable task on a CPU. */
kernel/sched/sched.h:#define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */
kernel/sched/sched.h: set_rd_overloaded(rq->rd, SG_OVERLOADED);
And use them separately, which results in suboptimal code:
/* update overload indicator if we are at root domain */
set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
/* Update over-utilization (tipping point, U >= 0) indicator */
set_rd_overutilized_status(env->dst_rq->rd,
Introduce separate sg_overloaded and sg_overutilized flags in update_sd_lb_stats()
and its lower level functions, and change all of them to 'bool'.
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 33 ++++++++++++++++-----------------
kernel/sched/sched.h | 17 ++++++-----------
2 files changed, 22 insertions(+), 28 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f29efd5f19f6..ebc8d5f855de 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6688,19 +6688,18 @@ static inline bool cpu_overutilized(int cpu)
/*
* overutilized value make sense only if EAS is enabled
*/
-static inline int is_rd_overutilized(struct root_domain *rd)
+static inline bool is_rd_overutilized(struct root_domain *rd)
{
return !sched_energy_enabled() || READ_ONCE(rd->overutilized);
}
-static inline void set_rd_overutilized(struct root_domain *rd,
- unsigned int status)
+static inline void set_rd_overutilized(struct root_domain *rd, bool flag)
{
if (!sched_energy_enabled())
return;
- WRITE_ONCE(rd->overutilized, status);
- trace_sched_overutilized_tp(rd, !!status);
+ WRITE_ONCE(rd->overutilized, flag);
+ trace_sched_overutilized_tp(rd, flag);
}
static inline void check_update_overutilized_status(struct rq *rq)
@@ -6711,7 +6710,7 @@ static inline void check_update_overutilized_status(struct rq *rq)
*/
if (!is_rd_overutilized(rq->rd) && cpu_overutilized(rq->cpu))
- set_rd_overutilized(rq->rd, SG_OVERUTILIZED);
+ set_rd_overutilized(rq->rd, 1);
}
#else
static inline void check_update_overutilized_status(struct rq *rq) { }
@@ -9940,7 +9939,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
struct sd_lb_stats *sds,
struct sched_group *group,
struct sg_lb_stats *sgs,
- int *sg_status)
+ bool *sg_overloaded,
+ bool *sg_overutilized)
{
int i, nr_running, local_group;
@@ -9961,10 +9961,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->sum_nr_running += nr_running;
if (nr_running > 1)
- *sg_status |= SG_OVERLOADED;
+ *sg_overloaded = 1;
if (cpu_overutilized(i))
- *sg_status |= SG_OVERUTILIZED;
+ *sg_overutilized = 1;
#ifdef CONFIG_NUMA_BALANCING
sgs->nr_numa_running += rq->nr_numa_running;
@@ -9986,7 +9986,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
/* Check for a misfit task on the cpu */
if (sgs->group_misfit_task_load < rq->misfit_task_load) {
sgs->group_misfit_task_load = rq->misfit_task_load;
- *sg_status |= SG_OVERLOADED;
+ *sg_overloaded = 1;
}
} else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
/* Check for a task running on a CPU with reduced capacity */
@@ -10612,7 +10612,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
struct sg_lb_stats *local = &sds->local_stat;
struct sg_lb_stats tmp_sgs;
unsigned long sum_util = 0;
- int sg_status = 0;
+ bool sg_overloaded = 0, sg_overutilized = 0;
do {
struct sg_lb_stats *sgs = &tmp_sgs;
@@ -10628,7 +10628,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
update_group_capacity(env->sd, env->dst_cpu);
}
- update_sg_lb_stats(env, sds, sg, sgs, &sg_status);
+ update_sg_lb_stats(env, sds, sg, sgs, &sg_overloaded, &sg_overutilized);
if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
sds->busiest = sg;
@@ -10657,13 +10657,12 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
if (!env->sd->parent) {
/* update overload indicator if we are at root domain */
- set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
+ set_rd_overloaded(env->dst_rq->rd, sg_overloaded);
/* Update over-utilization (tipping point, U >= 0) indicator */
- set_rd_overutilized(env->dst_rq->rd,
- sg_status & SG_OVERUTILIZED);
- } else if (sg_status & SG_OVERUTILIZED) {
- set_rd_overutilized(env->dst_rq->rd, SG_OVERUTILIZED);
+ set_rd_overutilized(env->dst_rq->rd, sg_overloaded);
+ } else if (sg_overutilized) {
+ set_rd_overutilized(env->dst_rq->rd, sg_overutilized);
}
update_idle_cpu_scan(env, sum_util);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 07c6669b8250..7c39dbf31f75 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -713,7 +713,7 @@ struct rt_rq {
} highest_prio;
#endif
#ifdef CONFIG_SMP
- int overloaded;
+ bool overloaded;
struct plist_head pushable_tasks;
#endif /* CONFIG_SMP */
@@ -757,7 +757,7 @@ struct dl_rq {
u64 next;
} earliest_dl;
- int overloaded;
+ bool overloaded;
/*
* Tasks on this rq that can be pushed away. They are kept in
@@ -850,10 +850,6 @@ struct perf_domain {
struct rcu_head rcu;
};
-/* Scheduling group status flags */
-#define SG_OVERLOADED 0x1 /* More than one runnable task on a CPU. */
-#define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */
-
/*
* We add the notion of a root-domain which will be used to define per-domain
* variables. Each exclusive cpuset essentially defines an island domain by
@@ -874,10 +870,10 @@ struct root_domain {
* - More than one runnable task
* - Running task is misfit
*/
- int overloaded;
+ bool overloaded;
/* Indicate one or more cpus over-utilized (tipping point) */
- int overutilized;
+ bool overutilized;
/*
* The bit corresponding to a CPU gets set here if such CPU has more
@@ -2540,9 +2536,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
}
#ifdef CONFIG_SMP
- if (prev_nr < 2 && rq->nr_running >= 2) {
- set_rd_overloaded(rq->rd, SG_OVERLOADED);
- }
+ if (prev_nr < 2 && rq->nr_running >= 2)
+ set_rd_overloaded(rq->rd, 1);
#endif
sched_update_tick_dependency(rq);
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 4d0a63e5b841c759c9a306aff158420421ef016f
Gitweb: https://git.kernel.org/tip/4d0a63e5b841c759c9a306aff158420421ef016f
Author: Ingo Molnar <[email protected]>
AuthorDate: Thu, 28 Mar 2024 11:54:42 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Thu, 28 Mar 2024 11:54:42 +01:00
sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized()
The _status() postfix has no real meaning, simplify the naming
and harmonize it with set_rd_overloaded().
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Shrikanth Hegde <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 839a97a..f29efd5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6693,7 +6693,7 @@ static inline int is_rd_overutilized(struct root_domain *rd)
return !sched_energy_enabled() || READ_ONCE(rd->overutilized);
}
-static inline void set_rd_overutilized_status(struct root_domain *rd,
+static inline void set_rd_overutilized(struct root_domain *rd,
unsigned int status)
{
if (!sched_energy_enabled())
@@ -6711,7 +6711,7 @@ static inline void check_update_overutilized_status(struct rq *rq)
*/
if (!is_rd_overutilized(rq->rd) && cpu_overutilized(rq->cpu))
- set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
+ set_rd_overutilized(rq->rd, SG_OVERUTILIZED);
}
#else
static inline void check_update_overutilized_status(struct rq *rq) { }
@@ -10660,10 +10660,10 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
/* Update over-utilization (tipping point, U >= 0) indicator */
- set_rd_overutilized_status(env->dst_rq->rd,
+ set_rd_overutilized(env->dst_rq->rd,
sg_status & SG_OVERUTILIZED);
} else if (sg_status & SG_OVERUTILIZED) {
- set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED);
+ set_rd_overutilized(env->dst_rq->rd, SG_OVERUTILIZED);
}
update_idle_cpu_scan(env, sum_util);
On 3/28/24 4:04 PM, Ingo Molnar wrote:
>
> * Shrikanth Hegde <[email protected]> wrote:
>
>>
>> Hi Ingo.
>>
>> These two patches apply cleanly now to sched/core.
>>
>> 7a9dd7ef946c (HEAD -> sched/core) sched/fair: Use helper functions to access rd->overload
>> 4f24aa918558 sched/fair: Check rd->overload value before update
>> c829d6818b60 (origin/sched/core) sched/fair: Simplify the continue_balancing logic in sched_balance_newidle()
>> d0f5d3cefc25 sched/fair: Introduce is_rd_overutilized() helper function to access root_domain::overutilized
>
> I've applied them, but note that there were quite a few avoidable typos
> and grammar mistakes in the changelogs. Please always review what
> you've submitted versus what I have applied and try to learn from that:
> I almost invariable have to edit some detail to make the changelog more
> presentable... Could you please take more care with future patches?
>
Noted.
I will learn for it, thank you.
> I also renamed the accessor functions in the second patch to:
>
> get_rd_overload()
> set_rd_overload()
>
> Plus I've applied a patch to rename ::overload to ::overloaded. It is
> silly to use an ambiguous noun instead of a clear adjective when naming
> such a flag ...
>
sure. looks fine.
> Thanks,
>
> Ingo
On 3/28/24 4:37 PM, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
>> Plus I've applied a patch to rename ::overload to ::overloaded. It is
>> silly to use an ambiguous noun instead of a clear adjective when naming
>> such a flag ...
>
> Plus SG_OVERLOAD should be SG_OVERLOADED as well - it now looks in line
> with SG_OVERUTILIZED:
>
> /* Scheduling group status flags */
> #define SG_OVERLOADED 0x1 /* More than one runnable task on a CPU. */
> #define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */
>
> My followup question is: why are these a bitmask, why not separate
> flags?
>
> AFAICS we only ever set them separately:
>
> thule:~/tip> git grep SG_OVER kernel/sched/
> kernel/sched/fair.c: set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
> kernel/sched/fair.c: *sg_status |= SG_OVERLOADED;
> kernel/sched/fair.c: *sg_status |= SG_OVERUTILIZED;
> kernel/sched/fair.c: *sg_status |= SG_OVERLOADED;
> kernel/sched/fair.c: set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
> kernel/sched/fair.c: sg_status & SG_OVERUTILIZED);
> kernel/sched/fair.c: } else if (sg_status & SG_OVERUTILIZED) {
> kernel/sched/fair.c: set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED);
> kernel/sched/sched.h:#define SG_OVERLOADED 0x1 /* More than one runnable task on a CPU. */
> kernel/sched/sched.h:#define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */
> kernel/sched/sched.h: set_rd_overloaded(rq->rd, SG_OVERLOADED);
>
> In fact this results in suboptimal code:
>
> /* update overload indicator if we are at root domain */
> set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
>
> /* Update over-utilization (tipping point, U >= 0) indicator */
> set_rd_overutilized_status(env->dst_rq->rd,
> sg_status & SG_OVERUTILIZED);
>
> Note how the bits that got mixed together in sg_status now have to be
> masked out individually.
>
> The sg_status bitmask appears to make no sense at all to me.
>
> By turning these into individual bool flags we could also do away with
> all the extra SG_OVERLOADED/SG_OVERUTILIZED abstraction.
>
> Ie. something like the patch below? Untested.
Looks good. I see it is merged to sched/core.
Did a boot with that patch and hackbench is showing same results 320 CPU system.
>
> Thanks,
>
> Ingo
>
> =================>
> From: Ingo Molnar <[email protected]>
> Date: Thu, 28 Mar 2024 12:00:14 +0100
> Subject: [PATCH] sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags
>
> SG_OVERLOADED and SG_OVERUTILIZED flags plus the sg_status bitmask are an
> unnecessary complication that only make the code harder to read and slower.
>
> We only ever set them separately:
>
> thule:~/tip> git grep SG_OVER kernel/sched/
> kernel/sched/fair.c: set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
> kernel/sched/fair.c: *sg_status |= SG_OVERLOADED;
> kernel/sched/fair.c: *sg_status |= SG_OVERUTILIZED;
> kernel/sched/fair.c: *sg_status |= SG_OVERLOADED;
> kernel/sched/fair.c: set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
> kernel/sched/fair.c: sg_status & SG_OVERUTILIZED);
> kernel/sched/fair.c: } else if (sg_status & SG_OVERUTILIZED) {
> kernel/sched/fair.c: set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED);
> kernel/sched/sched.h:#define SG_OVERLOADED 0x1 /* More than one runnable task on a CPU. */
> kernel/sched/sched.h:#define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */
> kernel/sched/sched.h: set_rd_overloaded(rq->rd, SG_OVERLOADED);
>
> And use them separately, which results in suboptimal code:
>
> /* update overload indicator if we are at root domain */
> set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
>
> /* Update over-utilization (tipping point, U >= 0) indicator */
> set_rd_overutilized_status(env->dst_rq->rd,
>
> Introduce separate sg_overloaded and sg_overutilized flags in update_sd_lb_stats()
> and its lower level functions, and change all of them to 'bool'.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> kernel/sched/fair.c | 33 ++++++++++++++++-----------------
> kernel/sched/sched.h | 17 ++++++-----------
> 2 files changed, 22 insertions(+), 28 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f29efd5f19f6..ebc8d5f855de 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6688,19 +6688,18 @@ static inline bool cpu_overutilized(int cpu)
> /*
> * overutilized value make sense only if EAS is enabled
> */
> -static inline int is_rd_overutilized(struct root_domain *rd)
> +static inline bool is_rd_overutilized(struct root_domain *rd)
> {
> return !sched_energy_enabled() || READ_ONCE(rd->overutilized);
> }
>
> -static inline void set_rd_overutilized(struct root_domain *rd,
> - unsigned int status)
> +static inline void set_rd_overutilized(struct root_domain *rd, bool flag)
> {
> if (!sched_energy_enabled())
> return;
>
> - WRITE_ONCE(rd->overutilized, status);
> - trace_sched_overutilized_tp(rd, !!status);
> + WRITE_ONCE(rd->overutilized, flag);
> + trace_sched_overutilized_tp(rd, flag);
> }
>
> static inline void check_update_overutilized_status(struct rq *rq)
> @@ -6711,7 +6710,7 @@ static inline void check_update_overutilized_status(struct rq *rq)
> */
>
> if (!is_rd_overutilized(rq->rd) && cpu_overutilized(rq->cpu))
> - set_rd_overutilized(rq->rd, SG_OVERUTILIZED);
> + set_rd_overutilized(rq->rd, 1);
> }
> #else
> static inline void check_update_overutilized_status(struct rq *rq) { }
> @@ -9940,7 +9939,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> struct sd_lb_stats *sds,
> struct sched_group *group,
> struct sg_lb_stats *sgs,
> - int *sg_status)
> + bool *sg_overloaded,
> + bool *sg_overutilized)
> {
> int i, nr_running, local_group;
>
> @@ -9961,10 +9961,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> sgs->sum_nr_running += nr_running;
>
> if (nr_running > 1)
> - *sg_status |= SG_OVERLOADED;
> + *sg_overloaded = 1;
>
> if (cpu_overutilized(i))
> - *sg_status |= SG_OVERUTILIZED;
> + *sg_overutilized = 1;
>
> #ifdef CONFIG_NUMA_BALANCING
> sgs->nr_numa_running += rq->nr_numa_running;
> @@ -9986,7 +9986,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> /* Check for a misfit task on the cpu */
> if (sgs->group_misfit_task_load < rq->misfit_task_load) {
> sgs->group_misfit_task_load = rq->misfit_task_load;
> - *sg_status |= SG_OVERLOADED;
> + *sg_overloaded = 1;
> }
> } else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
> /* Check for a task running on a CPU with reduced capacity */
> @@ -10612,7 +10612,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> struct sg_lb_stats *local = &sds->local_stat;
> struct sg_lb_stats tmp_sgs;
> unsigned long sum_util = 0;
> - int sg_status = 0;
> + bool sg_overloaded = 0, sg_overutilized = 0;
>
> do {
> struct sg_lb_stats *sgs = &tmp_sgs;
> @@ -10628,7 +10628,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> update_group_capacity(env->sd, env->dst_cpu);
> }
>
> - update_sg_lb_stats(env, sds, sg, sgs, &sg_status);
> + update_sg_lb_stats(env, sds, sg, sgs, &sg_overloaded, &sg_overutilized);
>
> if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
> sds->busiest = sg;
> @@ -10657,13 +10657,12 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>
> if (!env->sd->parent) {
> /* update overload indicator if we are at root domain */
> - set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
> + set_rd_overloaded(env->dst_rq->rd, sg_overloaded);
>
> /* Update over-utilization (tipping point, U >= 0) indicator */
> - set_rd_overutilized(env->dst_rq->rd,
> - sg_status & SG_OVERUTILIZED);
> - } else if (sg_status & SG_OVERUTILIZED) {
> - set_rd_overutilized(env->dst_rq->rd, SG_OVERUTILIZED);
> + set_rd_overutilized(env->dst_rq->rd, sg_overloaded);
> + } else if (sg_overutilized) {
> + set_rd_overutilized(env->dst_rq->rd, sg_overutilized);
> }
>
> update_idle_cpu_scan(env, sum_util);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 07c6669b8250..7c39dbf31f75 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -713,7 +713,7 @@ struct rt_rq {
> } highest_prio;
> #endif
> #ifdef CONFIG_SMP
> - int overloaded;
> + bool overloaded;
> struct plist_head pushable_tasks;
>
> #endif /* CONFIG_SMP */
> @@ -757,7 +757,7 @@ struct dl_rq {
> u64 next;
> } earliest_dl;
>
> - int overloaded;
> + bool overloaded;
>
> /*
> * Tasks on this rq that can be pushed away. They are kept in
> @@ -850,10 +850,6 @@ struct perf_domain {
> struct rcu_head rcu;
> };
>
> -/* Scheduling group status flags */
> -#define SG_OVERLOADED 0x1 /* More than one runnable task on a CPU. */
> -#define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */
> -
> /*
> * We add the notion of a root-domain which will be used to define per-domain
> * variables. Each exclusive cpuset essentially defines an island domain by
> @@ -874,10 +870,10 @@ struct root_domain {
> * - More than one runnable task
> * - Running task is misfit
> */
> - int overloaded;
> + bool overloaded;
>
> /* Indicate one or more cpus over-utilized (tipping point) */
> - int overutilized;
> + bool overutilized;
>
> /*
> * The bit corresponding to a CPU gets set here if such CPU has more
> @@ -2540,9 +2536,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
> }
>
> #ifdef CONFIG_SMP
> - if (prev_nr < 2 && rq->nr_running >= 2) {
> - set_rd_overloaded(rq->rd, SG_OVERLOADED);
> - }
> + if (prev_nr < 2 && rq->nr_running >= 2)
> + set_rd_overloaded(rq->rd, 1);
> #endif
>
> sched_update_tick_dependency(rq);
* Shrikanth Hegde <[email protected]> wrote:
>
>
> On 3/28/24 4:37 PM, Ingo Molnar wrote:
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> >> Plus I've applied a patch to rename ::overload to ::overloaded. It is
> >> silly to use an ambiguous noun instead of a clear adjective when naming
> >> such a flag ...
> >
> > Plus SG_OVERLOAD should be SG_OVERLOADED as well - it now looks in line
> > with SG_OVERUTILIZED:
> >
> > /* Scheduling group status flags */
> > #define SG_OVERLOADED 0x1 /* More than one runnable task on a CPU. */
> > #define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */
> >
> > My followup question is: why are these a bitmask, why not separate
> > flags?
> >
> > AFAICS we only ever set them separately:
> >
> > thule:~/tip> git grep SG_OVER kernel/sched/
> > kernel/sched/fair.c: set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
> > kernel/sched/fair.c: *sg_status |= SG_OVERLOADED;
> > kernel/sched/fair.c: *sg_status |= SG_OVERUTILIZED;
> > kernel/sched/fair.c: *sg_status |= SG_OVERLOADED;
> > kernel/sched/fair.c: set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
> > kernel/sched/fair.c: sg_status & SG_OVERUTILIZED);
> > kernel/sched/fair.c: } else if (sg_status & SG_OVERUTILIZED) {
> > kernel/sched/fair.c: set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED);
> > kernel/sched/sched.h:#define SG_OVERLOADED 0x1 /* More than one runnable task on a CPU. */
> > kernel/sched/sched.h:#define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */
> > kernel/sched/sched.h: set_rd_overloaded(rq->rd, SG_OVERLOADED);
> >
> > In fact this results in suboptimal code:
> >
> > /* update overload indicator if we are at root domain */
> > set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
> >
> > /* Update over-utilization (tipping point, U >= 0) indicator */
> > set_rd_overutilized_status(env->dst_rq->rd,
> > sg_status & SG_OVERUTILIZED);
> >
> > Note how the bits that got mixed together in sg_status now have to be
> > masked out individually.
> >
> > The sg_status bitmask appears to make no sense at all to me.
> >
> > By turning these into individual bool flags we could also do away with
> > all the extra SG_OVERLOADED/SG_OVERUTILIZED abstraction.
> >
> > Ie. something like the patch below? Untested.
>
> Looks good. I see it is merged to sched/core.
> Did a boot with that patch and hackbench is showing same results 320 CPU system.
Thanks, I've added:
Acked-by: Shrikanth Hegde <[email protected]>
Tested-by: Shrikanth Hegde <[email protected]>
And applied the additional docbook fix below on top as well.
Thaks,
Ingo
=================>
kernel/sched/fair.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ebc8d5f855de..1dd37168da50 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9933,7 +9933,8 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
* @sds: Load-balancing data with statistics of the local group.
* @group: sched_group whose statistics are to be updated.
* @sgs: variable to hold the statistics for this group.
- * @sg_status: Holds flag indicating the status of the sched_group
+ * @sg_overloaded: sched_group is overloaded
+ * @sg_overutilized: sched_group is overutilized
*/
static inline void update_sg_lb_stats(struct lb_env *env,
struct sd_lb_stats *sds,