2023-07-18 13:48:51

by Aaron Lu

[permalink] [raw]
Subject: [RFC PATCH 0/4] Reduce cost of accessing tg->load_avg

Nitin Tekchandani noticed some scheduler functions have high cost
according to perf/cycles while running postgres_sysbench workload.
I perf/annotated the high cost functions: update_cfs_group() and
update_load_avg() and found the costs were ~90% due to accessing to
tg->load_avg. This series is an attempt to reduce the cost of accessing
tg->load_avg.

The 1st patch is a bug fix. I probably should try PeterZ's SBRM which
should make things a lot cleaner but I'm not sure if this fix is stable
material and if so, writing in traditional way is easier for backport?

The 2nd patch is actually v2, the first version was sent here:
https://lore.kernel.org/lkml/20230327053955.GA570404@ziqianlu-desk2/
since it is part of this series, I didn't maintain its version. The idea
is from PeterZ: by making the hot variable per-node, at least, the write
side update_load_avg() will become local. The sad part is, the read side
is still global because it has to sum up each node's part and that hurts
performance. Also, update_cfs_group() is called very frequently, e.g. on
every en/dequeue_task_fair(). With this patch, postgres_sysbench on SPR
sees some improvement. hackbench/netperf also see some improvement on
SPR and Cascade Lake. For details, please see patch2.

While there are improvments from making tg->load_avg per node, the two
functions cost is still there while running postgres_sysbench on SPR: ~7%
for update_cfs_group() and ~5% for update_load_avg(). To further reduce
the cost of accessing tg->load_avg, the 3rd patch tried to reduce the
number of updates to tg->load_avg when a task migrates on wakeup: current
code will immediately update_tg_load_avg() once update_load_avg() found
cfs_rq has removed_load pending; patch3 changed this behaviour: it ignores
cfs_rq's pending removed_load and rely on following events like task
attaching so that to aggregate the process of tg->load_avg. For a wakeup
heavy workload, this can roughly reduce the call number to
update_tg_load_avg() by half.

After patch3, the cost of the two functions while running
postgres_sysbench on SPR dropped to ~1% but running netperf/UDP_RR on
SPR the two functions cost are still ~20% and ~10%. So patch4 tried to
reduce the number of calls to update_cfs_group() on en/dequeue path and
that made the two functions cost dropped to ~2% when running netperf/UDP_RR
on SPR.

One issue with patch3 and patch4 is, they both reduced the tracking
accuracy of task group's load_avg in favor of reducing cost of accessing
tg->load_avg. patch3 is probably better than patch4 in this regard,
because we already delay processing cfs_rq's removed_load although patch3
made the delay even longer; patch4 skipped some calls to
update_cfs_group() on en/dequeue path and that may affect things. I made
a test inspired by an old commit and result seem to suggest it's not bad
but I may miss some scenarios.

This series is based on v6.5-rc1 with one more patch applied that made
tg->load_avg stay in a dedicated cacheline to avoid any false sharing
issue as discovered by Deng Pan:
https://lore.kernel.org/lkml/[email protected]/

For performance data in each commit's changelog, node means patch2,
delay means patch3 and skip means patch4. All performance changes percent
are against base.

Comments are welcome.

Aaron Lu (4):
sched/fair: free allocated memory on error in alloc_fair_sched_group()
sched/fair: Make tg->load_avg per node
sched/fair: delay update_tg_load_avg() for cfs_rq's removed load
sched/fair: skip some update_cfs_group() on en/dequeue_entity()

kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 80 ++++++++++++++++++++++++++++++++++----------
kernel/sched/sched.h | 44 ++++++++++++++++++------
3 files changed, 97 insertions(+), 29 deletions(-)

--
2.41.0



2023-07-18 13:49:35

by Aaron Lu

[permalink] [raw]
Subject: [PATCH 1/4] sched/fair: free allocated memory on error in alloc_fair_sched_group()

There is one struct cfs_rq and one struct se on each cpu for a taskgroup
and when allocation for tg->cfs_rq[X] failed, the already allocated
tg->cfs_rq[0]..tg->cfs_rq[X-1] should be freed. The same for tg->se.

Signed-off-by: Aaron Lu <[email protected]>
---
kernel/sched/fair.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..0f913487928d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12443,10 +12443,10 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)

tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
if (!tg->cfs_rq)
- goto err;
+ return 0;
tg->se = kcalloc(nr_cpu_ids, sizeof(se), GFP_KERNEL);
if (!tg->se)
- goto err;
+ goto err_free_rq_pointer;

tg->shares = NICE_0_LOAD;

@@ -12456,12 +12456,12 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
GFP_KERNEL, cpu_to_node(i));
if (!cfs_rq)
- goto err;
+ goto err_free;

se = kzalloc_node(sizeof(struct sched_entity_stats),
GFP_KERNEL, cpu_to_node(i));
if (!se)
- goto err_free_rq;
+ goto err_free;

init_cfs_rq(cfs_rq);
init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
@@ -12470,9 +12470,18 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)

return 1;

-err_free_rq:
- kfree(cfs_rq);
-err:
+err_free:
+ for_each_possible_cpu(i) {
+ kfree(tg->cfs_rq[i]);
+ kfree(tg->se[i]);
+
+ if (!tg->cfs_rq[i] && !tg->se[i])
+ break;
+ }
+ kfree(tg->se);
+err_free_rq_pointer:
+ kfree(tg->cfs_rq);
+
return 0;
}

--
2.41.0


2023-07-18 13:57:11

by Aaron Lu

[permalink] [raw]
Subject: [RFC PATCH 2/4] sched/fair: Make tg->load_avg per node

When using sysbench to benchmark Postgres in a single docker instance
with sysbench's nr_threads set to nr_cpu, it is observed there are times
update_cfs_group() and update_load_avg() shows noticeable overhead on
a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):

13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group
10.63% 10.04% [kernel.vmlinux] [k] update_load_avg

Annotate shows the cycles are mostly spent on accessing tg->load_avg
with update_load_avg() being the write side and update_cfs_group() being
the read side.

Tim Chen told me that PeterZ once mentioned a way to solve a similar
problem by making a counter per node so do the same for tg->load_avg.
After this change, the cost of the two functions are reduced and
sysbench transactions are increased on SPR. Below are test results.

===============================================
postgres_sysbench(transaction, higher is better)
nr_thread=100%/75%/50% were tested on 2 sockets SPR and Icelake and
results that have a measuable difference are:

nr_thread=100% on SPR
base: 90569.11±1.15%
node: 104152.26±0.34% +15.0%

nr_thread=75% on SPR
base: 100803.96±0.57%
node: 107333.58±0.44% +6.5%

=======================================================================
hackbench/pipe/threads/fd=20/loop=1000000 (throughput, higher is better)
group=1/4/8/16 were tested on 2 sockets SPR and Cascade lake and the
results that have a measuable difference are:

group=8 on SPR:
base: 437163±2.6%
node: 471203±1.2% +7.8%

group=16 on SPR:
base: 468279±1.9%
node: 580385±1.7% +23.9%

=============================================
netperf/TCP_STRAM
nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and there is no measuable difference.

=============================================
netperf/UDP_RR (throughput, higher is better)
nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and results that have measuable difference are:

nr_thread=75% on Cascade lake:
base: 36701±1.7%
node: 39949±1.4% +8.8%

nr_thread=75% on SPR:
base: 14249±3.8%
node: 19890±2.0% +39.6%

nr_thread=100% on Cascade lake
base: 52275±0.6%
node: 53827±0.4% +3.0%

nr_thread=100% on SPR
base: 9560±1.6%
node: 14186±3.9% +48.4%

Reported-by: Nitin Tekchandani <[email protected]>
Signed-off-by: Aaron Lu <[email protected]>
---
kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 29 ++++++++++++++++++++++++++---
kernel/sched/sched.h | 43 +++++++++++++++++++++++++++++++++----------
3 files changed, 60 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 066ff1c8ae4e..3af965a18866 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -691,7 +691,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
SEQ_printf(m, " .%-30s: %lu\n", "tg_load_avg_contrib",
cfs_rq->tg_load_avg_contrib);
SEQ_printf(m, " .%-30s: %ld\n", "tg_load_avg",
- atomic_long_read(&cfs_rq->tg->load_avg));
+ tg_load_avg(cfs_rq->tg));
#endif
#endif
#ifdef CONFIG_CFS_BANDWIDTH
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0f913487928d..aceb8f5922cb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3496,7 +3496,7 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)

load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);

- tg_weight = atomic_long_read(&tg->load_avg);
+ tg_weight = tg_load_avg(tg);

/* Ensure tg_weight >= load */
tg_weight -= cfs_rq->tg_load_avg_contrib;
@@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
{
long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+ int node = cpu_to_node(smp_processor_id());

/*
* No need to update load_avg for root_task_group as it is not used.
@@ -3673,7 +3674,7 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
return;

if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
- atomic_long_add(delta, &cfs_rq->tg->load_avg);
+ atomic_long_add(delta, &cfs_rq->tg->node_info[node]->load_avg);
cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
}
}
@@ -12439,7 +12440,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
{
struct sched_entity *se;
struct cfs_rq *cfs_rq;
- int i;
+ int i, nodes;

tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
if (!tg->cfs_rq)
@@ -12468,8 +12469,30 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
init_entity_runnable_average(se);
}

+#ifdef CONFIG_SMP
+ nodes = num_possible_nodes();
+ tg->node_info = kcalloc(nodes, sizeof(struct tg_node_info *), GFP_KERNEL);
+ if (!tg->node_info)
+ goto err_free;
+
+ for_each_node(i) {
+ tg->node_info[i] = kzalloc_node(sizeof(struct tg_node_info), GFP_KERNEL, i);
+ if (!tg->node_info[i])
+ goto err_free_node;
+ }
+#endif
+
return 1;

+#ifdef CONFIG_SMP
+err_free_node:
+ for_each_node(i) {
+ kfree(tg->node_info[i]);
+ if (!tg->node_info[i])
+ break;
+ }
+ kfree(tg->node_info);
+#endif
err_free:
for_each_possible_cpu(i) {
kfree(tg->cfs_rq[i]);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14dfaafb3a8f..9cece2dbc95b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -359,6 +359,17 @@ struct cfs_bandwidth {
#endif
};

+struct tg_node_info {
+ /*
+ * load_avg can be heavily contended at clock tick time and task
+ * enqueue/dequeue time, so put it in its own cacheline separated
+ * from other fields.
+ */
+ struct {
+ atomic_long_t load_avg;
+ } ____cacheline_aligned_in_smp;
+};
+
/* Task group related information */
struct task_group {
struct cgroup_subsys_state css;
@@ -373,15 +384,8 @@ struct task_group {
/* A positive value indicates that this is a SCHED_IDLE group. */
int idle;

-#ifdef CONFIG_SMP
- /*
- * load_avg can be heavily contended at clock tick time, so put
- * it in its own cacheline separated from the fields above which
- * will also be accessed at each tick.
- */
- struct {
- atomic_long_t load_avg;
- } ____cacheline_aligned_in_smp;
+#ifdef CONFIG_SMP
+ struct tg_node_info **node_info;
#endif
#endif

@@ -413,9 +417,28 @@ struct task_group {
/* Effective clamp values used for a task group */
struct uclamp_se uclamp[UCLAMP_CNT];
#endif
-
};

+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+static inline long tg_load_avg(struct task_group *tg)
+{
+ long load_avg = 0;
+ int i;
+
+ /*
+ * The only path that can give us a root_task_group
+ * here is from print_cfs_rq() thus unlikely.
+ */
+ if (unlikely(tg == &root_task_group))
+ return 0;
+
+ for_each_node(i)
+ load_avg += atomic_long_read(&tg->node_info[i]->load_avg);
+
+ return load_avg;
+}
+#endif
+
#ifdef CONFIG_FAIR_GROUP_SCHED
#define ROOT_TASK_GROUP_LOAD NICE_0_LOAD

--
2.41.0


2023-07-18 13:57:16

by Aaron Lu

[permalink] [raw]
Subject: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

When a workload involves many wake time task migrations, tg->load_avg
can be heavily contended among CPUs because every migration involves
removing the task's load from its src cfs_rq and attach that load to
its new cfs_rq. Both the remove and attach requires an update to
tg->load_avg as well as propagating the change up the hierarchy.

E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
Sappire Rapids, during a 5s window, the wakeup number is 14millions and
migration number is 11millions. Since the workload can trigger many
wakeups and migrations, the access(both read and write) to tg->load_avg
can be unbound. For the above said workload, the profile shows
update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With
netperf/nr_client=nr_cpu/UDP_RR, the wakeup number is 21millions and
migration number is 14millions; update_cfs_group() costs ~25% and
update_load_avg() costs ~16%.

This patch is an attempt to reduce the cost of accessing tg->load_avg.

Current logic will immediately do a update_tg_load_avg() if cfs_rq has
removed load; this patch changes this behavior: if this cfs_rq has
removed load as discovered by update_cfs_rq_load_avg(), it didn't call
update_tg_load_avg() or propagate the removed load immediately, instead,
the update to tg->load_avg and propagated load can be dealed with by a
following event like task attached to this cfs_rq or in
update_blocked_averages(). This way, the call to update_tg_load_avg()
for this cfs_rq and its ancestors can be reduced by about half.

================================================
postgres_sysbench(transaction, higher is better)
nr_thread=100%/75%/50% were tested on 2 sockets SPR and Icelake and
results that have a measuable difference are:

nr_thread=100% on SPR:
base: 90569.11±1.15%
node: 104152.26±0.34% +15.0%
delay: 127309.46±4.25% +40.6%

nr_thread=75% on SPR:
base: 100803.96±0.57%
node: 107333.58±0.44% +6.5%
delay: 124332.39±0.51% +23.3%

nr_thread=75% on ICL:
base: 61961.26±0.41%
node: 61585.45±0.50%
delay: 72420.52±0.14% +16.9%

=======================================================================
hackbench/pipe/threads/fd=20/loop=1000000 (throughput, higher is better)
group=1/4/8/16 were tested on 2 sockets SPR and Cascade lake and the
results that have a measuable difference are:

group=8 on SPR:
base: 437163±2.6%
node: 471203±1.2% +7.8%
delay: 490780±0.9% +12.3%

group=16 on SPR:
base: 468279±1.9%
node: 580385±1.7% +23.9%
delay: 664422±0.2% +41.9%

================================================
netperf/TCP_STRAM (throughput, higher is better)
nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and results that have a measuable difference are:

nr_thread=50% on CSL:
base: 16258±0.7%
node: 16172±2.9%
delay: 17729±0.7% +9.0%

nr_thread=75% on CSL:
base: 12923±1.2%
node: 13011±2.2%
delay: 15452±1.6% +19.6%

nr_thread=75% on SPR:
base: 16232±11.9%
node: 13962±5.1%
delay: 21089±0.8% +29.9%

nr_thread=100% on SPR:
base: 13220±0.6%
node: 13113±0.0%
delay: 18258±11.3% +38.1%

=============================================
netperf/UDP_RR (throughput, higher is better)
nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and results that have measuable difference are:

nr_thread=1 on CSL:
base: 128521±0.5%
node: 127935±0.6%
delay: 126317±0.4% -1.7%

nr_thread=75% on CSL:
base: 36701±1.7%
node: 39949±1.4% +8.8%
delay: 42516±0.3% +15.8%

nr_thread=75% on SPR:
base: 14249±3.8%
node: 19890±2.0% +39.6%
delay: 31331±0.5% +119.9%

nr_thread=100% on CSL:
base: 52275±0.6%
node: 53827±0.4% +3.0%
delay: 78386±0.7% +49.9%

nr_thread=100% on SPR:
base: 9560±1.6%
node: 14186±3.9% +48.4%
delay: 20779±2.8% +117.4%

Signed-off-by: Aaron Lu <[email protected]>
---
kernel/sched/fair.c | 23 ++++++++++++++++++-----
kernel/sched/sched.h | 1 +
2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aceb8f5922cb..564ffe3e59c1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3645,6 +3645,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
if (child_cfs_rq_on_list(cfs_rq))
return false;

+ if (cfs_rq->prop_removed_sum)
+ return false;
+
return true;
}

@@ -3911,6 +3914,11 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
{
cfs_rq->propagate = 1;
cfs_rq->prop_runnable_sum += runnable_sum;
+
+ if (cfs_rq->prop_removed_sum) {
+ cfs_rq->prop_runnable_sum += cfs_rq->prop_removed_sum;
+ cfs_rq->prop_removed_sum = 0;
+ }
}

/* Update task and its cfs_rq load average */
@@ -4133,13 +4141,11 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
* removed_runnable is the unweighted version of removed_load so we
* can use it to estimate removed_load_sum.
*/
- add_tg_cfs_propagate(cfs_rq,
- -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
-
- decayed = 1;
+ cfs_rq->prop_removed_sum +=
+ -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT;
}

- decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
+ decayed = __update_load_avg_cfs_rq(now, cfs_rq);
u64_u32_store_copy(sa->last_update_time,
cfs_rq->last_update_time_copy,
sa->last_update_time);
@@ -9001,6 +9007,13 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)

if (cfs_rq == &rq->cfs)
decayed = true;
+
+ /*
+ * If the aggregated removed_sum hasn't been taken care of,
+ * deal with it now before this cfs_rq is removed from the list.
+ */
+ if (cfs_rq->prop_removed_sum)
+ add_tg_cfs_propagate(cfs_rq, 0);
}

/* Propagate pending load changes to the parent, if any: */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9cece2dbc95b..ab540b21d071 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -619,6 +619,7 @@ struct cfs_rq {
unsigned long tg_load_avg_contrib;
long propagate;
long prop_runnable_sum;
+ long prop_removed_sum;

/*
* h_load = weight * f(tg)
--
2.41.0


2023-07-18 14:00:23

by Aaron Lu

[permalink] [raw]
Subject: [RFC PATCH 4/4] sched/fair: skip some update_cfs_group() on en/dequeue_entity()

After the previous patch, the cost of update_cfs_group() and
update_load_avg() has dropped to around 1% for postgres_sysbench on SPR
but netperf/UDP_RR on SPR still saw ~20% update_cfs_group() and ~10%
update_load_avg() so this patch is another attempt to further reduce the
two functions' cost from read side.

The observation is: if an entity is dequeued, updating its weight isn't
useful, except that the current code will also update its cfs_rq's
load_avg using the updated weight...so removing update_cfs_group() on
dequeue path can reduce cost of accessing tg->load_avg, but it also will
reduce the tracking accuracy.

Another hint I got from an ancient commit 17bc14b767cf("Revert "sched:
Update_cfs_shares at period edge") is: if an entity is enqueued and it's
the only entity of its cfs_rq, we do not need immediately update its
weight since it's not needed to decide if it can preempt curr.

commit 17bc14b767cf mentioned a latency problem when reducing calling
frequency of update_cfs_group(): doing a make -j32 in one terminal
window will cause browsing experience worse. To see how things are now,
I did a test: two cgroups were created under root and in one group, I did
"make -j32" and in the meantime, I did "./schbench -m 1 -t 6 -r 300" in
another group on a 6core/12cpus Intel i7-8700T Coffee lake cpu and the
wakeup latency reported by schbench for base and this series doesn't look
much different:

base:
schbench -m 1 -t 6 -r 300:
Latency percentiles (usec) runtime 300 (s) (18534 total samples)
50.0th: 20 (9491 samples)
75.0th: 25 (4768 samples)
90.0th: 29 (2552 samples)
95.0th: 62 (809 samples)
*99.0th: 20320 (730 samples)
99.5th: 23392 (92 samples)
99.9th: 31392 (74 samples)
min=6, max=32032

make -j32:
real 5m35.950s
user 47m33.814s
sys 4m45.470s

this series:
schbench -m 1 -t 6 -r 300:
Latency percentiles (usec) runtime 300 (s) (18528 total samples)
50.0th: 21 (9920 samples)
75.0th: 26 (4756 samples)
90.0th: 30 (2100 samples)
95.0th: 63 (846 samples)
*99.0th: 19040 (722 samples)
99.5th: 21920 (92 samples)
99.9th: 30048 (81 samples)
min=6, max=34873

make -j32:
real 5m35.185s
user 47m28.528s
sys 4m44.705s

As for netperf/UDP_RR/nr_thread=100% on SPR: after this change, the two
functions' cost dropped to ~2%.

Other test results:
================================================
postgres_sysbench(transaction, higher is better)
nr_thread=100%/75%/50% were tested on 2 sockets SPR and Icelake and
results that have a measuable difference are:

nr_thread=100% on SPR:
base: 90569.11±1.15%
node: 104152.26±0.34% +15.0%
delay: 127309.46±4.25% +40.6%
skip: 125501.96±1.83% +38.6%

nr_thread=75% on SPR:
base: 100803.96±0.57%
node: 107333.58±0.44% +6.5%
delay: 124332.39±0.51% +23.3%
skip: 127676.55±0.03% +26.7%

nr_thread=75% on ICL:
base: 61961.26±0.41%
node: 61585.45±0.50%
delay: 72420.52±0.14% +16.9%
skip: 72413.23±0.30% +16.9%

=======================================================================
hackbench/pipe/threads/fd=20/loop=1000000 (throughput, higher is better)
group=1/4/8/16 were tested on 2 sockets SPR and Cascade lake and the
results that have a measuable difference are:

group=8 on SPR:
base: 437163±2.6%
node: 471203±1.2% +7.8%
delay: 490780±0.9% +12.3%
skip: 493062±1.9% +12.8%

group=16 on SPR:
base: 468279±1.9%
node: 580385±1.7% +23.9%
delay: 664422±0.2% +41.9%
skip: 697387±0.2% +48.9%

================================================
netperf/TCP_STRAM (throughput, higher is better)
nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and results that have measuable difference are:

nr_thread=50% on SPR:
base: 16258±0.7%
node: 16172±2.9%
delay: 17729±0.7% +9.0%
skip: 17823±1.3% +9.6%

nr_thread=75% on CSL:
base: 12923±1.2%
node: 13011±2.2%
delay: 15452±1.6% +19.6%
skip: 15302±1.7% +18.4%

nr_thread=75% on SPR:
base: 16232±11.9%
node: 13962±5.1%
delay: 21089±0.8% +29.9%
skip: 21251±0.4% +30.9%

nr_thread=100% on SPR:
base: 13220±0.6%
node: 13113±0.0%
delay: 18258±11.3% +38.1%
skip: 16974±12.7% +28.4%

=============================================
netperf/UDP_RR (throughput, higher is better)
nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and results that have measuable difference are:

nr_thread=25% on CSL:
base: 107269±0.3%
node: 107226±0.2%
delay: 106978±0.3%
skip: 109652±0.3% +2.2%

nr_thread=50% on CSL:
base: 74854±0.1%
node: 74521±0.4%
delay: 74438±0.2%
skip: 76431±0.1% +2.1%

nr_thread=75% on CSL:
base: 36701±1.7%
node: 39949±1.4% +8.8%
delay: 42516±0.3% +15.8%
skip: 45044±0.5% +22.7%

nr_thread=75% on SPR:
base: 14249±3.8%
node: 19890±2.0% +39.6%
delay: 31331±0.5% +119.9%
skip: 33688±3.5% +136.4%

nr_thread=100% on CSL:
base: 52275±0.6%
node: 53827±0.4% +3.0%
delay: 78386±0.7% +49.9%
skip: 76926±2.3% +47.2%

nr_thread=100% on SPR:
base: 9560±1.6%
node: 14186±3.9% +48.4%
delay: 20779±2.8% +117.4%
skip: 32125±2.5% +236.0%

Signed-off-by: Aaron Lu <[email protected]>
---
kernel/sched/fair.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 564ffe3e59c1..0dbbb92302ad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4862,7 +4862,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
*/
update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
se_update_runnable(se);
- update_cfs_group(se);
+ if (cfs_rq->nr_running > 0)
+ update_cfs_group(se);
account_entity_enqueue(cfs_rq, se);

if (flags & ENQUEUE_WAKEUP)
@@ -4978,8 +4979,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
/* return excess runtime on last dequeue */
return_cfs_rq_runtime(cfs_rq);

- update_cfs_group(se);
-
/*
* Now advance min_vruntime if @se was the entity holding it back,
* except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
--
2.41.0


2023-07-18 15:39:27

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: free allocated memory on error in alloc_fair_sched_group()

On 2023-07-18 at 21:41:17 +0800, Aaron Lu wrote:
> There is one struct cfs_rq and one struct se on each cpu for a taskgroup
> and when allocation for tg->cfs_rq[X] failed, the already allocated
> tg->cfs_rq[0]..tg->cfs_rq[X-1] should be freed. The same for tg->se.
>
> Signed-off-by: Aaron Lu <[email protected]>
> ---
> kernel/sched/fair.c | 23 ++++++++++++++++-------
> 1 file changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..0f913487928d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12443,10 +12443,10 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
>
> tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
> if (!tg->cfs_rq)
> - goto err;
> + return 0;
> tg->se = kcalloc(nr_cpu_ids, sizeof(se), GFP_KERNEL);
> if (!tg->se)
> - goto err;
> + goto err_free_rq_pointer;
>
> tg->shares = NICE_0_LOAD;
>
> @@ -12456,12 +12456,12 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
> cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
> GFP_KERNEL, cpu_to_node(i));
> if (!cfs_rq)
> - goto err;
> + goto err_free;
>
> se = kzalloc_node(sizeof(struct sched_entity_stats),
> GFP_KERNEL, cpu_to_node(i));
> if (!se)
> - goto err_free_rq;
> + goto err_free;
>
> init_cfs_rq(cfs_rq);
> init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
> @@ -12470,9 +12470,18 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
>
> return 1;
>
> -err_free_rq:
> - kfree(cfs_rq);
> -err:
> +err_free:
> + for_each_possible_cpu(i) {
> + kfree(tg->cfs_rq[i]);
> + kfree(tg->se[i]);
> +
> + if (!tg->cfs_rq[i] && !tg->se[i])
> + break;
> + }
> + kfree(tg->se);
> +err_free_rq_pointer:
> + kfree(tg->cfs_rq);
> +
Not sure if I overlooked, if alloc_fair_sched_group() fails in sched_create_group(),
would sched_free_group()->free_fair_sched_group() do the cleanup?

thanks,
Chenyu

2023-07-18 16:25:56

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

On Tue, 18 Jul 2023 at 15:41, Aaron Lu <[email protected]> wrote:
>
> When a workload involves many wake time task migrations, tg->load_avg
> can be heavily contended among CPUs because every migration involves
> removing the task's load from its src cfs_rq and attach that load to
> its new cfs_rq. Both the remove and attach requires an update to
> tg->load_avg as well as propagating the change up the hierarchy.
>
> E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> migration number is 11millions. Since the workload can trigger many
> wakeups and migrations, the access(both read and write) to tg->load_avg
> can be unbound. For the above said workload, the profile shows
> update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With
> netperf/nr_client=nr_cpu/UDP_RR, the wakeup number is 21millions and
> migration number is 14millions; update_cfs_group() costs ~25% and
> update_load_avg() costs ~16%.
>
> This patch is an attempt to reduce the cost of accessing tg->load_avg.

here you mention tg->load_avg which is updated with update_tg_load_avg()

but your patch below modifies the local update of cfs's util_avg,
runnable_avg and load_avg which need to be maintained up to date

You should be better to delay or rate limit the call to
update_tg_load_avg() or calc_group_shares()/update_cfs_group() which
access tg->load_avg and are the culprit instead of modifying other
place.

Have you tried to remove update_cfs_group() from enqueue/dequeue and
only let the tick update the share periodically ?

Have you tried to make update_tg_load_avg() return early ? the change
of cfs' load_avg is written in the tg->load_avg only when the change
is bigger than 1.5%. maybe increase it to 6% ?

Or like for update_cfs_group, remove it from attach/detach entity and
let the periodic update to propagate the change

But please don't touch local update of *_avg

>
> Current logic will immediately do a update_tg_load_avg() if cfs_rq has
> removed load; this patch changes this behavior: if this cfs_rq has
> removed load as discovered by update_cfs_rq_load_avg(), it didn't call
> update_tg_load_avg() or propagate the removed load immediately, instead,
> the update to tg->load_avg and propagated load can be dealed with by a
> following event like task attached to this cfs_rq or in
> update_blocked_averages(). This way, the call to update_tg_load_avg()
> for this cfs_rq and its ancestors can be reduced by about half.
>
> ================================================
> postgres_sysbench(transaction, higher is better)
> nr_thread=100%/75%/50% were tested on 2 sockets SPR and Icelake and
> results that have a measuable difference are:
>
> nr_thread=100% on SPR:
> base: 90569.11±1.15%
> node: 104152.26±0.34% +15.0%
> delay: 127309.46±4.25% +40.6%
>
> nr_thread=75% on SPR:
> base: 100803.96±0.57%
> node: 107333.58±0.44% +6.5%
> delay: 124332.39±0.51% +23.3%
>
> nr_thread=75% on ICL:
> base: 61961.26±0.41%
> node: 61585.45±0.50%
> delay: 72420.52±0.14% +16.9%
>
> =======================================================================
> hackbench/pipe/threads/fd=20/loop=1000000 (throughput, higher is better)
> group=1/4/8/16 were tested on 2 sockets SPR and Cascade lake and the
> results that have a measuable difference are:
>
> group=8 on SPR:
> base: 437163±2.6%
> node: 471203±1.2% +7.8%
> delay: 490780±0.9% +12.3%
>
> group=16 on SPR:
> base: 468279±1.9%
> node: 580385±1.7% +23.9%
> delay: 664422±0.2% +41.9%
>
> ================================================
> netperf/TCP_STRAM (throughput, higher is better)
> nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
> Lake and results that have a measuable difference are:
>
> nr_thread=50% on CSL:
> base: 16258±0.7%
> node: 16172±2.9%
> delay: 17729±0.7% +9.0%
>
> nr_thread=75% on CSL:
> base: 12923±1.2%
> node: 13011±2.2%
> delay: 15452±1.6% +19.6%
>
> nr_thread=75% on SPR:
> base: 16232±11.9%
> node: 13962±5.1%
> delay: 21089±0.8% +29.9%
>
> nr_thread=100% on SPR:
> base: 13220±0.6%
> node: 13113±0.0%
> delay: 18258±11.3% +38.1%
>
> =============================================
> netperf/UDP_RR (throughput, higher is better)
> nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
> Lake and results that have measuable difference are:
>
> nr_thread=1 on CSL:
> base: 128521±0.5%
> node: 127935±0.6%
> delay: 126317±0.4% -1.7%
>
> nr_thread=75% on CSL:
> base: 36701±1.7%
> node: 39949±1.4% +8.8%
> delay: 42516±0.3% +15.8%
>
> nr_thread=75% on SPR:
> base: 14249±3.8%
> node: 19890±2.0% +39.6%
> delay: 31331±0.5% +119.9%
>
> nr_thread=100% on CSL:
> base: 52275±0.6%
> node: 53827±0.4% +3.0%
> delay: 78386±0.7% +49.9%
>
> nr_thread=100% on SPR:
> base: 9560±1.6%
> node: 14186±3.9% +48.4%
> delay: 20779±2.8% +117.4%
>
> Signed-off-by: Aaron Lu <[email protected]>
> ---
> kernel/sched/fair.c | 23 ++++++++++++++++++-----
> kernel/sched/sched.h | 1 +
> 2 files changed, 19 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index aceb8f5922cb..564ffe3e59c1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3645,6 +3645,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> if (child_cfs_rq_on_list(cfs_rq))
> return false;
>
> + if (cfs_rq->prop_removed_sum)
> + return false;
> +
> return true;
> }
>
> @@ -3911,6 +3914,11 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
> {
> cfs_rq->propagate = 1;
> cfs_rq->prop_runnable_sum += runnable_sum;
> +
> + if (cfs_rq->prop_removed_sum) {
> + cfs_rq->prop_runnable_sum += cfs_rq->prop_removed_sum;
> + cfs_rq->prop_removed_sum = 0;
> + }
> }
>
> /* Update task and its cfs_rq load average */
> @@ -4133,13 +4141,11 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> * removed_runnable is the unweighted version of removed_load so we
> * can use it to estimate removed_load_sum.
> */
> - add_tg_cfs_propagate(cfs_rq,
> - -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> -
> - decayed = 1;
> + cfs_rq->prop_removed_sum +=
> + -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT;
> }
>
> - decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> + decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> u64_u32_store_copy(sa->last_update_time,
> cfs_rq->last_update_time_copy,
> sa->last_update_time);
> @@ -9001,6 +9007,13 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
>
> if (cfs_rq == &rq->cfs)
> decayed = true;
> +
> + /*
> + * If the aggregated removed_sum hasn't been taken care of,
> + * deal with it now before this cfs_rq is removed from the list.
> + */
> + if (cfs_rq->prop_removed_sum)
> + add_tg_cfs_propagate(cfs_rq, 0);
> }
>
> /* Propagate pending load changes to the parent, if any: */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9cece2dbc95b..ab540b21d071 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -619,6 +619,7 @@ struct cfs_rq {
> unsigned long tg_load_avg_contrib;
> long propagate;
> long prop_runnable_sum;
> + long prop_removed_sum;
>
> /*
> * h_load = weight * f(tg)
> --
> 2.41.0
>

2023-07-19 02:54:31

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: free allocated memory on error in alloc_fair_sched_group()

On Tue, Jul 18, 2023 at 11:13:12PM +0800, Chen Yu wrote:
> On 2023-07-18 at 21:41:17 +0800, Aaron Lu wrote:
> > There is one struct cfs_rq and one struct se on each cpu for a taskgroup
> > and when allocation for tg->cfs_rq[X] failed, the already allocated
> > tg->cfs_rq[0]..tg->cfs_rq[X-1] should be freed. The same for tg->se.
> >
> > Signed-off-by: Aaron Lu <[email protected]>
> > ---
> > kernel/sched/fair.c | 23 ++++++++++++++++-------
> > 1 file changed, 16 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a80a73909dc2..0f913487928d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -12443,10 +12443,10 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
> >
> > tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
> > if (!tg->cfs_rq)
> > - goto err;
> > + return 0;
> > tg->se = kcalloc(nr_cpu_ids, sizeof(se), GFP_KERNEL);
> > if (!tg->se)
> > - goto err;
> > + goto err_free_rq_pointer;
> >
> > tg->shares = NICE_0_LOAD;
> >
> > @@ -12456,12 +12456,12 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
> > cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
> > GFP_KERNEL, cpu_to_node(i));
> > if (!cfs_rq)
> > - goto err;
> > + goto err_free;
> >
> > se = kzalloc_node(sizeof(struct sched_entity_stats),
> > GFP_KERNEL, cpu_to_node(i));
> > if (!se)
> > - goto err_free_rq;
> > + goto err_free;
> >
> > init_cfs_rq(cfs_rq);
> > init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
> > @@ -12470,9 +12470,18 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
> >
> > return 1;
> >
> > -err_free_rq:
> > - kfree(cfs_rq);
> > -err:
> > +err_free:
> > + for_each_possible_cpu(i) {
> > + kfree(tg->cfs_rq[i]);
> > + kfree(tg->se[i]);
> > +
> > + if (!tg->cfs_rq[i] && !tg->se[i])
> > + break;
> > + }
> > + kfree(tg->se);
> > +err_free_rq_pointer:
> > + kfree(tg->cfs_rq);
> > +
> Not sure if I overlooked, if alloc_fair_sched_group() fails in sched_create_group(),
> would sched_free_group()->free_fair_sched_group() do the cleanup?

You are right, I overlooked... Thanks for pointing this out.

2023-07-19 05:31:56

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

On Tue, Jul 18, 2023 at 06:01:51PM +0200, Vincent Guittot wrote:
> On Tue, 18 Jul 2023 at 15:41, Aaron Lu <[email protected]> wrote:
> >
> > When a workload involves many wake time task migrations, tg->load_avg
> > can be heavily contended among CPUs because every migration involves
> > removing the task's load from its src cfs_rq and attach that load to
> > its new cfs_rq. Both the remove and attach requires an update to
> > tg->load_avg as well as propagating the change up the hierarchy.
> >
> > E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> > Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> > migration number is 11millions. Since the workload can trigger many
> > wakeups and migrations, the access(both read and write) to tg->load_avg
> > can be unbound. For the above said workload, the profile shows
> > update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With
> > netperf/nr_client=nr_cpu/UDP_RR, the wakeup number is 21millions and
> > migration number is 14millions; update_cfs_group() costs ~25% and
> > update_load_avg() costs ~16%.
> >
> > This patch is an attempt to reduce the cost of accessing tg->load_avg.
>
> here you mention tg->load_avg which is updated with update_tg_load_avg()
>
> but your patch below modifies the local update of cfs's util_avg,
> runnable_avg and load_avg which need to be maintained up to date

Yes, since it delayed propagating the removed load, the upper cfs_rqs'
*_avg could be updated later than current code.

> You should be better to delay or rate limit the call to
> update_tg_load_avg() or calc_group_shares()/update_cfs_group() which
> access tg->load_avg and are the culprit instead of modifying other
> place.

Thanks for the suggestion and I think it makes perfect sense.

I tried below to rate limit the update to tg->load_avg at most once per
ms in update_tg_load_avg():

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..e48fd0e6982d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
{
long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+ u64 now = cfs_rq_clock_pelt(cfs_rq);

/*
* No need to update load_avg for root_task_group as it is not used.
@@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
if (cfs_rq->tg == &root_task_group)
return;

- if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
+ if ((now - cfs_rq->last_update_tg_load_avg > 1000000) &&
+ abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
atomic_long_add(delta, &cfs_rq->tg->load_avg);
cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+ cfs_rq->last_update_tg_load_avg = now;
}
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14dfaafb3a8f..b5201d44d820 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -594,6 +594,7 @@ struct cfs_rq {

#ifdef CONFIG_FAIR_GROUP_SCHED
unsigned long tg_load_avg_contrib;
+ u64 last_update_tg_load_avg;
long propagate;
long prop_runnable_sum;

From some quick tests using postgres_sysbench and netperf/UDP_RR on SPR,
this alone already delivers good results. For postgres_sysbench, the two
functions cost dropped to 1%-2% each; and for netperf/UDP_RR, the two
functions cost dropped to ~2% and ~4%. Togerther with some more rate
limiting on the read side, I think the end result will be better. Does
the above look reasonable to you?

Alternatively, I can remove some callsites of update_tg_load_avg() like
you suggested below and only call update_tg_load_avg() when cfs_rq is
decayed(really just decayed, not when it detected it has removed load
pending or load propagated from its children). This way it would give us
similar result as above(roughly once per ms).

>
> Have you tried to remove update_cfs_group() from enqueue/dequeue and
> only let the tick update the share periodically ?

patch4 kind of did that :-)

>
> Have you tried to make update_tg_load_avg() return early ? the change
> of cfs' load_avg is written in the tg->load_avg only when the change
> is bigger than 1.5%. maybe increase it to 6% ?

Yes, increase the delta is also a good way to limit the update to
tg->load_avg. Will try that too.

>
> Or like for update_cfs_group, remove it from attach/detach entity and
> let the periodic update to propagate the change
>
> But please don't touch local update of *_avg

OK I see.

Thanks again for the comments, they are very helpful.

2023-07-19 08:31:28

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> On Tue, Jul 18, 2023 at 06:01:51PM +0200, Vincent Guittot wrote:
> > Have you tried to remove update_cfs_group() from enqueue/dequeue and
> > only let the tick update the share periodically ?
>
> patch4 kind of did that :-)
>

More about this.

If I remove update_cfs_group() in dequeue_task_fair() on top of patch4
like this:

From 43d5c12f0b2180c99149e663a71c610e31023d90 Mon Sep 17 00:00:00 2001
From: Aaron Lu <[email protected]>
Date: Wed, 19 Jul 2023 14:51:07 +0800
Subject: [PATCH 1/2] sched/fair: completely remove update_cfs_group() in
dequeue path

---
kernel/sched/fair.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2adb6a6abbce..a21ab72819ce 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6434,7 +6434,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

update_load_avg(cfs_rq, se, UPDATE_TG);
se_update_runnable(se);
- update_cfs_group(se);

cfs_rq->h_nr_running--;
cfs_rq->idle_h_nr_running -= idle_h_nr_running;
--
2.40.1

Than P95 latency of the schbench workload I described in patch4's
changelog will increase to > 1ms(base and patch4's P95 < 100us):

Latency percentiles (usec) runtime 300 (s) (18504 total samples)
50.0th: 20 (9537 samples)
75.0th: 25 (4869 samples)
90.0th: 29 (2264 samples)
95.0th: 2564 (909 samples)
*99.0th: 20768 (740 samples)
99.5th: 23520 (93 samples)
99.9th: 31520 (74 samples)
min=6, max=40072

If I further remove update_cfs_group() completely in enqueue path on top
of the last change:

From 4e4cb31590ca2e4080ece9cfa9dfaaf26501c60d Mon Sep 17 00:00:00 2001
From: Aaron Lu <[email protected]>
Date: Wed, 19 Jul 2023 15:36:24 +0800
Subject: [PATCH 2/2] sched/fair: completely remove update_cfs_group() from
enqueue path

---
kernel/sched/fair.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a21ab72819ce..8fc325112282 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4847,8 +4847,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
*/
update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
se_update_runnable(se);
- if (cfs_rq->nr_running > 0)
- update_cfs_group(se);
account_entity_enqueue(cfs_rq, se);

if (flags & ENQUEUE_WAKEUP)
@@ -6344,7 +6342,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

update_load_avg(cfs_rq, se, UPDATE_TG);
se_update_runnable(se);
- update_cfs_group(se);

cfs_rq->h_nr_running++;
cfs_rq->idle_h_nr_running += idle_h_nr_running;
--
2.40.1

Then P50's latency will bump to ~4ms from ~20us:
Latency percentiles (usec) runtime 300 (s) (17940 total samples)
50.0th: 3996 (12092 samples)
75.0th: 4004 (4919 samples)
90.0th: 4004 (0 samples)
95.0th: 4012 (353 samples)
*99.0th: 20000 (487 samples)
99.5th: 20000 (0 samples)
99.9th: 31136 (72 samples)
min=7, max=37402
real 5m36.633s
user 47m33.947s
sys 4m47.097s

So for the read side, maybe just keep what patch4 does?

Thanks,
Aaron

2023-07-19 08:32:08

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> Alternatively, I can remove some callsites of update_tg_load_avg() like
> you suggested below and only call update_tg_load_avg() when cfs_rq is
> decayed(really just decayed, not when it detected it has removed load
> pending or load propagated from its children). This way it would give us
> similar result as above(roughly once per ms).

Something like this: (I think this is better since it removed those
unnecessary calls to update_tg_load_avg(), although it is inline but
still)


From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001
From: Aaron Lu <[email protected]>
Date: Wed, 19 Jul 2023 13:54:48 +0800
Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed

---
kernel/sched/fair.c | 22 +++++++---------------
1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..7d5b7352b8b5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
}

/* Update task and its cfs_rq load average */
-static inline int propagate_entity_load_avg(struct sched_entity *se)
+static inline void propagate_entity_load_avg(struct sched_entity *se)
{
struct cfs_rq *cfs_rq, *gcfs_rq;

if (entity_is_task(se))
- return 0;
+ return;

gcfs_rq = group_cfs_rq(se);
if (!gcfs_rq->propagate)
- return 0;
+ return;

gcfs_rq->propagate = 0;

@@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)

trace_pelt_cfs_tp(cfs_rq);
trace_pelt_se_tp(se);
-
- return 1;
}

/*
@@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)

static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}

-static inline int propagate_entity_load_avg(struct sched_entity *se)
+static inline void propagate_entity_load_avg(struct sched_entity *se)
{
- return 0;
}

static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {}
@@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
{
unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
struct sched_avg *sa = &cfs_rq->avg;
- int decayed = 0;
+ int decayed;

if (cfs_rq->removed.nr) {
unsigned long r;
@@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
*/
add_tg_cfs_propagate(cfs_rq,
-(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
-
- decayed = 1;
}

- decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
+ decayed = __update_load_avg_cfs_rq(now, cfs_rq);
u64_u32_store_copy(sa->last_update_time,
cfs_rq->last_update_time_copy,
sa->last_update_time);
@@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
__update_load_avg_se(now, cfs_rq, se);

decayed = update_cfs_rq_load_avg(now, cfs_rq);
- decayed |= propagate_entity_load_avg(se);
+ propagate_entity_load_avg(se);

if (!se->avg.last_update_time && (flags & DO_ATTACH)) {

@@ -4264,15 +4259,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
* IOW we're enqueueing a task on a new CPU.
*/
attach_entity_load_avg(cfs_rq, se);
- update_tg_load_avg(cfs_rq);
-
} else if (flags & DO_DETACH) {
/*
* DO_DETACH means we're here from dequeue_entity()
* and we are migrating task out of the CPU.
*/
detach_entity_load_avg(cfs_rq, se);
- update_tg_load_avg(cfs_rq);
} else if (decayed) {
cfs_rq_util_change(cfs_rq, 0);

--
2.41.0


2023-07-19 09:26:36

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

On Wed, 19 Jul 2023 at 10:11, Aaron Lu <[email protected]> wrote:
>
> On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> > On Tue, Jul 18, 2023 at 06:01:51PM +0200, Vincent Guittot wrote:
> > > Have you tried to remove update_cfs_group() from enqueue/dequeue and
> > > only let the tick update the share periodically ?
> >
> > patch4 kind of did that :-)
> >
>
> More about this.
>
> If I remove update_cfs_group() in dequeue_task_fair() on top of patch4
> like this:
>
> From 43d5c12f0b2180c99149e663a71c610e31023d90 Mon Sep 17 00:00:00 2001
> From: Aaron Lu <[email protected]>
> Date: Wed, 19 Jul 2023 14:51:07 +0800
> Subject: [PATCH 1/2] sched/fair: completely remove update_cfs_group() in
> dequeue path
>
> ---
> kernel/sched/fair.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2adb6a6abbce..a21ab72819ce 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6434,7 +6434,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>
> update_load_avg(cfs_rq, se, UPDATE_TG);
> se_update_runnable(se);
> - update_cfs_group(se);
>
> cfs_rq->h_nr_running--;
> cfs_rq->idle_h_nr_running -= idle_h_nr_running;
> --
> 2.40.1
>
> Than P95 latency of the schbench workload I described in patch4's
> changelog will increase to > 1ms(base and patch4's P95 < 100us):
>
> Latency percentiles (usec) runtime 300 (s) (18504 total samples)
> 50.0th: 20 (9537 samples)
> 75.0th: 25 (4869 samples)
> 90.0th: 29 (2264 samples)
> 95.0th: 2564 (909 samples)
> *99.0th: 20768 (740 samples)
> 99.5th: 23520 (93 samples)
> 99.9th: 31520 (74 samples)
> min=6, max=40072
>
> If I further remove update_cfs_group() completely in enqueue path on top
> of the last change:
>
> From 4e4cb31590ca2e4080ece9cfa9dfaaf26501c60d Mon Sep 17 00:00:00 2001
> From: Aaron Lu <[email protected]>
> Date: Wed, 19 Jul 2023 15:36:24 +0800
> Subject: [PATCH 2/2] sched/fair: completely remove update_cfs_group() from
> enqueue path
>
> ---
> kernel/sched/fair.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a21ab72819ce..8fc325112282 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4847,8 +4847,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> */
> update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
> se_update_runnable(se);
> - if (cfs_rq->nr_running > 0)
> - update_cfs_group(se);
> account_entity_enqueue(cfs_rq, se);
>
> if (flags & ENQUEUE_WAKEUP)
> @@ -6344,7 +6342,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>
> update_load_avg(cfs_rq, se, UPDATE_TG);
> se_update_runnable(se);
> - update_cfs_group(se);
>
> cfs_rq->h_nr_running++;
> cfs_rq->idle_h_nr_running += idle_h_nr_running;
> --
> 2.40.1
>
> Then P50's latency will bump to ~4ms from ~20us:
> Latency percentiles (usec) runtime 300 (s) (17940 total samples)
> 50.0th: 3996 (12092 samples)
> 75.0th: 4004 (4919 samples)
> 90.0th: 4004 (0 samples)
> 95.0th: 4012 (353 samples)
> *99.0th: 20000 (487 samples)
> 99.5th: 20000 (0 samples)
> 99.9th: 31136 (72 samples)
> min=7, max=37402
> real 5m36.633s
> user 47m33.947s
> sys 4m47.097s
>
> So for the read side, maybe just keep what patch4 does?

yes, skipping update_cfs_group() at enqueue bypass the opportunity to
increase the share and get more running time for the group until the
update really happen

>
> Thanks,
> Aaron

2023-07-19 09:29:04

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

On Wed, 19 Jul 2023 at 07:18, Aaron Lu <[email protected]> wrote:
>
> On Tue, Jul 18, 2023 at 06:01:51PM +0200, Vincent Guittot wrote:
> > On Tue, 18 Jul 2023 at 15:41, Aaron Lu <[email protected]> wrote:
> > >
> > > When a workload involves many wake time task migrations, tg->load_avg
> > > can be heavily contended among CPUs because every migration involves
> > > removing the task's load from its src cfs_rq and attach that load to
> > > its new cfs_rq. Both the remove and attach requires an update to
> > > tg->load_avg as well as propagating the change up the hierarchy.
> > >
> > > E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> > > Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> > > migration number is 11millions. Since the workload can trigger many
> > > wakeups and migrations, the access(both read and write) to tg->load_avg
> > > can be unbound. For the above said workload, the profile shows
> > > update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With
> > > netperf/nr_client=nr_cpu/UDP_RR, the wakeup number is 21millions and
> > > migration number is 14millions; update_cfs_group() costs ~25% and
> > > update_load_avg() costs ~16%.
> > >
> > > This patch is an attempt to reduce the cost of accessing tg->load_avg.
> >
> > here you mention tg->load_avg which is updated with update_tg_load_avg()
> >
> > but your patch below modifies the local update of cfs's util_avg,
> > runnable_avg and load_avg which need to be maintained up to date
>
> Yes, since it delayed propagating the removed load, the upper cfs_rqs'
> *_avg could be updated later than current code.
>
> > You should be better to delay or rate limit the call to
> > update_tg_load_avg() or calc_group_shares()/update_cfs_group() which
> > access tg->load_avg and are the culprit instead of modifying other
> > place.
>
> Thanks for the suggestion and I think it makes perfect sense.
>
> I tried below to rate limit the update to tg->load_avg at most once per
> ms in update_tg_load_avg():
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..e48fd0e6982d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> {
> long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> + u64 now = cfs_rq_clock_pelt(cfs_rq);
>
> /*
> * No need to update load_avg for root_task_group as it is not used.
> @@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> if (cfs_rq->tg == &root_task_group)
> return;
>
> - if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> + if ((now - cfs_rq->last_update_tg_load_avg > 1000000) &&
> + abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> atomic_long_add(delta, &cfs_rq->tg->load_avg);
> cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> + cfs_rq->last_update_tg_load_avg = now;
> }
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 14dfaafb3a8f..b5201d44d820 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -594,6 +594,7 @@ struct cfs_rq {
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> unsigned long tg_load_avg_contrib;
> + u64 last_update_tg_load_avg;
> long propagate;
> long prop_runnable_sum;
>
> From some quick tests using postgres_sysbench and netperf/UDP_RR on SPR,
> this alone already delivers good results. For postgres_sysbench, the two
> functions cost dropped to 1%-2% each; and for netperf/UDP_RR, the two
> functions cost dropped to ~2% and ~4%. Togerther with some more rate
> limiting on the read side, I think the end result will be better. Does
> the above look reasonable to you?

Yes, that seems a better way to limit write access

>
> Alternatively, I can remove some callsites of update_tg_load_avg() like
> you suggested below and only call update_tg_load_avg() when cfs_rq is
> decayed(really just decayed, not when it detected it has removed load
> pending or load propagated from its children). This way it would give us
> similar result as above(roughly once per ms).
>
> >
> > Have you tried to remove update_cfs_group() from enqueue/dequeue and
> > only let the tick update the share periodically ?
>
> patch4 kind of did that :-)

That's what I have noticed when reading patch 4 :-)

>
> >
> > Have you tried to make update_tg_load_avg() return early ? the change
> > of cfs' load_avg is written in the tg->load_avg only when the change
> > is bigger than 1.5%. maybe increase it to 6% ?
>
> Yes, increase the delta is also a good way to limit the update to
> tg->load_avg. Will try that too.
>
> >
> > Or like for update_cfs_group, remove it from attach/detach entity and
> > let the periodic update to propagate the change
> >
> > But please don't touch local update of *_avg
>
> OK I see.
>
> Thanks again for the comments, they are very helpful.

2023-07-19 10:04:27

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

On Wed, 19 Jul 2023 at 10:01, Aaron Lu <[email protected]> wrote:
>
> On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> > Alternatively, I can remove some callsites of update_tg_load_avg() like
> > you suggested below and only call update_tg_load_avg() when cfs_rq is
> > decayed(really just decayed, not when it detected it has removed load
> > pending or load propagated from its children). This way it would give us
> > similar result as above(roughly once per ms).
>
> Something like this: (I think this is better since it removed those
> unnecessary calls to update_tg_load_avg(), although it is inline but
> still)
>
>
> From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001
> From: Aaron Lu <[email protected]>
> Date: Wed, 19 Jul 2023 13:54:48 +0800
> Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
>
> ---
> kernel/sched/fair.c | 22 +++++++---------------
> 1 file changed, 7 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..7d5b7352b8b5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
> }
>
> /* Update task and its cfs_rq load average */
> -static inline int propagate_entity_load_avg(struct sched_entity *se)
> +static inline void propagate_entity_load_avg(struct sched_entity *se)
> {
> struct cfs_rq *cfs_rq, *gcfs_rq;
>
> if (entity_is_task(se))
> - return 0;
> + return;
>
> gcfs_rq = group_cfs_rq(se);
> if (!gcfs_rq->propagate)
> - return 0;
> + return;
>
> gcfs_rq->propagate = 0;
>
> @@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
>
> trace_pelt_cfs_tp(cfs_rq);
> trace_pelt_se_tp(se);
> -
> - return 1;
> }
>
> /*
> @@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
>
> static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
>
> -static inline int propagate_entity_load_avg(struct sched_entity *se)
> +static inline void propagate_entity_load_avg(struct sched_entity *se)
> {
> - return 0;
> }
>
> static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {}
> @@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> {
> unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> struct sched_avg *sa = &cfs_rq->avg;
> - int decayed = 0;
> + int decayed;
>
> if (cfs_rq->removed.nr) {
> unsigned long r;
> @@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> */
> add_tg_cfs_propagate(cfs_rq,
> -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> -
> - decayed = 1;
> }
>
> - decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> + decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> u64_u32_store_copy(sa->last_update_time,
> cfs_rq->last_update_time_copy,
> sa->last_update_time);
> @@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> __update_load_avg_se(now, cfs_rq, se);
>
> decayed = update_cfs_rq_load_avg(now, cfs_rq);
> - decayed |= propagate_entity_load_avg(se);
> + propagate_entity_load_avg(se);

but then you also skip the call to cfs_rq_util_change()
>
> if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
>
> @@ -4264,15 +4259,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> * IOW we're enqueueing a task on a new CPU.
> */
> attach_entity_load_avg(cfs_rq, se);
> - update_tg_load_avg(cfs_rq);
> -
> } else if (flags & DO_DETACH) {
> /*
> * DO_DETACH means we're here from dequeue_entity()
> * and we are migrating task out of the CPU.
> */
> detach_entity_load_avg(cfs_rq, se);
> - update_tg_load_avg(cfs_rq);
> } else if (decayed) {
> cfs_rq_util_change(cfs_rq, 0);
>
> --
> 2.41.0
>

2023-07-19 12:18:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] sched/fair: Make tg->load_avg per node

On Tue, Jul 18, 2023 at 09:41:18PM +0800, Aaron Lu wrote:
> +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
> +static inline long tg_load_avg(struct task_group *tg)
> +{
> + long load_avg = 0;
> + int i;
> +
> + /*
> + * The only path that can give us a root_task_group
> + * here is from print_cfs_rq() thus unlikely.
> + */
> + if (unlikely(tg == &root_task_group))
> + return 0;
> +
> + for_each_node(i)
> + load_avg += atomic_long_read(&tg->node_info[i]->load_avg);
> +
> + return load_avg;
> +}
> +#endif

So I was working on something else numa and noticed that for_each_node()
(and most of the nodemask stuff) is quite moronic, afaict we should do
something like the below.

I now see Mike added the nr_node_ids thing fairly recent, but given
distros have NODES_SHIFT=10 and actual machines typically only have <=4
nodes, this would save a factor of 256 scanning.

Specifically, your for_each_node() would scan the full 1024 bit bitmap
looking for more bits that would never be there.

---

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 8d07116caaf1..c23c0889b8cf 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -109,7 +109,7 @@ extern nodemask_t _unused_nodemask_arg_;
__nodemask_pr_bits(maskp)
static inline unsigned int __nodemask_pr_numnodes(const nodemask_t *m)
{
- return m ? MAX_NUMNODES : 0;
+ return m ? nr_node_ids : 0;
}
static inline const unsigned long *__nodemask_pr_bits(const nodemask_t *m)
{
@@ -137,13 +137,13 @@ static inline void __node_clear(int node, volatile nodemask_t *dstp)
clear_bit(node, dstp->bits);
}

-#define nodes_setall(dst) __nodes_setall(&(dst), MAX_NUMNODES)
+#define nodes_setall(dst) __nodes_setall(&(dst), nr_node_ids)
static inline void __nodes_setall(nodemask_t *dstp, unsigned int nbits)
{
bitmap_fill(dstp->bits, nbits);
}

-#define nodes_clear(dst) __nodes_clear(&(dst), MAX_NUMNODES)
+#define nodes_clear(dst) __nodes_clear(&(dst), nr_node_ids)
static inline void __nodes_clear(nodemask_t *dstp, unsigned int nbits)
{
bitmap_zero(dstp->bits, nbits);
@@ -160,7 +160,7 @@ static inline bool __node_test_and_set(int node, nodemask_t *addr)
}

#define nodes_and(dst, src1, src2) \
- __nodes_and(&(dst), &(src1), &(src2), MAX_NUMNODES)
+ __nodes_and(&(dst), &(src1), &(src2), nr_node_ids)
static inline void __nodes_and(nodemask_t *dstp, const nodemask_t *src1p,
const nodemask_t *src2p, unsigned int nbits)
{
@@ -168,7 +168,7 @@ static inline void __nodes_and(nodemask_t *dstp, const nodemask_t *src1p,
}

#define nodes_or(dst, src1, src2) \
- __nodes_or(&(dst), &(src1), &(src2), MAX_NUMNODES)
+ __nodes_or(&(dst), &(src1), &(src2), nr_node_ids)
static inline void __nodes_or(nodemask_t *dstp, const nodemask_t *src1p,
const nodemask_t *src2p, unsigned int nbits)
{
@@ -176,7 +176,7 @@ static inline void __nodes_or(nodemask_t *dstp, const nodemask_t *src1p,
}

#define nodes_xor(dst, src1, src2) \
- __nodes_xor(&(dst), &(src1), &(src2), MAX_NUMNODES)
+ __nodes_xor(&(dst), &(src1), &(src2), nr_node_ids)
static inline void __nodes_xor(nodemask_t *dstp, const nodemask_t *src1p,
const nodemask_t *src2p, unsigned int nbits)
{
@@ -184,7 +184,7 @@ static inline void __nodes_xor(nodemask_t *dstp, const nodemask_t *src1p,
}

#define nodes_andnot(dst, src1, src2) \
- __nodes_andnot(&(dst), &(src1), &(src2), MAX_NUMNODES)
+ __nodes_andnot(&(dst), &(src1), &(src2), nr_node_ids)
static inline void __nodes_andnot(nodemask_t *dstp, const nodemask_t *src1p,
const nodemask_t *src2p, unsigned int nbits)
{
@@ -192,7 +192,7 @@ static inline void __nodes_andnot(nodemask_t *dstp, const nodemask_t *src1p,
}

#define nodes_complement(dst, src) \
- __nodes_complement(&(dst), &(src), MAX_NUMNODES)
+ __nodes_complement(&(dst), &(src), nr_node_ids)
static inline void __nodes_complement(nodemask_t *dstp,
const nodemask_t *srcp, unsigned int nbits)
{
@@ -200,7 +200,7 @@ static inline void __nodes_complement(nodemask_t *dstp,
}

#define nodes_equal(src1, src2) \
- __nodes_equal(&(src1), &(src2), MAX_NUMNODES)
+ __nodes_equal(&(src1), &(src2), nr_node_ids)
static inline bool __nodes_equal(const nodemask_t *src1p,
const nodemask_t *src2p, unsigned int nbits)
{
@@ -208,7 +208,7 @@ static inline bool __nodes_equal(const nodemask_t *src1p,
}

#define nodes_intersects(src1, src2) \
- __nodes_intersects(&(src1), &(src2), MAX_NUMNODES)
+ __nodes_intersects(&(src1), &(src2), nr_node_ids)
static inline bool __nodes_intersects(const nodemask_t *src1p,
const nodemask_t *src2p, unsigned int nbits)
{
@@ -216,33 +216,33 @@ static inline bool __nodes_intersects(const nodemask_t *src1p,
}

#define nodes_subset(src1, src2) \
- __nodes_subset(&(src1), &(src2), MAX_NUMNODES)
+ __nodes_subset(&(src1), &(src2), nr_node_ids)
static inline bool __nodes_subset(const nodemask_t *src1p,
const nodemask_t *src2p, unsigned int nbits)
{
return bitmap_subset(src1p->bits, src2p->bits, nbits);
}

-#define nodes_empty(src) __nodes_empty(&(src), MAX_NUMNODES)
+#define nodes_empty(src) __nodes_empty(&(src), nr_node_ids)
static inline bool __nodes_empty(const nodemask_t *srcp, unsigned int nbits)
{
return bitmap_empty(srcp->bits, nbits);
}

-#define nodes_full(nodemask) __nodes_full(&(nodemask), MAX_NUMNODES)
+#define nodes_full(nodemask) __nodes_full(&(nodemask), nr_node_ids)
static inline bool __nodes_full(const nodemask_t *srcp, unsigned int nbits)
{
return bitmap_full(srcp->bits, nbits);
}

-#define nodes_weight(nodemask) __nodes_weight(&(nodemask), MAX_NUMNODES)
+#define nodes_weight(nodemask) __nodes_weight(&(nodemask), nr_node_ids)
static inline int __nodes_weight(const nodemask_t *srcp, unsigned int nbits)
{
return bitmap_weight(srcp->bits, nbits);
}

#define nodes_shift_right(dst, src, n) \
- __nodes_shift_right(&(dst), &(src), (n), MAX_NUMNODES)
+ __nodes_shift_right(&(dst), &(src), (n), nr_node_ids)
static inline void __nodes_shift_right(nodemask_t *dstp,
const nodemask_t *srcp, int n, int nbits)
{
@@ -250,7 +250,7 @@ static inline void __nodes_shift_right(nodemask_t *dstp,
}

#define nodes_shift_left(dst, src, n) \
- __nodes_shift_left(&(dst), &(src), (n), MAX_NUMNODES)
+ __nodes_shift_left(&(dst), &(src), (n), nr_node_ids)
static inline void __nodes_shift_left(nodemask_t *dstp,
const nodemask_t *srcp, int n, int nbits)
{
@@ -385,7 +385,7 @@ static inline void __nodes_fold(nodemask_t *dstp, const nodemask_t *origp,
#if MAX_NUMNODES > 1
#define for_each_node_mask(node, mask) \
for ((node) = first_node(mask); \
- (node) < MAX_NUMNODES; \
+ (node) < nr_node_ids; \
(node) = next_node((node), (mask)))
#else /* MAX_NUMNODES == 1 */
#define for_each_node_mask(node, mask) \

2023-07-19 13:55:51

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote:
> On Wed, 19 Jul 2023 at 10:01, Aaron Lu <[email protected]> wrote:
> >
> > On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> > > Alternatively, I can remove some callsites of update_tg_load_avg() like
> > > you suggested below and only call update_tg_load_avg() when cfs_rq is
> > > decayed(really just decayed, not when it detected it has removed load
> > > pending or load propagated from its children). This way it would give us
> > > similar result as above(roughly once per ms).
> >
> > Something like this: (I think this is better since it removed those
> > unnecessary calls to update_tg_load_avg(), although it is inline but
> > still)
> >
> >
> > From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001
> > From: Aaron Lu <[email protected]>
> > Date: Wed, 19 Jul 2023 13:54:48 +0800
> > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
> >
> > ---
> > kernel/sched/fair.c | 22 +++++++---------------
> > 1 file changed, 7 insertions(+), 15 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a80a73909dc2..7d5b7352b8b5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
> > }
> >
> > /* Update task and its cfs_rq load average */
> > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> > {
> > struct cfs_rq *cfs_rq, *gcfs_rq;
> >
> > if (entity_is_task(se))
> > - return 0;
> > + return;
> >
> > gcfs_rq = group_cfs_rq(se);
> > if (!gcfs_rq->propagate)
> > - return 0;
> > + return;
> >
> > gcfs_rq->propagate = 0;
> >
> > @@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
> >
> > trace_pelt_cfs_tp(cfs_rq);
> > trace_pelt_se_tp(se);
> > -
> > - return 1;
> > }
> >
> > /*
> > @@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
> >
> > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
> >
> > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> > {
> > - return 0;
> > }
> >
> > static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {}
> > @@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > {
> > unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> > struct sched_avg *sa = &cfs_rq->avg;
> > - int decayed = 0;
> > + int decayed;
> >
> > if (cfs_rq->removed.nr) {
> > unsigned long r;
> > @@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > */
> > add_tg_cfs_propagate(cfs_rq,
> > -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> > -
> > - decayed = 1;
> > }
> >
> > - decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> > + decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> > u64_u32_store_copy(sa->last_update_time,
> > cfs_rq->last_update_time_copy,
> > sa->last_update_time);
> > @@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > __update_load_avg_se(now, cfs_rq, se);
> >
> > decayed = update_cfs_rq_load_avg(now, cfs_rq);
> > - decayed |= propagate_entity_load_avg(se);
> > + propagate_entity_load_avg(se);
>
> but then you also skip the call to cfs_rq_util_change()

Ah right, I missed that, thanks for catching this.

Updated:

From 09a649f8111cfca656b7b735da975ef607b00956 Mon Sep 17 00:00:00 2001
From: Aaron Lu <[email protected]>
Date: Wed, 19 Jul 2023 13:54:48 +0800
Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed

---
kernel/sched/fair.c | 17 ++++++-----------
1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..8d4b9e0a19b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4086,7 +4086,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
{
unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
struct sched_avg *sa = &cfs_rq->avg;
- int decayed = 0;
+ int decayed;

if (cfs_rq->removed.nr) {
unsigned long r;
@@ -4134,11 +4134,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
*/
add_tg_cfs_propagate(cfs_rq,
-(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
-
- decayed = 1;
}

- decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
+ decayed = __update_load_avg_cfs_rq(now, cfs_rq);
u64_u32_store_copy(sa->last_update_time,
cfs_rq->last_update_time_copy,
sa->last_update_time);
@@ -4242,7 +4240,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
u64 now = cfs_rq_clock_pelt(cfs_rq);
- int decayed;
+ int decayed, propagated;

/*
* Track task load average for carrying it to new CPU after migrated, and
@@ -4252,7 +4250,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
__update_load_avg_se(now, cfs_rq, se);

decayed = update_cfs_rq_load_avg(now, cfs_rq);
- decayed |= propagate_entity_load_avg(se);
+ propagated = propagate_entity_load_avg(se);

if (!se->avg.last_update_time && (flags & DO_ATTACH)) {

@@ -4264,19 +4262,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
* IOW we're enqueueing a task on a new CPU.
*/
attach_entity_load_avg(cfs_rq, se);
- update_tg_load_avg(cfs_rq);
-
} else if (flags & DO_DETACH) {
/*
* DO_DETACH means we're here from dequeue_entity()
* and we are migrating task out of the CPU.
*/
detach_entity_load_avg(cfs_rq, se);
- update_tg_load_avg(cfs_rq);
- } else if (decayed) {
+ } else if (decayed || propagated) {
cfs_rq_util_change(cfs_rq, 0);

- if (flags & UPDATE_TG)
+ if (decayed && (flags & UPDATE_TG))
update_tg_load_avg(cfs_rq);
}
}
--
2.41.0


> >
> > if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> >
> > @@ -4264,15 +4259,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > * IOW we're enqueueing a task on a new CPU.
> > */
> > attach_entity_load_avg(cfs_rq, se);
> > - update_tg_load_avg(cfs_rq);
> > -
> > } else if (flags & DO_DETACH) {
> > /*
> > * DO_DETACH means we're here from dequeue_entity()
> > * and we are migrating task out of the CPU.
> > */
> > detach_entity_load_avg(cfs_rq, se);
> > - update_tg_load_avg(cfs_rq);
> > } else if (decayed) {
> > cfs_rq_util_change(cfs_rq, 0);
> >
> > --
> > 2.41.0
> >

2023-07-19 14:08:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] sched/fair: Make tg->load_avg per node

On Wed, Jul 19, 2023 at 09:45:00PM +0800, Aaron Lu wrote:
> I'll rebase this per-node patch on top of below diff.

Oh, please double check I didn't wreck anything. I skipped 'converting'
the find_*_bit() functions because the users of those iterators might be
expecting NR_MAXNODES when not found.

But perhaps there's more I overlooked.

2023-07-19 14:34:53

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] sched/fair: Make tg->load_avg per node

On Wed, Jul 19, 2023 at 03:53:05PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 19, 2023 at 09:45:00PM +0800, Aaron Lu wrote:
> > I'll rebase this per-node patch on top of below diff.
>
> Oh, please double check I didn't wreck anything. I skipped 'converting'
> the find_*_bit() functions because the users of those iterators might be
> expecting NR_MAXNODES when not found.
>

Sounds more work than I had expected :)

> But perhaps there's more I overlooked.

Got it, will see if I can make it work.

2023-07-19 14:42:17

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] sched/fair: Make tg->load_avg per node

On Wed, Jul 19, 2023 at 01:53:58PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 18, 2023 at 09:41:18PM +0800, Aaron Lu wrote:
> > +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
> > +static inline long tg_load_avg(struct task_group *tg)
> > +{
> > + long load_avg = 0;
> > + int i;
> > +
> > + /*
> > + * The only path that can give us a root_task_group
> > + * here is from print_cfs_rq() thus unlikely.
> > + */
> > + if (unlikely(tg == &root_task_group))
> > + return 0;
> > +
> > + for_each_node(i)
> > + load_avg += atomic_long_read(&tg->node_info[i]->load_avg);
> > +
> > + return load_avg;
> > +}
> > +#endif
>
> So I was working on something else numa and noticed that for_each_node()
> (and most of the nodemask stuff) is quite moronic, afaict we should do
> something like the below.
>
> I now see Mike added the nr_node_ids thing fairly recent, but given
> distros have NODES_SHIFT=10 and actual machines typically only have <=4
> nodes, this would save a factor of 256 scanning.

Nice :-)

>
> Specifically, your for_each_node() would scan the full 1024 bit bitmap
> looking for more bits that would never be there.

Yes indeed.
I'll rebase this per-node patch on top of below diff.

Thanks for the info.

>
> ---
>
> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index 8d07116caaf1..c23c0889b8cf 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -109,7 +109,7 @@ extern nodemask_t _unused_nodemask_arg_;
> __nodemask_pr_bits(maskp)
> static inline unsigned int __nodemask_pr_numnodes(const nodemask_t *m)
> {
> - return m ? MAX_NUMNODES : 0;
> + return m ? nr_node_ids : 0;
> }
> static inline const unsigned long *__nodemask_pr_bits(const nodemask_t *m)
> {
> @@ -137,13 +137,13 @@ static inline void __node_clear(int node, volatile nodemask_t *dstp)
> clear_bit(node, dstp->bits);
> }
>
> -#define nodes_setall(dst) __nodes_setall(&(dst), MAX_NUMNODES)
> +#define nodes_setall(dst) __nodes_setall(&(dst), nr_node_ids)
> static inline void __nodes_setall(nodemask_t *dstp, unsigned int nbits)
> {
> bitmap_fill(dstp->bits, nbits);
> }
>
> -#define nodes_clear(dst) __nodes_clear(&(dst), MAX_NUMNODES)
> +#define nodes_clear(dst) __nodes_clear(&(dst), nr_node_ids)
> static inline void __nodes_clear(nodemask_t *dstp, unsigned int nbits)
> {
> bitmap_zero(dstp->bits, nbits);
> @@ -160,7 +160,7 @@ static inline bool __node_test_and_set(int node, nodemask_t *addr)
> }
>
> #define nodes_and(dst, src1, src2) \
> - __nodes_and(&(dst), &(src1), &(src2), MAX_NUMNODES)
> + __nodes_and(&(dst), &(src1), &(src2), nr_node_ids)
> static inline void __nodes_and(nodemask_t *dstp, const nodemask_t *src1p,
> const nodemask_t *src2p, unsigned int nbits)
> {
> @@ -168,7 +168,7 @@ static inline void __nodes_and(nodemask_t *dstp, const nodemask_t *src1p,
> }
>
> #define nodes_or(dst, src1, src2) \
> - __nodes_or(&(dst), &(src1), &(src2), MAX_NUMNODES)
> + __nodes_or(&(dst), &(src1), &(src2), nr_node_ids)
> static inline void __nodes_or(nodemask_t *dstp, const nodemask_t *src1p,
> const nodemask_t *src2p, unsigned int nbits)
> {
> @@ -176,7 +176,7 @@ static inline void __nodes_or(nodemask_t *dstp, const nodemask_t *src1p,
> }
>
> #define nodes_xor(dst, src1, src2) \
> - __nodes_xor(&(dst), &(src1), &(src2), MAX_NUMNODES)
> + __nodes_xor(&(dst), &(src1), &(src2), nr_node_ids)
> static inline void __nodes_xor(nodemask_t *dstp, const nodemask_t *src1p,
> const nodemask_t *src2p, unsigned int nbits)
> {
> @@ -184,7 +184,7 @@ static inline void __nodes_xor(nodemask_t *dstp, const nodemask_t *src1p,
> }
>
> #define nodes_andnot(dst, src1, src2) \
> - __nodes_andnot(&(dst), &(src1), &(src2), MAX_NUMNODES)
> + __nodes_andnot(&(dst), &(src1), &(src2), nr_node_ids)
> static inline void __nodes_andnot(nodemask_t *dstp, const nodemask_t *src1p,
> const nodemask_t *src2p, unsigned int nbits)
> {
> @@ -192,7 +192,7 @@ static inline void __nodes_andnot(nodemask_t *dstp, const nodemask_t *src1p,
> }
>
> #define nodes_complement(dst, src) \
> - __nodes_complement(&(dst), &(src), MAX_NUMNODES)
> + __nodes_complement(&(dst), &(src), nr_node_ids)
> static inline void __nodes_complement(nodemask_t *dstp,
> const nodemask_t *srcp, unsigned int nbits)
> {
> @@ -200,7 +200,7 @@ static inline void __nodes_complement(nodemask_t *dstp,
> }
>
> #define nodes_equal(src1, src2) \
> - __nodes_equal(&(src1), &(src2), MAX_NUMNODES)
> + __nodes_equal(&(src1), &(src2), nr_node_ids)
> static inline bool __nodes_equal(const nodemask_t *src1p,
> const nodemask_t *src2p, unsigned int nbits)
> {
> @@ -208,7 +208,7 @@ static inline bool __nodes_equal(const nodemask_t *src1p,
> }
>
> #define nodes_intersects(src1, src2) \
> - __nodes_intersects(&(src1), &(src2), MAX_NUMNODES)
> + __nodes_intersects(&(src1), &(src2), nr_node_ids)
> static inline bool __nodes_intersects(const nodemask_t *src1p,
> const nodemask_t *src2p, unsigned int nbits)
> {
> @@ -216,33 +216,33 @@ static inline bool __nodes_intersects(const nodemask_t *src1p,
> }
>
> #define nodes_subset(src1, src2) \
> - __nodes_subset(&(src1), &(src2), MAX_NUMNODES)
> + __nodes_subset(&(src1), &(src2), nr_node_ids)
> static inline bool __nodes_subset(const nodemask_t *src1p,
> const nodemask_t *src2p, unsigned int nbits)
> {
> return bitmap_subset(src1p->bits, src2p->bits, nbits);
> }
>
> -#define nodes_empty(src) __nodes_empty(&(src), MAX_NUMNODES)
> +#define nodes_empty(src) __nodes_empty(&(src), nr_node_ids)
> static inline bool __nodes_empty(const nodemask_t *srcp, unsigned int nbits)
> {
> return bitmap_empty(srcp->bits, nbits);
> }
>
> -#define nodes_full(nodemask) __nodes_full(&(nodemask), MAX_NUMNODES)
> +#define nodes_full(nodemask) __nodes_full(&(nodemask), nr_node_ids)
> static inline bool __nodes_full(const nodemask_t *srcp, unsigned int nbits)
> {
> return bitmap_full(srcp->bits, nbits);
> }
>
> -#define nodes_weight(nodemask) __nodes_weight(&(nodemask), MAX_NUMNODES)
> +#define nodes_weight(nodemask) __nodes_weight(&(nodemask), nr_node_ids)
> static inline int __nodes_weight(const nodemask_t *srcp, unsigned int nbits)
> {
> return bitmap_weight(srcp->bits, nbits);
> }
>
> #define nodes_shift_right(dst, src, n) \
> - __nodes_shift_right(&(dst), &(src), (n), MAX_NUMNODES)
> + __nodes_shift_right(&(dst), &(src), (n), nr_node_ids)
> static inline void __nodes_shift_right(nodemask_t *dstp,
> const nodemask_t *srcp, int n, int nbits)
> {
> @@ -250,7 +250,7 @@ static inline void __nodes_shift_right(nodemask_t *dstp,
> }
>
> #define nodes_shift_left(dst, src, n) \
> - __nodes_shift_left(&(dst), &(src), (n), MAX_NUMNODES)
> + __nodes_shift_left(&(dst), &(src), (n), nr_node_ids)
> static inline void __nodes_shift_left(nodemask_t *dstp,
> const nodemask_t *srcp, int n, int nbits)
> {
> @@ -385,7 +385,7 @@ static inline void __nodes_fold(nodemask_t *dstp, const nodemask_t *origp,
> #if MAX_NUMNODES > 1
> #define for_each_node_mask(node, mask) \
> for ((node) = first_node(mask); \
> - (node) < MAX_NUMNODES; \
> + (node) < nr_node_ids; \
> (node) = next_node((node), (mask)))
> #else /* MAX_NUMNODES == 1 */
> #define for_each_node_mask(node, mask) \

2023-07-19 16:23:33

by Yury Norov

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] sched/fair: Make tg->load_avg per node

On Wed, Jul 19, 2023 at 01:53:58PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 18, 2023 at 09:41:18PM +0800, Aaron Lu wrote:
> > +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
> > +static inline long tg_load_avg(struct task_group *tg)
> > +{
> > + long load_avg = 0;
> > + int i;
> > +
> > + /*
> > + * The only path that can give us a root_task_group
> > + * here is from print_cfs_rq() thus unlikely.
> > + */
> > + if (unlikely(tg == &root_task_group))
> > + return 0;
> > +
> > + for_each_node(i)
> > + load_avg += atomic_long_read(&tg->node_info[i]->load_avg);
> > +
> > + return load_avg;
> > +}
> > +#endif
>
> So I was working on something else numa and noticed that for_each_node()
> (and most of the nodemask stuff) is quite moronic, afaict we should do
> something like the below.
>
> I now see Mike added the nr_node_ids thing fairly recent, but given
> distros have NODES_SHIFT=10 and actual machines typically only have <=4
> nodes, this would save a factor of 256 scanning.
>
> Specifically, your for_each_node() would scan the full 1024 bit bitmap
> looking for more bits that would never be there.
>
> ---
>
> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index 8d07116caaf1..c23c0889b8cf 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -109,7 +109,7 @@ extern nodemask_t _unused_nodemask_arg_;
> __nodemask_pr_bits(maskp)
> static inline unsigned int __nodemask_pr_numnodes(const nodemask_t *m)
> {
> - return m ? MAX_NUMNODES : 0;
> + return m ? nr_node_ids : 0;
> }
> static inline const unsigned long *__nodemask_pr_bits(const nodemask_t *m)
> {
> @@ -137,13 +137,13 @@ static inline void __node_clear(int node, volatile nodemask_t *dstp)
> clear_bit(node, dstp->bits);
> }
>
> -#define nodes_setall(dst) __nodes_setall(&(dst), MAX_NUMNODES)
> +#define nodes_setall(dst) __nodes_setall(&(dst), nr_node_ids)
> static inline void __nodes_setall(nodemask_t *dstp, unsigned int nbits)
> {
> bitmap_fill(dstp->bits, nbits);
> }
>
> -#define nodes_clear(dst) __nodes_clear(&(dst), MAX_NUMNODES)
> +#define nodes_clear(dst) __nodes_clear(&(dst), nr_node_ids)
> static inline void __nodes_clear(nodemask_t *dstp, unsigned int nbits)
> {
> bitmap_zero(dstp->bits, nbits);
> @@ -160,7 +160,7 @@ static inline bool __node_test_and_set(int node, nodemask_t *addr)
> }
>
> #define nodes_and(dst, src1, src2) \
> - __nodes_and(&(dst), &(src1), &(src2), MAX_NUMNODES)
> + __nodes_and(&(dst), &(src1), &(src2), nr_node_ids)
> static inline void __nodes_and(nodemask_t *dstp, const nodemask_t *src1p,
> const nodemask_t *src2p, unsigned int nbits)
> {

This would break small_const_nbits() optimization for those configuring
their kernels properly. This is very similar to cpumasks and nr_cpu_ids
problem.

See 596ff4a09b8 ("cpumask: re-introduce constant-sized cpumask optimizations")

Thanks,
Yury

2023-07-20 13:11:51

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

On Wed, 19 Jul 2023 at 15:29, Aaron Lu <[email protected]> wrote:
>
> On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote:
> > On Wed, 19 Jul 2023 at 10:01, Aaron Lu <[email protected]> wrote:
> > >
> > > On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> > > > Alternatively, I can remove some callsites of update_tg_load_avg() like
> > > > you suggested below and only call update_tg_load_avg() when cfs_rq is
> > > > decayed(really just decayed, not when it detected it has removed load
> > > > pending or load propagated from its children). This way it would give us
> > > > similar result as above(roughly once per ms).
> > >
> > > Something like this: (I think this is better since it removed those
> > > unnecessary calls to update_tg_load_avg(), although it is inline but
> > > still)
> > >
> > >
> > > From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001
> > > From: Aaron Lu <[email protected]>
> > > Date: Wed, 19 Jul 2023 13:54:48 +0800
> > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
> > >
> > > ---
> > > kernel/sched/fair.c | 22 +++++++---------------
> > > 1 file changed, 7 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index a80a73909dc2..7d5b7352b8b5 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
> > > }
> > >
> > > /* Update task and its cfs_rq load average */
> > > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> > > {
> > > struct cfs_rq *cfs_rq, *gcfs_rq;
> > >
> > > if (entity_is_task(se))
> > > - return 0;
> > > + return;
> > >
> > > gcfs_rq = group_cfs_rq(se);
> > > if (!gcfs_rq->propagate)
> > > - return 0;
> > > + return;
> > >
> > > gcfs_rq->propagate = 0;
> > >
> > > @@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
> > >
> > > trace_pelt_cfs_tp(cfs_rq);
> > > trace_pelt_se_tp(se);
> > > -
> > > - return 1;
> > > }
> > >
> > > /*
> > > @@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
> > >
> > > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
> > >
> > > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> > > {
> > > - return 0;
> > > }
> > >
> > > static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {}
> > > @@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > > {
> > > unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> > > struct sched_avg *sa = &cfs_rq->avg;
> > > - int decayed = 0;
> > > + int decayed;
> > >
> > > if (cfs_rq->removed.nr) {
> > > unsigned long r;
> > > @@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > > */
> > > add_tg_cfs_propagate(cfs_rq,
> > > -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> > > -
> > > - decayed = 1;
> > > }
> > >
> > > - decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> > > + decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> > > u64_u32_store_copy(sa->last_update_time,
> > > cfs_rq->last_update_time_copy,
> > > sa->last_update_time);
> > > @@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > __update_load_avg_se(now, cfs_rq, se);
> > >
> > > decayed = update_cfs_rq_load_avg(now, cfs_rq);
> > > - decayed |= propagate_entity_load_avg(se);
> > > + propagate_entity_load_avg(se);
> >
> > but then you also skip the call to cfs_rq_util_change()
>
> Ah right, I missed that, thanks for catching this.
>
> Updated:
>
> From 09a649f8111cfca656b7b735da975ef607b00956 Mon Sep 17 00:00:00 2001
> From: Aaron Lu <[email protected]>
> Date: Wed, 19 Jul 2023 13:54:48 +0800
> Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
>
> ---
> kernel/sched/fair.c | 17 ++++++-----------
> 1 file changed, 6 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..8d4b9e0a19b6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4086,7 +4086,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> {
> unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> struct sched_avg *sa = &cfs_rq->avg;
> - int decayed = 0;
> + int decayed;
>
> if (cfs_rq->removed.nr) {
> unsigned long r;
> @@ -4134,11 +4134,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> */
> add_tg_cfs_propagate(cfs_rq,
> -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> -
> - decayed = 1;

We need this to propagate the change in other place like cpufreq

> }
>
> - decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> + decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> u64_u32_store_copy(sa->last_update_time,
> cfs_rq->last_update_time_copy,
> sa->last_update_time);
> @@ -4242,7 +4240,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> {
> u64 now = cfs_rq_clock_pelt(cfs_rq);
> - int decayed;
> + int decayed, propagated;
>
> /*
> * Track task load average for carrying it to new CPU after migrated, and
> @@ -4252,7 +4250,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> __update_load_avg_se(now, cfs_rq, se);
>
> decayed = update_cfs_rq_load_avg(now, cfs_rq);
> - decayed |= propagate_entity_load_avg(se);
> + propagated = propagate_entity_load_avg(se);
>
> if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
>
> @@ -4264,19 +4262,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> * IOW we're enqueueing a task on a new CPU.
> */
> attach_entity_load_avg(cfs_rq, se);
> - update_tg_load_avg(cfs_rq);
> -
> } else if (flags & DO_DETACH) {
> /*
> * DO_DETACH means we're here from dequeue_entity()
> * and we are migrating task out of the CPU.
> */
> detach_entity_load_avg(cfs_rq, se);
> - update_tg_load_avg(cfs_rq);
> - } else if (decayed) {
> + } else if (decayed || propagated) {
> cfs_rq_util_change(cfs_rq, 0);
>
> - if (flags & UPDATE_TG)
> + if (decayed && (flags & UPDATE_TG))

It would be simpler and more readable to clear UPDATE_TG or not set it
from the beginning

IIUC, you rely on the fact that a decay happens every 1024 us of the
cfs_rq_clock_pelt() which is scaled by frequency and cpu compute
capacity. So you can end up with a cfs_rq_clock_pelt() that is far
slower than real clock and the 1ms can easily be extended to dozens of
ms

> update_tg_load_avg(cfs_rq);
> }
> }
> --
> 2.41.0
>
>
> > >
> > > if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> > >
> > > @@ -4264,15 +4259,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > * IOW we're enqueueing a task on a new CPU.
> > > */
> > > attach_entity_load_avg(cfs_rq, se);
> > > - update_tg_load_avg(cfs_rq);
> > > -
> > > } else if (flags & DO_DETACH) {
> > > /*
> > > * DO_DETACH means we're here from dequeue_entity()
> > > * and we are migrating task out of the CPU.
> > > */
> > > detach_entity_load_avg(cfs_rq, se);
> > > - update_tg_load_avg(cfs_rq);
> > > } else if (decayed) {
> > > cfs_rq_util_change(cfs_rq, 0);
> > >
> > > --
> > > 2.41.0
> > >

2023-07-20 15:18:20

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

On Thu, Jul 20, 2023 at 03:10:30PM +0200, Vincent Guittot wrote:
> On Wed, 19 Jul 2023 at 15:29, Aaron Lu <[email protected]> wrote:
> >
> > On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote:
> > > On Wed, 19 Jul 2023 at 10:01, Aaron Lu <[email protected]> wrote:
> > > >
> > > > On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> > > > > Alternatively, I can remove some callsites of update_tg_load_avg() like
> > > > > you suggested below and only call update_tg_load_avg() when cfs_rq is
> > > > > decayed(really just decayed, not when it detected it has removed load
> > > > > pending or load propagated from its children). This way it would give us
> > > > > similar result as above(roughly once per ms).
> > > >
> > > > Something like this: (I think this is better since it removed those
> > > > unnecessary calls to update_tg_load_avg(), although it is inline but
> > > > still)
> > > >
> > > >
> > > > From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001
> > > > From: Aaron Lu <[email protected]>
> > > > Date: Wed, 19 Jul 2023 13:54:48 +0800
> > > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
> > > >
> > > > ---
> > > > kernel/sched/fair.c | 22 +++++++---------------
> > > > 1 file changed, 7 insertions(+), 15 deletions(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index a80a73909dc2..7d5b7352b8b5 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
> > > > }
> > > >
> > > > /* Update task and its cfs_rq load average */
> > > > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> > > > {
> > > > struct cfs_rq *cfs_rq, *gcfs_rq;
> > > >
> > > > if (entity_is_task(se))
> > > > - return 0;
> > > > + return;
> > > >
> > > > gcfs_rq = group_cfs_rq(se);
> > > > if (!gcfs_rq->propagate)
> > > > - return 0;
> > > > + return;
> > > >
> > > > gcfs_rq->propagate = 0;
> > > >
> > > > @@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > >
> > > > trace_pelt_cfs_tp(cfs_rq);
> > > > trace_pelt_se_tp(se);
> > > > -
> > > > - return 1;
> > > > }
> > > >
> > > > /*
> > > > @@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
> > > >
> > > > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
> > > >
> > > > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> > > > {
> > > > - return 0;
> > > > }
> > > >
> > > > static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {}
> > > > @@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > > > {
> > > > unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> > > > struct sched_avg *sa = &cfs_rq->avg;
> > > > - int decayed = 0;
> > > > + int decayed;
> > > >
> > > > if (cfs_rq->removed.nr) {
> > > > unsigned long r;
> > > > @@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > > > */
> > > > add_tg_cfs_propagate(cfs_rq,
> > > > -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> > > > -
> > > > - decayed = 1;
> > > > }
> > > >
> > > > - decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> > > > + decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> > > > u64_u32_store_copy(sa->last_update_time,
> > > > cfs_rq->last_update_time_copy,
> > > > sa->last_update_time);
> > > > @@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > > __update_load_avg_se(now, cfs_rq, se);
> > > >
> > > > decayed = update_cfs_rq_load_avg(now, cfs_rq);
> > > > - decayed |= propagate_entity_load_avg(se);
> > > > + propagate_entity_load_avg(se);
> > >
> > > but then you also skip the call to cfs_rq_util_change()
> >
> > Ah right, I missed that, thanks for catching this.
> >
> > Updated:
> >
> > From 09a649f8111cfca656b7b735da975ef607b00956 Mon Sep 17 00:00:00 2001
> > From: Aaron Lu <[email protected]>
> > Date: Wed, 19 Jul 2023 13:54:48 +0800
> > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
> >
> > ---
> > kernel/sched/fair.c | 17 ++++++-----------
> > 1 file changed, 6 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a80a73909dc2..8d4b9e0a19b6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4086,7 +4086,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > {
> > unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> > struct sched_avg *sa = &cfs_rq->avg;
> > - int decayed = 0;
> > + int decayed;
> >
> > if (cfs_rq->removed.nr) {
> > unsigned long r;
> > @@ -4134,11 +4134,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > */
> > add_tg_cfs_propagate(cfs_rq,
> > -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> > -
> > - decayed = 1;
>
> We need this to propagate the change in other place like cpufreq

Ah, I just made the same mistake again, sorry.

So there are three cases for a cfs_rq: load decayed, load removed and
load propagated. For all three cases, cfs_rq_util_change() needs to be
called and only for decayed, update_tg_load_avg() needs to be called.

I'll update the patch accordingly.

> > }
> >
> > - decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> > + decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> > u64_u32_store_copy(sa->last_update_time,
> > cfs_rq->last_update_time_copy,
> > sa->last_update_time);
> > @@ -4242,7 +4240,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > {
> > u64 now = cfs_rq_clock_pelt(cfs_rq);
> > - int decayed;
> > + int decayed, propagated;
> >
> > /*
> > * Track task load average for carrying it to new CPU after migrated, and
> > @@ -4252,7 +4250,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > __update_load_avg_se(now, cfs_rq, se);
> >
> > decayed = update_cfs_rq_load_avg(now, cfs_rq);
> > - decayed |= propagate_entity_load_avg(se);
> > + propagated = propagate_entity_load_avg(se);
> >
> > if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> >
> > @@ -4264,19 +4262,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > * IOW we're enqueueing a task on a new CPU.
> > */
> > attach_entity_load_avg(cfs_rq, se);
> > - update_tg_load_avg(cfs_rq);
> > -
> > } else if (flags & DO_DETACH) {
> > /*
> > * DO_DETACH means we're here from dequeue_entity()
> > * and we are migrating task out of the CPU.
> > */
> > detach_entity_load_avg(cfs_rq, se);
> > - update_tg_load_avg(cfs_rq);
> > - } else if (decayed) {
> > + } else if (decayed || propagated) {
> > cfs_rq_util_change(cfs_rq, 0);
> >
> > - if (flags & UPDATE_TG)
> > + if (decayed && (flags & UPDATE_TG))
>
> It would be simpler and more readable to clear UPDATE_TG or not set it
> from the beginning

Do you mean something like this?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8d4b9e0a19b6..084d63371355 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4249,7 +4249,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
__update_load_avg_se(now, cfs_rq, se);

- decayed = update_cfs_rq_load_avg(now, cfs_rq);
+ decayed = update_cfs_rq_load_avg(now, cfs_rq) && (flags & UPDATE_TG);
propagated = propagate_entity_load_avg(se);

if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
@@ -4271,7 +4271,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
} else if (decayed || propagated) {
cfs_rq_util_change(cfs_rq, 0);

- if (decayed && (flags & UPDATE_TG))
+ if (decayed)
update_tg_load_avg(cfs_rq);
}
}

> IIUC, you rely on the fact that a decay happens every 1024 us of the
> cfs_rq_clock_pelt() which is scaled by frequency and cpu compute
> capacity. So you can end up with a cfs_rq_clock_pelt() that is far
> slower than real clock and the 1ms can easily be extended to dozens of
> ms

Thanks for the info. I'm not familiar with this clock scale part and will
need to take a closer look.

As you know, the intent is to make the unbound update to tg->load_avg
become bound for those wakeup migration heavy workloads and the way this
change does to achieve it is to remove the update to tg->load_avg on
attach and detach path, just leave the update on load decay path. And if
the current implementation makes load decay longer than 1024us, that
shouldn't be a problem for this change. I don't see an immediate problem
if update to tg->load_avg happens less often than once per ms but please
let me know if I missed something, thanks.

>
> > update_tg_load_avg(cfs_rq);
> > }
> > }
> > --
> > 2.41.0
> >
> >
> > > >
> > > > if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> > > >
> > > > @@ -4264,15 +4259,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > > * IOW we're enqueueing a task on a new CPU.
> > > > */
> > > > attach_entity_load_avg(cfs_rq, se);
> > > > - update_tg_load_avg(cfs_rq);
> > > > -
> > > > } else if (flags & DO_DETACH) {
> > > > /*
> > > > * DO_DETACH means we're here from dequeue_entity()
> > > > * and we are migrating task out of the CPU.
> > > > */
> > > > detach_entity_load_avg(cfs_rq, se);
> > > > - update_tg_load_avg(cfs_rq);
> > > > } else if (decayed) {
> > > > cfs_rq_util_change(cfs_rq, 0);
> > > >
> > > > --
> > > > 2.41.0
> > > >

2023-07-20 15:32:13

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

On 20/07/2023 17:02, Vincent Guittot wrote:
> On Thu, 20 Jul 2023 at 16:42, Aaron Lu <[email protected]> wrote:
>>
>> On Thu, Jul 20, 2023 at 03:10:30PM +0200, Vincent Guittot wrote:
>>> On Wed, 19 Jul 2023 at 15:29, Aaron Lu <[email protected]> wrote:
>>>>
>>>> On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote:
>>>>> On Wed, 19 Jul 2023 at 10:01, Aaron Lu <[email protected]> wrote:
>>>>>>
>>>>>> On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:

[...]

> What was wrong with your proposal to limit the update inside
> update_tg_load_avg() ? except maybe s/1000000/NSEC_PER_MSEC/ and
> computing delta after testing the time since last update
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..e48fd0e6982d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct
> cfs_rq *cfs_rq)
> static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> {
> long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> + u64 now = cfs_rq_clock_pelt(cfs_rq);

Could this be `u64 now = sched_clock_cpu()` like in
migrate_se_pelt_lag() or newidle_balance() to avoid the time morphing
due to PELT's frequency and uArch invariance?
>
> /*
> * No need to update load_avg for root_task_group as it is not used.

[...]


2023-07-20 15:49:36

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

On Thu, 20 Jul 2023 at 16:42, Aaron Lu <[email protected]> wrote:
>
> On Thu, Jul 20, 2023 at 03:10:30PM +0200, Vincent Guittot wrote:
> > On Wed, 19 Jul 2023 at 15:29, Aaron Lu <[email protected]> wrote:
> > >
> > > On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote:
> > > > On Wed, 19 Jul 2023 at 10:01, Aaron Lu <[email protected]> wrote:
> > > > >
> > > > > On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> > > > > > Alternatively, I can remove some callsites of update_tg_load_avg() like
> > > > > > you suggested below and only call update_tg_load_avg() when cfs_rq is
> > > > > > decayed(really just decayed, not when it detected it has removed load
> > > > > > pending or load propagated from its children). This way it would give us
> > > > > > similar result as above(roughly once per ms).
> > > > >
> > > > > Something like this: (I think this is better since it removed those
> > > > > unnecessary calls to update_tg_load_avg(), although it is inline but
> > > > > still)
> > > > >
> > > > >
> > > > > From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001
> > > > > From: Aaron Lu <[email protected]>
> > > > > Date: Wed, 19 Jul 2023 13:54:48 +0800
> > > > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
> > > > >
> > > > > ---
> > > > > kernel/sched/fair.c | 22 +++++++---------------
> > > > > 1 file changed, 7 insertions(+), 15 deletions(-)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index a80a73909dc2..7d5b7352b8b5 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
> > > > > }
> > > > >
> > > > > /* Update task and its cfs_rq load average */
> > > > > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > > > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> > > > > {
> > > > > struct cfs_rq *cfs_rq, *gcfs_rq;
> > > > >
> > > > > if (entity_is_task(se))
> > > > > - return 0;
> > > > > + return;
> > > > >
> > > > > gcfs_rq = group_cfs_rq(se);
> > > > > if (!gcfs_rq->propagate)
> > > > > - return 0;
> > > > > + return;
> > > > >
> > > > > gcfs_rq->propagate = 0;
> > > > >
> > > > > @@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > > >
> > > > > trace_pelt_cfs_tp(cfs_rq);
> > > > > trace_pelt_se_tp(se);
> > > > > -
> > > > > - return 1;
> > > > > }
> > > > >
> > > > > /*
> > > > > @@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
> > > > >
> > > > > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
> > > > >
> > > > > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > > > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> > > > > {
> > > > > - return 0;
> > > > > }
> > > > >
> > > > > static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {}
> > > > > @@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > > > > {
> > > > > unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> > > > > struct sched_avg *sa = &cfs_rq->avg;
> > > > > - int decayed = 0;
> > > > > + int decayed;
> > > > >
> > > > > if (cfs_rq->removed.nr) {
> > > > > unsigned long r;
> > > > > @@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > > > > */
> > > > > add_tg_cfs_propagate(cfs_rq,
> > > > > -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> > > > > -
> > > > > - decayed = 1;
> > > > > }
> > > > >
> > > > > - decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> > > > > + decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> > > > > u64_u32_store_copy(sa->last_update_time,
> > > > > cfs_rq->last_update_time_copy,
> > > > > sa->last_update_time);
> > > > > @@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > > > __update_load_avg_se(now, cfs_rq, se);
> > > > >
> > > > > decayed = update_cfs_rq_load_avg(now, cfs_rq);
> > > > > - decayed |= propagate_entity_load_avg(se);
> > > > > + propagate_entity_load_avg(se);
> > > >
> > > > but then you also skip the call to cfs_rq_util_change()
> > >
> > > Ah right, I missed that, thanks for catching this.
> > >
> > > Updated:
> > >
> > > From 09a649f8111cfca656b7b735da975ef607b00956 Mon Sep 17 00:00:00 2001
> > > From: Aaron Lu <[email protected]>
> > > Date: Wed, 19 Jul 2023 13:54:48 +0800
> > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
> > >
> > > ---
> > > kernel/sched/fair.c | 17 ++++++-----------
> > > 1 file changed, 6 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index a80a73909dc2..8d4b9e0a19b6 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -4086,7 +4086,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > > {
> > > unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> > > struct sched_avg *sa = &cfs_rq->avg;
> > > - int decayed = 0;
> > > + int decayed;
> > >
> > > if (cfs_rq->removed.nr) {
> > > unsigned long r;
> > > @@ -4134,11 +4134,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > > */
> > > add_tg_cfs_propagate(cfs_rq,
> > > -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> > > -
> > > - decayed = 1;
> >
> > We need this to propagate the change in other place like cpufreq
>
> Ah, I just made the same mistake again, sorry.
>
> So there are three cases for a cfs_rq: load decayed, load removed and
> load propagated. For all three cases, cfs_rq_util_change() needs to be
> called and only for decayed, update_tg_load_avg() needs to be called.
>
> I'll update the patch accordingly.
>
> > > }
> > >
> > > - decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> > > + decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> > > u64_u32_store_copy(sa->last_update_time,
> > > cfs_rq->last_update_time_copy,
> > > sa->last_update_time);
> > > @@ -4242,7 +4240,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > > {
> > > u64 now = cfs_rq_clock_pelt(cfs_rq);
> > > - int decayed;
> > > + int decayed, propagated;
> > >
> > > /*
> > > * Track task load average for carrying it to new CPU after migrated, and
> > > @@ -4252,7 +4250,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > __update_load_avg_se(now, cfs_rq, se);
> > >
> > > decayed = update_cfs_rq_load_avg(now, cfs_rq);
> > > - decayed |= propagate_entity_load_avg(se);
> > > + propagated = propagate_entity_load_avg(se);
> > >
> > > if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> > >
> > > @@ -4264,19 +4262,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > * IOW we're enqueueing a task on a new CPU.
> > > */
> > > attach_entity_load_avg(cfs_rq, se);
> > > - update_tg_load_avg(cfs_rq);
> > > -
> > > } else if (flags & DO_DETACH) {
> > > /*
> > > * DO_DETACH means we're here from dequeue_entity()
> > > * and we are migrating task out of the CPU.
> > > */
> > > detach_entity_load_avg(cfs_rq, se);
> > > - update_tg_load_avg(cfs_rq);
> > > - } else if (decayed) {
> > > + } else if (decayed || propagated) {
> > > cfs_rq_util_change(cfs_rq, 0);
> > >
> > > - if (flags & UPDATE_TG)
> > > + if (decayed && (flags & UPDATE_TG))
> >
> > It would be simpler and more readable to clear UPDATE_TG or not set it
> > from the beginning
>
> Do you mean something like this?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8d4b9e0a19b6..084d63371355 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4249,7 +4249,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
> __update_load_avg_se(now, cfs_rq, se);
>
> - decayed = update_cfs_rq_load_avg(now, cfs_rq);
> + decayed = update_cfs_rq_load_avg(now, cfs_rq) && (flags & UPDATE_TG);
> propagated = propagate_entity_load_avg(se);
>
> if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> @@ -4271,7 +4271,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> } else if (decayed || propagated) {
> cfs_rq_util_change(cfs_rq, 0);
>
> - if (decayed && (flags & UPDATE_TG))
> + if (decayed)
> update_tg_load_avg(cfs_rq);
> }
> }
>
> > IIUC, you rely on the fact that a decay happens every 1024 us of the
> > cfs_rq_clock_pelt() which is scaled by frequency and cpu compute
> > capacity. So you can end up with a cfs_rq_clock_pelt() that is far
> > slower than real clock and the 1ms can easily be extended to dozens of
> > ms
>
> Thanks for the info. I'm not familiar with this clock scale part and will
> need to take a closer look.
>
> As you know, the intent is to make the unbound update to tg->load_avg
> become bound for those wakeup migration heavy workloads and the way this
> change does to achieve it is to remove the update to tg->load_avg on
> attach and detach path, just leave the update on load decay path. And if
> the current implementation makes load decay longer than 1024us, that
> shouldn't be a problem for this change. I don't see an immediate problem
> if update to tg->load_avg happens less often than once per ms but please
> let me know if I missed something, thanks.

What was wrong with your proposal to limit the update inside
update_tg_load_avg() ? except maybe s/1000000/NSEC_PER_MSEC/ and
computing delta after testing the time since last update

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..e48fd0e6982d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct
cfs_rq *cfs_rq)
static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
{
long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+ u64 now = cfs_rq_clock_pelt(cfs_rq);

/*
* No need to update load_avg for root_task_group as it is not used.
@@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct
cfs_rq *cfs_rq)
if (cfs_rq->tg == &root_task_group)
return;

- if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
+ if ((now - cfs_rq->last_update_tg_load_avg > 1000000) &&
+ abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
atomic_long_add(delta, &cfs_rq->tg->load_avg);
cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+ cfs_rq->last_update_tg_load_avg = now;
}
}


>
> >
> > > update_tg_load_avg(cfs_rq);
> > > }
> > > }
> > > --
> > > 2.41.0
> > >
> > >
> > > > >
> > > > > if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> > > > >
> > > > > @@ -4264,15 +4259,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > > > * IOW we're enqueueing a task on a new CPU.
> > > > > */
> > > > > attach_entity_load_avg(cfs_rq, se);
> > > > > - update_tg_load_avg(cfs_rq);
> > > > > -
> > > > > } else if (flags & DO_DETACH) {
> > > > > /*
> > > > > * DO_DETACH means we're here from dequeue_entity()
> > > > > * and we are migrating task out of the CPU.
> > > > > */
> > > > > detach_entity_load_avg(cfs_rq, se);
> > > > > - update_tg_load_avg(cfs_rq);
> > > > > } else if (decayed) {
> > > > > cfs_rq_util_change(cfs_rq, 0);
> > > > >
> > > > > --
> > > > > 2.41.0
> > > > >

2023-07-20 15:53:38

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

On Thu, 20 Jul 2023 at 16:42, Aaron Lu <[email protected]> wrote:
>
> On Thu, Jul 20, 2023 at 03:10:30PM +0200, Vincent Guittot wrote:
> > On Wed, 19 Jul 2023 at 15:29, Aaron Lu <[email protected]> wrote:
> > >
> > > On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote:
> > > > On Wed, 19 Jul 2023 at 10:01, Aaron Lu <[email protected]> wrote:
> > > > >
> > > > > On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> > > > > > Alternatively, I can remove some callsites of update_tg_load_avg() like
> > > > > > you suggested below and only call update_tg_load_avg() when cfs_rq is
> > > > > > decayed(really just decayed, not when it detected it has removed load
> > > > > > pending or load propagated from its children). This way it would give us
> > > > > > similar result as above(roughly once per ms).
> > > > >
> > > > > Something like this: (I think this is better since it removed those
> > > > > unnecessary calls to update_tg_load_avg(), although it is inline but
> > > > > still)
> > > > >
> > > > >
> > > > > From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001
> > > > > From: Aaron Lu <[email protected]>
> > > > > Date: Wed, 19 Jul 2023 13:54:48 +0800
> > > > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
> > > > >
> > > > > ---
> > > > > kernel/sched/fair.c | 22 +++++++---------------
> > > > > 1 file changed, 7 insertions(+), 15 deletions(-)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index a80a73909dc2..7d5b7352b8b5 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
> > > > > }
> > > > >
> > > > > /* Update task and its cfs_rq load average */
> > > > > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > > > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> > > > > {
> > > > > struct cfs_rq *cfs_rq, *gcfs_rq;
> > > > >
> > > > > if (entity_is_task(se))
> > > > > - return 0;
> > > > > + return;
> > > > >
> > > > > gcfs_rq = group_cfs_rq(se);
> > > > > if (!gcfs_rq->propagate)
> > > > > - return 0;
> > > > > + return;
> > > > >
> > > > > gcfs_rq->propagate = 0;
> > > > >
> > > > > @@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > > >
> > > > > trace_pelt_cfs_tp(cfs_rq);
> > > > > trace_pelt_se_tp(se);
> > > > > -
> > > > > - return 1;
> > > > > }
> > > > >
> > > > > /*
> > > > > @@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
> > > > >
> > > > > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
> > > > >
> > > > > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > > > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> > > > > {
> > > > > - return 0;
> > > > > }
> > > > >
> > > > > static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {}
> > > > > @@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > > > > {
> > > > > unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> > > > > struct sched_avg *sa = &cfs_rq->avg;
> > > > > - int decayed = 0;
> > > > > + int decayed;
> > > > >
> > > > > if (cfs_rq->removed.nr) {
> > > > > unsigned long r;
> > > > > @@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > > > > */
> > > > > add_tg_cfs_propagate(cfs_rq,
> > > > > -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> > > > > -
> > > > > - decayed = 1;
> > > > > }
> > > > >
> > > > > - decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> > > > > + decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> > > > > u64_u32_store_copy(sa->last_update_time,
> > > > > cfs_rq->last_update_time_copy,
> > > > > sa->last_update_time);
> > > > > @@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > > > __update_load_avg_se(now, cfs_rq, se);
> > > > >
> > > > > decayed = update_cfs_rq_load_avg(now, cfs_rq);
> > > > > - decayed |= propagate_entity_load_avg(se);
> > > > > + propagate_entity_load_avg(se);
> > > >
> > > > but then you also skip the call to cfs_rq_util_change()
> > >
> > > Ah right, I missed that, thanks for catching this.
> > >
> > > Updated:
> > >
> > > From 09a649f8111cfca656b7b735da975ef607b00956 Mon Sep 17 00:00:00 2001
> > > From: Aaron Lu <[email protected]>
> > > Date: Wed, 19 Jul 2023 13:54:48 +0800
> > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
> > >
> > > ---
> > > kernel/sched/fair.c | 17 ++++++-----------
> > > 1 file changed, 6 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index a80a73909dc2..8d4b9e0a19b6 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -4086,7 +4086,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > > {
> > > unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> > > struct sched_avg *sa = &cfs_rq->avg;
> > > - int decayed = 0;
> > > + int decayed;
> > >
> > > if (cfs_rq->removed.nr) {
> > > unsigned long r;
> > > @@ -4134,11 +4134,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > > */
> > > add_tg_cfs_propagate(cfs_rq,
> > > -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> > > -
> > > - decayed = 1;
> >
> > We need this to propagate the change in other place like cpufreq
>
> Ah, I just made the same mistake again, sorry.
>
> So there are three cases for a cfs_rq: load decayed, load removed and
> load propagated. For all three cases, cfs_rq_util_change() needs to be
> called and only for decayed, update_tg_load_avg() needs to be called.
>
> I'll update the patch accordingly.
>
> > > }
> > >
> > > - decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> > > + decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> > > u64_u32_store_copy(sa->last_update_time,
> > > cfs_rq->last_update_time_copy,
> > > sa->last_update_time);
> > > @@ -4242,7 +4240,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > > {
> > > u64 now = cfs_rq_clock_pelt(cfs_rq);
> > > - int decayed;
> > > + int decayed, propagated;
> > >
> > > /*
> > > * Track task load average for carrying it to new CPU after migrated, and
> > > @@ -4252,7 +4250,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > __update_load_avg_se(now, cfs_rq, se);
> > >
> > > decayed = update_cfs_rq_load_avg(now, cfs_rq);
> > > - decayed |= propagate_entity_load_avg(se);
> > > + propagated = propagate_entity_load_avg(se);
> > >
> > > if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> > >
> > > @@ -4264,19 +4262,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > * IOW we're enqueueing a task on a new CPU.
> > > */
> > > attach_entity_load_avg(cfs_rq, se);
> > > - update_tg_load_avg(cfs_rq);
> > > -
> > > } else if (flags & DO_DETACH) {
> > > /*
> > > * DO_DETACH means we're here from dequeue_entity()
> > > * and we are migrating task out of the CPU.
> > > */
> > > detach_entity_load_avg(cfs_rq, se);
> > > - update_tg_load_avg(cfs_rq);
> > > - } else if (decayed) {
> > > + } else if (decayed || propagated) {
> > > cfs_rq_util_change(cfs_rq, 0);
> > >
> > > - if (flags & UPDATE_TG)
> > > + if (decayed && (flags & UPDATE_TG))



> >
> > It would be simpler and more readable to clear UPDATE_TG or not set it
> > from the beginning
>
> Do you mean something like this?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8d4b9e0a19b6..084d63371355 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4249,7 +4249,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
> __update_load_avg_se(now, cfs_rq, se);
>
> - decayed = update_cfs_rq_load_avg(now, cfs_rq);
> + decayed = update_cfs_rq_load_avg(now, cfs_rq) && (flags & UPDATE_TG);
> propagated = propagate_entity_load_avg(se);
>
> if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> @@ -4271,7 +4271,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> } else if (decayed || propagated) {
> cfs_rq_util_change(cfs_rq, 0);
>
> - if (decayed && (flags & UPDATE_TG))
> + if (decayed)
> update_tg_load_avg(cfs_rq);
> }

something like below:

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4252,6 +4252,8 @@ static inline void update_load_avg(struct cfs_rq
*cfs_rq, struct sched_entity *s
__update_load_avg_se(now, cfs_rq, se);

decayed = update_cfs_rq_load_avg(now, cfs_rq);
+ if (!decayed)
+ flags &= ~UPDATE_TG;
decayed |= propagate_entity_load_avg(se);

if (!se->avg.last_update_time && (flags & DO_ATTACH)) {


> }
>
> > IIUC, you rely on the fact that a decay happens every 1024 us of the
> > cfs_rq_clock_pelt() which is scaled by frequency and cpu compute
> > capacity. So you can end up with a cfs_rq_clock_pelt() that is far
> > slower than real clock and the 1ms can easily be extended to dozens of
> > ms
>
> Thanks for the info. I'm not familiar with this clock scale part and will
> need to take a closer look.
>
> As you know, the intent is to make the unbound update to tg->load_avg
> become bound for those wakeup migration heavy workloads and the way this
> change does to achieve it is to remove the update to tg->load_avg on
> attach and detach path, just leave the update on load decay path. And if
> the current implementation makes load decay longer than 1024us, that
> shouldn't be a problem for this change. I don't see an immediate problem
> if update to tg->load_avg happens less often than once per ms but please
> let me know if I missed something, thanks.
>
> >
> > > update_tg_load_avg(cfs_rq);
> > > }
> > > }
> > > --
> > > 2.41.0
> > >
> > >
> > > > >
> > > > > if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> > > > >
> > > > > @@ -4264,15 +4259,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > > > * IOW we're enqueueing a task on a new CPU.
> > > > > */
> > > > > attach_entity_load_avg(cfs_rq, se);
> > > > > - update_tg_load_avg(cfs_rq);
> > > > > -
> > > > > } else if (flags & DO_DETACH) {
> > > > > /*
> > > > > * DO_DETACH means we're here from dequeue_entity()
> > > > > * and we are migrating task out of the CPU.
> > > > > */
> > > > > detach_entity_load_avg(cfs_rq, se);
> > > > > - update_tg_load_avg(cfs_rq);
> > > > > } else if (decayed) {
> > > > > cfs_rq_util_change(cfs_rq, 0);
> > > > >
> > > > > --
> > > > > 2.41.0
> > > > >

2023-07-20 16:04:13

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

On Thu, 20 Jul 2023 at 17:22, Dietmar Eggemann <[email protected]> wrote:
>
> On 20/07/2023 17:02, Vincent Guittot wrote:
> > On Thu, 20 Jul 2023 at 16:42, Aaron Lu <[email protected]> wrote:
> >>
> >> On Thu, Jul 20, 2023 at 03:10:30PM +0200, Vincent Guittot wrote:
> >>> On Wed, 19 Jul 2023 at 15:29, Aaron Lu <[email protected]> wrote:
> >>>>
> >>>> On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote:
> >>>>> On Wed, 19 Jul 2023 at 10:01, Aaron Lu <[email protected]> wrote:
> >>>>>>
> >>>>>> On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
>
> [...]
>
> > What was wrong with your proposal to limit the update inside
> > update_tg_load_avg() ? except maybe s/1000000/NSEC_PER_MSEC/ and
> > computing delta after testing the time since last update
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a80a73909dc2..e48fd0e6982d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct
> > cfs_rq *cfs_rq)
> > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> > {
> > long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > + u64 now = cfs_rq_clock_pelt(cfs_rq);
>
> Could this be `u64 now = sched_clock_cpu()` like in
> migrate_se_pelt_lag() or newidle_balance() to avoid the time morphing
> due to PELT's frequency and uArch invariance?

yes that's a good point. I missed this

> >
> > /*
> > * No need to update load_avg for root_task_group as it is not used.
>
> [...]
>

2023-07-21 02:20:26

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

On Thu, Jul 20, 2023 at 05:02:32PM +0200, Vincent Guittot wrote:
>
> What was wrong with your proposal to limit the update inside
> update_tg_load_avg() ? except maybe s/1000000/NSEC_PER_MSEC/ and
> computing delta after testing the time since last update

I was thinking it might be better to align the update_tg_load_avg() with
cfs_rq's decay point but that's just my feeling.

Absolutely nothing wrong with the below approach, it's straightforward
and effective. I'll fix the use of cfs_rq_clock_pelt() and collect
some data and then send out v2.

Thank you Vincent for all your comments, they're very useful to me.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..e48fd0e6982d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct
> cfs_rq *cfs_rq)
> static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> {
> long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> + u64 now = cfs_rq_clock_pelt(cfs_rq);
>
> /*
> * No need to update load_avg for root_task_group as it is not used.
> @@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct
> cfs_rq *cfs_rq)
> if (cfs_rq->tg == &root_task_group)
> return;
>
> - if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> + if ((now - cfs_rq->last_update_tg_load_avg > 1000000) &&
> + abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> atomic_long_add(delta, &cfs_rq->tg->load_avg);
> cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> + cfs_rq->last_update_tg_load_avg = now;
> }
> }

2023-07-21 06:49:31

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

On Thu, Jul 20, 2023 at 05:22:26PM +0200, Dietmar Eggemann wrote:
> On 20/07/2023 17:02, Vincent Guittot wrote:
> > On Thu, 20 Jul 2023 at 16:42, Aaron Lu <[email protected]> wrote:
> >>
> >> On Thu, Jul 20, 2023 at 03:10:30PM +0200, Vincent Guittot wrote:
> >>> On Wed, 19 Jul 2023 at 15:29, Aaron Lu <[email protected]> wrote:
> >>>>
> >>>> On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote:
> >>>>> On Wed, 19 Jul 2023 at 10:01, Aaron Lu <[email protected]> wrote:
> >>>>>>
> >>>>>> On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
>
> [...]
>
> > What was wrong with your proposal to limit the update inside
> > update_tg_load_avg() ? except maybe s/1000000/NSEC_PER_MSEC/ and
> > computing delta after testing the time since last update
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a80a73909dc2..e48fd0e6982d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct
> > cfs_rq *cfs_rq)
> > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> > {
> > long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > + u64 now = cfs_rq_clock_pelt(cfs_rq);
>
> Could this be `u64 now = sched_clock_cpu()` like in
> migrate_se_pelt_lag() or newidle_balance() to avoid the time morphing
> due to PELT's frequency and uArch invariance?

Yes, will use sched_clock_cpu() instead of cfs_rq_clock_pelt().
Thanks.

> >
> > /*
> > * No need to update load_avg for root_task_group as it is not used.
>
> [...]
>

2023-08-02 08:00:26

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: free allocated memory on error in alloc_fair_sched_group()

On Wed, Jul 19, 2023 at 10:13:55AM +0800, Aaron Lu wrote:
> On Tue, Jul 18, 2023 at 11:13:12PM +0800, Chen Yu wrote:
> > On 2023-07-18 at 21:41:17 +0800, Aaron Lu wrote:
> > > There is one struct cfs_rq and one struct se on each cpu for a taskgroup
> > > and when allocation for tg->cfs_rq[X] failed, the already allocated
> > > tg->cfs_rq[0]..tg->cfs_rq[X-1] should be freed. The same for tg->se.
> > >
> > > Signed-off-by: Aaron Lu <[email protected]>
> > > ---
> > > kernel/sched/fair.c | 23 ++++++++++++++++-------
> > > 1 file changed, 16 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index a80a73909dc2..0f913487928d 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -12443,10 +12443,10 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
> > >
> > > tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
> > > if (!tg->cfs_rq)
> > > - goto err;
> > > + return 0;
> > > tg->se = kcalloc(nr_cpu_ids, sizeof(se), GFP_KERNEL);
> > > if (!tg->se)
> > > - goto err;
> > > + goto err_free_rq_pointer;
> > >
> > > tg->shares = NICE_0_LOAD;
> > >
> > > @@ -12456,12 +12456,12 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
> > > cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
> > > GFP_KERNEL, cpu_to_node(i));
> > > if (!cfs_rq)
> > > - goto err;
> > > + goto err_free;
> > >
> > > se = kzalloc_node(sizeof(struct sched_entity_stats),
> > > GFP_KERNEL, cpu_to_node(i));
> > > if (!se)
> > > - goto err_free_rq;
> > > + goto err_free;
> > >
> > > init_cfs_rq(cfs_rq);
> > > init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
> > > @@ -12470,9 +12470,18 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
> > >
> > > return 1;
> > >
> > > -err_free_rq:
> > > - kfree(cfs_rq);
> > > -err:
> > > +err_free:
> > > + for_each_possible_cpu(i) {
> > > + kfree(tg->cfs_rq[i]);
> > > + kfree(tg->se[i]);
> > > +
> > > + if (!tg->cfs_rq[i] && !tg->se[i])
> > > + break;
> > > + }
> > > + kfree(tg->se);
> > > +err_free_rq_pointer:
> > > + kfree(tg->cfs_rq);
> > > +
> > Not sure if I overlooked, if alloc_fair_sched_group() fails in sched_create_group(),
> > would sched_free_group()->free_fair_sched_group() do the cleanup?
>
> You are right, I overlooked... Thanks for pointing this out.

While preparing v2, one thing still looks strange in the existing code:

int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
{
... ...

for_each_possible_cpu(i) {
cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
GFP_KERNEL, cpu_to_node(i));
if (!cfs_rq)
goto err;

se = kzalloc_node(sizeof(struct sched_entity_stats),
GFP_KERNEL, cpu_to_node(i));
if (!se)
goto err_free_rq;
~~~~~~~~~~~~~~~~

Since free_fair_sched_group() will take care freeing these memories on
error path, it looks unnecessary to do this free for a random cfs_rq
here. Or do I miss something again...?

init_cfs_rq(cfs_rq);
init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
init_entity_runnable_average(se);
}

return 1;

err_free_rq:
kfree(cfs_rq);
err:
return 0;
}

I plan to change it to this:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..ef2618ad26eb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12461,7 +12461,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
se = kzalloc_node(sizeof(struct sched_entity_stats),
GFP_KERNEL, cpu_to_node(i));
if (!se)
- goto err_free_rq;
+ goto err;

init_cfs_rq(cfs_rq);
init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
@@ -12470,8 +12470,6 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)

return 1;

-err_free_rq:
- kfree(cfs_rq);
err:
return 0;
}

2023-08-02 08:55:53

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: free allocated memory on error in alloc_fair_sched_group()

On 2023-08-02 at 15:01:05 +0800, Aaron Lu wrote:
> On Wed, Jul 19, 2023 at 10:13:55AM +0800, Aaron Lu wrote:
> > On Tue, Jul 18, 2023 at 11:13:12PM +0800, Chen Yu wrote:
> > > On 2023-07-18 at 21:41:17 +0800, Aaron Lu wrote:
> > > > There is one struct cfs_rq and one struct se on each cpu for a taskgroup
> > > > and when allocation for tg->cfs_rq[X] failed, the already allocated
> > > > tg->cfs_rq[0]..tg->cfs_rq[X-1] should be freed. The same for tg->se.
> > > >
> > > > Signed-off-by: Aaron Lu <[email protected]>
> > > > ---
> > > > kernel/sched/fair.c | 23 ++++++++++++++++-------
> > > > 1 file changed, 16 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index a80a73909dc2..0f913487928d 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -12443,10 +12443,10 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
> > > >
> > > > tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
> > > > if (!tg->cfs_rq)
> > > > - goto err;
> > > > + return 0;
> > > > tg->se = kcalloc(nr_cpu_ids, sizeof(se), GFP_KERNEL);
> > > > if (!tg->se)
> > > > - goto err;
> > > > + goto err_free_rq_pointer;
> > > >
> > > > tg->shares = NICE_0_LOAD;
> > > >
> > > > @@ -12456,12 +12456,12 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
> > > > cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
> > > > GFP_KERNEL, cpu_to_node(i));
> > > > if (!cfs_rq)
> > > > - goto err;
> > > > + goto err_free;
> > > >
> > > > se = kzalloc_node(sizeof(struct sched_entity_stats),
> > > > GFP_KERNEL, cpu_to_node(i));
> > > > if (!se)
> > > > - goto err_free_rq;
> > > > + goto err_free;
> > > >
> > > > init_cfs_rq(cfs_rq);
> > > > init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
> > > > @@ -12470,9 +12470,18 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
> > > >
> > > > return 1;
> > > >
> > > > -err_free_rq:
> > > > - kfree(cfs_rq);
> > > > -err:
> > > > +err_free:
> > > > + for_each_possible_cpu(i) {
> > > > + kfree(tg->cfs_rq[i]);
> > > > + kfree(tg->se[i]);
> > > > +
> > > > + if (!tg->cfs_rq[i] && !tg->se[i])
> > > > + break;
> > > > + }
> > > > + kfree(tg->se);
> > > > +err_free_rq_pointer:
> > > > + kfree(tg->cfs_rq);
> > > > +
> > > Not sure if I overlooked, if alloc_fair_sched_group() fails in sched_create_group(),
> > > would sched_free_group()->free_fair_sched_group() do the cleanup?
> >
> > You are right, I overlooked... Thanks for pointing this out.
>
> While preparing v2, one thing still looks strange in the existing code:
>
> int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
> {
> ... ...
>
> for_each_possible_cpu(i) {
> cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
> GFP_KERNEL, cpu_to_node(i));
> if (!cfs_rq)
> goto err;
>
> se = kzalloc_node(sizeof(struct sched_entity_stats),
> GFP_KERNEL, cpu_to_node(i));
> if (!se)
> goto err_free_rq;
> ~~~~~~~~~~~~~~~~
>
> Since free_fair_sched_group() will take care freeing these memories on
> error path, it looks unnecessary to do this free for a random cfs_rq
> here. Or do I miss something again...?
>
> init_cfs_rq(cfs_rq);
> init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
> init_entity_runnable_average(se);
> }
>
> return 1;
>
> err_free_rq:
> kfree(cfs_rq);
> err:
> return 0;
> }
>
> I plan to change it to this:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..ef2618ad26eb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12461,7 +12461,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
> se = kzalloc_node(sizeof(struct sched_entity_stats),
> GFP_KERNEL, cpu_to_node(i));
> if (!se)
> - goto err_free_rq;
> + goto err;
>
> init_cfs_rq(cfs_rq);
> init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
> @@ -12470,8 +12470,6 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
>
> return 1;
>
> -err_free_rq:
> - kfree(cfs_rq);
> err:
> return 0;
> }

It seems that the err_free_rq was introduced in
Commit dfc12eb26a28 ("sched: Fix memory leak in two error corner cases")
The memory leak was detected by the static tool, while that tools is
unaware of free_fair_sched_group() which will take care of this.
Not sure if a self-maintained function is preferred, or let other function
to take care of that. For now it seems to be a duplicated free. And
alloc_rt_sched_group() has the same issue.

thanks,
Chenyu

2023-08-02 13:31:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] sched/fair: Make tg->load_avg per node

On Wed, Jul 19, 2023 at 09:45:00PM +0800, Aaron Lu wrote:
> On Wed, Jul 19, 2023 at 01:53:58PM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 18, 2023 at 09:41:18PM +0800, Aaron Lu wrote:
> > > +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
> > > +static inline long tg_load_avg(struct task_group *tg)
> > > +{
> > > + long load_avg = 0;
> > > + int i;
> > > +
> > > + /*
> > > + * The only path that can give us a root_task_group
> > > + * here is from print_cfs_rq() thus unlikely.
> > > + */
> > > + if (unlikely(tg == &root_task_group))
> > > + return 0;
> > > +
> > > + for_each_node(i)
> > > + load_avg += atomic_long_read(&tg->node_info[i]->load_avg);
> > > +
> > > + return load_avg;
> > > +}
> > > +#endif
> >
> > So I was working on something else numa and noticed that for_each_node()
> > (and most of the nodemask stuff) is quite moronic, afaict we should do
> > something like the below.
> >
> > I now see Mike added the nr_node_ids thing fairly recent, but given
> > distros have NODES_SHIFT=10 and actual machines typically only have <=4
> > nodes, this would save a factor of 256 scanning.

More complete nodemask patch here:

https://lkml.kernel.org/r/20230802112458.230221601%40infradead.org

2023-08-11 10:21:19

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

On Fri, Jul 21, 2023 at 09:57:04AM +0800, Aaron Lu wrote:
> On Thu, Jul 20, 2023 at 05:02:32PM +0200, Vincent Guittot wrote:
> >
> > What was wrong with your proposal to limit the update inside
> > update_tg_load_avg() ? except maybe s/1000000/NSEC_PER_MSEC/ and
> > computing delta after testing the time since last update
>
> I was thinking it might be better to align the update_tg_load_avg() with
> cfs_rq's decay point but that's just my feeling.
>
> Absolutely nothing wrong with the below approach, it's straightforward
> and effective. I'll fix the use of cfs_rq_clock_pelt() and collect
> some data and then send out v2.
>
> Thank you Vincent for all your comments, they're very useful to me.
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a80a73909dc2..e48fd0e6982d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct
> > cfs_rq *cfs_rq)
> > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> > {
> > long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > + u64 now = cfs_rq_clock_pelt(cfs_rq);
> >
> > /*
> > * No need to update load_avg for root_task_group as it is not used.
> > @@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct
> > cfs_rq *cfs_rq)
> > if (cfs_rq->tg == &root_task_group)
> > return;
> >
> > - if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> > + if ((now - cfs_rq->last_update_tg_load_avg > 1000000) &&
> > + abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> > atomic_long_add(delta, &cfs_rq->tg->load_avg);
> > cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> > + cfs_rq->last_update_tg_load_avg = now;
> > }
> > }

While collecting data for v2, I find that with this "rate limit updates
to tg->load_avg to at most once per ms", the other two optimizations:
"skip some update_cfs_group() on en/dequeue_entity()" and "Make
tg->load_avg per node" doesn't matter anymore, i.e. as long as
"ratelimit" patch is there, adding those two other optimizations doesn't
improve performance further.

I think this is reasonable, since this "ratelimit" patch reduced updates
to tg->load_avg so much like from some 10 millions during a 5s window to
~25k, the overhead of accessing tg->load_avg is dropped greatly and
improving it by reducing the read side calls and making the counter
per-node doesn't matter anymore.

So I think for v2, I'll go with this "ratelimit" patch alone, but let me
know if you think otherwise.

The tests I've run are listed below.
1 postgres_sysbench with nr_thread=25%/50%/75%/100% on a 2 sockets, 112
cores, 224 threads Intel Sapphire Rapids(SPR);
2 hackbench with group=1/4/8/16, pipe, thread mode on a 2 sockets, 64
cores, 128 threads Intel Skylake(ICL) and another 2 sockets, 112 cores,
224 threads Intel SPR;
3 netperf with tcp_stream and udp_rr modes. For each mode, nr_client is
set to 25%/50%/75%/100%. Tested on the same ICL and SPR as in 2).

The test summary is:

postgres_sysbench on SPR
- when nr_thread=25% and 50%, results are in noise range;
- when nr_thread=75% and 100%, "ratelimit" patch can improve transaction
by 12.2% and 21.2%.

hackbench:
- when nr_group=1 and 4, results are in noise range for both ICL and SPR;
- when nr_group=8 and 16, "ratelimit" patch can improve throughput by
6.6% and 22.4% on ICL. For SPR, the improvement is 12.5% and 48.8%.

netperf:
- for tcp_stream mode test, all test results are in noise range for ICL
and SPR;
- for udp_rr mode test, when nr_client=25% and 50% of nr_cpu, test results
are in noise range; when nr_client=75% and 100% of nr_cpu, on ICL,
"ratelimit" patch can improve throughput by 38.5% and 13.4%; on SPR,
the improvement is 162% and 189%.

Full test results are listed below.

base: peterz's sched/core branch with commit e8d5ff8258bf7("sched/fair:
Block nohz tick_stop when cfs bandwidth in use") as head.
rate: the patch that rate limit updates to tg->load_avg to at most once
per ms; applied on top of "base".
skip: the patch "sched/fair: skip some update_cfs_group() on
en/dequeue_entity()" as in this v1 series; applied on top of "rate".
node: the patch "sched/fair: Make tg->load_avg per node" as in this v1
series + peterz's numa optimizations; applied on top of "skip".

All improvement percents are against "base".

postgres_sysbench on SPR:

25% (all in noise range)
base: 42382?19.8%
rate: 50174?9.5%
skip: 46311?12.0%
node: 48690?13.9%

50% (all in noise range)
base: 67626?1.3%
rate: 67365?3.1%
skip: 68218?4.9%
node: 66878?3.3%

75%
base: 100216?1.2%
rate: 112470?0.1% +12.2%
skip: 113345?0.3% +13.1%
node: 112962?0.1% +12.7%

100%
base: 93671?0.4%
rate: 113563?0.2% +21.2%
skip: 113741?0.2% +21.4%
node: 112793?0.2% +20.4%

hackbench on ICL:
group=1 (all in noise range)
base: 114912?5.2%
rate: 117857?2.5%
skip: 118501?4.3%
node: 119240?7.0%

group=4 (all in noise range)
base: 359902?1.6%
rate: 361685?2.7%
skip: 356354?1.6%
node: 358744?1.7%

group=8
base: 461070?0.8%
rate: 491713?0.3% +6.6%
skip: 490183?1.4% +6.3%
node: 490864?0.5% +6.5%

group=16
base: 309032?5.0%
rate: 378337?1.3% +22.4%
skip: 385637?1.4% +24.8%
node: 375087?0.9% +21.4%

hackbench on SPR:
group=1 (all in noise range)
base: 100768?2.9%
rate: 103134?2.9%
skip: 99642?1.7%
node: 104251?2.5%

group=4 (all in noise range)
base: 413830?12.5%
rate: 378660?16.6%
skip: 367234?13.1%
node: 372122?14.4%

group=8
base: 436124?0.6%
rate: 490787?3.2% +12.5%
skip: 493097?1.0% +13.0%
node: 497336?1.8% +14.0%

group=16
base: 457730?3.2%
rate: 680452?1.3% +48.8%
skip: 679271?0.9% +48.4%
node: 672365?0.5% +46.9%

netperf/udp_rr on ICL
25%
base: 114413?0.1%
rate: 115111?0.0% +0.6%
skip: 115546?0.2% +1.0%
node: 115214?0.3% +0.7%

50% (all in noise range)
base: 86803?0.5%
rate: 86611?0.0%
skip: 86986?0.5%
node: 87466?0.8%

75%
base: 35959?5.3%
rate: 49801?0.6% +38.5%
skip: 50745?1.5% +41.1%
node: 50457?1.5% +40.3%

100%
base: 61951?6.4%
rate: 70224?0.8% +13.4%
skip: 67321?4.3% +8.7%
node: 70591?2.3% +13.9%

netperf/udp_rr on SPR
25% (all in noise range)
base: 104954?1.3%
rate: 107312?2.8%
skip: 107688?1.7%
node: 106024?1.8%

50% (all in noise range)
base: 55394?4.6%
rate: 54940?7.4%
skip: 57829?1.8%
node: 57144?6.4%

75%
base: 13779?3.1%
rate: 36105?1.1% +162%
skip: 35319?6.3% +156%
node: 36764?3.5% +167%

100%
base: 9703?3.7%
rate: 28011?0.2% +189%
skip: 27087?1.3% +179%
node: 27337?2.2% +182%

netperf/tcp_stream on ICL
25% (all in noise range)
base: 43092?0.1%
rate: 42891?0.5%
skip: 43208?0.4%
node: 43044?0.6%

50% (all in noise range)
base: 19278?14.9%
rate: 22369?7.2%
skip: 19375?16.0%
node: 21312?8.2%

75% (all in noise range)
base: 16822?3.0%
rate: 17086?2.3%
skip: 17571?1.8%
node: 16801?2.4%

100% (all in noise range)
base: 18216?0.6%
rate: 18078?2.9%
skip: 18040?1.0%
node: 18064?1.6%

netperf/tcp_stream on SPR
25% (all in noise range)
base: 34491?0.3%
rate: 34886?0.5%
skip: 34894?0.9%
node: 34692?0.6%

50% (all in noise range)
base: 19278?14.9%
rate: 22369?7.2%
skip: 19375?16.0%
node: 21312?8.2%

75% (all in noise range)
base: 16822?3.0%
rate: 17086?2.3%
skip: 17571?1.8%
node: 16801?2.4%

100% (all in noise range)
base: 18216?0.6%
rate: 18078?2.9%
skip: 18040?0.1%
node: 18064?1.6%

Thanks,
Aaron

2023-08-11 11:02:07

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] sched/fair: Make tg->load_avg per node

On Wed, Aug 02, 2023 at 01:28:36PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 19, 2023 at 09:45:00PM +0800, Aaron Lu wrote:
> > On Wed, Jul 19, 2023 at 01:53:58PM +0200, Peter Zijlstra wrote:
> > > On Tue, Jul 18, 2023 at 09:41:18PM +0800, Aaron Lu wrote:
> > > > +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
> > > > +static inline long tg_load_avg(struct task_group *tg)
> > > > +{
> > > > + long load_avg = 0;
> > > > + int i;
> > > > +
> > > > + /*
> > > > + * The only path that can give us a root_task_group
> > > > + * here is from print_cfs_rq() thus unlikely.
> > > > + */
> > > > + if (unlikely(tg == &root_task_group))
> > > > + return 0;
> > > > +
> > > > + for_each_node(i)
> > > > + load_avg += atomic_long_read(&tg->node_info[i]->load_avg);
> > > > +
> > > > + return load_avg;
> > > > +}
> > > > +#endif
> > >
> > > So I was working on something else numa and noticed that for_each_node()
> > > (and most of the nodemask stuff) is quite moronic, afaict we should do
> > > something like the below.
> > >
> > > I now see Mike added the nr_node_ids thing fairly recent, but given
> > > distros have NODES_SHIFT=10 and actual machines typically only have <=4
> > > nodes, this would save a factor of 256 scanning.
>
> More complete nodemask patch here:
>
> https://lkml.kernel.org/r/20230802112458.230221601%40infradead.org

Thanks for the update.

I incorperated this numa change and collected some data and found that
with the newly proposed approach to rate limit updates to tg->load_avg
to at most once per ms, the cost of accessing tg->load_avg is dropped
so much that adding other optimizations doesn't make much difference.

So I was thinking maybe I just need that one ratelimit patch to reduce
the cost of accessing tg->load_avg. The detailed data is here:
https://lore.kernel.org/lkml/20230811092811.GA399195@ziqianlu-dell/