2021-08-19 17:52:44

by Michal Koutný

[permalink] [raw]
Subject: [RFC PATCH v2 0/5] leaf_cfs_rq_list cleanups and fix

Hello.

This is an extension of the simplification patch posted previously. The
original goal was to make the code a bit more readable but I noticed a
bug meanwhile and included it in the series.

Reason for RFC:
- I noticed an asymmetry in update_cfs_group(se) calls between
unthrottle_cfs_rq() and enqueue_task_fair(), not sure if the omission
is warranted. It's marked XXX in the patch "sched/fair: Simplify
ancestor enqueue loops"

Patch 1
- fix of load_cfs_rq_list handling

Patches 2, 3
- just (re)naming things

Patch 4, 5
- simplifications of load_cfs_rq_list and ancestor processing loops
- no functional changes intended



RFC v1:
https://lore.kernel.org/r/[email protected]

Michal Koutný (5):
sched/fair: Add ancestors of unthrottled undecayed cfs_rq
sched: Add group_se() helper
sched/fair: Rename leaf_list to more fitting load_list
sched/fair: Simplify load_cfs_rq_list maintenance
sched/fair: Simplify ancestor enqueue loops

kernel/sched/core.c | 4 +-
kernel/sched/fair.c | 220 +++++++++++++++++--------------------------
kernel/sched/sched.h | 25 ++---
3 files changed, 104 insertions(+), 145 deletions(-)

--
2.32.0


2021-08-19 17:52:50

by Michal Koutný

[permalink] [raw]
Subject: [RFC PATCH v2 4/5] sched/fair: Simplify load_cfs_rq_list maintenance

load_cfs_rq_list should contain cfs_rqs that have runnable entities in
them. When we're operating under a throttled hierarchy we always
update the load_cfs_rq_list in order not to break list_add_load_cfs_rq
invariant of adding complete branches.

This caused troubles when an entity became runnable (enqueue_entity)
under a throttled hierarchy (see commit b34cb07dde7c ("sched/fair: Fix
enqueue_task_fair() warning some more")). (Basically when we add to the
load list, we have to ensure all the ancestors are added and when
deleting we have to delete whole subtree.)

This patch simplifies the code by no updates of load_cfs_rq_list when
we're operating under the throttled hierarchy and defers the
load_cfs_rq_list update to the point when whole hierarchy is unthrottled
(tg_unthrottle_up). Specifically, subtrees of a throttled cfs_rq are not
decaying their PELT when they're being throttled (but the parent of the
throttled cfs_rq is decayed).

The code is now simpler and load_cfs_rq_list contains only cfs_rqs that
have load to decay.

Signed-off-by: Michal Koutný <[email protected]>
---
kernel/sched/fair.c | 58 ++++++++++-----------------------------------
1 file changed, 12 insertions(+), 46 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6f4d5d4dcdd9..9978485334ec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3086,6 +3086,8 @@ void reweight_task(struct task_struct *p, int prio)
load->inv_weight = sched_prio_to_wmult[prio];
}

+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
+
#ifdef CONFIG_FAIR_GROUP_SCHED
#ifdef CONFIG_SMP
/*
@@ -3196,8 +3198,6 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_SMP */

-static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
-
/*
* Recomputes the group entity based on the current state of its group
* runqueue.
@@ -4294,10 +4294,11 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)

/*
* When bandwidth control is enabled, cfs might have been removed
- * because of a parent been throttled but cfs->nr_running > 1. Try to
- * add it unconditionally.
+ * because of a parent been throttled. We'll add it later (with
+ * complete branch up to se->on_rq/cfs_eq->on_list) in
+ * tg_unthrottle_up() and unthrottle_cfs_rq().
*/
- if (cfs_rq->nr_running == 1 || cfs_bandwidth_used())
+ if (cfs_rq->nr_running == 1 && !throttled_hierarchy(cfs_rq))
list_add_load_cfs_rq(cfs_rq);

if (cfs_rq->nr_running == 1)
@@ -4936,31 +4937,13 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
goto unthrottle_throttle;
-
- /*
- * One parent has been throttled and cfs_rq removed from the
- * list. Add it back to not break the load list.
- */
- if (throttled_hierarchy(cfs_rq))
- list_add_load_cfs_rq(cfs_rq);
}

/* At this point se is NULL and we are at root level*/
add_nr_running(rq, task_delta);

unthrottle_throttle:
- /*
- * The cfs_rq_throttled() breaks in the above iteration can result in
- * incomplete load list maintenance, resulting in triggering the
- * assertion below.
- */
- for_each_sched_entity(se) {
- cfs_rq = cfs_rq_of(se);
-
- if (list_add_load_cfs_rq(cfs_rq))
- break;
- }
-
+ /* See enqueue_task_fair:enqueue_throttle */
assert_list_load_cfs_rq(rq);

/* Determine whether we need to wake up potentially idle CPU: */
@@ -5600,13 +5583,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
goto enqueue_throttle;
-
- /*
- * One parent has been throttled and cfs_rq removed from the
- * list. Add it back to not break the load list.
- */
- if (throttled_hierarchy(cfs_rq))
- list_add_load_cfs_rq(cfs_rq);
}

/* At this point se is NULL and we are at root level*/
@@ -5630,21 +5606,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_overutilized_status(rq);

enqueue_throttle:
- if (cfs_bandwidth_used()) {
- /*
- * When bandwidth control is enabled; the cfs_rq_throttled()
- * breaks in the above iteration can result in incomplete
- * load list maintenance, resulting in triggering the assertion
- * below.
- */
- for_each_sched_entity(se) {
- cfs_rq = cfs_rq_of(se);
-
- if (list_add_load_cfs_rq(cfs_rq))
- break;
- }
- }
-
+ /*
+ * If we got here, subtree of a cfs_rq must have been throttled and
+ * therefore we did not modify load list or we climbed up to root (or
+ * joined to an ancestor cfs_rq with on_rq == 1 => on_list).
+ */
assert_list_load_cfs_rq(rq);

hrtick_update(rq);
--
2.32.0

2021-08-19 17:53:37

by Michal Koutný

[permalink] [raw]
Subject: [RFC PATCH v2 3/5] sched/fair: Rename leaf_list to more fitting load_list

The leaf_list name is obsolete and misleading. The list is nowadays used
to hold cfs_rqs with non-zero PELT that has to be decayed to properly
account for blocked tasks. Those can be inner nodes of the task_group
tree as well.

Signed-off-by: Michal Koutný <[email protected]>
---
kernel/sched/core.c | 4 +-
kernel/sched/fair.c | 114 +++++++++++++++++++++----------------------
kernel/sched/sched.h | 17 +++----
3 files changed, 65 insertions(+), 70 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 20ffcc044134..e55a7c898cd9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8961,8 +8961,8 @@ void __init sched_init(void)
init_rt_rq(&rq->rt);
init_dl_rq(&rq->dl);
#ifdef CONFIG_FAIR_GROUP_SCHED
- INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
- rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
+ INIT_LIST_HEAD(&rq->load_cfs_rq_list);
+ rq->tmp_alone_branch = &rq->load_cfs_rq_list;
/*
* How much CPU bandwidth does root_task_group get?
*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 905f95b91a7a..6f4d5d4dcdd9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -286,13 +286,13 @@ static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
strlcpy(path, "(null)", len);
}

-static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+static inline bool list_add_load_cfs_rq(struct cfs_rq *cfs_rq)
{
struct rq *rq = rq_of(cfs_rq);
int cpu = cpu_of(rq);

if (cfs_rq->on_list)
- return rq->tmp_alone_branch == &rq->leaf_cfs_rq_list;
+ return rq->tmp_alone_branch == &rq->load_cfs_rq_list;

cfs_rq->on_list = 1;

@@ -313,14 +313,14 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
* the list, this means to put the child at the tail
* of the list that starts by parent.
*/
- list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
- &(cfs_rq->tg->parent->cfs_rq[cpu]->leaf_cfs_rq_list));
+ list_add_tail_rcu(&cfs_rq->load_cfs_rq_list,
+ &(cfs_rq->tg->parent->cfs_rq[cpu]->load_cfs_rq_list));
/*
* The branch is now connected to its tree so we can
* reset tmp_alone_branch to the beginning of the
* list.
*/
- rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
+ rq->tmp_alone_branch = &rq->load_cfs_rq_list;
return true;
}

@@ -329,13 +329,13 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
* cfs rq without parent should be put
* at the tail of the list.
*/
- list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
- &rq->leaf_cfs_rq_list);
+ list_add_tail_rcu(&cfs_rq->load_cfs_rq_list,
+ &rq->load_cfs_rq_list);
/*
* We have reach the top of a tree so we can reset
* tmp_alone_branch to the beginning of the list.
*/
- rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
+ rq->tmp_alone_branch = &rq->load_cfs_rq_list;
return true;
}

@@ -345,44 +345,44 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
* tmp_alone_branch points to the begin of the branch
* where we will add parent.
*/
- list_add_rcu(&cfs_rq->leaf_cfs_rq_list, rq->tmp_alone_branch);
+ list_add_rcu(&cfs_rq->load_cfs_rq_list, rq->tmp_alone_branch);
/*
* update tmp_alone_branch to points to the new begin
* of the branch
*/
- rq->tmp_alone_branch = &cfs_rq->leaf_cfs_rq_list;
+ rq->tmp_alone_branch = &cfs_rq->load_cfs_rq_list;
return false;
}

-static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+static inline void list_del_load_cfs_rq(struct cfs_rq *cfs_rq)
{
if (cfs_rq->on_list) {
struct rq *rq = rq_of(cfs_rq);

/*
* With cfs_rq being unthrottled/throttled during an enqueue,
- * it can happen the tmp_alone_branch points the a leaf that
+ * it can happen the tmp_alone_branch points the cfs_rq that
* we finally want to del. In this case, tmp_alone_branch moves
- * to the prev element but it will point to rq->leaf_cfs_rq_list
+ * to the prev element but it will point to rq->load_cfs_rq_list
* at the end of the enqueue.
*/
- if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
- rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
+ if (rq->tmp_alone_branch == &cfs_rq->load_cfs_rq_list)
+ rq->tmp_alone_branch = cfs_rq->load_cfs_rq_list.prev;

- list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
+ list_del_rcu(&cfs_rq->load_cfs_rq_list);
cfs_rq->on_list = 0;
}
}

-static inline void assert_list_leaf_cfs_rq(struct rq *rq)
+static inline void assert_list_load_cfs_rq(struct rq *rq)
{
- SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list);
+ SCHED_WARN_ON(rq->tmp_alone_branch != &rq->load_cfs_rq_list);
}

-/* Iterate thr' all leaf cfs_rq's on a runqueue */
-#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \
- list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \
- leaf_cfs_rq_list)
+/* Iterate thr' all loaded cfs_rq's on a runqueue */
+#define for_each_load_cfs_rq_safe(rq, cfs_rq, pos) \
+ list_for_each_entry_safe(cfs_rq, pos, &rq->load_cfs_rq_list, \
+ load_cfs_rq_list)

/* Do the two (enqueued) entities belong to the same group ? */
static inline struct cfs_rq *
@@ -442,20 +442,20 @@ static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
strlcpy(path, "(null)", len);
}

-static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+static inline bool list_add_load_cfs_rq(struct cfs_rq *cfs_rq)
{
return true;
}

-static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+static inline void list_del_load_cfs_rq(struct cfs_rq *cfs_rq)
{
}

-static inline void assert_list_leaf_cfs_rq(struct rq *rq)
+static inline void assert_list_load_cfs_rq(struct rq *rq)
{
}

-#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \
+#define for_each_load_cfs_rq_safe(rq, cfs_rq, pos) \
for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos)

static inline struct sched_entity *parent_entity(struct sched_entity *se)
@@ -3257,12 +3257,12 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
#ifdef CONFIG_SMP
#ifdef CONFIG_FAIR_GROUP_SCHED
/*
- * Because list_add_leaf_cfs_rq always places a child cfs_rq on the list
+ * Because list_add_load_cfs_rq always places a child cfs_rq on the list
* immediately before a parent cfs_rq, and cfs_rqs are removed from the list
* bottom-up, we only have to test whether the cfs_rq before us on the list
* is our child.
* If cfs_rq is not on the list, test whether a child needs its to be added to
- * connect a branch to the tree * (see list_add_leaf_cfs_rq() for details).
+ * connect a branch to the tree (see list_add_load_cfs_rq() for details).
*/
static inline bool child_cfs_rq_on_list(struct cfs_rq *cfs_rq)
{
@@ -3270,14 +3270,14 @@ static inline bool child_cfs_rq_on_list(struct cfs_rq *cfs_rq)
struct list_head *prev;

if (cfs_rq->on_list) {
- prev = cfs_rq->leaf_cfs_rq_list.prev;
+ prev = cfs_rq->load_cfs_rq_list.prev;
} else {
struct rq *rq = rq_of(cfs_rq);

prev = rq->tmp_alone_branch;
}

- prev_cfs_rq = container_of(prev, struct cfs_rq, leaf_cfs_rq_list);
+ prev_cfs_rq = container_of(prev, struct cfs_rq, load_cfs_rq_list);

return (prev_cfs_rq->tg->parent == cfs_rq->tg);
}
@@ -4298,7 +4298,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* add it unconditionally.
*/
if (cfs_rq->nr_running == 1 || cfs_bandwidth_used())
- list_add_leaf_cfs_rq(cfs_rq);
+ list_add_load_cfs_rq(cfs_rq);

if (cfs_rq->nr_running == 1)
check_enqueue_throttle(cfs_rq);
@@ -4775,7 +4775,7 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)

/* Add cfs_rq with load or one or more already running entities to the list */
if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
- list_add_leaf_cfs_rq(cfs_rq);
+ list_add_load_cfs_rq(cfs_rq);
}

return 0;
@@ -4789,7 +4789,7 @@ static int tg_throttle_down(struct task_group *tg, void *data)
/* group is entering throttled state, stop time */
if (!cfs_rq->throttle_count) {
cfs_rq->throttled_clock_task = rq_clock_task(rq);
- list_del_leaf_cfs_rq(cfs_rq);
+ list_del_load_cfs_rq(cfs_rq);
}
cfs_rq->throttle_count++;

@@ -4900,10 +4900,10 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
/* Nothing to run but something to decay? Complete the branch */
if (cfs_rq->on_list)
for_each_sched_entity(se) {
- if (list_add_leaf_cfs_rq(group_cfs_rq(se)))
+ if (list_add_load_cfs_rq(group_cfs_rq(se)))
break;
}
- assert_list_leaf_cfs_rq(rq);
+ assert_list_load_cfs_rq(rq);
return;
}

@@ -4939,10 +4939,10 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)

/*
* One parent has been throttled and cfs_rq removed from the
- * list. Add it back to not break the leaf list.
+ * list. Add it back to not break the load list.
*/
if (throttled_hierarchy(cfs_rq))
- list_add_leaf_cfs_rq(cfs_rq);
+ list_add_load_cfs_rq(cfs_rq);
}

/* At this point se is NULL and we are at root level*/
@@ -4951,17 +4951,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
unthrottle_throttle:
/*
* The cfs_rq_throttled() breaks in the above iteration can result in
- * incomplete leaf list maintenance, resulting in triggering the
+ * incomplete load list maintenance, resulting in triggering the
* assertion below.
*/
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);

- if (list_add_leaf_cfs_rq(cfs_rq))
+ if (list_add_load_cfs_rq(cfs_rq))
break;
}

- assert_list_leaf_cfs_rq(rq);
+ assert_list_load_cfs_rq(rq);

/* Determine whether we need to wake up potentially idle CPU: */
if (rq->curr == rq->idle && rq->cfs.nr_running)
@@ -5601,12 +5601,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_throttled(cfs_rq))
goto enqueue_throttle;

- /*
- * One parent has been throttled and cfs_rq removed from the
- * list. Add it back to not break the leaf list.
- */
- if (throttled_hierarchy(cfs_rq))
- list_add_leaf_cfs_rq(cfs_rq);
+ /*
+ * One parent has been throttled and cfs_rq removed from the
+ * list. Add it back to not break the load list.
+ */
+ if (throttled_hierarchy(cfs_rq))
+ list_add_load_cfs_rq(cfs_rq);
}

/* At this point se is NULL and we are at root level*/
@@ -5634,18 +5634,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/*
* When bandwidth control is enabled; the cfs_rq_throttled()
* breaks in the above iteration can result in incomplete
- * leaf list maintenance, resulting in triggering the assertion
+ * load list maintenance, resulting in triggering the assertion
* below.
*/
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);

- if (list_add_leaf_cfs_rq(cfs_rq))
+ if (list_add_load_cfs_rq(cfs_rq))
break;
}
}

- assert_list_leaf_cfs_rq(rq);
+ assert_list_load_cfs_rq(rq);

hrtick_update(rq);
}
@@ -8122,9 +8122,9 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)

/*
* Iterates the task_group tree in a bottom up fashion, see
- * list_add_leaf_cfs_rq() for details.
+ * list_add_load_cfs_rq() for details.
*/
- for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
+ for_each_load_cfs_rq_safe(rq, cfs_rq, pos) {
struct sched_entity *se;

if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) {
@@ -8144,7 +8144,7 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
* decayed cfs_rqs linger on the list.
*/
if (cfs_rq_is_decayed(cfs_rq))
- list_del_leaf_cfs_rq(cfs_rq);
+ list_del_load_cfs_rq(cfs_rq);

/* Don't need periodic decay once load/util_avg are null */
if (cfs_rq_has_blocked(cfs_rq))
@@ -11111,7 +11111,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
{
struct cfs_rq *cfs_rq;

- list_add_leaf_cfs_rq(cfs_rq_of(se));
+ list_add_load_cfs_rq(cfs_rq_of(se));

/* Start to propagate at parent */
se = se->parent;
@@ -11121,11 +11121,11 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)

if (!cfs_rq_throttled(cfs_rq)){
update_load_avg(cfs_rq, se, UPDATE_TG);
- list_add_leaf_cfs_rq(cfs_rq);
+ list_add_load_cfs_rq(cfs_rq);
continue;
}

- if (list_add_leaf_cfs_rq(cfs_rq))
+ if (list_add_load_cfs_rq(cfs_rq))
break;
}
}
@@ -11383,7 +11383,7 @@ void unregister_fair_sched_group(struct task_group *tg)
rq = cpu_rq(cpu);

raw_spin_rq_lock_irqsave(rq, flags);
- list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
+ list_del_load_cfs_rq(tg->cfs_rq[cpu]);
raw_spin_rq_unlock_irqrestore(rq, flags);
}
}
@@ -11543,7 +11543,7 @@ void print_cfs_stats(struct seq_file *m, int cpu)
struct cfs_rq *cfs_rq, *pos;

rcu_read_lock();
- for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos)
+ for_each_load_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos)
print_cfs_rq(m, cpu, cfs_rq);
rcu_read_unlock();
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 219ee463fe64..dc9382295ec9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -587,16 +587,8 @@ struct cfs_rq {
#ifdef CONFIG_FAIR_GROUP_SCHED
struct rq *rq; /* CPU runqueue to which this cfs_rq is attached */

- /*
- * leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
- * a hierarchy). Non-leaf lrqs hold other higher schedulable entities
- * (like users, containers etc.)
- *
- * leaf_cfs_rq_list ties together list of leaf cfs_rq's in a CPU.
- * This list is used during load balance.
- */
int on_list;
- struct list_head leaf_cfs_rq_list;
+ struct list_head load_cfs_rq_list;
struct task_group *tg; /* group that "owns" this runqueue */

#ifdef CONFIG_CFS_BANDWIDTH
@@ -950,8 +942,11 @@ struct rq {
struct dl_rq dl;

#ifdef CONFIG_FAIR_GROUP_SCHED
- /* list of leaf cfs_rq on this CPU: */
- struct list_head leaf_cfs_rq_list;
+ /* Bottom up ordered list of cfs_rqs with load (see
+ * cfs_rq_is_decayed()) on this CPU.
+ * This list is used during load balance.
+ */
+ struct list_head load_cfs_rq_list;
struct list_head *tmp_alone_branch;
#endif /* CONFIG_FAIR_GROUP_SCHED */

--
2.32.0

2021-08-19 17:55:54

by Michal Koutný

[permalink] [raw]
Subject: [RFC PATCH v2 5/5] sched/fair: Simplify ancestor enqueue loops

When a task is enqueued or cfs_rq is unthrottled we have work to do from
the cfs_rq in question possibly up to root. The important nodes on the
path are throttled cfs_rqs or cfs_rqs already enqueud to their parent.

Instead of multiple (interrupted) loops make all work in a single loop
and decide what all needs to be done inside it. This undoes parts of
commit 39f23ce07b93 ("sched/fair: Fix unthrottle_cfs_rq() for
leaf_cfs_rq list") but it should not bring any functional changes.

Note some PELT stats update code is duplicated both in enqueue_entity
and the ancestor loop (update_load_avg, se_update_runnable,
update_cfs_group). It'd be nice to factor these out, however, the later
parts of enqueue_entity rely on the updates, so stick with the current
repetition.

Signed-off-by: Michal Koutný <[email protected]>
---
kernel/sched/fair.c | 57 +++++++++++++++++----------------------------
1 file changed, 21 insertions(+), 36 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9978485334ec..79f183336fa8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4883,6 +4883,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se;
long task_delta, idle_task_delta;
+ int enqueue = 1;

cfs_rq->throttled = 0;

@@ -4911,29 +4912,21 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
task_delta = cfs_rq->h_nr_running;
idle_task_delta = cfs_rq->idle_h_nr_running;
for_each_sched_entity(se) {
- if (se->on_rq)
- break;
cfs_rq = cfs_rq_of(se);
- enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
-
- cfs_rq->h_nr_running += task_delta;
- cfs_rq->idle_h_nr_running += idle_task_delta;

- /* end evaluation on encountering a throttled cfs_rq */
- if (cfs_rq_throttled(cfs_rq))
- goto unthrottle_throttle;
- }
-
- for_each_sched_entity(se) {
- cfs_rq = cfs_rq_of(se);
-
- update_load_avg(cfs_rq, se, UPDATE_TG);
- se_update_runnable(se);
+ if (se->on_rq)
+ enqueue = 0;
+ if (enqueue)
+ enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
+ else {
+ update_load_avg(cfs_rq, se, UPDATE_TG);
+ se_update_runnable(se);
+ /* XXX: no update_cfs_group(se); */
+ }

cfs_rq->h_nr_running += task_delta;
cfs_rq->idle_h_nr_running += idle_task_delta;

-
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
goto unthrottle_throttle;
@@ -5537,6 +5530,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct sched_entity *se = &p->se;
int idle_h_nr_running = task_has_idle_policy(p);
int task_new = !(flags & ENQUEUE_WAKEUP);
+ int enqueue = 1;

/*
* The code below (indirectly) updates schedutil which looks at
@@ -5555,27 +5549,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);

for_each_sched_entity(se) {
- if (se->on_rq)
- break;
cfs_rq = cfs_rq_of(se);
- enqueue_entity(cfs_rq, se, flags);

- cfs_rq->h_nr_running++;
- cfs_rq->idle_h_nr_running += idle_h_nr_running;
-
- /* end evaluation on encountering a throttled cfs_rq */
- if (cfs_rq_throttled(cfs_rq))
- goto enqueue_throttle;
-
- flags = ENQUEUE_WAKEUP;
- }
-
- for_each_sched_entity(se) {
- cfs_rq = cfs_rq_of(se);
-
- update_load_avg(cfs_rq, se, UPDATE_TG);
- se_update_runnable(se);
- update_cfs_group(se);
+ if (se->on_rq)
+ enqueue = 0;
+ if (enqueue) {
+ enqueue_entity(cfs_rq, se, flags);
+ flags = ENQUEUE_WAKEUP;
+ } else {
+ update_load_avg(cfs_rq, se, UPDATE_TG);
+ se_update_runnable(se);
+ update_cfs_group(se);
+ }

cfs_rq->h_nr_running++;
cfs_rq->idle_h_nr_running += idle_h_nr_running;
--
2.32.0

2021-08-19 18:33:56

by Michal Koutný

[permalink] [raw]
Subject: [RFC PATCH v2 1/5] sched/fair: Add ancestors of unthrottled undecayed cfs_rq

Since commit a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to
list on unthrottle") we add cfs_rqs with no runnable tasks but not fully
decayed into the load (leaf) list. We may ignore adding some ancestors
and therefore breaking tmp_alone_branch invariant. This broke LTP test
cfs_bandwidth01 and it was partially fixed in commit fdaba61ef8a2
("sched/fair: Ensure that the CFS parent is added after unthrottling").

I noticed the named test still fails even with the fix (but with low
probability, 1 in ~1000 executions of the test). The reason is when
bailing out of unthrottle_cfs_rq early, we may miss adding ancestors of
the unthrottled cfs_rq, thus, not joining tmp_alone_branch properly.

Fix this by adding ancestors if we notice the unthrottled cfs_rq was
added to the load list.

Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
Signed-off-by: Michal Koutný <[email protected]>
---
kernel/sched/fair.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 44c452072a1b..2c41a9007928 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4898,8 +4898,16 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
/* update hierarchical throttle state */
walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);

- if (!cfs_rq->load.weight)
+ if (!cfs_rq->load.weight) {
+ /* Nothing to run but something to decay? Complete the branch */
+ if (cfs_rq->on_list)
+ for_each_sched_entity(se) {
+ if (list_add_leaf_cfs_rq(group_cfs_rq(se)))
+ break;
+ }
+ assert_list_leaf_cfs_rq(rq);
return;
+ }

task_delta = cfs_rq->h_nr_running;
idle_task_delta = cfs_rq->idle_h_nr_running;
--
2.32.0

2021-08-19 18:34:34

by Michal Koutný

[permalink] [raw]
Subject: [RFC PATCH v2 2/5] sched: Add group_se() helper

No functional change, unify cfs_rq to sched_entity conversion (and move
closer to use where possible). The helper is used only by
CONFIG_FAIR_GROUP_SCHED code, i.e. no dummy variant is defined.

Signed-off-by: Michal Koutný <[email protected]>
---
kernel/sched/fair.c | 9 +++------
kernel/sched/sched.h | 8 ++++++++
2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2c41a9007928..905f95b91a7a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4824,8 +4824,6 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
if (!dequeue)
return false; /* Throttle no longer required. */

- se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
-
/* freeze hierarchy runnable averages while throttled */
rcu_read_lock();
walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
@@ -4833,6 +4831,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)

task_delta = cfs_rq->h_nr_running;
idle_task_delta = cfs_rq->idle_h_nr_running;
+ se = group_se(cfs_rq);
for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se);
/* throttled entity or throttle-on-deactivate */
@@ -4884,8 +4883,6 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
struct sched_entity *se;
long task_delta, idle_task_delta;

- se = cfs_rq->tg->se[cpu_of(rq)];
-
cfs_rq->throttled = 0;

update_rq_clock(rq);
@@ -4898,6 +4895,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
/* update hierarchical throttle state */
walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);

+ se = group_se(cfs_rq);
if (!cfs_rq->load.weight) {
/* Nothing to run but something to decay? Complete the branch */
if (cfs_rq->on_list)
@@ -8163,8 +8161,7 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
*/
static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
{
- struct rq *rq = rq_of(cfs_rq);
- struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
+ struct sched_entity *se = group_se(cfs_rq);
unsigned long now = jiffies;
unsigned long load;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14a41a243f7b..219ee463fe64 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1383,6 +1383,14 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
return grp->my_q;
}

+/* sched entity representing the cfs_rq, NULL for root */
+static inline struct sched_entity *group_se(struct cfs_rq *cfs_rq)
+{
+ int cpu = cpu_of(rq_of(cfs_rq));
+
+ return cfs_rq->tg->se[cpu];
+}
+
#else

static inline struct task_struct *task_of(struct sched_entity *se)
--
2.32.0

2021-09-09 14:41:47

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/5] sched/fair: Add ancestors of unthrottled undecayed cfs_rq

On Thu, 19 Aug 2021 at 19:50, Michal Koutný <[email protected]> wrote:
>
> Since commit a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to
> list on unthrottle") we add cfs_rqs with no runnable tasks but not fully
> decayed into the load (leaf) list. We may ignore adding some ancestors
> and therefore breaking tmp_alone_branch invariant. This broke LTP test
> cfs_bandwidth01 and it was partially fixed in commit fdaba61ef8a2
> ("sched/fair: Ensure that the CFS parent is added after unthrottling").
>
> I noticed the named test still fails even with the fix (but with low
> probability, 1 in ~1000 executions of the test). The reason is when
> bailing out of unthrottle_cfs_rq early, we may miss adding ancestors of
> the unthrottled cfs_rq, thus, not joining tmp_alone_branch properly.
>
> Fix this by adding ancestors if we notice the unthrottled cfs_rq was
> added to the load list.
>
> Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
> Signed-off-by: Michal Koutný <[email protected]>
> ---
> kernel/sched/fair.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 44c452072a1b..2c41a9007928 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4898,8 +4898,16 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> /* update hierarchical throttle state */
> walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
>
> - if (!cfs_rq->load.weight)
> + if (!cfs_rq->load.weight) {
> + /* Nothing to run but something to decay? Complete the branch */
> + if (cfs_rq->on_list)

Could you use !cfs_rq_is decayed(cfs_rq) ?

> + for_each_sched_entity(se) {
> + if (list_add_leaf_cfs_rq(group_cfs_rq(se)))
> + break;
> + }
> + assert_list_leaf_cfs_rq(rq);

Instead of adding a loop here you should better jump to unthrottle_throttle ?

> return;
> + }
>
> task_delta = cfs_rq->h_nr_running;
> idle_task_delta = cfs_rq->idle_h_nr_running;
> --
> 2.32.0
>

2021-09-09 14:42:08

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v2 5/5] sched/fair: Simplify ancestor enqueue loops

On Thu, 19 Aug 2021 at 19:50, Michal Koutný <[email protected]> wrote:
>
> When a task is enqueued or cfs_rq is unthrottled we have work to do from
> the cfs_rq in question possibly up to root. The important nodes on the
> path are throttled cfs_rqs or cfs_rqs already enqueud to their parent.
>
> Instead of multiple (interrupted) loops make all work in a single loop
> and decide what all needs to be done inside it. This undoes parts of

These multiple break loops have been done to make unthrottle_cfs_rq,
throttle_cfs_rq, enqueue_task_fair and dequeue_task_fair to follow the
same pattern and I don't see any good reason to break this

> commit 39f23ce07b93 ("sched/fair: Fix unthrottle_cfs_rq() for
> leaf_cfs_rq list") but it should not bring any functional changes.
>
> Note some PELT stats update code is duplicated both in enqueue_entity
> and the ancestor loop (update_load_avg, se_update_runnable,
> update_cfs_group). It'd be nice to factor these out, however, the later
> parts of enqueue_entity rely on the updates, so stick with the current
> repetition.
>
> Signed-off-by: Michal Koutný <[email protected]>
> ---
> kernel/sched/fair.c | 57 +++++++++++++++++----------------------------
> 1 file changed, 21 insertions(+), 36 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9978485334ec..79f183336fa8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4883,6 +4883,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> struct sched_entity *se;
> long task_delta, idle_task_delta;
> + int enqueue = 1;
>
> cfs_rq->throttled = 0;
>
> @@ -4911,29 +4912,21 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> task_delta = cfs_rq->h_nr_running;
> idle_task_delta = cfs_rq->idle_h_nr_running;
> for_each_sched_entity(se) {
> - if (se->on_rq)
> - break;
> cfs_rq = cfs_rq_of(se);
> - enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
> -
> - cfs_rq->h_nr_running += task_delta;
> - cfs_rq->idle_h_nr_running += idle_task_delta;
>
> - /* end evaluation on encountering a throttled cfs_rq */
> - if (cfs_rq_throttled(cfs_rq))
> - goto unthrottle_throttle;
> - }
> -
> - for_each_sched_entity(se) {
> - cfs_rq = cfs_rq_of(se);
> -
> - update_load_avg(cfs_rq, se, UPDATE_TG);
> - se_update_runnable(se);
> + if (se->on_rq)
> + enqueue = 0;
> + if (enqueue)
> + enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
> + else {
> + update_load_avg(cfs_rq, se, UPDATE_TG);
> + se_update_runnable(se);
> + /* XXX: no update_cfs_group(se); */
> + }
>
> cfs_rq->h_nr_running += task_delta;
> cfs_rq->idle_h_nr_running += idle_task_delta;
>
> -
> /* end evaluation on encountering a throttled cfs_rq */
> if (cfs_rq_throttled(cfs_rq))
> goto unthrottle_throttle;
> @@ -5537,6 +5530,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> struct sched_entity *se = &p->se;
> int idle_h_nr_running = task_has_idle_policy(p);
> int task_new = !(flags & ENQUEUE_WAKEUP);
> + int enqueue = 1;
>
> /*
> * The code below (indirectly) updates schedutil which looks at
> @@ -5555,27 +5549,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
>
> for_each_sched_entity(se) {
> - if (se->on_rq)
> - break;
> cfs_rq = cfs_rq_of(se);
> - enqueue_entity(cfs_rq, se, flags);
>
> - cfs_rq->h_nr_running++;
> - cfs_rq->idle_h_nr_running += idle_h_nr_running;
> -
> - /* end evaluation on encountering a throttled cfs_rq */
> - if (cfs_rq_throttled(cfs_rq))
> - goto enqueue_throttle;
> -
> - flags = ENQUEUE_WAKEUP;
> - }
> -
> - for_each_sched_entity(se) {
> - cfs_rq = cfs_rq_of(se);
> -
> - update_load_avg(cfs_rq, se, UPDATE_TG);
> - se_update_runnable(se);
> - update_cfs_group(se);
> + if (se->on_rq)
> + enqueue = 0;
> + if (enqueue) {
> + enqueue_entity(cfs_rq, se, flags);
> + flags = ENQUEUE_WAKEUP;
> + } else {
> + update_load_avg(cfs_rq, se, UPDATE_TG);
> + se_update_runnable(se);
> + update_cfs_group(se);
> + }
>
> cfs_rq->h_nr_running++;
> cfs_rq->idle_h_nr_running += idle_h_nr_running;
> --
> 2.32.0
>

2021-09-10 11:36:39

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/5] sched/fair: Add ancestors of unthrottled undecayed cfs_rq

Hello Vincent.

Thank you for looking into this!

On Thu, Sep 09, 2021 at 03:57:37PM +0200, Vincent Guittot <[email protected]> wrote:
> > + /* Nothing to run but something to decay? Complete the branch */
> > + if (cfs_rq->on_list)
>
> Could you use !cfs_rq_is decayed(cfs_rq) ?

What needs to be checked here is whether the list was modified by adding
the cfs_rq (and branch needs closing).

It'd appear that the equal condition like in tg_unthrottle_up() would
make do, i.e.
!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running
but the unthrottle_cfs_rq() can be called under a still throttled
ancestor (i.e. throttle_count not dropping to zero) and in such a case
cfs_rq should not be added to the list yet.

Therefore, mere !cfs_rq_is_decayed(cfs_rq) doesn't seem correct to me.

> > + for_each_sched_entity(se) {
> > + if (list_add_leaf_cfs_rq(group_cfs_rq(se)))
> > + break;
> > + }
> > + assert_list_leaf_cfs_rq(rq);
>
> Instead of adding a loop here you should better jump to unthrottle_throttle ?

Oh, that looks a bit clumsy now (it's an artifact I've left when
reordering the patch in the series to be backport-friendly).
Jump to unthrottle_throttle seems easier indeed, there would be just the
additional check
if (rq->curr == rq->idle && rq->cfs.nr_running)
. Besides unnecessary work, it should be harmless.

Is the jump the preferred form?

Michal

2021-09-10 11:39:45

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC PATCH v2 5/5] sched/fair: Simplify ancestor enqueue loops

On Thu, Sep 09, 2021 at 04:04:02PM +0200, Vincent Guittot <[email protected]> wrote:
> These multiple break loops have been done to make unthrottle_cfs_rq,
> throttle_cfs_rq, enqueue_task_fair and dequeue_task_fair to follow the
> same pattern

Ah, I watched only the unthrottle_cfs_rq and enqueue_task_fair pair and
missed the consistency with the other two.

> and I don't see any good reason to break this

Isn't this a good reason
> 21 insertions(+), 36 deletions(-)
?

(The stats are with a grain of salt, I'd need to recheck how these stats
would hold if throttle_cfs_rq, dequeue_task_fair would be modified too +
they look a bit better because of the loop from 1/5.)

Michal

2021-09-10 11:45:37

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v2 5/5] sched/fair: Simplify ancestor enqueue loops

On Fri, 10 Sept 2021 at 13:35, Michal Koutný <[email protected]> wrote:
>
> On Thu, Sep 09, 2021 at 04:04:02PM +0200, Vincent Guittot <[email protected]> wrote:
> > These multiple break loops have been done to make unthrottle_cfs_rq,
> > throttle_cfs_rq, enqueue_task_fair and dequeue_task_fair to follow the
> > same pattern
>
> Ah, I watched only the unthrottle_cfs_rq and enqueue_task_fair pair and
> missed the consistency with the other two.
>
> > and I don't see any good reason to break this
>
> Isn't this a good reason
> > 21 insertions(+), 36 deletions(-)
> ?

Not if it make the code less readable and I prefer the current implementation

>
> (The stats are with a grain of salt, I'd need to recheck how these stats
> would hold if throttle_cfs_rq, dequeue_task_fair would be modified too +
> they look a bit better because of the loop from 1/5.)
>
> Michal

2021-09-10 13:19:59

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/5] sched/fair: Add ancestors of unthrottled undecayed cfs_rq

On Fri, 10 Sept 2021 at 13:35, Michal Koutný <[email protected]> wrote:
>
> Hello Vincent.
>
> Thank you for looking into this!
>
> On Thu, Sep 09, 2021 at 03:57:37PM +0200, Vincent Guittot <[email protected]> wrote:
> > > + /* Nothing to run but something to decay? Complete the branch */
> > > + if (cfs_rq->on_list)
> >
> > Could you use !cfs_rq_is decayed(cfs_rq) ?
>
> What needs to be checked here is whether the list was modified by adding
> the cfs_rq (and branch needs closing).
>
> It'd appear that the equal condition like in tg_unthrottle_up() would
> make do, i.e.
> !cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running
> but the unthrottle_cfs_rq() can be called under a still throttled
> ancestor (i.e. throttle_count not dropping to zero) and in such a case
> cfs_rq should not be added to the list yet.
>
> Therefore, mere !cfs_rq_is_decayed(cfs_rq) doesn't seem correct to me.

fair enough

>
> > > + for_each_sched_entity(se) {
> > > + if (list_add_leaf_cfs_rq(group_cfs_rq(se)))
> > > + break;
> > > + }
> > > + assert_list_leaf_cfs_rq(rq);
> >
> > Instead of adding a loop here you should better jump to unthrottle_throttle ?
>
> Oh, that looks a bit clumsy now (it's an artifact I've left when
> reordering the patch in the series to be backport-friendly).
> Jump to unthrottle_throttle seems easier indeed, there would be just the
> additional check
> if (rq->curr == rq->idle && rq->cfs.nr_running)
> . Besides unnecessary work, it should be harmless.

yes the condition should be always false

>
> Is the jump the preferred form?

yes compared to adding the exact same loop

>
> Michal

2021-09-10 14:21:59

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v2 4/5] sched/fair: Simplify load_cfs_rq_list maintenance

On Thu, 19 Aug 2021 at 19:50, Michal Koutný <[email protected]> wrote:
>
> load_cfs_rq_list should contain cfs_rqs that have runnable entities in
> them. When we're operating under a throttled hierarchy we always
> update the load_cfs_rq_list in order not to break list_add_load_cfs_rq
> invariant of adding complete branches.
>
> This caused troubles when an entity became runnable (enqueue_entity)
> under a throttled hierarchy (see commit b34cb07dde7c ("sched/fair: Fix
> enqueue_task_fair() warning some more")). (Basically when we add to the
> load list, we have to ensure all the ancestors are added and when
> deleting we have to delete whole subtree.)
>
> This patch simplifies the code by no updates of load_cfs_rq_list when
> we're operating under the throttled hierarchy and defers the
> load_cfs_rq_list update to the point when whole hierarchy is unthrottled
> (tg_unthrottle_up). Specifically, subtrees of a throttled cfs_rq are not
> decaying their PELT when they're being throttled (but the parent of the
> throttled cfs_rq is decayed).

Your proposal looks interesting but I need more time to make sure that
all cases are covered. We have faced several odd corner cases and
sequences in the past that I need time to check that you don't put
some back

>
> The code is now simpler and load_cfs_rq_list contains only cfs_rqs that
> have load to decay.
>
> Signed-off-by: Michal Koutný <[email protected]>
> ---
> kernel/sched/fair.c | 58 ++++++++++-----------------------------------
> 1 file changed, 12 insertions(+), 46 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6f4d5d4dcdd9..9978485334ec 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3086,6 +3086,8 @@ void reweight_task(struct task_struct *p, int prio)
> load->inv_weight = sched_prio_to_wmult[prio];
> }
>
> +static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
> +
> #ifdef CONFIG_FAIR_GROUP_SCHED
> #ifdef CONFIG_SMP
> /*
> @@ -3196,8 +3198,6 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
> }
> #endif /* CONFIG_SMP */
>
> -static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
> -
> /*
> * Recomputes the group entity based on the current state of its group
> * runqueue.
> @@ -4294,10 +4294,11 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>
> /*
> * When bandwidth control is enabled, cfs might have been removed
> - * because of a parent been throttled but cfs->nr_running > 1. Try to
> - * add it unconditionally.
> + * because of a parent been throttled. We'll add it later (with
> + * complete branch up to se->on_rq/cfs_eq->on_list) in
> + * tg_unthrottle_up() and unthrottle_cfs_rq().
> */
> - if (cfs_rq->nr_running == 1 || cfs_bandwidth_used())
> + if (cfs_rq->nr_running == 1 && !throttled_hierarchy(cfs_rq))
> list_add_load_cfs_rq(cfs_rq);
>
> if (cfs_rq->nr_running == 1)
> @@ -4936,31 +4937,13 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> /* end evaluation on encountering a throttled cfs_rq */
> if (cfs_rq_throttled(cfs_rq))
> goto unthrottle_throttle;
> -
> - /*
> - * One parent has been throttled and cfs_rq removed from the
> - * list. Add it back to not break the load list.
> - */
> - if (throttled_hierarchy(cfs_rq))
> - list_add_load_cfs_rq(cfs_rq);
> }
>
> /* At this point se is NULL and we are at root level*/
> add_nr_running(rq, task_delta);
>
> unthrottle_throttle:
> - /*
> - * The cfs_rq_throttled() breaks in the above iteration can result in
> - * incomplete load list maintenance, resulting in triggering the
> - * assertion below.
> - */
> - for_each_sched_entity(se) {
> - cfs_rq = cfs_rq_of(se);
> -
> - if (list_add_load_cfs_rq(cfs_rq))
> - break;
> - }
> -
> + /* See enqueue_task_fair:enqueue_throttle */
> assert_list_load_cfs_rq(rq);
>
> /* Determine whether we need to wake up potentially idle CPU: */
> @@ -5600,13 +5583,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> /* end evaluation on encountering a throttled cfs_rq */
> if (cfs_rq_throttled(cfs_rq))
> goto enqueue_throttle;
> -
> - /*
> - * One parent has been throttled and cfs_rq removed from the
> - * list. Add it back to not break the load list.
> - */
> - if (throttled_hierarchy(cfs_rq))
> - list_add_load_cfs_rq(cfs_rq);
> }
>
> /* At this point se is NULL and we are at root level*/
> @@ -5630,21 +5606,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> update_overutilized_status(rq);
>
> enqueue_throttle:
> - if (cfs_bandwidth_used()) {
> - /*
> - * When bandwidth control is enabled; the cfs_rq_throttled()
> - * breaks in the above iteration can result in incomplete
> - * load list maintenance, resulting in triggering the assertion
> - * below.
> - */
> - for_each_sched_entity(se) {
> - cfs_rq = cfs_rq_of(se);
> -
> - if (list_add_load_cfs_rq(cfs_rq))
> - break;
> - }
> - }
> -
> + /*
> + * If we got here, subtree of a cfs_rq must have been throttled and
> + * therefore we did not modify load list or we climbed up to root (or
> + * joined to an ancestor cfs_rq with on_rq == 1 => on_list).
> + */
> assert_list_load_cfs_rq(rq);
>
> hrtick_update(rq);
> --
> 2.32.0
>

2021-09-14 09:23:03

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC PATCH v2 4/5] sched/fair: Simplify load_cfs_rq_list maintenance

On Fri, Sep 10, 2021 at 04:19:27PM +0200, Vincent Guittot <[email protected]> wrote:
> Your proposal looks interesting but I need more time to make sure that
> all cases are covered. We have faced several odd corner cases and
> sequences in the past that I need time to check that you don't put
> some back

Do you have any pointers to the cases that come to your mind? I wonder
if those could be reproduced with a simple setup.
(FTR, I used the LTP test (at b673f49ae) cfs_bandwidth01 to check this change.)

Thanks,
Michal

2021-09-14 09:48:51

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v2 4/5] sched/fair: Simplify load_cfs_rq_list maintenance

On Tue, 14 Sept 2021 at 11:22, Michal Koutný <[email protected]> wrote:
>
> On Fri, Sep 10, 2021 at 04:19:27PM +0200, Vincent Guittot <[email protected]> wrote:
> > Your proposal looks interesting but I need more time to make sure that
> > all cases are covered. We have faced several odd corner cases and
> > sequences in the past that I need time to check that you don't put
> > some back
>
> Do you have any pointers to the cases that come to your mind? I wonder
> if those could be reproduced with a simple setup.

I don't have a strict list but several warnings for leaf_list have
been already reported on lkml in the past and the use cases were quite
complicated and I want to go through them to make sure they are still
covered.

Vincent

> (FTR, I used the LTP test (at b673f49ae) cfs_bandwidth01 to check this change.)
>
> Thanks,
> Michal

2021-09-15 12:32:05

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v2 4/5] sched/fair: Simplify load_cfs_rq_list maintenance

On Tue, 14 Sept 2021 at 11:45, Vincent Guittot
<[email protected]> wrote:
>
> On Tue, 14 Sept 2021 at 11:22, Michal Koutný <[email protected]> wrote:
> >
> > On Fri, Sep 10, 2021 at 04:19:27PM +0200, Vincent Guittot <[email protected]> wrote:
> > > Your proposal looks interesting but I need more time to make sure that
> > > all cases are covered. We have faced several odd corner cases and
> > > sequences in the past that I need time to check that you don't put
> > > some back
> >
> > Do you have any pointers to the cases that come to your mind? I wonder
> > if those could be reproduced with a simple setup.
>
> I don't have a strict list but several warnings for leaf_list have
> been already reported on lkml in the past and the use cases were quite
> complicated and I want to go through them to make sure they are still
> covered.

The corner cases that I wanted to check, are covered by
a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
and
fdaba61ef8a2 ("sched/fair: Ensure that the CFS parent is added after
unthrottling")

This patch looks ok to me.
Also, propagate_entity_cfs_rq() could also get advantage of the same
kind of change

>
> Vincent
>
> > (FTR, I used the LTP test (at b673f49ae) cfs_bandwidth01 to check this change.)
> >
> > Thanks,
> > Michal

2021-09-15 19:00:36

by Odin Ugedal

[permalink] [raw]
Subject: Re: [RFC PATCH v2 4/5] sched/fair: Simplify load_cfs_rq_list maintenance

ons. 15. sep. 2021 kl. 13:31 skrev Vincent Guittot <[email protected]>:
> The corner cases that I wanted to check, are covered by
> a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
> and
> fdaba61ef8a2 ("sched/fair: Ensure that the CFS parent is added after
> unthrottling")

Still have some more of my testing stuff from when i discovered the issues
behind a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on
unthrottle") as
well, so will take a look at this patch later this week.

Need to wrap my head around this logic again, but the idea looks good
at first sight.

Thanks
Odin

2021-09-21 19:58:00

by Odin Ugedal

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/5] sched/fair: Rename leaf_list to more fitting load_list

tor. 19. aug. 2021 kl. 18:50 skrev Michal Koutný <[email protected]>:
>
> The leaf_list name is obsolete and misleading. The list is nowadays used
> to hold cfs_rqs with non-zero PELT that has to be decayed to properly
> account for blocked tasks. Those can be inner nodes of the task_group
> tree as well.
>
> Signed-off-by: Michal Koutný <[email protected]>
> ---
> kernel/sched/core.c | 4 +-
> kernel/sched/fair.c | 114 +++++++++++++++++++++----------------------
> kernel/sched/sched.h | 17 +++----
> 3 files changed, 65 insertions(+), 70 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 20ffcc044134..e55a7c898cd9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8961,8 +8961,8 @@ void __init sched_init(void)
> init_rt_rq(&rq->rt);
> init_dl_rq(&rq->dl);
> #ifdef CONFIG_FAIR_GROUP_SCHED
> - INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
> - rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
> + INIT_LIST_HEAD(&rq->load_cfs_rq_list);
> + rq->tmp_alone_branch = &rq->load_cfs_rq_list;
> /*
> * How much CPU bandwidth does root_task_group get?
> *
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 905f95b91a7a..6f4d5d4dcdd9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -286,13 +286,13 @@ static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
> strlcpy(path, "(null)", len);
> }
>
> -static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> +static inline bool list_add_load_cfs_rq(struct cfs_rq *cfs_rq)
> {
> struct rq *rq = rq_of(cfs_rq);
> int cpu = cpu_of(rq);
>
> if (cfs_rq->on_list)
> - return rq->tmp_alone_branch == &rq->leaf_cfs_rq_list;
> + return rq->tmp_alone_branch == &rq->load_cfs_rq_list;
>
> cfs_rq->on_list = 1;
>
> @@ -313,14 +313,14 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> * the list, this means to put the child at the tail
> * of the list that starts by parent.
> */
> - list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
> - &(cfs_rq->tg->parent->cfs_rq[cpu]->leaf_cfs_rq_list));
> + list_add_tail_rcu(&cfs_rq->load_cfs_rq_list,
> + &(cfs_rq->tg->parent->cfs_rq[cpu]->load_cfs_rq_list));
> /*
> * The branch is now connected to its tree so we can
> * reset tmp_alone_branch to the beginning of the
> * list.
> */
> - rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
> + rq->tmp_alone_branch = &rq->load_cfs_rq_list;
> return true;
> }
>
> @@ -329,13 +329,13 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> * cfs rq without parent should be put
> * at the tail of the list.
> */
> - list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
> - &rq->leaf_cfs_rq_list);
> + list_add_tail_rcu(&cfs_rq->load_cfs_rq_list,
> + &rq->load_cfs_rq_list);
> /*
> * We have reach the top of a tree so we can reset
> * tmp_alone_branch to the beginning of the list.
> */
> - rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
> + rq->tmp_alone_branch = &rq->load_cfs_rq_list;
> return true;
> }
>
> @@ -345,44 +345,44 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> * tmp_alone_branch points to the begin of the branch
> * where we will add parent.
> */
> - list_add_rcu(&cfs_rq->leaf_cfs_rq_list, rq->tmp_alone_branch);
> + list_add_rcu(&cfs_rq->load_cfs_rq_list, rq->tmp_alone_branch);
> /*
> * update tmp_alone_branch to points to the new begin
> * of the branch
> */
> - rq->tmp_alone_branch = &cfs_rq->leaf_cfs_rq_list;
> + rq->tmp_alone_branch = &cfs_rq->load_cfs_rq_list;
> return false;
> }
>
> -static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> +static inline void list_del_load_cfs_rq(struct cfs_rq *cfs_rq)
> {
> if (cfs_rq->on_list) {
> struct rq *rq = rq_of(cfs_rq);
>
> /*
> * With cfs_rq being unthrottled/throttled during an enqueue,
> - * it can happen the tmp_alone_branch points the a leaf that
> + * it can happen the tmp_alone_branch points the cfs_rq that
> * we finally want to del. In this case, tmp_alone_branch moves
> - * to the prev element but it will point to rq->leaf_cfs_rq_list
> + * to the prev element but it will point to rq->load_cfs_rq_list
> * at the end of the enqueue.
> */
> - if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
> - rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
> + if (rq->tmp_alone_branch == &cfs_rq->load_cfs_rq_list)
> + rq->tmp_alone_branch = cfs_rq->load_cfs_rq_list.prev;
>
> - list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
> + list_del_rcu(&cfs_rq->load_cfs_rq_list);
> cfs_rq->on_list = 0;
> }
> }
>
> -static inline void assert_list_leaf_cfs_rq(struct rq *rq)
> +static inline void assert_list_load_cfs_rq(struct rq *rq)
> {
> - SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list);
> + SCHED_WARN_ON(rq->tmp_alone_branch != &rq->load_cfs_rq_list);
> }
>
> -/* Iterate thr' all leaf cfs_rq's on a runqueue */
> -#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \
> - list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \
> - leaf_cfs_rq_list)
> +/* Iterate thr' all loaded cfs_rq's on a runqueue */
> +#define for_each_load_cfs_rq_safe(rq, cfs_rq, pos) \
> + list_for_each_entry_safe(cfs_rq, pos, &rq->load_cfs_rq_list, \
> + load_cfs_rq_list)
>
> /* Do the two (enqueued) entities belong to the same group ? */
> static inline struct cfs_rq *
> @@ -442,20 +442,20 @@ static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
> strlcpy(path, "(null)", len);
> }
>
> -static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> +static inline bool list_add_load_cfs_rq(struct cfs_rq *cfs_rq)
> {
> return true;
> }
>
> -static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> +static inline void list_del_load_cfs_rq(struct cfs_rq *cfs_rq)
> {
> }
>
> -static inline void assert_list_leaf_cfs_rq(struct rq *rq)
> +static inline void assert_list_load_cfs_rq(struct rq *rq)
> {
> }
>
> -#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \
> +#define for_each_load_cfs_rq_safe(rq, cfs_rq, pos) \
> for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos)
>
> static inline struct sched_entity *parent_entity(struct sched_entity *se)
> @@ -3257,12 +3257,12 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
> #ifdef CONFIG_SMP
> #ifdef CONFIG_FAIR_GROUP_SCHED
> /*
> - * Because list_add_leaf_cfs_rq always places a child cfs_rq on the list
> + * Because list_add_load_cfs_rq always places a child cfs_rq on the list
> * immediately before a parent cfs_rq, and cfs_rqs are removed from the list
> * bottom-up, we only have to test whether the cfs_rq before us on the list
> * is our child.
> * If cfs_rq is not on the list, test whether a child needs its to be added to
> - * connect a branch to the tree * (see list_add_leaf_cfs_rq() for details).
> + * connect a branch to the tree (see list_add_load_cfs_rq() for details).
> */
> static inline bool child_cfs_rq_on_list(struct cfs_rq *cfs_rq)
> {
> @@ -3270,14 +3270,14 @@ static inline bool child_cfs_rq_on_list(struct cfs_rq *cfs_rq)
> struct list_head *prev;
>
> if (cfs_rq->on_list) {
> - prev = cfs_rq->leaf_cfs_rq_list.prev;
> + prev = cfs_rq->load_cfs_rq_list.prev;
> } else {
> struct rq *rq = rq_of(cfs_rq);
>
> prev = rq->tmp_alone_branch;
> }
>
> - prev_cfs_rq = container_of(prev, struct cfs_rq, leaf_cfs_rq_list);
> + prev_cfs_rq = container_of(prev, struct cfs_rq, load_cfs_rq_list);
>
> return (prev_cfs_rq->tg->parent == cfs_rq->tg);
> }
> @@ -4298,7 +4298,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> * add it unconditionally.
> */
> if (cfs_rq->nr_running == 1 || cfs_bandwidth_used())
> - list_add_leaf_cfs_rq(cfs_rq);
> + list_add_load_cfs_rq(cfs_rq);
>
> if (cfs_rq->nr_running == 1)
> check_enqueue_throttle(cfs_rq);
> @@ -4775,7 +4775,7 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
>
> /* Add cfs_rq with load or one or more already running entities to the list */
> if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
> - list_add_leaf_cfs_rq(cfs_rq);
> + list_add_load_cfs_rq(cfs_rq);
> }
>
> return 0;
> @@ -4789,7 +4789,7 @@ static int tg_throttle_down(struct task_group *tg, void *data)
> /* group is entering throttled state, stop time */
> if (!cfs_rq->throttle_count) {
> cfs_rq->throttled_clock_task = rq_clock_task(rq);
> - list_del_leaf_cfs_rq(cfs_rq);
> + list_del_load_cfs_rq(cfs_rq);
> }
> cfs_rq->throttle_count++;
>
> @@ -4900,10 +4900,10 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> /* Nothing to run but something to decay? Complete the branch */
> if (cfs_rq->on_list)
> for_each_sched_entity(se) {
> - if (list_add_leaf_cfs_rq(group_cfs_rq(se)))
> + if (list_add_load_cfs_rq(group_cfs_rq(se)))
> break;
> }
> - assert_list_leaf_cfs_rq(rq);
> + assert_list_load_cfs_rq(rq);
> return;
> }
>
> @@ -4939,10 +4939,10 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>
> /*
> * One parent has been throttled and cfs_rq removed from the
> - * list. Add it back to not break the leaf list.
> + * list. Add it back to not break the load list.
> */
> if (throttled_hierarchy(cfs_rq))
> - list_add_leaf_cfs_rq(cfs_rq);
> + list_add_load_cfs_rq(cfs_rq);
> }
>
> /* At this point se is NULL and we are at root level*/
> @@ -4951,17 +4951,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> unthrottle_throttle:
> /*
> * The cfs_rq_throttled() breaks in the above iteration can result in
> - * incomplete leaf list maintenance, resulting in triggering the
> + * incomplete load list maintenance, resulting in triggering the
> * assertion below.
> */
> for_each_sched_entity(se) {
> cfs_rq = cfs_rq_of(se);
>
> - if (list_add_leaf_cfs_rq(cfs_rq))
> + if (list_add_load_cfs_rq(cfs_rq))
> break;
> }
>
> - assert_list_leaf_cfs_rq(rq);
> + assert_list_load_cfs_rq(rq);
>
> /* Determine whether we need to wake up potentially idle CPU: */
> if (rq->curr == rq->idle && rq->cfs.nr_running)
> @@ -5601,12 +5601,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> if (cfs_rq_throttled(cfs_rq))
> goto enqueue_throttle;
>
> - /*
> - * One parent has been throttled and cfs_rq removed from the
> - * list. Add it back to not break the leaf list.
> - */
> - if (throttled_hierarchy(cfs_rq))
> - list_add_leaf_cfs_rq(cfs_rq);
> + /*
> + * One parent has been throttled and cfs_rq removed from the
> + * list. Add it back to not break the load list.
> + */
> + if (throttled_hierarchy(cfs_rq))
> + list_add_load_cfs_rq(cfs_rq);
> }
>
> /* At this point se is NULL and we are at root level*/
> @@ -5634,18 +5634,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> /*
> * When bandwidth control is enabled; the cfs_rq_throttled()
> * breaks in the above iteration can result in incomplete
> - * leaf list maintenance, resulting in triggering the assertion
> + * load list maintenance, resulting in triggering the assertion
> * below.
> */
> for_each_sched_entity(se) {
> cfs_rq = cfs_rq_of(se);
>
> - if (list_add_leaf_cfs_rq(cfs_rq))
> + if (list_add_load_cfs_rq(cfs_rq))
> break;
> }
> }
>
> - assert_list_leaf_cfs_rq(rq);
> + assert_list_load_cfs_rq(rq);
>
> hrtick_update(rq);
> }
> @@ -8122,9 +8122,9 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
>
> /*
> * Iterates the task_group tree in a bottom up fashion, see
> - * list_add_leaf_cfs_rq() for details.
> + * list_add_load_cfs_rq() for details.
> */
> - for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
> + for_each_load_cfs_rq_safe(rq, cfs_rq, pos) {
> struct sched_entity *se;
>
> if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) {
> @@ -8144,7 +8144,7 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
> * decayed cfs_rqs linger on the list.
> */
> if (cfs_rq_is_decayed(cfs_rq))
> - list_del_leaf_cfs_rq(cfs_rq);
> + list_del_load_cfs_rq(cfs_rq);
>
> /* Don't need periodic decay once load/util_avg are null */
> if (cfs_rq_has_blocked(cfs_rq))
> @@ -11111,7 +11111,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
> {
> struct cfs_rq *cfs_rq;
>
> - list_add_leaf_cfs_rq(cfs_rq_of(se));
> + list_add_load_cfs_rq(cfs_rq_of(se));
>
> /* Start to propagate at parent */
> se = se->parent;
> @@ -11121,11 +11121,11 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>
> if (!cfs_rq_throttled(cfs_rq)){
> update_load_avg(cfs_rq, se, UPDATE_TG);
> - list_add_leaf_cfs_rq(cfs_rq);
> + list_add_load_cfs_rq(cfs_rq);
> continue;
> }
>
> - if (list_add_leaf_cfs_rq(cfs_rq))
> + if (list_add_load_cfs_rq(cfs_rq))
> break;
> }
> }
> @@ -11383,7 +11383,7 @@ void unregister_fair_sched_group(struct task_group *tg)
> rq = cpu_rq(cpu);
>
> raw_spin_rq_lock_irqsave(rq, flags);
> - list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
> + list_del_load_cfs_rq(tg->cfs_rq[cpu]);
> raw_spin_rq_unlock_irqrestore(rq, flags);
> }
> }
> @@ -11543,7 +11543,7 @@ void print_cfs_stats(struct seq_file *m, int cpu)
> struct cfs_rq *cfs_rq, *pos;
>
> rcu_read_lock();
> - for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos)
> + for_each_load_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos)
> print_cfs_rq(m, cpu, cfs_rq);
> rcu_read_unlock();
> }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 219ee463fe64..dc9382295ec9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -587,16 +587,8 @@ struct cfs_rq {
> #ifdef CONFIG_FAIR_GROUP_SCHED
> struct rq *rq; /* CPU runqueue to which this cfs_rq is attached */
>
> - /*
> - * leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
> - * a hierarchy). Non-leaf lrqs hold other higher schedulable entities
> - * (like users, containers etc.)
> - *
> - * leaf_cfs_rq_list ties together list of leaf cfs_rq's in a CPU.
> - * This list is used during load balance.
> - */
> int on_list;
> - struct list_head leaf_cfs_rq_list;
> + struct list_head load_cfs_rq_list;
> struct task_group *tg; /* group that "owns" this runqueue */
>
> #ifdef CONFIG_CFS_BANDWIDTH
> @@ -950,8 +942,11 @@ struct rq {
> struct dl_rq dl;
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> - /* list of leaf cfs_rq on this CPU: */
> - struct list_head leaf_cfs_rq_list;
> + /* Bottom up ordered list of cfs_rqs with load (see
> + * cfs_rq_is_decayed()) on this CPU.
> + * This list is used during load balance.
> + */

Fully agree that a rename of some sort is good, as I struggled to understand it
in the beginning as well. However, I am not sure if "load_list" is the
best one (but
it is definitely better). It cfs_rq will be in the list even if it has
no load, as long as some
of its descendants have. Also, "load->weight" is not _really_ load,
but that all depends on
definition I guess.

I have no new suggestions right now, other than maybe "active" (but
not 100% sure)
so "load list" might be a good one after all.

> + struct list_head load_cfs_rq_list;
> struct list_head *tmp_alone_branch;
> #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> --
> 2.32.0
>

Odin
Thanks

2021-09-21 20:31:19

by Odin Ugedal

[permalink] [raw]
Subject: Re: [RFC PATCH v2 4/5] sched/fair: Simplify load_cfs_rq_list maintenance

Hi,

With the changes in PATCH 1 of this series [0], I think this logic needs to be
updated as well (or correct me if I am wrong). Did a quick look now,
and it looks
like there are some conflicts with the linus' tree when applying the
series as well,
but didn't look that deep into what caused it (for ref I tested on v5.15-rc2).

Not sure how you want to structure this patch series as all the patches kinda
depend on each other, since you sent the updated one separately (and I
am fairly new
to kernel development, so I have no idea), while patch 1 is fixing a
"real" issue
that we probably want to get fixed as soon as possible.


[0]: https://lore.kernel.org/lkml/[email protected]/

Thanks
Odin

2021-09-22 02:00:12

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC PATCH v2 4/5] sched/fair: Simplify load_cfs_rq_list maintenance

On Tue, Sep 21, 2021 at 08:21:50PM +0100, Odin Ugedal <[email protected]> wrote:
> With the changes in PATCH 1 of this series [0],

That patch is an independent fix of the described bug.

> I think this logic needs to be updated as well (or correct me if I am
> wrong).

I might be dense, what update are you referring to? (The adding to the
list in tg_unthrottle_up() is already guarded by
cfs_rq->throttle_count.)

> Did a quick look now, and it looks like there are some conflicts with
> the linus' tree when applying the series as well, but didn't look that
> deep into what caused it (for ref I tested on v5.15-rc2).

This v2 was based on v5.14-rc6 back then and yes when I was rebasing
(locally), I had to resolve conflicts because of SCHED_IDLE for cgroups.

> Not sure how you want to structure this patch series as all the
> patches kinda depend on each other, since you sent the updated one
> separately (and I am fairly new to kernel development, so I have no
> idea), while patch 1 is fixing a "real" issue that we probably want to
> get fixed as soon as possible.

I put the patch first in the series to be backport friendly but then I
decided to send v3 of just that one patch, exactly for the reason of
making the fix earlier.
I may get down to addressing feedback of the remaining patches (v2, 2--5)
only later. Thanks for your comments and patience :-)

Michal