2023-11-30 16:13:20

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return to userspace

As reported in [1], CFS bandwidth throttling is a source of headaches in
PREEMPT_RT - generally speaking, a throttled CFS task can hold locks that
prevent ksoftirqd from running, which prevents replenishing & unthrottling
the cfs_rq of said CFS task.

Peter mentioned that there have been discussions on changing /when/ the
throttling happens: rather than have it be done immediately upon updating
the runtime statistics and realizing the cfs_rq has depleted its quota, we wait
for the task to be about to return to userspace.

I'm not aware of the arguments in favour of this for !PREEMPT_RT, but given [1]
I had a look into it. Using this approach means no task can be throttled while
holding a kernel lock, which solves the locking dependency issue.

Concept & headaches
===================

The original idea was:
o cfs_rq runs out of runtime, flag it
o Take the tasks out of the cfs_rq one by one as they exit the kernel
o Dequeue the cfs_rq once it is empty of tasks (as if doing "normal" throttling)
o Put the tasks back onto it
o At unthrottle, simply re-enqueue the cfs_rq sched_entity and all is well

Cgroup migration kinda breaks all of that. Consider a task moving to a
now-fully-throttled cfs_rq: if we just enqueue it onto the throttled se, then
it's the same as throttling it, and if it is holding a problematic lock then
we're back to square one.
This means we can never fully throttle a cfs_rq, which is making me not so
confident in this approach.

Should/can we consider deferring such migrations to the unthrottle (with
something like in affine_move_task())? I think failing the migration is not
an option due to e.g. autogroups, and IIUC returning from writing to
cgroup/tasks means the migration has happened (like updating a task's
affinity), which means this would need to be a blocking wait.

CPU migrations within the same taskgroup are similarly
problematic (e.g. moving from a cfs_rq that still has a bit of ->runtime
left to one that has run out).

Throttle clocks are also kinda busted in this approach, but they aren't my
greatest concern at this point :-)

Implementation
==============

cfs_rq's that deplete their ->runtime get flagged for later throttling via
->in_throttle_limbo, and this is propagated down the hierarchy.

Tasks enqueued in a cfs_rq with ->in_throttle_limbo picked for execution
get a task_work added, which lets the actual throttling happen in
exit_to_user_mode().

Upon unthrottling, tasks are enqueued back onto their respective
cfs_rq. Unlike the previous throttling implementation, cfs_rq's can be
unthrottled while in a half-throttled state (i.e. some tasks have been
removed from them, while others are still enqueued and runnable), so the
unthrottling process is a bit more involved.

[1]: https://lore.kernel.org/all/[email protected]/
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
include/linux/sched.h | 2 +
kernel/sched/core.c | 6 +-
kernel/sched/debug.c | 4 +-
kernel/sched/fair.c | 257 +++++++++++++++++++++++++++++++-----------
kernel/sched/sched.h | 7 ++
5 files changed, 206 insertions(+), 70 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e4235bbfad776..20194e4c8b0b5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -805,6 +805,8 @@ struct task_struct {

#ifdef CONFIG_CGROUP_SCHED
struct task_group *sched_task_group;
+ struct callback_head sched_throttle_work;
+ struct list_head throttle_node;
#endif

#ifdef CONFIG_UCLAMP_TASK
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7a0c16115b79f..a63152072554b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4504,6 +4504,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.cfs_rq = NULL;
#endif

+#ifdef CONFIG_CFS_BANDWIDTH
+ init_cfs_throttle_work(p);
+#endif
+
#ifdef CONFIG_SCHEDSTATS
/* Even if schedstat is disabled, there should not be garbage */
memset(&p->stats, 0, sizeof(p->stats));
@@ -10823,7 +10827,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
cfs_rq->runtime_enabled = runtime_enabled;
cfs_rq->runtime_remaining = 0;

- if (cfs_rq->throttled)
+ if (cfs_rq->in_throttle_limbo)
unthrottle_cfs_rq(cfs_rq);
}

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 4580a450700ec..8867904cf7eee 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -695,8 +695,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
#endif
#endif
#ifdef CONFIG_CFS_BANDWIDTH
- SEQ_printf(m, " .%-30s: %d\n", "throttled",
- cfs_rq->throttled);
+ SEQ_printf(m, " .%-30s: %d\n", "in_throttle_limbo",
+ cfs_rq->in_throttle_limbo);
SEQ_printf(m, " .%-30s: %d\n", "throttle_count",
cfs_rq->throttle_count);
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8767988242ee3..27aee13e7ccd9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5497,7 +5497,7 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
if (likely(cfs_rq->runtime_remaining > 0))
return;

- if (cfs_rq->throttled)
+ if (cfs_rq->throttled || cfs_rq->in_throttle_limbo)
return;
/*
* if we're unable to extend our runtime we resched so that the active
@@ -5548,7 +5548,61 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
{
struct rq *rq = data;
struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+ struct sched_entity *se = tg->se[cpu_of(rq)];
+ struct cfs_rq *pcfs_rq = cfs_rq_of(se);
+ long task_delta = 0, idle_task_delta = 0;
+ struct task_struct *p, *tmp;

+ /*
+ * Re-enqueue the tasks that have been throttled at this level.
+ *
+ * The task count is up-propagated via ->unthrottled_*h_nr_running,
+ * as we can't purely rely on h_nr_running post-enqueue: the unthrottle
+ * might happen when a cfs_rq still has some tasks enqueued, either still
+ * making their way to userspace, or freshly migrated to it.
+ */
+ list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
+ struct sched_entity *pse = &p->se;
+
+ list_del_init(&p->throttle_node);
+ enqueue_entity(cfs_rq, pse, ENQUEUE_WAKEUP);
+ task_delta++;
+ idle_task_delta += task_has_idle_policy(p);
+ }
+
+ /*
+ * Account tasks woken up in children; by this point all direct children
+ * have been visited.
+ */
+ task_delta += cfs_rq->unthrottled_h_nr_running;
+ idle_task_delta += cfs_rq->unthrottled_idle_h_nr_running;
+
+ /*
+ * unthrottle_cfs_rq() needs a value to up-propagate above the
+ * freshly unthrottled cfs_rq.
+ */
+ cfs_rq->unthrottled_h_nr_running = task_delta;
+ cfs_rq->unthrottled_idle_h_nr_running = idle_task_delta;
+
+ cfs_rq->h_nr_running += task_delta;
+ cfs_rq->idle_h_nr_running += idle_task_delta;
+
+ /* Accumulate the delta in the parent's stash. Once all its children
+ * (i.e. all of this cfs_rq's siblings) have been visited, this value
+ * will be stable and used for its own count update.
+ */
+ pcfs_rq->unthrottled_h_nr_running += task_delta;
+ pcfs_rq->unthrottled_idle_h_nr_running += idle_task_delta;
+
+ /*
+ * If the cfs_rq became empty during throttling, then we dequeued
+ * it. It needs to be put back in the hierarchy if it or any of
+ * its children have now-unthrottled tasks.
+ */
+ if (!se->on_rq && (task_delta || idle_task_delta))
+ enqueue_entity(pcfs_rq, se, ENQUEUE_WAKEUP);
+
+ cfs_rq->in_throttle_limbo = false;
cfs_rq->throttle_count--;
if (!cfs_rq->throttle_count) {
cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
@@ -5573,6 +5627,17 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
return 0;
}

+static int tg_unthrottle_clear_up(struct task_group *tg, void *data)
+{
+ struct rq *rq = data;
+ struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+
+ cfs_rq->unthrottled_h_nr_running = 0;
+ cfs_rq->unthrottled_idle_h_nr_running = 0;
+
+ return 0;
+}
+
static int tg_throttle_down(struct task_group *tg, void *data)
{
struct rq *rq = data;
@@ -5588,61 +5653,59 @@ static int tg_throttle_down(struct task_group *tg, void *data)
cfs_rq->throttled_clock_self = rq_clock(rq);
}
cfs_rq->throttle_count++;
+ cfs_rq->in_throttle_limbo = true;

return 0;
}

-static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
+static void throttle_cfs_rq_work(struct callback_head *work)
{
- struct rq *rq = rq_of(cfs_rq);
- struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+ struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
struct sched_entity *se;
- long task_delta, idle_task_delta, dequeue = 1;
+ struct rq *rq;
+ struct cfs_rq *qcfs_rq, *top_cfs_rq;
+ long task_delta, idle_task_delta;

- raw_spin_lock(&cfs_b->lock);
- /* This will start the period timer if necessary */
- if (__assign_cfs_rq_runtime(cfs_b, cfs_rq, 1)) {
- /*
- * We have raced with bandwidth becoming available, and if we
- * actually throttled the timer might not unthrottle us for an
- * entire period. We additionally needed to make sure that any
- * subsequent check_cfs_rq_runtime calls agree not to throttle
- * us, as we may commit to do cfs put_prev+pick_next, so we ask
- * for 1ns of runtime rather than just check cfs_b.
- */
- dequeue = 0;
- } else {
- list_add_tail_rcu(&cfs_rq->throttled_list,
- &cfs_b->throttled_cfs_rq);
- }
- raw_spin_unlock(&cfs_b->lock);
+ WARN_ON_ONCE(p != current);

- if (!dequeue)
- return false; /* Throttle no longer required. */
+ /*
+ * If task is exiting, then there won't be a return to userspace, so we
+ * don't have to bother with any of this.
+ */
+ if ((p->flags & PF_EXITING))
+ return;

- se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+ CLASS(task_rq_lock, rq_guard)(p);
+ rq = rq_guard.rq;
+ se = &p->se;
+ qcfs_rq = cfs_rq_of(se);

- /* freeze hierarchy runnable averages while throttled */
- rcu_read_lock();
- walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
- rcu_read_unlock();
+ /*
+ * If not in limbo, then either replenish has happened or this task got
+ * migrated out of the throttled cfs_rq, move along
+ */
+ if (!qcfs_rq->in_throttle_limbo)
+ goto done;
+
+ update_rq_clock(rq);
+
+ list_add(&p->throttle_node, &qcfs_rq->throttled_limbo_list);
+ task_delta = 1;
+ idle_task_delta = cfs_rq_is_idle(qcfs_rq) ? 1 : 0;

- task_delta = cfs_rq->h_nr_running;
- idle_task_delta = cfs_rq->idle_h_nr_running;
for_each_sched_entity(se) {
- struct cfs_rq *qcfs_rq = cfs_rq_of(se);
- /* throttled entity or throttle-on-deactivate */
+ qcfs_rq = cfs_rq_of(se);
+
if (!se->on_rq)
goto done;

dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
-
- if (cfs_rq_is_idle(group_cfs_rq(se)))
- idle_task_delta = cfs_rq->h_nr_running;
-
qcfs_rq->h_nr_running -= task_delta;
qcfs_rq->idle_h_nr_running -= idle_task_delta;

+ if (qcfs_rq->in_throttle_limbo)
+ top_cfs_rq = qcfs_rq;
+
if (qcfs_rq->load.weight) {
/* Avoid re-evaluating load for this entity: */
se = parent_entity(se);
@@ -5651,34 +5714,84 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
}

for_each_sched_entity(se) {
- struct cfs_rq *qcfs_rq = cfs_rq_of(se);
+ qcfs_rq = cfs_rq_of(se);
/* throttled entity or throttle-on-deactivate */
if (!se->on_rq)
- goto done;
+ goto throttle_done;

update_load_avg(qcfs_rq, se, 0);
se_update_runnable(se);
-
- if (cfs_rq_is_idle(group_cfs_rq(se)))
- idle_task_delta = cfs_rq->h_nr_running;
-
qcfs_rq->h_nr_running -= task_delta;
- qcfs_rq->idle_h_nr_running -= idle_task_delta;
+ qcfs_rq->h_nr_running -= idle_task_delta;
}

+throttle_done:
/* At this point se is NULL and we are at root level*/
- sub_nr_running(rq, task_delta);
-
+ sub_nr_running(rq, 1);
done:
+ p->sched_throttle_work.next = &p->sched_throttle_work;
/*
- * Note: distribution will already see us throttled via the
- * throttled-list. rq->lock protects completion.
+ * XXX: this assumes the highest in_throttle_limbo cfs_rq in the
+ * hierarchy is the one which got throttled, which might not always
+ * be the case.
*/
- cfs_rq->throttled = 1;
- SCHED_WARN_ON(cfs_rq->throttled_clock);
- if (cfs_rq->nr_running)
- cfs_rq->throttled_clock = rq_clock(rq);
- return true;
+ if (top_cfs_rq && !top_cfs_rq->throttled_clock)
+ top_cfs_rq->throttled_clock = rq_clock(rq);
+
+ resched_curr(rq);
+}
+
+void init_cfs_throttle_work(struct task_struct *p)
+{
+ /* Protect against double add, see throttle_cfs_rq() and throttle_cfs_rq_work() */
+ p->sched_throttle_work.next = &p->sched_throttle_work;
+ init_task_work(&p->sched_throttle_work, throttle_cfs_rq_work);
+}
+
+static inline bool task_has_throttle_work(struct task_struct *p)
+{
+ return p->sched_throttle_work.next != &p->sched_throttle_work;
+}
+
+static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+ struct rq *rq = rq_of(cfs_rq);
+ struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+ long dequeue = 1;
+
+ if (cfs_rq->in_throttle_limbo)
+ return false;
+
+ raw_spin_lock(&cfs_b->lock);
+ /* This will start the period timer if necessary */
+ if (__assign_cfs_rq_runtime(cfs_b, cfs_rq, 1)) {
+ /*
+ * We have raced with bandwidth becoming available, and if we
+ * actually throttled the timer might not unthrottle us for an
+ * entire period. We additionally needed to make sure that any
+ * subsequent check_cfs_rq_runtime calls agree not to throttle
+ * us, as we may commit to do cfs put_prev+pick_next, so we ask
+ * for 1ns of runtime rather than just check cfs_b.
+ */
+ dequeue = 0;
+ } else {
+ list_add_tail_rcu(&cfs_rq->throttled_list,
+ &cfs_b->throttled_cfs_rq);
+ }
+
+ raw_spin_unlock(&cfs_b->lock);
+
+ if (!dequeue)
+ return false; /* Throttle no longer required. */
+
+ rcu_read_lock();
+ walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
+ rcu_read_unlock();
+
+ if (!task_has_throttle_work(current))
+ task_work_add(current, &current->sched_throttle_work, TWA_RESUME);
+
+ return false;
}

void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
@@ -5690,8 +5803,6 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)

se = cfs_rq->tg->se[cpu_of(rq)];

- cfs_rq->throttled = 0;
-
update_rq_clock(rq);

raw_spin_lock(&cfs_b->lock);
@@ -5719,8 +5830,15 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
goto unthrottle_throttle;
}

- task_delta = cfs_rq->h_nr_running;
- idle_task_delta = cfs_rq->idle_h_nr_running;
+ /*
+ * cfs_rq's below us may not have been fully emptied out, so we can't rely
+ * directly on ->h_nr_running. Use the stash instead.
+ */
+ task_delta = cfs_rq->unthrottled_h_nr_running;
+ idle_task_delta = cfs_rq->unthrottled_idle_h_nr_running;
+
+ walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_clear_up, (void *)rq);
+
for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se);

@@ -5728,9 +5846,6 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
break;
enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);

- if (cfs_rq_is_idle(group_cfs_rq(se)))
- idle_task_delta = cfs_rq->h_nr_running;
-
qcfs_rq->h_nr_running += task_delta;
qcfs_rq->idle_h_nr_running += idle_task_delta;

@@ -5745,9 +5860,6 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
update_load_avg(qcfs_rq, se, UPDATE_TG);
se_update_runnable(se);

- if (cfs_rq_is_idle(group_cfs_rq(se)))
- idle_task_delta = cfs_rq->h_nr_running;
-
qcfs_rq->h_nr_running += task_delta;
qcfs_rq->idle_h_nr_running += idle_task_delta;

@@ -5797,7 +5909,7 @@ static void __cfsb_csd_unthrottle(void *arg)
throttled_csd_list) {
list_del_init(&cursor->throttled_csd_list);

- if (cfs_rq_throttled(cursor))
+ if (cfs_rq_throttled(cursor) || cursor->in_throttle_limbo)
unthrottle_cfs_rq(cursor);
}

@@ -5837,7 +5949,7 @@ static void unthrottle_cfs_rq_async(struct cfs_rq *cfs_rq)
{
lockdep_assert_rq_held(rq_of(cfs_rq));

- if (SCHED_WARN_ON(!cfs_rq_throttled(cfs_rq) ||
+ if (SCHED_WARN_ON(!(cfs_rq_throttled(cfs_rq) || cfs_rq->in_throttle_limbo) ||
cfs_rq->runtime_remaining <= 0))
return;

@@ -5865,7 +5977,12 @@ static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
}

rq_lock_irqsave(rq, &rf);
- if (!cfs_rq_throttled(cfs_rq))
+ /*
+ * A cfs_rq can be in_throttle_limbo but not throttled if it is
+ * waiting for tasks to exit the kernel. In this case we still
+ * want to replenish.
+ */
+ if (!cfs_rq_throttled(cfs_rq) && !cfs_rq->in_throttle_limbo)
goto next;

/* Already queued for async unthrottle */
@@ -5914,7 +6031,7 @@ static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)

list_del_init(&cfs_rq->throttled_csd_list);

- if (cfs_rq_throttled(cfs_rq))
+ if (cfs_rq_throttled(cfs_rq) || cfs_rq->in_throttle_limbo)
unthrottle_cfs_rq(cfs_rq);

rq_unlock_irqrestore(rq, &rf);
@@ -6252,6 +6369,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
cfs_rq->runtime_enabled = 0;
INIT_LIST_HEAD(&cfs_rq->throttled_list);
INIT_LIST_HEAD(&cfs_rq->throttled_csd_list);
+ INIT_LIST_HEAD(&cfs_rq->throttled_limbo_list);
}

void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
@@ -8274,6 +8392,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf

p = task_of(se);

+ if (unlikely(cfs_rq_of(se)->in_throttle_limbo && !task_has_throttle_work(p)))
+ task_work_add(p, &p->sched_throttle_work, TWA_RESUME);
/*
* Since we haven't yet done put_prev_entity and if the selected task
* is a different task than we started out with, try and touch the
@@ -8314,6 +8434,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf

p = task_of(se);

+ if (unlikely(cfs_rq_of(se)->in_throttle_limbo && !task_has_throttle_work(p)))
+ task_work_add(p, &p->sched_throttle_work, TWA_RESUME);
+
done: __maybe_unused;
#ifdef CONFIG_SMP
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e5a95486a422..be29154d93898 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -633,8 +633,13 @@ struct cfs_rq {
u64 throttled_clock_self_time;
int throttled;
int throttle_count;
+ int in_throttle_limbo;
+ /* Temp storage for updating the counts during unthrottling */
+ unsigned int unthrottled_h_nr_running;
+ unsigned int unthrottled_idle_h_nr_running;
struct list_head throttled_list;
struct list_head throttled_csd_list;
+ struct list_head throttled_limbo_list;
#endif /* CONFIG_CFS_BANDWIDTH */
#endif /* CONFIG_FAIR_GROUP_SCHED */
};
@@ -2416,6 +2421,8 @@ extern void init_sched_dl_class(void);
extern void init_sched_rt_class(void);
extern void init_sched_fair_class(void);

+extern void init_cfs_throttle_work(struct task_struct *p);
+
extern void reweight_task(struct task_struct *p, int prio);

extern void resched_curr(struct rq *rq);
--
2.41.0


2023-11-30 21:26:57

by Benjamin Segall

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return to userspace

Valentin Schneider <[email protected]> writes:

> As reported in [1], CFS bandwidth throttling is a source of headaches in
> PREEMPT_RT - generally speaking, a throttled CFS task can hold locks that
> prevent ksoftirqd from running, which prevents replenishing & unthrottling
> the cfs_rq of said CFS task.
>
> Peter mentioned that there have been discussions on changing /when/ the
> throttling happens: rather than have it be done immediately upon updating
> the runtime statistics and realizing the cfs_rq has depleted its quota, we wait
> for the task to be about to return to userspace.
>
> I'm not aware of the arguments in favour of this for !PREEMPT_RT, but given [1]
> I had a look into it. Using this approach means no task can be throttled while
> holding a kernel lock, which solves the locking dependency issue.

The alternative we've been experimenting with (and still running into
other issues that have made it hard to tell if they work) is to still
leave the tasks on their cfs_rqs, but instead have two task_timelines or
similar per cfs_rq, one of which only has unthrottleable tasks (or
partially-throttled child cgroups) on it. Then when picking into a
partially-unthrottled cfs_rq you only look at that alternate task_timeline.

This means that we get to skip this per-actually-throttled-task loop:

> @@ -5548,7 +5548,61 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
> {
> struct rq *rq = data;
> struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> + struct sched_entity *se = tg->se[cpu_of(rq)];
> + struct cfs_rq *pcfs_rq = cfs_rq_of(se);
> + long task_delta = 0, idle_task_delta = 0;
> + struct task_struct *p, *tmp;
>
> + /*
> + * Re-enqueue the tasks that have been throttled at this level.
> + *
> + * The task count is up-propagated via ->unthrottled_*h_nr_running,
> + * as we can't purely rely on h_nr_running post-enqueue: the unthrottle
> + * might happen when a cfs_rq still has some tasks enqueued, either still
> + * making their way to userspace, or freshly migrated to it.
> + */
> + list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
> + struct sched_entity *pse = &p->se;
> +
> + list_del_init(&p->throttle_node);
> + enqueue_entity(cfs_rq, pse, ENQUEUE_WAKEUP);
> + task_delta++;
> + idle_task_delta += task_has_idle_policy(p);
> + }

The downsides are that you instead have extra operations per
enqueue/dequeue/pick (but just an extra list/rbtree operation or check),
and that it doesn't do *any* accounting change for a partially dequeued
cfs_rq.

I'm going to try putting together a cleaner variant of our version that
works via task_work instead of bracketing every relevant entry point.
(That design came from when we were trying instead to only do it for
tasks holding actual locks)

2023-12-01 13:30:53

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return to userspace

On 30/11/23 13:26, Benjamin Segall wrote:
> Valentin Schneider <[email protected]> writes:
>
>> As reported in [1], CFS bandwidth throttling is a source of headaches in
>> PREEMPT_RT - generally speaking, a throttled CFS task can hold locks that
>> prevent ksoftirqd from running, which prevents replenishing & unthrottling
>> the cfs_rq of said CFS task.
>>
>> Peter mentioned that there have been discussions on changing /when/ the
>> throttling happens: rather than have it be done immediately upon updating
>> the runtime statistics and realizing the cfs_rq has depleted its quota, we wait
>> for the task to be about to return to userspace.
>>
>> I'm not aware of the arguments in favour of this for !PREEMPT_RT, but given [1]
>> I had a look into it. Using this approach means no task can be throttled while
>> holding a kernel lock, which solves the locking dependency issue.
>
> The alternative we've been experimenting with (and still running into
> other issues that have made it hard to tell if they work) is to still
> leave the tasks on their cfs_rqs, but instead have two task_timelines or
> similar per cfs_rq, one of which only has unthrottleable tasks (or
> partially-throttled child cgroups) on it. Then when picking into a
> partially-unthrottled cfs_rq you only look at that alternate task_timeline.
>

IIUC then you don't dequeue the cfs_rq's se, but instead rely on the RB
tree switch to only have unthrottable tasks running.

> This means that we get to skip this per-actually-throttled-task loop:
>
>> @@ -5548,7 +5548,61 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
>> {
>> struct rq *rq = data;
>> struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
>> + struct sched_entity *se = tg->se[cpu_of(rq)];
>> + struct cfs_rq *pcfs_rq = cfs_rq_of(se);
>> + long task_delta = 0, idle_task_delta = 0;
>> + struct task_struct *p, *tmp;
>>
>> + /*
>> + * Re-enqueue the tasks that have been throttled at this level.
>> + *
>> + * The task count is up-propagated via ->unthrottled_*h_nr_running,
>> + * as we can't purely rely on h_nr_running post-enqueue: the unthrottle
>> + * might happen when a cfs_rq still has some tasks enqueued, either still
>> + * making their way to userspace, or freshly migrated to it.
>> + */
>> + list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
>> + struct sched_entity *pse = &p->se;
>> +
>> + list_del_init(&p->throttle_node);
>> + enqueue_entity(cfs_rq, pse, ENQUEUE_WAKEUP);
>> + task_delta++;
>> + idle_task_delta += task_has_idle_policy(p);
>> + }
>
> The downsides are that you instead have extra operations per
> enqueue/dequeue/pick (but just an extra list/rbtree operation or check),
> and that it doesn't do *any* accounting change for a partially dequeued
> cfs_rq.
>

I would assume we want to keep the *nr_running as close to reality as
possible, given their impact on pick_next_task_fair() & load_balance().

> I'm going to try putting together a cleaner variant of our version that
> works via task_work instead of bracketing every relevant entry point.
> (That design came from when we were trying instead to only do it for
> tasks holding actual locks)

Interesting, thank you for sharing! I assume then the motivation for this
is to reduce latencies caused by throttling lock holders?

2023-12-01 22:11:36

by Benjamin Segall

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return to userspace

Valentin Schneider <[email protected]> writes:

> On 30/11/23 13:26, Benjamin Segall wrote:
>> Valentin Schneider <[email protected]> writes:
>>
>> The alternative we've been experimenting with (and still running into
>> other issues that have made it hard to tell if they work) is to still
>> leave the tasks on their cfs_rqs, but instead have two task_timelines or
>> similar per cfs_rq, one of which only has unthrottleable tasks (or
>> partially-throttled child cgroups) on it. Then when picking into a
>> partially-unthrottled cfs_rq you only look at that alternate task_timeline.
>>
>
> IIUC then you don't dequeue the cfs_rq's se, but instead rely on the RB
> tree switch to only have unthrottable tasks running.

Correct.

>
>> This means that we get to skip this per-actually-throttled-task loop:
>>
>>> @@ -5548,7 +5548,61 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
>>> {
>>> struct rq *rq = data;
>>> struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
>>> + struct sched_entity *se = tg->se[cpu_of(rq)];
>>> + struct cfs_rq *pcfs_rq = cfs_rq_of(se);
>>> + long task_delta = 0, idle_task_delta = 0;
>>> + struct task_struct *p, *tmp;
>>>
>>> + /*
>>> + * Re-enqueue the tasks that have been throttled at this level.
>>> + *
>>> + * The task count is up-propagated via ->unthrottled_*h_nr_running,
>>> + * as we can't purely rely on h_nr_running post-enqueue: the unthrottle
>>> + * might happen when a cfs_rq still has some tasks enqueued, either still
>>> + * making their way to userspace, or freshly migrated to it.
>>> + */
>>> + list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
>>> + struct sched_entity *pse = &p->se;
>>> +
>>> + list_del_init(&p->throttle_node);
>>> + enqueue_entity(cfs_rq, pse, ENQUEUE_WAKEUP);
>>> + task_delta++;
>>> + idle_task_delta += task_has_idle_policy(p);
>>> + }
>>
>> The downsides are that you instead have extra operations per
>> enqueue/dequeue/pick (but just an extra list/rbtree operation or check),
>> and that it doesn't do *any* accounting change for a partially dequeued
>> cfs_rq.
>>
>
> I would assume we want to keep the *nr_running as close to reality as
> possible, given their impact on pick_next_task_fair() & load_balance().
>

Yeah, and while it's maybe ok for the longer-period load_balance, it's
definitely sketchy for the shorter term things like select_idle_sibling.
In theory we could duplicate more and more of the accounting... (though
that also then becomes questionable amounts of increased overhead on
enqueue/dequeue).

>> I'm going to try putting together a cleaner variant of our version that
>> works via task_work instead of bracketing every relevant entry point.
>> (That design came from when we were trying instead to only do it for
>> tasks holding actual locks)
>
> Interesting, thank you for sharing! I assume then the motivation for this
> is to reduce latencies caused by throttling lock holders?

Yeah, and then we ran into things like percpu-rwsem where lock/unlock
accounting doesn't work as well, but kept the basic design.

2023-12-07 23:47:43

by Benjamin Segall

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return to userspace

Alright, this took longer than I intended, but here's the basic version
with the duplicate "runqueue" just being a list rather than actual EEVDF
or anything sensible.

I also wasn't actually able to adapt it to work via task_work like I
thought I could; flagging the entry to the relevant schedule() calls is
too critical.

--------

Subject: [RFC PATCH] sched/fair: only throttle CFS tasks in userspace

The basic idea of this implementation is to maintain duplicate runqueues
in each cfs_rq that contain duplicate pointers to sched_entitys which
should bypass throttling. Then we can skip throttling cfs_rqs that have
any such children, and when we pick inside any not-actually-throttled
cfs_rq, we only look at this duplicated list.

"Which tasks should bypass throttling" here is "all schedule() calls
that don't set a special flag", but could instead involve the lockdep
markers (except for the problem of percpu-rwsem and similar) or explicit
flags around syscalls and faults, or something else.

This approach avoids any O(tasks) loops, but leaves partially-throttled
cfs_rqs still contributing their full h_nr_running to their parents,
which might result in worse balancing. Also it adds more (generally
still small) overhead to the common enqueue/dequeue/pick paths.

The very basic debug test added is to run a cpusoaker and "cat
/sys/kernel/debug/sched_locked_spin" pinned to the same cpu in the same
cgroup with a quota < 1 cpu.
---
include/linux/sched.h | 7 ++
kernel/entry/common.c | 2 +-
kernel/entry/kvm.c | 2 +-
kernel/sched/core.c | 20 ++++
kernel/sched/debug.c | 28 +++++
kernel/sched/fair.c | 232 ++++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 3 +
7 files changed, 281 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8d258162deb0a..3e9e0e308d66d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -301,10 +301,11 @@ extern long schedule_timeout(long timeout);
extern long schedule_timeout_interruptible(long timeout);
extern long schedule_timeout_killable(long timeout);
extern long schedule_timeout_uninterruptible(long timeout);
extern long schedule_timeout_idle(long timeout);
asmlinkage void schedule(void);
+asmlinkage void schedule_usermode(void);
extern void schedule_preempt_disabled(void);
asmlinkage void preempt_schedule_irq(void);
#ifdef CONFIG_PREEMPT_RT
extern void schedule_rtlock(void);
#endif
@@ -576,10 +577,13 @@ struct sched_entity {
struct cfs_rq *cfs_rq;
/* rq "owned" by this entity/group: */
struct cfs_rq *my_q;
/* cached value of my_q->h_nr_running */
unsigned long runnable_weight;
+#ifdef CONFIG_CFS_BANDWIDTH
+ struct list_head kernel_node;
+#endif
#endif

#ifdef CONFIG_SMP
/*
* Per entity load average tracking.
@@ -1562,10 +1566,13 @@ struct task_struct {

#ifdef CONFIG_USER_EVENTS
struct user_event_mm *user_event_mm;
#endif

+#ifdef CONFIG_CFS_BANDWIDTH
+ atomic_t in_return_to_user;
+#endif
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
*/
randomized_struct_fields_end
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index d7ee4bc3f2ba3..008c4cf7c09a1 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -154,11 +154,11 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
while (ti_work & EXIT_TO_USER_MODE_WORK) {

local_irq_enable_exit_to_user(ti_work);

if (ti_work & _TIF_NEED_RESCHED)
- schedule();
+ schedule_usermode(); /* TODO: also all of the arch/* loops that don't use this yet */

if (ti_work & _TIF_UPROBE)
uprobe_notify_resume(regs);

if (ti_work & _TIF_PATCH_PENDING)
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 2e0f75bcb7fd1..fc4b73de07539 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -12,11 +12,11 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
kvm_handle_signal_exit(vcpu);
return -EINTR;
}

if (ti_work & _TIF_NEED_RESCHED)
- schedule();
+ schedule_usermode();

if (ti_work & _TIF_NOTIFY_RESUME)
resume_user_mode_work(NULL);

ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index db4be4921e7f0..a7c028fad5a89 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4527,10 +4527,14 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
INIT_LIST_HEAD(&p->se.group_node);

#ifdef CONFIG_FAIR_GROUP_SCHED
p->se.cfs_rq = NULL;
#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+ INIT_LIST_HEAD(&p->se.kernel_node);
+ atomic_set(&p->in_return_to_user, 0);
+#endif

#ifdef CONFIG_SCHEDSTATS
/* Even if schedstat is disabled, there should not be garbage */
memset(&p->stats, 0, sizeof(p->stats));
#endif
@@ -6816,10 +6820,26 @@ asmlinkage __visible void __sched schedule(void)
__schedule_loop(SM_NONE);
sched_update_worker(tsk);
}
EXPORT_SYMBOL(schedule);

+asmlinkage __visible void __sched schedule_usermode(void)
+{
+#ifdef CONFIG_CFS_BANDWIDTH
+ /*
+ * This is only atomic because of this simple implementation. We could
+ * do something with an SM_USER to avoid other-cpu scheduler operations
+ * racing against these writes.
+ */
+ atomic_set(&current->in_return_to_user, true);
+ schedule();
+ atomic_set(&current->in_return_to_user, false);
+#else
+ schedule();
+#endif
+}
+
/*
* synchronize_rcu_tasks() makes sure that no task is stuck in preempted
* state (have scheduled out non-voluntarily) by making sure that all
* tasks have either left the run queue or have gone into user space.
* As idle tasks do not do either, they must not ever be preempted
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 168eecc209b49..5d9d011a0e6fb 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -317,10 +317,36 @@ static const struct file_operations sched_verbose_fops = {
.write = sched_verbose_write,
.open = simple_open,
.llseek = default_llseek,
};

+static DEFINE_MUTEX(sched_debug_spin_mutex);
+static int sched_debug_spin_show(struct seq_file *m, void *v) {
+ int count;
+ mutex_lock(&sched_debug_spin_mutex);
+ for (count = 0; count < 1000; count++) {
+ u64 start2;
+ start2 = jiffies;
+ while (jiffies == start2)
+ cpu_relax();
+ schedule();
+ }
+ mutex_unlock(&sched_debug_spin_mutex);
+ return 0;
+}
+static int sched_debug_spin_open(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, sched_debug_spin_show, NULL);
+}
+
+static const struct file_operations sched_debug_spin_fops = {
+ .open = sched_debug_spin_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
static const struct seq_operations sched_debug_sops;

static int sched_debug_open(struct inode *inode, struct file *filp)
{
return seq_open(filp, &sched_debug_sops);
@@ -372,10 +398,12 @@ static __init int sched_init_debug(void)
debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
#endif

debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);

+ debugfs_create_file("sched_locked_spin", 0444, NULL, NULL,
+ &sched_debug_spin_fops);
return 0;
}
late_initcall(sched_init_debug);

#ifdef CONFIG_SMP
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bcea3d55d95d3..15e5e358c91e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -126,10 +126,11 @@ int __weak arch_asym_cpu_priority(int cpu)
* we will always only issue the remaining available time.
*
* (default: 5 msec, units: microseconds)
*/
static unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
+static unsigned int sysctl_sched_cfs_bandwidth_kernel_bypass = 1;
#endif

#ifdef CONFIG_NUMA_BALANCING
/* Restrict the NUMA promotion throughput (MB/s) for each target node. */
static unsigned int sysctl_numa_balancing_promote_rate_limit = 65536;
@@ -144,10 +145,19 @@ static struct ctl_table sched_fair_sysctls[] = {
.maxlen = sizeof(unsigned int),
.mode = 0644,
.proc_handler = proc_dointvec_minmax,
.extra1 = SYSCTL_ONE,
},
+ {
+ .procname = "sched_cfs_bandwidth_kernel_bypass",
+ .data = &sysctl_sched_cfs_bandwidth_kernel_bypass,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
+ },
#endif
#ifdef CONFIG_NUMA_BALANCING
{
.procname = "numa_balancing_promote_rate_limit_MBps",
.data = &sysctl_numa_balancing_promote_rate_limit,
@@ -5416,18 +5426,38 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
se->prev_sum_exec_runtime = se->sum_exec_runtime;
}

/*
* Pick the next process, keeping these things in mind, in this order:
- * 1) keep things fair between processes/task groups
- * 2) pick the "next" process, since someone really wants that to run
- * 3) pick the "last" process, for cache locality
- * 4) do not run the "skip" process, if something else is available
+ * 1) If we're inside a throttled cfs_rq, only pick threads in the kernel
+ * 2) keep things fair between processes/task groups
+ * 3) pick the "next" process, since someone really wants that to run
+ * 4) pick the "last" process, for cache locality
+ * 5) do not run the "skip" process, if something else is available
*/
static struct sched_entity *
-pick_next_entity(struct cfs_rq *cfs_rq)
+pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
{
+#ifdef CONFIG_CFS_BANDWIDTH
+ /*
+ * TODO: This might trigger, I'm not sure/don't remember. Regardless,
+ * while we do not explicitly handle the case where h_kernel_running
+ * goes to 0, we will call account/check_cfs_rq_runtime at worst in
+ * entity_tick and notice that we can now properly do the full
+ * throttle_cfs_rq.
+ */
+ WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children));
+ if (throttled && !list_empty(&cfs_rq->kernel_children)) {
+ /*
+ * TODO: you'd want to factor out pick_eevdf to just take
+ * tasks_timeline, and replace this list with a second rbtree
+ * and a call to pick_eevdf.
+ */
+ return list_first_entry(&cfs_rq->kernel_children,
+ struct sched_entity, kernel_node);
+ }
+#endif
/*
* Enabling NEXT_BUDDY will affect latency but not fairness.
*/
if (sched_feat(NEXT_BUDDY) &&
cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
@@ -5622,12 +5652,18 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
if (cfs_rq->throttled)
return;
/*
* if we're unable to extend our runtime we resched so that the active
* hierarchy can be throttled
+ *
+ * Don't resched_curr() if curr is in the kernel. We won't throttle the
+ * cfs_rq if any task is in the kernel, and if curr in particular is we
+ * don't need to preempt it in favor of whatever other task is in the
+ * kernel.
*/
- if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
+ if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr) &&
+ list_empty(&rq_of(cfs_rq)->curr->se.kernel_node))
resched_curr(rq_of(cfs_rq));
}

static __always_inline
void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
@@ -5712,16 +5748,26 @@ static int tg_throttle_down(struct task_group *tg, void *data)
cfs_rq->throttle_count++;

return 0;
}

+static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
+static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
+
static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
struct rq *rq = rq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se;
- long task_delta, idle_task_delta, dequeue = 1;
+ long task_delta, idle_task_delta, kernel_delta, dequeue = 1;
+
+ /*
+ * We don't actually throttle, though account() will have made sure to
+ * resched us so that we pick into a kernel task.
+ */
+ if (cfs_rq->h_kernel_running)
+ return false;

raw_spin_lock(&cfs_b->lock);
/* This will start the period timer if necessary */
if (__assign_cfs_rq_runtime(cfs_b, cfs_rq, 1)) {
/*
@@ -5749,10 +5795,11 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
rcu_read_unlock();

task_delta = cfs_rq->h_nr_running;
idle_task_delta = cfs_rq->idle_h_nr_running;
+ kernel_delta = cfs_rq->h_kernel_running;
for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se);
/* throttled entity or throttle-on-deactivate */
if (!se->on_rq)
goto done;
@@ -5762,10 +5809,11 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
if (cfs_rq_is_idle(group_cfs_rq(se)))
idle_task_delta = cfs_rq->h_nr_running;

qcfs_rq->h_nr_running -= task_delta;
qcfs_rq->idle_h_nr_running -= idle_task_delta;
+ dequeue_kernel(qcfs_rq, se, kernel_delta);

if (qcfs_rq->load.weight) {
/* Avoid re-evaluating load for this entity: */
se = parent_entity(se);
break;
@@ -5784,10 +5832,11 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
if (cfs_rq_is_idle(group_cfs_rq(se)))
idle_task_delta = cfs_rq->h_nr_running;

qcfs_rq->h_nr_running -= task_delta;
qcfs_rq->idle_h_nr_running -= idle_task_delta;
+ dequeue_kernel(qcfs_rq, se, kernel_delta);
}

/* At this point se is NULL and we are at root level*/
sub_nr_running(rq, task_delta);

@@ -5806,11 +5855,11 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
{
struct rq *rq = rq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se;
- long task_delta, idle_task_delta;
+ long task_delta, idle_task_delta, kernel_delta;

se = cfs_rq->tg->se[cpu_of(rq)];

cfs_rq->throttled = 0;

@@ -5841,10 +5890,11 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
goto unthrottle_throttle;
}

task_delta = cfs_rq->h_nr_running;
idle_task_delta = cfs_rq->idle_h_nr_running;
+ kernel_delta = cfs_rq->h_kernel_running;
for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se);

if (se->on_rq)
break;
@@ -5853,10 +5903,11 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
if (cfs_rq_is_idle(group_cfs_rq(se)))
idle_task_delta = cfs_rq->h_nr_running;

qcfs_rq->h_nr_running += task_delta;
qcfs_rq->idle_h_nr_running += idle_task_delta;
+ enqueue_kernel(qcfs_rq, se, kernel_delta);

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(qcfs_rq))
goto unthrottle_throttle;
}
@@ -5870,10 +5921,11 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
if (cfs_rq_is_idle(group_cfs_rq(se)))
idle_task_delta = cfs_rq->h_nr_running;

qcfs_rq->h_nr_running += task_delta;
qcfs_rq->idle_h_nr_running += idle_task_delta;
+ enqueue_kernel(qcfs_rq, se, kernel_delta);

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(qcfs_rq))
goto unthrottle_throttle;
}
@@ -6528,10 +6580,90 @@ static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
if (cfs_task_bw_constrained(p))
tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
}
#endif

+/*
+ * We keep track of all children that are runnable in the kernel with a count of
+ * all descendants. The state is checked on enqueue and put_prev (and hard
+ * cleared on dequeue), and is stored just as the filled/empty state of the
+ * kernel_node list entry.
+ *
+ * These are simple helpers that do both parts, and should be called bottom-up
+ * until hitting a throttled cfs_rq whenever a task changes state (or a cfs_rq
+ * is (un)throttled).
+ */
+static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count)
+{
+ if (count == 0)
+ return;
+
+ if (list_empty(&se->kernel_node))
+ list_add(&se->kernel_node, &cfs_rq->kernel_children);
+ cfs_rq->h_kernel_running += count;
+}
+
+static bool is_kernel_task(struct task_struct *p)
+{
+ return sysctl_sched_cfs_bandwidth_kernel_bypass && !atomic_read(&p->in_return_to_user);
+}
+
+/*
+ * When called on a task this always transitions it to a !kernel state.
+ *
+ * When called on a group it is just synchronizing the state with the new
+ * h_kernel_waiters, unless this it has been throttled and is !on_rq
+ */
+static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count)
+{
+ if (count == 0)
+ return;
+
+ if (!se->on_rq || entity_is_task(se) ||
+ !group_cfs_rq(se)->h_kernel_running)
+ list_del_init(&se->kernel_node);
+ cfs_rq->h_kernel_running -= count;
+}
+
+/*
+ * Returns if the cfs_rq "should" be throttled but might not be because of
+ * kernel threads bypassing throttle.
+ */
+static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
+{
+ if (!cfs_bandwidth_used())
+ return false;
+
+ if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
+ return false;
+ return true;
+}
+
+static void unthrottle_on_enqueue(struct task_struct *p)
+{
+ struct sched_entity *se = &p->se;
+
+ if (!cfs_bandwidth_used() || !sysctl_sched_cfs_bandwidth_kernel_bypass)
+ return;
+ if (!cfs_rq_of(&p->se)->throttle_count)
+ return;
+
+ /*
+ * MAYBE TODO: doing it this simple way is O(throttle_count *
+ * cgroup_depth). We could optimize that into a single pass, but making
+ * a mostly-copy of unthrottle_cfs_rq that does that is a pain and easy
+ * to get wrong. (And even without unthrottle_on_enqueue it's O(nm),
+ * just not while holding rq->lock the whole time)
+ */
+
+ for_each_sched_entity(se) {
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ if (cfs_rq->throttled)
+ unthrottle_cfs_rq(cfs_rq);
+ }
+}
+
#else /* CONFIG_CFS_BANDWIDTH */

static inline bool cfs_bandwidth_used(void)
{
return false;
@@ -6575,10 +6707,20 @@ static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {}
bool cfs_task_bw_constrained(struct task_struct *p)
{
return false;
}
#endif
+static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count) {}
+static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count) {}
+static inline bool is_kernel_task(struct task_struct *p)
+{
+ return false;
+}
+static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
+{
+ return false;
+}
#endif /* CONFIG_CFS_BANDWIDTH */

#if !defined(CONFIG_CFS_BANDWIDTH) || !defined(CONFIG_NO_HZ_FULL)
static inline void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p) {}
#endif
@@ -6678,10 +6820,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
int idle_h_nr_running = task_has_idle_policy(p);
int task_new = !(flags & ENQUEUE_WAKEUP);
+ bool kernel_task = is_kernel_task(p);

/*
* The code below (indirectly) updates schedutil which looks at
* the cfs_rq utilization to select a frequency.
* Let's add the task's estimated utilization to the cfs_rq's
@@ -6706,10 +6849,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
cfs_rq->h_nr_running++;
cfs_rq->idle_h_nr_running += idle_h_nr_running;

if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;
+ if (kernel_task)
+ enqueue_kernel(cfs_rq, se, 1);

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
goto enqueue_throttle;

@@ -6726,10 +6871,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
cfs_rq->h_nr_running++;
cfs_rq->idle_h_nr_running += idle_h_nr_running;

if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;
+ if (kernel_task)
+ enqueue_kernel(cfs_rq, se, 1);

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
goto enqueue_throttle;
}
@@ -6756,10 +6903,13 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

enqueue_throttle:
assert_list_leaf_cfs_rq(rq);

hrtick_update(rq);
+
+ if (kernel_task)
+ unthrottle_on_enqueue(p);
}

static void set_next_buddy(struct sched_entity *se);

/*
@@ -6772,10 +6922,11 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
int task_sleep = flags & DEQUEUE_SLEEP;
int idle_h_nr_running = task_has_idle_policy(p);
bool was_sched_idle = sched_idle_rq(rq);
+ bool kernel_task = !list_empty(&p->se.kernel_node);

util_est_dequeue(&rq->cfs, p);

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
@@ -6784,10 +6935,12 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
cfs_rq->h_nr_running--;
cfs_rq->idle_h_nr_running -= idle_h_nr_running;

if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;
+ if (kernel_task)
+ dequeue_kernel(cfs_rq, se, 1);

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
goto dequeue_throttle;

@@ -6816,10 +6969,12 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
cfs_rq->h_nr_running--;
cfs_rq->idle_h_nr_running -= idle_h_nr_running;

if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;
+ if (kernel_task)
+ dequeue_kernel(cfs_rq, se, 1);

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
goto dequeue_throttle;

@@ -8316,15 +8471,44 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int

preempt:
resched_curr(rq);
}

+static void handle_kernel_task_prev(struct task_struct *prev)
+{
+#ifdef CONFIG_CFS_BANDWIDTH
+ struct sched_entity *se = &prev->se;
+ bool new_kernel = is_kernel_task(prev);
+ bool prev_kernel = !list_empty(&se->kernel_node);
+ /*
+ * These extra loops are bad and against the whole point of the merged
+ * PNT, but it's a pain to merge, particularly since we want it to occur
+ * before check_cfs_runtime.
+ */
+ if (prev_kernel && !new_kernel) {
+ WARN_ON_ONCE(!se->on_rq); /* dequeue should have removed us */
+ for_each_sched_entity(se) {
+ dequeue_kernel(cfs_rq_of(se), se, 1);
+ if (cfs_rq_throttled(cfs_rq_of(se)))
+ break;
+ }
+ } else if (!prev_kernel && new_kernel && se->on_rq) {
+ for_each_sched_entity(se) {
+ enqueue_kernel(cfs_rq_of(se), se, 1);
+ if (cfs_rq_throttled(cfs_rq_of(se)))
+ break;
+ }
+ }
+#endif
+}
+
#ifdef CONFIG_SMP
static struct task_struct *pick_task_fair(struct rq *rq)
{
struct sched_entity *se;
struct cfs_rq *cfs_rq;
+ bool throttled = false;

again:
cfs_rq = &rq->cfs;
if (!cfs_rq->nr_running)
return NULL;
@@ -8341,11 +8525,14 @@ static struct task_struct *pick_task_fair(struct rq *rq)

if (unlikely(check_cfs_rq_runtime(cfs_rq)))
goto again;
}

- se = pick_next_entity(cfs_rq);
+ if (cfs_rq_throttled_loose(cfs_rq))
+ throttled = true;
+
+ se = pick_next_entity(cfs_rq, throttled);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);

return task_of(se);
}
@@ -8356,10 +8543,18 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
{
struct cfs_rq *cfs_rq = &rq->cfs;
struct sched_entity *se;
struct task_struct *p;
int new_tasks;
+ bool throttled;
+
+ /*
+ * We want to handle this before check_cfs_runtime(prev). We'll
+ * duplicate a little work in the goto simple case, but that's fine
+ */
+ if (prev)
+ handle_kernel_task_prev(prev);

again:
if (!sched_fair_runnable(rq))
goto idle;

@@ -8373,10 +8568,11 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
*
* Therefore attempt to avoid putting and setting the entire cgroup
* hierarchy, only change the part that actually changes.
*/

+ throttled = false;
do {
struct sched_entity *curr = cfs_rq->curr;

/*
* Since we got here without doing put_prev_entity() we also
@@ -8404,11 +8600,14 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf

goto simple;
}
}

- se = pick_next_entity(cfs_rq);
+ if (cfs_rq_throttled_loose(cfs_rq))
+ throttled = true;
+
+ se = pick_next_entity(cfs_rq, throttled);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);

p = task_of(se);

@@ -8442,12 +8641,15 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
simple:
#endif
if (prev)
put_prev_task(rq, prev);

+ throttled = false;
do {
- se = pick_next_entity(cfs_rq);
+ if (cfs_rq_throttled_loose(cfs_rq))
+ throttled = true;
+ se = pick_next_entity(cfs_rq, throttled);
set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);

p = task_of(se);
@@ -8507,10 +8709,12 @@ static struct task_struct *__pick_next_task_fair(struct rq *rq)
static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
{
struct sched_entity *se = &prev->se;
struct cfs_rq *cfs_rq;

+ handle_kernel_task_prev(prev);
+
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
put_prev_entity(cfs_rq, se);
}
}
@@ -12788,10 +12992,13 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
cfs_rq->tasks_timeline = RB_ROOT_CACHED;
u64_u32_store(cfs_rq->min_vruntime, (u64)(-(1LL << 20)));
#ifdef CONFIG_SMP
raw_spin_lock_init(&cfs_rq->removed.lock);
#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+ INIT_LIST_HEAD(&cfs_rq->kernel_children);
+#endif
}

#ifdef CONFIG_FAIR_GROUP_SCHED
static void task_change_group_fair(struct task_struct *p)
{
@@ -12940,10 +13147,13 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,

se->my_q = cfs_rq;
/* guarantee group entities always have weight */
update_load_set(&se->load, NICE_0_LOAD);
se->parent = parent;
+#ifdef CONFIG_CFS_BANDWIDTH
+ INIT_LIST_HEAD(&se->kernel_node);
+#endif
}

static DEFINE_MUTEX(shares_mutex);

static int __sched_group_set_shares(struct task_group *tg, unsigned long shares)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e58a54bda77de..5fee806cc4538 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -578,10 +578,11 @@ struct cfs_rq {
u64 min_vruntime_copy;
#endif

struct rb_root_cached tasks_timeline;

+
/*
* 'curr' points to currently running entity on this cfs_rq.
* It is set to NULL otherwise (i.e when none are currently running).
*/
struct sched_entity *curr;
@@ -656,12 +657,14 @@ struct cfs_rq {
u64 throttled_clock_pelt_time;
u64 throttled_clock_self;
u64 throttled_clock_self_time;
int throttled;
int throttle_count;
+ int h_kernel_running;
struct list_head throttled_list;
struct list_head throttled_csd_list;
+ struct list_head kernel_children;;
#endif /* CONFIG_CFS_BANDWIDTH */
#endif /* CONFIG_FAIR_GROUP_SCHED */
};

static inline int rt_bandwidth_enabled(void)
--
2.43.0.472.g3155946c3a-goog

2023-12-18 18:07:29

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return to userspace

On 07/12/23 15:47, Benjamin Segall wrote:
> Alright, this took longer than I intended, but here's the basic version
> with the duplicate "runqueue" just being a list rather than actual EEVDF
> or anything sensible.
>
> I also wasn't actually able to adapt it to work via task_work like I
> thought I could; flagging the entry to the relevant schedule() calls is
> too critical.
>

Thanks for sharing this! Have a first set of comments.

> --------
>
> Subject: [RFC PATCH] sched/fair: only throttle CFS tasks in userspace
>
> The basic idea of this implementation is to maintain duplicate runqueues
> in each cfs_rq that contain duplicate pointers to sched_entitys which
> should bypass throttling. Then we can skip throttling cfs_rqs that have
> any such children, and when we pick inside any not-actually-throttled
> cfs_rq, we only look at this duplicated list.
>
> "Which tasks should bypass throttling" here is "all schedule() calls
> that don't set a special flag", but could instead involve the lockdep
> markers (except for the problem of percpu-rwsem and similar) or explicit
> flags around syscalls and faults, or something else.
>
> This approach avoids any O(tasks) loops, but leaves partially-throttled
> cfs_rqs still contributing their full h_nr_running to their parents,
> which might result in worse balancing. Also it adds more (generally
> still small) overhead to the common enqueue/dequeue/pick paths.
>
> The very basic debug test added is to run a cpusoaker and "cat
> /sys/kernel/debug/sched_locked_spin" pinned to the same cpu in the same
> cgroup with a quota < 1 cpu.

So if I'm trying to compare notes:

The dual tree:
- Can actually throttle cfs_rq's once all tasks are in userspace
- Adds (light-ish) overhead to enqueue/dequeue
- Doesn't keep *nr_running updated when not fully throttled

Whereas dequeue via task_work:
- Can never fully throttle a cfs_rq
- Adds (not so light) overhead to unthrottle
- Keeps *nr_running updated

My thoughts right now are that there might be a way to mix both worlds: go
with the dual tree, but subtract from h_nr_running at dequeue_kernel()
(sort of doing the count update in throttle_cfs_rq() early) and keep track
of in-user tasks via handle_kernel_task_prev() to add this back when
unthrottling a not-fully-throttled cfs_rq. I need to actually think about
it though :-)

> @@ -154,11 +154,11 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> while (ti_work & EXIT_TO_USER_MODE_WORK) {
>
> local_irq_enable_exit_to_user(ti_work);
>
> if (ti_work & _TIF_NEED_RESCHED)
> - schedule();
> + schedule_usermode(); /* TODO: also all of the arch/* loops that don't use this yet */
^^
Interestingly -Werror=comment trips on this

>
> if (ti_work & _TIF_UPROBE)
> uprobe_notify_resume(regs);
>
> if (ti_work & _TIF_PATCH_PENDING)
> @@ -5416,18 +5426,38 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> se->prev_sum_exec_runtime = se->sum_exec_runtime;
> }
>
> /*
> * Pick the next process, keeping these things in mind, in this order:
> - * 1) keep things fair between processes/task groups
> - * 2) pick the "next" process, since someone really wants that to run
> - * 3) pick the "last" process, for cache locality
> - * 4) do not run the "skip" process, if something else is available
> + * 1) If we're inside a throttled cfs_rq, only pick threads in the kernel
> + * 2) keep things fair between processes/task groups
> + * 3) pick the "next" process, since someone really wants that to run
> + * 4) pick the "last" process, for cache locality
> + * 5) do not run the "skip" process, if something else is available
> */
> static struct sched_entity *
> -pick_next_entity(struct cfs_rq *cfs_rq)
> +pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
> {
> +#ifdef CONFIG_CFS_BANDWIDTH
> + /*
> + * TODO: This might trigger, I'm not sure/don't remember. Regardless,
> + * while we do not explicitly handle the case where h_kernel_running
> + * goes to 0, we will call account/check_cfs_rq_runtime at worst in
> + * entity_tick and notice that we can now properly do the full
> + * throttle_cfs_rq.
> + */
> + WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children));

FWIW this triggers pretty much immediately in my environment (buildroot
over QEMU).

> @@ -5712,16 +5748,26 @@ static int tg_throttle_down(struct task_group *tg, void *data)
> cfs_rq->throttle_count++;
>
> return 0;
> }
>
> +static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
> +static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
> +
> static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
> {
> struct rq *rq = rq_of(cfs_rq);
> struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> struct sched_entity *se;
> - long task_delta, idle_task_delta, dequeue = 1;
> + long task_delta, idle_task_delta, kernel_delta, dequeue = 1;
> +
> + /*
> + * We don't actually throttle, though account() will have made sure to
> + * resched us so that we pick into a kernel task.
> + */
> + if (cfs_rq->h_kernel_running)
> + return false;
>

So as long as a cfs_rq has at least one task executing in the kernel, we
keep its *nr_running intact. We do throttle the whole thing once all the
tasks are in userspace (or about to be anyway), and unthrottle_on_enqueue()
undoes that.

> @@ -8316,15 +8471,44 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>
> preempt:
> resched_curr(rq);
> }
>
> +static void handle_kernel_task_prev(struct task_struct *prev)
> +{
> +#ifdef CONFIG_CFS_BANDWIDTH
> + struct sched_entity *se = &prev->se;
> + bool new_kernel = is_kernel_task(prev);
> + bool prev_kernel = !list_empty(&se->kernel_node);
> + /*
> + * These extra loops are bad and against the whole point of the merged
> + * PNT, but it's a pain to merge, particularly since we want it to occur
> + * before check_cfs_runtime.
> + */
> + if (prev_kernel && !new_kernel) {
> + WARN_ON_ONCE(!se->on_rq); /* dequeue should have removed us */
> + for_each_sched_entity(se) {
> + dequeue_kernel(cfs_rq_of(se), se, 1);
> + if (cfs_rq_throttled(cfs_rq_of(se)))
> + break;
> + }
> + } else if (!prev_kernel && new_kernel && se->on_rq) {
> + for_each_sched_entity(se) {
> + enqueue_kernel(cfs_rq_of(se), se, 1);
> + if (cfs_rq_throttled(cfs_rq_of(se)))
> + break;
> + }
> + }
> +#endif
> +}


I'm trying to grok what the implications of a second tasks_timeline would
be on picking - we'd want a RBtree update equivalent to put_prev_entity()'s
__enqueue_entity(), but that second tree doesn't follow the same
enqueue/dequeue cycle as the first one. Would a __dequeue_entity() +
__enqueue_entity() to force a rebalance make sense? That'd also take care
of updating the min_vruntime.


2023-12-18 22:34:54

by Benjamin Segall

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return to userspace

Valentin Schneider <[email protected]> writes:

> On 07/12/23 15:47, Benjamin Segall wrote:
>> Alright, this took longer than I intended, but here's the basic version
>> with the duplicate "runqueue" just being a list rather than actual EEVDF
>> or anything sensible.
>>
>> I also wasn't actually able to adapt it to work via task_work like I
>> thought I could; flagging the entry to the relevant schedule() calls is
>> too critical.
>>
>
> Thanks for sharing this! Have a first set of comments.
>
>> --------
>>
>> Subject: [RFC PATCH] sched/fair: only throttle CFS tasks in userspace
>>
>> The basic idea of this implementation is to maintain duplicate runqueues
>> in each cfs_rq that contain duplicate pointers to sched_entitys which
>> should bypass throttling. Then we can skip throttling cfs_rqs that have
>> any such children, and when we pick inside any not-actually-throttled
>> cfs_rq, we only look at this duplicated list.
>>
>> "Which tasks should bypass throttling" here is "all schedule() calls
>> that don't set a special flag", but could instead involve the lockdep
>> markers (except for the problem of percpu-rwsem and similar) or explicit
>> flags around syscalls and faults, or something else.
>>
>> This approach avoids any O(tasks) loops, but leaves partially-throttled
>> cfs_rqs still contributing their full h_nr_running to their parents,
>> which might result in worse balancing. Also it adds more (generally
>> still small) overhead to the common enqueue/dequeue/pick paths.
>>
>> The very basic debug test added is to run a cpusoaker and "cat
>> /sys/kernel/debug/sched_locked_spin" pinned to the same cpu in the same
>> cgroup with a quota < 1 cpu.
>
> So if I'm trying to compare notes:
>
> The dual tree:
> - Can actually throttle cfs_rq's once all tasks are in userspace
> - Adds (light-ish) overhead to enqueue/dequeue
> - Doesn't keep *nr_running updated when not fully throttled
>
> Whereas dequeue via task_work:
> - Can never fully throttle a cfs_rq
> - Adds (not so light) overhead to unthrottle
> - Keeps *nr_running updated

Yeah, this sounds right. (Though for the task_work version once all the
tasks are throttled, the cfs_rq is dequeued like normal, so it doesn't
seem like a big deal to me)

>
> My thoughts right now are that there might be a way to mix both worlds: go
> with the dual tree, but subtract from h_nr_running at dequeue_kernel()
> (sort of doing the count update in throttle_cfs_rq() early) and keep track
> of in-user tasks via handle_kernel_task_prev() to add this back when
> unthrottling a not-fully-throttled cfs_rq. I need to actually think about
> it though :-)

Probably, but I had tons of trouble getting the accounting correct as it
was :P

>> +#ifdef CONFIG_CFS_BANDWIDTH
>> + /*
>> + * TODO: This might trigger, I'm not sure/don't remember. Regardless,
>> + * while we do not explicitly handle the case where h_kernel_running
>> + * goes to 0, we will call account/check_cfs_rq_runtime at worst in
>> + * entity_tick and notice that we can now properly do the full
>> + * throttle_cfs_rq.
>> + */
>> + WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children));
>
> FWIW this triggers pretty much immediately in my environment (buildroot
> over QEMU).

Good to know; I feel like this should be avoidable (and definitely
should be avoided if it can be), but I might be forgetting some case
that's mostly unavoidable. I'll have to poke it when I have more time.

>> @@ -8316,15 +8471,44 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>
>> preempt:
>> resched_curr(rq);
>> }
>>
>> +static void handle_kernel_task_prev(struct task_struct *prev)
>> +{
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> + struct sched_entity *se = &prev->se;
>> + bool new_kernel = is_kernel_task(prev);
>> + bool prev_kernel = !list_empty(&se->kernel_node);
>> + /*
>> + * These extra loops are bad and against the whole point of the merged
>> + * PNT, but it's a pain to merge, particularly since we want it to occur
>> + * before check_cfs_runtime.
>> + */
>> + if (prev_kernel && !new_kernel) {
>> + WARN_ON_ONCE(!se->on_rq); /* dequeue should have removed us */
>> + for_each_sched_entity(se) {
>> + dequeue_kernel(cfs_rq_of(se), se, 1);
>> + if (cfs_rq_throttled(cfs_rq_of(se)))
>> + break;
>> + }
>> + } else if (!prev_kernel && new_kernel && se->on_rq) {
>> + for_each_sched_entity(se) {
>> + enqueue_kernel(cfs_rq_of(se), se, 1);
>> + if (cfs_rq_throttled(cfs_rq_of(se)))
>> + break;
>> + }
>> + }
>> +#endif
>> +}
>
>
> I'm trying to grok what the implications of a second tasks_timeline would
> be on picking - we'd want a RBtree update equivalent to put_prev_entity()'s
> __enqueue_entity(), but that second tree doesn't follow the same
> enqueue/dequeue cycle as the first one. Would a __dequeue_entity() +
> __enqueue_entity() to force a rebalance make sense? That'd also take care
> of updating the min_vruntime.

Yeah, the most sensible thing to do would probably be to just have curr
not be in the queue, the same as the normal rbtree, and save the "on
kernel list" as an on_rq-equivalent bool instead of as !list_empty(kernel_node).

2023-12-19 11:51:10

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return to userspace

On 18/12/23 14:34, Benjamin Segall wrote:
> Valentin Schneider <[email protected]> writes:
>>
>> So if I'm trying to compare notes:
>>
>> The dual tree:
>> - Can actually throttle cfs_rq's once all tasks are in userspace
>> - Adds (light-ish) overhead to enqueue/dequeue
>> - Doesn't keep *nr_running updated when not fully throttled
>>
>> Whereas dequeue via task_work:
>> - Can never fully throttle a cfs_rq
>> - Adds (not so light) overhead to unthrottle
>> - Keeps *nr_running updated
>
> Yeah, this sounds right. (Though for the task_work version once all the
> tasks are throttled, the cfs_rq is dequeued like normal, so it doesn't
> seem like a big deal to me)
>

In my implementation the cfs_rq's can be dequeued once all of their tasks
have been throttled, but the tasks aren't migrated back onto the cfs_rq's
until the unthrottle (rather than at the "final" throttle).

That is done because if a random task gets migrated to a throttled & empty
cfs_rq, we still need to schedule it 'till it exits the kernel, and we
don't want the now-in-userspace task to suddenly become eligible for
picking.

That adds these O(ntasks) loops in the unthrottle to re-shape the
cfs_rq's. The dual tree approach saves us from doing all that, IIUC
unthrottle_on_enqueue() only puts the se's in the task's direct hierarchy
back onto the second tree for them to be picked.


>>
>> My thoughts right now are that there might be a way to mix both worlds: go
>> with the dual tree, but subtract from h_nr_running at dequeue_kernel()
>> (sort of doing the count update in throttle_cfs_rq() early) and keep track
>> of in-user tasks via handle_kernel_task_prev() to add this back when
>> unthrottling a not-fully-throttled cfs_rq. I need to actually think about
>> it though :-)
>
> Probably, but I had tons of trouble getting the accounting correct as it
> was :P
>

Yeah so did I, cgroups are "fun".

>>
>> I'm trying to grok what the implications of a second tasks_timeline would
>> be on picking - we'd want a RBtree update equivalent to put_prev_entity()'s
>> __enqueue_entity(), but that second tree doesn't follow the same
>> enqueue/dequeue cycle as the first one. Would a __dequeue_entity() +
>> __enqueue_entity() to force a rebalance make sense? That'd also take care
>> of updating the min_vruntime.
>
> Yeah, the most sensible thing to do would probably be to just have curr
> not be in the queue, the same as the normal rbtree, and save the "on
> kernel list" as an on_rq-equivalent bool instead of as !list_empty(kernel_node).

Makes sense, I'll have a go at this.