2024-02-02 08:16:38

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry

Hi folks,

Problem statement
=================

CFS tasks can end up throttled while holding locks that other, non-throttled
tasks are blocking on.

For !PREEMPT_RT, this can be a source of latency due to the throttling causing a
resource acquisition denial.

For PREEMPT_RT, this is worse and can lead to a deadlock:
o A CFS task p0 gets throttled while holding read_lock(&lock)
o A task p1 blocks on write_lock(&lock), making further readers enter the
slowpath
o A ktimers or ksoftirqd task blocks on read_lock(&lock)

If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued on
the same CPU as one where ktimers/ksoftirqd is blocked on read_lock(&lock),
this creates a circular dependency.

This has been observed to happen with:
o fs/eventpoll.c::ep->lock
o net/netlink/af_netlink.c::nl_table_lock (after hand-fixing the above)
but can trigger with any rwlock that can be acquired in both process and
softirq contexts.

The linux-rt tree has had
1ea50f9636f0 ("softirq: Use a dedicated thread for timer wakeups.")
which helped this scenario for non-rwlock locks by ensuring the throttled
task would get PI'd to FIFO1 (ktimers' default priority). Unfortunately,
rwlocks cannot sanely do PI as they allow multiple readers.

Proposed approach
=================

Peter mentioned [1] that there have been discussions on changing /when/ the
throttling happens: rather than have it be done immediately upon updating
the runtime statistics and realizing the cfs_rq has depleted its quota, we wait
for the task to be about to return to userspace: if it's in userspace, it can't
hold any in-kernel lock.

I submitted an initial jab at this [2] and Ben Segall added his own version to
the conversation [3]. This series contains Ben's patch plus my additions. The
main change here is updating the .h_nr_running counts throughout the cfs_rq
hierachies to improve the picture given to load_balance().

The main thing that remains doing for this series is making the second cfs_rq
tree an actual RB tree (it's just a plain list ATM).

This also doesn't touch rq.nr_running yet, I'm not entirely sure whether we want
to expose this outside of CFS, but it is another field that's used by load balance.

Testing
=======

Tested on QEMU via:

mount -t cgroup -o cpu none /root/cpu

mkdir /root/cpu/cg0
echo 10000 > /root/cpu/cg0/cpu.cfs_period_us
echo 1000 > /root/cpu/cg0/cpu.cfs_quota_us

mkdir /root/cpu/cg0/cg00
mkdir /root/cpu/cg0/cg01

mkdir /root/cpu/cg0/cg00/cg000
mkdir /root/cpu/cg0/cg00/cg001

spawn() {
while true; do cat /sys/devices/system/cpu/smt/active &>/dev/null; done &
PID=$!
echo "Starting PID${PID}"
echo $PID > $1
}

spawn cpu/cg0/tasks
spawn cpu/cg0/tasks
spawn cpu/cg0/tasks
spawn cpu/cg0/tasks

spawn cpu/cg0/cg01/tasks

spawn cpu/cg0/cg00/cg000/tasks
spawn cpu/cg0/cg00/cg001/tasks

sleep 120

kill $(jobs -p)

Links
=====

[1]: https://lore.kernel.org/all/[email protected]/
[2]: http://lore.kernel.org/r/[email protected]
[3]: http://lore.kernel.org/r/[email protected]

Benjamin Segall (1):
sched/fair: Only throttle CFS tasks on return to userspace

Valentin Schneider (4):
sched: Note schedule() invocations at return-to-user with SM_USER
sched/fair: Delete cfs_rq_throttled_loose(), use
cfs_rq->throttle_pending instead
sched/fair: Track count of tasks running in userspace
sched/fair: Assert user/kernel/total nr invariants

include/linux/sched.h | 7 +
kernel/entry/common.c | 2 +-
kernel/entry/kvm.c | 2 +-
kernel/sched/core.c | 45 ++++-
kernel/sched/debug.c | 28 +++
kernel/sched/fair.c | 399 ++++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 5 +
7 files changed, 466 insertions(+), 22 deletions(-)

--
2.43.0



2024-02-02 08:16:54

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace

From: Benjamin Segall <[email protected]>

The basic idea of this implementation is to maintain duplicate runqueues
in each cfs_rq that contain duplicate pointers to sched_entitys which
should bypass throttling. Then we can skip throttling cfs_rqs that have
any such children, and when we pick inside any not-actually-throttled
cfs_rq, we only look at this duplicated list.

"Which tasks should bypass throttling" here is "all schedule() calls
that don't set a special flag", but could instead involve the lockdep
markers (except for the problem of percpu-rwsem and similar) or explicit
flags around syscalls and faults, or something else.

This approach avoids any O(tasks) loops, but leaves partially-throttled
cfs_rqs still contributing their full h_nr_running to their parents,
which might result in worse balancing. Also it adds more (generally
still small) overhead to the common enqueue/dequeue/pick paths.

The very basic debug test added is to run a cpusoaker and "cat
/sys/kernel/debug/sched_locked_spin" pinned to the same cpu in the same
cgroup with a quota < 1 cpu.

Not-signed-off-by: Benjamin Segall <[email protected]>
[Slight comment / naming changes]
Signed-off-by: Valentin Schneider <[email protected]>
---
include/linux/sched.h | 7 ++
kernel/entry/common.c | 2 +-
kernel/entry/kvm.c | 2 +-
kernel/sched/core.c | 20 ++++
kernel/sched/debug.c | 28 +++++
kernel/sched/fair.c | 232 ++++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 3 +
7 files changed, 281 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 03bfe9ab29511..4a0105d1eaa21 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -303,6 +303,7 @@ extern long schedule_timeout_killable(long timeout);
extern long schedule_timeout_uninterruptible(long timeout);
extern long schedule_timeout_idle(long timeout);
asmlinkage void schedule(void);
+asmlinkage void schedule_usermode(void);
extern void schedule_preempt_disabled(void);
asmlinkage void preempt_schedule_irq(void);
#ifdef CONFIG_PREEMPT_RT
@@ -553,6 +554,9 @@ struct sched_entity {
struct cfs_rq *my_q;
/* cached value of my_q->h_nr_running */
unsigned long runnable_weight;
+#ifdef CONFIG_CFS_BANDWIDTH
+ struct list_head kernel_node;
+#endif
#endif

#ifdef CONFIG_SMP
@@ -1539,6 +1543,9 @@ struct task_struct {
struct user_event_mm *user_event_mm;
#endif

+#ifdef CONFIG_CFS_BANDWIDTH
+ atomic_t in_return_to_user;
+#endif
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index d7ee4bc3f2ba3..16b5432a62c6f 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -156,7 +156,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
local_irq_enable_exit_to_user(ti_work);

if (ti_work & _TIF_NEED_RESCHED)
- schedule();
+ schedule_usermode(); /* TODO: also all of the arch/ loops that don't use this yet */

if (ti_work & _TIF_UPROBE)
uprobe_notify_resume(regs);
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 2e0f75bcb7fd1..fc4b73de07539 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -14,7 +14,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
}

if (ti_work & _TIF_NEED_RESCHED)
- schedule();
+ schedule_usermode();

if (ti_work & _TIF_NOTIFY_RESUME)
resume_user_mode_work(NULL);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index db4be4921e7f0..a7c028fad5a89 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4529,6 +4529,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
#ifdef CONFIG_FAIR_GROUP_SCHED
p->se.cfs_rq = NULL;
#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+ INIT_LIST_HEAD(&p->se.kernel_node);
+ atomic_set(&p->in_return_to_user, 0);
+#endif

#ifdef CONFIG_SCHEDSTATS
/* Even if schedstat is disabled, there should not be garbage */
@@ -6818,6 +6822,22 @@ asmlinkage __visible void __sched schedule(void)
}
EXPORT_SYMBOL(schedule);

+asmlinkage __visible void __sched schedule_usermode(void)
+{
+#ifdef CONFIG_CFS_BANDWIDTH
+ /*
+ * This is only atomic because of this simple implementation. We could
+ * do something with an SM_USER to avoid other-cpu scheduler operations
+ * racing against these writes.
+ */
+ atomic_set(&current->in_return_to_user, true);
+ schedule();
+ atomic_set(&current->in_return_to_user, false);
+#else
+ schedule();
+#endif
+}
+
/*
* synchronize_rcu_tasks() makes sure that no task is stuck in preempted
* state (have scheduled out non-voluntarily) by making sure that all
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8d5d98a5834df..4a89dbc3ddfcd 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -319,6 +319,32 @@ static const struct file_operations sched_verbose_fops = {
.llseek = default_llseek,
};

+static DEFINE_MUTEX(sched_debug_spin_mutex);
+static int sched_debug_spin_show(struct seq_file *m, void *v) {
+ int count;
+ mutex_lock(&sched_debug_spin_mutex);
+ for (count = 0; count < 1000; count++) {
+ u64 start2;
+ start2 = jiffies;
+ while (jiffies == start2)
+ cpu_relax();
+ schedule();
+ }
+ mutex_unlock(&sched_debug_spin_mutex);
+ return 0;
+}
+static int sched_debug_spin_open(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, sched_debug_spin_show, NULL);
+}
+
+static const struct file_operations sched_debug_spin_fops = {
+ .open = sched_debug_spin_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
static const struct seq_operations sched_debug_sops;

static int sched_debug_open(struct inode *inode, struct file *filp)
@@ -374,6 +400,8 @@ static __init int sched_init_debug(void)

debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);

+ debugfs_create_file("sched_locked_spin", 0444, NULL, NULL,
+ &sched_debug_spin_fops);
return 0;
}
late_initcall(sched_init_debug);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b803030c3a037..a1808459a5acc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -128,6 +128,7 @@ int __weak arch_asym_cpu_priority(int cpu)
* (default: 5 msec, units: microseconds)
*/
static unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
+static unsigned int sysctl_sched_cfs_bandwidth_kernel_bypass = 1;
#endif

#ifdef CONFIG_NUMA_BALANCING
@@ -146,6 +147,15 @@ static struct ctl_table sched_fair_sysctls[] = {
.proc_handler = proc_dointvec_minmax,
.extra1 = SYSCTL_ONE,
},
+ {
+ .procname = "sched_cfs_bandwidth_kernel_bypass",
+ .data = &sysctl_sched_cfs_bandwidth_kernel_bypass,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
+ },
#endif
#ifdef CONFIG_NUMA_BALANCING
{
@@ -5445,14 +5455,34 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)

/*
* Pick the next process, keeping these things in mind, in this order:
- * 1) keep things fair between processes/task groups
- * 2) pick the "next" process, since someone really wants that to run
- * 3) pick the "last" process, for cache locality
- * 4) do not run the "skip" process, if something else is available
+ * 1) If we're inside a throttled cfs_rq, only pick threads in the kernel
+ * 2) keep things fair between processes/task groups
+ * 3) pick the "next" process, since someone really wants that to run
+ * 4) pick the "last" process, for cache locality
+ * 5) do not run the "skip" process, if something else is available
*/
static struct sched_entity *
-pick_next_entity(struct cfs_rq *cfs_rq)
+pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
{
+#ifdef CONFIG_CFS_BANDWIDTH
+ /*
+ * TODO: This might trigger, I'm not sure/don't remember. Regardless,
+ * while we do not explicitly handle the case where h_kernel_running
+ * goes to 0, we will call account/check_cfs_rq_runtime at worst in
+ * entity_tick and notice that we can now properly do the full
+ * throttle_cfs_rq.
+ */
+ WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children));
+ if (throttled && !list_empty(&cfs_rq->kernel_children)) {
+ /*
+ * TODO: you'd want to factor out pick_eevdf to just take
+ * tasks_timeline, and replace this list with a second rbtree
+ * and a call to pick_eevdf.
+ */
+ return list_first_entry(&cfs_rq->kernel_children,
+ struct sched_entity, kernel_node);
+ }
+#endif
/*
* Enabling NEXT_BUDDY will affect latency but not fairness.
*/
@@ -5651,8 +5681,14 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
/*
* if we're unable to extend our runtime we resched so that the active
* hierarchy can be throttled
+ *
+ * Don't resched_curr() if curr is in the kernel. We won't throttle the
+ * cfs_rq if any task is in the kernel, and if curr in particular is we
+ * don't need to preempt it in favor of whatever other task is in the
+ * kernel.
*/
- if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
+ if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr) &&
+ list_empty(&rq_of(cfs_rq)->curr->se.kernel_node))
resched_curr(rq_of(cfs_rq));
}

@@ -5741,12 +5777,22 @@ static int tg_throttle_down(struct task_group *tg, void *data)
return 0;
}

+static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
+static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
+
static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
struct rq *rq = rq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se;
- long task_delta, idle_task_delta, dequeue = 1;
+ long task_delta, idle_task_delta, kernel_delta, dequeue = 1;
+
+ /*
+ * We don't actually throttle, though account() will have made sure to
+ * resched us so that we pick into a kernel task.
+ */
+ if (cfs_rq->h_kernel_running)
+ return false;

raw_spin_lock(&cfs_b->lock);
/* This will start the period timer if necessary */
@@ -5778,6 +5824,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)

task_delta = cfs_rq->h_nr_running;
idle_task_delta = cfs_rq->idle_h_nr_running;
+ kernel_delta = cfs_rq->h_kernel_running;
for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se);
/* throttled entity or throttle-on-deactivate */
@@ -5791,6 +5838,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)

qcfs_rq->h_nr_running -= task_delta;
qcfs_rq->idle_h_nr_running -= idle_task_delta;
+ dequeue_kernel(qcfs_rq, se, kernel_delta);

if (qcfs_rq->load.weight) {
/* Avoid re-evaluating load for this entity: */
@@ -5813,6 +5861,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)

qcfs_rq->h_nr_running -= task_delta;
qcfs_rq->idle_h_nr_running -= idle_task_delta;
+ dequeue_kernel(qcfs_rq, se, kernel_delta);
}

/* At this point se is NULL and we are at root level*/
@@ -5835,7 +5884,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
struct rq *rq = rq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se;
- long task_delta, idle_task_delta;
+ long task_delta, idle_task_delta, kernel_delta;

se = cfs_rq->tg->se[cpu_of(rq)];

@@ -5870,6 +5919,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)

task_delta = cfs_rq->h_nr_running;
idle_task_delta = cfs_rq->idle_h_nr_running;
+ kernel_delta = cfs_rq->h_kernel_running;
for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se);

@@ -5882,6 +5932,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)

qcfs_rq->h_nr_running += task_delta;
qcfs_rq->idle_h_nr_running += idle_task_delta;
+ enqueue_kernel(qcfs_rq, se, kernel_delta);

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(qcfs_rq))
@@ -5899,6 +5950,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)

qcfs_rq->h_nr_running += task_delta;
qcfs_rq->idle_h_nr_running += idle_task_delta;
+ enqueue_kernel(qcfs_rq, se, kernel_delta);

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(qcfs_rq))
@@ -6557,6 +6609,86 @@ static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
}
#endif

+/*
+ * We keep track of all children that are runnable in the kernel with a count of
+ * all descendants. The state is checked on enqueue and put_prev (and hard
+ * cleared on dequeue), and is stored just as the filled/empty state of the
+ * kernel_node list entry.
+ *
+ * These are simple helpers that do both parts, and should be called bottom-up
+ * until hitting a throttled cfs_rq whenever a task changes state (or a cfs_rq
+ * is (un)throttled).
+ */
+static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count)
+{
+ if (count == 0)
+ return;
+
+ if (list_empty(&se->kernel_node))
+ list_add(&se->kernel_node, &cfs_rq->kernel_children);
+ cfs_rq->h_kernel_running += count;
+}
+
+static bool is_kernel_task(struct task_struct *p)
+{
+ return sysctl_sched_cfs_bandwidth_kernel_bypass && !atomic_read(&p->in_return_to_user);
+}
+
+/*
+ * When called on a task this always transitions it to a !kernel state.
+ *
+ * When called on a group it is just synchronizing the state with the new
+ * h_kernel_waiters, unless this it has been throttled and is !on_rq
+ */
+static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count)
+{
+ if (count == 0)
+ return;
+
+ if (!se->on_rq || entity_is_task(se) ||
+ !group_cfs_rq(se)->h_kernel_running)
+ list_del_init(&se->kernel_node);
+ cfs_rq->h_kernel_running -= count;
+}
+
+/*
+ * Returns if the cfs_rq "should" be throttled but might not be because of
+ * kernel threads bypassing throttle.
+ */
+static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
+{
+ if (!cfs_bandwidth_used())
+ return false;
+
+ if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
+ return false;
+ return true;
+}
+
+static void unthrottle_on_enqueue(struct task_struct *p)
+{
+ struct sched_entity *se = &p->se;
+
+ if (!cfs_bandwidth_used() || !sysctl_sched_cfs_bandwidth_kernel_bypass)
+ return;
+ if (!cfs_rq_of(&p->se)->throttle_count)
+ return;
+
+ /*
+ * MAYBE TODO: doing it this simple way is O(throttle_count *
+ * cgroup_depth). We could optimize that into a single pass, but making
+ * a mostly-copy of unthrottle_cfs_rq that does that is a pain and easy
+ * to get wrong. (And even without unthrottle_on_enqueue it's O(nm),
+ * just not while holding rq->lock the whole time)
+ */
+
+ for_each_sched_entity(se) {
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ if (cfs_rq->throttled)
+ unthrottle_cfs_rq(cfs_rq);
+ }
+}
+
#else /* CONFIG_CFS_BANDWIDTH */

static inline bool cfs_bandwidth_used(void)
@@ -6604,6 +6736,16 @@ bool cfs_task_bw_constrained(struct task_struct *p)
return false;
}
#endif
+static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count) {}
+static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count) {}
+static inline bool is_kernel_task(struct task_struct *p)
+{
+ return false;
+}
+static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
+{
+ return false;
+}
#endif /* CONFIG_CFS_BANDWIDTH */

#if !defined(CONFIG_CFS_BANDWIDTH) || !defined(CONFIG_NO_HZ_FULL)
@@ -6707,6 +6849,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct sched_entity *se = &p->se;
int idle_h_nr_running = task_has_idle_policy(p);
int task_new = !(flags & ENQUEUE_WAKEUP);
+ bool kernel_task = is_kernel_task(p);

/*
* The code below (indirectly) updates schedutil which looks at
@@ -6735,6 +6878,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;
+ if (kernel_task)
+ enqueue_kernel(cfs_rq, se, 1);

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -6755,6 +6900,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;
+ if (kernel_task)
+ enqueue_kernel(cfs_rq, se, 1);

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -6785,6 +6932,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
assert_list_leaf_cfs_rq(rq);

hrtick_update(rq);
+
+ if (kernel_task)
+ unthrottle_on_enqueue(p);
}

static void set_next_buddy(struct sched_entity *se);
@@ -6801,6 +6951,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
int task_sleep = flags & DEQUEUE_SLEEP;
int idle_h_nr_running = task_has_idle_policy(p);
bool was_sched_idle = sched_idle_rq(rq);
+ bool kernel_task = !list_empty(&p->se.kernel_node);

util_est_dequeue(&rq->cfs, p);

@@ -6813,6 +6964,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;
+ if (kernel_task)
+ dequeue_kernel(cfs_rq, se, 1);

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -6845,6 +6998,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;
+ if (kernel_task)
+ dequeue_kernel(cfs_rq, se, 1);

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -8343,11 +8498,40 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
resched_curr(rq);
}

+static void handle_kernel_task_prev(struct task_struct *prev)
+{
+#ifdef CONFIG_CFS_BANDWIDTH
+ struct sched_entity *se = &prev->se;
+ bool p_in_kernel = is_kernel_task(prev);
+ bool p_in_kernel_tree = !list_empty(&se->kernel_node);
+ /*
+ * These extra loops are bad and against the whole point of the merged
+ * PNT, but it's a pain to merge, particularly since we want it to occur
+ * before check_cfs_runtime().
+ */
+ if (p_in_kernel_tree && !p_in_kernel) {
+ WARN_ON_ONCE(!se->on_rq); /* dequeue should have removed us */
+ for_each_sched_entity(se) {
+ dequeue_kernel(cfs_rq_of(se), se, 1);
+ if (cfs_rq_throttled(cfs_rq_of(se)))
+ break;
+ }
+ } else if (!p_in_kernel_tree && p_in_kernel && se->on_rq) {
+ for_each_sched_entity(se) {
+ enqueue_kernel(cfs_rq_of(se), se, 1);
+ if (cfs_rq_throttled(cfs_rq_of(se)))
+ break;
+ }
+ }
+#endif
+}
+
#ifdef CONFIG_SMP
static struct task_struct *pick_task_fair(struct rq *rq)
{
struct sched_entity *se;
struct cfs_rq *cfs_rq;
+ bool throttled = false;

again:
cfs_rq = &rq->cfs;
@@ -8368,7 +8552,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
goto again;
}

- se = pick_next_entity(cfs_rq);
+ if (cfs_rq_throttled_loose(cfs_rq))
+ throttled = true;
+
+ se = pick_next_entity(cfs_rq, throttled);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);

@@ -8383,6 +8570,14 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
struct sched_entity *se;
struct task_struct *p;
int new_tasks;
+ bool throttled;
+
+ /*
+ * We want to handle this before check_cfs_runtime(prev). We'll
+ * duplicate a little work in the goto simple case, but that's fine
+ */
+ if (prev)
+ handle_kernel_task_prev(prev);

again:
if (!sched_fair_runnable(rq))
@@ -8400,6 +8595,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
* hierarchy, only change the part that actually changes.
*/

+ throttled = false;
do {
struct sched_entity *curr = cfs_rq->curr;

@@ -8431,7 +8627,10 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
}
}

- se = pick_next_entity(cfs_rq);
+ if (cfs_rq_throttled_loose(cfs_rq))
+ throttled = true;
+
+ se = pick_next_entity(cfs_rq, throttled);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);

@@ -8469,8 +8668,11 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
if (prev)
put_prev_task(rq, prev);

+ throttled = false;
do {
- se = pick_next_entity(cfs_rq);
+ if (cfs_rq_throttled_loose(cfs_rq))
+ throttled = true;
+ se = pick_next_entity(cfs_rq, throttled);
set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
@@ -8534,6 +8736,8 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
struct sched_entity *se = &prev->se;
struct cfs_rq *cfs_rq;

+ handle_kernel_task_prev(prev);
+
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
put_prev_entity(cfs_rq, se);
@@ -12818,6 +13022,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
#ifdef CONFIG_SMP
raw_spin_lock_init(&cfs_rq->removed.lock);
#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+ INIT_LIST_HEAD(&cfs_rq->kernel_children);
+#endif
}

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -12970,6 +13177,9 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
/* guarantee group entities always have weight */
update_load_set(&se->load, NICE_0_LOAD);
se->parent = parent;
+#ifdef CONFIG_CFS_BANDWIDTH
+ INIT_LIST_HEAD(&se->kernel_node);
+#endif
}

static DEFINE_MUTEX(shares_mutex);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e58a54bda77de..0b33ce2e60555 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -580,6 +580,7 @@ struct cfs_rq {

struct rb_root_cached tasks_timeline;

+
/*
* 'curr' points to currently running entity on this cfs_rq.
* It is set to NULL otherwise (i.e when none are currently running).
@@ -658,8 +659,10 @@ struct cfs_rq {
u64 throttled_clock_self_time;
int throttled;
int throttle_count;
+ int h_kernel_running;
struct list_head throttled_list;
struct list_head throttled_csd_list;
+ struct list_head kernel_children;
#endif /* CONFIG_CFS_BANDWIDTH */
#endif /* CONFIG_FAIR_GROUP_SCHED */
};
--
2.43.0


2024-02-02 08:17:01

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead

cfs_rq_throttled_loose() does not check if there is runtime remaining in
the cfs_b, and thus relies on check_cfs_rq_runtime() being ran previously
for that to be checked.

Cache the throttle attempt in throttle_cfs_rq and reuse that where
needed.

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 44 ++++++++++----------------------------------
1 file changed, 10 insertions(+), 34 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 96504be6ee14a..60778afbff207 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5462,7 +5462,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
* 5) do not run the "skip" process, if something else is available
*/
static struct sched_entity *
-pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
+pick_next_entity(struct cfs_rq *cfs_rq)
{
#ifdef CONFIG_CFS_BANDWIDTH
/*
@@ -5473,7 +5473,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
* throttle_cfs_rq.
*/
WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children));
- if (throttled && !list_empty(&cfs_rq->kernel_children)) {
+ if (cfs_rq->throttle_pending && !list_empty(&cfs_rq->kernel_children)) {
/*
* TODO: you'd want to factor out pick_eevdf to just take
* tasks_timeline, and replace this list with a second rbtree
@@ -5791,8 +5791,12 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
* We don't actually throttle, though account() will have made sure to
* resched us so that we pick into a kernel task.
*/
- if (cfs_rq->h_kernel_running)
+ if (cfs_rq->h_kernel_running) {
+ cfs_rq->throttle_pending = true;
return false;
+ }
+
+ cfs_rq->throttle_pending = false;

raw_spin_lock(&cfs_b->lock);
/* This will start the period timer if necessary */
@@ -6666,20 +6670,6 @@ static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int c
cfs_rq->h_kernel_running -= count;
}

-/*
- * Returns if the cfs_rq "should" be throttled but might not be because of
- * kernel threads bypassing throttle.
- */
-static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
-{
- if (!cfs_bandwidth_used())
- return false;
-
- if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
- return false;
- return true;
-}
-
static void unthrottle_on_enqueue(struct task_struct *p)
{
struct sched_entity *se = &p->se;
@@ -8546,7 +8536,6 @@ static struct task_struct *pick_task_fair(struct rq *rq)
{
struct sched_entity *se;
struct cfs_rq *cfs_rq;
- bool throttled = false;

again:
cfs_rq = &rq->cfs;
@@ -8567,10 +8556,7 @@ static struct task_struct *pick_task_fair(struct rq *rq)
goto again;
}

- if (cfs_rq_throttled_loose(cfs_rq))
- throttled = true;
-
- se = pick_next_entity(cfs_rq, throttled);
+ se = pick_next_entity(cfs_rq);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);

@@ -8585,7 +8571,6 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
struct sched_entity *se;
struct task_struct *p;
int new_tasks;
- bool throttled;

/*
* We want to handle this before check_cfs_runtime(prev). We'll
@@ -8609,8 +8594,6 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
* Therefore attempt to avoid putting and setting the entire cgroup
* hierarchy, only change the part that actually changes.
*/
-
- throttled = false;
do {
struct sched_entity *curr = cfs_rq->curr;

@@ -8641,11 +8624,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
goto simple;
}
}
-
- if (cfs_rq_throttled_loose(cfs_rq))
- throttled = true;
-
- se = pick_next_entity(cfs_rq, throttled);
+ se = pick_next_entity(cfs_rq);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);

@@ -8683,11 +8662,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
if (prev)
put_prev_task(rq, prev);

- throttled = false;
do {
- if (cfs_rq_throttled_loose(cfs_rq))
- throttled = true;
- se = pick_next_entity(cfs_rq, throttled);
+ se = pick_next_entity(cfs_rq);
set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
--
2.43.0


2024-02-02 08:17:12

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH v2 4/5] sched/fair: Track count of tasks running in userspace

While having a second tree to pick from solves the throttling aspect of things,
it also requires modification of the task count at the cfs_rq level.

h_nr_running is used throughout load_balance(), and it needs to accurately
reflect the amount of pickable tasks: a cfs_rq with .throttle_pending=1 may have
many tasks in userspace (thus effectively throttled), and this "excess" of tasks
shouldn't cause find_busiest_group() / find_busiest_queue() to pick that
cfs_rq's CPU to pull load from when there are other CPUs with more pickable
tasks to pull.

The approach taken here is to track both the count of tasks in kernelspace and
the count of tasks in userspace (technically tasks-just-about-to-enter-userspace).

When a cfs_rq runs out of runtime, it gets marked as .throttle_pending=1. From
this point on, only tasks executing in kernelspace are pickable, and this is
reflected up the hierarchy by removing that cfs_rq.h_user_running from its
parents' .h_nr_running.

To aid in validating the proper behaviour of the implementation, we assert the
following invariants:
o For any cfs_rq with .throttle_pending == 0:
.h_kernel_running + .h_user_running == .h_nr_running
o For any cfs_rq with .throttle_pending == 1:
.h_kernel_running == .h_nr_running

This means the .h_user_running also needs to be updated as cfs_rq's become
throttle_pending=1. When a cfs_rq becomes .throttle_pending=1, its
h_user_running remains untouched, but it is subtracted from its parents'
h_user_running.

Another way to look at it is that the .h_user_running is "stored" at the level
of the .throttle_pending cfs_rq, and restored to the upper part of the hierarchy
at unthrottle.

An overview of the count logic is:

Consider:
cfs_rq.kernel := count of kernel *tasks* enqueued on this cfs_rq
cfs_rq.user := count of user *tasks* enqueued on this cfs_rq

Then, the following logic is implemented:
cfs_rq.h_kernel_running = Sum(child.kernel) for all child cfs_rq
cfs_rq.h_user_running = Sum(child.user) for all child cfs_rq with !child.throttle_pending
cfs_rq.h_nr_running = Sum(child.kernel) for all child cfs_rq
+ Sum(child.user) for all child cfs_rq with !child.throttle_pending

An application of that logic to an A/B/C cgroup hierarchy:

Initial condition, no throttling

+------+ .h_kernel_running = C.kernel + B.kernel + A.kernel
A |cfs_rq| .h_user_running = C.user + B.user + A.user
+------+ .h_nr_running = C.{kernel+user} + B.{kernel+user} + A.{kernel+user}
^ .throttle_pending = 0
|
| parent
|
+------+ .h_kernel_running = C.kernel + B.kernel
B |cfs_rq| .h_user_running = C.user + B.user
+------+ .h_nr_running = C.{kernel+user} + B.{kernel+user}
^ .throttle_pending = 0
|
| parent
|
+------+ .h_kernel_running = C.kernel
C |cfs_rq| .h_user_running = C.user
+------+ .h_nr_running = C.{kernel+user}
.throttle_pending = 0

C becomes .throttle_pending

+------+ .h_kernel_running = C.kernel + B.kernel + A.kernel <- Untouched
A |cfs_rq| .h_user_running = B.user + A.user <- Decremented by C.user
+------+ .h_nr_running = C.kernel + B.{kernel+user} + A.{kernel+user} <- Decremented by C.user
^ .throttle_pending = 0
|
| parent
|
+------+ .h_kernel_running = C.kernel + B.kernel <- Untouched
B |cfs_rq| .h_user_running = B.user <- Decremented by C.user
+------+ .h_nr_running = C.kernel + B.{kernel+user} + A.{kernel+user} <- Decremented by C.user
^ .throttle_pending = 0
|
| parent
|
+------+ .h_kernel_running = C.kernel
C |cfs_rq| .h_user_running = C.user <- Untouched, the count is "stored" at this level
+------+ .h_nr_running = C.kernel <- Decremented by C.user
.throttle_pending = 1

C becomes throttled

+------+ .h_kernel_running = B.kernel + A.kernel <- Decremented by C.kernel
A |cfs_rq| .h_user_running = B.user + A.user
+------+ .h_nr_running = B.{kernel+user} + A.{kernel+user} <- Decremented by C.kernel
^ .throttle_pending = 0
|
| parent
|
+------+ .h_kernel_running = B.kernel <- Decremented by C.kernel
B |cfs_rq| .h_user_running = B.user
+------+ .h_nr_running = B.{kernel+user} + A.{kernel+user} <- Decremented by C.kernel
^ .throttle_pending = 0
|
| parent
|
+------+ .h_kernel_running = C.kernel
C |cfs_rq| .h_user_running = C.user
+------+ .h_nr_running = C.{kernel+user} <- Incremented by C.user
.throttle_pending = 0

Could we get away with just one count, e.g. the user count and not the kernel
count? Technically yes, we could follow this scheme:
if (throttle_pending) => kernel count := h_nr_running - h_user_running
else => kernel count := h_nr_running
this however prevents any sort of assertion or sanity checking on the counts,
which I am not the biggest fan on - CFS group scheduling is enough of a headache
as it is.

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 174 ++++++++++++++++++++++++++++++++++++-------
kernel/sched/sched.h | 2 +
2 files changed, 151 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 60778afbff207..2b54d3813d18d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5785,17 +5785,48 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
struct rq *rq = rq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se;
- long task_delta, idle_task_delta, kernel_delta, dequeue = 1;
+ long task_delta, idle_task_delta, kernel_delta, user_delta, dequeue = 1;
+ bool was_pending;

/*
- * We don't actually throttle, though account() will have made sure to
- * resched us so that we pick into a kernel task.
+ * We don't actually throttle just yet, though account_cfs_rq_runtime()
+ * will have made sure to resched us so that we pick into a kernel task.
*/
if (cfs_rq->h_kernel_running) {
+ if (cfs_rq->throttle_pending)
+ return false;
+
+ /*
+ * From now on we're only going to pick tasks that are in the
+ * second tree. Reflect this by discounting tasks that aren't going
+ * to be pickable from the ->h_nr_running counts.
+ */
cfs_rq->throttle_pending = true;
+
+ se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+ user_delta = cfs_rq->h_user_running;
+ cfs_rq->h_nr_running -= user_delta;
+
+ for_each_sched_entity(se) {
+ struct cfs_rq *qcfs_rq = cfs_rq_of(se);
+
+ if (!se->on_rq)
+ goto done;
+
+ qcfs_rq->h_nr_running -= user_delta;
+ qcfs_rq->h_user_running -= user_delta;
+
+ assert_cfs_rq_counts(qcfs_rq);
+ }
return false;
}

+ /*
+ * Unlikely as it may be, we may only have user tasks as we hit the
+ * throttle, in which case we won't have discount them from the
+ * h_nr_running, and we need to be aware of that.
+ */
+ was_pending = cfs_rq->throttle_pending;
cfs_rq->throttle_pending = false;

raw_spin_lock(&cfs_b->lock);
@@ -5826,9 +5857,27 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
rcu_read_unlock();

- task_delta = cfs_rq->h_nr_running;
+ /*
+ * At this point, h_nr_running == h_kernel_running. We add back the
+ * h_user_running to the throttled cfs_rq, and only remove the difference
+ * to the upper cfs_rq's.
+ */
+ if (was_pending) {
+ WARN_ON_ONCE(cfs_rq->h_nr_running != cfs_rq->h_kernel_running);
+ cfs_rq->h_nr_running += cfs_rq->h_user_running;
+ } else {
+ WARN_ON_ONCE(cfs_rq->h_nr_running != cfs_rq->h_user_running);
+ }
+
+ /*
+ * We always discount user tasks from h_nr_running when throttle_pending
+ * so only h_kernel_running remains to be removed
+ */
+ task_delta = was_pending ? cfs_rq->h_kernel_running : cfs_rq->h_nr_running;
idle_task_delta = cfs_rq->idle_h_nr_running;
kernel_delta = cfs_rq->h_kernel_running;
+ user_delta = was_pending ? 0 : cfs_rq->h_user_running;
+
for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se);
/* throttled entity or throttle-on-deactivate */
@@ -5843,6 +5892,8 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
qcfs_rq->h_nr_running -= task_delta;
qcfs_rq->idle_h_nr_running -= idle_task_delta;
dequeue_kernel(qcfs_rq, se, kernel_delta);
+ qcfs_rq->h_user_running -= user_delta;
+

if (qcfs_rq->load.weight) {
/* Avoid re-evaluating load for this entity: */
@@ -5866,6 +5917,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
qcfs_rq->h_nr_running -= task_delta;
qcfs_rq->idle_h_nr_running -= idle_task_delta;
dequeue_kernel(qcfs_rq, se, kernel_delta);
+ qcfs_rq->h_user_running -= user_delta;
}

/* At this point se is NULL and we are at root level*/
@@ -5888,7 +5940,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
struct rq *rq = rq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se;
- long task_delta, idle_task_delta, kernel_delta;
+ long task_delta, idle_task_delta, kernel_delta, user_delta;

se = cfs_rq->tg->se[cpu_of(rq)];

@@ -5924,6 +5976,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
task_delta = cfs_rq->h_nr_running;
idle_task_delta = cfs_rq->idle_h_nr_running;
kernel_delta = cfs_rq->h_kernel_running;
+ user_delta = cfs_rq->h_user_running;
for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se);

@@ -5937,6 +5990,9 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
qcfs_rq->h_nr_running += task_delta;
qcfs_rq->idle_h_nr_running += idle_task_delta;
enqueue_kernel(qcfs_rq, se, kernel_delta);
+ qcfs_rq->h_user_running += user_delta;
+
+ assert_cfs_rq_counts(qcfs_rq);

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(qcfs_rq))
@@ -5955,6 +6011,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
qcfs_rq->h_nr_running += task_delta;
qcfs_rq->idle_h_nr_running += idle_task_delta;
enqueue_kernel(qcfs_rq, se, kernel_delta);
+ qcfs_rq->h_user_running += user_delta;

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(qcfs_rq))
@@ -6855,6 +6912,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
int idle_h_nr_running = task_has_idle_policy(p);
int task_new = !(flags & ENQUEUE_WAKEUP);
bool kernel_task = is_kernel_task(p);
+ bool throttle_pending = false;

/*
* The code below (indirectly) updates schedutil which looks at
@@ -6878,13 +6936,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, flags);

- cfs_rq->h_nr_running++;
- cfs_rq->idle_h_nr_running += idle_h_nr_running;

- if (cfs_rq_is_idle(cfs_rq))
- idle_h_nr_running = 1;
+ if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
+ cfs_rq->h_nr_running++;
if (kernel_task)
enqueue_kernel(cfs_rq, se, 1);
+ else if (!throttle_pending)
+ cfs_rq->h_user_running++;
+
+ throttle_pending |= cfs_rq->throttle_pending;
+
+ cfs_rq->idle_h_nr_running += idle_h_nr_running;
+ if (cfs_rq_is_idle(cfs_rq))
+ idle_h_nr_running = 1;
+

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -6900,13 +6965,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
se_update_runnable(se);
update_cfs_group(se);

- cfs_rq->h_nr_running++;
- cfs_rq->idle_h_nr_running += idle_h_nr_running;

- if (cfs_rq_is_idle(cfs_rq))
- idle_h_nr_running = 1;
+ if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
+ cfs_rq->h_nr_running++;
if (kernel_task)
enqueue_kernel(cfs_rq, se, 1);
+ else if (!throttle_pending)
+ cfs_rq->h_user_running++;
+
+ throttle_pending |= cfs_rq->throttle_pending;
+
+ cfs_rq->idle_h_nr_running += idle_h_nr_running;
+ if (cfs_rq_is_idle(cfs_rq))
+ idle_h_nr_running = 1;
+

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -6957,6 +7029,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
int idle_h_nr_running = task_has_idle_policy(p);
bool was_sched_idle = sched_idle_rq(rq);
bool kernel_task = !list_empty(&p->se.kernel_node);
+ bool throttle_pending = false;

util_est_dequeue(&rq->cfs, p);

@@ -6964,13 +7037,20 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);

- cfs_rq->h_nr_running--;
- cfs_rq->idle_h_nr_running -= idle_h_nr_running;

- if (cfs_rq_is_idle(cfs_rq))
- idle_h_nr_running = 1;
+ if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
+ cfs_rq->h_nr_running--;
if (kernel_task)
dequeue_kernel(cfs_rq, se, 1);
+ else if (!throttle_pending)
+ cfs_rq->h_user_running--;
+
+ throttle_pending |= cfs_rq->throttle_pending;
+
+ cfs_rq->idle_h_nr_running -= idle_h_nr_running;
+ if (cfs_rq_is_idle(cfs_rq))
+ idle_h_nr_running = 1;
+

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -6998,13 +7078,20 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
se_update_runnable(se);
update_cfs_group(se);

- cfs_rq->h_nr_running--;
- cfs_rq->idle_h_nr_running -= idle_h_nr_running;

- if (cfs_rq_is_idle(cfs_rq))
- idle_h_nr_running = 1;
+ if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
+ cfs_rq->h_nr_running--;
if (kernel_task)
dequeue_kernel(cfs_rq, se, 1);
+ else if (!throttle_pending)
+ cfs_rq->h_user_running--;
+
+ throttle_pending |= cfs_rq->throttle_pending;
+
+ cfs_rq->idle_h_nr_running -= idle_h_nr_running;
+ if (cfs_rq_is_idle(cfs_rq))
+ idle_h_nr_running = 1;
+

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -8503,28 +8590,65 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
resched_curr(rq);
}

+/*
+ * Consider:
+ * cfs_rq.kernel := count of kernel *tasks* enqueued on this cfs_rq
+ * cfs_rq.user := count of user *tasks* enqueued on this cfs_rq
+ *
+ * Then, the following logic is implemented:
+ * cfs_rq.h_kernel_running = Sum(child.kernel) for all child cfs_rq
+ * cfs_rq.h_user_running = Sum(child.user) for all child cfs_rq with !child.throttle_pending
+ * cfs_rq.h_nr_running = Sum(child.kernel) for all child cfs_rq
+ * + Sum(child.user) for all child cfs_rq with !child.throttle_pending
+ *
+ * IOW, count of kernel tasks is always propagated up the hierarchy, and count
+ * of user tasks is only propagated up if the cfs_rq isn't .throttle_pending.
+ */
static void handle_kernel_task_prev(struct task_struct *prev)
{
#ifdef CONFIG_CFS_BANDWIDTH
struct sched_entity *se = &prev->se;
bool p_in_kernel = is_kernel_task(prev);
bool p_in_kernel_tree = !list_empty(&se->kernel_node);
+ bool throttle_pending = false;
/*
* These extra loops are bad and against the whole point of the merged
* PNT, but it's a pain to merge, particularly since we want it to occur
* before check_cfs_runtime().
*/
if (p_in_kernel_tree && !p_in_kernel) {
+ /* Switch from KERNEL -> USER */
WARN_ON_ONCE(!se->on_rq); /* dequeue should have removed us */
+
for_each_sched_entity(se) {
- dequeue_kernel(cfs_rq_of(se), se, 1);
- if (cfs_rq_throttled(cfs_rq_of(se)))
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+ if (throttle_pending || cfs_rq->throttle_pending)
+ cfs_rq->h_nr_running--;
+ dequeue_kernel(cfs_rq, se, 1);
+ if (!throttle_pending)
+ cfs_rq->h_user_running++;
+
+ throttle_pending |= cfs_rq->throttle_pending;
+
+ if (cfs_rq_throttled(cfs_rq))
break;
}
} else if (!p_in_kernel_tree && p_in_kernel && se->on_rq) {
+ /* Switch from USER -> KERNEL */
+
for_each_sched_entity(se) {
- enqueue_kernel(cfs_rq_of(se), se, 1);
- if (cfs_rq_throttled(cfs_rq_of(se)))
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+ if (throttle_pending || cfs_rq->throttle_pending)
+ cfs_rq->h_nr_running++;
+ enqueue_kernel(cfs_rq, se, 1);
+ if (!throttle_pending)
+ cfs_rq->h_user_running--;
+
+ throttle_pending |= cfs_rq->throttle_pending;
+
+ if (cfs_rq_throttled(cfs_rq))
break;
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0b33ce2e60555..e8860e0d6fbc7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -660,6 +660,8 @@ struct cfs_rq {
int throttled;
int throttle_count;
int h_kernel_running;
+ int h_user_running;
+ int throttle_pending;
struct list_head throttled_list;
struct list_head throttled_csd_list;
struct list_head kernel_children;
--
2.43.0


2024-02-02 08:17:19

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH v2 5/5] sched/fair: Assert user/kernel/total nr invariants

Previous commits have added .h_kernel_running and .h_user_running to struct
cfs_rq, and are using them to play games with the hierarchical
h_nr_running.

Assert some count invariants under SCHED_DEBUG to improve debugging.

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2b54d3813d18d..52d0ee0e4d47c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5780,6 +5780,30 @@ static int tg_throttle_down(struct task_group *tg, void *data)
static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);

+#ifdef CONFIG_CFS_BANDWIDTH
+static inline void assert_cfs_rq_counts(struct cfs_rq *cfs_rq)
+{
+ lockdep_assert_rq_held(rq_of(cfs_rq));
+
+ /*
+ * When !throttle_pending, this is the normal operating mode, all tasks
+ * are pickable, so:
+ * nr_kernel_tasks + nr_user_tasks == nr_pickable_tasks
+ */
+ SCHED_WARN_ON(!cfs_rq->throttle_pending &&
+ (cfs_rq->h_kernel_running + cfs_rq->h_user_running !=
+ cfs_rq->h_nr_running));
+ /*
+ * When throttle_pending, only kernel tasks are pickable, so:
+ * nr_kernel_tasks == nr_pickable_tasks
+ */
+ SCHED_WARN_ON(cfs_rq->throttle_pending &&
+ (cfs_rq->h_kernel_running != cfs_rq->h_nr_running));
+}
+#else
+static inline void assert_cfs_rq_counts(struct cfs_rq *cfs_rq) { }
+#endif
+
static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
struct rq *rq = rq_of(cfs_rq);
@@ -5894,6 +5918,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
dequeue_kernel(qcfs_rq, se, kernel_delta);
qcfs_rq->h_user_running -= user_delta;

+ assert_cfs_rq_counts(qcfs_rq);

if (qcfs_rq->load.weight) {
/* Avoid re-evaluating load for this entity: */
@@ -5918,6 +5943,8 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
qcfs_rq->idle_h_nr_running -= idle_task_delta;
dequeue_kernel(qcfs_rq, se, kernel_delta);
qcfs_rq->h_user_running -= user_delta;
+
+ assert_cfs_rq_counts(qcfs_rq);
}

/* At this point se is NULL and we are at root level*/
@@ -6013,6 +6040,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
enqueue_kernel(qcfs_rq, se, kernel_delta);
qcfs_rq->h_user_running += user_delta;

+ assert_cfs_rq_counts(qcfs_rq);
+
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(qcfs_rq))
goto unthrottle_throttle;
@@ -6950,6 +6979,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;

+ assert_cfs_rq_counts(cfs_rq);

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -6965,6 +6995,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
se_update_runnable(se);
update_cfs_group(se);

+ assert_cfs_rq_counts(cfs_rq);

if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
cfs_rq->h_nr_running++;
@@ -6979,6 +7010,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;

+ assert_cfs_rq_counts(cfs_rq);

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -7051,6 +7083,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;

+ assert_cfs_rq_counts(cfs_rq);

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -7092,6 +7125,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = 1;

+ assert_cfs_rq_counts(cfs_rq);

/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
@@ -8631,6 +8665,8 @@ static void handle_kernel_task_prev(struct task_struct *prev)

throttle_pending |= cfs_rq->throttle_pending;

+ assert_cfs_rq_counts(cfs_rq);
+
if (cfs_rq_throttled(cfs_rq))
break;
}
@@ -8648,6 +8684,8 @@ static void handle_kernel_task_prev(struct task_struct *prev)

throttle_pending |= cfs_rq->throttle_pending;

+ assert_cfs_rq_counts(cfs_rq);
+
if (cfs_rq_throttled(cfs_rq))
break;
}
--
2.43.0


2024-02-02 08:27:28

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH v2 2/5] sched: Note schedule() invocations at return-to-user with SM_USER

task_struct.in_return_to_user is currently updated via atomic operations in
schedule_usermode().

However, one can note:
o .in_return_to_user is only updated for the current task
o There are no remote (smp_processor_id() != task_cpu(p)) accesses to
.in_return_to_user

Add schedule_with_mode() to factorize schedule() with different flags to
pass down to __schedule_loop().

Add SM_USER to denote schedule() calls from return-to-userspace points.

Update .in_return_to_user from within the preemption-disabled, rq_lock-held
part of __schedule().

Suggested-by: Benjamin Segall <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
include/linux/sched.h | 2 +-
kernel/sched/core.c | 43 ++++++++++++++++++++++++++++++++-----------
kernel/sched/fair.c | 17 ++++++++++++++++-
3 files changed, 49 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4a0105d1eaa21..1b6f17b2150a6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1544,7 +1544,7 @@ struct task_struct {
#endif

#ifdef CONFIG_CFS_BANDWIDTH
- atomic_t in_return_to_user;
+ int in_return_to_user;
#endif
/*
* New fields for task_struct should be added above here, so that
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a7c028fad5a89..54e6690626b13 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4531,7 +4531,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
#endif
#ifdef CONFIG_CFS_BANDWIDTH
INIT_LIST_HEAD(&p->se.kernel_node);
- atomic_set(&p->in_return_to_user, 0);
+ p->in_return_to_user = false;
#endif

#ifdef CONFIG_SCHEDSTATS
@@ -5147,6 +5147,9 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf

static inline void finish_lock_switch(struct rq *rq)
{
+#ifdef CONFIG_CFS_BANDWIDTH
+ current->in_return_to_user = false;
+#endif
/*
* If we are tracking spinlock dependencies then we have to
* fix up the runqueue lock - which gets 'carried over' from
@@ -6562,6 +6565,18 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
#define SM_PREEMPT 0x1
#define SM_RTLOCK_WAIT 0x2

+/*
+ * Special case for CFS_BANDWIDTH where we need to know if the call to
+ * __schedule() is directely preceding an entry into userspace.
+ * It is removed from the mode argument as soon as it is used to not go against
+ * the SM_MASK_PREEMPT optimisation below.
+ */
+#ifdef CONFIG_CFS_BANDWIDTH
+# define SM_USER 0x4
+#else
+# define SM_USER SM_NONE
+#endif
+
#ifndef CONFIG_PREEMPT_RT
# define SM_MASK_PREEMPT (~0U)
#else
@@ -6646,6 +6661,14 @@ static void __sched notrace __schedule(unsigned int sched_mode)
rq_lock(rq, &rf);
smp_mb__after_spinlock();

+#ifdef CONFIG_CFS_BANDWIDTH
+ if (sched_mode & SM_USER) {
+ prev->in_return_to_user = true;
+ sched_mode &= ~SM_USER;
+ }
+#endif
+ SCHED_WARN_ON(sched_mode & SM_USER);
+
/* Promote REQ to ACT */
rq->clock_update_flags <<= 1;
update_rq_clock(rq);
@@ -6807,7 +6830,7 @@ static __always_inline void __schedule_loop(unsigned int sched_mode)
} while (need_resched());
}

-asmlinkage __visible void __sched schedule(void)
+static __always_inline void schedule_with_mode(unsigned int sched_mode)
{
struct task_struct *tsk = current;

@@ -6817,22 +6840,20 @@ asmlinkage __visible void __sched schedule(void)

if (!task_is_running(tsk))
sched_submit_work(tsk);
- __schedule_loop(SM_NONE);
+ __schedule_loop(sched_mode);
sched_update_worker(tsk);
}
+
+asmlinkage __visible void __sched schedule(void)
+{
+ schedule_with_mode(SM_NONE);
+}
EXPORT_SYMBOL(schedule);

asmlinkage __visible void __sched schedule_usermode(void)
{
#ifdef CONFIG_CFS_BANDWIDTH
- /*
- * This is only atomic because of this simple implementation. We could
- * do something with an SM_USER to avoid other-cpu scheduler operations
- * racing against these writes.
- */
- atomic_set(&current->in_return_to_user, true);
- schedule();
- atomic_set(&current->in_return_to_user, false);
+ schedule_with_mode(SM_USER);
#else
schedule();
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1808459a5acc..96504be6ee14a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6631,7 +6631,22 @@ static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int c

static bool is_kernel_task(struct task_struct *p)
{
- return sysctl_sched_cfs_bandwidth_kernel_bypass && !atomic_read(&p->in_return_to_user);
+ /*
+ * The flag is updated within __schedule() with preemption disabled,
+ * under the rq lock, and only when the task is current.
+ *
+ * Holding the rq lock for that task's CPU is thus sufficient for the
+ * value to be stable, if the task is enqueued.
+ *
+ * If the task is dequeued, then task_cpu(p) *can* change, but this
+ * so far only happens in enqueue_task_fair() which means either:
+ * - the task is being activated, its CPU has been set previously in ttwu()
+ * - the task is going through a "change" cycle (e.g. sched_move_task()),
+ * the pi_lock is also held so the CPU is stable.
+ */
+ lockdep_assert_rq_held(cpu_rq(task_cpu(p)));
+
+ return sysctl_sched_cfs_bandwidth_kernel_bypass && !p->in_return_to_user;
}

/*
--
2.43.0


2024-02-06 21:55:15

by Benjamin Segall

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry

Valentin Schneider <[email protected]> writes:


> Proposed approach
> =================
>
> Peter mentioned [1] that there have been discussions on changing /when/ the
> throttling happens: rather than have it be done immediately upon updating
> the runtime statistics and realizing the cfs_rq has depleted its quota, we wait
> for the task to be about to return to userspace: if it's in userspace, it can't
> hold any in-kernel lock.
>
> I submitted an initial jab at this [2] and Ben Segall added his own version to
> the conversation [3]. This series contains Ben's patch plus my additions. The
> main change here is updating the .h_nr_running counts throughout the cfs_rq
> hierachies to improve the picture given to load_balance().
>
> The main thing that remains doing for this series is making the second cfs_rq
> tree an actual RB tree (it's just a plain list ATM).
>
> This also doesn't touch rq.nr_running yet, I'm not entirely sure whether we want
> to expose this outside of CFS, but it is another field that's used by load balance.

Then there's also all the load values as well; I don't know the load
balance code well, but it looks like the main thing would be
runnable_avg and that it isn't doing anything that would particularly
care about h_nr_running and runnable_avg being out of sync.

Maybe pulling a pending-throttle user task and then not seeing the
update in h_nr_running could be a bit of trouble?

2024-02-07 13:35:08

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead

On 06/02/24 13:36, Benjamin Segall wrote:
> Valentin Schneider <[email protected]> writes:
>
>> cfs_rq_throttled_loose() does not check if there is runtime remaining in
>> the cfs_b, and thus relies on check_cfs_rq_runtime() being ran previously
>> for that to be checked.
>>
>> Cache the throttle attempt in throttle_cfs_rq and reuse that where
>> needed.
>
> The general idea of throttle_pending rather than constantly checking
> runtime_remaining seems reasonable...
>
>>
>> Signed-off-by: Valentin Schneider <[email protected]>
>> ---
>> kernel/sched/fair.c | 44 ++++++++++----------------------------------
>> 1 file changed, 10 insertions(+), 34 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 96504be6ee14a..60778afbff207 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5462,7 +5462,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> * 5) do not run the "skip" process, if something else is available
>> */
>> static struct sched_entity *
>> -pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
>> +pick_next_entity(struct cfs_rq *cfs_rq)
>> {
>> #ifdef CONFIG_CFS_BANDWIDTH
>> /*
>> @@ -5473,7 +5473,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
>> * throttle_cfs_rq.
>> */
>> WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children));
>> - if (throttled && !list_empty(&cfs_rq->kernel_children)) {
>> + if (cfs_rq->throttle_pending && !list_empty(&cfs_rq->kernel_children)) {
>
> ... but we still need to know here if any of our parents are throttled
> as well, ie a "throttled_pending_count", or to keep the "throttled"
> parameter tracking in pnt_fair. (ie just replace the implementation of
> cfs_rq_throttled_loose).
>

Hm, good point. We should be good with reinstoring the throttled parameter
and feeding it a ->throttle_pending accumulator.

>> /*
>> * TODO: you'd want to factor out pick_eevdf to just take
>> * tasks_timeline, and replace this list with a second rbtree
>> @@ -5791,8 +5791,12 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
>> * We don't actually throttle, though account() will have made sure to
>> * resched us so that we pick into a kernel task.
>> */
>> - if (cfs_rq->h_kernel_running)
>> + if (cfs_rq->h_kernel_running) {
>> + cfs_rq->throttle_pending = true;
>> return false;
>> + }
>> +
>> + cfs_rq->throttle_pending = false;
>
> We also need to clear throttle_pending if quota refills and our
> runtime_remaining goes positive. (And do the appropriate h_* accounting in
> patch 4/5)

Right, so we could move the throttle_pending logic to after
__assign_cfs_rq_runtime(), and then modify distribute_cfs_runtime() to
catch the !throttled but throttle_pending case.


2024-02-07 13:35:48

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/5] sched/fair: Defer CFS throttle to user entry

On 06/02/24 13:55, Benjamin Segall wrote:
> Valentin Schneider <[email protected]> writes:
>
>
>> Proposed approach
>> =================
>>
>> Peter mentioned [1] that there have been discussions on changing /when/ the
>> throttling happens: rather than have it be done immediately upon updating
>> the runtime statistics and realizing the cfs_rq has depleted its quota, we wait
>> for the task to be about to return to userspace: if it's in userspace, it can't
>> hold any in-kernel lock.
>>
>> I submitted an initial jab at this [2] and Ben Segall added his own version to
>> the conversation [3]. This series contains Ben's patch plus my additions. The
>> main change here is updating the .h_nr_running counts throughout the cfs_rq
>> hierachies to improve the picture given to load_balance().
>>
>> The main thing that remains doing for this series is making the second cfs_rq
>> tree an actual RB tree (it's just a plain list ATM).
>>
>> This also doesn't touch rq.nr_running yet, I'm not entirely sure whether we want
>> to expose this outside of CFS, but it is another field that's used by load balance.
>
> Then there's also all the load values as well; I don't know the load
> balance code well, but it looks like the main thing would be
> runnable_avg and that it isn't doing anything that would particularly
> care about h_nr_running and runnable_avg being out of sync.
>

Yes, all of the runnable, load and util averages are still going to be an
issue unfortunately. AFAICT tackling this would imply pretty much dequeuing
the throttle_pending user tasks, which was my earlier attempt.

> Maybe pulling a pending-throttle user task and then not seeing the
> update in h_nr_running could be a bit of trouble?

That too is something I hadn't considered. Given the h_nr_running count is
updated accordingly, we could change can_migrate_task() to only allow
kernel tasks to be pulled if the hierarchy is ->throttle_pending. That
would probably require implementing a throttle_pending_count (as you
suggested in the other email) so we don't waste too much time checking up
the hierarchy for every task.