2023-12-20 00:19:13

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 00/23] Proxy Execution: A generalized form of Priority Inheritance v7

Looking for patches to review before closing out the year?
Have I got something for you! :)

Since sending v6 out and after my presentation at Linux
Plumbers[1], I got some nice feedback from a number of
developers. I particularly want to thank Metin Kaya, who provided
a ton of detailed off-list review comments, questions and cleanup
suggestions. Additionally Xuewen Yan, who pointed out an issue
with placement in my return migration logic that motivated me to
rework the return migration back into the ttwu path. This helped
to resolve the performance regression I had been seeing from the
v4->v5 rework!

The other focus of this revision has been to properly
conditionalize and re-integrate the chain-migration feature that
Connor had implemented to address concerns about the preserving
the RT load balancing invariant (always running top N priority
tasks across N available cpus) when we’re migrating tasks around
for proxy-execution. I’ve done a fair amount of reworking the
patches, but they aren’t quite yet to where I’d like them, but
I’ve included them here which you can consider as a preview.

Validating this invariant isn’t trivial. Long ago I wrote a
userland test case that’s now in LTP to check this called
sched_football:
https://github.com/linux-test-project/ltp/blob/master/testcases/realtime/func/sched_football/sched_football.c

So I’ve reimplemented this idea as an in-kernel test case,
extended to have lock chains across priority levels. The good
news is that the test case does not show any issues with RT load
balancing invariant when we add the chain-migration feature, but
I’m not actually seeing any issues with the test when
chain-migration isn’t added either, so I need to further extend
the test to try to manufacture the specific type of invariant
breakage I can imagine we don’t handle properly:
ie:
CPU0: P99, P98(boP2), P2
CPU1: P50

Which chain-migration should adjust to become:
CPU0: P99
CPU1: P98(boP2), P50, P2

On the stability front, the series has continued to fare much
better than pre-v6 patches. The only stability issue I had seen
with v6 (workqueue lockups when stressing under 64 core qemu
environment with KVM disabled) has so far not reproduced against
6.7-rc. At plumbers Thomas mentioned there had been a workqueue
issue in 6.6 that was since fixed, so I’m optimistic that was
what I was tripping on. If you run into any stability issues in
testing, please do let me know.

This patch series is actually coarser than what I’ve been
developing with, as there are a number of small “test” steps to
help validate behavior I changed, which would then be replaced by
the real logic afterwards. Including those here would just cause
more work for reviewers, so I’ve folded them together, but if
you’re interested you can find the fine-grained tree here:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v7-6.7-rc6-fine-grained
https://github.com/johnstultz-work/linux-dev.git proxy-exec-v7-6.7-rc6-fine-grained

Credit/Disclaimer:
—--------------------
As mentioned previously, this Proxy Execution series has a long
history:

First described in a paper[2] by Watkins, Straub, Niehaus, then
from patches from Peter Zijlstra, extended with lots of work by
Juri Lelli, Valentin Schneider, and Connor O'Brien. (and thank
you to Steven Rostedt for providing additional details here!)

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely
mine.

Overview:
—----------
Proxy Execution is a generalized form of priority inheritance.
Classic priority inheritance works well for real-time tasks where
there is a straight forward priority order to how things are run.
But it breaks down when used between CFS or DEADLINE tasks, as
there are lots of parameters involved outside of just the task’s
nice value when selecting the next task to run (via
pick_next_task()). So ideally we want to imbue the mutex holder
with all the scheduler attributes of the blocked waiting task.

Proxy Execution does this via a few changes:
* Keeping tasks that are blocked on a mutex *on* the runqueue
* Keeping additional tracking of which mutex a task is blocked
on, and which task holds a specific mutex.
* Special handling for when we select a blocked task to run, so
that we instead run the mutex holder.

By leaving blocked tasks on the runqueue, we allow
pick_next_task() to choose the task that should run next (even if
it’s blocked waiting on a mutex). If we do select a blocked task,
we look at the task’s blocked_on mutex and from there look at the
mutex’s owner task. And in the simple case, the task which owns
the mutex is what we then choose to run, allowing it to release
the mutex.

This means that instead of just tracking “curr”, the scheduler
needs to track both the scheduler context (what was picked and
all the state used for scheduling decisions), and the execution
context (what we’re actually running).

In this way, the mutex owner is run “on behalf” of the blocked
task that was picked to run, essentially inheriting the scheduler
context of the waiting blocked task.

As Connor outlined in a previous submission of this patch series,
this raises a number of complicated situations: The mutex owner
might itself be blocked on another mutex, or it could be
sleeping, running on a different CPU, in the process of migrating
between CPUs, etc.

The complexity from these edge cases are imposing, but currently
in Android we have a large number of cases where we are seeing
priority inversion (not unbounded, but much longer than we’d
like) between “foreground” and “background” SCHED_NORMAL
applications. As a result, currently we cannot usefully limit
background activity without introducing inconsistent behavior. So
Proxy Execution is a very attractive approach to resolving this
critical issue.

New in v7:
---------
* Extended blocked_on state tracking to use a tri-state so we
can avoid running tasks before they are return-migrated.

* Switched return migration back to the ttwu path to avoid extra
lock handling and resolve performance regression seen since v5

* _Tons_ of typos, cleanups, runtime conditionalization
improvements, build fixes for different config options, and
clarifications suggested by Metin Kaya. *Many* *many* thanks
for all the review and help here!

* Split up and reworked Connor’s chain-migration patches to be
properly conditionalized on the config value, and hopefully a
little easier to review.

* Added first stab at RT load-balancing invariant test
(sched_football)

* Fixed wake_cpu not being preserved during
activate_blocked_entities

* Fix build warnings Reported-by: kernel test robot <[email protected]>

* Other minor cleanups

Performance:
—----------
v7 of the series improves over v6 and v5 by moving proxy return
migration back into the ttwu path, which avoids a lot of extra
locking. This gets performance back to where we were in v4.

K Prateek Nayak did some benchmarking and performance analysis
with the v6 series and while he noted with a number of benchmarks
(schbench, tbench, netperf, ycsb-mongodb, DeathStarBench) he
found “little to no difference” when running with
proxy-execution[3], he did find a large regression with the
“perf bench sched messaging” test.

I’ve reproduced this issue and the majority of the regression
seems to come from the fact this patch series switches mutexes to
use handoff mode rather than optimistic spinning. This has been a
concern for cases where locks are under high contention, so I
need to spend some time on finding a solution to restore some
ability for optimistic spinning. Many thanks to Preteek for
raising this issue!

Previously Chenyu also report a regression[4], which seems
similar, but I need to look further into it.

Issues still to address:
—----------
* The chain migration patches still need further iterations and
better validation to ensure they resolve the RT/DL load
balancing invariants.

* Xuewen Yan earlier pointed out that we may see task
mis-placement on EAS systems if we do return migration based
strictly on cpu allowability. I tried an optimization to
always try to return migrate to the wake_cpu (which was saved
on proxy-migration), but this seemed to undo a chunk of the
benefit I saw in moving return migration back to ttwu, at
least with my prio-inversion-demo microbenchmark. Need to do
some broader performance analysis with the variations there.

* Optimization to avoid migrating blocked tasks (to preserve
optimistic mutex spinning) if the runnable lock-owner at the
end of the blocked_on chain is already running (though this is
difficult due to the limitations from locking rules needed to
traverse the blocked on chain across run queues).

* Similarly as we’re often dealing with lists of tasks or chains
of tasks and mutexes, iterating across these chains of objects
can be done safely while holding the rq lock, but as these
chains can cross runqueues our ability to traverse the chains
safely is somewhat limited.

* CFS load balancing. Blocked tasks may carry forward load
(PELT) to the lock owner's CPU, so CPU may look like it is
overloaded.

* The sleeping owner handling (where we deactivate waiting tasks
and enqueue them onto a list, then reactivate them when the
owner wakes up) doesn’t feel great. This is in part because
when we want to activate tasks, we’re already holding a
task.pi_lock and a rq_lock, just not the locks for the task
we’re activating, nor the rq we’re enqueuing it onto. So there
has to be a bit of lock juggling to drop and acquire the right
locks (in the right order). It feels like there’s got to be a
better way.

* “rq_selected()” naming. Peter doesn’t like it, but I’ve not
thought of a better name. Open to suggestions.

* As discussed at OSPM[5], I like to split pick_next_task() up
into two phases selecting and setting the next tasks, as
currently pick_next_task() assumes the returned task will be
run which results in various side-effects in sched class logic
when it’s run. I tried to take a pass at this earlier, but
it’s hairy and lower on the priority list for now.


If folks find it easier to test/tinker with, this patch series
can also be found here:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v7-6.7-rc6
https://github.com/johnstultz-work/linux-dev.git proxy-exec-v7-6.7-rc6

Feedback would be very welcome!

Thanks so much!
-john

[1] https://lwn.net/Articles/953438/
[2] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
[3] https://lore.kernel.org/lkml/[email protected]/
[4] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/
[5] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk)


Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]

Connor O'Brien (5):
sched: Add push_task_chain helper
sched: Consolidate pick_*_task to task_is_pushable helper
sched: Push execution and scheduler context split into deadline and rt
paths
sched: Add find_exec_ctx helper
sched: Fix rt/dl load balancing via chain level balance

John Stultz (8):
locking/mutex: Remove wakeups from under mutex::wait_lock
sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
sched: Fix runtime accounting w/ split exec & sched contexts
sched: Split out __sched() deactivate task logic into a helper
sched: Add a initial sketch of the find_proxy_task() function
sched: Handle blocked-waiter migration (and return migration)
sched: Initial sched_football test implementation
sched: Refactor dl/rt find_lowest/latest_rq logic

Juri Lelli (2):
locking/mutex: Make mutex::wait_lock irq safe
locking/mutex: Expose __mutex_owner()

Peter Zijlstra (7):
sched: Unify runtime accounting across classes
locking/mutex: Rework task_struct::blocked_on
locking/mutex: Switch to mutex handoffs for CONFIG_SCHED_PROXY_EXEC
sched: Split scheduler and execution contexts
sched: Start blocked_on chain processing in find_proxy_task()
sched: Add blocked_donor link to task for smarter mutex handoffs
sched: Add deactivated (sleeping) owner handling to find_proxy_task()

Valentin Schneider (1):
sched: Fix proxy/current (push,pull)ability

.../admin-guide/kernel-parameters.txt | 5 +
Documentation/locking/mutex-design.rst | 3 +
include/linux/sched.h | 59 +-
init/Kconfig | 7 +
init/init_task.c | 1 +
kernel/fork.c | 9 +-
kernel/locking/mutex-debug.c | 9 +-
kernel/locking/mutex.c | 166 ++--
kernel/locking/mutex.h | 25 +
kernel/locking/rtmutex.c | 26 +-
kernel/locking/rwbase_rt.c | 4 +-
kernel/locking/rwsem.c | 4 +-
kernel/locking/spinlock_rt.c | 3 +-
kernel/locking/ww_mutex.h | 73 +-
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 822 ++++++++++++++++--
kernel/sched/cpudeadline.c | 12 +-
kernel/sched/cpudeadline.h | 3 +-
kernel/sched/cpupri.c | 31 +-
kernel/sched/cpupri.h | 6 +-
kernel/sched/deadline.c | 218 +++--
kernel/sched/fair.c | 102 ++-
kernel/sched/rt.c | 298 +++++--
kernel/sched/sched.h | 105 ++-
kernel/sched/stop_task.c | 13 +-
kernel/sched/test_sched_football.c | 242 ++++++
lib/Kconfig.debug | 14 +
27 files changed, 1861 insertions(+), 400 deletions(-)
create mode 100644 kernel/sched/test_sched_football.c

--
2.43.0.472.g3155946c3a-goog



2023-12-20 00:19:23

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 01/23] sched: Unify runtime accounting across classes

From: Peter Zijlstra <[email protected]>

All classes use sched_entity::exec_start to track runtime and have
copies of the exact same code around to compute runtime.

Collapse all that.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[fix conflicts, fold in update_current_exec_runtime]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: rebased, resolving minor conflicts]
Signed-off-by: John Stultz <[email protected]>
---
v7:
* Minor typo fixup suggested by Metin Kaya
---
include/linux/sched.h | 2 +-
kernel/sched/deadline.c | 13 +++-------
kernel/sched/fair.c | 56 ++++++++++++++++++++++++++++++----------
kernel/sched/rt.c | 13 +++-------
kernel/sched/sched.h | 12 ++-------
kernel/sched/stop_task.c | 13 +---------
6 files changed, 52 insertions(+), 57 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 292c31697248..1e80c330f755 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -523,7 +523,7 @@ struct sched_statistics {
u64 block_max;
s64 sum_block_runtime;

- u64 exec_max;
+ s64 exec_max;
u64 slice_max;

u64 nr_migrations_cold;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index b28114478b82..6140f1f51da1 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1275,9 +1275,8 @@ static void update_curr_dl(struct rq *rq)
{
struct task_struct *curr = rq->curr;
struct sched_dl_entity *dl_se = &curr->dl;
- u64 delta_exec, scaled_delta_exec;
+ s64 delta_exec, scaled_delta_exec;
int cpu = cpu_of(rq);
- u64 now;

if (!dl_task(curr) || !on_dl_rq(dl_se))
return;
@@ -1290,21 +1289,15 @@ static void update_curr_dl(struct rq *rq)
* natural solution, but the full ramifications of this
* approach need further study.
*/
- now = rq_clock_task(rq);
- delta_exec = now - curr->se.exec_start;
- if (unlikely((s64)delta_exec <= 0)) {
+ delta_exec = update_curr_common(rq);
+ if (unlikely(delta_exec <= 0)) {
if (unlikely(dl_se->dl_yielded))
goto throttle;
return;
}

- schedstat_set(curr->stats.exec_max,
- max(curr->stats.exec_max, delta_exec));
-
trace_sched_stat_runtime(curr, delta_exec, 0);

- update_current_exec_runtime(curr, now, delta_exec);
-
if (dl_entity_is_special(dl_se))
return;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7a3c63a2171..1251fd01a555 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1129,23 +1129,17 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_SMP */

-/*
- * Update the current task's runtime statistics.
- */
-static void update_curr(struct cfs_rq *cfs_rq)
+static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
{
- struct sched_entity *curr = cfs_rq->curr;
- u64 now = rq_clock_task(rq_of(cfs_rq));
- u64 delta_exec;
-
- if (unlikely(!curr))
- return;
+ u64 now = rq_clock_task(rq);
+ s64 delta_exec;

delta_exec = now - curr->exec_start;
- if (unlikely((s64)delta_exec <= 0))
- return;
+ if (unlikely(delta_exec <= 0))
+ return delta_exec;

curr->exec_start = now;
+ curr->sum_exec_runtime += delta_exec;

if (schedstat_enabled()) {
struct sched_statistics *stats;
@@ -1155,9 +1149,43 @@ static void update_curr(struct cfs_rq *cfs_rq)
max(delta_exec, stats->exec_max));
}

- curr->sum_exec_runtime += delta_exec;
- schedstat_add(cfs_rq->exec_clock, delta_exec);
+ return delta_exec;
+}
+
+/*
+ * Used by other classes to account runtime.
+ */
+s64 update_curr_common(struct rq *rq)
+{
+ struct task_struct *curr = rq->curr;
+ s64 delta_exec;

+ delta_exec = update_curr_se(rq, &curr->se);
+ if (unlikely(delta_exec <= 0))
+ return delta_exec;
+
+ account_group_exec_runtime(curr, delta_exec);
+ cgroup_account_cputime(curr, delta_exec);
+
+ return delta_exec;
+}
+
+/*
+ * Update the current task's runtime statistics.
+ */
+static void update_curr(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *curr = cfs_rq->curr;
+ s64 delta_exec;
+
+ if (unlikely(!curr))
+ return;
+
+ delta_exec = update_curr_se(rq_of(cfs_rq), curr);
+ if (unlikely(delta_exec <= 0))
+ return;
+
+ schedstat_add(cfs_rq->exec_clock, delta_exec);
curr->vruntime += calc_delta_fair(delta_exec, curr);
update_deadline(cfs_rq, curr);
update_min_vruntime(cfs_rq);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 6aaf0a3d6081..9cdea3ea47da 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1002,24 +1002,17 @@ static void update_curr_rt(struct rq *rq)
{
struct task_struct *curr = rq->curr;
struct sched_rt_entity *rt_se = &curr->rt;
- u64 delta_exec;
- u64 now;
+ s64 delta_exec;

if (curr->sched_class != &rt_sched_class)
return;

- now = rq_clock_task(rq);
- delta_exec = now - curr->se.exec_start;
- if (unlikely((s64)delta_exec <= 0))
+ delta_exec = update_curr_common(rq);
+ if (unlikely(delta_exec < 0))
return;

- schedstat_set(curr->stats.exec_max,
- max(curr->stats.exec_max, delta_exec));
-
trace_sched_stat_runtime(curr, delta_exec, 0);

- update_current_exec_runtime(curr, now, delta_exec);
-
if (!rt_bandwidth_enabled())
return;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e5a95486a42..3e0e4fc8734b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2212,6 +2212,8 @@ struct affinity_context {
unsigned int flags;
};

+extern s64 update_curr_common(struct rq *rq);
+
struct sched_class {

#ifdef CONFIG_UCLAMP_TASK
@@ -3261,16 +3263,6 @@ extern int sched_dynamic_mode(const char *str);
extern void sched_dynamic_update(int mode);
#endif

-static inline void update_current_exec_runtime(struct task_struct *curr,
- u64 now, u64 delta_exec)
-{
- curr->se.sum_exec_runtime += delta_exec;
- account_group_exec_runtime(curr, delta_exec);
-
- curr->se.exec_start = now;
- cgroup_account_cputime(curr, delta_exec);
-}
-
#ifdef CONFIG_SCHED_MM_CID

#define SCHED_MM_CID_PERIOD_NS (100ULL * 1000000) /* 100ms */
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 6cf7304e6449..b1b8fe61c532 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -70,18 +70,7 @@ static void yield_task_stop(struct rq *rq)

static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
{
- struct task_struct *curr = rq->curr;
- u64 now, delta_exec;
-
- now = rq_clock_task(rq);
- delta_exec = now - curr->se.exec_start;
- if (unlikely((s64)delta_exec < 0))
- delta_exec = 0;
-
- schedstat_set(curr->stats.exec_max,
- max(curr->stats.exec_max, delta_exec));
-
- update_current_exec_runtime(curr, now, delta_exec);
+ update_curr_common(rq);
}

/*
--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:19:54

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 03/23] locking/mutex: Make mutex::wait_lock irq safe

From: Juri Lelli <[email protected]>

mutex::wait_lock might be nested under rq->lock.

Make it irq safe then.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[rebase & fix {un,}lock_wait_lock helpers in ww_mutex.h]
Signed-off-by: Connor O'Brien <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
v3:
* Re-added this patch after it was dropped in v2 which
caused lockdep warnings to trip.
v7:
* Fix function definition for PREEMPT_RT case, as pointed out
by Metin Kaya.
* Fix incorrect flags handling in PREEMPT_RT case as found by
Metin Kaya
---
kernel/locking/mutex.c | 18 ++++++++++--------
kernel/locking/ww_mutex.h | 22 +++++++++++-----------
2 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 8337ed0dbf81..73d98dd23eec 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -573,6 +573,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
DEFINE_WAKE_Q(wake_q);
struct mutex_waiter waiter;
struct ww_mutex *ww;
+ unsigned long flags;
int ret;

if (!use_ww_ctx)
@@ -615,7 +616,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
return 0;
}

- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
/*
* After waiting to acquire the wait_lock, try again.
*/
@@ -676,7 +677,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
goto err;
}

- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
/* Make sure we do wakeups before calling schedule */
if (!wake_q_empty(&wake_q)) {
wake_up_q(&wake_q);
@@ -702,9 +703,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
trace_contention_begin(lock, LCB_F_MUTEX);
}

- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
}
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
acquired:
__set_current_state(TASK_RUNNING);

@@ -730,7 +731,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
if (ww_ctx)
ww_mutex_lock_acquired(ww, ww_ctx);

- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q);
preempt_enable();
return 0;
@@ -740,7 +741,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
__mutex_remove_waiter(lock, &waiter);
err_early_kill:
trace_contention_end(lock, ret);
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
debug_mutex_free_waiter(&waiter);
mutex_release(&lock->dep_map, ip);
wake_up_q(&wake_q);
@@ -911,6 +912,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
struct task_struct *next = NULL;
DEFINE_WAKE_Q(wake_q);
unsigned long owner;
+ unsigned long flags;

mutex_release(&lock->dep_map, ip);

@@ -938,7 +940,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
}

preempt_disable();
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
debug_mutex_unlock(lock);
if (!list_empty(&lock->wait_list)) {
/* get the first entry from the wait-list: */
@@ -955,7 +957,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
if (owner & MUTEX_FLAG_HANDOFF)
__mutex_handoff(lock, next);

- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q);
preempt_enable();
}
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 7189c6631d90..9facc0ddfdd3 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -70,14 +70,14 @@ __ww_mutex_has_waiters(struct mutex *lock)
return atomic_long_read(&lock->owner) & MUTEX_FLAG_WAITERS;
}

-static inline void lock_wait_lock(struct mutex *lock)
+static inline void lock_wait_lock(struct mutex *lock, unsigned long *flags)
{
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, *flags);
}

-static inline void unlock_wait_lock(struct mutex *lock)
+static inline void unlock_wait_lock(struct mutex *lock, unsigned long *flags)
{
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, *flags);
}

static inline void lockdep_assert_wait_lock_held(struct mutex *lock)
@@ -144,14 +144,14 @@ __ww_mutex_has_waiters(struct rt_mutex *lock)
return rt_mutex_has_waiters(&lock->rtmutex);
}

-static inline void lock_wait_lock(struct rt_mutex *lock)
+static inline void lock_wait_lock(struct rt_mutex *lock, unsigned long *flags)
{
- raw_spin_lock(&lock->rtmutex.wait_lock);
+ raw_spin_lock_irqsave(&lock->rtmutex.wait_lock, *flags);
}

-static inline void unlock_wait_lock(struct rt_mutex *lock)
+static inline void unlock_wait_lock(struct rt_mutex *lock, unsigned long *flags)
{
- raw_spin_unlock(&lock->rtmutex.wait_lock);
+ raw_spin_unlock_irqrestore(&lock->rtmutex.wait_lock, *flags);
}

static inline void lockdep_assert_wait_lock_held(struct rt_mutex *lock)
@@ -380,6 +380,7 @@ static __always_inline void
ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
{
DEFINE_WAKE_Q(wake_q);
+ unsigned long flags;

ww_mutex_lock_acquired(lock, ctx);

@@ -408,10 +409,9 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
* Uh oh, we raced in fastpath, check if any of the waiters need to
* die or wound us.
*/
- lock_wait_lock(&lock->base);
+ lock_wait_lock(&lock->base, &flags);
__ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
- unlock_wait_lock(&lock->base);
-
+ unlock_wait_lock(&lock->base, &flags);
wake_up_q(&wake_q);
}

--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:20:28

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 06/23] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable

Add a CONFIG_SCHED_PROXY_EXEC option, along with a boot argument
sched_prox_exec= that can be used to disable the feature at boot
time if CONFIG_SCHED_PROXY_EXEC was enabled.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: John Stultz <[email protected]>
---
v7:
* Switch to CONFIG_SCHED_PROXY_EXEC/sched_proxy_exec= as
suggested by Metin Kaya.
* Switch boot arg from =disable/enable to use kstrtobool(),
which supports =yes|no|1|0|true|false|on|off, as also
suggested by Metin Kaya, and print a message when a boot
argument is used.
---
.../admin-guide/kernel-parameters.txt | 5 ++++
include/linux/sched.h | 13 +++++++++
init/Kconfig | 7 +++++
kernel/sched/core.c | 29 +++++++++++++++++++
4 files changed, 54 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 65731b060e3f..cc64393b913f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5714,6 +5714,11 @@
sa1100ir [NET]
See drivers/net/irda/sa1100_ir.c.

+ sched_proxy_exec= [KNL]
+ Enables or disables "proxy execution" style
+ solution to mutex based priority inversion.
+ Format: <bool>
+
sched_verbose [KNL] Enables verbose scheduler debug messages.

schedstats= [KNL,X86] Enable or disable scheduled statistics.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index bfe8670f99a1..880af1c3097d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1566,6 +1566,19 @@ struct task_struct {
*/
};

+#ifdef CONFIG_SCHED_PROXY_EXEC
+DECLARE_STATIC_KEY_TRUE(__sched_proxy_exec);
+static inline bool sched_proxy_exec(void)
+{
+ return static_branch_likely(&__sched_proxy_exec);
+}
+#else
+static inline bool sched_proxy_exec(void)
+{
+ return false;
+}
+#endif
+
static inline struct pid *task_pid(struct task_struct *task)
{
return task->thread_pid;
diff --git a/init/Kconfig b/init/Kconfig
index 9ffb103fc927..c5a759b6366a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -908,6 +908,13 @@ config NUMA_BALANCING_DEFAULT_ENABLED
If set, automatic NUMA balancing will be enabled if running on a NUMA
machine.

+config SCHED_PROXY_EXEC
+ bool "Proxy Execution"
+ default n
+ help
+ This option enables proxy execution, a mechanism for mutex owning
+ tasks to inherit the scheduling context of higher priority waiters.
+
menuconfig CGROUPS
bool "Control Group support"
select KERNFS
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4e46189d545d..e06558fb08aa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -117,6 +117,35 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);

DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

+#ifdef CONFIG_SCHED_PROXY_EXEC
+DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
+static int __init setup_proxy_exec(char *str)
+{
+ bool proxy_enable;
+
+ if (kstrtobool(str, &proxy_enable)) {
+ pr_warn("Unable to parse sched_proxy_exec=\n");
+ return 0;
+ }
+
+ if (proxy_enable) {
+ pr_info("sched_proxy_exec enabled via boot arg\n");
+ static_branch_enable(&__sched_proxy_exec);
+ } else {
+ pr_info("sched_proxy_exec disabled via boot arg\n");
+ static_branch_disable(&__sched_proxy_exec);
+ }
+ return 1;
+}
+#else
+static int __init setup_proxy_exec(char *str)
+{
+ pr_warn("CONFIG_SCHED_PROXY_EXEC=n, so it cannot be enabled or disabled at boottime\n");
+ return 0;
+}
+#endif
+__setup("sched_proxy_exec=", setup_proxy_exec);
+
#ifdef CONFIG_SCHED_DEBUG
/*
* Debugging: various feature bits
--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:20:28

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 05/23] locking/mutex: Rework task_struct::blocked_on

From: Peter Zijlstra <[email protected]>

Track the blocked-on relation for mutexes, to allow following this
relation at schedule time.

task
| blocked-on
v
mutex
| owner
v
task

Also adds a blocked_on_state value so we can distinguqish when a
task is blocked_on a mutex, but is either blocked, waking up, or
runnable (such that it can try to aquire the lock its blocked
on).

This avoids some of the subtle & racy games where the blocked_on
state gets cleared, only to have it re-added by the
mutex_lock_slowpath call when it tries to aquire the lock on
wakeup

Also adds blocked_lock to the task_struct so we can safely
serialize the blocked-on state.

Finally adds wrappers that are useful to provide correctness
checks. Folded in from a patch by:
Valentin Schneider <[email protected]>

This all will be used for tracking blocked-task/mutex chains
with the prox-execution patch in a similar fashion to how
priority inheritence is done with rt_mutexes.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[minor changes while rebasing]
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: Fix blocked_on tracking in __mutex_lock_common in error paths]
Signed-off-by: John Stultz <[email protected]>
---
v2:
* Fixed blocked_on tracking in error paths that was causing crashes
v4:
* Ensure we clear blocked_on when waking ww_mutexes to die or wound.
This is critical so we don't get ciruclar blocked_on relationships
that can't be resolved.
v5:
* Fix potential bug where the skip_wait path might clear blocked_on
when that path never set it
* Slight tweaks to where we set blocked_on to make it consistent,
along with extra WARN_ON correctness checking
* Minor comment changes
v7:
* Minor commit message change suggested by Metin Kaya
* Fix WARN_ON conditionals in unlock path (as blocked_on might
already be cleared), found while looking at issue Metin Kaya
raised.
* Minor tweaks to be consistent in what we do under the
blocked_on lock, also tweaked variable name to avoid confusion
with label, and comment typos, as suggested by Metin Kaya
* Minor tweak for CONFIG_SCHED_PROXY_EXEC name change
* Moved unused block of code to later in the series, as suggested
by Metin Kaya
* Switch to a tri-state to be able to distinguish from waking and
runnable so we can later safely do return migration from ttwu
* Folded together with related blocked_on changes
---
include/linux/sched.h | 40 ++++++++++++++++++++++++++++++++----
init/init_task.c | 1 +
kernel/fork.c | 4 ++--
kernel/locking/mutex-debug.c | 9 ++++----
kernel/locking/mutex.c | 35 +++++++++++++++++++++++++++----
kernel/locking/ww_mutex.h | 24 ++++++++++++++++++++--
kernel/sched/core.c | 6 ++++++
7 files changed, 103 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1e80c330f755..bfe8670f99a1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -743,6 +743,12 @@ struct kmap_ctrl {
#endif
};

+enum blocked_on_state {
+ BO_RUNNABLE,
+ BO_BLOCKED,
+ BO_WAKING,
+};
+
struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
/*
@@ -1149,10 +1155,9 @@ struct task_struct {
struct rt_mutex_waiter *pi_blocked_on;
#endif

-#ifdef CONFIG_DEBUG_MUTEXES
- /* Mutex deadlock detection: */
- struct mutex_waiter *blocked_on;
-#endif
+ enum blocked_on_state blocked_on_state;
+ struct mutex *blocked_on; /* lock we're blocked on */
+ raw_spinlock_t blocked_lock;

#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
int non_block_count;
@@ -2258,6 +2263,33 @@ static inline int rwlock_needbreak(rwlock_t *lock)
#endif
}

+static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)
+{
+ lockdep_assert_held(&p->blocked_lock);
+
+ /*
+ * Check we are clearing values to NULL or setting NULL
+ * to values to ensure we don't overwrite exisiting mutex
+ * values or clear already cleared values
+ */
+ WARN_ON((!m && !p->blocked_on) || (m && p->blocked_on));
+
+ p->blocked_on = m;
+ p->blocked_on_state = m ? BO_BLOCKED : BO_RUNNABLE;
+}
+
+static inline struct mutex *get_task_blocked_on(struct task_struct *p)
+{
+ lockdep_assert_held(&p->blocked_lock);
+
+ return p->blocked_on;
+}
+
+static inline struct mutex *get_task_blocked_on_once(struct task_struct *p)
+{
+ return READ_ONCE(p->blocked_on);
+}
+
static __always_inline bool need_resched(void)
{
return unlikely(tif_need_resched());
diff --git a/init/init_task.c b/init/init_task.c
index 5727d42149c3..0c31d7d7c7a9 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -131,6 +131,7 @@ struct task_struct init_task
.journal_info = NULL,
INIT_CPU_TIMERS(init_task)
.pi_lock = __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
+ .blocked_lock = __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock),
.timer_slack_ns = 50000, /* 50 usec default slack */
.thread_pid = &init_struct_pid,
.thread_node = LIST_HEAD_INIT(init_signals.thread_head),
diff --git a/kernel/fork.c b/kernel/fork.c
index 10917c3e1f03..b3ba3d22d8b2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2358,6 +2358,7 @@ __latent_entropy struct task_struct *copy_process(
ftrace_graph_init_task(p);

rt_mutex_init_task(p);
+ raw_spin_lock_init(&p->blocked_lock);

lockdep_assert_irqs_enabled();
#ifdef CONFIG_PROVE_LOCKING
@@ -2456,9 +2457,8 @@ __latent_entropy struct task_struct *copy_process(
lockdep_init_task(p);
#endif

-#ifdef CONFIG_DEBUG_MUTEXES
+ p->blocked_on_state = BO_RUNNABLE;
p->blocked_on = NULL; /* not blocked yet */
-#endif
#ifdef CONFIG_BCACHE
p->sequential_io = 0;
p->sequential_io_avg = 0;
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index bc8abb8549d2..1eedf7c60c00 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -52,17 +52,18 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
{
lockdep_assert_held(&lock->wait_lock);

- /* Mark the current thread as blocked on the lock: */
- task->blocked_on = waiter;
+ /* Current thread can't be already blocked (since it's executing!) */
+ DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task));
}

void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
struct task_struct *task)
{
+ struct mutex *blocked_on = get_task_blocked_on_once(task);
+
DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
DEBUG_LOCKS_WARN_ON(waiter->task != task);
- DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
- task->blocked_on = NULL;
+ DEBUG_LOCKS_WARN_ON(blocked_on && blocked_on != lock);

INIT_LIST_HEAD(&waiter->list);
waiter->task = NULL;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 543774506fdb..6084470773f6 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -592,6 +592,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
}

raw_spin_lock_irqsave(&lock->wait_lock, flags);
+ raw_spin_lock(&current->blocked_lock);
/*
* After waiting to acquire the wait_lock, try again.
*/
@@ -622,6 +623,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
goto err_early_kill;
}

+ set_task_blocked_on(current, lock);
set_current_state(state);
trace_contention_begin(lock, LCB_F_MUTEX);
for (;;) {
@@ -652,6 +654,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
goto err;
}

+ raw_spin_unlock(&current->blocked_lock);
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
/* Make sure we do wakeups before calling schedule */
if (!wake_q_empty(&wake_q)) {
@@ -662,6 +665,13 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas

first = __mutex_waiter_is_first(lock, &waiter);

+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
+ raw_spin_lock(&current->blocked_lock);
+
+ /*
+ * Re-set blocked_on_state as unlock path set it to WAKING/RUNNABLE
+ */
+ current->blocked_on_state = BO_BLOCKED;
set_current_state(state);
/*
* Here we order against unlock; we must either see it change
@@ -672,16 +682,25 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
break;

if (first) {
+ bool opt_acquired;
+
+ /*
+ * mutex_optimistic_spin() can schedule, so we need to
+ * release these locks before calling it.
+ */
+ raw_spin_unlock(&current->blocked_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
- if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
+ opt_acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
+ raw_spin_lock(&current->blocked_lock);
+ if (opt_acquired)
break;
trace_contention_begin(lock, LCB_F_MUTEX);
}
-
- raw_spin_lock_irqsave(&lock->wait_lock, flags);
}
- raw_spin_lock_irqsave(&lock->wait_lock, flags);
acquired:
+ set_task_blocked_on(current, NULL);
__set_current_state(TASK_RUNNING);

if (ww_ctx) {
@@ -706,16 +725,20 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
if (ww_ctx)
ww_mutex_lock_acquired(ww, ww_ctx);

+ raw_spin_unlock(&current->blocked_lock);
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q);
preempt_enable();
return 0;

err:
+ set_task_blocked_on(current, NULL);
__set_current_state(TASK_RUNNING);
__mutex_remove_waiter(lock, &waiter);
err_early_kill:
+ WARN_ON(current->blocked_on);
trace_contention_end(lock, ret);
+ raw_spin_unlock(&current->blocked_lock);
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
debug_mutex_free_waiter(&waiter);
mutex_release(&lock->dep_map, ip);
@@ -925,8 +948,12 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne

next = waiter->task;

+ raw_spin_lock(&next->blocked_lock);
debug_mutex_wake_waiter(lock, waiter);
+ WARN_ON(next->blocked_on != lock);
+ next->blocked_on_state = BO_WAKING;
wake_q_add(&wake_q, next);
+ raw_spin_unlock(&next->blocked_lock);
}

if (owner & MUTEX_FLAG_HANDOFF)
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 9facc0ddfdd3..8dd21ff5eee0 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -281,10 +281,21 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
return false;

if (waiter->ww_ctx->acquired > 0 && __ww_ctx_less(waiter->ww_ctx, ww_ctx)) {
+ /* nested as we should hold current->blocked_lock already */
+ raw_spin_lock_nested(&waiter->task->blocked_lock, SINGLE_DEPTH_NESTING);
#ifndef WW_RT
debug_mutex_wake_waiter(lock, waiter);
#endif
+ /*
+ * When waking up the task to die, be sure to set the
+ * blocked_on_state to WAKING. Otherwise we can see
+ * circular blocked_on relationships that can't
+ * resolve.
+ */
+ WARN_ON(waiter->task->blocked_on != lock);
+ waiter->task->blocked_on_state = BO_WAKING;
wake_q_add(wake_q, waiter->task);
+ raw_spin_unlock(&waiter->task->blocked_lock);
}

return true;
@@ -331,9 +342,18 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
* it's wounded in __ww_mutex_check_kill() or has a
* wakeup pending to re-read the wounded state.
*/
- if (owner != current)
+ if (owner != current) {
+ /* nested as we should hold current->blocked_lock already */
+ raw_spin_lock_nested(&owner->blocked_lock, SINGLE_DEPTH_NESTING);
+ /*
+ * When waking up the task to wound, be sure to set the
+ * blocked_on_state flag. Otherwise we can see circular
+ * blocked_on relationships that can't resolve.
+ */
+ owner->blocked_on_state = BO_WAKING;
wake_q_add(wake_q, owner);
-
+ raw_spin_unlock(&owner->blocked_lock);
+ }
return true;
}

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a708d225c28e..4e46189d545d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4195,6 +4195,7 @@ bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)
int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
guard(preempt)();
+ unsigned long flags;
int cpu, success = 0;

if (p == current) {
@@ -4341,6 +4342,11 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)

ttwu_queue(p, cpu, wake_flags);
}
+ /* XXX can we do something better here for !CONFIG_SCHED_PROXY_EXEC case */
+ raw_spin_lock_irqsave(&p->blocked_lock, flags);
+ if (p->blocked_on_state == BO_WAKING)
+ p->blocked_on_state = BO_RUNNABLE;
+ raw_spin_unlock_irqrestore(&p->blocked_lock, flags);
out:
if (success)
ttwu_stat(p, task_cpu(p), wake_flags);
--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:20:32

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 02/23] locking/mutex: Remove wakeups from under mutex::wait_lock

In preparation to nest mutex::wait_lock under rq::lock we need to remove
wakeups from under it.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[Heavily changed after 55f036ca7e74 ("locking: WW mutex cleanup") and
08295b3b5bee ("locking: Implement an algorithm choice for Wound-Wait
mutexes")]
Signed-off-by: Juri Lelli <[email protected]>
[jstultz: rebased to mainline, added extra wake_up_q & init
to avoid hangs, similar to Connor's rework of this patch]
Signed-off-by: John Stultz <[email protected]>
---
v5:
* Reverted back to an earlier version of this patch to undo
the change that kept the wake_q in the ctx structure, as
that broke the rule that the wake_q must always be on the
stack, as its not safe for concurrency.
v6:
* Made tweaks suggested by Waiman Long
v7:
* Fixups to pass wake_qs down for PREEMPT_RT logic
---
kernel/locking/mutex.c | 17 +++++++++++++----
kernel/locking/rtmutex.c | 26 +++++++++++++++++---------
kernel/locking/rwbase_rt.c | 4 +++-
kernel/locking/rwsem.c | 4 ++--
kernel/locking/spinlock_rt.c | 3 ++-
kernel/locking/ww_mutex.h | 29 ++++++++++++++++++-----------
6 files changed, 55 insertions(+), 28 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 2deeeca3e71b..8337ed0dbf81 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -570,6 +570,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
struct lockdep_map *nest_lock, unsigned long ip,
struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
{
+ DEFINE_WAKE_Q(wake_q);
struct mutex_waiter waiter;
struct ww_mutex *ww;
int ret;
@@ -620,7 +621,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
*/
if (__mutex_trylock(lock)) {
if (ww_ctx)
- __ww_mutex_check_waiters(lock, ww_ctx);
+ __ww_mutex_check_waiters(lock, ww_ctx, &wake_q);

goto skip_wait;
}
@@ -640,7 +641,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
* Add in stamp order, waking up waiters that must kill
* themselves.
*/
- ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx);
+ ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx, &wake_q);
if (ret)
goto err_early_kill;
}
@@ -676,6 +677,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
}

raw_spin_unlock(&lock->wait_lock);
+ /* Make sure we do wakeups before calling schedule */
+ if (!wake_q_empty(&wake_q)) {
+ wake_up_q(&wake_q);
+ wake_q_init(&wake_q);
+ }
schedule_preempt_disabled();

first = __mutex_waiter_is_first(lock, &waiter);
@@ -709,7 +715,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
*/
if (!ww_ctx->is_wait_die &&
!__mutex_waiter_is_first(lock, &waiter))
- __ww_mutex_check_waiters(lock, ww_ctx);
+ __ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
}

__mutex_remove_waiter(lock, &waiter);
@@ -725,6 +731,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
ww_mutex_lock_acquired(ww, ww_ctx);

raw_spin_unlock(&lock->wait_lock);
+ wake_up_q(&wake_q);
preempt_enable();
return 0;

@@ -736,6 +743,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
raw_spin_unlock(&lock->wait_lock);
debug_mutex_free_waiter(&waiter);
mutex_release(&lock->dep_map, ip);
+ wake_up_q(&wake_q);
preempt_enable();
return ret;
}
@@ -929,6 +937,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
}
}

+ preempt_disable();
raw_spin_lock(&lock->wait_lock);
debug_mutex_unlock(lock);
if (!list_empty(&lock->wait_list)) {
@@ -947,8 +956,8 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
__mutex_handoff(lock, next);

raw_spin_unlock(&lock->wait_lock);
-
wake_up_q(&wake_q);
+ preempt_enable();
}

#ifndef CONFIG_DEBUG_LOCK_ALLOC
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 4a10e8c16fd2..eaac8b196a69 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -34,13 +34,15 @@

static inline int __ww_mutex_add_waiter(struct rt_mutex_waiter *waiter,
struct rt_mutex *lock,
- struct ww_acquire_ctx *ww_ctx)
+ struct ww_acquire_ctx *ww_ctx,
+ struct wake_q_head *wake_q)
{
return 0;
}

static inline void __ww_mutex_check_waiters(struct rt_mutex *lock,
- struct ww_acquire_ctx *ww_ctx)
+ struct ww_acquire_ctx *ww_ctx,
+ struct wake_q_head *wake_q)
{
}

@@ -1206,6 +1208,7 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
struct rt_mutex_waiter *top_waiter = waiter;
struct rt_mutex_base *next_lock;
int chain_walk = 0, res;
+ DEFINE_WAKE_Q(wake_q);

lockdep_assert_held(&lock->wait_lock);

@@ -1244,7 +1247,8 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,

/* Check whether the waiter should back out immediately */
rtm = container_of(lock, struct rt_mutex, rtmutex);
- res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx);
+ res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx, &wake_q);
+ wake_up_q(&wake_q);
if (res) {
raw_spin_lock(&task->pi_lock);
rt_mutex_dequeue(lock, waiter);
@@ -1677,7 +1681,8 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
struct ww_acquire_ctx *ww_ctx,
unsigned int state,
enum rtmutex_chainwalk chwalk,
- struct rt_mutex_waiter *waiter)
+ struct rt_mutex_waiter *waiter,
+ struct wake_q_head *wake_q)
{
struct rt_mutex *rtm = container_of(lock, struct rt_mutex, rtmutex);
struct ww_mutex *ww = ww_container_of(rtm);
@@ -1688,7 +1693,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
/* Try to acquire the lock again: */
if (try_to_take_rt_mutex(lock, current, NULL)) {
if (build_ww_mutex() && ww_ctx) {
- __ww_mutex_check_waiters(rtm, ww_ctx);
+ __ww_mutex_check_waiters(rtm, ww_ctx, wake_q);
ww_mutex_lock_acquired(ww, ww_ctx);
}
return 0;
@@ -1706,7 +1711,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
/* acquired the lock */
if (build_ww_mutex() && ww_ctx) {
if (!ww_ctx->is_wait_die)
- __ww_mutex_check_waiters(rtm, ww_ctx);
+ __ww_mutex_check_waiters(rtm, ww_ctx, wake_q);
ww_mutex_lock_acquired(ww, ww_ctx);
}
} else {
@@ -1728,7 +1733,8 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,

static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
struct ww_acquire_ctx *ww_ctx,
- unsigned int state)
+ unsigned int state,
+ struct wake_q_head *wake_q)
{
struct rt_mutex_waiter waiter;
int ret;
@@ -1737,7 +1743,7 @@ static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
waiter.ww_ctx = ww_ctx;

ret = __rt_mutex_slowlock(lock, ww_ctx, state, RT_MUTEX_MIN_CHAINWALK,
- &waiter);
+ &waiter, wake_q);

debug_rt_mutex_free_waiter(&waiter);
return ret;
@@ -1753,6 +1759,7 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
struct ww_acquire_ctx *ww_ctx,
unsigned int state)
{
+ DEFINE_WAKE_Q(wake_q);
unsigned long flags;
int ret;

@@ -1774,8 +1781,9 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
* irqsave/restore variants.
*/
raw_spin_lock_irqsave(&lock->wait_lock, flags);
- ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state);
+ ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state, &wake_q);
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+ wake_up_q(&wake_q);
rt_mutex_post_schedule();

return ret;
diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index 34a59569db6b..e9d2f38b70f3 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -69,6 +69,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
unsigned int state)
{
struct rt_mutex_base *rtm = &rwb->rtmutex;
+ DEFINE_WAKE_Q(wake_q);
int ret;

rwbase_pre_schedule();
@@ -110,7 +111,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
* For rwlocks this returns 0 unconditionally, so the below
* !ret conditionals are optimized out.
*/
- ret = rwbase_rtmutex_slowlock_locked(rtm, state);
+ ret = rwbase_rtmutex_slowlock_locked(rtm, state, &wake_q);

/*
* On success the rtmutex is held, so there can't be a writer
@@ -122,6 +123,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
if (!ret)
atomic_inc(&rwb->readers);
raw_spin_unlock_irq(&rtm->wait_lock);
+ wake_up_q(&wake_q);
if (!ret)
rwbase_rtmutex_unlock(rtm);

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 2340b6d90ec6..74ebb2915d63 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1415,8 +1415,8 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
#define rwbase_rtmutex_lock_state(rtm, state) \
__rt_mutex_lock(rtm, state)

-#define rwbase_rtmutex_slowlock_locked(rtm, state) \
- __rt_mutex_slowlock_locked(rtm, NULL, state)
+#define rwbase_rtmutex_slowlock_locked(rtm, state, wq) \
+ __rt_mutex_slowlock_locked(rtm, NULL, state, wq)

#define rwbase_rtmutex_unlock(rtm) \
__rt_mutex_unlock(rtm)
diff --git a/kernel/locking/spinlock_rt.c b/kernel/locking/spinlock_rt.c
index 38e292454fcc..fb1810a14c9d 100644
--- a/kernel/locking/spinlock_rt.c
+++ b/kernel/locking/spinlock_rt.c
@@ -162,7 +162,8 @@ rwbase_rtmutex_lock_state(struct rt_mutex_base *rtm, unsigned int state)
}

static __always_inline int
-rwbase_rtmutex_slowlock_locked(struct rt_mutex_base *rtm, unsigned int state)
+rwbase_rtmutex_slowlock_locked(struct rt_mutex_base *rtm, unsigned int state,
+ struct wake_q_head *wake_q)
{
rtlock_slowlock_locked(rtm);
return 0;
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 3ad2cc4823e5..7189c6631d90 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -275,7 +275,7 @@ __ww_ctx_less(struct ww_acquire_ctx *a, struct ww_acquire_ctx *b)
*/
static bool
__ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
- struct ww_acquire_ctx *ww_ctx)
+ struct ww_acquire_ctx *ww_ctx, struct wake_q_head *wake_q)
{
if (!ww_ctx->is_wait_die)
return false;
@@ -284,7 +284,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
#ifndef WW_RT
debug_mutex_wake_waiter(lock, waiter);
#endif
- wake_up_process(waiter->task);
+ wake_q_add(wake_q, waiter->task);
}

return true;
@@ -299,7 +299,8 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
*/
static bool __ww_mutex_wound(struct MUTEX *lock,
struct ww_acquire_ctx *ww_ctx,
- struct ww_acquire_ctx *hold_ctx)
+ struct ww_acquire_ctx *hold_ctx,
+ struct wake_q_head *wake_q)
{
struct task_struct *owner = __ww_mutex_owner(lock);

@@ -331,7 +332,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
* wakeup pending to re-read the wounded state.
*/
if (owner != current)
- wake_up_process(owner);
+ wake_q_add(wake_q, owner);

return true;
}
@@ -352,7 +353,8 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
* The current task must not be on the wait list.
*/
static void
-__ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
+__ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx,
+ struct wake_q_head *wake_q)
{
struct MUTEX_WAITER *cur;

@@ -364,8 +366,8 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
if (!cur->ww_ctx)
continue;

- if (__ww_mutex_die(lock, cur, ww_ctx) ||
- __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx))
+ if (__ww_mutex_die(lock, cur, ww_ctx, wake_q) ||
+ __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx, wake_q))
break;
}
}
@@ -377,6 +379,8 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
static __always_inline void
ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
{
+ DEFINE_WAKE_Q(wake_q);
+
ww_mutex_lock_acquired(lock, ctx);

/*
@@ -405,8 +409,10 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
* die or wound us.
*/
lock_wait_lock(&lock->base);
- __ww_mutex_check_waiters(&lock->base, ctx);
+ __ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
unlock_wait_lock(&lock->base);
+
+ wake_up_q(&wake_q);
}

static __always_inline int
@@ -488,7 +494,8 @@ __ww_mutex_check_kill(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
static inline int
__ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
struct MUTEX *lock,
- struct ww_acquire_ctx *ww_ctx)
+ struct ww_acquire_ctx *ww_ctx,
+ struct wake_q_head *wake_q)
{
struct MUTEX_WAITER *cur, *pos = NULL;
bool is_wait_die;
@@ -532,7 +539,7 @@ __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
pos = cur;

/* Wait-Die: ensure younger waiters die. */
- __ww_mutex_die(lock, cur, ww_ctx);
+ __ww_mutex_die(lock, cur, ww_ctx, wake_q);
}

__ww_waiter_add(lock, waiter, pos);
@@ -550,7 +557,7 @@ __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
* such that either we or the fastpath will wound @ww->ctx.
*/
smp_mb();
- __ww_mutex_wound(lock, ww_ctx, ww->ctx);
+ __ww_mutex_wound(lock, ww_ctx, ww->ctx, wake_q);
}

return 0;
--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:20:54

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 04/23] locking/mutex: Expose __mutex_owner()

From: Juri Lelli <[email protected]>

Implementing proxy execution requires that scheduler code be able to
identify the current owner of a mutex. Expose __mutex_owner() for
this purpose (alone!).

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Juri Lelli <[email protected]>
[Removed the EXPORT_SYMBOL]
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: Reworked per Peter's suggestions]
Signed-off-by: John Stultz <[email protected]>
---
v4:
* Move __mutex_owner() to kernel/locking/mutex.h instead of
adding a new globally available accessor function to keep
the exposure of this low, along with keeping it an inline
function, as suggested by PeterZ
---
kernel/locking/mutex.c | 25 -------------------------
kernel/locking/mutex.h | 25 +++++++++++++++++++++++++
2 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 73d98dd23eec..543774506fdb 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -56,31 +56,6 @@ __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
}
EXPORT_SYMBOL(__mutex_init);

-/*
- * @owner: contains: 'struct task_struct *' to the current lock owner,
- * NULL means not owned. Since task_struct pointers are aligned at
- * at least L1_CACHE_BYTES, we have low bits to store extra state.
- *
- * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
- * Bit1 indicates unlock needs to hand the lock to the top-waiter
- * Bit2 indicates handoff has been done and we're waiting for pickup.
- */
-#define MUTEX_FLAG_WAITERS 0x01
-#define MUTEX_FLAG_HANDOFF 0x02
-#define MUTEX_FLAG_PICKUP 0x04
-
-#define MUTEX_FLAGS 0x07
-
-/*
- * Internal helper function; C doesn't allow us to hide it :/
- *
- * DO NOT USE (outside of mutex code).
- */
-static inline struct task_struct *__mutex_owner(struct mutex *lock)
-{
- return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
-}
-
static inline struct task_struct *__owner_task(unsigned long owner)
{
return (struct task_struct *)(owner & ~MUTEX_FLAGS);
diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index 0b2a79c4013b..1c7d3d32def8 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -20,6 +20,31 @@ struct mutex_waiter {
#endif
};

+/*
+ * @owner: contains: 'struct task_struct *' to the current lock owner,
+ * NULL means not owned. Since task_struct pointers are aligned at
+ * at least L1_CACHE_BYTES, we have low bits to store extra state.
+ *
+ * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
+ * Bit1 indicates unlock needs to hand the lock to the top-waiter
+ * Bit2 indicates handoff has been done and we're waiting for pickup.
+ */
+#define MUTEX_FLAG_WAITERS 0x01
+#define MUTEX_FLAG_HANDOFF 0x02
+#define MUTEX_FLAG_PICKUP 0x04
+
+#define MUTEX_FLAGS 0x07
+
+/*
+ * Internal helper function; C doesn't allow us to hide it :/
+ *
+ * DO NOT USE (outside of mutex & scheduler code).
+ */
+static inline struct task_struct *__mutex_owner(struct mutex *lock)
+{
+ return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
+}
+
#ifdef CONFIG_DEBUG_MUTEXES
extern void debug_mutex_lock_common(struct mutex *lock,
struct mutex_waiter *waiter);
--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:20:56

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 07/23] locking/mutex: Switch to mutex handoffs for CONFIG_SCHED_PROXY_EXEC

From: Peter Zijlstra <[email protected]>

Since with SCHED_PROXY_EXEC, we will want to hand off locks to
the tasks we are running on behalf of, switch to using mutex
handoffs.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[rebased, added comments and changelog]
Signed-off-by: Juri Lelli <[email protected]>
[Fixed rebase conflicts]
[squashed sched: Ensure blocked_on is always guarded by blocked_lock]
Signed-off-by: Valentin Schneider <[email protected]>
[fix rebase conflicts, various fixes & tweaks commented inline]
[squashed sched: Use rq->curr vs rq->proxy checks]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: Split out only the very basic initial framework
for proxy logic from a larger patch.]
Signed-off-by: John Stultz <[email protected]>
---
v5:
* Split out from core proxy patch
v6:
* Rework to use sched_proxy_exec() instead of #ifdef CONFIG_PROXY_EXEC
v7:
* Avoid disabling optimistic spinning at compile time so booting
with sched_proxy_exec=off matches prior performance
* Add comment in mutex-design.rst as suggested by Metin Kaya
---
Documentation/locking/mutex-design.rst | 3 ++
kernel/locking/mutex.c | 42 +++++++++++++++-----------
2 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/Documentation/locking/mutex-design.rst b/Documentation/locking/mutex-design.rst
index 78540cd7f54b..57a5cb03f409 100644
--- a/Documentation/locking/mutex-design.rst
+++ b/Documentation/locking/mutex-design.rst
@@ -61,6 +61,9 @@ taken, depending on the state of the lock:
waiting to spin on mutex owner, only to go directly to slowpath upon
obtaining the MCS lock.

+ NOTE: Optimistic spinning will be avoided when using proxy execution
+ (SCHED_PROXY_EXEC) as we want to hand the lock off to the task that was
+ boosting the current owner.

(iii) slowpath: last resort, if the lock is still unable to be acquired,
the task is added to the wait-queue and sleeps until woken up by the
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 6084470773f6..11dc5cb7a5a3 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -416,6 +416,9 @@ static __always_inline bool
mutex_optimistic_spin(struct mutex *lock, struct ww_acquire_ctx *ww_ctx,
struct mutex_waiter *waiter)
{
+ if (sched_proxy_exec())
+ return false;
+
if (!waiter) {
/*
* The purpose of the mutex_can_spin_on_owner() function is
@@ -914,26 +917,31 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne

mutex_release(&lock->dep_map, ip);

- /*
- * Release the lock before (potentially) taking the spinlock such that
- * other contenders can get on with things ASAP.
- *
- * Except when HANDOFF, in that case we must not clear the owner field,
- * but instead set it to the top waiter.
- */
- owner = atomic_long_read(&lock->owner);
- for (;;) {
- MUTEX_WARN_ON(__owner_task(owner) != current);
- MUTEX_WARN_ON(owner & MUTEX_FLAG_PICKUP);
-
- if (owner & MUTEX_FLAG_HANDOFF)
- break;
+ if (sched_proxy_exec()) {
+ /* Always force HANDOFF for Proxy Exec for now. Revisit. */
+ owner = MUTEX_FLAG_HANDOFF;
+ } else {
+ /*
+ * Release the lock before (potentially) taking the spinlock
+ * such that other contenders can get on with things ASAP.
+ *
+ * Except when HANDOFF, in that case we must not clear the
+ * owner field, but instead set it to the top waiter.
+ */
+ owner = atomic_long_read(&lock->owner);
+ for (;;) {
+ MUTEX_WARN_ON(__owner_task(owner) != current);
+ MUTEX_WARN_ON(owner & MUTEX_FLAG_PICKUP);

- if (atomic_long_try_cmpxchg_release(&lock->owner, &owner, __owner_flags(owner))) {
- if (owner & MUTEX_FLAG_WAITERS)
+ if (owner & MUTEX_FLAG_HANDOFF)
break;

- return;
+ if (atomic_long_try_cmpxchg_release(&lock->owner, &owner,
+ __owner_flags(owner))) {
+ if (owner & MUTEX_FLAG_WAITERS)
+ break;
+ return;
+ }
}
}

--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:21:25

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 08/23] sched: Split scheduler and execution contexts

From: Peter Zijlstra <[email protected]>

Let's define the scheduling context as all the scheduler state
in task_struct for the task selected to run, and the execution
context as all state required to actually run the task.

Currently both are intertwined in task_struct. We want to
logically split these such that we can use the scheduling
context of the task selected to be scheduled, but use the
execution context of a different task to actually be run.

To this purpose, introduce rq_selected() macro to point to the
task_struct selected from the runqueue by the scheduler, and
will be used for scheduler state, and preserve rq->curr to
indicate the execution context of the task that will actually be
run.

NOTE: Peter previously mentioned he didn't like the name
"rq_selected()", but I've not come up with a better alternative.
I'm very open to other name proposals.

Question for Peter: Dietmar suggested you'd prefer I drop the
conditionalization of the scheduler context pointer on the rq
(so rq_selected() would be open coded as rq->curr_selected or
whatever we agree on for a name), but I'd think in the
!CONFIG_PROXY_EXEC case we'd want to avoid the wasted pointer
and its use (since it curr_selected would always be == curr)?
If I'm wrong I'm fine switching this, but would appreciate
clarification.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
[add additional comments and update more sched_class code to use
rq::proxy]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: Rebased and resolved minor collisions, reworked to use
accessors, tweaked update_curr_common to use rq_proxy fixing rt
scheduling issues]
Signed-off-by: John Stultz <[email protected]>
---
v2:
* Reworked to use accessors
* Fixed update_curr_common to use proxy instead of curr
v3:
* Tweaked wrapper names
* Swapped proxy for selected for clarity
v4:
* Minor variable name tweaks for readability
* Use a macro instead of a inline function and drop
other helper functions as suggested by Peter.
* Remove verbose comments/questions to avoid review
distractions, as suggested by Dietmar
v5:
* Add CONFIG_PROXY_EXEC option to this patch so the
new logic can be tested with this change
* Minor fix to grab rq_selected when holding the rq lock
v7:
* Minor spelling fix and unused argument fixes suggested by
Metin Kaya
* Switch to curr_selected for consistency, and minor rewording
of commit message for clarity
* Rename variables selected instead of curr when we're using
rq_selected()
* Reduce macros in CONFIG_SCHED_PROXY_EXEC ifdef sections,
as suggested by Metin Kaya
---
kernel/sched/core.c | 46 ++++++++++++++++++++++++++---------------
kernel/sched/deadline.c | 35 ++++++++++++++++---------------
kernel/sched/fair.c | 18 ++++++++--------
kernel/sched/rt.c | 40 +++++++++++++++++------------------
kernel/sched/sched.h | 35 +++++++++++++++++++++++++++++--
5 files changed, 109 insertions(+), 65 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e06558fb08aa..0ce34f5c0e0c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -822,7 +822,7 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)

rq_lock(rq, &rf);
update_rq_clock(rq);
- rq->curr->sched_class->task_tick(rq, rq->curr, 1);
+ rq_selected(rq)->sched_class->task_tick(rq, rq_selected(rq), 1);
rq_unlock(rq, &rf);

return HRTIMER_NORESTART;
@@ -2242,16 +2242,18 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,

void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
{
- if (p->sched_class == rq->curr->sched_class)
- rq->curr->sched_class->wakeup_preempt(rq, p, flags);
- else if (sched_class_above(p->sched_class, rq->curr->sched_class))
+ struct task_struct *selected = rq_selected(rq);
+
+ if (p->sched_class == selected->sched_class)
+ selected->sched_class->wakeup_preempt(rq, p, flags);
+ else if (sched_class_above(p->sched_class, selected->sched_class))
resched_curr(rq);

/*
* A queue event has occurred, and we're going to schedule. In
* this case, we can save a useless back to back clock update.
*/
- if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
+ if (task_on_rq_queued(selected) && test_tsk_need_resched(rq->curr))
rq_clock_skip_update(rq);
}

@@ -2780,7 +2782,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
lockdep_assert_held(&p->pi_lock);

queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);

if (queued) {
/*
@@ -5600,7 +5602,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
* project cycles that may never be accounted to this
* thread, breaking clock_gettime().
*/
- if (task_current(rq, p) && task_on_rq_queued(p)) {
+ if (task_current_selected(rq, p) && task_on_rq_queued(p)) {
prefetch_curr_exec_start(p);
update_rq_clock(rq);
p->sched_class->update_curr(rq);
@@ -5668,7 +5670,8 @@ void scheduler_tick(void)
{
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
- struct task_struct *curr = rq->curr;
+ /* accounting goes to the selected task */
+ struct task_struct *selected;
struct rq_flags rf;
unsigned long thermal_pressure;
u64 resched_latency;
@@ -5679,16 +5682,17 @@ void scheduler_tick(void)
sched_clock_tick();

rq_lock(rq, &rf);
+ selected = rq_selected(rq);

update_rq_clock(rq);
thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
- curr->sched_class->task_tick(rq, curr, 0);
+ selected->sched_class->task_tick(rq, selected, 0);
if (sched_feat(LATENCY_WARN))
resched_latency = cpu_resched_latency(rq);
calc_global_load_tick(rq);
sched_core_tick(rq);
- task_tick_mm_cid(rq, curr);
+ task_tick_mm_cid(rq, selected);

rq_unlock(rq, &rf);

@@ -5697,8 +5701,8 @@ void scheduler_tick(void)

perf_event_task_tick();

- if (curr->flags & PF_WQ_WORKER)
- wq_worker_tick(curr);
+ if (selected->flags & PF_WQ_WORKER)
+ wq_worker_tick(selected);

#ifdef CONFIG_SMP
rq->idle_balance = idle_cpu(cpu);
@@ -5763,6 +5767,12 @@ static void sched_tick_remote(struct work_struct *work)
struct task_struct *curr = rq->curr;

if (cpu_online(cpu)) {
+ /*
+ * Since this is a remote tick for full dynticks mode,
+ * we are always sure that there is no proxy (only a
+ * single task is running).
+ */
+ SCHED_WARN_ON(rq->curr != rq_selected(rq));
update_rq_clock(rq);

if (!is_idle_task(curr)) {
@@ -6685,6 +6695,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
}

next = pick_next_task(rq, prev, &rf);
+ rq_set_selected(rq, next);
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
#ifdef CONFIG_SCHED_DEBUG
@@ -7185,7 +7196,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)

prev_class = p->sched_class;
queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);
if (queued)
dequeue_task(rq, p, queue_flag);
if (running)
@@ -7275,7 +7286,7 @@ void set_user_nice(struct task_struct *p, long nice)
}

queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);
if (queued)
dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
if (running)
@@ -7868,7 +7879,7 @@ static int __sched_setscheduler(struct task_struct *p,
}

queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);
if (queued)
dequeue_task(rq, p, queue_flags);
if (running)
@@ -9295,6 +9306,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
rcu_read_unlock();

rq->idle = idle;
+ rq_set_selected(rq, idle);
rcu_assign_pointer(rq->curr, idle);
idle->on_rq = TASK_ON_RQ_QUEUED;
#ifdef CONFIG_SMP
@@ -9384,7 +9396,7 @@ void sched_setnuma(struct task_struct *p, int nid)

rq = task_rq_lock(p, &rf);
queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);

if (queued)
dequeue_task(rq, p, DEQUEUE_SAVE);
@@ -10489,7 +10501,7 @@ void sched_move_task(struct task_struct *tsk)

update_rq_clock(rq);

- running = task_current(rq, tsk);
+ running = task_current_selected(rq, tsk);
queued = task_on_rq_queued(tsk);

if (queued)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 6140f1f51da1..9cf20f4ac5f9 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1150,7 +1150,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
#endif

enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
- if (dl_task(rq->curr))
+ if (dl_task(rq_selected(rq)))
wakeup_preempt_dl(rq, p, 0);
else
resched_curr(rq);
@@ -1273,7 +1273,7 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
*/
static void update_curr_dl(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
struct sched_dl_entity *dl_se = &curr->dl;
s64 delta_exec, scaled_delta_exec;
int cpu = cpu_of(rq);
@@ -1784,7 +1784,7 @@ static int find_later_rq(struct task_struct *task);
static int
select_task_rq_dl(struct task_struct *p, int cpu, int flags)
{
- struct task_struct *curr;
+ struct task_struct *curr, *selected;
bool select_rq;
struct rq *rq;

@@ -1795,6 +1795,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)

rcu_read_lock();
curr = READ_ONCE(rq->curr); /* unlocked access */
+ selected = READ_ONCE(rq_selected(rq));

/*
* If we are dealing with a -deadline task, we must
@@ -1805,9 +1806,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
* other hand, if it has a shorter deadline, we
* try to make it stay here, it might be important.
*/
- select_rq = unlikely(dl_task(curr)) &&
+ select_rq = unlikely(dl_task(selected)) &&
(curr->nr_cpus_allowed < 2 ||
- !dl_entity_preempt(&p->dl, &curr->dl)) &&
+ !dl_entity_preempt(&p->dl, &selected->dl)) &&
p->nr_cpus_allowed > 1;

/*
@@ -1870,7 +1871,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
* let's hope p can move out.
*/
if (rq->curr->nr_cpus_allowed == 1 ||
- !cpudl_find(&rq->rd->cpudl, rq->curr, NULL))
+ !cpudl_find(&rq->rd->cpudl, rq_selected(rq), NULL))
return;

/*
@@ -1909,7 +1910,7 @@ static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
int flags)
{
- if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
+ if (dl_entity_preempt(&p->dl, &rq_selected(rq)->dl)) {
resched_curr(rq);
return;
}
@@ -1919,7 +1920,7 @@ static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
* In the unlikely case current and p have the same deadline
* let us try to decide what's the best thing to do...
*/
- if ((p->dl.deadline == rq->curr->dl.deadline) &&
+ if ((p->dl.deadline == rq_selected(rq)->dl.deadline) &&
!test_tsk_need_resched(rq->curr))
check_preempt_equal_dl(rq, p);
#endif /* CONFIG_SMP */
@@ -1954,7 +1955,7 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
if (hrtick_enabled_dl(rq))
start_hrtick_dl(rq, p);

- if (rq->curr->sched_class != &dl_sched_class)
+ if (rq_selected(rq)->sched_class != &dl_sched_class)
update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);

deadline_queue_push_tasks(rq);
@@ -2268,8 +2269,8 @@ static int push_dl_task(struct rq *rq)
* can move away, it makes sense to just reschedule
* without going further in pushing next_task.
*/
- if (dl_task(rq->curr) &&
- dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
+ if (dl_task(rq_selected(rq)) &&
+ dl_time_before(next_task->dl.deadline, rq_selected(rq)->dl.deadline) &&
rq->curr->nr_cpus_allowed > 1) {
resched_curr(rq);
return 0;
@@ -2394,7 +2395,7 @@ static void pull_dl_task(struct rq *this_rq)
* deadline than the current task of its runqueue.
*/
if (dl_time_before(p->dl.deadline,
- src_rq->curr->dl.deadline))
+ rq_selected(src_rq)->dl.deadline))
goto skip;

if (is_migration_disabled(p)) {
@@ -2435,9 +2436,9 @@ static void task_woken_dl(struct rq *rq, struct task_struct *p)
if (!task_on_cpu(rq, p) &&
!test_tsk_need_resched(rq->curr) &&
p->nr_cpus_allowed > 1 &&
- dl_task(rq->curr) &&
+ dl_task(rq_selected(rq)) &&
(rq->curr->nr_cpus_allowed < 2 ||
- !dl_entity_preempt(&p->dl, &rq->curr->dl))) {
+ !dl_entity_preempt(&p->dl, &rq_selected(rq)->dl))) {
push_dl_tasks(rq);
}
}
@@ -2612,12 +2613,12 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
return;
}

- if (rq->curr != p) {
+ if (rq_selected(rq) != p) {
#ifdef CONFIG_SMP
if (p->nr_cpus_allowed > 1 && rq->dl.overloaded)
deadline_queue_push_tasks(rq);
#endif
- if (dl_task(rq->curr))
+ if (dl_task(rq_selected(rq)))
wakeup_preempt_dl(rq, p, 0);
else
resched_curr(rq);
@@ -2646,7 +2647,7 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p,
if (!rq->dl.overloaded)
deadline_queue_pull_task(rq);

- if (task_current(rq, p)) {
+ if (task_current_selected(rq, p)) {
/*
* If we now have a earlier deadline task than p,
* then reschedule, provided p is still on this
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1251fd01a555..07216ea3ed53 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1157,7 +1157,7 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
*/
s64 update_curr_common(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
s64 delta_exec;

delta_exec = update_curr_se(rq, &curr->se);
@@ -1203,7 +1203,7 @@ static void update_curr(struct cfs_rq *cfs_rq)

static void update_curr_fair(struct rq *rq)
{
- update_curr(cfs_rq_of(&rq->curr->se));
+ update_curr(cfs_rq_of(&rq_selected(rq)->se));
}

static inline void
@@ -6611,7 +6611,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
s64 delta = slice - ran;

if (delta < 0) {
- if (task_current(rq, p))
+ if (task_current_selected(rq, p))
resched_curr(rq);
return;
}
@@ -6626,7 +6626,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
*/
static void hrtick_update(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);

if (!hrtick_enabled_fair(rq) || curr->sched_class != &fair_sched_class)
return;
@@ -8235,7 +8235,7 @@ static void set_next_buddy(struct sched_entity *se)
*/
static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
struct sched_entity *se = &curr->se, *pse = &p->se;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
int next_buddy_marked = 0;
@@ -8268,7 +8268,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
* prevents us from potentially nominating it as a false LAST_BUDDY
* below.
*/
- if (test_tsk_need_resched(curr))
+ if (test_tsk_need_resched(rq->curr))
return;

/* Idle tasks are by definition preempted by non-idle tasks. */
@@ -9252,7 +9252,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
* update_load_avg() can call cpufreq_update_util(). Make sure that RT,
* DL and IRQ signals have been updated before updating CFS.
*/
- curr_class = rq->curr->sched_class;
+ curr_class = rq_selected(rq)->sched_class;

thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));

@@ -12640,7 +12640,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
* our priority decreased, or if we are not currently running on
* this runqueue and our priority is higher than the current's
*/
- if (task_current(rq, p)) {
+ if (task_current_selected(rq, p)) {
if (p->prio > oldprio)
resched_curr(rq);
} else
@@ -12743,7 +12743,7 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
* kick off the schedule if running, otherwise just see
* if we can still preempt the current task.
*/
- if (task_current(rq, p))
+ if (task_current_selected(rq, p))
resched_curr(rq);
else
wakeup_preempt(rq, p, 0);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 9cdea3ea47da..2682cec45aaa 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -530,7 +530,7 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)

static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
{
- struct task_struct *curr = rq_of_rt_rq(rt_rq)->curr;
+ struct task_struct *curr = rq_selected(rq_of_rt_rq(rt_rq));
struct rq *rq = rq_of_rt_rq(rt_rq);
struct sched_rt_entity *rt_se;

@@ -1000,7 +1000,7 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
*/
static void update_curr_rt(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
struct sched_rt_entity *rt_se = &curr->rt;
s64 delta_exec;

@@ -1545,7 +1545,7 @@ static int find_lowest_rq(struct task_struct *task);
static int
select_task_rq_rt(struct task_struct *p, int cpu, int flags)
{
- struct task_struct *curr;
+ struct task_struct *curr, *selected;
struct rq *rq;
bool test;

@@ -1557,6 +1557,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)

rcu_read_lock();
curr = READ_ONCE(rq->curr); /* unlocked access */
+ selected = READ_ONCE(rq_selected(rq));

/*
* If the current task on @p's runqueue is an RT task, then
@@ -1585,8 +1586,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
* systems like big.LITTLE.
*/
test = curr &&
- unlikely(rt_task(curr)) &&
- (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
+ unlikely(rt_task(selected)) &&
+ (curr->nr_cpus_allowed < 2 || selected->prio <= p->prio);

if (test || !rt_task_fits_capacity(p, cpu)) {
int target = find_lowest_rq(p);
@@ -1616,12 +1617,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)

static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
{
- /*
- * Current can't be migrated, useless to reschedule,
- * let's hope p can move out.
- */
if (rq->curr->nr_cpus_allowed == 1 ||
- !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
+ !cpupri_find(&rq->rd->cpupri, rq_selected(rq), NULL))
return;

/*
@@ -1664,7 +1661,9 @@ static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
*/
static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
{
- if (p->prio < rq->curr->prio) {
+ struct task_struct *curr = rq_selected(rq);
+
+ if (p->prio < curr->prio) {
resched_curr(rq);
return;
}
@@ -1682,7 +1681,7 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
* to move current somewhere else, making room for our non-migratable
* task.
*/
- if (p->prio == rq->curr->prio && !test_tsk_need_resched(rq->curr))
+ if (p->prio == curr->prio && !test_tsk_need_resched(rq->curr))
check_preempt_equal_prio(rq, p);
#endif
}
@@ -1707,7 +1706,7 @@ static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool f
* utilization. We only care of the case where we start to schedule a
* rt task
*/
- if (rq->curr->sched_class != &rt_sched_class)
+ if (rq_selected(rq)->sched_class != &rt_sched_class)
update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);

rt_queue_push_tasks(rq);
@@ -1988,6 +1987,7 @@ static struct task_struct *pick_next_pushable_task(struct rq *rq)

BUG_ON(rq->cpu != task_cpu(p));
BUG_ON(task_current(rq, p));
+ BUG_ON(task_current_selected(rq, p));
BUG_ON(p->nr_cpus_allowed <= 1);

BUG_ON(!task_on_rq_queued(p));
@@ -2020,7 +2020,7 @@ static int push_rt_task(struct rq *rq, bool pull)
* higher priority than current. If that's the case
* just reschedule current.
*/
- if (unlikely(next_task->prio < rq->curr->prio)) {
+ if (unlikely(next_task->prio < rq_selected(rq)->prio)) {
resched_curr(rq);
return 0;
}
@@ -2375,7 +2375,7 @@ static void pull_rt_task(struct rq *this_rq)
* p if it is lower in priority than the
* current task on the run queue
*/
- if (p->prio < src_rq->curr->prio)
+ if (p->prio < rq_selected(src_rq)->prio)
goto skip;

if (is_migration_disabled(p)) {
@@ -2419,9 +2419,9 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
bool need_to_push = !task_on_cpu(rq, p) &&
!test_tsk_need_resched(rq->curr) &&
p->nr_cpus_allowed > 1 &&
- (dl_task(rq->curr) || rt_task(rq->curr)) &&
+ (dl_task(rq_selected(rq)) || rt_task(rq_selected(rq))) &&
(rq->curr->nr_cpus_allowed < 2 ||
- rq->curr->prio <= p->prio);
+ rq_selected(rq)->prio <= p->prio);

if (need_to_push)
push_rt_tasks(rq);
@@ -2505,7 +2505,7 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
rt_queue_push_tasks(rq);
#endif /* CONFIG_SMP */
- if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
+ if (p->prio < rq_selected(rq)->prio && cpu_online(cpu_of(rq)))
resched_curr(rq);
}
}
@@ -2520,7 +2520,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
if (!task_on_rq_queued(p))
return;

- if (task_current(rq, p)) {
+ if (task_current_selected(rq, p)) {
#ifdef CONFIG_SMP
/*
* If our priority decreases while running, we
@@ -2546,7 +2546,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
* greater than the current running task
* then reschedule.
*/
- if (p->prio < rq->curr->prio)
+ if (p->prio < rq_selected(rq)->prio)
resched_curr(rq);
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3e0e4fc8734b..6ea1dfbe502a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -994,7 +994,10 @@ struct rq {
*/
unsigned int nr_uninterruptible;

- struct task_struct __rcu *curr;
+ struct task_struct __rcu *curr; /* Execution context */
+#ifdef CONFIG_SCHED_PROXY_EXEC
+ struct task_struct __rcu *curr_selected; /* Scheduling context (policy) */
+#endif
struct task_struct *idle;
struct task_struct *stop;
unsigned long next_balance;
@@ -1189,6 +1192,20 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define raw_rq() raw_cpu_ptr(&runqueues)

+#ifdef CONFIG_SCHED_PROXY_EXEC
+#define rq_selected(rq) ((rq)->curr_selected)
+static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
+{
+ rcu_assign_pointer(rq->curr_selected, t);
+}
+#else
+#define rq_selected(rq) ((rq)->curr)
+static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
+{
+ /* Do nothing */
+}
+#endif
+
struct sched_group;
#ifdef CONFIG_SCHED_CORE
static inline struct cpumask *sched_group_span(struct sched_group *sg);
@@ -2112,11 +2129,25 @@ static inline u64 global_rt_runtime(void)
return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
}

+/*
+ * Is p the current execution context?
+ */
static inline int task_current(struct rq *rq, struct task_struct *p)
{
return rq->curr == p;
}

+/*
+ * Is p the current scheduling context?
+ *
+ * Note that it might be the current execution context at the same time if
+ * rq->curr == rq_selected() == p.
+ */
+static inline int task_current_selected(struct rq *rq, struct task_struct *p)
+{
+ return rq_selected(rq) == p;
+}
+
static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
{
#ifdef CONFIG_SMP
@@ -2280,7 +2311,7 @@ struct sched_class {

static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
{
- WARN_ON_ONCE(rq->curr != prev);
+ WARN_ON_ONCE(rq_selected(rq) != prev);
prev->sched_class->put_prev_task(rq, prev);
}

--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:21:30

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 09/23] sched: Fix runtime accounting w/ split exec & sched contexts

The idea here is we want to charge the scheduler-context task's
vruntime but charge the execution-context task's sum_exec_runtime.

This way cputime accounting goes against the task actually running
but vruntime accounting goes against the selected task so we get
proper fairness.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: John Stultz <[email protected]>
---
kernel/sched/fair.c | 23 ++++++++++++++++++-----
1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 07216ea3ed53..085941db5bf1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1129,22 +1129,35 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_SMP */

-static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
+static s64 update_curr_se(struct rq *rq, struct sched_entity *se)
{
u64 now = rq_clock_task(rq);
s64 delta_exec;

- delta_exec = now - curr->exec_start;
+ /* Calculate the delta from selected se */
+ delta_exec = now - se->exec_start;
if (unlikely(delta_exec <= 0))
return delta_exec;

- curr->exec_start = now;
- curr->sum_exec_runtime += delta_exec;
+ /* Update selected se's exec_start */
+ se->exec_start = now;
+ if (entity_is_task(se)) {
+ struct task_struct *running = rq->curr;
+ /*
+ * If se is a task, we account the time against the running
+ * task, as w/ proxy-exec they may not be the same.
+ */
+ running->se.exec_start = now;
+ running->se.sum_exec_runtime += delta_exec;
+ } else {
+ /* If not task, account the time against se */
+ se->sum_exec_runtime += delta_exec;
+ }

if (schedstat_enabled()) {
struct sched_statistics *stats;

- stats = __schedstats_from_se(curr);
+ stats = __schedstats_from_se(se);
__schedstat_set(stats->exec_max,
max(delta_exec, stats->exec_max));
}
--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:21:52

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 10/23] sched: Split out __sched() deactivate task logic into a helper

As we're going to re-use the deactivation logic,
split it into a helper.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: John Stultz <[email protected]>
---
v6:
* Define function as static to avoid "no previous prototype"
warnings as Reported-by: kernel test robot <[email protected]>
v7:
* Rename state task_state to be more clear, as suggested by
Metin Kaya
---
kernel/sched/core.c | 66 +++++++++++++++++++++++++--------------------
1 file changed, 37 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0ce34f5c0e0c..34acd80b6bd0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6571,6 +6571,42 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
# define SM_MASK_PREEMPT SM_PREEMPT
#endif

+static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
+ unsigned long task_state)
+{
+ if (signal_pending_state(task_state, p)) {
+ WRITE_ONCE(p->__state, TASK_RUNNING);
+ } else {
+ p->sched_contributes_to_load =
+ (task_state & TASK_UNINTERRUPTIBLE) &&
+ !(task_state & TASK_NOLOAD) &&
+ !(task_state & TASK_FROZEN);
+
+ if (p->sched_contributes_to_load)
+ rq->nr_uninterruptible++;
+
+ /*
+ * __schedule() ttwu()
+ * prev_state = prev->state; if (p->on_rq && ...)
+ * if (prev_state) goto out;
+ * p->on_rq = 0; smp_acquire__after_ctrl_dep();
+ * p->state = TASK_WAKING
+ *
+ * Where __schedule() and ttwu() have matching control dependencies.
+ *
+ * After this, schedule() must not care about p->state any more.
+ */
+ deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
+
+ if (p->in_iowait) {
+ atomic_inc(&rq->nr_iowait);
+ delayacct_blkio_start();
+ }
+ return true;
+ }
+ return false;
+}
+
/*
* __schedule() is the main scheduler function.
*
@@ -6662,35 +6698,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
*/
prev_state = READ_ONCE(prev->__state);
if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
- if (signal_pending_state(prev_state, prev)) {
- WRITE_ONCE(prev->__state, TASK_RUNNING);
- } else {
- prev->sched_contributes_to_load =
- (prev_state & TASK_UNINTERRUPTIBLE) &&
- !(prev_state & TASK_NOLOAD) &&
- !(prev_state & TASK_FROZEN);
-
- if (prev->sched_contributes_to_load)
- rq->nr_uninterruptible++;
-
- /*
- * __schedule() ttwu()
- * prev_state = prev->state; if (p->on_rq && ...)
- * if (prev_state) goto out;
- * p->on_rq = 0; smp_acquire__after_ctrl_dep();
- * p->state = TASK_WAKING
- *
- * Where __schedule() and ttwu() have matching control dependencies.
- *
- * After this, schedule() must not care about p->state any more.
- */
- deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
-
- if (prev->in_iowait) {
- atomic_inc(&rq->nr_iowait);
- delayacct_blkio_start();
- }
- }
+ try_to_deactivate_task(rq, prev, prev_state);
switch_count = &prev->nvcsw;
}

--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:22:06

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 11/23] sched: Add a initial sketch of the find_proxy_task() function

Add a find_proxy_task() function which doesn't do much.

When we select a blocked task to run, we will just deactivate it
and pick again. The exception being if it has become unblocked
after find_proxy_task() was called.

Greatly simplified from patch by:
Peter Zijlstra (Intel) <[email protected]>
Juri Lelli <[email protected]>
Valentin Schneider <[email protected]>
Connor O'Brien <[email protected]>

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
[jstultz: Split out from larger proxy patch and simplified
for review and testing.]
Signed-off-by: John Stultz <[email protected]>
---
v5:
* Split out from larger proxy patch
v7:
* Fixed unused function arguments, spelling nits, and tweaks for
clarity, pointed out by Metin Kaya
* Moved task_is_blocked() implementation to this patch where it is
first used. Also drop unused arguments. Suggested by Metin Kaya.
* Tweaks to make things easier to read, as suggested by Metin Kaya.
* Rename proxy() to find_proxy_task() for clarity, and typo
fixes suggested by Metin Kaya
* Fix build warning Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
---
kernel/sched/core.c | 87 ++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/rt.c | 19 +++++++++-
kernel/sched/sched.h | 15 ++++++++
3 files changed, 115 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 34acd80b6bd0..12f5a0618328 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6572,11 +6572,11 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
#endif

static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
- unsigned long task_state)
+ unsigned long task_state, bool deactivate_cond)
{
if (signal_pending_state(task_state, p)) {
WRITE_ONCE(p->__state, TASK_RUNNING);
- } else {
+ } else if (deactivate_cond) {
p->sched_contributes_to_load =
(task_state & TASK_UNINTERRUPTIBLE) &&
!(task_state & TASK_NOLOAD) &&
@@ -6607,6 +6607,73 @@ static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
return false;
}

+#ifdef CONFIG_SCHED_PROXY_EXEC
+
+static bool proxy_deactivate(struct rq *rq, struct task_struct *next)
+{
+ unsigned long state = READ_ONCE(next->__state);
+
+ /* Don't deactivate if the state has been changed to TASK_RUNNING */
+ if (state == TASK_RUNNING)
+ return false;
+ if (!try_to_deactivate_task(rq, next, state, true))
+ return false;
+ put_prev_task(rq, next);
+ rq_set_selected(rq, rq->idle);
+ resched_curr(rq);
+ return true;
+}
+
+/*
+ * Initial simple proxy that just returns the task if it's waking
+ * or deactivates the blocked task so we can pick something that
+ * isn't blocked.
+ */
+static struct task_struct *
+find_proxy_task(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
+{
+ struct task_struct *ret = NULL;
+ struct task_struct *p = next;
+ struct mutex *mutex;
+
+ mutex = p->blocked_on;
+ /* Something changed in the chain, so pick again */
+ if (!mutex)
+ return NULL;
+ /*
+ * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
+ * and ensure @owner sticks around.
+ */
+ raw_spin_lock(&mutex->wait_lock);
+ raw_spin_lock(&p->blocked_lock);
+
+ /* Check again that p is blocked with blocked_lock held */
+ if (!task_is_blocked(p) || mutex != p->blocked_on) {
+ /*
+ * Something changed in the blocked_on chain and
+ * we don't know if only at this level. So, let's
+ * just bail out completely and let __schedule
+ * figure things out (pick_again loop).
+ */
+ goto out;
+ }
+
+ if (!proxy_deactivate(rq, next))
+ ret = p;
+out:
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_unlock(&mutex->wait_lock);
+ return ret;
+}
+#else /* SCHED_PROXY_EXEC */
+static struct task_struct *
+find_proxy_task(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
+{
+ BUG(); // This should never be called in the !PROXY case
+ return next;
+}
+#endif /* SCHED_PROXY_EXEC */
+
/*
* __schedule() is the main scheduler function.
*
@@ -6698,12 +6765,24 @@ static void __sched notrace __schedule(unsigned int sched_mode)
*/
prev_state = READ_ONCE(prev->__state);
if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
- try_to_deactivate_task(rq, prev, prev_state);
+ try_to_deactivate_task(rq, prev, prev_state,
+ !task_is_blocked(prev));
switch_count = &prev->nvcsw;
}

- next = pick_next_task(rq, prev, &rf);
+pick_again:
+ next = pick_next_task(rq, rq_selected(rq), &rf);
rq_set_selected(rq, next);
+ if (unlikely(task_is_blocked(next))) {
+ next = find_proxy_task(rq, next, &rf);
+ if (!next) {
+ rq_unpin_lock(rq, &rf);
+ __balance_callbacks(rq);
+ rq_repin_lock(rq, &rf);
+ goto pick_again;
+ }
+ }
+
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 2682cec45aaa..81cd22eaa6dc 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1491,8 +1491,19 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)

enqueue_rt_entity(rt_se, flags);

- if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
- enqueue_pushable_task(rq, p);
+ /*
+ * Current can't be pushed away. Selected is tied to current,
+ * so don't push it either.
+ */
+ if (task_current(rq, p) || task_current_selected(rq, p))
+ return;
+ /*
+ * Pinned tasks can't be pushed.
+ */
+ if (p->nr_cpus_allowed == 1)
+ return;
+
+ enqueue_pushable_task(rq, p);
}

static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
@@ -1779,6 +1790,10 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)

update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);

+ /* Avoid marking selected as pushable */
+ if (task_current_selected(rq, p))
+ return;
+
/*
* The previous task needs to be made eligible for pushing
* if it is still active
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6ea1dfbe502a..765ba10661de 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2148,6 +2148,21 @@ static inline int task_current_selected(struct rq *rq, struct task_struct *p)
return rq_selected(rq) == p;
}

+#ifdef CONFIG_SCHED_PROXY_EXEC
+static inline bool task_is_blocked(struct task_struct *p)
+{
+ if (!sched_proxy_exec())
+ return false;
+
+ return !!p->blocked_on && p->blocked_on_state != BO_RUNNABLE;
+}
+#else /* !SCHED_PROXY_EXEC */
+static inline bool task_is_blocked(struct task_struct *p)
+{
+ return false;
+}
+#endif /* SCHED_PROXY_EXEC */
+
static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
{
#ifdef CONFIG_SMP
--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:22:25

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 12/23] sched: Fix proxy/current (push,pull)ability

From: Valentin Schneider <[email protected]>

Proxy execution forms atomic pairs of tasks: The selected task
(scheduling context) and a proxy (execution context). The
selected task, along with the rest of the blocked chain,
follows the proxy wrt CPU placement.

They can be the same task, in which case push/pull doesn't need any
modification. When they are different, however,
FIFO1 & FIFO42:

,-> RT42
| | blocked-on
| v
blocked_donor | mutex
| | owner
| v
`-- RT1

RT1
RT42

CPU0 CPU1
^ ^
| |
overloaded !overloaded
rq prio = 42 rq prio = 0

RT1 is eligible to be pushed to CPU1, but should that happen it will
"carry" RT42 along. Clearly here neither RT1 nor RT42 must be seen as
push/pullable.

Unfortunately, only the selected task is usually dequeued from the
rq, and the proxy'ed execution context (rq->curr) remains on the rq.
This can cause RT1 to be selected for migration from logic like the
rt pushable_list.

This patch adds a dequeue/enqueue cycle on the proxy task before
__schedule returns, which allows the sched class logic to avoid
adding the now current task to the pushable_list.

Furthermore, tasks becoming blocked on a mutex don't need an explicit
dequeue/enqueue cycle to be made (push/pull)able: they have to be running
to block on a mutex, thus they will eventually hit put_prev_task().

XXX: pinned tasks becoming unblocked should be removed from the push/pull
lists, but those don't get to see __schedule() straight away.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Connor O'Brien <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
v3:
* Tweaked comments & commit message
v5:
* Minor simplifications to utilize the fix earlier
in the patch series.
* Rework the wording of the commit message to match selected/
proxy terminology and expand a bit to make it more clear how
it works.
v6:
* Droped now-unused proxied value, to be re-added later in the
series when it is used, as caught by Dietmar
v7:
* Unused function argument fixup
* Commit message nit pointed out by Metin Kaya
* Droped unproven unlikely() and use sched_proxy_exec()
in proxy_tag_curr, suggested by Metin Kaya
---
kernel/sched/core.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 12f5a0618328..f6bf3b62194c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6674,6 +6674,23 @@ find_proxy_task(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
}
#endif /* SCHED_PROXY_EXEC */

+static inline void proxy_tag_curr(struct rq *rq, struct task_struct *next)
+{
+ if (sched_proxy_exec()) {
+ /*
+ * pick_next_task() calls set_next_task() on the selected task
+ * at some point, which ensures it is not push/pullable.
+ * However, the selected task *and* the ,mutex owner form an
+ * atomic pair wrt push/pull.
+ *
+ * Make sure owner is not pushable. Unfortunately we can only
+ * deal with that by means of a dequeue/enqueue cycle. :-/
+ */
+ dequeue_task(rq, next, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
+ enqueue_task(rq, next, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
+ }
+}
+
/*
* __schedule() is the main scheduler function.
*
@@ -6796,6 +6813,10 @@ static void __sched notrace __schedule(unsigned int sched_mode)
* changes to task_struct made by pick_next_task().
*/
RCU_INIT_POINTER(rq->curr, next);
+
+ if (!task_current_selected(rq, next))
+ proxy_tag_curr(rq, next);
+
/*
* The membarrier system call requires each architecture
* to have a full memory barrier after updating
@@ -6820,6 +6841,10 @@ static void __sched notrace __schedule(unsigned int sched_mode)
/* Also unlocks the rq: */
rq = context_switch(rq, prev, next, &rf);
} else {
+ /* In case next was already curr but just got blocked_donor*/
+ if (!task_current_selected(rq, next))
+ proxy_tag_curr(rq, next);
+
rq_unpin_lock(rq, &rf);
__balance_callbacks(rq);
raw_spin_rq_unlock_irq(rq);
--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:22:58

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 13/23] sched: Start blocked_on chain processing in find_proxy_task()

From: Peter Zijlstra <[email protected]>

Start to flesh out the real find_proxy_task() implementation,
but avoid the migration cases for now, in those cases just
deactivate the selected task and pick again.

To ensure the selected task or other blocked tasks in the chain
aren't migrated away while we're running the proxy, this patch
also tweaks CFS logic to avoid migrating selected or mutex
blocked tasks.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: This change was split out from the larger proxy patch]
Signed-off-by: John Stultz <[email protected]>
---
v5:
* Split this out from larger proxy patch
v7:
* Minor refactoring of core find_proxy_task() function
* Minor spelling and corrections suggested by Metin Kaya
* Dropped an added BUG_ON that was frequently tripped
* Minor commit message tweaks from Metin Kaya
---
kernel/sched/core.c | 154 +++++++++++++++++++++++++++++++++++++-------
kernel/sched/fair.c | 9 ++-
2 files changed, 137 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f6bf3b62194c..42e25bbdfe6b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -94,6 +94,7 @@
#include "../workqueue_internal.h"
#include "../../io_uring/io-wq.h"
#include "../smpboot.h"
+#include "../locking/mutex.h"

EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpu);
EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpumask);
@@ -6609,6 +6610,15 @@ static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,

#ifdef CONFIG_SCHED_PROXY_EXEC

+static inline struct task_struct *
+proxy_resched_idle(struct rq *rq, struct task_struct *next)
+{
+ put_prev_task(rq, next);
+ rq_set_selected(rq, rq->idle);
+ set_tsk_need_resched(rq->idle);
+ return rq->idle;
+}
+
static bool proxy_deactivate(struct rq *rq, struct task_struct *next)
{
unsigned long state = READ_ONCE(next->__state);
@@ -6618,48 +6628,138 @@ static bool proxy_deactivate(struct rq *rq, struct task_struct *next)
return false;
if (!try_to_deactivate_task(rq, next, state, true))
return false;
- put_prev_task(rq, next);
- rq_set_selected(rq, rq->idle);
- resched_curr(rq);
+ proxy_resched_idle(rq, next);
return true;
}

/*
- * Initial simple proxy that just returns the task if it's waking
- * or deactivates the blocked task so we can pick something that
- * isn't blocked.
+ * Find who @next (currently blocked on a mutex) can proxy for.
+ *
+ * Follow the blocked-on relation:
+ * task->blocked_on -> mutex->owner -> task...
+ *
+ * Lock order:
+ *
+ * p->pi_lock
+ * rq->lock
+ * mutex->wait_lock
+ * p->blocked_lock
+ *
+ * Returns the task that is going to be used as execution context (the one
+ * that is actually going to be put to run on cpu_of(rq)).
*/
static struct task_struct *
find_proxy_task(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
{
+ struct task_struct *owner = NULL;
struct task_struct *ret = NULL;
- struct task_struct *p = next;
+ struct task_struct *p;
struct mutex *mutex;
+ int this_cpu = cpu_of(rq);

- mutex = p->blocked_on;
- /* Something changed in the chain, so pick again */
- if (!mutex)
- return NULL;
/*
- * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
- * and ensure @owner sticks around.
+ * Follow blocked_on chain.
+ *
+ * TODO: deadlock detection
*/
- raw_spin_lock(&mutex->wait_lock);
- raw_spin_lock(&p->blocked_lock);
+ for (p = next; task_is_blocked(p); p = owner) {
+ mutex = p->blocked_on;
+ /* Something changed in the chain, so pick again */
+ if (!mutex)
+ return NULL;

- /* Check again that p is blocked with blocked_lock held */
- if (!task_is_blocked(p) || mutex != p->blocked_on) {
/*
- * Something changed in the blocked_on chain and
- * we don't know if only at this level. So, let's
- * just bail out completely and let __schedule
- * figure things out (pick_again loop).
+ * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
+ * and ensure @owner sticks around.
*/
- goto out;
+ raw_spin_lock(&mutex->wait_lock);
+ raw_spin_lock(&p->blocked_lock);
+
+ /* Check again that p is blocked with blocked_lock held */
+ if (mutex != p->blocked_on) {
+ /*
+ * Something changed in the blocked_on chain and
+ * we don't know if only at this level. So, let's
+ * just bail out completely and let __schedule
+ * figure things out (pick_again loop).
+ */
+ goto out;
+ }
+
+ owner = __mutex_owner(mutex);
+ if (!owner) {
+ ret = p;
+ goto out;
+ }
+
+ if (task_cpu(owner) != this_cpu) {
+ /* XXX Don't handle migrations yet */
+ if (!proxy_deactivate(rq, next))
+ ret = next;
+ goto out;
+ }
+
+ if (task_on_rq_migrating(owner)) {
+ /*
+ * One of the chain of mutex owners is currently migrating to this
+ * CPU, but has not yet been enqueued because we are holding the
+ * rq lock. As a simple solution, just schedule rq->idle to give
+ * the migration a chance to complete. Much like the migrate_task
+ * case we should end up back in proxy(), this time hopefully with
+ * all relevant tasks already enqueued.
+ */
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_unlock(&mutex->wait_lock);
+ return proxy_resched_idle(rq, next);
+ }
+
+ if (!owner->on_rq) {
+ /* XXX Don't handle blocked owners yet */
+ if (!proxy_deactivate(rq, next))
+ ret = next;
+ goto out;
+ }
+
+ if (owner == p) {
+ /*
+ * It's possible we interleave with mutex_unlock like:
+ *
+ * lock(&rq->lock);
+ * find_proxy_task()
+ * mutex_unlock()
+ * lock(&wait_lock);
+ * next(owner) = current->blocked_donor;
+ * unlock(&wait_lock);
+ *
+ * wake_up_q();
+ * ...
+ * ttwu_runnable()
+ * __task_rq_lock()
+ * lock(&wait_lock);
+ * owner == p
+ *
+ * Which leaves us to finish the ttwu_runnable() and make it go.
+ *
+ * So schedule rq->idle so that ttwu_runnable can get the rq lock
+ * and mark owner as running.
+ */
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_unlock(&mutex->wait_lock);
+ return proxy_resched_idle(rq, next);
+ }
+
+ /*
+ * OK, now we're absolutely sure @owner is not blocked _and_
+ * on this rq, therefore holding @rq->lock is sufficient to
+ * guarantee its existence, as per ttwu_remote().
+ */
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_unlock(&mutex->wait_lock);
}

- if (!proxy_deactivate(rq, next))
- ret = p;
+ WARN_ON_ONCE(owner && !owner->on_rq);
+ return owner;
+
out:
raw_spin_unlock(&p->blocked_lock);
raw_spin_unlock(&mutex->wait_lock);
@@ -6738,6 +6838,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
struct rq_flags rf;
struct rq *rq;
int cpu;
+ bool preserve_need_resched = false;

cpu = smp_processor_id();
rq = cpu_rq(cpu);
@@ -6798,9 +6899,12 @@ static void __sched notrace __schedule(unsigned int sched_mode)
rq_repin_lock(rq, &rf);
goto pick_again;
}
+ if (next == rq->idle && prev == rq->idle)
+ preserve_need_resched = true;
}

- clear_tsk_need_resched(prev);
+ if (!preserve_need_resched)
+ clear_tsk_need_resched(prev);
clear_preempt_need_resched();
#ifdef CONFIG_SCHED_DEBUG
rq->last_seen_need_resched_ns = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 085941db5bf1..954b41e5b7df 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8905,6 +8905,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
if (kthread_is_per_cpu(p))
return 0;

+ if (task_is_blocked(p))
+ return 0;
+
if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
int cpu;

@@ -8941,7 +8944,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
/* Record that we found at least one task that could run on dst_cpu */
env->flags &= ~LBF_ALL_PINNED;

- if (task_on_cpu(env->src_rq, p)) {
+ if (task_on_cpu(env->src_rq, p) ||
+ task_current_selected(env->src_rq, p)) {
schedstat_inc(p->stats.nr_failed_migrations_running);
return 0;
}
@@ -8980,6 +8984,9 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
{
lockdep_assert_rq_held(env->src_rq);

+ BUG_ON(task_current(env->src_rq, p));
+ BUG_ON(task_current_selected(env->src_rq, p));
+
deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
set_task_cpu(p, env->dst_cpu);
}
--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:23:10

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 14/23] sched: Handle blocked-waiter migration (and return migration)

Add logic to handle migrating a blocked waiter to a remote
cpu where the lock owner is runnable.

Additionally, as the blocked task may not be able to run
on the remote cpu, add logic to handle return migration once
the waiting task is given the mutex.

Because tasks may get migrated to where they cannot run,
this patch also modifies the scheduling classes to avoid
sched class migrations on mutex blocked tasks, leaving
proxy() to do the migrations and return migrations.

This was split out from the larger proxy patch, and
significantly reworked.

Credits for the original patch go to:
Peter Zijlstra (Intel) <[email protected]>
Juri Lelli <[email protected]>
Valentin Schneider <[email protected]>
Connor O'Brien <[email protected]>

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: John Stultz <[email protected]>
---
v6:
* Integrated sched_proxy_exec() check in proxy_return_migration()
* Minor cleanups to diff
* Unpin the rq before calling __balance_callbacks()
* Tweak proxy migrate to migrate deeper task in chain, to avoid
tasks pingponging between rqs
v7:
* Fixup for unused function arguments
* Switch from that_rq -> target_rq, other minor tweaks, and typo
fixes suggested by Metin Kaya
* Switch back to doing return migration in the ttwu path, which
avoids nasty lock juggling and performance issues
* Fixes for UP builds
---
kernel/sched/core.c | 161 ++++++++++++++++++++++++++++++++++++++--
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 4 +-
kernel/sched/rt.c | 9 ++-
4 files changed, 164 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 42e25bbdfe6b..55dc2a3b7e46 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2981,8 +2981,15 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
struct set_affinity_pending my_pending = { }, *pending = NULL;
bool stop_pending, complete = false;

- /* Can the task run on the task's current CPU? If so, we're done */
- if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) {
+ /*
+ * Can the task run on the task's current CPU? If so, we're done
+ *
+ * We are also done if the task is selected, boosting a lock-
+ * holding proxy, (and potentially has been migrated outside its
+ * current or previous affinity mask)
+ */
+ if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask) ||
+ (task_current_selected(rq, p) && !task_current(rq, p))) {
struct task_struct *push_task = NULL;

if ((flags & SCA_MIGRATE_ENABLE) &&
@@ -3778,6 +3785,39 @@ static inline void ttwu_do_wakeup(struct task_struct *p)
trace_sched_wakeup(p);
}

+#ifdef CONFIG_SMP
+static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
+{
+ if (!sched_proxy_exec())
+ return false;
+
+ if (task_current(rq, p))
+ return false;
+
+ if (p->blocked_on && p->blocked_on_state == BO_WAKING) {
+ raw_spin_lock(&p->blocked_lock);
+ if (!is_cpu_allowed(p, cpu_of(rq))) {
+ if (task_current_selected(rq, p)) {
+ put_prev_task(rq, p);
+ rq_set_selected(rq, rq->idle);
+ }
+ deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
+ resched_curr(rq);
+ raw_spin_unlock(&p->blocked_lock);
+ return true;
+ }
+ resched_curr(rq);
+ raw_spin_unlock(&p->blocked_lock);
+ }
+ return false;
+}
+#else /* !CONFIG_SMP */
+static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
+{
+ return false;
+}
+#endif /*CONFIG_SMP */
+
static void
ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
struct rq_flags *rf)
@@ -3870,9 +3910,12 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
update_rq_clock(rq);
wakeup_preempt(rq, p, wake_flags);
}
+ if (proxy_needs_return(rq, p))
+ goto out;
ttwu_do_wakeup(p);
ret = 1;
}
+out:
__task_rq_unlock(rq, &rf);

return ret;
@@ -4231,6 +4274,7 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
int cpu, success = 0;

if (p == current) {
+ WARN_ON(task_is_blocked(p));
/*
* We're waking current, this means 'p->on_rq' and 'task_cpu(p)
* == smp_processor_id()'. Together this means we can special
@@ -6632,6 +6676,91 @@ static bool proxy_deactivate(struct rq *rq, struct task_struct *next)
return true;
}

+#ifdef CONFIG_SMP
+/*
+ * If the blocked-on relationship crosses CPUs, migrate @p to the
+ * owner's CPU.
+ *
+ * This is because we must respect the CPU affinity of execution
+ * contexts (owner) but we can ignore affinity for scheduling
+ * contexts (@p). So we have to move scheduling contexts towards
+ * potential execution contexts.
+ *
+ * Note: The owner can disappear, but simply migrate to @target_cpu
+ * and leave that CPU to sort things out.
+ */
+static struct task_struct *
+proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
+ struct task_struct *p, int target_cpu)
+{
+ struct rq *target_rq;
+ int wake_cpu;
+
+ lockdep_assert_rq_held(rq);
+ target_rq = cpu_rq(target_cpu);
+
+ /*
+ * Since we're going to drop @rq, we have to put(@rq_selected) first,
+ * otherwise we have a reference that no longer belongs to us. Use
+ * @rq->idle to fill the void and make the next pick_next_task()
+ * invocation happy.
+ *
+ * CPU0 CPU1
+ *
+ * B mutex_lock(X)
+ *
+ * A mutex_lock(X) <- B
+ * A __schedule()
+ * A pick->A
+ * A proxy->B
+ * A migrate A to CPU1
+ * B mutex_unlock(X) -> A
+ * B __schedule()
+ * B pick->A
+ * B switch_to (A)
+ * A ... does stuff
+ * A ... is still running here
+ *
+ * * BOOM *
+ */
+ put_prev_task(rq, rq_selected(rq));
+ rq_set_selected(rq, rq->idle);
+ set_next_task(rq, rq_selected(rq));
+ WARN_ON(p == rq->curr);
+
+ wake_cpu = p->wake_cpu;
+ deactivate_task(rq, p, 0);
+ set_task_cpu(p, target_cpu);
+ /*
+ * Preserve p->wake_cpu, such that we can tell where it
+ * used to run later.
+ */
+ p->wake_cpu = wake_cpu;
+
+ rq_unpin_lock(rq, rf);
+ __balance_callbacks(rq);
+
+ raw_spin_rq_unlock(rq);
+ raw_spin_rq_lock(target_rq);
+
+ activate_task(target_rq, p, 0);
+ wakeup_preempt(target_rq, p, 0);
+
+ raw_spin_rq_unlock(target_rq);
+ raw_spin_rq_lock(rq);
+ rq_repin_lock(rq, rf);
+
+ return NULL; /* Retry task selection on _this_ CPU. */
+}
+#else /* !CONFIG_SMP */
+static struct task_struct *
+proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
+ struct task_struct *p, int target_cpu)
+{
+ return NULL;
+}
+#endif /* CONFIG_SMP */
+
/*
* Find who @next (currently blocked on a mutex) can proxy for.
*
@@ -6654,8 +6783,11 @@ find_proxy_task(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
struct task_struct *owner = NULL;
struct task_struct *ret = NULL;
struct task_struct *p;
+ int cur_cpu, target_cpu;
struct mutex *mutex;
- int this_cpu = cpu_of(rq);
+ bool curr_in_chain = false;
+
+ cur_cpu = cpu_of(rq);

/*
* Follow blocked_on chain.
@@ -6686,17 +6818,27 @@ find_proxy_task(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
goto out;
}

+ if (task_current(rq, p))
+ curr_in_chain = true;
+
owner = __mutex_owner(mutex);
if (!owner) {
ret = p;
goto out;
}

- if (task_cpu(owner) != this_cpu) {
- /* XXX Don't handle migrations yet */
- if (!proxy_deactivate(rq, next))
- ret = next;
- goto out;
+ if (task_cpu(owner) != cur_cpu) {
+ target_cpu = task_cpu(owner);
+ /*
+ * @owner can disappear, simply migrate to @target_cpu and leave that CPU
+ * to sort things out.
+ */
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_unlock(&mutex->wait_lock);
+ if (curr_in_chain)
+ return proxy_resched_idle(rq, next);
+
+ return proxy_migrate_task(rq, rf, p, target_cpu);
}

if (task_on_rq_migrating(owner)) {
@@ -6999,6 +7141,9 @@ static inline void sched_submit_work(struct task_struct *tsk)
*/
SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT);

+ if (task_is_blocked(tsk))
+ return;
+
/*
* If we are going to sleep and we have plugged IO queued,
* make sure to submit it to avoid deadlocks.
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 9cf20f4ac5f9..4f998549ea74 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1705,7 +1705,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)

enqueue_dl_entity(&p->dl, flags);

- if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+ if (!task_current(rq, p) && p->nr_cpus_allowed > 1 && !task_is_blocked(p))
enqueue_pushable_dl_task(rq, p);
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 954b41e5b7df..8e3f118f6d6e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8372,7 +8372,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
goto idle;

#ifdef CONFIG_FAIR_GROUP_SCHED
- if (!prev || prev->sched_class != &fair_sched_class)
+ if (!prev ||
+ prev->sched_class != &fair_sched_class ||
+ rq->curr != rq_selected(rq))
goto simple;

/*
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 81cd22eaa6dc..a7b51a021111 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1503,6 +1503,9 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
if (p->nr_cpus_allowed == 1)
return;

+ if (task_is_blocked(p))
+ return;
+
enqueue_pushable_task(rq, p);
}

@@ -1790,10 +1793,12 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)

update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);

- /* Avoid marking selected as pushable */
- if (task_current_selected(rq, p))
+ /* Avoid marking current or selected as pushable */
+ if (task_current(rq, p) || task_current_selected(rq, p))
return;

+ if (task_is_blocked(p))
+ return;
/*
* The previous task needs to be made eligible for pushing
* if it is still active
--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:23:21

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 15/23] sched: Add blocked_donor link to task for smarter mutex handoffs

From: Peter Zijlstra <[email protected]>

Add link to the task this task is proxying for, and use it so we
do intelligent hand-off of the owned mutex to the task we're
running on behalf.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: This patch was split out from larger proxy patch]
Signed-off-by: John Stultz <[email protected]>
---
v5:
* Split out from larger proxy patch
v6:
* Moved proxied value from earlier patch to this one where it
is actually used
* Rework logic to check sched_proxy_exec() instead of using ifdefs
* Moved comment change to this patch where it makes sense
v7:
* Use more descriptive term then "us" in comments, as suggested
by Metin Kaya.
* Minor typo fixup from Metin Kaya
* Reworked proxied variable to prev_not_proxied to simplify usage
---
include/linux/sched.h | 1 +
kernel/fork.c | 1 +
kernel/locking/mutex.c | 35 ++++++++++++++++++++++++++++++++---
kernel/sched/core.c | 19 +++++++++++++++++--
4 files changed, 51 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 880af1c3097d..8020e224e057 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1157,6 +1157,7 @@ struct task_struct {

enum blocked_on_state blocked_on_state;
struct mutex *blocked_on; /* lock we're blocked on */
+ struct task_struct *blocked_donor; /* task that is boosting this task */
raw_spinlock_t blocked_lock;

#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
diff --git a/kernel/fork.c b/kernel/fork.c
index b3ba3d22d8b2..138fc23cad43 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2459,6 +2459,7 @@ __latent_entropy struct task_struct *copy_process(

p->blocked_on_state = BO_RUNNABLE;
p->blocked_on = NULL; /* not blocked yet */
+ p->blocked_donor = NULL; /* nobody is boosting p yet */
#ifdef CONFIG_BCACHE
p->sequential_io = 0;
p->sequential_io_avg = 0;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 11dc5cb7a5a3..2711af8c0052 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -910,7 +910,7 @@ EXPORT_SYMBOL_GPL(ww_mutex_lock_interruptible);
*/
static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigned long ip)
{
- struct task_struct *next = NULL;
+ struct task_struct *donor, *next = NULL;
DEFINE_WAKE_Q(wake_q);
unsigned long owner;
unsigned long flags;
@@ -948,7 +948,34 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
preempt_disable();
raw_spin_lock_irqsave(&lock->wait_lock, flags);
debug_mutex_unlock(lock);
- if (!list_empty(&lock->wait_list)) {
+
+ if (sched_proxy_exec()) {
+ raw_spin_lock(&current->blocked_lock);
+ /*
+ * If we have a task boosting current, and that task was boosting
+ * current through this lock, hand the lock to that task, as that
+ * is the highest waiter, as selected by the scheduling function.
+ */
+ donor = current->blocked_donor;
+ if (donor) {
+ struct mutex *next_lock;
+
+ raw_spin_lock_nested(&donor->blocked_lock, SINGLE_DEPTH_NESTING);
+ next_lock = get_task_blocked_on(donor);
+ if (next_lock == lock) {
+ next = donor;
+ donor->blocked_on_state = BO_WAKING;
+ wake_q_add(&wake_q, donor);
+ current->blocked_donor = NULL;
+ }
+ raw_spin_unlock(&donor->blocked_lock);
+ }
+ }
+
+ /*
+ * Failing that, pick any on the wait list.
+ */
+ if (!next && !list_empty(&lock->wait_list)) {
/* get the first entry from the wait-list: */
struct mutex_waiter *waiter =
list_first_entry(&lock->wait_list,
@@ -956,7 +983,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne

next = waiter->task;

- raw_spin_lock(&next->blocked_lock);
+ raw_spin_lock_nested(&next->blocked_lock, SINGLE_DEPTH_NESTING);
debug_mutex_wake_waiter(lock, waiter);
WARN_ON(next->blocked_on != lock);
next->blocked_on_state = BO_WAKING;
@@ -967,6 +994,8 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
if (owner & MUTEX_FLAG_HANDOFF)
__mutex_handoff(lock, next);

+ if (sched_proxy_exec())
+ raw_spin_unlock(&current->blocked_lock);
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q);
preempt_enable();
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 55dc2a3b7e46..e0afa228bc9d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6765,7 +6765,17 @@ proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
* Find who @next (currently blocked on a mutex) can proxy for.
*
* Follow the blocked-on relation:
- * task->blocked_on -> mutex->owner -> task...
+ *
+ * ,-> task
+ * | | blocked-on
+ * | v
+ * blocked_donor | mutex
+ * | | owner
+ * | v
+ * `-- task
+ *
+ * and set the blocked_donor relation, this latter is used by the mutex
+ * code to find which (blocked) task to hand-off to.
*
* Lock order:
*
@@ -6897,6 +6907,8 @@ find_proxy_task(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
*/
raw_spin_unlock(&p->blocked_lock);
raw_spin_unlock(&mutex->wait_lock);
+
+ owner->blocked_donor = p;
}

WARN_ON_ONCE(owner && !owner->on_rq);
@@ -6979,6 +6991,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
unsigned long prev_state;
struct rq_flags rf;
struct rq *rq;
+ bool prev_not_proxied;
int cpu;
bool preserve_need_resched = false;

@@ -7030,9 +7043,11 @@ static void __sched notrace __schedule(unsigned int sched_mode)
switch_count = &prev->nvcsw;
}

+ prev_not_proxied = !prev->blocked_donor;
pick_again:
next = pick_next_task(rq, rq_selected(rq), &rf);
rq_set_selected(rq, next);
+ next->blocked_donor = NULL;
if (unlikely(task_is_blocked(next))) {
next = find_proxy_task(rq, next, &rf);
if (!next) {
@@ -7088,7 +7103,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
rq = context_switch(rq, prev, next, &rf);
} else {
/* In case next was already curr but just got blocked_donor*/
- if (!task_current_selected(rq, next))
+ if (prev_not_proxied && next->blocked_donor)
proxy_tag_curr(rq, next);

rq_unpin_lock(rq, &rf);
--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:23:40

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 16/23] sched: Add deactivated (sleeping) owner handling to find_proxy_task()

From: Peter Zijlstra <[email protected]>

If the blocked_on chain resolves to a sleeping owner, deactivate
selected task, and enqueue it on the sleeping owner task.
Then re-activate it later when the owner is woken up.

NOTE: This has been particularly challenging to get working
properly, and some of the locking is particularly ackward. I'd
very much appreciate review and feedback for ways to simplify
this.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: This was broken out from the larger proxy() patch]
Signed-off-by: John Stultz <[email protected]>
---
v5:
* Split out from larger proxy patch
v6:
* Major rework, replacing the single list head per task with
per-task list head and nodes, creating a tree structure so
we only wake up decendents of the task woken.
* Reworked the locking to take the task->pi_lock, so we can
avoid mid-chain wakeup races from try_to_wake_up() called by
the ww_mutex logic.
v7:
* Drop ununessary __nested lock annotation, as we already drop
the lock prior.
* Add comments on #else & #endif lines, and clearer function
names, and commit message tweaks as suggested by Metin Kaya
* Move activate_blocked_entities() call from ttwu_queue to
try_to_wake_up() to simplify locking. Thanks to questions from
Metin Kaya
* Fix irqsave/irqrestore usage now we call this outside where
the pi_lock is held
* Fix activate_blocked_entitites not preserving wake_cpu
* Fix for UP builds
---
include/linux/sched.h | 3 +
kernel/fork.c | 4 +
kernel/sched/core.c | 214 ++++++++++++++++++++++++++++++++++++++----
3 files changed, 202 insertions(+), 19 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8020e224e057..6f982948a105 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1158,6 +1158,9 @@ struct task_struct {
enum blocked_on_state blocked_on_state;
struct mutex *blocked_on; /* lock we're blocked on */
struct task_struct *blocked_donor; /* task that is boosting this task */
+ struct list_head blocked_head; /* tasks blocked on this task */
+ struct list_head blocked_node; /* our entry on someone elses blocked_head */
+ struct task_struct *sleeping_owner; /* task our blocked_node is enqueued on */
raw_spinlock_t blocked_lock;

#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
diff --git a/kernel/fork.c b/kernel/fork.c
index 138fc23cad43..56f5e19c268e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2460,6 +2460,10 @@ __latent_entropy struct task_struct *copy_process(
p->blocked_on_state = BO_RUNNABLE;
p->blocked_on = NULL; /* not blocked yet */
p->blocked_donor = NULL; /* nobody is boosting p yet */
+
+ INIT_LIST_HEAD(&p->blocked_head);
+ INIT_LIST_HEAD(&p->blocked_node);
+ p->sleeping_owner = NULL;
#ifdef CONFIG_BCACHE
p->sequential_io = 0;
p->sequential_io_avg = 0;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e0afa228bc9d..0cd63bd0bdcd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3785,6 +3785,133 @@ static inline void ttwu_do_wakeup(struct task_struct *p)
trace_sched_wakeup(p);
}

+#ifdef CONFIG_SCHED_PROXY_EXEC
+static void do_activate_task(struct rq *rq, struct task_struct *p, int en_flags)
+{
+ lockdep_assert_rq_held(rq);
+
+ if (!sched_proxy_exec()) {
+ activate_task(rq, p, en_flags);
+ return;
+ }
+
+ if (p->sleeping_owner) {
+ struct task_struct *owner = p->sleeping_owner;
+
+ raw_spin_lock(&owner->blocked_lock);
+ list_del_init(&p->blocked_node);
+ p->sleeping_owner = NULL;
+ raw_spin_unlock(&owner->blocked_lock);
+ }
+
+ /*
+ * By calling activate_task with blocked_lock held, we
+ * order against the find_proxy_task() blocked_task case
+ * such that no more blocked tasks will be enqueued on p
+ * once we release p->blocked_lock.
+ */
+ raw_spin_lock(&p->blocked_lock);
+ WARN_ON(task_cpu(p) != cpu_of(rq));
+ activate_task(rq, p, en_flags);
+ raw_spin_unlock(&p->blocked_lock);
+}
+
+#ifdef CONFIG_SMP
+static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
+{
+ unsigned int wake_cpu;
+
+ /* Preserve wake_cpu */
+ wake_cpu = p->wake_cpu;
+ __set_task_cpu(p, cpu);
+ p->wake_cpu = wake_cpu;
+}
+#else /* !CONFIG_SMP */
+static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
+{
+ __set_task_cpu(p, cpu);
+}
+#endif /* CONFIG_SMP */
+
+static void activate_blocked_entities(struct rq *target_rq,
+ struct task_struct *owner,
+ int wake_flags)
+{
+ unsigned long flags;
+ struct rq_flags rf;
+ int target_cpu = cpu_of(target_rq);
+ int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
+
+ if (wake_flags & WF_MIGRATED)
+ en_flags |= ENQUEUE_MIGRATED;
+ /*
+ * A whole bunch of 'proxy' tasks back this blocked task, wake
+ * them all up to give this task its 'fair' share.
+ */
+ raw_spin_lock_irqsave(&owner->blocked_lock, flags);
+ while (!list_empty(&owner->blocked_head)) {
+ struct task_struct *pp;
+ unsigned int state;
+
+ pp = list_first_entry(&owner->blocked_head,
+ struct task_struct,
+ blocked_node);
+ BUG_ON(pp == owner);
+ list_del_init(&pp->blocked_node);
+ WARN_ON(!pp->sleeping_owner);
+ pp->sleeping_owner = NULL;
+ raw_spin_unlock_irqrestore(&owner->blocked_lock, flags);
+
+ raw_spin_lock_irqsave(&pp->pi_lock, flags);
+ state = READ_ONCE(pp->__state);
+ /* Avoid racing with ttwu */
+ if (state == TASK_WAKING) {
+ raw_spin_unlock_irqrestore(&pp->pi_lock, flags);
+ raw_spin_lock_irqsave(&owner->blocked_lock, flags);
+ continue;
+ }
+ if (READ_ONCE(pp->on_rq)) {
+ /*
+ * We raced with a non mutex handoff activation of pp.
+ * That activation will also take care of activating
+ * all of the tasks after pp in the blocked_entry list,
+ * so we're done here.
+ */
+ raw_spin_unlock_irqrestore(&pp->pi_lock, flags);
+ raw_spin_lock_irqsave(&owner->blocked_lock, flags);
+ continue;
+ }
+
+ proxy_set_task_cpu(pp, target_cpu);
+
+ rq_lock_irqsave(target_rq, &rf);
+ update_rq_clock(target_rq);
+ do_activate_task(target_rq, pp, en_flags);
+ resched_curr(target_rq);
+ rq_unlock_irqrestore(target_rq, &rf);
+ raw_spin_unlock_irqrestore(&pp->pi_lock, flags);
+
+ /* recurse - XXX This needs to be reworked to avoid recursing */
+ activate_blocked_entities(target_rq, pp, wake_flags);
+
+ raw_spin_lock_irqsave(&owner->blocked_lock, flags);
+ }
+ raw_spin_unlock_irqrestore(&owner->blocked_lock, flags);
+}
+#else /* !CONFIG_SCHED_PROXY_EXEC */
+static inline void do_activate_task(struct rq *rq, struct task_struct *p,
+ int en_flags)
+{
+ activate_task(rq, p, en_flags);
+}
+
+static inline void activate_blocked_entities(struct rq *target_rq,
+ struct task_struct *owner,
+ int wake_flags)
+{
+}
+#endif /* CONFIG_SCHED_PROXY_EXEC */
+
#ifdef CONFIG_SMP
static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
{
@@ -3839,7 +3966,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
atomic_dec(&task_rq(p)->nr_iowait);
}

- activate_task(rq, p, en_flags);
+ do_activate_task(rq, p, en_flags);
wakeup_preempt(rq, p, wake_flags);

ttwu_do_wakeup(p);
@@ -3936,13 +4063,19 @@ void sched_ttwu_pending(void *arg)
update_rq_clock(rq);

llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
+ int wake_flags;
if (WARN_ON_ONCE(p->on_cpu))
smp_cond_load_acquire(&p->on_cpu, !VAL);

if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
set_task_cpu(p, cpu_of(rq));

- ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
+ wake_flags = p->sched_remote_wakeup ? WF_MIGRATED : 0;
+ ttwu_do_activate(rq, p, wake_flags, &rf);
+ rq_unlock(rq, &rf);
+ activate_blocked_entities(rq, p, wake_flags);
+ rq_lock(rq, &rf);
+ update_rq_clock(rq);
}

/*
@@ -4423,6 +4556,7 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
if (p->blocked_on_state == BO_WAKING)
p->blocked_on_state = BO_RUNNABLE;
raw_spin_unlock_irqrestore(&p->blocked_lock, flags);
+ activate_blocked_entities(cpu_rq(cpu), p, wake_flags);
out:
if (success)
ttwu_stat(p, task_cpu(p), wake_flags);
@@ -6663,19 +6797,6 @@ proxy_resched_idle(struct rq *rq, struct task_struct *next)
return rq->idle;
}

-static bool proxy_deactivate(struct rq *rq, struct task_struct *next)
-{
- unsigned long state = READ_ONCE(next->__state);
-
- /* Don't deactivate if the state has been changed to TASK_RUNNING */
- if (state == TASK_RUNNING)
- return false;
- if (!try_to_deactivate_task(rq, next, state, true))
- return false;
- proxy_resched_idle(rq, next);
- return true;
-}
-
#ifdef CONFIG_SMP
/*
* If the blocked-on relationship crosses CPUs, migrate @p to the
@@ -6761,6 +6882,31 @@ proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
}
#endif /* CONFIG_SMP */

+static void proxy_enqueue_on_owner(struct rq *rq, struct task_struct *owner,
+ struct task_struct *next)
+{
+ /*
+ * ttwu_activate() will pick them up and place them on whatever rq
+ * @owner will run next.
+ */
+ if (!owner->on_rq) {
+ BUG_ON(!next->on_rq);
+ deactivate_task(rq, next, DEQUEUE_SLEEP);
+ if (task_current_selected(rq, next)) {
+ put_prev_task(rq, next);
+ rq_set_selected(rq, rq->idle);
+ }
+ /*
+ * ttwu_do_activate must not have a chance to activate p
+ * elsewhere before it's fully extricated from its old rq.
+ */
+ WARN_ON(next->sleeping_owner);
+ next->sleeping_owner = owner;
+ smp_mb();
+ list_add(&next->blocked_node, &owner->blocked_head);
+ }
+}
+
/*
* Find who @next (currently blocked on a mutex) can proxy for.
*
@@ -6866,10 +7012,40 @@ find_proxy_task(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
}

if (!owner->on_rq) {
- /* XXX Don't handle blocked owners yet */
- if (!proxy_deactivate(rq, next))
- ret = next;
- goto out;
+ /*
+ * rq->curr must not be added to the blocked_head list or else
+ * ttwu_do_activate could enqueue it elsewhere before it switches
+ * out here. The approach to avoid this is the same as in the
+ * migrate_task case.
+ */
+ if (curr_in_chain) {
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_unlock(&mutex->wait_lock);
+ return proxy_resched_idle(rq, next);
+ }
+
+ /*
+ * If !@owner->on_rq, holding @rq->lock will not pin the task,
+ * so we cannot drop @mutex->wait_lock until we're sure its a blocked
+ * task on this rq.
+ *
+ * We use @owner->blocked_lock to serialize against ttwu_activate().
+ * Either we see its new owner->on_rq or it will see our list_add().
+ */
+ if (owner != p) {
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_lock(&owner->blocked_lock);
+ }
+
+ proxy_enqueue_on_owner(rq, owner, next);
+
+ if (task_current_selected(rq, next)) {
+ put_prev_task(rq, next);
+ rq_set_selected(rq, rq->idle);
+ }
+ raw_spin_unlock(&owner->blocked_lock);
+ raw_spin_unlock(&mutex->wait_lock);
+ return NULL; /* retry task selection */
}

if (owner == p) {
--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:24:22

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 18/23] sched: Add push_task_chain helper

From: Connor O'Brien <[email protected]>

Switch logic that deactivates, sets the task cpu,
and reactivates a task on a different rq to use a
helper that will be later extended to push entire
blocked task chains.

This patch was broken out from a larger chain migration
patch originally by Connor O'Brien.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: split out from larger chain migration patch]
Signed-off-by: John Stultz <[email protected]>
---
kernel/sched/core.c | 4 +---
kernel/sched/deadline.c | 8 ++------
kernel/sched/rt.c | 8 ++------
kernel/sched/sched.h | 9 +++++++++
4 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0cd63bd0bdcd..0c212dcd4b7a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2721,9 +2721,7 @@ int push_cpu_stop(void *arg)

// XXX validate p is still the highest prio task
if (task_rq(p) == rq) {
- deactivate_task(rq, p, 0);
- set_task_cpu(p, lowest_rq->cpu);
- activate_task(lowest_rq, p, 0);
+ push_task_chain(rq, lowest_rq, p);
resched_curr(lowest_rq);
}

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4f998549ea74..def1eb23318b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2313,9 +2313,7 @@ static int push_dl_task(struct rq *rq)
goto retry;
}

- deactivate_task(rq, next_task, 0);
- set_task_cpu(next_task, later_rq->cpu);
- activate_task(later_rq, next_task, 0);
+ push_task_chain(rq, later_rq, next_task);
ret = 1;

resched_curr(later_rq);
@@ -2401,9 +2399,7 @@ static void pull_dl_task(struct rq *this_rq)
if (is_migration_disabled(p)) {
push_task = get_push_task(src_rq);
} else {
- deactivate_task(src_rq, p, 0);
- set_task_cpu(p, this_cpu);
- activate_task(this_rq, p, 0);
+ push_task_chain(src_rq, this_rq, p);
dmin = p->dl.deadline;
resched = true;
}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a7b51a021111..cf0eb4aac613 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2128,9 +2128,7 @@ static int push_rt_task(struct rq *rq, bool pull)
goto retry;
}

- deactivate_task(rq, next_task, 0);
- set_task_cpu(next_task, lowest_rq->cpu);
- activate_task(lowest_rq, next_task, 0);
+ push_task_chain(rq, lowest_rq, next_task);
resched_curr(lowest_rq);
ret = 1;

@@ -2401,9 +2399,7 @@ static void pull_rt_task(struct rq *this_rq)
if (is_migration_disabled(p)) {
push_task = get_push_task(src_rq);
} else {
- deactivate_task(src_rq, p, 0);
- set_task_cpu(p, this_cpu);
- activate_task(this_rq, p, 0);
+ push_task_chain(src_rq, this_rq, p);
resched = true;
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 765ba10661de..19afe532771f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3546,5 +3546,14 @@ static inline void init_sched_mm_cid(struct task_struct *t) { }

extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
+#ifdef CONFIG_SMP
+static inline
+void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
+{
+ deactivate_task(rq, task, 0);
+ set_task_cpu(task, dst_rq->cpu);
+ activate_task(dst_rq, task, 0);
+}
+#endif

#endif /* _KERNEL_SCHED_SCHED_H */
--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:24:31

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 19/23] sched: Consolidate pick_*_task to task_is_pushable helper

From: Connor O'Brien <[email protected]>

This patch consolidates rt and deadline pick_*_task functions to
a task_is_pushable() helper

This patch was broken out from a larger chain migration
patch originally by Connor O'Brien.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: split out from larger chain migration patch,
renamed helper function]
Signed-off-by: John Stultz <[email protected]>
---
v7:
* Split from chain migration patch
* Renamed function
---
kernel/sched/deadline.c | 10 +---------
kernel/sched/rt.c | 11 +----------
kernel/sched/sched.h | 10 ++++++++++
3 files changed, 12 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index def1eb23318b..1f3bc50de678 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2049,14 +2049,6 @@ static void task_fork_dl(struct task_struct *p)
/* Only try algorithms three times */
#define DL_MAX_TRIES 3

-static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
-{
- if (!task_on_cpu(rq, p) &&
- cpumask_test_cpu(cpu, &p->cpus_mask))
- return 1;
- return 0;
-}
-
/*
* Return the earliest pushable rq's task, which is suitable to be executed
* on the CPU, NULL otherwise:
@@ -2075,7 +2067,7 @@ static struct task_struct *pick_earliest_pushable_dl_task(struct rq *rq, int cpu
if (next_node) {
p = __node_2_pdl(next_node);

- if (pick_dl_task(rq, p, cpu))
+ if (task_is_pushable(rq, p, cpu) == 1)
return p;

next_node = rb_next(next_node);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index cf0eb4aac613..15161de88753 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1812,15 +1812,6 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
/* Only try algorithms three times */
#define RT_MAX_TRIES 3

-static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
-{
- if (!task_on_cpu(rq, p) &&
- cpumask_test_cpu(cpu, &p->cpus_mask))
- return 1;
-
- return 0;
-}
-
/*
* Return the highest pushable rq's task, which is suitable to be executed
* on the CPU, NULL otherwise
@@ -1834,7 +1825,7 @@ static struct task_struct *pick_highest_pushable_task(struct rq *rq, int cpu)
return NULL;

plist_for_each_entry(p, head, pushable_tasks) {
- if (pick_rt_task(rq, p, cpu))
+ if (task_is_pushable(rq, p, cpu) == 1)
return p;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 19afe532771f..ef3d327e267c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3554,6 +3554,16 @@ void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
set_task_cpu(task, dst_rq->cpu);
activate_task(dst_rq, task, 0);
}
+
+static inline
+int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
+{
+ if (!task_on_cpu(rq, p) &&
+ cpumask_test_cpu(cpu, &p->cpus_mask))
+ return 1;
+
+ return 0;
+}
#endif

#endif /* _KERNEL_SCHED_SCHED_H */
--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:24:40

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 17/23] sched: Initial sched_football test implementation

Reimplementation of the sched_football test from LTP:
https://github.com/linux-test-project/ltp/blob/master/testcases/realtime/func/sched_football/sched_football.c

But reworked to run in the kernel and utilize mutexes
to illustrate proper boosting of low priority mutex
holders.

TODO:
* Need a rt_mutex version so it can work w/o proxy-execution
* Need a better place to put it

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: John Stultz <[email protected]>
---
kernel/sched/Makefile | 1 +
kernel/sched/test_sched_football.c | 242 +++++++++++++++++++++++++++++
lib/Kconfig.debug | 14 ++
3 files changed, 257 insertions(+)
create mode 100644 kernel/sched/test_sched_football.c

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 976092b7bd45..2729d565dfd7 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -32,3 +32,4 @@ obj-y += core.o
obj-y += fair.o
obj-y += build_policy.o
obj-y += build_utility.o
+obj-$(CONFIG_SCHED_RT_INVARIENT_TEST) += test_sched_football.o
diff --git a/kernel/sched/test_sched_football.c b/kernel/sched/test_sched_football.c
new file mode 100644
index 000000000000..9742c45c0fe0
--- /dev/null
+++ b/kernel/sched/test_sched_football.c
@@ -0,0 +1,242 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Module-based test case for RT scheduling invariant
+ *
+ * A reimplementation of my old sched_football test
+ * found in LTP:
+ * https://github.com/linux-test-project/ltp/blob/master/testcases/realtime/func/sched_football/sched_football.c
+ *
+ * Similar to that test, this tries to validate the RT
+ * scheduling invariant, that the across N available cpus, the
+ * top N priority tasks always running.
+ *
+ * This is done via having N offsensive players that are
+ * medium priority, which constantly are trying to increment the
+ * ball_pos counter.
+ *
+ * Blocking this, are N defensive players that are higher
+ * priority which just spin on the cpu, preventing the medium
+ * priroity tasks from running.
+ *
+ * To complicate this, there are also N defensive low priority
+ * tasks. These start first and each aquire one of N mutexes.
+ * The high priority defense tasks will later try to grab the
+ * mutexes and block, opening a window for the offsensive tasks
+ * to run and increment the ball. If priority inheritance or
+ * proxy execution is used, the low priority defense players
+ * should be boosted to the high priority levels, and will
+ * prevent the mid priority offensive tasks from running.
+ *
+ * Copyright © International Business Machines Corp., 2007, 2008
+ * Copyright (C) Google, 2023
+ *
+ * Authors: John Stultz <[email protected]>
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/kthread.h>
+#include <linux/delay.h>
+#include <linux/sched/rt.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/rwsem.h>
+#include <linux/smp.h>
+#include <linux/slab.h>
+#include <linux/interrupt.h>
+#include <linux/sched.h>
+#include <uapi/linux/sched/types.h>
+#include <linux/rtmutex.h>
+
+atomic_t players_ready;
+atomic_t ball_pos;
+int players_per_team;
+bool game_over;
+
+struct mutex *mutex_low_list;
+struct mutex *mutex_mid_list;
+
+static inline
+struct task_struct *create_fifo_thread(int (*threadfn)(void *data), void *data,
+ char *name, int prio)
+{
+ struct task_struct *kth;
+ struct sched_attr attr = {
+ .size = sizeof(struct sched_attr),
+ .sched_policy = SCHED_FIFO,
+ .sched_nice = 0,
+ .sched_priority = prio,
+ };
+ int ret;
+
+ kth = kthread_create(threadfn, data, name);
+ if (IS_ERR(kth)) {
+ pr_warn("%s eerr, kthread_create failed\n", __func__);
+ return kth;
+ }
+ ret = sched_setattr_nocheck(kth, &attr);
+ if (ret) {
+ kthread_stop(kth);
+ pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
+ return ERR_PTR(ret);
+ }
+
+ wake_up_process(kth);
+ return kth;
+}
+
+int defense_low_thread(void *arg)
+{
+ long tnum = (long)arg;
+
+ atomic_inc(&players_ready);
+ mutex_lock(&mutex_low_list[tnum]);
+ while (!READ_ONCE(game_over)) {
+ if (kthread_should_stop())
+ break;
+ schedule();
+ }
+ mutex_unlock(&mutex_low_list[tnum]);
+ return 0;
+}
+
+int defense_mid_thread(void *arg)
+{
+ long tnum = (long)arg;
+
+ atomic_inc(&players_ready);
+ mutex_lock(&mutex_mid_list[tnum]);
+ mutex_lock(&mutex_low_list[tnum]);
+ while (!READ_ONCE(game_over)) {
+ if (kthread_should_stop())
+ break;
+ schedule();
+ }
+ mutex_unlock(&mutex_low_list[tnum]);
+ mutex_unlock(&mutex_mid_list[tnum]);
+ return 0;
+}
+
+int offense_thread(void *)
+{
+ atomic_inc(&players_ready);
+ while (!READ_ONCE(game_over)) {
+ if (kthread_should_stop())
+ break;
+ schedule();
+ atomic_inc(&ball_pos);
+ }
+ return 0;
+}
+
+int defense_hi_thread(void *arg)
+{
+ long tnum = (long)arg;
+
+ atomic_inc(&players_ready);
+ mutex_lock(&mutex_mid_list[tnum]);
+ while (!READ_ONCE(game_over)) {
+ if (kthread_should_stop())
+ break;
+ schedule();
+ }
+ mutex_unlock(&mutex_mid_list[tnum]);
+ return 0;
+}
+
+int crazy_fan_thread(void *)
+{
+ int count = 0;
+
+ atomic_inc(&players_ready);
+ while (!READ_ONCE(game_over)) {
+ if (kthread_should_stop())
+ break;
+ schedule();
+ udelay(1000);
+ msleep(2);
+ count++;
+ }
+ return 0;
+}
+
+int ref_thread(void *arg)
+{
+ struct task_struct *kth;
+ long game_time = (long)arg;
+ unsigned long final_pos;
+ long i;
+
+ pr_info("%s: started ref, game_time: %ld secs !\n", __func__,
+ game_time);
+
+ /* Create low priority defensive team */
+ for (i = 0; i < players_per_team; i++)
+ kth = create_fifo_thread(defense_low_thread, (void *)i,
+ "defese-low-thread", 2);
+ /* Wait for the defense threads to start */
+ while (atomic_read(&players_ready) < players_per_team)
+ msleep(1);
+
+ for (i = 0; i < players_per_team; i++)
+ kth = create_fifo_thread(defense_mid_thread,
+ (void *)(players_per_team - i - 1),
+ "defese-mid-thread", 3);
+ /* Wait for the defense threads to start */
+ while (atomic_read(&players_ready) < players_per_team * 2)
+ msleep(1);
+
+ /* Create mid priority offensive team */
+ for (i = 0; i < players_per_team; i++)
+ kth = create_fifo_thread(offense_thread, NULL,
+ "offense-thread", 5);
+ /* Wait for the offense threads to start */
+ while (atomic_read(&players_ready) < players_per_team * 3)
+ msleep(1);
+
+ /* Create high priority defensive team */
+ for (i = 0; i < players_per_team; i++)
+ kth = create_fifo_thread(defense_hi_thread, (void *)i,
+ "defese-hi-thread", 10);
+ /* Wait for the defense threads to start */
+ while (atomic_read(&players_ready) < players_per_team * 4)
+ msleep(1);
+
+ /* Create high priority defensive team */
+ for (i = 0; i < players_per_team; i++)
+ kth = create_fifo_thread(crazy_fan_thread, NULL,
+ "crazy-fan-thread", 15);
+ /* Wait for the defense threads to start */
+ while (atomic_read(&players_ready) < players_per_team * 5)
+ msleep(1);
+
+ pr_info("%s: all players checked in! Starting game.\n", __func__);
+ atomic_set(&ball_pos, 0);
+ msleep(game_time * 1000);
+ final_pos = atomic_read(&ball_pos);
+ pr_info("%s: final ball_pos: %ld\n", __func__, final_pos);
+ WARN_ON(final_pos != 0);
+ game_over = true;
+ return 0;
+}
+
+static int __init test_sched_football_init(void)
+{
+ struct task_struct *kth;
+ int i;
+
+ players_per_team = num_online_cpus();
+
+ mutex_low_list = kmalloc_array(players_per_team, sizeof(struct mutex), GFP_ATOMIC);
+ mutex_mid_list = kmalloc_array(players_per_team, sizeof(struct mutex), GFP_ATOMIC);
+
+ for (i = 0; i < players_per_team; i++) {
+ mutex_init(&mutex_low_list[i]);
+ mutex_init(&mutex_mid_list[i]);
+ }
+
+ kth = create_fifo_thread(ref_thread, (void *)10, "ref-thread", 20);
+
+ return 0;
+}
+module_init(test_sched_football_init);
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 4405f81248fb..1d90059d190f 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1238,6 +1238,20 @@ config SCHED_DEBUG
that can help debug the scheduler. The runtime overhead of this
option is minimal.

+config SCHED_RT_INVARIENT_TEST
+ tristate "RT invarient scheduling tester"
+ depends on DEBUG_KERNEL
+ help
+ This option provides a kernel module that runs tests to make
+ sure the RT invarient holds (top N priority tasks run on N
+ available cpus).
+
+ Say Y here if you want kernel rt scheduling tests
+ to be built into the kernel.
+ Say M if you want this test to build as a module.
+ Say N if you are unsure.
+
+
config SCHED_INFO
bool
default n
--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:24:47

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 20/23] sched: Push execution and scheduler context split into deadline and rt paths

From: Connor O'Brien <[email protected]>

In preparation for chain migration, push the awareness
of the split between execution and scheduler context
down into some of the rt/deadline code paths that deal
with load balancing.

This patch was broken out from a larger chain migration
patch originally by Connor O'Brien.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: split out from larger chain migration patch]
Signed-off-by: John Stultz <[email protected]>
---
kernel/sched/cpudeadline.c | 12 ++++++------
kernel/sched/cpudeadline.h | 3 ++-
kernel/sched/cpupri.c | 20 +++++++++++---------
kernel/sched/cpupri.h | 6 ++++--
kernel/sched/deadline.c | 18 +++++++++---------
kernel/sched/rt.c | 31 ++++++++++++++++++-------------
6 files changed, 50 insertions(+), 40 deletions(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 95baa12a1029..6ac59dcdf068 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -113,13 +113,13 @@ static inline int cpudl_maximum(struct cpudl *cp)
*
* Returns: int - CPUs were found
*/
-int cpudl_find(struct cpudl *cp, struct task_struct *p,
+int cpudl_find(struct cpudl *cp, struct task_struct *sched_ctx, struct task_struct *exec_ctx,
struct cpumask *later_mask)
{
- const struct sched_dl_entity *dl_se = &p->dl;
+ const struct sched_dl_entity *dl_se = &sched_ctx->dl;

if (later_mask &&
- cpumask_and(later_mask, cp->free_cpus, &p->cpus_mask)) {
+ cpumask_and(later_mask, cp->free_cpus, &exec_ctx->cpus_mask)) {
unsigned long cap, max_cap = 0;
int cpu, max_cpu = -1;

@@ -128,13 +128,13 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,

/* Ensure the capacity of the CPUs fits the task. */
for_each_cpu(cpu, later_mask) {
- if (!dl_task_fits_capacity(p, cpu)) {
+ if (!dl_task_fits_capacity(sched_ctx, cpu)) {
cpumask_clear_cpu(cpu, later_mask);

cap = arch_scale_cpu_capacity(cpu);

if (cap > max_cap ||
- (cpu == task_cpu(p) && cap == max_cap)) {
+ (cpu == task_cpu(exec_ctx) && cap == max_cap)) {
max_cap = cap;
max_cpu = cpu;
}
@@ -150,7 +150,7 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,

WARN_ON(best_cpu != -1 && !cpu_present(best_cpu));

- if (cpumask_test_cpu(best_cpu, &p->cpus_mask) &&
+ if (cpumask_test_cpu(best_cpu, &exec_ctx->cpus_mask) &&
dl_time_before(dl_se->deadline, cp->elements[0].dl)) {
if (later_mask)
cpumask_set_cpu(best_cpu, later_mask);
diff --git a/kernel/sched/cpudeadline.h b/kernel/sched/cpudeadline.h
index 0adeda93b5fb..6bb27f70e9d2 100644
--- a/kernel/sched/cpudeadline.h
+++ b/kernel/sched/cpudeadline.h
@@ -16,7 +16,8 @@ struct cpudl {
};

#ifdef CONFIG_SMP
-int cpudl_find(struct cpudl *cp, struct task_struct *p, struct cpumask *later_mask);
+int cpudl_find(struct cpudl *cp, struct task_struct *sched_ctx,
+ struct task_struct *exec_ctx, struct cpumask *later_mask);
void cpudl_set(struct cpudl *cp, int cpu, u64 dl);
void cpudl_clear(struct cpudl *cp, int cpu);
int cpudl_init(struct cpudl *cp);
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index 42c40cfdf836..15e947a3ded7 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -118,10 +118,11 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
return 1;
}

-int cpupri_find(struct cpupri *cp, struct task_struct *p,
+int cpupri_find(struct cpupri *cp, struct task_struct *sched_ctx,
+ struct task_struct *exec_ctx,
struct cpumask *lowest_mask)
{
- return cpupri_find_fitness(cp, p, lowest_mask, NULL);
+ return cpupri_find_fitness(cp, sched_ctx, exec_ctx, lowest_mask, NULL);
}

/**
@@ -141,18 +142,19 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
*
* Return: (int)bool - CPUs were found
*/
-int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
- struct cpumask *lowest_mask,
- bool (*fitness_fn)(struct task_struct *p, int cpu))
+int cpupri_find_fitness(struct cpupri *cp, struct task_struct *sched_ctx,
+ struct task_struct *exec_ctx,
+ struct cpumask *lowest_mask,
+ bool (*fitness_fn)(struct task_struct *p, int cpu))
{
- int task_pri = convert_prio(p->prio);
+ int task_pri = convert_prio(sched_ctx->prio);
int idx, cpu;

WARN_ON_ONCE(task_pri >= CPUPRI_NR_PRIORITIES);

for (idx = 0; idx < task_pri; idx++) {

- if (!__cpupri_find(cp, p, lowest_mask, idx))
+ if (!__cpupri_find(cp, exec_ctx, lowest_mask, idx))
continue;

if (!lowest_mask || !fitness_fn)
@@ -160,7 +162,7 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,

/* Ensure the capacity of the CPUs fit the task */
for_each_cpu(cpu, lowest_mask) {
- if (!fitness_fn(p, cpu))
+ if (!fitness_fn(sched_ctx, cpu))
cpumask_clear_cpu(cpu, lowest_mask);
}

@@ -192,7 +194,7 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
* really care.
*/
if (fitness_fn)
- return cpupri_find(cp, p, lowest_mask);
+ return cpupri_find(cp, sched_ctx, exec_ctx, lowest_mask);

return 0;
}
diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index d6cba0020064..bde7243cec2e 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -18,9 +18,11 @@ struct cpupri {
};

#ifdef CONFIG_SMP
-int cpupri_find(struct cpupri *cp, struct task_struct *p,
+int cpupri_find(struct cpupri *cp, struct task_struct *sched_ctx,
+ struct task_struct *exec_ctx,
struct cpumask *lowest_mask);
-int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
+int cpupri_find_fitness(struct cpupri *cp, struct task_struct *sched_ctx,
+ struct task_struct *exec_ctx,
struct cpumask *lowest_mask,
bool (*fitness_fn)(struct task_struct *p, int cpu));
void cpupri_set(struct cpupri *cp, int cpu, int pri);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1f3bc50de678..999bd17f11c4 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1779,7 +1779,7 @@ static inline bool dl_task_is_earliest_deadline(struct task_struct *p,
rq->dl.earliest_dl.curr));
}

-static int find_later_rq(struct task_struct *task);
+static int find_later_rq(struct task_struct *sched_ctx, struct task_struct *exec_ctx);

static int
select_task_rq_dl(struct task_struct *p, int cpu, int flags)
@@ -1819,7 +1819,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
select_rq |= !dl_task_fits_capacity(p, cpu);

if (select_rq) {
- int target = find_later_rq(p);
+ int target = find_later_rq(p, p);

if (target != -1 &&
dl_task_is_earliest_deadline(p, cpu_rq(target)))
@@ -1871,7 +1871,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
* let's hope p can move out.
*/
if (rq->curr->nr_cpus_allowed == 1 ||
- !cpudl_find(&rq->rd->cpudl, rq_selected(rq), NULL))
+ !cpudl_find(&rq->rd->cpudl, rq_selected(rq), rq->curr, NULL))
return;

/*
@@ -1879,7 +1879,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
* see if it is pushed or pulled somewhere else.
*/
if (p->nr_cpus_allowed != 1 &&
- cpudl_find(&rq->rd->cpudl, p, NULL))
+ cpudl_find(&rq->rd->cpudl, p, p, NULL))
return;

resched_curr(rq);
@@ -2079,25 +2079,25 @@ static struct task_struct *pick_earliest_pushable_dl_task(struct rq *rq, int cpu

static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);

-static int find_later_rq(struct task_struct *task)
+static int find_later_rq(struct task_struct *sched_ctx, struct task_struct *exec_ctx)
{
struct sched_domain *sd;
struct cpumask *later_mask = this_cpu_cpumask_var_ptr(local_cpu_mask_dl);
int this_cpu = smp_processor_id();
- int cpu = task_cpu(task);
+ int cpu = task_cpu(sched_ctx);

/* Make sure the mask is initialized first */
if (unlikely(!later_mask))
return -1;

- if (task->nr_cpus_allowed == 1)
+ if (exec_ctx && exec_ctx->nr_cpus_allowed == 1)
return -1;

/*
* We have to consider system topology and task affinity
* first, then we can look for a suitable CPU.
*/
- if (!cpudl_find(&task_rq(task)->rd->cpudl, task, later_mask))
+ if (!cpudl_find(&task_rq(exec_ctx)->rd->cpudl, sched_ctx, exec_ctx, later_mask))
return -1;

/*
@@ -2174,7 +2174,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
int cpu;

for (tries = 0; tries < DL_MAX_TRIES; tries++) {
- cpu = find_later_rq(task);
+ cpu = find_later_rq(task, task);

if ((cpu == -1) || (cpu == rq->cpu))
break;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 15161de88753..6371b0fca4ad 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1554,7 +1554,7 @@ static void yield_task_rt(struct rq *rq)
}

#ifdef CONFIG_SMP
-static int find_lowest_rq(struct task_struct *task);
+static int find_lowest_rq(struct task_struct *sched_ctx, struct task_struct *exec_ctx);

static int
select_task_rq_rt(struct task_struct *p, int cpu, int flags)
@@ -1604,7 +1604,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
(curr->nr_cpus_allowed < 2 || selected->prio <= p->prio);

if (test || !rt_task_fits_capacity(p, cpu)) {
- int target = find_lowest_rq(p);
+ int target = find_lowest_rq(p, p);

/*
* Bail out if we were forcing a migration to find a better
@@ -1631,8 +1631,13 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)

static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
{
+ struct task_struct *exec_ctx = p;
+ /*
+ * Current can't be migrated, useless to reschedule,
+ * let's hope p can move out.
+ */
if (rq->curr->nr_cpus_allowed == 1 ||
- !cpupri_find(&rq->rd->cpupri, rq_selected(rq), NULL))
+ !cpupri_find(&rq->rd->cpupri, rq_selected(rq), rq->curr, NULL))
return;

/*
@@ -1640,7 +1645,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
* see if it is pushed or pulled somewhere else.
*/
if (p->nr_cpus_allowed != 1 &&
- cpupri_find(&rq->rd->cpupri, p, NULL))
+ cpupri_find(&rq->rd->cpupri, p, exec_ctx, NULL))
return;

/*
@@ -1834,19 +1839,19 @@ static struct task_struct *pick_highest_pushable_task(struct rq *rq, int cpu)

static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask);

-static int find_lowest_rq(struct task_struct *task)
+static int find_lowest_rq(struct task_struct *sched_ctx, struct task_struct *exec_ctx)
{
struct sched_domain *sd;
struct cpumask *lowest_mask = this_cpu_cpumask_var_ptr(local_cpu_mask);
int this_cpu = smp_processor_id();
- int cpu = task_cpu(task);
+ int cpu = task_cpu(sched_ctx);
int ret;

/* Make sure the mask is initialized first */
if (unlikely(!lowest_mask))
return -1;

- if (task->nr_cpus_allowed == 1)
+ if (exec_ctx && exec_ctx->nr_cpus_allowed == 1)
return -1; /* No other targets possible */

/*
@@ -1855,13 +1860,13 @@ static int find_lowest_rq(struct task_struct *task)
*/
if (sched_asym_cpucap_active()) {

- ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
- task, lowest_mask,
+ ret = cpupri_find_fitness(&task_rq(sched_ctx)->rd->cpupri,
+ sched_ctx, exec_ctx, lowest_mask,
rt_task_fits_capacity);
} else {

- ret = cpupri_find(&task_rq(task)->rd->cpupri,
- task, lowest_mask);
+ ret = cpupri_find(&task_rq(sched_ctx)->rd->cpupri,
+ sched_ctx, exec_ctx, lowest_mask);
}

if (!ret)
@@ -1933,7 +1938,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
int cpu;

for (tries = 0; tries < RT_MAX_TRIES; tries++) {
- cpu = find_lowest_rq(task);
+ cpu = find_lowest_rq(task, task);

if ((cpu == -1) || (cpu == rq->cpu))
break;
@@ -2055,7 +2060,7 @@ static int push_rt_task(struct rq *rq, bool pull)
if (rq->curr->sched_class != &rt_sched_class)
return 0;

- cpu = find_lowest_rq(rq->curr);
+ cpu = find_lowest_rq(rq_selected(rq), rq->curr);
if (cpu == -1 || cpu == rq->cpu)
return 0;

--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:25:00

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 22/23] sched: Refactor dl/rt find_lowest/latest_rq logic

This pulls re-validation logic done in find_lowest_rq
and find_latest_rq after re-acquiring the rq locks out into its
own function.

This allows us to later use a more complicated validation
check for chain-migration when using proxy-exectuion.

TODO: It seems likely we could consolidate this two functions
further and leave the task_is_rt()/task_is_dl() checks externally?

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: John Stultz <[email protected]>
---
kernel/sched/deadline.c | 31 ++++++++++++++++++++-----
kernel/sched/rt.c | 50 ++++++++++++++++++++++++++++-------------
2 files changed, 59 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 21e56ac58e32..8b5701727342 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2172,6 +2172,30 @@ static int find_later_rq(struct task_struct *sched_ctx, struct task_struct *exec
return -1;
}

+static inline bool dl_revalidate_rq_state(struct task_struct *task, struct rq *rq,
+ struct rq *later)
+{
+ if (task_rq(task) != rq)
+ return false;
+
+ if (!cpumask_test_cpu(later->cpu, &task->cpus_mask))
+ return false;
+
+ if (task_on_cpu(rq, task))
+ return false;
+
+ if (!dl_task(task))
+ return false;
+
+ if (is_migration_disabled(task))
+ return false;
+
+ if (!task_on_rq_queued(task))
+ return false;
+
+ return true;
+}
+
/* Locks the rq it finds */
static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
{
@@ -2204,12 +2228,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)

/* Retry if something changed. */
if (double_lock_balance(rq, later_rq)) {
- if (unlikely(task_rq(task) != rq ||
- !cpumask_test_cpu(later_rq->cpu, &task->cpus_mask) ||
- task_on_cpu(rq, task) ||
- !dl_task(task) ||
- is_migration_disabled(task) ||
- !task_on_rq_queued(task))) {
+ if (unlikely(!dl_revalidate_rq_state(task, rq, later_rq))) {
double_unlock_balance(rq, later_rq);
later_rq = NULL;
break;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f8134d062fa3..fabb19891e95 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1935,6 +1935,39 @@ static int find_lowest_rq(struct task_struct *sched_ctx, struct task_struct *exe
return -1;
}

+static inline bool rt_revalidate_rq_state(struct task_struct *task, struct rq *rq,
+ struct rq *lowest)
+{
+ /*
+ * We had to unlock the run queue. In
+ * the mean time, task could have
+ * migrated already or had its affinity changed.
+ * Also make sure that it wasn't scheduled on its rq.
+ * It is possible the task was scheduled, set
+ * "migrate_disabled" and then got preempted, so we must
+ * check the task migration disable flag here too.
+ */
+ if (task_rq(task) != rq)
+ return false;
+
+ if (!cpumask_test_cpu(lowest->cpu, &task->cpus_mask))
+ return false;
+
+ if (task_on_cpu(rq, task))
+ return false;
+
+ if (!rt_task(task))
+ return false;
+
+ if (is_migration_disabled(task))
+ return false;
+
+ if (!task_on_rq_queued(task))
+ return false;
+
+ return true;
+}
+
/* Will lock the rq it finds */
static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
{
@@ -1964,22 +1997,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)

/* if the prio of this runqueue changed, try again */
if (double_lock_balance(rq, lowest_rq)) {
- /*
- * We had to unlock the run queue. In
- * the mean time, task could have
- * migrated already or had its affinity changed.
- * Also make sure that it wasn't scheduled on its rq.
- * It is possible the task was scheduled, set
- * "migrate_disabled" and then got preempted, so we must
- * check the task migration disable flag here too.
- */
- if (unlikely(task_rq(task) != rq ||
- !cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
- task_on_cpu(rq, task) ||
- !rt_task(task) ||
- is_migration_disabled(task) ||
- !task_on_rq_queued(task))) {
-
+ if (unlikely(!rt_revalidate_rq_state(task, rq, lowest_rq))) {
double_unlock_balance(rq, lowest_rq);
lowest_rq = NULL;
break;
--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:25:30

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 23/23] sched: Fix rt/dl load balancing via chain level balance

From: Connor O'Brien <[email protected]>

RT/DL balancing is supposed to guarantee that with N cpus available &
CPU affinity permitting, the top N RT/DL tasks will get spread across
the CPUs and all get to run. Proxy exec greatly complicates this as
blocked tasks remain on the rq but cannot be usefully migrated away
from their lock owning tasks. This has two major consequences:
1. In order to get the desired properties we need to migrate a blocked
task, its would-be proxy, and everything in between, all together -
i.e., we need to push/pull "blocked chains" rather than individual
tasks.
2. Tasks that are part of rq->curr's "blocked tree" therefore should
not be pushed or pulled. Options for enforcing this seem to include
a) create some sort of complex data structure for tracking
pushability, updating it whenever the blocked tree for rq->curr
changes (e.g. on mutex handoffs, migrations, etc.) as well as on
context switches.
b) give up on O(1) pushability checks, and search through the pushable
list every push/pull until we find a pushable "chain"
c) Extend option "b" with some sort of caching to avoid repeated work.

For the sake of simplicity & separating the "chain level balancing"
concerns from complicated optimizations, this patch focuses on trying
to implement option "b" correctly. This can then hopefully provide a
baseline for "correct load balancing behavior" that optimizations can
try to implement more efficiently.

Note:
The inability to atomically check "is task enqueued on a specific rq"
creates 2 possible races when following a blocked chain:
- If we check task_rq() first on a task that is dequeued from its rq,
it can be woken and enqueued on another rq before the call to
task_on_rq_queued()
- If we call task_on_rq_queued() first on a task that is on another
rq, it can be dequeued (since we don't hold its rq's lock) and then
be set to the current rq before we check task_rq().

Maybe there's a more elegant solution that would work, but for now,
just sandwich the task_rq() check between two task_on_rq_queued()
checks, all separated by smp_rmb() calls. Since we hold rq's lock,
task can't be enqueued or dequeued from rq, so neither race should be
possible.

extensive comments on various pitfalls, races, etc. included inline.

This patch was broken out from a larger chain migration
patch originally by Connor O'Brien.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: split out from larger chain migration patch,
majorly refactored for runtime conditionalization]
Signed-off-by: John Stultz <[email protected]>
---
v7:
* Split out from larger chain-migration patch in earlier
versions of this series
* Larger rework to allow proper conditionalization of the
logic when running with CONFIG_SCHED_PROXY_EXEC
---
kernel/sched/core.c | 77 +++++++++++++++++++++++-
kernel/sched/deadline.c | 98 +++++++++++++++++++++++-------
kernel/sched/rt.c | 130 ++++++++++++++++++++++++++++++++--------
kernel/sched/sched.h | 18 +++++-
4 files changed, 273 insertions(+), 50 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 77a79d5f829a..30dfb6f14f2b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3923,7 +3923,6 @@ struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p)
return p;

lockdep_assert_rq_held(rq);
-
for (exec_ctx = p; task_is_blocked(exec_ctx) && !task_on_cpu(rq, exec_ctx);
exec_ctx = owner) {
mutex = exec_ctx->blocked_on;
@@ -3938,6 +3937,82 @@ struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p)
}
return exec_ctx;
}
+
+#ifdef CONFIG_SMP
+void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
+{
+ struct task_struct *owner;
+
+ if (!sched_proxy_exec()) {
+ __push_task_chain(rq, dst_rq, task);
+ return;
+ }
+
+ lockdep_assert_rq_held(rq);
+ lockdep_assert_rq_held(dst_rq);
+
+ BUG_ON(!task_queued_on_rq(rq, task));
+ BUG_ON(task_current_selected(rq, task));
+
+ while (task) {
+ if (!task_queued_on_rq(rq, task) || task_current_selected(rq, task))
+ break;
+
+ if (task_is_blocked(task))
+ owner = __mutex_owner(task->blocked_on);
+ else
+ owner = NULL;
+ __push_task_chain(rq, dst_rq, task);
+ if (task == owner)
+ break;
+ task = owner;
+ }
+}
+
+/*
+ * Returns:
+ * 1 if chain is pushable and affinity does not prevent pushing to cpu
+ * 0 if chain is unpushable
+ * -1 if chain is pushable but affinity blocks running on cpu.
+ */
+int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
+{
+ struct task_struct *exec_ctx;
+
+ if (!sched_proxy_exec())
+ return __task_is_pushable(rq, p, cpu);
+
+ lockdep_assert_rq_held(rq);
+
+ if (task_rq(p) != rq || !task_on_rq_queued(p))
+ return 0;
+
+ exec_ctx = find_exec_ctx(rq, p);
+ /*
+ * Chain leads off the rq, we're free to push it anywhere.
+ *
+ * One wrinkle with relying on find_exec_ctx is that when the chain
+ * leads to a task currently migrating to rq, we see the chain as
+ * pushable & push everything prior to the migrating task. Even if
+ * we checked explicitly for this case, we could still race with a
+ * migration after the check.
+ * This shouldn't permanently produce a bad state though, as proxy()
+ * will send the chain back to rq and by that point the migration
+ * should be complete & a proper push can occur.
+ */
+ if (!exec_ctx)
+ return 1;
+
+ if (task_on_cpu(rq, exec_ctx) || exec_ctx->nr_cpus_allowed <= 1)
+ return 0;
+
+ return cpumask_test_cpu(cpu, &exec_ctx->cpus_mask) ? 1 : -1;
+}
+#else /* !CONFIG_SMP */
+void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
+{
+}
+#endif /* CONFIG_SMP */
#else /* !CONFIG_SCHED_PROXY_EXEC */
static inline void do_activate_task(struct rq *rq, struct task_struct *p,
int en_flags)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 8b5701727342..b7be888c1635 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2172,8 +2172,77 @@ static int find_later_rq(struct task_struct *sched_ctx, struct task_struct *exec
return -1;
}

+static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
+{
+ struct task_struct *p = NULL;
+ struct rb_node *next_node;
+
+ if (!has_pushable_dl_tasks(rq))
+ return NULL;
+
+ next_node = rb_first_cached(&rq->dl.pushable_dl_tasks_root);
+
+next_node:
+ if (next_node) {
+ p = __node_2_pdl(next_node);
+
+ /*
+ * cpu argument doesn't matter because we treat a -1 result
+ * (pushable but can't go to cpu0) the same as a 1 result
+ * (pushable to cpu0). All we care about here is general
+ * pushability.
+ */
+ if (task_is_pushable(rq, p, 0))
+ return p;
+
+ next_node = rb_next(next_node);
+ goto next_node;
+ }
+
+ if (!p)
+ return NULL;
+
+ WARN_ON_ONCE(rq->cpu != task_cpu(p));
+ WARN_ON_ONCE(task_current(rq, p));
+ WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
+
+ WARN_ON_ONCE(!task_on_rq_queued(p));
+ WARN_ON_ONCE(!dl_task(p));
+
+ return p;
+}
+
+#ifdef CONFIG_SCHED_PROXY_EXEC
static inline bool dl_revalidate_rq_state(struct task_struct *task, struct rq *rq,
- struct rq *later)
+ struct rq *later, bool *retry)
+{
+ if (!dl_task(task) || is_migration_disabled(task))
+ return false;
+
+ if (rq != this_rq()) {
+ struct task_struct *next_task = pick_next_pushable_dl_task(rq);
+
+ if (next_task == task) {
+ struct task_struct *exec_ctx;
+
+ exec_ctx = find_exec_ctx(rq, next_task);
+ *retry = (exec_ctx && !cpumask_test_cpu(later->cpu,
+ &exec_ctx->cpus_mask));
+ } else {
+ return false;
+ }
+ } else {
+ int pushable = task_is_pushable(rq, task, later->cpu);
+
+ *retry = pushable == -1;
+ if (!pushable)
+ return false;
+ }
+ return true;
+}
+#else
+static inline bool dl_revalidate_rq_state(struct task_struct *task, struct rq *rq,
+ struct rq *later, bool *retry)
{
if (task_rq(task) != rq)
return false;
@@ -2195,16 +2264,18 @@ static inline bool dl_revalidate_rq_state(struct task_struct *task, struct rq *r

return true;
}
-
+#endif
/* Locks the rq it finds */
static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
{
struct task_struct *exec_ctx;
struct rq *later_rq = NULL;
+ bool retry;
int tries;
int cpu;

for (tries = 0; tries < DL_MAX_TRIES; tries++) {
+ retry = false;
exec_ctx = find_exec_ctx(rq, task);
if (!exec_ctx)
break;
@@ -2228,7 +2299,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)

/* Retry if something changed. */
if (double_lock_balance(rq, later_rq)) {
- if (unlikely(!dl_revalidate_rq_state(task, rq, later_rq))) {
+ if (unlikely(!dl_revalidate_rq_state(task, rq, later_rq, &retry))) {
double_unlock_balance(rq, later_rq);
later_rq = NULL;
break;
@@ -2240,7 +2311,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
* its earliest one has a later deadline than our
* task, the rq is a good one.
*/
- if (dl_task_is_earliest_deadline(task, later_rq))
+ if (!retry && dl_task_is_earliest_deadline(task, later_rq))
break;

/* Otherwise we try again. */
@@ -2251,25 +2322,6 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
return later_rq;
}

-static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
-{
- struct task_struct *p;
-
- if (!has_pushable_dl_tasks(rq))
- return NULL;
-
- p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
-
- WARN_ON_ONCE(rq->cpu != task_cpu(p));
- WARN_ON_ONCE(task_current(rq, p));
- WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
-
- WARN_ON_ONCE(!task_on_rq_queued(p));
- WARN_ON_ONCE(!dl_task(p));
-
- return p;
-}
-
/*
* See if the non running -deadline tasks on this rq
* can be sent to some other CPU where they can preempt
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index fabb19891e95..d5ce95dc5c09 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1935,8 +1935,108 @@ static int find_lowest_rq(struct task_struct *sched_ctx, struct task_struct *exe
return -1;
}

+static struct task_struct *pick_next_pushable_task(struct rq *rq)
+{
+ struct plist_head *head = &rq->rt.pushable_tasks;
+ struct task_struct *p, *push_task = NULL;
+
+ if (!has_pushable_tasks(rq))
+ return NULL;
+
+ plist_for_each_entry(p, head, pushable_tasks) {
+ if (task_is_pushable(rq, p, 0)) {
+ push_task = p;
+ break;
+ }
+ }
+
+ if (!push_task)
+ return NULL;
+
+ BUG_ON(rq->cpu != task_cpu(push_task));
+ BUG_ON(task_current(rq, push_task) || task_current_selected(rq, push_task));
+ BUG_ON(!task_on_rq_queued(push_task));
+ BUG_ON(!rt_task(push_task));
+
+ return p;
+}
+
+#ifdef CONFIG_SCHED_PROXY_EXEC
+static inline bool rt_revalidate_rq_state(struct task_struct *task, struct rq *rq,
+ struct rq *lowest, bool *retry)
+{
+ /*
+ * Releasing the rq lock means we need to re-check pushability.
+ * Some scenarios:
+ * 1) If a migration from another CPU sent a task/chain to rq
+ * that made task newly unpushable by completing a chain
+ * from task to rq->curr, then we need to bail out and push something
+ * else.
+ * 2) If our chain led off this CPU or to a dequeued task, the last waiter
+ * on this CPU might have acquired the lock and woken (or even migrated
+ * & run, handed off the lock it held, etc...). This can invalidate the
+ * result of find_lowest_rq() if our chain previously ended in a blocked
+ * task whose affinity we could ignore, but now ends in an unblocked
+ * task that can't run on lowest_rq.
+ * 3) Race described at https://lore.kernel.org/all/[email protected]/
+ *
+ * Notes on these:
+ * - Scenario #2 is properly handled by rerunning find_lowest_rq
+ * - Scenario #1 requires that we fail
+ * - Scenario #3 can AFAICT only occur when rq is not this_rq(). And the
+ * suggested fix is not universally correct now that push_cpu_stop() can
+ * call this function.
+ */
+ if (!rt_task(task) || is_migration_disabled(task)) {
+ return false;
+ } else if (rq != this_rq()) {
+ /*
+ * If we are dealing with a remote rq, then all bets are off
+ * because task might have run & then been dequeued since we
+ * released the lock, at which point our normal checks can race
+ * with migration, as described in
+ * https://lore.kernel.org/all/[email protected]/
+ * Need to repick to ensure we avoid a race.
+ * But re-picking would be unnecessary & incorrect in the
+ * push_cpu_stop() path.
+ */
+ struct task_struct *next_task = pick_next_pushable_task(rq);
+
+ if (next_task == task) {
+ struct task_struct *exec_ctx;
+
+ exec_ctx = find_exec_ctx(rq, next_task);
+ *retry = (exec_ctx &&
+ !cpumask_test_cpu(lowest->cpu,
+ &exec_ctx->cpus_mask));
+ } else {
+ return false;
+ }
+ } else {
+ /*
+ * Chain level balancing introduces new ways for our choice of
+ * task & rq to become invalid when we release the rq lock, e.g.:
+ * 1) Migration to rq from another CPU makes task newly unpushable
+ * by completing a "blocked chain" from task to rq->curr.
+ * Fail so a different task can be chosen for push.
+ * 2) In cases where task's blocked chain led to a dequeued task
+ * or one on another rq, the last waiter in the chain on this
+ * rq might have acquired the lock and woken, meaning we must
+ * pick a different rq if its affinity prevents running on
+ * lowest rq.
+ */
+ int pushable = task_is_pushable(rq, task, lowest->cpu);
+
+ *retry = pushable == -1;
+ if (!pushable)
+ return false;
+ }
+
+ return true;
+}
+#else /* !CONFIG_SCHED_PROXY_EXEC */
static inline bool rt_revalidate_rq_state(struct task_struct *task, struct rq *rq,
- struct rq *lowest)
+ struct rq *lowest, bool *retry)
{
/*
* We had to unlock the run queue. In
@@ -1967,16 +2067,19 @@ static inline bool rt_revalidate_rq_state(struct task_struct *task, struct rq *r

return true;
}
+#endif

/* Will lock the rq it finds */
static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
{
struct task_struct *exec_ctx;
struct rq *lowest_rq = NULL;
+ bool retry;
int tries;
int cpu;

for (tries = 0; tries < RT_MAX_TRIES; tries++) {
+ retry = false;
exec_ctx = find_exec_ctx(rq, task);
cpu = find_lowest_rq(task, exec_ctx);

@@ -1997,7 +2100,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)

/* if the prio of this runqueue changed, try again */
if (double_lock_balance(rq, lowest_rq)) {
- if (unlikely(!rt_revalidate_rq_state(task, rq, lowest_rq))) {
+ if (unlikely(!rt_revalidate_rq_state(task, rq, lowest_rq, &retry))) {
double_unlock_balance(rq, lowest_rq);
lowest_rq = NULL;
break;
@@ -2005,7 +2108,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
}

/* If this rq is still suitable use it. */
- if (lowest_rq->rt.highest_prio.curr > task->prio)
+ if (lowest_rq->rt.highest_prio.curr > task->prio && !retry)
break;

/* try again */
@@ -2016,27 +2119,6 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
return lowest_rq;
}

-static struct task_struct *pick_next_pushable_task(struct rq *rq)
-{
- struct task_struct *p;
-
- if (!has_pushable_tasks(rq))
- return NULL;
-
- p = plist_first_entry(&rq->rt.pushable_tasks,
- struct task_struct, pushable_tasks);
-
- BUG_ON(rq->cpu != task_cpu(p));
- BUG_ON(task_current(rq, p));
- BUG_ON(task_current_selected(rq, p));
- BUG_ON(p->nr_cpus_allowed <= 1);
-
- BUG_ON(!task_on_rq_queued(p));
- BUG_ON(!rt_task(p));
-
- return p;
-}
-
/*
* If the current CPU has more than one RT task, see if the non
* running task can migrate over to a CPU that is running a task
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6cd473224cfe..4b97b36be691 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3548,7 +3548,7 @@ extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
#ifdef CONFIG_SMP
static inline
-void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
+void __push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
{
deactivate_task(rq, task, 0);
set_task_cpu(task, dst_rq->cpu);
@@ -3556,7 +3556,7 @@ void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
}

static inline
-int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
+int __task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
{
if (!task_on_cpu(rq, p) &&
cpumask_test_cpu(cpu, &p->cpus_mask))
@@ -3566,8 +3566,22 @@ int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
}

#ifdef CONFIG_SCHED_PROXY_EXEC
+void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task);
+int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu);
struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p);
#else /* !CONFIG_SCHED_PROXY_EXEC */
+static inline
+void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
+{
+ __push_task_chain(rq, dst_rq, task);
+}
+
+static inline
+int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
+{
+ return __task_is_pushable(rq, p, cpu);
+}
+
static inline
struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p)
{
--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:27:36

by John Stultz

[permalink] [raw]
Subject: [PATCH v7 21/23] sched: Add find_exec_ctx helper

From: Connor O'Brien <[email protected]>

Add a helper to find the runnable owner down a chain of blocked waiters

This patch was broken out from a larger chain migration
patch originally by Connor O'Brien.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: split out from larger chain migration patch]
Signed-off-by: John Stultz <[email protected]>
---
kernel/sched/core.c | 42 +++++++++++++++++++++++++++++++++++++++++
kernel/sched/cpupri.c | 11 ++++++++---
kernel/sched/deadline.c | 15 +++++++++++++--
kernel/sched/rt.c | 9 ++++++++-
kernel/sched/sched.h | 10 ++++++++++
5 files changed, 81 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0c212dcd4b7a..77a79d5f829a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3896,6 +3896,48 @@ static void activate_blocked_entities(struct rq *target_rq,
}
raw_spin_unlock_irqrestore(&owner->blocked_lock, flags);
}
+
+static inline bool task_queued_on_rq(struct rq *rq, struct task_struct *task)
+{
+ if (!task_on_rq_queued(task))
+ return false;
+ smp_rmb();
+ if (task_rq(task) != rq)
+ return false;
+ smp_rmb();
+ if (!task_on_rq_queued(task))
+ return false;
+ return true;
+}
+
+/*
+ * Returns the unblocked task at the end of the blocked chain starting with p
+ * if that chain is composed entirely of tasks enqueued on rq, or NULL otherwise.
+ */
+struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p)
+{
+ struct task_struct *exec_ctx, *owner;
+ struct mutex *mutex;
+
+ if (!sched_proxy_exec())
+ return p;
+
+ lockdep_assert_rq_held(rq);
+
+ for (exec_ctx = p; task_is_blocked(exec_ctx) && !task_on_cpu(rq, exec_ctx);
+ exec_ctx = owner) {
+ mutex = exec_ctx->blocked_on;
+ owner = __mutex_owner(mutex);
+ if (owner == exec_ctx)
+ break;
+
+ if (!task_queued_on_rq(rq, owner) || task_current_selected(rq, owner)) {
+ exec_ctx = NULL;
+ break;
+ }
+ }
+ return exec_ctx;
+}
#else /* !CONFIG_SCHED_PROXY_EXEC */
static inline void do_activate_task(struct rq *rq, struct task_struct *p,
int en_flags)
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index 15e947a3ded7..53be78afdd07 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -96,12 +96,17 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
if (skip)
return 0;

- if (cpumask_any_and(&p->cpus_mask, vec->mask) >= nr_cpu_ids)
+ if ((p && cpumask_any_and(&p->cpus_mask, vec->mask) >= nr_cpu_ids) ||
+ (!p && cpumask_any(vec->mask) >= nr_cpu_ids))
return 0;

if (lowest_mask) {
- cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
- cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
+ if (p) {
+ cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
+ cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
+ } else {
+ cpumask_copy(lowest_mask, vec->mask);
+ }

/*
* We have to ensure that we have at least one bit
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 999bd17f11c4..21e56ac58e32 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1866,6 +1866,8 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused

static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
{
+ struct task_struct *exec_ctx;
+
/*
* Current can't be migrated, useless to reschedule,
* let's hope p can move out.
@@ -1874,12 +1876,16 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
!cpudl_find(&rq->rd->cpudl, rq_selected(rq), rq->curr, NULL))
return;

+ exec_ctx = find_exec_ctx(rq, p);
+ if (task_current(rq, exec_ctx))
+ return;
+
/*
* p is migratable, so let's not schedule it and
* see if it is pushed or pulled somewhere else.
*/
if (p->nr_cpus_allowed != 1 &&
- cpudl_find(&rq->rd->cpudl, p, p, NULL))
+ cpudl_find(&rq->rd->cpudl, p, exec_ctx, NULL))
return;

resched_curr(rq);
@@ -2169,12 +2175,17 @@ static int find_later_rq(struct task_struct *sched_ctx, struct task_struct *exec
/* Locks the rq it finds */
static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
{
+ struct task_struct *exec_ctx;
struct rq *later_rq = NULL;
int tries;
int cpu;

for (tries = 0; tries < DL_MAX_TRIES; tries++) {
- cpu = find_later_rq(task, task);
+ exec_ctx = find_exec_ctx(rq, task);
+ if (!exec_ctx)
+ break;
+
+ cpu = find_later_rq(task, exec_ctx);

if ((cpu == -1) || (cpu == rq->cpu))
break;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 6371b0fca4ad..f8134d062fa3 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1640,6 +1640,11 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
!cpupri_find(&rq->rd->cpupri, rq_selected(rq), rq->curr, NULL))
return;

+ /* No reason to preempt since rq->curr wouldn't change anyway */
+ exec_ctx = find_exec_ctx(rq, p);
+ if (task_current(rq, exec_ctx))
+ return;
+
/*
* p is migratable, so let's not schedule it and
* see if it is pushed or pulled somewhere else.
@@ -1933,12 +1938,14 @@ static int find_lowest_rq(struct task_struct *sched_ctx, struct task_struct *exe
/* Will lock the rq it finds */
static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
{
+ struct task_struct *exec_ctx;
struct rq *lowest_rq = NULL;
int tries;
int cpu;

for (tries = 0; tries < RT_MAX_TRIES; tries++) {
- cpu = find_lowest_rq(task, task);
+ exec_ctx = find_exec_ctx(rq, task);
+ cpu = find_lowest_rq(task, exec_ctx);

if ((cpu == -1) || (cpu == rq->cpu))
break;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ef3d327e267c..6cd473224cfe 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3564,6 +3564,16 @@ int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)

return 0;
}
+
+#ifdef CONFIG_SCHED_PROXY_EXEC
+struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p);
+#else /* !CONFIG_SCHED_PROXY_EXEC */
+static inline
+struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p)
+{
+ return p;
+}
+#endif /* CONFIG_SCHED_PROXY_EXEC */
#endif

#endif /* _KERNEL_SCHED_SCHED_H */
--
2.43.0.472.g3155946c3a-goog


2023-12-20 00:59:23

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH v7 17/23] sched: Initial sched_football test implementation

Hi John,

On 12/19/23 16:18, John Stultz wrote:
> Reimplementation of the sched_football test from LTP:
> https://github.com/linux-test-project/ltp/blob/master/testcases/realtime/func/sched_football/sched_football.c
>
> But reworked to run in the kernel and utilize mutexes
> to illustrate proper boosting of low priority mutex
> holders.
>
> TODO:
> * Need a rt_mutex version so it can work w/o proxy-execution
> * Need a better place to put it
>
> Cc: [email protected]
> Signed-off-by: John Stultz <[email protected]>
> ---
> kernel/sched/Makefile | 1 +
> kernel/sched/test_sched_football.c | 242 +++++++++++++++++++++++++++++
> lib/Kconfig.debug | 14 ++
> 3 files changed, 257 insertions(+)
> create mode 100644 kernel/sched/test_sched_football.c
>


> diff --git a/kernel/sched/test_sched_football.c b/kernel/sched/test_sched_football.c
> new file mode 100644
> index 000000000000..9742c45c0fe0
> --- /dev/null
> +++ b/kernel/sched/test_sched_football.c
> @@ -0,0 +1,242 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * Module-based test case for RT scheduling invariant
> + *
> + * A reimplementation of my old sched_football test
> + * found in LTP:
> + * https://github.com/linux-test-project/ltp/blob/master/testcases/realtime/func/sched_football/sched_football.c
> + *
> + * Similar to that test, this tries to validate the RT
> + * scheduling invariant, that the across N available cpus, the
> + * top N priority tasks always running.
> + *
> + * This is done via having N offsensive players that are
> + * medium priority, which constantly are trying to increment the
> + * ball_pos counter.
> + *
> + * Blocking this, are N defensive players that are higher

no comma ^

> + * priority which just spin on the cpu, preventing the medium
> + * priroity tasks from running.
> + *
> + * To complicate this, there are also N defensive low priority
> + * tasks. These start first and each aquire one of N mutexes.

acquire

> + * The high priority defense tasks will later try to grab the
> + * mutexes and block, opening a window for the offsensive tasks

offensive

> + * to run and increment the ball. If priority inheritance or
> + * proxy execution is used, the low priority defense players
> + * should be boosted to the high priority levels, and will
> + * prevent the mid priority offensive tasks from running.
> + *
> + * Copyright © International Business Machines Corp., 2007, 2008
> + * Copyright (C) Google, 2023
> + *
> + * Authors: John Stultz <[email protected]>
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/kthread.h>
> +#include <linux/delay.h>
> +#include <linux/sched/rt.h>
> +#include <linux/spinlock.h>
> +#include <linux/mutex.h>
> +#include <linux/rwsem.h>
> +#include <linux/smp.h>
> +#include <linux/slab.h>
> +#include <linux/interrupt.h>
> +#include <linux/sched.h>
> +#include <uapi/linux/sched/types.h>
> +#include <linux/rtmutex.h>
> +
> +atomic_t players_ready;
> +atomic_t ball_pos;
> +int players_per_team;
> +bool game_over;
> +
> +struct mutex *mutex_low_list;
> +struct mutex *mutex_mid_list;
> +

[]

Is this the referee?

> +int ref_thread(void *arg)
> +{
> + struct task_struct *kth;
> + long game_time = (long)arg;
> + unsigned long final_pos;
> + long i;
> +
> + pr_info("%s: started ref, game_time: %ld secs !\n", __func__,
> + game_time);
> +
> + /* Create low priority defensive team */
> + for (i = 0; i < players_per_team; i++)
> + kth = create_fifo_thread(defense_low_thread, (void *)i,
> + "defese-low-thread", 2);

defense

> + /* Wait for the defense threads to start */
> + while (atomic_read(&players_ready) < players_per_team)
> + msleep(1);
> +
> + for (i = 0; i < players_per_team; i++)
> + kth = create_fifo_thread(defense_mid_thread,
> + (void *)(players_per_team - i - 1),
> + "defese-mid-thread", 3);

ditto

> + /* Wait for the defense threads to start */
> + while (atomic_read(&players_ready) < players_per_team * 2)
> + msleep(1);
> +
> + /* Create mid priority offensive team */
> + for (i = 0; i < players_per_team; i++)
> + kth = create_fifo_thread(offense_thread, NULL,
> + "offense-thread", 5);
> + /* Wait for the offense threads to start */
> + while (atomic_read(&players_ready) < players_per_team * 3)
> + msleep(1);
> +
> + /* Create high priority defensive team */
> + for (i = 0; i < players_per_team; i++)
> + kth = create_fifo_thread(defense_hi_thread, (void *)i,
> + "defese-hi-thread", 10);

ditto

> + /* Wait for the defense threads to start */
> + while (atomic_read(&players_ready) < players_per_team * 4)
> + msleep(1);
> +
> + /* Create high priority defensive team */
> + for (i = 0; i < players_per_team; i++)
> + kth = create_fifo_thread(crazy_fan_thread, NULL,
> + "crazy-fan-thread", 15);
> + /* Wait for the defense threads to start */
> + while (atomic_read(&players_ready) < players_per_team * 5)
> + msleep(1);
> +
> + pr_info("%s: all players checked in! Starting game.\n", __func__);
> + atomic_set(&ball_pos, 0);
> + msleep(game_time * 1000);
> + final_pos = atomic_read(&ball_pos);
> + pr_info("%s: final ball_pos: %ld\n", __func__, final_pos);
> + WARN_ON(final_pos != 0);
> + game_over = true;
> + return 0;
> +}
> +


> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 4405f81248fb..1d90059d190f 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1238,6 +1238,20 @@ config SCHED_DEBUG
> that can help debug the scheduler. The runtime overhead of this
> option is minimal.
>
> +config SCHED_RT_INVARIENT_TEST

INVARIANT

> + tristate "RT invarient scheduling tester"

invariant

> + depends on DEBUG_KERNEL
> + help
> + This option provides a kernel module that runs tests to make
> + sure the RT invarient holds (top N priority tasks run on N

invariant

> + available cpus).
> +
> + Say Y here if you want kernel rt scheduling tests

RT

> + to be built into the kernel.
> + Say M if you want this test to build as a module.
> + Say N if you are unsure.
> +
> +
> config SCHED_INFO
> bool
> default n

--
#Randy
https://people.kernel.org/tglx/notes-about-netiquette
https://subspace.kernel.org/etiquette.html

2023-12-20 01:04:21

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH v7 06/23] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable



On 12/19/23 16:18, John Stultz wrote:
> Add a CONFIG_SCHED_PROXY_EXEC option, along with a boot argument
> sched_prox_exec= that can be used to disable the feature at boot

sched_proxy_exec=

> time if CONFIG_SCHED_PROXY_EXEC was enabled.
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Signed-off-by: John Stultz <[email protected]>
> ---
> v7:
> * Switch to CONFIG_SCHED_PROXY_EXEC/sched_proxy_exec= as
> suggested by Metin Kaya.
> * Switch boot arg from =disable/enable to use kstrtobool(),
> which supports =yes|no|1|0|true|false|on|off, as also
> suggested by Metin Kaya, and print a message when a boot
> argument is used.
> ---
> .../admin-guide/kernel-parameters.txt | 5 ++++
> include/linux/sched.h | 13 +++++++++
> init/Kconfig | 7 +++++
> kernel/sched/core.c | 29 +++++++++++++++++++
> 4 files changed, 54 insertions(+)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 65731b060e3f..cc64393b913f 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5714,6 +5714,11 @@
> sa1100ir [NET]
> See drivers/net/irda/sa1100_ir.c.
>
> + sched_proxy_exec= [KNL]
> + Enables or disables "proxy execution" style
> + solution to mutex based priority inversion.

mutex-based

> + Format: <bool>
> +
> sched_verbose [KNL] Enables verbose scheduler debug messages.
>
> schedstats= [KNL,X86] Enable or disable scheduled statistics.
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index bfe8670f99a1..880af1c3097d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1566,6 +1566,19 @@ struct task_struct {
> */
> };
>
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +DECLARE_STATIC_KEY_TRUE(__sched_proxy_exec);
> +static inline bool sched_proxy_exec(void)
> +{
> + return static_branch_likely(&__sched_proxy_exec);
> +}
> +#else
> +static inline bool sched_proxy_exec(void)
> +{
> + return false;
> +}
> +#endif
> +
> static inline struct pid *task_pid(struct task_struct *task)
> {
> return task->thread_pid;
> diff --git a/init/Kconfig b/init/Kconfig
> index 9ffb103fc927..c5a759b6366a 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -908,6 +908,13 @@ config NUMA_BALANCING_DEFAULT_ENABLED
> If set, automatic NUMA balancing will be enabled if running on a NUMA
> machine.
>
> +config SCHED_PROXY_EXEC
> + bool "Proxy Execution"
> + default n
> + help
> + This option enables proxy execution, a mechanism for mutex owning

mutex-owning

> + tasks to inherit the scheduling context of higher priority waiters.
> +
> menuconfig CGROUPS
> bool "Control Group support"
> select KERNFS
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4e46189d545d..e06558fb08aa 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -117,6 +117,35 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
>
> DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
> +static int __init setup_proxy_exec(char *str)
> +{
> + bool proxy_enable;
> +
> + if (kstrtobool(str, &proxy_enable)) {
> + pr_warn("Unable to parse sched_proxy_exec=\n");
> + return 0;
> + }
> +
> + if (proxy_enable) {
> + pr_info("sched_proxy_exec enabled via boot arg\n");
> + static_branch_enable(&__sched_proxy_exec);
> + } else {
> + pr_info("sched_proxy_exec disabled via boot arg\n");
> + static_branch_disable(&__sched_proxy_exec);
> + }
> + return 1;
> +}
> +#else
> +static int __init setup_proxy_exec(char *str)
> +{
> + pr_warn("CONFIG_SCHED_PROXY_EXEC=n, so it cannot be enabled or disabled at boottime\n");

Preferably s/boottime/boot time/.

> + return 0;
> +}
> +#endif
> +__setup("sched_proxy_exec=", setup_proxy_exec);
> +
> #ifdef CONFIG_SCHED_DEBUG
> /*
> * Debugging: various feature bits

--
#Randy
https://people.kernel.org/tglx/notes-about-netiquette
https://subspace.kernel.org/etiquette.html

2023-12-20 02:37:51

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v7 17/23] sched: Initial sched_football test implementation

On Tue, Dec 19, 2023 at 4:59 PM Randy Dunlap <[email protected]> wrote:
> On 12/19/23 16:18, John Stultz wrote:
> []
>
> Is this the referee?

Yea, good point. "ref" is an overloaded shorthand. Will fix.

Thanks also for all the spelling corrections! Much appreciated.
-john

2023-12-21 08:36:02

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v7 00/23] Proxy Execution: A generalized form of Priority Inheritance v7

On 20/12/2023 12:18 am, John Stultz wrote:
> Looking for patches to review before closing out the year?
> Have I got something for you! :)
>
> Since sending v6 out and after my presentation at Linux
> Plumbers[1], I got some nice feedback from a number of
> developers. I particularly want to thank Metin Kaya, who provided
> a ton of detailed off-list review comments, questions and cleanup
> suggestions. Additionally Xuewen Yan, who pointed out an issue
> with placement in my return migration logic that motivated me to
> rework the return migration back into the ttwu path. This helped
> to resolve the performance regression I had been seeing from the
> v4->v5 rework!
>
> The other focus of this revision has been to properly
> conditionalize and re-integrate the chain-migration feature that
> Connor had implemented to address concerns about the preserving
> the RT load balancing invariant (always running top N priority
> tasks across N available cpus) when we’re migrating tasks around
> for proxy-execution. I’ve done a fair amount of reworking the
> patches, but they aren’t quite yet to where I’d like them, but
> I’ve included them here which you can consider as a preview.
>
> Validating this invariant isn’t trivial. Long ago I wrote a
> userland test case that’s now in LTP to check this called
> sched_football:
> https://github.com/linux-test-project/ltp/blob/master/testcases/realtime/func/sched_football/sched_football.c
>
> So I’ve reimplemented this idea as an in-kernel test case,
> extended to have lock chains across priority levels. The good
> news is that the test case does not show any issues with RT load
> balancing invariant when we add the chain-migration feature, but
> I’m not actually seeing any issues with the test when
> chain-migration isn’t added either, so I need to further extend
> the test to try to manufacture the specific type of invariant
> breakage I can imagine we don’t handle properly:
> ie:
> CPU0: P99, P98(boP2), P2
> CPU1: P50
>
> Which chain-migration should adjust to become:
> CPU0: P99
> CPU1: P98(boP2), P50, P2
>
> On the stability front, the series has continued to fare much
> better than pre-v6 patches. The only stability issue I had seen
> with v6 (workqueue lockups when stressing under 64 core qemu
> environment with KVM disabled) has so far not reproduced against
> 6.7-rc. At plumbers Thomas mentioned there had been a workqueue
> issue in 6.6 that was since fixed, so I’m optimistic that was
> what I was tripping on. If you run into any stability issues in
> testing, please do let me know.
>
> This patch series is actually coarser than what I’ve been
> developing with, as there are a number of small “test” steps to
> help validate behavior I changed, which would then be replaced by
> the real logic afterwards. Including those here would just cause
> more work for reviewers, so I’ve folded them together, but if
> you’re interested you can find the fine-grained tree here:
> https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v7-6.7-rc6-fine-grained
> https://github.com/johnstultz-work/linux-dev.git proxy-exec-v7-6.7-rc6-fine-grained
>
> Credit/Disclaimer:
> —--------------------
> As mentioned previously, this Proxy Execution series has a long
> history:
>
> First described in a paper[2] by Watkins, Straub, Niehaus, then
> from patches from Peter Zijlstra, extended with lots of work by
> Juri Lelli, Valentin Schneider, and Connor O'Brien. (and thank
> you to Steven Rostedt for providing additional details here!)
>
> So again, many thanks to those above, as all the credit for this
> series really is due to them - while the mistakes are likely
> mine.
>
> Overview:
> —----------
> Proxy Execution is a generalized form of priority inheritance.
> Classic priority inheritance works well for real-time tasks where
> there is a straight forward priority order to how things are run.
> But it breaks down when used between CFS or DEADLINE tasks, as
> there are lots of parameters involved outside of just the task’s
> nice value when selecting the next task to run (via
> pick_next_task()). So ideally we want to imbue the mutex holder
> with all the scheduler attributes of the blocked waiting task.
>
> Proxy Execution does this via a few changes:
> * Keeping tasks that are blocked on a mutex *on* the runqueue
> * Keeping additional tracking of which mutex a task is blocked
> on, and which task holds a specific mutex.
> * Special handling for when we select a blocked task to run, so
> that we instead run the mutex holder.
>
> By leaving blocked tasks on the runqueue, we allow
> pick_next_task() to choose the task that should run next (even if
> it’s blocked waiting on a mutex). If we do select a blocked task,
> we look at the task’s blocked_on mutex and from there look at the
> mutex’s owner task. And in the simple case, the task which owns
> the mutex is what we then choose to run, allowing it to release
> the mutex.
>
> This means that instead of just tracking “curr”, the scheduler
> needs to track both the scheduler context (what was picked and
> all the state used for scheduling decisions), and the execution
> context (what we’re actually running).
>
> In this way, the mutex owner is run “on behalf” of the blocked
> task that was picked to run, essentially inheriting the scheduler
> context of the waiting blocked task.
>
> As Connor outlined in a previous submission of this patch series,

Nit: Better to have a reference to Connor's patch series (i.e.,
https://lore.kernel.org/lkml/[email protected]/)
here?

> this raises a number of complicated situations: The mutex owner
> might itself be blocked on another mutex, or it could be
> sleeping, running on a different CPU, in the process of migrating
> between CPUs, etc.
>
> The complexity from these edge cases are imposing, but currently
> in Android we have a large number of cases where we are seeing
> priority inversion (not unbounded, but much longer than we’d
> like) between “foreground” and “background” SCHED_NORMAL
> applications. As a result, currently we cannot usefully limit
> background activity without introducing inconsistent behavior. So
> Proxy Execution is a very attractive approach to resolving this
> critical issue.
>
> New in v7:
> ---------
> * Extended blocked_on state tracking to use a tri-state so we
> can avoid running tasks before they are return-migrated.
>
> * Switched return migration back to the ttwu path to avoid extra
> lock handling and resolve performance regression seen since v5
>
> * _Tons_ of typos, cleanups, runtime conditionalization
> improvements, build fixes for different config options, and
> clarifications suggested by Metin Kaya. *Many* *many* thanks
> for all the review and help here!
>
> * Split up and reworked Connor’s chain-migration patches to be
> properly conditionalized on the config value, and hopefully a
> little easier to review.
>
> * Added first stab at RT load-balancing invariant test
> (sched_football)
>
> * Fixed wake_cpu not being preserved during
> activate_blocked_entities
>
> * Fix build warnings Reported-by: kernel test robot <[email protected]>
>
> * Other minor cleanups
>
> Performance:
> —----------
> v7 of the series improves over v6 and v5 by moving proxy return
> migration back into the ttwu path, which avoids a lot of extra
> locking. This gets performance back to where we were in v4.
>
> K Prateek Nayak did some benchmarking and performance analysis
> with the v6 series and while he noted with a number of benchmarks
> (schbench, tbench, netperf, ycsb-mongodb, DeathStarBench) he
> found “little to no difference” when running with
> proxy-execution[3], he did find a large regression with the
> “perf bench sched messaging” test.
>
> I’ve reproduced this issue and the majority of the regression
> seems to come from the fact this patch series switches mutexes to
> use handoff mode rather than optimistic spinning. This has been a
> concern for cases where locks are under high contention, so I
> need to spend some time on finding a solution to restore some
> ability for optimistic spinning. Many thanks to Preteek for
> raising this issue!
>
> Previously Chenyu also report a regression[4], which seems
> similar, but I need to look further into it.
>
> Issues still to address:
> —----------
> * The chain migration patches still need further iterations and
> better validation to ensure they resolve the RT/DL load
> balancing invariants.
>
> * Xuewen Yan earlier pointed out that we may see task
> mis-placement on EAS systems if we do return migration based
> strictly on cpu allowability. I tried an optimization to
> always try to return migrate to the wake_cpu (which was saved
> on proxy-migration), but this seemed to undo a chunk of the
> benefit I saw in moving return migration back to ttwu, at
> least with my prio-inversion-demo microbenchmark. Need to do
> some broader performance analysis with the variations there.
>
> * Optimization to avoid migrating blocked tasks (to preserve
> optimistic mutex spinning) if the runnable lock-owner at the
> end of the blocked_on chain is already running (though this is
> difficult due to the limitations from locking rules needed to
> traverse the blocked on chain across run queues).
>
> * Similarly as we’re often dealing with lists of tasks or chains
> of tasks and mutexes, iterating across these chains of objects
> can be done safely while holding the rq lock, but as these
> chains can cross runqueues our ability to traverse the chains
> safely is somewhat limited.
>
> * CFS load balancing. Blocked tasks may carry forward load
> (PELT) to the lock owner's CPU, so CPU may look like it is
> overloaded.
>
> * The sleeping owner handling (where we deactivate waiting tasks
> and enqueue them onto a list, then reactivate them when the
> owner wakes up) doesn’t feel great. This is in part because
> when we want to activate tasks, we’re already holding a
> task.pi_lock and a rq_lock, just not the locks for the task
> we’re activating, nor the rq we’re enqueuing it onto. So there
> has to be a bit of lock juggling to drop and acquire the right
> locks (in the right order). It feels like there’s got to be a
> better way.
>
> * “rq_selected()” naming. Peter doesn’t like it, but I’ve not
> thought of a better name. Open to suggestions.
>
> * As discussed at OSPM[5], I like to split pick_next_task() up
> into two phases selecting and setting the next tasks, as
> currently pick_next_task() assumes the returned task will be
> run which results in various side-effects in sched class logic
> when it’s run. I tried to take a pass at this earlier, but
> it’s hairy and lower on the priority list for now.

Do you think we should mention virtual runqueue idea and adding trace
points to measure task migration times? They are not "open issues", but
kind of to-do items in the agenda.

>
>
> If folks find it easier to test/tinker with, this patch series
> can also be found here:
> https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v7-6.7-rc6
> https://github.com/johnstultz-work/linux-dev.git proxy-exec-v7-6.7-rc6
>
> Feedback would be very welcome!
>
> Thanks so much!
> -john
>
> [1] https://lwn.net/Articles/953438/
> [2] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
> [3] https://lore.kernel.org/lkml/[email protected]/
> [4] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/
> [5] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk)
>
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
>
> Connor O'Brien (5):
> sched: Add push_task_chain helper
> sched: Consolidate pick_*_task to task_is_pushable helper
> sched: Push execution and scheduler context split into deadline and rt
> paths
> sched: Add find_exec_ctx helper
> sched: Fix rt/dl load balancing via chain level balance
>
> John Stultz (8):
> locking/mutex: Remove wakeups from under mutex::wait_lock
> sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
> sched: Fix runtime accounting w/ split exec & sched contexts
> sched: Split out __sched() deactivate task logic into a helper
> sched: Add a initial sketch of the find_proxy_task() function
> sched: Handle blocked-waiter migration (and return migration)
> sched: Initial sched_football test implementation
> sched: Refactor dl/rt find_lowest/latest_rq logic
>
> Juri Lelli (2):
> locking/mutex: Make mutex::wait_lock irq safe
> locking/mutex: Expose __mutex_owner()
>
> Peter Zijlstra (7):
> sched: Unify runtime accounting across classes
> locking/mutex: Rework task_struct::blocked_on
> locking/mutex: Switch to mutex handoffs for CONFIG_SCHED_PROXY_EXEC
> sched: Split scheduler and execution contexts
> sched: Start blocked_on chain processing in find_proxy_task()
> sched: Add blocked_donor link to task for smarter mutex handoffs
> sched: Add deactivated (sleeping) owner handling to find_proxy_task()
>
> Valentin Schneider (1):
> sched: Fix proxy/current (push,pull)ability
>
> .../admin-guide/kernel-parameters.txt | 5 +
> Documentation/locking/mutex-design.rst | 3 +
> include/linux/sched.h | 59 +-
> init/Kconfig | 7 +
> init/init_task.c | 1 +
> kernel/fork.c | 9 +-
> kernel/locking/mutex-debug.c | 9 +-
> kernel/locking/mutex.c | 166 ++--
> kernel/locking/mutex.h | 25 +
> kernel/locking/rtmutex.c | 26 +-
> kernel/locking/rwbase_rt.c | 4 +-
> kernel/locking/rwsem.c | 4 +-
> kernel/locking/spinlock_rt.c | 3 +-
> kernel/locking/ww_mutex.h | 73 +-
> kernel/sched/Makefile | 1 +
> kernel/sched/core.c | 822 ++++++++++++++++--
> kernel/sched/cpudeadline.c | 12 +-
> kernel/sched/cpudeadline.h | 3 +-
> kernel/sched/cpupri.c | 31 +-
> kernel/sched/cpupri.h | 6 +-
> kernel/sched/deadline.c | 218 +++--
> kernel/sched/fair.c | 102 ++-
> kernel/sched/rt.c | 298 +++++--
> kernel/sched/sched.h | 105 ++-
> kernel/sched/stop_task.c | 13 +-
> kernel/sched/test_sched_football.c | 242 ++++++
> lib/Kconfig.debug | 14 +
> 27 files changed, 1861 insertions(+), 400 deletions(-)
> create mode 100644 kernel/sched/test_sched_football.c
>


2023-12-21 10:17:45

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v7 05/23] locking/mutex: Rework task_struct::blocked_on

On 20/12/2023 12:18 am, John Stultz wrote:
> From: Peter Zijlstra <[email protected]>
>
> Track the blocked-on relation for mutexes, to allow following this
> relation at schedule time.
>
> task
> | blocked-on
> v
> mutex
> | owner
> v
> task
>
> Also adds a blocked_on_state value so we can distinguqish when a

What about "Also add a blocked_on_state enum to task_struct to
distinguish ..." to use an imperative language?
The same for "Also adds" and "Finally adds" below.

> task is blocked_on a mutex, but is either blocked, waking up, or
> runnable (such that it can try to aquire the lock its blocked

acquire

> on).
>
> This avoids some of the subtle & racy games where the blocked_on
> state gets cleared, only to have it re-added by the
> mutex_lock_slowpath call when it tries to aquire the lock on

ditto

> wakeup

wakeup.

>
> Also adds blocked_lock to the task_struct so we can safely
> serialize the blocked-on state.
>
> Finally adds wrappers that are useful to provide correctness
> checks. Folded in from a patch by:
> Valentin Schneider <[email protected]>
>
> This all will be used for tracking blocked-task/mutex chains
> with the prox-execution patch in a similar fashion to how
proxy

> priority inheritence is done with rt_mutexes.
inheritance

>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> [minor changes while rebasing]
> Signed-off-by: Juri Lelli <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Connor O'Brien <[email protected]>
> [jstultz: Fix blocked_on tracking in __mutex_lock_common in error paths]
> Signed-off-by: John Stultz <[email protected]>
> ---
> v2:
> * Fixed blocked_on tracking in error paths that was causing crashes
> v4:
> * Ensure we clear blocked_on when waking ww_mutexes to die or wound.
> This is critical so we don't get ciruclar blocked_on relationships

circular

> that can't be resolved.
> v5:
> * Fix potential bug where the skip_wait path might clear blocked_on
> when that path never set it
> * Slight tweaks to where we set blocked_on to make it consistent,
> along with extra WARN_ON correctness checking
> * Minor comment changes
> v7:
> * Minor commit message change suggested by Metin Kaya
> * Fix WARN_ON conditionals in unlock path (as blocked_on might
> already be cleared), found while looking at issue Metin Kaya
> raised.
> * Minor tweaks to be consistent in what we do under the
> blocked_on lock, also tweaked variable name to avoid confusion
> with label, and comment typos, as suggested by Metin Kaya
> * Minor tweak for CONFIG_SCHED_PROXY_EXEC name change
> * Moved unused block of code to later in the series, as suggested
> by Metin Kaya
> * Switch to a tri-state to be able to distinguish from waking and
> runnable so we can later safely do return migration from ttwu
> * Folded together with related blocked_on changes
> ---
> include/linux/sched.h | 40 ++++++++++++++++++++++++++++++++----
> init/init_task.c | 1 +
> kernel/fork.c | 4 ++--
> kernel/locking/mutex-debug.c | 9 ++++----
> kernel/locking/mutex.c | 35 +++++++++++++++++++++++++++----
> kernel/locking/ww_mutex.h | 24 ++++++++++++++++++++--
> kernel/sched/core.c | 6 ++++++
> 7 files changed, 103 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1e80c330f755..bfe8670f99a1 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -743,6 +743,12 @@ struct kmap_ctrl {
> #endif
> };
>
> +enum blocked_on_state {
> + BO_RUNNABLE,
> + BO_BLOCKED,
> + BO_WAKING,
> +};
> +
> struct task_struct {
> #ifdef CONFIG_THREAD_INFO_IN_TASK
> /*
> @@ -1149,10 +1155,9 @@ struct task_struct {
> struct rt_mutex_waiter *pi_blocked_on;
> #endif
>
> -#ifdef CONFIG_DEBUG_MUTEXES
> - /* Mutex deadlock detection: */
> - struct mutex_waiter *blocked_on;
> -#endif
> + enum blocked_on_state blocked_on_state;
> + struct mutex *blocked_on; /* lock we're blocked on */
> + raw_spinlock_t blocked_lock;
>
> #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> int non_block_count;
> @@ -2258,6 +2263,33 @@ static inline int rwlock_needbreak(rwlock_t *lock)
> #endif
> }
>
> +static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)
> +{
> + lockdep_assert_held(&p->blocked_lock);
> +
> + /*
> + * Check we are clearing values to NULL or setting NULL
> + * to values to ensure we don't overwrite exisiting mutex

existing

> + * values or clear already cleared values
> + */
> + WARN_ON((!m && !p->blocked_on) || (m && p->blocked_on));
> +
> + p->blocked_on = m;
> + p->blocked_on_state = m ? BO_BLOCKED : BO_RUNNABLE;
> +}
> +
> +static inline struct mutex *get_task_blocked_on(struct task_struct *p)
> +{
> + lockdep_assert_held(&p->blocked_lock);
> +
> + return p->blocked_on;
> +}
> +
> +static inline struct mutex *get_task_blocked_on_once(struct task_struct *p)
> +{
> + return READ_ONCE(p->blocked_on);
> +}

These functions make me think we should use [get, set]_task_blocked_on()
for accessing blocked_on & blocked_on_state fields, but there are some
references in this patch which we directly access aforementioned fields.
Is this OK?

> +
> static __always_inline bool need_resched(void)
> {
> return unlikely(tif_need_resched());
> diff --git a/init/init_task.c b/init/init_task.c
> index 5727d42149c3..0c31d7d7c7a9 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -131,6 +131,7 @@ struct task_struct init_task
> .journal_info = NULL,
> INIT_CPU_TIMERS(init_task)
> .pi_lock = __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
> + .blocked_lock = __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock),
> .timer_slack_ns = 50000, /* 50 usec default slack */
> .thread_pid = &init_struct_pid,
> .thread_node = LIST_HEAD_INIT(init_signals.thread_head),
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 10917c3e1f03..b3ba3d22d8b2 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2358,6 +2358,7 @@ __latent_entropy struct task_struct *copy_process(
> ftrace_graph_init_task(p);
>
> rt_mutex_init_task(p);
> + raw_spin_lock_init(&p->blocked_lock);
>
> lockdep_assert_irqs_enabled();
> #ifdef CONFIG_PROVE_LOCKING
> @@ -2456,9 +2457,8 @@ __latent_entropy struct task_struct *copy_process(
> lockdep_init_task(p);
> #endif
>
> -#ifdef CONFIG_DEBUG_MUTEXES
> + p->blocked_on_state = BO_RUNNABLE;
> p->blocked_on = NULL; /* not blocked yet */
> -#endif
> #ifdef CONFIG_BCACHE
> p->sequential_io = 0;
> p->sequential_io_avg = 0;
> diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
> index bc8abb8549d2..1eedf7c60c00 100644
> --- a/kernel/locking/mutex-debug.c
> +++ b/kernel/locking/mutex-debug.c
> @@ -52,17 +52,18 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
> {
> lockdep_assert_held(&lock->wait_lock);
>
> - /* Mark the current thread as blocked on the lock: */
> - task->blocked_on = waiter;
> + /* Current thread can't be already blocked (since it's executing!) */
> + DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task));
> }
>
> void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
> struct task_struct *task)
> {
> + struct mutex *blocked_on = get_task_blocked_on_once(task);
> +
> DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
> DEBUG_LOCKS_WARN_ON(waiter->task != task);
> - DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
> - task->blocked_on = NULL;
> + DEBUG_LOCKS_WARN_ON(blocked_on && blocked_on != lock);
>
> INIT_LIST_HEAD(&waiter->list);
> waiter->task = NULL;
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 543774506fdb..6084470773f6 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -592,6 +592,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> }
>
> raw_spin_lock_irqsave(&lock->wait_lock, flags);
> + raw_spin_lock(&current->blocked_lock);
> /*
> * After waiting to acquire the wait_lock, try again.
> */
> @@ -622,6 +623,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> goto err_early_kill;
> }
>
> + set_task_blocked_on(current, lock);
> set_current_state(state);
> trace_contention_begin(lock, LCB_F_MUTEX);
> for (;;) {
> @@ -652,6 +654,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> goto err;
> }
>
> + raw_spin_unlock(&current->blocked_lock);
> raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
> /* Make sure we do wakeups before calling schedule */
> if (!wake_q_empty(&wake_q)) {
> @@ -662,6 +665,13 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>
> first = __mutex_waiter_is_first(lock, &waiter);
>
> + raw_spin_lock_irqsave(&lock->wait_lock, flags);
> + raw_spin_lock(&current->blocked_lock);
> +
> + /*
> + * Re-set blocked_on_state as unlock path set it to WAKING/RUNNABLE
> + */
> + current->blocked_on_state = BO_BLOCKED;
> set_current_state(state);
> /*
> * Here we order against unlock; we must either see it change
> @@ -672,16 +682,25 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> break;
>
> if (first) {
> + bool opt_acquired;
> +
> + /*
> + * mutex_optimistic_spin() can schedule, so we need to
> + * release these locks before calling it.
> + */
> + raw_spin_unlock(&current->blocked_lock);
> + raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
> trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
> - if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
> + opt_acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);
> + raw_spin_lock_irqsave(&lock->wait_lock, flags);
> + raw_spin_lock(&current->blocked_lock);
> + if (opt_acquired)
> break;
> trace_contention_begin(lock, LCB_F_MUTEX);
> }
> -
> - raw_spin_lock_irqsave(&lock->wait_lock, flags);
> }
> - raw_spin_lock_irqsave(&lock->wait_lock, flags);
> acquired:
> + set_task_blocked_on(current, NULL);
> __set_current_state(TASK_RUNNING);
>
> if (ww_ctx) {
> @@ -706,16 +725,20 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> if (ww_ctx)
> ww_mutex_lock_acquired(ww, ww_ctx);
>
> + raw_spin_unlock(&current->blocked_lock);
> raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
> wake_up_q(&wake_q);
> preempt_enable();
> return 0;
>
> err:
> + set_task_blocked_on(current, NULL);
> __set_current_state(TASK_RUNNING);
> __mutex_remove_waiter(lock, &waiter);
> err_early_kill:
> + WARN_ON(current->blocked_on);
> trace_contention_end(lock, ret);
> + raw_spin_unlock(&current->blocked_lock);
> raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
> debug_mutex_free_waiter(&waiter);
> mutex_release(&lock->dep_map, ip);
> @@ -925,8 +948,12 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
>
> next = waiter->task;
>
> + raw_spin_lock(&next->blocked_lock);
> debug_mutex_wake_waiter(lock, waiter);
> + WARN_ON(next->blocked_on != lock);
> + next->blocked_on_state = BO_WAKING;
> wake_q_add(&wake_q, next);
> + raw_spin_unlock(&next->blocked_lock);
> }
>
> if (owner & MUTEX_FLAG_HANDOFF)
> diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
> index 9facc0ddfdd3..8dd21ff5eee0 100644
> --- a/kernel/locking/ww_mutex.h
> +++ b/kernel/locking/ww_mutex.h
> @@ -281,10 +281,21 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
> return false;
>
> if (waiter->ww_ctx->acquired > 0 && __ww_ctx_less(waiter->ww_ctx, ww_ctx)) {
> + /* nested as we should hold current->blocked_lock already */
> + raw_spin_lock_nested(&waiter->task->blocked_lock, SINGLE_DEPTH_NESTING);
> #ifndef WW_RT
> debug_mutex_wake_waiter(lock, waiter);
> #endif
> + /*
> + * When waking up the task to die, be sure to set the
> + * blocked_on_state to WAKING. Otherwise we can see
> + * circular blocked_on relationships that can't
> + * resolve.
> + */
> + WARN_ON(waiter->task->blocked_on != lock);
> + waiter->task->blocked_on_state = BO_WAKING;
> wake_q_add(wake_q, waiter->task);
> + raw_spin_unlock(&waiter->task->blocked_lock);
> }
>
> return true;
> @@ -331,9 +342,18 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
> * it's wounded in __ww_mutex_check_kill() or has a
> * wakeup pending to re-read the wounded state.
> */
> - if (owner != current)
> + if (owner != current) {
> + /* nested as we should hold current->blocked_lock already */
> + raw_spin_lock_nested(&owner->blocked_lock, SINGLE_DEPTH_NESTING);
> + /*
> + * When waking up the task to wound, be sure to set the
> + * blocked_on_state flag. Otherwise we can see circular
> + * blocked_on relationships that can't resolve.
> + */
> + owner->blocked_on_state = BO_WAKING;
> wake_q_add(wake_q, owner);
> -
> + raw_spin_unlock(&owner->blocked_lock);
> + }
> return true;
> }
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a708d225c28e..4e46189d545d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4195,6 +4195,7 @@ bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)
> int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> {
> guard(preempt)();
> + unsigned long flags;
> int cpu, success = 0;
>
> if (p == current) {
> @@ -4341,6 +4342,11 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>
> ttwu_queue(p, cpu, wake_flags);
> }
> + /* XXX can we do something better here for !CONFIG_SCHED_PROXY_EXEC case */

blocked_on* fields are now used even in !CONFIG_SCHED_PROXY_EXEC case.
I'm unsure if we can get rid of lock & unlock lines or entire hunk, but
would this be too ugly? I wish we could convert blocked_on_state to an
atomic variable.

if (p->blocked_on_state == BO_WAKING) {
raw_spin_lock_irqsave(&p->blocked_lock, flags);
p->blocked_on_state = BO_RUNNABLE;
raw_spin_unlock_irqrestore(&p->blocked_lock, flags);
}

> + raw_spin_lock_irqsave(&p->blocked_lock, flags);
> + if (p->blocked_on_state == BO_WAKING)
> + p->blocked_on_state = BO_RUNNABLE;
> + raw_spin_unlock_irqrestore(&p->blocked_lock, flags);
> out:
> if (success)
> ttwu_stat(p, task_cpu(p), wake_flags);


2023-12-21 10:45:03

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v7 08/23] sched: Split scheduler and execution contexts

On 20/12/2023 12:18 am, John Stultz wrote:
> From: Peter Zijlstra <[email protected]>
>
> Let's define the scheduling context as all the scheduler state
> in task_struct for the task selected to run, and the execution
> context as all state required to actually run the task.
>
> Currently both are intertwined in task_struct. We want to
> logically split these such that we can use the scheduling
> context of the task selected to be scheduled, but use the
> execution context of a different task to actually be run.

Should we update Documentation/kernel-hacking/hacking.rst (line #348:
:c:macro:`current`) or another appropriate doc to announce separation of
scheduling & execution contexts?

>
> To this purpose, introduce rq_selected() macro to point to the
> task_struct selected from the runqueue by the scheduler, and
> will be used for scheduler state, and preserve rq->curr to
> indicate the execution context of the task that will actually be
> run.
>
> NOTE: Peter previously mentioned he didn't like the name
> "rq_selected()", but I've not come up with a better alternative.
> I'm very open to other name proposals.
>
> Question for Peter: Dietmar suggested you'd prefer I drop the
> conditionalization of the scheduler context pointer on the rq
> (so rq_selected() would be open coded as rq->curr_selected or
> whatever we agree on for a name), but I'd think in the
> !CONFIG_PROXY_EXEC case we'd want to avoid the wasted pointer
> and its use (since it curr_selected would always be == curr)?
> If I'm wrong I'm fine switching this, but would appreciate
> clarification.
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Link: https://lkml.kernel.org/r/[email protected]
> [add additional comments and update more sched_class code to use
> rq::proxy]
> Signed-off-by: Connor O'Brien <[email protected]>
> [jstultz: Rebased and resolved minor collisions, reworked to use
> accessors, tweaked update_curr_common to use rq_proxy fixing rt
> scheduling issues]
> Signed-off-by: John Stultz <[email protected]>
> ---
> v2:
> * Reworked to use accessors
> * Fixed update_curr_common to use proxy instead of curr
> v3:
> * Tweaked wrapper names
> * Swapped proxy for selected for clarity
> v4:
> * Minor variable name tweaks for readability
> * Use a macro instead of a inline function and drop
> other helper functions as suggested by Peter.
> * Remove verbose comments/questions to avoid review
> distractions, as suggested by Dietmar
> v5:
> * Add CONFIG_PROXY_EXEC option to this patch so the
> new logic can be tested with this change
> * Minor fix to grab rq_selected when holding the rq lock
> v7:
> * Minor spelling fix and unused argument fixes suggested by
> Metin Kaya
> * Switch to curr_selected for consistency, and minor rewording
> of commit message for clarity
> * Rename variables selected instead of curr when we're using
> rq_selected()
> * Reduce macros in CONFIG_SCHED_PROXY_EXEC ifdef sections,
> as suggested by Metin Kaya
> ---
> kernel/sched/core.c | 46 ++++++++++++++++++++++++++---------------
> kernel/sched/deadline.c | 35 ++++++++++++++++---------------
> kernel/sched/fair.c | 18 ++++++++--------
> kernel/sched/rt.c | 40 +++++++++++++++++------------------
> kernel/sched/sched.h | 35 +++++++++++++++++++++++++++++--
> 5 files changed, 109 insertions(+), 65 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e06558fb08aa..0ce34f5c0e0c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -822,7 +822,7 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
>
> rq_lock(rq, &rf);
> update_rq_clock(rq);
> - rq->curr->sched_class->task_tick(rq, rq->curr, 1);
> + rq_selected(rq)->sched_class->task_tick(rq, rq_selected(rq), 1);
> rq_unlock(rq, &rf);
>
> return HRTIMER_NORESTART;
> @@ -2242,16 +2242,18 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,
>
> void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
> {
> - if (p->sched_class == rq->curr->sched_class)
> - rq->curr->sched_class->wakeup_preempt(rq, p, flags);
> - else if (sched_class_above(p->sched_class, rq->curr->sched_class))
> + struct task_struct *selected = rq_selected(rq);
> +
> + if (p->sched_class == selected->sched_class)
> + selected->sched_class->wakeup_preempt(rq, p, flags);
> + else if (sched_class_above(p->sched_class, selected->sched_class))
> resched_curr(rq);
>
> /*
> * A queue event has occurred, and we're going to schedule. In
> * this case, we can save a useless back to back clock update.
> */
> - if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
> + if (task_on_rq_queued(selected) && test_tsk_need_resched(rq->curr))
> rq_clock_skip_update(rq);
> }
>
> @@ -2780,7 +2782,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
> lockdep_assert_held(&p->pi_lock);
>
> queued = task_on_rq_queued(p);
> - running = task_current(rq, p);
> + running = task_current_selected(rq, p);
>
> if (queued) {
> /*
> @@ -5600,7 +5602,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
> * project cycles that may never be accounted to this
> * thread, breaking clock_gettime().
> */
> - if (task_current(rq, p) && task_on_rq_queued(p)) {
> + if (task_current_selected(rq, p) && task_on_rq_queued(p)) {
> prefetch_curr_exec_start(p);
> update_rq_clock(rq);
> p->sched_class->update_curr(rq);
> @@ -5668,7 +5670,8 @@ void scheduler_tick(void)
> {
> int cpu = smp_processor_id();
> struct rq *rq = cpu_rq(cpu);
> - struct task_struct *curr = rq->curr;
> + /* accounting goes to the selected task */
> + struct task_struct *selected;
> struct rq_flags rf;
> unsigned long thermal_pressure;
> u64 resched_latency;
> @@ -5679,16 +5682,17 @@ void scheduler_tick(void)
> sched_clock_tick();
>
> rq_lock(rq, &rf);
> + selected = rq_selected(rq);
>
> update_rq_clock(rq);
> thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
> update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
> - curr->sched_class->task_tick(rq, curr, 0);
> + selected->sched_class->task_tick(rq, selected, 0);
> if (sched_feat(LATENCY_WARN))
> resched_latency = cpu_resched_latency(rq);
> calc_global_load_tick(rq);
> sched_core_tick(rq);
> - task_tick_mm_cid(rq, curr);
> + task_tick_mm_cid(rq, selected);
>
> rq_unlock(rq, &rf);
>
> @@ -5697,8 +5701,8 @@ void scheduler_tick(void)
>
> perf_event_task_tick();
>
> - if (curr->flags & PF_WQ_WORKER)
> - wq_worker_tick(curr);
> + if (selected->flags & PF_WQ_WORKER)
> + wq_worker_tick(selected);
>
> #ifdef CONFIG_SMP
> rq->idle_balance = idle_cpu(cpu);
> @@ -5763,6 +5767,12 @@ static void sched_tick_remote(struct work_struct *work)
> struct task_struct *curr = rq->curr;
>
> if (cpu_online(cpu)) {
> + /*
> + * Since this is a remote tick for full dynticks mode,
> + * we are always sure that there is no proxy (only a
> + * single task is running).
> + */
> + SCHED_WARN_ON(rq->curr != rq_selected(rq));
> update_rq_clock(rq);
>
> if (!is_idle_task(curr)) {
> @@ -6685,6 +6695,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
> }
>
> next = pick_next_task(rq, prev, &rf);
> + rq_set_selected(rq, next);
> clear_tsk_need_resched(prev);
> clear_preempt_need_resched();
> #ifdef CONFIG_SCHED_DEBUG
> @@ -7185,7 +7196,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
>
> prev_class = p->sched_class;
> queued = task_on_rq_queued(p);
> - running = task_current(rq, p);
> + running = task_current_selected(rq, p);
> if (queued)
> dequeue_task(rq, p, queue_flag);
> if (running)
> @@ -7275,7 +7286,7 @@ void set_user_nice(struct task_struct *p, long nice)
> }
>
> queued = task_on_rq_queued(p);
> - running = task_current(rq, p);
> + running = task_current_selected(rq, p);
> if (queued)
> dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
> if (running)
> @@ -7868,7 +7879,7 @@ static int __sched_setscheduler(struct task_struct *p,
> }
>
> queued = task_on_rq_queued(p);
> - running = task_current(rq, p);
> + running = task_current_selected(rq, p);
> if (queued)
> dequeue_task(rq, p, queue_flags);
> if (running)
> @@ -9295,6 +9306,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
> rcu_read_unlock();
>
> rq->idle = idle;
> + rq_set_selected(rq, idle);
> rcu_assign_pointer(rq->curr, idle);
> idle->on_rq = TASK_ON_RQ_QUEUED;
> #ifdef CONFIG_SMP
> @@ -9384,7 +9396,7 @@ void sched_setnuma(struct task_struct *p, int nid)
>
> rq = task_rq_lock(p, &rf);
> queued = task_on_rq_queued(p);
> - running = task_current(rq, p);
> + running = task_current_selected(rq, p);
>
> if (queued)
> dequeue_task(rq, p, DEQUEUE_SAVE);
> @@ -10489,7 +10501,7 @@ void sched_move_task(struct task_struct *tsk)
>
> update_rq_clock(rq);
>
> - running = task_current(rq, tsk);
> + running = task_current_selected(rq, tsk);
> queued = task_on_rq_queued(tsk);
>
> if (queued)
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 6140f1f51da1..9cf20f4ac5f9 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1150,7 +1150,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
> #endif
>
> enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
> - if (dl_task(rq->curr))
> + if (dl_task(rq_selected(rq)))
> wakeup_preempt_dl(rq, p, 0);
> else
> resched_curr(rq);
> @@ -1273,7 +1273,7 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
> */
> static void update_curr_dl(struct rq *rq)
> {
> - struct task_struct *curr = rq->curr;
> + struct task_struct *curr = rq_selected(rq);
> struct sched_dl_entity *dl_se = &curr->dl;
> s64 delta_exec, scaled_delta_exec;
> int cpu = cpu_of(rq);
> @@ -1784,7 +1784,7 @@ static int find_later_rq(struct task_struct *task);
> static int
> select_task_rq_dl(struct task_struct *p, int cpu, int flags)
> {
> - struct task_struct *curr;
> + struct task_struct *curr, *selected;
> bool select_rq;
> struct rq *rq;
>
> @@ -1795,6 +1795,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
>
> rcu_read_lock();
> curr = READ_ONCE(rq->curr); /* unlocked access */
> + selected = READ_ONCE(rq_selected(rq));
>
> /*
> * If we are dealing with a -deadline task, we must
> @@ -1805,9 +1806,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
> * other hand, if it has a shorter deadline, we
> * try to make it stay here, it might be important.
> */
> - select_rq = unlikely(dl_task(curr)) &&
> + select_rq = unlikely(dl_task(selected)) &&
> (curr->nr_cpus_allowed < 2 ||
> - !dl_entity_preempt(&p->dl, &curr->dl)) &&
> + !dl_entity_preempt(&p->dl, &selected->dl)) &&
> p->nr_cpus_allowed > 1;
>
> /*
> @@ -1870,7 +1871,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
> * let's hope p can move out.
> */
> if (rq->curr->nr_cpus_allowed == 1 ||
> - !cpudl_find(&rq->rd->cpudl, rq->curr, NULL))
> + !cpudl_find(&rq->rd->cpudl, rq_selected(rq), NULL))
> return;
>
> /*
> @@ -1909,7 +1910,7 @@ static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
> static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
> int flags)
> {
> - if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
> + if (dl_entity_preempt(&p->dl, &rq_selected(rq)->dl)) {
> resched_curr(rq);
> return;
> }
> @@ -1919,7 +1920,7 @@ static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
> * In the unlikely case current and p have the same deadline
> * let us try to decide what's the best thing to do...
> */
> - if ((p->dl.deadline == rq->curr->dl.deadline) &&
> + if ((p->dl.deadline == rq_selected(rq)->dl.deadline) &&
> !test_tsk_need_resched(rq->curr))
> check_preempt_equal_dl(rq, p);
> #endif /* CONFIG_SMP */
> @@ -1954,7 +1955,7 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
> if (hrtick_enabled_dl(rq))
> start_hrtick_dl(rq, p);
>
> - if (rq->curr->sched_class != &dl_sched_class)
> + if (rq_selected(rq)->sched_class != &dl_sched_class)
> update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
>
> deadline_queue_push_tasks(rq);
> @@ -2268,8 +2269,8 @@ static int push_dl_task(struct rq *rq)
> * can move away, it makes sense to just reschedule
> * without going further in pushing next_task.
> */
> - if (dl_task(rq->curr) &&
> - dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
> + if (dl_task(rq_selected(rq)) &&
> + dl_time_before(next_task->dl.deadline, rq_selected(rq)->dl.deadline) &&
> rq->curr->nr_cpus_allowed > 1) {
> resched_curr(rq);
> return 0;
> @@ -2394,7 +2395,7 @@ static void pull_dl_task(struct rq *this_rq)
> * deadline than the current task of its runqueue.
> */
> if (dl_time_before(p->dl.deadline,
> - src_rq->curr->dl.deadline))
> + rq_selected(src_rq)->dl.deadline))
> goto skip;
>
> if (is_migration_disabled(p)) {
> @@ -2435,9 +2436,9 @@ static void task_woken_dl(struct rq *rq, struct task_struct *p)
> if (!task_on_cpu(rq, p) &&
> !test_tsk_need_resched(rq->curr) &&
> p->nr_cpus_allowed > 1 &&
> - dl_task(rq->curr) &&
> + dl_task(rq_selected(rq)) &&
> (rq->curr->nr_cpus_allowed < 2 ||
> - !dl_entity_preempt(&p->dl, &rq->curr->dl))) {
> + !dl_entity_preempt(&p->dl, &rq_selected(rq)->dl))) {
> push_dl_tasks(rq);
> }
> }
> @@ -2612,12 +2613,12 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
> return;
> }
>
> - if (rq->curr != p) {
> + if (rq_selected(rq) != p) {
> #ifdef CONFIG_SMP
> if (p->nr_cpus_allowed > 1 && rq->dl.overloaded)
> deadline_queue_push_tasks(rq);
> #endif
> - if (dl_task(rq->curr))
> + if (dl_task(rq_selected(rq)))
> wakeup_preempt_dl(rq, p, 0);
> else
> resched_curr(rq);
> @@ -2646,7 +2647,7 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p,
> if (!rq->dl.overloaded)
> deadline_queue_pull_task(rq);
>
> - if (task_current(rq, p)) {
> + if (task_current_selected(rq, p)) {
> /*
> * If we now have a earlier deadline task than p,
> * then reschedule, provided p is still on this
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1251fd01a555..07216ea3ed53 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1157,7 +1157,7 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
> */
> s64 update_curr_common(struct rq *rq)
> {
> - struct task_struct *curr = rq->curr;
> + struct task_struct *curr = rq_selected(rq);
> s64 delta_exec;
>
> delta_exec = update_curr_se(rq, &curr->se);
> @@ -1203,7 +1203,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
>
> static void update_curr_fair(struct rq *rq)
> {
> - update_curr(cfs_rq_of(&rq->curr->se));
> + update_curr(cfs_rq_of(&rq_selected(rq)->se));
> }
>
> static inline void
> @@ -6611,7 +6611,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
> s64 delta = slice - ran;
>
> if (delta < 0) {
> - if (task_current(rq, p))
> + if (task_current_selected(rq, p))
> resched_curr(rq);
> return;
> }
> @@ -6626,7 +6626,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
> */
> static void hrtick_update(struct rq *rq)
> {
> - struct task_struct *curr = rq->curr;
> + struct task_struct *curr = rq_selected(rq);
>
> if (!hrtick_enabled_fair(rq) || curr->sched_class != &fair_sched_class)
> return;
> @@ -8235,7 +8235,7 @@ static void set_next_buddy(struct sched_entity *se)
> */
> static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
> {
> - struct task_struct *curr = rq->curr;
> + struct task_struct *curr = rq_selected(rq);
> struct sched_entity *se = &curr->se, *pse = &p->se;
> struct cfs_rq *cfs_rq = task_cfs_rq(curr);
> int next_buddy_marked = 0;
> @@ -8268,7 +8268,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> * prevents us from potentially nominating it as a false LAST_BUDDY
> * below.
> */
> - if (test_tsk_need_resched(curr))
> + if (test_tsk_need_resched(rq->curr))
> return;
>
> /* Idle tasks are by definition preempted by non-idle tasks. */
> @@ -9252,7 +9252,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
> * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
> * DL and IRQ signals have been updated before updating CFS.
> */
> - curr_class = rq->curr->sched_class;
> + curr_class = rq_selected(rq)->sched_class;
>
> thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
>
> @@ -12640,7 +12640,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
> * our priority decreased, or if we are not currently running on
> * this runqueue and our priority is higher than the current's
> */
> - if (task_current(rq, p)) {
> + if (task_current_selected(rq, p)) {
> if (p->prio > oldprio)
> resched_curr(rq);
> } else
> @@ -12743,7 +12743,7 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
> * kick off the schedule if running, otherwise just see
> * if we can still preempt the current task.
> */
> - if (task_current(rq, p))
> + if (task_current_selected(rq, p))
> resched_curr(rq);
> else
> wakeup_preempt(rq, p, 0);
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 9cdea3ea47da..2682cec45aaa 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -530,7 +530,7 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
>
> static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
> {
> - struct task_struct *curr = rq_of_rt_rq(rt_rq)->curr;
> + struct task_struct *curr = rq_selected(rq_of_rt_rq(rt_rq));
> struct rq *rq = rq_of_rt_rq(rt_rq);
> struct sched_rt_entity *rt_se;
>
> @@ -1000,7 +1000,7 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
> */
> static void update_curr_rt(struct rq *rq)
> {
> - struct task_struct *curr = rq->curr;
> + struct task_struct *curr = rq_selected(rq);
> struct sched_rt_entity *rt_se = &curr->rt;
> s64 delta_exec;
>
> @@ -1545,7 +1545,7 @@ static int find_lowest_rq(struct task_struct *task);
> static int
> select_task_rq_rt(struct task_struct *p, int cpu, int flags)
> {
> - struct task_struct *curr;
> + struct task_struct *curr, *selected;
> struct rq *rq;
> bool test;
>
> @@ -1557,6 +1557,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
>
> rcu_read_lock();
> curr = READ_ONCE(rq->curr); /* unlocked access */
> + selected = READ_ONCE(rq_selected(rq));
>
> /*
> * If the current task on @p's runqueue is an RT task, then
> @@ -1585,8 +1586,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
> * systems like big.LITTLE.
> */
> test = curr &&
> - unlikely(rt_task(curr)) &&
> - (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
> + unlikely(rt_task(selected)) &&
> + (curr->nr_cpus_allowed < 2 || selected->prio <= p->prio);
>
> if (test || !rt_task_fits_capacity(p, cpu)) {
> int target = find_lowest_rq(p);
> @@ -1616,12 +1617,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
>
> static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
> {
> - /*
> - * Current can't be migrated, useless to reschedule,
> - * let's hope p can move out.
> - */
> if (rq->curr->nr_cpus_allowed == 1 ||
> - !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
> + !cpupri_find(&rq->rd->cpupri, rq_selected(rq), NULL))
> return;
>
> /*
> @@ -1664,7 +1661,9 @@ static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
> */
> static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
> {
> - if (p->prio < rq->curr->prio) {
> + struct task_struct *curr = rq_selected(rq);
> +
> + if (p->prio < curr->prio) {
> resched_curr(rq);
> return;
> }
> @@ -1682,7 +1681,7 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
> * to move current somewhere else, making room for our non-migratable
> * task.
> */
> - if (p->prio == rq->curr->prio && !test_tsk_need_resched(rq->curr))
> + if (p->prio == curr->prio && !test_tsk_need_resched(rq->curr))
> check_preempt_equal_prio(rq, p);
> #endif
> }
> @@ -1707,7 +1706,7 @@ static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool f
> * utilization. We only care of the case where we start to schedule a
> * rt task
> */
> - if (rq->curr->sched_class != &rt_sched_class)
> + if (rq_selected(rq)->sched_class != &rt_sched_class)
> update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
>
> rt_queue_push_tasks(rq);
> @@ -1988,6 +1987,7 @@ static struct task_struct *pick_next_pushable_task(struct rq *rq)
>
> BUG_ON(rq->cpu != task_cpu(p));
> BUG_ON(task_current(rq, p));
> + BUG_ON(task_current_selected(rq, p));
> BUG_ON(p->nr_cpus_allowed <= 1);
>
> BUG_ON(!task_on_rq_queued(p));
> @@ -2020,7 +2020,7 @@ static int push_rt_task(struct rq *rq, bool pull)
> * higher priority than current. If that's the case
> * just reschedule current.
> */
> - if (unlikely(next_task->prio < rq->curr->prio)) {
> + if (unlikely(next_task->prio < rq_selected(rq)->prio)) {
> resched_curr(rq);
> return 0;
> }
> @@ -2375,7 +2375,7 @@ static void pull_rt_task(struct rq *this_rq)
> * p if it is lower in priority than the
> * current task on the run queue
> */
> - if (p->prio < src_rq->curr->prio)
> + if (p->prio < rq_selected(src_rq)->prio)
> goto skip;
>
> if (is_migration_disabled(p)) {
> @@ -2419,9 +2419,9 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
> bool need_to_push = !task_on_cpu(rq, p) &&
> !test_tsk_need_resched(rq->curr) &&
> p->nr_cpus_allowed > 1 &&
> - (dl_task(rq->curr) || rt_task(rq->curr)) &&
> + (dl_task(rq_selected(rq)) || rt_task(rq_selected(rq))) &&
> (rq->curr->nr_cpus_allowed < 2 ||
> - rq->curr->prio <= p->prio);
> + rq_selected(rq)->prio <= p->prio);
>
> if (need_to_push)
> push_rt_tasks(rq);
> @@ -2505,7 +2505,7 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
> if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
> rt_queue_push_tasks(rq);
> #endif /* CONFIG_SMP */
> - if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
> + if (p->prio < rq_selected(rq)->prio && cpu_online(cpu_of(rq)))
> resched_curr(rq);
> }
> }
> @@ -2520,7 +2520,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
> if (!task_on_rq_queued(p))
> return;
>
> - if (task_current(rq, p)) {
> + if (task_current_selected(rq, p)) {
> #ifdef CONFIG_SMP
> /*
> * If our priority decreases while running, we
> @@ -2546,7 +2546,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
> * greater than the current running task
> * then reschedule.
> */
> - if (p->prio < rq->curr->prio)
> + if (p->prio < rq_selected(rq)->prio)
> resched_curr(rq);
> }
> }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3e0e4fc8734b..6ea1dfbe502a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -994,7 +994,10 @@ struct rq {
> */
> unsigned int nr_uninterruptible;
>
> - struct task_struct __rcu *curr;
> + struct task_struct __rcu *curr; /* Execution context */
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> + struct task_struct __rcu *curr_selected; /* Scheduling context (policy) */
> +#endif
> struct task_struct *idle;
> struct task_struct *stop;
> unsigned long next_balance;
> @@ -1189,6 +1192,20 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> #define cpu_curr(cpu) (cpu_rq(cpu)->curr)
> #define raw_rq() raw_cpu_ptr(&runqueues)
>
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +#define rq_selected(rq) ((rq)->curr_selected)
> +static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
> +{
> + rcu_assign_pointer(rq->curr_selected, t);
> +}
> +#else
> +#define rq_selected(rq) ((rq)->curr)
> +static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
> +{
> + /* Do nothing */
> +}
> +#endif
> +
> struct sched_group;
> #ifdef CONFIG_SCHED_CORE
> static inline struct cpumask *sched_group_span(struct sched_group *sg);
> @@ -2112,11 +2129,25 @@ static inline u64 global_rt_runtime(void)
> return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
> }
>
> +/*
> + * Is p the current execution context?
> + */
> static inline int task_current(struct rq *rq, struct task_struct *p)
> {
> return rq->curr == p;
> }
>
> +/*
> + * Is p the current scheduling context?
> + *
> + * Note that it might be the current execution context at the same time if
> + * rq->curr == rq_selected() == p.
> + */
> +static inline int task_current_selected(struct rq *rq, struct task_struct *p)
> +{
> + return rq_selected(rq) == p;
> +}
> +
> static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
> {
> #ifdef CONFIG_SMP
> @@ -2280,7 +2311,7 @@ struct sched_class {
>
> static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
> {
> - WARN_ON_ONCE(rq->curr != prev);
> + WARN_ON_ONCE(rq_selected(rq) != prev);
> prev->sched_class->put_prev_task(rq, prev);
> }
>


2023-12-21 12:32:35

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v7 10/23] sched: Split out __sched() deactivate task logic into a helper

On 20/12/2023 12:18 am, John Stultz wrote:
> As we're going to re-use the deactivation logic,
> split it into a helper.
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Signed-off-by: John Stultz <[email protected]>
> ---
> v6:
> * Define function as static to avoid "no previous prototype"
> warnings as Reported-by: kernel test robot <[email protected]>
> v7:
> * Rename state task_state to be more clear, as suggested by
> Metin Kaya
> ---
> kernel/sched/core.c | 66 +++++++++++++++++++++++++--------------------
> 1 file changed, 37 insertions(+), 29 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0ce34f5c0e0c..34acd80b6bd0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6571,6 +6571,42 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> # define SM_MASK_PREEMPT SM_PREEMPT
> #endif
>

* Better to have a comment (e.g., in which conditions
try_to_deactivate_task() returns true or false) here.

* try_to_deactivate_task() is temporarily used by 2 commits in the patch
set (i.e., it's only called by __schedule() just like in this patch at
the end of the series). However, it's nice to make that big
__scheduler() function a bit modular as we discussed off-list. So,
should we move this function out of the Proxy Execution patch set to get
it merged independently?

> +static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
> + unsigned long task_state)
> +{
> + if (signal_pending_state(task_state, p)) {
> + WRITE_ONCE(p->__state, TASK_RUNNING);
> + } else {
> + p->sched_contributes_to_load =
> + (task_state & TASK_UNINTERRUPTIBLE) &&
> + !(task_state & TASK_NOLOAD) &&
> + !(task_state & TASK_FROZEN);
> +
> + if (p->sched_contributes_to_load)
> + rq->nr_uninterruptible++;
> +
> + /*
> + * __schedule() ttwu()
> + * prev_state = prev->state; if (p->on_rq && ...)
> + * if (prev_state) goto out;
> + * p->on_rq = 0; smp_acquire__after_ctrl_dep();
> + * p->state = TASK_WAKING
> + *
> + * Where __schedule() and ttwu() have matching control dependencies.
> + *
> + * After this, schedule() must not care about p->state any more.
> + */
> + deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
> +
> + if (p->in_iowait) {
> + atomic_inc(&rq->nr_iowait);
> + delayacct_blkio_start();
> + }
> + return true;
> + }
> + return false;
> +}
> +
> /*
> * __schedule() is the main scheduler function.
> *
> @@ -6662,35 +6698,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
> */
> prev_state = READ_ONCE(prev->__state);
> if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
> - if (signal_pending_state(prev_state, prev)) {
> - WRITE_ONCE(prev->__state, TASK_RUNNING);
> - } else {
> - prev->sched_contributes_to_load =
> - (prev_state & TASK_UNINTERRUPTIBLE) &&
> - !(prev_state & TASK_NOLOAD) &&
> - !(prev_state & TASK_FROZEN);
> -
> - if (prev->sched_contributes_to_load)
> - rq->nr_uninterruptible++;
> -
> - /*
> - * __schedule() ttwu()
> - * prev_state = prev->state; if (p->on_rq && ...)
> - * if (prev_state) goto out;
> - * p->on_rq = 0; smp_acquire__after_ctrl_dep();
> - * p->state = TASK_WAKING
> - *
> - * Where __schedule() and ttwu() have matching control dependencies.
> - *
> - * After this, schedule() must not care about p->state any more.
> - */
> - deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
> -
> - if (prev->in_iowait) {
> - atomic_inc(&rq->nr_iowait);
> - delayacct_blkio_start();
> - }
> - }
> + try_to_deactivate_task(rq, prev, prev_state);
> switch_count = &prev->nvcsw;
> }
>


2023-12-21 13:01:01

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v7 11/23] sched: Add a initial sketch of the find_proxy_task() function

On 20/12/2023 12:18 am, John Stultz wrote:
> Add a find_proxy_task() function which doesn't do much.
>
> When we select a blocked task to run, we will just deactivate it
> and pick again. The exception being if it has become unblocked
> after find_proxy_task() was called.
>
> Greatly simplified from patch by:
> Peter Zijlstra (Intel) <[email protected]>
> Juri Lelli <[email protected]>
> Valentin Schneider <[email protected]>
> Connor O'Brien <[email protected]>
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> [jstultz: Split out from larger proxy patch and simplified
> for review and testing.]
> Signed-off-by: John Stultz <[email protected]>
> ---
> v5:
> * Split out from larger proxy patch
> v7:
> * Fixed unused function arguments, spelling nits, and tweaks for
> clarity, pointed out by Metin Kaya
> * Moved task_is_blocked() implementation to this patch where it is
> first used. Also drop unused arguments. Suggested by Metin Kaya.
> * Tweaks to make things easier to read, as suggested by Metin Kaya.
> * Rename proxy() to find_proxy_task() for clarity, and typo
> fixes suggested by Metin Kaya
> * Fix build warning Reported-by: kernel test robot <[email protected]>
> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

Super-nit: s/Add a/Add an/ in commit header.

> ---
> kernel/sched/core.c | 87 ++++++++++++++++++++++++++++++++++++++++++--
> kernel/sched/rt.c | 19 +++++++++-
> kernel/sched/sched.h | 15 ++++++++
> 3 files changed, 115 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 34acd80b6bd0..12f5a0618328 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6572,11 +6572,11 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> #endif
>
> static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
> - unsigned long task_state)
> + unsigned long task_state, bool deactivate_cond)
> {
> if (signal_pending_state(task_state, p)) {
> WRITE_ONCE(p->__state, TASK_RUNNING);
> - } else {
> + } else if (deactivate_cond) {
> p->sched_contributes_to_load =
> (task_state & TASK_UNINTERRUPTIBLE) &&
> !(task_state & TASK_NOLOAD) &&
> @@ -6607,6 +6607,73 @@ static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
> return false;
> }
>
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +
> +static bool proxy_deactivate(struct rq *rq, struct task_struct *next)
> +{
> + unsigned long state = READ_ONCE(next->__state);
> +
> + /* Don't deactivate if the state has been changed to TASK_RUNNING */
> + if (state == TASK_RUNNING)
> + return false;
> + if (!try_to_deactivate_task(rq, next, state, true))
> + return false;
> + put_prev_task(rq, next);
> + rq_set_selected(rq, rq->idle);
> + resched_curr(rq);
> + return true;
> +}
> +
> +/*
> + * Initial simple proxy that just returns the task if it's waking
> + * or deactivates the blocked task so we can pick something that
> + * isn't blocked.
> + */
> +static struct task_struct *
> +find_proxy_task(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
> +{
> + struct task_struct *ret = NULL;
> + struct task_struct *p = next;
> + struct mutex *mutex;
> +
> + mutex = p->blocked_on;
> + /* Something changed in the chain, so pick again */
> + if (!mutex)
> + return NULL;
> + /*
> + * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
> + * and ensure @owner sticks around.
> + */
> + raw_spin_lock(&mutex->wait_lock);
> + raw_spin_lock(&p->blocked_lock);
> +
> + /* Check again that p is blocked with blocked_lock held */
> + if (!task_is_blocked(p) || mutex != p->blocked_on) {
> + /*
> + * Something changed in the blocked_on chain and
> + * we don't know if only at this level. So, let's
> + * just bail out completely and let __schedule
> + * figure things out (pick_again loop).
> + */
> + goto out;
> + }
> +
> + if (!proxy_deactivate(rq, next))
> + ret = p;
> +out:
> + raw_spin_unlock(&p->blocked_lock);
> + raw_spin_unlock(&mutex->wait_lock);
> + return ret;
> +}
> +#else /* SCHED_PROXY_EXEC */
> +static struct task_struct *
> +find_proxy_task(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
> +{
> + BUG(); // This should never be called in the !PROXY case
> + return next;
> +}
> +#endif /* SCHED_PROXY_EXEC */
> +
> /*
> * __schedule() is the main scheduler function.
> *
> @@ -6698,12 +6765,24 @@ static void __sched notrace __schedule(unsigned int sched_mode)
> */
> prev_state = READ_ONCE(prev->__state);
> if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
> - try_to_deactivate_task(rq, prev, prev_state);
> + try_to_deactivate_task(rq, prev, prev_state,
> + !task_is_blocked(prev));
> switch_count = &prev->nvcsw;
> }
>
> - next = pick_next_task(rq, prev, &rf);
> +pick_again:
> + next = pick_next_task(rq, rq_selected(rq), &rf);
> rq_set_selected(rq, next);
> + if (unlikely(task_is_blocked(next))) {
> + next = find_proxy_task(rq, next, &rf);
> + if (!next) {
> + rq_unpin_lock(rq, &rf);
> + __balance_callbacks(rq);
> + rq_repin_lock(rq, &rf);
> + goto pick_again;
> + }
> + }
> +
> clear_tsk_need_resched(prev);
> clear_preempt_need_resched();
> #ifdef CONFIG_SCHED_DEBUG
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 2682cec45aaa..81cd22eaa6dc 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1491,8 +1491,19 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
>
> enqueue_rt_entity(rt_se, flags);
>
> - if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
> - enqueue_pushable_task(rq, p);
> + /*
> + * Current can't be pushed away. Selected is tied to current,
> + * so don't push it either.
> + */
> + if (task_current(rq, p) || task_current_selected(rq, p))
> + return;
> + /*
> + * Pinned tasks can't be pushed.
> + */
> + if (p->nr_cpus_allowed == 1)
> + return;
> +
> + enqueue_pushable_task(rq, p);
> }
>
> static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
> @@ -1779,6 +1790,10 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
>
> update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);
>
> + /* Avoid marking selected as pushable */
> + if (task_current_selected(rq, p))
> + return;
> +
> /*
> * The previous task needs to be made eligible for pushing
> * if it is still active
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 6ea1dfbe502a..765ba10661de 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2148,6 +2148,21 @@ static inline int task_current_selected(struct rq *rq, struct task_struct *p)
> return rq_selected(rq) == p;
> }
>
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +static inline bool task_is_blocked(struct task_struct *p)
> +{
> + if (!sched_proxy_exec())
> + return false;
> +
> + return !!p->blocked_on && p->blocked_on_state != BO_RUNNABLE;
> +}
> +#else /* !SCHED_PROXY_EXEC */
> +static inline bool task_is_blocked(struct task_struct *p)
> +{
> + return false;
> +}
> +#endif /* SCHED_PROXY_EXEC */
> +

We can drop #else part, IMHO. Because sched_proxy_exec() already returns
false in !CONFIG_SCHED_PROXY_EXEC case.

> static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
> {
> #ifdef CONFIG_SMP


2023-12-21 15:30:58

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v7 13/23] sched: Start blocked_on chain processing in find_proxy_task()

On 20/12/2023 12:18 am, John Stultz wrote:
> From: Peter Zijlstra <[email protected]>
>
> Start to flesh out the real find_proxy_task() implementation,
> but avoid the migration cases for now, in those cases just
> deactivate the selected task and pick again.
>
> To ensure the selected task or other blocked tasks in the chain
> aren't migrated away while we're running the proxy, this patch
> also tweaks CFS logic to avoid migrating selected or mutex
> blocked tasks.
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> Signed-off-by: Valentin Schneider <[email protected]>
> Signed-off-by: Connor O'Brien <[email protected]>
> [jstultz: This change was split out from the larger proxy patch]
> Signed-off-by: John Stultz <[email protected]>
> ---
> v5:
> * Split this out from larger proxy patch
> v7:
> * Minor refactoring of core find_proxy_task() function
> * Minor spelling and corrections suggested by Metin Kaya
> * Dropped an added BUG_ON that was frequently tripped
> * Minor commit message tweaks from Metin Kaya
> ---
> kernel/sched/core.c | 154 +++++++++++++++++++++++++++++++++++++-------
> kernel/sched/fair.c | 9 ++-
> 2 files changed, 137 insertions(+), 26 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f6bf3b62194c..42e25bbdfe6b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -94,6 +94,7 @@
> #include "../workqueue_internal.h"
> #include "../../io_uring/io-wq.h"
> #include "../smpboot.h"
> +#include "../locking/mutex.h"
>
> EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpu);
> EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpumask);
> @@ -6609,6 +6610,15 @@ static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
>
> #ifdef CONFIG_SCHED_PROXY_EXEC
>
> +static inline struct task_struct *
> +proxy_resched_idle(struct rq *rq, struct task_struct *next)
> +{
> + put_prev_task(rq, next);
> + rq_set_selected(rq, rq->idle);
> + set_tsk_need_resched(rq->idle);
> + return rq->idle;
> +}
> +
> static bool proxy_deactivate(struct rq *rq, struct task_struct *next)
> {
> unsigned long state = READ_ONCE(next->__state);
> @@ -6618,48 +6628,138 @@ static bool proxy_deactivate(struct rq *rq, struct task_struct *next)
> return false;
> if (!try_to_deactivate_task(rq, next, state, true))
> return false;
> - put_prev_task(rq, next);
> - rq_set_selected(rq, rq->idle);
> - resched_curr(rq);
> + proxy_resched_idle(rq, next);
> return true;
> }
>
> /*
> - * Initial simple proxy that just returns the task if it's waking
> - * or deactivates the blocked task so we can pick something that
> - * isn't blocked.
> + * Find who @next (currently blocked on a mutex) can proxy for.
> + *
> + * Follow the blocked-on relation:
> + * task->blocked_on -> mutex->owner -> task...
> + *
> + * Lock order:
> + *
> + * p->pi_lock
> + * rq->lock
> + * mutex->wait_lock
> + * p->blocked_lock
> + *
> + * Returns the task that is going to be used as execution context (the one
> + * that is actually going to be put to run on cpu_of(rq)).
> */
> static struct task_struct *
> find_proxy_task(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
> {
> + struct task_struct *owner = NULL;
> struct task_struct *ret = NULL;
> - struct task_struct *p = next;
> + struct task_struct *p;
> struct mutex *mutex;
> + int this_cpu = cpu_of(rq);
>
> - mutex = p->blocked_on;
> - /* Something changed in the chain, so pick again */
> - if (!mutex)
> - return NULL;
> /*
> - * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
> - * and ensure @owner sticks around.
> + * Follow blocked_on chain.
> + *
> + * TODO: deadlock detection
> */
> - raw_spin_lock(&mutex->wait_lock);
> - raw_spin_lock(&p->blocked_lock);
> + for (p = next; task_is_blocked(p); p = owner) {
> + mutex = p->blocked_on;
> + /* Something changed in the chain, so pick again */
> + if (!mutex)
> + return NULL;
>
> - /* Check again that p is blocked with blocked_lock held */
> - if (!task_is_blocked(p) || mutex != p->blocked_on) {
> /*
> - * Something changed in the blocked_on chain and
> - * we don't know if only at this level. So, let's
> - * just bail out completely and let __schedule
> - * figure things out (pick_again loop).
> + * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
> + * and ensure @owner sticks around.
> */
> - goto out;
> + raw_spin_lock(&mutex->wait_lock);
> + raw_spin_lock(&p->blocked_lock);
> +
> + /* Check again that p is blocked with blocked_lock held */

Is this comment still valid?

> + if (mutex != p->blocked_on) {
> + /*
> + * Something changed in the blocked_on chain and
> + * we don't know if only at this level. So, let's
> + * just bail out completely and let __schedule
> + * figure things out (pick_again loop).
> + */
> + goto out;
> + }
> +
> + owner = __mutex_owner(mutex);
> + if (!owner) {
> + ret = p;
> + goto out;
> + }
> +
> + if (task_cpu(owner) != this_cpu) {
> + /* XXX Don't handle migrations yet */
> + if (!proxy_deactivate(rq, next))
> + ret = next;
> + goto out;
> + }
> +
> + if (task_on_rq_migrating(owner)) {
> + /*
> + * One of the chain of mutex owners is currently migrating to this
> + * CPU, but has not yet been enqueued because we are holding the
> + * rq lock. As a simple solution, just schedule rq->idle to give
> + * the migration a chance to complete. Much like the migrate_task
> + * case we should end up back in proxy(), this time hopefully with

s/proxy/find_proxy_task/

> + * all relevant tasks already enqueued.
> + */
> + raw_spin_unlock(&p->blocked_lock);
> + raw_spin_unlock(&mutex->wait_lock);
> + return proxy_resched_idle(rq, next);
> + }
> +
> + if (!owner->on_rq) {
> + /* XXX Don't handle blocked owners yet */
> + if (!proxy_deactivate(rq, next))
> + ret = next;
> + goto out;
> + }
> +
> + if (owner == p) {
> + /*
> + * It's possible we interleave with mutex_unlock like:
> + *
> + * lock(&rq->lock);
> + * find_proxy_task()
> + * mutex_unlock()
> + * lock(&wait_lock);
> + * next(owner) = current->blocked_donor;
> + * unlock(&wait_lock);
> + *
> + * wake_up_q();
> + * ...
> + * ttwu_runnable()
> + * __task_rq_lock()
> + * lock(&wait_lock);
> + * owner == p
> + *
> + * Which leaves us to finish the ttwu_runnable() and make it go.
> + *
> + * So schedule rq->idle so that ttwu_runnable can get the rq lock
> + * and mark owner as running.
> + */
> + raw_spin_unlock(&p->blocked_lock);
> + raw_spin_unlock(&mutex->wait_lock);
> + return proxy_resched_idle(rq, next);
> + }
> +
> + /*
> + * OK, now we're absolutely sure @owner is not blocked _and_
> + * on this rq, therefore holding @rq->lock is sufficient to
> + * guarantee its existence, as per ttwu_remote().
> + */
> + raw_spin_unlock(&p->blocked_lock);
> + raw_spin_unlock(&mutex->wait_lock);
> }
>
> - if (!proxy_deactivate(rq, next))
> - ret = p;
> + WARN_ON_ONCE(owner && !owner->on_rq);
> + return owner;
> +
> out:
> raw_spin_unlock(&p->blocked_lock);
> raw_spin_unlock(&mutex->wait_lock);
> @@ -6738,6 +6838,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
> struct rq_flags rf;
> struct rq *rq;
> int cpu;
> + bool preserve_need_resched = false;
>
> cpu = smp_processor_id();
> rq = cpu_rq(cpu);
> @@ -6798,9 +6899,12 @@ static void __sched notrace __schedule(unsigned int sched_mode)
> rq_repin_lock(rq, &rf);
> goto pick_again;
> }
> + if (next == rq->idle && prev == rq->idle)
> + preserve_need_resched = true;
> }
>
> - clear_tsk_need_resched(prev);
> + if (!preserve_need_resched)
> + clear_tsk_need_resched(prev);
> clear_preempt_need_resched();
> #ifdef CONFIG_SCHED_DEBUG
> rq->last_seen_need_resched_ns = 0;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 085941db5bf1..954b41e5b7df 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8905,6 +8905,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> if (kthread_is_per_cpu(p))
> return 0;
>
> + if (task_is_blocked(p))
> + return 0;

I think "We do not migrate tasks that are: ..."
(kernel/sched/fair.c:8897) comment needs to be updated for this change.

> +
> if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
> int cpu;
>
> @@ -8941,7 +8944,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> /* Record that we found at least one task that could run on dst_cpu */
> env->flags &= ~LBF_ALL_PINNED;
>
> - if (task_on_cpu(env->src_rq, p)) {
> + if (task_on_cpu(env->src_rq, p) ||
> + task_current_selected(env->src_rq, p)) {
> schedstat_inc(p->stats.nr_failed_migrations_running);
> return 0;
> }
> @@ -8980,6 +8984,9 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
> {
> lockdep_assert_rq_held(env->src_rq);
>
> + BUG_ON(task_current(env->src_rq, p));
> + BUG_ON(task_current_selected(env->src_rq, p));
> +
> deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
> set_task_cpu(p, env->dst_cpu);
> }


2023-12-21 16:14:58

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v7 14/23] sched: Handle blocked-waiter migration (and return migration)

On 20/12/2023 12:18 am, John Stultz wrote:
> Add logic to handle migrating a blocked waiter to a remote
> cpu where the lock owner is runnable.
>
> Additionally, as the blocked task may not be able to run
> on the remote cpu, add logic to handle return migration once
> the waiting task is given the mutex.
>
> Because tasks may get migrated to where they cannot run,
> this patch also modifies the scheduling classes to avoid
> sched class migrations on mutex blocked tasks, leaving
> proxy() to do the migrations and return migrations.

s/proxy/find_proxy_task/

>
> This was split out from the larger proxy patch, and
> significantly reworked.
>
> Credits for the original patch go to:
> Peter Zijlstra (Intel) <[email protected]>
> Juri Lelli <[email protected]>
> Valentin Schneider <[email protected]>
> Connor O'Brien <[email protected]>
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Signed-off-by: John Stultz <[email protected]>
> ---
> v6:
> * Integrated sched_proxy_exec() check in proxy_return_migration()
> * Minor cleanups to diff
> * Unpin the rq before calling __balance_callbacks()
> * Tweak proxy migrate to migrate deeper task in chain, to avoid
> tasks pingponging between rqs
> v7:
> * Fixup for unused function arguments
> * Switch from that_rq -> target_rq, other minor tweaks, and typo
> fixes suggested by Metin Kaya
> * Switch back to doing return migration in the ttwu path, which
> avoids nasty lock juggling and performance issues
> * Fixes for UP builds
> ---
> kernel/sched/core.c | 161 ++++++++++++++++++++++++++++++++++++++--
> kernel/sched/deadline.c | 2 +-
> kernel/sched/fair.c | 4 +-
> kernel/sched/rt.c | 9 ++-
> 4 files changed, 164 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 42e25bbdfe6b..55dc2a3b7e46 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2981,8 +2981,15 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
> struct set_affinity_pending my_pending = { }, *pending = NULL;
> bool stop_pending, complete = false;
>
> - /* Can the task run on the task's current CPU? If so, we're done */
> - if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) {
> + /*
> + * Can the task run on the task's current CPU? If so, we're done
> + *
> + * We are also done if the task is selected, boosting a lock-
> + * holding proxy, (and potentially has been migrated outside its
> + * current or previous affinity mask)
> + */
> + if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask) ||
> + (task_current_selected(rq, p) && !task_current(rq, p))) {
> struct task_struct *push_task = NULL;
>
> if ((flags & SCA_MIGRATE_ENABLE) &&
> @@ -3778,6 +3785,39 @@ static inline void ttwu_do_wakeup(struct task_struct *p)
> trace_sched_wakeup(p);
> }
>
> +#ifdef CONFIG_SMP
> +static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
> +{
> + if (!sched_proxy_exec())
> + return false;
> +
> + if (task_current(rq, p))
> + return false;
> +
> + if (p->blocked_on && p->blocked_on_state == BO_WAKING) {
> + raw_spin_lock(&p->blocked_lock);
> + if (!is_cpu_allowed(p, cpu_of(rq))) {
> + if (task_current_selected(rq, p)) {
> + put_prev_task(rq, p);
> + rq_set_selected(rq, rq->idle);
> + }
> + deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
> + resched_curr(rq);
> + raw_spin_unlock(&p->blocked_lock);
> + return true;
> + }
> + resched_curr(rq);
> + raw_spin_unlock(&p->blocked_lock);

Do we need to hold blocked_lock while checking allowed CPUs? Would
moving raw_spin_lock(&p->blocked_lock); inside if (!is_cpu_allowed())
block be silly? i.e.,:

if (!is_cpu_allowed(p, cpu_of(rq))) {
raw_spin_lock(&p->blocked_lock);
if (task_current_selected(rq, p)) {
put_prev_task(rq, p);
rq_set_selected(rq, rq->idle);
}
deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
resched_curr(rq);
raw_spin_unlock(&p->blocked_lock);
return true;
}
resched_curr(rq);

> + }
> + return false;
> +}
> +#else /* !CONFIG_SMP */
> +static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
> +{
> + return false;
> +}
> +#endif /*CONFIG_SMP */

Nit: what about this?

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 30dfb6f14f2b..60b542a6faa5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4027,9 +4027,11 @@ static inline void
activate_blocked_entities(struct rq *target_rq,
}
#endif /* CONFIG_SCHED_PROXY_EXEC */

-#ifdef CONFIG_SMP
static inline bool proxy_needs_return(struct rq *rq, struct
task_struct *p)
{
+ if (!IS_ENABLED(CONFIG_SMP))
+ return false;
+
if (!sched_proxy_exec())
return false;

@@ -4053,12 +4055,6 @@ static inline bool proxy_needs_return(struct rq
*rq, struct task_struct *p)
}
return false;
}
-#else /* !CONFIG_SMP */
-static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
-{
- return false;
-}
-#endif /*CONFIG_SMP */

static void
ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,

> +
> static void
> ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
> struct rq_flags *rf)
> @@ -3870,9 +3910,12 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
> update_rq_clock(rq);
> wakeup_preempt(rq, p, wake_flags);
> }
> + if (proxy_needs_return(rq, p))
> + goto out;
> ttwu_do_wakeup(p);
> ret = 1;
> }
> +out:
> __task_rq_unlock(rq, &rf);
>
> return ret;
> @@ -4231,6 +4274,7 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> int cpu, success = 0;
>
> if (p == current) {
> + WARN_ON(task_is_blocked(p));
> /*
> * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
> * == smp_processor_id()'. Together this means we can special
> @@ -6632,6 +6676,91 @@ static bool proxy_deactivate(struct rq *rq, struct task_struct *next)
> return true;
> }
>
> +#ifdef CONFIG_SMP
> +/*
> + * If the blocked-on relationship crosses CPUs, migrate @p to the
> + * owner's CPU.
> + *
> + * This is because we must respect the CPU affinity of execution
> + * contexts (owner) but we can ignore affinity for scheduling
> + * contexts (@p). So we have to move scheduling contexts towards
> + * potential execution contexts.
> + *
> + * Note: The owner can disappear, but simply migrate to @target_cpu
> + * and leave that CPU to sort things out.
> + */
> +static struct task_struct *

proxy_migrate_task() always returns NULL. Is return type really needed?

> +proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> + struct task_struct *p, int target_cpu)
> +{
> + struct rq *target_rq;
> + int wake_cpu;
> +

Having a "if (!IS_ENABLED(CONFIG_SMP))" check here would help in
dropping #else part. i.e.,

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 30dfb6f14f2b..685ba6f2d4cd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6912,7 +6912,6 @@ proxy_resched_idle(struct rq *rq, struct
task_struct *next)
return rq->idle;
}

-#ifdef CONFIG_SMP
/*
* If the blocked-on relationship crosses CPUs, migrate @p to the
* owner's CPU.
@@ -6932,6 +6931,9 @@ proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
struct rq *target_rq;
int wake_cpu;

+ if (!IS_ENABLED(CONFIG_SMP))
+ return NULL;
+
lockdep_assert_rq_held(rq);
target_rq = cpu_rq(target_cpu);

@@ -6988,14 +6990,6 @@ proxy_migrate_task(struct rq *rq, struct rq_flags
*rf,

return NULL; /* Retry task selection on _this_ CPU. */
}
-#else /* !CONFIG_SMP */
-static struct task_struct *
-proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
- struct task_struct *p, int target_cpu)
-{
- return NULL;
-}
-#endif /* CONFIG_SMP */

static void proxy_enqueue_on_owner(struct rq *rq, struct task_struct
*owner,
struct task_struct *next)

> + lockdep_assert_rq_held(rq);
> + target_rq = cpu_rq(target_cpu);
> +
> + /*
> + * Since we're going to drop @rq, we have to put(@rq_selected) first,
> + * otherwise we have a reference that no longer belongs to us. Use
> + * @rq->idle to fill the void and make the next pick_next_task()
> + * invocation happy.
> + *
> + * CPU0 CPU1
> + *
> + * B mutex_lock(X)
> + *
> + * A mutex_lock(X) <- B
> + * A __schedule()
> + * A pick->A
> + * A proxy->B
> + * A migrate A to CPU1
> + * B mutex_unlock(X) -> A
> + * B __schedule()
> + * B pick->A
> + * B switch_to (A)
> + * A ... does stuff
> + * A ... is still running here
> + *
> + * * BOOM *
> + */
> + put_prev_task(rq, rq_selected(rq));
> + rq_set_selected(rq, rq->idle);
> + set_next_task(rq, rq_selected(rq));
> + WARN_ON(p == rq->curr);
> +
> + wake_cpu = p->wake_cpu;
> + deactivate_task(rq, p, 0);
> + set_task_cpu(p, target_cpu);
> + /*
> + * Preserve p->wake_cpu, such that we can tell where it
> + * used to run later.
> + */
> + p->wake_cpu = wake_cpu;
> +
> + rq_unpin_lock(rq, rf);
> + __balance_callbacks(rq);
> +
> + raw_spin_rq_unlock(rq);
> + raw_spin_rq_lock(target_rq);
> +
> + activate_task(target_rq, p, 0);
> + wakeup_preempt(target_rq, p, 0);
> +
> + raw_spin_rq_unlock(target_rq);
> + raw_spin_rq_lock(rq);
> + rq_repin_lock(rq, rf);
> +
> + return NULL; /* Retry task selection on _this_ CPU. */
> +}
> +#else /* !CONFIG_SMP */
> +static struct task_struct *
> +proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> + struct task_struct *p, int target_cpu)
> +{
> + return NULL;
> +}
> +#endif /* CONFIG_SMP */ > +
> /*
> * Find who @next (currently blocked on a mutex) can proxy for.
> *
> @@ -6654,8 +6783,11 @@ find_proxy_task(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
> struct task_struct *owner = NULL;
> struct task_struct *ret = NULL;
> struct task_struct *p;
> + int cur_cpu, target_cpu;
> struct mutex *mutex;
> - int this_cpu = cpu_of(rq);
> + bool curr_in_chain = false;
> +
> + cur_cpu = cpu_of(rq);
>
> /*
> * Follow blocked_on chain.
> @@ -6686,17 +6818,27 @@ find_proxy_task(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
> goto out;
> }
>
> + if (task_current(rq, p))
> + curr_in_chain = true;
> +
> owner = __mutex_owner(mutex);
> if (!owner) {
> ret = p;
> goto out;
> }
>
> - if (task_cpu(owner) != this_cpu) {
> - /* XXX Don't handle migrations yet */
> - if (!proxy_deactivate(rq, next))
> - ret = next;
> - goto out;
> + if (task_cpu(owner) != cur_cpu) {
> + target_cpu = task_cpu(owner);
> + /*
> + * @owner can disappear, simply migrate to @target_cpu and leave that CPU
> + * to sort things out.
> + */
> + raw_spin_unlock(&p->blocked_lock);
> + raw_spin_unlock(&mutex->wait_lock);
> + if (curr_in_chain)
> + return proxy_resched_idle(rq, next);
> +
> + return proxy_migrate_task(rq, rf, p, target_cpu);
> }
>
> if (task_on_rq_migrating(owner)) {
> @@ -6999,6 +7141,9 @@ static inline void sched_submit_work(struct task_struct *tsk)
> */
> SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT);
>
> + if (task_is_blocked(tsk))
> + return;
> +
> /*
> * If we are going to sleep and we have plugged IO queued,
> * make sure to submit it to avoid deadlocks.
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 9cf20f4ac5f9..4f998549ea74 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1705,7 +1705,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>
> enqueue_dl_entity(&p->dl, flags);
>
> - if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
> + if (!task_current(rq, p) && p->nr_cpus_allowed > 1 && !task_is_blocked(p))
> enqueue_pushable_dl_task(rq, p);
> }
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 954b41e5b7df..8e3f118f6d6e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8372,7 +8372,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> goto idle;
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> - if (!prev || prev->sched_class != &fair_sched_class)
> + if (!prev ||
> + prev->sched_class != &fair_sched_class ||
> + rq->curr != rq_selected(rq))
> goto simple;
>
> /*
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 81cd22eaa6dc..a7b51a021111 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1503,6 +1503,9 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
> if (p->nr_cpus_allowed == 1)
> return;
>
> + if (task_is_blocked(p))
> + return;
> +
> enqueue_pushable_task(rq, p);
> }
>
> @@ -1790,10 +1793,12 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
>
> update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);
>
> - /* Avoid marking selected as pushable */
> - if (task_current_selected(rq, p))
> + /* Avoid marking current or selected as pushable */
> + if (task_current(rq, p) || task_current_selected(rq, p))
> return;
>
> + if (task_is_blocked(p))
> + return;
> /*
> * The previous task needs to be made eligible for pushing
> * if it is still active


2023-12-21 17:06:11

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v7 06/23] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable

On Tue, Dec 19, 2023 at 5:04 PM Randy Dunlap <[email protected]> wrote:
> On 12/19/23 16:18, John Stultz wrote:
> > Add a CONFIG_SCHED_PROXY_EXEC option, along with a boot argument
> > sched_prox_exec= that can be used to disable the feature at boot
>
> sched_proxy_exec=
>

Ah, thank you for your careful review! I applied all your suggestions!
thanks
-john

2023-12-21 17:14:17

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v7 00/23] Proxy Execution: A generalized form of Priority Inheritance v7

On Thu, Dec 21, 2023 at 12:35 AM Metin Kaya <[email protected]> wrote:
> On 20/12/2023 12:18 am, John Stultz wrote:
> >
> > As Connor outlined in a previous submission of this patch series,
>
> Nit: Better to have a reference to Connor's patch series (i.e.,
> https://lore.kernel.org/lkml/[email protected]/)
> here?

Yes, thank you for providing the link!


> > * As discussed at OSPM[5], I like to split pick_next_task() up
> > into two phases selecting and setting the next tasks, as
> > currently pick_next_task() assumes the returned task will be
> > run which results in various side-effects in sched class logic
> > when it’s run. I tried to take a pass at this earlier, but
> > it’s hairy and lower on the priority list for now.
>
> Do you think we should mention virtual runqueue idea and adding trace
> points to measure task migration times? They are not "open issues", but
> kind of to-do items in the agenda.
>

I appreciate you bringing those up. The virtual runqueue idea is still
a bit handwavy, but the trace points are a good item for the TODO.
Apologies for missing it, as you suggested it just the other day as I
was prepping these patches, and I didn't go back to add it here in the
cover letter.

thanks
-john

2023-12-21 17:53:19

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v7 05/23] locking/mutex: Rework task_struct::blocked_on

On Thu, Dec 21, 2023 at 2:13 AM Metin Kaya <[email protected]> wrote:
> On 20/12/2023 12:18 am, John Stultz wrote:
> > +static inline struct mutex *get_task_blocked_on(struct task_struct *p)
> > +{
> > + lockdep_assert_held(&p->blocked_lock);
> > +
> > + return p->blocked_on;
> > +}
> > +
> > +static inline struct mutex *get_task_blocked_on_once(struct task_struct *p)
> > +{
> > + return READ_ONCE(p->blocked_on);
> > +}
>
> These functions make me think we should use [get, set]_task_blocked_on()
> for accessing blocked_on & blocked_on_state fields, but there are some
> references in this patch which we directly access aforementioned fields.
> Is this OK?

Yeah. In the reworks I've probably added some subtle uses that should
be switched to the accessors or better commented.


> @@ -4341,6 +4342,11 @@ int try_to_wake_up(struct task_struct *p,
unsigned int state, int wake_flags)
> >
> > ttwu_queue(p, cpu, wake_flags);
> > }
> > + /* XXX can we do something better here for !CONFIG_SCHED_PROXY_EXEC case */
>
> blocked_on* fields are now used even in !CONFIG_SCHED_PROXY_EXEC case.
> I'm unsure if we can get rid of lock & unlock lines or entire hunk, but
> would this be too ugly? I wish we could convert blocked_on_state to an
> atomic variable.

Well, atomics have their own costs, but it's something I'll think
about. In the comment above, the idea I'm pondering is that in the
!PROXY_EXEC case the blocked_on_state doesn't provide much utility, so
maybe there's a way to opt out of that portion (while keeping the
blocked_on for debug checking). Even in the PROXY_EXEC case, we might
be able to move this check to proxy_needs_return(), but I need to
think the logic out to make sure we'd always hit that path when we
need to make the transition.

I've also wondered if the blocked_on_state might be able to be merged
into the task->__state, but the rules there are more subtle so for my
sanity I've kept it separate here for now.

thanks
-john

2023-12-21 18:23:47

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v7 08/23] sched: Split scheduler and execution contexts

On Thu, Dec 21, 2023 at 2:44 AM Metin Kaya <[email protected]> wrote:
>
> On 20/12/2023 12:18 am, John Stultz wrote:
> > From: Peter Zijlstra <[email protected]>
> >
> > Let's define the scheduling context as all the scheduler state
> > in task_struct for the task selected to run, and the execution
> > context as all state required to actually run the task.
> >
> > Currently both are intertwined in task_struct. We want to
> > logically split these such that we can use the scheduling
> > context of the task selected to be scheduled, but use the
> > execution context of a different task to actually be run.
>
> Should we update Documentation/kernel-hacking/hacking.rst (line #348:
> :c:macro:`current`) or another appropriate doc to announce separation of
> scheduling & execution contexts?

So I like this suggestion, but the hacking.rst file feels a little too
general to be getting into the subtleties of scheduler internals.
The splitting of the scheduler context and the execution context
really is just a scheduler detail, as everything else will still deal
just with the execution context as before. So it's really only for
scheduler accounting that we utilize the "rq_selected" scheduler
context.

Maybe something under Documentation/scheduler/ would be more
appropriate? Though the documents there are all pretty focused on
particular sched classes, and not much on the core logic that is most
affected by this conceptual change. I guess maybe adding
sched-core.txt document might be useful to have for this sort of
detail (though a bit daunting to write from scratch).

thanks
-john

2023-12-21 18:50:07

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v7 10/23] sched: Split out __sched() deactivate task logic into a helper

On Thu, Dec 21, 2023 at 4:30 AM Metin Kaya <[email protected]> wrote:
>
> On 20/12/2023 12:18 am, John Stultz wrote:
> > As we're going to re-use the deactivation logic,
> > split it into a helper.
> >
>
> * Better to have a comment (e.g., in which conditions
> try_to_deactivate_task() returns true or false) here.

Ah, good point. I've added a comment to address this.

> * try_to_deactivate_task() is temporarily used by 2 commits in the patch
> set (i.e., it's only called by __schedule() just like in this patch at
> the end of the series). However, it's nice to make that big
> __scheduler() function a bit modular as we discussed off-list. So,
> should we move this function out of the Proxy Execution patch set to get
> it merged independently?

Yeah. I add and later remove the proxy_deactivate() function as it
goes unused, but I am thinking about keeping it to handle the case
where if the blocked on chains are too long, and we're spending too
much time re-running find_proxy_task() - where we should probably just
deactivate the selected task and move on. But for now, you're right.

The suggestion of just the initial step of splitting the logic out
(probably make it a void function because the remaining usage in
__schedule() doesn't care about the result) is a good one, so I'll
rework it this way and send it separately (along with other early
patches Qais suggested).

thanks
-john

2023-12-21 19:16:36

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v7 11/23] sched: Add a initial sketch of the find_proxy_task() function

On Thu, Dec 21, 2023 at 4:55 AM Metin Kaya <[email protected]> wrote:
> On 20/12/2023 12:18 am, John Stultz wrote:
> >
> Super-nit: s/Add a/Add an/ in commit header.

Thanks for catching that!

> > +#ifdef CONFIG_SCHED_PROXY_EXEC
> > +static inline bool task_is_blocked(struct task_struct *p)
> > +{
> > + if (!sched_proxy_exec())
> > + return false;
> > +
> > + return !!p->blocked_on && p->blocked_on_state != BO_RUNNABLE;
> > +}
> > +#else /* !SCHED_PROXY_EXEC */
> > +static inline bool task_is_blocked(struct task_struct *p)
> > +{
> > + return false;
> > +}
> > +#endif /* SCHED_PROXY_EXEC */
> > +
>
> We can drop #else part, IMHO. Because sched_proxy_exec() already returns
> false in !CONFIG_SCHED_PROXY_EXEC case.

Oh, good point. That is a nice simplification.

thanks
-john

2023-12-21 19:46:34

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v7 14/23] sched: Handle blocked-waiter migration (and return migration)

On Thu, Dec 21, 2023 at 8:13 AM Metin Kaya <[email protected]> wrote:
> On 20/12/2023 12:18 am, John Stultz wrote:
> > Because tasks may get migrated to where they cannot run,
> > this patch also modifies the scheduling classes to avoid
> > sched class migrations on mutex blocked tasks, leaving
> > proxy() to do the migrations and return migrations.
>
> s/proxy/find_proxy_task/

Thanks, fixed.

> > + if (p->blocked_on && p->blocked_on_state == BO_WAKING) {
> > + raw_spin_lock(&p->blocked_lock);
> > + if (!is_cpu_allowed(p, cpu_of(rq))) {
> > + if (task_current_selected(rq, p)) {
> > + put_prev_task(rq, p);
> > + rq_set_selected(rq, rq->idle);
> > + }
> > + deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
> > + resched_curr(rq);
> > + raw_spin_unlock(&p->blocked_lock);
> > + return true;
> > + }
> > + resched_curr(rq);
> > + raw_spin_unlock(&p->blocked_lock);
>
> Do we need to hold blocked_lock while checking allowed CPUs? Would
> moving raw_spin_lock(&p->blocked_lock); inside if (!is_cpu_allowed())
> block be silly? i.e.,:

That's an interesting idea. I'll take a shot at reworking it. Thanks!

> Nit: what about this
> -#ifdef CONFIG_SMP
> static inline bool proxy_needs_return(struct rq *rq, struct
> task_struct *p)
> {
> + if (!IS_ENABLED(CONFIG_SMP))
> + return false;
> +

It would be nice, but the trouble is is_cpu_allowed() isn't defined
for !CONFIG_SMP, so that won't build.


> > + * Note: The owner can disappear, but simply migrate to @target_cpu
> > + * and leave that CPU to sort things out.
> > + */
> > +static struct task_struct *
>
> proxy_migrate_task() always returns NULL. Is return type really needed?

Good point. Reworked to clean that up.

> > +proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> > + struct task_struct *p, int target_cpu)
> > +{
> > + struct rq *target_rq;
> > + int wake_cpu;
> > +
>
> Having a "if (!IS_ENABLED(CONFIG_SMP))" check here would help in
> dropping #else part. i.e.,

Sadly same problem as before, as wake_cpu isn't defined when !CONFIG_SMP :(

thanks again for the detailed review!
-john

2023-12-21 21:02:34

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v7 12/23] sched: Fix proxy/current (push,pull)ability

On Thu, Dec 21, 2023 at 7:03 AM Metin Kaya <[email protected]> wrote:
> On 20/12/2023 12:18 am, John Stultz wrote:
> > +static inline void proxy_tag_curr(struct rq *rq, struct task_struct *next)
> > +{
> > + if (sched_proxy_exec()) {
>
> Should we immediately return in !sched_proxy_exec() case to save one
> level of indentation?

Sure.

> > + /*
> > + * pick_next_task() calls set_next_task() on the selected task
> > + * at some point, which ensures it is not push/pullable.
> > + * However, the selected task *and* the ,mutex owner form an
>
> Super-nit: , before mutex should be dropped.
>
> > + * atomic pair wrt push/pull.
> > + *
> > + * Make sure owner is not pushable. Unfortunately we can only
> > + * deal with that by means of a dequeue/enqueue cycle. :-/
> > + */
> > + dequeue_task(rq, next, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
> > + enqueue_task(rq, next, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
> > + }
> > +}
> > +
> > /*
> > * __schedule() is the main scheduler function.
> > *
> > @@ -6796,6 +6813,10 @@ static void __sched notrace __schedule(unsigned int sched_mode)
> > * changes to task_struct made by pick_next_task().
> > */
> > RCU_INIT_POINTER(rq->curr, next);
> > +
> > + if (!task_current_selected(rq, next))
> > + proxy_tag_curr(rq, next);
> > +
> > /*
> > * The membarrier system call requires each architecture
> > * to have a full memory barrier after updating
> > @@ -6820,6 +6841,10 @@ static void __sched notrace __schedule(unsigned int sched_mode)
> > /* Also unlocks the rq: */
> > rq = context_switch(rq, prev, next, &rf);
> > } else {
> > + /* In case next was already curr but just got blocked_donor*/
>
> Super-nit: please keep a space before */.

Fixed up.

Thanks for continuing to provide so much detailed feedback!
-john

2023-12-22 08:33:58

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v7 16/23] sched: Add deactivated (sleeping) owner handling to find_proxy_task()

On 20/12/2023 12:18 am, John Stultz wrote:
> From: Peter Zijlstra <[email protected]>
>
> If the blocked_on chain resolves to a sleeping owner, deactivate
> selected task, and enqueue it on the sleeping owner task.
> Then re-activate it later when the owner is woken up.
>
> NOTE: This has been particularly challenging to get working
> properly, and some of the locking is particularly ackward. I'd

awkward?

> very much appreciate review and feedback for ways to simplify
> this.
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> Signed-off-by: Valentin Schneider <[email protected]>
> Signed-off-by: Connor O'Brien <[email protected]>
> [jstultz: This was broken out from the larger proxy() patch]
> Signed-off-by: John Stultz <[email protected]>
> ---
> v5:
> * Split out from larger proxy patch
> v6:
> * Major rework, replacing the single list head per task with
> per-task list head and nodes, creating a tree structure so
> we only wake up decendents of the task woken.

descendants

> * Reworked the locking to take the task->pi_lock, so we can
> avoid mid-chain wakeup races from try_to_wake_up() called by
> the ww_mutex logic.
> v7:
> * Drop ununessary __nested lock annotation, as we already drop

unnecessary

> the lock prior.
> * Add comments on #else & #endif lines, and clearer function
> names, and commit message tweaks as suggested by Metin Kaya
> * Move activate_blocked_entities() call from ttwu_queue to
> try_to_wake_up() to simplify locking. Thanks to questions from
> Metin Kaya
> * Fix irqsave/irqrestore usage now we call this outside where
> the pi_lock is held
> * Fix activate_blocked_entitites not preserving wake_cpu
> * Fix for UP builds
> ---
> include/linux/sched.h | 3 +
> kernel/fork.c | 4 +
> kernel/sched/core.c | 214 ++++++++++++++++++++++++++++++++++++++----
> 3 files changed, 202 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8020e224e057..6f982948a105 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1158,6 +1158,9 @@ struct task_struct {
> enum blocked_on_state blocked_on_state;
> struct mutex *blocked_on; /* lock we're blocked on */
> struct task_struct *blocked_donor; /* task that is boosting this task */
> + struct list_head blocked_head; /* tasks blocked on this task */
> + struct list_head blocked_node; /* our entry on someone elses blocked_head */
> + struct task_struct *sleeping_owner; /* task our blocked_node is enqueued on */
> raw_spinlock_t blocked_lock;

As we discussed off-list, there are now 7 fields introduced in
task_struct because of Proxy Execution support. I recommended to move
them into a container struct to save task_struct from a slight pollution
(similarly initialization of these fields in copy_process() can be moved
into a separate function), but you mentioned fair points about different
locks for different fields.

I wonder what other folks think about this minor concern.

>
> #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 138fc23cad43..56f5e19c268e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2460,6 +2460,10 @@ __latent_entropy struct task_struct *copy_process(
> p->blocked_on_state = BO_RUNNABLE;
> p->blocked_on = NULL; /* not blocked yet */
> p->blocked_donor = NULL; /* nobody is boosting p yet */
> +
> + INIT_LIST_HEAD(&p->blocked_head);
> + INIT_LIST_HEAD(&p->blocked_node);
> + p->sleeping_owner = NULL;
> #ifdef CONFIG_BCACHE
> p->sequential_io = 0;
> p->sequential_io_avg = 0;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e0afa228bc9d..0cd63bd0bdcd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3785,6 +3785,133 @@ static inline void ttwu_do_wakeup(struct task_struct *p)
> trace_sched_wakeup(p);
> }
>
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +static void do_activate_task(struct rq *rq, struct task_struct *p, int en_flags)
> +{
> + lockdep_assert_rq_held(rq);
> +
> + if (!sched_proxy_exec()) {
> + activate_task(rq, p, en_flags);
> + return;
> + }
> +
> + if (p->sleeping_owner) {
> + struct task_struct *owner = p->sleeping_owner;
> +
> + raw_spin_lock(&owner->blocked_lock);
> + list_del_init(&p->blocked_node);
> + p->sleeping_owner = NULL;
> + raw_spin_unlock(&owner->blocked_lock);
> + }
> +
> + /*
> + * By calling activate_task with blocked_lock held, we
> + * order against the find_proxy_task() blocked_task case
> + * such that no more blocked tasks will be enqueued on p
> + * once we release p->blocked_lock.
> + */
> + raw_spin_lock(&p->blocked_lock);
> + WARN_ON(task_cpu(p) != cpu_of(rq));
> + activate_task(rq, p, en_flags);
> + raw_spin_unlock(&p->blocked_lock);
> +}
> +
> +#ifdef CONFIG_SMP
> +static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
> +{
> + unsigned int wake_cpu;
> +
> + /* Preserve wake_cpu */
> + wake_cpu = p->wake_cpu;
> + __set_task_cpu(p, cpu);
> + p->wake_cpu = wake_cpu;
> +}
> +#else /* !CONFIG_SMP */
> +static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
> +{
> + __set_task_cpu(p, cpu);
> +}
> +#endif /* CONFIG_SMP */
> +
> +static void activate_blocked_entities(struct rq *target_rq,
> + struct task_struct *owner,
> + int wake_flags)
> +{
> + unsigned long flags;
> + struct rq_flags rf;
> + int target_cpu = cpu_of(target_rq);
> + int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
> +
> + if (wake_flags & WF_MIGRATED)
> + en_flags |= ENQUEUE_MIGRATED;
> + /*
> + * A whole bunch of 'proxy' tasks back this blocked task, wake
> + * them all up to give this task its 'fair' share.
> + */
> + raw_spin_lock_irqsave(&owner->blocked_lock, flags);
> + while (!list_empty(&owner->blocked_head)) {
> + struct task_struct *pp;
> + unsigned int state;
> +
> + pp = list_first_entry(&owner->blocked_head,
> + struct task_struct,
> + blocked_node);
> + BUG_ON(pp == owner);
> + list_del_init(&pp->blocked_node);
> + WARN_ON(!pp->sleeping_owner);
> + pp->sleeping_owner = NULL;
> + raw_spin_unlock_irqrestore(&owner->blocked_lock, flags);
> +
> + raw_spin_lock_irqsave(&pp->pi_lock, flags);
> + state = READ_ONCE(pp->__state);
> + /* Avoid racing with ttwu */
> + if (state == TASK_WAKING) {
> + raw_spin_unlock_irqrestore(&pp->pi_lock, flags);
> + raw_spin_lock_irqsave(&owner->blocked_lock, flags);
> + continue;
> + }
> + if (READ_ONCE(pp->on_rq)) {
> + /*
> + * We raced with a non mutex handoff activation of pp.
> + * That activation will also take care of activating
> + * all of the tasks after pp in the blocked_entry list,
> + * so we're done here.
> + */
> + raw_spin_unlock_irqrestore(&pp->pi_lock, flags);
> + raw_spin_lock_irqsave(&owner->blocked_lock, flags);
> + continue;
> + }
> +
> + proxy_set_task_cpu(pp, target_cpu);
> +
> + rq_lock_irqsave(target_rq, &rf);
> + update_rq_clock(target_rq);
> + do_activate_task(target_rq, pp, en_flags);
> + resched_curr(target_rq);
> + rq_unlock_irqrestore(target_rq, &rf);
> + raw_spin_unlock_irqrestore(&pp->pi_lock, flags);
> +
> + /* recurse - XXX This needs to be reworked to avoid recursing */
> + activate_blocked_entities(target_rq, pp, wake_flags);
> +
> + raw_spin_lock_irqsave(&owner->blocked_lock, flags);
> + }
> + raw_spin_unlock_irqrestore(&owner->blocked_lock, flags);
> +}
> +#else /* !CONFIG_SCHED_PROXY_EXEC */
> +static inline void do_activate_task(struct rq *rq, struct task_struct *p,
> + int en_flags)
> +{
> + activate_task(rq, p, en_flags);
> +}
> +
> +static inline void activate_blocked_entities(struct rq *target_rq,
> + struct task_struct *owner,
> + int wake_flags)
> +{
> +}
> +#endif /* CONFIG_SCHED_PROXY_EXEC */
> +
> #ifdef CONFIG_SMP
> static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
> {
> @@ -3839,7 +3966,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
> atomic_dec(&task_rq(p)->nr_iowait);
> }
>
> - activate_task(rq, p, en_flags);
> + do_activate_task(rq, p, en_flags);
> wakeup_preempt(rq, p, wake_flags);
>
> ttwu_do_wakeup(p);
> @@ -3936,13 +4063,19 @@ void sched_ttwu_pending(void *arg)
> update_rq_clock(rq);
>
> llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
> + int wake_flags;
> if (WARN_ON_ONCE(p->on_cpu))
> smp_cond_load_acquire(&p->on_cpu, !VAL);
>
> if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
> set_task_cpu(p, cpu_of(rq));
>
> - ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
> + wake_flags = p->sched_remote_wakeup ? WF_MIGRATED : 0;
> + ttwu_do_activate(rq, p, wake_flags, &rf);
> + rq_unlock(rq, &rf);
> + activate_blocked_entities(rq, p, wake_flags);

I'm unsure if it's a big deal, but IRQs are disabled here and
activate_blocked_entities() disables them again.

> + rq_lock(rq, &rf);
> + update_rq_clock(rq);
> }
>
> /*
> @@ -4423,6 +4556,7 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> if (p->blocked_on_state == BO_WAKING)
> p->blocked_on_state = BO_RUNNABLE;
> raw_spin_unlock_irqrestore(&p->blocked_lock, flags);
> + activate_blocked_entities(cpu_rq(cpu), p, wake_flags);
> out:
> if (success)
> ttwu_stat(p, task_cpu(p), wake_flags);
> @@ -6663,19 +6797,6 @@ proxy_resched_idle(struct rq *rq, struct task_struct *next)
> return rq->idle;
> }
>
> -static bool proxy_deactivate(struct rq *rq, struct task_struct *next)
> -{
> - unsigned long state = READ_ONCE(next->__state);
> -
> - /* Don't deactivate if the state has been changed to TASK_RUNNING */
> - if (state == TASK_RUNNING)
> - return false;
> - if (!try_to_deactivate_task(rq, next, state, true))

Now we can drop the last argument (deactivate_cond) of
try_to_deactivate_task() since it can be determined via
!task_is_blocked(p) I think. IOW:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 30dfb6f14f2b..5b3b4ebca65d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6866,11 +6866,11 @@ pick_next_task(struct rq *rq, struct task_struct
*prev, struct rq_flags *rf)
#endif

static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
- unsigned long task_state, bool deactivate_cond)
+ unsigned long task_state)
{
if (signal_pending_state(task_state, p)) {
WRITE_ONCE(p->__state, TASK_RUNNING);
- } else if (deactivate_cond) {
+ } else if (!task_is_blocked(p)) {
p->sched_contributes_to_load =
(task_state & TASK_UNINTERRUPTIBLE) &&
!(task_state & TASK_NOLOAD) &&
@@ -7329,8 +7329,7 @@ static void __sched notrace __schedule(unsigned
int sched_mode)
*/
prev_state = READ_ONCE(prev->__state);
if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
- try_to_deactivate_task(rq, prev, prev_state,
- !task_is_blocked(prev));
+ try_to_deactivate_task(rq, prev, prev_state);
switch_count = &prev->nvcsw;
}


> - return false;
> - proxy_resched_idle(rq, next);
> - return true;
> -}
> -
> #ifdef CONFIG_SMP
> /*
> * If the blocked-on relationship crosses CPUs, migrate @p to the
> @@ -6761,6 +6882,31 @@ proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> }
> #endif /* CONFIG_SMP */
>
> +static void proxy_enqueue_on_owner(struct rq *rq, struct task_struct *owner,
> + struct task_struct *next)
> +{
> + /*
> + * ttwu_activate() will pick them up and place them on whatever rq
> + * @owner will run next.
> + */
> + if (!owner->on_rq) {
> + BUG_ON(!next->on_rq);
> + deactivate_task(rq, next, DEQUEUE_SLEEP);
> + if (task_current_selected(rq, next)) {
> + put_prev_task(rq, next);
> + rq_set_selected(rq, rq->idle);
> + }
> + /*
> + * ttwu_do_activate must not have a chance to activate p
> + * elsewhere before it's fully extricated from its old rq.
> + */
> + WARN_ON(next->sleeping_owner);
> + next->sleeping_owner = owner;
> + smp_mb();
> + list_add(&next->blocked_node, &owner->blocked_head);
> + }
> +}
> +
> /*
> * Find who @next (currently blocked on a mutex) can proxy for.
> *
> @@ -6866,10 +7012,40 @@ find_proxy_task(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
> }
>
> if (!owner->on_rq) {
> - /* XXX Don't handle blocked owners yet */
> - if (!proxy_deactivate(rq, next))
> - ret = next;
> - goto out;
> + /*
> + * rq->curr must not be added to the blocked_head list or else
> + * ttwu_do_activate could enqueue it elsewhere before it switches
> + * out here. The approach to avoid this is the same as in the
> + * migrate_task case.
> + */
> + if (curr_in_chain) {
> + raw_spin_unlock(&p->blocked_lock);
> + raw_spin_unlock(&mutex->wait_lock);
> + return proxy_resched_idle(rq, next);
> + }
> +
> + /*
> + * If !@owner->on_rq, holding @rq->lock will not pin the task,
> + * so we cannot drop @mutex->wait_lock until we're sure its a blocked
> + * task on this rq.
> + *
> + * We use @owner->blocked_lock to serialize against ttwu_activate().
> + * Either we see its new owner->on_rq or it will see our list_add().
> + */
> + if (owner != p) {
> + raw_spin_unlock(&p->blocked_lock);
> + raw_spin_lock(&owner->blocked_lock);
> + }
> +
> + proxy_enqueue_on_owner(rq, owner, next);
> +
> + if (task_current_selected(rq, next)) {
> + put_prev_task(rq, next);
> + rq_set_selected(rq, rq->idle);
> + }
> + raw_spin_unlock(&owner->blocked_lock);
> + raw_spin_unlock(&mutex->wait_lock);
> + return NULL; /* retry task selection */
> }
>
> if (owner == p) {


2023-12-22 09:32:33

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v7 17/23] sched: Initial sched_football test implementation

On 20/12/2023 12:18 am, John Stultz wrote:
> Reimplementation of the sched_football test from LTP:
> https://github.com/linux-test-project/ltp/blob/master/testcases/realtime/func/sched_football/sched_football.c
>
> But reworked to run in the kernel and utilize mutexes
> to illustrate proper boosting of low priority mutex
> holders.
>
> TODO:
> * Need a rt_mutex version so it can work w/o proxy-execution
> * Need a better place to put it

I think also this patch can be upstreamed regardless of other Proxy
Execution patches, right?

>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Signed-off-by: John Stultz <[email protected]>
> ---
> kernel/sched/Makefile | 1 +
> kernel/sched/test_sched_football.c | 242 +++++++++++++++++++++++++++++
> lib/Kconfig.debug | 14 ++
> 3 files changed, 257 insertions(+)
> create mode 100644 kernel/sched/test_sched_football.c
>
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 976092b7bd45..2729d565dfd7 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -32,3 +32,4 @@ obj-y += core.o
> obj-y += fair.o
> obj-y += build_policy.o
> obj-y += build_utility.o
> +obj-$(CONFIG_SCHED_RT_INVARIENT_TEST) += test_sched_football.o
> diff --git a/kernel/sched/test_sched_football.c b/kernel/sched/test_sched_football.c
> new file mode 100644
> index 000000000000..9742c45c0fe0
> --- /dev/null
> +++ b/kernel/sched/test_sched_football.c
> @@ -0,0 +1,242 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * Module-based test case for RT scheduling invariant
> + *
> + * A reimplementation of my old sched_football test
> + * found in LTP:
> + * https://github.com/linux-test-project/ltp/blob/master/testcases/realtime/func/sched_football/sched_football.c
> + *
> + * Similar to that test, this tries to validate the RT
> + * scheduling invariant, that the across N available cpus, the
> + * top N priority tasks always running.
> + *
> + * This is done via having N offsensive players that are

offensive

> + * medium priority, which constantly are trying to increment the
> + * ball_pos counter.
> + *
> + * Blocking this, are N defensive players that are higher
> + * priority which just spin on the cpu, preventing the medium
> + * priroity tasks from running.

priority

> + *
> + * To complicate this, there are also N defensive low priority
> + * tasks. These start first and each aquire one of N mutexes.
> + * The high priority defense tasks will later try to grab the
> + * mutexes and block, opening a window for the offsensive tasks
> + * to run and increment the ball. If priority inheritance or
> + * proxy execution is used, the low priority defense players
> + * should be boosted to the high priority levels, and will
> + * prevent the mid priority offensive tasks from running.
> + *
> + * Copyright © International Business Machines Corp., 2007, 2008
> + * Copyright (C) Google, 2023
> + *
> + * Authors: John Stultz <[email protected]>
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/kthread.h>
> +#include <linux/delay.h>
> +#include <linux/sched/rt.h>
> +#include <linux/spinlock.h>
> +#include <linux/mutex.h>
> +#include <linux/rwsem.h>
> +#include <linux/smp.h>
> +#include <linux/slab.h>
> +#include <linux/interrupt.h>
> +#include <linux/sched.h>
> +#include <uapi/linux/sched/types.h>
> +#include <linux/rtmutex.h>
> +
> +atomic_t players_ready;
> +atomic_t ball_pos;
> +int players_per_team;

Nit: Number of players cannot be lower than 0. Should it be unsigned then?

> +bool game_over;
> +
> +struct mutex *mutex_low_list;
> +struct mutex *mutex_mid_list;
> +
> +static inline
> +struct task_struct *create_fifo_thread(int (*threadfn)(void *data), void *data,
> + char *name, int prio)
> +{
> + struct task_struct *kth;
> + struct sched_attr attr = {
> + .size = sizeof(struct sched_attr),
> + .sched_policy = SCHED_FIFO,
> + .sched_nice = 0,
> + .sched_priority = prio,
> + };
> + int ret;
> +
> + kth = kthread_create(threadfn, data, name);
> + if (IS_ERR(kth)) {
> + pr_warn("%s eerr, kthread_create failed\n", __func__);

Extra e at eerr?

> + return kth;
> + }
> + ret = sched_setattr_nocheck(kth, &attr);
> + if (ret) {
> + kthread_stop(kth);
> + pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
> + return ERR_PTR(ret);
> + }
> +
> + wake_up_process(kth);
> + return kth;

I think the result of this function is actually unused. So,
create_fifo_thread()'s return type can be void?

> +}
> +
> +int defense_low_thread(void *arg)
> +{
> + long tnum = (long)arg;
> +
> + atomic_inc(&players_ready);
> + mutex_lock(&mutex_low_list[tnum]);
> + while (!READ_ONCE(game_over)) {
> + if (kthread_should_stop())
> + break;
> + schedule();
> + }
> + mutex_unlock(&mutex_low_list[tnum]);
> + return 0;
> +}
> +
> +int defense_mid_thread(void *arg)
> +{
> + long tnum = (long)arg;
> +
> + atomic_inc(&players_ready);
> + mutex_lock(&mutex_mid_list[tnum]);
> + mutex_lock(&mutex_low_list[tnum]);
> + while (!READ_ONCE(game_over)) {
> + if (kthread_should_stop())
> + break;
> + schedule();
> + }
> + mutex_unlock(&mutex_low_list[tnum]);
> + mutex_unlock(&mutex_mid_list[tnum]);
> + return 0;
> +}
> +
> +int offense_thread(void *)

Does this (no param name) build fine on Android env?

> +{
> + atomic_inc(&players_ready);
> + while (!READ_ONCE(game_over)) {
> + if (kthread_should_stop())
> + break;
> + schedule();
> + atomic_inc(&ball_pos);
> + }
> + return 0;
> +}
> +
> +int defense_hi_thread(void *arg)
> +{
> + long tnum = (long)arg;
> +
> + atomic_inc(&players_ready);
> + mutex_lock(&mutex_mid_list[tnum]);
> + while (!READ_ONCE(game_over)) {
> + if (kthread_should_stop())
> + break;
> + schedule();
> + }
> + mutex_unlock(&mutex_mid_list[tnum]);
> + return 0;
> +}
> +
> +int crazy_fan_thread(void *)

Same (no param name) question here.

> +{
> + int count = 0;
> +
> + atomic_inc(&players_ready);
> + while (!READ_ONCE(game_over)) {
> + if (kthread_should_stop())
> + break;
> + schedule();
> + udelay(1000);
> + msleep(2);
> + count++;
> + }
> + return 0;
> +}
> +
> +int ref_thread(void *arg)
> +{
> + struct task_struct *kth;
> + long game_time = (long)arg;
> + unsigned long final_pos;
> + long i;
> +
> + pr_info("%s: started ref, game_time: %ld secs !\n", __func__,
> + game_time);
> +
> + /* Create low priority defensive team */

Sorry: extra space after `low`.

> + for (i = 0; i < players_per_team; i++)
> + kth = create_fifo_thread(defense_low_thread, (void *)i,
> + "defese-low-thread", 2);
> + /* Wait for the defense threads to start */
> + while (atomic_read(&players_ready) < players_per_team)
> + msleep(1);
> +
> + for (i = 0; i < players_per_team; i++)
> + kth = create_fifo_thread(defense_mid_thread,
> + (void *)(players_per_team - i - 1),
> + "defese-mid-thread", 3);
> + /* Wait for the defense threads to start */
> + while (atomic_read(&players_ready) < players_per_team * 2)
> + msleep(1);
> +
> + /* Create mid priority offensive team */
> + for (i = 0; i < players_per_team; i++)
> + kth = create_fifo_thread(offense_thread, NULL,
> + "offense-thread", 5);
> + /* Wait for the offense threads to start */
> + while (atomic_read(&players_ready) < players_per_team * 3)
> + msleep(1);
> +
> + /* Create high priority defensive team */
> + for (i = 0; i < players_per_team; i++)
> + kth = create_fifo_thread(defense_hi_thread, (void *)i,
> + "defese-hi-thread", 10);
> + /* Wait for the defense threads to start */
> + while (atomic_read(&players_ready) < players_per_team * 4)
> + msleep(1);
> +
> + /* Create high priority defensive team */
> + for (i = 0; i < players_per_team; i++)
> + kth = create_fifo_thread(crazy_fan_thread, NULL,
> + "crazy-fan-thread", 15);
> + /* Wait for the defense threads to start */
> + while (atomic_read(&players_ready) < players_per_team * 5)
> + msleep(1);
> +
> + pr_info("%s: all players checked in! Starting game.\n", __func__);
> + atomic_set(&ball_pos, 0);
> + msleep(game_time * 1000);
> + final_pos = atomic_read(&ball_pos);
> + pr_info("%s: final ball_pos: %ld\n", __func__, final_pos);
> + WARN_ON(final_pos != 0);
> + game_over = true;
> + return 0;
> +}
> +
> +static int __init test_sched_football_init(void)
> +{
> + struct task_struct *kth;
> + int i;
> +
> + players_per_team = num_online_cpus();
> +
> + mutex_low_list = kmalloc_array(players_per_team, sizeof(struct mutex), GFP_ATOMIC);
> + mutex_mid_list = kmalloc_array(players_per_team, sizeof(struct mutex), GFP_ATOMIC);

* Extra space after `players_per_team,`.
* Shouldn't we check result of `kmalloc_array()`?

Same comments for `mutex_low_list` (previous) line.

> +
> + for (i = 0; i < players_per_team; i++) {
> + mutex_init(&mutex_low_list[i]);
> + mutex_init(&mutex_mid_list[i]);
> + }
> +
> + kth = create_fifo_thread(ref_thread, (void *)10, "ref-thread", 20);
> +
> + return 0;
> +}
> +module_init(test_sched_football_init);
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 4405f81248fb..1d90059d190f 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1238,6 +1238,20 @@ config SCHED_DEBUG
> that can help debug the scheduler. The runtime overhead of this
> option is minimal.
>
> +config SCHED_RT_INVARIENT_TEST
> + tristate "RT invarient scheduling tester"
> + depends on DEBUG_KERNEL
> + help
> + This option provides a kernel module that runs tests to make
> + sure the RT invarient holds (top N priority tasks run on N
> + available cpus).
> +
> + Say Y here if you want kernel rt scheduling tests
> + to be built into the kernel.
> + Say M if you want this test to build as a module.
> + Say N if you are unsure.
> +
> +
> config SCHED_INFO
> bool
> default n


2023-12-22 10:24:16

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v7 19/23] sched: Consolidate pick_*_task to task_is_pushable helper

On 20/12/2023 12:18 am, John Stultz wrote:
> From: Connor O'Brien <[email protected]>
>
> This patch consolidates rt and deadline pick_*_task functions to
> a task_is_pushable() helper
>
> This patch was broken out from a larger chain migration
> patch originally by Connor O'Brien.
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Signed-off-by: Connor O'Brien <[email protected]>
> [jstultz: split out from larger chain migration patch,
> renamed helper function]
> Signed-off-by: John Stultz <[email protected]>
> ---
> v7:
> * Split from chain migration patch
> * Renamed function
> ---
> kernel/sched/deadline.c | 10 +---------
> kernel/sched/rt.c | 11 +----------
> kernel/sched/sched.h | 10 ++++++++++
> 3 files changed, 12 insertions(+), 19 deletions(-)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index def1eb23318b..1f3bc50de678 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2049,14 +2049,6 @@ static void task_fork_dl(struct task_struct *p)
> /* Only try algorithms three times */
> #define DL_MAX_TRIES 3
>
> -static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
> -{
> - if (!task_on_cpu(rq, p) &&
> - cpumask_test_cpu(cpu, &p->cpus_mask))
> - return 1;
> - return 0;
> -}
> -
> /*
> * Return the earliest pushable rq's task, which is suitable to be executed
> * on the CPU, NULL otherwise:
> @@ -2075,7 +2067,7 @@ static struct task_struct *pick_earliest_pushable_dl_task(struct rq *rq, int cpu
> if (next_node) {
> p = __node_2_pdl(next_node);
>
> - if (pick_dl_task(rq, p, cpu))
> + if (task_is_pushable(rq, p, cpu) == 1)

Nit: ` == 1` part is redundant, IMHO.

> return p;
>
> next_node = rb_next(next_node);
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index cf0eb4aac613..15161de88753 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1812,15 +1812,6 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
> /* Only try algorithms three times */
> #define RT_MAX_TRIES 3
>
> -static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
> -{
> - if (!task_on_cpu(rq, p) &&
> - cpumask_test_cpu(cpu, &p->cpus_mask))
> - return 1;
> -
> - return 0;
> -}
> -
> /*
> * Return the highest pushable rq's task, which is suitable to be executed
> * on the CPU, NULL otherwise
> @@ -1834,7 +1825,7 @@ static struct task_struct *pick_highest_pushable_task(struct rq *rq, int cpu)
> return NULL;
>
> plist_for_each_entry(p, head, pushable_tasks) {
> - if (pick_rt_task(rq, p, cpu))
> + if (task_is_pushable(rq, p, cpu) == 1)

Ditto.

> return p;
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 19afe532771f..ef3d327e267c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3554,6 +3554,16 @@ void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
> set_task_cpu(task, dst_rq->cpu);
> activate_task(dst_rq, task, 0);
> }
> +
> +static inline
> +int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)

Nit: I know the function is just renamed in this patch, but should we
change the return type to bool while we are at it?

> +{
> + if (!task_on_cpu(rq, p) &&
> + cpumask_test_cpu(cpu, &p->cpus_mask))
> + return 1;
> +
> + return 0;
> +}
> #endif
>
> #endif /* _KERNEL_SCHED_SCHED_H */


2023-12-22 10:33:01

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v7 18/23] sched: Add push_task_chain helper

On 20/12/2023 12:18 am, John Stultz wrote:
> From: Connor O'Brien <[email protected]>
>
> Switch logic that deactivates, sets the task cpu,
> and reactivates a task on a different rq to use a
> helper that will be later extended to push entire
> blocked task chains.
>
> This patch was broken out from a larger chain migration
> patch originally by Connor O'Brien.

I think the patches #18, #19, #22 can be upstreamed regardless of other
Proxy Execution patches.

>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Signed-off-by: Connor O'Brien <[email protected]>
> [jstultz: split out from larger chain migration patch]
> Signed-off-by: John Stultz <[email protected]>
> ---
> kernel/sched/core.c | 4 +---
> kernel/sched/deadline.c | 8 ++------
> kernel/sched/rt.c | 8 ++------
> kernel/sched/sched.h | 9 +++++++++
> 4 files changed, 14 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0cd63bd0bdcd..0c212dcd4b7a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2721,9 +2721,7 @@ int push_cpu_stop(void *arg)
>
> // XXX validate p is still the highest prio task
> if (task_rq(p) == rq) {
> - deactivate_task(rq, p, 0);
> - set_task_cpu(p, lowest_rq->cpu);
> - activate_task(lowest_rq, p, 0);
> + push_task_chain(rq, lowest_rq, p);
> resched_curr(lowest_rq);
> }
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 4f998549ea74..def1eb23318b 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2313,9 +2313,7 @@ static int push_dl_task(struct rq *rq)
> goto retry;
> }
>
> - deactivate_task(rq, next_task, 0);
> - set_task_cpu(next_task, later_rq->cpu);
> - activate_task(later_rq, next_task, 0);
> + push_task_chain(rq, later_rq, next_task);
> ret = 1;
>
> resched_curr(later_rq);
> @@ -2401,9 +2399,7 @@ static void pull_dl_task(struct rq *this_rq)
> if (is_migration_disabled(p)) {
> push_task = get_push_task(src_rq);
> } else {
> - deactivate_task(src_rq, p, 0);
> - set_task_cpu(p, this_cpu);
> - activate_task(this_rq, p, 0);
> + push_task_chain(src_rq, this_rq, p);
> dmin = p->dl.deadline;
> resched = true;
> }
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index a7b51a021111..cf0eb4aac613 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2128,9 +2128,7 @@ static int push_rt_task(struct rq *rq, bool pull)
> goto retry;
> }
>
> - deactivate_task(rq, next_task, 0);
> - set_task_cpu(next_task, lowest_rq->cpu);
> - activate_task(lowest_rq, next_task, 0);
> + push_task_chain(rq, lowest_rq, next_task);
> resched_curr(lowest_rq);
> ret = 1;
>
> @@ -2401,9 +2399,7 @@ static void pull_rt_task(struct rq *this_rq)
> if (is_migration_disabled(p)) {
> push_task = get_push_task(src_rq);
> } else {
> - deactivate_task(src_rq, p, 0);
> - set_task_cpu(p, this_cpu);
> - activate_task(this_rq, p, 0);
> + push_task_chain(src_rq, this_rq, p);
> resched = true;
> }
> /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 765ba10661de..19afe532771f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3546,5 +3546,14 @@ static inline void init_sched_mm_cid(struct task_struct *t) { }
>
> extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
> extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
> +#ifdef CONFIG_SMP
> +static inline
> +void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
> +{
> + deactivate_task(rq, task, 0);
> + set_task_cpu(task, dst_rq->cpu);
> + activate_task(dst_rq, task, 0);
> +}
> +#endif
>
> #endif /* _KERNEL_SCHED_SCHED_H */


2023-12-22 11:40:54

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v7 20/23] sched: Push execution and scheduler context split into deadline and rt paths

On 20/12/2023 12:18 am, John Stultz wrote:
> From: Connor O'Brien <[email protected]>
>
> In preparation for chain migration, push the awareness
> of the split between execution and scheduler context
> down into some of the rt/deadline code paths that deal
> with load balancing.
>
> This patch was broken out from a larger chain migration
> patch originally by Connor O'Brien.
>

Nit: Commit header is too long. ` paths` can be dropped.

> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Signed-off-by: Connor O'Brien <[email protected]>
> [jstultz: split out from larger chain migration patch]
> Signed-off-by: John Stultz <[email protected]>
> ---
> kernel/sched/cpudeadline.c | 12 ++++++------
> kernel/sched/cpudeadline.h | 3 ++-
> kernel/sched/cpupri.c | 20 +++++++++++---------
> kernel/sched/cpupri.h | 6 ++++--
> kernel/sched/deadline.c | 18 +++++++++---------
> kernel/sched/rt.c | 31 ++++++++++++++++++-------------
> 6 files changed, 50 insertions(+), 40 deletions(-)
>
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index 95baa12a1029..6ac59dcdf068 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -113,13 +113,13 @@ static inline int cpudl_maximum(struct cpudl *cp)
> *
> * Returns: int - CPUs were found
> */
> -int cpudl_find(struct cpudl *cp, struct task_struct *p,
> +int cpudl_find(struct cpudl *cp, struct task_struct *sched_ctx, struct task_struct *exec_ctx,
> struct cpumask *later_mask)
> {
> - const struct sched_dl_entity *dl_se = &p->dl;
> + const struct sched_dl_entity *dl_se = &sched_ctx->dl;
>
> if (later_mask &&
> - cpumask_and(later_mask, cp->free_cpus, &p->cpus_mask)) {
> + cpumask_and(later_mask, cp->free_cpus, &exec_ctx->cpus_mask)) {
> unsigned long cap, max_cap = 0;
> int cpu, max_cpu = -1;
>
> @@ -128,13 +128,13 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
>
> /* Ensure the capacity of the CPUs fits the task. */
> for_each_cpu(cpu, later_mask) {
> - if (!dl_task_fits_capacity(p, cpu)) {
> + if (!dl_task_fits_capacity(sched_ctx, cpu)) {
> cpumask_clear_cpu(cpu, later_mask);
>
> cap = arch_scale_cpu_capacity(cpu);
>
> if (cap > max_cap ||
> - (cpu == task_cpu(p) && cap == max_cap)) {
> + (cpu == task_cpu(exec_ctx) && cap == max_cap)) {
> max_cap = cap;
> max_cpu = cpu;
> }
> @@ -150,7 +150,7 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
>
> WARN_ON(best_cpu != -1 && !cpu_present(best_cpu));
>
> - if (cpumask_test_cpu(best_cpu, &p->cpus_mask) &&
> + if (cpumask_test_cpu(best_cpu, &exec_ctx->cpus_mask) &&
> dl_time_before(dl_se->deadline, cp->elements[0].dl)) {
> if (later_mask)
> cpumask_set_cpu(best_cpu, later_mask);
> diff --git a/kernel/sched/cpudeadline.h b/kernel/sched/cpudeadline.h
> index 0adeda93b5fb..6bb27f70e9d2 100644
> --- a/kernel/sched/cpudeadline.h
> +++ b/kernel/sched/cpudeadline.h
> @@ -16,7 +16,8 @@ struct cpudl {
> };
>
> #ifdef CONFIG_SMP
> -int cpudl_find(struct cpudl *cp, struct task_struct *p, struct cpumask *later_mask);
> +int cpudl_find(struct cpudl *cp, struct task_struct *sched_ctx,
> + struct task_struct *exec_ctx, struct cpumask *later_mask);
> void cpudl_set(struct cpudl *cp, int cpu, u64 dl);
> void cpudl_clear(struct cpudl *cp, int cpu);
> int cpudl_init(struct cpudl *cp);
> diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
> index 42c40cfdf836..15e947a3ded7 100644
> --- a/kernel/sched/cpupri.c
> +++ b/kernel/sched/cpupri.c
> @@ -118,10 +118,11 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
> return 1;
> }
>
> -int cpupri_find(struct cpupri *cp, struct task_struct *p,
> +int cpupri_find(struct cpupri *cp, struct task_struct *sched_ctx,
> + struct task_struct *exec_ctx,
> struct cpumask *lowest_mask)
> {
> - return cpupri_find_fitness(cp, p, lowest_mask, NULL);
> + return cpupri_find_fitness(cp, sched_ctx, exec_ctx, lowest_mask, NULL);
> }
>
> /**
> @@ -141,18 +142,19 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
> *
> * Return: (int)bool - CPUs were found
> */
> -int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
> - struct cpumask *lowest_mask,
> - bool (*fitness_fn)(struct task_struct *p, int cpu))
> +int cpupri_find_fitness(struct cpupri *cp, struct task_struct *sched_ctx,
> + struct task_struct *exec_ctx,
> + struct cpumask *lowest_mask,
> + bool (*fitness_fn)(struct task_struct *p, int cpu))
> {
> - int task_pri = convert_prio(p->prio);
> + int task_pri = convert_prio(sched_ctx->prio);
> int idx, cpu;
>
> WARN_ON_ONCE(task_pri >= CPUPRI_NR_PRIORITIES);
>
> for (idx = 0; idx < task_pri; idx++) {
>
> - if (!__cpupri_find(cp, p, lowest_mask, idx))
> + if (!__cpupri_find(cp, exec_ctx, lowest_mask, idx))
> continue;
>
> if (!lowest_mask || !fitness_fn)
> @@ -160,7 +162,7 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
>
> /* Ensure the capacity of the CPUs fit the task */
> for_each_cpu(cpu, lowest_mask) {
> - if (!fitness_fn(p, cpu))
> + if (!fitness_fn(sched_ctx, cpu))
> cpumask_clear_cpu(cpu, lowest_mask);
> }
>
> @@ -192,7 +194,7 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
> * really care.
> */
> if (fitness_fn)
> - return cpupri_find(cp, p, lowest_mask);
> + return cpupri_find(cp, sched_ctx, exec_ctx, lowest_mask);
>
> return 0;
> }
> diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
> index d6cba0020064..bde7243cec2e 100644
> --- a/kernel/sched/cpupri.h
> +++ b/kernel/sched/cpupri.h
> @@ -18,9 +18,11 @@ struct cpupri {
> };
>
> #ifdef CONFIG_SMP
> -int cpupri_find(struct cpupri *cp, struct task_struct *p,
> +int cpupri_find(struct cpupri *cp, struct task_struct *sched_ctx,
> + struct task_struct *exec_ctx,
> struct cpumask *lowest_mask);
> -int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
> +int cpupri_find_fitness(struct cpupri *cp, struct task_struct *sched_ctx,
> + struct task_struct *exec_ctx,
> struct cpumask *lowest_mask,
> bool (*fitness_fn)(struct task_struct *p, int cpu));
> void cpupri_set(struct cpupri *cp, int cpu, int pri);
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 1f3bc50de678..999bd17f11c4 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1779,7 +1779,7 @@ static inline bool dl_task_is_earliest_deadline(struct task_struct *p,
> rq->dl.earliest_dl.curr));
> }
>
> -static int find_later_rq(struct task_struct *task);
> +static int find_later_rq(struct task_struct *sched_ctx, struct task_struct *exec_ctx);
>
> static int
> select_task_rq_dl(struct task_struct *p, int cpu, int flags)
> @@ -1819,7 +1819,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
> select_rq |= !dl_task_fits_capacity(p, cpu);
>
> if (select_rq) {
> - int target = find_later_rq(p);
> + int target = find_later_rq(p, p);
>
> if (target != -1 &&
> dl_task_is_earliest_deadline(p, cpu_rq(target)))
> @@ -1871,7 +1871,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
> * let's hope p can move out.
> */
> if (rq->curr->nr_cpus_allowed == 1 ||
> - !cpudl_find(&rq->rd->cpudl, rq_selected(rq), NULL))
> + !cpudl_find(&rq->rd->cpudl, rq_selected(rq), rq->curr, NULL))
> return;
>
> /*
> @@ -1879,7 +1879,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
> * see if it is pushed or pulled somewhere else.
> */
> if (p->nr_cpus_allowed != 1 &&
> - cpudl_find(&rq->rd->cpudl, p, NULL))
> + cpudl_find(&rq->rd->cpudl, p, p, NULL))
> return;
>
> resched_curr(rq);
> @@ -2079,25 +2079,25 @@ static struct task_struct *pick_earliest_pushable_dl_task(struct rq *rq, int cpu
>
> static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
>
> -static int find_later_rq(struct task_struct *task)
> +static int find_later_rq(struct task_struct *sched_ctx, struct task_struct *exec_ctx)

Nit: line becomes too long. Same for find_later_rq()'s signature above
as well as find_lowest_rq() in rt.c.

> {
> struct sched_domain *sd;
> struct cpumask *later_mask = this_cpu_cpumask_var_ptr(local_cpu_mask_dl);
> int this_cpu = smp_processor_id();
> - int cpu = task_cpu(task);
> + int cpu = task_cpu(sched_ctx);
>
> /* Make sure the mask is initialized first */
> if (unlikely(!later_mask))
> return -1;
>
> - if (task->nr_cpus_allowed == 1)
> + if (exec_ctx && exec_ctx->nr_cpus_allowed == 1)

Can exec_ctx be NULL? If so, we may hit a seg. fault at
task_rq(exec_ctx) below.

> return -1;
>
> /*
> * We have to consider system topology and task affinity
> * first, then we can look for a suitable CPU.
> */
> - if (!cpudl_find(&task_rq(task)->rd->cpudl, task, later_mask))
> + if (!cpudl_find(&task_rq(exec_ctx)->rd->cpudl, sched_ctx, exec_ctx, later_mask))
> return -1;
>
> /*
> @@ -2174,7 +2174,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
> int cpu;
>
> for (tries = 0; tries < DL_MAX_TRIES; tries++) {
> - cpu = find_later_rq(task);
> + cpu = find_later_rq(task, task);
>
> if ((cpu == -1) || (cpu == rq->cpu))
> break;
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 15161de88753..6371b0fca4ad 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1554,7 +1554,7 @@ static void yield_task_rt(struct rq *rq)
> }
>
> #ifdef CONFIG_SMP
> -static int find_lowest_rq(struct task_struct *task);
> +static int find_lowest_rq(struct task_struct *sched_ctx, struct task_struct *exec_ctx);
>
> static int
> select_task_rq_rt(struct task_struct *p, int cpu, int flags)
> @@ -1604,7 +1604,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
> (curr->nr_cpus_allowed < 2 || selected->prio <= p->prio);
>
> if (test || !rt_task_fits_capacity(p, cpu)) {
> - int target = find_lowest_rq(p);
> + int target = find_lowest_rq(p, p);
>
> /*
> * Bail out if we were forcing a migration to find a better
> @@ -1631,8 +1631,13 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
>
> static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
> {
> + struct task_struct *exec_ctx = p;
> + /*
> + * Current can't be migrated, useless to reschedule,
> + * let's hope p can move out.
> + */
> if (rq->curr->nr_cpus_allowed == 1 ||
> - !cpupri_find(&rq->rd->cpupri, rq_selected(rq), NULL))
> + !cpupri_find(&rq->rd->cpupri, rq_selected(rq), rq->curr, NULL))
> return;
>
> /*
> @@ -1640,7 +1645,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
> * see if it is pushed or pulled somewhere else.
> */
> if (p->nr_cpus_allowed != 1 &&
> - cpupri_find(&rq->rd->cpupri, p, NULL))
> + cpupri_find(&rq->rd->cpupri, p, exec_ctx, NULL))
> return;
>
> /*
> @@ -1834,19 +1839,19 @@ static struct task_struct *pick_highest_pushable_task(struct rq *rq, int cpu)
>
> static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask);
>
> -static int find_lowest_rq(struct task_struct *task)
> +static int find_lowest_rq(struct task_struct *sched_ctx, struct task_struct *exec_ctx)
> {
> struct sched_domain *sd;
> struct cpumask *lowest_mask = this_cpu_cpumask_var_ptr(local_cpu_mask);
> int this_cpu = smp_processor_id();
> - int cpu = task_cpu(task);
> + int cpu = task_cpu(sched_ctx);
> int ret;
>
> /* Make sure the mask is initialized first */
> if (unlikely(!lowest_mask))
> return -1;
>
> - if (task->nr_cpus_allowed == 1)
> + if (exec_ctx && exec_ctx->nr_cpus_allowed == 1)
> return -1; /* No other targets possible */
>
> /*
> @@ -1855,13 +1860,13 @@ static int find_lowest_rq(struct task_struct *task)
> */
> if (sched_asym_cpucap_active()) {
>
> - ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
> - task, lowest_mask,
> + ret = cpupri_find_fitness(&task_rq(sched_ctx)->rd->cpupri,
> + sched_ctx, exec_ctx, lowest_mask,
> rt_task_fits_capacity);
> } else {
>
> - ret = cpupri_find(&task_rq(task)->rd->cpupri,
> - task, lowest_mask);
> + ret = cpupri_find(&task_rq(sched_ctx)->rd->cpupri,
> + sched_ctx, exec_ctx, lowest_mask);
> }
>
> if (!ret)
> @@ -1933,7 +1938,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
> int cpu;
>
> for (tries = 0; tries < RT_MAX_TRIES; tries++) {
> - cpu = find_lowest_rq(task);
> + cpu = find_lowest_rq(task, task);
>
> if ((cpu == -1) || (cpu == rq->cpu))
> break;
> @@ -2055,7 +2060,7 @@ static int push_rt_task(struct rq *rq, bool pull)
> if (rq->curr->sched_class != &rt_sched_class)
> return 0;
>
> - cpu = find_lowest_rq(rq->curr);
> + cpu = find_lowest_rq(rq_selected(rq), rq->curr);
> if (cpu == -1 || cpu == rq->cpu)
> return 0;
>


2023-12-22 11:59:13

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v7 21/23] sched: Add find_exec_ctx helper

On 20/12/2023 12:18 am, John Stultz wrote:
> From: Connor O'Brien <[email protected]>
>
> Add a helper to find the runnable owner down a chain of blocked waiters
>
> This patch was broken out from a larger chain migration
> patch originally by Connor O'Brien.
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Signed-off-by: Connor O'Brien <[email protected]>
> [jstultz: split out from larger chain migration patch]
> Signed-off-by: John Stultz <[email protected]>
> ---
> kernel/sched/core.c | 42 +++++++++++++++++++++++++++++++++++++++++
> kernel/sched/cpupri.c | 11 ++++++++---
> kernel/sched/deadline.c | 15 +++++++++++++--
> kernel/sched/rt.c | 9 ++++++++-
> kernel/sched/sched.h | 10 ++++++++++
> 5 files changed, 81 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0c212dcd4b7a..77a79d5f829a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3896,6 +3896,48 @@ static void activate_blocked_entities(struct rq *target_rq,
> }
> raw_spin_unlock_irqrestore(&owner->blocked_lock, flags);
> }
> +
> +static inline bool task_queued_on_rq(struct rq *rq, struct task_struct *task)
> +{
> + if (!task_on_rq_queued(task))
> + return false;
> + smp_rmb();
> + if (task_rq(task) != rq)
> + return false;
> + smp_rmb();
> + if (!task_on_rq_queued(task))
> + return false;

* Super-nit: we may want to have empty lines between `if` blocks and
before/after `smp_rmb()` calls.

* I did not understand why we call `task_on_rq_queued(task)` twice.
Should we have an explanatory comment before the function definition?

> + return true;
> +}
> +
> +/*
> + * Returns the unblocked task at the end of the blocked chain starting with p
> + * if that chain is composed entirely of tasks enqueued on rq, or NULL otherwise.
> + */
> +struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p)
> +{
> + struct task_struct *exec_ctx, *owner;
> + struct mutex *mutex;
> +
> + if (!sched_proxy_exec())
> + return p;
> +
> + lockdep_assert_rq_held(rq);
> +
> + for (exec_ctx = p; task_is_blocked(exec_ctx) && !task_on_cpu(rq, exec_ctx);
> + exec_ctx = owner) {
> + mutex = exec_ctx->blocked_on;
> + owner = __mutex_owner(mutex);
> + if (owner == exec_ctx)
> + break;
> +
> + if (!task_queued_on_rq(rq, owner) || task_current_selected(rq, owner)) {
> + exec_ctx = NULL;
> + break;
> + }
> + }
> + return exec_ctx;
> +}
> #else /* !CONFIG_SCHED_PROXY_EXEC */
> static inline void do_activate_task(struct rq *rq, struct task_struct *p,
> int en_flags)
> diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
> index 15e947a3ded7..53be78afdd07 100644
> --- a/kernel/sched/cpupri.c
> +++ b/kernel/sched/cpupri.c
> @@ -96,12 +96,17 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
> if (skip)
> return 0;
>
> - if (cpumask_any_and(&p->cpus_mask, vec->mask) >= nr_cpu_ids)
> + if ((p && cpumask_any_and(&p->cpus_mask, vec->mask) >= nr_cpu_ids) ||
> + (!p && cpumask_any(vec->mask) >= nr_cpu_ids))
> return 0;
>
> if (lowest_mask) {
> - cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
> - cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
> + if (p) {
> + cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
> + cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
> + } else {
> + cpumask_copy(lowest_mask, vec->mask);
> + }

I think changes in `cpupri.c` should be part of previous (`sched: Push
execution and scheduler context split into deadline and rt paths`)
patch. Because they don't seem to be related with find_exec_ctx()?

>
> /*
> * We have to ensure that we have at least one bit
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 999bd17f11c4..21e56ac58e32 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1866,6 +1866,8 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
>
> static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
> {
> + struct task_struct *exec_ctx;
> +
> /*
> * Current can't be migrated, useless to reschedule,
> * let's hope p can move out.
> @@ -1874,12 +1876,16 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
> !cpudl_find(&rq->rd->cpudl, rq_selected(rq), rq->curr, NULL))
> return;
>
> + exec_ctx = find_exec_ctx(rq, p);
> + if (task_current(rq, exec_ctx))
> + return;
> +
> /*
> * p is migratable, so let's not schedule it and
> * see if it is pushed or pulled somewhere else.
> */
> if (p->nr_cpus_allowed != 1 &&
> - cpudl_find(&rq->rd->cpudl, p, p, NULL))
> + cpudl_find(&rq->rd->cpudl, p, exec_ctx, NULL))
> return;
>
> resched_curr(rq);
> @@ -2169,12 +2175,17 @@ static int find_later_rq(struct task_struct *sched_ctx, struct task_struct *exec
> /* Locks the rq it finds */
> static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
> {
> + struct task_struct *exec_ctx;
> struct rq *later_rq = NULL;
> int tries;
> int cpu;
>
> for (tries = 0; tries < DL_MAX_TRIES; tries++) {
> - cpu = find_later_rq(task, task);
> + exec_ctx = find_exec_ctx(rq, task);
> + if (!exec_ctx)
> + break;
> +
> + cpu = find_later_rq(task, exec_ctx);
>

Super-nit: this empty line should be removed to keep logically connected
lines closer.
The same for find_lock_lowest_rq().

> if ((cpu == -1) || (cpu == rq->cpu))
> break;
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 6371b0fca4ad..f8134d062fa3 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1640,6 +1640,11 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
> !cpupri_find(&rq->rd->cpupri, rq_selected(rq), rq->curr, NULL))
> return;
>
> + /* No reason to preempt since rq->curr wouldn't change anyway */
> + exec_ctx = find_exec_ctx(rq, p);
> + if (task_current(rq, exec_ctx))
> + return;
> +
> /*
> * p is migratable, so let's not schedule it and
> * see if it is pushed or pulled somewhere else.
> @@ -1933,12 +1938,14 @@ static int find_lowest_rq(struct task_struct *sched_ctx, struct task_struct *exe
> /* Will lock the rq it finds */
> static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
> {
> + struct task_struct *exec_ctx;
> struct rq *lowest_rq = NULL;
> int tries;
> int cpu;
>
> for (tries = 0; tries < RT_MAX_TRIES; tries++) {
> - cpu = find_lowest_rq(task, task);
> + exec_ctx = find_exec_ctx(rq, task);
> + cpu = find_lowest_rq(task, exec_ctx);
>
> if ((cpu == -1) || (cpu == rq->cpu))
> break;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ef3d327e267c..6cd473224cfe 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3564,6 +3564,16 @@ int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
>
> return 0;
> }
> +
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p);
> +#else /* !CONFIG_SCHED_PROXY_EXEC */
> +static inline
> +struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p)
> +{
> + return p;
> +}
> +#endif /* CONFIG_SCHED_PROXY_EXEC */
> #endif

Nit: `#ifdef CONFIG_SMP` block becomes bigger after this hunk. We should
append `/* CONFIG_SMP */` to this line, IMHO.

>
> #endif /* _KERNEL_SCHED_SCHED_H */


2023-12-22 13:53:02

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v7 22/23] sched: Refactor dl/rt find_lowest/latest_rq logic

On 20/12/2023 12:18 am, John Stultz wrote:
> This pulls re-validation logic done in find_lowest_rq
> and find_latest_rq after re-acquiring the rq locks out into its
> own function.
>
> This allows us to later use a more complicated validation
> check for chain-migration when using proxy-exectuion.

execution

>
> TODO: It seems likely we could consolidate this two functions
> further and leave the task_is_rt()/task_is_dl() checks externally?

Agreed.

>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Signed-off-by: John Stultz <[email protected]>
> ---
> kernel/sched/deadline.c | 31 ++++++++++++++++++++-----
> kernel/sched/rt.c | 50 ++++++++++++++++++++++++++++-------------
> 2 files changed, 59 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 21e56ac58e32..8b5701727342 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2172,6 +2172,30 @@ static int find_later_rq(struct task_struct *sched_ctx, struct task_struct *exec
> return -1;
> }
>
> +static inline bool dl_revalidate_rq_state(struct task_struct *task, struct rq *rq,
> + struct rq *later)
> +{
> + if (task_rq(task) != rq)
> + return false;
> +
> + if (!cpumask_test_cpu(later->cpu, &task->cpus_mask))
> + return false;
> +
> + if (task_on_cpu(rq, task))
> + return false;
> +
> + if (!dl_task(task))
> + return false;
> +
> + if (is_migration_disabled(task))
> + return false;
> +
> + if (!task_on_rq_queued(task))
> + return false;
> +
> + return true;
> +}
> +
> /* Locks the rq it finds */
> static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
> {
> @@ -2204,12 +2228,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
>
> /* Retry if something changed. */
> if (double_lock_balance(rq, later_rq)) {
> - if (unlikely(task_rq(task) != rq ||
> - !cpumask_test_cpu(later_rq->cpu, &task->cpus_mask) ||
> - task_on_cpu(rq, task) ||
> - !dl_task(task) ||
> - is_migration_disabled(task) ||
> - !task_on_rq_queued(task))) {
> + if (unlikely(!dl_revalidate_rq_state(task, rq, later_rq))) {
> double_unlock_balance(rq, later_rq);
> later_rq = NULL;
> break;
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index f8134d062fa3..fabb19891e95 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1935,6 +1935,39 @@ static int find_lowest_rq(struct task_struct *sched_ctx, struct task_struct *exe
> return -1;
> }
>
> +static inline bool rt_revalidate_rq_state(struct task_struct *task, struct rq *rq,
> + struct rq *lowest)
> +{
> + /*
> + * We had to unlock the run queue. In
> + * the mean time, task could have
> + * migrated already or had its affinity changed.
> + * Also make sure that it wasn't scheduled on its rq.
> + * It is possible the task was scheduled, set
> + * "migrate_disabled" and then got preempted, so we must
> + * check the task migration disable flag here too.
> + */
> + if (task_rq(task) != rq)
> + return false;
> +
> + if (!cpumask_test_cpu(lowest->cpu, &task->cpus_mask))
> + return false;
> +
> + if (task_on_cpu(rq, task))
> + return false;
> +
> + if (!rt_task(task))
> + return false;
> +
> + if (is_migration_disabled(task))
> + return false;
> +
> + if (!task_on_rq_queued(task))
> + return false;
> +
> + return true;
> +}
> +
> /* Will lock the rq it finds */
> static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
> {
> @@ -1964,22 +1997,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
>
> /* if the prio of this runqueue changed, try again */
> if (double_lock_balance(rq, lowest_rq)) {
> - /*
> - * We had to unlock the run queue. In
> - * the mean time, task could have
> - * migrated already or had its affinity changed.
> - * Also make sure that it wasn't scheduled on its rq.
> - * It is possible the task was scheduled, set
> - * "migrate_disabled" and then got preempted, so we must
> - * check the task migration disable flag here too.
> - */
> - if (unlikely(task_rq(task) != rq ||
> - !cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
> - task_on_cpu(rq, task) ||
> - !rt_task(task) ||
> - is_migration_disabled(task) ||
> - !task_on_rq_queued(task))) {
> -
> + if (unlikely(!rt_revalidate_rq_state(task, rq, lowest_rq))) {
> double_unlock_balance(rq, lowest_rq);
> lowest_rq = NULL;
> break;


2023-12-22 14:51:59

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v7 23/23] sched: Fix rt/dl load balancing via chain level balance

On 20/12/2023 12:18 am, John Stultz wrote:
> From: Connor O'Brien <[email protected]>
>
> RT/DL balancing is supposed to guarantee that with N cpus available &
> CPU affinity permitting, the top N RT/DL tasks will get spread across
> the CPUs and all get to run. Proxy exec greatly complicates this as
> blocked tasks remain on the rq but cannot be usefully migrated away
> from their lock owning tasks. This has two major consequences:
> 1. In order to get the desired properties we need to migrate a blocked
> task, its would-be proxy, and everything in between, all together -
> i.e., we need to push/pull "blocked chains" rather than individual
> tasks.
> 2. Tasks that are part of rq->curr's "blocked tree" therefore should
> not be pushed or pulled. Options for enforcing this seem to include
> a) create some sort of complex data structure for tracking
> pushability, updating it whenever the blocked tree for rq->curr
> changes (e.g. on mutex handoffs, migrations, etc.) as well as on
> context switches.
> b) give up on O(1) pushability checks, and search through the pushable
> list every push/pull until we find a pushable "chain"
> c) Extend option "b" with some sort of caching to avoid repeated work.
>
> For the sake of simplicity & separating the "chain level balancing"
> concerns from complicated optimizations, this patch focuses on trying
> to implement option "b" correctly. This can then hopefully provide a
> baseline for "correct load balancing behavior" that optimizations can
> try to implement more efficiently.
>
> Note:
> The inability to atomically check "is task enqueued on a specific rq"
> creates 2 possible races when following a blocked chain:
> - If we check task_rq() first on a task that is dequeued from its rq,
> it can be woken and enqueued on another rq before the call to
> task_on_rq_queued()
> - If we call task_on_rq_queued() first on a task that is on another
> rq, it can be dequeued (since we don't hold its rq's lock) and then
> be set to the current rq before we check task_rq().
>
> Maybe there's a more elegant solution that would work, but for now,
> just sandwich the task_rq() check between two task_on_rq_queued()
> checks, all separated by smp_rmb() calls. Since we hold rq's lock,
> task can't be enqueued or dequeued from rq, so neither race should be
> possible.
>
> extensive comments on various pitfalls, races, etc. included inline.
>
> This patch was broken out from a larger chain migration
> patch originally by Connor O'Brien.
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Signed-off-by: Connor O'Brien <[email protected]>
> [jstultz: split out from larger chain migration patch,
> majorly refactored for runtime conditionalization]
> Signed-off-by: John Stultz <[email protected]>
> ---
> v7:
> * Split out from larger chain-migration patch in earlier
> versions of this series
> * Larger rework to allow proper conditionalization of the
> logic when running with CONFIG_SCHED_PROXY_EXEC
> ---
> kernel/sched/core.c | 77 +++++++++++++++++++++++-
> kernel/sched/deadline.c | 98 +++++++++++++++++++++++-------
> kernel/sched/rt.c | 130 ++++++++++++++++++++++++++++++++--------
> kernel/sched/sched.h | 18 +++++-
> 4 files changed, 273 insertions(+), 50 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 77a79d5f829a..30dfb6f14f2b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3923,7 +3923,6 @@ struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p)
> return p;
>
> lockdep_assert_rq_held(rq);
> -
> for (exec_ctx = p; task_is_blocked(exec_ctx) && !task_on_cpu(rq, exec_ctx);
> exec_ctx = owner) {
> mutex = exec_ctx->blocked_on;
> @@ -3938,6 +3937,82 @@ struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p)
> }
> return exec_ctx;
> }
> +
> +#ifdef CONFIG_SMP
> +void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
> +{
> + struct task_struct *owner;
> +
> + if (!sched_proxy_exec()) {
> + __push_task_chain(rq, dst_rq, task);
> + return;
> + }
> +
> + lockdep_assert_rq_held(rq);
> + lockdep_assert_rq_held(dst_rq);
> +
> + BUG_ON(!task_queued_on_rq(rq, task));
> + BUG_ON(task_current_selected(rq, task));
> +
> + while (task) {
> + if (!task_queued_on_rq(rq, task) || task_current_selected(rq, task))
> + break;
> +
> + if (task_is_blocked(task))
> + owner = __mutex_owner(task->blocked_on);
> + else
> + owner = NULL;
> + __push_task_chain(rq, dst_rq, task);
> + if (task == owner)
> + break;
> + task = owner;
> + }
> +}
> +
> +/*
> + * Returns:
> + * 1 if chain is pushable and affinity does not prevent pushing to cpu
> + * 0 if chain is unpushable
> + * -1 if chain is pushable but affinity blocks running on cpu.
> + */
> +int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
> +{
> + struct task_struct *exec_ctx;
> +
> + if (!sched_proxy_exec())
> + return __task_is_pushable(rq, p, cpu);
> +
> + lockdep_assert_rq_held(rq);
> +
> + if (task_rq(p) != rq || !task_on_rq_queued(p))
> + return 0;
> +
> + exec_ctx = find_exec_ctx(rq, p);
> + /*
> + * Chain leads off the rq, we're free to push it anywhere.
> + *
> + * One wrinkle with relying on find_exec_ctx is that when the chain
> + * leads to a task currently migrating to rq, we see the chain as
> + * pushable & push everything prior to the migrating task. Even if
> + * we checked explicitly for this case, we could still race with a
> + * migration after the check.
> + * This shouldn't permanently produce a bad state though, as proxy()

find_proxy_task()

> + * will send the chain back to rq and by that point the migration
> + * should be complete & a proper push can occur.
> + */
> + if (!exec_ctx)
> + return 1;
> +
> + if (task_on_cpu(rq, exec_ctx) || exec_ctx->nr_cpus_allowed <= 1)
> + return 0;
> +
> + return cpumask_test_cpu(cpu, &exec_ctx->cpus_mask) ? 1 : -1;
> +}
> +#else /* !CONFIG_SMP */
> +void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
> +{
> +}
> +#endif /* CONFIG_SMP */
> #else /* !CONFIG_SCHED_PROXY_EXEC */
> static inline void do_activate_task(struct rq *rq, struct task_struct *p,
> int en_flags)
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 8b5701727342..b7be888c1635 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2172,8 +2172,77 @@ static int find_later_rq(struct task_struct *sched_ctx, struct task_struct *exec
> return -1;
> }
>
> +static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
> +{
> + struct task_struct *p = NULL;
> + struct rb_node *next_node;
> +
> + if (!has_pushable_dl_tasks(rq))
> + return NULL;
> +
> + next_node = rb_first_cached(&rq->dl.pushable_dl_tasks_root);
> +
> +next_node:
> + if (next_node) {
> + p = __node_2_pdl(next_node);
> +
> + /*
> + * cpu argument doesn't matter because we treat a -1 result
> + * (pushable but can't go to cpu0) the same as a 1 result
> + * (pushable to cpu0). All we care about here is general
> + * pushability.
> + */
> + if (task_is_pushable(rq, p, 0))
> + return p;
> +
> + next_node = rb_next(next_node);
> + goto next_node;
> + }
> +
> + if (!p)
> + return NULL;
> +
> + WARN_ON_ONCE(rq->cpu != task_cpu(p));
> + WARN_ON_ONCE(task_current(rq, p));
> + WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
> +
> + WARN_ON_ONCE(!task_on_rq_queued(p));
> + WARN_ON_ONCE(!dl_task(p));
> +
> + return p;
> +}
> +
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> static inline bool dl_revalidate_rq_state(struct task_struct *task, struct rq *rq,
> - struct rq *later)
> + struct rq *later, bool *retry)
> +{
> + if (!dl_task(task) || is_migration_disabled(task))
> + return false;
> +
> + if (rq != this_rq()) {
> + struct task_struct *next_task = pick_next_pushable_dl_task(rq);
> +
> + if (next_task == task) {

Nit: We can `return false;` if next_task != task and save one level of
indentation.

> + struct task_struct *exec_ctx;
> +
> + exec_ctx = find_exec_ctx(rq, next_task);
> + *retry = (exec_ctx && !cpumask_test_cpu(later->cpu,
> + &exec_ctx->cpus_mask));
> + } else {
> + return false;
> + }
> + } else {
> + int pushable = task_is_pushable(rq, task, later->cpu);
> +
> + *retry = pushable == -1;
> + if (!pushable)
> + return false;

`return pushable;` can replace above 2 lines.
The same for rt_revalidate_rq_state().

> + }
> + return true;
> +}
> +#else
> +static inline bool dl_revalidate_rq_state(struct task_struct *task, struct rq *rq,
> + struct rq *later, bool *retry)
> {
> if (task_rq(task) != rq)
> return false;
> @@ -2195,16 +2264,18 @@ static inline bool dl_revalidate_rq_state(struct task_struct *task, struct rq *r
>
> return true;
> }
> -
> +#endif
> /* Locks the rq it finds */
> static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
> {
> struct task_struct *exec_ctx;
> struct rq *later_rq = NULL;
> + bool retry;
> int tries;
> int cpu;
>
> for (tries = 0; tries < DL_MAX_TRIES; tries++) {
> + retry = false;
> exec_ctx = find_exec_ctx(rq, task);
> if (!exec_ctx)
> break;
> @@ -2228,7 +2299,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
>
> /* Retry if something changed. */
> if (double_lock_balance(rq, later_rq)) {
> - if (unlikely(!dl_revalidate_rq_state(task, rq, later_rq))) {
> + if (unlikely(!dl_revalidate_rq_state(task, rq, later_rq, &retry))) {
> double_unlock_balance(rq, later_rq);
> later_rq = NULL;
> break;
> @@ -2240,7 +2311,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
> * its earliest one has a later deadline than our
> * task, the rq is a good one.
> */
> - if (dl_task_is_earliest_deadline(task, later_rq))
> + if (!retry && dl_task_is_earliest_deadline(task, later_rq))
> break;
>
> /* Otherwise we try again. */
> @@ -2251,25 +2322,6 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
> return later_rq;
> }
>
> -static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
> -{
> - struct task_struct *p;
> -
> - if (!has_pushable_dl_tasks(rq))
> - return NULL;
> -
> - p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
> -
> - WARN_ON_ONCE(rq->cpu != task_cpu(p));
> - WARN_ON_ONCE(task_current(rq, p));
> - WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
> -
> - WARN_ON_ONCE(!task_on_rq_queued(p));
> - WARN_ON_ONCE(!dl_task(p));
> -
> - return p;
> -}
> -
> /*
> * See if the non running -deadline tasks on this rq
> * can be sent to some other CPU where they can preempt
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index fabb19891e95..d5ce95dc5c09 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1935,8 +1935,108 @@ static int find_lowest_rq(struct task_struct *sched_ctx, struct task_struct *exe
> return -1;
> }
>
> +static struct task_struct *pick_next_pushable_task(struct rq *rq)
> +{
> + struct plist_head *head = &rq->rt.pushable_tasks;
> + struct task_struct *p, *push_task = NULL;
> +
> + if (!has_pushable_tasks(rq))
> + return NULL;
> +
> + plist_for_each_entry(p, head, pushable_tasks) {
> + if (task_is_pushable(rq, p, 0)) {
> + push_task = p;
> + break;
> + }
> + }
> +
> + if (!push_task)
> + return NULL;
> +
> + BUG_ON(rq->cpu != task_cpu(push_task));
> + BUG_ON(task_current(rq, push_task) || task_current_selected(rq, push_task));
> + BUG_ON(!task_on_rq_queued(push_task));
> + BUG_ON(!rt_task(push_task));
> +
> + return p;
> +}
> +
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +static inline bool rt_revalidate_rq_state(struct task_struct *task, struct rq *rq,
> + struct rq *lowest, bool *retry)

This function can be consolidated with dl_revalidate_rq_state() as you
noted in the previous patch, although rt_revalidate_rq_state() has few
comments.

> +{
> + /*
> + * Releasing the rq lock means we need to re-check pushability.
> + * Some scenarios:
> + * 1) If a migration from another CPU sent a task/chain to rq
> + * that made task newly unpushable by completing a chain
> + * from task to rq->curr, then we need to bail out and push something
> + * else.
> + * 2) If our chain led off this CPU or to a dequeued task, the last waiter
> + * on this CPU might have acquired the lock and woken (or even migrated
> + * & run, handed off the lock it held, etc...). This can invalidate the
> + * result of find_lowest_rq() if our chain previously ended in a blocked
> + * task whose affinity we could ignore, but now ends in an unblocked
> + * task that can't run on lowest_rq.
> + * 3) Race described at https://lore.kernel.org/all/[email protected]/
> + *
> + * Notes on these:
> + * - Scenario #2 is properly handled by rerunning find_lowest_rq
> + * - Scenario #1 requires that we fail
> + * - Scenario #3 can AFAICT only occur when rq is not this_rq(). And the
> + * suggested fix is not universally correct now that push_cpu_stop() can
> + * call this function.
> + */
> + if (!rt_task(task) || is_migration_disabled(task)) {
> + return false;
> + } else if (rq != this_rq()) {

Nit: `else` can be dropped as in dl_revalidate_rq_state().

> + /*
> + * If we are dealing with a remote rq, then all bets are off
> + * because task might have run & then been dequeued since we
> + * released the lock, at which point our normal checks can race
> + * with migration, as described in
> + * https://lore.kernel.org/all/[email protected]/
> + * Need to repick to ensure we avoid a race.
> + * But re-picking would be unnecessary & incorrect in the
> + * push_cpu_stop() path.
> + */
> + struct task_struct *next_task = pick_next_pushable_task(rq);
> +
> + if (next_task == task) {
> + struct task_struct *exec_ctx;
> +
> + exec_ctx = find_exec_ctx(rq, next_task);
> + *retry = (exec_ctx &&
> + !cpumask_test_cpu(lowest->cpu,
> + &exec_ctx->cpus_mask));
> + } else {
> + return false;
> + }
> + } else {
> + /*
> + * Chain level balancing introduces new ways for our choice of
> + * task & rq to become invalid when we release the rq lock, e.g.:
> + * 1) Migration to rq from another CPU makes task newly unpushable
> + * by completing a "blocked chain" from task to rq->curr.
> + * Fail so a different task can be chosen for push.
> + * 2) In cases where task's blocked chain led to a dequeued task
> + * or one on another rq, the last waiter in the chain on this
> + * rq might have acquired the lock and woken, meaning we must
> + * pick a different rq if its affinity prevents running on
> + * lowest rq.
> + */
> + int pushable = task_is_pushable(rq, task, lowest->cpu);
> +
> + *retry = pushable == -1;
> + if (!pushable)
> + return false;
> + }
> +
> + return true;
> +}
> +#else /* !CONFIG_SCHED_PROXY_EXEC */
> static inline bool rt_revalidate_rq_state(struct task_struct *task, struct rq *rq,
> - struct rq *lowest)
> + struct rq *lowest, bool *retry)
> {
> /*
> * We had to unlock the run queue. In
> @@ -1967,16 +2067,19 @@ static inline bool rt_revalidate_rq_state(struct task_struct *task, struct rq *r
>
> return true;
> }
> +#endif
>
> /* Will lock the rq it finds */
> static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
> {
> struct task_struct *exec_ctx;
> struct rq *lowest_rq = NULL;
> + bool retry;
> int tries;
> int cpu;
>
> for (tries = 0; tries < RT_MAX_TRIES; tries++) {
> + retry = false;
> exec_ctx = find_exec_ctx(rq, task);
> cpu = find_lowest_rq(task, exec_ctx);
>
> @@ -1997,7 +2100,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
>
> /* if the prio of this runqueue changed, try again */
> if (double_lock_balance(rq, lowest_rq)) {
> - if (unlikely(!rt_revalidate_rq_state(task, rq, lowest_rq))) {
> + if (unlikely(!rt_revalidate_rq_state(task, rq, lowest_rq, &retry))) {
> double_unlock_balance(rq, lowest_rq);
> lowest_rq = NULL;
> break;
> @@ -2005,7 +2108,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
> }
>
> /* If this rq is still suitable use it. */
> - if (lowest_rq->rt.highest_prio.curr > task->prio)
> + if (lowest_rq->rt.highest_prio.curr > task->prio && !retry)
> break;
>
> /* try again */
> @@ -2016,27 +2119,6 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
> return lowest_rq;
> }
>
> -static struct task_struct *pick_next_pushable_task(struct rq *rq)
> -{
> - struct task_struct *p;
> -
> - if (!has_pushable_tasks(rq))
> - return NULL;
> -
> - p = plist_first_entry(&rq->rt.pushable_tasks,
> - struct task_struct, pushable_tasks);
> -
> - BUG_ON(rq->cpu != task_cpu(p));
> - BUG_ON(task_current(rq, p));
> - BUG_ON(task_current_selected(rq, p));
> - BUG_ON(p->nr_cpus_allowed <= 1);
> -
> - BUG_ON(!task_on_rq_queued(p));
> - BUG_ON(!rt_task(p));
> -
> - return p;
> -}
> -
> /*
> * If the current CPU has more than one RT task, see if the non
> * running task can migrate over to a CPU that is running a task
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 6cd473224cfe..4b97b36be691 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3548,7 +3548,7 @@ extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
> extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
> #ifdef CONFIG_SMP
> static inline
> -void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
> +void __push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
> {
> deactivate_task(rq, task, 0);
> set_task_cpu(task, dst_rq->cpu);
> @@ -3556,7 +3556,7 @@ void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
> }
>
> static inline
> -int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
> +int __task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
> {
> if (!task_on_cpu(rq, p) &&
> cpumask_test_cpu(cpu, &p->cpus_mask))
> @@ -3566,8 +3566,22 @@ int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
> }
>
> #ifdef CONFIG_SCHED_PROXY_EXEC
> +void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task);
> +int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu);
> struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p);
> #else /* !CONFIG_SCHED_PROXY_EXEC */
> +static inline
> +void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
> +{
> + __push_task_chain(rq, dst_rq, task);
> +}
> +
> +static inline
> +int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
> +{
> + return __task_is_pushable(rq, p, cpu);
> +}
> +
> static inline
> struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p)
> {


2023-12-28 15:06:28

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v7 06/23] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable

On 20/12/2023 12:18 am, John Stultz wrote:
> Add a CONFIG_SCHED_PROXY_EXEC option, along with a boot argument
> sched_prox_exec= that can be used to disable the feature at boot
> time if CONFIG_SCHED_PROXY_EXEC was enabled.
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Signed-off-by: John Stultz <[email protected]>
> ---
> v7:
> * Switch to CONFIG_SCHED_PROXY_EXEC/sched_proxy_exec= as
> suggested by Metin Kaya.
> * Switch boot arg from =disable/enable to use kstrtobool(),
> which supports =yes|no|1|0|true|false|on|off, as also
> suggested by Metin Kaya, and print a message when a boot
> argument is used.
> ---
> .../admin-guide/kernel-parameters.txt | 5 ++++
> include/linux/sched.h | 13 +++++++++
> init/Kconfig | 7 +++++
> kernel/sched/core.c | 29 +++++++++++++++++++
> 4 files changed, 54 insertions(+)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 65731b060e3f..cc64393b913f 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5714,6 +5714,11 @@
> sa1100ir [NET]
> See drivers/net/irda/sa1100_ir.c.
>
> + sched_proxy_exec= [KNL]
> + Enables or disables "proxy execution" style
> + solution to mutex based priority inversion.
> + Format: <bool>
> +
> sched_verbose [KNL] Enables verbose scheduler debug messages.
>
> schedstats= [KNL,X86] Enable or disable scheduled statistics.
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index bfe8670f99a1..880af1c3097d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1566,6 +1566,19 @@ struct task_struct {
> */
> };
>
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +DECLARE_STATIC_KEY_TRUE(__sched_proxy_exec);
> +static inline bool sched_proxy_exec(void)
> +{
> + return static_branch_likely(&__sched_proxy_exec);
> +}
> +#else
> +static inline bool sched_proxy_exec(void)
> +{
> + return false;
> +}
> +#endif
> +
> static inline struct pid *task_pid(struct task_struct *task)
> {
> return task->thread_pid;
> diff --git a/init/Kconfig b/init/Kconfig
> index 9ffb103fc927..c5a759b6366a 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -908,6 +908,13 @@ config NUMA_BALANCING_DEFAULT_ENABLED
> If set, automatic NUMA balancing will be enabled if running on a NUMA
> machine.
>
> +config SCHED_PROXY_EXEC
> + bool "Proxy Execution"
> + default n
> + help
> + This option enables proxy execution, a mechanism for mutex owning
> + tasks to inherit the scheduling context of higher priority waiters.
> +

Should `SCHED_PROXY_EXEC` config option be under `Scheduler features` menu?

> menuconfig CGROUPS
> bool "Control Group support"
> select KERNFS
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4e46189d545d..e06558fb08aa 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -117,6 +117,35 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
>
> DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
> +static int __init setup_proxy_exec(char *str)
> +{
> + bool proxy_enable;
> +
> + if (kstrtobool(str, &proxy_enable)) {
> + pr_warn("Unable to parse sched_proxy_exec=\n");
> + return 0;
> + }
> +
> + if (proxy_enable) {
> + pr_info("sched_proxy_exec enabled via boot arg\n");
> + static_branch_enable(&__sched_proxy_exec);
> + } else {
> + pr_info("sched_proxy_exec disabled via boot arg\n");
> + static_branch_disable(&__sched_proxy_exec);
> + }
> + return 1;
> +}
> +#else
> +static int __init setup_proxy_exec(char *str)
> +{
> + pr_warn("CONFIG_SCHED_PROXY_EXEC=n, so it cannot be enabled or disabled at boottime\n");
> + return 0;
> +}
> +#endif
> +__setup("sched_proxy_exec=", setup_proxy_exec);
> +
> #ifdef CONFIG_SCHED_DEBUG
> /*
> * Debugging: various feature bits


2023-12-28 15:19:16

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v7 17/23] sched: Initial sched_football test implementation

On 20/12/2023 12:18 am, John Stultz wrote:
> Reimplementation of the sched_football test from LTP:
> https://github.com/linux-test-project/ltp/blob/master/testcases/realtime/func/sched_football/sched_football.c
>
> But reworked to run in the kernel and utilize mutexes
> to illustrate proper boosting of low priority mutex
> holders.
>
> TODO:
> * Need a rt_mutex version so it can work w/o proxy-execution
> * Need a better place to put it
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Signed-off-by: John Stultz <[email protected]>
> ---
> kernel/sched/Makefile | 1 +
> kernel/sched/test_sched_football.c | 242 +++++++++++++++++++++++++++++
> lib/Kconfig.debug | 14 ++
> 3 files changed, 257 insertions(+)
> create mode 100644 kernel/sched/test_sched_football.c
>
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 976092b7bd45..2729d565dfd7 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -32,3 +32,4 @@ obj-y += core.o
> obj-y += fair.o
> obj-y += build_policy.o
> obj-y += build_utility.o
> +obj-$(CONFIG_SCHED_RT_INVARIENT_TEST) += test_sched_football.o
> diff --git a/kernel/sched/test_sched_football.c b/kernel/sched/test_sched_football.c
> new file mode 100644
> index 000000000000..9742c45c0fe0
> --- /dev/null
> +++ b/kernel/sched/test_sched_football.c
> @@ -0,0 +1,242 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * Module-based test case for RT scheduling invariant
> + *
> + * A reimplementation of my old sched_football test
> + * found in LTP:
> + * https://github.com/linux-test-project/ltp/blob/master/testcases/realtime/func/sched_football/sched_football.c
> + *
> + * Similar to that test, this tries to validate the RT
> + * scheduling invariant, that the across N available cpus, the
> + * top N priority tasks always running.
> + *
> + * This is done via having N offsensive players that are
> + * medium priority, which constantly are trying to increment the
> + * ball_pos counter.
> + *
> + * Blocking this, are N defensive players that are higher
> + * priority which just spin on the cpu, preventing the medium
> + * priroity tasks from running.
> + *
> + * To complicate this, there are also N defensive low priority
> + * tasks. These start first and each aquire one of N mutexes.
> + * The high priority defense tasks will later try to grab the
> + * mutexes and block, opening a window for the offsensive tasks
> + * to run and increment the ball. If priority inheritance or
> + * proxy execution is used, the low priority defense players
> + * should be boosted to the high priority levels, and will
> + * prevent the mid priority offensive tasks from running.
> + *
> + * Copyright © International Business Machines Corp., 2007, 2008
> + * Copyright (C) Google, 2023
> + *
> + * Authors: John Stultz <[email protected]>
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/kthread.h>
> +#include <linux/delay.h>
> +#include <linux/sched/rt.h>
> +#include <linux/spinlock.h>
> +#include <linux/mutex.h>
> +#include <linux/rwsem.h>
> +#include <linux/smp.h>
> +#include <linux/slab.h>
> +#include <linux/interrupt.h>
> +#include <linux/sched.h>
> +#include <uapi/linux/sched/types.h>
> +#include <linux/rtmutex.h>
> +
> +atomic_t players_ready;
> +atomic_t ball_pos;
> +int players_per_team;
> +bool game_over;
> +
> +struct mutex *mutex_low_list;
> +struct mutex *mutex_mid_list;
> +
> +static inline
> +struct task_struct *create_fifo_thread(int (*threadfn)(void *data), void *data,
> + char *name, int prio)
> +{
> + struct task_struct *kth;
> + struct sched_attr attr = {
> + .size = sizeof(struct sched_attr),
> + .sched_policy = SCHED_FIFO,
> + .sched_nice = 0,
> + .sched_priority = prio,
> + };
> + int ret;
> +
> + kth = kthread_create(threadfn, data, name);
> + if (IS_ERR(kth)) {
> + pr_warn("%s eerr, kthread_create failed\n", __func__);
> + return kth;
> + }
> + ret = sched_setattr_nocheck(kth, &attr);
> + if (ret) {
> + kthread_stop(kth);
> + pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
> + return ERR_PTR(ret);
> + }
> +
> + wake_up_process(kth);
> + return kth;
> +}
> +
> +int defense_low_thread(void *arg)
> +{
> + long tnum = (long)arg;
> +
> + atomic_inc(&players_ready);
> + mutex_lock(&mutex_low_list[tnum]);
> + while (!READ_ONCE(game_over)) {
> + if (kthread_should_stop())
> + break;
> + schedule();
> + }
> + mutex_unlock(&mutex_low_list[tnum]);
> + return 0;
> +}
> +
> +int defense_mid_thread(void *arg)
> +{
> + long tnum = (long)arg;
> +
> + atomic_inc(&players_ready);
> + mutex_lock(&mutex_mid_list[tnum]);
> + mutex_lock(&mutex_low_list[tnum]);
> + while (!READ_ONCE(game_over)) {
> + if (kthread_should_stop())
> + break;
> + schedule();
> + }
> + mutex_unlock(&mutex_low_list[tnum]);
> + mutex_unlock(&mutex_mid_list[tnum]);
> + return 0;
> +}
> +
> +int offense_thread(void *)
> +{
> + atomic_inc(&players_ready);
> + while (!READ_ONCE(game_over)) {
> + if (kthread_should_stop())
> + break;
> + schedule();
> + atomic_inc(&ball_pos);
> + }
> + return 0;
> +}
> +
> +int defense_hi_thread(void *arg)
> +{
> + long tnum = (long)arg;
> +
> + atomic_inc(&players_ready);
> + mutex_lock(&mutex_mid_list[tnum]);
> + while (!READ_ONCE(game_over)) {
> + if (kthread_should_stop())
> + break;
> + schedule();
> + }
> + mutex_unlock(&mutex_mid_list[tnum]);
> + return 0;
> +}
> +
> +int crazy_fan_thread(void *)
> +{
> + int count = 0;
> +
> + atomic_inc(&players_ready);
> + while (!READ_ONCE(game_over)) {
> + if (kthread_should_stop())
> + break;
> + schedule();
> + udelay(1000);
> + msleep(2);
> + count++;
> + }
> + return 0;
> +}
> +
> +int ref_thread(void *arg)
> +{
> + struct task_struct *kth;
> + long game_time = (long)arg;
> + unsigned long final_pos;
> + long i;
> +
> + pr_info("%s: started ref, game_time: %ld secs !\n", __func__,
> + game_time);
> +
> + /* Create low priority defensive team */
> + for (i = 0; i < players_per_team; i++)
> + kth = create_fifo_thread(defense_low_thread, (void *)i,
> + "defese-low-thread", 2);
> + /* Wait for the defense threads to start */
> + while (atomic_read(&players_ready) < players_per_team)
> + msleep(1);
> +
> + for (i = 0; i < players_per_team; i++)
> + kth = create_fifo_thread(defense_mid_thread,
> + (void *)(players_per_team - i - 1),
> + "defese-mid-thread", 3);
> + /* Wait for the defense threads to start */
> + while (atomic_read(&players_ready) < players_per_team * 2)
> + msleep(1);
> +
> + /* Create mid priority offensive team */
> + for (i = 0; i < players_per_team; i++)
> + kth = create_fifo_thread(offense_thread, NULL,
> + "offense-thread", 5);
> + /* Wait for the offense threads to start */
> + while (atomic_read(&players_ready) < players_per_team * 3)
> + msleep(1);
> +
> + /* Create high priority defensive team */
> + for (i = 0; i < players_per_team; i++)
> + kth = create_fifo_thread(defense_hi_thread, (void *)i,
> + "defese-hi-thread", 10);
> + /* Wait for the defense threads to start */
> + while (atomic_read(&players_ready) < players_per_team * 4)
> + msleep(1);
> +
> + /* Create high priority defensive team */
> + for (i = 0; i < players_per_team; i++)
> + kth = create_fifo_thread(crazy_fan_thread, NULL,
> + "crazy-fan-thread", 15);
> + /* Wait for the defense threads to start */
> + while (atomic_read(&players_ready) < players_per_team * 5)
> + msleep(1);
> +
> + pr_info("%s: all players checked in! Starting game.\n", __func__);
> + atomic_set(&ball_pos, 0);
> + msleep(game_time * 1000);
> + final_pos = atomic_read(&ball_pos);
> + pr_info("%s: final ball_pos: %ld\n", __func__, final_pos);
> + WARN_ON(final_pos != 0);
> + game_over = true;
> + return 0;
> +}
> +
> +static int __init test_sched_football_init(void)
> +{
> + struct task_struct *kth;
> + int i;
> +
> + players_per_team = num_online_cpus();
> +
> + mutex_low_list = kmalloc_array(players_per_team, sizeof(struct mutex), GFP_ATOMIC);
> + mutex_mid_list = kmalloc_array(players_per_team, sizeof(struct mutex), GFP_ATOMIC);
> +
> + for (i = 0; i < players_per_team; i++) {
> + mutex_init(&mutex_low_list[i]);
> + mutex_init(&mutex_mid_list[i]);
> + }
> +
> + kth = create_fifo_thread(ref_thread, (void *)10, "ref-thread", 20);
> +
> + return 0;
> +}
> +module_init(test_sched_football_init);

Hit `modpost: missing MODULE_LICENSE() in
kernel/sched/test_sched_football.o` error when I build this module.

JFYI: the module does not have MODULE_NAME(), MODULE_DESCRIPTION(),
MODULE_AUTHOR(), module_exit(), ... as well.

> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 4405f81248fb..1d90059d190f 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1238,6 +1238,20 @@ config SCHED_DEBUG
> that can help debug the scheduler. The runtime overhead of this
> option is minimal.
>
> +config SCHED_RT_INVARIENT_TEST
> + tristate "RT invarient scheduling tester"
> + depends on DEBUG_KERNEL
> + help
> + This option provides a kernel module that runs tests to make
> + sure the RT invarient holds (top N priority tasks run on N
> + available cpus).
> +
> + Say Y here if you want kernel rt scheduling tests
> + to be built into the kernel.
> + Say M if you want this test to build as a module.
> + Say N if you are unsure.
> +
> +
> config SCHED_INFO
> bool
> default n


2023-12-28 16:36:46

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v7 17/23] sched: Initial sched_football test implementation

On 20/12/2023 12:18 am, John Stultz wrote:
> Reimplementation of the sched_football test from LTP:
> https://github.com/linux-test-project/ltp/blob/master/testcases/realtime/func/sched_football/sched_football.c
>
> But reworked to run in the kernel and utilize mutexes
> to illustrate proper boosting of low priority mutex
> holders.
>
> TODO:
> * Need a rt_mutex version so it can work w/o proxy-execution
> * Need a better place to put it
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Signed-off-by: John Stultz <[email protected]>
> ---
> kernel/sched/Makefile | 1 +
> kernel/sched/test_sched_football.c | 242 +++++++++++++++++++++++++++++
> lib/Kconfig.debug | 14 ++
> 3 files changed, 257 insertions(+)
> create mode 100644 kernel/sched/test_sched_football.c
>
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 976092b7bd45..2729d565dfd7 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -32,3 +32,4 @@ obj-y += core.o
> obj-y += fair.o
> obj-y += build_policy.o
> obj-y += build_utility.o
> +obj-$(CONFIG_SCHED_RT_INVARIENT_TEST) += test_sched_football.o
> diff --git a/kernel/sched/test_sched_football.c b/kernel/sched/test_sched_football.c
> new file mode 100644
> index 000000000000..9742c45c0fe0
> --- /dev/null
> +++ b/kernel/sched/test_sched_football.c
> @@ -0,0 +1,242 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * Module-based test case for RT scheduling invariant
> + *
> + * A reimplementation of my old sched_football test
> + * found in LTP:
> + * https://github.com/linux-test-project/ltp/blob/master/testcases/realtime/func/sched_football/sched_football.c
> + *
> + * Similar to that test, this tries to validate the RT
> + * scheduling invariant, that the across N available cpus, the
> + * top N priority tasks always running.
> + *
> + * This is done via having N offsensive players that are
> + * medium priority, which constantly are trying to increment the
> + * ball_pos counter.
> + *
> + * Blocking this, are N defensive players that are higher
> + * priority which just spin on the cpu, preventing the medium
> + * priroity tasks from running.
> + *
> + * To complicate this, there are also N defensive low priority
> + * tasks. These start first and each aquire one of N mutexes.
> + * The high priority defense tasks will later try to grab the
> + * mutexes and block, opening a window for the offsensive tasks
> + * to run and increment the ball. If priority inheritance or
> + * proxy execution is used, the low priority defense players
> + * should be boosted to the high priority levels, and will
> + * prevent the mid priority offensive tasks from running.
> + *
> + * Copyright © International Business Machines Corp., 2007, 2008
> + * Copyright (C) Google, 2023
> + *
> + * Authors: John Stultz <[email protected]>
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/kthread.h>
> +#include <linux/delay.h>
> +#include <linux/sched/rt.h>
> +#include <linux/spinlock.h>
> +#include <linux/mutex.h>
> +#include <linux/rwsem.h>
> +#include <linux/smp.h>
> +#include <linux/slab.h>
> +#include <linux/interrupt.h>
> +#include <linux/sched.h>
> +#include <uapi/linux/sched/types.h>
> +#include <linux/rtmutex.h>
> +
> +atomic_t players_ready;
> +atomic_t ball_pos;
> +int players_per_team;
> +bool game_over;
> +
> +struct mutex *mutex_low_list;
> +struct mutex *mutex_mid_list;
> +
> +static inline
> +struct task_struct *create_fifo_thread(int (*threadfn)(void *data), void *data,
> + char *name, int prio)
> +{
> + struct task_struct *kth;
> + struct sched_attr attr = {
> + .size = sizeof(struct sched_attr),
> + .sched_policy = SCHED_FIFO,
> + .sched_nice = 0,
> + .sched_priority = prio,
> + };
> + int ret;
> +
> + kth = kthread_create(threadfn, data, name);
> + if (IS_ERR(kth)) {
> + pr_warn("%s eerr, kthread_create failed\n", __func__);
> + return kth;
> + }
> + ret = sched_setattr_nocheck(kth, &attr);
> + if (ret) {
> + kthread_stop(kth);
> + pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
> + return ERR_PTR(ret);
> + }
> +
> + wake_up_process(kth);
> + return kth;
> +}
> +
> +int defense_low_thread(void *arg)
> +{
> + long tnum = (long)arg;
> +
> + atomic_inc(&players_ready);
> + mutex_lock(&mutex_low_list[tnum]);
> + while (!READ_ONCE(game_over)) {
> + if (kthread_should_stop())
> + break;
> + schedule();
> + }
> + mutex_unlock(&mutex_low_list[tnum]);
> + return 0;
> +}
> +
> +int defense_mid_thread(void *arg)
> +{
> + long tnum = (long)arg;
> +
> + atomic_inc(&players_ready);
> + mutex_lock(&mutex_mid_list[tnum]);
> + mutex_lock(&mutex_low_list[tnum]);
> + while (!READ_ONCE(game_over)) {
> + if (kthread_should_stop())
> + break;
> + schedule();
> + }
> + mutex_unlock(&mutex_low_list[tnum]);
> + mutex_unlock(&mutex_mid_list[tnum]);
> + return 0;
> +}
> +
> +int offense_thread(void *)
> +{
> + atomic_inc(&players_ready);
> + while (!READ_ONCE(game_over)) {
> + if (kthread_should_stop())
> + break;
> + schedule();
> + atomic_inc(&ball_pos);
> + }
> + return 0;
> +}
> +
> +int defense_hi_thread(void *arg)
> +{
> + long tnum = (long)arg;
> +
> + atomic_inc(&players_ready);
> + mutex_lock(&mutex_mid_list[tnum]);
> + while (!READ_ONCE(game_over)) {
> + if (kthread_should_stop())
> + break;
> + schedule();
> + }
> + mutex_unlock(&mutex_mid_list[tnum]);
> + return 0;
> +}
> +
> +int crazy_fan_thread(void *)
> +{
> + int count = 0;
> +
> + atomic_inc(&players_ready);
> + while (!READ_ONCE(game_over)) {
> + if (kthread_should_stop())
> + break;
> + schedule();
> + udelay(1000);
> + msleep(2);
> + count++;

@count is only increased. Is it really necessary?

> + }
> + return 0;
> +}
> +
> +int ref_thread(void *arg)
> +{
> + struct task_struct *kth;
> + long game_time = (long)arg;
> + unsigned long final_pos;
> + long i;
> +
> + pr_info("%s: started ref, game_time: %ld secs !\n", __func__,
> + game_time);
> +
> + /* Create low priority defensive team */
> + for (i = 0; i < players_per_team; i++)
> + kth = create_fifo_thread(defense_low_thread, (void *)i,
> + "defese-low-thread", 2);
> + /* Wait for the defense threads to start */
> + while (atomic_read(&players_ready) < players_per_team)
> + msleep(1);
> +
> + for (i = 0; i < players_per_team; i++)
> + kth = create_fifo_thread(defense_mid_thread,
> + (void *)(players_per_team - i - 1),
> + "defese-mid-thread", 3);
> + /* Wait for the defense threads to start */
> + while (atomic_read(&players_ready) < players_per_team * 2)
> + msleep(1);
> +
> + /* Create mid priority offensive team */
> + for (i = 0; i < players_per_team; i++)
> + kth = create_fifo_thread(offense_thread, NULL,
> + "offense-thread", 5);
> + /* Wait for the offense threads to start */
> + while (atomic_read(&players_ready) < players_per_team * 3)
> + msleep(1);
> +
> + /* Create high priority defensive team */
> + for (i = 0; i < players_per_team; i++)
> + kth = create_fifo_thread(defense_hi_thread, (void *)i,
> + "defese-hi-thread", 10);
> + /* Wait for the defense threads to start */
> + while (atomic_read(&players_ready) < players_per_team * 4)
> + msleep(1);
> +
> + /* Create high priority defensive team */
> + for (i = 0; i < players_per_team; i++)
> + kth = create_fifo_thread(crazy_fan_thread, NULL,
> + "crazy-fan-thread", 15);
> + /* Wait for the defense threads to start */
> + while (atomic_read(&players_ready) < players_per_team * 5)
> + msleep(1);
> +
> + pr_info("%s: all players checked in! Starting game.\n", __func__);
> + atomic_set(&ball_pos, 0);
> + msleep(game_time * 1000);
> + final_pos = atomic_read(&ball_pos);
> + pr_info("%s: final ball_pos: %ld\n", __func__, final_pos);
> + WARN_ON(final_pos != 0);
> + game_over = true;
> + return 0;
> +}
> +
> +static int __init test_sched_football_init(void)
> +{
> + struct task_struct *kth;
> + int i;
> +
> + players_per_team = num_online_cpus();
> +
> + mutex_low_list = kmalloc_array(players_per_team, sizeof(struct mutex), GFP_ATOMIC);
> + mutex_mid_list = kmalloc_array(players_per_team, sizeof(struct mutex), GFP_ATOMIC);
> +
> + for (i = 0; i < players_per_team; i++) {
> + mutex_init(&mutex_low_list[i]);
> + mutex_init(&mutex_mid_list[i]);
> + }
> +
> + kth = create_fifo_thread(ref_thread, (void *)10, "ref-thread", 20);
> +
> + return 0;
> +}

* Please prepend a prefix to prints to ease capturing the module logs.
* I think `rmmod test_sched_football` throws `Device or resource busy`
error and fails to remove the module because of missing module_exit().

> +module_init(test_sched_football_init);
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 4405f81248fb..1d90059d190f 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1238,6 +1238,20 @@ config SCHED_DEBUG
> that can help debug the scheduler. The runtime overhead of this
> option is minimal.
>
> +config SCHED_RT_INVARIENT_TEST
> + tristate "RT invarient scheduling tester"
> + depends on DEBUG_KERNEL
> + help
> + This option provides a kernel module that runs tests to make
> + sure the RT invarient holds (top N priority tasks run on N
> + available cpus).
> +
> + Say Y here if you want kernel rt scheduling tests
> + to be built into the kernel.
> + Say M if you want this test to build as a module.
> + Say N if you are unsure.
> +
> +
> config SCHED_INFO
> bool
> default n


2024-01-02 15:34:17

by Phil Auld

[permalink] [raw]
Subject: Re: [PATCH v7 14/23] sched: Handle blocked-waiter migration (and return migration)

On Thu, Dec 21, 2023 at 04:12:57PM +0000 Metin Kaya wrote:
> On 20/12/2023 12:18 am, John Stultz wrote:
> > Add logic to handle migrating a blocked waiter to a remote
> > cpu where the lock owner is runnable.
> >
> > Additionally, as the blocked task may not be able to run
> > on the remote cpu, add logic to handle return migration once
> > the waiting task is given the mutex.
> >
> > Because tasks may get migrated to where they cannot run,
> > this patch also modifies the scheduling classes to avoid
> > sched class migrations on mutex blocked tasks, leaving
> > proxy() to do the migrations and return migrations.
>
> s/proxy/find_proxy_task/
>

While fixing that paragraph, probably:

s/this patch also modifies/also modify/


Cheers,
Phil

> >
> > This was split out from the larger proxy patch, and
> > significantly reworked.
> >
> > Credits for the original patch go to:
> > Peter Zijlstra (Intel) <[email protected]>
> > Juri Lelli <[email protected]>
> > Valentin Schneider <[email protected]>
> > Connor O'Brien <[email protected]>
> >
> > Cc: Joel Fernandes <[email protected]>
> > Cc: Qais Yousef <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Juri Lelli <[email protected]>
> > Cc: Vincent Guittot <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Ben Segall <[email protected]>
> > Cc: Zimuzo Ezeozue <[email protected]>
> > Cc: Youssef Esmat <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Will Deacon <[email protected]>
> > Cc: Waiman Long <[email protected]>
> > Cc: Boqun Feng <[email protected]>
> > Cc: "Paul E. McKenney" <[email protected]>
> > Cc: Metin Kaya <[email protected]>
> > Cc: Xuewen Yan <[email protected]>
> > Cc: K Prateek Nayak <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: [email protected]
> > Signed-off-by: John Stultz <[email protected]>
> > ---
> > v6:
> > * Integrated sched_proxy_exec() check in proxy_return_migration()
> > * Minor cleanups to diff
> > * Unpin the rq before calling __balance_callbacks()
> > * Tweak proxy migrate to migrate deeper task in chain, to avoid
> > tasks pingponging between rqs
> > v7:
> > * Fixup for unused function arguments
> > * Switch from that_rq -> target_rq, other minor tweaks, and typo
> > fixes suggested by Metin Kaya
> > * Switch back to doing return migration in the ttwu path, which
> > avoids nasty lock juggling and performance issues
> > * Fixes for UP builds
> > ---
> > kernel/sched/core.c | 161 ++++++++++++++++++++++++++++++++++++++--
> > kernel/sched/deadline.c | 2 +-
> > kernel/sched/fair.c | 4 +-
> > kernel/sched/rt.c | 9 ++-
> > 4 files changed, 164 insertions(+), 12 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 42e25bbdfe6b..55dc2a3b7e46 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2981,8 +2981,15 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
> > struct set_affinity_pending my_pending = { }, *pending = NULL;
> > bool stop_pending, complete = false;
> > - /* Can the task run on the task's current CPU? If so, we're done */
> > - if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) {
> > + /*
> > + * Can the task run on the task's current CPU? If so, we're done
> > + *
> > + * We are also done if the task is selected, boosting a lock-
> > + * holding proxy, (and potentially has been migrated outside its
> > + * current or previous affinity mask)
> > + */
> > + if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask) ||
> > + (task_current_selected(rq, p) && !task_current(rq, p))) {
> > struct task_struct *push_task = NULL;
> > if ((flags & SCA_MIGRATE_ENABLE) &&
> > @@ -3778,6 +3785,39 @@ static inline void ttwu_do_wakeup(struct task_struct *p)
> > trace_sched_wakeup(p);
> > }
> > +#ifdef CONFIG_SMP
> > +static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
> > +{
> > + if (!sched_proxy_exec())
> > + return false;
> > +
> > + if (task_current(rq, p))
> > + return false;
> > +
> > + if (p->blocked_on && p->blocked_on_state == BO_WAKING) {
> > + raw_spin_lock(&p->blocked_lock);
> > + if (!is_cpu_allowed(p, cpu_of(rq))) {
> > + if (task_current_selected(rq, p)) {
> > + put_prev_task(rq, p);
> > + rq_set_selected(rq, rq->idle);
> > + }
> > + deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
> > + resched_curr(rq);
> > + raw_spin_unlock(&p->blocked_lock);
> > + return true;
> > + }
> > + resched_curr(rq);
> > + raw_spin_unlock(&p->blocked_lock);
>
> Do we need to hold blocked_lock while checking allowed CPUs? Would moving
> raw_spin_lock(&p->blocked_lock); inside if (!is_cpu_allowed()) block be
> silly? i.e.,:
>
> if (!is_cpu_allowed(p, cpu_of(rq))) {
> raw_spin_lock(&p->blocked_lock);
> if (task_current_selected(rq, p)) {
> put_prev_task(rq, p);
> rq_set_selected(rq, rq->idle);
> }
> deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
> resched_curr(rq);
> raw_spin_unlock(&p->blocked_lock);
> return true;
> }
> resched_curr(rq);
>
> > + }
> > + return false;
> > +}
> > +#else /* !CONFIG_SMP */
> > +static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
> > +{
> > + return false;
> > +}
> > +#endif /*CONFIG_SMP */
>
> Nit: what about this?
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 30dfb6f14f2b..60b542a6faa5 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4027,9 +4027,11 @@ static inline void activate_blocked_entities(struct
> rq *target_rq,
> }
> #endif /* CONFIG_SCHED_PROXY_EXEC */
>
> -#ifdef CONFIG_SMP
> static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
> {
> + if (!IS_ENABLED(CONFIG_SMP))
> + return false;
> +
> if (!sched_proxy_exec())
> return false;
>
> @@ -4053,12 +4055,6 @@ static inline bool proxy_needs_return(struct rq *rq,
> struct task_struct *p)
> }
> return false;
> }
> -#else /* !CONFIG_SMP */
> -static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
> -{
> - return false;
> -}
> -#endif /*CONFIG_SMP */
>
> static void
> ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
>
> > +
> > static void
> > ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
> > struct rq_flags *rf)
> > @@ -3870,9 +3910,12 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
> > update_rq_clock(rq);
> > wakeup_preempt(rq, p, wake_flags);
> > }
> > + if (proxy_needs_return(rq, p))
> > + goto out;
> > ttwu_do_wakeup(p);
> > ret = 1;
> > }
> > +out:
> > __task_rq_unlock(rq, &rf);
> > return ret;
> > @@ -4231,6 +4274,7 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> > int cpu, success = 0;
> > if (p == current) {
> > + WARN_ON(task_is_blocked(p));
> > /*
> > * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
> > * == smp_processor_id()'. Together this means we can special
> > @@ -6632,6 +6676,91 @@ static bool proxy_deactivate(struct rq *rq, struct task_struct *next)
> > return true;
> > }
> > +#ifdef CONFIG_SMP
> > +/*
> > + * If the blocked-on relationship crosses CPUs, migrate @p to the
> > + * owner's CPU.
> > + *
> > + * This is because we must respect the CPU affinity of execution
> > + * contexts (owner) but we can ignore affinity for scheduling
> > + * contexts (@p). So we have to move scheduling contexts towards
> > + * potential execution contexts.
> > + *
> > + * Note: The owner can disappear, but simply migrate to @target_cpu
> > + * and leave that CPU to sort things out.
> > + */
> > +static struct task_struct *
>
> proxy_migrate_task() always returns NULL. Is return type really needed?
>
> > +proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> > + struct task_struct *p, int target_cpu)
> > +{
> > + struct rq *target_rq;
> > + int wake_cpu;
> > +
>
> Having a "if (!IS_ENABLED(CONFIG_SMP))" check here would help in dropping
> #else part. i.e.,
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 30dfb6f14f2b..685ba6f2d4cd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6912,7 +6912,6 @@ proxy_resched_idle(struct rq *rq, struct task_struct
> *next)
> return rq->idle;
> }
>
> -#ifdef CONFIG_SMP
> /*
> * If the blocked-on relationship crosses CPUs, migrate @p to the
> * owner's CPU.
> @@ -6932,6 +6931,9 @@ proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> struct rq *target_rq;
> int wake_cpu;
>
> + if (!IS_ENABLED(CONFIG_SMP))
> + return NULL;
> +
> lockdep_assert_rq_held(rq);
> target_rq = cpu_rq(target_cpu);
>
> @@ -6988,14 +6990,6 @@ proxy_migrate_task(struct rq *rq, struct rq_flags
> *rf,
>
> return NULL; /* Retry task selection on _this_ CPU. */
> }
> -#else /* !CONFIG_SMP */
> -static struct task_struct *
> -proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> - struct task_struct *p, int target_cpu)
> -{
> - return NULL;
> -}
> -#endif /* CONFIG_SMP */
>
> static void proxy_enqueue_on_owner(struct rq *rq, struct task_struct
> *owner,
> struct task_struct *next)
>
> > + lockdep_assert_rq_held(rq);
> > + target_rq = cpu_rq(target_cpu);
> > +
> > + /*
> > + * Since we're going to drop @rq, we have to put(@rq_selected) first,
> > + * otherwise we have a reference that no longer belongs to us. Use
> > + * @rq->idle to fill the void and make the next pick_next_task()
> > + * invocation happy.
> > + *
> > + * CPU0 CPU1
> > + *
> > + * B mutex_lock(X)
> > + *
> > + * A mutex_lock(X) <- B
> > + * A __schedule()
> > + * A pick->A
> > + * A proxy->B
> > + * A migrate A to CPU1
> > + * B mutex_unlock(X) -> A
> > + * B __schedule()
> > + * B pick->A
> > + * B switch_to (A)
> > + * A ... does stuff
> > + * A ... is still running here
> > + *
> > + * * BOOM *
> > + */
> > + put_prev_task(rq, rq_selected(rq));
> > + rq_set_selected(rq, rq->idle);
> > + set_next_task(rq, rq_selected(rq));
> > + WARN_ON(p == rq->curr);
> > +
> > + wake_cpu = p->wake_cpu;
> > + deactivate_task(rq, p, 0);
> > + set_task_cpu(p, target_cpu);
> > + /*
> > + * Preserve p->wake_cpu, such that we can tell where it
> > + * used to run later.
> > + */
> > + p->wake_cpu = wake_cpu;
> > +
> > + rq_unpin_lock(rq, rf);
> > + __balance_callbacks(rq);
> > +
> > + raw_spin_rq_unlock(rq);
> > + raw_spin_rq_lock(target_rq);
> > +
> > + activate_task(target_rq, p, 0);
> > + wakeup_preempt(target_rq, p, 0);
> > +
> > + raw_spin_rq_unlock(target_rq);
> > + raw_spin_rq_lock(rq);
> > + rq_repin_lock(rq, rf);
> > +
> > + return NULL; /* Retry task selection on _this_ CPU. */
> > +}
> > +#else /* !CONFIG_SMP */
> > +static struct task_struct *
> > +proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> > + struct task_struct *p, int target_cpu)
> > +{
> > + return NULL;
> > +}
> > +#endif /* CONFIG_SMP */ > +
> > /*
> > * Find who @next (currently blocked on a mutex) can proxy for.
> > *
> > @@ -6654,8 +6783,11 @@ find_proxy_task(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
> > struct task_struct *owner = NULL;
> > struct task_struct *ret = NULL;
> > struct task_struct *p;
> > + int cur_cpu, target_cpu;
> > struct mutex *mutex;
> > - int this_cpu = cpu_of(rq);
> > + bool curr_in_chain = false;
> > +
> > + cur_cpu = cpu_of(rq);
> > /*
> > * Follow blocked_on chain.
> > @@ -6686,17 +6818,27 @@ find_proxy_task(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
> > goto out;
> > }
> > + if (task_current(rq, p))
> > + curr_in_chain = true;
> > +
> > owner = __mutex_owner(mutex);
> > if (!owner) {
> > ret = p;
> > goto out;
> > }
> > - if (task_cpu(owner) != this_cpu) {
> > - /* XXX Don't handle migrations yet */
> > - if (!proxy_deactivate(rq, next))
> > - ret = next;
> > - goto out;
> > + if (task_cpu(owner) != cur_cpu) {
> > + target_cpu = task_cpu(owner);
> > + /*
> > + * @owner can disappear, simply migrate to @target_cpu and leave that CPU
> > + * to sort things out.
> > + */
> > + raw_spin_unlock(&p->blocked_lock);
> > + raw_spin_unlock(&mutex->wait_lock);
> > + if (curr_in_chain)
> > + return proxy_resched_idle(rq, next);
> > +
> > + return proxy_migrate_task(rq, rf, p, target_cpu);
> > }
> > if (task_on_rq_migrating(owner)) {
> > @@ -6999,6 +7141,9 @@ static inline void sched_submit_work(struct task_struct *tsk)
> > */
> > SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT);
> > + if (task_is_blocked(tsk))
> > + return;
> > +
> > /*
> > * If we are going to sleep and we have plugged IO queued,
> > * make sure to submit it to avoid deadlocks.
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index 9cf20f4ac5f9..4f998549ea74 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -1705,7 +1705,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> > enqueue_dl_entity(&p->dl, flags);
> > - if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
> > + if (!task_current(rq, p) && p->nr_cpus_allowed > 1 && !task_is_blocked(p))
> > enqueue_pushable_dl_task(rq, p);
> > }
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 954b41e5b7df..8e3f118f6d6e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8372,7 +8372,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> > goto idle;
> > #ifdef CONFIG_FAIR_GROUP_SCHED
> > - if (!prev || prev->sched_class != &fair_sched_class)
> > + if (!prev ||
> > + prev->sched_class != &fair_sched_class ||
> > + rq->curr != rq_selected(rq))
> > goto simple;
> > /*
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index 81cd22eaa6dc..a7b51a021111 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -1503,6 +1503,9 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
> > if (p->nr_cpus_allowed == 1)
> > return;
> > + if (task_is_blocked(p))
> > + return;
> > +
> > enqueue_pushable_task(rq, p);
> > }
> > @@ -1790,10 +1793,12 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
> > update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);
> > - /* Avoid marking selected as pushable */
> > - if (task_current_selected(rq, p))
> > + /* Avoid marking current or selected as pushable */
> > + if (task_current(rq, p) || task_current_selected(rq, p))
> > return;
> > + if (task_is_blocked(p))
> > + return;
> > /*
> > * The previous task needs to be made eligible for pushing
> > * if it is still active
>
>

--


2024-01-03 13:48:12

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v7 09/23] sched: Fix runtime accounting w/ split exec & sched contexts

(I did a reply instead of a reply-all, sorry John you're getting this one twice!)

On 19/12/23 16:18, John Stultz wrote:
> The idea here is we want to charge the scheduler-context task's
> vruntime but charge the execution-context task's sum_exec_runtime.
>
> This way cputime accounting goes against the task actually running
> but vruntime accounting goes against the selected task so we get
> proper fairness.

This looks like the right approach, especially when it comes to exposing
data to userspace as with e.g. top.

I did however get curious as to what would be the impact of not updating
the donor's sum_exec_runtime. A quick look through fair.c shows these
function using it:
- numa_get_avg_runtime()
- task_numa_work()
- task_tick_numa()
- set_next_entity()
- hrtick_start_fair()

The NUMA ones shouldn't matter too much, as they care about the actually
running task, which is the one that gets its sum_exec_runtime increased.
task_tick_numa() needs to be changed though, as it should be passed the
currently running task, not the selected (donor) one, but shouldn't need
any other change (famous last words).

Generally I think all of the NUMA balancing stats stuff shouldn't care
about the donor task, as the pages being accessed are part of the execution
context.

The hrtick one is tricky. AFAICT since we don't update the donor's
sum_exec_runtime, in proxy scenarios we'll end up always programming the
hrtimer to the entire extent of the donor's slice, which might not be
correct. Considering the HRTICK SCHED_FEAT defaults to disabled, that could
be left as a TODO.


2024-01-03 14:50:09

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v7 08/23] sched: Split scheduler and execution contexts

On 19/12/23 16:18, John Stultz wrote:
> NOTE: Peter previously mentioned he didn't like the name
> "rq_selected()", but I've not come up with a better alternative.
> I'm very open to other name proposals.
>

I got used to the naming relatively quickly. It "should" be rq_pick()
(i.e. what did the last pick_next_task() return for that rq), but that
naming is unfortunately ambiguous (is it doing a pick itself?), so I think
"selected" works.


2024-01-04 23:25:51

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v7 16/23] sched: Add deactivated (sleeping) owner handling to find_proxy_task()

On Fri, Dec 22, 2023 at 12:33 AM Metin Kaya <[email protected]> wrote:
> On 20/12/2023 12:18 am, John Stultz wrote:
> > @@ -3936,13 +4063,19 @@ void sched_ttwu_pending(void *arg)
> > update_rq_clock(rq);
> >
> > llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
> > + int wake_flags;
> > if (WARN_ON_ONCE(p->on_cpu))
> > smp_cond_load_acquire(&p->on_cpu, !VAL);
> >
> > if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
> > set_task_cpu(p, cpu_of(rq));
> >
> > - ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
> > + wake_flags = p->sched_remote_wakeup ? WF_MIGRATED : 0;
> > + ttwu_do_activate(rq, p, wake_flags, &rf);
> > + rq_unlock(rq, &rf);
> > + activate_blocked_entities(rq, p, wake_flags);
>
> I'm unsure if it's a big deal, but IRQs are disabled here and
> activate_blocked_entities() disables them again.

Yeah. activate_blocked_entities() is also called from try_to_wakeup()
where we don't have irqs disabled, so we still need to make sure irqs
are off there.
But activate_blocked_entities does irqsave/restore so it should safely
put us back to the proper state. We could rq_unlock_irqrestore() in
the above, but it seems silly to enable irqs here just to disable them
shortly after.

But let me know if it seems I didn't quite get your concern here.


> > @@ -6663,19 +6797,6 @@ proxy_resched_idle(struct rq *rq, struct task_struct *next)
> > return rq->idle;
> > }
> >
> > -static bool proxy_deactivate(struct rq *rq, struct task_struct *next)
> > -{
> > - unsigned long state = READ_ONCE(next->__state);
> > -
> > - /* Don't deactivate if the state has been changed to TASK_RUNNING */
> > - if (state == TASK_RUNNING)
> > - return false;
> > - if (!try_to_deactivate_task(rq, next, state, true))
>
> Now we can drop the last argument (deactivate_cond) of
> try_to_deactivate_task() since it can be determined via
> !task_is_blocked(p) I think. IOW:

True. I'll add that into the "Drop proxy_deactivate" patch in my
fine-grained series.

Thanks again for the feedback!
-john

2024-01-04 23:33:28

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v7 14/23] sched: Handle blocked-waiter migration (and return migration)

On Tue, Jan 2, 2024 at 7:34 AM Phil Auld <[email protected]> wrote:
>
> On Thu, Dec 21, 2023 at 04:12:57PM +0000 Metin Kaya wrote:
> > On 20/12/2023 12:18 am, John Stultz wrote:
> > > Add logic to handle migrating a blocked waiter to a remote
> > > cpu where the lock owner is runnable.
> > >
> > > Additionally, as the blocked task may not be able to run
> > > on the remote cpu, add logic to handle return migration once
> > > the waiting task is given the mutex.
> > >
> > > Because tasks may get migrated to where they cannot run,
> > > this patch also modifies the scheduling classes to avoid
> > > sched class migrations on mutex blocked tasks, leaving
> > > proxy() to do the migrations and return migrations.
> >
> > s/proxy/find_proxy_task/
> >
>
> While fixing that paragraph, probably:
>
> s/this patch also modifies/also modify/

Fixed as well. Thanks!
-john

2024-01-04 23:44:31

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v7 19/23] sched: Consolidate pick_*_task to task_is_pushable helper

On Fri, Dec 22, 2023 at 2:23 AM Metin Kaya <[email protected]> wrote:
> On 20/12/2023 12:18 am, John Stultz wrote:
> > From: Connor O'Brien <[email protected]>
> >
> > This patch consolidates rt and deadline pick_*_task functions to
> > a task_is_pushable() helper
> >
> > This patch was broken out from a larger chain migration
> > patch originally by Connor O'Brien.
> >
> > Cc: Joel Fernandes <[email protected]>
> > Cc: Qais Yousef <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Juri Lelli <[email protected]>
> > Cc: Vincent Guittot <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Ben Segall <[email protected]>
> > Cc: Zimuzo Ezeozue <[email protected]>
> > Cc: Youssef Esmat <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Will Deacon <[email protected]>
> > Cc: Waiman Long <[email protected]>
> > Cc: Boqun Feng <[email protected]>
> > Cc: "Paul E. McKenney" <[email protected]>
> > Cc: Metin Kaya <[email protected]>
> > Cc: Xuewen Yan <[email protected]>
> > Cc: K Prateek Nayak <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: [email protected]
> > Signed-off-by: Connor O'Brien <[email protected]>
> > [jstultz: split out from larger chain migration patch,
> > renamed helper function]
> > Signed-off-by: John Stultz <[email protected]>
> > ---
> > v7:
> > * Split from chain migration patch
> > * Renamed function
> > ---
> > kernel/sched/deadline.c | 10 +---------
> > kernel/sched/rt.c | 11 +----------
> > kernel/sched/sched.h | 10 ++++++++++
> > 3 files changed, 12 insertions(+), 19 deletions(-)
> >
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index def1eb23318b..1f3bc50de678 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -2049,14 +2049,6 @@ static void task_fork_dl(struct task_struct *p)
> > /* Only try algorithms three times */
> > #define DL_MAX_TRIES 3
> >
> > -static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
> > -{
> > - if (!task_on_cpu(rq, p) &&
> > - cpumask_test_cpu(cpu, &p->cpus_mask))
> > - return 1;
> > - return 0;
> > -}
> > -
> > /*
> > * Return the earliest pushable rq's task, which is suitable to be executed
> > * on the CPU, NULL otherwise:
> > @@ -2075,7 +2067,7 @@ static struct task_struct *pick_earliest_pushable_dl_task(struct rq *rq, int cpu
> > if (next_node) {
> > p = __node_2_pdl(next_node);
> >
> > - if (pick_dl_task(rq, p, cpu))
> > + if (task_is_pushable(rq, p, cpu) == 1)
>
> Nit: ` == 1` part is redundant, IMHO.

Indeed at this step is seems silly, but later task_is_pushable() can
return one of three states:
https://github.com/johnstultz-work/linux-dev/commit/1ebaf1b186f0cae8a4a26708776b347fa47decef#diff-cc1a82129952a910fdc4292448c2a097a2ba538bebefcf3c06381e45639ae73eR3973

I'm not a huge fan of this sort of magic tri-state return value, as
it's not very intuitive, so I need to spend some time to see if I can
find a better approach.

Thanks for pointing this out.
-john

2024-01-05 00:02:20

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v7 20/23] sched: Push execution and scheduler context split into deadline and rt paths

On Fri, Dec 22, 2023 at 3:33 AM Metin Kaya <[email protected]> wrote:
>
> On 20/12/2023 12:18 am, John Stultz wrote:
> > From: Connor O'Brien <[email protected]>
> >
> > In preparation for chain migration, push the awareness
> > of the split between execution and scheduler context
> > down into some of the rt/deadline code paths that deal
> > with load balancing.
> >
> > This patch was broken out from a larger chain migration
> > patch originally by Connor O'Brien.
> >
>
> Nit: Commit header is too long. ` paths` can be dropped.

Done.

> > @@ -2079,25 +2079,25 @@ static struct task_struct *pick_earliest_pushable_dl_task(struct rq *rq, int cpu
> >
> > static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
> >
> > -static int find_later_rq(struct task_struct *task)
> > +static int find_later_rq(struct task_struct *sched_ctx, struct task_struct *exec_ctx)
>
> Nit: line becomes too long. Same for find_later_rq()'s signature above
> as well as find_lowest_rq() in rt.c.

While I do try to keep things under 80 where I can, it's no longer the standard:
https://lore.kernel.org/lkml/[email protected]/

> >
> > - if (task->nr_cpus_allowed == 1)
> > + if (exec_ctx && exec_ctx->nr_cpus_allowed == 1)
>
> Can exec_ctx be NULL? If so, we may hit a seg. fault at
> task_rq(exec_ctx) below.

Oh, this is a bad split on my part. Only after find_exec_ctx() is
introduced can the exec_ctx be null.
I'll move that change to later in the series.

thanks
-john

2024-01-05 03:13:00

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v7 21/23] sched: Add find_exec_ctx helper

On Fri, Dec 22, 2023 at 3:57 AM Metin Kaya <[email protected]> wrote:
> On 20/12/2023 12:18 am, John Stultz wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 0c212dcd4b7a..77a79d5f829a 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3896,6 +3896,48 @@ static void activate_blocked_entities(struct rq *target_rq,
> > }
> > raw_spin_unlock_irqrestore(&owner->blocked_lock, flags);
> > }
> > +
> > +static inline bool task_queued_on_rq(struct rq *rq, struct task_struct *task)
> > +{
> > + if (!task_on_rq_queued(task))
> > + return false;
> > + smp_rmb();
> > + if (task_rq(task) != rq)
> > + return false;
> > + smp_rmb();
> > + if (!task_on_rq_queued(task))
> > + return false;
>
> * Super-nit: we may want to have empty lines between `if` blocks and
> before/after `smp_rmb()` calls.

Done.

> * I did not understand why we call `task_on_rq_queued(task)` twice.
> Should we have an explanatory comment before the function definition?

Yeah. I'll put a better comment on my todo there.

> > diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
> > index 15e947a3ded7..53be78afdd07 100644
> > --- a/kernel/sched/cpupri.c
> > +++ b/kernel/sched/cpupri.c
> > @@ -96,12 +96,17 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
> > if (skip)
> > return 0;
> >
> > - if (cpumask_any_and(&p->cpus_mask, vec->mask) >= nr_cpu_ids)
> > + if ((p && cpumask_any_and(&p->cpus_mask, vec->mask) >= nr_cpu_ids) ||
> > + (!p && cpumask_any(vec->mask) >= nr_cpu_ids))
> > return 0;
> >
> > if (lowest_mask) {
> > - cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
> > - cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
> > + if (p) {
> > + cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
> > + cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
> > + } else {
> > + cpumask_copy(lowest_mask, vec->mask);
> > + }
>
> I think changes in `cpupri.c` should be part of previous (`sched: Push
> execution and scheduler context split into deadline and rt paths`)
> patch. Because they don't seem to be related with find_exec_ctx()?

So, it's here only because find_exec_ctx() can return null, so we have
to have the null p checks.

I'll think a bit if we can avoid it here.

> > @@ -2169,12 +2175,17 @@ static int find_later_rq(struct task_struct *sched_ctx, struct task_struct *exec
> > /* Locks the rq it finds */
> > static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
> > {
> > + struct task_struct *exec_ctx;
> > struct rq *later_rq = NULL;
> > int tries;
> > int cpu;
> >
> > for (tries = 0; tries < DL_MAX_TRIES; tries++) {
> > - cpu = find_later_rq(task, task);
> > + exec_ctx = find_exec_ctx(rq, task);
> > + if (!exec_ctx)
> > + break;
> > +
> > + cpu = find_later_rq(task, exec_ctx);
> >
>
> Super-nit: this empty line should be removed to keep logically connected
> lines closer.

Done.


> > +#ifdef CONFIG_SCHED_PROXY_EXEC
> > +struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p);
> > +#else /* !CONFIG_SCHED_PROXY_EXEC */
> > +static inline
> > +struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p)
> > +{
> > + return p;
> > +}
> > +#endif /* CONFIG_SCHED_PROXY_EXEC */
> > #endif
>
> Nit: `#ifdef CONFIG_SMP` block becomes bigger after this hunk. We should
> append `/* CONFIG_SMP */` to this line, IMHO.
>

Done.

Thanks for the feedback!
-john

2024-01-05 03:42:58

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v7 23/23] sched: Fix rt/dl load balancing via chain level balance

On Fri, Dec 22, 2023 at 6:51 AM Metin Kaya <[email protected]> wrote:
> On 20/12/2023 12:18 am, John Stultz wrote:
> > + /*
> > + * Chain leads off the rq, we're free to push it anywhere.
> > + *
> > + * One wrinkle with relying on find_exec_ctx is that when the chain
> > + * leads to a task currently migrating to rq, we see the chain as
> > + * pushable & push everything prior to the migrating task. Even if
> > + * we checked explicitly for this case, we could still race with a
> > + * migration after the check.
> > + * This shouldn't permanently produce a bad state though, as proxy()
>
> find_proxy_task()

Fixed.

> > +#ifdef CONFIG_SCHED_PROXY_EXEC
> > static inline bool dl_revalidate_rq_state(struct task_struct *task, struct rq *rq,
> > - struct rq *later)
> > + struct rq *later, bool *retry)
> > +{
> > + if (!dl_task(task) || is_migration_disabled(task))
> > + return false;
> > +
> > + if (rq != this_rq()) {
> > + struct task_struct *next_task = pick_next_pushable_dl_task(rq);
> > +
> > + if (next_task == task) {
>
> Nit: We can `return false;` if next_task != task and save one level of
> indentation.

Ah, good point. Fixed.

> > + struct task_struct *exec_ctx;
> > +
> > + exec_ctx = find_exec_ctx(rq, next_task);
> > + *retry = (exec_ctx && !cpumask_test_cpu(later->cpu,
> > + &exec_ctx->cpus_mask));
> > + } else {
> > + return false;
> > + }
> > + } else {
> > + int pushable = task_is_pushable(rq, task, later->cpu);
> > +
> > + *retry = pushable == -1;
> > + if (!pushable)
> > + return false;
>
> `return pushable;` can replace above 2 lines.
> The same for rt_revalidate_rq_state().

Hrm. It does save lines, but I fret (in my estimation) it makes the
code a touch more complex to read. I might hold off on this for the
moment unless someone else pushes for it.

> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index fabb19891e95..d5ce95dc5c09 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
...
> > +#ifdef CONFIG_SCHED_PROXY_EXEC
> > +static inline bool rt_revalidate_rq_state(struct task_struct *task, struct rq *rq,
> > + struct rq *lowest, bool *retry)
>
> This function can be consolidated with dl_revalidate_rq_state() as you
> noted in the previous patch, although rt_revalidate_rq_state() has few
> comments.

Yeah. I need to stare at it a bit to try to figure out what might be
the best name to use for the common chunk.
I'd also like to figure a better way to do the retry stuff, as it feels messy.

> > +{
> > + /*
> > + * Releasing the rq lock means we need to re-check pushability.
> > + * Some scenarios:
> > + * 1) If a migration from another CPU sent a task/chain to rq
> > + * that made task newly unpushable by completing a chain
> > + * from task to rq->curr, then we need to bail out and push something
> > + * else.
> > + * 2) If our chain led off this CPU or to a dequeued task, the last waiter
> > + * on this CPU might have acquired the lock and woken (or even migrated
> > + * & run, handed off the lock it held, etc...). This can invalidate the
> > + * result of find_lowest_rq() if our chain previously ended in a blocked
> > + * task whose affinity we could ignore, but now ends in an unblocked
> > + * task that can't run on lowest_rq.
> > + * 3) Race described at https://lore.kernel.org/all/[email protected]/
> > + *
> > + * Notes on these:
> > + * - Scenario #2 is properly handled by rerunning find_lowest_rq
> > + * - Scenario #1 requires that we fail
> > + * - Scenario #3 can AFAICT only occur when rq is not this_rq(). And the
> > + * suggested fix is not universally correct now that push_cpu_stop() can
> > + * call this function.
> > + */
> > + if (!rt_task(task) || is_migration_disabled(task)) {
> > + return false;
> > + } else if (rq != this_rq()) {
>
> Nit: `else` can be dropped as in dl_revalidate_rq_state().

Ack. Done.

Thanks again for all the feedback!
-john

2024-01-05 05:20:27

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v7 17/23] sched: Initial sched_football test implementation

On Fri, Dec 22, 2023 at 1:32 AM Metin Kaya <[email protected]> wrote:
>
> On 20/12/2023 12:18 am, John Stultz wrote:
> > Reimplementation of the sched_football test from LTP:
> > https://github.com/linux-test-project/ltp/blob/master/testcases/realtime/func/sched_football/sched_football.c
> >
> > But reworked to run in the kernel and utilize mutexes
> > to illustrate proper boosting of low priority mutex
> > holders.
> >
> > TODO:
> > * Need a rt_mutex version so it can work w/o proxy-execution
> > * Need a better place to put it
>
> I think also this patch can be upstreamed regardless of other Proxy
> Execution patches, right?

Well, we would need to use rt_mutexes for the !PROXY case to validate
inheritance.
But something like it could be included before PROXY lands.

> > + *
> > + * This is done via having N offsensive players that are
>
> offensive

Fixed.

> > + * medium priority, which constantly are trying to increment the
> > + * ball_pos counter.
> > + *
> > + * Blocking this, are N defensive players that are higher
> > + * priority which just spin on the cpu, preventing the medium
> > + * priroity tasks from running.
>
> priority

Fixed.

> > +atomic_t players_ready;
> > +atomic_t ball_pos;
> > +int players_per_team;
>
> Nit: Number of players cannot be lower than 0. Should it be unsigned then?

Fixed.

> > +bool game_over;
> > +
> > +struct mutex *mutex_low_list;
> > +struct mutex *mutex_mid_list;
> > +
> > +static inline
> > +struct task_struct *create_fifo_thread(int (*threadfn)(void *data), void *data,
> > + char *name, int prio)
> > +{
> > + struct task_struct *kth;
> > + struct sched_attr attr = {
> > + .size = sizeof(struct sched_attr),
> > + .sched_policy = SCHED_FIFO,
> > + .sched_nice = 0,
> > + .sched_priority = prio,
> > + };
> > + int ret;
> > +
> > + kth = kthread_create(threadfn, data, name);
> > + if (IS_ERR(kth)) {
> > + pr_warn("%s eerr, kthread_create failed\n", __func__);
>
> Extra e at eerr?

Fixed.


> > + return kth;
> > + }
> > + ret = sched_setattr_nocheck(kth, &attr);
> > + if (ret) {
> > + kthread_stop(kth);
> > + pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
> > + return ERR_PTR(ret);
> > + }
> > +
> > + wake_up_process(kth);
> > + return kth;
>
> I think the result of this function is actually unused. So,
> create_fifo_thread()'s return type can be void?

It's not used, but it probably should be. At least I should be
checking for the failure cases. I'll rework to fix this.



> > +
> > +int offense_thread(void *)
>
> Does this (no param name) build fine on Android env?

Good point, I've only been testing this bit with qemu. I'll fix it up.

> > +int ref_thread(void *arg)
> > +{
> > + struct task_struct *kth;
> > + long game_time = (long)arg;
> > + unsigned long final_pos;
> > + long i;
> > +
> > + pr_info("%s: started ref, game_time: %ld secs !\n", __func__,
> > + game_time);
> > +
> > + /* Create low priority defensive team */
>
> Sorry: extra space after `low`.

Fixed.

> > + mutex_low_list = kmalloc_array(players_per_team, sizeof(struct mutex), GFP_ATOMIC);
> > + mutex_mid_list = kmalloc_array(players_per_team, sizeof(struct mutex), GFP_ATOMIC);
>
> * Extra space after `players_per_team,`.
> * Shouldn't we check result of `kmalloc_array()`?
>
> Same comments for `mutex_low_list` (previous) line.

Yep.

Thanks for all the suggestions!
-john

2024-01-05 05:22:37

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v7 17/23] sched: Initial sched_football test implementation

On Thu, Dec 28, 2023 at 7:19 AM Metin Kaya <[email protected]> wrote:
> On 20/12/2023 12:18 am, John Stultz wrote:
> > +static int __init test_sched_football_init(void)
> > +{
> > + struct task_struct *kth;
> > + int i;
> > +
> > + players_per_team = num_online_cpus();
> > +
> > + mutex_low_list = kmalloc_array(players_per_team, sizeof(struct mutex), GFP_ATOMIC);
> > + mutex_mid_list = kmalloc_array(players_per_team, sizeof(struct mutex), GFP_ATOMIC);
> > +
> > + for (i = 0; i < players_per_team; i++) {
> > + mutex_init(&mutex_low_list[i]);
> > + mutex_init(&mutex_mid_list[i]);
> > + }
> > +
> > + kth = create_fifo_thread(ref_thread, (void *)10, "ref-thread", 20);
> > +
> > + return 0;
> > +}
> > +module_init(test_sched_football_init);
>
> Hit `modpost: missing MODULE_LICENSE() in
> kernel/sched/test_sched_football.o` error when I build this module.
>
> JFYI: the module does not have MODULE_NAME(), MODULE_DESCRIPTION(),
> MODULE_AUTHOR(), module_exit(), ... as well.

Good point. I've only been using it as a built-in.
Added all of those except for module_exit() for now, as I don't want
it to be unloaded while the kthreads are running.

thanks
-john

2024-01-05 05:25:28

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v7 17/23] sched: Initial sched_football test implementation

On Thu, Dec 28, 2023 at 8:36 AM Metin Kaya <[email protected]> wrote:
> On 20/12/2023 12:18 am, John Stultz wrote:
> > +int crazy_fan_thread(void *)
> > +{
> > + int count = 0;
> > +
> > + atomic_inc(&players_ready);
> > + while (!READ_ONCE(game_over)) {
> > + if (kthread_should_stop())
> > + break;
> > + schedule();
> > + udelay(1000);
> > + msleep(2);
> > + count++;
>
> @count is only increased. Is it really necessary?

Nope. Just remnants of earlier debug code.

>
> * Please prepend a prefix to prints to ease capturing the module logs.

Done.

> * I think `rmmod test_sched_football` throws `Device or resource busy`
> error and fails to remove the module because of missing module_exit().

Yep. I'm skipping this for now, but I'll see about adding it later
after I figure out the changes I need to manufacture the problematic
load-balancing condition I'm worried about, as it doesn't seem to
appear on its own so far.

thanks
-john

2024-01-10 22:25:02

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v7 08/23] sched: Split scheduler and execution contexts

On Wed, Jan 3, 2024 at 6:49 AM Valentin Schneider <[email protected]> wrote:
> On 19/12/23 16:18, John Stultz wrote:
> > NOTE: Peter previously mentioned he didn't like the name
> > "rq_selected()", but I've not come up with a better alternative.
> > I'm very open to other name proposals.
> >
>
> I got used to the naming relatively quickly. It "should" be rq_pick()
> (i.e. what did the last pick_next_task() return for that rq), but that
> naming is unfortunately ambiguous (is it doing a pick itself?), so I think
> "selected" works.

Thanks for that feedback! I guess rq_picked() might be an alternative
to your suggestion of rq_pick(), but selected still sounds better to
me.

thanks
-john

2024-01-10 22:37:07

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v7 06/23] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable

On Thu, Dec 28, 2023 at 7:06 AM Metin Kaya <[email protected]> wrote:
> On 20/12/2023 12:18 am, John Stultz wrote:
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -908,6 +908,13 @@ config NUMA_BALANCING_DEFAULT_ENABLED
> > If set, automatic NUMA balancing will be enabled if running on a NUMA
> > machine.
> >
> > +config SCHED_PROXY_EXEC
> > + bool "Proxy Execution"
> > + default n
> > + help
> > + This option enables proxy execution, a mechanism for mutex owning
> > + tasks to inherit the scheduling context of higher priority waiters.
> > +
>
> Should `SCHED_PROXY_EXEC` config option be under `Scheduler features` menu?

Yeah, that sounds like a nice idea. Done.

Thanks again for the suggestion!
-john