2023-11-06 19:37:06

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6

Stabilizing this Proxy Execution series has unfortunately
continued to be a challenging task. Since the v5 release, I’ve
been focused on getting the deactivated/sleeping owner enqueuing
functionality, which I left out of v5, stabilized. I’ve managed
to rework enough to avoid the crashes previously tripped with the
locktorture & ww_mutex selftests, so I feel it’s much improved,
but I do still see some issues (blocked waitqueues and hung task
watchdogs firing) after stressing the system for many hours in a
64 cpu qemu environment (though this issue seems to be introduced
earlier in the series with proxy-migration/return-migration).

I still haven’t had time to focus on testing and debugging the
chain migration patches. So I’ve left that patch out of this
submission for now, but will try to get it included in the next
revision.

This patch series is actually coarser than what I’ve been
developing with, as there are a number of small “test” steps to
help validate behavior I changed, which would then be replaced by
the real logic afterwards. Including those here would just cause
more work for reviewers, so I’ve folded them together, but if
you’re interested you can find the fine-grained tree here:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6-fine-grained
https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6-fine-grained

As mentioned previously, this Proxy Execution series has a long
history: First described in a paper[1] by Watkins, Straub,
Niehaus, then from patches from Peter Zijlstra, extended with
lots of work by Juri Lelli, Valentin Schneider, and Connor
O'Brien. (and thank you to Steven Rostedt for providing
additional details here!)

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely
mine.

Overview:
—----------
Proxy Execution is a generalized form of priority inheritance.
Classic priority inheritance works well for real-time tasks where
there is a straight forward priority order to how things are run.
But it breaks down when used between CFS or DEADLINE tasks, as
there are lots of parameters involved outside of just the task’s
nice value when selecting the next task to run (via
pick_next_task()). So ideally we want to imbue the mutex holder
with all the scheduler attributes of the blocked waiting task.

Proxy Execution does this via a few changes:
* Keeping tasks that are blocked on a mutex *on* the runqueue
* Keeping additional tracking of which mutex a task is blocked
on, and which task holds a specific mutex.
* Special handling for when we select a blocked task to run, so
that we instead run the mutex holder.

By leaving blocked tasks on the runqueue, we allow
pick_next_task() to choose the task that should run next (even if
it’s blocked waiting on a mutex). If we do select a blocked task,
we look at the task’s blocked_on mutex and from there look at the
mutex’s owner task. And in the simple case, the task which owns
the mutex is what we then choose to run, allowing it to release
the mutex.

This means that instead of just tracking “curr”, the scheduler
needs to track both the scheduler context (what was picked and
all the state used for scheduling decisions), and the execution
context (what we’re actually running).

In this way, the mutex owner is run “on behalf” of the blocked
task that was picked to run, essentially inheriting the scheduler
context of the waiting blocked task.

As Connor outlined in a previous submission of this patch series,
this raises a number of complicated situations: The mutex owner
might itself be blocked on another mutex, or it could be
sleeping, running on a different CPU, in the process of migrating
between CPUs, etc.

The complexity from this is imposing, but currently in Android we
have a large number of cases where we are seeing priority
inversion (not unbounded, but much longer than we’d like) between
“foreground” and “background” SCHED_NORMAL applications. As a
result, currently we cannot usefully limit background activity
without introducing inconsistent behavior. So Proxy Execution is
a very attractive approach to resolving this critical issue.


New in v6:
---------
* Big rework of blocked owner enqueuing, re-adding the
functionality to the series.

* Added a boot option proxy_exec= and additional logic to allow
the functionality to be boot time toggled.

* Minor tweaks to wake_q logic suggested by Waiman Long

* Minor fixups Reported-by: kernel test robot <[email protected]>

* Various cleanups, optimizations as well as fixes for bugs found in v5

Also, I know Peter didn’t like the blocked_on wrappers, so I
previously dropped them, but I found them (and the debug checks
within) crucial to working out issues in this patch series. I’m
fine to consider dropping them later if they are still
objectionable, but their utility was too high at this point to
get rid of them.


Issues still to address:
—----------
* As already mentioned, the sleeping owner handling (where we
deactivate waiting tasks and enqueue them onto a list, then
reactivate them when the owner wakes up) has been very
difficult to get right. This is in part because when we want
to activate tasks, we’re already holding a task.pi_lock and a
rq_lock, just not the locks for the task we’re activating, nor
the rq we’re enqueuing it onto. So there has to be a bit of
lock juggling to drop and acquire the right locks (in the right
order).

* Similarly as we’re often dealing with lists of tasks or chains
of tasks and mutexes, iterating across these chains of objects
can be done safely while holding the rq lock, but as these
chains can cross runqueues our ability to traverse the chains
safely is somewhat limited.

* Additionally, in the sleeping owner enqueueing as well as in
proxy return migration, we seem to be constantly working
against the proper locking order (task.pi_lock->rq lock
->mutex.waitlock -> task.blocked_lock), having to repeatedly
“swim upstream” where we need a higher order lock then what
we’re already holding. Suggestions for different approaches to
the serialization or ways to move the logic outside of where
locks are already held would be appreciated.

* Still need to validate and re-add the chain migration patches
to ensure they resolve the RT/DL load balancing invariants.

* “rq_selected()” naming. Peter doesn’t like it, but I’ve not
thought of a better name. Open to suggestions.

* As discussed at OSPM[2], I like to split pick_next_task() up
into two phases selecting and setting the next tasks, as
currently pick_next_task() assumes the returned task will be
run which results in various side-effects in sched class logic
when it’s run. I tried to take a pass at this earlier, but it’s
hairy and lower on the priority list for now.

* CFS load balancing. Blocked tasks may carry forward load (PELT)
to the lock owner's CPU, so CPU may look like it is overloaded.

* Optimization to avoid migrating blocked tasks (to preserve
optimistic mutex spinning) if the runnable lock-owner at the
end of the blocked_on chain is already running (though this is
difficult due to the limitations from locking rules needed to
traverse the blocked on chain across run queues).


Performance:
—----------
This patch series switches mutexes to use handoff mode rather
than optimistic spinning. This is a potential concern where locks
are under high contention. However, earlier performance analysis
(on both x86 and mobile devices) did not see major regressions.
That said, Chenyu did report a regression[3], which I’ll need to
look further into. I also briefly re-tested with this v5 series
and saw some average latencies grow vs v4, suggesting the changes
to return-migration (and extra locking) have some impact. With v6
the extra overhead is reduced but still not as nice as v4. I’ll
be digging more there, but my priority is still stability over
speed at this point (it’s easier to validate correctness of
optimizations if the baseline isn’t crashing).


If folks find it easier to test/tinker with, this patch series
can also be found here:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6
https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6

Awhile back Sage Sharp had a nice blog post about types of
reviews [4], and while any review and feedback would be greatly
appreciated, those focusing on conceptual design or correctness
issues would be *especially* valued.

Thanks so much!
-john

[1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
[2] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk)
[3] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/
[4] https://sage.thesharps.us/2014/09/01/the-gentle-art-of-patch-review/


Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]


John Stultz (8):
locking/mutex: Removes wakeups from under mutex::wait_lock
sched: Add CONFIG_PROXY_EXEC & boot argument to enable/disable
locking/mutex: Split blocked_on logic into two states (blocked_on and
blocked_on_waking)
sched: Fix runtime accounting w/ split exec & sched contexts
sched: Split out __sched() deactivate task logic into a helper
sched: Add a very simple proxy() function
sched: Add proxy deactivate helper
sched: Handle blocked-waiter migration (and return migration)

Juri Lelli (2):
locking/mutex: make mutex::wait_lock irq safe
locking/mutex: Expose __mutex_owner()

Peter Zijlstra (8):
sched: Unify runtime accounting across classes
locking/mutex: Rework task_struct::blocked_on
locking/mutex: Add task_struct::blocked_lock to serialize changes to
the blocked_on state
locking/mutex: Switch to mutex handoffs for CONFIG_PROXY_EXEC
sched: Split scheduler execution context
sched: Start blocked_on chain processing in proxy()
sched: Add blocked_donor link to task for smarter mutex handoffs
sched: Add deactivated (sleeping) owner handling to proxy()

Valentin Schneider (2):
locking/mutex: Add p->blocked_on wrappers for correctness checks
sched: Fix proxy/current (push,pull)ability

.../admin-guide/kernel-parameters.txt | 4 +
include/linux/sched.h | 49 +-
init/Kconfig | 7 +
init/init_task.c | 1 +
kernel/Kconfig.locks | 2 +-
kernel/fork.c | 8 +-
kernel/locking/mutex-debug.c | 9 +-
kernel/locking/mutex.c | 163 ++--
kernel/locking/mutex.h | 25 +
kernel/locking/ww_mutex.h | 72 +-
kernel/sched/core.c | 723 ++++++++++++++++--
kernel/sched/deadline.c | 50 +-
kernel/sched/fair.c | 104 ++-
kernel/sched/rt.c | 77 +-
kernel/sched/sched.h | 61 +-
kernel/sched/stop_task.c | 13 +-
16 files changed, 1117 insertions(+), 251 deletions(-)

--
2.42.0.869.gea05f2083d-goog


2023-11-06 19:37:11

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 01/20] sched: Unify runtime accounting across classes

From: Peter Zijlstra <[email protected]>

All classes use sched_entity::exec_start to track runtime and have
copies of the exact same code around to compute runtime.

Collapse all that.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[fix conflicts, fold in update_current_exec_runtime]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: rebased, resovling minor conflicts]
Signed-off-by: John Stultz <[email protected]>
---
NOTE: This patch is a general cleanup and if no one objects
could be merged at this point. If needed, I'll resend separately
if it isn't picked up on its own.
---
include/linux/sched.h | 2 +-
kernel/sched/deadline.c | 13 +++-------
kernel/sched/fair.c | 56 ++++++++++++++++++++++++++++++----------
kernel/sched/rt.c | 13 +++-------
kernel/sched/sched.h | 12 ++-------
kernel/sched/stop_task.c | 13 +---------
6 files changed, 52 insertions(+), 57 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 77f01ac385f7..4f5b0710c0f1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -520,7 +520,7 @@ struct sched_statistics {
u64 block_max;
s64 sum_block_runtime;

- u64 exec_max;
+ s64 exec_max;
u64 slice_max;

u64 nr_migrations_cold;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 58b542bf2893..9522e6607754 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1299,9 +1299,8 @@ static void update_curr_dl(struct rq *rq)
{
struct task_struct *curr = rq->curr;
struct sched_dl_entity *dl_se = &curr->dl;
- u64 delta_exec, scaled_delta_exec;
+ s64 delta_exec, scaled_delta_exec;
int cpu = cpu_of(rq);
- u64 now;

if (!dl_task(curr) || !on_dl_rq(dl_se))
return;
@@ -1314,21 +1313,15 @@ static void update_curr_dl(struct rq *rq)
* natural solution, but the full ramifications of this
* approach need further study.
*/
- now = rq_clock_task(rq);
- delta_exec = now - curr->se.exec_start;
- if (unlikely((s64)delta_exec <= 0)) {
+ delta_exec = update_curr_common(rq);
+ if (unlikely(delta_exec <= 0)) {
if (unlikely(dl_se->dl_yielded))
goto throttle;
return;
}

- schedstat_set(curr->stats.exec_max,
- max(curr->stats.exec_max, delta_exec));
-
trace_sched_stat_runtime(curr, delta_exec, 0);

- update_current_exec_runtime(curr, now, delta_exec);
-
if (dl_entity_is_special(dl_se))
return;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df348aa55d3c..c919633acd3d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1144,23 +1144,17 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_SMP */

-/*
- * Update the current task's runtime statistics.
- */
-static void update_curr(struct cfs_rq *cfs_rq)
+static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
{
- struct sched_entity *curr = cfs_rq->curr;
- u64 now = rq_clock_task(rq_of(cfs_rq));
- u64 delta_exec;
-
- if (unlikely(!curr))
- return;
+ u64 now = rq_clock_task(rq);
+ s64 delta_exec;

delta_exec = now - curr->exec_start;
- if (unlikely((s64)delta_exec <= 0))
- return;
+ if (unlikely(delta_exec <= 0))
+ return delta_exec;

curr->exec_start = now;
+ curr->sum_exec_runtime += delta_exec;

if (schedstat_enabled()) {
struct sched_statistics *stats;
@@ -1170,9 +1164,43 @@ static void update_curr(struct cfs_rq *cfs_rq)
max(delta_exec, stats->exec_max));
}

- curr->sum_exec_runtime += delta_exec;
- schedstat_add(cfs_rq->exec_clock, delta_exec);
+ return delta_exec;
+}
+
+/*
+ * Used by other classes to account runtime.
+ */
+s64 update_curr_common(struct rq *rq)
+{
+ struct task_struct *curr = rq->curr;
+ s64 delta_exec;

+ delta_exec = update_curr_se(rq, &curr->se);
+ if (unlikely(delta_exec <= 0))
+ return delta_exec;
+
+ account_group_exec_runtime(curr, delta_exec);
+ cgroup_account_cputime(curr, delta_exec);
+
+ return delta_exec;
+}
+
+/*
+ * Update the current task's runtime statistics.
+ */
+static void update_curr(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *curr = cfs_rq->curr;
+ s64 delta_exec;
+
+ if (unlikely(!curr))
+ return;
+
+ delta_exec = update_curr_se(rq_of(cfs_rq), curr);
+ if (unlikely(delta_exec <= 0))
+ return;
+
+ schedstat_add(cfs_rq->exec_clock, delta_exec);
curr->vruntime += calc_delta_fair(delta_exec, curr);
update_deadline(cfs_rq, curr);
update_min_vruntime(cfs_rq);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 0597ba0f85ff..327ae4148aec 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1046,24 +1046,17 @@ static void update_curr_rt(struct rq *rq)
{
struct task_struct *curr = rq->curr;
struct sched_rt_entity *rt_se = &curr->rt;
- u64 delta_exec;
- u64 now;
+ s64 delta_exec;

if (curr->sched_class != &rt_sched_class)
return;

- now = rq_clock_task(rq);
- delta_exec = now - curr->se.exec_start;
- if (unlikely((s64)delta_exec <= 0))
+ delta_exec = update_curr_common(rq);
+ if (unlikely(delta_exec < 0))
return;

- schedstat_set(curr->stats.exec_max,
- max(curr->stats.exec_max, delta_exec));
-
trace_sched_stat_runtime(curr, delta_exec, 0);

- update_current_exec_runtime(curr, now, delta_exec);
-
if (!rt_bandwidth_enabled())
return;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 04846272409c..1def5b7fa1df 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2228,6 +2228,8 @@ struct affinity_context {
unsigned int flags;
};

+extern s64 update_curr_common(struct rq *rq);
+
struct sched_class {

#ifdef CONFIG_UCLAMP_TASK
@@ -3280,16 +3282,6 @@ extern int sched_dynamic_mode(const char *str);
extern void sched_dynamic_update(int mode);
#endif

-static inline void update_current_exec_runtime(struct task_struct *curr,
- u64 now, u64 delta_exec)
-{
- curr->se.sum_exec_runtime += delta_exec;
- account_group_exec_runtime(curr, delta_exec);
-
- curr->se.exec_start = now;
- cgroup_account_cputime(curr, delta_exec);
-}
-
#ifdef CONFIG_SCHED_MM_CID

#define SCHED_MM_CID_PERIOD_NS (100ULL * 1000000) /* 100ms */
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 85590599b4d6..7595494ceb6d 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -70,18 +70,7 @@ static void yield_task_stop(struct rq *rq)

static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
{
- struct task_struct *curr = rq->curr;
- u64 now, delta_exec;
-
- now = rq_clock_task(rq);
- delta_exec = now - curr->se.exec_start;
- if (unlikely((s64)delta_exec < 0))
- delta_exec = 0;
-
- schedstat_set(curr->stats.exec_max,
- max(curr->stats.exec_max, delta_exec));
-
- update_current_exec_runtime(curr, now, delta_exec);
+ update_curr_common(rq);
}

/*
--
2.42.0.869.gea05f2083d-goog

2023-11-06 19:37:13

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 02/20] locking/mutex: Removes wakeups from under mutex::wait_lock

In preparation to nest mutex::wait_lock under rq::lock we need to remove
wakeups from under it.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[Heavily changed after 55f036ca7e74 ("locking: WW mutex cleanup") and
08295b3b5bee ("locking: Implement an algorithm choice for Wound-Wait
mutexes")]
Signed-off-by: Juri Lelli <[email protected]>
[jstultz: rebased to mainline, added extra wake_up_q & init
to avoid hangs, similar to Connor's rework of this patch]
Signed-off-by: John Stultz <[email protected]>
---
v5:
* Reverted back to an earlier version of this patch to undo
the change that kept the wake_q in the ctx structure, as
that broke the rule that the wake_q must always be on the
stack, as its not safe for concurrency.
v6:
* Made tweaks suggested by Waiman Long
---
kernel/locking/mutex.c | 17 +++++++++++++----
kernel/locking/ww_mutex.h | 29 ++++++++++++++++++-----------
2 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index d973fe6041bf..4ada158eb7ca 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -570,6 +570,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
struct lockdep_map *nest_lock, unsigned long ip,
struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
{
+ DEFINE_WAKE_Q(wake_q);
struct mutex_waiter waiter;
struct ww_mutex *ww;
int ret;
@@ -620,7 +621,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
*/
if (__mutex_trylock(lock)) {
if (ww_ctx)
- __ww_mutex_check_waiters(lock, ww_ctx);
+ __ww_mutex_check_waiters(lock, ww_ctx, &wake_q);

goto skip_wait;
}
@@ -640,7 +641,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
* Add in stamp order, waking up waiters that must kill
* themselves.
*/
- ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx);
+ ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx, &wake_q);
if (ret)
goto err_early_kill;
}
@@ -676,6 +677,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
}

raw_spin_unlock(&lock->wait_lock);
+ /* Make sure we do wakeups before calling schedule */
+ if (!wake_q_empty(&wake_q)) {
+ wake_up_q(&wake_q);
+ wake_q_init(&wake_q);
+ }
schedule_preempt_disabled();

first = __mutex_waiter_is_first(lock, &waiter);
@@ -709,7 +715,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
*/
if (!ww_ctx->is_wait_die &&
!__mutex_waiter_is_first(lock, &waiter))
- __ww_mutex_check_waiters(lock, ww_ctx);
+ __ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
}

__mutex_remove_waiter(lock, &waiter);
@@ -725,6 +731,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
ww_mutex_lock_acquired(ww, ww_ctx);

raw_spin_unlock(&lock->wait_lock);
+ wake_up_q(&wake_q);
preempt_enable();
return 0;

@@ -736,6 +743,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
raw_spin_unlock(&lock->wait_lock);
debug_mutex_free_waiter(&waiter);
mutex_release(&lock->dep_map, ip);
+ wake_up_q(&wake_q);
preempt_enable();
return ret;
}
@@ -929,6 +937,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
}
}

+ preempt_disable();
raw_spin_lock(&lock->wait_lock);
debug_mutex_unlock(lock);
if (!list_empty(&lock->wait_list)) {
@@ -947,8 +956,8 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
__mutex_handoff(lock, next);

raw_spin_unlock(&lock->wait_lock);
-
wake_up_q(&wake_q);
+ preempt_enable();
}

#ifndef CONFIG_DEBUG_LOCK_ALLOC
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 3ad2cc4823e5..7189c6631d90 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -275,7 +275,7 @@ __ww_ctx_less(struct ww_acquire_ctx *a, struct ww_acquire_ctx *b)
*/
static bool
__ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
- struct ww_acquire_ctx *ww_ctx)
+ struct ww_acquire_ctx *ww_ctx, struct wake_q_head *wake_q)
{
if (!ww_ctx->is_wait_die)
return false;
@@ -284,7 +284,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
#ifndef WW_RT
debug_mutex_wake_waiter(lock, waiter);
#endif
- wake_up_process(waiter->task);
+ wake_q_add(wake_q, waiter->task);
}

return true;
@@ -299,7 +299,8 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
*/
static bool __ww_mutex_wound(struct MUTEX *lock,
struct ww_acquire_ctx *ww_ctx,
- struct ww_acquire_ctx *hold_ctx)
+ struct ww_acquire_ctx *hold_ctx,
+ struct wake_q_head *wake_q)
{
struct task_struct *owner = __ww_mutex_owner(lock);

@@ -331,7 +332,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
* wakeup pending to re-read the wounded state.
*/
if (owner != current)
- wake_up_process(owner);
+ wake_q_add(wake_q, owner);

return true;
}
@@ -352,7 +353,8 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
* The current task must not be on the wait list.
*/
static void
-__ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
+__ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx,
+ struct wake_q_head *wake_q)
{
struct MUTEX_WAITER *cur;

@@ -364,8 +366,8 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
if (!cur->ww_ctx)
continue;

- if (__ww_mutex_die(lock, cur, ww_ctx) ||
- __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx))
+ if (__ww_mutex_die(lock, cur, ww_ctx, wake_q) ||
+ __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx, wake_q))
break;
}
}
@@ -377,6 +379,8 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
static __always_inline void
ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
{
+ DEFINE_WAKE_Q(wake_q);
+
ww_mutex_lock_acquired(lock, ctx);

/*
@@ -405,8 +409,10 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
* die or wound us.
*/
lock_wait_lock(&lock->base);
- __ww_mutex_check_waiters(&lock->base, ctx);
+ __ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
unlock_wait_lock(&lock->base);
+
+ wake_up_q(&wake_q);
}

static __always_inline int
@@ -488,7 +494,8 @@ __ww_mutex_check_kill(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
static inline int
__ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
struct MUTEX *lock,
- struct ww_acquire_ctx *ww_ctx)
+ struct ww_acquire_ctx *ww_ctx,
+ struct wake_q_head *wake_q)
{
struct MUTEX_WAITER *cur, *pos = NULL;
bool is_wait_die;
@@ -532,7 +539,7 @@ __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
pos = cur;

/* Wait-Die: ensure younger waiters die. */
- __ww_mutex_die(lock, cur, ww_ctx);
+ __ww_mutex_die(lock, cur, ww_ctx, wake_q);
}

__ww_waiter_add(lock, waiter, pos);
@@ -550,7 +557,7 @@ __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
* such that either we or the fastpath will wound @ww->ctx.
*/
smp_mb();
- __ww_mutex_wound(lock, ww_ctx, ww->ctx);
+ __ww_mutex_wound(lock, ww_ctx, ww->ctx, wake_q);
}

return 0;
--
2.42.0.869.gea05f2083d-goog

2023-11-06 19:37:14

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 03/20] locking/mutex: make mutex::wait_lock irq safe

From: Juri Lelli <[email protected]>

mutex::wait_lock might be nested under rq->lock.

Make it irq safe then.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[rebase & fix {un,}lock_wait_lock helpers in ww_mutex.h]
Signed-off-by: Connor O'Brien <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
v3:
* Re-added this patch after it was dropped in v2 which
caused lockdep warnings to trip.
---
kernel/locking/mutex.c | 18 ++++++++++--------
kernel/locking/ww_mutex.h | 22 +++++++++++-----------
2 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 4ada158eb7ca..4c63a197f6fe 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -573,6 +573,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
DEFINE_WAKE_Q(wake_q);
struct mutex_waiter waiter;
struct ww_mutex *ww;
+ unsigned long flags;
int ret;

if (!use_ww_ctx)
@@ -615,7 +616,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
return 0;
}

- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
/*
* After waiting to acquire the wait_lock, try again.
*/
@@ -676,7 +677,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
goto err;
}

- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
/* Make sure we do wakeups before calling schedule */
if (!wake_q_empty(&wake_q)) {
wake_up_q(&wake_q);
@@ -702,9 +703,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
trace_contention_begin(lock, LCB_F_MUTEX);
}

- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
}
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
acquired:
__set_current_state(TASK_RUNNING);

@@ -730,7 +731,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
if (ww_ctx)
ww_mutex_lock_acquired(ww, ww_ctx);

- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q);
preempt_enable();
return 0;
@@ -740,7 +741,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
__mutex_remove_waiter(lock, &waiter);
err_early_kill:
trace_contention_end(lock, ret);
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
debug_mutex_free_waiter(&waiter);
mutex_release(&lock->dep_map, ip);
wake_up_q(&wake_q);
@@ -911,6 +912,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
struct task_struct *next = NULL;
DEFINE_WAKE_Q(wake_q);
unsigned long owner;
+ unsigned long flags;

mutex_release(&lock->dep_map, ip);

@@ -938,7 +940,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
}

preempt_disable();
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
debug_mutex_unlock(lock);
if (!list_empty(&lock->wait_list)) {
/* get the first entry from the wait-list: */
@@ -955,7 +957,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
if (owner & MUTEX_FLAG_HANDOFF)
__mutex_handoff(lock, next);

- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q);
preempt_enable();
}
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 7189c6631d90..8b94f4b89e74 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -70,14 +70,14 @@ __ww_mutex_has_waiters(struct mutex *lock)
return atomic_long_read(&lock->owner) & MUTEX_FLAG_WAITERS;
}

-static inline void lock_wait_lock(struct mutex *lock)
+static inline void lock_wait_lock(struct mutex *lock, unsigned long *flags)
{
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, *flags);
}

-static inline void unlock_wait_lock(struct mutex *lock)
+static inline void unlock_wait_lock(struct mutex *lock, unsigned long flags)
{
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
}

static inline void lockdep_assert_wait_lock_held(struct mutex *lock)
@@ -144,14 +144,14 @@ __ww_mutex_has_waiters(struct rt_mutex *lock)
return rt_mutex_has_waiters(&lock->rtmutex);
}

-static inline void lock_wait_lock(struct rt_mutex *lock)
+static inline void lock_wait_lock(struct rt_mutex *lock, unsigned long *flags)
{
- raw_spin_lock(&lock->rtmutex.wait_lock);
+ raw_spin_lock_irqsave(&lock->rtmutex.wait_lock, *flags);
}

-static inline void unlock_wait_lock(struct rt_mutex *lock)
+static inline void unlock_wait_lock(struct rt_mutex *lock, flags)
{
- raw_spin_unlock(&lock->rtmutex.wait_lock);
+ raw_spin_unlock_irqrestore(&lock->rtmutex.wait_lock, flags);
}

static inline void lockdep_assert_wait_lock_held(struct rt_mutex *lock)
@@ -380,6 +380,7 @@ static __always_inline void
ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
{
DEFINE_WAKE_Q(wake_q);
+ unsigned long flags;

ww_mutex_lock_acquired(lock, ctx);

@@ -408,10 +409,9 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
* Uh oh, we raced in fastpath, check if any of the waiters need to
* die or wound us.
*/
- lock_wait_lock(&lock->base);
+ lock_wait_lock(&lock->base, &flags);
__ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
- unlock_wait_lock(&lock->base);
-
+ unlock_wait_lock(&lock->base, flags);
wake_up_q(&wake_q);
}

--
2.42.0.869.gea05f2083d-goog

2023-11-06 19:37:15

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 04/20] locking/mutex: Expose __mutex_owner()

From: Juri Lelli <[email protected]>

Implementing proxy execution requires that scheduler code be able to
identify the current owner of a mutex. Expose __mutex_owner() for
this purpose (alone!).

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]
Signed-off-by: Juri Lelli <[email protected]>
[Removed the EXPORT_SYMBOL]
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: Reworked per Peter's suggestions]
Signed-off-by: John Stultz <[email protected]>
---
v4:
* Move __mutex_owner() to kernel/locking/mutex.h instead of
adding a new globally available accessor function to keep
the exposure of this low, along with keeping it an inline
function, as suggested by PeterZ
---
kernel/locking/mutex.c | 25 -------------------------
kernel/locking/mutex.h | 25 +++++++++++++++++++++++++
2 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 4c63a197f6fe..2c5d1a9cf767 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -56,31 +56,6 @@ __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
}
EXPORT_SYMBOL(__mutex_init);

-/*
- * @owner: contains: 'struct task_struct *' to the current lock owner,
- * NULL means not owned. Since task_struct pointers are aligned at
- * at least L1_CACHE_BYTES, we have low bits to store extra state.
- *
- * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
- * Bit1 indicates unlock needs to hand the lock to the top-waiter
- * Bit2 indicates handoff has been done and we're waiting for pickup.
- */
-#define MUTEX_FLAG_WAITERS 0x01
-#define MUTEX_FLAG_HANDOFF 0x02
-#define MUTEX_FLAG_PICKUP 0x04
-
-#define MUTEX_FLAGS 0x07
-
-/*
- * Internal helper function; C doesn't allow us to hide it :/
- *
- * DO NOT USE (outside of mutex code).
- */
-static inline struct task_struct *__mutex_owner(struct mutex *lock)
-{
- return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
-}
-
static inline struct task_struct *__owner_task(unsigned long owner)
{
return (struct task_struct *)(owner & ~MUTEX_FLAGS);
diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index 0b2a79c4013b..1c7d3d32def8 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -20,6 +20,31 @@ struct mutex_waiter {
#endif
};

+/*
+ * @owner: contains: 'struct task_struct *' to the current lock owner,
+ * NULL means not owned. Since task_struct pointers are aligned at
+ * at least L1_CACHE_BYTES, we have low bits to store extra state.
+ *
+ * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
+ * Bit1 indicates unlock needs to hand the lock to the top-waiter
+ * Bit2 indicates handoff has been done and we're waiting for pickup.
+ */
+#define MUTEX_FLAG_WAITERS 0x01
+#define MUTEX_FLAG_HANDOFF 0x02
+#define MUTEX_FLAG_PICKUP 0x04
+
+#define MUTEX_FLAGS 0x07
+
+/*
+ * Internal helper function; C doesn't allow us to hide it :/
+ *
+ * DO NOT USE (outside of mutex & scheduler code).
+ */
+static inline struct task_struct *__mutex_owner(struct mutex *lock)
+{
+ return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
+}
+
#ifdef CONFIG_DEBUG_MUTEXES
extern void debug_mutex_lock_common(struct mutex *lock,
struct mutex_waiter *waiter);
--
2.42.0.869.gea05f2083d-goog

2023-11-06 19:37:18

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 07/20] locking/mutex: Add p->blocked_on wrappers for correctness checks

From: Valentin Schneider <[email protected]>

This lets us assert p->blocked_lock is held whenever we access
p->blocked_on, as well as warn us for unexpected state changes.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]
Signed-off-by: Valentin Schneider <[email protected]>
[fix conflicts, call in more places]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: tweaked commit subject, added get_task_blocked_on() as well]
Signed-off-by: John Stultz <[email protected]>
---
v2:
* Added get_task_blocked_on() accessor
v4:
* Address READ_ONCE usage that was dropped in v2
* Reordered to be a later add on to the main patch series as
Peter was unhappy with similar wrappers in other patches.
v5:
* Added some extra correctness checking in wrappers
---
include/linux/sched.h | 22 ++++++++++++++++++++++
kernel/locking/mutex-debug.c | 4 ++--
kernel/locking/mutex.c | 10 +++++-----
kernel/locking/ww_mutex.h | 4 ++--
4 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a9258dae00e0..81334677e008 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2248,6 +2248,28 @@ static inline int rwlock_needbreak(rwlock_t *lock)
#endif
}

+static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)
+{
+ lockdep_assert_held(&p->blocked_lock);
+
+ /* We should be setting values to NULL or NULL to values */
+ WARN_ON((!m && !p->blocked_on) || (m && p->blocked_on));
+
+ p->blocked_on = m;
+}
+
+static inline struct mutex *get_task_blocked_on(struct task_struct *p)
+{
+ lockdep_assert_held(&p->blocked_lock);
+
+ return p->blocked_on;
+}
+
+static inline struct mutex *get_task_blocked_on_once(struct task_struct *p)
+{
+ return READ_ONCE(p->blocked_on);
+}
+
static __always_inline bool need_resched(void)
{
return unlikely(tif_need_resched());
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 7228909c3e62..1eedf7c60c00 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -53,13 +53,13 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
lockdep_assert_held(&lock->wait_lock);

/* Current thread can't be already blocked (since it's executing!) */
- DEBUG_LOCKS_WARN_ON(task->blocked_on);
+ DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task));
}

void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
struct task_struct *task)
{
- struct mutex *blocked_on = READ_ONCE(task->blocked_on);
+ struct mutex *blocked_on = get_task_blocked_on_once(task);

DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
DEBUG_LOCKS_WARN_ON(waiter->task != task);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index df186c0bf4a9..36e563f69705 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -623,7 +623,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
goto err_early_kill;
}

- current->blocked_on = lock;
+ set_task_blocked_on(current, lock);
set_current_state(state);
trace_contention_begin(lock, LCB_F_MUTEX);
for (;;) {
@@ -670,7 +670,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
/*
* Gets reset by unlock path().
*/
- current->blocked_on = lock;
+ set_task_blocked_on(current, lock);
set_current_state(state);
/*
* Here we order against unlock; we must either see it change
@@ -699,7 +699,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
}
}
acquired:
- current->blocked_on = NULL;
+ set_task_blocked_on(current, NULL);
__set_current_state(TASK_RUNNING);

if (ww_ctx) {
@@ -731,7 +731,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
return 0;

err:
- current->blocked_on = NULL;
+ set_task_blocked_on(current, NULL);
__set_current_state(TASK_RUNNING);
__mutex_remove_waiter(lock, &waiter);
err_early_kill:
@@ -950,7 +950,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
debug_mutex_wake_waiter(lock, waiter);
raw_spin_lock(&next->blocked_lock);
WARN_ON(next->blocked_on != lock);
- next->blocked_on = NULL;
+ set_task_blocked_on(current, NULL);
raw_spin_unlock(&next->blocked_lock);
wake_q_add(&wake_q, next);
}
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 2929a95b4272..44a532dda927 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -292,7 +292,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
* blocked_on relationships that can't resolve.
*/
WARN_ON(waiter->task->blocked_on != lock);
- waiter->task->blocked_on = NULL;
+ set_task_blocked_on(waiter->task, NULL);
wake_q_add(wake_q, waiter->task);
raw_spin_unlock(&waiter->task->blocked_lock);
}
@@ -349,7 +349,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
* blocked_on pointer. Otherwise we can see circular
* blocked_on relationships that can't resolve.
*/
- owner->blocked_on = NULL;
+ set_task_blocked_on(owner, NULL);
wake_q_add(wake_q, owner);
raw_spin_unlock(&owner->blocked_lock);
}
--
2.42.0.869.gea05f2083d-goog

2023-11-06 19:37:17

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 08/20] sched: Add CONFIG_PROXY_EXEC & boot argument to enable/disable

This patch adds the CONFIG_PROXY_EXEC option, along with
a boot argument prox_exec= that can be used to disable the
feature at boot time if CONFIG_PROXY_EXEC was enabled.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]
Signed-off-by: John Stultz <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 4 +++
include/linux/sched.h | 13 ++++++++
init/Kconfig | 7 +++++
kernel/sched/core.c | 31 +++++++++++++++++++
4 files changed, 55 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0a1731a0f0ef..199578ae1606 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4648,6 +4648,10 @@
that).
Format: <bool>

+ proxy_exec= [KNL] Enable or disables "proxy execution" style
+ solution to mutex based priority inversion.
+ Format: "enable" or "disable"
+
psi= [KNL] Enable or disable pressure stall information
tracking.
Format: <bool>
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 81334677e008..5f05d9a4cc3f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1551,6 +1551,19 @@ struct task_struct {
*/
};

+#ifdef CONFIG_PROXY_EXEC
+DECLARE_STATIC_KEY_TRUE(__sched_proxy_exec);
+static inline bool sched_proxy_exec(void)
+{
+ return static_branch_likely(&__sched_proxy_exec);
+}
+#else
+static inline bool sched_proxy_exec(void)
+{
+ return false;
+}
+#endif
+
static inline struct pid *task_pid(struct task_struct *task)
{
return task->thread_pid;
diff --git a/init/Kconfig b/init/Kconfig
index 6d35728b94b2..884f94d8ee9e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -908,6 +908,13 @@ config NUMA_BALANCING_DEFAULT_ENABLED
If set, automatic NUMA balancing will be enabled if running on a NUMA
machine.

+config PROXY_EXEC
+ bool "Proxy Execution"
+ default n
+ help
+ This option enables proxy execution, a mechanism for mutex owning
+ tasks to inherit the scheduling context of higher priority waiters.
+
menuconfig CGROUPS
bool "Control Group support"
select KERNFS
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 802551e0009b..a38bf8ef5798 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -117,6 +117,37 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);

DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

+#ifdef CONFIG_PROXY_EXEC
+DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
+static int __init setup_proxy_exec(char *str)
+{
+ int ret = 0;
+
+ if (!str)
+ goto out;
+
+ if (!strcmp(str, "enable")) {
+ static_branch_enable(&__sched_proxy_exec);
+ ret = 1;
+ } else if (!strcmp(str, "disable")) {
+ static_branch_disable(&__sched_proxy_exec);
+ ret = 1;
+ }
+out:
+ if (!ret)
+ pr_warn("Unable to parse proxy_exec=\n");
+
+ return ret;
+}
+#else
+static int __init setup_proxy_exec(char *str)
+{
+ pr_warn("CONFIG_PROXY_EXEC=n, so it cannot be enabled or disabled at boottime\n");
+ return 0;
+}
+#endif
+__setup("proxy_exec=", setup_proxy_exec);
+
#ifdef CONFIG_SCHED_DEBUG
/*
* Debugging: various feature bits
--
2.42.0.869.gea05f2083d-goog

2023-11-06 19:37:22

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 09/20] locking/mutex: Split blocked_on logic into two states (blocked_on and blocked_on_waking)

This patch adds blocked_on_waking so we can track separately if
the task should be able to try to aquire the lock separately
from the lock it is blocked on.

This avoids some of the subtle magic where the blocked_on state
gets cleared, only to have it re-added by the
mutex_lock_slowpath call when it tries to aquire the lock on
wakeup

This should make dealing with the ww_mutex issue cleaner.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]
Signed-off-by: John Stultz <[email protected]>
---
include/linux/sched.h | 2 ++
kernel/fork.c | 1 +
kernel/locking/mutex.c | 7 ++++---
kernel/locking/ww_mutex.h | 12 ++++++------
kernel/sched/sched.h | 12 ++++++++++++
5 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5f05d9a4cc3f..47c7095b918a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1146,6 +1146,7 @@ struct task_struct {
#endif

struct mutex *blocked_on; /* lock we're blocked on */
+ bool blocked_on_waking; /* blocked on, but waking */
raw_spinlock_t blocked_lock;

#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
@@ -2269,6 +2270,7 @@ static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)
WARN_ON((!m && !p->blocked_on) || (m && p->blocked_on));

p->blocked_on = m;
+ p->blocked_on_waking = false;
}

static inline struct mutex *get_task_blocked_on(struct task_struct *p)
diff --git a/kernel/fork.c b/kernel/fork.c
index 47b76ed5ddf6..930947bf4569 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2457,6 +2457,7 @@ __latent_entropy struct task_struct *copy_process(
#endif

p->blocked_on = NULL; /* not blocked yet */
+ p->blocked_on_waking = false; /* not blocked yet */

#ifdef CONFIG_BCACHE
p->sequential_io = 0;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 36e563f69705..f37b7afe8aa5 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -667,10 +667,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas

raw_spin_lock_irqsave(&lock->wait_lock, flags);
raw_spin_lock(&current->blocked_lock);
+
/*
- * Gets reset by unlock path().
+ * Clear blocked_on_waking flag set by the unlock path().
*/
- set_task_blocked_on(current, lock);
+ current->blocked_on_waking = false;
set_current_state(state);
/*
* Here we order against unlock; we must either see it change
@@ -950,7 +951,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
debug_mutex_wake_waiter(lock, waiter);
raw_spin_lock(&next->blocked_lock);
WARN_ON(next->blocked_on != lock);
- set_task_blocked_on(current, NULL);
+ next->blocked_on_waking = true;
raw_spin_unlock(&next->blocked_lock);
wake_q_add(&wake_q, next);
}
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 44a532dda927..3b0a68d7e308 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -287,12 +287,12 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
debug_mutex_wake_waiter(lock, waiter);
#endif
/*
- * When waking up the task to die, be sure to clear the
- * blocked_on pointer. Otherwise we can see circular
+ * When waking up the task to die, be sure to set the
+ * blocked_on_waking flag. Otherwise we can see circular
* blocked_on relationships that can't resolve.
*/
WARN_ON(waiter->task->blocked_on != lock);
- set_task_blocked_on(waiter->task, NULL);
+ waiter->task->blocked_on_waking = true;
wake_q_add(wake_q, waiter->task);
raw_spin_unlock(&waiter->task->blocked_lock);
}
@@ -345,11 +345,11 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
/* nested as we should hold current->blocked_lock already */
raw_spin_lock_nested(&owner->blocked_lock, SINGLE_DEPTH_NESTING);
/*
- * When waking up the task to wound, be sure to clear the
- * blocked_on pointer. Otherwise we can see circular
+ * When waking up the task to wound, be sure to set the
+ * blocked_on_waking flag. Otherwise we can see circular
* blocked_on relationships that can't resolve.
*/
- set_task_blocked_on(owner, NULL);
+ owner->blocked_on_waking = true;
wake_q_add(wake_q, owner);
raw_spin_unlock(&owner->blocked_lock);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1def5b7fa1df..7c37d478e0f8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2133,6 +2133,18 @@ static inline int task_current(struct rq *rq, struct task_struct *p)
return rq->curr == p;
}

+#ifdef CONFIG_PROXY_EXEC
+static inline bool task_is_blocked(struct task_struct *p)
+{
+ return sched_proxy_exec() && !!p->blocked_on && !p->blocked_on_waking;
+}
+#else /* !PROXY_EXEC */
+static inline bool task_is_blocked(struct task_struct *p)
+{
+ return false;
+}
+#endif /* PROXY_EXEC */
+
static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
{
#ifdef CONFIG_SMP
--
2.42.0.869.gea05f2083d-goog

2023-11-06 19:37:25

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 13/20] sched: Split out __sched() deactivate task logic into a helper

As we're going to re-use the deactivation logic,
split it into a helper.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]
Signed-off-by: John Stultz <[email protected]>
---
v6:
* Define function as static to avoid "no previous prototype"
warnings as Reported-by: kernel test robot <[email protected]>
---
kernel/sched/core.c | 65 +++++++++++++++++++++++++--------------------
1 file changed, 36 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9931940ba474..1b38b34d3f64 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6575,6 +6575,41 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
# define SM_MASK_PREEMPT SM_PREEMPT
#endif

+static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p, unsigned long state)
+{
+ if (signal_pending_state(state, p)) {
+ WRITE_ONCE(p->__state, TASK_RUNNING);
+ } else {
+ p->sched_contributes_to_load =
+ (state & TASK_UNINTERRUPTIBLE) &&
+ !(state & TASK_NOLOAD) &&
+ !(state & TASK_FROZEN);
+
+ if (p->sched_contributes_to_load)
+ rq->nr_uninterruptible++;
+
+ /*
+ * __schedule() ttwu()
+ * prev_state = prev->state; if (p->on_rq && ...)
+ * if (prev_state) goto out;
+ * p->on_rq = 0; smp_acquire__after_ctrl_dep();
+ * p->state = TASK_WAKING
+ *
+ * Where __schedule() and ttwu() have matching control dependencies.
+ *
+ * After this, schedule() must not care about p->state any more.
+ */
+ deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
+
+ if (p->in_iowait) {
+ atomic_inc(&rq->nr_iowait);
+ delayacct_blkio_start();
+ }
+ return true;
+ }
+ return false;
+}
+
/*
* __schedule() is the main scheduler function.
*
@@ -6665,35 +6700,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
*/
prev_state = READ_ONCE(prev->__state);
if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
- if (signal_pending_state(prev_state, prev)) {
- WRITE_ONCE(prev->__state, TASK_RUNNING);
- } else {
- prev->sched_contributes_to_load =
- (prev_state & TASK_UNINTERRUPTIBLE) &&
- !(prev_state & TASK_NOLOAD) &&
- !(prev_state & TASK_FROZEN);
-
- if (prev->sched_contributes_to_load)
- rq->nr_uninterruptible++;
-
- /*
- * __schedule() ttwu()
- * prev_state = prev->state; if (p->on_rq && ...)
- * if (prev_state) goto out;
- * p->on_rq = 0; smp_acquire__after_ctrl_dep();
- * p->state = TASK_WAKING
- *
- * Where __schedule() and ttwu() have matching control dependencies.
- *
- * After this, schedule() must not care about p->state any more.
- */
- deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
-
- if (prev->in_iowait) {
- atomic_inc(&rq->nr_iowait);
- delayacct_blkio_start();
- }
- }
+ try_to_deactivate_task(rq, prev, prev_state);
switch_count = &prev->nvcsw;
}

--
2.42.0.869.gea05f2083d-goog

2023-11-06 19:37:27

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 14/20] sched: Add a very simple proxy() function

This adds a very simple proxy() function so if
we select a blocked task to run, we will deactivate it
and pick again. The exception being if it has become
unblocked after proxy() was called.

Greatly simplified from patch by:
Peter Zijlstra (Intel) <[email protected]>
Juri Lelli <[email protected]>
Valentin Schneider <[email protected]>
Connor O'Brien <[email protected]>

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]
[jstultz: Split out from larger proxy patch and simplified
for review and testing.]
Signed-off-by: John Stultz <[email protected]>
---
v5:
* Split out from larger proxy patch
---
kernel/sched/core.c | 89 +++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/rt.c | 19 +++++++++-
2 files changed, 102 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1b38b34d3f64..5770656b898d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6575,11 +6575,12 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
# define SM_MASK_PREEMPT SM_PREEMPT
#endif

-static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p, unsigned long state)
+static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
+ unsigned long state, bool deactivate_cond)
{
if (signal_pending_state(state, p)) {
WRITE_ONCE(p->__state, TASK_RUNNING);
- } else {
+ } else if (deactivate_cond) {
p->sched_contributes_to_load =
(state & TASK_UNINTERRUPTIBLE) &&
!(state & TASK_NOLOAD) &&
@@ -6610,6 +6611,74 @@ static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p, unsigne
return false;
}

+#ifdef CONFIG_PROXY_EXEC
+/*
+ * Initial simple proxy that just returns the task if its waking
+ * or deactivates the blocked task so we can pick something that
+ * isn't blocked.
+ */
+static struct task_struct *
+proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
+{
+ struct task_struct *p = next;
+ struct mutex *mutex;
+ unsigned long state;
+
+ mutex = p->blocked_on;
+ /* Something changed in the chain, pick_again */
+ if (!mutex)
+ return NULL;
+ /*
+ * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
+ * and ensure @owner sticks around.
+ */
+ raw_spin_lock(&mutex->wait_lock);
+ raw_spin_lock(&p->blocked_lock);
+
+ /* Check again that p is blocked with blocked_lock held */
+ if (!task_is_blocked(p) || mutex != p->blocked_on) {
+ /*
+ * Something changed in the blocked_on chain and
+ * we don't know if only at this level. So, let's
+ * just bail out completely and let __schedule
+ * figure things out (pick_again loop).
+ */
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_unlock(&mutex->wait_lock);
+ return NULL;
+ }
+
+ state = READ_ONCE(p->__state);
+ /* Don't deactivate if the state has been changed to TASK_RUNNING */
+ if (!state) {
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_unlock(&mutex->wait_lock);
+ return p;
+ }
+
+ try_to_deactivate_task(rq, next, state, true);
+
+ /*
+ * If next is the selected task, then remove lingering
+ * references to it from rq and sched_class structs after
+ * dequeueing.
+ */
+ put_prev_task(rq, next);
+ rq_set_selected(rq, rq->idle);
+ resched_curr(rq);
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_unlock(&mutex->wait_lock);
+ return NULL;
+}
+#else /* PROXY_EXEC */
+static struct task_struct *
+proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
+{
+ BUG(); // This should never be called in the !PROXY case
+ return next;
+}
+#endif /* PROXY_EXEC */
+
/*
* __schedule() is the main scheduler function.
*
@@ -6700,12 +6769,24 @@ static void __sched notrace __schedule(unsigned int sched_mode)
*/
prev_state = READ_ONCE(prev->__state);
if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
- try_to_deactivate_task(rq, prev, prev_state);
+ try_to_deactivate_task(rq, prev, prev_state,
+ !task_is_blocked(prev));
switch_count = &prev->nvcsw;
}

- next = pick_next_task(rq, prev, &rf);
+pick_again:
+ next = pick_next_task(rq, rq_selected(rq), &rf);
rq_set_selected(rq, next);
+ if (unlikely(task_is_blocked(next))) {
+ next = proxy(rq, next, &rf);
+ if (!next) {
+ rq_unpin_lock(rq, &rf);
+ __balance_callbacks(rq);
+ rq_repin_lock(rq, &rf);
+ goto pick_again;
+ }
+ }
+
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index bc243e70bc0e..0125a3ae5a7a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1537,8 +1537,19 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)

enqueue_rt_entity(rt_se, flags);

- if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
- enqueue_pushable_task(rq, p);
+ /*
+ * Current can't be pushed away. Selected is tied to current,
+ * so don't push it either.
+ */
+ if (task_current(rq, p) || task_current_selected(rq, p))
+ return;
+ /*
+ * Pinned tasks can't be pushed.
+ */
+ if (p->nr_cpus_allowed == 1)
+ return;
+
+ enqueue_pushable_task(rq, p);
}

static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
@@ -1825,6 +1836,10 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)

update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);

+ /* Avoid marking selected as pushable */
+ if (task_current_selected(rq, p))
+ return;
+
/*
* The previous task needs to be made eligible for pushing
* if it is still active
--
2.42.0.869.gea05f2083d-goog

2023-11-06 19:37:30

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 16/20] sched: Fix proxy/current (push,pull)ability

From: Valentin Schneider <[email protected]>

Proxy execution forms atomic pairs of tasks: The selected task
(scheduling context) and an proxy (execution context). The
selected task, along with the rest of the blocked chain,
follows the proxy wrt CPU placement.

They can be the same task, in which case push/pull doesn't need any
modification. When they are different, however,
FIFO1 & FIFO42:

,-> RT42
| | blocked-on
| v
blocked_donor | mutex
| | owner
| v
`-- RT1

RT1
RT42

CPU0 CPU1
^ ^
| |
overloaded !overloaded
rq prio = 42 rq prio = 0

RT1 is eligible to be pushed to CPU1, but should that happen it will
"carry" RT42 along. Clearly here neither RT1 nor RT42 must be seen as
push/pullable.

Unfortunately, only the selected task is usually dequeued from the
rq, and the proxy'ed execution context (rq->curr) remains on the rq.
This can cause RT1 to be selected for migration from logic like the
rt pushable_list.

This patch adds a dequeue/enqueue cycle on the proxy task before
__schedule returns, which allows the sched class logic to avoid
adding the now current task to the pushable_list.

Furthermore, tasks becoming blocked on a mutex don't need an explicit
dequeue/enqueue cycle to be made (push/pull)able: they have to be running
to block on a mutex, thus they will eventually hit put_prev_task().

XXX: pinned tasks becoming unblocked should be removed from the push/pull
lists, but those don't get to see __schedule() straight away.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Connor O'Brien <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
v3:
* Tweaked comments & commit message
v5:
* Minor simplifications to utilize the fix earlier
in the patch series.
* Rework the wording of the commit message to match selected/
proxy terminology and expand a bit to make it more clear how
it works.
v6:
* Droped now-unused proxied value, to be re-added later in the
series when it is used, as caught by Dietmar
---
kernel/sched/core.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1b84d612332e..c148ee5dcf7e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6669,6 +6669,21 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
raw_spin_unlock(&mutex->wait_lock);
return ret;
}
+
+static inline void proxy_tag_curr(struct rq *rq, struct task_struct *next)
+{
+ /*
+ * pick_next_task() calls set_next_task() on the selected task
+ * at some point, which ensures it is not push/pullable.
+ * However, the selected task *and* the ,mutex owner form an
+ * atomic pair wrt push/pull.
+ *
+ * Make sure owner is not pushable. Unfortunately we can only
+ * deal with that by means of a dequeue/enqueue cycle. :-/
+ */
+ dequeue_task(rq, next, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
+ enqueue_task(rq, next, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
+}
#else /* PROXY_EXEC */
static struct task_struct *
proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
@@ -6676,6 +6691,8 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
BUG(); // This should never be called in the !PROXY case
return next;
}
+
+static inline void proxy_tag_curr(struct rq *rq, struct task_struct *next) { }
#endif /* PROXY_EXEC */

/*
@@ -6799,6 +6816,10 @@ static void __sched notrace __schedule(unsigned int sched_mode)
* changes to task_struct made by pick_next_task().
*/
RCU_INIT_POINTER(rq->curr, next);
+
+ if (unlikely(!task_current_selected(rq, next)))
+ proxy_tag_curr(rq, next);
+
/*
* The membarrier system call requires each architecture
* to have a full memory barrier after updating
@@ -6823,6 +6844,10 @@ static void __sched notrace __schedule(unsigned int sched_mode)
/* Also unlocks the rq: */
rq = context_switch(rq, prev, next, &rf);
} else {
+ /* In case next was already curr but just got blocked_donor*/
+ if (unlikely(!task_current_selected(rq, next)))
+ proxy_tag_curr(rq, next);
+
rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

rq_unpin_lock(rq, &rf);
--
2.42.0.869.gea05f2083d-goog

2023-11-06 19:37:31

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 15/20] sched: Add proxy deactivate helper

Add small helper for deactivating the selected task

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]
Signed-off-by: John Stultz <[email protected]>
---
kernel/sched/core.c | 43 +++++++++++++++++++++----------------------
1 file changed, 21 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5770656b898d..1b84d612332e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6612,6 +6612,22 @@ static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
}

#ifdef CONFIG_PROXY_EXEC
+
+bool proxy_deactivate(struct rq *rq, struct task_struct *next)
+{
+ unsigned long state = READ_ONCE(next->__state);
+
+ /* Don't deactivate if the state has been changed to TASK_RUNNING */
+ if (!state)
+ return false;
+ if (!try_to_deactivate_task(rq, next, state, true))
+ return false;
+ put_prev_task(rq, next);
+ rq_set_selected(rq, rq->idle);
+ resched_curr(rq);
+ return true;
+}
+
/*
* Initial simple proxy that just returns the task if its waking
* or deactivates the blocked task so we can pick something that
@@ -6620,10 +6636,9 @@ static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
static struct task_struct *
proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
{
+ struct task_struct *ret = NULL;
struct task_struct *p = next;
struct mutex *mutex;
- unsigned long state;
-
mutex = p->blocked_on;
/* Something changed in the chain, pick_again */
if (!mutex)
@@ -6645,30 +6660,14 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
*/
raw_spin_unlock(&p->blocked_lock);
raw_spin_unlock(&mutex->wait_lock);
- return NULL;
- }
-
- state = READ_ONCE(p->__state);
- /* Don't deactivate if the state has been changed to TASK_RUNNING */
- if (!state) {
- raw_spin_unlock(&p->blocked_lock);
- raw_spin_unlock(&mutex->wait_lock);
- return p;
+ return ret;
}

- try_to_deactivate_task(rq, next, state, true);
-
- /*
- * If next is the selected task, then remove lingering
- * references to it from rq and sched_class structs after
- * dequeueing.
- */
- put_prev_task(rq, next);
- rq_set_selected(rq, rq->idle);
- resched_curr(rq);
+ if (!proxy_deactivate(rq, next))
+ ret = p;
raw_spin_unlock(&p->blocked_lock);
raw_spin_unlock(&mutex->wait_lock);
- return NULL;
+ return ret;
}
#else /* PROXY_EXEC */
static struct task_struct *
--
2.42.0.869.gea05f2083d-goog

2023-11-06 19:37:32

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 05/20] locking/mutex: Rework task_struct::blocked_on

From: Peter Zijlstra <[email protected]>

Track the blocked-on relation for mutexes, this allows following this
relation at schedule time.

task
| blocked-on
v
mutex
| owner
v
task

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[minor changes while rebasing]
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: Fix blocked_on tracking in __mutex_lock_common in error paths]
Signed-off-by: John Stultz <[email protected]>
---
v2:
* Fixed blocked_on tracking in error paths that was causing crashes
v4:
* Ensure we clear blocked_on when waking ww_mutexes to die or wound.
This is critical so we don't get ciruclar blocked_on relationships
that can't be resolved.
v5:
* Fix potential bug where the skip_wait path might clear blocked_on
when that path never set it
* Slight tweaks to where we set blocked_on to make it consistent,
along with extra WARN_ON correctness checking
* Minor comment changes
---
include/linux/sched.h | 5 +----
kernel/fork.c | 3 +--
kernel/locking/mutex-debug.c | 9 +++++----
kernel/locking/mutex.c | 10 ++++++++++
kernel/locking/ww_mutex.h | 17 +++++++++++++++--
5 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4f5b0710c0f1..22a6ac47d5fb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1145,10 +1145,7 @@ struct task_struct {
struct rt_mutex_waiter *pi_blocked_on;
#endif

-#ifdef CONFIG_DEBUG_MUTEXES
- /* Mutex deadlock detection: */
- struct mutex_waiter *blocked_on;
-#endif
+ struct mutex *blocked_on; /* lock we're blocked on */

#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
int non_block_count;
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b6d20dfb9a8..1c3f7eaa9239 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2455,9 +2455,8 @@ __latent_entropy struct task_struct *copy_process(
lockdep_init_task(p);
#endif

-#ifdef CONFIG_DEBUG_MUTEXES
p->blocked_on = NULL; /* not blocked yet */
-#endif
+
#ifdef CONFIG_BCACHE
p->sequential_io = 0;
p->sequential_io_avg = 0;
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index bc8abb8549d2..7228909c3e62 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -52,17 +52,18 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
{
lockdep_assert_held(&lock->wait_lock);

- /* Mark the current thread as blocked on the lock: */
- task->blocked_on = waiter;
+ /* Current thread can't be already blocked (since it's executing!) */
+ DEBUG_LOCKS_WARN_ON(task->blocked_on);
}

void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
struct task_struct *task)
{
+ struct mutex *blocked_on = READ_ONCE(task->blocked_on);
+
DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
DEBUG_LOCKS_WARN_ON(waiter->task != task);
- DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
- task->blocked_on = NULL;
+ DEBUG_LOCKS_WARN_ON(blocked_on && blocked_on != lock);

INIT_LIST_HEAD(&waiter->list);
waiter->task = NULL;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 2c5d1a9cf767..73064e4865b7 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -622,6 +622,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
goto err_early_kill;
}

+ current->blocked_on = lock;
set_current_state(state);
trace_contention_begin(lock, LCB_F_MUTEX);
for (;;) {
@@ -662,6 +663,10 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas

first = __mutex_waiter_is_first(lock, &waiter);

+ /*
+ * Gets reset by unlock path().
+ */
+ current->blocked_on = lock;
set_current_state(state);
/*
* Here we order against unlock; we must either see it change
@@ -682,6 +687,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
}
raw_spin_lock_irqsave(&lock->wait_lock, flags);
acquired:
+ current->blocked_on = NULL;
__set_current_state(TASK_RUNNING);

if (ww_ctx) {
@@ -712,9 +718,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
return 0;

err:
+ current->blocked_on = NULL;
__set_current_state(TASK_RUNNING);
__mutex_remove_waiter(lock, &waiter);
err_early_kill:
+ WARN_ON(current->blocked_on);
trace_contention_end(lock, ret);
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
debug_mutex_free_waiter(&waiter);
@@ -926,6 +934,8 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
next = waiter->task;

debug_mutex_wake_waiter(lock, waiter);
+ WARN_ON(next->blocked_on != lock);
+ next->blocked_on = NULL;
wake_q_add(&wake_q, next);
}

diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 8b94f4b89e74..8bb334491732 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -284,6 +284,13 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
#ifndef WW_RT
debug_mutex_wake_waiter(lock, waiter);
#endif
+ /*
+ * When waking up the task to die, be sure to clear the
+ * blocked_on pointer. Otherwise we can see circular
+ * blocked_on relationships that can't resolve.
+ */
+ WARN_ON(waiter->task->blocked_on != lock);
+ waiter->task->blocked_on = NULL;
wake_q_add(wake_q, waiter->task);
}

@@ -331,9 +338,15 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
* it's wounded in __ww_mutex_check_kill() or has a
* wakeup pending to re-read the wounded state.
*/
- if (owner != current)
+ if (owner != current) {
+ /*
+ * When waking up the task to wound, be sure to clear the
+ * blocked_on pointer. Otherwise we can see circular
+ * blocked_on relationships that can't resolve.
+ */
+ owner->blocked_on = NULL;
wake_q_add(wake_q, owner);
-
+ }
return true;
}

--
2.42.0.869.gea05f2083d-goog

2023-11-06 19:37:36

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 19/20] sched: Add blocked_donor link to task for smarter mutex handoffs

From: Peter Zijlstra <[email protected]>

Add link to the task this task is proxying for, and use it so we
do intellegent hand-off of the owned mutex to the task we're
running on behalf.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: This patch was split out from larger proxy patch]
Signed-off-by: John Stultz <[email protected]>
---
v5:
* Split out from larger proxy patch

v6:
* Moved proxied value from earlier patch to this one where it
is actually used
* Rework logic to check sched_proxy_exec() instead of using ifdefs
* Moved comment change to this patch where it makes sense
---
include/linux/sched.h | 1 +
kernel/fork.c | 1 +
kernel/locking/mutex.c | 35 ++++++++++++++++++++++++++++++++---
kernel/sched/core.c | 19 +++++++++++++++++--
4 files changed, 51 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 47c7095b918a..9bff2f123207 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1145,6 +1145,7 @@ struct task_struct {
struct rt_mutex_waiter *pi_blocked_on;
#endif

+ struct task_struct *blocked_donor; /* task that is boosting us */
struct mutex *blocked_on; /* lock we're blocked on */
bool blocked_on_waking; /* blocked on, but waking */
raw_spinlock_t blocked_lock;
diff --git a/kernel/fork.c b/kernel/fork.c
index 930947bf4569..6604e0472da0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2456,6 +2456,7 @@ __latent_entropy struct task_struct *copy_process(
lockdep_init_task(p);
#endif

+ p->blocked_donor = NULL; /* nobody is boosting us yet */
p->blocked_on = NULL; /* not blocked yet */
p->blocked_on_waking = false; /* not blocked yet */

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 5394a3c4b5d9..f7187a247482 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -907,7 +907,7 @@ EXPORT_SYMBOL_GPL(ww_mutex_lock_interruptible);
*/
static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigned long ip)
{
- struct task_struct *next = NULL;
+ struct task_struct *donor, *next = NULL;
DEFINE_WAKE_Q(wake_q);
unsigned long owner;
unsigned long flags;
@@ -945,7 +945,34 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
preempt_disable();
raw_spin_lock_irqsave(&lock->wait_lock, flags);
debug_mutex_unlock(lock);
- if (!list_empty(&lock->wait_list)) {
+
+ if (sched_proxy_exec()) {
+ raw_spin_lock(&current->blocked_lock);
+ /*
+ * If we have a task boosting us, and that task was boosting us through
+ * this lock, hand the lock to that task, as that is the highest
+ * waiter, as selected by the scheduling function.
+ */
+ donor = current->blocked_donor;
+ if (donor) {
+ struct mutex *next_lock;
+
+ raw_spin_lock_nested(&donor->blocked_lock, SINGLE_DEPTH_NESTING);
+ next_lock = get_task_blocked_on(donor);
+ if (next_lock == lock) {
+ next = donor;
+ donor->blocked_on_waking = true;
+ wake_q_add(&wake_q, donor);
+ current->blocked_donor = NULL;
+ }
+ raw_spin_unlock(&donor->blocked_lock);
+ }
+ }
+
+ /*
+ * Failing that, pick any on the wait list.
+ */
+ if (!next && !list_empty(&lock->wait_list)) {
/* get the first entry from the wait-list: */
struct mutex_waiter *waiter =
list_first_entry(&lock->wait_list,
@@ -954,7 +981,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
next = waiter->task;

debug_mutex_wake_waiter(lock, waiter);
- raw_spin_lock(&next->blocked_lock);
+ raw_spin_lock_nested(&next->blocked_lock, SINGLE_DEPTH_NESTING);
WARN_ON(next->blocked_on != lock);
next->blocked_on_waking = true;
raw_spin_unlock(&next->blocked_lock);
@@ -964,6 +991,8 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
if (owner & MUTEX_FLAG_HANDOFF)
__mutex_handoff(lock, next);

+ if (sched_proxy_exec())
+ raw_spin_unlock(&current->blocked_lock);
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q);
preempt_enable();
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 760e2753a24c..6ac7a241dacc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6782,7 +6782,17 @@ static inline bool proxy_return_migration(struct rq *rq, struct rq_flags *rf,
* Find who @next (currently blocked on a mutex) can proxy for.
*
* Follow the blocked-on relation:
- * task->blocked_on -> mutex->owner -> task...
+ *
+ * ,-> task
+ * | | blocked-on
+ * | v
+ * blocked_donor | mutex
+ * | | owner
+ * | v
+ * `-- task
+ *
+ * and set the blocked_donor relation, this latter is used by the mutex
+ * code to find which (blocked) task to hand-off to.
*
* Lock order:
*
@@ -6919,6 +6929,8 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
*/
raw_spin_unlock(&p->blocked_lock);
raw_spin_unlock(&mutex->wait_lock);
+
+ owner->blocked_donor = p;
}

WARN_ON_ONCE(owner && !owner->on_rq);
@@ -7003,6 +7015,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
unsigned long prev_state;
struct rq_flags rf;
struct rq *rq;
+ bool proxied;
int cpu;
bool preserve_need_resched = false;

@@ -7053,9 +7066,11 @@ static void __sched notrace __schedule(unsigned int sched_mode)
switch_count = &prev->nvcsw;
}

+ proxied = !!prev->blocked_donor;
pick_again:
next = pick_next_task(rq, rq_selected(rq), &rf);
rq_set_selected(rq, next);
+ next->blocked_donor = NULL;
if (unlikely(task_is_blocked(next))) {
next = proxy(rq, next, &rf);
if (!next) {
@@ -7119,7 +7134,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
rq = context_switch(rq, prev, next, &rf);
} else {
/* In case next was already curr but just got blocked_donor*/
- if (unlikely(!task_current_selected(rq, next)))
+ if (unlikely(!proxied && next->blocked_donor))
proxy_tag_curr(rq, next);

rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
--
2.42.0.869.gea05f2083d-goog

2023-11-06 19:37:37

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 18/20] sched: Handle blocked-waiter migration (and return migration)

Add logic to handle migrating a blocked waiter to a remote
cpu where the lock owner is runnable.

Additionally, as the blocked task may not be able to run
on the remote cpu, add logic to handle return migration once
the waiting task is given the mutex.

Because tasks may get migrated to where they cannot run,
this patch also modifies the scheduling classes to avoid
sched class migrations on mutex blocked tasks, leaving
proxy() to do the migrations and return migrations.

This was split out from the larger proxy patch, and
significantly reworked to avoid changes to the try_to_wakeup()
call path.

Credits for the original patch go to:
Peter Zijlstra (Intel) <[email protected]>
Juri Lelli <[email protected]>
Valentin Schneider <[email protected]>
Connor O'Brien <[email protected]>

NOTE: The return migration is further complicated in that we
need to take the pi_lock in order to decide which cpu we should
migrate back to. This requires dropping the current rq lock,
grabbing the pi_lock re-taking the current rq lock, picking a
cpu, deactivating the task, switching its cpu, dropping the
current rq lock, grabbing the target rq, activating the task
and then dropping the target rq and reaquiring the current
rq. This seems overly complex, so suggestions for a better
approach would be welcome!

TODO: Seeing stalls/hangs after running for awhile with this
patch, which suggests we're losing track of a task somewhere
in the migrations.
[ 880.032744] BUG: workqueue lockup - pool cpus=11 node=0 flags=0x0 nice=0 stuck for 58s!
...
[ 1443.185762] watchdog: BUG: soft lockup - CPU#11 stuck for 23s! [irqbalance:1880]
...

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]
Signed-off-by: John Stultz <[email protected]>
---
v6:
* Integrated sched_proxy_exec() check in proxy_return_migration()
* Minor cleanups to diff
* Unpin the rq before calling __balance_callbacks()
* Tweak proxy migrate to migrate deeper task in chain, to avoid
tasks pingponging between rqs
---
kernel/sched/core.c | 183 ++++++++++++++++++++++++++++++++++++++--
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 4 +-
kernel/sched/rt.c | 9 +-
4 files changed, 187 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c7b5cb5d8dc3..760e2753a24c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3000,8 +3000,15 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
struct set_affinity_pending my_pending = { }, *pending = NULL;
bool stop_pending, complete = false;

- /* Can the task run on the task's current CPU? If so, we're done */
- if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) {
+ /*
+ * Can the task run on the task's current CPU? If so, we're done
+ *
+ * We are also done if the task is selected, boosting a lock-
+ * holding proxy, (and potentially has been migrated outside its
+ * current or previous affinity mask)
+ */
+ if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask) ||
+ (task_current_selected(rq, p) && !task_current(rq, p))) {
struct task_struct *push_task = NULL;

if ((flags & SCA_MIGRATE_ENABLE) &&
@@ -6636,6 +6643,141 @@ bool proxy_deactivate(struct rq *rq, struct task_struct *next)
return true;
}

+static struct task_struct *
+proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
+ struct task_struct *p, int that_cpu)
+{
+ struct rq *that_rq;
+ int wake_cpu;
+
+ /*
+ * If the blocked-on relationship crosses CPUs, migrate @p to the
+ * @owner's CPU.
+ *
+ * This is because we must respect the CPU affinity of execution
+ * contexts (@owner) but we can ignore affinity for scheduling
+ * contexts (@p). So we have to move scheduling contexts towards
+ * potential execution contexts.
+ */
+ that_rq = cpu_rq(that_cpu);
+
+ /*
+ * @owner can disappear, but simply migrate to @that_cpu and leave
+ * that CPU to sort things out.
+ */
+
+ /*
+ * Since we're going to drop @rq, we have to put(@rq_selected) first,
+ * otherwise we have a reference that no longer belongs to us. Use
+ * @rq->idle to fill the void and make the next pick_next_task()
+ * invocation happy.
+ *
+ * CPU0 CPU1
+ *
+ * B mutex_lock(X)
+ *
+ * A mutex_lock(X) <- B
+ * A __schedule()
+ * A pick->A
+ * A proxy->B
+ * A migrate A to CPU1
+ * B mutex_unlock(X) -> A
+ * B __schedule()
+ * B pick->A
+ * B switch_to (A)
+ * A ... does stuff
+ * A ... is still running here
+ *
+ * * BOOM *
+ */
+ put_prev_task(rq, rq_selected(rq));
+ rq_set_selected(rq, rq->idle);
+ set_next_task(rq, rq_selected(rq));
+ WARN_ON(p == rq->curr);
+
+ wake_cpu = p->wake_cpu;
+ deactivate_task(rq, p, 0);
+ set_task_cpu(p, that_cpu);
+ /*
+ * Preserve p->wake_cpu, such that we can tell where it
+ * used to run later.
+ */
+ p->wake_cpu = wake_cpu;
+
+ rq_unpin_lock(rq, rf);
+ __balance_callbacks(rq);
+
+ raw_spin_rq_unlock(rq);
+ raw_spin_rq_lock(that_rq);
+
+ activate_task(that_rq, p, 0);
+ check_preempt_curr(that_rq, p, 0);
+
+ raw_spin_rq_unlock(that_rq);
+ raw_spin_rq_lock(rq);
+ rq_repin_lock(rq, rf);
+
+ return NULL; /* Retry task selection on _this_ CPU. */
+}
+
+static inline bool proxy_return_migration(struct rq *rq, struct rq_flags *rf,
+ struct task_struct *next)
+{
+ if (!sched_proxy_exec())
+ return false;
+
+ if (next->blocked_on && next->blocked_on_waking) {
+ if (!is_cpu_allowed(next, cpu_of(rq))) {
+ struct rq *that_rq;
+ int cpu;
+
+ if (next == rq->curr) {
+ /* can't migrate curr, so return and let caller sort it */
+ return true;
+ }
+
+ put_prev_task(rq, rq_selected(rq));
+ rq_set_selected(rq, rq->idle);
+
+ /* First unpin & run balance callbacks */
+ rq_unpin_lock(rq, rf);
+ __balance_callbacks(rq);
+ /*
+ * Drop the rq lock so we can get pi_lock,
+ * then reaquire it again to figure out
+ * where to send it.
+ */
+ raw_spin_rq_unlock(rq);
+ raw_spin_lock(&next->pi_lock);
+ rq_lock(rq, rf);
+
+ cpu = select_task_rq(next, next->wake_cpu, WF_TTWU);
+
+ deactivate_task(rq, next, 0);
+ set_task_cpu(next, cpu);
+ that_rq = cpu_rq(cpu);
+
+ /* drop this rq lock and grab that_rq's */
+ rq_unpin_lock(rq, rf);
+ raw_spin_rq_unlock(rq);
+ raw_spin_rq_lock(that_rq);
+
+ activate_task(that_rq, next, 0);
+ check_preempt_curr(that_rq, next, 0);
+
+ /* drop that_rq's lock and re-grab this' */
+ raw_spin_rq_unlock(that_rq);
+ raw_spin_rq_lock(rq);
+ rq_repin_lock(rq, rf);
+
+ raw_spin_unlock(&next->pi_lock);
+
+ return true;
+ }
+ }
+ return false;
+}
+
/*
* Find who @next (currently blocked on a mutex) can proxy for.
*
@@ -6658,7 +6800,8 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
struct task_struct *ret = NULL;
struct task_struct *p = next;
struct task_struct *owner = NULL;
- int this_cpu;
+ bool curr_in_chain = false;
+ int this_cpu, that_cpu;
struct mutex *mutex;

this_cpu = cpu_of(rq);
@@ -6694,6 +6837,9 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
return NULL;
}

+ if (task_current(rq, p))
+ curr_in_chain = true;
+
owner = __mutex_owner(mutex);
if (!owner) {
raw_spin_unlock(&p->blocked_lock);
@@ -6702,12 +6848,17 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
}

if (task_cpu(owner) != this_cpu) {
- /* XXX Don't handle migrations yet */
- if (!proxy_deactivate(rq, next))
- ret = next;
+ that_cpu = task_cpu(owner);
+ /*
+ * @owner can disappear, simply migrate to @that_cpu and leave that CPU
+ * to sort things out.
+ */
raw_spin_unlock(&p->blocked_lock);
raw_spin_unlock(&mutex->wait_lock);
- return ret;
+ if (curr_in_chain)
+ return proxy_resched_idle(rq, next);
+
+ return proxy_migrate_task(rq, rf, p, that_cpu);
}

if (task_on_rq_migrating(owner)) {
@@ -6788,7 +6939,14 @@ static inline void proxy_tag_curr(struct rq *rq, struct task_struct *next)
dequeue_task(rq, next, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
enqueue_task(rq, next, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
}
+
#else /* PROXY_EXEC */
+static inline bool proxy_return_migration(struct rq *rq, struct rq_flags *rf,
+ struct task_struct *next)
+{
+ return false;
+}
+
static struct task_struct *
proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
{
@@ -6909,6 +7067,14 @@ static void __sched notrace __schedule(unsigned int sched_mode)
if (next == rq->idle && prev == rq->idle)
preserve_need_resched = true;
}
+ if (unlikely(proxy_return_migration(rq, &rf, next))) {
+ if (next != rq->curr)
+ goto pick_again;
+
+ rq_set_selected(rq, rq->idle);
+ set_tsk_need_resched(rq->idle);
+ next = rq->idle;
+ }

if (!preserve_need_resched)
clear_tsk_need_resched(prev);
@@ -7006,6 +7172,9 @@ static inline void sched_submit_work(struct task_struct *tsk)
*/
SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT);

+ if (task_is_blocked(tsk))
+ return;
+
/*
* If we are going to sleep and we have plugged IO queued,
* make sure to submit it to avoid deadlocks.
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e8bca6b8da6f..99788cfd8835 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1731,7 +1731,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)

enqueue_dl_entity(&p->dl, flags);

- if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+ if (!task_current(rq, p) && p->nr_cpus_allowed > 1 && !task_is_blocked(p))
enqueue_pushable_dl_task(rq, p);
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f334b129b269..f2dee89f475b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8220,7 +8220,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
goto idle;

#ifdef CONFIG_FAIR_GROUP_SCHED
- if (!prev || prev->sched_class != &fair_sched_class)
+ if (!prev ||
+ prev->sched_class != &fair_sched_class ||
+ rq->curr != rq_selected(rq))
goto simple;

/*
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 0125a3ae5a7a..d4f5625e4433 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1549,6 +1549,9 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
if (p->nr_cpus_allowed == 1)
return;

+ if (task_is_blocked(p))
+ return;
+
enqueue_pushable_task(rq, p);
}

@@ -1836,10 +1839,12 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)

update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);

- /* Avoid marking selected as pushable */
- if (task_current_selected(rq, p))
+ /* Avoid marking current or selected as pushable */
+ if (task_current(rq, p) || task_current_selected(rq, p))
return;

+ if (task_is_blocked(p))
+ return;
/*
* The previous task needs to be made eligible for pushing
* if it is still active
--
2.42.0.869.gea05f2083d-goog

2023-11-06 19:37:39

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 10/20] locking/mutex: Switch to mutex handoffs for CONFIG_PROXY_EXEC

From: Peter Zijlstra <[email protected]>

Since with PROXY_EXEC, we will want to hand off locks to the
task's we are running on behalf of, switch to using mutex
handoffs.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[rebased, added comments and changelog]
Signed-off-by: Juri Lelli <[email protected]>
[Fixed rebase conflicts]
[squashed sched: Ensure blocked_on is always guarded by blocked_lock]
Signed-off-by: Valentin Schneider <[email protected]>
[fix rebase conflicts, various fixes & tweaks commented inline]
[squashed sched: Use rq->curr vs rq->proxy checks]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: Split out only the very basic initial framework
for proxy logic from a larger patch.]
Signed-off-by: John Stultz <[email protected]>
---
v5:
* Split out from core proxy patch
v6:
* Rework to use sched_proxy_exec() instead of #ifdef CONFIG_PROXY_EXEC
---
kernel/Kconfig.locks | 2 +-
kernel/locking/mutex.c | 39 ++++++++++++++++++++++-----------------
2 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 4198f0273ecd..791c98f1d329 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -226,7 +226,7 @@ config ARCH_SUPPORTS_ATOMIC_RMW

config MUTEX_SPIN_ON_OWNER
def_bool y
- depends on SMP && ARCH_SUPPORTS_ATOMIC_RMW
+ depends on SMP && ARCH_SUPPORTS_ATOMIC_RMW && !PROXY_EXEC

config RWSEM_SPIN_ON_OWNER
def_bool y
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index f37b7afe8aa5..5394a3c4b5d9 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -914,26 +914,31 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne

mutex_release(&lock->dep_map, ip);

- /*
- * Release the lock before (potentially) taking the spinlock such that
- * other contenders can get on with things ASAP.
- *
- * Except when HANDOFF, in that case we must not clear the owner field,
- * but instead set it to the top waiter.
- */
- owner = atomic_long_read(&lock->owner);
- for (;;) {
- MUTEX_WARN_ON(__owner_task(owner) != current);
- MUTEX_WARN_ON(owner & MUTEX_FLAG_PICKUP);
-
- if (owner & MUTEX_FLAG_HANDOFF)
- break;
+ if (sched_proxy_exec()) {
+ /* Always force HANDOFF for Proxy Exec for now. Revisit. */
+ owner = MUTEX_FLAG_HANDOFF;
+ } else {
+ /*
+ * Release the lock before (potentially) taking the spinlock
+ * such that other contenders can get on with things ASAP.
+ *
+ * Except when HANDOFF, in that case we must not clear the
+ * owner field, but instead set it to the top waiter.
+ */
+ owner = atomic_long_read(&lock->owner);
+ for (;;) {
+ MUTEX_WARN_ON(__owner_task(owner) != current);
+ MUTEX_WARN_ON(owner & MUTEX_FLAG_PICKUP);

- if (atomic_long_try_cmpxchg_release(&lock->owner, &owner, __owner_flags(owner))) {
- if (owner & MUTEX_FLAG_WAITERS)
+ if (owner & MUTEX_FLAG_HANDOFF)
break;

- return;
+ if (atomic_long_try_cmpxchg_release(&lock->owner, &owner,
+ __owner_flags(owner))) {
+ if (owner & MUTEX_FLAG_WAITERS)
+ break;
+ return;
+ }
}
}

--
2.42.0.869.gea05f2083d-goog

2023-11-06 19:37:50

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 17/20] sched: Start blocked_on chain processing in proxy()

From: Peter Zijlstra <[email protected]>

Start to flesh out the real proxy() implementation, but
avoid the migration cases for now, in those cases just
deactivate the selected task and pick again.

To ensure the selected task or other blocked tasks in
the chain aren't migrated away while we're running the
proxy, this patch also tweaks CFS logic to avoid migrating
selected or mutex blocked tasks.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: This change was split out from the larger proxy patch]
Signed-off-by: John Stultz <[email protected]>
---
v5:
* Split this out from larger proxy patch
---
kernel/sched/core.c | 162 ++++++++++++++++++++++++++++++++++++--------
kernel/sched/fair.c | 10 ++-
2 files changed, 143 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c148ee5dcf7e..c7b5cb5d8dc3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -95,6 +95,7 @@
#include "../workqueue_internal.h"
#include "../../io_uring/io-wq.h"
#include "../smpboot.h"
+#include "../locking/mutex.h"

EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpu);
EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpumask);
@@ -6613,6 +6614,15 @@ static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,

#ifdef CONFIG_PROXY_EXEC

+static inline struct task_struct *
+proxy_resched_idle(struct rq *rq, struct task_struct *next)
+{
+ put_prev_task(rq, next);
+ rq_set_selected(rq, rq->idle);
+ set_tsk_need_resched(rq->idle);
+ return rq->idle;
+}
+
bool proxy_deactivate(struct rq *rq, struct task_struct *next)
{
unsigned long state = READ_ONCE(next->__state);
@@ -6622,52 +6632,146 @@ bool proxy_deactivate(struct rq *rq, struct task_struct *next)
return false;
if (!try_to_deactivate_task(rq, next, state, true))
return false;
- put_prev_task(rq, next);
- rq_set_selected(rq, rq->idle);
- resched_curr(rq);
+ proxy_resched_idle(rq, next);
return true;
}

/*
- * Initial simple proxy that just returns the task if its waking
- * or deactivates the blocked task so we can pick something that
- * isn't blocked.
+ * Find who @next (currently blocked on a mutex) can proxy for.
+ *
+ * Follow the blocked-on relation:
+ * task->blocked_on -> mutex->owner -> task...
+ *
+ * Lock order:
+ *
+ * p->pi_lock
+ * rq->lock
+ * mutex->wait_lock
+ * p->blocked_lock
+ *
+ * Returns the task that is going to be used as execution context (the one
+ * that is actually going to be put to run on cpu_of(rq)).
*/
static struct task_struct *
proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
{
struct task_struct *ret = NULL;
struct task_struct *p = next;
+ struct task_struct *owner = NULL;
+ int this_cpu;
struct mutex *mutex;
- mutex = p->blocked_on;
- /* Something changed in the chain, pick_again */
- if (!mutex)
- return NULL;
+
+ this_cpu = cpu_of(rq);
+
/*
- * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
- * and ensure @owner sticks around.
+ * Follow blocked_on chain.
+ *
+ * TODO: deadlock detection
*/
- raw_spin_lock(&mutex->wait_lock);
- raw_spin_lock(&p->blocked_lock);
+ for (p = next; task_is_blocked(p); p = owner) {
+ mutex = p->blocked_on;
+ /* Something changed in the chain, pick_again */
+ if (!mutex)
+ return NULL;

- /* Check again that p is blocked with blocked_lock held */
- if (!task_is_blocked(p) || mutex != p->blocked_on) {
/*
- * Something changed in the blocked_on chain and
- * we don't know if only at this level. So, let's
- * just bail out completely and let __schedule
- * figure things out (pick_again loop).
+ * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
+ * and ensure @owner sticks around.
+ */
+ raw_spin_lock(&mutex->wait_lock);
+ raw_spin_lock(&p->blocked_lock);
+
+ /* Check again that p is blocked with blocked_lock held */
+ if (mutex != p->blocked_on) {
+ /*
+ * Something changed in the blocked_on chain and
+ * we don't know if only at this level. So, let's
+ * just bail out completely and let __schedule
+ * figure things out (pick_again loop).
+ */
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_unlock(&mutex->wait_lock);
+ return NULL;
+ }
+
+ owner = __mutex_owner(mutex);
+ if (!owner) {
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_unlock(&mutex->wait_lock);
+ return p;
+ }
+
+ if (task_cpu(owner) != this_cpu) {
+ /* XXX Don't handle migrations yet */
+ if (!proxy_deactivate(rq, next))
+ ret = next;
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_unlock(&mutex->wait_lock);
+ return ret;
+ }
+
+ if (task_on_rq_migrating(owner)) {
+ /*
+ * One of the chain of mutex owners is currently migrating to this
+ * CPU, but has not yet been enqueued because we are holding the
+ * rq lock. As a simple solution, just schedule rq->idle to give
+ * the migration a chance to complete. Much like the migrate_task
+ * case we should end up back in proxy(), this time hopefully with
+ * all relevant tasks already enqueued.
+ */
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_unlock(&mutex->wait_lock);
+ return proxy_resched_idle(rq, next);
+ }
+
+ if (!owner->on_rq) {
+ /* XXX Don't handle blocked owners yet */
+ if (!proxy_deactivate(rq, next))
+ ret = next;
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_unlock(&mutex->wait_lock);
+ return ret;
+ }
+
+ if (owner == p) {
+ /*
+ * Its possible we interleave with mutex_unlock like:
+ *
+ * lock(&rq->lock);
+ * proxy()
+ * mutex_unlock()
+ * lock(&wait_lock);
+ * next(owner) = current->blocked_donor;
+ * unlock(&wait_lock);
+ *
+ * wake_up_q();
+ * ...
+ * ttwu_runnable()
+ * __task_rq_lock()
+ * lock(&wait_lock);
+ * owner == p
+ *
+ * Which leaves us to finish the ttwu_runnable() and make it go.
+ *
+ * So schedule rq->idle so that ttwu_runnable can get the rq lock
+ * and mark owner as running.
+ */
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_unlock(&mutex->wait_lock);
+ return proxy_resched_idle(rq, next);
+ }
+
+ /*
+ * OK, now we're absolutely sure @owner is not blocked _and_
+ * on this rq, therefore holding @rq->lock is sufficient to
+ * guarantee its existence, as per ttwu_remote().
*/
raw_spin_unlock(&p->blocked_lock);
raw_spin_unlock(&mutex->wait_lock);
- return ret;
}

- if (!proxy_deactivate(rq, next))
- ret = p;
- raw_spin_unlock(&p->blocked_lock);
- raw_spin_unlock(&mutex->wait_lock);
- return ret;
+ WARN_ON_ONCE(owner && !owner->on_rq);
+ return owner;
}

static inline void proxy_tag_curr(struct rq *rq, struct task_struct *next)
@@ -6742,6 +6846,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
struct rq_flags rf;
struct rq *rq;
int cpu;
+ bool preserve_need_resched = false;

cpu = smp_processor_id();
rq = cpu_rq(cpu);
@@ -6801,9 +6906,12 @@ static void __sched notrace __schedule(unsigned int sched_mode)
rq_repin_lock(rq, &rf);
goto pick_again;
}
+ if (next == rq->idle && prev == rq->idle)
+ preserve_need_resched = true;
}

- clear_tsk_need_resched(prev);
+ if (!preserve_need_resched)
+ clear_tsk_need_resched(prev);
clear_preempt_need_resched();
#ifdef CONFIG_SCHED_DEBUG
rq->last_seen_need_resched_ns = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1aca675985b2..f334b129b269 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8752,7 +8752,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
/* Disregard pcpu kthreads; they are where they need to be. */
if (kthread_is_per_cpu(p))
return 0;
-
+ if (task_is_blocked(p))
+ return 0;
if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
int cpu;

@@ -8789,7 +8790,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
/* Record that we found at least one task that could run on dst_cpu */
env->flags &= ~LBF_ALL_PINNED;

- if (task_on_cpu(env->src_rq, p)) {
+ if (task_on_cpu(env->src_rq, p) ||
+ task_current_selected(env->src_rq, p)) {
schedstat_inc(p->stats.nr_failed_migrations_running);
return 0;
}
@@ -8828,6 +8830,10 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
{
lockdep_assert_rq_held(env->src_rq);

+ BUG_ON(task_is_blocked(p));
+ BUG_ON(task_current(env->src_rq, p));
+ BUG_ON(task_current_selected(env->src_rq, p));
+
deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
set_task_cpu(p, env->dst_cpu);
}
--
2.42.0.869.gea05f2083d-goog

2023-11-06 19:38:31

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 20/20] sched: Add deactivated (sleeping) owner handling to proxy()

From: Peter Zijlstra <[email protected]>

Adds a implementation of (sleeping) deactivated owner handling
where we queue the selected task on the deactivated owner task
and deactivate it as well, re-activating it later when the owner
is woken up.

NOTE: This has been particularly challenging to get working
properly, and some of the locking is particularly ackward. I'd
very much appreciate review and feedback for ways to simplify
this.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: This was broken out from the larger proxy() patch]
Signed-off-by: John Stultz <[email protected]>
---
v5:
* Split out from larger proxy patch
v6:
* Major rework, replacing the single list head per task with
per-task list head and nodes, creating a tree structure so
we only wake up decendents of the task woken.
* Reworked the locking to take the task->pi_lock, so we can
avoid mid-chain wakeup races from try_to_wake_up() called by
the ww_mutex logic.
---
include/linux/sched.h | 3 +
kernel/fork.c | 4 +-
kernel/sched/core.c | 198 ++++++++++++++++++++++++++++++++++++++++--
3 files changed, 196 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9bff2f123207..c5aa0208104f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1148,6 +1148,9 @@ struct task_struct {
struct task_struct *blocked_donor; /* task that is boosting us */
struct mutex *blocked_on; /* lock we're blocked on */
bool blocked_on_waking; /* blocked on, but waking */
+ struct list_head blocked_head; /* tasks blocked on us */
+ struct list_head blocked_node; /* our entry on someone elses blocked_head */
+ struct task_struct *sleeping_owner; /* task our blocked_node is enqueued on */
raw_spinlock_t blocked_lock;

#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
diff --git a/kernel/fork.c b/kernel/fork.c
index 6604e0472da0..bbcf2697652f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2459,7 +2459,9 @@ __latent_entropy struct task_struct *copy_process(
p->blocked_donor = NULL; /* nobody is boosting us yet */
p->blocked_on = NULL; /* not blocked yet */
p->blocked_on_waking = false; /* not blocked yet */
-
+ INIT_LIST_HEAD(&p->blocked_head);
+ INIT_LIST_HEAD(&p->blocked_node);
+ p->sleeping_owner = NULL;
#ifdef CONFIG_BCACHE
p->sequential_io = 0;
p->sequential_io_avg = 0;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6ac7a241dacc..8f87318784d0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3804,6 +3804,119 @@ static inline void ttwu_do_wakeup(struct task_struct *p)
trace_sched_wakeup(p);
}

+#ifdef CONFIG_PROXY_EXEC
+static void do_activate_task(struct rq *rq, struct task_struct *p, int en_flags)
+{
+ lockdep_assert_rq_held(rq);
+
+ if (!sched_proxy_exec()) {
+ activate_task(rq, p, en_flags);
+ return;
+ }
+
+ if (p->sleeping_owner) {
+ struct task_struct *owner = p->sleeping_owner;
+
+ raw_spin_lock(&owner->blocked_lock);
+ list_del_init(&p->blocked_node);
+ p->sleeping_owner = NULL;
+ raw_spin_unlock(&owner->blocked_lock);
+ }
+
+ /*
+ * By calling activate_task with blocked_lock held, we order against
+ * the proxy() blocked_task case such that no more blocked tasks will
+ * be enqueued on p once we release p->blocked_lock.
+ */
+ raw_spin_lock(&p->blocked_lock);
+ WARN_ON(task_cpu(p) != cpu_of(rq));
+ activate_task(rq, p, en_flags);
+ raw_spin_unlock(&p->blocked_lock);
+}
+
+static void activate_blocked_ents(struct rq *target_rq,
+ struct task_struct *owner,
+ int wake_flags)
+{
+ unsigned long flags;
+ struct rq_flags rf;
+ int target_cpu = cpu_of(target_rq);
+ int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
+
+ if (wake_flags & WF_MIGRATED)
+ en_flags |= ENQUEUE_MIGRATED;
+ /*
+ * A whole bunch of 'proxy' tasks back this blocked task, wake
+ * them all up to give this task its 'fair' share.
+ */
+ raw_spin_lock(&owner->blocked_lock);
+ while (!list_empty(&owner->blocked_head)) {
+ struct task_struct *pp;
+ unsigned int state;
+
+ pp = list_first_entry(&owner->blocked_head,
+ struct task_struct,
+ blocked_node);
+ BUG_ON(pp == owner);
+ list_del_init(&pp->blocked_node);
+ WARN_ON(!pp->sleeping_owner);
+ pp->sleeping_owner = NULL;
+ raw_spin_unlock(&owner->blocked_lock);
+
+ /* Nested as ttwu holds the owner's pi_lock */
+ /* XXX But how do we enforce ordering to avoid ABBA? */
+ raw_spin_lock_irqsave_nested(&pp->pi_lock, flags, SINGLE_DEPTH_NESTING);
+ smp_rmb();
+ state = READ_ONCE(pp->__state);
+ /* Avoid racing with ttwu */
+ if (state == TASK_WAKING) {
+ raw_spin_unlock_irqrestore(&pp->pi_lock, flags);
+ raw_spin_lock(&owner->blocked_lock);
+ continue;
+ }
+ if (READ_ONCE(pp->on_rq)) {
+ /*
+ * We raced with a non mutex handoff activation of pp.
+ * That activation will also take care of activating
+ * all of the tasks after pp in the blocked_entry list,
+ * so we're done here.
+ */
+ raw_spin_unlock_irqrestore(&pp->pi_lock, flags);
+ raw_spin_lock(&owner->blocked_lock);
+ continue;
+ }
+
+ __set_task_cpu(pp, target_cpu);
+
+ rq_lock_irqsave(target_rq, &rf);
+ update_rq_clock(target_rq);
+ do_activate_task(target_rq, pp, en_flags);
+ resched_curr(target_rq);
+ rq_unlock_irqrestore(target_rq, &rf);
+ raw_spin_unlock_irqrestore(&pp->pi_lock, flags);
+
+ /* recurse */
+ activate_blocked_ents(target_rq, pp, wake_flags);
+
+ raw_spin_lock(&owner->blocked_lock);
+ }
+ raw_spin_unlock(&owner->blocked_lock);
+}
+
+#else
+static inline void do_activate_task(struct rq *rq, struct task_struct *p,
+ int en_flags)
+{
+ activate_task(rq, p, en_flags);
+}
+
+static inline void activate_blocked_ents(struct rq *target_rq,
+ struct task_struct *owner,
+ int wake_flags)
+{
+}
+#endif
+
static void
ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
struct rq_flags *rf)
@@ -3825,7 +3938,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
atomic_dec(&task_rq(p)->nr_iowait);
}

- activate_task(rq, p, en_flags);
+ do_activate_task(rq, p, en_flags);
+
check_preempt_curr(rq, p, wake_flags);

ttwu_do_wakeup(p);
@@ -3922,13 +4036,19 @@ void sched_ttwu_pending(void *arg)
update_rq_clock(rq);

llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
+ int wake_flags;
if (WARN_ON_ONCE(p->on_cpu))
smp_cond_load_acquire(&p->on_cpu, !VAL);

if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
set_task_cpu(p, cpu_of(rq));

- ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
+ wake_flags = p->sched_remote_wakeup ? WF_MIGRATED : 0;
+ ttwu_do_activate(rq, p, wake_flags, &rf);
+ rq_unlock(rq, &rf);
+ activate_blocked_ents(rq, p, wake_flags);
+ rq_lock(rq, &rf);
+ update_rq_clock(rq);
}

/*
@@ -4069,6 +4189,15 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
update_rq_clock(rq);
ttwu_do_activate(rq, p, wake_flags, &rf);
rq_unlock(rq, &rf);
+
+ /*
+ * When activating blocked ents, we will take the entities
+ * pi_lock, so drop the owners. Would love suggestions for
+ * a better approach.
+ */
+ raw_spin_unlock(&p->pi_lock);
+ activate_blocked_ents(rq, p, wake_flags);
+ raw_spin_lock(&p->pi_lock);
}

/*
@@ -6778,6 +6907,31 @@ static inline bool proxy_return_migration(struct rq *rq, struct rq_flags *rf,
return false;
}

+static void proxy_enqueue_on_owner(struct rq *rq, struct task_struct *owner,
+ struct task_struct *next)
+{
+ /*
+ * ttwu_activate() will pick them up and place them on whatever rq
+ * @owner will run next.
+ */
+ if (!owner->on_rq) {
+ BUG_ON(!next->on_rq);
+ deactivate_task(rq, next, DEQUEUE_SLEEP);
+ if (task_current_selected(rq, next)) {
+ put_prev_task(rq, next);
+ rq_set_selected(rq, rq->idle);
+ }
+ /*
+ * ttwu_do_activate must not have a chance to activate p
+ * elsewhere before it's fully extricated from its old rq.
+ */
+ WARN_ON(next->sleeping_owner);
+ next->sleeping_owner = owner;
+ smp_mb();
+ list_add(&next->blocked_node, &owner->blocked_head);
+ }
+}
+
/*
* Find who @next (currently blocked on a mutex) can proxy for.
*
@@ -6807,7 +6961,6 @@ static inline bool proxy_return_migration(struct rq *rq, struct rq_flags *rf,
static struct task_struct *
proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
{
- struct task_struct *ret = NULL;
struct task_struct *p = next;
struct task_struct *owner = NULL;
bool curr_in_chain = false;
@@ -6886,12 +7039,41 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
}

if (!owner->on_rq) {
- /* XXX Don't handle blocked owners yet */
- if (!proxy_deactivate(rq, next))
- ret = next;
- raw_spin_unlock(&p->blocked_lock);
+ /*
+ * rq->curr must not be added to the blocked_head list or else
+ * ttwu_do_activate could enqueue it elsewhere before it switches
+ * out here. The approach to avoiding this is the same as in the
+ * migrate_task case.
+ */
+ if (curr_in_chain) {
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_unlock(&mutex->wait_lock);
+ return proxy_resched_idle(rq, next);
+ }
+
+ /*
+ * If !@owner->on_rq, holding @rq->lock will not pin the task,
+ * so we cannot drop @mutex->wait_lock until we're sure its a blocked
+ * task on this rq.
+ *
+ * We use @owner->blocked_lock to serialize against ttwu_activate().
+ * Either we see its new owner->on_rq or it will see our list_add().
+ */
+ if (owner != p) {
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_lock(&owner->blocked_lock);
+ }
+
+ proxy_enqueue_on_owner(rq, owner, next);
+
+ if (task_current_selected(rq, next)) {
+ put_prev_task(rq, next);
+ rq_set_selected(rq, rq->idle);
+ }
+ raw_spin_unlock(&owner->blocked_lock);
raw_spin_unlock(&mutex->wait_lock);
- return ret;
+
+ return NULL; /* retry task selection */
}

if (owner == p) {
--
2.42.0.869.gea05f2083d-goog

2023-11-06 19:38:57

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 06/20] locking/mutex: Add task_struct::blocked_lock to serialize changes to the blocked_on state

From: Peter Zijlstra <[email protected]>

This patch was split out from the later "sched: Add proxy
execution" patch.

Adds blocked_lock to the task_struct so we can safely keep track
of which tasks are blocked on us.

This will be used for tracking blocked-task/mutex chains with
the prox-execution patch in a similar fashion to how priority
inheritence is done with rt_mutexes.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[rebased, added comments and changelog]
Signed-off-by: Juri Lelli <[email protected]>
[Fixed rebase conflicts]
[squashed sched: Ensure blocked_on is always guarded by blocked_lock]
Signed-off-by: Valentin Schneider <[email protected]>
[fix rebase conflicts, various fixes & tweaks commented inline]
[squashed sched: Use rq->curr vs rq->proxy checks]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: Split out from bigger patch]
Signed-off-by: John Stultz <[email protected]>
---
v2:
* Split out into its own patch
v4:
* Remove verbose comments/questions to avoid review
distractions, as suggested by Dietmar
* Fixed nested block_on locking for ww_mutex access
---
include/linux/sched.h | 1 +
init/init_task.c | 1 +
kernel/fork.c | 1 +
kernel/locking/mutex.c | 24 ++++++++++++++++++++----
kernel/locking/ww_mutex.h | 6 ++++++
5 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 22a6ac47d5fb..a9258dae00e0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1146,6 +1146,7 @@ struct task_struct {
#endif

struct mutex *blocked_on; /* lock we're blocked on */
+ raw_spinlock_t blocked_lock;

#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
int non_block_count;
diff --git a/init/init_task.c b/init/init_task.c
index ff6c4b9bfe6b..189ce67e9704 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -130,6 +130,7 @@ struct task_struct init_task
.journal_info = NULL,
INIT_CPU_TIMERS(init_task)
.pi_lock = __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
+ .blocked_lock = __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock),
.timer_slack_ns = 50000, /* 50 usec default slack */
.thread_pid = &init_struct_pid,
.thread_group = LIST_HEAD_INIT(init_task.thread_group),
diff --git a/kernel/fork.c b/kernel/fork.c
index 1c3f7eaa9239..47b76ed5ddf6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2353,6 +2353,7 @@ __latent_entropy struct task_struct *copy_process(
ftrace_graph_init_task(p);

rt_mutex_init_task(p);
+ raw_spin_lock_init(&p->blocked_lock);

lockdep_assert_irqs_enabled();
#ifdef CONFIG_PROVE_LOCKING
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 73064e4865b7..df186c0bf4a9 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -592,6 +592,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
}

raw_spin_lock_irqsave(&lock->wait_lock, flags);
+ raw_spin_lock(&current->blocked_lock);
/*
* After waiting to acquire the wait_lock, try again.
*/
@@ -653,6 +654,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
goto err;
}

+ raw_spin_unlock(&current->blocked_lock);
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
/* Make sure we do wakeups before calling schedule */
if (!wake_q_empty(&wake_q)) {
@@ -663,6 +665,8 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas

first = __mutex_waiter_is_first(lock, &waiter);

+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
+ raw_spin_lock(&current->blocked_lock);
/*
* Gets reset by unlock path().
*/
@@ -677,15 +681,23 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
break;

if (first) {
+ bool acquired;
+
+ /*
+ * mutex_optimistic_spin() can schedule, so we need to
+ * release these locks before calling it.
+ */
+ raw_spin_unlock(&current->blocked_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
- if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
+ acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
+ raw_spin_lock(&current->blocked_lock);
+ if (acquired)
break;
trace_contention_begin(lock, LCB_F_MUTEX);
}
-
- raw_spin_lock_irqsave(&lock->wait_lock, flags);
}
- raw_spin_lock_irqsave(&lock->wait_lock, flags);
acquired:
current->blocked_on = NULL;
__set_current_state(TASK_RUNNING);
@@ -712,6 +724,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
if (ww_ctx)
ww_mutex_lock_acquired(ww, ww_ctx);

+ raw_spin_unlock(&current->blocked_lock);
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q);
preempt_enable();
@@ -724,6 +737,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
err_early_kill:
WARN_ON(current->blocked_on);
trace_contention_end(lock, ret);
+ raw_spin_unlock(&current->blocked_lock);
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
debug_mutex_free_waiter(&waiter);
mutex_release(&lock->dep_map, ip);
@@ -934,8 +948,10 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
next = waiter->task;

debug_mutex_wake_waiter(lock, waiter);
+ raw_spin_lock(&next->blocked_lock);
WARN_ON(next->blocked_on != lock);
next->blocked_on = NULL;
+ raw_spin_unlock(&next->blocked_lock);
wake_q_add(&wake_q, next);
}

diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 8bb334491732..2929a95b4272 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -281,6 +281,8 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
return false;

if (waiter->ww_ctx->acquired > 0 && __ww_ctx_less(waiter->ww_ctx, ww_ctx)) {
+ /* nested as we should hold current->blocked_lock already */
+ raw_spin_lock_nested(&waiter->task->blocked_lock, SINGLE_DEPTH_NESTING);
#ifndef WW_RT
debug_mutex_wake_waiter(lock, waiter);
#endif
@@ -292,6 +294,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
WARN_ON(waiter->task->blocked_on != lock);
waiter->task->blocked_on = NULL;
wake_q_add(wake_q, waiter->task);
+ raw_spin_unlock(&waiter->task->blocked_lock);
}

return true;
@@ -339,6 +342,8 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
* wakeup pending to re-read the wounded state.
*/
if (owner != current) {
+ /* nested as we should hold current->blocked_lock already */
+ raw_spin_lock_nested(&owner->blocked_lock, SINGLE_DEPTH_NESTING);
/*
* When waking up the task to wound, be sure to clear the
* blocked_on pointer. Otherwise we can see circular
@@ -346,6 +351,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
*/
owner->blocked_on = NULL;
wake_q_add(wake_q, owner);
+ raw_spin_unlock(&owner->blocked_lock);
}
return true;
}
--
2.42.0.869.gea05f2083d-goog

2023-11-06 19:39:07

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 12/20] sched: Fix runtime accounting w/ split exec & sched contexts

The idea here is we want to charge the scheduler-context task's
vruntime but charge the execution-context task's sum_exec_runtime.

This way cputime accounting goes against the task actually running
but vruntime accounting goes against the selected task so we get
proper fairness.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]
Signed-off-by: John Stultz <[email protected]>
---
kernel/sched/fair.c | 24 +++++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d5c1ec34bf7..1aca675985b2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1144,22 +1144,36 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_SMP */

-static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
+static s64 update_curr_se(struct rq *rq, struct sched_entity *se)
{
u64 now = rq_clock_task(rq);
s64 delta_exec;

- delta_exec = now - curr->exec_start;
+ /* Calculate the delta from selected se */
+ delta_exec = now - se->exec_start;
if (unlikely(delta_exec <= 0))
return delta_exec;

- curr->exec_start = now;
- curr->sum_exec_runtime += delta_exec;
+ /* Update selected se's exec_start */
+ se->exec_start = now;
+ if (entity_is_task(se)) {
+ struct task_struct *running = rq->curr;
+ /*
+ * If se is a task, we account the time
+ * against the running task, as w/ proxy-exec
+ * they may not be the same.
+ */
+ running->se.exec_start = now;
+ running->se.sum_exec_runtime += delta_exec;
+ } else {
+ /* If not task, account the time against se */
+ se->sum_exec_runtime += delta_exec;
+ }

if (schedstat_enabled()) {
struct sched_statistics *stats;

- stats = __schedstats_from_se(curr);
+ stats = __schedstats_from_se(se);
__schedstat_set(stats->exec_max,
max(delta_exec, stats->exec_max));
}
--
2.42.0.869.gea05f2083d-goog

2023-11-06 19:39:33

by John Stultz

[permalink] [raw]
Subject: [PATCH v6 11/20] sched: Split scheduler execution context

From: Peter Zijlstra <[email protected]>

Lets define the scheduling context as all the scheduler state in
task_struct and the execution context as all state required to run
the task.

Currently both are intertwined in task_struct. We want to logically
split these such that we can run the execution context of one task
with the scheduling context of another.

To this purpose introduce rq_selected() macro to point to the
task_struct used for scheduler state and preserve rq->curr to
denote the execution context.

NOTE: Peter previously mentioned he didn't like the name
"rq_selected()", but I've not come up with a better alternative.
I'm very open to other name proposals.

Question for Peter: Dietmar suggested you'd prefer I drop the
conditionalization of the scheduler context pointer on the rq
(so rq_selected() would be open coded as rq->curr_sched or
whatever we agree on for a name), but I'd think in the
!CONFIG_PROXY_EXEC case we'd want to avoid the wasted pointer
and its use (since it curr_sched would always be == curr)?
If I'm wrong I'm fine switching this, but would appreciate
clarification.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
[add additional comments and update more sched_class code to use
rq::proxy]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: Rebased and resolved minor collisions, reworked to use
accessors, tweaked update_curr_common to use rq_proxy fixing rt
scheduling issues]
Signed-off-by: John Stultz <[email protected]>
---
v2:
* Reworked to use accessors
* Fixed update_curr_common to use proxy instead of curr
v3:
* Tweaked wrapper names
* Swapped proxy for selected for clarity
v4:
* Minor variable name tweaks for readability
* Use a macro instead of a inline function and drop
other helper functions as suggested by Peter.
* Remove verbose comments/questions to avoid review
distractions, as suggested by Dietmar
v5:
* Add CONFIG_PROXY_EXEC option to this patch so the
new logic can be tested with this change
* Minor fix to grab rq_selected when holding the rq lock
---
kernel/sched/core.c | 38 +++++++++++++++++++++++++-------------
kernel/sched/deadline.c | 35 ++++++++++++++++++-----------------
kernel/sched/fair.c | 18 +++++++++---------
kernel/sched/rt.c | 40 ++++++++++++++++++++--------------------
kernel/sched/sched.h | 37 +++++++++++++++++++++++++++++++++++--
5 files changed, 107 insertions(+), 61 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a38bf8ef5798..9931940ba474 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -824,7 +824,7 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)

rq_lock(rq, &rf);
update_rq_clock(rq);
- rq->curr->sched_class->task_tick(rq, rq->curr, 1);
+ rq_selected(rq)->sched_class->task_tick(rq, rq_selected(rq), 1);
rq_unlock(rq, &rf);

return HRTIMER_NORESTART;
@@ -2251,16 +2251,18 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,

void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
- if (p->sched_class == rq->curr->sched_class)
- rq->curr->sched_class->check_preempt_curr(rq, p, flags);
- else if (sched_class_above(p->sched_class, rq->curr->sched_class))
+ struct task_struct *curr = rq_selected(rq);
+
+ if (p->sched_class == curr->sched_class)
+ curr->sched_class->check_preempt_curr(rq, p, flags);
+ else if (sched_class_above(p->sched_class, curr->sched_class))
resched_curr(rq);

/*
* A queue event has occurred, and we're going to schedule. In
* this case, we can save a useless back to back clock update.
*/
- if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
+ if (task_on_rq_queued(curr) && test_tsk_need_resched(rq->curr))
rq_clock_skip_update(rq);
}

@@ -2799,7 +2801,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
lockdep_assert_held(&p->pi_lock);

queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);

if (queued) {
/*
@@ -5602,7 +5604,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
* project cycles that may never be accounted to this
* thread, breaking clock_gettime().
*/
- if (task_current(rq, p) && task_on_rq_queued(p)) {
+ if (task_current_selected(rq, p) && task_on_rq_queued(p)) {
prefetch_curr_exec_start(p);
update_rq_clock(rq);
p->sched_class->update_curr(rq);
@@ -5670,7 +5672,8 @@ void scheduler_tick(void)
{
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
- struct task_struct *curr = rq->curr;
+ /* accounting goes to the selected task */
+ struct task_struct *curr;
struct rq_flags rf;
unsigned long thermal_pressure;
u64 resched_latency;
@@ -5681,6 +5684,7 @@ void scheduler_tick(void)
sched_clock_tick();

rq_lock(rq, &rf);
+ curr = rq_selected(rq);

update_rq_clock(rq);
thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
@@ -5765,6 +5769,12 @@ static void sched_tick_remote(struct work_struct *work)
struct task_struct *curr = rq->curr;

if (cpu_online(cpu)) {
+ /*
+ * Since this is a remote tick for full dynticks mode,
+ * we are always sure that there is no proxy (only a
+ * single task is running).
+ */
+ SCHED_WARN_ON(rq->curr != rq_selected(rq));
update_rq_clock(rq);

if (!is_idle_task(curr)) {
@@ -6688,6 +6698,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
}

next = pick_next_task(rq, prev, &rf);
+ rq_set_selected(rq, next);
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
#ifdef CONFIG_SCHED_DEBUG
@@ -7154,7 +7165,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)

prev_class = p->sched_class;
queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);
if (queued)
dequeue_task(rq, p, queue_flag);
if (running)
@@ -7242,7 +7253,7 @@ void set_user_nice(struct task_struct *p, long nice)
goto out_unlock;
}
queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);
if (queued)
dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
if (running)
@@ -7826,7 +7837,7 @@ static int __sched_setscheduler(struct task_struct *p,
}

queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);
if (queued)
dequeue_task(rq, p, queue_flags);
if (running)
@@ -9327,6 +9338,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
rcu_read_unlock();

rq->idle = idle;
+ rq_set_selected(rq, idle);
rcu_assign_pointer(rq->curr, idle);
idle->on_rq = TASK_ON_RQ_QUEUED;
#ifdef CONFIG_SMP
@@ -9416,7 +9428,7 @@ void sched_setnuma(struct task_struct *p, int nid)

rq = task_rq_lock(p, &rf);
queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);

if (queued)
dequeue_task(rq, p, DEQUEUE_SAVE);
@@ -10543,7 +10555,7 @@ void sched_move_task(struct task_struct *tsk)

update_rq_clock(rq);

- running = task_current(rq, tsk);
+ running = task_current_selected(rq, tsk);
queued = task_on_rq_queued(tsk);

if (queued)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 9522e6607754..e8bca6b8da6f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1174,7 +1174,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
#endif

enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
- if (dl_task(rq->curr))
+ if (dl_task(rq_selected(rq)))
check_preempt_curr_dl(rq, p, 0);
else
resched_curr(rq);
@@ -1297,7 +1297,7 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
*/
static void update_curr_dl(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
struct sched_dl_entity *dl_se = &curr->dl;
s64 delta_exec, scaled_delta_exec;
int cpu = cpu_of(rq);
@@ -1810,7 +1810,7 @@ static int find_later_rq(struct task_struct *task);
static int
select_task_rq_dl(struct task_struct *p, int cpu, int flags)
{
- struct task_struct *curr;
+ struct task_struct *curr, *selected;
bool select_rq;
struct rq *rq;

@@ -1821,6 +1821,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)

rcu_read_lock();
curr = READ_ONCE(rq->curr); /* unlocked access */
+ selected = READ_ONCE(rq_selected(rq));

/*
* If we are dealing with a -deadline task, we must
@@ -1831,9 +1832,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
* other hand, if it has a shorter deadline, we
* try to make it stay here, it might be important.
*/
- select_rq = unlikely(dl_task(curr)) &&
+ select_rq = unlikely(dl_task(selected)) &&
(curr->nr_cpus_allowed < 2 ||
- !dl_entity_preempt(&p->dl, &curr->dl)) &&
+ !dl_entity_preempt(&p->dl, &selected->dl)) &&
p->nr_cpus_allowed > 1;

/*
@@ -1896,7 +1897,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
* let's hope p can move out.
*/
if (rq->curr->nr_cpus_allowed == 1 ||
- !cpudl_find(&rq->rd->cpudl, rq->curr, NULL))
+ !cpudl_find(&rq->rd->cpudl, rq_selected(rq), NULL))
return;

/*
@@ -1935,7 +1936,7 @@ static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
int flags)
{
- if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
+ if (dl_entity_preempt(&p->dl, &rq_selected(rq)->dl)) {
resched_curr(rq);
return;
}
@@ -1945,7 +1946,7 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
* In the unlikely case current and p have the same deadline
* let us try to decide what's the best thing to do...
*/
- if ((p->dl.deadline == rq->curr->dl.deadline) &&
+ if ((p->dl.deadline == rq_selected(rq)->dl.deadline) &&
!test_tsk_need_resched(rq->curr))
check_preempt_equal_dl(rq, p);
#endif /* CONFIG_SMP */
@@ -1980,7 +1981,7 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
if (hrtick_enabled_dl(rq))
start_hrtick_dl(rq, p);

- if (rq->curr->sched_class != &dl_sched_class)
+ if (rq_selected(rq)->sched_class != &dl_sched_class)
update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);

deadline_queue_push_tasks(rq);
@@ -2297,8 +2298,8 @@ static int push_dl_task(struct rq *rq)
* can move away, it makes sense to just reschedule
* without going further in pushing next_task.
*/
- if (dl_task(rq->curr) &&
- dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
+ if (dl_task(rq_selected(rq)) &&
+ dl_time_before(next_task->dl.deadline, rq_selected(rq)->dl.deadline) &&
rq->curr->nr_cpus_allowed > 1) {
resched_curr(rq);
return 0;
@@ -2423,7 +2424,7 @@ static void pull_dl_task(struct rq *this_rq)
* deadline than the current task of its runqueue.
*/
if (dl_time_before(p->dl.deadline,
- src_rq->curr->dl.deadline))
+ rq_selected(src_rq)->dl.deadline))
goto skip;

if (is_migration_disabled(p)) {
@@ -2462,9 +2463,9 @@ static void task_woken_dl(struct rq *rq, struct task_struct *p)
if (!task_on_cpu(rq, p) &&
!test_tsk_need_resched(rq->curr) &&
p->nr_cpus_allowed > 1 &&
- dl_task(rq->curr) &&
+ dl_task(rq_selected(rq)) &&
(rq->curr->nr_cpus_allowed < 2 ||
- !dl_entity_preempt(&p->dl, &rq->curr->dl))) {
+ !dl_entity_preempt(&p->dl, &rq_selected(rq)->dl))) {
push_dl_tasks(rq);
}
}
@@ -2639,12 +2640,12 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
return;
}

- if (rq->curr != p) {
+ if (rq_selected(rq) != p) {
#ifdef CONFIG_SMP
if (p->nr_cpus_allowed > 1 && rq->dl.overloaded)
deadline_queue_push_tasks(rq);
#endif
- if (dl_task(rq->curr))
+ if (dl_task(rq_selected(rq)))
check_preempt_curr_dl(rq, p, 0);
else
resched_curr(rq);
@@ -2673,7 +2674,7 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p,
if (!rq->dl.overloaded)
deadline_queue_pull_task(rq);

- if (task_current(rq, p)) {
+ if (task_current_selected(rq, p)) {
/*
* If we now have a earlier deadline task than p,
* then reschedule, provided p is still on this
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c919633acd3d..3d5c1ec34bf7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1172,7 +1172,7 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
*/
s64 update_curr_common(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
s64 delta_exec;

delta_exec = update_curr_se(rq, &curr->se);
@@ -1218,7 +1218,7 @@ static void update_curr(struct cfs_rq *cfs_rq)

static void update_curr_fair(struct rq *rq)
{
- update_curr(cfs_rq_of(&rq->curr->se));
+ update_curr(cfs_rq_of(&rq_selected(rq)->se));
}

static inline void
@@ -6461,7 +6461,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
s64 delta = slice - ran;

if (delta < 0) {
- if (task_current(rq, p))
+ if (task_current_selected(rq, p))
resched_curr(rq);
return;
}
@@ -6476,7 +6476,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
*/
static void hrtick_update(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);

if (!hrtick_enabled_fair(rq) || curr->sched_class != &fair_sched_class)
return;
@@ -8082,7 +8082,7 @@ static void set_next_buddy(struct sched_entity *se)
*/
static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
struct sched_entity *se = &curr->se, *pse = &p->se;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
int next_buddy_marked = 0;
@@ -8115,7 +8115,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
* prevents us from potentially nominating it as a false LAST_BUDDY
* below.
*/
- if (test_tsk_need_resched(curr))
+ if (test_tsk_need_resched(rq->curr))
return;

/* Idle tasks are by definition preempted by non-idle tasks. */
@@ -9099,7 +9099,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
* update_load_avg() can call cpufreq_update_util(). Make sure that RT,
* DL and IRQ signals have been updated before updating CFS.
*/
- curr_class = rq->curr->sched_class;
+ curr_class = rq_selected(rq)->sched_class;

thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));

@@ -12471,7 +12471,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
* our priority decreased, or if we are not currently running on
* this runqueue and our priority is higher than the current's
*/
- if (task_current(rq, p)) {
+ if (task_current_selected(rq, p)) {
if (p->prio > oldprio)
resched_curr(rq);
} else
@@ -12574,7 +12574,7 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
* kick off the schedule if running, otherwise just see
* if we can still preempt the current task.
*/
- if (task_current(rq, p))
+ if (task_current_selected(rq, p))
resched_curr(rq);
else
check_preempt_curr(rq, p, 0);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 327ae4148aec..bc243e70bc0e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -574,7 +574,7 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)

static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
{
- struct task_struct *curr = rq_of_rt_rq(rt_rq)->curr;
+ struct task_struct *curr = rq_selected(rq_of_rt_rq(rt_rq));
struct rq *rq = rq_of_rt_rq(rt_rq);
struct sched_rt_entity *rt_se;

@@ -1044,7 +1044,7 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
*/
static void update_curr_rt(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
struct sched_rt_entity *rt_se = &curr->rt;
s64 delta_exec;

@@ -1591,7 +1591,7 @@ static int find_lowest_rq(struct task_struct *task);
static int
select_task_rq_rt(struct task_struct *p, int cpu, int flags)
{
- struct task_struct *curr;
+ struct task_struct *curr, *selected;
struct rq *rq;
bool test;

@@ -1603,6 +1603,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)

rcu_read_lock();
curr = READ_ONCE(rq->curr); /* unlocked access */
+ selected = READ_ONCE(rq_selected(rq));

/*
* If the current task on @p's runqueue is an RT task, then
@@ -1631,8 +1632,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
* systems like big.LITTLE.
*/
test = curr &&
- unlikely(rt_task(curr)) &&
- (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
+ unlikely(rt_task(selected)) &&
+ (curr->nr_cpus_allowed < 2 || selected->prio <= p->prio);

if (test || !rt_task_fits_capacity(p, cpu)) {
int target = find_lowest_rq(p);
@@ -1662,12 +1663,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)

static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
{
- /*
- * Current can't be migrated, useless to reschedule,
- * let's hope p can move out.
- */
if (rq->curr->nr_cpus_allowed == 1 ||
- !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
+ !cpupri_find(&rq->rd->cpupri, rq_selected(rq), NULL))
return;

/*
@@ -1710,7 +1707,9 @@ static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
*/
static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flags)
{
- if (p->prio < rq->curr->prio) {
+ struct task_struct *curr = rq_selected(rq);
+
+ if (p->prio < curr->prio) {
resched_curr(rq);
return;
}
@@ -1728,7 +1727,7 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flag
* to move current somewhere else, making room for our non-migratable
* task.
*/
- if (p->prio == rq->curr->prio && !test_tsk_need_resched(rq->curr))
+ if (p->prio == curr->prio && !test_tsk_need_resched(rq->curr))
check_preempt_equal_prio(rq, p);
#endif
}
@@ -1753,7 +1752,7 @@ static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool f
* utilization. We only care of the case where we start to schedule a
* rt task
*/
- if (rq->curr->sched_class != &rt_sched_class)
+ if (rq_selected(rq)->sched_class != &rt_sched_class)
update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);

rt_queue_push_tasks(rq);
@@ -2034,6 +2033,7 @@ static struct task_struct *pick_next_pushable_task(struct rq *rq)

BUG_ON(rq->cpu != task_cpu(p));
BUG_ON(task_current(rq, p));
+ BUG_ON(task_current_selected(rq, p));
BUG_ON(p->nr_cpus_allowed <= 1);

BUG_ON(!task_on_rq_queued(p));
@@ -2066,7 +2066,7 @@ static int push_rt_task(struct rq *rq, bool pull)
* higher priority than current. If that's the case
* just reschedule current.
*/
- if (unlikely(next_task->prio < rq->curr->prio)) {
+ if (unlikely(next_task->prio < rq_selected(rq)->prio)) {
resched_curr(rq);
return 0;
}
@@ -2419,7 +2419,7 @@ static void pull_rt_task(struct rq *this_rq)
* p if it is lower in priority than the
* current task on the run queue
*/
- if (p->prio < src_rq->curr->prio)
+ if (p->prio < rq_selected(src_rq)->prio)
goto skip;

if (is_migration_disabled(p)) {
@@ -2461,9 +2461,9 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
bool need_to_push = !task_on_cpu(rq, p) &&
!test_tsk_need_resched(rq->curr) &&
p->nr_cpus_allowed > 1 &&
- (dl_task(rq->curr) || rt_task(rq->curr)) &&
+ (dl_task(rq_selected(rq)) || rt_task(rq_selected(rq))) &&
(rq->curr->nr_cpus_allowed < 2 ||
- rq->curr->prio <= p->prio);
+ rq_selected(rq)->prio <= p->prio);

if (need_to_push)
push_rt_tasks(rq);
@@ -2547,7 +2547,7 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
rt_queue_push_tasks(rq);
#endif /* CONFIG_SMP */
- if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
+ if (p->prio < rq_selected(rq)->prio && cpu_online(cpu_of(rq)))
resched_curr(rq);
}
}
@@ -2562,7 +2562,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
if (!task_on_rq_queued(p))
return;

- if (task_current(rq, p)) {
+ if (task_current_selected(rq, p)) {
#ifdef CONFIG_SMP
/*
* If our priority decreases while running, we
@@ -2588,7 +2588,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
* greater than the current running task
* then reschedule.
*/
- if (p->prio < rq->curr->prio)
+ if (p->prio < rq_selected(rq)->prio)
resched_curr(rq);
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7c37d478e0f8..130e669cf5da 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1013,7 +1013,10 @@ struct rq {
*/
unsigned int nr_uninterruptible;

- struct task_struct __rcu *curr;
+ struct task_struct __rcu *curr; /* Execution context */
+#ifdef CONFIG_PROXY_EXEC
+ struct task_struct __rcu *curr_sched; /* Scheduling context (policy) */
+#endif
struct task_struct *idle;
struct task_struct *stop;
unsigned long next_balance;
@@ -1212,6 +1215,22 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define raw_rq() raw_cpu_ptr(&runqueues)

+#ifdef CONFIG_PROXY_EXEC
+#define rq_selected(rq) ((rq)->curr_sched)
+#define cpu_curr_selected(cpu) (cpu_rq(cpu)->curr_sched)
+static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
+{
+ rcu_assign_pointer(rq->curr_sched, t);
+}
+#else
+#define rq_selected(rq) ((rq)->curr)
+#define cpu_curr_selected(cpu) (cpu_rq(cpu)->curr)
+static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
+{
+ /* Do nothing */
+}
+#endif
+
struct sched_group;
#ifdef CONFIG_SCHED_CORE
static inline struct cpumask *sched_group_span(struct sched_group *sg);
@@ -2128,11 +2147,25 @@ static inline u64 global_rt_runtime(void)
return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
}

+/*
+ * Is p the current execution context?
+ */
static inline int task_current(struct rq *rq, struct task_struct *p)
{
return rq->curr == p;
}

+/*
+ * Is p the current scheduling context?
+ *
+ * Note that it might be the current execution context at the same time if
+ * rq->curr == rq_selected() == p.
+ */
+static inline int task_current_selected(struct rq *rq, struct task_struct *p)
+{
+ return rq_selected(rq) == p;
+}
+
#ifdef CONFIG_PROXY_EXEC
static inline bool task_is_blocked(struct task_struct *p)
{
@@ -2308,7 +2341,7 @@ struct sched_class {

static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
{
- WARN_ON_ONCE(rq->curr != prev);
+ WARN_ON_ONCE(rq_selected(rq) != prev);
prev->sched_class->put_prev_task(rq, prev);
}

--
2.42.0.869.gea05f2083d-goog

2023-11-18 00:28:18

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v6 15/20] sched: Add proxy deactivate helper

On Tue, Nov 7, 2023 at 6:52 PM kernel test robot <[email protected]> wrote:
>
> kernel test robot noticed the following build warnings:
>
...
> >> kernel/sched/core.c:6616:6: warning: no previous prototype for 'proxy_deactivate' [-Wmissing-prototypes]
> 6616 | bool proxy_deactivate(struct rq *rq, struct task_struct *next)
> | ^~~~~~~~~~~~~~~~


Thank you! I've fixed this up for v7

thanks again
-john

2023-12-13 06:38:02

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6

Hello John,

I may have some data that might help you debug a potential performance
issue mentioned below.

On 11/7/2023 1:04 AM, John Stultz wrote:
> [..snip..]
>
> Performance:
> —----------
> This patch series switches mutexes to use handoff mode rather
> than optimistic spinning. This is a potential concern where locks
> are under high contention. However, earlier performance analysis
> (on both x86 and mobile devices) did not see major regressions.
> That said, Chenyu did report a regression[3], which I’ll need to
> look further into.

I too see this as the most notable regression. Some of the other
benchmarks I've tested (schbench, tbench, netperf, ycsb-mongodb,
DeathStarBench) show little to no difference when running with Proxy
Execution, however sched-messaging sees a 10x blowup in the runtime.
(taskset -c 0-7,128-125 perf bench sched messaging -p -t -l 100000 -g 1)

While investigating, I drew up the runqueue length when running
sched-messaging pinned to 1CCX (CPUs 0-7,128-125 on my 3rd Generation
EPYC system) using the following bpftrace script that dumps it csv
format:

rqlen.bt
---
BEGIN
{
$i = 0;
printf("[%12s],", "Timestamp");
while ($i < 8)
{
printf("CPU%3d,", $i);
$i = $i + 1;
}
$i = 128;
while ($i < 136)
{
printf("CPU%3d,", $i);
$i = $i + 1;
}
printf("\n");
}

kprobe:scheduler_tick
{
@runqlen[curtask->wake_cpu] = curtask->se.cfs_rq->rq->nr_running;
}

tracepoint:power:cpu_idle
{
@runqlen[curtask->wake_cpu] = 0;
}

interval:hz:50
{
$i = 0;
printf("[%12lld],", elapsed);
while ($i < 8)
{
printf("%3d,", @runqlen[$i]);
$i = $i + 1;
}

$i=128;
while ($i < 136)
{
printf("%3d,", @runqlen[$i]);
$i = $i + 1;
}
printf("\n");
}

END
{
clear(@runqlen);
}

--

I've attached the csv for tip (rqlen50-tip-pinned.csv) and proxy
execution (rqlen50-pe-pinned.csv) below.

The trend I see with hackbench is that the chain migration leads
to a single runqueue being completely overloaded, followed by some
amount of the idling on the entire CCX and a similar chain appearing
on a different CPU. The trace for tip show a lot more CPUs being
utilized.

Mathieu has been looking at hackbench and the effect of task migration
on the runtime and it appears that lowering the migrations improves
the hackbench performance significantly [1][2][3]

[1] https://lore.kernel.org/lkml/20230905072141.GA253439@ziqianlu-dell/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lore.kernel.org/lkml/[email protected]/

Since migration itself is not cheap, I believe the chain migration at
the current scale hampers the performance since sched-messaging
emulates a worst-case scenario for proxy-execution.

I'll update the thread once I have more information. I'll continue
testing and take a closer look at the implementation.

> I also briefly re-tested with this v5 series
> and saw some average latencies grow vs v4, suggesting the changes
> to return-migration (and extra locking) have some impact. With v6
> the extra overhead is reduced but still not as nice as v4. I’ll
> be digging more there, but my priority is still stability over
> speed at this point (it’s easier to validate correctness of
> optimizations if the baseline isn’t crashing).
>
>
> If folks find it easier to test/tinker with, this patch series
> can also be found here:
> https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6
> https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6

P.S. I was using the above tree.

>
> Awhile back Sage Sharp had a nice blog post about types of
> reviews [4], and while any review and feedback would be greatly
> appreciated, those focusing on conceptual design or correctness
> issues would be *especially* valued.

I have skipped a few phases that Sage mentions in their blog but I'll
try my best to follow the order from here on forward :)

>
> Thanks so much!
> -john
>
> [1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
> [2] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk)
> [3] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/
> [4] https://sage.thesharps.us/2014/09/01/the-gentle-art-of-patch-review/
>
>
> [..snip..]
>

--
Thanks and Regards,
Prateek


Attachments:
rqlen50-pe-pinned.csv (193.58 kB)
rqlen50-tip-pinned.csv (22.57 kB)
Download all attachments

2023-12-13 16:21:22

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6

On Wed, 2023-12-13 at 12:07 +0530, K Prateek Nayak wrote:
> Hello John,
>
> I may have some data that might help you debug a potential
> performance
> issue mentioned below.
>
> On 11/7/2023 1:04 AM, John Stultz wrote:
> > [..snip..]
> >
> > Performance:
> > —----------
> > This patch series switches mutexes to use handoff mode rather
> > than optimistic spinning. This is a potential concern where locks
> > are under high contention. However, earlier performance analysis
> > (on both x86 and mobile devices) did not see major regressions.
> > That said, Chenyu did report a regression[3], which I’ll need to
> > look further into.
>
> I too see this as the most notable regression. Some of the other
> benchmarks I've tested (schbench, tbench, netperf, ycsb-mongodb,
> DeathStarBench) show little to no difference when running with Proxy
> Execution, however sched-messaging sees a 10x blowup in the runtime.
> (taskset -c 0-7,128-125 perf bench sched messaging -p -t -l 100000 -g
> 1)

I observe similar regression. Total time of `taskset -c 0-5,6-11 perf
bench sched messaging -p -t -l 100000 -g 1` increases from 13.964 secs
to 184.866 secs on my test machine. Other perf sched benchmarks look
OK.

>
> While investigating, I drew up the runqueue length when running
> sched-messaging pinned to 1CCX (CPUs 0-7,128-125 on my 3rd Generation
> EPYC system) using the following bpftrace script that dumps it csv
> format:
>

[snip]

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

2023-12-13 19:12:02

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6

On Tue, Dec 12, 2023 at 10:37 PM K Prateek Nayak <[email protected]> wrote:
>
> Hello John,
>
> I may have some data that might help you debug a potential performance
> issue mentioned below.

Hey Prateek,
Thank you so much for taking the time to try this out and providing
such helpful analysis!
More below.

> On 11/7/2023 1:04 AM, John Stultz wrote:
> > [..snip..]
> >
> > Performance:
> > —----------
> > This patch series switches mutexes to use handoff mode rather
> > than optimistic spinning. This is a potential concern where locks
> > are under high contention. However, earlier performance analysis
> > (on both x86 and mobile devices) did not see major regressions.
> > That said, Chenyu did report a regression[3], which I’ll need to
> > look further into.
>
> I too see this as the most notable regression. Some of the other
> benchmarks I've tested (schbench, tbench, netperf, ycsb-mongodb,
> DeathStarBench) show little to no difference when running with Proxy

This is great to hear! Thank you for providing this input!

> Execution, however sched-messaging sees a 10x blowup in the runtime.
> (taskset -c 0-7,128-125 perf bench sched messaging -p -t -l 100000 -g 1)

Oof. I appreciate you sharing this!

> While investigating, I drew up the runqueue length when running
> sched-messaging pinned to 1CCX (CPUs 0-7,128-125 on my 3rd Generation
> EPYC system) using the following bpftrace script that dumps it csv
> format:

Just so I'm following you properly on the processor you're using, cpus
0-7 and 128-125 are in the same CCX?
(I thought there were only 8 cores per CCX?)

> rqlen.bt
> ---
<snip>
> --
>
> I've attached the csv for tip (rqlen50-tip-pinned.csv) and proxy
> execution (rqlen50-pe-pinned.csv) below.
>
> The trend I see with hackbench is that the chain migration leads
> to a single runqueue being completely overloaded, followed by some
> amount of the idling on the entire CCX and a similar chain appearing
> on a different CPU. The trace for tip show a lot more CPUs being
> utilized.
>
> Mathieu has been looking at hackbench and the effect of task migration
> on the runtime and it appears that lowering the migrations improves
> the hackbench performance significantly [1][2][3]
>
> [1] https://lore.kernel.org/lkml/20230905072141.GA253439@ziqianlu-dell/
> [2] https://lore.kernel.org/lkml/[email protected]/
> [3] https://lore.kernel.org/lkml/[email protected]/
>
> Since migration itself is not cheap, I believe the chain migration at
> the current scale hampers the performance since sched-messaging
> emulates a worst-case scenario for proxy-execution.

Hrm.

> I'll update the thread once I have more information. I'll continue
> testing and take a closer look at the implementation.
>
> > I also briefly re-tested with this v5 series
> > and saw some average latencies grow vs v4, suggesting the changes
> > to return-migration (and extra locking) have some impact. With v6
> > the extra overhead is reduced but still not as nice as v4. I’ll
> > be digging more there, but my priority is still stability over
> > speed at this point (it’s easier to validate correctness of
> > optimizations if the baseline isn’t crashing).
> >
> >
> > If folks find it easier to test/tinker with, this patch series
> > can also be found here:
> > https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6
> > https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6
>
> P.S. I was using the above tree.

Ok, I've been working on getting v7 ready which includes two main things:
1) I've reworked the return migration back into the ttwu path to avoid
the lock juggling
2) Working to properly conditionalize and re-add Connor's
chain-migration feature (which when a migration happens pulls the full
blocked_donor list with it)

So I'll try to reproduce your results and see if these help any with
this particular case, and then I'll start to look closer at what can
be done.

Again, thanks so much, I've got so much gratitude for your testing and
analysis here. I really appreciate your feedback!
Do let me know if you find anything further!

thanks
-john

2023-12-14 01:00:41

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6

On Tue, Dec 12, 2023 at 10:37 PM K Prateek Nayak <[email protected]> wrote:
> I too see this as the most notable regression. Some of the other
> benchmarks I've tested (schbench, tbench, netperf, ycsb-mongodb,
> DeathStarBench) show little to no difference when running with Proxy
> Execution, however sched-messaging sees a 10x blowup in the runtime.
> (taskset -c 0-7,128-125 perf bench sched messaging -p -t -l 100000 -g 1)
...
> The trend I see with hackbench is that the chain migration leads
> to a single runqueue being completely overloaded, followed by some
> amount of the idling on the entire CCX and a similar chain appearing
> on a different CPU. The trace for tip show a lot more CPUs being
> utilized.

So I reproduced a similar issue with the test (I only have 8 cores on
the bare metal box I have so I didn't bother using taskset):

perf bench sched messaging -p -t -l 100000 -g 1
v6.6: 4.583s
proxy-exec-6.6: 28.842s
proxy-exec-WIP: 26.1033s

So the pre-v7 code does improve things, but not substantially.

Bisecting through the patch series to see how it regressed shows it is
not a single change:
mutex-handoff: 16.957s
blocked-waiter/return: 29.628s
ttwu return: 20.045s
blocked_donor: 25.021s

So it seems like the majority of the regression comes from switching
to mutex handoff instead of optimistic spinning.
This would account for your more cpus being utilized comment, as more
of the blocked tasks would be spinning trying to take the mutex.

Then adding the initial blocked-waiter/return migration change hurts
things further (and this was a known issue with v5/v6).
Then the pending patch to switch back to doing return migration in
ttwu recovers a chunk of that cost.
And then the blocked_donor handoff optimization (which passes the lock
back to the donor boosting us, rather than the next task in the mutex
waitlist) further impacts performance here.

The chain migration feature in proxy-exec-WIP doesn't seem to help or
hurt in this case.

I'll need to look closer at each of those steps to see if there's
anything too daft I'm doing.

The loss of optimistic spinning has been a long term worry with the
patch. Unfortunately as I mentioned in the plumbers talk, the idea
from OSPM on avoiding migration and spinning if the runnable owner of
a blocked_on chain was on a cpu isn't easy to accomplish, as we are
limited to how far off the rq we can look. I feel like I need to come
up with some way to lock the entire blocked_on chain so it can be
traversed safely - as it is now, due to the locking order
(task->pi_lock, rq_lock, mutex->wait_lock, task->blocked_on) we
can't stabily traverse go task->mutex->task->mutex, unless the tasks
are all on the same rq (and we're holding the rq_lock). So we can
only safely look at one mutex owner off of the rq (when we also hold
the mutex wait_lock).

I'm stewing on an idea for some sort of mutex holder (similar to a
runqueue) that would allow the chain of mutexes to be traversed
quickly and stability - but the many-to-one blocked_on relationship
complicates it.
Suggestions or other ideas here would be welcome, as I didn't get a
chance to really discuss it much at plumbers.

thanks
-john

2023-12-14 01:03:53

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6

On Wed, Dec 13, 2023 at 5:00 PM John Stultz <[email protected]> wrote:
> I'm stewing on an idea for some sort of mutex holder (similar to a

On re-reading this, I realized holder is far too overloaded a term in
this context.

"mutex container" might be a better term for the idea.

thanks
-john

2023-12-14 05:15:50

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6

Hello John,

Thank you for taking a look at the report.

On 12/14/2023 12:41 AM, John Stultz wrote:
> On Tue, Dec 12, 2023 at 10:37 PM K Prateek Nayak <[email protected]> wrote:
>>
>> Hello John,
>>
>> I may have some data that might help you debug a potential performance
>> issue mentioned below.
>
> Hey Prateek,
> Thank you so much for taking the time to try this out and providing
> such helpful analysis!
> More below.
>
>> On 11/7/2023 1:04 AM, John Stultz wrote:
>>> [..snip..]
>>>
>>> Performance:
>>> —----------
>>> This patch series switches mutexes to use handoff mode rather
>>> than optimistic spinning. This is a potential concern where locks
>>> are under high contention. However, earlier performance analysis
>>> (on both x86 and mobile devices) did not see major regressions.
>>> That said, Chenyu did report a regression[3], which I’ll need to
>>> look further into.
>>
>> I too see this as the most notable regression. Some of the other
>> benchmarks I've tested (schbench, tbench, netperf, ycsb-mongodb,
>> DeathStarBench) show little to no difference when running with Proxy
>
> This is great to hear! Thank you for providing this input!
>
>> Execution, however sched-messaging sees a 10x blowup in the runtime.
>> (taskset -c 0-7,128-125 perf bench sched messaging -p -t -l 100000 -g 1)
>
> Oof. I appreciate you sharing this!
>
>> While investigating, I drew up the runqueue length when running
>> sched-messaging pinned to 1CCX (CPUs 0-7,128-125 on my 3rd Generation
>> EPYC system) using the following bpftrace script that dumps it csv
>> format:
>
> Just so I'm following you properly on the processor you're using, cpus
> 0-7 and 128-125 are in the same CCX?
> (I thought there were only 8 cores per CCX?)

Sorry about that! It should be 0-7,128-135 (16 threads of 8 cores in the
same CCX) The pinning was added so that I could only observe a subset of
the total CPUs since analyzing the behavior of 40 tasks on 256 CPUs was
much harder than analyzing it on 16 CPUs :)

>
>> rqlen.bt
>> ---
> <snip>
>> --
>>
>> I've attached the csv for tip (rqlen50-tip-pinned.csv) and proxy
>> execution (rqlen50-pe-pinned.csv) below.
>>
>> The trend I see with hackbench is that the chain migration leads
>> to a single runqueue being completely overloaded, followed by some
>> amount of the idling on the entire CCX and a similar chain appearing
>> on a different CPU. The trace for tip show a lot more CPUs being
>> utilized.
>>
>> Mathieu has been looking at hackbench and the effect of task migration
>> on the runtime and it appears that lowering the migrations improves
>> the hackbench performance significantly [1][2][3]
>>
>> [1] https://lore.kernel.org/lkml/20230905072141.GA253439@ziqianlu-dell/
>> [2] https://lore.kernel.org/lkml/[email protected]/
>> [3] https://lore.kernel.org/lkml/[email protected]/
>>
>> Since migration itself is not cheap, I believe the chain migration at
>> the current scale hampers the performance since sched-messaging
>> emulates a worst-case scenario for proxy-execution.
>
> Hrm.
>
>> I'll update the thread once I have more information. I'll continue
>> testing and take a closer look at the implementation.
>>
>>> I also briefly re-tested with this v5 series
>>> and saw some average latencies grow vs v4, suggesting the changes
>>> to return-migration (and extra locking) have some impact. With v6
>>> the extra overhead is reduced but still not as nice as v4. I’ll
>>> be digging more there, but my priority is still stability over
>>> speed at this point (it’s easier to validate correctness of
>>> optimizations if the baseline isn’t crashing).
>>>
>>>
>>> If folks find it easier to test/tinker with, this patch series
>>> can also be found here:
>>> https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6
>>> https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6
>>
>> P.S. I was using the above tree.
>
> Ok, I've been working on getting v7 ready which includes two main things:
> 1) I've reworked the return migration back into the ttwu path to avoid
> the lock juggling
> 2) Working to properly conditionalize and re-add Connor's
> chain-migration feature (which when a migration happens pulls the full
> blocked_donor list with it)
>
> So I'll try to reproduce your results and see if these help any with
> this particular case, and then I'll start to look closer at what can
> be done.
>
> Again, thanks so much, I've got so much gratitude for your testing and
> analysis here. I really appreciate your feedback!
> Do let me know if you find anything further!

Sure thing! I'll keep you updated of any finding. Thank you for digging
further into this issue.

>
> thanks
> -john

--
Thanks and Regards,
Prateek

2023-12-17 03:08:29

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6

Hi John

On 11/06/23 19:34, John Stultz wrote:
> Stabilizing this Proxy Execution series has unfortunately
> continued to be a challenging task. Since the v5 release, I’ve
> been focused on getting the deactivated/sleeping owner enqueuing
> functionality, which I left out of v5, stabilized. I’ve managed
> to rework enough to avoid the crashes previously tripped with the
> locktorture & ww_mutex selftests, so I feel it’s much improved,
> but I do still see some issues (blocked waitqueues and hung task
> watchdogs firing) after stressing the system for many hours in a
> 64 cpu qemu environment (though this issue seems to be introduced
> earlier in the series with proxy-migration/return-migration).
>
> I still haven’t had time to focus on testing and debugging the
> chain migration patches. So I’ve left that patch out of this
> submission for now, but will try to get it included in the next
> revision.
>
> This patch series is actually coarser than what I’ve been
> developing with, as there are a number of small “test” steps to
> help validate behavior I changed, which would then be replaced by
> the real logic afterwards. Including those here would just cause
> more work for reviewers, so I’ve folded them together, but if
> you’re interested you can find the fine-grained tree here:
> https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6-fine-grained
> https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6-fine-grained
>
> As mentioned previously, this Proxy Execution series has a long
> history: First described in a paper[1] by Watkins, Straub,
> Niehaus, then from patches from Peter Zijlstra, extended with
> lots of work by Juri Lelli, Valentin Schneider, and Connor
> O'Brien. (and thank you to Steven Rostedt for providing
> additional details here!)

Thanks a lot for all your effort into trying to push this difficult patchset
forward!

I am trying to find more time to help with review and hopefully debugging too,
but as it stands, I think to make progress we need to think about breaking this
patchset into smaller problems and get them merged into phases so at least the
review and actual work done would be broken down into smaller more manageable
chunks.

From my very birds eye view it seems we have 3 elements:

1. Extend locking infrastructure.
2. Split task context into scheduling and execution.
3. Actual proxy_execution implementation.

It seems to me (and as ever I could be wrong of course) the first 7 patches are
more or less stable? Could you send patch 1 individually and the next 6 patches
to get the ground work to extend locking reviewed and merged first?

After that we can focus on splitting the task context into scheduling and
execution (and maybe introduce the PROXY_EXEC config along with it) but without
actually implementing the inheritance, etc parts? Just generally teaching the
scheduler these now are 2 separate parts.

Are 1 and 2 dependent on each other or can be sent as two series in parallel
actually?

Hopefully this should reduce the work a lot from continuously rebasing these
patches and focus on the last part which is the meaty and most difficult bit
IIUC. Which I hope we can break down too; but I have no idea at the moment how
to do that.

Merging in parts will help with giving each part a chance to soak individually
in mainline while the rest is being discussed. Which would make handling
potential fall overs easier too.

I don't know what the other thinks, but IMHO if there are stable parts of this
series; I think we should focus on trying to merge these elements first. I hope
you'll be the one to get this through the finishing line, but if for whatever
reason yet another person needs to carry over, they'd find something got merged
at least :-) I'm sure reviving these patches every time is no easy task!


Cheers

--
Qais Yousef

2023-12-17 16:19:49

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v6 01/20] sched: Unify runtime accounting across classes

On 11/06/23 19:34, John Stultz wrote:
> From: Peter Zijlstra <[email protected]>
>
> All classes use sched_entity::exec_start to track runtime and have
> copies of the exact same code around to compute runtime.
>
> Collapse all that.
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E . McKenney" <[email protected]>
> Cc: [email protected]
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> [fix conflicts, fold in update_current_exec_runtime]
> Signed-off-by: Connor O'Brien <[email protected]>
> [jstultz: rebased, resovling minor conflicts]
> Signed-off-by: John Stultz <[email protected]>
> ---
> NOTE: This patch is a general cleanup and if no one objects
> could be merged at this point. If needed, I'll resend separately
> if it isn't picked up on its own.

Looks like this actually got merged into tip via the deadline server work :-)

Though not sure if I caught a bug here

> diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
> index 85590599b4d6..7595494ceb6d 100644
> --- a/kernel/sched/stop_task.c
> +++ b/kernel/sched/stop_task.c
> @@ -70,18 +70,7 @@ static void yield_task_stop(struct rq *rq)
>
> static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
> {
> - struct task_struct *curr = rq->curr;
> - u64 now, delta_exec;
> -
> - now = rq_clock_task(rq);
> - delta_exec = now - curr->se.exec_start;
> - if (unlikely((s64)delta_exec < 0))
> - delta_exec = 0;

If negative instead of returning for stopper task; we set delta_exec to 0

> -
> - schedstat_set(curr->stats.exec_max,
> - max(curr->stats.exec_max, delta_exec));
> -
> - update_current_exec_runtime(curr, now, delta_exec);

And curry on to do time accounting

> + update_curr_common(rq);

But the new function will return early without doing accounting. Wouldn't this
re-introrduce 8f6189684eb4 ("sched: Fix migration thread runtime bogosity")?

> }


Cheers

--
Qais Yousef

2023-12-18 20:23:44

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v6 01/20] sched: Unify runtime accounting across classes

On Sun, Dec 17, 2023 at 8:19 AM Qais Yousef <[email protected]> wrote:
> On 11/06/23 19:34, John Stultz wrote:
> > From: Peter Zijlstra <[email protected]>
> >
> > All classes use sched_entity::exec_start to track runtime and have
> > copies of the exact same code around to compute runtime.
> >
> > Collapse all that.
> >
...
> Looks like this actually got merged into tip via the deadline server work :-)

Oh! That's great to see! The patch has been floating around for a while.

> Though not sure if I caught a bug here
>
> > diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
> > index 85590599b4d6..7595494ceb6d 100644
> > --- a/kernel/sched/stop_task.c
> > +++ b/kernel/sched/stop_task.c
> > @@ -70,18 +70,7 @@ static void yield_task_stop(struct rq *rq)
> >
> > static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
> > {
> > - struct task_struct *curr = rq->curr;
> > - u64 now, delta_exec;
> > -
> > - now = rq_clock_task(rq);
> > - delta_exec = now - curr->se.exec_start;
> > - if (unlikely((s64)delta_exec < 0))
> > - delta_exec = 0;
>
> If negative instead of returning for stopper task; we set delta_exec to 0
>
> > -
> > - schedstat_set(curr->stats.exec_max,
> > - max(curr->stats.exec_max, delta_exec));
> > -
> > - update_current_exec_runtime(curr, now, delta_exec);
>
> And curry on to do time accounting
>
> > + update_curr_common(rq);
>
> But the new function will return early without doing accounting. Wouldn't this
> re-introrduce 8f6189684eb4 ("sched: Fix migration thread runtime bogosity")?

Hrm. So first, good eye for catching this!
Looking through the code, much of the accounting logic we end up
skipping doesn't have much effect when delta_exec = 0, so it seems
mostly harmless to return early without the accounting.

Though, there is one side-effect that does get skipped, which is the
removed update_current_exec_runtime() unconditionally sets:
curr->se.exec_start = now;

Which basically resets the accounting window.

From the commit, It's unclear how intentional this side-effect is for
the edge case where the interval is negative.

I can't say I've really wrapped my head around the cases where the
se.exec_start would get ahead of the rq_clock_task(), so it's not
clear in which cases we would want to reset the accounting window vs
wait for the rq_clock_task() to catch up. But as this is getting
called from put_prev_task_stop(), it seems we're closing the
accounting window here anyway, and later set_next_task_stop() would be
called (which sets se.exec_start, resetting the accounting) to start
the accounting window again.

So you are right that there is a practical change in behavior, but I
don't think I see it having an effect.

But I've added Mike and Daniel to the CC in case I'm missing something.

thanks
-john

2023-12-18 23:39:11

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6

On Sat, Dec 16, 2023 at 7:07 PM Qais Yousef <[email protected]> wrote:
> I am trying to find more time to help with review and hopefully debugging too,
> but as it stands, I think to make progress we need to think about breaking this
> patchset into smaller problems and get them merged into phases so at least the
> review and actual work done would be broken down into smaller more manageable
> chunks.
>
> From my very birds eye view it seems we have 3 elements:
>
> 1. Extend locking infrastructure.
> 2. Split task context into scheduling and execution.
> 3. Actual proxy_execution implementation.
>
> It seems to me (and as ever I could be wrong of course) the first 7 patches are
> more or less stable? Could you send patch 1 individually and the next 6 patches
> to get the ground work to extend locking reviewed and merged first?

So I'm working hard to get v7 out the door here, but your general
suggestion here sounds fair.

Part of why I've not pushed as hard to get the first initial patches
in, is that the earlier changes to rework things are mostly done in
service of the later proxy exec logic, so it may be a little hard to
justify the churn on their own. Also, I've been hoping to get more
discussion & feedback around the big picture - but I suspect the size
& number of the changes makes this daunting.

That said, if Peter is up for it, I'd be happy if the initial chunks
of the series were to be considered to be pulled in.

> After that we can focus on splitting the task context into scheduling and
> execution (and maybe introduce the PROXY_EXEC config along with it) but without
> actually implementing the inheritance, etc parts? Just generally teaching the
> scheduler these now are 2 separate parts.

The majority of that is done in a single patch:
https://github.com/johnstultz-work/linux-dev/commit/9e3b364f3724ed840137d681876268b0ad67a469

There we start to have separate names, but at least until we are doing
the proxying, the rq->curr and rq_selected() will be the same task.

> Are 1 and 2 dependent on each other or can be sent as two series in parallel
> actually?

Technically, I think they can be done in parallel. Though I'm not sure
if managing and submitting multiple patch series is easier or harder
for folks to review.

> Hopefully this should reduce the work a lot from continuously rebasing these
> patches and focus on the last part which is the meaty and most difficult bit
> IIUC. Which I hope we can break down too; but I have no idea at the moment how
> to do that.

Rebasing hasn't been the major problem, but wrangling the large patch
set has. For v7 I got a lot of offlist feedback, and propagating those
changes through the full fine-grained stack makes even trivial changes
to early patches quite laborious.

As for breaking down the rest, I thought the v6 series had later
patches broken down fairly well:
1) Just local rq proxying (where the owner happens to be on the current rq)
2) Add proxy migration & return migration, so we can boost tasks on remote rqs
3) Sleeping owner enqueueing (so we get woken and added to the rq the
owner gets woken on)
...

And I have the fine-grained version that is even more split up so I
could test each small change at a time:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6-fine-grained

But I'm open to other suggestions.

> Merging in parts will help with giving each part a chance to soak individually
> in mainline while the rest is being discussed. Which would make handling
> potential fall overs easier too.

Again, I think this would be great.

I'll try to get v7 out the door so folks can once more consider the
big picture, but then maybe I'll send out the earlier changes more
frequently.

thanks
-john

2023-12-28 16:21:46

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v6 01/20] sched: Unify runtime accounting across classes

On 12/18/23 12:23, John Stultz wrote:
> On Sun, Dec 17, 2023 at 8:19 AM Qais Yousef <[email protected]> wrote:
> > On 11/06/23 19:34, John Stultz wrote:
> > > From: Peter Zijlstra <[email protected]>
> > >
> > > All classes use sched_entity::exec_start to track runtime and have
> > > copies of the exact same code around to compute runtime.
> > >
> > > Collapse all that.
> > >
> ...
> > Looks like this actually got merged into tip via the deadline server work :-)
>
> Oh! That's great to see! The patch has been floating around for a while.
>
> > Though not sure if I caught a bug here
> >
> > > diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
> > > index 85590599b4d6..7595494ceb6d 100644
> > > --- a/kernel/sched/stop_task.c
> > > +++ b/kernel/sched/stop_task.c
> > > @@ -70,18 +70,7 @@ static void yield_task_stop(struct rq *rq)
> > >
> > > static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
> > > {
> > > - struct task_struct *curr = rq->curr;
> > > - u64 now, delta_exec;
> > > -
> > > - now = rq_clock_task(rq);
> > > - delta_exec = now - curr->se.exec_start;
> > > - if (unlikely((s64)delta_exec < 0))
> > > - delta_exec = 0;
> >
> > If negative instead of returning for stopper task; we set delta_exec to 0
> >
> > > -
> > > - schedstat_set(curr->stats.exec_max,
> > > - max(curr->stats.exec_max, delta_exec));
> > > -
> > > - update_current_exec_runtime(curr, now, delta_exec);
> >
> > And curry on to do time accounting
> >
> > > + update_curr_common(rq);
> >
> > But the new function will return early without doing accounting. Wouldn't this
> > re-introrduce 8f6189684eb4 ("sched: Fix migration thread runtime bogosity")?
>
> Hrm. So first, good eye for catching this!
> Looking through the code, much of the accounting logic we end up
> skipping doesn't have much effect when delta_exec = 0, so it seems
> mostly harmless to return early without the accounting.
>
> Though, there is one side-effect that does get skipped, which is the
> removed update_current_exec_runtime() unconditionally sets:
> curr->se.exec_start = now;
>
> Which basically resets the accounting window.
>
> From the commit, It's unclear how intentional this side-effect is for
> the edge case where the interval is negative.
>
> I can't say I've really wrapped my head around the cases where the
> se.exec_start would get ahead of the rq_clock_task(), so it's not
> clear in which cases we would want to reset the accounting window vs
> wait for the rq_clock_task() to catch up. But as this is getting
> called from put_prev_task_stop(), it seems we're closing the
> accounting window here anyway, and later set_next_task_stop() would be
> called (which sets se.exec_start, resetting the accounting) to start
> the accounting window again.
>
> So you are right that there is a practical change in behavior, but I
> don't think I see it having an effect.

Yes, agreed. I couldn't reproduce any problem and I can't see a terrible side
effect of returning early as well compared to continuing to do the accounting.


Cheers

--
Qais Yousef

2023-12-28 16:45:23

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6

On 12/18/23 15:38, John Stultz wrote:
> On Sat, Dec 16, 2023 at 7:07 PM Qais Yousef <[email protected]> wrote:
> > I am trying to find more time to help with review and hopefully debugging too,
> > but as it stands, I think to make progress we need to think about breaking this
> > patchset into smaller problems and get them merged into phases so at least the
> > review and actual work done would be broken down into smaller more manageable
> > chunks.
> >
> > From my very birds eye view it seems we have 3 elements:
> >
> > 1. Extend locking infrastructure.
> > 2. Split task context into scheduling and execution.
> > 3. Actual proxy_execution implementation.
> >
> > It seems to me (and as ever I could be wrong of course) the first 7 patches are
> > more or less stable? Could you send patch 1 individually and the next 6 patches
> > to get the ground work to extend locking reviewed and merged first?
>
> So I'm working hard to get v7 out the door here, but your general
> suggestion here sounds fair.
>
> Part of why I've not pushed as hard to get the first initial patches
> in, is that the earlier changes to rework things are mostly done in
> service of the later proxy exec logic, so it may be a little hard to
> justify the churn on their own. Also, I've been hoping to get more

If by churn you mean they'd be changing a lot later, then yes.

But by churn you mean merging patches to serve a purpose that is yet to follow
up, then I think that is fine. I think other complex patches were staged in
pieces to help with both review and test process. And to speed things up.

> discussion & feedback around the big picture - but I suspect the size
> & number of the changes makes this daunting.

I don't know how the maintainers manage in general tbh. But finding the time
for a smaller set of patches would be easier IMHO for everyone interested.

Assuming my understanding is correct that the first set of patches upto
splitting the scheduler context are more or less stable.

> That said, if Peter is up for it, I'd be happy if the initial chunks
> of the series were to be considered to be pulled in.
>
> > After that we can focus on splitting the task context into scheduling and
> > execution (and maybe introduce the PROXY_EXEC config along with it) but without
> > actually implementing the inheritance, etc parts? Just generally teaching the
> > scheduler these now are 2 separate parts.
>
> The majority of that is done in a single patch:
> https://github.com/johnstultz-work/linux-dev/commit/9e3b364f3724ed840137d681876268b0ad67a469

Yep!

>
> There we start to have separate names, but at least until we are doing
> the proxying, the rq->curr and rq_selected() will be the same task.
>
> > Are 1 and 2 dependent on each other or can be sent as two series in parallel
> > actually?
>
> Technically, I think they can be done in parallel. Though I'm not sure
> if managing and submitting multiple patch series is easier or harder
> for folks to review.

I didn't mean you should do that :-) Meant they're actually completely
independent.

>
> > Hopefully this should reduce the work a lot from continuously rebasing these
> > patches and focus on the last part which is the meaty and most difficult bit
> > IIUC. Which I hope we can break down too; but I have no idea at the moment how
> > to do that.
>
> Rebasing hasn't been the major problem, but wrangling the large patch
> set has. For v7 I got a lot of offlist feedback, and propagating those
> changes through the full fine-grained stack makes even trivial changes
> to early patches quite laborious.
>
> As for breaking down the rest, I thought the v6 series had later
> patches broken down fairly well:

Hmm okay. I thought they're dependent; and I meant to potentially independent
parts that can be piecewise merged. But perhaps I'm overthinking the splitting
here :-)

> 1) Just local rq proxying (where the owner happens to be on the current rq)
> 2) Add proxy migration & return migration, so we can boost tasks on remote rqs
> 3) Sleeping owner enqueueing (so we get woken and added to the rq the
> owner gets woken on)
> ...
>
> And I have the fine-grained version that is even more split up so I
> could test each small change at a time:
> https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6-fine-grained
>
> But I'm open to other suggestions.
>
> > Merging in parts will help with giving each part a chance to soak individually
> > in mainline while the rest is being discussed. Which would make handling
> > potential fall overs easier too.
>
> Again, I think this would be great.
>
> I'll try to get v7 out the door so folks can once more consider the
> big picture, but then maybe I'll send out the earlier changes more
> frequently.


Thanks!

--
Qais Yousef