2024-03-15 04:40:29

by John Stultz

[permalink] [raw]
Subject: [PATCH v9 0/7] Preparatory changes for Proxy Execution v9

As mentioned last time[1], after previous submissions of the
Proxy Execution series, I got feedback that the patch series was
getting a bit unwieldy to review, and Qais suggested I break out
just the cleanups/preparatory components of the patch series and
submit them on their own in the hope we can start to merge the
less complex bits and discussion can focus on the more
complicated portions afterwards. This so far has not been very
successful, with the submission & RESEND of the v8 preparatory
changes not getting much in the way of review.

Nonetheless, for v9 of this series, I’m again only submitting
those early cleanup/preparatory changes here (which have not
changed since the v8 submissions, but to avoid confusion with the
git branch names, I’m labeling it as v9). In the meantime, I’ve
continued to put a lot of effort into the full series, mostly
focused on polishing the series for correctness, and fixing some
hard to trip races.

If you are interested, the full v9 series, it can be found here:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v9-6.8
https://github.com/johnstultz-work/linux-dev.git proxy-exec-v9-6.8


New in v9:
(In the git tree. Again, none of the preparatory patches
submitted here have changed since v8)
---------
* Change to force mutex lock handoff when we have a blocked donor
(preserves optimistic spinning elsewhere, but still prioritizes
donor when present on unlock)

* Do return migration whenever we’re not on the wake_cpu (should
address placement concerns brought up earlier by Xuewen Yan)

* Closed hole where we might mark a task as BO_RUNNABLE without
doing return migration

* Much improved handling of balance callbacks when we need to
pick_again

* Fixes for cases where we put_prev_task() but left a dangling
pointer to rq_selected() when deactivating a task (as it could
then be migrated away while we still have a reference to it),
by selecting idle before deactivating next.

* Fixes for dangling references to rq->curr (which had been
put_prev_task’ed) when we drop rq lock for proxy_migration

* Fixes for ttwu / find_proxy_task() races if the lock owner was
being return migrated, and ttwu hadn’t yet set_task_cpu() and
activated it, which allowed that task to be scheduled on two
cpus at the same time.

* Fix for live-lock between activate_blocked_tasks() and
proxy_enqueue_on_owner() if activated owner went right back to
sleep (which also simplifies the locking in
activate_blocked_tasks())

* Cleanups to avoid locked BO_WAKING->BO_RUNNABLE transition in
try_to_wake_up() if proxy execution isn't enabled

* Fix for psi_dequeue, as proxy changes assumptions around
voluntary sleeps.

* Numerous typos, comment improvements, and other fixups
suggested by Metin

* And more!


Performance:
---------
K Prateek Nayak provided some feedback on the v8 series here[2].
Given the potential extra overhead of doing rq migrations/return
migrations/etc for the proxy case, it’s not completely surprising
a few of K Prateek’s test cases saw ~3-5% regressions, but I’m
hoping to look into this soon to see if we can reduce those
further. The donor mutex handoff in this revision may help some.


Issues still to address:
---------
* The chain migration functionality needs further iterations and
better validation to ensure it truly maintains the RT/DL load
balancing invariants.

* CFS load balancing. There was concern that blocked tasks may
carry forward load (PELT) to the lock owner's CPU, so the CPU
may look like it is overloaded. Needs investigation.

* The sleeping owner handling (where we deactivate waiting tasks
and enqueue them onto a list, then reactivate them when the
owner wakes up) doesn’t feel great. This is in part because
when we want to activate tasks, we’re already holding a
task.pi_lock and a rq_lock, just not the locks for the task
we’re activating, nor the rq we’re enqueuing it onto. So there
has to be a bit of lock juggling to drop and acquire the right
locks (in the right order). It feels like there’s got to be a
better way. Also needs some rework to get rid of the recursion.


Credit/Disclaimer:
—--------------------
As mentioned previously, this Proxy Execution series has a long
history: First described in a paper[3] by Watkins, Straub,
Niehaus, then from patches from Peter Zijlstra, extended with
lots of work by Juri Lelli, Valentin Schneider, and Connor
O'Brien. (and thank you to Steven Rostedt for providing
additional details here!)

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely
mine.

Thanks so much!
-john

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]


Connor O'Brien (2):
sched: Add do_push_task helper
sched: Consolidate pick_*_task to task_is_pushable helper

John Stultz (1):
sched: Split out __schedule() deactivate task logic into a helper

Juri Lelli (2):
locking/mutex: Make mutex::wait_lock irq safe
locking/mutex: Expose __mutex_owner()

Peter Zijlstra (2):
locking/mutex: Remove wakeups from under mutex::wait_lock
sched: Split scheduler and execution contexts

kernel/locking/mutex.c | 60 +++++++----------
kernel/locking/mutex.h | 25 +++++++
kernel/locking/rtmutex.c | 26 +++++---
kernel/locking/rwbase_rt.c | 4 +-
kernel/locking/rwsem.c | 4 +-
kernel/locking/spinlock_rt.c | 3 +-
kernel/locking/ww_mutex.h | 49 ++++++++------
kernel/sched/core.c | 122 +++++++++++++++++++++--------------
kernel/sched/deadline.c | 53 ++++++---------
kernel/sched/fair.c | 18 +++---
kernel/sched/rt.c | 59 +++++++----------
kernel/sched/sched.h | 44 ++++++++++++-
12 files changed, 268 insertions(+), 199 deletions(-)

--
2.44.0.291.gc1ea87d7ee-goog



2024-03-15 04:40:46

by John Stultz

[permalink] [raw]
Subject: [PATCH v9 1/7] locking/mutex: Remove wakeups from under mutex::wait_lock

From: Peter Zijlstra <[email protected]>

In preparation to nest mutex::wait_lock under rq::lock we need to remove
wakeups from under it.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[Heavily changed after 55f036ca7e74 ("locking: WW mutex cleanup") and
08295b3b5bee ("locking: Implement an algorithm choice for Wound-Wait
mutexes")]
Signed-off-by: Juri Lelli <[email protected]>
[jstultz: rebased to mainline, added extra wake_up_q & init
to avoid hangs, similar to Connor's rework of this patch]
Signed-off-by: John Stultz <[email protected]>
---
v5:
* Reverted back to an earlier version of this patch to undo
the change that kept the wake_q in the ctx structure, as
that broke the rule that the wake_q must always be on the
stack, as its not safe for concurrency.
v6:
* Made tweaks suggested by Waiman Long
v7:
* Fixups to pass wake_qs down for PREEMPT_RT logic
---
kernel/locking/mutex.c | 17 +++++++++++++----
kernel/locking/rtmutex.c | 26 +++++++++++++++++---------
kernel/locking/rwbase_rt.c | 4 +++-
kernel/locking/rwsem.c | 4 ++--
kernel/locking/spinlock_rt.c | 3 ++-
kernel/locking/ww_mutex.h | 29 ++++++++++++++++++-----------
6 files changed, 55 insertions(+), 28 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index cbae8c0b89ab..980ce630232c 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -575,6 +575,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
struct lockdep_map *nest_lock, unsigned long ip,
struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
{
+ DEFINE_WAKE_Q(wake_q);
struct mutex_waiter waiter;
struct ww_mutex *ww;
int ret;
@@ -625,7 +626,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
*/
if (__mutex_trylock(lock)) {
if (ww_ctx)
- __ww_mutex_check_waiters(lock, ww_ctx);
+ __ww_mutex_check_waiters(lock, ww_ctx, &wake_q);

goto skip_wait;
}
@@ -645,7 +646,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
* Add in stamp order, waking up waiters that must kill
* themselves.
*/
- ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx);
+ ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx, &wake_q);
if (ret)
goto err_early_kill;
}
@@ -681,6 +682,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
}

raw_spin_unlock(&lock->wait_lock);
+ /* Make sure we do wakeups before calling schedule */
+ if (!wake_q_empty(&wake_q)) {
+ wake_up_q(&wake_q);
+ wake_q_init(&wake_q);
+ }
schedule_preempt_disabled();

first = __mutex_waiter_is_first(lock, &waiter);
@@ -714,7 +720,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
*/
if (!ww_ctx->is_wait_die &&
!__mutex_waiter_is_first(lock, &waiter))
- __ww_mutex_check_waiters(lock, ww_ctx);
+ __ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
}

__mutex_remove_waiter(lock, &waiter);
@@ -730,6 +736,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
ww_mutex_lock_acquired(ww, ww_ctx);

raw_spin_unlock(&lock->wait_lock);
+ wake_up_q(&wake_q);
preempt_enable();
return 0;

@@ -741,6 +748,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
raw_spin_unlock(&lock->wait_lock);
debug_mutex_free_waiter(&waiter);
mutex_release(&lock->dep_map, ip);
+ wake_up_q(&wake_q);
preempt_enable();
return ret;
}
@@ -934,6 +942,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
}
}

+ preempt_disable();
raw_spin_lock(&lock->wait_lock);
debug_mutex_unlock(lock);
if (!list_empty(&lock->wait_list)) {
@@ -952,8 +961,8 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
__mutex_handoff(lock, next);

raw_spin_unlock(&lock->wait_lock);
-
wake_up_q(&wake_q);
+ preempt_enable();
}

#ifndef CONFIG_DEBUG_LOCK_ALLOC
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 4a10e8c16fd2..eaac8b196a69 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -34,13 +34,15 @@

static inline int __ww_mutex_add_waiter(struct rt_mutex_waiter *waiter,
struct rt_mutex *lock,
- struct ww_acquire_ctx *ww_ctx)
+ struct ww_acquire_ctx *ww_ctx,
+ struct wake_q_head *wake_q)
{
return 0;
}

static inline void __ww_mutex_check_waiters(struct rt_mutex *lock,
- struct ww_acquire_ctx *ww_ctx)
+ struct ww_acquire_ctx *ww_ctx,
+ struct wake_q_head *wake_q)
{
}

@@ -1206,6 +1208,7 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
struct rt_mutex_waiter *top_waiter = waiter;
struct rt_mutex_base *next_lock;
int chain_walk = 0, res;
+ DEFINE_WAKE_Q(wake_q);

lockdep_assert_held(&lock->wait_lock);

@@ -1244,7 +1247,8 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,

/* Check whether the waiter should back out immediately */
rtm = container_of(lock, struct rt_mutex, rtmutex);
- res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx);
+ res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx, &wake_q);
+ wake_up_q(&wake_q);
if (res) {
raw_spin_lock(&task->pi_lock);
rt_mutex_dequeue(lock, waiter);
@@ -1677,7 +1681,8 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
struct ww_acquire_ctx *ww_ctx,
unsigned int state,
enum rtmutex_chainwalk chwalk,
- struct rt_mutex_waiter *waiter)
+ struct rt_mutex_waiter *waiter,
+ struct wake_q_head *wake_q)
{
struct rt_mutex *rtm = container_of(lock, struct rt_mutex, rtmutex);
struct ww_mutex *ww = ww_container_of(rtm);
@@ -1688,7 +1693,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
/* Try to acquire the lock again: */
if (try_to_take_rt_mutex(lock, current, NULL)) {
if (build_ww_mutex() && ww_ctx) {
- __ww_mutex_check_waiters(rtm, ww_ctx);
+ __ww_mutex_check_waiters(rtm, ww_ctx, wake_q);
ww_mutex_lock_acquired(ww, ww_ctx);
}
return 0;
@@ -1706,7 +1711,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
/* acquired the lock */
if (build_ww_mutex() && ww_ctx) {
if (!ww_ctx->is_wait_die)
- __ww_mutex_check_waiters(rtm, ww_ctx);
+ __ww_mutex_check_waiters(rtm, ww_ctx, wake_q);
ww_mutex_lock_acquired(ww, ww_ctx);
}
} else {
@@ -1728,7 +1733,8 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,

static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
struct ww_acquire_ctx *ww_ctx,
- unsigned int state)
+ unsigned int state,
+ struct wake_q_head *wake_q)
{
struct rt_mutex_waiter waiter;
int ret;
@@ -1737,7 +1743,7 @@ static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
waiter.ww_ctx = ww_ctx;

ret = __rt_mutex_slowlock(lock, ww_ctx, state, RT_MUTEX_MIN_CHAINWALK,
- &waiter);
+ &waiter, wake_q);

debug_rt_mutex_free_waiter(&waiter);
return ret;
@@ -1753,6 +1759,7 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
struct ww_acquire_ctx *ww_ctx,
unsigned int state)
{
+ DEFINE_WAKE_Q(wake_q);
unsigned long flags;
int ret;

@@ -1774,8 +1781,9 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
* irqsave/restore variants.
*/
raw_spin_lock_irqsave(&lock->wait_lock, flags);
- ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state);
+ ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state, &wake_q);
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+ wake_up_q(&wake_q);
rt_mutex_post_schedule();

return ret;
diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index 34a59569db6b..e9d2f38b70f3 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -69,6 +69,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
unsigned int state)
{
struct rt_mutex_base *rtm = &rwb->rtmutex;
+ DEFINE_WAKE_Q(wake_q);
int ret;

rwbase_pre_schedule();
@@ -110,7 +111,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
* For rwlocks this returns 0 unconditionally, so the below
* !ret conditionals are optimized out.
*/
- ret = rwbase_rtmutex_slowlock_locked(rtm, state);
+ ret = rwbase_rtmutex_slowlock_locked(rtm, state, &wake_q);

/*
* On success the rtmutex is held, so there can't be a writer
@@ -122,6 +123,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
if (!ret)
atomic_inc(&rwb->readers);
raw_spin_unlock_irq(&rtm->wait_lock);
+ wake_up_q(&wake_q);
if (!ret)
rwbase_rtmutex_unlock(rtm);

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 2340b6d90ec6..74ebb2915d63 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1415,8 +1415,8 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
#define rwbase_rtmutex_lock_state(rtm, state) \
__rt_mutex_lock(rtm, state)

-#define rwbase_rtmutex_slowlock_locked(rtm, state) \
- __rt_mutex_slowlock_locked(rtm, NULL, state)
+#define rwbase_rtmutex_slowlock_locked(rtm, state, wq) \
+ __rt_mutex_slowlock_locked(rtm, NULL, state, wq)

#define rwbase_rtmutex_unlock(rtm) \
__rt_mutex_unlock(rtm)
diff --git a/kernel/locking/spinlock_rt.c b/kernel/locking/spinlock_rt.c
index 38e292454fcc..fb1810a14c9d 100644
--- a/kernel/locking/spinlock_rt.c
+++ b/kernel/locking/spinlock_rt.c
@@ -162,7 +162,8 @@ rwbase_rtmutex_lock_state(struct rt_mutex_base *rtm, unsigned int state)
}

static __always_inline int
-rwbase_rtmutex_slowlock_locked(struct rt_mutex_base *rtm, unsigned int state)
+rwbase_rtmutex_slowlock_locked(struct rt_mutex_base *rtm, unsigned int state,
+ struct wake_q_head *wake_q)
{
rtlock_slowlock_locked(rtm);
return 0;
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 3ad2cc4823e5..7189c6631d90 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -275,7 +275,7 @@ __ww_ctx_less(struct ww_acquire_ctx *a, struct ww_acquire_ctx *b)
*/
static bool
__ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
- struct ww_acquire_ctx *ww_ctx)
+ struct ww_acquire_ctx *ww_ctx, struct wake_q_head *wake_q)
{
if (!ww_ctx->is_wait_die)
return false;
@@ -284,7 +284,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
#ifndef WW_RT
debug_mutex_wake_waiter(lock, waiter);
#endif
- wake_up_process(waiter->task);
+ wake_q_add(wake_q, waiter->task);
}

return true;
@@ -299,7 +299,8 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
*/
static bool __ww_mutex_wound(struct MUTEX *lock,
struct ww_acquire_ctx *ww_ctx,
- struct ww_acquire_ctx *hold_ctx)
+ struct ww_acquire_ctx *hold_ctx,
+ struct wake_q_head *wake_q)
{
struct task_struct *owner = __ww_mutex_owner(lock);

@@ -331,7 +332,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
* wakeup pending to re-read the wounded state.
*/
if (owner != current)
- wake_up_process(owner);
+ wake_q_add(wake_q, owner);

return true;
}
@@ -352,7 +353,8 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
* The current task must not be on the wait list.
*/
static void
-__ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
+__ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx,
+ struct wake_q_head *wake_q)
{
struct MUTEX_WAITER *cur;

@@ -364,8 +366,8 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
if (!cur->ww_ctx)
continue;

- if (__ww_mutex_die(lock, cur, ww_ctx) ||
- __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx))
+ if (__ww_mutex_die(lock, cur, ww_ctx, wake_q) ||
+ __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx, wake_q))
break;
}
}
@@ -377,6 +379,8 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
static __always_inline void
ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
{
+ DEFINE_WAKE_Q(wake_q);
+
ww_mutex_lock_acquired(lock, ctx);

/*
@@ -405,8 +409,10 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
* die or wound us.
*/
lock_wait_lock(&lock->base);
- __ww_mutex_check_waiters(&lock->base, ctx);
+ __ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
unlock_wait_lock(&lock->base);
+
+ wake_up_q(&wake_q);
}

static __always_inline int
@@ -488,7 +494,8 @@ __ww_mutex_check_kill(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
static inline int
__ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
struct MUTEX *lock,
- struct ww_acquire_ctx *ww_ctx)
+ struct ww_acquire_ctx *ww_ctx,
+ struct wake_q_head *wake_q)
{
struct MUTEX_WAITER *cur, *pos = NULL;
bool is_wait_die;
@@ -532,7 +539,7 @@ __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
pos = cur;

/* Wait-Die: ensure younger waiters die. */
- __ww_mutex_die(lock, cur, ww_ctx);
+ __ww_mutex_die(lock, cur, ww_ctx, wake_q);
}

__ww_waiter_add(lock, waiter, pos);
@@ -550,7 +557,7 @@ __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
* such that either we or the fastpath will wound @ww->ctx.
*/
smp_mb();
- __ww_mutex_wound(lock, ww_ctx, ww->ctx);
+ __ww_mutex_wound(lock, ww_ctx, ww->ctx, wake_q);
}

return 0;
--
2.44.0.291.gc1ea87d7ee-goog


2024-03-15 04:40:56

by John Stultz

[permalink] [raw]
Subject: [PATCH v9 3/7] locking/mutex: Expose __mutex_owner()

From: Juri Lelli <[email protected]>

Implementing proxy execution requires that scheduler code be able to
identify the current owner of a mutex. Expose __mutex_owner() for
this purpose (alone!).

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Juri Lelli <[email protected]>
[Removed the EXPORT_SYMBOL]
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: Reworked per Peter's suggestions]
Signed-off-by: John Stultz <[email protected]>
---
v4:
* Move __mutex_owner() to kernel/locking/mutex.h instead of
adding a new globally available accessor function to keep
the exposure of this low, along with keeping it an inline
function, as suggested by PeterZ
---
kernel/locking/mutex.c | 25 -------------------------
kernel/locking/mutex.h | 25 +++++++++++++++++++++++++
2 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 7de72c610c65..5741641be914 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -56,31 +56,6 @@ __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
}
EXPORT_SYMBOL(__mutex_init);

-/*
- * @owner: contains: 'struct task_struct *' to the current lock owner,
- * NULL means not owned. Since task_struct pointers are aligned at
- * at least L1_CACHE_BYTES, we have low bits to store extra state.
- *
- * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
- * Bit1 indicates unlock needs to hand the lock to the top-waiter
- * Bit2 indicates handoff has been done and we're waiting for pickup.
- */
-#define MUTEX_FLAG_WAITERS 0x01
-#define MUTEX_FLAG_HANDOFF 0x02
-#define MUTEX_FLAG_PICKUP 0x04
-
-#define MUTEX_FLAGS 0x07
-
-/*
- * Internal helper function; C doesn't allow us to hide it :/
- *
- * DO NOT USE (outside of mutex code).
- */
-static inline struct task_struct *__mutex_owner(struct mutex *lock)
-{
- return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
-}
-
static inline struct task_struct *__owner_task(unsigned long owner)
{
return (struct task_struct *)(owner & ~MUTEX_FLAGS);
diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index 0b2a79c4013b..1c7d3d32def8 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -20,6 +20,31 @@ struct mutex_waiter {
#endif
};

+/*
+ * @owner: contains: 'struct task_struct *' to the current lock owner,
+ * NULL means not owned. Since task_struct pointers are aligned at
+ * at least L1_CACHE_BYTES, we have low bits to store extra state.
+ *
+ * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
+ * Bit1 indicates unlock needs to hand the lock to the top-waiter
+ * Bit2 indicates handoff has been done and we're waiting for pickup.
+ */
+#define MUTEX_FLAG_WAITERS 0x01
+#define MUTEX_FLAG_HANDOFF 0x02
+#define MUTEX_FLAG_PICKUP 0x04
+
+#define MUTEX_FLAGS 0x07
+
+/*
+ * Internal helper function; C doesn't allow us to hide it :/
+ *
+ * DO NOT USE (outside of mutex & scheduler code).
+ */
+static inline struct task_struct *__mutex_owner(struct mutex *lock)
+{
+ return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
+}
+
#ifdef CONFIG_DEBUG_MUTEXES
extern void debug_mutex_lock_common(struct mutex *lock,
struct mutex_waiter *waiter);
--
2.44.0.291.gc1ea87d7ee-goog


2024-03-15 04:41:00

by John Stultz

[permalink] [raw]
Subject: [PATCH v9 2/7] locking/mutex: Make mutex::wait_lock irq safe

From: Juri Lelli <[email protected]>

mutex::wait_lock might be nested under rq->lock.

Make it irq safe then.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[rebase & fix {un,}lock_wait_lock helpers in ww_mutex.h]
Signed-off-by: Connor O'Brien <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
v3:
* Re-added this patch after it was dropped in v2 which
caused lockdep warnings to trip.
v7:
* Fix function definition for PREEMPT_RT case, as pointed out
by Metin Kaya.
* Fix incorrect flags handling in PREEMPT_RT case as found by
Metin Kaya
---
kernel/locking/mutex.c | 18 ++++++++++--------
kernel/locking/ww_mutex.h | 22 +++++++++++-----------
2 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 980ce630232c..7de72c610c65 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -578,6 +578,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
DEFINE_WAKE_Q(wake_q);
struct mutex_waiter waiter;
struct ww_mutex *ww;
+ unsigned long flags;
int ret;

if (!use_ww_ctx)
@@ -620,7 +621,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
return 0;
}

- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
/*
* After waiting to acquire the wait_lock, try again.
*/
@@ -681,7 +682,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
goto err;
}

- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
/* Make sure we do wakeups before calling schedule */
if (!wake_q_empty(&wake_q)) {
wake_up_q(&wake_q);
@@ -707,9 +708,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
trace_contention_begin(lock, LCB_F_MUTEX);
}

- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
}
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
acquired:
__set_current_state(TASK_RUNNING);

@@ -735,7 +736,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
if (ww_ctx)
ww_mutex_lock_acquired(ww, ww_ctx);

- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q);
preempt_enable();
return 0;
@@ -745,7 +746,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
__mutex_remove_waiter(lock, &waiter);
err_early_kill:
trace_contention_end(lock, ret);
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
debug_mutex_free_waiter(&waiter);
mutex_release(&lock->dep_map, ip);
wake_up_q(&wake_q);
@@ -916,6 +917,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
struct task_struct *next = NULL;
DEFINE_WAKE_Q(wake_q);
unsigned long owner;
+ unsigned long flags;

mutex_release(&lock->dep_map, ip);

@@ -943,7 +945,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
}

preempt_disable();
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
debug_mutex_unlock(lock);
if (!list_empty(&lock->wait_list)) {
/* get the first entry from the wait-list: */
@@ -960,7 +962,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
if (owner & MUTEX_FLAG_HANDOFF)
__mutex_handoff(lock, next);

- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q);
preempt_enable();
}
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 7189c6631d90..9facc0ddfdd3 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -70,14 +70,14 @@ __ww_mutex_has_waiters(struct mutex *lock)
return atomic_long_read(&lock->owner) & MUTEX_FLAG_WAITERS;
}

-static inline void lock_wait_lock(struct mutex *lock)
+static inline void lock_wait_lock(struct mutex *lock, unsigned long *flags)
{
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, *flags);
}

-static inline void unlock_wait_lock(struct mutex *lock)
+static inline void unlock_wait_lock(struct mutex *lock, unsigned long *flags)
{
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, *flags);
}

static inline void lockdep_assert_wait_lock_held(struct mutex *lock)
@@ -144,14 +144,14 @@ __ww_mutex_has_waiters(struct rt_mutex *lock)
return rt_mutex_has_waiters(&lock->rtmutex);
}

-static inline void lock_wait_lock(struct rt_mutex *lock)
+static inline void lock_wait_lock(struct rt_mutex *lock, unsigned long *flags)
{
- raw_spin_lock(&lock->rtmutex.wait_lock);
+ raw_spin_lock_irqsave(&lock->rtmutex.wait_lock, *flags);
}

-static inline void unlock_wait_lock(struct rt_mutex *lock)
+static inline void unlock_wait_lock(struct rt_mutex *lock, unsigned long *flags)
{
- raw_spin_unlock(&lock->rtmutex.wait_lock);
+ raw_spin_unlock_irqrestore(&lock->rtmutex.wait_lock, *flags);
}

static inline void lockdep_assert_wait_lock_held(struct rt_mutex *lock)
@@ -380,6 +380,7 @@ static __always_inline void
ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
{
DEFINE_WAKE_Q(wake_q);
+ unsigned long flags;

ww_mutex_lock_acquired(lock, ctx);

@@ -408,10 +409,9 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
* Uh oh, we raced in fastpath, check if any of the waiters need to
* die or wound us.
*/
- lock_wait_lock(&lock->base);
+ lock_wait_lock(&lock->base, &flags);
__ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
- unlock_wait_lock(&lock->base);
-
+ unlock_wait_lock(&lock->base, &flags);
wake_up_q(&wake_q);
}

--
2.44.0.291.gc1ea87d7ee-goog


2024-03-15 04:41:10

by John Stultz

[permalink] [raw]
Subject: [PATCH v9 4/7] sched: Add do_push_task helper

From: Connor O'Brien <[email protected]>

Switch logic that deactivates, sets the task cpu,
and reactivates a task on a different rq to use a
helper that will be later extended to push entire
blocked task chains.

This patch was broken out from a larger chain migration
patch originally by Connor O'Brien.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: split out from larger chain migration patch]
Signed-off-by: John Stultz <[email protected]>
---
v8:
* Renamed from push_task_chain to do_push_task so it makes
more sense without proxy-execution
---
kernel/sched/core.c | 4 +---
kernel/sched/deadline.c | 8 ++------
kernel/sched/rt.c | 8 ++------
kernel/sched/sched.h | 9 +++++++++
4 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9116bcc90346..ad4748327651 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2714,9 +2714,7 @@ int push_cpu_stop(void *arg)

// XXX validate p is still the highest prio task
if (task_rq(p) == rq) {
- deactivate_task(rq, p, 0);
- set_task_cpu(p, lowest_rq->cpu);
- activate_task(lowest_rq, p, 0);
+ do_push_task(rq, lowest_rq, p);
resched_curr(lowest_rq);
}

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a04a436af8cc..e68d88963e89 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2443,9 +2443,7 @@ static int push_dl_task(struct rq *rq)
goto retry;
}

- deactivate_task(rq, next_task, 0);
- set_task_cpu(next_task, later_rq->cpu);
- activate_task(later_rq, next_task, 0);
+ do_push_task(rq, later_rq, next_task);
ret = 1;

resched_curr(later_rq);
@@ -2531,9 +2529,7 @@ static void pull_dl_task(struct rq *this_rq)
if (is_migration_disabled(p)) {
push_task = get_push_task(src_rq);
} else {
- deactivate_task(src_rq, p, 0);
- set_task_cpu(p, this_cpu);
- activate_task(this_rq, p, 0);
+ do_push_task(src_rq, this_rq, p);
dmin = p->dl.deadline;
resched = true;
}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3261b067b67e..dd072d11cc02 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2106,9 +2106,7 @@ static int push_rt_task(struct rq *rq, bool pull)
goto retry;
}

- deactivate_task(rq, next_task, 0);
- set_task_cpu(next_task, lowest_rq->cpu);
- activate_task(lowest_rq, next_task, 0);
+ do_push_task(rq, lowest_rq, next_task);
resched_curr(lowest_rq);
ret = 1;

@@ -2379,9 +2377,7 @@ static void pull_rt_task(struct rq *this_rq)
if (is_migration_disabled(p)) {
push_task = get_push_task(src_rq);
} else {
- deactivate_task(src_rq, p, 0);
- set_task_cpu(p, this_cpu);
- activate_task(this_rq, p, 0);
+ do_push_task(src_rq, this_rq, p);
resched = true;
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 001fe047bd5d..6ca83837e0f4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3472,5 +3472,14 @@ static inline void init_sched_mm_cid(struct task_struct *t) { }

extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
+#ifdef CONFIG_SMP
+static inline
+void do_push_task(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
+{
+ deactivate_task(rq, task, 0);
+ set_task_cpu(task, dst_rq->cpu);
+ activate_task(dst_rq, task, 0);
+}
+#endif

#endif /* _KERNEL_SCHED_SCHED_H */
--
2.44.0.291.gc1ea87d7ee-goog


2024-03-15 04:41:28

by John Stultz

[permalink] [raw]
Subject: [PATCH v9 5/7] sched: Consolidate pick_*_task to task_is_pushable helper

From: Connor O'Brien <[email protected]>

This patch consolidates rt and deadline pick_*_task functions to
a task_is_pushable() helper

This patch was broken out from a larger chain migration
patch originally by Connor O'Brien.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: split out from larger chain migration patch,
renamed helper function]
Signed-off-by: John Stultz <[email protected]>
---
v7:
* Split from chain migration patch
* Renamed function
---
kernel/sched/deadline.c | 10 +---------
kernel/sched/rt.c | 11 +----------
kernel/sched/sched.h | 10 ++++++++++
3 files changed, 12 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e68d88963e89..1b9cdb507498 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2179,14 +2179,6 @@ static void task_fork_dl(struct task_struct *p)
/* Only try algorithms three times */
#define DL_MAX_TRIES 3

-static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
-{
- if (!task_on_cpu(rq, p) &&
- cpumask_test_cpu(cpu, &p->cpus_mask))
- return 1;
- return 0;
-}
-
/*
* Return the earliest pushable rq's task, which is suitable to be executed
* on the CPU, NULL otherwise:
@@ -2205,7 +2197,7 @@ static struct task_struct *pick_earliest_pushable_dl_task(struct rq *rq, int cpu
if (next_node) {
p = __node_2_pdl(next_node);

- if (pick_dl_task(rq, p, cpu))
+ if (task_is_pushable(rq, p, cpu) == 1)
return p;

next_node = rb_next(next_node);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index dd072d11cc02..638e7c158ae4 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1791,15 +1791,6 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
/* Only try algorithms three times */
#define RT_MAX_TRIES 3

-static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
-{
- if (!task_on_cpu(rq, p) &&
- cpumask_test_cpu(cpu, &p->cpus_mask))
- return 1;
-
- return 0;
-}
-
/*
* Return the highest pushable rq's task, which is suitable to be executed
* on the CPU, NULL otherwise
@@ -1813,7 +1804,7 @@ static struct task_struct *pick_highest_pushable_task(struct rq *rq, int cpu)
return NULL;

plist_for_each_entry(p, head, pushable_tasks) {
- if (pick_rt_task(rq, p, cpu))
+ if (task_is_pushable(rq, p, cpu) == 1)
return p;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6ca83837e0f4..c83e5e0672dc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3480,6 +3480,16 @@ void do_push_task(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
set_task_cpu(task, dst_rq->cpu);
activate_task(dst_rq, task, 0);
}
+
+static inline
+int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
+{
+ if (!task_on_cpu(rq, p) &&
+ cpumask_test_cpu(cpu, &p->cpus_mask))
+ return 1;
+
+ return 0;
+}
#endif

#endif /* _KERNEL_SCHED_SCHED_H */
--
2.44.0.291.gc1ea87d7ee-goog


2024-03-15 04:41:43

by John Stultz

[permalink] [raw]
Subject: [PATCH v9 6/7] sched: Split out __schedule() deactivate task logic into a helper

As we're going to re-use the deactivation logic,
split it into a helper.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: John Stultz <[email protected]>
---
v6:
* Define function as static to avoid "no previous prototype"
warnings as Reported-by: kernel test robot <[email protected]>
v7:
* Rename state task_state to be more clear, as suggested by
Metin Kaya
---
kernel/sched/core.c | 72 +++++++++++++++++++++++++++------------------
1 file changed, 43 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ad4748327651..b537e5f501ea 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6563,6 +6563,48 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
# define SM_MASK_PREEMPT SM_PREEMPT
#endif

+/*
+ * Helper function for __schedule()
+ *
+ * If a task does not have signals pending, deactivate it and return true
+ * Otherwise marks the task's __state as RUNNING and returns false
+ */
+static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
+ unsigned long task_state)
+{
+ if (signal_pending_state(task_state, p)) {
+ WRITE_ONCE(p->__state, TASK_RUNNING);
+ } else {
+ p->sched_contributes_to_load =
+ (task_state & TASK_UNINTERRUPTIBLE) &&
+ !(task_state & TASK_NOLOAD) &&
+ !(task_state & TASK_FROZEN);
+
+ if (p->sched_contributes_to_load)
+ rq->nr_uninterruptible++;
+
+ /*
+ * __schedule() ttwu()
+ * prev_state = prev->state; if (p->on_rq && ...)
+ * if (prev_state) goto out;
+ * p->on_rq = 0; smp_acquire__after_ctrl_dep();
+ * p->state = TASK_WAKING
+ *
+ * Where __schedule() and ttwu() have matching control dependencies.
+ *
+ * After this, schedule() must not care about p->state any more.
+ */
+ deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
+
+ if (p->in_iowait) {
+ atomic_inc(&rq->nr_iowait);
+ delayacct_blkio_start();
+ }
+ return true;
+ }
+ return false;
+}
+
/*
* __schedule() is the main scheduler function.
*
@@ -6654,35 +6696,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
*/
prev_state = READ_ONCE(prev->__state);
if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
- if (signal_pending_state(prev_state, prev)) {
- WRITE_ONCE(prev->__state, TASK_RUNNING);
- } else {
- prev->sched_contributes_to_load =
- (prev_state & TASK_UNINTERRUPTIBLE) &&
- !(prev_state & TASK_NOLOAD) &&
- !(prev_state & TASK_FROZEN);
-
- if (prev->sched_contributes_to_load)
- rq->nr_uninterruptible++;
-
- /*
- * __schedule() ttwu()
- * prev_state = prev->state; if (p->on_rq && ...)
- * if (prev_state) goto out;
- * p->on_rq = 0; smp_acquire__after_ctrl_dep();
- * p->state = TASK_WAKING
- *
- * Where __schedule() and ttwu() have matching control dependencies.
- *
- * After this, schedule() must not care about p->state any more.
- */
- deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
-
- if (prev->in_iowait) {
- atomic_inc(&rq->nr_iowait);
- delayacct_blkio_start();
- }
- }
+ try_to_deactivate_task(rq, prev, prev_state);
switch_count = &prev->nvcsw;
}

--
2.44.0.291.gc1ea87d7ee-goog


2024-03-15 04:41:57

by John Stultz

[permalink] [raw]
Subject: [PATCH v9 7/7] sched: Split scheduler and execution contexts

From: Peter Zijlstra <[email protected]>

Let's define the scheduling context as all the scheduler state
in task_struct for the task selected to run, and the execution
context as all state required to actually run the task.

Currently both are intertwined in task_struct. We want to
logically split these such that we can use the scheduling
context of the task selected to be scheduled, but use the
execution context of a different task to actually be run.

To this purpose, introduce rq_selected() macro to point to the
task_struct selected from the runqueue by the scheduler, and
will be used for scheduler state, and preserve rq->curr to
indicate the execution context of the task that will actually be
run.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
[add additional comments and update more sched_class code to use
rq::proxy]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: Rebased and resolved minor collisions, reworked to use
accessors, tweaked update_curr_common to use rq_proxy fixing rt
scheduling issues]
Signed-off-by: John Stultz <[email protected]>
---
v2:
* Reworked to use accessors
* Fixed update_curr_common to use proxy instead of curr
v3:
* Tweaked wrapper names
* Swapped proxy for selected for clarity
v4:
* Minor variable name tweaks for readability
* Use a macro instead of a inline function and drop
other helper functions as suggested by Peter.
* Remove verbose comments/questions to avoid review
distractions, as suggested by Dietmar
v5:
* Add CONFIG_PROXY_EXEC option to this patch so the
new logic can be tested with this change
* Minor fix to grab rq_selected when holding the rq lock
v7:
* Minor spelling fix and unused argument fixes suggested by
Metin Kaya
* Switch to curr_selected for consistency, and minor rewording
of commit message for clarity
* Rename variables selected instead of curr when we're using
rq_selected()
* Reduce macros in CONFIG_SCHED_PROXY_EXEC ifdef sections,
as suggested by Metin Kaya
v8:
* Use rq->curr, not rq_selected with task_tick, as suggested by
Valentin
* Minor rework to reorder this with CONFIG_SCHED_PROXY_EXEC patch
---
kernel/sched/core.c | 46 ++++++++++++++++++++++++++---------------
kernel/sched/deadline.c | 35 ++++++++++++++++---------------
kernel/sched/fair.c | 18 ++++++++--------
kernel/sched/rt.c | 40 +++++++++++++++++------------------
kernel/sched/sched.h | 25 ++++++++++++++++++++--
5 files changed, 99 insertions(+), 65 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b537e5f501ea..c17f91d6ceba 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -794,7 +794,7 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)

rq_lock(rq, &rf);
update_rq_clock(rq);
- rq->curr->sched_class->task_tick(rq, rq->curr, 1);
+ rq_selected(rq)->sched_class->task_tick(rq, rq->curr, 1);
rq_unlock(rq, &rf);

return HRTIMER_NORESTART;
@@ -2238,16 +2238,18 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,

void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
{
- if (p->sched_class == rq->curr->sched_class)
- rq->curr->sched_class->wakeup_preempt(rq, p, flags);
- else if (sched_class_above(p->sched_class, rq->curr->sched_class))
+ struct task_struct *selected = rq_selected(rq);
+
+ if (p->sched_class == selected->sched_class)
+ selected->sched_class->wakeup_preempt(rq, p, flags);
+ else if (sched_class_above(p->sched_class, selected->sched_class))
resched_curr(rq);

/*
* A queue event has occurred, and we're going to schedule. In
* this case, we can save a useless back to back clock update.
*/
- if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
+ if (task_on_rq_queued(selected) && test_tsk_need_resched(rq->curr))
rq_clock_skip_update(rq);
}

@@ -2774,7 +2776,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
lockdep_assert_held(&p->pi_lock);

queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);

if (queued) {
/*
@@ -5587,7 +5589,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
* project cycles that may never be accounted to this
* thread, breaking clock_gettime().
*/
- if (task_current(rq, p) && task_on_rq_queued(p)) {
+ if (task_current_selected(rq, p) && task_on_rq_queued(p)) {
prefetch_curr_exec_start(p);
update_rq_clock(rq);
p->sched_class->update_curr(rq);
@@ -5655,7 +5657,8 @@ void scheduler_tick(void)
{
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
- struct task_struct *curr = rq->curr;
+ /* accounting goes to the selected task */
+ struct task_struct *selected;
struct rq_flags rf;
unsigned long thermal_pressure;
u64 resched_latency;
@@ -5666,16 +5669,17 @@ void scheduler_tick(void)
sched_clock_tick();

rq_lock(rq, &rf);
+ selected = rq_selected(rq);

update_rq_clock(rq);
thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
- curr->sched_class->task_tick(rq, curr, 0);
+ selected->sched_class->task_tick(rq, selected, 0);
if (sched_feat(LATENCY_WARN))
resched_latency = cpu_resched_latency(rq);
calc_global_load_tick(rq);
sched_core_tick(rq);
- task_tick_mm_cid(rq, curr);
+ task_tick_mm_cid(rq, selected);

rq_unlock(rq, &rf);

@@ -5684,8 +5688,8 @@ void scheduler_tick(void)

perf_event_task_tick();

- if (curr->flags & PF_WQ_WORKER)
- wq_worker_tick(curr);
+ if (selected->flags & PF_WQ_WORKER)
+ wq_worker_tick(selected);

#ifdef CONFIG_SMP
rq->idle_balance = idle_cpu(cpu);
@@ -5750,6 +5754,12 @@ static void sched_tick_remote(struct work_struct *work)
struct task_struct *curr = rq->curr;

if (cpu_online(cpu)) {
+ /*
+ * Since this is a remote tick for full dynticks mode,
+ * we are always sure that there is no proxy (only a
+ * single task is running).
+ */
+ SCHED_WARN_ON(rq->curr != rq_selected(rq));
update_rq_clock(rq);

if (!is_idle_task(curr)) {
@@ -6701,6 +6711,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
}

next = pick_next_task(rq, prev, &rf);
+ rq_set_selected(rq, next);
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
#ifdef CONFIG_SCHED_DEBUG
@@ -7201,7 +7212,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)

prev_class = p->sched_class;
queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);
if (queued)
dequeue_task(rq, p, queue_flag);
if (running)
@@ -7291,7 +7302,7 @@ void set_user_nice(struct task_struct *p, long nice)
}

queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);
if (queued)
dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
if (running)
@@ -7870,7 +7881,7 @@ static int __sched_setscheduler(struct task_struct *p,
}

queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);
if (queued)
dequeue_task(rq, p, queue_flags);
if (running)
@@ -9297,6 +9308,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
rcu_read_unlock();

rq->idle = idle;
+ rq_set_selected(rq, idle);
rcu_assign_pointer(rq->curr, idle);
idle->on_rq = TASK_ON_RQ_QUEUED;
#ifdef CONFIG_SMP
@@ -9386,7 +9398,7 @@ void sched_setnuma(struct task_struct *p, int nid)

rq = task_rq_lock(p, &rf);
queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);

if (queued)
dequeue_task(rq, p, DEQUEUE_SAVE);
@@ -10491,7 +10503,7 @@ void sched_move_task(struct task_struct *tsk)

update_rq_clock(rq);

- running = task_current(rq, tsk);
+ running = task_current_selected(rq, tsk);
queued = task_on_rq_queued(tsk);

if (queued)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1b9cdb507498..c30b592d6e9d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1218,7 +1218,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
#endif

enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
- if (dl_task(rq->curr))
+ if (dl_task(rq_selected(rq)))
wakeup_preempt_dl(rq, p, 0);
else
resched_curr(rq);
@@ -1442,7 +1442,7 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
*/
static void update_curr_dl(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
struct sched_dl_entity *dl_se = &curr->dl;
s64 delta_exec;

@@ -1899,7 +1899,7 @@ static int find_later_rq(struct task_struct *task);
static int
select_task_rq_dl(struct task_struct *p, int cpu, int flags)
{
- struct task_struct *curr;
+ struct task_struct *curr, *selected;
bool select_rq;
struct rq *rq;

@@ -1910,6 +1910,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)

rcu_read_lock();
curr = READ_ONCE(rq->curr); /* unlocked access */
+ selected = READ_ONCE(rq_selected(rq));

/*
* If we are dealing with a -deadline task, we must
@@ -1920,9 +1921,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
* other hand, if it has a shorter deadline, we
* try to make it stay here, it might be important.
*/
- select_rq = unlikely(dl_task(curr)) &&
+ select_rq = unlikely(dl_task(selected)) &&
(curr->nr_cpus_allowed < 2 ||
- !dl_entity_preempt(&p->dl, &curr->dl)) &&
+ !dl_entity_preempt(&p->dl, &selected->dl)) &&
p->nr_cpus_allowed > 1;

/*
@@ -1985,7 +1986,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
* let's hope p can move out.
*/
if (rq->curr->nr_cpus_allowed == 1 ||
- !cpudl_find(&rq->rd->cpudl, rq->curr, NULL))
+ !cpudl_find(&rq->rd->cpudl, rq_selected(rq), NULL))
return;

/*
@@ -2024,7 +2025,7 @@ static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
int flags)
{
- if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
+ if (dl_entity_preempt(&p->dl, &rq_selected(rq)->dl)) {
resched_curr(rq);
return;
}
@@ -2034,7 +2035,7 @@ static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
* In the unlikely case current and p have the same deadline
* let us try to decide what's the best thing to do...
*/
- if ((p->dl.deadline == rq->curr->dl.deadline) &&
+ if ((p->dl.deadline == rq_selected(rq)->dl.deadline) &&
!test_tsk_need_resched(rq->curr))
check_preempt_equal_dl(rq, p);
#endif /* CONFIG_SMP */
@@ -2066,7 +2067,7 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
if (!first)
return;

- if (rq->curr->sched_class != &dl_sched_class)
+ if (rq_selected(rq)->sched_class != &dl_sched_class)
update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);

deadline_queue_push_tasks(rq);
@@ -2391,8 +2392,8 @@ static int push_dl_task(struct rq *rq)
* can move away, it makes sense to just reschedule
* without going further in pushing next_task.
*/
- if (dl_task(rq->curr) &&
- dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
+ if (dl_task(rq_selected(rq)) &&
+ dl_time_before(next_task->dl.deadline, rq_selected(rq)->dl.deadline) &&
rq->curr->nr_cpus_allowed > 1) {
resched_curr(rq);
return 0;
@@ -2515,7 +2516,7 @@ static void pull_dl_task(struct rq *this_rq)
* deadline than the current task of its runqueue.
*/
if (dl_time_before(p->dl.deadline,
- src_rq->curr->dl.deadline))
+ rq_selected(src_rq)->dl.deadline))
goto skip;

if (is_migration_disabled(p)) {
@@ -2554,9 +2555,9 @@ static void task_woken_dl(struct rq *rq, struct task_struct *p)
if (!task_on_cpu(rq, p) &&
!test_tsk_need_resched(rq->curr) &&
p->nr_cpus_allowed > 1 &&
- dl_task(rq->curr) &&
+ dl_task(rq_selected(rq)) &&
(rq->curr->nr_cpus_allowed < 2 ||
- !dl_entity_preempt(&p->dl, &rq->curr->dl))) {
+ !dl_entity_preempt(&p->dl, &rq_selected(rq)->dl))) {
push_dl_tasks(rq);
}
}
@@ -2731,12 +2732,12 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
return;
}

- if (rq->curr != p) {
+ if (rq_selected(rq) != p) {
#ifdef CONFIG_SMP
if (p->nr_cpus_allowed > 1 && rq->dl.overloaded)
deadline_queue_push_tasks(rq);
#endif
- if (dl_task(rq->curr))
+ if (dl_task(rq_selected(rq)))
wakeup_preempt_dl(rq, p, 0);
else
resched_curr(rq);
@@ -2765,7 +2766,7 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p,
if (!rq->dl.overloaded)
deadline_queue_pull_task(rq);

- if (task_current(rq, p)) {
+ if (task_current_selected(rq, p)) {
/*
* If we now have a earlier deadline task than p,
* then reschedule, provided p is still on this
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 533547e3c90a..dc342e1fc420 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1140,7 +1140,7 @@ static inline void update_curr_task(struct task_struct *p, s64 delta_exec)
*/
s64 update_curr_common(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
s64 delta_exec;

delta_exec = update_curr_se(rq, &curr->se);
@@ -1177,7 +1177,7 @@ static void update_curr(struct cfs_rq *cfs_rq)

static void update_curr_fair(struct rq *rq)
{
- update_curr(cfs_rq_of(&rq->curr->se));
+ update_curr(cfs_rq_of(&rq_selected(rq)->se));
}

static inline void
@@ -6627,7 +6627,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
s64 delta = slice - ran;

if (delta < 0) {
- if (task_current(rq, p))
+ if (task_current_selected(rq, p))
resched_curr(rq);
return;
}
@@ -6642,7 +6642,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
*/
static void hrtick_update(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);

if (!hrtick_enabled_fair(rq) || curr->sched_class != &fair_sched_class)
return;
@@ -8267,7 +8267,7 @@ static void set_next_buddy(struct sched_entity *se)
*/
static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
struct sched_entity *se = &curr->se, *pse = &p->se;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
int cse_is_idle, pse_is_idle;
@@ -8298,7 +8298,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
* prevents us from potentially nominating it as a false LAST_BUDDY
* below.
*/
- if (test_tsk_need_resched(curr))
+ if (test_tsk_need_resched(rq->curr))
return;

/* Idle tasks are by definition preempted by non-idle tasks. */
@@ -9282,7 +9282,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
* update_load_avg() can call cpufreq_update_util(). Make sure that RT,
* DL and IRQ signals have been updated before updating CFS.
*/
- curr_class = rq->curr->sched_class;
+ curr_class = rq_selected(rq)->sched_class;

thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));

@@ -12673,7 +12673,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
* our priority decreased, or if we are not currently running on
* this runqueue and our priority is higher than the current's
*/
- if (task_current(rq, p)) {
+ if (task_current_selected(rq, p)) {
if (p->prio > oldprio)
resched_curr(rq);
} else
@@ -12776,7 +12776,7 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
* kick off the schedule if running, otherwise just see
* if we can still preempt the current task.
*/
- if (task_current(rq, p))
+ if (task_current_selected(rq, p))
resched_curr(rq);
else
wakeup_preempt(rq, p, 0);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 638e7c158ae4..48fc7a198f1a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -530,7 +530,7 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)

static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
{
- struct task_struct *curr = rq_of_rt_rq(rt_rq)->curr;
+ struct task_struct *curr = rq_selected(rq_of_rt_rq(rt_rq));
struct rq *rq = rq_of_rt_rq(rt_rq);
struct sched_rt_entity *rt_se;

@@ -1000,7 +1000,7 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
*/
static void update_curr_rt(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
struct sched_rt_entity *rt_se = &curr->rt;
s64 delta_exec;

@@ -1543,7 +1543,7 @@ static int find_lowest_rq(struct task_struct *task);
static int
select_task_rq_rt(struct task_struct *p, int cpu, int flags)
{
- struct task_struct *curr;
+ struct task_struct *curr, *selected;
struct rq *rq;
bool test;

@@ -1555,6 +1555,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)

rcu_read_lock();
curr = READ_ONCE(rq->curr); /* unlocked access */
+ selected = READ_ONCE(rq_selected(rq));

/*
* If the current task on @p's runqueue is an RT task, then
@@ -1583,8 +1584,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
* systems like big.LITTLE.
*/
test = curr &&
- unlikely(rt_task(curr)) &&
- (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
+ unlikely(rt_task(selected)) &&
+ (curr->nr_cpus_allowed < 2 || selected->prio <= p->prio);

if (test || !rt_task_fits_capacity(p, cpu)) {
int target = find_lowest_rq(p);
@@ -1614,12 +1615,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)

static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
{
- /*
- * Current can't be migrated, useless to reschedule,
- * let's hope p can move out.
- */
if (rq->curr->nr_cpus_allowed == 1 ||
- !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
+ !cpupri_find(&rq->rd->cpupri, rq_selected(rq), NULL))
return;

/*
@@ -1662,7 +1659,9 @@ static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
*/
static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
{
- if (p->prio < rq->curr->prio) {
+ struct task_struct *curr = rq_selected(rq);
+
+ if (p->prio < curr->prio) {
resched_curr(rq);
return;
}
@@ -1680,7 +1679,7 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
* to move current somewhere else, making room for our non-migratable
* task.
*/
- if (p->prio == rq->curr->prio && !test_tsk_need_resched(rq->curr))
+ if (p->prio == curr->prio && !test_tsk_need_resched(rq->curr))
check_preempt_equal_prio(rq, p);
#endif
}
@@ -1705,7 +1704,7 @@ static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool f
* utilization. We only care of the case where we start to schedule a
* rt task
*/
- if (rq->curr->sched_class != &rt_sched_class)
+ if (rq_selected(rq)->sched_class != &rt_sched_class)
update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);

rt_queue_push_tasks(rq);
@@ -1977,6 +1976,7 @@ static struct task_struct *pick_next_pushable_task(struct rq *rq)

BUG_ON(rq->cpu != task_cpu(p));
BUG_ON(task_current(rq, p));
+ BUG_ON(task_current_selected(rq, p));
BUG_ON(p->nr_cpus_allowed <= 1);

BUG_ON(!task_on_rq_queued(p));
@@ -2009,7 +2009,7 @@ static int push_rt_task(struct rq *rq, bool pull)
* higher priority than current. If that's the case
* just reschedule current.
*/
- if (unlikely(next_task->prio < rq->curr->prio)) {
+ if (unlikely(next_task->prio < rq_selected(rq)->prio)) {
resched_curr(rq);
return 0;
}
@@ -2362,7 +2362,7 @@ static void pull_rt_task(struct rq *this_rq)
* p if it is lower in priority than the
* current task on the run queue
*/
- if (p->prio < src_rq->curr->prio)
+ if (p->prio < rq_selected(src_rq)->prio)
goto skip;

if (is_migration_disabled(p)) {
@@ -2404,9 +2404,9 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
bool need_to_push = !task_on_cpu(rq, p) &&
!test_tsk_need_resched(rq->curr) &&
p->nr_cpus_allowed > 1 &&
- (dl_task(rq->curr) || rt_task(rq->curr)) &&
+ (dl_task(rq_selected(rq)) || rt_task(rq_selected(rq))) &&
(rq->curr->nr_cpus_allowed < 2 ||
- rq->curr->prio <= p->prio);
+ rq_selected(rq)->prio <= p->prio);

if (need_to_push)
push_rt_tasks(rq);
@@ -2490,7 +2490,7 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
rt_queue_push_tasks(rq);
#endif /* CONFIG_SMP */
- if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
+ if (p->prio < rq_selected(rq)->prio && cpu_online(cpu_of(rq)))
resched_curr(rq);
}
}
@@ -2505,7 +2505,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
if (!task_on_rq_queued(p))
return;

- if (task_current(rq, p)) {
+ if (task_current_selected(rq, p)) {
#ifdef CONFIG_SMP
/*
* If our priority decreases while running, we
@@ -2531,7 +2531,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
* greater than the current running task
* then reschedule.
*/
- if (p->prio < rq->curr->prio)
+ if (p->prio < rq_selected(rq)->prio)
resched_curr(rq);
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c83e5e0672dc..808d6ee8ae33 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1030,7 +1030,7 @@ struct rq {
*/
unsigned int nr_uninterruptible;

- struct task_struct __rcu *curr;
+ struct task_struct __rcu *curr; /* Execution context */
struct task_struct *idle;
struct task_struct *stop;
unsigned long next_balance;
@@ -1225,6 +1225,13 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define raw_rq() raw_cpu_ptr(&runqueues)

+/* For now, rq_selected == rq->curr */
+#define rq_selected(rq) ((rq)->curr)
+static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
+{
+ /* Do nothing */
+}
+
struct sched_group;
#ifdef CONFIG_SCHED_CORE
static inline struct cpumask *sched_group_span(struct sched_group *sg);
@@ -2148,11 +2155,25 @@ static inline u64 global_rt_runtime(void)
return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
}

+/*
+ * Is p the current execution context?
+ */
static inline int task_current(struct rq *rq, struct task_struct *p)
{
return rq->curr == p;
}

+/*
+ * Is p the current scheduling context?
+ *
+ * Note that it might be the current execution context at the same time if
+ * rq->curr == rq_selected() == p.
+ */
+static inline int task_current_selected(struct rq *rq, struct task_struct *p)
+{
+ return rq_selected(rq) == p;
+}
+
static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
{
#ifdef CONFIG_SMP
@@ -2322,7 +2343,7 @@ struct sched_class {

static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
{
- WARN_ON_ONCE(rq->curr != prev);
+ WARN_ON_ONCE(rq_selected(rq) != prev);
prev->sched_class->put_prev_task(rq, prev);
}

--
2.44.0.291.gc1ea87d7ee-goog


2024-03-18 15:06:07

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v9 0/7] Preparatory changes for Proxy Execution v9

On 15/03/2024 4:39 am, John Stultz wrote:
> As mentioned last time[1], after previous submissions of the
> Proxy Execution series, I got feedback that the patch series was
> getting a bit unwieldy to review, and Qais suggested I break out
> just the cleanups/preparatory components of the patch series and
> submit them on their own in the hope we can start to merge the
> less complex bits and discussion can focus on the more
> complicated portions afterwards. This so far has not been very
> successful, with the submission & RESEND of the v8 preparatory
> changes not getting much in the way of review.
>
> Nonetheless, for v9 of this series, I’m again only submitting
> those early cleanup/preparatory changes here (which have not
> changed since the v8 submissions, but to avoid confusion with the
> git branch names, I’m labeling it as v9). In the meantime, I’ve
> continued to put a lot of effort into the full series, mostly
> focused on polishing the series for correctness, and fixing some
> hard to trip races.
>
> If you are interested, the full v9 series, it can be found here:
> https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v9-6.8
> https://github.com/johnstultz-work/linux-dev.git proxy-exec-v9-6.8
>
>
> New in v9:
> (In the git tree. Again, none of the preparatory patches
> submitted here have changed since v8)
> ---------
> * Change to force mutex lock handoff when we have a blocked donor
> (preserves optimistic spinning elsewhere, but still prioritizes
> donor when present on unlock)
>
> * Do return migration whenever we’re not on the wake_cpu (should
> address placement concerns brought up earlier by Xuewen Yan)
>
> * Closed hole where we might mark a task as BO_RUNNABLE without
> doing return migration
>
> * Much improved handling of balance callbacks when we need to
> pick_again
>
> * Fixes for cases where we put_prev_task() but left a dangling
> pointer to rq_selected() when deactivating a task (as it could
> then be migrated away while we still have a reference to it),
> by selecting idle before deactivating next.
>
> * Fixes for dangling references to rq->curr (which had been
> put_prev_task’ed) when we drop rq lock for proxy_migration
>
> * Fixes for ttwu / find_proxy_task() races if the lock owner was
> being return migrated, and ttwu hadn’t yet set_task_cpu() and
> activated it, which allowed that task to be scheduled on two
> cpus at the same time.
>
> * Fix for live-lock between activate_blocked_tasks() and
> proxy_enqueue_on_owner() if activated owner went right back to
> sleep (which also simplifies the locking in
> activate_blocked_tasks())
>
> * Cleanups to avoid locked BO_WAKING->BO_RUNNABLE transition in
> try_to_wake_up() if proxy execution isn't enabled
>
> * Fix for psi_dequeue, as proxy changes assumptions around
> voluntary sleeps.
>
> * Numerous typos, comment improvements, and other fixups
> suggested by Metin
>
> * And more!
>
>
> Performance:
> ---------
> K Prateek Nayak provided some feedback on the v8 series here[2].
> Given the potential extra overhead of doing rq migrations/return
> migrations/etc for the proxy case, it’s not completely surprising
> a few of K Prateek’s test cases saw ~3-5% regressions, but I’m
> hoping to look into this soon to see if we can reduce those
> further. The donor mutex handoff in this revision may help some.
>
>
> Issues still to address:
> ---------
> * The chain migration functionality needs further iterations and
> better validation to ensure it truly maintains the RT/DL load
> balancing invariants.
>
> * CFS load balancing. There was concern that blocked tasks may
> carry forward load (PELT) to the lock owner's CPU, so the CPU
> may look like it is overloaded. Needs investigation.
>
> * The sleeping owner handling (where we deactivate waiting tasks
> and enqueue them onto a list, then reactivate them when the
> owner wakes up) doesn’t feel great. This is in part because
> when we want to activate tasks, we’re already holding a
> task.pi_lock and a rq_lock, just not the locks for the task
> we’re activating, nor the rq we’re enqueuing it onto. So there
> has to be a bit of lock juggling to drop and acquire the right
> locks (in the right order). It feels like there’s got to be a
> better way. Also needs some rework to get rid of the recursion.
>
>
> Credit/Disclaimer:
> —--------------------
> As mentioned previously, this Proxy Execution series has a long
> history: First described in a paper[3] by Watkins, Straub,
> Niehaus, then from patches from Peter Zijlstra, extended with
> lots of work by Juri Lelli, Valentin Schneider, and Connor
> O'Brien. (and thank you to Steven Rostedt for providing
> additional details here!)
>
> So again, many thanks to those above, as all the credit for this
> series really is due to them - while the mistakes are likely
> mine.
>
> Thanks so much!
> -john
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/[email protected]/
> [3] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
>
>
> Connor O'Brien (2):
> sched: Add do_push_task helper
> sched: Consolidate pick_*_task to task_is_pushable helper
>
> John Stultz (1):
> sched: Split out __schedule() deactivate task logic into a helper
>
> Juri Lelli (2):
> locking/mutex: Make mutex::wait_lock irq safe
> locking/mutex: Expose __mutex_owner()
>
> Peter Zijlstra (2):
> locking/mutex: Remove wakeups from under mutex::wait_lock
> sched: Split scheduler and execution contexts
>
> kernel/locking/mutex.c | 60 +++++++----------
> kernel/locking/mutex.h | 25 +++++++
> kernel/locking/rtmutex.c | 26 +++++---
> kernel/locking/rwbase_rt.c | 4 +-
> kernel/locking/rwsem.c | 4 +-
> kernel/locking/spinlock_rt.c | 3 +-
> kernel/locking/ww_mutex.h | 49 ++++++++------
> kernel/sched/core.c | 122 +++++++++++++++++++++--------------
> kernel/sched/deadline.c | 53 ++++++---------
> kernel/sched/fair.c | 18 +++---
> kernel/sched/rt.c | 59 +++++++----------
> kernel/sched/sched.h | 44 ++++++++++++-
> 12 files changed, 268 insertions(+), 199 deletions(-)
>

I could cleanly apply this patch series to `linux-next/next-20240318`
branch.

Here is checkpatch report:
```
$ ./scripts/checkpatch.pl -g 6aa2a9e7aea6d..HEAD



---------------------------------------------------------------------



Commit a9124906b27f ("locking/mutex: Make mutex::wait_lock irq safe")



---------------------------------------------------------------------
total: 0 errors, 0 warnings, 128 lines checked

Commit a9124906b27f ("locking/mutex: Make mutex::wait_lock irq safe")
has no obvious style problems and is ready for submission.
-------------------------------------------------------------
Commit c3b81eb4858a ("locking/mutex: Expose __mutex_owner()")
-------------------------------------------------------------
WARNING: Possible repeated word: 'at'
#91: FILE: kernel/locking/mutex.h:26:
+ * NULL means not owned. Since task_struct pointers are aligned at
+ * at least L1_CACHE_BYTES, we have low bits to store extra state.

total: 0 errors, 1 warnings, 62 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
mechanically convert to the typical style using --fix or
--fix-inplace.

Commit c3b81eb4858a ("locking/mutex: Expose __mutex_owner()") has style
problems, please review.
------------------------------------------------------
Commit 6adb713bd389 ("sched: Add do_push_task helper")
------------------------------------------------------
total: 0 errors, 0 warnings, 64 lines checked

Commit 6adb713bd389 ("sched: Add do_push_task helper") has no obvious
style problems and is ready for submission.
---------------------------------------------------------------------------------
Commit 037345d6896a ("sched: Consolidate pick_*_task to task_is_pushable
helper")
---------------------------------------------------------------------------------
total: 0 errors, 0 warnings, 61 lines checked

Commit 037345d6896a ("sched: Consolidate pick_*_task to task_is_pushable
helper") has no obvious style problems and is ready for submission.
-----------------------------------------------------------------------------------------
Commit 6c9aa2fd15de ("sched: Split out __schedule() deactivate task
logic into a helper")
-----------------------------------------------------------------------------------------
total: 0 errors, 0 warnings, 84 lines checked

Commit 6c9aa2fd15de ("sched: Split out __schedule() deactivate task
logic into a helper") has no obvious style problems and is ready for
submission.
---------------------------------------------------------------------
Commit 6af85d44be35 ("sched: Split scheduler and execution contexts")
---------------------------------------------------------------------
WARNING: Duplicate signature
#46:
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>

WARNING: Do not crash the kernel unless it is absolutely
unavoidable--use WARN_ON_ONCE() plus recovery code (if feasible) instead
of BUG() or variants
#557: FILE: kernel/sched/rt.c:1979:
+ BUG_ON(task_current_selected(rq, p));

total: 0 errors, 2 warnings, 539 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
mechanically convert to the typical style using --fix or
--fix-inplace.

Commit 6af85d44be35 ("sched: Split scheduler and execution contexts")
has style problems, please review.

NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
```

I think only the duplicate signature warning is a real problem and will
reply to the offending mail.

Issued `make defconfig` on a x86_64 machine, enabled lock & RCU
debugging config options and built the kernel.

Then, ran commands below on a QEMU Debian guest (arch=x86_64, cores=32,
threads=2, memory=2GiB) before and after this patch series:
```
$ insmod torture.ko && insmod scftorture.ko verbose=1 onoff_holdoff=5
onoff_interval=1; sleep 60; rmmod scftorture torture
$ insmod torture.ko random_shuffle=1 && insmod locktorture.ko
torture_type=mutex_lock rt_boost=1 rt_boost_factor=50 nested_locks=3
writer_fifo=1; sleep 60; rmmod locktorture torture
$ taskset -c 0-31,32-63 perf bench sched messaging -p -t -l 100000 -g 2
$ perf bench sched pipe -l 100000 -T
```

Logs before this patch series:
```
$ insmod torture.ko && insmod scftorture.ko verbose=1 onoff_holdoff=5
onoff_interval=1; sleep 60; rmmod scftorture torture
[ 35.912508] torture module --- scftorture: disable_onoff_at_boot=0
ftrace_dump_at_shutdown=0 verbose_sleep_frequency=0
verbose_sleep_duration=1 random_shuffle=0
[ 35.914627] scftorture: --- Start of test: verbose=1 holdoff=0
longwait=0 nthreads=-1 onoff_holdoff=5 onoff_interval=1 shutdown_secs=0
stat_interval=60 stutter=5 use_cpus_read_lock=0, weight_resched=-1,
weight_single=-1, weight_single_rpc=-1, weight_single_wait=-1, weight_many1
[ 35.918824] scftorture: !!! built as module, weight_resched ignored
[ 35.919842] scf_sel_dump: 32.820 smp_call_function_single(nowait)
[ 35.921035] scf_sel_dump: 32.820 smp_call_function_single_rpc(wait)
[ 35.922183] scf_sel_dump: 32.820 smp_call_function_single(wait)
[ 35.923334] scf_sel_dump: 0.512 smp_call_function_many(nowait)
[ 35.924372] scf_sel_dump: 0.512 smp_call_function_many(wait)
[ 35.925410] scf_sel_dump: 0.256 smp_call_function(nowait)
[ 35.926398] scf_sel_dump: 0.256 smp_call_function(wait)
[ 35.927373] scftorture-torture: Creating torture_onoff task
[ 35.928665] scftorture-torture: torture_onoff task started
[ 35.928677] scftorture-torture: Creating torture_stutter task
[ 35.929696] scftorture-torture: torture_onoff begin holdoff
[ 35.930902] scftorture: Starting 64 smp_call_function() threads
[ 35.930904] scftorture-torture: torture_stutter task started
[ 35.933755] scftorture-torture: Creating scftorture_invoker task
[ 35.934966] scftorture-torture: Creating scftorture_invoker task
[ 35.934980] scftorture: scftorture_invoker 0: task started
..
[ 99.689799] scftorture: scf_invoked_count VER: 87177502 resched: 0
single: 28726722/28492519 single_ofl: 13877596/13766523 single_rpc:
28492768 single_rpc_ofl: 0 many: 445547/446215 all: 221861/223143 onoff:
1527/1527:1559/1559 14,74:8,642 23749:29438 (HZ=1000) ste: 0 stnmie: 0
[ 99.694023] scftorture: --- End of test: SUCCESS: verbose=1
holdoff=0 longwait=0 nthreads=64 onoff_holdoff=5 onoff_interval=1
shutdown_secs=0 stat_interval=60 stutter=5 use_cpus_read_lock=0,
weight_resched=-1, weight_single=-1, weight_single_rpc=-1,
weight_single_wait=-1, weig1

$ insmod torture.ko random_shuffle=1 && insmod locktorture.ko
torture_type=mutex_lock rt_boost=1 rt_boost_factor=50 nested_locks=3
writer_fifo=1; sleep 60; rmmod locktorture torture
[ 114.543156] torture module --- mutex_lock: disable_onoff_at_boot=0
ftrace_dump_at_shutdown=0 verbose_sleep_frequency=0
verbose_sleep_duration=1 random_shuffle=1
[ 114.545291] mutex_lock-torture:--- Start of test [debug]:
acq_writer_lim=0 bind_readers=0-63 bind_writers=0-63 call_rcu_chains=0
long_hold=100 nested_locks=3 nreaders_stress=0 nwriters_stress=128
onoff_holdoff=0 onoff_interval=0 rt_boost=1 rt_boost_factor=50
shuffle_interval=3 1
[ 114.549640] mutex_lock-torture: Creating torture_shuffle task
[ 114.550880] mutex_lock-torture: Creating torture_stutter task
[ 114.550905] mutex_lock-torture: torture_shuffle task started
[ 114.552006] mutex_lock-torture: Creating lock_torture_writer task
[ 114.552021] mutex_lock-torture: torture_stutter task started
..
[ 177.883162] mutex_lock-torture: Stopping lock_torture_stats task
[ 177.884249] Writes: Total: 72982638 Max/Min: 748221/404806 Fail: 0
[ 177.885483] mutex_lock-torture: lock_torture_stats is stopping
[ 177.886516] Writes: Total: 72982638 Max/Min: 748221/404806 Fail: 0
[ 177.887630] mutex_lock-torture:--- End of test: SUCCESS [debug]:
acq_writer_lim=0 bind_readers=0-63 bind_writers=0-63 call_rcu_chains=0
long_hold=100 nested_locks=3 nreaders_stress=0 nwriters_stress=128
onoff_holdoff=0 onoff_interval=0 rt_boost=1 rt_boost_factor=50 shuffle_inte1

$ taskset -c 0-31,32-63 perf bench sched messaging -p -t -l 100000 -g 2
# Running 'sched/messaging' benchmark:0-31,32-63 perf bench sched
messaging -p -t -l 100000 -g 2
# 20 sender and receiver threads per group
# 2 groups == 80 threads run

Total time: 13.835 [sec]

$ perf bench sched pipe -l 100000 -T
# Running 'sched/pipe' benchmark:
# Executed 100000 pipe operations between two threads

Total time: 2.772 [sec]

27.725990 usecs/op
36067 ops/sec
```

Logs after applying this patch series:
```
$ insmod torture.ko && insmod scftorture.ko verbose=1 onoff_holdoff=5
onoff_interval=1; sleep 60; rmmod scftorture torture
[ 17.840768] torture module --- scftorture: disable_onoff_at_boot=0
ftrace_dump_at_shutdown=0 verbose_sleep_frequency=0
verbose_sleep_duration=1 random_shuffle=0
[ 17.842941] scftorture: --- Start of test: verbose=1 holdoff=0
longwait=0 nthreads=-1 onoff_holdoff=5 onoff_interval=1 shutdown_secs=0
stat_interval=60 stutter=5 use_cpus_read_lock=0, weight_resched=-1,
weight_single=-1, weight_single_rpc=-1, weight_single_wait=-1, weight_many1
[ 17.847211] scftorture: !!! built as module, weight_resched ignored
[ 17.848135] scf_sel_dump: 32.820 smp_call_function_single(nowait)
[ 17.849230] scf_sel_dump: 32.820 smp_call_function_single_rpc(wait)
[ 17.850475] scf_sel_dump: 32.820 smp_call_function_single(wait)
[ 17.851523] scf_sel_dump: 0.512 smp_call_function_many(nowait)
[ 17.852559] scf_sel_dump: 0.512 smp_call_function_many(wait)
[ 17.853602] scf_sel_dump: 0.256 smp_call_function(nowait)
[ 17.854541] scf_sel_dump: 0.256 smp_call_function(wait)
[ 17.855490] scftorture-torture: Creating torture_onoff task
[ 17.856673] scftorture-torture: torture_onoff task started
[ 17.856694] scftorture-torture: Creating torture_stutter task
[ 17.857701] scftorture-torture: torture_onoff begin holdoff
[ 17.859816] scftorture: Starting 64 smp_call_function() threads
[ 17.859816] scftorture-torture: torture_stutter task started
[ 17.860839] scftorture-torture: Creating scftorture_invoker task
[ 17.860930] scftorture-torture: Creating scftorture_invoker task
[ 17.860940] scftorture: scftorture_invoker 0: task started
..
[ 78.914652] scftorture: scf_invoked_count VER: 86855271 resched: 0
single: 27751906/27523464 single_ofl: 12966034/12854343 single_rpc:
27530433 single_rpc_ofl: 0 many: 430178/430527 all: 215119/214473 onoff:
1441/1441:1474/1474 14,268:8,177 22685:28034 (HZ=1000) ste: 0 stnmie:0
[ 78.919133] scftorture: --- End of test: SUCCESS: verbose=1
holdoff=0 longwait=0 nthreads=64 onoff_holdoff=5 onoff_interval=1
shutdown_secs=0 stat_interval=60 stutter=5 use_cpus_read_lock=0,
weight_resched=-1, weight_single=-1, weight_single_rpc=-1,
weight_single_wait=-1, weig1

$ insmod torture.ko random_shuffle=1 && insmod locktorture.ko
torture_type=mutex_lock rt_boost=1 rt_boost_factor=50 nested_locks=3
writer_fifo=1; sleep 60; rmmod locktorture torture
[ 83.340115] torture module --- mutex_lock: disable_onoff_at_boot=0
ftrace_dump_at_shutdown=0 verbose_sleep_frequency=0
verbose_sleep_duration=1 random_shuffle=1
[ 83.342142] mutex_lock-torture:--- Start of test [debug]:
acq_writer_lim=0 bind_readers=0-63 bind_writers=0-63 call_rcu_chains=0
long_hold=100 nested_locks=3 nreaders_stress=0 nwriters_stress=128
onoff_holdoff=0 onoff_interval=0 rt_boost=1 rt_boost_factor=50
shuffle_interval=3 1
[ 83.346306] mutex_lock-torture: Creating torture_shuffle task
[ 83.347383] mutex_lock-torture: Creating torture_stutter task
[ 83.347395] mutex_lock-torture: torture_shuffle task started
[ 83.348549] mutex_lock-torture: Creating lock_torture_writer task
[ 83.348555] mutex_lock-torture: torture_stutter task started
..
[ 146.611980] mutex_lock-torture: Stopping lock_torture_stats task
[ 146.613012] Writes: Total: 75575274 Max/Min: 1085187/396665 ???
Fail: 0
[ 146.614183] mutex_lock-torture: lock_torture_stats is stopping
[ 146.615160] Writes: Total: 75575274 Max/Min: 1085187/396665 ???
Fail: 0
[ 146.616273] mutex_lock-torture:--- End of test: SUCCESS [debug]:
acq_writer_lim=0 bind_readers=0-63 bind_writers=0-63 call_rcu_chains=0
long_hold=100 nested_locks=3 nreaders_stress=0 nwriters_stress=128
onoff_holdoff=0 onoff_interval=0 rt_boost=1 rt_boost_factor=50 shuffle_inte1

$ taskset -c 0-31,32-63 perf bench sched messaging -p -t -l 100000 -g 2
# Running 'sched/messaging' benchmark:0-31,32-63 perf bench sched
messaging -p -t -l 100000 -g 2
# 20 sender and receiver threads per group
# 2 groups == 80 threads run

Total time: 13.548 [sec]

$ perf bench sched pipe -l 100000 -T
# Running 'sched/pipe' benchmark:
# Executed 100000 pipe operations between two threads

Total time: 2.350 [sec]

23.503460 usecs/op
42546 ops/sec
```

Looks like there are slight differences (both improvements and
regressions) before & after this patch series. I believe they should be
ignored since I ran the test only few times and it's just a virtual machine.

I also ported commits in
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v9-6.8
to an Android v6.1 kernel and tested on my Pixel 6.

In conclusion, I'm happy to give my:
Reviewed-by: Metin Kaya <[email protected]>
Tested-by: Metin Kaya <[email protected]>

Thanks,

2024-03-18 15:07:20

by Metin Kaya

[permalink] [raw]
Subject: Re: [PATCH v9 7/7] sched: Split scheduler and execution contexts

On 15/03/2024 4:39 am, John Stultz wrote:
> From: Peter Zijlstra <[email protected]>
>
> Let's define the scheduling context as all the scheduler state
> in task_struct for the task selected to run, and the execution
> context as all state required to actually run the task.
>
> Currently both are intertwined in task_struct. We want to
> logically split these such that we can use the scheduling
> context of the task selected to be scheduled, but use the
> execution context of a different task to actually be run.
>
> To this purpose, introduce rq_selected() macro to point to the
> task_struct selected from the runqueue by the scheduler, and
> will be used for scheduler state, and preserve rq->curr to
> indicate the execution context of the task that will actually be
> run.
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>

Duplicate S-o-b for Peter.

[snip]


2024-03-25 13:09:08

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v9 0/7] Preparatory changes for Proxy Execution v9

Hello John,

On 3/15/2024 10:09 AM, John Stultz wrote:
> As mentioned last time[1], after previous submissions of the
> Proxy Execution series, I got feedback that the patch series was
> getting a bit unwieldy to review, and Qais suggested I break out
> just the cleanups/preparatory components of the patch series and
> submit them on their own in the hope we can start to merge the
> less complex bits and discussion can focus on the more
> complicated portions afterwards. This so far has not been very
> successful, with the submission & RESEND of the v8 preparatory
> changes not getting much in the way of review.
>
> Nonetheless, for v9 of this series, I’m again only submitting
> those early cleanup/preparatory changes here (which have not
> changed since the v8 submissions, but to avoid confusion with the
> git branch names, I’m labeling it as v9). In the meantime, I’ve
> continued to put a lot of effort into the full series, mostly
> focused on polishing the series for correctness, and fixing some
> hard to trip races.
>
> If you are interested, the full v9 series, it can be found here:
> https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v9-6.8
> https://github.com/johnstultz-work/linux-dev.git proxy-exec-v9-6.8

Tested the v9 of the series.

tl;dr

o I still see a small regression for hackbench. I'll get some perf
profiles for the same and leave them in this thread soon (I do
not have them at the moment unfortunately)

o There is a regression for some combinations in schbench. I'll
have to recheck if I can consistently reproduce this or not and
look at the perf profile to see if something is sticking out.

Rest of the benchmark results look good. I'll leave them below and
go digging at the regressions.

o System Details

- 3rd Generation EPYC System
- 2 x 64C/128T
- NPS1 mode

o Kernels

tip: tip:sched/core at commit 8cec3dd9e593
("sched/core: Simplify code by removing
duplicate #ifdefs")

proxy-exec-full: tip + proxy execution commits from
"proxy-exec-v9-6.8"

o Results

==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) proxy_exec_v9[pct imp](CV)
1-groups 1.00 [ -0.00]( 1.80) 1.03 [ -2.88]( 2.71)
2-groups 1.00 [ -0.00]( 1.76) 1.02 [ -2.32]( 1.71)
4-groups 1.00 [ -0.00]( 1.82) 1.03 [ -2.79]( 0.84)
8-groups 1.00 [ -0.00]( 1.40) 1.02 [ -1.89]( 0.89)
16-groups 1.00 [ -0.00]( 3.38) 1.01 [ -0.53]( 1.61)


==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) proxy_exec_v9[pct imp](CV)
1 1.00 [ 0.00]( 0.44) 0.99 [ -1.30]( 0.66)
2 1.00 [ 0.00]( 0.39) 0.98 [ -1.76]( 0.64)
4 1.00 [ 0.00]( 0.40) 0.99 [ -1.12]( 0.63)
8 1.00 [ 0.00]( 0.16) 0.97 [ -2.94]( 1.49)
16 1.00 [ 0.00]( 3.00) 1.01 [ 0.92]( 2.18)
32 1.00 [ 0.00]( 0.84) 1.01 [ 0.66]( 1.22)
64 1.00 [ 0.00]( 1.66) 1.00 [ -0.39]( 0.24)
128 1.00 [ 0.00]( 1.04) 0.99 [ -1.23]( 2.26)
256 1.00 [ 0.00]( 0.26) 1.02 [ 1.92]( 1.09)
512 1.00 [ 0.00]( 0.15) 1.02 [ 1.84]( 0.17)
1024 1.00 [ 0.00]( 0.20) 1.03 [ 2.71]( 0.33)


==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) proxy_exec_v9[pct imp](CV)
Copy 1.00 [ 0.00]( 6.19) 1.11 [ 11.16]( 2.57)
Scale 1.00 [ 0.00]( 6.47) 0.98 [ -2.43]( 7.68)
Add 1.00 [ 0.00]( 6.50) 0.99 [ -0.74]( 7.25)
Triad 1.00 [ 0.00]( 5.70) 1.03 [ 2.95]( 4.41)


==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) proxy_exec_v9[pct imp](CV)
Copy 1.00 [ 0.00]( 3.22) 1.04 [ 4.29]( 3.02)
Scale 1.00 [ 0.00]( 6.17) 1.02 [ 1.97]( 1.55)
Add 1.00 [ 0.00]( 5.12) 1.02 [ 2.48]( 1.55)
Triad 1.00 [ 0.00]( 2.29) 1.01 [ 1.06]( 1.49)


==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) proxy_exec_v9[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.17) 0.98 [ -1.99]( 0.24)
2-clients 1.00 [ 0.00]( 0.49) 0.98 [ -1.86]( 0.45)
4-clients 1.00 [ 0.00]( 0.65) 0.98 [ -1.65]( 0.30)
8-clients 1.00 [ 0.00]( 0.56) 0.98 [ -1.73]( 0.41)
16-clients 1.00 [ 0.00]( 0.78) 0.98 [ -1.52]( 0.34)
32-clients 1.00 [ 0.00]( 0.62) 0.98 [ -1.90]( 0.73)
64-clients 1.00 [ 0.00]( 1.41) 0.99 [ -1.46]( 1.39)
128-clients 1.00 [ 0.00]( 0.83) 0.98 [ -1.63]( 0.89)
256-clients 1.00 [ 0.00]( 4.60) 1.01 [ 1.47]( 2.12)
512-clients 1.00 [ 0.00](54.18) 1.02 [ 2.25](56.18)


==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) proxy_exec_v9[pct imp](CV)
1 1.00 [ -0.00](34.63) 1.43 [-43.33]( 2.73)
2 1.00 [ -0.00]( 2.70) 0.89 [ 10.81](23.82)
4 1.00 [ -0.00]( 4.70) 1.04 [ -4.44](12.54)
8 1.00 [ -0.00]( 5.09) 0.87 [ 13.21](14.08)
16 1.00 [ -0.00]( 5.08) 1.03 [ -3.39]( 4.10)
32 1.00 [ -0.00]( 2.91) 1.14 [-14.44]( 0.56)
64 1.00 [ -0.00]( 2.73) 1.04 [ -4.17]( 2.77)
128 1.00 [ -0.00]( 7.89) 1.07 [ -7.14]( 2.83)
256 1.00 [ -0.00](28.55) 0.69 [ 31.37](19.96)
512 1.00 [ -0.00]( 2.11) 1.01 [ -1.20]( 1.07)
--

I'll leave more test results on the thread as I get to them. It has been
a slightly busy season so sorry about the delays.

>
>
> New in v9:
> (In the git tree. Again, none of the preparatory patches
> submitted here have changed since v8)

Since these changes in this preparatory series have remained the same,
please feel free to add:

Tested-by: K Prateek Nayak <[email protected]>

> ---------
> * Change to force mutex lock handoff when we have a blocked donor
> (preserves optimistic spinning elsewhere, but still prioritizes
> donor when present on unlock)
>
> * Do return migration whenever we’re not on the wake_cpu (should
> address placement concerns brought up earlier by Xuewen Yan)
>
> * Closed hole where we might mark a task as BO_RUNNABLE without
> doing return migration
>
> * Much improved handling of balance callbacks when we need to
> pick_again
>
> * Fixes for cases where we put_prev_task() but left a dangling
> pointer to rq_selected() when deactivating a task (as it could
> then be migrated away while we still have a reference to it),
> by selecting idle before deactivating next.
>
> * Fixes for dangling references to rq->curr (which had been
> put_prev_task’ed) when we drop rq lock for proxy_migration
>
> * Fixes for ttwu / find_proxy_task() races if the lock owner was
> being return migrated, and ttwu hadn’t yet set_task_cpu() and
> activated it, which allowed that task to be scheduled on two
> cpus at the same time.
>
> * Fix for live-lock between activate_blocked_tasks() and
> proxy_enqueue_on_owner() if activated owner went right back to
> sleep (which also simplifies the locking in
> activate_blocked_tasks())
>
> * Cleanups to avoid locked BO_WAKING->BO_RUNNABLE transition in
> try_to_wake_up() if proxy execution isn't enabled
>
> * Fix for psi_dequeue, as proxy changes assumptions around
> voluntary sleeps.
>
> * Numerous typos, comment improvements, and other fixups
> suggested by Metin
>
> * And more!
>
>
> Performance:
> ---------
> K Prateek Nayak provided some feedback on the v8 series here[2].
> Given the potential extra overhead of doing rq migrations/return
> migrations/etc for the proxy case, it’s not completely surprising
> a few of K Prateek’s test cases saw ~3-5% regressions, but I’m
> hoping to look into this soon to see if we can reduce those
> further. The donor mutex handoff in this revision may help some.
>
>
> Issues still to address:
> ---------
> * The chain migration functionality needs further iterations and
> better validation to ensure it truly maintains the RT/DL load
> balancing invariants.
>
> * CFS load balancing. There was concern that blocked tasks may
> carry forward load (PELT) to the lock owner's CPU, so the CPU
> may look like it is overloaded. Needs investigation.
>
> * The sleeping owner handling (where we deactivate waiting tasks
> and enqueue them onto a list, then reactivate them when the
> owner wakes up) doesn’t feel great. This is in part because
> when we want to activate tasks, we’re already holding a
> task.pi_lock and a rq_lock, just not the locks for the task
> we’re activating, nor the rq we’re enqueuing it onto. So there
> has to be a bit of lock juggling to drop and acquire the right
> locks (in the right order). It feels like there’s got to be a
> better way. Also needs some rework to get rid of the recursion.
>
>
> Credit/Disclaimer:
> —--------------------
> As mentioned previously, this Proxy Execution series has a long
> history: First described in a paper[3] by Watkins, Straub,
> Niehaus, then from patches from Peter Zijlstra, extended with
> lots of work by Juri Lelli, Valentin Schneider, and Connor
> O'Brien. (and thank you to Steven Rostedt for providing
> additional details here!)
>
> So again, many thanks to those above, as all the credit for this
> series really is due to them - while the mistakes are likely
> mine.
>
> Thanks so much!
> -john
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/[email protected]/
> [3] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
>
>[..snip..]
>

--
Thanks and Regards,
Prateek

2024-03-25 22:18:23

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v9 1/7] locking/mutex: Remove wakeups from under mutex::wait_lock

On Thu, 14 Mar 2024, John Stultz wrote:

>From: Peter Zijlstra <[email protected]>
>
>In preparation to nest mutex::wait_lock under rq::lock we need to remove
>wakeups from under it.

Acked-by: Davidlohr Bueso <[email protected]>

>
>Cc: Joel Fernandes <[email protected]>
>Cc: Qais Yousef <[email protected]>
>Cc: Ingo Molnar <[email protected]>
>Cc: Peter Zijlstra <[email protected]>
>Cc: Juri Lelli <[email protected]>
>Cc: Vincent Guittot <[email protected]>
>Cc: Dietmar Eggemann <[email protected]>
>Cc: Valentin Schneider <[email protected]>
>Cc: Steven Rostedt <[email protected]>
>Cc: Ben Segall <[email protected]>
>Cc: Zimuzo Ezeozue <[email protected]>
>Cc: Youssef Esmat <[email protected]>
>Cc: Mel Gorman <[email protected]>
>Cc: Daniel Bristot de Oliveira <[email protected]>
>Cc: Will Deacon <[email protected]>
>Cc: Waiman Long <[email protected]>
>Cc: Boqun Feng <[email protected]>
>Cc: "Paul E. McKenney" <[email protected]>
>Cc: Metin Kaya <[email protected]>
>Cc: Xuewen Yan <[email protected]>
>Cc: K Prateek Nayak <[email protected]>
>Cc: Thomas Gleixner <[email protected]>
>Cc: [email protected]
>Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>[Heavily changed after 55f036ca7e74 ("locking: WW mutex cleanup") and
>08295b3b5bee ("locking: Implement an algorithm choice for Wound-Wait
>mutexes")]
>Signed-off-by: Juri Lelli <[email protected]>
>[jstultz: rebased to mainline, added extra wake_up_q & init
> to avoid hangs, similar to Connor's rework of this patch]
>Signed-off-by: John Stultz <[email protected]>
>---
>v5:
>* Reverted back to an earlier version of this patch to undo
> the change that kept the wake_q in the ctx structure, as
> that broke the rule that the wake_q must always be on the
> stack, as its not safe for concurrency.
>v6:
>* Made tweaks suggested by Waiman Long
>v7:
>* Fixups to pass wake_qs down for PREEMPT_RT logic
>---
> kernel/locking/mutex.c | 17 +++++++++++++----
> kernel/locking/rtmutex.c | 26 +++++++++++++++++---------
> kernel/locking/rwbase_rt.c | 4 +++-
> kernel/locking/rwsem.c | 4 ++--
> kernel/locking/spinlock_rt.c | 3 ++-
> kernel/locking/ww_mutex.h | 29 ++++++++++++++++++-----------
> 6 files changed, 55 insertions(+), 28 deletions(-)
>
>diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
>index cbae8c0b89ab..980ce630232c 100644
>--- a/kernel/locking/mutex.c
>+++ b/kernel/locking/mutex.c
>@@ -575,6 +575,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> struct lockdep_map *nest_lock, unsigned long ip,
> struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
> {
>+ DEFINE_WAKE_Q(wake_q);
> struct mutex_waiter waiter;
> struct ww_mutex *ww;
> int ret;
>@@ -625,7 +626,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> */
> if (__mutex_trylock(lock)) {
> if (ww_ctx)
>- __ww_mutex_check_waiters(lock, ww_ctx);
>+ __ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
>
> goto skip_wait;
> }
>@@ -645,7 +646,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> * Add in stamp order, waking up waiters that must kill
> * themselves.
> */
>- ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx);
>+ ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx, &wake_q);
> if (ret)
> goto err_early_kill;
> }
>@@ -681,6 +682,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> }
>
> raw_spin_unlock(&lock->wait_lock);
>+ /* Make sure we do wakeups before calling schedule */
>+ if (!wake_q_empty(&wake_q)) {
>+ wake_up_q(&wake_q);
>+ wake_q_init(&wake_q);
>+ }
> schedule_preempt_disabled();
>
> first = __mutex_waiter_is_first(lock, &waiter);
>@@ -714,7 +720,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> */
> if (!ww_ctx->is_wait_die &&
> !__mutex_waiter_is_first(lock, &waiter))
>- __ww_mutex_check_waiters(lock, ww_ctx);
>+ __ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
> }
>
> __mutex_remove_waiter(lock, &waiter);
>@@ -730,6 +736,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> ww_mutex_lock_acquired(ww, ww_ctx);
>
> raw_spin_unlock(&lock->wait_lock);
>+ wake_up_q(&wake_q);
> preempt_enable();
> return 0;
>
>@@ -741,6 +748,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> raw_spin_unlock(&lock->wait_lock);
> debug_mutex_free_waiter(&waiter);
> mutex_release(&lock->dep_map, ip);
>+ wake_up_q(&wake_q);
> preempt_enable();
> return ret;
> }
>@@ -934,6 +942,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
> }
> }
>
>+ preempt_disable();
> raw_spin_lock(&lock->wait_lock);
> debug_mutex_unlock(lock);
> if (!list_empty(&lock->wait_list)) {
>@@ -952,8 +961,8 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
> __mutex_handoff(lock, next);
>
> raw_spin_unlock(&lock->wait_lock);
>-
> wake_up_q(&wake_q);
>+ preempt_enable();
> }
>
> #ifndef CONFIG_DEBUG_LOCK_ALLOC
>diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
>index 4a10e8c16fd2..eaac8b196a69 100644
>--- a/kernel/locking/rtmutex.c
>+++ b/kernel/locking/rtmutex.c
>@@ -34,13 +34,15 @@
>
> static inline int __ww_mutex_add_waiter(struct rt_mutex_waiter *waiter,
> struct rt_mutex *lock,
>- struct ww_acquire_ctx *ww_ctx)
>+ struct ww_acquire_ctx *ww_ctx,
>+ struct wake_q_head *wake_q)
> {
> return 0;
> }
>
> static inline void __ww_mutex_check_waiters(struct rt_mutex *lock,
>- struct ww_acquire_ctx *ww_ctx)
>+ struct ww_acquire_ctx *ww_ctx,
>+ struct wake_q_head *wake_q)
> {
> }
>
>@@ -1206,6 +1208,7 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
> struct rt_mutex_waiter *top_waiter = waiter;
> struct rt_mutex_base *next_lock;
> int chain_walk = 0, res;
>+ DEFINE_WAKE_Q(wake_q);
>
> lockdep_assert_held(&lock->wait_lock);
>
>@@ -1244,7 +1247,8 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
>
> /* Check whether the waiter should back out immediately */
> rtm = container_of(lock, struct rt_mutex, rtmutex);
>- res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx);
>+ res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx, &wake_q);
>+ wake_up_q(&wake_q);
> if (res) {
> raw_spin_lock(&task->pi_lock);
> rt_mutex_dequeue(lock, waiter);
>@@ -1677,7 +1681,8 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
> struct ww_acquire_ctx *ww_ctx,
> unsigned int state,
> enum rtmutex_chainwalk chwalk,
>- struct rt_mutex_waiter *waiter)
>+ struct rt_mutex_waiter *waiter,
>+ struct wake_q_head *wake_q)
> {
> struct rt_mutex *rtm = container_of(lock, struct rt_mutex, rtmutex);
> struct ww_mutex *ww = ww_container_of(rtm);
>@@ -1688,7 +1693,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
> /* Try to acquire the lock again: */
> if (try_to_take_rt_mutex(lock, current, NULL)) {
> if (build_ww_mutex() && ww_ctx) {
>- __ww_mutex_check_waiters(rtm, ww_ctx);
>+ __ww_mutex_check_waiters(rtm, ww_ctx, wake_q);
> ww_mutex_lock_acquired(ww, ww_ctx);
> }
> return 0;
>@@ -1706,7 +1711,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
> /* acquired the lock */
> if (build_ww_mutex() && ww_ctx) {
> if (!ww_ctx->is_wait_die)
>- __ww_mutex_check_waiters(rtm, ww_ctx);
>+ __ww_mutex_check_waiters(rtm, ww_ctx, wake_q);
> ww_mutex_lock_acquired(ww, ww_ctx);
> }
> } else {
>@@ -1728,7 +1733,8 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
>
> static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
> struct ww_acquire_ctx *ww_ctx,
>- unsigned int state)
>+ unsigned int state,
>+ struct wake_q_head *wake_q)
> {
> struct rt_mutex_waiter waiter;
> int ret;
>@@ -1737,7 +1743,7 @@ static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
> waiter.ww_ctx = ww_ctx;
>
> ret = __rt_mutex_slowlock(lock, ww_ctx, state, RT_MUTEX_MIN_CHAINWALK,
>- &waiter);
>+ &waiter, wake_q);
>
> debug_rt_mutex_free_waiter(&waiter);
> return ret;
>@@ -1753,6 +1759,7 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
> struct ww_acquire_ctx *ww_ctx,
> unsigned int state)
> {
>+ DEFINE_WAKE_Q(wake_q);
> unsigned long flags;
> int ret;
>
>@@ -1774,8 +1781,9 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
> * irqsave/restore variants.
> */
> raw_spin_lock_irqsave(&lock->wait_lock, flags);
>- ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state);
>+ ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state, &wake_q);
> raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
>+ wake_up_q(&wake_q);
> rt_mutex_post_schedule();
>
> return ret;
>diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
>index 34a59569db6b..e9d2f38b70f3 100644
>--- a/kernel/locking/rwbase_rt.c
>+++ b/kernel/locking/rwbase_rt.c
>@@ -69,6 +69,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
> unsigned int state)
> {
> struct rt_mutex_base *rtm = &rwb->rtmutex;
>+ DEFINE_WAKE_Q(wake_q);
> int ret;
>
> rwbase_pre_schedule();
>@@ -110,7 +111,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
> * For rwlocks this returns 0 unconditionally, so the below
> * !ret conditionals are optimized out.
> */
>- ret = rwbase_rtmutex_slowlock_locked(rtm, state);
>+ ret = rwbase_rtmutex_slowlock_locked(rtm, state, &wake_q);
>
> /*
> * On success the rtmutex is held, so there can't be a writer
>@@ -122,6 +123,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
> if (!ret)
> atomic_inc(&rwb->readers);
> raw_spin_unlock_irq(&rtm->wait_lock);
>+ wake_up_q(&wake_q);
> if (!ret)
> rwbase_rtmutex_unlock(rtm);
>
>diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
>index 2340b6d90ec6..74ebb2915d63 100644
>--- a/kernel/locking/rwsem.c
>+++ b/kernel/locking/rwsem.c
>@@ -1415,8 +1415,8 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
> #define rwbase_rtmutex_lock_state(rtm, state) \
> __rt_mutex_lock(rtm, state)
>
>-#define rwbase_rtmutex_slowlock_locked(rtm, state) \
>- __rt_mutex_slowlock_locked(rtm, NULL, state)
>+#define rwbase_rtmutex_slowlock_locked(rtm, state, wq) \
>+ __rt_mutex_slowlock_locked(rtm, NULL, state, wq)
>
> #define rwbase_rtmutex_unlock(rtm) \
> __rt_mutex_unlock(rtm)
>diff --git a/kernel/locking/spinlock_rt.c b/kernel/locking/spinlock_rt.c
>index 38e292454fcc..fb1810a14c9d 100644
>--- a/kernel/locking/spinlock_rt.c
>+++ b/kernel/locking/spinlock_rt.c
>@@ -162,7 +162,8 @@ rwbase_rtmutex_lock_state(struct rt_mutex_base *rtm, unsigned int state)
> }
>
> static __always_inline int
>-rwbase_rtmutex_slowlock_locked(struct rt_mutex_base *rtm, unsigned int state)
>+rwbase_rtmutex_slowlock_locked(struct rt_mutex_base *rtm, unsigned int state,
>+ struct wake_q_head *wake_q)
> {
> rtlock_slowlock_locked(rtm);
> return 0;
>diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
>index 3ad2cc4823e5..7189c6631d90 100644
>--- a/kernel/locking/ww_mutex.h
>+++ b/kernel/locking/ww_mutex.h
>@@ -275,7 +275,7 @@ __ww_ctx_less(struct ww_acquire_ctx *a, struct ww_acquire_ctx *b)
> */
> static bool
> __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
>- struct ww_acquire_ctx *ww_ctx)
>+ struct ww_acquire_ctx *ww_ctx, struct wake_q_head *wake_q)
> {
> if (!ww_ctx->is_wait_die)
> return false;
>@@ -284,7 +284,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
> #ifndef WW_RT
> debug_mutex_wake_waiter(lock, waiter);
> #endif
>- wake_up_process(waiter->task);
>+ wake_q_add(wake_q, waiter->task);
> }
>
> return true;
>@@ -299,7 +299,8 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
> */
> static bool __ww_mutex_wound(struct MUTEX *lock,
> struct ww_acquire_ctx *ww_ctx,
>- struct ww_acquire_ctx *hold_ctx)
>+ struct ww_acquire_ctx *hold_ctx,
>+ struct wake_q_head *wake_q)
> {
> struct task_struct *owner = __ww_mutex_owner(lock);
>
>@@ -331,7 +332,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
> * wakeup pending to re-read the wounded state.
> */
> if (owner != current)
>- wake_up_process(owner);
>+ wake_q_add(wake_q, owner);
>
> return true;
> }
>@@ -352,7 +353,8 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
> * The current task must not be on the wait list.
> */
> static void
>-__ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
>+__ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx,
>+ struct wake_q_head *wake_q)
> {
> struct MUTEX_WAITER *cur;
>
>@@ -364,8 +366,8 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
> if (!cur->ww_ctx)
> continue;
>
>- if (__ww_mutex_die(lock, cur, ww_ctx) ||
>- __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx))
>+ if (__ww_mutex_die(lock, cur, ww_ctx, wake_q) ||
>+ __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx, wake_q))
> break;
> }
> }
>@@ -377,6 +379,8 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
> static __always_inline void
> ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
> {
>+ DEFINE_WAKE_Q(wake_q);
>+
> ww_mutex_lock_acquired(lock, ctx);
>
> /*
>@@ -405,8 +409,10 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
> * die or wound us.
> */
> lock_wait_lock(&lock->base);
>- __ww_mutex_check_waiters(&lock->base, ctx);
>+ __ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
> unlock_wait_lock(&lock->base);
>+
>+ wake_up_q(&wake_q);
> }
>
> static __always_inline int
>@@ -488,7 +494,8 @@ __ww_mutex_check_kill(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
> static inline int
> __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
> struct MUTEX *lock,
>- struct ww_acquire_ctx *ww_ctx)
>+ struct ww_acquire_ctx *ww_ctx,
>+ struct wake_q_head *wake_q)
> {
> struct MUTEX_WAITER *cur, *pos = NULL;
> bool is_wait_die;
>@@ -532,7 +539,7 @@ __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
> pos = cur;
>
> /* Wait-Die: ensure younger waiters die. */
>- __ww_mutex_die(lock, cur, ww_ctx);
>+ __ww_mutex_die(lock, cur, ww_ctx, wake_q);
> }
>
> __ww_waiter_add(lock, waiter, pos);
>@@ -550,7 +557,7 @@ __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
> * such that either we or the fastpath will wound @ww->ctx.
> */
> smp_mb();
>- __ww_mutex_wound(lock, ww_ctx, ww->ctx);
>+ __ww_mutex_wound(lock, ww_ctx, ww->ctx, wake_q);
> }
>
> return 0;
>--
>2.44.0.291.gc1ea87d7ee-goog
>

2024-04-01 21:28:58

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v9 0/7] Preparatory changes for Proxy Execution v9

On Sun, Mar 24, 2024 at 11:44 PM 'K Prateek Nayak' via kernel-team
<[email protected]> wrote:
> On 3/15/2024 10:09 AM, John Stultz wrote:
> > As mentioned last time[1], after previous submissions of the
> > Proxy Execution series, I got feedback that the patch series was
> > getting a bit unwieldy to review, and Qais suggested I break out
> > just the cleanups/preparatory components of the patch series and
> > submit them on their own in the hope we can start to merge the
> > less complex bits and discussion can focus on the more
> > complicated portions afterwards. This so far has not been very
> > successful, with the submission & RESEND of the v8 preparatory
> > changes not getting much in the way of review.
> >
> > Nonetheless, for v9 of this series, I’m again only submitting
> > those early cleanup/preparatory changes here (which have not
> > changed since the v8 submissions, but to avoid confusion with the
> > git branch names, I’m labeling it as v9). In the meantime, I’ve
> > continued to put a lot of effort into the full series, mostly
> > focused on polishing the series for correctness, and fixing some
> > hard to trip races.
> >
> > If you are interested, the full v9 series, it can be found here:
> > https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v9-6.8
> > https://github.com/johnstultz-work/linux-dev.git proxy-exec-v9-6.8
>
> Tested the v9 of the series.
>
> tl;dr
>
> o I still see a small regression for hackbench. I'll get some perf
> profiles for the same and leave them in this thread soon (I do
> not have them at the moment unfortunately)
>
> o There is a regression for some combinations in schbench. I'll
> have to recheck if I can consistently reproduce this or not and
> look at the perf profile to see if something is sticking out.

Much appreciated for the extra testing of the whole series!

I've not had much time to dig further on performance tuning since v8
(v9 mostly focused on fixes), so I didn't expect much change.
As I do some more detailed analysis of behavior via tracing for
correctness, I'm hoping there will be some finds there to further
improve things, but I'm not sure all of the overhead of handling proxy
rq-migrations and re-selecting the next task will be avoidable. So I
expect some potential throughput impact, but hopefully the latency
improvements will be worth it.

> > New in v9:
> > (In the git tree. Again, none of the preparatory patches
> > submitted here have changed since v8)
>
> Since these changes in this preparatory series have remained the same,
> please feel free to add:
>
> Tested-by: K Prateek Nayak <[email protected]>

Just added to my WIP tree. I'll likely be sending a RESEND of the v9
prep patches in the next few days, and will add them there as well.

Thank you so much again!
-john

2024-04-01 23:34:45

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v9 7/7] sched: Split scheduler and execution contexts

On Mon, Mar 18, 2024 at 8:06 AM Metin Kaya <[email protected]> wrote:
> On 15/03/2024 4:39 am, John Stultz wrote:
> > From: Peter Zijlstra <[email protected]>
> >
> > Let's define the scheduling context as all the scheduler state
> > in task_struct for the task selected to run, and the execution
> > context as all state required to actually run the task.
> >
> > Currently both are intertwined in task_struct. We want to
> > logically split these such that we can use the scheduling
> > context of the task selected to be scheduled, but use the
> > execution context of a different task to actually be run.
> >
> > To this purpose, introduce rq_selected() macro to point to the
> > task_struct selected from the runqueue by the scheduler, and
> > will be used for scheduler state, and preserve rq->curr to
> > indicate the execution context of the task that will actually be
> > run.
> >
> > Cc: Joel Fernandes <[email protected]>
> > Cc: Qais Yousef <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Juri Lelli <[email protected]>
> > Cc: Vincent Guittot <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Ben Segall <[email protected]>
> > Cc: Zimuzo Ezeozue <[email protected]>
> > Cc: Youssef Esmat <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Will Deacon <[email protected]>
> > Cc: Waiman Long <[email protected]>
> > Cc: Boqun Feng <[email protected]>
> > Cc: "Paul E. McKenney" <[email protected]>
> > Cc: Xuewen Yan <[email protected]>
> > Cc: K Prateek Nayak <[email protected]>
> > Cc: Metin Kaya <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: [email protected]
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > Signed-off-by: Juri Lelli <[email protected]>
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>
> Duplicate S-o-b for Peter.

Yes, though I believe this accurately tracks the path of development.

The patch originally by Peter was sent by Juri:
https://lore.kernel.org/all/[email protected]/

Then and Peter put it in a branch on his git tree(adding another SoB
and the Link tag), from which Conor picked it up:
https://lore.kernel.org/lkml/[email protected]/

Followed by me.

That said, I can no longer find the branch in Peter's tree, so it's
possible I have this history wrong, but this is what I recall.

So I'm hesitant to change that part of the attribution, unless Peter
requests otherwise.

thanks
-john