2024-02-24 00:12:11

by John Stultz

[permalink] [raw]
Subject: [RESEND][PATCH v8 0/7] Preparatory changes for Proxy Execution v8

After sending out v7 of Proxy Execution, I got feedback that the
patch series was getting a bit unwieldy to review, and Qais
suggested I break out just the cleanups/preparatory components
of the patch series and submit them on their own in the hope we
can start to merge the less complex bits and discussion can focus
on the more complicated portions afterwards.

So for the v8 of this series, I only submitted those earlier
cleanup/preparatory changes:
https://lore.kernel.org/lkml/[email protected]/

After sending this out a few weeks back, I’ve not heard much, so
I wanted to resend this again.

(I did correct one detail here, which was that I had accidentally
lost the author credit to one of the patches, and I’ve fixed that
in this submission).

As before, If you are interested, the full v8 series, it can be
found here:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v8-6.8-rc3
https://github.com/johnstultz-work/linux-dev.git proxy-exec-v8-6.8-rc3

However, I’ve been focusing pretty intensely on the series to
shake out some issues with the more complicated later patches in
the series (not in what I’m submitting here), and have resolved
a number of problems I uncovered in doing wider testing (along
with lots of review feedback from Metin), so v9 and all of its
improvements will hopefully be ready to send out soon.

If you want a preview, my current WIP tree (careful, as I rebase
it frequently) is here:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-WIP
https://github.com/johnstultz-work/linux-dev.git proxy-exec-WIP

Review and feedback would be greatly appreciated!

Thanks so much!
-john

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]


Connor O'Brien (2):
sched: Add do_push_task helper
sched: Consolidate pick_*_task to task_is_pushable helper

John Stultz (1):
sched: Split out __schedule() deactivate task logic into a helper

Juri Lelli (2):
locking/mutex: Make mutex::wait_lock irq safe
locking/mutex: Expose __mutex_owner()

Peter Zijlstra (2):
locking/mutex: Remove wakeups from under mutex::wait_lock
sched: Split scheduler and execution contexts

kernel/locking/mutex.c | 60 +++++++----------
kernel/locking/mutex.h | 25 +++++++
kernel/locking/rtmutex.c | 26 +++++---
kernel/locking/rwbase_rt.c | 4 +-
kernel/locking/rwsem.c | 4 +-
kernel/locking/spinlock_rt.c | 3 +-
kernel/locking/ww_mutex.h | 49 ++++++++------
kernel/sched/core.c | 122 +++++++++++++++++++++--------------
kernel/sched/deadline.c | 53 ++++++---------
kernel/sched/fair.c | 18 +++---
kernel/sched/rt.c | 59 +++++++----------
kernel/sched/sched.h | 44 ++++++++++++-
12 files changed, 268 insertions(+), 199 deletions(-)

--
2.44.0.rc0.258.g7320e95886-goog



2024-02-24 00:12:23

by John Stultz

[permalink] [raw]
Subject: [RESEND][PATCH v8 1/7] locking/mutex: Remove wakeups from under mutex::wait_lock

From: Peter Zijlstra <[email protected]>

In preparation to nest mutex::wait_lock under rq::lock we need to remove
wakeups from under it.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[Heavily changed after 55f036ca7e74 ("locking: WW mutex cleanup") and
08295b3b5bee ("locking: Implement an algorithm choice for Wound-Wait
mutexes")]
Signed-off-by: Juri Lelli <[email protected]>
[jstultz: rebased to mainline, added extra wake_up_q & init
to avoid hangs, similar to Connor's rework of this patch]
Signed-off-by: John Stultz <[email protected]>
---
v5:
* Reverted back to an earlier version of this patch to undo
the change that kept the wake_q in the ctx structure, as
that broke the rule that the wake_q must always be on the
stack, as its not safe for concurrency.
v6:
* Made tweaks suggested by Waiman Long
v7:
* Fixups to pass wake_qs down for PREEMPT_RT logic
---
kernel/locking/mutex.c | 17 +++++++++++++----
kernel/locking/rtmutex.c | 26 +++++++++++++++++---------
kernel/locking/rwbase_rt.c | 4 +++-
kernel/locking/rwsem.c | 4 ++--
kernel/locking/spinlock_rt.c | 3 ++-
kernel/locking/ww_mutex.h | 29 ++++++++++++++++++-----------
6 files changed, 55 insertions(+), 28 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index cbae8c0b89ab..980ce630232c 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -575,6 +575,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
struct lockdep_map *nest_lock, unsigned long ip,
struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
{
+ DEFINE_WAKE_Q(wake_q);
struct mutex_waiter waiter;
struct ww_mutex *ww;
int ret;
@@ -625,7 +626,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
*/
if (__mutex_trylock(lock)) {
if (ww_ctx)
- __ww_mutex_check_waiters(lock, ww_ctx);
+ __ww_mutex_check_waiters(lock, ww_ctx, &wake_q);

goto skip_wait;
}
@@ -645,7 +646,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
* Add in stamp order, waking up waiters that must kill
* themselves.
*/
- ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx);
+ ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx, &wake_q);
if (ret)
goto err_early_kill;
}
@@ -681,6 +682,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
}

raw_spin_unlock(&lock->wait_lock);
+ /* Make sure we do wakeups before calling schedule */
+ if (!wake_q_empty(&wake_q)) {
+ wake_up_q(&wake_q);
+ wake_q_init(&wake_q);
+ }
schedule_preempt_disabled();

first = __mutex_waiter_is_first(lock, &waiter);
@@ -714,7 +720,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
*/
if (!ww_ctx->is_wait_die &&
!__mutex_waiter_is_first(lock, &waiter))
- __ww_mutex_check_waiters(lock, ww_ctx);
+ __ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
}

__mutex_remove_waiter(lock, &waiter);
@@ -730,6 +736,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
ww_mutex_lock_acquired(ww, ww_ctx);

raw_spin_unlock(&lock->wait_lock);
+ wake_up_q(&wake_q);
preempt_enable();
return 0;

@@ -741,6 +748,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
raw_spin_unlock(&lock->wait_lock);
debug_mutex_free_waiter(&waiter);
mutex_release(&lock->dep_map, ip);
+ wake_up_q(&wake_q);
preempt_enable();
return ret;
}
@@ -934,6 +942,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
}
}

+ preempt_disable();
raw_spin_lock(&lock->wait_lock);
debug_mutex_unlock(lock);
if (!list_empty(&lock->wait_list)) {
@@ -952,8 +961,8 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
__mutex_handoff(lock, next);

raw_spin_unlock(&lock->wait_lock);
-
wake_up_q(&wake_q);
+ preempt_enable();
}

#ifndef CONFIG_DEBUG_LOCK_ALLOC
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 4a10e8c16fd2..eaac8b196a69 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -34,13 +34,15 @@

static inline int __ww_mutex_add_waiter(struct rt_mutex_waiter *waiter,
struct rt_mutex *lock,
- struct ww_acquire_ctx *ww_ctx)
+ struct ww_acquire_ctx *ww_ctx,
+ struct wake_q_head *wake_q)
{
return 0;
}

static inline void __ww_mutex_check_waiters(struct rt_mutex *lock,
- struct ww_acquire_ctx *ww_ctx)
+ struct ww_acquire_ctx *ww_ctx,
+ struct wake_q_head *wake_q)
{
}

@@ -1206,6 +1208,7 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
struct rt_mutex_waiter *top_waiter = waiter;
struct rt_mutex_base *next_lock;
int chain_walk = 0, res;
+ DEFINE_WAKE_Q(wake_q);

lockdep_assert_held(&lock->wait_lock);

@@ -1244,7 +1247,8 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,

/* Check whether the waiter should back out immediately */
rtm = container_of(lock, struct rt_mutex, rtmutex);
- res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx);
+ res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx, &wake_q);
+ wake_up_q(&wake_q);
if (res) {
raw_spin_lock(&task->pi_lock);
rt_mutex_dequeue(lock, waiter);
@@ -1677,7 +1681,8 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
struct ww_acquire_ctx *ww_ctx,
unsigned int state,
enum rtmutex_chainwalk chwalk,
- struct rt_mutex_waiter *waiter)
+ struct rt_mutex_waiter *waiter,
+ struct wake_q_head *wake_q)
{
struct rt_mutex *rtm = container_of(lock, struct rt_mutex, rtmutex);
struct ww_mutex *ww = ww_container_of(rtm);
@@ -1688,7 +1693,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
/* Try to acquire the lock again: */
if (try_to_take_rt_mutex(lock, current, NULL)) {
if (build_ww_mutex() && ww_ctx) {
- __ww_mutex_check_waiters(rtm, ww_ctx);
+ __ww_mutex_check_waiters(rtm, ww_ctx, wake_q);
ww_mutex_lock_acquired(ww, ww_ctx);
}
return 0;
@@ -1706,7 +1711,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
/* acquired the lock */
if (build_ww_mutex() && ww_ctx) {
if (!ww_ctx->is_wait_die)
- __ww_mutex_check_waiters(rtm, ww_ctx);
+ __ww_mutex_check_waiters(rtm, ww_ctx, wake_q);
ww_mutex_lock_acquired(ww, ww_ctx);
}
} else {
@@ -1728,7 +1733,8 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,

static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
struct ww_acquire_ctx *ww_ctx,
- unsigned int state)
+ unsigned int state,
+ struct wake_q_head *wake_q)
{
struct rt_mutex_waiter waiter;
int ret;
@@ -1737,7 +1743,7 @@ static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
waiter.ww_ctx = ww_ctx;

ret = __rt_mutex_slowlock(lock, ww_ctx, state, RT_MUTEX_MIN_CHAINWALK,
- &waiter);
+ &waiter, wake_q);

debug_rt_mutex_free_waiter(&waiter);
return ret;
@@ -1753,6 +1759,7 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
struct ww_acquire_ctx *ww_ctx,
unsigned int state)
{
+ DEFINE_WAKE_Q(wake_q);
unsigned long flags;
int ret;

@@ -1774,8 +1781,9 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
* irqsave/restore variants.
*/
raw_spin_lock_irqsave(&lock->wait_lock, flags);
- ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state);
+ ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state, &wake_q);
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+ wake_up_q(&wake_q);
rt_mutex_post_schedule();

return ret;
diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index 34a59569db6b..e9d2f38b70f3 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -69,6 +69,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
unsigned int state)
{
struct rt_mutex_base *rtm = &rwb->rtmutex;
+ DEFINE_WAKE_Q(wake_q);
int ret;

rwbase_pre_schedule();
@@ -110,7 +111,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
* For rwlocks this returns 0 unconditionally, so the below
* !ret conditionals are optimized out.
*/
- ret = rwbase_rtmutex_slowlock_locked(rtm, state);
+ ret = rwbase_rtmutex_slowlock_locked(rtm, state, &wake_q);

/*
* On success the rtmutex is held, so there can't be a writer
@@ -122,6 +123,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
if (!ret)
atomic_inc(&rwb->readers);
raw_spin_unlock_irq(&rtm->wait_lock);
+ wake_up_q(&wake_q);
if (!ret)
rwbase_rtmutex_unlock(rtm);

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 2340b6d90ec6..74ebb2915d63 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1415,8 +1415,8 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
#define rwbase_rtmutex_lock_state(rtm, state) \
__rt_mutex_lock(rtm, state)

-#define rwbase_rtmutex_slowlock_locked(rtm, state) \
- __rt_mutex_slowlock_locked(rtm, NULL, state)
+#define rwbase_rtmutex_slowlock_locked(rtm, state, wq) \
+ __rt_mutex_slowlock_locked(rtm, NULL, state, wq)

#define rwbase_rtmutex_unlock(rtm) \
__rt_mutex_unlock(rtm)
diff --git a/kernel/locking/spinlock_rt.c b/kernel/locking/spinlock_rt.c
index 38e292454fcc..fb1810a14c9d 100644
--- a/kernel/locking/spinlock_rt.c
+++ b/kernel/locking/spinlock_rt.c
@@ -162,7 +162,8 @@ rwbase_rtmutex_lock_state(struct rt_mutex_base *rtm, unsigned int state)
}

static __always_inline int
-rwbase_rtmutex_slowlock_locked(struct rt_mutex_base *rtm, unsigned int state)
+rwbase_rtmutex_slowlock_locked(struct rt_mutex_base *rtm, unsigned int state,
+ struct wake_q_head *wake_q)
{
rtlock_slowlock_locked(rtm);
return 0;
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 3ad2cc4823e5..7189c6631d90 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -275,7 +275,7 @@ __ww_ctx_less(struct ww_acquire_ctx *a, struct ww_acquire_ctx *b)
*/
static bool
__ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
- struct ww_acquire_ctx *ww_ctx)
+ struct ww_acquire_ctx *ww_ctx, struct wake_q_head *wake_q)
{
if (!ww_ctx->is_wait_die)
return false;
@@ -284,7 +284,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
#ifndef WW_RT
debug_mutex_wake_waiter(lock, waiter);
#endif
- wake_up_process(waiter->task);
+ wake_q_add(wake_q, waiter->task);
}

return true;
@@ -299,7 +299,8 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
*/
static bool __ww_mutex_wound(struct MUTEX *lock,
struct ww_acquire_ctx *ww_ctx,
- struct ww_acquire_ctx *hold_ctx)
+ struct ww_acquire_ctx *hold_ctx,
+ struct wake_q_head *wake_q)
{
struct task_struct *owner = __ww_mutex_owner(lock);

@@ -331,7 +332,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
* wakeup pending to re-read the wounded state.
*/
if (owner != current)
- wake_up_process(owner);
+ wake_q_add(wake_q, owner);

return true;
}
@@ -352,7 +353,8 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
* The current task must not be on the wait list.
*/
static void
-__ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
+__ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx,
+ struct wake_q_head *wake_q)
{
struct MUTEX_WAITER *cur;

@@ -364,8 +366,8 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
if (!cur->ww_ctx)
continue;

- if (__ww_mutex_die(lock, cur, ww_ctx) ||
- __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx))
+ if (__ww_mutex_die(lock, cur, ww_ctx, wake_q) ||
+ __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx, wake_q))
break;
}
}
@@ -377,6 +379,8 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
static __always_inline void
ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
{
+ DEFINE_WAKE_Q(wake_q);
+
ww_mutex_lock_acquired(lock, ctx);

/*
@@ -405,8 +409,10 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
* die or wound us.
*/
lock_wait_lock(&lock->base);
- __ww_mutex_check_waiters(&lock->base, ctx);
+ __ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
unlock_wait_lock(&lock->base);
+
+ wake_up_q(&wake_q);
}

static __always_inline int
@@ -488,7 +494,8 @@ __ww_mutex_check_kill(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
static inline int
__ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
struct MUTEX *lock,
- struct ww_acquire_ctx *ww_ctx)
+ struct ww_acquire_ctx *ww_ctx,
+ struct wake_q_head *wake_q)
{
struct MUTEX_WAITER *cur, *pos = NULL;
bool is_wait_die;
@@ -532,7 +539,7 @@ __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
pos = cur;

/* Wait-Die: ensure younger waiters die. */
- __ww_mutex_die(lock, cur, ww_ctx);
+ __ww_mutex_die(lock, cur, ww_ctx, wake_q);
}

__ww_waiter_add(lock, waiter, pos);
@@ -550,7 +557,7 @@ __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
* such that either we or the fastpath will wound @ww->ctx.
*/
smp_mb();
- __ww_mutex_wound(lock, ww_ctx, ww->ctx);
+ __ww_mutex_wound(lock, ww_ctx, ww->ctx, wake_q);
}

return 0;
--
2.44.0.rc0.258.g7320e95886-goog


2024-02-24 00:12:30

by John Stultz

[permalink] [raw]
Subject: [RESEND][PATCH v8 2/7] locking/mutex: Make mutex::wait_lock irq safe

From: Juri Lelli <[email protected]>

mutex::wait_lock might be nested under rq->lock.

Make it irq safe then.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[rebase & fix {un,}lock_wait_lock helpers in ww_mutex.h]
Signed-off-by: Connor O'Brien <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
v3:
* Re-added this patch after it was dropped in v2 which
caused lockdep warnings to trip.
v7:
* Fix function definition for PREEMPT_RT case, as pointed out
by Metin Kaya.
* Fix incorrect flags handling in PREEMPT_RT case as found by
Metin Kaya
---
kernel/locking/mutex.c | 18 ++++++++++--------
kernel/locking/ww_mutex.h | 22 +++++++++++-----------
2 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 980ce630232c..7de72c610c65 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -578,6 +578,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
DEFINE_WAKE_Q(wake_q);
struct mutex_waiter waiter;
struct ww_mutex *ww;
+ unsigned long flags;
int ret;

if (!use_ww_ctx)
@@ -620,7 +621,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
return 0;
}

- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
/*
* After waiting to acquire the wait_lock, try again.
*/
@@ -681,7 +682,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
goto err;
}

- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
/* Make sure we do wakeups before calling schedule */
if (!wake_q_empty(&wake_q)) {
wake_up_q(&wake_q);
@@ -707,9 +708,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
trace_contention_begin(lock, LCB_F_MUTEX);
}

- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
}
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
acquired:
__set_current_state(TASK_RUNNING);

@@ -735,7 +736,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
if (ww_ctx)
ww_mutex_lock_acquired(ww, ww_ctx);

- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q);
preempt_enable();
return 0;
@@ -745,7 +746,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
__mutex_remove_waiter(lock, &waiter);
err_early_kill:
trace_contention_end(lock, ret);
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
debug_mutex_free_waiter(&waiter);
mutex_release(&lock->dep_map, ip);
wake_up_q(&wake_q);
@@ -916,6 +917,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
struct task_struct *next = NULL;
DEFINE_WAKE_Q(wake_q);
unsigned long owner;
+ unsigned long flags;

mutex_release(&lock->dep_map, ip);

@@ -943,7 +945,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
}

preempt_disable();
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
debug_mutex_unlock(lock);
if (!list_empty(&lock->wait_list)) {
/* get the first entry from the wait-list: */
@@ -960,7 +962,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
if (owner & MUTEX_FLAG_HANDOFF)
__mutex_handoff(lock, next);

- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q);
preempt_enable();
}
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 7189c6631d90..9facc0ddfdd3 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -70,14 +70,14 @@ __ww_mutex_has_waiters(struct mutex *lock)
return atomic_long_read(&lock->owner) & MUTEX_FLAG_WAITERS;
}

-static inline void lock_wait_lock(struct mutex *lock)
+static inline void lock_wait_lock(struct mutex *lock, unsigned long *flags)
{
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, *flags);
}

-static inline void unlock_wait_lock(struct mutex *lock)
+static inline void unlock_wait_lock(struct mutex *lock, unsigned long *flags)
{
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, *flags);
}

static inline void lockdep_assert_wait_lock_held(struct mutex *lock)
@@ -144,14 +144,14 @@ __ww_mutex_has_waiters(struct rt_mutex *lock)
return rt_mutex_has_waiters(&lock->rtmutex);
}

-static inline void lock_wait_lock(struct rt_mutex *lock)
+static inline void lock_wait_lock(struct rt_mutex *lock, unsigned long *flags)
{
- raw_spin_lock(&lock->rtmutex.wait_lock);
+ raw_spin_lock_irqsave(&lock->rtmutex.wait_lock, *flags);
}

-static inline void unlock_wait_lock(struct rt_mutex *lock)
+static inline void unlock_wait_lock(struct rt_mutex *lock, unsigned long *flags)
{
- raw_spin_unlock(&lock->rtmutex.wait_lock);
+ raw_spin_unlock_irqrestore(&lock->rtmutex.wait_lock, *flags);
}

static inline void lockdep_assert_wait_lock_held(struct rt_mutex *lock)
@@ -380,6 +380,7 @@ static __always_inline void
ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
{
DEFINE_WAKE_Q(wake_q);
+ unsigned long flags;

ww_mutex_lock_acquired(lock, ctx);

@@ -408,10 +409,9 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
* Uh oh, we raced in fastpath, check if any of the waiters need to
* die or wound us.
*/
- lock_wait_lock(&lock->base);
+ lock_wait_lock(&lock->base, &flags);
__ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
- unlock_wait_lock(&lock->base);
-
+ unlock_wait_lock(&lock->base, &flags);
wake_up_q(&wake_q);
}

--
2.44.0.rc0.258.g7320e95886-goog


2024-02-24 00:12:41

by John Stultz

[permalink] [raw]
Subject: [RESEND][PATCH v8 3/7] locking/mutex: Expose __mutex_owner()

From: Juri Lelli <[email protected]>

Implementing proxy execution requires that scheduler code be able to
identify the current owner of a mutex. Expose __mutex_owner() for
this purpose (alone!).

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Juri Lelli <[email protected]>
[Removed the EXPORT_SYMBOL]
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: Reworked per Peter's suggestions]
Signed-off-by: John Stultz <[email protected]>
---
v4:
* Move __mutex_owner() to kernel/locking/mutex.h instead of
adding a new globally available accessor function to keep
the exposure of this low, along with keeping it an inline
function, as suggested by PeterZ
---
kernel/locking/mutex.c | 25 -------------------------
kernel/locking/mutex.h | 25 +++++++++++++++++++++++++
2 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 7de72c610c65..5741641be914 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -56,31 +56,6 @@ __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
}
EXPORT_SYMBOL(__mutex_init);

-/*
- * @owner: contains: 'struct task_struct *' to the current lock owner,
- * NULL means not owned. Since task_struct pointers are aligned at
- * at least L1_CACHE_BYTES, we have low bits to store extra state.
- *
- * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
- * Bit1 indicates unlock needs to hand the lock to the top-waiter
- * Bit2 indicates handoff has been done and we're waiting for pickup.
- */
-#define MUTEX_FLAG_WAITERS 0x01
-#define MUTEX_FLAG_HANDOFF 0x02
-#define MUTEX_FLAG_PICKUP 0x04
-
-#define MUTEX_FLAGS 0x07
-
-/*
- * Internal helper function; C doesn't allow us to hide it :/
- *
- * DO NOT USE (outside of mutex code).
- */
-static inline struct task_struct *__mutex_owner(struct mutex *lock)
-{
- return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
-}
-
static inline struct task_struct *__owner_task(unsigned long owner)
{
return (struct task_struct *)(owner & ~MUTEX_FLAGS);
diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index 0b2a79c4013b..1c7d3d32def8 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -20,6 +20,31 @@ struct mutex_waiter {
#endif
};

+/*
+ * @owner: contains: 'struct task_struct *' to the current lock owner,
+ * NULL means not owned. Since task_struct pointers are aligned at
+ * at least L1_CACHE_BYTES, we have low bits to store extra state.
+ *
+ * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
+ * Bit1 indicates unlock needs to hand the lock to the top-waiter
+ * Bit2 indicates handoff has been done and we're waiting for pickup.
+ */
+#define MUTEX_FLAG_WAITERS 0x01
+#define MUTEX_FLAG_HANDOFF 0x02
+#define MUTEX_FLAG_PICKUP 0x04
+
+#define MUTEX_FLAGS 0x07
+
+/*
+ * Internal helper function; C doesn't allow us to hide it :/
+ *
+ * DO NOT USE (outside of mutex & scheduler code).
+ */
+static inline struct task_struct *__mutex_owner(struct mutex *lock)
+{
+ return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
+}
+
#ifdef CONFIG_DEBUG_MUTEXES
extern void debug_mutex_lock_common(struct mutex *lock,
struct mutex_waiter *waiter);
--
2.44.0.rc0.258.g7320e95886-goog


2024-02-24 00:13:13

by John Stultz

[permalink] [raw]
Subject: [RESEND][PATCH v8 5/7] sched: Consolidate pick_*_task to task_is_pushable helper

From: Connor O'Brien <[email protected]>

This patch consolidates rt and deadline pick_*_task functions to
a task_is_pushable() helper

This patch was broken out from a larger chain migration
patch originally by Connor O'Brien.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: split out from larger chain migration patch,
renamed helper function]
Signed-off-by: John Stultz <[email protected]>
---
v7:
* Split from chain migration patch
* Renamed function
---
kernel/sched/deadline.c | 10 +---------
kernel/sched/rt.c | 11 +----------
kernel/sched/sched.h | 10 ++++++++++
3 files changed, 12 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e68d88963e89..1b9cdb507498 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2179,14 +2179,6 @@ static void task_fork_dl(struct task_struct *p)
/* Only try algorithms three times */
#define DL_MAX_TRIES 3

-static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
-{
- if (!task_on_cpu(rq, p) &&
- cpumask_test_cpu(cpu, &p->cpus_mask))
- return 1;
- return 0;
-}
-
/*
* Return the earliest pushable rq's task, which is suitable to be executed
* on the CPU, NULL otherwise:
@@ -2205,7 +2197,7 @@ static struct task_struct *pick_earliest_pushable_dl_task(struct rq *rq, int cpu
if (next_node) {
p = __node_2_pdl(next_node);

- if (pick_dl_task(rq, p, cpu))
+ if (task_is_pushable(rq, p, cpu) == 1)
return p;

next_node = rb_next(next_node);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index dd072d11cc02..638e7c158ae4 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1791,15 +1791,6 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
/* Only try algorithms three times */
#define RT_MAX_TRIES 3

-static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
-{
- if (!task_on_cpu(rq, p) &&
- cpumask_test_cpu(cpu, &p->cpus_mask))
- return 1;
-
- return 0;
-}
-
/*
* Return the highest pushable rq's task, which is suitable to be executed
* on the CPU, NULL otherwise
@@ -1813,7 +1804,7 @@ static struct task_struct *pick_highest_pushable_task(struct rq *rq, int cpu)
return NULL;

plist_for_each_entry(p, head, pushable_tasks) {
- if (pick_rt_task(rq, p, cpu))
+ if (task_is_pushable(rq, p, cpu) == 1)
return p;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6ca83837e0f4..c83e5e0672dc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3480,6 +3480,16 @@ void do_push_task(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
set_task_cpu(task, dst_rq->cpu);
activate_task(dst_rq, task, 0);
}
+
+static inline
+int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
+{
+ if (!task_on_cpu(rq, p) &&
+ cpumask_test_cpu(cpu, &p->cpus_mask))
+ return 1;
+
+ return 0;
+}
#endif

#endif /* _KERNEL_SCHED_SCHED_H */
--
2.44.0.rc0.258.g7320e95886-goog


2024-02-24 00:13:26

by John Stultz

[permalink] [raw]
Subject: [RESEND][PATCH v8 6/7] sched: Split out __schedule() deactivate task logic into a helper

As we're going to re-use the deactivation logic,
split it into a helper.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: John Stultz <[email protected]>
---
v6:
* Define function as static to avoid "no previous prototype"
warnings as Reported-by: kernel test robot <[email protected]>
v7:
* Rename state task_state to be more clear, as suggested by
Metin Kaya
---
kernel/sched/core.c | 72 +++++++++++++++++++++++++++------------------
1 file changed, 43 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ad4748327651..b537e5f501ea 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6563,6 +6563,48 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
# define SM_MASK_PREEMPT SM_PREEMPT
#endif

+/*
+ * Helper function for __schedule()
+ *
+ * If a task does not have signals pending, deactivate it and return true
+ * Otherwise marks the task's __state as RUNNING and returns false
+ */
+static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
+ unsigned long task_state)
+{
+ if (signal_pending_state(task_state, p)) {
+ WRITE_ONCE(p->__state, TASK_RUNNING);
+ } else {
+ p->sched_contributes_to_load =
+ (task_state & TASK_UNINTERRUPTIBLE) &&
+ !(task_state & TASK_NOLOAD) &&
+ !(task_state & TASK_FROZEN);
+
+ if (p->sched_contributes_to_load)
+ rq->nr_uninterruptible++;
+
+ /*
+ * __schedule() ttwu()
+ * prev_state = prev->state; if (p->on_rq && ...)
+ * if (prev_state) goto out;
+ * p->on_rq = 0; smp_acquire__after_ctrl_dep();
+ * p->state = TASK_WAKING
+ *
+ * Where __schedule() and ttwu() have matching control dependencies.
+ *
+ * After this, schedule() must not care about p->state any more.
+ */
+ deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
+
+ if (p->in_iowait) {
+ atomic_inc(&rq->nr_iowait);
+ delayacct_blkio_start();
+ }
+ return true;
+ }
+ return false;
+}
+
/*
* __schedule() is the main scheduler function.
*
@@ -6654,35 +6696,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
*/
prev_state = READ_ONCE(prev->__state);
if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
- if (signal_pending_state(prev_state, prev)) {
- WRITE_ONCE(prev->__state, TASK_RUNNING);
- } else {
- prev->sched_contributes_to_load =
- (prev_state & TASK_UNINTERRUPTIBLE) &&
- !(prev_state & TASK_NOLOAD) &&
- !(prev_state & TASK_FROZEN);
-
- if (prev->sched_contributes_to_load)
- rq->nr_uninterruptible++;
-
- /*
- * __schedule() ttwu()
- * prev_state = prev->state; if (p->on_rq && ...)
- * if (prev_state) goto out;
- * p->on_rq = 0; smp_acquire__after_ctrl_dep();
- * p->state = TASK_WAKING
- *
- * Where __schedule() and ttwu() have matching control dependencies.
- *
- * After this, schedule() must not care about p->state any more.
- */
- deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
-
- if (prev->in_iowait) {
- atomic_inc(&rq->nr_iowait);
- delayacct_blkio_start();
- }
- }
+ try_to_deactivate_task(rq, prev, prev_state);
switch_count = &prev->nvcsw;
}

--
2.44.0.rc0.258.g7320e95886-goog


2024-02-24 00:13:54

by John Stultz

[permalink] [raw]
Subject: [RESEND][PATCH v8 4/7] sched: Add do_push_task helper

From: Connor O'Brien <[email protected]>

Switch logic that deactivates, sets the task cpu,
and reactivates a task on a different rq to use a
helper that will be later extended to push entire
blocked task chains.

This patch was broken out from a larger chain migration
patch originally by Connor O'Brien.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: split out from larger chain migration patch]
Signed-off-by: John Stultz <[email protected]>
---
v8:
* Renamed from push_task_chain to do_push_task so it makes
more sense without proxy-execution
---
kernel/sched/core.c | 4 +---
kernel/sched/deadline.c | 8 ++------
kernel/sched/rt.c | 8 ++------
kernel/sched/sched.h | 9 +++++++++
4 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9116bcc90346..ad4748327651 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2714,9 +2714,7 @@ int push_cpu_stop(void *arg)

// XXX validate p is still the highest prio task
if (task_rq(p) == rq) {
- deactivate_task(rq, p, 0);
- set_task_cpu(p, lowest_rq->cpu);
- activate_task(lowest_rq, p, 0);
+ do_push_task(rq, lowest_rq, p);
resched_curr(lowest_rq);
}

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a04a436af8cc..e68d88963e89 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2443,9 +2443,7 @@ static int push_dl_task(struct rq *rq)
goto retry;
}

- deactivate_task(rq, next_task, 0);
- set_task_cpu(next_task, later_rq->cpu);
- activate_task(later_rq, next_task, 0);
+ do_push_task(rq, later_rq, next_task);
ret = 1;

resched_curr(later_rq);
@@ -2531,9 +2529,7 @@ static void pull_dl_task(struct rq *this_rq)
if (is_migration_disabled(p)) {
push_task = get_push_task(src_rq);
} else {
- deactivate_task(src_rq, p, 0);
- set_task_cpu(p, this_cpu);
- activate_task(this_rq, p, 0);
+ do_push_task(src_rq, this_rq, p);
dmin = p->dl.deadline;
resched = true;
}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3261b067b67e..dd072d11cc02 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2106,9 +2106,7 @@ static int push_rt_task(struct rq *rq, bool pull)
goto retry;
}

- deactivate_task(rq, next_task, 0);
- set_task_cpu(next_task, lowest_rq->cpu);
- activate_task(lowest_rq, next_task, 0);
+ do_push_task(rq, lowest_rq, next_task);
resched_curr(lowest_rq);
ret = 1;

@@ -2379,9 +2377,7 @@ static void pull_rt_task(struct rq *this_rq)
if (is_migration_disabled(p)) {
push_task = get_push_task(src_rq);
} else {
- deactivate_task(src_rq, p, 0);
- set_task_cpu(p, this_cpu);
- activate_task(this_rq, p, 0);
+ do_push_task(src_rq, this_rq, p);
resched = true;
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 001fe047bd5d..6ca83837e0f4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3472,5 +3472,14 @@ static inline void init_sched_mm_cid(struct task_struct *t) { }

extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
+#ifdef CONFIG_SMP
+static inline
+void do_push_task(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
+{
+ deactivate_task(rq, task, 0);
+ set_task_cpu(task, dst_rq->cpu);
+ activate_task(dst_rq, task, 0);
+}
+#endif

#endif /* _KERNEL_SCHED_SCHED_H */
--
2.44.0.rc0.258.g7320e95886-goog


2024-02-24 00:14:22

by John Stultz

[permalink] [raw]
Subject: [RESEND][PATCH v8 7/7] sched: Split scheduler and execution contexts

From: Peter Zijlstra <[email protected]>

Let's define the scheduling context as all the scheduler state
in task_struct for the task selected to run, and the execution
context as all state required to actually run the task.

Currently both are intertwined in task_struct. We want to
logically split these such that we can use the scheduling
context of the task selected to be scheduled, but use the
execution context of a different task to actually be run.

To this purpose, introduce rq_selected() macro to point to the
task_struct selected from the runqueue by the scheduler, and
will be used for scheduler state, and preserve rq->curr to
indicate the execution context of the task that will actually be
run.

Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
[add additional comments and update more sched_class code to use
rq::proxy]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: Rebased and resolved minor collisions, reworked to use
accessors, tweaked update_curr_common to use rq_proxy fixing rt
scheduling issues]
Signed-off-by: John Stultz <[email protected]>
---
v2:
* Reworked to use accessors
* Fixed update_curr_common to use proxy instead of curr
v3:
* Tweaked wrapper names
* Swapped proxy for selected for clarity
v4:
* Minor variable name tweaks for readability
* Use a macro instead of a inline function and drop
other helper functions as suggested by Peter.
* Remove verbose comments/questions to avoid review
distractions, as suggested by Dietmar
v5:
* Add CONFIG_PROXY_EXEC option to this patch so the
new logic can be tested with this change
* Minor fix to grab rq_selected when holding the rq lock
v7:
* Minor spelling fix and unused argument fixes suggested by
Metin Kaya
* Switch to curr_selected for consistency, and minor rewording
of commit message for clarity
* Rename variables selected instead of curr when we're using
rq_selected()
* Reduce macros in CONFIG_SCHED_PROXY_EXEC ifdef sections,
as suggested by Metin Kaya
v8:
* Use rq->curr, not rq_selected with task_tick, as suggested by
Valentin
* Minor rework to reorder this with CONFIG_SCHED_PROXY_EXEC patch
---
kernel/sched/core.c | 46 ++++++++++++++++++++++++++---------------
kernel/sched/deadline.c | 35 ++++++++++++++++---------------
kernel/sched/fair.c | 18 ++++++++--------
kernel/sched/rt.c | 40 +++++++++++++++++------------------
kernel/sched/sched.h | 25 ++++++++++++++++++++--
5 files changed, 99 insertions(+), 65 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b537e5f501ea..c17f91d6ceba 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -794,7 +794,7 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)

rq_lock(rq, &rf);
update_rq_clock(rq);
- rq->curr->sched_class->task_tick(rq, rq->curr, 1);
+ rq_selected(rq)->sched_class->task_tick(rq, rq->curr, 1);
rq_unlock(rq, &rf);

return HRTIMER_NORESTART;
@@ -2238,16 +2238,18 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,

void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
{
- if (p->sched_class == rq->curr->sched_class)
- rq->curr->sched_class->wakeup_preempt(rq, p, flags);
- else if (sched_class_above(p->sched_class, rq->curr->sched_class))
+ struct task_struct *selected = rq_selected(rq);
+
+ if (p->sched_class == selected->sched_class)
+ selected->sched_class->wakeup_preempt(rq, p, flags);
+ else if (sched_class_above(p->sched_class, selected->sched_class))
resched_curr(rq);

/*
* A queue event has occurred, and we're going to schedule. In
* this case, we can save a useless back to back clock update.
*/
- if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
+ if (task_on_rq_queued(selected) && test_tsk_need_resched(rq->curr))
rq_clock_skip_update(rq);
}

@@ -2774,7 +2776,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
lockdep_assert_held(&p->pi_lock);

queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);

if (queued) {
/*
@@ -5587,7 +5589,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
* project cycles that may never be accounted to this
* thread, breaking clock_gettime().
*/
- if (task_current(rq, p) && task_on_rq_queued(p)) {
+ if (task_current_selected(rq, p) && task_on_rq_queued(p)) {
prefetch_curr_exec_start(p);
update_rq_clock(rq);
p->sched_class->update_curr(rq);
@@ -5655,7 +5657,8 @@ void scheduler_tick(void)
{
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
- struct task_struct *curr = rq->curr;
+ /* accounting goes to the selected task */
+ struct task_struct *selected;
struct rq_flags rf;
unsigned long thermal_pressure;
u64 resched_latency;
@@ -5666,16 +5669,17 @@ void scheduler_tick(void)
sched_clock_tick();

rq_lock(rq, &rf);
+ selected = rq_selected(rq);

update_rq_clock(rq);
thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
- curr->sched_class->task_tick(rq, curr, 0);
+ selected->sched_class->task_tick(rq, selected, 0);
if (sched_feat(LATENCY_WARN))
resched_latency = cpu_resched_latency(rq);
calc_global_load_tick(rq);
sched_core_tick(rq);
- task_tick_mm_cid(rq, curr);
+ task_tick_mm_cid(rq, selected);

rq_unlock(rq, &rf);

@@ -5684,8 +5688,8 @@ void scheduler_tick(void)

perf_event_task_tick();

- if (curr->flags & PF_WQ_WORKER)
- wq_worker_tick(curr);
+ if (selected->flags & PF_WQ_WORKER)
+ wq_worker_tick(selected);

#ifdef CONFIG_SMP
rq->idle_balance = idle_cpu(cpu);
@@ -5750,6 +5754,12 @@ static void sched_tick_remote(struct work_struct *work)
struct task_struct *curr = rq->curr;

if (cpu_online(cpu)) {
+ /*
+ * Since this is a remote tick for full dynticks mode,
+ * we are always sure that there is no proxy (only a
+ * single task is running).
+ */
+ SCHED_WARN_ON(rq->curr != rq_selected(rq));
update_rq_clock(rq);

if (!is_idle_task(curr)) {
@@ -6701,6 +6711,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
}

next = pick_next_task(rq, prev, &rf);
+ rq_set_selected(rq, next);
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
#ifdef CONFIG_SCHED_DEBUG
@@ -7201,7 +7212,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)

prev_class = p->sched_class;
queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);
if (queued)
dequeue_task(rq, p, queue_flag);
if (running)
@@ -7291,7 +7302,7 @@ void set_user_nice(struct task_struct *p, long nice)
}

queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);
if (queued)
dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
if (running)
@@ -7870,7 +7881,7 @@ static int __sched_setscheduler(struct task_struct *p,
}

queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);
if (queued)
dequeue_task(rq, p, queue_flags);
if (running)
@@ -9297,6 +9308,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
rcu_read_unlock();

rq->idle = idle;
+ rq_set_selected(rq, idle);
rcu_assign_pointer(rq->curr, idle);
idle->on_rq = TASK_ON_RQ_QUEUED;
#ifdef CONFIG_SMP
@@ -9386,7 +9398,7 @@ void sched_setnuma(struct task_struct *p, int nid)

rq = task_rq_lock(p, &rf);
queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);

if (queued)
dequeue_task(rq, p, DEQUEUE_SAVE);
@@ -10491,7 +10503,7 @@ void sched_move_task(struct task_struct *tsk)

update_rq_clock(rq);

- running = task_current(rq, tsk);
+ running = task_current_selected(rq, tsk);
queued = task_on_rq_queued(tsk);

if (queued)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1b9cdb507498..c30b592d6e9d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1218,7 +1218,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
#endif

enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
- if (dl_task(rq->curr))
+ if (dl_task(rq_selected(rq)))
wakeup_preempt_dl(rq, p, 0);
else
resched_curr(rq);
@@ -1442,7 +1442,7 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
*/
static void update_curr_dl(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
struct sched_dl_entity *dl_se = &curr->dl;
s64 delta_exec;

@@ -1899,7 +1899,7 @@ static int find_later_rq(struct task_struct *task);
static int
select_task_rq_dl(struct task_struct *p, int cpu, int flags)
{
- struct task_struct *curr;
+ struct task_struct *curr, *selected;
bool select_rq;
struct rq *rq;

@@ -1910,6 +1910,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)

rcu_read_lock();
curr = READ_ONCE(rq->curr); /* unlocked access */
+ selected = READ_ONCE(rq_selected(rq));

/*
* If we are dealing with a -deadline task, we must
@@ -1920,9 +1921,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
* other hand, if it has a shorter deadline, we
* try to make it stay here, it might be important.
*/
- select_rq = unlikely(dl_task(curr)) &&
+ select_rq = unlikely(dl_task(selected)) &&
(curr->nr_cpus_allowed < 2 ||
- !dl_entity_preempt(&p->dl, &curr->dl)) &&
+ !dl_entity_preempt(&p->dl, &selected->dl)) &&
p->nr_cpus_allowed > 1;

/*
@@ -1985,7 +1986,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
* let's hope p can move out.
*/
if (rq->curr->nr_cpus_allowed == 1 ||
- !cpudl_find(&rq->rd->cpudl, rq->curr, NULL))
+ !cpudl_find(&rq->rd->cpudl, rq_selected(rq), NULL))
return;

/*
@@ -2024,7 +2025,7 @@ static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
int flags)
{
- if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
+ if (dl_entity_preempt(&p->dl, &rq_selected(rq)->dl)) {
resched_curr(rq);
return;
}
@@ -2034,7 +2035,7 @@ static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
* In the unlikely case current and p have the same deadline
* let us try to decide what's the best thing to do...
*/
- if ((p->dl.deadline == rq->curr->dl.deadline) &&
+ if ((p->dl.deadline == rq_selected(rq)->dl.deadline) &&
!test_tsk_need_resched(rq->curr))
check_preempt_equal_dl(rq, p);
#endif /* CONFIG_SMP */
@@ -2066,7 +2067,7 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
if (!first)
return;

- if (rq->curr->sched_class != &dl_sched_class)
+ if (rq_selected(rq)->sched_class != &dl_sched_class)
update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);

deadline_queue_push_tasks(rq);
@@ -2391,8 +2392,8 @@ static int push_dl_task(struct rq *rq)
* can move away, it makes sense to just reschedule
* without going further in pushing next_task.
*/
- if (dl_task(rq->curr) &&
- dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
+ if (dl_task(rq_selected(rq)) &&
+ dl_time_before(next_task->dl.deadline, rq_selected(rq)->dl.deadline) &&
rq->curr->nr_cpus_allowed > 1) {
resched_curr(rq);
return 0;
@@ -2515,7 +2516,7 @@ static void pull_dl_task(struct rq *this_rq)
* deadline than the current task of its runqueue.
*/
if (dl_time_before(p->dl.deadline,
- src_rq->curr->dl.deadline))
+ rq_selected(src_rq)->dl.deadline))
goto skip;

if (is_migration_disabled(p)) {
@@ -2554,9 +2555,9 @@ static void task_woken_dl(struct rq *rq, struct task_struct *p)
if (!task_on_cpu(rq, p) &&
!test_tsk_need_resched(rq->curr) &&
p->nr_cpus_allowed > 1 &&
- dl_task(rq->curr) &&
+ dl_task(rq_selected(rq)) &&
(rq->curr->nr_cpus_allowed < 2 ||
- !dl_entity_preempt(&p->dl, &rq->curr->dl))) {
+ !dl_entity_preempt(&p->dl, &rq_selected(rq)->dl))) {
push_dl_tasks(rq);
}
}
@@ -2731,12 +2732,12 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
return;
}

- if (rq->curr != p) {
+ if (rq_selected(rq) != p) {
#ifdef CONFIG_SMP
if (p->nr_cpus_allowed > 1 && rq->dl.overloaded)
deadline_queue_push_tasks(rq);
#endif
- if (dl_task(rq->curr))
+ if (dl_task(rq_selected(rq)))
wakeup_preempt_dl(rq, p, 0);
else
resched_curr(rq);
@@ -2765,7 +2766,7 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p,
if (!rq->dl.overloaded)
deadline_queue_pull_task(rq);

- if (task_current(rq, p)) {
+ if (task_current_selected(rq, p)) {
/*
* If we now have a earlier deadline task than p,
* then reschedule, provided p is still on this
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 533547e3c90a..dc342e1fc420 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1140,7 +1140,7 @@ static inline void update_curr_task(struct task_struct *p, s64 delta_exec)
*/
s64 update_curr_common(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
s64 delta_exec;

delta_exec = update_curr_se(rq, &curr->se);
@@ -1177,7 +1177,7 @@ static void update_curr(struct cfs_rq *cfs_rq)

static void update_curr_fair(struct rq *rq)
{
- update_curr(cfs_rq_of(&rq->curr->se));
+ update_curr(cfs_rq_of(&rq_selected(rq)->se));
}

static inline void
@@ -6627,7 +6627,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
s64 delta = slice - ran;

if (delta < 0) {
- if (task_current(rq, p))
+ if (task_current_selected(rq, p))
resched_curr(rq);
return;
}
@@ -6642,7 +6642,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
*/
static void hrtick_update(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);

if (!hrtick_enabled_fair(rq) || curr->sched_class != &fair_sched_class)
return;
@@ -8267,7 +8267,7 @@ static void set_next_buddy(struct sched_entity *se)
*/
static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
struct sched_entity *se = &curr->se, *pse = &p->se;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
int cse_is_idle, pse_is_idle;
@@ -8298,7 +8298,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
* prevents us from potentially nominating it as a false LAST_BUDDY
* below.
*/
- if (test_tsk_need_resched(curr))
+ if (test_tsk_need_resched(rq->curr))
return;

/* Idle tasks are by definition preempted by non-idle tasks. */
@@ -9282,7 +9282,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
* update_load_avg() can call cpufreq_update_util(). Make sure that RT,
* DL and IRQ signals have been updated before updating CFS.
*/
- curr_class = rq->curr->sched_class;
+ curr_class = rq_selected(rq)->sched_class;

thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));

@@ -12673,7 +12673,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
* our priority decreased, or if we are not currently running on
* this runqueue and our priority is higher than the current's
*/
- if (task_current(rq, p)) {
+ if (task_current_selected(rq, p)) {
if (p->prio > oldprio)
resched_curr(rq);
} else
@@ -12776,7 +12776,7 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
* kick off the schedule if running, otherwise just see
* if we can still preempt the current task.
*/
- if (task_current(rq, p))
+ if (task_current_selected(rq, p))
resched_curr(rq);
else
wakeup_preempt(rq, p, 0);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 638e7c158ae4..48fc7a198f1a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -530,7 +530,7 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)

static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
{
- struct task_struct *curr = rq_of_rt_rq(rt_rq)->curr;
+ struct task_struct *curr = rq_selected(rq_of_rt_rq(rt_rq));
struct rq *rq = rq_of_rt_rq(rt_rq);
struct sched_rt_entity *rt_se;

@@ -1000,7 +1000,7 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
*/
static void update_curr_rt(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
struct sched_rt_entity *rt_se = &curr->rt;
s64 delta_exec;

@@ -1543,7 +1543,7 @@ static int find_lowest_rq(struct task_struct *task);
static int
select_task_rq_rt(struct task_struct *p, int cpu, int flags)
{
- struct task_struct *curr;
+ struct task_struct *curr, *selected;
struct rq *rq;
bool test;

@@ -1555,6 +1555,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)

rcu_read_lock();
curr = READ_ONCE(rq->curr); /* unlocked access */
+ selected = READ_ONCE(rq_selected(rq));

/*
* If the current task on @p's runqueue is an RT task, then
@@ -1583,8 +1584,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
* systems like big.LITTLE.
*/
test = curr &&
- unlikely(rt_task(curr)) &&
- (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
+ unlikely(rt_task(selected)) &&
+ (curr->nr_cpus_allowed < 2 || selected->prio <= p->prio);

if (test || !rt_task_fits_capacity(p, cpu)) {
int target = find_lowest_rq(p);
@@ -1614,12 +1615,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)

static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
{
- /*
- * Current can't be migrated, useless to reschedule,
- * let's hope p can move out.
- */
if (rq->curr->nr_cpus_allowed == 1 ||
- !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
+ !cpupri_find(&rq->rd->cpupri, rq_selected(rq), NULL))
return;

/*
@@ -1662,7 +1659,9 @@ static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
*/
static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
{
- if (p->prio < rq->curr->prio) {
+ struct task_struct *curr = rq_selected(rq);
+
+ if (p->prio < curr->prio) {
resched_curr(rq);
return;
}
@@ -1680,7 +1679,7 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
* to move current somewhere else, making room for our non-migratable
* task.
*/
- if (p->prio == rq->curr->prio && !test_tsk_need_resched(rq->curr))
+ if (p->prio == curr->prio && !test_tsk_need_resched(rq->curr))
check_preempt_equal_prio(rq, p);
#endif
}
@@ -1705,7 +1704,7 @@ static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool f
* utilization. We only care of the case where we start to schedule a
* rt task
*/
- if (rq->curr->sched_class != &rt_sched_class)
+ if (rq_selected(rq)->sched_class != &rt_sched_class)
update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);

rt_queue_push_tasks(rq);
@@ -1977,6 +1976,7 @@ static struct task_struct *pick_next_pushable_task(struct rq *rq)

BUG_ON(rq->cpu != task_cpu(p));
BUG_ON(task_current(rq, p));
+ BUG_ON(task_current_selected(rq, p));
BUG_ON(p->nr_cpus_allowed <= 1);

BUG_ON(!task_on_rq_queued(p));
@@ -2009,7 +2009,7 @@ static int push_rt_task(struct rq *rq, bool pull)
* higher priority than current. If that's the case
* just reschedule current.
*/
- if (unlikely(next_task->prio < rq->curr->prio)) {
+ if (unlikely(next_task->prio < rq_selected(rq)->prio)) {
resched_curr(rq);
return 0;
}
@@ -2362,7 +2362,7 @@ static void pull_rt_task(struct rq *this_rq)
* p if it is lower in priority than the
* current task on the run queue
*/
- if (p->prio < src_rq->curr->prio)
+ if (p->prio < rq_selected(src_rq)->prio)
goto skip;

if (is_migration_disabled(p)) {
@@ -2404,9 +2404,9 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
bool need_to_push = !task_on_cpu(rq, p) &&
!test_tsk_need_resched(rq->curr) &&
p->nr_cpus_allowed > 1 &&
- (dl_task(rq->curr) || rt_task(rq->curr)) &&
+ (dl_task(rq_selected(rq)) || rt_task(rq_selected(rq))) &&
(rq->curr->nr_cpus_allowed < 2 ||
- rq->curr->prio <= p->prio);
+ rq_selected(rq)->prio <= p->prio);

if (need_to_push)
push_rt_tasks(rq);
@@ -2490,7 +2490,7 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
rt_queue_push_tasks(rq);
#endif /* CONFIG_SMP */
- if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
+ if (p->prio < rq_selected(rq)->prio && cpu_online(cpu_of(rq)))
resched_curr(rq);
}
}
@@ -2505,7 +2505,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
if (!task_on_rq_queued(p))
return;

- if (task_current(rq, p)) {
+ if (task_current_selected(rq, p)) {
#ifdef CONFIG_SMP
/*
* If our priority decreases while running, we
@@ -2531,7 +2531,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
* greater than the current running task
* then reschedule.
*/
- if (p->prio < rq->curr->prio)
+ if (p->prio < rq_selected(rq)->prio)
resched_curr(rq);
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c83e5e0672dc..808d6ee8ae33 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1030,7 +1030,7 @@ struct rq {
*/
unsigned int nr_uninterruptible;

- struct task_struct __rcu *curr;
+ struct task_struct __rcu *curr; /* Execution context */
struct task_struct *idle;
struct task_struct *stop;
unsigned long next_balance;
@@ -1225,6 +1225,13 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define raw_rq() raw_cpu_ptr(&runqueues)

+/* For now, rq_selected == rq->curr */
+#define rq_selected(rq) ((rq)->curr)
+static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
+{
+ /* Do nothing */
+}
+
struct sched_group;
#ifdef CONFIG_SCHED_CORE
static inline struct cpumask *sched_group_span(struct sched_group *sg);
@@ -2148,11 +2155,25 @@ static inline u64 global_rt_runtime(void)
return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
}

+/*
+ * Is p the current execution context?
+ */
static inline int task_current(struct rq *rq, struct task_struct *p)
{
return rq->curr == p;
}

+/*
+ * Is p the current scheduling context?
+ *
+ * Note that it might be the current execution context at the same time if
+ * rq->curr == rq_selected() == p.
+ */
+static inline int task_current_selected(struct rq *rq, struct task_struct *p)
+{
+ return rq_selected(rq) == p;
+}
+
static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
{
#ifdef CONFIG_SMP
@@ -2322,7 +2343,7 @@ struct sched_class {

static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
{
- WARN_ON_ONCE(rq->curr != prev);
+ WARN_ON_ONCE(rq_selected(rq) != prev);
prev->sched_class->put_prev_task(rq, prev);
}

--
2.44.0.rc0.258.g7320e95886-goog


2024-02-28 04:43:35

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RESEND][PATCH v8 0/7] Preparatory changes for Proxy Execution v8

Hello John,

Happy to report that I did not see any regressions with the series
as expected. Full results below.

On 2/24/2024 5:41 AM, John Stultz wrote:
> After sending out v7 of Proxy Execution, I got feedback that the
> patch series was getting a bit unwieldy to review, and Qais
> suggested I break out just the cleanups/preparatory components
> of the patch series and submit them on their own in the hope we
> can start to merge the less complex bits and discussion can focus
> on the more complicated portions afterwards.
>
> So for the v8 of this series, I only submitted those earlier
> cleanup/preparatory changes:
> https://lore.kernel.org/lkml/[email protected]/
>
> After sending this out a few weeks back, I’ve not heard much, so
> I wanted to resend this again.
>
> (I did correct one detail here, which was that I had accidentally
> lost the author credit to one of the patches, and I’ve fixed that
> in this submission).
>
> As before, If you are interested, the full v8 series, it can be
> found here:
> https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v8-6.8-rc3
> https://github.com/johnstultz-work/linux-dev.git proxy-exec-v8-6.8-rc3
>
> However, I’ve been focusing pretty intensely on the series to
> shake out some issues with the more complicated later patches in
> the series (not in what I’m submitting here), and have resolved
> a number of problems I uncovered in doing wider testing (along
> with lots of review feedback from Metin), so v9 and all of its
> improvements will hopefully be ready to send out soon.
>
> If you want a preview, my current WIP tree (careful, as I rebase
> it frequently) is here:
> https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-WIP
> https://github.com/johnstultz-work/linux-dev.git proxy-exec-WIP
>
> Review and feedback would be greatly appreciated!

o System Details

- 3rd Generation EPYC System
- 2 x 64C/128T
- NPS1 mode

o Kernels

tip: tip:sched/core at commit 8cec3dd9e593 ("sched/core:
Simplify code by removing duplicate #ifdefs")

proxy-setup: tip + this series

o Results

==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) proxy-setup[pct imp](CV)
1-groups 1.00 [ -0.00]( 2.08) 1.01 [ -0.53]( 2.45)
2-groups 1.00 [ -0.00]( 0.89) 1.03 [ -3.32]( 1.48)
4-groups 1.00 [ -0.00]( 0.81) 1.02 [ -2.26]( 1.22)
8-groups 1.00 [ -0.00]( 0.78) 1.00 [ -0.29]( 0.97)
16-groups 1.00 [ -0.00]( 1.60) 1.00 [ -0.27]( 1.86)


==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) proxy-setup[pct imp](CV)
1 1.00 [ 0.00]( 0.71) 1.00 [ 0.31]( 0.37)
2 1.00 [ 0.00]( 0.25) 0.99 [ -0.56]( 0.31)
4 1.00 [ 0.00]( 0.85) 0.98 [ -2.35]( 0.69)
8 1.00 [ 0.00]( 1.00) 0.99 [ -0.99]( 0.12)
16 1.00 [ 0.00]( 1.25) 0.99 [ -0.78]( 1.35)
32 1.00 [ 0.00]( 0.35) 1.00 [ 0.12]( 2.23)
64 1.00 [ 0.00]( 0.71) 0.99 [ -0.97]( 0.55)
128 1.00 [ 0.00]( 0.46) 0.96 [ -4.38]( 0.47)
256 1.00 [ 0.00]( 0.24) 0.99 [ -1.32]( 0.95)
512 1.00 [ 0.00]( 0.30) 0.98 [ -1.52]( 0.10)
1024 1.00 [ 0.00]( 0.40) 0.98 [ -1.59]( 0.23)


==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) proxy-setup[pct imp](CV)
Copy 1.00 [ 0.00]( 9.73) 1.04 [ 4.18]( 3.12)
Scale 1.00 [ 0.00]( 5.57) 0.99 [ -1.35]( 5.74)
Add 1.00 [ 0.00]( 5.43) 0.99 [ -1.29]( 5.93)
Triad 1.00 [ 0.00]( 5.50) 0.97 [ -3.47]( 7.81)


==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) proxy-setup[pct imp](CV)
Copy 1.00 [ 0.00]( 3.26) 1.01 [ 0.83]( 2.69)
Scale 1.00 [ 0.00]( 1.26) 1.00 [ -0.32]( 4.52)
Add 1.00 [ 0.00]( 1.47) 1.01 [ 0.63]( 0.96)
Triad 1.00 [ 0.00]( 1.77) 1.02 [ 1.81]( 1.00)


==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) proxy-setup[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.22) 0.99 [ -0.53]( 0.26)
2-clients 1.00 [ 0.00]( 0.57) 1.00 [ -0.44]( 0.41)
4-clients 1.00 [ 0.00]( 0.43) 1.00 [ -0.48]( 0.39)
8-clients 1.00 [ 0.00]( 0.27) 1.00 [ -0.31]( 0.42)
16-clients 1.00 [ 0.00]( 0.46) 1.00 [ -0.11]( 0.42)
32-clients 1.00 [ 0.00]( 0.95) 1.00 [ -0.41]( 0.56)
64-clients 1.00 [ 0.00]( 1.79) 1.00 [ -0.15]( 1.65)
128-clients 1.00 [ 0.00]( 0.89) 1.00 [ -0.43]( 0.80)
256-clients 1.00 [ 0.00]( 3.88) 1.00 [ -0.37]( 4.74)
512-clients 1.00 [ 0.00](35.06) 1.01 [ 1.05](50.84)


==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) proxy-setup[pct imp](CV)
1 1.00 [ -0.00](27.28) 1.31 [-31.25]( 2.38)
2 1.00 [ -0.00]( 3.85) 1.00 [ -0.00]( 8.85)
4 1.00 [ -0.00](14.00) 1.11 [-10.53](11.18)
8 1.00 [ -0.00]( 4.68) 1.08 [ -8.33]( 9.93)
16 1.00 [ -0.00]( 4.08) 0.92 [ 8.06]( 3.70)
32 1.00 [ -0.00]( 6.68) 0.95 [ 5.10]( 2.22)
64 1.00 [ -0.00]( 1.79) 0.99 [ 1.02]( 3.18)
128 1.00 [ -0.00]( 6.30) 1.02 [ -2.48]( 7.37)
256 1.00 [ -0.00](43.39) 1.00 [ -0.00](37.06)
512 1.00 [ -0.00]( 2.26) 0.98 [ 1.88]( 6.96)

Note: schbench is known to have high run to run variance for
16-workers and below.


==================================================================
Test : Unixbench
Units : Normalized scores
Interpretation: Lower is better
Statistic : Various (Mentioned)
==================================================================
Metric Variant tip proxy-setup
Hmean unixbench-dhry2reg-1 0.00% -0.60%
Hmean unixbench-dhry2reg-512 0.00% -0.01%
Amean unixbench-syscall-1 0.00% -0.41%
Amean unixbench-syscall-512 0.00% 0.13%
Hmean unixbench-pipe-1 0.00% 1.02%
Hmean unixbench-pipe-512 0.00% 0.53%
Hmean unixbench-spawn-1 0.00% -2.68%
Hmean unixbench-spawn-512 0.00% 3.24%
Hmean unixbench-execl-1 0.00% 0.61%
Hmean unixbench-execl-512 0.00% 1.97%
--

Tested-by: K Prateek Nayak <[email protected]>

>
> Thanks so much!
> -john
>
> [..snip..]
>

--
Thanks and Regards,
Prateek

2024-02-28 04:56:42

by John Stultz

[permalink] [raw]
Subject: Re: [RESEND][PATCH v8 0/7] Preparatory changes for Proxy Execution v8

On Tue, Feb 27, 2024 at 8:43 PM 'K Prateek Nayak' via kernel-team
<[email protected]> wrote:
> Happy to report that I did not see any regressions with the series
> as expected. Full results below.
>
[snip]
> o System Details
>
> - 3rd Generation EPYC System
> - 2 x 64C/128T
> - NPS1 mode
>
> o Kernels
>
> tip: tip:sched/core at commit 8cec3dd9e593 ("sched/core:
> Simplify code by removing duplicate #ifdefs")
>
> proxy-setup: tip + this series
>

Hey! Thank you so much for taking the time to run these through the
testing! I *really* appreciate it!

Just to clarify: by "this series" did you test just the 7 preparatory
patches submitted to the list here, or did you pull the full
proxy-exec-v8-6.8-rc3 set from git?
(Either is great! I just wanted to make sure its clear which were covered)

[snip]
> Tested-by: K Prateek Nayak <[email protected]>

Thanks so much again!
-john

2024-02-28 05:13:01

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RESEND][PATCH v8 0/7] Preparatory changes for Proxy Execution v8

Hello John,

On 2/28/2024 10:21 AM, John Stultz wrote:
> On Tue, Feb 27, 2024 at 8:43 PM 'K Prateek Nayak' via kernel-team
> <[email protected]> wrote:
>> Happy to report that I did not see any regressions with the series
>> as expected. Full results below.
>>
> [snip]
>> o System Details
>>
>> - 3rd Generation EPYC System
>> - 2 x 64C/128T
>> - NPS1 mode
>>
>> o Kernels
>>
>> tip: tip:sched/core at commit 8cec3dd9e593 ("sched/core:
>> Simplify code by removing duplicate #ifdefs")
>>
>> proxy-setup: tip + this series
>>
>
> Hey! Thank you so much for taking the time to run these through the
> testing! I *really* appreciate it!
>
> Just to clarify: by "this series" did you test just the 7 preparatory
> patches submitted to the list here, or did you pull the full
> proxy-exec-v8-6.8-rc3 set from git?

Just these preparatory patches for now. On my way to queue a run for the
whole set from your tree. I'll use the "proxy-exec-v8-6.8-rc3" branch and
pick the commits past the
"[ANNOTATION] === Proxy Exec patches past this point ===" till the commit
ff90fb583a81 ("FIX: Avoid using possibly uninitialized cpu value with
activate_blocked_entities()") on top of the tip:sched/core mentioned
above since it'll allow me to reuse the baseline numbers :)

> (Either is great! I just wanted to make sure its clear which were covered)
>
> [snip]
>> Tested-by: K Prateek Nayak <[email protected]>
>
> Thanks so much again!
> -john

--
Thanks and Regards,
Prateek

2024-02-28 17:37:59

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RESEND][PATCH v8 0/7] Preparatory changes for Proxy Execution v8

Hello John,

On 2/28/2024 10:54 AM, John Stultz wrote:
> On Tue, Feb 27, 2024 at 9:12 PM K Prateek Nayak <[email protected]> wrote:
>> On 2/28/2024 10:21 AM, John Stultz wrote:
>>> Just to clarify: by "this series" did you test just the 7 preparatory
>>> patches submitted to the list here, or did you pull the full
>>> proxy-exec-v8-6.8-rc3 set from git?
>>
>> Just these preparatory patches for now. On my way to queue a run for the
>> whole set from your tree. I'll use the "proxy-exec-v8-6.8-rc3" branch and
>> pick the commits past the
>> "[ANNOTATION] === Proxy Exec patches past this point ===" till the commit
>> ff90fb583a81 ("FIX: Avoid using possibly uninitialized cpu value with
>> activate_blocked_entities()") on top of the tip:sched/core mentioned
>> above since it'll allow me to reuse the baseline numbers :)
>>
>
> Ah, thank you for the clarification!
>
> Also, I really appreciate your testing with the rest of the series as
> well. It will be good to have any potential problems identified early

I got a chance to test the whole of v8 patches on the same dual socket
3rd Generation EPYC system:

tl;dr

- There is a slight regression in hackbench but instead of the 10x
blowup seen previously, it is only around 5% with overloaded case
not regressing at all.

- A small but consistent (~2-3%) regression is seen in tbench and
netperf.

- schbench is inconclusive due to run to run variance and stream is
perf neutral with proxy execution.

I've not looked deeper into the regressions. I'll let you know if I
spot anything when digging deeper. Below are the full results:

o System Details

- 3rd Generation EPYC System
- 2 x 64C/128T
- NPS1 mode

o Kernels

tip: tip:sched/core at commit 8cec3dd9e593
("sched/core: Simplify code by removing
duplicate #ifdefs")

proxy-exec-full: tip + proxy execution commits from
"proxy-exec-v8-6.8-rc3" described previously in
this thread.

o Results

==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) proxy-exec-full[pct imp](CV)
1-groups 1.00 [ -0.00]( 2.08) 1.00 [ -0.18]( 3.90)
2-groups 1.00 [ -0.00]( 0.89) 1.04 [ -4.43]( 0.78)
4-groups 1.00 [ -0.00]( 0.81) 1.05 [ -4.82]( 1.03)
8-groups 1.00 [ -0.00]( 0.78) 1.02 [ -1.90]( 1.00)
16-groups 1.00 [ -0.00]( 1.60) 1.01 [ -0.80]( 1.18)


==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) proxy-exec-full[pct imp](CV)
1 1.00 [ 0.00]( 0.71) 0.97 [ -3.00]( 0.15)
2 1.00 [ 0.00]( 0.25) 0.97 [ -3.35]( 0.98)
4 1.00 [ 0.00]( 0.85) 0.97 [ -3.26]( 1.40)
8 1.00 [ 0.00]( 1.00) 0.97 [ -2.75]( 0.46)
16 1.00 [ 0.00]( 1.25) 0.99 [ -1.27]( 0.11)
32 1.00 [ 0.00]( 0.35) 0.98 [ -2.42]( 0.06)
64 1.00 [ 0.00]( 0.71) 0.97 [ -2.76]( 1.81)
128 1.00 [ 0.00]( 0.46) 0.97 [ -2.67]( 0.88)
256 1.00 [ 0.00]( 0.24) 0.98 [ -1.97]( 0.98)
512 1.00 [ 0.00]( 0.30) 0.98 [ -2.41]( 0.38)
1024 1.00 [ 0.00]( 0.40) 0.98 [ -2.21]( 0.11)


==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) proxy-exec-full[pct imp](CV)
Copy 1.00 [ 0.00]( 9.73) 1.00 [ 0.26]( 6.36)
Scale 1.00 [ 0.00]( 5.57) 1.02 [ 1.59]( 2.98)
Add 1.00 [ 0.00]( 5.43) 1.00 [ 0.48]( 2.77)
Triad 1.00 [ 0.00]( 5.50) 0.98 [ -2.18]( 6.06)


==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) proxy-exec-full[pct imp](CV)
Copy 1.00 [ 0.00]( 3.26) 0.98 [ -1.96]( 3.24)
Scale 1.00 [ 0.00]( 1.26) 0.96 [ -3.61]( 6.41)
Add 1.00 [ 0.00]( 1.47) 0.98 [ -1.84]( 4.14)
Triad 1.00 [ 0.00]( 1.77) 1.00 [ 0.27]( 2.60)


==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) proxy-exec-full[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.22) 0.97 [ -3.01]( 0.40)
2-clients 1.00 [ 0.00]( 0.57) 0.97 [ -3.25]( 0.45)
4-clients 1.00 [ 0.00]( 0.43) 0.97 [ -3.26]( 0.59)
8-clients 1.00 [ 0.00]( 0.27) 0.97 [ -2.83]( 0.55)
16-clients 1.00 [ 0.00]( 0.46) 0.97 [ -2.99]( 0.65)
32-clients 1.00 [ 0.00]( 0.95) 0.97 [ -2.98]( 0.71)
64-clients 1.00 [ 0.00]( 1.79) 0.97 [ -2.61]( 1.38)
128-clients 1.00 [ 0.00]( 0.89) 0.97 [ -2.72]( 0.94)
256-clients 1.00 [ 0.00]( 3.88) 0.98 [ -1.89]( 2.92)
512-clients 1.00 [ 0.00](35.06) 0.99 [ -0.78](47.83)


==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) proxy-exec-full[pct imp](CV)
1 1.00 [ -0.00](27.28) 1.31 [-31.25]( 6.45)
2 1.00 [ -0.00]( 3.85) 0.95 [ 5.00](10.02)
4 1.00 [ -0.00](14.00) 1.11 [-10.53]( 1.36)
8 1.00 [ -0.00]( 4.68) 1.15 [-14.58](14.55)
16 1.00 [ -0.00]( 4.08) 0.98 [ 1.61]( 3.28)
32 1.00 [ -0.00]( 6.68) 1.02 [ -2.04]( 1.71)
64 1.00 [ -0.00]( 1.79) 1.12 [-11.73]( 7.08)
128 1.00 [ -0.00]( 6.30) 1.11 [-10.84]( 5.52)
256 1.00 [ -0.00](43.39) 1.37 [-37.14](20.11)
512 1.00 [ -0.00]( 2.26) 0.99 [ 1.17]( 1.43)


==================================================================
Test : Unixbench
Units : Normalized scores
Interpretation: Lower is better
Statistic : Various (Mentioned)
==================================================================
Metric Variant tip proxy-exec-full
Hmean unixbench-dhry2reg-1 0.00% -0.67%
Hmean unixbench-dhry2reg-512 0.00% 0.14%
Amean unixbench-syscall-1 0.00% -0.86%
Amean unixbench-syscall-512 0.00% -6.42%
Hmean unixbench-pipe-1 0.00% 0.79%
Hmean unixbench-pipe-512 0.00% 0.57%
Hmean unixbench-spawn-1 0.00% -3.91%
Hmean unixbench-spawn-512 0.00% 3.17%
Hmean unixbench-execl-1 0.00% -1.18%
Hmean unixbench-execl-512 0.00% 1.26%
--

> (I'm trying to get v9 ready as soon as I can here, as its fixed a
> number of smaller issues - However, I've also managed to uncover some
> new problems in stress testing, so we'll see how quickly I can chase
> those down).

I haven't seen any splats when running the above tests. I'll test some
larger workloads next. Please let me know if you would like me to test
any specific workload or need additional data from these tests :)

>
> thanks
> -john

--
Thanks and Regards,
Prateek

2024-02-29 06:19:33

by John Stultz

[permalink] [raw]
Subject: Re: [RESEND][PATCH v8 0/7] Preparatory changes for Proxy Execution v8

On Wed, Feb 28, 2024 at 9:37 AM 'K Prateek Nayak' via kernel-team
<[email protected]> wrote:
> I got a chance to test the whole of v8 patches on the same dual socket
> 3rd Generation EPYC system:
>
> tl;dr
>
> - There is a slight regression in hackbench but instead of the 10x
> blowup seen previously, it is only around 5% with overloaded case
> not regressing at all.
>
> - A small but consistent (~2-3%) regression is seen in tbench and
> netperf.

Once again, thank you so much for your testing and reporting of the
data! I really appreciate it!

Do you mind sharing exactly how you're running the benchmarks? (I'd
like to try to reproduce these locally (though my machine is much
smaller).

I'm guessing the hackbench one is the same command you shared earlier with v6?

thanks
-john

2024-02-29 06:46:16

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RESEND][PATCH v8 0/7] Preparatory changes for Proxy Execution v8

Hello John,

On 2/29/2024 11:49 AM, John Stultz wrote:
> On Wed, Feb 28, 2024 at 9:37 AM 'K Prateek Nayak' via kernel-team
> <[email protected]> wrote:
>> I got a chance to test the whole of v8 patches on the same dual socket
>> 3rd Generation EPYC system:
>>
>> tl;dr
>>
>> - There is a slight regression in hackbench but instead of the 10x
>> blowup seen previously, it is only around 5% with overloaded case
>> not regressing at all.
>>
>> - A small but consistent (~2-3%) regression is seen in tbench and
>> netperf.
>
> Once again, thank you so much for your testing and reporting of the
> data! I really appreciate it!
>
> Do you mind sharing exactly how you're running the benchmarks? (I'd
> like to try to reproduce these locally (though my machine is much
> smaller).
>
> I'm guessing the hackbench one is the same command you shared earlier with v6?

Yup it is same as earlier. I'll list all the commands down below:

o Hackbench

perf bench sched messaging -p -t -l 100000 -g <# of groups>

o Old schbench
git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git
at commit e4aa540 ("Make sure rps isn't zero in auto_rps mode.")

schbench -m 2 -t <# workers> -r 30

(I should probably upgrade this to the latest! Let me get on it)

o tbench (https://www.samba.org/ftp/tridge/dbench/dbench-4.0.tar.gz)

nohup tbench_srv 0 &
tbench -c client.txt -t 60 <# clients> 127.0.0.1

o Stream (https://www.cs.virginia.edu/stream/FTP/Code/)

export ARRAY_SIZE=128000000; # 4 * Local L3 size
gcc -DSTREAM_ARRAY_SIZE=$ARRAY_SIZE -DNTIMES=<Loops internally> -fopenmp -O2 stream.c -o stream
export OMP_NUM_THREADS=16; # Number of CCX on my machine
./stream;

o netperf

netserver -L 127.0.0.1
for i in `seq 0 1 <num clients>`;
do
netperf -H 127.0.0.1 -t TCP_RR -l 100 -- -r 100 -k REQUEST_SIZE,RESPONSE_SIZE,ELAPSED_TIME,THROUGHPUT,THROUGHPUT_UNITS,MIN_LATENCY,MEAN_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MAX_LATENCY,STDDEV_LATENCY&
done
wait;

o Unixbench (from mmtest)

./run-mmtests.sh --no-monitor --config configs/config-workload-unixbench
--

If you have any other question, please do let me know :)

>
> thanks
> -john

--
Thanks and Regards,
Prateek

2024-02-29 07:05:15

by John Stultz

[permalink] [raw]
Subject: Re: [RESEND][PATCH v8 0/7] Preparatory changes for Proxy Execution v8

On Wed, Feb 28, 2024 at 10:44 PM 'K Prateek Nayak' via kernel-team
<[email protected]> wrote:
> On 2/29/2024 11:49 AM, John Stultz wrote:
> > Do you mind sharing exactly how you're running the benchmarks? (I'd
> > like to try to reproduce these locally (though my machine is much
> > smaller).
> >
> > I'm guessing the hackbench one is the same command you shared earlier with v6?
>
> Yup it is same as earlier. I'll list all the commands down below:

Great! I'll try to take a swing at reproducing these locally before I
send out v9.

[snip]
> If you have any other question, please do let me know :)

Thank you so much for the details! Your efforts here are very appreciated!
-john

2024-02-28 05:28:34

by John Stultz

[permalink] [raw]
Subject: Re: [RESEND][PATCH v8 0/7] Preparatory changes for Proxy Execution v8

On Tue, Feb 27, 2024 at 9:12 PM K Prateek Nayak <kprateek.nayak@amdcom> wrote:
> On 2/28/2024 10:21 AM, John Stultz wrote:
> > Just to clarify: by "this series" did you test just the 7 preparatory
> > patches submitted to the list here, or did you pull the full
> > proxy-exec-v8-6.8-rc3 set from git?
>
> Just these preparatory patches for now. On my way to queue a run for the
> whole set from your tree. I'll use the "proxy-exec-v8-6.8-rc3" branch and
> pick the commits past the
> "[ANNOTATION] === Proxy Exec patches past this point ===" till the commit
> ff90fb583a81 ("FIX: Avoid using possibly uninitialized cpu value with
> activate_blocked_entities()") on top of the tip:sched/core mentioned
> above since it'll allow me to reuse the baseline numbers :)
>

Ah, thank you for the clarification!

Also, I really appreciate your testing with the rest of the series as
well. It will be good to have any potential problems identified early
(I'm trying to get v9 ready as soon as I can here, as its fixed a
number of smaller issues - However, I've also managed to uncover some
new problems in stress testing, so we'll see how quickly I can chase
those down).

thanks
-john