As mentioned a few times previously[1], after earlier
submissions of the Proxy Execution series didn’t get much in the
way of feedback, it was noted that the patch series was getting
a bit unwieldy to review. Qais suggested I break out just the
cleanups/preparatory components of the patch series and submit
them on their own in the hope we can start to merge the less
complex bits and discussion can focus on the more complicated
portions afterwards. This so far has not been very successful,
with the submission & RESEND of the v8 & v9 preparatory changes
not getting all that much in the way of review or feedback.
For v10 of this series, I’m again only submitting those early
cleanup/preparatory changes here. However, please let me know if
there is any way to make reviewing the series easier to move
this forward.
In the meantime, I’ve continued to put effort into the full
series, mostly focused on polishing the series for correctness.
Unfortunately one issue I found ended up taking awhile to
determine it was actually a problem in mainline (the RT_PUSH_IPI
feature broke the RT scheduling invariant - after disabling it
I don’t see problems with mainline or with proxy-exec). But going
through the analysis process was helpful, and I’ve made some
tweaks to Metin’s patch for trace events to make it easier to
follow along the proxy behavior using ftrace & perfetto. Doing
this also helped find a case where when we were proxy-migrating
current, we first schedule idle, but didn’t preserve the
needs_resched flag, needlessly delaying things.
If you are interested, the full v10 series, it can be found here:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v10-6.9-rc7
https://github.com/johnstultz-work/linux-dev.git proxy-exec-v10-6.9-rc7
New in v10 (in the preparatory patches submitted here)
---------
* Switched preempt_enable to be lower close to the unlock as
suggested by Valentin
* Added additional preempt_disable coverage around the wake_q
calls as again noted by Valentin
* Handle null lock ptr in __mutex_owner, to simplify later code,
as suggested by Metin Kaya
* Changed do_push_task to move_queued_task_locked as suggested
by Valentin
* Use rq_selected in push_rt_task & get_push_task
* Added Reviewed by tags
New in v10 (in the rest of the series)
---------
* Tweak so that if find_proxy_task returns idle, we should
always preserve needs_resched
* Drop WARN_ON(task_is_blocked(p)) in ttwu current case
* Add more details to the traceevents (owner task for proxy
migrations, and prev, selected and next for task selection)
so its easier to understand the proxy behavior.
* Simplify logic to task_queued_on_rq suggested by Metin
* Rework from do_push_task usage to move_queued_task_locked
* Further Cleanups suggested by Metin
Performance:
---------
K Prateek Nayak provided some feedback on the full v8 series
here[2]. Given the potential extra overhead of doing rq
migrations/return migrations/etc for the proxy case, it’s not
completely surprising a few of K Prateek’s test cases saw ~3-5%
regressions, but I’m hoping to look into this soon to see if we
can reduce those further.
Issues still to address:
---------
* The chain migration functionality needs further iterations and
better validation to ensure it truly maintains the RT/DL load
balancing invariants.
* CFS load balancing. There was concern that blocked tasks may
carry forward load (PELT) to the lock owner's CPU, so the CPU
may look like it is overloaded. Needs investigation.
* The sleeping owner handling (where we deactivate waiting tasks
and enqueue them onto a list, then reactivate them when the
owner wakes up) doesn’t feel great. This is in part because
when we want to activate tasks, we’re already holding a
task.pi_lock and a rq_lock, just not the locks for the task
we’re activating, nor the rq we’re enqueuing it onto. So there
has to be a bit of lock juggling to drop and acquire the right
locks (in the right order). It feels like there’s got to be a
better way. Also needs some rework to get rid of the
recursion.
Credit/Disclaimer:
—--------------------
As mentioned previously, this Proxy Execution series has a long
history:
First described in a paper[3] by Watkins, Straub, Niehaus, then
from patches from Peter Zijlstra, extended with lots of work by
Juri Lelli, Valentin Schneider, and Connor O'Brien. (and thank
you to Steven Rostedt for providing additional details here!)
So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely
mine.
Thanks so much!
-john
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Connor O'Brien (2):
sched: Add move_queued_task_locked helper
sched: Consolidate pick_*_task to task_is_pushable helper
John Stultz (1):
sched: Split out __schedule() deactivate task logic into a helper
Juri Lelli (2):
locking/mutex: Make mutex::wait_lock irq safe
locking/mutex: Expose __mutex_owner()
Peter Zijlstra (2):
locking/mutex: Remove wakeups from under mutex::wait_lock
sched: Split scheduler and execution contexts
kernel/locking/mutex.c | 60 +++++++----------
kernel/locking/mutex.h | 27 ++++++++
kernel/locking/rtmutex.c | 30 ++++++---
kernel/locking/rwbase_rt.c | 8 ++-
kernel/locking/rwsem.c | 4 +-
kernel/locking/spinlock_rt.c | 3 +-
kernel/locking/ww_mutex.h | 49 ++++++++------
kernel/sched/core.c | 122 +++++++++++++++++++++--------------
kernel/sched/deadline.c | 53 ++++++---------
kernel/sched/fair.c | 18 +++---
kernel/sched/rt.c | 61 +++++++-----------
kernel/sched/sched.h | 48 +++++++++++++-
12 files changed, 282 insertions(+), 201 deletions(-)
--
2.45.0.rc1.225.g2a3ae87e7f-goog
From: Juri Lelli <[email protected]>
Implementing proxy execution requires that scheduler code be able to
identify the current owner of a mutex. Expose __mutex_owner() for
this purpose (alone!).
Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Tested-by: K Prateek Nayak <[email protected]>
Tested-by: Metin Kaya <[email protected]>
Reviewed-by: Metin Kaya <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
[Removed the EXPORT_SYMBOL]
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: Reworked per Peter's suggestions]
Signed-off-by: John Stultz <[email protected]>
---
v4:
* Move __mutex_owner() to kernel/locking/mutex.h instead of
adding a new globally available accessor function to keep
the exposure of this low, along with keeping it an inline
function, as suggested by PeterZ
v10:
* Handle null lock ptr, to simplify later code, as suggested
by Metin Kaya
---
kernel/locking/mutex.c | 25 -------------------------
kernel/locking/mutex.h | 27 +++++++++++++++++++++++++++
2 files changed, 27 insertions(+), 25 deletions(-)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 6d843a0978a5..4b7193fd3be9 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -56,31 +56,6 @@ __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
}
EXPORT_SYMBOL(__mutex_init);
-/*
- * @owner: contains: 'struct task_struct *' to the current lock owner,
- * NULL means not owned. Since task_struct pointers are aligned at
- * at least L1_CACHE_BYTES, we have low bits to store extra state.
- *
- * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
- * Bit1 indicates unlock needs to hand the lock to the top-waiter
- * Bit2 indicates handoff has been done and we're waiting for pickup.
- */
-#define MUTEX_FLAG_WAITERS 0x01
-#define MUTEX_FLAG_HANDOFF 0x02
-#define MUTEX_FLAG_PICKUP 0x04
-
-#define MUTEX_FLAGS 0x07
-
-/*
- * Internal helper function; C doesn't allow us to hide it :/
- *
- * DO NOT USE (outside of mutex code).
- */
-static inline struct task_struct *__mutex_owner(struct mutex *lock)
-{
- return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
-}
-
static inline struct task_struct *__owner_task(unsigned long owner)
{
return (struct task_struct *)(owner & ~MUTEX_FLAGS);
diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index 0b2a79c4013b..cbff35b9b7ae 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -20,6 +20,33 @@ struct mutex_waiter {
#endif
};
+/*
+ * @owner: contains: 'struct task_struct *' to the current lock owner,
+ * NULL means not owned. Since task_struct pointers are aligned at
+ * at least L1_CACHE_BYTES, we have low bits to store extra state.
+ *
+ * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
+ * Bit1 indicates unlock needs to hand the lock to the top-waiter
+ * Bit2 indicates handoff has been done and we're waiting for pickup.
+ */
+#define MUTEX_FLAG_WAITERS 0x01
+#define MUTEX_FLAG_HANDOFF 0x02
+#define MUTEX_FLAG_PICKUP 0x04
+
+#define MUTEX_FLAGS 0x07
+
+/*
+ * Internal helper function; C doesn't allow us to hide it :/
+ *
+ * DO NOT USE (outside of mutex & scheduler code).
+ */
+static inline struct task_struct *__mutex_owner(struct mutex *lock)
+{
+ if (!lock)
+ return NULL;
+ return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
+}
+
#ifdef CONFIG_DEBUG_MUTEXES
extern void debug_mutex_lock_common(struct mutex *lock,
struct mutex_waiter *waiter);
--
2.45.0.rc1.225.g2a3ae87e7f-goog
From: Peter Zijlstra <[email protected]>
Let's define the scheduling context as all the scheduler state
in task_struct for the task selected to run, and the execution
context as all state required to actually run the task.
Currently both are intertwined in task_struct. We want to
logically split these such that we can use the scheduling
context of the task selected to be scheduled, but use the
execution context of a different task to actually be run.
To this purpose, introduce rq_selected() macro to point to the
task_struct selected from the runqueue by the scheduler, and
will be used for scheduler state, and preserve rq->curr to
indicate the execution context of the task that will actually be
run.
Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Tested-by: K Prateek Nayak <[email protected]>
Tested-by: Metin Kaya <[email protected]>
Reviewed-by: Metin Kaya <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
[add additional comments and update more sched_class code to use
rq::proxy]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: Rebased and resolved minor collisions, reworked to use
accessors, tweaked update_curr_common to use rq_proxy fixing rt
scheduling issues]
Signed-off-by: John Stultz <[email protected]>
---
v2:
* Reworked to use accessors
* Fixed update_curr_common to use proxy instead of curr
v3:
* Tweaked wrapper names
* Swapped proxy for selected for clarity
v4:
* Minor variable name tweaks for readability
* Use a macro instead of a inline function and drop
other helper functions as suggested by Peter.
* Remove verbose comments/questions to avoid review
distractions, as suggested by Dietmar
v5:
* Add CONFIG_PROXY_EXEC option to this patch so the
new logic can be tested with this change
* Minor fix to grab rq_selected when holding the rq lock
v7:
* Minor spelling fix and unused argument fixes suggested by
Metin Kaya
* Switch to curr_selected for consistency, and minor rewording
of commit message for clarity
* Rename variables selected instead of curr when we're using
rq_selected()
* Reduce macros in CONFIG_SCHED_PROXY_EXEC ifdef sections,
as suggested by Metin Kaya
v8:
* Use rq->curr, not rq_selected with task_tick, as suggested by
Valentin
* Minor rework to reorder this with CONFIG_SCHED_PROXY_EXEC patch
v10:
* Use rq_selected in push_rt_task & get_push_task
---
kernel/sched/core.c | 46 ++++++++++++++++++++++++++---------------
kernel/sched/deadline.c | 35 ++++++++++++++++---------------
kernel/sched/fair.c | 18 ++++++++--------
kernel/sched/rt.c | 42 ++++++++++++++++++-------------------
kernel/sched/sched.h | 27 +++++++++++++++++++++---
5 files changed, 101 insertions(+), 67 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8bc5844ebab9..30af17648f8c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -794,7 +794,7 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
rq_lock(rq, &rf);
update_rq_clock(rq);
- rq->curr->sched_class->task_tick(rq, rq->curr, 1);
+ rq_selected(rq)->sched_class->task_tick(rq, rq->curr, 1);
rq_unlock(rq, &rf);
return HRTIMER_NORESTART;
@@ -2236,16 +2236,18 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,
void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
{
- if (p->sched_class == rq->curr->sched_class)
- rq->curr->sched_class->wakeup_preempt(rq, p, flags);
- else if (sched_class_above(p->sched_class, rq->curr->sched_class))
+ struct task_struct *selected = rq_selected(rq);
+
+ if (p->sched_class == selected->sched_class)
+ selected->sched_class->wakeup_preempt(rq, p, flags);
+ else if (sched_class_above(p->sched_class, selected->sched_class))
resched_curr(rq);
/*
* A queue event has occurred, and we're going to schedule. In
* this case, we can save a useless back to back clock update.
*/
- if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
+ if (task_on_rq_queued(selected) && test_tsk_need_resched(rq->curr))
rq_clock_skip_update(rq);
}
@@ -2772,7 +2774,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
lockdep_assert_held(&p->pi_lock);
queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);
if (queued) {
/*
@@ -5596,7 +5598,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
* project cycles that may never be accounted to this
* thread, breaking clock_gettime().
*/
- if (task_current(rq, p) && task_on_rq_queued(p)) {
+ if (task_current_selected(rq, p) && task_on_rq_queued(p)) {
prefetch_curr_exec_start(p);
update_rq_clock(rq);
p->sched_class->update_curr(rq);
@@ -5664,7 +5666,8 @@ void scheduler_tick(void)
{
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
- struct task_struct *curr = rq->curr;
+ /* accounting goes to the selected task */
+ struct task_struct *selected;
struct rq_flags rf;
unsigned long thermal_pressure;
u64 resched_latency;
@@ -5675,16 +5678,17 @@ void scheduler_tick(void)
sched_clock_tick();
rq_lock(rq, &rf);
+ selected = rq_selected(rq);
update_rq_clock(rq);
thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
- curr->sched_class->task_tick(rq, curr, 0);
+ selected->sched_class->task_tick(rq, selected, 0);
if (sched_feat(LATENCY_WARN))
resched_latency = cpu_resched_latency(rq);
calc_global_load_tick(rq);
sched_core_tick(rq);
- task_tick_mm_cid(rq, curr);
+ task_tick_mm_cid(rq, selected);
rq_unlock(rq, &rf);
@@ -5693,8 +5697,8 @@ void scheduler_tick(void)
perf_event_task_tick();
- if (curr->flags & PF_WQ_WORKER)
- wq_worker_tick(curr);
+ if (selected->flags & PF_WQ_WORKER)
+ wq_worker_tick(selected);
#ifdef CONFIG_SMP
rq->idle_balance = idle_cpu(cpu);
@@ -5759,6 +5763,12 @@ static void sched_tick_remote(struct work_struct *work)
struct task_struct *curr = rq->curr;
if (cpu_online(cpu)) {
+ /*
+ * Since this is a remote tick for full dynticks mode,
+ * we are always sure that there is no proxy (only a
+ * single task is running).
+ */
+ SCHED_WARN_ON(rq->curr != rq_selected(rq));
update_rq_clock(rq);
if (!is_idle_task(curr)) {
@@ -6712,6 +6722,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
}
next = pick_next_task(rq, prev, &rf);
+ rq_set_selected(rq, next);
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
#ifdef CONFIG_SCHED_DEBUG
@@ -7222,7 +7233,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
prev_class = p->sched_class;
queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);
if (queued)
dequeue_task(rq, p, queue_flag);
if (running)
@@ -7312,7 +7323,7 @@ void set_user_nice(struct task_struct *p, long nice)
}
queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);
if (queued)
dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
if (running)
@@ -7891,7 +7902,7 @@ static int __sched_setscheduler(struct task_struct *p,
}
queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);
if (queued)
dequeue_task(rq, p, queue_flags);
if (running)
@@ -9318,6 +9329,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
rcu_read_unlock();
rq->idle = idle;
+ rq_set_selected(rq, idle);
rcu_assign_pointer(rq->curr, idle);
idle->on_rq = TASK_ON_RQ_QUEUED;
#ifdef CONFIG_SMP
@@ -9407,7 +9419,7 @@ void sched_setnuma(struct task_struct *p, int nid)
rq = task_rq_lock(p, &rf);
queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_selected(rq, p);
if (queued)
dequeue_task(rq, p, DEQUEUE_SAVE);
@@ -10512,7 +10524,7 @@ void sched_move_task(struct task_struct *tsk)
update_rq_clock(rq);
- running = task_current(rq, tsk);
+ running = task_current_selected(rq, tsk);
queued = task_on_rq_queued(tsk);
if (queued)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index ae583a427539..6b49f9229414 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1218,7 +1218,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
#endif
enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
- if (dl_task(rq->curr))
+ if (dl_task(rq_selected(rq)))
wakeup_preempt_dl(rq, p, 0);
else
resched_curr(rq);
@@ -1442,7 +1442,7 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
*/
static void update_curr_dl(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
struct sched_dl_entity *dl_se = &curr->dl;
s64 delta_exec;
@@ -1899,7 +1899,7 @@ static int find_later_rq(struct task_struct *task);
static int
select_task_rq_dl(struct task_struct *p, int cpu, int flags)
{
- struct task_struct *curr;
+ struct task_struct *curr, *selected;
bool select_rq;
struct rq *rq;
@@ -1910,6 +1910,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
rcu_read_lock();
curr = READ_ONCE(rq->curr); /* unlocked access */
+ selected = READ_ONCE(rq_selected(rq));
/*
* If we are dealing with a -deadline task, we must
@@ -1920,9 +1921,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
* other hand, if it has a shorter deadline, we
* try to make it stay here, it might be important.
*/
- select_rq = unlikely(dl_task(curr)) &&
+ select_rq = unlikely(dl_task(selected)) &&
(curr->nr_cpus_allowed < 2 ||
- !dl_entity_preempt(&p->dl, &curr->dl)) &&
+ !dl_entity_preempt(&p->dl, &selected->dl)) &&
p->nr_cpus_allowed > 1;
/*
@@ -1985,7 +1986,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
* let's hope p can move out.
*/
if (rq->curr->nr_cpus_allowed == 1 ||
- !cpudl_find(&rq->rd->cpudl, rq->curr, NULL))
+ !cpudl_find(&rq->rd->cpudl, rq_selected(rq), NULL))
return;
/*
@@ -2024,7 +2025,7 @@ static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
int flags)
{
- if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
+ if (dl_entity_preempt(&p->dl, &rq_selected(rq)->dl)) {
resched_curr(rq);
return;
}
@@ -2034,7 +2035,7 @@ static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
* In the unlikely case current and p have the same deadline
* let us try to decide what's the best thing to do...
*/
- if ((p->dl.deadline == rq->curr->dl.deadline) &&
+ if ((p->dl.deadline == rq_selected(rq)->dl.deadline) &&
!test_tsk_need_resched(rq->curr))
check_preempt_equal_dl(rq, p);
#endif /* CONFIG_SMP */
@@ -2066,7 +2067,7 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
if (!first)
return;
- if (rq->curr->sched_class != &dl_sched_class)
+ if (rq_selected(rq)->sched_class != &dl_sched_class)
update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
deadline_queue_push_tasks(rq);
@@ -2391,8 +2392,8 @@ static int push_dl_task(struct rq *rq)
* can move away, it makes sense to just reschedule
* without going further in pushing next_task.
*/
- if (dl_task(rq->curr) &&
- dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
+ if (dl_task(rq_selected(rq)) &&
+ dl_time_before(next_task->dl.deadline, rq_selected(rq)->dl.deadline) &&
rq->curr->nr_cpus_allowed > 1) {
resched_curr(rq);
return 0;
@@ -2515,7 +2516,7 @@ static void pull_dl_task(struct rq *this_rq)
* deadline than the current task of its runqueue.
*/
if (dl_time_before(p->dl.deadline,
- src_rq->curr->dl.deadline))
+ rq_selected(src_rq)->dl.deadline))
goto skip;
if (is_migration_disabled(p)) {
@@ -2554,9 +2555,9 @@ static void task_woken_dl(struct rq *rq, struct task_struct *p)
if (!task_on_cpu(rq, p) &&
!test_tsk_need_resched(rq->curr) &&
p->nr_cpus_allowed > 1 &&
- dl_task(rq->curr) &&
+ dl_task(rq_selected(rq)) &&
(rq->curr->nr_cpus_allowed < 2 ||
- !dl_entity_preempt(&p->dl, &rq->curr->dl))) {
+ !dl_entity_preempt(&p->dl, &rq_selected(rq)->dl))) {
push_dl_tasks(rq);
}
}
@@ -2731,12 +2732,12 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
return;
}
- if (rq->curr != p) {
+ if (rq_selected(rq) != p) {
#ifdef CONFIG_SMP
if (p->nr_cpus_allowed > 1 && rq->dl.overloaded)
deadline_queue_push_tasks(rq);
#endif
- if (dl_task(rq->curr))
+ if (dl_task(rq_selected(rq)))
wakeup_preempt_dl(rq, p, 0);
else
resched_curr(rq);
@@ -2765,7 +2766,7 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p,
if (!rq->dl.overloaded)
deadline_queue_pull_task(rq);
- if (task_current(rq, p)) {
+ if (task_current_selected(rq, p)) {
/*
* If we now have a earlier deadline task than p,
* then reschedule, provided p is still on this
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c62805dbd608..4c0018ba7ea3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1146,7 +1146,7 @@ static inline void update_curr_task(struct task_struct *p, s64 delta_exec)
*/
s64 update_curr_common(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
s64 delta_exec;
delta_exec = update_curr_se(rq, &curr->se);
@@ -1183,7 +1183,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
static void update_curr_fair(struct rq *rq)
{
- update_curr(cfs_rq_of(&rq->curr->se));
+ update_curr(cfs_rq_of(&rq_selected(rq)->se));
}
static inline void
@@ -6639,7 +6639,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
s64 delta = slice - ran;
if (delta < 0) {
- if (task_current(rq, p))
+ if (task_current_selected(rq, p))
resched_curr(rq);
return;
}
@@ -6654,7 +6654,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
*/
static void hrtick_update(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
if (!hrtick_enabled_fair(rq) || curr->sched_class != &fair_sched_class)
return;
@@ -8285,7 +8285,7 @@ static void set_next_buddy(struct sched_entity *se)
*/
static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
struct sched_entity *se = &curr->se, *pse = &p->se;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
int cse_is_idle, pse_is_idle;
@@ -8316,7 +8316,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
* prevents us from potentially nominating it as a false LAST_BUDDY
* below.
*/
- if (test_tsk_need_resched(curr))
+ if (test_tsk_need_resched(rq->curr))
return;
/* Idle tasks are by definition preempted by non-idle tasks. */
@@ -9298,7 +9298,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
* update_load_avg() can call cpufreq_update_util(). Make sure that RT,
* DL and IRQ signals have been updated before updating CFS.
*/
- curr_class = rq->curr->sched_class;
+ curr_class = rq_selected(rq)->sched_class;
thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
@@ -12667,7 +12667,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
* our priority decreased, or if we are not currently running on
* this runqueue and our priority is higher than the current's
*/
- if (task_current(rq, p)) {
+ if (task_current_selected(rq, p)) {
if (p->prio > oldprio)
resched_curr(rq);
} else
@@ -12770,7 +12770,7 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
* kick off the schedule if running, otherwise just see
* if we can still preempt the current task.
*/
- if (task_current(rq, p))
+ if (task_current_selected(rq, p))
resched_curr(rq);
else
wakeup_preempt(rq, p, 0);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 8b6fb77e095b..b02e8aad9b86 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -530,7 +530,7 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
{
- struct task_struct *curr = rq_of_rt_rq(rt_rq)->curr;
+ struct task_struct *curr = rq_selected(rq_of_rt_rq(rt_rq));
struct rq *rq = rq_of_rt_rq(rt_rq);
struct sched_rt_entity *rt_se;
@@ -1000,7 +1000,7 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
*/
static void update_curr_rt(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *curr = rq_selected(rq);
struct sched_rt_entity *rt_se = &curr->rt;
s64 delta_exec;
@@ -1543,7 +1543,7 @@ static int find_lowest_rq(struct task_struct *task);
static int
select_task_rq_rt(struct task_struct *p, int cpu, int flags)
{
- struct task_struct *curr;
+ struct task_struct *curr, *selected;
struct rq *rq;
bool test;
@@ -1555,6 +1555,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
rcu_read_lock();
curr = READ_ONCE(rq->curr); /* unlocked access */
+ selected = READ_ONCE(rq_selected(rq));
/*
* If the current task on @p's runqueue is an RT task, then
@@ -1583,8 +1584,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
* systems like big.LITTLE.
*/
test = curr &&
- unlikely(rt_task(curr)) &&
- (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
+ unlikely(rt_task(selected)) &&
+ (curr->nr_cpus_allowed < 2 || selected->prio <= p->prio);
if (test || !rt_task_fits_capacity(p, cpu)) {
int target = find_lowest_rq(p);
@@ -1614,12 +1615,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
{
- /*
- * Current can't be migrated, useless to reschedule,
- * let's hope p can move out.
- */
if (rq->curr->nr_cpus_allowed == 1 ||
- !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
+ !cpupri_find(&rq->rd->cpupri, rq_selected(rq), NULL))
return;
/*
@@ -1662,7 +1659,9 @@ static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
*/
static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
{
- if (p->prio < rq->curr->prio) {
+ struct task_struct *curr = rq_selected(rq);
+
+ if (p->prio < curr->prio) {
resched_curr(rq);
return;
}
@@ -1680,7 +1679,7 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
* to move current somewhere else, making room for our non-migratable
* task.
*/
- if (p->prio == rq->curr->prio && !test_tsk_need_resched(rq->curr))
+ if (p->prio == curr->prio && !test_tsk_need_resched(rq->curr))
check_preempt_equal_prio(rq, p);
#endif
}
@@ -1705,7 +1704,7 @@ static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool f
* utilization. We only care of the case where we start to schedule a
* rt task
*/
- if (rq->curr->sched_class != &rt_sched_class)
+ if (rq_selected(rq)->sched_class != &rt_sched_class)
update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
rt_queue_push_tasks(rq);
@@ -1977,6 +1976,7 @@ static struct task_struct *pick_next_pushable_task(struct rq *rq)
BUG_ON(rq->cpu != task_cpu(p));
BUG_ON(task_current(rq, p));
+ BUG_ON(task_current_selected(rq, p));
BUG_ON(p->nr_cpus_allowed <= 1);
BUG_ON(!task_on_rq_queued(p));
@@ -2009,7 +2009,7 @@ static int push_rt_task(struct rq *rq, bool pull)
* higher priority than current. If that's the case
* just reschedule current.
*/
- if (unlikely(next_task->prio < rq->curr->prio)) {
+ if (unlikely(next_task->prio < rq_selected(rq)->prio)) {
resched_curr(rq);
return 0;
}
@@ -2030,7 +2030,7 @@ static int push_rt_task(struct rq *rq, bool pull)
* Note that the stoppers are masqueraded as SCHED_FIFO
* (cf. sched_set_stop_task()), so we can't rely on rt_task().
*/
- if (rq->curr->sched_class != &rt_sched_class)
+ if (rq_selected(rq)->sched_class != &rt_sched_class)
return 0;
cpu = find_lowest_rq(rq->curr);
@@ -2362,7 +2362,7 @@ static void pull_rt_task(struct rq *this_rq)
* p if it is lower in priority than the
* current task on the run queue
*/
- if (p->prio < src_rq->curr->prio)
+ if (p->prio < rq_selected(src_rq)->prio)
goto skip;
if (is_migration_disabled(p)) {
@@ -2404,9 +2404,9 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
bool need_to_push = !task_on_cpu(rq, p) &&
!test_tsk_need_resched(rq->curr) &&
p->nr_cpus_allowed > 1 &&
- (dl_task(rq->curr) || rt_task(rq->curr)) &&
+ (dl_task(rq_selected(rq)) || rt_task(rq_selected(rq))) &&
(rq->curr->nr_cpus_allowed < 2 ||
- rq->curr->prio <= p->prio);
+ rq_selected(rq)->prio <= p->prio);
if (need_to_push)
push_rt_tasks(rq);
@@ -2490,7 +2490,7 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
rt_queue_push_tasks(rq);
#endif /* CONFIG_SMP */
- if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
+ if (p->prio < rq_selected(rq)->prio && cpu_online(cpu_of(rq)))
resched_curr(rq);
}
}
@@ -2505,7 +2505,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
if (!task_on_rq_queued(p))
return;
- if (task_current(rq, p)) {
+ if (task_current_selected(rq, p)) {
#ifdef CONFIG_SMP
/*
* If our priority decreases while running, we
@@ -2531,7 +2531,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
* greater than the current running task
* then reschedule.
*/
- if (p->prio < rq->curr->prio)
+ if (p->prio < rq_selected(rq)->prio)
resched_curr(rq);
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e46f69ba9ba2..747233cf1116 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1032,7 +1032,7 @@ struct rq {
*/
unsigned int nr_uninterruptible;
- struct task_struct __rcu *curr;
+ struct task_struct __rcu *curr; /* Execution context */
struct task_struct *idle;
struct task_struct *stop;
unsigned long next_balance;
@@ -1227,6 +1227,13 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define raw_rq() raw_cpu_ptr(&runqueues)
+/* For now, rq_selected == rq->curr */
+#define rq_selected(rq) ((rq)->curr)
+static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
+{
+ /* Do nothing */
+}
+
struct sched_group;
#ifdef CONFIG_SCHED_CORE
static inline struct cpumask *sched_group_span(struct sched_group *sg);
@@ -2150,11 +2157,25 @@ static inline u64 global_rt_runtime(void)
return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
}
+/*
+ * Is p the current execution context?
+ */
static inline int task_current(struct rq *rq, struct task_struct *p)
{
return rq->curr == p;
}
+/*
+ * Is p the current scheduling context?
+ *
+ * Note that it might be the current execution context at the same time if
+ * rq->curr == rq_selected() == p.
+ */
+static inline int task_current_selected(struct rq *rq, struct task_struct *p)
+{
+ return rq_selected(rq) == p;
+}
+
static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
{
#ifdef CONFIG_SMP
@@ -2324,7 +2345,7 @@ struct sched_class {
static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
{
- WARN_ON_ONCE(rq->curr != prev);
+ WARN_ON_ONCE(rq_selected(rq) != prev);
prev->sched_class->put_prev_task(rq, prev);
}
@@ -2405,7 +2426,7 @@ extern void set_cpus_allowed_common(struct task_struct *p, struct affinity_conte
static inline struct task_struct *get_push_task(struct rq *rq)
{
- struct task_struct *p = rq->curr;
+ struct task_struct *p = rq_selected(rq);
lockdep_assert_rq_held(rq);
--
2.45.0.rc1.225.g2a3ae87e7f-goog
From: Peter Zijlstra <[email protected]>
In preparation to nest mutex::wait_lock under rq::lock we need to remove
wakeups from under it.
Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Tested-by: K Prateek Nayak <[email protected]>
Tested-by: Metin Kaya <[email protected]>
Acked-by: Davidlohr Bueso <[email protected]>
Reviewed-by: Metin Kaya <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[Heavily changed after 55f036ca7e74 ("locking: WW mutex cleanup") and
08295b3b5bee ("locking: Implement an algorithm choice for Wound-Wait
mutexes")]
Signed-off-by: Juri Lelli <[email protected]>
[jstultz: rebased to mainline, added extra wake_up_q & init
to avoid hangs, similar to Connor's rework of this patch]
Signed-off-by: John Stultz <[email protected]>
---
v5:
* Reverted back to an earlier version of this patch to undo
the change that kept the wake_q in the ctx structure, as
that broke the rule that the wake_q must always be on the
stack, as its not safe for concurrency.
v6:
* Made tweaks suggested by Waiman Long
v7:
* Fixups to pass wake_qs down for PREEMPT_RT logic
v10:
* Switched preempt_enable to be lower close to the unlock as
suggested by Valentin
* Added additional preempt_disable coverage around the wake_q
calls as again noted by Valentin
---
kernel/locking/mutex.c | 17 +++++++++++++----
kernel/locking/rtmutex.c | 30 +++++++++++++++++++++---------
kernel/locking/rwbase_rt.c | 8 +++++++-
kernel/locking/rwsem.c | 4 ++--
kernel/locking/spinlock_rt.c | 3 ++-
kernel/locking/ww_mutex.h | 29 ++++++++++++++++++-----------
6 files changed, 63 insertions(+), 28 deletions(-)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index cbae8c0b89ab..4269da1f3ef5 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -575,6 +575,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
struct lockdep_map *nest_lock, unsigned long ip,
struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
{
+ DEFINE_WAKE_Q(wake_q);
struct mutex_waiter waiter;
struct ww_mutex *ww;
int ret;
@@ -625,7 +626,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
*/
if (__mutex_trylock(lock)) {
if (ww_ctx)
- __ww_mutex_check_waiters(lock, ww_ctx);
+ __ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
goto skip_wait;
}
@@ -645,7 +646,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
* Add in stamp order, waking up waiters that must kill
* themselves.
*/
- ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx);
+ ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx, &wake_q);
if (ret)
goto err_early_kill;
}
@@ -681,6 +682,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
}
raw_spin_unlock(&lock->wait_lock);
+ /* Make sure we do wakeups before calling schedule */
+ if (!wake_q_empty(&wake_q)) {
+ wake_up_q(&wake_q);
+ wake_q_init(&wake_q);
+ }
schedule_preempt_disabled();
first = __mutex_waiter_is_first(lock, &waiter);
@@ -714,7 +720,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
*/
if (!ww_ctx->is_wait_die &&
!__mutex_waiter_is_first(lock, &waiter))
- __ww_mutex_check_waiters(lock, ww_ctx);
+ __ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
}
__mutex_remove_waiter(lock, &waiter);
@@ -730,6 +736,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
ww_mutex_lock_acquired(ww, ww_ctx);
raw_spin_unlock(&lock->wait_lock);
+ wake_up_q(&wake_q);
preempt_enable();
return 0;
@@ -741,6 +748,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
raw_spin_unlock(&lock->wait_lock);
debug_mutex_free_waiter(&waiter);
mutex_release(&lock->dep_map, ip);
+ wake_up_q(&wake_q);
preempt_enable();
return ret;
}
@@ -951,9 +959,10 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
if (owner & MUTEX_FLAG_HANDOFF)
__mutex_handoff(lock, next);
+ preempt_disable();
raw_spin_unlock(&lock->wait_lock);
-
wake_up_q(&wake_q);
+ preempt_enable();
}
#ifndef CONFIG_DEBUG_LOCK_ALLOC
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 88d08eeb8bc0..7a85d9bfa972 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -34,13 +34,15 @@
static inline int __ww_mutex_add_waiter(struct rt_mutex_waiter *waiter,
struct rt_mutex *lock,
- struct ww_acquire_ctx *ww_ctx)
+ struct ww_acquire_ctx *ww_ctx,
+ struct wake_q_head *wake_q)
{
return 0;
}
static inline void __ww_mutex_check_waiters(struct rt_mutex *lock,
- struct ww_acquire_ctx *ww_ctx)
+ struct ww_acquire_ctx *ww_ctx,
+ struct wake_q_head *wake_q)
{
}
@@ -1207,6 +1209,7 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
struct rt_mutex_waiter *top_waiter = waiter;
struct rt_mutex_base *next_lock;
int chain_walk = 0, res;
+ DEFINE_WAKE_Q(wake_q);
lockdep_assert_held(&lock->wait_lock);
@@ -1245,7 +1248,10 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
/* Check whether the waiter should back out immediately */
rtm = container_of(lock, struct rt_mutex, rtmutex);
- res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx);
+ preempt_disable();
+ res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx, &wake_q);
+ wake_up_q(&wake_q);
+ preempt_enable();
if (res) {
raw_spin_lock(&task->pi_lock);
rt_mutex_dequeue(lock, waiter);
@@ -1678,7 +1684,8 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
struct ww_acquire_ctx *ww_ctx,
unsigned int state,
enum rtmutex_chainwalk chwalk,
- struct rt_mutex_waiter *waiter)
+ struct rt_mutex_waiter *waiter,
+ struct wake_q_head *wake_q)
{
struct rt_mutex *rtm = container_of(lock, struct rt_mutex, rtmutex);
struct ww_mutex *ww = ww_container_of(rtm);
@@ -1689,7 +1696,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
/* Try to acquire the lock again: */
if (try_to_take_rt_mutex(lock, current, NULL)) {
if (build_ww_mutex() && ww_ctx) {
- __ww_mutex_check_waiters(rtm, ww_ctx);
+ __ww_mutex_check_waiters(rtm, ww_ctx, wake_q);
ww_mutex_lock_acquired(ww, ww_ctx);
}
return 0;
@@ -1707,7 +1714,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
/* acquired the lock */
if (build_ww_mutex() && ww_ctx) {
if (!ww_ctx->is_wait_die)
- __ww_mutex_check_waiters(rtm, ww_ctx);
+ __ww_mutex_check_waiters(rtm, ww_ctx, wake_q);
ww_mutex_lock_acquired(ww, ww_ctx);
}
} else {
@@ -1729,7 +1736,8 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
struct ww_acquire_ctx *ww_ctx,
- unsigned int state)
+ unsigned int state,
+ struct wake_q_head *wake_q)
{
struct rt_mutex_waiter waiter;
int ret;
@@ -1738,7 +1746,7 @@ static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
waiter.ww_ctx = ww_ctx;
ret = __rt_mutex_slowlock(lock, ww_ctx, state, RT_MUTEX_MIN_CHAINWALK,
- &waiter);
+ &waiter, wake_q);
debug_rt_mutex_free_waiter(&waiter);
return ret;
@@ -1754,6 +1762,7 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
struct ww_acquire_ctx *ww_ctx,
unsigned int state)
{
+ DEFINE_WAKE_Q(wake_q);
unsigned long flags;
int ret;
@@ -1775,8 +1784,11 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
* irqsave/restore variants.
*/
raw_spin_lock_irqsave(&lock->wait_lock, flags);
- ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state);
+ ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state, &wake_q);
+ preempt_disable();
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+ wake_up_q(&wake_q);
+ preempt_enable();
rt_mutex_post_schedule();
return ret;
diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index 34a59569db6b..9f4322c07486 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -69,6 +69,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
unsigned int state)
{
struct rt_mutex_base *rtm = &rwb->rtmutex;
+ DEFINE_WAKE_Q(wake_q);
int ret;
rwbase_pre_schedule();
@@ -110,7 +111,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
* For rwlocks this returns 0 unconditionally, so the below
* !ret conditionals are optimized out.
*/
- ret = rwbase_rtmutex_slowlock_locked(rtm, state);
+ ret = rwbase_rtmutex_slowlock_locked(rtm, state, &wake_q);
/*
* On success the rtmutex is held, so there can't be a writer
@@ -121,7 +122,12 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
*/
if (!ret)
atomic_inc(&rwb->readers);
+
+ preempt_disable();
raw_spin_unlock_irq(&rtm->wait_lock);
+ wake_up_q(&wake_q);
+ preempt_enable();
+
if (!ret)
rwbase_rtmutex_unlock(rtm);
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index c6d17aee4209..79ab7b8df5c1 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1415,8 +1415,8 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
#define rwbase_rtmutex_lock_state(rtm, state) \
__rt_mutex_lock(rtm, state)
-#define rwbase_rtmutex_slowlock_locked(rtm, state) \
- __rt_mutex_slowlock_locked(rtm, NULL, state)
+#define rwbase_rtmutex_slowlock_locked(rtm, state, wq) \
+ __rt_mutex_slowlock_locked(rtm, NULL, state, wq)
#define rwbase_rtmutex_unlock(rtm) \
__rt_mutex_unlock(rtm)
diff --git a/kernel/locking/spinlock_rt.c b/kernel/locking/spinlock_rt.c
index 38e292454fcc..fb1810a14c9d 100644
--- a/kernel/locking/spinlock_rt.c
+++ b/kernel/locking/spinlock_rt.c
@@ -162,7 +162,8 @@ rwbase_rtmutex_lock_state(struct rt_mutex_base *rtm, unsigned int state)
}
static __always_inline int
-rwbase_rtmutex_slowlock_locked(struct rt_mutex_base *rtm, unsigned int state)
+rwbase_rtmutex_slowlock_locked(struct rt_mutex_base *rtm, unsigned int state,
+ struct wake_q_head *wake_q)
{
rtlock_slowlock_locked(rtm);
return 0;
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 3ad2cc4823e5..7189c6631d90 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -275,7 +275,7 @@ __ww_ctx_less(struct ww_acquire_ctx *a, struct ww_acquire_ctx *b)
*/
static bool
__ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
- struct ww_acquire_ctx *ww_ctx)
+ struct ww_acquire_ctx *ww_ctx, struct wake_q_head *wake_q)
{
if (!ww_ctx->is_wait_die)
return false;
@@ -284,7 +284,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
#ifndef WW_RT
debug_mutex_wake_waiter(lock, waiter);
#endif
- wake_up_process(waiter->task);
+ wake_q_add(wake_q, waiter->task);
}
return true;
@@ -299,7 +299,8 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
*/
static bool __ww_mutex_wound(struct MUTEX *lock,
struct ww_acquire_ctx *ww_ctx,
- struct ww_acquire_ctx *hold_ctx)
+ struct ww_acquire_ctx *hold_ctx,
+ struct wake_q_head *wake_q)
{
struct task_struct *owner = __ww_mutex_owner(lock);
@@ -331,7 +332,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
* wakeup pending to re-read the wounded state.
*/
if (owner != current)
- wake_up_process(owner);
+ wake_q_add(wake_q, owner);
return true;
}
@@ -352,7 +353,8 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
* The current task must not be on the wait list.
*/
static void
-__ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
+__ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx,
+ struct wake_q_head *wake_q)
{
struct MUTEX_WAITER *cur;
@@ -364,8 +366,8 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
if (!cur->ww_ctx)
continue;
- if (__ww_mutex_die(lock, cur, ww_ctx) ||
- __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx))
+ if (__ww_mutex_die(lock, cur, ww_ctx, wake_q) ||
+ __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx, wake_q))
break;
}
}
@@ -377,6 +379,8 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
static __always_inline void
ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
{
+ DEFINE_WAKE_Q(wake_q);
+
ww_mutex_lock_acquired(lock, ctx);
/*
@@ -405,8 +409,10 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
* die or wound us.
*/
lock_wait_lock(&lock->base);
- __ww_mutex_check_waiters(&lock->base, ctx);
+ __ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
unlock_wait_lock(&lock->base);
+
+ wake_up_q(&wake_q);
}
static __always_inline int
@@ -488,7 +494,8 @@ __ww_mutex_check_kill(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
static inline int
__ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
struct MUTEX *lock,
- struct ww_acquire_ctx *ww_ctx)
+ struct ww_acquire_ctx *ww_ctx,
+ struct wake_q_head *wake_q)
{
struct MUTEX_WAITER *cur, *pos = NULL;
bool is_wait_die;
@@ -532,7 +539,7 @@ __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
pos = cur;
/* Wait-Die: ensure younger waiters die. */
- __ww_mutex_die(lock, cur, ww_ctx);
+ __ww_mutex_die(lock, cur, ww_ctx, wake_q);
}
__ww_waiter_add(lock, waiter, pos);
@@ -550,7 +557,7 @@ __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
* such that either we or the fastpath will wound @ww->ctx.
*/
smp_mb();
- __ww_mutex_wound(lock, ww_ctx, ww->ctx);
+ __ww_mutex_wound(lock, ww_ctx, ww->ctx, wake_q);
}
return 0;
--
2.45.0.rc1.225.g2a3ae87e7f-goog
From: Connor O'Brien <[email protected]>
This patch consolidates rt and deadline pick_*_task functions to
a task_is_pushable() helper
This patch was broken out from a larger chain migration
patch originally by Connor O'Brien.
Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Tested-by: K Prateek Nayak <[email protected]>
Tested-by: Metin Kaya <[email protected]>
Reviewed-by: Metin Kaya <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: split out from larger chain migration patch,
renamed helper function]
Signed-off-by: John Stultz <[email protected]>
---
v7:
* Split from chain migration patch
* Renamed function
---
kernel/sched/deadline.c | 10 +---------
kernel/sched/rt.c | 11 +----------
kernel/sched/sched.h | 10 ++++++++++
3 files changed, 12 insertions(+), 19 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index eaedc69c5e30..ae583a427539 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2179,14 +2179,6 @@ static void task_fork_dl(struct task_struct *p)
/* Only try algorithms three times */
#define DL_MAX_TRIES 3
-static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
-{
- if (!task_on_cpu(rq, p) &&
- cpumask_test_cpu(cpu, &p->cpus_mask))
- return 1;
- return 0;
-}
-
/*
* Return the earliest pushable rq's task, which is suitable to be executed
* on the CPU, NULL otherwise:
@@ -2205,7 +2197,7 @@ static struct task_struct *pick_earliest_pushable_dl_task(struct rq *rq, int cpu
if (next_node) {
p = __node_2_pdl(next_node);
- if (pick_dl_task(rq, p, cpu))
+ if (task_is_pushable(rq, p, cpu) == 1)
return p;
next_node = rb_next(next_node);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 975cb49a64dc..8b6fb77e095b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1791,15 +1791,6 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
/* Only try algorithms three times */
#define RT_MAX_TRIES 3
-static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
-{
- if (!task_on_cpu(rq, p) &&
- cpumask_test_cpu(cpu, &p->cpus_mask))
- return 1;
-
- return 0;
-}
-
/*
* Return the highest pushable rq's task, which is suitable to be executed
* on the CPU, NULL otherwise
@@ -1813,7 +1804,7 @@ static struct task_struct *pick_highest_pushable_task(struct rq *rq, int cpu)
return NULL;
plist_for_each_entry(p, head, pushable_tasks) {
- if (pick_rt_task(rq, p, cpu))
+ if (task_is_pushable(rq, p, cpu) == 1)
return p;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2d41ebe200c7..e46f69ba9ba2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3490,6 +3490,16 @@ void move_queued_task_locked(struct rq *rq, struct rq *dst_rq, struct task_struc
set_task_cpu(task, dst_rq->cpu);
activate_task(dst_rq, task, 0);
}
+
+static inline
+int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
+{
+ if (!task_on_cpu(rq, p) &&
+ cpumask_test_cpu(cpu, &p->cpus_mask))
+ return 1;
+
+ return 0;
+}
#endif
#endif /* _KERNEL_SCHED_SCHED_H */
--
2.45.0.rc1.225.g2a3ae87e7f-goog
From: Juri Lelli <[email protected]>
mutex::wait_lock might be nested under rq->lock.
Make it irq safe then.
Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Tested-by: K Prateek Nayak <[email protected]>
Tested-by: Metin Kaya <[email protected]>
Reviewed-by: Metin Kaya <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[rebase & fix {un,}lock_wait_lock helpers in ww_mutex.h]
Signed-off-by: Connor O'Brien <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
v3:
* Re-added this patch after it was dropped in v2 which
caused lockdep warnings to trip.
v7:
* Fix function definition for PREEMPT_RT case, as pointed out
by Metin Kaya.
* Fix incorrect flags handling in PREEMPT_RT case as found by
Metin Kaya
---
kernel/locking/mutex.c | 18 ++++++++++--------
kernel/locking/ww_mutex.h | 22 +++++++++++-----------
2 files changed, 21 insertions(+), 19 deletions(-)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 4269da1f3ef5..6d843a0978a5 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -578,6 +578,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
DEFINE_WAKE_Q(wake_q);
struct mutex_waiter waiter;
struct ww_mutex *ww;
+ unsigned long flags;
int ret;
if (!use_ww_ctx)
@@ -620,7 +621,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
return 0;
}
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
/*
* After waiting to acquire the wait_lock, try again.
*/
@@ -681,7 +682,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
goto err;
}
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
/* Make sure we do wakeups before calling schedule */
if (!wake_q_empty(&wake_q)) {
wake_up_q(&wake_q);
@@ -707,9 +708,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
trace_contention_begin(lock, LCB_F_MUTEX);
}
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
}
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
acquired:
__set_current_state(TASK_RUNNING);
@@ -735,7 +736,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
if (ww_ctx)
ww_mutex_lock_acquired(ww, ww_ctx);
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q);
preempt_enable();
return 0;
@@ -745,7 +746,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
__mutex_remove_waiter(lock, &waiter);
err_early_kill:
trace_contention_end(lock, ret);
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
debug_mutex_free_waiter(&waiter);
mutex_release(&lock->dep_map, ip);
wake_up_q(&wake_q);
@@ -916,6 +917,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
struct task_struct *next = NULL;
DEFINE_WAKE_Q(wake_q);
unsigned long owner;
+ unsigned long flags;
mutex_release(&lock->dep_map, ip);
@@ -942,7 +944,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
}
}
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
debug_mutex_unlock(lock);
if (!list_empty(&lock->wait_list)) {
/* get the first entry from the wait-list: */
@@ -960,7 +962,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
__mutex_handoff(lock, next);
preempt_disable();
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q);
preempt_enable();
}
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 7189c6631d90..9facc0ddfdd3 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -70,14 +70,14 @@ __ww_mutex_has_waiters(struct mutex *lock)
return atomic_long_read(&lock->owner) & MUTEX_FLAG_WAITERS;
}
-static inline void lock_wait_lock(struct mutex *lock)
+static inline void lock_wait_lock(struct mutex *lock, unsigned long *flags)
{
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, *flags);
}
-static inline void unlock_wait_lock(struct mutex *lock)
+static inline void unlock_wait_lock(struct mutex *lock, unsigned long *flags)
{
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, *flags);
}
static inline void lockdep_assert_wait_lock_held(struct mutex *lock)
@@ -144,14 +144,14 @@ __ww_mutex_has_waiters(struct rt_mutex *lock)
return rt_mutex_has_waiters(&lock->rtmutex);
}
-static inline void lock_wait_lock(struct rt_mutex *lock)
+static inline void lock_wait_lock(struct rt_mutex *lock, unsigned long *flags)
{
- raw_spin_lock(&lock->rtmutex.wait_lock);
+ raw_spin_lock_irqsave(&lock->rtmutex.wait_lock, *flags);
}
-static inline void unlock_wait_lock(struct rt_mutex *lock)
+static inline void unlock_wait_lock(struct rt_mutex *lock, unsigned long *flags)
{
- raw_spin_unlock(&lock->rtmutex.wait_lock);
+ raw_spin_unlock_irqrestore(&lock->rtmutex.wait_lock, *flags);
}
static inline void lockdep_assert_wait_lock_held(struct rt_mutex *lock)
@@ -380,6 +380,7 @@ static __always_inline void
ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
{
DEFINE_WAKE_Q(wake_q);
+ unsigned long flags;
ww_mutex_lock_acquired(lock, ctx);
@@ -408,10 +409,9 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
* Uh oh, we raced in fastpath, check if any of the waiters need to
* die or wound us.
*/
- lock_wait_lock(&lock->base);
+ lock_wait_lock(&lock->base, &flags);
__ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
- unlock_wait_lock(&lock->base);
-
+ unlock_wait_lock(&lock->base, &flags);
wake_up_q(&wake_q);
}
--
2.45.0.rc1.225.g2a3ae87e7f-goog
From: Connor O'Brien <[email protected]>
Switch logic that deactivates, sets the task cpu,
and reactivates a task on a different rq to use a
helper that will be later extended to push entire
blocked task chains.
This patch was broken out from a larger chain migration
patch originally by Connor O'Brien.
Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Tested-by: K Prateek Nayak <[email protected]>
Tested-by: Metin Kaya <[email protected]>
Reviewed-by: Metin Kaya <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: split out from larger chain migration patch]
Signed-off-by: John Stultz <[email protected]>
---
v8:
* Renamed from push_task_chain to do_push_task so it makes
more sense without proxy-execution
v10:
* Changed name to move_queued_task_locked as suggested by Valentin
---
kernel/sched/core.c | 4 +---
kernel/sched/deadline.c | 8 ++------
kernel/sched/rt.c | 8 ++------
kernel/sched/sched.h | 11 +++++++++++
4 files changed, 16 insertions(+), 15 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7019a40457a6..48f0d4b381d5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2712,9 +2712,7 @@ int push_cpu_stop(void *arg)
// XXX validate p is still the highest prio task
if (task_rq(p) == rq) {
- deactivate_task(rq, p, 0);
- set_task_cpu(p, lowest_rq->cpu);
- activate_task(lowest_rq, p, 0);
+ move_queued_task_locked(rq, lowest_rq, p);
resched_curr(lowest_rq);
}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a04a436af8cc..eaedc69c5e30 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2443,9 +2443,7 @@ static int push_dl_task(struct rq *rq)
goto retry;
}
- deactivate_task(rq, next_task, 0);
- set_task_cpu(next_task, later_rq->cpu);
- activate_task(later_rq, next_task, 0);
+ move_queued_task_locked(rq, later_rq, next_task);
ret = 1;
resched_curr(later_rq);
@@ -2531,9 +2529,7 @@ static void pull_dl_task(struct rq *this_rq)
if (is_migration_disabled(p)) {
push_task = get_push_task(src_rq);
} else {
- deactivate_task(src_rq, p, 0);
- set_task_cpu(p, this_cpu);
- activate_task(this_rq, p, 0);
+ move_queued_task_locked(src_rq, this_rq, p);
dmin = p->dl.deadline;
resched = true;
}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3261b067b67e..975cb49a64dc 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2106,9 +2106,7 @@ static int push_rt_task(struct rq *rq, bool pull)
goto retry;
}
- deactivate_task(rq, next_task, 0);
- set_task_cpu(next_task, lowest_rq->cpu);
- activate_task(lowest_rq, next_task, 0);
+ move_queued_task_locked(rq, lowest_rq, next_task);
resched_curr(lowest_rq);
ret = 1;
@@ -2379,9 +2377,7 @@ static void pull_rt_task(struct rq *this_rq)
if (is_migration_disabled(p)) {
push_task = get_push_task(src_rq);
} else {
- deactivate_task(src_rq, p, 0);
- set_task_cpu(p, this_cpu);
- activate_task(this_rq, p, 0);
+ move_queued_task_locked(src_rq, this_rq, p);
resched = true;
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ae50f212775e..2d41ebe200c7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3480,5 +3480,16 @@ static inline void init_sched_mm_cid(struct task_struct *t) { }
extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
+#ifdef CONFIG_SMP
+static inline
+void move_queued_task_locked(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
+{
+ lockdep_assert_rq_held(rq);
+ lockdep_assert_rq_held(dst_rq);
+ deactivate_task(rq, task, 0);
+ set_task_cpu(task, dst_rq->cpu);
+ activate_task(dst_rq, task, 0);
+}
+#endif
#endif /* _KERNEL_SCHED_SCHED_H */
--
2.45.0.rc1.225.g2a3ae87e7f-goog
As we're going to re-use the deactivation logic,
split it into a helper.
Cc: Joel Fernandes <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Zimuzo Ezeozue <[email protected]>
Cc: Youssef Esmat <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Metin Kaya <[email protected]>
Cc: Xuewen Yan <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Tested-by: K Prateek Nayak <[email protected]>
Tested-by: Metin Kaya <[email protected]>
Reviewed-by: Metin Kaya <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
v6:
* Define function as static to avoid "no previous prototype"
warnings as Reported-by: kernel test robot <[email protected]>
v7:
* Rename state task_state to be more clear, as suggested by
Metin Kaya
---
kernel/sched/core.c | 72 +++++++++++++++++++++++++++------------------
1 file changed, 43 insertions(+), 29 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 48f0d4b381d5..8bc5844ebab9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6572,6 +6572,48 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
# define SM_MASK_PREEMPT SM_PREEMPT
#endif
+/*
+ * Helper function for __schedule()
+ *
+ * If a task does not have signals pending, deactivate it and return true
+ * Otherwise marks the task's __state as RUNNING and returns false
+ */
+static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
+ unsigned long task_state)
+{
+ if (signal_pending_state(task_state, p)) {
+ WRITE_ONCE(p->__state, TASK_RUNNING);
+ } else {
+ p->sched_contributes_to_load =
+ (task_state & TASK_UNINTERRUPTIBLE) &&
+ !(task_state & TASK_NOLOAD) &&
+ !(task_state & TASK_FROZEN);
+
+ if (p->sched_contributes_to_load)
+ rq->nr_uninterruptible++;
+
+ /*
+ * __schedule() ttwu()
+ * prev_state = prev->state; if (p->on_rq && ...)
+ * if (prev_state) goto out;
+ * p->on_rq = 0; smp_acquire__after_ctrl_dep();
+ * p->state = TASK_WAKING
+ *
+ * Where __schedule() and ttwu() have matching control dependencies.
+ *
+ * After this, schedule() must not care about p->state any more.
+ */
+ deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
+
+ if (p->in_iowait) {
+ atomic_inc(&rq->nr_iowait);
+ delayacct_blkio_start();
+ }
+ return true;
+ }
+ return false;
+}
+
/*
* __schedule() is the main scheduler function.
*
@@ -6665,35 +6707,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
*/
prev_state = READ_ONCE(prev->__state);
if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
- if (signal_pending_state(prev_state, prev)) {
- WRITE_ONCE(prev->__state, TASK_RUNNING);
- } else {
- prev->sched_contributes_to_load =
- (prev_state & TASK_UNINTERRUPTIBLE) &&
- !(prev_state & TASK_NOLOAD) &&
- !(prev_state & TASK_FROZEN);
-
- if (prev->sched_contributes_to_load)
- rq->nr_uninterruptible++;
-
- /*
- * __schedule() ttwu()
- * prev_state = prev->state; if (p->on_rq && ...)
- * if (prev_state) goto out;
- * p->on_rq = 0; smp_acquire__after_ctrl_dep();
- * p->state = TASK_WAKING
- *
- * Where __schedule() and ttwu() have matching control dependencies.
- *
- * After this, schedule() must not care about p->state any more.
- */
- deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
-
- if (prev->in_iowait) {
- atomic_inc(&rq->nr_iowait);
- delayacct_blkio_start();
- }
- }
+ try_to_deactivate_task(rq, prev, prev_state);
switch_count = &prev->nvcsw;
}
--
2.45.0.rc1.225.g2a3ae87e7f-goog
On 05/06/24 21:54, John Stultz wrote:
> From: Connor O'Brien <[email protected]>
>
> Switch logic that deactivates, sets the task cpu,
> and reactivates a task on a different rq to use a
> helper that will be later extended to push entire
> blocked task chains.
>
> This patch was broken out from a larger chain migration
> patch originally by Connor O'Brien.
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Tested-by: K Prateek Nayak <[email protected]>
> Tested-by: Metin Kaya <[email protected]>
> Reviewed-by: Metin Kaya <[email protected]>
> Reviewed-by: Valentin Schneider <[email protected]>
> Signed-off-by: Connor O'Brien <[email protected]>
> [jstultz: split out from larger chain migration patch]
> Signed-off-by: John Stultz <[email protected]>
> ---
> v8:
> * Renamed from push_task_chain to do_push_task so it makes
> more sense without proxy-execution
> v10:
> * Changed name to move_queued_task_locked as suggested by Valentin
> ---
> kernel/sched/core.c | 4 +---
> kernel/sched/deadline.c | 8 ++------
> kernel/sched/rt.c | 8 ++------
> kernel/sched/sched.h | 11 +++++++++++
> 4 files changed, 16 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 7019a40457a6..48f0d4b381d5 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2712,9 +2712,7 @@ int push_cpu_stop(void *arg)
>
> // XXX validate p is still the highest prio task
> if (task_rq(p) == rq) {
> - deactivate_task(rq, p, 0);
> - set_task_cpu(p, lowest_rq->cpu);
> - activate_task(lowest_rq, p, 0);
> + move_queued_task_locked(rq, lowest_rq, p);
> resched_curr(lowest_rq);
> }
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index a04a436af8cc..eaedc69c5e30 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2443,9 +2443,7 @@ static int push_dl_task(struct rq *rq)
> goto retry;
> }
>
> - deactivate_task(rq, next_task, 0);
> - set_task_cpu(next_task, later_rq->cpu);
> - activate_task(later_rq, next_task, 0);
> + move_queued_task_locked(rq, later_rq, next_task);
> ret = 1;
>
> resched_curr(later_rq);
> @@ -2531,9 +2529,7 @@ static void pull_dl_task(struct rq *this_rq)
> if (is_migration_disabled(p)) {
> push_task = get_push_task(src_rq);
> } else {
> - deactivate_task(src_rq, p, 0);
> - set_task_cpu(p, this_cpu);
> - activate_task(this_rq, p, 0);
> + move_queued_task_locked(src_rq, this_rq, p);
> dmin = p->dl.deadline;
> resched = true;
> }
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 3261b067b67e..975cb49a64dc 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2106,9 +2106,7 @@ static int push_rt_task(struct rq *rq, bool pull)
> goto retry;
> }
>
> - deactivate_task(rq, next_task, 0);
> - set_task_cpu(next_task, lowest_rq->cpu);
> - activate_task(lowest_rq, next_task, 0);
> + move_queued_task_locked(rq, lowest_rq, next_task);
> resched_curr(lowest_rq);
> ret = 1;
>
> @@ -2379,9 +2377,7 @@ static void pull_rt_task(struct rq *this_rq)
> if (is_migration_disabled(p)) {
> push_task = get_push_task(src_rq);
> } else {
> - deactivate_task(src_rq, p, 0);
> - set_task_cpu(p, this_cpu);
> - activate_task(this_rq, p, 0);
> + move_queued_task_locked(src_rq, this_rq, p);
> resched = true;
> }
> /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ae50f212775e..2d41ebe200c7 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3480,5 +3480,16 @@ static inline void init_sched_mm_cid(struct task_struct *t) { }
>
> extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
> extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
> +#ifdef CONFIG_SMP
> +static inline
> +void move_queued_task_locked(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
> +{
> + lockdep_assert_rq_held(rq);
> + lockdep_assert_rq_held(dst_rq);
> + deactivate_task(rq, task, 0);
> + set_task_cpu(task, dst_rq->cpu);
> + activate_task(dst_rq, task, 0);
> +}
> +#endif
I see this pattern in __migrate_swap_task() and try_steal_cookie(), should they
be converted to?
Beside this
Reviewed-by: Qais Yousef <[email protected]>
>
> #endif /* _KERNEL_SCHED_SCHED_H */
> --
> 2.45.0.rc1.225.g2a3ae87e7f-goog
>
On 05/06/24 21:54, John Stultz wrote:
> From: Connor O'Brien <[email protected]>
>
> This patch consolidates rt and deadline pick_*_task functions to
> a task_is_pushable() helper
>
> This patch was broken out from a larger chain migration
> patch originally by Connor O'Brien.
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Tested-by: K Prateek Nayak <[email protected]>
> Tested-by: Metin Kaya <[email protected]>
> Reviewed-by: Metin Kaya <[email protected]>
> Reviewed-by: Valentin Schneider <[email protected]>
> Signed-off-by: Connor O'Brien <[email protected]>
> [jstultz: split out from larger chain migration patch,
> renamed helper function]
> Signed-off-by: John Stultz <[email protected]>
> ---
> v7:
> * Split from chain migration patch
> * Renamed function
> ---
> kernel/sched/deadline.c | 10 +---------
> kernel/sched/rt.c | 11 +----------
> kernel/sched/sched.h | 10 ++++++++++
> 3 files changed, 12 insertions(+), 19 deletions(-)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index eaedc69c5e30..ae583a427539 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2179,14 +2179,6 @@ static void task_fork_dl(struct task_struct *p)
> /* Only try algorithms three times */
> #define DL_MAX_TRIES 3
>
> -static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
> -{
> - if (!task_on_cpu(rq, p) &&
> - cpumask_test_cpu(cpu, &p->cpus_mask))
> - return 1;
> - return 0;
> -}
> -
> /*
> * Return the earliest pushable rq's task, which is suitable to be executed
> * on the CPU, NULL otherwise:
> @@ -2205,7 +2197,7 @@ static struct task_struct *pick_earliest_pushable_dl_task(struct rq *rq, int cpu
> if (next_node) {
> p = __node_2_pdl(next_node);
>
> - if (pick_dl_task(rq, p, cpu))
> + if (task_is_pushable(rq, p, cpu) == 1)
Any reason we're checking specifically for == 1? Could the function later
return something other than 0 or 1?
It's explaining if 1 could end up meaning something else in the commit message
and in function docs. Otherwise let's make it a bool and not do explicit check
for return value.
With this
Reviewed-by: Qais Yousef <[email protected]>
> return p;
>
> next_node = rb_next(next_node);
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 975cb49a64dc..8b6fb77e095b 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1791,15 +1791,6 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
> /* Only try algorithms three times */
> #define RT_MAX_TRIES 3
>
> -static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
> -{
> - if (!task_on_cpu(rq, p) &&
> - cpumask_test_cpu(cpu, &p->cpus_mask))
> - return 1;
> -
> - return 0;
> -}
> -
> /*
> * Return the highest pushable rq's task, which is suitable to be executed
> * on the CPU, NULL otherwise
> @@ -1813,7 +1804,7 @@ static struct task_struct *pick_highest_pushable_task(struct rq *rq, int cpu)
> return NULL;
>
> plist_for_each_entry(p, head, pushable_tasks) {
> - if (pick_rt_task(rq, p, cpu))
> + if (task_is_pushable(rq, p, cpu) == 1)
> return p;
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 2d41ebe200c7..e46f69ba9ba2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3490,6 +3490,16 @@ void move_queued_task_locked(struct rq *rq, struct rq *dst_rq, struct task_struc
> set_task_cpu(task, dst_rq->cpu);
> activate_task(dst_rq, task, 0);
> }
> +
> +static inline
> +int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
> +{
> + if (!task_on_cpu(rq, p) &&
> + cpumask_test_cpu(cpu, &p->cpus_mask))
> + return 1;
> +
> + return 0;
> +}
> #endif
>
> #endif /* _KERNEL_SCHED_SCHED_H */
> --
> 2.45.0.rc1.225.g2a3ae87e7f-goog
>
On 05/06/24 21:54, John Stultz wrote:
> As we're going to re-use the deactivation logic,
> split it into a helper.
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Tested-by: K Prateek Nayak <[email protected]>
> Tested-by: Metin Kaya <[email protected]>
> Reviewed-by: Metin Kaya <[email protected]>
> Signed-off-by: John Stultz <[email protected]>
> ---
> v6:
> * Define function as static to avoid "no previous prototype"
> warnings as Reported-by: kernel test robot <[email protected]>
> v7:
> * Rename state task_state to be more clear, as suggested by
> Metin Kaya
> ---
> kernel/sched/core.c | 72 +++++++++++++++++++++++++++------------------
> 1 file changed, 43 insertions(+), 29 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 48f0d4b381d5..8bc5844ebab9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6572,6 +6572,48 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> # define SM_MASK_PREEMPT SM_PREEMPT
> #endif
>
> +/*
> + * Helper function for __schedule()
> + *
> + * If a task does not have signals pending, deactivate it and return true
> + * Otherwise marks the task's __state as RUNNING and returns false
> + */
> +static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
> + unsigned long task_state)
> +{
> + if (signal_pending_state(task_state, p)) {
> + WRITE_ONCE(p->__state, TASK_RUNNING);
We can avoid extra indention for the other (lengthy) leg if we return here?
The return value is ignored for now, I don't mind keeping it but better call it
out in the commit message or when you add the new user later you can update the
signature more easily.
Generally I think patches 4, 5 and 6 could be sent as their own series with
minor commit messages tweaks to make them more independent and I hope Ingo and
Peter are okay to pick them up as they look a nice clean up in general.
Anyway:
Reviewed-by: Qais Yousef <[email protected]>
> + } else {
> + p->sched_contributes_to_load =
> + (task_state & TASK_UNINTERRUPTIBLE) &&
> + !(task_state & TASK_NOLOAD) &&
> + !(task_state & TASK_FROZEN);
> +
> + if (p->sched_contributes_to_load)
> + rq->nr_uninterruptible++;
> +
> + /*
> + * __schedule() ttwu()
> + * prev_state = prev->state; if (p->on_rq && ...)
> + * if (prev_state) goto out;
> + * p->on_rq = 0; smp_acquire__after_ctrl_dep();
> + * p->state = TASK_WAKING
> + *
> + * Where __schedule() and ttwu() have matching control dependencies.
> + *
> + * After this, schedule() must not care about p->state any more.
> + */
> + deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
> +
> + if (p->in_iowait) {
> + atomic_inc(&rq->nr_iowait);
> + delayacct_blkio_start();
> + }
> + return true;
> + }
> + return false;
> +}
> +
> /*
> * __schedule() is the main scheduler function.
> *
> @@ -6665,35 +6707,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
> */
> prev_state = READ_ONCE(prev->__state);
> if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
> - if (signal_pending_state(prev_state, prev)) {
> - WRITE_ONCE(prev->__state, TASK_RUNNING);
> - } else {
> - prev->sched_contributes_to_load =
> - (prev_state & TASK_UNINTERRUPTIBLE) &&
> - !(prev_state & TASK_NOLOAD) &&
> - !(prev_state & TASK_FROZEN);
> -
> - if (prev->sched_contributes_to_load)
> - rq->nr_uninterruptible++;
> -
> - /*
> - * __schedule() ttwu()
> - * prev_state = prev->state; if (p->on_rq && ...)
> - * if (prev_state) goto out;
> - * p->on_rq = 0; smp_acquire__after_ctrl_dep();
> - * p->state = TASK_WAKING
> - *
> - * Where __schedule() and ttwu() have matching control dependencies.
> - *
> - * After this, schedule() must not care about p->state any more.
> - */
> - deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
> -
> - if (prev->in_iowait) {
> - atomic_inc(&rq->nr_iowait);
> - delayacct_blkio_start();
> - }
> - }
> + try_to_deactivate_task(rq, prev, prev_state);
> switch_count = &prev->nvcsw;
> }
>
> --
> 2.45.0.rc1.225.g2a3ae87e7f-goog
>
On 04/06/2024 2:29 pm, Qais Yousef wrote:
> On 05/06/24 21:54, John Stultz wrote:
>> As we're going to re-use the deactivation logic,
>> split it into a helper.
>>
>> Cc: Joel Fernandes <[email protected]>
>> Cc: Qais Yousef <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Juri Lelli <[email protected]>
>> Cc: Vincent Guittot <[email protected]>
>> Cc: Dietmar Eggemann <[email protected]>
>> Cc: Valentin Schneider <[email protected]>
>> Cc: Steven Rostedt <[email protected]>
>> Cc: Ben Segall <[email protected]>
>> Cc: Zimuzo Ezeozue <[email protected]>
>> Cc: Youssef Esmat <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Daniel Bristot de Oliveira <[email protected]>
>> Cc: Will Deacon <[email protected]>
>> Cc: Waiman Long <[email protected]>
>> Cc: Boqun Feng <[email protected]>
>> Cc: "Paul E. McKenney" <[email protected]>
>> Cc: Metin Kaya <[email protected]>
>> Cc: Xuewen Yan <[email protected]>
>> Cc: K Prateek Nayak <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: [email protected]
>> Tested-by: K Prateek Nayak <[email protected]>
>> Tested-by: Metin Kaya <[email protected]>
>> Reviewed-by: Metin Kaya <[email protected]>
>> Signed-off-by: John Stultz <[email protected]>
>> ---
>> v6:
>> * Define function as static to avoid "no previous prototype"
>> warnings as Reported-by: kernel test robot <[email protected]>
>> v7:
>> * Rename state task_state to be more clear, as suggested by
>> Metin Kaya
>> ---
>> kernel/sched/core.c | 72 +++++++++++++++++++++++++++------------------
>> 1 file changed, 43 insertions(+), 29 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 48f0d4b381d5..8bc5844ebab9 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -6572,6 +6572,48 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>> # define SM_MASK_PREEMPT SM_PREEMPT
>> #endif
>>
>> +/*
>> + * Helper function for __schedule()
>> + *
>> + * If a task does not have signals pending, deactivate it and return true
>> + * Otherwise marks the task's __state as RUNNING and returns false
>> + */
>> +static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
>> + unsigned long task_state)
>> +{
>> + if (signal_pending_state(task_state, p)) {
>> + WRITE_ONCE(p->__state, TASK_RUNNING);
>
> We can avoid extra indention for the other (lengthy) leg if we return here?
>
> The return value is ignored for now, I don't mind keeping it but better call it
I see the return value is unused even after the complete patch set. So,
agreed with your proposal.
> out in the commit message or when you add the new user later you can update the
> signature more easily.
>
> Generally I think patches 4, 5 and 6 could be sent as their own series with
> minor commit messages tweaks to make them more independent and I hope Ingo and
> Peter are okay to pick them up as they look a nice clean up in general.
>
> Anyway:
>
> Reviewed-by: Qais Yousef <[email protected]>
>
>> + } else {
>> + p->sched_contributes_to_load =
>> + (task_state & TASK_UNINTERRUPTIBLE) &&
>> + !(task_state & TASK_NOLOAD) &&
>> + !(task_state & TASK_FROZEN);
>> +
>> + if (p->sched_contributes_to_load)
>> + rq->nr_uninterruptible++;
>> +
>> + /*
>> + * __schedule() ttwu()
>> + * prev_state = prev->state; if (p->on_rq && ...)
>> + * if (prev_state) goto out;
>> + * p->on_rq = 0; smp_acquire__after_ctrl_dep();
>> + * p->state = TASK_WAKING
>> + *
>> + * Where __schedule() and ttwu() have matching control dependencies.
>> + *
>> + * After this, schedule() must not care about p->state any more.
>> + */
>> + deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
>> +
>> + if (p->in_iowait) {
>> + atomic_inc(&rq->nr_iowait);
>> + delayacct_blkio_start();
>> + }
>> + return true;
>> + }
>> + return false;
>> +}
>> +
>> /*
>> * __schedule() is the main scheduler function.
>> *
>> @@ -6665,35 +6707,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
>> */
>> prev_state = READ_ONCE(prev->__state);
>> if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
>> - if (signal_pending_state(prev_state, prev)) {
>> - WRITE_ONCE(prev->__state, TASK_RUNNING);
>> - } else {
>> - prev->sched_contributes_to_load =
>> - (prev_state & TASK_UNINTERRUPTIBLE) &&
>> - !(prev_state & TASK_NOLOAD) &&
>> - !(prev_state & TASK_FROZEN);
>> -
>> - if (prev->sched_contributes_to_load)
>> - rq->nr_uninterruptible++;
>> -
>> - /*
>> - * __schedule() ttwu()
>> - * prev_state = prev->state; if (p->on_rq && ...)
>> - * if (prev_state) goto out;
>> - * p->on_rq = 0; smp_acquire__after_ctrl_dep();
>> - * p->state = TASK_WAKING
>> - *
>> - * Where __schedule() and ttwu() have matching control dependencies.
>> - *
>> - * After this, schedule() must not care about p->state any more.
>> - */
>> - deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
>> -
>> - if (prev->in_iowait) {
>> - atomic_inc(&rq->nr_iowait);
>> - delayacct_blkio_start();
>> - }
>> - }
>> + try_to_deactivate_task(rq, prev, prev_state);
>> switch_count = &prev->nvcsw;
>> }
>>
>> --
>> 2.45.0.rc1.225.g2a3ae87e7f-goog
>>
On 05/06/24 21:54, John Stultz wrote:
> From: Juri Lelli <[email protected]>
>
> mutex::wait_lock might be nested under rq->lock.
>
> Make it irq safe then.
>
> Cc: Joel Fernandes <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Zimuzo Ezeozue <[email protected]>
> Cc: Youssef Esmat <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Metin Kaya <[email protected]>
> Cc: Xuewen Yan <[email protected]>
> Cc: K Prateek Nayak <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Tested-by: K Prateek Nayak <[email protected]>
> Tested-by: Metin Kaya <[email protected]>
> Reviewed-by: Metin Kaya <[email protected]>
> Reviewed-by: Valentin Schneider <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> [rebase & fix {un,}lock_wait_lock helpers in ww_mutex.h]
> Signed-off-by: Connor O'Brien <[email protected]>
> Signed-off-by: John Stultz <[email protected]>
> ---
> v3:
> * Re-added this patch after it was dropped in v2 which
> caused lockdep warnings to trip.
> v7:
> * Fix function definition for PREEMPT_RT case, as pointed out
> by Metin Kaya.
> * Fix incorrect flags handling in PREEMPT_RT case as found by
> Metin Kaya
> ---
> kernel/locking/mutex.c | 18 ++++++++++--------
> kernel/locking/ww_mutex.h | 22 +++++++++++-----------
> 2 files changed, 21 insertions(+), 19 deletions(-)
Are locking folks okay with patches 1 and 2? Is there any concern/side effect
not to pick them up? It'd be nice if they can get merged and soaked in
linux-next. Reducing the amount of patches to help this series make progress
would be appreciated - if there are no major concerns of course. Some feedback
would be very helpful either way.
Thanks!
--
Qais Yousef
>
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 4269da1f3ef5..6d843a0978a5 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -578,6 +578,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> DEFINE_WAKE_Q(wake_q);
> struct mutex_waiter waiter;
> struct ww_mutex *ww;
> + unsigned long flags;
> int ret;
>
> if (!use_ww_ctx)
> @@ -620,7 +621,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> return 0;
> }
>
> - raw_spin_lock(&lock->wait_lock);
> + raw_spin_lock_irqsave(&lock->wait_lock, flags);
> /*
> * After waiting to acquire the wait_lock, try again.
> */
> @@ -681,7 +682,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> goto err;
> }
>
> - raw_spin_unlock(&lock->wait_lock);
> + raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
> /* Make sure we do wakeups before calling schedule */
> if (!wake_q_empty(&wake_q)) {
> wake_up_q(&wake_q);
> @@ -707,9 +708,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> trace_contention_begin(lock, LCB_F_MUTEX);
> }
>
> - raw_spin_lock(&lock->wait_lock);
> + raw_spin_lock_irqsave(&lock->wait_lock, flags);
> }
> - raw_spin_lock(&lock->wait_lock);
> + raw_spin_lock_irqsave(&lock->wait_lock, flags);
> acquired:
> __set_current_state(TASK_RUNNING);
>
> @@ -735,7 +736,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> if (ww_ctx)
> ww_mutex_lock_acquired(ww, ww_ctx);
>
> - raw_spin_unlock(&lock->wait_lock);
> + raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
> wake_up_q(&wake_q);
> preempt_enable();
> return 0;
> @@ -745,7 +746,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> __mutex_remove_waiter(lock, &waiter);
> err_early_kill:
> trace_contention_end(lock, ret);
> - raw_spin_unlock(&lock->wait_lock);
> + raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
> debug_mutex_free_waiter(&waiter);
> mutex_release(&lock->dep_map, ip);
> wake_up_q(&wake_q);
> @@ -916,6 +917,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
> struct task_struct *next = NULL;
> DEFINE_WAKE_Q(wake_q);
> unsigned long owner;
> + unsigned long flags;
>
> mutex_release(&lock->dep_map, ip);
>
> @@ -942,7 +944,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
> }
> }
>
> - raw_spin_lock(&lock->wait_lock);
> + raw_spin_lock_irqsave(&lock->wait_lock, flags);
> debug_mutex_unlock(lock);
> if (!list_empty(&lock->wait_list)) {
> /* get the first entry from the wait-list: */
> @@ -960,7 +962,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
> __mutex_handoff(lock, next);
>
> preempt_disable();
> - raw_spin_unlock(&lock->wait_lock);
> + raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
> wake_up_q(&wake_q);
> preempt_enable();
> }
> diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
> index 7189c6631d90..9facc0ddfdd3 100644
> --- a/kernel/locking/ww_mutex.h
> +++ b/kernel/locking/ww_mutex.h
> @@ -70,14 +70,14 @@ __ww_mutex_has_waiters(struct mutex *lock)
> return atomic_long_read(&lock->owner) & MUTEX_FLAG_WAITERS;
> }
>
> -static inline void lock_wait_lock(struct mutex *lock)
> +static inline void lock_wait_lock(struct mutex *lock, unsigned long *flags)
> {
> - raw_spin_lock(&lock->wait_lock);
> + raw_spin_lock_irqsave(&lock->wait_lock, *flags);
> }
>
> -static inline void unlock_wait_lock(struct mutex *lock)
> +static inline void unlock_wait_lock(struct mutex *lock, unsigned long *flags)
> {
> - raw_spin_unlock(&lock->wait_lock);
> + raw_spin_unlock_irqrestore(&lock->wait_lock, *flags);
> }
>
> static inline void lockdep_assert_wait_lock_held(struct mutex *lock)
> @@ -144,14 +144,14 @@ __ww_mutex_has_waiters(struct rt_mutex *lock)
> return rt_mutex_has_waiters(&lock->rtmutex);
> }
>
> -static inline void lock_wait_lock(struct rt_mutex *lock)
> +static inline void lock_wait_lock(struct rt_mutex *lock, unsigned long *flags)
> {
> - raw_spin_lock(&lock->rtmutex.wait_lock);
> + raw_spin_lock_irqsave(&lock->rtmutex.wait_lock, *flags);
> }
>
> -static inline void unlock_wait_lock(struct rt_mutex *lock)
> +static inline void unlock_wait_lock(struct rt_mutex *lock, unsigned long *flags)
> {
> - raw_spin_unlock(&lock->rtmutex.wait_lock);
> + raw_spin_unlock_irqrestore(&lock->rtmutex.wait_lock, *flags);
> }
>
> static inline void lockdep_assert_wait_lock_held(struct rt_mutex *lock)
> @@ -380,6 +380,7 @@ static __always_inline void
> ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
> {
> DEFINE_WAKE_Q(wake_q);
> + unsigned long flags;
>
> ww_mutex_lock_acquired(lock, ctx);
>
> @@ -408,10 +409,9 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
> * Uh oh, we raced in fastpath, check if any of the waiters need to
> * die or wound us.
> */
> - lock_wait_lock(&lock->base);
> + lock_wait_lock(&lock->base, &flags);
> __ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
> - unlock_wait_lock(&lock->base);
> -
> + unlock_wait_lock(&lock->base, &flags);
> wake_up_q(&wake_q);
> }
>
> --
> 2.45.0.rc1.225.g2a3ae87e7f-goog
>
On 05/06/24 21:54, John Stultz wrote:
> From: Peter Zijlstra <[email protected]>
>
> Let's define the scheduling context as all the scheduler state
> in task_struct for the task selected to run, and the execution
> context as all state required to actually run the task.
>
> Currently both are intertwined in task_struct. We want to
> logically split these such that we can use the scheduling
> context of the task selected to be scheduled, but use the
> execution context of a different task to actually be run.
>
> To this purpose, introduce rq_selected() macro to point to the
> task_struct selected from the runqueue by the scheduler, and
> will be used for scheduler state, and preserve rq->curr to
> indicate the execution context of the task that will actually be
> run.
This is subjective opinion of course. But I am not a fan of rq_selected().
I think we are better with more explicit
rq->currs /* current sched context */
rq->currx /* current execution context */
and the helpers
task_curr_sctx()
task_curr_xctx()
This will ensure all current users of rq->curr will generate a build error and
be better audited what they are supposed to be. And I think the code is more
readable this way.
Another way to do it is to split task_struct to sched and exec context (rather
than control the reference to curr as done here). Then curr would be always the
same, but reference to its sched context would change with inheritance.
ie:
struct task_sctx {
...
};
struct task_struct {
struct task_sctx *sctx;
/* exec context is embedded as-is */
...
};
Which means would just have to update all accessors to sched context to do
extra dereference. Which I am not sure if a problematic extra level of
indirection.
I see the latter approach more intuitive and explicit about what is a sched
context that is supposed to be inherited and what is not.
I generally think breaking down task structure would be good. Hopefully the new
data perf can help us make better decision on what attributes to group
together. And generally this might be a good exercise to rethink its content
which grown organicaly over the year. Maybe we want to create similar such
smaller wrapper structs for other fields too.
I **think** with this approach we would just need go fix compilation errors to
do p->sctx->{attribute}.
Proxy exec later would only update the reference to this struct to enforce
inheritance for rq->curr->sctx to point to the donor's context.
On 04/06/2024 2:18 pm, Qais Yousef wrote:
> On 05/06/24 21:54, John Stultz wrote:
>> From: Connor O'Brien <[email protected]>
>>
>> This patch consolidates rt and deadline pick_*_task functions to
>> a task_is_pushable() helper
>>
>> This patch was broken out from a larger chain migration
>> patch originally by Connor O'Brien.
>>
>> Cc: Joel Fernandes <[email protected]>
>> Cc: Qais Yousef <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Juri Lelli <[email protected]>
>> Cc: Vincent Guittot <[email protected]>
>> Cc: Dietmar Eggemann <[email protected]>
>> Cc: Valentin Schneider <[email protected]>
>> Cc: Steven Rostedt <[email protected]>
>> Cc: Ben Segall <[email protected]>
>> Cc: Zimuzo Ezeozue <[email protected]>
>> Cc: Youssef Esmat <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Daniel Bristot de Oliveira <[email protected]>
>> Cc: Will Deacon <[email protected]>
>> Cc: Waiman Long <[email protected]>
>> Cc: Boqun Feng <[email protected]>
>> Cc: "Paul E. McKenney" <[email protected]>
>> Cc: Metin Kaya <[email protected]>
>> Cc: Xuewen Yan <[email protected]>
>> Cc: K Prateek Nayak <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: [email protected]
>> Tested-by: K Prateek Nayak <[email protected]>
>> Tested-by: Metin Kaya <[email protected]>
>> Reviewed-by: Metin Kaya <[email protected]>
>> Reviewed-by: Valentin Schneider <[email protected]>
>> Signed-off-by: Connor O'Brien <[email protected]>
>> [jstultz: split out from larger chain migration patch,
>> renamed helper function]
>> Signed-off-by: John Stultz <[email protected]>
>> ---
>> v7:
>> * Split from chain migration patch
>> * Renamed function
>> ---
>> kernel/sched/deadline.c | 10 +---------
>> kernel/sched/rt.c | 11 +----------
>> kernel/sched/sched.h | 10 ++++++++++
>> 3 files changed, 12 insertions(+), 19 deletions(-)
>>
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index eaedc69c5e30..ae583a427539 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -2179,14 +2179,6 @@ static void task_fork_dl(struct task_struct *p)
>> /* Only try algorithms three times */
>> #define DL_MAX_TRIES 3
>>
>> -static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
>> -{
>> - if (!task_on_cpu(rq, p) &&
>> - cpumask_test_cpu(cpu, &p->cpus_mask))
>> - return 1;
>> - return 0;
>> -}
>> -
>> /*
>> * Return the earliest pushable rq's task, which is suitable to be executed
>> * on the CPU, NULL otherwise:
>> @@ -2205,7 +2197,7 @@ static struct task_struct *pick_earliest_pushable_dl_task(struct rq *rq, int cpu
>> if (next_node) {
>> p = __node_2_pdl(next_node);
>>
>> - if (pick_dl_task(rq, p, cpu))
>> + if (task_is_pushable(rq, p, cpu) == 1)
>
> Any reason we're checking specifically for == 1? Could the function later
> return something other than 0 or 1?
I hear you, but just letting you know that `task_is_pushable()` will
return different non-zero results per upcoming patches in the queue. The
commit "sched: Fix rt/dl load balancing via chain level balance" will
make the function look like below (excerpt from `kernel/sched/core.c`):
/*
* Returns:
* 1 if chain is pushable and affinity does not prevent pushing to cpu
* 0 if chain is unpushable
* -1 if chain is pushable but affinity blocks running on cpu.
*/
int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
>
> It's explaining if 1 could end up meaning something else in the commit message
> and in function docs. Otherwise let's make it a bool and not do explicit check
> for return value.
>
> With this
>
> Reviewed-by: Qais Yousef <[email protected]>
>
>> return p;
>>
>> next_node = rb_next(next_node);
>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index 975cb49a64dc..8b6fb77e095b 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -1791,15 +1791,6 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
>> /* Only try algorithms three times */
>> #define RT_MAX_TRIES 3
>>
>> -static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
>> -{
>> - if (!task_on_cpu(rq, p) &&
>> - cpumask_test_cpu(cpu, &p->cpus_mask))
>> - return 1;
>> -
>> - return 0;
>> -}
>> -
>> /*
>> * Return the highest pushable rq's task, which is suitable to be executed
>> * on the CPU, NULL otherwise
>> @@ -1813,7 +1804,7 @@ static struct task_struct *pick_highest_pushable_task(struct rq *rq, int cpu)
>> return NULL;
>>
>> plist_for_each_entry(p, head, pushable_tasks) {
>> - if (pick_rt_task(rq, p, cpu))
>> + if (task_is_pushable(rq, p, cpu) == 1)
>> return p;
>> }
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 2d41ebe200c7..e46f69ba9ba2 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -3490,6 +3490,16 @@ void move_queued_task_locked(struct rq *rq, struct rq *dst_rq, struct task_struc
>> set_task_cpu(task, dst_rq->cpu);
>> activate_task(dst_rq, task, 0);
>> }
>> +
>> +static inline
>> +int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
>> +{
>> + if (!task_on_cpu(rq, p) &&
>> + cpumask_test_cpu(cpu, &p->cpus_mask))
>> + return 1;
>> +
>> + return 0;
>> +}
>> #endif
>>
>> #endif /* _KERNEL_SCHED_SCHED_H */
>> --
>> 2.45.0.rc1.225.g2a3ae87e7f-goog
>>
On Tue, Jun 4, 2024 at 6:12 AM Qais Yousef <[email protected]> wrote:
> On 05/06/24 21:54, John Stultz wrote:
> > From: Connor O'Brien <[email protected]>
> >
> > Switch logic that deactivates, sets the task cpu,
> > and reactivates a task on a different rq to use a
> > helper that will be later extended to push entire
> > blocked task chains.
> >
> > This patch was broken out from a larger chain migration
> > patch originally by Connor O'Brien.
...
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index ae50f212775e..2d41ebe200c7 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -3480,5 +3480,16 @@ static inline void init_sched_mm_cid(struct task_struct *t) { }
> >
> > extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
> > extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
> > +#ifdef CONFIG_SMP
> > +static inline
> > +void move_queued_task_locked(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
> > +{
> > + lockdep_assert_rq_held(rq);
> > + lockdep_assert_rq_held(dst_rq);
> > + deactivate_task(rq, task, 0);
> > + set_task_cpu(task, dst_rq->cpu);
> > + activate_task(dst_rq, task, 0);
> > +}
> > +#endif
>
> I see this pattern in __migrate_swap_task() and try_steal_cookie(), should they
> be converted to?
Ah! Thanks for pointing it out. I'll include that in the next iteration.
> Beside this
>
> Reviewed-by: Qais Yousef <[email protected]>
Thank you so much for the review and feedback!
-john
On Tue, Jun 4, 2024 at 7:11 AM Qais Yousef <[email protected]> wrote:
>
> On 05/06/24 21:54, John Stultz wrote:
> > From: Peter Zijlstra <[email protected]>
> >
> > Let's define the scheduling context as all the scheduler state
> > in task_struct for the task selected to run, and the execution
> > context as all state required to actually run the task.
> >
> > Currently both are intertwined in task_struct. We want to
> > logically split these such that we can use the scheduling
> > context of the task selected to be scheduled, but use the
> > execution context of a different task to actually be run.
> >
> > To this purpose, introduce rq_selected() macro to point to the
> > task_struct selected from the runqueue by the scheduler, and
> > will be used for scheduler state, and preserve rq->curr to
> > indicate the execution context of the task that will actually be
> > run.
>
> This is subjective opinion of course. But I am not a fan of rq_selected().
> I think we are better with more explicit
>
> rq->currs /* current sched context */
> rq->currx /* current execution context */
Yeah. While I've got my own opinions, I'm very flexible on names/style
details if maintainers want it any specific way.
But just personally, this strikes me as easier to misread, and not as
descriptive as the "selected" vs the "executing" task.
> and the helpers
>
> task_curr_sctx()
> task_curr_xctx()
Though would rq_curr_sctx() and rq_curr_xctx() make more sense, as
it's a property of the runqueue (not the task)?
> This will ensure all current users of rq->curr will generate a build error and
> be better audited what they are supposed to be. And I think the code is more
> readable this way.
So earlier I had changed it to: rq_curr() and rq_selected(), but Peter
complained about the rq_curr() macro making things indirect, so I
dropped it (though I kept the rq_selected() macro because it allowed
things to collapse down better if SCHED_PROXY_EXEC was not
configured). I do agree the macro (along with renaming the field)
helps ensure each use is considered properly, but I also want to align
with the feedback Peter gave previously.
> Another way to do it is to split task_struct to sched and exec context (rather
> than control the reference to curr as done here). Then curr would be always the
> same, but reference to its sched context would change with inheritance.
>
> ie:
>
> struct task_sctx {
> ...
> };
>
> struct task_struct {
> struct task_sctx *sctx;
> /* exec context is embedded as-is */
> ...
> };
>
> Which means would just have to update all accessors to sched context to do
> extra dereference. Which I am not sure if a problematic extra level of
> indirection.
>
> I see the latter approach more intuitive and explicit about what is a sched
> context that is supposed to be inherited and what is not.
Hrm. So I think I see what you're saying. Particularly with the logic
where we sometimes care about the execution context and sometimes care
about the scheduler context, I can understand wanting to better define
that separation with structures. One thing to note from your example
above for unblocked tasks, the scheduler and execution contexts can be
the same. So you'd likely want the task_struct to not just have a
reference to the scheduler context, but its own internal scheduler
context structure as well.
That said, I'm not sure if adding the current scheduler context
reference into the task struct as you propose above is the right
approach. Particularly because while this is a generalization of
priority inheritance, it isn't actually implemented in the same way as
priority inheritance. We don't raise and lower the lock owner task's
priority. Instead we pick a task from the rq, and if its blocked, we
traverse the blocked_on chain to find the runnable owner. So the two
are really properties of the rq and not really part of the task. And
remember there can be multiple blocked waiters on a mutex, and they
can both be used to run the lock owner. So one would have to be
setting the owner's scheduler context reference on each selection,
which seems a bit noisy (I also will have to think about the
serialization as you may need to hold the rq lock to keep the donor's
sched context in place (as your crossing task to task), which might be
an unintuitive/unexpected dependency).
Instead as the two contexts are really attributes of the runqueue, it
would seem like one could have what is currently rq_selected() just
return the struct task_sctx from the selected task (instead of an
entire task), but that might also cause some annoyance for things like
debugging (as having both selected and current task's comm and pid
available together are very useful for understanding what's going on).
> I generally think breaking down task structure would be good. Hopefully the new
> data perf can help us make better decision on what attributes to group
> together. And generally this might be a good exercise to rethink its content
> which grown organicaly over the year. Maybe we want to create similar such
> smaller wrapper structs for other fields too.
>
> I **think** with this approach we would just need go fix compilation errors to
> do p->sctx->{attribute}.
>
> Proxy exec later would only update the reference to this struct to enforce
> inheritance for rq->curr->sctx to point to the donor's context.
I'll need to think about this a bit more (and hopefully others can
chime in on if they see it as worth it). It would be a *lot* of churn
to tweak all the users to have the extra structure indirection, so I'm
hesitant to go running after this without wider agreement. But I do
see there would be some benefit to avoiding some of the confusion that
can come from having to deal with two full separate tasks in much of
the logic.
thanks
-john
On Tue, Jun 4, 2024 at 6:29 AM Qais Yousef <[email protected]> wrote:
>
> On 05/06/24 21:54, John Stultz wrote:
> > As we're going to re-use the deactivation logic,
> > split it into a helper.
> >
> > Cc: Joel Fernandes <[email protected]>
> > Cc: Qais Yousef <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Juri Lelli <[email protected]>
> > Cc: Vincent Guittot <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Ben Segall <[email protected]>
> > Cc: Zimuzo Ezeozue <[email protected]>
> > Cc: Youssef Esmat <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Will Deacon <[email protected]>
> > Cc: Waiman Long <[email protected]>
> > Cc: Boqun Feng <[email protected]>
> > Cc: "Paul E. McKenney" <[email protected]>
> > Cc: Metin Kaya <[email protected]>
> > Cc: Xuewen Yan <[email protected]>
> > Cc: K Prateek Nayak <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: [email protected]
> > Tested-by: K Prateek Nayak <[email protected]>
> > Tested-by: Metin Kaya <[email protected]>
> > Reviewed-by: Metin Kaya <[email protected]>
> > Signed-off-by: John Stultz <[email protected]>
> > ---
> > v6:
> > * Define function as static to avoid "no previous prototype"
> > warnings as Reported-by: kernel test robot <[email protected]>
> > v7:
> > * Rename state task_state to be more clear, as suggested by
> > Metin Kaya
> > ---
> > kernel/sched/core.c | 72 +++++++++++++++++++++++++++------------------
> > 1 file changed, 43 insertions(+), 29 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 48f0d4b381d5..8bc5844ebab9 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6572,6 +6572,48 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > # define SM_MASK_PREEMPT SM_PREEMPT
> > #endif
> >
> > +/*
> > + * Helper function for __schedule()
> > + *
> > + * If a task does not have signals pending, deactivate it and return true
> > + * Otherwise marks the task's __state as RUNNING and returns false
> > + */
> > +static bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
> > + unsigned long task_state)
> > +{
> > + if (signal_pending_state(task_state, p)) {
> > + WRITE_ONCE(p->__state, TASK_RUNNING);
>
> We can avoid extra indention for the other (lengthy) leg if we return here?
>
> The return value is ignored for now, I don't mind keeping it but better call it
> out in the commit message or when you add the new user later you can update the
> signature more easily.
Ah. Good point on both counts here. I've reworked it to use your
suggestions here.
> Generally I think patches 4, 5 and 6 could be sent as their own series with
> minor commit messages tweaks to make them more independent and I hope Ingo and
> Peter are okay to pick them up as they look a nice clean up in general.
>
> Anyway:
>
> Reviewed-by: Qais Yousef <[email protected]>
Thanks so much for the review!
-john
On 06/04/24 12:33, John Stultz wrote:
> On Tue, Jun 4, 2024 at 7:11 AM Qais Yousef <[email protected]> wrote:
> >
> > On 05/06/24 21:54, John Stultz wrote:
> > > From: Peter Zijlstra <[email protected]>
> > >
> > > Let's define the scheduling context as all the scheduler state
> > > in task_struct for the task selected to run, and the execution
> > > context as all state required to actually run the task.
> > >
> > > Currently both are intertwined in task_struct. We want to
> > > logically split these such that we can use the scheduling
> > > context of the task selected to be scheduled, but use the
> > > execution context of a different task to actually be run.
> > >
> > > To this purpose, introduce rq_selected() macro to point to the
> > > task_struct selected from the runqueue by the scheduler, and
> > > will be used for scheduler state, and preserve rq->curr to
> > > indicate the execution context of the task that will actually be
> > > run.
> >
> > This is subjective opinion of course. But I am not a fan of rq_selected().
> > I think we are better with more explicit
> >
> > rq->currs /* current sched context */
> > rq->currx /* current execution context */
>
> Yeah. While I've got my own opinions, I'm very flexible on names/style
> details if maintainers want it any specific way.
>
> But just personally, this strikes me as easier to misread, and not as
> descriptive as the "selected" vs the "executing" task.
rq_selected() makes me think we're returning an rq, not a task.
>
> > and the helpers
> >
> > task_curr_sctx()
> > task_curr_xctx()
>
> Though would rq_curr_sctx() and rq_curr_xctx() make more sense, as
> it's a property of the runqueue (not the task)?
Hmm maybe they could be viewed as property of the runqueue. But I think they
belong to the task. With the current suggestion we are keeping two task
references, one to get the execution context, and the other to get the
scheduling context. But in reality we have one current task, not two. I think
the fact that we look at current task should be a sign that its a property of
the task, not the runqueue otherwise the info should be stored in struct rq
directly.
IOW, rq keeps track of current task, but current task might have its scheduling
properties upgraded due to inheritance. I don't see the latter is a property of
the rq. But rq can help keep track of the two tasks to make accessing the two
properties (scheduling vs execution) easier.
>
> > This will ensure all current users of rq->curr will generate a build error and
> > be better audited what they are supposed to be. And I think the code is more
> > readable this way.
>
> So earlier I had changed it to: rq_curr() and rq_selected(), but Peter
> complained about the rq_curr() macro making things indirect, so I
> dropped it (though I kept the rq_selected() macro because it allowed
> things to collapse down better if SCHED_PROXY_EXEC was not
> configured). I do agree the macro (along with renaming the field)
> helps ensure each use is considered properly, but I also want to align
> with the feedback Peter gave previously.
Yes please, let's wait for the maintainers. Don't want to waste your time with
changes that might not be agreeable for sure :-)
>
> > Another way to do it is to split task_struct to sched and exec context (rather
> > than control the reference to curr as done here). Then curr would be always the
> > same, but reference to its sched context would change with inheritance.
> >
> > ie:
> >
> > struct task_sctx {
> > ...
> > };
> >
> > struct task_struct {
> > struct task_sctx *sctx;
> > /* exec context is embedded as-is */
> > ...
> > };
> >
> > Which means would just have to update all accessors to sched context to do
> > extra dereference. Which I am not sure if a problematic extra level of
> > indirection.
> >
> > I see the latter approach more intuitive and explicit about what is a sched
> > context that is supposed to be inherited and what is not.
>
> Hrm. So I think I see what you're saying. Particularly with the logic
> where we sometimes care about the execution context and sometimes care
> about the scheduler context, I can understand wanting to better define
> that separation with structures. One thing to note from your example
> above for unblocked tasks, the scheduler and execution contexts can be
> the same. So you'd likely want the task_struct to not just have a
> reference to the scheduler context, but its own internal scheduler
> context structure as well.
>
> That said, I'm not sure if adding the current scheduler context
> reference into the task struct as you propose above is the right
> approach. Particularly because while this is a generalization of
> priority inheritance, it isn't actually implemented in the same way as
> priority inheritance. We don't raise and lower the lock owner task's
> priority. Instead we pick a task from the rq, and if its blocked, we
> traverse the blocked_on chain to find the runnable owner. So the two
> are really properties of the rq and not really part of the task. And
> remember there can be multiple blocked waiters on a mutex, and they
> can both be used to run the lock owner. So one would have to be
Hmm I would have thought this should still work the same. But we'd end up with
maintaining current's scheduling context. In my head what should different is
what is being updated, rq->currs/rq->currx or rq->curr->sctx.
But you've been looking at this for long enough now and I could be missing
some (many) things for sure.
> setting the owner's scheduler context reference on each selection,
> which seems a bit noisy (I also will have to think about the
> serialization as you may need to hold the rq lock to keep the donor's
> sched context in place (as your crossing task to task), which might be
> an unintuitive/unexpected dependency).
It could be oversimplified in my head..
>
> Instead as the two contexts are really attributes of the runqueue, it
I guess I am struggling to see the contexts are attributes of the runqueue. The
current task, yes. But what scheduling context the task has, is its own
property the way I see it.
> would seem like one could have what is currently rq_selected() just
> return the struct task_sctx from the selected task (instead of an
> entire task), but that might also cause some annoyance for things like
> debugging (as having both selected and current task's comm and pid
> available together are very useful for understanding what's going on).
>
> > I generally think breaking down task structure would be good. Hopefully the new
> > data perf can help us make better decision on what attributes to group
> > together. And generally this might be a good exercise to rethink its content
> > which grown organicaly over the year. Maybe we want to create similar such
> > smaller wrapper structs for other fields too.
> >
> > I **think** with this approach we would just need go fix compilation errors to
> > do p->sctx->{attribute}.
> >
> > Proxy exec later would only update the reference to this struct to enforce
> > inheritance for rq->curr->sctx to point to the donor's context.
>
> I'll need to think about this a bit more (and hopefully others can
> chime in on if they see it as worth it). It would be a *lot* of churn
> to tweak all the users to have the extra structure indirection, so I'm
> hesitant to go running after this without wider agreement. But I do
> see there would be some benefit to avoiding some of the confusion that
> can come from having to deal with two full separate tasks in much of
> the logic.
Don't change anything yet please :-) I could be missing a lot of details. But
I think for now it's good to know why one alternative could potentially be
better than the other without any specific action. It's the type of detail that
can be changed at anytime later - but for me I see these as properties of the
tasks and that's what promoted me to enquire why not done that way instead.
The proposed way does seem the fastest path to get something working, so I am
not against it. But I had to ask if it is the best thing longer term.
Thanks!
--
Qais Yousef