Hi all,
Proxy Execution (also goes under several other names) isn't a new
concept, it has been mentioned already in the past to this community
(both in email discussions and at conferences [1, 2]), but no actual
implementation that applies to a fairly recent kernel exists as of today
(of which I'm aware of at least - happy to be proven wrong).
Very broadly speaking, more info below, proxy execution enables a task
to run using the context of some other task that is "willing" to
participate in the mechanism, as this helps both tasks to improve
performance (w.r.t. the latter task not participating to proxy
execution).
This RFD/proof of concept aims at starting a discussion about how we can
get proxy execution in mainline. But, first things first, why do we even
care about it?
I'm pretty confident with saying that the line of development that is
mainly interested in this at the moment is the one that might benefit
in allowing non privileged processes to use deadline scheduling [3].
The main missing bit before we can safely relax the root privileges
constraint is a proper priority inheritance mechanism, which translates
to bandwidth inheritance [4, 5] for deadline scheduling, or to some sort
of interpretation of the concept of running a task holding a (rt_)mutex
within the bandwidth allotment of some other task that is blocked on the
same (rt_)mutex.
The concept itself is pretty general however, and it is not hard to
foresee possible applications in other scenarios (say for example nice
values/shares across co-operating CFS tasks or clamping values [6]).
But I'm already digressing, so let's get back to the code that comes
with this cover letter.
One can define the scheduling context of a task as all the information
in task_struct that the scheduler needs to implement a policy and the
execution contex as all the state required to actually "run" the task.
An example of scheduling context might be the information contained in
task_struct se, rt and dl fields; affinity pertains instead to execution
context (and I guess decideing what pertains to what is actually up for
discussion as well ;-). Patch 04/08 implements such distinction.
As implemented in this set, a link between scheduling contexts of
different tasks might be established when a task blocks on a mutex held
by some other task (blocked_on relation). In this case the former task
starts to be considered a potential proxy for the latter (mutex owner).
One key change in how mutexes work made in here is that waiters don't
really sleep: they are not dequeued, so they can be picked up by the
scheduler when it runs. If a waiter (potential proxy) task is selected
by the scheduler, the blocked_on relation is used to find the mutex
owner and put that to run on the CPU, using the proxy task scheduling
context.
Follow the blocked-on relation:
,-> task <- proxy, picked by scheduler
| | blocked-on
| v
blocked-task | mutex
| | owner
| v
`-- task <- gets to run using proxy info
Now, the situation is (of course) more tricky than depicted so far
because we have to deal with all sort of possible states the mutex
owner might be in while a potential proxy is selected by the scheduler,
e.g. owner might be sleeping, running on a different CPU, blocked on
another mutex itself... so, I'd kindly refer people to have a look at
05/08 proxy() implementation and comments.
Peter kindly shared his WIP patches with us (me, Luca, Tommaso, Claudio,
Daniel, the Pisa gang) a while ago, but I could seriously have a decent
look at them only recently (thanks a lot to the other guys for giving a
first look at this way before me!). This set is thus composed of Peter's
original patches (which I rebased on tip/sched/core as of today,
commented and hopefully duly reported in changelogs what have I possibly
broke) plus a bunch of additional changes that seemed required to make
all this boot "successfully" on a virtual machine. So be advised! This
is good only for fun ATM (I actually really hope this is good enough for
discussion), pretty far from production I'm afraid. Share early, share
often, right? :-)
The main concerns I have with the current approach is that, being based
on mutex.c, it's both
- not linked with futexes
- not involving "legacy" priority inheritance (rt_mutex.c)
I believe one of the main reasons Peter started this on mutexes is to
have better coverage of potential problems (which I can assure everybody
it had). I'm not yet sure what should we do moving forward, and this is
exactly what I'd be pleased to hear your opinions on.
https://github.com/jlelli/linux.git experimental/deadline/proxy-rfc-v1
Thanks a lot in advance!
- Juri
1 - https://wiki.linuxfoundation.org/_media/realtime/events/rt-summit2017/proxy-execution_peter-zijlstra.pdf
2 - https://lwn.net/Articles/397422/ which "points" to https://goo.gl/3VrLza
3 - https://marc.info/?l=linux-rt-users&m=153450086400459&w=2
4 - https://ieeexplore.ieee.org/document/5562902
5 - http://retis.sssup.it/~lipari/papers/rtlws2013.pdf
6 - https://lore.kernel.org/lkml/[email protected]/
Juri Lelli (3):
locking/mutex: make mutex::wait_lock irq safe
sched: Ensure blocked_on is always guarded by blocked_lock
sched: Fixup task CPUs for potential proxies.
Peter Zijlstra (5):
locking/mutex: Convert mutex::wait_lock to raw_spinlock_t
locking/mutex: Removes wakeups from under mutex::wait_lock
locking/mutex: Rework task_struct::blocked_on
sched: Split scheduler execution context
sched: Add proxy execution
include/linux/mutex.h | 4 +-
include/linux/sched.h | 8 +-
init/Kconfig | 4 +
init/init_task.c | 1 +
kernel/Kconfig.locks | 2 +-
kernel/fork.c | 8 +-
kernel/locking/mutex-debug.c | 12 +-
kernel/locking/mutex.c | 127 +++++++--
kernel/sched/core.c | 510 +++++++++++++++++++++++++++++++++--
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 7 +
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 30 ++-
13 files changed, 642 insertions(+), 75 deletions(-)
--
2.17.1
From: Peter Zijlstra <[email protected]>
In preparation to nest mutex::wait_lock under rq::lock it needs to be
raw_spinlock_t.
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/mutex.h | 4 ++--
kernel/locking/mutex-debug.c | 4 ++--
kernel/locking/mutex.c | 22 +++++++++++-----------
3 files changed, 15 insertions(+), 15 deletions(-)
diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 3093dd162424..725aa113626f 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -52,7 +52,7 @@ struct ww_acquire_ctx;
*/
struct mutex {
atomic_long_t owner;
- spinlock_t wait_lock;
+ raw_spinlock_t wait_lock;
#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
struct optimistic_spin_queue osq; /* Spinner MCS lock */
#endif
@@ -127,7 +127,7 @@ do { \
#define __MUTEX_INITIALIZER(lockname) \
{ .owner = ATOMIC_LONG_INIT(0) \
- , .wait_lock = __SPIN_LOCK_UNLOCKED(lockname.wait_lock) \
+ , .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(lockname.wait_lock) \
, .wait_list = LIST_HEAD_INIT(lockname.wait_list) \
__DEBUG_MUTEX_INITIALIZER(lockname) \
__DEP_MAP_MUTEX_INITIALIZER(lockname) }
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 9aa713629387..a660d38b6c29 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -36,7 +36,7 @@ void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
void debug_mutex_wake_waiter(struct mutex *lock, struct mutex_waiter *waiter)
{
- SMP_DEBUG_LOCKS_WARN_ON(!spin_is_locked(&lock->wait_lock));
+ SMP_DEBUG_LOCKS_WARN_ON(!raw_spin_is_locked(&lock->wait_lock));
DEBUG_LOCKS_WARN_ON(list_empty(&lock->wait_list));
DEBUG_LOCKS_WARN_ON(waiter->magic != waiter);
DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
@@ -51,7 +51,7 @@ void debug_mutex_free_waiter(struct mutex_waiter *waiter)
void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
struct task_struct *task)
{
- SMP_DEBUG_LOCKS_WARN_ON(!spin_is_locked(&lock->wait_lock));
+ SMP_DEBUG_LOCKS_WARN_ON(!raw_spin_is_locked(&lock->wait_lock));
/* Mark the current thread as blocked on the lock: */
task->blocked_on = waiter;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 3f8a35104285..df34ce70fcde 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -39,7 +39,7 @@ void
__mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
{
atomic_long_set(&lock->owner, 0);
- spin_lock_init(&lock->wait_lock);
+ raw_spin_lock_init(&lock->wait_lock);
INIT_LIST_HEAD(&lock->wait_list);
#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
osq_lock_init(&lock->osq);
@@ -464,9 +464,9 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
* Uh oh, we raced in fastpath, check if any of the waiters need to
* die or wound us.
*/
- spin_lock(&lock->base.wait_lock);
+ raw_spin_lock(&lock->base.wait_lock);
__ww_mutex_check_waiters(&lock->base, ctx);
- spin_unlock(&lock->base.wait_lock);
+ raw_spin_unlock(&lock->base.wait_lock);
}
#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
@@ -934,7 +934,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
return 0;
}
- spin_lock(&lock->wait_lock);
+ raw_spin_lock(&lock->wait_lock);
/*
* After waiting to acquire the wait_lock, try again.
*/
@@ -998,7 +998,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
goto err;
}
- spin_unlock(&lock->wait_lock);
+ raw_spin_unlock(&lock->wait_lock);
schedule_preempt_disabled();
/*
@@ -1021,9 +1021,9 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
(first && mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx, &waiter)))
break;
- spin_lock(&lock->wait_lock);
+ raw_spin_lock(&lock->wait_lock);
}
- spin_lock(&lock->wait_lock);
+ raw_spin_lock(&lock->wait_lock);
acquired:
__set_current_state(TASK_RUNNING);
@@ -1050,7 +1050,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
if (use_ww_ctx && ww_ctx)
ww_mutex_lock_acquired(ww, ww_ctx);
- spin_unlock(&lock->wait_lock);
+ raw_spin_unlock(&lock->wait_lock);
preempt_enable();
return 0;
@@ -1058,7 +1058,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
__set_current_state(TASK_RUNNING);
mutex_remove_waiter(lock, &waiter, current);
err_early_kill:
- spin_unlock(&lock->wait_lock);
+ raw_spin_unlock(&lock->wait_lock);
debug_mutex_free_waiter(&waiter);
mutex_release(&lock->dep_map, 1, ip);
preempt_enable();
@@ -1227,7 +1227,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
owner = old;
}
- spin_lock(&lock->wait_lock);
+ raw_spin_lock(&lock->wait_lock);
debug_mutex_unlock(lock);
if (!list_empty(&lock->wait_list)) {
/* get the first entry from the wait-list: */
@@ -1244,7 +1244,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
if (owner & MUTEX_FLAG_HANDOFF)
__mutex_handoff(lock, next);
- spin_unlock(&lock->wait_lock);
+ raw_spin_unlock(&lock->wait_lock);
wake_up_q(&wake_q);
}
--
2.17.1
From: Peter Zijlstra <[email protected]>
Lets define the scheduling context as all the scheduler state in
task_struct and the execution context as all state required to run the
task.
Currently both are intertwined in task_struct. We want to logically
split these such that we can run the execution context of one task
with the scheduling context of another.
To this purpose introduce rq::proxy to point to the task_struct used
for scheduler state and preserve rq::curr to denote the execution
context.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[added lot of comments/questions - identifiable by XXX]
Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/core.c | 62 ++++++++++++++++++++++++++++++++++----------
kernel/sched/fair.c | 4 +++
kernel/sched/sched.h | 30 ++++++++++++++++++++-
3 files changed, 82 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe0223121883..d3c481b734dd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -224,12 +224,13 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
{
struct rq *rq = container_of(timer, struct rq, hrtick_timer);
struct rq_flags rf;
+ struct task_struct *curr = rq->proxy;
WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
rq_lock(rq, &rf);
update_rq_clock(rq);
- rq->curr->sched_class->task_tick(rq, rq->curr, 1);
+ curr->sched_class->task_tick(rq, curr, 1);
rq_unlock(rq, &rf);
return HRTIMER_NORESTART;
@@ -836,13 +837,18 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,
void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
+ struct task_struct *curr = rq->proxy;
const struct sched_class *class;
- if (p->sched_class == rq->curr->sched_class) {
- rq->curr->sched_class->check_preempt_curr(rq, p, flags);
+ if (p->sched_class == curr->sched_class) {
+ /*
+ * XXX check_preempt_curr will check rq->curr against p, looks
+ * like we want to check rq->proxy against p though?
+ */
+ curr->sched_class->check_preempt_curr(rq, p, flags);
} else {
for_each_class(class) {
- if (class == rq->curr->sched_class)
+ if (class == curr->sched_class)
break;
if (class == p->sched_class) {
resched_curr(rq);
@@ -855,7 +861,7 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
* A queue event has occurred, and we're going to schedule. In
* this case, we can save a useless back to back clock update.
*/
- if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
+ if (task_on_rq_queued(curr) && test_tsk_need_resched(rq->curr))
rq_clock_skip_update(rq);
}
@@ -1016,7 +1022,11 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
lockdep_assert_held(&p->pi_lock);
queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ /*
+ * XXX is changing affinity of a proxy a problem?
+ * Consider for example put_prev_ set_curr_ below...
+ */
+ running = task_current_proxy(rq, p);
if (queued) {
/*
@@ -3021,7 +3031,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
* project cycles that may never be accounted to this
* thread, breaking clock_gettime().
*/
- if (task_current(rq, p) && task_on_rq_queued(p)) {
+ if (task_current_proxy(rq, p) && task_on_rq_queued(p)) {
prefetch_curr_exec_start(p);
update_rq_clock(rq);
p->sched_class->update_curr(rq);
@@ -3040,8 +3050,9 @@ void scheduler_tick(void)
{
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
- struct task_struct *curr = rq->curr;
struct rq_flags rf;
+ /* accounting goes to the proxy task */
+ struct task_struct *curr = rq->proxy;
sched_clock_tick();
@@ -3096,6 +3107,13 @@ static void sched_tick_remote(struct work_struct *work)
if (is_idle_task(curr))
goto out_unlock;
+ /*
+ * XXX don't we need to account to rq->proxy?
+ * Maybe, since this is a remote tick for full dynticks mode, we are
+ * always sure that there is no proxy (only a single task is running.
+ */
+ SCHED_WARN_ON(rq->curr != rq->proxy);
+
update_rq_clock(rq);
delta = rq_clock_task(rq) - curr->se.exec_start;
@@ -3804,7 +3822,10 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
prev_class = p->sched_class;
queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ /*
+ * XXX how does (proxy exec) mutexes and RT_mutexes work together?!
+ */
+ running = task_current_proxy(rq, p);
if (queued)
dequeue_task(rq, p, queue_flag);
if (running)
@@ -3891,7 +3912,10 @@ void set_user_nice(struct task_struct *p, long nice)
goto out_unlock;
}
queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ /*
+ * XXX see concerns about do_set_cpus_allowed, rt_mutex_prio & Co.
+ */
+ running = task_current_proxy(rq, p);
if (queued)
dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
if (running)
@@ -4318,7 +4342,10 @@ static int __sched_setscheduler(struct task_struct *p,
}
queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ /*
+ * XXX and again, how is this safe w.r.t. proxy exec?
+ */
+ running = task_current_proxy(rq, p);
if (queued)
dequeue_task(rq, p, queue_flags);
if (running)
@@ -4938,6 +4965,11 @@ static void do_sched_yield(void)
rq_lock(rq, &rf);
schedstat_inc(rq->yld_count);
+ /*
+ * XXX how about proxy exec?
+ * If a task currently proxied by some other task yields, should we
+ * apply the proxy or the current yield "behaviour" ?
+ */
current->sched_class->yield_task(rq);
/*
@@ -5044,6 +5076,10 @@ EXPORT_SYMBOL(yield);
*/
int __sched yield_to(struct task_struct *p, bool preempt)
{
+ /*
+ * XXX what about current being proxied?
+ * Should we use proxy->sched_class methods in this case?
+ */
struct task_struct *curr = current;
struct rq *rq, *p_rq;
unsigned long flags;
@@ -5502,7 +5538,7 @@ void sched_setnuma(struct task_struct *p, int nid)
rq = task_rq_lock(p, &rf);
queued = task_on_rq_queued(p);
- running = task_current(rq, p);
+ running = task_current_proxy(rq, p);
if (queued)
dequeue_task(rq, p, DEQUEUE_SAVE);
@@ -6351,7 +6387,7 @@ void sched_move_task(struct task_struct *tsk)
rq = task_rq_lock(tsk, &rf);
update_rq_clock(rq);
- running = task_current(rq, tsk);
+ running = task_current_proxy(rq, tsk);
queued = task_on_rq_queued(tsk);
if (queued)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d59307ecd67d..7f8a5dcda923 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9721,6 +9721,10 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
entity_tick(cfs_rq, se, queued);
}
+ /*
+ * XXX need to use execution context (rq->curr) for task_tick_numa and
+ * update_misfit_status?
+ */
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 798b1afd5092..287ff248836f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -824,7 +824,8 @@ struct rq {
*/
unsigned long nr_uninterruptible;
- struct task_struct *curr;
+ struct task_struct *curr; /* Execution context */
+ struct task_struct *proxy; /* Scheduling context (policy) */
struct task_struct *idle;
struct task_struct *stop;
unsigned long next_balance;
@@ -1421,11 +1422,38 @@ static inline u64 global_rt_runtime(void)
return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
}
+/*
+ * Is p the current execution context?
+ */
static inline int task_current(struct rq *rq, struct task_struct *p)
{
return rq->curr == p;
}
+/*
+ * Is p the current scheduling context?
+ *
+ * Note that it might be the current execution context at the same time if
+ * rq->curr == rq->proxy == p.
+ */
+static inline int task_current_proxy(struct rq *rq, struct task_struct *p)
+{
+ return rq->proxy == p;
+}
+
+#ifdef CONFIG_PROXY_EXEC
+static inline bool task_is_blocked(struct task_struct *p)
+{
+ return !!p->blocked_on;
+}
+#else /* !PROXY_EXEC */
+static inline bool task_is_blocked(struct task_struct *p)
+{
+ return false;
+}
+
+#endif /* PROXY_EXEC */
+
static inline int task_running(struct rq *rq, struct task_struct *p)
{
#ifdef CONFIG_SMP
--
2.17.1
From: Peter Zijlstra <[email protected]>
A task currently holding a mutex (executing a critical section) might
find benefit in using scheduling contexts of other tasks blocked on the
same mutex if they happen to have higher priority of the current owner
(e.g., to prevent priority inversions).
Proxy execution lets a task do exactly that: if a mutex owner has
waiters, it can use waiters' scheduling context to potentially continue
running if preempted.
The basic mechanism is implemented by this patch, the core of which
resides in the proxy() function. Potential proxies (i.e., tasks blocked
on a mutex) are not dequeued, so, if one of them is actually selected by
schedule() as the next task to be put to run on a CPU, proxy() is used
to walk the blocked_on relation and find which task (mutex owner) might
be able to use the proxy's scheduling context.
Here come the tricky bits. In fact, owner task might be in all sort of
states when a proxy is found (blocked, executing on a different CPU,
etc.). Details on how to handle different situations are to be found in
proxy() code comments.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[rebased, added comments and changelog]
Signed-off-by: Juri Lelli <[email protected]>
---
include/linux/sched.h | 2 +
init/Kconfig | 4 +
init/init_task.c | 1 +
kernel/Kconfig.locks | 2 +-
kernel/fork.c | 2 +
kernel/locking/mutex.c | 47 ++++-
kernel/sched/core.c | 408 ++++++++++++++++++++++++++++++++++++++--
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 3 +
kernel/sched/rt.c | 2 +-
10 files changed, 455 insertions(+), 18 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a35e8ab3eef1..e3e7b1bcb8fa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -909,6 +909,8 @@ struct task_struct {
struct task_struct *blocked_task; /* task that's boosting us */
struct mutex *blocked_on; /* lock we're blocked on */
+ struct list_head blocked_entry; /* tasks blocked on us */
+ raw_spinlock_t blocked_lock;
#ifdef CONFIG_TRACE_IRQFLAGS
unsigned int irq_events;
diff --git a/init/Kconfig b/init/Kconfig
index 317d5ccb5191..210edff253f8 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -667,6 +667,10 @@ config NUMA_BALANCING_DEFAULT_ENABLED
If set, automatic NUMA balancing will be enabled if running on a NUMA
machine.
+config PROXY_EXEC
+ bool "Proxy Execution"
+ default n
+
menuconfig CGROUPS
bool "Control Group support"
select KERNFS
diff --git a/init/init_task.c b/init/init_task.c
index 5aebe3be4d7c..6505df95f6ac 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -117,6 +117,7 @@ struct task_struct init_task
.journal_info = NULL,
INIT_CPU_TIMERS(init_task)
.pi_lock = __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
+ .blocked_lock = __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock),
.timer_slack_ns = 50000, /* 50 usec default slack */
.thread_pid = &init_struct_pid,
.thread_group = LIST_HEAD_INIT(init_task.thread_group),
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 84d882f3e299..5a627839a048 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -225,7 +225,7 @@ config ARCH_SUPPORTS_ATOMIC_RMW
config MUTEX_SPIN_ON_OWNER
def_bool y
- depends on SMP && ARCH_SUPPORTS_ATOMIC_RMW
+ depends on SMP && ARCH_SUPPORTS_ATOMIC_RMW && !PROXY_EXEC
config RWSEM_SPIN_ON_OWNER
def_bool y
diff --git a/kernel/fork.c b/kernel/fork.c
index ef27a675b0d7..b56ca9780194 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1723,6 +1723,7 @@ static __latent_entropy struct task_struct *copy_process(
ftrace_graph_init_task(p);
rt_mutex_init_task(p);
+ raw_spin_lock_init(&p->blocked_lock);
#ifdef CONFIG_PROVE_LOCKING
DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
@@ -1829,6 +1830,7 @@ static __latent_entropy struct task_struct *copy_process(
p->blocked_task = NULL; /* nobody is boosting us yet*/
p->blocked_on = NULL; /* not blocked yet */
+ INIT_LIST_HEAD(&p->blocked_entry);
#ifdef CONFIG_BCACHE
p->sequential_io = 0;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 76b59b555da3..23312afa7fca 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -35,6 +35,10 @@
# include "mutex.h"
#endif
+#if defined(CONFIG_PROXY_EXEC) && defined(CONFIG_MUTEX_SPIN_ON_OWNER)
+#error BOOM
+#endif
+
void
__mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
{
@@ -1021,6 +1025,10 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
__mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
}
+ /*
+ * Gets reset by ttwu_remote().
+ */
+ current->blocked_on = lock;
set_current_state(state);
/*
* Here we order against unlock; we must either see it change
@@ -1206,10 +1214,21 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
{
struct task_struct *next = NULL;
DEFINE_WAKE_Q(wake_q);
- unsigned long owner;
+ /*
+ * XXX [juril] Proxy Exec forces always an HANDOFF (so that owner is
+ * never empty when there are waiters wating?). Should we make this
+ * conditional on having proxy exec configured in?
+ */
+ unsigned long owner = MUTEX_FLAG_HANDOFF;
mutex_release(&lock->dep_map, 1, ip);
+ /*
+ * XXX must always handoff the mutex to avoid !owner in proxy().
+ * scheduler delay is minimal since we hand off to the task that
+ * is to be scheduled next.
+ */
+#ifndef CONFIG_PROXY_EXEC
/*
* Release the lock before (potentially) taking the spinlock such that
* other contenders can get on with things ASAP.
@@ -1240,10 +1259,34 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
owner = old;
}
+#endif
raw_spin_lock(&lock->wait_lock);
debug_mutex_unlock(lock);
- if (!list_empty(&lock->wait_list)) {
+
+#ifdef CONFIG_PROXY_EXEC
+ /*
+ * If we have a task boosting us, and that task was boosting us through
+ * this lock, hand the lock that that task, as that is the highest
+ * waiter, as selected by the scheduling function.
+ *
+ * XXX existance guarantee on ->blocked_task ?
+ */
+ next = current->blocked_task;
+ if (next && next->blocked_on != lock)
+ next = NULL;
+
+ /*
+ * XXX if there was no higher prio proxy, ->blocked_task will not have
+ * been set. Therefore lower prio contending tasks are serviced in
+ * FIFO order.
+ */
+#endif
+
+ /*
+ * Failing that, pick any on the wait list.
+ */
+ if (!next && !list_empty(&lock->wait_list)) {
/* get the first entry from the wait-list: */
struct mutex_waiter *waiter =
list_first_entry(&lock->wait_list,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d3c481b734dd..e3e3eea3f5b2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1646,8 +1646,31 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
static inline void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
{
+ /*
+ * XXX can we possibly avoid the spinlock ? See proxy()'s blocked_task
+ * case.
+ */
+ raw_spin_lock(&p->blocked_lock);
activate_task(rq, p, en_flags);
+
+ /*
+ * A whole bunch of 'proxy' tasks back this blocked task, wake
+ * them all up to give this task its 'fair' share.
+ */
+ while (!list_empty(&p->blocked_entry)) {
+ struct task_struct *pp =
+ list_first_entry(&p->blocked_entry,
+ struct task_struct,
+ blocked_entry);
+
+ list_del_init(&pp->blocked_entry);
+ activate_task(rq, pp, en_flags);
+ pp->on_rq = TASK_ON_RQ_QUEUED;
+ resched_curr(rq);
+ }
+
p->on_rq = TASK_ON_RQ_QUEUED;
+ raw_spin_unlock(&p->blocked_lock);
/* If a worker is waking up, notify the workqueue: */
if (p->flags & PF_WQ_WORKER)
@@ -1722,12 +1745,46 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
int ret = 0;
rq = __task_rq_lock(p, &rf);
- if (task_on_rq_queued(p)) {
- /* check_preempt_curr() may use rq clock */
- update_rq_clock(rq);
- ttwu_do_wakeup(rq, p, wake_flags, &rf);
- ret = 1;
+ if (!task_on_rq_queued(p)) {
+ BUG_ON(p->state == TASK_RUNNING);
+ goto out_unlock;
}
+
+ /*
+ * ttwu_do_wakeup()->
+ * check_preempt_curr() may use rq clock
+ */
+ update_rq_clock(rq);
+
+ /*
+ * Since we don't dequeue for blocked-on relations, we'll always
+ * trigger the on_rq_queued() clause for them.
+ */
+ if (task_is_blocked(p)) {
+ p->blocked_on = NULL; /* let it run again */
+ if (!cpumask_test_cpu(cpu_of(rq), &p->cpus_allowed)) {
+ /*
+ * proxy stuff moved us outside of the affinity mask
+ * 'sleep' now and fail the direct wakeup so that the
+ * normal wakeup path will fix things.
+ */
+ p->on_rq = 0;
+ /* XXX [juril] SLEEP|NOCLOCK ? */
+ deactivate_task(rq, p, DEQUEUE_SLEEP);
+ goto out_unlock;
+ }
+
+ /*
+ * Must resched after killing a blocked_on relation. The currently
+ * executing context might not be the most elegible anymore.
+ */
+ resched_curr(rq);
+ }
+
+ ttwu_do_wakeup(rq, p, wake_flags, &rf);
+ ret = 1;
+
+out_unlock:
__task_rq_unlock(rq, &rf);
return ret;
@@ -3360,6 +3417,308 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
BUG();
}
+#ifdef CONFIG_PROXY_EXEC
+static struct task_struct fake_task;
+
+/*
+ * Follow the blocked-on relation:
+ *
+ * ,-> task
+ * | | blocked-on
+ * | v
+ * blocked-task | mutex
+ * | | owner
+ * | v
+ * `-- task
+ *
+ * and set the blocked-task relation, this latter is used by the mutex code
+ * to find which (blocked) task to hand-off to.
+ *
+ * Lock order:
+ *
+ * p->pi_lock
+ * rq->lock
+ * mutex->wait_lock
+ * p->blocked_lock
+ *
+ * Returns the task that is going to be used as execution context (the one
+ * that is actually going to be put to run on cpu_of(rq)).
+ *
+ */
+static struct task_struct *
+proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
+{
+ struct task_struct *p = next;
+ struct task_struct *owner;
+ struct mutex *mutex;
+ struct rq *that_rq;
+ int this_cpu, that_cpu;
+ LIST_HEAD(migrate_list);
+
+ this_cpu = cpu_of(rq);
+
+ /*
+ * Follow blocked_on chain.
+ *
+ * TODO: deadlock detection
+ */
+ for (p = next; p->blocked_on; p = owner) {
+ mutex = p->blocked_on;
+
+ /*
+ * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
+ * and ensure @owner sticks around.
+ */
+ raw_spin_lock(&mutex->wait_lock);
+ owner = __mutex_owner(mutex);
+ /*
+ * XXX can't this be 0|FLAGS? See __mutex_unlock_slowpath for(;;)
+ * Mmm, OK, this can't probably happend because we forse
+ * unlock to skip the for(;;) loop. Is this acceptable though?
+ */
+retry_owner:
+ if (task_cpu(owner) != this_cpu)
+ goto migrate_task;
+
+ if (owner == p)
+ goto owned_task;
+
+ if (!owner->on_rq)
+ goto blocked_task;
+
+ /*
+ * OK, now we're absolutely sure @owner is not blocked _and_
+ * on this rq, therefore holding @rq->lock is sufficient to
+ * guarantee its existence, as per ttwu_remote().
+ */
+ raw_spin_unlock(&mutex->wait_lock);
+
+ owner->blocked_task = p;
+ }
+
+ WARN_ON_ONCE(!owner->on_rq);
+
+ return owner;
+
+migrate_task:
+ /*
+ * The blocked-on relation must not cross CPUs, if this happens
+ * migrate @p to the @owner's CPU.
+ *
+ * This is because we must respect the CPU affinity of execution
+ * contexts (@owner) but we can ignore affinity for scheduling
+ * contexts (@p). So we have to move scheduling contexts towards
+ * potential execution contexts.
+ *
+ * XXX [juril] what if @p is not the highest prio task once migrated
+ * to @owner's CPU?
+ *
+ * XXX [juril] also, after @p is migrated it is not migrated back once
+ * @owner releases the lock? Isn't this a potential problem w.r.t.
+ * @owner affinity settings?
+ * [juril] OK. It is migrated back into its affinity mask in
+ * ttwu_remote(), or by using wake_cpu via select_task_rq, guess we
+ * might want to add a comment about that here. :-)
+ *
+ * TODO: could optimize by finding the CPU of the final owner
+ * and migrating things there. Given:
+ *
+ * CPU0 CPU1 CPU2
+ *
+ * a ----> b ----> c
+ *
+ * the current scheme would result in migrating 'a' to CPU1,
+ * then CPU1 would migrate b and a to CPU2. Only then would
+ * CPU2 run c.
+ */
+ that_cpu = task_cpu(owner);
+ that_rq = cpu_rq(that_cpu);
+ /*
+ * @owner can disappear, simply migrate to @that_cpu and leave that CPU
+ * to sort things out.
+ */
+ raw_spin_unlock(&mutex->wait_lock);
+
+ /*
+ * Since we're going to drop @rq, we have to put(@next) first,
+ * otherwise we have a reference that no longer belongs to us. Use
+ * @fake_task to fill the void and make the next pick_next_task()
+ * invocation happy.
+ *
+ * XXX double, triple think about this.
+ * XXX put doesn't work with ON_RQ_MIGRATE
+ *
+ * CPU0 CPU1
+ *
+ * B mutex_lock(X)
+ *
+ * A mutex_lock(X) <- B
+ * A __schedule()
+ * A pick->A
+ * A proxy->B
+ * A migrate A to CPU1
+ * B mutex_unlock(X) -> A
+ * B __schedule()
+ * B pick->A
+ * B switch_to (A)
+ * A ... does stuff
+ * A ... is still running here
+ *
+ * * BOOM *
+ */
+ put_prev_task(rq, next);
+ if (rq->curr != rq->idle) {
+ rq->proxy = rq->idle;
+ set_tsk_need_resched(rq->idle);
+ /*
+ * XXX [juril] don't we still need to migrate @next to
+ * @owner's CPU?
+ */
+ return rq->idle;
+ }
+ rq->proxy = &fake_task;
+
+ for (; p; p = p->blocked_task) {
+ int wake_cpu = p->wake_cpu;
+
+ WARN_ON(p == rq->curr);
+
+ p->on_rq = TASK_ON_RQ_MIGRATING;
+ dequeue_task(rq, p, 0);
+ set_task_cpu(p, that_cpu);
+ /*
+ * We can abuse blocked_entry to migrate the thing, because @p is
+ * still on the rq.
+ */
+ list_add(&p->blocked_entry, &migrate_list);
+
+ /*
+ * Preserve p->wake_cpu, such that we can tell where it
+ * used to run later.
+ */
+ p->wake_cpu = wake_cpu;
+ }
+
+ rq_unpin_lock(rq, rf);
+ raw_spin_unlock(&rq->lock);
+ raw_spin_lock(&that_rq->lock);
+
+ while (!list_empty(&migrate_list)) {
+ p = list_first_entry(&migrate_list, struct task_struct, blocked_entry);
+ list_del_init(&p->blocked_entry);
+
+ enqueue_task(that_rq, p, 0);
+ check_preempt_curr(that_rq, p, 0);
+ p->on_rq = TASK_ON_RQ_QUEUED;
+ resched_curr(that_rq);
+ }
+
+ raw_spin_unlock(&that_rq->lock);
+ raw_spin_lock(&rq->lock);
+ rq_repin_lock(rq, rf);
+
+ return NULL; /* Retry task selection on _this_ CPU. */
+
+owned_task:
+ /*
+ * Its possible we interleave with mutex_unlock like:
+ *
+ * lock(&rq->lock);
+ * proxy()
+ * mutex_unlock()
+ * lock(&wait_lock);
+ * next(owner) = current->blocked_task;
+ * unlock(&wait_lock);
+ *
+ * wake_up_q();
+ * ...
+ * ttwu_remote()
+ * __task_rq_lock()
+ * lock(&wait_lock);
+ * owner == p
+ *
+ * Which leaves us to finish the ttwu_remote() and make it go.
+ *
+ * XXX is this happening in case of an HANDOFF to p?
+ * In any case, reading of the owner in __mutex_unlock_slowpath is
+ * done atomically outside wait_lock (only adding waiters to wake_q is
+ * done inside the critical section).
+ * Does this means we can get to proxy _w/o an owner_ if that was
+ * cleared before grabbing wait_lock? Do we account for this case?
+ * OK we actually do (see PROXY_EXEC ifdeffery in unlock function).
+ */
+
+ /*
+ * Finish wakeup, will make the contending ttwu do a
+ * _spurious_ wakeup, but all code should be able to
+ * deal with that.
+ */
+ owner->blocked_on = NULL;
+ owner->state = TASK_RUNNING;
+ // XXX task_woken
+
+ /*
+ * If @owner/@p is allowed to run on this CPU, make it go.
+ */
+ if (cpumask_test_cpu(this_cpu, &owner->cpus_allowed)) {
+ raw_spin_unlock(&mutex->wait_lock);
+ return owner;
+ }
+
+ /*
+ * We have to let ttwu fix things up, because we
+ * can't restore the affinity. So dequeue.
+ */
+ owner->on_rq = 0;
+ deactivate_task(rq, p, DEQUEUE_SLEEP);
+ goto blocked_task;
+
+blocked_task:
+ /*
+ * If !@owner->on_rq, holding @rq->lock will not pin the task,
+ * so we cannot drop @mutex->wait_lock until we're sure its a blocked
+ * task on this rq.
+ *
+ * We use @owner->blocked_lock to serialize against ttwu_activate().
+ * Either we see its new owner->on_rq or it will see our list_add().
+ */
+ raw_spin_lock(&owner->blocked_lock);
+
+ /*
+ * If we became runnable while waiting for blocked_lock, retry.
+ */
+ if (owner->on_rq) {
+ /*
+ * If we see the new on->rq, we must also see the new task_cpu().
+ */
+ raw_spin_unlock(&owner->blocked_lock);
+ goto retry_owner;
+ }
+
+ /*
+ * Walk back up the blocked_task relation and enqueue them all on @owner
+ *
+ * ttwu_activate() will pick them up and place them on whatever rq
+ * @owner will run next.
+ */
+ for (; p; p = p->blocked_task) {
+ p->on_rq = 0;
+ deactivate_task(rq, p, DEQUEUE_SLEEP);
+ list_add(&p->blocked_entry, &owner->blocked_entry);
+ }
+ raw_spin_unlock(&owner->blocked_lock);
+ raw_spin_unlock(&mutex->wait_lock);
+
+ return NULL; /* retry task selection */
+}
+#else /* PROXY_EXEC */
+static struct task_struct *
+proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
+{
+ return next;
+}
+#endif /* PROXY_EXEC */
+
/*
* __schedule() is the main scheduler function.
*
@@ -3439,12 +3798,19 @@ static void __sched notrace __schedule(bool preempt)
if (unlikely(signal_pending_state(prev->state, prev))) {
prev->state = TASK_RUNNING;
} else {
- deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
- prev->on_rq = 0;
-
- if (prev->in_iowait) {
- atomic_inc(&rq->nr_iowait);
- delayacct_blkio_start();
+ if (!task_is_blocked(prev)) {
+ prev->on_rq = 0;
+ deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
+ } else {
+ /*
+ * XXX
+ * Let's make this task, which is blocked on
+ * a mutex, (push/pull)able (RT/DL).
+ * Unfortunately we can only deal with that by
+ * means of a dequeue/enqueue cycle. :-/
+ */
+ dequeue_task(rq, prev, 0);
+ enqueue_task(rq, prev, 0);
}
/*
@@ -3463,7 +3829,23 @@ static void __sched notrace __schedule(bool preempt)
switch_count = &prev->nvcsw;
}
- next = pick_next_task(rq, prev, &rf);
+pick_again:
+ /*
+ * If picked task is actually blocked it means that it can act as a
+ * proxy for the task that is holding the mutex picked task is blocked
+ * on. Get a reference to the blocked (going to be proxy) task here.
+ * Note that if next isn't actually blocked we will have rq->proxy ==
+ * rq->curr == next in the end, which is intended and means that proxy
+ * execution is currently "not in use".
+ */
+ rq->proxy = next = pick_next_task(rq, rq->proxy, &rf);
+ next->blocked_task = NULL;
+ if (unlikely(task_is_blocked(next))) {
+ next = proxy(rq, next, &rf);
+ if (!next)
+ goto pick_again;
+ }
+
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
@@ -5441,7 +5823,7 @@ void init_idle(struct task_struct *idle, int cpu)
__set_task_cpu(idle, cpu);
rcu_read_unlock();
- rq->curr = rq->idle = idle;
+ rq->curr = rq->proxy = rq->idle = idle;
idle->on_rq = TASK_ON_RQ_QUEUED;
#ifdef CONFIG_SMP
idle->on_cpu = 1;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 91e4202b0634..9336310c541d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1499,7 +1499,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
enqueue_dl_entity(&p->dl, pi_se, flags);
- if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+ if (!task_current(rq, p) && p->nr_cpus_allowed > 1 && !task_is_blocked(p))
enqueue_pushable_dl_task(rq, p);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7f8a5dcda923..3f9f60bdc1d6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7043,6 +7043,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
lockdep_assert_held(&env->src_rq->lock);
+ if (task_is_blocked(p))
+ return 0;
+
/*
* We do not migrate tasks that are:
* 1) throttled_lb_pair, or
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 2e2955a8cf8f..9dada9e0d699 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1334,7 +1334,7 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
enqueue_rt_entity(rt_se, flags);
- if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+ if (!task_current(rq, p) && p->nr_cpus_allowed > 1 && !task_is_blocked(p))
enqueue_pushable_task(rq, p);
}
--
2.17.1
When a mutex owner with potential proxies wakes up those proxies are
activated as well, on the same CPU of the owner. They might have been
sleeping on a different CPU however.
Fixup potential proxies CPU at wakeup time before activating them (or
they get woken up with a wrong CPU reference).
Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/core.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 54003515fd29..0314afe4ba80 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1664,6 +1664,14 @@ static inline void ttwu_activate(struct rq *rq, struct task_struct *p, int en_fl
blocked_entry);
list_del_init(&pp->blocked_entry);
+ /* XXX can't call set_task_cpu() because we are not holding
+ * neither pp->pi_lock nor task's rq lock. This should however
+ * be fine as this task can't be woken up as it is blocked on
+ * this mutex atm.
+ * A problem however might be that __set_task_cpu() calls
+ * set_task_rq() which deals with groups and such...
+ */
+ __set_task_cpu(pp, cpu_of(rq));
activate_task(rq, pp, en_flags);
pp->on_rq = TASK_ON_RQ_QUEUED;
resched_curr(rq);
@@ -3847,7 +3855,8 @@ static void __sched notrace __schedule(bool preempt)
* whether it wants to wake up a task to maintain
* concurrency.
*/
- if (prev->flags & PF_WQ_WORKER) {
+ if ((prev->flags & PF_WQ_WORKER) &&
+ !task_is_blocked(prev)) {
struct task_struct *to_wakeup;
to_wakeup = wq_worker_sleeping(prev);
--
2.17.1
mutex::wait_lock might be nested under rq->lock.
Make it irq safe then.
Signed-off-by: Juri Lelli <[email protected]>
---
kernel/locking/mutex.c | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 23312afa7fca..c16cb84420c3 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -444,6 +444,7 @@ __ww_mutex_check_waiters(struct mutex *lock,
static __always_inline void
ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
{
+ unsigned long flags;
DEFINE_WAKE_Q(wake_q);
ww_mutex_lock_acquired(lock, ctx);
@@ -473,9 +474,9 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
* Uh oh, we raced in fastpath, check if any of the waiters need to
* die or wound us.
*/
- raw_spin_lock(&lock->base.wait_lock);
+ raw_spin_lock_irqsave(&lock->base.wait_lock, flags);
__ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
- raw_spin_unlock(&lock->base.wait_lock);
+ raw_spin_unlock_irqrestore(&lock->base.wait_lock, flags);
wake_up_q(&wake_q);
}
@@ -917,6 +918,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
bool first = false;
struct ww_mutex *ww;
int ret;
+ unsigned long flags;
might_sleep();
@@ -947,7 +949,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
return 0;
}
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
/*
* After waiting to acquire the wait_lock, try again.
*/
@@ -1012,7 +1014,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
goto err;
}
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
schedule_preempt_disabled();
/*
@@ -1039,9 +1041,9 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
(first && mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx, &waiter)))
break;
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
}
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
acquired:
__set_current_state(TASK_RUNNING);
@@ -1070,7 +1072,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
if (use_ww_ctx && ww_ctx)
ww_mutex_lock_acquired(ww, ww_ctx);
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q);
preempt_enable();
return 0;
@@ -1079,7 +1081,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
__set_current_state(TASK_RUNNING);
mutex_remove_waiter(lock, &waiter, current);
err_early_kill:
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
debug_mutex_free_waiter(&waiter);
mutex_release(&lock->dep_map, 1, ip);
wake_up_q(&wake_q);
@@ -1220,6 +1222,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
* conditional on having proxy exec configured in?
*/
unsigned long owner = MUTEX_FLAG_HANDOFF;
+ unsigned long flags;
mutex_release(&lock->dep_map, 1, ip);
@@ -1261,7 +1264,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
}
#endif
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
debug_mutex_unlock(lock);
#ifdef CONFIG_PROXY_EXEC
@@ -1302,7 +1305,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
__mutex_handoff(lock, next);
preempt_disable(); // XXX unlock->wakeup inversion like
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q); // XXX must force resched on proxy
preempt_enable();
--
2.17.1
blocked_on pointer might be concurrently modified by schedule() (when
proxy() is called) and by wakeup path, so we need to guard changes.
Ensure blocked_lock is always held before updating blocked_on pointer.
Signed-off-by: Juri Lelli <[email protected]>
---
kernel/locking/mutex-debug.c | 1 +
kernel/locking/mutex.c | 13 ++++++++++---
kernel/sched/core.c | 31 ++++++++++++++++++++++++++++++-
3 files changed, 41 insertions(+), 4 deletions(-)
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 6605e083a3e9..2e3fbdaa8474 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -62,6 +62,7 @@ void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
{
DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
DEBUG_LOCKS_WARN_ON(waiter->task != task);
+ DEBUG_LOCKS_WARN_ON(task->blocked_on == NULL);
DEBUG_LOCKS_WARN_ON(task->blocked_on != lock);
list_del_init(&waiter->list);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index c16cb84420c3..8f6d4ceca2da 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -950,6 +950,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
}
raw_spin_lock_irqsave(&lock->wait_lock, flags);
+ raw_spin_lock(¤t->blocked_lock);
/*
* After waiting to acquire the wait_lock, try again.
*/
@@ -1014,6 +1015,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
goto err;
}
+ raw_spin_unlock(¤t->blocked_lock);
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
schedule_preempt_disabled();
@@ -1027,6 +1029,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
__mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
}
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
+ raw_spin_lock(¤t->blocked_lock);
/*
* Gets reset by ttwu_remote().
*/
@@ -1040,10 +1044,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
if (__mutex_trylock(lock) ||
(first && mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx, &waiter)))
break;
-
- raw_spin_lock_irqsave(&lock->wait_lock, flags);
}
- raw_spin_lock_irqsave(&lock->wait_lock, flags);
acquired:
__set_current_state(TASK_RUNNING);
@@ -1072,6 +1073,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
if (use_ww_ctx && ww_ctx)
ww_mutex_lock_acquired(ww, ww_ctx);
+ raw_spin_unlock(¤t->blocked_lock);
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q);
preempt_enable();
@@ -1081,6 +1083,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
__set_current_state(TASK_RUNNING);
mutex_remove_waiter(lock, &waiter, current);
err_early_kill:
+ raw_spin_unlock(¤t->blocked_lock);
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
debug_mutex_free_waiter(&waiter);
mutex_release(&lock->dep_map, 1, ip);
@@ -1268,6 +1271,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
debug_mutex_unlock(lock);
#ifdef CONFIG_PROXY_EXEC
+ raw_spin_lock(¤t->blocked_lock);
/*
* If we have a task boosting us, and that task was boosting us through
* this lock, hand the lock that that task, as that is the highest
@@ -1305,6 +1309,9 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
__mutex_handoff(lock, next);
preempt_disable(); // XXX unlock->wakeup inversion like
+#ifdef CONFIG_PROXY_EXEC
+ raw_spin_unlock(¤t->blocked_lock);
+#endif
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q); // XXX must force resched on proxy
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e3e3eea3f5b2..54003515fd29 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1761,7 +1761,15 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
* trigger the on_rq_queued() clause for them.
*/
if (task_is_blocked(p)) {
- p->blocked_on = NULL; /* let it run again */
+ raw_spin_lock(&p->blocked_lock);
+
+ if (task_is_blocked(p)) {
+ p->blocked_on = NULL; /* let it run again */
+ } else {
+ raw_spin_unlock(&p->blocked_lock);
+ goto out_wakeup;
+ }
+
if (!cpumask_test_cpu(cpu_of(rq), &p->cpus_allowed)) {
/*
* proxy stuff moved us outside of the affinity mask
@@ -1771,6 +1779,7 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
p->on_rq = 0;
/* XXX [juril] SLEEP|NOCLOCK ? */
deactivate_task(rq, p, DEQUEUE_SLEEP);
+ raw_spin_unlock(&p->blocked_lock);
goto out_unlock;
}
@@ -1779,8 +1788,10 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
* executing context might not be the most elegible anymore.
*/
resched_curr(rq);
+ raw_spin_unlock(&p->blocked_lock);
}
+out_wakeup:
ttwu_do_wakeup(rq, p, wake_flags, &rf);
ret = 1;
@@ -3464,12 +3475,26 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
*/
for (p = next; p->blocked_on; p = owner) {
mutex = p->blocked_on;
+ if (!mutex)
+ return NULL;
/*
* By taking mutex->wait_lock we hold off concurrent mutex_unlock()
* and ensure @owner sticks around.
*/
raw_spin_lock(&mutex->wait_lock);
+ raw_spin_lock(&p->blocked_lock);
+
+ /* Check again that p is blocked with blocked_lock held */
+ if (task_is_blocked(p)) {
+ BUG_ON(mutex != p->blocked_on);
+ } else {
+ /* Something changed in the blocked_on chain */
+ raw_spin_unlock(&p->blocked_lock);
+ raw_spin_unlock(&mutex->wait_lock);
+ return NULL;
+ }
+
owner = __mutex_owner(mutex);
/*
* XXX can't this be 0|FLAGS? See __mutex_unlock_slowpath for(;;)
@@ -3491,6 +3516,7 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
* on this rq, therefore holding @rq->lock is sufficient to
* guarantee its existence, as per ttwu_remote().
*/
+ raw_spin_unlock(&p->blocked_lock);
raw_spin_unlock(&mutex->wait_lock);
owner->blocked_task = p;
@@ -3537,6 +3563,7 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
* @owner can disappear, simply migrate to @that_cpu and leave that CPU
* to sort things out.
*/
+ raw_spin_unlock(&p->blocked_lock);
raw_spin_unlock(&mutex->wait_lock);
/*
@@ -3661,6 +3688,7 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
* If @owner/@p is allowed to run on this CPU, make it go.
*/
if (cpumask_test_cpu(this_cpu, &owner->cpus_allowed)) {
+ raw_spin_unlock(&p->blocked_lock);
raw_spin_unlock(&mutex->wait_lock);
return owner;
}
@@ -3682,6 +3710,7 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
* We use @owner->blocked_lock to serialize against ttwu_activate().
* Either we see its new owner->on_rq or it will see our list_add().
*/
+ raw_spin_unlock(&p->blocked_lock);
raw_spin_lock(&owner->blocked_lock);
/*
--
2.17.1
From: Peter Zijlstra <[email protected]>
Track the blocked-on relation for mutexes, this allows following this
relation at schedule time. Add blocked_task to track the inverse
relation.
,-> task
| | blocked-on
| v
blocked-task | mutex
| | owner
| v
`-- task
This patch only enables blocked-on relation, blocked-task will be
enabled in a later patch implementing proxy().
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[minor changes while rebasing]
Signed-off-by: Juri Lelli <[email protected]>
---
include/linux/sched.h | 6 ++----
kernel/fork.c | 6 +++---
kernel/locking/mutex-debug.c | 7 +++----
kernel/locking/mutex.c | 3 +++
4 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 977cb57d7bc9..a35e8ab3eef1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -907,10 +907,8 @@ struct task_struct {
struct rt_mutex_waiter *pi_blocked_on;
#endif
-#ifdef CONFIG_DEBUG_MUTEXES
- /* Mutex deadlock detection: */
- struct mutex_waiter *blocked_on;
-#endif
+ struct task_struct *blocked_task; /* task that's boosting us */
+ struct mutex *blocked_on; /* lock we're blocked on */
#ifdef CONFIG_TRACE_IRQFLAGS
unsigned int irq_events;
diff --git a/kernel/fork.c b/kernel/fork.c
index f0b58479534f..ef27a675b0d7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1827,9 +1827,9 @@ static __latent_entropy struct task_struct *copy_process(
lockdep_init_task(p);
#endif
-#ifdef CONFIG_DEBUG_MUTEXES
- p->blocked_on = NULL; /* not blocked yet */
-#endif
+ p->blocked_task = NULL; /* nobody is boosting us yet*/
+ p->blocked_on = NULL; /* not blocked yet */
+
#ifdef CONFIG_BCACHE
p->sequential_io = 0;
p->sequential_io_avg = 0;
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index a660d38b6c29..6605e083a3e9 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -53,8 +53,8 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
{
SMP_DEBUG_LOCKS_WARN_ON(!raw_spin_is_locked(&lock->wait_lock));
- /* Mark the current thread as blocked on the lock: */
- task->blocked_on = waiter;
+ /* Current thread can't be alredy blocked (since it's executing!) */
+ DEBUG_LOCKS_WARN_ON(task->blocked_on);
}
void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
@@ -62,8 +62,7 @@ void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
{
DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
DEBUG_LOCKS_WARN_ON(waiter->task != task);
- DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
- task->blocked_on = NULL;
+ DEBUG_LOCKS_WARN_ON(task->blocked_on != lock);
list_del_init(&waiter->list);
waiter->task = NULL;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index f37402cd8496..76b59b555da3 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -979,6 +979,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
}
waiter.task = current;
+ current->blocked_on = lock;
set_current_state(state);
for (;;) {
@@ -1047,6 +1048,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
}
mutex_remove_waiter(lock, &waiter, current);
+ current->blocked_on = NULL;
+
if (likely(list_empty(&lock->wait_list)))
__mutex_clear_flag(lock, MUTEX_FLAGS);
--
2.17.1
From: Peter Zijlstra <[email protected]>
In preparation to nest mutex::wait_lock under rq::lock we need to remove
wakeups from under it.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[Heavily changed after 55f036ca7e74 ("locking: WW mutex cleanup") and
08295b3b5bee ("locking: Implement an algorithm choice for Wound-Wait
mutexes")]
Signed-off-by: Juri Lelli <[email protected]>
---
kernel/locking/mutex.c | 43 +++++++++++++++++++++++++++---------------
1 file changed, 28 insertions(+), 15 deletions(-)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index df34ce70fcde..f37402cd8496 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -338,7 +338,7 @@ __ww_ctx_stamp_after(struct ww_acquire_ctx *a, struct ww_acquire_ctx *b)
*/
static bool __sched
__ww_mutex_die(struct mutex *lock, struct mutex_waiter *waiter,
- struct ww_acquire_ctx *ww_ctx)
+ struct ww_acquire_ctx *ww_ctx, struct wake_q_head *wake_q)
{
if (!ww_ctx->is_wait_die)
return false;
@@ -346,7 +346,7 @@ __ww_mutex_die(struct mutex *lock, struct mutex_waiter *waiter,
if (waiter->ww_ctx->acquired > 0 &&
__ww_ctx_stamp_after(waiter->ww_ctx, ww_ctx)) {
debug_mutex_wake_waiter(lock, waiter);
- wake_up_process(waiter->task);
+ wake_q_add(wake_q, waiter->task); // XXX
}
return true;
@@ -361,7 +361,8 @@ __ww_mutex_die(struct mutex *lock, struct mutex_waiter *waiter,
*/
static bool __ww_mutex_wound(struct mutex *lock,
struct ww_acquire_ctx *ww_ctx,
- struct ww_acquire_ctx *hold_ctx)
+ struct ww_acquire_ctx *hold_ctx,
+ struct wake_q_head *wake_q)
{
struct task_struct *owner = __mutex_owner(lock);
@@ -393,7 +394,7 @@ static bool __ww_mutex_wound(struct mutex *lock,
* wakeup pending to re-read the wounded state.
*/
if (owner != current)
- wake_up_process(owner);
+ wake_q_add(wake_q, owner); // XXX
return true;
}
@@ -414,7 +415,9 @@ static bool __ww_mutex_wound(struct mutex *lock,
* The current task must not be on the wait list.
*/
static void __sched
-__ww_mutex_check_waiters(struct mutex *lock, struct ww_acquire_ctx *ww_ctx)
+__ww_mutex_check_waiters(struct mutex *lock,
+ struct ww_acquire_ctx *ww_ctx,
+ struct wake_q_head *wake_q)
{
struct mutex_waiter *cur;
@@ -424,8 +427,8 @@ __ww_mutex_check_waiters(struct mutex *lock, struct ww_acquire_ctx *ww_ctx)
if (!cur->ww_ctx)
continue;
- if (__ww_mutex_die(lock, cur, ww_ctx) ||
- __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx))
+ if (__ww_mutex_die(lock, cur, ww_ctx, wake_q) ||
+ __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx, wake_q))
break;
}
}
@@ -437,6 +440,8 @@ __ww_mutex_check_waiters(struct mutex *lock, struct ww_acquire_ctx *ww_ctx)
static __always_inline void
ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
{
+ DEFINE_WAKE_Q(wake_q);
+
ww_mutex_lock_acquired(lock, ctx);
/*
@@ -465,8 +470,10 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
* die or wound us.
*/
raw_spin_lock(&lock->base.wait_lock);
- __ww_mutex_check_waiters(&lock->base, ctx);
+ __ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
raw_spin_unlock(&lock->base.wait_lock);
+
+ wake_up_q(&wake_q);
}
#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
@@ -824,7 +831,8 @@ __ww_mutex_check_kill(struct mutex *lock, struct mutex_waiter *waiter,
static inline int __sched
__ww_mutex_add_waiter(struct mutex_waiter *waiter,
struct mutex *lock,
- struct ww_acquire_ctx *ww_ctx)
+ struct ww_acquire_ctx *ww_ctx,
+ struct wake_q_head *wake_q)
{
struct mutex_waiter *cur;
struct list_head *pos;
@@ -868,7 +876,7 @@ __ww_mutex_add_waiter(struct mutex_waiter *waiter,
pos = &cur->list;
/* Wait-Die: ensure younger waiters die. */
- __ww_mutex_die(lock, cur, ww_ctx);
+ __ww_mutex_die(lock, cur, ww_ctx, wake_q);
}
__mutex_add_waiter(lock, waiter, pos);
@@ -886,7 +894,7 @@ __ww_mutex_add_waiter(struct mutex_waiter *waiter,
* such that either we or the fastpath will wound @ww->ctx.
*/
smp_mb();
- __ww_mutex_wound(lock, ww_ctx, ww->ctx);
+ __ww_mutex_wound(lock, ww_ctx, ww->ctx, wake_q);
}
return 0;
@@ -900,6 +908,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
struct lockdep_map *nest_lock, unsigned long ip,
struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
{
+ DEFINE_WAKE_Q(wake_q);
struct mutex_waiter waiter;
bool first = false;
struct ww_mutex *ww;
@@ -940,7 +949,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
*/
if (__mutex_trylock(lock)) {
if (use_ww_ctx && ww_ctx)
- __ww_mutex_check_waiters(lock, ww_ctx);
+ __ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
goto skip_wait;
}
@@ -962,7 +971,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
* Add in stamp order, waking up waiters that must kill
* themselves.
*/
- ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx);
+ ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx, &wake_q);
if (ret)
goto err_early_kill;
@@ -1034,7 +1043,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
*/
if (!ww_ctx->is_wait_die &&
!__mutex_waiter_is_first(lock, &waiter))
- __ww_mutex_check_waiters(lock, ww_ctx);
+ __ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
}
mutex_remove_waiter(lock, &waiter, current);
@@ -1051,6 +1060,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
ww_mutex_lock_acquired(ww, ww_ctx);
raw_spin_unlock(&lock->wait_lock);
+ wake_up_q(&wake_q);
preempt_enable();
return 0;
@@ -1061,6 +1071,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
raw_spin_unlock(&lock->wait_lock);
debug_mutex_free_waiter(&waiter);
mutex_release(&lock->dep_map, 1, ip);
+ wake_up_q(&wake_q);
preempt_enable();
return ret;
}
@@ -1244,9 +1255,11 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
if (owner & MUTEX_FLAG_HANDOFF)
__mutex_handoff(lock, next);
+ preempt_disable(); // XXX unlock->wakeup inversion like
raw_spin_unlock(&lock->wait_lock);
- wake_up_q(&wake_q);
+ wake_up_q(&wake_q); // XXX must force resched on proxy
+ preempt_enable();
}
#ifndef CONFIG_DEBUG_LOCK_ALLOC
--
2.17.1
On Tue, Oct 09, 2018 at 11:24:26AM +0200, Juri Lelli wrote:
> The main concerns I have with the current approach is that, being based
> on mutex.c, it's both
>
> - not linked with futexes
> - not involving "legacy" priority inheritance (rt_mutex.c)
>
> I believe one of the main reasons Peter started this on mutexes is to
> have better coverage of potential problems (which I can assure everybody
> it had). I'm not yet sure what should we do moving forward, and this is
> exactly what I'd be pleased to hear your opinions on.
Well that, and mutex was 'simple', I didn't have to go rip out all the
legacy PI crud.
If this all ends up working well, the solution is 'simple' and we can
simply copy mutex to rt_mutex or something along those lines if we want
to keep the distinction between them. Alternatively we simply delete
rt_mutex.
Thanks for reviving this.. it's been an 'interesting' year and a half
since I wrote all this and I've really not had time to work on it.
On 09/10/18 11:44, Peter Zijlstra wrote:
> On Tue, Oct 09, 2018 at 11:24:26AM +0200, Juri Lelli wrote:
> > The main concerns I have with the current approach is that, being based
> > on mutex.c, it's both
> >
> > - not linked with futexes
> > - not involving "legacy" priority inheritance (rt_mutex.c)
> >
> > I believe one of the main reasons Peter started this on mutexes is to
> > have better coverage of potential problems (which I can assure everybody
> > it had). I'm not yet sure what should we do moving forward, and this is
> > exactly what I'd be pleased to hear your opinions on.
>
> Well that, and mutex was 'simple', I didn't have to go rip out all the
> legacy PI crud.
Indeed.
> If this all ends up working well, the solution is 'simple' and we can
> simply copy mutex to rt_mutex or something along those lines if we want
> to keep the distinction between them. Alternatively we simply delete
> rt_mutex.
Ah.. right.. sounds *scary*, but I guess it might be an option after
all.
> Thanks for reviving this.. it's been an 'interesting' year and a half
> since I wrote all this and I've really not had time to work on it.
n/p, it has been a long standing thing to look at for me as well. Thanks
again for sharing your patches!
On 2018-10-09 11:24:26 [+0200], Juri Lelli wrote:
> The main concerns I have with the current approach is that, being based
> on mutex.c, it's both
>
> - not linked with futexes
> - not involving "legacy" priority inheritance (rt_mutex.c)
>
> I believe one of the main reasons Peter started this on mutexes is to
> have better coverage of potential problems (which I can assure everybody
> it had). I'm not yet sure what should we do moving forward, and this is
> exactly what I'd be pleased to hear your opinions on.
wasn't the idea that once it works to get rid of rt_mutex?
Sebastian
On 10/9/18 12:51 PM, Sebastian Andrzej Siewior wrote:
>> The main concerns I have with the current approach is that, being based
>> on mutex.c, it's both
>>
>> - not linked with futexes
>> - not involving "legacy" priority inheritance (rt_mutex.c)
>>
>> I believe one of the main reasons Peter started this on mutexes is to
>> have better coverage of potential problems (which I can assure everybody
>> it had). I'm not yet sure what should we do moving forward, and this is
>> exactly what I'd be pleased to hear your opinions on.
> wasn't the idea that once it works to get rid of rt_mutex?
As far as I know, it is. But there are some additional complexity
involving a -rt version of this patch, for instance:
What should the protocol do if the thread migrating is with migration
disabled?
The side effects of, for instance, ignoring the migrate_disable() would
add noise for the initial implementation... too much complexity at once.
IMHO, once it works in the non-rt, it will be easier to do the changes
needed to integrate it with -rt.
Thoughts?
-- Daniel
On 09/10/18 13:56, Daniel Bristot de Oliveira wrote:
> On 10/9/18 12:51 PM, Sebastian Andrzej Siewior wrote:
> >> The main concerns I have with the current approach is that, being based
> >> on mutex.c, it's both
> >>
> >> - not linked with futexes
> >> - not involving "legacy" priority inheritance (rt_mutex.c)
> >>
> >> I believe one of the main reasons Peter started this on mutexes is to
> >> have better coverage of potential problems (which I can assure everybody
> >> it had). I'm not yet sure what should we do moving forward, and this is
> >> exactly what I'd be pleased to hear your opinions on.
> > wasn't the idea that once it works to get rid of rt_mutex?
Looks like it was (see Peter's reply).
> As far as I know, it is. But there are some additional complexity
> involving a -rt version of this patch, for instance:
>
> What should the protocol do if the thread migrating is with migration
> disabled?
>
> The side effects of, for instance, ignoring the migrate_disable() would
> add noise for the initial implementation... too much complexity at once.
>
> IMHO, once it works in the non-rt, it will be easier to do the changes
> needed to integrate it with -rt.
>
> Thoughts?
Makes sense to me. Maybe we should just still keep in mind eventual
integration, so that we don't take decisions we would regret.
Hi,
On Tue, 9 Oct 2018 11:24:29 +0200
Juri Lelli <[email protected]> wrote:
> From: Peter Zijlstra <[email protected]>
>
> Track the blocked-on relation for mutexes, this allows following this
> relation at schedule time. Add blocked_task to track the inverse
> relation.
>
> ,-> task
> | | blocked-on
> | v
> blocked-task | mutex
> | | owner
> | v
> `-- task
I was a little bit confused by this description, because (if I
understand the code well) blocked_task does not actually track the
inverse of the "blocked_on" relationship, but just points to the task
that is _currently_ acting as a proxy for a given task.
In theory, we could have multiple tasks blocked on "mutex" (which is
owned by "task"), so if "blocked_task" tracked the inverse of
"blocked_on" it should have been a list (or a data structure containing
pointers to multiple task structures), no?
I would propose to change "blocked_task" into something like
"current_proxy", or similar, which should be more clear (unless I
completely misunderstood this stuff... In that case, sorry about the
noise)
Also, I suspect that this "blocked_task" (or "current_proxy") field
should be introcuced in patch 5 (same for the "task_is_blocked()"
function from patch 4... Should it go in patch 5?)
Luca
>
> This patch only enables blocked-on relation, blocked-task will be
> enabled in a later patch implementing proxy().
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> [minor changes while rebasing]
> Signed-off-by: Juri Lelli <[email protected]>
> ---
> include/linux/sched.h | 6 ++----
> kernel/fork.c | 6 +++---
> kernel/locking/mutex-debug.c | 7 +++----
> kernel/locking/mutex.c | 3 +++
> 4 files changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 977cb57d7bc9..a35e8ab3eef1 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -907,10 +907,8 @@ struct task_struct {
> struct rt_mutex_waiter *pi_blocked_on;
> #endif
>
> -#ifdef CONFIG_DEBUG_MUTEXES
> - /* Mutex deadlock detection: */
> - struct mutex_waiter *blocked_on;
> -#endif
> + struct task_struct *blocked_task; /* task
> that's boosting us */
> + struct mutex *blocked_on; /* lock
> we're blocked on */
> #ifdef CONFIG_TRACE_IRQFLAGS
> unsigned int irq_events;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f0b58479534f..ef27a675b0d7 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1827,9 +1827,9 @@ static __latent_entropy struct task_struct
> *copy_process( lockdep_init_task(p);
> #endif
>
> -#ifdef CONFIG_DEBUG_MUTEXES
> - p->blocked_on = NULL; /* not blocked yet */
> -#endif
> + p->blocked_task = NULL; /* nobody is boosting us yet*/
> + p->blocked_on = NULL; /* not blocked yet */
> +
> #ifdef CONFIG_BCACHE
> p->sequential_io = 0;
> p->sequential_io_avg = 0;
> diff --git a/kernel/locking/mutex-debug.c
> b/kernel/locking/mutex-debug.c index a660d38b6c29..6605e083a3e9 100644
> --- a/kernel/locking/mutex-debug.c
> +++ b/kernel/locking/mutex-debug.c
> @@ -53,8 +53,8 @@ void debug_mutex_add_waiter(struct mutex *lock,
> struct mutex_waiter *waiter, {
> SMP_DEBUG_LOCKS_WARN_ON(!raw_spin_is_locked(&lock->wait_lock));
>
> - /* Mark the current thread as blocked on the lock: */
> - task->blocked_on = waiter;
> + /* Current thread can't be alredy blocked (since it's
> executing!) */
> + DEBUG_LOCKS_WARN_ON(task->blocked_on);
> }
>
> void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter
> *waiter, @@ -62,8 +62,7 @@ void mutex_remove_waiter(struct mutex
> *lock, struct mutex_waiter *waiter, {
> DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
> DEBUG_LOCKS_WARN_ON(waiter->task != task);
> - DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
> - task->blocked_on = NULL;
> + DEBUG_LOCKS_WARN_ON(task->blocked_on != lock);
>
> list_del_init(&waiter->list);
> waiter->task = NULL;
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index f37402cd8496..76b59b555da3 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -979,6 +979,7 @@ __mutex_lock_common(struct mutex *lock, long
> state, unsigned int subclass, }
>
> waiter.task = current;
> + current->blocked_on = lock;
>
> set_current_state(state);
> for (;;) {
> @@ -1047,6 +1048,8 @@ __mutex_lock_common(struct mutex *lock, long
> state, unsigned int subclass, }
>
> mutex_remove_waiter(lock, &waiter, current);
> + current->blocked_on = NULL;
> +
> if (likely(list_empty(&lock->wait_list)))
> __mutex_clear_flag(lock, MUTEX_FLAGS);
>
On Wed, Oct 10, 2018 at 12:34:17PM +0200, luca abeni wrote:
> So, I would propose to make the proxy() function of patch more generic,
> and not strictly bound to mutexes. Maybe a task structure can contain a
> list of tasks for which the task can act as a proxy, and we can have a
> function like "I want to act as a proxy for task T" to be invoked when
> a task blocks?
Certainly possible, but that's something I'd prefer to look at after it
all 'works'. The mutex blocking thing doesn't require external
interfaces etc..
On 10/10/18 12:43, luca abeni wrote:
> Hi,
>
> On Tue, 9 Oct 2018 11:24:29 +0200
> Juri Lelli <[email protected]> wrote:
>
> > From: Peter Zijlstra <[email protected]>
> >
> > Track the blocked-on relation for mutexes, this allows following this
> > relation at schedule time. Add blocked_task to track the inverse
> > relation.
> >
> > ,-> task
> > | | blocked-on
> > | v
> > blocked-task | mutex
> > | | owner
> > | v
> > `-- task
>
> I was a little bit confused by this description, because (if I
> understand the code well) blocked_task does not actually track the
> inverse of the "blocked_on" relationship, but just points to the task
> that is _currently_ acting as a proxy for a given task.
>
> In theory, we could have multiple tasks blocked on "mutex" (which is
> owned by "task"), so if "blocked_task" tracked the inverse of
> "blocked_on" it should have been a list (or a data structure containing
> pointers to multiple task structures), no?
>
> I would propose to change "blocked_task" into something like
> "current_proxy", or similar, which should be more clear (unless I
> completely misunderstood this stuff... In that case, sorry about the
> noise)
Makes sense to me. It looks also closer to what comment says.
> Also, I suspect that this "blocked_task" (or "current_proxy") field
> should be introcuced in patch 5 (same for the "task_is_blocked()"
> function from patch 4... Should it go in patch 5?)
Sure. I believe I might have wrongly split things while rebasing.
Will fix.
Thanks,
- Juri
Hi,
On Tue, 9 Oct 2018 11:24:31 +0200
Juri Lelli <[email protected]> wrote:
[...]
> +migrate_task:
[...]
> + put_prev_task(rq, next);
> + if (rq->curr != rq->idle) {
> + rq->proxy = rq->idle;
> + set_tsk_need_resched(rq->idle);
> + /*
> + * XXX [juril] don't we still need to migrate @next
> to
> + * @owner's CPU?
> + */
> + return rq->idle;
> + }
If I understand well, this code ends up migrating the task only if the
CPU was previously idle? (scheduling the idle task if the CPU was not
previously idle)
Out of curiosity (I admit this is my ignorance), why is this needed?
If I understand well, after scheduling the idle task the scheduler will
be invoked again (because of the set_tsk_need_resched(rq->idle)) but I
do not understand why it is not possible to migrate task "p" immediately
(I would just check "rq->curr != p", to avoid migrating the currently
scheduled task).
Thanks,
Luca
> + rq->proxy = &fake_task;
> +
> + for (; p; p = p->blocked_task) {
> + int wake_cpu = p->wake_cpu;
> +
> + WARN_ON(p == rq->curr);
> +
> + p->on_rq = TASK_ON_RQ_MIGRATING;
> + dequeue_task(rq, p, 0);
> + set_task_cpu(p, that_cpu);
> + /*
> + * We can abuse blocked_entry to migrate the thing,
> because @p is
> + * still on the rq.
> + */
> + list_add(&p->blocked_entry, &migrate_list);
> +
> + /*
> + * Preserve p->wake_cpu, such that we can tell where
> it
> + * used to run later.
> + */
> + p->wake_cpu = wake_cpu;
> + }
> +
> + rq_unpin_lock(rq, rf);
> + raw_spin_unlock(&rq->lock);
> + raw_spin_lock(&that_rq->lock);
> +
> + while (!list_empty(&migrate_list)) {
> + p = list_first_entry(&migrate_list, struct
> task_struct, blocked_entry);
> + list_del_init(&p->blocked_entry);
> +
> + enqueue_task(that_rq, p, 0);
> + check_preempt_curr(that_rq, p, 0);
> + p->on_rq = TASK_ON_RQ_QUEUED;
> + resched_curr(that_rq);
> + }
> +
> + raw_spin_unlock(&that_rq->lock);
> + raw_spin_lock(&rq->lock);
> + rq_repin_lock(rq, rf);
> +
> + return NULL; /* Retry task selection on _this_ CPU. */
> +
> +owned_task:
> + /*
> + * Its possible we interleave with mutex_unlock like:
> + *
> + * lock(&rq->lock);
> + * proxy()
> + * mutex_unlock()
> + * lock(&wait_lock);
> + * next(owner) = current->blocked_task;
> + * unlock(&wait_lock);
> + *
> + * wake_up_q();
> + * ...
> + * ttwu_remote()
> + * __task_rq_lock()
> + * lock(&wait_lock);
> + * owner == p
> + *
> + * Which leaves us to finish the ttwu_remote() and make it
> go.
> + *
> + * XXX is this happening in case of an HANDOFF to p?
> + * In any case, reading of the owner in
> __mutex_unlock_slowpath is
> + * done atomically outside wait_lock (only adding waiters to
> wake_q is
> + * done inside the critical section).
> + * Does this means we can get to proxy _w/o an owner_ if
> that was
> + * cleared before grabbing wait_lock? Do we account for this
> case?
> + * OK we actually do (see PROXY_EXEC ifdeffery in unlock
> function).
> + */
> +
> + /*
> + * Finish wakeup, will make the contending ttwu do a
> + * _spurious_ wakeup, but all code should be able to
> + * deal with that.
> + */
> + owner->blocked_on = NULL;
> + owner->state = TASK_RUNNING;
> + // XXX task_woken
> +
> + /*
> + * If @owner/@p is allowed to run on this CPU, make it go.
> + */
> + if (cpumask_test_cpu(this_cpu, &owner->cpus_allowed)) {
> + raw_spin_unlock(&mutex->wait_lock);
> + return owner;
> + }
> +
> + /*
> + * We have to let ttwu fix things up, because we
> + * can't restore the affinity. So dequeue.
> + */
> + owner->on_rq = 0;
> + deactivate_task(rq, p, DEQUEUE_SLEEP);
> + goto blocked_task;
> +
> +blocked_task:
> + /*
> + * If !@owner->on_rq, holding @rq->lock will not pin the
> task,
> + * so we cannot drop @mutex->wait_lock until we're sure its
> a blocked
> + * task on this rq.
> + *
> + * We use @owner->blocked_lock to serialize against
> ttwu_activate().
> + * Either we see its new owner->on_rq or it will see our
> list_add().
> + */
> + raw_spin_lock(&owner->blocked_lock);
> +
> + /*
> + * If we became runnable while waiting for blocked_lock,
> retry.
> + */
> + if (owner->on_rq) {
> + /*
> + * If we see the new on->rq, we must also see the
> new task_cpu().
> + */
> + raw_spin_unlock(&owner->blocked_lock);
> + goto retry_owner;
> + }
> +
> + /*
> + * Walk back up the blocked_task relation and enqueue them
> all on @owner
> + *
> + * ttwu_activate() will pick them up and place them on
> whatever rq
> + * @owner will run next.
> + */
> + for (; p; p = p->blocked_task) {
> + p->on_rq = 0;
> + deactivate_task(rq, p, DEQUEUE_SLEEP);
> + list_add(&p->blocked_entry, &owner->blocked_entry);
> + }
> + raw_spin_unlock(&owner->blocked_lock);
> + raw_spin_unlock(&mutex->wait_lock);
> +
> + return NULL; /* retry task selection */
> +}
> +#else /* PROXY_EXEC */
> +static struct task_struct *
> +proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
> +{
> + return next;
> +}
> +#endif /* PROXY_EXEC */
> +
> /*
> * __schedule() is the main scheduler function.
> *
> @@ -3439,12 +3798,19 @@ static void __sched notrace __schedule(bool
> preempt) if (unlikely(signal_pending_state(prev->state, prev))) {
> prev->state = TASK_RUNNING;
> } else {
> - deactivate_task(rq, prev, DEQUEUE_SLEEP |
> DEQUEUE_NOCLOCK);
> - prev->on_rq = 0;
> -
> - if (prev->in_iowait) {
> - atomic_inc(&rq->nr_iowait);
> - delayacct_blkio_start();
> + if (!task_is_blocked(prev)) {
> + prev->on_rq = 0;
> + deactivate_task(rq, prev,
> DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
> + } else {
> + /*
> + * XXX
> + * Let's make this task, which is
> blocked on
> + * a mutex, (push/pull)able (RT/DL).
> + * Unfortunately we can only deal
> with that by
> + * means of a dequeue/enqueue
> cycle. :-/
> + */
> + dequeue_task(rq, prev, 0);
> + enqueue_task(rq, prev, 0);
> }
>
> /*
> @@ -3463,7 +3829,23 @@ static void __sched notrace __schedule(bool
> preempt) switch_count = &prev->nvcsw;
> }
>
> - next = pick_next_task(rq, prev, &rf);
> +pick_again:
> + /*
> + * If picked task is actually blocked it means that it can
> act as a
> + * proxy for the task that is holding the mutex picked task
> is blocked
> + * on. Get a reference to the blocked (going to be proxy)
> task here.
> + * Note that if next isn't actually blocked we will have
> rq->proxy ==
> + * rq->curr == next in the end, which is intended and means
> that proxy
> + * execution is currently "not in use".
> + */
> + rq->proxy = next = pick_next_task(rq, rq->proxy, &rf);
> + next->blocked_task = NULL;
> + if (unlikely(task_is_blocked(next))) {
> + next = proxy(rq, next, &rf);
> + if (!next)
> + goto pick_again;
> + }
> +
> clear_tsk_need_resched(prev);
> clear_preempt_need_resched();
>
> @@ -5441,7 +5823,7 @@ void init_idle(struct task_struct *idle, int
> cpu) __set_task_cpu(idle, cpu);
> rcu_read_unlock();
>
> - rq->curr = rq->idle = idle;
> + rq->curr = rq->proxy = rq->idle = idle;
> idle->on_rq = TASK_ON_RQ_QUEUED;
> #ifdef CONFIG_SMP
> idle->on_cpu = 1;
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 91e4202b0634..9336310c541d 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1499,7 +1499,7 @@ static void enqueue_task_dl(struct rq *rq,
> struct task_struct *p, int flags)
> enqueue_dl_entity(&p->dl, pi_se, flags);
>
> - if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
> + if (!task_current(rq, p) && p->nr_cpus_allowed > 1
> && !task_is_blocked(p)) enqueue_pushable_dl_task(rq, p);
> }
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7f8a5dcda923..3f9f60bdc1d6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7043,6 +7043,9 @@ int can_migrate_task(struct task_struct *p,
> struct lb_env *env)
> lockdep_assert_held(&env->src_rq->lock);
>
> + if (task_is_blocked(p))
> + return 0;
> +
> /*
> * We do not migrate tasks that are:
> * 1) throttled_lb_pair, or
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 2e2955a8cf8f..9dada9e0d699 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1334,7 +1334,7 @@ enqueue_task_rt(struct rq *rq, struct
> task_struct *p, int flags)
> enqueue_rt_entity(rt_se, flags);
>
> - if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
> + if (!task_current(rq, p) && p->nr_cpus_allowed > 1
> && !task_is_blocked(p)) enqueue_pushable_task(rq, p);
> }
>
On Wed, 10 Oct 2018 12:57:10 +0200
Peter Zijlstra <[email protected]> wrote:
> On Wed, Oct 10, 2018 at 12:34:17PM +0200, luca abeni wrote:
> > So, I would propose to make the proxy() function of patch more
> > generic, and not strictly bound to mutexes. Maybe a task structure
> > can contain a list of tasks for which the task can act as a proxy,
> > and we can have a function like "I want to act as a proxy for task
> > T" to be invoked when a task blocks?
>
> Certainly possible, but that's something I'd prefer to look at after
> it all 'works'.
Of course :)
I was mentioning this idea because maybe it can have some impact on the
design.
BTW, here is another "interesting" issue I had in the past with changes
like this one: how do we check if the patchset works as expected?
"No crashes" is surely a requirement, but I think we also need some
kind of testcase that fails if the inheritance mechanism is not working
properly, and is successful if the inheritance works.
Maybe we can develop some testcase based on rt-app (if noone has such a
testcase already)
Thanks,
Luca
> The mutex blocking thing doesn't require external
> interfaces etc..
On Wed, Oct 10, 2018 at 01:16:29PM +0200, luca abeni wrote:
> On Wed, 10 Oct 2018 12:57:10 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > On Wed, Oct 10, 2018 at 12:34:17PM +0200, luca abeni wrote:
> > > So, I would propose to make the proxy() function of patch more
> > > generic, and not strictly bound to mutexes. Maybe a task structure
> > > can contain a list of tasks for which the task can act as a proxy,
> > > and we can have a function like "I want to act as a proxy for task
> > > T" to be invoked when a task blocks?
> >
> > Certainly possible, but that's something I'd prefer to look at after
> > it all 'works'.
>
> Of course :)
> I was mentioning this idea because maybe it can have some impact on the
> design.
>
> BTW, here is another "interesting" issue I had in the past with changes
> like this one: how do we check if the patchset works as expected?
>
> "No crashes" is surely a requirement, but I think we also need some
> kind of testcase that fails if the inheritance mechanism is not working
> properly, and is successful if the inheritance works.
>
> Maybe we can develop some testcase based on rt-app (if noone has such a
> testcase already)
Indeed; IIRC there is a test suite that mostly covers the FIFO-PI stuff,
that should obviously still pass. Steve, do you know where that lives?
For the extended DL stuff, we'd need new tests.
Hi all,
On Tue, 9 Oct 2018 11:24:26 +0200
Juri Lelli <[email protected]> wrote:
> Hi all,
>
> Proxy Execution (also goes under several other names) isn't a new
> concept, it has been mentioned already in the past to this community
> (both in email discussions and at conferences [1, 2]), but no actual
> implementation that applies to a fairly recent kernel exists as of
> today (of which I'm aware of at least - happy to be proven wrong).
>
> Very broadly speaking, more info below, proxy execution enables a task
> to run using the context of some other task that is "willing" to
> participate in the mechanism, as this helps both tasks to improve
> performance (w.r.t. the latter task not participating to proxy
> execution).
First of all, thanks to Juri for working on this!
I am looking at the patchset, and I have some questions / comments
(maybe some of my questions are really stupid, I do not know... :)
To begin, I have a very high-level comment about proxy execution: I
believe proxy execution is a very powerful concept, that can be used in
many cases, not only to do inheritance on mutual exclusion (think, for
example, about pipelines of tasks: a task implementing a stage of the
pipeline can act as a proxy for the task implementing the previous
stage).
So, I would propose to make the proxy() function of patch more generic,
and not strictly bound to mutexes. Maybe a task structure can contain a
list of tasks for which the task can act as a proxy, and we can have a
function like "I want to act as a proxy for task T" to be invoked when
a task blocks?
Luca
On Tue, Oct 09, 2018 at 11:24:26AM +0200, Juri Lelli wrote:
> Hi all,
Hi, nice series, I have a lot of details to grok, but I like the idea of PE
> Proxy Execution (also goes under several other names) isn't a new
> concept, it has been mentioned already in the past to this community
> (both in email discussions and at conferences [1, 2]), but no actual
> implementation that applies to a fairly recent kernel exists as of today
> (of which I'm aware of at least - happy to be proven wrong).
>
> Very broadly speaking, more info below, proxy execution enables a task
> to run using the context of some other task that is "willing" to
> participate in the mechanism, as this helps both tasks to improve
> performance (w.r.t. the latter task not participating to proxy
> execution).
From what I remember, PEP was originally proposed for a global EDF, and as
far as my head has been able to read this series, this implementation is
planned for not only deadline, but eventuall also for sched_(rr|fifo|other)
- is that correct?
I have a bit of concern when it comes to affinities and and where the
lock owner will actually execute while in the context of the proxy,
especially when you run into the situation where you have disjoint CPU
affinities for _rr tasks to ensure the deadlines.
I believe there were some papers circulated last year that looked at
something similar to this when you had overlapping or completely disjoint
CPUsets I think it would be nice to drag into the discussion. Has this been
considered? (if so, sorry for adding line-noise!)
Let me know if my attempt at translating brainlanguage into semi-coherent
english failed and I'll do another attempt
> This RFD/proof of concept aims at starting a discussion about how we can
> get proxy execution in mainline. But, first things first, why do we even
> care about it?
>
> I'm pretty confident with saying that the line of development that is
> mainly interested in this at the moment is the one that might benefit
> in allowing non privileged processes to use deadline scheduling [3].
> The main missing bit before we can safely relax the root privileges
> constraint is a proper priority inheritance mechanism, which translates
> to bandwidth inheritance [4, 5] for deadline scheduling, or to some sort
> of interpretation of the concept of running a task holding a (rt_)mutex
> within the bandwidth allotment of some other task that is blocked on the
> same (rt_)mutex.
>
> The concept itself is pretty general however, and it is not hard to
> foresee possible applications in other scenarios (say for example nice
> values/shares across co-operating CFS tasks or clamping values [6]).
> But I'm already digressing, so let's get back to the code that comes
> with this cover letter.
>
> One can define the scheduling context of a task as all the information
> in task_struct that the scheduler needs to implement a policy and the
> execution contex as all the state required to actually "run" the task.
> An example of scheduling context might be the information contained in
> task_struct se, rt and dl fields; affinity pertains instead to execution
> context (and I guess decideing what pertains to what is actually up for
> discussion as well ;-). Patch 04/08 implements such distinction.
I really like the idea of splitting scheduling ctx and execution context!
> As implemented in this set, a link between scheduling contexts of
> different tasks might be established when a task blocks on a mutex held
> by some other task (blocked_on relation). In this case the former task
> starts to be considered a potential proxy for the latter (mutex owner).
> One key change in how mutexes work made in here is that waiters don't
> really sleep: they are not dequeued, so they can be picked up by the
> scheduler when it runs. If a waiter (potential proxy) task is selected
> by the scheduler, the blocked_on relation is used to find the mutex
> owner and put that to run on the CPU, using the proxy task scheduling
> context.
>
> Follow the blocked-on relation:
>
> ,-> task <- proxy, picked by scheduler
> | | blocked-on
> | v
> blocked-task | mutex
> | | owner
> | v
> `-- task <- gets to run using proxy info
>
> Now, the situation is (of course) more tricky than depicted so far
> because we have to deal with all sort of possible states the mutex
> owner might be in while a potential proxy is selected by the scheduler,
> e.g. owner might be sleeping, running on a different CPU, blocked on
> another mutex itself... so, I'd kindly refer people to have a look at
> 05/08 proxy() implementation and comments.
My head hurt already.. :)
> Peter kindly shared his WIP patches with us (me, Luca, Tommaso, Claudio,
> Daniel, the Pisa gang) a while ago, but I could seriously have a decent
> look at them only recently (thanks a lot to the other guys for giving a
> first look at this way before me!). This set is thus composed of Peter's
> original patches (which I rebased on tip/sched/core as of today,
> commented and hopefully duly reported in changelogs what have I possibly
> broke) plus a bunch of additional changes that seemed required to make
> all this boot "successfully" on a virtual machine. So be advised! This
> is good only for fun ATM (I actually really hope this is good enough for
> discussion), pretty far from production I'm afraid. Share early, share
> often, right? :-)
I'll give it a spin and see if it boots, then I probably have a ton of
extra questions :)
Thanks for sharing!
-Henrik
On Wed, Oct 10, 2018 at 01:56:39PM +0200, Henrik Austad wrote:
> On Tue, Oct 09, 2018 at 11:24:26AM +0200, Juri Lelli wrote:
> > Hi all,
>
> Hi, nice series, I have a lot of details to grok, but I like the idea of PE
>
> > Proxy Execution (also goes under several other names) isn't a new
> > concept, it has been mentioned already in the past to this community
> > (both in email discussions and at conferences [1, 2]), but no actual
> > implementation that applies to a fairly recent kernel exists as of today
> > (of which I'm aware of at least - happy to be proven wrong).
> >
> > Very broadly speaking, more info below, proxy execution enables a task
> > to run using the context of some other task that is "willing" to
> > participate in the mechanism, as this helps both tasks to improve
> > performance (w.r.t. the latter task not participating to proxy
> > execution).
>
> From what I remember, PEP was originally proposed for a global EDF, and as
> far as my head has been able to read this series, this implementation is
> planned for not only deadline, but eventuall also for sched_(rr|fifo|other)
> - is that correct?
This implementation covers every scheduling class unconditionally. It
directly uses the scheduling function to order things; where PI
re-implements the FIFO scheduling function to order the blocked lists.
> I have a bit of concern when it comes to affinities and and where the
> lock owner will actually execute while in the context of the proxy,
> especially when you run into the situation where you have disjoint CPU
> affinities for _rr tasks to ensure the deadlines.
The affinities of execution contexts are respected.
> I believe there were some papers circulated last year that looked at
> something similar to this when you had overlapping or completely disjoint
> CPUsets I think it would be nice to drag into the discussion. Has this been
> considered? (if so, sorry for adding line-noise!)
Hurm, was that one of Bjorn's papers? Didn't that deal with AC of
disjoint/overlapping sets?
> > One can define the scheduling context of a task as all the information
> > in task_struct that the scheduler needs to implement a policy and the
> > execution contex as all the state required to actually "run" the task.
> > An example of scheduling context might be the information contained in
> > task_struct se, rt and dl fields; affinity pertains instead to execution
> > context (and I guess decideing what pertains to what is actually up for
> > discussion as well ;-). Patch 04/08 implements such distinction.
>
> I really like the idea of splitting scheduling ctx and execution context!
Right; so this whole thing relies on 'violating' affinities for
scheduling contexts, but respects affinities for execution contexts.
The basic observation is that affinities only matter when you execute
code.
This then also gives a fairly clear definition of what an execution
context is.
On 10/10/18 13:23, Peter Zijlstra wrote:
> On Wed, Oct 10, 2018 at 01:16:29PM +0200, luca abeni wrote:
> > On Wed, 10 Oct 2018 12:57:10 +0200
> > Peter Zijlstra <[email protected]> wrote:
> >
> > > On Wed, Oct 10, 2018 at 12:34:17PM +0200, luca abeni wrote:
> > > > So, I would propose to make the proxy() function of patch more
> > > > generic, and not strictly bound to mutexes. Maybe a task structure
> > > > can contain a list of tasks for which the task can act as a proxy,
> > > > and we can have a function like "I want to act as a proxy for task
> > > > T" to be invoked when a task blocks?
> > >
> > > Certainly possible, but that's something I'd prefer to look at after
> > > it all 'works'.
> >
> > Of course :)
> > I was mentioning this idea because maybe it can have some impact on the
> > design.
> >
> > BTW, here is another "interesting" issue I had in the past with changes
> > like this one: how do we check if the patchset works as expected?
> >
> > "No crashes" is surely a requirement, but I think we also need some
> > kind of testcase that fails if the inheritance mechanism is not working
> > properly, and is successful if the inheritance works.
> >
> > Maybe we can develop some testcase based on rt-app (if noone has such a
> > testcase already)
>
> Indeed; IIRC there is a test suite that mostly covers the FIFO-PI stuff,
> that should obviously still pass. Steve, do you know where that lives?
>
> For the extended DL stuff, we'd need new tests.
This one, right?
https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git/tree/src/pi_tests/pi_stress.c?h=stable/v1.0
It looks like it supports DEADLINE as well.. although I'll have to check
again what it does for the DEADLINE case.
On 10/10/18 13:56, Henrik Austad wrote:
> On Tue, Oct 09, 2018 at 11:24:26AM +0200, Juri Lelli wrote:
> > Hi all,
>
> Hi, nice series, I have a lot of details to grok, but I like the idea of PE
>
> > Proxy Execution (also goes under several other names) isn't a new
> > concept, it has been mentioned already in the past to this community
> > (both in email discussions and at conferences [1, 2]), but no actual
> > implementation that applies to a fairly recent kernel exists as of today
> > (of which I'm aware of at least - happy to be proven wrong).
> >
> > Very broadly speaking, more info below, proxy execution enables a task
> > to run using the context of some other task that is "willing" to
> > participate in the mechanism, as this helps both tasks to improve
> > performance (w.r.t. the latter task not participating to proxy
> > execution).
>
> From what I remember, PEP was originally proposed for a global EDF, and as
> far as my head has been able to read this series, this implementation is
> planned for not only deadline, but eventuall also for sched_(rr|fifo|other)
> - is that correct?
Correct, this is cross class.
> I have a bit of concern when it comes to affinities and and where the
> lock owner will actually execute while in the context of the proxy,
> especially when you run into the situation where you have disjoint CPU
> affinities for _rr tasks to ensure the deadlines.
Well, it's the (scheduler context) of the proxy that is potentially
moved around. Lock owner stays inside its affinity.
> I believe there were some papers circulated last year that looked at
> something similar to this when you had overlapping or completely disjoint
> CPUsets I think it would be nice to drag into the discussion. Has this been
> considered? (if so, sorry for adding line-noise!)
I think you refer to BBB work. Not sure if it applies here, though
(considering what above).
> Let me know if my attempt at translating brainlanguage into semi-coherent
> english failed and I'll do another attempt
You succeeded! (that's assuming that I got your questions right of
course :)
>
> > This RFD/proof of concept aims at starting a discussion about how we can
> > get proxy execution in mainline. But, first things first, why do we even
> > care about it?
> >
> > I'm pretty confident with saying that the line of development that is
> > mainly interested in this at the moment is the one that might benefit
> > in allowing non privileged processes to use deadline scheduling [3].
> > The main missing bit before we can safely relax the root privileges
> > constraint is a proper priority inheritance mechanism, which translates
> > to bandwidth inheritance [4, 5] for deadline scheduling, or to some sort
> > of interpretation of the concept of running a task holding a (rt_)mutex
> > within the bandwidth allotment of some other task that is blocked on the
> > same (rt_)mutex.
> >
> > The concept itself is pretty general however, and it is not hard to
> > foresee possible applications in other scenarios (say for example nice
> > values/shares across co-operating CFS tasks or clamping values [6]).
> > But I'm already digressing, so let's get back to the code that comes
> > with this cover letter.
> >
> > One can define the scheduling context of a task as all the information
> > in task_struct that the scheduler needs to implement a policy and the
> > execution contex as all the state required to actually "run" the task.
> > An example of scheduling context might be the information contained in
> > task_struct se, rt and dl fields; affinity pertains instead to execution
> > context (and I guess decideing what pertains to what is actually up for
> > discussion as well ;-). Patch 04/08 implements such distinction.
>
> I really like the idea of splitting scheduling ctx and execution context!
>
> > As implemented in this set, a link between scheduling contexts of
> > different tasks might be established when a task blocks on a mutex held
> > by some other task (blocked_on relation). In this case the former task
> > starts to be considered a potential proxy for the latter (mutex owner).
> > One key change in how mutexes work made in here is that waiters don't
> > really sleep: they are not dequeued, so they can be picked up by the
> > scheduler when it runs. If a waiter (potential proxy) task is selected
> > by the scheduler, the blocked_on relation is used to find the mutex
> > owner and put that to run on the CPU, using the proxy task scheduling
> > context.
> >
> > Follow the blocked-on relation:
> >
> > ,-> task <- proxy, picked by scheduler
> > | | blocked-on
> > | v
> > blocked-task | mutex
> > | | owner
> > | v
> > `-- task <- gets to run using proxy info
> >
> > Now, the situation is (of course) more tricky than depicted so far
> > because we have to deal with all sort of possible states the mutex
> > owner might be in while a potential proxy is selected by the scheduler,
> > e.g. owner might be sleeping, running on a different CPU, blocked on
> > another mutex itself... so, I'd kindly refer people to have a look at
> > 05/08 proxy() implementation and comments.
>
> My head hurt already.. :)
Eh. I was wondering about putting even more details in the cover. But
then I thought that it might have been enough info already for this
first spin. Guess we'll have to create proper docs (after how to
properly implement this has been agreed upon?).
> > Peter kindly shared his WIP patches with us (me, Luca, Tommaso, Claudio,
> > Daniel, the Pisa gang) a while ago, but I could seriously have a decent
> > look at them only recently (thanks a lot to the other guys for giving a
> > first look at this way before me!). This set is thus composed of Peter's
> > original patches (which I rebased on tip/sched/core as of today,
> > commented and hopefully duly reported in changelogs what have I possibly
> > broke) plus a bunch of additional changes that seemed required to make
> > all this boot "successfully" on a virtual machine. So be advised! This
> > is good only for fun ATM (I actually really hope this is good enough for
> > discussion), pretty far from production I'm afraid. Share early, share
> > often, right? :-)
>
> I'll give it a spin and see if it boots, then I probably have a ton of
> extra questions :)
Thanks! (I honestly expect sparks.. but it'll give us clues what needs
to be fixing)
Thanks a lot for looking at this.
Best,
- Juri
On 10/10/18 2:24 PM, Peter Zijlstra wrote:
>> I believe there were some papers circulated last year that looked at
>> something similar to this when you had overlapping or completely disjoint
>> CPUsets I think it would be nice to drag into the discussion. Has this been
>> considered? (if so, sorry for adding line-noise!)
> Hurm, was that one of Bjorn's papers? Didn't that deal with AC of
> disjoint/overlapping sets?
>
This paper:
https://people.mpi-sws.org/~bbb/papers/pdf/rtsj14.pdf
But, unless I am wrong, there were findings after this paper that shows
some imprecision on it.
Anyway, it does not analyse the locking properties, only scheduler of
independent tasks - it is a start, but far from what we do here.
(btw this paper is really complex...)
The locking problem for such case: APA with the nesting of different
locks in the locking implementation (we use raw spin lock on this, and
this method could also be used in the rw lock/sem in the future, nesting
rw_lock(mutex_proxy(raw_spinlock())) is an open problem from the
academic point of view.
I explained these things (nested lock and the need of APA for locking)
as "Open Problems" at the RTSOPs (part of the ECRTS) earlier this year:
http://rtsops2018.loria.fr/wp-content/uploads/2018/07/RTSOPS18_proceedings_final.pdf
Bjorn were there, but not only him... Baruah, Davis, Alan Burns were there.
There are some works being done for more complex locking in academy, like:
https://www.cs.unc.edu/~jarretc/papers/ecrts18b_long.pdf
But still, the task model used on these implementations are not the
Linux one.
-- Daniel
Hi Luca,
On 10/10/18 13:10, luca abeni wrote:
> Hi,
>
> On Tue, 9 Oct 2018 11:24:31 +0200
> Juri Lelli <[email protected]> wrote:
> [...]
> > +migrate_task:
> [...]
> > + put_prev_task(rq, next);
> > + if (rq->curr != rq->idle) {
> > + rq->proxy = rq->idle;
> > + set_tsk_need_resched(rq->idle);
> > + /*
> > + * XXX [juril] don't we still need to migrate @next
> > to
> > + * @owner's CPU?
> > + */
> > + return rq->idle;
> > + }
>
> If I understand well, this code ends up migrating the task only if the
> CPU was previously idle? (scheduling the idle task if the CPU was not
> previously idle)
>
> Out of curiosity (I admit this is my ignorance), why is this needed?
> If I understand well, after scheduling the idle task the scheduler will
> be invoked again (because of the set_tsk_need_resched(rq->idle)) but I
> do not understand why it is not possible to migrate task "p" immediately
> (I would just check "rq->curr != p", to avoid migrating the currently
> scheduled task).
As the comment suggests, I was also puzzled by this bit.
I'd be inclined to agree with you, it seems that the only case in which
we want to "temporarily" schedule the idle task is if the proxy was
executing (so it just blocked on the mutex and being scheduled out).
If it wasn't we should be able to let the current "curr" continue
executing, in this case returning it as next will mean that schedule
takes the else branch and there isn't an actual context switch.
> > + rq->proxy = &fake_task;
[...]
We can maybe also rq->proxy = rq->curr and return rq->curr in such a
case, instead of the below?
> > + return NULL; /* Retry task selection on _this_ CPU. */
Peter, what are we missing? :-)
On Thu, Oct 11, 2018 at 02:34:48PM +0200, Juri Lelli wrote:
> Hi Luca,
>
> On 10/10/18 13:10, luca abeni wrote:
> > Hi,
> >
> > On Tue, 9 Oct 2018 11:24:31 +0200
> > Juri Lelli <[email protected]> wrote:
> > [...]
> > > +migrate_task:
> > [...]
> > > + put_prev_task(rq, next);
> > > + if (rq->curr != rq->idle) {
> > > + rq->proxy = rq->idle;
> > > + set_tsk_need_resched(rq->idle);
> > > + /*
> > > + * XXX [juril] don't we still need to migrate @next
> > > to
> > > + * @owner's CPU?
> > > + */
> > > + return rq->idle;
> > > + }
> >
> > If I understand well, this code ends up migrating the task only if the
> > CPU was previously idle? (scheduling the idle task if the CPU was not
> > previously idle)
> >
> > Out of curiosity (I admit this is my ignorance), why is this needed?
> > If I understand well, after scheduling the idle task the scheduler will
> > be invoked again (because of the set_tsk_need_resched(rq->idle)) but I
> > do not understand why it is not possible to migrate task "p" immediately
> > (I would just check "rq->curr != p", to avoid migrating the currently
> > scheduled task).
>
> As the comment suggests, I was also puzzled by this bit.
>
> I'd be inclined to agree with you, it seems that the only case in which
> we want to "temporarily" schedule the idle task is if the proxy was
> executing (so it just blocked on the mutex and being scheduled out).
>
> If it wasn't we should be able to let the current "curr" continue
> executing, in this case returning it as next will mean that schedule
> takes the else branch and there isn't an actual context switch.
>
> > > + rq->proxy = &fake_task;
>
> [...]
>
> We can maybe also rq->proxy = rq->curr and return rq->curr in such a
> case, instead of the below?
>
> > > + return NULL; /* Retry task selection on _this_ CPU. */
>
> Peter, what are we missing? :-)
I think it was the safe and simple choice; note that we're not migrating
just a single @p, but a whole chain of @p. rq->curr must not be any of the
possible @p's. rq->idle, is per definition not one of the @p's.
Does that make sense?
On 11/10/18 14:53, Peter Zijlstra wrote:
[...]
> I think it was the safe and simple choice; note that we're not migrating
> just a single @p, but a whole chain of @p. rq->curr must not be any of the
> possible @p's. rq->idle, is per definition not one of the @p's.
>
> Does that make sense?
It does, and I guess is most probably the safest choice indeed. But,
just to put together a proper comment for next version..
The chain we are migrating is composed of blocked_task(s), so tasks that
blocked on a mutex owned by @p. Now, if such a chain has been built, it
means that proxy() has been called "for" @p previously, and @p might be
curr while one of its waiters might be proxy. So, none of the blocked_
task(s) should be rq->curr (even if one of their scheduling context
might be in use)?
On Thu, 11 Oct 2018 14:53:25 +0200
Peter Zijlstra <[email protected]> wrote:
[...]
> > > > + if (rq->curr != rq->idle) {
> > > > + rq->proxy = rq->idle;
> > > > + set_tsk_need_resched(rq->idle);
> > > > + /*
> > > > + * XXX [juril] don't we still need to migrate
> > > > @next to
> > > > + * @owner's CPU?
> > > > + */
> > > > + return rq->idle;
> > > > + }
> > >
> > > If I understand well, this code ends up migrating the task only
> > > if the CPU was previously idle? (scheduling the idle task if the
> > > CPU was not previously idle)
> > >
> > > Out of curiosity (I admit this is my ignorance), why is this
> > > needed? If I understand well, after scheduling the idle task the
> > > scheduler will be invoked again (because of the
> > > set_tsk_need_resched(rq->idle)) but I do not understand why it is
> > > not possible to migrate task "p" immediately (I would just check
> > > "rq->curr != p", to avoid migrating the currently scheduled
> > > task).
[...]
> I think it was the safe and simple choice; note that we're not
> migrating just a single @p, but a whole chain of @p.
Ah, that's the point I was missing... Thanks for explaining, now
everything looks more clear!
But... Here is my next dumb question: once the tasks are migrated to
the other runqueue, what prevents the scheduler from migrating them
back? In particular, task p: if it is (for example) a fixed priority
task an is on this runqueue, it is probably because the FP invariant
wants this... So, the push mechanism might end up migrating p back to
this runqueue soon... No?
Another doubt: if I understand well, when a task p "blocks" on a mutex
the proxy mechanism migrates it (and the whole chain of blocked tasks)
to the owner's core... Right?
Now, I understand why this is simpler to implement, but from the
schedulability point of view shouldn't we migrate the owner to p's core
instead?
Thanks,
Luca
> rq->curr must
> not be any of the possible @p's. rq->idle, is per definition not one
> of the @p's.
>
> Does that make sense?
On 12/10/18 09:22, luca abeni wrote:
> On Thu, 11 Oct 2018 14:53:25 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> [...]
> > > > > + if (rq->curr != rq->idle) {
> > > > > + rq->proxy = rq->idle;
> > > > > + set_tsk_need_resched(rq->idle);
> > > > > + /*
> > > > > + * XXX [juril] don't we still need to migrate
> > > > > @next to
> > > > > + * @owner's CPU?
> > > > > + */
> > > > > + return rq->idle;
> > > > > + }
> > > >
> > > > If I understand well, this code ends up migrating the task only
> > > > if the CPU was previously idle? (scheduling the idle task if the
> > > > CPU was not previously idle)
> > > >
> > > > Out of curiosity (I admit this is my ignorance), why is this
> > > > needed? If I understand well, after scheduling the idle task the
> > > > scheduler will be invoked again (because of the
> > > > set_tsk_need_resched(rq->idle)) but I do not understand why it is
> > > > not possible to migrate task "p" immediately (I would just check
> > > > "rq->curr != p", to avoid migrating the currently scheduled
> > > > task).
> [...]
> > I think it was the safe and simple choice; note that we're not
> > migrating just a single @p, but a whole chain of @p.
>
> Ah, that's the point I was missing... Thanks for explaining, now
> everything looks more clear!
>
>
> But... Here is my next dumb question: once the tasks are migrated to
> the other runqueue, what prevents the scheduler from migrating them
> back? In particular, task p: if it is (for example) a fixed priority
> task an is on this runqueue, it is probably because the FP invariant
> wants this... So, the push mechanism might end up migrating p back to
> this runqueue soon... No?
Not if p is going to be proxying for owner on owner's rq.
OTOH, I guess we might have counter-migrations generated by push/pull
decisions. Maybe we should remove potential proxies from pushable?
We'd still have the same problem for FAIR though.
In general it seems to make sense to me the fact that potential proxies
shouldn't participate to load balancing while waiting to be activated by
the mutex owner; they are basically sleeping, even if they are not.
> Another doubt: if I understand well, when a task p "blocks" on a mutex
> the proxy mechanism migrates it (and the whole chain of blocked tasks)
> to the owner's core... Right?
> Now, I understand why this is simpler to implement, but from the
> schedulability point of view shouldn't we migrate the owner to p's core
> instead?
Guess the most important reason is that we need to respect owner's
affinity, p (and the rest of the list) might have affinity mask that
doesn't work well with owner's.
Dear Juri,
just a small comment for the next round of patches (I guess after
OSPM)...
On 091018, 11:24, Juri Lelli wrote:
> From: Peter Zijlstra <[email protected]>
>
> Lets define the scheduling context as all the scheduler state in
> task_struct and the execution context as all state required to run the
> task.
>
> Currently both are intertwined in task_struct. We want to logically
> split these such that we can run the execution context of one task
> with the scheduling context of another.
>
> To this purpose introduce rq::proxy to point to the task_struct used
> for scheduler state and preserve rq::curr to denote the execution
> context.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> [added lot of comments/questions - identifiable by XXX]
> Signed-off-by: Juri Lelli <[email protected]>
> ---
> kernel/sched/core.c | 62 ++++++++++++++++++++++++++++++++++----------
> kernel/sched/fair.c | 4 +++
> kernel/sched/sched.h | 30 ++++++++++++++++++++-
> 3 files changed, 82 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fe0223121883..d3c481b734dd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -224,12 +224,13 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
> {
> struct rq *rq = container_of(timer, struct rq, hrtick_timer);
> struct rq_flags rf;
> + struct task_struct *curr = rq->proxy;
You may want to use a different naming for these local variables (e.g.
"proxy") to help the reader in not confusing curr (i.e. scheduling ctx)
with rq::curr (i.e. execution ctx).
Best regards,
Claudio