PATCH 1~2 mainly fix the deadline PI crash happened when doing
enqueue_task_dl()->rt_mutex_get_top_task() due to not holding
rq lock for the top waiter update.
PATCH 3~6 mainly fix the deadline PI issue happened when doing
enqueue_task_dl() after get @pi_task, and access pi_task's data
(dl.dl_runtime and dl.dl_period), because the access is not
holding any lock(pi lock or rq lock) of pi_task's. PATCH 3~4 are
separated out to make PATCH 5 smaller and easier to reviewers.
The two issues can be fixed using the same logic, so bind them
together as one series.
Xunlei Pang (6):
rtmutex: Deboost before waking up the top waiter
sched/rtmutex/deadline: Fix a PI crash for deadline tasks
rtmutex: Move "rt_mutex_waiter" definition to
"include/linux/rtmutex.h"
sched: Move dl_policy() to "include/linux/sched.h"
sched/deadline/rtmutex: Fix unprotected PI access in enqueue_task_dl()
sched/deadline/rtmutex: Don't miss the dl_runtime/dl_period update
include/linux/init_task.h | 3 +-
include/linux/rtmutex.h | 29 +++++++++++++-
include/linux/sched.h | 10 ++++-
include/linux/sched/deadline.h | 22 +++++++++++
kernel/fork.c | 1 +
kernel/futex.c | 5 +--
kernel/locking/rtmutex.c | 84 ++++++++++++++++++++++++++++-------------
kernel/locking/rtmutex_common.h | 22 +----------
kernel/sched/core.c | 2 +
kernel/sched/deadline.c | 10 +++--
kernel/sched/sched.h | 4 --
11 files changed, 132 insertions(+), 60 deletions(-)
--
1.8.3.1
Currently dl tasks will actually return at the very beginning
of rt_mutex_adjust_prio_chain() in !detect_deadlock cases:
if (waiter->prio == task->prio) {
if (!detect_deadlock)
goto out_unlock_pi; // out here
else
requeue = false;
}
As the deadline value of blocked deadline tasks(waiters) without
changing their sched_class(thus prio doesn't change) never changes,
this seems reasonable, but it actually misses the chance of updating
rt_mutex_waiter's "dl_runtime(period)_copy" if a waiter updates its
deadline parameters(dl_runtime, dl_period) or boosted waiter changes
to !deadline class.
Thus, force deadline task not out by adding the !dl_prio() condition.
Signed-off-by: Xunlei Pang <[email protected]>
---
kernel/locking/rtmutex.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 4d14eee..f3b7d29 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -554,7 +554,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
* enabled we continue, but stop the requeueing in the chain
* walk.
*/
- if (waiter->prio == task->prio) {
+ if (waiter->prio == task->prio && !dl_task(task)) {
if (!detect_deadlock)
goto out_unlock_pi;
else
--
1.8.3.1
We should deboost before waking the high-prio task such that
we don't run two tasks with the 'same' priority.
The patch fixed the logic, and introduced rt_mutex_postunlock()
to do some code refactor.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Xunlei Pang <[email protected]>
---
kernel/futex.c | 5 ++---
kernel/locking/rtmutex.c | 28 ++++++++++++++++++++++++----
kernel/locking/rtmutex_common.h | 1 +
3 files changed, 27 insertions(+), 7 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index 4e1a53e..4ae3523 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1524,9 +1524,8 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
* scheduled away before the wake up can take place.
*/
spin_unlock(&hb->lock);
- wake_up_q(&wake_q);
- if (deboost)
- rt_mutex_adjust_prio(current);
+
+ rt_mutex_postunlock(&wake_q, deboost);
return 0;
}
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 3e74660..42bc59b 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1390,12 +1390,32 @@ rt_mutex_fastunlock(struct rt_mutex *lock,
} else {
bool deboost = slowfn(lock, &wake_q);
- wake_up_q(&wake_q);
+ rt_mutex_postunlock(&wake_q, deboost);
+ }
+}
+
- /* Undo pi boosting if necessary: */
- if (deboost)
- rt_mutex_adjust_prio(current);
+/*
+ * Undo pi boosting (if necessary) and wake top waiter.
+ */
+void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
+{
+ /*
+ * We should deboost before waking the high-prio task such that
+ * we don't run two tasks with the 'same' priority. This however
+ * can lead to prio-inversion if we would get preempted after
+ * the deboost but before waking our high-prio task, hence the
+ * preempt_disable.
+ */
+ if (deboost) {
+ preempt_disable();
+ rt_mutex_adjust_prio(current);
}
+
+ wake_up_q(wake_q);
+
+ if (deboost)
+ preempt_enable();
}
/**
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 4f5f83c..93b0924 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -111,6 +111,7 @@ extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
struct wake_q_head *wqh);
+extern void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost);
extern void rt_mutex_adjust_prio(struct task_struct *task);
#ifdef CONFIG_DEBUG_RT_MUTEXES
--
1.8.3.1
We access @pi_task's data without any lock in enqueue_task_dl(), though
checked "dl_prio(pi_task->normal_prio)" condition, that's not enough.
"dl_period" and "dl_runtime" of "pi_task->dl" can change. For example,
if it changes to !deadline class, dl_runtime will be cleared to zero,
then we will hit an endless loop in replenish_dl_entity() below:
while (dl_se->runtime <= 0) {
dl_se->deadline += pi_se->dl_period;
dl_se->runtime += pi_se->dl_runtime;
}
or hit "BUG_ON(pi_se->dl_runtime <= 0)" earlier.
That's because without any lock of top waiter, there is no guarantee.
In order to solve it, we add some members in "rt_mutex_waiter" and use
this structure instead of task_struct as the one to be accessed by
enqueue_task_dl(), specifically added:
struct rt_mutex_waiter {
... ...
int prio;
+ /* Updated under waiter's pi_lock and rt_mutex lock */
+ u64 dl_runtime, dl_period;
+ /*
+ * Copied directly from above.
+ * Updated under owner's pi_lock, rq lock, and rt_mutex lock.
+ */
+ u64 dl_runtime_copy, dl_period_copy;
};
We must update "dl_runtime_copy" and "dl_period_copy" under rt_mutex
lock, because they are copied from rt_mutex_waiter's "dl_runtime" and
"dl_period" which are protected by the same rt_mutex lock. We update
the copy in rt_mutex_update_copy() introduced perviously, as it is
called by rt_mutex_setprio() which held owner's pi_lock, rq lock, and
rt_mutex lock.
"dl_runtime_copy" and "dl_period_copy" are updated under owner's pi_lock,
rq lock, and rt_mutex lock, plus the waiter was dependably blocked on
rtmutex, thus they can be safely accessed by enqueue_task_dl() which
held rq lock.
Note that, now we return a rt_mutex_waiter to enqueue_task_dl(), we add
a new "struct sched_dl_entity_fake" to fake as a real sched_dl_entity,
this is ok as long as we keep the "dl_runtime" and "dl_period" in it
the same order as that in sched_dl_entity. Also adjust the location of
"dl_period" in sched_dl_entity to make sched_dl_entity_fake smaller.
Signed-off-by: Xunlei Pang <[email protected]>
---
include/linux/rtmutex.h | 7 +++++++
include/linux/sched.h | 2 +-
include/linux/sched/deadline.h | 20 ++++++++++++++++++++
kernel/locking/rtmutex.c | 23 +++++++++++++++++++++++
kernel/sched/deadline.c | 10 +++++++---
5 files changed, 58 insertions(+), 4 deletions(-)
diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index f9bf40a..56e2aaf 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -58,6 +58,13 @@ struct rt_mutex_waiter {
struct rt_mutex *deadlock_lock;
#endif
int prio;
+ /* Updated under waiter's pi_lock and rt_mutex lock */
+ u64 dl_runtime, dl_period;
+ /*
+ * Copied directly from above.
+ * Updated under owner's pi_lock, rq lock, and rt_mutex lock.
+ */
+ u64 dl_runtime_copy, dl_period_copy;
};
struct hrtimer_sleeper;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8ad3522..960465c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1323,8 +1323,8 @@ struct sched_dl_entity {
* the next sched_setattr().
*/
u64 dl_runtime; /* maximum runtime for each instance */
- u64 dl_deadline; /* relative deadline of each instance */
u64 dl_period; /* separation of two instances (period) */
+ u64 dl_deadline; /* relative deadline of each instance */
u64 dl_bw; /* dl_runtime / dl_deadline */
/*
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index e8304d4..ca5bae5 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -2,6 +2,16 @@
#define _SCHED_DEADLINE_H
/*
+ * Used by enqueue_task_dl() for PI cases to disguise sched_dl_entity,
+ * thus must be the same order as the counterparts in sched_dl_entity.
+ */
+struct sched_dl_entity_fake {
+ struct rb_node rb_node;
+ u64 dl_runtime;
+ u64 dl_period;
+};
+
+/*
* SCHED_DEADLINE tasks has negative priorities, reflecting
* the fact that any of them has higher prio than RT and
* NORMAL/BATCH tasks.
@@ -28,4 +38,14 @@ static inline bool dl_time_before(u64 a, u64 b)
extern void rt_mutex_update_copy(struct task_struct *p);
+#ifdef CONFIG_RT_MUTEXES
+extern struct rt_mutex_waiter *rt_mutex_get_top_waiter(struct task_struct *p);
+#else
+static inline
+struct rt_mutex_waiter *rt_mutex_get_top_waiter(struct task_struct *p)
+{
+ return NULL;
+}
+#endif
+
#endif /* _SCHED_DEADLINE_H */
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 00c6560..4d14eee 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -280,6 +280,15 @@ struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
struct rt_mutex_waiter, pi_tree_entry)->task;
}
+struct rt_mutex_waiter *rt_mutex_get_top_waiter(struct task_struct *task)
+{
+ if (!task->pi_waiters_leftmost_copy)
+ return NULL;
+
+ return rb_entry(task->pi_waiters_leftmost_copy,
+ struct rt_mutex_waiter, pi_tree_entry);
+}
+
/*
* Called by sched_setscheduler() to get the priority which will be
* effective after the change.
@@ -299,7 +308,17 @@ int rt_mutex_get_effective_prio(struct task_struct *task, int newprio)
*/
void rt_mutex_update_copy(struct task_struct *p)
{
+ struct rt_mutex_waiter *top_waiter;
+
+ /* We must always update it, even if NULL */
p->pi_waiters_leftmost_copy = p->pi_waiters_leftmost;
+
+ if (!task_has_pi_waiters(p))
+ return;
+
+ top_waiter = task_top_pi_waiter(p);
+ top_waiter->dl_runtime_copy = top_waiter->dl_runtime;
+ top_waiter->dl_period_copy = top_waiter->dl_period;
}
/*
@@ -632,6 +651,8 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
/* [7] Requeue the waiter in the lock waiter tree. */
rt_mutex_dequeue(lock, waiter);
waiter->prio = task->prio;
+ waiter->dl_runtime = dl_policy(task->policy) ? task->dl.dl_runtime : 0;
+ waiter->dl_period = dl_policy(task->policy) ? task->dl.dl_period : 0;
rt_mutex_enqueue(lock, waiter);
/* [8] Release the task */
@@ -902,6 +923,8 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
waiter->task = task;
waiter->lock = lock;
waiter->prio = task->prio;
+ waiter->dl_runtime = dl_policy(task->policy) ? task->dl.dl_runtime : 0;
+ waiter->dl_period = dl_policy(task->policy) ? task->dl.dl_period : 0;
/* Get the top priority waiter on the lock */
if (rt_mutex_has_waiters(lock))
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index affd97e..224aa64 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -932,8 +932,9 @@ static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
{
- struct task_struct *pi_task = rt_mutex_get_top_task(p);
+ struct rt_mutex_waiter *top_waiter = rt_mutex_get_top_waiter(p);
struct sched_dl_entity *pi_se = &p->dl;
+ struct sched_dl_entity_fake pi_se_fake;
/*
* Use the scheduling parameters of the top pi-waiter
@@ -941,8 +942,11 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
* smaller than our one... OTW we keep our runtime and
* deadline.
*/
- if (pi_task && p->dl.dl_boosted && dl_prio(pi_task->normal_prio)) {
- pi_se = &pi_task->dl;
+ if (top_waiter && p->dl.dl_boosted && top_waiter->dl_runtime_copy) {
+ BUG_ON(top_waiter->dl_period_copy == 0);
+ pi_se_fake.dl_runtime = top_waiter->dl_runtime_copy;
+ pi_se_fake.dl_period = top_waiter->dl_period_copy;
+ pi_se = (struct sched_dl_entity *)&pi_se_fake;
} else if (!dl_prio(p->normal_prio)) {
/*
* Special case in which we have a !SCHED_DEADLINE task
--
1.8.3.1
Rtmutex code will need dl_policy(), so make it visible to rtmutex.
Signed-off-by: Xunlei Pang <[email protected]>
---
include/linux/sched.h | 5 +++++
kernel/sched/sched.h | 4 ----
2 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 332a6b5..8ad3522 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1309,6 +1309,11 @@ struct sched_rt_entity {
#endif
};
+static inline int dl_policy(int policy)
+{
+ return policy == SCHED_DEADLINE;
+}
+
struct sched_dl_entity {
struct rb_node rb_node;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a7cbad7..0e6ea02 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -99,10 +99,6 @@ static inline int rt_policy(int policy)
return policy == SCHED_FIFO || policy == SCHED_RR;
}
-static inline int dl_policy(int policy)
-{
- return policy == SCHED_DEADLINE;
-}
static inline bool valid_policy(int policy)
{
return idle_policy(policy) || fair_policy(policy) ||
--
1.8.3.1
A crash happened while I was playing with deadline PI rtmutex.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
PGD 232a75067 PUD 230947067 PMD 0
Oops: 0000 [#1] SMP
CPU: 1 PID: 10994 Comm: a.out Not tainted
Call Trace:
[<ffffffff810cf8aa>] ? enqueue_task_dl+0x2a/0x320
[<ffffffff810b658c>] enqueue_task+0x2c/0x80
[<ffffffff810ba763>] activate_task+0x23/0x30
[<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
[<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
[<ffffffff8164e783>] __schedule+0xd3/0x900
[<ffffffff8164efd9>] schedule+0x29/0x70
[<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
[<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
[<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
[<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
[<ffffffff810cfa15>] ? enqueue_task_dl+0x195/0x320
[<ffffffff810d0bac>] ? prio_changed_dl+0x6c/0x90
[<ffffffff810ed8b0>] do_futex+0x190/0x5b0
[<ffffffff810edd50>] SyS_futex+0x80/0x180
[<ffffffff8165a089>] system_call_fastpath+0x16/0x1b
RIP [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
are only protected by pi_lock when operating pi waiters, while
rt_mutex_get_top_task()(now replaced by rt_mutex_get_top_waiter()
by previous patches, same issue), will access them with rq lock
held but not holding pi_lock.
In order to tackle it, we introduce new "pi_waiters_leftmost_copy"
in task_struct, and add a new function named rt_mutex_update_copy()
to do the copy, it can be called by rt_mutex_setprio() which held
owner's pi_lock, rq lock, and rt_mutex lock. But one exception is
that rt_mutex_setprio() can be called without rtmutex locked after
mark_wakeup_next_waiter() by rt_mutex_adjust_prio(), so we need a
new function in mark_wakeup_next_waiter() to lock rq first then
call rt_mutex_update_copy().
We avoid adding new logic by calling __rt_mutex_adjust_prio() directly
in mark_wakeup_next_waiter()and remove the old rt_mutex_adjust_prio()
outside the lock. Since we moved the deboost point, in order to avoid
current process to be preempted(due to deboost earlier) before finishing
wake_up_q(), we also move preempt_disable() before unlocking rtmutex.
Originally-From: Peter Zijlstra <[email protected]>
Signed-off-by: Xunlei Pang <[email protected]>
---
include/linux/init_task.h | 3 ++-
include/linux/sched.h | 3 +++
include/linux/sched/deadline.h | 2 ++
kernel/fork.c | 1 +
kernel/locking/rtmutex.c | 61 +++++++++++++++++-------------------------
kernel/sched/core.c | 2 ++
6 files changed, 35 insertions(+), 37 deletions(-)
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index f2cb8d4..7be5a83 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -162,7 +162,8 @@ extern struct task_group root_task_group;
#ifdef CONFIG_RT_MUTEXES
# define INIT_RT_MUTEXES(tsk) \
.pi_waiters = RB_ROOT, \
- .pi_waiters_leftmost = NULL,
+ .pi_waiters_leftmost = NULL, \
+ .pi_waiters_leftmost_copy = NULL,
#else
# define INIT_RT_MUTEXES(tsk)
#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 45e848c..332a6b5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1621,7 +1621,10 @@ struct task_struct {
#ifdef CONFIG_RT_MUTEXES
/* PI waiters blocked on a rt_mutex held by this task */
struct rb_root pi_waiters;
+ /* Updated under pi_lock and rtmutex lock */
struct rb_node *pi_waiters_leftmost;
+ /* Updated under pi_lock, rq_lock, and rtmutex lock */
+ struct rb_node *pi_waiters_leftmost_copy;
/* Deadlock detection and priority inheritance handling */
struct rt_mutex_waiter *pi_blocked_on;
#endif
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index 9089a2a..e8304d4 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -26,4 +26,6 @@ static inline bool dl_time_before(u64 a, u64 b)
return (s64)(a - b) < 0;
}
+extern void rt_mutex_update_copy(struct task_struct *p);
+
#endif /* _SCHED_DEADLINE_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index a453546..4b847e4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1225,6 +1225,7 @@ static void rt_mutex_init_task(struct task_struct *p)
#ifdef CONFIG_RT_MUTEXES
p->pi_waiters = RB_ROOT;
p->pi_waiters_leftmost = NULL;
+ p->pi_waiters_leftmost_copy = NULL;
p->pi_blocked_on = NULL;
#endif
}
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 42bc59b..00c6560 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -273,10 +273,11 @@ int rt_mutex_getprio(struct task_struct *task)
struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
{
- if (likely(!task_has_pi_waiters(task)))
+ if (!task->pi_waiters_leftmost_copy)
return NULL;
- return task_top_pi_waiter(task)->task;
+ return rb_entry(task->pi_waiters_leftmost_copy,
+ struct rt_mutex_waiter, pi_tree_entry)->task;
}
/*
@@ -285,12 +286,20 @@ struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
*/
int rt_mutex_get_effective_prio(struct task_struct *task, int newprio)
{
- if (!task_has_pi_waiters(task))
+ struct task_struct *top_task = rt_mutex_get_top_task(task);
+
+ if (!top_task)
return newprio;
- if (task_top_pi_waiter(task)->task->prio <= newprio)
- return task_top_pi_waiter(task)->task->prio;
- return newprio;
+ return min(top_task->prio, newprio);
+}
+
+/*
+ * Must hold rq lock, p's pi_lock, and rt_mutex lock.
+ */
+void rt_mutex_update_copy(struct task_struct *p)
+{
+ p->pi_waiters_leftmost_copy = p->pi_waiters_leftmost;
}
/*
@@ -307,24 +316,6 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
}
/*
- * Adjust task priority (undo boosting). Called from the exit path of
- * rt_mutex_slowunlock() and rt_mutex_slowlock().
- *
- * (Note: We do this outside of the protection of lock->wait_lock to
- * allow the lock to be taken while or before we readjust the priority
- * of task. We do not use the spin_xx_mutex() variants here as we are
- * outside of the debug path.)
- */
-void rt_mutex_adjust_prio(struct task_struct *task)
-{
- unsigned long flags;
-
- raw_spin_lock_irqsave(&task->pi_lock, flags);
- __rt_mutex_adjust_prio(task);
- raw_spin_unlock_irqrestore(&task->pi_lock, flags);
-}
-
-/*
* Deadlock detection is conditional:
*
* If CONFIG_DEBUG_RT_MUTEXES=n, deadlock detection is only conducted
@@ -987,6 +978,7 @@ static void mark_wakeup_next_waiter(struct wake_q_head *wake_q,
* lock->wait_lock.
*/
rt_mutex_dequeue_pi(current, waiter);
+ __rt_mutex_adjust_prio(current);
/*
* As we are waking up the top waiter, and the waiter stays
@@ -1325,6 +1317,15 @@ static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
*/
mark_wakeup_next_waiter(wake_q, lock);
+ /*
+ * We should deboost before waking the high-prio task such that
+ * we don't run two tasks with the 'same' priority. This however
+ * can lead to prio-inversion if we would get preempted after
+ * the deboost but before waking our high-prio task, hence the
+ * preempt_disable before unlock.
+ */
+ preempt_disable();
+
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
/* check PI boosting */
@@ -1400,18 +1401,6 @@ rt_mutex_fastunlock(struct rt_mutex *lock,
*/
void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
{
- /*
- * We should deboost before waking the high-prio task such that
- * we don't run two tasks with the 'same' priority. This however
- * can lead to prio-inversion if we would get preempted after
- * the deboost but before waking our high-prio task, hence the
- * preempt_disable.
- */
- if (deboost) {
- preempt_disable();
- rt_mutex_adjust_prio(current);
- }
-
wake_up_q(wake_q);
if (deboost)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1159423..4a2c79d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3480,6 +3480,8 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
goto out_unlock;
}
+ rt_mutex_update_copy(p);
+
trace_sched_pi_setprio(p, prio);
oldprio = p->prio;
--
1.8.3.1
Deadline code will need it, so make it visible to deadline.
Signed-off-by: Xunlei Pang <[email protected]>
---
include/linux/rtmutex.h | 22 +++++++++++++++++++++-
kernel/locking/rtmutex_common.h | 21 ---------------------
2 files changed, 21 insertions(+), 22 deletions(-)
diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 1abba5c..f9bf40a 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -39,7 +39,27 @@ struct rt_mutex {
#endif
};
-struct rt_mutex_waiter;
+/*
+ * This is the control structure for tasks blocked on a rt_mutex,
+ * which is allocated on the kernel stack on of the blocked task.
+ *
+ * @tree_entry: pi node to enqueue into the mutex waiters tree
+ * @pi_tree_entry: pi node to enqueue into the mutex owner waiters tree
+ * @task: task reference to the blocked task
+ */
+struct rt_mutex_waiter {
+ struct rb_node tree_entry;
+ struct rb_node pi_tree_entry;
+ struct task_struct *task;
+ struct rt_mutex *lock;
+#ifdef CONFIG_DEBUG_RT_MUTEXES
+ unsigned long ip;
+ struct pid *deadlock_task_pid;
+ struct rt_mutex *deadlock_lock;
+#endif
+ int prio;
+};
+
struct hrtimer_sleeper;
#ifdef CONFIG_DEBUG_RT_MUTEXES
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 93b0924..c94adcd 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -15,27 +15,6 @@
#include <linux/rtmutex.h>
/*
- * This is the control structure for tasks blocked on a rt_mutex,
- * which is allocated on the kernel stack on of the blocked task.
- *
- * @tree_entry: pi node to enqueue into the mutex waiters tree
- * @pi_tree_entry: pi node to enqueue into the mutex owner waiters tree
- * @task: task reference to the blocked task
- */
-struct rt_mutex_waiter {
- struct rb_node tree_entry;
- struct rb_node pi_tree_entry;
- struct task_struct *task;
- struct rt_mutex *lock;
-#ifdef CONFIG_DEBUG_RT_MUTEXES
- unsigned long ip;
- struct pid *deadlock_task_pid;
- struct rt_mutex *deadlock_lock;
-#endif
- int prio;
-};
-
-/*
* Various helpers to access the waiters-tree:
*/
static inline int rt_mutex_has_waiters(struct rt_mutex *lock)
--
1.8.3.1
On Thu, Apr 14, 2016 at 07:37:06PM +0800, Xunlei Pang wrote:
> We access @pi_task's data without any lock in enqueue_task_dl(), though
> checked "dl_prio(pi_task->normal_prio)" condition, that's not enough.
The proper fix is to ensure that pi_task is guaranteed to be blocked.
On 2016/04/14 at 23:31, Peter Zijlstra wrote:
> On Thu, Apr 14, 2016 at 07:37:06PM +0800, Xunlei Pang wrote:
>> We access @pi_task's data without any lock in enqueue_task_dl(), though
>> checked "dl_prio(pi_task->normal_prio)" condition, that's not enough.
> The proper fix is to ensure that pi_task is guaranteed to be blocked.
Even if pi_task was blocked, its parameters are still allowed to be changed,
so we have to do that. Did I miss something?
Regards,
Xunlei
On 2016/04/15 at 09:58, Xunlei Pang wrote:
> On 2016/04/14 at 23:31, Peter Zijlstra wrote:
>> On Thu, Apr 14, 2016 at 07:37:06PM +0800, Xunlei Pang wrote:
>>> We access @pi_task's data without any lock in enqueue_task_dl(), though
>>> checked "dl_prio(pi_task->normal_prio)" condition, that's not enough.
>> The proper fix is to ensure that pi_task is guaranteed to be blocked.
> Even if pi_task was blocked, its parameters are still allowed to be changed,
> so we have to do that. Did I miss something?
>
> Regards,
> Xunlei
Fortunately, I just reproduced through an overnight test, so it really happened in reality as I thought.
[50697.042391] kernel BUG at kernel/sched/deadline.c:398!
[50697.048212] invalid opcode: 0000 [#1] SMP
[50697.137676] CPU: 1 PID: 10676 Comm: bugon Tainted: G W 4.6.0-rc3+ #19
[50697.146250] Hardware name: Intel Corporation Broadwell Client platform/SawTooth Peak, BIOS BDW-E1R1.86C.0127.R00.150
8062034 08/06/2015
[50697.159942] task: ffff880089d72b80 ti: ffff880074bb4000 task.ti: ffff880074bb4000
[50697.168420] RIP: 0010:[<ffffffff810cb4ef>] [<ffffffff810cb4ef>] replenish_dl_entity+0xff/0x110
[50697.178292] RSP: 0000:ffff88016ec43d90 EFLAGS: 00010046
[50697.184307] RAX: 0000000000000001 RBX: ffff880089d72d50 RCX: 0000000000000001
[50697.192390] RDX: 0000000000000010 RSI: ffff8800719858d0 RDI: ffff880089d72d50
[50697.200473] RBP: ffff88016ec43da8 R08: 0000000000000001 R09: 0000000000000097
[50697.208556] R10: 0000000057102e72 R11: 000000000f9e6fd7 R12: ffff88016ec56e40
[50697.216638] R13: ffff88016ec56e40 R14: 0000000000016e40 R15: ffff880089d72d50
[50697.224721] FS: 00007f14e788b700(0000) GS:ffff88016ec40000(0000) knlGS:0000000000000000
[50697.233887] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[50697.240396] CR2: 000055be08240c68 CR3: 000000008a5d5000 CR4: 00000000003406e0
[50697.248478] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[50697.256561] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[50697.264643] Stack:
[50697.266917] ffff88016ec56e40 ffff880089d72b80 0000000000000010 ffff88016ec43de8
[50697.275338] ffffffff810cbfe4 ffff88016ec43de8 ffff880089d72b80 ffff88016ec56e40
[50697.283757] 00000000000188c5 ffff880089d72d50 ffff88016ec4f228 ffff88016ec43e18
[50697.292175] Call Trace:
[50697.294943] <IRQ>
[50697.297122] [<ffffffff810cbfe4>] enqueue_task_dl+0x264/0x340
[50697.303838] [<ffffffff810cc453>] update_curr_dl+0x1c3/0x1f0
[50697.310249] [<ffffffff810cc51c>] task_tick_dl+0x1c/0x80
[50697.316265] [<ffffffff810b66ac>] scheduler_tick+0x5c/0xe0
[50697.322480] [<ffffffff811060d0>] ? tick_sched_do_timer+0x50/0x50
[50697.329383] [<ffffffff810f60e1>] update_process_times+0x51/0x60
[50697.336188] [<ffffffff81105a25>] tick_sched_handle.isra.17+0x25/0x60
[50697.343486] [<ffffffff8110610d>] tick_sched_timer+0x3d/0x70
[50697.349895] [<ffffffff810f6c93>] __hrtimer_run_queues+0xf3/0x270
[50697.356797] [<ffffffff810f7168>] hrtimer_interrupt+0xa8/0x1a0
[50697.363404] [<ffffffff81053de5>] local_apic_timer_interrupt+0x35/0x60
[50697.370799] [<ffffffff816b7a1d>] smp_apic_timer_interrupt+0x3d/0x50
[50697.377996] [<ffffffff816b5b5c>] apic_timer_interrupt+0x8c/0xa0
[50697.384798] <EOI>
[50697.384798] <EOI>
[50697.386974] Code: a9 48 c7 c7 38 5f a0 81 31 c0 48 89 75 e8 c6 05 5c 48 c8 00 01 e8 74 20 0c 00 49 8b 84 24 28 09 00 00 8b 4b 54 48 8b 75 e8 eb c4 <0f> 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
[50697.409201] RIP [<ffffffff810cb4ef>] replenish_dl_entity+0xff/0x110
[50697.416409] RSP <ffff88016ec43d90>
[50697.433683] ---[ end trace da6e1e42babefb7f ]---
[50697.438913] Kernel panic - not syncing: Fatal exception in interrupt
[50698.484088] Shutting down cpus with NMI
[50698.488434] Kernel Offset: disabled
[50698.492383] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
On Thu, 14 Apr 2016, Xunlei Pang wrote:
> We should deboost before waking the high-prio task such that
> we don't run two tasks with the 'same' priority.
No. This is fundamentaly broken.
T1 (prio 0) lock(X)
--> preemption
T2 (prio 10) lock(X)
boost(T1)
schedule()
T1 (prio 10) unlock(X)
deboost()
(prio 0)
--> preemption
T3 (prio 5) ....
Classic priority inversion enabled by a mechanism to avoid it. Brilliant
stuff.
Thanks,
tglx
On 2016/04/18 at 16:23, Thomas Gleixner wrote:
> On Thu, 14 Apr 2016, Xunlei Pang wrote:
>> We should deboost before waking the high-prio task such that
>> we don't run two tasks with the 'same' priority.
> No. This is fundamentaly broken.
>
> T1 (prio 0) lock(X)
>
> --> preemption
>
> T2 (prio 10) lock(X)
> boost(T1)
> schedule()
>
> T1 (prio 10) unlock(X)
We add a preempt_disable() before deboost to avoid the breakage,
there's also some comment about this in the patch's code.
Regards,
Xunlei
> deboost()
> (prio 0)
>
> --> preemption
>
> T3 (prio 5) ....
>
> Classic priority inversion enabled by a mechanism to avoid it. Brilliant
> stuff.
>
> Thanks,
>
> tglx
On Mon, 18 Apr 2016, Xunlei Pang wrote:
> On 2016/04/18 at 16:23, Thomas Gleixner wrote:
> > On Thu, 14 Apr 2016, Xunlei Pang wrote:
> >> We should deboost before waking the high-prio task such that
> >> we don't run two tasks with the 'same' priority.
> > No. This is fundamentaly broken.
> >
> > T1 (prio 0) lock(X)
> >
> > --> preemption
> >
> > T2 (prio 10) lock(X)
> > boost(T1)
> > schedule()
> >
> > T1 (prio 10) unlock(X)
>
> We add a preempt_disable() before deboost to avoid the breakage,
> there's also some comment about this in the patch's code.
So the changelog is useless and misleading. Neither does it explain what's
wrong with having two tasks with the same priority in running state.
Thanks,
tglx
On 2016/04/18 at 17:02, Thomas Gleixner wrote:
> On Mon, 18 Apr 2016, Xunlei Pang wrote:
>> On 2016/04/18 at 16:23, Thomas Gleixner wrote:
>>> On Thu, 14 Apr 2016, Xunlei Pang wrote:
>>>> We should deboost before waking the high-prio task such that
>>>> we don't run two tasks with the 'same' priority.
>>> No. This is fundamentaly broken.
>>>
>>> T1 (prio 0) lock(X)
>>>
>>> --> preemption
>>>
>>> T2 (prio 10) lock(X)
>>> boost(T1)
>>> schedule()
>>>
>>> T1 (prio 10) unlock(X)
>> We add a preempt_disable() before deboost to avoid the breakage,
>> there's also some comment about this in the patch's code.
> So the changelog is useless and misleading. Neither does it explain what's
> wrong with having two tasks with the same priority in running state.
Sorry about that, will improve it.
Regards,
Xunlei
>
> Thanks,
>
> tglx
>
>
>
>
On Mon, Apr 18, 2016 at 11:02:28AM +0200, Thomas Gleixner wrote:
> On Mon, 18 Apr 2016, Xunlei Pang wrote:
> > On 2016/04/18 at 16:23, Thomas Gleixner wrote:
> > > On Thu, 14 Apr 2016, Xunlei Pang wrote:
> > >> We should deboost before waking the high-prio task such that
> > >> we don't run two tasks with the 'same' priority.
> > > No. This is fundamentaly broken.
> > >
> > > T1 (prio 0) lock(X)
> > >
> > > --> preemption
> > >
> > > T2 (prio 10) lock(X)
> > > boost(T1)
> > > schedule()
> > >
> > > T1 (prio 10) unlock(X)
> >
> > We add a preempt_disable() before deboost to avoid the breakage,
> > there's also some comment about this in the patch's code.
>
> So the changelog is useless and misleading. Neither does it explain what's
> wrong with having two tasks with the same priority in running state.
So its semantically icky to have the two tasks running off the same
state and practically icky when you consider bandwidth inheritance --
where the boosted task wants to explicitly modify the state of the
booster.
In that latter case you really want to unboost before you let the
booster run again.
However, you noted we need to deal with this case due to the whole
optimistic spinning crap anyway :/
On Fri, Apr 15, 2016 at 10:19:12AM +0800, Xunlei Pang wrote:
> On 2016/04/15 at 09:58, Xunlei Pang wrote:
> > On 2016/04/14 at 23:31, Peter Zijlstra wrote:
> >> On Thu, Apr 14, 2016 at 07:37:06PM +0800, Xunlei Pang wrote:
> >>> We access @pi_task's data without any lock in enqueue_task_dl(), though
> >>> checked "dl_prio(pi_task->normal_prio)" condition, that's not enough.
> >> The proper fix is to ensure that pi_task is guaranteed to be blocked.
> > Even if pi_task was blocked, its parameters are still allowed to be changed,
> > so we have to do that. Did I miss something?
> >
> > Regards,
> > Xunlei
>
> Fortunately, I just reproduced through an overnight test, so it really happened in reality as I thought.
But what happens? How is it changed when it is blocked?
On Wed, 20 Apr 2016, Peter Zijlstra wrote:
> On Mon, Apr 18, 2016 at 11:02:28AM +0200, Thomas Gleixner wrote:
> > On Mon, 18 Apr 2016, Xunlei Pang wrote:
> > > We add a preempt_disable() before deboost to avoid the breakage,
> > > there's also some comment about this in the patch's code.
> >
> > So the changelog is useless and misleading. Neither does it explain what's
> > wrong with having two tasks with the same priority in running state.
>
> So its semantically icky to have the two tasks running off the same
> state and practically icky when you consider bandwidth inheritance --
> where the boosted task wants to explicitly modify the state of the
> booster.
>
> In that latter case you really want to unboost before you let the
> booster run again.
I understand that. That doesn't make the changelog any better, which mumbles
about priorities :(
> However, you noted we need to deal with this case due to the whole
> optimistic spinning crap anyway :/
Right, but that's another dimension of madness. Both tasks are on a cpu. The
reason why we boost the lock holder before spinning is to make sure that it
does not get preempted by something of medium priority before dropping the
lock. That really gets interesting with bandwith inheritance ....
Thanks,
tglx
On 2016/04/20/ at 20:25, Peter Zijlstra wrote:
> On Fri, Apr 15, 2016 at 10:19:12AM +0800, Xunlei Pang wrote:
>> On 2016/04/15 at 09:58, Xunlei Pang wrote:
>>> On 2016/04/14 at 23:31, Peter Zijlstra wrote:
>>>> On Thu, Apr 14, 2016 at 07:37:06PM +0800, Xunlei Pang wrote:
>>>>> We access @pi_task's data without any lock in enqueue_task_dl(), though
>>>>> checked "dl_prio(pi_task->normal_prio)" condition, that's not enough.
>>>> The proper fix is to ensure that pi_task is guaranteed to be blocked.
>>> Even if pi_task was blocked, its parameters are still allowed to be changed,
>>> so we have to do that. Did I miss something?
>>>
>>> Regards,
>>> Xunlei
>> Fortunately, I just reproduced through an overnight test, so it really happened in reality as I thought.
> But what happens? How is it changed when it is blocked?
The top waiter's policy can be changed by other tasks through sched_setattr() syscall during it was blocked.
I created another thread doing the following thing:
while (1) {
change the waiter to cfs
do something
change the waiter to deadline
}
Regards,
Xunlei
On Wed, Apr 20, 2016 at 02:43:29PM +0200, Thomas Gleixner wrote:
> > So its semantically icky to have the two tasks running off the same
> > state and practically icky when you consider bandwidth inheritance --
> > where the boosted task wants to explicitly modify the state of the
> > booster.
> >
> > In that latter case you really want to unboost before you let the
> > booster run again.
>
> I understand that. That doesn't make the changelog any better, which mumbles
> about priorities :(
Agreed.
> > However, you noted we need to deal with this case due to the whole
> > optimistic spinning crap anyway :/
>
> Right, but that's another dimension of madness. Both tasks are on a cpu.
> The reason why we boost the lock holder before spinning is to make
> sure that it does not get preempted by something of medium priority
> before dropping the lock.
Right; I figured that out pretty quickly, which is why this patch does a
preempt_disable() over the unboost+wakeup.
FWIW, the immediate reason for this patch is that is ensures the new
p->pi_task pointer, points to something that exists.
> That really gets interesting with bandwith inheritance ....
I'm more worried about the optimistic spinning case..
On Wed, Apr 20, 2016 at 09:00:32PM +0800, Xunlei Pang wrote:
> > But what happens? How is it changed when it is blocked?
>
> The top waiter's policy can be changed by other tasks through sched_setattr() syscall during it was blocked.
> I created another thread doing the following thing:
> while (1) {
> change the waiter to cfs
> do something
> change the waiter to deadline
> }
Indeed; so why didn't you say that? That is the single most important
thing in the Changelog -- the _actual_ problem, and you left it out.
I'm not quite sure how to go fix that best, but copying the state is not
right. That precludes being able to change the state.
There's two (obvious but) rather ugly solutions:
- delay the __sched_setscheduler() call until such a time that the task
is no longer the top waiter.
- deboost + __sched_setscheduler() + boost
Both have a different set of problems, but both keep the p->pi_task
pointer and its state 'stable'.
On Thu, Apr 14, 2016 at 07:37:03PM +0800, Xunlei Pang wrote:
> + /* Updated under pi_lock and rtmutex lock */
> struct rb_node *pi_waiters_leftmost;
> + struct rb_node *pi_waiters_leftmost_copy;
> struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
> {
> + if (!task->pi_waiters_leftmost_copy)
> return NULL;
>
> + return rb_entry(task->pi_waiters_leftmost_copy,
> + struct rt_mutex_waiter, pi_tree_entry)->task;
> }
why ?! Why not keep a regular task_struct pointer and avoid this stuff?
On 2016/04/20 at 21:17, Peter Zijlstra wrote:
> On Wed, Apr 20, 2016 at 09:00:32PM +0800, Xunlei Pang wrote:
>
>>> But what happens? How is it changed when it is blocked?
>> The top waiter's policy can be changed by other tasks through sched_setattr() syscall during it was blocked.
>> I created another thread doing the following thing:
>> while (1) {
>> change the waiter to cfs
>> do something
>> change the waiter to deadline
>> }
> Indeed; so why didn't you say that? That is the single most important
> thing in the Changelog -- the _actual_ problem, and you left it out.
Sorry, the changelog mentioned a little, I should describe it in detail.
>
> I'm not quite sure how to go fix that best, but copying the state is not
> right. That precludes being able to change the state.
The patch updates the copy everytime the waiter's policy/runtime/period
are changed. The calling path is rt_mutex_setprio()->rt_mutex_update_copy(),
so it can change very soon after __sched_setscheduler()->rt_mutex_adjust_pi()
is made, also PATCH6 forces to make the update for deadline cases.
This is not acceptable?
Regards,
Xunlei
>
> There's two (obvious but) rather ugly solutions:
>
> - delay the __sched_setscheduler() call until such a time that the task
> is no longer the top waiter.
>
> - deboost + __sched_setscheduler() + boost
>
> Both have a different set of problems, but both keep the p->pi_task
> pointer and its state 'stable'.
>
>
On 2016/04/20 at 21:19, Peter Zijlstra wrote:
> On Thu, Apr 14, 2016 at 07:37:03PM +0800, Xunlei Pang wrote:
>> + /* Updated under pi_lock and rtmutex lock */
>> struct rb_node *pi_waiters_leftmost;
>> + struct rb_node *pi_waiters_leftmost_copy;
>> struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
>> {
>> + if (!task->pi_waiters_leftmost_copy)
>> return NULL;
>>
>> + return rb_entry(task->pi_waiters_leftmost_copy,
>> + struct rt_mutex_waiter, pi_tree_entry)->task;
>> }
> why ?! Why not keep a regular task_struct pointer and avoid this stuff?
I meant to make it semantically consistent, but I can change it since you think task_struct is better.
Regards,
Xunlei
Hi Peter,
On 2016/04/20 at 21:49, Xunlei Pang wrote:
> On 2016/04/20 at 21:19, Peter Zijlstra wrote:
>> On Thu, Apr 14, 2016 at 07:37:03PM +0800, Xunlei Pang wrote:
>>> + /* Updated under pi_lock and rtmutex lock */
>>> struct rb_node *pi_waiters_leftmost;
>>> + struct rb_node *pi_waiters_leftmost_copy;
>>> struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
>>> {
>>> + if (!task->pi_waiters_leftmost_copy)
>>> return NULL;
>>>
>>> + return rb_entry(task->pi_waiters_leftmost_copy,
>>> + struct rt_mutex_waiter, pi_tree_entry)->task;
>>> }
>> why ?! Why not keep a regular task_struct pointer and avoid this stuff?
> I meant to make it semantically consistent, but I can change it since you think task_struct is better.
What do you think this version of PATCH1 and PATCH2?
If you are fine with it, I can sent it out as v4 separately, then we can focus on the issue in PATCH5.
Thanks!
>
> Regards,
> Xunlei