Commit 73d786bd043e "futex: Rework inconsistent rt_mutex/futex_q state"
mentions that it could cause an infinite loop, and will fix it in the later
patches:
bebe5b514345f09 futex: Futex_unlock_pi() determinism
cfafcd117da0216 futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
But at the moment they're not backported. In a single-core environment, the
probability of triggering is high.
I also backported commit b4abf91047cf ("rtmutex: Make wait_lock irq safe"),
it fixes a potential deadlock problem. Although it hasn't actually been
triggered in our environment at the moment.
Other patches are used to resolve conflicts or fix problems caused by new
patches.
Anna-Maria Gleixner (1):
rcu: Update documentation of rcu_read_unlock()
Mike Galbraith (1):
futex: Handle transient "ownerless" rtmutex state correctly
Peter Zijlstra (6):
futex: Cleanup refcounting
futex,rt_mutex: Introduce rt_mutex_init_waiter()
futex: Pull rt_mutex_futex_unlock() out from under hb->lock
futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
futex: Futex_unlock_pi() determinism
futex,rt_mutex: Fix rt_mutex_cleanup_proxy_lock()
Thomas Gleixner (3):
futex: Rename free_pi_state() to put_pi_state()
rtmutex: Make wait_lock irq safe
futex: Avoid freeing an active timer
include/linux/rcupdate.h | 4 +-
kernel/futex.c | 245 +++++++++++++++++++++-----------
kernel/locking/rtmutex.c | 185 +++++++++++++-----------
kernel/locking/rtmutex_common.h | 2 +-
4 files changed, 262 insertions(+), 174 deletions(-)
--
2.26.0.106.g9fadedd
From: Anna-Maria Gleixner <[email protected]>
[ Upstream commit ec84b27f9b3b569f9235413d1945a2006b97b0aa ]
Since commit b4abf91047cf ("rtmutex: Make wait_lock irq safe") the
explanation in rcu_read_unlock() documentation about irq unsafe rtmutex
wait_lock is no longer valid.
Remove it to prevent kernel developers reading the documentation to rely on
it.
Suggested-by: Eric W. Biederman <[email protected]>
Signed-off-by: Anna-Maria Gleixner <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Paul E. McKenney <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zhen Lei <[email protected]>
---
include/linux/rcupdate.h | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 0a93e9d1708e29e..3072e9c93ae6be2 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -880,9 +880,7 @@ static __always_inline void rcu_read_lock(void)
* Unfortunately, this function acquires the scheduler's runqueue and
* priority-inheritance spinlocks. This means that deadlock could result
* if the caller of rcu_read_unlock() already holds one of these locks or
- * any lock that is ever acquired while holding them; or any lock which
- * can be taken from interrupt context because rcu_boost()->rt_mutex_lock()
- * does not disable irqs while taking ->wait_lock.
+ * any lock that is ever acquired while holding them.
*
* That said, RCU readers are never priority boosted unless they were
* preempted. Therefore, one way to avoid deadlock is to make sure
--
2.26.0.106.g9fadedd
From: Peter Zijlstra <[email protected]>
[ Upstream commit 04dc1b2fff4e96cb4142227fbdc63c8871ad4ed9 ]
Markus reported that the glibc/nptl/tst-robustpi8 test was failing after
commit:
cfafcd117da0 ("futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()")
The following trace shows the problem:
ld-linux-x86-64-2161 [019] .... 410.760971: SyS_futex: 00007ffbeb76b028: 80000875 op=FUTEX_LOCK_PI
ld-linux-x86-64-2161 [019] ...1 410.760972: lock_pi_update_atomic: 00007ffbeb76b028: curval=80000875 uval=80000875 newval=80000875 ret=0
ld-linux-x86-64-2165 [011] .... 410.760978: SyS_futex: 00007ffbeb76b028: 80000875 op=FUTEX_UNLOCK_PI
ld-linux-x86-64-2165 [011] d..1 410.760979: do_futex: 00007ffbeb76b028: curval=80000875 uval=80000875 newval=80000871 ret=0
ld-linux-x86-64-2165 [011] .... 410.760980: SyS_futex: 00007ffbeb76b028: 80000871 ret=0000
ld-linux-x86-64-2161 [019] .... 410.760980: SyS_futex: 00007ffbeb76b028: 80000871 ret=ETIMEDOUT
Task 2165 does an UNLOCK_PI, assigning the lock to the waiter task 2161
which then returns with -ETIMEDOUT. That wrecks the lock state, because now
the owner isn't aware it acquired the lock and removes the pending robust
list entry.
If 2161 is killed, the robust list will not clear out this futex and the
subsequent acquire on this futex will then (correctly) result in -ESRCH
which is unexpected by glibc, triggers an internal assertion and dies.
Task 2161 Task 2165
rt_mutex_wait_proxy_lock()
timeout();
/* T2161 is still queued in the waiter list */
return -ETIMEDOUT;
futex_unlock_pi()
spin_lock(hb->lock);
rtmutex_unlock()
remove_rtmutex_waiter(T2161);
mark_lock_available();
/* Make the next waiter owner of the user space side */
futex_uval = 2161;
spin_unlock(hb->lock);
spin_lock(hb->lock);
rt_mutex_cleanup_proxy_lock()
if (rtmutex_owner() !== current)
...
return FAIL;
....
return -ETIMEOUT;
This means that rt_mutex_cleanup_proxy_lock() needs to call
try_to_take_rt_mutex() so it can take over the rtmutex correctly which was
assigned by the waker. If the rtmutex is owned by some other task then this
call is harmless and just confirmes that the waiter is not able to acquire
it.
While there, fix what looks like a merge error which resulted in
rt_mutex_cleanup_proxy_lock() having two calls to
fixup_rt_mutex_waiters() and rt_mutex_wait_proxy_lock() not having any.
Both should have one, since both potentially touch the waiter list.
Fixes: 38d589f2fd08 ("futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()")
Reported-by: Markus Trippelsdorf <[email protected]>
Bug-Spotted-by: Thomas Gleixner <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Markus Trippelsdorf <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Zhen Lei <[email protected]>
---
kernel/locking/rtmutex.c | 24 ++++++++++++++++++------
1 file changed, 18 insertions(+), 6 deletions(-)
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 18154e10d63a23b..532986d82179b69 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1764,12 +1764,14 @@ int rt_mutex_wait_proxy_lock(struct rt_mutex *lock,
int ret;
raw_spin_lock_irq(&lock->wait_lock);
-
- set_current_state(TASK_INTERRUPTIBLE);
-
/* sleep on the mutex */
+ set_current_state(TASK_INTERRUPTIBLE);
ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter);
-
+ /*
+ * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
+ * have to fix that up.
+ */
+ fixup_rt_mutex_waiters(lock);
raw_spin_unlock_irq(&lock->wait_lock);
return ret;
@@ -1800,16 +1802,26 @@ bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock,
bool cleanup = false;
raw_spin_lock_irq(&lock->wait_lock);
+ /*
+ * Do an unconditional try-lock, this deals with the lock stealing
+ * state where __rt_mutex_futex_unlock() -> mark_wakeup_next_waiter()
+ * sets a NULL owner.
+ *
+ * We're not interested in the return value, because the subsequent
+ * test on rt_mutex_owner() will infer that. If the trylock succeeded,
+ * we will own the lock and it will have removed the waiter. If we
+ * failed the trylock, we're still not owner and we need to remove
+ * ourselves.
+ */
+ try_to_take_rt_mutex(lock, current, waiter);
/*
* Unless we're the owner; we're still enqueued on the wait_list.
* So check if we became owner, if not, take us off the wait_list.
*/
if (rt_mutex_owner(lock) != current) {
remove_waiter(lock, waiter);
- fixup_rt_mutex_waiters(lock);
cleanup = true;
}
-
/*
* try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
* have to fix that up.
--
2.26.0.106.g9fadedd
From: Peter Zijlstra <[email protected]>
[ Upstream commit bf92cf3a5100f5a0d5f9834787b130159397cb22 ]
Add a put_pit_state() as counterpart for get_pi_state() so the refcounting
becomes consistent.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Zhen Lei <[email protected]>
---
kernel/futex.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index dbb38e14f6fcc8e..dab9c79a931a23e 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -825,7 +825,7 @@ static int refill_pi_state_cache(void)
return 0;
}
-static struct futex_pi_state * alloc_pi_state(void)
+static struct futex_pi_state *alloc_pi_state(void)
{
struct futex_pi_state *pi_state = current->pi_state_cache;
@@ -858,6 +858,11 @@ static void pi_state_update_owner(struct futex_pi_state *pi_state,
}
}
+static void get_pi_state(struct futex_pi_state *pi_state)
+{
+ WARN_ON_ONCE(!atomic_inc_not_zero(&pi_state->refcount));
+}
+
/*
* Drops a reference to the pi_state object and frees or caches it
* when the last reference is gone.
@@ -901,7 +906,7 @@ static void put_pi_state(struct futex_pi_state *pi_state)
* Look up the task based on what TID userspace gave us.
* We dont trust it.
*/
-static struct task_struct * futex_find_get_task(pid_t pid)
+static struct task_struct *futex_find_get_task(pid_t pid)
{
struct task_struct *p;
@@ -1149,7 +1154,7 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
goto out_einval;
out_attach:
- atomic_inc(&pi_state->refcount);
+ get_pi_state(pi_state);
raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
*ps = pi_state;
return 0;
@@ -2204,7 +2209,7 @@ retry_private:
*/
if (requeue_pi) {
/* Prepare the waiter to take the rt_mutex. */
- atomic_inc(&pi_state->refcount);
+ get_pi_state(pi_state);
this->pi_state = pi_state;
ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
this->rt_waiter,
--
2.26.0.106.g9fadedd
From: Thomas Gleixner <[email protected]>
[ Upstream commit 29e9ee5d48c35d6cf8afe09bdf03f77125c9ac11 ]
free_pi_state() is confusing as it is in fact only freeing/caching the
pi state when the last reference is gone. Rename it to put_pi_state()
which reflects better what it is doing.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: [email protected]
Cc: Andy Lowe <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Zhen Lei <[email protected]>
---
kernel/futex.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index ff5499b0c5b34f7..dbb38e14f6fcc8e 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -859,9 +859,12 @@ static void pi_state_update_owner(struct futex_pi_state *pi_state,
}
/*
+ * Drops a reference to the pi_state object and frees or caches it
+ * when the last reference is gone.
+ *
* Must be called with the hb lock held.
*/
-static void free_pi_state(struct futex_pi_state *pi_state)
+static void put_pi_state(struct futex_pi_state *pi_state)
{
if (!pi_state)
return;
@@ -2121,7 +2124,7 @@ retry_private:
case 0:
break;
case -EFAULT:
- free_pi_state(pi_state);
+ put_pi_state(pi_state);
pi_state = NULL;
double_unlock_hb(hb1, hb2);
hb_waiters_dec(hb2);
@@ -2139,7 +2142,7 @@ retry_private:
* exit to complete.
* - EAGAIN: The user space value changed.
*/
- free_pi_state(pi_state);
+ put_pi_state(pi_state);
pi_state = NULL;
double_unlock_hb(hb1, hb2);
hb_waiters_dec(hb2);
@@ -2214,7 +2217,7 @@ retry_private:
} else if (ret) {
/* -EDEADLK */
this->pi_state = NULL;
- free_pi_state(pi_state);
+ put_pi_state(pi_state);
goto out_unlock;
}
}
@@ -2223,7 +2226,7 @@ retry_private:
}
out_unlock:
- free_pi_state(pi_state);
+ put_pi_state(pi_state);
double_unlock_hb(hb1, hb2);
wake_up_q(&wake_q);
hb_waiters_dec(hb2);
@@ -2376,7 +2379,7 @@ static void unqueue_me_pi(struct futex_q *q)
__unqueue_futex(q);
BUG_ON(!q->pi_state);
- free_pi_state(q->pi_state);
+ put_pi_state(q->pi_state);
q->pi_state = NULL;
spin_unlock(q->lock_ptr);
@@ -3210,7 +3213,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
* Drop the reference to the pi state which
* the requeue_pi() code acquired for us.
*/
- free_pi_state(q.pi_state);
+ put_pi_state(q.pi_state);
spin_unlock(q.lock_ptr);
/*
* Adjust the return value. It's either -EFAULT or
--
2.26.0.106.g9fadedd
From: Peter Zijlstra <[email protected]>
[ Upstream commit 50809358dd7199aa7ce232f6877dd09ec30ef374 ]
Since there's already two copies of this code, introduce a helper now
before adding a third one.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Zhen Lei <[email protected]>
---
kernel/futex.c | 5 +----
kernel/locking/rtmutex.c | 12 +++++++++---
kernel/locking/rtmutex_common.h | 1 +
3 files changed, 11 insertions(+), 7 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index dab9c79a931a23e..53a085a378f3844 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3156,10 +3156,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
* The waiter is allocated on our stack, manipulated by the requeue
* code while we sleep on uaddr.
*/
- debug_rt_mutex_init_waiter(&rt_waiter);
- RB_CLEAR_NODE(&rt_waiter.pi_tree_entry);
- RB_CLEAR_NODE(&rt_waiter.tree_entry);
- rt_waiter.task = NULL;
+ rt_mutex_init_waiter(&rt_waiter);
ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
if (unlikely(ret != 0))
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 1c0cb5c3c6ad6fe..fa24df4bc7a8ff0 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1155,6 +1155,14 @@ void rt_mutex_adjust_pi(struct task_struct *task)
next_lock, NULL, task);
}
+void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
+{
+ debug_rt_mutex_init_waiter(waiter);
+ RB_CLEAR_NODE(&waiter->pi_tree_entry);
+ RB_CLEAR_NODE(&waiter->tree_entry);
+ waiter->task = NULL;
+}
+
/**
* __rt_mutex_slowlock() - Perform the wait-wake-try-to-take loop
* @lock: the rt_mutex to take
@@ -1236,9 +1244,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
struct rt_mutex_waiter waiter;
int ret = 0;
- debug_rt_mutex_init_waiter(&waiter);
- RB_CLEAR_NODE(&waiter.pi_tree_entry);
- RB_CLEAR_NODE(&waiter.tree_entry);
+ rt_mutex_init_waiter(&waiter);
raw_spin_lock(&lock->wait_lock);
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 4584db96265d429..0424ff97bb69cad 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -102,6 +102,7 @@ extern struct task_struct *rt_mutex_next_owner(struct rt_mutex *lock);
extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock,
struct task_struct *proxy_owner);
extern void rt_mutex_proxy_unlock(struct rt_mutex *lock);
+extern void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter);
extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
struct rt_mutex_waiter *waiter,
struct task_struct *task);
--
2.26.0.106.g9fadedd
From: Thomas Gleixner <[email protected]>
[ Upstream commit b4abf91047cf054f203dcfac97e1038388826937 ]
Sasha reported a lockdep splat about a potential deadlock between RCU boosting
rtmutex and the posix timer it_lock.
CPU0 CPU1
rtmutex_lock(&rcu->rt_mutex)
spin_lock(&rcu->rt_mutex.wait_lock)
local_irq_disable()
spin_lock(&timer->it_lock)
spin_lock(&rcu->mutex.wait_lock)
--> Interrupt
spin_lock(&timer->it_lock)
This is caused by the following code sequence on CPU1
rcu_read_lock()
x = lookup();
if (x)
spin_lock_irqsave(&x->it_lock);
rcu_read_unlock();
return x;
We could fix that in the posix timer code by keeping rcu read locked across
the spinlocked and irq disabled section, but the above sequence is common and
there is no reason not to support it.
Taking rt_mutex.wait_lock irq safe prevents the deadlock.
Reported-by: Sasha Levin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul McKenney <[email protected]>
Signed-off-by: Zhen Lei <[email protected]>
---
kernel/locking/rtmutex.c | 135 +++++++++++++++++++++------------------
1 file changed, 72 insertions(+), 63 deletions(-)
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 98e45a2e1236e30..18154e10d63a23b 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -163,13 +163,14 @@ static inline void mark_rt_mutex_waiters(struct rt_mutex *lock)
* 2) Drop lock->wait_lock
* 3) Try to unlock the lock with cmpxchg
*/
-static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock)
+static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock,
+ unsigned long flags)
__releases(lock->wait_lock)
{
struct task_struct *owner = rt_mutex_owner(lock);
clear_rt_mutex_waiters(lock);
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
/*
* If a new waiter comes in between the unlock and the cmpxchg
* we have two situations:
@@ -211,11 +212,12 @@ static inline void mark_rt_mutex_waiters(struct rt_mutex *lock)
/*
* Simple slow path only version: lock->owner is protected by lock->wait_lock.
*/
-static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock)
+static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock,
+ unsigned long flags)
__releases(lock->wait_lock)
{
lock->owner = NULL;
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
return true;
}
#endif
@@ -497,7 +499,6 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
int ret = 0, depth = 0;
struct rt_mutex *lock;
bool detect_deadlock;
- unsigned long flags;
bool requeue = true;
detect_deadlock = rt_mutex_cond_detect_deadlock(orig_waiter, chwalk);
@@ -540,7 +541,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
/*
* [1] Task cannot go away as we did a get_task() before !
*/
- raw_spin_lock_irqsave(&task->pi_lock, flags);
+ raw_spin_lock_irq(&task->pi_lock);
/*
* [2] Get the waiter on which @task is blocked on.
@@ -624,7 +625,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
* operations.
*/
if (!raw_spin_trylock(&lock->wait_lock)) {
- raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+ raw_spin_unlock_irq(&task->pi_lock);
cpu_relax();
goto retry;
}
@@ -655,7 +656,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
/*
* No requeue[7] here. Just release @task [8]
*/
- raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+ raw_spin_unlock(&task->pi_lock);
put_task_struct(task);
/*
@@ -663,14 +664,14 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
* If there is no owner of the lock, end of chain.
*/
if (!rt_mutex_owner(lock)) {
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irq(&lock->wait_lock);
return 0;
}
/* [10] Grab the next task, i.e. owner of @lock */
task = rt_mutex_owner(lock);
get_task_struct(task);
- raw_spin_lock_irqsave(&task->pi_lock, flags);
+ raw_spin_lock(&task->pi_lock);
/*
* No requeue [11] here. We just do deadlock detection.
@@ -685,8 +686,8 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
top_waiter = rt_mutex_top_waiter(lock);
/* [13] Drop locks */
- raw_spin_unlock_irqrestore(&task->pi_lock, flags);
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock(&task->pi_lock);
+ raw_spin_unlock_irq(&lock->wait_lock);
/* If owner is not blocked, end of chain. */
if (!next_lock)
@@ -707,7 +708,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
rt_mutex_enqueue(lock, waiter);
/* [8] Release the task */
- raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+ raw_spin_unlock(&task->pi_lock);
put_task_struct(task);
/*
@@ -725,14 +726,14 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
*/
if (prerequeue_top_waiter != rt_mutex_top_waiter(lock))
wake_up_process(rt_mutex_top_waiter(lock)->task);
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irq(&lock->wait_lock);
return 0;
}
/* [10] Grab the next task, i.e. the owner of @lock */
task = rt_mutex_owner(lock);
get_task_struct(task);
- raw_spin_lock_irqsave(&task->pi_lock, flags);
+ raw_spin_lock(&task->pi_lock);
/* [11] requeue the pi waiters if necessary */
if (waiter == rt_mutex_top_waiter(lock)) {
@@ -786,8 +787,8 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
top_waiter = rt_mutex_top_waiter(lock);
/* [13] Drop the locks */
- raw_spin_unlock_irqrestore(&task->pi_lock, flags);
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock(&task->pi_lock);
+ raw_spin_unlock_irq(&lock->wait_lock);
/*
* Make the actual exit decisions [12], based on the stored
@@ -810,7 +811,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
goto again;
out_unlock_pi:
- raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+ raw_spin_unlock_irq(&task->pi_lock);
out_put_task:
put_task_struct(task);
@@ -820,7 +821,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
/*
* Try to take an rt-mutex
*
- * Must be called with lock->wait_lock held.
+ * Must be called with lock->wait_lock held and interrupts disabled
*
* @lock: The lock to be acquired.
* @task: The task which wants to acquire the lock
@@ -830,8 +831,6 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
struct rt_mutex_waiter *waiter)
{
- unsigned long flags;
-
/*
* Before testing whether we can acquire @lock, we set the
* RT_MUTEX_HAS_WAITERS bit in @lock->owner. This forces all
@@ -916,7 +915,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
* case, but conditionals are more expensive than a redundant
* store.
*/
- raw_spin_lock_irqsave(&task->pi_lock, flags);
+ raw_spin_lock(&task->pi_lock);
task->pi_blocked_on = NULL;
/*
* Finish the lock acquisition. @task is the new owner. If
@@ -925,7 +924,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
*/
if (rt_mutex_has_waiters(lock))
rt_mutex_enqueue_pi(task, rt_mutex_top_waiter(lock));
- raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+ raw_spin_unlock(&task->pi_lock);
takeit:
/* We got the lock. */
@@ -945,7 +944,7 @@ takeit:
*
* Prepare waiter and propagate pi chain
*
- * This must be called with lock->wait_lock held.
+ * This must be called with lock->wait_lock held and interrupts disabled
*/
static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
struct rt_mutex_waiter *waiter,
@@ -956,7 +955,6 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
struct rt_mutex_waiter *top_waiter = waiter;
struct rt_mutex *next_lock;
int chain_walk = 0, res;
- unsigned long flags;
/*
* Early deadlock detection. We really don't want the task to
@@ -970,7 +968,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
if (owner == task)
return -EDEADLK;
- raw_spin_lock_irqsave(&task->pi_lock, flags);
+ raw_spin_lock(&task->pi_lock);
__rt_mutex_adjust_prio(task);
waiter->task = task;
waiter->lock = lock;
@@ -983,12 +981,12 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
task->pi_blocked_on = waiter;
- raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+ raw_spin_unlock(&task->pi_lock);
if (!owner)
return 0;
- raw_spin_lock_irqsave(&owner->pi_lock, flags);
+ raw_spin_lock(&owner->pi_lock);
if (waiter == rt_mutex_top_waiter(lock)) {
rt_mutex_dequeue_pi(owner, top_waiter);
rt_mutex_enqueue_pi(owner, waiter);
@@ -1003,7 +1001,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
/* Store the lock on which owner is blocked or NULL */
next_lock = task_blocked_on_lock(owner);
- raw_spin_unlock_irqrestore(&owner->pi_lock, flags);
+ raw_spin_unlock(&owner->pi_lock);
/*
* Even if full deadlock detection is on, if the owner is not
* blocked itself, we can avoid finding this out in the chain
@@ -1019,12 +1017,12 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
*/
get_task_struct(owner);
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irq(&lock->wait_lock);
res = rt_mutex_adjust_prio_chain(owner, chwalk, lock,
next_lock, waiter, task);
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irq(&lock->wait_lock);
return res;
}
@@ -1033,15 +1031,14 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
* Remove the top waiter from the current tasks pi waiter tree and
* queue it up.
*
- * Called with lock->wait_lock held.
+ * Called with lock->wait_lock held and interrupts disabled.
*/
static void mark_wakeup_next_waiter(struct wake_q_head *wake_q,
struct rt_mutex *lock)
{
struct rt_mutex_waiter *waiter;
- unsigned long flags;
- raw_spin_lock_irqsave(¤t->pi_lock, flags);
+ raw_spin_lock(¤t->pi_lock);
waiter = rt_mutex_top_waiter(lock);
@@ -1063,7 +1060,7 @@ static void mark_wakeup_next_waiter(struct wake_q_head *wake_q,
*/
lock->owner = (void *) RT_MUTEX_HAS_WAITERS;
- raw_spin_unlock_irqrestore(¤t->pi_lock, flags);
+ raw_spin_unlock(¤t->pi_lock);
wake_q_add(wake_q, waiter->task);
}
@@ -1071,7 +1068,7 @@ static void mark_wakeup_next_waiter(struct wake_q_head *wake_q,
/*
* Remove a waiter from a lock and give up
*
- * Must be called with lock->wait_lock held and
+ * Must be called with lock->wait_lock held and interrupts disabled. I must
* have just failed to try_to_take_rt_mutex().
*/
static void remove_waiter(struct rt_mutex *lock,
@@ -1080,12 +1077,11 @@ static void remove_waiter(struct rt_mutex *lock,
bool is_top_waiter = (waiter == rt_mutex_top_waiter(lock));
struct task_struct *owner = rt_mutex_owner(lock);
struct rt_mutex *next_lock;
- unsigned long flags;
- raw_spin_lock_irqsave(¤t->pi_lock, flags);
+ raw_spin_lock(¤t->pi_lock);
rt_mutex_dequeue(lock, waiter);
current->pi_blocked_on = NULL;
- raw_spin_unlock_irqrestore(¤t->pi_lock, flags);
+ raw_spin_unlock(¤t->pi_lock);
/*
* Only update priority if the waiter was the highest priority
@@ -1094,7 +1090,7 @@ static void remove_waiter(struct rt_mutex *lock,
if (!owner || !is_top_waiter)
return;
- raw_spin_lock_irqsave(&owner->pi_lock, flags);
+ raw_spin_lock(&owner->pi_lock);
rt_mutex_dequeue_pi(owner, waiter);
@@ -1106,7 +1102,7 @@ static void remove_waiter(struct rt_mutex *lock,
/* Store the lock on which owner is blocked or NULL */
next_lock = task_blocked_on_lock(owner);
- raw_spin_unlock_irqrestore(&owner->pi_lock, flags);
+ raw_spin_unlock(&owner->pi_lock);
/*
* Don't walk the chain, if the owner task is not blocked
@@ -1118,12 +1114,12 @@ static void remove_waiter(struct rt_mutex *lock,
/* gets dropped in rt_mutex_adjust_prio_chain()! */
get_task_struct(owner);
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irq(&lock->wait_lock);
rt_mutex_adjust_prio_chain(owner, RT_MUTEX_MIN_CHAINWALK, lock,
next_lock, NULL, current);
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irq(&lock->wait_lock);
}
/*
@@ -1167,11 +1163,11 @@ void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
* __rt_mutex_slowlock() - Perform the wait-wake-try-to-take loop
* @lock: the rt_mutex to take
* @state: the state the task should block in (TASK_INTERRUPTIBLE
- * or TASK_UNINTERRUPTIBLE)
+ * or TASK_UNINTERRUPTIBLE)
* @timeout: the pre-initialized and started timer, or NULL for none
* @waiter: the pre-initialized rt_mutex_waiter
*
- * lock->wait_lock must be held by the caller.
+ * Must be called with lock->wait_lock held and interrupts disabled
*/
static int __sched
__rt_mutex_slowlock(struct rt_mutex *lock, int state,
@@ -1199,13 +1195,13 @@ __rt_mutex_slowlock(struct rt_mutex *lock, int state,
break;
}
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irq(&lock->wait_lock);
debug_rt_mutex_print_deadlock(waiter);
schedule();
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irq(&lock->wait_lock);
set_current_state(state);
}
@@ -1242,15 +1238,24 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
enum rtmutex_chainwalk chwalk)
{
struct rt_mutex_waiter waiter;
+ unsigned long flags;
int ret = 0;
rt_mutex_init_waiter(&waiter);
- raw_spin_lock(&lock->wait_lock);
+ /*
+ * Technically we could use raw_spin_[un]lock_irq() here, but this can
+ * be called in early boot if the cmpxchg() fast path is disabled
+ * (debug, no architecture support). In this case we will acquire the
+ * rtmutex with lock->wait_lock held. But we cannot unconditionally
+ * enable interrupts in that early boot case. So we need to use the
+ * irqsave/restore variants.
+ */
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
/* Try to acquire the lock again: */
if (try_to_take_rt_mutex(lock, current, NULL)) {
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
return 0;
}
@@ -1279,7 +1284,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
*/
fixup_rt_mutex_waiters(lock);
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
/* Remove pending timer: */
if (unlikely(timeout))
@@ -1308,6 +1313,7 @@ static inline int __rt_mutex_slowtrylock(struct rt_mutex *lock)
*/
static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
{
+ unsigned long flags;
int ret;
/*
@@ -1319,14 +1325,14 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
return 0;
/*
- * The mutex has currently no owner. Lock the wait lock and
- * try to acquire the lock.
+ * The mutex has currently no owner. Lock the wait lock and try to
+ * acquire the lock. We use irqsave here to support early boot calls.
*/
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
ret = __rt_mutex_slowtrylock(lock);
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
return ret;
}
@@ -1338,7 +1344,10 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
struct wake_q_head *wake_q)
{
- raw_spin_lock(&lock->wait_lock);
+ unsigned long flags;
+
+ /* irqsave required to support early boot calls */
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
debug_rt_mutex_unlock(lock);
@@ -1375,10 +1384,10 @@ static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
*/
while (!rt_mutex_has_waiters(lock)) {
/* Drops lock->wait_lock ! */
- if (unlock_rt_mutex_safe(lock) == true)
+ if (unlock_rt_mutex_safe(lock, flags) == true)
return false;
/* Relock the rtmutex and try again */
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irqsave(&lock->wait_lock, flags);
}
/*
@@ -1389,7 +1398,7 @@ static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
*/
mark_wakeup_next_waiter(wake_q, lock);
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
/* check PI boosting */
return true;
@@ -1680,10 +1689,10 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
{
int ret;
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irq(&lock->wait_lock);
if (try_to_take_rt_mutex(lock, task, NULL)) {
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irq(&lock->wait_lock);
return 1;
}
@@ -1704,7 +1713,7 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
if (unlikely(ret))
remove_waiter(lock, waiter);
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irq(&lock->wait_lock);
debug_rt_mutex_print_deadlock(waiter);
@@ -1754,14 +1763,14 @@ int rt_mutex_wait_proxy_lock(struct rt_mutex *lock,
{
int ret;
- raw_spin_lock(&lock->wait_lock);
+ raw_spin_lock_irq(&lock->wait_lock);
set_current_state(TASK_INTERRUPTIBLE);
/* sleep on the mutex */
ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter);
- raw_spin_unlock(&lock->wait_lock);
+ raw_spin_unlock_irq(&lock->wait_lock);
return ret;
}
--
2.26.0.106.g9fadedd
On Mon, Aug 02, 2021 at 09:46:13PM +0800, Zhen Lei wrote:
> Commit 73d786bd043e "futex: Rework inconsistent rt_mutex/futex_q state"
> mentions that it could cause an infinite loop, and will fix it in the later
> patches:
> bebe5b514345f09 futex: Futex_unlock_pi() determinism
> cfafcd117da0216 futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
>
> But at the moment they're not backported. In a single-core environment, the
> probability of triggering is high.
>
> I also backported commit b4abf91047cf ("rtmutex: Make wait_lock irq safe"),
> it fixes a potential deadlock problem. Although it hasn't actually been
> triggered in our environment at the moment.
>
> Other patches are used to resolve conflicts or fix problems caused by new
> patches.
>
>
> Anna-Maria Gleixner (1):
> rcu: Update documentation of rcu_read_unlock()
>
> Mike Galbraith (1):
> futex: Handle transient "ownerless" rtmutex state correctly
>
> Peter Zijlstra (6):
> futex: Cleanup refcounting
> futex,rt_mutex: Introduce rt_mutex_init_waiter()
> futex: Pull rt_mutex_futex_unlock() out from under hb->lock
> futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
> futex: Futex_unlock_pi() determinism
> futex,rt_mutex: Fix rt_mutex_cleanup_proxy_lock()
>
> Thomas Gleixner (3):
> futex: Rename free_pi_state() to put_pi_state()
> rtmutex: Make wait_lock irq safe
> futex: Avoid freeing an active timer
>
> include/linux/rcupdate.h | 4 +-
> kernel/futex.c | 245 +++++++++++++++++++++-----------
> kernel/locking/rtmutex.c | 185 +++++++++++++-----------
> kernel/locking/rtmutex_common.h | 2 +-
> 4 files changed, 262 insertions(+), 174 deletions(-)
To all concerned,
I have verified that this series of patches, when applied
to 4.4.277, passes the futex-unlock-pi replicator I posted
to lkml on July 19.
Subject: [BUG] 4.4.262: infinite loop in futex_unlock_pi (EAGAIN loop)
Acked-by: Joe Korty <[email protected]>
On Mon, Aug 02, 2021 at 08:53:25PM -0400, Joe Korty wrote:
> On Mon, Aug 02, 2021 at 09:46:13PM +0800, Zhen Lei wrote:
> > Commit 73d786bd043e "futex: Rework inconsistent rt_mutex/futex_q state"
> > mentions that it could cause an infinite loop, and will fix it in the later
> > patches:
> > bebe5b514345f09 futex: Futex_unlock_pi() determinism
> > cfafcd117da0216 futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
> >
> > But at the moment they're not backported. In a single-core environment, the
> > probability of triggering is high.
> >
> > I also backported commit b4abf91047cf ("rtmutex: Make wait_lock irq safe"),
> > it fixes a potential deadlock problem. Although it hasn't actually been
> > triggered in our environment at the moment.
> >
> > Other patches are used to resolve conflicts or fix problems caused by new
> > patches.
> >
> >
> > Anna-Maria Gleixner (1):
> > rcu: Update documentation of rcu_read_unlock()
> >
> > Mike Galbraith (1):
> > futex: Handle transient "ownerless" rtmutex state correctly
> >
> > Peter Zijlstra (6):
> > futex: Cleanup refcounting
> > futex,rt_mutex: Introduce rt_mutex_init_waiter()
> > futex: Pull rt_mutex_futex_unlock() out from under hb->lock
> > futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
> > futex: Futex_unlock_pi() determinism
> > futex,rt_mutex: Fix rt_mutex_cleanup_proxy_lock()
> >
> > Thomas Gleixner (3):
> > futex: Rename free_pi_state() to put_pi_state()
> > rtmutex: Make wait_lock irq safe
> > futex: Avoid freeing an active timer
> >
> > include/linux/rcupdate.h | 4 +-
> > kernel/futex.c | 245 +++++++++++++++++++++-----------
> > kernel/locking/rtmutex.c | 185 +++++++++++++-----------
> > kernel/locking/rtmutex_common.h | 2 +-
> > 4 files changed, 262 insertions(+), 174 deletions(-)
>
>
> To all concerned,
>
> I have verified that this series of patches, when applied
> to 4.4.277, passes the futex-unlock-pi replicator I posted
> to lkml on July 19.
>
> Subject: [BUG] 4.4.262: infinite loop in futex_unlock_pi (EAGAIN loop)
>
> Acked-by: Joe Korty <[email protected]>
>
Thanks for testing and the series, all now queued up.
I'll go do a -rc release with just this set of patches in it so that
people can test it well.
greg k-h