2023-01-17 18:33:56

by Wander Lairson Costa

[permalink] [raw]
Subject: [PATCH] rtmutex: ensure we wake up the top waiter

In task_blocked_on_lock() we save the owner, release the wait_lock and
call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock
again, the owner may release the lock and deboost.

rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
phase, waiter may be initially in the top of the queue, but after
dequeued and requeued it may no longer be true.

This scenario ends up waking the wrong task, which will verify it is no
the top waiter and comes back to sleep. Now we have a situation in which
no task is holding the lock but no one acquires it.

We can reproduce the bug in PREEMPT_RT with stress-ng:

while true; do
stress-ng --sched deadline --sched-period 1000000000 \
--sched-runtime 800000000 --sched-deadline \
1000000000 --mmapfork 23 -t 20
done

Signed-off-by: Wander Lairson Costa <[email protected]>
---
kernel/locking/rtmutex.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 010cf4e6d0b8..728f434de2bb 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -901,8 +901,9 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
* then we need to wake the new top waiter up to try
* to get the lock.
*/
- if (prerequeue_top_waiter != rt_mutex_top_waiter(lock))
- wake_up_state(waiter->task, waiter->wake_state);
+ top_waiter = rt_mutex_top_waiter(lock);
+ if (prerequeue_top_waiter != top_waiter)
+ wake_up_state(top_waiter->task, top_waiter->wake_state);
raw_spin_unlock_irq(&lock->wait_lock);
return 0;
}
--
2.39.0


2023-01-17 21:55:53

by Wander Lairson Costa

[permalink] [raw]
Subject: Re: [PATCH] rtmutex: ensure we wake up the top waiter

On Tue, Jan 17, 2023 at 4:32 PM Waiman Long <[email protected]> wrote:
>
> On 1/17/23 12:26, Wander Lairson Costa wrote:
> > In task_blocked_on_lock() we save the owner, release the wait_lock and
> > call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock
> > again, the owner may release the lock and deboost.
> Are you referring to task_blocks_on_rt_mutex(), not task_blocked_on_lock()?
> >
> > rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
> > phase, waiter may be initially in the top of the queue, but after
> > dequeued and requeued it may no longer be true.
> >
> > This scenario ends up waking the wrong task, which will verify it is no
> > the top waiter and comes back to sleep. Now we have a situation in which
> > no task is holding the lock but no one acquires it.
> >
> > We can reproduce the bug in PREEMPT_RT with stress-ng:
> >
> > while true; do
> > stress-ng --sched deadline --sched-period 1000000000 \
> > --sched-runtime 800000000 --sched-deadline \
> > 1000000000 --mmapfork 23 -t 20
> > done
> >
> > Signed-off-by: Wander Lairson Costa <[email protected]>
> > ---
> > kernel/locking/rtmutex.c | 5 +++--
> > 1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> > index 010cf4e6d0b8..728f434de2bb 100644
> > --- a/kernel/locking/rtmutex.c
> > +++ b/kernel/locking/rtmutex.c
> > @@ -901,8 +901,9 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
> > * then we need to wake the new top waiter up to try
> > * to get the lock.
> > */
> > - if (prerequeue_top_waiter != rt_mutex_top_waiter(lock))
> > - wake_up_state(waiter->task, waiter->wake_state);
> > + top_waiter = rt_mutex_top_waiter(lock);
> > + if (prerequeue_top_waiter != top_waiter)
> > + wake_up_state(top_waiter->task, top_waiter->wake_state);
> > raw_spin_unlock_irq(&lock->wait_lock);
> > return 0;
> > }
>
> I would say that if a rt_mutex has no owner but have waiters, we should
> always wake up the top waiter whether or not it is the current waiter.

In practice, there is this case in which the unlock code wakes up the
top waiter, but before it task awakes, a third running task changes
the top waiter.

> So what is the result of the stress-ng run above? Is it a hang? It is
> not clear from your patch description.

It manifests as a rcu_preempt stall.

>
> I am not that familiar with the rt_mutex code, I am cc'ing Thomas and
> Sebastian for their input.
>
> Cheers,
> Longman
>

2023-01-17 21:56:19

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH] rtmutex: ensure we wake up the top waiter

On 1/17/23 12:26, Wander Lairson Costa wrote:
> In task_blocked_on_lock() we save the owner, release the wait_lock and
> call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock
> again, the owner may release the lock and deboost.
Are you referring to task_blocks_on_rt_mutex(), not task_blocked_on_lock()?
>
> rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
> phase, waiter may be initially in the top of the queue, but after
> dequeued and requeued it may no longer be true.
>
> This scenario ends up waking the wrong task, which will verify it is no
> the top waiter and comes back to sleep. Now we have a situation in which
> no task is holding the lock but no one acquires it.
>
> We can reproduce the bug in PREEMPT_RT with stress-ng:
>
> while true; do
> stress-ng --sched deadline --sched-period 1000000000 \
> --sched-runtime 800000000 --sched-deadline \
> 1000000000 --mmapfork 23 -t 20
> done
>
> Signed-off-by: Wander Lairson Costa <[email protected]>
> ---
> kernel/locking/rtmutex.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> index 010cf4e6d0b8..728f434de2bb 100644
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -901,8 +901,9 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
> * then we need to wake the new top waiter up to try
> * to get the lock.
> */
> - if (prerequeue_top_waiter != rt_mutex_top_waiter(lock))
> - wake_up_state(waiter->task, waiter->wake_state);
> + top_waiter = rt_mutex_top_waiter(lock);
> + if (prerequeue_top_waiter != top_waiter)
> + wake_up_state(top_waiter->task, top_waiter->wake_state);
> raw_spin_unlock_irq(&lock->wait_lock);
> return 0;
> }

I would say that if a rt_mutex has no owner but have waiters, we should
always wake up the top waiter whether or not it is the current waiter.
So what is the result of the stress-ng run above? Is it a hang? It is
not clear from your patch description.

I am not that familiar with the rt_mutex code, I am cc'ing Thomas and
Sebastian for their input.

Cheers,
Longman

2023-01-17 21:56:35

by Wander Lairson Costa

[permalink] [raw]
Subject: Re: [PATCH] rtmutex: ensure we wake up the top waiter

On Tue, Jan 17, 2023 at 4:40 PM Waiman Long <[email protected]> wrote:
>
> On 1/17/23 14:32, Waiman Long wrote:
> > On 1/17/23 12:26, Wander Lairson Costa wrote:
> >> In task_blocked_on_lock() we save the owner, release the wait_lock and
> >> call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock
> >> again, the owner may release the lock and deboost.
> > Are you referring to task_blocks_on_rt_mutex(), not
> > task_blocked_on_lock()?
> >>
> >> rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
> >> phase, waiter may be initially in the top of the queue, but after
> >> dequeued and requeued it may no longer be true.
> >>
> >> This scenario ends up waking the wrong task, which will verify it is no
> >> the top waiter and comes back to sleep. Now we have a situation in which
> >> no task is holding the lock but no one acquires it.
> >>
> >> We can reproduce the bug in PREEMPT_RT with stress-ng:
> >>
> >> while true; do
> >> stress-ng --sched deadline --sched-period 1000000000 \
> >> --sched-runtime 800000000 --sched-deadline \
> >> 1000000000 --mmapfork 23 -t 20
> >> done
> >>
> >> Signed-off-by: Wander Lairson Costa <[email protected]>
> >> ---
> >> kernel/locking/rtmutex.c | 5 +++--
> >> 1 file changed, 3 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> >> index 010cf4e6d0b8..728f434de2bb 100644
> >> --- a/kernel/locking/rtmutex.c
> >> +++ b/kernel/locking/rtmutex.c
> >> @@ -901,8 +901,9 @@ static int __sched
> >> rt_mutex_adjust_prio_chain(struct task_struct *task,
> >> * then we need to wake the new top waiter up to try
> >> * to get the lock.
> >> */
> >> - if (prerequeue_top_waiter != rt_mutex_top_waiter(lock))
> >> - wake_up_state(waiter->task, waiter->wake_state);
> >> + top_waiter = rt_mutex_top_waiter(lock);
> >> + if (prerequeue_top_waiter != top_waiter)
> >> + wake_up_state(top_waiter->task, top_waiter->wake_state);
> >> raw_spin_unlock_irq(&lock->wait_lock);
> >> return 0;
> >> }
> >
> > I would say that if a rt_mutex has no owner but have waiters, we
> > should always wake up the top waiter whether or not it is the current
> > waiter. So what is the result of the stress-ng run above? Is it a
> > hang? It is not clear from your patch description.
>
> BTW, if it is a hang. What arch has this problem? x86 or arm64? There is
> a recent report of some rt_mutex locking issue in arm64, I believe. I
> don't know if it will be related. So it is important to know in what
> arch you see this problem.
>

x86_64. Notice, at least in my test case, it only manifested under PREEMPT_RT.

> Cheers,
> Longman
>

2023-01-17 22:31:01

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH] rtmutex: ensure we wake up the top waiter

On 1/17/23 14:32, Waiman Long wrote:
> On 1/17/23 12:26, Wander Lairson Costa wrote:
>> In task_blocked_on_lock() we save the owner, release the wait_lock and
>> call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock
>> again, the owner may release the lock and deboost.
> Are you referring to task_blocks_on_rt_mutex(), not
> task_blocked_on_lock()?
>>
>> rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
>> phase, waiter may be initially in the top of the queue, but after
>> dequeued and requeued it may no longer be true.
>>
>> This scenario ends up waking the wrong task, which will verify it is no
>> the top waiter and comes back to sleep. Now we have a situation in which
>> no task is holding the lock but no one acquires it.
>>
>> We can reproduce the bug in PREEMPT_RT with stress-ng:
>>
>> while true; do
>>      stress-ng --sched deadline --sched-period 1000000000 \
>>              --sched-runtime 800000000 --sched-deadline \
>>              1000000000 --mmapfork 23 -t 20
>> done
>>
>> Signed-off-by: Wander Lairson Costa <[email protected]>
>> ---
>>   kernel/locking/rtmutex.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
>> index 010cf4e6d0b8..728f434de2bb 100644
>> --- a/kernel/locking/rtmutex.c
>> +++ b/kernel/locking/rtmutex.c
>> @@ -901,8 +901,9 @@ static int __sched
>> rt_mutex_adjust_prio_chain(struct task_struct *task,
>>            * then we need to wake the new top waiter up to try
>>            * to get the lock.
>>            */
>> -        if (prerequeue_top_waiter != rt_mutex_top_waiter(lock))
>> -            wake_up_state(waiter->task, waiter->wake_state);
>> +        top_waiter = rt_mutex_top_waiter(lock);
>> +        if (prerequeue_top_waiter != top_waiter)
>> +            wake_up_state(top_waiter->task, top_waiter->wake_state);
>>           raw_spin_unlock_irq(&lock->wait_lock);
>>           return 0;
>>       }
>
> I would say that if a rt_mutex has no owner but have waiters, we
> should always wake up the top waiter whether or not it is the current
> waiter. So what is the result of the stress-ng run above? Is it a
> hang? It is not clear from your patch description.

BTW, if it is a hang. What arch has this problem? x86 or arm64? There is
a recent report of some rt_mutex locking issue in arm64, I believe. I
don't know if it will be related. So it is important to know in what
arch you see this problem.

Cheers,
Longman

2023-01-18 00:51:31

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] rtmutex: ensure we wake up the top waiter

Wander!

On Tue, Jan 17 2023 at 14:26, Wander Lairson Costa wrote:
> In task_blocked_on_lock() we save the owner, release the wait_lock and
> call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock
> again, the owner may release the lock and deboost.

This does not make sense in several aspects:

1) Who is 'we'? You, me, someone else? None of us does anything of the
above.

https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog

2) What has task_blocked_on_lock() to do with the logic in
rt_mutex_adjust_prio_chain() which is called by other callsites
too?

3) If the owner releases the lock and deboosts then this has
absolutely nothing to do with the lock because the priority of a
the owner is determined by its own priority and the priority of the
top most waiter. If the owner releases the lock then it marks the
lock ownerless, wakes the top most waiter and deboosts itself. In
this owner deboost rt_mutex_adjust_prio_chain() is not involved at
all. Why?

Because the owner deboost does not affect the priority of the
waiters at all. It's the other way round: Waiter priority affects
the owner priority if the waiter priority is higher than the owner
priority.

> rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
> phase, waiter may be initially in the top of the queue, but after
> dequeued and requeued it may no longer be true.

That's related to your above argumentation in which way?

rt_mutex_adjust_prio_chain()

lock->wait_lock is held across the whole operation

prerequeue_top_waiter = rt_mutex_top_waiter(lock);

This saves the current top waiter before the dequeue()/enqueue()
sequence.

rt_mutex_dequeue(lock, waiter);
waiter_update_prio(waiter, task);
rt_mutex_enqueue(lock, waiter);

if (!rt_mutex_owner(lock)) {

This is the case where the lock has no owner, i.e. the previous owner
unlocked and the chainwalk cannot be continued.

Now the code checks whether the requeue changed the top waiter task:

if (prerequeue_top_waiter != rt_mutex_top_waiter(lock))

What can make this condition true?

1) @waiter is the new top waiter due to the requeue operation

2) @waiter is not longer the top waiter due to the requeue operation

So in both cases the new top waiter must be woken up so it can take over
the ownerless lock.

Here is where the code is buggy. It only considers case #1, but not
case #2, right?

So your patch is correct, but the explanation in your changelog has
absolutely nothing to do with the problem.

Why?

#2 is caused by a top waiter dropping out due to a signal or timeout
and thereby deboosting the whole lock chain.

So the relevant callchain which causes the problem originates from
remove_waiter()

See?

Thanks,

tglx

2023-01-18 19:15:54

by Wander Lairson Costa

[permalink] [raw]
Subject: Re: [PATCH] rtmutex: ensure we wake up the top waiter

On Tue, Jan 17, 2023 at 9:05 PM Thomas Gleixner <[email protected]> wrote:
>
> Wander!
>
> On Tue, Jan 17 2023 at 14:26, Wander Lairson Costa wrote:
> > In task_blocked_on_lock() we save the owner, release the wait_lock and
> > call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock
> > again, the owner may release the lock and deboost.
>
> This does not make sense in several aspects:
>
> 1) Who is 'we'? You, me, someone else? None of us does anything of the
> above.
>
> https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog
>
> 2) What has task_blocked_on_lock() to do with the logic in
> rt_mutex_adjust_prio_chain() which is called by other callsites
> too?
>
> 3) If the owner releases the lock and deboosts then this has
> absolutely nothing to do with the lock because the priority of a
> the owner is determined by its own priority and the priority of the
> top most waiter. If the owner releases the lock then it marks the
> lock ownerless, wakes the top most waiter and deboosts itself. In
> this owner deboost rt_mutex_adjust_prio_chain() is not involved at
> all. Why?
>
> Because the owner deboost does not affect the priority of the
> waiters at all. It's the other way round: Waiter priority affects
> the owner priority if the waiter priority is higher than the owner
> priority.
>
> > rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
> > phase, waiter may be initially in the top of the queue, but after
> > dequeued and requeued it may no longer be true.
>
> That's related to your above argumentation in which way?
>

I think I made the mistake of not explicitly saying at least three
tasks are involved:

- A Task T1 that currently holds the mutex
- A Task T2 that is the top waiter
- A Task T3 that changes the top waiter

T3 tries to acquire the mutex, but as T1 holds it, it calls
task_blocked_on_lock() and saves the owner. It eventually calls
rt_mutex_adjust_prio_chain(), but it releases the wait lock before
doing so. This opens a window for T1 to release the mutex and wake up
T2. Before T2 runs, T3 acquires the wait lock again inside
rt_mutex_adjust_prio_chain(). If the "dequeue/requeue" piece of code
changes the top waiter, then 1) When T2 runs, it will verify that it
is no longer the top waiter and comes back to sleep 2) As you observed
below, the waiter doesn't point to the top waiter and, therefore, it
will wake up the wrong task.


> rt_mutex_adjust_prio_chain()
>
> lock->wait_lock is held across the whole operation
>
> prerequeue_top_waiter = rt_mutex_top_waiter(lock);
>
> This saves the current top waiter before the dequeue()/enqueue()
> sequence.
>
> rt_mutex_dequeue(lock, waiter);
> waiter_update_prio(waiter, task);
> rt_mutex_enqueue(lock, waiter);
>
> if (!rt_mutex_owner(lock)) {
>
> This is the case where the lock has no owner, i.e. the previous owner
> unlocked and the chainwalk cannot be continued.
>
> Now the code checks whether the requeue changed the top waiter task:
>
> if (prerequeue_top_waiter != rt_mutex_top_waiter(lock))
>
> What can make this condition true?
>
> 1) @waiter is the new top waiter due to the requeue operation
>
> 2) @waiter is not longer the top waiter due to the requeue operation
>
> So in both cases the new top waiter must be woken up so it can take over
> the ownerless lock.
>
> Here is where the code is buggy. It only considers case #1, but not
> case #2, right?
>
> So your patch is correct, but the explanation in your changelog has
> absolutely nothing to do with the problem.
>
> Why?
>
> #2 is caused by a top waiter dropping out due to a signal or timeout
> and thereby deboosting the whole lock chain.
>
> So the relevant callchain which causes the problem originates from
> remove_waiter()
>
> See?
>

Another piece of information I forgot: I spotted the bug in the
spinlock_rt, which uses a rtmutex under the hood. It has a different
code path in the lock scenario, and there is no call to
remove_waiter() (or I am missing something).
Anyway, you summed it up pretty well here: "@waiter is no longer the
top waiter due to the requeue operation". I tried (and failed) to
explain the call chain that ends up in the buggy scenario, but now I
think I should just describe the fundamental problem (the waiter
doesn't point to the top waiter).

> Thanks,
>
> tglx
>

2023-01-19 21:22:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] rtmutex: ensure we wake up the top waiter

Wander!

On Wed, Jan 18 2023 at 15:49, Wander Lairson Costa wrote:
> On Tue, Jan 17, 2023 at 9:05 PM Thomas Gleixner <[email protected]> wrote:
>> On Tue, Jan 17 2023 at 14:26, Wander Lairson Costa wrote:
>> > rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
>> > phase, waiter may be initially in the top of the queue, but after
>> > dequeued and requeued it may no longer be true.
>>
>> That's related to your above argumentation in which way?
>>
>
> I think I made the mistake of not explicitly saying at least three
> tasks are involved:
>
> - A Task T1 that currently holds the mutex
> - A Task T2 that is the top waiter
> - A Task T3 that changes the top waiter
>
> T3 tries to acquire the mutex, but as T1 holds it, it calls
> task_blocked_on_lock() and saves the owner. It eventually calls
> rt_mutex_adjust_prio_chain(), but it releases the wait lock before
> doing so. This opens a window for T1 to release the mutex and wake up
> T2. Before T2 runs, T3 acquires the wait lock again inside
> rt_mutex_adjust_prio_chain(). If the "dequeue/requeue" piece of code
> changes the top waiter, then 1) When T2 runs, it will verify that it
> is no longer the top waiter and comes back to sleep 2) As you observed
> below, the waiter doesn't point to the top waiter and, therefore, it
> will wake up the wrong task.

This is still confusing as hell because the wait locks you are talking
about belong to different rtmutexes. So there is no drops wait lock and
reacquires wait lock window.

There must be (multiple) lock chains involved, but I couldn't figure out
yet what the actaul scenario is in the case of a pure rt_spinlock clock
chain:

> Another piece of information I forgot: I spotted the bug in the
> spinlock_rt, which uses a rtmutex under the hood. It has a different
> code path in the lock scenario, and there is no call to
> remove_waiter() (or I am missing something).

Correct. But this still might be a lock chain issue where a non
rt_spinlock which allows early removal.

> Anyway, you summed it up pretty well here: "@waiter is no longer the
> top waiter due to the requeue operation". I tried (and failed) to
> explain the call chain that ends up in the buggy scenario, but now I
> think I should just describe the fundamental problem (the waiter
> doesn't point to the top waiter).

You really want to provide the information WHY this case can happen at
all. If it's not the removal case and related to some other obscure lock
chain problem, then we really need to understand the scenario because
that lets us analyze whether there are other holes we did not think
about yet.

If you have traces which show the sequence of lock events leading to
this problem, then you should be able to decode the scenario. If you
fail to extract the information, then please provide the traces so we
can stare at them.

Thanks,

tglx

2023-01-31 17:47:24

by Wander Lairson Costa

[permalink] [raw]
Subject: Re: [PATCH] rtmutex: ensure we wake up the top waiter

On Thu, Jan 19, 2023 at 5:54 PM Thomas Gleixner <[email protected]> wrote:
>
> Wander!
>
> On Wed, Jan 18 2023 at 15:49, Wander Lairson Costa wrote:
> > On Tue, Jan 17, 2023 at 9:05 PM Thomas Gleixner <[email protected]> wrote:
> >> On Tue, Jan 17 2023 at 14:26, Wander Lairson Costa wrote:
> >> > rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
> >> > phase, waiter may be initially in the top of the queue, but after
> >> > dequeued and requeued it may no longer be true.
> >>
> >> That's related to your above argumentation in which way?
> >>
> >
> > I think I made the mistake of not explicitly saying at least three
> > tasks are involved:
> >
> > - A Task T1 that currently holds the mutex
> > - A Task T2 that is the top waiter
> > - A Task T3 that changes the top waiter
> >
> > T3 tries to acquire the mutex, but as T1 holds it, it calls
> > task_blocked_on_lock() and saves the owner. It eventually calls
> > rt_mutex_adjust_prio_chain(), but it releases the wait lock before
> > doing so. This opens a window for T1 to release the mutex and wake up
> > T2. Before T2 runs, T3 acquires the wait lock again inside
> > rt_mutex_adjust_prio_chain(). If the "dequeue/requeue" piece of code
> > changes the top waiter, then 1) When T2 runs, it will verify that it
> > is no longer the top waiter and comes back to sleep 2) As you observed
> > below, the waiter doesn't point to the top waiter and, therefore, it
> > will wake up the wrong task.
>
> This is still confusing as hell because the wait locks you are talking
> about belong to different rtmutexes. So there is no drops wait lock and
> reacquires wait lock window.
>
> There must be (multiple) lock chains involved, but I couldn't figure out
> yet what the actaul scenario is in the case of a pure rt_spinlock clock
> chain:
>

Sorry, it took so long to reply. I didn't have the traces anymore and
had to regenerate them. You spotted an error in my analysis, then I
had to start over.

> > Another piece of information I forgot: I spotted the bug in the
> > spinlock_rt, which uses a rtmutex under the hood. It has a different
> > code path in the lock scenario, and there is no call to
> > remove_waiter() (or I am missing something).
>
> Correct. But this still might be a lock chain issue where a non
> rt_spinlock which allows early removal.
>
> > Anyway, you summed it up pretty well here: "@waiter is no longer the
> > top waiter due to the requeue operation". I tried (and failed) to
> > explain the call chain that ends up in the buggy scenario, but now I
> > think I should just describe the fundamental problem (the waiter
> > doesn't point to the top waiter).
>
> You really want to provide the information WHY this case can happen at
> all. If it's not the removal case and related to some other obscure lock
> chain problem, then we really need to understand the scenario because
> that lets us analyze whether there are other holes we did not think
> about yet.
>
> If you have traces which show the sequence of lock events leading to
> this problem, then you should be able to decode the scenario. If you
> fail to extract the information, then please provide the traces so we
> can stare at them.
>

Here we go:

Let L1 and L2 be two spinlocks.

Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top
waiter of L2.

Let T2 be the task holding L2.

Let T3 be a task trying to acquire L1.

The following events will lead to a state in which the wait queue of L2
isn't empty but nobody holds it.

T1 T2 T3
spin_lock(L1)

raw_spin_lock(L1->wait_lock)

rtlock_slowlock_locked(L1)

task_blocks_on_rt_mutex(L1, T3)

orig_waiter->lock = L1

orig_waiter->task = T3

raw_spin_unlock(L1->wait_lock)

rt_mutex_adjust_prio_chain(T1, L1, L2, orig_waiter, T3)

spin_unlock(L2)
rt_mutex_slowunlock(L2)
raw_spin_lock(L2->wait_lock)
wakeup(T1)
raw_spin_unlock(L2->wait_lock)

waiter = T1->pi_blocked_on

waiter == rt_mutex_top_waiter(L2)

waiter->task == T1

raw_spin_lock(L2->wait_lock)

dequeue(L2, waiter)

update_prio(waiter, T1)

enqueue(L2, waiter)

waiter != rt_mutex_top_waiter(L2)

L2->owner == NULL

wakeup(T1)

raw_spin_unlock(L2->wait_lock)
T1 wakes up
T1 != top_waiter(L2)
schedule_rtlock()

What I observed is that T1 deadline is updated before the call to
update_prio(), and the new deadline removes it from the the top waiter
in L2 wait queue.

Does that make sense?


2023-01-31 17:54:49

by Wander Lairson Costa

[permalink] [raw]
Subject: Re: [PATCH] rtmutex: ensure we wake up the top waiter

On Tue, Jan 31, 2023 at 02:46:19PM -0300, Wander Lairson Costa wrote:
> On Thu, Jan 19, 2023 at 5:54 PM Thomas Gleixner <[email protected]> wrote:
> >
> > Wander!
> >
> > On Wed, Jan 18 2023 at 15:49, Wander Lairson Costa wrote:
> > > On Tue, Jan 17, 2023 at 9:05 PM Thomas Gleixner <[email protected]> wrote:
> > >> On Tue, Jan 17 2023 at 14:26, Wander Lairson Costa wrote:
> > >> > rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
> > >> > phase, waiter may be initially in the top of the queue, but after
> > >> > dequeued and requeued it may no longer be true.
> > >>
> > >> That's related to your above argumentation in which way?
> > >>
> > >
> > > I think I made the mistake of not explicitly saying at least three
> > > tasks are involved:
> > >
> > > - A Task T1 that currently holds the mutex
> > > - A Task T2 that is the top waiter
> > > - A Task T3 that changes the top waiter
> > >
> > > T3 tries to acquire the mutex, but as T1 holds it, it calls
> > > task_blocked_on_lock() and saves the owner. It eventually calls
> > > rt_mutex_adjust_prio_chain(), but it releases the wait lock before
> > > doing so. This opens a window for T1 to release the mutex and wake up
> > > T2. Before T2 runs, T3 acquires the wait lock again inside
> > > rt_mutex_adjust_prio_chain(). If the "dequeue/requeue" piece of code
> > > changes the top waiter, then 1) When T2 runs, it will verify that it
> > > is no longer the top waiter and comes back to sleep 2) As you observed
> > > below, the waiter doesn't point to the top waiter and, therefore, it
> > > will wake up the wrong task.
> >
> > This is still confusing as hell because the wait locks you are talking
> > about belong to different rtmutexes. So there is no drops wait lock and
> > reacquires wait lock window.
> >
> > There must be (multiple) lock chains involved, but I couldn't figure out
> > yet what the actaul scenario is in the case of a pure rt_spinlock clock
> > chain:
> >
>
> Sorry, it took so long to reply. I didn't have the traces anymore and
> had to regenerate them. You spotted an error in my analysis, then I
> had to start over.
>
> > > Another piece of information I forgot: I spotted the bug in the
> > > spinlock_rt, which uses a rtmutex under the hood. It has a different
> > > code path in the lock scenario, and there is no call to
> > > remove_waiter() (or I am missing something).
> >
> > Correct. But this still might be a lock chain issue where a non
> > rt_spinlock which allows early removal.
> >
> > > Anyway, you summed it up pretty well here: "@waiter is no longer the
> > > top waiter due to the requeue operation". I tried (and failed) to
> > > explain the call chain that ends up in the buggy scenario, but now I
> > > think I should just describe the fundamental problem (the waiter
> > > doesn't point to the top waiter).
> >
> > You really want to provide the information WHY this case can happen at
> > all. If it's not the removal case and related to some other obscure lock
> > chain problem, then we really need to understand the scenario because
> > that lets us analyze whether there are other holes we did not think
> > about yet.
> >
> > If you have traces which show the sequence of lock events leading to
> > this problem, then you should be able to decode the scenario. If you
> > fail to extract the information, then please provide the traces so we
> > can stare at them.
> >
>
> Here we go:
>
> Let L1 and L2 be two spinlocks.
>
> Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top
> waiter of L2.
>
> Let T2 be the task holding L2.
>
> Let T3 be a task trying to acquire L1.
>
> The following events will lead to a state in which the wait queue of L2
> isn't empty but nobody holds it.
>
> T1 T2 T3
> spin_lock(L1)
>
> raw_spin_lock(L1->wait_lock)
>
> rtlock_slowlock_locked(L1)
>
> task_blocks_on_rt_mutex(L1, T3)
>
> orig_waiter->lock = L1
>
> orig_waiter->task = T3
>
> raw_spin_unlock(L1->wait_lock)
>
> rt_mutex_adjust_prio_chain(T1, L1, L2, orig_waiter, T3)
>
> spin_unlock(L2)
> rt_mutex_slowunlock(L2)
> raw_spin_lock(L2->wait_lock)
> wakeup(T1)
> raw_spin_unlock(L2->wait_lock)
>
> waiter = T1->pi_blocked_on
>
> waiter == rt_mutex_top_waiter(L2)
>
> waiter->task == T1
>
> raw_spin_lock(L2->wait_lock)
>
> dequeue(L2, waiter)
>
> update_prio(waiter, T1)
>
> enqueue(L2, waiter)
>
> waiter != rt_mutex_top_waiter(L2)
>
> L2->owner == NULL
>
> wakeup(T1)
>
> raw_spin_unlock(L2->wait_lock)
> T1 wakes up
> T1 != top_waiter(L2)
> schedule_rtlock()
>

Arrrrghhhh, s**t mail servers... Hopefully now it is formatted correctly:

Let L1 and L2 be two spinlocks.

Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top
waiter of L2.

Let T2 be the task holding L2.

Let T3 be a task trying to acquire L1.

The following events will lead to a state in which the wait queue of L2
isn't empty but nobody holds it.

T1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?T2 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?T3
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? spin_lock(L1)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? raw_spin_lock(L1->wait_lock)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? rtlock_slowlock_locked(L1)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? task_blocks_on_rt_mutex(L1, T3)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? orig_waiter->lock = L1
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? orig_waiter->task = T3
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? raw_spin_unlock(L1->wait_lock)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? rt_mutex_adjust_prio_chain(T1, L1, L2, orig_waiter, T3)

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? spin_unlock(L2)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? rt_mutex_slowunlock(L2)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? raw_spin_lock(L2->wait_lock)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? wakeup(T1)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? raw_spin_unlock(L2->wait_lock)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? waiter = T1->pi_blocked_on
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? waiter == rt_mutex_top_waiter(L2)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? waiter->task == T1
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? raw_spin_lock(L2->wait_lock)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? dequeue(L2, waiter)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? update_prio(waiter, T1)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? enqueue(L2, waiter)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? waiter != rt_mutex_top_waiter(L2)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? L2->owner == NULL
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? wakeup(T1)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? raw_spin_unlock(L2->wait_lock)
T1 wakes up
T1 != top_waiter(L2)
schedule_rtlock()


2023-02-02 08:07:43

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] rtmutex: ensure we wake up the top waiter

On Tue, Jan 31 2023 at 14:53, Wander Lairson Costa wrote:
> On Tue, Jan 31, 2023 at 02:46:19PM -0300, Wander Lairson Costa wrote:
>> > If you have traces which show the sequence of lock events leading to
>> > this problem, then you should be able to decode the scenario. If you
>> > fail to extract the information, then please provide the traces so we
>> > can stare at them.
>> >
>>
>> Here we go:
>>
>> Let L1 and L2 be two spinlocks.
>>
>> Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top
>> waiter of L2.
>>
>> Let T2 be the task holding L2.
>>
>> Let T3 be a task trying to acquire L1.
>>
>> The following events will lead to a state in which the wait queue of L2
>> isn't empty but nobody holds it.

That explains it nicely. Care to resend with proper explanations in the
changelog?

Thanks,

tglx

2023-02-02 11:21:35

by Wander Lairson Costa

[permalink] [raw]
Subject: Re: [PATCH] rtmutex: ensure we wake up the top waiter

On Thu, Feb 2, 2023 at 5:07 AM Thomas Gleixner <[email protected]> wrote:
>
> On Tue, Jan 31 2023 at 14:53, Wander Lairson Costa wrote:
> > On Tue, Jan 31, 2023 at 02:46:19PM -0300, Wander Lairson Costa wrote:
> >> > If you have traces which show the sequence of lock events leading to
> >> > this problem, then you should be able to decode the scenario. If you
> >> > fail to extract the information, then please provide the traces so we
> >> > can stare at them.
> >> >
> >>
> >> Here we go:
> >>
> >> Let L1 and L2 be two spinlocks.
> >>
> >> Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top
> >> waiter of L2.
> >>
> >> Let T2 be the task holding L2.
> >>
> >> Let T3 be a task trying to acquire L1.
> >>
> >> The following events will lead to a state in which the wait queue of L2
> >> isn't empty but nobody holds it.
>
> That explains it nicely. Care to resend with proper explanations in the
> changelog?
>

Sure thing, thanks.


Subject: [tip: locking/urgent] rtmutex: Ensure that the top waiter is always woken up

The following commit has been merged into the locking/urgent branch of tip:

Commit-ID: db370a8b9f67ae5f17e3d5482493294467784504
Gitweb: https://git.kernel.org/tip/db370a8b9f67ae5f17e3d5482493294467784504
Author: Wander Lairson Costa <[email protected]>
AuthorDate: Thu, 02 Feb 2023 09:30:20 -03:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Mon, 06 Feb 2023 14:49:13 +01:00

rtmutex: Ensure that the top waiter is always woken up

Let L1 and L2 be two spinlocks.

Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top
waiter of L2.

Let T2 be the task holding L2.

Let T3 be a task trying to acquire L1.

The following events will lead to a state in which the wait queue of L2
isn't empty, but no task actually holds the lock.

T1 T2 T3
== == ==

spin_lock(L1)
| raw_spin_lock(L1->wait_lock)
| rtlock_slowlock_locked(L1)
| | task_blocks_on_rt_mutex(L1, T3)
| | | orig_waiter->lock = L1
| | | orig_waiter->task = T3
| | | raw_spin_unlock(L1->wait_lock)
| | | rt_mutex_adjust_prio_chain(T1, L1, L2, orig_waiter, T3)
spin_unlock(L2) | | | |
| rt_mutex_slowunlock(L2) | | | |
| | raw_spin_lock(L2->wait_lock) | | | |
| | wakeup(T1) | | | |
| | raw_spin_unlock(L2->wait_lock) | | | |
| | | | waiter = T1->pi_blocked_on
| | | | waiter == rt_mutex_top_waiter(L2)
| | | | waiter->task == T1
| | | | raw_spin_lock(L2->wait_lock)
| | | | dequeue(L2, waiter)
| | | | update_prio(waiter, T1)
| | | | enqueue(L2, waiter)
| | | | waiter != rt_mutex_top_waiter(L2)
| | | | L2->owner == NULL
| | | | wakeup(T1)
| | | | raw_spin_unlock(L2->wait_lock)
T1 wakes up
T1 != top_waiter(L2)
schedule_rtlock()

If the deadline of T1 is updated before the call to update_prio(), and the
new deadline is greater than the deadline of the second top waiter, then
after the requeue, T1 is no longer the top waiter, and the wrong task is
woken up which will then go back to sleep because it is not the top waiter.

This can be reproduced in PREEMPT_RT with stress-ng:

while true; do
stress-ng --sched deadline --sched-period 1000000000 \
--sched-runtime 800000000 --sched-deadline \
1000000000 --mmapfork 23 -t 20
done

A similar issue was pointed out by Thomas versus the cases where the top
waiter drops out early due to a signal or timeout, which is a general issue
for all regular rtmutex use cases, e.g. futex.

The problematic code is in rt_mutex_adjust_prio_chain():

// Save the top waiter before dequeue/enqueue
prerequeue_top_waiter = rt_mutex_top_waiter(lock);

rt_mutex_dequeue(lock, waiter);
waiter_update_prio(waiter, task);
rt_mutex_enqueue(lock, waiter);

// Lock has no owner?
if (!rt_mutex_owner(lock)) {
// Top waiter changed
----> if (prerequeue_top_waiter != rt_mutex_top_waiter(lock))
----> wake_up_state(waiter->task, waiter->wake_state);

This only takes the case into account where @waiter is the new top waiter
due to the requeue operation.

But it fails to handle the case where @waiter is not longer the top
waiter due to the requeue operation.

Ensure that the new top waiter is woken up so in all cases so it can take
over the ownerless lock.

[ tglx: Amend changelog, add Fixes tag ]

Fixes: c014ef69b3ac ("locking/rtmutex: Add wake_state to rt_mutex_waiter")
Signed-off-by: Wander Lairson Costa <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/[email protected]
---
kernel/locking/rtmutex.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 010cf4e..728f434 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -901,8 +901,9 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
* then we need to wake the new top waiter up to try
* to get the lock.
*/
- if (prerequeue_top_waiter != rt_mutex_top_waiter(lock))
- wake_up_state(waiter->task, waiter->wake_state);
+ top_waiter = rt_mutex_top_waiter(lock);
+ if (prerequeue_top_waiter != top_waiter)
+ wake_up_state(top_waiter->task, top_waiter->wake_state);
raw_spin_unlock_irq(&lock->wait_lock);
return 0;
}