2022-12-02 11:12:21

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] rtmutex: Add acquire semantics for rtmutex lock acquisition

Jan Kara reported the following bug triggering on 6.0.5-rt14 running
dbench on XFS on arm64.

kernel BUG at fs/inode.c:625!
Internal error: Oops - BUG: 0 [#1] PREEMPT_RT SMP
CPU: 11 PID: 6611 Comm: dbench Tainted: G E 6.0.0-rt14-rt+ #1
Hardware name: GIGABYTE R272-P30-JG/MP32-AR0-JG
pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : clear_inode+0xa0/0xc0
lr : clear_inode+0x38/0xc0
sp : ffff80000f4f3cd0
x29: ffff80000f4f3cd0 x28: ffff07ff92142000 x27: 0000000000000000
x26: ffff08012aef6058 x25: 0000000000000002 x24: ffffb657395e8000
x23: ffffb65739072008 x22: ffffb656e0bed0a8 x21: ffff08012aef6190
x20: ffff08012aef61f8 x19: ffff08012aef6058 x18: 0000000000000014
x17: 00000000f0d86255 x16: ffffb65737dfdb00 x15: 0100000004000000
x14: 644d000008090000 x13: 644d000008090000 x12: ffff80000f4f3b20
x11: 0000000000000002 x10: ffff083f5ffbe1c0 x9 : ffffb657388284a4
x8 : fffffffffffffffe x7 : ffff80000f4f3b20 x6 : ffff80000f4f3b20
x5 : ffff08012aef6210 x4 : ffff08012aef6210 x3 : 0000000000000000
x2 : ffff08012aef62d8 x1 : ffff07ff8fbbf690 x0 : ffff08012aef61a0
Call trace:
clear_inode+0xa0/0xc0
evict+0x160/0x180
iput+0x154/0x240
do_unlinkat+0x184/0x300
__arm64_sys_unlinkat+0x48/0xc0
el0_svc_common.constprop.4+0xe4/0x2c0
do_el0_svc+0xac/0x100
el0_svc+0x78/0x200
el0t_64_sync_handler+0x9c/0xc0
el0t_64_sync+0x19c/0x1a0

It also affects 6.1-rc7-rt5 and affects a preempt-rt fork of 5.14 so this is
likely a bug that existed forever and only became visible when ARM support
was added to preempt-rt. The same problem does not occur on x86-64 and he
also reported that converting sb->s_inode_wblist_lock to raw_spinlock_t makes
the problem disappear indicating that the RT spinlock variant is the problem.

Will Deacon observed

I'd be more inclined to be suspicious of the slowpath tbh, as we
need to make sure that we have acquire semantics on all paths
where the lock can be taken. Looking at the rtmutex code, this
really isn't obvious to me -- for example, try_to_take_rt_mutex()
appears to be able to return via the 'takeit' label without acquire
semantics and it looks like we might be relying on the caller's
subsequent _unlock_ of the wait_lock for ordering, but that will
give us release semantics which aren't correct.

Sebastian Andrzej Siewior prototyped a fix that does work based on that
comment but it was a little bit overkill and added some fences that
should not be necessary.

The lock owner is updated with an IRQ-safe raw spinlock held but the
spin_unlock does not provide acquire semantics which are needed when
acquiring a mutex. This patch adds the necessary acquire semantics for a
lock operation when the lock owner is updated. It successfully completed
10 iterations of the dbench workload while the vanilla kernel fails on
the first iteration.

[[email protected]: Initial prototype fix]
Reported-by: Jan Kara <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/locking/rtmutex.c | 48 +++++++++++++++++++++++++++++++++++---------
kernel/locking/rtmutex_api.c | 6 +++---
2 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 7779ee8abc2a..35212f260148 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -89,15 +89,33 @@ static inline int __ww_mutex_check_kill(struct rt_mutex *lock,
* set this bit before looking at the lock.
*/

-static __always_inline void
-rt_mutex_set_owner(struct rt_mutex_base *lock, struct task_struct *owner)
+static __always_inline struct task_struct *
+rt_mutex_owner_encode(struct rt_mutex_base *lock, struct task_struct *owner)
{
unsigned long val = (unsigned long)owner;

if (rt_mutex_has_waiters(lock))
val |= RT_MUTEX_HAS_WAITERS;

- WRITE_ONCE(lock->owner, (struct task_struct *)val);
+ return (struct task_struct *)val;
+}
+
+
+static __always_inline void
+rt_mutex_set_owner(struct rt_mutex_base *lock, struct task_struct *owner)
+{
+ /*
+ * lock->wait_lock is held but explicit acquire semantics are needed
+ * for a new lock owner so WRITE_ONCE is insufficient.
+ */
+ xchg_acquire(&lock->owner, rt_mutex_owner_encode(lock, owner));
+}
+
+static __always_inline void
+rt_mutex_clear_owner(struct rt_mutex_base *lock)
+{
+ /* lock->wait_lock is held so the unlock provides release semantics. */
+ WRITE_ONCE(lock->owner, rt_mutex_owner_encode(lock, NULL));
}

static __always_inline void clear_rt_mutex_waiters(struct rt_mutex_base *lock)
@@ -106,7 +124,8 @@ static __always_inline void clear_rt_mutex_waiters(struct rt_mutex_base *lock)
((unsigned long)lock->owner & ~RT_MUTEX_HAS_WAITERS);
}

-static __always_inline void fixup_rt_mutex_waiters(struct rt_mutex_base *lock)
+static __always_inline void
+fixup_rt_mutex_waiters(struct rt_mutex_base *lock, bool acquire_lock)
{
unsigned long owner, *p = (unsigned long *) &lock->owner;

@@ -172,8 +191,19 @@ static __always_inline void fixup_rt_mutex_waiters(struct rt_mutex_base *lock)
* still set.
*/
owner = READ_ONCE(*p);
- if (owner & RT_MUTEX_HAS_WAITERS)
- WRITE_ONCE(*p, owner & ~RT_MUTEX_HAS_WAITERS);
+ if (owner & RT_MUTEX_HAS_WAITERS) {
+ /*
+ * See comments in rt_mutex_set_owner and
+ * rt_mutex_clear_owner on why xchg_acquire is used for
+ * updating owner for locking and WRITE_ONCE for unlocking.
+ * WRITE_ONCE would work for both here although other lock
+ * acquisitions may enter the slow path unnecessarily.
+ */
+ if (acquire_lock)
+ xchg_acquire(p, owner & ~RT_MUTEX_HAS_WAITERS);
+ else
+ WRITE_ONCE(*p, owner & ~RT_MUTEX_HAS_WAITERS);
+ }
}

/*
@@ -1243,7 +1273,7 @@ static int __sched __rt_mutex_slowtrylock(struct rt_mutex_base *lock)
* try_to_take_rt_mutex() sets the lock waiters bit
* unconditionally. Clean this up.
*/
- fixup_rt_mutex_waiters(lock);
+ fixup_rt_mutex_waiters(lock, true);

return ret;
}
@@ -1604,7 +1634,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
* try_to_take_rt_mutex() sets the waiter bit
* unconditionally. We might have to fix that up.
*/
- fixup_rt_mutex_waiters(lock);
+ fixup_rt_mutex_waiters(lock, true);

trace_contention_end(lock, ret);

@@ -1719,7 +1749,7 @@ static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock)
* try_to_take_rt_mutex() sets the waiter bit unconditionally.
* We might have to fix that up:
*/
- fixup_rt_mutex_waiters(lock);
+ fixup_rt_mutex_waiters(lock, true);
debug_rt_mutex_free_waiter(&waiter);

trace_contention_end(lock, 0);
diff --git a/kernel/locking/rtmutex_api.c b/kernel/locking/rtmutex_api.c
index 900220941caa..cb9fdff76a8a 100644
--- a/kernel/locking/rtmutex_api.c
+++ b/kernel/locking/rtmutex_api.c
@@ -267,7 +267,7 @@ void __sched rt_mutex_init_proxy_locked(struct rt_mutex_base *lock,
void __sched rt_mutex_proxy_unlock(struct rt_mutex_base *lock)
{
debug_rt_mutex_proxy_unlock(lock);
- rt_mutex_set_owner(lock, NULL);
+ rt_mutex_clear_owner(lock);
}

/**
@@ -382,7 +382,7 @@ int __sched rt_mutex_wait_proxy_lock(struct rt_mutex_base *lock,
* try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
* have to fix that up.
*/
- fixup_rt_mutex_waiters(lock);
+ fixup_rt_mutex_waiters(lock, true);
raw_spin_unlock_irq(&lock->wait_lock);

return ret;
@@ -438,7 +438,7 @@ bool __sched rt_mutex_cleanup_proxy_lock(struct rt_mutex_base *lock,
* try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
* have to fix that up.
*/
- fixup_rt_mutex_waiters(lock);
+ fixup_rt_mutex_waiters(lock, false);

raw_spin_unlock_irq(&lock->wait_lock);


Subject: Re: [PATCH] rtmutex: Add acquire semantics for rtmutex lock acquisition

On 2022-12-02 10:02:23 [+0000], Mel Gorman wrote:
> The lock owner is updated with an IRQ-safe raw spinlock held but the
> spin_unlock does not provide acquire semantics which are needed when
> acquiring a mutex. This patch adds the necessary acquire semantics for a
> lock operation when the lock owner is updated. It successfully completed
> 10 iterations of the dbench workload while the vanilla kernel fails on
> the first iteration.

I *think* it is

Fixes: 700318d1d7b38 ("locking/rtmutex: Use acquire/release semantics")

Before that, it did cmpxchg() which should be fine.

Regarding mark_rt_mutex_waiters(). Isn't acquire semantic required in
order for the lock-owner not perform the fastpath but go to the slowpath
instead?

Sebastian

2022-12-02 15:43:46

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] rtmutex: Add acquire semantics for rtmutex lock acquisition

On Fri, Dec 02, 2022 at 12:21:06PM +0100, Sebastian Andrzej Siewior wrote:
> On 2022-12-02 10:02:23 [+0000], Mel Gorman wrote:
> > The lock owner is updated with an IRQ-safe raw spinlock held but the
> > spin_unlock does not provide acquire semantics which are needed when
> > acquiring a mutex. This patch adds the necessary acquire semantics for a
> > lock operation when the lock owner is updated. It successfully completed
> > 10 iterations of the dbench workload while the vanilla kernel fails on
> > the first iteration.
>
> I *think* it is
>
> Fixes: 700318d1d7b38 ("locking/rtmutex: Use acquire/release semantics")
>

Adding Davidlohr to cc.

It might have made the problem worse but even then rt_mutex_set_owner was
just a plain assignment and while I didn't check carefully, at a glance
try_to_take_rt_mutex didn't look like it guaranteed ACQUIRE semantics.

> Before that, it did cmpxchg() which should be fine.
>
> Regarding mark_rt_mutex_waiters(). Isn't acquire semantic required in
> order for the lock-owner not perform the fastpath but go to the slowpath
> instead?
>

Good spot, it does. While the most straight-forward solution is to use
cmpxchg_acquire, I think it is overkill because it could incur back-to-back
ACQUIRE operations in the event of contention. There could be a smp_wmb
after the cmpxchg_relaxed but that impacts all arches and a non-paired
smp_wmb is generally frowned upon.

I'm thinking this on top of the patch should be sufficient even though
it's a heavier operation than is necesary for ACQUIRE as well as being
"not typical" according to Documentation/atomic_t.txt. Will, as this
affects ARM primarily do you have any preference?

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 35212f260148..af0dbe4d5e97 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -238,6 +238,13 @@ static __always_inline void mark_rt_mutex_waiters(struct rt_mutex_base *lock)
owner = *p;
} while (cmpxchg_relaxed(p, owner,
owner | RT_MUTEX_HAS_WAITERS) != owner);
+
+ /*
+ * The cmpxchg loop above is relaxed to avoid back-to-back ACQUIRE
+ * operations in the event of contention. Ensure the successful
+ * cmpxchg is visible.
+ */
+ smp_mb__after_atomic();
}

/*

Subject: Re: [PATCH] rtmutex: Add acquire semantics for rtmutex lock acquisition

On 2022-12-02 15:01:58 [+0000], Mel Gorman wrote:
> > Fixes: 700318d1d7b38 ("locking/rtmutex: Use acquire/release semantics")
> >
>
> Adding Davidlohr to cc.
>
> It might have made the problem worse but even then rt_mutex_set_owner was
> just a plain assignment and while I didn't check carefully, at a glance
> try_to_take_rt_mutex didn't look like it guaranteed ACQUIRE semantics.

It looks like it had a strong cmpxchg which was relaxed. But I might be
wrong ;) Either way we need stable tag so that this gets back ported.
The commit in question is since v4.4 and stable trees are maintained
down to 4.9 (but only until JAN so we should hurry ;)).

> > Before that, it did cmpxchg() which should be fine.
> >
> > Regarding mark_rt_mutex_waiters(). Isn't acquire semantic required in
> > order for the lock-owner not perform the fastpath but go to the slowpath
> > instead?
> >
>
> Good spot, it does. While the most straight-forward solution is to use
> cmpxchg_acquire, I think it is overkill because it could incur back-to-back
> ACQUIRE operations in the event of contention. There could be a smp_wmb
> after the cmpxchg_relaxed but that impacts all arches and a non-paired
> smp_wmb is generally frowned upon.

but in general, it should succeed on the first iteration. It can only
fail (and retry) if the owner was able to unlock it first. A second
locker will spin on the wait_lock so.

> I'm thinking this on top of the patch should be sufficient even though
> it's a heavier operation than is necesary for ACQUIRE as well as being
> "not typical" according to Documentation/atomic_t.txt. Will, as this
> affects ARM primarily do you have any preference?
>
> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> index 35212f260148..af0dbe4d5e97 100644
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -238,6 +238,13 @@ static __always_inline void mark_rt_mutex_waiters(struct rt_mutex_base *lock)
> owner = *p;
> } while (cmpxchg_relaxed(p, owner,
> owner | RT_MUTEX_HAS_WAITERS) != owner);
> +
> + /*
> + * The cmpxchg loop above is relaxed to avoid back-to-back ACQUIRE
> + * operations in the event of contention. Ensure the successful
> + * cmpxchg is visible.
> + */
> + smp_mb__after_atomic();
> }

Sebastian

Subject: [tip: locking/urgent] ]65;6203;1crtmutex: Add acquire semantics for rtmutex lock acquisition slow path

The following commit has been merged into the locking/urgent branch of tip:

Commit-ID: 2d77ed7afafef79dbb9db77f646ba429d907ed2b
Gitweb: https://git.kernel.org/tip/2d77ed7afafef79dbb9db77f646ba429d907ed2b
Author: Mel Gorman <[email protected]>
AuthorDate: Fri, 02 Dec 2022 10:02:23
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Mon, 12 Dec 2022 13:13:38 +01:00

]65;6203;1crtmutex: Add acquire semantics for rtmutex lock acquisition slow path

Jan Kara reported the following bug triggering on 6.0.5-rt14 running dbench
on XFS on arm64.

kernel BUG at fs/inode.c:625!
Internal error: Oops - BUG: 0 [#1] PREEMPT_RT SMP
CPU: 11 PID: 6611 Comm: dbench Tainted: G E 6.0.0-rt14-rt+ #1
pc : clear_inode+0xa0/0xc0
lr : clear_inode+0x38/0xc0
Call trace:
clear_inode+0xa0/0xc0
evict+0x160/0x180
iput+0x154/0x240
do_unlinkat+0x184/0x300
__arm64_sys_unlinkat+0x48/0xc0
el0_svc_common.constprop.4+0xe4/0x2c0
do_el0_svc+0xac/0x100
el0_svc+0x78/0x200
el0t_64_sync_handler+0x9c/0xc0
el0t_64_sync+0x19c/0x1a0

It also affects 6.1-rc7-rt5 and affects a preempt-rt fork of 5.14 so this
is likely a bug that existed forever and only became visible when ARM
support was added to preempt-rt. The same problem does not occur on x86-64
and he also reported that converting sb->s_inode_wblist_lock to
raw_spinlock_t makes the problem disappear indicating that the RT spinlock
variant is the problem.

Which in turn means that RT mutexes on ARM64 and any other weakly ordered
architecture are affected by this independent of RT.

Will Deacon observed:

"I'd be more inclined to be suspicious of the slowpath tbh, as we need to
make sure that we have acquire semantics on all paths where the lock can
be taken. Looking at the rtmutex code, this really isn't obvious to me
-- for example, try_to_take_rt_mutex() appears to be able to return via
the 'takeit' label without acquire semantics and it looks like we might
be relying on the caller's subsequent _unlock_ of the wait_lock for
ordering, but that will give us release semantics which aren't correct."

Sebastian Andrzej Siewior prototyped a fix that does work based on that
comment but it was a little bit overkill and added some fences that should
not be necessary.

The lock owner is updated with an IRQ-safe raw spinlock held, but the
spin_unlock does not provide acquire semantics which are needed when
acquiring a mutex.

Adds the necessary acquire semantics for lock owner updates in the slow path
acquisition and the waiter bit logic.

It successfully completed 10 iterations of the dbench workload while the
vanilla kernel fails on the first iteration.

[ [email protected]: Initial prototype fix ]

Fixes: 700318d1d7b38 ("locking/rtmutex: Use acquire/release semantics")
Fixes: 23f78d4a03c5 ("[PATCH] pi-futex: rt mutex core")
Reported-by: Jan Kara <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
---
kernel/locking/rtmutex.c | 55 +++++++++++++++++++++++++++++------
kernel/locking/rtmutex_api.c | 6 ++--
2 files changed, 49 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 7779ee8..010cf4e 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -89,15 +89,31 @@ static inline int __ww_mutex_check_kill(struct rt_mutex *lock,
* set this bit before looking at the lock.
*/

-static __always_inline void
-rt_mutex_set_owner(struct rt_mutex_base *lock, struct task_struct *owner)
+static __always_inline struct task_struct *
+rt_mutex_owner_encode(struct rt_mutex_base *lock, struct task_struct *owner)
{
unsigned long val = (unsigned long)owner;

if (rt_mutex_has_waiters(lock))
val |= RT_MUTEX_HAS_WAITERS;

- WRITE_ONCE(lock->owner, (struct task_struct *)val);
+ return (struct task_struct *)val;
+}
+
+static __always_inline void
+rt_mutex_set_owner(struct rt_mutex_base *lock, struct task_struct *owner)
+{
+ /*
+ * lock->wait_lock is held but explicit acquire semantics are needed
+ * for a new lock owner so WRITE_ONCE is insufficient.
+ */
+ xchg_acquire(&lock->owner, rt_mutex_owner_encode(lock, owner));
+}
+
+static __always_inline void rt_mutex_clear_owner(struct rt_mutex_base *lock)
+{
+ /* lock->wait_lock is held so the unlock provides release semantics. */
+ WRITE_ONCE(lock->owner, rt_mutex_owner_encode(lock, NULL));
}

static __always_inline void clear_rt_mutex_waiters(struct rt_mutex_base *lock)
@@ -106,7 +122,8 @@ static __always_inline void clear_rt_mutex_waiters(struct rt_mutex_base *lock)
((unsigned long)lock->owner & ~RT_MUTEX_HAS_WAITERS);
}

-static __always_inline void fixup_rt_mutex_waiters(struct rt_mutex_base *lock)
+static __always_inline void
+fixup_rt_mutex_waiters(struct rt_mutex_base *lock, bool acquire_lock)
{
unsigned long owner, *p = (unsigned long *) &lock->owner;

@@ -172,8 +189,21 @@ static __always_inline void fixup_rt_mutex_waiters(struct rt_mutex_base *lock)
* still set.
*/
owner = READ_ONCE(*p);
- if (owner & RT_MUTEX_HAS_WAITERS)
- WRITE_ONCE(*p, owner & ~RT_MUTEX_HAS_WAITERS);
+ if (owner & RT_MUTEX_HAS_WAITERS) {
+ /*
+ * See rt_mutex_set_owner() and rt_mutex_clear_owner() on
+ * why xchg_acquire() is used for updating owner for
+ * locking and WRITE_ONCE() for unlocking.
+ *
+ * WRITE_ONCE() would work for the acquire case too, but
+ * in case that the lock acquisition failed it might
+ * force other lockers into the slow path unnecessarily.
+ */
+ if (acquire_lock)
+ xchg_acquire(p, owner & ~RT_MUTEX_HAS_WAITERS);
+ else
+ WRITE_ONCE(*p, owner & ~RT_MUTEX_HAS_WAITERS);
+ }
}

/*
@@ -208,6 +238,13 @@ static __always_inline void mark_rt_mutex_waiters(struct rt_mutex_base *lock)
owner = *p;
} while (cmpxchg_relaxed(p, owner,
owner | RT_MUTEX_HAS_WAITERS) != owner);
+
+ /*
+ * The cmpxchg loop above is relaxed to avoid back-to-back ACQUIRE
+ * operations in the event of contention. Ensure the successful
+ * cmpxchg is visible.
+ */
+ smp_mb__after_atomic();
}

/*
@@ -1243,7 +1280,7 @@ static int __sched __rt_mutex_slowtrylock(struct rt_mutex_base *lock)
* try_to_take_rt_mutex() sets the lock waiters bit
* unconditionally. Clean this up.
*/
- fixup_rt_mutex_waiters(lock);
+ fixup_rt_mutex_waiters(lock, true);

return ret;
}
@@ -1604,7 +1641,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
* try_to_take_rt_mutex() sets the waiter bit
* unconditionally. We might have to fix that up.
*/
- fixup_rt_mutex_waiters(lock);
+ fixup_rt_mutex_waiters(lock, true);

trace_contention_end(lock, ret);

@@ -1719,7 +1756,7 @@ static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock)
* try_to_take_rt_mutex() sets the waiter bit unconditionally.
* We might have to fix that up:
*/
- fixup_rt_mutex_waiters(lock);
+ fixup_rt_mutex_waiters(lock, true);
debug_rt_mutex_free_waiter(&waiter);

trace_contention_end(lock, 0);
diff --git a/kernel/locking/rtmutex_api.c b/kernel/locking/rtmutex_api.c
index 9002209..cb9fdff 100644
--- a/kernel/locking/rtmutex_api.c
+++ b/kernel/locking/rtmutex_api.c
@@ -267,7 +267,7 @@ void __sched rt_mutex_init_proxy_locked(struct rt_mutex_base *lock,
void __sched rt_mutex_proxy_unlock(struct rt_mutex_base *lock)
{
debug_rt_mutex_proxy_unlock(lock);
- rt_mutex_set_owner(lock, NULL);
+ rt_mutex_clear_owner(lock);
}

/**
@@ -382,7 +382,7 @@ int __sched rt_mutex_wait_proxy_lock(struct rt_mutex_base *lock,
* try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
* have to fix that up.
*/
- fixup_rt_mutex_waiters(lock);
+ fixup_rt_mutex_waiters(lock, true);
raw_spin_unlock_irq(&lock->wait_lock);

return ret;
@@ -438,7 +438,7 @@ bool __sched rt_mutex_cleanup_proxy_lock(struct rt_mutex_base *lock,
* try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
* have to fix that up.
*/
- fixup_rt_mutex_waiters(lock);
+ fixup_rt_mutex_waiters(lock, false);

raw_spin_unlock_irq(&lock->wait_lock);

Subject: [tip: locking/urgent] rtmutex: Add acquire semantics for rtmutex lock acquisition slow path

The following commit has been merged into the locking/urgent branch of tip:

Commit-ID: 1c0908d8e441631f5b8ba433523cf39339ee2ba0
Gitweb: https://git.kernel.org/tip/1c0908d8e441631f5b8ba433523cf39339ee2ba0
Author: Mel Gorman <[email protected]>
AuthorDate: Fri, 02 Dec 2022 10:02:23
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Mon, 12 Dec 2022 19:55:56 +01:00

rtmutex: Add acquire semantics for rtmutex lock acquisition slow path

Jan Kara reported the following bug triggering on 6.0.5-rt14 running dbench
on XFS on arm64.

kernel BUG at fs/inode.c:625!
Internal error: Oops - BUG: 0 [#1] PREEMPT_RT SMP
CPU: 11 PID: 6611 Comm: dbench Tainted: G E 6.0.0-rt14-rt+ #1
pc : clear_inode+0xa0/0xc0
lr : clear_inode+0x38/0xc0
Call trace:
clear_inode+0xa0/0xc0
evict+0x160/0x180
iput+0x154/0x240
do_unlinkat+0x184/0x300
__arm64_sys_unlinkat+0x48/0xc0
el0_svc_common.constprop.4+0xe4/0x2c0
do_el0_svc+0xac/0x100
el0_svc+0x78/0x200
el0t_64_sync_handler+0x9c/0xc0
el0t_64_sync+0x19c/0x1a0

It also affects 6.1-rc7-rt5 and affects a preempt-rt fork of 5.14 so this
is likely a bug that existed forever and only became visible when ARM
support was added to preempt-rt. The same problem does not occur on x86-64
and he also reported that converting sb->s_inode_wblist_lock to
raw_spinlock_t makes the problem disappear indicating that the RT spinlock
variant is the problem.

Which in turn means that RT mutexes on ARM64 and any other weakly ordered
architecture are affected by this independent of RT.

Will Deacon observed:

"I'd be more inclined to be suspicious of the slowpath tbh, as we need to
make sure that we have acquire semantics on all paths where the lock can
be taken. Looking at the rtmutex code, this really isn't obvious to me
-- for example, try_to_take_rt_mutex() appears to be able to return via
the 'takeit' label without acquire semantics and it looks like we might
be relying on the caller's subsequent _unlock_ of the wait_lock for
ordering, but that will give us release semantics which aren't correct."

Sebastian Andrzej Siewior prototyped a fix that does work based on that
comment but it was a little bit overkill and added some fences that should
not be necessary.

The lock owner is updated with an IRQ-safe raw spinlock held, but the
spin_unlock does not provide acquire semantics which are needed when
acquiring a mutex.

Adds the necessary acquire semantics for lock owner updates in the slow path
acquisition and the waiter bit logic.

It successfully completed 10 iterations of the dbench workload while the
vanilla kernel fails on the first iteration.

[ [email protected]: Initial prototype fix ]

Fixes: 700318d1d7b38 ("locking/rtmutex: Use acquire/release semantics")
Fixes: 23f78d4a03c5 ("[PATCH] pi-futex: rt mutex core")
Reported-by: Jan Kara <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
---
kernel/locking/rtmutex.c | 55 +++++++++++++++++++++++++++++------
kernel/locking/rtmutex_api.c | 6 ++--
2 files changed, 49 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 7779ee8..010cf4e 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -89,15 +89,31 @@ static inline int __ww_mutex_check_kill(struct rt_mutex *lock,
* set this bit before looking at the lock.
*/

-static __always_inline void
-rt_mutex_set_owner(struct rt_mutex_base *lock, struct task_struct *owner)
+static __always_inline struct task_struct *
+rt_mutex_owner_encode(struct rt_mutex_base *lock, struct task_struct *owner)
{
unsigned long val = (unsigned long)owner;

if (rt_mutex_has_waiters(lock))
val |= RT_MUTEX_HAS_WAITERS;

- WRITE_ONCE(lock->owner, (struct task_struct *)val);
+ return (struct task_struct *)val;
+}
+
+static __always_inline void
+rt_mutex_set_owner(struct rt_mutex_base *lock, struct task_struct *owner)
+{
+ /*
+ * lock->wait_lock is held but explicit acquire semantics are needed
+ * for a new lock owner so WRITE_ONCE is insufficient.
+ */
+ xchg_acquire(&lock->owner, rt_mutex_owner_encode(lock, owner));
+}
+
+static __always_inline void rt_mutex_clear_owner(struct rt_mutex_base *lock)
+{
+ /* lock->wait_lock is held so the unlock provides release semantics. */
+ WRITE_ONCE(lock->owner, rt_mutex_owner_encode(lock, NULL));
}

static __always_inline void clear_rt_mutex_waiters(struct rt_mutex_base *lock)
@@ -106,7 +122,8 @@ static __always_inline void clear_rt_mutex_waiters(struct rt_mutex_base *lock)
((unsigned long)lock->owner & ~RT_MUTEX_HAS_WAITERS);
}

-static __always_inline void fixup_rt_mutex_waiters(struct rt_mutex_base *lock)
+static __always_inline void
+fixup_rt_mutex_waiters(struct rt_mutex_base *lock, bool acquire_lock)
{
unsigned long owner, *p = (unsigned long *) &lock->owner;

@@ -172,8 +189,21 @@ static __always_inline void fixup_rt_mutex_waiters(struct rt_mutex_base *lock)
* still set.
*/
owner = READ_ONCE(*p);
- if (owner & RT_MUTEX_HAS_WAITERS)
- WRITE_ONCE(*p, owner & ~RT_MUTEX_HAS_WAITERS);
+ if (owner & RT_MUTEX_HAS_WAITERS) {
+ /*
+ * See rt_mutex_set_owner() and rt_mutex_clear_owner() on
+ * why xchg_acquire() is used for updating owner for
+ * locking and WRITE_ONCE() for unlocking.
+ *
+ * WRITE_ONCE() would work for the acquire case too, but
+ * in case that the lock acquisition failed it might
+ * force other lockers into the slow path unnecessarily.
+ */
+ if (acquire_lock)
+ xchg_acquire(p, owner & ~RT_MUTEX_HAS_WAITERS);
+ else
+ WRITE_ONCE(*p, owner & ~RT_MUTEX_HAS_WAITERS);
+ }
}

/*
@@ -208,6 +238,13 @@ static __always_inline void mark_rt_mutex_waiters(struct rt_mutex_base *lock)
owner = *p;
} while (cmpxchg_relaxed(p, owner,
owner | RT_MUTEX_HAS_WAITERS) != owner);
+
+ /*
+ * The cmpxchg loop above is relaxed to avoid back-to-back ACQUIRE
+ * operations in the event of contention. Ensure the successful
+ * cmpxchg is visible.
+ */
+ smp_mb__after_atomic();
}

/*
@@ -1243,7 +1280,7 @@ static int __sched __rt_mutex_slowtrylock(struct rt_mutex_base *lock)
* try_to_take_rt_mutex() sets the lock waiters bit
* unconditionally. Clean this up.
*/
- fixup_rt_mutex_waiters(lock);
+ fixup_rt_mutex_waiters(lock, true);

return ret;
}
@@ -1604,7 +1641,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
* try_to_take_rt_mutex() sets the waiter bit
* unconditionally. We might have to fix that up.
*/
- fixup_rt_mutex_waiters(lock);
+ fixup_rt_mutex_waiters(lock, true);

trace_contention_end(lock, ret);

@@ -1719,7 +1756,7 @@ static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock)
* try_to_take_rt_mutex() sets the waiter bit unconditionally.
* We might have to fix that up:
*/
- fixup_rt_mutex_waiters(lock);
+ fixup_rt_mutex_waiters(lock, true);
debug_rt_mutex_free_waiter(&waiter);

trace_contention_end(lock, 0);
diff --git a/kernel/locking/rtmutex_api.c b/kernel/locking/rtmutex_api.c
index 9002209..cb9fdff 100644
--- a/kernel/locking/rtmutex_api.c
+++ b/kernel/locking/rtmutex_api.c
@@ -267,7 +267,7 @@ void __sched rt_mutex_init_proxy_locked(struct rt_mutex_base *lock,
void __sched rt_mutex_proxy_unlock(struct rt_mutex_base *lock)
{
debug_rt_mutex_proxy_unlock(lock);
- rt_mutex_set_owner(lock, NULL);
+ rt_mutex_clear_owner(lock);
}

/**
@@ -382,7 +382,7 @@ int __sched rt_mutex_wait_proxy_lock(struct rt_mutex_base *lock,
* try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
* have to fix that up.
*/
- fixup_rt_mutex_waiters(lock);
+ fixup_rt_mutex_waiters(lock, true);
raw_spin_unlock_irq(&lock->wait_lock);

return ret;
@@ -438,7 +438,7 @@ bool __sched rt_mutex_cleanup_proxy_lock(struct rt_mutex_base *lock,
* try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
* have to fix that up.
*/
- fixup_rt_mutex_waiters(lock);
+ fixup_rt_mutex_waiters(lock, false);

raw_spin_unlock_irq(&lock->wait_lock);

2022-12-16 11:28:02

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] rtmutex: Add acquire semantics for rtmutex lock acquisition

On Tue, Dec 06, 2022 at 12:43:51PM +0100, Sebastian Andrzej Siewior wrote:
> > > Before that, it did cmpxchg() which should be fine.
> > >
> > > Regarding mark_rt_mutex_waiters(). Isn't acquire semantic required in
> > > order for the lock-owner not perform the fastpath but go to the slowpath
> > > instead?
> > >
> >
> > Good spot, it does. While the most straight-forward solution is to use
> > cmpxchg_acquire, I think it is overkill because it could incur back-to-back
> > ACQUIRE operations in the event of contention. There could be a smp_wmb
> > after the cmpxchg_relaxed but that impacts all arches and a non-paired
> > smp_wmb is generally frowned upon.
>
> but in general, it should succeed on the first iteration. It can only
> fail (and retry) if the owner was able to unlock it first. A second
> locker will spin on the wait_lock so.
>

Sure, generally it would be fine but it also costs us nothing
to avoid additional overhead in the contended case. The pattern of
atomic_relaxed+smp_mb__after_atomic is unusual but I think the comment is
sufficient to explain why it's structured like that.

--
Mel Gorman
SUSE Labs

2022-12-16 11:51:16

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH] rtmutex: Add acquire semantics for rtmutex lock acquisition

On Fri, Dec 02, 2022 at 03:01:58PM +0000, Mel Gorman wrote:
> On Fri, Dec 02, 2022 at 12:21:06PM +0100, Sebastian Andrzej Siewior wrote:
> > On 2022-12-02 10:02:23 [+0000], Mel Gorman wrote:
> > > The lock owner is updated with an IRQ-safe raw spinlock held but the
> > > spin_unlock does not provide acquire semantics which are needed when
> > > acquiring a mutex. This patch adds the necessary acquire semantics for a
> > > lock operation when the lock owner is updated. It successfully completed
> > > 10 iterations of the dbench workload while the vanilla kernel fails on
> > > the first iteration.
> >
> > I *think* it is
> >
> > Fixes: 700318d1d7b38 ("locking/rtmutex: Use acquire/release semantics")
> >
>
> Adding Davidlohr to cc.
>
> It might have made the problem worse but even then rt_mutex_set_owner was
> just a plain assignment and while I didn't check carefully, at a glance
> try_to_take_rt_mutex didn't look like it guaranteed ACQUIRE semantics.
>
> > Before that, it did cmpxchg() which should be fine.
> >
> > Regarding mark_rt_mutex_waiters(). Isn't acquire semantic required in
> > order for the lock-owner not perform the fastpath but go to the slowpath
> > instead?
> >
>
> Good spot, it does. While the most straight-forward solution is to use
> cmpxchg_acquire, I think it is overkill because it could incur back-to-back
> ACQUIRE operations in the event of contention. There could be a smp_wmb
> after the cmpxchg_relaxed but that impacts all arches and a non-paired
> smp_wmb is generally frowned upon.
>
> I'm thinking this on top of the patch should be sufficient even though
> it's a heavier operation than is necesary for ACQUIRE as well as being
> "not typical" according to Documentation/atomic_t.txt. Will, as this
> affects ARM primarily do you have any preference?
>
> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> index 35212f260148..af0dbe4d5e97 100644
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -238,6 +238,13 @@ static __always_inline void mark_rt_mutex_waiters(struct rt_mutex_base *lock)
> owner = *p;
> } while (cmpxchg_relaxed(p, owner,
> owner | RT_MUTEX_HAS_WAITERS) != owner);
> +
> + /*
> + * The cmpxchg loop above is relaxed to avoid back-to-back ACQUIRE
> + * operations in the event of contention. Ensure the successful
> + * cmpxchg is visible.
> + */
> + smp_mb__after_atomic();

Could we use smp_acquire__after_ctrl_dep() instead?

Will

2022-12-16 14:07:46

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] rtmutex: Add acquire semantics for rtmutex lock acquisition

On Fri, Dec 16, 2022 at 11:14:12AM +0000, Will Deacon wrote:
> > diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> > index 35212f260148..af0dbe4d5e97 100644
> > --- a/kernel/locking/rtmutex.c
> > +++ b/kernel/locking/rtmutex.c
> > @@ -238,6 +238,13 @@ static __always_inline void mark_rt_mutex_waiters(struct rt_mutex_base *lock)
> > owner = *p;
> > } while (cmpxchg_relaxed(p, owner,
> > owner | RT_MUTEX_HAS_WAITERS) != owner);
> > +
> > + /*
> > + * The cmpxchg loop above is relaxed to avoid back-to-back ACQUIRE
> > + * operations in the event of contention. Ensure the successful
> > + * cmpxchg is visible.
> > + */
> > + smp_mb__after_atomic();
>
> Could we use smp_acquire__after_ctrl_dep() instead?
>

It's might be sufficient but it's not as obviously correct. It lacks an
obvious pairing that is definitely correct but the control dependency
combined with the smp_acquire__after_ctrl_dep *should* be sufficient
against the lock fast path based on the available documentation. However,
I was unable to convince myself that this is definitely correct on all CPUs.

Given that arm64 was trivial to crash on PREEMPT_RT before the patch
(hackbench pipes also triggers the same problem), I'm reluctant to try and
be too clever particularly as I didn't have a reproducer for a problem in
this specific path. If someone can demonstrate a reasonable case where
smp_mb__after_atomic() is too heavy and document that it can be safely
relaxed then great. At least if that relaxing is wrong, there will be a
bisection point between the fix and the reintroduction of damage.

--
Mel Gorman
SUSE Labs

2022-12-16 16:08:54

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH] rtmutex: Add acquire semantics for rtmutex lock acquisition

Hey Mel,

On Fri, Dec 16, 2022 at 01:55:48PM +0000, Mel Gorman wrote:
> On Fri, Dec 16, 2022 at 11:14:12AM +0000, Will Deacon wrote:
> > > diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> > > index 35212f260148..af0dbe4d5e97 100644
> > > --- a/kernel/locking/rtmutex.c
> > > +++ b/kernel/locking/rtmutex.c
> > > @@ -238,6 +238,13 @@ static __always_inline void mark_rt_mutex_waiters(struct rt_mutex_base *lock)
> > > owner = *p;
> > > } while (cmpxchg_relaxed(p, owner,
> > > owner | RT_MUTEX_HAS_WAITERS) != owner);
> > > +
> > > + /*
> > > + * The cmpxchg loop above is relaxed to avoid back-to-back ACQUIRE
> > > + * operations in the event of contention. Ensure the successful
> > > + * cmpxchg is visible.
> > > + */
> > > + smp_mb__after_atomic();
> >
> > Could we use smp_acquire__after_ctrl_dep() instead?
> >
>
> It's might be sufficient but it's not as obviously correct. It lacks an
> obvious pairing that is definitely correct but the control dependency
> combined with the smp_acquire__after_ctrl_dep *should* be sufficient
> against the lock fast path based on the available documentation. However,
> I was unable to convince myself that this is definitely correct on all CPUs.

I'm reasonably confident (insofar as you can be confident about any of
this) that it will work, and on arm64 it will result in a slightly cheaper
instruction being emitted. However, I just chimed in because you asked for
my opinion so I'm absolutely fine with whichever direction you prefer to
take!

> Given that arm64 was trivial to crash on PREEMPT_RT before the patch
> (hackbench pipes also triggers the same problem), I'm reluctant to try and
> be too clever particularly as I didn't have a reproducer for a problem in
> this specific path. If someone can demonstrate a reasonable case where
> smp_mb__after_atomic() is too heavy and document that it can be safely
> relaxed then great. At least if that relaxing is wrong, there will be a
> bisection point between the fix and the reintroduction of damage.

I guess bigeasy can give the weaker barrier a try if he has the time, but
otherwise we can leave the change as-is.

Cheers,

Will

Subject: Re: [PATCH] rtmutex: Add acquire semantics for rtmutex lock acquisition

On 2022-12-16 15:58:04 [+0000], Will Deacon wrote:
> I guess bigeasy can give the weaker barrier a try if he has the time, but
> otherwise we can leave the change as-is.

I can't give a try because I have no HW. All I contributed here so far
was based on what you wrote in the previous email and then I spotted the
lack of the barrier of any sorts and asked about it.
I _would_ assume that the cmpxchg_barrier() here would work but I'm far
from knowing.
If the explicit barrier after the cmpxchg_relaxed() is what you two
agree on and it is better/ cheaper/ more obvious then fine. Do it ;)

> Cheers,
>
> Will

Sebastian