LinuxLists.cc - [RFC] Potential unnecessary barrier in slow path of rt

2023-01-22 15:06:41

Subject: [RFC] Potential unnecessary barrier in slow path of rt_mutex

Hello,

We have been trying to verify that the rt_mutex patch
https://lkml.org/lkml/2022/12/2/279 guarantees the intended acquire
semantics and that there are no other potential problems with it. For
that, we are using a verification tool as discussed in
https://lkml.org/lkml/2022/8/26/597.

The tool reported a data race for which I already submitted a patch in
https://lkml.org/lkml/2023/1/20/702.

During the discussion of the rt_mutex patch, Will proposed replacing the
explicit barrier in 'mark_rt_mutex_waiters' with
'smp_acquire__after_ctrl_dep()'. We wanted to check if it would be
possible to get rid of the barrier instead of weakening it.

While according to LKMM /tools/memory-model/linux-kernel.cat, mutual
exclusion is violated if we do this, the verification tool reported that
this violation is not possible according to the formal memory models of
aarch64, riscv and power. Interestingly, if the data race from above is
fixed, mutual exclusion is guaranteed by LKMM even if we fully remove
the barrier. The reason for this is that marking the racy access
introduces more ordering guarantees.

Another possibility is to keep the barrier, but revert the change
WRITE_ONCE() -> xchg_acquire() in 'rt_mutex_set_owner'. Boqun suggested
this one might be better because it's weird using xchg_acquire() to get
an acquire store.

Either the barrier or the acquire store are needed for the correctness
of the algorithm. If we relax both, the tool reports a violation.

I would like to get some feedback about the following.

(1) Since the barriers are in the slow path of the algorithm, is it
worth it to try to play it smart or should we play it safe (at the cost
of potentially having an extra barrier)?

(2) In which cases were the acquiring semantics missing before the
rt_mutex patch? This is not clear for me from the patch and the
corresponding discussion. Once I understand this, I might be able to use
the rules from LKMM to give a more formal argument of why one of the
barriers is not needed (or find an example that shows that indeed both
barriers are necessary).

(3) If we agree that one barrier might not necessary, but we still keep
both to be safe, would it make sense to add some comment in the code
along the lines "Of these N barriers, you can probably get away with the
following subset of them, but we leave all in place to be safe"?

The following litmus test (I hope comments are enough to trace back to
the real implementation) shows all the above issues in detail and how
the changes I propose impact them. It can be run using the herd7 tool
http://diy.inria.fr/www/?record=linux#.
The final query asks if P0 can release the lock via the fast path even
if P1 set the wait bit. The expected result is "No".

C rt_mutex

{
atomic_t owner = ATOMIC_INIT(0);
}

P0(int *owner, int *x) {
int r0 = 2; // current task
int r1 = atomic_cmpxchg_acquire(owner,0,r0); // rtlock acquire
int r2 = READ_ONCE(*x); // critical section;
WRITE_ONCE(*x, r2 + 1); // marked to avoid
data races
int r3 = atomic_cmpxchg_release(owner,r0,0); // rtlock release
succeeds
}

P1(int *owner, int *x) {
int r0 = 4; // current task
int r1 = atomic_cmpxchg_acquire(owner,0,r0); // rtlock acquire failed
int r2 = *owner; // mark waiters
int r3 = atomic_cmpxchg_relaxed(owner,r2,r2|1); // mark waiters succeeds
smp_mb(); // mark waiters
int r4 = xchg_acquire(owner,r0); // set owner
int r5 = READ_ONCE(*x); // critical section
}

// 0:r1 = 0: rtlock acquire succeeds
// 0:r3 = 2: rtlock release succeeds
// 1:r1 = 2: rtlock acquire failed
// 1:r3 = 1:r2: mark waiters succeeds
// 1:r5 = 0: critical section violated
exists (0:r1 = 0 /\ 0:r3 = 2 /\ 1:r1 = 2 /\ 1:r3 = 1:r2 /\ 1:r5 = 0)

herd7 reports a race related to 'mark waiters'. Below is the fix I
proposed in https://lkml.org/lkml/2023/1/20/702.

P0(int *owner, int *x) {
int r0 = 2;
int r1 = atomic_cmpxchg_acquire(owner,0,r0);
int r2 = READ_ONCE(*x);
WRITE_ONCE(*x, r2 + 1);
int r3 = atomic_cmpxchg_release(owner,r0,0);
}

P1(int *owner, int *x) {
int r0 = 4;
int r1 = atomic_cmpxchg_acquire(owner,0,r0);
int r2 = READ_ONCE(*owner);
int r3 = atomic_cmpxchg_relaxed(owner,r2,r2|1);
smp_mb();
int r4 = xchg_acquire(owner,r0);
int r5 = READ_ONCE(*x);
}

This code is correct (result is "No") and there are no races, but it has
an unnecessary barrier. smp_mb() can be removed without affecting the
result.

P0(int *owner, int *x) {
int r0 = 2;
int r1 = atomic_cmpxchg_acquire(owner,0,r0);
int r2 = READ_ONCE(*x);
WRITE_ONCE(*x, r2 + 1);
int r3 = atomic_cmpxchg_release(owner,r0,0);
}

P1(int *owner, int *x) {
int r0 = 4;
int r1 = atomic_cmpxchg_acquire(owner,0,r0);
int r2 = READ_ONCE(*owner);
int r3 = atomic_cmpxchg_relaxed(owner,r2,r2|1);
int r4 = xchg_acquire(owner,r0);
int r5 = READ_ONCE(*x);
}

The other possibility is to keep the barrier, but relax the acquire store.

P0(int *owner, int *x) {
int r0 = 2;
int r1 = atomic_cmpxchg_acquire(owner,0,r0);
int r2 = READ_ONCE(*x);
WRITE_ONCE(*x, r2 + 1);
int r3 = atomic_cmpxchg_release(owner,r0,0);
}

P1(int *owner, int *x) {
int r0 = 4;
int r1 = atomic_cmpxchg_acquire(owner,0,r0);
int r2 = READ_ONCE(*owner);
int r3 = atomic_cmpxchg_relaxed(owner,r2,r2|1);
smp_mb();
WRITE_ONCE(*owner,r0);
int r5 = READ_ONCE(*x);
}

Once again the result is "No".

Regards,
Hernan