2019-04-19 18:24:05

by Alan Stern

[permalink] [raw]
Subject: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

The description of smp_mb__before_atomic() and smp_mb__after_atomic()
in Documentation/atomic_t.txt is slightly terse and misleading. It
does not clearly state that these barriers only affect the ordering of
other instructions with respect to the atomic operation.

This improves the text to make the actual ordering implications clear,
and also to explain how these barriers differ from a RELEASE or
ACQUIRE ordering.

Signed-off-by: Alan Stern <[email protected]>

---


Documentation/atomic_t.txt | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

Index: usb-devel/Documentation/atomic_t.txt
===================================================================
--- usb-devel.orig/Documentation/atomic_t.txt
+++ usb-devel/Documentation/atomic_t.txt
@@ -171,7 +171,10 @@ The barriers:
smp_mb__{before,after}_atomic()

only apply to the RMW ops and can be used to augment/upgrade the ordering
-inherent to the used atomic op. These barriers provide a full smp_mb().
+inherent to the used atomic op. Unlike normal smp_mb() barriers, they order
+only the RMW op itself against the instructions preceding the
+smp_mb__before_atomic() or following the smp_mb__after_atomic(); they do
+not order instructions on the other side of the RMW op at all.

These helper barriers exist because architectures have varying implicit
ordering on their SMP atomic primitives. For example our TSO architectures
@@ -195,7 +198,8 @@ Further, while something like:
atomic_dec(&X);

is a 'typical' RELEASE pattern, the barrier is strictly stronger than
-a RELEASE. Similarly for something like:
+a RELEASE because it orders preceding instructions against both the read
+and write parts of the atomic_dec(). Similarly, something like:

atomic_inc(&X);
smp_mb__after_atomic();
@@ -227,7 +231,8 @@ strictly stronger than ACQUIRE. As illus

This should not happen; but a hypothetical atomic_inc_acquire() --
(void)atomic_fetch_inc_acquire() for instance -- would allow the outcome,
-since then:
+because it would not order the W part of the RMW against the following
+WRITE_ONCE. Thus:

P1 P2




2019-04-19 18:20:51

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

On Fri, Apr 19, 2019 at 01:21:45PM -0400, Alan Stern wrote:
> The description of smp_mb__before_atomic() and smp_mb__after_atomic()
> in Documentation/atomic_t.txt is slightly terse and misleading. It
> does not clearly state that these barriers only affect the ordering of
> other instructions with respect to the atomic operation.
>
> This improves the text to make the actual ordering implications clear,
> and also to explain how these barriers differ from a RELEASE or
> ACQUIRE ordering.
>
> Signed-off-by: Alan Stern <[email protected]>

Queued for further review, thank you, Alan!

Thanx, Paul

> ---
>
>
> Documentation/atomic_t.txt | 11 ++++++++---
> 1 file changed, 8 insertions(+), 3 deletions(-)
>
> Index: usb-devel/Documentation/atomic_t.txt
> ===================================================================
> --- usb-devel.orig/Documentation/atomic_t.txt
> +++ usb-devel/Documentation/atomic_t.txt
> @@ -171,7 +171,10 @@ The barriers:
> smp_mb__{before,after}_atomic()
>
> only apply to the RMW ops and can be used to augment/upgrade the ordering
> -inherent to the used atomic op. These barriers provide a full smp_mb().
> +inherent to the used atomic op. Unlike normal smp_mb() barriers, they order
> +only the RMW op itself against the instructions preceding the
> +smp_mb__before_atomic() or following the smp_mb__after_atomic(); they do
> +not order instructions on the other side of the RMW op at all.
>
> These helper barriers exist because architectures have varying implicit
> ordering on their SMP atomic primitives. For example our TSO architectures
> @@ -195,7 +198,8 @@ Further, while something like:
> atomic_dec(&X);
>
> is a 'typical' RELEASE pattern, the barrier is strictly stronger than
> -a RELEASE. Similarly for something like:
> +a RELEASE because it orders preceding instructions against both the read
> +and write parts of the atomic_dec(). Similarly, something like:
>
> atomic_inc(&X);
> smp_mb__after_atomic();
> @@ -227,7 +231,8 @@ strictly stronger than ACQUIRE. As illus
>
> This should not happen; but a hypothetical atomic_inc_acquire() --
> (void)atomic_fetch_inc_acquire() for instance -- would allow the outcome,
> -since then:
> +because it would not order the W part of the RMW against the following
> +WRITE_ONCE. Thus:
>
> P1 P2
>
>


2019-04-19 18:29:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

On Fri, Apr 19, 2019 at 08:00:17PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 19, 2019 at 01:21:45PM -0400, Alan Stern wrote:
> > Index: usb-devel/Documentation/atomic_t.txt
> > ===================================================================
> > --- usb-devel.orig/Documentation/atomic_t.txt
> > +++ usb-devel/Documentation/atomic_t.txt
> > @@ -171,7 +171,10 @@ The barriers:
> > smp_mb__{before,after}_atomic()
> >
> > only apply to the RMW ops and can be used to augment/upgrade the ordering
> > -inherent to the used atomic op. These barriers provide a full smp_mb().
> > +inherent to the used atomic op. Unlike normal smp_mb() barriers, they order
> > +only the RMW op itself against the instructions preceding the
> > +smp_mb__before_atomic() or following the smp_mb__after_atomic(); they do
> > +not order instructions on the other side of the RMW op at all.
>
> Now it is I who is confused; what?
>
> x = 1;
> smp_mb__before_atomic();
> atomic_add(1, &a);
> y = 1;
>
> the stores to both x and y will be ordered as if an smp_mb() where
> there. There is no order between a and y otoh.

Let's look at x86. And a slightly different example:

x = 1;
smp_mb__before_atomic();
atomic_add(1, &a);
r1 = y;

The atomic_add() asm does not have the "memory" constraint, which is
completely legitimate because atomic_add() does not return a value,
and thus guarantees no ordering. The compiler is therefore within
its rights to transform the code into the following:

x = 1;
smp_mb__before_atomic();
r1 = y;
atomic_add(1, &a);

But x86's smp_mb__before_atomic() is just a compiler barrier, and
x86 is further allowed to reorder prior stores with later loads.
The CPU can therefore execute this code as follows:

r1 = y;
x = 1;
smp_mb__before_atomic();
atomic_add(1, &a);

So in general, the ordering is guaranteed only to the atomic itself,
not to accesses on the other side of the atomic.

Thanx, Paul


2019-04-19 19:32:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

On Fri, Apr 19, 2019 at 01:21:45PM -0400, Alan Stern wrote:
> Index: usb-devel/Documentation/atomic_t.txt
> ===================================================================
> --- usb-devel.orig/Documentation/atomic_t.txt
> +++ usb-devel/Documentation/atomic_t.txt
> @@ -171,7 +171,10 @@ The barriers:
> smp_mb__{before,after}_atomic()
>
> only apply to the RMW ops and can be used to augment/upgrade the ordering
> -inherent to the used atomic op. These barriers provide a full smp_mb().
> +inherent to the used atomic op. Unlike normal smp_mb() barriers, they order
> +only the RMW op itself against the instructions preceding the
> +smp_mb__before_atomic() or following the smp_mb__after_atomic(); they do
> +not order instructions on the other side of the RMW op at all.

Now it is I who is confused; what?

x = 1;
smp_mb__before_atomic();
atomic_add(1, &a);
y = 1;

the stores to both x and y will be ordered as if an smp_mb() where
there. There is no order between a and y otoh.

2019-04-20 00:27:51

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

Paul E. McKenney's on April 20, 2019 4:26 am:
> On Fri, Apr 19, 2019 at 08:00:17PM +0200, Peter Zijlstra wrote:
>> On Fri, Apr 19, 2019 at 01:21:45PM -0400, Alan Stern wrote:
>> > Index: usb-devel/Documentation/atomic_t.txt
>> > ===================================================================
>> > --- usb-devel.orig/Documentation/atomic_t.txt
>> > +++ usb-devel/Documentation/atomic_t.txt
>> > @@ -171,7 +171,10 @@ The barriers:
>> > smp_mb__{before,after}_atomic()
>> >
>> > only apply to the RMW ops and can be used to augment/upgrade the ordering
>> > -inherent to the used atomic op. These barriers provide a full smp_mb().
>> > +inherent to the used atomic op. Unlike normal smp_mb() barriers, they order
>> > +only the RMW op itself against the instructions preceding the
>> > +smp_mb__before_atomic() or following the smp_mb__after_atomic(); they do
>> > +not order instructions on the other side of the RMW op at all.
>>
>> Now it is I who is confused; what?
>>
>> x = 1;
>> smp_mb__before_atomic();
>> atomic_add(1, &a);
>> y = 1;
>>
>> the stores to both x and y will be ordered as if an smp_mb() where
>> there. There is no order between a and y otoh.
>
> Let's look at x86. And a slightly different example:
>
> x = 1;
> smp_mb__before_atomic();
> atomic_add(1, &a);
> r1 = y;
>
> The atomic_add() asm does not have the "memory" constraint, which is
> completely legitimate because atomic_add() does not return a value,
> and thus guarantees no ordering. The compiler is therefore within
> its rights to transform the code into the following:
>
> x = 1;
> smp_mb__before_atomic();
> r1 = y;
> atomic_add(1, &a);
>
> But x86's smp_mb__before_atomic() is just a compiler barrier, and
> x86 is further allowed to reorder prior stores with later loads.
> The CPU can therefore execute this code as follows:
>
> r1 = y;
> x = 1;
> smp_mb__before_atomic();
> atomic_add(1, &a);
>
> So in general, the ordering is guaranteed only to the atomic itself,
> not to accesses on the other side of the atomic.

That's interesting. I don't think that's what all our code expects.
I had the same idea as Peter.

IIRC the primitive was originally introduced exactly so x86 would not
need to have the unnecessary hardware barrier with sequences like

smp_mb();
...
atomic_inc(&v);

The "new" semantics are a bit subtle. One option might be just to
replace it entirely with atomic_xxx_mb() ?

Thanks,
Nick

2019-04-20 08:55:57

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

On Sat, Apr 20, 2019 at 10:26:15AM +1000, Nicholas Piggin wrote:
> Paul E. McKenney's on April 20, 2019 4:26 am:
> > On Fri, Apr 19, 2019 at 08:00:17PM +0200, Peter Zijlstra wrote:
> >> On Fri, Apr 19, 2019 at 01:21:45PM -0400, Alan Stern wrote:
> >> > Index: usb-devel/Documentation/atomic_t.txt
> >> > ===================================================================
> >> > --- usb-devel.orig/Documentation/atomic_t.txt
> >> > +++ usb-devel/Documentation/atomic_t.txt
> >> > @@ -171,7 +171,10 @@ The barriers:
> >> > smp_mb__{before,after}_atomic()
> >> >
> >> > only apply to the RMW ops and can be used to augment/upgrade the ordering
> >> > -inherent to the used atomic op. These barriers provide a full smp_mb().
> >> > +inherent to the used atomic op. Unlike normal smp_mb() barriers, they order
> >> > +only the RMW op itself against the instructions preceding the
> >> > +smp_mb__before_atomic() or following the smp_mb__after_atomic(); they do
> >> > +not order instructions on the other side of the RMW op at all.
> >>
> >> Now it is I who is confused; what?
> >>
> >> x = 1;
> >> smp_mb__before_atomic();
> >> atomic_add(1, &a);
> >> y = 1;
> >>
> >> the stores to both x and y will be ordered as if an smp_mb() where
> >> there. There is no order between a and y otoh.
> >
> > Let's look at x86. And a slightly different example:
> >
> > x = 1;
> > smp_mb__before_atomic();
> > atomic_add(1, &a);
> > r1 = y;
> >
> > The atomic_add() asm does not have the "memory" constraint, which is
> > completely legitimate because atomic_add() does not return a value,
> > and thus guarantees no ordering. The compiler is therefore within
> > its rights to transform the code into the following:
> >
> > x = 1;
> > smp_mb__before_atomic();
> > r1 = y;
> > atomic_add(1, &a);
> >
> > But x86's smp_mb__before_atomic() is just a compiler barrier, and
> > x86 is further allowed to reorder prior stores with later loads.
> > The CPU can therefore execute this code as follows:
> >
> > r1 = y;
> > x = 1;
> > smp_mb__before_atomic();
> > atomic_add(1, &a);
> >
> > So in general, the ordering is guaranteed only to the atomic itself,
> > not to accesses on the other side of the atomic.
>
> That's interesting. I don't think that's what all our code expects.
> I had the same idea as Peter.
>
> IIRC the primitive was originally introduced exactly so x86 would not
> need to have the unnecessary hardware barrier with sequences like
>
> smp_mb();
> ...
> atomic_inc(&v);
>
> The "new" semantics are a bit subtle. One option might be just to
> replace it entirely with atomic_xxx_mb() ?

Hmmm... There are more than 2,000 uses of atomic_inc() in the kernel.
There are about 300-400 total between smp_mb__before_atomic() and
smp_mb__after_atomic().

So what are our options?

1. atomic_xxx_mb() as you say.

From a quick scan of smp_mb__before_atomic() uses, we need this
for atomic_inc(), atomic_dec(), atomic_add(), atomic_sub(),
clear_bit(), set_bit(), test_bit(), atomic_long_dec(),
atomic_long_add(), refcount_dec(), cmpxchg_relaxed(),
set_tsk_thread_flag(), clear_bit_unlock().

Another random look identifies atomic_andnot().

And atomic_set(): set_preempt_state(). This fails
on x86, s390, and TSO friends, does it not? Or is
this ARM-only? Still, why not just smp_mb() before and
after? Same issue in __kernfs_new_node(), bio_cnt_set(),
sbitmap_queue_update_wake_batch(),

Ditto for atomic64_set() in __ceph_dir_set_complete().

Ditto for atomic_read() in rvt_qp_is_avail(). This function
has a couple of other oddly placed smp_mb__before_atomic().

And atomic_cmpxchg(): msc_buffer_alloc(). This instance
of smp_mb__before_atomic() can be removed unless I am missing
something subtle. Ditto for kvm_vcpu_exiting_guest_mode(),
pv_kick_node(), __sbq_wake_up(),

And lock acquisition??? acm_read_bulk_callback().

In nfnl_acct_fill_info(), a smp_mb__before_atomic() after
a atomic64_xchg()??? Also before a clear_bit(), but the
clear_bit() is inside an "if".

There are a few cases that would see added overhead. For example,
svc_get_next_xprt() has the following:

smp_mb__before_atomic();
clear_bit(SP_CONGESTED, &pool->sp_flags);
clear_bit(RQ_BUSY, &rqstp->rq_flags);
smp_mb__after_atomic();

And xs_sock_reset_connection_flags() has this:

smp_mb__before_atomic();
clear_bit(XPRT_CLOSE_WAIT, &xprt->state);
clear_bit(XPRT_CLOSING, &xprt->state);
xs_sock_reset_state_flags(xprt); /* Also a clear_bit(). */
smp_mb__after_atomic();

Yeah, there are more than a few misuses, aren't there? :-/
A coccinelle script seems in order. In 0day test robot.

But there are a number of helper functions whose purpose
seems to be to wrap an atomic in smp_mb__before_atomic() and
smp_mb__after_atomic(), so some of the atomic_xxx_mb() functions
might be a good idea just for improved readability.

2. Add something to checkpatch.pl warning about non-total ordering,
with the error message explaining that the ordering is partial.

3. Make non-value-returning atomics provide full ordering.
This would of course need some benchmarking, but would be a
simple change to make and would eliminate a large class of
potential bugs. My guess is that the loss in performance
would be non-negligible, but who knows?

4. Your idea here!

Thanx, Paul

2019-04-23 12:19:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

On Sat, Apr 20, 2019 at 01:54:40AM -0700, Paul E. McKenney wrote:
> 3. Make non-value-returning atomics provide full ordering.
> This would of course need some benchmarking, but would be a
> simple change to make and would eliminate a large class of
> potential bugs. My guess is that the loss in performance
> would be non-negligible, but who knows?

Well, only for the architectures that have
smp_mb__{before,after}_atomic() as barrier(), which are: ia64, mips,
s390, sparc, x86 and xtense.

$ ./compare.sh defconfig-build defconfig-build1 vmlinux
do_profile_hits 275 278 +3,+0
freezer_apply_state 86 98 +12,+0
perf_event_alloc 2232 2261 +29,+0
_free_event 631 660 +29,+0
shmem_add_to_page_cache 712 722 +10,+0
_enable_swap_info 333 337 +4,+0
do_mmu_notifier_register 303 311 +8,+0
__nfs_commit_inode 356 359 +3,+0
tcp_try_coalesce 246 250 +4,+0
i915_gem_free_object 90 97 +7,+0
mce_intel_hcpu_update 39 47 +8,+0
__ia32_sys_swapoff 1177 1181 +4,+0
pci_enable_ats 124 131 +7,+0
__x64_sys_swapoff 1178 1182 +4,+0
i915_gem_madvise_ioctl 447 443 -4,+0
calc_global_load_tick 75 82 +7,+0
i915_gem_object_set_tiling 712 708 -4,+0
total 11374236 11374367 +131,+0
Which doesn't look too bad.

---
diff --git a/arch/x86/include/asm/atomic.h b/arch/x86/include/asm/atomic.h
index ea3d95275b43..115127c7ad28 100644
--- a/arch/x86/include/asm/atomic.h
+++ b/arch/x86/include/asm/atomic.h
@@ -54,7 +54,7 @@ static __always_inline void arch_atomic_add(int i, atomic_t *v)
{
asm volatile(LOCK_PREFIX "addl %1,%0"
: "+m" (v->counter)
- : "ir" (i));
+ : "ir" (i) : "memory");
}

/**
@@ -68,7 +68,7 @@ static __always_inline void arch_atomic_sub(int i, atomic_t *v)
{
asm volatile(LOCK_PREFIX "subl %1,%0"
: "+m" (v->counter)
- : "ir" (i));
+ : "ir" (i) : "memory");
}

/**
@@ -95,7 +95,7 @@ static __always_inline bool arch_atomic_sub_and_test(int i, atomic_t *v)
static __always_inline void arch_atomic_inc(atomic_t *v)
{
asm volatile(LOCK_PREFIX "incl %0"
- : "+m" (v->counter));
+ : "+m" (v->counter) :: "memory");
}
#define arch_atomic_inc arch_atomic_inc

@@ -108,7 +108,7 @@ static __always_inline void arch_atomic_inc(atomic_t *v)
static __always_inline void arch_atomic_dec(atomic_t *v)
{
asm volatile(LOCK_PREFIX "decl %0"
- : "+m" (v->counter));
+ : "+m" (v->counter) :: "memory");
}
#define arch_atomic_dec arch_atomic_dec

diff --git a/arch/x86/include/asm/atomic64_64.h b/arch/x86/include/asm/atomic64_64.h
index dadc20adba21..5e86c0d68ac1 100644
--- a/arch/x86/include/asm/atomic64_64.h
+++ b/arch/x86/include/asm/atomic64_64.h
@@ -45,7 +45,7 @@ static __always_inline void arch_atomic64_add(long i, atomic64_t *v)
{
asm volatile(LOCK_PREFIX "addq %1,%0"
: "=m" (v->counter)
- : "er" (i), "m" (v->counter));
+ : "er" (i), "m" (v->counter) : "memory");
}

/**
@@ -59,7 +59,7 @@ static inline void arch_atomic64_sub(long i, atomic64_t *v)
{
asm volatile(LOCK_PREFIX "subq %1,%0"
: "=m" (v->counter)
- : "er" (i), "m" (v->counter));
+ : "er" (i), "m" (v->counter) : "memory");
}

/**
@@ -87,7 +87,7 @@ static __always_inline void arch_atomic64_inc(atomic64_t *v)
{
asm volatile(LOCK_PREFIX "incq %0"
: "=m" (v->counter)
- : "m" (v->counter));
+ : "m" (v->counter) : "memory");
}
#define arch_atomic64_inc arch_atomic64_inc

@@ -101,7 +101,7 @@ static __always_inline void arch_atomic64_dec(atomic64_t *v)
{
asm volatile(LOCK_PREFIX "decq %0"
: "=m" (v->counter)
- : "m" (v->counter));
+ : "m" (v->counter) : "memory");
}
#define arch_atomic64_dec arch_atomic64_dec

2019-04-23 12:35:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

On Sat, Apr 20, 2019 at 01:54:40AM -0700, Paul E. McKenney wrote:
> And atomic_set(): set_preempt_state(). This fails
> on x86, s390, and TSO friends, does it not? Or is
> this ARM-only? Still, why not just smp_mb() before and
> after? Same issue in __kernfs_new_node(), bio_cnt_set(),
> sbitmap_queue_update_wake_batch(),
>
> Ditto for atomic64_set() in __ceph_dir_set_complete().
>
> Ditto for atomic_read() in rvt_qp_is_avail(). This function
> has a couple of other oddly placed smp_mb__before_atomic().

That are just straight up bugs. The atomic_t.txt file clearly specifies
the barriers only apply to RmW ops and both _set() and _read() are
specified to not be a RmW.

> And atomic_cmpxchg(): msc_buffer_alloc(). This instance
> of smp_mb__before_atomic() can be removed unless I am missing
> something subtle. Ditto for kvm_vcpu_exiting_guest_mode(),
> pv_kick_node(), __sbq_wake_up(),

Note that pv_kick_node() uses cmpxchg_relaxed(), which does not
otherwise imply barriers.

> And lock acquisition??? acm_read_bulk_callback().

I think it goes with the set_bit() earlier, but what do I know.

> In nfnl_acct_fill_info(), a smp_mb__before_atomic() after
> a atomic64_xchg()??? Also before a clear_bit(), but the
> clear_bit() is inside an "if".

Since it is _before, I'm thinking the pairing was intended with the
clear_bit(), and yes, then I would expect the smp_mb__before_atomic() to
be part of that same branch.

> There are a few cases that would see added overhead. For example,
> svc_get_next_xprt() has the following:
>
> smp_mb__before_atomic();
> clear_bit(SP_CONGESTED, &pool->sp_flags);
> clear_bit(RQ_BUSY, &rqstp->rq_flags);
> smp_mb__after_atomic();
>
> And xs_sock_reset_connection_flags() has this:
>
> smp_mb__before_atomic();
> clear_bit(XPRT_CLOSE_WAIT, &xprt->state);
> clear_bit(XPRT_CLOSING, &xprt->state);
> xs_sock_reset_state_flags(xprt); /* Also a clear_bit(). */
> smp_mb__after_atomic();
>
> Yeah, there are more than a few misuses, aren't there? :-/
> A coccinelle script seems in order. In 0day test robot.

If we can get it to flag the right patterns, then yes that might be
useful regardless of the issue at hand, people seem to get this one
wrong a lot.

> But there are a number of helper functions whose purpose
> seems to be to wrap an atomic in smp_mb__before_atomic() and
> smp_mb__after_atomic(), so some of the atomic_xxx_mb() functions
> might be a good idea just for improved readability.

Are there really sites where _mb() makes sense? The above is just a lot
of buggy code.

2019-04-23 13:23:13

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

On Tue, Apr 23, 2019 at 02:17:15PM +0200, Peter Zijlstra wrote:
> On Sat, Apr 20, 2019 at 01:54:40AM -0700, Paul E. McKenney wrote:
> > 3. Make non-value-returning atomics provide full ordering.
> > This would of course need some benchmarking, but would be a
> > simple change to make and would eliminate a large class of
> > potential bugs. My guess is that the loss in performance
> > would be non-negligible, but who knows?
>
> Well, only for the architectures that have
> smp_mb__{before,after}_atomic() as barrier(), which are: ia64, mips,
> s390, sparc, x86 and xtense.

The weakly ordered architectures would need to add the equivalent of
smp_mb() before and after, right? This might result in a more noticeable
loss of performance.

Thanx, Paul

> $ ./compare.sh defconfig-build defconfig-build1 vmlinux
> do_profile_hits 275 278 +3,+0
> freezer_apply_state 86 98 +12,+0
> perf_event_alloc 2232 2261 +29,+0
> _free_event 631 660 +29,+0
> shmem_add_to_page_cache 712 722 +10,+0
> _enable_swap_info 333 337 +4,+0
> do_mmu_notifier_register 303 311 +8,+0
> __nfs_commit_inode 356 359 +3,+0
> tcp_try_coalesce 246 250 +4,+0
> i915_gem_free_object 90 97 +7,+0
> mce_intel_hcpu_update 39 47 +8,+0
> __ia32_sys_swapoff 1177 1181 +4,+0
> pci_enable_ats 124 131 +7,+0
> __x64_sys_swapoff 1178 1182 +4,+0
> i915_gem_madvise_ioctl 447 443 -4,+0
> calc_global_load_tick 75 82 +7,+0
> i915_gem_object_set_tiling 712 708 -4,+0
> total 11374236 11374367 +131,+0
> Which doesn't look too bad.
>
> ---
> diff --git a/arch/x86/include/asm/atomic.h b/arch/x86/include/asm/atomic.h
> index ea3d95275b43..115127c7ad28 100644
> --- a/arch/x86/include/asm/atomic.h
> +++ b/arch/x86/include/asm/atomic.h
> @@ -54,7 +54,7 @@ static __always_inline void arch_atomic_add(int i, atomic_t *v)
> {
> asm volatile(LOCK_PREFIX "addl %1,%0"
> : "+m" (v->counter)
> - : "ir" (i));
> + : "ir" (i) : "memory");
> }
>
> /**
> @@ -68,7 +68,7 @@ static __always_inline void arch_atomic_sub(int i, atomic_t *v)
> {
> asm volatile(LOCK_PREFIX "subl %1,%0"
> : "+m" (v->counter)
> - : "ir" (i));
> + : "ir" (i) : "memory");
> }
>
> /**
> @@ -95,7 +95,7 @@ static __always_inline bool arch_atomic_sub_and_test(int i, atomic_t *v)
> static __always_inline void arch_atomic_inc(atomic_t *v)
> {
> asm volatile(LOCK_PREFIX "incl %0"
> - : "+m" (v->counter));
> + : "+m" (v->counter) :: "memory");
> }
> #define arch_atomic_inc arch_atomic_inc
>
> @@ -108,7 +108,7 @@ static __always_inline void arch_atomic_inc(atomic_t *v)
> static __always_inline void arch_atomic_dec(atomic_t *v)
> {
> asm volatile(LOCK_PREFIX "decl %0"
> - : "+m" (v->counter));
> + : "+m" (v->counter) :: "memory");
> }
> #define arch_atomic_dec arch_atomic_dec
>
> diff --git a/arch/x86/include/asm/atomic64_64.h b/arch/x86/include/asm/atomic64_64.h
> index dadc20adba21..5e86c0d68ac1 100644
> --- a/arch/x86/include/asm/atomic64_64.h
> +++ b/arch/x86/include/asm/atomic64_64.h
> @@ -45,7 +45,7 @@ static __always_inline void arch_atomic64_add(long i, atomic64_t *v)
> {
> asm volatile(LOCK_PREFIX "addq %1,%0"
> : "=m" (v->counter)
> - : "er" (i), "m" (v->counter));
> + : "er" (i), "m" (v->counter) : "memory");
> }
>
> /**
> @@ -59,7 +59,7 @@ static inline void arch_atomic64_sub(long i, atomic64_t *v)
> {
> asm volatile(LOCK_PREFIX "subq %1,%0"
> : "=m" (v->counter)
> - : "er" (i), "m" (v->counter));
> + : "er" (i), "m" (v->counter) : "memory");
> }
>
> /**
> @@ -87,7 +87,7 @@ static __always_inline void arch_atomic64_inc(atomic64_t *v)
> {
> asm volatile(LOCK_PREFIX "incq %0"
> : "=m" (v->counter)
> - : "m" (v->counter));
> + : "m" (v->counter) : "memory");
> }
> #define arch_atomic64_inc arch_atomic64_inc
>
> @@ -101,7 +101,7 @@ static __always_inline void arch_atomic64_dec(atomic64_t *v)
> {
> asm volatile(LOCK_PREFIX "decq %0"
> : "=m" (v->counter)
> - : "m" (v->counter));
> + : "m" (v->counter) : "memory");
> }
> #define arch_atomic64_dec arch_atomic64_dec
>
>

2019-04-23 13:28:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

On Tue, Apr 23, 2019 at 06:21:16AM -0700, Paul E. McKenney wrote:
> On Tue, Apr 23, 2019 at 02:17:15PM +0200, Peter Zijlstra wrote:
> > On Sat, Apr 20, 2019 at 01:54:40AM -0700, Paul E. McKenney wrote:
> > > 3. Make non-value-returning atomics provide full ordering.
> > > This would of course need some benchmarking, but would be a
> > > simple change to make and would eliminate a large class of
> > > potential bugs. My guess is that the loss in performance
> > > would be non-negligible, but who knows?
> >
> > Well, only for the architectures that have
> > smp_mb__{before,after}_atomic() as barrier(), which are: ia64, mips,
> > s390, sparc, x86 and xtense.
>
> The weakly ordered architectures would need to add the equivalent of
> smp_mb() before and after, right? This might result in a more noticeable
> loss of performance.

The weak archs already have: smp_mb__{before,after}_atomic() :=
smp_mb().


2019-04-23 13:31:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

On Tue, Apr 23, 2019 at 02:32:09PM +0200, Peter Zijlstra wrote:
> On Sat, Apr 20, 2019 at 01:54:40AM -0700, Paul E. McKenney wrote:
> > And atomic_set(): set_preempt_state(). This fails
> > on x86, s390, and TSO friends, does it not? Or is
> > this ARM-only? Still, why not just smp_mb() before and
> > after? Same issue in __kernfs_new_node(), bio_cnt_set(),
> > sbitmap_queue_update_wake_batch(),
> >
> > Ditto for atomic64_set() in __ceph_dir_set_complete().
> >
> > Ditto for atomic_read() in rvt_qp_is_avail(). This function
> > has a couple of other oddly placed smp_mb__before_atomic().
>
> That are just straight up bugs. The atomic_t.txt file clearly specifies
> the barriers only apply to RmW ops and both _set() and _read() are
> specified to not be a RmW.

Agreed. The "Ditto" covers my atomic_set() consternation. ;-)

> > And atomic_cmpxchg(): msc_buffer_alloc(). This instance
> > of smp_mb__before_atomic() can be removed unless I am missing
> > something subtle. Ditto for kvm_vcpu_exiting_guest_mode(),
> > pv_kick_node(), __sbq_wake_up(),
>
> Note that pv_kick_node() uses cmpxchg_relaxed(), which does not
> otherwise imply barriers.

Good point, my eyes must have been going funny.

> > And lock acquisition??? acm_read_bulk_callback().
>
> I think it goes with the set_bit() earlier, but what do I know.

Quite possibly! In that case it should be smp_mb__after_atomic(),
and it would be nice if it immediately followed the set_bit().

> > In nfnl_acct_fill_info(), a smp_mb__before_atomic() after
> > a atomic64_xchg()??? Also before a clear_bit(), but the
> > clear_bit() is inside an "if".
>
> Since it is _before, I'm thinking the pairing was intended with the
> clear_bit(), and yes, then I would expect the smp_mb__before_atomic() to
> be part of that same branch.

It is quite possible that this one is a leftover, where the atomic
operation was removed but the smp_mb__{before,after}_atomic() lived on.
I had one of those in RCU, which now has a patch in -rcu.

> > There are a few cases that would see added overhead. For example,
> > svc_get_next_xprt() has the following:
> >
> > smp_mb__before_atomic();
> > clear_bit(SP_CONGESTED, &pool->sp_flags);
> > clear_bit(RQ_BUSY, &rqstp->rq_flags);
> > smp_mb__after_atomic();
> >
> > And xs_sock_reset_connection_flags() has this:
> >
> > smp_mb__before_atomic();
> > clear_bit(XPRT_CLOSE_WAIT, &xprt->state);
> > clear_bit(XPRT_CLOSING, &xprt->state);
> > xs_sock_reset_state_flags(xprt); /* Also a clear_bit(). */
> > smp_mb__after_atomic();
> >
> > Yeah, there are more than a few misuses, aren't there? :-/
> > A coccinelle script seems in order. In 0day test robot.
>
> If we can get it to flag the right patterns, then yes that might be
> useful regardless of the issue at hand, people seem to get this one
> wrong a lot.

To be fair, the odd-looking ones are maybe 5% of the total. Still too
many wrong, but the vast majority look OK.

> > But there are a number of helper functions whose purpose
> > seems to be to wrap an atomic in smp_mb__before_atomic() and
> > smp_mb__after_atomic(), so some of the atomic_xxx_mb() functions
> > might be a good idea just for improved readability.
>
> Are there really sites where _mb() makes sense? The above is just a lot
> of buggy code.

There are a great many that look like this:

smp_mb__before_atomic();
clear_bit(NFSD4_CLIENT_UPCALL_LOCK, &clp->cl_flags);
smp_mb__after_atomic();

Replacing these three lines with this would not be a bad thing:

clear_bit_mb(NFSD4_CLIENT_UPCALL_LOCK, &clp->cl_flags);

Thanx, Paul

2019-04-23 13:41:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

On Tue, Apr 23, 2019 at 06:30:10AM -0700, Paul E. McKenney wrote:
> There are a great many that look like this:
>
> smp_mb__before_atomic();
> clear_bit(NFSD4_CLIENT_UPCALL_LOCK, &clp->cl_flags);
> smp_mb__after_atomic();

Ooh, marvel at the comment describing the ordering there. Oh wait :-(
So much for checkpatch.pl I suppose.

I think the first is a release order for the 'LOCK' bit and the second
is because of wake_up_bit() being a shitty interface.

So maybe that should've been:

clear_bit_unlock(NFSD4_CLIENT_UPCALL_LOCK, &clp->cl_flags);
smp_mb__after_atomic();
wake_up_bit(&clp->cl_flags, NFSD4_CLIENT_UPCALL_LOCK);

instead?

2019-04-23 20:18:05

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

On Tue, Apr 23, 2019 at 03:26:20PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 23, 2019 at 06:21:16AM -0700, Paul E. McKenney wrote:
> > On Tue, Apr 23, 2019 at 02:17:15PM +0200, Peter Zijlstra wrote:
> > > On Sat, Apr 20, 2019 at 01:54:40AM -0700, Paul E. McKenney wrote:
> > > > 3. Make non-value-returning atomics provide full ordering.
> > > > This would of course need some benchmarking, but would be a
> > > > simple change to make and would eliminate a large class of
> > > > potential bugs. My guess is that the loss in performance
> > > > would be non-negligible, but who knows?
> > >
> > > Well, only for the architectures that have
> > > smp_mb__{before,after}_atomic() as barrier(), which are: ia64, mips,
> > > s390, sparc, x86 and xtense.
> >
> > The weakly ordered architectures would need to add the equivalent of
> > smp_mb() before and after, right? This might result in a more noticeable
> > loss of performance.
>
> The weak archs already have: smp_mb__{before,after}_atomic() :=
> smp_mb().

Agreed, but I thought that one of the ideas going forward was to get
rid of smp_mb__{before,after}_atomic().

Thanx, Paul

2019-04-23 20:22:35

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

On Tue, Apr 23, 2019 at 03:40:03PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 23, 2019 at 06:30:10AM -0700, Paul E. McKenney wrote:
> > There are a great many that look like this:
> >
> > smp_mb__before_atomic();
> > clear_bit(NFSD4_CLIENT_UPCALL_LOCK, &clp->cl_flags);
> > smp_mb__after_atomic();
>
> Ooh, marvel at the comment describing the ordering there. Oh wait :-(
> So much for checkpatch.pl I suppose.

Especially if the code was in the kernel before checkpatch.pl started
asking for comments. Which might or might not be the case with this
code. No idea either way.

> I think the first is a release order for the 'LOCK' bit and the second
> is because of wake_up_bit() being a shitty interface.
>
> So maybe that should've been:
>
> clear_bit_unlock(NFSD4_CLIENT_UPCALL_LOCK, &clp->cl_flags);
> smp_mb__after_atomic();
> wake_up_bit(&clp->cl_flags, NFSD4_CLIENT_UPCALL_LOCK);
>
> instead?

Quite possibly, but my brain is too fried to be sure.

Thanx, Paul

2019-04-23 20:30:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

On Tue, Apr 23, 2019 at 01:16:37PM -0700, Paul E. McKenney wrote:

> Agreed, but I thought that one of the ideas going forward was to get
> rid of smp_mb__{before,after}_atomic().

It's not one I had considered.. I just wanted to get rid of this
'surprise' behaviour.

2019-04-24 08:32:46

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

On Tue, Apr 23, 2019 at 10:28:31PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 23, 2019 at 01:16:37PM -0700, Paul E. McKenney wrote:
>
> > Agreed, but I thought that one of the ideas going forward was to get
> > rid of smp_mb__{before,after}_atomic().
>
> It's not one I had considered.. I just wanted to get rid of this
> 'surprise' behaviour.

Ah, good point, your patch is in fact a midpoint between those two
positions. Just to make sure I understand:

1. Without your patch, smp_mb__{before,after}_atomic() orders
only against the atomic itself.

2. With your patch, smp_mb__{before,after}_atomic() orders against
the atomic itself and the accesses on the other side of that
atomic. However, it does not order the atomic against the
accesses on the other side of that atomic.

Putting things between the smp_mb__{before,after}_atomic()
and the atomic is in my opinion a bad idea, but in this case
they are not necessarily ordered.

3. Dispensing with smp_mb__{before,after}_atomic() would have
void RMW atomics fully ordered, but I suspect that it results
in ugly performance regressions.

Or am I still missing something?

Thanx, Paul

2019-04-24 08:46:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

On Wed, Apr 24, 2019 at 01:29:48AM -0700, Paul E. McKenney wrote:
> On Tue, Apr 23, 2019 at 10:28:31PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 23, 2019 at 01:16:37PM -0700, Paul E. McKenney wrote:
> >
> > > Agreed, but I thought that one of the ideas going forward was to get
> > > rid of smp_mb__{before,after}_atomic().
> >
> > It's not one I had considered.. I just wanted to get rid of this
> > 'surprise' behaviour.
>
> Ah, good point, your patch is in fact a midpoint between those two
> positions. Just to make sure I understand:
>
> 1. Without your patch, smp_mb__{before,after}_atomic() orders
> only against the atomic itself.

Right, and that was not intentional.

> 2. With your patch, smp_mb__{before,after}_atomic() orders against
> the atomic itself and the accesses on the other side of that
> atomic. However, it does not order the atomic against the
> accesses on the other side of that atomic.

Right. I'll go make a more complete patch, covering all the
architectures.

> Putting things between the smp_mb__{before,after}_atomic()
> and the atomic is in my opinion a bad idea, but in this case
> they are not necessarily ordered.

Agreed, that is an unsupported idiom and it would be good to have
something check for this.

> 3. Dispensing with smp_mb__{before,after}_atomic() would have
> void RMW atomics fully ordered, but I suspect that it results
> in ugly performance regressions.
>
> Or am I still missing something?

I think we're good :-)

2019-04-27 08:18:53

by Andrea Parri

[permalink] [raw]
Subject: Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

On Tue, Apr 23, 2019 at 06:30:10AM -0700, Paul E. McKenney wrote:
> On Tue, Apr 23, 2019 at 02:32:09PM +0200, Peter Zijlstra wrote:
> > On Sat, Apr 20, 2019 at 01:54:40AM -0700, Paul E. McKenney wrote:
> > > And atomic_set(): set_preempt_state(). This fails
> > > on x86, s390, and TSO friends, does it not? Or is
> > > this ARM-only? Still, why not just smp_mb() before and
> > > after? Same issue in __kernfs_new_node(), bio_cnt_set(),
> > > sbitmap_queue_update_wake_batch(),
> > >
> > > Ditto for atomic64_set() in __ceph_dir_set_complete().
> > >
> > > Ditto for atomic_read() in rvt_qp_is_avail(). This function
> > > has a couple of other oddly placed smp_mb__before_atomic().
> >
> > That are just straight up bugs. The atomic_t.txt file clearly specifies
> > the barriers only apply to RmW ops and both _set() and _read() are
> > specified to not be a RmW.
>
> Agreed. The "Ditto" covers my atomic_set() consternation. ;-)

I was working on some of these before the Easter break [1, 2]: the plan
was to continue next week, but by addressing the remaining cases with a
conservative s/that barrier/smp_mb at first; unless you've other plans?

Andrea

[1] http://lkml.kernel.org/r/1555417031-27356-1-git-send-email-andrea.parri@amarulasolutions.com
[2] http://lkml.kernel.org/r/[email protected]

2019-04-27 08:41:47

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

On Sat, Apr 27, 2019 at 10:17:38AM +0200, Andrea Parri wrote:
> On Tue, Apr 23, 2019 at 06:30:10AM -0700, Paul E. McKenney wrote:
> > On Tue, Apr 23, 2019 at 02:32:09PM +0200, Peter Zijlstra wrote:
> > > On Sat, Apr 20, 2019 at 01:54:40AM -0700, Paul E. McKenney wrote:
> > > > And atomic_set(): set_preempt_state(). This fails
> > > > on x86, s390, and TSO friends, does it not? Or is
> > > > this ARM-only? Still, why not just smp_mb() before and
> > > > after? Same issue in __kernfs_new_node(), bio_cnt_set(),
> > > > sbitmap_queue_update_wake_batch(),
> > > >
> > > > Ditto for atomic64_set() in __ceph_dir_set_complete().
> > > >
> > > > Ditto for atomic_read() in rvt_qp_is_avail(). This function
> > > > has a couple of other oddly placed smp_mb__before_atomic().
> > >
> > > That are just straight up bugs. The atomic_t.txt file clearly specifies
> > > the barriers only apply to RmW ops and both _set() and _read() are
> > > specified to not be a RmW.
> >
> > Agreed. The "Ditto" covers my atomic_set() consternation. ;-)
>
> I was working on some of these before the Easter break [1, 2]: the plan
> was to continue next week, but by addressing the remaining cases with a
> conservative s/that barrier/smp_mb at first; unless you've other plans?
>
> Andrea
>
> [1] http://lkml.kernel.org/r/1555417031-27356-1-git-send-email-andrea.parri@amarulasolutions.com
> [2] http://lkml.kernel.org/r/[email protected]

Sounds good to me! ;-)

Thanx, Paul

2019-04-29 09:25:42

by Johan Hovold

[permalink] [raw]
Subject: Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

On Tue, Apr 23, 2019 at 06:30:10AM -0700, Paul E. McKenney wrote:
> On Tue, Apr 23, 2019 at 02:32:09PM +0200, Peter Zijlstra wrote:
> > On Sat, Apr 20, 2019 at 01:54:40AM -0700, Paul E. McKenney wrote:

> > > And lock acquisition??? acm_read_bulk_callback().
> >
> > I think it goes with the set_bit() earlier, but what do I know.
>
> Quite possibly! In that case it should be smp_mb__after_atomic(),
> and it would be nice if it immediately followed the set_bit().

I noticed this one last week as well. The set_bit() had been incorrectly
moved and without noticing the smp_mb__before_atomic(). I've submitted a
patch to restore it and to fix a related issue to due missing barriers:

https://lkml.kernel.org/r/[email protected]

Johan

2019-04-29 14:54:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

On Mon, Apr 29, 2019 at 11:24:30AM +0200, Johan Hovold wrote:
> On Tue, Apr 23, 2019 at 06:30:10AM -0700, Paul E. McKenney wrote:
> > On Tue, Apr 23, 2019 at 02:32:09PM +0200, Peter Zijlstra wrote:
> > > On Sat, Apr 20, 2019 at 01:54:40AM -0700, Paul E. McKenney wrote:
>
> > > > And lock acquisition??? acm_read_bulk_callback().
> > >
> > > I think it goes with the set_bit() earlier, but what do I know.
> >
> > Quite possibly! In that case it should be smp_mb__after_atomic(),
> > and it would be nice if it immediately followed the set_bit().
>
> I noticed this one last week as well. The set_bit() had been incorrectly
> moved and without noticing the smp_mb__before_atomic(). I've submitted a
> patch to restore it and to fix a related issue to due missing barriers:
>
> https://lkml.kernel.org/r/[email protected]

Good to know, thank you!

Thanx, Paul