2017-03-24 12:44:32

by Dmitry Vyukov

[permalink] [raw]
Subject: re: locking/atomic: Introduce atomic_try_cmpxchg()

Hi,

I've come across:

commit a9ebf306f52c756c4f9e50ee9a60cd6389d71344
Author: Peter Zijlstra
Date: Wed Feb 1 16:39:38 2017 +0100
locking/atomic: Introduce atomic_try_cmpxchg()

The primitive has subtle difference with all other implementation that
I know of, and can lead to very subtle bugs. Some time ago I've spent
several days debugging a memory corruption caused by similar
implementation. Consider a classical lock-free stack push:

node->next = atomic_read(&head);
do {
} while (!atomic_try_cmpxchg(&head, &node->next, node));

This code is broken with the current implementation, the problem is
with unconditional update of *__po here:

#define __atomic64_try_cmpxchg(type, _p, _po, _n) \
({
\
typeof(_po) __po = (_po); \
typeof(*(_po)) __o = *__po; \
*__po = atomic64_cmpxchg##type((_p), __o, (_n)); \
(*__po == __o);
\
})

In case of success it writes the same value back into *__po, but in
case of cmpxchg success we might have lose ownership of some memory
locations and potentially over what __po has pointed to. The same
holds for the re-read of *__po.
The problem is very easy to overlook in user code (not saying about
the case when developer is already familiar with different semantics
of similar functions (e.g. C/C++ atomic operations, or gcc/clang
atomic builtins)). For this reason cmpxchg implementations usually
update comparand only in case of failure. So I would suggest to change
it to a safer and less surprising alternative:

diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
index fb961db51a2a..81fb985f51f4 100644
--- a/arch/x86/include/asm/cmpxchg.h
+++ b/arch/x86/include/asm/cmpxchg.h
@@ -212,7 +212,8 @@ extern void __add_wrong_size(void)
default: \
__cmpxchg_wrong_size(); \
} \
- *_old = __old; \
+ if (!success) \
+ *_old = __old; \
success; \
})

diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index aae5953817d6..f8098157f7c8 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -1023,8 +1023,11 @@ static inline int atomic_dec_if_positive(atomic_t *v)
({ \
typeof(_po) __po = (_po); \
typeof(*(_po)) __o = *__po; \
- *__po = atomic64_cmpxchg##type((_p), __o, (_n)); \
- (*__po == __o); \
+ typeof(*(_po)) __v = atomic64_cmpxchg##type((_p), __o, (_n)); \
+ if (__v == __o) \
+ return true; \
+ *__po = __v; \
+ return false; \
})


2017-03-24 14:22:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 01:44:00PM +0100, Dmitry Vyukov wrote:
>
> The primitive has subtle difference with all other implementation that
> I know of, and can lead to very subtle bugs. Some time ago I've spent
> several days debugging a memory corruption caused by similar
> implementation. Consider a classical lock-free stack push:
>
> node->next = atomic_read(&head);
> do {
> } while (!atomic_try_cmpxchg(&head, &node->next, node));
>
> This code is broken with the current implementation, the problem is
> with unconditional update of *__po here:

Indeed. I had only considered stack local variables when I wrote that.

> So I would suggest to change it to a safer and less surprising
> alternative:
>
> diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
> index fb961db51a2a..81fb985f51f4 100644
> --- a/arch/x86/include/asm/cmpxchg.h
> +++ b/arch/x86/include/asm/cmpxchg.h
> @@ -212,7 +212,8 @@ extern void __add_wrong_size(void)
> default: \
> __cmpxchg_wrong_size(); \
> } \
> - *_old = __old; \
> + if (!success) \
> + *_old = __old; \
> success; \
> })

I've no immediate objection, I'll double check what, if anything, it
does for code gen.

> diff --git a/include/linux/atomic.h b/include/linux/atomic.h
> index aae5953817d6..f8098157f7c8 100644
> --- a/include/linux/atomic.h
> +++ b/include/linux/atomic.h
> @@ -1023,8 +1023,11 @@ static inline int atomic_dec_if_positive(atomic_t *v)
> ({ \
> typeof(_po) __po = (_po); \
> typeof(*(_po)) __o = *__po; \
> - *__po = atomic64_cmpxchg##type((_p), __o, (_n)); \
> - (*__po == __o); \
> + typeof(*(_po)) __v = atomic64_cmpxchg##type((_p), __o, (_n)); \
> + if (__v == __o) \
> + return true; \
> + *__po = __v; \
> + return false; \
> })

Can you actually use return in statement-expressions?


2017-03-24 14:24:29

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 3:21 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Mar 24, 2017 at 01:44:00PM +0100, Dmitry Vyukov wrote:
>>
>> The primitive has subtle difference with all other implementation that
>> I know of, and can lead to very subtle bugs. Some time ago I've spent
>> several days debugging a memory corruption caused by similar
>> implementation. Consider a classical lock-free stack push:
>>
>> node->next = atomic_read(&head);
>> do {
>> } while (!atomic_try_cmpxchg(&head, &node->next, node));
>>
>> This code is broken with the current implementation, the problem is
>> with unconditional update of *__po here:
>
> Indeed. I had only considered stack local variables when I wrote that.
>
>> So I would suggest to change it to a safer and less surprising
>> alternative:
>>
>> diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
>> index fb961db51a2a..81fb985f51f4 100644
>> --- a/arch/x86/include/asm/cmpxchg.h
>> +++ b/arch/x86/include/asm/cmpxchg.h
>> @@ -212,7 +212,8 @@ extern void __add_wrong_size(void)
>> default: \
>> __cmpxchg_wrong_size(); \
>> } \
>> - *_old = __old; \
>> + if (!success) \
>> + *_old = __old; \
>> success; \
>> })
>
> I've no immediate objection, I'll double check what, if anything, it
> does for code gen.
>
>> diff --git a/include/linux/atomic.h b/include/linux/atomic.h
>> index aae5953817d6..f8098157f7c8 100644
>> --- a/include/linux/atomic.h
>> +++ b/include/linux/atomic.h
>> @@ -1023,8 +1023,11 @@ static inline int atomic_dec_if_positive(atomic_t *v)
>> ({ \
>> typeof(_po) __po = (_po); \
>> typeof(*(_po)) __o = *__po; \
>> - *__po = atomic64_cmpxchg##type((_p), __o, (_n)); \
>> - (*__po == __o); \
>> + typeof(*(_po)) __v = atomic64_cmpxchg##type((_p), __o, (_n)); \
>> + if (__v == __o) \
>> + return true; \
>> + *__po = __v; \
>> + return false; \
>> })
>
> Can you actually use return in statement-expressions?

Dunno. Just wanted to show the idea.

2017-03-24 16:41:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 03:21:40PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 24, 2017 at 01:44:00PM +0100, Dmitry Vyukov wrote:

> > So I would suggest to change it to a safer and less surprising
> > alternative:
> >
> > diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
> > index fb961db51a2a..81fb985f51f4 100644
> > --- a/arch/x86/include/asm/cmpxchg.h
> > +++ b/arch/x86/include/asm/cmpxchg.h
> > @@ -212,7 +212,8 @@ extern void __add_wrong_size(void)
> > default: \
> > __cmpxchg_wrong_size(); \
> > } \
> > - *_old = __old; \
> > + if (!success) \
> > + *_old = __old; \
> > success; \
> > })
>
> I've no immediate objection, I'll double check what, if anything, it
> does for code gen.

So the first snipped I tested regressed like so:


0000000000000000 <T_refcount_inc>: 0000000000000000 <T_refcount_inc>:
0: 8b 07 mov (%rdi),%eax 0: 8b 17 mov (%rdi),%edx
2: 83 f8 ff cmp $0xffffffff,%eax 2: 83 fa ff cmp $0xffffffff,%edx
5: 74 13 je 1a <T_refcount_inc+0x1a> 5: 74 1a je 21 <T_refcount_inc+0x21>
7: 85 c0 test %eax,%eax 7: 85 d2 test %edx,%edx
9: 74 0d je 18 <T_refcount_inc+0x18> 9: 74 13 je 1e <T_refcount_inc+0x1e>
b: 8d 50 01 lea 0x1(%rax),%edx b: 8d 4a 01 lea 0x1(%rdx),%ecx
e: f0 0f b1 17 lock cmpxchg %edx,(%rdi) e: 89 d0 mov %edx,%eax
12: 75 ee jne 2 <T_refcount_inc+0x2> 10: f0 0f b1 0f lock cmpxchg %ecx,(%rdi)
14: ff c2 inc %edx 14: 74 04 je 1a <T_refcount_inc+0x1a>
16: 75 02 jne 1a <T_refcount_inc+0x1a> 16: 89 c2 mov %eax,%edx
18: 0f 0b ud2 18: eb e8 jmp 2 <T_refcount_inc+0x2>
1a: c3 retq 1a: ff c1 inc %ecx
1c: 75 03 jne 21 <T_refcount_inc+0x21>
1e: 0f 0b ud2
20: c3 retq
21: c3 retq

Which is rather unfortunate...

2017-03-24 16:55:17

by Andy Lutomirski

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 9:41 AM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Mar 24, 2017 at 03:21:40PM +0100, Peter Zijlstra wrote:
>> On Fri, Mar 24, 2017 at 01:44:00PM +0100, Dmitry Vyukov wrote:
>
>> > So I would suggest to change it to a safer and less surprising
>> > alternative:
>> >
>> > diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
>> > index fb961db51a2a..81fb985f51f4 100644
>> > --- a/arch/x86/include/asm/cmpxchg.h
>> > +++ b/arch/x86/include/asm/cmpxchg.h
>> > @@ -212,7 +212,8 @@ extern void __add_wrong_size(void)
>> > default: \
>> > __cmpxchg_wrong_size(); \
>> > } \
>> > - *_old = __old; \
>> > + if (!success) \
>> > + *_old = __old; \
>> > success; \
>> > })
>>
>> I've no immediate objection, I'll double check what, if anything, it
>> does for code gen.
>
> So the first snipped I tested regressed like so:
>
>
> 0000000000000000 <T_refcount_inc>: 0000000000000000 <T_refcount_inc>:
> 0: 8b 07 mov (%rdi),%eax 0: 8b 17 mov (%rdi),%edx
> 2: 83 f8 ff cmp $0xffffffff,%eax 2: 83 fa ff cmp $0xffffffff,%edx
> 5: 74 13 je 1a <T_refcount_inc+0x1a> 5: 74 1a je 21 <T_refcount_inc+0x21>
> 7: 85 c0 test %eax,%eax 7: 85 d2 test %edx,%edx
> 9: 74 0d je 18 <T_refcount_inc+0x18> 9: 74 13 je 1e <T_refcount_inc+0x1e>
> b: 8d 50 01 lea 0x1(%rax),%edx b: 8d 4a 01 lea 0x1(%rdx),%ecx
> e: f0 0f b1 17 lock cmpxchg %edx,(%rdi) e: 89 d0 mov %edx,%eax
> 12: 75 ee jne 2 <T_refcount_inc+0x2> 10: f0 0f b1 0f lock cmpxchg %ecx,(%rdi)
> 14: ff c2 inc %edx 14: 74 04 je 1a <T_refcount_inc+0x1a>
> 16: 75 02 jne 1a <T_refcount_inc+0x1a> 16: 89 c2 mov %eax,%edx
> 18: 0f 0b ud2 18: eb e8 jmp 2 <T_refcount_inc+0x2>
> 1a: c3 retq 1a: ff c1 inc %ecx
> 1c: 75 03 jne 21 <T_refcount_inc+0x21>
> 1e: 0f 0b ud2
> 20: c3 retq
> 21: c3 retq

Can you re-send the better asm you got earlier?

If I pretend to be a dumb compiler, I wonder if you'd get better results with:

if (!success) {
*_old = __old;
return false;
} else {
return true;
}

or however you jam that into a statement expression. That way you
aren't relying on the compiler to merge the branches.

>
> Which is rather unfortunate...



--
Andy Lutomirski
AMA Capital Management, LLC

2017-03-24 17:24:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 09:54:46AM -0700, Andy Lutomirski wrote:
> > So the first snipped I tested regressed like so:
> >
> >
> > 0000000000000000 <T_refcount_inc>: 0000000000000000 <T_refcount_inc>:
> > 0: 8b 07 mov (%rdi),%eax 0: 8b 17 mov (%rdi),%edx
> > 2: 83 f8 ff cmp $0xffffffff,%eax 2: 83 fa ff cmp $0xffffffff,%edx
> > 5: 74 13 je 1a <T_refcount_inc+0x1a> 5: 74 1a je 21 <T_refcount_inc+0x21>
> > 7: 85 c0 test %eax,%eax 7: 85 d2 test %edx,%edx
> > 9: 74 0d je 18 <T_refcount_inc+0x18> 9: 74 13 je 1e <T_refcount_inc+0x1e>
> > b: 8d 50 01 lea 0x1(%rax),%edx b: 8d 4a 01 lea 0x1(%rdx),%ecx
> > e: f0 0f b1 17 lock cmpxchg %edx,(%rdi) e: 89 d0 mov %edx,%eax
> > 12: 75 ee jne 2 <T_refcount_inc+0x2> 10: f0 0f b1 0f lock cmpxchg %ecx,(%rdi)
> > 14: ff c2 inc %edx 14: 74 04 je 1a <T_refcount_inc+0x1a>
> > 16: 75 02 jne 1a <T_refcount_inc+0x1a> 16: 89 c2 mov %eax,%edx
> > 18: 0f 0b ud2 18: eb e8 jmp 2 <T_refcount_inc+0x2>
> > 1a: c3 retq 1a: ff c1 inc %ecx
> > 1c: 75 03 jne 21 <T_refcount_inc+0x21>
> > 1e: 0f 0b ud2
> > 20: c3 retq
> > 21: c3 retq
>
> Can you re-send the better asm you got earlier?

On the left?

> If I pretend to be a dumb compiler, I wonder if you'd get better results with:
>
> if (!success) {
> *_old = __old;
> return false;
> } else {
> return true;
> }
>
> or however you jam that into a statement expression. That way you
> aren't relying on the compiler to merge the branches.

I tried a few variants, but nothing really made it better.

Find the tiny.c file below; I'm using:

gcc (Debian 6.3.0-5) 6.3.0 20170124

it has both an inline and an stmt-expr try_cmpxchg variant to play with;
the 'expected' output is at the bottom (same as above left).

Note that clang doesn't compile this stuff due to missing features.

---

/* gcc -Os -std=gnu99 -fno-strict-overflow -falign-jumps=1 -falign-loops=1 -c tiny.c; objdump -dr tiny.o */

typedef _Bool bool;

static inline bool try_cmpxchg(unsigned int *ptr, unsigned int *val, unsigned int new)
{
unsigned int old = *val;
bool success;

asm volatile("lock cmpxchgl %[new], %[ptr]"
: "=@ccz" (success),
[ptr] "+m" (*ptr),
[old] "+a" (old)
: [new] "r" (new)
: "memory");

// if (!success)
*val = old;

return success;
}

#define __try_cmpxchg(_ptr, _pold, _new) \
({ \
bool success; \
__typeof__(_ptr) _old = (_pold); \
__typeof__(*(_ptr)) __old = *_old; \
__typeof__(*(_ptr)) __new = (_new); \
volatile unsigned int * __ptr = (volatile unsigned int *)(_ptr);\
\
asm volatile("lock cmpxchgl %[new], %[ptr]" \
: "=@ccz" (success), \
[ptr] "+m" (*__ptr), \
[old] "+a" (__old) \
: [new] "r" (__new) \
: "memory"); \
/* if (!success) */ \
*_old = __old; \
success; \
})

#define EXCEPTION_VALUE(val, handler) asm volatile ("ud2 # %0" : : "r" (val))

#define UINT_MAX (~0U)

#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)

static inline void refcount_inc(unsigned int *r)
{
unsigned int new, val = *(unsigned int volatile *)r;

do {
if (unlikely(val == UINT_MAX)) /* saturated */
return;

if (unlikely(!val)) /* use-after-free */
goto exception;

/* cannot overflow because we already checked UINT_MAX */
new = val + 1;

} while (unlikely(!try_cmpxchg(r, &val, new)));

if (unlikely(new == UINT_MAX))
exception: EXCEPTION_VALUE(val, __refcount_warn);
}

void T_refcount_inc(unsigned int *r)
{
refcount_inc(r);
}

/*
0: 8b 07 mov (%rdi),%eax
2: 83 f8 ff cmp $0xffffffff,%eax
5: 74 13 je 1a <T_refcount_inc+0x1a>
7: 85 c0 test %eax,%eax
9: 74 0d je 18 <T_refcount_inc+0x18>
b: 8d 50 01 lea 0x1(%rax),%edx
e: f0 0f b1 17 lock cmpxchg %edx,(%rdi)
12: 75 ee jne 2 <T_refcount_inc+0x2>
14: ff c2 inc %edx
16: 75 02 jne 1a <T_refcount_inc+0x1a>
18: 0f 0b ud2
1a: c3 retq
*/

2017-03-24 17:51:49

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 6:23 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Mar 24, 2017 at 09:54:46AM -0700, Andy Lutomirski wrote:
>> > So the first snipped I tested regressed like so:
>> >
>> >
>> > 0000000000000000 <T_refcount_inc>: 0000000000000000 <T_refcount_inc>:
>> > 0: 8b 07 mov (%rdi),%eax 0: 8b 17 mov (%rdi),%edx
>> > 2: 83 f8 ff cmp $0xffffffff,%eax 2: 83 fa ff cmp $0xffffffff,%edx
>> > 5: 74 13 je 1a <T_refcount_inc+0x1a> 5: 74 1a je 21 <T_refcount_inc+0x21>
>> > 7: 85 c0 test %eax,%eax 7: 85 d2 test %edx,%edx
>> > 9: 74 0d je 18 <T_refcount_inc+0x18> 9: 74 13 je 1e <T_refcount_inc+0x1e>
>> > b: 8d 50 01 lea 0x1(%rax),%edx b: 8d 4a 01 lea 0x1(%rdx),%ecx
>> > e: f0 0f b1 17 lock cmpxchg %edx,(%rdi) e: 89 d0 mov %edx,%eax
>> > 12: 75 ee jne 2 <T_refcount_inc+0x2> 10: f0 0f b1 0f lock cmpxchg %ecx,(%rdi)
>> > 14: ff c2 inc %edx 14: 74 04 je 1a <T_refcount_inc+0x1a>
>> > 16: 75 02 jne 1a <T_refcount_inc+0x1a> 16: 89 c2 mov %eax,%edx
>> > 18: 0f 0b ud2 18: eb e8 jmp 2 <T_refcount_inc+0x2>
>> > 1a: c3 retq 1a: ff c1 inc %ecx
>> > 1c: 75 03 jne 21 <T_refcount_inc+0x21>
>> > 1e: 0f 0b ud2
>> > 20: c3 retq
>> > 21: c3 retq
>>
>> Can you re-send the better asm you got earlier?
>
> On the left?
>
>> If I pretend to be a dumb compiler, I wonder if you'd get better results with:
>>
>> if (!success) {
>> *_old = __old;
>> return false;
>> } else {
>> return true;
>> }
>>
>> or however you jam that into a statement expression. That way you
>> aren't relying on the compiler to merge the branches.
>
> I tried a few variants, but nothing really made it better.
>
> Find the tiny.c file below; I'm using:
>
> gcc (Debian 6.3.0-5) 6.3.0 20170124
>
> it has both an inline and an stmt-expr try_cmpxchg variant to play with;
> the 'expected' output is at the bottom (same as above left).
>
> Note that clang doesn't compile this stuff due to missing features.
>
> ---
>
> /* gcc -Os -std=gnu99 -fno-strict-overflow -falign-jumps=1 -falign-loops=1 -c tiny.c; objdump -dr tiny.o */
>
> typedef _Bool bool;
>
> static inline bool try_cmpxchg(unsigned int *ptr, unsigned int *val, unsigned int new)
> {
> unsigned int old = *val;
> bool success;
>
> asm volatile("lock cmpxchgl %[new], %[ptr]"
> : "=@ccz" (success),
> [ptr] "+m" (*ptr),
> [old] "+a" (old)
> : [new] "r" (new)
> : "memory");
>
> // if (!success)
> *val = old;
>
> return success;
> }
>
> #define __try_cmpxchg(_ptr, _pold, _new) \
> ({ \
> bool success; \
> __typeof__(_ptr) _old = (_pold); \
> __typeof__(*(_ptr)) __old = *_old; \
> __typeof__(*(_ptr)) __new = (_new); \
> volatile unsigned int * __ptr = (volatile unsigned int *)(_ptr);\
> \
> asm volatile("lock cmpxchgl %[new], %[ptr]" \
> : "=@ccz" (success), \
> [ptr] "+m" (*__ptr), \
> [old] "+a" (__old) \
> : [new] "r" (__new) \
> : "memory"); \
> /* if (!success) */ \
> *_old = __old; \
> success; \
> })
>
> #define EXCEPTION_VALUE(val, handler) asm volatile ("ud2 # %0" : : "r" (val))
>
> #define UINT_MAX (~0U)
>
> #define likely(x) __builtin_expect(!!(x), 1)
> #define unlikely(x) __builtin_expect(!!(x), 0)
>
> static inline void refcount_inc(unsigned int *r)
> {
> unsigned int new, val = *(unsigned int volatile *)r;
>
> do {
> if (unlikely(val == UINT_MAX)) /* saturated */
> return;
>
> if (unlikely(!val)) /* use-after-free */
> goto exception;
>
> /* cannot overflow because we already checked UINT_MAX */
> new = val + 1;
>
> } while (unlikely(!try_cmpxchg(r, &val, new)));
>
> if (unlikely(new == UINT_MAX))
> exception: EXCEPTION_VALUE(val, __refcount_warn);
> }
>
> void T_refcount_inc(unsigned int *r)
> {
> refcount_inc(r);
> }
>
> /*
> 0: 8b 07 mov (%rdi),%eax
> 2: 83 f8 ff cmp $0xffffffff,%eax
> 5: 74 13 je 1a <T_refcount_inc+0x1a>
> 7: 85 c0 test %eax,%eax
> 9: 74 0d je 18 <T_refcount_inc+0x18>
> b: 8d 50 01 lea 0x1(%rax),%edx
> e: f0 0f b1 17 lock cmpxchg %edx,(%rdi)
> 12: 75 ee jne 2 <T_refcount_inc+0x2>
> 14: ff c2 inc %edx
> 16: 75 02 jne 1a <T_refcount_inc+0x1a>
> 18: 0f 0b ud2
> 1a: c3 retq
> */



This seems to help ;)

#define try_cmpxchg(ptr, pold, new) __atomic_compare_exchange_n(ptr,
pold, new, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)

2017-03-24 18:05:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 07:00:25PM +0100, Peter Zijlstra wrote:
> static inline bool try_cmpxchg2(unsigned int *ptr, unsigned int *val, unsigned int new)
> {
> unsigned int old = *val;
> bool success;
>
> asm volatile goto("lock cmpxchgl %[new], %[ptr]; "
> "jnz %[label]"
> : /* no output */
> : [ptr] "+m" (*ptr),
> [old] "+a" (old)
> [new] "r" (new)
> : "memory"
> : label);
> return 1;
>
> label:
> *val = old;
> return 0;
> }

N/m, I'm apparently so tired I didn't see the compiler errors :/

2017-03-24 18:10:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 06:51:15PM +0100, Dmitry Vyukov wrote:
> On Fri, Mar 24, 2017 at 6:23 PM, Peter Zijlstra <[email protected]> wrote:
> > On Fri, Mar 24, 2017 at 09:54:46AM -0700, Andy Lutomirski wrote:
> >> > So the first snipped I tested regressed like so:
> >> >
> >> >
> >> > 0000000000000000 <T_refcount_inc>: 0000000000000000 <T_refcount_inc>:
> >> > 0: 8b 07 mov (%rdi),%eax 0: 8b 17 mov (%rdi),%edx
> >> > 2: 83 f8 ff cmp $0xffffffff,%eax 2: 83 fa ff cmp $0xffffffff,%edx
> >> > 5: 74 13 je 1a <T_refcount_inc+0x1a> 5: 74 1a je 21 <T_refcount_inc+0x21>
> >> > 7: 85 c0 test %eax,%eax 7: 85 d2 test %edx,%edx
> >> > 9: 74 0d je 18 <T_refcount_inc+0x18> 9: 74 13 je 1e <T_refcount_inc+0x1e>
> >> > b: 8d 50 01 lea 0x1(%rax),%edx b: 8d 4a 01 lea 0x1(%rdx),%ecx
> >> > e: f0 0f b1 17 lock cmpxchg %edx,(%rdi) e: 89 d0 mov %edx,%eax
> >> > 12: 75 ee jne 2 <T_refcount_inc+0x2> 10: f0 0f b1 0f lock cmpxchg %ecx,(%rdi)
> >> > 14: ff c2 inc %edx 14: 74 04 je 1a <T_refcount_inc+0x1a>
> >> > 16: 75 02 jne 1a <T_refcount_inc+0x1a> 16: 89 c2 mov %eax,%edx
> >> > 18: 0f 0b ud2 18: eb e8 jmp 2 <T_refcount_inc+0x2>
> >> > 1a: c3 retq 1a: ff c1 inc %ecx
> >> > 1c: 75 03 jne 21 <T_refcount_inc+0x21>
> >> > 1e: 0f 0b ud2
> >> > 20: c3 retq
> >> > 21: c3 retq
> >>

> This seems to help ;)
>
> #define try_cmpxchg(ptr, pold, new) __atomic_compare_exchange_n(ptr, pold, new, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)

That gets me:

0000000000000000 <T_refcount_inc>:
0: 8b 07 mov (%rdi),%eax
2: 89 44 24 fc mov %eax,-0x4(%rsp)
6: 8b 44 24 fc mov -0x4(%rsp),%eax
a: 83 f8 ff cmp $0xffffffff,%eax
d: 74 1c je 2b <T_refcount_inc+0x2b>
f: 85 c0 test %eax,%eax
11: 75 07 jne 1a <T_refcount_inc+0x1a>
13: 8b 44 24 fc mov -0x4(%rsp),%eax
17: 0f 0b ud2
19: c3 retq
1a: 8d 50 01 lea 0x1(%rax),%edx
1d: 8b 44 24 fc mov -0x4(%rsp),%eax
21: f0 0f b1 17 lock cmpxchg %edx,(%rdi)
25: 75 db jne 2 <T_refcount_inc+0x2>
27: ff c2 inc %edx
29: 74 e8 je 13 <T_refcount_inc+0x13>
2b: c3 retq


Which is even worse... (I did double check it actually compiled)

2017-03-24 18:13:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 07:08:38PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 24, 2017 at 06:51:15PM +0100, Dmitry Vyukov wrote:
> > On Fri, Mar 24, 2017 at 6:23 PM, Peter Zijlstra <[email protected]> wrote:
> > > On Fri, Mar 24, 2017 at 09:54:46AM -0700, Andy Lutomirski wrote:
> > >> > So the first snipped I tested regressed like so:
> > >> >
> > >> >
> > >> > 0000000000000000 <T_refcount_inc>: 0000000000000000 <T_refcount_inc>:
> > >> > 0: 8b 07 mov (%rdi),%eax 0: 8b 17 mov (%rdi),%edx
> > >> > 2: 83 f8 ff cmp $0xffffffff,%eax 2: 83 fa ff cmp $0xffffffff,%edx
> > >> > 5: 74 13 je 1a <T_refcount_inc+0x1a> 5: 74 1a je 21 <T_refcount_inc+0x21>
> > >> > 7: 85 c0 test %eax,%eax 7: 85 d2 test %edx,%edx
> > >> > 9: 74 0d je 18 <T_refcount_inc+0x18> 9: 74 13 je 1e <T_refcount_inc+0x1e>
> > >> > b: 8d 50 01 lea 0x1(%rax),%edx b: 8d 4a 01 lea 0x1(%rdx),%ecx
> > >> > e: f0 0f b1 17 lock cmpxchg %edx,(%rdi) e: 89 d0 mov %edx,%eax
> > >> > 12: 75 ee jne 2 <T_refcount_inc+0x2> 10: f0 0f b1 0f lock cmpxchg %ecx,(%rdi)
> > >> > 14: ff c2 inc %edx 14: 74 04 je 1a <T_refcount_inc+0x1a>
> > >> > 16: 75 02 jne 1a <T_refcount_inc+0x1a> 16: 89 c2 mov %eax,%edx
> > >> > 18: 0f 0b ud2 18: eb e8 jmp 2 <T_refcount_inc+0x2>
> > >> > 1a: c3 retq 1a: ff c1 inc %ecx
> > >> > 1c: 75 03 jne 21 <T_refcount_inc+0x21>
> > >> > 1e: 0f 0b ud2
> > >> > 20: c3 retq
> > >> > 21: c3 retq
> > >>
>
> > This seems to help ;)
> >
> > #define try_cmpxchg(ptr, pold, new) __atomic_compare_exchange_n(ptr, pold, new, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)
>
> That gets me:
>
> 0000000000000000 <T_refcount_inc>:
> 0: 8b 07 mov (%rdi),%eax
> 2: 89 44 24 fc mov %eax,-0x4(%rsp)
> 6: 8b 44 24 fc mov -0x4(%rsp),%eax
> a: 83 f8 ff cmp $0xffffffff,%eax
> d: 74 1c je 2b <T_refcount_inc+0x2b>
> f: 85 c0 test %eax,%eax
> 11: 75 07 jne 1a <T_refcount_inc+0x1a>
> 13: 8b 44 24 fc mov -0x4(%rsp),%eax
> 17: 0f 0b ud2
> 19: c3 retq
> 1a: 8d 50 01 lea 0x1(%rax),%edx
> 1d: 8b 44 24 fc mov -0x4(%rsp),%eax
> 21: f0 0f b1 17 lock cmpxchg %edx,(%rdi)
> 25: 75 db jne 2 <T_refcount_inc+0x2>
> 27: ff c2 inc %edx
> 29: 74 e8 je 13 <T_refcount_inc+0x13>
> 2b: c3 retq
>
>
> Which is even worse... (I did double check it actually compiled)

Not to mention we cannot use the C11 atomics in kernel because we want
to be able to runtime patch LOCK prefixes when only 1 CPU is available.

2017-03-24 18:02:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 06:23:42PM +0100, Peter Zijlstra wrote:
> I tried a few variants, but nothing really made it better.
>
> Find the tiny.c file below; I'm using:
>
> gcc (Debian 6.3.0-5) 6.3.0 20170124
>
> it has both an inline and an stmt-expr try_cmpxchg variant to play with;
> the 'expected' output is at the bottom (same as above left).
>
> Note that clang doesn't compile this stuff due to missing features.
>

static inline bool try_cmpxchg2(unsigned int *ptr, unsigned int *val, unsigned int new)
{
unsigned int old = *val;
bool success;

asm volatile goto("lock cmpxchgl %[new], %[ptr]; "
"jnz %[label]"
: /* no output */
: [ptr] "+m" (*ptr),
[old] "+a" (old)
[new] "r" (new)
: "memory"
: label);
return 1;

label:
*val = old;
return 0;
}


generates better code than the @cc output with extra if (!success), but
not as good as the original.

2017-03-24 18:19:25

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 7:08 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Mar 24, 2017 at 06:51:15PM +0100, Dmitry Vyukov wrote:
>> On Fri, Mar 24, 2017 at 6:23 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Fri, Mar 24, 2017 at 09:54:46AM -0700, Andy Lutomirski wrote:
>> >> > So the first snipped I tested regressed like so:
>> >> >
>> >> >
>> >> > 0000000000000000 <T_refcount_inc>: 0000000000000000 <T_refcount_inc>:
>> >> > 0: 8b 07 mov (%rdi),%eax 0: 8b 17 mov (%rdi),%edx
>> >> > 2: 83 f8 ff cmp $0xffffffff,%eax 2: 83 fa ff cmp $0xffffffff,%edx
>> >> > 5: 74 13 je 1a <T_refcount_inc+0x1a> 5: 74 1a je 21 <T_refcount_inc+0x21>
>> >> > 7: 85 c0 test %eax,%eax 7: 85 d2 test %edx,%edx
>> >> > 9: 74 0d je 18 <T_refcount_inc+0x18> 9: 74 13 je 1e <T_refcount_inc+0x1e>
>> >> > b: 8d 50 01 lea 0x1(%rax),%edx b: 8d 4a 01 lea 0x1(%rdx),%ecx
>> >> > e: f0 0f b1 17 lock cmpxchg %edx,(%rdi) e: 89 d0 mov %edx,%eax
>> >> > 12: 75 ee jne 2 <T_refcount_inc+0x2> 10: f0 0f b1 0f lock cmpxchg %ecx,(%rdi)
>> >> > 14: ff c2 inc %edx 14: 74 04 je 1a <T_refcount_inc+0x1a>
>> >> > 16: 75 02 jne 1a <T_refcount_inc+0x1a> 16: 89 c2 mov %eax,%edx
>> >> > 18: 0f 0b ud2 18: eb e8 jmp 2 <T_refcount_inc+0x2>
>> >> > 1a: c3 retq 1a: ff c1 inc %ecx
>> >> > 1c: 75 03 jne 21 <T_refcount_inc+0x21>
>> >> > 1e: 0f 0b ud2
>> >> > 20: c3 retq
>> >> > 21: c3 retq
>> >>
>
>> This seems to help ;)
>>
>> #define try_cmpxchg(ptr, pold, new) __atomic_compare_exchange_n(ptr, pold, new, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)
>
> That gets me:
>
> 0000000000000000 <T_refcount_inc>:
> 0: 8b 07 mov (%rdi),%eax
> 2: 89 44 24 fc mov %eax,-0x4(%rsp)
> 6: 8b 44 24 fc mov -0x4(%rsp),%eax
> a: 83 f8 ff cmp $0xffffffff,%eax
> d: 74 1c je 2b <T_refcount_inc+0x2b>
> f: 85 c0 test %eax,%eax
> 11: 75 07 jne 1a <T_refcount_inc+0x1a>
> 13: 8b 44 24 fc mov -0x4(%rsp),%eax
> 17: 0f 0b ud2
> 19: c3 retq
> 1a: 8d 50 01 lea 0x1(%rax),%edx
> 1d: 8b 44 24 fc mov -0x4(%rsp),%eax
> 21: f0 0f b1 17 lock cmpxchg %edx,(%rdi)
> 25: 75 db jne 2 <T_refcount_inc+0x2>
> 27: ff c2 inc %edx
> 29: 74 e8 je 13 <T_refcount_inc+0x13>
> 2b: c3 retq
>
>
> Which is even worse... (I did double check it actually compiled)


For me gcc 7.0.1 generates:

0000000000000330 <T_refcount_inc1>:
330: 55 push %rbp
331: 8b 07 mov (%rdi),%eax
333: 48 89 e5 mov %rsp,%rbp
336: 83 f8 ff cmp $0xffffffff,%eax
339: 74 12 je 34d <T_refcount_inc1+0x1d>
33b: 85 c0 test %eax,%eax
33d: 74 10 je 34f <T_refcount_inc1+0x1f>
33f: 8d 50 01 lea 0x1(%rax),%edx
342: f0 0f b1 17 lock cmpxchg %edx,(%rdi)
346: 75 ee jne 336 <T_refcount_inc1+0x6>
348: 83 fa ff cmp $0xffffffff,%edx
34b: 74 04 je 351 <T_refcount_inc1+0x21>
34d: 5d pop %rbp
34e: c3 retq
34f: 31 c0 xor %eax,%eax
351: 0f 0b ud2
353: 5d pop %rbp
354: c3 retq

with:

if (!success) \
*_old = __old; \

0000000000000320 <T_refcount_inc1>:
320: 8b 0f mov (%rdi),%ecx
322: 55 push %rbp
323: 48 89 e5 mov %rsp,%rbp
326: 83 f9 ff cmp $0xffffffff,%ecx
329: 74 2d je 358 <T_refcount_inc1+0x38>
32b: 85 c9 test %ecx,%ecx
32d: 74 25 je 354 <T_refcount_inc1+0x34>
32f: 8d 71 01 lea 0x1(%rcx),%esi
332: 89 c8 mov %ecx,%eax
334: f0 0f b1 37 lock cmpxchg %esi,(%rdi)
338: 89 c2 mov %eax,%edx
33a: 74 20 je 35c <T_refcount_inc1+0x3c>
33c: 83 fa ff cmp $0xffffffff,%edx
33f: 74 17 je 358 <T_refcount_inc1+0x38>
341: 85 d2 test %edx,%edx
343: 74 0f je 354 <T_refcount_inc1+0x34>
345: 8d 72 01 lea 0x1(%rdx),%esi
348: 89 d0 mov %edx,%eax
34a: f0 0f b1 37 lock cmpxchg %esi,(%rdi)
34e: 74 0a je 35a <T_refcount_inc1+0x3a>
350: 89 c2 mov %eax,%edx
352: eb e8 jmp 33c <T_refcount_inc1+0x1c>
354: 31 c9 xor %ecx,%ecx
356: 0f 0b ud2
358: 5d pop %rbp
359: c3 retq
35a: 89 d1 mov %edx,%ecx
35c: 83 fe ff cmp $0xffffffff,%esi
35f: 74 f5 je 356 <T_refcount_inc1+0x36>
361: 5d pop %rbp
362: c3 retq


with __atomic_compare_exchange_n:

exactly the same as the original code.


But I don't have an answer for runtime patching of LOCK.
Looks like something to fix in gcc.

2017-03-24 18:46:29

by Andy Lutomirski

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 10:23 AM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Mar 24, 2017 at 09:54:46AM -0700, Andy Lutomirski wrote:
>> > So the first snipped I tested regressed like so:
>> >
>> >
>> > 0000000000000000 <T_refcount_inc>: 0000000000000000 <T_refcount_inc>:
>> > 0: 8b 07 mov (%rdi),%eax 0: 8b 17 mov (%rdi),%edx
>> > 2: 83 f8 ff cmp $0xffffffff,%eax 2: 83 fa ff cmp $0xffffffff,%edx
>> > 5: 74 13 je 1a <T_refcount_inc+0x1a> 5: 74 1a je 21 <T_refcount_inc+0x21>
>> > 7: 85 c0 test %eax,%eax 7: 85 d2 test %edx,%edx
>> > 9: 74 0d je 18 <T_refcount_inc+0x18> 9: 74 13 je 1e <T_refcount_inc+0x1e>
>> > b: 8d 50 01 lea 0x1(%rax),%edx b: 8d 4a 01 lea 0x1(%rdx),%ecx
>> > e: f0 0f b1 17 lock cmpxchg %edx,(%rdi) e: 89 d0 mov %edx,%eax
>> > 12: 75 ee jne 2 <T_refcount_inc+0x2> 10: f0 0f b1 0f lock cmpxchg %ecx,(%rdi)
>> > 14: ff c2 inc %edx 14: 74 04 je 1a <T_refcount_inc+0x1a>
>> > 16: 75 02 jne 1a <T_refcount_inc+0x1a> 16: 89 c2 mov %eax,%edx
>> > 18: 0f 0b ud2 18: eb e8 jmp 2 <T_refcount_inc+0x2>
>> > 1a: c3 retq 1a: ff c1 inc %ecx
>> > 1c: 75 03 jne 21 <T_refcount_inc+0x21>
>> > 1e: 0f 0b ud2
>> > 20: c3 retq
>> > 21: c3 retq
>>
>> Can you re-send the better asm you got earlier?
>
> On the left?

Apparently I'm just blind this morning.
*/

After playing with it a bit, I found some of the problem: you're
passing val into EXCEPTION_VALUE, which keeps it live. If I get rid
of that, the generated code is great.

I haven't found a way to convince GCC that, in the success case, eax
isn't clobbered. I wrote this:

static inline bool try_cmpxchg(unsigned int *ptr, unsigned int *val,
unsigned int new)
{
unsigned int old = *val;
bool success;

asm volatile("lock cmpxchgl %[new], %[ptr]"
: "=@ccz" (success),
[ptr] "+m" (*ptr),
[old] "+a" (old)
: [new] "r" (new)
: "memory");

if (!success) {
*val = old;
} else {
if (*val != old) {
*val = old;
__builtin_unreachable();
} else {
/*
* Damnit, GCC, I want you to realize that this
* is happening but to avoid emitting the store.
*/
*val = old; /* <-- here */
}
}

return success;
}

The "here" line is the problematic code that breaks certain use cases,
and it obviously needn't have any effect in the generated code, but
I'm having trouble getting GCC to generate good code without it.

Is there some hack like if __builtin_is_unescaped(*val) *val = old;
that would work?

--Andy

2017-03-24 19:08:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 10:23 AM, Peter Zijlstra <[email protected]> wrote:
>
> I tried a few variants, but nothing really made it better.

So I really hate how your thing has two return values, and fakes the
second one using the pointer value. I dislike it for two different
reasons:

- it's bad for type checking: it makes the "real" pointer value and
the "old value" pointer value both be pointers of the same type

- I think it's part of the reason why gcc easily generates crap code.

And I think the problem is fundamental to your interface.

So how about we change the interface entirely, with the goal being
both type safety and good code generation?

In particular, let's make the interface something that might be
optimal given possible future gcc improvements, like the ability to
use "asm goto" with outputs. We can't do that today, but maybe some
day...

So I would suggest that the format of the "try_cmpxhg()" thing be
something like this:

#define try_cmpxchg(ptr, value, new, success_label) ({ \
bool __txchg_success; \
__typeof__(*(ptr)) __old; \
asm volatile("lock cmpxchgl %3, %1" \
: "=@ccz" (__txchg_success), \
"+m" (*ptr), \
"=a" (__old) \
: "r" (new), \
"2" (value) \
: "memory"); \
if (__txchg_success) goto success_label; \
__old; })

which seems to be fairly natural.

Then you do your refcount loop (or pretty much *any* cmpxchg loop) with

for (;;) {
if (unlikely(val == UINT_MAX))
goto saturated;

if (unlikely(!val))
goto use_after_free;

new = val + 1;
val = try_cmpxchg(r, val, new, success);
}

success:
return;

saturated: use_after_free: .. whatever error handling ..

and I think gcc should generate reasonable code.

Hmm?

NOTE! I would suggest that this odd "macro with a success label" model
of try_cmpxchg() never be used for something that people are expected
to use directly. So I don't think "normal" code should use this label
form.

But it's useful as a helper function that is then used to implement
the "real" ABI, ie to implement that "refcount_inc()" and other things
like that..

Linus

2017-03-24 19:17:03

by Andy Lutomirski

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 11:13 AM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Mar 24, 2017 at 07:08:38PM +0100, Peter Zijlstra wrote:
>> On Fri, Mar 24, 2017 at 06:51:15PM +0100, Dmitry Vyukov wrote:
>> > On Fri, Mar 24, 2017 at 6:23 PM, Peter Zijlstra <[email protected]> wrote:
>> > > On Fri, Mar 24, 2017 at 09:54:46AM -0700, Andy Lutomirski wrote:
>> > >> > So the first snipped I tested regressed like so:
>> > >> >
>> > >> >
>> > >> > 0000000000000000 <T_refcount_inc>: 0000000000000000 <T_refcount_inc>:
>> > >> > 0: 8b 07 mov (%rdi),%eax 0: 8b 17 mov (%rdi),%edx
>> > >> > 2: 83 f8 ff cmp $0xffffffff,%eax 2: 83 fa ff cmp $0xffffffff,%edx
>> > >> > 5: 74 13 je 1a <T_refcount_inc+0x1a> 5: 74 1a je 21 <T_refcount_inc+0x21>
>> > >> > 7: 85 c0 test %eax,%eax 7: 85 d2 test %edx,%edx
>> > >> > 9: 74 0d je 18 <T_refcount_inc+0x18> 9: 74 13 je 1e <T_refcount_inc+0x1e>
>> > >> > b: 8d 50 01 lea 0x1(%rax),%edx b: 8d 4a 01 lea 0x1(%rdx),%ecx
>> > >> > e: f0 0f b1 17 lock cmpxchg %edx,(%rdi) e: 89 d0 mov %edx,%eax
>> > >> > 12: 75 ee jne 2 <T_refcount_inc+0x2> 10: f0 0f b1 0f lock cmpxchg %ecx,(%rdi)
>> > >> > 14: ff c2 inc %edx 14: 74 04 je 1a <T_refcount_inc+0x1a>
>> > >> > 16: 75 02 jne 1a <T_refcount_inc+0x1a> 16: 89 c2 mov %eax,%edx
>> > >> > 18: 0f 0b ud2 18: eb e8 jmp 2 <T_refcount_inc+0x2>
>> > >> > 1a: c3 retq 1a: ff c1 inc %ecx
>> > >> > 1c: 75 03 jne 21 <T_refcount_inc+0x21>
>> > >> > 1e: 0f 0b ud2
>> > >> > 20: c3 retq
>> > >> > 21: c3 retq
>> > >>
>>
>> > This seems to help ;)
>> >
>> > #define try_cmpxchg(ptr, pold, new) __atomic_compare_exchange_n(ptr, pold, new, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)
>>
>> That gets me:
>>
>> 0000000000000000 <T_refcount_inc>:
>> 0: 8b 07 mov (%rdi),%eax
>> 2: 89 44 24 fc mov %eax,-0x4(%rsp)
>> 6: 8b 44 24 fc mov -0x4(%rsp),%eax
>> a: 83 f8 ff cmp $0xffffffff,%eax
>> d: 74 1c je 2b <T_refcount_inc+0x2b>
>> f: 85 c0 test %eax,%eax
>> 11: 75 07 jne 1a <T_refcount_inc+0x1a>
>> 13: 8b 44 24 fc mov -0x4(%rsp),%eax
>> 17: 0f 0b ud2
>> 19: c3 retq
>> 1a: 8d 50 01 lea 0x1(%rax),%edx
>> 1d: 8b 44 24 fc mov -0x4(%rsp),%eax
>> 21: f0 0f b1 17 lock cmpxchg %edx,(%rdi)
>> 25: 75 db jne 2 <T_refcount_inc+0x2>
>> 27: ff c2 inc %edx
>> 29: 74 e8 je 13 <T_refcount_inc+0x13>
>> 2b: c3 retq
>>
>>
>> Which is even worse... (I did double check it actually compiled)
>
> Not to mention we cannot use the C11 atomics in kernel because we want
> to be able to runtime patch LOCK prefixes when only 1 CPU is available.

Is this really a show-stopper? I bet that objtool could be persuaded
to emit a list of the locations of all those LOCK prefixes.

--Andy

2017-03-24 19:17:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 11:45 AM, Andy Lutomirski <[email protected]> wrote:
>
> Is there some hack like if __builtin_is_unescaped(*val) *val = old;
> that would work?

See my recent email suggesting a completely different interface, which
avoids this problem.

My interface generates:

0000000000000000 <T_refcount_inc>:
0: 8b 07 mov (%rdi),%eax
2: 83 f8 ff cmp $0xffffffff,%eax
5: 74 12 je 19 <T_refcount_inc+0x19>
7: 85 c0 test %eax,%eax
9: 74 0a je 15 <T_refcount_inc+0x15>
b: 8d 50 01 lea 0x1(%rax),%edx
e: f0 0f b1 17 lock cmpxchg %edx,(%rdi)
12: 75 ee jne 2 <T_refcount_inc+0x2>
14: c3 retq
15: 31 c0 xor %eax,%eax
17: 0f 0b ud2
19: c3 retq

for PeterZ's test-case, which seems optimal.

Of course, PeterZ used -Os, which isn't actually very natural for the
kernel. Using -O2 I get something else. It turns out that my macro
should use

if (likely(__txchg_success)) goto success_label;

(that "likely()" is criticial) to make gcc not try to optimize for the
looping case.

So with that "likely()" fixed, with -O2 I get:

0000000000000000 <T_refcount_inc>:
0: 8b 07 mov (%rdi),%eax
2: 83 f8 ff cmp $0xffffffff,%eax
5: 74 0d je 14 <T_refcount_inc+0x14>
7: 85 c0 test %eax,%eax
9: 74 12 je 1d <T_refcount_inc+0x1d>
b: 8d 50 01 lea 0x1(%rax),%edx
e: f0 0f b1 17 lock cmpxchg %edx,(%rdi)
12: 75 02 jne 16 <T_refcount_inc+0x16>
14: f3 c3 repz retq
16: 83 f8 ff cmp $0xffffffff,%eax
19: 75 ec jne 7 <T_refcount_inc+0x7>
1b: f3 c3 repz retq
1d: 31 c0 xor %eax,%eax
1f: 0f 0b ud2
21: c3 retq

which again looks pretty optimal (it did indeed actually generate
bigger but potentially higher-performance code by making the good case
be a fallthrough, and the unlikely case be a _forward_ jump that will
be predicted not-taken in the absense of other rpediction information.

(Of course, this also depends on the exact behavior that PeterZ's code
had, namely an exception for use-after-free, but a silent saturation)

Linus

2017-03-24 19:22:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 12:16 PM, Andy Lutomirski <[email protected]> wrote:
>>
>> Not to mention we cannot use the C11 atomics in kernel because we want
>> to be able to runtime patch LOCK prefixes when only 1 CPU is available.
>
> Is this really a show-stopper? I bet that objtool could be persuaded
> to emit a list of the locations of all those LOCK prefixes.

I doubt it's a show-stopper, if only because nobody cares about UP any
more. Not even the embedded world does.

That said, I'm not convinced that there will ever really be a reason
for the kernel to use the C11 atomics. They just aren't any better
than what we can do ourselves.

The reason for C11 atomics is "portably good atomics". We use
"architecture-specific good atomics" instead, and are willing to
maintain that. We will *have* to maintain that in the forseeable
future anyway, for legacy compiler issues.

So any C11 atomics use by the kernel is realistically at least a
decade away if it *ever* happens.

Don't even worry about it. Planning decades ahead for some detail like
that is stupid.

Linus

2017-03-24 19:27:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 12:20 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Mar 24, 2017 at 12:16 PM, Andy Lutomirski <[email protected]> wrote:
>>>
>>> Not to mention we cannot use the C11 atomics in kernel because we want
>>> to be able to runtime patch LOCK prefixes when only 1 CPU is available.
>>
>> Is this really a show-stopper? I bet that objtool could be persuaded
>> to emit a list of the locations of all those LOCK prefixes.
>
> I doubt it's a show-stopper, if only because nobody cares about UP any
> more. Not even the embedded world does.
>
> That said, I'm not convinced that there will ever really be a reason
> for the kernel to use the C11 atomics. They just aren't any better
> than what we can do ourselves.
>
> The reason for C11 atomics is "portably good atomics". We use
> "architecture-specific good atomics" instead, and are willing to
> maintain that. We will *have* to maintain that in the forseeable
> future anyway, for legacy compiler issues.

In theory, though, the compiler could optimize based on its knowledge
of what the C11 atomics do. ISTR reading about a few optimizations
that were already starting to be developed.

Using asm goto seems okay, too, but it's a lot more tedious is less
friendly to the optimizers.

--Andy

2017-03-24 20:15:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 12:16:11PM -0700, Andy Lutomirski wrote:
> On Fri, Mar 24, 2017 at 11:13 AM, Peter Zijlstra <[email protected]> wrote:

> > Not to mention we cannot use the C11 atomics in kernel because we want
> > to be able to runtime patch LOCK prefixes when only 1 CPU is available.
>
> Is this really a show-stopper? I bet that objtool could be persuaded
> to emit a list of the locations of all those LOCK prefixes.

Ah, but its not _all_ LOCK prefixes. Some are needed even on UP, because
against hardware instead of other CPUs. Or again hypervisor instead of
other vCPU.

2017-03-24 20:16:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 12:20:37PM -0700, Linus Torvalds wrote:
> I doubt it's a show-stopper, if only because nobody cares about UP any
> more. Not even the embedded world does.

For some obscure reason we recently introduced a variant for virt
people. Where it would need memory barriers against the hypervisor, but
having only a single vCPU not against itself.

2017-03-24 20:21:58

by Andy Lutomirski

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 1:14 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Mar 24, 2017 at 12:16:11PM -0700, Andy Lutomirski wrote:
>> On Fri, Mar 24, 2017 at 11:13 AM, Peter Zijlstra <[email protected]> wrote:
>
>> > Not to mention we cannot use the C11 atomics in kernel because we want
>> > to be able to runtime patch LOCK prefixes when only 1 CPU is available.
>>
>> Is this really a show-stopper? I bet that objtool could be persuaded
>> to emit a list of the locations of all those LOCK prefixes.
>
> Ah, but its not _all_ LOCK prefixes. Some are needed even on UP, because
> against hardware instead of other CPUs. Or again hypervisor instead of
> other vCPU.
>

Make a table of mandatory lock prefixes and assume all the others (due
to C11 atomics, etc) can be omitted on UP?

I'm curious, though, whether anyone actually compiles an x86 SMP
kernel, runs it on UP, and cares about performance these days.

--Andy

2017-03-24 20:22:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 11:45:46AM -0700, Andy Lutomirski wrote:
> After playing with it a bit, I found some of the problem: you're
> passing val into EXCEPTION_VALUE, which keeps it live. If I get rid
> of that, the generated code is great.

Right, so I needed that because I land on ud2 through 2 different paths:

- newly saturated
- use-after-free

And the exception handler can figure out which of the two by looking at
the variable, but then of course, it needs to be life.

For the full horror of how to do this, look here:

http://paste.debian.net/924190/

But I didn't just show you that, so you can't blame me for any damage
that might've done you.

2017-03-24 20:28:23

by Andy Lutomirski

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 1:22 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Mar 24, 2017 at 11:45:46AM -0700, Andy Lutomirski wrote:
>> After playing with it a bit, I found some of the problem: you're
>> passing val into EXCEPTION_VALUE, which keeps it live. If I get rid
>> of that, the generated code is great.
>
> Right, so I needed that because I land on ud2 through 2 different paths:
>
> - newly saturated
> - use-after-free
>
> And the exception handler can figure out which of the two by looking at
> the variable, but then of course, it needs to be life.
>
> For the full horror of how to do this, look here:
>
> http://paste.debian.net/924190/
>
> But I didn't just show you that, so you can't blame me for any damage
> that might've done you.

Wow, that's horrible. Could this not be done by looking at flags
instead of regs?

For that matter, you're effectively comparing to -1 and 0. I'm not
really sure it would be faster, but you could plausibly add one then
subtract one again and get the full picture just from flags and a
single comparison?

2017-03-24 20:46:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 12:08:32PM -0700, Linus Torvalds wrote:
> On Fri, Mar 24, 2017 at 10:23 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > I tried a few variants, but nothing really made it better.
>
> So I really hate how your thing has two return values, and fakes the
> second one using the pointer value.

Inspired by C11 I'm afraid..

> So how about we change the interface entirely, with the goal being
> both type safety and good code generation?

I certainly like it better, but so far I'm having trouble reproducing
your results. What compiler version are you on?

2017-03-24 20:59:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 1:46 PM, Peter Zijlstra <[email protected]> wrote:
>
> I certainly like it better, but so far I'm having trouble reproducing
> your results. What compiler version are you on?

I have:

gcc version 6.3.1 20161221 (Red Hat 6.3.1-1) (GCC)

from

gcc-6.3.1-1.fc24.x86_64

and I'm attaching the edited form of your test-case, just so that
we're on the exact same page.

Linus


Attachments:
tiny.c (1.18 kB)

2017-03-24 21:08:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 01:27:49PM -0700, Andy Lutomirski wrote:
> On Fri, Mar 24, 2017 at 1:22 PM, Peter Zijlstra <[email protected]> wrote:
> > On Fri, Mar 24, 2017 at 11:45:46AM -0700, Andy Lutomirski wrote:
> >> After playing with it a bit, I found some of the problem: you're
> >> passing val into EXCEPTION_VALUE, which keeps it live. If I get rid
> >> of that, the generated code is great.
> >
> > Right, so I needed that because I land on ud2 through 2 different paths:
> >
> > - newly saturated
> > - use-after-free
> >
> > And the exception handler can figure out which of the two by looking at
> > the variable, but then of course, it needs to be life.
> >
> > For the full horror of how to do this, look here:
> >
> > http://paste.debian.net/924190/
> >
> > But I didn't just show you that, so you can't blame me for any damage
> > that might've done you.
>
> Wow, that's horrible. Could this not be done by looking at flags
> instead of regs?

Well, the EXCEPTION_HANDLER() thing is something ARM/ARM64 could also
implement.

2017-03-24 21:25:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 12:17:28PM -0700, Linus Torvalds wrote:
> On Fri, Mar 24, 2017 at 11:45 AM, Andy Lutomirski <[email protected]> wrote:
> >
> > Is there some hack like if __builtin_is_unescaped(*val) *val = old;
> > that would work?
>
> See my recent email suggesting a completely different interface, which
> avoids this problem.
>
> My interface generates:
>
> 0000000000000000 <T_refcount_inc>:
> 0: 8b 07 mov (%rdi),%eax
> 2: 83 f8 ff cmp $0xffffffff,%eax
> 5: 74 12 je 19 <T_refcount_inc+0x19>
> 7: 85 c0 test %eax,%eax
> 9: 74 0a je 15 <T_refcount_inc+0x15>
> b: 8d 50 01 lea 0x1(%rax),%edx
> e: f0 0f b1 17 lock cmpxchg %edx,(%rdi)
> 12: 75 ee jne 2 <T_refcount_inc+0x2>
> 14: c3 retq
> 15: 31 c0 xor %eax,%eax
> 17: 0f 0b ud2
> 19: c3 retq
>
> for PeterZ's test-case, which seems optimal.

Right; now my GCC emits more or less the same code (its a slightly
different compiler and instead of 12: jne, it does: 12 je ; 14: jmp 2.

But maybe that's the likely() you added later.

Also, see how at 7 we test if eax is 0 and then at 9 jump to 15 where we
make eax 0. Pretty daft code-gen.

In any case, you lost one branch into ud2; your success: return, should
be success: if (new == UINT_MAX), such that when we newly saturate the
count we also raise an exception.

With that, the code is still larger than it used to be. I'll have a play
around. I do like this interface better, but getting GCC to generate
sensible code seems 'interesting'.

I'll try and redo the patches that landed in tip and see what it does
for total vmlinux size somewhere tomorrow.

2017-03-25 07:52:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 10:23:29PM +0100, Peter Zijlstra wrote:

> I'll try and redo the patches that landed in tip and see what it does
> for total vmlinux size somewhere tomorrow.

text data bss dec hex filename
10726413 4540256 843776 16110445 f5d36d defconfig-build/vmlinux.pre
10730509 4540256 843776 16114541 f5e36d defconfig-build/vmlinux.post

:-(

---
arch/x86/include/asm/atomic.h | 18 ++++++-------
arch/x86/include/asm/atomic64_64.h | 24 ++++++++++--------
arch/x86/include/asm/cmpxchg.h | 50 ++++++++++++++++++++----------------
include/linux/atomic.h | 52 +++++++++++++++++++++++++-------------
lib/refcount.c | 35 ++++++++++++++-----------
5 files changed, 104 insertions(+), 75 deletions(-)

diff --git a/arch/x86/include/asm/atomic.h b/arch/x86/include/asm/atomic.h
index caa5798..c80d4914 100644
--- a/arch/x86/include/asm/atomic.h
+++ b/arch/x86/include/asm/atomic.h
@@ -186,11 +186,8 @@ static __always_inline int atomic_cmpxchg(atomic_t *v, int old, int new)
return cmpxchg(&v->counter, old, new);
}

-#define atomic_try_cmpxchg atomic_try_cmpxchg
-static __always_inline bool atomic_try_cmpxchg(atomic_t *v, int *old, int new)
-{
- return try_cmpxchg(&v->counter, old, new);
-}
+#define atomic_try_cmpxchg(_ptr, _old, _new, _label) \
+ try_cmpxchg(&(_ptr)->counter, _old, _new, _label)

static inline int atomic_xchg(atomic_t *v, int new)
{
@@ -210,8 +207,9 @@ static inline void atomic_##op(int i, atomic_t *v) \
static inline int atomic_fetch_##op(int i, atomic_t *v) \
{ \
int val = atomic_read(v); \
- do { \
- } while (!atomic_try_cmpxchg(v, &val, val c_op i)); \
+ for (;;) \
+ val = atomic_try_cmpxchg(v, val, val c_op i, success); \
+success: \
return val; \
}

@@ -239,10 +237,12 @@ ATOMIC_OPS(xor, ^)
static __always_inline int __atomic_add_unless(atomic_t *v, int a, int u)
{
int c = atomic_read(v);
- do {
+ for (;;) {
if (unlikely(c == u))
break;
- } while (!atomic_try_cmpxchg(v, &c, c + a));
+ c = atomic_try_cmpxchg(v, c, c + a, success);
+ }
+success:
return c;
}

diff --git a/arch/x86/include/asm/atomic64_64.h b/arch/x86/include/asm/atomic64_64.h
index 6189a43..489c3e2 100644
--- a/arch/x86/include/asm/atomic64_64.h
+++ b/arch/x86/include/asm/atomic64_64.h
@@ -176,11 +176,8 @@ static inline long atomic64_cmpxchg(atomic64_t *v, long old, long new)
return cmpxchg(&v->counter, old, new);
}

-#define atomic64_try_cmpxchg atomic64_try_cmpxchg
-static __always_inline bool atomic64_try_cmpxchg(atomic64_t *v, long *old, long new)
-{
- return try_cmpxchg(&v->counter, old, new);
-}
+#define atomic64_try_cmpxchg(_ptr, _old, _new, _label) \
+ try_cmpxchg(&(_ptr)->counter, _old, _new, _label)

static inline long atomic64_xchg(atomic64_t *v, long new)
{
@@ -199,10 +196,12 @@ static inline long atomic64_xchg(atomic64_t *v, long new)
static inline bool atomic64_add_unless(atomic64_t *v, long a, long u)
{
long c = atomic64_read(v);
- do {
+ for (;;) {
if (unlikely(c == u))
return false;
- } while (!atomic64_try_cmpxchg(v, &c, c + a));
+ c = atomic64_try_cmpxchg(v, c, c + a, success);
+ }
+success:
return true;
}

@@ -218,11 +217,13 @@ static inline bool atomic64_add_unless(atomic64_t *v, long a, long u)
static inline long atomic64_dec_if_positive(atomic64_t *v)
{
long dec, c = atomic64_read(v);
- do {
+ for (;;) {
dec = c - 1;
if (unlikely(dec < 0))
break;
- } while (!atomic64_try_cmpxchg(v, &c, dec));
+ c = atomic64_try_cmpxchg(v, c, dec, success);
+ }
+success:
return dec;
}

@@ -239,8 +240,9 @@ static inline void atomic64_##op(long i, atomic64_t *v) \
static inline long atomic64_fetch_##op(long i, atomic64_t *v) \
{ \
long val = atomic64_read(v); \
- do { \
- } while (!atomic64_try_cmpxchg(v, &val, val c_op i)); \
+ for (;;) \
+ val = atomic64_try_cmpxchg(v, val, val c_op i, success);\
+success: \
return val; \
}

diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
index fb961db..e6b8a8f 100644
--- a/arch/x86/include/asm/cmpxchg.h
+++ b/arch/x86/include/asm/cmpxchg.h
@@ -154,22 +154,24 @@ extern void __add_wrong_size(void)
__cmpxchg_local(ptr, old, new, sizeof(*(ptr)))


-#define __raw_try_cmpxchg(_ptr, _pold, _new, size, lock) \
+#define __raw_try_cmpxchg(_ptr, _old, _new, success_label, size, lock) \
({ \
- bool success; \
- __typeof__(_ptr) _old = (_pold); \
- __typeof__(*(_ptr)) __old = *_old; \
+ __typeof__(*(_ptr)) __old = (_old); \
__typeof__(*(_ptr)) __new = (_new); \
+ __typeof__(*(_ptr)) __ret; \
+ bool __success; \
+ \
switch (size) { \
case __X86_CASE_B: \
{ \
volatile u8 *__ptr = (volatile u8 *)(_ptr); \
asm volatile(lock "cmpxchgb %[new], %[ptr]" \
CC_SET(z) \
- : CC_OUT(z) (success), \
+ : CC_OUT(z) (__success), \
[ptr] "+m" (*__ptr), \
- [old] "+a" (__old) \
- : [new] "q" (__new) \
+ [old] "=a" (__ret) \
+ : [new] "q" (__new), \
+ "2" (__old) \
: "memory"); \
break; \
} \
@@ -178,10 +180,11 @@ extern void __add_wrong_size(void)
volatile u16 *__ptr = (volatile u16 *)(_ptr); \
asm volatile(lock "cmpxchgw %[new], %[ptr]" \
CC_SET(z) \
- : CC_OUT(z) (success), \
+ : CC_OUT(z) (__success), \
[ptr] "+m" (*__ptr), \
- [old] "+a" (__old) \
- : [new] "r" (__new) \
+ [old] "=a" (__ret) \
+ : [new] "r" (__new), \
+ "2" (__old) \
: "memory"); \
break; \
} \
@@ -190,10 +193,11 @@ extern void __add_wrong_size(void)
volatile u32 *__ptr = (volatile u32 *)(_ptr); \
asm volatile(lock "cmpxchgl %[new], %[ptr]" \
CC_SET(z) \
- : CC_OUT(z) (success), \
+ : CC_OUT(z) (__success), \
[ptr] "+m" (*__ptr), \
- [old] "+a" (__old) \
- : [new] "r" (__new) \
+ [old] "=a" (__ret) \
+ : [new] "r" (__new), \
+ "2" (__old) \
: "memory"); \
break; \
} \
@@ -202,25 +206,27 @@ extern void __add_wrong_size(void)
volatile u64 *__ptr = (volatile u64 *)(_ptr); \
asm volatile(lock "cmpxchgq %[new], %[ptr]" \
CC_SET(z) \
- : CC_OUT(z) (success), \
+ : CC_OUT(z) (__success), \
[ptr] "+m" (*__ptr), \
- [old] "+a" (__old) \
- : [new] "r" (__new) \
+ [old] "=a" (__ret) \
+ : [new] "r" (__new), \
+ "2" (__old) \
: "memory"); \
break; \
} \
default: \
__cmpxchg_wrong_size(); \
} \
- *_old = __old; \
- success; \
+ \
+ if (likely(__success)) goto success_label; \
+ __ret; \
})

-#define __try_cmpxchg(ptr, pold, new, size) \
- __raw_try_cmpxchg((ptr), (pold), (new), (size), LOCK_PREFIX)
+#define __try_cmpxchg(ptr, pold, new, success_label, size) \
+ __raw_try_cmpxchg((ptr), (pold), (new), success_label, (size), LOCK_PREFIX)

-#define try_cmpxchg(ptr, pold, new) \
- __try_cmpxchg((ptr), (pold), (new), sizeof(*(ptr)))
+#define try_cmpxchg(ptr, pold, new, success_label) \
+ __try_cmpxchg((ptr), (pold), (new), success_label, sizeof(*(ptr)))

/*
* xadd() adds "inc" to "*ptr" and atomically returns the previous
diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index aae5953..13a6eac 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -425,18 +425,26 @@

#ifndef atomic_try_cmpxchg

-#define __atomic_try_cmpxchg(type, _p, _po, _n) \
+#define __atomic_try_cmpxchg(type, _p, _o, _n, _label) \
({ \
- typeof(_po) __po = (_po); \
- typeof(*(_po)) __o = *__po; \
- *__po = atomic_cmpxchg##type((_p), __o, (_n)); \
- (*__po == __o); \
+ typeof(*(_p)) __r; \
+ typeof(*(_p)) __o = (_o); \
+ __r = atomic_cmpxchg##type((_p), __o, (_n)); \
+ if (__r == __o) goto _label; \
+ __r; \
})

-#define atomic_try_cmpxchg(_p, _po, _n) __atomic_try_cmpxchg(, _p, _po, _n)
-#define atomic_try_cmpxchg_relaxed(_p, _po, _n) __atomic_try_cmpxchg(_relaxed, _p, _po, _n)
-#define atomic_try_cmpxchg_acquire(_p, _po, _n) __atomic_try_cmpxchg(_acquire, _p, _po, _n)
-#define atomic_try_cmpxchg_release(_p, _po, _n) __atomic_try_cmpxchg(_release, _p, _po, _n)
+#define atomic_try_cmpxchg(_p, _o, _n, _l) \
+ __atomic_try_cmpxchg(, _p, _o, _n, _l)
+
+#define atomic_try_cmpxchg_relaxed(_p, _o, _n, _l) \
+ __atomic_try_cmpxchg(_relaxed, _p, _o, _n, _l)
+
+#define atomic_try_cmpxchg_acquire(_p, _o, _n, _l) \
+ __atomic_try_cmpxchg(_acquire, _p, _o, _n, _l)
+
+#define atomic_try_cmpxchg_release(_p, _o, _n, _l) \
+ __atomic_try_cmpxchg(_release, _p, _o, _n, _l)

#else /* atomic_try_cmpxchg */
#define atomic_try_cmpxchg_relaxed atomic_try_cmpxchg
@@ -1019,18 +1027,26 @@ static inline int atomic_dec_if_positive(atomic_t *v)

#ifndef atomic64_try_cmpxchg

-#define __atomic64_try_cmpxchg(type, _p, _po, _n) \
+#define __atomic64_try_cmpxchg(type, _p, _o, _n, _label) \
({ \
- typeof(_po) __po = (_po); \
- typeof(*(_po)) __o = *__po; \
- *__po = atomic64_cmpxchg##type((_p), __o, (_n)); \
- (*__po == __o); \
+ typeof(*(_p)) __r; \
+ typeof(*(_p)) __o = (_o); \
+ __r = atomic64_cmpxchg##type((_p), __o, (_n)); \
+ if (__r == __o) goto _label; \
+ __r; \
})

-#define atomic64_try_cmpxchg(_p, _po, _n) __atomic64_try_cmpxchg(, _p, _po, _n)
-#define atomic64_try_cmpxchg_relaxed(_p, _po, _n) __atomic64_try_cmpxchg(_relaxed, _p, _po, _n)
-#define atomic64_try_cmpxchg_acquire(_p, _po, _n) __atomic64_try_cmpxchg(_acquire, _p, _po, _n)
-#define atomic64_try_cmpxchg_release(_p, _po, _n) __atomic64_try_cmpxchg(_release, _p, _po, _n)
+#define atomic64_try_cmpxchg(_p, _o, _n, _l) \
+ __atomic64_try_cmpxchg(, _p, _o, _n, _l)
+
+#define atomic64_try_cmpxchg_relaxed(_p, _o, _n, _l) \
+ __atomic64_try_cmpxchg(_relaxed, _p, _o, _n, _l)
+
+#define atomic64_try_cmpxchg_acquire(_p, _o, _n, _l) \
+ __atomic64_try_cmpxchg(_acquire, _p, _o, _n, _l)
+
+#define atomic64_try_cmpxchg_release(_p, _o, _n, _l) \
+ __atomic64_try_cmpxchg(_release, _p, _o, _n, _l)

#else /* atomic64_try_cmpxchg */
#define atomic64_try_cmpxchg_relaxed atomic64_try_cmpxchg
diff --git a/lib/refcount.c b/lib/refcount.c
index f42124c..18b8926 100644
--- a/lib/refcount.c
+++ b/lib/refcount.c
@@ -59,7 +59,7 @@ bool refcount_add_not_zero(unsigned int i, refcount_t *r)
{
unsigned int new, val = atomic_read(&r->refs);

- do {
+ for (;;) {
if (!val)
return false;

@@ -70,8 +70,9 @@ bool refcount_add_not_zero(unsigned int i, refcount_t *r)
if (new < val)
new = UINT_MAX;

- } while (!atomic_try_cmpxchg_relaxed(&r->refs, &val, new));
-
+ val = atomic_try_cmpxchg_relaxed(&r->refs, val, new, success);
+ }
+success:
WARN_ONCE(new == UINT_MAX, "refcount_t: saturated; leaking memory.\n");

return true;
@@ -116,7 +117,7 @@ bool refcount_inc_not_zero(refcount_t *r)
{
unsigned int new, val = atomic_read(&r->refs);

- do {
+ for (;;) {
new = val + 1;

if (!val)
@@ -125,8 +126,9 @@ bool refcount_inc_not_zero(refcount_t *r)
if (unlikely(!new))
return true;

- } while (!atomic_try_cmpxchg_relaxed(&r->refs, &val, new));
-
+ val = atomic_try_cmpxchg_relaxed(&r->refs, val, new, success);
+ }
+success:
WARN_ONCE(new == UINT_MAX, "refcount_t: saturated; leaking memory.\n");

return true;
@@ -175,7 +177,7 @@ bool refcount_sub_and_test(unsigned int i, refcount_t *r)
{
unsigned int new, val = atomic_read(&r->refs);

- do {
+ for (;;) {
if (unlikely(val == UINT_MAX))
return false;

@@ -185,8 +187,9 @@ bool refcount_sub_and_test(unsigned int i, refcount_t *r)
return false;
}

- } while (!atomic_try_cmpxchg_release(&r->refs, &val, new));
-
+ val = atomic_try_cmpxchg_release(&r->refs, val, new, success);
+ }
+success:
return !new;
}
EXPORT_SYMBOL_GPL(refcount_sub_and_test);
@@ -244,9 +247,10 @@ EXPORT_SYMBOL_GPL(refcount_dec);
*/
bool refcount_dec_if_one(refcount_t *r)
{
- int val = 1;
-
- return atomic_try_cmpxchg_release(&r->refs, &val, 0);
+ atomic_try_cmpxchg_release(&r->refs, 1, 0, success);
+ return false;
+success:
+ return true;
}
EXPORT_SYMBOL_GPL(refcount_dec_if_one);

@@ -265,7 +269,7 @@ bool refcount_dec_not_one(refcount_t *r)
{
unsigned int new, val = atomic_read(&r->refs);

- do {
+ for (;;) {
if (unlikely(val == UINT_MAX))
return true;

@@ -278,8 +282,9 @@ bool refcount_dec_not_one(refcount_t *r)
return true;
}

- } while (!atomic_try_cmpxchg_release(&r->refs, &val, new));
-
+ val = atomic_try_cmpxchg_release(&r->refs, val, new, success);
+ }
+success:
return true;
}
EXPORT_SYMBOL_GPL(refcount_dec_not_one);

2017-03-25 18:01:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Sat, Mar 25, 2017 at 12:51 AM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Mar 24, 2017 at 10:23:29PM +0100, Peter Zijlstra wrote:
>
>> I'll try and redo the patches that landed in tip and see what it does
>> for total vmlinux size somewhere tomorrow.
>
> text data bss dec hex filename
> 10726413 4540256 843776 16110445 f5d36d defconfig-build/vmlinux.pre
> 10730509 4540256 843776 16114541 f5e36d defconfig-build/vmlinux.post
>
> :-(

Hmm. But you are comparing against the *broken* version that did the
unconditional store of the result.

You should at least compare against the fixed version with the
conditional store. That's the one that was hard to get good code
generation from, wasn't it?

Linus

2017-03-25 18:20:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Sat, Mar 25, 2017 at 11:00:44AM -0700, Linus Torvalds wrote:
> On Sat, Mar 25, 2017 at 12:51 AM, Peter Zijlstra <[email protected]> wrote:
> > On Fri, Mar 24, 2017 at 10:23:29PM +0100, Peter Zijlstra wrote:
> >
> >> I'll try and redo the patches that landed in tip and see what it does
> >> for total vmlinux size somewhere tomorrow.
> >
> > text data bss dec hex filename
> > 10726413 4540256 843776 16110445 f5d36d defconfig-build/vmlinux.pre
> > 10730509 4540256 843776 16114541 f5e36d defconfig-build/vmlinux.post
10730445 4540256 843776 16114477 f5e32d defconfig-build/vmlinux

> >
> > :-(
>
> Hmm. But you are comparing against the *broken* version that did the
> unconditional store of the result.

Well, only broken if not used on stack local variables, but yes.

> You should at least compare against the fixed version with the
> conditional store. That's the one that was hard to get good code
> generation from, wasn't it?

Added above, a few bytes smaller than the shiny new one actually.

2017-03-25 18:34:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Sat, Mar 25, 2017 at 11:28 AM, Linus Torvalds
<[email protected]> wrote:
>
> Hmm. Sad. The label approach looked like it would match the semantics
> of cmpxchg perfectly, but it's not as optimal as it superficially
> would have seemed.

Oh, I just noticed that at least your other one didn't mark "success"
as being likely.

That changed code generation a lot for me for the loops, where gcc
would assume that the loop was likely to be taken, which in turn means
that gcc lays out the loop with a backwards branch. Which is denser,
but also likely slower, since most x86 chips predict backwards
branches taken.

So that might be one difference. Although the size differences in the
last case are so small that it might also just be random noise.

Linus

2017-03-25 18:35:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Sat, Mar 25, 2017 at 11:20 AM, Peter Zijlstra <[email protected]> wrote:
>
> Added above, a few bytes smaller than the shiny new one actually.

Hmm. Sad. The label approach looked like it would match the semantics
of cmpxchg perfectly, but it's not as optimal as it superficially
would have seemed.

And I assume that register allocation etc is different enough that
there's no sane way to diff the asm to see what changed.

Linus

2017-03-25 21:14:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Sat, Mar 25, 2017 at 11:34:32AM -0700, Linus Torvalds wrote:
> On Sat, Mar 25, 2017 at 11:28 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > Hmm. Sad. The label approach looked like it would match the semantics
> > of cmpxchg perfectly, but it's not as optimal as it superficially
> > would have seemed.
>
> Oh, I just noticed that at least your other one didn't mark "success"
> as being likely.

10730509 4540256 843776 16114541 f5e36d defconfig-build/vmlinux


---
arch/x86/include/asm/cmpxchg.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
index fb961db..d347abc 100644
--- a/arch/x86/include/asm/cmpxchg.h
+++ b/arch/x86/include/asm/cmpxchg.h
@@ -212,8 +212,9 @@ extern void __add_wrong_size(void)
default: \
__cmpxchg_wrong_size(); \
} \
+ if (unlikely(!success)) \
*_old = __old; \
- success; \
+ likely(success); \
})

#define __try_cmpxchg(ptr, pold, new, size) \

2017-03-25 22:08:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Sat, Mar 25, 2017 at 2:13 PM, Peter Zijlstra <[email protected]> wrote:
> On Sat, Mar 25, 2017 at 11:34:32AM -0700, Linus Torvalds wrote:
>>
>> Oh, I just noticed that at least your other one didn't mark "success"
>> as being likely.
>
> 10730509 4540256 843776 16114541 f5e36d defconfig-build/vmlinux

Ok, that seems to be the exact same size as with the patch using the
"goto label" approach. So maybe the code generation is the same now.

Linus

2017-03-27 09:53:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Sat, Mar 25, 2017 at 03:08:10PM -0700, Linus Torvalds wrote:
> On Sat, Mar 25, 2017 at 2:13 PM, Peter Zijlstra <[email protected]> wrote:
> > On Sat, Mar 25, 2017 at 11:34:32AM -0700, Linus Torvalds wrote:
> >>
> >> Oh, I just noticed that at least your other one didn't mark "success"
> >> as being likely.
> >
> > 10730509 4540256 843776 16114541 f5e36d defconfig-build/vmlinux
>
> Ok, that seems to be the exact same size as with the patch using the
> "goto label" approach. So maybe the code generation is the same now.


OK, so I went and build myself a GCC-7 compiler and constructed the
below table. From this I would propose we do the "try_cmpxchg + if"
thing, also below. Because, while the interface is icky, it is what C11
does for this construct.


GCC-6.3.0:

10735757 (cmpxchg)
10726413 (try_cmpxchg)
10730701 (try_cmpxchg + likely)
10730509 (try_cmpxchg + if)
10730445 (try_cmpxchg-linus)

GCC-7 (20170327):

10709514 (cmpxchg)
10704266 (try_cmpxchg)
10704458 (try_cmpxchg + likely)
10704266 (try_cmpxchg + if)
10704394 (try_cmpxchg-linus)



---
arch/x86/include/asm/cmpxchg.h | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
index fb961db..d90296d 100644
--- a/arch/x86/include/asm/cmpxchg.h
+++ b/arch/x86/include/asm/cmpxchg.h
@@ -212,8 +212,9 @@ extern void __add_wrong_size(void)
default: \
__cmpxchg_wrong_size(); \
} \
- *_old = __old; \
- success; \
+ if (unlikely(!success)) \
+ *_old = __old; \
+ likely(success); \
})

#define __try_cmpxchg(ptr, pold, new, size) \

2017-03-27 12:16:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Fri, Mar 24, 2017 at 01:44:00PM +0100, Dmitry Vyukov wrote:
> Hi,
>
> I've come across:
>
> commit a9ebf306f52c756c4f9e50ee9a60cd6389d71344
> Author: Peter Zijlstra
> Date: Wed Feb 1 16:39:38 2017 +0100
> locking/atomic: Introduce atomic_try_cmpxchg()
>
> The primitive has subtle difference with all other implementation that
> I know of, and can lead to very subtle bugs. Some time ago I've spent
> several days debugging a memory corruption caused by similar
> implementation.

OK, so how about this?

---
Subject: atomic: Fix try_cmpxchg semantics
From: Peter Zijlstra <[email protected]>
Date: Mon Mar 27 13:54:38 CEST 2017

Dmitry noted that the new try_cmpxchg() primitive is broken when the
old pointer doesn't point to local stack.

He writes: "Consider a classical lock-free stack push:

node->next = atomic_read(&head);
do {
} while (!atomic_try_cmpxchg(&head, &node->next, node));

This code is broken with the current implementation, the problem is
with unconditional update of *__po.

In case of success it writes the same value back into *__po, but in
case of cmpxchg success we might have lose ownership of some memory
locations and potentially over what __po has pointed to. The same
holds for the re-read of *__po. "

He also points out that this makes it surprisingly different from the
similar C/C++ atomic operation.

After investigating the code-gen differences caused by this patch; and
a number of alternatives (Linus dislikes this interface lots), we
arrived at these results (size x86_64-defconfig/vmlinux):

GCC-6.3.0:

10735757 cmpxchg
10726413 try_cmpxchg
10730509 try_cmpxchg + patch
10730445 try_cmpxchg-linus

GCC-7 (20170327):

10709514 cmpxchg
10704266 try_cmpxchg
10704266 try_cmpxchg + patch
10704394 try_cmpxchg-linus

>From this we see that the patch has the advantage of better code-gen on
GCC-7 and keeps the interface roughly consistent with the C language
variant.

Fixes: a9ebf306f52c ("locking/atomic: Introduce atomic_try_cmpxchg()")
Reported-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/include/asm/cmpxchg.h | 5 +++--
include/linux/atomic.h | 16 ++++++++++------
2 files changed, 13 insertions(+), 8 deletions(-)

--- a/arch/x86/include/asm/cmpxchg.h
+++ b/arch/x86/include/asm/cmpxchg.h
@@ -212,8 +212,9 @@ extern void __add_wrong_size(void)
default: \
__cmpxchg_wrong_size(); \
} \
- *_old = __old; \
- success; \
+ if (unlikely(!success)) \
+ *_old = __old; \
+ likely(success); \
})

#define __try_cmpxchg(ptr, pold, new, size) \
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -428,9 +428,11 @@
#define __atomic_try_cmpxchg(type, _p, _po, _n) \
({ \
typeof(_po) __po = (_po); \
- typeof(*(_po)) __o = *__po; \
- *__po = atomic_cmpxchg##type((_p), __o, (_n)); \
- (*__po == __o); \
+ typeof(*(_po)) __r, __o = *__po; \
+ __r = atomic_cmpxchg##type((_p), __o, (_n)); \
+ if (unlikely(__r != __o)) \
+ *__po = __r; \
+ likely(__r == __o); \
})

#define atomic_try_cmpxchg(_p, _po, _n) __atomic_try_cmpxchg(, _p, _po, _n)
@@ -1022,9 +1024,11 @@ static inline int atomic_dec_if_positive
#define __atomic64_try_cmpxchg(type, _p, _po, _n) \
({ \
typeof(_po) __po = (_po); \
- typeof(*(_po)) __o = *__po; \
- *__po = atomic64_cmpxchg##type((_p), __o, (_n)); \
- (*__po == __o); \
+ typeof(*(_po)) __r, __o = *__po; \
+ __r = atomic64_cmpxchg##type((_p), __o, (_n)); \
+ if (unlikely(__r != __o)) \
+ *__po = __r; \
+ likely(__r == __o); \
})

#define atomic64_try_cmpxchg(_p, _po, _n) __atomic64_try_cmpxchg(, _p, _po, _n)

2017-03-27 13:45:45

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: locking/atomic: Introduce atomic_try_cmpxchg()

On Mon, Mar 27, 2017 at 2:16 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Mar 24, 2017 at 01:44:00PM +0100, Dmitry Vyukov wrote:
>> Hi,
>>
>> I've come across:
>>
>> commit a9ebf306f52c756c4f9e50ee9a60cd6389d71344
>> Author: Peter Zijlstra
>> Date: Wed Feb 1 16:39:38 2017 +0100
>> locking/atomic: Introduce atomic_try_cmpxchg()
>>
>> The primitive has subtle difference with all other implementation that
>> I know of, and can lead to very subtle bugs. Some time ago I've spent
>> several days debugging a memory corruption caused by similar
>> implementation.
>
> OK, so how about this?
>
> ---
> Subject: atomic: Fix try_cmpxchg semantics
> From: Peter Zijlstra <[email protected]>
> Date: Mon Mar 27 13:54:38 CEST 2017
>
> Dmitry noted that the new try_cmpxchg() primitive is broken when the
> old pointer doesn't point to local stack.
>
> He writes: "Consider a classical lock-free stack push:
>
> node->next = atomic_read(&head);
> do {
> } while (!atomic_try_cmpxchg(&head, &node->next, node));
>
> This code is broken with the current implementation, the problem is
> with unconditional update of *__po.
>
> In case of success it writes the same value back into *__po, but in
> case of cmpxchg success we might have lose ownership of some memory
> locations and potentially over what __po has pointed to. The same
> holds for the re-read of *__po. "
>
> He also points out that this makes it surprisingly different from the
> similar C/C++ atomic operation.
>
> After investigating the code-gen differences caused by this patch; and
> a number of alternatives (Linus dislikes this interface lots), we
> arrived at these results (size x86_64-defconfig/vmlinux):
>
> GCC-6.3.0:
>
> 10735757 cmpxchg
> 10726413 try_cmpxchg
> 10730509 try_cmpxchg + patch
> 10730445 try_cmpxchg-linus
>
> GCC-7 (20170327):
>
> 10709514 cmpxchg
> 10704266 try_cmpxchg
> 10704266 try_cmpxchg + patch
> 10704394 try_cmpxchg-linus
>
> From this we see that the patch has the advantage of better code-gen on
> GCC-7 and keeps the interface roughly consistent with the C language
> variant.
>
> Fixes: a9ebf306f52c ("locking/atomic: Introduce atomic_try_cmpxchg()")
> Reported-by: Dmitry Vyukov <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> arch/x86/include/asm/cmpxchg.h | 5 +++--
> include/linux/atomic.h | 16 ++++++++++------
> 2 files changed, 13 insertions(+), 8 deletions(-)
>
> --- a/arch/x86/include/asm/cmpxchg.h
> +++ b/arch/x86/include/asm/cmpxchg.h
> @@ -212,8 +212,9 @@ extern void __add_wrong_size(void)
> default: \
> __cmpxchg_wrong_size(); \
> } \
> - *_old = __old; \
> - success; \
> + if (unlikely(!success)) \
> + *_old = __old; \
> + likely(success); \
> })
>
> #define __try_cmpxchg(ptr, pold, new, size) \
> --- a/include/linux/atomic.h
> +++ b/include/linux/atomic.h
> @@ -428,9 +428,11 @@
> #define __atomic_try_cmpxchg(type, _p, _po, _n) \
> ({ \
> typeof(_po) __po = (_po); \
> - typeof(*(_po)) __o = *__po; \
> - *__po = atomic_cmpxchg##type((_p), __o, (_n)); \
> - (*__po == __o); \
> + typeof(*(_po)) __r, __o = *__po; \
> + __r = atomic_cmpxchg##type((_p), __o, (_n)); \
> + if (unlikely(__r != __o)) \
> + *__po = __r; \
> + likely(__r == __o); \
> })
>
> #define atomic_try_cmpxchg(_p, _po, _n) __atomic_try_cmpxchg(, _p, _po, _n)
> @@ -1022,9 +1024,11 @@ static inline int atomic_dec_if_positive
> #define __atomic64_try_cmpxchg(type, _p, _po, _n) \
> ({ \
> typeof(_po) __po = (_po); \
> - typeof(*(_po)) __o = *__po; \
> - *__po = atomic64_cmpxchg##type((_p), __o, (_n)); \
> - (*__po == __o); \
> + typeof(*(_po)) __r, __o = *__po; \
> + __r = atomic64_cmpxchg##type((_p), __o, (_n)); \
> + if (unlikely(__r != __o)) \
> + *__po = __r; \
> + likely(__r == __o); \
> })
>
> #define atomic64_try_cmpxchg(_p, _po, _n) __atomic64_try_cmpxchg(, _p, _po, _n)


Acked-by: Dmitry Vyukov <[email protected]>