2013-09-10 13:22:42

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 0/7] preempt_count rework -v2

These patches optimize preempt_enable by firstly folding the preempt and
need_resched tests into one -- this should work for all architectures. And
secondly by providing per-arch preempt_count implementations; with x86 using
per-cpu preempt_count for fastest access.


These patches have been boot tested on CONFIG_PREEMPT=y x86_64 and survive
building a x86_64-defconfig kernel.

kernel/sched/core.c:kick_process() now looks like:

ffffffff8106f3f0 <kick_process>:
ffffffff8106f3f0: 55 push %rbp
ffffffff8106f3f1: 65 ff 04 25 e0 b7 00 incl %gs:0xb7e0
ffffffff8106f3f8: 00
ffffffff8106f3f9: 48 89 e5 mov %rsp,%rbp
ffffffff8106f3fc: 48 8b 47 08 mov 0x8(%rdi),%rax
ffffffff8106f400: 8b 50 18 mov 0x18(%rax),%edx
ffffffff8106f403: 65 8b 04 25 1c b0 00 mov %gs:0xb01c,%eax
ffffffff8106f40a: 00
ffffffff8106f40b: 39 c2 cmp %eax,%edx
ffffffff8106f40d: 74 1b je ffffffff8106f42a <kick_process+0x3a>
ffffffff8106f40f: 89 d1 mov %edx,%ecx
ffffffff8106f411: 48 c7 c0 00 2c 01 00 mov $0x12c00,%rax
ffffffff8106f418: 48 8b 0c cd a0 bc cb mov -0x7e344360(,%rcx,8),%rcx
ffffffff8106f41f: 81
ffffffff8106f420: 48 3b bc 08 00 08 00 cmp 0x800(%rax,%rcx,1),%rdi
ffffffff8106f427: 00
ffffffff8106f428: 74 1e je ffffffff8106f448 <kick_process+0x58>
* ffffffff8106f42a: 65 ff 0c 25 e0 b7 00 decl %gs:0xb7e0
ffffffff8106f431: 00
* ffffffff8106f432: 0f 94 c0 sete %al
* ffffffff8106f435: 84 c0 test %al,%al
* ffffffff8106f437: 75 02 jne ffffffff8106f43b <kick_process+0x4b>
ffffffff8106f439: 5d pop %rbp
ffffffff8106f43a: c3 retq
* ffffffff8106f43b: e8 b0 b6 f9 ff callq ffffffff8100aaf0 <___preempt_schedule>
ffffffff8106f440: 5d pop %rbp
ffffffff8106f441: c3 retq
ffffffff8106f442: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
ffffffff8106f448: 89 d7 mov %edx,%edi
ffffffff8106f44a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
ffffffff8106f450: ff 15 ea e0 ba 00 callq *0xbae0ea(%rip) # ffffffff81c1d540 <smp_ops+0x20>
ffffffff8106f456: eb d2 jmp ffffffff8106f42a <kick_process+0x3a>
ffffffff8106f458: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
ffffffff8106f45f: 00

Where the '*' marked lines are preempt_enable(), sadly GCC isn't able to get
rid of the sete+test :/ Its a rather frequent pattern in the kernel, so
'fixing' the x86 GCC backend to recognise this might be useful.


---
arch/alpha/include/asm/Kbuild | 1 +
arch/arc/include/asm/Kbuild | 1 +
arch/arm/include/asm/Kbuild | 1 +
arch/arm64/include/asm/Kbuild | 1 +
arch/avr32/include/asm/Kbuild | 1 +
arch/blackfin/include/asm/Kbuild | 1 +
arch/c6x/include/asm/Kbuild | 1 +
arch/cris/include/asm/Kbuild | 1 +
arch/frv/include/asm/Kbuild | 1 +
arch/h8300/include/asm/Kbuild | 1 +
arch/hexagon/include/asm/Kbuild | 1 +
arch/ia64/include/asm/Kbuild | 1 +
arch/m32r/include/asm/Kbuild | 1 +
arch/m68k/include/asm/Kbuild | 1 +
arch/metag/include/asm/Kbuild | 1 +
arch/microblaze/include/asm/Kbuild | 1 +
arch/mips/include/asm/Kbuild | 1 +
arch/mips/mm/init.c | 5 +-
arch/mn10300/include/asm/Kbuild | 1 +
arch/openrisc/include/asm/Kbuild | 1 +
arch/parisc/include/asm/Kbuild | 1 +
arch/powerpc/include/asm/Kbuild | 1 +
arch/s390/include/asm/Kbuild | 1 +
arch/score/include/asm/Kbuild | 1 +
arch/sh/include/asm/Kbuild | 1 +
arch/sparc/include/asm/Kbuild | 1 +
arch/tile/include/asm/Kbuild | 1 +
arch/um/include/asm/Kbuild | 1 +
arch/unicore32/include/asm/Kbuild | 1 +
arch/x86/include/asm/calling.h | 50 +++++++++++++++++
arch/x86/include/asm/thread_info.h | 5 +-
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/asm-offsets.c | 1 -
arch/x86/kernel/cpu/common.c | 5 ++
arch/x86/kernel/entry_32.S | 7 +--
arch/x86/kernel/entry_64.S | 4 +-
arch/x86/kernel/process_32.c | 10 ++++
arch/x86/kernel/process_64.c | 10 ++++
arch/x86/kernel/traps.c | 4 +-
arch/xtensa/include/asm/Kbuild | 1 +
include/linux/hardirq.h | 8 +--
include/linux/preempt.h | 112 +++++++++++++++++--------------------
include/linux/sched.h | 5 --
include/linux/thread_info.h | 1 +
include/linux/uaccess.h | 8 +--
include/trace/events/sched.h | 2 +-
init/main.c | 2 +-
kernel/context_tracking.c | 4 +-
kernel/cpu/idle.c | 7 +++
kernel/sched/core.c | 54 +++++++++---------
kernel/softirq.c | 16 +++---
kernel/timer.c | 8 +--
lib/smp_processor_id.c | 3 +-
53 files changed, 226 insertions(+), 136 deletions(-)


2013-09-10 13:51:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2


* Peter Zijlstra <[email protected]> wrote:

> These patches optimize preempt_enable by firstly folding the preempt and
> need_resched tests into one -- this should work for all architectures. And
> secondly by providing per-arch preempt_count implementations; with x86 using
> per-cpu preempt_count for fastest access.
>
>
> These patches have been boot tested on CONFIG_PREEMPT=y x86_64 and survive
> building a x86_64-defconfig kernel.
>
> kernel/sched/core.c:kick_process() now looks like:
>
> ffffffff8106f3f0 <kick_process>:
> ffffffff8106f3f0: 55 push %rbp
> ffffffff8106f3f1: 65 ff 04 25 e0 b7 00 incl %gs:0xb7e0
> ffffffff8106f3f8: 00
> ffffffff8106f3f9: 48 89 e5 mov %rsp,%rbp
> ffffffff8106f3fc: 48 8b 47 08 mov 0x8(%rdi),%rax
> ffffffff8106f400: 8b 50 18 mov 0x18(%rax),%edx
> ffffffff8106f403: 65 8b 04 25 1c b0 00 mov %gs:0xb01c,%eax
> ffffffff8106f40a: 00
> ffffffff8106f40b: 39 c2 cmp %eax,%edx
> ffffffff8106f40d: 74 1b je ffffffff8106f42a <kick_process+0x3a>
> ffffffff8106f40f: 89 d1 mov %edx,%ecx
> ffffffff8106f411: 48 c7 c0 00 2c 01 00 mov $0x12c00,%rax
> ffffffff8106f418: 48 8b 0c cd a0 bc cb mov -0x7e344360(,%rcx,8),%rcx
> ffffffff8106f41f: 81
> ffffffff8106f420: 48 3b bc 08 00 08 00 cmp 0x800(%rax,%rcx,1),%rdi
> ffffffff8106f427: 00
> ffffffff8106f428: 74 1e je ffffffff8106f448 <kick_process+0x58>
> * ffffffff8106f42a: 65 ff 0c 25 e0 b7 00 decl %gs:0xb7e0
> ffffffff8106f431: 00
> * ffffffff8106f432: 0f 94 c0 sete %al
> * ffffffff8106f435: 84 c0 test %al,%al
> * ffffffff8106f437: 75 02 jne ffffffff8106f43b <kick_process+0x4b>
> ffffffff8106f439: 5d pop %rbp
> ffffffff8106f43a: c3 retq
> * ffffffff8106f43b: e8 b0 b6 f9 ff callq ffffffff8100aaf0 <___preempt_schedule>

Mind also posting the 'before' assembly, to make it clear how much we've
improved things?

> ffffffff8106f440: 5d pop %rbp
> ffffffff8106f441: c3 retq
> ffffffff8106f442: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
> ffffffff8106f448: 89 d7 mov %edx,%edi
> ffffffff8106f44a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
> ffffffff8106f450: ff 15 ea e0 ba 00 callq *0xbae0ea(%rip) # ffffffff81c1d540 <smp_ops+0x20>
> ffffffff8106f456: eb d2 jmp ffffffff8106f42a <kick_process+0x3a>
> ffffffff8106f458: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
> ffffffff8106f45f: 00
>
> Where the '*' marked lines are preempt_enable(), sadly GCC isn't able to
> get rid of the sete+test :/ Its a rather frequent pattern in the kernel,
> so 'fixing' the x86 GCC backend to recognise this might be useful.

So what we do in kick_process() is:

preempt_disable();
cpu = task_cpu(p);
if ((cpu != smp_processor_id()) && task_curr(p))
smp_send_reschedule(cpu);
preempt_enable();

The preempt_disable() looks sweet:

> ffffffff8106f3f1: 65 ff 04 25 e0 b7 00 incl %gs:0xb7e0
> ffffffff8106f3f8: 00

and the '*' you marked is the preempt_enable() portion, which, with your
new code, looks like this:

#define preempt_check_resched() \
do { \
if (unlikely(!*preempt_count_ptr())) \
preempt_schedule(); \
} while (0)

Which GCC translates to:

> * ffffffff8106f42a: 65 ff 0c 25 e0 b7 00 decl %gs:0xb7e0
> ffffffff8106f431: 00
> * ffffffff8106f432: 0f 94 c0 sete %al
> * ffffffff8106f435: 84 c0 test %al,%al
> * ffffffff8106f437: 75 02 jne ffffffff8106f43b <kick_process+0x4b>

So, is the problem that GCC cannot pass a 'CPU flags' state out of asm(),
only an explicit (pseudo-)value, right?

Ideally we'd like to have something like:

> * ffffffff8106f42a: 65 ff 0c 25 e0 b7 00 decl %gs:0xb7e0
> ffffffff8106f431: 00
> * ffffffff8106f437: 75 02 jne ffffffff8106f43b <kick_process+0x4b>

right?

Thanks,

Ingo

2013-09-10 13:56:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2


* Ingo Molnar <[email protected]> wrote:

> So what we do in kick_process() is:
>
> preempt_disable();
> cpu = task_cpu(p);
> if ((cpu != smp_processor_id()) && task_curr(p))
> smp_send_reschedule(cpu);
> preempt_enable();
>
> The preempt_disable() looks sweet:
>
> > ffffffff8106f3f1: 65 ff 04 25 e0 b7 00 incl %gs:0xb7e0
> > ffffffff8106f3f8: 00
>
> and the '*' you marked is the preempt_enable() portion, which, with your
> new code, looks like this:
>
> #define preempt_check_resched() \
> do { \
> if (unlikely(!*preempt_count_ptr())) \
> preempt_schedule(); \
> } while (0)
>
> Which GCC translates to:
>
> > * ffffffff8106f42a: 65 ff 0c 25 e0 b7 00 decl %gs:0xb7e0
> > ffffffff8106f431: 00
> > * ffffffff8106f432: 0f 94 c0 sete %al
> > * ffffffff8106f435: 84 c0 test %al,%al
> > * ffffffff8106f437: 75 02 jne ffffffff8106f43b <kick_process+0x4b>

Correction, so this comes from the new x86-specific optimization:

+static __always_inline bool __preempt_count_dec_and_test(void)
+{
+ unsigned char c;
+
+ asm ("decl " __percpu_arg(0) "; sete %1"
+ : "+m" (__preempt_count), "=qm" (c));
+
+ return c != 0;
+}

And that's where the sete and test originates from.

Couldn't it be improved by merging the preempt_schedule() call into a new
primitive, keeping the call in the regular flow, or using section tricks
to move it out of line? The scheduling case is a slowpath in most cases.

Thanks,

Ingo

2013-09-10 15:14:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On Tue, Sep 10, 2013 at 03:56:36PM +0200, Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:

> > > * ffffffff8106f42a: 65 ff 0c 25 e0 b7 00 decl %gs:0xb7e0
> > > ffffffff8106f431: 00
> > > * ffffffff8106f432: 0f 94 c0 sete %al
> > > * ffffffff8106f435: 84 c0 test %al,%al
> > > * ffffffff8106f437: 75 02 jne ffffffff8106f43b <kick_process+0x4b>
>
> Correction, so this comes from the new x86-specific optimization:
>
> +static __always_inline bool __preempt_count_dec_and_test(void)
> +{
> + unsigned char c;
> +
> + asm ("decl " __percpu_arg(0) "; sete %1"
> + : "+m" (__preempt_count), "=qm" (c));
> +
> + return c != 0;
> +}
>
> And that's where the sete and test originates from.

Correct, used in:

#define preempt_enable() \
do { \
barrier(); \
if (unlikely(preempt_count_dec_and_test())) \
__preempt_schedule(); \
} while (0)

> Couldn't it be improved by merging the preempt_schedule() call into a new
> primitive, keeping the call in the regular flow, or using section tricks
> to move it out of line? The scheduling case is a slowpath in most cases.

Not if we want to keep using the GCC unlikely thing afaik. That said,
all this inline asm stuff is isn't my strong point, so maybe someone
else has a good idea.

But I really think fixing GCC would be good, as we have the same pattern
with all *_and_test() functions.

2013-09-10 15:29:22

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On 9/10/2013 6:56 AM, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
>> So what we do in kick_process() is:
>>
>> preempt_disable();
>> cpu = task_cpu(p);
>> if ((cpu != smp_processor_id()) && task_curr(p))
>> smp_send_reschedule(cpu);
>> preempt_enable();
>>
>> The preempt_disable() looks sweet:
>>
>>> ffffffff8106f3f1: 65 ff 04 25 e0 b7 00 incl %gs:0xb7e0
>>> ffffffff8106f3f8: 00
>>
>> and the '*' you marked is the preempt_enable() portion, which, with your
>> new code, looks like this:
>>
>> #define preempt_check_resched() \
>> do { \
>> if (unlikely(!*preempt_count_ptr())) \
>> preempt_schedule(); \
>> } while (0)
>>
>> Which GCC translates to:
>>
>>> * ffffffff8106f42a: 65 ff 0c 25 e0 b7 00 decl %gs:0xb7e0
>>> ffffffff8106f431: 00
>>> * ffffffff8106f432: 0f 94 c0 sete %al
>>> * ffffffff8106f435: 84 c0 test %al,%al
>>> * ffffffff8106f437: 75 02 jne ffffffff8106f43b <kick_process+0x4b>
>
> Correction, so this comes from the new x86-specific optimization:
>
> +static __always_inline bool __preempt_count_dec_and_test(void)
> +{
> + unsigned char c;
> +
> + asm ("decl " __percpu_arg(0) "; sete %1"
> + : "+m" (__preempt_count), "=qm" (c));
> +
> + return c != 0;
> +}
>
> And that's where the sete and test originates from.
>
> Couldn't it be improved by merging the preempt_schedule() call into a new
> primitive, keeping the call in the regular flow, or using section tricks
> to move it out of line? The scheduling case is a slowpath in most cases.
>
also.. yuck on using "dec"
"dec" sucks, please use "sub foo ,1" instead
(dec sucks because of its broken flags behavior; it creates basically a bubble in the pipeline)

2013-09-10 15:35:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On Tue, Sep 10, 2013 at 08:29:16AM -0700, Arjan van de Ven wrote:

> also.. yuck on using "dec" "dec" sucks, please use "sub foo ,1"
> instead (dec sucks because of its broken flags behavior; it creates
> basically a bubble in the pipeline)

Then someone needs to go change:

atomic{,64}_dec*()
local{,64}_dec*()
percpu_add_op()

All of which use decl/incl, percpu_add_op() actually goes out of its way
to generate them.

Also, subl ,1 is 4 bytes larger.

2013-09-10 16:24:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On Tue, Sep 10, 2013 at 8:29 AM, Arjan van de Ven <[email protected]> wrote:
>>
> also.. yuck on using "dec"
> "dec" sucks, please use "sub foo ,1" instead

That's a bigger instruction, largely due to the constant.

> (dec sucks because of its broken flags behavior; it creates basically a
> bubble in the pipeline)

Intel could (and should) just fix that. It's "easy" enough - you just
need to rename the carry flag as a separate register (and make sure
that the conditional instructions that don't use it don't think they
need it).

In fact I thought Intel already did that on their large cores. Are you
sure they still have trouble with inc/dec?

Linus

2013-09-10 16:34:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On Tue, Sep 10, 2013 at 6:56 AM, Ingo Molnar <[email protected]> wrote:
>
> +static __always_inline bool __preempt_count_dec_and_test(void)
> +{
> + unsigned char c;
> +
> + asm ("decl " __percpu_arg(0) "; sete %1"
> + : "+m" (__preempt_count), "=qm" (c));
> +
> + return c != 0;
> +}
>
> And that's where the sete and test originates from.

We could make this use "asm goto" instead.

An "asm goto" cannot have outputs, but this particular one doesn't
_need_ outputs. You could mark the preempt_count memory as an input,
and then have a memory clobber. I think you need the memory clobber
anyway for that preempt-count thing.

So I _think_ something like

static __always_inline bool __preempt_count_dec_and_test(void)
{
asm goto("decl " __percpu_arg(0) "\n\t"
"je %l[became_zero]"
: :"m" (__preempt_count):"memory":became_zero);
return 0;
became_zero:
return 1;
}

would work.

You need to wrap it in

#ifdef CC_HAVE_ASM_GOTO

and then have the old "sete" version for older compilers, but for
newer ones you'd get pretty much perfect code. UNTESTED.

Linus

2013-09-10 16:45:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On Tue, Sep 10, 2013 at 09:34:52AM -0700, Linus Torvalds wrote:
> On Tue, Sep 10, 2013 at 6:56 AM, Ingo Molnar <[email protected]> wrote:
> >
> > +static __always_inline bool __preempt_count_dec_and_test(void)
> > +{
> > + unsigned char c;
> > +
> > + asm ("decl " __percpu_arg(0) "; sete %1"
> > + : "+m" (__preempt_count), "=qm" (c));
> > +
> > + return c != 0;
> > +}
> >
> > And that's where the sete and test originates from.
>
> We could make this use "asm goto" instead.
>
> An "asm goto" cannot have outputs, but this particular one doesn't
> _need_ outputs. You could mark the preempt_count memory as an input,
> and then have a memory clobber. I think you need the memory clobber
> anyway for that preempt-count thing.
>
> So I _think_ something like
>
> static __always_inline bool __preempt_count_dec_and_test(void)
> {
> asm goto("decl " __percpu_arg(0) "\n\t"
> "je %l[became_zero]"
> : :"m" (__preempt_count):"memory":became_zero);
> return 0;
> became_zero:
> return 1;
> }

The usage site:

#define preempt_enable() \
do { \
barrier(); \
if (unlikely(preempt_count_dec_and_test())) \
__preempt_schedule(); \
} while (0)

Already includes the barrier explicitly, so do we still need the memory
clobber in that asm goto thing?

That said, your change results in:

ffffffff8106f420 <kick_process>:
ffffffff8106f420: 55 push %rbp
ffffffff8106f421: 65 ff 04 25 e0 b7 00 incl %gs:0xb7e0
ffffffff8106f428: 00
ffffffff8106f429: 48 89 e5 mov %rsp,%rbp
ffffffff8106f42c: 48 8b 47 08 mov 0x8(%rdi),%rax
ffffffff8106f430: 8b 50 18 mov 0x18(%rax),%edx
ffffffff8106f433: 65 8b 04 25 1c b0 00 mov %gs:0xb01c,%eax
ffffffff8106f43a: 00
ffffffff8106f43b: 39 c2 cmp %eax,%edx
ffffffff8106f43d: 74 1b je ffffffff8106f45a <kick_process+0x3a>
ffffffff8106f43f: 89 d1 mov %edx,%ecx
ffffffff8106f441: 48 c7 c0 00 2c 01 00 mov $0x12c00,%rax
ffffffff8106f448: 48 8b 0c cd a0 bc cb mov -0x7e344360(,%rcx,8),%rcx
ffffffff8106f44f: 81
ffffffff8106f450: 48 3b bc 08 00 08 00 cmp 0x800(%rax,%rcx,1),%rdi
ffffffff8106f457: 00
ffffffff8106f458: 74 26 je ffffffff8106f480 <kick_process+0x60>
* ffffffff8106f45a: 65 ff 0c 25 e0 b7 00 decl %gs:0xb7e0
ffffffff8106f461: 00
* ffffffff8106f462: 74 0c je ffffffff8106f470 <kick_process+0x50>
ffffffff8106f464: 5d pop %rbp
ffffffff8106f465: c3 retq
ffffffff8106f466: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
ffffffff8106f46d: 00 00 00
* ffffffff8106f470: e8 9b b6 f9 ff callq ffffffff8100ab10 <___preempt_schedule>
ffffffff8106f475: 5d pop %rbp
ffffffff8106f476: c3 retq
ffffffff8106f477: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
ffffffff8106f47e: 00 00
ffffffff8106f480: 89 d7 mov %edx,%edi
ffffffff8106f482: ff 15 b8 e0 ba 00 callq *0xbae0b8(%rip) # ffffffff81c1d540 <smp_ops+0x20>
ffffffff8106f488: eb d0 jmp ffffffff8106f45a <kick_process+0x3a>
ffffffff8106f48a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)


Which is indeed perfect. So should I go 'fix' the other _and_test()
functions we have to do this same thing?

2013-09-10 17:06:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On Tue, Sep 10, 2013 at 9:45 AM, Peter Zijlstra <[email protected]> wrote:
>
> The usage site:
>
> #define preempt_enable() \
> do { \
> barrier(); \
> if (unlikely(preempt_count_dec_and_test())) \
> __preempt_schedule(); \
> } while (0)
>
> Already includes the barrier explicitly, so do we still need the memory
> clobber in that asm goto thing?

Yeah, you do to be safe, just to let gcc know that it may be changing
the memory location.

The issue is that because an "asm goto" cannot have outputs, I had to
change the (correct) "+m" input/output into just a "m" (input).

So without the memory clobber, gcc might decide that the thing doesn't
actually change the preempt count, and perhaps move a load of that
across the "asm goto".

Admittedly, that does sound almost impossibly unlikely, but I'd be
happier being careful.

> That said, your change results in:
>
> * ffffffff8106f45a: 65 ff 0c 25 e0 b7 00 decl %gs:0xb7e0
> ffffffff8106f461: 00
> * ffffffff8106f462: 74 0c je ffffffff8106f470 <kick_process+0x50>
...
> Which is indeed perfect. So should I go 'fix' the other _and_test()
> functions we have to do this same thing?

It would be a good thing to test. There might be downsides with "asm
goto" (maybe it limits gcc some way), but it does in general save us
not only two instructions but also a register.

Linus

2013-09-10 21:25:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On Tue, Sep 10, 2013 at 10:06:44AM -0700, Linus Torvalds wrote:
> > Which is indeed perfect. So should I go 'fix' the other _and_test()
> > functions we have to do this same thing?
>
> It would be a good thing to test. There might be downsides with "asm
> goto" (maybe it limits gcc some way), but it does in general save us
> not only two instructions but also a register.

Here's one that builds and boots on kvm until wanting to mount root.

I'm not entirely sure on the "ir" vs "er" thing and atomic64_t and
local_t are inconsistent wrt that so I'm too.

I'll try running on actual hardware tomorrow sometime.

---
arch/x86/include/asm/atomic.h | 52 +++++++++++++++++++++++++++++++++++
arch/x86/include/asm/atomic64_64.h | 54 ++++++++++++++++++++++++++++++++++++-
arch/x86/include/asm/local.h | 52 +++++++++++++++++++++++++++++++++++
3 files changed, 157 insertions(+), 1 deletion(-)

--- a/arch/x86/include/asm/atomic.h
+++ b/arch/x86/include/asm/atomic.h
@@ -74,6 +74,18 @@ static inline void atomic_sub(int i, ato
* true if the result is zero, or false for all
* other cases.
*/
+#ifdef CC_HAVE_ASM_GOTO
+static inline int atomic_sub_and_test(int i, atomic_t *v)
+{
+ asm volatile goto(LOCK_PREFIX "subl %1,%0;"
+ "je %l[became_zero]"
+ : : "m" (v->counter), "ir" (i)
+ : "memory" : became_zero);
+ return 0;
+became_zero:
+ return 1;
+}
+#else
static inline int atomic_sub_and_test(int i, atomic_t *v)
{
unsigned char c;
@@ -83,6 +95,7 @@ static inline int atomic_sub_and_test(in
: "ir" (i) : "memory");
return c;
}
+#endif

/**
* atomic_inc - increment atomic variable
@@ -116,6 +129,18 @@ static inline void atomic_dec(atomic_t *
* returns true if the result is 0, or false for all other
* cases.
*/
+#ifdef CC_HAVE_ASM_GOTO
+static inline int atomic_dec_and_test(atomic_t *v)
+{
+ asm volatile goto(LOCK_PREFIX "decl %0;"
+ "je %l[became_zero]"
+ : : "m" (v->counter)
+ : "memory" : became_zero);
+ return 0;
+became_zero:
+ return 1;
+}
+#else
static inline int atomic_dec_and_test(atomic_t *v)
{
unsigned char c;
@@ -125,6 +150,7 @@ static inline int atomic_dec_and_test(at
: : "memory");
return c != 0;
}
+#endif

/**
* atomic_inc_and_test - increment and test
@@ -134,6 +160,18 @@ static inline int atomic_dec_and_test(at
* and returns true if the result is zero, or false for all
* other cases.
*/
+#ifdef CC_HAVE_ASM_GOTO
+static inline int atomic_inc_and_test(atomic_t *v)
+{
+ asm volatile goto(LOCK_PREFIX "incl %0;"
+ "je %l[became_zero]"
+ : : "m" (v->counter)
+ : "memory" : became_zero);
+ return 0;
+became_zero:
+ return 1;
+}
+#else
static inline int atomic_inc_and_test(atomic_t *v)
{
unsigned char c;
@@ -143,6 +181,7 @@ static inline int atomic_inc_and_test(at
: : "memory");
return c != 0;
}
+#endif

/**
* atomic_add_negative - add and test if negative
@@ -153,6 +192,18 @@ static inline int atomic_inc_and_test(at
* if the result is negative, or false when
* result is greater than or equal to zero.
*/
+#ifdef CC_HAVE_ASM_GOTO
+static inline int atomic_add_negative(int i, atomic_t *v)
+{
+ asm volatile goto(LOCK_PREFIX "addl %1,%0;"
+ "js %l[became_neg]"
+ : : "m" (v->counter), "ir" (i)
+ : "memory" : became_neg);
+ return 0;
+became_neg:
+ return 1;
+}
+#else
static inline int atomic_add_negative(int i, atomic_t *v)
{
unsigned char c;
@@ -162,6 +213,7 @@ static inline int atomic_add_negative(in
: "ir" (i) : "memory");
return c;
}
+#endif

/**
* atomic_add_return - add integer and return
--- a/arch/x86/include/asm/atomic64_64.h
+++ b/arch/x86/include/asm/atomic64_64.h
@@ -70,6 +70,18 @@ static inline void atomic64_sub(long i,
* true if the result is zero, or false for all
* other cases.
*/
+#ifdef CC_HAVE_ASM_GOTO
+static inline int atomic64_sub_and_test(long i, atomic64_t *v)
+{
+ asm volatile goto(LOCK_PREFIX "subq %1,%0;"
+ "je %l[became_zero]"
+ : : "m" (v->counter), "er" (i)
+ : "memory" : became_zero);
+ return 0;
+became_zero:
+ return 1;
+}
+#else
static inline int atomic64_sub_and_test(long i, atomic64_t *v)
{
unsigned char c;
@@ -79,6 +91,7 @@ static inline int atomic64_sub_and_test(
: "er" (i), "m" (v->counter) : "memory");
return c;
}
+#endif

/**
* atomic64_inc - increment atomic64 variable
@@ -114,6 +127,18 @@ static inline void atomic64_dec(atomic64
* returns true if the result is 0, or false for all other
* cases.
*/
+#ifdef CC_HAVE_ASM_GOTO
+static inline int atomic64_dec_and_test(atomic64_t *v)
+{
+ asm volatile goto(LOCK_PREFIX "decq %0;"
+ "je %l[became_zero]"
+ : : "m" (v->counter)
+ : "memory" : became_zero);
+ return 0;
+became_zero:
+ return 1;
+}
+#else
static inline int atomic64_dec_and_test(atomic64_t *v)
{
unsigned char c;
@@ -123,6 +148,7 @@ static inline int atomic64_dec_and_test(
: "m" (v->counter) : "memory");
return c != 0;
}
+#endif

/**
* atomic64_inc_and_test - increment and test
@@ -132,6 +158,18 @@ static inline int atomic64_dec_and_test(
* and returns true if the result is zero, or false for all
* other cases.
*/
+#ifdef CC_HAVE_ASM_GOTO
+static inline int atomic64_inc_and_test(atomic64_t *v)
+{
+ asm volatile goto(LOCK_PREFIX "incq %0;"
+ "je %l[became_zero]"
+ : : "m" (v->counter)
+ : "memory" : became_zero);
+ return 0;
+became_zero:
+ return 1;
+}
+#else
static inline int atomic64_inc_and_test(atomic64_t *v)
{
unsigned char c;
@@ -141,6 +179,7 @@ static inline int atomic64_inc_and_test(
: "m" (v->counter) : "memory");
return c != 0;
}
+#endif

/**
* atomic64_add_negative - add and test if negative
@@ -151,6 +190,18 @@ static inline int atomic64_inc_and_test(
* if the result is negative, or false when
* result is greater than or equal to zero.
*/
+#ifdef CC_HAVE_ASM_GOTO
+static inline int atomic64_add_negative(long i, atomic64_t *v)
+{
+ asm volatile goto(LOCK_PREFIX "addq %1,%0;"
+ "js %l[became_neg]"
+ : : "m" (v->counter), "er" (i)
+ : "memory" : became_neg);
+ return 0;
+became_neg:
+ return 1;
+}
+#else
static inline int atomic64_add_negative(long i, atomic64_t *v)
{
unsigned char c;
@@ -160,6 +211,7 @@ static inline int atomic64_add_negative(
: "er" (i), "m" (v->counter) : "memory");
return c;
}
+#endif

/**
* atomic64_add_return - add and return
@@ -219,7 +271,7 @@ static inline int atomic64_add_unless(at

/*
* atomic64_dec_if_positive - decrement by 1 if old value positive
- * @v: pointer of type atomic_t
+ * @v: pointer of type atomic64_t
*
* The function returns the old value of *v minus 1, even if
* the atomic variable, v, was not decremented.
--- a/arch/x86/include/asm/local.h
+++ b/arch/x86/include/asm/local.h
@@ -50,6 +50,18 @@ static inline void local_sub(long i, loc
* true if the result is zero, or false for all
* other cases.
*/
+#ifdef CC_HAVE_ASM_GOTO
+static inline int local_sub_and_test(long i, local_t *l)
+{
+ asm volatile goto(_ASM_SUB " %1,%0;"
+ "je %l[became_zero]"
+ : : "m" (l->a.counter), "ir" (i)
+ : "memory" : became_zero);
+ return 0;
+became_zero:
+ return 1;
+}
+#else
static inline int local_sub_and_test(long i, local_t *l)
{
unsigned char c;
@@ -59,6 +71,7 @@ static inline int local_sub_and_test(lon
: "ir" (i) : "memory");
return c;
}
+#endif

/**
* local_dec_and_test - decrement and test
@@ -68,6 +81,18 @@ static inline int local_sub_and_test(lon
* returns true if the result is 0, or false for all other
* cases.
*/
+#ifdef CC_HAVE_ASM_GOTO
+static inline int local_dec_and_test(local_t *l)
+{
+ asm volatile goto(_ASM_DEC " %0;"
+ "je %l[became_zero]"
+ : : "m" (l->a.counter)
+ : "memory" : became_zero);
+ return 0;
+became_zero:
+ return 1;
+}
+#else
static inline int local_dec_and_test(local_t *l)
{
unsigned char c;
@@ -77,6 +102,7 @@ static inline int local_dec_and_test(loc
: : "memory");
return c != 0;
}
+#endif

/**
* local_inc_and_test - increment and test
@@ -86,6 +112,18 @@ static inline int local_dec_and_test(loc
* and returns true if the result is zero, or false for all
* other cases.
*/
+#ifdef CC_HAVE_ASM_GOTO
+static inline int local_inc_and_test(local_t *l)
+{
+ asm volatile goto(_ASM_INC " %0;"
+ "je %l[became_zero]"
+ : : "m" (l->a.counter)
+ : "memory" : became_zero);
+ return 0;
+became_zero:
+ return 1;
+}
+#else
static inline int local_inc_and_test(local_t *l)
{
unsigned char c;
@@ -95,6 +133,7 @@ static inline int local_inc_and_test(loc
: : "memory");
return c != 0;
}
+#endif

/**
* local_add_negative - add and test if negative
@@ -105,6 +144,18 @@ static inline int local_inc_and_test(loc
* if the result is negative, or false when
* result is greater than or equal to zero.
*/
+#ifdef CC_HAVE_ASM_GOTO
+static inline int local_add_negative(long i, local_t *l)
+{
+ asm volatile goto(_ASM_ADD " %1,%0;"
+ "js %l[became_neg]"
+ : : "m" (l->a.counter), "ir" (i)
+ : "memory" : became_neg);
+ return 0;
+became_neg:
+ return 1;
+}
+#else
static inline int local_add_negative(long i, local_t *l)
{
unsigned char c;
@@ -114,6 +165,7 @@ static inline int local_add_negative(lon
: "ir" (i) : "memory");
return c;
}
+#endif

/**
* local_add_return - add and return

2013-09-10 21:43:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On Tue, Sep 10, 2013 at 2:25 PM, Peter Zijlstra <[email protected]> wrote:
>
> Here's one that builds and boots on kvm until wanting to mount root.
>
> I'm not entirely sure on the "ir" vs "er" thing and atomic64_t and
> local_t are inconsistent wrt that so I'm too.

"i" is "any constant", while "e" is "32-bit signed constant".

And I think all of the 64-bit ones should probably be "e", because
afaik there is no way to add a 64-bit constant directly to memory (you
have to load it into a register first).

Of course, in reality, the constant is always just 1 or -1 or
something like that, so nobody will ever notice the incorrect case...

And it doesn't matter for the 32-bit cases, obviously, but we could
just make them all be "e" for simplicity.

That said, looking at your patch, I get the *very* strong feeling that
we could make a macro that does all the repetitions for us, and then
have a

GENERATE_RMW(atomic_sub_and_test, LOCK_PREFIX "subl", "e", "")
GENERATE_RMW(atomic_dec_and_test, LOCK_PREFIX "decl", "e", "")
..
GENERATE_RMW(atomic_add_negative, LOCK_PREFIX "addl", "s", "")

GENERATE_RMW(local_sub_and_test, "subl", "e", __percpu_prefix)
...

etc.

I'm sure the macro would be nasty as hell (and I bet it needs a few
more arguments), but then we'd avoid the repetition..

Linus

2013-09-10 21:53:03

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On 09/10/2013 02:43 PM, Linus Torvalds wrote:
> On Tue, Sep 10, 2013 at 2:25 PM, Peter Zijlstra <[email protected]> wrote:
>>
>> Here's one that builds and boots on kvm until wanting to mount root.
>>
>> I'm not entirely sure on the "ir" vs "er" thing and atomic64_t and
>> local_t are inconsistent wrt that so I'm too.
>
> "i" is "any constant", while "e" is "32-bit signed constant".
>
> And I think all of the 64-bit ones should probably be "e", because
> afaik there is no way to add a 64-bit constant directly to memory (you
> have to load it into a register first).
>
> Of course, in reality, the constant is always just 1 or -1 or
> something like that, so nobody will ever notice the incorrect case...
>
> And it doesn't matter for the 32-bit cases, obviously, but we could
> just make them all be "e" for simplicity.
>
> That said, looking at your patch, I get the *very* strong feeling that
> we could make a macro that does all the repetitions for us, and then
> have a
>
> GENERATE_RMW(atomic_sub_and_test, LOCK_PREFIX "subl", "e", "")
> GENERATE_RMW(atomic_dec_and_test, LOCK_PREFIX "decl", "e", "")
> ..
> GENERATE_RMW(atomic_add_negative, LOCK_PREFIX "addl", "s", "")
>
> GENERATE_RMW(local_sub_and_test, "subl", "e", __percpu_prefix)
> ...
>
> etc.
>
> I'm sure the macro would be nasty as hell (and I bet it needs a few
> more arguments), but then we'd avoid the repetition..
>

Actually, the right thing here really is "er" (which I think you meant,
but just to make it clear.) Why? Even if the value is representable as
a signed immediate, if gcc already happens to have it in a register it
will be better to use the register.

"e" doesn't work on versions of gcc older than the first x86-64 release,
but we don't care about that anymore.

A final good question is if we should encapsulate the add/inc and
sub/dec into a single function; one could easily do somethin glike:

static inline int atomic_sub_and_test(int i, atomic_t *)
{
if (__builtin_constant_p(i) && i == 1)
/* Use incl */
else if (__builtin_constant_p(i) && i == -1)
/* Use decl */
else
/* Use addl */
}

-hpa

2013-09-10 22:03:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On Tue, Sep 10, 2013 at 2:51 PM, H. Peter Anvin <[email protected]> wrote:
> On 09/10/2013 02:43 PM, Linus Torvalds wrote:
>
> Actually, the right thing here really is "er" (which I think you meant,
> but just to make it clear.)

Yes, I was just answering the i-vs-e confusion.

> "e" doesn't work on versions of gcc older than the first x86-64 release,
> but we don't care about that anymore.

Indeed.

> A final good question is if we should encapsulate the add/inc and
> sub/dec into a single function; one could easily do somethin glike:

Yes. However, I would do that at a higher level than the one that
builds the actual functions.

That said, there's a few cases where you might want to specify
add-vs-sub explicitly, but they are rather odd, namely the fact that
"-128" fits in a byte, but "128" does not.

So it can be better to add 128 by doing a "subl $-128" than by doing
an "add $128".

But we probably don't have any situation where we care about that
special value of "128". I've seen the trick, though.

Linus

2013-09-10 22:07:32

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On 09/10/2013 03:02 PM, Linus Torvalds wrote:
> On Tue, Sep 10, 2013 at 2:51 PM, H. Peter Anvin <[email protected]> wrote:
>> On 09/10/2013 02:43 PM, Linus Torvalds wrote:
>>
>> Actually, the right thing here really is "er" (which I think you meant,
>> but just to make it clear.)
>
> Yes, I was just answering the i-vs-e confusion.
>
>> "e" doesn't work on versions of gcc older than the first x86-64 release,
>> but we don't care about that anymore.
>
> Indeed.
>
>> A final good question is if we should encapsulate the add/inc and
>> sub/dec into a single function; one could easily do somethin glike:
>
> Yes. However, I would do that at a higher level than the one that
> builds the actual functions.
>
> That said, there's a few cases where you might want to specify
> add-vs-sub explicitly, but they are rather odd, namely the fact that
> "-128" fits in a byte, but "128" does not.
>
> So it can be better to add 128 by doing a "subl $-128" than by doing
> an "add $128".
>
> But we probably don't have any situation where we care about that
> special value of "128". I've seen the trick, though.
>

Yes, and if __builtin_constant_p() we could even do it explicitly.
Unfortunately I don't think gcc allows alternatives in asm() statements,
unlike in its own pattern tables.

-hpa

2013-09-11 13:13:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On Tue, Sep 10, 2013 at 02:43:06PM -0700, Linus Torvalds wrote:
> That said, looking at your patch, I get the *very* strong feeling that
> we could make a macro that does all the repetitions for us, and then
> have a
>
> GENERATE_RMW(atomic_sub_and_test, LOCK_PREFIX "subl", "e", "")

The below seems to compile..

---
arch/x86/include/asm/atomic.h | 29 +--------
arch/x86/include/asm/atomic64_64.h | 28 +--------
arch/x86/include/asm/local.h | 28 +--------
arch/x86/include/asm/addcc.h | 115 +++++++++++++++++++++++++++++++++++++
4 files changed, 128 insertions(+), 72 deletions(-)

Index: linux-2.6/arch/x86/include/asm/atomic.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/atomic.h
+++ linux-2.6/arch/x86/include/asm/atomic.h
@@ -6,6 +6,7 @@
#include <asm/processor.h>
#include <asm/alternative.h>
#include <asm/cmpxchg.h>
+#include <asm/addcc.h>

/*
* Atomic operations that C can't guarantee us. Useful for
@@ -76,12 +77,7 @@ static inline void atomic_sub(int i, ato
*/
static inline int atomic_sub_and_test(int i, atomic_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "subl %2,%0; sete %1"
- : "+m" (v->counter), "=qm" (c)
- : "ir" (i) : "memory");
- return c;
+ GENERATE_ADDcc(v->counter, -i, LOCK_PREFIX, "e");
}

/**
@@ -118,12 +114,7 @@ static inline void atomic_dec(atomic_t *
*/
static inline int atomic_dec_and_test(atomic_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "decl %0; sete %1"
- : "+m" (v->counter), "=qm" (c)
- : : "memory");
- return c != 0;
+ GENERATE_ADDcc(v->counter, -1, LOCK_PREFIX, "e");
}

/**
@@ -136,12 +127,7 @@ static inline int atomic_dec_and_test(at
*/
static inline int atomic_inc_and_test(atomic_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "incl %0; sete %1"
- : "+m" (v->counter), "=qm" (c)
- : : "memory");
- return c != 0;
+ GENERATE_ADDcc(v->counter, 1, LOCK_PREFIX, "e");
}

/**
@@ -155,12 +141,7 @@ static inline int atomic_inc_and_test(at
*/
static inline int atomic_add_negative(int i, atomic_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "addl %2,%0; sets %1"
- : "+m" (v->counter), "=qm" (c)
- : "ir" (i) : "memory");
- return c;
+ GENERATE_ADDcc(v->counter, i, LOCK_PREFIX, "s");
}

/**
Index: linux-2.6/arch/x86/include/asm/atomic64_64.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/atomic64_64.h
+++ linux-2.6/arch/x86/include/asm/atomic64_64.h
@@ -72,12 +72,7 @@ static inline void atomic64_sub(long i,
*/
static inline int atomic64_sub_and_test(long i, atomic64_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "subq %2,%0; sete %1"
- : "=m" (v->counter), "=qm" (c)
- : "er" (i), "m" (v->counter) : "memory");
- return c;
+ GENERATE_ADDcc(v->counter, -i, LOCK_PREFIX, "e");
}

/**
@@ -116,12 +111,7 @@ static inline void atomic64_dec(atomic64
*/
static inline int atomic64_dec_and_test(atomic64_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "decq %0; sete %1"
- : "=m" (v->counter), "=qm" (c)
- : "m" (v->counter) : "memory");
- return c != 0;
+ GENERATE_ADDcc(v->counter, -1, LOCK_PREFIX, "e");
}

/**
@@ -134,12 +124,7 @@ static inline int atomic64_dec_and_test(
*/
static inline int atomic64_inc_and_test(atomic64_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "incq %0; sete %1"
- : "=m" (v->counter), "=qm" (c)
- : "m" (v->counter) : "memory");
- return c != 0;
+ GENERATE_ADDcc(v->counter, 1, LOCK_PREFIX, "e");
}

/**
@@ -153,12 +138,7 @@ static inline int atomic64_inc_and_test(
*/
static inline int atomic64_add_negative(long i, atomic64_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "addq %2,%0; sets %1"
- : "=m" (v->counter), "=qm" (c)
- : "er" (i), "m" (v->counter) : "memory");
- return c;
+ GENERATE_ADDcc(v->counter, i, LOCK_PREFIX, "s");
}

/**
Index: linux-2.6/arch/x86/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/local.h
+++ linux-2.6/arch/x86/include/asm/local.h
@@ -52,12 +52,7 @@ static inline void local_sub(long i, loc
*/
static inline int local_sub_and_test(long i, local_t *l)
{
- unsigned char c;
-
- asm volatile(_ASM_SUB "%2,%0; sete %1"
- : "+m" (l->a.counter), "=qm" (c)
- : "ir" (i) : "memory");
- return c;
+ GENERATE_ADDcc(l->a.counter, -i, "", "e");
}

/**
@@ -70,12 +65,7 @@ static inline int local_sub_and_test(lon
*/
static inline int local_dec_and_test(local_t *l)
{
- unsigned char c;
-
- asm volatile(_ASM_DEC "%0; sete %1"
- : "+m" (l->a.counter), "=qm" (c)
- : : "memory");
- return c != 0;
+ GENERATE_ADDcc(l->a.counter, -1, "", "e");
}

/**
@@ -88,12 +78,7 @@ static inline int local_dec_and_test(loc
*/
static inline int local_inc_and_test(local_t *l)
{
- unsigned char c;
-
- asm volatile(_ASM_INC "%0; sete %1"
- : "+m" (l->a.counter), "=qm" (c)
- : : "memory");
- return c != 0;
+ GENERATE_ADDcc(l->a.counter, 1, "", "e");
}

/**
@@ -107,12 +92,7 @@ static inline int local_inc_and_test(loc
*/
static inline int local_add_negative(long i, local_t *l)
{
- unsigned char c;
-
- asm volatile(_ASM_ADD "%2,%0; sets %1"
- : "+m" (l->a.counter), "=qm" (c)
- : "ir" (i) : "memory");
- return c;
+ GENERATE_ADDcc(l->a.counter, i, "", "s");
}

/**
Index: linux-2.6/arch/x86/include/asm/addcc.h
===================================================================
--- /dev/null
+++ linux-2.6/arch/x86/include/asm/addcc.h
@@ -0,0 +1,115 @@
+#ifndef _ASM_X86_ADDcc
+#define _ASM_X86_ADDcc
+
+extern void __bad_addcc_size(void);
+
+#ifdef CC_HAVE_ASM_GOTO
+
+#define GENERATE_ADDcc(var, val, lock, cc) \
+do { \
+ const int add_ID__ = (__builtin_constant_p(val) && \
+ ((val) == 1 || (val) == -1)) ? (val) : 0; \
+ \
+ switch (sizeof(var)) { \
+ case 4: \
+ if (add_ID__ == 1) { \
+ asm volatile goto(lock "incl %0;" \
+ "j" cc " %l[cc_label]" \
+ : : "m" (var) \
+ : "memory" : cc_label); \
+ } else if (add_ID__ == -1) { \
+ asm volatile goto(lock "decl %0;" \
+ "j" cc " %l[cc_label]" \
+ : : "m" (var) \
+ : "memory" : cc_label); \
+ } else { \
+ asm volatile goto(lock "addl %1, %0;" \
+ "j" cc " %l[cc_label]" \
+ : : "m" (var), "er" (val) \
+ : "memory" : cc_label); \
+ } \
+ break; \
+ \
+ case 8: \
+ if (add_ID__ == 1) { \
+ asm volatile goto(lock "incq %0;" \
+ "j" cc " %l[cc_label]" \
+ : : "m" (var) \
+ : "memory" : cc_label); \
+ } else if (add_ID__ == -1) { \
+ asm volatile goto(lock "decq %0;" \
+ "j" cc " %l[cc_label]" \
+ : : "m" (var) \
+ : "memory" : cc_label); \
+ } else { \
+ asm volatile goto(lock "addq %1, %0;" \
+ "j" cc " %l[cc_label]" \
+ : : "m" (var), "er" (val) \
+ : "memory" : cc_label); \
+ } \
+ break; \
+ \
+ default: __bad_addcc_size(); \
+ } \
+ \
+ return 0; \
+cc_label: \
+ return 1; \
+} while (0)
+
+#else /* !CC_HAVE_ASM_GOTO */
+
+#define GENERATE_ADDcc(var, val, lock, cc) \
+do { \
+ const int add_ID__ = (__builtin_constant_p(val) && \
+ ((val) == 1 || (val) == -1)) ? (val) : 0; \
+ char c; \
+ \
+ switch (sizeof(var)) { \
+ case 4: \
+ if (add_ID__ == 1) { \
+ asm volatile (lock "incl %0;" \
+ "set" cc " %1" \
+ : "+m" (var), "=qm" (c) \
+ : : "memory"); \
+ } else if (add_ID__ == -1) { \
+ asm volatile (lock "decl %0;" \
+ "set" cc " %1" \
+ : "+m" (var), "=qm" (c) \
+ : : "memory"); \
+ } else { \
+ asm volatile (lock "addl %2, %0;" \
+ "set" cc " %1" \
+ : "+m" (var), "=qm" (c) \
+ : "er" (val) : "memory"); \
+ } \
+ break; \
+ \
+ case 8: \
+ if (add_ID__ == 1) { \
+ asm volatile (lock "incq %0;" \
+ "set" cc " %1" \
+ : "+m" (var), "=qm" (c) \
+ : : "memory"); \
+ } else if (add_ID__ == -1) { \
+ asm volatile (lock "decq %0;" \
+ "set" cc " %1" \
+ : "+m" (var), "=qm" (c) \
+ : : "memory"); \
+ } else { \
+ asm volatile (lock "addq %2, %0;" \
+ "set" cc " %1" \
+ : "+m" (var), "=qm" (c) \
+ : "er" (val) : "memory"); \
+ } \
+ break; \
+ \
+ default: __bad_addcc_size(); \
+ } \
+ \
+ return c != 0; \
+} while (0)
+
+#endif /* CC_HAVE_ASM_GOTO */
+
+#endif /* _ASM_X86_ADDcc */

2013-09-11 13:26:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On Wed, Sep 11, 2013 at 03:13:23PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 10, 2013 at 02:43:06PM -0700, Linus Torvalds wrote:
> > That said, looking at your patch, I get the *very* strong feeling that
> > we could make a macro that does all the repetitions for us, and then
> > have a
> >
> > GENERATE_RMW(atomic_sub_and_test, LOCK_PREFIX "subl", "e", "")
>
> The below seems to compile..
>
> +#define GENERATE_ADDcc(var, val, lock, cc) \

And here's one that builds and is usable for per-cpu things too:

static __always_inline bool __preempt_count_dec_and_test(void)
{
GENERATE_ADDcc(__preempt_count, -1, "", __percpu_arg(0), "e");
}

---
arch/x86/include/asm/addcc.h | 115 +++++++++++++++++++++++++++++++++++++
arch/x86/include/asm/atomic.h | 29 +--------
arch/x86/include/asm/atomic64_64.h | 28 +--------
arch/x86/include/asm/local.h | 28 +--------
4 files changed, 128 insertions(+), 72 deletions(-)

--- /dev/null
+++ b/arch/x86/include/asm/addcc.h
@@ -0,0 +1,115 @@
+#ifndef _ASM_X86_ADDcc
+#define _ASM_X86_ADDcc
+
+extern void __bad_addcc_size(void);
+
+#ifdef CC_HAVE_ASM_GOTO
+
+#define GENERATE_ADDcc(var, val, lock, arg0, cc) \
+do { \
+ const int add_ID__ = (__builtin_constant_p(val) && \
+ ((val) == 1 || (val) == -1)) ? (val) : 0; \
+ \
+ switch (sizeof(var)) { \
+ case 4: \
+ if (add_ID__ == 1) { \
+ asm volatile goto(lock "incl " arg0 ";" \
+ "j" cc " %l[cc_label]" \
+ : : "m" (var) \
+ : "memory" : cc_label); \
+ } else if (add_ID__ == -1) { \
+ asm volatile goto(lock "decl " arg0 ";" \
+ "j" cc " %l[cc_label]" \
+ : : "m" (var) \
+ : "memory" : cc_label); \
+ } else { \
+ asm volatile goto(lock "addl %1, " arg0 ";" \
+ "j" cc " %l[cc_label]" \
+ : : "m" (var), "er" (val) \
+ : "memory" : cc_label); \
+ } \
+ break; \
+ \
+ case 8: \
+ if (add_ID__ == 1) { \
+ asm volatile goto(lock "incq " arg0 ";" \
+ "j" cc " %l[cc_label]" \
+ : : "m" (var) \
+ : "memory" : cc_label); \
+ } else if (add_ID__ == -1) { \
+ asm volatile goto(lock "decq " arg0 ";" \
+ "j" cc " %l[cc_label]" \
+ : : "m" (var) \
+ : "memory" : cc_label); \
+ } else { \
+ asm volatile goto(lock "addq %1, " arg0 ";" \
+ "j" cc " %l[cc_label]" \
+ : : "m" (var), "er" (val) \
+ : "memory" : cc_label); \
+ } \
+ break; \
+ \
+ default: __bad_addcc_size(); \
+ } \
+ \
+ return 0; \
+cc_label: \
+ return 1; \
+} while (0)
+
+#else /* !CC_HAVE_ASM_GOTO */
+
+#define GENERATE_ADDcc(var, val, lock, arg0, cc) \
+do { \
+ const int add_ID__ = (__builtin_constant_p(val) && \
+ ((val) == 1 || (val) == -1)) ? (val) : 0; \
+ char c; \
+ \
+ switch (sizeof(var)) { \
+ case 4: \
+ if (add_ID__ == 1) { \
+ asm volatile (lock "incl " arg0 ";" \
+ "set" cc " %1" \
+ : "+m" (var), "=qm" (c) \
+ : : "memory"); \
+ } else if (add_ID__ == -1) { \
+ asm volatile (lock "decl " arg0 ";" \
+ "set" cc " %1" \
+ : "+m" (var), "=qm" (c) \
+ : : "memory"); \
+ } else { \
+ asm volatile (lock "addl %2, " arg0 ";" \
+ "set" cc " %1" \
+ : "+m" (var), "=qm" (c) \
+ : "er" (val) : "memory"); \
+ } \
+ break; \
+ \
+ case 8: \
+ if (add_ID__ == 1) { \
+ asm volatile (lock "incq " arg0 ";" \
+ "set" cc " %1" \
+ : "+m" (var), "=qm" (c) \
+ : : "memory"); \
+ } else if (add_ID__ == -1) { \
+ asm volatile (lock "decq " arg0 ";" \
+ "set" cc " %1" \
+ : "+m" (var), "=qm" (c) \
+ : : "memory"); \
+ } else { \
+ asm volatile (lock "addq %2, " arg0 ";" \
+ "set" cc " %1" \
+ : "+m" (var), "=qm" (c) \
+ : "er" (val) : "memory"); \
+ } \
+ break; \
+ \
+ default: __bad_addcc_size(); \
+ } \
+ \
+ return c != 0; \
+} while (0)
+
+#endif /* CC_HAVE_ASM_GOTO */
+
+#endif /* _ASM_X86_ADDcc */
--- a/arch/x86/include/asm/atomic.h
+++ b/arch/x86/include/asm/atomic.h
@@ -6,6 +6,7 @@
#include <asm/processor.h>
#include <asm/alternative.h>
#include <asm/cmpxchg.h>
+#include <asm/addcc.h>

/*
* Atomic operations that C can't guarantee us. Useful for
@@ -76,12 +77,7 @@ static inline void atomic_sub(int i, ato
*/
static inline int atomic_sub_and_test(int i, atomic_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "subl %2,%0; sete %1"
- : "+m" (v->counter), "=qm" (c)
- : "ir" (i) : "memory");
- return c;
+ GENERATE_ADDcc(v->counter, -i, LOCK_PREFIX, "%0", "e");
}

/**
@@ -118,12 +114,7 @@ static inline void atomic_dec(atomic_t *
*/
static inline int atomic_dec_and_test(atomic_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "decl %0; sete %1"
- : "+m" (v->counter), "=qm" (c)
- : : "memory");
- return c != 0;
+ GENERATE_ADDcc(v->counter, -1, LOCK_PREFIX, "%0", "e");
}

/**
@@ -136,12 +127,7 @@ static inline int atomic_dec_and_test(at
*/
static inline int atomic_inc_and_test(atomic_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "incl %0; sete %1"
- : "+m" (v->counter), "=qm" (c)
- : : "memory");
- return c != 0;
+ GENERATE_ADDcc(v->counter, 1, LOCK_PREFIX, "%0", "e");
}

/**
@@ -155,12 +141,7 @@ static inline int atomic_inc_and_test(at
*/
static inline int atomic_add_negative(int i, atomic_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "addl %2,%0; sets %1"
- : "+m" (v->counter), "=qm" (c)
- : "ir" (i) : "memory");
- return c;
+ GENERATE_ADDcc(v->counter, i, LOCK_PREFIX, "%0", "s");
}

/**
--- a/arch/x86/include/asm/atomic64_64.h
+++ b/arch/x86/include/asm/atomic64_64.h
@@ -72,12 +72,7 @@ static inline void atomic64_sub(long i,
*/
static inline int atomic64_sub_and_test(long i, atomic64_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "subq %2,%0; sete %1"
- : "=m" (v->counter), "=qm" (c)
- : "er" (i), "m" (v->counter) : "memory");
- return c;
+ GENERATE_ADDcc(v->counter, -i, LOCK_PREFIX, "%0", "e");
}

/**
@@ -116,12 +111,7 @@ static inline void atomic64_dec(atomic64
*/
static inline int atomic64_dec_and_test(atomic64_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "decq %0; sete %1"
- : "=m" (v->counter), "=qm" (c)
- : "m" (v->counter) : "memory");
- return c != 0;
+ GENERATE_ADDcc(v->counter, -1, LOCK_PREFIX, "%0", "e");
}

/**
@@ -134,12 +124,7 @@ static inline int atomic64_dec_and_test(
*/
static inline int atomic64_inc_and_test(atomic64_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "incq %0; sete %1"
- : "=m" (v->counter), "=qm" (c)
- : "m" (v->counter) : "memory");
- return c != 0;
+ GENERATE_ADDcc(v->counter, 1, LOCK_PREFIX, "%0", "e");
}

/**
@@ -153,12 +138,7 @@ static inline int atomic64_inc_and_test(
*/
static inline int atomic64_add_negative(long i, atomic64_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "addq %2,%0; sets %1"
- : "=m" (v->counter), "=qm" (c)
- : "er" (i), "m" (v->counter) : "memory");
- return c;
+ GENERATE_ADDcc(v->counter, i, LOCK_PREFIX, "%0", "s");
}

/**
--- a/arch/x86/include/asm/local.h
+++ b/arch/x86/include/asm/local.h
@@ -52,12 +52,7 @@ static inline void local_sub(long i, loc
*/
static inline int local_sub_and_test(long i, local_t *l)
{
- unsigned char c;
-
- asm volatile(_ASM_SUB "%2,%0; sete %1"
- : "+m" (l->a.counter), "=qm" (c)
- : "ir" (i) : "memory");
- return c;
+ GENERATE_ADDcc(l->a.counter, -i, "", "%0", "e");
}

/**
@@ -70,12 +65,7 @@ static inline int local_sub_and_test(lon
*/
static inline int local_dec_and_test(local_t *l)
{
- unsigned char c;
-
- asm volatile(_ASM_DEC "%0; sete %1"
- : "+m" (l->a.counter), "=qm" (c)
- : : "memory");
- return c != 0;
+ GENERATE_ADDcc(l->a.counter, -1, "", "%0", "e");
}

/**
@@ -88,12 +78,7 @@ static inline int local_dec_and_test(loc
*/
static inline int local_inc_and_test(local_t *l)
{
- unsigned char c;
-
- asm volatile(_ASM_INC "%0; sete %1"
- : "+m" (l->a.counter), "=qm" (c)
- : : "memory");
- return c != 0;
+ GENERATE_ADDcc(l->a.counter, 1, "", "%0", "e");
}

/**
@@ -107,12 +92,7 @@ static inline int local_inc_and_test(loc
*/
static inline int local_add_negative(long i, local_t *l)
{
- unsigned char c;
-
- asm volatile(_ASM_ADD "%2,%0; sets %1"
- : "+m" (l->a.counter), "=qm" (c)
- : "ir" (i) : "memory");
- return c;
+ GENERATE_ADDcc(l->a.counter, i, "", "%0", "s");
}

/**

2013-09-11 15:30:24

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On 09/11/2013 06:13 AM, Peter Zijlstra wrote:
> On Tue, Sep 10, 2013 at 02:43:06PM -0700, Linus Torvalds wrote:
>> That said, looking at your patch, I get the *very* strong feeling that
>> we could make a macro that does all the repetitions for us, and then
>> have a
>>
>> GENERATE_RMW(atomic_sub_and_test, LOCK_PREFIX "subl", "e", "")
>
> The below seems to compile..
>
> +
> +#define GENERATE_ADDcc(var, val, lock, cc) \
> +do { \
> + const int add_ID__ = (__builtin_constant_p(val) && \
> + ((val) == 1 || (val) == -1)) ? (val) : 0; \
> + \
> + switch (sizeof(var)) { \
> + case 4: \
> + if (add_ID__ == 1) { \
> + asm volatile goto(lock "incl %0;" \
> + "j" cc " %l[cc_label]" \
> + : : "m" (var) \
> + : "memory" : cc_label); \
> + } else if (add_ID__ == -1) { \
> + asm volatile goto(lock "decl %0;" \
> + "j" cc " %l[cc_label]" \
> + : : "m" (var) \
> + : "memory" : cc_label); \
> + } else { \
> + asm volatile goto(lock "addl %1, %0;" \
> + "j" cc " %l[cc_label]" \
> + : : "m" (var), "er" (val) \
> + : "memory" : cc_label); \
> + } \
> + break; \
> + \
> + case 8: \
> + if (add_ID__ == 1) { \
> + asm volatile goto(lock "incq %0;" \
> + "j" cc " %l[cc_label]" \
> + : : "m" (var) \
> + : "memory" : cc_label); \
> + } else if (add_ID__ == -1) { \
> + asm volatile goto(lock "decq %0;" \
> + "j" cc " %l[cc_label]" \
> + : : "m" (var) \
> + : "memory" : cc_label); \
> + } else { \
> + asm volatile goto(lock "addq %1, %0;" \
> + "j" cc " %l[cc_label]" \
> + : : "m" (var), "er" (val) \
> + : "memory" : cc_label); \
> + } \
> + break; \
> + \

At least in the "asm goto" case you can use:

lock "add%z0 %1,%0;"

... and skip the switch statement.

There was a bug in some old (gcc 3.x?) early x86-64 versions which would
treat %z0 as if it was %Z0 which means it would emit "ll" instead of "q"
but that doesn't apply to any gcc which has "asm goto"...

-hpa

2013-09-11 15:33:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On Wed, Sep 11, 2013 at 6:13 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Sep 10, 2013 at 02:43:06PM -0700, Linus Torvalds wrote:
>> That said, looking at your patch, I get the *very* strong feeling that
>> we could make a macro that does all the repetitions for us, and then
>> have a
>>
>> GENERATE_RMW(atomic_sub_and_test, LOCK_PREFIX "subl", "e", "")
>
> The below seems to compile..

You have that horrible duplication in the macro itself now, though.

The only difference between the 4-byte and the 8-byte one is the "l"
vs "q". And the counter increment/decrement argument (by the caller)
defines whether it's add/dec/inc. Why not just add those as parameters
to the macro, and suddenly the macro becomes 10x smaller.

An excessively complex macro that makes the arguments trivially
simpler is not worth it.

Especially since that excessive macro complexity now means that your
macro is useless for things that the *simpler* macro could have done,
like the bit-test-and-modify cases.

So your complexity actually makes things worse.

So just pass in the operation name and size. Then somebody will use it
for test_and_set_bit() and friends too.

Linus

2013-09-11 16:01:50

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On 09/10/2013 09:24 AM, Linus Torvalds wrote:
> On Tue, Sep 10, 2013 at 8:29 AM, Arjan van de Ven <[email protected]> wrote:
>>>
>> also.. yuck on using "dec"
>> "dec" sucks, please use "sub foo ,1" instead
>
> That's a bigger instruction, largely due to the constant.
>
>> (dec sucks because of its broken flags behavior; it creates basically a
>> bubble in the pipeline)
>
> Intel could (and should) just fix that. It's "easy" enough - you just
> need to rename the carry flag as a separate register (and make sure
> that the conditional instructions that don't use it don't think they
> need it).
>
> In fact I thought Intel already did that on their large cores. Are you
> sure they still have trouble with inc/dec?
>

Big cores are fine, but Atom (and presumably Quark) might still suffer a
performance penalty.

-hpa

2013-09-11 19:00:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On Wed, Sep 11, 2013 at 08:33:21AM -0700, Linus Torvalds wrote:
> An excessively complex macro that makes the arguments trivially
> simpler is not worth it.

OK, stripped it down further, I couldn't quite see how to collapse the
unary and binary operator variants though :/

---
arch/x86/include/asm/rmwcc.h | 62 +++++++++++++++++++++++++++++++++++++
arch/x86/include/asm/atomic.h | 29 ++---------------
arch/x86/include/asm/atomic64_64.h | 28 ++--------------
arch/x86/include/asm/local.h | 28 ++--------------
4 files changed, 75 insertions(+), 72 deletions(-)

--- /dev/null
+++ b/arch/x86/include/asm/rmwcc.h
@@ -0,0 +1,62 @@
+#ifndef _ASM_X86_RMWcc
+#define _ASM_X86_RMWcc
+
+extern void __bad_rmwcc_nargs(void);
+
+#ifdef CC_HAVE_ASM_GOTO
+
+#define GENERATE_RMWcc(var, val, op, nargs, arg0, cc) \
+do { \
+ switch (nargs) { \
+ case 0: \
+ asm volatile goto (op " " arg0 ";" \
+ "j" cc " %l[cc_label]" \
+ : : "m" (var) \
+ : "memory" : cc_label); \
+ break; \
+ \
+ case 1: \
+ asm volatile goto (op " %1, " arg0 ";" \
+ "j" cc " %l[cc_label]" \
+ : : "m" (var), "er" (val) \
+ : "memory" : cc_label); \
+ break; \
+ \
+ default: __bad_rmwcc_nargs(); \
+ } \
+ \
+ return 0; \
+cc_label: \
+ return 1; \
+} while (0)
+
+#else /* !CC_HAVE_ASM_GOTO */
+
+#define GENERATE_RMWcc(var, val, op, nargs, arg0, cc) \
+do { \
+ char c; \
+ \
+ switch (nargs) { \
+ case 0: \
+ asm volatile (op " " arg0 ";" \
+ "set" cc " %1" \
+ : "+m" (var), "=qm" (c) \
+ : : "memory"); \
+ break; \
+ \
+ case 1: \
+ asm volatile (op " %2, " arg0 ";" \
+ "set" cc " %1" \
+ : "+m" (var), "=qm" (c) \
+ : "er" (val) : "memory"); \
+ break; \
+ \
+ default: __bad_rmwcc_nargs(); \
+ } \
+ \
+ return c != 0; \
+} while (0)
+
+#endif /* CC_HAVE_ASM_GOTO */
+
+#endif /* _ASM_X86_RMWcc */
--- a/arch/x86/include/asm/atomic.h
+++ b/arch/x86/include/asm/atomic.h
@@ -6,6 +6,7 @@
#include <asm/processor.h>
#include <asm/alternative.h>
#include <asm/cmpxchg.h>
+#include <asm/rmwcc.h>

/*
* Atomic operations that C can't guarantee us. Useful for
@@ -76,12 +77,7 @@ static inline void atomic_sub(int i, ato
*/
static inline int atomic_sub_and_test(int i, atomic_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "subl %2,%0; sete %1"
- : "+m" (v->counter), "=qm" (c)
- : "ir" (i) : "memory");
- return c;
+ GENERATE_RMWcc(v->counter, i, LOCK_PREFIX "subl", 1, "%0", "e");
}

/**
@@ -118,12 +114,7 @@ static inline void atomic_dec(atomic_t *
*/
static inline int atomic_dec_and_test(atomic_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "decl %0; sete %1"
- : "+m" (v->counter), "=qm" (c)
- : : "memory");
- return c != 0;
+ GENERATE_RMWcc(v->counter, 0, LOCK_PREFIX "decl", 0, "%0", "e");
}

/**
@@ -136,12 +127,7 @@ static inline int atomic_dec_and_test(at
*/
static inline int atomic_inc_and_test(atomic_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "incl %0; sete %1"
- : "+m" (v->counter), "=qm" (c)
- : : "memory");
- return c != 0;
+ GENERATE_RMWcc(v->counter, 0, LOCK_PREFIX "incl", 0, "%0", "e");
}

/**
@@ -155,12 +141,7 @@ static inline int atomic_inc_and_test(at
*/
static inline int atomic_add_negative(int i, atomic_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "addl %2,%0; sets %1"
- : "+m" (v->counter), "=qm" (c)
- : "ir" (i) : "memory");
- return c;
+ GENERATE_RMWcc(v->counter, i, LOCK_PREFIX "addl", 1, "%0", "s");
}

/**
--- a/arch/x86/include/asm/atomic64_64.h
+++ b/arch/x86/include/asm/atomic64_64.h
@@ -72,12 +72,7 @@ static inline void atomic64_sub(long i,
*/
static inline int atomic64_sub_and_test(long i, atomic64_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "subq %2,%0; sete %1"
- : "=m" (v->counter), "=qm" (c)
- : "er" (i), "m" (v->counter) : "memory");
- return c;
+ GENERATE_RMWcc(v->counter, i, LOCK_PREFIX "subq", 1, "%0", "e");
}

/**
@@ -116,12 +111,7 @@ static inline void atomic64_dec(atomic64
*/
static inline int atomic64_dec_and_test(atomic64_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "decq %0; sete %1"
- : "=m" (v->counter), "=qm" (c)
- : "m" (v->counter) : "memory");
- return c != 0;
+ GENERATE_RMWcc(v->counter, 0, LOCK_PREFIX "decq", 0, "%0", "e");
}

/**
@@ -134,12 +124,7 @@ static inline int atomic64_dec_and_test(
*/
static inline int atomic64_inc_and_test(atomic64_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "incq %0; sete %1"
- : "=m" (v->counter), "=qm" (c)
- : "m" (v->counter) : "memory");
- return c != 0;
+ GENERATE_RMWcc(v->counter, 0, LOCK_PREFIX "incq", 0, "%0", "e");
}

/**
@@ -153,12 +138,7 @@ static inline int atomic64_inc_and_test(
*/
static inline int atomic64_add_negative(long i, atomic64_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "addq %2,%0; sets %1"
- : "=m" (v->counter), "=qm" (c)
- : "er" (i), "m" (v->counter) : "memory");
- return c;
+ GENERATE_RMWcc(v->counter, i, LOCK_PREFIX "addq", 1, "%0", "s");
}

/**
--- a/arch/x86/include/asm/local.h
+++ b/arch/x86/include/asm/local.h
@@ -52,12 +52,7 @@ static inline void local_sub(long i, loc
*/
static inline int local_sub_and_test(long i, local_t *l)
{
- unsigned char c;
-
- asm volatile(_ASM_SUB "%2,%0; sete %1"
- : "+m" (l->a.counter), "=qm" (c)
- : "ir" (i) : "memory");
- return c;
+ GENERATE_RMWcc(l->a.counter, i, _ASM_SUB, 1, "%0", "e");
}

/**
@@ -70,12 +65,7 @@ static inline int local_sub_and_test(lon
*/
static inline int local_dec_and_test(local_t *l)
{
- unsigned char c;
-
- asm volatile(_ASM_DEC "%0; sete %1"
- : "+m" (l->a.counter), "=qm" (c)
- : : "memory");
- return c != 0;
+ GENERATE_RMWcc(l->a.counter, 0, _ASM_DEC, 0, "%0", "e");
}

/**
@@ -88,12 +78,7 @@ static inline int local_dec_and_test(loc
*/
static inline int local_inc_and_test(local_t *l)
{
- unsigned char c;
-
- asm volatile(_ASM_INC "%0; sete %1"
- : "+m" (l->a.counter), "=qm" (c)
- : : "memory");
- return c != 0;
+ GENERATE_RMWcc(l->a.counter, 0, _ASM_INC, 0, "%0", "e");
}

/**
@@ -107,12 +92,7 @@ static inline int local_inc_and_test(loc
*/
static inline int local_add_negative(long i, local_t *l)
{
- unsigned char c;
-
- asm volatile(_ASM_ADD "%2,%0; sets %1"
- : "+m" (l->a.counter), "=qm" (c)
- : "ir" (i) : "memory");
- return c;
+ GENERATE_RMWcc(l->a.counter, i, _ASM_ADD, 1, "%0", "s");
}

/**

2013-09-11 23:02:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On Wed, Sep 11, 2013 at 11:59 AM, Peter Zijlstra <[email protected]> wrote:
>
> OK, stripped it down further, I couldn't quite see how to collapse the
> unary and binary operator variants though :/

Ok, this looks pretty good. I assume it works too? ;)

Linus

2013-09-12 02:20:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On Wed, Sep 11, 2013 at 04:02:14PM -0700, Linus Torvalds wrote:
> On Wed, Sep 11, 2013 at 11:59 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > OK, stripped it down further, I couldn't quite see how to collapse the
> > unary and binary operator variants though :/
>
> Ok, this looks pretty good. I assume it works too? ;)

Only compile tested that one.. the below is kvm boot tested until root
mount -- I'll try on actual hardware when I've gotten some sleep.

I split the thing up into two macros GEN_UNARY_RMWcc and
GEN_BINARY_RMWcc which ends up being more readable as well as smaller
code overall.

I also attempted to convert asm/bitops.h, although I'm not sure it'll
compile right with older GCCs due to the comment near BITOP_ADDR()

It also changes the fallback from sbb %0,%0 to setc %0, which afaict
should be similar but I'm not too familiar with all that.

I might have to add the clobber to the macro arguments so we can do
version without "memory" clobber, although bitops is inconsistent with
that as well, __test_and_clear_bit() doesn't have a memory clobber but
__test_and_change_bit() does.

---
arch/x86/include/asm/atomic.h | 29 +++---------------
arch/x86/include/asm/atomic64_64.h | 28 ++---------------
arch/x86/include/asm/bitops.h | 58 +++++--------------------------------
arch/x86/include/asm/local.h | 28 ++---------------
arch/x86/include/asm/rmwcc.h | 52 +++++++++++++++++++++++++++++++++
5 files changed, 73 insertions(+), 122 deletions(-)

--- a/arch/x86/include/asm/atomic.h
+++ b/arch/x86/include/asm/atomic.h
@@ -6,6 +6,7 @@
#include <asm/processor.h>
#include <asm/alternative.h>
#include <asm/cmpxchg.h>
+#include <asm/rmwcc.h>

/*
* Atomic operations that C can't guarantee us. Useful for
@@ -76,12 +77,7 @@ static inline void atomic_sub(int i, ato
*/
static inline int atomic_sub_and_test(int i, atomic_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "subl %2,%0; sete %1"
- : "+m" (v->counter), "=qm" (c)
- : "ir" (i) : "memory");
- return c;
+ GEN_BINARY_RMWcc(LOCK_PREFIX "subl", v->counter, i, "%0", "e");
}

/**
@@ -118,12 +114,7 @@ static inline void atomic_dec(atomic_t *
*/
static inline int atomic_dec_and_test(atomic_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "decl %0; sete %1"
- : "+m" (v->counter), "=qm" (c)
- : : "memory");
- return c != 0;
+ GEN_UNARY_RMWcc(LOCK_PREFIX "decl", v->counter, "%0", "e");
}

/**
@@ -136,12 +127,7 @@ static inline int atomic_dec_and_test(at
*/
static inline int atomic_inc_and_test(atomic_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "incl %0; sete %1"
- : "+m" (v->counter), "=qm" (c)
- : : "memory");
- return c != 0;
+ GEN_UNARY_RMWcc(LOCK_PREFIX "incl", v->counter, "%0", "e");
}

/**
@@ -155,12 +141,7 @@ static inline int atomic_inc_and_test(at
*/
static inline int atomic_add_negative(int i, atomic_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "addl %2,%0; sets %1"
- : "+m" (v->counter), "=qm" (c)
- : "ir" (i) : "memory");
- return c;
+ GEN_BINARY_RMWcc(LOCK_PREFIX "addl", v->counter, i, "%0", "s");
}

/**
--- a/arch/x86/include/asm/atomic64_64.h
+++ b/arch/x86/include/asm/atomic64_64.h
@@ -72,12 +72,7 @@ static inline void atomic64_sub(long i,
*/
static inline int atomic64_sub_and_test(long i, atomic64_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "subq %2,%0; sete %1"
- : "=m" (v->counter), "=qm" (c)
- : "er" (i), "m" (v->counter) : "memory");
- return c;
+ GEN_BINARY_RMWcc(LOCK_PREFIX "subq", v->counter, i, "%0", "e");
}

/**
@@ -116,12 +111,7 @@ static inline void atomic64_dec(atomic64
*/
static inline int atomic64_dec_and_test(atomic64_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "decq %0; sete %1"
- : "=m" (v->counter), "=qm" (c)
- : "m" (v->counter) : "memory");
- return c != 0;
+ GEN_UNARY_RMWcc(LOCK_PREFIX "decq", v->counter, "%0", "e");
}

/**
@@ -134,12 +124,7 @@ static inline int atomic64_dec_and_test(
*/
static inline int atomic64_inc_and_test(atomic64_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "incq %0; sete %1"
- : "=m" (v->counter), "=qm" (c)
- : "m" (v->counter) : "memory");
- return c != 0;
+ GEN_UNARY_RMWcc(LOCK_PREFIX "incq", v->counter, "%0", "e");
}

/**
@@ -153,12 +138,7 @@ static inline int atomic64_inc_and_test(
*/
static inline int atomic64_add_negative(long i, atomic64_t *v)
{
- unsigned char c;
-
- asm volatile(LOCK_PREFIX "addq %2,%0; sets %1"
- : "=m" (v->counter), "=qm" (c)
- : "er" (i), "m" (v->counter) : "memory");
- return c;
+ GEN_BINARY_RMWcc(LOCK_PREFIX "addq", v->counter, i, "%0", "s");
}

/**
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -14,6 +14,7 @@

#include <linux/compiler.h>
#include <asm/alternative.h>
+#include <asm/rmwcc.h>

#if BITS_PER_LONG == 32
# define _BITOPS_LONG_SHIFT 5
@@ -204,12 +205,7 @@ static inline void change_bit(long nr, v
*/
static inline int test_and_set_bit(long nr, volatile unsigned long *addr)
{
- int oldbit;
-
- asm volatile(LOCK_PREFIX "bts %2,%1\n\t"
- "sbb %0,%0" : "=r" (oldbit), ADDR : "Ir" (nr) : "memory");
-
- return oldbit;
+ GEN_BINARY_RMWcc(LOCK_PREFIX "bts", *addr, nr, "%0", "c");
}

/**
@@ -236,13 +232,7 @@ test_and_set_bit_lock(long nr, volatile
*/
static inline int __test_and_set_bit(long nr, volatile unsigned long *addr)
{
- int oldbit;
-
- asm("bts %2,%1\n\t"
- "sbb %0,%0"
- : "=r" (oldbit), ADDR
- : "Ir" (nr));
- return oldbit;
+ GEN_BINARY_RMWcc("bts", *addr, nr, "%0", "c");
}

/**
@@ -255,13 +245,7 @@ static inline int __test_and_set_bit(lon
*/
static inline int test_and_clear_bit(long nr, volatile unsigned long *addr)
{
- int oldbit;
-
- asm volatile(LOCK_PREFIX "btr %2,%1\n\t"
- "sbb %0,%0"
- : "=r" (oldbit), ADDR : "Ir" (nr) : "memory");
-
- return oldbit;
+ GEN_BINARY_RMWcc(LOCK_PREFIX "btr", *addr, nr, "%0", "c");
}

/**
@@ -282,26 +266,13 @@ static inline int test_and_clear_bit(lon
*/
static inline int __test_and_clear_bit(long nr, volatile unsigned long *addr)
{
- int oldbit;
-
- asm volatile("btr %2,%1\n\t"
- "sbb %0,%0"
- : "=r" (oldbit), ADDR
- : "Ir" (nr));
- return oldbit;
+ GEN_BINARY_RMWcc("btr", *addr, nr, "%0", "c");
}

/* WARNING: non atomic and it can be reordered! */
static inline int __test_and_change_bit(long nr, volatile unsigned long *addr)
{
- int oldbit;
-
- asm volatile("btc %2,%1\n\t"
- "sbb %0,%0"
- : "=r" (oldbit), ADDR
- : "Ir" (nr) : "memory");
-
- return oldbit;
+ GEN_BINARY_RMWcc("btc", *addr, nr, "%0", "c");
}

/**
@@ -314,13 +285,7 @@ static inline int __test_and_change_bit(
*/
static inline int test_and_change_bit(long nr, volatile unsigned long *addr)
{
- int oldbit;
-
- asm volatile(LOCK_PREFIX "btc %2,%1\n\t"
- "sbb %0,%0"
- : "=r" (oldbit), ADDR : "Ir" (nr) : "memory");
-
- return oldbit;
+ GEN_BINARY_RMWcc(LOCK_PREFIX "btc", *addr, nr, "%0", "c");
}

static __always_inline int constant_test_bit(long nr, const volatile unsigned long *addr)
@@ -331,14 +296,7 @@ static __always_inline int constant_test

static inline int variable_test_bit(long nr, volatile const unsigned long *addr)
{
- int oldbit;
-
- asm volatile("bt %2,%1\n\t"
- "sbb %0,%0"
- : "=r" (oldbit)
- : "m" (*(unsigned long *)addr), "Ir" (nr));
-
- return oldbit;
+ GEN_BINARY_RMWcc("bt", *(volatile unsigned long *)addr, nr, "%0", "c");
}

#if 0 /* Fool kernel-doc since it doesn't do macros yet */
--- a/arch/x86/include/asm/local.h
+++ b/arch/x86/include/asm/local.h
@@ -52,12 +52,7 @@ static inline void local_sub(long i, loc
*/
static inline int local_sub_and_test(long i, local_t *l)
{
- unsigned char c;
-
- asm volatile(_ASM_SUB "%2,%0; sete %1"
- : "+m" (l->a.counter), "=qm" (c)
- : "ir" (i) : "memory");
- return c;
+ GEN_BINARY_RMWcc(_ASM_SUB, l->a.counter, i, "%0", "e");
}

/**
@@ -70,12 +65,7 @@ static inline int local_sub_and_test(lon
*/
static inline int local_dec_and_test(local_t *l)
{
- unsigned char c;
-
- asm volatile(_ASM_DEC "%0; sete %1"
- : "+m" (l->a.counter), "=qm" (c)
- : : "memory");
- return c != 0;
+ GEN_UNARY_RMWcc(_ASM_DEC, l->a.counter, "%0", "e");
}

/**
@@ -88,12 +78,7 @@ static inline int local_dec_and_test(loc
*/
static inline int local_inc_and_test(local_t *l)
{
- unsigned char c;
-
- asm volatile(_ASM_INC "%0; sete %1"
- : "+m" (l->a.counter), "=qm" (c)
- : : "memory");
- return c != 0;
+ GEN_UNARY_RMWcc(_ASM_INC, l->a.counter, "%0", "e");
}

/**
@@ -107,12 +92,7 @@ static inline int local_inc_and_test(loc
*/
static inline int local_add_negative(long i, local_t *l)
{
- unsigned char c;
-
- asm volatile(_ASM_ADD "%2,%0; sets %1"
- : "+m" (l->a.counter), "=qm" (c)
- : "ir" (i) : "memory");
- return c;
+ GEN_BINARY_RMWcc(_ASM_ADD, l->a.counter, i, "%0", "s");
}

/**
--- /dev/null
+++ b/arch/x86/include/asm/rmwcc.h
@@ -0,0 +1,52 @@
+#ifndef _ASM_X86_RMWcc
+#define _ASM_X86_RMWcc
+
+#ifdef CC_HAVE_ASM_GOTO
+
+#define GEN_UNARY_RMWcc(op, var, arg0, cc) \
+do { \
+ asm volatile goto (op " " arg0 ";" \
+ "j" cc " %l[cc_label]" \
+ : : "m" (var) \
+ : "memory" : cc_label); \
+ return 0; \
+cc_label: \
+ return 1; \
+} while (0)
+
+#define GEN_BINARY_RMWcc(op, var, val, arg0, cc) \
+do { \
+ asm volatile goto (op " %1, " arg0 ";" \
+ "j" cc " %l[cc_label]" \
+ : : "m" (var), "er" (val) \
+ : "memory" : cc_label); \
+ return 0; \
+cc_label: \
+ return 1; \
+} while (0)
+
+#else /* !CC_HAVE_ASM_GOTO */
+
+#define GEN_UNARY_RMWcc(op, var, arg0, cc) \
+do { \
+ char c; \
+ asm volatile (op " " arg0 ";" \
+ "set" cc " %1" \
+ : "+m" (var), "=qm" (c) \
+ : : "memory"); \
+ return c != 0; \
+} while (0)
+
+#define GEN_BINARY_RMWcc(op, var, val, arg0, cc) \
+do { \
+ char c; \
+ asm volatile (op " %2, " arg0 ";" \
+ "set" cc " %1" \
+ : "+m" (var), "=qm" (c) \
+ : "er" (val) : "memory"); \
+ return c != 0; \
+} while (0)
+
+#endif /* CC_HAVE_ASM_GOTO */
+
+#endif /* _ASM_X86_RMWcc */

2013-09-12 02:43:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On Wed, Sep 11, 2013 at 7:20 PM, Peter Zijlstra <[email protected]> wrote:
>
> I split the thing up into two macros GEN_UNARY_RMWcc and
> GEN_BINARY_RMWcc which ends up being more readable as well as smaller
> code overall.

Yes, that looks like the right abstraction level. Powerful without
being complicated.

> I also attempted to convert asm/bitops.h, although I'm not sure it'll
> compile right with older GCCs due to the comment near BITOP_ADDR()

Actually, since you now have that memory clobber, it no longer
matters, and "m" is the right thing.

That said, the very same memory clobber may be what makes this whole
approach a loss, if it causes gcc to do tons of reloads or other
random things.

That memory clobber wasn't an issue for the whole
__preempt_count_dec_and_test() thing, because there we needed to keep
anything before the decrement contained anyway.

For other cases? Who knows.. A lot of the "change and test atomically"
things have the same containment need, so it might not be a problem.

> I might have to add the clobber to the macro arguments so we can do
> version without "memory" clobber, although bitops is inconsistent with
> that as well, __test_and_clear_bit() doesn't have a memory clobber but
> __test_and_change_bit() does.

You do need the memory clobber. The "asm goto" version can not have
outputs, so "=m" and "+m" are both no-go, and so the only thing left
is that memory clobber, as far as I can see..

The bitops didn't use to need it, because they had that "+/=m" that
already tells gcc that the target is changed.

Linus

2013-09-12 11:52:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On Wed, Sep 11, 2013 at 07:43:42PM -0700, Linus Torvalds wrote:
> On Wed, Sep 11, 2013 at 7:20 PM, Peter Zijlstra <[email protected]> wrote:
> >
> > I split the thing up into two macros GEN_UNARY_RMWcc and
> > GEN_BINARY_RMWcc which ends up being more readable as well as smaller
> > code overall.
>
> Yes, that looks like the right abstraction level. Powerful without
> being complicated.

> That said, the very same memory clobber may be what makes this whole
> approach a loss, if it causes gcc to do tons of reloads or other
> random things.

Random things below, not sure why

> For other cases? Who knows.. A lot of the "change and test atomically"
> things have the same containment need, so it might not be a problem.

Right, all atomic ops already contain a memory clobber to go along with
their memory barrier semantics.

So I did a defconfig build without the patch and with the patch and got
a significant size increase:

text data bss filename
11173443 1423736 1183744 defconfig-build/vmlinux.before
11174516 1423736 1183744 defconfig-build/vmlinux.after

I then undid all the bitop conversions that added the extra memory
clobber and got:

11174204 1423736 1183744 defconfig-build/vmlinux.after

Which is still a significant increase, so I had a look at what GCC
generates, for mm/rmap.c, which uses quite a few atomic_*_and_test(),
functions I got:

text data bss dec hex filename
8675 1 16 8692 21f4 defconfig-build/mm/rmap.o
8739 1 16 8756 2234 defconfig-build/mm/rmap.o

So the increase is there too, doing a objdump -D on them the first
difference is:

0000000000000660 <do_page_add_anon_rmap>:
660: 55 push %rbp
661: 48 89 e5 mov %rsp,%rbp
664: 48 83 ec 20 sub $0x20,%rsp
668: 48 89 5d f0 mov %rbx,-0x10(%rbp)
66c: 4c 89 65 f8 mov %r12,-0x8(%rbp)
670: 48 89 fb mov %rdi,%rbx
673: f0 ff 47 18 lock incl 0x18(%rdi)
677: 0f 94 c0 sete %al
67a: 84 c0 test %al,%al
67c: 75 12 jne 690 <do_page_add_anon_rmap+0x30>
67e: 48 8b 5d f0 mov -0x10(%rbp),%rbx
682: 4c 8b 65 f8 mov -0x8(%rbp),%r12
686: c9 leaveq

vs.:

0000000000000660 <do_page_add_anon_rmap>:
660: 55 push %rbp
661: 48 89 e5 mov %rsp,%rbp
664: 48 83 ec 20 sub $0x20,%rsp
668: 48 89 5d e0 mov %rbx,-0x20(%rbp)
66c: 4c 89 65 e8 mov %r12,-0x18(%rbp)
670: 48 89 fb mov %rdi,%rbx
673: 4c 89 6d f0 mov %r13,-0x10(%rbp)
677: 4c 89 75 f8 mov %r14,-0x8(%rbp)
67b: f0 ff 47 18 lock incl 0x18(%rdi)
67f: 74 17 je 698 <do_page_add_anon_rmap+0x38>
681: 48 8b 5d e0 mov -0x20(%rbp),%rbx
685: 4c 8b 65 e8 mov -0x18(%rbp),%r12
689: 4c 8b 6d f0 mov -0x10(%rbp),%r13
68d: 4c 8b 75 f8 mov -0x8(%rbp),%r14
691: c9 leaveq

For some obscure (to me) reason the new fangled asm goto construct
generates a bunch of extra MOVs.

OTOH the good news is that a kernel with that patch applied does indeed
boot properly on real hardware.

2013-09-12 12:25:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2


* Peter Zijlstra <[email protected]> wrote:

> So the increase is there too, doing a objdump -D on them the first
> difference is:
>
> 0000000000000660 <do_page_add_anon_rmap>:
> 660: 55 push %rbp
> 661: 48 89 e5 mov %rsp,%rbp
> 664: 48 83 ec 20 sub $0x20,%rsp
> 668: 48 89 5d f0 mov %rbx,-0x10(%rbp)
> 66c: 4c 89 65 f8 mov %r12,-0x8(%rbp)
> 670: 48 89 fb mov %rdi,%rbx
> 673: f0 ff 47 18 lock incl 0x18(%rdi)
> 677: 0f 94 c0 sete %al
> 67a: 84 c0 test %al,%al
> 67c: 75 12 jne 690 <do_page_add_anon_rmap+0x30>
> 67e: 48 8b 5d f0 mov -0x10(%rbp),%rbx
> 682: 4c 8b 65 f8 mov -0x8(%rbp),%r12
> 686: c9 leaveq
>
> vs.:
>
> 0000000000000660 <do_page_add_anon_rmap>:
> 660: 55 push %rbp
> 661: 48 89 e5 mov %rsp,%rbp
> 664: 48 83 ec 20 sub $0x20,%rsp
> 668: 48 89 5d e0 mov %rbx,-0x20(%rbp)
> 66c: 4c 89 65 e8 mov %r12,-0x18(%rbp)
> 670: 48 89 fb mov %rdi,%rbx
> 673: 4c 89 6d f0 mov %r13,-0x10(%rbp)
> 677: 4c 89 75 f8 mov %r14,-0x8(%rbp)
> 67b: f0 ff 47 18 lock incl 0x18(%rdi)
> 67f: 74 17 je 698 <do_page_add_anon_rmap+0x38>
> 681: 48 8b 5d e0 mov -0x20(%rbp),%rbx
> 685: 4c 8b 65 e8 mov -0x18(%rbp),%r12
> 689: 4c 8b 6d f0 mov -0x10(%rbp),%r13
> 68d: 4c 8b 75 f8 mov -0x8(%rbp),%r14
> 691: c9 leaveq
>
> For some obscure (to me) reason the new fangled asm goto construct
> generates a bunch of extra MOVs.

It adds two pairs of MOVs that shows that R13 and R14 got clobbered, but
the change also got rid of of a SETE and a TEST here:

> 673: f0 ff 47 18 lock incl 0x18(%rdi)
> 677: 0f 94 c0 sete %al
> 67a: 84 c0 test %al,%al
> 67c: 75 12 jne 690 <do_page_add_anon_rmap+0x30>

so there's a slight increase in size, but the extra instructions look
rather lightweight and it could all go away if asm goto is improved ...

It would all be very sweet if all those clobbers went away.

Thanks,

Ingo

2013-09-13 07:33:55

by Kevin Easton

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On Thu, Sep 12, 2013 at 04:20:40AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 11, 2013 at 04:02:14PM -0700, Linus Torvalds wrote:
> > On Wed, Sep 11, 2013 at 11:59 AM, Peter Zijlstra <[email protected]> wrote:
> > >
> > > OK, stripped it down further, I couldn't quite see how to collapse the
> > > unary and binary operator variants though :/
> >
> > Ok, this looks pretty good. I assume it works too? ;)
>
> Only compile tested that one.. the below is kvm boot tested until root
> mount -- I'll try on actual hardware when I've gotten some sleep.
>
> I split the thing up into two macros GEN_UNARY_RMWcc and
> GEN_BINARY_RMWcc which ends up being more readable as well as smaller
> code overall.

If you wanted to collapse the unary and binary variants as you mentioned
upthread, you could do something like (for the CC_HAVE_ASM_GOTO case):

#define GEN_RMWcc(fullop, cc, ...) \
do { \
asm volatile goto (fullop \
"j" cc " %l[cc_label]" \
: : __VA_ARGS__ \
: "memory" : cc_label); \
return 0; \
cc_label: \
return 1; \
} while (0)

#define GEN_UNARY_RMWcc(op, var, arg0, cc) GEN_RMWcc(op " " arg0 ";", cc, "m" (var))
#define GEN_BINARY_RMWcc(op, var, val, arg0, cc) GEN_RMWcc(op " %1, " arg0 ";", cc, "m" (var), "er" (val))

- Kevin

2013-09-13 08:06:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/7] preempt_count rework -v2

On Fri, Sep 13, 2013 at 05:25:00PM +1000, Kevin Easton wrote:
> If you wanted to collapse the unary and binary variants as you mentioned
> upthread, you could do something like (for the CC_HAVE_ASM_GOTO case):

Indeed, another 11 lines off, awesome!

--- /dev/null
+++ b/arch/x86/include/asm/rmwcc.h
@@ -0,0 +1,41 @@
+#ifndef _ASM_X86_RMWcc
+#define _ASM_X86_RMWcc
+
+#ifdef CC_HAVE_ASM_GOTO
+
+#define __GEN_RMWcc(fullop, var, cc, ...) \
+do { \
+ asm volatile goto (fullop "; j" cc " %l[cc_label]" \
+ : : "m" (var), ## __VA_ARGS__ \
+ : "memory" : cc_label); \
+ return 0; \
+cc_label: \
+ return 1; \
+} while (0)
+
+#define GEN_UNARY_RMWcc(op, var, arg0, cc) \
+ __GEN_RMWcc(op " " arg0, var, cc)
+
+#define GEN_BINARY_RMWcc(op, var, val, arg0, cc) \
+ __GEN_RMWcc(op " %1, " arg0, var, cc, "er" (val))
+
+#else /* !CC_HAVE_ASM_GOTO */
+
+#define __GEN_RMWcc(fullop, var, cc, ...) \
+do { \
+ char c; \
+ asm volatile (fullop "; set" cc " %1" \
+ : "+m" (var), "=qm" (c) \
+ : __VA_ARGS__ : "memory"); \
+ return c != 0; \
+} while (0)
+
+#define GEN_UNARY_RMWcc(op, var, arg0, cc) \
+ __GEN_RMWcc(op " " arg0, var, cc)
+
+#define GEN_BINARY_RMWcc(op, var, val, arg0, cc) \
+ __GEN_RMWcc(op " %2, " arg0, var, cc, "er" (val))
+
+#endif /* CC_HAVE_ASM_GOTO */
+
+#endif /* _ASM_X86_RMWcc */