LinuxLists.cc - [patch 4/7] Immediate Values

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

Mathieu Desnoyers wrote:

> +#define immediate_read(name) \
> + ({ \
> + __typeof__(name##__immediate) value; \
> + switch (sizeof(value)) { \
> + case 1: \
> + asm ( ".section __immediate, \"a\", @progbits;\n\t" \
> + ".long %1, (0f)+1, 1;\n\t" \
> + ".previous;\n\t" \
> + "0:\n\t" \
> + "mov %2,%0;\n\t" \

Given that you're relying on the exact instruction that this mov
generates, it might be better to explicitly put the opcodes in with
.byte. That way you're protected from the assembler deciding to
generate some other form of the instruction (for whatever reason). I
guess substituting in different registers would be a pain.

Aside from that, is there any reason not to just put $0 in there rather
than use %2?

> + ".long %1, (0f)+1, 4;\n\t" \
> + ".previous;\n\t" \
> + "1:\n\t" \
> + ".org (1b)+(3-((1b)%%4)), 0x90;\n\t" \
>
Seems a little complex, but I couldn't come up with anything much better:

.org . + 3 - (. & 3), 0x90

You can use . rather than needing to define 1:, it doesn't need quite so
many parens, and using &3 avoids the %% wart.

It's a pity that gas seems to generate plain 0x90 nops rather than
long-nop forms here. I thought it could do that.

J

2007-09-18 22:14:39

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

Jeremy Fitzhardinge <[email protected]> writes:
>
> It's a pity that gas seems to generate plain 0x90 nops rather than
> long-nop forms here. I thought it could do that.

.p2align does it.

-Andi

2007-09-18 22:17:28

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

Jeremy Fitzhardinge wrote:
> Mathieu Desnoyers wrote:
>
>> +#define immediate_read(name) \
>> + ({ \
>> + __typeof__(name##__immediate) value; \
>> + switch (sizeof(value)) { \
>> + case 1: \
>> + asm ( ".section __immediate, \"a\", @progbits;\n\t" \
>> + ".long %1, (0f)+1, 1;\n\t" \
>> + ".previous;\n\t" \
>> + "0:\n\t" \
>> + "mov %2,%0;\n\t" \
>
> Given that you're relying on the exact instruction that this mov
> generates, it might be better to explicitly put the opcodes in with
> .byte. That way you're protected from the assembler deciding to
> generate some other form of the instruction (for whatever reason). I
> guess substituting in different registers would be a pain.

Allowing different registers should be doable, but if so, one would have
to put 0: at the *end* of the instruction and use (0f)-4 instead, since
the non-%eax forms are one byte longer.

This also seems "safer", since an imm32 is always the last thing in the
instruction.

-hpa

2007-09-18 22:27:59

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

H. Peter Anvin wrote:
> Allowing different registers should be doable, but if so, one would have
> to put 0: at the *end* of the instruction and use (0f)-4 instead, since
> the non-%eax forms are one byte longer.
>

OK, that's already a problem since its using "=r" as the constraint.

> This also seems "safer", since an imm32 is always the last thing in the
> instruction.

Good idea. If gas/gcc generates entirely the wrong addressing mode,
then we've got bigger problems.

J

2007-09-18 22:29:57

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

Andi Kleen wrote:
> Jeremy Fitzhardinge <[email protected]> writes:
>
>> It's a pity that gas seems to generate plain 0x90 nops rather than
>> long-nop forms here. I thought it could do that.
>>
>
> .p2align does it

Just .p2align? Not align, balign, org or skip? Seems... strange.

J

2007-09-18 22:37:10

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

Jeremy Fitzhardinge wrote:
> Andi Kleen wrote:
>> Jeremy Fitzhardinge <[email protected]> writes:
>>
>>> It's a pity that gas seems to generate plain 0x90 nops rather than
>>> long-nop forms here. I thought it could do that.
>>>
>> .p2align does it
>
> Just .p2align? Not align, balign, org or skip? Seems... strange.
>

Probably it works for .align and .balign too, but not .org.

-hpa

2007-09-18 22:45:07

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

On Tue, Sep 18, 2007 at 03:29:50PM -0700, Jeremy Fitzhardinge wrote:
> Andi Kleen wrote:
> > Jeremy Fitzhardinge <[email protected]> writes:
> >
> >> It's a pity that gas seems to generate plain 0x90 nops rather than
> >> long-nop forms here. I thought it could do that.
> >>
> >
> > .p2align does it
>
> Just .p2align? Not align, balign, org or skip? Seems... strange.

The problem is that you cannot always safely jump into the middle of the
longer form nops. So I suppose they didn't risk breakage on older
code relying on this.

-Andi

2007-09-19 11:00:08

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

* Jeremy Fitzhardinge ([email protected]) wrote:
> Mathieu Desnoyers wrote:
>
> > +#define immediate_read(name) \
> > + ({ \
> > + __typeof__(name##__immediate) value; \
> > + switch (sizeof(value)) { \
> > + case 1: \
> > + asm ( ".section __immediate, \"a\", @progbits;\n\t" \
> > + ".long %1, (0f)+1, 1;\n\t" \
> > + ".previous;\n\t" \
> > + "0:\n\t" \
> > + "mov %2,%0;\n\t" \
>
> Given that you're relying on the exact instruction that this mov
> generates, it might be better to explicitly put the opcodes in with
> .byte. That way you're protected from the assembler deciding to
> generate some other form of the instruction (for whatever reason). I
> guess substituting in different registers would be a pain.
>

Good point. I thought it might come up, especially for 16 bits mov that
can be expressed under different forms, one of which has a prefix. I
would like to go for Peter's suggestion: putting the label _after_ the
instruction, since we know that we will be right after the immediate
value, but it has a drawback: we cannot insure correct alignment of the
immediate value in that case. But that would help not having to force
the register.

> Aside from that, is there any reason not to just put $0 in there rather
> than use %2?
>

Actually, no, since the initial value is written to the immediate value
references at early boot and at module load time. I originally thought
passing the referenced variable to it, but, as I recall, it brought
linker issues when the symbol was defined in another module. So yes,
just $0 is ok, I'll change that.

>
> > + ".long %1, (0f)+1, 4;\n\t" \
> > + ".previous;\n\t" \
> > + "1:\n\t" \
> > + ".org (1b)+(3-((1b)%%4)), 0x90;\n\t" \
> >
> Seems a little complex, but I couldn't come up with anything much better:
>
> .org . + 3 - (. & 3), 0x90
>
> You can use . rather than needing to define 1:, it doesn't need quite so
> many parens, and using &3 avoids the %% wart.
>

Yes, this one is tricky.. trying to align efficiently something on a 4
bytes address - 1 is not what gas is used to help doing.

> It's a pity that gas seems to generate plain 0x90 nops rather than
> long-nop forms here. I thought it could do that.
>

At least we will have at most 3 nops there.

Mathieu

> J

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-09-19 11:01:09

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

* H. Peter Anvin ([email protected]) wrote:
> Jeremy Fitzhardinge wrote:
> > Mathieu Desnoyers wrote:
> >
> >> +#define immediate_read(name) \
> >> + ({ \
> >> + __typeof__(name##__immediate) value; \
> >> + switch (sizeof(value)) { \
> >> + case 1: \
> >> + asm ( ".section __immediate, \"a\", @progbits;\n\t" \
> >> + ".long %1, (0f)+1, 1;\n\t" \
> >> + ".previous;\n\t" \
> >> + "0:\n\t" \
> >> + "mov %2,%0;\n\t" \
> >
> > Given that you're relying on the exact instruction that this mov
> > generates, it might be better to explicitly put the opcodes in with
> > .byte. That way you're protected from the assembler deciding to
> > generate some other form of the instruction (for whatever reason). I
> > guess substituting in different registers would be a pain.
>
> Allowing different registers should be doable, but if so, one would have
> to put 0: at the *end* of the instruction and use (0f)-4 instead, since
> the non-%eax forms are one byte longer.
>
> This also seems "safer", since an imm32 is always the last thing in the
> instruction.
>

The idea is very interesting, but then, if I can't be sure of the size
of my instruction, how can I align the immediate value properly ?

> -hpa
>

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-09-19 11:08:58

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

* Andi Kleen ([email protected]) wrote:
> Jeremy Fitzhardinge <[email protected]> writes:
> >
> > It's a pity that gas seems to generate plain 0x90 nops rather than
> > long-nop forms here. I thought it could do that.
>
> .p2align does it.
>

Sadly, p2align does not apply well to my context. I have to align on 4
bytes boundaries - 1 for the 4 bytes mov, so, if I would use p2align, I
would end up aligning on 4 bytes with p2align and then add 3 bytes
(worse case: adding 3 + 3 = 6 bytes of nops).

However, with the .org arithmetic, I can simply add then quantity of
nops needed to make my alignment on 4 bytes - 1, so the worse case
becomes adding 3 bytes.

The example is:

originally: address & 3 = 1
* p2align
p2align adds 3 bytes to align on 4 bytes boundaries
we add 3 bytes to align on the next 4 bytes - 1, so the immediate value
within the instruction is aligned on 4 bytes boundaries

* org
we add 2 bytes to be aligned on the next 4 bytes - 1.

And yes, it's a pity there is no way to produce the long-nops there. :(

Mathieu

> -Andi

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-09-19 11:14:22

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

> And yes, it's a pity there is no way to produce the long-nops there. :(

To be honest I consider it extremly unlikely you would be able
to measure a difference.

-Andi

2007-09-19 13:01:34

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

* Jeremy Fitzhardinge ([email protected]) wrote:
> H. Peter Anvin wrote:
> > Allowing different registers should be doable, but if so, one would have
> > to put 0: at the *end* of the instruction and use (0f)-4 instead, since
> > the non-%eax forms are one byte longer.
> >
>
> OK, that's already a problem since its using "=r" as the constraint.
>
> > This also seems "safer", since an imm32 is always the last thing in the
> > instruction.
>
> Good idea. If gas/gcc generates entirely the wrong addressing mode,
> then we've got bigger problems.
>

Ok, let's have a good look at what we want:

1 - get a pointer to the beginning of the immediate value within the
instruction.
2 - make sure that the immediate value, within the instruction, is
written to atomically wrt all CPUs, even on older architectures
where non aligned writes are not atomic.

Effectively, placing a label at the end of the instruction, and then
offsetting backward from there, will give us (1).

Then, for the alignment, we have to give a good look at the instruction
set reference, mov instruction, to see what the variants of mov
immediate value to register are on i386.

First, let's look at the possible prefixes:
- Lock and repeat prefixes : no
- Segment override prefixes: no memory reference there, so doesn't
apply.
- Branch hints : no, it's a mov instruction
- Operand-size override prefix: *yes*, can be used for 2 bytes mov
- Address-size override prefix: no address in there, only immediate
value and register.

(looking at the Compat/Leg Mode)

3 cases:

* 1 byte
B0 + rb MOV r8, imm8 (1 byte opcode)
REX + B0 + rb MOV r8, imm8 (only on 64 bits archs, never generated)
C6 /0 MOV r/m8, imm8 (2 bytes opcode)
(this one doesn't require alignment at all, since we do a 1 byte write)

* 2 bytes
B8 + rw MOV r16, imm16 (1 byte opcode)
66 B8 + rd MOV r16, imm16 (2 bytes opcode) (with 66H prefix)
C7 /0 MOV r/m16, imm16 (2 bytes opcode)
(Alignment on 4 bytes boundaries would be required because of the
possible 1 byte opcode ? Or is the 66H prefix mandatory there ? If it
is, then we can safely align on 2 bytes boundaries.)

* 4 bytes
B8 + rd MOV r32, imm32 (1 byte opcode)
C7 /0 MOV r/m32, imm32 (2 bytes opcode)
(the 2 bytes opcode can be a problem)

I have missed anything ?

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-09-19 16:03:59

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

* Mathieu Desnoyers ([email protected]) wrote:
> * Jeremy Fitzhardinge ([email protected]) wrote:
> > H. Peter Anvin wrote:
> > > Allowing different registers should be doable, but if so, one would have
> > > to put 0: at the *end* of the instruction and use (0f)-4 instead, since
> > > the non-%eax forms are one byte longer.
> > >
> >
> > OK, that's already a problem since its using "=r" as the constraint.
> >
> > > This also seems "safer", since an imm32 is always the last thing in the
> > > instruction.
> >
> > Good idea. If gas/gcc generates entirely the wrong addressing mode,
> > then we've got bigger problems.
> >
>
> Ok, let's have a good look at what we want:
>
> 1 - get a pointer to the beginning of the immediate value within the
> instruction.
> 2 - make sure that the immediate value, within the instruction, is
> written to atomically wrt all CPUs, even on older architectures
> where non aligned writes are not atomic.
>
> Effectively, placing a label at the end of the instruction, and then
> offsetting backward from there, will give us (1).
>
> Then, for the alignment, we have to give a good look at the instruction
> set reference, mov instruction, to see what the variants of mov
> immediate value to register are on i386.
>
> First, let's look at the possible prefixes:
> - Lock and repeat prefixes : no
> - Segment override prefixes: no memory reference there, so doesn't
> apply.
> - Branch hints : no, it's a mov instruction
> - Operand-size override prefix: *yes*, can be used for 2 bytes mov
> - Address-size override prefix: no address in there, only immediate
> value and register.
>
> (looking at the Compat/Leg Mode)
>
> 3 cases:
>
> * 1 byte
> B0 + rb MOV r8, imm8 (1 byte opcode)
> REX + B0 + rb MOV r8, imm8 (only on 64 bits archs, never generated)
> C6 /0 MOV r/m8, imm8 (2 bytes opcode)
> (this one doesn't require alignment at all, since we do a 1 byte write)
>
> * 2 bytes
> B8 + rw MOV r16, imm16 (1 byte opcode)
> 66 B8 + rd MOV r16, imm16 (2 bytes opcode) (with 66H prefix)
> C7 /0 MOV r/m16, imm16 (2 bytes opcode)
> (Alignment on 4 bytes boundaries would be required because of the
> possible 1 byte opcode ? Or is the 66H prefix mandatory there ? If it
> is, then we can safely align on 2 bytes boundaries.)
>
> * 4 bytes
> B8 + rd MOV r32, imm32 (1 byte opcode)
> C7 /0 MOV r/m32, imm32 (2 bytes opcode)
> (the 2 bytes opcode can be a problem)
>
> I have missed anything ?
>

This is not a problem in practice because we place a breakpoint and use
a bypass when we are updating immediate values that can be used
concurrently. So let's not bother about that too much. It just means
that the breakpoint approach might have to be used for more
architectures in the future to address that kind of issue.

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-09-19 16:16:35

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

Mathieu Desnoyers wrote:
>
> Ok, let's have a good look at what we want:
>
> 1 - get a pointer to the beginning of the immediate value within the
> instruction.
> 2 - make sure that the immediate value, within the instruction, is
> written to atomically wrt all CPUs, even on older architectures
> where non aligned writes are not atomic.
>

I think you'll find that even on modern architectures cross-cacheline
writes aren't atomic.

Anyway, disregard the bit I mentioned about one byte longer; I've been
staring myself too blind at instruction variants lately (lots of
arguments about how to generate various instructions using NASM syntax).

>
> * 4 bytes
> B8 + rd MOV r32, imm32 (1 byte opcode)
> C7 /0 MOV r/m32, imm32 (2 bytes opcode)
> (the 2 bytes opcode can be a problem)
>

If gas generates the C7 opcodes by default, then that's a bug, nothing less.

-hpa

2007-09-19 17:30:54

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

H. Peter Anvin wrote:
> Mathieu Desnoyers wrote:
>
>> Ok, let's have a good look at what we want:
>>
>> 1 - get a pointer to the beginning of the immediate value within the
>> instruction.
>> 2 - make sure that the immediate value, within the instruction, is
>> written to atomically wrt all CPUs, even on older architectures
>> where non aligned writes are not atomic.
>>
>>
>
> I think you'll find that even on modern architectures cross-cacheline
> writes aren't atomic.
>

Cross-cache-line, sure. But what about just not sizeof aligned? If its
enough to avoid cross-cache-line, then that's simpler.

Which is something I was going to comment on: Mathieu, you try to align
the constant itself, but you don't prevent the instruction overall from
crossing a cache line. Given how delicate all this stuff is, it seems
like a good idea to do that.

>> * 4 bytes
>> B8 + rd MOV r32, imm32 (1 byte opcode)
>> C7 /0 MOV r/m32, imm32 (2 bytes opcode)
>> (the 2 bytes opcode can be a problem)
>>
>>
>
> If gas generates the C7 opcodes by default, then that's a bug, nothing less.
>

Well, in this case, it might be preferred if it brings the constant into
alignment without explicit padding :)

J

2007-09-19 17:44:20

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

Jeremy Fitzhardinge wrote:
>
> Cross-cache-line, sure. But what about just not sizeof aligned? If its
> enough to avoid cross-cache-line, then that's simpler.
>

Not really. It pretty much comes down to the same problem.

> Which is something I was going to comment on: Mathieu, you try to align
> the constant itself, but you don't prevent the instruction overall from
> crossing a cache line. Given how delicate all this stuff is, it seems
> like a good idea to do that.

Given how delicate all his stuff is, it would be interesting to know the
magnitude of the expected payoff, which I personally don't remember seeing.

-hpa

2007-09-19 18:22:27

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

* Jeremy Fitzhardinge ([email protected]) wrote:
> H. Peter Anvin wrote:
> > Mathieu Desnoyers wrote:
> >
> >> Ok, let's have a good look at what we want:
> >>
> >> 1 - get a pointer to the beginning of the immediate value within the
> >> instruction.
> >> 2 - make sure that the immediate value, within the instruction, is
> >> written to atomically wrt all CPUs, even on older architectures
> >> where non aligned writes are not atomic.
> >>
> >>
> >
> > I think you'll find that even on modern architectures cross-cacheline
> > writes aren't atomic.
> >
>
> Cross-cache-line, sure. But what about just not sizeof aligned? If its
> enough to avoid cross-cache-line, then that's simpler.
>

Being sizeof aligned on a cache-line (e.g. 32 bytes boundaries) is a
superset of being aligned on sizeof multiples (e.g. 4 bytes). Therefore,
if we declare data of a certain size not aligned on the sizeof
boundaries, we won't be aligned on cache-lines neither. (unless I am
utterly wrong..) :)

> Which is something I was going to comment on: Mathieu, you try to align
> the constant itself, but you don't prevent the instruction overall from
> crossing a cache line. Given how delicate all this stuff is, it seems
> like a good idea to do that.
>

We just can't, for movl is 5 bytes in total : 1 byte for opcode, 4
bytes for the immediate value. But since we do not modify the opcode at
all, CPUs will either see the old or new immediate value (each of those
will be coherent because of the atomic update) and, in every case, they
will use it with the same opcode that haven't been touched.

>
> >> * 4 bytes
> >> B8 + rd MOV r32, imm32 (1 byte opcode)
> >> C7 /0 MOV r/m32, imm32 (2 bytes opcode)
> >> (the 2 bytes opcode can be a problem)
> >>
> >>
> >
> > If gas generates the C7 opcodes by default, then that's a bug, nothing less.
> >
>
> Well, in this case, it might be preferred if it brings the constant into
> alignment without explicit padding :)
>

It will need explicit padding too. We would have to align the 4 bytes
immediate value on 4 bytes multiples. Therefore, this 2 bytes opcode
followed by 4 bytes immediate value would have to be aligned on
(4 bytes - 2) boundaries.

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-09-19 18:36:36

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

* H. Peter Anvin ([email protected]) wrote:
> Jeremy Fitzhardinge wrote:
> >
> > Cross-cache-line, sure. But what about just not sizeof aligned? If its
> > enough to avoid cross-cache-line, then that's simpler.
> >
>
> Not really. It pretty much comes down to the same problem.
>
> > Which is something I was going to comment on: Mathieu, you try to align
> > the constant itself, but you don't prevent the instruction overall from
> > crossing a cache line. Given how delicate all this stuff is, it seems
> > like a good idea to do that.
>
> Given how delicate all his stuff is, it would be interesting to know the
> magnitude of the expected payoff, which I personally don't remember seeing.
>

Please have a look at the patch header at:
http://lkml.org/lkml/2007/9/18/346

And at the documentation:
http://lkml.org/lkml/2007/9/18/375

Where I provide speedups experienced with this patch. Please note that
my primary goal is to provide non intrusive instrumentation with the
Linux Kernel Markers, and the immediate values are the underlying
mechanism that help enabling/disabling dynamically instrumentation sites
with the smallest effect possible (and still get the local variables
in the middle of a function compiled with -O2...).

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-09-20 10:24:46

by Denys Vlasenko

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

On Tuesday 18 September 2007 22:07, Mathieu Desnoyers wrote:
> i386 optimization of the immediate values which uses a movl with code patching
> to set/unset the value used to populate the register used as variable source.
>
> Changelog:
> - Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
> non atomic writes to a code region only touched by us (nobody can execute it
> since we are protected by the immediate_mutex).
> - Put immediate_set and _immediate_set in the architecture independent header.

> +struct __immediate {
> + long var; /* Pointer to the identifier variable of the
> + * immediate value
> + */
> + long immediate; /*
> + * Pointer to the memory location of the
> + * immediate value within the instruction.
> + */
> + long size; /* Type size. */
> +};

> + case 2: \
> + asm ( ".section __immediate, \"a\", @progbits;\n\t" \
> + ".long %1, (0f)+2, 2;\n\t" \
> + ".previous;\n\t" \
> + "1:\n\t" \
> + ".align 2;\n\t" \
> + "0:\n\t" \
> + "mov %2,%0;\n\t" \
> + : "=r" (value) \
> + : "m" (name##__immediate), \
> + "i" (0)); \

Instead of letting gcc use whatever instruction it sees fit best
for accessing the variable (like add/cmp/test...)
now we force it to use mov imm,reg first. Maybe with preceding nop
due to "align 2".

And then we use 12 more bytes in __immediate section
*for each* place where you read the variable.

Do you plan to use the same approach on x86_64?
I mean, longs there are twice as long.

Can this be made conditional, on CONFIG_CC_OPTIMIZE_FOR_SIZE perhaps?
--
vda

2007-09-21 13:31:43

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

* Denys Vlasenko ([email protected]) wrote:
> On Tuesday 18 September 2007 22:07, Mathieu Desnoyers wrote:
> > i386 optimization of the immediate values which uses a movl with code patching
> > to set/unset the value used to populate the register used as variable source.
> >
> > Changelog:
> > - Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
> > non atomic writes to a code region only touched by us (nobody can execute it
> > since we are protected by the immediate_mutex).
> > - Put immediate_set and _immediate_set in the architecture independent header.
>
> > +struct __immediate {
> > + long var; /* Pointer to the identifier variable of the
> > + * immediate value
> > + */
> > + long immediate; /*
> > + * Pointer to the memory location of the
> > + * immediate value within the instruction.
> > + */
> > + long size; /* Type size. */
> > +};
>
>
> > + case 2: \
> > + asm ( ".section __immediate, \"a\", @progbits;\n\t" \
> > + ".long %1, (0f)+2, 2;\n\t" \
> > + ".previous;\n\t" \
> > + "1:\n\t" \
> > + ".align 2;\n\t" \
> > + "0:\n\t" \
> > + "mov %2,%0;\n\t" \
> > + : "=r" (value) \
> > + : "m" (name##__immediate), \
> > + "i" (0)); \
>
> Instead of letting gcc use whatever instruction it sees fit best
> for accessing the variable (like add/cmp/test...)
> now we force it to use mov imm,reg first. Maybe with preceding nop
> due to "align 2".
>

Yes, this is true. So, the following branch:

char x;

void testb(void)
{
if (x > 5)
testa();
}

Would turn into:
56: b0 00 mov $0x0,%al
58: 3c 05 cmp $0x5,%al
5a: 7e 05 jle 61 <testb+0x11>

Rather than:

56: 80 3d 00 00 00 00 05 cmpb $0x5,0x0
5d: 7e 05 jle 64 <testb+0x14>

> And then we use 12 more bytes in __immediate section
> *for each* place where you read the variable.
>

Yes. You must consider the this section is only used when updating the
variable. It is never used by the read-side and therefore does not
consume data cache on hot paths.

> Do you plan to use the same approach on x86_64?
> I mean, longs there are twice as long.
>

Yup. It's a memory footprint vs active cacheline footprints tradeoff.
When GCC optimizes for size and we see kernel speedups, it is not so
because it "consumes" less memory, but rather that there is less junk
polluting the cachelines. So unless you worry about a few K of data and
are an embedded system developer, I really don't see why you worry about
this. Oh, and by the way, I provide the ability to disable immediate
values in the EMBEDDED menu.

> Can this be made conditional, on CONFIG_CC_OPTIMIZE_FOR_SIZE perhaps?

No. As I just stated, only embedded developers would have an interest in
disabling this features because they would have so few memory available
on their architecture. The memory consumed by the immediate values table
is out of the hot path cachelines and therefore does not impact overall
performance.

> --
> vda

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-10-20 16:47:32

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

* Jeremy Fitzhardinge ([email protected]) wrote:
> H. Peter Anvin wrote:
> > Allowing different registers should be doable, but if so, one would have
> > to put 0: at the *end* of the instruction and use (0f)-4 instead, since
> > the non-%eax forms are one byte longer.
> >
>
> OK, that's already a problem since its using "=r" as the constraint.
>

Hi Jeremy,

I have tried generating asm-to-"register" c variables for char, short
and int on i386 and I do not see this happening. The char opcode is
always 1 byte, short 2 bytes and int 1 byte. Result:

gcc version 4.1.3 20070812 (prerelease) (Debian 4.1.2-15)

8: b3 02 mov $0x2,%bl
a: b1 03 mov $0x3,%cl
c: b2 04 mov $0x4,%dl
e: b0 05 mov $0x5,%al

4f: 66 be 06 00 mov $0x6,%si
53: 66 bb 07 00 mov $0x7,%bx
57: 66 b9 08 00 mov $0x8,%cx
5b: 66 ba 09 00 mov $0x9,%dx
5f: 66 b8 0a 00 mov $0xa,%ax

9f: bb 0b 00 00 00 mov $0xb,%ebx
a4: be 0c 00 00 00 mov $0xc,%esi
a9: b9 0d 00 00 00 mov $0xd,%ecx
ae: ba 0e 00 00 00 mov $0xe,%edx
b3: b8 0f 00 00 00 mov $0xf,%eax

I notice that having a "=r" inline assembly that outputs to the first
"register char" variable seems to be problematic. It fails with the
following error:

/tmp/ccy35Hq1.s: Assembler messages:
/tmp/ccy35Hq1.s:15: Error: bad register name `%sil'

But it seems to be specific to "register" variables, which I do not use
in my immediate values.

> > This also seems "safer", since an imm32 is always the last thing in the
> > instruction.
>
> Good idea. If gas/gcc generates entirely the wrong addressing mode,
> then we've got bigger problems.
>

I am still trying to figure out if we must assume that gas will produce
different length opcodes for mov instructions. The choice is:

- Either I use a "r" constraint and let gcc produce the instructions,
that I need to assume to have correct size so I can align their
immediate values (therefore, taking the offset from the end of the
instruction will not help). Here, if gas changes its behavior
dramatically for a given immediate value size, it will break.

- Second choice is to stick to a particular register, choosing the one
with the less side-effect, and encoding the instruction ourselves. I
start to think that this second solution might be safer, even though
we wouldn't let the compiler select the register which has the less
impact by itself.

Any comments about this ?

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-10-20 18:36:57

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

Mathieu Desnoyers wrote:
>
> I have tried generating asm-to-"register" c variables for char, short
> and int on i386 and I do not see this happening. The char opcode is
> always 1 byte, short 2 bytes and int 1 byte. Result:
>

The comment was referring to x86-64, but I incorrectly remembered that
applying to "movq $imm,%reg" as opposed to loading from an absolute
address. gas actually has a special opcode (movabs) for the 64-bit
version of the latter variant, which is only available with %rax and its
subregisters.

Nevermind, in other words. It's still true, though, that the immediate
will always be the last thing in the instruction -- that's a fixture of
the instruction format.

> gcc version 4.1.3 20070812 (prerelease) (Debian 4.1.2-15)
>
> 8: b3 02 mov $0x2,%bl
> a: b1 03 mov $0x3,%cl
> c: b2 04 mov $0x4,%dl
> e: b0 05 mov $0x5,%al
>
> 4f: 66 be 06 00 mov $0x6,%si
> 53: 66 bb 07 00 mov $0x7,%bx
> 57: 66 b9 08 00 mov $0x8,%cx
> 5b: 66 ba 09 00 mov $0x9,%dx
> 5f: 66 b8 0a 00 mov $0xa,%ax
>
> 9f: bb 0b 00 00 00 mov $0xb,%ebx
> a4: be 0c 00 00 00 mov $0xc,%esi
> a9: b9 0d 00 00 00 mov $0xd,%ecx
> ae: ba 0e 00 00 00 mov $0xe,%edx
> b3: b8 0f 00 00 00 mov $0xf,%eax
>
>
> I notice that having a "=r" inline assembly that outputs to the first
> "register char" variable seems to be problematic. It fails with the
> following error:
>
> /tmp/ccy35Hq1.s: Assembler messages:
> /tmp/ccy35Hq1.s:15: Error: bad register name `%sil'

'r' is wrong for 8-bit variables on i386. It needs to be 'q'.

2007-10-22 09:59:56

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

> - Either I use a "r" constraint and let gcc produce the instructions,
> that I need to assume to have correct size so I can align their
> immediate values (therefore, taking the offset from the end of the
> instruction will not help). Here, if gas changes its behavior
> dramatically for a given immediate value size, it will break.

I wouldn't expect it to do that, but you could perhaps add a self
test somewhere to check for it.

>
> - Second choice is to stick to a particular register, choosing the one
> with the less side-effect, and encoding the instruction ourselves. I
> start to think that this second solution might be safer, even though
> we wouldn't let the compiler select the register which has the less
> impact by itself.

Such effects caused occassional bugs in the alternative() implementation
which requires a maximum size for the replacement.

But in this case it should be safe enough to trust gas stability.

-Andi

2007-10-22 15:45:31

[permalink] [raw]

Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

* H. Peter Anvin ([email protected]) wrote:
> Mathieu Desnoyers wrote:
> >
> >I have tried generating asm-to-"register" c variables for char, short
> >and int on i386 and I do not see this happening. The char opcode is
> >always 1 byte, short 2 bytes and int 1 byte. Result:
> >
>
> The comment was referring to x86-64, but I incorrectly remembered that
> applying to "movq $imm,%reg" as opposed to loading from an absolute
> address. gas actually has a special opcode (movabs) for the 64-bit
> version of the latter variant, which is only available with %rax and its
> subregisters.
>
> Nevermind, in other words. It's still true, though, that the immediate
> will always be the last thing in the instruction -- that's a fixture of
> the instruction format.
>

Hrm, some basic testing shows me that, on x86_64:

register long var, var2, var3, var4, var5;
asm ("movq $0xFFFFFFFF,%0;\n\t" : "=r" (var) :);
asm ("movq $0x100000000,%0;\n\t" : "=r" (var2) :);
asm ("movq $0,%0;\n\t" : "=r" (var3) :);
asm ("movq $0xFFFFFFFFFFFFFFFF,%0;\n\t" : "=r" (var4) :);
asm ("movq $0xFEFEFEFE01010101,%0;\n\t" : "=r" (var5) :);

generates:

ae: 48 be ff ff ff ff 00 mov $0xffffffff,%rsi
b5: 00 00 00
b8: 48 bf 00 00 00 00 01 mov $0x100000000,%rdi
bf: 00 00 00
c2: 48 c7 c1 00 00 00 00 mov $0x0,%rcx
c9: 48 c7 c2 ff ff ff ff mov $0xffffffffffffffff,%rdx
d0: 48 b8 01 01 01 01 fe mov $0xfefefefe01010101,%rax
d7: fe fe fe

So we have to deal with the fact that gas produces opcodes of
different size (2 and 3 bytes) and with different immediate value sizes
for 8 bytes load immediates, depending on the immediate value to load.
There even seems to be an implicit duplication of the value when the
lower and higher 32 bits of the immediate value are the same (the
0xFFFFFFFFFFFFFFFF case).

Therefore, I propose the following. We should use the

asm ("movq $0xFEFEFEFE01010101,%0;\n\t" : "=r" (var5) :);

to make sure we always deal with the same instruction and have a 8 bytes
immediate value. We don't really care about the static value given
because we always update it dynamically before the first time it is
executed.

> >gcc version 4.1.3 20070812 (prerelease) (Debian 4.1.2-15)
> >
> > 8: b3 02 mov $0x2,%bl
> > a: b1 03 mov $0x3,%cl
> > c: b2 04 mov $0x4,%dl
> > e: b0 05 mov $0x5,%al
> >
> > 4f: 66 be 06 00 mov $0x6,%si
> > 53: 66 bb 07 00 mov $0x7,%bx
> > 57: 66 b9 08 00 mov $0x8,%cx
> > 5b: 66 ba 09 00 mov $0x9,%dx
> > 5f: 66 b8 0a 00 mov $0xa,%ax
> >
> > 9f: bb 0b 00 00 00 mov $0xb,%ebx
> > a4: be 0c 00 00 00 mov $0xc,%esi
> > a9: b9 0d 00 00 00 mov $0xd,%ecx
> > ae: ba 0e 00 00 00 mov $0xe,%edx
> > b3: b8 0f 00 00 00 mov $0xf,%eax
> >
> >
> >I notice that having a "=r" inline assembly that outputs to the first
> >"register char" variable seems to be problematic. It fails with the
> >following error:
> >
> >/tmp/ccy35Hq1.s: Assembler messages:
> >/tmp/ccy35Hq1.s:15: Error: bad register name `%sil'
>
> 'r' is wrong for 8-bit variables on i386. It needs to be 'q'.
>

Will fix. I noticed that it was because I had too much "register" char
variables declared and used at the same time. Putting a "g" constraint
gave the same result.

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-10-22 16:44:26