2007-09-18 21:11:46

by Mathieu Desnoyers

[permalink] [raw]
Subject: [patch 4/7] Immediate Values - i386 Optimization

i386 optimization of the immediate values which uses a movl with code patching
to set/unset the value used to populate the register used as variable source.

Changelog:
- Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
non atomic writes to a code region only touched by us (nobody can execute it
since we are protected by the immediate_mutex).
- Put immediate_set and _immediate_set in the architecture independent header.

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Andi Kleen <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Chuck Ebbert <[email protected]>
CC: Christoph Hellwig <[email protected]>
---
arch/i386/kernel/Makefile | 1
arch/i386/kernel/immediate.c | 307 +++++++++++++++++++++++++++++++++++++++++++
arch/i386/kernel/traps.c | 8 -
include/asm-i386/immediate.h | 82 +++++++++++
4 files changed, 394 insertions(+), 4 deletions(-)

Index: linux-2.6-lttng/include/asm-i386/immediate.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/asm-i386/immediate.h 2007-09-18 13:15:02.000000000 -0400
@@ -0,0 +1,82 @@
+#ifndef _ASM_I386_IMMEDIATE_H
+#define _ASM_I386_IMMEDIATE_H
+
+/*
+ * Immediate values. i386 architecture optimizations.
+ *
+ * (C) Copyright 2006 Mathieu Desnoyers <[email protected]>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+struct __immediate {
+ long var; /* Pointer to the identifier variable of the
+ * immediate value
+ */
+ long immediate; /*
+ * Pointer to the memory location of the
+ * immediate value within the instruction.
+ */
+ long size; /* Type size. */
+};
+
+/**
+ * immediate_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ * Optimized version of the immediate.
+ * Do not use in __init and __exit functions. Use _immediate_read() instead.
+ * Makes sure the 2 and 4 bytes update will be atomic by aligning the immediate
+ * value. 2 bytes (short) uses a 66H prefix. If size is bigger than 4 bytes,
+ * fall back on a memory read.
+ */
+#define immediate_read(name) \
+ ({ \
+ __typeof__(name##__immediate) value; \
+ switch (sizeof(value)) { \
+ case 1: \
+ asm ( ".section __immediate, \"a\", @progbits;\n\t" \
+ ".long %1, (0f)+1, 1;\n\t" \
+ ".previous;\n\t" \
+ "0:\n\t" \
+ "mov %2,%0;\n\t" \
+ : "=r" (value) \
+ : "m" (name##__immediate), \
+ "i" (0)); \
+ break; \
+ case 2: \
+ asm ( ".section __immediate, \"a\", @progbits;\n\t" \
+ ".long %1, (0f)+2, 2;\n\t" \
+ ".previous;\n\t" \
+ "1:\n\t" \
+ ".align 2;\n\t" \
+ "0:\n\t" \
+ "mov %2,%0;\n\t" \
+ : "=r" (value) \
+ : "m" (name##__immediate), \
+ "i" (0)); \
+ break; \
+ case 4: \
+ asm ( ".section __immediate, \"a\", @progbits;\n\t" \
+ ".long %1, (0f)+1, 4;\n\t" \
+ ".previous;\n\t" \
+ "1:\n\t" \
+ ".org (1b)+(3-((1b)%%4)), 0x90;\n\t" \
+ "0:\n\t" \
+ "mov %2,%0;\n\t" \
+ : "=r" (value) \
+ : "m" (name##__immediate), \
+ "i" (0)); \
+ break; \
+ default:value = name##__immediate; \
+ break; \
+ }; \
+ value; \
+ })
+
+extern int arch_immediate_update(const struct __immediate *immediate);
+extern void arch_immediate_update_early(const struct __immediate *immediate);
+
+#endif /* _ASM_I386_IMMEDIATE_H */
Index: linux-2.6-lttng/arch/i386/kernel/Makefile
===================================================================
--- linux-2.6-lttng.orig/arch/i386/kernel/Makefile 2007-09-18 10:08:33.000000000 -0400
+++ linux-2.6-lttng/arch/i386/kernel/Makefile 2007-09-18 13:15:02.000000000 -0400
@@ -36,6 +36,7 @@ obj-$(CONFIG_MODULES) += module.o
obj-y += sysenter.o vsyscall.o
obj-$(CONFIG_ACPI_SRAT) += srat.o
obj-$(CONFIG_EFI) += efi.o efi_stub.o
+obj-$(CONFIG_IMMEDIATE) += immediate.o
obj-$(CONFIG_DOUBLEFAULT) += doublefault.o
obj-$(CONFIG_KGDB) += kgdb.o kgdb-jmp.o
obj-$(CONFIG_VM86) += vm86.o
Index: linux-2.6-lttng/arch/i386/kernel/immediate.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/arch/i386/kernel/immediate.c 2007-09-18 13:15:02.000000000 -0400
@@ -0,0 +1,307 @@
+/*
+ * Immediate Value - i386 architecture specific code.
+ *
+ * Rationale
+ *
+ * Required because of :
+ * - Erratum 49 fix for Intel PIII.
+ * - Still present on newer processors : Intel Core 2 Duo Processor for Intel
+ * Centrino Duo Processor Technology Specification Update, AH33.
+ * Unsynchronized Cross-Modifying Code Operations Can Cause Unexpected
+ * Instruction Execution Results.
+ *
+ * Permits immediate value modification by XMC with correct serialization.
+ *
+ * Reentrant for NMI and trap handler instrumentation. Permits XMC to a
+ * location that has preemption enabled because it involves no temporary or
+ * reused data structure.
+ *
+ * Quoting Richard J Moore, source of the information motivating this
+ * implementation which differs from the one proposed by Intel which is not
+ * suitable for kernel context (does not support NMI and would require disabling
+ * interrupts on every CPU for a long period) :
+ *
+ * "There is another issue to consider when looking into using probes other
+ * then int3:
+ *
+ * Intel erratum 54 - Unsynchronized Cross-modifying code - refers to the
+ * practice of modifying code on one processor where another has prefetched
+ * the unmodified version of the code. Intel states that unpredictable general
+ * protection faults may result if a synchronizing instruction (iret, int,
+ * int3, cpuid, etc ) is not executed on the second processor before it
+ * executes the pre-fetched out-of-date copy of the instruction.
+ *
+ * When we became aware of this I had a long discussion with Intel's
+ * microarchitecture guys. It turns out that the reason for this erratum
+ * (which incidentally Intel does not intend to fix) is because the trace
+ * cache - the stream of micorops resulting from instruction interpretation -
+ * cannot guaranteed to be valid. Reading between the lines I assume this
+ * issue arises because of optimization done in the trace cache, where it is
+ * no longer possible to identify the original instruction boundaries. If the
+ * CPU discoverers that the trace cache has been invalidated because of
+ * unsynchronized cross-modification then instruction execution will be
+ * aborted with a GPF. Further discussion with Intel revealed that replacing
+ * the first opcode byte with an int3 would not be subject to this erratum.
+ *
+ * So, is cmpxchg reliable? One has to guarantee more than mere atomicity."
+ *
+ * Overall design
+ *
+ * The algorithm proposed by Intel applies not so well in kernel context: it
+ * would imply disabling interrupts and looping on every CPUs while modifying
+ * the code and would not support instrumentation of code called from interrupt
+ * sources that cannot be disabled.
+ *
+ * Therefore, we use a different algorithm to respect Intel's erratum (see the
+ * quoted discussion above). We make sure that no CPU sees an out-of-date copy
+ * of a pre-fetched instruction by 1 - using a breakpoint, which skips the
+ * instruction that is going to be modified, 2 - issuing an IPI to every CPU to
+ * execute a sync_core(), to make sure that even when the breakpoint is removed,
+ * no cpu could possibly still have the out-of-date copy of the instruction,
+ * modify the now unused 2nd byte of the instruction, and then put back the
+ * original 1st byte of the instruction.
+ *
+ * It has exactly the same intent as the algorithm proposed by Intel, but
+ * it has less side-effects, scales better and supports NMI, SMI and MCE.
+ *
+ * Mathieu Desnoyers <[email protected]>
+ */
+
+#include <linux/notifier.h>
+#include <linux/preempt.h>
+#include <linux/smp.h>
+#include <linux/notifier.h>
+#include <linux/module.h>
+#include <linux/immediate.h>
+#include <linux/kdebug.h>
+#include <linux/rcupdate.h>
+#include <linux/kprobes.h>
+
+#include <asm/cacheflush.h>
+
+#define BREAKPOINT_INSTRUCTION 0xcc
+#define BREAKPOINT_INS_LEN 1
+#define NR_NOPS 8
+
+static long target_after_int3; /* EIP of the target after the int3 */
+static long bypass_eip; /* EIP of the bypass. */
+static long bypass_after_int3; /* EIP after the end-of-bypass int3 */
+static long after_immediate; /*
+ * EIP where to resume after the
+ * single-stepping.
+ */
+
+/*
+ * Size of the movl instruction (without the immediate value) in bytes.
+ * The 2 bytes load immediate has a 66H prefix, which makes the opcode 2 bytes
+ * wide.
+ */
+static inline size_t _immediate_get_insn_size(long size)
+{
+ switch (size) {
+ case 1: return 1;
+ case 2: return 2;
+ case 4: return 1;
+ default: BUG();
+ };
+}
+
+/*
+ * Internal bypass used during value update. The bypass is skipped by the
+ * function in which it is inserted.
+ * No need to be aligned because we exclude readers from the site during
+ * update.
+ * Layout is:
+ * nop nop nop nop nop nop nop nop int3
+ * The nops are the target replaced by the instruction to single-step.
+ */
+static inline void _immediate_bypass(long *bypassaddr, long *breaknextaddr)
+{
+ asm volatile ( "jmp 2f;\n\t"
+ "0:\n\t"
+ ".space 8, 0x90;\n\t"
+ "1:\n\t"
+ "int3;\n\t"
+ "2:\n\t"
+ "movl $(0b),%0;\n\t"
+ "movl $((1b)+1),%1;\n\t"
+ : "=r" (*bypassaddr),
+ "=r" (*breaknextaddr) : );
+}
+
+static void immediate_synchronize_core(void *info)
+{
+ sync_core(); /* use cpuid to stop speculative execution */
+}
+
+/*
+ * The eip value points right after the breakpoint instruction, in the second
+ * byte of the movl.
+ * Disable preemption in the bypass to make sure no thread will be preempted in
+ * it. We can then use synchronize_sched() to make sure every bypass users have
+ * ended.
+ */
+static int immediate_notifier(struct notifier_block *nb,
+ unsigned long val, void *data)
+{
+ enum die_val die_val = (enum die_val) val;
+ struct die_args *args = data;
+
+ if (!args->regs || user_mode_vm(args->regs))
+ return NOTIFY_DONE;
+
+ if (die_val == DIE_INT3) {
+ if (args->regs->eip == target_after_int3) {
+ preempt_disable();
+ args->regs->eip = bypass_eip;
+ return NOTIFY_STOP;
+ } else if (args->regs->eip == bypass_after_int3) {
+ args->regs->eip = after_immediate;
+ preempt_enable();
+ return NOTIFY_STOP;
+ }
+ }
+ return NOTIFY_DONE;
+}
+
+static struct notifier_block immediate_notify = {
+ .notifier_call = immediate_notifier,
+ .priority = 0x7fffffff, /* we need to be notified first */
+};
+
+
+/**
+ * arch_immediate_update - update one immediate value
+ * @immediate: pointer of type const struct __immediate to update
+ *
+ * Update one immediate value. Must be called with immediate_mutex held.
+ */
+__kprobes int arch_immediate_update(const struct __immediate *immediate)
+{
+ int ret;
+ size_t insn_size = _immediate_get_insn_size(immediate->size);
+ long insn = immediate->immediate - insn_size;
+ long len;
+ unsigned long cr0;
+
+#ifdef CONFIG_KPROBES
+ /*
+ * Fail if a kprobe has been set on this instruction.
+ * (TODO: we could eventually do better and modify all the (possibly
+ * nested) kprobes for this site if kprobes had an API for this.
+ */
+ if (unlikely(*(unsigned char*)insn == BREAKPOINT_INSTRUCTION)) {
+ printk(KERN_WARNING "Immediate value in conflict with kprobe. "
+ "Variable at %p, "
+ "instruction at %p, size %lu\n",
+ (void*)immediate->immediate,
+ (void*)immediate->var, immediate->size);
+ return -EBUSY;
+ }
+#endif
+
+ /*
+ * If the variable and the instruction have the same value, there is
+ * nothing to do.
+ */
+ switch (immediate->size) {
+ case 1: if (*(uint8_t*)immediate->immediate
+ == *(uint8_t*)immediate->var)
+ return 0;
+ break;
+ case 2: if (*(uint16_t*)immediate->immediate
+ == *(uint16_t*)immediate->var)
+ return 0;
+ break;
+ case 4: if (*(uint32_t*)immediate->immediate
+ == *(uint32_t*)immediate->var)
+ return 0;
+ break;
+ default:return -EINVAL;
+ }
+
+ _immediate_bypass(&bypass_eip, &bypass_after_int3);
+
+ after_immediate = immediate->immediate + immediate->size;
+
+ /*
+ * Using the _early variants because nobody is executing the
+ * bypass code while we patch it. It is protected by the
+ * immediate_mutex. Since we modify the instructions non atomically (for
+ * nops), we have to use the _early variant.
+ * We must however deal with the WP flag in cr0 by ourself.
+ */
+ kernel_wp_save(cr0);
+ text_poke_early((void*)bypass_eip, (void*)insn,
+ insn_size + immediate->size);
+ /*
+ * Fill the rest with nops.
+ */
+ len = NR_NOPS - immediate->size - insn_size;
+ add_nops((void*)(bypass_eip + insn_size + immediate->size), len);
+ kernel_wp_restore(cr0);
+
+ target_after_int3 = insn + BREAKPOINT_INS_LEN;
+ /* register_die_notifier has memory barriers */
+ register_die_notifier(&immediate_notify);
+ /* The breakpoint will single-step the bypass */
+ text_poke((void*)insn,
+ INIT_ARRAY(unsigned char, BREAKPOINT_INSTRUCTION, 1), 1);
+ wmb();
+ /*
+ * Execute serializing instruction on each CPU.
+ * Acts as a memory barrier.
+ */
+ ret = on_each_cpu(immediate_synchronize_core, NULL, 1, 1);
+ BUG_ON(ret != 0);
+
+ text_poke((void*)(insn + insn_size), (void*)immediate->var,
+ immediate->size);
+ wmb();
+ text_poke((void*)insn, (unsigned char*)bypass_eip, 1);
+ /*
+ * Wait for all int3 handlers to end
+ * (interrupts are disabled in int3).
+ * This CPU is clearly not in a int3 handler,
+ * because int3 handler is not preemptible and
+ * there cannot be any more int3 handler called
+ * for this site, because we placed the original
+ * instruction back.
+ * synchronize_sched has memory barriers.
+ */
+ synchronize_sched();
+ unregister_die_notifier(&immediate_notify);
+ /* unregister_die_notifier has memory barriers */
+ return 0;
+}
+
+/**
+ * arch_immediate_update_early - update one immediate value at boot time
+ * @immediate: pointer of type const struct __immediate to update
+ *
+ * Update one immediate value at boot time.
+ */
+void __init arch_immediate_update_early(const struct __immediate *immediate)
+{
+ /*
+ * If the variable and the instruction have the same value, there is
+ * nothing to do.
+ */
+ switch (immediate->size) {
+ case 1: if (*(uint8_t*)immediate->immediate
+ == *(uint8_t*)immediate->var)
+ return;
+ break;
+ case 2: if (*(uint16_t*)immediate->immediate
+ == *(uint16_t*)immediate->var)
+ return;
+ break;
+ case 4: if (*(uint32_t*)immediate->immediate
+ == *(uint32_t*)immediate->var)
+ return;
+ break;
+ default:return;
+ }
+ memcpy((void*)immediate->immediate, (void*)immediate->var,
+ immediate->size);
+}
Index: linux-2.6-lttng/arch/i386/kernel/traps.c
===================================================================
--- linux-2.6-lttng.orig/arch/i386/kernel/traps.c 2007-09-18 10:09:57.000000000 -0400
+++ linux-2.6-lttng/arch/i386/kernel/traps.c 2007-09-18 13:15:02.000000000 -0400
@@ -637,7 +637,7 @@ fastcall void do_##name(struct pt_regs *
}

DO_VM86_ERROR_INFO( 0, SIGFPE, "divide error", divide_error, FPE_INTDIV, regs->eip)
-#ifndef CONFIG_KPROBES
+#if !defined(CONFIG_KPROBES) && !defined(CONFIG_IMMEDIATE)
DO_VM86_ERROR( 3, SIGTRAP, "int3", int3)
#endif
DO_VM86_ERROR( 4, SIGSEGV, "overflow", overflow)
@@ -879,14 +879,14 @@ void restart_nmi(void)
acpi_nmi_enable();
}

-#ifdef CONFIG_KPROBES
+#if defined(CONFIG_KPROBES) || defined(CONFIG_IMMEDIATE)
fastcall void __kprobes do_int3(struct pt_regs *regs, long error_code)
{
if (notify_die(DIE_INT3, "int3", regs, error_code, 3, SIGTRAP)
== NOTIFY_STOP)
return;
- /* This is an interrupt gate, because kprobes wants interrupts
- disabled. Normal trap handlers don't. */
+ /* This is an interrupt gate, because kprobes and immediate values wants
+ * interrupts disabled. Normal trap handlers don't. */
restore_interrupts(regs);
do_trap(3, SIGTRAP, "int3", 1, regs, error_code, NULL);
}

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68


2007-09-18 21:51:28

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

Mathieu Desnoyers wrote:

> +#define immediate_read(name) \
> + ({ \
> + __typeof__(name##__immediate) value; \
> + switch (sizeof(value)) { \
> + case 1: \
> + asm ( ".section __immediate, \"a\", @progbits;\n\t" \
> + ".long %1, (0f)+1, 1;\n\t" \
> + ".previous;\n\t" \
> + "0:\n\t" \
> + "mov %2,%0;\n\t" \

Given that you're relying on the exact instruction that this mov
generates, it might be better to explicitly put the opcodes in with
.byte. That way you're protected from the assembler deciding to
generate some other form of the instruction (for whatever reason). I
guess substituting in different registers would be a pain.

Aside from that, is there any reason not to just put $0 in there rather
than use %2?


> + ".long %1, (0f)+1, 4;\n\t" \
> + ".previous;\n\t" \
> + "1:\n\t" \
> + ".org (1b)+(3-((1b)%%4)), 0x90;\n\t" \
>
Seems a little complex, but I couldn't come up with anything much better:

.org . + 3 - (. & 3), 0x90

You can use . rather than needing to define 1:, it doesn't need quite so
many parens, and using &3 avoids the %% wart.

It's a pity that gas seems to generate plain 0x90 nops rather than
long-nop forms here. I thought it could do that.

J

2007-09-18 22:14:39

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

Jeremy Fitzhardinge <[email protected]> writes:
>
> It's a pity that gas seems to generate plain 0x90 nops rather than
> long-nop forms here. I thought it could do that.

.p2align does it.

-Andi

2007-09-18 22:17:28

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

Jeremy Fitzhardinge wrote:
> Mathieu Desnoyers wrote:
>
>> +#define immediate_read(name) \
>> + ({ \
>> + __typeof__(name##__immediate) value; \
>> + switch (sizeof(value)) { \
>> + case 1: \
>> + asm ( ".section __immediate, \"a\", @progbits;\n\t" \
>> + ".long %1, (0f)+1, 1;\n\t" \
>> + ".previous;\n\t" \
>> + "0:\n\t" \
>> + "mov %2,%0;\n\t" \
>
> Given that you're relying on the exact instruction that this mov
> generates, it might be better to explicitly put the opcodes in with
> .byte. That way you're protected from the assembler deciding to
> generate some other form of the instruction (for whatever reason). I
> guess substituting in different registers would be a pain.

Allowing different registers should be doable, but if so, one would have
to put 0: at the *end* of the instruction and use (0f)-4 instead, since
the non-%eax forms are one byte longer.

This also seems "safer", since an imm32 is always the last thing in the
instruction.

-hpa

2007-09-18 22:27:59

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

H. Peter Anvin wrote:
> Allowing different registers should be doable, but if so, one would have
> to put 0: at the *end* of the instruction and use (0f)-4 instead, since
> the non-%eax forms are one byte longer.
>

OK, that's already a problem since its using "=r" as the constraint.

> This also seems "safer", since an imm32 is always the last thing in the
> instruction.

Good idea. If gas/gcc generates entirely the wrong addressing mode,
then we've got bigger problems.

J

2007-09-18 22:29:57

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

Andi Kleen wrote:
> Jeremy Fitzhardinge <[email protected]> writes:
>
>> It's a pity that gas seems to generate plain 0x90 nops rather than
>> long-nop forms here. I thought it could do that.
>>
>
> .p2align does it

Just .p2align? Not align, balign, org or skip? Seems... strange.

J

2007-09-18 22:37:10

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

Jeremy Fitzhardinge wrote:
> Andi Kleen wrote:
>> Jeremy Fitzhardinge <[email protected]> writes:
>>
>>> It's a pity that gas seems to generate plain 0x90 nops rather than
>>> long-nop forms here. I thought it could do that.
>>>
>> .p2align does it
>
> Just .p2align? Not align, balign, org or skip? Seems... strange.
>

Probably it works for .align and .balign too, but not .org.

-hpa

2007-09-18 22:45:07

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

On Tue, Sep 18, 2007 at 03:29:50PM -0700, Jeremy Fitzhardinge wrote:
> Andi Kleen wrote:
> > Jeremy Fitzhardinge <[email protected]> writes:
> >
> >> It's a pity that gas seems to generate plain 0x90 nops rather than
> >> long-nop forms here. I thought it could do that.
> >>
> >
> > .p2align does it
>
> Just .p2align? Not align, balign, org or skip? Seems... strange.

The problem is that you cannot always safely jump into the middle of the
longer form nops. So I suppose they didn't risk breakage on older
code relying on this.

-Andi

2007-09-19 11:00:08

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

* Jeremy Fitzhardinge ([email protected]) wrote:
> Mathieu Desnoyers wrote:
>
> > +#define immediate_read(name) \
> > + ({ \
> > + __typeof__(name##__immediate) value; \
> > + switch (sizeof(value)) { \
> > + case 1: \
> > + asm ( ".section __immediate, \"a\", @progbits;\n\t" \
> > + ".long %1, (0f)+1, 1;\n\t" \
> > + ".previous;\n\t" \
> > + "0:\n\t" \
> > + "mov %2,%0;\n\t" \
>
> Given that you're relying on the exact instruction that this mov
> generates, it might be better to explicitly put the opcodes in with
> .byte. That way you're protected from the assembler deciding to
> generate some other form of the instruction (for whatever reason). I
> guess substituting in different registers would be a pain.
>

Good point. I thought it might come up, especially for 16 bits mov that
can be expressed under different forms, one of which has a prefix. I
would like to go for Peter's suggestion: putting the label _after_ the
instruction, since we know that we will be right after the immediate
value, but it has a drawback: we cannot insure correct alignment of the
immediate value in that case. But that would help not having to force
the register.

> Aside from that, is there any reason not to just put $0 in there rather
> than use %2?
>

Actually, no, since the initial value is written to the immediate value
references at early boot and at module load time. I originally thought
passing the referenced variable to it, but, as I recall, it brought
linker issues when the symbol was defined in another module. So yes,
just $0 is ok, I'll change that.

>
> > + ".long %1, (0f)+1, 4;\n\t" \
> > + ".previous;\n\t" \
> > + "1:\n\t" \
> > + ".org (1b)+(3-((1b)%%4)), 0x90;\n\t" \
> >
> Seems a little complex, but I couldn't come up with anything much better:
>
> .org . + 3 - (. & 3), 0x90
>
> You can use . rather than needing to define 1:, it doesn't need quite so
> many parens, and using &3 avoids the %% wart.
>

Yes, this one is tricky.. trying to align efficiently something on a 4
bytes address - 1 is not what gas is used to help doing.

> It's a pity that gas seems to generate plain 0x90 nops rather than
> long-nop forms here. I thought it could do that.
>

At least we will have at most 3 nops there.

Mathieu

> J

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-09-19 11:01:09

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

* H. Peter Anvin ([email protected]) wrote:
> Jeremy Fitzhardinge wrote:
> > Mathieu Desnoyers wrote:
> >
> >> +#define immediate_read(name) \
> >> + ({ \
> >> + __typeof__(name##__immediate) value; \
> >> + switch (sizeof(value)) { \
> >> + case 1: \
> >> + asm ( ".section __immediate, \"a\", @progbits;\n\t" \
> >> + ".long %1, (0f)+1, 1;\n\t" \
> >> + ".previous;\n\t" \
> >> + "0:\n\t" \
> >> + "mov %2,%0;\n\t" \
> >
> > Given that you're relying on the exact instruction that this mov
> > generates, it might be better to explicitly put the opcodes in with
> > .byte. That way you're protected from the assembler deciding to
> > generate some other form of the instruction (for whatever reason). I
> > guess substituting in different registers would be a pain.
>
> Allowing different registers should be doable, but if so, one would have
> to put 0: at the *end* of the instruction and use (0f)-4 instead, since
> the non-%eax forms are one byte longer.
>
> This also seems "safer", since an imm32 is always the last thing in the
> instruction.
>

The idea is very interesting, but then, if I can't be sure of the size
of my instruction, how can I align the immediate value properly ?

> -hpa
>

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-09-19 11:08:58

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

* Andi Kleen ([email protected]) wrote:
> Jeremy Fitzhardinge <[email protected]> writes:
> >
> > It's a pity that gas seems to generate plain 0x90 nops rather than
> > long-nop forms here. I thought it could do that.
>
> .p2align does it.
>

Sadly, p2align does not apply well to my context. I have to align on 4
bytes boundaries - 1 for the 4 bytes mov, so, if I would use p2align, I
would end up aligning on 4 bytes with p2align and then add 3 bytes
(worse case: adding 3 + 3 = 6 bytes of nops).

However, with the .org arithmetic, I can simply add then quantity of
nops needed to make my alignment on 4 bytes - 1, so the worse case
becomes adding 3 bytes.

The example is:

originally: address & 3 = 1
* p2align
p2align adds 3 bytes to align on 4 bytes boundaries
we add 3 bytes to align on the next 4 bytes - 1, so the immediate value
within the instruction is aligned on 4 bytes boundaries

* org
we add 2 bytes to be aligned on the next 4 bytes - 1.

And yes, it's a pity there is no way to produce the long-nops there. :(

Mathieu

> -Andi

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-09-19 11:14:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

> And yes, it's a pity there is no way to produce the long-nops there. :(

To be honest I consider it extremly unlikely you would be able
to measure a difference.

-Andi

2007-09-19 13:01:34

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

* Jeremy Fitzhardinge ([email protected]) wrote:
> H. Peter Anvin wrote:
> > Allowing different registers should be doable, but if so, one would have
> > to put 0: at the *end* of the instruction and use (0f)-4 instead, since
> > the non-%eax forms are one byte longer.
> >
>
> OK, that's already a problem since its using "=r" as the constraint.
>
> > This also seems "safer", since an imm32 is always the last thing in the
> > instruction.
>
> Good idea. If gas/gcc generates entirely the wrong addressing mode,
> then we've got bigger problems.
>

Ok, let's have a good look at what we want:

1 - get a pointer to the beginning of the immediate value within the
instruction.
2 - make sure that the immediate value, within the instruction, is
written to atomically wrt all CPUs, even on older architectures
where non aligned writes are not atomic.

Effectively, placing a label at the end of the instruction, and then
offsetting backward from there, will give us (1).

Then, for the alignment, we have to give a good look at the instruction
set reference, mov instruction, to see what the variants of mov
immediate value to register are on i386.

First, let's look at the possible prefixes:
- Lock and repeat prefixes : no
- Segment override prefixes: no memory reference there, so doesn't
apply.
- Branch hints : no, it's a mov instruction
- Operand-size override prefix: *yes*, can be used for 2 bytes mov
- Address-size override prefix: no address in there, only immediate
value and register.

(looking at the Compat/Leg Mode)

3 cases:

* 1 byte
B0 + rb MOV r8, imm8 (1 byte opcode)
REX + B0 + rb MOV r8, imm8 (only on 64 bits archs, never generated)
C6 /0 MOV r/m8, imm8 (2 bytes opcode)
(this one doesn't require alignment at all, since we do a 1 byte write)

* 2 bytes
B8 + rw MOV r16, imm16 (1 byte opcode)
66 B8 + rd MOV r16, imm16 (2 bytes opcode) (with 66H prefix)
C7 /0 MOV r/m16, imm16 (2 bytes opcode)
(Alignment on 4 bytes boundaries would be required because of the
possible 1 byte opcode ? Or is the 66H prefix mandatory there ? If it
is, then we can safely align on 2 bytes boundaries.)

* 4 bytes
B8 + rd MOV r32, imm32 (1 byte opcode)
C7 /0 MOV r/m32, imm32 (2 bytes opcode)
(the 2 bytes opcode can be a problem)

I have missed anything ?

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-09-19 16:03:59

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

* Mathieu Desnoyers ([email protected]) wrote:
> * Jeremy Fitzhardinge ([email protected]) wrote:
> > H. Peter Anvin wrote:
> > > Allowing different registers should be doable, but if so, one would have
> > > to put 0: at the *end* of the instruction and use (0f)-4 instead, since
> > > the non-%eax forms are one byte longer.
> > >
> >
> > OK, that's already a problem since its using "=r" as the constraint.
> >
> > > This also seems "safer", since an imm32 is always the last thing in the
> > > instruction.
> >
> > Good idea. If gas/gcc generates entirely the wrong addressing mode,
> > then we've got bigger problems.
> >
>
> Ok, let's have a good look at what we want:
>
> 1 - get a pointer to the beginning of the immediate value within the
> instruction.
> 2 - make sure that the immediate value, within the instruction, is
> written to atomically wrt all CPUs, even on older architectures
> where non aligned writes are not atomic.
>
> Effectively, placing a label at the end of the instruction, and then
> offsetting backward from there, will give us (1).
>
> Then, for the alignment, we have to give a good look at the instruction
> set reference, mov instruction, to see what the variants of mov
> immediate value to register are on i386.
>
> First, let's look at the possible prefixes:
> - Lock and repeat prefixes : no
> - Segment override prefixes: no memory reference there, so doesn't
> apply.
> - Branch hints : no, it's a mov instruction
> - Operand-size override prefix: *yes*, can be used for 2 bytes mov
> - Address-size override prefix: no address in there, only immediate
> value and register.
>
> (looking at the Compat/Leg Mode)
>
> 3 cases:
>
> * 1 byte
> B0 + rb MOV r8, imm8 (1 byte opcode)
> REX + B0 + rb MOV r8, imm8 (only on 64 bits archs, never generated)
> C6 /0 MOV r/m8, imm8 (2 bytes opcode)
> (this one doesn't require alignment at all, since we do a 1 byte write)
>
> * 2 bytes
> B8 + rw MOV r16, imm16 (1 byte opcode)
> 66 B8 + rd MOV r16, imm16 (2 bytes opcode) (with 66H prefix)
> C7 /0 MOV r/m16, imm16 (2 bytes opcode)
> (Alignment on 4 bytes boundaries would be required because of the
> possible 1 byte opcode ? Or is the 66H prefix mandatory there ? If it
> is, then we can safely align on 2 bytes boundaries.)
>
> * 4 bytes
> B8 + rd MOV r32, imm32 (1 byte opcode)
> C7 /0 MOV r/m32, imm32 (2 bytes opcode)
> (the 2 bytes opcode can be a problem)
>
> I have missed anything ?
>

This is not a problem in practice because we place a breakpoint and use
a bypass when we are updating immediate values that can be used
concurrently. So let's not bother about that too much. It just means
that the breakpoint approach might have to be used for more
architectures in the future to address that kind of issue.

Mathieu


--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-09-19 16:16:35

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

Mathieu Desnoyers wrote:
>
> Ok, let's have a good look at what we want:
>
> 1 - get a pointer to the beginning of the immediate value within the
> instruction.
> 2 - make sure that the immediate value, within the instruction, is
> written to atomically wrt all CPUs, even on older architectures
> where non aligned writes are not atomic.
>

I think you'll find that even on modern architectures cross-cacheline
writes aren't atomic.

Anyway, disregard the bit I mentioned about one byte longer; I've been
staring myself too blind at instruction variants lately (lots of
arguments about how to generate various instructions using NASM syntax).

>
> * 4 bytes
> B8 + rd MOV r32, imm32 (1 byte opcode)
> C7 /0 MOV r/m32, imm32 (2 bytes opcode)
> (the 2 bytes opcode can be a problem)
>

If gas generates the C7 opcodes by default, then that's a bug, nothing less.

-hpa

2007-09-19 17:30:54

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

H. Peter Anvin wrote:
> Mathieu Desnoyers wrote:
>
>> Ok, let's have a good look at what we want:
>>
>> 1 - get a pointer to the beginning of the immediate value within the
>> instruction.
>> 2 - make sure that the immediate value, within the instruction, is
>> written to atomically wrt all CPUs, even on older architectures
>> where non aligned writes are not atomic.
>>
>>
>
> I think you'll find that even on modern architectures cross-cacheline
> writes aren't atomic.
>

Cross-cache-line, sure. But what about just not sizeof aligned? If its
enough to avoid cross-cache-line, then that's simpler.

Which is something I was going to comment on: Mathieu, you try to align
the constant itself, but you don't prevent the instruction overall from
crossing a cache line. Given how delicate all this stuff is, it seems
like a good idea to do that.


>> * 4 bytes
>> B8 + rd MOV r32, imm32 (1 byte opcode)
>> C7 /0 MOV r/m32, imm32 (2 bytes opcode)
>> (the 2 bytes opcode can be a problem)
>>
>>
>
> If gas generates the C7 opcodes by default, then that's a bug, nothing less.
>

Well, in this case, it might be preferred if it brings the constant into
alignment without explicit padding :)

J

2007-09-19 17:44:20

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

Jeremy Fitzhardinge wrote:
>
> Cross-cache-line, sure. But what about just not sizeof aligned? If its
> enough to avoid cross-cache-line, then that's simpler.
>

Not really. It pretty much comes down to the same problem.

> Which is something I was going to comment on: Mathieu, you try to align
> the constant itself, but you don't prevent the instruction overall from
> crossing a cache line. Given how delicate all this stuff is, it seems
> like a good idea to do that.

Given how delicate all his stuff is, it would be interesting to know the
magnitude of the expected payoff, which I personally don't remember seeing.

-hpa

2007-09-19 18:22:27

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

* Jeremy Fitzhardinge ([email protected]) wrote:
> H. Peter Anvin wrote:
> > Mathieu Desnoyers wrote:
> >
> >> Ok, let's have a good look at what we want:
> >>
> >> 1 - get a pointer to the beginning of the immediate value within the
> >> instruction.
> >> 2 - make sure that the immediate value, within the instruction, is
> >> written to atomically wrt all CPUs, even on older architectures
> >> where non aligned writes are not atomic.
> >>
> >>
> >
> > I think you'll find that even on modern architectures cross-cacheline
> > writes aren't atomic.
> >
>
> Cross-cache-line, sure. But what about just not sizeof aligned? If its
> enough to avoid cross-cache-line, then that's simpler.
>

Being sizeof aligned on a cache-line (e.g. 32 bytes boundaries) is a
superset of being aligned on sizeof multiples (e.g. 4 bytes). Therefore,
if we declare data of a certain size not aligned on the sizeof
boundaries, we won't be aligned on cache-lines neither. (unless I am
utterly wrong..) :)

> Which is something I was going to comment on: Mathieu, you try to align
> the constant itself, but you don't prevent the instruction overall from
> crossing a cache line. Given how delicate all this stuff is, it seems
> like a good idea to do that.
>

We just can't, for movl is 5 bytes in total : 1 byte for opcode, 4
bytes for the immediate value. But since we do not modify the opcode at
all, CPUs will either see the old or new immediate value (each of those
will be coherent because of the atomic update) and, in every case, they
will use it with the same opcode that haven't been touched.

>
> >> * 4 bytes
> >> B8 + rd MOV r32, imm32 (1 byte opcode)
> >> C7 /0 MOV r/m32, imm32 (2 bytes opcode)
> >> (the 2 bytes opcode can be a problem)
> >>
> >>
> >
> > If gas generates the C7 opcodes by default, then that's a bug, nothing less.
> >
>
> Well, in this case, it might be preferred if it brings the constant into
> alignment without explicit padding :)
>

It will need explicit padding too. We would have to align the 4 bytes
immediate value on 4 bytes multiples. Therefore, this 2 bytes opcode
followed by 4 bytes immediate value would have to be aligned on
(4 bytes - 2) boundaries.

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-09-19 18:36:36

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

* H. Peter Anvin ([email protected]) wrote:
> Jeremy Fitzhardinge wrote:
> >
> > Cross-cache-line, sure. But what about just not sizeof aligned? If its
> > enough to avoid cross-cache-line, then that's simpler.
> >
>
> Not really. It pretty much comes down to the same problem.
>
> > Which is something I was going to comment on: Mathieu, you try to align
> > the constant itself, but you don't prevent the instruction overall from
> > crossing a cache line. Given how delicate all this stuff is, it seems
> > like a good idea to do that.
>
> Given how delicate all his stuff is, it would be interesting to know the
> magnitude of the expected payoff, which I personally don't remember seeing.
>

Please have a look at the patch header at:
http://lkml.org/lkml/2007/9/18/346

And at the documentation:
http://lkml.org/lkml/2007/9/18/375

Where I provide speedups experienced with this patch. Please note that
my primary goal is to provide non intrusive instrumentation with the
Linux Kernel Markers, and the immediate values are the underlying
mechanism that help enabling/disabling dynamically instrumentation sites
with the smallest effect possible (and still get the local variables
in the middle of a function compiled with -O2...).

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-09-20 10:24:46

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

On Tuesday 18 September 2007 22:07, Mathieu Desnoyers wrote:
> i386 optimization of the immediate values which uses a movl with code patching
> to set/unset the value used to populate the register used as variable source.
>
> Changelog:
> - Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
> non atomic writes to a code region only touched by us (nobody can execute it
> since we are protected by the immediate_mutex).
> - Put immediate_set and _immediate_set in the architecture independent header.

> +struct __immediate {
> + long var; /* Pointer to the identifier variable of the
> + * immediate value
> + */
> + long immediate; /*
> + * Pointer to the memory location of the
> + * immediate value within the instruction.
> + */
> + long size; /* Type size. */
> +};


> + case 2: \
> + asm ( ".section __immediate, \"a\", @progbits;\n\t" \
> + ".long %1, (0f)+2, 2;\n\t" \
> + ".previous;\n\t" \
> + "1:\n\t" \
> + ".align 2;\n\t" \
> + "0:\n\t" \
> + "mov %2,%0;\n\t" \
> + : "=r" (value) \
> + : "m" (name##__immediate), \
> + "i" (0)); \

Instead of letting gcc use whatever instruction it sees fit best
for accessing the variable (like add/cmp/test...)
now we force it to use mov imm,reg first. Maybe with preceding nop
due to "align 2".

And then we use 12 more bytes in __immediate section
*for each* place where you read the variable.

Do you plan to use the same approach on x86_64?
I mean, longs there are twice as long.

Can this be made conditional, on CONFIG_CC_OPTIMIZE_FOR_SIZE perhaps?
--
vda

2007-09-21 13:31:43

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

* Denys Vlasenko ([email protected]) wrote:
> On Tuesday 18 September 2007 22:07, Mathieu Desnoyers wrote:
> > i386 optimization of the immediate values which uses a movl with code patching
> > to set/unset the value used to populate the register used as variable source.
> >
> > Changelog:
> > - Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
> > non atomic writes to a code region only touched by us (nobody can execute it
> > since we are protected by the immediate_mutex).
> > - Put immediate_set and _immediate_set in the architecture independent header.
>
> > +struct __immediate {
> > + long var; /* Pointer to the identifier variable of the
> > + * immediate value
> > + */
> > + long immediate; /*
> > + * Pointer to the memory location of the
> > + * immediate value within the instruction.
> > + */
> > + long size; /* Type size. */
> > +};
>
>
> > + case 2: \
> > + asm ( ".section __immediate, \"a\", @progbits;\n\t" \
> > + ".long %1, (0f)+2, 2;\n\t" \
> > + ".previous;\n\t" \
> > + "1:\n\t" \
> > + ".align 2;\n\t" \
> > + "0:\n\t" \
> > + "mov %2,%0;\n\t" \
> > + : "=r" (value) \
> > + : "m" (name##__immediate), \
> > + "i" (0)); \
>
> Instead of letting gcc use whatever instruction it sees fit best
> for accessing the variable (like add/cmp/test...)
> now we force it to use mov imm,reg first. Maybe with preceding nop
> due to "align 2".
>

Yes, this is true. So, the following branch:

char x;

void testb(void)
{
if (x > 5)
testa();
}

Would turn into:
56: b0 00 mov $0x0,%al
58: 3c 05 cmp $0x5,%al
5a: 7e 05 jle 61 <testb+0x11>


Rather than:

56: 80 3d 00 00 00 00 05 cmpb $0x5,0x0
5d: 7e 05 jle 64 <testb+0x14>

> And then we use 12 more bytes in __immediate section
> *for each* place where you read the variable.
>

Yes. You must consider the this section is only used when updating the
variable. It is never used by the read-side and therefore does not
consume data cache on hot paths.

> Do you plan to use the same approach on x86_64?
> I mean, longs there are twice as long.
>

Yup. It's a memory footprint vs active cacheline footprints tradeoff.
When GCC optimizes for size and we see kernel speedups, it is not so
because it "consumes" less memory, but rather that there is less junk
polluting the cachelines. So unless you worry about a few K of data and
are an embedded system developer, I really don't see why you worry about
this. Oh, and by the way, I provide the ability to disable immediate
values in the EMBEDDED menu.

> Can this be made conditional, on CONFIG_CC_OPTIMIZE_FOR_SIZE perhaps?

No. As I just stated, only embedded developers would have an interest in
disabling this features because they would have so few memory available
on their architecture. The memory consumed by the immediate values table
is out of the hot path cachelines and therefore does not impact overall
performance.

> --
> vda

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-10-20 16:47:32

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

* Jeremy Fitzhardinge ([email protected]) wrote:
> H. Peter Anvin wrote:
> > Allowing different registers should be doable, but if so, one would have
> > to put 0: at the *end* of the instruction and use (0f)-4 instead, since
> > the non-%eax forms are one byte longer.
> >
>
> OK, that's already a problem since its using "=r" as the constraint.
>

Hi Jeremy,

I have tried generating asm-to-"register" c variables for char, short
and int on i386 and I do not see this happening. The char opcode is
always 1 byte, short 2 bytes and int 1 byte. Result:

gcc version 4.1.3 20070812 (prerelease) (Debian 4.1.2-15)

8: b3 02 mov $0x2,%bl
a: b1 03 mov $0x3,%cl
c: b2 04 mov $0x4,%dl
e: b0 05 mov $0x5,%al

4f: 66 be 06 00 mov $0x6,%si
53: 66 bb 07 00 mov $0x7,%bx
57: 66 b9 08 00 mov $0x8,%cx
5b: 66 ba 09 00 mov $0x9,%dx
5f: 66 b8 0a 00 mov $0xa,%ax

9f: bb 0b 00 00 00 mov $0xb,%ebx
a4: be 0c 00 00 00 mov $0xc,%esi
a9: b9 0d 00 00 00 mov $0xd,%ecx
ae: ba 0e 00 00 00 mov $0xe,%edx
b3: b8 0f 00 00 00 mov $0xf,%eax


I notice that having a "=r" inline assembly that outputs to the first
"register char" variable seems to be problematic. It fails with the
following error:

/tmp/ccy35Hq1.s: Assembler messages:
/tmp/ccy35Hq1.s:15: Error: bad register name `%sil'

But it seems to be specific to "register" variables, which I do not use
in my immediate values.


> > This also seems "safer", since an imm32 is always the last thing in the
> > instruction.
>
> Good idea. If gas/gcc generates entirely the wrong addressing mode,
> then we've got bigger problems.
>

I am still trying to figure out if we must assume that gas will produce
different length opcodes for mov instructions. The choice is:

- Either I use a "r" constraint and let gcc produce the instructions,
that I need to assume to have correct size so I can align their
immediate values (therefore, taking the offset from the end of the
instruction will not help). Here, if gas changes its behavior
dramatically for a given immediate value size, it will break.

- Second choice is to stick to a particular register, choosing the one
with the less side-effect, and encoding the instruction ourselves. I
start to think that this second solution might be safer, even though
we wouldn't let the compiler select the register which has the less
impact by itself.

Any comments about this ?

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-10-20 18:36:57

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

Mathieu Desnoyers wrote:
>
> I have tried generating asm-to-"register" c variables for char, short
> and int on i386 and I do not see this happening. The char opcode is
> always 1 byte, short 2 bytes and int 1 byte. Result:
>

The comment was referring to x86-64, but I incorrectly remembered that
applying to "movq $imm,%reg" as opposed to loading from an absolute
address. gas actually has a special opcode (movabs) for the 64-bit
version of the latter variant, which is only available with %rax and its
subregisters.

Nevermind, in other words. It's still true, though, that the immediate
will always be the last thing in the instruction -- that's a fixture of
the instruction format.

> gcc version 4.1.3 20070812 (prerelease) (Debian 4.1.2-15)
>
> 8: b3 02 mov $0x2,%bl
> a: b1 03 mov $0x3,%cl
> c: b2 04 mov $0x4,%dl
> e: b0 05 mov $0x5,%al
>
> 4f: 66 be 06 00 mov $0x6,%si
> 53: 66 bb 07 00 mov $0x7,%bx
> 57: 66 b9 08 00 mov $0x8,%cx
> 5b: 66 ba 09 00 mov $0x9,%dx
> 5f: 66 b8 0a 00 mov $0xa,%ax
>
> 9f: bb 0b 00 00 00 mov $0xb,%ebx
> a4: be 0c 00 00 00 mov $0xc,%esi
> a9: b9 0d 00 00 00 mov $0xd,%ecx
> ae: ba 0e 00 00 00 mov $0xe,%edx
> b3: b8 0f 00 00 00 mov $0xf,%eax
>
>
> I notice that having a "=r" inline assembly that outputs to the first
> "register char" variable seems to be problematic. It fails with the
> following error:
>
> /tmp/ccy35Hq1.s: Assembler messages:
> /tmp/ccy35Hq1.s:15: Error: bad register name `%sil'

'r' is wrong for 8-bit variables on i386. It needs to be 'q'.

2007-10-22 09:59:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

> - Either I use a "r" constraint and let gcc produce the instructions,
> that I need to assume to have correct size so I can align their
> immediate values (therefore, taking the offset from the end of the
> instruction will not help). Here, if gas changes its behavior
> dramatically for a given immediate value size, it will break.

I wouldn't expect it to do that, but you could perhaps add a self
test somewhere to check for it.

>
> - Second choice is to stick to a particular register, choosing the one
> with the less side-effect, and encoding the instruction ourselves. I
> start to think that this second solution might be safer, even though
> we wouldn't let the compiler select the register which has the less
> impact by itself.

Such effects caused occassional bugs in the alternative() implementation
which requires a maximum size for the replacement.

But in this case it should be safe enough to trust gas stability.

-Andi

2007-10-22 15:45:31

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

* H. Peter Anvin ([email protected]) wrote:
> Mathieu Desnoyers wrote:
> >
> >I have tried generating asm-to-"register" c variables for char, short
> >and int on i386 and I do not see this happening. The char opcode is
> >always 1 byte, short 2 bytes and int 1 byte. Result:
> >
>
> The comment was referring to x86-64, but I incorrectly remembered that
> applying to "movq $imm,%reg" as opposed to loading from an absolute
> address. gas actually has a special opcode (movabs) for the 64-bit
> version of the latter variant, which is only available with %rax and its
> subregisters.
>
> Nevermind, in other words. It's still true, though, that the immediate
> will always be the last thing in the instruction -- that's a fixture of
> the instruction format.
>

Hrm, some basic testing shows me that, on x86_64:

register long var, var2, var3, var4, var5;
asm ("movq $0xFFFFFFFF,%0;\n\t" : "=r" (var) :);
asm ("movq $0x100000000,%0;\n\t" : "=r" (var2) :);
asm ("movq $0,%0;\n\t" : "=r" (var3) :);
asm ("movq $0xFFFFFFFFFFFFFFFF,%0;\n\t" : "=r" (var4) :);
asm ("movq $0xFEFEFEFE01010101,%0;\n\t" : "=r" (var5) :);

generates:

ae: 48 be ff ff ff ff 00 mov $0xffffffff,%rsi
b5: 00 00 00
b8: 48 bf 00 00 00 00 01 mov $0x100000000,%rdi
bf: 00 00 00
c2: 48 c7 c1 00 00 00 00 mov $0x0,%rcx
c9: 48 c7 c2 ff ff ff ff mov $0xffffffffffffffff,%rdx
d0: 48 b8 01 01 01 01 fe mov $0xfefefefe01010101,%rax
d7: fe fe fe

So we have to deal with the fact that gas produces opcodes of
different size (2 and 3 bytes) and with different immediate value sizes
for 8 bytes load immediates, depending on the immediate value to load.
There even seems to be an implicit duplication of the value when the
lower and higher 32 bits of the immediate value are the same (the
0xFFFFFFFFFFFFFFFF case).

Therefore, I propose the following. We should use the

asm ("movq $0xFEFEFEFE01010101,%0;\n\t" : "=r" (var5) :);

to make sure we always deal with the same instruction and have a 8 bytes
immediate value. We don't really care about the static value given
because we always update it dynamically before the first time it is
executed.

> >gcc version 4.1.3 20070812 (prerelease) (Debian 4.1.2-15)
> >
> > 8: b3 02 mov $0x2,%bl
> > a: b1 03 mov $0x3,%cl
> > c: b2 04 mov $0x4,%dl
> > e: b0 05 mov $0x5,%al
> >
> > 4f: 66 be 06 00 mov $0x6,%si
> > 53: 66 bb 07 00 mov $0x7,%bx
> > 57: 66 b9 08 00 mov $0x8,%cx
> > 5b: 66 ba 09 00 mov $0x9,%dx
> > 5f: 66 b8 0a 00 mov $0xa,%ax
> >
> > 9f: bb 0b 00 00 00 mov $0xb,%ebx
> > a4: be 0c 00 00 00 mov $0xc,%esi
> > a9: b9 0d 00 00 00 mov $0xd,%ecx
> > ae: ba 0e 00 00 00 mov $0xe,%edx
> > b3: b8 0f 00 00 00 mov $0xf,%eax
> >
> >
> >I notice that having a "=r" inline assembly that outputs to the first
> >"register char" variable seems to be problematic. It fails with the
> >following error:
> >
> >/tmp/ccy35Hq1.s: Assembler messages:
> >/tmp/ccy35Hq1.s:15: Error: bad register name `%sil'
>
> 'r' is wrong for 8-bit variables on i386. It needs to be 'q'.
>

Will fix. I noticed that it was because I had too much "register" char
variables declared and used at the same time. Putting a "g" constraint
gave the same result.

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-10-22 16:44:26

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 4/7] Immediate Values - i386 Optimization

Mathieu Desnoyers wrote:
>
> Will fix. I noticed that it was because I had too much "register" char
> variables declared and used at the same time. Putting a "g" constraint
> gave the same result.
>

"g" is the same as "rmi", which is probably *NOT* what you want.

Don't use "register" variables. That's a poor workaround for using the
proper constraints.

-hpa