Subject: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Hi all,

An x86 processor handles an interrupt (from an external
source, software generated or due to an exception),
depending on the contents if the IDT. Normally the IDT
contains mostly interrupt gates. Linux points each
interrupt gate to a unique function. Some are specific
to some task (handling traps, IPI's, ...), the others
are stubs that push the interrupt number to the stack
and jump to 'common_interrupt'.

This patch removes the need for the stubs.

An interrupt gate contains a FAR pointer to the interrupt
handler, meaning that the code segment of the interrupt
handler is also reloaded. Instead of pointing each (non-
specific) interrupt gate to a unique handler, we set a
unique code segment and use a common handler. When the
handler finishes the code segment is restored to the
'normal'/previous one.

In order to have a unique code segment for each interrupt
vector, the GDT is extended to 512 entries (1 full page),
and the last half of the page describes identical code
segments (identical, except for the number in the cs
register), which are refered to by the 256 interrupt
gates.

In this version, even the specialized handlers get run
with their code segment switched. This is not necessary,
but I like the fact that in a register dump one can now
see from the code segment that the code is ran due to
a (hard) interrupt. The exception I made is the int 0x80
(syscall), which runs with the normal kernel code segment.


Concluding: changing interrupt handling to this way
removes quite a bit of source code. It also removes the
need for the interrupt stubs and, on i386, pointers to
them. This saves a few kilobytes of code. The page
reserved for the GDT is now fully used. The cs register
indicating directly that code is executed on behalf of
a (hardware) interrupt is a nice debugging aid. This way
of handling interrupts also leads to cleaner code: this
patch already gets rid of some 'ugly' macro magic in
entry_32.S and irqinit_64.c.

More cleanup is certainly possible, but I have tried to
keep the changes local and small. If switching code
segments is too expensive for some paths, that can be
fixed by not doing that ;).

I'ld welcome some numbers on a few benchmarks on real
hardware (I only tested on qemu: debian runs without
noticable differences before/after this patch).

Greetings,
Alexander

P.S. Just in case someone thinks this is a great idea and
testing and benchmarking goes well...

Signed-off-by: Alexander van Heukelum <[email protected]>

arch/x86/include/asm/desc.h | 24 +++++-----
arch/x86/include/asm/segment.h | 14 +-----
arch/x86/kernel/cpu/common.c | 3 +
arch/x86/kernel/entry_32.S | 33 ++++----------
arch/x86/kernel/head64.c | 4 --
arch/x86/kernel/head_32.S | 37 +++++----------
arch/x86/kernel/head_64.S | 18 ++-----
arch/x86/kernel/irqinit_32.c | 4 +-
arch/x86/kernel/irqinit_64.c | 96 +++++++---------------------------------
arch/x86/kernel/traps.c | 4 +-
10 files changed, 64 insertions(+), 173 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index e6b82b1..3125345 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -50,7 +50,7 @@ static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func,
unsigned dpl, unsigned ist, unsigned seg)
{
gate->offset_low = PTR_LOW(func);
- gate->segment = __KERNEL_CS;
+ gate->segment = seg;
gate->ist = ist;
gate->p = 1;
gate->dpl = dpl;
@@ -317,7 +317,7 @@ static inline void _set_gate(int gate, unsigned type, void *addr,
static inline void set_intr_gate(unsigned int n, void *addr)
{
BUG_ON((unsigned)n > 0xFF);
- _set_gate(n, GATE_INTERRUPT, addr, 0, 0, __KERNEL_CS);
+ _set_gate(n, GATE_INTERRUPT, addr, 0, 0, (256 + n) * 8);
}

#define SYS_VECTOR_FREE 0
@@ -348,37 +348,37 @@ static inline void alloc_intr_gate(unsigned int n, void *addr)
static inline void set_system_intr_gate(unsigned int n, void *addr)
{
BUG_ON((unsigned)n > 0xFF);
- _set_gate(n, GATE_INTERRUPT, addr, 0x3, 0, __KERNEL_CS);
+ _set_gate(n, GATE_INTERRUPT, addr, 0x3, 0, (256 + n) * 8);
}

-static inline void set_system_trap_gate(unsigned int n, void *addr)
+static inline void set_system_gate(unsigned int n, void *addr)
{
BUG_ON((unsigned)n > 0xFF);
+#ifdef CONFIG_X86_64
+ _set_gate(n, GATE_INTERRUPT, addr, 0x3, 0, __KERNEL_CS);
+#else
_set_gate(n, GATE_TRAP, addr, 0x3, 0, __KERNEL_CS);
+#endif
}

-static inline void set_trap_gate(unsigned int n, void *addr)
-{
- BUG_ON((unsigned)n > 0xFF);
- _set_gate(n, GATE_TRAP, addr, 0, 0, __KERNEL_CS);
-}
-
+#ifdef CONFIG_X86_32
static inline void set_task_gate(unsigned int n, unsigned int gdt_entry)
{
BUG_ON((unsigned)n > 0xFF);
_set_gate(n, GATE_TASK, (void *)0, 0, 0, (gdt_entry<<3));
}
+#endif

static inline void set_intr_gate_ist(int n, void *addr, unsigned ist)
{
BUG_ON((unsigned)n > 0xFF);
- _set_gate(n, GATE_INTERRUPT, addr, 0, ist, __KERNEL_CS);
+ _set_gate(n, GATE_INTERRUPT, addr, 0, ist, (256 + n) * 8);
}

static inline void set_system_intr_gate_ist(int n, void *addr, unsigned ist)
{
BUG_ON((unsigned)n > 0xFF);
- _set_gate(n, GATE_INTERRUPT, addr, 0x3, ist, __KERNEL_CS);
+ _set_gate(n, GATE_INTERRUPT, addr, 0x3, ist, (256 + n) * 8);
}

#else
diff --git a/arch/x86/include/asm/segment.h b/arch/x86/include/asm/segment.h
index 1dc1b51..c494a15 100644
--- a/arch/x86/include/asm/segment.h
+++ b/arch/x86/include/asm/segment.h
@@ -97,11 +97,6 @@

#define GDT_ENTRY_DOUBLEFAULT_TSS 31

-/*
- * The GDT has 32 entries
- */
-#define GDT_ENTRIES 32
-
/* The PnP BIOS entries in the GDT */
#define GDT_ENTRY_PNPBIOS_CS32 (GDT_ENTRY_PNPBIOS_BASE + 0)
#define GDT_ENTRY_PNPBIOS_CS16 (GDT_ENTRY_PNPBIOS_BASE + 1)
@@ -171,8 +166,6 @@
#define GS_TLS_SEL ((GDT_ENTRY_TLS_MIN+GS_TLS)*8 + 3)
#define FS_TLS_SEL ((GDT_ENTRY_TLS_MIN+FS_TLS)*8 + 3)

-#define GDT_ENTRIES 16
-
#endif

#define __KERNEL_CS (GDT_ENTRY_KERNEL_CS * 8)
@@ -195,15 +188,10 @@
#define SEGMENT_TI_MASK 0x4

#define IDT_ENTRIES 256
+#define GDT_ENTRIES 512
#define NUM_EXCEPTION_VECTORS 32
#define GDT_SIZE (GDT_ENTRIES * 8)
#define GDT_ENTRY_TLS_ENTRIES 3
#define TLS_SIZE (GDT_ENTRY_TLS_ENTRIES * 8)

-#ifdef __KERNEL__
-#ifndef __ASSEMBLY__
-extern const char early_idt_handlers[NUM_EXCEPTION_VECTORS][10];
-#endif
-#endif
-
#endif /* _ASM_X86_SEGMENT_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index b9c9ea0..8aed74b 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -55,6 +55,7 @@ DEFINE_PER_CPU(struct gdt_page, gdt_page) = { .gdt = {
[GDT_ENTRY_DEFAULT_USER32_CS] = { { { 0x0000ffff, 0x00cffb00 } } },
[GDT_ENTRY_DEFAULT_USER_DS] = { { { 0x0000ffff, 0x00cff300 } } },
[GDT_ENTRY_DEFAULT_USER_CS] = { { { 0x0000ffff, 0x00affb00 } } },
+ [256 ... 511] = { { { 0x0000ffff, 0x00af9b00 } } }
} };
#else
DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
@@ -90,6 +91,8 @@ DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {

[GDT_ENTRY_ESPFIX_SS] = { { { 0x00000000, 0x00c09200 } } },
[GDT_ENTRY_PERCPU] = { { { 0x00000000, 0x00000000 } } },
+
+ [256 ... 511] = { { { 0x0000ffff, 0x00cf9a00 } } }
} };
#endif
EXPORT_PER_CPU_SYMBOL_GPL(gdt_page);
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 28b597e..fadc971 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -622,31 +622,18 @@ END(syscall_badsys)
* Build the entry stubs and pointer table with
* some assembler magic.
*/
-.section .rodata,"a"
-ENTRY(interrupt)
-.text
-
-ENTRY(irq_entries_start)
+.p2align
+ENTRY(maininterrupt)
RING0_INT_FRAME
-vector=0
-.rept NR_VECTORS
- ALIGN
- .if vector
- CFI_ADJUST_CFA_OFFSET -4
- .endif
-1: pushl $~(vector)
- CFI_ADJUST_CFA_OFFSET 4
+ push %eax
+ push %eax
+ mov %cs,%eax
+ shr $3,%eax
+ and $0xff,%eax
+ not %eax
+ mov %eax,4(%esp)
+ pop %eax
jmp common_interrupt
- .previous
- .long 1b
- .text
-vector=vector+1
-.endr
-END(irq_entries_start)
-
-.previous
-END(interrupt)
-.previous

/*
* the CPU automatically disables interrupts when executing an IRQ vector,
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index d16084f..3d4e142 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -100,11 +100,7 @@ void __init x86_64_start_kernel(char * real_mode_data)
cleanup_highmap();

for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
-#ifdef CONFIG_EARLY_PRINTK
- set_intr_gate(i, &early_idt_handlers[i]);
-#else
set_intr_gate(i, early_idt_handler);
-#endif
}
load_idt((const struct desc_ptr *)&idt_descr);

diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index eb7515c..028427c 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -486,7 +486,7 @@ check_x87:
*/
setup_idt:
lea ignore_int,%edx
- movl $(__KERNEL_CS << 16),%eax
+ movl $((256 * 8) << 16),%eax /* cs = (256 + irq_nr) * 8 */
movw %dx,%ax /* selector = 0x0010 = cs */
movw $0x8E00,%dx /* interrupt gate - dpl=0, present */

@@ -496,12 +496,13 @@ rp_sidt:
movl %eax,(%edi)
movl %edx,4(%edi)
addl $8,%edi
+ addl $(8 << 16),%eax /* cs = (256 + irq_nr) * 8 */
dec %ecx
jne rp_sidt

.macro set_early_handler handler,trapno
lea \handler,%edx
- movl $(__KERNEL_CS << 16),%eax
+ movl $(((256 + \trapno) * 8) << 16),%eax
movw %dx,%ax
movw $0x8E00,%dx /* interrupt gate - dpl=0, present */
lea idt_table,%edi
@@ -509,30 +510,15 @@ rp_sidt:
movl %edx,8*\trapno+4(%edi)
.endm

- set_early_handler handler=early_divide_err,trapno=0
- set_early_handler handler=early_illegal_opcode,trapno=6
- set_early_handler handler=early_protection_fault,trapno=13
- set_early_handler handler=early_page_fault,trapno=14
+ set_early_handler handler=early_fault_fake_errorcode,trapno=0
+ set_early_handler handler=early_fault_fake_errorcode,trapno=6
+ set_early_handler handler=early_fault,trapno=13
+ set_early_handler handler=early_fault,trapno=14

ret

-early_divide_err:
- xor %edx,%edx
- pushl $0 /* fake errcode */
- jmp early_fault
-
-early_illegal_opcode:
- movl $6,%edx
- pushl $0 /* fake errcode */
- jmp early_fault
-
-early_protection_fault:
- movl $13,%edx
- jmp early_fault
-
-early_page_fault:
- movl $14,%edx
- jmp early_fault
+early_fault_fake_errorcode:
+ pushl $0

early_fault:
cld
@@ -546,7 +532,10 @@ early_fault:
incl early_recursion_flag
movl %cr2,%eax
pushl %eax
- pushl %edx /* trapno */
+ mov %cs, %eax
+ shr $3, %eax
+ and $0xff, %eax
+ pushl %eax /* trapno */
pushl $fault_msg
#ifdef CONFIG_EARLY_PRINTK
call early_printk
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 26cfdc1..c2ec12f 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -267,17 +267,6 @@ bad_address:
jmp bad_address

.section ".init.text","ax"
-#ifdef CONFIG_EARLY_PRINTK
- .globl early_idt_handlers
-early_idt_handlers:
- i = 0
- .rept NUM_EXCEPTION_VECTORS
- movl $i, %esi
- jmp early_idt_handler
- i = i + 1
- .endr
-#endif
-
ENTRY(early_idt_handler)
#ifdef CONFIG_EARLY_PRINTK
cmpl $2,early_recursion_flag(%rip)
@@ -286,7 +275,9 @@ ENTRY(early_idt_handler)
GET_CR2_INTO_RCX
movq %rcx,%r9
xorl %r8d,%r8d # zero for error code
- movl %esi,%ecx # get vector number
+ mov %cs, %ecx
+ shr $3, %ecx
+ and $0xff, %ecx # get vector number from CS
# Test %ecx against mask of vectors that push error code.
cmpl $31,%ecx
ja 0f
@@ -295,7 +286,8 @@ ENTRY(early_idt_handler)
testl $0x27d00,%eax
je 0f
popq %r8 # get error code
-0: movq 0(%rsp),%rcx # get ip
+0: mov %ecx, %esi # vector number
+ movq 0(%rsp),%rcx # get ip
movq 8(%rsp),%rdx # get cs
xorl %eax,%eax
leaq early_idt_msg(%rip),%rdi
diff --git a/arch/x86/kernel/irqinit_32.c b/arch/x86/kernel/irqinit_32.c
index 845aa98..d7c4b01 100644
--- a/arch/x86/kernel/irqinit_32.c
+++ b/arch/x86/kernel/irqinit_32.c
@@ -21,7 +21,7 @@
#include <asm/arch_hooks.h>
#include <asm/i8259.h>

-
+void maininterrupt(void);

/*
* Note that on a 486, we don't want to do a SIGFPE on an irq13
@@ -129,7 +129,7 @@ void __init native_init_IRQ(void)
for (i = FIRST_EXTERNAL_VECTOR; i < NR_VECTORS; i++) {
/* SYSCALL_VECTOR was reserved in trap_init. */
if (i != SYSCALL_VECTOR)
- set_intr_gate(i, interrupt[i]);
+ set_intr_gate(i, maininterrupt);
}


diff --git a/arch/x86/kernel/irqinit_64.c b/arch/x86/kernel/irqinit_64.c
index ff02353..36e7f6d 100644
--- a/arch/x86/kernel/irqinit_64.c
+++ b/arch/x86/kernel/irqinit_64.c
@@ -24,86 +24,22 @@
#include <asm/i8259.h>

/*
- * Common place to define all x86 IRQ vectors
- *
- * This builds up the IRQ handler stubs using some ugly macros in irq.h
- *
- * These macros create the low-level assembly IRQ routines that save
- * register context and call do_IRQ(). do_IRQ() then does all the
- * operations that are needed to keep the AT (or SMP IOAPIC)
- * interrupt-controller happy.
+ * All incoming IRQs are caught here.
*/
-
-#define IRQ_NAME2(nr) nr##_interrupt(void)
-#define IRQ_NAME(nr) IRQ_NAME2(IRQ##nr)
-
-/*
- * SMP has a few special interrupts for IPI messages
- */
-
-#define BUILD_IRQ(nr) \
- asmlinkage void IRQ_NAME(nr); \
- asm("\n.text\n.p2align\n" \
- "IRQ" #nr "_interrupt:\n\t" \
- "push $~(" #nr ") ; " \
- "jmp common_interrupt\n" \
- ".previous");
-
-#define BI(x,y) \
- BUILD_IRQ(x##y)
-
-#define BUILD_16_IRQS(x) \
- BI(x,0) BI(x,1) BI(x,2) BI(x,3) \
- BI(x,4) BI(x,5) BI(x,6) BI(x,7) \
- BI(x,8) BI(x,9) BI(x,a) BI(x,b) \
- BI(x,c) BI(x,d) BI(x,e) BI(x,f)
-
-/*
- * ISA PIC or low IO-APIC triggered (INTA-cycle or APIC) interrupts:
- * (these are usually mapped to vectors 0x30-0x3f)
- */
-
-/*
- * The IO-APIC gives us many more interrupt sources. Most of these
- * are unused but an SMP system is supposed to have enough memory ...
- * sometimes (mostly wrt. hw bugs) we get corrupted vectors all
- * across the spectrum, so we really want to be prepared to get all
- * of these. Plus, more powerful systems might have more than 64
- * IO-APIC registers.
- *
- * (these are usually mapped into the 0x30-0xff vector range)
- */
- BUILD_16_IRQS(0x2) BUILD_16_IRQS(0x3)
-BUILD_16_IRQS(0x4) BUILD_16_IRQS(0x5) BUILD_16_IRQS(0x6) BUILD_16_IRQS(0x7)
-BUILD_16_IRQS(0x8) BUILD_16_IRQS(0x9) BUILD_16_IRQS(0xa) BUILD_16_IRQS(0xb)
-BUILD_16_IRQS(0xc) BUILD_16_IRQS(0xd) BUILD_16_IRQS(0xe) BUILD_16_IRQS(0xf)
-
-#undef BUILD_16_IRQS
-#undef BI
-
-
-#define IRQ(x,y) \
- IRQ##x##y##_interrupt
-
-#define IRQLIST_16(x) \
- IRQ(x,0), IRQ(x,1), IRQ(x,2), IRQ(x,3), \
- IRQ(x,4), IRQ(x,5), IRQ(x,6), IRQ(x,7), \
- IRQ(x,8), IRQ(x,9), IRQ(x,a), IRQ(x,b), \
- IRQ(x,c), IRQ(x,d), IRQ(x,e), IRQ(x,f)
-
-/* for the irq vectors */
-static void (*__initdata interrupt[NR_VECTORS - FIRST_EXTERNAL_VECTOR])(void) = {
- IRQLIST_16(0x2), IRQLIST_16(0x3),
- IRQLIST_16(0x4), IRQLIST_16(0x5), IRQLIST_16(0x6), IRQLIST_16(0x7),
- IRQLIST_16(0x8), IRQLIST_16(0x9), IRQLIST_16(0xa), IRQLIST_16(0xb),
- IRQLIST_16(0xc), IRQLIST_16(0xd), IRQLIST_16(0xe), IRQLIST_16(0xf)
-};
-
-#undef IRQ
-#undef IRQLIST_16
-
-
-
+asmlinkage void maininterrupt(void);
+asm("\n.text\n.p2align\n"
+ "maininterrupt:\n\t"
+ "push %rax\n\t"
+ "push %rax\n\t"
+ "mov %cs,%eax\n\t"
+ "shr $3,%eax\n\t"
+ "and $0xff,%eax\n\t"
+ "not %rax\n\t"
+ "mov %rax,8(%rsp)\n\t"
+ "pop %rax\n\t"
+ "jmp common_interrupt\n"
+ ".previous"
+);

/*
* IRQ2 is cascade interrupt to second interrupt controller
@@ -219,7 +155,7 @@ void __init native_init_IRQ(void)
for (i = 0; i < (NR_VECTORS - FIRST_EXTERNAL_VECTOR); i++) {
int vector = FIRST_EXTERNAL_VECTOR + i;
if (vector != IA32_SYSCALL_VECTOR)
- set_intr_gate(vector, interrupt[i]);
+ set_intr_gate(vector, maininterrupt);
}

apic_intr_init();
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 47f6041..c2a794a 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -996,7 +996,7 @@ void __init trap_init(void)
set_intr_gate(19, &simd_coprocessor_error);

#ifdef CONFIG_IA32_EMULATION
- set_system_intr_gate(IA32_SYSCALL_VECTOR, ia32_syscall);
+ set_system_gate(IA32_SYSCALL_VECTOR, ia32_syscall);
#endif

#ifdef CONFIG_X86_32
@@ -1012,7 +1012,7 @@ void __init trap_init(void)
printk("done.\n");
}

- set_system_trap_gate(SYSCALL_VECTOR, &system_call);
+ set_system_gate(SYSCALL_VECTOR, &system_call);

/* Reserve all the builtin and the syscall vector: */
for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)


2008-11-04 12:43:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


* Alexander van Heukelum <[email protected]> wrote:

> Hi all,
>
> An x86 processor handles an interrupt (from an external source,
> software generated or due to an exception), depending on the
> contents if the IDT. Normally the IDT contains mostly interrupt
> gates. Linux points each interrupt gate to a unique function. Some
> are specific to some task (handling traps, IPI's, ...), the others
> are stubs that push the interrupt number to the stack and jump to
> 'common_interrupt'.
>
> This patch removes the need for the stubs.

hm, the cost would be this new code:

> +.p2align
> +ENTRY(maininterrupt)
> RING0_INT_FRAME
> -vector=0
> -.rept NR_VECTORS
> - ALIGN
> - .if vector
> - CFI_ADJUST_CFA_OFFSET -4
> - .endif
> -1: pushl $~(vector)
> - CFI_ADJUST_CFA_OFFSET 4
> + push %eax
> + push %eax
> + mov %cs,%eax
> + shr $3,%eax
> + and $0xff,%eax
> + not %eax
> + mov %eax,4(%esp)
> + pop %eax
> jmp common_interrupt

.. which we were able to avoid before. A couple of segment register
accesses, shifts, etc to calculate the vector - each of which can be
quite costly (especially the segment register access - this is a
relatively rare instruction pattern).

I'm not unconvicable, but we need to be conservative here: could you
try to measure the full before/after cost of IRQ entry, to the cycle
level? I'm curious what the performance impact is.

Also, this makes life probably a bit harder for Xen, which assumes
that the GDT of the guest OS is small-ish. (Jeremy Cc:-ed)

Ingo

2008-11-04 13:51:17

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

On Tue, 4 Nov 2008 13:42:42 +0100, "Ingo Molnar" <[email protected]> said:
>
> * Alexander van Heukelum <[email protected]> wrote:
>
> > Hi all,
> >
> > An x86 processor handles an interrupt (from an external source,
> > software generated or due to an exception), depending on the
> > contents if the IDT. Normally the IDT contains mostly interrupt
> > gates. Linux points each interrupt gate to a unique function. Some
> > are specific to some task (handling traps, IPI's, ...), the others
> > are stubs that push the interrupt number to the stack and jump to
> > 'common_interrupt'.
> >
> > This patch removes the need for the stubs.
>
> hm, the cost would be this new code:
>
> > +.p2align
> > +ENTRY(maininterrupt)
> > RING0_INT_FRAME
> > -vector=0
> > -.rept NR_VECTORS
> > - ALIGN
> > - .if vector
> > - CFI_ADJUST_CFA_OFFSET -4
> > - .endif
> > -1: pushl $~(vector)
> > - CFI_ADJUST_CFA_OFFSET 4
> > + push %eax
> > + push %eax
> > + mov %cs,%eax
> > + shr $3,%eax
> > + and $0xff,%eax
> > + not %eax
> > + mov %eax,4(%esp)
> > + pop %eax
> > jmp common_interrupt
>
> .. which we were able to avoid before. A couple of segment register
> accesses, shifts, etc to calculate the vector - each of which can be
> quite costly (especially the segment register access - this is a
> relatively rare instruction pattern).

The way it is written now is just so I did not have to change
common_interrupt (to keep changes small). All those accesses
so close together will cost some cycles, but much can be avoided
if it is integrated. If the precise content of the stack can be
changed, this could be as simple as "push %cs". Even that can be
delayed, because the content of the cs register will still be
there.

Note that the specialized interrupts (including page fault, etc.)
will not go via this path. As far as I understand now, it is only
the interrupts from external devices that normally go via
common_interrupt. There I think the overhead is really tiny
compared to the rest of the handling of the interrupt.

> I'm not unconvicable, but we need to be conservative here: could you
> try to measure the full before/after cost of IRQ entry, to the cycle
> level? I'm curious what the performance impact is.
>
> Also, this makes life probably a bit harder for Xen, which assumes
> that the GDT of the guest OS is small-ish. (Jeremy Cc:-ed)

I already had [email protected] for exactly this reason ;).

Greetings,
Alexander

> Ingo
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - A no graphics, no pop-ups email service

2008-11-04 14:01:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


* Alexander van Heukelum <[email protected]> wrote:

> On Tue, 4 Nov 2008 13:42:42 +0100, "Ingo Molnar" <[email protected]> said:
> >
> > * Alexander van Heukelum <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > An x86 processor handles an interrupt (from an external source,
> > > software generated or due to an exception), depending on the
> > > contents if the IDT. Normally the IDT contains mostly interrupt
> > > gates. Linux points each interrupt gate to a unique function. Some
> > > are specific to some task (handling traps, IPI's, ...), the others
> > > are stubs that push the interrupt number to the stack and jump to
> > > 'common_interrupt'.
> > >
> > > This patch removes the need for the stubs.
> >
> > hm, the cost would be this new code:
> >
> > > +.p2align
> > > +ENTRY(maininterrupt)
> > > RING0_INT_FRAME
> > > -vector=0
> > > -.rept NR_VECTORS
> > > - ALIGN
> > > - .if vector
> > > - CFI_ADJUST_CFA_OFFSET -4
> > > - .endif
> > > -1: pushl $~(vector)
> > > - CFI_ADJUST_CFA_OFFSET 4
> > > + push %eax
> > > + push %eax
> > > + mov %cs,%eax
> > > + shr $3,%eax
> > > + and $0xff,%eax
> > > + not %eax
> > > + mov %eax,4(%esp)
> > > + pop %eax
> > > jmp common_interrupt
> >
> > .. which we were able to avoid before. A couple of segment register
> > accesses, shifts, etc to calculate the vector - each of which can be
> > quite costly (especially the segment register access - this is a
> > relatively rare instruction pattern).
>
> The way it is written now is just so I did not have to change
> common_interrupt (to keep changes small). All those accesses so
> close together will cost some cycles, but much can be avoided if it
> is integrated. If the precise content of the stack can be changed,
> this could be as simple as "push %cs". Even that can be delayed,
> because the content of the cs register will still be there.
>
> Note that the specialized interrupts (including page fault, etc.)
> will not go via this path. As far as I understand now, it is only
> the interrupts from external devices that normally go via
> common_interrupt. There I think the overhead is really tiny compared
> to the rest of the handling of the interrupt.

no complaints from me about the cleanup/simplification effect - that's
really great. To make the reasoning all iron-clad please post timings
of "push %cs" costs measured via RDTSC or so - can be done in
user-space as well. (you can simulate the entry+exit sequence in
user-space as well and prove that the overhead is near zero.) In the
end it could all even be faster (perhaps), besides smaller.

( another advantage is that the 6 bytes GDT descriptor is more
compressed and hence uses up less L1/L2 cache footprint than the
larger (~7 byte) trampolines we have at the moment. )

plus it's possible to observe the typical cost of irqs from user-space
as well: run a task on a single CPU and save away all the RDTSC deltas
that are larger than ~10 cycles - these will be the IRQ entry costs.
Print out these deltas after 60 seconds of runtime (or something like
that), and look at the histogram.

Ingo

2008-11-04 15:07:43

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

[Alexander van Heukelum - Tue, Nov 04, 2008 at 01:28:39PM +0100]
| Hi all,
|
| An x86 processor handles an interrupt (from an external
| source, software generated or due to an exception),
| depending on the contents if the IDT. Normally the IDT
| contains mostly interrupt gates. Linux points each
| interrupt gate to a unique function. Some are specific
| to some task (handling traps, IPI's, ...), the others
| are stubs that push the interrupt number to the stack
| and jump to 'common_interrupt'.
|
| This patch removes the need for the stubs.
|
| An interrupt gate contains a FAR pointer to the interrupt
| handler, meaning that the code segment of the interrupt
| handler is also reloaded. Instead of pointing each (non-
| specific) interrupt gate to a unique handler, we set a
| unique code segment and use a common handler. When the
| handler finishes the code segment is restored to the
| 'normal'/previous one.
|
| In order to have a unique code segment for each interrupt
| vector, the GDT is extended to 512 entries (1 full page),
| and the last half of the page describes identical code
| segments (identical, except for the number in the cs
| register), which are refered to by the 256 interrupt
| gates.
|
| In this version, even the specialized handlers get run
| with their code segment switched. This is not necessary,
| but I like the fact that in a register dump one can now
| see from the code segment that the code is ran due to
| a (hard) interrupt. The exception I made is the int 0x80
| (syscall), which runs with the normal kernel code segment.
|
|
| Concluding: changing interrupt handling to this way
| removes quite a bit of source code. It also removes the
| need for the interrupt stubs and, on i386, pointers to
| them. This saves a few kilobytes of code. The page
| reserved for the GDT is now fully used. The cs register
| indicating directly that code is executed on behalf of
| a (hardware) interrupt is a nice debugging aid. This way
| of handling interrupts also leads to cleaner code: this
| patch already gets rid of some 'ugly' macro magic in
| entry_32.S and irqinit_64.c.
|
| More cleanup is certainly possible, but I have tried to
| keep the changes local and small. If switching code
| segments is too expensive for some paths, that can be
| fixed by not doing that ;).
|
| I'ld welcome some numbers on a few benchmarks on real
| hardware (I only tested on qemu: debian runs without
| noticable differences before/after this patch).
|
| Greetings,
| Alexander
|
| P.S. Just in case someone thinks this is a great idea and
| testing and benchmarking goes well...
|
...

Hi Alexander, great done!

not taking into account the cost of cs reading (which I
don't suspect to be that expensive apart from writting,
on the other hand I guess walking on GDT entries could
be not that cheap especially with new segments you propose,
I guess cpu internally check for segment to be the same
and do not reload it again even if it's described as FAR
pointer but I could be wrong so Andi CC'ed :)

A small nit in implementation:

entry_32.S:
+ push %eax
+ push %eax
+ mov %cs,%eax
+ shr $3,%eax
+ and $0xff,%eax
+ not %eax
+ mov %eax,4(%esp)
+ pop %eax

CFI_ADJUST_CFA_OFFSET missed?

- Cyrill -

2008-11-04 15:47:52

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


On Tue, 4 Nov 2008 18:07:29 +0300, "Cyrill Gorcunov"
<[email protected]> said:
> [Alexander van Heukelum - Tue, Nov 04, 2008 at 01:28:39PM +0100]
> | Hi all,
> |
> | An x86 processor handles an interrupt (from an external
> | source, software generated or due to an exception),
> | depending on the contents if the IDT. Normally the IDT
> | contains mostly interrupt gates. Linux points each
> | interrupt gate to a unique function. Some are specific
> | to some task (handling traps, IPI's, ...), the others
> | are stubs that push the interrupt number to the stack
> | and jump to 'common_interrupt'.
> |
> | This patch removes the need for the stubs.
> |
> | An interrupt gate contains a FAR pointer to the interrupt
> | handler, meaning that the code segment of the interrupt
> | handler is also reloaded. Instead of pointing each (non-
> | specific) interrupt gate to a unique handler, we set a
> | unique code segment and use a common handler. When the
> | handler finishes the code segment is restored to the
> | 'normal'/previous one.
> |
> | In order to have a unique code segment for each interrupt
> | vector, the GDT is extended to 512 entries (1 full page),
> | and the last half of the page describes identical code
> | segments (identical, except for the number in the cs
> | register), which are refered to by the 256 interrupt
> | gates.
> |
> | In this version, even the specialized handlers get run
> | with their code segment switched. This is not necessary,
> | but I like the fact that in a register dump one can now
> | see from the code segment that the code is ran due to
> | a (hard) interrupt. The exception I made is the int 0x80
> | (syscall), which runs with the normal kernel code segment.
> |
> |
> | Concluding: changing interrupt handling to this way
> | removes quite a bit of source code. It also removes the
> | need for the interrupt stubs and, on i386, pointers to
> | them. This saves a few kilobytes of code. The page
> | reserved for the GDT is now fully used. The cs register
> | indicating directly that code is executed on behalf of
> | a (hardware) interrupt is a nice debugging aid. This way
> | of handling interrupts also leads to cleaner code: this
> | patch already gets rid of some 'ugly' macro magic in
> | entry_32.S and irqinit_64.c.
> |
> | More cleanup is certainly possible, but I have tried to
> | keep the changes local and small. If switching code
> | segments is too expensive for some paths, that can be
> | fixed by not doing that ;).
> |
> | I'ld welcome some numbers on a few benchmarks on real
> | hardware (I only tested on qemu: debian runs without
> | noticable differences before/after this patch).
> |
> | Greetings,
> | Alexander
> |
> | P.S. Just in case someone thinks this is a great idea and
> | testing and benchmarking goes well...
> |
> ...
>
> Hi Alexander, great done!
>
> not taking into account the cost of cs reading (which I
> don't suspect to be that expensive apart from writting,
> on the other hand I guess walking on GDT entries could
> be not that cheap especially with new segments you propose,
> I guess cpu internally check for segment to be the same
> and do not reload it again even if it's described as FAR
> pointer but I could be wrong so Andi CC'ed :)

Thanks! And indeed Andi might know more about this.

I wonder how the time needed for reading the GDT segments
balances against the time needed due to the extra redirection
due to running the stubs. I'ld be interested if the difference
can be measured with the current implementation. (I really
need to highjack a machine to do some measurements; I hoped
someone would do it before I got to it ;) )

Even if some CPU's have some internal optimization for the case
where the gate segment is the same as the current one, I wonder
if it is really important... Interrupts that occur while the
processor is running userspace already cause changing segments.
They are more likely to be in cache, maybe.

Greetings,
Alexander

> A small nit in implementation:
>
> entry_32.S:
> + push %eax
> + push %eax
> + mov %cs,%eax
> + shr $3,%eax
> + and $0xff,%eax
> + not %eax
> + mov %eax,4(%esp)
> + pop %eax
>
> CFI_ADJUST_CFA_OFFSET missed?

Sure, I did just enough to make it work for me ;).

> - Cyrill -
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - IMAP accessible web-mail

2008-11-04 16:23:21

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

On Tue, 4 Nov 2008 15:00:30 +0100, "Ingo Molnar" <[email protected]> said:
>
> * Alexander van Heukelum <[email protected]> wrote:
>
> > On Tue, 4 Nov 2008 13:42:42 +0100, "Ingo Molnar" <[email protected]> said:
> > >
> > > * Alexander van Heukelum <[email protected]> wrote:
> > >
> > > > Hi all,
> > > >
> > > > An x86 processor handles an interrupt (from an external source,
> > > > software generated or due to an exception), depending on the
> > > > contents if the IDT. Normally the IDT contains mostly interrupt
> > > > gates. Linux points each interrupt gate to a unique function. Some
> > > > are specific to some task (handling traps, IPI's, ...), the others
> > > > are stubs that push the interrupt number to the stack and jump to
> > > > 'common_interrupt'.
> > > >
> > > > This patch removes the need for the stubs.
> > >
> > > hm, the cost would be this new code:
> > >
> > > > +.p2align
> > > > +ENTRY(maininterrupt)
> > > > RING0_INT_FRAME
> > > > -vector=0
> > > > -.rept NR_VECTORS
> > > > - ALIGN
> > > > - .if vector
> > > > - CFI_ADJUST_CFA_OFFSET -4
> > > > - .endif
> > > > -1: pushl $~(vector)
> > > > - CFI_ADJUST_CFA_OFFSET 4
> > > > + push %eax
> > > > + push %eax
> > > > + mov %cs,%eax
> > > > + shr $3,%eax
> > > > + and $0xff,%eax
> > > > + not %eax
> > > > + mov %eax,4(%esp)
> > > > + pop %eax
> > > > jmp common_interrupt
> > >
> > > .. which we were able to avoid before. A couple of segment register
> > > accesses, shifts, etc to calculate the vector - each of which can be
> > > quite costly (especially the segment register access - this is a
> > > relatively rare instruction pattern).
> >
> > The way it is written now is just so I did not have to change
> > common_interrupt (to keep changes small). All those accesses so
> > close together will cost some cycles, but much can be avoided if it
> > is integrated. If the precise content of the stack can be changed,
> > this could be as simple as "push %cs". Even that can be delayed,
> > because the content of the cs register will still be there.
> >
> > Note that the specialized interrupts (including page fault, etc.)
> > will not go via this path. As far as I understand now, it is only
> > the interrupts from external devices that normally go via
> > common_interrupt. There I think the overhead is really tiny compared
> > to the rest of the handling of the interrupt.
>
> no complaints from me about the cleanup/simplification effect - that's
> really great. To make the reasoning all iron-clad please post timings
> of "push %cs" costs measured via RDTSC or so - can be done in
> user-space as well. (you can simulate the entry+exit sequence in
> user-space as well and prove that the overhead is near zero.) In the
> end it could all even be faster (perhaps), besides smaller.

I did some timings using the little program below (32-bit only), doing
1024 times the same sequence. TEST1 is just pushing a constant onto
the stack; TEST2 is pushing the cs register; TEST3 is the sequence
from the patch to extract the vector number from the cs register.

Opteron (cycles): 1024 / 1157 / 3527
Xeon E5345 (cycles): 1092 / 1085 / 6622
Athlon XP (cycles): 1028 / 1166 / 5192

I'ld say that the cost of the push %cs itself is negligible.

> ( another advantage is that the 6 bytes GDT descriptor is more
> compressed and hence uses up less L1/L2 cache footprint than the
> larger (~7 byte) trampolines we have at the moment. )

A GDT descriptor has to be read and processed anyhow... It might
just not be in cache. But at least it is aligned. The trampolines
are 7 bytes (irq#<128) or 10 bytes (irq#>127) on i386 and x86_64.
And one is data, and the other is code, which might also cause
different behaviour. It's just a bit too complicated to decide by
just reasoning about it ;).

> plus it's possible to observe the typical cost of irqs from user-space
> as well: run a task on a single CPU and save away all the RDTSC deltas
> that are larger than ~10 cycles - these will be the IRQ entry costs.
> Print out these deltas after 60 seconds of runtime (or something like
> that), and look at the histogram.

I'll see if I can do that. Maybe in a few days...

Thanks,
Alexander

> Ingo


#include <stdio.h>
#include <stdlib.h>

#define TEST 3

int main(void)
{
int i, ticks[1024];

for (i=0; i<(sizeof(ticks)/sizeof(*ticks)); i++) {
asm volatile (
"push %%edx\n\t"
"push %%ecx\n\t"
"rdtsc\n\t"
"mov %%eax,%%ecx\n\t"
".rept 1024\n\t"
#if TEST==1
"push $-255\n\t"
#endif
#if TEST==2
"push %%cs\n\t"
#endif
#if TEST==3
"push %%eax\n\t"
"push %%eax\n\t"
"mov %%cs,%%eax\n\t"
"shr $3,%%eax\n\t"
"and $0xff,%%eax\n\t"
"not %%eax\n\t"
"mov %%eax,4(%%esp)\n\t"
"pop %%eax\n\t"
#endif
".endr\n\t"
"rdtsc\n\t"
".rept 1024\n\t"
"pop %%edx\n\t"
".endr\n\t"
"sub %%ecx,%%eax\n\t"
"pop %%ecx\n\t"
"pop %%edx"
: "=a" (ticks[i]) );
}

for (i=0; i<(sizeof(ticks)/sizeof(*ticks)); i++) {
printf("%i\n", ticks[i]);
}
}
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - A fast, anti-spam email service.

2008-11-04 16:37:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


* Alexander van Heukelum <[email protected]> wrote:

> I wonder how the time needed for reading the GDT segments balances
> against the time needed due to the extra redirection due to running
> the stubs. I'ld be interested if the difference can be measured with
> the current implementation. (I really need to highjack a machine to
> do some measurements; I hoped someone would do it before I got to it
> ;) )
>
> Even if some CPU's have some internal optimization for the case
> where the gate segment is the same as the current one, I wonder if
> it is really important... Interrupts that occur while the processor
> is running userspace already cause changing segments. They are more
> likely to be in cache, maybe.

there are three main factors:

- Same-value segment loads are optimized on most modern CPUs and can
give a few cycles (2-3) advantage. That might or might not apply to
the microcode that does IRQ entry processing. (A cache miss will
increase the cost much more but that is true in general as well)

- A second effect is that the changed data structure layout: a more
compressed GDT entry (6 bytes) against a more spread out (~7 bytes,
not aligned) interrupt trampoline. Note that the first one is data
cache the second one is instruction cache - the two have different
sizes, different implementations and different hit/miss pressures.
Generally the instruction-cache is the more precious resource and we
optimize for that first, for data cache second.

- A third effect is branch prediction: currently we are fanning
out all the vectors into ~240 branches just to recover a single
constant in essence. That is quite wasteful of instruction cache
resources, because from the logic side it's a data constant, not a
control flow difference. (we demultiplex that number into an
interrupt handler later on, but the CPU has no knowledge of that
relationship)

... all in one, the situation is complex enough on the CPU
architecture side for it to really necessiate a measurement in
practice, and that's why i have asked you to do them: the numbers need
to go hand in hand with the patch submission.

My estimation is that if we do it right, your approach will behave
better on modern CPUs (which is what matters most for such things),
especially on real workloads where there's a considerable
instruction-cache pressure. But it should be measured in any case.

Ingo

2008-11-04 16:45:28

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


On Tue, 4 Nov 2008 17:36:36 +0100, "Ingo Molnar" <[email protected]> said:
>
> * Alexander van Heukelum <[email protected]> wrote:
>
> > I wonder how the time needed for reading the GDT segments balances
> > against the time needed due to the extra redirection due to running
> > the stubs. I'ld be interested if the difference can be measured with
> > the current implementation. (I really need to highjack a machine to
> > do some measurements; I hoped someone would do it before I got to it
> > ;) )
> >
> > Even if some CPU's have some internal optimization for the case
> > where the gate segment is the same as the current one, I wonder if
> > it is really important... Interrupts that occur while the processor
> > is running userspace already cause changing segments. They are more
> > likely to be in cache, maybe.
>
> there are three main factors:
>
> - Same-value segment loads are optimized on most modern CPUs and can
> give a few cycles (2-3) advantage. That might or might not apply to
> the microcode that does IRQ entry processing. (A cache miss will
> increase the cost much more but that is true in general as well)
>
> - A second effect is that the changed data structure layout: a more
> compressed GDT entry (6 bytes) against a more spread out (~7 bytes,
> not aligned) interrupt trampoline. Note that the first one is data
> cache the second one is instruction cache - the two have different
> sizes, different implementations and different hit/miss pressures.
> Generally the instruction-cache is the more precious resource and we
> optimize for that first, for data cache second.
>
> - A third effect is branch prediction: currently we are fanning
> out all the vectors into ~240 branches just to recover a single
> constant in essence. That is quite wasteful of instruction cache
> resources, because from the logic side it's a data constant, not a
> control flow difference. (we demultiplex that number into an
> interrupt handler later on, but the CPU has no knowledge of that
> relationship)
>
> ... all in one, the situation is complex enough on the CPU
> architecture side for it to really necessiate a measurement in
> practice, and that's why i have asked you to do them: the numbers need
> to go hand in hand with the patch submission.
>
> My estimation is that if we do it right, your approach will behave
> better on modern CPUs (which is what matters most for such things),
> especially on real workloads where there's a considerable
> instruction-cache pressure. But it should be measured in any case.

Fully agreed. I will do some measurements in the near future, maybe
next week. At least noone came up with an absolutely blocking problem
with this approach ;).

Greetings,
Alexander

> Ingo
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - IMAP accessible web-mail

2008-11-04 16:51:56

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

[Alexander van Heukelum - Tue, Nov 04, 2008 at 05:23:09PM +0100]
...
|
| I did some timings using the little program below (32-bit only), doing
| 1024 times the same sequence. TEST1 is just pushing a constant onto
| the stack; TEST2 is pushing the cs register; TEST3 is the sequence
| from the patch to extract the vector number from the cs register.
|
| Opteron (cycles): 1024 / 1157 / 3527
| Xeon E5345 (cycles): 1092 / 1085 / 6622
| Athlon XP (cycles): 1028 / 1166 / 5192

Xeon is defenitely out of luck :-)

|
| I'ld say that the cost of the push %cs itself is negligible.
|
| > ( another advantage is that the 6 bytes GDT descriptor is more
| > compressed and hence uses up less L1/L2 cache footprint than the
| > larger (~7 byte) trampolines we have at the moment. )
|
| A GDT descriptor has to be read and processed anyhow... It might
| just not be in cache. But at least it is aligned. The trampolines
| are 7 bytes (irq#<128) or 10 bytes (irq#>127) on i386 and x86_64.
| And one is data, and the other is code, which might also cause
| different behaviour. It's just a bit too complicated to decide by
| just reasoning about it ;).
|
| > plus it's possible to observe the typical cost of irqs from user-space
| > as well: run a task on a single CPU and save away all the RDTSC deltas
| > that are larger than ~10 cycles - these will be the IRQ entry costs.
| > Print out these deltas after 60 seconds of runtime (or something like
| > that), and look at the histogram.
|
| I'll see if I can do that. Maybe in a few days...
|
| Thanks,
| Alexander
|
| > Ingo
...

- Cyrill -

2008-11-04 16:55:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


* Alexander van Heukelum <[email protected]> wrote:

> > My estimation is that if we do it right, your approach will behave
> > better on modern CPUs (which is what matters most for such
> > things), especially on real workloads where there's a considerable
> > instruction-cache pressure. But it should be measured in any case.
>
> Fully agreed. I will do some measurements in the near future, maybe
> next week. At least noone came up with an absolutely blocking
> problem with this approach ;).

how about "it does not build with lguest enabled" as a blocking
problem? ;-)

arch/x86/lguest/built-in.o: In function `lguest_init_IRQ':
boot.c:(.init.text+0x33f): undefined reference to `interrupt'

config attached.

Ingo


Attachments:
(No filename) (738.00 B)
config (59.89 kB)
Download all attachments

2008-11-04 16:56:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


* Ingo Molnar <[email protected]> wrote:

> * Alexander van Heukelum <[email protected]> wrote:
>
> > > My estimation is that if we do it right, your approach will behave
> > > better on modern CPUs (which is what matters most for such
> > > things), especially on real workloads where there's a considerable
> > > instruction-cache pressure. But it should be measured in any case.
> >
> > Fully agreed. I will do some measurements in the near future, maybe
> > next week. At least noone came up with an absolutely blocking
> > problem with this approach ;).
>
> how about "it does not build with lguest enabled" as a blocking
> problem? ;-)
>
> arch/x86/lguest/built-in.o: In function `lguest_init_IRQ':
> boot.c:(.init.text+0x33f): undefined reference to `interrupt'
>
> config attached.

... other than that it booted fine on a few testboxes here. That's
still not an exhaustive test by any means, but it's promising.

Ingo

2008-11-04 16:56:53

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

> not taking into account the cost of cs reading (which I
> don't suspect to be that expensive apart from writting,

GDT accesses have an implied LOCK prefix. Especially
on some older CPUs that could be slow.

I don't know if it's a problem or not but it would need
some careful benchmarking on different systems to make sure interrupt
latencies are not impacted.

Another reason I would be also careful with this patch is that
it will likely trigger slow paths in JITs like qemu/vmware/etc.

Also code segment switching is likely not something that
current and future micro architectures will spend a lot of time optimizing.

I'm not sure that risk is worth the small improvement in code
size.

An alternative BTW to having all the stubs in the executable
would be to just dynamically generate them when the interrupt
is set up. Then you would only have the stubs around for the
interrupts which are actually used.

-Andi

2008-11-04 16:58:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


* Cyrill Gorcunov <[email protected]> wrote:

> [Alexander van Heukelum - Tue, Nov 04, 2008 at 05:23:09PM +0100]
> ...
> |
> | I did some timings using the little program below (32-bit only), doing
> | 1024 times the same sequence. TEST1 is just pushing a constant onto
> | the stack; TEST2 is pushing the cs register; TEST3 is the sequence
> | from the patch to extract the vector number from the cs register.
> |
> | Opteron (cycles): 1024 / 1157 / 3527
> | Xeon E5345 (cycles): 1092 / 1085 / 6622
> | Athlon XP (cycles): 1028 / 1166 / 5192
>
> Xeon is defenitely out of luck :-)

it's still OK - i.e. no outrageous showstopper overhead anywhere in
that instruction sequence. The total round-trip overhead is what will
matter most.

Ingo

2008-11-04 16:59:33

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


On Tue, 4 Nov 2008 17:54:09 +0100, "Ingo Molnar" <[email protected]> said:
>
> * Alexander van Heukelum <[email protected]> wrote:
>
> > > My estimation is that if we do it right, your approach will behave
> > > better on modern CPUs (which is what matters most for such
> > > things), especially on real workloads where there's a considerable
> > > instruction-cache pressure. But it should be measured in any case.
> >
> > Fully agreed. I will do some measurements in the near future, maybe
> > next week. At least noone came up with an absolutely blocking
> > problem with this approach ;).
>
> how about "it does not build with lguest enabled" as a blocking
> problem? ;-)

Blocking for applying the patch as is, sure... As a proof of concept
it is still fine ;).

> arch/x86/lguest/built-in.o: In function `lguest_init_IRQ':
> boot.c:(.init.text+0x33f): undefined reference to `interrupt'

It just needs to be fixed. I guess similar problems are to
be expected with xen or um.

Thanks,
Alexander

> config attached.
>
> Ingo
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - mmm... Fastmail...

2008-11-04 17:14:17

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

[Ingo Molnar - Tue, Nov 04, 2008 at 05:58:11PM +0100]
|
| * Cyrill Gorcunov <[email protected]> wrote:
|
| > [Alexander van Heukelum - Tue, Nov 04, 2008 at 05:23:09PM +0100]
| > ...
| > |
| > | I did some timings using the little program below (32-bit only), doing
| > | 1024 times the same sequence. TEST1 is just pushing a constant onto
| > | the stack; TEST2 is pushing the cs register; TEST3 is the sequence
| > | from the patch to extract the vector number from the cs register.
| > |
| > | Opteron (cycles): 1024 / 1157 / 3527
| > | Xeon E5345 (cycles): 1092 / 1085 / 6622
| > | Athlon XP (cycles): 1028 / 1166 / 5192
| >
| > Xeon is defenitely out of luck :-)
|
| it's still OK - i.e. no outrageous showstopper overhead anywhere in
| that instruction sequence. The total round-trip overhead is what will
| matter most.
|
| Ingo
|

Don't get me wrong please, I really like what Alexander have done!
But frankly six time slower is a bit scarying me.

- Cyrill -

2008-11-04 17:29:37

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


On Tue, 4 Nov 2008 20:13:46 +0300, "Cyrill Gorcunov"
<[email protected]> said:
> [Ingo Molnar - Tue, Nov 04, 2008 at 05:58:11PM +0100]
> |
> | * Cyrill Gorcunov <[email protected]> wrote:
> |
> | > [Alexander van Heukelum - Tue, Nov 04, 2008 at 05:23:09PM +0100]
> | > ...
> | > |
> | > | I did some timings using the little program below (32-bit only),
> doing
> | > | 1024 times the same sequence. TEST1 is just pushing a constant onto
> | > | the stack; TEST2 is pushing the cs register; TEST3 is the sequence
> | > | from the patch to extract the vector number from the cs register.
> | > |
> | > | Opteron (cycles): 1024 / 1157 / 3527
> | > | Xeon E5345 (cycles): 1092 / 1085 / 6622
> | > | Athlon XP (cycles): 1028 / 1166 / 5192
> | >
> | > Xeon is defenitely out of luck :-)
> |
> | it's still OK - i.e. no outrageous showstopper overhead anywhere in
> | that instruction sequence. The total round-trip overhead is what will
> | matter most.
> |
> | Ingo
> |
>
> Don't get me wrong please, I really like what Alexander have done!
> But frankly six time slower is a bit scarying me.

Thanks again ;). Now it _is_ six times slower to do this tiny
piece of code... But please keep in mind all the activity that
follows to save the current data segment registers (the stack
segment and code segment are saved automatically), the general
purpose registers and to load most of the data segments with
kernel-space values. And looking at it now... do_IRQ is also
not exactly trivial.

Also, I kept the information that is saved on the stack
exactly the same. If this is not a requirement, "push %cs"
is what is left of this expensive (6 cycle!) sequence.
Even that could be unnecessary if the stack layout can
be changed... But I'ld like to consider that separately.

Greetings,
Alexander

> - Cyrill -
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - A no graphics, no pop-ups email service

2008-11-04 17:39:43

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


On Tue, 4 Nov 2008 17:54:09 +0100, "Ingo Molnar" <[email protected]> said:
>
> * Alexander van Heukelum <[email protected]> wrote:
>
> > > My estimation is that if we do it right, your approach will behave
> > > better on modern CPUs (which is what matters most for such
> > > things), especially on real workloads where there's a considerable
> > > instruction-cache pressure. But it should be measured in any case.
> >
> > Fully agreed. I will do some measurements in the near future, maybe
> > next week. At least noone came up with an absolutely blocking
> > problem with this approach ;).
>
> how about "it does not build with lguest enabled" as a blocking
> problem? ;-)
>
> arch/x86/lguest/built-in.o: In function `lguest_init_IRQ':
> boot.c:(.init.text+0x33f): undefined reference to `interrupt'

The following makes it compile... Whether it works is a different
question ;).

index a5d8e1a..ad7e292 100644
--- a/arch/x86/lguest/boot.c
+++ b/arch/x86/lguest/boot.c
@@ -580,6 +580,7 @@ static struct irq_chip lguest_irq_controller = {
* interrupt (except 128, which is used for system calls), and then
tells the
* Linux infrastructure that each interrupt is controlled by our
level-based
* lguest interrupt controller. */
+void maininterrupt(void);
static void __init lguest_init_IRQ(void)
{
unsigned int i;
@@ -590,7 +591,7 @@ static void __init lguest_init_IRQ(void)
* a straightforward 1 to 1 mapping, so force that here.
*/
__get_cpu_var(vector_irq)[vector] = i;
if (vector != SYSCALL_VECTOR) {
- set_intr_gate(vector, interrupt[vector]);
+ set_intr_gate(vector, maininterrupt);
set_irq_chip_and_handler_name(i,
&lguest_irq_controller,
handle_level_irq,
"level");

> config attached.
>
> Ingo
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - One of many happy users:
http://www.fastmail.fm/docs/quotes.html

2008-11-04 18:07:19

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

On Tue, 4 Nov 2008 18:05:01 +0100, "Andi Kleen" <[email protected]>
said:
> > not taking into account the cost of cs reading (which I
> > don't suspect to be that expensive apart from writting,
>
> GDT accesses have an implied LOCK prefix. Especially
> on some older CPUs that could be slow.
>
> I don't know if it's a problem or not but it would need
> some careful benchmarking on different systems to make sure interrupt
> latencies are not impacted.

That's good to know. I assume this LOCKed bus cycle only occurs
if the (hidden) segment information is not cached in some way?
How many segments are typically cached? In particular, does it
optimize switching between two segments?

> Another reason I would be also careful with this patch is that
> it will likely trigger slow paths in JITs like qemu/vmware/etc.

Software can be fixed ;).

> Also code segment switching is likely not something that
> current and future micro architectures will spend a lot of time
> optimizing.
>
> I'm not sure that risk is worth the small improvement in code
> size.

I think it is worth exploring a bit more. I feel it should be
a neutral change worst-case performance-wise, but I really
think the new code is more readable/understandable.

> An alternative BTW to having all the stubs in the executable
> would be to just dynamically generate them when the interrupt
> is set up. Then you would only have the stubs around for the
> interrupts which are actually used.

I was trying to simplify things, not make it even less
transparent ;).

> -Andi
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - A no graphics, no pop-ups email service

2008-11-04 18:15:34

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Alexander van Heukelum wrote:
>
> That's good to know. I assume this LOCKed bus cycle only occurs
> if the (hidden) segment information is not cached in some way?
> How many segments are typically cached? In particular, does it
> optimize switching between two segments?
>

Yes, there is a segment descriptor cache (as opposed to the hidden but
architectural segment descriptor *registers*, which the Intel
documentation confusingly call a "cache".)

It is used to optimize switching between a small number of segments, and
was crucial for decent performance on Win9x, which contained a bunch of
16-bit code.

-hpa

2008-11-04 18:44:35

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


On Tue, 04 Nov 2008 10:14:11 -0800, "H. Peter Anvin" <[email protected]>
said:
> Alexander van Heukelum wrote:
> >
> > That's good to know. I assume this LOCKed bus cycle only occurs
> > if the (hidden) segment information is not cached in some way?
> > How many segments are typically cached? In particular, does it
> > optimize switching between two segments?
> >
>
> Yes, there is a segment descriptor cache (as opposed to the hidden but
> architectural segment descriptor *registers*, which the Intel
> documentation confusingly call a "cache".)
>
> It is used to optimize switching between a small number of segments, and
> was crucial for decent performance on Win9x, which contained a bunch of
> 16-bit code.

Thanks for the info!

This just means that if there are performance problems, the
'specialized'
handlers should be using the kernel segment or maybe a single common
segment. It would still allow us to get rid of the trampolines. A stack
trace should be enough to reconstruct which vector was originally called
in that case. Only the common_interrupt-codepath needs the original
vector as far as I can see.

You just made testing on larger machines with a lot of external
interrupts necessary :-/. (Assuming small machines do not show
performance problems, that is.)

Greetings,
Alexander

> -hpa
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - I mean, what is it about a decent email service?

2008-11-04 19:08:34

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Alexander van Heukelum wrote:
>
> Thanks for the info!
>
> This just means that if there are performance problems, the
> 'specialized'
> handlers should be using the kernel segment or maybe a single common
> segment. It would still allow us to get rid of the trampolines. A stack
> trace should be enough to reconstruct which vector was originally called
> in that case. Only the common_interrupt-codepath needs the original
> vector as far as I can see.
>
> You just made testing on larger machines with a lot of external
> interrupts necessary :-/. (Assuming small machines do not show
> performance problems, that is.)
>

Overall, it more than anything else show the x86 architectural
braindamage of not having an interrupt vector number register available
anywhere. However, I suspect using segment registers is liable to
suffer from a "wall of jello" effect once you overflow the segment
descriptor cache, which will typically be around 32 entries in size.

Now, at least on my kernel, the existing IRQ stubs are rather weird:
they are padded to 8 bytes and then misaligned onto a 4-byte boundary.
Furthermore, at the cost of an extra jump, they can trivially be
compressed down to 4 bytes apiece instead of 7-padded-to-8.

-hpa

2008-11-04 19:34:23

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Okay, looking at this some more, the current interrupt stubs are just
plain braindead.

We have a large number of push instructions which save a negative
number, even when that means using the full 5-byte form; then we use:

unsigned vector = ~regs->orig_ax;

in do_IRQ. This is utterly moronic; if we use the short form push at
all times, then we can set the upper bits (which distinguish us from a
system call entry) at leisure (via a simple orl in common code), rather
than in each stub, which to boot bloats it above the 8-byte mark.

That way, each stub has the simple form:

6A xx E9 yy yy yy yy 90

Down to 8 bytes, including one byte of padding. Already better - we are
down to 2K total, and each stub is aligned.

Now, we can do better than that at the cost of an extra branch. The
extra branch, however, is a direct unconditional branch and so is not
subject to misprediction (good), although it may end up taking an extra
icache miss (bad):

we can group our vectors in 8 groups of 32 vectors each. Each contain a
stub of the form:

6A xx EB yy

... which jump to a common jump instruction at the end of each group.
Thus, each group takes 32*4+5 bytes+3 bytes for alignment = 136 bytes,
for a total of 1088 bytes.

This has two disadvantages:
- an extra jump.
- we can no longer redirect a stub away from common code by
changing the branch in that slot. We have to instead modify
the IDT. This means "dedicated" interrupts don't get the
vector number at all, which is probably fine -- to be honest,
I'm not sure if they do at the moment either.

Fixing the first of these I think is a no-brainer. That will cut the
size of the existing stub pool by almost half. The second is more of a
judgement call, and I'd like to see performance numbers for it. Either
which way, I think it's worthwhile to consider this as an alternative to
playing segmentation tricks, which I think could have really nasty
side effects.

-hpa

2008-11-04 20:02:30

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Ingo Molnar wrote:
> .. which we were able to avoid before. A couple of segment register
> accesses, shifts, etc to calculate the vector - each of which can be
> quite costly (especially the segment register access - this is a
> relatively rare instruction pattern).
>
> I'm not unconvicable, but we need to be conservative here: could you
> try to measure the full before/after cost of IRQ entry, to the cycle
> level? I'm curious what the performance impact is.
>
> Also, this makes life probably a bit harder for Xen, which assumes
> that the GDT of the guest OS is small-ish. (Jeremy Cc:-ed)
>

It doesn't increase the GDT to more than one page, so there's no issue
there. The only reason the GDT uses a whole page is because of Xen's
requirements anyway, so if we can make good use of the rest of the
entries, so much the better.

The other possible concern with Xen is whether Xen will properly load an
arbitrary %cs on exception entry, or if it always loads KERNEL_CS; looks
like it will load any %cs, so we should be fine there.

Overall the patch looks good. Saving a segment register should be much
faster than loading it, so I don't think the %cs read on entry should
cost too much, but reloading %cs with KERNEL_CS might be a bit more of a
cost (or does it run the whole exception with the new %cs?).

J

2008-11-04 20:02:49

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Ingo Molnar wrote:
> ( another advantage is that the 6 bytes GDT descriptor is more
> compressed and hence uses up less L1/L2 cache footprint than the
> larger (~7 byte) trampolines we have at the moment. )
>

Also its D cache rather than I cache, which is generally more
plentiful. However, I think the cost of GDT cache misses on exception
latency is something that we've largely overlooked, and this will make
it a bigger factor (vs cache misses on the actual exception handler code
itself, which should be reduced).

J

2008-11-04 20:06:37

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

H. Peter Anvin wrote:
> - we can no longer redirect a stub away from common code by
> changing the branch in that slot. We have to instead modify
> the IDT. This means "dedicated" interrupts don't get the
> vector number at all, which is probably fine -- to be honest,
> I'm not sure if they do at the moment either.
>

They do, and the various patches to multiplex things like IPIs across
multiple vectors rely on it.

J

2008-11-04 20:19:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Jeremy Fitzhardinge wrote:
> Ingo Molnar wrote:
>> ( another advantage is that the 6 bytes GDT descriptor is more
>> compressed and hence uses up less L1/L2 cache footprint than the
>> larger (~7 byte) trampolines we have at the moment. )
>>
>
> Also its D cache rather than I cache, which is generally more
> plentiful. However, I think the cost of GDT cache misses on exception
> latency is something that we've largely overlooked, and this will make
> it a bigger factor (vs cache misses on the actual exception handler code
> itself, which should be reduced).
>

Besides, a GDT entry is 8 bytes, not 6.

-hpa

2008-11-04 20:22:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

> Fixing the first of these I think is a no-brainer. That will cut the
> size of the existing stub pool by almost half. The second is more of a
> judgement call, and I'd like to see performance numbers for it. Either
> which way, I think it's worthwhile to consider this as an alternative to
> playing segmentation tricks, which I think could have really nasty
> side effects.

Or again just generate them on demand when the interrupt is set up.
If you really have 240 interrupts sources you can afford the 5k likely,
but for most there will be only a minimum number of stubs.

Although frankly I suspect there are far easier ways to save 5k of memory.

-Andi

2008-11-04 20:26:58

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Andi Kleen wrote:
>
> Or again just generate them on demand when the interrupt is set up.
> If you really have 240 interrupts sources you can afford the 5k likely,
> but for most there will be only a minimum number of stubs.
>
> Although frankly I suspect there are far easier ways to save 5k of memory.
>

Generating them dynamically is probably pretty ugly too, though.

Shrinking the whole table down to 2K by just regularizing the structure
is trivial, though, and should almost certainly be a win. The more
esoteric ideas are probably worse.

-hpa

2008-11-04 20:38:32

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

On Tue, Nov 04, 2008 at 12:26:13PM -0800, H. Peter Anvin wrote:
> Andi Kleen wrote:
> >
> > Or again just generate them on demand when the interrupt is set up.
> > If you really have 240 interrupts sources you can afford the 5k likely,
> > but for most there will be only a minimum number of stubs.
> >
> > Although frankly I suspect there are far easier ways to save 5k of memory.
> >
>
> Generating them dynamically is probably pretty ugly too, though.

Why? The only slightly tricky thing is that they need to be in no NX space.
Then it's just a few bytes patched in a template.

> Shrinking the whole table down to 2K by just regularizing the structure
> is trivial, though, and should almost certainly be a win. The more
> esoteric ideas are probably worse.

Just think how much memory you could safe elsewhere with the same
effort.

-Andi

2008-11-04 20:44:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


* Alexander van Heukelum <[email protected]> wrote:

> On Tue, 4 Nov 2008 18:05:01 +0100, "Andi Kleen" <[email protected]>
> said:
> > > not taking into account the cost of cs reading (which I
> > > don't suspect to be that expensive apart from writting,
> >
> > GDT accesses have an implied LOCK prefix. Especially
> > on some older CPUs that could be slow.
> >
> > I don't know if it's a problem or not but it would need
> > some careful benchmarking on different systems to make sure interrupt
> > latencies are not impacted.

That's not a real issue on anything produced in this decade as we have
had per CPU GDTs in Linux for about a decade as well.

It's only an issue on ancient CPUs that export all their LOCKed cycles
to the bus. Pentium and older or so. The PPro got it right already.

What matters is what i said before: the actual raw cycle count before
and after the patch, on the two main classes of CPUs, and the amount
of icache we can save.

> That's good to know. I assume this LOCKed bus cycle only occurs if
> the (hidden) segment information is not cached in some way? How many
> segments are typically cached? In particular, does it optimize
> switching between two segments?
>
> > Another reason I would be also careful with this patch is that it
> > will likely trigger slow paths in JITs like qemu/vmware/etc.
>
> Software can be fixed ;).

Yes, and things like vmware were never a reason to hinder Linux.

> > Also code segment switching is likely not something that current
> > and future micro architectures will spend a lot of time
> > optimizing.
> >
> > I'm not sure that risk is worth the small improvement in code
> > size.
>
> I think it is worth exploring a bit more. I feel it should be a
> neutral change worst-case performance-wise, but I really think the
> new code is more readable/understandable.

It's all measurable, so the vague "risk" mentioned above can be
dispelled via hard numbers.

> > An alternative BTW to having all the stubs in the executable would
> > be to just dynamically generate them when the interrupt is set up.
> > Then you would only have the stubs around for the interrupts which
> > are actually used.
>
> I was trying to simplify things, not make it even less transparent
> ;).

yep, the complexity of dynamic stubs is the last thing we need here.

And as hpa's comments point it out, compressing the rather stupid irq
stubs might be a third option that looks promising as well.

Ingo

2008-11-04 20:58:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

On Tue, Nov 04, 2008 at 09:44:00PM +0100, Ingo Molnar wrote:
>
> * Alexander van Heukelum <[email protected]> wrote:
>
> > On Tue, 4 Nov 2008 18:05:01 +0100, "Andi Kleen" <[email protected]>
> > said:
> > > > not taking into account the cost of cs reading (which I
> > > > don't suspect to be that expensive apart from writting,
> > >
> > > GDT accesses have an implied LOCK prefix. Especially
> > > on some older CPUs that could be slow.
> > >
> > > I don't know if it's a problem or not but it would need
> > > some careful benchmarking on different systems to make sure interrupt
> > > latencies are not impacted.
>
> That's not a real issue on anything produced in this decade as we have
> had per CPU GDTs in Linux for about a decade as well.
>
> It's only an issue on ancient CPUs that export all their LOCKed cycles
> to the bus. Pentium and older or so. The PPro got it right already.

??? LOCK slowness is not because of the bus. And I know you know
that Ingo, so I don't know why you wrote that bogosity above.

> What matters is what i said before: the actual raw cycle count before
> and after the patch, on the two main classes of CPUs, and the amount

iirc there are at least between three and five classes of CPUs that
matter (P6, K8, P4 and possibly Atom and C3). But I would only
expect P4 to be a real problem.

> > That's good to know. I assume this LOCKed bus cycle only occurs if
> > the (hidden) segment information is not cached in some way? How many
> > segments are typically cached? In particular, does it optimize
> > switching between two segments?
> >
> > > Another reason I would be also careful with this patch is that it
> > > will likely trigger slow paths in JITs like qemu/vmware/etc.
> >
> > Software can be fixed ;).
>
> Yes, and things like vmware were never a reason to hinder Linux.

Hopefully the users agree with you on that.

But anyways having to fix the JIT for saving 3-5k of memory would seem
like a bad payoff in terms of effort:gain. Yes I know you personally
wouldn't need to fix them, but wasting other engineer's time is nearly
as bad as your own.

> > > An alternative BTW to having all the stubs in the executable would
> > > be to just dynamically generate them when the interrupt is set up.
> > > Then you would only have the stubs around for the interrupts which
> > > are actually used.
> >
> > I was trying to simplify things, not make it even less transparent
> > ;).

Doesn't make sense to me. The current code is not complex at all,
just not particularly efficient. Yours might be better (at some
risk), but simpler is probably not the right word to describe it.

>
> yep, the complexity of dynamic stubs is the last thing we need here.

I don't think it's particularly complex. You just have a few bytes
and you fill in the number and the target.

-Andi

2008-11-04 21:30:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


* Ingo Molnar <[email protected]> wrote:

> And as hpa's comments point it out, compressing the rather stupid
> irq stubs might be a third option that looks promising as well.

... and we should try and see how far we can compress those stubs,
before we do any segment register based tricks.

Ingo

2008-11-04 21:37:15

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>> And as hpa's comments point it out, compressing the rather stupid
>> irq stubs might be a third option that looks promising as well.
>
> ... and we should try and see how far we can compress those stubs,
> before we do any segment register based tricks.
>

Using the techniques previously mentioned, for 224 vectors:

1792 bytes ( 8 bytes/stub) - trivial.
1568 bytes ( 7 bytes/stub) - same without alignment.
952 bytes (~4 bytes/stub) - extra jump needed.

For comparison, the IDT itself is 2048 bytes on x86-32 and 4096 bytes on
x86-64.

-hpa

2008-11-04 21:54:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


* H. Peter Anvin <[email protected]> wrote:

> Ingo Molnar wrote:
> > * Ingo Molnar <[email protected]> wrote:
> >
> >> And as hpa's comments point it out, compressing the rather stupid
> >> irq stubs might be a third option that looks promising as well.
> >
> > ... and we should try and see how far we can compress those stubs,
> > before we do any segment register based tricks.
> >
>
> Using the techniques previously mentioned, for 224 vectors:
>
> 1792 bytes ( 8 bytes/stub) - trivial.
> 1568 bytes ( 7 bytes/stub) - same without alignment.
> 952 bytes (~4 bytes/stub) - extra jump needed.
>
> For comparison, the IDT itself is 2048 bytes on x86-32 and 4096 bytes on
> x86-64.

sounds like a plan :)

Ingo

2008-11-05 00:42:42

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Andi Kleen wrote:
> ??? LOCK slowness is not because of the bus. And I know you know
> that Ingo, so I don't know why you wrote that bogosity above.
>

Why are the accesses locked? Is it because it does an update of the
accessed bit in the descriptor? (We should be pre-setting them all anyway.)

J

2008-11-05 00:55:18

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Jeremy Fitzhardinge wrote:
> Andi Kleen wrote:
>> ??? LOCK slowness is not because of the bus. And I know you know that
>> Ingo, so I don't know why you wrote that bogosity above.
>>
>
> Why are the accesses locked? Is it because it does an update of the
> accessed bit in the descriptor? (We should be pre-setting them all
> anyway.)
>

It is, but the locked access is unconditional. Similar to any other
read/modify/write transaction -- the write is required to release the lock.

-hpa

2008-11-05 10:27:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


* Andi Kleen <[email protected]> wrote:

> On Tue, Nov 04, 2008 at 09:44:00PM +0100, Ingo Molnar wrote:
> >
> > * Alexander van Heukelum <[email protected]> wrote:
> >
> > > On Tue, 4 Nov 2008 18:05:01 +0100, "Andi Kleen" <[email protected]>
> > > said:
> > > > > not taking into account the cost of cs reading (which I
> > > > > don't suspect to be that expensive apart from writting,
> > > >
> > > > GDT accesses have an implied LOCK prefix. Especially
> > > > on some older CPUs that could be slow.
> > > >
> > > > I don't know if it's a problem or not but it would need
> > > > some careful benchmarking on different systems to make sure interrupt
> > > > latencies are not impacted.
> >
> > That's not a real issue on anything produced in this decade as we have
> > had per CPU GDTs in Linux for about a decade as well.
> >
> > It's only an issue on ancient CPUs that export all their LOCKed
> > cycles to the bus. Pentium and older or so. The PPro got it right
> > already.
>
> ??? LOCK slowness is not because of the bus. And I know you know
> that Ingo, so I don't know why you wrote that bogosity above.

.. of course the historic LOCK slowness was all due to the system bus:
very old CPUs exported a LOCK signal to the system bus for every
LOCK-prefix access (implicit and explicit) and that made it _really_
expensive. (hundreds of cycles)

... on reasonably modern CPUs the LOCK-ed access has been abstracted
away to within the CPU, and the cost of LOCK-ed access is rather low
(think 10-20 cycles - of course only if there's no cache miss cost)
(That's obviously the case with the GDT, with is both per CPU and well
cached.)

on _really_ modern CPUs LOCK can be as cheap as just a few cycles - so
low that we can stop bothering about it in the future. There's no
fundamental physical reason why the LOCK prefix (implicit or explicit)
should be expensive.

the real reason why Alexander's patch needs to be measured is not the
LOCK cycle of GDT accesses but what i pointed out in my first mail:
the segmentation trick it plays. And that's why shrinking the stubs is
probably a better idea which should be tried first.

... anyway, the unacceptable tone of your reply shows that you still
have not changed a bit in your old habit of attacking and bullying
people on lkml. All the other Intel engineers i'm working with as a
maintainer show a very professional approach and are very easy to work
with. You need to stop your attacks, and until you change this
negative way of working with people i'll continue to ignore you.

Ingo

2008-11-05 17:53:40

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

[Ingo Molnar - Tue, Nov 04, 2008 at 10:52:45PM +0100]
|
| * H. Peter Anvin <[email protected]> wrote:
|
| > Ingo Molnar wrote:
| > > * Ingo Molnar <[email protected]> wrote:
| > >
| > >> And as hpa's comments point it out, compressing the rather stupid
| > >> irq stubs might be a third option that looks promising as well.
| > >
| > > ... and we should try and see how far we can compress those stubs,
| > > before we do any segment register based tricks.
| > >
| >
| > Using the techniques previously mentioned, for 224 vectors:
| >
| > 1792 bytes ( 8 bytes/stub) - trivial.
| > 1568 bytes ( 7 bytes/stub) - same without alignment.
| > 952 bytes (~4 bytes/stub) - extra jump needed.
| >
| > For comparison, the IDT itself is 2048 bytes on x86-32 and 4096 bytes on
| > x86-64.
|
| sounds like a plan :)
|
| Ingo
|

Ingo, what the conclusion is? As I understand from the thread --

1) Implement Peter's proposed cleanup/compress.
2) Test Alexander's patche.

Did I miss something?

- Cyrill -

2008-11-05 18:06:50

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Cyrill Gorcunov wrote:
>
> Ingo, what the conclusion is? As I understand from the thread --
>
> 1) Implement Peter's proposed cleanup/compress.
> 2) Test Alexander's patche.
>
> Did I miss something?
>

Nope, that's pretty much it.

However, there are good reason to believe that using this kind of
segment selector tricks is probably a bad idea in the long term,
especially since CPU vendors have strong incentives to reduce the size
of the segment descriptor cache now when none of the mainstream OSes
rely on more than a small handful of segments.

I was planning to look at doing the obvious stub shrink today.

-hpa

2008-11-05 18:14:21

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

[H. Peter Anvin - Wed, Nov 05, 2008 at 10:04:50AM -0800]
| Cyrill Gorcunov wrote:
| >
| > Ingo, what the conclusion is? As I understand from the thread --
| >
| > 1) Implement Peter's proposed cleanup/compress.
| > 2) Test Alexander's patche.
| >
| > Did I miss something?
| >
|
| Nope, that's pretty much it.
|
| However, there are good reason to believe that using this kind of
| segment selector tricks is probably a bad idea in the long term,
| especially since CPU vendors have strong incentives to reduce the size
| of the segment descriptor cache now when none of the mainstream OSes
| rely on more than a small handful of segments.
|
| I was planning to look at doing the obvious stub shrink today.
|
| -hpa
|

I see. Thanks! Btw Peter, I remember I read long time ago about
segment caches (well... in time of DOS programming actually). But
there was only 'common' words like this cache exist. But maybe
it's possible to know what exactly size of such a cache is?
You mentoined number 32. (heh... I hadn't remember it until
you mentoined about such a cache :-)

- Cyrill -

2008-11-05 18:19:59

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

[Andi Kleen - Tue, Nov 04, 2008 at 06:05:01PM +0100]
| > not taking into account the cost of cs reading (which I
| > don't suspect to be that expensive apart from writting,
|
| GDT accesses have an implied LOCK prefix. Especially
| on some older CPUs that could be slow.
|
| I don't know if it's a problem or not but it would need
| some careful benchmarking on different systems to make sure interrupt
| latencies are not impacted.
|
| Another reason I would be also careful with this patch is that
| it will likely trigger slow paths in JITs like qemu/vmware/etc.
|
| Also code segment switching is likely not something that
| current and future micro architectures will spend a lot of time optimizing.
|
| I'm not sure that risk is worth the small improvement in code
| size.
|
| An alternative BTW to having all the stubs in the executable
| would be to just dynamically generate them when the interrupt
| is set up. Then you would only have the stubs around for the
| interrupts which are actually used.
|
| -Andi
|

Thanks a lot for comments, Andi!

- Cyrill -

2008-11-05 18:21:18

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Cyrill Gorcunov wrote:
>
> I see. Thanks! Btw Peter, I remember I read long time ago about
> segment caches (well... in time of DOS programming actually). But
> there was only 'common' words like this cache exist. But maybe
> it's possible to know what exactly size of such a cache is?
> You mentoined number 32. (heh... I hadn't remember it until
> you mentoined about such a cache :-)
>

As with any other caching structure, you can discover its size,
associativity, and replacement policy by artificially trying to provoke
patterns that produce pathological timings.

At Transmeta, at one time we used a 32-entry direct-mapped cache, which
ended up with a ~96% hit rate on common Win95 benchmarks.

I should, however, make it clear that there are other alternatives for
speeding up segment descriptor loading, and not all of them rely on a cache.

-hpa

2008-11-05 18:26:40

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

[H. Peter Anvin - Wed, Nov 05, 2008 at 10:20:23AM -0800]
| Cyrill Gorcunov wrote:
| >
| > I see. Thanks! Btw Peter, I remember I read long time ago about
| > segment caches (well... in time of DOS programming actually). But
| > there was only 'common' words like this cache exist. But maybe
| > it's possible to know what exactly size of such a cache is?
| > You mentoined number 32. (heh... I hadn't remember it until
| > you mentoined about such a cache :-)
| >
|
| As with any other caching structure, you can discover its size,
| associativity, and replacement policy by artificially trying to provoke
| patterns that produce pathological timings.
|
| At Transmeta, at one time we used a 32-entry direct-mapped cache, which
| ended up with a ~96% hit rate on common Win95 benchmarks.
|
| I should, however, make it clear that there are other alternatives for
| speeding up segment descriptor loading, and not all of them rely on a cache.
|
| -hpa
|

Thanks a lot for explanation!

- Cyrill -

2008-11-06 09:16:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


* Jeremy Fitzhardinge <[email protected]> wrote:

> Why are the accesses locked? Is it because it does an update of the
> accessed bit in the descriptor? (We should be pre-setting them all
> anyway.)

yes, the accessed bit in the segment descriptor has to be updated in
an atomic transaction: the CPU has to do a MESI coherent
read+compare+write transaction, without damaging other updates to the
6 bytes segment descriptor.

Old OSs implemented paging to disk by swapping out segments based on
the accessed bit, and clearing the present and accessed bit when the
segment is swapped out.

But given that all our GDT entries have the accessed bit set on Linux,
there's no physical reason why the CPU should be using a locked cycle
here - only to stay compatible with ancient stuff.

So ... that notion just survived in the backwards-compatibility stream
of CPU enhancements, over the past 10 years.

On 64-bit Linux there's no reason to maintain that principle, so i'd
expect future CPUs to relax this even more, were it ever to show up on
the performance radar. Note that SYSCALL/SYSRET already optimize that
away.

Ingo

2008-11-06 09:20:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


* Alexander van Heukelum <[email protected]> wrote:

> > | > | Opteron (cycles): 1024 / 1157 / 3527
> > | > | Xeon E5345 (cycles): 1092 / 1085 / 6622
> > | > | Athlon XP (cycles): 1028 / 1166 / 5192
> > | >
> > | > Xeon is defenitely out of luck :-)
> > |
> > | it's still OK - i.e. no outrageous showstopper overhead anywhere in
> > | that instruction sequence. The total round-trip overhead is what will
> > | matter most.
> > |
> > | Ingo
> > |
> >
> > Don't get me wrong please, I really like what Alexander have done!
> > But frankly six time slower is a bit scarying me.

the cost is 6 cycles instead of 1 cycles. In a codepath that takes
thousands of cycles and is often cache-limited.

> Thanks again ;). Now it _is_ six times slower to do this tiny piece
> of code... But please keep in mind all the activity that follows to
> save the current data segment registers (the stack segment and code
> segment are saved automatically), the general purpose registers and
> to load most of the data segments with kernel-space values. And
> looking at it now... do_IRQ is also not exactly trivial.
>
> Also, I kept the information that is saved on the stack exactly the
> same. If this is not a requirement, "push %cs" is what is left of
> this expensive (6 cycle!) sequence. Even that could be unnecessary
> if the stack layout can be changed... But I'ld like to consider that
> separately.

we really want to keep the stack frame consistent between all the
context types. We can do things like return-to-userspace-from-irq or
schedule-from-irq-initiated-event, etc. - so crossing between these
context frames has to be standard and straightforward.

Ingo

2008-11-06 09:30:11

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Ingo Molnar wrote:
>
> yes, the accessed bit in the segment descriptor has to be updated in
> an atomic transaction: the CPU has to do a MESI coherent
> read+compare+write transaction, without damaging other updates to the
> 6 bytes segment descriptor.
>

8 bytes, rather.

2008-11-06 09:30:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


* H. Peter Anvin <[email protected]> wrote:

> Ingo Molnar wrote:
>>
>> yes, the accessed bit in the segment descriptor has to be updated
>> in an atomic transaction: the CPU has to do a MESI coherent
>> read+compare+write transaction, without damaging other updates to
>> the 6 bytes segment descriptor.
>
> 8 bytes, rather.

heh, yes of course :-)

Ingo

2008-11-10 01:31:05

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Alexander van Heukelum wrote:
>
> In general: after applying the patch, latencies are more
> often seen by the rdtsctest. It also seems to cause a
> small percentage decrease in speed of hackbench.
> Looking at the latency histograms I believe this is
> a real effect, but I could not do enough boots/runs to
> make this a certainty from the runtimes alone.
>
> At least for this PC, doing hpa's suggested cleanup of
> the stub table is the right way to go for now... A
> second option would be to get rid of the stub table by
> assigning each important vector a unique handler and
> to make sure those handlers do not rely on the vector
> number at all.
>

Hi Alexander,

First of all, great job on the timing analysis. I believe this confirms
the concerns that I had about this technique.

Here is a prototype patch of the compressed IRQ stubs -- this patch
compresses them down to 7 stubs per 32-byte cache line (or part of cache
line) at the expense of a back-to-back jmp which has the potential of
being ugly on some pipelines (we can only get 4 stubs into 32 bytes
without that).

Would you be willing to run your timing test on this patch? This isn't
submission-quality since it commingles multiple changes, and it needs
some cleanup, but it should be useful for testing.

As a side benefit it eliminates some gratuitous differences between the
32- and 64-bit code.

-hpa


Attachments:
irqstubs.patch (7.15 kB)

2008-11-10 08:59:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


* Alexander van Heukelum <[email protected]> wrote:

> Hi all,
>
> I have spent some time trying to find out how expensive the
> segment-switching patch was. I have only one computer available at
> the time: a "Sempron 2400+", 32-bit-only machine.
>
> Measured were timings of "hackbench 10" in a loop. The average was
> taken of more than 100 runs. Timings were done for two seperate
> boots of the system.

hackbench is _way_ too noisy to measure such cycle-level differences
as irq entry changes cause. It also does not really stress interrupts
- it only stresses networking, the VFS and the scheduler.

a better test might have been to generate a ton of interrupts, but
even then it's _very_ hard to measure it properly. The best method is
what i've suggested to you early on: run a loop in user-space and
observe irq costs via RDTSC, as they happen. Then build a histogram
and compare the before/after histogram. Compare best-case results as
well (the first slot of the histogram), as those are statistically
much more significant than a noisy average.

Measuring such things in a meaningful way is really tricky business.
Using hackbench to measure IRQ entry micro-costs is like trying to
take a photo of a delicate flower at night, by using an atomic bomb as
the flash-light: you certainly get some sort of effect to report, but
there's not many nuances left in the picture to really look at ;-)

Ingo

2008-11-10 12:44:32

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

On Mon, 10 Nov 2008 09:58:46 +0100, "Ingo Molnar" <[email protected]> said:
> * Alexander van Heukelum <[email protected]> wrote:
> > Hi all,
> >
> > I have spent some time trying to find out how expensive the
> > segment-switching patch was. I have only one computer available at
> > the time: a "Sempron 2400+", 32-bit-only machine.
> >
> > Measured were timings of "hackbench 10" in a loop. The average was
> > taken of more than 100 runs. Timings were done for two seperate
> > boots of the system.

Hi Ingo,

I guess you just stopped reading here?

> hackbench is _way_ too noisy to measure such cycle-level differences
> as irq entry changes cause. It also does not really stress interrupts
> - it only stresses networking, the VFS and the scheduler.
>
> a better test might have been to generate a ton of interrupts, but
> even then it's _very_ hard to measure it properly.

I should have presented the second benchmark as the first I
guess. I really just used hackbench as a workload. I gathered
it would give a good amount of exceptions like page faults and
maybe others. It would be nice to have a simple debug switch in
the kernel to make it generate a lot of interrupts, though ;).

> The best method is
> what i've suggested to you early on: run a loop in user-space and
> observe irq costs via RDTSC, as they happen. Then build a histogram
> and compare the before/after histogram. Compare best-case results as
> well (the first slot of the histogram), as those are statistically
> much more significant than a noisy average.

See the rest of the mail you replied to and its attachment. I've put
the programs I used and the histogram in

http://heukelum.fastmail.fm/irqstubs/

I think rdtsctest.c is pretty much what you describe.

Greetings,
Alexander

> Measuring such things in a meaningful way is really tricky business.
> Using hackbench to measure IRQ entry micro-costs is like trying to
> take a photo of a delicate flower at night, by using an atomic bomb as
> the flash-light: you certainly get some sort of effect to report, but
> there's not many nuances left in the picture to really look at ;-)
>
> Ingo
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - Same, same, but different...

2008-11-10 13:07:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


* Alexander van Heukelum <[email protected]> wrote:

> On Mon, 10 Nov 2008 09:58:46 +0100, "Ingo Molnar" <[email protected]> said:
> > * Alexander van Heukelum <[email protected]> wrote:
> > > Hi all,
> > >
> > > I have spent some time trying to find out how expensive the
> > > segment-switching patch was. I have only one computer available at
> > > the time: a "Sempron 2400+", 32-bit-only machine.
> > >
> > > Measured were timings of "hackbench 10" in a loop. The average was
> > > taken of more than 100 runs. Timings were done for two seperate
> > > boots of the system.
>
> Hi Ingo,
>
> I guess you just stopped reading here?

yeah, sorry! You describe and did exactly the kind of histogram that i
wanted to see done ;-)

I'm not sure i can read out the same thing from the result though.
Firstly, it seems the 'after' histograms are better, because there the
histogram shifted towards shorter delays. (i.e. lower effective irq
entry overhead)

OTOH, unless i'm misreading them, it's a bit hard to compare them
visually: the integral of the histograms does not seem to be constant,
they dont seem to be normalized.

It should be made constant for them to be comparable. (i.e. the total
number of irq hits profiled should be equal - or should be normalized
with the sum after the fact)

Ingo

2008-11-10 15:40:28

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Ingo Molnar wrote:
> * Alexander van Heukelum <[email protected]> wrote:
>
> hackbench is _way_ too noisy to measure such cycle-level differences
> as irq entry changes cause. It also does not really stress interrupts
> - it only stresses networking, the VFS and the scheduler.
>
> a better test might have been to generate a ton of interrupts, but
> even then it's _very_ hard to measure it properly. The best method is
> what i've suggested to you early on: run a loop in user-space and
> observe irq costs via RDTSC, as they happen. Then build a histogram
> and compare the before/after histogram. Compare best-case results as
> well (the first slot of the histogram), as those are statistically
> much more significant than a noisy average.
>

For what it's worth, I tested this out, and I'm pretty sure you need to
run a uniprocessor configuration (or system) for it to make sense --
otherwise you end up missing too many of the interrupts. I first tested
this on an 8-processor system and, well, came up with nothing.

I'm going to try this later on a uniprocessor, unless Alexander beats me
to it.

-hpa

2008-11-10 21:35:43

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

On Mon, 10 Nov 2008 14:07:09 +0100, "Ingo Molnar" <[email protected]> said:
> * Alexander van Heukelum <[email protected]> wrote:
> > On Mon, 10 Nov 2008 09:58:46 +0100, "Ingo Molnar" <[email protected]> said:
> > > * Alexander van Heukelum <[email protected]> wrote:
> > > > Hi all,
> > > >
> > > > I have spent some time trying to find out how expensive the
> > > > segment-switching patch was. I have only one computer available at
> > > > the time: a "Sempron 2400+", 32-bit-only machine.
> > > >
> > > > Measured were timings of "hackbench 10" in a loop. The average was
> > > > taken of more than 100 runs. Timings were done for two seperate
> > > > boots of the system.
> >
> > Hi Ingo,
> >
> > I guess you just stopped reading here?
>
> yeah, sorry! You describe and did exactly the kind of histogram that i
> wanted to see done ;-)

I thought so ;).

> I'm not sure i can read out the same thing from the result though.
> Firstly, it seems the 'after' histograms are better, because there the
> histogram shifted towards shorter delays. (i.e. lower effective irq
> entry overhead)
>
> OTOH, unless i'm misreading them, it's a bit hard to compare them
> visually: the integral of the histograms does not seem to be constant,
> they dont seem to be normalized.

The total number of measured intervals (between two almost-adjacent
rdtsc's) is exactly the same for all histograms (10^10). Almost all
measurements are of the "nothing happened" type, i.e., around 11
clock cycles on this machine. The user time spent inside the
rdtsctest program is almost independent of the load, but it
measures time spent outside of the program... But what should be
attributed to what effect is unclear to me at the moment.

> It should be made constant for them to be comparable. (i.e. the total
> number of irq hits profiled should be equal - or should be normalized
> with the sum after the fact)

Basically the difference between the "idle" and "hack10" versions
should indicate the effect of extra interrupts (timer) and additional
exceptions and cache effects due to context switching.

Thanks,
Alexander

> Ingo
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - I mean, what is it about a decent email service?

2008-11-10 21:44:28

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


On Mon, 10 Nov 2008 07:39:22 -0800, "H. Peter Anvin" <[email protected]>
said:
> Ingo Molnar wrote:
> > * Alexander van Heukelum <[email protected]> wrote:
> >
> > hackbench is _way_ too noisy to measure such cycle-level differences
> > as irq entry changes cause. It also does not really stress interrupts
> > - it only stresses networking, the VFS and the scheduler.
> >
> > a better test might have been to generate a ton of interrupts, but
> > even then it's _very_ hard to measure it properly. The best method is
> > what i've suggested to you early on: run a loop in user-space and
> > observe irq costs via RDTSC, as they happen. Then build a histogram
> > and compare the before/after histogram. Compare best-case results as
> > well (the first slot of the histogram), as those are statistically
> > much more significant than a noisy average.
> >
>
> For what it's worth, I tested this out, and I'm pretty sure you need to
> run a uniprocessor configuration (or system) for it to make sense --
> otherwise you end up missing too many of the interrupts. I first tested
> this on an 8-processor system and, well, came up with nothing.
>
> I'm going to try this later on a uniprocessor, unless Alexander beats me
> to it.

I did the rdtsctest again for the irqstubs patch you sent. The data
is at http://heukelum.fastmail.fm/irqstubs/ and the latency histogram
is http://heukelum.fastmail.fm/irqstubs/latency_hpa.png

Greetings,
Alexander

> -hpa
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - Same, same, but different...

2008-11-10 22:22:19

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Alexander van Heukelum wrote:
>>
>> OTOH, unless i'm misreading them, it's a bit hard to compare them
>> visually: the integral of the histograms does not seem to be constant,
>> they dont seem to be normalized.
>
> The total number of measured intervals (between two almost-adjacent
> rdtsc's) is exactly the same for all histograms (10^10). Almost all
> measurements are of the "nothing happened" type, i.e., around 11
> clock cycles on this machine. The user time spent inside the
> rdtsctest program is almost independent of the load, but it
> measures time spent outside of the program... But what should be
> attributed to what effect is unclear to me at the moment.
>

I believe you need to remove the obvious null events at the low end (no
interrupt happened) and renormalize to the same scale for the histograms
to make sense.

As it is, the difference in the number of events that actually matters
dominate the graphs; for example, there are 142187 events >= 12 in
hack10ticks, but 136533 in hack10ticks_hpa.

-hpa

2008-11-10 23:37:06

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Alexander van Heukelum wrote:
>
> I did the rdtsctest again for the irqstubs patch you sent. The data
> is at http://heukelum.fastmail.fm/irqstubs/ and the latency histogram
> is http://heukelum.fastmail.fm/irqstubs/latency_hpa.png
>

Okay, I've stared at a bunch of different transformations of this data
and I'm starting to think that it's getting lost in the noise. The
difference between your "idleticks" and "idleticks2" data sets, for
example, is as big as the differences between any two data sets that I
can see.

Just for reference, see this graph where I have filtered out events
outside the [30..1000] cycle range and renormalized.

http://www.zytor.com/~hpa/hist.pdf

I don't know how to even figure out what a realistic error range looks
like, other than repeating each run something like 100+ times and do an
"eye chart" kind of diagram.

-hpa

2008-11-11 05:02:01

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Okay, after spending most of the day trying to get something that isn't
completely like white noise (interesting problem, otherwise I'd have
given up long ago) I did, eventually, come up with something that looks
like it's significant. I did a set of multiple runs, and am looking for
the "waterfall points" in the cumulative statistics.

http://www.zytor.com/~hpa/baseline-hpa-3000-3600.pdf

This particular set of data points was gathered on a 64-bit kernel, so I
didn't try the segment technique.

It looks to me that the collection of red lines is enough to the left of
the black ones that one can assume there is a significant effect,
probably by about a cache miss worth of time.

-hpa

2008-11-11 09:55:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


* Alexander van Heukelum <[email protected]> wrote:

> > OTOH, unless i'm misreading them, it's a bit hard to compare them
> > visually: the integral of the histograms does not seem to be
> > constant, they dont seem to be normalized.
>
> The total number of measured intervals (between two almost-adjacent
> rdtsc's) is exactly the same for all histograms (10^10). Almost all
> measurements are of the "nothing happened" type, i.e., around 11
> clock cycles on this machine. The user time spent inside the
> rdtsctest program is almost independent of the load, but it measures
> time spent outside of the program... But what should be attributed
> to what effect is unclear to me at the moment.

a high-pass filter should be applied in any case, to filter out the
"nothing happened" baseline. Eliminating every delta below 500-1000
cycles would do the trick i think, all IRQ costs are at least 1000
cycles.

then a low-pass filter should be applied to eliminate non-irq noise
such as scheduling effects or expensive irqs (which are both
uninteresting to such analysis).

and then _that_ double-filtered dataset should be normalized: the
number of events should be made the same. (just clip the larger
dataset to the length of the smaller dataset)

> > It should be made constant for them to be comparable. (i.e. the
> > total number of irq hits profiled should be equal - or should be
> > normalized with the sum after the fact)
>
> Basically the difference between the "idle" and "hack10" versions
> should indicate the effect of extra interrupts (timer) and
> additional exceptions and cache effects due to context switching.

i was only looking at before/after duos, for the same basic type of
workload. Idle versus hackbench is indeed apples to oranges.

Ingo

2008-11-13 22:25:20

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

On Mon, 2008-11-10 at 21:00 -0800, H. Peter Anvin wrote:
> Okay, after spending most of the day trying to get something that isn't
> completely like white noise (interesting problem, otherwise I'd have
> given up long ago) I did, eventually, come up with something that looks
> like it's significant. I did a set of multiple runs, and am looking for
> the "waterfall points" in the cumulative statistics.
>
> http://www.zytor.com/~hpa/baseline-hpa-3000-3600.pdf
>
> This particular set of data points was gathered on a 64-bit kernel, so I
> didn't try the segment technique.
>
> It looks to me that the collection of red lines is enough to the left of
> the black ones that one can assume there is a significant effect,
> probably by about a cache miss worth of time.

This graph is a little confusing. Is the area under each curve here
supposed to be a constant?

Is this latency from all interrupts as seen by userspace? Or does a
particular interrupt dominate?

--
Mathematics is the supreme nostalgia of our time.

2008-11-14 01:11:43

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Sorry to reply so late on this slightly offtopic rant...

On Wednesday 05 November 2008 21:26, Ingo Molnar wrote:
> * Andi Kleen <[email protected]> wrote:
> > On Tue, Nov 04, 2008 at 09:44:00PM +0100, Ingo Molnar wrote:

> > > It's only an issue on ancient CPUs that export all their LOCKed
> > > cycles to the bus. Pentium and older or so. The PPro got it right
> > > already.
> >
> > ??? LOCK slowness is not because of the bus. And I know you know
> > that Ingo, so I don't know why you wrote that bogosity above.
>
> .. of course the historic LOCK slowness was all due to the system bus:
> very old CPUs exported a LOCK signal to the system bus for every
> LOCK-prefix access (implicit and explicit) and that made it _really_
> expensive. (hundreds of cycles)
>
> ... on reasonably modern CPUs the LOCK-ed access has been abstracted
> away to within the CPU, and the cost of LOCK-ed access is rather low
> (think 10-20 cycles - of course only if there's no cache miss cost)
> (That's obviously the case with the GDT, with is both per CPU and well
> cached.)

Locked instruction AFAIR is about 50 cycles on Core2. I think it is
a bit lower on K8. On Nehalem, which has optimisations for these,
I have heard it is still about 20-25 cycles. Although I don't have
one, so I don't actually know.

These (on my Core2) don't seem to pipeline at all with other
instructions either. So on my Core2, a locked instruction is worth
maybe 150-200 regular pipelined, superscalar instructions.

There is another big reason why lock instructions are expensive,
and that is because they have to prevent subsequent loads from
passing any previous stores becoming visible. This in theory could
be somewhat speculated, but no matter what happens, the program
visible state can't be committed until the stores are.

I heard from an Intel hardware engineer that Nehalem has some
really fancy logic in it to make locked instructions "free", that
was nacked from earlier CPUs because it was too costly. So obviously
it is taking a fair whack of transistors or power for them to do it.
And even then it is far from free, but still seems to be one or two
orders of magnitude more expensive than a regular instruction.


> on _really_ modern CPUs LOCK can be as cheap as just a few cycles - so

Oh, maybe I'm mistaken about Nehalem then? How many is "just a few"?
If it is 25 non-pipelined cycles, then that's still 100 instructions
if it is a 4 issue machine.


> low that we can stop bothering about it in the future. There's no
> fundamental physical reason why the LOCK prefix (implicit or explicit)
> should be expensive.

Even if they could make it free on the software side, it is obviously
expensive on the hardware side. Not bothering about it is a copout.
The atomic instruction speedups in Nehalem are cool, but what would
have been even cooler is if Intel had decided *not* to spend resources
making this cheaper because they found Linux has so few locked
instructions :)

Even if somehow the x86 ISA didn't have the implicit memory ordering
requirement in the lock instruction, I think it's obviously a special
case path that doesn't fit in with a load/store uarch (whether they
implement it in uops with ll/sc like thing or whatnot, it's going to
need special logic).

IMO, we shouldn't stop bothering about LOCK prefix in the forseeable
future.

2008-11-14 01:21:40

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Matt Mackall wrote:
> On Mon, 2008-11-10 at 21:00 -0800, H. Peter Anvin wrote:
>> Okay, after spending most of the day trying to get something that isn't
>> completely like white noise (interesting problem, otherwise I'd have
>> given up long ago) I did, eventually, come up with something that looks
>> like it's significant. I did a set of multiple runs, and am looking for
>> the "waterfall points" in the cumulative statistics.
>>
>> http://www.zytor.com/~hpa/baseline-hpa-3000-3600.pdf
>>
>> This particular set of data points was gathered on a 64-bit kernel, so I
>> didn't try the segment technique.
>>
>> It looks to me that the collection of red lines is enough to the left of
>> the black ones that one can assume there is a significant effect,
>> probably by about a cache miss worth of time.
>
> This graph is a little confusing. Is the area under each curve here
> supposed to be a constant?
>

No, they reflect individual runs. They start at 1 at the top left and
drop to 0 at the far right in each case. What matters is the horizontal
position of large vertical drops.

> Is this latency from all interrupts as seen by userspace? Or does a
> particular interrupt dominate?
>

All interrupts, but rather inherently the difference between interrupt
handlers is going to be bigger than the differences between
implementations of the same handler. I *believe* all the interrupts
you're seeing in that graph are probably timer interrupts. The other
major interrupt source that was active on the system was USB.

-hpa

2008-11-14 01:30:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Nick Piggin wrote:
>
> I heard from an Intel hardware engineer that Nehalem has some
> really fancy logic in it to make locked instructions "free", that
> was nacked from earlier CPUs because it was too costly. So obviously
> it is taking a fair whack of transistors or power for them to do it.
> And even then it is far from free, but still seems to be one or two
> orders of magnitude more expensive than a regular instruction.
>

Last I heard it was still a dozen-ish cycles even on Nehalem.

>
> IMO, we shouldn't stop bothering about LOCK prefix in the forseeable
> future.
>

Even if a CPU came out *today* that had zero-cost locks we'd have to
worry about it for at least another 5-10 years. The good news is that
we're doing pretty good with it for now, but I don't believe in general
we can avoid the fact that improving LOCK performance helps everything
when you're dealing with large numbers of cores/threads.

-hpa

2008-11-14 02:12:36

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

On Friday 14 November 2008 12:20, H. Peter Anvin wrote:
> Nick Piggin wrote:
> > I heard from an Intel hardware engineer that Nehalem has some
> > really fancy logic in it to make locked instructions "free", that
> > was nacked from earlier CPUs because it was too costly. So obviously
> > it is taking a fair whack of transistors or power for them to do it.
> > And even then it is far from free, but still seems to be one or two
> > orders of magnitude more expensive than a regular instruction.
>
> Last I heard it was still a dozen-ish cycles even on Nehalem.

Right -- that's their definition of "free", and even comes after
they seem to have put a large amount of effort into it. So they
are still expensive.


> > IMO, we shouldn't stop bothering about LOCK prefix in the forseeable
> > future.
>
> Even if a CPU came out *today* that had zero-cost locks we'd have to
> worry about it for at least another 5-10 years. The good news is that
> we're doing pretty good with it for now, but I don't believe in general
> we can avoid the fact that improving LOCK performance helps everything
> when you're dealing with large numbers of cores/threads.

There is a balance, though. And that is going to depend on what other
low hanging fruit the CPU has eaten, and what the ratio of lock
instructions is on the target workload. So it's always preferable to
reduce their number. (by definition they will always be more difficult
for the CPU to execute)

2008-11-14 02:31:20

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

On Thu, 2008-11-13 at 17:18 -0800, H. Peter Anvin wrote:
> Matt Mackall wrote:
> > On Mon, 2008-11-10 at 21:00 -0800, H. Peter Anvin wrote:
> >> Okay, after spending most of the day trying to get something that isn't
> >> completely like white noise (interesting problem, otherwise I'd have
> >> given up long ago) I did, eventually, come up with something that looks
> >> like it's significant. I did a set of multiple runs, and am looking for
> >> the "waterfall points" in the cumulative statistics.
> >>
> >> http://www.zytor.com/~hpa/baseline-hpa-3000-3600.pdf
> >>
> >> This particular set of data points was gathered on a 64-bit kernel, so I
> >> didn't try the segment technique.
> >>
> >> It looks to me that the collection of red lines is enough to the left of
> >> the black ones that one can assume there is a significant effect,
> >> probably by about a cache miss worth of time.
> >
> > This graph is a little confusing. Is the area under each curve here
> > supposed to be a constant?
> >
>
> No, they reflect individual runs. They start at 1 at the top left and
> drop to 0 at the far right in each case. What matters is the horizontal
> position of large vertical drops.

Still confused. If, say, the top blue line on the left represents the
same number of interrupts as the bottom red one, then at some point it
must cross under the red one as it goes to the right, which it does not
appear to do. Thus, it does not appear the scale on the left is actually
in units of constant probability, no?

Though I'll agree that even if they're not scaled so that the area under
the curve sums to a probability of 1, the centerpoint of the vertical
drop is what matters. But that's rather hard to read off this chart, as
the blue line I mentioned has a center point point well above the red
one, so while it looks like a shift of 80 cycles, it's more like 30.

Is there any theoretical reason you can't just sum the histograms for
runs of the same code and then divide by event count? Is there some sort
of alignment/cache-coloring issue across boots?

> > Is this latency from all interrupts as seen by userspace? Or does a
> > particular interrupt dominate?
> >
>
> All interrupts, but rather inherently the difference between interrupt
> handlers is going to be bigger than the differences between
> implementations of the same handler. I *believe* all the interrupts
> you're seeing in that graph are probably timer interrupts. The other
> major interrupt source that was active on the system was USB.

That's what I'd expect on an idle system, certainly.

Anyway, I'm actually surprised your red graph is visibly better than
your blue one. FWIW, I was leaning towards your simpler variant (and
away from the magical segment register proposal). I'd be happy to see
either of your versions submitted.

--
Mathematics is the supreme nostalgia of our time.

2008-11-14 03:24:25

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Matt Mackall wrote:
>>>
>> No, they reflect individual runs. They start at 1 at the top left and
>> drop to 0 at the far right in each case. What matters is the horizontal
>> position of large vertical drops.
>
> Still confused. If, say, the top blue line on the left represents the
> same number of interrupts as the bottom red one, then at some point it
> must cross under the red one as it goes to the right, which it does not
> appear to do. Thus, it does not appear the scale on the left is actually
> in units of constant probability, no?
>
> Though I'll agree that even if they're not scaled so that the area under
> the curve sums to a probability of 1, the centerpoint of the vertical
> drop is what matters. But that's rather hard to read off this chart, as
> the blue line I mentioned has a center point point well above the red
> one, so while it looks like a shift of 80 cycles, it's more like 30.
>
> Is there any theoretical reason you can't just sum the histograms for
> runs of the same code and then divide by event count? Is there some sort
> of alignment/cache-coloring issue across boots?
>

The reason for the multiple curves is to show the range of uncertainty.

There are three sets of graphs in there: black (current mainline,
16-byte stubs), red (4-byte stubs with a double jump), and blue (8-byte
stubs.)

The fact that the system is idle is pretty much a case which should
favor "blue" over "red"; realistically I think the graphs show that they
are identical within the limits of measurement, and both are
significantly better than "black".

Since this is pretty much the optimal case for "blue" and it doesn't
show any performance win over "red", I implemented the "red" option and
pushed it into tip:x86/irq.

> That's what I'd expect on an idle system, certainly.
>
> Anyway, I'm actually surprised your red graph is visibly better than
> your blue one. FWIW, I was leaning towards your simpler variant (and
> away from the magical segment register proposal). I'd be happy to see
> either of your versions submitted.

Same here, which is why I wanted to check them both out.

-hpa

2008-11-26 21:36:35

by Avi Kivity

[permalink] [raw]
Subject: Re: [Lguest] [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


> Here is a prototype patch of the compressed IRQ stubs -- this patch
> compresses them down to 7 stubs per 32-byte cache line (or part of cache
> line) at the expense of a back-to-back jmp which has the potential of
> being ugly on some pipelines (we can only get 4 stubs into 32 bytes
> without that).
>

You could actually get 4-byte stubs, using a 16-bit call (66 e8 ww ww).
But it would be slower, since we won't be pairing it with a ret.

I suspect we could get it down to three bytes, by sharing the last byte
of the four-byte call sequence with the first byte of the next:

66 e8 ff 66 e8 fc 66 e8 f9 66 e8 f6 ...

Every three bytes a new stub begins; it's a four-byte call to offset
0x6703 relative to the beginning of the first stub.

Can anyone better 24 bits/stub?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2008-11-26 21:50:41

by Avi Kivity

[permalink] [raw]
Subject: Re: [Lguest] [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Avi Kivity wrote:
>
>> Here is a prototype patch of the compressed IRQ stubs -- this patch
>> compresses them down to 7 stubs per 32-byte cache line (or part of cache
>> line) at the expense of a back-to-back jmp which has the potential of
>> being ugly on some pipelines (we can only get 4 stubs into 32 bytes
>> without that).
>>
>
> You could actually get 4-byte stubs, using a 16-bit call (66 e8 ww
> ww). But it would be slower, since we won't be pairing it with a ret.
>
> I suspect we could get it down to three bytes, by sharing the last
> byte of the four-byte call sequence with the first byte of the next:
>
> 66 e8 ff 66 e8 fc 66 e8 f9 66 e8 f6 ...
>
> Every three bytes a new stub begins; it's a four-byte call to offset
> 0x6703 relative to the beginning of the first stub.
>
> Can anyone better 24 bits/stub?

I actually got it down to 16 bits: use a 16-bit code segment, so you can
drop the address size override:

e8 ff e8 fd e8 fb ...

of course the common code has to jump back to a 32-bit code segment.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2008-11-27 00:05:06

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [Lguest] [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Avi Kivity wrote:
>
>> Here is a prototype patch of the compressed IRQ stubs -- this patch
>> compresses them down to 7 stubs per 32-byte cache line (or part of cache
>> line) at the expense of a back-to-back jmp which has the potential of
>> being ugly on some pipelines (we can only get 4 stubs into 32 bytes
>> without that).
>
> You could actually get 4-byte stubs, using a 16-bit call (66 e8 ww ww).
> But it would be slower, since we won't be pairing it with a ret.
>

Yes, I would consider that a theoretical exercise only :)

> I suspect we could get it down to three bytes, by sharing the last byte
> of the four-byte call sequence with the first byte of the next:
>
> 66 e8 ff 66 e8 fc 66 e8 f9 66 e8 f6 ...
>
> Every three bytes a new stub begins; it's a four-byte call to offset
> 0x6703 relative to the beginning of the first stub.
>
> Can anyone better 24 bits/stub?

On the entirely silly level...

CC xx

-hpa

2008-11-27 10:19:21

by Avi Kivity

[permalink] [raw]
Subject: Re: [Lguest] [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

H. Peter Anvin wrote:
>
>> I suspect we could get it down to three bytes, by sharing the last
>> byte of the four-byte call sequence with the first byte of the next:
>>
>> 66 e8 ff 66 e8 fc 66 e8 f9 66 e8 f6 ...
>>
>> Every three bytes a new stub begins; it's a four-byte call to offset
>> 0x6703 relative to the beginning of the first stub.
>>
>> Can anyone better 24 bits/stub?
>
> On the entirely silly level...
>
> CC xx

Nice. Can actually go to zero, by pointing the IDT at (unmapped_area +
vector), and deducing the vector in the page fault handler from cr2.

--
error compiling committee.c: too many arguments to function

2008-11-27 10:57:59

by Andi Kleen

[permalink] [raw]
Subject: Re: [Lguest] [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

On Thu, Nov 27, 2008 at 12:13:43PM +0200, Avi Kivity wrote:
> H. Peter Anvin wrote:
> >
> >>I suspect we could get it down to three bytes, by sharing the last
> >>byte of the four-byte call sequence with the first byte of the next:
> >>
> >> 66 e8 ff 66 e8 fc 66 e8 f9 66 e8 f6 ...
> >>
> >>Every three bytes a new stub begins; it's a four-byte call to offset
> >>0x6703 relative to the beginning of the first stub.
> >>
> >>Can anyone better 24 bits/stub?
> >
> >On the entirely silly level...
> >
> >CC xx
>
> Nice. Can actually go to zero, by pointing the IDT at (unmapped_area +
> vector), and deducing the vector in the page fault handler from cr2.

That would be still one byte, otherwise you wouldn't get a unique index.

-Andi

--
[email protected]

2008-11-27 11:00:31

by Avi Kivity

[permalink] [raw]
Subject: Re: [Lguest] [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Andi Kleen wrote:
>> Nice. Can actually go to zero, by pointing the IDT at (unmapped_area +
>> vector), and deducing the vector in the page fault handler from cr2.
>>
>
> That would be still one byte, otherwise you wouldn't get a unique index.
>

One virtual byte; zero physical bytes. unmapped_area above need not be
mapped.

--
error compiling committee.c: too many arguments to function

Subject: Re: [Lguest] [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

On Thu, Nov 27, 2008 at 12:13:43PM +0200, Avi Kivity wrote:
> H. Peter Anvin wrote:
> >
> >>I suspect we could get it down to three bytes, by sharing the last
> >>byte of the four-byte call sequence with the first byte of the next:
> >>
> >> 66 e8 ff 66 e8 fc 66 e8 f9 66 e8 f6 ...
> >>
> >>Every three bytes a new stub begins; it's a four-byte call to offset
> >>0x6703 relative to the beginning of the first stub.
> >>
> >>Can anyone better 24 bits/stub?
> >
> >On the entirely silly level...
> >
> >CC xx
>
> Nice. Can actually go to zero, by pointing the IDT at (unmapped_area +
> vector), and deducing the vector in the page fault handler from cr2.

Hi all,

We started the discussion with doing away with the whole jump
array entirely, by changing the value of the CS index in the
IDT. This needs the GDT to be extended with 256 entries, but an
entire page (space for 512 entries) was already reserved anyhow!
I think there is still some problem with the patch I sent due to
some code depending on certain values of the CS index, but the
system I've benchmarked on seemed to behave.

I did a set of benchmarks on an 8-way Xeon in 64-bit mode. The
system was loaded with an instance of bonnie++ pinned to processor
0, and all 8 processors were running a program doing (almost)
adjacent rdtsc's. Bonnie++ causes interrupts and the latencies
due to these show up as larger time intervals. Complete runs of
bonnie++ in fast mode were sampled this way for a current -rc6
kernel and an -rc6 kernel plus my patch. The total sampling time
was 30 minutes for each run. Per kernel I did one run as a warm-up
and another two runs to measure the latencies. The results for
measured latencies between 5 and 1000 microseconds are shown in
the attached graph. Above 1000 microseconds there is only one big
contribution: at 40000 microseconds ;). The surface below the graph
is a measure of time.

Observations (for this test load!):

Near 200, 250 and 350 microseconds, the peaks shift to longer
latencies for the cs-changing code by about 10 microseconds,
but the total time spent is pretty much constant.

The highest latencies for the cs-changing code are near 600
and 650 microseconds. The highest latencies for the current
code are near 800 and 850 microseconds.

The total surface of the graphs between 5 and 1000 microseconds
is within an error estimate of 1% equal for both cases, and is
about 0.69% of the total time.

Most time is spent measuring 'latencies' of less than 5 micro-
seconds, since bonnie++ is taking only about 5% cpu time on a
single cpu most of the time, and only up to 50% on a single cpu
during a short time in the file creation benchmark.

Greetings,
Alexander


Attachments:
(No filename) (2.62 kB)
load.png (10.98 kB)
Download all attachments
Subject: Re: [Lguest] [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

On Fri, Nov 28, 2008 at 09:48:10PM +0100, Alexander van Heukelum wrote:
> On Thu, Nov 27, 2008 at 12:13:43PM +0200, Avi Kivity wrote:
> > H. Peter Anvin wrote:
> > >
> > >>I suspect we could get it down to three bytes, by sharing the last
> > >>byte of the four-byte call sequence with the first byte of the next:
> > >>
> > >> 66 e8 ff 66 e8 fc 66 e8 f9 66 e8 f6 ...
> > >>
> > >>Every three bytes a new stub begins; it's a four-byte call to offset
> > >>0x6703 relative to the beginning of the first stub.
> > >>
> > >>Can anyone better 24 bits/stub?
> > >
> > >On the entirely silly level...
> > >
> > >CC xx
> >
> > Nice. Can actually go to zero, by pointing the IDT at (unmapped_area +
> > vector), and deducing the vector in the page fault handler from cr2.
>
> Hi all,
>
> We started the discussion with doing away with the whole jump
> array entirely, by changing the value of the CS index in the
> IDT. This needs the GDT to be extended with 256 entries, but an
> entire page (space for 512 entries) was already reserved anyhow!
> I think there is still some problem with the patch I sent due to
> some code depending on certain values of the CS index, but the
> system I've benchmarked on seemed to behave.
>
> I did a set of benchmarks on an 8-way Xeon in 64-bit mode. The
> system was loaded with an instance of bonnie++ pinned to processor
> 0, and all 8 processors were running a program doing (almost)
> adjacent rdtsc's. Bonnie++ causes interrupts and the latencies
> due to these show up as larger time intervals. Complete runs of
> bonnie++ in fast mode were sampled this way for a current -rc6
> kernel and an -rc6 kernel plus my patch. The total sampling time
> was 30 minutes for each run. Per kernel I did one run as a warm-up
> and another two runs to measure the latencies. The results for
> measured latencies between 5 and 1000 microseconds are shown in
> the attached graph. Above 1000 microseconds there is only one big
> contribution: at 40000 microseconds ;). The surface below the graph
> is a measure of time.
>
> Observations (for this test load!):
>
> Near 200, 250 and 350 microseconds, the peaks shift to longer
> latencies for the cs-changing code by about 10 microseconds,
> but the total time spent is pretty much constant.
>
> The highest latencies for the cs-changing code are near 600
> and 650 microseconds. The highest latencies for the current
> code are near 800 and 850 microseconds.
>
> The total surface of the graphs between 5 and 1000 microseconds
> is within an error estimate of 1% equal for both cases, and is
> about 0.69% of the total time.
>
> Most time is spent measuring 'latencies' of less than 5 micro-
> seconds, since bonnie++ is taking only about 5% cpu time on a
> single cpu most of the time, and only up to 50% on a single cpu
> during a short time in the file creation benchmark.

I now did the benchmarks for the same -rc6 with hpa's 4-byte stubs
too. Same machine. It's significantly better than the other two
options in terms of speed. It takes about 7% less cpu to handle
the interrupts. (0.64% cpu instead of 0.69%.) I have to run now,
I'll let interpreting the histogram to someone else ;).

Greetings,
Alexander



Attachments:
(No filename) (3.13 kB)
load.png (11.57 kB)
Download all attachments

2008-11-29 18:23:18

by Avi Kivity

[permalink] [raw]
Subject: Re: [Lguest] [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Alexander van Heukelum wrote:
> I now did the benchmarks for the same -rc6 with hpa's 4-byte stubs
> too. Same machine. It's significantly better than the other two
> options in terms of speed. It takes about 7% less cpu to handle
> the interrupts. (0.64% cpu instead of 0.69%.) I have to run now,
> I'll let interpreting the histogram to someone else ;).
>

This is noise. 0.05% cpu on a 1GHz machine servicing 1000 interrupt/sec
boils down to 500 cycles/interrupt. These changes shouldn't amount to
so much (and I doubt you have 1000 interrupts/sec with a single disk)..

I'm sorry, but the whole effort is misguided, in my opinion. If you
want to optimize, try reducing the number of interrupts that occur
rather than saving a few cycles in the interrupt path.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2008-11-29 18:33:40

by Avi Kivity

[permalink] [raw]
Subject: Re: [Lguest] [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Alexander van Heukelum wrote:
> I now did the benchmarks for the same -rc6 with hpa's 4-byte stubs
> too. Same machine. It's significantly better than the other two
> options in terms of speed. It takes about 7% less cpu to handle
> the interrupts. (0.64% cpu instead of 0.69%.) I have to run now,
> I'll let interpreting the histogram to someone else ;).
>

This is noise. 0.05% cpu on a 1GHz machine servicing 1000 interrupt/sec
boils down to 500 cycles/interrupt. These changes shouldn't amount to
so much (and I doubt you have 1000 interrupts/sec with a single disk)..

I'm sorry, but the whole effort is misguided, in my opinion. If you
want to optimize, try reducing the number of interrupts that occur
rather than saving a few cycles in the interrupt path.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2008-11-29 19:59:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Lguest] [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


* Avi Kivity <[email protected]> wrote:

> Alexander van Heukelum wrote:
>> I now did the benchmarks for the same -rc6 with hpa's 4-byte stubs
>> too. Same machine. It's significantly better than the other two
>> options in terms of speed. It takes about 7% less cpu to handle
>> the interrupts. (0.64% cpu instead of 0.69%.) I have to run now,
>> I'll let interpreting the histogram to someone else ;).
>>
>
> This is noise. 0.05% cpu on a 1GHz machine servicing 1000 interrupt/sec
> boils down to 500 cycles/interrupt. These changes shouldn't amount to
> so much (and I doubt you have 1000 interrupts/sec with a single disk)..
>
> I'm sorry, but the whole effort is misguided, in my opinion. If you
> want to optimize, try reducing the number of interrupts that occur
> rather than saving a few cycles in the interrupt path.

the goal was not to optimize those workloads - the goal was to (try to)
validate those irq trampoline changes / cleanups. We went with hpa's
changes in the end which compresses the trampolines - that reduces the
$icache footprint which is hard to measure but a very real concern on
real workloads.

Ingo

2008-12-01 04:32:58

by Rusty Russell

[permalink] [raw]
Subject: Re: [Lguest] [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

On Sunday 30 November 2008 04:52:41 Avi Kivity wrote:
> Alexander van Heukelum wrote:
> > I now did the benchmarks for the same -rc6 with hpa's 4-byte stubs
> > too. Same machine. It's significantly better than the other two
> > options in terms of speed. It takes about 7% less cpu to handle
> > the interrupts. (0.64% cpu instead of 0.69%.) I have to run now,
> > I'll let interpreting the histogram to someone else ;).
>
> This is noise. 0.05% cpu on a 1GHz machine servicing 1000 interrupt/sec
> boils down to 500 cycles/interrupt. These changes shouldn't amount to
> so much (and I doubt you have 1000 interrupts/sec with a single disk)..

Sure, but smallest cache wins. Which is why I thought hpa chose the 3 byte
option.

> I'm sorry, but the whole effort is misguided, in my opinion.

Respectfully disagree. I wouldn't do it, but it warms my heart that others
are. It's are not subtractive from other optimization efforts.

Cheers,
Rusty.

2008-12-01 08:00:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Lguest] [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


* Rusty Russell <[email protected]> wrote:

> On Sunday 30 November 2008 04:52:41 Avi Kivity wrote:
> > Alexander van Heukelum wrote:
> > > I now did the benchmarks for the same -rc6 with hpa's 4-byte stubs
> > > too. Same machine. It's significantly better than the other two
> > > options in terms of speed. It takes about 7% less cpu to handle
> > > the interrupts. (0.64% cpu instead of 0.69%.) I have to run now,
> > > I'll let interpreting the histogram to someone else ;).
> >
> > This is noise. 0.05% cpu on a 1GHz machine servicing 1000 interrupt/sec
> > boils down to 500 cycles/interrupt. These changes shouldn't amount to
> > so much (and I doubt you have 1000 interrupts/sec with a single disk)..
>
> Sure, but smallest cache wins. Which is why I thought hpa chose the 3 byte
> option.
>
> > I'm sorry, but the whole effort is misguided, in my opinion.
>
> Respectfully disagree. I wouldn't do it, but it warms my heart that
> others are. It's are not subtractive from other optimization efforts.

Yeah, the efforts from Alexander, Peter and Cyrill are fantastic.

The most positive effects of it isnt just the optimizations and code
compression. What is important is the fact that this piece of code
(entry_*.S) has not gotten systematic attention for a decade, and has
become a rather crufty patchwork of hit-and-run changes. It has become
very hard to review and it has become a reoccuring source of nasty bugs.

It is now being reworked and re-measured. It does not matter how small
and hard to measure the gain is in one specific case - we are mainly
concerned about not creating measurable _harm_, and we are after the
cleanups and the increase in maintainability. Once the code is cleaner,
improvements are much easier to do and bugs are much easier to fix.

Ingo

2008-12-01 09:25:02

by Avi Kivity

[permalink] [raw]
Subject: Re: [Lguest] [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Rusty Russell wrote:
> On Sunday 30 November 2008 04:52:41 Avi Kivity wrote:
>
>> Alexander van Heukelum wrote:
>>
>>> I now did the benchmarks for the same -rc6 with hpa's 4-byte stubs
>>> too. Same machine. It's significantly better than the other two
>>> options in terms of speed. It takes about 7% less cpu to handle
>>> the interrupts. (0.64% cpu instead of 0.69%.) I have to run now,
>>> I'll let interpreting the histogram to someone else ;).
>>>
>> This is noise. 0.05% cpu on a 1GHz machine servicing 1000 interrupt/sec
>> boils down to 500 cycles/interrupt. These changes shouldn't amount to
>> so much (and I doubt you have 1000 interrupts/sec with a single disk)..
>>
>
> Sure, but smallest cache wins. Which is why I thought hpa chose the 3 byte
> option.
>
>

Four bytes was the smallest sane option. Three bytes involved
instruction opcodes overlap.

>> I'm sorry, but the whole effort is misguided, in my opinion.
>>
>
> Respectfully disagree. I wouldn't do it, but it warms my heart that others
> are. It's are not subtractive from other optimization efforts.
>

Once it's done there's no reason not to commit it. But the effort
expended to do it is gone, without any measurable return.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2008-12-01 10:32:27

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [Lguest] [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

On Mon, Dec 1, 2008 at 12:24 PM, Avi Kivity <[email protected]> wrote:
...
>
> Once it's done there's no reason not to commit it. But the effort expended
> to do it is gone, without any measurable return.
>
>
...

Not sure Avi what you mean but as far as I know Alexander is working on
this file so he need just time to finish (we all have other duties you know :).
So I think the idea Peter proposed could be merged right after Alexander
will have finished. At least the Peter's suggestion was recorded in this
thread which means it will *not* be lost eventually. Or you meant something
else (yep, I could have it translated plain wrong)?

2008-12-01 10:47:32

by Avi Kivity

[permalink] [raw]
Subject: Re: [Lguest] [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Cyrill Gorcunov wrote:
> On Mon, Dec 1, 2008 at 12:24 PM, Avi Kivity <[email protected]> wrote:
> ...
>
>> Once it's done there's no reason not to commit it. But the effort expended
>> to do it is gone, without any measurable return.
>>
>>
> ...
>
> Not sure Avi what you mean but as far as I know Alexander is working on
> this file so he need just time to finish (we all have other duties you know :).
> So I think the idea Peter proposed could be merged right after Alexander
> will have finished. At least the Peter's suggestion was recorded in this
> thread which means it will *not* be lost eventually. Or you meant something
> else (yep, I could have it translated plain wrong)?
>

What I mean is that hpa's patch makes the kernel better, so it should be
applied. I'm not sure what else Alexander is working on, but I do hope
the improvements will be more concrete.

(Alexander, I apologise for being so discouraging; but I really feel
these improvements are marginal)

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2008-12-01 10:49:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Lguest] [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes


* Avi Kivity <[email protected]> wrote:

> Cyrill Gorcunov wrote:
>> On Mon, Dec 1, 2008 at 12:24 PM, Avi Kivity <[email protected]> wrote:
>> ...
>>
>>> Once it's done there's no reason not to commit it. But the effort expended
>>> to do it is gone, without any measurable return.
>>>
>>>
>> ...
>>
>> Not sure Avi what you mean but as far as I know Alexander is working on
>> this file so he need just time to finish (we all have other duties you know :).
>> So I think the idea Peter proposed could be merged right after Alexander
>> will have finished. At least the Peter's suggestion was recorded in this
>> thread which means it will *not* be lost eventually. Or you meant something
>> else (yep, I could have it translated plain wrong)?
>>
>
> What I mean is that hpa's patch makes the kernel better, so it should
> be applied. I'm not sure what else Alexander is working on, but I do
> hope the improvements will be more concrete.

that is what happened three weeks ago already on Nov 11, we applied
Peter's patches to tip/x86/irq:

939b787: x86: 64 bits: shrink and align IRQ stubs
b7c6244: x86: 32 bits: shrink and align IRQ stubs

it's all in the x86 tree and in linux-next as well. Alexander and Peter
are working on this together, not against each other. Alexander was still
running some numbers to make sure we made the right decision.

Ingo