2006-09-22 11:51:06

by Rusty Russell

[permalink] [raw]
Subject: [PATCH 0/7] Using %gs for per-cpu areas on x86

OK, here it is. Benchmarks still coming. This is against Andi's
2.6.18-rc7-git3 tree, and replaces the patches between (and not
including) i386-pda-asm-offsets and i386-early-fault.

One patch is identical, one is mildly modified, the rest are
re-implemented but inspired by Jeremy's PDA work.

Thanks,
Rusty.
--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law


2006-09-22 11:53:21

by Rusty Russell

[permalink] [raw]
Subject: [PATCH 1/7] Use per-cpu GDT tables from early in boot

Currently, we have 3 GDT tables: one for early boot (boot_gdt_table),
one for late boot (cpu_gdt_table), and a dynamically-allocated per-cpu
one (per-cpu cpu_gdt_table *).

Simplify this and reduce waste, by using the per-cpu GDT table
directly in boot. This is done by using the table in the per-cpu
section until cpu_init() where the per-cpu copies have been set up.

This means we it's easier to define the cpu_gdt_table in C, otherwise
we'd need asm per-cpu definition macros.

Signed-off-by: Rusty Russell <[email protected]>

Index: ak-pda-gs/arch/i386/kernel/setup.c
===================================================================
--- ak-pda-gs.orig/arch/i386/kernel/setup.c 2006-09-20 15:38:27.000000000 +1000
+++ ak-pda-gs/arch/i386/kernel/setup.c 2006-09-20 15:39:01.000000000 +1000
@@ -1437,6 +1437,50 @@
tsc_init();
}

+/*
+ * The Global Descriptor Table contains 28 quadwords, per-CPU.
+ */
+__attribute__((aligned(L1_CACHE_BYTES)))
+DEFINE_PER_CPU(struct desc_struct, cpu_gdt_table[GDT_ENTRIES]) =
+{
+ /* kernel 4GB code at 0x00000000 */
+ [GDT_ENTRY_KERNEL_CS] = { 0x0000ffff, 0x00cf9a00 },
+ /* kernel 4GB data at 0x00000000 */
+ [GDT_ENTRY_KERNEL_DS] = { 0x0000ffff, 0x00cf9200 },
+ /* user 4GB code at 0x00000000 */
+ [GDT_ENTRY_DEFAULT_USER_CS] = { 0x0000ffff, 0x00cffa00 },
+ /* user 4GB data at 0x00000000 */
+ [GDT_ENTRY_DEFAULT_USER_DS] = { 0x0000ffff, 0x00cff200 },
+ /*
+ * Segments used for calling PnP BIOS have byte granularity.
+ * They code segments and data segments have fixed 64k limits,
+ * the transfer segment sizes are set at run time.
+ */
+ [GDT_ENTRY_PNPBIOS_BASE] =
+ { 0x0000ffff, 0x00409a00 }, /* 32-bit code */
+ { 0x0000ffff, 0x00009a00 }, /* 16-bit code */
+ { 0x0000ffff, 0x00009200 }, /* 16-bit data */
+ { 0x00000000, 0x00009200 }, /* 16-bit data */
+ { 0x00000000, 0x00009200 }, /* 16-bit data */
+
+ /*
+ * The APM segments have byte granularity and their bases
+ * are set at run time. All have 64k limits.
+ */
+ [GDT_ENTRY_APMBIOS_BASE] =
+ { 0x0000ffff, 0x00409a00 }, /* APM CS code */
+ { 0x0000ffff, 0x00009a00 }, /* APM CS 16 code (16 bit) */
+ { 0x0000ffff, 0x00409200 }, /* APM DS data */
+
+ /* ESPFIX 16-bit SS */
+ [GDT_ENTRY_ESPFIX_SS] = { 0x00000000, 0x00009200 },
+};
+
+/* Early in boot we use the master per-cpu gdt_table directly. */
+DEFINE_PER_CPU(struct Xgt_desc_struct, cpu_gdt_descr)
+= { .size = GDT_ENTRIES*8-1, .address = (long)&per_cpu__cpu_gdt_table };
+EXPORT_PER_CPU_SYMBOL(cpu_gdt_descr);
+
static __init int add_pcspkr(void)
{
struct platform_device *pd;
Index: ak-pda-gs/arch/i386/kernel/cpu/common.c
===================================================================
--- ak-pda-gs.orig/arch/i386/kernel/cpu/common.c 2006-09-20 15:38:27.000000000 +1000
+++ ak-pda-gs/arch/i386/kernel/cpu/common.c 2006-09-20 15:39:01.000000000 +1000
@@ -21,9 +21,6 @@

#include "cpu.h"

-DEFINE_PER_CPU(struct Xgt_desc_struct, cpu_gdt_descr);
-EXPORT_PER_CPU_SYMBOL(cpu_gdt_descr);
-
DEFINE_PER_CPU(unsigned char, cpu_16bit_stack[CPU_16BIT_STACK_SIZE]);
EXPORT_PER_CPU_SYMBOL(cpu_16bit_stack);

@@ -612,38 +609,8 @@
set_in_cr4(X86_CR4_TSD);
}

- /* The CPU hotplug case */
- if (cpu_gdt_descr->address) {
- gdt = (struct desc_struct *)cpu_gdt_descr->address;
- memset(gdt, 0, PAGE_SIZE);
- goto old_gdt;
- }
- /*
- * This is a horrible hack to allocate the GDT. The problem
- * is that cpu_init() is called really early for the boot CPU
- * (and hence needs bootmem) but much later for the secondary
- * CPUs, when bootmem will have gone away
- */
- if (NODE_DATA(0)->bdata->node_bootmem_map) {
- gdt = (struct desc_struct *)alloc_bootmem_pages(PAGE_SIZE);
- /* alloc_bootmem_pages panics on failure, so no check */
- memset(gdt, 0, PAGE_SIZE);
- } else {
- gdt = (struct desc_struct *)get_zeroed_page(GFP_KERNEL);
- if (unlikely(!gdt)) {
- printk(KERN_CRIT "CPU%d failed to allocate GDT\n", cpu);
- for (;;)
- local_irq_enable();
- }
- }
-old_gdt:
- /*
- * Initialize the per-CPU GDT with the boot GDT,
- * and set up the GDT descriptor:
- */
- memcpy(gdt, cpu_gdt_table, GDT_SIZE);
-
/* Set up GDT entry for 16bit stack */
+ gdt = __get_cpu_var(cpu_gdt_table);
*(__u64 *)(&gdt[GDT_ENTRY_ESPFIX_SS]) |=
((((__u64)stk16_off) << 16) & 0x000000ffffff0000ULL) |
((((__u64)stk16_off) << 32) & 0xff00000000000000ULL) |
Index: ak-pda-gs/arch/i386/kernel/head.S
===================================================================
--- ak-pda-gs.orig/arch/i386/kernel/head.S 2006-09-20 15:38:27.000000000 +1000
+++ ak-pda-gs/arch/i386/kernel/head.S 2006-09-20 15:39:01.000000000 +1000
@@ -302,7 +302,7 @@
movl %eax,%cr0

call check_x87
- lgdt cpu_gdt_descr
+ lgdt per_cpu__cpu_gdt_descr
lidt idt_descr
ljmp $(__KERNEL_CS),$1f
1: movl $(__KERNEL_DS),%eax # reload all the segment registers
@@ -456,12 +456,6 @@
.word IDT_ENTRIES*8-1 # idt contains 256 entries
.long idt_table

-# boot GDT descriptor (later on used by CPU#0):
- .word 0 # 32 bit align gdt_desc.address
-cpu_gdt_descr:
- .word GDT_ENTRIES*8-1
- .long cpu_gdt_table
-
/*
* The boot_gdt_table must mirror the equivalent in setup.S and is
* used only for booting.
@@ -472,55 +466,3 @@
.quad 0x00cf9a000000ffff /* kernel 4GB code at 0x00000000 */
.quad 0x00cf92000000ffff /* kernel 4GB data at 0x00000000 */

-/*
- * The Global Descriptor Table contains 28 quadwords, per-CPU.
- */
- .align L1_CACHE_BYTES
-ENTRY(cpu_gdt_table)
- .quad 0x0000000000000000 /* NULL descriptor */
- .quad 0x0000000000000000 /* 0x0b reserved */
- .quad 0x0000000000000000 /* 0x13 reserved */
- .quad 0x0000000000000000 /* 0x1b reserved */
- .quad 0x0000000000000000 /* 0x20 unused */
- .quad 0x0000000000000000 /* 0x28 unused */
- .quad 0x0000000000000000 /* 0x33 TLS entry 1 */
- .quad 0x0000000000000000 /* 0x3b TLS entry 2 */
- .quad 0x0000000000000000 /* 0x43 TLS entry 3 */
- .quad 0x0000000000000000 /* 0x4b reserved */
- .quad 0x0000000000000000 /* 0x53 reserved */
- .quad 0x0000000000000000 /* 0x5b reserved */
-
- .quad 0x00cf9a000000ffff /* 0x60 kernel 4GB code at 0x00000000 */
- .quad 0x00cf92000000ffff /* 0x68 kernel 4GB data at 0x00000000 */
- .quad 0x00cffa000000ffff /* 0x73 user 4GB code at 0x00000000 */
- .quad 0x00cff2000000ffff /* 0x7b user 4GB data at 0x00000000 */
-
- .quad 0x0000000000000000 /* 0x80 TSS descriptor */
- .quad 0x0000000000000000 /* 0x88 LDT descriptor */
-
- /*
- * Segments used for calling PnP BIOS have byte granularity.
- * They code segments and data segments have fixed 64k limits,
- * the transfer segment sizes are set at run time.
- */
- .quad 0x00409a000000ffff /* 0x90 32-bit code */
- .quad 0x00009a000000ffff /* 0x98 16-bit code */
- .quad 0x000092000000ffff /* 0xa0 16-bit data */
- .quad 0x0000920000000000 /* 0xa8 16-bit data */
- .quad 0x0000920000000000 /* 0xb0 16-bit data */
-
- /*
- * The APM segments have byte granularity and their bases
- * are set at run time. All have 64k limits.
- */
- .quad 0x00409a000000ffff /* 0xb8 APM CS code */
- .quad 0x00009a000000ffff /* 0xc0 APM CS 16 code (16 bit) */
- .quad 0x004092000000ffff /* 0xc8 APM DS data */
-
- .quad 0x0000920000000000 /* 0xd0 - ESPFIX 16-bit SS */
- .quad 0x0000000000000000 /* 0xd8 - unused */
- .quad 0x0000000000000000 /* 0xe0 - unused */
- .quad 0x0000000000000000 /* 0xe8 - unused */
- .quad 0x0000000000000000 /* 0xf0 - unused */
- .quad 0x0000000000000000 /* 0xf8 - GDT entry 31: double-fault TSS */
-
Index: ak-pda-gs/arch/i386/kernel/smpboot.c
===================================================================
--- ak-pda-gs.orig/arch/i386/kernel/smpboot.c 2006-09-20 15:38:27.000000000 +1000
+++ ak-pda-gs/arch/i386/kernel/smpboot.c 2006-09-20 15:39:01.000000000 +1000
@@ -1058,7 +1058,6 @@
struct warm_boot_cpu_info info;
struct work_struct task;
int apicid, ret;
- struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);

apicid = x86_cpu_to_apicid[cpu];
if (apicid == BAD_APICID) {
@@ -1066,18 +1065,6 @@
goto exit;
}

- /*
- * the CPU isn't initialized at boot time, allocate gdt table here.
- * cpu_init will initialize it
- */
- if (!cpu_gdt_descr->address) {
- cpu_gdt_descr->address = get_zeroed_page(GFP_KERNEL);
- if (!cpu_gdt_descr->address)
- printk(KERN_CRIT "CPU%d failed to allocate GDT\n", cpu);
- ret = -ENOMEM;
- goto exit;
- }
-
info.complete = &done;
info.apicid = apicid;
info.cpu = cpu;
Index: ak-pda-gs/include/asm-i386/desc.h
===================================================================
--- ak-pda-gs.orig/include/asm-i386/desc.h 2006-09-20 15:27:51.000000000 +1000
+++ ak-pda-gs/include/asm-i386/desc.h 2006-09-20 15:39:01.000000000 +1000
@@ -14,8 +14,7 @@

#include <asm/mmu.h>

-extern struct desc_struct cpu_gdt_table[GDT_ENTRIES];
-
+DECLARE_PER_CPU(struct desc_struct, cpu_gdt_table[GDT_ENTRIES]);
DECLARE_PER_CPU(unsigned char, cpu_16bit_stack[CPU_16BIT_STACK_SIZE]);

struct Xgt_desc_struct {

--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

2006-09-22 11:55:14

by Rusty Russell

[permalink] [raw]
Subject: [PATCH 2/7]

This patch implements save/restore of %gs in the kernel, so it can be
used for per-cpu data. This is not cheap, and we do it for UP as well
as SMP, which is stupid. Benchmarks, anyone?

Based on this patch by Jeremy Fitzhardinge:

From: Jeremy Fitzhardinge <[email protected]>
This patch is the meat of the PDA change. This patch makes several
related changes:

1: Most significantly, %gs is now used in the kernel. This means that on
entry, the old value of %gs is saved away, and it is reloaded with
__KERNEL_PDA.

2: entry.S constructs the stack in the shape of struct pt_regs, and this
is passed around the kernel so that the process's saved register
state can be accessed.

Unfortunately struct pt_regs doesn't currently have space for %gs
(or %fs). This patch extends pt_regs to add space for gs (no space
is allocated for %fs, since it won't be used, and it would just
complicate the code in entry.S to work around the space).

3: Because %gs is now saved on the stack like %ds, %es and the integer
registers, there are a number of places where it no longer needs to
be handled specially; namely context switch, and saving/restoring the
register state in a signal context.

4: And since kernel threads run in kernel space and call normal kernel
code, they need to be created with their %gs == __KERNEL_PDA.

NOTE: even though it's called "ptrace-abi.h", this file does not
define a user-space visible ABI.

And

+From: Ingo Molnar <[email protected]>

in the syscall exit path the %gs selector has to be restored _after_ the
last kernel function has been called. If lockdep is enabled then this
kernel function is TRACE_IRQS_ON.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
Cc: Chuck Ebbert <[email protected]>
Cc: Zachary Amsden <[email protected]>
Cc: Jan Beulich <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>

Index: ak-pda-gs/arch/i386/kernel/asm-offsets.c
===================================================================
--- ak-pda-gs.orig/arch/i386/kernel/asm-offsets.c 2006-09-20 16:40:06.000000000 +1000
+++ ak-pda-gs/arch/i386/kernel/asm-offsets.c 2006-09-20 16:40:18.000000000 +1000
@@ -67,6 +67,7 @@
OFFSET(PT_EAX, pt_regs, eax);
OFFSET(PT_DS, pt_regs, xds);
OFFSET(PT_ES, pt_regs, xes);
+ OFFSET(PT_GS, pt_regs, xgs);
OFFSET(PT_ORIG_EAX, pt_regs, orig_eax);
OFFSET(PT_EIP, pt_regs, eip);
OFFSET(PT_CS, pt_regs, xcs);
Index: ak-pda-gs/arch/i386/kernel/cpu/common.c
===================================================================
--- ak-pda-gs.orig/arch/i386/kernel/cpu/common.c 2006-09-20 16:40:06.000000000 +1000
+++ ak-pda-gs/arch/i386/kernel/cpu/common.c 2006-09-20 16:40:18.000000000 +1000
@@ -579,6 +579,15 @@
disable_pse = 1;
#endif
}
+
+/* Make sure %gs is initialized properly in idle threads */
+struct pt_regs * __devinit idle_regs(struct pt_regs *regs)
+{
+ memset(regs, 0, sizeof(struct pt_regs));
+ regs->xgs = __KERNEL_PERCPU;
+ return regs;
+}
+
/*
* cpu_init() initializes state that is per-CPU. Some data is already
* initialized (naturally) in the bootstrap process, such as the GDT
@@ -641,8 +650,8 @@
__set_tss_desc(cpu, GDT_ENTRY_DOUBLEFAULT_TSS, &doublefault_tss);
#endif

- /* Clear %fs and %gs. */
- asm volatile ("movl %0, %%fs; movl %0, %%gs" : : "r" (0));
+ /* Clear %fs. */
+ asm volatile ("mov %0, %%fs" : : "r" (0));

/* Clear all 6 debug registers: */
set_debugreg(0, 0);
Index: ak-pda-gs/arch/i386/kernel/entry.S
===================================================================
--- ak-pda-gs.orig/arch/i386/kernel/entry.S 2006-09-20 16:40:06.000000000 +1000
+++ ak-pda-gs/arch/i386/kernel/entry.S 2006-09-20 16:40:18.000000000 +1000
@@ -30,12 +30,13 @@
* 18(%esp) - %eax
* 1C(%esp) - %ds
* 20(%esp) - %es
- * 24(%esp) - orig_eax
- * 28(%esp) - %eip
- * 2C(%esp) - %cs
- * 30(%esp) - %eflags
- * 34(%esp) - %oldesp
- * 38(%esp) - %oldss
+ * 24(%esp) - %gs
+ * 28(%esp) - orig_eax
+ * 2C(%esp) - %eip
+ * 30(%esp) - %cs
+ * 34(%esp) - %eflags
+ * 38(%esp) - %oldesp
+ * 3C(%esp) - %oldss
*
* "current" is in register %ebx during any slow entries.
*/
@@ -91,6 +92,9 @@

#define SAVE_ALL \
cld; \
+ pushl %gs; \
+ CFI_ADJUST_CFA_OFFSET 4;\
+ /*CFI_REL_OFFSET gs, 0;*/\
pushl %es; \
CFI_ADJUST_CFA_OFFSET 4;\
/*CFI_REL_OFFSET es, 0;*/\
@@ -120,7 +124,9 @@
CFI_REL_OFFSET ebx, 0;\
movl $(__USER_DS), %edx; \
movl %edx, %ds; \
- movl %edx, %es;
+ movl %edx, %es; \
+ movl $(__KERNEL_PERCPU), %edx; \
+ movl %edx, %gs

#define RESTORE_INT_REGS \
popl %ebx; \
@@ -153,17 +159,22 @@
2: popl %es; \
CFI_ADJUST_CFA_OFFSET -4;\
/*CFI_RESTORE es;*/\
-.section .fixup,"ax"; \
-3: movl $0,(%esp); \
- jmp 1b; \
+3: popl %gs; \
+ CFI_ADJUST_CFA_OFFSET -4;\
+ /*CFI_RESTORE gs;*/\
+.pushsection .fixup,"ax"; \
4: movl $0,(%esp); \
+ jmp 1b; \
+5: movl $0,(%esp); \
jmp 2b; \
-.previous; \
+6: movl $0,(%esp); \
+ jmp 3b; \
.section __ex_table,"a";\
.align 4; \
- .long 1b,3b; \
- .long 2b,4b; \
-.previous
+ .long 1b,4b; \
+ .long 2b,5b; \
+ .long 3b,6b; \
+.popsection

#define RING0_INT_FRAME \
CFI_STARTPROC simple;\
@@ -320,11 +331,18 @@
/* if something modifies registers it must also disable sysexit */
movl PT_EIP(%esp), %edx
movl PT_OLDESP(%esp), %ecx
- xorl %ebp,%ebp
TRACE_IRQS_ON
+1: mov PT_GS(%esp), %gs
+ xorl %ebp,%ebp
ENABLE_INTERRUPTS_SYSEXIT
CFI_ENDPROC
-
+.pushsection .fixup,"ax"; \
+2: movl $0,PT_GS(%esp); \
+ jmp 1b; \
+.section __ex_table,"a";\
+ .align 4; \
+ .long 1b,2b; \
+.popsection

# system call handler stub
ENTRY(system_call)
@@ -370,7 +388,7 @@
TRACE_IRQS_IRET
restore_nocheck_notrace:
RESTORE_REGS
- addl $4, %esp
+ addl $4, %esp # skip orig_eax/error_code
CFI_ADJUST_CFA_OFFSET -4
1: INTERRUPT_RETURN
.section .fixup,"ax"
@@ -512,14 +530,12 @@
/* put ESP to the proper location */ \
movl %eax, %esp;
#define UNWIND_ESPFIX_STACK \
- pushl %eax; \
CFI_ADJUST_CFA_OFFSET 4; \
movl %ss, %eax; \
/* see if on 16bit stack */ \
- cmpw $__ESPFIX_SS, %ax; \
+ cmp $__ESPFIX_SS, %eax; \
je 28f; \
-27: popl %eax; \
- CFI_ADJUST_CFA_OFFSET -4; \
+27: CFI_ADJUST_CFA_OFFSET -4; \
.section .fixup,"ax"; \
28: movl $__KERNEL_DS, %eax; \
movl %eax, %ds; \
@@ -588,13 +604,15 @@
CFI_ADJUST_CFA_OFFSET 4
ALIGN
error_code:
+ /* the function address is in %gs's slot on the stack */
+ pushl %es
+ CFI_ADJUST_CFA_OFFSET 4
pushl %ds
CFI_ADJUST_CFA_OFFSET 4
/*CFI_REL_OFFSET ds, 0*/
pushl %eax
CFI_ADJUST_CFA_OFFSET 4
CFI_REL_OFFSET eax, 0
- xorl %eax, %eax
pushl %ebp
CFI_ADJUST_CFA_OFFSET 4
CFI_REL_OFFSET ebp, 0
@@ -607,7 +625,6 @@
pushl %edx
CFI_ADJUST_CFA_OFFSET 4
CFI_REL_OFFSET edx, 0
- decl %eax # eax = -1
pushl %ecx
CFI_ADJUST_CFA_OFFSET 4
CFI_REL_OFFSET ecx, 0
@@ -615,21 +632,17 @@
CFI_ADJUST_CFA_OFFSET 4
CFI_REL_OFFSET ebx, 0
cld
- pushl %es
- CFI_ADJUST_CFA_OFFSET 4
- /*CFI_REL_OFFSET es, 0*/
UNWIND_ESPFIX_STACK
- popl %ecx
- CFI_ADJUST_CFA_OFFSET -4
- /*CFI_REGISTER es, ecx*/
- movl PT_ES(%esp), %edi # get the function address
+ movl PT_GS(%esp), %edi # get the function address
movl PT_ORIG_EAX(%esp), %edx # get the error code
- movl %eax, PT_ORIG_EAX(%esp)
- movl %ecx, PT_ES(%esp)
- /*CFI_REL_OFFSET es, ES*/
+ movl $-1, PT_ORIG_EAX(%esp) # no syscall to restart
+ mov %gs, PT_GS(%esp)
+ /*CFI_REL_OFFSET gs, GS*/
movl $(__USER_DS), %ecx
movl %ecx, %ds
movl %ecx, %es
+ movl $(__KERNEL_PERCPU), %ecx
+ movl %ecx, %gs
movl %esp,%eax # pt_regs pointer
call *%edi
jmp ret_from_exception
@@ -939,6 +952,7 @@
movl %ebx, PT_EAX(%edx)
movl $__USER_DS, PT_DS(%edx)
movl $__USER_DS, PT_ES(%edx)
+ movl $0, PT_GS(%edx)
movl %ebx, PT_ORIG_EAX(%edx)
movl %ecx, PT_EIP(%edx)
movl 12(%esp), %ecx
Index: ak-pda-gs/arch/i386/kernel/process.c
===================================================================
--- ak-pda-gs.orig/arch/i386/kernel/process.c 2006-09-20 16:40:06.000000000 +1000
+++ ak-pda-gs/arch/i386/kernel/process.c 2006-09-20 16:40:18.000000000 +1000
@@ -336,6 +336,7 @@

regs.xds = __USER_DS;
regs.xes = __USER_DS;
+ regs.xgs = __KERNEL_PERCPU;
regs.orig_eax = -1;
regs.eip = (unsigned long) kernel_thread_helper;
regs.xcs = __KERNEL_CS | get_kernel_rpl();
@@ -421,7 +422,6 @@
p->thread.eip = (unsigned long) ret_from_fork;

savesegment(fs,p->thread.fs);
- savesegment(gs,p->thread.gs);

tsk = current;
if (unlikely(test_tsk_thread_flag(tsk, TIF_IO_BITMAP))) {
@@ -645,16 +645,16 @@
load_esp0(tss, next);

/*
- * Save away %fs and %gs. No need to save %es and %ds, as
- * those are always kernel segments while inside the kernel.
- * Doing this before setting the new TLS descriptors avoids
- * the situation where we temporarily have non-reloadable
- * segments in %fs and %gs. This could be an issue if the
- * NMI handler ever used %fs or %gs (it does not today), or
- * if the kernel is running inside of a hypervisor layer.
+ * Save away %fs. No need to save %gs, as it was saved on the
+ * stack on entry. No need to save %es and %ds, as those are
+ * always kernel segments while inside the kernel. Doing this
+ * before setting the new TLS descriptors avoids the situation
+ * where we temporarily have non-reloadable segments in %fs
+ * and %gs. This could be an issue if the NMI handler ever
+ * used %fs or %gs (it does not today), or if the kernel is
+ * running inside of a hypervisor layer.
*/
savesegment(fs, prev->fs);
- savesegment(gs, prev->gs);

/*
* Load the per-thread Thread-Local Storage descriptor.
@@ -662,16 +662,13 @@
load_TLS(next, cpu);

/*
- * Restore %fs and %gs if needed.
+ * Restore %fs if needed.
*
- * Glibc normally makes %fs be zero, and %gs is one of
- * the TLS segments.
+ * Glibc normally makes %fs be zero.
*/
if (unlikely(prev->fs | next->fs))
loadsegment(fs, next->fs);

- if (prev->gs | next->gs)
- loadsegment(gs, next->gs);

/*
* Restore IOPL if needed.
Index: ak-pda-gs/arch/i386/kernel/signal.c
===================================================================
--- ak-pda-gs.orig/arch/i386/kernel/signal.c 2006-09-20 16:40:06.000000000 +1000
+++ ak-pda-gs/arch/i386/kernel/signal.c 2006-09-20 16:40:18.000000000 +1000
@@ -128,7 +128,7 @@
X86_EFLAGS_TF | X86_EFLAGS_SF | X86_EFLAGS_ZF | \
X86_EFLAGS_AF | X86_EFLAGS_PF | X86_EFLAGS_CF)

- GET_SEG(gs);
+ COPY_SEG(gs);
GET_SEG(fs);
COPY_SEG(es);
COPY_SEG(ds);
@@ -244,9 +244,7 @@
{
int tmp, err = 0;

- tmp = 0;
- savesegment(gs, tmp);
- err |= __put_user(tmp, (unsigned int __user *)&sc->gs);
+ err |= __put_user(regs->xgs, (unsigned int __user *)&sc->gs);
savesegment(fs, tmp);
err |= __put_user(tmp, (unsigned int __user *)&sc->fs);

Index: ak-pda-gs/include/asm-i386/mmu_context.h
===================================================================
--- ak-pda-gs.orig/include/asm-i386/mmu_context.h 2006-09-20 16:40:06.000000000 +1000
+++ ak-pda-gs/include/asm-i386/mmu_context.h 2006-09-20 16:40:18.000000000 +1000
@@ -62,8 +62,8 @@
#endif
}

-#define deactivate_mm(tsk, mm) \
- asm("movl %0,%%fs ; movl %0,%%gs": :"r" (0))
+#define deactivate_mm(tsk, mm) \
+ asm("movl %0,%%fs": :"r" (0));

#define activate_mm(prev, next) \
switch_mm((prev),(next),NULL)
Index: ak-pda-gs/include/asm-i386/processor.h
===================================================================
--- ak-pda-gs.orig/include/asm-i386/processor.h 2006-09-20 16:40:06.000000000 +1000
+++ ak-pda-gs/include/asm-i386/processor.h 2006-09-20 16:40:18.000000000 +1000
@@ -477,6 +477,7 @@
.vm86_info = NULL, \
.sysenter_cs = __KERNEL_CS, \
.io_bitmap_ptr = NULL, \
+ .gs = __KERNEL_PERCPU, \
}

/*
@@ -504,7 +505,8 @@
}

#define start_thread(regs, new_eip, new_esp) do { \
- __asm__("movl %0,%%fs ; movl %0,%%gs": :"r" (0)); \
+ __asm__("movl %0,%%fs": :"r" (0)); \
+ regs->xgs = 0; \
set_fs(USER_DS); \
regs->xds = __USER_DS; \
regs->xes = __USER_DS; \
Index: ak-pda-gs/kernel/fork.c
===================================================================
--- ak-pda-gs.orig/kernel/fork.c 2006-09-20 16:40:06.000000000 +1000
+++ ak-pda-gs/kernel/fork.c 2006-09-20 16:40:18.000000000 +1000
@@ -1299,7 +1299,7 @@
return ERR_PTR(retval);
}

-struct pt_regs * __devinit __attribute__((weak)) idle_regs(struct pt_regs *regs)
+noinline struct pt_regs * __devinit __attribute__((weak)) idle_regs(struct pt_regs *regs)
{
memset(regs, 0, sizeof(struct pt_regs));
return regs;
Index: ak-pda-gs/include/asm-i386/ptrace.h
===================================================================
--- ak-pda-gs.orig/include/asm-i386/ptrace.h 2006-09-20 16:40:06.000000000 +1000
+++ ak-pda-gs/include/asm-i386/ptrace.h 2006-09-20 16:40:18.000000000 +1000
@@ -33,6 +33,8 @@
long eax;
int xds;
int xes;
+ /* int xfs; */
+ int xgs;
long orig_eax;
long eip;
int xcs;
Index: ak-pda-gs/include/asm-i386/segment.h
===================================================================
--- ak-pda-gs.orig/include/asm-i386/segment.h 2006-09-20 16:40:01.000000000 +1000
+++ ak-pda-gs/include/asm-i386/segment.h 2006-09-20 16:40:18.000000000 +1000
@@ -39,7 +39,7 @@
* 25 - APM BIOS support
*
* 26 - ESPFIX small SS
- * 27 - unused
+ * 27 - PERCPU [ offset segment for per-cpu area ]
* 28 - unused
* 29 - unused
* 30 - unused
@@ -74,6 +74,9 @@
#define GDT_ENTRY_ESPFIX_SS (GDT_ENTRY_KERNEL_BASE + 14)
#define __ESPFIX_SS (GDT_ENTRY_ESPFIX_SS * 8)

+#define GDT_ENTRY_PERCPU (GDT_ENTRY_KERNEL_BASE + 15)
+#define __KERNEL_PERCPU (GDT_ENTRY_PERCPU * 8)
+
#define GDT_ENTRY_DOUBLEFAULT_TSS 31

/*
Index: ak-pda-gs/include/asm-i386/unwind.h
===================================================================
--- ak-pda-gs.orig/include/asm-i386/unwind.h 2006-09-20 16:40:01.000000000 +1000
+++ ak-pda-gs/include/asm-i386/unwind.h 2006-09-20 16:40:18.000000000 +1000
@@ -64,6 +64,7 @@
info->regs.xss = __KERNEL_DS;
info->regs.xds = __USER_DS;
info->regs.xes = __USER_DS;
+ info->regs.xgs = __KERNEL_PERCPU;
}

extern asmlinkage int arch_unwind_init_running(struct unwind_frame_info *,

--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

2006-09-22 11:57:00

by Rusty Russell

[permalink] [raw]
Subject: [PATCH 3/7] Update sys_vm86 to cope with changed pt_regs and %gs usage.

From: Jeremy Fitzhardinge <[email protected]>
sys_vm86 uses a struct kernel_vm86_regs, which is identical to
pt_regs, but adds an extra space for all the segment registers that
iret needs when returning into vm86 mode.

Previously this structure was completely independent, so changes in
pt_regs had to be manually reflected in kernel_vm86_regs. This change
just embeds pt_regs in kernel_vm86_regs, and makes the appropriate
changes to vm86.c to deal with the new nameing.

Also, since %gs is dealt with differently in the kernel, this change
adjusts vm86.c accordingly. Namely, the on-stack saved regs->xgs is
the place where usermode gs is stored, rather than in the CPU's %gs
register.

While making these changes, I also cleaned up some frankly bizarre
code which was added when auditing was added to sys_vm86.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
Cc: Chuck Ebbert <[email protected]>
Cc: Zachary Amsden <[email protected]>
Cc: Jan Beulich <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: Chris Wright <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>

---
arch/i386/kernel/vm86.c | 121 +++++++++++++++++++++++++++++-------------------
include/asm-i386/vm86.h | 17 ------
2 files changed, 76 insertions(+), 62 deletions(-)


===================================================================
Index: linux/arch/i386/kernel/vm86.c
===================================================================
--- linux.orig/arch/i386/kernel/vm86.c
+++ linux/arch/i386/kernel/vm86.c
@@ -43,6 +43,7 @@
#include <linux/highmem.h>
#include <linux/ptrace.h>
#include <linux/audit.h>
+#include <linux/stddef.h>

#include <asm/uaccess.h>
#include <asm/io.h>
@@ -72,10 +73,10 @@
/*
* 8- and 16-bit register defines..
*/
-#define AL(regs) (((unsigned char *)&((regs)->eax))[0])
-#define AH(regs) (((unsigned char *)&((regs)->eax))[1])
-#define IP(regs) (*(unsigned short *)&((regs)->eip))
-#define SP(regs) (*(unsigned short *)&((regs)->esp))
+#define AL(regs) (((unsigned char *)&((regs)->pt.eax))[0])
+#define AH(regs) (((unsigned char *)&((regs)->pt.eax))[1])
+#define IP(regs) (*(unsigned short *)&((regs)->pt.eip))
+#define SP(regs) (*(unsigned short *)&((regs)->pt.esp))

/*
* virtual flags (16 and 32-bit versions)
@@ -89,10 +90,37 @@
#define SAFE_MASK (0xDD5)
#define RETURN_MASK (0xDFF)

-#define VM86_REGS_PART2 orig_eax
-#define VM86_REGS_SIZE1 \
- ( (unsigned)( & (((struct kernel_vm86_regs *)0)->VM86_REGS_PART2) ) )
-#define VM86_REGS_SIZE2 (sizeof(struct kernel_vm86_regs) - VM86_REGS_SIZE1)
+/* convert kernel_vm86_regs to vm86_regs */
+static int copy_vm86_regs_to_user(struct vm86_regs __user *user,
+ const struct kernel_vm86_regs *regs)
+{
+ int ret = 0;
+
+ /* kernel_vm86_regs is missing xfs, so copy everything up to
+ (but not including) xgs, and then rest after xgs. */
+ ret += copy_to_user(user, regs, offsetof(struct kernel_vm86_regs, pt.xgs));
+ ret += copy_to_user(&user->__null_gs, &regs->pt.xgs,
+ sizeof(struct kernel_vm86_regs) -
+ offsetof(struct kernel_vm86_regs, pt.xgs));
+
+ return ret;
+}
+
+/* convert vm86_regs to kernel_vm86_regs */
+static int copy_vm86_regs_from_user(struct kernel_vm86_regs *regs,
+ const struct vm86_regs __user *user,
+ unsigned extra)
+{
+ int ret = 0;
+
+ ret += copy_from_user(regs, user, offsetof(struct kernel_vm86_regs, pt.xgs));
+ ret += copy_from_user(&regs->pt.xgs, &user->__null_gs,
+ sizeof(struct kernel_vm86_regs) -
+ offsetof(struct kernel_vm86_regs, pt.xgs) +
+ extra);
+
+ return ret;
+}

struct pt_regs * FASTCALL(save_v86_state(struct kernel_vm86_regs * regs));
struct pt_regs * fastcall save_v86_state(struct kernel_vm86_regs * regs)
@@ -112,10 +140,8 @@ struct pt_regs * fastcall save_v86_state
printk("no vm86_info: BAD\n");
do_exit(SIGSEGV);
}
- set_flags(regs->eflags, VEFLAGS, VIF_MASK | current->thread.v86mask);
- tmp = copy_to_user(&current->thread.vm86_info->regs,regs, VM86_REGS_SIZE1);
- tmp += copy_to_user(&current->thread.vm86_info->regs.VM86_REGS_PART2,
- &regs->VM86_REGS_PART2, VM86_REGS_SIZE2);
+ set_flags(regs->pt.eflags, VEFLAGS, VIF_MASK | current->thread.v86mask);
+ tmp = copy_vm86_regs_to_user(&current->thread.vm86_info->regs,regs);
tmp += put_user(current->thread.screen_bitmap,&current->thread.vm86_info->screen_bitmap);
if (tmp) {
printk("vm86: could not access userspace vm86_info\n");
@@ -129,9 +155,11 @@ struct pt_regs * fastcall save_v86_state
current->thread.saved_esp0 = 0;
put_cpu();

- loadsegment(fs, current->thread.saved_fs);
- loadsegment(gs, current->thread.saved_gs);
ret = KVM86->regs32;
+
+ loadsegment(fs, current->thread.saved_fs);
+ ret->xgs = current->thread.saved_gs;
+
return ret;
}

@@ -183,9 +211,9 @@ asmlinkage int sys_vm86old(struct pt_reg
tsk = current;
if (tsk->thread.saved_esp0)
goto out;
- tmp = copy_from_user(&info, v86, VM86_REGS_SIZE1);
- tmp += copy_from_user(&info.regs.VM86_REGS_PART2, &v86->regs.VM86_REGS_PART2,
- (long)&info.vm86plus - (long)&info.regs.VM86_REGS_PART2);
+ tmp = copy_vm86_regs_from_user(&info.regs, &v86->regs,
+ offsetof(struct kernel_vm86_struct, vm86plus) -
+ sizeof(info.regs));
ret = -EFAULT;
if (tmp)
goto out;
@@ -233,9 +261,9 @@ asmlinkage int sys_vm86(struct pt_regs r
if (tsk->thread.saved_esp0)
goto out;
v86 = (struct vm86plus_struct __user *)regs.ecx;
- tmp = copy_from_user(&info, v86, VM86_REGS_SIZE1);
- tmp += copy_from_user(&info.regs.VM86_REGS_PART2, &v86->regs.VM86_REGS_PART2,
- (long)&info.regs32 - (long)&info.regs.VM86_REGS_PART2);
+ tmp = copy_vm86_regs_from_user(&info.regs, &v86->regs,
+ offsetof(struct kernel_vm86_struct, regs32) -
+ sizeof(info.regs));
ret = -EFAULT;
if (tmp)
goto out;
@@ -252,15 +280,15 @@ out:
static void do_sys_vm86(struct kernel_vm86_struct *info, struct task_struct *tsk)
{
struct tss_struct *tss;
- long eax;
/*
* make sure the vm86() system call doesn't try to do anything silly
*/
- info->regs.__null_ds = 0;
- info->regs.__null_es = 0;
+ info->regs.pt.xds = 0;
+ info->regs.pt.xes = 0;
+ info->regs.pt.xgs = 0;

-/* we are clearing fs,gs later just before "jmp resume_userspace",
- * because starting with Linux 2.1.x they aren't no longer saved/restored
+/* we are clearing fs later just before "jmp resume_userspace",
+ * because it is not saved/restored.
*/

/*
@@ -268,10 +296,10 @@ static void do_sys_vm86(struct kernel_vm
* has set it up safely, so this makes sure interrupt etc flags are
* inherited from protected mode.
*/
- VEFLAGS = info->regs.eflags;
- info->regs.eflags &= SAFE_MASK;
- info->regs.eflags |= info->regs32->eflags & ~SAFE_MASK;
- info->regs.eflags |= VM_MASK;
+ VEFLAGS = info->regs.pt.eflags;
+ info->regs.pt.eflags &= SAFE_MASK;
+ info->regs.pt.eflags |= info->regs32->eflags & ~SAFE_MASK;
+ info->regs.pt.eflags |= VM_MASK;

switch (info->cpu_type) {
case CPU_286:
@@ -294,7 +322,7 @@ static void do_sys_vm86(struct kernel_vm
info->regs32->eax = 0;
tsk->thread.saved_esp0 = tsk->thread.esp0;
savesegment(fs, tsk->thread.saved_fs);
- savesegment(gs, tsk->thread.saved_gs);
+ tsk->thread.saved_gs = info->regs32->xgs;

tss = &per_cpu(init_tss, get_cpu());
tsk->thread.esp0 = (unsigned long) &info->VM86_TSS_ESP0;
@@ -306,19 +334,18 @@ static void do_sys_vm86(struct kernel_vm
tsk->thread.screen_bitmap = info->screen_bitmap;
if (info->flags & VM86_SCREEN_BITMAP)
mark_screen_rdonly(tsk->mm);
- __asm__ __volatile__("xorl %eax,%eax; movl %eax,%fs; movl %eax,%gs\n\t");
- __asm__ __volatile__("movl %%eax, %0\n" :"=r"(eax));

/*call audit_syscall_exit since we do not exit via the normal paths */
if (unlikely(current->audit_context))
- audit_syscall_exit(AUDITSC_RESULT(eax), eax);
+ audit_syscall_exit(AUDITSC_RESULT(0), 0);

__asm__ __volatile__(
"movl %0,%%esp\n\t"
"movl %1,%%ebp\n\t"
+ "mov %2, %%fs\n\t"
"jmp resume_userspace"
: /* no outputs */
- :"r" (&info->regs), "r" (task_thread_info(tsk)));
+ :"r" (&info->regs), "r" (task_thread_info(tsk)), "r" (0));
/* we never return here */
}

@@ -348,12 +375,12 @@ static inline void clear_IF(struct kerne

static inline void clear_TF(struct kernel_vm86_regs * regs)
{
- regs->eflags &= ~TF_MASK;
+ regs->pt.eflags &= ~TF_MASK;
}

static inline void clear_AC(struct kernel_vm86_regs * regs)
{
- regs->eflags &= ~AC_MASK;
+ regs->pt.eflags &= ~AC_MASK;
}

/* It is correct to call set_IF(regs) from the set_vflags_*
@@ -370,7 +397,7 @@ static inline void clear_AC(struct kerne
static inline void set_vflags_long(unsigned long eflags, struct kernel_vm86_regs * regs)
{
set_flags(VEFLAGS, eflags, current->thread.v86mask);
- set_flags(regs->eflags, eflags, SAFE_MASK);
+ set_flags(regs->pt.eflags, eflags, SAFE_MASK);
if (eflags & IF_MASK)
set_IF(regs);
else
@@ -380,7 +407,7 @@ static inline void set_vflags_long(unsig
static inline void set_vflags_short(unsigned short flags, struct kernel_vm86_regs * regs)
{
set_flags(VFLAGS, flags, current->thread.v86mask);
- set_flags(regs->eflags, flags, SAFE_MASK);
+ set_flags(regs->pt.eflags, flags, SAFE_MASK);
if (flags & IF_MASK)
set_IF(regs);
else
@@ -389,7 +416,7 @@ static inline void set_vflags_short(unsi

static inline unsigned long get_vflags(struct kernel_vm86_regs * regs)
{
- unsigned long flags = regs->eflags & RETURN_MASK;
+ unsigned long flags = regs->pt.eflags & RETURN_MASK;

if (VEFLAGS & VIF_MASK)
flags |= IF_MASK;
@@ -493,7 +520,7 @@ static void do_int(struct kernel_vm86_re
unsigned long __user *intr_ptr;
unsigned long segoffs;

- if (regs->cs == BIOSSEG)
+ if (regs->pt.xcs == BIOSSEG)
goto cannot_handle;
if (is_revectored(i, &KVM86->int_revectored))
goto cannot_handle;
@@ -505,9 +532,9 @@ static void do_int(struct kernel_vm86_re
if ((segoffs >> 16) == BIOSSEG)
goto cannot_handle;
pushw(ssp, sp, get_vflags(regs), cannot_handle);
- pushw(ssp, sp, regs->cs, cannot_handle);
+ pushw(ssp, sp, regs->pt.xcs, cannot_handle);
pushw(ssp, sp, IP(regs), cannot_handle);
- regs->cs = segoffs >> 16;
+ regs->pt.xcs = segoffs >> 16;
SP(regs) -= 6;
IP(regs) = segoffs & 0xffff;
clear_TF(regs);
@@ -524,7 +551,7 @@ int handle_vm86_trap(struct kernel_vm86_
if (VMPI.is_vm86pus) {
if ( (trapno==3) || (trapno==1) )
return_to_32bit(regs, VM86_TRAP + (trapno << 8));
- do_int(regs, trapno, (unsigned char __user *) (regs->ss << 4), SP(regs));
+ do_int(regs, trapno, (unsigned char __user *) (regs->pt.xss << 4), SP(regs));
return 0;
}
if (trapno !=1)
@@ -560,10 +587,10 @@ void handle_vm86_fault(struct kernel_vm8
handle_vm86_trap(regs, 0, 1); \
return; } while (0)

- orig_flags = *(unsigned short *)&regs->eflags;
+ orig_flags = *(unsigned short *)&regs->pt.eflags;

- csp = (unsigned char __user *) (regs->cs << 4);
- ssp = (unsigned char __user *) (regs->ss << 4);
+ csp = (unsigned char __user *) (regs->pt.xcs << 4);
+ ssp = (unsigned char __user *) (regs->pt.xss << 4);
sp = SP(regs);
ip = IP(regs);

@@ -650,7 +677,7 @@ void handle_vm86_fault(struct kernel_vm8
SP(regs) += 6;
}
IP(regs) = newip;
- regs->cs = newcs;
+ regs->pt.xcs = newcs;
CHECK_IF_IN_TRAP;
if (data32) {
set_vflags_long(newflags, regs);
Index: linux/include/asm-i386/vm86.h
===================================================================
--- linux.orig/include/asm-i386/vm86.h
+++ linux/include/asm-i386/vm86.h
@@ -145,26 +145,13 @@ struct vm86plus_struct {
* at the end of the structure. Look at ptrace.h to see the "normal"
* setup. For user space layout see 'struct vm86_regs' above.
*/
+#include <asm/ptrace.h>

struct kernel_vm86_regs {
/*
* normal regs, with special meaning for the segment descriptors..
*/
- long ebx;
- long ecx;
- long edx;
- long esi;
- long edi;
- long ebp;
- long eax;
- long __null_ds;
- long __null_es;
- long orig_eax;
- long eip;
- unsigned short cs, __csh;
- long eflags;
- long esp;
- unsigned short ss, __ssh;
+ struct pt_regs pt;
/*
* these are specific to v86 mode:
*/

--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

2006-09-22 11:58:32

by Rusty Russell

[permalink] [raw]
Subject: [PATCH 4/7] Fix places where using %gs changes the usermode ABI.

There are a few places where the change in struct pt_regs and the use
of %gs affect the userspace ABI. These are primarily debugging
interfaces where thread state can be inspected or extracted.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
Cc: Chuck Ebbert <[email protected]>
Cc: Zachary Amsden <[email protected]>
Cc: Jan Beulich <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>

Index: ak-pda-gs/arch/i386/kernel/process.c
===================================================================
--- ak-pda-gs.orig/arch/i386/kernel/process.c 2006-09-20 16:40:18.000000000 +1000
+++ ak-pda-gs/arch/i386/kernel/process.c 2006-09-20 16:46:40.000000000 +1000
@@ -304,8 +304,8 @@
regs->eax,regs->ebx,regs->ecx,regs->edx);
printk("ESI: %08lx EDI: %08lx EBP: %08lx",
regs->esi, regs->edi, regs->ebp);
- printk(" DS: %04x ES: %04x\n",
- 0xffff & regs->xds,0xffff & regs->xes);
+ printk(" DS: %04x ES: %04x GS: %04x\n",
+ 0xffff & regs->xds,0xffff & regs->xes, 0xffff & regs->xgs);

cr0 = read_cr0();
cr2 = read_cr2();
@@ -499,7 +499,7 @@
dump->regs.ds = regs->xds;
dump->regs.es = regs->xes;
savesegment(fs,dump->regs.fs);
- savesegment(gs,dump->regs.gs);
+ dump->regs.gs = regs->xgs;
dump->regs.orig_eax = regs->orig_eax;
dump->regs.eip = regs->eip;
dump->regs.cs = regs->xcs;
Index: ak-pda-gs/arch/i386/kernel/ptrace.c
===================================================================
--- ak-pda-gs.orig/arch/i386/kernel/ptrace.c 2006-09-20 15:33:57.000000000 +1000
+++ ak-pda-gs/arch/i386/kernel/ptrace.c 2006-09-20 16:46:40.000000000 +1000
@@ -94,13 +94,9 @@
return -EIO;
child->thread.fs = value;
return 0;
- case GS:
- if (value && (value & 3) != 3)
- return -EIO;
- child->thread.gs = value;
- return 0;
case DS:
case ES:
+ case GS:
if (value && (value & 3) != 3)
return -EIO;
value &= 0xffff;
@@ -116,8 +112,8 @@
value |= get_stack_long(child, EFL_OFFSET) & ~FLAG_MASK;
break;
}
- if (regno > GS*4)
- regno -= 2*4;
+ if (regno > ES*4)
+ regno -= 1*4;
put_stack_long(child, regno - sizeof(struct pt_regs), value);
return 0;
}
@@ -131,18 +127,16 @@
case FS:
retval = child->thread.fs;
break;
- case GS:
- retval = child->thread.gs;
- break;
case DS:
case ES:
+ case GS:
case SS:
case CS:
retval = 0xffff;
/* fall through */
default:
- if (regno > GS*4)
- regno -= 2*4;
+ if (regno > ES*4)
+ regno -= 1*4;
regno = regno - sizeof(struct pt_regs);
retval &= get_stack_long(child, regno);
}
Index: ak-pda-gs/include/asm-i386/elf.h
===================================================================
--- ak-pda-gs.orig/include/asm-i386/elf.h 2006-09-20 15:33:57.000000000 +1000
+++ ak-pda-gs/include/asm-i386/elf.h 2006-09-20 16:46:40.000000000 +1000
@@ -91,7 +91,7 @@
pr_reg[7] = regs->xds; \
pr_reg[8] = regs->xes; \
savesegment(fs,pr_reg[9]); \
- savesegment(gs,pr_reg[10]); \
+ pr_reg[10] = regs->xgs; \
pr_reg[11] = regs->orig_eax; \
pr_reg[12] = regs->eip; \
pr_reg[13] = regs->xcs; \

--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

2006-09-22 11:59:50

by Rusty Russell

[permalink] [raw]
Subject: [PATCH 5/7] Use %gs for per-cpu sections in kernel

This patch actually uses the gs register to implement the per-cpu
sections. It's fairly straightforward: the gs segment starts at the
per-cpu offset for the particular cpu (or 0, in very early boot).

We also implement x86_64-inspired (via Jeremy Fitzhardinge) per-cpu
accesses where a general lvalue isn't needed. These
single-instruction accesses are slightly more efficient, plus (being a
single insn) are atomic wrt. preemption so we can use them to
implement cpu_local_inc etc.

Signed-off-by: Rusty Russell <[email protected]>

Index: ak-fresh/arch/i386/kernel/cpu/common.c
===================================================================
--- ak-fresh.orig/arch/i386/kernel/cpu/common.c 2006-09-22 16:48:14.000000000 +1000
+++ ak-fresh/arch/i386/kernel/cpu/common.c 2006-09-22 17:02:47.000000000 +1000
@@ -13,6 +13,7 @@
#include <asm/mmu_context.h>
#include <asm/mtrr.h>
#include <asm/mce.h>
+#include <asm/smp.h>
#ifdef CONFIG_X86_LOCAL_APIC
#include <asm/mpspec.h>
#include <asm/apic.h>
@@ -601,12 +602,24 @@
struct thread_struct *thread = &current->thread;
struct desc_struct *gdt;
__u32 stk16_off = (__u32)&per_cpu(cpu_16bit_stack, cpu);
- struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);

if (cpu_test_and_set(cpu, cpu_initialized)) {
printk(KERN_WARNING "CPU#%d already initialized!\n", cpu);
for (;;) local_irq_enable();
}
+
+ /* Set up GDT entry for 16bit stack */
+ stk16_off = (u32)&per_cpu(cpu_16bit_stack, cpu);
+ gdt = per_cpu(cpu_gdt_table, cpu);
+ *(__u64 *)(&gdt[GDT_ENTRY_ESPFIX_SS]) |=
+ ((((__u64)stk16_off) << 16) & 0x000000ffffff0000ULL) |
+ ((((__u64)stk16_off) << 32) & 0xff00000000000000ULL) |
+ (CPU_16BIT_STACK_SIZE - 1);
+
+ /* Complete percpu area setup early, before calling printk(),
+ since it may end up using it indirectly. */
+ setup_percpu_for_this_cpu(cpu);
+
printk(KERN_INFO "Initializing CPU#%d\n", cpu);

if (cpu_has_vme || cpu_has_tsc || cpu_has_de)
@@ -618,17 +631,6 @@
set_in_cr4(X86_CR4_TSD);
}

- /* Set up GDT entry for 16bit stack */
- gdt = __get_cpu_var(cpu_gdt_table);
- *(__u64 *)(&gdt[GDT_ENTRY_ESPFIX_SS]) |=
- ((((__u64)stk16_off) << 16) & 0x000000ffffff0000ULL) |
- ((((__u64)stk16_off) << 32) & 0xff00000000000000ULL) |
- (CPU_16BIT_STACK_SIZE - 1);
-
- cpu_gdt_descr->size = GDT_SIZE - 1;
- cpu_gdt_descr->address = (unsigned long)gdt;
-
- load_gdt(cpu_gdt_descr);
load_idt(&idt_descr);

/*
Index: ak-fresh/arch/i386/kernel/smpboot.c
===================================================================
--- ak-fresh.orig/arch/i386/kernel/smpboot.c 2006-09-22 16:48:14.000000000 +1000
+++ ak-fresh/arch/i386/kernel/smpboot.c 2006-09-22 17:02:47.000000000 +1000
@@ -102,6 +102,9 @@
{ [0 ... NR_CPUS-1] = 0xff };
EXPORT_SYMBOL(x86_cpu_to_apicid);

+DEFINE_PER_CPU(unsigned long, this_cpu_off);
+EXPORT_PER_CPU_SYMBOL(this_cpu_off);
+
/*
* Trampoline 80x86 program as an array.
*/
@@ -1303,6 +1306,37 @@
synchronize_tsc_bp();
}

+static inline void set_kernel_gs(void)
+{
+ /* Set %gs for this CPU's per-cpu area. Memory clobber is to create a
+ barrier with respect to any per-cpu operations, so the compiler
+ doesn't move any before here. */
+ asm volatile ("mov %0, %%gs" : : "r" (__KERNEL_PERCPU) : "memory");
+}
+
+static __cpuinit void setup_percpu_descriptor(struct desc_struct *gdt,
+ unsigned long per_cpu_off)
+{
+ unsigned limit, flags;
+
+ limit = (1 << 20);
+ flags = 0x8; /* 4k granularity */
+
+ /* present read-write data segment */
+ pack_descriptor((u32 *)&gdt->a, (u32 *)&gdt->b,
+ per_cpu_off, limit - 1,
+ 0x80 | DESCTYPE_S | 0x2, flags);
+}
+
+/* Set up a very early per-cpu for the boot CPU so that smp_processor_id()
+ and current will work. */
+void __init smp_setup_processor_id(void)
+{
+ /* We use the per-cpu template area (__per_cpu_offset[0] == 0). */
+ __per_cpu_offset[0] = 0;
+ setup_percpu_for_this_cpu(0);
+}
+
/* These are wrappers to interface to the new boot process. Someone
who understands all this stuff should rewrite it properly. --RR 15/Jul/02 */
void __init smp_prepare_cpus(unsigned int max_cpus)
@@ -1313,8 +1347,25 @@
smp_boot_cpus(max_cpus);
}

+/* Be careful not to use %gs references until this is setup: needs to
+ * be done on this CPU. */
+void __init setup_percpu_for_this_cpu(unsigned int cpu)
+{
+ struct desc_struct *gdt = per_cpu(cpu_gdt_table, cpu);
+ struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);
+
+ per_cpu(this_cpu_off, cpu) = __per_cpu_offset[cpu];
+ setup_percpu_descriptor(&gdt[GDT_ENTRY_PERCPU], __per_cpu_offset[cpu]);
+ cpu_gdt_descr->address = (unsigned long)gdt;
+ cpu_gdt_descr->size = GDT_SIZE - 1;
+ load_gdt(cpu_gdt_descr);
+ set_kernel_gs();
+}
+
void __devinit smp_prepare_boot_cpu(void)
{
+ setup_percpu_for_this_cpu(0);
+
cpu_set(smp_processor_id(), cpu_online_map);
cpu_set(smp_processor_id(), cpu_callout_map);
cpu_set(smp_processor_id(), cpu_present_map);
Index: ak-fresh/include/asm-i386/percpu.h
===================================================================
--- ak-fresh.orig/include/asm-i386/percpu.h 2006-09-22 16:48:14.000000000 +1000
+++ ak-fresh/include/asm-i386/percpu.h 2006-09-22 16:59:00.000000000 +1000
@@ -1,6 +1,107 @@
#ifndef __ARCH_I386_PERCPU__
#define __ARCH_I386_PERCPU__

+#ifdef CONFIG_SMP
+/* Same as generic implementation except for optimized local access. */
+#define __GENERIC_PER_CPU
+
+/* This is used for other cpus to find our section. */
+extern unsigned long __per_cpu_offset[NR_CPUS];
+
+/* Separate out the type, so (int[3], foo) works. */
+#define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu__##name
+#define DEFINE_PER_CPU(type, name) \
+ __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name
+
+/* We can use this directly for local CPU (faster). */
+DECLARE_PER_CPU(unsigned long, this_cpu_off);
+
+/* var is in discarded region: offset to particular copy we want */
+#define per_cpu(var, cpu) (*({ \
+ extern int simple_indentifier_##var(void); \
+ RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]); }))
+
+#define __raw_get_cpu_var(var) (*({ \
+ extern int simple_indentifier_##var(void); \
+ RELOC_HIDE(&per_cpu__##var, x86_read_percpu(this_cpu_off)); \
+}))
+
+#define __get_cpu_var(var) __raw_get_cpu_var(var)
+
+/* A macro to avoid #include hell... */
+#define percpu_modcopy(pcpudst, src, size) \
+do { \
+ unsigned int __i; \
+ for_each_possible_cpu(__i) \
+ memcpy((pcpudst)+__per_cpu_offset[__i], \
+ (src), (size)); \
+} while (0)
+
+#define EXPORT_PER_CPU_SYMBOL(var) EXPORT_SYMBOL(per_cpu__##var)
+#define EXPORT_PER_CPU_SYMBOL_GPL(var) EXPORT_SYMBOL_GPL(per_cpu__##var)
+
+/* gs segment starts at (positive) offset == __per_cpu_offset[cpu] */
+#define __percpu_seg "%%gs:"
+#else /* !SMP */
#include <asm-generic/percpu.h>
+#define __percpu_seg ""
+#endif /* SMP */
+
+/* For arch-specific code, we can use direct single-insn ops (they
+ * don't give an lvalue though). */
+extern void __bad_percpu_size(void);
+
+#define percpu_to_op(op,var,val) \
+ do { \
+ typedef typeof(var) T__; \
+ if (0) { T__ tmp__; tmp__ = (val); } \
+ switch (sizeof(var)) { \
+ case 1: \
+ asm(op "b %1,"__percpu_seg"%0" \
+ : "+m" (var) \
+ :"ri" ((T__)val)); \
+ break; \
+ case 2: \
+ asm(op "w %1,"__percpu_seg"%0" \
+ : "+m" (var) \
+ :"ri" ((T__)val)); \
+ break; \
+ case 4: \
+ asm(op "l %1,"__percpu_seg"%0" \
+ : "+m" (var) \
+ :"ri" ((T__)val)); \
+ break; \
+ default: __bad_percpu_size(); \
+ } \
+ } while (0)
+
+#define percpu_from_op(op,var) \
+ ({ \
+ typeof(var) ret__; \
+ switch (sizeof(var)) { \
+ case 1: \
+ asm(op "b "__percpu_seg"%1,%0" \
+ : "=r" (ret__) \
+ : "m" (var)); \
+ break; \
+ case 2: \
+ asm(op "w "__percpu_seg"%1,%0" \
+ : "=r" (ret__) \
+ : "m" (var)); \
+ break; \
+ case 4: \
+ asm(op "l "__percpu_seg"%1,%0" \
+ : "=r" (ret__) \
+ : "m" (var)); \
+ break; \
+ default: __bad_percpu_size(); \
+ } \
+ ret__; })
+
+#define x86_read_percpu(var) percpu_from_op("mov", per_cpu__##var)
+#define x86_write_percpu(var,val) percpu_to_op("mov", per_cpu__##var, val)
+#define x86_add_percpu(var,val) percpu_to_op("add", per_cpu__##var, val)
+#define x86_sub_percpu(var,val) percpu_to_op("sub", per_cpu__##var, val)
+#define x86_or_percpu(var,val) percpu_to_op("or", per_cpu__##var, val)

#endif /* __ARCH_I386_PERCPU__ */
Index: ak-fresh/include/asm-i386/smp.h
===================================================================
--- ak-fresh.orig/include/asm-i386/smp.h 2006-09-22 16:48:14.000000000 +1000
+++ ak-fresh/include/asm-i386/smp.h 2006-09-22 17:02:47.000000000 +1000
@@ -86,6 +86,8 @@
extern void __cpu_die(unsigned int cpu);
extern unsigned int num_processors;

+void setup_percpu_for_this_cpu(unsigned int cpu);
+
#endif /* !__ASSEMBLY__ */

#else /* CONFIG_SMP */
@@ -94,6 +96,8 @@

#define NO_PROC_ID 0xFF /* No processor magic marker */

+#define setup_percpu_for_this_cpu(cpu)
+
#endif

#ifndef __ASSEMBLY__
Index: ak-fresh/arch/i386/kernel/setup.c
===================================================================
--- ak-fresh.orig/arch/i386/kernel/setup.c 2006-09-22 17:02:46.000000000 +1000
+++ ak-fresh/arch/i386/kernel/setup.c 2006-09-22 17:03:49.000000000 +1000
@@ -1474,6 +1474,8 @@

/* ESPFIX 16-bit SS */
[GDT_ENTRY_ESPFIX_SS] = { 0x00000000, 0x00009200 },
+ /* FIXME: We save/restore %gs even on UP: fix entry.S. */
+ [GDT_ENTRY_PERCPU] = { 0x0000ffff, 0x00cf9200 },
};

/* Early in boot we use the master per-cpu gdt_table directly. */

--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

2006-09-22 12:00:55

by Rusty Russell

[permalink] [raw]
Subject: [PATCH 6/7] (Optional) implement smp_processor_id() as a per-cpu var

This implements smp_processor_id() as a per-cpu variable. The generic
code expects it in thread_info still, so we can't remove it from
there, but reducing smp_processor_id() from 9 bytes/3 insns to 6
bytes/1 insn is a nice micro-optimization.

Signed-off-by: Rusty Russell <[email protected]>

Index: ak-pda-gs/arch/i386/kernel/smpboot.c
===================================================================
--- ak-pda-gs.orig/arch/i386/kernel/smpboot.c 2006-09-20 19:55:07.000000000 +1000
+++ ak-pda-gs/arch/i386/kernel/smpboot.c 2006-09-20 22:36:06.000000000 +1000
@@ -105,6 +105,9 @@
DEFINE_PER_CPU(unsigned long, this_cpu_off);
EXPORT_PER_CPU_SYMBOL(this_cpu_off);

+DEFINE_PER_CPU(u32, processor_id);
+EXPORT_PER_CPU_SYMBOL(processor_id);
+
/*
* Trampoline 80x86 program as an array.
*/
@@ -1355,6 +1358,7 @@
struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);

per_cpu(this_cpu_off, cpu) = __per_cpu_offset[cpu];
+ per_cpu(processor_id, cpu) = cpu;
setup_percpu_descriptor(&gdt[GDT_ENTRY_PERCPU], __per_cpu_offset[cpu]);
cpu_gdt_descr->address = (unsigned long)gdt;
cpu_gdt_descr->size = GDT_SIZE - 1;
Index: ak-pda-gs/include/asm-i386/smp.h
===================================================================
--- ak-pda-gs.orig/include/asm-i386/smp.h 2006-09-20 20:20:50.000000000 +1000
+++ ak-pda-gs/include/asm-i386/smp.h 2006-09-20 22:36:22.000000000 +1000
@@ -8,6 +8,7 @@
#include <linux/kernel.h>
#include <linux/threads.h>
#include <linux/cpumask.h>
+#include <asm/percpu.h>
#endif

#ifdef CONFIG_X86_LOCAL_APIC
@@ -56,7 +57,8 @@
* from the initial startup. We map APIC_BASE very early in page_setup(),
* so this is correct in the x86 case.
*/
-#define raw_smp_processor_id() (current_thread_info()->cpu)
+DECLARE_PER_CPU(u32, processor_id);
+#define raw_smp_processor_id() x86_read_percpu(processor_id)

extern cpumask_t cpu_callout_map;
extern cpumask_t cpu_callin_map;
Index: ak-pda-gs/arch/i386/kernel/cpu/common.c
===================================================================
--- ak-pda-gs.orig/arch/i386/kernel/cpu/common.c 2006-09-20 21:28:28.000000000 +1000
+++ ak-pda-gs/arch/i386/kernel/cpu/common.c 2006-09-20 22:37:17.000000000 +1000
@@ -597,7 +597,7 @@
*/
void __cpuinit cpu_init(void)
{
- int cpu = smp_processor_id();
+ unsigned int cpu = current->thread_info->cpu;
struct tss_struct * t = &per_cpu(init_tss, cpu);
struct thread_struct *thread = &current->thread;
struct desc_struct *gdt;

--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

2006-09-22 12:01:46

by Rusty Russell

[permalink] [raw]
Subject: [PATCH 7/7] (Optional) implement current as a per-cpu var

This implements current as a per-cpu variable. The generic code
expects it in thread_info still, so I don't remove it from there, but
reducing current from 9 bytes/3 insns to 6 bytes/1 insn is a nice
micro-optimization.

Signed-off-by: Rusty Russell <[email protected]>

Index: ak-fresh/arch/i386/kernel/setup.c
===================================================================
--- ak-fresh.orig/arch/i386/kernel/setup.c 2006-09-22 17:03:49.000000000 +1000
+++ ak-fresh/arch/i386/kernel/setup.c 2006-09-22 17:04:29.000000000 +1000
@@ -148,6 +148,9 @@

unsigned char __initdata boot_params[PARAM_SIZE];

+DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task;
+EXPORT_PER_CPU_SYMBOL(current_task);
+
static struct resource data_resource = {
.name = "Kernel data",
.start = 0,
Index: ak-fresh/arch/i386/kernel/smpboot.c
===================================================================
--- ak-fresh.orig/arch/i386/kernel/smpboot.c 2006-09-22 17:04:18.000000000 +1000
+++ ak-fresh/arch/i386/kernel/smpboot.c 2006-09-22 17:04:29.000000000 +1000
@@ -598,12 +598,13 @@
* We don't actually need to load the full TSS,
* basically just the stack pointer and the eip.
*/
-
+ /* Can't use current until setup_percpu_for_this_cpu */
asm volatile(
"movl %0,%%esp\n\t"
"jmp *%1"
:
- :"r" (current->thread.esp),"r" (current->thread.eip));
+ :"r" (current_thread_info()->task->thread.esp),
+ "r" (current_thread_info()->task->thread.eip));
}

extern struct {
@@ -946,6 +947,8 @@
if (IS_ERR(idle))
panic("failed fork for CPU %d", cpu);
idle->thread.eip = (unsigned long) start_secondary;
+ per_cpu(current_task, cpu) = idle;
+
/* start_eip had better be page-aligned! */
start_eip = setup_trampoline();

Index: ak-fresh/include/asm-i386/current.h
===================================================================
--- ak-fresh.orig/include/asm-i386/current.h 2006-09-22 17:02:46.000000000 +1000
+++ ak-fresh/include/asm-i386/current.h 2006-09-22 17:04:29.000000000 +1000
@@ -1,15 +1,10 @@
#ifndef _I386_CURRENT_H
#define _I386_CURRENT_H

+#include <asm/percpu.h>
#include <linux/thread_info.h>

-struct task_struct;
-
-static __always_inline struct task_struct * get_current(void)
-{
- return current_thread_info()->task;
-}
-
-#define current get_current()
+DECLARE_PER_CPU(struct task_struct *, current_task);
+#define current x86_read_percpu(current_task)

#endif /* !(_I386_CURRENT_H) */
Index: ak-fresh/arch/i386/kernel/process.c
===================================================================
--- ak-fresh.orig/arch/i386/kernel/process.c 2006-09-22 17:02:46.000000000 +1000
+++ ak-fresh/arch/i386/kernel/process.c 2006-09-22 17:04:30.000000000 +1000
@@ -669,6 +669,7 @@
if (unlikely(prev->fs | next->fs))
loadsegment(fs, next->fs);

+ x86_write_percpu(current_task, next_p);

/*
* Restore IOPL if needed.
Index: ak-fresh/arch/i386/kernel/cpu/common.c
===================================================================
--- ak-fresh.orig/arch/i386/kernel/cpu/common.c 2006-09-22 17:04:18.000000000 +1000
+++ ak-fresh/arch/i386/kernel/cpu/common.c 2006-09-22 17:05:25.000000000 +1000
@@ -597,9 +597,10 @@
*/
void __cpuinit cpu_init(void)
{
- unsigned int cpu = current->thread_info->cpu;
+ /* Careful: can't use current() or smp_processor_id() yet! */
+ unsigned int cpu = current_thread_info()->cpu;
struct tss_struct * t = &per_cpu(init_tss, cpu);
- struct thread_struct *thread = &current->thread;
+ struct thread_struct *thread = &per_cpu(current_task, cpu)->thread;
struct desc_struct *gdt;
__u32 stk16_off = (__u32)&per_cpu(cpu_16bit_stack, cpu);


--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

2006-09-22 12:32:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 5/7] Use %gs for per-cpu sections in kernel

BTW I changed my copy sorry. I redid the early PDA support
to not be in assembler.

On Fri, Sep 22, 2006 at 09:59:45PM +1000, Rusty Russell wrote:
> This patch actually uses the gs register to implement the per-cpu
> sections. It's fairly straightforward: the gs segment starts at the
> per-cpu offset for the particular cpu (or 0, in very early boot).
>
> We also implement x86_64-inspired (via Jeremy Fitzhardinge) per-cpu
> accesses where a general lvalue isn't needed. These
> single-instruction accesses are slightly more efficient, plus (being a
> single insn) are atomic wrt. preemption so we can use them to
> implement cpu_local_inc etc.

The problem is nobody uses cpu_local_inc() etc :/ And it is difficult
to use in generic code because of the usual preemption issues
(and being slower on other archs in many cases compared to preempt disabling
around larger block of code)

Without that it is the same code as Jeremy's variant
%gs memory reference + another reference with offset as far as I can see.

So while it looks nice I don't think it will have advantages. Or did
i miss something?

-Andi

2006-09-22 22:24:49

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 2/7]

Rusty Russell wrote:
> This patch implements save/restore of %gs in the kernel, so it can be
> used for per-cpu data. This is not cheap, and we do it for UP as well
> as SMP, which is stupid. Benchmarks, anyone?
>
I measured the cost as adding 9 cycles to a null syscall on my Core Duo
machine. I have not explicitly measured it on other machines, but I run
a number of other segment save/load tests on a wide range of machines,
and didn't find much variability.

I think saving/restoring %gs will still be necessary. There are a number
of places in the kernel which expect to find the usermode %gs on the
kernel stack frame, including context switch, ptrace, vm86, signal
context, and maybe something else. If you don't save it on the stack,
then you need to have UP variations of %gs handling in all those other
places, which is pretty messy. Also, unless you want to have two
definitions of struct_pt regs (which would add even more mess into
ptrace), you'd still need to sub/add %esp in entry.S to skip over the
%gs hole. I don't think this UP microoptimisation would be worth enough
to justify the mess it would cause elsewhere.

How does this version of the patch differ from mine? Is it just my
patch+Ingo's fix, or are there other changes? I couldn't see anything
from a quick read-over.

J

2006-09-22 22:40:01

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 5/7] Use %gs for per-cpu sections in kernel

Rusty Russell wrote:
> This patch actually uses the gs register to implement the per-cpu
> sections. It's fairly straightforward: the gs segment starts at the
> per-cpu offset for the particular cpu (or 0, in very early boot).
>
> We also implement x86_64-inspired (via Jeremy Fitzhardinge) per-cpu
> accesses where a general lvalue isn't needed. These
> single-instruction accesses are slightly more efficient, plus (being a
> single insn) are atomic wrt. preemption so we can use them to
> implement cpu_local_inc etc.
>
> Signed-off-by: Rusty Russell <[email protected]>
>
> Index: ak-fresh/arch/i386/kernel/cpu/common.c
> ===================================================================
> --- ak-fresh.orig/arch/i386/kernel/cpu/common.c 2006-09-22 16:48:14.000000000 +1000
> +++ ak-fresh/arch/i386/kernel/cpu/common.c 2006-09-22 17:02:47.000000000 +1000
> @@ -13,6 +13,7 @@
> #include <asm/mmu_context.h>
> #include <asm/mtrr.h>
> #include <asm/mce.h>
> +#include <asm/smp.h>
> #ifdef CONFIG_X86_LOCAL_APIC
> #include <asm/mpspec.h>
> #include <asm/apic.h>
> @@ -601,12 +602,24 @@
> struct thread_struct *thread = &current->thread;
> struct desc_struct *gdt;
> __u32 stk16_off = (__u32)&per_cpu(cpu_16bit_stack, cpu);
> - struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);
>
> if (cpu_test_and_set(cpu, cpu_initialized)) {
> printk(KERN_WARNING "CPU#%d already initialized!\n", cpu);
> for (;;) local_irq_enable();
> }
> +
> + /* Set up GDT entry for 16bit stack */
> + stk16_off = (u32)&per_cpu(cpu_16bit_stack, cpu);
> + gdt = per_cpu(cpu_gdt_table, cpu);
> + *(__u64 *)(&gdt[GDT_ENTRY_ESPFIX_SS]) |=
> + ((((__u64)stk16_off) << 16) & 0x000000ffffff0000ULL) |
> + ((((__u64)stk16_off) << 32) & 0xff00000000000000ULL) |
> + (CPU_16BIT_STACK_SIZE - 1);
>

This should use pack_descriptor(). I'd never got around to changing it,
but it really should.

> + /* Complete percpu area setup early, before calling printk(),
> + since it may end up using it indirectly. */
> + setup_percpu_for_this_cpu(cpu);
> +
>

I managed to get all this done in head.S before going into C code; is
that not still possible? Or is there a later patch to do this.

> +static __cpuinit void setup_percpu_descriptor(struct desc_struct *gdt,
> + unsigned long per_cpu_off)
> +{
> + unsigned limit, flags;
> +
> + limit = (1 << 20);
> + flags = 0x8; /* 4k granularity */
>

Why not set the limit to the percpu section size? It would avoid having
it clipped under Xen.


> +/* Be careful not to use %gs references until this is setup: needs to
> + * be done on this CPU. */
> +void __init setup_percpu_for_this_cpu(unsigned int cpu)
> +{
> + struct desc_struct *gdt = per_cpu(cpu_gdt_table, cpu);
> + struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);
> +
> + per_cpu(this_cpu_off, cpu) = __per_cpu_offset[cpu];
> + setup_percpu_descriptor(&gdt[GDT_ENTRY_PERCPU], __per_cpu_offset[cpu]);
> + cpu_gdt_descr->address = (unsigned long)gdt;
> + cpu_gdt_descr->size = GDT_SIZE - 1;
> + load_gdt(cpu_gdt_descr);
> + set_kernel_gs();
> +}
>

Everything except the load_gdt and set_kernel_gs could be done in advance.

> +
> void __devinit smp_prepare_boot_cpu(void)
> {
> + setup_percpu_for_this_cpu(0);
> +
> cpu_set(smp_processor_id(), cpu_online_map);
> cpu_set(smp_processor_id(), cpu_callout_map);
> cpu_set(smp_processor_id(), cpu_present_map);
> Index: ak-fresh/include/asm-i386/percpu.h
> ===================================================================
> --- ak-fresh.orig/include/asm-i386/percpu.h 2006-09-22 16:48:14.000000000 +1000
> +++ ak-fresh/include/asm-i386/percpu.h 2006-09-22 16:59:00.000000000 +1000
> @@ -1,6 +1,107 @@
> #ifndef __ARCH_I386_PERCPU__
> #define __ARCH_I386_PERCPU__
>
> +#ifdef CONFIG_SMP
> +/* Same as generic implementation except for optimized local access. */
> +#define __GENERIC_PER_CPU
> +
> +/* This is used for other cpus to find our section. */
> +extern unsigned long __per_cpu_offset[NR_CPUS];
> +
> +/* Separate out the type, so (int[3], foo) works. */
> +#define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu__##name
> +#define DEFINE_PER_CPU(type, name) \
> + __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name
> +
> +/* We can use this directly for local CPU (faster). */
> +DECLARE_PER_CPU(unsigned long, this_cpu_off);
> +
> +/* var is in discarded region: offset to particular copy we want */
> +#define per_cpu(var, cpu) (*({ \
> + extern int simple_indentifier_##var(void); \
> + RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]); }))
> +
> +#define __raw_get_cpu_var(var) (*({ \
> + extern int simple_indentifier_##var(void); \
> + RELOC_HIDE(&per_cpu__##var, x86_read_percpu(this_cpu_off)); \
> +}))
> +
> +#define __get_cpu_var(var) __raw_get_cpu_var(var)
> +
> +/* A macro to avoid #include hell... */
> +#define percpu_modcopy(pcpudst, src, size) \
> +do { \
> + unsigned int __i; \
> + for_each_possible_cpu(__i) \
> + memcpy((pcpudst)+__per_cpu_offset[__i], \
> + (src), (size)); \
> +} while (0)
> +
> +#define EXPORT_PER_CPU_SYMBOL(var) EXPORT_SYMBOL(per_cpu__##var)
> +#define EXPORT_PER_CPU_SYMBOL_GPL(var) EXPORT_SYMBOL_GPL(per_cpu__##var)
> +
> +/* gs segment starts at (positive) offset == __per_cpu_offset[cpu] */
> +#define __percpu_seg "%%gs:"
> +#else /* !SMP */
> #include <asm-generic/percpu.h>
> +#define __percpu_seg ""
> +#endif /* SMP */
> +
> +/* For arch-specific code, we can use direct single-insn ops (they
> + * don't give an lvalue though). */
> +extern void __bad_percpu_size(void);
> +
> +#define percpu_to_op(op,var,val) \
> + do { \
> + typedef typeof(var) T__; \
> + if (0) { T__ tmp__; tmp__ = (val); } \
> + switch (sizeof(var)) { \
> + case 1: \
> + asm(op "b %1,"__percpu_seg"%0" \
>

So are symbols referencing the .data.percpu section 0-based? Wouldn't
you need to subtract __per_cpu_start from the symbols to get a 0-based
segment offset?

Or is the only percpu benefit you're getting from %gs is a slightly
quicker way of getting the percpu_offset? Does that help much?
> +#define x86_read_percpu(var) percpu_from_op("mov", per_cpu__##var)
> +#define x86_write_percpu(var,val) percpu_to_op("mov", per_cpu__##var, val)
> +#define x86_add_percpu(var,val) percpu_to_op("add", per_cpu__##var, val)
> +#define x86_sub_percpu(var,val) percpu_to_op("sub", per_cpu__##var, val)
> +#define x86_or_percpu(var,val) percpu_to_op("or", per_cpu__##var, val)
>

Why x86_? If some other arch implemented a similar mechanism, wouldn't
they want to use the same macro names?

J

2006-09-22 22:43:51

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 5/7] Use %gs for per-cpu sections in kernel

Andi Kleen wrote:
> BTW I changed my copy sorry. I redid the early PDA support
> to not be in assembler.

I went to the trouble of making the PDA completely set up before any C
code ran. Did you undo that?

Andrew mentioned that people have various hacks which hook into mcount,
and want to use current/smp_processor_id, which means that that they
have to work from the first function prologue.
It also simplifies things to get all that set up ASAP so there's no
bootstrap dependency problem.

J

2006-09-22 23:53:01

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 5/7] Use %gs for per-cpu sections in kernel

On Saturday 23 September 2006 00:43, Jeremy Fitzhardinge wrote:
> Andi Kleen wrote:
> > BTW I changed my copy sorry. I redid the early PDA support
> > to not be in assembler.
>
> I went to the trouble of making the PDA completely set up before any C
> code ran.

Yes, but your patch never applied to anything even remotely
looking like the code in my tree. I got so frustrated that
I ended up reimplementing it in a cleaner way.

Now head.S calls i386_start_kernel() and that calls pda_init()
without any additional assembly code or other special cases etc.

This is very similar to how x86-64 works.

> which means that that they
> have to work from the first function prologue.

I mainly did it to fix lockdep.

I used to do mcount hacks myself, but you typically need
a few special annotations for those anyways so I am not too
concerned about them.

> It also simplifies things to get all that set up ASAP so there's no
> bootstrap dependency problem.

Yes no argument on that.

-Andi

2006-09-23 04:31:27

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 5/7] Use %gs for per-cpu sections in kernel

On Fri, 2006-09-22 at 15:39 -0700, Jeremy Fitzhardinge wrote:
> Rusty Russell wrote:
> > +
> > + /* Set up GDT entry for 16bit stack */
> > + stk16_off = (u32)&per_cpu(cpu_16bit_stack, cpu);
> > + gdt = per_cpu(cpu_gdt_table, cpu);
> > + *(__u64 *)(&gdt[GDT_ENTRY_ESPFIX_SS]) |=
> > + ((((__u64)stk16_off) << 16) & 0x000000ffffff0000ULL) |
> > + ((((__u64)stk16_off) << 32) & 0xff00000000000000ULL) |
> > + (CPU_16BIT_STACK_SIZE - 1);
> >
>
> This should use pack_descriptor(). I'd never got around to changing it,
> but it really should.

Yep, agreed.

> > + /* Complete percpu area setup early, before calling printk(),
> > + since it may end up using it indirectly. */
> > + setup_percpu_for_this_cpu(cpu);
> > +
> >
>
> I managed to get all this done in head.S before going into C code; is
> that not still possible? Or is there a later patch to do this.

It's possible; it would simplify the C code a little, but I'll have to
see what the asm looks like.

> > +static __cpuinit void setup_percpu_descriptor(struct desc_struct *gdt,
> > + unsigned long per_cpu_off)
> > +{
> > + unsigned limit, flags;
> > +
> > + limit = (1 << 20);
> > + flags = 0x8; /* 4k granularity */
> >
>
> Why not set the limit to the percpu section size? It would avoid having
> it clipped under Xen.

Sure... there was a couple of other things Xen needs, too, so I thought
I'd do a separate patch (whole page for GDT and the xen page, which
means generic per-cpu setup should use boot_alloc_pages()).

> > +/* Be careful not to use %gs references until this is setup: needs to
> > + * be done on this CPU. */
> > +void __init setup_percpu_for_this_cpu(unsigned int cpu)
> > +{
> > + struct desc_struct *gdt = per_cpu(cpu_gdt_table, cpu);
> > + struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);
> > +
> > + per_cpu(this_cpu_off, cpu) = __per_cpu_offset[cpu];
> > + setup_percpu_descriptor(&gdt[GDT_ENTRY_PERCPU], __per_cpu_offset[cpu]);
> > + cpu_gdt_descr->address = (unsigned long)gdt;
> > + cpu_gdt_descr->size = GDT_SIZE - 1;
> > + load_gdt(cpu_gdt_descr);
> > + set_kernel_gs();
> > +}
> >
>
> Everything except the load_gdt and set_kernel_gs could be done in advance.

Yes. Which particularly makes sense if this is done in asm, as you
suggested above.

> > +#define percpu_to_op(op,var,val) \
> > + do { \
> > + typedef typeof(var) T__; \
> > + if (0) { T__ tmp__; tmp__ = (val); } \
> > + switch (sizeof(var)) { \
> > + case 1: \
> > + asm(op "b %1,"__percpu_seg"%0" \
> >
>
> So are symbols referencing the .data.percpu section 0-based? Wouldn't
> you need to subtract __per_cpu_start from the symbols to get a 0-based
> segment offset?

I don't think I understand the question.

The .data.percpu section is the "template" per-cpu section, freed along
with other initdata: after setup_percpu_areas() is called, it is not
supposed to be used. Around that time, the gs segment is set up based
at __per_cpu_offset[cpu], so "%gs:<varname>" accesses the local version.

> Or is the only percpu benefit you're getting from %gs is a slightly
> quicker way of getting the percpu_offset? Does that help much?

In generic code, that's true (the arch-specific accessors can do it in 1
insn, not two). But it's still a help. This is __raw_get_cpu_var(x)
before:

3: 89 e0 mov %esp,%eax
5: 25 00 e0 ff ff and $0xffffe000,%eax
a: 8b 40 08 mov 0x8(%eax),%eax
d: 8b 04 85 00 00 00 00 mov 0x0(,%eax,4),%eax
10: R_386_32 __per_cpu_offset
14: 8b 80 00 00 00 00 mov 0x0(%eax),%eax
16: R_386_32 per_cpu__x

And this is after:

1f: 65 a1 00 00 00 00 mov %gs:0x0,%eax
21: R_386_32 per_cpu__this_cpu_off
25: 8b 80 00 00 00 00 mov 0x0(%eax),%eax
27: R_386_32 per_cpu__x

So we go from 5 instructions, 23 bytes, 3 memory references, to 2
instructions, 12 bytes, 2 memory references (although the extra mem ref
will almost certainly have been in cache).

> > +#define x86_read_percpu(var) percpu_from_op("mov", per_cpu__##var)
> > +#define x86_write_percpu(var,val) percpu_to_op("mov", per_cpu__##var, val)
> > +#define x86_add_percpu(var,val) percpu_to_op("add", per_cpu__##var, val)
> > +#define x86_sub_percpu(var,val) percpu_to_op("sub", per_cpu__##var, val)
> > +#define x86_or_percpu(var,val) percpu_to_op("or", per_cpu__##var, val)
>
> Why x86_? If some other arch implemented a similar mechanism, wouldn't
> they want to use the same macro names?

Possibly, but for the moment they are very arch specific: we really
don't want them in generic code. It *might* be worth creating a generic
"read_per_cpu()" which returns a rvalue, but IMHO adding a new thread
model which is all-positive-offset is probably a better long-term plan.

Cheers,
Rusty.
--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

2006-09-23 04:36:06

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 2/7]

On Fri, 2006-09-22 at 15:24 -0700, Jeremy Fitzhardinge wrote:
> Rusty Russell wrote:
> > This patch implements save/restore of %gs in the kernel, so it can be
> > used for per-cpu data. This is not cheap, and we do it for UP as well
> > as SMP, which is stupid. Benchmarks, anyone?
> >
> I measured the cost as adding 9 cycles to a null syscall on my Core Duo
> machine. I have not explicitly measured it on other machines, but I run
> a number of other segment save/load tests on a wide range of machines,
> and didn't find much variability.

Oh, OK! I had a belief that segment loading was expensive, perhaps I'm
off-base here.

> I think saving/restoring %gs will still be necessary. There are a number
> of places in the kernel which expect to find the usermode %gs on the
> kernel stack frame, including context switch, ptrace, vm86, signal
> context, and maybe something else. If you don't save it on the stack,
> then you need to have UP variations of %gs handling in all those other
> places, which is pretty messy. Also, unless you want to have two
> definitions of struct_pt regs (which would add even more mess into
> ptrace), you'd still need to sub/add %esp in entry.S to skip over the
> %gs hole. I don't think this UP microoptimisation would be worth enough
> to justify the mess it would cause elsewhere.
>
> How does this version of the patch differ from mine? Is it just my
> patch+Ingo's fix, or are there other changes? I couldn't see anything
> from a quick read-over.

Yep, no substative changes. s/__KERNEL_PDA/__KERNEL_PERCPU/, plus your
version had a "write_pda(pcurrent, next_p)" inserted in process.c's
__switch_to which belonged in a successor patch...

Thanks!
Rusty.
--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

2006-09-23 04:51:19

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 5/7] Use %gs for per-cpu sections in kernel

On Fri, 2006-09-22 at 14:32 +0200, Andi Kleen wrote:
> BTW I changed my copy sorry. I redid the early PDA support
> to not be in assembler.
>
> On Fri, Sep 22, 2006 at 09:59:45PM +1000, Rusty Russell wrote:
> > This patch actually uses the gs register to implement the per-cpu
> > sections. It's fairly straightforward: the gs segment starts at the
> > per-cpu offset for the particular cpu (or 0, in very early boot).
> >
> > We also implement x86_64-inspired (via Jeremy Fitzhardinge) per-cpu
> > accesses where a general lvalue isn't needed. These
> > single-instruction accesses are slightly more efficient, plus (being a
> > single insn) are atomic wrt. preemption so we can use them to
> > implement cpu_local_inc etc.
>
> The problem is nobody uses cpu_local_inc() etc :/ And it is difficult
> to use in generic code because of the usual preemption issues
> (and being slower on other archs in many cases compared to preempt disabling
> around larger block of code)

True, they were primarily designed for the SNMP counters in networking
(as per your comment in snmp.h). They're now dynamic per-cpu pointers,
which adds a new wrinkle though 8(

> Without that it is the same code as Jeremy's variant
> %gs memory reference + another reference with offset as far as I can see.
>
> So while it looks nice I don't think it will have advantages. Or did
> i miss something?

Mainly that it makes more sense to use the existing per-cpu concept than
introduce another kind of per-cpu var within a special structure, but
it's also more efficient (see other post). Hopefully it will spark
interest in making dynamic-percpu pointers use the same offset scheme,
now x86 will experience the benefits.

And we might even get a third user of local_t!

Cheers,
Rusty.
--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

2006-09-23 08:13:40

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 5/7] Use %gs for per-cpu sections in kernel

> >+
> >+ /* Set up GDT entry for 16bit stack */
> >+ stk16_off = (u32)&per_cpu(cpu_16bit_stack, cpu);
> >+ gdt = per_cpu(cpu_gdt_table, cpu);
> >+ *(__u64 *)(&gdt[GDT_ENTRY_ESPFIX_SS]) |=
> >+ ((((__u64)stk16_off) << 16) & 0x000000ffffff0000ULL) |
> >+ ((((__u64)stk16_off) << 32) & 0xff00000000000000ULL) |
> >+ (CPU_16BIT_STACK_SIZE - 1);
> >
>
> This should use pack_descriptor(). I'd never got around to changing it,
> but it really should.

I fixed it now in the original patch

> >+ /* Complete percpu area setup early, before calling printk(),
> >+ since it may end up using it indirectly. */
> >+ setup_percpu_for_this_cpu(cpu);
> >+
> >
>
> I managed to get all this done in head.S before going into C code; is
> that not still possible? Or is there a later patch to do this.

Why write in assembler what you can write in C?

-Andi

2006-09-23 08:17:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 5/7] Use %gs for per-cpu sections in kernel

> Mainly that it makes more sense to use the existing per-cpu concept than
> introduce another kind of per-cpu var within a special structure, but
> it's also more efficient (see other post). Hopefully it will spark

What post exactly? AFAIK it is the same code for common code.

The advantage of the PDA split is that the important variables which are
in the PDA can be accessed with a single reference, while generic portable
per CPU data is the same as it was before. With your scheme even
the PDA accesses are at least two instructions, right? (I don't
think gcc/ld can resolve the per cpu section offset into a constant,
so it has to load them into a register first)

> interest in making dynamic-percpu pointers use the same offset scheme,
> now x86 will experience the benefits.
>
> And we might even get a third user of local_t!

I'm not holding my breath. I guess it was a nice idea before preemption
became popular ...

-Andi

2006-09-23 08:55:58

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 5/7] Use %gs for per-cpu sections in kernel

On Sat, 2006-09-23 at 10:17 +0200, Andi Kleen wrote:
> > Mainly that it makes more sense to use the existing per-cpu concept than
> > introduce another kind of per-cpu var within a special structure, but
> > it's also more efficient (see other post). Hopefully it will spark
>
> What post exactly? AFAIK it is the same code for common code.
>
> The advantage of the PDA split is that the important variables which are
> in the PDA can be accessed with a single reference, while generic portable
> per CPU data is the same as it was before. With your scheme even
> the PDA accesses are at least two instructions, right? (I don't
> think gcc/ld can resolve the per cpu section offset into a constant,
> so it has to load them into a register first)

No, now normal per-cpu accesses are 2 insn, per-cpu accesses using
arch-specific macros are 1 insn. ie. it's as if every per-cpu variable
were in the "pda".

Here's the reply to Jeremy's query:

Jeremy says:
> Or is the only percpu benefit you're getting from %gs is a slightly
> quicker way of getting the percpu_offset? Does that help much?

In generic code, that's true (the arch-specific accessors can do it in 1
insn, not two). But it's still a help. This is __raw_get_cpu_var(x)
before:

3: 89 e0 mov %esp,%eax
5: 25 00 e0 ff ff and $0xffffe000,%eax
a: 8b 40 08 mov 0x8(%eax),%eax
d: 8b 04 85 00 00 00 00 mov 0x0(,%eax,4),%eax
10: R_386_32 __per_cpu_offset
14: 8b 80 00 00 00 00 mov 0x0(%eax),%eax
16: R_386_32 per_cpu__x

And this is after:

1f: 65 a1 00 00 00 00 mov %gs:0x0,%eax
21: R_386_32 per_cpu__this_cpu_off
25: 8b 80 00 00 00 00 mov 0x0(%eax),%eax
27: R_386_32 per_cpu__x

So we go from 5 instructions, 23 bytes, 3 memory references, to 2
instructions, 12 bytes, 2 memory references (although the extra mem ref
will almost certainly have been in cache).

> > interest in making dynamic-percpu pointers use the same offset scheme,
> > now x86 will experience the benefits.
> >
> > And we might even get a third user of local_t!
>
> I'm not holding my breath. I guess it was a nice idea before preemption
> became popular ...

Well, since Xen doesn't support preemption, perhaps we'll convince
distros to turn it off again? 8)

Sorry for the confusion,
Rusty.
--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

2006-09-25 01:02:59

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 5/7] Use %gs for per-cpu sections in kernel

Rusty Russell wrote:
>> So are symbols referencing the .data.percpu section 0-based? Wouldn't
>> you need to subtract __per_cpu_start from the symbols to get a 0-based
>> segment offset?
>>
>
> I don't think I understand the question.
>
> The .data.percpu section is the "template" per-cpu section, freed along
> with other initdata: after setup_percpu_areas() is called, it is not
> supposed to be used. Around that time, the gs segment is set up based
> at __per_cpu_offset[cpu], so "%gs:<varname>" accesses the local version.
>

If you do

DEFINE_PER_CPU(int, foo);

then this ends up defining per_cpu__foo in .data.percpu. Since
.data.percpu is part of the init data section, it starts at some address
X (not 0), so the real offset into the actual per-cpu memory is actually
(per_cpu__foo - __per_cpu_start). setup_per_cpu_areas() builds this
delta into the __per_cpu_offset[], and so it means that the base of your
%gs segment is at -__per_cpu_start from the actual start of the CPU's
per-cpu memory, and the limit has to be correspondingly larger. Which
is a bit ugly. Especially since "__per_cpu_start" is actually very
large, and so this scheme pretty much relies on being able to wrap
around the segment limit, and will be very bad for Xen.

An alternative is to put the "-__per_cpu_start" into the addressing mode
when constructing the address of the per-cpu variable.

J

2006-09-25 01:07:21

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 5/7] Use %gs for per-cpu sections in kernel

Andi Kleen wrote:
>> I managed to get all this done in head.S before going into C code; is
>> that not still possible? Or is there a later patch to do this.
>>
>
> Why write in assembler what you can write in C?
>
This stuff is very basic, and you could consider it as being part of the
kernel's C runtime model, and therefore can be expected to be available
everywhere. In particular, the use of current is so prevalent that you
really can't call anything without having the PDA set up.

J

2006-09-25 01:16:24

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 5/7] Use %gs for per-cpu sections in kernel

On Sun, 2006-09-24 at 18:03 -0700, Jeremy Fitzhardinge wrote:
> Rusty Russell wrote:
> >> So are symbols referencing the .data.percpu section 0-based? Wouldn't
> >> you need to subtract __per_cpu_start from the symbols to get a 0-based
> >> segment offset?
> >>
> >
> > I don't think I understand the question.
> >
> > The .data.percpu section is the "template" per-cpu section, freed along
> > with other initdata: after setup_percpu_areas() is called, it is not
> > supposed to be used. Around that time, the gs segment is set up based
> > at __per_cpu_offset[cpu], so "%gs:<varname>" accesses the local version.
> >
>
> If you do
>
> DEFINE_PER_CPU(int, foo);
>
> then this ends up defining per_cpu__foo in .data.percpu. Since
> .data.percpu is part of the init data section, it starts at some address
> X (not 0), so the real offset into the actual per-cpu memory is actually
> (per_cpu__foo - __per_cpu_start). setup_per_cpu_areas() builds this
> delta into the __per_cpu_offset[], and so it means that the base of your
> %gs segment is at -__per_cpu_start from the actual start of the CPU's
> per-cpu memory, and the limit has to be correspondingly larger. Which
> is a bit ugly.

Hi Jeremy!

You're thinking of it in a convoluted way, by converting to offsets
from the per-cpu section, then converting it back. How about this
explanation: the local cpu's versions are offset from where the compiler
thinks they are by __per_cpu_offset[cpu]. We set the segment base to
__per_cpu_offset[cpu], so "%gs:per_cpu__foo" gets us straight to the
local cpu version. __per_cpu_offset[cpu] is always positive (kernel
image sits at bottom of kernel address space).

> Especially since "__per_cpu_start" is actually very
> large, and so this scheme pretty much relies on being able to wrap
> around the segment limit, and will be very bad for Xen.

__per_cpu_start is large, yes. But there's no reason to use it in
address calculation. The second half of your statement is not correct.

> An alternative is to put the "-__per_cpu_start" into the addressing mode
> when constructing the address of the per-cpu variable.

I think you're thinking of TLS relocations? I don't use them...

Rusty.
--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

2006-09-25 01:20:59

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 5/7] Use %gs for per-cpu sections in kernel

On Sun, 2006-09-24 at 18:07 -0700, Jeremy Fitzhardinge wrote:
> Andi Kleen wrote:
> >> I managed to get all this done in head.S before going into C code; is
> >> that not still possible? Or is there a later patch to do this.
> >>
> >
> > Why write in assembler what you can write in C?
> >
> This stuff is very basic, and you could consider it as being part of the
> kernel's C runtime model, and therefore can be expected to be available
> everywhere. In particular, the use of current is so prevalent that you
> really can't call anything without having the PDA set up.

Yeah, I agree with Jeremy. It's nice if we can just use it everywhere,
and he kindly explained to me that secondary CPUs we can do most of it
before the CPU bringup (ie. in C): we just have to load the gs register
in asm AFAICT.

I'll produce a patch, so we can see what we're talking about...

Thanks!
Rusty.
--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

2006-09-25 01:36:04

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 5/7] Use %gs for per-cpu sections in kernel

Rusty Russell wrote:
> You're thinking of it in a convoluted way, by converting to offsets
> from the per-cpu section, then converting it back. How about this
> explanation: the local cpu's versions are offset from where the compiler
> thinks they are by __per_cpu_offset[cpu]. We set the segment base to
> __per_cpu_offset[cpu], so "%gs:per_cpu__foo" gets us straight to the
> local cpu version. __per_cpu_offset[cpu] is always positive (kernel
> image sits at bottom of kernel address space).
>

We're talking kernel virtual addresses, so the physical load address
doesn't matter, of course.

So, take this kernel I have here as an explicit example:

$ nm -n vmlinux
[...]
c0431100 A __per_cpu_start
[...]
c0433800 D per_cpu__cpu_gdt_descr
c0433880 D per_cpu__cpu_tlbstate


And say that this CPU has its percpu data allocated at 0xc100000.

So, in this case the %gs base will be loaded with 0xc100000-0xc0431100 =
0x4bccef00
The offset of per_cpu__cpu_gdt_descr is 0xc0433800, so
%gs:per_cpu__cpu_gdt_descr will compute 0x4bccef00+0xc0433800 to get the
final linear address. Since 0xc0433800 is negative, this is actually a
subtraction, and it therefore requires the segment to have a 4G limit.
Which makes Xen sad.

>> Especially since "__per_cpu_start" is actually very
>> large, and so this scheme pretty much relies on being able to wrap
>> around the segment limit, and will be very bad for Xen.
>>
>
> __per_cpu_start is large, yes. But there's no reason to use it in
> address calculation. The second half of your statement is not correct.
>

__per_cpu_start is added to all per_cpu__* addresses.

>> An alternative is to put the "-__per_cpu_start" into the addressing mode
>> when constructing the address of the per-cpu variable.
>>
>
> I think you're thinking of TLS relocations? I don't use them...
>

No, but this is just as bad.

J

2006-09-25 02:51:21

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 5/7] Use %gs for per-cpu sections in kernel

On Sun, 2006-09-24 at 18:36 -0700, Jeremy Fitzhardinge wrote:
> Rusty Russell wrote:
> > You're thinking of it in a convoluted way, by converting to offsets
> > from the per-cpu section, then converting it back. How about this
> > explanation: the local cpu's versions are offset from where the compiler
> > thinks they are by __per_cpu_offset[cpu]. We set the segment base to
> > __per_cpu_offset[cpu], so "%gs:per_cpu__foo" gets us straight to the
> > local cpu version. __per_cpu_offset[cpu] is always positive (kernel
> > image sits at bottom of kernel address space).
> >
>
> We're talking kernel virtual addresses, so the physical load address
> doesn't matter, of course.
>
> So, take this kernel I have here as an explicit example:
>
> $ nm -n vmlinux
> [...]
> c0431100 A __per_cpu_start
> [...]
> c0433800 D per_cpu__cpu_gdt_descr
> c0433880 D per_cpu__cpu_tlbstate
>
>
> And say that this CPU has its percpu data allocated at 0xc100000.

That can't happen, since 0xc100000 is not in the kernel address space.
0xc1000000 is though, perhaps that's what you meant?

> So, in this case the %gs base will be loaded with 0xc100000-0xc0431100 =
> 0x4bccef00

A negative offset, exactly, which can't happen, as I said.

Hope that clarifies?

Confused,
Rusty.
--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

2006-09-25 05:26:34

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 5/7] Use %gs for per-cpu sections in kernel

On Mon, 2006-09-25 at 11:20 +1000, Rusty Russell wrote:
> On Sun, 2006-09-24 at 18:07 -0700, Jeremy Fitzhardinge wrote:
> > Andi Kleen wrote:
> > >> I managed to get all this done in head.S before going into C code; is
> > >> that not still possible? Or is there a later patch to do this.
> > >>
> > >
> > > Why write in assembler what you can write in C?
> > >
> > This stuff is very basic, and you could consider it as being part of the
> > kernel's C runtime model, and therefore can be expected to be available
> > everywhere. In particular, the use of current is so prevalent that you
> > really can't call anything without having the PDA set up.
>
> Yeah, I agree with Jeremy. It's nice if we can just use it everywhere,
> and he kindly explained to me that secondary CPUs we can do most of it
> before the CPU bringup (ie. in C): we just have to load the gs register
> in asm AFAICT.
>
> I'll produce a patch, so we can see what we're talking about...

Actually, it's quite sweet! We have a pointer to the currently booting
CPU's GDT descriptor and load that in asm (only one additional insn),
and the C code is significantly simpler. Here is a replacement for
patch 5/7, and I'll post updated (simplified) 6/7 and 7/7 separately.

Thanks Jeremy!
Rusty.

====
This patch actually uses the gs register to implement the per-cpu
sections. It's fairly straightforward: the gs segment starts at the
per-cpu offset for the particular cpu (or 0, in very early boot).

Jeremy Fitzhardinge points out that if we set up %gs correctly in asm,
we can always use per-cpu variables (eg. smp_processor_id() and
current()) in C code. As he intimated, this is quite clean.

We keep a "booting_cpu_gdt_desc_ptr". It's statically initialized to
the master per-cpu gdt descriptor, and then updated to point to the
current cpu's GDT descriptor as CPUs come up. This keeps the asm much
the same as before.

We also implement x86_64-inspired (via Jeremy Fitzhardinge) per-cpu
accesses where a general lvalue isn't needed. These
single-instruction accesses are slightly more efficient, plus (being a
single insn) are atomic wrt. preemption so we can use them to
implement cpu_local_inc etc.

Signed-off-by: Rusty Russell <[email protected]>

Index: ak-fresh/arch/i386/kernel/cpu/common.c
===================================================================
--- ak-fresh.orig/arch/i386/kernel/cpu/common.c 2006-09-25 14:15:32.000000000 +1000
+++ ak-fresh/arch/i386/kernel/cpu/common.c 2006-09-25 14:21:40.000000000 +1000
@@ -13,6 +13,7 @@
#include <asm/mmu_context.h>
#include <asm/mtrr.h>
#include <asm/mce.h>
+#include <asm/smp.h>
#ifdef CONFIG_X86_LOCAL_APIC
#include <asm/mpspec.h>
#include <asm/apic.h>
@@ -601,12 +602,19 @@
struct thread_struct *thread = &current->thread;
struct desc_struct *gdt;
__u32 stk16_off = (__u32)&per_cpu(cpu_16bit_stack, cpu);
- struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);

if (cpu_test_and_set(cpu, cpu_initialized)) {
printk(KERN_WARNING "CPU#%d already initialized!\n", cpu);
for (;;) local_irq_enable();
}
+
+ /* Set up GDT entry for 16bit stack */
+ gdt = __get_cpu_var(cpu_gdt_table);
+ *(__u64 *)(&gdt[GDT_ENTRY_ESPFIX_SS]) |=
+ ((((__u64)stk16_off) << 16) & 0x000000ffffff0000ULL) |
+ ((((__u64)stk16_off) << 32) & 0xff00000000000000ULL) |
+ (CPU_16BIT_STACK_SIZE - 1);
+
printk(KERN_INFO "Initializing CPU#%d\n", cpu);

if (cpu_has_vme || cpu_has_tsc || cpu_has_de)
@@ -618,17 +626,6 @@
set_in_cr4(X86_CR4_TSD);
}

- /* Set up GDT entry for 16bit stack */
- gdt = __get_cpu_var(cpu_gdt_table);
- *(__u64 *)(&gdt[GDT_ENTRY_ESPFIX_SS]) |=
- ((((__u64)stk16_off) << 16) & 0x000000ffffff0000ULL) |
- ((((__u64)stk16_off) << 32) & 0xff00000000000000ULL) |
- (CPU_16BIT_STACK_SIZE - 1);
-
- cpu_gdt_descr->size = GDT_SIZE - 1;
- cpu_gdt_descr->address = (unsigned long)gdt;
-
- load_gdt(cpu_gdt_descr);
load_idt(&idt_descr);

/*
Index: ak-fresh/arch/i386/kernel/smpboot.c
===================================================================
--- ak-fresh.orig/arch/i386/kernel/smpboot.c 2006-09-25 14:15:32.000000000 +1000
+++ ak-fresh/arch/i386/kernel/smpboot.c 2006-09-25 14:38:46.000000000 +1000
@@ -102,6 +102,9 @@
{ [0 ... NR_CPUS-1] = 0xff };
EXPORT_SYMBOL(x86_cpu_to_apicid);

+DEFINE_PER_CPU(unsigned long, this_cpu_off);
+EXPORT_PER_CPU_SYMBOL(this_cpu_off);
+
/*
* Trampoline 80x86 program as an array.
*/
@@ -916,6 +919,31 @@
#define alloc_idle_task(cpu) fork_idle(cpu)
#endif

+static __cpuinit void setup_percpu_descriptor(struct desc_struct *gdt,
+ unsigned long per_cpu_off)
+{
+ unsigned limit, flags;
+
+ limit = (1 << 20);
+ flags = 0x8; /* 4k granularity */
+
+ /* present read-write data segment */
+ pack_descriptor((u32 *)&gdt->a, (u32 *)&gdt->b,
+ per_cpu_off, limit - 1,
+ 0x80 | DESCTYPE_S | 0x2, flags);
+}
+
+static void __init setup_percpu(unsigned int cpu)
+{
+ struct desc_struct *gdt = per_cpu(cpu_gdt_table, cpu);
+ struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);
+
+ per_cpu(this_cpu_off, cpu) = __per_cpu_offset[cpu];
+ setup_percpu_descriptor(&gdt[GDT_ENTRY_PERCPU], __per_cpu_offset[cpu]);
+ cpu_gdt_descr->address = (unsigned long)gdt;
+ cpu_gdt_descr->size = GDT_SIZE - 1;
+}
+
static int __devinit do_boot_cpu(int apicid, int cpu)
/*
* NOTE - on most systems this is a PHYSICAL apic ID, but on multiquad
@@ -940,6 +968,10 @@
if (IS_ERR(idle))
panic("failed fork for CPU %d", cpu);
idle->thread.eip = (unsigned long) start_secondary;
+
+ setup_percpu(cpu);
+ booting_cpu_gdt_desc_ptr = &per_cpu(cpu_gdt_descr, cpu);
+
/* start_eip had better be page-aligned! */
start_eip = setup_trampoline();

@@ -1303,6 +1335,14 @@
synchronize_tsc_bp();
}

+static inline void set_kernel_gs(void)
+{
+ /* Set %gs for this CPU's per-cpu area. Memory clobber is to create a
+ barrier with respect to any per-cpu operations, so the compiler
+ doesn't move any before here. */
+ asm volatile ("mov %0, %%gs" : : "r" (__KERNEL_PERCPU) : "memory");
+}
+
/* These are wrappers to interface to the new boot process. Someone
who understands all this stuff should rewrite it properly. --RR 15/Jul/02 */
void __init smp_prepare_cpus(unsigned int max_cpus)
@@ -1315,6 +1355,18 @@

void __devinit smp_prepare_boot_cpu(void)
{
+ /* Move over to our per-cpu area now it's allocated. */
+ unsigned int cpu = smp_processor_id();
+ struct desc_struct *gdt = per_cpu(cpu_gdt_table, cpu);
+ struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);
+
+ per_cpu(this_cpu_off, cpu) = __per_cpu_offset[cpu];
+ setup_percpu_descriptor(&gdt[GDT_ENTRY_PERCPU], __per_cpu_offset[cpu]);
+ cpu_gdt_descr->address = (unsigned long)gdt;
+ cpu_gdt_descr->size = GDT_SIZE - 1;
+ load_gdt(cpu_gdt_descr);
+ set_kernel_gs();
+
cpu_set(smp_processor_id(), cpu_online_map);
cpu_set(smp_processor_id(), cpu_callout_map);
cpu_set(smp_processor_id(), cpu_present_map);
Index: ak-fresh/include/asm-i386/percpu.h
===================================================================
--- ak-fresh.orig/include/asm-i386/percpu.h 2006-09-25 14:15:32.000000000 +1000
+++ ak-fresh/include/asm-i386/percpu.h 2006-09-25 14:21:40.000000000 +1000
@@ -1,6 +1,107 @@
#ifndef __ARCH_I386_PERCPU__
#define __ARCH_I386_PERCPU__

+#ifdef CONFIG_SMP
+/* Same as generic implementation except for optimized local access. */
+#define __GENERIC_PER_CPU
+
+/* This is used for other cpus to find our section. */
+extern unsigned long __per_cpu_offset[NR_CPUS];
+
+/* Separate out the type, so (int[3], foo) works. */
+#define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu__##name
+#define DEFINE_PER_CPU(type, name) \
+ __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name
+
+/* We can use this directly for local CPU (faster). */
+DECLARE_PER_CPU(unsigned long, this_cpu_off);
+
+/* var is in discarded region: offset to particular copy we want */
+#define per_cpu(var, cpu) (*({ \
+ extern int simple_indentifier_##var(void); \
+ RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]); }))
+
+#define __raw_get_cpu_var(var) (*({ \
+ extern int simple_indentifier_##var(void); \
+ RELOC_HIDE(&per_cpu__##var, x86_read_percpu(this_cpu_off)); \
+}))
+
+#define __get_cpu_var(var) __raw_get_cpu_var(var)
+
+/* A macro to avoid #include hell... */
+#define percpu_modcopy(pcpudst, src, size) \
+do { \
+ unsigned int __i; \
+ for_each_possible_cpu(__i) \
+ memcpy((pcpudst)+__per_cpu_offset[__i], \
+ (src), (size)); \
+} while (0)
+
+#define EXPORT_PER_CPU_SYMBOL(var) EXPORT_SYMBOL(per_cpu__##var)
+#define EXPORT_PER_CPU_SYMBOL_GPL(var) EXPORT_SYMBOL_GPL(per_cpu__##var)
+
+/* gs segment starts at (positive) offset == __per_cpu_offset[cpu] */
+#define __percpu_seg "%%gs:"
+#else /* !SMP */
#include <asm-generic/percpu.h>
+#define __percpu_seg ""
+#endif /* SMP */
+
+/* For arch-specific code, we can use direct single-insn ops (they
+ * don't give an lvalue though). */
+extern void __bad_percpu_size(void);
+
+#define percpu_to_op(op,var,val) \
+ do { \
+ typedef typeof(var) T__; \
+ if (0) { T__ tmp__; tmp__ = (val); } \
+ switch (sizeof(var)) { \
+ case 1: \
+ asm(op "b %1,"__percpu_seg"%0" \
+ : "+m" (var) \
+ :"ri" ((T__)val)); \
+ break; \
+ case 2: \
+ asm(op "w %1,"__percpu_seg"%0" \
+ : "+m" (var) \
+ :"ri" ((T__)val)); \
+ break; \
+ case 4: \
+ asm(op "l %1,"__percpu_seg"%0" \
+ : "+m" (var) \
+ :"ri" ((T__)val)); \
+ break; \
+ default: __bad_percpu_size(); \
+ } \
+ } while (0)
+
+#define percpu_from_op(op,var) \
+ ({ \
+ typeof(var) ret__; \
+ switch (sizeof(var)) { \
+ case 1: \
+ asm(op "b "__percpu_seg"%1,%0" \
+ : "=r" (ret__) \
+ : "m" (var)); \
+ break; \
+ case 2: \
+ asm(op "w "__percpu_seg"%1,%0" \
+ : "=r" (ret__) \
+ : "m" (var)); \
+ break; \
+ case 4: \
+ asm(op "l "__percpu_seg"%1,%0" \
+ : "=r" (ret__) \
+ : "m" (var)); \
+ break; \
+ default: __bad_percpu_size(); \
+ } \
+ ret__; })
+
+#define x86_read_percpu(var) percpu_from_op("mov", per_cpu__##var)
+#define x86_write_percpu(var,val) percpu_to_op("mov", per_cpu__##var, val)
+#define x86_add_percpu(var,val) percpu_to_op("add", per_cpu__##var, val)
+#define x86_sub_percpu(var,val) percpu_to_op("sub", per_cpu__##var, val)
+#define x86_or_percpu(var,val) percpu_to_op("or", per_cpu__##var, val)

#endif /* __ARCH_I386_PERCPU__ */
Index: ak-fresh/include/asm-i386/smp.h
===================================================================
--- ak-fresh.orig/include/asm-i386/smp.h 2006-09-25 14:15:32.000000000 +1000
+++ ak-fresh/include/asm-i386/smp.h 2006-09-25 14:21:40.000000000 +1000
@@ -86,6 +86,8 @@
extern void __cpu_die(unsigned int cpu);
extern unsigned int num_processors;

+void setup_percpu_for_this_cpu(unsigned int cpu);
+
#endif /* !__ASSEMBLY__ */

#else /* CONFIG_SMP */
@@ -94,6 +96,8 @@

#define NO_PROC_ID 0xFF /* No processor magic marker */

+#define setup_percpu_for_this_cpu(cpu)
+
#endif

#ifndef __ASSEMBLY__
Index: ak-fresh/arch/i386/kernel/setup.c
===================================================================
--- ak-fresh.orig/arch/i386/kernel/setup.c 2006-09-25 14:15:32.000000000 +1000
+++ ak-fresh/arch/i386/kernel/setup.c 2006-09-25 14:21:40.000000000 +1000
@@ -1474,6 +1474,8 @@

/* ESPFIX 16-bit SS */
[GDT_ENTRY_ESPFIX_SS] = { 0x00000000, 0x00009200 },
+ /* FIXME: We save/restore %gs even on UP: fix entry.S. */
+ [GDT_ENTRY_PERCPU] = { 0x0000ffff, 0x00cf9200 },
};

/* Early in boot we use the master per-cpu gdt_table directly. */
@@ -1481,6 +1483,8 @@
= { .size = GDT_ENTRIES*8-1, .address = (long)&per_cpu__cpu_gdt_table };
EXPORT_PER_CPU_SYMBOL(cpu_gdt_descr);

+struct Xgt_desc_struct *booting_cpu_gdt_desc_ptr = &per_cpu__cpu_gdt_descr;
+
static __init int add_pcspkr(void)
{
struct platform_device *pd;
Index: ak-fresh/arch/i386/kernel/head.S
===================================================================
--- ak-fresh.orig/arch/i386/kernel/head.S 2006-09-25 14:13:34.000000000 +1000
+++ ak-fresh/arch/i386/kernel/head.S 2006-09-25 14:21:40.000000000 +1000
@@ -302,7 +302,8 @@
movl %eax,%cr0

call check_x87
- lgdt per_cpu__cpu_gdt_descr
+ movl booting_cpu_gdt_desc_ptr, %eax
+ lgdt (%eax)
lidt idt_descr
ljmp $(__KERNEL_CS),$1f
1: movl $(__KERNEL_DS),%eax # reload all the segment registers
@@ -311,10 +312,11 @@
movl $(__USER_DS),%eax # DS/ES contains default USER segment
movl %eax,%ds
movl %eax,%es
+ movl $(__KERNEL_PERCPU),%eax
+ mov %eax,%gs

- xorl %eax,%eax # Clear FS/GS and LDT
+ xorl %eax,%eax # Clear FS and LDT
movl %eax,%fs
- movl %eax,%gs
lldt %ax
cld # gcc2 wants the direction flag cleared at all times
pushl %eax # fake return address

--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

2006-09-25 05:25:52

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 5/7] Use %gs for per-cpu sections in kernel

Rusty Russell wrote:
> That can't happen, since 0xc100000 is not in the kernel address space.
> 0xc1000000 is though, perhaps that's what you meant?
>

Yes, it is. Though it doesn't actually make any material difference to
my argument.

>> So, in this case the %gs base will be loaded with 0xc100000-0xc0431100 =
>> 0x4bccef00
>>
>
>
> A negative offset, exactly, which can't happen, as I said.
0x4bccef00 is positive. The correct number is 0xc1000000-0xc0431100 =
0xbcef00

The %gs:per_cpu__foo addressing mode still calculates
0xbcef00+0xc0433800, which is still a subtraction. My essential point
is that *all* kernel addresses (=kernel symbols) are negative, so using
them as an offset from a segment base (any segment base) is a
subtraction, which requires a 4G limit.

J

2006-09-25 05:27:48

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 6/7] (Optional) implement smp_processor_id() as a per-cpu var

On Fri, 2006-09-22 at 22:00 +1000, Rusty Russell wrote:
> This implements smp_processor_id() as a per-cpu variable.

Updated for revised 5/7: we no longer need to avoid using
smp_processor_id() in cpu_init.

This implements smp_processor_id() as a per-cpu variable. The generic
code expects it in thread_info still, so we can't remove it from
there, but reducing smp_processor_id() from 9 bytes/3 insns to 6
bytes/1 insn is a nice micro-optimization.

Signed-off-by: Rusty Russell <[email protected]>

Index: ak-fresh/arch/i386/kernel/smpboot.c
===================================================================
--- ak-fresh.orig/arch/i386/kernel/smpboot.c 2006-09-25 14:49:05.000000000 +1000
+++ ak-fresh/arch/i386/kernel/smpboot.c 2006-09-25 15:02:51.000000000 +1000
@@ -105,6 +105,9 @@
DEFINE_PER_CPU(unsigned long, this_cpu_off);
EXPORT_PER_CPU_SYMBOL(this_cpu_off);

+DEFINE_PER_CPU(u32, processor_id);
+EXPORT_PER_CPU_SYMBOL(processor_id);
+
/*
* Trampoline 80x86 program as an array.
*/
@@ -939,6 +942,7 @@
struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);

per_cpu(this_cpu_off, cpu) = __per_cpu_offset[cpu];
+ per_cpu(processor_id, cpu) = cpu;
setup_percpu_descriptor(&gdt[GDT_ENTRY_PERCPU], __per_cpu_offset[cpu]);
cpu_gdt_descr->address = (unsigned long)gdt;
cpu_gdt_descr->size = GDT_SIZE - 1;
@@ -1361,6 +1365,7 @@
struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);

per_cpu(this_cpu_off, cpu) = __per_cpu_offset[cpu];
+ per_cpu(processor_id, cpu) = cpu;
setup_percpu_descriptor(&gdt[GDT_ENTRY_PERCPU], __per_cpu_offset[cpu]);
cpu_gdt_descr->address = (unsigned long)gdt;
cpu_gdt_descr->size = GDT_SIZE - 1;
Index: ak-fresh/include/asm-i386/smp.h
===================================================================
--- ak-fresh.orig/include/asm-i386/smp.h 2006-09-25 14:49:05.000000000 +1000
+++ ak-fresh/include/asm-i386/smp.h 2006-09-25 14:52:05.000000000 +1000
@@ -8,6 +8,7 @@
#include <linux/kernel.h>
#include <linux/threads.h>
#include <linux/cpumask.h>
+#include <asm/percpu.h>
#endif

#ifdef CONFIG_X86_LOCAL_APIC
@@ -56,7 +57,8 @@
* from the initial startup. We map APIC_BASE very early in page_setup(),
* so this is correct in the x86 case.
*/
-#define raw_smp_processor_id() (current_thread_info()->cpu)
+DECLARE_PER_CPU(u32, processor_id);
+#define raw_smp_processor_id() x86_read_percpu(processor_id)

extern cpumask_t cpu_callout_map;
extern cpumask_t cpu_callin_map;

--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

2006-09-25 05:29:09

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 7/7] (Optional) implement current as a per-cpu var

On Fri, 2006-09-22 at 22:01 +1000, Rusty Russell wrote:
> This implements current as a per-cpu variable.

Again, revised for updated 5/7: we no longer need to avoid using current
early on.

This implements current as a per-cpu variable. The generic code
expects it in thread_info still, so I don't remove it from there, but
reducing current from 9 bytes/3 insns to 6 bytes/1 insn is a nice
micro-optimization.

Signed-off-by: Rusty Russell <[email protected]>

Index: ak-fresh/arch/i386/kernel/setup.c
===================================================================
--- ak-fresh.orig/arch/i386/kernel/setup.c 2006-09-25 15:02:33.000000000 +1000
+++ ak-fresh/arch/i386/kernel/setup.c 2006-09-25 15:06:31.000000000 +1000
@@ -148,6 +148,9 @@

unsigned char __initdata boot_params[PARAM_SIZE];

+DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task;
+EXPORT_PER_CPU_SYMBOL(current_task);
+
static struct resource data_resource = {
.name = "Kernel data",
.start = 0,
Index: ak-fresh/arch/i386/kernel/smpboot.c
===================================================================
--- ak-fresh.orig/arch/i386/kernel/smpboot.c 2006-09-25 15:02:51.000000000 +1000
+++ ak-fresh/arch/i386/kernel/smpboot.c 2006-09-25 15:06:31.000000000 +1000
@@ -972,6 +972,7 @@
if (IS_ERR(idle))
panic("failed fork for CPU %d", cpu);
idle->thread.eip = (unsigned long) start_secondary;
+ per_cpu(current_task, cpu) = idle;

setup_percpu(cpu);
booting_cpu_gdt_desc_ptr = &per_cpu(cpu_gdt_descr, cpu);
Index: ak-fresh/include/asm-i386/current.h
===================================================================
--- ak-fresh.orig/include/asm-i386/current.h 2006-09-25 15:02:33.000000000 +1000
+++ ak-fresh/include/asm-i386/current.h 2006-09-25 15:06:31.000000000 +1000
@@ -1,15 +1,10 @@
#ifndef _I386_CURRENT_H
#define _I386_CURRENT_H

+#include <asm/percpu.h>
#include <linux/thread_info.h>

-struct task_struct;
-
-static __always_inline struct task_struct * get_current(void)
-{
- return current_thread_info()->task;
-}
-
-#define current get_current()
+DECLARE_PER_CPU(struct task_struct *, current_task);
+#define current x86_read_percpu(current_task)

#endif /* !(_I386_CURRENT_H) */
Index: ak-fresh/arch/i386/kernel/process.c
===================================================================
--- ak-fresh.orig/arch/i386/kernel/process.c 2006-09-25 15:02:33.000000000 +1000
+++ ak-fresh/arch/i386/kernel/process.c 2006-09-25 15:06:31.000000000 +1000
@@ -669,6 +669,7 @@
if (unlikely(prev->fs | next->fs))
loadsegment(fs, next->fs);

+ x86_write_percpu(current_task, next_p);

/*
* Restore IOPL if needed.

--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

2006-09-25 06:03:56

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 5/7] Use %gs for per-cpu sections in kernel

On Sun, 2006-09-24 at 22:25 -0700, Jeremy Fitzhardinge wrote:
> The %gs:per_cpu__foo addressing mode still calculates
> 0xbcef00+0xc0433800, which is still a subtraction. My essential point
> is that *all* kernel addresses (=kernel symbols) are negative, so using
> them as an offset from a segment base (any segment base) is a
> subtraction, which requires a 4G limit.

I don't think so. There's *never* address subtraction, there's
sometimes 32 bit wrap (glibc uses this to effect subtraction, sure).
But there's no wrap here.

To test, I changed the following:

--- smpboot.c.~8~ 2006-09-25 15:51:50.000000000 +1000
+++ smpboot.c 2006-09-25 16:00:36.000000000 +1000
@@ -926,8 +926,9 @@
unsigned long per_cpu_off)
{
unsigned limit, flags;
+ extern char __per_cpu_end[];

- limit = (1 << 20);
+ limit = PAGE_ALIGN((long)__per_cpu_end) >> PAGE_SHIFT;
flags = 0x8; /* 4k granularity */

/* present read-write data segment */


Works fine...

Hope that clarifies!
Rusty.
--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

2006-09-25 06:25:35

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 5/7] Use %gs for per-cpu sections in kernel

Rusty Russell wrote:
> I don't think so. There's *never* address subtraction, there's
> sometimes 32 bit wrap (glibc uses this to effect subtraction, sure).
> But there's no wrap here.
>
Hm, I guess, so long as you assume the kernel data segment is always
below the kernel heap.

> To test, I changed the following:
>
> --- smpboot.c.~8~ 2006-09-25 15:51:50.000000000 +1000
> +++ smpboot.c 2006-09-25 16:00:36.000000000 +1000
> @@ -926,8 +926,9 @@
> unsigned long per_cpu_off)
> {
> unsigned limit, flags;
> + extern char __per_cpu_end[];
>
> - limit = (1 << 20);
> + limit = PAGE_ALIGN((long)__per_cpu_end) >> PAGE_SHIFT;
>
limit is a size, rather than the end address, so this isn't quite right.

J

2006-09-25 23:33:19

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 5/7] Use %gs for per-cpu sections in kernel

On Sun, 2006-09-24 at 23:25 -0700, Jeremy Fitzhardinge wrote:
> Rusty Russell wrote:
> > I don't think so. There's *never* address subtraction, there's
> > sometimes 32 bit wrap (glibc uses this to effect subtraction, sure).
> > But there's no wrap here.
> >
> Hm, I guess, so long as you assume the kernel data segment is always
> below the kernel heap.

Agreed, we should BUG_ON() in case anyone ever changes this... I will
create a patch for this...

> > To test, I changed the following:
> >
> > --- smpboot.c.~8~ 2006-09-25 15:51:50.000000000 +1000
> > +++ smpboot.c 2006-09-25 16:00:36.000000000 +1000
> > @@ -926,8 +926,9 @@
> > unsigned long per_cpu_off)
> > {
> > unsigned limit, flags;
> > + extern char __per_cpu_end[];
> >
> > - limit = (1 << 20);
> > + limit = PAGE_ALIGN((long)__per_cpu_end) >> PAGE_SHIFT;
> >
> limit is a size, rather than the end address, so this isn't quite right.

I think it's OK. For every "%gs:var", var will be less than
__per_cpu_end.

Thanks!
Rusty.
--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law