2006-11-13 16:58:13

by Vivek Goyal

[permalink] [raw]
Subject: [RFC] [PATCH 0/16] x86_64: Relocatable bzImage Support (V2)

Hi All,

Eric Biederman implemented the relocatable bzImage for x86_64 and posted
patches for comments quite some time back.

http://marc.theaimsgroup.com/?l=linux-kernel&m=115443019026302&w=2

We have been testing the patches in RHEL kernels since then and things are
looking up. I think this is the time that patches can be included in -mm
and get more testing done and get rest of the issues sorted out.

Eric is currently held up with other things, so I have taken his patches
and forward ported to 2.6.19-rc5-git2. Did few cleanups and fixed few
bugs as faced in our testing. I have also accomodated the review comments
received last time.

These changes make a bzImage and vmlinux relocatable hence kernel can be
loaded at and run from a non-1MB location. These changes are especially
useful for kdump where a single kernel can be used both as production
kernel and dump capture kernel hence making the life easier both for
distros and developers.

Following is a brief account of changes I have done since patches were
posted last time.

- Forward ported the changes to latest kernel.
- Extended bzImage protocol to handle relocatable kernel
- Dropped support for elf bzImage
- Dropped support for the serial debugging in decompressor code.
- Fixed a bug related to memory hotplug.
- Fixed a bug related to setting NX bit (Thanks to larry woodman)
- Fixed a bug regarding jumping to secondary_startup_64 instead of
assuming that align fills empty space with "nop".
- Fixed a bug regarding phys_base being put in initdata section.
- Fixed a but where bss was not being zeroed properly
- Reverted the change back to cotinue to compile the kernel for 2MB so
that loaders loading vmlinux directly are not broken.
- Aligned data segment to 4K boundary to make kexec using vmlinux work.
- Tested to patch for making sure suspend/resume to/from memory is working.
(Required due to ACPI wake up code changes)

Your comments/suggestions are welcome.

Thanks
Vivek


2006-11-13 16:54:46

by Vivek Goyal

[permalink] [raw]
Subject: [RFC] [PATCH 6/16] x86_64: Modify copy_bootdata to to use virtual address



Use virtual addresses instead of physical addresses
in copy bootdata. In addition fix the implementation
of the old bootloader convention. Everything is
at real_mode_data always. It is just that sometimes
real_mode_data was relocated by setup.S to not sit at
0x90000.

Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---

arch/x86_64/kernel/head64.c | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)

diff -puN arch/x86_64/kernel/head64.c~x86_64-modify-copy_bootdata-to-use-virtual-addresses arch/x86_64/kernel/head64.c
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/head64.c~x86_64-modify-copy_bootdata-to-use-virtual-addresses 2006-11-09 22:57:55.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/head64.c 2006-11-09 22:57:55.000000000 -0500
@@ -29,27 +29,26 @@ static void __init clear_bss(void)
}

#define NEW_CL_POINTER 0x228 /* Relative to real mode data */
-#define OLD_CL_MAGIC_ADDR 0x90020
+#define OLD_CL_MAGIC_ADDR 0x20
#define OLD_CL_MAGIC 0xA33F
-#define OLD_CL_BASE_ADDR 0x90000
-#define OLD_CL_OFFSET 0x90022
+#define OLD_CL_OFFSET 0x22

extern char saved_command_line[];

static void __init copy_bootdata(char *real_mode_data)
{
- int new_data;
+ unsigned long new_data;
char * command_line;

memcpy(x86_boot_params, real_mode_data, BOOT_PARAM_SIZE);
- new_data = *(int *) (x86_boot_params + NEW_CL_POINTER);
+ new_data = *(u32 *) (x86_boot_params + NEW_CL_POINTER);
if (!new_data) {
- if (OLD_CL_MAGIC != * (u16 *) OLD_CL_MAGIC_ADDR) {
+ if (OLD_CL_MAGIC != *(u16 *)(real_mode_data + OLD_CL_MAGIC_ADDR)) {
return;
}
- new_data = OLD_CL_BASE_ADDR + * (u16 *) OLD_CL_OFFSET;
+ new_data = __pa(real_mode_data) + *(u16 *)(real_mode_data + OLD_CL_OFFSET);
}
- command_line = (char *) ((u64)(new_data));
+ command_line = __va(new_data);
memcpy(saved_command_line, command_line, COMMAND_LINE_SIZE);
}

@@ -74,7 +73,7 @@ void __init x86_64_start_kernel(char * r
cpu_pda(i) = &boot_cpu_pda[i];

pda_init(0);
- copy_bootdata(real_mode_data);
+ copy_bootdata(__va(real_mode_data));
#ifdef CONFIG_SMP
cpu_set(0, cpu_online_map);
#endif
_

2006-11-13 16:53:42

by Vivek Goyal

[permalink] [raw]
Subject: [RFC] [PATCH 7/16] x86_64: cleanup segments



Move __KERNEL32_CS up into the unused gdt entry. __KERNEL32_CS is
used when entering the kernel so putting it first is useful when
trying to keep boot gdt sizes to a minimum.

Set the accessed bit on all gdt entries. We don't care
so there is no need for the cpu to burn the extra cycles,
and it potentially allows the pages to be immutable. Plus
it is confusing when debugging and your gdt entries mysteriously
change.

Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---

arch/x86_64/kernel/head.S | 12 ++++++------
include/asm-x86_64/segment.h | 2 +-
2 files changed, 7 insertions(+), 7 deletions(-)

diff -puN arch/x86_64/kernel/head.S~x86_64-cleanup-segments arch/x86_64/kernel/head.S
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/head.S~x86_64-cleanup-segments 2006-11-09 22:58:28.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/head.S 2006-11-09 22:58:28.000000000 -0500
@@ -354,13 +354,13 @@ gdt:

ENTRY(cpu_gdt_table)
.quad 0x0000000000000000 /* NULL descriptor */
+ .quad 0x00cf9b000000ffff /* __KERNEL32_CS */
+ .quad 0x00af9b000000ffff /* __KERNEL_CS */
+ .quad 0x00cf93000000ffff /* __KERNEL_DS */
+ .quad 0x00cffb000000ffff /* __USER32_CS */
+ .quad 0x00cff3000000ffff /* __USER_DS, __USER32_DS */
+ .quad 0x00affb000000ffff /* __USER_CS */
.quad 0x0 /* unused */
- .quad 0x00af9a000000ffff /* __KERNEL_CS */
- .quad 0x00cf92000000ffff /* __KERNEL_DS */
- .quad 0x00cffa000000ffff /* __USER32_CS */
- .quad 0x00cff2000000ffff /* __USER_DS, __USER32_DS */
- .quad 0x00affa000000ffff /* __USER_CS */
- .quad 0x00cf9a000000ffff /* __KERNEL32_CS */
.quad 0,0 /* TSS */
.quad 0,0 /* LDT */
.quad 0,0,0 /* three TLS descriptors */
diff -puN include/asm-x86_64/segment.h~x86_64-cleanup-segments include/asm-x86_64/segment.h
--- linux-2.6.19-rc5-reloc/include/asm-x86_64/segment.h~x86_64-cleanup-segments 2006-11-09 22:58:28.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/include/asm-x86_64/segment.h 2006-11-09 22:58:28.000000000 -0500
@@ -6,7 +6,7 @@
#define __KERNEL_CS 0x10
#define __KERNEL_DS 0x18

-#define __KERNEL32_CS 0x38
+#define __KERNEL32_CS 0x08

/*
* we cannot use the same code segment descriptor for user and kernel
_

2006-11-13 16:55:28

by Vivek Goyal

[permalink] [raw]
Subject: [RFC] [PATCH 15/16] x86_64: Relocatable kernel support



This patch modifies the x86_64 kernel so that it can be loaded and run
at any 2M aligned address, below 512G. The technique used is to
compile the decompressor with -fPIC and modify it so the decompressor
is fully relocatable. For the main kernel the page tables are
modified so the kernel remains at the same virtual address. In
addition a variable phys_base is kept that holds the physical address
the kernel is loaded at. __pa_symbol is modified to add that when
we take the address of a kernel symbol.

When loaded with a normal bootloader the decompressor will decompress
the kernel to 2M and it will run there. This both ensures the
relocation code is always working, and makes it easier to use 2M
pages for the kernel and the cpu.

Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---

arch/x86_64/boot/compressed/Makefile | 12 -
arch/x86_64/boot/compressed/head.S | 311 ++++++++++++++++++++++----------
arch/x86_64/boot/compressed/misc.c | 251 +++++++++++++------------
arch/x86_64/boot/compressed/vmlinux.lds | 44 ++++
arch/x86_64/boot/compressed/vmlinux.scr | 5
arch/x86_64/kernel/head.S | 221 ++++++++++++----------
include/asm-x86_64/page.h | 6
7 files changed, 532 insertions(+), 318 deletions(-)

diff -puN arch/x86_64/boot/compressed/head.S~x86_64-Relocatable-kernel-support arch/x86_64/boot/compressed/head.S
--- linux-2.6.19-rc5-reloc/arch/x86_64/boot/compressed/head.S~x86_64-Relocatable-kernel-support 2006-11-09 23:06:47.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/boot/compressed/head.S 2006-11-09 23:06:47.000000000 -0500
@@ -26,116 +26,245 @@

#include <linux/linkage.h>
#include <asm/segment.h>
+#include <asm/pgtable.h>
#include <asm/page.h>
+#include <asm/msr.h>

+.section ".text.head"
.code32
.globl startup_32

startup_32:
cld
cli
- movl $(__KERNEL_DS),%eax
- movl %eax,%ds
- movl %eax,%es
- movl %eax,%fs
- movl %eax,%gs
-
- lss stack_start,%esp
- xorl %eax,%eax
-1: incl %eax # check that A20 really IS enabled
- movl %eax,0x000000 # loop forever if it isn't
- cmpl %eax,0x100000
- je 1b
-
-/*
- * Initialize eflags. Some BIOS's leave bits like NT set. This would
- * confuse the debugger if this code is traced.
- * XXX - best to initialize before switching to protected mode.
+ movl $(__KERNEL_DS), %eax
+ movl %eax, %ds
+ movl %eax, %es
+ movl %eax, %ss
+
+/* Calculate the delta between where we were compiled to run
+ * at and where we were actually loaded at. This can only be done
+ * with a short local call on x86. Nothing else will tell us what
+ * address we are running at. The reserved chunk of the real-mode
+ * data at 0x34-0x3f are used as the stack for this calculation.
+ * Only 4 bytes are needed.
*/
- pushl $0
- popfl
+ leal 0x40(%esi), %esp
+ call 1f
+1: popl %ebp
+ subl $1b, %ebp
+
+/* Compute the delta between where we were compiled to run at
+ * and where the code will actually run at.
+ */
+ movl %ebp, %ebx
+ addl $(LARGE_PAGE_SIZE -1), %ebx
+ andl $LARGE_PAGE_MASK, %ebx
+
+ /* Replace the compressed data size with the uncompressed size */
+ subl input_len(%ebp), %ebx
+ movl output_len(%ebp), %eax
+ addl %eax, %ebx
+ /* Add 8 bytes for every 32K input block */
+ shrl $12, %eax
+ addl %eax, %ebx
+ /* Add 32K + 18 bytes of extra slack and align on a 4K boundary */
+ addl $(32768 + 18 + 4095), %ebx
+ andl $~4095, %ebx
+
+/*
+ * Prepare for entering 64 bit mode
+ */
+
+ /* Load new GDT with the 64bit segments using 32bit descriptor */
+ leal gdt(%ebp), %eax
+ movl %eax, gdt+2(%ebp)
+ lgdt gdt(%ebp)
+
+ /* Enable PAE mode */
+ xorl %eax, %eax
+ orl $(1 << 5), %eax
+ movl %eax, %cr4
+
+/*
+ * Build early 4G boot pagetable
+ */
+ /* Initialize Page tables to 0*/
+ leal pgtable(%ebx), %edi
+ xorl %eax, %eax
+ movl $((4096*6)/4), %ecx
+ rep stosl
+
+ /* Build Level 4 */
+ leal pgtable + 0(%ebx), %edi
+ leal 0x1007 (%edi), %eax
+ movl %eax, 0(%edi)
+
+ /* Build Level 3 */
+ leal pgtable + 0x1000(%ebx), %edi
+ leal 0x1007(%edi), %eax
+ movl $4, %ecx
+1: movl %eax, 0x00(%edi)
+ addl $0x00001000, %eax
+ addl $8, %edi
+ decl %ecx
+ jnz 1b
+
+ /* Build Level 2 */
+ leal pgtable + 0x2000(%ebx), %edi
+ movl $0x00000183, %eax
+ movl $2048, %ecx
+1: movl %eax, 0(%edi)
+ addl $0x00200000, %eax
+ addl $8, %edi
+ decl %ecx
+ jnz 1b
+
+ /* Enable the boot page tables */
+ leal pgtable(%ebx), %eax
+ movl %eax, %cr3
+
+ /* Enable Long mode in EFER (Extended Feature Enable Register) */
+ movl $MSR_EFER, %ecx
+ rdmsr
+ btsl $_EFER_LME, %eax
+ wrmsr
+
+ /* Setup for the jump to 64bit mode
+ *
+ * When the jump is performend we will be in long mode but
+ * in 32bit compatibility mode with EFER.LME = 1, CS.L = 0, CS.D = 1
+ * (and in turn EFER.LMA = 1). To jump into 64bit mode we use
+ * the new gdt/idt that has __KERNEL_CS with CS.L = 1.
+ * We place all of the values on our mini stack so lret can
+ * used to perform that far jump.
+ */
+ pushl $__KERNEL_CS
+ leal startup_64(%ebp), %eax
+ pushl %eax
+
+ /* Enter paged protected Mode, activating Long Mode */
+ movl $0x80000001, %eax /* Enable Paging and Protected mode */
+ movl %eax, %cr0
+
+ /* Jump from 32bit compatibility mode into 64bit mode. */
+ lret
+
+ /* Be careful here startup_64 needs to be at a predictable
+ * address so I can export it in an ELF header. Bootloaders
+ * should look at the ELF header to find this address, as
+ * it may change in the future.
+ */
+ .code64
+ .org 0x100
+ENTRY(startup_64)
+ /* We come here either from startup_32 or directly from a
+ * 64bit bootloader. If we come here from a bootloader we depend on
+ * an identity mapped page table being provied that maps our
+ * entire text+data+bss and hopefully all of memory.
+ */
+
+ /* Setup data segments. */
+ xorl %eax, %eax
+ movl %eax, %ds
+ movl %eax, %es
+ movl %eax, %ss
+
+ /* Compute the decompressed kernel start address. It is where
+ * we were loaded at aligned to a 2M boundary.
+ */
+ leaq startup_32(%rip) /* - $startup_32 */, %rbp
+ addq $(LARGE_PAGE_SIZE - 1), %rbp
+ andq $LARGE_PAGE_MASK, %rbp
+
+/* Compute the delta between where we were compiled to run at
+ * and where the code will actually run at.
+ */
+ /* Start with the delta to where the kernel will run at. */
+ movq %rbp, %rbx
+
+ /* Replace the compressed data size with the uncompressed size */
+ movl input_len(%rip), %eax
+ subq %rax, %rbx
+ movl output_len(%rip), %eax
+ addq %rax, %rbx
+ /* Add 8 bytes for every 32K input block */
+ shrq $12, %rax
+ addq %rax, %rbx
+ /* Add 32K + 18 bytes of extra slack and align on a 4K boundary */
+ addq $(32768 + 18 + 4095), %rbx
+ andq $~4095, %rbx
+
+/* Copy the compressed kernel to the end of our buffer
+ * where decompression in place becomes safe.
+ */
+ leaq _end(%rip), %r8
+ leaq _end(%rbx), %r9
+ movq $_end /* - $startup_32 */, %rcx
+1: subq $8, %r8
+ subq $8, %r9
+ movq 0(%r8), %rax
+ movq %rax, 0(%r9)
+ subq $8, %rcx
+ jnz 1b
+
+/*
+ * Jump to the relocated address.
+ */
+ leaq relocated(%rbx), %rax
+ jmp *%rax
+
+.section ".text"
+relocated:
+
/*
* Clear BSS
*/
- xorl %eax,%eax
- movl $_edata,%edi
- movl $_end,%ecx
- subl %edi,%ecx
+ xorq %rax, %rax
+ leaq _edata(%rbx), %rdi
+ leaq _end(%rbx), %rcx
+ subq %rdi, %rcx
cld
rep
stosb
+
+ /* Setup the stack */
+ leaq user_stack_end(%rip), %rsp
+
+ /* zero EFLAGS after setting rsp */
+ pushq $0
+ popfq
+
/*
* Do the decompression, and jump to the new kernel..
*/
- subl $16,%esp # place for structure on the stack
- movl %esp,%eax
- pushl %esi # real mode pointer as second arg
- pushl %eax # address of structure as first arg
- call decompress_kernel
- orl %eax,%eax
- jnz 3f
- addl $8,%esp
- xorl %ebx,%ebx
- ljmp $(__KERNEL_CS), $0x200000
-
-/*
- * We come here, if we were loaded high.
- * We need to move the move-in-place routine down to 0x1000
- * and then start it with the buffer addresses in registers,
- * which we got from the stack.
- */
-3:
- movl %esi,%ebx
- movl $move_routine_start,%esi
- movl $0x1000,%edi
- movl $move_routine_end,%ecx
- subl %esi,%ecx
- addl $3,%ecx
- shrl $2,%ecx
- cld
- rep
- movsl
-
- popl %esi # discard the address
- addl $4,%esp # real mode pointer
- popl %esi # low_buffer_start
- popl %ecx # lcount
- popl %edx # high_buffer_start
- popl %eax # hcount
- movl $0x200000,%edi
- cli # make sure we don't get interrupted
- ljmp $(__KERNEL_CS), $0x1000 # and jump to the move routine
-
-/*
- * Routine (template) for moving the decompressed kernel in place,
- * if we were high loaded. This _must_ PIC-code !
- */
-move_routine_start:
- movl %ecx,%ebp
- shrl $2,%ecx
- rep
- movsl
- movl %ebp,%ecx
- andl $3,%ecx
- rep
- movsb
- movl %edx,%esi
- movl %eax,%ecx # NOTE: rep movsb won't move if %ecx == 0
- addl $3,%ecx
- shrl $2,%ecx
- rep
- movsl
- movl %ebx,%esi # Restore setup pointer
- xorl %ebx,%ebx
- ljmp $(__KERNEL_CS), $0x200000
-move_routine_end:
+ pushq %rsi # Save the real mode argument
+ movq %rsi, %rdi # real mode address
+ leaq _heap(%rip), %rsi # _heap
+ leaq input_data(%rip), %rdx # input_data
+ movl input_len(%rip), %eax
+ movq %rax, %rcx # input_len
+ movq %rbp, %r8 # output
+ call decompress_kernel
+ popq %rsi

+/*
+ * Jump to the decompressed kernel.
+ */
+ jmp *%rbp

-/* Stack for uncompression */
- .align 32
+ .data
+gdt:
+ .word gdt_end - gdt
+ .long gdt
+ .word 0
+ .quad 0x0000000000000000 /* NULL descriptor */
+ .quad 0x00af9a000000ffff /* __KERNEL_CS */
+ .quad 0x00cf92000000ffff /* __KERNEL_DS */
+gdt_end:
+ .bss
+/* Stack for uncompression */
+ .balign 4
user_stack:
.fill 4096,4,0
-stack_start:
- .long user_stack+4096
- .word __KERNEL_DS
-
+user_stack_end:
diff -puN arch/x86_64/boot/compressed/Makefile~x86_64-Relocatable-kernel-support arch/x86_64/boot/compressed/Makefile
--- linux-2.6.19-rc5-reloc/arch/x86_64/boot/compressed/Makefile~x86_64-Relocatable-kernel-support 2006-11-09 23:06:47.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/boot/compressed/Makefile 2006-11-09 23:06:47.000000000 -0500
@@ -8,16 +8,14 @@

targets := vmlinux vmlinux.bin vmlinux.bin.gz head.o misc.o piggy.o
EXTRA_AFLAGS := -traditional
-AFLAGS := $(subst -m64,-m32,$(AFLAGS))

# cannot use EXTRA_CFLAGS because base CFLAGS contains -mkernel which conflicts with
# -m32
-CFLAGS := -m32 -D__KERNEL__ -Iinclude -O2 -fno-strict-aliasing
-LDFLAGS := -m elf_i386
+CFLAGS := -m64 -D__KERNEL__ -Iinclude -O2 -fno-strict-aliasing -fPIC -mcmodel=small -fno-builtin
+LDFLAGS := -m elf_x86_64

-LDFLAGS_vmlinux := -Ttext $(IMAGE_OFFSET) -e startup_32 -m elf_i386
-
-$(obj)/vmlinux: $(obj)/head.o $(obj)/misc.o $(obj)/piggy.o FORCE
+LDFLAGS_vmlinux := -T
+$(obj)/vmlinux: $(src)/vmlinux.lds $(obj)/head.o $(obj)/misc.o $(obj)/piggy.o FORCE
$(call if_changed,ld)
@:

@@ -27,7 +25,7 @@ $(obj)/vmlinux.bin: vmlinux FORCE
$(obj)/vmlinux.bin.gz: $(obj)/vmlinux.bin FORCE
$(call if_changed,gzip)

-LDFLAGS_piggy.o := -r --format binary --oformat elf32-i386 -T
+LDFLAGS_piggy.o := -r --format binary --oformat elf64-x86-64 -T

$(obj)/piggy.o: $(obj)/vmlinux.scr $(obj)/vmlinux.bin.gz FORCE
$(call if_changed,ld)
diff -puN arch/x86_64/boot/compressed/misc.c~x86_64-Relocatable-kernel-support arch/x86_64/boot/compressed/misc.c
--- linux-2.6.19-rc5-reloc/arch/x86_64/boot/compressed/misc.c~x86_64-Relocatable-kernel-support 2006-11-09 23:06:47.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/boot/compressed/misc.c 2006-11-09 23:06:47.000000000 -0500
@@ -9,10 +9,95 @@
* High loaded stuff by Hans Lermen & Werner Almesberger, Feb. 1996
*/

+#define _LINUX_STRING_H_ 1
+#define __LINUX_BITMAP_H 1
+
+#include <linux/linkage.h>
#include <linux/screen_info.h>
#include <asm/io.h>
#include <asm/page.h>

+/* WARNING!!
+ * This code is compiled with -fPIC and it is relocated dynamically
+ * at run time, but no relocation processing is performed.
+ * This means that it is not safe to place pointers in static structures.
+ */
+
+/*
+ * Getting to provable safe in place decompression is hard.
+ * Worst case behaviours need to be analized.
+ * Background information:
+ *
+ * The file layout is:
+ * magic[2]
+ * method[1]
+ * flags[1]
+ * timestamp[4]
+ * extraflags[1]
+ * os[1]
+ * compressed data blocks[N]
+ * crc[4] orig_len[4]
+ *
+ * resulting in 18 bytes of non compressed data overhead.
+ *
+ * Files divided into blocks
+ * 1 bit (last block flag)
+ * 2 bits (block type)
+ *
+ * 1 block occurs every 32K -1 bytes or when there 50% compression has been achieved.
+ * The smallest block type encoding is always used.
+ *
+ * stored:
+ * 32 bits length in bytes.
+ *
+ * fixed:
+ * magic fixed tree.
+ * symbols.
+ *
+ * dynamic:
+ * dynamic tree encoding.
+ * symbols.
+ *
+ *
+ * The buffer for decompression in place is the length of the
+ * uncompressed data, plus a small amount extra to keep the algorithm safe.
+ * The compressed data is placed at the end of the buffer. The output
+ * pointer is placed at the start of the buffer and the input pointer
+ * is placed where the compressed data starts. Problems will occur
+ * when the output pointer overruns the input pointer.
+ *
+ * The output pointer can only overrun the input pointer if the input
+ * pointer is moving faster than the output pointer. A condition only
+ * triggered by data whose compressed form is larger than the uncompressed
+ * form.
+ *
+ * The worst case at the block level is a growth of the compressed data
+ * of 5 bytes per 32767 bytes.
+ *
+ * The worst case internal to a compressed block is very hard to figure.
+ * The worst case can at least be boundined by having one bit that represents
+ * 32764 bytes and then all of the rest of the bytes representing the very
+ * very last byte.
+ *
+ * All of which is enough to compute an amount of extra data that is required
+ * to be safe. To avoid problems at the block level allocating 5 extra bytes
+ * per 32767 bytes of data is sufficient. To avoind problems internal to a block
+ * adding an extra 32767 bytes (the worst case uncompressed block size) is
+ * sufficient, to ensure that in the worst case the decompressed data for
+ * block will stop the byte before the compressed data for a block begins.
+ * To avoid problems with the compressed data's meta information an extra 18
+ * bytes are needed. Leading to the formula:
+ *
+ * extra_bytes = (uncompressed_size >> 12) + 32768 + 18 + decompressor_size.
+ *
+ * Adding 8 bytes per 32K is a bit excessive but much easier to calculate.
+ * Adding 32768 instead of 32767 just makes for round numbers.
+ * Adding the decompressor_size is necessary as it musht live after all
+ * of the data as well. Last I measured the decompressor is about 14K.
+ * 10K of actuall data and 4K of bss.
+ *
+ */
+
/*
* gzip declarations
*/
@@ -28,15 +113,20 @@ typedef unsigned char uch;
typedef unsigned short ush;
typedef unsigned long ulg;

-#define WSIZE 0x8000 /* Window size must be at least 32k, */
- /* and a power of two */
-
-static uch *inbuf; /* input buffer */
-static uch window[WSIZE]; /* Sliding window buffer */
-
-static unsigned insize = 0; /* valid bytes in inbuf */
-static unsigned inptr = 0; /* index of next byte to be processed in inbuf */
-static unsigned outcnt = 0; /* bytes in output buffer */
+#define WSIZE 0x80000000 /* Window size must be at least 32k,
+ * and a power of two
+ * We don't actually have a window just
+ * a huge output buffer so I report
+ * a 2G windows size, as that should
+ * always be larger than our output buffer.
+ */
+
+static uch *inbuf; /* input buffer */
+static uch *window; /* Sliding window buffer, (and final output buffer) */
+
+static unsigned insize; /* valid bytes in inbuf */
+static unsigned inptr; /* index of next byte to be processed in inbuf */
+static unsigned outcnt; /* bytes in output buffer */

/* gzip flag byte */
#define ASCII_FLAG 0x01 /* bit 0 set: file probably ASCII text */
@@ -87,8 +177,6 @@ extern unsigned char input_data[];
extern int input_len;

static long bytes_out = 0;
-static uch *output_data;
-static unsigned long output_ptr = 0;

static void *malloc(int size);
static void free(void *where);
@@ -98,17 +186,10 @@ static void *memcpy(void *dest, const vo

static void putstr(const char *);

-extern int end;
-static long free_mem_ptr = (long)&end;
+static long free_mem_ptr;
static long free_mem_end_ptr;

-#define INPLACE_MOVE_ROUTINE 0x1000
-#define LOW_BUFFER_START 0x2000
-#define LOW_BUFFER_MAX 0x90000
-#define HEAP_SIZE 0x3000
-static unsigned int low_buffer_end, low_buffer_size;
-static int high_loaded =0;
-static uch *high_buffer_start /* = (uch *)(((ulg)&end) + HEAP_SIZE)*/;
+#define HEAP_SIZE 0x6000

static char *vidmem = (char *)0xb8000;
static int vidport;
@@ -218,58 +299,31 @@ static void* memcpy(void* dest, const vo
*/
static int fill_inbuf(void)
{
- if (insize != 0) {
- error("ran out of input data");
- }
-
- inbuf = input_data;
- insize = input_len;
- inptr = 1;
- return inbuf[0];
+ error("ran out of input data");
+ return 0;
}

/* ===========================================================================
* Write the output window window[0..outcnt-1] and update crc and bytes_out.
* (Used for the decompressed data only.)
*/
-static void flush_window_low(void)
-{
- ulg c = crc; /* temporary variable */
- unsigned n;
- uch *in, *out, ch;
-
- in = window;
- out = &output_data[output_ptr];
- for (n = 0; n < outcnt; n++) {
- ch = *out++ = *in++;
- c = crc_32_tab[((int)c ^ ch) & 0xff] ^ (c >> 8);
- }
- crc = c;
- bytes_out += (ulg)outcnt;
- output_ptr += (ulg)outcnt;
- outcnt = 0;
-}
-
-static void flush_window_high(void)
-{
- ulg c = crc; /* temporary variable */
- unsigned n;
- uch *in, ch;
- in = window;
- for (n = 0; n < outcnt; n++) {
- ch = *output_data++ = *in++;
- if ((ulg)output_data == low_buffer_end) output_data=high_buffer_start;
- c = crc_32_tab[((int)c ^ ch) & 0xff] ^ (c >> 8);
- }
- crc = c;
- bytes_out += (ulg)outcnt;
- outcnt = 0;
-}
-
static void flush_window(void)
{
- if (high_loaded) flush_window_high();
- else flush_window_low();
+ /* With my window equal to my output buffer
+ * I only need to compute the crc here.
+ */
+ ulg c = crc; /* temporary variable */
+ unsigned n;
+ uch *in, ch;
+
+ in = window;
+ for (n = 0; n < outcnt; n++) {
+ ch = *in++;
+ c = crc_32_tab[((int)c ^ ch) & 0xff] ^ (c >> 8);
+ }
+ crc = c;
+ bytes_out += (ulg)outcnt;
+ outcnt = 0;
}

static void error(char *x)
@@ -281,57 +335,8 @@ static void error(char *x)
while(1); /* Halt */
}

-static void setup_normal_output_buffer(void)
-{
-#ifdef STANDARD_MEMORY_BIOS_CALL
- if (RM_EXT_MEM_K < 1024) error("Less than 2MB of memory");
-#else
- if ((RM_ALT_MEM_K > RM_EXT_MEM_K ? RM_ALT_MEM_K : RM_EXT_MEM_K) < 1024) error("Less than 2MB of memory");
-#endif
- output_data = (unsigned char *)0x200000;
- free_mem_end_ptr = (long)real_mode;
-}
-
-struct moveparams {
- uch *low_buffer_start; int lcount;
- uch *high_buffer_start; int hcount;
-};
-
-static void setup_output_buffer_if_we_run_high(struct moveparams *mv)
-{
- high_buffer_start = (uch *)(((ulg)&end) + HEAP_SIZE);
-#ifdef STANDARD_MEMORY_BIOS_CALL
- if (RM_EXT_MEM_K < (3*1024)) error("Less than 4MB of memory");
-#else
- if ((RM_ALT_MEM_K > RM_EXT_MEM_K ? RM_ALT_MEM_K : RM_EXT_MEM_K) < (3*1024)) error("Less than 4MB of memory");
-#endif
- mv->low_buffer_start = output_data = (unsigned char *)LOW_BUFFER_START;
- low_buffer_end = ((unsigned int)real_mode > LOW_BUFFER_MAX
- ? LOW_BUFFER_MAX : (unsigned int)real_mode) & ~0xfff;
- low_buffer_size = low_buffer_end - LOW_BUFFER_START;
- high_loaded = 1;
- free_mem_end_ptr = (long)high_buffer_start;
- if ( (0x200000 + low_buffer_size) > ((ulg)high_buffer_start)) {
- high_buffer_start = (uch *)(0x200000 + low_buffer_size);
- mv->hcount = 0; /* say: we need not to move high_buffer */
- }
- else mv->hcount = -1;
- mv->high_buffer_start = high_buffer_start;
-}
-
-static void close_output_buffer_if_we_run_high(struct moveparams *mv)
-{
- if (bytes_out > low_buffer_size) {
- mv->lcount = low_buffer_size;
- if (mv->hcount)
- mv->hcount = bytes_out - low_buffer_size;
- } else {
- mv->lcount = bytes_out;
- mv->hcount = 0;
- }
-}
-
-int decompress_kernel(struct moveparams *mv, void *rmode)
+asmlinkage void decompress_kernel(void *rmode, unsigned long heap,
+ uch *input_data, unsigned long input_len, uch *output)
{
real_mode = rmode;

@@ -346,13 +351,21 @@ int decompress_kernel(struct moveparams
lines = RM_SCREEN_INFO.orig_video_lines;
cols = RM_SCREEN_INFO.orig_video_cols;

- if (free_mem_ptr < 0x100000) setup_normal_output_buffer();
- else setup_output_buffer_if_we_run_high(mv);
+ window = output; /* Output buffer (Normally at 1M) */
+ free_mem_ptr = heap; /* Heap */
+ free_mem_end_ptr = heap + HEAP_SIZE;
+ inbuf = input_data; /* Input buffer */
+ insize = input_len;
+ inptr = 0;
+
+ if ((ulg)output & 0x1fffffUL)
+ error("Destination address not 2M aligned");
+ if ((ulg)output >= 0xffffffffffUL)
+ error("Destination address too large");

makecrc();
putstr(".\nDecompressing Linux...");
gunzip();
putstr("done.\nBooting the kernel.\n");
- if (high_loaded) close_output_buffer_if_we_run_high(mv);
- return high_loaded;
+ return;
}
diff -puN /dev/null arch/x86_64/boot/compressed/vmlinux.lds
--- /dev/null 2006-11-09 22:37:03.200734626 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/boot/compressed/vmlinux.lds 2006-11-09 23:06:47.000000000 -0500
@@ -0,0 +1,44 @@
+OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64", "elf64-x86-64")
+OUTPUT_ARCH(i386:x86-64)
+ENTRY(startup_64)
+SECTIONS
+{
+ /* Be careful parts of head.S assume startup_32 is at
+ * address 0.
+ */
+ . = 0;
+ .text : {
+ _head = . ;
+ *(.text.head)
+ _ehead = . ;
+ *(.text.compressed)
+ _text = .; /* Text */
+ *(.text)
+ *(.text.*)
+ _etext = . ;
+ }
+ .rodata : {
+ _rodata = . ;
+ *(.rodata) /* read-only data */
+ *(.rodata.*)
+ _erodata = . ;
+ }
+ .data : {
+ _data = . ;
+ *(.data)
+ *(.data.*)
+ _edata = . ;
+ }
+ .bss : {
+ _bss = . ;
+ *(.bss)
+ *(.bss.*)
+ *(COMMON)
+ . = ALIGN(8);
+ _end = . ;
+ . = ALIGN(4096);
+ pgtable = . ;
+ . = . + 4096 * 6;
+ _heap = .;
+ }
+}
diff -puN arch/x86_64/boot/compressed/vmlinux.scr~x86_64-Relocatable-kernel-support arch/x86_64/boot/compressed/vmlinux.scr
--- linux-2.6.19-rc5-reloc/arch/x86_64/boot/compressed/vmlinux.scr~x86_64-Relocatable-kernel-support 2006-11-09 23:06:47.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/boot/compressed/vmlinux.scr 2006-11-09 23:06:47.000000000 -0500
@@ -1,9 +1,10 @@
SECTIONS
{
- .data : {
+ .text.compressed : {
input_len = .;
LONG(input_data_end - input_data) input_data = .;
- *(.data)
+ *(.data)
+ output_len = . - 4;
input_data_end = .;
}
}
diff -puN arch/x86_64/kernel/head.S~x86_64-Relocatable-kernel-support arch/x86_64/kernel/head.S
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/head.S~x86_64-Relocatable-kernel-support 2006-11-09 23:06:47.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/head.S 2006-11-09 23:06:47.000000000 -0500
@@ -5,6 +5,7 @@
* Copyright (C) 2000 Pavel Machek <[email protected]>
* Copyright (C) 2000 Karsten Keil <[email protected]>
* Copyright (C) 2001,2002 Andi Kleen <[email protected]>
+ * Copyright (C) 2005 Eric Biederman <[email protected]>
*/


@@ -19,94 +20,126 @@
#include <asm/cache.h>

/* we are not able to switch in one step to the final KERNEL ADRESS SPACE
- * because we need identity-mapped pages on setup so define __START_KERNEL to
- * 0x100000 for this stage
+ * because we need identity-mapped pages.
*
*/

.text
.section .bootstrap.text
- .code32
- .globl startup_32
-/* %bx: 1 if coming from smp trampoline on secondary cpu */
-startup_32:
-
+ .code64
+ .globl startup_64
+startup_64:
+
/*
- * At this point the CPU runs in 32bit protected mode (CS.D = 1) with
- * paging disabled and the point of this file is to switch to 64bit
- * long mode with a kernel mapping for kerneland to jump into the
- * kernel virtual addresses.
- * There is no stack until we set one up.
+ * At this point the CPU runs in 64bit mode CS.L = 1 CS.D = 1,
+ * and someone has loaded an identity mapped page table
+ * for us. These identity mapped page tables map all of the
+ * kernel pages and possibly all of memory.
+ *
+ * %esi holds a physical pointer to real_mode_data.
+ *
+ * We come here either directly from a 64bit bootloader, or from
+ * arch/x86_64/boot/compressed/head.S.
+ *
+ * We only come here initially at boot nothing else comes here.
+ *
+ * Since we may be loaded at an address different from what we were
+ * compiled to run at we first fixup the physical addresses in our page
+ * tables and then reload them.
*/

- /* Initialize the %ds segment register */
- movl $__KERNEL_DS,%eax
- movl %eax,%ds
+ /* Compute the delta between the address I am compiled to run at and the
+ * address I am actually running at.
+ */
+ leaq _text(%rip), %rbp
+ subq $_text - __START_KERNEL_map, %rbp

- /* Load new GDT with the 64bit segments using 32bit descriptor */
- lgdt pGDT32 - __START_KERNEL_map
+ /* Is the address not 2M aligned? */
+ movq %rbp, %rax
+ andl $~LARGE_PAGE_MASK, %eax
+ testl %eax, %eax
+ jnz bad_address
+
+ /* Is the address too large? */
+ leaq _text(%rip), %rdx
+ movq $PGDIR_SIZE, %rax
+ cmpq %rax, %rdx
+ jae bad_address

- /* If the CPU doesn't support CPUID this will double fault.
- * Unfortunately it is hard to check for CPUID without a stack.
+ /* Fixup the physical addresses in the page table
*/
-
- /* Check if extended functions are implemented */
- movl $0x80000000, %eax
- cpuid
- cmpl $0x80000000, %eax
- jbe no_long_mode
- /* Check if long mode is implemented */
- mov $0x80000001, %eax
- cpuid
- btl $29, %edx
- jnc no_long_mode
+ addq %rbp, init_level4_pgt + 0(%rip)
+ addq %rbp, init_level4_pgt + (258*8)(%rip)
+ addq %rbp, init_level4_pgt + (511*8)(%rip)
+
+ addq %rbp, level3_ident_pgt + 0(%rip)
+ addq %rbp, level3_kernel_pgt + (510*8)(%rip)
+
+ /* Add an Identity mapping if I am above 1G */
+ leaq _text(%rip), %rdi
+ andq $LARGE_PAGE_MASK, %rdi
+
+ movq %rdi, %rax
+ shrq $PUD_SHIFT, %rax
+ andq $(PTRS_PER_PUD - 1), %rax
+ jz ident_complete
+
+ leaq (level2_spare_pgt - __START_KERNEL_map + _KERNPG_TABLE)(%rbp), %rdx
+ leaq level3_ident_pgt(%rip), %rbx
+ movq %rdx, 0(%rbx, %rax, 8)
+
+ movq %rdi, %rax
+ shrq $PMD_SHIFT, %rax
+ andq $(PTRS_PER_PMD - 1), %rax
+ leaq __PAGE_KERNEL_LARGE_EXEC(%rdi), %rdx
+ leaq level2_spare_pgt(%rip), %rbx
+ movq %rdx, 0(%rbx, %rax, 8)
+ident_complete:

- /*
- * Prepare for entering 64bits mode
+ /* Fixup the kernel text+data virtual addresses
*/
+ leaq level2_kernel_pgt(%rip), %rdi
+ leaq 4096(%rdi), %r8
+ /* See if it is a valid page table entry */
+1: testq $1, 0(%rdi)
+ jz 2f
+ addq %rbp, 0(%rdi)
+ /* Go to the next page */
+2: addq $8, %rdi
+ cmp %r8, %rdi
+ jne 1b

- /* Enable PAE mode */
- xorl %eax, %eax
- btsl $5, %eax
- movl %eax, %cr4
-
- /* Setup early boot stage 4 level pagetables */
- movl $(init_level4_pgt - __START_KERNEL_map), %eax
- movl %eax, %cr3
+ /* Fixup phys_base */
+ addq %rbp, phys_base(%rip)

- /* Setup EFER (Extended Feature Enable Register) */
- movl $MSR_EFER, %ecx
- rdmsr
-
- /* Enable Long Mode */
- btsl $_EFER_LME, %eax
-
- /* Make changes effective */
- wrmsr
+#ifdef CONFIG_SMP
+ addq %rbp, trampoline_level4_pgt + 0(%rip)
+ addq %rbp, trampoline_level4_pgt + (511*8)(%rip)
+#endif
+#ifdef CONFIG_ACPI_SLEEP
+ addq %rbp, wakeup_level4_pgt + 0(%rip)
+ addq %rbp, wakeup_level4_pgt + (511*8)(%rip)
+#endif

- xorl %eax, %eax
- btsl $31, %eax /* Enable paging and in turn activate Long Mode */
- btsl $0, %eax /* Enable protected mode */
- /* Make changes effective */
- movl %eax, %cr0
- /*
- * At this point we're in long mode but in 32bit compatibility mode
- * with EFER.LME = 1, CS.L = 0, CS.D = 1 (and in turn
- * EFER.LMA = 1). Now we want to jump in 64bit mode, to do that we use
- * the new gdt/idt that has __KERNEL_CS with CS.L = 1.
+ /* Due to ENTRY(), sometimes the empty space gets filled with
+ * zeros. Better take a jmp than relying on empty space being
+ * filled with 0x90 (nop)
*/
- ljmp $__KERNEL_CS, $(startup_64 - __START_KERNEL_map)
-
- .code64
- .org 0x100
- .globl startup_64
-startup_64:
+ jmp secondary_startup_64
ENTRY(secondary_startup_64)
- /* We come here either from startup_32
- * or directly from a 64bit bootloader.
- * Since we may have come directly from a bootloader we
- * reload the page tables here.
- */
+ /*
+ * At this point the CPU runs in 64bit mode CS.L = 1 CS.D = 1,
+ * and someone has loaded a mapped page table.
+ *
+ * %esi holds a physical pointer to real_mode_data.
+ *
+ * We come here either from startup_64 (using physical addresses)
+ * or from trampoline.S (using virtual addresses).
+ *
+ * Using virtual addresses from trampoline.S removes the need
+ * to have any identity mapped pages in the kernel page table
+ * after the boot processor executes this code.
+ */

/* Enable PAE mode and PGE */
xorq %rax, %rax
@@ -116,8 +149,14 @@ ENTRY(secondary_startup_64)

/* Setup early boot stage 4 level pagetables. */
movq $(init_level4_pgt - __START_KERNEL_map), %rax
+ addq phys_base(%rip), %rax
movq %rax, %cr3

+ /* Ensure I am executing from virtual addresses */
+ movq $1f, %rax
+ jmp *%rax
+1:
+
/* Check if nx is implemented */
movl $0x80000001, %eax
cpuid
@@ -126,17 +165,11 @@ ENTRY(secondary_startup_64)
/* Setup EFER (Extended Feature Enable Register) */
movl $MSR_EFER, %ecx
rdmsr
-
- /* Enable System Call */
- btsl $_EFER_SCE, %eax
-
- /* No Execute supported? */
- btl $20,%edi
+ btsl $_EFER_SCE, %eax /* Enable System Call */
+ btl $20,%edi /* No Execute supported? */
jnc 1f
btsl $_EFER_NX, %eax
-1:
- /* Make changes effective */
- wrmsr
+1: wrmsr /* Make changes effective */

/* Setup cr0 */
#define CR0_PM 1 /* protected mode */
@@ -163,7 +196,7 @@ ENTRY(secondary_startup_64)
* addresses where we're currently running on. We have to do that here
* because in 32bit we couldn't load a 64bit linear address.
*/
- lgdt cpu_gdt_descr
+ lgdt cpu_gdt_descr(%rip)

/*
* Setup up a dummy PDA. this is just for some early bootup code
@@ -206,6 +239,9 @@ initial_code:
init_rsp:
.quad init_thread_union+THREAD_SIZE-8

+bad_address:
+ jmp bad_address
+
ENTRY(early_idt_handler)
cmpl $2,early_recursion_flag(%rip)
jz 1f
@@ -234,23 +270,7 @@ early_idt_msg:
early_idt_ripmsg:
.asciz "RIP %s\n"

-.code32
-ENTRY(no_long_mode)
- /* This isn't an x86-64 CPU so hang */
-1:
- jmp 1b
-
-.org 0xf00
- .globl pGDT32
-pGDT32:
- .word gdt_end-cpu_gdt_table-1
- .long cpu_gdt_table-__START_KERNEL_map
-
-.org 0xf10
-ljumpvector:
- .long startup_64-__START_KERNEL_map
- .word __KERNEL_CS
-
+.balign PAGE_SIZE
ENTRY(stext)
ENTRY(_stext)

@@ -305,6 +325,9 @@ NEXT_PAGE(level2_kernel_pgt)
/* Module mapping starts here */
.fill (PTRS_PER_PMD - (KERNEL_TEXT_SIZE/PMD_SIZE)),8,0

+NEXT_PAGE(level2_spare_pgt)
+ .fill 512,8,0
+
#undef PMDS
#undef NEXT_PAGE

@@ -322,6 +345,10 @@ gdt:
.endr
#endif

+ENTRY(phys_base)
+ /* This must match the first entry in level2_kernel_pgt */
+ .quad 0x0000000000000000
+
/* We need valid kernel segments for data and code in long mode too
* IRET will check the segment types kkeil 2000/10/28
* Also sysret mandates a special GDT layout
diff -puN include/asm-x86_64/page.h~x86_64-Relocatable-kernel-support include/asm-x86_64/page.h
--- linux-2.6.19-rc5-reloc/include/asm-x86_64/page.h~x86_64-Relocatable-kernel-support 2006-11-09 23:06:47.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/include/asm-x86_64/page.h 2006-11-09 23:06:47.000000000 -0500
@@ -61,6 +61,8 @@ typedef struct { unsigned long pgd; } pg

typedef struct { unsigned long pgprot; } pgprot_t;

+extern unsigned long phys_base;
+
#define pte_val(x) ((x).pte)
#define pmd_val(x) ((x).pmd)
#define pud_val(x) ((x).pud)
@@ -99,14 +101,14 @@ typedef struct { unsigned long pgprot; }
#define PAGE_OFFSET __PAGE_OFFSET

/* Note: __pa(&symbol_visible_to_c) should be always replaced with __pa_symbol.
- Otherwise you risk miscompilation. */
+ Otherwise you risk miscompilation. */
#define __pa(x) ((unsigned long)(x) - PAGE_OFFSET)
/* __pa_symbol should be used for C visible symbols.
This seems to be the official gcc blessed way to do such arithmetic. */
#define __pa_symbol(x) \
({unsigned long v; \
asm("" : "=r" (v) : "0" (x)); \
- (v - __START_KERNEL_map); })
+ ((v - __START_KERNEL_map) + phys_base); })

#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
#ifdef CONFIG_FLATMEM
_

2006-11-13 16:55:32

by Vivek Goyal

[permalink] [raw]
Subject: [RFC] [PATCH 8/16] x86_64: Add EFER to the set registers saved by save_processor_state



EFER varies like %cr4 depending on the cpu capabilities, and which cpu
capabilities we want to make use of. So save/restore it make certain
we have the same EFER value when we are done.

Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---

arch/x86_64/kernel/suspend.c | 3 ++-
include/asm-x86_64/suspend.h | 1 +
2 files changed, 3 insertions(+), 1 deletion(-)

diff -puN arch/x86_64/kernel/suspend.c~x86_64-Add-EFER-to-the-set-registers-saved-by-save_processor_state arch/x86_64/kernel/suspend.c
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/suspend.c~x86_64-Add-EFER-to-the-set-registers-saved-by-save_processor_state 2006-11-09 22:58:53.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/suspend.c 2006-11-09 22:58:53.000000000 -0500
@@ -33,7 +33,6 @@ void __save_processor_state(struct saved
asm volatile ("str %0" : "=m" (ctxt->tr));

/* XMM0..XMM15 should be handled by kernel_fpu_begin(). */
- /* EFER should be constant for kernel version, no need to handle it. */
/*
* segment registers
*/
@@ -50,6 +49,7 @@ void __save_processor_state(struct saved
/*
* control registers
*/
+ rdmsrl(MSR_EFER, ctxt->efer);
asm volatile ("movq %%cr0, %0" : "=r" (ctxt->cr0));
asm volatile ("movq %%cr2, %0" : "=r" (ctxt->cr2));
asm volatile ("movq %%cr3, %0" : "=r" (ctxt->cr3));
@@ -75,6 +75,7 @@ void __restore_processor_state(struct sa
/*
* control registers
*/
+ wrmsrl(MSR_EFER, ctxt->efer);
asm volatile ("movq %0, %%cr8" :: "r" (ctxt->cr8));
asm volatile ("movq %0, %%cr4" :: "r" (ctxt->cr4));
asm volatile ("movq %0, %%cr3" :: "r" (ctxt->cr3));
diff -puN include/asm-x86_64/suspend.h~x86_64-Add-EFER-to-the-set-registers-saved-by-save_processor_state include/asm-x86_64/suspend.h
--- linux-2.6.19-rc5-reloc/include/asm-x86_64/suspend.h~x86_64-Add-EFER-to-the-set-registers-saved-by-save_processor_state 2006-11-09 22:58:53.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/include/asm-x86_64/suspend.h 2006-11-09 22:58:53.000000000 -0500
@@ -17,6 +17,7 @@ struct saved_context {
u16 ds, es, fs, gs, ss;
unsigned long gs_base, gs_kernel_base, fs_base;
unsigned long cr0, cr2, cr3, cr4, cr8;
+ unsigned long efer;
u16 gdt_pad;
u16 gdt_limit;
unsigned long gdt_base;
_

2006-11-13 16:55:32

by Vivek Goyal

[permalink] [raw]
Subject: [RFC] [PATCH 3/16] x86_64: Kill temp_boot_pmds



Early in the boot process we need the ability to set
up temporary mappings, before our normal mechanisms are
initialized. Currently this is used to map pages that
are part of the page tables we are building and pages
during the dmi scan.

The core problem is that we are using the user portion of
the page tables to implement this. Which means that while
this mechanism is active we cannot catch NULL pointer dereferences
and we deviate from the normal ways of handling things.

In this patch I modify early_ioremap to map pages into
the kernel portion of address space, roughly where
we will later put modules, and I make the discovery of
which addresses we can use dynamic which removes all
kinds of static limits and remove the dependencies
on implementation details between different parts of the code.

Now alloc_low_page() and unmap_low_page() use
early_iomap() and early_iounmap() to allocate/map and
unmap a page.

Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---

arch/x86_64/kernel/head.S | 3 -
arch/x86_64/mm/init.c | 100 ++++++++++++++++++++--------------------------
2 files changed, 45 insertions(+), 58 deletions(-)

diff -puN arch/x86_64/kernel/head.S~x86_64-Kill-temp_boot_pmds arch/x86_64/kernel/head.S
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/head.S~x86_64-Kill-temp_boot_pmds 2006-11-09 22:53:32.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/head.S 2006-11-09 22:53:32.000000000 -0500
@@ -280,9 +280,6 @@ NEXT_PAGE(level2_ident_pgt)
.quad i << 21 | 0x083
i = i + 1
.endr
- /* Temporary mappings for the super early allocator in arch/x86_64/mm/init.c */
- .globl temp_boot_pmds
-temp_boot_pmds:
.fill 492,8,0

NEXT_PAGE(level2_kernel_pgt)
diff -puN arch/x86_64/mm/init.c~x86_64-Kill-temp_boot_pmds arch/x86_64/mm/init.c
--- linux-2.6.19-rc5-reloc/arch/x86_64/mm/init.c~x86_64-Kill-temp_boot_pmds 2006-11-09 22:53:32.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/mm/init.c 2006-11-09 22:53:32.000000000 -0500
@@ -167,23 +167,9 @@ __set_fixmap (enum fixed_addresses idx,

unsigned long __initdata table_start, table_end;

-extern pmd_t temp_boot_pmds[];
-
-static struct temp_map {
- pmd_t *pmd;
- void *address;
- int allocated;
-} temp_mappings[] __initdata = {
- { &temp_boot_pmds[0], (void *)(40UL * 1024 * 1024) },
- { &temp_boot_pmds[1], (void *)(42UL * 1024 * 1024) },
- {}
-};
-
-static __meminit void *alloc_low_page(int *index, unsigned long *phys)
+static __meminit void *alloc_low_page(unsigned long *phys)
{
- struct temp_map *ti;
- int i;
- unsigned long pfn = table_end++, paddr;
+ unsigned long pfn = table_end++;
void *adr;

if (after_bootmem) {
@@ -194,57 +180,63 @@ static __meminit void *alloc_low_page(in

if (pfn >= end_pfn)
panic("alloc_low_page: ran out of memory");
- for (i = 0; temp_mappings[i].allocated; i++) {
- if (!temp_mappings[i].pmd)
- panic("alloc_low_page: ran out of temp mappings");
- }
- ti = &temp_mappings[i];
- paddr = (pfn << PAGE_SHIFT) & PMD_MASK;
- set_pmd(ti->pmd, __pmd(paddr | _KERNPG_TABLE | _PAGE_PSE));
- ti->allocated = 1;
- __flush_tlb();
- adr = ti->address + ((pfn << PAGE_SHIFT) & ~PMD_MASK);
+
+ adr = early_ioremap(pfn * PAGE_SIZE, PAGE_SIZE);
memset(adr, 0, PAGE_SIZE);
- *index = i;
- *phys = pfn * PAGE_SIZE;
- return adr;
-}
+ *phys = pfn * PAGE_SIZE;
+ return adr;
+}

-static __meminit void unmap_low_page(int i)
+static __meminit void unmap_low_page(void *adr)
{
- struct temp_map *ti;

if (after_bootmem)
return;

- ti = &temp_mappings[i];
- set_pmd(ti->pmd, __pmd(0));
- ti->allocated = 0;
+ early_iounmap(adr, PAGE_SIZE);
}

/* Must run before zap_low_mappings */
__init void *early_ioremap(unsigned long addr, unsigned long size)
{
- unsigned long map = round_down(addr, LARGE_PAGE_SIZE);
-
- /* actually usually some more */
- if (size >= LARGE_PAGE_SIZE) {
- return NULL;
+ unsigned long vaddr;
+ pmd_t *pmd, *last_pmd;
+ int i, pmds;
+
+ pmds = ((addr & ~PMD_MASK) + size + ~PMD_MASK) / PMD_SIZE;
+ vaddr = __START_KERNEL_map;
+ pmd = level2_kernel_pgt;
+ last_pmd = level2_kernel_pgt + PTRS_PER_PMD - 1;
+ for (; pmd <= last_pmd; pmd++, vaddr += PMD_SIZE) {
+ for (i = 0; i < pmds; i++) {
+ if (pmd_present(pmd[i]))
+ goto next;
+ }
+ vaddr += addr & ~PMD_MASK;
+ addr &= PMD_MASK;
+ for (i = 0; i < pmds; i++, addr += PMD_SIZE)
+ set_pmd(pmd + i,__pmd(addr | _KERNPG_TABLE | _PAGE_PSE));
+ __flush_tlb();
+ return (void *)vaddr;
+ next:
+ ;
}
- set_pmd(temp_mappings[0].pmd, __pmd(map | _KERNPG_TABLE | _PAGE_PSE));
- map += LARGE_PAGE_SIZE;
- set_pmd(temp_mappings[1].pmd, __pmd(map | _KERNPG_TABLE | _PAGE_PSE));
- __flush_tlb();
- return temp_mappings[0].address + (addr & (LARGE_PAGE_SIZE-1));
+ printk("early_ioremap(0x%lx, %lu) failed\n", addr, size);
+ return NULL;
}

/* To avoid virtual aliases later */
__init void early_iounmap(void *addr, unsigned long size)
{
- if ((void *)round_down((unsigned long)addr, LARGE_PAGE_SIZE) != temp_mappings[0].address)
- printk("early_iounmap: bad address %p\n", addr);
- set_pmd(temp_mappings[0].pmd, __pmd(0));
- set_pmd(temp_mappings[1].pmd, __pmd(0));
+ unsigned long vaddr;
+ pmd_t *pmd;
+ int i, pmds;
+
+ vaddr = (unsigned long)addr;
+ pmds = ((vaddr & ~PMD_MASK) + size + ~PMD_MASK) / PMD_SIZE;
+ pmd = level2_kernel_pgt + pmd_index(vaddr);
+ for (i = 0; i < pmds; i++)
+ pmd_clear(pmd + i);
__flush_tlb();
}

@@ -289,7 +281,6 @@ static void __meminit phys_pud_init(pud_


for (; i < PTRS_PER_PUD; i++, addr = (addr & PUD_MASK) + PUD_SIZE ) {
- int map;
unsigned long pmd_phys;
pud_t *pud = pud_page + pud_index(addr);
pmd_t *pmd;
@@ -307,12 +298,12 @@ static void __meminit phys_pud_init(pud_
continue;
}

- pmd = alloc_low_page(&map, &pmd_phys);
+ pmd = alloc_low_page(&pmd_phys);
spin_lock(&init_mm.page_table_lock);
set_pud(pud, __pud(pmd_phys | _KERNPG_TABLE));
phys_pmd_init(pmd, addr, end);
spin_unlock(&init_mm.page_table_lock);
- unmap_low_page(map);
+ unmap_low_page(pmd);
}
__flush_tlb();
}
@@ -364,7 +355,6 @@ void __meminit init_memory_mapping(unsig
end = (unsigned long)__va(end);

for (; start < end; start = next) {
- int map;
unsigned long pud_phys;
pgd_t *pgd = pgd_offset_k(start);
pud_t *pud;
@@ -372,7 +362,7 @@ void __meminit init_memory_mapping(unsig
if (after_bootmem)
pud = pud_offset(pgd, start & PGDIR_MASK);
else
- pud = alloc_low_page(&map, &pud_phys);
+ pud = alloc_low_page(&pud_phys);

next = start + PGDIR_SIZE;
if (next > end)
@@ -380,7 +370,7 @@ void __meminit init_memory_mapping(unsig
phys_pud_init(pud, __pa(start), __pa(next));
if (!after_bootmem)
set_pgd(pgd_offset_k(start), mk_kernel_pgd(pud_phys));
- unmap_low_page(map);
+ unmap_low_page(pud);
}

if (!after_bootmem)
_

2006-11-13 16:54:08

by Vivek Goyal

[permalink] [raw]
Subject: [RFC] [PATCH 14/16] x86_64: Remove CONFIG_PHYSICAL_START



I am about to add relocatable kernel support which has essentially
no cost so there is no point in retaining CONFIG_PHYSICAL_START
and retaining CONFIG_PHYSICAL_START makes implementation of and
testing of a relocatable kernel more difficult.

Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---

arch/x86_64/Kconfig | 19 -------------------
arch/x86_64/boot/compressed/head.S | 6 +++---
arch/x86_64/boot/compressed/misc.c | 6 +++---
arch/x86_64/defconfig | 1 -
arch/x86_64/kernel/vmlinux.lds.S | 2 +-
arch/x86_64/mm/fault.c | 4 ++--
include/asm-x86_64/page.h | 2 --
7 files changed, 9 insertions(+), 31 deletions(-)

diff -puN arch/x86_64/boot/compressed/head.S~x86_64-Remove-CONFIG_PHYSICAL_START arch/x86_64/boot/compressed/head.S
--- linux-2.6.19-rc5-reloc/arch/x86_64/boot/compressed/head.S~x86_64-Remove-CONFIG_PHYSICAL_START 2006-11-09 23:05:52.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/boot/compressed/head.S 2006-11-09 23:05:52.000000000 -0500
@@ -76,7 +76,7 @@ startup_32:
jnz 3f
addl $8,%esp
xorl %ebx,%ebx
- ljmp $(__KERNEL_CS), $__PHYSICAL_START
+ ljmp $(__KERNEL_CS), $0x200000

/*
* We come here, if we were loaded high.
@@ -102,7 +102,7 @@ startup_32:
popl %ecx # lcount
popl %edx # high_buffer_start
popl %eax # hcount
- movl $__PHYSICAL_START,%edi
+ movl $0x200000,%edi
cli # make sure we don't get interrupted
ljmp $(__KERNEL_CS), $0x1000 # and jump to the move routine

@@ -127,7 +127,7 @@ move_routine_start:
movsl
movl %ebx,%esi # Restore setup pointer
xorl %ebx,%ebx
- ljmp $(__KERNEL_CS), $__PHYSICAL_START
+ ljmp $(__KERNEL_CS), $0x200000
move_routine_end:


diff -puN arch/x86_64/boot/compressed/misc.c~x86_64-Remove-CONFIG_PHYSICAL_START arch/x86_64/boot/compressed/misc.c
--- linux-2.6.19-rc5-reloc/arch/x86_64/boot/compressed/misc.c~x86_64-Remove-CONFIG_PHYSICAL_START 2006-11-09 23:05:52.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/boot/compressed/misc.c 2006-11-09 23:05:52.000000000 -0500
@@ -288,7 +288,7 @@ static void setup_normal_output_buffer(v
#else
if ((RM_ALT_MEM_K > RM_EXT_MEM_K ? RM_ALT_MEM_K : RM_EXT_MEM_K) < 1024) error("Less than 2MB of memory");
#endif
- output_data = (unsigned char *)__PHYSICAL_START; /* Normally Points to 1M */
+ output_data = (unsigned char *)0x200000;
free_mem_end_ptr = (long)real_mode;
}

@@ -311,8 +311,8 @@ static void setup_output_buffer_if_we_ru
low_buffer_size = low_buffer_end - LOW_BUFFER_START;
high_loaded = 1;
free_mem_end_ptr = (long)high_buffer_start;
- if ( (__PHYSICAL_START + low_buffer_size) > ((ulg)high_buffer_start)) {
- high_buffer_start = (uch *)(__PHYSICAL_START + low_buffer_size);
+ if ( (0x200000 + low_buffer_size) > ((ulg)high_buffer_start)) {
+ high_buffer_start = (uch *)(0x200000 + low_buffer_size);
mv->hcount = 0; /* say: we need not to move high_buffer */
}
else mv->hcount = -1;
diff -puN arch/x86_64/defconfig~x86_64-Remove-CONFIG_PHYSICAL_START arch/x86_64/defconfig
--- linux-2.6.19-rc5-reloc/arch/x86_64/defconfig~x86_64-Remove-CONFIG_PHYSICAL_START 2006-11-09 23:05:52.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/defconfig 2006-11-09 23:05:52.000000000 -0500
@@ -165,7 +165,6 @@ CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
# CONFIG_KEXEC is not set
# CONFIG_CRASH_DUMP is not set
-CONFIG_PHYSICAL_START=0x200000
CONFIG_SECCOMP=y
# CONFIG_CC_STACKPROTECTOR is not set
# CONFIG_HZ_100 is not set
diff -puN arch/x86_64/Kconfig~x86_64-Remove-CONFIG_PHYSICAL_START arch/x86_64/Kconfig
--- linux-2.6.19-rc5-reloc/arch/x86_64/Kconfig~x86_64-Remove-CONFIG_PHYSICAL_START 2006-11-09 23:05:52.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/Kconfig 2006-11-09 23:05:52.000000000 -0500
@@ -513,25 +513,6 @@ config CRASH_DUMP
PHYSICAL_START.
For more details see Documentation/kdump/kdump.txt

-config PHYSICAL_START
- hex "Physical address where the kernel is loaded" if (EMBEDDED || CRASH_DUMP)
- default "0x1000000" if CRASH_DUMP
- default "0x200000"
- help
- This gives the physical address where the kernel is loaded. Normally
- for regular kernels this value is 0x200000 (2MB). But in the case
- of kexec on panic the fail safe kernel needs to run at a different
- address than the panic-ed kernel. This option is used to set the load
- address for kernels used to capture crash dump on being kexec'ed
- after panic. The default value for crash dump kernels is
- 0x1000000 (16MB). This can also be set based on the "X" value as
- specified in the "crashkernel=YM@XM" command line boot parameter
- passed to the panic-ed kernel. Typically this parameter is set as
- crashkernel=64M@16M. Please take a look at
- Documentation/kdump/kdump.txt for more details about crash dumps.
-
- Don't change this unless you know what you are doing.
-
config SECCOMP
bool "Enable seccomp to safely compute untrusted bytecode"
depends on PROC_FS
diff -puN arch/x86_64/kernel/vmlinux.lds.S~x86_64-Remove-CONFIG_PHYSICAL_START arch/x86_64/kernel/vmlinux.lds.S
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/vmlinux.lds.S~x86_64-Remove-CONFIG_PHYSICAL_START 2006-11-09 23:05:52.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/vmlinux.lds.S 2006-11-09 23:05:52.000000000 -0500
@@ -22,7 +22,7 @@ PHDRS {
}
SECTIONS
{
- . = __START_KERNEL;
+ . = __START_KERNEL_map + 0x200000;
phys_startup_64 = startup_64 - LOAD_OFFSET;
_text = .; /* Text and read-only data */
.text : AT(ADDR(.text) - LOAD_OFFSET) {
diff -puN arch/x86_64/mm/fault.c~x86_64-Remove-CONFIG_PHYSICAL_START arch/x86_64/mm/fault.c
--- linux-2.6.19-rc5-reloc/arch/x86_64/mm/fault.c~x86_64-Remove-CONFIG_PHYSICAL_START 2006-11-09 23:05:52.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/mm/fault.c 2006-11-09 23:05:52.000000000 -0500
@@ -644,9 +644,9 @@ void vmalloc_sync_all(void)
start = address + PGDIR_SIZE;
}
/* Check that there is no need to do the same for the modules area. */
- BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
+ BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL_map));
BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) ==
- (__START_KERNEL & PGDIR_MASK)));
+ (__START_KERNEL_map & PGDIR_MASK)));
}

static int __init enable_pagefaulttrace(char *str)
diff -puN include/asm-x86_64/page.h~x86_64-Remove-CONFIG_PHYSICAL_START include/asm-x86_64/page.h
--- linux-2.6.19-rc5-reloc/include/asm-x86_64/page.h~x86_64-Remove-CONFIG_PHYSICAL_START 2006-11-09 23:05:52.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/include/asm-x86_64/page.h 2006-11-09 23:05:52.000000000 -0500
@@ -75,8 +75,6 @@ typedef struct { unsigned long pgprot; }

#endif /* !__ASSEMBLY__ */

-#define __PHYSICAL_START _AC(CONFIG_PHYSICAL_START,UL)
-#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START)
#define __START_KERNEL_map _AC(0xffffffff80000000,UL)
#define __PAGE_OFFSET _AC(0xffff810000000000,UL)

_

2006-11-13 16:54:10

by Vivek Goyal

[permalink] [raw]
Subject: [RFC] [PATCH 4/16] x86_64: Clean up the early boot page table



- Merge physmem_pgt and ident_pgt, removing physmem_pgt. The merge
is broken as soon as mm/init.c:init_memory_mapping is run.
- As physmem_pgt is gone don't export it in pgtable.h.
- Use defines from pgtable.h for page permissions.
- Fix the physical memory identity mapping so it is at the correct
address.
- Remove the physical memory mapping from wakeup_level4_pgt it
is at the wrong address so we can't possibly be usinging it.
- Simply NEXT_PAGE the work to calculate the phys_ alias
of the labels was very cool. Unfortuantely it was a brittle
special purpose hack that makes maitenance more difficult.
Instead just use label - __START_KERNEL_map like we do
everywhere else in assembly.

Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---

arch/x86_64/kernel/head.S | 61 +++++++++++++++++++------------------------
include/asm-x86_64/pgtable.h | 1
2 files changed, 28 insertions(+), 34 deletions(-)

diff -puN arch/x86_64/kernel/head.S~x86_64-Cleanup-the-early-boot-page-table arch/x86_64/kernel/head.S
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/head.S~x86_64-Cleanup-the-early-boot-page-table 2006-11-09 22:55:39.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/head.S 2006-11-09 22:55:39.000000000 -0500
@@ -13,6 +13,7 @@
#include <linux/init.h>
#include <asm/desc.h>
#include <asm/segment.h>
+#include <asm/pgtable.h>
#include <asm/page.h>
#include <asm/msr.h>
#include <asm/cache.h>
@@ -252,52 +253,48 @@ ljumpvector:
ENTRY(stext)
ENTRY(_stext)

- $page = 0
#define NEXT_PAGE(name) \
- $page = $page + 1; \
- .org $page * 0x1000; \
- phys_/**/name = $page * 0x1000 + __PHYSICAL_START; \
+ .balign PAGE_SIZE; \
ENTRY(name)

+/* Automate the creation of 1 to 1 mapping pmd entries */
+#define PMDS(START, PERM, COUNT) \
+ i = 0 ; \
+ .rept (COUNT) ; \
+ .quad (START) + (i << 21) + (PERM) ; \
+ i = i + 1 ; \
+ .endr
+
NEXT_PAGE(init_level4_pgt)
/* This gets initialized in x86_64_start_kernel */
.fill 512,8,0

NEXT_PAGE(level3_ident_pgt)
- .quad phys_level2_ident_pgt | 0x007
+ .quad level2_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
.fill 511,8,0

NEXT_PAGE(level3_kernel_pgt)
.fill 510,8,0
/* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
- .quad phys_level2_kernel_pgt | 0x007
+ .quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
.fill 1,8,0

NEXT_PAGE(level2_ident_pgt)
- /* 40MB for bootup. */
- i = 0
- .rept 20
- .quad i << 21 | 0x083
- i = i + 1
- .endr
- .fill 492,8,0
+ /* Since I easily can, map the first 1G.
+ * Don't set NX because code runs from these pages.
+ */
+ PMDS(0x0000000000000000, __PAGE_KERNEL_LARGE_EXEC, PTRS_PER_PMD)

NEXT_PAGE(level2_kernel_pgt)
/* 40MB kernel mapping. The kernel code cannot be bigger than that.
When you change this change KERNEL_TEXT_SIZE in page.h too. */
/* (2^48-(2*1024*1024*1024)-((2^39)*511)-((2^30)*510)) = 0 */
- i = 0
- .rept 20
- .quad i << 21 | 0x183
- i = i + 1
- .endr
+ PMDS(0x0000000000000000, __PAGE_KERNEL_LARGE_EXEC|_PAGE_GLOBAL,
+ KERNEL_TEXT_SIZE/PMD_SIZE)
/* Module mapping starts here */
- .fill 492,8,0
-
-NEXT_PAGE(level3_physmem_pgt)
- .quad phys_level2_kernel_pgt | 0x007 /* so that __va works even before pagetable_init */
- .fill 511,8,0
+ .fill (PTRS_PER_PMD - (KERNEL_TEXT_SIZE/PMD_SIZE)),8,0

+#undef PMDS
#undef NEXT_PAGE

.data
@@ -305,12 +302,10 @@ NEXT_PAGE(level3_physmem_pgt)
#ifdef CONFIG_ACPI_SLEEP
.align PAGE_SIZE
ENTRY(wakeup_level4_pgt)
- .quad phys_level3_ident_pgt | 0x007
- .fill 255,8,0
- .quad phys_level3_physmem_pgt | 0x007
- .fill 254,8,0
+ .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
+ .fill 510,8,0
/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
- .quad phys_level3_kernel_pgt | 0x007
+ .quad level3_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
#endif

#ifndef CONFIG_HOTPLUG_CPU
@@ -324,12 +319,12 @@ ENTRY(wakeup_level4_pgt)
*/
.align PAGE_SIZE
ENTRY(boot_level4_pgt)
- .quad phys_level3_ident_pgt | 0x007
- .fill 255,8,0
- .quad phys_level3_physmem_pgt | 0x007
- .fill 254,8,0
+ .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
+ .fill 257,8,0
+ .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
+ .fill 252,8,0
/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
- .quad phys_level3_kernel_pgt | 0x007
+ .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE

.data

diff -puN include/asm-x86_64/pgtable.h~x86_64-Cleanup-the-early-boot-page-table include/asm-x86_64/pgtable.h
--- linux-2.6.19-rc5-reloc/include/asm-x86_64/pgtable.h~x86_64-Cleanup-the-early-boot-page-table 2006-11-09 22:55:39.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/include/asm-x86_64/pgtable.h 2006-11-09 22:55:39.000000000 -0500
@@ -15,7 +15,6 @@
#include <asm/pda.h>

extern pud_t level3_kernel_pgt[512];
-extern pud_t level3_physmem_pgt[512];
extern pud_t level3_ident_pgt[512];
extern pmd_t level2_kernel_pgt[512];
extern pgd_t init_level4_pgt[];
_

2006-11-13 16:54:46

by Vivek Goyal

[permalink] [raw]
Subject: [RFC] [PATCH 13/16] x86_64: __pa and __pa_symbol address space separation



Currently __pa_symbol is for use with symbols in the kernel address
map and __pa is for use with pointers into the physical memory map.
But the code is implemented so you can usually interchange the two.

__pa which is much more common can be implemented much more cheaply
if it is it doesn't have to worry about any other kernel address
spaces. This is especially true with a relocatable kernel as
__pa_symbol needs to peform an extra variable read to resolve
the address.

There is a third macro that is added for the vsyscall data
__pa_vsymbol for finding the physical addesses of vsyscall pages.

Most of this patch is simply sorting through the references to
__pa or __pa_symbol and using the proper one. A little of
it is continuing to use a physical address when we have it
instead of recalculating it several times.

swapper_pgd is now NULL. leave_mm now uses init_mm.pgd
and init_mm.pgd is initialized at boot (instead of compile time)
to the physmem virtual mapping of init_level4_pgd. The
physical address changed.

Except for the for EMPTY_ZERO page all of the remaining references
to __pa_symbol appear to be during kernel initialization. So this
should reduce the cost of __pa in the common case, even on a relocated
kernel.

As this is technically a semantic change we need to be on the lookout
for anything I missed. But it works for me (tm).

Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---

arch/i386/kernel/alternative.c | 8 ++++----
arch/i386/mm/init.c | 15 ++++++++-------
arch/x86_64/kernel/machine_kexec.c | 14 +++++++-------
arch/x86_64/kernel/setup.c | 9 +++++----
arch/x86_64/kernel/smp.c | 2 +-
arch/x86_64/kernel/vsyscall.c | 10 ++++++++--
arch/x86_64/mm/init.c | 21 +++++++++++----------
arch/x86_64/mm/pageattr.c | 17 ++++++++++-------
include/asm-x86_64/page.h | 6 ++----
include/asm-x86_64/pgtable.h | 4 ++--
10 files changed, 58 insertions(+), 48 deletions(-)

diff -puN arch/i386/kernel/alternative.c~x86_64-__pa-and-__pa_symbol-address-space-separation arch/i386/kernel/alternative.c
--- linux-2.6.19-rc5-reloc/arch/i386/kernel/alternative.c~x86_64-__pa-and-__pa_symbol-address-space-separation 2006-11-09 23:05:12.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/i386/kernel/alternative.c 2006-11-09 23:05:12.000000000 -0500
@@ -348,8 +348,8 @@ void __init alternative_instructions(voi
if (no_replacement) {
printk(KERN_INFO "(SMP-)alternatives turned off\n");
free_init_pages("SMP alternatives",
- (unsigned long)__smp_alt_begin,
- (unsigned long)__smp_alt_end);
+ __pa_symbol(&__smp_alt_begin),
+ __pa_symbol(&__smp_alt_end));
return;
}

@@ -378,8 +378,8 @@ void __init alternative_instructions(voi
_text, _etext);
}
free_init_pages("SMP alternatives",
- (unsigned long)__smp_alt_begin,
- (unsigned long)__smp_alt_end);
+ __pa_symbol(&__smp_alt_begin),
+ __pa_symbol(&__smp_alt_end));
} else {
alternatives_smp_save(__smp_alt_instructions,
__smp_alt_instructions_end);
diff -puN arch/i386/mm/init.c~x86_64-__pa-and-__pa_symbol-address-space-separation arch/i386/mm/init.c
--- linux-2.6.19-rc5-reloc/arch/i386/mm/init.c~x86_64-__pa-and-__pa_symbol-address-space-separation 2006-11-09 23:05:12.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/i386/mm/init.c 2006-11-09 23:05:12.000000000 -0500
@@ -778,10 +778,11 @@ void free_init_pages(char *what, unsigne
unsigned long addr;

for (addr = begin; addr < end; addr += PAGE_SIZE) {
- ClearPageReserved(virt_to_page(addr));
- init_page_count(virt_to_page(addr));
- memset((void *)addr, POISON_FREE_INITMEM, PAGE_SIZE);
- free_page(addr);
+ struct page *page = pfn_to_page(addr >> PAGE_SHIFT);
+ ClearPageReserved(page);
+ init_page_count(page);
+ memset(page_address(page), POISON_FREE_INITMEM, PAGE_SIZE);
+ __free_page(page);
totalram_pages++;
}
printk(KERN_INFO "Freeing %s: %ldk freed\n", what, (end - begin) >> 10);
@@ -790,14 +791,14 @@ void free_init_pages(char *what, unsigne
void free_initmem(void)
{
free_init_pages("unused kernel memory",
- (unsigned long)(&__init_begin),
- (unsigned long)(&__init_end));
+ __pa_symbol(&__init_begin),
+ __pa_symbol(&__init_end));
}

#ifdef CONFIG_BLK_DEV_INITRD
void free_initrd_mem(unsigned long start, unsigned long end)
{
- free_init_pages("initrd memory", start, end);
+ free_init_pages("initrd memory", __pa(start), __pa(end));
}
#endif

diff -puN arch/x86_64/kernel/machine_kexec.c~x86_64-__pa-and-__pa_symbol-address-space-separation arch/x86_64/kernel/machine_kexec.c
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/machine_kexec.c~x86_64-__pa-and-__pa_symbol-address-space-separation 2006-11-09 23:05:12.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/machine_kexec.c 2006-11-09 23:05:12.000000000 -0500
@@ -191,19 +191,19 @@ NORET_TYPE void machine_kexec(struct kim

page_list[PA_CONTROL_PAGE] = __pa(control_page);
page_list[VA_CONTROL_PAGE] = (unsigned long)relocate_kernel;
- page_list[PA_PGD] = __pa(kexec_pgd);
+ page_list[PA_PGD] = __pa_symbol(&kexec_pgd);
page_list[VA_PGD] = (unsigned long)kexec_pgd;
- page_list[PA_PUD_0] = __pa(kexec_pud0);
+ page_list[PA_PUD_0] = __pa_symbol(&kexec_pud0);
page_list[VA_PUD_0] = (unsigned long)kexec_pud0;
- page_list[PA_PMD_0] = __pa(kexec_pmd0);
+ page_list[PA_PMD_0] = __pa_symbol(&kexec_pmd0);
page_list[VA_PMD_0] = (unsigned long)kexec_pmd0;
- page_list[PA_PTE_0] = __pa(kexec_pte0);
+ page_list[PA_PTE_0] = __pa_symbol(&kexec_pte0);
page_list[VA_PTE_0] = (unsigned long)kexec_pte0;
- page_list[PA_PUD_1] = __pa(kexec_pud1);
+ page_list[PA_PUD_1] = __pa_symbol(&kexec_pud1);
page_list[VA_PUD_1] = (unsigned long)kexec_pud1;
- page_list[PA_PMD_1] = __pa(kexec_pmd1);
+ page_list[PA_PMD_1] = __pa_symbol(&kexec_pmd1);
page_list[VA_PMD_1] = (unsigned long)kexec_pmd1;
- page_list[PA_PTE_1] = __pa(kexec_pte1);
+ page_list[PA_PTE_1] = __pa_symbol(&kexec_pte1);
page_list[VA_PTE_1] = (unsigned long)kexec_pte1;

page_list[PA_TABLE_PAGE] =
diff -puN arch/x86_64/kernel/setup.c~x86_64-__pa-and-__pa_symbol-address-space-separation arch/x86_64/kernel/setup.c
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/setup.c~x86_64-__pa-and-__pa_symbol-address-space-separation 2006-11-09 23:05:12.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/setup.c 2006-11-09 23:05:12.000000000 -0500
@@ -365,11 +365,12 @@ void __init setup_arch(char **cmdline_p)
init_mm.end_code = (unsigned long) &_etext;
init_mm.end_data = (unsigned long) &_edata;
init_mm.brk = (unsigned long) &_end;
+ init_mm.pgd = __va(__pa_symbol(&init_level4_pgt));

- code_resource.start = virt_to_phys(&_text);
- code_resource.end = virt_to_phys(&_etext)-1;
- data_resource.start = virt_to_phys(&_etext);
- data_resource.end = virt_to_phys(&_edata)-1;
+ code_resource.start = __pa_symbol(&_text);
+ code_resource.end = __pa_symbol(&_etext)-1;
+ data_resource.start = __pa_symbol(&_etext);
+ data_resource.end = __pa_symbol(&_edata)-1;

early_identify_cpu(&boot_cpu_data);

diff -puN arch/x86_64/kernel/smp.c~x86_64-__pa-and-__pa_symbol-address-space-separation arch/x86_64/kernel/smp.c
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/smp.c~x86_64-__pa-and-__pa_symbol-address-space-separation 2006-11-09 23:05:12.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/smp.c 2006-11-09 23:05:12.000000000 -0500
@@ -76,7 +76,7 @@ static inline void leave_mm(int cpu)
if (read_pda(mmu_state) == TLBSTATE_OK)
BUG();
cpu_clear(cpu, read_pda(active_mm)->cpu_vm_mask);
- load_cr3(swapper_pg_dir);
+ load_cr3(init_mm.pgd);
}

/*
diff -puN arch/x86_64/kernel/vsyscall.c~x86_64-__pa-and-__pa_symbol-address-space-separation arch/x86_64/kernel/vsyscall.c
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/vsyscall.c~x86_64-__pa-and-__pa_symbol-address-space-separation 2006-11-09 23:05:12.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/vsyscall.c 2006-11-09 23:05:12.000000000 -0500
@@ -46,6 +46,12 @@ int __vgetcpu_mode __section_vgetcpu_mod

#include <asm/unistd.h>

+#define __pa_vsymbol(x) \
+ ({unsigned long v; \
+ extern char __vsyscall_0; \
+ asm("" : "=r" (v) : "0" (x)); \
+ ((v - VSYSCALL_FIRST_PAGE) + __pa_symbol(&__vsyscall_0)); })
+
static __always_inline void timeval_normalize(struct timeval * tv)
{
time_t __sec;
@@ -198,10 +204,10 @@ static int vsyscall_sysctl_change(ctl_ta
return ret;
/* gcc has some trouble with __va(__pa()), so just do it this
way. */
- map1 = ioremap(__pa_symbol(&vsysc1), 2);
+ map1 = ioremap(__pa_vsymbol(&vsysc1), 2);
if (!map1)
return -ENOMEM;
- map2 = ioremap(__pa_symbol(&vsysc2), 2);
+ map2 = ioremap(__pa_vsymbol(&vsysc2), 2);
if (!map2) {
ret = -ENOMEM;
goto out;
diff -puN arch/x86_64/mm/init.c~x86_64-__pa-and-__pa_symbol-address-space-separation arch/x86_64/mm/init.c
--- linux-2.6.19-rc5-reloc/arch/x86_64/mm/init.c~x86_64-__pa-and-__pa_symbol-address-space-separation 2006-11-09 23:05:12.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/mm/init.c 2006-11-09 23:05:12.000000000 -0500
@@ -572,11 +572,11 @@ void free_init_pages(char *what, unsigne

printk(KERN_INFO "Freeing %s: %ldk freed\n", what, (end - begin) >> 10);
for (addr = begin; addr < end; addr += PAGE_SIZE) {
- ClearPageReserved(virt_to_page(addr));
- init_page_count(virt_to_page(addr));
- memset((void *)(addr & ~(PAGE_SIZE-1)),
- POISON_FREE_INITMEM, PAGE_SIZE);
- free_page(addr);
+ struct page *page = pfn_to_page(addr >> PAGE_SHIFT);
+ ClearPageReserved(page);
+ init_page_count(page);
+ memset(page_address(page), POISON_FREE_INITMEM, PAGE_SIZE);
+ __free_page(page);
totalram_pages++;
}
}
@@ -586,17 +586,18 @@ void free_initmem(void)
memset(__initdata_begin, POISON_FREE_INITDATA,
__initdata_end - __initdata_begin);
free_init_pages("unused kernel memory",
- (unsigned long)(&__init_begin),
- (unsigned long)(&__init_end));
+ __pa_symbol(&__init_begin),
+ __pa_symbol(&__init_end));
}

#ifdef CONFIG_DEBUG_RODATA

void mark_rodata_ro(void)
{
- unsigned long addr = (unsigned long)__start_rodata;
+ unsigned long addr = (unsigned long)__va(__pa_symbol(&__start_rodata));
+ unsigned long end = (unsigned long)__va(__pa_symbol(&__end_rodata));

- for (; addr < (unsigned long)__end_rodata; addr += PAGE_SIZE)
+ for (; addr < end; addr += PAGE_SIZE)
change_page_attr_addr(addr, 1, PAGE_KERNEL_RO);

printk ("Write protecting the kernel read-only data: %luk\n",
@@ -615,7 +616,7 @@ void mark_rodata_ro(void)
#ifdef CONFIG_BLK_DEV_INITRD
void free_initrd_mem(unsigned long start, unsigned long end)
{
- free_init_pages("initrd memory", start, end);
+ free_init_pages("initrd memory", __pa(start), __pa(end));
}
#endif

diff -puN arch/x86_64/mm/pageattr.c~x86_64-__pa-and-__pa_symbol-address-space-separation arch/x86_64/mm/pageattr.c
--- linux-2.6.19-rc5-reloc/arch/x86_64/mm/pageattr.c~x86_64-__pa-and-__pa_symbol-address-space-separation 2006-11-09 23:05:12.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/mm/pageattr.c 2006-11-09 23:05:12.000000000 -0500
@@ -51,7 +51,6 @@ static struct page *split_large_page(uns
SetPagePrivate(base);
page_private(base) = 0;

- address = __pa(address);
addr = address & LARGE_PAGE_MASK;
pbase = (pte_t *)page_address(base);
for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
@@ -95,7 +94,7 @@ static inline void save_page(struct page
* No more special protections in this 2/4MB area - revert to a
* large page again.
*/
-static void revert_page(unsigned long address, pgprot_t ref_prot)
+static void revert_page(unsigned long address, unsigned long pfn, pgprot_t ref_prot)
{
pgd_t *pgd;
pud_t *pud;
@@ -108,7 +107,8 @@ static void revert_page(unsigned long ad
BUG_ON(pud_none(*pud));
pmd = pmd_offset(pud, address);
BUG_ON(pmd_val(*pmd) & _PAGE_PSE);
- large_pte = mk_pte_phys(__pa(address) & LARGE_PAGE_MASK, ref_prot);
+ large_pte = mk_pte_phys((pfn << PAGE_SHIFT) & LARGE_PAGE_MASK,
+ ref_prot);
large_pte = pte_mkhuge(large_pte);
set_pte((pte_t *)pmd, large_pte);
}
@@ -133,7 +133,8 @@ __change_page_attr(unsigned long address
*/
struct page *split;
ref_prot2 = pte_pgprot(pte_clrhuge(*kpte));
- split = split_large_page(address, prot, ref_prot2);
+ split = split_large_page(pfn << PAGE_SHIFT, prot,
+ ref_prot2);
if (!split)
return -ENOMEM;
set_pte(kpte, mk_pte(split, ref_prot2));
@@ -152,7 +153,7 @@ __change_page_attr(unsigned long address

if (page_private(kpte_page) == 0) {
save_page(kpte_page);
- revert_page(address, ref_prot);
+ revert_page(address, pfn, ref_prot);
}
return 0;
}
@@ -172,6 +173,7 @@ __change_page_attr(unsigned long address
*/
int change_page_attr_addr(unsigned long address, int numpages, pgprot_t prot)
{
+ unsigned long phys_base_pfn = __pa_symbol(__START_KERNEL_map) >> PAGE_SHIFT;
int err = 0;
int i;

@@ -184,10 +186,11 @@ int change_page_attr_addr(unsigned long
break;
/* Handle kernel mapping too which aliases part of the
* lowmem */
- if (__pa(address) < KERNEL_TEXT_SIZE) {
+ if ((pfn >= phys_base_pfn) &&
+ ((pfn - phys_base_pfn) < (KERNEL_TEXT_SIZE >> PAGE_SHIFT))) {
unsigned long addr2;
pgprot_t prot2;
- addr2 = __START_KERNEL_map + __pa(address);
+ addr2 = __START_KERNEL_map + ((pfn - phys_base_pfn) << PAGE_SHIFT);
/* Make sure the kernel mappings stay executable */
prot2 = pte_pgprot(pte_mkexec(pfn_pte(0, prot)));
err = __change_page_attr(addr2, pfn, prot2,
diff -puN include/asm-x86_64/page.h~x86_64-__pa-and-__pa_symbol-address-space-separation include/asm-x86_64/page.h
--- linux-2.6.19-rc5-reloc/include/asm-x86_64/page.h~x86_64-__pa-and-__pa_symbol-address-space-separation 2006-11-09 23:05:12.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/include/asm-x86_64/page.h 2006-11-09 23:05:12.000000000 -0500
@@ -102,17 +102,15 @@ typedef struct { unsigned long pgprot; }

/* Note: __pa(&symbol_visible_to_c) should be always replaced with __pa_symbol.
Otherwise you risk miscompilation. */
-#define __pa(x) (((unsigned long)(x)>=__START_KERNEL_map)?(unsigned long)(x) - (unsigned long)__START_KERNEL_map:(unsigned long)(x) - PAGE_OFFSET)
+#define __pa(x) ((unsigned long)(x) - PAGE_OFFSET)
/* __pa_symbol should be used for C visible symbols.
This seems to be the official gcc blessed way to do such arithmetic. */
#define __pa_symbol(x) \
({unsigned long v; \
asm("" : "=r" (v) : "0" (x)); \
- __pa(v); })
+ (v - __START_KERNEL_map); })

#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
-#define __boot_va(x) __va(x)
-#define __boot_pa(x) __pa(x)
#ifdef CONFIG_FLATMEM
#define pfn_valid(pfn) ((pfn) < end_pfn)
#endif
diff -puN include/asm-x86_64/pgtable.h~x86_64-__pa-and-__pa_symbol-address-space-separation include/asm-x86_64/pgtable.h
--- linux-2.6.19-rc5-reloc/include/asm-x86_64/pgtable.h~x86_64-__pa-and-__pa_symbol-address-space-separation 2006-11-09 23:05:12.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/include/asm-x86_64/pgtable.h 2006-11-09 23:05:12.000000000 -0500
@@ -20,7 +20,7 @@ extern pmd_t level2_kernel_pgt[512];
extern pgd_t init_level4_pgt[];
extern unsigned long __supported_pte_mask;

-#define swapper_pg_dir init_level4_pgt
+#define swapper_pg_dir ((pgd_t *)NULL)

extern void paging_init(void);
extern void clear_kernel_mapping(unsigned long addr, unsigned long size);
@@ -30,7 +30,7 @@ extern void clear_kernel_mapping(unsigne
* for zero-mapped memory areas etc..
*/
extern unsigned long empty_zero_page[PAGE_SIZE/sizeof(unsigned long)];
-#define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))
+#define ZERO_PAGE(vaddr) (pfn_to_page(__pa_symbol(&empty_zero_page) >> PAGE_SHIFT))

#endif /* !__ASSEMBLY__ */

_

2006-11-13 16:53:45

by Vivek Goyal

[permalink] [raw]
Subject: [RFC] [PATCH 12/16] x86_64: Remove the identity mapping as early as possible



With the rewrite of the SMP trampoline and the early page
allocator there is nothing that needs identity mapped pages,
once we start executing C code.

So add zap_identity_mappings into head64.c and remove
zap_low_mappings() from much later in the code. The functions
are subtly different thus the name change.

This also kills boot_level4_pgt which was from an earlier
attempt to move the identity mappings as early as possible,
and is now no longer needed. Essentially I have replaced
boot_level4_pgt with trampoline_level4_pgt in trampoline.S

Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---

arch/x86_64/kernel/head.S | 39 ++++++++++++++-------------------------
arch/x86_64/kernel/head64.c | 16 ++++++++++------
arch/x86_64/kernel/setup.c | 2 --
arch/x86_64/kernel/setup64.c | 1 -
arch/x86_64/mm/init.c | 24 ------------------------
include/asm-x86_64/pgtable.h | 1 -
include/asm-x86_64/proto.h | 2 --
7 files changed, 24 insertions(+), 61 deletions(-)

diff -puN arch/x86_64/kernel/head64.c~x86_64-Remove-the-identity-mapping-as-early-as-possible arch/x86_64/kernel/head64.c
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/head64.c~x86_64-Remove-the-identity-mapping-as-early-as-possible 2006-11-09 23:04:38.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/head64.c 2006-11-09 23:04:38.000000000 -0500
@@ -18,8 +18,16 @@
#include <asm/setup.h>
#include <asm/desc.h>
#include <asm/pgtable.h>
+#include <asm/tlbflush.h>
#include <asm/sections.h>

+static void __init zap_identity_mappings(void)
+{
+ pgd_t *pgd = pgd_offset_k(0UL);
+ pgd_clear(pgd);
+ __flush_tlb();
+}
+
/* Don't add a printk in there. printk relies on the PDA which is not initialized
yet. */
static void __init clear_bss(void)
@@ -56,6 +64,8 @@ void __init x86_64_start_kernel(char * r
{
int i;

+ /* Make NULL pointers segfault */
+ zap_identity_mappings();
for (i = 0; i < 256; i++)
set_intr_gate(i, early_idt_handler);
asm volatile("lidt %0" :: "m" (idt_descr));
@@ -63,12 +73,6 @@ void __init x86_64_start_kernel(char * r

early_printk("Kernel alive\n");

- /*
- * switch to init_level4_pgt from boot_level4_pgt
- */
- memcpy(init_level4_pgt, boot_level4_pgt, PTRS_PER_PGD*sizeof(pgd_t));
- asm volatile("movq %0,%%cr3" :: "r" (__pa_symbol(&init_level4_pgt)));
-
for (i = 0; i < NR_CPUS; i++)
cpu_pda(i) = &boot_cpu_pda[i];

diff -puN arch/x86_64/kernel/head.S~x86_64-Remove-the-identity-mapping-as-early-as-possible arch/x86_64/kernel/head.S
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/head.S~x86_64-Remove-the-identity-mapping-as-early-as-possible 2006-11-09 23:04:38.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/head.S 2006-11-09 23:04:38.000000000 -0500
@@ -71,7 +71,7 @@ startup_32:
movl %eax, %cr4

/* Setup early boot stage 4 level pagetables */
- movl $(boot_level4_pgt - __START_KERNEL_map), %eax
+ movl $(init_level4_pgt - __START_KERNEL_map), %eax
movl %eax, %cr3

/* Setup EFER (Extended Feature Enable Register) */
@@ -115,7 +115,7 @@ ENTRY(secondary_startup_64)
movq %rax, %cr4

/* Setup early boot stage 4 level pagetables. */
- movq $(boot_level4_pgt - __START_KERNEL_map), %rax
+ movq $(init_level4_pgt - __START_KERNEL_map), %rax
movq %rax, %cr3

/* Check if nx is implemented */
@@ -266,9 +266,19 @@ ENTRY(name)
i = i + 1 ; \
.endr

+ /*
+ * This default setting generates an ident mapping at address 0x100000
+ * and a mapping for the kernel that precisely maps virtual address
+ * 0xffffffff80000000 to physical address 0x000000. (always using
+ * 2Mbyte large pages provided by PAE mode)
+ */
NEXT_PAGE(init_level4_pgt)
- /* This gets initialized in x86_64_start_kernel */
- .fill 512,8,0
+ .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
+ .fill 257,8,0
+ .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
+ .fill 252,8,0
+ /* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
+ .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE

NEXT_PAGE(level3_ident_pgt)
.quad level2_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
@@ -299,27 +309,6 @@ NEXT_PAGE(level2_kernel_pgt)
#undef NEXT_PAGE

.data
-
-#ifndef CONFIG_HOTPLUG_CPU
- __INITDATA
-#endif
- /*
- * This default setting generates an ident mapping at address 0x100000
- * and a mapping for the kernel that precisely maps virtual address
- * 0xffffffff80000000 to physical address 0x000000. (always using
- * 2Mbyte large pages provided by PAE mode)
- */
- .align PAGE_SIZE
-ENTRY(boot_level4_pgt)
- .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .fill 257,8,0
- .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .fill 252,8,0
- /* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
- .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
-
- .data
-
.align 16
.globl cpu_gdt_descr
cpu_gdt_descr:
diff -puN arch/x86_64/kernel/setup64.c~x86_64-Remove-the-identity-mapping-as-early-as-possible arch/x86_64/kernel/setup64.c
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/setup64.c~x86_64-Remove-the-identity-mapping-as-early-as-possible 2006-11-09 23:04:38.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/setup64.c 2006-11-09 23:04:38.000000000 -0500
@@ -202,7 +202,6 @@ void __cpuinit cpu_init (void)
/* CPU 0 is initialised in head64.c */
if (cpu != 0) {
pda_init(cpu);
- zap_low_mappings(cpu);
} else
estacks = boot_exception_stacks;

diff -puN arch/x86_64/kernel/setup.c~x86_64-Remove-the-identity-mapping-as-early-as-possible arch/x86_64/kernel/setup.c
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/setup.c~x86_64-Remove-the-identity-mapping-as-early-as-possible 2006-11-09 23:04:38.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/setup.c 2006-11-09 23:04:38.000000000 -0500
@@ -396,8 +396,6 @@ void __init setup_arch(char **cmdline_p)

dmi_scan_machine();

- zap_low_mappings(0);
-
#ifdef CONFIG_ACPI
/*
* Initialize the ACPI boot-time table parser (gets the RSDP and SDT).
diff -puN arch/x86_64/mm/init.c~x86_64-Remove-the-identity-mapping-as-early-as-possible arch/x86_64/mm/init.c
--- linux-2.6.19-rc5-reloc/arch/x86_64/mm/init.c~x86_64-Remove-the-identity-mapping-as-early-as-possible 2006-11-09 23:04:38.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/mm/init.c 2006-11-09 23:04:38.000000000 -0500
@@ -378,21 +378,6 @@ void __meminit init_memory_mapping(unsig
__flush_tlb_all();
}

-void __cpuinit zap_low_mappings(int cpu)
-{
- if (cpu == 0) {
- pgd_t *pgd = pgd_offset_k(0UL);
- pgd_clear(pgd);
- } else {
- /*
- * For AP's, zap the low identity mappings by changing the cr3
- * to init_level4_pgt and doing local flush tlb all
- */
- asm volatile("movq %0,%%cr3" :: "r" (__pa_symbol(&init_level4_pgt)));
- }
- __flush_tlb_all();
-}
-
#ifndef CONFIG_NUMA
void __init paging_init(void)
{
@@ -576,15 +561,6 @@ void __init mem_init(void)
reservedpages << (PAGE_SHIFT-10),
datasize >> 10,
initsize >> 10);
-
-#ifdef CONFIG_SMP
- /*
- * Sync boot_level4_pgt mappings with the init_level4_pgt
- * except for the low identity mappings which are already zapped
- * in init_level4_pgt. This sync-up is essential for AP's bringup
- */
- memcpy(boot_level4_pgt+1, init_level4_pgt+1, (PTRS_PER_PGD-1)*sizeof(pgd_t));
-#endif
}

void free_init_pages(char *what, unsigned long begin, unsigned long end)
diff -puN include/asm-x86_64/pgtable.h~x86_64-Remove-the-identity-mapping-as-early-as-possible include/asm-x86_64/pgtable.h
--- linux-2.6.19-rc5-reloc/include/asm-x86_64/pgtable.h~x86_64-Remove-the-identity-mapping-as-early-as-possible 2006-11-09 23:04:38.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/include/asm-x86_64/pgtable.h 2006-11-09 23:04:38.000000000 -0500
@@ -18,7 +18,6 @@ extern pud_t level3_kernel_pgt[512];
extern pud_t level3_ident_pgt[512];
extern pmd_t level2_kernel_pgt[512];
extern pgd_t init_level4_pgt[];
-extern pgd_t boot_level4_pgt[];
extern unsigned long __supported_pte_mask;

#define swapper_pg_dir init_level4_pgt
diff -puN include/asm-x86_64/proto.h~x86_64-Remove-the-identity-mapping-as-early-as-possible include/asm-x86_64/proto.h
--- linux-2.6.19-rc5-reloc/include/asm-x86_64/proto.h~x86_64-Remove-the-identity-mapping-as-early-as-possible 2006-11-09 23:04:38.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/include/asm-x86_64/proto.h 2006-11-09 23:04:38.000000000 -0500
@@ -11,8 +11,6 @@ struct pt_regs;
extern void start_kernel(void);
extern void pda_init(int);

-extern void zap_low_mappings(int cpu);
-
extern void early_idt_handler(void);

extern void mcheck_init(struct cpuinfo_x86 *c);
_

2006-11-13 16:56:09

by Vivek Goyal

[permalink] [raw]
Subject: [RFC] [PATCH 2/16] x86_64: Assembly safe page.h and pgtable.h


This patch makes pgtable.h and page.h safe to include
in assembly files like head.S. Allowing us to use
symbolic constants instead of hard coded numbers when
refering to the page tables.

This patch copies asm-sparc64/const.h to asm-x86_64 to
get a definition of _AC() a very convinient macro that
allows us to force the type when we are compiling the
code in C and to drop all of the type information when
we are using the constant in assembly. Previously this
was done with multiple definition of the same constant.
const.h was modified slightly so that it works when given
CONFIG options as arguments.

This patch adds #ifndef __ASSEMBLY__ ... #endif
and _AC(1,UL) where appropriate so the assembler won't
choke on the header files. Otherwise nothing
should have changed.

Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---

include/asm-x86_64/const.h | 20 ++++++++++++++++++++
include/asm-x86_64/page.h | 34 +++++++++++++---------------------
include/asm-x86_64/pgtable.h | 33 +++++++++++++++++++++------------
3 files changed, 54 insertions(+), 33 deletions(-)

diff -puN /dev/null include/asm-x86_64/const.h
--- /dev/null 2006-11-09 22:37:03.200734626 -0500
+++ linux-2.6.19-rc5-reloc-root/include/asm-x86_64/const.h 2006-11-09 22:31:04.000000000 -0500
@@ -0,0 +1,20 @@
+/* const.h: Macros for dealing with constants. */
+
+#ifndef _X86_64_CONST_H
+#define _X86_64_CONST_H
+
+/* Some constant macros are used in both assembler and
+ * C code. Therefore we cannot annotate them always with
+ * 'UL' and other type specificers unilaterally. We
+ * use the following macros to deal with this.
+ */
+
+#ifdef __ASSEMBLY__
+#define _AC(X,Y) X
+#else
+#define __AC(X,Y) (X##Y)
+#define _AC(X,Y) __AC(X,Y)
+#endif
+
+
+#endif /* !(_X86_64_CONST_H) */
diff -puN include/asm-x86_64/page.h~x86_64-Assembly-safe-page.h-and-pgtable.h include/asm-x86_64/page.h
--- linux-2.6.19-rc5-reloc/include/asm-x86_64/page.h~x86_64-Assembly-safe-page.h-and-pgtable.h 2006-11-09 22:31:04.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/include/asm-x86_64/page.h 2006-11-09 22:53:16.000000000 -0500
@@ -1,14 +1,11 @@
#ifndef _X86_64_PAGE_H
#define _X86_64_PAGE_H

+#include <asm/const.h>

/* PAGE_SHIFT determines the page size */
#define PAGE_SHIFT 12
-#ifdef __ASSEMBLY__
-#define PAGE_SIZE (0x1 << PAGE_SHIFT)
-#else
-#define PAGE_SIZE (1UL << PAGE_SHIFT)
-#endif
+#define PAGE_SIZE (_AC(1,UL) << PAGE_SHIFT)
#define PAGE_MASK (~(PAGE_SIZE-1))
#define PHYSICAL_PAGE_MASK (~(PAGE_SIZE-1) & __PHYSICAL_MASK)

@@ -33,10 +30,10 @@
#define N_EXCEPTION_STACKS 5 /* hw limit: 7 */

#define LARGE_PAGE_MASK (~(LARGE_PAGE_SIZE-1))
-#define LARGE_PAGE_SIZE (1UL << PMD_SHIFT)
+#define LARGE_PAGE_SIZE (_AC(1,UL) << PMD_SHIFT)

#define HPAGE_SHIFT PMD_SHIFT
-#define HPAGE_SIZE ((1UL) << HPAGE_SHIFT)
+#define HPAGE_SIZE (_AC(1,UL) << HPAGE_SHIFT)
#define HPAGE_MASK (~(HPAGE_SIZE - 1))
#define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)

@@ -76,29 +73,24 @@ typedef struct { unsigned long pgprot; }
#define __pgd(x) ((pgd_t) { (x) } )
#define __pgprot(x) ((pgprot_t) { (x) } )

-#define __PHYSICAL_START ((unsigned long)CONFIG_PHYSICAL_START)
-#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START)
-#define __START_KERNEL_map 0xffffffff80000000UL
-#define __PAGE_OFFSET 0xffff810000000000UL
+#endif /* !__ASSEMBLY__ */

-#else
-#define __PHYSICAL_START CONFIG_PHYSICAL_START
+#define __PHYSICAL_START _AC(CONFIG_PHYSICAL_START,UL)
#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START)
-#define __START_KERNEL_map 0xffffffff80000000
-#define __PAGE_OFFSET 0xffff810000000000
-#endif /* !__ASSEMBLY__ */
+#define __START_KERNEL_map _AC(0xffffffff80000000,UL)
+#define __PAGE_OFFSET _AC(0xffff810000000000,UL)

/* to align the pointer to the (next) page boundary */
#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)

/* See Documentation/x86_64/mm.txt for a description of the memory map. */
#define __PHYSICAL_MASK_SHIFT 46
-#define __PHYSICAL_MASK ((1UL << __PHYSICAL_MASK_SHIFT) - 1)
+#define __PHYSICAL_MASK ((_AC(1,UL) << __PHYSICAL_MASK_SHIFT) - 1)
#define __VIRTUAL_MASK_SHIFT 48
-#define __VIRTUAL_MASK ((1UL << __VIRTUAL_MASK_SHIFT) - 1)
+#define __VIRTUAL_MASK ((_AC(1,UL) << __VIRTUAL_MASK_SHIFT) - 1)

-#define KERNEL_TEXT_SIZE (40UL*1024*1024)
-#define KERNEL_TEXT_START 0xffffffff80000000UL
+#define KERNEL_TEXT_SIZE (_AC(40,UL)*1024*1024)
+#define KERNEL_TEXT_START _AC(0xffffffff80000000,UL)

#ifndef __ASSEMBLY__

@@ -106,7 +98,7 @@ typedef struct { unsigned long pgprot; }

#endif /* __ASSEMBLY__ */

-#define PAGE_OFFSET ((unsigned long)__PAGE_OFFSET)
+#define PAGE_OFFSET __PAGE_OFFSET

/* Note: __pa(&symbol_visible_to_c) should be always replaced with __pa_symbol.
Otherwise you risk miscompilation. */
diff -puN include/asm-x86_64/pgtable.h~x86_64-Assembly-safe-page.h-and-pgtable.h include/asm-x86_64/pgtable.h
--- linux-2.6.19-rc5-reloc/include/asm-x86_64/pgtable.h~x86_64-Assembly-safe-page.h-and-pgtable.h 2006-11-09 22:31:04.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/include/asm-x86_64/pgtable.h 2006-11-09 22:53:16.000000000 -0500
@@ -1,6 +1,9 @@
#ifndef _X86_64_PGTABLE_H
#define _X86_64_PGTABLE_H

+#include <asm/const.h>
+#ifndef __ASSEMBLY__
+
/*
* This file contains the functions and defines necessary to modify and use
* the x86-64 page table tree.
@@ -31,6 +34,8 @@ extern void clear_kernel_mapping(unsigne
extern unsigned long empty_zero_page[PAGE_SIZE/sizeof(unsigned long)];
#define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))

+#endif /* !__ASSEMBLY__ */
+
/*
* PGDIR_SHIFT determines what a top-level page table entry can map
*/
@@ -55,6 +60,8 @@ extern unsigned long empty_zero_page[PAG
*/
#define PTRS_PER_PTE 512

+#ifndef __ASSEMBLY__
+
#define pte_ERROR(e) \
printk("%s:%d: bad pte %p(%016lx).\n", __FILE__, __LINE__, &(e), pte_val(e))
#define pmd_ERROR(e) \
@@ -118,22 +125,23 @@ static inline pte_t ptep_get_and_clear_f

#define pte_pgprot(a) (__pgprot((a).pte & ~PHYSICAL_PAGE_MASK))

-#define PMD_SIZE (1UL << PMD_SHIFT)
+#endif /* !__ASSEMBLY__ */
+
+#define PMD_SIZE (_AC(1,UL) << PMD_SHIFT)
#define PMD_MASK (~(PMD_SIZE-1))
-#define PUD_SIZE (1UL << PUD_SHIFT)
+#define PUD_SIZE (_AC(1,UL) << PUD_SHIFT)
#define PUD_MASK (~(PUD_SIZE-1))
-#define PGDIR_SIZE (1UL << PGDIR_SHIFT)
+#define PGDIR_SIZE (_AC(1,UL) << PGDIR_SHIFT)
#define PGDIR_MASK (~(PGDIR_SIZE-1))

#define USER_PTRS_PER_PGD ((TASK_SIZE-1)/PGDIR_SIZE+1)
#define FIRST_USER_ADDRESS 0

-#ifndef __ASSEMBLY__
-#define MAXMEM 0x3fffffffffffUL
-#define VMALLOC_START 0xffffc20000000000UL
-#define VMALLOC_END 0xffffe1ffffffffffUL
-#define MODULES_VADDR 0xffffffff88000000UL
-#define MODULES_END 0xfffffffffff00000UL
+#define MAXMEM _AC(0x3fffffffffff,UL)
+#define VMALLOC_START _AC(0xffffc20000000000,UL)
+#define VMALLOC_END _AC(0xffffe1ffffffffff,UL)
+#define MODULES_VADDR _AC(0xffffffff88000000,UL)
+#define MODULES_END _AC(0xfffffffffff00000,UL)
#define MODULES_LEN (MODULES_END - MODULES_VADDR)

#define _PAGE_BIT_PRESENT 0
@@ -159,7 +167,7 @@ static inline pte_t ptep_get_and_clear_f
#define _PAGE_GLOBAL 0x100 /* Global TLB entry */

#define _PAGE_PROTNONE 0x080 /* If not present */
-#define _PAGE_NX (1UL<<_PAGE_BIT_NX)
+#define _PAGE_NX (_AC(1,UL)<<_PAGE_BIT_NX)

#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED | _PAGE_DIRTY)
#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
@@ -221,6 +229,8 @@ static inline pte_t ptep_get_and_clear_f
#define __S110 PAGE_SHARED_EXEC
#define __S111 PAGE_SHARED_EXEC

+#ifndef __ASSEMBLY__
+
static inline unsigned long pgd_bad(pgd_t pgd)
{
unsigned long val = pgd_val(pgd);
@@ -417,8 +427,6 @@ extern spinlock_t pgd_lock;
extern struct page *pgd_list;
void vmalloc_sync_all(void);

-#endif /* !__ASSEMBLY__ */
-
extern int kern_addr_valid(unsigned long addr);

#define io_remap_pfn_range(vma, vaddr, pfn, size, prot) \
@@ -448,5 +456,6 @@ extern int kern_addr_valid(unsigned long
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define __HAVE_ARCH_PTE_SAME
#include <asm-generic/pgtable.h>
+#endif /* !__ASSEMBLY__ */

#endif /* _X86_64_PGTABLE_H */
_

2006-11-13 16:56:10

by Vivek Goyal

[permalink] [raw]
Subject: [RFC] [PATCH 9/16] x86_64: 64bit PIC SMP trampoline



This modifies the SMP trampoline and all of the associated code so
it can jump to a 64bit kernel loaded at an arbitrary address.

The dependencies on having an idenetity mapped page in the kernel
page tables for SMP bootup have all been removed.

In addition the trampoline has been modified to verify
that long mode is supported. Asking if long mode is implemented is
down right silly but we have traditionally had some of these checks,
and they can't hurt anything. So when the totally ludicrous happens
we just might handle it correctly.

Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---

arch/x86_64/kernel/head.S | 1
arch/x86_64/kernel/setup.c | 9 --
arch/x86_64/kernel/trampoline.S | 168 ++++++++++++++++++++++++++++++++++++----
3 files changed, 156 insertions(+), 22 deletions(-)

diff -puN arch/x86_64/kernel/head.S~x86_64-64bit-PIC-SMP-trampoline arch/x86_64/kernel/head.S
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/head.S~x86_64-64bit-PIC-SMP-trampoline 2006-11-09 22:59:13.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/head.S 2006-11-09 22:59:13.000000000 -0500
@@ -101,6 +101,7 @@ startup_32:
.org 0x100
.globl startup_64
startup_64:
+ENTRY(secondary_startup_64)
/* We come here either from startup_32
* or directly from a 64bit bootloader.
* Since we may have come directly from a bootloader we
diff -puN arch/x86_64/kernel/setup.c~x86_64-64bit-PIC-SMP-trampoline arch/x86_64/kernel/setup.c
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/setup.c~x86_64-64bit-PIC-SMP-trampoline 2006-11-09 22:59:13.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/setup.c 2006-11-09 22:59:13.000000000 -0500
@@ -446,15 +446,8 @@ void __init setup_arch(char **cmdline_p)
reserve_bootmem_generic(ebda_addr, ebda_size);

#ifdef CONFIG_SMP
- /*
- * But first pinch a few for the stack/trampoline stuff
- * FIXME: Don't need the extra page at 4K, but need to fix
- * trampoline before removing it. (see the GDT stuff)
- */
- reserve_bootmem_generic(PAGE_SIZE, PAGE_SIZE);
-
/* Reserve SMP trampoline */
- reserve_bootmem_generic(SMP_TRAMPOLINE_BASE, PAGE_SIZE);
+ reserve_bootmem_generic(SMP_TRAMPOLINE_BASE, 2*PAGE_SIZE);
#endif

#ifdef CONFIG_ACPI_SLEEP
diff -puN arch/x86_64/kernel/trampoline.S~x86_64-64bit-PIC-SMP-trampoline arch/x86_64/kernel/trampoline.S
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/trampoline.S~x86_64-64bit-PIC-SMP-trampoline 2006-11-09 22:59:13.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/trampoline.S 2006-11-09 22:59:13.000000000 -0500
@@ -3,6 +3,7 @@
* Trampoline.S Derived from Setup.S by Linus Torvalds
*
* 4 Jan 1997 Michael Chastain: changed to gnu as.
+ * 15 Sept 2005 Eric Biederman: 64bit PIC support
*
* Entry: CS:IP point to the start of our code, we are
* in real mode with no stack, but the rest of the
@@ -17,15 +18,20 @@
* and IP is zero. Thus, data addresses need to be absolute
* (no relocation) and are taken with regard to r_base.
*
+ * With the addition of trampoline_level4_pgt this code can
+ * now enter a 64bit kernel that lives at arbitrary 64bit
+ * physical addresses.
+ *
* If you work on this file, check the object module with objdump
* --full-contents --reloc to make sure there are no relocation
- * entries. For the GDT entry we do hand relocation in smpboot.c
- * because of 64bit linker limitations.
+ * entries.
*/

#include <linux/linkage.h>
-#include <asm/segment.h>
+#include <asm/pgtable.h>
#include <asm/page.h>
+#include <asm/msr.h>
+#include <asm/segment.h>

.data

@@ -33,15 +39,31 @@

ENTRY(trampoline_data)
r_base = .
+ cli # We should be safe anyway
wbinvd
mov %cs, %ax # Code and data in the same place
mov %ax, %ds
+ mov %ax, %es
+ mov %ax, %ss

- cli # We should be safe anyway

movl $0xA5A5A5A5, trampoline_data - r_base
# write marker for master knows we're running

+ # Setup stack
+ movw $(trampoline_stack_end - r_base), %sp
+
+ call verify_cpu # Verify the cpu supports long mode
+
+ mov %cs, %ax
+ movzx %ax, %esi # Find the 32bit trampoline location
+ shll $4, %esi
+
+ # Fixup the vectors
+ addl %esi, startup_32_vector - r_base
+ addl %esi, startup_64_vector - r_base
+ addl %esi, tgdt + 2 - r_base # Fixup the gdt pointer
+
/*
* GDT tables in non default location kernel can be beyond 16MB and
* lgdt will not be able to load the address as in real mode default
@@ -49,23 +71,141 @@ r_base = .
* to 32 bit.
*/

- lidtl idt_48 - r_base # load idt with 0, 0
- lgdtl gdt_48 - r_base # load gdt with whatever is appropriate
+ lidtl tidt - r_base # load idt with 0, 0
+ lgdtl tgdt - r_base # load gdt with whatever is appropriate

xor %ax, %ax
inc %ax # protected mode (PE) bit
lmsw %ax # into protected mode
- # flaush prefetch and jump to startup_32 in arch/x86_64/kernel/head.S
- ljmpl $__KERNEL32_CS, $(startup_32-__START_KERNEL_map)
+
+ # flush prefetch and jump to startup_32
+ ljmpl *(startup_32_vector - r_base)
+
+ .code32
+ .balign 4
+startup_32:
+ movl $__KERNEL_DS, %eax # Initialize the %ds segment register
+ movl %eax, %ds
+
+ xorl %eax, %eax
+ btsl $5, %eax # Enable PAE mode
+ movl %eax, %cr4
+
+ # Setup trampoline 4 level pagetables
+ leal (trampoline_level4_pgt - r_base)(%esi), %eax
+ movl %eax, %cr3
+
+ movl $MSR_EFER, %ecx
+ movl $(1 << _EFER_LME), %eax # Enable Long Mode
+ xorl %edx, %edx
+ wrmsr
+
+ xorl %eax, %eax
+ btsl $31, %eax # Enable paging and in turn activate Long Mode
+ btsl $0, %eax # Enable protected mode
+ movl %eax, %cr0
+
+ /*
+ * At this point we're in long mode but in 32bit compatibility mode
+ * with EFER.LME = 1, CS.L = 0, CS.D = 1 (and in turn
+ * EFER.LMA = 1). Now we want to jump in 64bit mode, to do that we use
+ * the new gdt/idt that has __KERNEL_CS with CS.L = 1.
+ */
+ ljmp *(startup_64_vector - r_base)(%esi)
+
+ .code64
+ .balign 4
+startup_64:
+ # Now jump into the kernel using virtual addresses
+ movq $secondary_startup_64, %rax
+ jmp *%rax
+
+ .code16
+verify_cpu:
+ pushl $0 # Kill any dangerous flags
+ popfl
+
+ /* minimum CPUID flags for x86-64 */
+ /* see http://www.x86-64.org/lists/discuss/msg02971.html */
+#define REQUIRED_MASK1 ((1<<0)|(1<<3)|(1<<4)|(1<<5)|(1<<6)|(1<<8)|\
+ (1<<13)|(1<<15)|(1<<24)|(1<<25)|(1<<26))
+#define REQUIRED_MASK2 (1<<29)
+
+ pushfl # check for cpuid
+ popl %eax
+ movl %eax, %ebx
+ xorl $0x200000,%eax
+ pushl %eax
+ popfl
+ pushfl
+ popl %eax
+ pushl %ebx
+ popfl
+ cmpl %eax, %ebx
+ jz no_longmode
+
+ xorl %eax, %eax # See if cpuid 1 is implemented
+ cpuid
+ cmpl $0x1, %eax
+ jb no_longmode
+
+ movl $0x01, %eax # Does the cpu have what it takes?
+ cpuid
+ andl $REQUIRED_MASK1, %edx
+ xorl $REQUIRED_MASK1, %edx
+ jnz no_longmode
+
+ movl $0x80000000, %eax # See if extended cpuid is implemented
+ cpuid
+ cmpl $0x80000001, %eax
+ jb no_longmode
+
+ movl $0x80000001, %eax # Does the cpu have what it takes?
+ cpuid
+ andl $REQUIRED_MASK2, %edx
+ xorl $REQUIRED_MASK2, %edx
+ jnz no_longmode
+
+ ret # The cpu supports long mode
+
+no_longmode:
+ hlt
+ jmp no_longmode
+

# Careful these need to be in the same 64K segment as the above;
-idt_48:
+tidt:
.word 0 # idt limit = 0
.word 0, 0 # idt base = 0L

-gdt_48:
- .short GDT_ENTRIES*8 - 1 # gdt limit
- .long cpu_gdt_table-__START_KERNEL_map
+ # Duplicate the global descriptor table
+ # so the kernel can live anywhere
+ .balign 4
+tgdt:
+ .short tgdt_end - tgdt # gdt limit
+ .long tgdt - r_base
+ .short 0
+ .quad 0x00cf9b000000ffff # __KERNEL32_CS
+ .quad 0x00af9b000000ffff # __KERNEL_CS
+ .quad 0x00cf93000000ffff # __KERNEL_DS
+tgdt_end:
+
+ .balign 4
+startup_32_vector:
+ .long startup_32 - r_base
+ .word __KERNEL32_CS, 0
+
+ .balign 4
+startup_64_vector:
+ .long startup_64 - r_base
+ .word __KERNEL_CS, 0
+
+trampoline_stack:
+ .org 0x1000
+trampoline_stack_end:
+ENTRY(trampoline_level4_pgt)
+ .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
+ .fill 510,8,0
+ .quad level3_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE

-.globl trampoline_end
-trampoline_end:
+ENTRY(trampoline_end)
_

2006-11-13 16:54:47

by Vivek Goyal

[permalink] [raw]
Subject: [RFC] [PATCH 1/16] x86_64: Align data segment to page size boundary


o Explicitly align data segment to PAGE_SIZE boundary otherwise depending on
config options and tool chain it might be placed on a non PAGE_SIZE aligned
boundary and vmlinux loaders like kexec fail when they encounter a
PT_LOAD type segment which is not aligned to PAGE_SIZE boundary.

Signed-off-by: Vivek Goyal <[email protected]>
---

arch/x86_64/kernel/vmlinux.lds.S | 1 +
1 file changed, 1 insertion(+)

diff -puN arch/x86_64/kernel/vmlinux.lds.S~x86_64-align-data-segment-to-4K-boundary arch/x86_64/kernel/vmlinux.lds.S
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/vmlinux.lds.S~x86_64-align-data-segment-to-4K-boundary 2006-11-09 22:25:15.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/vmlinux.lds.S 2006-11-09 22:28:24.000000000 -0500
@@ -60,6 +60,7 @@ SECTIONS
}
#endif

+ . = ALIGN(PAGE_SIZE); /* Align data segment to page size boundary */
/* Data */
.data : AT(ADDR(.data) - LOAD_OFFSET) {
*(.data)
_

2006-11-13 16:54:46

by Vivek Goyal

[permalink] [raw]
Subject: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup



- Killed lots of dead code
- Improve the cpu sanity checks to verify long mode
is enabled when we wake up.
- Removed the need for modifying any existing kernel page table.
- Moved wakeup_level4_pgt into the wakeup routine so we can
run the kernel above 4G.
- Increased the size of the wakeup routine to 8K.
- Renamed the variables to use the 64bit register names.
- Lots of misc cleanups to match trampoline.S

I don't have a configuration I can test this but it compiles cleanly
and it should work, the code is very similar to the SMP trampoline,
which I have tested. At least now the comments about still running in
low memory are actually correct.

Vivek has tested this patch for suspend to memory and it works fine.

Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---

arch/x86_64/kernel/acpi/sleep.c | 19 --
arch/x86_64/kernel/acpi/wakeup.S | 329 +++++++++++++++++----------------------
arch/x86_64/kernel/head.S | 9 -
include/asm-x86_64/suspend.h | 12 -
4 files changed, 153 insertions(+), 216 deletions(-)

diff -puN arch/x86_64/kernel/acpi/sleep.c~x86_64-64bit-PIC-ACPI-wakeup arch/x86_64/kernel/acpi/sleep.c
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/acpi/sleep.c~x86_64-64bit-PIC-ACPI-wakeup 2006-11-09 22:59:48.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/acpi/sleep.c 2006-11-09 22:59:48.000000000 -0500
@@ -60,17 +60,6 @@ extern char wakeup_start, wakeup_end;

extern unsigned long FASTCALL(acpi_copy_wakeup_routine(unsigned long));

-static pgd_t low_ptr;
-
-static void init_low_mapping(void)
-{
- pgd_t *slot0 = pgd_offset(current->mm, 0UL);
- low_ptr = *slot0;
- set_pgd(slot0, *pgd_offset(current->mm, PAGE_OFFSET));
- WARN_ON(num_online_cpus() != 1);
- local_flush_tlb();
-}
-
/**
* acpi_save_state_mem - save kernel state
*
@@ -79,8 +68,6 @@ static void init_low_mapping(void)
*/
int acpi_save_state_mem(void)
{
- init_low_mapping();
-
memcpy((void *)acpi_wakeup_address, &wakeup_start,
&wakeup_end - &wakeup_start);
acpi_copy_wakeup_routine(acpi_wakeup_address);
@@ -93,8 +80,6 @@ int acpi_save_state_mem(void)
*/
void acpi_restore_state_mem(void)
{
- set_pgd(pgd_offset(current->mm, 0UL), low_ptr);
- local_flush_tlb();
}

/**
@@ -107,8 +92,8 @@ void acpi_restore_state_mem(void)
*/
void __init acpi_reserve_bootmem(void)
{
- acpi_wakeup_address = (unsigned long)alloc_bootmem_low(PAGE_SIZE);
- if ((&wakeup_end - &wakeup_start) > PAGE_SIZE)
+ acpi_wakeup_address = (unsigned long)alloc_bootmem_low(PAGE_SIZE*2);
+ if ((&wakeup_end - &wakeup_start) > (PAGE_SIZE*2))
printk(KERN_CRIT
"ACPI: Wakeup code way too big, will crash on attempt to suspend\n");
}
diff -puN arch/x86_64/kernel/acpi/wakeup.S~x86_64-64bit-PIC-ACPI-wakeup arch/x86_64/kernel/acpi/wakeup.S
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/acpi/wakeup.S~x86_64-64bit-PIC-ACPI-wakeup 2006-11-09 22:59:48.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/acpi/wakeup.S 2006-11-09 22:59:48.000000000 -0500
@@ -1,6 +1,7 @@
.text
#include <linux/linkage.h>
#include <asm/segment.h>
+#include <asm/pgtable.h>
#include <asm/page.h>
#include <asm/msr.h>

@@ -15,7 +16,6 @@
# cs = 0x1234, eip = 0x05
#

-
ALIGN
.align 16
ENTRY(wakeup_start)
@@ -30,22 +30,25 @@ wakeup_code:
cld
# setup data segment
movw %cs, %ax
- movw %ax, %ds # Make ds:0 point to wakeup_start
+ movw %ax, %ds # Make ds:0 point to wakeup_start
movw %ax, %ss
- mov $(wakeup_stack - wakeup_code), %sp # Private stack is needed for ASUS board
+ # Private stack is needed for ASUS board
+ mov $(wakeup_stack - wakeup_code), %sp

- pushl $0 # Kill any dangerous flags
+ pushl $0 # Kill any dangerous flags
popfl

movl real_magic - wakeup_code, %eax
cmpl $0x12345678, %eax
jne bogus_real_magic

+ call verify_cpu # Verify the cpu supports long mode
+
testl $1, video_flags - wakeup_code
jz 1f
lcall $0xc000,$3
movw %cs, %ax
- movw %ax, %ds # Bios might have played with that
+ movw %ax, %ds # Bios might have played with that
movw %ax, %ss
1:

@@ -60,13 +63,17 @@ wakeup_code:
movw $0x0e00 + 'L', %fs:(0x10)

movb $0xa2, %al ; outb %al, $0x80
+
+ mov %ds, %ax # Find 32bit wakeup_code address
+ movzx %ax, %esi # (Convert %ds:gdt to a linear ptr)
+ shll $4, %esi
+
+ # Fixup the vectors
+ addl %esi, wakeup_32_vector - wakeup_code
+ addl %esi, wakeup_long64_vector - wakeup_code
+ addl %esi, gdt_48a + 2 - wakeup_code # Fixup the gdt pointer

- lidt %ds:idt_48a - wakeup_code
- xorl %eax, %eax
- movw %ds, %ax # (Convert %ds:gdt to a linear ptr)
- shll $4, %eax
- addl $(gdta - wakeup_code), %eax
- movl %eax, gdt_48a +2 - wakeup_code
+ lidtl %ds:idt_48a - wakeup_code
lgdtl %ds:gdt_48a - wakeup_code # load gdt with whatever is
# appropriate

@@ -75,85 +82,47 @@ wakeup_code:
jmp 1f
1:

- .byte 0x66, 0xea # prefix + jmpi-opcode
- .long wakeup_32 - __START_KERNEL_map
- .word __KERNEL_CS
+ ljmpl *(wakeup_32_vector - wakeup_code)
+
+ .balign 4
+wakeup_32_vector:
+ .long wakeup_32 - wakeup_code
+ .word __KERNEL32_CS, 0

.code32
wakeup_32:
# Running in this code, but at low address; paging is not yet turned on.
movb $0xa5, %al ; outb %al, $0x80

- /* Check if extended functions are implemented */
- movl $0x80000000, %eax
- cpuid
- cmpl $0x80000000, %eax
- jbe bogus_cpu
- wbinvd
- mov $0x80000001, %eax
- cpuid
- btl $29, %edx
- jnc bogus_cpu
- movl %edx,%edi
-
- movw $__KERNEL_DS, %ax
- movw %ax, %ds
- movw %ax, %es
- movw %ax, %fs
- movw %ax, %gs
-
- movw $__KERNEL_DS, %ax
- movw %ax, %ss
+ /* Initialize segments */
+ movl $__KERNEL_DS, %eax
+ movl %eax, %ds

- mov $(wakeup_stack - __START_KERNEL_map), %esp
- movl saved_magic - __START_KERNEL_map, %eax
- cmpl $0x9abcdef0, %eax
- jne bogus_32_magic
+ movw $0x0e00 + 'i', %ds:(0xb8012)
+ movb $0xa8, %al ; outb %al, $0x80;

/*
* Prepare for entering 64bits mode
*/

- /* Enable PAE mode and PGE */
+ /* Enable PAE */
xorl %eax, %eax
btsl $5, %eax
- btsl $7, %eax
movl %eax, %cr4

/* Setup early boot stage 4 level pagetables */
- movl $(wakeup_level4_pgt - __START_KERNEL_map), %eax
+ leal (wakeup_level4_pgt - wakeup_code)(%esi), %eax
movl %eax, %cr3

- /* Setup EFER (Extended Feature Enable Register) */
- movl $MSR_EFER, %ecx
- rdmsr
- /* Fool rdmsr and reset %eax to avoid dependences */
- xorl %eax, %eax
/* Enable Long Mode */
- btsl $_EFER_LME, %eax
- /* Enable System Call */
- btsl $_EFER_SCE, %eax
-
- /* No Execute supported? */
- btl $20,%edi
- jnc 1f
- btsl $_EFER_NX, %eax
-1:
-
- /* Make changes effective */
+ movl $MSR_EFER, %ecx
+ movl $(1 << _EFER_LME), %eax # Enable Long Mode
+ xorl %edx, %edx
wrmsr
- wbinvd

xorl %eax, %eax
btsl $31, %eax /* Enable paging and in turn activate Long Mode */
btsl $0, %eax /* Enable protected mode */
- btsl $1, %eax /* Enable MP */
- btsl $4, %eax /* Enable ET */
- btsl $5, %eax /* Enable NE */
- btsl $16, %eax /* Enable WP */
- btsl $18, %eax /* Enable AM */
-
- /* Make changes effective */
movl %eax, %cr0
/* At this point:
CR4.PAE must be 1
@@ -162,11 +131,6 @@ wakeup_32:
Next instruction must be a branch
This must be on identity-mapped page
*/
- jmp reach_compatibility_mode
-reach_compatibility_mode:
- movw $0x0e00 + 'i', %ds:(0xb8012)
- movb $0xa8, %al ; outb %al, $0x80;
-
/*
* At this point we're in long mode but in 32bit compatibility mode
* with EFER.LME = 1, CS.L = 0, CS.D = 1 (and in turn
@@ -174,20 +138,13 @@ reach_compatibility_mode:
* the new gdt/idt that has __KERNEL_CS with CS.L = 1.
*/

- movw $0x0e00 + 'n', %ds:(0xb8014)
- movb $0xa9, %al ; outb %al, $0x80
-
- /* Load new GDT with the 64bit segment using 32bit descriptor */
- movl $(pGDT32 - __START_KERNEL_map), %eax
- lgdt (%eax)
-
- movl $(wakeup_jumpvector - __START_KERNEL_map), %eax
/* Finally jump in 64bit mode */
- ljmp *(%eax)
+ ljmp *(wakeup_long64_vector - wakeup_code)(%esi)

-wakeup_jumpvector:
- .long wakeup_long64 - __START_KERNEL_map
- .word __KERNEL_CS
+ .balign 4
+wakeup_long64_vector:
+ .long wakeup_long64 - wakeup_code
+ .word __KERNEL_CS, 0

.code64

@@ -199,10 +156,18 @@ wakeup_long64:
* addresses where we're currently running on. We have to do that here
* because in 32bit we couldn't load a 64bit linear address.
*/
- lgdt cpu_gdt_descr - __START_KERNEL_map
+ lgdt cpu_gdt_descr
+
+ movw $0x0e00 + 'n', %ds:(0xb8014)
+ movb $0xa9, %al ; outb %al, $0x80
+
+ movq saved_magic, %rax
+ movq $0x123456789abcdef0, %rdx
+ cmpq %rdx, %rax
+ jne bogus_64_magic

movw $0x0e00 + 'u', %ds:(0xb8016)
-
+
nop
nop
movw $__KERNEL_DS, %ax
@@ -211,16 +176,16 @@ wakeup_long64:
movw %ax, %es
movw %ax, %fs
movw %ax, %gs
- movq saved_esp, %rsp
+ movq saved_rsp, %rsp

movw $0x0e00 + 'x', %ds:(0xb8018)
- movq saved_ebx, %rbx
- movq saved_edi, %rdi
- movq saved_esi, %rsi
- movq saved_ebp, %rbp
+ movq saved_rbx, %rbx
+ movq saved_rdi, %rdi
+ movq saved_rsi, %rsi
+ movq saved_rbp, %rbp

movw $0x0e00 + '!', %ds:(0xb801a)
- movq saved_eip, %rax
+ movq saved_rip, %rax
jmp *%rax

.code32
@@ -228,25 +193,10 @@ wakeup_long64:
.align 64
gdta:
.word 0, 0, 0, 0 # dummy
-
- .word 0, 0, 0, 0 # unused
-
- .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb)
- .word 0 # base address = 0
- .word 0x9B00 # code read/exec. ??? Why I need 0x9B00 (as opposed to 0x9A00 in order for this to work?)
- .word 0x00CF # granularity = 4096, 386
- # (+5th nibble of limit)
-
- .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb)
- .word 0 # base address = 0
- .word 0x9200 # data read/write
- .word 0x00CF # granularity = 4096, 386
- # (+5th nibble of limit)
-# this is 64bit descriptor for code
- .word 0xFFFF
- .word 0
- .word 0x9A00 # code read/exec
- .word 0x00AF # as above, but it is long mode and with D=0
+ /* ??? Why I need the accessed bit set in order for this to work? */
+ .quad 0x00cf9b000000ffff # __KERNEL32_CS
+ .quad 0x00af9b000000ffff # __KERNEL_CS
+ .quad 0x00cf93000000ffff # __KERNEL_DS

idt_48a:
.word 0 # idt limit = 0
@@ -255,30 +205,24 @@ idt_48a:
gdt_48a:
.word 0x8000 # gdt limit=2048,
# 256 GDT entries
- .word 0, 0 # gdt base (filled in later)
-
-
+ .long gdta - wakeup_code # gdt base (relocated in later)
+
+
real_save_gdt: .word 0
.quad 0
real_magic: .quad 0
video_mode: .quad 0
video_flags: .quad 0

+.code16
bogus_real_magic:
movb $0xba,%al ; outb %al,$0x80
jmp bogus_real_magic

-bogus_32_magic:
+.code64
+bogus_64_magic:
movb $0xb3,%al ; outb %al,$0x80
- jmp bogus_32_magic
-
-bogus_31_magic:
- movb $0xb1,%al ; outb %al,$0x80
- jmp bogus_31_magic
-
-bogus_cpu:
- movb $0xbc,%al ; outb %al,$0x80
- jmp bogus_cpu
+ jmp bogus_64_magic


/* This code uses an extended set of video mode numbers. These include:
@@ -301,6 +245,7 @@ bogus_cpu:
#define VIDEO_FIRST_V7 0x0900

# Setting of user mode (AX=mode ID) => CF=success
+.code16
mode_seta:
movw %ax, %bx
#if 0
@@ -346,14 +291,59 @@ check_vesaa:

_setbada: jmp setbada

- .code64
-bogus_magic:
- movw $0x0e00 + 'B', %ds:(0xb8018)
- jmp bogus_magic
-
-bogus_magic2:
- movw $0x0e00 + '2', %ds:(0xb8018)
- jmp bogus_magic2
+ .code16
+verify_cpu:
+ pushl $0 # Kill any dangerous flags
+ popfl
+
+ /* minimum CPUID flags for x86-64 */
+ /* see http://www.x86-64.org/lists/discuss/msg02971.html */
+#define REQUIRED_MASK1 ((1<<0)|(1<<3)|(1<<4)|(1<<5)|(1<<6)|(1<<8)|\
+ (1<<13)|(1<<15)|(1<<24)|(1<<25)|(1<<26))
+#define REQUIRED_MASK2 (1<<29)
+
+ pushfl # check for cpuid
+ popl %eax
+ movl %eax, %ebx
+ xorl $0x200000,%eax
+ pushl %eax
+ popfl
+ pushfl
+ popl %eax
+ pushl %ebx
+ popfl
+ cmpl %eax, %ebx
+ jz no_longmode
+
+ xorl %eax, %eax # See if cpuid 1 is implemented
+ cpuid
+ cmpl $0x1, %eax
+ jb no_longmode
+
+ movl $0x01, %eax # Does the cpu have what it takes?
+ cpuid
+ andl $REQUIRED_MASK1, %edx
+ xorl $REQUIRED_MASK1, %edx
+ jnz no_longmode
+
+ movl $0x80000000, %eax # See if extended cpuid is implemented
+ cpuid
+ cmpl $0x80000001, %eax
+ jb no_longmode
+
+ movl $0x80000001, %eax # Does the cpu have what it takes?
+ cpuid
+ andl $REQUIRED_MASK2, %edx
+ xorl $REQUIRED_MASK2, %edx
+ jnz no_longmode
+
+ ret # The cpu supports long mode
+
+no_longmode:
+ movb $0xbc,%al ; outb %al,$0x80
+ jmp no_longmode
+
+ ret


wakeup_stack_begin: # Stack grows down
@@ -361,7 +351,15 @@ wakeup_stack_begin: # Stack grows down
.org 0xff0
wakeup_stack: # Just below end of page

+.org 0x1000
+ENTRY(wakeup_level4_pgt)
+ .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
+ .fill 510,8,0
+ /* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
+ .quad level3_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
+
ENTRY(wakeup_end)
+ .code64

##
# acpi_copy_wakeup_routine
@@ -378,23 +376,6 @@ ENTRY(acpi_copy_wakeup_routine)
pushq %rcx
pushq %rdx

- sgdt saved_gdt
- sidt saved_idt
- sldt saved_ldt
- str saved_tss
-
- movq %cr3, %rdx
- movq %rdx, saved_cr3
- movq %cr4, %rdx
- movq %rdx, saved_cr4
- movq %cr0, %rdx
- movq %rdx, saved_cr0
- sgdt real_save_gdt - wakeup_start (,%rdi)
- movl $MSR_EFER, %ecx
- rdmsr
- movl %eax, saved_efer
- movl %edx, saved_efer2
-
movl saved_video_mode, %edx
movl %edx, video_mode - wakeup_start (,%rdi)
movl acpi_video_flags, %edx
@@ -403,18 +384,11 @@ ENTRY(acpi_copy_wakeup_routine)
movq $0x123456789abcdef0, %rdx
movq %rdx, saved_magic

- movl saved_magic - __START_KERNEL_map, %eax
- cmpl $0x9abcdef0, %eax
- jne bogus_32_magic
-
- # make sure %cr4 is set correctly (features, etc)
- movl saved_cr4 - __START_KERNEL_map, %eax
- movq %rax, %cr4
-
- movl saved_cr0 - __START_KERNEL_map, %eax
- movq %rax, %cr0
- jmp 1f # Flush pipelines
-1:
+ movq saved_magic, %rax
+ movq $0x123456789abcdef0, %rdx
+ cmpq %rdx, %rax
+ jne bogus_64_magic
+
# restore the regs we used
popq %rdx
popq %rcx
@@ -450,13 +424,13 @@ do_suspend_lowlevel:
movq %r15, saved_context_r15(%rip)
pushfq ; popq saved_context_eflags(%rip)

- movq $.L97, saved_eip(%rip)
+ movq $.L97, saved_rip(%rip)

- movq %rsp,saved_esp
- movq %rbp,saved_ebp
- movq %rbx,saved_ebx
- movq %rdi,saved_edi
- movq %rsi,saved_esi
+ movq %rsp,saved_rsp
+ movq %rbp,saved_rbp
+ movq %rbx,saved_rbx
+ movq %rdi,saved_rdi
+ movq %rsi,saved_rsi

addq $8, %rsp
movl $3, %edi
@@ -503,25 +477,12 @@ do_suspend_lowlevel:

.data
ALIGN
-ENTRY(saved_ebp) .quad 0
-ENTRY(saved_esi) .quad 0
-ENTRY(saved_edi) .quad 0
-ENTRY(saved_ebx) .quad 0
+ENTRY(saved_rbp) .quad 0
+ENTRY(saved_rsi) .quad 0
+ENTRY(saved_rdi) .quad 0
+ENTRY(saved_rbx) .quad 0

-ENTRY(saved_eip) .quad 0
-ENTRY(saved_esp) .quad 0
+ENTRY(saved_rip) .quad 0
+ENTRY(saved_rsp) .quad 0

ENTRY(saved_magic) .quad 0
-
-ALIGN
-# saved registers
-saved_gdt: .quad 0,0
-saved_idt: .quad 0,0
-saved_ldt: .quad 0
-saved_tss: .quad 0
-
-saved_cr0: .quad 0
-saved_cr3: .quad 0
-saved_cr4: .quad 0
-saved_efer: .quad 0
-saved_efer2: .quad 0
diff -puN arch/x86_64/kernel/head.S~x86_64-64bit-PIC-ACPI-wakeup arch/x86_64/kernel/head.S
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/head.S~x86_64-64bit-PIC-ACPI-wakeup 2006-11-09 22:59:48.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/head.S 2006-11-09 22:59:48.000000000 -0500
@@ -300,15 +300,6 @@ NEXT_PAGE(level2_kernel_pgt)

.data

-#ifdef CONFIG_ACPI_SLEEP
- .align PAGE_SIZE
-ENTRY(wakeup_level4_pgt)
- .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .fill 510,8,0
- /* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
- .quad level3_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
-#endif
-
#ifndef CONFIG_HOTPLUG_CPU
__INITDATA
#endif
diff -puN include/asm-x86_64/suspend.h~x86_64-64bit-PIC-ACPI-wakeup include/asm-x86_64/suspend.h
--- linux-2.6.19-rc5-reloc/include/asm-x86_64/suspend.h~x86_64-64bit-PIC-ACPI-wakeup 2006-11-09 22:59:48.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/include/asm-x86_64/suspend.h 2006-11-09 22:59:48.000000000 -0500
@@ -45,12 +45,12 @@ extern unsigned long saved_context_eflag
extern void fix_processor_context(void);

#ifdef CONFIG_ACPI_SLEEP
-extern unsigned long saved_eip;
-extern unsigned long saved_esp;
-extern unsigned long saved_ebp;
-extern unsigned long saved_ebx;
-extern unsigned long saved_esi;
-extern unsigned long saved_edi;
+extern unsigned long saved_rip;
+extern unsigned long saved_rsp;
+extern unsigned long saved_rbp;
+extern unsigned long saved_rbx;
+extern unsigned long saved_rsi;
+extern unsigned long saved_rdi;

/* routines for saving/restoring kernel state */
extern int acpi_save_state_mem(void);
_

2006-11-13 16:57:36

by Vivek Goyal

[permalink] [raw]
Subject: [RFC] [PATCH 5/16] x86_64: Fix earlyprintk to use standard ISA mapping



Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---

arch/x86_64/kernel/early_printk.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff -puN arch/x86_64/kernel/early_printk.c~x86_64-fix-early_printk-to-use-the-standard-ISA-mapping arch/x86_64/kernel/early_printk.c
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/early_printk.c~x86_64-fix-early_printk-to-use-the-standard-ISA-mapping 2006-11-09 22:57:41.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/early_printk.c 2006-11-09 22:57:41.000000000 -0500
@@ -11,11 +11,10 @@

#ifdef __i386__
#include <asm/setup.h>
-#define VGABASE (__ISA_IO_base + 0xb8000)
#else
#include <asm/bootsetup.h>
-#define VGABASE ((void __iomem *)0xffffffff800b8000UL)
#endif
+#define VGABASE (__ISA_IO_base + 0xb8000)

static int max_ypos = 25, max_xpos = 80;
static int current_ypos = 25, current_xpos = 0;
_

2006-11-13 16:56:43

by Vivek Goyal

[permalink] [raw]
Subject: [RFC] [PATCH 16/16] x86_64: Extend bzImage protocol for relocatable bzImage




o Extend the bzImage protocol (same as i386) to allow bzImage loaders to
load the protected mode kernel at non-1MB address. Now protected mode
component is relocatable and can be loaded at non-1MB addresses.

o As of today kdump uses it to run a second kernel from a reserved memory
area.

Signed-off-by: Vivek Goyal <[email protected]>
---

arch/x86_64/boot/setup.S | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff -puN arch/x86_64/boot/setup.S~x86_64-extend-bzImage-protocol-for-relocatable-bzImage arch/x86_64/boot/setup.S
--- linux-2.6.19-rc5-reloc/arch/x86_64/boot/setup.S~x86_64-extend-bzImage-protocol-for-relocatable-bzImage 2006-11-09 23:07:08.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/boot/setup.S 2006-11-09 23:07:08.000000000 -0500
@@ -80,7 +80,7 @@ start:
# This is the setup header, and it must start at %cs:2 (old 0x9020:2)

.ascii "HdrS" # header signature
- .word 0x0204 # header version number (>= 0x0105)
+ .word 0x0205 # header version number (>= 0x0105)
# or else old loadlin-1.5 will fail)
realmode_swtch: .word 0, 0 # default_switch, SETUPSEG
start_sys_seg: .word SYSSEG
@@ -155,7 +155,12 @@ cmd_line_ptr: .long 0 # (Header versio
# low memory 0x10000 or higher.

ramdisk_max: .long 0xffffffff
-
+kernel_alignment: .long 0x200000 # physical addr alignment required for
+ # protected mode relocatable kernel
+relocatable_kernel: .byte 1
+pad2: .byte 0
+pad3: .word 0
+
trampoline: call start_of_setup
.align 16
# The offset at this point is 0x240
_

2006-11-13 16:57:35

by Vivek Goyal

[permalink] [raw]
Subject: [RFC] [PATCH 11/16] x86_64: Modify discover_ebda to use virtual address



Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---

arch/x86_64/kernel/setup.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff -puN arch/x86_64/kernel/setup.c~x86_64-Modify-discover_ebda-to-use-virtual-addresses arch/x86_64/kernel/setup.c
--- linux-2.6.19-rc5-reloc/arch/x86_64/kernel/setup.c~x86_64-Modify-discover_ebda-to-use-virtual-addresses 2006-11-09 23:02:24.000000000 -0500
+++ linux-2.6.19-rc5-reloc-root/arch/x86_64/kernel/setup.c 2006-11-09 23:02:24.000000000 -0500
@@ -327,10 +327,10 @@ static void discover_ebda(void)
* there is a real-mode segmented pointer pointing to the
* 4K EBDA area at 0x40E
*/
- ebda_addr = *(unsigned short *)EBDA_ADDR_POINTER;
+ ebda_addr = *(unsigned short *)__va(EBDA_ADDR_POINTER);
ebda_addr <<= 4;

- ebda_size = *(unsigned short *)(unsigned long)ebda_addr;
+ ebda_size = *(unsigned short *)__va(ebda_addr);

/* Round EBDA up to pages */
if (ebda_size == 0)
_

2006-11-13 17:23:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

On Monday 13 November 2006 17:43, Vivek Goyal wrote:
>
> - Killed lots of dead code
> - Improve the cpu sanity checks to verify long mode
> is enabled when we wake up.
> - Removed the need for modifying any existing kernel page table.
> - Moved wakeup_level4_pgt into the wakeup routine so we can
> run the kernel above 4G.
> - Increased the size of the wakeup routine to 8K.
> - Renamed the variables to use the 64bit register names.
> - Lots of misc cleanups to match trampoline.S
>
> I don't have a configuration I can test this but it compiles cleanly
> and it should work, the code is very similar to the SMP trampoline,
> which I have tested. At least now the comments about still running in
> low memory are actually correct.
>
> Vivek has tested this patch for suspend to memory and it works fine.

Suspend is unfortunately quite fragile.

pavel, rafael can you please test and review this patch?

(full patch is on l-k)

> +verify_cpu:
> + pushl $0 # Kill any dangerous flags
> + popfl
> +
> + /* minimum CPUID flags for x86-64 */
> + /* see http://www.x86-64.org/lists/discuss/msg02971.html */
> +#define REQUIRED_MASK1 ((1<<0)|(1<<3)|(1<<4)|(1<<5)|(1<<6)|(1<<8)|\
> + (1<<13)|(1<<15)|(1<<24)|(1<<25)|(1<<26))
> +#define REQUIRED_MASK2 (1<<29)

It would be much better if this least this CPUID code was in a common shared
file with head.S

-Andi

2006-11-13 17:23:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] [PATCH 2/16] x86_64: Assembly safe page.h and pgtable.h

On Monday 13 November 2006 17:28, Vivek Goyal wrote:
>
> This patch makes pgtable.h and page.h safe to include
> in assembly files like head.S. Allowing us to use
> symbolic constants instead of hard coded numbers when
> refering to the page tables.

Hmm, I think the ULs are probably not needed anyways. What
happens when you just drop them even for C? You shouldn't get any
new warnings i hope.

-Andi

2006-11-13 17:28:52

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] [PATCH 9/16] x86_64: 64bit PIC SMP trampoline


> +verify_cpu:

Another duplication. Please get rid of that.

-Andi

2006-11-13 18:00:38

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

On Mon, Nov 13, 2006 at 06:22:43PM +0100, Andi Kleen wrote:
[..]
>
> > +verify_cpu:
> > + pushl $0 # Kill any dangerous flags
> > + popfl
> > +
> > + /* minimum CPUID flags for x86-64 */
> > + /* see http://www.x86-64.org/lists/discuss/msg02971.html */
> > +#define REQUIRED_MASK1 ((1<<0)|(1<<3)|(1<<4)|(1<<5)|(1<<6)|(1<<8)|\
> > + (1<<13)|(1<<15)|(1<<24)|(1<<25)|(1<<26))
> > +#define REQUIRED_MASK2 (1<<29)
>
> It would be much better if this least this CPUID code was in a common shared
> file with head.S

Hi Andi,

This code (verify_cpu) is called while we are still in real mode. So it has
to be present in low 1MB. Now in trampoline has been designed to switch to
64bit mode and then jump to the kernel hence kernel can be loaded anywhere
even beyond (4G). So if we move this code into say arch/x86_64/kernel/head.S
then we can't even call it.

So I think we got to leave it in trampoline only.

Thanks
Vivek

2006-11-13 18:13:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup


> This code (verify_cpu) is called while we are still in real mode. So it has
> to be present in low 1MB. Now in trampoline has been designed to switch to
> 64bit mode and then jump to the kernel hence kernel can be loaded anywhere
> even beyond (4G). So if we move this code into say arch/x86_64/kernel/head.S
> then we can't even call it.

I didn't mean to call it. Just #include it from a common file

-Andi

2006-11-13 19:22:53

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

Andi Kleen <[email protected]> writes:

>> This code (verify_cpu) is called while we are still in real mode. So it has
>> to be present in low 1MB. Now in trampoline has been designed to switch to
>> 64bit mode and then jump to the kernel hence kernel can be loaded anywhere
>> even beyond (4G). So if we move this code into say arch/x86_64/kernel/head.S
>> then we can't even call it.
>
> I didn't mean to call it. Just #include it from a common file

I believe the duplication winds up happening in setup.S

Eric

2006-11-13 19:28:00

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] [PATCH 2/16] x86_64: Assembly safe page.h and pgtable.h

Andi Kleen <[email protected]> writes:

> On Monday 13 November 2006 17:28, Vivek Goyal wrote:
>>
>> This patch makes pgtable.h and page.h safe to include
>> in assembly files like head.S. Allowing us to use
>> symbolic constants instead of hard coded numbers when
>> refering to the page tables.
>
> Hmm, I think the ULs are probably not needed anyways. What
> happens when you just drop them even for C? You shouldn't get any
> new warnings i hope.

I don't remember the details anymore but there were problems when I
tried by just removing the suffixes. I think they were
no warnings or possibly promotion problems.

Using _AC is just about as simple as dropping the suffix
and much more maintainable then maintaining multiple definitions.

Eric

2006-11-13 19:35:24

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

On Mon, Nov 13, 2006 at 12:21:05PM -0700, Eric W. Biederman wrote:
> Andi Kleen <[email protected]> writes:
>
> >> This code (verify_cpu) is called while we are still in real mode. So it has
> >> to be present in low 1MB. Now in trampoline has been designed to switch to
> >> 64bit mode and then jump to the kernel hence kernel can be loaded anywhere
> >> even beyond (4G). So if we move this code into say arch/x86_64/kernel/head.S
> >> then we can't even call it.
> >
> > I didn't mean to call it. Just #include it from a common file
>
> I believe the duplication winds up happening in setup.S
>

Yes. So boot cpu code in setup.S is also doing these checks. So one
of the options is that I create a new file says verify_cpu.S and this
code can be shared by setup.S, trampoline.S and wakeup.S.

Or, I can simply drop the verify_cpu bit from trampoline.S and wakeup.S.
This looks like a non-essential bit and in the past we did not perform
these checks in trampoline.S and wakeup.S

At this point of time, I will prefer to go with second option of dropping
extended checks in trampoline.S and wakeup.S to keep things simple.

Does that make sense?

Thanks
Vivek

2006-11-13 19:59:40

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

Vivek Goyal <[email protected]> writes:

> On Mon, Nov 13, 2006 at 12:21:05PM -0700, Eric W. Biederman wrote:
>> Andi Kleen <[email protected]> writes:
>>
>> >> This code (verify_cpu) is called while we are still in real mode. So it has
>> >> to be present in low 1MB. Now in trampoline has been designed to switch to
>> >> 64bit mode and then jump to the kernel hence kernel can be loaded anywhere
>> >> even beyond (4G). So if we move this code into say
> arch/x86_64/kernel/head.S
>> >> then we can't even call it.
>> >
>> > I didn't mean to call it. Just #include it from a common file
>>
>> I believe the duplication winds up happening in setup.S
>>
>
> Yes. So boot cpu code in setup.S is also doing these checks. So one
> of the options is that I create a new file says verify_cpu.S and this
> code can be shared by setup.S, trampoline.S and wakeup.S.
>
> Or, I can simply drop the verify_cpu bit from trampoline.S and wakeup.S.
> This looks like a non-essential bit and in the past we did not perform
> these checks in trampoline.S and wakeup.S

We do it head.S instead. Although the version in head.S is less
complete.

> At this point of time, I will prefer to go with second option of dropping
> extended checks in trampoline.S and wakeup.S to keep things simple.
>
> Does that make sense?

I think just making an arch/x86_64/kernel/verify_cpu.S that can
be included from setup.S wakeup.S and trampoline.S will be just
an exercise in code motion. It provides a good sanity check in
case things are hideously wrong.

If we are looking at more then code motion then it make sense to
reevaluate and probably drop the code.

Deduping this code path makes a lot of sense.

Eric


2006-11-13 21:17:27

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] [PATCH 2/16] x86_64: Assembly safe page.h and pgtable.h

On Mon, Nov 13, 2006 at 06:17:00PM +0100, Andi Kleen wrote:
> On Monday 13 November 2006 17:28, Vivek Goyal wrote:
> >
> > This patch makes pgtable.h and page.h safe to include
> > in assembly files like head.S. Allowing us to use
> > symbolic constants instead of hard coded numbers when
> > refering to the page tables.
>
> Hmm, I think the ULs are probably not needed anyways. What
> happens when you just drop them even for C? You shouldn't get any
> new warnings i hope.
>

I think we need these UL suffixes. Otherwise in some cases overflow
can take place and compiler emits warning.

For ex. in following definition I got rid of UL.

#define PGDIR_SIZE (1 << PGDIR_SHIFT)

Here constant defaulted to intger and PGDIR_SHIFT is 39. Hence compiler
emits following warning wherever PGDIR_SIZE is used.

arch/x86_64/kernel/machine_kexec.c: In function init_level3_page:
arch/x86_64/kernel/machine_kexec.c:47: warning: left shift count >=
width of type
arch/x86_64/kernel/machine_kexec.c: In function init_level4_page:
arch/x86_64/kernel/machine_kexec.c:80: warning: left shift count >=
width of type
arch/x86_64/kernel/machine_kexec.c:96: warning: left shift count >=
width of type
arch/x86_64/kernel/machine_kexec.c:101: warning: left shift count >=
width of type

Thanks
Vivek

2006-11-13 23:02:20

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

Hi!

> > - Killed lots of dead code
> > - Improve the cpu sanity checks to verify long mode
> > is enabled when we wake up.
> > - Removed the need for modifying any existing kernel page table.
> > - Moved wakeup_level4_pgt into the wakeup routine so we can
> > run the kernel above 4G.
> > - Increased the size of the wakeup routine to 8K.
> > - Renamed the variables to use the 64bit register names.
> > - Lots of misc cleanups to match trampoline.S
> >
> > I don't have a configuration I can test this but it compiles cleanly
> > and it should work, the code is very similar to the SMP trampoline,
> > which I have tested. At least now the comments about still running in
> > low memory are actually correct.
> >
> > Vivek has tested this patch for suspend to memory and it works fine.
>
> Suspend is unfortunately quite fragile.
>
> pavel, rafael can you please test and review this patch?

> (full patch is on l-k)

Based on comments, it looks like there'll be new version of patch,
anyway? Vivek, would you cc me?
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-13 23:10:02

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

On Tue, Nov 14, 2006 at 12:01:56AM +0100, Pavel Machek wrote:
> Hi!
>
> > > - Killed lots of dead code
> > > - Improve the cpu sanity checks to verify long mode
> > > is enabled when we wake up.
> > > - Removed the need for modifying any existing kernel page table.
> > > - Moved wakeup_level4_pgt into the wakeup routine so we can
> > > run the kernel above 4G.
> > > - Increased the size of the wakeup routine to 8K.
> > > - Renamed the variables to use the 64bit register names.
> > > - Lots of misc cleanups to match trampoline.S
> > >
> > > I don't have a configuration I can test this but it compiles cleanly
> > > and it should work, the code is very similar to the SMP trampoline,
> > > which I have tested. At least now the comments about still running in
> > > low memory are actually correct.
> > >
> > > Vivek has tested this patch for suspend to memory and it works fine.
> >
> > Suspend is unfortunately quite fragile.
> >
> > pavel, rafael can you please test and review this patch?
>
> > (full patch is on l-k)
>
> Based on comments, it looks like there'll be new version of patch,
> anyway? Vivek, would you cc me?
> Pavel

Hi Pavel,

Sure I will.

Thanks
Vivek

2006-11-14 01:51:13

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] [PATCH 2/16] x86_64: Assembly safe page.h and pgtable.h


>
> I think we need these UL suffixes. Otherwise in some cases overflow
> can take place and compiler emits warning.
>
> For ex. in following definition I got rid of UL.
>
> #define PGDIR_SIZE (1 << PGDIR_SHIFT)

Yes for the shifts it is needed, but not for the unshifted constants.
I think. At least when they're hex the compiler should chose the right
type on its own.

-Andi

2006-11-14 02:42:19

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] [PATCH 2/16] x86_64: Assembly safe page.h and pgtable.h

Andi Kleen <[email protected]> writes:

>>
>> I think we need these UL suffixes. Otherwise in some cases overflow
>> can take place and compiler emits warning.
>>
>> For ex. in following definition I got rid of UL.
>>
>> #define PGDIR_SIZE (1 << PGDIR_SHIFT)
>
> Yes for the shifts it is needed, but not for the unshifted constants.
> I think. At least when they're hex the compiler should chose the right
> type on its own.

Only if the high bit is set. But it should chose a big enough type.
However there is no reason to play games and possibly out smart ourselves.
That is the point of the _AC() macro. It adds the suffix only for
C code, and drops it for assembly.

Eric

2006-11-14 21:15:47

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

Hi!

> - Killed lots of dead code
> - Improve the cpu sanity checks to verify long mode
> is enabled when we wake up.
> - Removed the need for modifying any existing kernel page table.
> - Moved wakeup_level4_pgt into the wakeup routine so we can
> run the kernel above 4G.
> - Increased the size of the wakeup routine to 8K.
> - Renamed the variables to use the 64bit register names.
> - Lots of misc cleanups to match trampoline.S
>
> I don't have a configuration I can test this but it compiles cleanly

Ugh, now that's a big patch.. and untested, too :-(.

Why is PGE no longer required, for example?

Can we get it piece-by-piece?

> Vivek has tested this patch for suspend to memory and it works fine.

Ok, so it was tested on one config. Given that the patch deals with
detecting CPU oddities... :-(

Pavel
--
Thanks for all the (sleeping) penguins.

2006-11-14 21:38:53

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

Pavel Machek <[email protected]> writes:

> Hi!
>
>> - Killed lots of dead code
>> - Improve the cpu sanity checks to verify long mode
>> is enabled when we wake up.
>> - Removed the need for modifying any existing kernel page table.
>> - Moved wakeup_level4_pgt into the wakeup routine so we can
>> run the kernel above 4G.
>> - Increased the size of the wakeup routine to 8K.
>> - Renamed the variables to use the 64bit register names.
>> - Lots of misc cleanups to match trampoline.S
>>
>> I don't have a configuration I can test this but it compiles cleanly
>
> Ugh, now that's a big patch.. and untested, too :-(.

It was very carefully code reviewed at least the first time,
and the code was put in sync with code that was tested.

But things happens so the lack of testing was noted.

> Why is PGE no longer required, for example?

PGE is never required. Especially on a temporary page table.
PGE is an optimization, to make context switches faster.

> Can we get it piece-by-piece?


>> Vivek has tested this patch for suspend to memory and it works fine.
>
> Ok, so it was tested on one config. Given that the patch deals with
> detecting CPU oddities... :-(

Read the code. Given your scorn and the state of that mess when I
started I'm not certain a productive conversation can be had.

Do you understand the code as it is currently written?

Eric

2006-11-14 21:40:57

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

Eric W. Biederman wrote:
>
>> Why is PGE no longer required, for example?
>
> PGE is never required. Especially on a temporary page table.
> PGE is an optimization, to make context switches faster.
>

You cannot, however, set the Global bit unless PGE is supported by the
CPU (you'll trap.)

-hpa

2006-11-14 23:18:22

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

On Tue, Nov 14, 2006 at 04:30:03PM +0000, Pavel Machek wrote:
> Hi!
>
> > - Killed lots of dead code
> > - Improve the cpu sanity checks to verify long mode
> > is enabled when we wake up.
> > - Removed the need for modifying any existing kernel page table.
> > - Moved wakeup_level4_pgt into the wakeup routine so we can
> > run the kernel above 4G.
> > - Increased the size of the wakeup routine to 8K.
> > - Renamed the variables to use the 64bit register names.
> > - Lots of misc cleanups to match trampoline.S
> >
> > I don't have a configuration I can test this but it compiles cleanly
>
> Ugh, now that's a big patch.. and untested, too :-(.
>
> Why is PGE no longer required, for example?
>
> Can we get it piece-by-piece?
>
> > Vivek has tested this patch for suspend to memory and it works fine.
>
> Ok, so it was tested on one config. Given that the patch deals with
> detecting CPU oddities... :-(

Hi Pavel,

This code has been lying in RHEL kernels for close to 3 months now.
Have not heard of suspend/resume complaints. So hoping it got
tested on wide variety of hardware too apart from testing on my machine.

Thanks
Vivek

2006-11-14 23:39:52

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

Hi!

> > > Vivek has tested this patch for suspend to memory and it works fine.
> >
> > Ok, so it was tested on one config. Given that the patch deals with
> > detecting CPU oddities... :-(
>
> This code has been lying in RHEL kernels for close to 3 months now.
> Have not heard of suspend/resume complaints. So hoping it got
> tested on wide variety of hardware too apart from testing on my machine.

Well, unless you have some way to restore video in RHEL, I'd not
expect many users of suspend to RAM... On systems without extensive
whitelist, s2ram is fairly hard to test.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-14 23:43:52

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

Hi!

> >> I don't have a configuration I can test this but it compiles cleanly
> >
> > Ugh, now that's a big patch.. and untested, too :-(.
>
> It was very carefully code reviewed at least the first time,
> and the code was put in sync with code that was tested.

So we had two very different versions of "switch to 64-bit" and now we
have two mostly similar versions. Not a big improvement...

> > Why is PGE no longer required, for example?
>
> PGE is never required. Especially on a temporary page table.
> PGE is an optimization, to make context switches faster.

HPA tells me it is.

> > Can we get it piece-by-piece?

Please?

> >> Vivek has tested this patch for suspend to memory and it works fine.
> >
> > Ok, so it was tested on one config. Given that the patch deals with
> > detecting CPU oddities... :-(
>
> Read the code. Given your scorn and the state of that mess when I
> started I'm not certain a productive conversation can be had.
>
> Do you understand the code as it is currently written?

Mostly. I've written it at some point.

It may be a mess, but patch below is wholesale rewrite, mixing
cleanups (ebx->rbx) with serious changes (PGE). And then you tell me
it was tested on one machine. It is hard/impossible to rewrite, and
changelog is not helpful, either.

Please split it up.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-15 03:49:42

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

On Wednesday 15 November 2006 00:43, Pavel Machek wrote:
> Hi!
>
> > >> I don't have a configuration I can test this but it compiles cleanly
> > >
> > > Ugh, now that's a big patch.. and untested, too :-(.
> >
> > It was very carefully code reviewed at least the first time,
> > and the code was put in sync with code that was tested.
>
> So we had two very different versions of "switch to 64-bit" and now we
> have two mostly similar versions. Not a big improvement...

Hmm? That's an improvement in my book. Of course i would prefer
truly shared code, but even more similar code is better.

> > > Why is PGE no longer required, for example?
> >
> > PGE is never required. Especially on a temporary page table.
> > PGE is an optimization, to make context switches faster.
>
> HPA tells me it is.

He was just nitpicking. Eric is right it is just an optimization
(modulo hardware/software bugs)

-Andi

2006-11-15 17:27:17

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup


Vivek. It looks like the patch had whitespace damage sometime in it's
history. Do you know what happened to the to make the indentation
inconsistent? I'm pretty certain the version I generated didn't have
that problem.

Pavel Machek <[email protected]> writes:

> Hi!
>
>> >> I don't have a configuration I can test this but it compiles cleanly
>> >
>> > Ugh, now that's a big patch.. and untested, too :-(.
>>
>> It was very carefully code reviewed at least the first time,
>> and the code was put in sync with code that was tested.
>
> So we had two very different versions of "switch to 64-bit" and now we
> have two mostly similar versions. Not a big improvement...

Well we had two versions. It looks like the acpi wakeup was a cut
and past of the primary kernel entry path, and it diverged and
was less well maintained because it is less considered and get's used
less often.

>> > Can we get it piece-by-piece?
>
> Please?

I honestly think it would make the change less reviewable.

>> >> Vivek has tested this patch for suspend to memory and it works fine.
>> >
>> > Ok, so it was tested on one config. Given that the patch deals with
>> > detecting CPU oddities... :-(
>>
>> Read the code. Given your scorn and the state of that mess when I
>> started I'm not certain a productive conversation can be had.
>>
>> Do you understand the code as it is currently written?
>
> Mostly. I've written it at some point.
>
> It may be a mess, but patch below is wholesale rewrite, mixing
> cleanups (ebx->rbx) with serious changes (PGE). And then you tell me
> it was tested on one machine. It is hard/impossible to rewrite, and
> changelog is not helpful, either.

So there is one major change in the patch. Modifying the trampoline
to take the code all of the way to 64bit mode, allowing us to switch
to a kernel that is loaded above 4GB. Essentially everything else is
required by that change.

The basic point is that I am completely changing the idiom for how
cpus enter the kernel. If you focus on the details the change and
miss the primary change itself you will simply not understand what
is happening. The PGE change just happens to be part of the change
in requirements for entering the kernel.

The code paths that are changed are not conditional code they will
always execute. So a single test tells us a lot.

I just looked at the patch again, and it really is not that large.
At this point I'm afraid splitting it up is more likely to cause
review problems then the other way around. I don't any of the
cleanups touch a code path I don't need to touch.

As for the hard/impossible to rewrite. It was a bit tedious because
of all of the picky details, but I didn't find it hard. I suspect this
is just a skill set difference. Or the fact that I wasn't messing
with the actual code that interfaces with acpi.

Eric

2006-11-15 18:30:39

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

On Wed, Nov 15, 2006 at 10:26:12AM -0700, Eric W. Biederman wrote:
>
> Vivek. It looks like the patch had whitespace damage sometime in it's
> history. Do you know what happened to the to make the indentation
> inconsistent? I'm pretty certain the version I generated didn't have
> that problem.
>

Hi Eric,

Sorry I did not get that. This patch has been posted twice on mailing list.
First by you and now second time by me. Are you saying that original post
did not had whitespace damage and now second post has got? (May be I
goofed up while forward porting, though I can't spot the whitespace damage).

Thanks
Vivek

2006-11-15 18:39:20

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

Hi!

> >> > Can we get it piece-by-piece?
> >
> > Please?
>
> I honestly think it would make the change less reviewable.

It included mass renames, "unused" code removal, and miscellaus
cleanups. At least seaparate cleanups from "real" changes.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-15 21:09:20

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH] x86_64: Move cpu long mode verification code to common file (was Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup)

On Mon, Nov 13, 2006 at 06:22:43PM +0100, Andi Kleen wrote:
>
> > +verify_cpu:
> > + pushl $0 # Kill any dangerous flags
> > + popfl
> > +
> > + /* minimum CPUID flags for x86-64 */
> > + /* see http://www.x86-64.org/lists/discuss/msg02971.html */
> > +#define REQUIRED_MASK1 ((1<<0)|(1<<3)|(1<<4)|(1<<5)|(1<<6)|(1<<8)|\
> > + (1<<13)|(1<<15)|(1<<24)|(1<<25)|(1<<26))
> > +#define REQUIRED_MASK2 (1<<29)
>
> It would be much better if this least this CPUID code was in a common shared
> file with head.S
>

Hi Andi,

Pleaese find attached the patch which moves verify_cpu code to a
single file arch/x86_64/kernel/verify_cpu.S and this file is included
by all to do the cpu long mode and SSE checks.

Thanks
Vivek



o This patch moves the code to verify long mode and SSE to a common file.
This code is not shared by trampoline.S, wakeup.S, boot/setup.S and
boot/compressed/head.S

o So far we used to do very limited check in trampoline.S, wakeup.S and
in 32bit entry point. Now all the entry paths are forced to do the
exhaustive check, including SSE because verify_cpu is shared.

o I am keeping this patch as last in the x86 relocatable series because
previous patches have got quite some amount of testing done and don't want
to distrub that. So that if there is problem introduced by this patch, at
least it can be easily isolated.

Signed-off-by: Vivek Goyal <[email protected]>
---

arch/x86_64/boot/compressed/head.S | 19 ++++++
arch/x86_64/boot/setup.S | 65 ++---------------------
arch/x86_64/kernel/acpi/wakeup.S | 51 +-----------------
arch/x86_64/kernel/trampoline.S | 51 +-----------------
arch/x86_64/kernel/verify_cpu.S | 103 +++++++++++++++++++++++++++++++++++++
5 files changed, 134 insertions(+), 155 deletions(-)

diff -puN arch/x86_64/boot/compressed/head.S~x86_64-move-cpu-verfication-code-to-common-file arch/x86_64/boot/compressed/head.S
--- linux-2.6.19-rc5-git2-reloc/arch/x86_64/boot/compressed/head.S~x86_64-move-cpu-verfication-code-to-common-file 2006-11-14 23:11:44.000000000 -0500
+++ linux-2.6.19-rc5-git2-reloc-root/arch/x86_64/boot/compressed/head.S 2006-11-14 23:11:44.000000000 -0500
@@ -54,6 +54,15 @@ startup_32:
1: popl %ebp
subl $1b, %ebp

+/* setup a stack and make sure cpu supports long mode. */
+ movl $user_stack_end, %eax
+ addl %ebp, %eax
+ movl %eax, %esp
+
+ call verify_cpu
+ testl %eax, %eax
+ jnz no_longmode
+
/* Compute the delta between where we were compiled to run at
* and where the code will actually run at.
*/
@@ -150,13 +159,21 @@ startup_32:
/* Jump from 32bit compatibility mode into 64bit mode. */
lret

+no_longmode:
+ /* This isn't an x86-64 CPU so hang */
+1:
+ hlt
+ jmp 1b
+
+#include "../../kernel/verify_cpu.S"
+
/* Be careful here startup_64 needs to be at a predictable
* address so I can export it in an ELF header. Bootloaders
* should look at the ELF header to find this address, as
* it may change in the future.
*/
.code64
- .org 0x100
+ .org 0x200
ENTRY(startup_64)
/* We come here either from startup_32 or directly from a
* 64bit bootloader. If we come here from a bootloader we depend on
diff -puN arch/x86_64/boot/setup.S~x86_64-move-cpu-verfication-code-to-common-file arch/x86_64/boot/setup.S
--- linux-2.6.19-rc5-git2-reloc/arch/x86_64/boot/setup.S~x86_64-move-cpu-verfication-code-to-common-file 2006-11-14 23:11:44.000000000 -0500
+++ linux-2.6.19-rc5-git2-reloc-root/arch/x86_64/boot/setup.S 2006-11-14 23:11:44.000000000 -0500
@@ -295,64 +295,10 @@ loader_ok:
movw %cs,%ax
movw %ax,%ds

- /* minimum CPUID flags for x86-64 */
- /* see http://www.x86-64.org/lists/discuss/msg02971.html */
-#define SSE_MASK ((1<<25)|(1<<26))
-#define REQUIRED_MASK1 ((1<<0)|(1<<3)|(1<<4)|(1<<5)|(1<<6)|(1<<8)|\
- (1<<13)|(1<<15)|(1<<24))
-#define REQUIRED_MASK2 (1<<29)
-
- pushfl /* standard way to check for cpuid */
- popl %eax
- movl %eax,%ebx
- xorl $0x200000,%eax
- pushl %eax
- popfl
- pushfl
- popl %eax
- cmpl %eax,%ebx
- jz no_longmode /* cpu has no cpuid */
- movl $0x0,%eax
- cpuid
- cmpl $0x1,%eax
- jb no_longmode /* no cpuid 1 */
- xor %di,%di
- cmpl $0x68747541,%ebx /* AuthenticAMD */
- jnz noamd
- cmpl $0x69746e65,%edx
- jnz noamd
- cmpl $0x444d4163,%ecx
- jnz noamd
- mov $1,%di /* cpu is from AMD */
-noamd:
- movl $0x1,%eax
- cpuid
- andl $REQUIRED_MASK1,%edx
- xorl $REQUIRED_MASK1,%edx
- jnz no_longmode
- movl $0x80000000,%eax
- cpuid
- cmpl $0x80000001,%eax
- jb no_longmode /* no extended cpuid */
- movl $0x80000001,%eax
- cpuid
- andl $REQUIRED_MASK2,%edx
- xorl $REQUIRED_MASK2,%edx
- jnz no_longmode
-sse_test:
- movl $1,%eax
- cpuid
- andl $SSE_MASK,%edx
- cmpl $SSE_MASK,%edx
- je sse_ok
- test %di,%di
- jz no_longmode /* only try to force SSE on AMD */
- movl $0xc0010015,%ecx /* HWCR */
- rdmsr
- btr $15,%eax /* enable SSE */
- wrmsr
- xor %di,%di /* don't loop */
- jmp sse_test /* try again */
+ call verify_cpu
+ testl %eax,%eax
+ jz sse_ok
+
no_longmode:
call beep
lea long_mode_panic,%si
@@ -362,7 +308,8 @@ no_longmode_loop:
long_mode_panic:
.string "Your CPU does not support long mode. Use a 32bit distribution."
.byte 0
-
+
+#include "../kernel/verify_cpu.S"
sse_ok:
popw %ds

diff -puN arch/x86_64/kernel/acpi/wakeup.S~x86_64-move-cpu-verfication-code-to-common-file arch/x86_64/kernel/acpi/wakeup.S
--- linux-2.6.19-rc5-git2-reloc/arch/x86_64/kernel/acpi/wakeup.S~x86_64-move-cpu-verfication-code-to-common-file 2006-11-14 23:11:44.000000000 -0500
+++ linux-2.6.19-rc5-git2-reloc-root/arch/x86_64/kernel/acpi/wakeup.S 2006-11-14 23:11:44.000000000 -0500
@@ -43,6 +43,8 @@ wakeup_code:
jne bogus_real_magic

call verify_cpu # Verify the cpu supports long mode
+ testl %eax, %eax
+ jnz no_longmode

testl $1, video_flags - wakeup_code
jz 1f
@@ -292,57 +294,12 @@ check_vesaa:
_setbada: jmp setbada

.code16
-verify_cpu:
- pushl $0 # Kill any dangerous flags
- popfl
-
- /* minimum CPUID flags for x86-64 */
- /* see http://www.x86-64.org/lists/discuss/msg02971.html */
-#define REQUIRED_MASK1 ((1<<0)|(1<<3)|(1<<4)|(1<<5)|(1<<6)|(1<<8)|\
- (1<<13)|(1<<15)|(1<<24)|(1<<25)|(1<<26))
-#define REQUIRED_MASK2 (1<<29)
-
- pushfl # check for cpuid
- popl %eax
- movl %eax, %ebx
- xorl $0x200000,%eax
- pushl %eax
- popfl
- pushfl
- popl %eax
- pushl %ebx
- popfl
- cmpl %eax, %ebx
- jz no_longmode
-
- xorl %eax, %eax # See if cpuid 1 is implemented
- cpuid
- cmpl $0x1, %eax
- jb no_longmode
-
- movl $0x01, %eax # Does the cpu have what it takes?
- cpuid
- andl $REQUIRED_MASK1, %edx
- xorl $REQUIRED_MASK1, %edx
- jnz no_longmode
-
- movl $0x80000000, %eax # See if extended cpuid is implemented
- cpuid
- cmpl $0x80000001, %eax
- jb no_longmode
-
- movl $0x80000001, %eax # Does the cpu have what it takes?
- cpuid
- andl $REQUIRED_MASK2, %edx
- xorl $REQUIRED_MASK2, %edx
- jnz no_longmode
-
- ret # The cpu supports long mode
-
no_longmode:
movb $0xbc,%al ; outb %al,$0x80
jmp no_longmode

+#include "../verify_cpu.S"
+
ret


diff -puN arch/x86_64/kernel/trampoline.S~x86_64-move-cpu-verfication-code-to-common-file arch/x86_64/kernel/trampoline.S
--- linux-2.6.19-rc5-git2-reloc/arch/x86_64/kernel/trampoline.S~x86_64-move-cpu-verfication-code-to-common-file 2006-11-14 23:11:44.000000000 -0500
+++ linux-2.6.19-rc5-git2-reloc-root/arch/x86_64/kernel/trampoline.S 2006-11-14 23:11:44.000000000 -0500
@@ -54,6 +54,8 @@ r_base = .
movw $(trampoline_stack_end - r_base), %sp

call verify_cpu # Verify the cpu supports long mode
+ testl %eax, %eax # Check for return code
+ jnz no_longmode

mov %cs, %ax
movzx %ax, %esi # Find the 32bit trampoline location
@@ -121,57 +123,10 @@ startup_64:
jmp *%rax

.code16
-verify_cpu:
- pushl $0 # Kill any dangerous flags
- popfl
-
- /* minimum CPUID flags for x86-64 */
- /* see http://www.x86-64.org/lists/discuss/msg02971.html */
-#define REQUIRED_MASK1 ((1<<0)|(1<<3)|(1<<4)|(1<<5)|(1<<6)|(1<<8)|\
- (1<<13)|(1<<15)|(1<<24)|(1<<25)|(1<<26))
-#define REQUIRED_MASK2 (1<<29)
-
- pushfl # check for cpuid
- popl %eax
- movl %eax, %ebx
- xorl $0x200000,%eax
- pushl %eax
- popfl
- pushfl
- popl %eax
- pushl %ebx
- popfl
- cmpl %eax, %ebx
- jz no_longmode
-
- xorl %eax, %eax # See if cpuid 1 is implemented
- cpuid
- cmpl $0x1, %eax
- jb no_longmode
-
- movl $0x01, %eax # Does the cpu have what it takes?
- cpuid
- andl $REQUIRED_MASK1, %edx
- xorl $REQUIRED_MASK1, %edx
- jnz no_longmode
-
- movl $0x80000000, %eax # See if extended cpuid is implemented
- cpuid
- cmpl $0x80000001, %eax
- jb no_longmode
-
- movl $0x80000001, %eax # Does the cpu have what it takes?
- cpuid
- andl $REQUIRED_MASK2, %edx
- xorl $REQUIRED_MASK2, %edx
- jnz no_longmode
-
- ret # The cpu supports long mode
-
no_longmode:
hlt
jmp no_longmode
-
+#include "verify_cpu.S"

# Careful these need to be in the same 64K segment as the above;
tidt:
diff -puN /dev/null arch/x86_64/kernel/verify_cpu.S
--- /dev/null 2006-11-14 23:08:29.168044802 -0500
+++ linux-2.6.19-rc5-git2-reloc-root/arch/x86_64/kernel/verify_cpu.S 2006-11-14 23:11:44.000000000 -0500
@@ -0,0 +1,103 @@
+/*
+ *
+ * verify_cpu.S
+ *
+ * 14 Nov 2006 Vivek Goyal: Created the file
+ *
+ * This is a common code for verification whether CPU supports
+ * long mode and SSE or not. It is not called directly instead this
+ * file is included at various places and compiled in that context.
+ * Following are the current usage.
+ *
+ * This file is included by both 16bit and 32bit code.
+ *
+ * arch/x86_64/boot/setup.S : Boot cpu verification (16bit)
+ * arch/x86_64/boot/compressed/head.S: Boot cpu verification (32bit)
+ * arch/x86_64/kernel/trampoline.S: secondary processor verfication (16bit)
+ * arch/x86_64/kernel/acpi/wakeup.S:Verfication at resume (16bit)
+ *
+ * verify_cpu, returns the status of cpu check in register %eax.
+ * 0: Success 1: Failure
+ *
+ * The caller needs to check for the error code and take the action
+ * appropriately. Either display a message or halt.
+ */
+
+verify_cpu:
+
+ pushfl # Save caller passed flags
+ pushl $0 # Kill any dangerous flags
+ popfl
+
+ /* minimum CPUID flags for x86-64 */
+ /* see http://www.x86-64.org/lists/discuss/msg02971.html */
+#define SSE_MASK ((1<<25)|(1<<26))
+#define REQUIRED_MASK1 ((1<<0)|(1<<3)|(1<<4)|(1<<5)|(1<<6)|(1<<8)|\
+ (1<<13)|(1<<15)|(1<<24))
+#define REQUIRED_MASK2 (1<<29)
+ pushfl # standard way to check for cpuid
+ popl %eax
+ movl %eax,%ebx
+ xorl $0x200000,%eax
+ pushl %eax
+ popfl
+ pushfl
+ popl %eax
+ cmpl %eax,%ebx
+ jz verify_cpu_no_longmode # cpu has no cpuid
+
+ movl $0x0,%eax # See if cpuid 1 is implemented
+ cpuid
+ cmpl $0x1,%eax
+ jb verify_cpu_no_longmode # no cpuid 1
+
+ xor %di,%di
+ cmpl $0x68747541,%ebx # AuthenticAMD
+ jnz verify_cpu_noamd
+ cmpl $0x69746e65,%edx
+ jnz verify_cpu_noamd
+ cmpl $0x444d4163,%ecx
+ jnz verify_cpu_noamd
+ mov $1,%di # cpu is from AMD
+
+verify_cpu_noamd:
+ movl $0x1,%eax # Does the cpu have what it takes
+ cpuid
+ andl $REQUIRED_MASK1,%edx
+ xorl $REQUIRED_MASK1,%edx
+ jnz verify_cpu_no_longmode
+
+ movl $0x80000000,%eax # See if extended cpuid is implemented
+ cpuid
+ cmpl $0x80000001,%eax
+ jb verify_cpu_no_longmode # no extended cpuid
+
+ movl $0x80000001,%eax # Does the cpu have what it takes
+ cpuid
+ andl $REQUIRED_MASK2,%edx
+ xorl $REQUIRED_MASK2,%edx
+ jnz verify_cpu_no_longmode
+
+verify_cpu_sse_test:
+ movl $1,%eax
+ cpuid
+ andl $SSE_MASK,%edx
+ cmpl $SSE_MASK,%edx
+ je verify_cpu_sse_ok
+ test %di,%di
+ jz verify_cpu_no_longmode # only try to force SSE on AMD
+ movl $0xc0010015,%ecx # HWCR
+ rdmsr
+ btr $15,%eax # enable SSE
+ wrmsr
+ xor %di,%di # don't loop
+ jmp verify_cpu_sse_test # try again
+
+verify_cpu_no_longmode:
+ popfl # Restore caller passed flags
+ movl $1,%eax
+ ret
+verify_cpu_sse_ok:
+ popfl # Restore caller passed flags
+ xorl %eax, %eax
+ ret
_

2006-11-15 21:25:27

by Vivek Goyal

[permalink] [raw]
Subject: Re: [Fastboot] [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

On Mon, Nov 13, 2006 at 11:43:14AM -0500, Vivek Goyal wrote:
>
>
> - Killed lots of dead code
> - Improve the cpu sanity checks to verify long mode
> is enabled when we wake up.
> - Removed the need for modifying any existing kernel page table.
> - Moved wakeup_level4_pgt into the wakeup routine so we can
> run the kernel above 4G.
> - Increased the size of the wakeup routine to 8K.
> - Renamed the variables to use the 64bit register names.
> - Lots of misc cleanups to match trampoline.S
>
> I don't have a configuration I can test this but it compiles cleanly
> and it should work, the code is very similar to the SMP trampoline,
> which I have tested. At least now the comments about still running in
> low memory are actually correct.
>
> Vivek has tested this patch for suspend to memory and it works fine.
>

More update. Got hold of another machine and suspend/resume seems to be
facing problems.

With 2.6.19-rc5-git2
--------------------
- echo 3 > /proc/acpi/sleep (Suspend to memory takes place)
- Press power button (System tries to come back but fails in MPT adapter
initialization)

With 2.6.19-rc5-git2 + Reloc patches
------------------------------------
- echo 3 > /proc/acpi/sleep (Suspend to memory takes place)
- Press power button (Fan powers on but nothing additional is displayed on
serial console.)

Will do a bisect and try to isolate the problem.

Pavel, I hope my testing procedure is right?

Thanks
Vivek

2006-11-15 22:54:48

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] x86_64: Move cpu long mode verification code to common file (was Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup)

Hi!

> > It would be much better if this least this CPUID code was in a common shared
> > file with head.S

>
> Pleaese find attached the patch which moves verify_cpu code to a
> single file arch/x86_64/kernel/verify_cpu.S and this file is included
> by all to do the cpu long mode and SSE checks.

Looks ok to me on quick look...

> @@ -0,0 +1,103 @@
> +/*
> + *
> + * verify_cpu.S
> + *
> + * 14 Nov 2006 Vivek Goyal: Created the file

Could we get copyright/GPL here, instead?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-15 23:03:23

by Pavel Machek

[permalink] [raw]
Subject: Re: [Fastboot] [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

Hi!

> > I don't have a configuration I can test this but it compiles cleanly
> > and it should work, the code is very similar to the SMP trampoline,
> > which I have tested. At least now the comments about still running in
> > low memory are actually correct.
> >
> > Vivek has tested this patch for suspend to memory and it works fine.
> >
>
> More update. Got hold of another machine and suspend/resume seems to be
> facing problems.
>
> With 2.6.19-rc5-git2
> --------------------
> - echo 3 > /proc/acpi/sleep (Suspend to memory takes place)
> - Press power button (System tries to come back but fails in MPT adapter
> initialization)
>
> With 2.6.19-rc5-git2 + Reloc patches
> ------------------------------------
> - echo 3 > /proc/acpi/sleep (Suspend to memory takes place)
> - Press power button (Fan powers on but nothing additional is displayed on
> serial console.)
>
> Will do a bisect and try to isolate the problem.
>
> Pavel, I hope my testing procedure is right?

Yep, basically.

If you were end user, I'd tell you to use s2ram from suspend.sf.net to
get video. But you have serial console, so you don't need VGA ;-).

I do not know what MPT is; SCSI adapter? Yep, I'd expect that to have
problems.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-16 00:30:09

by Vivek Goyal

[permalink] [raw]
Subject: Re: [Fastboot] [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

On Wed, Nov 15, 2006 at 04:24:11PM -0500, Vivek Goyal wrote:
> On Mon, Nov 13, 2006 at 11:43:14AM -0500, Vivek Goyal wrote:
> >
> >
> > - Killed lots of dead code
> > - Improve the cpu sanity checks to verify long mode
> > is enabled when we wake up.
> > - Removed the need for modifying any existing kernel page table.
> > - Moved wakeup_level4_pgt into the wakeup routine so we can
> > run the kernel above 4G.
> > - Increased the size of the wakeup routine to 8K.
> > - Renamed the variables to use the 64bit register names.
> > - Lots of misc cleanups to match trampoline.S
> >
> > I don't have a configuration I can test this but it compiles cleanly
> > and it should work, the code is very similar to the SMP trampoline,
> > which I have tested. At least now the comments about still running in
> > low memory are actually correct.
> >
> > Vivek has tested this patch for suspend to memory and it works fine.
> >
>
> More update. Got hold of another machine and suspend/resume seems to be
> facing problems.
>
> With 2.6.19-rc5-git2
> --------------------
> - echo 3 > /proc/acpi/sleep (Suspend to memory takes place)
> - Press power button (System tries to come back but fails in MPT adapter
> initialization)
>
> With 2.6.19-rc5-git2 + Reloc patches
> ------------------------------------
> - echo 3 > /proc/acpi/sleep (Suspend to memory takes place)
> - Press power button (Fan powers on but nothing additional is displayed on
> serial console.)
>
> Will do a bisect and try to isolate the problem.
>

Ok. In the new code NX bit protection feature is not being enabled and that
seems to be causing the problem. I checked and enabled the NX bit feature
in EFER in wakeup.S and it starts working.

I think my new machine supports NX bit protection feature and if while
resuming if I don't enable that feature back probably it must have caused
a GPF while loading the page tables which have got NX bit set. (A guess).

I know that previous machine I was testing on does not support NX bit
feature and that could be the reason that previous machine did not run into
the problems.

Thanks
Vivek

2006-11-16 20:07:22

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH] x86_64: Move cpu long mode verification code to common file (was Re: [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup)

On Wed, Nov 15, 2006 at 11:54:29PM +0100, Pavel Machek wrote:
> Hi!
>
> > > It would be much better if this least this CPUID code was in a common shared
> > > file with head.S
>
> >
> > Pleaese find attached the patch which moves verify_cpu code to a
> > single file arch/x86_64/kernel/verify_cpu.S and this file is included
> > by all to do the cpu long mode and SSE checks.
>
> Looks ok to me on quick look...
>
> > @@ -0,0 +1,103 @@
> > +/*
> > + *
> > + * verify_cpu.S
> > + *
> > + * 14 Nov 2006 Vivek Goyal: Created the file
>
> Could we get copyright/GPL here, instead?

Pavel, did the modifications to reflect copyright/GPL info. Please find
attached the regenerated patch.



o This patch moves the code to verify long mode and SSE to a common file.
This code is not shared by trampoline.S, wakeup.S, boot/setup.S and
boot/compressed/head.S

o So far we used to do very limited check in trampoline.S, wakeup.S and
in 32bit entry point. Now all the entry paths are forced to do the
exhaustive check, including SSE because verify_cpu is shared.

o I am keeping this patch as last in the x86 relocatable series because
previous patches have got quite some amount of testing done and don't want
to distrub that. So that if there is problem introduced by this patch, at
least it can be easily isolated.

Signed-off-by: Vivek Goyal <[email protected]>
---

arch/x86_64/boot/compressed/head.S | 19 ++++++
arch/x86_64/boot/setup.S | 65 ++--------------------
arch/x86_64/kernel/acpi/wakeup.S | 51 +----------------
arch/x86_64/kernel/trampoline.S | 51 +----------------
arch/x86_64/kernel/verify_cpu.S | 106 +++++++++++++++++++++++++++++++++++++
5 files changed, 137 insertions(+), 155 deletions(-)

diff -puN arch/x86_64/boot/compressed/head.S~x86_64-move-cpu-verfication-code-to-common-file arch/x86_64/boot/compressed/head.S
--- linux-2.6.19-rc5-git2-reloc/arch/x86_64/boot/compressed/head.S~x86_64-move-cpu-verfication-code-to-common-file 2006-11-15 21:43:44.000000000 -0500
+++ linux-2.6.19-rc5-git2-reloc-root/arch/x86_64/boot/compressed/head.S 2006-11-15 21:43:44.000000000 -0500
@@ -54,6 +54,15 @@ startup_32:
1: popl %ebp
subl $1b, %ebp

+/* setup a stack and make sure cpu supports long mode. */
+ movl $user_stack_end, %eax
+ addl %ebp, %eax
+ movl %eax, %esp
+
+ call verify_cpu
+ testl %eax, %eax
+ jnz no_longmode
+
/* Compute the delta between where we were compiled to run at
* and where the code will actually run at.
*/
@@ -150,13 +159,21 @@ startup_32:
/* Jump from 32bit compatibility mode into 64bit mode. */
lret

+no_longmode:
+ /* This isn't an x86-64 CPU so hang */
+1:
+ hlt
+ jmp 1b
+
+#include "../../kernel/verify_cpu.S"
+
/* Be careful here startup_64 needs to be at a predictable
* address so I can export it in an ELF header. Bootloaders
* should look at the ELF header to find this address, as
* it may change in the future.
*/
.code64
- .org 0x100
+ .org 0x200
ENTRY(startup_64)
/* We come here either from startup_32 or directly from a
* 64bit bootloader. If we come here from a bootloader we depend on
diff -puN arch/x86_64/boot/setup.S~x86_64-move-cpu-verfication-code-to-common-file arch/x86_64/boot/setup.S
--- linux-2.6.19-rc5-git2-reloc/arch/x86_64/boot/setup.S~x86_64-move-cpu-verfication-code-to-common-file 2006-11-15 21:43:44.000000000 -0500
+++ linux-2.6.19-rc5-git2-reloc-root/arch/x86_64/boot/setup.S 2006-11-15 21:43:44.000000000 -0500
@@ -295,64 +295,10 @@ loader_ok:
movw %cs,%ax
movw %ax,%ds

- /* minimum CPUID flags for x86-64 */
- /* see http://www.x86-64.org/lists/discuss/msg02971.html */
-#define SSE_MASK ((1<<25)|(1<<26))
-#define REQUIRED_MASK1 ((1<<0)|(1<<3)|(1<<4)|(1<<5)|(1<<6)|(1<<8)|\
- (1<<13)|(1<<15)|(1<<24))
-#define REQUIRED_MASK2 (1<<29)
-
- pushfl /* standard way to check for cpuid */
- popl %eax
- movl %eax,%ebx
- xorl $0x200000,%eax
- pushl %eax
- popfl
- pushfl
- popl %eax
- cmpl %eax,%ebx
- jz no_longmode /* cpu has no cpuid */
- movl $0x0,%eax
- cpuid
- cmpl $0x1,%eax
- jb no_longmode /* no cpuid 1 */
- xor %di,%di
- cmpl $0x68747541,%ebx /* AuthenticAMD */
- jnz noamd
- cmpl $0x69746e65,%edx
- jnz noamd
- cmpl $0x444d4163,%ecx
- jnz noamd
- mov $1,%di /* cpu is from AMD */
-noamd:
- movl $0x1,%eax
- cpuid
- andl $REQUIRED_MASK1,%edx
- xorl $REQUIRED_MASK1,%edx
- jnz no_longmode
- movl $0x80000000,%eax
- cpuid
- cmpl $0x80000001,%eax
- jb no_longmode /* no extended cpuid */
- movl $0x80000001,%eax
- cpuid
- andl $REQUIRED_MASK2,%edx
- xorl $REQUIRED_MASK2,%edx
- jnz no_longmode
-sse_test:
- movl $1,%eax
- cpuid
- andl $SSE_MASK,%edx
- cmpl $SSE_MASK,%edx
- je sse_ok
- test %di,%di
- jz no_longmode /* only try to force SSE on AMD */
- movl $0xc0010015,%ecx /* HWCR */
- rdmsr
- btr $15,%eax /* enable SSE */
- wrmsr
- xor %di,%di /* don't loop */
- jmp sse_test /* try again */
+ call verify_cpu
+ testl %eax,%eax
+ jz sse_ok
+
no_longmode:
call beep
lea long_mode_panic,%si
@@ -362,7 +308,8 @@ no_longmode_loop:
long_mode_panic:
.string "Your CPU does not support long mode. Use a 32bit distribution."
.byte 0
-
+
+#include "../kernel/verify_cpu.S"
sse_ok:
popw %ds

diff -puN arch/x86_64/kernel/acpi/wakeup.S~x86_64-move-cpu-verfication-code-to-common-file arch/x86_64/kernel/acpi/wakeup.S
--- linux-2.6.19-rc5-git2-reloc/arch/x86_64/kernel/acpi/wakeup.S~x86_64-move-cpu-verfication-code-to-common-file 2006-11-15 21:43:44.000000000 -0500
+++ linux-2.6.19-rc5-git2-reloc-root/arch/x86_64/kernel/acpi/wakeup.S 2006-11-15 21:43:44.000000000 -0500
@@ -43,6 +43,8 @@ wakeup_code:
jne bogus_real_magic

call verify_cpu # Verify the cpu supports long mode
+ testl %eax, %eax
+ jnz no_longmode

testl $1, video_flags - wakeup_code
jz 1f
@@ -305,57 +307,12 @@ check_vesaa:
_setbada: jmp setbada

.code16
-verify_cpu:
- pushl $0 # Kill any dangerous flags
- popfl
-
- /* minimum CPUID flags for x86-64 */
- /* see http://www.x86-64.org/lists/discuss/msg02971.html */
-#define REQUIRED_MASK1 ((1<<0)|(1<<3)|(1<<4)|(1<<5)|(1<<6)|(1<<8)|\
- (1<<13)|(1<<15)|(1<<24)|(1<<25)|(1<<26))
-#define REQUIRED_MASK2 (1<<29)
-
- pushfl # check for cpuid
- popl %eax
- movl %eax, %ebx
- xorl $0x200000,%eax
- pushl %eax
- popfl
- pushfl
- popl %eax
- pushl %ebx
- popfl
- cmpl %eax, %ebx
- jz no_longmode
-
- xorl %eax, %eax # See if cpuid 1 is implemented
- cpuid
- cmpl $0x1, %eax
- jb no_longmode
-
- movl $0x01, %eax # Does the cpu have what it takes?
- cpuid
- andl $REQUIRED_MASK1, %edx
- xorl $REQUIRED_MASK1, %edx
- jnz no_longmode
-
- movl $0x80000000, %eax # See if extended cpuid is implemented
- cpuid
- cmpl $0x80000001, %eax
- jb no_longmode
-
- movl $0x80000001, %eax # Does the cpu have what it takes?
- cpuid
- andl $REQUIRED_MASK2, %edx
- xorl $REQUIRED_MASK2, %edx
- jnz no_longmode
-
- ret # The cpu supports long mode
-
no_longmode:
movb $0xbc,%al ; outb %al,$0x80
jmp no_longmode

+#include "../verify_cpu.S"
+
ret


diff -puN arch/x86_64/kernel/trampoline.S~x86_64-move-cpu-verfication-code-to-common-file arch/x86_64/kernel/trampoline.S
--- linux-2.6.19-rc5-git2-reloc/arch/x86_64/kernel/trampoline.S~x86_64-move-cpu-verfication-code-to-common-file 2006-11-15 21:43:44.000000000 -0500
+++ linux-2.6.19-rc5-git2-reloc-root/arch/x86_64/kernel/trampoline.S 2006-11-15 21:43:44.000000000 -0500
@@ -54,6 +54,8 @@ r_base = .
movw $(trampoline_stack_end - r_base), %sp

call verify_cpu # Verify the cpu supports long mode
+ testl %eax, %eax # Check for return code
+ jnz no_longmode

mov %cs, %ax
movzx %ax, %esi # Find the 32bit trampoline location
@@ -121,57 +123,10 @@ startup_64:
jmp *%rax

.code16
-verify_cpu:
- pushl $0 # Kill any dangerous flags
- popfl
-
- /* minimum CPUID flags for x86-64 */
- /* see http://www.x86-64.org/lists/discuss/msg02971.html */
-#define REQUIRED_MASK1 ((1<<0)|(1<<3)|(1<<4)|(1<<5)|(1<<6)|(1<<8)|\
- (1<<13)|(1<<15)|(1<<24)|(1<<25)|(1<<26))
-#define REQUIRED_MASK2 (1<<29)
-
- pushfl # check for cpuid
- popl %eax
- movl %eax, %ebx
- xorl $0x200000,%eax
- pushl %eax
- popfl
- pushfl
- popl %eax
- pushl %ebx
- popfl
- cmpl %eax, %ebx
- jz no_longmode
-
- xorl %eax, %eax # See if cpuid 1 is implemented
- cpuid
- cmpl $0x1, %eax
- jb no_longmode
-
- movl $0x01, %eax # Does the cpu have what it takes?
- cpuid
- andl $REQUIRED_MASK1, %edx
- xorl $REQUIRED_MASK1, %edx
- jnz no_longmode
-
- movl $0x80000000, %eax # See if extended cpuid is implemented
- cpuid
- cmpl $0x80000001, %eax
- jb no_longmode
-
- movl $0x80000001, %eax # Does the cpu have what it takes?
- cpuid
- andl $REQUIRED_MASK2, %edx
- xorl $REQUIRED_MASK2, %edx
- jnz no_longmode
-
- ret # The cpu supports long mode
-
no_longmode:
hlt
jmp no_longmode
-
+#include "verify_cpu.S"

# Careful these need to be in the same 64K segment as the above;
tidt:
diff -puN /dev/null arch/x86_64/kernel/verify_cpu.S
--- /dev/null 2006-11-15 21:36:19.165423574 -0500
+++ linux-2.6.19-rc5-git2-reloc-root/arch/x86_64/kernel/verify_cpu.S 2006-11-15 21:43:44.000000000 -0500
@@ -0,0 +1,106 @@
+/*
+ *
+ * verify_cpu.S - Code for cpu long mode and SSE verification
+ *
+ * Copyright (c) 2006-2007 Vivek Goyal ([email protected])
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ *
+ * This is a common code for verification whether CPU supports
+ * long mode and SSE or not. It is not called directly instead this
+ * file is included at various places and compiled in that context.
+ * Following are the current usage.
+ *
+ * This file is included by both 16bit and 32bit code.
+ *
+ * arch/x86_64/boot/setup.S : Boot cpu verification (16bit)
+ * arch/x86_64/boot/compressed/head.S: Boot cpu verification (32bit)
+ * arch/x86_64/kernel/trampoline.S: secondary processor verfication (16bit)
+ * arch/x86_64/kernel/acpi/wakeup.S:Verfication at resume (16bit)
+ *
+ * verify_cpu, returns the status of cpu check in register %eax.
+ * 0: Success 1: Failure
+ *
+ * The caller needs to check for the error code and take the action
+ * appropriately. Either display a message or halt.
+ */
+
+verify_cpu:
+
+ pushfl # Save caller passed flags
+ pushl $0 # Kill any dangerous flags
+ popfl
+
+ /* minimum CPUID flags for x86-64 */
+ /* see http://www.x86-64.org/lists/discuss/msg02971.html */
+#define SSE_MASK ((1<<25)|(1<<26))
+#define REQUIRED_MASK1 ((1<<0)|(1<<3)|(1<<4)|(1<<5)|(1<<6)|(1<<8)|\
+ (1<<13)|(1<<15)|(1<<24))
+#define REQUIRED_MASK2 (1<<29)
+ pushfl # standard way to check for cpuid
+ popl %eax
+ movl %eax,%ebx
+ xorl $0x200000,%eax
+ pushl %eax
+ popfl
+ pushfl
+ popl %eax
+ cmpl %eax,%ebx
+ jz verify_cpu_no_longmode # cpu has no cpuid
+
+ movl $0x0,%eax # See if cpuid 1 is implemented
+ cpuid
+ cmpl $0x1,%eax
+ jb verify_cpu_no_longmode # no cpuid 1
+
+ xor %di,%di
+ cmpl $0x68747541,%ebx # AuthenticAMD
+ jnz verify_cpu_noamd
+ cmpl $0x69746e65,%edx
+ jnz verify_cpu_noamd
+ cmpl $0x444d4163,%ecx
+ jnz verify_cpu_noamd
+ mov $1,%di # cpu is from AMD
+
+verify_cpu_noamd:
+ movl $0x1,%eax # Does the cpu have what it takes
+ cpuid
+ andl $REQUIRED_MASK1,%edx
+ xorl $REQUIRED_MASK1,%edx
+ jnz verify_cpu_no_longmode
+
+ movl $0x80000000,%eax # See if extended cpuid is implemented
+ cpuid
+ cmpl $0x80000001,%eax
+ jb verify_cpu_no_longmode # no extended cpuid
+
+ movl $0x80000001,%eax # Does the cpu have what it takes
+ cpuid
+ andl $REQUIRED_MASK2,%edx
+ xorl $REQUIRED_MASK2,%edx
+ jnz verify_cpu_no_longmode
+
+verify_cpu_sse_test:
+ movl $1,%eax
+ cpuid
+ andl $SSE_MASK,%edx
+ cmpl $SSE_MASK,%edx
+ je verify_cpu_sse_ok
+ test %di,%di
+ jz verify_cpu_no_longmode # only try to force SSE on AMD
+ movl $0xc0010015,%ecx # HWCR
+ rdmsr
+ btr $15,%eax # enable SSE
+ wrmsr
+ xor %di,%di # don't loop
+ jmp verify_cpu_sse_test # try again
+
+verify_cpu_no_longmode:
+ popfl # Restore caller passed flags
+ movl $1,%eax
+ ret
+verify_cpu_sse_ok:
+ popfl # Restore caller passed flags
+ xorl %eax, %eax
+ ret
_

2006-11-16 20:10:08

by Vivek Goyal

[permalink] [raw]
Subject: Re: [Fastboot] [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

On Wed, Nov 15, 2006 at 07:28:36PM -0500, Vivek Goyal wrote:
> On Wed, Nov 15, 2006 at 04:24:11PM -0500, Vivek Goyal wrote:
> > On Mon, Nov 13, 2006 at 11:43:14AM -0500, Vivek Goyal wrote:
> > >
> > >
> > > - Killed lots of dead code
> > > - Improve the cpu sanity checks to verify long mode
> > > is enabled when we wake up.
> > > - Removed the need for modifying any existing kernel page table.
> > > - Moved wakeup_level4_pgt into the wakeup routine so we can
> > > run the kernel above 4G.
> > > - Increased the size of the wakeup routine to 8K.
> > > - Renamed the variables to use the 64bit register names.
> > > - Lots of misc cleanups to match trampoline.S
> > >
> > > I don't have a configuration I can test this but it compiles cleanly
> > > and it should work, the code is very similar to the SMP trampoline,
> > > which I have tested. At least now the comments about still running in
> > > low memory are actually correct.
> > >
> > > Vivek has tested this patch for suspend to memory and it works fine.
> > >
> >
> > More update. Got hold of another machine and suspend/resume seems to be
> > facing problems.
> >
> > With 2.6.19-rc5-git2
> > --------------------
> > - echo 3 > /proc/acpi/sleep (Suspend to memory takes place)
> > - Press power button (System tries to come back but fails in MPT adapter
> > initialization)
> >
> > With 2.6.19-rc5-git2 + Reloc patches
> > ------------------------------------
> > - echo 3 > /proc/acpi/sleep (Suspend to memory takes place)
> > - Press power button (Fan powers on but nothing additional is displayed on
> > serial console.)
> >
> > Will do a bisect and try to isolate the problem.
> >
>
> Ok. In the new code NX bit protection feature is not being enabled and that
> seems to be causing the problem. I checked and enabled the NX bit feature
> in EFER in wakeup.S and it starts working.
>
> I think my new machine supports NX bit protection feature and if while
> resuming if I don't enable that feature back probably it must have caused
> a GPF while loading the page tables which have got NX bit set. (A guess).
>
> I know that previous machine I was testing on does not support NX bit
> feature and that could be the reason that previous machine did not run into
> the problems.
>

Hi,

Fixed the resume problem happening on my second box which supported NX
protection bit. Please find attached the regenerated patch.

Thanks
Vivek



- Killed lots of dead code
- Improve the cpu sanity checks to verify long mode
is enabled when we wake up.
- Removed the need for modifying any existing kernel page table.
- Moved wakeup_level4_pgt into the wakeup routine so we can
run the kernel above 4G.
- Increased the size of the wakeup routine to 8K.
- Renamed the variables to use the 64bit register names.
- Lots of misc cleanups to match trampoline.S

I don't have a configuration I can test this but it compiles cleanly
and it should work, the code is very similar to the SMP trampoline,
which I have tested. At least now the comments about still running in
low memory are actually correct.

Vivek has tested this patch for suspend to memory and it works fine.

Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---

arch/x86_64/kernel/acpi/sleep.c | 19 --
arch/x86_64/kernel/acpi/wakeup.S | 334 +++++++++++++++++----------------------
arch/x86_64/kernel/head.S | 9 -
include/asm-x86_64/suspend.h | 12 -
4 files changed, 162 insertions(+), 212 deletions(-)

diff -puN arch/x86_64/kernel/acpi/sleep.c~x86_64-64bit-PIC-ACPI-wakeup arch/x86_64/kernel/acpi/sleep.c
--- linux-2.6.19-rc5-git2-reloc/arch/x86_64/kernel/acpi/sleep.c~x86_64-64bit-PIC-ACPI-wakeup 2006-11-15 00:34:26.000000000 -0500
+++ linux-2.6.19-rc5-git2-reloc-root/arch/x86_64/kernel/acpi/sleep.c 2006-11-15 00:34:26.000000000 -0500
@@ -60,17 +60,6 @@ extern char wakeup_start, wakeup_end;

extern unsigned long FASTCALL(acpi_copy_wakeup_routine(unsigned long));

-static pgd_t low_ptr;
-
-static void init_low_mapping(void)
-{
- pgd_t *slot0 = pgd_offset(current->mm, 0UL);
- low_ptr = *slot0;
- set_pgd(slot0, *pgd_offset(current->mm, PAGE_OFFSET));
- WARN_ON(num_online_cpus() != 1);
- local_flush_tlb();
-}
-
/**
* acpi_save_state_mem - save kernel state
*
@@ -79,8 +68,6 @@ static void init_low_mapping(void)
*/
int acpi_save_state_mem(void)
{
- init_low_mapping();
-
memcpy((void *)acpi_wakeup_address, &wakeup_start,
&wakeup_end - &wakeup_start);
acpi_copy_wakeup_routine(acpi_wakeup_address);
@@ -93,8 +80,6 @@ int acpi_save_state_mem(void)
*/
void acpi_restore_state_mem(void)
{
- set_pgd(pgd_offset(current->mm, 0UL), low_ptr);
- local_flush_tlb();
}

/**
@@ -107,8 +92,8 @@ void acpi_restore_state_mem(void)
*/
void __init acpi_reserve_bootmem(void)
{
- acpi_wakeup_address = (unsigned long)alloc_bootmem_low(PAGE_SIZE);
- if ((&wakeup_end - &wakeup_start) > PAGE_SIZE)
+ acpi_wakeup_address = (unsigned long)alloc_bootmem_low(PAGE_SIZE*2);
+ if ((&wakeup_end - &wakeup_start) > (PAGE_SIZE*2))
printk(KERN_CRIT
"ACPI: Wakeup code way too big, will crash on attempt to suspend\n");
}
diff -puN arch/x86_64/kernel/acpi/wakeup.S~x86_64-64bit-PIC-ACPI-wakeup arch/x86_64/kernel/acpi/wakeup.S
--- linux-2.6.19-rc5-git2-reloc/arch/x86_64/kernel/acpi/wakeup.S~x86_64-64bit-PIC-ACPI-wakeup 2006-11-15 00:34:26.000000000 -0500
+++ linux-2.6.19-rc5-git2-reloc-root/arch/x86_64/kernel/acpi/wakeup.S 2006-11-15 21:41:20.000000000 -0500
@@ -1,6 +1,7 @@
.text
#include <linux/linkage.h>
#include <asm/segment.h>
+#include <asm/pgtable.h>
#include <asm/page.h>
#include <asm/msr.h>

@@ -15,7 +16,6 @@
# cs = 0x1234, eip = 0x05
#

-
ALIGN
.align 16
ENTRY(wakeup_start)
@@ -30,22 +30,25 @@ wakeup_code:
cld
# setup data segment
movw %cs, %ax
- movw %ax, %ds # Make ds:0 point to wakeup_start
+ movw %ax, %ds # Make ds:0 point to wakeup_start
movw %ax, %ss
- mov $(wakeup_stack - wakeup_code), %sp # Private stack is needed for ASUS board
+ # Private stack is needed for ASUS board
+ mov $(wakeup_stack - wakeup_code), %sp

- pushl $0 # Kill any dangerous flags
+ pushl $0 # Kill any dangerous flags
popfl

movl real_magic - wakeup_code, %eax
cmpl $0x12345678, %eax
jne bogus_real_magic

+ call verify_cpu # Verify the cpu supports long mode
+
testl $1, video_flags - wakeup_code
jz 1f
lcall $0xc000,$3
movw %cs, %ax
- movw %ax, %ds # Bios might have played with that
+ movw %ax, %ds # Bios might have played with that
movw %ax, %ss
1:

@@ -60,13 +63,17 @@ wakeup_code:
movw $0x0e00 + 'L', %fs:(0x10)

movb $0xa2, %al ; outb %al, $0x80
+
+ mov %ds, %ax # Find 32bit wakeup_code address
+ movzx %ax, %esi # (Convert %ds:gdt to a linear ptr)
+ shll $4, %esi
+
+ # Fixup the vectors
+ addl %esi, wakeup_32_vector - wakeup_code
+ addl %esi, wakeup_long64_vector - wakeup_code
+ addl %esi, gdt_48a + 2 - wakeup_code # Fixup the gdt pointer

- lidt %ds:idt_48a - wakeup_code
- xorl %eax, %eax
- movw %ds, %ax # (Convert %ds:gdt to a linear ptr)
- shll $4, %eax
- addl $(gdta - wakeup_code), %eax
- movl %eax, gdt_48a +2 - wakeup_code
+ lidtl %ds:idt_48a - wakeup_code
lgdtl %ds:gdt_48a - wakeup_code # load gdt with whatever is
# appropriate

@@ -75,85 +82,60 @@ wakeup_code:
jmp 1f
1:

- .byte 0x66, 0xea # prefix + jmpi-opcode
- .long wakeup_32 - __START_KERNEL_map
- .word __KERNEL_CS
+ ljmpl *(wakeup_32_vector - wakeup_code)
+
+ .balign 4
+wakeup_32_vector:
+ .long wakeup_32 - wakeup_code
+ .word __KERNEL32_CS, 0

.code32
wakeup_32:
# Running in this code, but at low address; paging is not yet turned on.
movb $0xa5, %al ; outb %al, $0x80

- /* Check if extended functions are implemented */
- movl $0x80000000, %eax
- cpuid
- cmpl $0x80000000, %eax
- jbe bogus_cpu
- wbinvd
- mov $0x80000001, %eax
- cpuid
- btl $29, %edx
- jnc bogus_cpu
- movl %edx,%edi
-
- movw $__KERNEL_DS, %ax
- movw %ax, %ds
- movw %ax, %es
- movw %ax, %fs
- movw %ax, %gs
+ /* Initialize segments */
+ movl $__KERNEL_DS, %eax
+ movl %eax, %ds

- movw $__KERNEL_DS, %ax
- movw %ax, %ss
-
- mov $(wakeup_stack - __START_KERNEL_map), %esp
- movl saved_magic - __START_KERNEL_map, %eax
- cmpl $0x9abcdef0, %eax
- jne bogus_32_magic
+ movw $0x0e00 + 'i', %ds:(0xb8012)
+ movb $0xa8, %al ; outb %al, $0x80;

/*
* Prepare for entering 64bits mode
*/

- /* Enable PAE mode and PGE */
+ /* Enable PAE */
xorl %eax, %eax
btsl $5, %eax
- btsl $7, %eax
movl %eax, %cr4

/* Setup early boot stage 4 level pagetables */
- movl $(wakeup_level4_pgt - __START_KERNEL_map), %eax
+ leal (wakeup_level4_pgt - wakeup_code)(%esi), %eax
movl %eax, %cr3

- /* Setup EFER (Extended Feature Enable Register) */
- movl $MSR_EFER, %ecx
- rdmsr
- /* Fool rdmsr and reset %eax to avoid dependences */
- xorl %eax, %eax
+ /* Check if nx is implemented */
+ movl $0x80000001, %eax
+ cpuid
+ movl %edx,%edi
+
/* Enable Long Mode */
- btsl $_EFER_LME, %eax
- /* Enable System Call */
- btsl $_EFER_SCE, %eax
+ xorl %eax, %eax
+ btsl $_EFER_LME, %eax

- /* No Execute supported? */
+ /* No Execute supported? */
btl $20,%edi
jnc 1f
btsl $_EFER_NX, %eax
-1:
-
- /* Make changes effective */
+
+ /* Enable Long Mode */
+1: movl $MSR_EFER, %ecx
+ xorl %edx, %edx
wrmsr
- wbinvd

xorl %eax, %eax
btsl $31, %eax /* Enable paging and in turn activate Long Mode */
btsl $0, %eax /* Enable protected mode */
- btsl $1, %eax /* Enable MP */
- btsl $4, %eax /* Enable ET */
- btsl $5, %eax /* Enable NE */
- btsl $16, %eax /* Enable WP */
- btsl $18, %eax /* Enable AM */
-
- /* Make changes effective */
movl %eax, %cr0
/* At this point:
CR4.PAE must be 1
@@ -162,11 +144,6 @@ wakeup_32:
Next instruction must be a branch
This must be on identity-mapped page
*/
- jmp reach_compatibility_mode
-reach_compatibility_mode:
- movw $0x0e00 + 'i', %ds:(0xb8012)
- movb $0xa8, %al ; outb %al, $0x80;
-
/*
* At this point we're in long mode but in 32bit compatibility mode
* with EFER.LME = 1, CS.L = 0, CS.D = 1 (and in turn
@@ -174,20 +151,13 @@ reach_compatibility_mode:
* the new gdt/idt that has __KERNEL_CS with CS.L = 1.
*/

- movw $0x0e00 + 'n', %ds:(0xb8014)
- movb $0xa9, %al ; outb %al, $0x80
-
- /* Load new GDT with the 64bit segment using 32bit descriptor */
- movl $(pGDT32 - __START_KERNEL_map), %eax
- lgdt (%eax)
-
- movl $(wakeup_jumpvector - __START_KERNEL_map), %eax
/* Finally jump in 64bit mode */
- ljmp *(%eax)
+ ljmp *(wakeup_long64_vector - wakeup_code)(%esi)

-wakeup_jumpvector:
- .long wakeup_long64 - __START_KERNEL_map
- .word __KERNEL_CS
+ .balign 4
+wakeup_long64_vector:
+ .long wakeup_long64 - wakeup_code
+ .word __KERNEL_CS, 0

.code64

@@ -199,10 +169,18 @@ wakeup_long64:
* addresses where we're currently running on. We have to do that here
* because in 32bit we couldn't load a 64bit linear address.
*/
- lgdt cpu_gdt_descr - __START_KERNEL_map
+ lgdt cpu_gdt_descr
+
+ movw $0x0e00 + 'n', %ds:(0xb8014)
+ movb $0xa9, %al ; outb %al, $0x80
+
+ movq saved_magic, %rax
+ movq $0x123456789abcdef0, %rdx
+ cmpq %rdx, %rax
+ jne bogus_64_magic

movw $0x0e00 + 'u', %ds:(0xb8016)
-
+
nop
nop
movw $__KERNEL_DS, %ax
@@ -211,16 +189,16 @@ wakeup_long64:
movw %ax, %es
movw %ax, %fs
movw %ax, %gs
- movq saved_esp, %rsp
+ movq saved_rsp, %rsp

movw $0x0e00 + 'x', %ds:(0xb8018)
- movq saved_ebx, %rbx
- movq saved_edi, %rdi
- movq saved_esi, %rsi
- movq saved_ebp, %rbp
+ movq saved_rbx, %rbx
+ movq saved_rdi, %rdi
+ movq saved_rsi, %rsi
+ movq saved_rbp, %rbp

movw $0x0e00 + '!', %ds:(0xb801a)
- movq saved_eip, %rax
+ movq saved_rip, %rax
jmp *%rax

.code32
@@ -228,25 +206,10 @@ wakeup_long64:
.align 64
gdta:
.word 0, 0, 0, 0 # dummy
-
- .word 0, 0, 0, 0 # unused
-
- .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb)
- .word 0 # base address = 0
- .word 0x9B00 # code read/exec. ??? Why I need 0x9B00 (as opposed to 0x9A00 in order for this to work?)
- .word 0x00CF # granularity = 4096, 386
- # (+5th nibble of limit)
-
- .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb)
- .word 0 # base address = 0
- .word 0x9200 # data read/write
- .word 0x00CF # granularity = 4096, 386
- # (+5th nibble of limit)
-# this is 64bit descriptor for code
- .word 0xFFFF
- .word 0
- .word 0x9A00 # code read/exec
- .word 0x00AF # as above, but it is long mode and with D=0
+ /* ??? Why I need the accessed bit set in order for this to work? */
+ .quad 0x00cf9b000000ffff # __KERNEL32_CS
+ .quad 0x00af9b000000ffff # __KERNEL_CS
+ .quad 0x00cf93000000ffff # __KERNEL_DS

idt_48a:
.word 0 # idt limit = 0
@@ -255,30 +218,24 @@ idt_48a:
gdt_48a:
.word 0x8000 # gdt limit=2048,
# 256 GDT entries
- .word 0, 0 # gdt base (filled in later)
-
-
+ .long gdta - wakeup_code # gdt base (relocated in later)
+
+
real_save_gdt: .word 0
.quad 0
real_magic: .quad 0
video_mode: .quad 0
video_flags: .quad 0

+.code16
bogus_real_magic:
movb $0xba,%al ; outb %al,$0x80
jmp bogus_real_magic

-bogus_32_magic:
+.code64
+bogus_64_magic:
movb $0xb3,%al ; outb %al,$0x80
- jmp bogus_32_magic
-
-bogus_31_magic:
- movb $0xb1,%al ; outb %al,$0x80
- jmp bogus_31_magic
-
-bogus_cpu:
- movb $0xbc,%al ; outb %al,$0x80
- jmp bogus_cpu
+ jmp bogus_64_magic


/* This code uses an extended set of video mode numbers. These include:
@@ -301,6 +258,7 @@ bogus_cpu:
#define VIDEO_FIRST_V7 0x0900

# Setting of user mode (AX=mode ID) => CF=success
+.code16
mode_seta:
movw %ax, %bx
#if 0
@@ -346,14 +304,59 @@ check_vesaa:

_setbada: jmp setbada

- .code64
-bogus_magic:
- movw $0x0e00 + 'B', %ds:(0xb8018)
- jmp bogus_magic
-
-bogus_magic2:
- movw $0x0e00 + '2', %ds:(0xb8018)
- jmp bogus_magic2
+ .code16
+verify_cpu:
+ pushl $0 # Kill any dangerous flags
+ popfl
+
+ /* minimum CPUID flags for x86-64 */
+ /* see http://www.x86-64.org/lists/discuss/msg02971.html */
+#define REQUIRED_MASK1 ((1<<0)|(1<<3)|(1<<4)|(1<<5)|(1<<6)|(1<<8)|\
+ (1<<13)|(1<<15)|(1<<24)|(1<<25)|(1<<26))
+#define REQUIRED_MASK2 (1<<29)
+
+ pushfl # check for cpuid
+ popl %eax
+ movl %eax, %ebx
+ xorl $0x200000,%eax
+ pushl %eax
+ popfl
+ pushfl
+ popl %eax
+ pushl %ebx
+ popfl
+ cmpl %eax, %ebx
+ jz no_longmode
+
+ xorl %eax, %eax # See if cpuid 1 is implemented
+ cpuid
+ cmpl $0x1, %eax
+ jb no_longmode
+
+ movl $0x01, %eax # Does the cpu have what it takes?
+ cpuid
+ andl $REQUIRED_MASK1, %edx
+ xorl $REQUIRED_MASK1, %edx
+ jnz no_longmode
+
+ movl $0x80000000, %eax # See if extended cpuid is implemented
+ cpuid
+ cmpl $0x80000001, %eax
+ jb no_longmode
+
+ movl $0x80000001, %eax # Does the cpu have what it takes?
+ cpuid
+ andl $REQUIRED_MASK2, %edx
+ xorl $REQUIRED_MASK2, %edx
+ jnz no_longmode
+
+ ret # The cpu supports long mode
+
+no_longmode:
+ movb $0xbc,%al ; outb %al,$0x80
+ jmp no_longmode
+
+ ret


wakeup_stack_begin: # Stack grows down
@@ -361,7 +364,15 @@ wakeup_stack_begin: # Stack grows down
.org 0xff0
wakeup_stack: # Just below end of page

+.org 0x1000
+ENTRY(wakeup_level4_pgt)
+ .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
+ .fill 510,8,0
+ /* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
+ .quad level3_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
+
ENTRY(wakeup_end)
+ .code64

##
# acpi_copy_wakeup_routine
@@ -378,23 +389,6 @@ ENTRY(acpi_copy_wakeup_routine)
pushq %rcx
pushq %rdx

- sgdt saved_gdt
- sidt saved_idt
- sldt saved_ldt
- str saved_tss
-
- movq %cr3, %rdx
- movq %rdx, saved_cr3
- movq %cr4, %rdx
- movq %rdx, saved_cr4
- movq %cr0, %rdx
- movq %rdx, saved_cr0
- sgdt real_save_gdt - wakeup_start (,%rdi)
- movl $MSR_EFER, %ecx
- rdmsr
- movl %eax, saved_efer
- movl %edx, saved_efer2
-
movl saved_video_mode, %edx
movl %edx, video_mode - wakeup_start (,%rdi)
movl acpi_video_flags, %edx
@@ -403,18 +397,11 @@ ENTRY(acpi_copy_wakeup_routine)
movq $0x123456789abcdef0, %rdx
movq %rdx, saved_magic

- movl saved_magic - __START_KERNEL_map, %eax
- cmpl $0x9abcdef0, %eax
- jne bogus_32_magic
-
- # make sure %cr4 is set correctly (features, etc)
- movl saved_cr4 - __START_KERNEL_map, %eax
- movq %rax, %cr4
-
- movl saved_cr0 - __START_KERNEL_map, %eax
- movq %rax, %cr0
- jmp 1f # Flush pipelines
-1:
+ movq saved_magic, %rax
+ movq $0x123456789abcdef0, %rdx
+ cmpq %rdx, %rax
+ jne bogus_64_magic
+
# restore the regs we used
popq %rdx
popq %rcx
@@ -450,13 +437,13 @@ do_suspend_lowlevel:
movq %r15, saved_context_r15(%rip)
pushfq ; popq saved_context_eflags(%rip)

- movq $.L97, saved_eip(%rip)
+ movq $.L97, saved_rip(%rip)

- movq %rsp,saved_esp
- movq %rbp,saved_ebp
- movq %rbx,saved_ebx
- movq %rdi,saved_edi
- movq %rsi,saved_esi
+ movq %rsp,saved_rsp
+ movq %rbp,saved_rbp
+ movq %rbx,saved_rbx
+ movq %rdi,saved_rdi
+ movq %rsi,saved_rsi

addq $8, %rsp
movl $3, %edi
@@ -503,25 +490,12 @@ do_suspend_lowlevel:

.data
ALIGN
-ENTRY(saved_ebp) .quad 0
-ENTRY(saved_esi) .quad 0
-ENTRY(saved_edi) .quad 0
-ENTRY(saved_ebx) .quad 0
+ENTRY(saved_rbp) .quad 0
+ENTRY(saved_rsi) .quad 0
+ENTRY(saved_rdi) .quad 0
+ENTRY(saved_rbx) .quad 0

-ENTRY(saved_eip) .quad 0
-ENTRY(saved_esp) .quad 0
+ENTRY(saved_rip) .quad 0
+ENTRY(saved_rsp) .quad 0

ENTRY(saved_magic) .quad 0
-
-ALIGN
-# saved registers
-saved_gdt: .quad 0,0
-saved_idt: .quad 0,0
-saved_ldt: .quad 0
-saved_tss: .quad 0
-
-saved_cr0: .quad 0
-saved_cr3: .quad 0
-saved_cr4: .quad 0
-saved_efer: .quad 0
-saved_efer2: .quad 0
diff -puN arch/x86_64/kernel/head.S~x86_64-64bit-PIC-ACPI-wakeup arch/x86_64/kernel/head.S
--- linux-2.6.19-rc5-git2-reloc/arch/x86_64/kernel/head.S~x86_64-64bit-PIC-ACPI-wakeup 2006-11-15 00:34:26.000000000 -0500
+++ linux-2.6.19-rc5-git2-reloc-root/arch/x86_64/kernel/head.S 2006-11-15 21:37:37.000000000 -0500
@@ -300,15 +300,6 @@ NEXT_PAGE(level2_kernel_pgt)

.data

-#ifdef CONFIG_ACPI_SLEEP
- .align PAGE_SIZE
-ENTRY(wakeup_level4_pgt)
- .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .fill 510,8,0
- /* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
- .quad level3_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
-#endif
-
#ifndef CONFIG_HOTPLUG_CPU
__INITDATA
#endif
diff -puN include/asm-x86_64/suspend.h~x86_64-64bit-PIC-ACPI-wakeup include/asm-x86_64/suspend.h
--- linux-2.6.19-rc5-git2-reloc/include/asm-x86_64/suspend.h~x86_64-64bit-PIC-ACPI-wakeup 2006-11-15 00:34:26.000000000 -0500
+++ linux-2.6.19-rc5-git2-reloc-root/include/asm-x86_64/suspend.h 2006-11-15 00:34:26.000000000 -0500
@@ -45,12 +45,12 @@ extern unsigned long saved_context_eflag
extern void fix_processor_context(void);

#ifdef CONFIG_ACPI_SLEEP
-extern unsigned long saved_eip;
-extern unsigned long saved_esp;
-extern unsigned long saved_ebp;
-extern unsigned long saved_ebx;
-extern unsigned long saved_esi;
-extern unsigned long saved_edi;
+extern unsigned long saved_rip;
+extern unsigned long saved_rsp;
+extern unsigned long saved_rbp;
+extern unsigned long saved_rbx;
+extern unsigned long saved_rsi;
+extern unsigned long saved_rdi;

/* routines for saving/restoring kernel state */
extern int acpi_save_state_mem(void);
_

2006-11-16 20:53:44

by Pavel Machek

[permalink] [raw]
Subject: Re: [Fastboot] [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

Hi!

> > Ok. In the new code NX bit protection feature is not being enabled and that
> > seems to be causing the problem. I checked and enabled the NX bit feature
> > in EFER in wakeup.S and it starts working.
> >
> > I think my new machine supports NX bit protection feature and if while
> > resuming if I don't enable that feature back probably it must have caused
> > a GPF while loading the page tables which have got NX bit set. (A guess).
> >
> > I know that previous machine I was testing on does not support NX bit
> > feature and that could be the reason that previous machine did not run into
> > the problems.
>
> Fixed the resume problem happening on my second box which supported NX
> protection bit. Please find attached the regenerated patch.
>
> - Killed lots of dead code

Cleanup. (a)

> - Improve the cpu sanity checks to verify long mode
> is enabled when we wake up.

Change. (b). I'm not sure if we really need this one. I do not think
replacing cpu while suspended is supported operation.

> - Removed the need for modifying any existing kernel page table.

Unrelated change, probably good one. (c).

> - Moved wakeup_level4_pgt into the wakeup routine so we can
> run the kernel above 4G.

The change you really wanted to do in the first place. (d).

> - Increased the size of the wakeup routine to 8K.

You want bigger stack or what? (e)

> - Renamed the variables to use the 64bit register names.

Cleanup. (a)

> - Lots of misc cleanups to match trampoline.S

More cleanups. (a).

Can we at least get (a) (b) (c) (d) and (e) separated?

Oh and please drop the whitespace changes.

> I don't have a configuration I can test this but it compiles cleanly
> and it should work, the code is very similar to the SMP trampoline,

I assume you have configuration for test now?

> @@ -60,17 +60,6 @@ extern char wakeup_start, wakeup_end;
>
> extern unsigned long FASTCALL(acpi_copy_wakeup_routine(unsigned long));
>
> -static pgd_t low_ptr;
> -
> -static void init_low_mapping(void)
> -{
> - pgd_t *slot0 = pgd_offset(current->mm, 0UL);
> - low_ptr = *slot0;
> - set_pgd(slot0, *pgd_offset(current->mm, PAGE_OFFSET));
> - WARN_ON(num_online_cpus() != 1);
> - local_flush_tlb();
> -}
> -

So you no longer need identity mapping? Is not it specified that when
you transition between modes, you should do that while in identity
mapping?

> @@ -15,7 +16,6 @@
> # cs = 0x1234, eip = 0x05
> #
>
> -
> ALIGN
> .align 16
> ENTRY(wakeup_start)

Whitespace changes.

> @@ -30,22 +30,25 @@ wakeup_code:
> cld
> # setup data segment
> movw %cs, %ax
> - movw %ax, %ds # Make ds:0 point to wakeup_start
> + movw %ax, %ds # Make ds:0 point to wakeup_start
> movw %ax, %ss
> - mov $(wakeup_stack - wakeup_code), %sp # Private stack is needed for ASUS board
> + # Private stack is needed for ASUS board
> + mov $(wakeup_stack - wakeup_code), %sp
>
> - pushl $0 # Kill any dangerous flags
> + pushl $0 # Kill any dangerous flags
> popfl

More whitespace changes.


> movl real_magic - wakeup_code, %eax
> cmpl $0x12345678, %eax
> jne bogus_real_magic
>
> + call verify_cpu # Verify the cpu supports long mode
> +

Check if cpu supports long mode... but we suspended when running long
mode, why checking again?

> testl $1, video_flags - wakeup_code
> jz 1f
> lcall $0xc000,$3
> movw %cs, %ax
> - movw %ax, %ds # Bios might have played with that
> + movw %ax, %ds # Bios might have played with that
> movw %ax, %ss
> 1:

More whitespace changes.

> @@ -228,25 +206,10 @@ wakeup_long64:
> .align 64
> gdta:
> .word 0, 0, 0, 0 # dummy
> -
> - .word 0, 0, 0, 0 # unused
> -
> - .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb)
> - .word 0 # base address = 0
> - .word 0x9B00 # code read/exec. ??? Why I need 0x9B00 (as opposed to 0x9A00 in order for this to work?)
> - .word 0x00CF # granularity = 4096, 386
> - # (+5th nibble of limit)
> -
> - .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb)
> - .word 0 # base address = 0
> - .word 0x9200 # data read/write
> - .word 0x00CF # granularity = 4096, 386
> - # (+5th nibble of limit)
> -# this is 64bit descriptor for code
> - .word 0xFFFF
> - .word 0
> - .word 0x9A00 # code read/exec
> - .word 0x00AF # as above, but it is long mode and with D=0
> + /* ??? Why I need the accessed bit set in order for this to work? */
> + .quad 0x00cf9b000000ffff # __KERNEL32_CS
> + .quad 0x00af9b000000ffff # __KERNEL_CS
> + .quad 0x00cf93000000ffff # __KERNEL_DS

Why this change, why did you change the values in here, and why you
did not tell me about it in the changelog?

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-16 21:30:35

by Vivek Goyal

[permalink] [raw]
Subject: Re: [Fastboot] [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

On Thu, Nov 16, 2006 at 09:53:13PM +0100, Pavel Machek wrote:
> Hi!
>
> > > Ok. In the new code NX bit protection feature is not being enabled and that
> > > seems to be causing the problem. I checked and enabled the NX bit feature
> > > in EFER in wakeup.S and it starts working.
> > >
> > > I think my new machine supports NX bit protection feature and if while
> > > resuming if I don't enable that feature back probably it must have caused
> > > a GPF while loading the page tables which have got NX bit set. (A guess).
> > >
> > > I know that previous machine I was testing on does not support NX bit
> > > feature and that could be the reason that previous machine did not run into
> > > the problems.
> >
> > Fixed the resume problem happening on my second box which supported NX
> > protection bit. Please find attached the regenerated patch.
> >
> > - Killed lots of dead code
>
> Cleanup. (a)
>
> > - Improve the cpu sanity checks to verify long mode
> > is enabled when we wake up.
>
> Change. (b). I'm not sure if we really need this one. I do not think
> replacing cpu while suspended is supported operation.
>

That's fine but it does not harm. Now all the entry path share the
same sanity check (verify_cpu.S) and I believe it makes code more
maintanable and more robust. It just makes our checks stronger in
case somebody really replaces the cpus.

> > - Removed the need for modifying any existing kernel page table.
>
> Unrelated change, probably good one. (c).
>
> > - Moved wakeup_level4_pgt into the wakeup routine so we can
> > run the kernel above 4G.
>
> The change you really wanted to do in the first place. (d).
>
> > - Increased the size of the wakeup routine to 8K.
>
> You want bigger stack or what? (e)
>

I think this is because of wakeup_level4_pgt page tables which are now
part of trampoline. And these page tables got to be at 4K byte boundary.
Hence now we need two pages for trampoline instead of one.

> > - Renamed the variables to use the 64bit register names.
>
> Cleanup. (a)
>
> > - Lots of misc cleanups to match trampoline.S
>
> More cleanups. (a).
>
> Can we at least get (a) (b) (c) (d) and (e) separated?
>

Ok. I will separate the patches.

> Oh and please drop the whitespace changes.
>

Sure. Will drop the whitespace too.

> > I don't have a configuration I can test this but it compiles cleanly
> > and it should work, the code is very similar to the SMP trampoline,
>
> I assume you have configuration for test now?
>

Eric did not have but now I have tested it already on two configurations.
I think that's good enough. Isn't it?

> > @@ -60,17 +60,6 @@ extern char wakeup_start, wakeup_end;
> >
> > extern unsigned long FASTCALL(acpi_copy_wakeup_routine(unsigned long));
> >
> > -static pgd_t low_ptr;
> > -
> > -static void init_low_mapping(void)
> > -{
> > - pgd_t *slot0 = pgd_offset(current->mm, 0UL);
> > - low_ptr = *slot0;
> > - set_pgd(slot0, *pgd_offset(current->mm, PAGE_OFFSET));
> > - WARN_ON(num_online_cpus() != 1);
> > - local_flush_tlb();
> > -}
> > -
>
> So you no longer need identity mapping? Is not it specified that when
> you transition between modes, you should do that while in identity
> mapping?
>

I am not sure where these mappings are required at all in first place?
While going to sleep state? While resuming we are using wake page tables
and they already got identity mappings so it should not be an issue.

[..]
> More whitespace changes.
>
>
> > movl real_magic - wakeup_code, %eax
> > cmpl $0x12345678, %eax
> > jne bogus_real_magic
> >
> > + call verify_cpu # Verify the cpu supports long mode
> > +
>
> Check if cpu supports long mode... but we suspended when running long
> mode, why checking again?
>

Please see above.

> > testl $1, video_flags - wakeup_code
> > jz 1f
> > lcall $0xc000,$3
> > movw %cs, %ax
> > - movw %ax, %ds # Bios might have played with that
> > + movw %ax, %ds # Bios might have played with that
> > movw %ax, %ss
> > 1:
>
> More whitespace changes.
>
> > @@ -228,25 +206,10 @@ wakeup_long64:
> > .align 64
> > gdta:
> > .word 0, 0, 0, 0 # dummy
> > -
> > - .word 0, 0, 0, 0 # unused
> > -
> > - .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb)
> > - .word 0 # base address = 0
> > - .word 0x9B00 # code read/exec. ??? Why I need 0x9B00 (as opposed to 0x9A00 in order for this to work?)
> > - .word 0x00CF # granularity = 4096, 386
> > - # (+5th nibble of limit)
> > -
> > - .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb)
> > - .word 0 # base address = 0
> > - .word 0x9200 # data read/write
> > - .word 0x00CF # granularity = 4096, 386
> > - # (+5th nibble of limit)
> > -# this is 64bit descriptor for code
> > - .word 0xFFFF
> > - .word 0
> > - .word 0x9A00 # code read/exec
> > - .word 0x00AF # as above, but it is long mode and with D=0
> > + /* ??? Why I need the accessed bit set in order for this to work? */
> > + .quad 0x00cf9b000000ffff # __KERNEL32_CS
> > + .quad 0x00af9b000000ffff # __KERNEL_CS
> > + .quad 0x00cf93000000ffff # __KERNEL_DS
>
> Why this change, why did you change the values in here, and why you
> did not tell me about it in the changelog?
>

I think mainly it has been modified to be consistent gdt table across
the kernel (cpu_gdt_table, trampoline.S and wakeup.S). Now __KERNEL_32_CS
entry has been moved up to keep the size of gdt small on trampoline. This
change was done in patch number 7 (cleanup segments).

Seondly, I think it is just change of form from using .word to .quad. More
compact form.

Thirdly I think it does not harm marking that gdt entry has been accessed.
Eric can elaborate more on it. Patch 7 also has got details.

Thanks
Vivek

2006-11-16 21:52:20

by Pavel Machek

[permalink] [raw]
Subject: Re: [Fastboot] [RFC] [PATCH 10/16] x86_64: 64bit PIC ACPI wakeup

Hi!

> > > Fixed the resume problem happening on my second box which supported NX
> > > protection bit. Please find attached the regenerated patch.
> > >
> > > - Killed lots of dead code
> >
> > Cleanup. (a)
> >
> > > - Improve the cpu sanity checks to verify long mode
> > > is enabled when we wake up.
> >
> > Change. (b). I'm not sure if we really need this one. I do not think
> > replacing cpu while suspended is supported operation.
> >
>
> That's fine but it does not harm. Now all the entry path share the
> same sanity check (verify_cpu.S) and I believe it makes code more
> maintanable and more robust. It just makes our checks stronger in
> case somebody really replaces the cpus.

It is probably okay if shared.

> > > - Removed the need for modifying any existing kernel page table.
> >
> > Unrelated change, probably good one. (c).
> >
> > > - Moved wakeup_level4_pgt into the wakeup routine so we can
> > > run the kernel above 4G.
> >
> > The change you really wanted to do in the first place. (d).
> >
> > > - Increased the size of the wakeup routine to 8K.
> >
> > You want bigger stack or what? (e)
> >
>
> I think this is because of wakeup_level4_pgt page tables which are now
> part of trampoline. And these page tables got to be at 4K byte boundary.
> Hence now we need two pages for trampoline instead of one.

Aha, ok. Notice that in the changelog.

> > > - Renamed the variables to use the 64bit register names.
> >
> > Cleanup. (a)
> >
> > > - Lots of misc cleanups to match trampoline.S
> >
> > More cleanups. (a).
> >
> > Can we at least get (a) (b) (c) (d) and (e) separated?
>
> Ok. I will separate the patches.

Thanks!

> > > I don't have a configuration I can test this but it compiles cleanly
> > > and it should work, the code is very similar to the SMP trampoline,
> >
> > I assume you have configuration for test now?
>
> Eric did not have but now I have tested it already on two configurations.
> I think that's good enough. Isn't it?

Should be... if it is in -mm for long enough. Unfortunately very
little people are testing x86-64 on notebooks :-(. I guess I should
force myself to install 64-bit distro on old arima here...

> > > -static pgd_t low_ptr;
> > > -
> > > -static void init_low_mapping(void)
> > > -{
> > > - pgd_t *slot0 = pgd_offset(current->mm, 0UL);
> > > - low_ptr = *slot0;
> > > - set_pgd(slot0, *pgd_offset(current->mm, PAGE_OFFSET));
> > > - WARN_ON(num_online_cpus() != 1);
> > > - local_flush_tlb();
> > > -}
> > > -
> >
> > So you no longer need identity mapping? Is not it specified that when
> > you transition between modes, you should do that while in identity
> > mapping?
> >
>
> I am not sure where these mappings are required at all in first place?
> While going to sleep state? While resuming we are using wake page tables
> and they already got identity mappings so it should not be an issue.

Ok, I guess that's okay.

> > > @@ -228,25 +206,10 @@ wakeup_long64:
> > > .align 64
> > > gdta:
> > > .word 0, 0, 0, 0 # dummy
> > > -
> > > - .word 0, 0, 0, 0 # unused
> > > -
> > > - .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb)
> > > - .word 0 # base address = 0
> > > - .word 0x9B00 # code read/exec. ??? Why I need 0x9B00 (as opposed to 0x9A00 in order for this to work?)
> > > - .word 0x00CF # granularity = 4096, 386
> > > - # (+5th nibble of limit)
> > > -
> > > - .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb)
> > > - .word 0 # base address = 0
> > > - .word 0x9200 # data read/write
> > > - .word 0x00CF # granularity = 4096, 386
> > > - # (+5th nibble of limit)
> > > -# this is 64bit descriptor for code
> > > - .word 0xFFFF
> > > - .word 0
> > > - .word 0x9A00 # code read/exec
> > > - .word 0x00AF # as above, but it is long mode and with D=0
> > > + /* ??? Why I need the accessed bit set in order for this to work? */
> > > + .quad 0x00cf9b000000ffff # __KERNEL32_CS
> > > + .quad 0x00af9b000000ffff # __KERNEL_CS
> > > + .quad 0x00cf93000000ffff # __KERNEL_DS
> >
> > Why this change, why did you change the values in here, and why you
> > did not tell me about it in the changelog?
>
> I think mainly it has been modified to be consistent gdt table across
> the kernel (cpu_gdt_table, trampoline.S and wakeup.S). Now __KERNEL_32_CS
> entry has been moved up to keep the size of gdt small on trampoline. This
> change was done in patch number 7 (cleanup segments).
>
> Seondly, I think it is just change of form from using .word to .quad. More
> compact form.
>
> Thirdly I think it does not harm marking that gdt entry has been accessed.
> Eric can elaborate more on it. Patch 7 also has got details.

Ok. Maybe it would be nice to #include the GDT's, too, so they do not
drift apart? Or at least comment "this has to be kept in sync
with..."?

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html