2007-02-12 07:37:49

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [1/39] i386: move startup_32() in text.head section


From: Vivek Goyal <[email protected]>

o Entry startup_32 was in .text section but it was accessing some init
data too and it prompts MODPOST to generate compilation warnings.

WARNING: vmlinux - Section mismatch: reference to .init.data:boot_params from
.text between '_text' (at offset 0xc0100029) and 'startup_32_smp'
WARNING: vmlinux - Section mismatch: reference to .init.data:boot_params from
.text between '_text' (at offset 0xc0100037) and 'startup_32_smp'
WARNING: vmlinux - Section mismatch: reference to
.init.data:init_pg_tables_end from .text between '_text' (at offset
0xc0100099) and 'startup_32_smp'

o Can't move startup_32 to .init.text as this entry point has to be at the
start of bzImage. Hence moved startup_32 to a new section .text.head and
instructed MODPOST to not to generate warnings if init data is being
accessed from .text.head section. This code has been audited.

o SMP boot up code (startup_32_smp) can go into .init.text if CPU hotplug
is not supported. Otherwise it generates more warnings

WARNING: vmlinux - Section mismatch: reference to .init.data:new_cpu_data from
.text between 'checkCPUtype' (at offset 0xc0100126) and 'is486'
WARNING: vmlinux - Section mismatch: reference to .init.data:new_cpu_data from
.text between 'checkCPUtype' (at offset 0xc0100130) and 'is486'

Signed-off-by: Vivek Goyal <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---

arch/i386/kernel/head.S | 17 ++++++++++++++---
arch/i386/kernel/vmlinux.lds.S | 7 ++++++-
scripts/mod/modpost.c | 10 +++++++++-
3 files changed, 29 insertions(+), 5 deletions(-)

Index: linux/arch/i386/kernel/head.S
===================================================================
--- linux.orig/arch/i386/kernel/head.S
+++ linux/arch/i386/kernel/head.S
@@ -53,6 +53,7 @@
* any particular GDT layout, because we load our own as soon as we
* can.
*/
+.section .text.head,"ax",@progbits
ENTRY(startup_32)

#ifdef CONFIG_PARAVIRT
@@ -141,16 +142,25 @@ page_pde_offset = (__PAGE_OFFSET >> 20);
jb 10b
movl %edi,(init_pg_tables_end - __PAGE_OFFSET)

-#ifdef CONFIG_SMP
xorl %ebx,%ebx /* This is the boot CPU (BSP) */
jmp 3f
-
/*
* Non-boot CPU entry point; entered from trampoline.S
* We can't lgdt here, because lgdt itself uses a data segment, but
* we know the trampoline has already loaded the boot_gdt_table GDT
* for us.
+ *
+ * If cpu hotplug is not supported then this code can go in init section
+ * which will be freed later
*/
+
+#ifdef CONFIG_HOTPLUG_CPU
+.section .text,"ax",@progbits
+#else
+.section .init.text,"ax",@progbits
+#endif
+
+#ifdef CONFIG_SMP
ENTRY(startup_32_smp)
cld
movl $(__BOOT_DS),%eax
@@ -208,8 +218,8 @@ ENTRY(startup_32_smp)
xorl %ebx,%ebx
incl %ebx

-3:
#endif /* CONFIG_SMP */
+3:

/*
* Enable paging
@@ -492,6 +502,7 @@ ignore_int:
#endif
iret

+.section .text
#ifdef CONFIG_PARAVIRT
startup_paravirt:
cld
Index: linux/arch/i386/kernel/vmlinux.lds.S
===================================================================
--- linux.orig/arch/i386/kernel/vmlinux.lds.S
+++ linux/arch/i386/kernel/vmlinux.lds.S
@@ -37,9 +37,14 @@ SECTIONS
{
. = LOAD_OFFSET + LOAD_PHYSICAL_ADDR;
phys_startup_32 = startup_32 - LOAD_OFFSET;
+
+ .text.head : AT(ADDR(.text.head) - LOAD_OFFSET) {
+ _text = .; /* Text and read-only data */
+ *(.text.head)
+ } :text = 0x9090
+
/* read-only */
.text : AT(ADDR(.text) - LOAD_OFFSET) {
- _text = .; /* Text and read-only data */
*(.text)
SCHED_TEXT
LOCK_TEXT
Index: linux/scripts/mod/modpost.c
===================================================================
--- linux.orig/scripts/mod/modpost.c
+++ linux/scripts/mod/modpost.c
@@ -641,12 +641,20 @@ static int secref_whitelist(const char *
if (f1 && f2)
return 1;

- /* Whitelist all references from .pci_fixup section if vmlinux */
+ /* Whitelist all references from .pci_fixup section if vmlinux
+ * Whitelist all refereces from .text.head to .init.data if vmlinux
+ * Whitelist all refereces from .text.head to .init.text if vmlinux
+ */
if (is_vmlinux(modname)) {
if ((strcmp(fromsec, ".pci_fixup") == 0) &&
(strcmp(tosec, ".init.text") == 0))
return 1;

+ if ((strcmp(fromsec, ".text.head") == 0) &&
+ ((strcmp(tosec, ".init.data") == 0) ||
+ (strcmp(tosec, ".init.text") == 0)))
+ return 1;
+
/* Check for pattern 3 */
for (s = pat3refsym; *s; s++)
if (strcmp(refsymname, *s) == 0)


2007-02-12 07:37:50

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [2/39] x86_64: Break init() in two parts to avoid MODPOST warnings


From: Vivek Goyal <[email protected]>

o init() is a non __init function in .text section but it calls many
functions which are in .init.text section. Hence MODPOST generates lots
of cross reference warnings on i386 if compiled with CONFIG_RELOCATABLE=y

WARNING: vmlinux - Section mismatch: reference to .init.text:smp_prepare_cpus from .text between 'init' (at offset 0xc0101049) and 'rest_init'
WARNING: vmlinux - Section mismatch: reference to .init.text:migration_init from .text between 'init' (at offset 0xc010104e) and 'rest_init'
WARNING: vmlinux - Section mismatch: reference to .init.text:spawn_ksoftirqd from .text between 'init' (at offset 0xc0101053) and 'rest_init'

o This patch breaks down init() in two parts. One part which can go
in .init.text section and can be freed and other part which has to
be non __init(init_post()). Now init() calls init_post() and init_post()
does not call any functions present in .init sections. Hence getting
rid of warnings.

Signed-off-by: Vivek Goyal <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---

init/main.c | 81 +++++++++++++++++++++++++++++++++---------------------------
1 file changed, 45 insertions(+), 36 deletions(-)

Index: linux/init/main.c
===================================================================
--- linux.orig/init/main.c
+++ linux/init/main.c
@@ -713,7 +713,49 @@ static void run_init_process(char *init_
kernel_execve(init_filename, argv_init, envp_init);
}

-static int init(void * unused)
+/* This is a non __init function. Force it to be noinline otherwise gcc
+ * makes it inline to init() and it becomes part of init.text section
+ */
+static int noinline init_post(void)
+{
+ free_initmem();
+ unlock_kernel();
+ mark_rodata_ro();
+ system_state = SYSTEM_RUNNING;
+ numa_default_policy();
+
+ if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
+ printk(KERN_WARNING "Warning: unable to open an initial console.\n");
+
+ (void) sys_dup(0);
+ (void) sys_dup(0);
+
+ if (ramdisk_execute_command) {
+ run_init_process(ramdisk_execute_command);
+ printk(KERN_WARNING "Failed to execute %s\n",
+ ramdisk_execute_command);
+ }
+
+ /*
+ * We try each of these until one succeeds.
+ *
+ * The Bourne shell can be used instead of init if we are
+ * trying to recover a really broken machine.
+ */
+ if (execute_command) {
+ run_init_process(execute_command);
+ printk(KERN_WARNING "Failed to execute %s. Attempting "
+ "defaults...\n", execute_command);
+ }
+ run_init_process("/sbin/init");
+ run_init_process("/etc/init");
+ run_init_process("/bin/init");
+ run_init_process("/bin/sh");
+
+ panic("No init found. Try passing init= option to kernel.");
+}
+
+static int __init init(void * unused)
{
lock_kernel();
/*
@@ -761,39 +803,6 @@ static int init(void * unused)
* we're essentially up and running. Get rid of the
* initmem segments and start the user-mode stuff..
*/
- free_initmem();
- unlock_kernel();
- mark_rodata_ro();
- system_state = SYSTEM_RUNNING;
- numa_default_policy();
-
- if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
- printk(KERN_WARNING "Warning: unable to open an initial console.\n");
-
- (void) sys_dup(0);
- (void) sys_dup(0);
-
- if (ramdisk_execute_command) {
- run_init_process(ramdisk_execute_command);
- printk(KERN_WARNING "Failed to execute %s\n",
- ramdisk_execute_command);
- }
-
- /*
- * We try each of these until one succeeds.
- *
- * The Bourne shell can be used instead of init if we are
- * trying to recover a really broken machine.
- */
- if (execute_command) {
- run_init_process(execute_command);
- printk(KERN_WARNING "Failed to execute %s. Attempting "
- "defaults...\n", execute_command);
- }
- run_init_process("/sbin/init");
- run_init_process("/etc/init");
- run_init_process("/bin/init");
- run_init_process("/bin/sh");
-
- panic("No init found. Try passing init= option to kernel.");
+ init_post();
+ return 0;
}

2007-02-12 07:38:15

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [15/39] x86_64: list x86_64 quilt tree


From: Randy Dunlap <[email protected]>

List x86_64 quilt tree in MAINTAINERS.

Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
MAINTAINERS | 1 +
1 file changed, 1 insertion(+)

Index: linux/MAINTAINERS
===================================================================
--- linux.orig/MAINTAINERS
+++ linux/MAINTAINERS
@@ -3735,6 +3735,7 @@ P: Andi Kleen
M: [email protected]
L: [email protected]
W: http://www.x86-64.org
+T: quilt ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt-current
S: Maintained

YAM DRIVER FOR AX.25

2007-02-12 07:38:19

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [23/39] i386: use smp_call_function_single()


From: Alexey Dobriyan <[email protected]>
It will execute rdmsr and wrmsr only on the cpu we need.

Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---

arch/i386/kernel/msr.c | 13 ++++---------
1 file changed, 4 insertions(+), 9 deletions(-)

Index: linux/arch/i386/kernel/msr.c
===================================================================
--- linux.orig/arch/i386/kernel/msr.c
+++ linux/arch/i386/kernel/msr.c
@@ -68,7 +68,6 @@ static inline int rdmsr_eio(u32 reg, u32
#ifdef CONFIG_SMP

struct msr_command {
- int cpu;
int err;
u32 reg;
u32 data[2];
@@ -78,16 +77,14 @@ static void msr_smp_wrmsr(void *cmd_bloc
{
struct msr_command *cmd = (struct msr_command *)cmd_block;

- if (cmd->cpu == smp_processor_id())
- cmd->err = wrmsr_eio(cmd->reg, cmd->data[0], cmd->data[1]);
+ cmd->err = wrmsr_eio(cmd->reg, cmd->data[0], cmd->data[1]);
}

static void msr_smp_rdmsr(void *cmd_block)
{
struct msr_command *cmd = (struct msr_command *)cmd_block;

- if (cmd->cpu == smp_processor_id())
- cmd->err = rdmsr_eio(cmd->reg, &cmd->data[0], &cmd->data[1]);
+ cmd->err = rdmsr_eio(cmd->reg, &cmd->data[0], &cmd->data[1]);
}

static inline int do_wrmsr(int cpu, u32 reg, u32 eax, u32 edx)
@@ -99,12 +96,11 @@ static inline int do_wrmsr(int cpu, u32
if (cpu == smp_processor_id()) {
ret = wrmsr_eio(reg, eax, edx);
} else {
- cmd.cpu = cpu;
cmd.reg = reg;
cmd.data[0] = eax;
cmd.data[1] = edx;

- smp_call_function(msr_smp_wrmsr, &cmd, 1, 1);
+ smp_call_function_single(cpu, msr_smp_wrmsr, &cmd, 1, 1);
ret = cmd.err;
}
preempt_enable();
@@ -120,10 +116,9 @@ static inline int do_rdmsr(int cpu, u32
if (cpu == smp_processor_id()) {
ret = rdmsr_eio(reg, eax, edx);
} else {
- cmd.cpu = cpu;
cmd.reg = reg;

- smp_call_function(msr_smp_rdmsr, &cmd, 1, 1);
+ smp_call_function_single(cpu, msr_smp_rdmsr, &cmd, 1, 1);

*eax = cmd.data[0];
*edx = cmd.data[1];

2007-02-12 07:38:54

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [21/39] i386: rdmsr_on_cpu, wrmsr_on_cpu


From: Alexey Dobriyan <[email protected]>
There was OpenVZ specific bug rendering some cpufreq drivers unusable
on SMP. In short, when cpufreq code thinks it confined itself to
needed cpu by means of set_cpus_allowed() to execute rdmsr, some
"virtual cpu" feature can migrate process to anywhere. This triggers
bugons and does wrong things in general.

This got fixed by introducing rdmsr_on_cpu and wrmsr_on_cpu executing
rdmsr and wrmsr on given physical cpu by means of
smp_call_function_single().

AK: link it into 64bit kernel too because cpufreq drivers use it.

arch/i386/kernel/cpu/cpufreq/p4-clockmod.c | 30 ++----------
arch/i386/lib/Makefile | 2
arch/i386/kernel/cpu/cpufreq/p4-clockmod.c | 30 ++----------
arch/i386/lib/Makefile | 2
arch/i386/lib/msr-on-cpu.c | 70 +++++++++++++++++++++++++++++
arch/x86_64/lib/Makefile | 4 +
include/asm-i386/msr.h | 3 +
5 files changed, 84 insertions(+), 25 deletions(-)

Signed-off-by: Andi Kleen <[email protected]>

Index: linux/arch/i386/lib/Makefile
===================================================================
--- linux.orig/arch/i386/lib/Makefile
+++ linux/arch/i386/lib/Makefile
@@ -7,3 +7,5 @@ lib-y = checksum.o delay.o usercopy.o ge
bitops.o semaphore.o

lib-$(CONFIG_X86_USE_3DNOW) += mmx.o
+
+obj-y = msr-on-cpu.o
Index: linux/arch/i386/lib/msr-on-cpu.c
===================================================================
--- /dev/null
+++ linux/arch/i386/lib/msr-on-cpu.c
@@ -0,0 +1,70 @@
+#include <linux/module.h>
+#include <linux/preempt.h>
+#include <linux/smp.h>
+#include <asm/msr.h>
+
+#ifdef CONFIG_SMP
+struct msr_info {
+ u32 msr_no;
+ u32 l, h;
+};
+
+static void __rdmsr_on_cpu(void *info)
+{
+ struct msr_info *rv = info;
+
+ rdmsr(rv->msr_no, rv->l, rv->h);
+}
+
+void rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h)
+{
+ preempt_disable();
+ if (smp_processor_id() == cpu)
+ rdmsr(msr_no, *l, *h);
+ else {
+ struct msr_info rv;
+
+ rv.msr_no = msr_no;
+ smp_call_function_single(cpu, __rdmsr_on_cpu, &rv, 0, 1);
+ *l = rv.l;
+ *h = rv.h;
+ }
+ preempt_enable();
+}
+
+static void __wrmsr_on_cpu(void *info)
+{
+ struct msr_info *rv = info;
+
+ wrmsr(rv->msr_no, rv->l, rv->h);
+}
+
+void wrmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h)
+{
+ preempt_disable();
+ if (smp_processor_id() == cpu)
+ wrmsr(msr_no, l, h);
+ else {
+ struct msr_info rv;
+
+ rv.msr_no = msr_no;
+ rv.l = l;
+ rv.h = h;
+ smp_call_function_single(cpu, __wrmsr_on_cpu, &rv, 0, 1);
+ }
+ preempt_enable();
+}
+#else
+void rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h)
+{
+ rdmsr(msr_no, *l, *h);
+}
+
+void wrmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h)
+{
+ wrmsr(msr_no, l, h);
+}
+#endif
+
+EXPORT_SYMBOL(rdmsr_on_cpu);
+EXPORT_SYMBOL(wrmsr_on_cpu);
Index: linux/include/asm-i386/msr.h
===================================================================
--- linux.orig/include/asm-i386/msr.h
+++ linux/include/asm-i386/msr.h
@@ -83,6 +83,9 @@ static inline void wrmsrl (unsigned long
: "c" (counter))
#endif /* !CONFIG_PARAVIRT */

+void rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
+void wrmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h);
+
/* symbolic names for some interesting MSRs */
/* Intel defined MSRs. */
#define MSR_IA32_P5_MC_ADDR 0
Index: linux/arch/i386/kernel/cpu/cpufreq/p4-clockmod.c
===================================================================
--- linux.orig/arch/i386/kernel/cpu/cpufreq/p4-clockmod.c
+++ linux/arch/i386/kernel/cpu/cpufreq/p4-clockmod.c
@@ -62,7 +62,7 @@ static int cpufreq_p4_setdc(unsigned int
if (!cpu_online(cpu) || (newstate > DC_DISABLE) || (newstate == DC_RESV))
return -EINVAL;

- rdmsr(MSR_IA32_THERM_STATUS, l, h);
+ rdmsr_on_cpu(cpu, MSR_IA32_THERM_STATUS, &l, &h);

if (l & 0x01)
dprintk("CPU#%d currently thermal throttled\n", cpu);
@@ -70,10 +70,10 @@ static int cpufreq_p4_setdc(unsigned int
if (has_N44_O17_errata[cpu] && (newstate == DC_25PT || newstate == DC_DFLT))
newstate = DC_38PT;

- rdmsr(MSR_IA32_THERM_CONTROL, l, h);
+ rdmsr_on_cpu(cpu, MSR_IA32_THERM_CONTROL, &l, &h);
if (newstate == DC_DISABLE) {
dprintk("CPU#%d disabling modulation\n", cpu);
- wrmsr(MSR_IA32_THERM_CONTROL, l & ~(1<<4), h);
+ wrmsr_on_cpu(cpu, MSR_IA32_THERM_CONTROL, l & ~(1<<4), h);
} else {
dprintk("CPU#%d setting duty cycle to %d%%\n",
cpu, ((125 * newstate) / 10));
@@ -84,7 +84,7 @@ static int cpufreq_p4_setdc(unsigned int
*/
l = (l & ~14);
l = l | (1<<4) | ((newstate & 0x7)<<1);
- wrmsr(MSR_IA32_THERM_CONTROL, l, h);
+ wrmsr_on_cpu(cpu, MSR_IA32_THERM_CONTROL, l, h);
}

return 0;
@@ -111,7 +111,6 @@ static int cpufreq_p4_target(struct cpuf
{
unsigned int newstate = DC_RESV;
struct cpufreq_freqs freqs;
- cpumask_t cpus_allowed;
int i;

if (cpufreq_frequency_table_target(policy, &p4clockmod_table[0], target_freq, relation, &newstate))
@@ -132,17 +131,8 @@ static int cpufreq_p4_target(struct cpuf
/* run on each logical CPU, see section 13.15.3 of IA32 Intel Architecture Software
* Developer's Manual, Volume 3
*/
- cpus_allowed = current->cpus_allowed;
-
- for_each_cpu_mask(i, policy->cpus) {
- cpumask_t this_cpu = cpumask_of_cpu(i);
-
- set_cpus_allowed(current, this_cpu);
- BUG_ON(smp_processor_id() != i);
-
+ for_each_cpu_mask(i, policy->cpus)
cpufreq_p4_setdc(i, p4clockmod_table[newstate].index);
- }
- set_cpus_allowed(current, cpus_allowed);

/* notifiers */
for_each_cpu_mask(i, policy->cpus) {
@@ -256,17 +246,9 @@ static int cpufreq_p4_cpu_exit(struct cp

static unsigned int cpufreq_p4_get(unsigned int cpu)
{
- cpumask_t cpus_allowed;
u32 l, h;

- cpus_allowed = current->cpus_allowed;
-
- set_cpus_allowed(current, cpumask_of_cpu(cpu));
- BUG_ON(smp_processor_id() != cpu);
-
- rdmsr(MSR_IA32_THERM_CONTROL, l, h);
-
- set_cpus_allowed(current, cpus_allowed);
+ rdmsr_on_cpu(cpu, MSR_IA32_THERM_CONTROL, &l, &h);

if (l & 0x10) {
l = l >> 1;
Index: linux/arch/x86_64/lib/Makefile
===================================================================
--- linux.orig/arch/x86_64/lib/Makefile
+++ linux/arch/x86_64/lib/Makefile
@@ -4,10 +4,12 @@

CFLAGS_csum-partial.o := -funroll-loops

-obj-y := io.o iomap_copy.o
+obj-y := io.o iomap_copy.o msr-on-cpu.o

lib-y := csum-partial.o csum-copy.o csum-wrappers.o delay.o \
usercopy.o getuser.o putuser.o \
thunk.o clear_page.o copy_page.o bitstr.o bitops.o
lib-y += memcpy.o memmove.o memset.o copy_user.o rwlock.o copy_user_nocache.o
lib-y += memcpy_uncached_read.o
+
+msr-on-cpu-y += ../../i386/lib/msr-on-cpu.o

2007-02-12 07:38:56

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [30/39] x86_64: Check return value of putreg in PTRACE_SETREGS


This means if an illegal value is set for the segment registers there
ptrace will error out now with an errno instead of silently ignoring
it.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86_64/kernel/ptrace.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

Index: linux/arch/x86_64/kernel/ptrace.c
===================================================================
--- linux.orig/arch/x86_64/kernel/ptrace.c
+++ linux/arch/x86_64/kernel/ptrace.c
@@ -536,8 +536,12 @@ long arch_ptrace(struct task_struct *chi
}
ret = 0;
for (ui = 0; ui < sizeof(struct user_regs_struct); ui += sizeof(long)) {
- ret |= __get_user(tmp, (unsigned long __user *) data);
- putreg(child, ui, tmp);
+ ret = __get_user(tmp, (unsigned long __user *) data);
+ if (ret)
+ break;
+ ret = putreg(child, ui, tmp);
+ if (ret)
+ break;
data += sizeof(long);
}
break;

2007-02-12 07:38:58

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [28/39] i386: fix size_or_mask and size_and_mask


From: "Andreas Herrmann" <[email protected]>
mtrr: fix size_or_mask and size_and_mask

This fixes two bugs in /proc/mtrr interface:
o If physical address size crosses the 44 bit boundary
size_or_mask is evaluated wrong.
o size_and_mask limits width of physical base
address for an MTRR to be less than 44 bits.

TBD: later patch had one more change, but I think that was bogus.
TBD: need to double check

Signed-off-by: Andreas Herrmann <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
arch/i386/kernel/cpu/mtrr/main.c | 6 +++---
arch/i386/kernel/cpu/mtrr/mtrr.h | 2 +-
2 files changed, 4 insertions(+), 4 deletions(-)

Index: linux/arch/i386/kernel/cpu/mtrr/main.c
===================================================================
--- linux.orig/arch/i386/kernel/cpu/mtrr/main.c
+++ linux/arch/i386/kernel/cpu/mtrr/main.c
@@ -50,7 +50,7 @@ u32 num_var_ranges = 0;
unsigned int *usage_table;
static DEFINE_MUTEX(mtrr_mutex);

-u32 size_or_mask, size_and_mask;
+u64 size_or_mask, size_and_mask;

static struct mtrr_ops * mtrr_ops[X86_VENDOR_NUM] = {};

@@ -662,8 +662,8 @@ void __init mtrr_bp_init(void)
boot_cpu_data.x86_mask == 0x4))
phys_addr = 36;

- size_or_mask = ~((1 << (phys_addr - PAGE_SHIFT)) - 1);
- size_and_mask = ~size_or_mask & 0xfff00000;
+ size_or_mask = ~((1ULL << (phys_addr - PAGE_SHIFT)) - 1);
+ size_and_mask = ~size_or_mask & 0xfffff00000ULL;
} else if (boot_cpu_data.x86_vendor == X86_VENDOR_CENTAUR &&
boot_cpu_data.x86 == 6) {
/* VIA C* family have Intel style MTRRs, but
Index: linux/arch/i386/kernel/cpu/mtrr/mtrr.h
===================================================================
--- linux.orig/arch/i386/kernel/cpu/mtrr/mtrr.h
+++ linux/arch/i386/kernel/cpu/mtrr/mtrr.h
@@ -84,7 +84,7 @@ void get_mtrr_state(void);

extern void set_mtrr_ops(struct mtrr_ops * ops);

-extern u32 size_or_mask, size_and_mask;
+extern u64 size_or_mask, size_and_mask;
extern struct mtrr_ops * mtrr_if;

#define is_cpu(vnd) (mtrr_if && mtrr_if->vendor == X86_VENDOR_##vnd)

2007-02-12 07:38:59

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [29/39] x86_64: - Ignore long SMI interrupts in clock calibration code - update 1


From: Jack Steiner <[email protected]>
Add failsafe mechanism to HPET/TSC clock calibration.

Signed-off-by: Jack Steiner <[email protected]>

Updated to include failsafe mechanism & additional community feedback.
Patch built on latest 2.6.20-rc4-mm1 tree.




Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86_64/kernel/time.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

Index: linux/arch/x86_64/kernel/time.c
===================================================================
--- linux.orig/arch/x86_64/kernel/time.c
+++ linux/arch/x86_64/kernel/time.c
@@ -657,6 +657,7 @@ core_initcall(cpufreq_tsc);

#define TICK_COUNT 100000000
#define TICK_MIN 5000
+#define MAX_READ_RETRIES 5

/*
* Some platforms take periodic SMI interrupts with 5ms duration. Make sure none
@@ -664,13 +665,17 @@ core_initcall(cpufreq_tsc);
*/
static void __init read_hpet_tsc(int *hpet, int *tsc)
{
- int tsc1, tsc2, hpet1;
+ int tsc1, tsc2, hpet1, retries = 0;
+ static int msg;

do {
tsc1 = get_cycles_sync();
hpet1 = hpet_readl(HPET_COUNTER);
tsc2 = get_cycles_sync();
- } while (tsc2 - tsc1 > TICK_MIN);
+ } while (tsc2 - tsc1 > TICK_MIN && retries++ < MAX_READ_RETRIES);
+ if (retries >= MAX_READ_RETRIES && !msg++)
+ printk(KERN_WARNING
+ "hpet.c: exceeded max retries to read HPET & TSC\n");
*hpet = hpet1;
*tsc = tsc2;
}

2007-02-12 07:39:34

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [38/39] x86: fix laptop bootup hang in init_acpi()


From: Ingo Molnar <[email protected]>

During kernel bootup, a new T60 laptop (CoreDuo, 32-bit) hangs about
10%-20% of the time in acpi_init():

Calling initcall 0xc055ce1a: topology_init+0x0/0x2f()
Calling initcall 0xc055d75e: mtrr_init_finialize+0x0/0x2c()
Calling initcall 0xc05664f3: param_sysfs_init+0x0/0x175()
Calling initcall 0xc014cb65: pm_sysrq_init+0x0/0x17()
Calling initcall 0xc0569f99: init_bio+0x0/0xf4()
Calling initcall 0xc056b865: genhd_device_init+0x0/0x50()
Calling initcall 0xc056c4bd: fbmem_init+0x0/0x87()
Calling initcall 0xc056dd74: acpi_init+0x0/0x1ee()

It's a hard hang that not even an NMI could punch through! Frustratingly,
adding printks or function tracing to the ACPI code made the hangs go away
...

After some time an additional detail emerged: disabling the NMI watchdog
made these occasional hangs go away.

So i spent the better part of today trying to debug this and trying out
various theories when i finally found the likely reason for the hang: if
acpi_ns_initialize_devices() executes an _INI AML method and an NMI
happens to hit that AML execution in the wrong moment, the machine would
hang. (my theory is that this must be some sort of chipset setup method
doing stores to chipset mmio registers?)

Unfortunately given the characteristics of the hang it was sheer
impossible to figure out which of the numerous AML methods is impacted
by this problem.

As a workaround i wrote an interface to disable chipset-based NMIs while
executing _INI sections - and indeed this fixed the hang. I did a
boot-loop of 100 separate reboots and none hung - while without the patch
it would hang every 5-10 attempts. Out of caution i did not touch the
nmi_watchdog=2 case (it's not related to the chipset anyway and didnt
hang).

I implemented this for both x86_64 and i686, tested the i686 laptop both
with nmi_watchdog=1 [which triggered the hangs] and nmi_watchdog=2, and
tested an Athlon64 box with the 64-bit kernel as well. Everything builds
and works with the patch applied.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Len Brown <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

arch/i386/kernel/nmi.c | 28 ++++++++++++++++++++++++++++
arch/x86_64/kernel/nmi.c | 27 +++++++++++++++++++++++++++
drivers/acpi/namespace/nsinit.c | 9 +++++++++
include/linux/nmi.h | 9 ++++++++-
4 files changed, 72 insertions(+), 1 deletion(-)

Index: linux/arch/i386/kernel/nmi.c
===================================================================
--- linux.orig/arch/i386/kernel/nmi.c
+++ linux/arch/i386/kernel/nmi.c
@@ -383,6 +383,34 @@ void enable_timer_nmi_watchdog(void)
}
}

+static void __acpi_nmi_disable(void *__unused)
+{
+ apic_write_around(APIC_LVT0, APIC_DM_NMI | APIC_LVT_MASKED);
+}
+
+/*
+ * Disable timer based NMIs on all CPUs:
+ */
+void acpi_nmi_disable(void)
+{
+ if (atomic_read(&nmi_active) && nmi_watchdog == NMI_IO_APIC)
+ on_each_cpu(__acpi_nmi_disable, NULL, 0, 1);
+}
+
+static void __acpi_nmi_enable(void *__unused)
+{
+ apic_write_around(APIC_LVT0, APIC_DM_NMI);
+}
+
+/*
+ * Enable timer based NMIs on all CPUs:
+ */
+void acpi_nmi_enable(void)
+{
+ if (atomic_read(&nmi_active) && nmi_watchdog == NMI_IO_APIC)
+ on_each_cpu(__acpi_nmi_enable, NULL, 0, 1);
+}
+
#ifdef CONFIG_PM

static int nmi_pm_active; /* nmi_active before suspend */
Index: linux/arch/x86_64/kernel/nmi.c
===================================================================
--- linux.orig/arch/x86_64/kernel/nmi.c
+++ linux/arch/x86_64/kernel/nmi.c
@@ -368,6 +368,33 @@ void enable_timer_nmi_watchdog(void)
}
}

+static void __acpi_nmi_disable(void *__unused)
+{
+ apic_write(APIC_LVT0, APIC_DM_NMI | APIC_LVT_MASKED);
+}
+
+/*
+ * Disable timer based NMIs on all CPUs:
+ */
+void acpi_nmi_disable(void)
+{
+ if (atomic_read(&nmi_active) && nmi_watchdog == NMI_IO_APIC)
+ on_each_cpu(__acpi_nmi_disable, NULL, 0, 1);
+}
+
+static void __acpi_nmi_enable(void *__unused)
+{
+ apic_write(APIC_LVT0, APIC_DM_NMI);
+}
+
+/*
+ * Enable timer based NMIs on all CPUs:
+ */
+void acpi_nmi_enable(void)
+{
+ if (atomic_read(&nmi_active) && nmi_watchdog == NMI_IO_APIC)
+ on_each_cpu(__acpi_nmi_enable, NULL, 0, 1);
+}
#ifdef CONFIG_PM

static int nmi_pm_active; /* nmi_active before suspend */
Index: linux/drivers/acpi/namespace/nsinit.c
===================================================================
--- linux.orig/drivers/acpi/namespace/nsinit.c
+++ linux/drivers/acpi/namespace/nsinit.c
@@ -45,6 +45,7 @@
#include <acpi/acnamesp.h>
#include <acpi/acdispat.h>
#include <acpi/acinterp.h>
+#include <linux/nmi.h>

#define _COMPONENT ACPI_NAMESPACE
ACPI_MODULE_NAME("nsinit")
@@ -534,7 +535,15 @@ acpi_ns_init_one_device(acpi_handle obj_
info->parameter_type = ACPI_PARAM_ARGS;
info->flags = ACPI_IGNORE_RETURN_VALUE;

+ /*
+ * Some hardware relies on this being executed as atomically
+ * as possible (without an NMI being received in the middle of
+ * this) - so disable NMIs and initialize the device:
+ */
+ acpi_nmi_disable();
status = acpi_ns_evaluate(info);
+ acpi_nmi_enable();
+
if (ACPI_SUCCESS(status)) {
walk_info->num_INI++;

Index: linux/include/linux/nmi.h
===================================================================
--- linux.orig/include/linux/nmi.h
+++ linux/include/linux/nmi.h
@@ -17,8 +17,15 @@
#ifdef ARCH_HAS_NMI_WATCHDOG
#include <asm/nmi.h>
extern void touch_nmi_watchdog(void);
+extern void acpi_nmi_disable(void);
+extern void acpi_nmi_enable(void);
#else
-# define touch_nmi_watchdog() touch_softlockup_watchdog()
+static inline void touch_nmi_watchdog(void)
+{
+ touch_softlockup_watchdog();
+}
+static inline void acpi_nmi_disable(void) { }
+static inline void acpi_nmi_enable(void) { }
#endif

#ifndef trigger_all_cpu_backtrace

2007-02-12 07:40:19

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [16/39] x86: simplify notify_page_fault()


From: "Jan Beulich" <[email protected]>
Remove all parameters from this function that aren't really variable.

Signed-off-by: Jan Beulich <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
arch/i386/mm/fault.c | 18 ++++++++----------
arch/x86_64/mm/fault.c | 18 ++++++++----------
2 files changed, 16 insertions(+), 20 deletions(-)

Index: linux/arch/i386/mm/fault.c
===================================================================
--- linux.orig/arch/i386/mm/fault.c
+++ linux/arch/i386/mm/fault.c
@@ -46,17 +46,17 @@ int unregister_page_fault_notifier(struc
}
EXPORT_SYMBOL_GPL(unregister_page_fault_notifier);

-static inline int notify_page_fault(enum die_val val, const char *str,
- struct pt_regs *regs, long err, int trap, int sig)
+static inline int notify_page_fault(struct pt_regs *regs, long err)
{
struct die_args args = {
.regs = regs,
- .str = str,
+ .str = "page fault",
.err = err,
- .trapnr = trap,
- .signr = sig
+ .trapnr = 14,
+ .signr = SIGSEGV
};
- return atomic_notifier_call_chain(&notify_page_fault_chain, val, &args);
+ return atomic_notifier_call_chain(&notify_page_fault_chain,
+ DIE_PAGE_FAULT, &args);
}

/*
@@ -353,8 +353,7 @@ fastcall void __kprobes do_page_fault(st
if (unlikely(address >= TASK_SIZE)) {
if (!(error_code & 0x0000000d) && vmalloc_fault(address) >= 0)
return;
- if (notify_page_fault(DIE_PAGE_FAULT, "page fault", regs, error_code, 14,
- SIGSEGV) == NOTIFY_STOP)
+ if (notify_page_fault(regs, error_code) == NOTIFY_STOP)
return;
/*
* Don't take the mm semaphore here. If we fixup a prefetch
@@ -363,8 +362,7 @@ fastcall void __kprobes do_page_fault(st
goto bad_area_nosemaphore;
}

- if (notify_page_fault(DIE_PAGE_FAULT, "page fault", regs, error_code, 14,
- SIGSEGV) == NOTIFY_STOP)
+ if (notify_page_fault(regs, error_code) == NOTIFY_STOP)
return;

/* It's safe to allow irq's after cr2 has been saved and the vmalloc
Index: linux/arch/x86_64/mm/fault.c
===================================================================
--- linux.orig/arch/x86_64/mm/fault.c
+++ linux/arch/x86_64/mm/fault.c
@@ -56,17 +56,17 @@ int unregister_page_fault_notifier(struc
}
EXPORT_SYMBOL_GPL(unregister_page_fault_notifier);

-static inline int notify_page_fault(enum die_val val, const char *str,
- struct pt_regs *regs, long err, int trap, int sig)
+static inline int notify_page_fault(struct pt_regs *regs, long err)
{
struct die_args args = {
.regs = regs,
- .str = str,
+ .str = "page fault",
.err = err,
- .trapnr = trap,
- .signr = sig
+ .trapnr = 14,
+ .signr = SIGSEGV
};
- return atomic_notifier_call_chain(&notify_page_fault_chain, val, &args);
+ return atomic_notifier_call_chain(&notify_page_fault_chain,
+ DIE_PAGE_FAULT, &args);
}

void bust_spinlocks(int yes)
@@ -376,8 +376,7 @@ asmlinkage void __kprobes do_page_fault(
if (vmalloc_fault(address) >= 0)
return;
}
- if (notify_page_fault(DIE_PAGE_FAULT, "page fault", regs, error_code, 14,
- SIGSEGV) == NOTIFY_STOP)
+ if (notify_page_fault(regs, error_code) == NOTIFY_STOP)
return;
/*
* Don't take the mm semaphore here. If we fixup a prefetch
@@ -386,8 +385,7 @@ asmlinkage void __kprobes do_page_fault(
goto bad_area_nosemaphore;
}

- if (notify_page_fault(DIE_PAGE_FAULT, "page fault", regs, error_code, 14,
- SIGSEGV) == NOTIFY_STOP)
+ if (notify_page_fault(regs, error_code) == NOTIFY_STOP)
return;

if (likely(regs->eflags & X86_EFLAGS_IF))

2007-02-12 07:40:19

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [17/39] x86_64: Tighten mce_amd driver MSR reads


From: "Jan Beulich" <[email protected]>

while debugging an unrelated problem in Xen, I noticed odd reads from
non-existent MSRs. Having now found time to look why these happen, I
came up with below patch, which
- prevents accessing MCi_MISCj with j > 0 when the block pointer in
MCi_MISC0 is zero
- accesses only contiguous MCi_MISCj until a non-implemented one is
found
- doesn't touch unimplemented blocks in mce_threshold_interrupt at all
- gives names to two bits previously derived from MASK_VALID_HI (it
took me some time to understand the code without this)

The first three items, besides being apparently closer to the spec, should
namely help cutting down on the time mce_threshold_interrupt() takes.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86_64/kernel/mce_amd.c | 40 +++++++++++++++++++++++++---------------
1 file changed, 25 insertions(+), 15 deletions(-)

Index: linux/arch/x86_64/kernel/mce_amd.c
===================================================================
--- linux.orig/arch/x86_64/kernel/mce_amd.c
+++ linux/arch/x86_64/kernel/mce_amd.c
@@ -37,6 +37,8 @@
#define THRESHOLD_MAX 0xFFF
#define INT_TYPE_APIC 0x00020000
#define MASK_VALID_HI 0x80000000
+#define MASK_CNTP_HI 0x40000000
+#define MASK_LOCKED_HI 0x20000000
#define MASK_LVTOFF_HI 0x00F00000
#define MASK_COUNT_EN_HI 0x00080000
#define MASK_INT_TYPE_HI 0x00060000
@@ -122,14 +124,17 @@ void __cpuinit mce_amd_feature_init(stru
for (block = 0; block < NR_BLOCKS; ++block) {
if (block == 0)
address = MSR_IA32_MC0_MISC + bank * 4;
- else if (block == 1)
- address = MCG_XBLK_ADDR
- + ((low & MASK_BLKPTR_LO) >> 21);
+ else if (block == 1) {
+ address = (low & MASK_BLKPTR_LO) >> 21;
+ if (!address)
+ break;
+ address += MCG_XBLK_ADDR;
+ }
else
++address;

if (rdmsr_safe(address, &low, &high))
- continue;
+ break;

if (!(high & MASK_VALID_HI)) {
if (block)
@@ -138,8 +143,8 @@ void __cpuinit mce_amd_feature_init(stru
break;
}

- if (!(high & MASK_VALID_HI >> 1) ||
- (high & MASK_VALID_HI >> 2))
+ if (!(high & MASK_CNTP_HI) ||
+ (high & MASK_LOCKED_HI))
continue;

if (!block)
@@ -187,17 +192,22 @@ asmlinkage void mce_threshold_interrupt(

/* assume first bank caused it */
for (bank = 0; bank < NR_BANKS; ++bank) {
+ if (!(per_cpu(bank_map, m.cpu) & (1 << bank)))
+ continue;
for (block = 0; block < NR_BLOCKS; ++block) {
if (block == 0)
address = MSR_IA32_MC0_MISC + bank * 4;
- else if (block == 1)
- address = MCG_XBLK_ADDR
- + ((low & MASK_BLKPTR_LO) >> 21);
+ else if (block == 1) {
+ address = (low & MASK_BLKPTR_LO) >> 21;
+ if (!address)
+ break;
+ address += MCG_XBLK_ADDR;
+ }
else
++address;

if (rdmsr_safe(address, &low, &high))
- continue;
+ break;

if (!(high & MASK_VALID_HI)) {
if (block)
@@ -206,8 +216,8 @@ asmlinkage void mce_threshold_interrupt(
break;
}

- if (!(high & MASK_VALID_HI >> 1) ||
- (high & MASK_VALID_HI >> 2))
+ if (!(high & MASK_CNTP_HI) ||
+ (high & MASK_LOCKED_HI))
continue;

if (high & MASK_OVERFLOW_HI) {
@@ -385,7 +395,7 @@ static __cpuinit int allocate_threshold_
return 0;

if (rdmsr_safe(address, &low, &high))
- goto recurse;
+ return 0;

if (!(high & MASK_VALID_HI)) {
if (block)
@@ -394,8 +404,8 @@ static __cpuinit int allocate_threshold_
return 0;
}

- if (!(high & MASK_VALID_HI >> 1) ||
- (high & MASK_VALID_HI >> 2))
+ if (!(high & MASK_CNTP_HI) ||
+ (high & MASK_LOCKED_HI))
goto recurse;

b = kzalloc(sizeof(struct threshold_block), GFP_KERNEL);

2007-02-12 07:41:00

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [34/39] i386: Use stack arguments for calling into EFI


When calling into the EFI firmware, the parameters need to be passed on
the stack. The recent change to use -mregparm=3 breaks x86 EFI support.
This patch is needed to allow the new Intel-based Macs to suspend to ram
(efi.get_time is called during the suspend phase).

Signed-off-by: Frederic Riss <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
include/linux/efi.h | 43 +++++++++++++++++++++++++++----------------
1 file changed, 27 insertions(+), 16 deletions(-)

Index: linux/include/linux/efi.h
===================================================================
--- linux.orig/include/linux/efi.h
+++ linux/include/linux/efi.h
@@ -157,22 +157,33 @@ typedef struct {
unsigned long reset_system;
} efi_runtime_services_t;

-typedef efi_status_t efi_get_time_t (efi_time_t *tm, efi_time_cap_t *tc);
-typedef efi_status_t efi_set_time_t (efi_time_t *tm);
-typedef efi_status_t efi_get_wakeup_time_t (efi_bool_t *enabled, efi_bool_t *pending,
- efi_time_t *tm);
-typedef efi_status_t efi_set_wakeup_time_t (efi_bool_t enabled, efi_time_t *tm);
-typedef efi_status_t efi_get_variable_t (efi_char16_t *name, efi_guid_t *vendor, u32 *attr,
- unsigned long *data_size, void *data);
-typedef efi_status_t efi_get_next_variable_t (unsigned long *name_size, efi_char16_t *name,
- efi_guid_t *vendor);
-typedef efi_status_t efi_set_variable_t (efi_char16_t *name, efi_guid_t *vendor,
- unsigned long attr, unsigned long data_size,
- void *data);
-typedef efi_status_t efi_get_next_high_mono_count_t (u32 *count);
-typedef void efi_reset_system_t (int reset_type, efi_status_t status,
- unsigned long data_size, efi_char16_t *data);
-typedef efi_status_t efi_set_virtual_address_map_t (unsigned long memory_map_size,
+typedef asmlinkage efi_status_t efi_get_time_t (efi_time_t *tm,
+ efi_time_cap_t *tc);
+typedef asmlinkage efi_status_t efi_set_time_t (efi_time_t *tm);
+typedef asmlinkage efi_status_t efi_get_wakeup_time_t (efi_bool_t *enabled,
+ efi_bool_t *pending,
+ efi_time_t *tm);
+typedef asmlinkage efi_status_t efi_set_wakeup_time_t (efi_bool_t enabled,
+ efi_time_t *tm);
+typedef asmlinkage efi_status_t efi_get_variable_t (efi_char16_t *name,
+ efi_guid_t *vendor,
+ u32 *attr,
+ unsigned long *data_size,
+ void *data);
+typedef asmlinkage efi_status_t efi_get_next_variable_t (unsigned long *name_sz,
+ efi_char16_t *name,
+ efi_guid_t *vendor);
+typedef asmlinkage efi_status_t efi_set_variable_t (efi_char16_t *name,
+ efi_guid_t *vendor,
+ unsigned long attr,
+ unsigned long data_size,
+ void *data);
+typedef asmlinkage efi_status_t efi_get_next_high_mono_count_t (u32 *count);
+typedef asmlinkage void efi_reset_system_t (int reset_type,
+ efi_status_t status,
+ unsigned long data_size,
+ efi_char16_t *data);
+typedef asmlinkage efi_status_t efi_set_virtual_address_map_t (unsigned long memory_map_size,
unsigned long descriptor_size,
u32 descriptor_version,
efi_memory_desc_t *virtual_map);

2007-02-12 07:41:00

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [37/39] x86_64: robustify bad_dma_address handling


From: Muli Ben-Yehuda <[email protected]>

- set bad_dma_address explicitly to 0x0
- reserve 32 pages from bad_dma_address and up
- WARN_ON() a driver feeding us bad_dma_address

Thanks to Leo Duran <[email protected]> for the suggestion.

Signed-off-by: Muli Ben-Yehuda <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
Cc: Leo Duran <[email protected]>
Cc: Job Mason <[email protected]>
---
arch/x86_64/kernel/pci-calgary.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)

Index: linux/arch/x86_64/kernel/pci-calgary.c
===================================================================
--- linux.orig/arch/x86_64/kernel/pci-calgary.c
+++ linux/arch/x86_64/kernel/pci-calgary.c
@@ -138,6 +138,8 @@ static const unsigned long phb_debug_off

#define PHB_DEBUG_STUFF_OFFSET 0x0020

+#define EMERGENCY_PAGES 32 /* = 128KB */
+
unsigned int specified_table_size = TCE_TABLE_SIZE_UNSPECIFIED;
static int translate_empty_slots __read_mostly = 0;
static int calgary_detected __read_mostly = 0;
@@ -296,6 +298,16 @@ static void __iommu_free(struct iommu_ta
{
unsigned long entry;
unsigned long badbit;
+ unsigned long badend;
+
+ /* were we called with bad_dma_address? */
+ badend = bad_dma_address + (EMERGENCY_PAGES * PAGE_SIZE);
+ if (unlikely((dma_addr >= bad_dma_address) && (dma_addr < badend))) {
+ printk(KERN_ERR "Calgary: driver tried unmapping bad DMA "
+ "address 0x%Lx\n", dma_addr);
+ WARN_ON(1);
+ return;
+ }

entry = dma_addr >> PAGE_SHIFT;

@@ -656,8 +668,8 @@ static void __init calgary_reserve_regio
u64 start;
struct iommu_table *tbl = dev->sysdata;

- /* reserve bad_dma_address in case it's a legal address */
- iommu_range_reserve(tbl, bad_dma_address, 1);
+ /* reserve EMERGENCY_PAGES from bad_dma_address and up */
+ iommu_range_reserve(tbl, bad_dma_address, EMERGENCY_PAGES);

/* avoid the BIOS/VGA first 640KB-1MB region */
start = (640 * 1024);
@@ -1176,6 +1188,7 @@ int __init calgary_iommu_init(void)
}

force_iommu = 1;
+ bad_dma_address = 0x0;
dma_ops = &calgary_dma_ops;

return 0;

2007-02-12 07:41:01

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [39/39] i386: All Transmeta CPUs have constant TSCs


From: "H. Peter Anvin" <[email protected]>

All Transmeta CPUs ever produced have constant-rate TSCs.

Signed-off-by: H. Peter Anvin <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

arch/i386/kernel/cpu/transmeta.c | 3 +++
1 file changed, 3 insertions(+)

Index: linux/arch/i386/kernel/cpu/transmeta.c
===================================================================
--- linux.orig/arch/i386/kernel/cpu/transmeta.c
+++ linux/arch/i386/kernel/cpu/transmeta.c
@@ -72,6 +72,9 @@ static void __cpuinit init_transmeta(str
wrmsr(0x80860004, ~0, uk);
c->x86_capability[0] = cpuid_edx(0x00000001);
wrmsr(0x80860004, cap_mask, uk);
+
+ /* All Transmeta CPUs have a constant TSC */
+ set_bit(X86_FEATURE_CONSTANT_TSC, c->x86_capability);

/* If we can run i686 user-space code, call us an i686 */
#define USER686 (X86_FEATURE_TSC|X86_FEATURE_CX8|X86_FEATURE_CMOV)

2007-02-12 07:42:15

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [18/39] x86_64: Allow to run a program when a machine check event is detected


When a machine check event is detected (including a AMD RevF threshold
overflow event) allow to run a "trigger" program. This allows user space
to react to such events sooner.

The trigger is configured using a new trigger entry in the
machinecheck sysfs interface. It is currently shared between
all CPUs.

I also fixed the AMD threshold handler to run the machine
check polling code immediately to actually log any events
that might have caused the threshold interrupt.

Also added some documentation for the mce sysfs interface.

Signed-off-by: Andi Kleen <[email protected]>

---
Documentation/x86_64/machinecheck | 70 ++++++++++++++++++++++++++++++++++++++
arch/x86_64/kernel/mce.c | 66 +++++++++++++++++++++++++++++------
arch/x86_64/kernel/mce_amd.c | 4 ++
include/asm-x86_64/mce.h | 2 +
kernel/kmod.c | 44 ++++++++++++++++-------
5 files changed, 160 insertions(+), 26 deletions(-)

Index: linux/arch/x86_64/kernel/mce.c
===================================================================
--- linux.orig/arch/x86_64/kernel/mce.c
+++ linux/arch/x86_64/kernel/mce.c
@@ -19,6 +19,7 @@
#include <linux/cpu.h>
#include <linux/percpu.h>
#include <linux/ctype.h>
+#include <linux/kmod.h>
#include <asm/processor.h>
#include <asm/msr.h>
#include <asm/mce.h>
@@ -42,6 +43,10 @@ static unsigned long console_logged;
static int notify_user;
static int rip_msr;
static int mce_bootlog = 1;
+static atomic_t mce_events;
+
+static char trigger[128];
+static char *trigger_argv[2] = { trigger, NULL };

/*
* Lockless MCE logging infrastructure.
@@ -57,6 +62,7 @@ struct mce_log mcelog = {
void mce_log(struct mce *mce)
{
unsigned next, entry;
+ atomic_inc(&mce_events);
mce->finished = 0;
wmb();
for (;;) {
@@ -161,6 +167,17 @@ static inline void mce_get_rip(struct mc
}
}

+static void do_mce_trigger(void)
+{
+ static atomic_t mce_logged;
+ int events = atomic_read(&mce_events);
+ if (events != atomic_read(&mce_logged) && trigger[0]) {
+ /* Small race window, but should be harmless. */
+ atomic_set(&mce_logged, events);
+ call_usermodehelper(trigger, trigger_argv, NULL, -1);
+ }
+}
+
/*
* The actual machine check handler
*/
@@ -234,8 +251,12 @@ void do_machine_check(struct pt_regs * r
}

/* Never do anything final in the polling timer */
- if (!regs)
+ if (!regs) {
+ /* Normal interrupt context here. Call trigger for any new
+ events. */
+ do_mce_trigger();
goto out;
+ }

/* If we didn't find an uncorrectable error, pick
the last one (shouldn't happen, just being safe). */
@@ -606,17 +627,42 @@ DEFINE_PER_CPU(struct sys_device, device
} \
static SYSDEV_ATTR(name, 0644, show_ ## name, set_ ## name);

+/* TBD should generate these dynamically based on number of available banks */
ACCESSOR(bank0ctl,bank[0],mce_restart())
ACCESSOR(bank1ctl,bank[1],mce_restart())
ACCESSOR(bank2ctl,bank[2],mce_restart())
ACCESSOR(bank3ctl,bank[3],mce_restart())
ACCESSOR(bank4ctl,bank[4],mce_restart())
ACCESSOR(bank5ctl,bank[5],mce_restart())
-static struct sysdev_attribute * bank_attributes[NR_BANKS] = {
- &attr_bank0ctl, &attr_bank1ctl, &attr_bank2ctl,
- &attr_bank3ctl, &attr_bank4ctl, &attr_bank5ctl};
+
+static ssize_t show_trigger(struct sys_device *s, char *buf)
+{
+ strcpy(buf, trigger);
+ strcat(buf, "\n");
+ return strlen(trigger) + 1;
+}
+
+static ssize_t set_trigger(struct sys_device *s,const char *buf,size_t siz)
+{
+ char *p;
+ int len;
+ strncpy(trigger, buf, sizeof(trigger));
+ trigger[sizeof(trigger)-1] = 0;
+ len = strlen(trigger);
+ p = strchr(trigger, '\n');
+ if (*p) *p = 0;
+ return len;
+}
+
+static SYSDEV_ATTR(trigger, 0644, show_trigger, set_trigger);
ACCESSOR(tolerant,tolerant,)
ACCESSOR(check_interval,check_interval,mce_restart())
+static struct sysdev_attribute *mce_attributes[] = {
+ &attr_bank0ctl, &attr_bank1ctl, &attr_bank2ctl,
+ &attr_bank3ctl, &attr_bank4ctl, &attr_bank5ctl,
+ &attr_tolerant, &attr_check_interval, &attr_trigger,
+ NULL
+};

/* Per cpu sysdev init. All of the cpus still share the same ctl bank */
static __cpuinit int mce_create_device(unsigned int cpu)
@@ -632,11 +678,9 @@ static __cpuinit int mce_create_device(u
err = sysdev_register(&per_cpu(device_mce,cpu));

if (!err) {
- for (i = 0; i < banks; i++)
+ for (i = 0; mce_attributes[i]; i++)
sysdev_create_file(&per_cpu(device_mce,cpu),
- bank_attributes[i]);
- sysdev_create_file(&per_cpu(device_mce,cpu), &attr_tolerant);
- sysdev_create_file(&per_cpu(device_mce,cpu), &attr_check_interval);
+ mce_attributes[i]);
}
return err;
}
@@ -645,11 +689,9 @@ static void mce_remove_device(unsigned i
{
int i;

- for (i = 0; i < banks; i++)
+ for (i = 0; mce_attributes[i]; i++)
sysdev_remove_file(&per_cpu(device_mce,cpu),
- bank_attributes[i]);
- sysdev_remove_file(&per_cpu(device_mce,cpu), &attr_tolerant);
- sysdev_remove_file(&per_cpu(device_mce,cpu), &attr_check_interval);
+ mce_attributes[i]);
sysdev_unregister(&per_cpu(device_mce,cpu));
memset(&per_cpu(device_mce, cpu).kobj, 0, sizeof(struct kobject));
}
Index: linux/arch/x86_64/kernel/mce_amd.c
===================================================================
--- linux.orig/arch/x86_64/kernel/mce_amd.c
+++ linux/arch/x86_64/kernel/mce_amd.c
@@ -220,6 +220,10 @@ asmlinkage void mce_threshold_interrupt(
(high & MASK_LOCKED_HI))
continue;

+ /* Log the machine check that caused the threshold
+ event. */
+ do_machine_check(NULL, 0);
+
if (high & MASK_OVERFLOW_HI) {
rdmsrl(address, m.misc);
rdmsrl(MSR_IA32_MC0_STATUS + bank * 4,
Index: linux/kernel/kmod.c
===================================================================
--- linux.orig/kernel/kmod.c
+++ linux/kernel/kmod.c
@@ -217,7 +217,10 @@ static int wait_for_helper(void *data)
sub_info->retval = ret;
}

- complete(sub_info->complete);
+ if (sub_info->wait < 0)
+ kfree(sub_info);
+ else
+ complete(sub_info->complete);
return 0;
}

@@ -239,6 +242,9 @@ static void __call_usermodehelper(struct
pid = kernel_thread(____call_usermodehelper, sub_info,
CLONE_VFORK | SIGCHLD);

+ if (wait < 0)
+ return;
+
if (pid < 0) {
sub_info->retval = pid;
complete(sub_info->complete);
@@ -253,6 +259,9 @@ static void __call_usermodehelper(struct
* @envp: null-terminated environment list
* @session_keyring: session keyring for process (NULL for an empty keyring)
* @wait: wait for the application to finish and return status.
+ * when -1 don't wait at all, but you get no useful error back when
+ * the program couldn't be exec'ed. This makes it safe to call
+ * from interrupt context.
*
* Runs a user-space application. The application is started
* asynchronously if wait is not set, and runs as a child of keventd.
@@ -265,17 +274,8 @@ int call_usermodehelper_keys(char *path,
struct key *session_keyring, int wait)
{
DECLARE_COMPLETION_ONSTACK(done);
- struct subprocess_info sub_info = {
- .work = __WORK_INITIALIZER(sub_info.work,
- __call_usermodehelper),
- .complete = &done,
- .path = path,
- .argv = argv,
- .envp = envp,
- .ring = session_keyring,
- .wait = wait,
- .retval = 0,
- };
+ struct subprocess_info *sub_info;
+ int retval;

if (!khelper_wq)
return -EBUSY;
@@ -283,9 +283,25 @@ int call_usermodehelper_keys(char *path,
if (path[0] == '\0')
return 0;

- queue_work(khelper_wq, &sub_info.work);
+ sub_info = kzalloc(sizeof(struct subprocess_info), GFP_ATOMIC);
+ if (!sub_info)
+ return -ENOMEM;
+
+ INIT_WORK(&sub_info->work, __call_usermodehelper);
+ sub_info->complete = &done;
+ sub_info->path = path;
+ sub_info->argv = argv;
+ sub_info->envp = envp;
+ sub_info->ring = session_keyring;
+ sub_info->wait = wait;
+
+ queue_work(khelper_wq, &sub_info->work);
+ if (wait < 0) /* task has freed sub_info */
+ return 0;
wait_for_completion(&done);
- return sub_info.retval;
+ retval = sub_info->retval;
+ kfree(sub_info);
+ return retval;
}
EXPORT_SYMBOL(call_usermodehelper_keys);

Index: linux/include/asm-x86_64/mce.h
===================================================================
--- linux.orig/include/asm-x86_64/mce.h
+++ linux/include/asm-x86_64/mce.h
@@ -103,6 +103,8 @@ void mce_log_therm_throt_event(unsigned

extern atomic_t mce_entry;

+extern void do_machine_check(struct pt_regs *, long);
+
#endif

#endif
Index: linux/Documentation/x86_64/machinecheck
===================================================================
--- /dev/null
+++ linux/Documentation/x86_64/machinecheck
@@ -0,0 +1,70 @@
+
+Configurable sysfs parameters for the x86-64 machine check code.
+
+Machine checks report internal hardware error conditions detected
+by the CPU. Uncorrected errors typically cause a machine check
+(often with panic), corrected ones cause a machine check log entry.
+
+Machine checks are organized in banks (normally associated with
+a hardware subsystem) and subevents in a bank. The exact meaning
+of the banks and subevent is CPU specific.
+
+mcelog knows how to decode them.
+
+When you see the "Machine check errors logged" message in the system
+log then mcelog should run to collect and decode machine check entries
+from /dev/mcelog. Normally mcelog should be run regularly from a cronjob.
+
+Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN
+(N = CPU number)
+
+The directory contains some configurable entries:
+
+Entries:
+
+bankNctl
+(N bank number)
+ 64bit Hex bitmask enabling/disabling specific subevents for bank N
+ When a bit in the bitmask is zero then the respective
+ subevent will not be reported.
+ By default all events are enabled.
+ Note that BIOS maintain another mask to disable specific events
+ per bank. This is not visible here
+
+The following entries appear for each CPU, but they are truly shared
+between all CPUs.
+
+check_interval
+ How often to poll for corrected machine check errors, in seconds
+ (Note output is hexademical). Default 5 minutes.
+
+tolerant
+ Tolerance level. When a machine check exception occurs for a non
+ corrected machine check the kernel can take different actions.
+ Since machine check exceptions can happen any time it is sometimes
+ risky for the kernel to kill a process because it defies
+ normal kernel locking rules. The tolerance level configures
+ how hard the kernel tries to recover even at some risk of deadlock.
+
+ 0: always panic,
+ 1: panic if deadlock possible,
+ 2: try to avoid panic,
+ 3: never panic or exit (for testing only)
+
+ Default: 1
+
+ Note this only makes a difference if the CPU allows recovery
+ from a machine check exception. Current x86 CPUs generally do not.
+
+trigger
+ Program to run when a machine check event is detected.
+ This is an alternative to running mcelog regularly from cron
+ and allows to detect events faster.
+
+TBD document entries for AMD threshold interrupt configuration
+
+For more details about the x86 machine check architecture
+see the Intel and AMD architecture manuals from their developer websites.
+
+For more details about the architecture see
+see http://one.firstfloor.org/~andi/mce.pdf

2007-02-12 07:42:32

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [35/39] x86_64: Don't reserve ROMs


We trust the e820 table, so explicitely reserving ROMs shouldn't
be needed.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86_64/kernel/setup.c | 130 ---------------------------------------------
1 file changed, 2 insertions(+), 128 deletions(-)

Index: linux/arch/x86_64/kernel/setup.c
===================================================================
--- linux.orig/arch/x86_64/kernel/setup.c
+++ linux/arch/x86_64/kernel/setup.c
@@ -138,128 +138,6 @@ struct resource code_resource = {
.flags = IORESOURCE_RAM,
};

-#define IORESOURCE_ROM (IORESOURCE_BUSY | IORESOURCE_READONLY | IORESOURCE_MEM)
-
-static struct resource system_rom_resource = {
- .name = "System ROM",
- .start = 0xf0000,
- .end = 0xfffff,
- .flags = IORESOURCE_ROM,
-};
-
-static struct resource extension_rom_resource = {
- .name = "Extension ROM",
- .start = 0xe0000,
- .end = 0xeffff,
- .flags = IORESOURCE_ROM,
-};
-
-static struct resource adapter_rom_resources[] = {
- { .name = "Adapter ROM", .start = 0xc8000, .end = 0,
- .flags = IORESOURCE_ROM },
- { .name = "Adapter ROM", .start = 0, .end = 0,
- .flags = IORESOURCE_ROM },
- { .name = "Adapter ROM", .start = 0, .end = 0,
- .flags = IORESOURCE_ROM },
- { .name = "Adapter ROM", .start = 0, .end = 0,
- .flags = IORESOURCE_ROM },
- { .name = "Adapter ROM", .start = 0, .end = 0,
- .flags = IORESOURCE_ROM },
- { .name = "Adapter ROM", .start = 0, .end = 0,
- .flags = IORESOURCE_ROM }
-};
-
-static struct resource video_rom_resource = {
- .name = "Video ROM",
- .start = 0xc0000,
- .end = 0xc7fff,
- .flags = IORESOURCE_ROM,
-};
-
-static struct resource video_ram_resource = {
- .name = "Video RAM area",
- .start = 0xa0000,
- .end = 0xbffff,
- .flags = IORESOURCE_RAM,
-};
-
-#define romsignature(x) (*(unsigned short *)(x) == 0xaa55)
-
-static int __init romchecksum(unsigned char *rom, unsigned long length)
-{
- unsigned char *p, sum = 0;
-
- for (p = rom; p < rom + length; p++)
- sum += *p;
- return sum == 0;
-}
-
-static void __init probe_roms(void)
-{
- unsigned long start, length, upper;
- unsigned char *rom;
- int i;
-
- /* video rom */
- upper = adapter_rom_resources[0].start;
- for (start = video_rom_resource.start; start < upper; start += 2048) {
- rom = isa_bus_to_virt(start);
- if (!romsignature(rom))
- continue;
-
- video_rom_resource.start = start;
-
- /* 0 < length <= 0x7f * 512, historically */
- length = rom[2] * 512;
-
- /* if checksum okay, trust length byte */
- if (length && romchecksum(rom, length))
- video_rom_resource.end = start + length - 1;
-
- request_resource(&iomem_resource, &video_rom_resource);
- break;
- }
-
- start = (video_rom_resource.end + 1 + 2047) & ~2047UL;
- if (start < upper)
- start = upper;
-
- /* system rom */
- request_resource(&iomem_resource, &system_rom_resource);
- upper = system_rom_resource.start;
-
- /* check for extension rom (ignore length byte!) */
- rom = isa_bus_to_virt(extension_rom_resource.start);
- if (romsignature(rom)) {
- length = extension_rom_resource.end - extension_rom_resource.start + 1;
- if (romchecksum(rom, length)) {
- request_resource(&iomem_resource, &extension_rom_resource);
- upper = extension_rom_resource.start;
- }
- }
-
- /* check for adapter roms on 2k boundaries */
- for (i = 0; i < ARRAY_SIZE(adapter_rom_resources) && start < upper;
- start += 2048) {
- rom = isa_bus_to_virt(start);
- if (!romsignature(rom))
- continue;
-
- /* 0 < length <= 0x7f * 512, historically */
- length = rom[2] * 512;
-
- /* but accept any length that fits if checksum okay */
- if (!length || start + length > upper || !romchecksum(rom, length))
- continue;
-
- adapter_rom_resources[i].start = start;
- adapter_rom_resources[i].end = start + length - 1;
- request_resource(&iomem_resource, &adapter_rom_resources[i]);
-
- start = adapter_rom_resources[i++].end & ~2047UL;
- }
-}
-
#ifdef CONFIG_PROC_VMCORE
/* elfcorehdr= specifies the location of elf core header
* stored by the crashed kernel. This option will be passed
@@ -524,15 +402,11 @@ void __init setup_arch(char **cmdline_p)
init_apic_mappings();

/*
- * Request address space for all standard RAM and ROM resources
- * and also for regions reported as reserved by the e820.
- */
- probe_roms();
+ * We trust e820 completely. No explicit ROM probing in memory.
+ */
e820_reserve_resources();
e820_mark_nosave_regions();

- request_resource(&iomem_resource, &video_ram_resource);
-
{
unsigned i;
/* request I/O space for devices used on all i[345]86 PCs */

2007-02-12 07:42:32

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [36/39] x86_64: define dma noncoherent API functions


From: Jeff Garzik <[email protected]>

x86-64 is missing these:

Signed-off-by: Jeff Garzik <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
include/asm-x86_64/dma-mapping.h | 3 +++
1 file changed, 3 insertions(+)

Index: linux/include/asm-x86_64/dma-mapping.h
===================================================================
--- linux.orig/include/asm-x86_64/dma-mapping.h
+++ linux/include/asm-x86_64/dma-mapping.h
@@ -66,6 +66,9 @@ static inline int dma_mapping_error(dma_
#define dma_alloc_noncoherent(d, s, h, f) dma_alloc_coherent(d, s, h, f)
#define dma_free_noncoherent(d, s, v, h) dma_free_coherent(d, s, v, h)

+#define dma_alloc_noncoherent(d, s, h, f) dma_alloc_coherent(d, s, h, f)
+#define dma_free_noncoherent(d, s, v, h) dma_free_coherent(d, s, v, h)
+
extern void *dma_alloc_coherent(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t gfp);
extern void dma_free_coherent(struct device *dev, size_t size, void *vaddr,

2007-02-12 07:42:33

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [19/39] x86_64: remove get_pmd()


From: "Jan Beulich" <[email protected]>
Function is dead.

Signed-off-by: Jan Beulich <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
include/asm-x86_64/pgalloc.h | 5 -----
1 file changed, 5 deletions(-)

Index: linux/include/asm-x86_64/pgalloc.h
===================================================================
--- linux.orig/include/asm-x86_64/pgalloc.h
+++ linux/include/asm-x86_64/pgalloc.h
@@ -18,11 +18,6 @@ static inline void pmd_populate(struct m
set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)));
}

-static inline pmd_t *get_pmd(void)
-{
- return (pmd_t *)get_zeroed_page(GFP_KERNEL);
-}
-
static inline void pmd_free(pmd_t *pmd)
{
BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));

2007-02-12 07:43:46

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [20/39] i386: Small cleanup to TLB flush code


- Remove outdated comment
- Use cpu_relax() in a busy loop

Signed-off-by: Andi Kleen <[email protected]>

---
arch/i386/kernel/smp.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

Index: linux/arch/i386/kernel/smp.c
===================================================================
--- linux.orig/arch/i386/kernel/smp.c
+++ linux/arch/i386/kernel/smp.c
@@ -375,8 +375,7 @@ static void flush_tlb_others(cpumask_t c
/*
* i'm not happy about this global shared spinlock in the
* MM hot path, but we'll see how contended it is.
- * Temporarily this turns IRQs off, so that lockups are
- * detected by the NMI watchdog.
+ * AK: x86-64 has a faster method that could be ported.
*/
spin_lock(&tlbstate_lock);

@@ -401,7 +400,7 @@ static void flush_tlb_others(cpumask_t c

while (!cpus_empty(flush_cpumask))
/* nothing. lockup detection does not belong here */
- mb();
+ cpu_relax();

flush_mm = NULL;
flush_va = 0;

2007-02-12 07:43:46

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [22/39] x86_64: Kconfig typos


From: Nicolas Kaiser <[email protected]>
Some typos in Kconfig.

Signed-off-by: Nicolas Kaiser <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---

---
arch/x86_64/Kconfig | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

Index: linux/arch/x86_64/Kconfig
===================================================================
--- linux.orig/arch/x86_64/Kconfig
+++ linux/arch/x86_64/Kconfig
@@ -148,18 +148,18 @@ config MPSC
Optimize for Intel Pentium 4 and older Nocona/Dempsey Xeon CPUs
with Intel Extended Memory 64 Technology(EM64T). For details see
<http://www.intel.com/technology/64bitextensions/>.
- Note the the latest Xeons (Xeon 51xx and 53xx) are not based on the
- Netburst core and shouldn't use this option. You can distingush them
+ Note that the latest Xeons (Xeon 51xx and 53xx) are not based on the
+ Netburst core and shouldn't use this option. You can distinguish them
using the cpu family field
- in /proc/cpuinfo. Family 15 is a older Xeon, Family 6 a newer one
- (this rule only applies to system that support EM64T)
+ in /proc/cpuinfo. Family 15 is an older Xeon, Family 6 a newer one
+ (this rule only applies to systems that support EM64T)

config MCORE2
bool "Intel Core2 / newer Xeon"
help
Optimize for Intel Core2 and newer Xeons (51xx)
- You can distingush the newer Xeons from the older ones using
- the cpu family field in /proc/cpuinfo. 15 is a older Xeon
+ You can distinguish the newer Xeons from the older ones using
+ the cpu family field in /proc/cpuinfo. 15 is an older Xeon
(use CONFIG_MPSC then), 6 is a newer one. This rule only
applies to CPUs that support EM64T.

2007-02-12 07:44:23

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [24/39] i386: use smp_call_function_single()


From: Alexey Dobriyan <[email protected]>
It will execure cpuid only on the cpu we need.

Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---

arch/i386/kernel/cpuid.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

Index: linux/arch/i386/kernel/cpuid.c
===================================================================
--- linux.orig/arch/i386/kernel/cpuid.c
+++ linux/arch/i386/kernel/cpuid.c
@@ -48,7 +48,6 @@ static struct class *cpuid_class;
#ifdef CONFIG_SMP

struct cpuid_command {
- int cpu;
u32 reg;
u32 *data;
};
@@ -57,8 +56,7 @@ static void cpuid_smp_cpuid(void *cmd_bl
{
struct cpuid_command *cmd = (struct cpuid_command *)cmd_block;

- if (cmd->cpu == smp_processor_id())
- cpuid(cmd->reg, &cmd->data[0], &cmd->data[1], &cmd->data[2],
+ cpuid(cmd->reg, &cmd->data[0], &cmd->data[1], &cmd->data[2],
&cmd->data[3]);
}

@@ -70,11 +68,10 @@ static inline void do_cpuid(int cpu, u32
if (cpu == smp_processor_id()) {
cpuid(reg, &data[0], &data[1], &data[2], &data[3]);
} else {
- cmd.cpu = cpu;
cmd.reg = reg;
cmd.data = data;

- smp_call_function(cpuid_smp_cpuid, &cmd, 1, 1);
+ smp_call_function_single(cpu, cpuid_smp_cpuid, &cmd, 1, 1);
}
preempt_enable();
}

2007-02-12 07:44:23

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [25/39] x86_64: Fix preprocessor condition


From: "Josef 'Jeff' Sipek" <[email protected]>
Signed-off-by: Josef 'Jeff' Sipek <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
include/asm-x86_64/io.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/include/asm-x86_64/io.h
===================================================================
--- linux.orig/include/asm-x86_64/io.h
+++ linux/include/asm-x86_64/io.h
@@ -100,7 +100,7 @@ __OUTS(l)

#define IO_SPACE_LIMIT 0xffff

-#if defined(__KERNEL__) && __x86_64__
+#if defined(__KERNEL__) && defined(__x86_64__)

#include <linux/vmalloc.h>

2007-02-12 07:44:24

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [27/39] i386: APM on i386


From: Alexey Dobriyan <[email protected]>
Byte-to-byte identical /proc/apm here.

Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---

arch/i386/kernel/apm.c | 26 ++++++++++++++++++--------
1 file changed, 18 insertions(+), 8 deletions(-)

Index: linux/arch/i386/kernel/apm.c
===================================================================
--- linux.orig/arch/i386/kernel/apm.c
+++ linux/arch/i386/kernel/apm.c
@@ -211,6 +211,7 @@
#include <linux/slab.h>
#include <linux/stat.h>
#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
#include <linux/miscdevice.h>
#include <linux/apm_bios.h>
#include <linux/init.h>
@@ -1636,9 +1637,8 @@ static int do_open(struct inode * inode,
return 0;
}

-static int apm_get_info(char *buf, char **start, off_t fpos, int length)
+static int proc_apm_show(struct seq_file *m, void *v)
{
- char * p;
unsigned short bx;
unsigned short cx;
unsigned short dx;
@@ -1650,8 +1650,6 @@ static int apm_get_info(char *buf, char
int time_units = -1;
char *units = "?";

- p = buf;
-
if ((num_online_cpus() == 1) &&
!(error = apm_get_power_status(&bx, &cx, &dx))) {
ac_line_status = (bx >> 8) & 0xff;
@@ -1705,7 +1703,7 @@ static int apm_get_info(char *buf, char
-1: Unknown
8) min = minutes; sec = seconds */

- p += sprintf(p, "%s %d.%d 0x%02x 0x%02x 0x%02x 0x%02x %d%% %d %s\n",
+ seq_printf(m, "%s %d.%d 0x%02x 0x%02x 0x%02x 0x%02x %d%% %d %s\n",
driver_version,
(apm_info.bios.version >> 8) & 0xff,
apm_info.bios.version & 0xff,
@@ -1716,10 +1714,22 @@ static int apm_get_info(char *buf, char
percentage,
time_units,
units);
+ return 0;
+}

- return p - buf;
+static int proc_apm_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, proc_apm_show, NULL);
}

+static const struct file_operations apm_file_ops = {
+ .owner = THIS_MODULE,
+ .open = proc_apm_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
static int apm(void *unused)
{
unsigned short bx;
@@ -2341,9 +2351,9 @@ static int __init apm_init(void)
set_base(gdt[APM_DS >> 3],
__va((unsigned long)apm_info.bios.dseg << 4));

- apm_proc = create_proc_info_entry("apm", 0, NULL, apm_get_info);
+ apm_proc = create_proc_entry("apm", 0, NULL);
if (apm_proc)
- apm_proc->owner = THIS_MODULE;
+ apm_proc->proc_fops = &apm_file_ops;

kapmd_task = kthread_create(apm, NULL, "kapmd");
if (IS_ERR(kapmd_task)) {

2007-02-12 07:45:09

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [31/39] x86_64: Unexport __supported_pte_mask


The symbol is needed to manipulate page tables, and modules shouldn't
do that.

Leftover from 2.4, but no in tree module should need it now.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86_64/kernel/setup64.c | 1 -
1 file changed, 1 deletion(-)

Index: linux/arch/x86_64/kernel/setup64.c
===================================================================
--- linux.orig/arch/x86_64/kernel/setup64.c
+++ linux/arch/x86_64/kernel/setup64.c
@@ -37,7 +37,6 @@ struct desc_ptr idt_descr = { 256 * 16 -
char boot_cpu_stack[IRQSTACKSIZE] __attribute__((section(".bss.page_aligned")));

unsigned long __supported_pte_mask __read_mostly = ~0UL;
-EXPORT_SYMBOL(__supported_pte_mask);
static int do_not_nx __cpuinitdata = 0;

/* noexec=on|off

2007-02-12 07:45:10

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [32/39] x86_64: x86_64 - Fix FS/GS registers for VT execution


From: Zachary Amsden <[email protected]>

Initialize FS and GS to __KERNEL_DS as well. The actual value of them is not
important, but it is important to reload them in protected mode. At this time,
they still retain the real mode values from initial boot. VT disallows
execution of code under such conditions, which means hardware virtualization
can not be used to boot the kernel on Intel platforms, making the boot time
painfully slow.

This requires moving the GS load before the load of GS_BASE, so just move
all the segments loads there to keep them together in the code.

Signed-off-by: Zachary Amsden <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86_64/kernel/head.S | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)

Index: linux/arch/x86_64/kernel/head.S
===================================================================
--- linux.orig/arch/x86_64/kernel/head.S
+++ linux/arch/x86_64/kernel/head.S
@@ -163,6 +163,20 @@ startup_64:
*/
lgdt cpu_gdt_descr

+ /* set up data segments. actually 0 would do too */
+ movl $__KERNEL_DS,%eax
+ movl %eax,%ds
+ movl %eax,%ss
+ movl %eax,%es
+
+ /*
+ * We don't really need to load %fs or %gs, but load them anyway
+ * to kill any stale realmode selectors. This allows execution
+ * under VT hardware.
+ */
+ movl %eax,%fs
+ movl %eax,%gs
+
/*
* Setup up a dummy PDA. this is just for some early bootup code
* that does in_interrupt()
@@ -173,12 +187,6 @@ startup_64:
shrq $32,%rdx
wrmsr

- /* set up data segments. actually 0 would do too */
- movl $__KERNEL_DS,%eax
- movl %eax,%ds
- movl %eax,%ss
- movl %eax,%es
-
/* esi is pointer to real mode structure with interesting info.
pass it to C */
movl %esi, %edi

2007-02-12 07:45:10

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [33/39] x86_64: Fix off by one error in IOMMU boundary checking


Should be harmless because there is normally no memory there, but
technically it was incorrect.

Pointed out by Leo Duran

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86_64/kernel/pci-gart.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/arch/x86_64/kernel/pci-gart.c
===================================================================
--- linux.orig/arch/x86_64/kernel/pci-gart.c
+++ linux/arch/x86_64/kernel/pci-gart.c
@@ -185,7 +185,7 @@ static void iommu_full(struct device *de
static inline int need_iommu(struct device *dev, unsigned long addr, size_t size)
{
u64 mask = *dev->dma_mask;
- int high = addr + size >= mask;
+ int high = addr + size > mask;
int mmu = high;
if (force_iommu)
mmu = 1;
@@ -195,7 +195,7 @@ static inline int need_iommu(struct devi
static inline int nonforced_iommu(struct device *dev, unsigned long addr, size_t size)
{
u64 mask = *dev->dma_mask;
- int high = addr + size >= mask;
+ int high = addr + size > mask;
int mmu = high;
return mmu;
}

2007-02-12 07:46:41

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [14/39] x86_64: cleanup Doc/x86_64/ files


From: Randy Dunlap <[email protected]>

Fix typos.
Lots of whitespace changes for readability and consistency.

Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
Documentation/x86_64/boot-options.txt | 27 ++++++++++-----------------
Documentation/x86_64/cpu-hotplug-spec | 2 +-
Documentation/x86_64/kernel-stacks | 26 +++++++++++++-------------
Documentation/x86_64/mm.txt | 22 +++++++++++-----------
4 files changed, 35 insertions(+), 42 deletions(-)

Index: linux/Documentation/x86_64/cpu-hotplug-spec
===================================================================
--- linux.orig/Documentation/x86_64/cpu-hotplug-spec
+++ linux/Documentation/x86_64/cpu-hotplug-spec
@@ -2,7 +2,7 @@ Firmware support for CPU hotplug under L
---------------------------------------------------

Linux/x86-64 supports CPU hotplug now. For various reasons Linux wants to
-know in advance boot time the maximum number of CPUs that could be plugged
+know in advance of boot time the maximum number of CPUs that could be plugged
into the system. ACPI 3.0 currently has no official way to supply
this information from the firmware to the operating system.

Index: linux/Documentation/x86_64/kernel-stacks
===================================================================
--- linux.orig/Documentation/x86_64/kernel-stacks
+++ linux/Documentation/x86_64/kernel-stacks
@@ -9,9 +9,9 @@ zombie. While the thread is in user spac
except for the thread_info structure at the bottom.

In addition to the per thread stacks, there are specialized stacks
-associated with each cpu. These stacks are only used while the kernel
-is in control on that cpu, when a cpu returns to user space the
-specialized stacks contain no useful data. The main cpu stacks is
+associated with each CPU. These stacks are only used while the kernel
+is in control on that CPU; when a CPU returns to user space the
+specialized stacks contain no useful data. The main CPU stacks are:

* Interrupt stack. IRQSTACKSIZE

@@ -32,17 +32,17 @@ x86_64 also has a feature which is not a
to automatically switch to a new stack for designated events such as
double fault or NMI, which makes it easier to handle these unusual
events on x86_64. This feature is called the Interrupt Stack Table
-(IST). There can be up to 7 IST entries per cpu. The IST code is an
-index into the Task State Segment (TSS), the IST entries in the TSS
-point to dedicated stacks, each stack can be a different size.
+(IST). There can be up to 7 IST entries per CPU. The IST code is an
+index into the Task State Segment (TSS). The IST entries in the TSS
+point to dedicated stacks; each stack can be a different size.

-An IST is selected by an non-zero value in the IST field of an
+An IST is selected by a non-zero value in the IST field of an
interrupt-gate descriptor. When an interrupt occurs and the hardware
loads such a descriptor, the hardware automatically sets the new stack
pointer based on the IST value, then invokes the interrupt handler. If
software wants to allow nested IST interrupts then the handler must
adjust the IST values on entry to and exit from the interrupt handler.
-(this is occasionally done, e.g. for debug exceptions)
+(This is occasionally done, e.g. for debug exceptions.)

Events with different IST codes (i.e. with different stacks) can be
nested. For example, a debug interrupt can safely be interrupted by an
@@ -58,17 +58,17 @@ The currently assigned IST stacks are :-

Used for interrupt 12 - Stack Fault Exception (#SS).

- This allows to recover from invalid stack segments. Rarely
+ This allows the CPU to recover from invalid stack segments. Rarely
happens.

* DOUBLEFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE).

Used for interrupt 8 - Double Fault Exception (#DF).

- Invoked when handling a exception causes another exception. Happens
- when the kernel is very confused (e.g. kernel stack pointer corrupt)
- Using a separate stack allows to recover from it well enough in many
- cases to still output an oops.
+ Invoked when handling one exception causes another exception. Happens
+ when the kernel is very confused (e.g. kernel stack pointer corrupt).
+ Using a separate stack allows the kernel to recover from it well enough
+ in many cases to still output an oops.

* NMI_STACK. EXCEPTION_STKSZ (PAGE_SIZE).

Index: linux/Documentation/x86_64/mm.txt
===================================================================
--- linux.orig/Documentation/x86_64/mm.txt
+++ linux/Documentation/x86_64/mm.txt
@@ -3,26 +3,26 @@

Virtual memory map with 4 level page tables:

-0000000000000000 - 00007fffffffffff (=47bits) user space, different per mm
+0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm
hole caused by [48:63] sign extension
-ffff800000000000 - ffff80ffffffffff (=40bits) guard hole
-ffff810000000000 - ffffc0ffffffffff (=46bits) direct mapping of all phys. memory
-ffffc10000000000 - ffffc1ffffffffff (=40bits) hole
-ffffc20000000000 - ffffe1ffffffffff (=45bits) vmalloc/ioremap space
+ffff800000000000 - ffff80ffffffffff (=40 bits) guard hole
+ffff810000000000 - ffffc0ffffffffff (=46 bits) direct mapping of all phys. memory
+ffffc10000000000 - ffffc1ffffffffff (=40 bits) hole
+ffffc20000000000 - ffffe1ffffffffff (=45 bits) vmalloc/ioremap space
... unused hole ...
-ffffffff80000000 - ffffffff82800000 (=40MB) kernel text mapping, from phys 0
+ffffffff80000000 - ffffffff82800000 (=40 MB) kernel text mapping, from phys 0
... unused hole ...
-ffffffff88000000 - fffffffffff00000 (=1919MB) module mapping space
+ffffffff88000000 - fffffffffff00000 (=1919 MB) module mapping space

-The direct mapping covers all memory in the system upto the highest
+The direct mapping covers all memory in the system up to the highest
memory address (this means in some cases it can also include PCI memory
-holes)
+holes).

vmalloc space is lazily synchronized into the different PML4 pages of
the processes using the page fault handler, with init_level4_pgt as
reference.

-Current X86-64 implementations only support 40 bit of address space,
-but we support upto 46bits. This expands into MBZ space in the page tables.
+Current X86-64 implementations only support 40 bits of address space,
+but we support up to 46 bits. This expands into MBZ space in the page tables.

-Andi Kleen, Jul 2004
Index: linux/Documentation/x86_64/boot-options.txt
===================================================================
--- linux.orig/Documentation/x86_64/boot-options.txt
+++ linux/Documentation/x86_64/boot-options.txt
@@ -226,9 +226,9 @@ IOMMU (input/output memory management un
is 20.
memaper[=<order>] Allocate an own aperture over RAM with size 32MB<<order.
(default: order=1, i.e. 64MB)
- merge Do scather-gather (SG) merging. Implies "force"
+ merge Do scatter-gather (SG) merging. Implies "force"
(experimental).
- nomerge Don't do scather-gather (SG) merging.
+ nomerge Don't do scatter-gather (SG) merging.
noaperture Ask the IOMMU not to touch the aperture for AGP.
forcesac Force single-address cycle (SAC) mode for masks <40bits
(experimental).
@@ -275,14 +275,14 @@ IOMMU (input/output memory management un

Debugging

- oops=panic Always panic on oopses. Default is to just kill the process,
- but there is a small probability of deadlocking the machine.
- This will also cause panics on machine check exceptions.
- Useful together with panic=30 to trigger a reboot.
+ oops=panic Always panic on oopses. Default is to just kill the process,
+ but there is a small probability of deadlocking the machine.
+ This will also cause panics on machine check exceptions.
+ Useful together with panic=30 to trigger a reboot.

- kstack=N Print that many words from the kernel stack in oops dumps.
+ kstack=N Print N words from the kernel stack in oops dumps.

- pagefaulttrace Dump all page faults. Only useful for extreme debugging
+ pagefaulttrace Dump all page faults. Only useful for extreme debugging
and will create a lot of output.

call_trace=[old|both|newfallback|new]
@@ -292,15 +292,8 @@ Debugging
newfallback: use new unwinder but fall back to old if it gets
stuck (default)

- call_trace=[old|both|newfallback|new]
- old: use old inexact backtracer
- new: use new exact dwarf2 unwinder
- both: print entries from both
- newfallback: use new unwinder but fall back to old if it gets
- stuck (default)
-
-Misc
+Miscellaneous

noreplacement Don't replace instructions with more appropriate ones
for the CPU. This may be useful on asymmetric MP systems
- where some CPU have less capabilities than the others.
+ where some CPUs have less capabilities than others.

2007-02-12 07:46:43

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [26/39] i386: fix 32-bit ioctls on x64_32


From: Giuliano Procida <[email protected]>
[MTRR] fix 32-bit ioctls on x64_32

Signed-off-by: Giuliano Procida <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---

Fixed incomplete support for 32-bit compatibility ioctls in
2.6.19.1. They were unhandled in one of three case-statements.
Testing using X server before and after change.

---
arch/i386/kernel/cpu/mtrr/if.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)

Index: linux/arch/i386/kernel/cpu/mtrr/if.c
===================================================================
--- linux.orig/arch/i386/kernel/cpu/mtrr/if.c
+++ linux/arch/i386/kernel/cpu/mtrr/if.c
@@ -211,6 +211,9 @@ mtrr_ioctl(struct file *file, unsigned i
default:
return -ENOTTY;
case MTRRIOC_ADD_ENTRY:
+#ifdef CONFIG_COMPAT
+ case MTRRIOC32_ADD_ENTRY:
+#endif
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
err =
@@ -218,21 +221,33 @@ mtrr_ioctl(struct file *file, unsigned i
file, 0);
break;
case MTRRIOC_SET_ENTRY:
+#ifdef CONFIG_COMPAT
+ case MTRRIOC32_SET_ENTRY:
+#endif
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
err = mtrr_add(sentry.base, sentry.size, sentry.type, 0);
break;
case MTRRIOC_DEL_ENTRY:
+#ifdef CONFIG_COMPAT
+ case MTRRIOC32_DEL_ENTRY:
+#endif
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
err = mtrr_file_del(sentry.base, sentry.size, file, 0);
break;
case MTRRIOC_KILL_ENTRY:
+#ifdef CONFIG_COMPAT
+ case MTRRIOC32_KILL_ENTRY:
+#endif
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
err = mtrr_del(-1, sentry.base, sentry.size);
break;
case MTRRIOC_GET_ENTRY:
+#ifdef CONFIG_COMPAT
+ case MTRRIOC32_GET_ENTRY:
+#endif
if (gentry.regnum >= num_var_ranges)
return -EINVAL;
mtrr_if->get(gentry.regnum, &gentry.base, &size, &type);
@@ -249,6 +264,9 @@ mtrr_ioctl(struct file *file, unsigned i

break;
case MTRRIOC_ADD_PAGE_ENTRY:
+#ifdef CONFIG_COMPAT
+ case MTRRIOC32_ADD_PAGE_ENTRY:
+#endif
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
err =
@@ -256,21 +274,33 @@ mtrr_ioctl(struct file *file, unsigned i
file, 1);
break;
case MTRRIOC_SET_PAGE_ENTRY:
+#ifdef CONFIG_COMPAT
+ case MTRRIOC32_SET_PAGE_ENTRY:
+#endif
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
err = mtrr_add_page(sentry.base, sentry.size, sentry.type, 0);
break;
case MTRRIOC_DEL_PAGE_ENTRY:
+#ifdef CONFIG_COMPAT
+ case MTRRIOC32_DEL_PAGE_ENTRY:
+#endif
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
err = mtrr_file_del(sentry.base, sentry.size, file, 1);
break;
case MTRRIOC_KILL_PAGE_ENTRY:
+#ifdef CONFIG_COMPAT
+ case MTRRIOC32_KILL_PAGE_ENTRY:
+#endif
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
err = mtrr_del_page(-1, sentry.base, sentry.size);
break;
case MTRRIOC_GET_PAGE_ENTRY:
+#ifdef CONFIG_COMPAT
+ case MTRRIOC32_GET_PAGE_ENTRY:
+#endif
if (gentry.regnum >= num_var_ranges)
return -EINVAL;
mtrr_if->get(gentry.regnum, &gentry.base, &size, &type);

2007-02-12 07:47:17

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [6/39] i386: romsignature/checksum cleanup


From: Rene Herman <[email protected]>

Use adding __init to romsignature() (it's only called from probe_roms()
which is itself __init) as an excuse to submit a pedantic cleanup.

Signed-off-by: Rene Herman <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

arch/i386/kernel/e820.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)

Index: linux/arch/i386/kernel/e820.c
===================================================================
--- linux.orig/arch/i386/kernel/e820.c
+++ linux/arch/i386/kernel/e820.c
@@ -157,21 +157,22 @@ static struct resource standard_io_resou
.flags = IORESOURCE_BUSY | IORESOURCE_IO
} };

-static int romsignature(const unsigned char *x)
+#define ROMSIGNATURE 0xaa55
+
+static int __init romsignature(const unsigned char *rom)
{
unsigned short sig;
- int ret = 0;
- if (probe_kernel_address((const unsigned short *)x, sig) == 0)
- ret = (sig == 0xaa55);
- return ret;
+
+ return probe_kernel_address((const unsigned short *)rom, sig) == 0 &&
+ sig == ROMSIGNATURE;
}

static int __init romchecksum(unsigned char *rom, unsigned long length)
{
- unsigned char *p, sum = 0;
+ unsigned char sum;

- for (p = rom; p < rom + length; p++)
- sum += *p;
+ for (sum = 0; length; length--)
+ sum += *rom++;
return sum == 0;
}

2007-02-12 07:47:19

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [3/39] i386: arch/i386/kernel/cpu/mcheck/mce.c should #include <asm/mce.h>


From: Adrian Bunk <[email protected]>

Every file should include the headers containing the prototypes for
it's global functions.

Signed-off-by: Adrian Bunk <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---

arch/i386/kernel/cpu/mcheck/mce.c | 1 +
1 file changed, 1 insertion(+)

Index: linux/arch/i386/kernel/cpu/mcheck/mce.c
===================================================================
--- linux.orig/arch/i386/kernel/cpu/mcheck/mce.c
+++ linux/arch/i386/kernel/cpu/mcheck/mce.c
@@ -12,6 +12,7 @@

#include <asm/processor.h>
#include <asm/system.h>
+#include <asm/mce.h>

#include "mce.h"

2007-02-12 07:47:55

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [7/39] x86_64: Fix fake numa for x86_64 machines with big IO hole


From: Rohit Seth <[email protected]>

This patch resolves the issue of running with numa=fake=X on kernel command
line on x86_64 machines that have big IO hole. While calculating the size
of each node now we look at the total hole size in that range.

Previously there were nodes that only had IO holes in them causing kernel
boot problems. We now use the NODE_MIN_SIZE (64MB) as the minimum size of
memory that any node must have. We reduce the number of allocated nodes if
the number of nodes specified on kernel command line results in any node
getting memory smaller than NODE_MIN_SIZE.

This change allows the extra memory to be incremented in NODE_MIN_SIZE
granule and uniformly distribute among as many nodes (called big nodes) as
possible.

[[email protected]: build fix]
Signed-off-by: David Rientjes <[email protected]>
Signed-off-by: Paul Menage <[email protected]>
Signed-off-by: Rohit Seth <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

arch/x86_64/kernel/e820.c | 31 ++++++++++++
arch/x86_64/mm/numa.c | 110 ++++++++++++++++++++++++++++++++++++++------
include/asm-x86_64/e820.h | 1
include/asm-x86_64/mmzone.h | 5 ++
4 files changed, 133 insertions(+), 14 deletions(-)

Index: linux/arch/x86_64/kernel/e820.c
===================================================================
--- linux.orig/arch/x86_64/kernel/e820.c
+++ linux/arch/x86_64/kernel/e820.c
@@ -191,6 +191,37 @@ unsigned long __init e820_end_of_ram(voi
}

/*
+ * Find the hole size in the range.
+ */
+unsigned long __init e820_hole_size(unsigned long start, unsigned long end)
+{
+ unsigned long ram = 0;
+ int i;
+
+ for (i = 0; i < e820.nr_map; i++) {
+ struct e820entry *ei = &e820.map[i];
+ unsigned long last, addr;
+
+ if (ei->type != E820_RAM ||
+ ei->addr+ei->size <= start ||
+ ei->addr >= end)
+ continue;
+
+ addr = round_up(ei->addr, PAGE_SIZE);
+ if (addr < start)
+ addr = start;
+
+ last = round_down(ei->addr + ei->size, PAGE_SIZE);
+ if (last >= end)
+ last = end;
+
+ if (last > addr)
+ ram += last - addr;
+ }
+ return ((end - start) - ram);
+}
+
+/*
* Mark e820 reserved areas as busy for the resource manager.
*/
void __init e820_reserve_resources(void)
Index: linux/arch/x86_64/mm/numa.c
===================================================================
--- linux.orig/arch/x86_64/mm/numa.c
+++ linux/arch/x86_64/mm/numa.c
@@ -272,31 +272,113 @@ void __init numa_init_array(void)
}

#ifdef CONFIG_NUMA_EMU
+/* Numa emulation */
int numa_fake __initdata = 0;

-/* Numa emulation */
+/*
+ * This function is used to find out if the start and end correspond to
+ * different zones.
+ */
+int zone_cross_over(unsigned long start, unsigned long end)
+{
+ if ((start < (MAX_DMA32_PFN << PAGE_SHIFT)) &&
+ (end >= (MAX_DMA32_PFN << PAGE_SHIFT)))
+ return 1;
+ return 0;
+}
+
static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn)
{
- int i;
+ int i, big;
struct bootnode nodes[MAX_NUMNODES];
- unsigned long sz = ((end_pfn - start_pfn)<<PAGE_SHIFT) / numa_fake;
+ unsigned long sz, old_sz;
+ unsigned long hole_size;
+ unsigned long start, end;
+ unsigned long max_addr = (end_pfn << PAGE_SHIFT);
+
+ start = (start_pfn << PAGE_SHIFT);
+ hole_size = e820_hole_size(start, max_addr);
+ sz = (max_addr - start - hole_size) / numa_fake;

/* Kludge needed for the hash function */
- if (hweight64(sz) > 1) {
- unsigned long x = 1;
- while ((x << 1) < sz)
- x <<= 1;
- if (x < sz/2)
- printk(KERN_ERR "Numa emulation unbalanced. Complain to maintainer\n");
- sz = x;
- }

+ old_sz = sz;
+ /*
+ * Round down to the nearest FAKE_NODE_MIN_SIZE.
+ */
+ sz &= FAKE_NODE_MIN_HASH_MASK;
+
+ /*
+ * We ensure that each node is at least 64MB big. Smaller than this
+ * size can cause VM hiccups.
+ */
+ if (sz == 0) {
+ printk(KERN_INFO "Not enough memory for %d nodes. Reducing "
+ "the number of nodes\n", numa_fake);
+ numa_fake = (max_addr - start - hole_size) / FAKE_NODE_MIN_SIZE;
+ printk(KERN_INFO "Number of fake nodes will be = %d\n",
+ numa_fake);
+ sz = FAKE_NODE_MIN_SIZE;
+ }
+ /*
+ * Find out how many nodes can get an extra NODE_MIN_SIZE granule.
+ * This logic ensures the extra memory gets distributed among as many
+ * nodes as possible (as compared to one single node getting all that
+ * extra memory.
+ */
+ big = ((old_sz - sz) * numa_fake) / FAKE_NODE_MIN_SIZE;
+ printk(KERN_INFO "Fake node Size: %luMB hole_size: %luMB big nodes: "
+ "%d\n",
+ (sz >> 20), (hole_size >> 20), big);
memset(&nodes,0,sizeof(nodes));
+ end = start;
for (i = 0; i < numa_fake; i++) {
- nodes[i].start = (start_pfn<<PAGE_SHIFT) + i*sz;
+ /*
+ * In case we are not able to allocate enough memory for all
+ * the nodes, we reduce the number of fake nodes.
+ */
+ if (end >= max_addr) {
+ numa_fake = i - 1;
+ break;
+ }
+ start = nodes[i].start = end;
+ /*
+ * Final node can have all the remaining memory.
+ */
if (i == numa_fake-1)
- sz = (end_pfn<<PAGE_SHIFT) - nodes[i].start;
- nodes[i].end = nodes[i].start + sz;
+ sz = max_addr - start;
+ end = nodes[i].start + sz;
+ /*
+ * Fir "big" number of nodes get extra granule.
+ */
+ if (i < big)
+ end += FAKE_NODE_MIN_SIZE;
+ /*
+ * Iterate over the range to ensure that this node gets at
+ * least sz amount of RAM (excluding holes)
+ */
+ while ((end - start - e820_hole_size(start, end)) < sz) {
+ end += FAKE_NODE_MIN_SIZE;
+ if (end >= max_addr)
+ break;
+ }
+ /*
+ * Look at the next node to make sure there is some real memory
+ * to map. Bad things happen when the only memory present
+ * in a zone on a fake node is IO hole.
+ */
+ while (e820_hole_size(end, end + FAKE_NODE_MIN_SIZE) > 0) {
+ if (zone_cross_over(start, end + sz)) {
+ end = (MAX_DMA32_PFN << PAGE_SHIFT);
+ break;
+ }
+ if (end >= max_addr)
+ break;
+ end += FAKE_NODE_MIN_SIZE;
+ }
+ if (end > max_addr)
+ end = max_addr;
+ nodes[i].end = end;
printk(KERN_INFO "Faking node %d at %016Lx-%016Lx (%LuMB)\n",
i,
nodes[i].start, nodes[i].end,
Index: linux/include/asm-x86_64/e820.h
===================================================================
--- linux.orig/include/asm-x86_64/e820.h
+++ linux/include/asm-x86_64/e820.h
@@ -46,6 +46,7 @@ extern void e820_mark_nosave_regions(voi
extern void e820_print_map(char *who);
extern int e820_any_mapped(unsigned long start, unsigned long end, unsigned type);
extern int e820_all_mapped(unsigned long start, unsigned long end, unsigned type);
+extern unsigned long e820_hole_size(unsigned long start, unsigned long end);

extern void e820_setup_gap(void);
extern void e820_register_active_regions(int nid,
Index: linux/include/asm-x86_64/mmzone.h
===================================================================
--- linux.orig/include/asm-x86_64/mmzone.h
+++ linux/include/asm-x86_64/mmzone.h
@@ -47,5 +47,10 @@ static inline __attribute__((pure)) int
extern int pfn_valid(unsigned long pfn);
#endif

+#ifdef CONFIG_NUMA_EMU
+#define FAKE_NODE_MIN_SIZE (64*1024*1024)
+#define FAKE_NODE_MIN_HASH_MASK (~(FAKE_NODE_MIN_SIZE - 1ul))
+#endif
+
#endif
#endif

2007-02-12 07:47:55

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [8/39] x86_64: Remove fastcall references in x86_64 code


From: Glauber de Oliveira Costa <[email protected]>

Unlike x86, x86_64 already passes arguments in registers. The use of
regparm attribute makes no difference in produced code, and the use of
fastcall just bloats the code.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

arch/x86_64/kernel/acpi/sleep.c | 2 +-
arch/x86_64/kernel/x8664_ksyms.c | 4 ++--
include/asm-x86_64/hw_irq.h | 2 +-
include/asm-x86_64/mutex.h | 6 +++---
4 files changed, 7 insertions(+), 7 deletions(-)

Index: linux/arch/x86_64/kernel/acpi/sleep.c
===================================================================
--- linux.orig/arch/x86_64/kernel/acpi/sleep.c
+++ linux/arch/x86_64/kernel/acpi/sleep.c
@@ -58,7 +58,7 @@ unsigned long acpi_wakeup_address = 0;
unsigned long acpi_video_flags;
extern char wakeup_start, wakeup_end;

-extern unsigned long FASTCALL(acpi_copy_wakeup_routine(unsigned long));
+extern unsigned long acpi_copy_wakeup_routine(unsigned long);

static pgd_t low_ptr;

Index: linux/arch/x86_64/kernel/x8664_ksyms.c
===================================================================
--- linux.orig/arch/x86_64/kernel/x8664_ksyms.c
+++ linux/arch/x86_64/kernel/x8664_ksyms.c
@@ -36,8 +36,8 @@ EXPORT_SYMBOL(copy_page);
EXPORT_SYMBOL(clear_page);

#ifdef CONFIG_SMP
-extern void FASTCALL( __write_lock_failed(rwlock_t *rw));
-extern void FASTCALL( __read_lock_failed(rwlock_t *rw));
+extern void __write_lock_failed(rwlock_t *rw);
+extern void __read_lock_failed(rwlock_t *rw);
EXPORT_SYMBOL(__write_lock_failed);
EXPORT_SYMBOL(__read_lock_failed);
#endif
Index: linux/include/asm-x86_64/hw_irq.h
===================================================================
--- linux.orig/include/asm-x86_64/hw_irq.h
+++ linux/include/asm-x86_64/hw_irq.h
@@ -91,7 +91,7 @@ extern void enable_8259A_irq(unsigned in
extern int i8259A_irq_pending(unsigned int irq);
extern void make_8259A_irq(unsigned int irq);
extern void init_8259A(int aeoi);
-extern void FASTCALL(send_IPI_self(int vector));
+extern void send_IPI_self(int vector);
extern void init_VISWS_APIC_irqs(void);
extern void setup_IO_APIC(void);
extern void disable_IO_APIC(void);
Index: linux/include/asm-x86_64/mutex.h
===================================================================
--- linux.orig/include/asm-x86_64/mutex.h
+++ linux/include/asm-x86_64/mutex.h
@@ -21,7 +21,7 @@ do { \
unsigned long dummy; \
\
typecheck(atomic_t *, v); \
- typecheck_fn(fastcall void (*)(atomic_t *), fail_fn); \
+ typecheck_fn(void (*)(atomic_t *), fail_fn); \
\
__asm__ __volatile__( \
LOCK_PREFIX " decl (%%rdi) \n" \
@@ -47,7 +47,7 @@ do { \
*/
static inline int
__mutex_fastpath_lock_retval(atomic_t *count,
- int fastcall (*fail_fn)(atomic_t *))
+ int (*fail_fn)(atomic_t *))
{
if (unlikely(atomic_dec_return(count) < 0))
return fail_fn(count);
@@ -67,7 +67,7 @@ do { \
unsigned long dummy; \
\
typecheck(atomic_t *, v); \
- typecheck_fn(fastcall void (*)(atomic_t *), fail_fn); \
+ typecheck_fn(void (*)(atomic_t *), fail_fn); \
\
__asm__ __volatile__( \
LOCK_PREFIX " incl (%%rdi) \n" \

2007-02-12 07:47:56

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [9/39] x86_64: Use constant instead of raw number in x86_64 ioperm.c


From: Glauber de Oliveira Costa <[email protected]>

This is a tiny cleanup to increase readability

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

arch/x86_64/kernel/ioport.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/arch/x86_64/kernel/ioport.c
===================================================================
--- linux.orig/arch/x86_64/kernel/ioport.c
+++ linux/arch/x86_64/kernel/ioport.c
@@ -114,6 +114,6 @@ asmlinkage long sys_iopl(unsigned int le
if (!capable(CAP_SYS_RAWIO))
return -EPERM;
}
- regs->eflags = (regs->eflags &~ 0x3000UL) | (level << 12);
+ regs->eflags = (regs->eflags &~ X86_EFLAGS_IOPL) | (level << 12);
return 0;
}

2007-02-12 07:48:59

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [10/39] x86_64: Handle 32 bit PerfMon Counter writes cleanly in x86_64 nmi_watchdog


From: Venkatesh Pallipadi <[email protected]>


P6 CPUs and Core/Core 2 CPUs which has 'architectural perf mon' feature,
only supports write of low 32 bits in Performance Monitoring Counters.
Bits 32..39 are sign extended based on bit 31 and bits 40..63 are reserved
and should be zero.

This patch:

Change x86_64 nmi handler to handle this case cleanly.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86_64/kernel/nmi.c | 46 ++++++++++++++++++++++++++++++++--------------
1 file changed, 32 insertions(+), 14 deletions(-)

Index: linux/arch/x86_64/kernel/nmi.c
===================================================================
--- linux.orig/arch/x86_64/kernel/nmi.c
+++ linux/arch/x86_64/kernel/nmi.c
@@ -214,6 +214,23 @@ static __init void nmi_cpu_busy(void *da
}
#endif

+static unsigned int adjust_for_32bit_ctr(unsigned int hz)
+{
+ unsigned int retval = hz;
+
+ /*
+ * On Intel CPUs with ARCH_PERFMON only 32 bits in the counter
+ * are writable, with higher bits sign extending from bit 31.
+ * So, we can only program the counter with 31 bit values and
+ * 32nd bit should be 1, for 33.. to be 1.
+ * Find the appropriate nmi_hz
+ */
+ if ((((u64)cpu_khz * 1000) / retval) > 0x7fffffffULL) {
+ retval = ((u64)cpu_khz * 1000) / 0x7fffffffUL + 1;
+ }
+ return retval;
+}
+
int __init check_nmi_watchdog (void)
{
int *counts;
@@ -268,17 +285,8 @@ int __init check_nmi_watchdog (void)
struct nmi_watchdog_ctlblk *wd = &__get_cpu_var(nmi_watchdog_ctlblk);

nmi_hz = 1;
- /*
- * On Intel CPUs with ARCH_PERFMON only 32 bits in the counter
- * are writable, with higher bits sign extending from bit 31.
- * So, we can only program the counter with 31 bit values and
- * 32nd bit should be 1, for 33.. to be 1.
- * Find the appropriate nmi_hz
- */
- if (wd->perfctr_msr == MSR_ARCH_PERFMON_PERFCTR0 &&
- ((u64)cpu_khz * 1000) > 0x7fffffffULL) {
- nmi_hz = ((u64)cpu_khz * 1000) / 0x7fffffffUL + 1;
- }
+ if (wd->perfctr_msr == MSR_ARCH_PERFMON_PERFCTR0)
+ nmi_hz = adjust_for_32bit_ctr(nmi_hz);
}

kfree(counts);
@@ -634,7 +642,9 @@ static int setup_intel_arch_watchdog(voi

/* setup the timer */
wrmsr(evntsel_msr, evntsel, 0);
- wrmsrl(perfctr_msr, -((u64)cpu_khz * 1000 / nmi_hz));
+
+ nmi_hz = adjust_for_32bit_ctr(nmi_hz);
+ wrmsr(perfctr_msr, (u32)(-((u64)cpu_khz * 1000 / nmi_hz)), 0);

apic_write(APIC_LVTPC, APIC_DM_NMI);
evntsel |= ARCH_PERFMON_EVENTSEL0_ENABLE;
@@ -855,15 +865,23 @@ int __kprobes nmi_watchdog_tick(struct p
dummy &= ~P4_CCCR_OVF;
wrmsrl(wd->cccr_msr, dummy);
apic_write(APIC_LVTPC, APIC_DM_NMI);
+ /* start the cycle over again */
+ wrmsrl(wd->perfctr_msr,
+ -((u64)cpu_khz * 1000 / nmi_hz));
} else if (wd->perfctr_msr == MSR_ARCH_PERFMON_PERFCTR0) {
/*
* ArchPerfom/Core Duo needs to re-unmask
* the apic vector
*/
apic_write(APIC_LVTPC, APIC_DM_NMI);
+ /* ARCH_PERFMON has 32 bit counter writes */
+ wrmsr(wd->perfctr_msr,
+ (u32)(-((u64)cpu_khz * 1000 / nmi_hz)), 0);
+ } else {
+ /* start the cycle over again */
+ wrmsrl(wd->perfctr_msr,
+ -((u64)cpu_khz * 1000 / nmi_hz));
}
- /* start the cycle over again */
- wrmsrl(wd->perfctr_msr, -((u64)cpu_khz * 1000 / nmi_hz));
rc = 1;
} else if (nmi_watchdog == NMI_IO_APIC) {
/* don't know how to accurately check for this.

2007-02-12 07:48:59

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [13/39] i386: CONFIG_PHYSICAL_ALIGN limited to 4M?


From: Rene Herman <[email protected]>
A while ago it was remarked on list here that keeping the kernel 4M
aligned physically might be a performance win if the added 1M (it
normally loads at 1M) meant it would fit on one 4M aligned hugepage
instead of 2 and since that time I've been doing such.

In fact, while I was at it, I ran the kernel at 16M; while admittedly a
bit of a non-issue, having never experienced ZONE_DMA shortage, I am an
ISA user on a >16M machine so this seemed to make sense -- no kernel
eating up "precious" ISA-DMAable memory.

Recently CONFIG_PHYSICAL_START was replaced by CONFIG_PHYSICAL_ALIGN
(commit e69f202d0a1419219198566e1c22218a5c71a9a6) and while 4M alignment
is still possible, that's also the strictest alignment allowed meaning I
can't load my (non-relocatable) kernel at 16M anymore.

If I just apply the following and set it to 16M, things seem to be
working for me. Was there an important reason to limit the alignment to
4M, and if so, even on non relocatable kernels?

Rene.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/i386/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/arch/i386/Kconfig
===================================================================
--- linux.orig/arch/i386/Kconfig
+++ linux/arch/i386/Kconfig
@@ -843,7 +843,7 @@ config RELOCATABLE
config PHYSICAL_ALIGN
hex "Alignment value to which kernel should be aligned"
default "0x100000"
- range 0x2000 0x400000
+ range 0x2000 0x1000000
help
This value puts the alignment restrictions on physical address
where kernel is loaded and run from. Kernel is compiled for an

2007-02-12 07:49:00

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [12/39] i386: Handle 32 bit PerfMon Counter writes cleanly in oprofile


From: Venkatesh Pallipadi <[email protected]>

Handle these 32 bit perfmon counter MSR writes cleanly in oprofile.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
arch/i386/oprofile/op_model_ppro.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

Index: linux/arch/i386/oprofile/op_model_ppro.c
===================================================================
--- linux.orig/arch/i386/oprofile/op_model_ppro.c
+++ linux/arch/i386/oprofile/op_model_ppro.c
@@ -24,7 +24,8 @@

#define CTR_IS_RESERVED(msrs,c) (msrs->counters[(c)].addr ? 1 : 0)
#define CTR_READ(l,h,msrs,c) do {rdmsr(msrs->counters[(c)].addr, (l), (h));} while (0)
-#define CTR_WRITE(l,msrs,c) do {wrmsr(msrs->counters[(c)].addr, -(u32)(l), -1);} while (0)
+#define CTR_32BIT_WRITE(l,msrs,c) \
+ do {wrmsr(msrs->counters[(c)].addr, -(u32)(l), 0);} while (0)
#define CTR_OVERFLOWED(n) (!((n) & (1U<<31)))

#define CTRL_IS_RESERVED(msrs,c) (msrs->controls[(c)].addr ? 1 : 0)
@@ -79,7 +80,7 @@ static void ppro_setup_ctrs(struct op_ms
for (i = 0; i < NUM_COUNTERS; ++i) {
if (unlikely(!CTR_IS_RESERVED(msrs,i)))
continue;
- CTR_WRITE(1, msrs, i);
+ CTR_32BIT_WRITE(1, msrs, i);
}

/* enable active counters */
@@ -87,7 +88,7 @@ static void ppro_setup_ctrs(struct op_ms
if ((counter_config[i].enabled) && (CTR_IS_RESERVED(msrs,i))) {
reset_value[i] = counter_config[i].count;

- CTR_WRITE(counter_config[i].count, msrs, i);
+ CTR_32BIT_WRITE(counter_config[i].count, msrs, i);

CTRL_READ(low, high, msrs, i);
CTRL_CLEAR(low);
@@ -116,7 +117,7 @@ static int ppro_check_ctrs(struct pt_reg
CTR_READ(low, high, msrs, i);
if (CTR_OVERFLOWED(low)) {
oprofile_add_sample(regs, i);
- CTR_WRITE(reset_value[i], msrs, i);
+ CTR_32BIT_WRITE(reset_value[i], msrs, i);
}
}

2007-02-12 07:49:46

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [11/39] i386: Handle 32 bit PerfMon Counter writes cleanly in i386 nmi_watchdog


From: Venkatesh Pallipadi <[email protected]>

Change i386 nmi handler to handle 32 bit perfmon counter MSR writes cleanly.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
arch/i386/kernel/nmi.c | 64 ++++++++++++++++++++++++++++++++++++-------------
1 file changed, 48 insertions(+), 16 deletions(-)

Index: linux/arch/i386/kernel/nmi.c
===================================================================
--- linux.orig/arch/i386/kernel/nmi.c
+++ linux/arch/i386/kernel/nmi.c
@@ -216,6 +216,28 @@ static __init void nmi_cpu_busy(void *da
}
#endif

+static unsigned int adjust_for_32bit_ctr(unsigned int hz)
+{
+ u64 counter_val;
+ unsigned int retval = hz;
+
+ /*
+ * On Intel CPUs with P6/ARCH_PERFMON only 32 bits in the counter
+ * are writable, with higher bits sign extending from bit 31.
+ * So, we can only program the counter with 31 bit values and
+ * 32nd bit should be 1, for 33.. to be 1.
+ * Find the appropriate nmi_hz
+ */
+ counter_val = (u64)cpu_khz * 1000;
+ do_div(counter_val, retval);
+ if (counter_val > 0x7fffffffULL) {
+ u64 count = (u64)cpu_khz * 1000;
+ do_div(count, 0x7fffffffUL);
+ retval = count + 1;
+ }
+ return retval;
+}
+
static int __init check_nmi_watchdog(void)
{
unsigned int *prev_nmi_count;
@@ -281,18 +303,10 @@ static int __init check_nmi_watchdog(voi
struct nmi_watchdog_ctlblk *wd = &__get_cpu_var(nmi_watchdog_ctlblk);

nmi_hz = 1;
- /*
- * On Intel CPUs with ARCH_PERFMON only 32 bits in the counter
- * are writable, with higher bits sign extending from bit 31.
- * So, we can only program the counter with 31 bit values and
- * 32nd bit should be 1, for 33.. to be 1.
- * Find the appropriate nmi_hz
- */
- if (wd->perfctr_msr == MSR_ARCH_PERFMON_PERFCTR0 &&
- ((u64)cpu_khz * 1000) > 0x7fffffffULL) {
- u64 count = (u64)cpu_khz * 1000;
- do_div(count, 0x7fffffffUL);
- nmi_hz = count + 1;
+
+ if (wd->perfctr_msr == MSR_P6_PERFCTR0 ||
+ wd->perfctr_msr == MSR_ARCH_PERFMON_PERFCTR0) {
+ nmi_hz = adjust_for_32bit_ctr(nmi_hz);
}
}

@@ -442,6 +456,17 @@ static void write_watchdog_counter(unsig
wrmsrl(perfctr_msr, 0 - count);
}

+static void write_watchdog_counter32(unsigned int perfctr_msr,
+ const char *descr)
+{
+ u64 count = (u64)cpu_khz * 1000;
+
+ do_div(count, nmi_hz);
+ if(descr)
+ Dprintk("setting %s to -0x%08Lx\n", descr, count);
+ wrmsr(perfctr_msr, (u32)(-count), 0);
+}
+
/* Note that these events don't tick when the CPU idles. This means
the frequency varies with CPU load. */

@@ -531,7 +556,8 @@ static int setup_p6_watchdog(void)

/* setup the timer */
wrmsr(evntsel_msr, evntsel, 0);
- write_watchdog_counter(perfctr_msr, "P6_PERFCTR0");
+ nmi_hz = adjust_for_32bit_ctr(nmi_hz);
+ write_watchdog_counter32(perfctr_msr, "P6_PERFCTR0");
apic_write(APIC_LVTPC, APIC_DM_NMI);
evntsel |= P6_EVNTSEL0_ENABLE;
wrmsr(evntsel_msr, evntsel, 0);
@@ -704,7 +730,8 @@ static int setup_intel_arch_watchdog(voi

/* setup the timer */
wrmsr(evntsel_msr, evntsel, 0);
- write_watchdog_counter(perfctr_msr, "INTEL_ARCH_PERFCTR0");
+ nmi_hz = adjust_for_32bit_ctr(nmi_hz);
+ write_watchdog_counter32(perfctr_msr, "INTEL_ARCH_PERFCTR0");
apic_write(APIC_LVTPC, APIC_DM_NMI);
evntsel |= ARCH_PERFMON_EVENTSEL0_ENABLE;
wrmsr(evntsel_msr, evntsel, 0);
@@ -956,6 +983,8 @@ __kprobes int nmi_watchdog_tick(struct p
dummy &= ~P4_CCCR_OVF;
wrmsrl(wd->cccr_msr, dummy);
apic_write(APIC_LVTPC, APIC_DM_NMI);
+ /* start the cycle over again */
+ write_watchdog_counter(wd->perfctr_msr, NULL);
}
else if (wd->perfctr_msr == MSR_P6_PERFCTR0 ||
wd->perfctr_msr == MSR_ARCH_PERFMON_PERFCTR0) {
@@ -964,9 +993,12 @@ __kprobes int nmi_watchdog_tick(struct p
* other P6 variant.
* ArchPerfom/Core Duo also needs this */
apic_write(APIC_LVTPC, APIC_DM_NMI);
+ /* P6/ARCH_PERFMON has 32 bit counter write */
+ write_watchdog_counter32(wd->perfctr_msr, NULL);
+ } else {
+ /* start the cycle over again */
+ write_watchdog_counter(wd->perfctr_msr, NULL);
}
- /* start the cycle over again */
- write_watchdog_counter(wd->perfctr_msr, NULL);
rc = 1;
} else if (nmi_watchdog == NMI_IO_APIC) {
/* don't know how to accurately check for this.

2007-02-12 07:50:09

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [5/39] i386: improve sched_clock() on i686


From: Ingo Molnar <[email protected]>

Clean up sched_clock() on i686: it will use the TSC if available and falls
back to jiffies only if the user asked for it to be disabled via notsc or
the CPU calibration code didnt figure out the right cpu_khz.

This generally makes the scheduler timestamps more finegrained, on all
hardware. (the current scheduler is pretty resistant against asynchronous
sched_clock() values on different CPUs, it will allow at most up to a jiffy
of jitter.)

Also simplify sched_clock()'s check for TSC availability: propagate the
desire and ability to use the TSC into the tsc_disable flag, previously
this flag only indicated whether the notsc option was passed. This makes
the rare low-res sched_clock() codepath a single branch off a read-mostly
flag.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---

arch/i386/kernel/tsc.c | 22 ++++++++++++++--------
include/asm-i386/bugs.h | 2 +-
2 files changed, 15 insertions(+), 9 deletions(-)

Index: linux/arch/i386/kernel/tsc.c
===================================================================
--- linux.orig/arch/i386/kernel/tsc.c
+++ linux/arch/i386/kernel/tsc.c
@@ -112,13 +112,10 @@ unsigned long long sched_clock(void)
return (*custom_sched_clock)();

/*
- * in the NUMA case we dont use the TSC as they are not
- * synchronized across all CPUs.
+ * Fall back to jiffies if there's no TSC available:
*/
-#ifndef CONFIG_NUMA
- if (!cpu_khz || check_tsc_unstable())
-#endif
- /* no locking but a rare wrong value is not a big deal */
+ if (unlikely(tsc_disable))
+ /* No locking but a rare wrong value is not a big deal: */
return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ);

/* read the Time Stamp Counter: */
@@ -198,13 +195,13 @@ EXPORT_SYMBOL(recalibrate_cpu_khz);
void __init tsc_init(void)
{
if (!cpu_has_tsc || tsc_disable)
- return;
+ goto out_no_tsc;

cpu_khz = calculate_cpu_khz();
tsc_khz = cpu_khz;

if (!cpu_khz)
- return;
+ goto out_no_tsc;

printk("Detected %lu.%03lu MHz processor.\n",
(unsigned long)cpu_khz / 1000,
@@ -212,6 +209,15 @@ void __init tsc_init(void)

set_cyc2ns_scale(cpu_khz);
use_tsc_delay();
+ return;
+
+out_no_tsc:
+ /*
+ * Set the tsc_disable flag if there's no TSC support, this
+ * makes it a fast flag for the kernel to see whether it
+ * should be using the TSC.
+ */
+ tsc_disable = 1;
}

#ifdef CONFIG_CPU_FREQ
Index: linux/include/asm-i386/bugs.h
===================================================================
--- linux.orig/include/asm-i386/bugs.h
+++ linux/include/asm-i386/bugs.h
@@ -160,7 +160,7 @@ static void __init check_config(void)
* If we configured ourselves for a TSC, we'd better have one!
*/
#ifdef CONFIG_X86_TSC
- if (!cpu_has_tsc)
+ if (!cpu_has_tsc && !tsc_disable)
panic("Kernel compiled for Pentium+, requires TSC feature!");
#endif

2007-02-12 07:50:49

by Andi Kleen

[permalink] [raw]
Subject: [PATCH x86 for review II] [4/39] i386: add idle notifier


From: Stephane Eranian <[email protected]>

Add a notifier mechanism to the low level idle loop. You can register a
callback function which gets invoked on entry and exit from the low level idle
loop. The low level idle loop is defined as the polling loop, low-power call,
or the mwait instruction. Interrupts processed by the idle thread are not
considered part of the low level loop.

The notifier can be used to measure precisely how much is spent in useless
execution (or low power mode). The perfmon subsystem uses it to turn on/off
monitoring.

Signed-off-by: stephane eranian <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---

arch/i386/kernel/apic.c | 4 ++
arch/i386/kernel/cpu/mcheck/p4.c | 2 +
arch/i386/kernel/irq.c | 3 ++
arch/i386/kernel/process.c | 53 ++++++++++++++++++++++++++++++++++++++-
arch/i386/kernel/smp.c | 2 +
include/asm-i386/idle.h | 14 ++++++++++
include/asm-i386/processor.h | 8 +++++
7 files changed, 85 insertions(+), 1 deletion(-)

Index: linux/arch/i386/kernel/apic.c
===================================================================
--- linux.orig/arch/i386/kernel/apic.c
+++ linux/arch/i386/kernel/apic.c
@@ -36,6 +36,7 @@
#include <asm/hpet.h>
#include <asm/i8253.h>
#include <asm/nmi.h>
+#include <asm/idle.h>

#include <mach_apic.h>
#include <mach_apicdef.h>
@@ -1255,6 +1256,7 @@ fastcall void smp_apic_timer_interrupt(s
* Besides, if we don't timer interrupts ignore the global
* interrupt lock, which is the WrongThing (tm) to do.
*/
+ exit_idle();
irq_enter();
smp_local_timer_interrupt();
irq_exit();
@@ -1305,6 +1307,7 @@ fastcall void smp_spurious_interrupt(str
{
unsigned long v;

+ exit_idle();
irq_enter();
/*
* Check if this really is a spurious interrupt and ACK it
@@ -1329,6 +1332,7 @@ fastcall void smp_error_interrupt(struct
{
unsigned long v, v1;

+ exit_idle();
irq_enter();
/* First tickle the hardware, only then report what went on. -- REW */
v = apic_read(APIC_ESR);
Index: linux/arch/i386/kernel/cpu/mcheck/p4.c
===================================================================
--- linux.orig/arch/i386/kernel/cpu/mcheck/p4.c
+++ linux/arch/i386/kernel/cpu/mcheck/p4.c
@@ -12,6 +12,7 @@
#include <asm/system.h>
#include <asm/msr.h>
#include <asm/apic.h>
+#include <asm/idle.h>

#include <asm/therm_throt.h>

@@ -59,6 +60,7 @@ static void (*vendor_thermal_interrupt)(

fastcall void smp_thermal_interrupt(struct pt_regs *regs)
{
+ exit_idle();
irq_enter();
vendor_thermal_interrupt(regs);
irq_exit();
Index: linux/arch/i386/kernel/irq.c
===================================================================
--- linux.orig/arch/i386/kernel/irq.c
+++ linux/arch/i386/kernel/irq.c
@@ -19,6 +19,8 @@
#include <linux/cpu.h>
#include <linux/delay.h>

+#include <asm/idle.h>
+
DEFINE_PER_CPU(irq_cpustat_t, irq_stat) ____cacheline_internodealigned_in_smp;
EXPORT_PER_CPU_SYMBOL(irq_stat);

@@ -61,6 +63,7 @@ fastcall unsigned int do_IRQ(struct pt_r
union irq_ctx *curctx, *irqctx;
u32 *isp;
#endif
+ exit_idle();

if (unlikely((unsigned)irq >= NR_IRQS)) {
printk(KERN_EMERG "%s: cannot handle IRQ %d\n",
Index: linux/arch/i386/kernel/process.c
===================================================================
--- linux.orig/arch/i386/kernel/process.c
+++ linux/arch/i386/kernel/process.c
@@ -48,6 +48,7 @@
#include <asm/i387.h>
#include <asm/desc.h>
#include <asm/vm86.h>
+#include <asm/idle.h>
#ifdef CONFIG_MATH_EMULATION
#include <asm/math_emu.h>
#endif
@@ -80,6 +81,42 @@ void (*pm_idle)(void);
EXPORT_SYMBOL(pm_idle);
static DEFINE_PER_CPU(unsigned int, cpu_idle_state);

+static ATOMIC_NOTIFIER_HEAD(idle_notifier);
+
+void idle_notifier_register(struct notifier_block *n)
+{
+ atomic_notifier_chain_register(&idle_notifier, n);
+}
+
+void idle_notifier_unregister(struct notifier_block *n)
+{
+ atomic_notifier_chain_unregister(&idle_notifier, n);
+}
+
+static DEFINE_PER_CPU(volatile unsigned long, idle_state);
+
+void enter_idle(void)
+{
+ /* needs to be atomic w.r.t. interrupts, not against other CPUs */
+ __set_bit(0, &__get_cpu_var(idle_state));
+ atomic_notifier_call_chain(&idle_notifier, IDLE_START, NULL);
+}
+
+static void __exit_idle(void)
+{
+ /* needs to be atomic w.r.t. interrupts, not against other CPUs */
+ if (__test_and_clear_bit(0, &__get_cpu_var(idle_state)) == 0)
+ return;
+ atomic_notifier_call_chain(&idle_notifier, IDLE_END, NULL);
+}
+
+void exit_idle(void)
+{
+ if (current->pid)
+ return;
+ __exit_idle();
+}
+
void disable_hlt(void)
{
hlt_counter++;
@@ -130,6 +167,7 @@ EXPORT_SYMBOL(default_idle);
*/
static void poll_idle (void)
{
+ local_irq_enable();
cpu_relax();
}

@@ -189,7 +227,16 @@ void cpu_idle(void)
play_dead();

__get_cpu_var(irq_stat).idle_timestamp = jiffies;
+
+ /*
+ * Idle routines should keep interrupts disabled
+ * from here on, until they go to idle.
+ * Otherwise, idle callbacks can misfire.
+ */
+ local_irq_disable();
+ enter_idle();
idle();
+ __exit_idle();
}
preempt_enable_no_resched();
schedule();
@@ -243,7 +290,11 @@ void mwait_idle_with_hints(unsigned long
__monitor((void *)&current_thread_info()->flags, 0, 0);
smp_mb();
if (!need_resched())
- __mwait(eax, ecx);
+ __sti_mwait(eax, ecx);
+ else
+ local_irq_enable();
+ } else {
+ local_irq_enable();
}
}

Index: linux/arch/i386/kernel/smp.c
===================================================================
--- linux.orig/arch/i386/kernel/smp.c
+++ linux/arch/i386/kernel/smp.c
@@ -23,6 +23,7 @@

#include <asm/mtrr.h>
#include <asm/tlbflush.h>
+#include <asm/idle.h>
#include <mach_apic.h>

/*
@@ -624,6 +625,7 @@ fastcall void smp_call_function_interrup
/*
* At this point the info structure may be out of scope unless wait==1
*/
+ exit_idle();
irq_enter();
(*func)(info);
irq_exit();
Index: linux/include/asm-i386/idle.h
===================================================================
--- /dev/null
+++ linux/include/asm-i386/idle.h
@@ -0,0 +1,14 @@
+#ifndef _ASM_I386_IDLE_H
+#define _ASM_I386_IDLE_H 1
+
+#define IDLE_START 1
+#define IDLE_END 2
+
+struct notifier_block;
+void idle_notifier_register(struct notifier_block *n);
+void idle_notifier_unregister(struct notifier_block *n);
+
+void exit_idle(void);
+void enter_idle(void);
+
+#endif
Index: linux/include/asm-i386/processor.h
===================================================================
--- linux.orig/include/asm-i386/processor.h
+++ linux/include/asm-i386/processor.h
@@ -257,6 +257,14 @@ static inline void __mwait(unsigned long
: :"a" (eax), "c" (ecx));
}

+static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
+{
+ /* "mwait %eax,%ecx;" */
+ asm volatile(
+ "sti; .byte 0x0f,0x01,0xc9;"
+ : :"a" (eax), "c" (ecx));
+}
+
extern void mwait_idle_with_hints(unsigned long eax, unsigned long ecx);

/* from system description table in BIOS. Mostly for MCA use, but

2007-02-12 07:53:59

by Oliver Neukum

[permalink] [raw]
Subject: Re: [PATCH x86 for review II] [18/39] x86_64: Allow to run a program when a machine check event is detected

Am Montag, 12. Februar 2007 08:38 schrieb Andi Kleen:
> When a machine check event is detected (including a AMD RevF threshold
> overflow event) allow to run a "trigger" program. This allows user space
> to react to such events sooner.

Could this not be merged with other reporting mechanisms? This looks like
a new incarnation of the /etc/hotplug code.

Regards
Oliver

2007-02-12 08:04:48

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH x86 for review II] [18/39] x86_64: Allow to run a program when a machine check event is detected

On Monday 12 February 2007 08:54, Oliver Neukum wrote:
> Am Montag, 12. Februar 2007 08:38 schrieb Andi Kleen:
> > When a machine check event is detected (including a AMD RevF threshold
> > overflow event) allow to run a "trigger" program. This allows user space
> > to react to such events sooner.
>
> Could this not be merged with other reporting mechanisms? This looks like
> a new incarnation of the /etc/hotplug code.

I refuse to make mcelog depend on dbus. Just because some desktops want bloat
doesn't mean that everybody else wants too.

I don't know of any other lightweight mechanism except for this.

-Andi

2007-02-12 08:11:28

by Bauke Jan Douma

[permalink] [raw]
Subject: Re: [PATCH x86 for review II] [18/39] x86_64: Allow to run a program when a machine check event is detected

Andi Kleen wrote on 12-02-07 09:04:
> On Monday 12 February 2007 08:54, Oliver Neukum wrote:
>> Am Montag, 12. Februar 2007 08:38 schrieb Andi Kleen:
>>> When a machine check event is detected (including a AMD RevF threshold
>>> overflow event) allow to run a "trigger" program. This allows user space
>>> to react to such events sooner.
>> Could this not be merged with other reporting mechanisms? This looks like
>> a new incarnation of the /etc/hotplug code.
>
> I refuse to make mcelog depend on dbus. Just because some desktops want bloat
> doesn't mean that everybody else wants too.

Man, how I agree with that!
Refreshing to hear someone just say no.

bjd



2007-02-12 13:24:33

by Giuliano Procida

[permalink] [raw]
Subject: Re: [PATCH x86 for review II] [26/39] i386: fix 32-bit ioctls on x64_32

This is a nicer version of the MTRR compatibilty ioctl patch, compiles
smaller and also tested.

Signed-off-by: Giuliano Procida <[email protected]>

--- linux-source-2.6.19.1.orig/arch/i386/kernel/cpu/mtrr/if.c 2006-12-11
19:32:53.000000000 +0000
+++ linux-source-2.6.19.1/arch/i386/kernel/cpu/mtrr/if.c 2007-01-27
12:25:21.000000000 +0000
@@ -154,150 +154,164 @@
mtrr_ioctl(struct file *file, unsigned int cmd, unsigned long __arg)
{
int err = 0;
+ const unsigned ioctl_type = _IOC_TYPE(cmd);
+ const unsigned ioctl_dir = _IOC_DIR(cmd);
+ const unsigned ioctl_nr = _IOC_NR(cmd);
+ const unsigned ioctl_size = _IOC_SIZE(cmd);
mtrr_type type;
- struct mtrr_sentry sentry;
- struct mtrr_gentry gentry;
+ union mtrr_data {
+ struct mtrr_sentry sentry;
+ struct mtrr_gentry gentry;
+#ifdef CONFIG_COMPAT
+ struct mtrr_sentry32 sentry32;
+ struct mtrr_gentry32 gentry32;
+#endif
+ } u;
void __user *arg = (void __user *) __arg;

- switch (cmd) {
- case MTRRIOC_ADD_ENTRY:
- case MTRRIOC_SET_ENTRY:
- case MTRRIOC_DEL_ENTRY:
- case MTRRIOC_KILL_ENTRY:
- case MTRRIOC_ADD_PAGE_ENTRY:
- case MTRRIOC_SET_PAGE_ENTRY:
- case MTRRIOC_DEL_PAGE_ENTRY:
- case MTRRIOC_KILL_PAGE_ENTRY:
- if (copy_from_user(&sentry, arg, sizeof sentry))
- return -EFAULT;
- break;
- case MTRRIOC_GET_ENTRY:
- case MTRRIOC_GET_PAGE_ENTRY:
- if (copy_from_user(&gentry, arg, sizeof gentry))
- return -EFAULT;
+ /* check type and max size */
+ if (ioctl_type != MTRR_IOCTL_BASE || ioctl_size > sizeof(u))
+ return -ENOTTY;
+
+ /* copy from user */
+ if (ioctl_dir & _IOC_WRITE && copy_from_user(&u, arg, ioctl_size))
+ return -EFAULT;
+
+ /* check number, direction, size and permission */
+ switch (ioctl_nr) {
+ case _IOC_NR(MTRRIOC_ADD_ENTRY):
+ case _IOC_NR(MTRRIOC_SET_ENTRY):
+ case _IOC_NR(MTRRIOC_DEL_ENTRY):
+ case _IOC_NR(MTRRIOC_KILL_ENTRY):
+ case _IOC_NR(MTRRIOC_ADD_PAGE_ENTRY):
+ case _IOC_NR(MTRRIOC_SET_PAGE_ENTRY):
+ case _IOC_NR(MTRRIOC_DEL_PAGE_ENTRY):
+ case _IOC_NR(MTRRIOC_KILL_PAGE_ENTRY):
+ if (ioctl_dir != _IOC_WRITE)
+ return -ENOTTY;
+ switch (ioctl_size) {
+ case sizeof(struct mtrr_sentry):
break;
#ifdef CONFIG_COMPAT
- case MTRRIOC32_ADD_ENTRY:
- case MTRRIOC32_SET_ENTRY:
- case MTRRIOC32_DEL_ENTRY:
- case MTRRIOC32_KILL_ENTRY:
- case MTRRIOC32_ADD_PAGE_ENTRY:
- case MTRRIOC32_SET_PAGE_ENTRY:
- case MTRRIOC32_DEL_PAGE_ENTRY:
- case MTRRIOC32_KILL_PAGE_ENTRY: {
- struct mtrr_sentry32 __user *s32 = (struct mtrr_sentry32 __user *)__arg;
- err = get_user(sentry.base, &s32->base);
- err |= get_user(sentry.size, &s32->size);
- err |= get_user(sentry.type, &s32->type);
- if (err)
- return err;
- break;
- }
- case MTRRIOC32_GET_ENTRY:
- case MTRRIOC32_GET_PAGE_ENTRY: {
- struct mtrr_gentry32 __user *g32 = (struct mtrr_gentry32 __user *)__arg;
- err = get_user(gentry.regnum, &g32->regnum);
- err |= get_user(gentry.base, &g32->base);
- err |= get_user(gentry.size, &g32->size);
- err |= get_user(gentry.type, &g32->type);
- if (err)
- return err;
+ case sizeof(struct mtrr_sentry32):
+ {
+ struct mtrr_sentry32 s32 = u.sentry32;
+ u.sentry.base = s32.base;
+ u.sentry.size = s32.size;
+ u.sentry.type = s32.type;
+ }
break;
- }
#endif
+ default:
+ return -ENOTTY;
+ }
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+ break;
+ case _IOC_NR(MTRRIOC_GET_ENTRY):
+ case _IOC_NR(MTRRIOC_GET_PAGE_ENTRY):
+ if (ioctl_dir != (_IOC_READ|_IOC_WRITE))
+ return -ENOTTY;
+ switch (ioctl_size) {
+ case sizeof(struct mtrr_gentry):
+ break;
+#ifdef CONFIG_COMPAT
+ case sizeof(struct mtrr_gentry32):
+ {
+ struct mtrr_gentry32 g32 = u.gentry32;
+ u.gentry.base = g32.base;
+ u.gentry.size = g32.size;
+ u.gentry.regnum = g32.regnum;
+ u.gentry.type = g32.type;
+ }
+ break;
+#endif
+ default:
+ return -ENOTTY;
+ }
+ break;
+ default:
+ return -ENOTTY;
}

- switch (cmd) {
+ /* perform command */
+ switch (ioctl_nr) {
default:
return -ENOTTY;
- case MTRRIOC_ADD_ENTRY:
- if (!capable(CAP_SYS_ADMIN))
- return -EPERM;
+ case _IOC_NR(MTRRIOC_ADD_ENTRY):
err =
- mtrr_file_add(sentry.base, sentry.size, sentry.type, 1,
+ mtrr_file_add(u.sentry.base, u.sentry.size, u.sentry.type, 1,
file, 0);
break;
- case MTRRIOC_SET_ENTRY:
- if (!capable(CAP_SYS_ADMIN))
- return -EPERM;
- err = mtrr_add(sentry.base, sentry.size, sentry.type, 0);
+ case _IOC_NR(MTRRIOC_SET_ENTRY):
+ err = mtrr_add(u.sentry.base, u.sentry.size, u.sentry.type, 0);
break;
- case MTRRIOC_DEL_ENTRY:
- if (!capable(CAP_SYS_ADMIN))
- return -EPERM;
- err = mtrr_file_del(sentry.base, sentry.size, file, 0);
+ case _IOC_NR(MTRRIOC_DEL_ENTRY):
+ err = mtrr_file_del(u.sentry.base, u.sentry.size, file, 0);
break;
- case MTRRIOC_KILL_ENTRY:
- if (!capable(CAP_SYS_ADMIN))
- return -EPERM;
- err = mtrr_del(-1, sentry.base, sentry.size);
+ case _IOC_NR(MTRRIOC_KILL_ENTRY):
+ err = mtrr_del(-1, u.sentry.base, u.sentry.size);
break;
- case MTRRIOC_GET_ENTRY:
- if (gentry.regnum >= num_var_ranges)
+ case _IOC_NR(MTRRIOC_GET_ENTRY):
+ if (u.gentry.regnum >= num_var_ranges)
return -EINVAL;
- mtrr_if->get(gentry.regnum, &gentry.base, &gentry.size, &type);
+ mtrr_if->get(u.gentry.regnum, &u.gentry.base, &u.gentry.size, &type);

/* Hide entries that go above 4GB */
- if (gentry.base + gentry.size > 0x100000
- || gentry.size == 0x100000)
- gentry.base = gentry.size = gentry.type = 0;
+ if (u.gentry.base + u.gentry.size > 0x100000
+ || u.gentry.size == 0x100000)
+ u.gentry.base = u.gentry.size = u.gentry.type = 0;
else {
- gentry.base <<= PAGE_SHIFT;
- gentry.size <<= PAGE_SHIFT;
- gentry.type = type;
+ u.gentry.base <<= PAGE_SHIFT;
+ u.gentry.size <<= PAGE_SHIFT;
+ u.gentry.type = type;
}

break;
- case MTRRIOC_ADD_PAGE_ENTRY:
- if (!capable(CAP_SYS_ADMIN))
- return -EPERM;
+ case _IOC_NR(MTRRIOC_ADD_PAGE_ENTRY):
err =
- mtrr_file_add(sentry.base, sentry.size, sentry.type, 1,
+ mtrr_file_add(u.sentry.base, u.sentry.size, u.sentry.type, 1,
file, 1);
break;
- case MTRRIOC_SET_PAGE_ENTRY:
- if (!capable(CAP_SYS_ADMIN))
- return -EPERM;
- err = mtrr_add_page(sentry.base, sentry.size, sentry.type, 0);
+ case _IOC_NR(MTRRIOC_SET_PAGE_ENTRY):
+ err = mtrr_add_page(u.sentry.base, u.sentry.size, u.sentry.type, 0);
break;
- case MTRRIOC_DEL_PAGE_ENTRY:
- if (!capable(CAP_SYS_ADMIN))
- return -EPERM;
- err = mtrr_file_del(sentry.base, sentry.size, file, 1);
+ case _IOC_NR(MTRRIOC_DEL_PAGE_ENTRY):
+ err = mtrr_file_del(u.sentry.base, u.sentry.size, file, 1);
break;
- case MTRRIOC_KILL_PAGE_ENTRY:
- if (!capable(CAP_SYS_ADMIN))
- return -EPERM;
- err = mtrr_del_page(-1, sentry.base, sentry.size);
+ case _IOC_NR(MTRRIOC_KILL_PAGE_ENTRY):
+ err = mtrr_del_page(-1, u.sentry.base, u.sentry.size);
break;
- case MTRRIOC_GET_PAGE_ENTRY:
- if (gentry.regnum >= num_var_ranges)
+ case _IOC_NR(MTRRIOC_GET_PAGE_ENTRY):
+ if (u.gentry.regnum >= num_var_ranges)
return -EINVAL;
- mtrr_if->get(gentry.regnum, &gentry.base, &gentry.size, &type);
- gentry.type = type;
+ mtrr_if->get(u.gentry.regnum, &u.gentry.base, &u.gentry.size, &type);
+ u.gentry.type = type;
break;
}

if (err)
return err;

- switch(cmd) {
- case MTRRIOC_GET_ENTRY:
- case MTRRIOC_GET_PAGE_ENTRY:
- if (copy_to_user(arg, &gentry, sizeof gentry))
- err = -EFAULT;
- break;
+ /* copy to user */
+ if (ioctl_dir & _IOC_READ) {
+ switch (ioctl_size) {
#ifdef CONFIG_COMPAT
- case MTRRIOC32_GET_ENTRY:
- case MTRRIOC32_GET_PAGE_ENTRY: {
- struct mtrr_gentry32 __user *g32 = (struct mtrr_gentry32 __user *)__arg;
- err = put_user(gentry.base, &g32->base);
- err |= put_user(gentry.size, &g32->size);
- err |= put_user(gentry.regnum, &g32->regnum);
- err |= put_user(gentry.type, &g32->type);
+ case sizeof(struct mtrr_gentry32):
+ {
+ struct mtrr_gentry g64 = u.gentry;
+ u.gentry32.base = g64.base;
+ u.gentry32.size = g64.size;
+ u.gentry32.regnum = g64.regnum;
+ u.gentry32.type = g64.type;
+ }
break;
- }
#endif
+ default:
+ break;
+ }
+ if (copy_to_user(arg, &u, ioctl_size))
+ err = -EFAULT;
}
return err;
}

2007-02-12 15:05:48

by Pavel Machek

[permalink] [raw]
Subject: Re: [patches] [PATCH x86 for review II] [18/39] x86_64: Allow to run a program when a machine check event is detected

On Mon 2007-02-12 09:04:43, Andi Kleen wrote:
> On Monday 12 February 2007 08:54, Oliver Neukum wrote:
> > Am Montag, 12. Februar 2007 08:38 schrieb Andi Kleen:
> > > When a machine check event is detected (including a AMD RevF threshold
> > > overflow event) allow to run a "trigger" program. This allows user space
> > > to react to such events sooner.
> >
> > Could this not be merged with other reporting mechanisms? This looks like
> > a new incarnation of the /etc/hotplug code.
>
> I refuse to make mcelog depend on dbus. Just because some desktops want bloat
> doesn't mean that everybody else wants too.

Hmm... fix the userspace, no? /etc/hotplug code in kernel should work
with other stuff than dbus...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-02-12 19:45:33

by Frederic Riss

[permalink] [raw]
Subject: Re: [PATCH x86 for review II] [34/39] i386: Use stack arguments for calling into EFI

Le lundi 12 février 2007 à 08:38 +0100, Andi Kleen a écrit :
> When calling into the EFI firmware, the parameters need to be passed on
> the stack. The recent change to use -mregparm=3 breaks x86 EFI support.
> This patch is needed to allow the new Intel-based Macs to suspend to ram
> (efi.get_time is called during the suspend phase).
>
> Signed-off-by: Frederic Riss <[email protected]>
> Signed-off-by: Andi Kleen <[email protected]>

For 2.6.20, Linus merged a different version touching only
arch/kernel/efi.c. If you really prefer this version, I guess it should
be discussed more thoroughly with him and the EFI guys.

Fred.

2007-02-12 22:38:47

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH x86 for review II] [26/39] i386: fix 32-bit ioctls on x64_32

On Monday 12 February 2007 14:24, Giuliano Procida wrote:
> This is a nicer version of the MTRR compatibilty ioctl patch, compiles
> smaller and also tested.

Perhaps nice, but doesn't apply. I will stay with the old version for now

Applying patch patches/fix-32-bit-ioctls-on-x64_32
patching file arch/i386/kernel/cpu/mtrr/if.c
Hunk #1 FAILED at 154.
1 out of 1 hunk FAILED -- rejects in file arch/i386/kernel/cpu/mtrr/if.c


-Andi

2007-02-13 06:37:13

by Rene Herman

[permalink] [raw]
Subject: Re: [PATCH x86 for review II] [13/39] i386: CONFIG_PHYSICAL_ALIGN limited to 4M?

On 02/12/2007 08:38 AM, Andi Kleen wrote:

> From: Rene Herman <[email protected]>

[ ... ]

> --- linux.orig/arch/i386/Kconfig
> +++ linux/arch/i386/Kconfig
> @@ -843,7 +843,7 @@ config RELOCATABLE
> config PHYSICAL_ALIGN
> hex "Alignment value to which kernel should be aligned"
> default "0x100000"
> - range 0x2000 0x400000
> + range 0x2000 0x1000000
> help
> This value puts the alignment restrictions on physical address
> where kernel is loaded and run from. Kernel is compiled for an

Okay I guess, but in reply to this, Vivek Goyal pointed out a patch of
his restoring CONFIG_PHYSICAL_START that was already in -mm instead
since it seems Xen also wanted it. That does also match what I wanted it
for better:

http://lkml.org/lkml/2007/1/2/376

Rene.