2007-08-09 12:42:06

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [0/12] x86 Late merge bug fixes for 2.6.23


Various simple bug fixes, no new features. Please review.
I plan to send them to Linus later.

I don't have anything else critical pending. If you believe
a x86 patch is critical and really needs to make .23 please send email.

There are a few hardware fixes pending too, but they're not regressions
so probably .24 material.

Open problems:
- Still one known reference counting bug in c_p_a()
- x86-64 vdso compilation still broken on old debian binutils
* No satisfactionary workaround found so far. Those users might need to update
their binutils


-Andi



2007-08-09 12:42:41

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [1/12] x86: Work around mmio config space quirk on AMD Fam10h


From: dean gaudet <[email protected]>

Some broken devices have been discovered to require %al/%ax/%eax registers
for MMIO config space accesses. Modify mmconfig.c to use these registers
explicitly (rather than modify the global readb/writeb/etc inlines).

AK: also changed i386 to always use eax
AK: moved change to extended space probing to different patch
AK: reworked with inlines according to Linus' requirements.
AK: improve comments.

Signed-off-by: dean gaudet <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>


---
arch/i386/pci/mmconfig.c | 14 ++++++--------
arch/i386/pci/pci.h | 43 +++++++++++++++++++++++++++++++++++++++++++
arch/x86_64/pci/mmconfig.c | 12 ++++++------
3 files changed, 55 insertions(+), 14 deletions(-)

Index: linux/arch/x86_64/pci/mmconfig.c
===================================================================
--- linux.orig/arch/x86_64/pci/mmconfig.c
+++ linux/arch/x86_64/pci/mmconfig.c
@@ -66,13 +66,13 @@ static int pci_mmcfg_read(unsigned int s

switch (len) {
case 1:
- *value = readb(addr + reg);
+ *value = mmio_config_readb(addr + reg);
break;
case 2:
- *value = readw(addr + reg);
+ *value = mmio_config_readw(addr + reg);
break;
case 4:
- *value = readl(addr + reg);
+ *value = mmio_config_readl(addr + reg);
break;
}

@@ -94,13 +94,13 @@ static int pci_mmcfg_write(unsigned int

switch (len) {
case 1:
- writeb(value, addr + reg);
+ mmio_config_writeb(addr + reg, value);
break;
case 2:
- writew(value, addr + reg);
+ mmio_config_writew(addr + reg, value);
break;
case 4:
- writel(value, addr + reg);
+ mmio_config_writel(addr + reg, value);
break;
}

Index: linux/arch/i386/pci/mmconfig.c
===================================================================
--- linux.orig/arch/i386/pci/mmconfig.c
+++ linux/arch/i386/pci/mmconfig.c
@@ -82,16 +82,15 @@ static int pci_mmcfg_read(unsigned int s

switch (len) {
case 1:
- *value = readb(mmcfg_virt_addr + reg);
+ *value = mmio_config_readb(mmcfg_virt_addr + reg);
break;
case 2:
- *value = readw(mmcfg_virt_addr + reg);
+ *value = mmio_config_readw(mmcfg_virt_addr + reg);
break;
case 4:
- *value = readl(mmcfg_virt_addr + reg);
+ *value = mmio_config_readl(mmcfg_virt_addr + reg);
break;
}
-
spin_unlock_irqrestore(&pci_config_lock, flags);

return 0;
@@ -116,16 +115,15 @@ static int pci_mmcfg_write(unsigned int

switch (len) {
case 1:
- writeb(value, mmcfg_virt_addr + reg);
+ mmio_config_writeb(mmcfg_virt_addr, value);
break;
case 2:
- writew(value, mmcfg_virt_addr + reg);
+ mmio_config_writew(mmcfg_virt_addr, value);
break;
case 4:
- writel(value, mmcfg_virt_addr + reg);
+ mmio_config_writel(mmcfg_virt_addr, value);
break;
}
-
spin_unlock_irqrestore(&pci_config_lock, flags);

return 0;
Index: linux/arch/i386/pci/pci.h
===================================================================
--- linux.orig/arch/i386/pci/pci.h
+++ linux/arch/i386/pci/pci.h
@@ -104,3 +104,46 @@ extern DECLARE_BITMAP(pci_mmcfg_fallback
extern int __init pci_mmcfg_arch_reachable(unsigned int seg, unsigned int bus,
unsigned int devfn);
extern int __init pci_mmcfg_arch_init(void);
+
+/*
+ * AMD Fam10h CPUs are buggy, and cannot access MMIO config space
+ * on their northbrige except through the * %eax register. As such, you MUST
+ * NOT use normal IOMEM accesses, you need to only use the magic mmio-config
+ * accessor functions.
+ * In fact just use pci_config_*, nothing else please.
+ */
+static inline unsigned char mmio_config_readb(void __iomem *pos)
+{
+ u8 val;
+ asm volatile("movb (%1),%%al" : "=a" (val) : "r" (pos));
+ return val;
+}
+
+static inline unsigned short mmio_config_readw(void __iomem *pos)
+{
+ u16 val;
+ asm volatile("movw (%1),%%ax" : "=a" (val) : "r" (pos));
+ return val;
+}
+
+static inline unsigned int mmio_config_readl(void __iomem *pos)
+{
+ u32 val;
+ asm volatile("movl (%1),%%eax" : "=a" (val) : "r" (pos));
+ return val;
+}
+
+static inline void mmio_config_writeb(void __iomem *pos, u8 val)
+{
+ asm volatile("movb %%al,(%1)" :: "a" (val), "r" (pos) : "memory");
+}
+
+static inline void mmio_config_writew(void __iomem *pos, u16 val)
+{
+ asm volatile("movw %%ax,(%1)" :: "a" (val), "r" (pos) : "memory");
+}
+
+static inline void mmio_config_writel(void __iomem *pos, u32 val)
+{
+ asm volatile("movl %%eax,(%1)" :: "a" (val), "r" (pos) : "memory");
+}

2007-08-09 12:43:07

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [3/12] x86_64: Don't mark __exitcall as __cold


gcc currently doesn't support attributes on types, so we can't use it
function pointers. This avoids some warnings on a gcc 4.3 build.

Signed-off-by: Andi Kleen <[email protected]>

---
include/linux/init.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/include/linux/init.h
===================================================================
--- linux.orig/include/linux/init.h
+++ linux/include/linux/init.h
@@ -43,7 +43,7 @@
#define __init __attribute__ ((__section__ (".init.text"))) __cold
#define __initdata __attribute__ ((__section__ (".init.data")))
#define __exitdata __attribute__ ((__section__(".exit.data")))
-#define __exit_call __attribute_used__ __attribute__ ((__section__ (".exitcall.exit"))) __cold
+#define __exit_call __attribute_used__ __attribute__ ((__section__ (".exitcall.exit")))

/* modpost check for section mismatches during the kernel build.
* A section mismatch happens when there are references from a

2007-08-09 12:43:32

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [2/12] x86_64: Calgary - Fix mis-handled PCI topology


From: Murillo Fernandes Bernardes <[email protected]>
Subject: x86-64: Calgary - Fix mis-handled PCI topology

Current code assumed that devices were directly connected to a Calgary
bridge, as it tried to get the iommu table directly from the parent bus
controller.

When we have another bridge between the Calgary/CalIOC2 bridge and the
device we should look upwards until we get to the top (Calgary/CalIOC2
bridge), where the iommu table resides.

Signed-off-by: Murillo Fernandes Bernardes <[email protected]>
Signed-off-by: Muli Ben-Yehuda <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

arch/x86_64/kernel/pci-calgary.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)

Index: linux/arch/x86_64/kernel/pci-calgary.c
===================================================================
--- linux.orig/arch/x86_64/kernel/pci-calgary.c
+++ linux/arch/x86_64/kernel/pci-calgary.c
@@ -367,16 +367,15 @@ static inline struct iommu_table *find_i

pdev = to_pci_dev(dev);

- /* is the device behind a bridge? */
- if (unlikely(pdev->bus->parent))
- pbus = pdev->bus->parent;
- else
- pbus = pdev->bus;
+ pbus = pdev->bus;
+
+ /* is the device behind a bridge? Look for the root bus */
+ while (pbus->parent)
+ pbus = pbus->parent;

tbl = pci_iommu(pbus);

- BUG_ON(pdev->bus->parent &&
- (tbl->it_busno != pdev->bus->parent->number));
+ BUG_ON(tbl && (tbl->it_busno != pbus->number));

return tbl;
}

2007-08-09 12:43:52

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [4/12] x86_64: Disable CLFLUSH support again


It turns out CLFLUSH support is still not complete; we
flush the wrong pages. Again disable it for the release.
Noticed by Jan Beulich.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/i386/mm/pageattr.c | 2 +-
arch/x86_64/mm/pageattr.c | 3 ++-
2 files changed, 3 insertions(+), 2 deletions(-)

Index: linux/arch/i386/mm/pageattr.c
===================================================================
--- linux.orig/arch/i386/mm/pageattr.c
+++ linux/arch/i386/mm/pageattr.c
@@ -82,7 +82,7 @@ static void flush_kernel_map(void *arg)
struct page *p;

/* High level code is not ready for clflush yet */
- if (cpu_has_clflush) {
+ if (0 && cpu_has_clflush) {
list_for_each_entry (p, lh, lru)
cache_flush_page(p);
} else if (boot_cpu_data.x86_model >= 4)
Index: linux/arch/x86_64/mm/pageattr.c
===================================================================
--- linux.orig/arch/x86_64/mm/pageattr.c
+++ linux/arch/x86_64/mm/pageattr.c
@@ -75,7 +75,8 @@ static void flush_kernel_map(void *arg)

/* When clflush is available always use it because it is
much cheaper than WBINVD. */
- if (!cpu_has_clflush)
+ /* clflush is still broken. Disable for now. */
+ if (0 && !cpu_has_clflush)
asm volatile("wbinvd" ::: "memory");
else list_for_each_entry(pg, l, lru) {
void *adr = page_address(pg);

2007-08-09 12:44:28

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [5/12] x86_64: Make patching more robust, fix paravirt issue


Commit 19d36ccdc34f5ed444f8a6af0cbfdb6790eb1177 "x86: Fix alternatives
and kprobes to remap write-protected kernel text" uses code which is
being patched for patching.

In particular, paravirt_ops does patching in two stages: first it
calls paravirt_ops.patch, then it fills any remaining instructions
with nop_out(). nop_out calls text_poke() which calls
lookup_address() which calls pgd_val() (aka paravirt_ops.pgd_val):
that call site is one of the places we patch.

If we always do patching as one single call to text_poke(), we only
need make sure we're not patching the memcpy in text_poke itself.
This means the prototype to paravirt_ops.patch needs to change, to
marshal the new code into a buffer rather than patching in place as it
does now. It also means all patching goes through text_poke(), which
is known to be safe (apply_alternatives is also changed to make a
single patch).

AK: fix compilation on x86-64 (bad rusty!)
AK: fix boot on x86-64 (sigh)
AK: merged with other patches

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
arch/i386/kernel/alternative.c | 33 ++++++++++++++++----------
arch/i386/kernel/paravirt.c | 52 ++++++++++++++++++++---------------------
arch/i386/kernel/vmi.c | 35 ++++++++++++++++-----------
arch/i386/xen/enlighten.c | 12 +++++----
drivers/lguest/lguest.c | 9 +++----
include/asm-i386/paravirt.h | 16 +++++++-----
6 files changed, 90 insertions(+), 67 deletions(-)

Index: linux/arch/i386/kernel/alternative.c
===================================================================
--- linux.orig/arch/i386/kernel/alternative.c
+++ linux/arch/i386/kernel/alternative.c
@@ -11,6 +11,8 @@
#include <asm/mce.h>
#include <asm/nmi.h>

+#define MAX_PATCH_LEN (255-1)
+
#ifdef CONFIG_HOTPLUG_CPU
static int smp_alt_once;

@@ -148,7 +150,8 @@ static unsigned char** find_nop_table(vo

#endif /* CONFIG_X86_64 */

-static void nop_out(void *insns, unsigned int len)
+/* Use this to add nops to a buffer, then text_poke the whole buffer. */
+static void add_nops(void *insns, unsigned int len)
{
unsigned char **noptable = find_nop_table();

@@ -156,7 +159,7 @@ static void nop_out(void *insns, unsigne
unsigned int noplen = len;
if (noplen > ASM_NOP_MAX)
noplen = ASM_NOP_MAX;
- text_poke(insns, noptable[noplen], noplen);
+ memcpy(insns, noptable[noplen], noplen);
insns += noplen;
len -= noplen;
}
@@ -174,15 +177,15 @@ extern u8 *__smp_locks[], *__smp_locks_e
void apply_alternatives(struct alt_instr *start, struct alt_instr *end)
{
struct alt_instr *a;
- u8 *instr;
- int diff;
+ char insnbuf[MAX_PATCH_LEN];

DPRINTK("%s: alt table %p -> %p\n", __FUNCTION__, start, end);
for (a = start; a < end; a++) {
+ u8 *instr = a->instr;
BUG_ON(a->replacementlen > a->instrlen);
+ BUG_ON(a->instrlen > sizeof(insnbuf));
if (!boot_cpu_has(a->cpuid))
continue;
- instr = a->instr;
#ifdef CONFIG_X86_64
/* vsyscall code is not mapped yet. resolve it manually. */
if (instr >= (u8 *)VSYSCALL_START && instr < (u8*)VSYSCALL_END) {
@@ -191,9 +194,10 @@ void apply_alternatives(struct alt_instr
__FUNCTION__, a->instr, instr);
}
#endif
- memcpy(instr, a->replacement, a->replacementlen);
- diff = a->instrlen - a->replacementlen;
- nop_out(instr + a->replacementlen, diff);
+ memcpy(insnbuf, a->replacement, a->replacementlen);
+ add_nops(insnbuf + a->replacementlen,
+ a->instrlen - a->replacementlen);
+ text_poke(instr, insnbuf, a->instrlen);
}
}

@@ -215,16 +219,18 @@ static void alternatives_smp_lock(u8 **s
static void alternatives_smp_unlock(u8 **start, u8 **end, u8 *text, u8 *text_end)
{
u8 **ptr;
+ char insn[1];

if (noreplace_smp)
return;

+ add_nops(insn, 1);
for (ptr = start; ptr < end; ptr++) {
if (*ptr < text)
continue;
if (*ptr > text_end)
continue;
- nop_out(*ptr, 1);
+ text_poke(*ptr, insn, 1);
};
}

@@ -351,6 +357,7 @@ void apply_paravirt(struct paravirt_patc
struct paravirt_patch_site *end)
{
struct paravirt_patch_site *p;
+ char insnbuf[MAX_PATCH_LEN];

if (noreplace_paravirt)
return;
@@ -358,13 +365,15 @@ void apply_paravirt(struct paravirt_patc
for (p = start; p < end; p++) {
unsigned int used;

- used = paravirt_ops.patch(p->instrtype, p->clobbers, p->instr,
- p->len);
+ BUG_ON(p->len > MAX_PATCH_LEN);
+ used = paravirt_ops.patch(p->instrtype, p->clobbers, insnbuf,
+ (unsigned long)p->instr, p->len);

BUG_ON(used > p->len);

/* Pad the rest with nops */
- nop_out(p->instr + used, p->len - used);
+ add_nops(insnbuf + used, p->len - used);
+ text_poke(p->instr, insnbuf, p->len);
}
}
extern struct paravirt_patch_site __start_parainstructions[],
Index: linux/arch/i386/kernel/paravirt.c
===================================================================
--- linux.orig/arch/i386/kernel/paravirt.c
+++ linux/arch/i386/kernel/paravirt.c
@@ -69,7 +69,8 @@ DEF_NATIVE(read_tsc, "rdtsc");

DEF_NATIVE(ud2a, "ud2a");

-static unsigned native_patch(u8 type, u16 clobbers, void *insns, unsigned len)
+static unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
+ unsigned long addr, unsigned len)
{
const unsigned char *start, *end;
unsigned ret;
@@ -90,7 +91,7 @@ static unsigned native_patch(u8 type, u1
#undef SITE

patch_site:
- ret = paravirt_patch_insns(insns, len, start, end);
+ ret = paravirt_patch_insns(ibuf, len, start, end);
break;

case PARAVIRT_PATCH(make_pgd):
@@ -107,7 +108,7 @@ static unsigned native_patch(u8 type, u1
break;

default:
- ret = paravirt_patch_default(type, clobbers, insns, len);
+ ret = paravirt_patch_default(type, clobbers, ibuf, addr, len);
break;
}

@@ -129,68 +130,67 @@ struct branch {
u32 delta;
} __attribute__((packed));

-unsigned paravirt_patch_call(void *target, u16 tgt_clobbers,
- void *site, u16 site_clobbers,
+unsigned paravirt_patch_call(void *insnbuf,
+ const void *target, u16 tgt_clobbers,
+ unsigned long addr, u16 site_clobbers,
unsigned len)
{
- unsigned char *call = site;
- unsigned long delta = (unsigned long)target - (unsigned long)(call+5);
- struct branch b;
+ struct branch *b = insnbuf;
+ unsigned long delta = (unsigned long)target - (addr+5);

if (tgt_clobbers & ~site_clobbers)
return len; /* target would clobber too much for this site */
if (len < 5)
return len; /* call too long for patch site */

- b.opcode = 0xe8; /* call */
- b.delta = delta;
- BUILD_BUG_ON(sizeof(b) != 5);
- text_poke(call, (unsigned char *)&b, 5);
+ b->opcode = 0xe8; /* call */
+ b->delta = delta;
+ BUILD_BUG_ON(sizeof(*b) != 5);

return 5;
}

-unsigned paravirt_patch_jmp(void *target, void *site, unsigned len)
+unsigned paravirt_patch_jmp(const void *target, void *insnbuf,
+ unsigned long addr, unsigned len)
{
- unsigned char *jmp = site;
- unsigned long delta = (unsigned long)target - (unsigned long)(jmp+5);
- struct branch b;
+ struct branch *b = insnbuf;
+ unsigned long delta = (unsigned long)target - (addr+5);

if (len < 5)
return len; /* call too long for patch site */

- b.opcode = 0xe9; /* jmp */
- b.delta = delta;
- text_poke(jmp, (unsigned char *)&b, 5);
+ b->opcode = 0xe9; /* jmp */
+ b->delta = delta;

return 5;
}

-unsigned paravirt_patch_default(u8 type, u16 clobbers, void *site, unsigned len)
+unsigned paravirt_patch_default(u8 type, u16 clobbers, void *insnbuf,
+ unsigned long addr, unsigned len)
{
void *opfunc = *((void **)&paravirt_ops + type);
unsigned ret;

if (opfunc == NULL)
/* If there's no function, patch it with a ud2a (BUG) */
- ret = paravirt_patch_insns(site, len, start_ud2a, end_ud2a);
+ ret = paravirt_patch_insns(insnbuf, len, start_ud2a, end_ud2a);
else if (opfunc == paravirt_nop)
/* If the operation is a nop, then nop the callsite */
ret = paravirt_patch_nop();
else if (type == PARAVIRT_PATCH(iret) ||
type == PARAVIRT_PATCH(irq_enable_sysexit))
/* If operation requires a jmp, then jmp */
- ret = paravirt_patch_jmp(opfunc, site, len);
+ ret = paravirt_patch_jmp(opfunc, insnbuf, addr, len);
else
/* Otherwise call the function; assume target could
clobber any caller-save reg */
- ret = paravirt_patch_call(opfunc, CLBR_ANY,
- site, clobbers, len);
+ ret = paravirt_patch_call(insnbuf, opfunc, CLBR_ANY,
+ addr, clobbers, len);

return ret;
}

-unsigned paravirt_patch_insns(void *site, unsigned len,
+unsigned paravirt_patch_insns(void *insnbuf, unsigned len,
const char *start, const char *end)
{
unsigned insn_len = end - start;
@@ -198,7 +198,7 @@ unsigned paravirt_patch_insns(void *site
if (insn_len > len || start == NULL)
insn_len = len;
else
- memcpy(site, start, insn_len);
+ memcpy(insnbuf, start, insn_len);

return insn_len;
}
Index: linux/arch/i386/kernel/vmi.c
===================================================================
--- linux.orig/arch/i386/kernel/vmi.c
+++ linux/arch/i386/kernel/vmi.c
@@ -87,12 +87,14 @@ struct vmi_timer_ops vmi_timer_ops;
#define IRQ_PATCH_INT_MASK 0
#define IRQ_PATCH_DISABLE 5

-static inline void patch_offset(unsigned char *eip, unsigned char *dest)
+static inline void patch_offset(void *insnbuf,
+ unsigned long eip, unsigned long dest)
{
- *(unsigned long *)(eip+1) = dest-eip-5;
+ *(unsigned long *)(insnbuf+1) = dest-eip-5;
}

-static unsigned patch_internal(int call, unsigned len, void *insns)
+static unsigned patch_internal(int call, unsigned len, void *insnbuf,
+ unsigned long eip)
{
u64 reloc;
struct vmi_relocation_info *const rel = (struct vmi_relocation_info *)&reloc;
@@ -100,14 +102,14 @@ static unsigned patch_internal(int call,
switch(rel->type) {
case VMI_RELOCATION_CALL_REL:
BUG_ON(len < 5);
- *(char *)insns = MNEM_CALL;
- patch_offset(insns, rel->eip);
+ *(char *)insnbuf = MNEM_CALL;
+ patch_offset(insnbuf, eip, (unsigned long)rel->eip);
return 5;

case VMI_RELOCATION_JUMP_REL:
BUG_ON(len < 5);
- *(char *)insns = MNEM_JMP;
- patch_offset(insns, rel->eip);
+ *(char *)insnbuf = MNEM_JMP;
+ patch_offset(insnbuf, eip, (unsigned long)rel->eip);
return 5;

case VMI_RELOCATION_NOP:
@@ -128,21 +130,26 @@ static unsigned patch_internal(int call,
* Apply patch if appropriate, return length of new instruction
* sequence. The callee does nop padding for us.
*/
-static unsigned vmi_patch(u8 type, u16 clobbers, void *insns, unsigned len)
+static unsigned vmi_patch(u8 type, u16 clobbers, void *insns,
+ unsigned long eip, unsigned len)
{
switch (type) {
case PARAVIRT_PATCH(irq_disable):
- return patch_internal(VMI_CALL_DisableInterrupts, len, insns);
+ return patch_internal(VMI_CALL_DisableInterrupts, len,
+ insns, eip);
case PARAVIRT_PATCH(irq_enable):
- return patch_internal(VMI_CALL_EnableInterrupts, len, insns);
+ return patch_internal(VMI_CALL_EnableInterrupts, len,
+ insns, eip);
case PARAVIRT_PATCH(restore_fl):
- return patch_internal(VMI_CALL_SetInterruptMask, len, insns);
+ return patch_internal(VMI_CALL_SetInterruptMask, len,
+ insns, eip);
case PARAVIRT_PATCH(save_fl):
- return patch_internal(VMI_CALL_GetInterruptMask, len, insns);
+ return patch_internal(VMI_CALL_GetInterruptMask, len,
+ insns, eip);
case PARAVIRT_PATCH(iret):
- return patch_internal(VMI_CALL_IRET, len, insns);
+ return patch_internal(VMI_CALL_IRET, len, insns, eip);
case PARAVIRT_PATCH(irq_enable_sysexit):
- return patch_internal(VMI_CALL_SYSEXIT, len, insns);
+ return patch_internal(VMI_CALL_SYSEXIT, len, insns, eip);
default:
break;
}
Index: linux/arch/i386/xen/enlighten.c
===================================================================
--- linux.orig/arch/i386/xen/enlighten.c
+++ linux/arch/i386/xen/enlighten.c
@@ -842,7 +842,8 @@ void __init xen_setup_vcpu_info_placemen
}
}

-static unsigned xen_patch(u8 type, u16 clobbers, void *insns, unsigned len)
+static unsigned xen_patch(u8 type, u16 clobbers, void *insnbuf,
+ unsigned long addr, unsigned len)
{
char *start, *end, *reloc;
unsigned ret;
@@ -869,7 +870,7 @@ static unsigned xen_patch(u8 type, u16 c
if (start == NULL || (end-start) > len)
goto default_patch;

- ret = paravirt_patch_insns(insns, len, start, end);
+ ret = paravirt_patch_insns(insnbuf, len, start, end);

/* Note: because reloc is assigned from something that
appears to be an array, gcc assumes it's non-null,
@@ -877,8 +878,8 @@ static unsigned xen_patch(u8 type, u16 c
end. */
if (reloc > start && reloc < end) {
int reloc_off = reloc - start;
- long *relocp = (long *)(insns + reloc_off);
- long delta = start - (char *)insns;
+ long *relocp = (long *)(insnbuf + reloc_off);
+ long delta = start - (char *)addr;

*relocp += delta;
}
@@ -886,7 +887,8 @@ static unsigned xen_patch(u8 type, u16 c

default_patch:
default:
- ret = paravirt_patch_default(type, clobbers, insns, len);
+ ret = paravirt_patch_default(type, clobbers, insnbuf,
+ addr, len);
break;
}

Index: linux/drivers/lguest/lguest.c
===================================================================
--- linux.orig/drivers/lguest/lguest.c
+++ linux/drivers/lguest/lguest.c
@@ -933,23 +933,24 @@ static const struct lguest_insns
/* Now our patch routine is fairly simple (based on the native one in
* paravirt.c). If we have a replacement, we copy it in and return how much of
* the available space we used. */
-static unsigned lguest_patch(u8 type, u16 clobber, void *insns, unsigned len)
+static unsigned lguest_patch(u8 type, u16 clobber, void *ibuf,
+ unsigned long addr, unsigned len)
{
unsigned int insn_len;

/* Don't do anything special if we don't have a replacement */
if (type >= ARRAY_SIZE(lguest_insns) || !lguest_insns[type].start)
- return paravirt_patch_default(type, clobber, insns, len);
+ return paravirt_patch_default(type, clobber, ibuf, addr, len);

insn_len = lguest_insns[type].end - lguest_insns[type].start;

/* Similarly if we can't fit replacement (shouldn't happen, but let's
* be thorough). */
if (len < insn_len)
- return paravirt_patch_default(type, clobber, insns, len);
+ return paravirt_patch_default(type, clobber, ibuf, addr, len);

/* Copy in our instructions. */
- memcpy(insns, lguest_insns[type].start, insn_len);
+ memcpy(ibuf, lguest_insns[type].start, insn_len);
return insn_len;
}

Index: linux/include/asm-i386/paravirt.h
===================================================================
--- linux.orig/include/asm-i386/paravirt.h
+++ linux/include/asm-i386/paravirt.h
@@ -47,7 +47,8 @@ struct paravirt_ops
* The patch function should return the number of bytes of code
* generated, as we nop pad the rest in generic code.
*/
- unsigned (*patch)(u8 type, u16 clobber, void *firstinsn, unsigned len);
+ unsigned (*patch)(u8 type, u16 clobber, void *insnbuf,
+ unsigned long addr, unsigned len);

/* Basic arch-specific setup */
void (*arch_setup)(void);
@@ -253,13 +254,16 @@ extern struct paravirt_ops paravirt_ops;

unsigned paravirt_patch_nop(void);
unsigned paravirt_patch_ignore(unsigned len);
-unsigned paravirt_patch_call(void *target, u16 tgt_clobbers,
- void *site, u16 site_clobbers,
+unsigned paravirt_patch_call(void *insnbuf,
+ const void *target, u16 tgt_clobbers,
+ unsigned long addr, u16 site_clobbers,
unsigned len);
-unsigned paravirt_patch_jmp(void *target, void *site, unsigned len);
-unsigned paravirt_patch_default(u8 type, u16 clobbers, void *site, unsigned len);
+unsigned paravirt_patch_jmp(const void *target, void *insnbuf,
+ unsigned long addr, unsigned len);
+unsigned paravirt_patch_default(u8 type, u16 clobbers, void *insnbuf,
+ unsigned long addr, unsigned len);

-unsigned paravirt_patch_insns(void *site, unsigned len,
+unsigned paravirt_patch_insns(void *insnbuf, unsigned len,
const char *start, const char *end);

int paravirt_disable_iospace(void);

2007-08-09 12:44:56

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [6/12] x86_64: Early segment setup for VT


From: Zachary Amsden <[email protected]>

VT is very picky about when it can enter execution.
Get all segments setup and get LDT and TR into valid state to allow
VT execution under VMware and KVM (untested).

This makes the boot decompression run under VT, which makes it several
orders of magnitude faster on 64-bit Intel hardware.

Before, I was seeing times up to a minute or more to decompress a 1.3MB kernel
on a very fast box.

Signed-off-by: Zachary Amsden <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86_64/boot/compressed/head.S | 7 +++++++
1 file changed, 7 insertions(+)

Index: linux/arch/x86_64/boot/compressed/head.S
===================================================================
--- linux.orig/arch/x86_64/boot/compressed/head.S
+++ linux/arch/x86_64/boot/compressed/head.S
@@ -195,6 +195,11 @@ ENTRY(startup_64)
movl %eax, %ds
movl %eax, %es
movl %eax, %ss
+ movl %eax, %fs
+ movl %eax, %gs
+ lldt %ax
+ movl $0x20, %eax
+ ltr %ax

/* Compute the decompressed kernel start address. It is where
* we were loaded at aligned to a 2M boundary. %rbp contains the
@@ -295,6 +300,8 @@ gdt:
.quad 0x0000000000000000 /* NULL descriptor */
.quad 0x00af9a000000ffff /* __KERNEL_CS */
.quad 0x00cf92000000ffff /* __KERNEL_DS */
+ .quad 0x0080890000000000 /* TS descriptor */
+ .quad 0x0000000000000000 /* TS continued */
gdt_end:
.bss
/* Stack for uncompression */

2007-08-09 12:45:37

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [7/12] x86_64: really stop MCEs during code patching


From: Adrian Bunk <[email protected]>
It's CONFIG_X86_MCE, not CONFIG_MCE.

Signed-off-by: Adrian Bunk <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---

arch/i386/kernel/alternative.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/arch/i386/kernel/alternative.c
===================================================================
--- linux.orig/arch/i386/kernel/alternative.c
+++ linux/arch/i386/kernel/alternative.c
@@ -388,7 +388,7 @@ void __init alternative_instructions(voi
that might execute the to be patched code.
Other CPUs are not running. */
stop_nmi();
-#ifdef CONFIG_MCE
+#ifdef CONFIG_X86_MCE
stop_mce();
#endif

@@ -426,7 +426,7 @@ void __init alternative_instructions(voi
local_irq_restore(flags);

restart_nmi();
-#ifdef CONFIG_MCE
+#ifdef CONFIG_X86_MCE
restart_mce();
#endif
}

2007-08-09 12:46:04

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [8/12] x86_64: Use global flag to disable broken local apic timer on AMD CPUs.


The Averatec 2370 and some other Turion laptop BIOS seems to program the
ENABLE_C1E MSR inconsistently between cores. This confuses the lapic
use heuristics because when C1E is enabled anywhere it seems to affect
the complete chip.

Use a global flag instead of a per cpu flag to handle this.
If any CPU has C1E enabled disabled lapic use.

Thanks to Cal Peake for debugging.

Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>

---
arch/i386/kernel/apic.c | 8 +++-----
arch/i386/kernel/cpu/amd.c | 7 ++++++-
include/asm-i386/apic.h | 2 ++
include/asm-i386/cpufeature.h | 2 +-
4 files changed, 12 insertions(+), 7 deletions(-)

Index: linux/arch/i386/kernel/apic.c
===================================================================
--- linux.orig/arch/i386/kernel/apic.c
+++ linux/arch/i386/kernel/apic.c
@@ -61,8 +61,9 @@ static int enable_local_apic __initdata

/* Local APIC timer verification ok */
static int local_apic_timer_verify_ok;
-/* Disable local APIC timer from the kernel commandline or via dmi quirk */
-static int local_apic_timer_disabled;
+/* Disable local APIC timer from the kernel commandline or via dmi quirk
+ or using CPU MSR check */
+int local_apic_timer_disabled;
/* Local APIC timer works in C2 */
int local_apic_timer_c2_ok;
EXPORT_SYMBOL_GPL(local_apic_timer_c2_ok);
@@ -370,9 +371,6 @@ void __init setup_boot_APIC_clock(void)
long delta, deltapm;
int pm_referenced = 0;

- if (boot_cpu_has(X86_FEATURE_LAPIC_TIMER_BROKEN))
- local_apic_timer_disabled = 1;
-
/*
* The local apic timer can be disabled via the kernel
* commandline or from the test above. Register the lapic
Index: linux/arch/i386/kernel/cpu/amd.c
===================================================================
--- linux.orig/arch/i386/kernel/cpu/amd.c
+++ linux/arch/i386/kernel/cpu/amd.c
@@ -3,6 +3,7 @@
#include <linux/mm.h>
#include <asm/io.h>
#include <asm/processor.h>
+#include <asm/apic.h>

#include "cpu.h"

@@ -22,6 +23,7 @@
extern void vide(void);
__asm__(".align 4\nvide: ret");

+#ifdef CONFIG_X86_LOCAL_APIC
#define ENABLE_C1E_MASK 0x18000000
#define CPUID_PROCESSOR_SIGNATURE 1
#define CPUID_XFAM 0x0ff00000
@@ -52,6 +54,7 @@ static __cpuinit int amd_apic_timer_brok
}
return 0;
}
+#endif

int force_mwait __cpuinitdata;

@@ -282,8 +285,10 @@ static void __cpuinit init_amd(struct cp
num_cache_leaves = 3;
}

+#ifdef CONFIG_X86_LOCAL_APIC
if (amd_apic_timer_broken())
- set_bit(X86_FEATURE_LAPIC_TIMER_BROKEN, c->x86_capability);
+ local_apic_timer_disabled = 1;
+#endif

if (c->x86 == 0x10 && !force_mwait)
clear_bit(X86_FEATURE_MWAIT, c->x86_capability);
Index: linux/include/asm-i386/apic.h
===================================================================
--- linux.orig/include/asm-i386/apic.h
+++ linux/include/asm-i386/apic.h
@@ -116,6 +116,8 @@ extern void enable_NMI_through_LVT0 (voi
extern int timer_over_8254;
extern int local_apic_timer_c2_ok;

+extern int local_apic_timer_disabled;
+
#else /* !CONFIG_X86_LOCAL_APIC */
static inline void lapic_shutdown(void) { }

Index: linux/include/asm-i386/cpufeature.h
===================================================================
--- linux.orig/include/asm-i386/cpufeature.h
+++ linux/include/asm-i386/cpufeature.h
@@ -79,7 +79,7 @@
#define X86_FEATURE_ARCH_PERFMON (3*32+11) /* Intel Architectural PerfMon */
#define X86_FEATURE_PEBS (3*32+12) /* Precise-Event Based Sampling */
#define X86_FEATURE_BTS (3*32+13) /* Branch Trace Store */
-#define X86_FEATURE_LAPIC_TIMER_BROKEN (3*32+ 14) /* lapic timer broken in C1 */
+/* 14 free */
#define X86_FEATURE_SYNC_RDTSC (3*32+15) /* RDTSC synchronizes the CPU */
#define X86_FEATURE_REP_GOOD (3*32+16) /* rep microcode works well on this CPU */

2007-08-09 12:46:29

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [9/12] x86_64: Avoid NMI Watchdog and/or long wait in setup_APIC_timer


From: Aaron Durbin <[email protected]>
In setup_APIC_timer with the HPET in use, a condition can arise while
waiting for the next irq slice to expire on the HPET which will either
cause an NMI watchdog to fire or a 3 minute busy loop if the NMI
watchdog is disabled.

The HPET comparator and the counter keep incrementing during its normal
operation. When a comparison event fires the comparator will increment
by the designated period. If the HPET trigger occurs right after
the 'int trigger = hpet_readl(HPET_T0_CMP);' line, we will will spin
for up to 3 minutes (with a clock of 25MHz) waiting for the HPET
counter to wrap around. However, when the NMI watchdog is enabled the
NMI watchdog will detect a lockup and reboot the machine. This
scenario can be exasperated by the presence of an SMI which will
increase the window of opportunity for the condition to occur.

The fix is to wait for the compartor to change which signals the
end of the tick slice.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86_64/kernel/apic.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

Index: linux/arch/x86_64/kernel/apic.c
===================================================================
--- linux.orig/arch/x86_64/kernel/apic.c
+++ linux/arch/x86_64/kernel/apic.c
@@ -791,10 +791,12 @@ static void setup_APIC_timer(unsigned in

/* wait for irq slice */
if (hpet_address && hpet_use_timer) {
+ /*
+ * Wait for the comparator value to change which signals that
+ * the tick slice has expired.
+ */
int trigger = hpet_readl(HPET_T0_CMP);
- while (hpet_readl(HPET_COUNTER) >= trigger)
- /* do nothing */ ;
- while (hpet_readl(HPET_COUNTER) < trigger)
+ while (trigger == hpet_readl(HPET_T0_CMP))
/* do nothing */ ;
} else {
int c1, c2;

2007-08-09 12:46:48

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [10/12] x86_64: Add warning in Documentation that zero-page is not a stable ABI


Some people writing boot loaders seem to falsely belief the 32bit zero page is a
stable interface for out of tree code like the real mode boot protocol. Add a comment
clarifying that is not true.

Signed-off-by: Andi Kleen <[email protected]>

---
Documentation/i386/zero-page.txt | 10 ++++++++++
1 file changed, 10 insertions(+)

Index: linux/Documentation/i386/zero-page.txt
===================================================================
--- linux.orig/Documentation/i386/zero-page.txt
+++ linux/Documentation/i386/zero-page.txt
@@ -1,3 +1,13 @@
+---------------------------------------------------------------------------
+!!!!!!!!!!!!!!!WARNING!!!!!!!!
+The zero page is a kernel internal data structure, not a stable ABI. It might change
+without warning and the kernel has no way to detect old version of it.
+If you're writing some external code like a boot loader you should only use
+the stable versioned real mode boot protocol described in boot.txt. Otherwise the kernel
+might break you at any time.
+!!!!!!!!!!!!!WARNING!!!!!!!!!!!
+----------------------------------------------------------------------------
+
Summary of boot_params layout (kernel point of view)
( collected by Hans Lermen and Martin Mares )

2007-08-09 12:47:06

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [11/12] x86_64: vdso.lds in arch/x86_64/vdso/.gitignore


From: Pete Zaitcev <[email protected]>
Create arch/x86_64/vdso/.gitignore and put vdso.lds into it.

Signed-off-by: Pete Zaitcev <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---

Utterly trivial but for some reason missing. V.odd. I must be
misunderstanding how this works, but just in case here's patch.

---
arch/x86_64/vdso/.gitignore | 1 +
1 file changed, 1 insertion(+)

Index: linux/arch/x86_64/vdso/.gitignore
===================================================================
--- /dev/null
+++ linux/arch/x86_64/vdso/.gitignore
@@ -0,0 +1 @@
+vdso.lds

2007-08-09 12:47:36

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [12/12] x86_64: Fix start_kernel warning


Fix

WARNING: vmlinux.o(.text+0x183): Section mismatch: reference to .init.text:start_kernel (between 'is386' and 'check_x87')

Signed-off-by: Andi Kleen <[email protected]>

---
arch/i386/kernel/head.S | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/arch/i386/kernel/head.S
===================================================================
--- linux.orig/arch/i386/kernel/head.S
+++ linux/arch/i386/kernel/head.S
@@ -163,7 +163,7 @@ page_pde_offset = (__PAGE_OFFSET >> 20);
*/

#ifdef CONFIG_HOTPLUG_CPU
-.section .text,"ax",@progbits
+.section .text.init,"ax",@progbits
#else
.section .init.text,"ax",@progbits
#endif

2007-08-09 12:53:22

by Jan Beulich

[permalink] [raw]
Subject: Re: [patches] [PATCH] [4/12] x86_64: Disable CLFLUSH support again

The 64-bit conditional seems wrong to me, I'd assume you mean

if (1 || !cpu_has_clflush)

Jan

>>> Andi Kleen <[email protected]> 09.08.07 14:41 >>>

It turns out CLFLUSH support is still not complete; we
flush the wrong pages. Again disable it for the release.
Noticed by Jan Beulich.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/i386/mm/pageattr.c | 2 +-
arch/x86_64/mm/pageattr.c | 3 ++-
2 files changed, 3 insertions(+), 2 deletions(-)

Index: linux/arch/i386/mm/pageattr.c
===================================================================
--- linux.orig/arch/i386/mm/pageattr.c
+++ linux/arch/i386/mm/pageattr.c
@@ -82,7 +82,7 @@ static void flush_kernel_map(void *arg)
struct page *p;

/* High level code is not ready for clflush yet */
- if (cpu_has_clflush) {
+ if (0 && cpu_has_clflush) {
list_for_each_entry (p, lh, lru)
cache_flush_page(p);
} else if (boot_cpu_data.x86_model >= 4)
Index: linux/arch/x86_64/mm/pageattr.c
===================================================================
--- linux.orig/arch/x86_64/mm/pageattr.c
+++ linux/arch/x86_64/mm/pageattr.c
@@ -75,7 +75,8 @@ static void flush_kernel_map(void *arg)

/* When clflush is available always use it because it is
much cheaper than WBINVD. */
- if (!cpu_has_clflush)
+ /* clflush is still broken. Disable for now. */
+ if (0 && !cpu_has_clflush)
asm volatile("wbinvd" ::: "memory");
else list_for_each_entry(pg, l, lru) {
void *adr = page_address(pg);
_______________________________________________
patches mailing list
[email protected]
https://www.x86-64.org/mailman/listinfo/patches

2007-08-09 13:06:49

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: [PATCH] [0/12] x86 Late merge bug fixes for 2.6.23

On Thu, Aug 09, 2007 at 02:41:27PM +0200, Andi Kleen wrote:
>
> Various simple bug fixes, no new features. Please review. I plan to
> send them to Linus later.
>
> I don't have anything else critical pending. If you believe a x86
> patch is critical and really needs to make .23 please send email.

We also need this fix for the current sysdata regression:

Subject: finish i386 and x86-64 sysdata conversion

This patch finishes the i386 and x86-64 ->sysdata conversion and
hopefully also fixes Riku's and Andy's observed bugs. It is based on
Yinghai Lu's and Andy Whitcroft's patches (thanks!) with some changes:

- introduce pci_scan_bus_with_sysdata() and use it instead of
pci_scan_bus() where appropriate. pci_scan_bus_with_sysdata() will
allocate the sysdata structure and then call pci_scan_bus().
- always allocate pci_sysdata dynamically. The whole point of this
sysdata work is to make it easy to do root-bus specific things
(e.g., support PCI domains and IOMMU's). I dislike using a default
struct pci_sysdata in some places and a dynamically allocated
pci_sysdata elsewhere - the potential for someone indavertantly
changing the default structure is too high.
- this patch only makes the minimal changes necessary, i.e., the NUMA node is
always initialized to -1. Patches to do the right thing with regards
to the NUMA node can build on top of this (either add a 'node'
parameter to pci_scan_bus_with_sysdata() or just update the node
when it becomes known).

The patch was compile tested with various configurations (e.g., NUMAQ,
VISWS) and run-time tested on i386 and x86-64. Unfortunately none of
my machines exhibited the bugs so caveat emptor.

Andy, could you please see if this fixes the NUMA issues you've seen?
Riku, does this fix "pci=noacpi" on your laptop?

Comments appreciated. If this looks ok it should go into 2.6.23.

Signed-off-by: Muli Ben-Yehuda <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Chuck Ebbert <[email protected]>,
Cc: [email protected]
Cc: Andy Whitcroft <[email protected]>
Cc: Jeff Garzik <[email protected]>
---
arch/i386/pci/common.c | 23 +++++++++++++++++++++++
arch/i386/pci/fixup.c | 6 +++---
arch/i386/pci/irq.c | 5 +++--
arch/i386/pci/legacy.c | 2 +-
arch/i386/pci/numa.c | 15 +++++++++------
arch/i386/pci/visws.c | 4 ++--
include/asm-i386/pci.h | 3 +++
include/asm-x86_64/pci.h | 2 ++
8 files changed, 46 insertions(+), 14 deletions(-)

diff --git a/arch/i386/pci/common.c b/arch/i386/pci/common.c
index 85503de..ebc6f3c 100644
--- a/arch/i386/pci/common.c
+++ b/arch/i386/pci/common.c
@@ -455,3 +455,26 @@ void pcibios_disable_device (struct pci_dev *dev)
if (!dev->msi_enabled && pcibios_disable_irq)
pcibios_disable_irq(dev);
}
+
+struct pci_bus *pci_scan_bus_with_sysdata(int busno)
+{
+ struct pci_bus *bus = NULL;
+ struct pci_sysdata *sd;
+
+ /*
+ * Allocate per-root-bus (not per bus) arch-specific data.
+ * TODO: leak; this memory is never freed.
+ * It's arguable whether it's worth the trouble to care.
+ */
+ sd = kzalloc(sizeof(*sd), GFP_KERNEL);
+ if (!sd) {
+ printk(KERN_ERR "PCI: OOM, skipping PCI bus %02x\n", busno);
+ return NULL;
+ }
+ sd->node = -1;
+ bus = pci_scan_bus(busno, &pci_root_ops, sd);
+ if (!bus)
+ kfree(sd);
+
+ return bus;
+}
diff --git a/arch/i386/pci/fixup.c b/arch/i386/pci/fixup.c
index e7306db..c82cbf4 100644
--- a/arch/i386/pci/fixup.c
+++ b/arch/i386/pci/fixup.c
@@ -25,9 +25,9 @@ static void __devinit pci_fixup_i450nx(struct pci_dev *d)
pci_read_config_byte(d, reg++, &subb);
DBG("i450NX PXB %d: %02x/%02x/%02x\n", pxb, busno, suba, subb);
if (busno)
- pci_scan_bus(busno, &pci_root_ops, NULL); /* Bus A */
+ pci_scan_bus_with_sysdata(busno); /* Bus A */
if (suba < subb)
- pci_scan_bus(suba+1, &pci_root_ops, NULL); /* Bus B */
+ pci_scan_bus_with_sysdata(suba+1); /* Bus B */
}
pcibios_last_bus = -1;
}
@@ -42,7 +42,7 @@ static void __devinit pci_fixup_i450gx(struct pci_dev *d)
u8 busno;
pci_read_config_byte(d, 0x4a, &busno);
printk(KERN_INFO "PCI: i440KX/GX host bridge %s: secondary bus %02x\n", pci_name(d), busno);
- pci_scan_bus(busno, &pci_root_ops, NULL);
+ pci_scan_bus_with_sysdata(busno);
pcibios_last_bus = -1;
}
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82454GX, pci_fixup_i450gx);
diff --git a/arch/i386/pci/irq.c b/arch/i386/pci/irq.c
index f2cb942..665db06 100644
--- a/arch/i386/pci/irq.c
+++ b/arch/i386/pci/irq.c
@@ -138,8 +138,9 @@ static void __init pirq_peer_trick(void)
for(i = 1; i < 256; i++) {
if (!busmap[i] || pci_find_bus(0, i))
continue;
- if (pci_scan_bus(i, &pci_root_ops, NULL))
- printk(KERN_INFO "PCI: Discovered primary peer bus %02x [IRQ]\n", i);
+ if (pci_scan_bus_with_sysdata(i))
+ printk(KERN_INFO "PCI: Discovered primary peer "
+ "bus %02x [IRQ]\n", i);
}
pcibios_last_bus = -1;
}
diff --git a/arch/i386/pci/legacy.c b/arch/i386/pci/legacy.c
index 149a958..5565d70 100644
--- a/arch/i386/pci/legacy.c
+++ b/arch/i386/pci/legacy.c
@@ -26,7 +26,7 @@ static void __devinit pcibios_fixup_peer_bridges(void)
l != 0x0000 && l != 0xffff) {
DBG("Found device at %02x:%02x [%04x]\n", n, devfn, l);
printk(KERN_INFO "PCI: Discovered peer bus %02x\n", n);
- pci_scan_bus(n, &pci_root_ops, NULL);
+ pci_scan_bus_with_sysdata(n);
break;
}
}
diff --git a/arch/i386/pci/numa.c b/arch/i386/pci/numa.c
index adbe17a..f5f165f 100644
--- a/arch/i386/pci/numa.c
+++ b/arch/i386/pci/numa.c
@@ -96,10 +96,14 @@ static void __devinit pci_fixup_i450nx(struct pci_dev *d)
pci_read_config_byte(d, reg++, &suba);
pci_read_config_byte(d, reg++, &subb);
DBG("i450NX PXB %d: %02x/%02x/%02x\n", pxb, busno, suba, subb);
- if (busno)
- pci_scan_bus(QUADLOCAL2BUS(quad,busno), &pci_root_ops, NULL); /* Bus A */
- if (suba < subb)
- pci_scan_bus(QUADLOCAL2BUS(quad,suba+1), &pci_root_ops, NULL); /* Bus B */
+ if (busno) {
+ /* Bus A */
+ pci_scan_bus_with_sysdata(QUADLOCAL2BUS(quad, busno));
+ }
+ if (suba < subb) {
+ /* Bus B */
+ pci_scan_bus_with_sysdata(QUADLOCAL2BUS(quad, suba+1));
+ }
}
pcibios_last_bus = -1;
}
@@ -123,8 +127,7 @@ static int __init pci_numa_init(void)
continue;
printk("Scanning PCI bus %d for quad %d\n",
QUADLOCAL2BUS(quad,0), quad);
- pci_scan_bus(QUADLOCAL2BUS(quad,0),
- &pci_root_ops, NULL);
+ pci_scan_bus_with_sysdata(QUADLOCAL2BUS(quad, 0));
}
return 0;
}
diff --git a/arch/i386/pci/visws.c b/arch/i386/pci/visws.c
index f1b486d..8ecb1c7 100644
--- a/arch/i386/pci/visws.c
+++ b/arch/i386/pci/visws.c
@@ -101,8 +101,8 @@ static int __init pcibios_init(void)
"bridge B (PIIX4) bus: %u\n", pci_bus1, pci_bus0);

raw_pci_ops = &pci_direct_conf1;
- pci_scan_bus(pci_bus0, &pci_root_ops, NULL);
- pci_scan_bus(pci_bus1, &pci_root_ops, NULL);
+ pci_scan_bus_with_sysdata(pci_bus0);
+ pci_scan_bus_with_sysdata(pci_bus1);
pci_fixup_irqs(visws_swizzle, visws_map_irq);
pcibios_resource_survey();
return 0;
diff --git a/include/asm-i386/pci.h b/include/asm-i386/pci.h
index d790343..4fcacc7 100644
--- a/include/asm-i386/pci.h
+++ b/include/asm-i386/pci.h
@@ -8,6 +8,9 @@ struct pci_sysdata {
int node; /* NUMA node */
};

+/* scan a bus after allocating a pci_sysdata for it */
+extern struct pci_bus *pci_scan_bus_with_sysdata(int busno);
+
#include <linux/mm.h> /* for struct page */

/* Can be used to override the logic in pci_scan_bus for skipping
diff --git a/include/asm-x86_64/pci.h b/include/asm-x86_64/pci.h
index 88926eb..5da8cb0 100644
--- a/include/asm-x86_64/pci.h
+++ b/include/asm-x86_64/pci.h
@@ -10,6 +10,8 @@ struct pci_sysdata {
void* iommu; /* IOMMU private data */
};

+extern struct pci_bus *pci_scan_bus_with_sysdata(int busno);
+
#ifdef CONFIG_CALGARY_IOMMU
static inline void* pci_iommu(struct pci_bus *bus)
{
--
1.5.2





2007-08-09 13:11:38

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: [PATCH] [4/12] x86_64: Disable CLFLUSH support again

On Thu, Aug 09, 2007 at 02:41:31PM +0200, Andi Kleen wrote:
>
> It turns out CLFLUSH support is still not complete; we
> flush the wrong pages. Again disable it for the release.
> Noticed by Jan Beulich.
>
> Signed-off-by: Andi Kleen <[email protected]>

Aside from the bug Jan pointed out with the patch, we're also using
cflflush in other places (arch/i386/kernel/alternative.c and
arch/x86_64/kernel/tce.c). Is the issue with clflush support specific
to pageattr.c or should it be disabled every else too? what
specifically is the issue?

Cheers,
Muli

2007-08-09 13:13:27

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [0/12] x86 Late merge bug fixes for 2.6.23

On Thursday 09 August 2007 15:06:30 Muli Ben-Yehuda wrote:
> On Thu, Aug 09, 2007 at 02:41:27PM +0200, Andi Kleen wrote:
> >
> > Various simple bug fixes, no new features. Please review. I plan to
> > send them to Linus later.
> >
> > I don't have anything else critical pending. If you believe a x86
> > patch is critical and really needs to make .23 please send email.
>
> We also need this fix for the current sysdata regression:

I forgot to mention this -- plan to address the sysdata issues
was to ask Linus to revert the original patch.

-Andi

2007-08-09 13:13:44

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [4/12] x86_64: Disable CLFLUSH support again

On Thursday 09 August 2007 15:11:21 Muli Ben-Yehuda wrote:

> Aside from the bug Jan pointed out with the patch, we're also using
> cflflush in other places (arch/i386/kernel/alternative.c and
> arch/x86_64/kernel/tce.c). Is the issue with clflush support specific
> to pageattr.c

Specific to pageattr.c

-Andi

2007-08-09 13:28:50

by Jan Beulich

[permalink] [raw]
Subject: Re: [patches] [PATCH] [4/12] x86_64: Disable CLFLUSH support again

>>> Muli Ben-Yehuda <[email protected]> 09.08.07 15:11 >>>
>On Thu, Aug 09, 2007 at 02:41:31PM +0200, Andi Kleen wrote:
>>
>> It turns out CLFLUSH support is still not complete; we
>> flush the wrong pages. Again disable it for the release.
>> Noticed by Jan Beulich.
>>
>> Signed-off-by: Andi Kleen <[email protected]>
>
>Aside from the bug Jan pointed out with the patch, we're also using
>cflflush in other places (arch/i386/kernel/alternative.c and
>arch/x86_64/kernel/tce.c). Is the issue with clflush support specific
>to pageattr.c or should it be disabled every else too? what
>specifically is the issue?

The issue here is not with clflush by itself, but with what pages it is being
applied to. Here, it gets used on page table pages when flushing really is
needed on what one or more page table entries in that page table page
point(ed) to.

Jan

2007-08-09 16:24:31

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: [patches] [PATCH] [4/12] x86_64: Disable CLFLUSH support again

On Thu, Aug 09, 2007 at 02:29:58PM +0100, Jan Beulich wrote:

> The issue here is not with clflush by itself, but with what pages it
> is being applied to. Here, it gets used on page table pages when
> flushing really is needed on what one or more page table entries in
> that page table page point(ed) to.

I see, thanks.

Cheers,
Muli

2007-08-09 16:46:54

by Cal Peake

[permalink] [raw]
Subject: Re: [PATCH] [8/12] x86_64: Use global flag to disable broken local apic timer on AMD CPUs.

On Thu, 9 Aug 2007, Andi Kleen wrote:

> Index: linux/arch/i386/kernel/apic.c
> ===================================================================
> --- linux.orig/arch/i386/kernel/apic.c
> +++ linux/arch/i386/kernel/apic.c
> @@ -61,8 +61,9 @@ static int enable_local_apic __initdata
>
> /* Local APIC timer verification ok */
> static int local_apic_timer_verify_ok;
> -/* Disable local APIC timer from the kernel commandline or via dmi quirk */
> -static int local_apic_timer_disabled;
> +/* Disable local APIC timer from the kernel commandline or via dmi quirk
> + or using CPU MSR check */
> +int local_apic_timer_disabled;
> /* Local APIC timer works in C2 */
> int local_apic_timer_c2_ok;
> EXPORT_SYMBOL_GPL(local_apic_timer_c2_ok);
> @@ -370,9 +371,6 @@ void __init setup_boot_APIC_clock(void)
> long delta, deltapm;
> int pm_referenced = 0;
>
> - if (boot_cpu_has(X86_FEATURE_LAPIC_TIMER_BROKEN))
> - local_apic_timer_disabled = 1;
> -
> /*
> * The local apic timer can be disabled via the kernel
> * commandline or from the test above. Register the lapic

Andi, just noticed that the comment above needs to be updated now.

--
Cal Peake

2007-08-09 16:50:50

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [8/12] x86_64: Use global flag to disable broken local apic timer on AMD CPUs.


> > /*
> > * The local apic timer can be disabled via the kernel
> > * commandline or from the test above. Register the lapic
>
> Andi, just noticed that the comment above needs to be updated now.

Too late now. I can do that as a follow up patch. It'll be hardly
a showstopper.

-Andi

2007-08-10 10:20:33

by Joachim Deguara

[permalink] [raw]
Subject: Re: [patches] [PATCH] [1/12] x86: Work around mmio config space quirk on AMD Fam10h

I was just queuing up an identical patch ;) We didn't run into a problem yet
but we were going to fix this to fit our BKDG documentation. I didn't see
the original email, could you point me to it?

-Joachim

On Thursday 09 August 2007 14:41:28 Andi Kleen wrote:
> From: dean gaudet <[email protected]>
>
> Some broken devices have been discovered to require %al/%ax/%eax registers
> for MMIO config space accesses. Modify mmconfig.c to use these registers
> explicitly (rather than modify the global readb/writeb/etc inlines).
>
> AK: also changed i386 to always use eax
> AK: moved change to extended space probing to different patch
> AK: reworked with inlines according to Linus' requirements.
> AK: improve comments.
>
> Signed-off-by: dean gaudet <[email protected]>
> Signed-off-by: Andi Kleen <[email protected]>

ACKED-by: Joachim Deguara <[email protected]

>
>
> ---
> arch/i386/pci/mmconfig.c | 14 ++++++--------
> arch/i386/pci/pci.h | 43
> +++++++++++++++++++++++++++++++++++++++++++ arch/x86_64/pci/mmconfig.c |
> 12 ++++++------
> 3 files changed, 55 insertions(+), 14 deletions(-)
>
> Index: linux/arch/x86_64/pci/mmconfig.c
> ===================================================================
> --- linux.orig/arch/x86_64/pci/mmconfig.c
> +++ linux/arch/x86_64/pci/mmconfig.c
> @@ -66,13 +66,13 @@ static int pci_mmcfg_read(unsigned int s
>
> switch (len) {
> case 1:
> - *value = readb(addr + reg);
> + *value = mmio_config_readb(addr + reg);
> break;
> case 2:
> - *value = readw(addr + reg);
> + *value = mmio_config_readw(addr + reg);
> break;
> case 4:
> - *value = readl(addr + reg);
> + *value = mmio_config_readl(addr + reg);
> break;
> }
>
> @@ -94,13 +94,13 @@ static int pci_mmcfg_write(unsigned int
>
> switch (len) {
> case 1:
> - writeb(value, addr + reg);
> + mmio_config_writeb(addr + reg, value);
> break;
> case 2:
> - writew(value, addr + reg);
> + mmio_config_writew(addr + reg, value);
> break;
> case 4:
> - writel(value, addr + reg);
> + mmio_config_writel(addr + reg, value);
> break;
> }
>
> Index: linux/arch/i386/pci/mmconfig.c
> ===================================================================
> --- linux.orig/arch/i386/pci/mmconfig.c
> +++ linux/arch/i386/pci/mmconfig.c
> @@ -82,16 +82,15 @@ static int pci_mmcfg_read(unsigned int s
>
> switch (len) {
> case 1:
> - *value = readb(mmcfg_virt_addr + reg);
> + *value = mmio_config_readb(mmcfg_virt_addr + reg);
> break;
> case 2:
> - *value = readw(mmcfg_virt_addr + reg);
> + *value = mmio_config_readw(mmcfg_virt_addr + reg);
> break;
> case 4:
> - *value = readl(mmcfg_virt_addr + reg);
> + *value = mmio_config_readl(mmcfg_virt_addr + reg);
> break;
> }
> -
> spin_unlock_irqrestore(&pci_config_lock, flags);
>
> return 0;
> @@ -116,16 +115,15 @@ static int pci_mmcfg_write(unsigned int
>
> switch (len) {
> case 1:
> - writeb(value, mmcfg_virt_addr + reg);
> + mmio_config_writeb(mmcfg_virt_addr, value);
> break;
> case 2:
> - writew(value, mmcfg_virt_addr + reg);
> + mmio_config_writew(mmcfg_virt_addr, value);
> break;
> case 4:
> - writel(value, mmcfg_virt_addr + reg);
> + mmio_config_writel(mmcfg_virt_addr, value);
> break;
> }
> -
> spin_unlock_irqrestore(&pci_config_lock, flags);
>
> return 0;
> Index: linux/arch/i386/pci/pci.h
> ===================================================================
> --- linux.orig/arch/i386/pci/pci.h
> +++ linux/arch/i386/pci/pci.h
> @@ -104,3 +104,46 @@ extern DECLARE_BITMAP(pci_mmcfg_fallback
> extern int __init pci_mmcfg_arch_reachable(unsigned int seg, unsigned int
> bus, unsigned int devfn);
> extern int __init pci_mmcfg_arch_init(void);
> +
> +/*
> + * AMD Fam10h CPUs are buggy, and cannot access MMIO config space
> + * on their northbrige except through the * %eax register. As such, you
> MUST + * NOT use normal IOMEM accesses, you need to only use the magic
> mmio-config + * accessor functions.
> + * In fact just use pci_config_*, nothing else please.
> + */
> +static inline unsigned char mmio_config_readb(void __iomem *pos)
> +{
> + u8 val;
> + asm volatile("movb (%1),%%al" : "=a" (val) : "r" (pos));
> + return val;
> +}
> +
> +static inline unsigned short mmio_config_readw(void __iomem *pos)
> +{
> + u16 val;
> + asm volatile("movw (%1),%%ax" : "=a" (val) : "r" (pos));
> + return val;
> +}
> +
> +static inline unsigned int mmio_config_readl(void __iomem *pos)
> +{
> + u32 val;
> + asm volatile("movl (%1),%%eax" : "=a" (val) : "r" (pos));
> + return val;
> +}
> +
> +static inline void mmio_config_writeb(void __iomem *pos, u8 val)
> +{
> + asm volatile("movb %%al,(%1)" :: "a" (val), "r" (pos) : "memory");
> +}
> +
> +static inline void mmio_config_writew(void __iomem *pos, u16 val)
> +{
> + asm volatile("movw %%ax,(%1)" :: "a" (val), "r" (pos) : "memory");
> +}
> +
> +static inline void mmio_config_writel(void __iomem *pos, u32 val)
> +{
> + asm volatile("movl %%eax,(%1)" :: "a" (val), "r" (pos) : "memory");
> +}
> _______________________________________________
> patches mailing list
> [email protected]
> https://www.x86-64.org/mailman/listinfo/patches




2007-08-12 09:25:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] [1/12] x86: Work around mmio config space quirk on AMD Fam10h



On Thu, 9 Aug 2007, Andi Kleen wrote:
>
> Some broken devices have been discovered to require %al/%ax/%eax registers
> for MMIO config space accesses. Modify mmconfig.c to use these registers
> explicitly (rather than modify the global readb/writeb/etc inlines).
>
> AK: also changed i386 to always use eax
> AK: moved change to extended space probing to different patch
> AK: reworked with inlines according to Linus' requirements.
> AK: improve comments.
>
> Signed-off-by: dean gaudet <[email protected]>
> Signed-off-by: Andi Kleen <[email protected]>

Damn you, Andi.

You had obviously never actually tested this on x86.

> --- linux.orig/arch/i386/pci/mmconfig.c
> +++ linux/arch/i386/pci/mmconfig.c
> - writeb(value, mmcfg_virt_addr + reg);
> + mmio_config_writeb(mmcfg_virt_addr, value);

Notice something missing here? The new code cannot work on any machine
that actually uses mmio cfg.

Linus

2007-08-12 10:11:47

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [1/12] x86: Work around mmio config space quirk on AMD Fam10h

On Sunday 12 August 2007 11:25, Linus Torvalds wrote:
> On Thu, 9 Aug 2007, Andi Kleen wrote:
> > Some broken devices have been discovered to require %al/%ax/%eax
> > registers for MMIO config space accesses. Modify mmconfig.c to use these
> > registers explicitly (rather than modify the global readb/writeb/etc
> > inlines).
> >
> > AK: also changed i386 to always use eax
> > AK: moved change to extended space probing to different patch
> > AK: reworked with inlines according to Linus' requirements.
> > AK: improve comments.
> >
> > Signed-off-by: dean gaudet <[email protected]>
> > Signed-off-by: Andi Kleen <[email protected]>
>
> Damn you, Andi.

Thanks for the kind words.

> You had obviously never actually tested this on x86.

Hmm, it booted on my i386 test box but it's possible it didn't use MCFG.

>
> > --- linux.orig/arch/i386/pci/mmconfig.c
> > +++ linux/arch/i386/pci/mmconfig.c
> > - writeb(value, mmcfg_virt_addr + reg);
> > + mmio_config_writeb(mmcfg_virt_addr, value);
>
> Notice something missing here? The new code cannot work on any machine
> that actually uses mmio cfg.

Sigh. Ok I assume you already reverted and I'll send a new one.

-Andi

2007-08-12 19:16:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] [1/12] x86: Work around mmio config space quirk on AMD Fam10h



On Sun, 12 Aug 2007, Andi Kleen wrote:
> >
> > Damn you, Andi.
>
> Thanks for the kind words.

Should I be impressed by the fact that 16% of your patches caused bootup
problems, when we're really close to a -rc3, and *long* past the point
where we want to address regressions rather than cause them?

I simply want you to be more careful. *Much* more careful. As it is, I
end up always being afraid of merging your patch-series (and yes, I should
have found this even earlier when reading through the diff rather than
after I had already applied it, but I didn't, and we had a broken kernel
for a few hours as a result. I didn't see any bug-reports about this one,
so hopefully nobody noticed, but it still grates me).

Bugs happen, but (a) they should happen during the merge window, not when
we're in stabilization phase and (b) the percentages here were just not
very good.

Linus

2007-08-13 03:06:50

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH] [1/12] x86: Work around mmio config space quirk on AMD Fam10h

On Sun, Aug 12, 2007 at 12:15:26PM -0700, Linus Torvalds wrote:

> Bugs happen, but (a) they should happen during the merge window, not when
> we're in stabilization phase and (b) the percentages here were just not
> very good.

This does make me wonder, why these weren't caught in -mm ?
Andrew lines up Andi's quilt series in each run afaik, or was
the last patchbombing all stuff that bypassed -mm for some reason?

Dave

--
http://www.codemonkey.org.uk

2007-08-13 03:27:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] [1/12] x86: Work around mmio config space quirk on AMD Fam10h



On Sun, 12 Aug 2007, Dave Jones wrote:
>
> This does make me wonder, why these weren't caught in -mm ?

I'm worried that -mm isn't getting a lot of exposure these days. People do
run it, but I wonder how many..

That said, a lot of machines won't ever use MMCFG (especially the old ones
- and most of the new ones would run x86-64), so that probably explains at
least that one.

But the x86-64 alternates code would hit anybody who had the "use rep movs
for best performance" code, and I'm surprised that one wasn't caught
earlier. X86_FEATURE_REP_GOOD is not uncommon (Core 2 has it). So I assume
that one wasn't really in -mm at all.

Linus

2007-08-13 03:38:31

by dean gaudet

[permalink] [raw]
Subject: Re: [PATCH] [1/12] x86: Work around mmio config space quirk on AMD Fam10h

On Sun, 12 Aug 2007, Linus Torvalds wrote:

> On Sun, 12 Aug 2007, Dave Jones wrote:
> >
> > This does make me wonder, why these weren't caught in -mm ?
>
> I'm worried that -mm isn't getting a lot of exposure these days. People do
> run it, but I wonder how many..

andrew caught it in -mm and reverted it. it crashed his vaio.

-dean

2007-08-18 00:05:31

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] [5/12] x86_64: Make patching more robust, fix paravirt issue

Andi Kleen wrote:
> Commit 19d36ccdc34f5ed444f8a6af0cbfdb6790eb1177 "x86: Fix alternatives
> and kprobes to remap write-protected kernel text" uses code which is
> being patched for patching.
>
> In particular, paravirt_ops does patching in two stages: first it
> calls paravirt_ops.patch, then it fills any remaining instructions
> with nop_out(). nop_out calls text_poke() which calls
> lookup_address() which calls pgd_val() (aka paravirt_ops.pgd_val):
> that call site is one of the places we patch.
>
> If we always do patching as one single call to text_poke(), we only
> need make sure we're not patching the memcpy in text_poke itself.
> This means the prototype to paravirt_ops.patch needs to change, to
> marshal the new code into a buffer rather than patching in place as it
> does now. It also means all patching goes through text_poke(), which
> is known to be safe (apply_alternatives is also changed to make a
> single patch).
>

Hi Andi,

This patch breaks Xen booting. I get infinite recursive faults during
patching when this patch is present. If I boot with
"noreplace-paravirt" it works OK, and it works as expected if I back
this patch out. I haven't tracked down the exact failure mode; its a
little hard to debug because it overwrites all kernel memory with
recursive fault stackframes and then finally traps out to Xen when it
hits the bottom of memory.

I think we should back this one out before .23.

J

2007-08-18 00:12:18

by Chris Wright

[permalink] [raw]
Subject: Re: [PATCH] [5/12] x86_64: Make patching more robust, fix paravirt issue

* Jeremy Fitzhardinge ([email protected]) wrote:
> This patch breaks Xen booting. I get infinite recursive faults during
> patching when this patch is present. If I boot with
> "noreplace-paravirt" it works OK, and it works as expected if I back
> this patch out. I haven't tracked down the exact failure mode; its a
> little hard to debug because it overwrites all kernel memory with
> recursive fault stackframes and then finally traps out to Xen when it
> hits the bottom of memory.
>
> I think we should back this one out before .23.

I agree (second time this has broken during .23 devel).

thanks,
-chris

2007-08-18 09:50:41

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [5/12] x86_64: Make patching more robust, fix paravirt issue


> This patch breaks Xen booting.

Check the latest git head. Does it still break?

-Andi

2007-08-18 20:33:01

by Chris Wright

[permalink] [raw]
Subject: Re: [PATCH] [5/12] x86_64: Make patching more robust, fix paravirt issue

* Andi Kleen ([email protected]) wrote:
> > This patch breaks Xen booting.
>
> Check the latest git head. Does it still break?

Yeah, this is the latest git. The broken commit is Rusty's patch which,
after Linus reverted the write-protected remap changes, is no longer
necessary. AFAICT patching is writing garbage into the insn stream.
I suspect it's copying an uninitialized temp buffer.

thanks,
-chris

2007-08-18 20:56:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] [5/12] x86_64: Make patching more robust, fix paravirt issue



On Sat, 18 Aug 2007, Chris Wright wrote:
> >
> > Check the latest git head. Does it still break?
>
> Yeah, this is the latest git. The broken commit is Rusty's patch which,
> after Linus reverted the write-protected remap changes, is no longer
> necessary. AFAICT patching is writing garbage into the insn stream.
> I suspect it's copying an uninitialized temp buffer.

Can you send me the revert patch that is verified to work?

Linus

2007-08-18 21:32:37

by Chris Wright

[permalink] [raw]
Subject: Re: [PATCH] [5/12] x86_64: Make patching more robust, fix paravirt issue

* Linus Torvalds ([email protected]) wrote:
> On Sat, 18 Aug 2007, Chris Wright wrote:
> > >
> > > Check the latest git head. Does it still break?
> >
> > Yeah, this is the latest git. The broken commit is Rusty's patch which,
> > after Linus reverted the write-protected remap changes, is no longer
> > necessary. AFAICT patching is writing garbage into the insn stream.
> > I suspect it's copying an uninitialized temp buffer.
>
> Can you send me the revert patch that is verified to work?

Now that I understand the problem, I do have a very simple (slightly
overkill) fix for paravirt patching. This can be cleaned up to avoid
the copies when they aren't needed, but that will take a little more
auditing of the various patchers. If you still prefer a revert I've
got one handy, and we can re-visit this all post .23.

thanks,
-chris
--

Subject: [PATCH] x86: properly initialize temp insn buffer for paravirt patching
From: Chris Wright <[email protected]>

With commit ab144f5ec64c42218a555ec1dbde6b60cf2982d6 the patching code
now collects the complete new instruction stream into a temp buffer
before finally patching in the new insns. In some cases the paravirt
patchers will choose to leave the patch site unpatched (length mismatch,
clobbers mismatch, etc). This causes the new patching code to copy an
uninitialized temp buffer, i.e. garbage, to the callsite. Simply make
sure to always initialize the buffer with the original instruction stream.
A better fix is to audit all the patchers and return proper length so that
apply_paravirt() can skip copies when we leave the patch site untouched.

Signed-off-by: Chris Wright <[email protected]>
---
arch/i386/kernel/alternative.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/i386/kernel/alternative.c b/arch/i386/kernel/alternative.c
index 1b66d5c..9f4ac8b 100644
--- a/arch/i386/kernel/alternative.c
+++ b/arch/i386/kernel/alternative.c
@@ -366,6 +366,8 @@ void apply_paravirt(struct paravirt_patch_site *start,
unsigned int used;

BUG_ON(p->len > MAX_PATCH_LEN);
+ /* prep the buffer with the original instructions */
+ memcpy(insnbuf, p->instr, p->len);
used = paravirt_ops.patch(p->instrtype, p->clobbers, insnbuf,
(unsigned long)p->instr, p->len);

2007-08-18 21:58:24

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] [5/12] x86_64: Make patching more robust, fix paravirt issue

Andi Kleen wrote:
>> This patch breaks Xen booting.
>>
>
> Check the latest git head. Does it still break?

Yes, that's with latest git head.

J

2007-08-18 22:28:52

by Chris Wright

[permalink] [raw]
Subject: [PATCH] x86: skip paravirt patching when appropriate

* Chris Wright ([email protected]) wrote:
> Now that I understand the problem, I do have a very simple (slightly
> overkill) fix for paravirt patching. This can be cleaned up to avoid
> the copies when they aren't needed, but that will take a little more
> auditing of the various patchers. If you still prefer a revert I've
> got one handy, and we can re-visit this all post .23.

This one avoids the patching when necessary, but needs some validation on
VMI (and Zach's not avail today). I'll resend when I know it's working
for all paravirt patchers.

thanks,
-chris
--

Subject: [PATCH] x86: skip paravirt patching when appropriate
From: Chris Wright <[email protected]>

commit d34fda4a84c18402640a1a2342d6e6d9829e6db7 was a little overkill
in the case where a paravirt patcher chooses to leave patch site
unpatched. Instead of copying original instructions to temp buffer
then back to patch site, simply skip patching those sites altogether.

Cc: Jeremy Fitzhardinge <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Zach Amsden <[email protected]>
Signed-off-by: Chris Wright <[email protected]>
---
arch/i386/kernel/alternative.c | 4 ++--
arch/i386/kernel/paravirt.c | 10 +++++-----
arch/i386/kernel/vmi.c | 4 ++--
include/asm-i386/paravirt.h | 3 +++
4 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/i386/kernel/alternative.c b/arch/i386/kernel/alternative.c
index 9f4ac8b..b81d87e 100644
--- a/arch/i386/kernel/alternative.c
+++ b/arch/i386/kernel/alternative.c
@@ -366,10 +366,10 @@ void apply_paravirt(struct paravirt_patch_site *start,
unsigned int used;

BUG_ON(p->len > MAX_PATCH_LEN);
- /* prep the buffer with the original instructions */
- memcpy(insnbuf, p->instr, p->len);
used = paravirt_ops.patch(p->instrtype, p->clobbers, insnbuf,
(unsigned long)p->instr, p->len);
+ if (used == PV_NO_PATCH)
+ continue;

BUG_ON(used > p->len);

diff --git a/arch/i386/kernel/paravirt.c b/arch/i386/kernel/paravirt.c
index 739cfb2..a36ce34 100644
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -122,7 +122,7 @@ unsigned paravirt_patch_nop(void)

unsigned paravirt_patch_ignore(unsigned len)
{
- return len;
+ return PV_NO_PATCH;
}

struct branch {
@@ -139,9 +139,9 @@ unsigned paravirt_patch_call(void *insnbuf,
unsigned long delta = (unsigned long)target - (addr+5);

if (tgt_clobbers & ~site_clobbers)
- return len; /* target would clobber too much for this site */
+ return PV_NO_PATCH; /* target would clobber too much for this site */
if (len < 5)
- return len; /* call too long for patch site */
+ return PV_NO_PATCH; /* call too long for patch site */

b->opcode = 0xe8; /* call */
b->delta = delta;
@@ -157,7 +157,7 @@ unsigned paravirt_patch_jmp(const void *target, void *insnbuf,
unsigned long delta = (unsigned long)target - (addr+5);

if (len < 5)
- return len; /* call too long for patch site */
+ return PV_NO_PATCH; /* call too long for patch site */

b->opcode = 0xe9; /* jmp */
b->delta = delta;
@@ -196,7 +196,7 @@ unsigned paravirt_patch_insns(void *insnbuf, unsigned len,
unsigned insn_len = end - start;

if (insn_len > len || start == NULL)
- insn_len = len;
+ insn_len = PV_NO_PATCH;
else
memcpy(insnbuf, start, insn_len);

diff --git a/arch/i386/kernel/vmi.c b/arch/i386/kernel/vmi.c
index 18673e0..27ae004 100644
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -118,7 +118,7 @@ static unsigned patch_internal(int call, unsigned len, void *insnbuf,

case VMI_RELOCATION_NONE:
/* leave native code in place */
- break;
+ return PV_NO_PATCH;

default:
BUG();
@@ -153,7 +153,7 @@ static unsigned vmi_patch(u8 type, u16 clobbers, void *insns,
default:
break;
}
- return len;
+ return PV_NO_PATCH;
}

/* CPUID has non-C semantics, and paravirt-ops API doesn't match hardware ISA */
diff --git a/include/asm-i386/paravirt.h b/include/asm-i386/paravirt.h
index 9fa3fa9..b26794f 100644
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -252,6 +252,9 @@ extern struct paravirt_ops paravirt_ops;
#define paravirt_alt(insn_string) \
_paravirt_alt(insn_string, "%c[paravirt_typenum]", "%c[paravirt_clobber]")

+enum {
+ PV_NO_PATCH = -1
+};
unsigned paravirt_patch_nop(void);
unsigned paravirt_patch_ignore(unsigned len);
unsigned paravirt_patch_call(void *insnbuf,

2007-08-19 03:13:42

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH] x86: skip paravirt patching when appropriate

Chris Wright wrote:
> * Chris Wright ([email protected]) wrote:
>
>> Now that I understand the problem, I do have a very simple (slightly
>> overkill) fix for paravirt patching. This can be cleaned up to avoid
>> the copies when they aren't needed, but that will take a little more
>> auditing of the various patchers. If you still prefer a revert I've
>> got one handy, and we can re-visit this all post .23.
>>
>
> This one avoids the patching when necessary, but needs some validation on
> VMI (and Zach's not avail today). I'll resend when I know it's working
> for all paravirt patchers.
>

I'm back. I'll give this one a spin after dinner tonight.

Zach

2007-08-19 06:00:17

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH] x86: skip paravirt patching when appropriate

Chris Wright wrote:
> * Chris Wright ([email protected]) wrote:
>
>> Now that I understand the problem, I do have a very simple (slightly
>> overkill) fix for paravirt patching. This can be cleaned up to avoid
>> the copies when they aren't needed, but that will take a little more
>> auditing of the various patchers. If you still prefer a revert I've
>> got one handy, and we can re-visit this all post .23.
>>
>
> This one avoids the patching when necessary, but needs some validation on
> VMI (and Zach's not avail today). I'll resend when I know it's working
> for all paravirt patchers.
>

Looks like it's already in -git. I tested with your fix and hit a
different bug. Bisecting, will handle.

Zach

2007-08-20 01:50:53

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] [5/12] x86_64: Make patching more robust, fix paravirt issue

On Fri, 2007-08-17 at 17:05 -0700, Jeremy Fitzhardinge wrote:
> Andi Kleen wrote:
> > Commit 19d36ccdc34f5ed444f8a6af0cbfdb6790eb1177 "x86: Fix alternatives
> > and kprobes to remap write-protected kernel text" uses code which is
> > being patched for patching.
> >
> > In particular, paravirt_ops does patching in two stages: first it
> > calls paravirt_ops.patch, then it fills any remaining instructions
> > with nop_out(). nop_out calls text_poke() which calls
> > lookup_address() which calls pgd_val() (aka paravirt_ops.pgd_val):
> > that call site is one of the places we patch.
> >
> > If we always do patching as one single call to text_poke(), we only
> > need make sure we're not patching the memcpy in text_poke itself.
> > This means the prototype to paravirt_ops.patch needs to change, to
> > marshal the new code into a buffer rather than patching in place as it
> > does now. It also means all patching goes through text_poke(), which
> > is known to be safe (apply_alternatives is also changed to make a
> > single patch).
> >
>
> Hi Andi,
>
> This patch breaks Xen booting. I get infinite recursive faults during
> patching when this patch is present. If I boot with
> "noreplace-paravirt" it works OK, and it works as expected if I back
> this patch out. I haven't tracked down the exact failure mode; its a
> little hard to debug because it overwrites all kernel memory with
> recursive fault stackframes and then finally traps out to Xen when it
> hits the bottom of memory.
>
> I think we should back this one out before .23.

Then back out 19d36ccdc34f5ed444f8a6af0cbfdb6790eb1177 too, which broke
lguest booting, and this tried to fix.

Rusty.

2007-08-20 02:08:10

by Chris Wright

[permalink] [raw]
Subject: Re: [PATCH] [5/12] x86_64: Make patching more robust, fix paravirt issue

* Rusty Russell ([email protected]) wrote:
> Then back out 19d36ccdc34f5ed444f8a6af0cbfdb6790eb1177 too, which broke
> lguest booting, and this tried to fix.

That did get backed out (at least the part that broke paravirt patching)
in 602033ed5907a59ce86f709082a35be047743a86. Linus' tree should be
working fine right now with d34fda4a84c18402640a1a2342d6e6d9829e6db7
committed, and can be further refined with the patch below that's just
waiting on some further testing.

thanks,
-chris
--

Subject: [PATCH] x86: skip paravirt patching when appropriate
From: Chris Wright <[email protected]>

commit d34fda4a84c18402640a1a2342d6e6d9829e6db7 was a little overkill
in the case where a paravirt patcher chooses to leave patch site
unpatched. Instead of copying original instructions to temp buffer
then back to patch site, simply skip patching those sites altogether.

Cc: Jeremy Fitzhardinge <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Zach Amsden <[email protected]>
Signed-off-by: Chris Wright <[email protected]>
---
arch/i386/kernel/alternative.c | 4 ++--
arch/i386/kernel/paravirt.c | 10 +++++-----
arch/i386/kernel/vmi.c | 4 ++--
include/asm-i386/paravirt.h | 3 +++
4 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/i386/kernel/alternative.c b/arch/i386/kernel/alternative.c
index 9f4ac8b..b81d87e 100644
--- a/arch/i386/kernel/alternative.c
+++ b/arch/i386/kernel/alternative.c
@@ -366,10 +366,10 @@ void apply_paravirt(struct paravirt_patch_site *start,
unsigned int used;

BUG_ON(p->len > MAX_PATCH_LEN);
- /* prep the buffer with the original instructions */
- memcpy(insnbuf, p->instr, p->len);
used = paravirt_ops.patch(p->instrtype, p->clobbers, insnbuf,
(unsigned long)p->instr, p->len);
+ if (used == PV_NO_PATCH)
+ continue;

BUG_ON(used > p->len);

diff --git a/arch/i386/kernel/paravirt.c b/arch/i386/kernel/paravirt.c
index 739cfb2..a36ce34 100644
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -122,7 +122,7 @@ unsigned paravirt_patch_nop(void)

unsigned paravirt_patch_ignore(unsigned len)
{
- return len;
+ return PV_NO_PATCH;
}

struct branch {
@@ -139,9 +139,9 @@ unsigned paravirt_patch_call(void *insnbuf,
unsigned long delta = (unsigned long)target - (addr+5);

if (tgt_clobbers & ~site_clobbers)
- return len; /* target would clobber too much for this site */
+ return PV_NO_PATCH; /* target would clobber too much for this site */
if (len < 5)
- return len; /* call too long for patch site */
+ return PV_NO_PATCH; /* call too long for patch site */

b->opcode = 0xe8; /* call */
b->delta = delta;
@@ -157,7 +157,7 @@ unsigned paravirt_patch_jmp(const void *target, void *insnbuf,
unsigned long delta = (unsigned long)target - (addr+5);

if (len < 5)
- return len; /* call too long for patch site */
+ return PV_NO_PATCH; /* call too long for patch site */

b->opcode = 0xe9; /* jmp */
b->delta = delta;
@@ -196,7 +196,7 @@ unsigned paravirt_patch_insns(void *insnbuf, unsigned len,
unsigned insn_len = end - start;

if (insn_len > len || start == NULL)
- insn_len = len;
+ insn_len = PV_NO_PATCH;
else
memcpy(insnbuf, start, insn_len);

diff --git a/arch/i386/kernel/vmi.c b/arch/i386/kernel/vmi.c
index 18673e0..27ae004 100644
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -118,7 +118,7 @@ static unsigned patch_internal(int call, unsigned len, void *insnbuf,

case VMI_RELOCATION_NONE:
/* leave native code in place */
- break;
+ return PV_NO_PATCH;

default:
BUG();
@@ -153,7 +153,7 @@ static unsigned vmi_patch(u8 type, u16 clobbers, void *insns,
default:
break;
}
- return len;
+ return PV_NO_PATCH;
}

/* CPUID has non-C semantics, and paravirt-ops API doesn't match hardware ISA */
diff --git a/include/asm-i386/paravirt.h b/include/asm-i386/paravirt.h
index 9fa3fa9..b26794f 100644
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -252,6 +252,9 @@ extern struct paravirt_ops paravirt_ops;
#define paravirt_alt(insn_string) \
_paravirt_alt(insn_string, "%c[paravirt_typenum]", "%c[paravirt_clobber]")

+enum {
+ PV_NO_PATCH = -1
+};
unsigned paravirt_patch_nop(void);
unsigned paravirt_patch_ignore(unsigned len);
unsigned paravirt_patch_call(void *insnbuf,

2007-08-20 06:15:31

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] [5/12] x86_64: Make patching more robust, fix paravirt issue

Chris Wright wrote:
> That did get backed out (at least the part that broke paravirt patching)
> in 602033ed5907a59ce86f709082a35be047743a86. Linus' tree should be
> working fine right now with d34fda4a84c18402640a1a2342d6e6d9829e6db7
> committed, and can be further refined with the patch below that's just
> waiting on some further testing.
>

I don't think this is necessary. It isn't worth complicating the
interface to avoid the memcpy.

J

2007-08-21 07:30:27

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH] [5/12] x86_64: Make patching more robust, fix paravirt issue

On 8/20/07, Jeremy Fitzhardinge <[email protected]> wrote:
> Chris Wright wrote:
> > That did get backed out (at least the part that broke paravirt patching)
> > in 602033ed5907a59ce86f709082a35be047743a86. Linus' tree should be
> > working fine right now with d34fda4a84c18402640a1a2342d6e6d9829e6db7
> > committed, and can be further refined with the patch below that's just
> > waiting on some further testing.
> >
>
> I don't think this is necessary. It isn't worth complicating the
> interface to avoid the memcpy.
>
> J
Damn,

I can't believe I've just lost a night tracking the issue, without
seeing the discussion here ;-)
I came out to this very same conclusion, and was about to send a patch
that fixes it, by doing a memcpy before starting the instruction
replacement.

(I wouldn't say anything, as this is solved, but my night have to get
some value, after all! ;-)

So I'm with Jeremy. We don't lose too much by putting a memcpy there,
this code is not exactly critical. It also seems cleaner, and less
error prone. I have a patch ready here, but I think by this time, you
guys have too ;-)

--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

2007-08-21 10:29:49

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [5/12] x86_64: Make patching more robust, fix paravirt issue

On Tue, Aug 21, 2007 at 04:30:10AM -0300, Glauber de Oliveira Costa wrote:
> On 8/20/07, Jeremy Fitzhardinge <[email protected]> wrote:
> > Chris Wright wrote:
> > > That did get backed out (at least the part that broke paravirt patching)
> > > in 602033ed5907a59ce86f709082a35be047743a86. Linus' tree should be
> > > working fine right now with d34fda4a84c18402640a1a2342d6e6d9829e6db7
> > > committed, and can be further refined with the patch below that's just
> > > waiting on some further testing.
> > >
> >
> > I don't think this is necessary. It isn't worth complicating the
> > interface to avoid the memcpy.
> >
> > J
> Damn,
>
> I can't believe I've just lost a night tracking the issue, without
> seeing the discussion here ;-)
> I came out to this very same conclusion, and was about to send a patch
> that fixes it, by doing a memcpy before starting the instruction
> replacement.
>
> (I wouldn't say anything, as this is solved, but my night have to get
> some value, after all! ;-)
>
> So I'm with Jeremy. We don't lose too much by putting a memcpy there,
> this code is not exactly critical. It also seems cleaner, and less
> error prone. I have a patch ready here, but I think by this time, you
> guys have too ;-)

x86-64 also has a __inline_memcpy that is guaranteed inlined. It was
originally for such cases when memcpy didn't work. Could be added to i386
too if there is need

-Andi

2007-08-21 12:26:53

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH] [5/12] x86_64: Make patching more robust, fix paravirt issue

On 8/21/07, Andi Kleen <[email protected]> wrote:
> On Tue, Aug 21, 2007 at 04:30:10AM -0300, Glauber de Oliveira Costa wrote:
> > On 8/20/07, Jeremy Fitzhardinge <[email protected]> wrote:
> > > Chris Wright wrote:
> > > > That did get backed out (at least the part that broke paravirt patching)
> > > > in 602033ed5907a59ce86f709082a35be047743a86. Linus' tree should be
> > > > working fine right now with d34fda4a84c18402640a1a2342d6e6d9829e6db7
> > > > committed, and can be further refined with the patch below that's just
> > > > waiting on some further testing.
> > > >
> > >
> > > I don't think this is necessary. It isn't worth complicating the
> > > interface to avoid the memcpy.
> > >
> > > J
> > Damn,
> >
> > I can't believe I've just lost a night tracking the issue, without
> > seeing the discussion here ;-)
> > I came out to this very same conclusion, and was about to send a patch
> > that fixes it, by doing a memcpy before starting the instruction
> > replacement.
> >
> > (I wouldn't say anything, as this is solved, but my night have to get
> > some value, after all! ;-)
> >
> > So I'm with Jeremy. We don't lose too much by putting a memcpy there,
> > this code is not exactly critical. It also seems cleaner, and less
> > error prone. I have a patch ready here, but I think by this time, you
> > guys have too ;-)
>
> x86-64 also has a __inline_memcpy that is guaranteed inlined. It was
> originally for such cases when memcpy didn't work. Could be added to i386
> too if there is need
>
Andi,

this is not the case that memcpy did not work. (I got the same issue
for paravirt_ops x86_64, and btw, that's why I have not yet sent a new
patch).

The thing is that if that the paravirt replacement function returns,
it will return with the temporary buffer empty, will write the buffer
with text_poke(), and thus, put crap into the code.

So adding a memcpy to the buffer guarantees that in the worst case, it
will contain the original instruction. The normal memcpy works neatly.

--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

2007-10-11 21:30:45

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] [6/12] x86_64: Early segment setup for VT

Andi Kleen wrote:
> From: Zachary Amsden <[email protected]>
>
> VT is very picky about when it can enter execution.
> Get all segments setup and get LDT and TR into valid state to allow
> VT execution under VMware and KVM (untested).
>
> This makes the boot decompression run under VT, which makes it several
> orders of magnitude faster on 64-bit Intel hardware.
>
> Before, I was seeing times up to a minute or more to decompress a 1.3MB kernel
> on a very fast box.
>
> Signed-off-by: Zachary Amsden <[email protected]>
> Signed-off-by: Andi Kleen <[email protected]>
>

Makes a lot of sense to me.

-hpa