2005-12-02 08:10:36

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: [patch 1/3] x86_64: Node local PDA -- early cpu_to_node

The following patchset is to allocate node local memory for the x86_64
processor PDA.

Andrew, can you pls include this in -mm for testing?

Thanks,
Kiran

---

Patch enables early initialization of cpu_to_node. apicid_to_node is built by
reading the SRAT table, from acpi_numa_init, and x86_cpu_to_apicid is built
by parsing the ACPI MADT table, from acpi_boot_init. Combine these two tables
and setup cpu_to_node. Thanks to Andi for suggesting this.

Early initialization helps the static per-cpu areas in getting pages from
correct node, and sets up cpu_to_node early for node local PDA allocations.

Signed-off-by: Alok N Kataria <[email protected]>
Signed-off-by: Ravikiran Thirumalai <[email protected]>

Index: linux-2.6.15-rc3/arch/x86_64/kernel/setup.c
===================================================================
--- linux-2.6.15-rc3.orig/arch/x86_64/kernel/setup.c 2005-11-30 11:32:53.000000000 -0800
+++ linux-2.6.15-rc3/arch/x86_64/kernel/setup.c 2005-11-30 16:49:19.000000000 -0800
@@ -528,6 +528,7 @@
void __init setup_arch(char **cmdline_p)
{
unsigned long kernel_end;
+ unsigned i;

ROOT_DEV = old_decode_dev(ORIG_ROOT_DEV);
drive_info = DRIVE_INFO;
@@ -669,6 +670,15 @@
acpi_boot_init();
#endif

+#ifdef CONFIG_ACPI_NUMA
+ /*
+ * Setup cpu_to_node using the SRAT lapcis & ACPI MADT table
+ * info.
+ */
+ for (i = 0; i < NR_CPUS; i++)
+ cpu_to_node[i] = apicid_to_node[x86_cpu_to_apicid[i]];
+#endif
+
#ifdef CONFIG_X86_LOCAL_APIC
/*
* get boot-time SMP configuration:
@@ -687,12 +697,9 @@

request_resource(&iomem_resource, &video_ram_resource);

- {
- unsigned i;
/* request I/O space for devices used on all i[345]86 PCs */
for (i = 0; i < STANDARD_IO_RESOURCES; i++)
request_resource(&ioport_resource, &standard_io_resources[i]);
- }

e820_setup_gap();


2005-12-02 08:17:04

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: [patch 2/3] x86_64: Node local PDA -- Use macros to access cpu_pda

Helper patch to change cpu_pda users to use macros to access cpu_pda
instead of the cpu_pda[] array.

Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>

Index: linux-2.6.15-rc1git/arch/x86_64/kernel/irq.c
===================================================================
--- linux-2.6.15-rc1git.orig/arch/x86_64/kernel/irq.c 2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.15-rc1git/arch/x86_64/kernel/irq.c 2005-11-16 14:08:14.000000000 -0800
@@ -69,13 +69,13 @@
seq_printf(p, "NMI: ");
for (j = 0; j < NR_CPUS; j++)
if (cpu_online(j))
- seq_printf(p, "%10u ", cpu_pda[j].__nmi_count);
+ seq_printf(p, "%10u ", cpu_pda(j)->__nmi_count);
seq_putc(p, '\n');
#ifdef CONFIG_X86_LOCAL_APIC
seq_printf(p, "LOC: ");
for (j = 0; j < NR_CPUS; j++)
if (cpu_online(j))
- seq_printf(p, "%10u ", cpu_pda[j].apic_timer_irqs);
+ seq_printf(p, "%10u ", cpu_pda(j)->apic_timer_irqs);
seq_putc(p, '\n');
#endif
seq_printf(p, "ERR: %10u\n", atomic_read(&irq_err_count));
Index: linux-2.6.15-rc1git/arch/x86_64/kernel/nmi.c
===================================================================
--- linux-2.6.15-rc1git.orig/arch/x86_64/kernel/nmi.c 2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.15-rc1git/arch/x86_64/kernel/nmi.c 2005-11-16 14:08:14.000000000 -0800
@@ -155,19 +155,19 @@
smp_call_function(nmi_cpu_busy, (void *)&endflag, 0, 0);

for (cpu = 0; cpu < NR_CPUS; cpu++)
- counts[cpu] = cpu_pda[cpu].__nmi_count;
+ counts[cpu] = cpu_pda(cpu)->__nmi_count;
local_irq_enable();
mdelay((10*1000)/nmi_hz); // wait 10 ticks

for (cpu = 0; cpu < NR_CPUS; cpu++) {
if (!cpu_online(cpu))
continue;
- if (cpu_pda[cpu].__nmi_count - counts[cpu] <= 5) {
+ if (cpu_pda(cpu)->__nmi_count - counts[cpu] <= 5) {
endflag = 1;
printk("CPU#%d: NMI appears to be stuck (%d->%d)!\n",
cpu,
counts[cpu],
- cpu_pda[cpu].__nmi_count);
+ cpu_pda(cpu)->__nmi_count);
nmi_active = 0;
lapic_nmi_owner &= ~LAPIC_NMI_WATCHDOG;
nmi_perfctr_msr = 0;
Index: linux-2.6.15-rc1git/arch/x86_64/kernel/setup64.c
===================================================================
--- linux-2.6.15-rc1git.orig/arch/x86_64/kernel/setup64.c 2005-11-16 12:13:40.000000000 -0800
+++ linux-2.6.15-rc1git/arch/x86_64/kernel/setup64.c 2005-11-16 14:08:14.000000000 -0800
@@ -30,7 +30,7 @@

cpumask_t cpu_initialized __cpuinitdata = CPU_MASK_NONE;

-struct x8664_pda cpu_pda[NR_CPUS] __cacheline_aligned;
+struct x8664_pda _cpu_pda[NR_CPUS] __cacheline_aligned;

struct desc_ptr idt_descr = { 256 * 16, (unsigned long) idt_table };

@@ -110,18 +110,18 @@
}
if (!ptr)
panic("Cannot allocate cpu data for CPU %d\n", i);
- cpu_pda[i].data_offset = ptr - __per_cpu_start;
+ cpu_pda(i)->data_offset = ptr - __per_cpu_start;
memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
}
}

void pda_init(int cpu)
{
- struct x8664_pda *pda = &cpu_pda[cpu];
+ struct x8664_pda *pda = cpu_pda(cpu);

/* Setup up data that may be needed in __get_free_pages early */
asm volatile("movl %0,%%fs ; movl %0,%%gs" :: "r" (0));
- wrmsrl(MSR_GS_BASE, cpu_pda + cpu);
+ wrmsrl(MSR_GS_BASE, pda);

pda->cpunumber = cpu;
pda->irqcount = -1;
Index: linux-2.6.15-rc1git/arch/x86_64/kernel/smpboot.c
===================================================================
--- linux-2.6.15-rc1git.orig/arch/x86_64/kernel/smpboot.c 2005-11-16 12:13:40.000000000 -0800
+++ linux-2.6.15-rc1git/arch/x86_64/kernel/smpboot.c 2005-11-16 14:08:14.000000000 -0800
@@ -778,7 +778,7 @@

do_rest:

- cpu_pda[cpu].pcurrent = c_idle.idle;
+ cpu_pda(cpu)->pcurrent = c_idle.idle;

start_rip = setup_trampoline();

Index: linux-2.6.15-rc1git/arch/x86_64/kernel/traps.c
===================================================================
--- linux-2.6.15-rc1git.orig/arch/x86_64/kernel/traps.c 2005-11-16 12:13:40.000000000 -0800
+++ linux-2.6.15-rc1git/arch/x86_64/kernel/traps.c 2005-11-16 14:08:14.000000000 -0800
@@ -158,7 +158,7 @@
{
unsigned long addr;
const unsigned cpu = safe_smp_processor_id();
- unsigned long *irqstack_end = (unsigned long *)cpu_pda[cpu].irqstackptr;
+ unsigned long *irqstack_end = (unsigned long *)cpu_pda(cpu)->irqstackptr;
int i;
unsigned used = 0;

@@ -226,8 +226,8 @@
unsigned long *stack;
int i;
const int cpu = safe_smp_processor_id();
- unsigned long *irqstack_end = (unsigned long *) (cpu_pda[cpu].irqstackptr);
- unsigned long *irqstack = (unsigned long *) (cpu_pda[cpu].irqstackptr - IRQSTACKSIZE);
+ unsigned long *irqstack_end = (unsigned long *) (cpu_pda(cpu)->irqstackptr);
+ unsigned long *irqstack = (unsigned long *) (cpu_pda(cpu)->irqstackptr - IRQSTACKSIZE);

// debugging aid: "show_stack(NULL, NULL);" prints the
// back trace for this cpu.
@@ -275,7 +275,7 @@
int in_kernel = !user_mode(regs);
unsigned long rsp;
const int cpu = safe_smp_processor_id();
- struct task_struct *cur = cpu_pda[cpu].pcurrent;
+ struct task_struct *cur = cpu_pda(cpu)->pcurrent;

rsp = regs->rsp;

Index: linux-2.6.15-rc1git/arch/x86_64/kernel/x8664_ksyms.c
===================================================================
--- linux-2.6.15-rc1git.orig/arch/x86_64/kernel/x8664_ksyms.c 2005-11-16 12:13:40.000000000 -0800
+++ linux-2.6.15-rc1git/arch/x86_64/kernel/x8664_ksyms.c 2005-11-16 14:08:14.000000000 -0800
@@ -109,7 +109,7 @@
EXPORT_SYMBOL(copy_page);
EXPORT_SYMBOL(clear_page);

-EXPORT_SYMBOL(cpu_pda);
+EXPORT_SYMBOL(_cpu_pda);
#ifdef CONFIG_SMP
EXPORT_SYMBOL(cpu_data);
EXPORT_SYMBOL(cpu_online_map);
Index: linux-2.6.15-rc1git/arch/x86_64/mm/numa.c
===================================================================
--- linux-2.6.15-rc1git.orig/arch/x86_64/mm/numa.c 2005-11-16 12:13:40.000000000 -0800
+++ linux-2.6.15-rc1git/arch/x86_64/mm/numa.c 2005-11-16 14:11:41.000000000 -0800
@@ -270,7 +270,7 @@

void __cpuinit numa_set_node(int cpu, int node)
{
- cpu_pda[cpu].nodenumber = node;
+ cpu_pda(cpu)->nodenumber = node;
cpu_to_node[cpu] = node;
}

Index: linux-2.6.15-rc1git/include/asm-x86_64/pda.h
===================================================================
--- linux-2.6.15-rc1git.orig/include/asm-x86_64/pda.h 2005-11-16 12:13:40.000000000 -0800
+++ linux-2.6.15-rc1git/include/asm-x86_64/pda.h 2005-11-16 14:08:14.000000000 -0800
@@ -27,7 +27,9 @@
#define IRQSTACK_ORDER 2
#define IRQSTACKSIZE (PAGE_SIZE << IRQSTACK_ORDER)

-extern struct x8664_pda cpu_pda[];
+extern struct x8664_pda _cpu_pda[];
+
+#define cpu_pda(i) (&_cpu_pda[i])

/*
* There is no fast way to get the base address of the PDA, all the accesses
Index: linux-2.6.15-rc1git/include/asm-x86_64/percpu.h
===================================================================
--- linux-2.6.15-rc1git.orig/include/asm-x86_64/percpu.h 2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.15-rc1git/include/asm-x86_64/percpu.h 2005-11-16 14:08:14.000000000 -0800
@@ -11,7 +11,7 @@

#include <asm/pda.h>

-#define __per_cpu_offset(cpu) (cpu_pda[cpu].data_offset)
+#define __per_cpu_offset(cpu) (cpu_pda(cpu)->data_offset)
#define __my_cpu_offset() read_pda(data_offset)

/* Separate out the type, so (int[3], foo) works. */

2005-12-02 08:23:16

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: [patch 3/3] x86_64: Node local PDA -- allocate node local memory for pda

Patch uses a static PDA array early at boot and reallocates processor PDA
with node local memory when kmalloc is ready, just before pda_init.
The boot_cpu_pda is needed sice the cpu_pda is used even before pda_init for
that cpu is called (to set the static per-cpu areas offset table etc)

Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>

Index: linux-2.6.15-rc3/arch/x86_64/kernel/head64.c
===================================================================
--- linux-2.6.15-rc3.orig/arch/x86_64/kernel/head64.c 2005-11-30 17:01:18.000000000 -0800
+++ linux-2.6.15-rc3/arch/x86_64/kernel/head64.c 2005-11-30 17:07:14.000000000 -0800
@@ -80,6 +80,7 @@
{
char *s;
int i;
+ extern struct x8664_pda boot_cpu_pda[];

for (i = 0; i < 256; i++)
set_intr_gate(i, early_idt_handler);
@@ -92,6 +93,9 @@
memcpy(init_level4_pgt, boot_level4_pgt, PTRS_PER_PGD*sizeof(pgd_t));
asm volatile("movq %0,%%cr3" :: "r" (__pa_symbol(&init_level4_pgt)));

+ for (i = 0; i < NR_CPUS; i++)
+ cpu_pda(i) = &boot_cpu_pda[i];
+
pda_init(0);
copy_bootdata(real_mode_data);
#ifdef CONFIG_SMP
Index: linux-2.6.15-rc3/arch/x86_64/kernel/setup64.c
===================================================================
--- linux-2.6.15-rc3.orig/arch/x86_64/kernel/setup64.c 2005-11-30 17:01:18.000000000 -0800
+++ linux-2.6.15-rc3/arch/x86_64/kernel/setup64.c 2005-12-01 13:18:21.000000000 -0800
@@ -30,7 +30,8 @@

cpumask_t cpu_initialized __cpuinitdata = CPU_MASK_NONE;

-struct x8664_pda _cpu_pda[NR_CPUS] __cacheline_aligned;
+struct x8664_pda *_cpu_pda[NR_CPUS] __read_mostly;
+struct x8664_pda boot_cpu_pda[NR_CPUS] __cacheline_aligned;

struct desc_ptr idt_descr = { 256 * 16, (unsigned long) idt_table };

@@ -119,6 +120,23 @@
{
struct x8664_pda *pda = cpu_pda(cpu);

+ /* Allocate node local memory for AP pdas */
+ if (cpu) {
+ struct x8664_pda *newpda;
+ newpda = kmalloc_node(sizeof (struct x8664_pda), GFP_ATOMIC,
+ cpu_to_node(cpu));
+ if (newpda) {
+ printk("Allocating node local PDA for cpu %d at 0x%lx\n",
+ cpu, (unsigned long) newpda);
+ memcpy(newpda, pda, sizeof (struct x8664_pda));
+ pda = newpda;
+ cpu_pda(cpu) = pda;
+ }
+ else
+ printk("Could not allocate node local PDA for cpu %d\n",
+ cpu);
+ }
+
/* Setup up data that may be needed in __get_free_pages early */
asm volatile("movl %0,%%fs ; movl %0,%%gs" :: "r" (0));
wrmsrl(MSR_GS_BASE, pda);
Index: linux-2.6.15-rc3/include/asm-x86_64/pda.h
===================================================================
--- linux-2.6.15-rc3.orig/include/asm-x86_64/pda.h 2005-11-30 17:01:18.000000000 -0800
+++ linux-2.6.15-rc3/include/asm-x86_64/pda.h 2005-11-30 17:07:14.000000000 -0800
@@ -27,9 +27,9 @@
#define IRQSTACK_ORDER 2
#define IRQSTACKSIZE (PAGE_SIZE << IRQSTACK_ORDER)

-extern struct x8664_pda _cpu_pda[];
+extern struct x8664_pda *_cpu_pda[];

-#define cpu_pda(i) (&_cpu_pda[i])
+#define cpu_pda(i) (_cpu_pda[i])

/*
* There is no fast way to get the base address of the PDA, all the accesses

2005-12-02 08:55:41

by Eric Dumazet

[permalink] [raw]
Subject: Re: [patch 3/3] x86_64: Node local PDA -- allocate node local memory for pda

Ravikiran G Thirumalai a ?crit :
> Patch uses a static PDA array early at boot and reallocates processor PDA
> with node local memory when kmalloc is ready, just before pda_init.
> The boot_cpu_pda is needed sice the cpu_pda is used even before pda_init for
> that cpu is called (to set the static per-cpu areas offset table etc)
>

That sounds great.

I have only have one suggestion : If kernel is not NUMA, then maybe we should
avoid one indirection to get the pda, and avoid some code too.



include/asm-x86_64/pda.h

#if !defined(CONFIG_NUMA)
extern struct x8664_pda _cpu_pda[];
#define cpu_pda(i) (&_cpu_pda[i])
#else
extern struct x8664_pda *_cpu_pda[];
#define cpu_pda(i) (_cpu_pda[i])
#endif

arch/x86_64/kernel/setup64.c

#if !definedd(CONFIG_NUMA)
struct x8664_pda _cpu_pda[NR_CPUS] __cacheline_aligned;
#else
struct x8664_pda *_cpu_pda[NR_CPUS] __read_mostly;
struct x8664_pda boot_cpu_pda[NR_CPUS] __cacheline_aligned;
#endif


...
#if defined(CONFIG_NUMA)
/* Allocate node local memory for AP pdas */
if (cpu) {
struct x8664_pda *newpda;
newpda = kmalloc_node(sizeof (struct x8664_pda), GFP_ATOMIC,
cpu_to_node(cpu));
if (newpda) {
printk("Allocating node local PDA for cpu %d at 0x%lx\n",
cpu, (unsigned long) newpda);
memcpy(newpda, pda, sizeof (struct x8664_pda));
pda = newpda;
cpu_pda(cpu) = pda;
}
else
printk("Could not allocate node local PDA for cpu %d\n",
cpu);
}
#endif



Eric

2005-12-02 09:06:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 3/3] x86_64: Node local PDA -- allocate node local memory for pda

Ravikiran G Thirumalai <[email protected]> wrote:
>
> --- linux-2.6.15-rc3.orig/arch/x86_64/kernel/head64.c 2005-11-30 17:01:18.000000000 -0800
> +++ linux-2.6.15-rc3/arch/x86_64/kernel/head64.c 2005-11-30 17:07:14.000000000 -0800
> @@ -80,6 +80,7 @@
> {
> char *s;
> int i;
> + extern struct x8664_pda boot_cpu_pda[];

And what happens if someone later changes the type of boot_cpu_pda?

2005-12-02 10:36:39

by Eric Dumazet

[permalink] [raw]
Subject: [RFC] NUMA aware kthread_create() ?

Hi

Is there any plans about making a kthread_create_on_cpu() version of
kthread_create(), so that memory allocated for thread stack/info is allocated
on the node of the target CPU ?

There is a mention about kthread_create_on_cpu() in a comment in
include/linux/kthread.h, but no implementation.

The current use pattern is

p = kthread_create(ksoftirqd, hcpu, "ksoftirqd/%d", hotcpu);
if (IS_ERR(p)) { error ... }
kthread_bind(p, hotcpu);

So the thread memory is currently allocated on the node of the current cpu, ie
not the target cpu (hotcpu in this example)

Thank you
Eric Dumazet

2005-12-02 11:43:51

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 1/3] x86_64: Node local PDA -- early cpu_to_node

> +#ifdef CONFIG_ACPI_NUMA
> + /*
> + * Setup cpu_to_node using the SRAT lapcis & ACPI MADT table
> + * info.
> + */
> + for (i = 0; i < NR_CPUS; i++)
> + cpu_to_node[i] = apicid_to_node[x86_cpu_to_apicid[i]];
> +#endif

This should be in a separate function in srat.c.

And are you sure it will work with k8topology.c. Doesn't look like
that to me.

-Andi

2005-12-02 11:47:12

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 3/3] x86_64: Node local PDA -- allocate node local memory for pda

On Fri, Dec 02, 2005 at 12:23:09AM -0800, Ravikiran G Thirumalai wrote:
> Patch uses a static PDA array early at boot and reallocates processor PDA
> with node local memory when kmalloc is ready, just before pda_init.
> The boot_cpu_pda is needed sice the cpu_pda is used even before pda_init for
> that cpu is called (to set the static per-cpu areas offset table etc)

Where is it needed? Perhaps it should be just allocated in the
CPU triggering the other CPU start instead. Then you could avoid that
or rather only define a __initdata boot_pda for the BP.

>
> Index: linux-2.6.15-rc3/arch/x86_64/kernel/head64.c
> ===================================================================
> --- linux-2.6.15-rc3.orig/arch/x86_64/kernel/head64.c 2005-11-30 17:01:18.000000000 -0800
> +++ linux-2.6.15-rc3/arch/x86_64/kernel/head64.c 2005-11-30 17:07:14.000000000 -0800
> @@ -80,6 +80,7 @@
> {
> char *s;
> int i;
> + extern struct x8664_pda boot_cpu_pda[];

externs only belong in include files.

-Andi

2005-12-02 12:32:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] [RFC] NUMA aware kthread_create() ?

On Fri, Dec 02, 2005 at 11:36:19AM +0100, Eric Dumazet wrote:
> Hi
>
> Is there any plans about making a kthread_create_on_cpu() version of
> kthread_create(), so that memory allocated for thread stack/info is
> allocated on the node of the target CPU ?

I don't know of plans. Feel free to do it, although I'm not sure
it will help very much because the stack is relatively small.

-Andi

2005-12-02 18:24:35

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [discuss] Re: [patch 3/3] x86_64: Node local PDA -- allocate node local memory for pda

On Fri, Dec 02, 2005 at 09:54:59AM +0100, Eric Dumazet wrote:
> Ravikiran G Thirumalai a ?crit :
> >Patch uses a static PDA array early at boot and reallocates processor PDA
> >with node local memory when kmalloc is ready, just before pda_init.
> >The boot_cpu_pda is needed sice the cpu_pda is used even before pda_init
> >for
> >that cpu is called (to set the static per-cpu areas offset table etc)
> >
>
> That sounds great.
>
> I have only have one suggestion : If kernel is not NUMA, then maybe we
> should avoid one indirection to get the pda, and avoid some code too.
>

Sure, but there is no extra indirection with the fastpath routines like
read_pda, write_pda and friends with the current patch.
Places where cpu_pda[] is accessed by the array name are not really
important -- except for the static per_cpu_offset of another cpu. But
still it might be worth it (considering that people use
per_cpu(var, smp_processor_id()), instead of __get_cpu_var in many places).
I will incorporate this.

Thanks,
Kiran

2005-12-02 20:02:41

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [patch 3/3] x86_64: Node local PDA -- allocate node local memory for pda

On Fri, Dec 02, 2005 at 12:47:09PM +0100, Andi Kleen wrote:
> On Fri, Dec 02, 2005 at 12:23:09AM -0800, Ravikiran G Thirumalai wrote:
> > Patch uses a static PDA array early at boot and reallocates processor PDA
> > with node local memory when kmalloc is ready, just before pda_init.
> > The boot_cpu_pda is needed sice the cpu_pda is used even before pda_init for
> > that cpu is called (to set the static per-cpu areas offset table etc)
>
> Where is it needed? Perhaps it should be just allocated in the
> CPU triggering the other CPU start instead. Then you could avoid that
> or rather only define a __initdata boot_pda for the BP.
>

setup_per_cpu_areas() is invoked quite early in the boot process and it
writes into the cpu_pda.data_offset field for all the cpus. I'd even tried
storing the offset table for cpus in a temporary table (which can be marked
__initdata and discarded later), but there were references
to the static per cpu areas through per_cpu macros (which need to use the
cpu_pda) even before the BP boots up and starts the secondary cpus,
resulting in early exceptions.

> >
> > Index: linux-2.6.15-rc3/arch/x86_64/kernel/head64.c
> > ===================================================================
> > --- linux-2.6.15-rc3.orig/arch/x86_64/kernel/head64.c 2005-11-30 17:01:18.000000000 -0800
> > +++ linux-2.6.15-rc3/arch/x86_64/kernel/head64.c 2005-11-30 17:07:14.000000000 -0800
> > @@ -80,6 +80,7 @@
> > {
> > char *s;
> > int i;
> > + extern struct x8664_pda boot_cpu_pda[];
>
> externs only belong in include files.

Yes, I will change this and resubmit

Thanks,
Kiran

2005-12-02 22:41:08

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 3/3] x86_64: Node local PDA -- allocate node local memory for pda

On Fri, Dec 02, 2005 at 12:02:34PM -0800, Ravikiran G Thirumalai wrote:
> On Fri, Dec 02, 2005 at 12:47:09PM +0100, Andi Kleen wrote:
> > On Fri, Dec 02, 2005 at 12:23:09AM -0800, Ravikiran G Thirumalai wrote:
> > > Patch uses a static PDA array early at boot and reallocates processor PDA
> > > with node local memory when kmalloc is ready, just before pda_init.
> > > The boot_cpu_pda is needed sice the cpu_pda is used even before pda_init for
> > > that cpu is called (to set the static per-cpu areas offset table etc)
> >
> > Where is it needed? Perhaps it should be just allocated in the
> > CPU triggering the other CPU start instead. Then you could avoid that
> > or rather only define a __initdata boot_pda for the BP.
> >
>
> setup_per_cpu_areas() is invoked quite early in the boot process and it
> writes into the cpu_pda.data_offset field for all the cpus. I'd even tried
> storing the offset table for cpus in a temporary table (which can be marked
> __initdata and discarded later), but there were references
> to the static per cpu areas through per_cpu macros (which need to use the
> cpu_pda) even before the BP boots up and starts the secondary cpus,
> resulting in early exceptions.

Move it into smpboot.c then on the parent CPU.

-Andi

2005-12-02 22:52:07

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [discuss] Re: [patch 1/3] x86_64: Node local PDA -- early cpu_to_node

On Fri, Dec 02, 2005 at 12:43:49PM +0100, Andi Kleen wrote:
> > +#ifdef CONFIG_ACPI_NUMA
> > + /*
> > + * Setup cpu_to_node using the SRAT lapcis & ACPI MADT table
> > + * info.
> > + */
> > + for (i = 0; i < NR_CPUS; i++)
> > + cpu_to_node[i] = apicid_to_node[x86_cpu_to_apicid[i]];
> > +#endif
>
> This should be in a separate function in srat.c.

OK,

>
> And are you sure it will work with k8topology.c. Doesn't look like
> that to me.

I don't have a K8 box yet :(, so I cannot confirm either ways.
But I thought newer opterons need to use ACPI_NUMA instead...

<Kconfig quote>
config K8_NUMA
bool "Old style AMD Opteron NUMA detection"
depends on NUMA
default y
help
Enable K8 NUMA node topology detection. You should say Y here if
you have a multi processor AMD K8 system. This uses an old
method to read the NUMA configurtion directly from the builtin
Northbridge of Opteron. It is recommended to use X86_64_ACPI_NUMA
instead, which also takes priority if both are compiled in.
</quote>

Even if K8 detection is used, cpu_pda will have memory allocated from node0
which is not different from the current state. So this patch helps Opterons
and EM64t boxes which use ACPI_NUMA, right? Also the newer opteron boxes
and em64t NUMA boxes can now get node local memory for static per-cpu areas.

Thanks,
Kiran


2005-12-02 23:02:09

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] Re: [patch 1/3] x86_64: Node local PDA -- early cpu_to_node

On Fri, Dec 02, 2005 at 02:51:56PM -0800, Ravikiran G Thirumalai wrote:
> On Fri, Dec 02, 2005 at 12:43:49PM +0100, Andi Kleen wrote:
> > > +#ifdef CONFIG_ACPI_NUMA
> > > + /*
> > > + * Setup cpu_to_node using the SRAT lapcis & ACPI MADT table
> > > + * info.
> > > + */
> > > + for (i = 0; i < NR_CPUS; i++)
> > > + cpu_to_node[i] = apicid_to_node[x86_cpu_to_apicid[i]];
> > > +#endif
> >
> > This should be in a separate function in srat.c.
>
> OK,
>
> >
> > And are you sure it will work with k8topology.c. Doesn't look like
> > that to me.
>
> I don't have a K8 box yet :(, so I cannot confirm either ways.
> But I thought newer opterons need to use ACPI_NUMA instead...

k8topology still needs to work - e.g. for LinuxBios and users which use
acpi=off and as a fallback for broken SRAT tables. You can't break it right now.

>
> <Kconfig quote>
> config K8_NUMA
> bool "Old style AMD Opteron NUMA detection"
> depends on NUMA
> default y
> help
> Enable K8 NUMA node topology detection. You should say Y here if
> you have a multi processor AMD K8 system. This uses an old
> method to read the NUMA configurtion directly from the builtin
> Northbridge of Opteron. It is recommended to use X86_64_ACPI_NUMA
> instead, which also takes priority if both are compiled in.
> </quote>
>
> Even if K8 detection is used, cpu_pda will have memory allocated from node0
> which is not different from the current state. So this patch helps Opterons
> and EM64t boxes which use ACPI_NUMA, right? Also the newer opteron boxes
> and em64t NUMA boxes can now get node local memory for static per-cpu areas.

Hmm good point. However i would prefer if there was no performance regression
between the two options. However i guess it can be kept like this now.
Just make sure to comment it well.

-Andi

2005-12-02 23:43:37

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [discuss] Re: [patch 1/3] x86_64: Node local PDA -- early cpu_to_node

On Sat, Dec 03, 2005 at 12:02:06AM +0100, Andi Kleen wrote:
> On Fri, Dec 02, 2005 at 02:51:56PM -0800, Ravikiran G Thirumalai wrote:
> > On Fri, Dec 02, 2005 at 12:43:49PM +0100, Andi Kleen wrote:
> > > And are you sure it will work with k8topology.c. Doesn't look like
> > > that to me.
> >
> > I don't have a K8 box yet :(, so I cannot confirm either ways.
> > But I thought newer opterons need to use ACPI_NUMA instead...
>
> k8topology still needs to work - e.g. for LinuxBios and users which use
> acpi=off and as a fallback for broken SRAT tables. You can't break it right now.
>

I don't think this breaks K8 per-se, because x86_cpu_to_apicid[] is setup if
acpi is compiled in and k8topology sets up apicid_to_node[] at
k8_scan_nodes. That said, I don't know for sure as I don't have a K8 yet. If
someone can test this patch on a opteron, compiled with
ACPI_NUMA as well as K8, (but which falls back to K8 at boot),
it will be helpful.

> >
> > Even if K8 detection is used, cpu_pda will have memory allocated from node0
> > which is not different from the current state. So this patch helps Opterons
> > and EM64t boxes which use ACPI_NUMA, right? Also the newer opteron boxes
> > and em64t NUMA boxes can now get node local memory for static per-cpu areas.
>
> Hmm good point. However i would prefer if there was no performance regression
> between the two options. However i guess it can be kept like this now.
> Just make sure to comment it well.

Sure.

Thanks,
Kiran

2005-12-02 23:48:28

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] Re: [patch 1/3] x86_64: Node local PDA -- early cpu_to_node

On Fri, Dec 02, 2005 at 03:43:30PM -0800, Ravikiran G Thirumalai wrote:
> On Sat, Dec 03, 2005 at 12:02:06AM +0100, Andi Kleen wrote:
> > On Fri, Dec 02, 2005 at 02:51:56PM -0800, Ravikiran G Thirumalai wrote:
> > > On Fri, Dec 02, 2005 at 12:43:49PM +0100, Andi Kleen wrote:
> > > > And are you sure it will work with k8topology.c. Doesn't look like
> > > > that to me.
> > >
> > > I don't have a K8 box yet :(, so I cannot confirm either ways.
> > > But I thought newer opterons need to use ACPI_NUMA instead...
> >
> > k8topology still needs to work - e.g. for LinuxBios and users which use
> > acpi=off and as a fallback for broken SRAT tables. You can't break it right now.
> >
>
> I don't think this breaks K8 per-se, because x86_cpu_to_apicid[] is setup if
> acpi is compiled in and k8topology sets up apicid_to_node[] at
> k8_scan_nodes. That said, I don't know for sure as I don't have a K8 yet. If
> someone can test this patch on a opteron, compiled with
> ACPI_NUMA as well as K8, (but which falls back to K8 at boot),
> it will be helpful.

I can do that with the next patch.

-Andi