2003-08-20 22:03:51

by Andrew Theurer

[permalink] [raw]
Subject: CPU boot problem on 2.6.0-test3-bk8

Maybe this is already known, but just in case:
I cannot fully boot on an x440 system with 2.6.0-test3-bk8. The kernel tries
to boot more than the 16 logical processors, and after failing (no response)
on cpus 16, 17, and 18, it still thinks it has 19 cpus total. It finally
gets stuck at "checking TSC synchronization across 19 CPUs:"

Attached is the boot log. Any ideas? I'll try -test3-bk7 next

-Andrew Theurer


Attachments:
(No filename) (424.00 B)
260-test3-bk8 (19.62 kB)
Download all attachments

2003-08-21 01:13:46

by Andrew Theurer

[permalink] [raw]
Subject: Re: CPU boot problem on 2.6.0-test3-bk8

On Wednesday 20 August 2003 20:02, Dave Hansen wrote:
> On Wed, 2003-08-20 at 14:58, Andrew Theurer wrote:
> > Maybe this is already known, but just in case:
> > I cannot fully boot on an x440 system with 2.6.0-test3-bk8. The kernel
> > tries to boot more than the 16 logical processors, and after failing (no
> > response) on cpus 16, 17, and 18, it still thinks it has 19 cpus total.
> > It finally gets stuck at "checking TSC synchronization across 19 CPUs:"
> >
> > Attached is the boot log. Any ideas? I'll try -test3-bk7 next
>
> Can you see if it works without HT on? Did it work on plain -test3?
> My 16-way x440 with no HT boots fine on test3.

I'll try without HT to see what happens. FWIW, it boots fine with HT if I set
maxcpus=16. I am wondering if (apicid == BAD_APIC) test is not working in
smp_boot_cpus.

-Andrew Theurer

2003-08-21 01:03:31

by Dave Hansen

[permalink] [raw]
Subject: Re: CPU boot problem on 2.6.0-test3-bk8

On Wed, 2003-08-20 at 14:58, Andrew Theurer wrote:
> Maybe this is already known, but just in case:
> I cannot fully boot on an x440 system with 2.6.0-test3-bk8. The kernel tries
> to boot more than the 16 logical processors, and after failing (no response)
> on cpus 16, 17, and 18, it still thinks it has 19 cpus total. It finally
> gets stuck at "checking TSC synchronization across 19 CPUs:"
>
> Attached is the boot log. Any ideas? I'll try -test3-bk7 next

Can you see if it works without HT on? Did it work on plain -test3?
My 16-way x440 with no HT boots fine on test3.

--
Dave Hansen
[email protected]

2003-08-21 03:43:22

by Dave Hansen

[permalink] [raw]
Subject: Re: CPU boot problem on 2.6.0-test3-bk8

On Wed, 2003-08-20 at 18:13, Andrew Theurer wrote:
> On Wednesday 20 August 2003 20:02, Dave Hansen wrote:
> > On Wed, 2003-08-20 at 14:58, Andrew Theurer wrote:
> > > Maybe this is already known, but just in case:
> > > I cannot fully boot on an x440 system with 2.6.0-test3-bk8. The kernel
> > > tries to boot more than the 16 logical processors, and after failing (no
> > > response) on cpus 16, 17, and 18, it still thinks it has 19 cpus total.
> > > It finally gets stuck at "checking TSC synchronization across 19 CPUs:"
> > >
> > > Attached is the boot log. Any ideas? I'll try -test3-bk7 next
> >
> > Can you see if it works without HT on? Did it work on plain -test3?
> > My 16-way x440 with no HT boots fine on test3.
>
> I'll try without HT to see what happens. FWIW, it boots fine with HT if I set
> maxcpus=16. I am wondering if (apicid == BAD_APIC) test is not working in
> smp_boot_cpus.

Hmmm. This is looking like fallout from the massive wli-bomb. Here's
the loop that controls the cpu booting, before and after cpumask_t:

- for (bit = 0; kicked < NR_CPUS && bit < BITS_PER_LONG; bit++) + for
(bit = 0; kicked < NR_CPUS && bit < MAX_APICS; bit++)
apicid = cpu_present_to_apicid(bit);

"kicked" only gets incremented for CPUs that were successfully booted,
so it doesn't help terminate the loop much. MAX_APICS is 256 on summit,
which is *MUCH* bigger than BITS_PER_LONG.
cpu_2_logical_apicid[NR_CPUS] which is referenced from
cpu_present_to_apicid() is getting referenced up to MAX_APICs, which is
bigger than NR_CPUS. Overflow. Bang. garbage != BAD_APICID :)

Attached patch fixes it. We sure do have a lot of duplicate code in the
subarches. <sigh>
--
Dave Hansen
[email protected]


Attachments:
cpu_to_logical_apicid-fix-2.6.0-test3-bk8-0.patch (2.27 kB)

2003-08-21 14:11:00

by Andrew Theurer

[permalink] [raw]
Subject: Re: CPU boot problem on 2.6.0-test3-bk8

On Wednesday 20 August 2003 22:42, Dave Hansen wrote:
> On Wed, 2003-08-20 at 18:13, Andrew Theurer wrote:
> > On Wednesday 20 August 2003 20:02, Dave Hansen wrote:
> > > On Wed, 2003-08-20 at 14:58, Andrew Theurer wrote:
> > > > Maybe this is already known, but just in case:
> > > > I cannot fully boot on an x440 system with 2.6.0-test3-bk8. The
> > > > kernel tries to boot more than the 16 logical processors, and after
> > > > failing (no response) on cpus 16, 17, and 18, it still thinks it has
> > > > 19 cpus total. It finally gets stuck at "checking TSC synchronization
> > > > across 19 CPUs:"
> > > >
> > > > Attached is the boot log. Any ideas? I'll try -test3-bk7 next
> > >
> > > Can you see if it works without HT on? Did it work on plain -test3?
> > > My 16-way x440 with no HT boots fine on test3.
> >
> > I'll try without HT to see what happens. FWIW, it boots fine with HT if
> > I set maxcpus=16. I am wondering if (apicid == BAD_APIC) test is not
> > working in smp_boot_cpus.
>
> Hmmm. This is looking like fallout from the massive wli-bomb. Here's
> the loop that controls the cpu booting, before and after cpumask_t:
>
> - for (bit = 0; kicked < NR_CPUS && bit < BITS_PER_LONG; bit++) + for
> (bit = 0; kicked < NR_CPUS && bit < MAX_APICS; bit++)
> apicid = cpu_present_to_apicid(bit);
>
> "kicked" only gets incremented for CPUs that were successfully booted,
> so it doesn't help terminate the loop much. MAX_APICS is 256 on summit,
> which is *MUCH* bigger than BITS_PER_LONG.
> cpu_2_logical_apicid[NR_CPUS] which is referenced from
> cpu_present_to_apicid() is getting referenced up to MAX_APICs, which is
> bigger than NR_CPUS. Overflow. Bang. garbage != BAD_APICID :)

Still looks like we have a problem (see attached boot log). Maybe we should
change that for loop to:

for (bit = 0; kicked < num_processors && bit < BITS_PER_LONG; bit++)

So we only loop for the actual number processors found in mpparse.c? This
seems to work for me.

-Andrew Theurer


Attachments:
(No filename) (1.96 kB)
260test3bk8patch1 (21.71 kB)
patch-boot-cpu.260test3bk8 (1.54 kB)
Download all attachments

2003-08-21 15:00:14

by Dave Hansen

[permalink] [raw]
Subject: Re: CPU boot problem on 2.6.0-test3-bk8

On Thu, 2003-08-21 at 07:10, Andrew Theurer wrote:
> So we only loop for the actual number processors found in mpparse.c? This
> seems to work for me.

I think there's a reason it was done that way. I think your patch
breaks the visws subarch, too.

Could you mark up that loop a bit and printk a bit, so we can see which
continue you're missing?

<pasting patch lazily in email because I can't be bothered to actually copy it from the machine I"m working on>
diff -urp linux-2.6.0-test3-clean/arch/i386/kernel/smpboot.c linux-2.6.0-test3-work/arch/i386/kernel/smpboot.c
--- linux-2.6.0-test3-clean/arch/i386/kernel/smpboot.c Wed Aug 20 19:54:29 2003
+++ linux-2.6.0-test3-work/arch/i386/kernel/smpboot.c Wed Aug 20 20:19:41 2003
@@ -1020,24 +1020,30 @@ static void __init smp_boot_cpus(unsigne
Dprintk("CPU present map: %lx\n", physids_coerce(phys_cpu_present_map));

kicked = 1;
- for (bit = 0; kicked < NR_CPUS && bit < MAX_APICS; bit++) {
+ for (bit = 0; kicked < NR_CPUS && bit < MAX_APICS; bit++, kicked++) {
apicid = cpu_present_to_apicid(bit);
/*
* Don't even attempt to start the boot CPU!
*/
- if ((apicid == boot_cpu_apicid) || (apicid == BAD_APICID))
+ printk("smp_boot_cpus() bit: %d\n", bit);
+ if ((apicid == boot_cpu_apicid) || (apicid == BAD_APICID)) {
+ printk("(apicid == boot_cpu_apicid) || (apicid == BAD_APICID)\n");
+ printk("apicid: %08lx boot_cpu_apicid: %08lx BAD_APICID: %08lx\n", apicid, boot_cpu_apicid, BAD_APICID);
continue;
+ }

- if (!check_apicid_present(bit))
+ if (!check_apicid_present(bit)) {
+ printk("!check_apicid_present(bit)\n");
continue;
- if (max_cpus <= cpucount+1)
+ }
+ if (max_cpus <= cpucount+1) {
+ printk("(max_cpus <= cpucount+1)\n");
continue;
+ }

if (do_boot_cpu(apicid))
printk("CPU #%d not responding - cannot use it.\n",
apicid);
- else
- ++kicked;
}

/*
--
Dave Hansen
[email protected]

2003-08-21 15:29:38

by Dave Hansen

[permalink] [raw]
Subject: Re: CPU boot problem on 2.6.0-test3-bk8

On Thu, 2003-08-21 at 07:10, Andrew Theurer wrote:
> Still looks like we have a problem (see attached boot log). Maybe we should
> change that for loop to:
>
> for (bit = 0; kicked < num_processors && bit < BITS_PER_LONG; bit++)
>
> So we only loop for the actual number processors found in mpparse.c? This
> seems to work for me.

You have something else wrong too:

[dave@nighthawk temp]$ egrep -c ^CPU\[0-9\]+: 260test3bk8patch1
20

It looks like you booted 20 processors, successfully.

You have 5 "Geniune" cpus and 16 "Xeon" cpus. Are you using plain
summit, or generic arch support?

$ egrep ^CPU\[0-9\]+: 260test3bk8patch1
CPU0: Intel(R) Genuine CPU 1.50GHz stepping 01
CPU1: Intel(R) Xeon(TM) CPU 1.50GHz stepping 01
CPU2: Intel(R) Xeon(TM) CPU 1.50GHz stepping 01
CPU3: Intel(R) Xeon(TM) CPU 1.50GHz stepping 01
CPU4: Intel(R) Genuine CPU 1.50GHz stepping 01
CPU5: Intel(R) Xeon(TM) CPU 1.50GHz stepping 01
CPU6: Intel(R) Xeon(TM) CPU 1.50GHz stepping 01
CPU7: Intel(R) Xeon(TM) CPU 1.50GHz stepping 01
CPU8: Intel(R) Genuine CPU 1.50GHz stepping 01
CPU9: Intel(R) Xeon(TM) CPU 1.50GHz stepping 01
CPU10: Intel(R) Xeon(TM) CPU 1.50GHz stepping 01
CPU11: Intel(R) Xeon(TM) CPU 1.50GHz stepping 01
CPU12: Intel(R) Genuine CPU 1.50GHz stepping 01
CPU13: Intel(R) Xeon(TM) CPU 1.50GHz stepping 01
CPU14: Intel(R) Xeon(TM) CPU 1.50GHz stepping 01
CPU15: Intel(R) Xeon(TM) CPU 1.50GHz stepping 01
CPU16: Intel(R) Xeon(TM) CPU 1.50GHz stepping 01
CPU17: Intel(R) Xeon(TM) CPU 1.50GHz stepping 01
CPU18: Intel(R) Xeon(TM) CPU 1.50GHz stepping 01
CPU19: Intel(R) Xeon(TM) CPU 1.50GHz stepping 01


--
Dave Hansen
[email protected]

2003-08-21 15:57:47

by Andrew Theurer

[permalink] [raw]
Subject: Re: CPU boot problem on 2.6.0-test3-bk8

On Thursday 21 August 2003 09:58, Dave Hansen wrote:
> On Thu, 2003-08-21 at 07:10, Andrew Theurer wrote:
> > So we only loop for the actual number processors found in mpparse.c?
> > This seems to work for me.
>
> I think there's a reason it was done that way. I think your patch
> breaks the visws subarch, too.
>
> Could you mark up that loop a bit and printk a bit, so we can see which
> continue you're missing?
>
> <pasting patch lazily in email because I can't be bothered to actually copy
> it from the machine I"m working on> diff -urp
> linux-2.6.0-test3-clean/arch/i386/kernel/smpboot.c
> linux-2.6.0-test3-work/arch/i386/kernel/smpboot.c ---
> linux-2.6.0-test3-clean/arch/i386/kernel/smpboot.c Wed Aug 20 19:54:29
> 2003 +++ linux-2.6.0-test3-work/arch/i386/kernel/smpboot.c Wed Aug 20
> 20:19:41 2003 @@ -1020,24 +1020,30 @@ static void __init
> smp_boot_cpus(unsigne
> Dprintk("CPU present map: %lx\n",
> physids_coerce(phys_cpu_present_map));
>
> kicked = 1;
> - for (bit = 0; kicked < NR_CPUS && bit < MAX_APICS; bit++) {
> + for (bit = 0; kicked < NR_CPUS && bit < MAX_APICS; bit++, kicked++)

This patch (plus your first one) seems to work. Perhaps the addition of
kicked++ above helped? Attached is the boot log.



Attachments:
(No filename) (1.24 kB)
dmesg (34.95 kB)
Download all attachments

2003-08-21 16:11:35

by Dave Hansen

[permalink] [raw]
Subject: Re: CPU boot problem on 2.6.0-test3-bk8

On Thu, 2003-08-21 at 08:56, Andrew Theurer wrote:
> On Thursday 21 August 2003 09:58, Dave Hansen wrote:
> > On Thu, 2003-08-21 at 07:10, Andrew Theurer wrote:
> > > So we only loop for the actual number processors found in mpparse.c?
> > > This seems to work for me.
> >
> > I think there's a reason it was done that way. I think your patch
> > breaks the visws subarch, too.
> >
> > Could you mark up that loop a bit and printk a bit, so we can see which
> > continue you're missing?
> >
> > <pasting patch lazily in email because I can't be bothered to actually copy
> > it from the machine I"m working on> diff -urp
> > linux-2.6.0-test3-clean/arch/i386/kernel/smpboot.c
> > linux-2.6.0-test3-work/arch/i386/kernel/smpboot.c ---
> > linux-2.6.0-test3-clean/arch/i386/kernel/smpboot.c Wed Aug 20 19:54:29
> > 2003 +++ linux-2.6.0-test3-work/arch/i386/kernel/smpboot.c Wed Aug 20
> > 20:19:41 2003 @@ -1020,24 +1020,30 @@ static void __init
> > smp_boot_cpus(unsigne
> > Dprintk("CPU present map: %lx\n",
> > physids_coerce(phys_cpu_present_map));
> >
> > kicked = 1;
> > - for (bit = 0; kicked < NR_CPUS && bit < MAX_APICS; bit++) {
> > + for (bit = 0; kicked < NR_CPUS && bit < MAX_APICS; bit++, kicked++)
>
> This patch (plus your first one) seems to work. Perhaps the addition of
> kicked++ above helped? Attached is the boot log.

I missed that. But, it's incorrect. You're doubly incrementing kicked
in the case of CPUs that are booted correctly and getting to kicked >=
NR_CPUS a lot quicker. That's why you're booting correctly.

Secondly, we can actually boot up to NR_CPUS cpus, and we can *fail* to
boot a lot more than that. At least that's what the code is trying to
do. Whether it is "the right thing" is debatable.

--
Dave Hansen
[email protected]

2003-08-21 17:04:01

by Andrew Theurer

[permalink] [raw]
Subject: Re: CPU boot problem on 2.6.0-test3-bk8

On Thursday 21 August 2003 11:09, Dave Hansen wrote:
> On Thu, 2003-08-21 at 08:56, Andrew Theurer wrote:
> > On Thursday 21 August 2003 09:58, Dave Hansen wrote:
> > > On Thu, 2003-08-21 at 07:10, Andrew Theurer wrote:
> > > > So we only loop for the actual number processors found in mpparse.c?
> > > > This seems to work for me.
> > >
> > > I think there's a reason it was done that way. I think your patch
> > > breaks the visws subarch, too.
> > >
> > > Could you mark up that loop a bit and printk a bit, so we can see which
> > > continue you're missing?
> > >
> > > <pasting patch lazily in email because I can't be bothered to actually
> > > copy it from the machine I"m working on> diff -urp
> > > linux-2.6.0-test3-clean/arch/i386/kernel/smpboot.c
> > > linux-2.6.0-test3-work/arch/i386/kernel/smpboot.c ---
> > > linux-2.6.0-test3-clean/arch/i386/kernel/smpboot.c Wed Aug 20 19:54:29
> > > 2003 +++ linux-2.6.0-test3-work/arch/i386/kernel/smpboot.c Wed Aug 20
> > > 20:19:41 2003 @@ -1020,24 +1020,30 @@ static void __init
> > > smp_boot_cpus(unsigne
> > > Dprintk("CPU present map: %lx\n",
> > > physids_coerce(phys_cpu_present_map));
> > >
> > > kicked = 1;
> > > - for (bit = 0; kicked < NR_CPUS && bit < MAX_APICS; bit++) {
> > > + for (bit = 0; kicked < NR_CPUS && bit < MAX_APICS; bit++,
> > > kicked++)
> >
> > This patch (plus your first one) seems to work. Perhaps the addition of
> > kicked++ above helped? Attached is the boot log.
>
> I missed that. But, it's incorrect. You're doubly incrementing kicked
> in the case of CPUs that are booted correctly and getting to kicked >=
> NR_CPUS a lot quicker. That's why you're booting correctly.
>
> Secondly, we can actually boot up to NR_CPUS cpus, and we can *fail* to
> boot a lot more than that. At least that's what the code is trying to
> do. Whether it is "the right thing" is debatable.

Boot log with extra kicked++ removed...


Attachments:
(No filename) (1.90 kB)
dmesg2 (51.74 kB)
Download all attachments

2003-08-21 21:05:19

by William Lee Irwin III

[permalink] [raw]
Subject: Re: CPU boot problem on 2.6.0-test3-bk8

On Thu, Aug 21, 2003 at 08:28:08AM -0700, Dave Hansen wrote:
> It looks like you booted 20 processors, successfully.
> You have 5 "Geniune" cpus and 16 "Xeon" cpus. Are you using plain
> summit, or generic arch support?

AFAICT the only way we can see that is if we kick the same ones twice.
Using max_cpus= the exact number of cpus you have or CONFIG_NR_CPUS=
the exact number of cpus you have will get testers able to boot until
it's fixed.

It shouldn't be too hard to find the faulty code; all 5 "Genuine"
entries are bogus and alias the entries we actually want (the Xeons).


-- wli

2003-08-21 21:15:15

by William Lee Irwin III

[permalink] [raw]
Subject: Re: CPU boot problem on 2.6.0-test3-bk8

On Thu, Aug 21, 2003 at 12:02:02PM -0500, Andrew Theurer wrote:
>>> This patch (plus your first one) seems to work. Perhaps the addition of
>>> kicked++ above helped? Attached is the boot log.

No, it didn't. The reasons why it "worked" were:
(a) NR_CPUS is exactly twice the number of cpus you want
(b) all of the bogus entries appear _after_ the legitimate ones


-- wli

2003-08-21 21:33:21

by William Lee Irwin III

[permalink] [raw]
Subject: Re: CPU boot problem on 2.6.0-test3-bk8

On Thu, Aug 21, 2003 at 12:02:02PM -0500, Andrew Theurer wrote:
> smp_boot_cpus() bit: 80
> Booting processor 16/114 eip 2000
> Not responding.
> Unmapping cpu 16 from all nodes
> CPU #114 not responding - cannot use it.

cpu_present_to_apicid() needs a similar treatment to dhansen's prior
bits. diff incoming shortly.


-- wli

2003-08-21 22:17:23

by William Lee Irwin III

[permalink] [raw]
Subject: Re: CPU boot problem on 2.6.0-test3-bk8

On Thu, Aug 21, 2003 at 12:02:02PM -0500, Andrew Theurer wrote:
>> smp_boot_cpus() bit: 80
>> Booting processor 16/114 eip 2000
>> Not responding.
>> Unmapping cpu 16 from all nodes
>> CPU #114 not responding - cannot use it.

On Thu, Aug 21, 2003 at 02:33:50PM -0700, William Lee Irwin III wrote:
> cpu_present_to_apicid() needs a similar treatment to dhansen's prior
> bits. diff incoming shortly.

Could one of you two try this out on a Summit machine in addition to
Dave's prior patch (or hook me up to one)?

Thanks.


-- wli


===== include/asm-i386/mach-bigsmp/mach_apic.h 1.16 vs edited =====
--- 1.16/include/asm-i386/mach-bigsmp/mach_apic.h Wed Aug 20 22:32:06 2003
+++ edited/include/asm-i386/mach-bigsmp/mach_apic.h Thu Aug 21 15:07:42 2003
@@ -86,7 +86,10 @@

static inline int cpu_present_to_apicid(int mps_cpu)
{
- return (int) bios_cpu_apicid[mps_cpu];
+ if (mps_cpu < NR_CPUS)
+ return (int)bios_cpu_apicid[mps_cpu];
+ else
+ return BAD_APICID;
}

static inline physid_mask_t apicid_to_cpu_present(int phys_apicid)
===== include/asm-i386/mach-default/mach_apic.h 1.27 vs edited =====
--- 1.27/include/asm-i386/mach-default/mach_apic.h Mon Aug 18 19:46:23 2003
+++ edited/include/asm-i386/mach-default/mach_apic.h Thu Aug 21 15:08:15 2003
@@ -83,7 +83,10 @@

static inline int cpu_present_to_apicid(int mps_cpu)
{
- return mps_cpu;
+ if (mps_cpu < NR_CPUS)
+ return mps_cpu;
+ else
+ return BAD_APICID;
}

static inline physid_mask_t apicid_to_cpu_present(int phys_apicid)
===== include/asm-i386/mach-es7000/mach_apic.h 1.3 vs edited =====
--- 1.3/include/asm-i386/mach-es7000/mach_apic.h Wed Aug 20 22:32:06 2003
+++ edited/include/asm-i386/mach-es7000/mach_apic.h Thu Aug 21 15:08:41 2003
@@ -106,8 +106,10 @@
{
if (!mps_cpu)
return boot_cpu_physical_apicid;
- else
+ else if (mps_cpu < NR_CPUS)
return (int) bios_cpu_apicid[mps_cpu];
+ else
+ return BAD_APICID;
}

static inline physid_mask_t apicid_to_cpu_present(int phys_apicid)
===== include/asm-i386/mach-numaq/mach_apic.h 1.22 vs edited =====
--- 1.22/include/asm-i386/mach-numaq/mach_apic.h Wed Aug 20 22:32:06 2003
+++ edited/include/asm-i386/mach-numaq/mach_apic.h Thu Aug 21 15:10:31 2003
@@ -65,9 +65,17 @@
return (int)cpu_2_logical_apicid[cpu];
}

+/*
+ * Supporting over 60 cpus on NUMA-Q requires a locality-dependent
+ * cpu to APIC ID relation to properly interact with the intelligent
+ * mode of the cluster controller.
+ */
static inline int cpu_present_to_apicid(int mps_cpu)
{
- return ((mps_cpu >> 2) << 4) | (1 << (mps_cpu & 0x3));
+ if (mps_cpu < 60)
+ return ((mps_cpu >> 2) << 4) | (1 << (mps_cpu & 0x3));
+ else
+ return BAD_APICID;
}

static inline int generate_logical_apicid(int quad, int phys_apicid)
===== include/asm-i386/mach-summit/mach_apic.h 1.31 vs edited =====
--- 1.31/include/asm-i386/mach-summit/mach_apic.h Wed Aug 20 22:32:06 2003
+++ edited/include/asm-i386/mach-summit/mach_apic.h Thu Aug 21 15:10:57 2003
@@ -87,7 +87,10 @@

static inline int cpu_present_to_apicid(int mps_cpu)
{
- return (int) bios_cpu_apicid[mps_cpu];
+ if (mps_cpu < NR_CPUS)
+ return (int)bios_cpu_apicid[mps_cpu];
+ else
+ return BAD_APICID;
}

static inline physid_mask_t ioapic_phys_id_map(physid_mask_t phys_id_map)
===== include/asm-i386/mach-visws/mach_apic.h 1.7 vs edited =====
--- 1.7/include/asm-i386/mach-visws/mach_apic.h Wed Aug 20 22:30:10 2003
+++ edited/include/asm-i386/mach-visws/mach_apic.h Thu Aug 21 15:11:16 2003
@@ -59,7 +59,10 @@

static inline int cpu_present_to_apicid(int mps_cpu)
{
- return mps_cpu;
+ if (mps_cpu < NR_CPUS)
+ return mps_cpu;
+ else
+ return BAD_APICID;
}

static inline physid_mask_t apicid_to_cpu_present(int apicid)

2003-08-21 22:45:26

by William Lee Irwin III

[permalink] [raw]
Subject: Re: CPU boot problem on 2.6.0-test3-bk8

On Thu, Aug 21, 2003 at 02:33:50PM -0700, William Lee Irwin III wrote:
>> cpu_present_to_apicid() needs a similar treatment to dhansen's prior
>> bits. diff incoming shortly.

On Thu, Aug 21, 2003 at 03:17:44PM -0700, William Lee Irwin III wrote:
> Could one of you two try this out on a Summit machine in addition to
> Dave's prior patch (or hook me up to one)?

That broke sparse APIC ID's on several subarches. This is a bit less
indiscriminate about who it updates (and should replace the prior patch
wrt. sending anything upstream):


-- wli


===== include/asm-i386/mach-bigsmp/mach_apic.h 1.16 vs edited =====
--- 1.16/include/asm-i386/mach-bigsmp/mach_apic.h Wed Aug 20 22:32:06 2003
+++ edited/include/asm-i386/mach-bigsmp/mach_apic.h Thu Aug 21 15:07:42 2003
@@ -86,7 +86,10 @@

static inline int cpu_present_to_apicid(int mps_cpu)
{
- return (int) bios_cpu_apicid[mps_cpu];
+ if (mps_cpu < NR_CPUS)
+ return (int)bios_cpu_apicid[mps_cpu];
+ else
+ return BAD_APICID;
}

static inline physid_mask_t apicid_to_cpu_present(int phys_apicid)
===== include/asm-i386/mach-es7000/mach_apic.h 1.3 vs edited =====
--- 1.3/include/asm-i386/mach-es7000/mach_apic.h Wed Aug 20 22:32:06 2003
+++ edited/include/asm-i386/mach-es7000/mach_apic.h Thu Aug 21 15:08:41 2003
@@ -106,8 +106,10 @@
{
if (!mps_cpu)
return boot_cpu_physical_apicid;
- else
+ else if (mps_cpu < NR_CPUS)
return (int) bios_cpu_apicid[mps_cpu];
+ else
+ return BAD_APICID;
}

static inline physid_mask_t apicid_to_cpu_present(int phys_apicid)
===== include/asm-i386/mach-numaq/mach_apic.h 1.22 vs edited =====
--- 1.22/include/asm-i386/mach-numaq/mach_apic.h Wed Aug 20 22:32:06 2003
+++ edited/include/asm-i386/mach-numaq/mach_apic.h Thu Aug 21 15:10:31 2003
@@ -65,9 +65,17 @@
return (int)cpu_2_logical_apicid[cpu];
}

+/*
+ * Supporting over 60 cpus on NUMA-Q requires a locality-dependent
+ * cpu to APIC ID relation to properly interact with the intelligent
+ * mode of the cluster controller.
+ */
static inline int cpu_present_to_apicid(int mps_cpu)
{
- return ((mps_cpu >> 2) << 4) | (1 << (mps_cpu & 0x3));
+ if (mps_cpu < 60)
+ return ((mps_cpu >> 2) << 4) | (1 << (mps_cpu & 0x3));
+ else
+ return BAD_APICID;
}

static inline int generate_logical_apicid(int quad, int phys_apicid)
===== include/asm-i386/mach-summit/mach_apic.h 1.31 vs edited =====
--- 1.31/include/asm-i386/mach-summit/mach_apic.h Wed Aug 20 22:32:06 2003
+++ edited/include/asm-i386/mach-summit/mach_apic.h Thu Aug 21 15:10:57 2003
@@ -87,7 +87,10 @@

static inline int cpu_present_to_apicid(int mps_cpu)
{
- return (int) bios_cpu_apicid[mps_cpu];
+ if (mps_cpu < NR_CPUS)
+ return (int)bios_cpu_apicid[mps_cpu];
+ else
+ return BAD_APICID;
}

static inline physid_mask_t ioapic_phys_id_map(physid_mask_t phys_id_map)

2003-08-21 23:09:54

by William Lee Irwin III

[permalink] [raw]
Subject: Re: CPU boot problem on 2.6.0-test3-bk8

On Thu, Aug 21, 2003 at 03:45:43PM -0700, William Lee Irwin III wrote:
> That broke sparse APIC ID's on several subarches. This is a bit less
> indiscriminate about who it updates (and should replace the prior patch
> wrt. sending anything upstream):

This must go in regardless; in the bios_cpu_apicid[] case, it would
walk off the end of bios_cpu_apicid[] and attempt to send APIC INIT
messages to garbage without this patch, and in the NUMA-Q case, it
would attempt to send NMI wakeups to destinations in the broadcast
cluster (which is harmless, but very poor form) without this patch.

vs. current bk as of 4:01PM PDT.

Linus, please apply.


-- wli


===== include/asm-i386/mach-bigsmp/mach_apic.h 1.16 vs edited =====
--- 1.16/include/asm-i386/mach-bigsmp/mach_apic.h Wed Aug 20 22:32:06 2003
+++ edited/include/asm-i386/mach-bigsmp/mach_apic.h Thu Aug 21 15:07:42 2003
@@ -86,7 +86,10 @@

static inline int cpu_present_to_apicid(int mps_cpu)
{
- return (int) bios_cpu_apicid[mps_cpu];
+ if (mps_cpu < NR_CPUS)
+ return (int)bios_cpu_apicid[mps_cpu];
+ else
+ return BAD_APICID;
}

static inline physid_mask_t apicid_to_cpu_present(int phys_apicid)
===== include/asm-i386/mach-es7000/mach_apic.h 1.3 vs edited =====
--- 1.3/include/asm-i386/mach-es7000/mach_apic.h Wed Aug 20 22:32:06 2003
+++ edited/include/asm-i386/mach-es7000/mach_apic.h Thu Aug 21 15:08:41 2003
@@ -106,8 +106,10 @@
{
if (!mps_cpu)
return boot_cpu_physical_apicid;
- else
+ else if (mps_cpu < NR_CPUS)
return (int) bios_cpu_apicid[mps_cpu];
+ else
+ return BAD_APICID;
}

static inline physid_mask_t apicid_to_cpu_present(int phys_apicid)
===== include/asm-i386/mach-numaq/mach_apic.h 1.22 vs edited =====
--- 1.22/include/asm-i386/mach-numaq/mach_apic.h Wed Aug 20 22:32:06 2003
+++ edited/include/asm-i386/mach-numaq/mach_apic.h Thu Aug 21 15:10:31 2003
@@ -65,9 +65,17 @@
return (int)cpu_2_logical_apicid[cpu];
}

+/*
+ * Supporting over 60 cpus on NUMA-Q requires a locality-dependent
+ * cpu to APIC ID relation to properly interact with the intelligent
+ * mode of the cluster controller.
+ */
static inline int cpu_present_to_apicid(int mps_cpu)
{
- return ((mps_cpu >> 2) << 4) | (1 << (mps_cpu & 0x3));
+ if (mps_cpu < 60)
+ return ((mps_cpu >> 2) << 4) | (1 << (mps_cpu & 0x3));
+ else
+ return BAD_APICID;
}

static inline int generate_logical_apicid(int quad, int phys_apicid)
===== include/asm-i386/mach-summit/mach_apic.h 1.31 vs edited =====
--- 1.31/include/asm-i386/mach-summit/mach_apic.h Wed Aug 20 22:32:06 2003
+++ edited/include/asm-i386/mach-summit/mach_apic.h Thu Aug 21 15:10:57 2003
@@ -87,7 +87,10 @@

static inline int cpu_present_to_apicid(int mps_cpu)
{
- return (int) bios_cpu_apicid[mps_cpu];
+ if (mps_cpu < NR_CPUS)
+ return (int)bios_cpu_apicid[mps_cpu];
+ else
+ return BAD_APICID;
}

static inline physid_mask_t ioapic_phys_id_map(physid_mask_t phys_id_map)

2003-08-22 17:15:56

by William Lee Irwin III

[permalink] [raw]
Subject: Re: CPU boot problem on 2.6.0-test3-bk8

On Thu, Aug 21, 2003 at 12:02:02PM -0500, Andrew Theurer wrote:
> Boot log with extra kicked++ removed...

Say, could you try last night's bk snapshot and let me know how it's
doing? I threw in a necessary fix on top of Dave's last night, but I
don't know whether it's sufficient for your purposes yet.


-- wli

2003-08-22 18:17:36

by Andrew Theurer

[permalink] [raw]
Subject: Re: CPU boot problem on 2.6.0-test3-bk8

On Friday 22 August 2003 12:16, William Lee Irwin III wrote:
> On Thu, Aug 21, 2003 at 12:02:02PM -0500, Andrew Theurer wrote:
> > Boot log with extra kicked++ removed...
>
> Say, could you try last night's bk snapshot and let me know how it's
> doing? I threw in a necessary fix on top of Dave's last night, but I
> don't know whether it's sufficient for your purposes yet.

Yes, should have it tested in a few, just backed up right now.

2003-08-22 19:11:52

by Andrew Theurer

[permalink] [raw]
Subject: Re: CPU boot problem on 2.6.0-test3-bk8

On Friday 22 August 2003 12:16, William Lee Irwin III wrote:
> On Thu, Aug 21, 2003 at 12:02:02PM -0500, Andrew Theurer wrote:
> > Boot log with extra kicked++ removed...
>
> Say, could you try last night's bk snapshot and let me know how it's
> doing? I threw in a necessary fix on top of Dave's last night, but I
> don't know whether it's sufficient for your purposes yet.

OK, looks like it worked fine.