Hi,
I have produced a config which doesn't boot correctly on either of my
computers with v2.6.26-rc9.
The first problem is with my laptop, which hangs on boot, see
screenshot (sorry for the bad quality, I was in a hurry):
http://folk.uio.no/vegardno/linux/DSCF3026.JPG
I figured there's something wrong with ACPI PCI irq routing as I'm
missing a message that shows up in my usual configs:
calling ide_scan_pcibus+0x0/0xf0
ICH8M: IDE controller (0x8086:0x2850 rev 0x03) at PCI slot 0000:00:1f.1
--- HERE: ---
ACPI: PCI Interrupt 0000:00:1f.1[A] -> GSI 18 (level, low) -> IRQ 18
----
ICH8M: not 100% native mode: will probe irqs later
...and as can be seen in the screenshot, ide-cd "lost interrupt".
Unfortunately, I don't have a serial line on this machine, so it's
hard to get the full log. (Maybe I can try netconsole, it seems to be
up by the time this happens.)
(I also tried booting with acpi=off, acpi=noirq, pci=noacpi, to no avail.)
Note that this does not seem to be a regression; I get the same thing
with -rc8 as well. I'm guessing it's config-induced, here are my
config changes that triggered it:
http://folk.uio.no/vegardno/linux/config-diff
I'll make a guess for the NUMA/ACPI stuff. And this is the full
config: http://folk.uio.no/vegardno/linux/config-bisect
I decided to also try this kernel with my desktop machine. This one
boots (probably because ACPI is disabled?), but I get the following
warning:
[ 0.000000] Begin SRAT table scan....
[ 0.000000] ------------[ cut here ]------------
[ 0.000000] WARNING: at arch/x86/mm/ioremap.c:553 early_ioremap+0x1c1/0x200()
[ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.26-rc9 #201
[ 0.000000] [<c0157d6f>] warn_on_slowpath+0x4f/0x70
[ 0.000000] [<c01708db>] ? up+0x2b/0x40
[ 0.000000] [<c01586f1>] ? release_console_sem+0x1c1/0x1f0
[ 0.000000] [<c05c0132>] ? _spin_unlock_irqrestore+0x32/0x70
[ 0.000000] [<c05c0143>] ? _spin_unlock_irqrestore+0x43/0x70
[ 0.000000] [<c0158705>] ? release_console_sem+0x1d5/0x1f0
[ 0.000000] [<c0158aff>] ? vprintk+0x1cf/0x430
[ 0.000000] [<c0158b0b>] ? vprintk+0x1db/0x430
[ 0.000000] [<c07c0b24>] ? __early_set_fixmap+0xb4/0xd0
[ 0.000000] [<c07c1231>] early_ioremap+0x1c1/0x200
[ 0.000000] [<c07bd48e>] get_memcfg_from_srat+0x15e/0x6d0
[ 0.000000] [<c01314be>] ? post_set+0x4e/0x50
[ 0.000000] [<c01315fa>] ? generic_get_mtrr+0x2a/0x100
[ 0.000000] [<c07c154d>] setup_memory+0xd/0x280
[ 0.000000] [<c07b3c7e>] setup_arch+0x2ee/0x4e0
[ 0.000000] [<c0158d7b>] ? printk+0x1b/0x20
[ 0.000000] [<c07ab842>] start_kernel+0x82/0x390
[ 0.000000] [<c07ab028>] i386_start_kernel+0x8/0x10
[ 0.000000] =======================
[ 0.000000] ---[ end trace 4eaa2a86a8e2da22 ]---
[ 0.000000] failed to get NUMA memory information from SRAT table
[ 0.000000] NUMA - single node, flat memory mode
What really puzzles me is the line number of the WARN -- it says
arch/x86/mm/ioremap.c:553. But this line does not contain any WARNs.
In fact, it's not even the right function. But this is a clean
v2.6.26-rc9 compile, as can be seen from the kernel version above.
There's also one a bit later:
[ 7.746346] please boot with early_ioremap_debug and report the dmesg.
[ 7.746460] ------------[ cut here ]------------
[ 7.746569] WARNING: at arch/x86/mm/ioremap.c:524
check_early_ioremap_leak+0x3f/0x50()
See full dmesg at: http://folk.uio.no/vegardno/linux/dmesg-2.6.26-rc9
With early_ioremap_debug enabled:
http://folk.uio.no/vegardno/linux/dmesg-early_ioremap_debug
My apologies if I got the recipients wrong; I picked them off the
git-log for relevant files. Please Cc in the right direction if needed
:-)
Thanks!
Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
On Sunday 06 July 2008 07:47:19 am Vegard Nossum wrote:
> I have produced a config which doesn't boot correctly on either of my
> computers with v2.6.26-rc9.
>
> The first problem is with my laptop, which hangs on boot, see
> screenshot (sorry for the bad quality, I was in a hurry):
> http://folk.uio.no/vegardno/linux/DSCF3026.JPG
>
> I figured there's something wrong with ACPI PCI irq routing as I'm
> missing a message that shows up in my usual configs:
>
> calling ide_scan_pcibus+0x0/0xf0
> ICH8M: IDE controller (0x8086:0x2850 rev 0x03) at PCI slot 0000:00:1f.1
> --- HERE: ---
> ACPI: PCI Interrupt 0000:00:1f.1[A] -> GSI 18 (level, low) -> IRQ 18
> ----
> ICH8M: not 100% native mode: will probe irqs later
>
> ...and as can be seen in the screenshot, ide-cd "lost interrupt".
> Unfortunately, I don't have a serial line on this machine, so it's
> hard to get the full log. (Maybe I can try netconsole, it seems to be
> up by the time this happens.)
If you could get netconsole working and collect the complete
console log from both the working config and the broken one,
that would help a lot.
Bjorn
On Thu, Jul 17, 2008 at 8:45 PM, Bjorn Helgaas <[email protected]> wrote:
> On Sunday 06 July 2008 07:47:19 am Vegard Nossum wrote:
>> I have produced a config which doesn't boot correctly on either of my
>> computers with v2.6.26-rc9.
>>
>> The first problem is with my laptop, which hangs on boot, see
>> screenshot (sorry for the bad quality, I was in a hurry):
>> http://folk.uio.no/vegardno/linux/DSCF3026.JPG
>>
>> I figured there's something wrong with ACPI PCI irq routing as I'm
>> missing a message that shows up in my usual configs:
>>
>> calling ide_scan_pcibus+0x0/0xf0
>> ICH8M: IDE controller (0x8086:0x2850 rev 0x03) at PCI slot 0000:00:1f.1
>> --- HERE: ---
>> ACPI: PCI Interrupt 0000:00:1f.1[A] -> GSI 18 (level, low) -> IRQ 18
>> ----
>> ICH8M: not 100% native mode: will probe irqs later
>>
>> ...and as can be seen in the screenshot, ide-cd "lost interrupt".
>> Unfortunately, I don't have a serial line on this machine, so it's
>> hard to get the full log. (Maybe I can try netconsole, it seems to be
>> up by the time this happens.)
>
> If you could get netconsole working and collect the complete
> console log from both the working config and the broken one,
> that would help a lot.
Hah, I actually managed to get netconsole working. This is the log
that fails (v2.6.26):
http://folk.uio.no/vegardno/linux/numa-netconsole.txt
I think this looks really interesting:
[ 0.093562] ExtINT not setup in hardware but reported by MP table
[ 0.093854] ENABLING IO-APIC IRQs
[ 0.094244] ..TIMER: vector=0x31 apic1=-1 pin1=2 apic2=-1 pin2=0
[ 0.094985] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
[ 0.094985] ...trying to set up timer (IRQ0) through the 8259A ...
[ 0.094985] ..... (found pin 0) ...<7>APIC error on CPU0: 00(80)
[ 0.094985] failed.
[ 0.094985] ...trying to set up timer as Virtual Wire IRQ...<7>APIC
error on CPU0: 80(80)
[ 0.105761] works.
This error would be:
7: Illegal register address
Looks like APIC was misconfigured, which might explain IRQ routing weirdness?
Will try to provide the booting one later. Now zZz...
Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
On Thu, 17 Jul 2008, Vegard Nossum wrote:
> Hah, I actually managed to get netconsole working. This is the log
> that fails (v2.6.26):
>
> http://folk.uio.no/vegardno/linux/numa-netconsole.txt
>
> I think this looks really interesting:
>
> [ 0.093562] ExtINT not setup in hardware but reported by MP table
> [ 0.093854] ENABLING IO-APIC IRQs
> [ 0.094244] ..TIMER: vector=0x31 apic1=-1 pin1=2 apic2=-1 pin2=0
> [ 0.094985] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> [ 0.094985] ...trying to set up timer (IRQ0) through the 8259A ...
> [ 0.094985] ..... (found pin 0) ...<7>APIC error on CPU0: 00(80)
> [ 0.094985] failed.
> [ 0.094985] ...trying to set up timer as Virtual Wire IRQ...<7>APIC
> error on CPU0: 80(80)
> [ 0.105761] works.
>
> This error would be:
> 7: Illegal register address
>
> Looks like APIC was misconfigured, which might explain IRQ routing weirdness?
Here pin1 and pin2 have been correctly initialised, but apic1 and apic2
are both set to -1. This is an "impossible scenario." Because of that
I/O APIC code actually pokes at random registers of the local APIC which
immediately precedes I/O APIC #0 in the fixmap. Results may be arbitrary.
> Will try to provide the booting one later. Now zZz...
Your system uses an MP table and there were quite a lot of changes in its
support code recently. A bug must have crept in. Please try the appended
patch and the "apic=debug" option for both cases and the resulting logs
may shed some light.
Maciej
patch-next-2.6.26-rc9-20080711-apic-dprintk-0
diff -up --recursive --new-file linux-next-2.6.26-rc9-20080711.macro/include/asm-x86/apic.h linux-next-2.6.26-rc9-20080711/include/asm-x86/apic.h
--- linux-next-2.6.26-rc9-20080711.macro/include/asm-x86/apic.h 2008-07-16 14:50:07.000000000 +0000
+++ linux-next-2.6.26-rc9-20080711/include/asm-x86/apic.h 2008-07-18 00:33:19.000000000 +0000
@@ -10,7 +10,7 @@
#define ARCH_APICTIMER_STOPS_ON_C3 1
-#define Dprintk(x...)
+#define Dprintk printk
/*
* Debugging macros
On Fri, Jul 18, 2008 at 2:47 AM, Maciej W. Rozycki <[email protected]> wrote:
>> Looks like APIC was misconfigured, which might explain IRQ routing weirdness?
>
> Here pin1 and pin2 have been correctly initialised, but apic1 and apic2
> are both set to -1. This is an "impossible scenario." Because of that
> I/O APIC code actually pokes at random registers of the local APIC which
> immediately precedes I/O APIC #0 in the fixmap. Results may be arbitrary.
>
>> Will try to provide the booting one later. Now zZz...
>
> Your system uses an MP table and there were quite a lot of changes in its
> support code recently. A bug must have crept in. Please try the appended
> patch and the "apic=debug" option for both cases and the resulting logs
> may shed some light.
I have done this for the bad case:
http://folk.uio.no/vegardno/linux/netconsole2.txt
(I added a warning for certain APIC errors as well. Not that it adds
much in this particular case, but I think it would be good to have in
mainline anyway, to highlight these error cases in the future...)
Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
* Maciej W. Rozycki <[email protected]> wrote:
> --- linux-next-2.6.26-rc9-20080711.macro/include/asm-x86/apic.h 2008-07-16 14:50:07.000000000 +0000
> +++ linux-next-2.6.26-rc9-20080711/include/asm-x86/apic.h 2008-07-18 00:33:19.000000000 +0000
> @@ -10,7 +10,7 @@
>
> #define ARCH_APICTIMER_STOPS_ON_C3 1
>
> -#define Dprintk(x...)
> +#define Dprintk printk
Maciej, i picked up this change too, as getting more debug output from
apic=debug seems more than prudent these days ;-)
Ingo
On Fri, 18 Jul 2008, Vegard Nossum wrote:
> I have done this for the bad case:
> http://folk.uio.no/vegardno/linux/netconsole2.txt
Thanks. The MP table looks broken:
Bus #4 is ISA
I/O APIC #32 Version 32 at 0xFEC00000.
Int: type 3, pol 1, trig 1, bus 4, IRQ 00, APIC ID 2, APIC INT 00
etc. (notice the different I/O APIC IDs reported), so any changes to our
code are not at fault. I suggest you ask your vendor to fix it up (did
they test it at all? -- it has no chance to work) or stick to ACPI (I can
see it has been automatically disabled for some reason).
Anyway, the piece of I/O APIC code in question should not poke at the
local APIC, so I'll prepare a fix to make it check for the APIC ID being
-1.
> (I added a warning for certain APIC errors as well. Not that it adds
> much in this particular case, but I think it would be good to have in
> mainline anyway, to highlight these error cases in the future...)
It might be useful indeed.
Maciej