2006-09-12 22:33:01

by Robin Lee Powell

[permalink] [raw]
Subject: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM


Please cc me on replies, as I'm not on the list.

I have some moderately old hosts that hang on boot, very early on,
with any kernel newer than 2.6.3. Important basic facts about the
box are dual opteron 244s, 16gb of RAM, and it's a 64-bit build of
the kernel. We've tried both MK8 and CONFIG_MK8 and
CONFIG_GENERIC_CPU to no avail. Hell, I think I've tried just about
everything at this point.

In most cases the stopping point is:

[snip]
Initializing CPU#0
[snip]
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
CPU 0/0(1) -> Node 0 -> Core 0

(see below for complete version)

Whether that last line shows up or not is based on BIOS configs,
AFAICT.

I've been working on these for several days now, and I'm totally
stumped. I'll generate any data any of you ask for that might be
useful, and I'll be appending a lot of data here.

The kernel I'm actually trying to get working is 2.6.17.11, but I've
tried all of the following:

linux-2.6.2
linux-2.6.3
linux-2.6.4
linux-2.6.6
linux-2.6.10
linux-2.6.17.11

Only the first two work.

For all of them except 2.6.17.11, the config was just the config
from the known-working 2.6.2, make oldconfig, hold down the return
button. 2.6.17.11 I played with a bit more.

The original 2.6.2 working .config is
http://teddyb.org/~rlpowell/media/regular/lkml/orig.config.txt

The current 2.6.17.11 .config is
http://teddyb.org/~rlpowell/media/regular/lkml/current.config.txt

The full boot with apci on is
http://teddyb.org/~rlpowell/media/regular/lkml/apci_boot.txt

The full boot with apci off is
http://teddyb.org/~rlpowell/media/regular/lkml/noapci_boot.txt

This version is rather different, as it ends in:

HARDWARE ERROR
CPU 0: Machine Check Exception: 7 Bank 3: b40000000000083b
RIP 10:<ffffffff80446e3e> {pci_conf1_read+0xbe/0xf0}
TSC 2e7932dbf8 ADDR fdfc000cfc
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
Kernel panic - not syncing: Uncorrected machine check

We believe this to be spurious, as both machines of this type we've
tested showed the same error, and both of them have been running
with 2.6.2 for years.

The motherboard is, apparently, a RioWorks Rhapsody HDAMA, whatever that is.
We are not on the latest BIOS revision, but then the changes between 1.84
(which we're on) and 1.89 don't look relevant.

BIOS information is
http://teddyb.org/~rlpowell/media/regular/lkml/bios_info.txt

lspci -v (generated while up on the 2.6.2 kernel that's been running
for years) is
http://teddyb.org/~rlpowell/media/regular/lkml/lspci_v.txt

dmidecode is
http://teddyb.org/~rlpowell/media/regular/lkml/dmidecode.txt

-Robin


2006-09-14 19:05:50

by Robin Lee Powell

[permalink] [raw]
Subject: Same MCE on 4 working machines (was Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM)

On Tue, Sep 12, 2006 at 03:32:58PM -0700, wrote:
>
> Please cc me on replies, as I'm not on the list.

Nevermind; I'm watching the thread at
http://lkml.org/lkml/2006/9/12/300

> I have some moderately old hosts that hang on boot, very early on,
> with any kernel newer than 2.6.3. Important basic facts about the
> box are dual opteron 244s, 16gb of RAM, and it's a 64-bit build of
> the kernel.

This isn't just me. All the Debian kernels hang too. I've tried
all of the following:

Linux version 2.6.8-12-amd64-generic (buildd@bester) (gcc version 3.4.4 20050314 (prerelease) (Debian 3.4.3-13)) #1 Mon Jul 17 01:12:05 UTC 2006

Linux version 2.6.8-12-amd64-k8 (buildd@bester) (gcc version 3.4.4 20050314 (prerelease) (Debian 3.4.3-13)) #1 Mon Jul 17 01:39:03 UTC 2006

Linux version 2.6.8-12-amd64-k8-smp (buildd@bester) (gcc version 3.4.4 20050314 (prerelease) (Debian 3.4.3-13)) #1 SMP Mon Jul 17 00:17:20 UTC 2006

The boot messages differ only in trivial ways, and they all end
with:

- ----------

NET: Registered protocol family 16
CPU 0: Machine Check Exception: 7 Bank 3: b40000000000083b
RIP 10:<ffffffff8023a44c> {pci_conf1_read+0xac/0xe0}
TSC d189cea ADDR fdfc000cfe
CPU 0: Machine Check Exception: 7 Bank 0: b40000000000083b
RIP 10:<ffffffff8023a44c> {pci_conf1_read+0xac/0xe0}
TSC 0
Kernel panic: Uncorrected machine check


- ----------

This happens on 4 different machine! The only way to get them to
boot to a kernel later than 2.6.3 is "nomce".

I assert that an MCE on multiple machines that have been running
successfully on 2.6.2 for *years* is bad kernel behaviour; we have
no reason to believe that 4 different machines that happen to have
identical hardware have all developed the same hardware failure at
the same time.

Is there anything I can do here besides nomce?

Boot messages on Debian kernels from the primary host:

amd64 generic:
http://teddyb.org/~rlpowell/media/regular/lkml/amd64-generic-boot.txt

amd64 k8:
http://teddyb.org/~rlpowell/media/regular/lkml/amd64-k8-boot.txt

amd64 k8 smp:
http://teddyb.org/~rlpowell/media/regular/lkml/amd64-k8-smp-boot.txt

Boot messages from two other hosts booting Debian's 2.6.8-11 amd64
generic kernel; these hosts are used for massive report jobs in
production on a 2.6.2 kernel, so we *know* they work:

http://teddyb.org/~rlpowell/media/regular/lkml/prodbing3-boot.txt

http://teddyb.org/~rlpowell/media/regular/lkml/prodbing4-boot.txt

-Robin

2006-09-14 19:13:17

by Lee Revell

[permalink] [raw]
Subject: Re: Same MCE on 4 working machines (was Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM)

On Thu, 2006-09-14 at 12:05 -0700, Robin Lee Powell wrote:
> This isn't just me. All the Debian kernels hang too. I've tried
> all of the following:
>
> Linux version 2.6.8-12-amd64-generic (buildd@bester) (gcc version
> 3.4.4 20050314 (prerelease) (Debian 3.4.3-13)) #1 Mon Jul 17 01:12:05
> UTC 2006
>
> Linux version 2.6.8-12-amd64-k8 (buildd@bester) (gcc version 3.4.4
> 20050314 (prerelease) (Debian 3.4.3-13)) #1 Mon Jul 17 01:39:03 UTC
> 2006
>
> Linux version 2.6.8-12-amd64-k8-smp (buildd@bester) (gcc version 3.4.4
> 20050314 (prerelease) (Debian 3.4.3-13)) #1 SMP Mon Jul 17 00:17:20
> UTC 2006

Have you tried a *recent* 2.6 kernel like 2.6.17 or 2.6.18-rc*?

2.6.8 is way too old to debug.

Lee

2006-09-14 19:15:58

by Robin Lee Powell

[permalink] [raw]
Subject: Re: Same MCE on 4 working machines (was Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM)

On Thu, Sep 14, 2006 at 03:14:08PM -0400, Lee Revell wrote:
> On Thu, 2006-09-14 at 12:05 -0700, Robin Lee Powell wrote:
> > This isn't just me. All the Debian kernels hang too. I've tried
> > all of the following:
> >
> > Linux version 2.6.8-12-amd64-generic (buildd@bester) (gcc version
> > 3.4.4 20050314 (prerelease) (Debian 3.4.3-13)) #1 Mon Jul 17 01:12:05
> > UTC 2006
> >
> > Linux version 2.6.8-12-amd64-k8 (buildd@bester) (gcc version 3.4.4
> > 20050314 (prerelease) (Debian 3.4.3-13)) #1 Mon Jul 17 01:39:03 UTC
> > 2006
> >
> > Linux version 2.6.8-12-amd64-k8-smp (buildd@bester) (gcc version 3.4.4
> > 20050314 (prerelease) (Debian 3.4.3-13)) #1 SMP Mon Jul 17 00:17:20
> > UTC 2006
>
> Have you tried a *recent* 2.6 kernel like 2.6.17 or 2.6.18-rc*?
>
> 2.6.8 is way too old to debug.

Yes; that's what my previous post was about. See
http://lkml.org/lkml/2006/9/12/300

I was doing 2.6.17.11, which was kernel.org's latest stable at the
time I started all this.

I tried the Debian kernels just to show that it wasn't just me
screwing up my kernel configs.

These machines will not boot an any kernel > 2.6.3 that I have
tried, and I've tried about 8 different ones at this point.

I noted in the release notes for 2.6.4 that the mce code was
entirely replaced; I'm suspecting that's the problem, but I have no
idea how to debug it. Whether the problem is the kernel or the
motherboard is also certainly open to debate.

-Robin

--
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/
Reason #237 To Learn Lojban: "Homonyms: Their Grate!"
Proud Supporter of the Singularity Institute - http://singinst.org/

2006-09-15 05:42:50

by Bharath Ramesh

[permalink] [raw]
Subject: Re: Same MCE on 4 working machines (was Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM)

Have you tried booting newer kernel post 2.6.13 with the boot option
mce=bootlog and see if it goes past the current failure. Try the same
with with noacpi.

Bharath

On 9/14/06, Robin Lee Powell <[email protected]> wrote:
> On Thu, Sep 14, 2006 at 03:14:08PM -0400, Lee Revell wrote:
> > On Thu, 2006-09-14 at 12:05 -0700, Robin Lee Powell wrote:
> > > This isn't just me. All the Debian kernels hang too. I've tried
> > > all of the following:
> > >
> > > Linux version 2.6.8-12-amd64-generic (buildd@bester) (gcc version
> > > 3.4.4 20050314 (prerelease) (Debian 3.4.3-13)) #1 Mon Jul 17 01:12:05
> > > UTC 2006
> > >
> > > Linux version 2.6.8-12-amd64-k8 (buildd@bester) (gcc version 3.4.4
> > > 20050314 (prerelease) (Debian 3.4.3-13)) #1 Mon Jul 17 01:39:03 UTC
> > > 2006
> > >
> > > Linux version 2.6.8-12-amd64-k8-smp (buildd@bester) (gcc version 3.4.4
> > > 20050314 (prerelease) (Debian 3.4.3-13)) #1 SMP Mon Jul 17 00:17:20
> > > UTC 2006
> >
> > Have you tried a *recent* 2.6 kernel like 2.6.17 or 2.6.18-rc*?
> >
> > 2.6.8 is way too old to debug.
>
> Yes; that's what my previous post was about. See
> http://lkml.org/lkml/2006/9/12/300
>
> I was doing 2.6.17.11, which was kernel.org's latest stable at the
> time I started all this.
>
> I tried the Debian kernels just to show that it wasn't just me
> screwing up my kernel configs.
>
> These machines will not boot an any kernel > 2.6.3 that I have
> tried, and I've tried about 8 different ones at this point.
>
> I noted in the release notes for 2.6.4 that the mce code was
> entirely replaced; I'm suspecting that's the problem, but I have no
> idea how to debug it. Whether the problem is the kernel or the
> motherboard is also certainly open to debate.
>
> -Robin
>
> --
> http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/
> Reason #237 To Learn Lojban: "Homonyms: Their Grate!"
> Proud Supporter of the Singularity Institute - http://singinst.org/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2006-09-15 11:22:55

by Alan

[permalink] [raw]
Subject: Re: Same MCE on 4 working machines (was Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM)

Ar Iau, 2006-09-14 am 12:05 -0700, ysgrifennodd Robin Lee Powell:
> NET: Registered protocol family 16
> CPU 0: Machine Check Exception: 7 Bank 3: b40000000000083b
> RIP 10:<ffffffff8023a44c> {pci_conf1_read+0xac/0xe0}
> TSC d189cea ADDR fdfc000cfe

We went to do a PCI configuration cycle and your box blew up. Thats
pretty clear. Could be down to the various changes in how we do PCI
accesses tripping up a problem box, or triggering a bug.

See what effect

pci=bios
pci=conf1
pci=conf2

pci=nommconf
pci=nomsi

have and report back.

What drivers do you have enabled and what pci devices are present ?

Alan

2006-09-15 17:47:07

by Robin Lee Powell

[permalink] [raw]
Subject: Re: Same MCE on 4 working machines (was Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM)

I didn't know about mce=bootlog. Neat. It doesn't change anything,
though. I've tried noacpi and many variants thereon; no change.

The most severe set of options I have record of trying is:

nosmp noapic mem=512M ide=nodma apm=off acpi=off desktop showopts

but there were lots of others.

mce=nobootlog doesn't help either, for the record.

If mce=bootlog actually sticks logs somewhere I should retrieve and
show to you, please tell me; ./Documentation/x86_64/boot-options.txt
doesn't say anything about it.

-Robin

On Fri, Sep 15, 2006 at 01:42:49AM -0400, Bharath Ramesh wrote:
> Have you tried booting newer kernel post 2.6.13 with the boot
> option mce=bootlog and see if it goes past the current failure.
> Try the same with with noacpi.
>
> Bharath
>
> On 9/14/06, Robin Lee Powell <[email protected]> wrote:
> >On Thu, Sep 14, 2006 at 03:14:08PM -0400, Lee Revell wrote:
> >> On Thu, 2006-09-14 at 12:05 -0700, Robin Lee Powell wrote:
> >> > This isn't just me. All the Debian kernels hang too. I've tried
> >> > all of the following:
> >> >
> >> > Linux version 2.6.8-12-amd64-generic (buildd@bester) (gcc version
> >> > 3.4.4 20050314 (prerelease) (Debian 3.4.3-13)) #1 Mon Jul 17 01:12:05
> >> > UTC 2006
> >> >
> >> > Linux version 2.6.8-12-amd64-k8 (buildd@bester) (gcc version 3.4.4
> >> > 20050314 (prerelease) (Debian 3.4.3-13)) #1 Mon Jul 17 01:39:03 UTC
> >> > 2006
> >> >
> >> > Linux version 2.6.8-12-amd64-k8-smp (buildd@bester) (gcc version 3.4.4
> >> > 20050314 (prerelease) (Debian 3.4.3-13)) #1 SMP Mon Jul 17 00:17:20
> >> > UTC 2006
> >>
> >> Have you tried a *recent* 2.6 kernel like 2.6.17 or 2.6.18-rc*?
> >>
> >> 2.6.8 is way too old to debug.
> >
> >Yes; that's what my previous post was about. See
> >http://lkml.org/lkml/2006/9/12/300
> >
> >I was doing 2.6.17.11, which was kernel.org's latest stable at the
> >time I started all this.
> >
> >I tried the Debian kernels just to show that it wasn't just me
> >screwing up my kernel configs.
> >
> >These machines will not boot an any kernel > 2.6.3 that I have
> >tried, and I've tried about 8 different ones at this point.
> >
> >I noted in the release notes for 2.6.4 that the mce code was
> >entirely replaced; I'm suspecting that's the problem, but I have no
> >idea how to debug it. Whether the problem is the kernel or the
> >motherboard is also certainly open to debate.
> >
> >-Robin
> >
> >--
> >http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/
> >Reason #237 To Learn Lojban: "Homonyms: Their Grate!"
> >Proud Supporter of the Singularity Institute - http://singinst.org/
> >-
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to [email protected]
> >More majordomo info at http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at http://www.tux.org/lkml/
> >

--
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/
Reason #237 To Learn Lojban: "Homonyms: Their Grate!"
Proud Supporter of the Singularity Institute - http://singinst.org/

2006-09-15 17:59:17

by Alan

[permalink] [raw]
Subject: Re: Same MCE on 4 working machines (was Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM)

Ar Gwe, 2006-09-15 am 10:47 -0700, ysgrifennodd Robin Lee Powell:
> I didn't know about mce=bootlog. Neat. It doesn't change anything,
> though. I've tried noacpi and many variants thereon; no change.
>
> The most severe set of options I have record of trying is:
>
> nosmp noapic mem=512M ide=nodma apm=off acpi=off desktop showopts

What did the various pci= options I suggested do - anything ?

2006-09-15 18:07:17

by Robin Lee Powell

[permalink] [raw]
Subject: Re: Same MCE on 4 working machines (was Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM)

Ooops, I'm sorry, I didn't actually follow your instructions; all
the testing I just did was with Debian's 2.6.8-12-amd64-k8-smp. Too many
kernels on the box; I'll re-image.

Installing my own 2.6.17.11 now.

The behaviour is a bit different on my own 2.6.17.11; no idea why,
haven't compared the configs carefully yet. Instead of having the
MCE, it hangs at:

- --------------------

CPU 0: aperture @ 0 size 64 MB
No AGP bridge found
Your BIOS doesn't leave a aperture memory hole
Please enable the IOMMU option in the BIOS setup
This costs you 64 MB of RAM
Mapping aperture over 65536 KB of RAM @ 10000000
Built 1 zonelists
Kernel command line: root=/dev/sda2 ro console=ttyS1
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 32768 bytes)
Disabling vsyscall due to use of PM timer
time.c: Using 3.579545 MHz WALL PM GTOD PM timer.
time.c: Detected 1804.148 MHz processor.
Console: colour VGA+ 80x25
Dentry cache hash table entries: 2097152 (order: 12, 16777216 bytes)
Inode-cache hash table entries: 1048576 (order: 11, 8388608 bytes)
Memory: 16320480k/16777216k available (2584k kernel code, 324288k reserved, 1198k data, 220k init)
Calibrating delay using timer specific routine.. 3611.41 BogoMIPS (lpj=18057091)
Security Framework v1.0.0 initialized
Mount-cache hash table entries: 256
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)

- --------------------

Note that that's the *first* CPU; it doesn't even get to the second
one.

If I use acpi=off, if gets to the MCE:


- ---------------------

Initializing CPU#1
Calibrating delay using timer specific routine.. 3608.30 BogoMIPS (lpj=18041501)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
AMD Opteron(tm) Processor 244 stepping 0a
CPU 1: Syncing TSC to CPU 0.
CPU 1: synchronized TSC with CPU 0 (last diff 1 cycles, maxerr 1221 cycles)
Brought up 2 CPUs
testing NMI watchdog ... OK.
migration_cost=621
NET: Registered protocol family 16
PCI: Using configuration type 1
ACPI: Subsystem revision 20060127
ACPI: Interpreter disabled.
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI: disabled
SCSI subsystem initialized
PCI: Probing PCI hardware

HARDWARE ERROR
CPU 0: Machine Check Exception: 7 Bank 3: b40000000000083b
RIP 10:<ffffffff80308e7e> {pci_conf1_read+0xbe/0xf0}
TSC 20aa61dfee ADDR fdfc000cfc
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
Kernel panic - not syncing: Uncorrected machine check

- ---------------------

"mce=bootlog" has the same hang as the first case above.

"mce=bootlog acpi=off" has the same MCE as the second case above.
Specifically:

HARDWARE ERROR
CPU 0: Machine Check Exception: 7 Bank 3: b40000000000083b
RIP 10:<ffffffff80308e7e> {pci_conf1_read+0xbe/0xf0}
TSC 1a0c706340 ADDR fdfc000cfc
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
Kernel panic - not syncing: Uncorrected machine check

-Robin
On Fri, Sep 15, 2006 at 10:47:01AM -0700, rlpowell wrote:
> I didn't know about mce=bootlog. Neat. It doesn't change anything,
> though. I've tried noacpi and many variants thereon; no change.
>
> The most severe set of options I have record of trying is:
>
> nosmp noapic mem=512M ide=nodma apm=off acpi=off desktop showopts
>
> but there were lots of others.
>
> mce=nobootlog doesn't help either, for the record.
>
> If mce=bootlog actually sticks logs somewhere I should retrieve and
> show to you, please tell me; ./Documentation/x86_64/boot-options.txt
> doesn't say anything about it.
>
> -Robin
>
> On Fri, Sep 15, 2006 at 01:42:49AM -0400, Bharath Ramesh wrote:
> > Have you tried booting newer kernel post 2.6.13 with the boot
> > option mce=bootlog and see if it goes past the current failure.
> > Try the same with with noacpi.
> >
> > Bharath
> >
> > On 9/14/06, Robin Lee Powell <[email protected]> wrote:
> > >On Thu, Sep 14, 2006 at 03:14:08PM -0400, Lee Revell wrote:
> > >> On Thu, 2006-09-14 at 12:05 -0700, Robin Lee Powell wrote:
> > >> > This isn't just me. All the Debian kernels hang too. I've tried
> > >> > all of the following:
> > >> >
> > >> > Linux version 2.6.8-12-amd64-generic (buildd@bester) (gcc version
> > >> > 3.4.4 20050314 (prerelease) (Debian 3.4.3-13)) #1 Mon Jul 17 01:12:05
> > >> > UTC 2006
> > >> >
> > >> > Linux version 2.6.8-12-amd64-k8 (buildd@bester) (gcc version 3.4.4
> > >> > 20050314 (prerelease) (Debian 3.4.3-13)) #1 Mon Jul 17 01:39:03 UTC
> > >> > 2006
> > >> >
> > >> > Linux version 2.6.8-12-amd64-k8-smp (buildd@bester) (gcc version 3.4.4
> > >> > 20050314 (prerelease) (Debian 3.4.3-13)) #1 SMP Mon Jul 17 00:17:20
> > >> > UTC 2006
> > >>
> > >> Have you tried a *recent* 2.6 kernel like 2.6.17 or 2.6.18-rc*?
> > >>
> > >> 2.6.8 is way too old to debug.
> > >
> > >Yes; that's what my previous post was about. See
> > >http://lkml.org/lkml/2006/9/12/300
> > >
> > >I was doing 2.6.17.11, which was kernel.org's latest stable at the
> > >time I started all this.
> > >
> > >I tried the Debian kernels just to show that it wasn't just me
> > >screwing up my kernel configs.
> > >
> > >These machines will not boot an any kernel > 2.6.3 that I have
> > >tried, and I've tried about 8 different ones at this point.
> > >
> > >I noted in the release notes for 2.6.4 that the mce code was
> > >entirely replaced; I'm suspecting that's the problem, but I have no
> > >idea how to debug it. Whether the problem is the kernel or the
> > >motherboard is also certainly open to debate.
> > >
> > >-Robin
> > >
> > >--
> > >http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/
> > >Reason #237 To Learn Lojban: "Homonyms: Their Grate!"
> > >Proud Supporter of the Singularity Institute - http://singinst.org/
> > >-
> > >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > >the body of a message to [email protected]
> > >More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >Please read the FAQ at http://www.tux.org/lkml/
> > >
>
> --
> http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/
> Reason #237 To Learn Lojban: "Homonyms: Their Grate!"
> Proud Supporter of the Singularity Institute - http://singinst.org/

--
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/
Reason #237 To Learn Lojban: "Homonyms: Their Grate!"
Proud Supporter of the Singularity Institute - http://singinst.org/

2006-09-15 18:12:52

by Robin Lee Powell

[permalink] [raw]
Subject: Re: Same MCE on 4 working machines (was Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM)

On Fri, Sep 15, 2006 at 07:22:53PM +0100, Alan Cox wrote:
> Ar Gwe, 2006-09-15 am 10:47 -0700, ysgrifennodd Robin Lee Powell:
> > I didn't know about mce=bootlog. Neat. It doesn't change anything,
> > though. I've tried noacpi and many variants thereon; no change.
> >
> > The most severe set of options I have record of trying is:
> >
> > nosmp noapic mem=512M ide=nodma apm=off acpi=off desktop showopts
>
> What did the various pci= options I suggested do - anything ?

I'm working on it. :-)

It's a lot of options. Be with you momentarily.

-Robin


--
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/
Reason #237 To Learn Lojban: "Homonyms: Their Grate!"
Proud Supporter of the Singularity Institute - http://singinst.org/

2006-09-15 18:29:20

by Robin Lee Powell

[permalink] [raw]
Subject: Re: Same MCE on 4 working machines (was Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM)

On Fri, Sep 15, 2006 at 12:45:42PM +0100, Alan Cox wrote:
> Ar Iau, 2006-09-14 am 12:05 -0700, ysgrifennodd Robin Lee Powell:
> > NET: Registered protocol family 16
> > CPU 0: Machine Check Exception: 7 Bank 3: b40000000000083b
> > RIP 10:<ffffffff8023a44c> {pci_conf1_read+0xac/0xe0}
> > TSC d189cea ADDR fdfc000cfe
>
> We went to do a PCI configuration cycle and your box blew up.
> Thats pretty clear. Could be down to the various changes in how we
> do PCI accesses tripping up a problem box, or triggering a bug.

*nod* I'm totally on the fence about that; the company that made
these boxes (Penguin Computing) seems to have some clue issues, and
the motherboard is an Arima (sp?) HDAMA v2, which I gather is one of
the very earliest SMP Opteron boards.

Note that with the answers I give below I'm using the kernel that
hangs at:

Security Framework v1.0.0 initialized
Mount-cache hash table entries: 256
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)

for the first CPU, and that doesn't generate an MCE unless I use
acpi=off, so I'll be doing each option twice (once with just the
option you gave, and once with it plus acpi=off).

This is my 2.6.17.11 kernel; the Debian 2.6.8-12 kernel gets to the
MCE without any options; dunno why yet.

The MCE this kernel gives is:

HARDWARE ERROR
CPU 0: Machine Check Exception: 7 Bank 3: b40000000000083b
RIP 10:<ffffffff80308e7e> {pci_conf1_read+0xbe/0xf0}
TSC 1a0c706340 ADDR fdfc000cfc
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
Kernel panic - not syncing: Uncorrected machine check

> See what effect
>
> pci=bios

No effect.

> pci=conf1

No effect.

> pci=conf2

No effect without acpi=off.

With acpi=off, it gets rather farther before apparently failing to
talk the 3-ware card:

- ----------------------

Brought up 2 CPUs
testing NMI watchdog ... OK.
migration_cost=629
NET: Registered protocol family 16
ACPI: Subsystem revision 20060127
ACPI: Interpreter disabled.
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI: disabled
SCSI subsystem initialized
PCI: System does not support PCI
PCI: System does not support PCI
PCI-DMA: Disabling AGP.
PCI-DMA: More than 4GB of RAM and no IOMMU
PCI-DMA: 32bit PCI IO may malfunction.
PCI-DMA: Disabling IOMMU.
WARNING more than 4GB of memory but IOMMU not available.
WARNING 32bit PCI may malfunction.
NET: Registered protocol family 2
IP route cache hash table entries: 524288 (order: 10, 4194304 bytes)
TCP established hash table entries: 262144 (order: 10, 4194304 bytes)
TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
TCP: Hash tables configured (established 262144 bind 65536)
TCP reno registered
IA32 emulation $Id: sys_ia32.c,v 1.32 2002/03/24 13:02:28 ak Exp $
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
Initializing Cryptographic API
io scheduler noop registered
io scheduler anticipatory registered (default)
io scheduler deadline registered
io scheduler cfq registered
vga16fb: mapped to 0xffff8100000a0000
Console: switching to colour frame buffer device 80x30
fb0: VGA16 VGA frame buffer device
Linux agpgart interface v0.101 (c) Dave Jones
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
RAMDISK driver initialized: 16 RAM disks of 8192K size 1024 blocksize
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 50MHz system bus speed for PIO modes; override with idebus=xx
Adaptec aacraid driver (1.1-5[2409]-mh1)
3ware Storage Controller device driver for Linux v1.26.02.001.
3ware 9000 Storage Controller device driver for Linux v2.26.02.007.
PNP: No PS/2 controller found. Probing ports directly.
serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
ip_tables: (C) 2000-2006 Netfilter Core Team
TCP bic registered
NET: Registered protocol family 8
NET: Registered protocol family 20
VFS: Cannot open root device "sda2" or unknown-block(0,0)
Please append a correct "root=" boot option
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)

- ------------------------

> pci=nommconf

No effect.

> pci=nomsi

No effect.

> have and report back.
>
> What drivers do you have enabled

I'm not completely certain I know what you're asking there, but I
think this answers it:

http://teddyb.org/~rlpowell/media/regular/lkml/2.6.8.11.non-bi.config.txt

> and what pci devices are present ?

http://teddyb.org/~rlpowell/media/regular/lkml/lspci_v.txt

-Robin

--
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/
Reason #237 To Learn Lojban: "Homonyms: Their Grate!"
Proud Supporter of the Singularity Institute - http://singinst.org/

2006-09-15 20:27:10

by Alan

[permalink] [raw]
Subject: Re: Same MCE on 4 working machines (was Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM)

Ar Gwe, 2006-09-15 am 11:29 -0700, ysgrifennodd Robin Lee Powell:
> > pci=conf2
>
> No effect without acpi=off.
>
> With acpi=off, it gets rather farther before apparently failing to
> talk the 3-ware card:

Thats helpful. The conf2 cycles are the wrong type for the board so with
acpi=off pci=conf2 it doesn't see any PCI devices and doesn't explode. I
see nothing odd in the lspci data at all however.

You also have a lot of RAM, that shouldn't matter but it means you hit
code paths most users don't. If you boot with mem limited to 1GB I
assume it still blows up ?

2006-09-15 20:32:05

by Robin Lee Powell

[permalink] [raw]
Subject: Re: Same MCE on 4 working machines (was Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM)

On Fri, Sep 15, 2006 at 09:50:39PM +0100, Alan Cox wrote:
> Ar Gwe, 2006-09-15 am 11:29 -0700, ysgrifennodd Robin Lee Powell:
> > > pci=conf2
> >
> > No effect without acpi=off.
> >
> > With acpi=off, it gets rather farther before apparently failing to
> > talk the 3-ware card:
>
> Thats helpful. The conf2 cycles are the wrong type for the board
> so with acpi=off pci=conf2 it doesn't see any PCI devices and
> doesn't explode. I see nothing odd in the lspci data at all
> however.
>
> You also have a lot of RAM, that shouldn't matter but it means you
> hit code paths most users don't. If you boot with mem limited to
> 1GB I assume it still blows up ?

I've tried mem=1023M, yes, and it still blows up. Just did acpi=off
mem=1023M to check.

-Robin

2006-09-15 23:18:55

by Robin Lee Powell

[permalink] [raw]
Subject: Re: Same MCE on 4 working machines (was Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM)

Found away to get around the large RAM issue; see below.

On Fri, Sep 15, 2006 at 01:31:59PM -0700, wrote:
> On Fri, Sep 15, 2006 at 09:50:39PM +0100, Alan Cox wrote:
> >
> > You also have a lot of RAM, that shouldn't matter but it means
> > you hit code paths most users don't. If you boot with mem
> > limited to 1GB I assume it still blows up ?
>
> I've tried mem=1023M, yes, and it still blows up. Just did
> acpi=off mem=1023M to check.

I've found a server with the same hardware except only 2GiB of RAM.
The behaviour is slightly different. It restarts instead of
hanging, and the last bit is:

Initializing CPU#0
PID hash table entries: 4096 (order: 12, 32768 bytes)
Disabling vsyscall due to use of PM timer
time.c: Using 3.579545 MHz WALL PM GTOD PM timer.
time.c: Detected 1804.115 MHz processor.
Console: colour VGA+ 80x25
Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
Memory: 2059540k/2096576k available (2584k kernel code, 36348k reserved, 1198k data, 220k init)
Calibrating delay using timer specific routine.. 3611.41 BogoMIPS (lpj=18057088)
Security Framework v1.0.0 initialized
Mount-cache hash table entries: 256
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
?

Not the wierd character at the end; there's always one or two of
them, but that could just be our Cyclades console servers doing
something odd.

At that point, the machine reboots.

I have found no way yet to get any other behaviour; acpi=off, in
particular, doesn't give me the MCE on this box.

I've tried all your pci= options, too, with no effect.

I tried "nosmp noapic mem=512M ide=nodma apm=off acpi=off desktop
showopts".

I tried iommu=off.

I tried Debian's 2.6.8-11-amd64-generic, which on the 16GiB boxes
went straight to the MCE; it stopped at the same place, but seems to
have hung instead of rebooting. Still didn't get as far as the MCE.

Nothing seems to make a difference.

But 2.6.2 boots right up, no troubles.

Just to make sure that the machines really were the same, I pulled
lspci -v from this smaller-RAM one. They are *exactly the same*.
Right down to the IRQs. You can see them at:

16gb: http://teddyb.org/~rlpowell/media/regular/lkml/lspci_v.txt

2gb: http://teddyb.org/~rlpowell/media/regular/lkml/devnutch1-lspci_v.txt

(you can see they're not the same file, because the whitespace came
out differently :-)

Here's some BIOS options that look maybe relevant, just in case:

4GB Memory Hole Adjust [Auto]
4GB Memory Hole Size [128 MB]
IOMMU: [Enable]
Size: [64 MB]
Multiprocessor Specification: [1.4]
Use PCI Interrupt Entries in MP Table: [Yes]

ACPI Enabled: [Yes]
ACPI SRAT Table [Enabled]
Spread spectrum modulation [No]
Suppress Unused PCI Slot Clocks [No]

-Robin

--
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/
Reason #237 To Learn Lojban: "Homonyms: Their Grate!"
Proud Supporter of the Singularity Institute - http://singinst.org/

2006-09-18 07:50:48

by Andi Kleen

[permalink] [raw]
Subject: Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM

Robin Lee Powell <[email protected]> writes:
>
> This version is rather different, as it ends in:
>
> HARDWARE ERROR
> CPU 0: Machine Check Exception: 7 Bank 3: b40000000000083b
> RIP 10:<ffffffff80446e3e> {pci_conf1_read+0xbe/0xf0}
> TSC 2e7932dbf8 ADDR fdfc000cfc
> This is not a software problem!
> Run through mcelog --ascii to decode and contact your hardware vendor
> Kernel panic - not syncing: Uncorrected machine check

Decoded it gives

..
bus error 'local node origin, request didn't time out
data read mem transaction
i/o access, level generic'
..

It will probably boot with mce=off acpi=off pci=conf1

You got some buggy device that causes a bus timeout when its config space
is read. The old kernel most likely didn't touch it by luck.

Please add the following patch and send the whole log.
This will tell us which device has this problem.

-Andi

diff -u linux-2.6.17-hack/arch/i386/pci/direct.c-o linux-2.6.17-hack/arch/i386/pci/direct.c
--- linux-2.6.17-hack/arch/i386/pci/direct.c-o 2006-04-20 02:17:33.000000000 +0200
+++ linux-2.6.17-hack/arch/i386/pci/direct.c 2006-09-18 09:48:46.000000000 +0200
@@ -19,6 +19,9 @@
{
unsigned long flags;

+ printk("conf1 read bus %x devfn %x reg %x len %u\n",
+ bus, devfn, reg, len);
+
if ((bus > 255) || (devfn > 255) || (reg > 255)) {
*value = -1;
return -EINVAL;

2006-09-18 07:52:24

by Andi Kleen

[permalink] [raw]
Subject: Re: Same MCE on 4 working machines (was Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM)

"Bharath Ramesh" <[email protected]> writes:

> Have you tried booting newer kernel post 2.6.13 with the boot option
> mce=bootlog and see if it goes past the current failure. Try the same
> with with noacpi.

Did you mean mce=off? mce=bootlog will just log the leftover MCEs
from the previous boot, but that shouldn't change anything.

-Andi

2006-09-18 18:59:45

by Robin Lee Powell

[permalink] [raw]
Subject: Re: Same MCE on 4 working machines (was Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM)

On Mon, Sep 18, 2006 at 09:52:18AM +0200, Andi Kleen wrote:
> "Bharath Ramesh" <[email protected]> writes:
>
> > Have you tried booting newer kernel post 2.6.13 with the boot
> > option mce=bootlog and see if it goes past the current failure.
> > Try the same with with noacpi.
>
> Did you mean mce=off? mce=bootlog will just log the leftover MCEs
> from the previous boot, but that shouldn't change anything.

mce=off allows some of the kernels with this problem (those that get
as far as an MCE) to boot. The ones with less than 16GiB of RAM
never get an MCE, though.

-Robin

--
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/
Reason #237 To Learn Lojban: "Homonyms: Their Grate!"
Proud Supporter of the Singularity Institute - http://singinst.org/

2006-09-18 19:06:31

by Robin Lee Powell

[permalink] [raw]
Subject: Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM

On Mon, Sep 18, 2006 at 09:50:41AM +0200, Andi Kleen wrote:
> Robin Lee Powell <[email protected]> writes:
> >
> > This version is rather different, as it ends in:
> >
> > HARDWARE ERROR
> > CPU 0: Machine Check Exception: 7 Bank 3: b40000000000083b
> > RIP 10:<ffffffff80446e3e> {pci_conf1_read+0xbe/0xf0}
> > TSC 2e7932dbf8 ADDR fdfc000cfc
> > This is not a software problem!
> > Run through mcelog --ascii to decode and contact your hardware vendor
> > Kernel panic - not syncing: Uncorrected machine check
>
> Decoded it gives
>
> ..
> bus error 'local node origin, request didn't time out
> data read mem transaction
> i/o access, level generic'
> ..
>
> It will probably boot with mce=off acpi=off pci=conf1

Indeed! Even on the ones that weren't having an MCE problem.

> You got some buggy device that causes a bus timeout when its
> config space is read. The old kernel most likely didn't touch it
> by luck.
>
> Please add the following patch and send the whole log. This will
> tell us which device has this problem.

OK. I'll post results in a bit.

-Robin

--
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/
Reason #237 To Learn Lojban: "Homonyms: Their Grate!"
Proud Supporter of the Singularity Institute - http://singinst.org/

2006-09-18 23:58:57

by Robin Lee Powell

[permalink] [raw]
Subject: Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM

On Mon, Sep 18, 2006 at 09:50:41AM +0200, Andi Kleen wrote:
> Robin Lee Powell <[email protected]> writes:
> >
> > This version is rather different, as it ends in:
> >
> > HARDWARE ERROR
> > CPU 0: Machine Check Exception: 7 Bank 3: b40000000000083b
> > RIP 10:<ffffffff80446e3e> {pci_conf1_read+0xbe/0xf0}
> > TSC 2e7932dbf8 ADDR fdfc000cfc
> > This is not a software problem!
> > Run through mcelog --ascii to decode and contact your hardware vendor
> > Kernel panic - not syncing: Uncorrected machine check
>
> Decoded it gives
>
> ..
> bus error 'local node origin, request didn't time out
> data read mem transaction
> i/o access, level generic'
> ..
>
> It will probably boot with mce=off acpi=off pci=conf1
>
> You got some buggy device that causes a bus timeout when its config space
> is read. The old kernel most likely didn't touch it by luck.
>
> Please add the following patch and send the whole log.
> This will tell us which device has this problem.

Done; it's at
http://teddyb.org/~rlpowell/media/regular/lkml/hacked-boot.txt

Note that I had to us "mce=off acpi=off pci=conf1" to get any of
that hack's output to show up at all; I wasn't clear whether you
intended that or not.

-Robin

--
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/
Reason #237 To Learn Lojban: "Homonyms: Their Grate!"
Proud Supporter of the Singularity Institute - http://singinst.org/

2006-09-19 06:21:51

by Andi Kleen

[permalink] [raw]
Subject: Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM


> Done; it's at
> http://teddyb.org/~rlpowell/media/regular/lkml/hacked-boot.txt
>
> Note that I had to us "mce=off acpi=off pci=conf1" to get any of
> that hack's output to show up at all; I wasn't clear whether you
> intended that or not.

Unfortunately with mce=off we can't see which device breaks.
Can you please boot with the patch and just

acpi=off pci=conf1 ?

and send the full output?

-Andi

2006-09-19 06:28:32

by Robin Lee Powell

[permalink] [raw]
Subject: Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM

On Tue, Sep 19, 2006 at 08:04:14AM +0200, Andi Kleen wrote:
>
> > Done; it's at
> > http://teddyb.org/~rlpowell/media/regular/lkml/hacked-boot.txt
> >
> > Note that I had to us "mce=off acpi=off pci=conf1" to get any of
> > that hack's output to show up at all; I wasn't clear whether you
> > intended that or not.
>
> Unfortunately with mce=off we can't see which device breaks. Can
> you please boot with the patch and just
>
> acpi=off pci=conf1 ?
>
> and send the full output?

The result is a reboot in the middle of bringing up CPU#1. No
output from the patch is printed.

I've printed it below anyways.

-Robin

rBootdata ok (command line is root=/dev/sda2 ro console=ttyS1 acpi=off pci=conf1)
Linux version 2.6.17.11 (root@sv-furldb1i) (gcc version 3.3.5 (Debian 1:3.3.5-13)) #4 SMP Mon Sep 18 12:57:57 PDT 2006
BIOS-provided physical RAM map: |
BIOS-e820: 0000000000000000 - 000000000009b000 (usable)pi=off pci=conf1 |
BIOS-e820: 000000000009b000 - 00000000000a0000 (reserved) |
BIOS-e820: 00000000000cc000 - 0000000000100000 (reserved) |
BIOS-e820: 0000000000100000 - 000000007ff70000 (usable) |
BIOS-e820: 000000007ff70000 - 000000007ff76000 (ACPI data) |
BIOS-e820: 000000007ff76000 - 000000007ff80000 (ACPI NVS) |
BIOS-e820: 000000007ff80000 - 0000000080000000 (reserved) |
BIOS-e820: 00000000fec00000 - 00000000fec00400 (reserved) |
BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) |
BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved) |
DMI present. |
Intel MultiProcessor Specification v1.4------------------------------------+
Virtual Wire compatibility mode. which entry is highlighted.
OEM ID: AMD Product ID: HAMMER APIC at: 0xFEE00000the
Processor #0 15:5 APIC version 16mmand-line, 'o' to open a new line
Processor #1 15:5 APIC version 16selected line, 'd' to remove the
I/O APIC #2 Version 17 at 0xFEC00000.back to the main menu.
I/O APIC #3 Version 17 at 0xFB000000.
I/O APIC #4 Version 17 at 0xFB001000.
Setting APIC routing to flat
Processors: 2
Allocating PCI resources starting at 88000000 (gap: 80000000:7ec00000)
Checking aperture...
CPU 0: aperture @ e0000000 size 64 MB
CPU 1: aperture @ e0000000 size 64 MB
Built 1 zonelists
Kernel command line: root=/dev/sda2 ro console=ttyS1 acpi=off pci=conf1
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 32768 bytes)
time.c: Using 1.193182 MHz WALL PIT GTOD PIT/TSC timer.
time.c: Detected 1804.140 MHz processor.
Console: colour VGA+ 80x25
Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
Memory: 2059504k/2096576k available (2612k kernel code, 36384k reserved, 1205k data, 224k init)
Calibrating delay using timer specific routine.. 3614.04 BogoMIPS (lpj=18070214)
Security Framework v1.0.0 initialized
Mount-cache hash table entries: 256
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
U?

2006-09-19 06:39:57

by Andi Kleen

[permalink] [raw]
Subject: Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM

On Tuesday 19 September 2006 08:28, Robin Lee Powell wrote:
> On Tue, Sep 19, 2006 at 08:04:14AM +0200, Andi Kleen wrote:
> >
> > > Done; it's at
> > > http://teddyb.org/~rlpowell/media/regular/lkml/hacked-boot.txt
> > >
> > > Note that I had to us "mce=off acpi=off pci=conf1" to get any of
> > > that hack's output to show up at all; I wasn't clear whether you
> > > intended that or not.
> >
> > Unfortunately with mce=off we can't see which device breaks. Can
> > you please boot with the patch and just
> >
> > acpi=off pci=conf1 ?
> >
> > and send the full output?
>
> The result is a reboot in the middle of bringing up CPU#1. No
> output from the patch is printed.
>
> I've printed it below anyways.


What happens when you additionally add this patch and boot with
the same options again?

-Andi


diff -u linux-2.6.17-hack/include/asm-x86_64/pci-direct.h-o linux-2.6.17-hack/include/asm-x86_64/pci-direct.h
--- linux-2.6.17-hack/include/asm-x86_64/pci-direct.h-o 2006-03-03 08:14:00.000000000 +0100
+++ linux-2.6.17-hack/include/asm-x86_64/pci-direct.h 2006-09-19 08:38:25.000000000 +0200
@@ -7,33 +7,32 @@
/* Direct PCI access. This is used for PCI accesses in early boot before
the PCI subsystem works. */

-#define PDprintk(x...)
+#define PDprintk(x...) printk(x)

static inline u32 read_pci_config(u8 bus, u8 slot, u8 func, u8 offset)
{
u32 v;
+ PDprintk("%x reading 4 from %x: %x\n", slot, offset, v);
outl(0x80000000 | (bus<<16) | (slot<<11) | (func<<8) | offset, 0xcf8);
v = inl(0xcfc);
- if (v != 0xffffffff)
- PDprintk("%x reading 4 from %x: %x\n", slot, offset, v);
return v;
}

static inline u8 read_pci_config_byte(u8 bus, u8 slot, u8 func, u8 offset)
{
u8 v;
+ PDprintk("%x reading 1 from %x: %x\n", slot, offset, v);
outl(0x80000000 | (bus<<16) | (slot<<11) | (func<<8) | offset, 0xcf8);
v = inb(0xcfc + (offset&3));
- PDprintk("%x reading 1 from %x: %x\n", slot, offset, v);
return v;
}

static inline u16 read_pci_config_16(u8 bus, u8 slot, u8 func, u8 offset)
{
u16 v;
+ PDprintk("%x reading 2 from %x: %x\n", slot, offset, v);
outl(0x80000000 | (bus<<16) | (slot<<11) | (func<<8) | offset, 0xcf8);
v = inw(0xcfc + (offset&2));
- PDprintk("%x reading 2 from %x: %x\n", slot, offset, v);
return v;
}

2006-09-19 17:46:52

by Robin Lee Powell

[permalink] [raw]
Subject: Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM

On Tue, Sep 19, 2006 at 08:39:51AM +0200, Andi Kleen wrote:
> On Tuesday 19 September 2006 08:28, Robin Lee Powell wrote:
> > On Tue, Sep 19, 2006 at 08:04:14AM +0200, Andi Kleen wrote:
> > >
> > > > Done; it's at
> > > > http://teddyb.org/~rlpowell/media/regular/lkml/hacked-boot.txt
> > > >
> > > > Note that I had to us "mce=off acpi=off pci=conf1" to get any of
> > > > that hack's output to show up at all; I wasn't clear whether you
> > > > intended that or not.
> > >
> > > Unfortunately with mce=off we can't see which device breaks. Can
> > > you please boot with the patch and just
> > >
> > > acpi=off pci=conf1 ?
> > >
> > > and send the full output?
> >
> > The result is a reboot in the middle of bringing up CPU#1. No
> > output from the patch is printed.
> >
> > I've printed it below anyways.
>
>
> What happens when you additionally add this patch and boot with
> the same options again?

Here's acpi=off pci=conf1:

http://teddyb.org/~rlpowell/media/regular/lkml/hacked-boot-2-acpi-pci.txt

Here's nothing:

http://teddyb.org/~rlpowell/media/regular/lkml/hacked-boot-2-none.txt

Unfortunately, the machines that actually got as far as *displaying*
an MCE were the ones with 16GiB in them, and I could no longer
justify holding on to them just for kernel testing; as you can
imagine, they're a tad expensive.

-Robin

--
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/
Reason #237 To Learn Lojban: "Homonyms: Their Grate!"
Proud Supporter of the Singularity Institute - http://singinst.org/

2006-09-19 21:07:47

by Robin Lee Powell

[permalink] [raw]
Subject: Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM

On Tue, Sep 19, 2006 at 10:46:49AM -0700, wrote:
> Unfortunately, the machines that actually got as far as
> *displaying* an MCE were the ones with 16GiB in them, and I could
> no longer justify holding on to them just for kernel testing; as
> you can imagine, they're a tad expensive.

New developments:

1. I will have access to one of the 16GiB boxes within the next few
days.

2. The *un-released* 2.09 BIOS upgrade from Arima seems to solve
the issue.

Given that, if you'd like to just call it good, I'd totally
understand. I'm also fine with continuing to help debug this; it's
entirely up to you guys.

I *really* appreciate all the help, too, particularily from Alan and
Andi. If you guys have Amazon lists or something I can throw a
thank-you at, let me know.

(Unfortunately, upgrading the BIOS clears all the settings,
including console redirection, and we have over 200 of these
machines... Not your problem of course, just venting.)

-Robin