2003-05-28 11:17:18

by Helge Hafting

[permalink] [raw]
Subject: Re: 2.5.70-mm1 bootcrash, possibly IDE or RAID

William Lee Irwin III wrote:
> On Wed, May 28, 2003 at 01:14:28PM +0200, Helge Hafting wrote:
>
>>2.5.69-mm8 is fine, 2.5.67-mm1 dies before mounting anything read-write.
Argh. I meant 2.5.70-mm1. Followup to the wrong message. :-(

The early kernel boot is fine, the penguin appear,
a bunch of the usual messages scroll by too fast to read,
and then it hangs.
The kernel is UP, with preempt & devfs. All filesystems
are ext2. This kernel has no module support.

Root is on raid-1, there are two
ide disks connected to this controller on separate cables:
00:02.5 IDE interface: Silicon Integrated Systems [SiS] 5513 [IDE]

Here's the decoded crash, written down by hand:
<stuff scrolled off screen>
bio_endio
_end_that_request_first
ide_end_request
ide_dma_intr
ide_intr
ide_dma_intr
handle_IRQ_event
do_IRQ
default_idle
default_idle
common_interrupt
default_idle
default_idle
default_idle
cpu_idle
rest_init
start_kernel
unknown_bootoption
<0>Kwrnel Panic fatal exception in interrupt
in interrupt - not syncing


2003-05-28 11:22:42

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.70-mm1 bootcrash, possibly IDE or RAID

On Wed, May 28, 2003 at 01:34:16PM +0200, Helge Hafting wrote:
> Here's the decoded crash, written down by hand:
> <stuff scrolled off screen>
> bio_endio
> _end_that_request_first
> ide_end_request
> ide_dma_intr
> ide_intr
> ide_dma_intr
> handle_IRQ_event
> do_IRQ
> default_idle
> default_idle
> common_interrupt

This is unusual; I'm having trouble very close to this area. There is
a remote chance it could be the same problem.

Could you log this to serial and get the rest of the oops/BUG? If it's
where I think it is, I've been looking at end_page_writeback() and so
might have an idea or two.


-- wli

2003-05-28 22:42:15

by Helge Hafting

[permalink] [raw]
Subject: Re: 2.5.70-mm1 bootcrash, possibly RAID-1

On Wed, May 28, 2003 at 04:35:44AM -0700, William Lee Irwin III wrote:
>
> This is unusual; I'm having trouble very close to this area. There is
> a remote chance it could be the same problem.
>
> Could you log this to serial and get the rest of the oops/BUG? If it's
> where I think it is, I've been looking at end_page_writeback() and so
> might have an idea or two.

I tried 2.5.70-mm1 on the dual celeron at home. This one has
scsi instead of ide, so I guess it is a RAID-1 problem.
This machine has root on raid-1 too. I believe there where
several oopses in a row, I captured all of the last one
thanks to a framebuffer with a small font. Here it is:

Unable to handle kernel paging request at virtual address 8a8a8ab6
*pde=0 OOPS 0000 [#1]
EIP at put_all_bios+0x47/0x80
(edx was the register containing 8a8a8a8a)
Process swapper pid=0 threadinfo c1352000 task=c13f52d0
Call trace:
raid_end_bio_io
raid1_end_request
scsi_request_fn
bio_endio
_end_that_request_first
scsi_end_request
__wake_up
scsi_io_completion
scsi_delete_timer
sd_rw_intr
sym_wakeup_done
scsi_finish_command
scsi_softirq
timer_interrupt
do_softirq
do_IRQ
default_idle
default_idle
common_interrupt
default_idle
default_idle
default_idle
cpu_idle
printk
<0> Kernel panic:fatal exception in interrupt
in interrupt - not syncing
reboot in 300 seconds

This looks very similiar to the partial trace
from the ide machine,
it had everything from _end_that_request_first
down to the three default_idles, but with ide
instead of scsi functions.

Helge Hafting

2003-05-28 23:08:51

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.70-mm1 bootcrash, possibly RAID-1

Helge Hafting <[email protected]> wrote:
>
> On Wed, May 28, 2003 at 04:35:44AM -0700, William Lee Irwin III wrote:
> >
> > This is unusual; I'm having trouble very close to this area. There is
> > a remote chance it could be the same problem.
> >
> > Could you log this to serial and get the rest of the oops/BUG? If it's
> > where I think it is, I've been looking at end_page_writeback() and so
> > might have an idea or two.
>
> I tried 2.5.70-mm1 on the dual celeron at home. This one has
> scsi instead of ide, so I guess it is a RAID-1 problem.
> This machine has root on raid-1 too. I believe there where
> several oopses in a row, I captured all of the last one
> thanks to a framebuffer with a small font. Here it is:
>
> Unable to handle kernel paging request at virtual address 8a8a8ab6
> *pde=0 OOPS 0000 [#1]
> EIP at put_all_bios+0x47/0x80
> (edx was the register containing 8a8a8a8a)
> Process swapper pid=0 threadinfo c1352000 task=c13f52d0
> Call trace:
> raid_end_bio_io
> raid1_end_request

That's POISON_BEFORE: "use of uninitialised memory", not "use of freed
memory".

I fiddled with the slab poisoning values, and shall undo that.

2003-05-28 23:17:41

by Paul Erkkila

[permalink] [raw]
Subject: Re: 2.5.70-mm1 bootcrash, possibly RAID-1



I'm having a similar problem here with 2.5.70. I can't
seem to get the entire stack trace though, but with a
stripped down kernel config it seems to be when during
the time MD starts working.

Machine is an asus p4c8000, intel ich5, using the IDE
part not sata. I'm also using /dev/md0 as my root
partition.

Hope that helps, i'm trying to find a null modem to
get a real capture ;).

-pee

Helge Hafting wrote:

>On Wed, May 28, 2003 at 04:35:44AM -0700, William Lee Irwin III wrote:
>
>
>>This is unusual; I'm having trouble very close to this area. There is
>>a remote chance it could be the same problem.
>>
>>Could you log this to serial and get the rest of the oops/BUG? If it's
>>where I think it is, I've been looking at end_page_writeback() and so
>>might have an idea or two.
>>
>>
>
>I tried 2.5.70-mm1 on the dual celeron at home. This one has
>scsi instead of ide, so I guess it is a RAID-1 problem.
>This machine has root on raid-1 too. I believe there where
>several oopses in a row, I captured all of the last one
>thanks to a framebuffer with a small font. Here it is:
>
>Unable to handle kernel paging request at virtual address 8a8a8ab6
>*pde=0 OOPS 0000 [#1]
>EIP at put_all_bios+0x47/0x80
>(edx was the register containing 8a8a8a8a)
>Process swapper pid=0 threadinfo c1352000 task=c13f52d0
>Call trace:
>raid_end_bio_io
>raid1_end_request
>scsi_request_fn
>bio_endio
>_end_that_request_first
>scsi_end_request
>__wake_up
>scsi_io_completion
>scsi_delete_timer
>sd_rw_intr
>sym_wakeup_done
>scsi_finish_command
>scsi_softirq
>timer_interrupt
>do_softirq
>do_IRQ
>default_idle
>default_idle
>common_interrupt
>default_idle
>default_idle
>default_idle
>cpu_idle
>printk
><0> Kernel panic:fatal exception in interrupt
>in interrupt - not syncing
>reboot in 300 seconds
>
>This looks very similiar to the partial trace
>from the ide machine,
>it had everything from _end_that_request_first
>down to the three default_idles, but with ide
>instead of scsi functions.
>
>Helge Hafting
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>

2003-05-29 13:09:36

by John Stoffel

[permalink] [raw]
Subject: Re: 2.5.70-mm1 bootcrash, possibly RAID-1


Helge> On Wed, May 28, 2003 at 04:35:44AM -0700, William Lee Irwin III wrote:
>>
>> Could you log this to serial and get the rest of the oops/BUG? If it's
>> where I think it is, I've been looking at end_page_writeback() and so
>> might have an idea or two.

Helge> I tried 2.5.70-mm1 on the dual celeron at home. This one has
Helge> scsi instead of ide, so I guess it is a RAID-1 problem.
Helge> This machine has root on raid-1 too. I believe there where
Helge> several oopses in a row, I captured all of the last one
Helge> thanks to a framebuffer with a small font. Here it is:

I've finally gotten 2.5.70-mm1 compiled and bootable on my system, but
with my /home being RAID1, I was getting crashes that looked alot like
this as well. This was a Dual PIII Xeon 550, with a mix of IDE and
SCSI drives. /home was on a pair of 18gb SCSI disks, RAID1.

I also had problems with the new AIC7xxx driver and had to drop back
to the old one to get a boot. I think. Lots and lots of confusion
here.

John