2005-12-14 13:02:38

by Kalin KOZHUHAROV

[permalink] [raw]
Subject: Help track down a freezing machine

Hi, all!

You know there are weeks (or months!) when everything is just plain wrong... While still fighting
with my laptop (See "Help track a memory leak in 2.6.0..14), now one of the semi-production machines
started to freeze without any indication...

No Oops.
Nothing in the logs.
No response to ping, all network is dead.
Trying to log on the console dies at random (sometimes in the middle of the login name entry), but I
can still Alt+SysRq+{S, U,S,B}.
Sometimes no response to any keyboard press...

It is a DIY P4 machine with Asus P5GDC-V-Deluxe (i915G,LGA775), 2GB RAM, SATA WD740GD disk (using
libata).
Running (now) 2.6.14.3 with sk98lin-8.23.1.3 patched in (the in-kernel one does not recognise the
NIC) and mppe-mppc-1.3.patch (using the box to test VPNs). Softwarewise it is a Gentoo machine,
runnig apache-2.0.54, subverison-1.2.3, bugzilla-2.20, mysql-5.0.16, pptpd-1.2.3, ppp-2.4.3 and
latest openss{l,h}. No X, no sound, no WiFi, no USB, no NFSv4 (just 3): it is a headless server-type
box (on a KVM).

When it does die, and lately this happens 2-3 times per 24 hours, there is nothing hwatsoever to
indicate the cause - just dead.

A strange thing is that after the box is restarted with Alt+SysRq+{S, U,S,B}, most of the times it
cannot find the SATA drive (BIOS cannot recognize it), so I need to turn off the power physically.

About the NIC: There are a few posts on the net that Asus shipped some MBs with broken SPD, so they
don't work with linux. Found some king of cryptic patch at Asus site (for another board) and it sayd
to apply cleanly, but NIC is still not recognized by the in-kernel sk98lin at all (flash was done
after problems began, but might have made them appear more frequently?).

02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 Gigabit Ethernet Controller (rev 15)
Subsystem: ASUSTeK Computer Inc. Marvell 88E8053 Gigabit Ethernet Controller (Asus)
Flags: bus master, fast devsel, latency 0, IRQ 17
Memory at cfffc000 (64-bit, non-prefetchable) [size=16K]
I/O ports at d800 [size=256]
Expansion ROM at cffc0000 [disabled] [size=128K]
Capabilities: [48] Power Management version 2
Capabilities: [50] Vital Product Data
Capabilities: [5c] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable-
Capabilities: [e0] Express Legacy Endpoint IRQ 0
Capabilities: [100] Advanced Error Reporting

Got another NIC and will try it tomorrow.

Now that I get a repetitive freeze, is there anything to debug the problem?
I guess, the point when kernel is still responsive to keyboard, but I cannot login.

It sounds really bad, but a put a cron job to restart the box every 4 hours until I move it's
functions off to another one... and it used to run 30+ days...

Any help is appreciated.
Kalin.

--
|[ ~~~~~~~~~~~~~~~~~~~~~~ ]|
+-> http://ThinRope.net/ <-+
|[ ______________________ ]|


2005-12-14 14:25:29

by Alex Riesen

[permalink] [raw]
Subject: Re: Help track down a freezing machine

On 12/14/05, Kalin KOZHUHAROV <[email protected]> wrote:
> Now that I get a repetitive freeze, is there anything to debug the problem?
> I guess, the point when kernel is still responsive to keyboard, but I cannot login.

try to connect a serial console to it and press Alt+SysRq+t

2005-12-14 20:20:24

by Nigel Cunningham

[permalink] [raw]
Subject: Re: Help track down a freezing machine

Hi.

On Wed, 2005-12-14 at 22:55, Kalin KOZHUHAROV wrote:
> Hi, all!
>
> You know there are weeks (or months!) when everything is just plain wrong... While still fighting
> with my laptop (See "Help track a memory leak in 2.6.0..14), now one of the semi-production machines
> started to freeze without any indication...

Another suggestion is to patch your kernel with kdb or kgdb and turn the
option on when compiling. You could then do more detailed examination of
the issue.

Regards,

Nigel

2005-12-16 17:57:21

by Kalin KOZHUHAROV

[permalink] [raw]
Subject: Re: Help track down a freezing machine

Alex Riesen wrote:
> On 12/14/05, Kalin KOZHUHAROV <[email protected]> wrote:
>
>>Now that I get a repetitive freeze, is there anything to debug the problem?
>>I guess, the point when kernel is still responsive to keyboard, but I cannot login.
>
>
> try to connect a serial console to it and press Alt+SysRq+t

Thank you for the suggestio, Alex.

I was always trying to avoid the serial console till now (it just seems difficult, and I DO know it
is not), and didn't even bother with the netconsole...

However, now that I spent almost an hour, trying to OCR and fix a screenshot of an oops, I am
convinced: I DO need serial console! First thing tomorrow.

So until now, here is an oops, the first I saw in a few months, captured by my camera and then
digitally enhanced: http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char__oops1.jpg

The OCRed/handwritten text ( http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char__oops1.txt )
says:

Call trace:
SCSI device sda: 145226112 512-byte hdwr sectors (74356 MB)
SCSI device sda: drive cache: write back
[<c01ec22f>] kobject_put+0x1f/0x30
[<c028c8fd>] scsi_end_request+0xdd/0xf0
[<c028ccae>] scsi_io_completion+0x26e/0x570
[<c011b623>] load_balance_newidle+0x43/0x110
[<c028d255>] scsi_generic_done+0x35/0x50
[<c02873ee>] scsi_finish_command+0x8e/0xd0
[<c0318dea>] schedule+0x4da/0xd50
[<c0318e1d>] schedule+0x50d/0xd50
[<c028728f>] scsi_sortirq+0xdf/0x160
[<c0125836>] __do_softirq+0xd6/0xf0
[<c0125885>] do_softirq+0x35/0x40
[<c0125e35>] ksoftirqd+0x95/0xe0
[<c0125da0>] ksoftirqd+0x0/0xe0
[<c0135b9a>] kthread+0xba/0xc0
[<c0135ae0>] kthread+0x0/0xc0
[<c0101245>] kernel_thread_helper+0x5/0x10
Code: e1 08 00 89 44 24 04 89 1c 24 e8 27 b0 ff ff eb a5 90 8d 74 26 00 55 57 56
53 83 ec 08 8b 44 24 1c 89 44 24 04 8b 80 ec 00 00 00 <8b> 38 f6 80 79 01 00 00
80 0f 85 98 00 00 00 8b 47 2c 8d 6f 20
<0>Kernel panic - not syncing: Fatal exception in interrupt

Unfortunately everything was frozen (KBD too), so I couldn't scroll up to see the beginning. As you
may guess, it was not written to the disk.

The oops happened on boot (after a hard power-off) and is probbably related to the SATA system.

The .config is available at http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char.config

Any insights?

I will be "fighting" with the machine this weekend as well and keep posting.
Removed the fcron job (to restart every 4h) and now it has been running almost 11h...

Kalin.
--
|[ ~~~~~~~~~~~~~~~~~~~~~~ ]|
+-> http://ThinRope.net/ <-+
|[ ______________________ ]|

2005-12-27 09:30:15

by Kalin KOZHUHAROV

[permalink] [raw]
Subject: Re: Help track down a freezing machine, libata?

Kalin KOZHUHAROV wrote:
> Alex Riesen wrote:
>
>>On 12/14/05, Kalin KOZHUHAROV <[email protected]> wrote:
>>
>>
>>>Now that I get a repetitive freeze, is there anything to debug the problem?
>>>I guess, the point when kernel is still responsive to keyboard, but I cannot login.
>>
>>
>>try to connect a serial console to it and press Alt+SysRq+t
>
>
> Thank you for the suggestio, Alex.
>
> I was always trying to avoid the serial console till now (it just seems difficult, and I DO know it
> is not), and didn't even bother with the netconsole...

It is, I as have to go and buy a null modem cable... will do it.

> So until now, here is an oops, the first I saw in a few months, captured by my camera and then
> digitally enhanced: http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char__oops1.jpg
>
> The OCRed/handwritten text ( http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char__oops1.txt )
> says:
>
> Call trace:
> SCSI device sda: 145226112 512-byte hdwr sectors (74356 MB)
> SCSI device sda: drive cache: write back
> [<c01ec22f>] kobject_put+0x1f/0x30
> [<c028c8fd>] scsi_end_request+0xdd/0xf0
> [<c028ccae>] scsi_io_completion+0x26e/0x570
> [<c011b623>] load_balance_newidle+0x43/0x110
> [<c028d255>] scsi_generic_done+0x35/0x50
> [<c02873ee>] scsi_finish_command+0x8e/0xd0
> [<c0318dea>] schedule+0x4da/0xd50
> [<c0318e1d>] schedule+0x50d/0xd50
> [<c028728f>] scsi_sortirq+0xdf/0x160
> [<c0125836>] __do_softirq+0xd6/0xf0
> [<c0125885>] do_softirq+0x35/0x40
> [<c0125e35>] ksoftirqd+0x95/0xe0
> [<c0125da0>] ksoftirqd+0x0/0xe0
> [<c0135b9a>] kthread+0xba/0xc0
> [<c0135ae0>] kthread+0x0/0xc0
> [<c0101245>] kernel_thread_helper+0x5/0x10
> Code: e1 08 00 89 44 24 04 89 1c 24 e8 27 b0 ff ff eb a5 90 8d 74 26 00 55 57 56
> 53 83 ec 08 8b 44 24 1c 89 44 24 04 8b 80 ec 00 00 00 <8b> 38 f6 80 79 01 00 00
> 80 0f 85 98 00 00 00 8b 47 2c 8d 6f 20
> <0>Kernel panic - not syncing: Fatal exception in interrupt
>
> Unfortunately everything was frozen (KBD too), so I couldn't scroll up to see the beginning. As you
> may guess, it was not written to the disk.
>
> The oops happened on boot (after a hard power-off) and is probbably related to the SATA system.

All right, the above started to be reproducible, about once every 3 boots: the system freezes when
tries to initialize the ata sybsystem. (still don't have cable for the serial console, sorry)

> The .config is available at http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char.config

Now this is 2.6.14.4 and the new config is:
http://linux.tar.bz/reports/oopses/char/2.6.14.4-K01_char.config

I added another 250GB SATA HDD and changed the PSU, but it does not seem to be related for that bug.
Will try to tweak the IDE parameters in BIOS.

Kalin.

--
|[ ~~~~~~~~~~~~~~~~~~~~~~ ]|
+-> http://ThinRope.net/ <-+
|[ ______________________ ]|

2006-01-08 16:28:30

by Kalin KOZHUHAROV

[permalink] [raw]
Subject: Re: Help track down a freezing machine, libata or hardware

Kalin KOZHUHAROV wrote:
> Kalin KOZHUHAROV wrote:
>
>>Alex Riesen wrote:
>>
>>
>>>On 12/14/05, Kalin KOZHUHAROV <[email protected]> wrote:
>>>
>>>
>>>
>>>>Now that I get a repetitive freeze, is there anything to debug the problem?
>>>>I guess, the point when kernel is still responsive to keyboard, but I cannot login.
>>>
>>>
>>>try to connect a serial console to it and press Alt+SysRq+t
>>
>>
>>Thank you for the suggestio, Alex.
>>
>>I was always trying to avoid the serial console till now (it just seems difficult, and I DO know it
>>is not), and didn't even bother with the netconsole...
>
>
> It is, I as have to go and buy a null modem cable... will do it.
>
>
>>So until now, here is an oops, the first I saw in a few months, captured by my camera and then
>>digitally enhanced: http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char__oops1.jpg
>>
>>The OCRed/handwritten text ( http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char__oops1.txt )
>>says:
>>
>>Call trace:
>>SCSI device sda: 145226112 512-byte hdwr sectors (74356 MB)
>>SCSI device sda: drive cache: write back
>>[<c01ec22f>] kobject_put+0x1f/0x30
>>[<c028c8fd>] scsi_end_request+0xdd/0xf0
>>[<c028ccae>] scsi_io_completion+0x26e/0x570
>>[<c011b623>] load_balance_newidle+0x43/0x110
>>[<c028d255>] scsi_generic_done+0x35/0x50
>>[<c02873ee>] scsi_finish_command+0x8e/0xd0
>>[<c0318dea>] schedule+0x4da/0xd50
>>[<c0318e1d>] schedule+0x50d/0xd50
>>[<c028728f>] scsi_sortirq+0xdf/0x160
>>[<c0125836>] __do_softirq+0xd6/0xf0
>>[<c0125885>] do_softirq+0x35/0x40
>>[<c0125e35>] ksoftirqd+0x95/0xe0
>>[<c0125da0>] ksoftirqd+0x0/0xe0
>>[<c0135b9a>] kthread+0xba/0xc0
>>[<c0135ae0>] kthread+0x0/0xc0
>>[<c0101245>] kernel_thread_helper+0x5/0x10
>>Code: e1 08 00 89 44 24 04 89 1c 24 e8 27 b0 ff ff eb a5 90 8d 74 26 00 55 57 56
>> 53 83 ec 08 8b 44 24 1c 89 44 24 04 8b 80 ec 00 00 00 <8b> 38 f6 80 79 01 00 00
>> 80 0f 85 98 00 00 00 8b 47 2c 8d 6f 20
>><0>Kernel panic - not syncing: Fatal exception in interrupt
>>
>>Unfortunately everything was frozen (KBD too), so I couldn't scroll up to see the beginning. As you
>>may guess, it was not written to the disk.
>>
>>The oops happened on boot (after a hard power-off) and is probbably related to the SATA system.
>
>
> All right, the above started to be reproducible, about once every 3 boots: the system freezes when
> tries to initialize the ata sybsystem. (still don't have cable for the serial console, sorry)
>
>
>>The .config is available at http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char.config
>
>
> Now this is 2.6.14.4 and the new config is:
> http://linux.tar.bz/reports/oopses/char/2.6.14.4-K01_char.config
>
> I added another 250GB SATA HDD and changed the PSU, but it does not seem to be related for that bug.
> Will try to tweak the IDE parameters in BIOS.

OK, now this looks like hardware failure... I run 2.6.15 the other day and I was happy to see 2d
uptime :-) However... everything was borked, root was mounted RO, and fs was generally screwed.

I don't have physical acces to the box right now, so I will post more details tomorrow, but this is
what I got as dmesg:


ata1: port reset, p_is 40000001 is 1 pis 0 cmd c017 tf 471 ss 113 se 0
ata1: translated ATA stat/err 0x71/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata1: status=0x71 { DriveReady DeviceFault SeekComplete Error }
ata1: error=0x04 { DriveStatusError }
ata1: port reset, p_is 40000001 is 1 pis 0 cmd c017 tf 471 ss 113 se 0
ata1: translated ATA stat/err 0x71/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata1: status=0x71 { DriveReady DeviceFault SeekComplete Error }
ata1: error=0x04 { DriveStatusError }
ata1: port reset, p_is 40000001 is 1 pis 0 cmd c017 tf 471 ss 113 se 0
ata1: translated ATA stat/err 0x71/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata1: status=0x71 { DriveReady DeviceFault SeekComplete Error }
ata1: error=0x04 { DriveStatusError }
ata1: port reset, p_is 40000001 is 1 pis 0 cmd c017 tf 471 ss 113 se 0
ata1: translated ATA stat/err 0x71/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata1: status=0x71 { DriveReady DeviceFault SeekComplete Error }
ata1: error=0x04 { DriveStatusError }
ata1: port reset, p_is 40000001 is 1 pis 0 cmd c017 tf 471 ss 113 se 0
ata1: translated ATA stat/err 0x71/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata1: status=0x71 { DriveReady DeviceFault SeekComplete Error }
ata1: error=0x04 { DriveStatusError }
sd 0:0:0:0: SCSI error: return code = 0x8000002
sda: Current: sense key=0xb
ASC=0x0 ASCQ=0x0
end_request: I/O error, dev sda, sector 10803956
Buffer I/O error on device sda3, logical block 362497


The above was repeated many times (buffer was full, so I don't know how many), cannot tell the time,
as trying to `cat /var/log/everything` resulted in IO error.
I also got a bunch of these:

ReiserFS: sda3: warning: clm-6006: writing inode 56216 on readonly FS
ReiserFS: sda3: warning: clm-6006: writing inode 56216 on readonly FS
ReiserFS: sda3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find
stat data of [25971 27180 0x0 SD]
ReiserFS: sda3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find
stat data of [9064 25871 0x0 SD]
ReiserFS: sda3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find
stat data of [25971 27180 0x0 SD]

Do you see this as hardware problem?

The drive in question is a WD740GD (SATA, 10k RPM) and I will run the WD tools on it in a day or two
to check if it is a hardware failure.

Kalin.

--
|[ ~~~~~~~~~~~~~~~~~~~~~~ ]|
+-> http://ThinRope.net/ <-+
|[ ______________________ ]|

2006-01-11 00:08:07

by Esben Stien

[permalink] [raw]
Subject: Re: Help track down a freezing machine, libata or hardware

Kalin KOZHUHAROV <[email protected]> writes:

> Do you see this as hardware problem?

Yes, definitely. I wrote a nice mail to the reiserfs list titled
"Close Encounter of the Bad Block Kind". Check it out for how to get
your data before anything else happens.

--
Esben Stien is b0ef@e s a
http://www. s t n m
irc://irc. b - i . e/%23contact
[sip|iax]: e e
jid:b0ef@ n n