Date: Tue, 28 Mar 2006 12:55:42 -0500
From: Gene Heskett <gene.heskett@verizon.net>
Subject: Re: Detecting I/O error and Halting System
In-reply-to: <20060328150712.85169.qmail@web30602.mail.mud.yahoo.com>
To: linux-kernel@vger.kernel.org
Reply-to: gene.heskett@verizononline.net
Message-id: <200603281255.42705.gene.heskett@verizon.net>
Organization: Organization? Absolutely zip.
MIME-version: 1.0
Content-type: text/plain; charset=iso-8859-1
Content-transfer-encoding: 8BIT
Content-disposition: inline
References: <20060328150712.85169.qmail@web30602.mail.mud.yahoo.com>
User-Agent: KMail/1.7
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4338
Lines: 164

On Tuesday 28 March 2006 10:07, zine el abidine Hamid wrote:
>First of all, thank you for your analysis.
>
>I don't think that it's a HDD problem nor a cable
>problem because the servers are new. We have tried
>different HDD (seagate, maxtor) but it has not help
>anyway.
>It's perhaps a temperature problem but we make a lot
>tests in hard condition (high temperature)
>successfuly...
>
>One thinks that the problem comes from the VIA chipset
>VT82c686 (it's also the opinion of Dick Johnson
>(linux-os) whom advised me to try UDMA33 instead of
>UDMA66).
>
>How can I determine the problem?
>
>I want to add that the HDD seems to be disconnected
>(the BIOS can't find any drive for boot) after a
>simple reset. We must switch off the servers to get
>them work again.
>However, it takes a long time (4 mounths and more)
>before the HDD fell down. I want to work around by
>write a module which will supervise the HDD. I know
>how to write a module (I used the lkmpg guide
>(http://www.tldp.org/LDP/lkmpg/) but how can I
>shutdown Linux from inside a module...?
>
>best regards.
>
>Zine.

I take it that you are aware of a drive monitoring utility called 
smartd?  By querying the drive after a new powerup, you may be able to 
extract usefull information about its health.

>--- Alan Cox <alan@lxorguk.ukuu.org.uk> a ?crit :
>> On Llu, 2006-03-27 at 16:55 +0200, zine el abidine
>>
>> Hamid wrote:
>> > hda: status timeout: status=0xd0 { Busy }
>>
>> adapter
>>
>> > disque annonce un status busy du DMA
>>
>> If I'm reading the translation right then your hard
>> disk decided
>> it was busy and then never came back
>>
>> > Feb 12 04:46:23 porte_de_clignancourt_nds_b
>>
>> kernel:
>> > ide0: reset: success
>>
>> So the IDE layer tried to reset it
>>
>> > Feb 12 10:22:38 porte_de_clignancourt_nds_b
>>
>> kernel:
>> > hda: timeout waiting for DMA
>>
>> Which didnt help
>>
>> > Feb 12 10:24:47 porte_de_clignancourt_nds_b
>>
>> kernel:
>> > ide0: reset: success
>>
>> Still trying
>>
>> > Feb 12 10:24:47 porte_de_clignancourt_nds_b
>>
>> kernel:
>> > hda: irq timeout: status=0xd0 { Busy }
>> >
>> >
>> > Feb 12 10:24:47 porte_de_clignancourt_nds_b
>>
>> kernel:
>> > hda: DMA disabled
>>
>> We gave up on DMA to see if PIO would help
>>
>> > Feb 12 10:24:47 porte_de_clignancourt_nds_b
>>
>> kernel:
>> > ide0: reset timed-out, status=0x80
>> > Feb 12 10:24:47 porte_de_clignancourt_nds_b
>>
>> kernel:
>> > hda: status timeout: status=0x80 { Busy }
>> > nouvel ?chec de reset
>> > Feb 12 10:24:47 porte_de_clignancourt_nds_b
>>
>> kernel:
>> > hda: drive not ready for command
>> > Feb 12 10:24:47 porte_de_clignancourt_nds_b
>>
>> kernel:
>> > ide0: reset: success
>>
>> And reset..
>>
>> > Feb 12 13:45:38 porte_de_clignancourt_nds_b
>>
>> kernel:
>> > hda: status timeout: status=0x80 { Busy }
>> > Feb 12 13:45:38 porte_de_clignancourt_nds_b
>>
>> kernel:
>> > hda: drive not ready for command
>> > Feb 12 13:45:38 porte_de_clignancourt_nds_b
>>
>> kernel:
>> > ide0: reset timed-out, status=0x80
>> > Feb 12 13:45:38 porte_de_clignancourt_nds_b
>>
>> kernel:
>> > end_request: I/O error, dev 03:02 (hda), sector
>>
>> 102263
>>
>> > Feb 12 13:45:38 porte_de_clignancourt_nds_b
>>
>> syslogd:
>> > /var/log/maillog: Input/output error
>> > Feb 12 13:45:38 porte_de_clignancourt_nds_b
>>
>> kernel:
>> > end_request: I/O error, dev 03:02 (hda), sector
>>
>> 110720
>>
>> > Feb 12 13:45:38 porte_de_clignancourt_nds_b
>>
>> kernel:
>> > end_request: I/O error, dev 03:02 (hda), sector
>>
>> 110728
>>
>> Eventually we give up.
>>
>>
>> First thing to check would be the disk and the
>> temperature, then the
>> cabling. In particular make sure the *long* part of
>> the cable is between
>> the drive and the controller.

-- 
Cheers, Gene
People having trouble with vz bouncing email to me should add the word
'online' between the 'verizon', and the dot which bypasses vz's
stupid bounce rules.  I do use spamassassin too. :-)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/