2003-01-25 00:50:03

by Arindam Dey

[permalink] [raw]
Subject: Hard Disk Failure

Hi people,

I have subscribed myself to this mailing list just to
ask this question although this may not be the right
place but we are desperate. Somebody please HELP US.

According to the Changes Document in the Kernel Source
tree the minimum version of e2fsprogs required for
2.4.18 and 2.4.19 is 1.25. We were making a
distribution and due to a horrendous oversight the
version of e2fsprogs remained old as in 1.23-2. We are
using ext3 filesystem.

Now this Distribution is bundled along with its own
Hardware and about 45% of these PC's Harddisk are
failing after a period of 2-3 weeks. On reinstallation
they become ok but again after 2-3 weeks they fail
again and finally after 2 months of this the Hard Disk
fails COMPLETELY and cannot be used again for any
distribution to be installed.

All I want to know is what is the probability that the
above oversight of e2fsprogs version is responsible
for the HDD failure thats all. Since we are totally
clueless and are unable to replicate the problem in a
controlled environment.

Thanks in advance,

Arindam Dey

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com


2003-01-27 01:27:12

by Arindam Dey

[permalink] [raw]
Subject: Re: Hard Disk Failure

--- Mark Hahn <[email protected]> wrote:
> > Now this Distribution is bundled along with its
> own
> > Hardware and about 45% of these PC's Harddisk are
> > failing after a period of 2-3 weeks. On
> reinstallation
>
> not IBM DTLA's, I hope.
>
> > they become ok but again after 2-3 weeks they fail
> > again and finally after 2 months of this the Hard
> Disk
> > fails COMPLETELY and cannot be used again for any
>
> you need to get some actual data here, not this
> "completely"
> nonsense. do you mean it doesn't spin up? if so,
> there's
> nothing that the software (including the OS) could
> have done
> to cause it.
>
> > All I want to know is what is the probability that
> the
> > above oversight of e2fsprogs version is
> responsible
> > for the HDD failure thats all. Since we are
> totally
>
> no. e2fsprogs might cause data loss, but not
> physical damage.
>

I am using Kernel -2.4.19. The ouput of /proc/ide/sis
is as follows
############# /proc/ide/sis
######################################
SiS 5513 Ultra 66 chipset
--------------- Primary Channel ----------------
Secondary Channel
-------------Channel Status: On
Off
Operation Mode: Compatible
Compatible
Cable Type: 80 pins 80
pins
Prefetch Count: 512 512
Drive 0: Postwrite Enabled
Postwrite Disabled
Prefetch Enabled
Prefetch Disabled
UDMA Enabled UDMA
Disabled
UDMA Cycle Time 2 CLK UDMA
Cycle Time Reserved
Data Active Time 3 PCICLK Data
Active Time 8 PCICLK
Data Recovery Time 1 PCICLK Data
Recovery Time 12 PCICLK
Drive 1: Postwrite Disabled
Postwrite Disabled
Prefetch Disabled
Prefetch Disabled
UDMA Enabled UDMA
Disabled
UDMA Cycle Time 4 CLK UDMA
Cycle Time Reserved
Data Active Time 3 PCICLK Data
Active Time 8 PCICLK
Data Recovery Time 1 PCICLK Data
Recovery Time 12 PCICLK

#############the output of hdparm -i /dev/hda is as
follows############
Model=ExcelStor Technology ES3230, FwRev=ES7CA25A,
SerialNo=MA15HAX
Config={ Fixed }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=57
BuffType=DualPortCache, BuffSize=2048kB,
MaxMultSect=16, MultSect=16
CurCHS=16383/16/63, CurSects=-66060037, LBA=yes,
LBAsects=58615258
IORDY=on/off, tPIO={min:120,w/IORDY:120},
tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2 udma0 udma1 udma2
AdvancedPM=no
Drive Supports : Reserved : ATA-1 ATA-2 ATA-3 ATA-4
ATA-5

The problem is random in nature it occurs on its own
giving dma error at boot time when it is checks the
hard disk.

hda: dma_intr: bad DMA status (dma_stat=35) ;
hda: dma_intr: status=0x50 { DriveReady SeekComplete }
hda: dma_intr: bad DMA status (dma_stat=35)
hda: dma_intr: status=0x50 { DriveReady SeekComplete }
hda: dma_intr: bad DMA status (dma_stat=75)
The hexadecimal values of the status are different.

I have attched the dmesg and the lspci output also.

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com


Attachments:
dmesg (5.89 kB)
dmesg
lspci (2.75 kB)
lspci
Download all attachments

2003-01-27 02:23:56

by Brad Tilley

[permalink] [raw]
Subject: Re: Hard Disk Failure

> no. e2fsprogs might cause data loss, but not
> physical damage.

This reminds me of something I read once.

In his book, Takedown, Tsutomu Shimomura (forgive me if that's spelled wrong)
wrote a few short paragraphs about how he was able to move the head-arm of a
magnetic disk drive back and forth with software commands. He could tell the
head-arm to go to any cylinder on the drive, he wondered what would happen if
he tried to send it to a cylinder that was outside the physical limits of the
drive. He told the drive (a 200 cylinder drive) to goto cylinder 4000. The
drive actually tried to go to that cylinder and caused a hardware failure in
the process.

Is it still possible for software to damage hardware in this fashion or is
hardware smarter now? Do drives know not to try and access a cylinder that is
outside their physical limits?


2003-01-27 11:00:12

by Lionel Bouton

[permalink] [raw]
Subject: Re: Hard Disk Failure

On dim, jan 26, 2003 at 09:33:11 -0500, rtilley wrote:
> Is it still possible for software to damage hardware in this fashion or is
> hardware smarter now? Do drives know not to try and access a cylinder that is
> outside their physical limits?
>

I guess not :

Recently I moved one drive from one old system to a new one. The new BIOS
couldn't be configured to use the old geometry -> I couldn't use the drive
to boot (unless repartitioning the 120GiB beast).

During the migration process I tried different geometry settings and at
several times the cylinder number was way above the drive's actual limit.
I could hear loud "bumps" when the drive accessed the higher cylinders.
Each time I rushed to the reset button...

This was with a SiS735 chipset and a Maxtor 4G120J6.

LB.

2003-01-27 19:48:10

by John Bradford

[permalink] [raw]
Subject: Re: Hard Disk Failure

> > do you mean it doesn't spin up? if so, there's nothing that the
> > software (including the OS) could have done to cause it.

Unless the firmware has become corrupted.

John

2003-01-27 19:46:18

by John Bradford

[permalink] [raw]
Subject: Re: Hard Disk Failure

> > no. e2fsprogs might cause data loss, but not
> > physical damage.
>
> This reminds me of something I read once.
>
> In his book, Takedown, Tsutomu Shimomura (forgive me if that's
> spelled wrong) wrote a few short paragraphs about how he was able to
> move the head-arm of a magnetic disk drive back and forth with
> software commands. He could tell the head-arm to go to any cylinder
> on the drive, he wondered what would happen if he tried to send it
> to a cylinder that was outside the physical limits of the drive. He
> told the drive (a 200 cylinder drive) to goto cylinder 4000. The
> drive actually tried to go to that cylinder and caused a hardware
> failure in the process.

It's actually possible to make some old mainframe hard disks 'walk'
across the floor, by doing various seeks across the disk :-).

> Is it still possible for software to damage hardware in this fashion
> or is hardware smarter now? Do drives know not to try and access a
> cylinder that is outside their physical limits?

Since modern hard disks are not accessed by their physical
geometries', I would imagine that it would be rare to be able to cause
physical damage to a disk by sending a reference to an out of range
sector. The disk has to translate the sector you send to it in to
it's real geometry anyway, so there should be no way to translate an
invalid sector in to an invalid physical geometry location, which it
could then not seek to.

John

2003-01-27 21:02:11

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Hard Disk Failure

On Mon, 27 Jan 2003, John Bradford wrote:

> > > no. e2fsprogs might cause data loss, but not
> > > physical damage.
> >
> > This reminds me of something I read once.
> >
> > In his book, Takedown, Tsutomu Shimomura (forgive me if that's
> > spelled wrong) wrote a few short paragraphs about how he was able to
> > move the head-arm of a magnetic disk drive back and forth with
> > software commands. He could tell the head-arm to go to any cylinder
> > on the drive, he wondered what would happen if he tried to send it
> > to a cylinder that was outside the physical limits of the drive. He
> > told the drive (a 200 cylinder drive) to goto cylinder 4000. The
> > drive actually tried to go to that cylinder and caused a hardware
> > failure in the process.
>
> It's actually possible to make some old mainframe hard disks 'walk'
> across the floor, by doing various seeks across the disk :-).
>
> > Is it still possible for software to damage hardware in this fashion
> > or is hardware smarter now? Do drives know not to try and access a
> > cylinder that is outside their physical limits?
>
> Since modern hard disks are not accessed by their physical
> geometries', I would imagine that it would be rare to be able to cause
> physical damage to a disk by sending a reference to an out of range
> sector. The disk has to translate the sector you send to it in to
> it's real geometry anyway, so there should be no way to translate an
> invalid sector in to an invalid physical geometry location, which it
> could then not seek to.
>
> John

In a previous life, I wrote a Disk utility called SPINOK.
It was posted on my BBS (The PROGRAM EXCHANGE) and became
the major product of a company that made a (copy, clone,
theft?) called Spin-Rite. This utility would determine
the optimum interleave for a disk drive. It would first
save the data from a track into memory, then format the
track, then write the data back.

Most disk drives in those days used the ST-506 interface,
however, when the IDE interface became popular, it was
necessary to determine the geometry of the drive before
attempting to format a track with the new interleave. This
was necessary because the track headers contain the cylinder
as well as the head and sector. If this gets written wrong,
you have the slowest drive you can imagine!

So, the utility tries first to determine the number of cylinders,
then the number of sectors per track, then the number of heads.
It does this by trying to read head 0 (all drives have head 0),
sector 1 (all drives have sector 1), cylinder 1024 (the maximum
possible in those days. It would "work backwards" to determine
the whole geometry. It used Newton's method for quick convergence
so it didn't have to try all combinations.

It was possible for a disk to get "stuck" if it seeked (sought)
to a cylinder that was too high. However, a power-off reset
would fix it because, during power up, the drive always moved
the head-arm throughout the platters to minimize "stiction" and
get the heads flying.

Having mucked with many disk drives, I don't thing it's possible
to kill them from "commands" unless the command is "upload firmware".
If you happen to upload Linux instead of the correct firmware, the
drive may "go penguin" and all bets are off!

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.