2002-09-26 05:46:21

by dean gaudet

[permalink] [raw]
Subject: SMART *causing* disk lossage?

i suspect this is a controller problem, but i thought i'd mention it
anyhow.

i've had 5 disk problems on one of my promise 100tx2 cards -- one of them
was on /dev/hdk, the other 4 on /dev/hdi... one of the 4 actually could
have been a dead disk (the drive didn't respond elsewhere either). but
the rest seemed to clear up on power off/on.

in 4 of the instances, smartd was the first to log anything about a dead
disk -- it didn't log any change in the smart parameters, just that it
couldn't reach the drive. following smartd's complaint were kernel
messages about resetting the bus, and so forth.

the 5th time, this evening, i was running hddtemp by hand -- and the
failure appeared to occur at exactly that moment.

the drives are maxtor D740X 80GB (6L080J4 or 6L080L4).

is it at all possible that using SMART is causing some sort of screwup in
the kernel on the drives? (i mean anything is possible, i'm just grasping
at straws here.)

i ended up replacing the controller tonight, 'cause it's just too
coincidental that all this is happenning on /dev/hdi (i've replaced the
hdi disk and cable already).

anyhow, kernel rev is 2.4.19-pre7-ac4.

-dean


2002-09-26 08:20:54

by Jan Niehusmann

[permalink] [raw]
Subject: Re: SMART *causing* disk lossage?

On Wed, Sep 25, 2002 at 10:51:37PM -0700, dean gaudet wrote:
> in 4 of the instances, smartd was the first to log anything about a dead
> disk -- it didn't log any change in the smart parameters, just that it
> couldn't reach the drive. following smartd's complaint were kernel
> messages about resetting the bus, and so forth.

It may be completely unrelated, but we had similar problems in one
server after installing a new gigabit ethernet card. The server ran fine
for several days, and then the disk became unreachable. After a reboot
all was fine for a few days, and then the problem showed up again.

We 'solved' the problem by moving the ethernet card to a different pci
slot where it didn't share it's interrupt with the ide controler.

The mainboard was an asus a7v-133, and the NIC uses the tg3 driver.

Jan