2003-08-20 15:09:10

by Larry McVoy

[permalink] [raw]
Subject: IDE wierdness

The primary drive in our file server started to flake out on us (caught by
the integrity checker we use as part of our backups, files that hadn't been
modified in a couple of years started having different CRC's). I pulled
the data off and stuck in a new drive.

I wanted to see if the old drive could be salvaged and used as a test box
drive. The drive seems to be degenerating fast. When I put that drive
in a 3ware card the 3ware card only sees 1/3 of the drives. Strange.
When I put all 3 drives in a promise card, it sees them but if I try and
copy data from the bad drive to any other drive the system locks up hard,
no console, no pings, no response to the reset switch, it takes a power
cycle to get things back.

I verified that behaviour on two different systems so it isn't the box.
I also cycled through 3 different 3ware cards to make sure that wasn't
the problem (isn't sys admin fun?).

It's clear to me that I don't want to use this drive but I'm wondering if
there is any interest in debugging the lock up. I've only done it on
2.4.18 as shipped by redhat but I could try 2.6 or whatever you like.

If the concensus is that it is OK that bad hardware locks you up then I'll
toss the drive and move on.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm


2003-08-20 15:40:18

by Jeff Garzik

[permalink] [raw]
Subject: Re: IDE wierdness

On Wed, Aug 20, 2003 at 08:09:03AM -0700, Larry McVoy wrote:
> If the concensus is that it is OK that bad hardware locks you up then I'll
> toss the drive and move on.

Don't toss bad drives, send them to weirdos like me: I can use them
for testing and debugging error handling paths...

Jeff



2003-08-20 15:41:23

by Alan

[permalink] [raw]
Subject: Re: IDE wierdness

On Mer, 2003-08-20 at 16:09, Larry McVoy wrote:
>
> It's clear to me that I don't want to use this drive but I'm wondering if
> there is any interest in debugging the lock up. I've only done it on
> 2.4.18 as shipped by redhat but I could try 2.6 or whatever you like.
>
> If the concensus is that it is OK that bad hardware locks you up then I'll
> toss the drive and move on.

Some PIO transfers are regulated by the drive and the drive can lock the
bus forever. Newer chipsets like the SI680/3112 support watchdog
deadlock breakers for this but we don't really support them right now.

Getting different data off a failing drive is unusual because the blocks
are ECC'd extensively (well more than ECC'd) and have checks, could be
the RAM/CPU going I guess.


2003-08-20 16:52:07

by John Bradford

[permalink] [raw]
Subject: Re: IDE wierdness

> It's clear to me that I don't want to use this drive but I'm wondering if
> there is any interest in debugging the lock up. I've only done it on
> 2.4.18 as shipped by redhat but I could try 2.6 or whatever you like.

Out of interest, what does the S.M.A.R.T. data from the drive look like?

John.

2003-08-29 16:16:18

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: IDE wierdness

On 20 Aug 2003, Alan Cox wrote:
> On Mer, 2003-08-20 at 16:09, Larry McVoy wrote:
> > It's clear to me that I don't want to use this drive but I'm wondering if
> > there is any interest in debugging the lock up. I've only done it on
> > 2.4.18 as shipped by redhat but I could try 2.6 or whatever you like.
> >
> > If the concensus is that it is OK that bad hardware locks you up then I'll
> > toss the drive and move on.
>
> Some PIO transfers are regulated by the drive and the drive can lock the
> bus forever. Newer chipsets like the SI680/3112 support watchdog
> deadlock breakers for this but we don't really support them right now.
>
> Getting different data off a failing drive is unusual because the blocks
> are ECC'd extensively (well more than ECC'd) and have checks, could be
> the RAM/CPU going I guess.

Although it can happen. I used to see corrupted data in /etc/motd (which is
rewritten on each boot up) and random SEGVs on an embedded box. A few weeks
later the drive started to report real errors. After mapping out the bad blocks
using e2fsck -c, and replacing the files that were affected, the problem
disappeared.

Looks like ECC is not always ECC...

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds