2006-08-06 04:14:12

by Joseph Pingenot

[permalink] [raw]
Subject: IO errors after some uptime on 2.6.17.7

I searched the archives, but was unable to find any information, so
here it is.
I've now seen two boxes with 2.6.17.7 go down with IO errors after some
time running. There doesn't seem to necessarily be a fixed amount of
time, nor can I precisely figure out what's causing it.
A .config file from one of the affected machines is attached.
Unfortunately, the syslogs aren't available, since the filesystem got
mounted readonly right when the problem happened, and I don't see
anything in the logs that might have caused it, no oops, no nothing.
Any ideas what might be going on?
Thanks!

-Joseph


Attachments:
(No filename) (608.00 B)
.config (40.02 kB)
Download all attachments

2006-08-06 07:01:50

by Chuck Ebbert

[permalink] [raw]
Subject: Re: IO errors after some uptime on 2.6.17.7

In-Reply-To: <[email protected]>

On Sat, 5 Aug 2006 23:17:01 -0500, Joseph Pingenot wrote:

> I've now seen two boxes with 2.6.17.7 go down with IO errors after some
> time running. There doesn't seem to necessarily be a fixed amount of
> time, nor can I precisely figure out what's causing it.

How do you know they were IO errors?

Did previous versions of 2.6.17 work?

Please try to get debug output with netconsole if you can.

--
Chuck

2006-08-06 15:32:15

by Joseph Pingenot

[permalink] [raw]
Subject: Re: IO errors after some uptime on 2.6.17.7

>From Chuck Ebbert on Sunday, 06 August, 2006:
>In-Reply-To: <[email protected]>
>On Sat, 5 Aug 2006 23:17:01 -0500, Joseph Pingenot wrote:
>> I've now seen two boxes with 2.6.17.7 go down with IO errors after some
>> time running. There doesn't seem to necessarily be a fixed amount of
>> time, nor can I precisely figure out what's causing it.
>How do you know they were IO errors?

That's what everything was saying, like when I tried to run any program.
Eventually, things that were running stayed running (e.g. I have irssi
on one box showing symptoms and can still chat, but opening up a
new window in screen fails:

Could not write /var/run/utmp: No such process
cannot exec '/bin/bash': Input/output error

On reboot everything was fixed. And the SMART diagnostic on a drive:
# 1 Extended offline Completed without error 00% 12221 -

Some relevant SMART data:
5 Reallocated_Sector_Ct 0x0033 100 100 050 Pre-fail Always - 0
1 Raw_Read_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0

I'll send the .config from the other downed box when I can get to it
to reboot it.

>Did previous versions of 2.6.17 work?

It seemed to, but I'm far from certain.

>Please try to get debug output with netconsole if you can.

That will be very tough atm. I'll see what I can do, though.

2006-08-13 19:17:15

by Joseph Pingenot

[permalink] [raw]
Subject: Re: IO errors after some uptime on 2.6.17.7

>From Joseph Pingenot on Sunday, 06 August, 2006:
>From Chuck Ebbert on Sunday, 06 August, 2006:
>>In-Reply-To: <[email protected]>
>>On Sat, 5 Aug 2006 23:17:01 -0500, Joseph Pingenot wrote:
>>> I've now seen two boxes with 2.6.17.7 go down with IO errors after some
>>> time running. There doesn't seem to necessarily be a fixed amount of
>>> time, nor can I precisely figure out what's causing it.
>>How do you know they were IO errors?

Well, as an update, I'm becoming convinced that at least one is actually
the disk dying. The other might be too. But the timing sure was
crazy.

Thanks for the help, and sorry to waste your time.

-Joseph

>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/