2011-02-07 18:53:20

by bryan.coleman

[permalink] [raw]
Subject: ext4 problems with external RAID array via SAS connection

I am experiencing problems with an ext4 file system.

At first, the drive seemed to work fine. I was primarily copying things
to the drive migrating data from another server. After many GBs of data,
that seemingly successfully were done being transferred, I started seeing
ext4 errors in /var/log/messages. I then unmounted the drive and ran fsck
on it (which took multiple hours to run). I then ls'ed around and one of
the areas caused the system to again throw ext4 errors.

I did run memtest through one complete pass and it found no problems.

I then went looking for help on the fedora forum and it was suggested that
I increase my journal size. So I recreated the ext4 partition (with
larger journal) and started the migration process again. After several
days of copying, the errors started again.


Here are some of the errors from /var/log/messages:

Feb 2 04:48:30 mdct-00fs kernel: [672021.519914] EXT4-fs error (device
dm-2): ext4_mb_generate_buddy: EXT4-fs: group 22307: 460 blocks in bitmap,
0 in gd
Feb 2 04:48:30 mdct-00fs kernel: [672021.520429] EXT4-fs error (device
dm-2): ext4_mb_generate_buddy: EXT4-fs: group 22308: 1339 blocks in
bitmap, 0 in gd
Feb 2 04:48:30 mdct-00fs kernel: [672021.520927] EXT4-fs error (device
dm-2): ext4_mb_generate_buddy: EXT4-fs: group 22309: 3204 blocks in
bitmap, 0 in gd
Feb 2 04:48:30 mdct-00fs kernel: [672021.521409] EXT4-fs error (device
dm-2): ext4_mb_generate_buddy: EXT4-fs: group 22310: 2117 blocks in
bitmap, 0 in gd
Feb 4 05:08:29 mdct-00fs kernel: [845547.724807] EXT4-fs error (device
dm-2): ext4_dx_find_entry: inode #311951364: (comm scp) bad entry in
directory: directory entry across blocks -
block=1257308156offset=0(9166848), inode=3143403788, rec_len=80864,
name_len=168
Feb 4 05:08:29 mdct-00fs kernel: [845547.733034] EXT4-fs error (device
dm-2): ext4_add_entry: inode #311951364: (comm scp) bad entry in
directory: directory entry across blocks - block=1257308156offset=0(0),
inode=3143403788, rec_len=80864, name_len=168
Feb 4 05:19:41 mdct-00fs kernel: [846217.922351] EXT4-fs error (device
dm-2): ext4_dx_find_entry: inode #311951364: (comm scp) bad entry in
directory: directory entry across blocks -
block=1257308156offset=0(9166848), inode=3143403788, rec_len=80864,
name_len=168
Feb 4 05:19:41 mdct-00fs kernel: [846217.928922] EXT4-fs error (device
dm-2): ext4_add_entry: inode #311951364: (comm scp) bad entry in
directory: directory entry across blocks - block=1257308156offset=0(0),
inode=3143403788, rec_len=80864, name_len=168


Here is my setup:

Promise Vtrak RAID array with 12 drives in a RAID 6 configuration
(over 5TB).
The promise array is connected to my server using a external SAS
connection.
OS: Fedora 14

One logical volume on the promise.
One logical volume at the external SAS level.
One logical volume at the OS level.
So from my OS, I see one logical volume depicting one big drive.

I then setup the ext4 system using the following command:
'mkfs.ext4 -v -m 1 -J size=1024 -E stride=16,stripe-width=160
/dev/vg_storage/lv_storage'


Any thoughts/tips on how to track down the problem?

My thought now is to try using ext3; however, my fear is that I will just
run into the problem with it. Is ext4 production ready?


Thoughts?


2011-02-07 22:54:39

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 problems with external RAID array via SAS connection

On Mon, Feb 07, 2011 at 01:53:18PM -0500, [email protected] wrote:
> I am experiencing problems with an ext4 file system.
>
> At first, the drive seemed to work fine. I was primarily copying things
> to the drive migrating data from another server. After many GBs of data,
> that seemingly successfully were done being transferred, I started seeing
> ext4 errors in /var/log/messages. I then unmounted the drive and ran fsck
> on it (which took multiple hours to run). I then ls'ed around and one of
> the areas caused the system to again throw ext4 errors.

Did fsck report any errors? Do you have a copy of your fsck
transcript?

The errors you've reported do make me suspicious that there's
something unstable with your hardware...

- Ted

2011-02-08 13:19:02

by bryan.coleman

[permalink] [raw]
Subject: Re: ext4 problems with external RAID array via SAS connection

When I ran fsck after the first bout of failure, it did report a lot of
errors. I do not have a copy of that fsck transcript; however, I have not
yet run fsck since my second attempt. Is there a method of capturing the
transcript that is preferred?

Bryan



From: Ted Ts'o <[email protected]>
To: [email protected]
Cc: [email protected]
Date: 02/07/2011 05:55 PM
Subject: Re: ext4 problems with external RAID array via SAS
connection
Sent by: [email protected]



On Mon, Feb 07, 2011 at 01:53:18PM -0500, [email protected] wrote:
> I am experiencing problems with an ext4 file system.
>
> At first, the drive seemed to work fine. I was primarily copying things

> to the drive migrating data from another server. After many GBs of
data,
> that seemingly successfully were done being transferred, I started
seeing
> ext4 errors in /var/log/messages. I then unmounted the drive and ran
fsck
> on it (which took multiple hours to run). I then ls'ed around and one
of
> the areas caused the system to again throw ext4 errors.

Did fsck report any errors? Do you have a copy of your fsck
transcript?

The errors you've reported do make me suspicious that there's
something unstable with your hardware...

- Ted

2011-02-08 14:50:32

by bryan.coleman

[permalink] [raw]
Subject: Re: ext4 problems with external RAID array via SAS connection

Well, I attempted to run fsck on the problem drive using the script
command to capture the transcript; however, it failed to read a block from
the file system. The exception was "fsck.ext4: Attempt to read block from
filesystem resulted in short read while trying to open
/dev/mapper/vg_storage-lv_storage".

Other messages that are now in /var/log/messages:

Buffer I/O error on device dm-2, logical block 0
lost page write due to I/O error on dm-2
EXT4-fs (dm-2): previous I/O error to superblock detected
Buffer I/O error on device dm-2, logical block 0
lost page write due to I/O error on dm-2
Buffer I/O error on device dm-2, logical block 0
Buffer I/O error on device dm-2, logical block 1
Buffer I/O error on device dm-2, logical block 2
Buffer I/O error on device dm-2, logical block 3
Buffer I/O error on device dm-2, logical block 0
EXT4-fs (dm-2): unable to read superblock


Since it looks like I need to start the process all over again, is there a
good way to quickly determine if the problem is hardware related? Is
there a preferred method that will stress test the drive and shed more
light on what might be going wrong?

Thank you,

Bryan



From: [email protected]
To: [email protected], [email protected]
Date: 02/08/2011 08:19 AM
Subject: Re: ext4 problems with external RAID array via SAS
connection
Sent by: [email protected]



When I ran fsck after the first bout of failure, it did report a lot of
errors. I do not have a copy of that fsck transcript; however, I have not

yet run fsck since my second attempt. Is there a method of capturing the
transcript that is preferred?

Bryan



From: Ted Ts'o <[email protected]>
To: [email protected]
Cc: [email protected]
Date: 02/07/2011 05:55 PM
Subject: Re: ext4 problems with external RAID array via SAS
connection
Sent by: [email protected]



On Mon, Feb 07, 2011 at 01:53:18PM -0500, [email protected] wrote:
> I am experiencing problems with an ext4 file system.
>
> At first, the drive seemed to work fine. I was primarily copying things


> to the drive migrating data from another server. After many GBs of
data,
> that seemingly successfully were done being transferred, I started
seeing
> ext4 errors in /var/log/messages. I then unmounted the drive and ran
fsck
> on it (which took multiple hours to run). I then ls'ed around and one
of
> the areas caused the system to again throw ext4 errors.

Did fsck report any errors? Do you have a copy of your fsck
transcript?

The errors you've reported do make me suspicious that there's
something unstable with your hardware...

- Ted

2011-02-08 15:19:49

by Eric Sandeen

[permalink] [raw]
Subject: Re: ext4 problems with external RAID array via SAS connection

On 2/8/11 8:50 AM, [email protected] wrote:
> Well, I attempted to run fsck on the problem drive using the script
> command to capture the transcript; however, it failed to read a block from
> the file system. The exception was "fsck.ext4: Attempt to read block from
> filesystem resulted in short read while trying to open
> /dev/mapper/vg_storage-lv_storage".
>
> Other messages that are now in /var/log/messages:
>
> Buffer I/O error on device dm-2, logical block 0
> lost page write due to I/O error on dm-2
> EXT4-fs (dm-2): previous I/O error to superblock detected
> Buffer I/O error on device dm-2, logical block 0
> lost page write due to I/O error on dm-2
> Buffer I/O error on device dm-2, logical block 0
> Buffer I/O error on device dm-2, logical block 1
> Buffer I/O error on device dm-2, logical block 2
> Buffer I/O error on device dm-2, logical block 3
> Buffer I/O error on device dm-2, logical block 0
> EXT4-fs (dm-2): unable to read superblock
>
>
> Since it looks like I need to start the process all over again, is there a
> good way to quickly determine if the problem is hardware related? Is
> there a preferred method that will stress test the drive and shed more
> light on what might be going wrong?

You have a hardware problem... "Buffer I/O error on device dm-2, logical block 0"
means that you failed to read the first block on that device; not something
e2fsck can fix, I'm afraid; you'll need to sort out what's wrong with the storage,
first.

-Eric

> Thank you,
>
> Bryan
>
>
>
> From: [email protected]
> To: [email protected], [email protected]
> Date: 02/08/2011 08:19 AM
> Subject: Re: ext4 problems with external RAID array via SAS
> connection
> Sent by: [email protected]
>
>
>
> When I ran fsck after the first bout of failure, it did report a lot of
> errors. I do not have a copy of that fsck transcript; however, I have not
>
> yet run fsck since my second attempt. Is there a method of capturing the
> transcript that is preferred?
>
> Bryan
>
>
>
> From: Ted Ts'o <[email protected]>
> To: [email protected]
> Cc: [email protected]
> Date: 02/07/2011 05:55 PM
> Subject: Re: ext4 problems with external RAID array via SAS
> connection
> Sent by: [email protected]
>
>
>
> On Mon, Feb 07, 2011 at 01:53:18PM -0500, [email protected] wrote:
>> I am experiencing problems with an ext4 file system.
>>
>> At first, the drive seemed to work fine. I was primarily copying things
>
>
>> to the drive migrating data from another server. After many GBs of
> data,
>> that seemingly successfully were done being transferred, I started
> seeing
>> ext4 errors in /var/log/messages. I then unmounted the drive and ran
> fsck
>> on it (which took multiple hours to run). I then ls'ed around and one
> of
>> the areas caused the system to again throw ext4 errors.
>
> Did fsck report any errors? Do you have a copy of your fsck
> transcript?
>
> The errors you've reported do make me suspicious that there's
> something unstable with your hardware...
>
> - Ted
> --

2011-02-08 18:50:35

by bryan.coleman

[permalink] [raw]
Subject: Re: ext4 problems with external RAID array via SAS connection

I found that the promise array had been restarted via watchdog timer. I
am investigating that avenue via promise (albeit slow). Note: the
watchdog reset the controller days after the initial ext4 messages. I'm
not saying they are unrelated. I just what to get all of the facts out
there.

I suspect the connection between the server and the promise got hosed when
the controller was reset. When I restart the server, I could fsck the
drive.

The fsck is currently running (and has been for some time now).

It is doing a ton of "Inode ######## ref count is 2, should be 1. Fix?
yes" "Unattached inode #########" "Connect to /lost+found? yes"

I am running fsck in a script session; however, there are currently a ton
of the messages above (current log size: 106M).

Do you think it is still hardware? If so, is there a command that would
stress it enough to break quickly? What is the best way to isolate
hardware problems?

Bryan



From: Eric Sandeen <[email protected]>
To: [email protected]
Cc: [email protected], "Ted Ts'o" <[email protected]>
Date: 02/08/2011 10:21 AM
Subject: Re: ext4 problems with external RAID array via SAS
connection
Sent by: [email protected]



On 2/8/11 8:50 AM, [email protected] wrote:
> Well, I attempted to run fsck on the problem drive using the script
> command to capture the transcript; however, it failed to read a block
from
> the file system. The exception was "fsck.ext4: Attempt to read block
from
> filesystem resulted in short read while trying to open
> /dev/mapper/vg_storage-lv_storage".
>
> Other messages that are now in /var/log/messages:
>
> Buffer I/O error on device dm-2, logical block 0
> lost page write due to I/O error on dm-2
> EXT4-fs (dm-2): previous I/O error to superblock detected
> Buffer I/O error on device dm-2, logical block 0
> lost page write due to I/O error on dm-2
> Buffer I/O error on device dm-2, logical block 0
> Buffer I/O error on device dm-2, logical block 1
> Buffer I/O error on device dm-2, logical block 2
> Buffer I/O error on device dm-2, logical block 3
> Buffer I/O error on device dm-2, logical block 0
> EXT4-fs (dm-2): unable to read superblock
>
>
> Since it looks like I need to start the process all over again, is there
a
> good way to quickly determine if the problem is hardware related? Is
> there a preferred method that will stress test the drive and shed more
> light on what might be going wrong?

You have a hardware problem... "Buffer I/O error on device dm-2, logical
block 0"
means that you failed to read the first block on that device; not
something
e2fsck can fix, I'm afraid; you'll need to sort out what's wrong with the
storage,
first.

-Eric

> Thank you,
>
> Bryan
>
>
>
> From: [email protected]
> To: [email protected], [email protected]
> Date: 02/08/2011 08:19 AM
> Subject: Re: ext4 problems with external RAID array via SAS
> connection
> Sent by: [email protected]
>
>
>
> When I ran fsck after the first bout of failure, it did report a lot of
> errors. I do not have a copy of that fsck transcript; however, I have
not
>
> yet run fsck since my second attempt. Is there a method of capturing
the
> transcript that is preferred?
>
> Bryan
>
>
>
> From: Ted Ts'o <[email protected]>
> To: [email protected]
> Cc: [email protected]
> Date: 02/07/2011 05:55 PM
> Subject: Re: ext4 problems with external RAID array via SAS
> connection
> Sent by: [email protected]
>
>
>
> On Mon, Feb 07, 2011 at 01:53:18PM -0500, [email protected] wrote:
>> I am experiencing problems with an ext4 file system.
>>
>> At first, the drive seemed to work fine. I was primarily copying
things
>
>
>> to the drive migrating data from another server. After many GBs of
> data,
>> that seemingly successfully were done being transferred, I started
> seeing
>> ext4 errors in /var/log/messages. I then unmounted the drive and ran
> fsck
>> on it (which took multiple hours to run). I then ls'ed around and one
> of
>> the areas caused the system to again throw ext4 errors.
>
> Did fsck report any errors? Do you have a copy of your fsck
> transcript?
>
> The errors you've reported do make me suspicious that there's
> something unstable with your hardware...
>
> - Ted
> --

2011-02-08 20:49:38

by Eric Sandeen

[permalink] [raw]
Subject: Re: ext4 problems with external RAID array via SAS connection

On 2/8/11 12:50 PM, [email protected] wrote:
> I found that the promise array had been restarted via watchdog timer. I
> am investigating that avenue via promise (albeit slow). Note: the
> watchdog reset the controller days after the initial ext4 messages. I'm
> not saying they are unrelated. I just what to get all of the facts out
> there.
>
> I suspect the connection between the server and the promise got hosed when
> the controller was reset. When I restart the server, I could fsck the
> drive.
>
> The fsck is currently running (and has been for some time now).
>
> It is doing a ton of "Inode ######## ref count is 2, should be 1. Fix?
> yes" "Unattached inode #########" "Connect to /lost+found? yes"
>
> I am running fsck in a script session; however, there are currently a ton
> of the messages above (current log size: 106M).
>
> Do you think it is still hardware? If so, is there a command that would
> stress it enough to break quickly? What is the best way to isolate
> hardware problems?

My assertion of hardware problems was based solely on the IO error reading
block 0. If you can't read the superblock there's not much to be done.

As for what caused the corruption fsck is now finding, that's harder to say,
you're essentially getting reports that fsck is finding errors which happened
sometime in the past.

My first thought is whether a large cache on the array got lost when it was
reset, that could certainly cause filesystem corruption.

-Eric

> Bryan
>
>
>
> From: Eric Sandeen <[email protected]>
> To: [email protected]
> Cc: [email protected], "Ted Ts'o" <[email protected]>
> Date: 02/08/2011 10:21 AM
> Subject: Re: ext4 problems with external RAID array via SAS
> connection
> Sent by: [email protected]
>
>
>
> On 2/8/11 8:50 AM, [email protected] wrote:
>> Well, I attempted to run fsck on the problem drive using the script
>> command to capture the transcript; however, it failed to read a block
> from
>> the file system. The exception was "fsck.ext4: Attempt to read block
> from
>> filesystem resulted in short read while trying to open
>> /dev/mapper/vg_storage-lv_storage".
>>
>> Other messages that are now in /var/log/messages:
>>
>> Buffer I/O error on device dm-2, logical block 0
>> lost page write due to I/O error on dm-2
>> EXT4-fs (dm-2): previous I/O error to superblock detected
>> Buffer I/O error on device dm-2, logical block 0
>> lost page write due to I/O error on dm-2
>> Buffer I/O error on device dm-2, logical block 0
>> Buffer I/O error on device dm-2, logical block 1
>> Buffer I/O error on device dm-2, logical block 2
>> Buffer I/O error on device dm-2, logical block 3
>> Buffer I/O error on device dm-2, logical block 0
>> EXT4-fs (dm-2): unable to read superblock
>>
>>
>> Since it looks like I need to start the process all over again, is there
> a
>> good way to quickly determine if the problem is hardware related? Is
>> there a preferred method that will stress test the drive and shed more
>> light on what might be going wrong?
>
> You have a hardware problem... "Buffer I/O error on device dm-2, logical
> block 0"
> means that you failed to read the first block on that device; not
> something
> e2fsck can fix, I'm afraid; you'll need to sort out what's wrong with the
> storage,
> first.
>
> -Eric
>
>> Thank you,
>>
>> Bryan
>>
>>
>>
>> From: [email protected]
>> To: [email protected], [email protected]
>> Date: 02/08/2011 08:19 AM
>> Subject: Re: ext4 problems with external RAID array via SAS
>> connection
>> Sent by: [email protected]
>>
>>
>>
>> When I ran fsck after the first bout of failure, it did report a lot of
>> errors. I do not have a copy of that fsck transcript; however, I have
> not
>>
>> yet run fsck since my second attempt. Is there a method of capturing
> the
>> transcript that is preferred?
>>
>> Bryan
>>
>>
>>
>> From: Ted Ts'o <[email protected]>
>> To: [email protected]
>> Cc: [email protected]
>> Date: 02/07/2011 05:55 PM
>> Subject: Re: ext4 problems with external RAID array via SAS
>> connection
>> Sent by: [email protected]
>>
>>
>>
>> On Mon, Feb 07, 2011 at 01:53:18PM -0500, [email protected] wrote:
>>> I am experiencing problems with an ext4 file system.
>>>
>>> At first, the drive seemed to work fine. I was primarily copying
> things
>>
>>
>>> to the drive migrating data from another server. After many GBs of
>> data,
>>> that seemingly successfully were done being transferred, I started
>> seeing
>>> ext4 errors in /var/log/messages. I then unmounted the drive and ran
>> fsck
>>> on it (which took multiple hours to run). I then ls'ed around and one
>> of
>>> the areas caused the system to again throw ext4 errors.
>>
>> Did fsck report any errors? Do you have a copy of your fsck
>> transcript?
>>
>> The errors you've reported do make me suspicious that there's
>> something unstable with your hardware...
>>
>> - Ted
>> --
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>


2011-02-09 13:43:58

by bryan.coleman

[permalink] [raw]
Subject: Re: ext4 problems with external RAID array via SAS connection

The disk was not in the middle of copying when the array went down.

I did get an fsck transcript; however, it is 14M tgz'd. I don't really
want to send it to the list, but am willing to send it direct if you (or
Ted) are willing.

The fsck said it completed successfully. I kicked off fsck again just to
make sure and it reported clean. So I mounted the drive and ls'd around
and it started reporting errors. "ls: cannot access 40: Input/output
error" Note: 40 is a directory.

So I unmounted again and started an fsck. It reported errors and started
on it's merry way; however, now it was dealing with many "Multiple-claimed
block(s) in inode #########: <many numbers following>"

I am willing to reformat the drive again; however, I would like to know
what the best way to track down the issue is?

Any thoughts?



From: Eric Sandeen <[email protected]>
To: [email protected]
Cc: [email protected], "Ted Ts'o" <[email protected]>
Date: 02/08/2011 03:49 PM
Subject: Re: ext4 problems with external RAID array via SAS
connection



On 2/8/11 12:50 PM, [email protected] wrote:
> I found that the promise array had been restarted via watchdog timer. I

> am investigating that avenue via promise (albeit slow). Note: the
> watchdog reset the controller days after the initial ext4 messages. I'm

> not saying they are unrelated. I just what to get all of the facts out
> there.
>
> I suspect the connection between the server and the promise got hosed
when
> the controller was reset. When I restart the server, I could fsck the
> drive.
>
> The fsck is currently running (and has been for some time now).
>
> It is doing a ton of "Inode ######## ref count is 2, should be 1. Fix?
> yes" "Unattached inode #########" "Connect to /lost+found? yes"
>
> I am running fsck in a script session; however, there are currently a
ton
> of the messages above (current log size: 106M).
>
> Do you think it is still hardware? If so, is there a command that would

> stress it enough to break quickly? What is the best way to isolate
> hardware problems?

My assertion of hardware problems was based solely on the IO error reading
block 0. If you can't read the superblock there's not much to be done.

As for what caused the corruption fsck is now finding, that's harder to
say,
you're essentially getting reports that fsck is finding errors which
happened
sometime in the past.

My first thought is whether a large cache on the array got lost when it
was
reset, that could certainly cause filesystem corruption.

-Eric

> Bryan
>
>
>
> From: Eric Sandeen <[email protected]>
> To: [email protected]
> Cc: [email protected], "Ted Ts'o" <[email protected]>
> Date: 02/08/2011 10:21 AM
> Subject: Re: ext4 problems with external RAID array via SAS
> connection
> Sent by: [email protected]
>
>
>
> On 2/8/11 8:50 AM, [email protected] wrote:
>> Well, I attempted to run fsck on the problem drive using the script
>> command to capture the transcript; however, it failed to read a block
> from
>> the file system. The exception was "fsck.ext4: Attempt to read block
> from
>> filesystem resulted in short read while trying to open
>> /dev/mapper/vg_storage-lv_storage".
>>
>> Other messages that are now in /var/log/messages:
>>
>> Buffer I/O error on device dm-2, logical block 0
>> lost page write due to I/O error on dm-2
>> EXT4-fs (dm-2): previous I/O error to superblock detected
>> Buffer I/O error on device dm-2, logical block 0
>> lost page write due to I/O error on dm-2
>> Buffer I/O error on device dm-2, logical block 0
>> Buffer I/O error on device dm-2, logical block 1
>> Buffer I/O error on device dm-2, logical block 2
>> Buffer I/O error on device dm-2, logical block 3
>> Buffer I/O error on device dm-2, logical block 0
>> EXT4-fs (dm-2): unable to read superblock
>>
>>
>> Since it looks like I need to start the process all over again, is
there
> a
>> good way to quickly determine if the problem is hardware related? Is
>> there a preferred method that will stress test the drive and shed more
>> light on what might be going wrong?
>
> You have a hardware problem... "Buffer I/O error on device dm-2, logical

> block 0"
> means that you failed to read the first block on that device; not
> something
> e2fsck can fix, I'm afraid; you'll need to sort out what's wrong with
the
> storage,
> first.
>
> -Eric
>
>> Thank you,
>>
>> Bryan
>>
>>
>>
>> From: [email protected]
>> To: [email protected], [email protected]
>> Date: 02/08/2011 08:19 AM
>> Subject: Re: ext4 problems with external RAID array via SAS
>> connection
>> Sent by: [email protected]
>>
>>
>>
>> When I ran fsck after the first bout of failure, it did report a lot of

>> errors. I do not have a copy of that fsck transcript; however, I have
> not
>>
>> yet run fsck since my second attempt. Is there a method of capturing
> the
>> transcript that is preferred?
>>
>> Bryan
>>
>>
>>
>> From: Ted Ts'o <[email protected]>
>> To: [email protected]
>> Cc: [email protected]
>> Date: 02/07/2011 05:55 PM
>> Subject: Re: ext4 problems with external RAID array via SAS
>> connection
>> Sent by: [email protected]
>>
>>
>>
>> On Mon, Feb 07, 2011 at 01:53:18PM -0500, [email protected] wrote:
>>> I am experiencing problems with an ext4 file system.
>>>
>>> At first, the drive seemed to work fine. I was primarily copying
> things
>>
>>
>>> to the drive migrating data from another server. After many GBs of
>> data,
>>> that seemingly successfully were done being transferred, I started
>> seeing
>>> ext4 errors in /var/log/messages. I then unmounted the drive and ran
>> fsck
>>> on it (which took multiple hours to run). I then ls'ed around and one

>> of
>>> the areas caused the system to again throw ext4 errors.
>>
>> Did fsck report any errors? Do you have a copy of your fsck
>> transcript?
>>
>> The errors you've reported do make me suspicious that there's
>> something unstable with your hardware...
>>
>> - Ted
>> --
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>




2011-02-09 18:28:33

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 problems with external RAID array via SAS connection

On Wed, Feb 09, 2011 at 08:43:56AM -0500, [email protected] wrote:
>
> The fsck said it completed successfully. I kicked off fsck again just to
> make sure and it reported clean. So I mounted the drive and ls'd around
> and it started reporting errors. "ls: cannot access 40: Input/output
> error" Note: 40 is a directory.

Well, we'd need to look at the kernel messages, but the Input/output
error strongly suggests that there are, well, I/O errors talking to
your storage array. Which again suggests hardware problems, or device
driver bugs, or both.

- Ted

2011-02-09 19:46:39

by Ric Wheeler

[permalink] [raw]
Subject: Re: ext4 problems with external RAID array via SAS connection

On 02/09/2011 01:28 PM, Ted Ts'o wrote:
> On Wed, Feb 09, 2011 at 08:43:56AM -0500, [email protected] wrote:
>> The fsck said it completed successfully. I kicked off fsck again just to
>> make sure and it reported clean. So I mounted the drive and ls'd around
>> and it started reporting errors. "ls: cannot access 40: Input/output
>> error" Note: 40 is a directory.
> Well, we'd need to look at the kernel messages, but the Input/output
> error strongly suggests that there are, well, I/O errors talking to
> your storage array. Which again suggests hardware problems, or device
> driver bugs, or both.
>
> - Ted

I think that you might want to start to test with a simplified storage config.
Try keep the RAID card in the loop, but using a simpler RAID scheme (single
drive? RAID0 or RAID1) and see if the issue persists.

Ric