2010-04-16 12:55:46

by Andre Noll

[permalink] [raw]
Subject: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

Hi

I'm seeing the following errors when writing to an ext4 file system:

qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 fa6a73 2002.
end_request: I/O error, dev sda, sector 7812771880
Aborting journal on device sda-8.
EXT4-fs (sda): delayed block allocation failed for inode 752 at logical offset 1982464 with max blocks 17588 with error -30

This should not happen!! Data will be lost
EXT4-fs error (device sda) in ext4_da_writepages: Journal has aborted
EXT4-fs error (device sda): ext4_journal_start_sb: Detected aborted journal
EXT4-fs (sda): Remounting filesystem read-only
EXT4-fs (sda): delayed block allocation failed for inode 756 at logical offset 177656 with max blocks 1205 with error -30
EXT4-fs error (device sda) in ext4_da_write_end: IO failure

This should not happen!! Data will be lost
EXT4-fs (sda): ext4_da_writepages: jbd2_start: 1205 pages, ino 756; err -30

EXT4-fs (sda): ext4_da_writepages: jbd2_start: 17588 pages, ino 752; err -30

EXT4-fs (sda): delayed block allocation failed for inode 755 at logical offset 669696 with max blocks 3268 with error -30

This should not happen!! Data will be lost
EXT4-fs (sda): ext4_da_writepages: jbd2_start: 3268 pages, ino 755; err -30

This happened while stress-testing the machine [1] with

stress -d 5 --hdd-bytes 10G --hdd-noclean

This command is supposed to fill the disk and then exit with "no space
left". Instead, the above errors occured when the fs was 100% full.
There's nothing in the logs of the Raid system.

Is this a known issue? Who's to blame, qla2xxx or ext4?

Thanks
Andre

[1] Vanilla 2.6.34-rc4, a 7.3T ext4 fs on a FC Infortrend hardware
raid, connected to a qlogic hba. Let me know if you need more info.
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (1.77 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2010-04-16 15:02:36

by Eric Sandeen

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On 04/16/2010 07:35 AM, Andre Noll wrote:
> Hi
>
> I'm seeing the following errors when writing to an ext4 file system:
>
> qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 fa6a73 2002.
> end_request: I/O error, dev sda, sector 7812771880

Sounds like a hardware failure

> Aborting journal on device sda-8.
> EXT4-fs (sda): delayed block allocation failed for inode 752 at logical offset 1982464 with max blocks 17588 with error -30
>
> This should not happen!! Data will be lost

ext4 is just the messenger here, I think.

-Eric

> EXT4-fs error (device sda) in ext4_da_writepages: Journal has aborted
> EXT4-fs error (device sda): ext4_journal_start_sb: Detected aborted journal
> EXT4-fs (sda): Remounting filesystem read-only
> EXT4-fs (sda): delayed block allocation failed for inode 756 at logical offset 177656 with max blocks 1205 with error -30
> EXT4-fs error (device sda) in ext4_da_write_end: IO failure
>
> This should not happen!! Data will be lost
> EXT4-fs (sda): ext4_da_writepages: jbd2_start: 1205 pages, ino 756; err -30
>
> EXT4-fs (sda): ext4_da_writepages: jbd2_start: 17588 pages, ino 752; err -30
>
> EXT4-fs (sda): delayed block allocation failed for inode 755 at logical offset 669696 with max blocks 3268 with error -30
>
> This should not happen!! Data will be lost
> EXT4-fs (sda): ext4_da_writepages: jbd2_start: 3268 pages, ino 755; err -30
>
> This happened while stress-testing the machine [1] with
>
> stress -d 5 --hdd-bytes 10G --hdd-noclean
>
> This command is supposed to fill the disk and then exit with "no space
> left". Instead, the above errors occured when the fs was 100% full.
> There's nothing in the logs of the Raid system.
>
> Is this a known issue? Who's to blame, qla2xxx or ext4?
>
> Thanks
> Andre
>
> [1] Vanilla 2.6.34-rc4, a 7.3T ext4 fs on a FC Infortrend hardware
> raid, connected to a qlogic hba. Let me know if you need more info.


2010-04-16 15:30:48

by Andre Noll

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On 10:02, Eric Sandeen wrote:

> > I'm seeing the following errors when writing to an ext4 file system:
> >
> > qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 fa6a73 2002.
> > end_request: I/O error, dev sda, sector 7812771880
>
> Sounds like a hardware failure

Yes, or a problem with the qla2xxx driver. However, it's a bit
surprising that the hardware survived the stress test for more than
one day and went south just at the moment the fs was completely filled.

My first guess was that the problem might be caused by a reported
device size which is slightly larger than the actual size. But then

ddrescue -i 7440Gi /dev/zero /dev/sda

would have resulted in a similar error. This command completed without
errors though.

Thanks
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (839.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2010-04-16 15:44:02

by Eric Sandeen

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On 04/16/2010 10:30 AM, Andre Noll wrote:
> On 10:02, Eric Sandeen wrote:
>
>>> I'm seeing the following errors when writing to an ext4 file system:
>>>
>>> qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 fa6a73 2002.
>>> end_request: I/O error, dev sda, sector 7812771880
>>
>> Sounds like a hardware failure
>
> Yes, or a problem with the qla2xxx driver. However, it's a bit
> surprising that the hardware survived the stress test for more than
> one day and went south just at the moment the fs was completely filled.

Hm, that's true.

Well, the error was on sector 7812771880 of the device ...

can you reliably write directly to that sector w/o the filesystem in
between?

-Eric

> My first guess was that the problem might be caused by a reported
> device size which is slightly larger than the actual size. But then
>
> ddrescue -i 7440Gi /dev/zero /dev/sda
>
> would have resulted in a similar error. This command completed without
> errors though.
>
> Thanks
> Andre


2010-04-16 15:52:22

by Andrew Vasquez

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On Fri, 16 Apr 2010, Eric Sandeen wrote:

> On 04/16/2010 10:30 AM, Andre Noll wrote:
> > On 10:02, Eric Sandeen wrote:
> >
> >>> I'm seeing the following errors when writing to an ext4 file system:
> >>>
> >>> qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 fa6a73 2002.
> >>> end_request: I/O error, dev sda, sector 7812771880
> >>
> >> Sounds like a hardware failure
> >
> > Yes, or a problem with the qla2xxx driver. However, it's a bit
> > surprising that the hardware survived the stress test for more than
> > one day and went south just at the moment the fs was completely filled.
>
> Hm, that's true.
>
> Well, the error was on sector 7812771880 of the device ...
>
> can you reliably write directly to that sector w/o the filesystem in
> between?

The driver here is simply reporting that the a command's timeout value
had been exceeded and the mid-layer began error-recovery with a request
to 'abort' the command. The abort-request completed successfully:

qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 fa6a73 2002.

I can't explain why the storage did not complete the request in the
allotted time.

-- av

2010-04-16 15:57:32

by Andre Noll

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On 10:43, Eric Sandeen wrote:

> Well, the error was on sector 7812771880 of the device ...
>
> can you reliably write directly to that sector w/o the filesystem in
> between?

Yes, that does not result in any errors.

Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (305.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2010-04-16 16:09:22

by Andre Noll

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On 08:52, Andrew Vasquez wrote:
> The driver here is simply reporting that the a command's timeout value
> had been exceeded and the mid-layer began error-recovery with a request
> to 'abort' the command. The abort-request completed successfully:
>
> qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 fa6a73 2002.
>
> I can't explain why the storage did not complete the request in the
> allotted time.

Ah, that's valuable information, thanks. The underlying Infortrend
Raid System is rather old but worked without any problems for several
years. We recently replaced its 400G disks by new 2T WD disks. Maybe
the new disks have longer response times, could that be the reason? And
is there a way to increase the timeout value?

Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (829.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2010-04-16 16:42:03

by Andrew Vasquez

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On Fri, 16 Apr 2010, Andre Noll wrote:

> On 08:52, Andrew Vasquez wrote:
> > The driver here is simply reporting that the a command's timeout value
> > had been exceeded and the mid-layer began error-recovery with a request
> > to 'abort' the command. The abort-request completed successfully:
> >
> > qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 fa6a73 2002.
> >
> > I can't explain why the storage did not complete the request in the
> > allotted time.
>
> Ah, that's valuable information, thanks. The underlying Infortrend
> Raid System is rather old but worked without any problems for several
> years. We recently replaced its 400G disks by new 2T WD disks. Maybe
> the new disks have longer response times, could that be the reason? And
> is there a way to increase the timeout value?

To update the default timeout value (30 seconds) for commands
submitted to /dev/sdn to 60 seconds:

$ echo 60 > /sys/block/sdn/device/timeout

-- av

2010-04-16 16:47:23

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On Fri, Apr 16, 2010 at 02:35:26PM +0200, Andre Noll wrote:
> Hi
>
> I'm seeing the following errors when writing to an ext4 file system:
>
> qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 fa6a73 2002.
> end_request: I/O error, dev sda, sector 7812771880

These look like hardware errors; it may not be the qla2xxx driver's
fault, but actually the fault of the hard disk....

- Ted

2010-04-16 17:07:31

by Andre Noll

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On 09:36, Andrew Vasquez wrote:
> > > qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 fa6a73 2002.
> > >
> > > I can't explain why the storage did not complete the request in the
> > > allotted time.
> >
> > Ah, that's valuable information, thanks. The underlying Infortrend
> > Raid System is rather old but worked without any problems for several
> > years. We recently replaced its 400G disks by new 2T WD disks. Maybe
> > the new disks have longer response times, could that be the reason? And
> > is there a way to increase the timeout value?
>
> To update the default timeout value (30 seconds) for commands
> submitted to /dev/sdn to 60 seconds:
>
> $ echo 60 > /sys/block/sdn/device/timeout

I will re-run the stress test with a 60 seconds timeout value and follow
up if this did not help.

Thanks again
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (923.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2010-04-17 16:55:40

by Bernd Schubert

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On Friday 16 April 2010, Andre Noll wrote:
> On 09:36, Andrew Vasquez wrote:
> > > > qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 fa6a73
> > > > 2002.
> > > >
> > > > I can't explain why the storage did not complete the request in the
> > > > allotted time.
> > >
> > > Ah, that's valuable information, thanks. The underlying Infortrend
> > > Raid System is rather old but worked without any problems for several
> > > years. We recently replaced its 400G disks by new 2T WD disks. Maybe
> > > the new disks have longer response times, could that be the reason? And
> > > is there a way to increase the timeout value?
> >
> > To update the default timeout value (30 seconds) for commands
> > submitted to /dev/sdn to 60 seconds:
> >
> > $ echo 60 > /sys/block/sdn/device/timeout
>
> I will re-run the stress test with a 60 seconds timeout value and follow
> up if this did not help.

That will not help if the command is "SYNCHRONIZE_CACHE", as that ignores
device settings, but uses scsi default timeout (30s), which is far too small
for SATA based raid units. Scsi maintainers ignored that and a couple of other
patches I wrote to improve error handling with Infortrend units. Will send the
patches again soon.
Also, if the abort command succeeds, it the command should be re-queued and
should not result in an error. I think my patches also would increase
verbosity to point out what exactly happened (possibly a wrong return code in
the qla2xxx driver, although that should activate the next step in error
handling, I need to find some to go through the code...).

Altogether filesystem unrelated. The filesystem just might be the reason for a
synchronize-cache, e.g. barriers, etc.

Greetings from T?bingen,
Bernd

2010-04-17 18:19:40

by Andre Noll

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On 18:55, Bernd Schubert wrote:
> > > To update the default timeout value (30 seconds) for commands
> > > submitted to /dev/sdn to 60 seconds:
> > >
> > > $ echo 60 > /sys/block/sdn/device/timeout
> >
> > I will re-run the stress test with a 60 seconds timeout value and follow
> > up if this did not help.
>
> That will not help if the command is "SYNCHRONIZE_CACHE", as that ignores
> device settings, but uses scsi default timeout (30s), which is far too small
> for SATA based raid units. Scsi maintainers ignored that and a couple of other
> patches I wrote to improve error handling with Infortrend units. Will send the
> patches again soon.

Please CC me when you do so. The machine I am having trouble with is
only our fallback server. I can use it freely for testing and am willing
to give your patches a try.

Thanks
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (926.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2010-04-17 18:43:30

by Bernd Schubert

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On Saturday 17 April 2010, Andre Noll wrote:
> On 18:55, Bernd Schubert wrote:
> > > > To update the default timeout value (30 seconds) for commands
> > > > submitted to /dev/sdn to 60 seconds:
> > > >
> > > > $ echo 60 > /sys/block/sdn/device/timeout
> > >
> > > I will re-run the stress test with a 60 seconds timeout value and
> > > follow up if this did not help.
> >
> > That will not help if the command is "SYNCHRONIZE_CACHE", as that ignores
> > device settings, but uses scsi default timeout (30s), which is far too
> > small for SATA based raid units. Scsi maintainers ignored that and a
> > couple of other patches I wrote to improve error handling with Infortrend
> > units. Will send the patches again soon.
>
> Please CC me when you do so. The machine I am having trouble with is
> only our fallback server. I can use it freely for testing and am willing
> to give your patches a try.

There is actually not much to test, as the patches had been the only solution
to stabilize a large Lustre environment with dozens of Infortrend Raids. I
spent months to debug Infortrend Raids, scsi stack and the LSI MPT fusion
driver. Nowadays some patches are also used for DDN customers.
I'm just always out of time to forward port it to more recent linux-git and to
resend.


Cheers,
Bernd


2010-04-17 20:46:22

by Eric Sandeen

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

Bernd Schubert wrote:
...

> Altogether filesystem unrelated. The filesystem just might be the reason for a
> synchronize-cache, e.g. barriers, etc.

Interesting that it only happens at enospc time though - the fs would
be sending barriers in general usage.

Although, as ext4 fills, it does do more syncing/flushing to try to
eke out that last bit of space.... hmm....

Andre, just for fun, you might try mounting -o nodelalloc, fill it again,
and see if you see the same behavior.

If not, you might do it again w/ defaults, and capture blktrace data, which
I think will trace barrier requests as well.

(or, maybe also try testing with -o nobarrier)

I still don't think it's likely a filesystem problem but maybe you can
pinpoint the fs behavior that triggers it.

-Eric

> Greetings from T?bingen,
> Bernd
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-04-17 22:39:22

by Andre Noll

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On 15:45, Eric Sandeen wrote:
> Interesting that it only happens at enospc time though - the fs would
> be sending barriers in general usage.
>
> Although, as ext4 fills, it does do more syncing/flushing to try to
> eke out that last bit of space.... hmm....
>
> Andre, just for fun, you might try mounting -o nodelalloc, fill it again,
> and see if you see the same behavior.

Will do. This will take some time though as I destroyed the fs by
writing directly to the device, so I had to recreate it and start from
scratch. I'm currently filling the new fs using a 60s device timeout as
suggested by Andrew.

> If not, you might do it again w/ defaults, and capture blktrace data, which
> I think will trace barrier requests as well.
>
> (or, maybe also try testing with -o nobarrier)
>
> I still don't think it's likely a filesystem problem but maybe you can
> pinpoint the fs behavior that triggers it.

I'll try to reproduce the problem using different timeout values and the
ext4 options you suggest. If I can find a reliable reproducer, I'll run
blktrace and post the results.

Regards
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (1.15 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2010-04-20 15:37:54

by Andre Noll

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On 00:38, Andre Noll wrote:
> > I still don't think it's likely a filesystem problem but maybe you can
> > pinpoint the fs behavior that triggers it.
>
> I'll try to reproduce the problem using different timeout values and the
> ext4 options you suggest. If I can find a reliable reproducer, I'll run
> blktrace and post the results.

Here are some results. Prior to running the tests I wrote a bunch of
10G files and then filled the fs completely with 2T files containing
zeros. Each of the tests below consisted of three runs of

- remove 5 of the above 10G files to make 50G space available
- run stress -d 5 --hdd-bytes 10G --hdd-noclean until it dies
- run fsck if any fs errors occured

Summary: Increasing the device timeout to 60s _or_ disabling barriers
makes the problem go away. Deactivating delayed allocation makes the
problem worse.

- device timeout 60s, default ext4 parameters
No problems at all, all three runs OK

- device timeout 30s, default ext4 parameters
1. OK
2. dmesg:
qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2ea270b 2002.
end_request: I/O error, dev sda, sector 7812889640
Aborting journal on device sda-8.
EXT4-fs error (device sda): ext4_journal_start_sb: Detected aborted journal
EXT4-fs (sda): Remounting filesystem read-only

fsck:
Inode 287, i_blocks is 4294918568, should be 416. Fix? yes
Inode 288, i_size is 2198897426432, should be 2199023251456. Fix? yes
Inode 288, i_blocks is 4294721960, should be 416. Fix? yes

3.
qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2ece6a8 2002.
qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2ece6dc 2002.
end_request: I/O error, dev sda, sector 7812690136
Aborting journal on device sda-8.
EXT4-fs error (device sda) in ext4_free_blocks: Journal has aborted
EXT4-fs error (device sda) in ext4_ext_remove_space: Journal has aborted
EXT4-fs error (device sda) in ext4_reserve_inode_write: Journal has aborted
EXT4-fs error (device sda) in ext4_ext_truncate: Journal has aborted
EXT4-fs error (device sda) in ext4_reserve_inode_write: Journal has aborted
EXT4-fs error (device sda) in ext4_orphan_del: Journal has aborted
EXT4-fs error (device sda) in ext4_reserve_inode_write: Journal has aborted
EXT4-fs error (device sda) in ext4_delete_inode: Journal has aborted
EXT4-fs error (device sda): ext4_journal_start_sb: Detected aborted journal
EXT4-fs (sda): Remounting filesystem read-only

fsck:
e2fsck 1.41.10 (10-Feb-2009)
/dev/sda: recovering journal
Clearing orphaned inode 179 (uid=0, gid=0, mode=0100600, size=0)
/dev/sda: clean, 301/244158464 files, 1935004235/1953247232 blocks


- device timeout 30s, nodelalloc

This seems to trigger the problem more reliably:

1.
dmesg: same qla, ext4 errors as above
fsck: orphaned inodes as above
2. and 3.
errors already while removing files:
rm: cannot remove `stress.98q1gG': Read-only file system
dmesg: same qla/ext4 errors, but also
JBD2: Detected IO errors while flushing file data on sda-8
fsck: clean

- device timeout 30s, nobarrier
No problem at all, all three runs OK.

Eric, are you still interested in seeing the blktrace output? Suppose,
I should use a 30s timeout, nodealloc and barriers=1 as this triggers
the problem within minutes.


Regards
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (3.33 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2010-04-20 17:16:52

by Eric Sandeen

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On 04/20/2010 10:37 AM, Andre Noll wrote:
...

> - device timeout 30s, nobarrier
> No problem at all, all three runs OK.
>
> Eric, are you still interested in seeing the blktrace output? Suppose,
> I should use a 30s timeout, nodealloc and barriers=1 as this triggers
> the problem within minutes.

Hm, so something about barriers being issued is causing timeout
problems on the device...?

since nodelalloc made it -worse- I suppose that rules out the problems
with fs flushing near ensopc, that should only be happening with delalloc.

To be honest I'm not sure what I would do with the blktrace data now,
but perhaps someone might be interested to see it; I'd at least gather it.

-Eric

2010-04-20 17:26:36

by Bernd Schubert

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On Tuesday 20 April 2010, Eric Sandeen wrote:
> On 04/20/2010 10:37 AM, Andre Noll wrote:
> ...
>
> > - device timeout 30s, nobarrier
> > No problem at all, all three runs OK.
> >
> > Eric, are you still interested in seeing the blktrace output? Suppose,
> > I should use a 30s timeout, nodealloc and barriers=1 as this triggers
> > the problem within minutes.
>
> Hm, so something about barriers being issued is causing timeout
> problems on the device...?

I think interesting at this point would be the exact model of the Infortrend
device. There are some completely broken models (IMHO), which have two
controllers for redundancy. Now with enabled write-back cache, it can happen
that those units run into some kind of firmware bug. It then takes about 2h to
flush 2GB of write-back cache. The telnet interface will show the status of
the cache. More recent IFT dual-controller units do not suffer from this bug
anymore, but as Andre said, they are using an old unit...


Thanks,
Bernd




2010-04-20 17:46:54

by Andre Noll

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On 11:51, Eric Sandeen wrote:
> On 04/20/2010 10:37 AM, Andre Noll wrote:
> ...
>
> > - device timeout 30s, nobarrier
> > No problem at all, all three runs OK.
> >
> > Eric, are you still interested in seeing the blktrace output? Suppose,
> > I should use a 30s timeout, nodealloc and barriers=1 as this triggers
> > the problem within minutes.
>
> Hm, so something about barriers being issued is causing timeout
> problems on the device...?

Yes, that's what it looks like.

> To be honest I'm not sure what I would do with the blktrace data now,
> but perhaps someone might be interested to see it; I'd at least gather it.

OK, will send it tomorrow.

Thanks
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (750.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2010-04-20 18:36:04

by Andre Noll

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On 19:26, Bernd Schubert wrote:
> On Tuesday 20 April 2010, Eric Sandeen wrote:
> I think interesting at this point would be the exact model of the Infortrend
> device.

Here's the system information as reported by the telnet interface:

CPU Type PPC750FX
Total Cache Size 2048MB DDR(ECC)
Firmware Version 3.42I.03
Bootrecord Version 1.23A
FW Upgradability Rev. C
Serial Number 6912121
Battery Backup Unit Present
Base Board Rev. ID 0
Base Board ID 81
ID of NVRAM Defaults A16F-G2221 V6.10
Controller Position Slot A

> There are some completely broken models (IMHO), which have two
> controllers for redundancy.

This is a 4 year old system (which does not support Raid6). It has only
a single controller though.

> Now with enabled write-back cache, it can happen that those units run
> into some kind of firmware bug. It then takes about 2h to flush 2GB of
> write-back cache. The telnet interface will show the status of the
> cache.

Hey, I saw this once on a different (newer) infortrend system. However,
it might still be hapening on this system as well and cause the timeout
problems.

Guess I'll have to check if there's a more recent firmware for this
system..

Thanks
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (1.38 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2010-04-20 20:09:49

by Bernd Schubert

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On Tuesday 20 April 2010, Andre Noll wrote:
> On 19:26, Bernd Schubert wrote:
> > On Tuesday 20 April 2010, Eric Sandeen wrote:
> > I think interesting at this point would be the exact model of the
> > Infortrend device.
>
> Here's the system information as reported by the telnet interface:
>
> CPU Type PPC750FX
> Total Cache Size 2048MB DDR(ECC)
> Firmware Version 3.42I.03
> Bootrecord Version 1.23A
> FW Upgradability Rev. C
> Serial Number 6912121
> Battery Backup Unit Present
> Base Board Rev. ID 0
> Base Board ID 81
> ID of NVRAM Defaults A16F-G2221 V6.10
> Controller Position Slot A
>
> > There are some completely broken models (IMHO), which have two
> > controllers for redundancy.
>
> This is a 4 year old system (which does not support Raid6). It has only
> a single controller though.

I don't have any experience with that model.

>
> > Now with enabled write-back cache, it can happen that those units run
> > into some kind of firmware bug. It then takes about 2h to flush 2GB of
> > write-back cache. The telnet interface will show the status of the
> > cache.
>
> Hey, I saw this once on a different (newer) infortrend system. However,
> it might still be hapening on this system as well and cause the timeout
> problems.

I think the dual-controller models that work fine have have SAS-interlink.
Infortrend never confirmed the issue, but I guess it is related to cache-
coherency between both controllers.
There are also other cache related firmware bugs, when it fails to flush the
cache at all. Scsi commands then time out, it enters recovery, properly
responds to scsi commands, resumes normal operation and fails those commands
again. Even with software raid out of several of those hardware raids, this
fail-recover-fail loop prevents suitable operation. Also part of my scsi
patches to limit number of recoveries within a time limit. This issue should
be fixed with recent firmware version, though. But depending on your model,
those fixed version might not be available.

>
> Guess I'll have to check if there's a more recent firmware for this
> system..

At least worth a try.

Cheers,
Bernd

2010-04-21 08:58:02

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

Andre Noll <[email protected]> writes:

> On 00:38, Andre Noll wrote:
>> > I still don't think it's likely a filesystem problem but maybe you can
>> > pinpoint the fs behavior that triggers it.
>>
>> I'll try to reproduce the problem using different timeout values and the
>> ext4 options you suggest. If I can find a reliable reproducer, I'll run
>> blktrace and post the results.
>
> Here are some results. Prior to running the tests I wrote a bunch of
> 10G files and then filled the fs completely with 2T files containing
> zeros. Each of the tests below consisted of three runs of
>
> - remove 5 of the above 10G files to make 50G space available
> - run stress -d 5 --hdd-bytes 10G --hdd-noclean until it dies
what 'stress' process do? was it posted already?
> - run fsck if any fs errors occured
>
> Summary: Increasing the device timeout to 60s _or_ disabling barriers
> makes the problem go away. Deactivating delayed allocation makes the
> problem worse.
2Gb cache is really huge.
barriers=0 , result in less disk wcache activity, but more real IO
And nodelaloc result in more real IO due, so imho this is looks like
device issue.
about nodelalloc: It is unlikely to see "This should not happen!!
Data will be lost" because this message appear from writepage
so may happens only when you rewrite an existing file(below i_size).
BTW, you already noted that you have performed some stress on the device
without filesystem. What was they doing?
IMHO you will able to reproduce the issue without fs by performing
random writes/reads to the device from several tasks in parallel.
>
> - device timeout 60s, default ext4 parameters
> No problems at all, all three runs OK
>
> - device timeout 30s, default ext4 parameters
> 1. OK
> 2. dmesg:
> qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2ea270b 2002.
> end_request: I/O error, dev sda, sector 7812889640
> Aborting journal on device sda-8.
> EXT4-fs error (device sda): ext4_journal_start_sb: Detected aborted journal
> EXT4-fs (sda): Remounting filesystem read-only
>
> fsck:
> Inode 287, i_blocks is 4294918568, should be 416. Fix? yes
> Inode 288, i_size is 2198897426432, should be 2199023251456. Fix? yes
> Inode 288, i_blocks is 4294721960, should be 416. Fix? yes
>
> 3.
> qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2ece6a8 2002.
> qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2ece6dc 2002.
> end_request: I/O error, dev sda, sector 7812690136
> Aborting journal on device sda-8.
> EXT4-fs error (device sda) in ext4_free_blocks: Journal has aborted
> EXT4-fs error (device sda) in ext4_ext_remove_space: Journal has aborted
> EXT4-fs error (device sda) in ext4_reserve_inode_write: Journal has aborted
> EXT4-fs error (device sda) in ext4_ext_truncate: Journal has aborted
> EXT4-fs error (device sda) in ext4_reserve_inode_write: Journal has aborted
> EXT4-fs error (device sda) in ext4_orphan_del: Journal has aborted
> EXT4-fs error (device sda) in ext4_reserve_inode_write: Journal has aborted
> EXT4-fs error (device sda) in ext4_delete_inode: Journal has aborted
> EXT4-fs error (device sda): ext4_journal_start_sb: Detected aborted journal
> EXT4-fs (sda): Remounting filesystem read-only
>
> fsck:
> e2fsck 1.41.10 (10-Feb-2009)
> /dev/sda: recovering journal
> Clearing orphaned inode 179 (uid=0, gid=0, mode=0100600, size=0)
> /dev/sda: clean, 301/244158464 files, 1935004235/1953247232 blocks
>
>
> - device timeout 30s, nodelalloc
>
> This seems to trigger the problem more reliably:
>
> 1.
> dmesg: same qla, ext4 errors as above
> fsck: orphaned inodes as above
> 2. and 3.
> errors already while removing files:
> rm: cannot remove `stress.98q1gG': Read-only file system
> dmesg: same qla/ext4 errors, but also
> JBD2: Detected IO errors while flushing file data on sda-8
> fsck: clean
>
> - device timeout 30s, nobarrier
> No problem at all, all three runs OK.
>
> Eric, are you still interested in seeing the blktrace output? Suppose,
> I should use a 30s timeout, nodealloc and barriers=1 as this triggers
> the problem within minutes.
>
>
> Regards
> Andre

2010-04-21 13:49:06

by Andre Noll

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On 12:57, Dmitry Monakhov wrote:
> > - run stress -d 5 --hdd-bytes 10G --hdd-noclean until it dies
> what 'stress' process do? was it posted already?

stress is a simple, yet useful program which imposes certain types of
stress on a machine. With the above command line options, it simply
writes 5 files in parallel, each 10G large, in an endless loop until
the file system is full (or becomes read-only due to errors). It
helped me more than once to identify hardware or software problems,
_before_ the machine went into production use.

> > Summary: Increasing the device timeout to 60s _or_ disabling barriers
> > makes the problem go away. Deactivating delayed allocation makes the
> > problem worse.
> 2Gb cache is really huge.

Really? This is a four year old el-cheapo hardware raid system with
16 SATA slots. You can easily spend twice the money and get much more
cache memory then.

> barriers=0 , result in less disk wcache activity, but more real IO
> And nodelaloc result in more real IO due, so imho this is looks like
> device issue.

Yes, I think we all agree that the problem is not ext4-related but
is most likely an issue with the infortrend hardware. However, ext4
seems to be very good at triggering that particular problem.

> about nodelalloc: It is unlikely to see "This should not happen!!
> Data will be lost" because this message appear from writepage
> so may happens only when you rewrite an existing file(below i_size).

Nope, this definitely occured while stress was writing new files and
the file system was nearly full.

> BTW, you already noted that you have performed some stress on the device
> without filesystem. What was they doing?

I only ran ddrescue /dev/sda /dev/null once to make sure everything
is readable. This completed with no problems, so I created an ext4
file system and used the above stress command which resulted in write
errors. I then used ddrescue again to rewrite the sector on which
the error occured. This also succeeded which indicates a transient
problem, i.e. no problem with the particular sector.

Regards
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (2.11 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2010-04-22 08:21:53

by Andre Noll

[permalink] [raw]
Subject: Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

On 19:46, Andre Noll wrote:
> > To be honest I'm not sure what I would do with the blktrace data now,
> > but perhaps someone might be interested to see it; I'd at least gather it.
>
> OK, will send it tomorrow.

Here we go.

mount | grep sda:

/dev/sda on /mnt type ext4 (ro,relatime,barrier=1,data=ordered)

command:

ls stress.* | head -n 5 | xargs rm && stress -d 5 --hdd-noclean --hdd-bytes 10G

dmesg:

EXT4-fs (sda): mounted filesystem with ordered data mode
qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2f82d1 2002.
qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2f82d3 2002.
end_request: I/O error, dev sda, sector 7812689752
Aborting journal on device sda-8.
qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2f82d8 2002.
EXT4-fs (sda): delayed block allocation failed for inode 4198406 at logical offset 0 with max blocks 32768 with error -30

This should not happen!! Data will be lost
EXT4-fs error (device sda) in ext4_da_writepages: Journal has aborted
qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2f82da 2002.
qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2f82db 2002.
EXT4-fs (sda): delayed block allocation failed for inode 4198435 at logical offset 0 with max blocks 32768 with error -30

This should not happen!! Data will be lost
EXT4-fs error (device sda): ext4_journal_start_sb: Detected aborted journal
EXT4-fs (sda): Remounting filesystem read-only
EXT4-fs (sda): ext4_da_writepages: jbd2_start: 32768 pages, ino 4198406; err -30

EXT4-fs (sda): ext4_da_writepages: jbd2_start: 32768 pages, ino 4198435; err -30

qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2f82df 2002.
EXT4-fs (sda): delayed block allocation failed for inode 4198433 at logical offset 0 with max blocks 32768 with error -30

This should not happen!! Data will be lost
EXT4-fs (sda): delayed block allocation failed for inode 4198437 at logical offset 0 with max blocks 32768 with error -30

This should not happen!! Data will be lost
EXT4-fs (sda): ext4_da_writepages: jbd2_start: 32768 pages, ino 4198433; err -30

EXT4-fs (sda): ext4_da_writepages: jbd2_start: 32768 pages, ino 4198437; err -30

qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2f82e2 2002.
...

The output of "btrace /dev/sda" during the above command is 1.7M
large. It's at

http://systemlinux.org/~maan/btrace.sda

I'll run further tests with barrier=0 and 60s timeout.

Thanks
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (2.70 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments