2003-03-16 14:50:58

by Dave Gilbert (Home)

[permalink] [raw]
Subject: 2.4.20: ext3/raid5 - allocating block in system zone/multiple 1 requests for sector

Hi,
I've just built an 800GB RAID5 array and built an ext3 file system
on it; on trying to copy data off the 200GB RAID it is replacing I'm
starting to see errors of the form:

kernel: EXT3-fs error (device md(9,2)): ext3_new_block: Allocating block in
system zone - block = 140509185

and

kernel: EXT3-fs error (device md(9,2)): ext3_add_entry: bad entry in
directory #70254593: rec_len %% 4 != 0 - offset=28, inode=23880564,
rec_len=21587, name_len=76

and

kernel: raid5: multiple 1 requests for sector 281018464

This is on an x86 which has been running fine on the smaller raid for
years (albeit Reiser); the array is built from 5 200GB Western Digi
IDEs on a mix of promise and HPT controllers (there are no IDE errors
visible). This is a straight 2.4.20 kernel.

The previous messages to the list with this form of error have suggested
the problem is related to >2TB arrays; but this one is a relative
tiny one.

Help greatly appreciated,

Dave
---------------- Have a happy GNU millennium! ----------------------
/ Dr. David Alan Gilbert | Running GNU/Linux on Alpha,68K| Happy \
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
\ _________________________|_____ http://www.treblig.org |_______/


2003-03-18 03:05:14

by NeilBrown

[permalink] [raw]
Subject: Re: 2.4.20: ext3/raid5 - allocating block in system zone/multiple 1 requests for sector

On Sunday March 16, [email protected] wrote:
> Hi,
> I've just built an 800GB RAID5 array and built an ext3 file system
> on it; on trying to copy data off the 200GB RAID it is replacing I'm
> starting to see errors of the form:
>
> kernel: EXT3-fs error (device md(9,2)): ext3_new_block: Allocating block in
> system zone - block = 140509185
>
> and
>
> kernel: EXT3-fs error (device md(9,2)): ext3_add_entry: bad entry in
> directory #70254593: rec_len %% 4 != 0 - offset=28, inode=23880564,
> rec_len=21587, name_len=76
>
> and
>
> kernel: raid5: multiple 1 requests for sector 281018464

I had exactly these symptoms about a year ago in 2.4.18. I found and
fixed the problem and have just checked and the fix is definately in
2.4.20.
So if you really are running 2.4.20 then it looks like a similar bug
has appeared.

These two symptoms strongly suggest a buffer aliasing problem.
i.e. you have two buffers (one for data and one for metadata)
that refer to the same location on disc.
One is part of a file that was recently deleted, but the buffer hasn't
been flushed yet. The other is part of a new directory.
The old buffer and the new buffer both get written to disc at much the
same time (hence the "multiple 1 requests"), but the old buffer hits
the disc second and so corrupts the filesystem.

The bug I found was specific to data=journal mode, and this certainly
has more options for buffer aliasing. Were you using data=journal?

NeilBrown

2003-03-18 03:17:26

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.20: ext3/raid5 - allocating block in system zone/multiple 1 requests for sector

Neil Brown <[email protected]> wrote:
>
> These two symptoms strongly suggest a buffer aliasing problem.
> i.e. you have two buffers (one for data and one for metadata)
> that refer to the same location on disc.
> One is part of a file that was recently deleted, but the buffer hasn't
> been flushed yet. The other is part of a new directory.
> The old buffer and the new buffer both get written to disc at much the
> same time (hence the "multiple 1 requests"), but the old buffer hits
> the disc second and so corrupts the filesystem.

This aliasing can happen very easily with direct-io, and it is something
which drivers should be able to cope with.

I hope RAID is not still assuming that all requests are unique in this way?

2003-03-18 05:49:43

by NeilBrown

[permalink] [raw]
Subject: Re: 2.4.20: ext3/raid5 - allocating block in system zone/multiple 1 requests for sector

On Monday March 17, [email protected] wrote:
> Neil Brown <[email protected]> wrote:
> >
> > These two symptoms strongly suggest a buffer aliasing problem.
> > i.e. you have two buffers (one for data and one for metadata)
> > that refer to the same location on disc.
> > One is part of a file that was recently deleted, but the buffer hasn't
> > been flushed yet. The other is part of a new directory.
> > The old buffer and the new buffer both get written to disc at much the
> > same time (hence the "multiple 1 requests"), but the old buffer hits
> > the disc second and so corrupts the filesystem.
>
> This aliasing can happen very easily with direct-io, and it is something
> which drivers should be able to cope with.
>
> I hope RAID is not still assuming that all requests are unique in this way?

No. RAID copes. If raid5 sees a write request for a block that it
already has a pending write request for, it will print a warning and
delay the second until the first complete.

In the cas in question I don't think raid5 is contributing to the
problem. It is just provide extra information which might help point
towards the problem - i.e. it is confirming that some sort of aliasing
is happening.

NeilBrown

2003-03-18 13:55:17

by Dave Gilbert (Home)

[permalink] [raw]
Subject: Re: 2.4.20: ext3/raid5 - allocating block in system zone/multiple 1 requests for sector

Neil Brown wrote:

> The bug I found was specific to data=journal mode, and this certainly
> has more options for buffer aliasing. Were you using data=journal?

No.

Dave