2003-08-12 20:45:42

by Tupshin Harper

[permalink] [raw]
Subject: data corruption using raid0+lvm2+jfs with 2.6.0-test3

I have an LVM2 setup with four lvm groups. One of those groups sits on
top of a two disk raid 0 array. When writing to JFS partitions on that
lvm group, I get frequent, reproducible data corruption. This same setup
works fine with 2.4.22-pre kernels. The JFS may or may not be relevant,
since I haven't had a chance to use other filesystems as a control.
There are a number of instances of the following message associated with
the data corruption:

raid0_make_request bug: can't convert block across chunks or bigger than
8k 12436792 8

The 12436792 varies widely, the rest is always the same. The error is
coming from drivers/md/raid0.c.

-Tupshin


2003-08-12 21:12:18

by Tupshin Harper

[permalink] [raw]
Subject: Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3

Christophe Saout wrote:

>Am Di, 2003-08-12 um 22.45 schrieb Tupshin Harper:
>
>
>
>>I have an LVM2 setup with four lvm groups. One of those groups sits on
>>top of a two disk raid 0 array. When writing to JFS partitions on that
>>lvm group, I get frequent, reproducible data corruption. This same setup
>>works fine with 2.4.22-pre kernels. The JFS may or may not be relevant,
>>since I haven't had a chance to use other filesystems as a control.
>>There are a number of instances of the following message associated with
>>the data corruption:
>>
>>raid0_make_request bug: can't convert block across chunks or bigger than
>>8k 12436792 8
>>
>>The 12436792 varies widely, the rest is always the same. The error is
>>coming from drivers/md/raid0.c.
>>
>>
>
>Why don't you try using an LVM2 stripe? That's the same as raid0 does.
>And I'm sure it doesn't suffer from such problems because it's handling
>bios in a very generic and flexible manner.
>
Yes, I'm already converting to such a setup as I type this. I thought
that a data corruption issue was worth mentioning, however. ;-)

-Tupshin

2003-08-12 21:09:18

by Christophe Saout

[permalink] [raw]
Subject: Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3

Am Di, 2003-08-12 um 22.45 schrieb Tupshin Harper:

> I have an LVM2 setup with four lvm groups. One of those groups sits on
> top of a two disk raid 0 array. When writing to JFS partitions on that
> lvm group, I get frequent, reproducible data corruption. This same setup
> works fine with 2.4.22-pre kernels. The JFS may or may not be relevant,
> since I haven't had a chance to use other filesystems as a control.
> There are a number of instances of the following message associated with
> the data corruption:
>
> raid0_make_request bug: can't convert block across chunks or bigger than
> 8k 12436792 8
>
> The 12436792 varies widely, the rest is always the same. The error is
> coming from drivers/md/raid0.c.

Why don't you try using an LVM2 stripe? That's the same as raid0 does.
And I'm sure it doesn't suffer from such problems because it's handling
bios in a very generic and flexible manner.

Looking at the code:

chunk_size = mddev->chunk_size >> 10;
block = bio->bi_sector >> 1;
if (unlikely(chunk_size < (block & (chunk_size - 1)) + (bio->bi_size >>
10))) { ...
/* Sanity check -- queue functions should prevent this happening */
if (bio->bi_vcnt != 1 ||
bio->bi_idx != 0)
goto bad_map; /* -> error message */

So, it looks like queue functions don't prevent this from happening.

md.c:
blk_queue_max_sectors(mddev->queue, chunk_size >> 9);
blk_queue_segment_boundary(mddev->queue, (chunk_size>>1) - 1);

I'm wondering, why can't there be more than one bvec?

--
Christophe Saout <[email protected]>
Please avoid sending me Word or PowerPoint attachments.
See http://www.fsf.org/philosophy/no-word-attachments.html

2003-08-12 21:42:40

by Andrew Morton

[permalink] [raw]
Subject: Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3

Tupshin Harper <[email protected]> wrote:
>
> raid0_make_request bug: can't convert block across chunks or bigger than
> 8k 12436792 8

There is a fix for this at

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.0-test3/2.6.0-test3-mm1/broken-out/bio-too-big-fix.patch

Results of testing are always appreciated...

2003-08-12 23:06:49

by NeilBrown

[permalink] [raw]
Subject: Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3

On Tuesday August 12, [email protected] wrote:
> Tupshin Harper <[email protected]> wrote:
> >
> > raid0_make_request bug: can't convert block across chunks or bigger than
> > 8k 12436792 8
>
> There is a fix for this at
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.0-test3/2.6.0-test3-mm1/broken-out/bio-too-big-fix.patch
>
> Results of testing are always appreciated...

I don't think this will help. It is a different problem.

As far as I can tell, the problem is that dm doesn't honour the
merge_bvec_fn of the underlying device (neither does md for that
matter).
I think it does honour the max_sectors restriction, so it will only
allow a request as big as one chunk, but it will allow such a requests
to span a chunk boundary.

Probably the simplest solution to this is to put in calls to
bio_split, which will need to be strengthed to handle multi-page bios.

The policy would be:
"a client of a block device *should* honour the various bio size
restrictions, and may suffer performance loss if it doesn't;
a block device driver *must* handle any bio it is passed, and may
call bio_split to help out".

A better solution, which is too much for 2.6.0, is to have a cleaner
interface wherein the client of the block device uses a two-stage
process to submit requests.
Firstly it says:
I want to do IO at this location, what is the max number of sectors
allowed?
Then it adds pages to the bio upto that limit.
Finally it say
OK, here is the request.

The first and final stages have to be properly paired so that a
device knows if there are any pending requests and can hold-off any
device reconfiguration until all pending requests have completed.

NeilBrown

2003-08-13 07:25:47

by Joe Thornber

[permalink] [raw]
Subject: Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3

On Wed, Aug 13, 2003 at 09:05:58AM +1000, Neil Brown wrote:
> A better solution, which is too much for 2.6.0, is to have a cleaner
> interface wherein the client of the block device uses a two-stage
> process to submit requests.
> Firstly it says:
> I want to do IO at this location, what is the max number of sectors
> allowed?
> Then it adds pages to the bio upto that limit.
> Finally it say
> OK, here is the request.
>
> The first and final stages have to be properly paired so that a
> device knows if there are any pending requests and can hold-off any
> device reconfiguration until all pending requests have completed.

This is exactly what I'd like to do. The merge_bvec_fn is unusable by
dm (and probably md) because this function is mapping specific - so
dm_suspend/dm_resume need to be lifted above it.

- Joe

2003-08-15 21:27:15

by Mike Fedyk

[permalink] [raw]
Subject: Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3

On Wed, Aug 13, 2003 at 09:05:58AM +1000, Neil Brown wrote:
> On Tuesday August 12, [email protected] wrote:
> > Tupshin Harper <[email protected]> wrote:
> > >
> > > raid0_make_request bug: can't convert block across chunks or bigger than
> > > 8k 12436792 8
> >
> > There is a fix for this at
> >
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.0-test3/2.6.0-test3-mm1/broken-out/bio-too-big-fix.patch
> >
> > Results of testing are always appreciated...
>
> I don't think this will help. It is a different problem.
>
> As far as I can tell, the problem is that dm doesn't honour the
> merge_bvec_fn of the underlying device (neither does md for that
> matter).
> I think it does honour the max_sectors restriction, so it will only
> allow a request as big as one chunk, but it will allow such a requests
> to span a chunk boundary.
>
> Probably the simplest solution to this is to put in calls to
> bio_split, which will need to be strengthed to handle multi-page bios.
>
> The policy would be:
> "a client of a block device *should* honour the various bio size
> restrictions, and may suffer performance loss if it doesn't;
> a block device driver *must* handle any bio it is passed, and may
> call bio_split to help out".
>

Any progress on this?

2003-08-16 08:01:46

by NeilBrown

[permalink] [raw]
Subject: Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3

On Friday August 15, [email protected] wrote:
> On Wed, Aug 13, 2003 at 09:05:58AM +1000, Neil Brown wrote:
> > On Tuesday August 12, [email protected] wrote:
> > > Tupshin Harper <[email protected]> wrote:
> > > >
> > > > raid0_make_request bug: can't convert block across chunks or bigger than
> > > > 8k 12436792 8

> >
> > Probably the simplest solution to this is to put in calls to
> > bio_split, which will need to be strengthed to handle multi-page bios.
> >
> > The policy would be:
> > "a client of a block device *should* honour the various bio size
> > restrictions, and may suffer performance loss if it doesn't;
> > a block device driver *must* handle any bio it is passed, and may
> > call bio_split to help out".
> >
>
> Any progress on this?

No, and I doubt there will be in a big hurry, unless I come up with an
easy way to make lvm-over-raid0 break instantly instead of eventually.

I think that for now you should assume tat lvm over raid0 (or raid0
over lvm) simply isn't supported. As lvm (aka dm) supports striping,
it shouldn't be needed.

NeilBrown

2003-08-16 14:19:42

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3

On 2003-08-16T18:00:21,
Neil Brown <[email protected]> said:

> I think that for now you should assume tat lvm over raid0 (or raid0
> over lvm) simply isn't supported. As lvm (aka dm) supports striping,
> it shouldn't be needed.

Can raid0 detect that it is being accessed via DM and 'fail-fast' and
refuse to ever come up?

This probably also suggests that the lvm2 and evms2 folks should refuse
to set this up in their tools...


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
SuSE Labs - Research & Development, SuSE Linux AG

High Availabilty, n.: Patching up complex systems with even more complexity.


Attachments:
(No filename) (622.00 B)
(No filename) (189.00 B)
Download all attachments

2003-08-16 23:52:50

by Mike Fedyk

[permalink] [raw]
Subject: Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3

On Sat, Aug 16, 2003 at 06:00:21PM +1000, Neil Brown wrote:
> On Friday August 15, [email protected] wrote:
> > On Wed, Aug 13, 2003 at 09:05:58AM +1000, Neil Brown wrote:
> > > On Tuesday August 12, [email protected] wrote:
> > > > Tupshin Harper <[email protected]> wrote:
> > > > >
> > > > > raid0_make_request bug: can't convert block across chunks or bigger than
> > > > > 8k 12436792 8
>
> > >
> > > Probably the simplest solution to this is to put in calls to
> > > bio_split, which will need to be strengthed to handle multi-page bios.
> > >
> > > The policy would be:
> > > "a client of a block device *should* honour the various bio size
> > > restrictions, and may suffer performance loss if it doesn't;
> > > a block device driver *must* handle any bio it is passed, and may
> > > call bio_split to help out".
> > >
> >
> > Any progress on this?
>
> No, and I doubt there will be in a big hurry, unless I come up with an
> easy way to make lvm-over-raid0 break instantly instead of eventually.
>
> I think that for now you should assume tat lvm over raid0 (or raid0
> over lvm) simply isn't supported. As lvm (aka dm) supports striping,
> it shouldn't be needed.

I have a raid5 with "4" 18gb drives, and one of the "drives" is two 9gb
drives in a linear md "array".

I'm guessing this will hit this bug too?

I have a couple systems that use software raid5 that I'll avoid putting
2.6-test on until I know the raid is more reliable (or is this only with
md+lvm?)

2003-08-17 00:13:15

by NeilBrown

[permalink] [raw]
Subject: Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3

On Saturday August 16, [email protected] wrote:
> On Sat, Aug 16, 2003 at 06:00:21PM +1000, Neil Brown wrote:
> > On Friday August 15, [email protected] wrote:
> > > On Wed, Aug 13, 2003 at 09:05:58AM +1000, Neil Brown wrote:
> > > > On Tuesday August 12, [email protected] wrote:
> > > > > Tupshin Harper <[email protected]> wrote:
> > > > > >
> > > > > > raid0_make_request bug: can't convert block across chunks or bigger than
> > > > > > 8k 12436792 8
> >
> > > >
> > > > Probably the simplest solution to this is to put in calls to
> > > > bio_split, which will need to be strengthed to handle multi-page bios.
> > > >
> > > > The policy would be:
> > > > "a client of a block device *should* honour the various bio size
> > > > restrictions, and may suffer performance loss if it doesn't;
> > > > a block device driver *must* handle any bio it is passed, and may
> > > > call bio_split to help out".
> > > >
> > >
> > > Any progress on this?
> >
> > No, and I doubt there will be in a big hurry, unless I come up with an
> > easy way to make lvm-over-raid0 break instantly instead of eventually.
> >
> > I think that for now you should assume tat lvm over raid0 (or raid0
> > over lvm) simply isn't supported. As lvm (aka dm) supports striping,
> > it shouldn't be needed.
>
> I have a raid5 with "4" 18gb drives, and one of the "drives" is two 9gb
> drives in a linear md "array".
>
> I'm guessing this will hit this bug too?

This should be safe. raid5 only ever submits 1-page (4K) requests
that are page aligned, and linear arrays will have the boundary
between drives 4k aligned (actually "chunksize" aligned, and chunksize
is atleast 4k).

So raid5 should be safe over everything (unless dm allows striping
with a chunk size less than pagesize).

Thinks: as an interim solution of other raid levels - if the
underlying device has a merge_bvec_function which is being ignored, we
could set max_sectors to PAGE_SIZE/512. This should be safe, though
possibly not optimal (but "safe" is trumps "optimal" any day).

NeilBrown
>
> I have a couple systems that use software raid5 that I'll avoid putting
> 2.6-test on until I know the raid is more reliable (or is this only with
> md+lvm?)

2003-08-17 17:51:01

by Mike Fedyk

[permalink] [raw]
Subject: Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3

On Sun, Aug 17, 2003 at 10:12:27AM +1000, Neil Brown wrote:
> On Saturday August 16, [email protected] wrote:
> > I have a raid5 with "4" 18gb drives, and one of the "drives" is two 9gb
> > drives in a linear md "array".
> >
> > I'm guessing this will hit this bug too?
>
> This should be safe. raid5 only ever submits 1-page (4K) requests
> that are page aligned, and linear arrays will have the boundary
> between drives 4k aligned (actually "chunksize" aligned, and chunksize
> is atleast 4k).
>

So why is this hitting with raid0? Is lvm2 on top of md the problem and md
on lvm2 is ok?

> So raid5 should be safe over everything (unless dm allows striping
> with a chunk size less than pagesize).
>
> Thinks: as an interim solution of other raid levels - if the
> underlying device has a merge_bvec_function which is being ignored, we
> could set max_sectors to PAGE_SIZE/512. This should be safe, though
> possibly not optimal (but "safe" is trumps "optimal" any day).

Assuming that sectors are always 512 bytes (true for any hard drive I've
seen) that will be 512 * 8 = one 4k page.

Any chance sector != 512?

2003-08-17 23:14:28

by NeilBrown

[permalink] [raw]
Subject: Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3

On Sunday August 17, [email protected] wrote:
> On Sun, Aug 17, 2003 at 10:12:27AM +1000, Neil Brown wrote:
> > On Saturday August 16, [email protected] wrote:
> > > I have a raid5 with "4" 18gb drives, and one of the "drives" is two 9gb
> > > drives in a linear md "array".
> > >
> > > I'm guessing this will hit this bug too?
> >
> > This should be safe. raid5 only ever submits 1-page (4K) requests
> > that are page aligned, and linear arrays will have the boundary
> > between drives 4k aligned (actually "chunksize" aligned, and chunksize
> > is atleast 4k).
> >
>
> So why is this hitting with raid0? Is lvm2 on top of md the problem and md
> on lvm2 is ok?
>

The various raid levels under md are in many ways quite independent.
You cannot generalise about "md works" or "md doesn't work", you have
to talk about the specific raid levels.

The problem happens when
1/ The underlying device defines a merge_bvec_fn, and
2/ the driver (meta-device) that uses that device
2a/ does not honour the merge_bvec_fn, and
2b/ passes on requests larger than one page.

md/linear, md/raid0, and lvm all define a merge_bvec_fn, and so can be
a problem as an underlying device by point (1).

md/* and lvm do not honour merge_bvec_fn and so can be a problem as a
meta-device by 2a.
However md/raid5 never passes on requests larger than one page, so it
escapes being a problem by point 2b.

So the problem can happen with
md/linear, md/raid0, or lvm being a component of
md/linear, md/raid0, md/raid1, md/multipath, lvm.

(I have possibly oversimplified the situation with lvm due to lack of
precise knowledge).

I hope that clarifies the situation.

> > So raid5 should be safe over everything (unless dm allows striping
> > with a chunk size less than pagesize).
> >
> > Thinks: as an interim solution of other raid levels - if the
> > underlying device has a merge_bvec_function which is being ignored, we
> > could set max_sectors to PAGE_SIZE/512. This should be safe, though
> > possibly not optimal (but "safe" is trumps "optimal" any day).
>
> Assuming that sectors are always 512 bytes (true for any hard drive I've
> seen) that will be 512 * 8 = one 4k page.
>
> Any chance sector != 512?

No. 'sector' in the kernel means '512 bytes'.
Some devices might request requests to be at least 2 sectors long and
have even sector addresses because they have physical sectors that are
1K, but the parameters like max_sectors are still in multiples of 512
bytes.

NeilBrown

2003-08-18 00:28:41

by Mike Fedyk

[permalink] [raw]
Subject: Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3

On Mon, Aug 18, 2003 at 09:14:07AM +1000, Neil Brown wrote:
> The various raid levels under md are in many ways quite independent.
> You cannot generalise about "md works" or "md doesn't work", you have
> to talk about the specific raid levels.
>
> The problem happens when
> 1/ The underlying device defines a merge_bvec_fn, and
> 2/ the driver (meta-device) that uses that device
> 2a/ does not honour the merge_bvec_fn, and
> 2b/ passes on requests larger than one page.
>
> md/linear, md/raid0, and lvm all define a merge_bvec_fn, and so can be
> a problem as an underlying device by point (1).
>
> md/* and lvm do not honour merge_bvec_fn and so can be a problem as a
> meta-device by 2a.
> However md/raid5 never passes on requests larger than one page, so it
> escapes being a problem by point 2b.
>
> So the problem can happen with
> md/linear, md/raid0, or lvm being a component of
> md/linear, md/raid0, md/raid1, md/multipath, lvm.
>
> (I have possibly oversimplified the situation with lvm due to lack of
> precise knowledge).
>
> I hope that clarifies the situation.

Thanks Neil. That was very informative. :)

> > On Sun, Aug 17, 2003 at 10:12:27AM +1000, Neil Brown wrote:
> > > So raid5 should be safe over everything (unless dm allows striping
> > > with a chunk size less than pagesize).
> > >
> > > Thinks: as an interim solution of other raid levels - if the
> > > underlying device has a merge_bvec_function which is being ignored, we
> > > could set max_sectors to PAGE_SIZE/512. This should be safe, though
> > > possibly not optimal (but "safe" is trumps "optimal" any day).
> >
> > Assuming that sectors are always 512 bytes (true for any hard drive I've
> > seen) that will be 512 * 8 = one 4k page.
> >
> > Any chance sector != 512?
>
> No. 'sector' in the kernel means '512 bytes'.
> Some devices might request requests to be at least 2 sectors long and
> have even sector addresses because they have physical sectors that are
> 1K, but the parameters like max_sectors are still in multiples of 512
> bytes.
>
> NeilBrown

Any idea of the ETA for that nice interim patch?

And if there is already a merge_bvec_function defined and coded, why is it
not being used?! Isn't that supposed to be detected by the BIO sybsystem,
and used automatically when it's defined? Or were there some bugs found in
it and it was disabled temporarily? Maybe not everyone agrees on bio
spliting/merging? (I seem to recall a thread about that a while back, but I
thougth it was resolved...)