From: Neil Brown <neilb@cse.unsw.edu.au>
To: Mike Fedyk <mfedyk@matchmail.com>
Date: Mon, 18 Aug 2003 09:14:07 +1000
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <16192.3135.69070.901901@gargle.gargle.HOWL>
Cc: Joe Thornber <thornber@sistina.com>, Andrew Morton <akpm@osdl.org>,
       Tupshin Harper <tupshin@tupshin.com>, linux-kernel@vger.kernel.org,
       Jens Axboe <axboe@suse.de>
Subject: Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3
In-Reply-To: message from Mike Fedyk on Sunday August 17
References: <3F3951F1.9040605@tupshin.com>
	<20030812142846.46eacc48.akpm@osdl.org>
	<16185.29398.80225.875488@gargle.gargle.HOWL>
	<20030815212707.GR1027@matchmail.com>
	<16189.58517.211393.526998@gargle.gargle.HOWL>
	<20030816235245.GU1027@matchmail.com>
	<16190.51307.990399.306100@gargle.gargle.HOWL>
	<20030817175050.GX1027@matchmail.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2752
Lines: 70

On Sunday August 17, mfedyk@matchmail.com wrote:
> On Sun, Aug 17, 2003 at 10:12:27AM +1000, Neil Brown wrote:
> > On Saturday August 16, mfedyk@matchmail.com wrote:
> > > I have a raid5 with "4" 18gb drives, and one of the "drives" is two 9gb
> > > drives in a linear md "array".
> > > 
> > > I'm guessing this will hit this bug too?
> > 
> > This should be safe.  raid5 only ever submits 1-page (4K) requests
> > that are page aligned, and linear arrays will have the boundary
> > between drives 4k aligned (actually "chunksize" aligned, and chunksize
> > is atleast 4k). 
> > 
> 
> So why is this hitting with raid0?  Is lvm2 on top of md the problem and md
> on lvm2 is ok?
> 

The various raid levels under md are in many ways quite independent.
You cannot generalise about "md works" or "md doesn't work", you have
to talk about the specific raid levels.

The problem happens when 
 1/ The underlying device defines a merge_bvec_fn, and 
 2/ the driver (meta-device) that uses that device
     2a/  does not honour the merge_bvec_fn, and 
     2b/  passes on requests larger than one page.

md/linear, md/raid0, and lvm all define a merge_bvec_fn, and so can be
a problem as an underlying device by point (1).

md/* and lvm do not honour merge_bvec_fn and so can be a problem as a
meta-device by 2a.
However md/raid5 never passes on requests larger than one page, so it
escapes being a problem by point 2b.

So the problem can happen with
  md/linear, md/raid0, or lvm being a component of 
  md/linear, md/raid0, md/raid1, md/multipath, lvm.

(I have possibly oversimplified the situation with lvm due to lack of
precise knowledge).

I hope that clarifies the situation.

> > So raid5 should be safe over everything (unless dm allows striping
> > with a chunk size less than pagesize).
> > 
> > Thinks: as an interim solution of other raid levels - if the
> > underlying device has a merge_bvec_function which is being ignored, we
> > could set max_sectors to PAGE_SIZE/512.  This should be safe, though
> > possibly not optimal (but "safe" is trumps "optimal" any day).
> 
> Assuming that sectors are always 512 bytes (true for any hard drive I've
> seen) that will be 512 * 8 = one 4k page.
> 
> Any chance sector != 512?

No.  'sector' in the kernel means '512 bytes'.
Some devices might request requests to be at least 2 sectors long and
have even sector addresses because they have physical sectors that are
1K, but the parameters like max_sectors are still in multiples of 512
bytes.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/