RFCI == Request For Clever Ideas.
Hi all..
I want be able to partition "md" raid arrays.
e.g. I want to be able to use RAID1 to mirror sda and sdb as whole
drives, and then partitions that into root, swap, other (or whatever
suits the particular situation).
I am already doing this on 2.4 based kernels using a local patch
which is unlikely ever to get into the kernel.
This patch declares a new major number and uses it to address upto 15
partitions on each of the first 16 md arrays.
Not only does this limit you do only partitioning 16 md arrays, but
it means that there are two separate devices (major+minor) that
access the same device. In 2.4 this is just untidy. In 2.6 it would
also subvert the exclusive access provided by bd_claim, and that
isn't a good thing.
So I'm looking for a nice clean way to provide partitioning of md
devices in 2.6.
In 2.6 we have 20 bits for minor number and I am quite happy to
require the use of them - i.e. there is no need for the approach to
work equally well in 2.4.
Backwards compatibility is fairly important, and that means that the
first 256 minor numbers for block major number '9' have to still be
whole md arrays.
Some options that occur to me are:
1/ compile time option that redefines major 9 to use 6 bits for
partition information. This throws backwards compatibility out the
window and is a nice clean way forward. I think this would be a
support nightmare and I wouldn't impose it on anyone.
2/ new major number which uses 6 bits for partitioning and provide
some sort of interlock so that you cannot access the same raid
array from both the old and the new major at the same time.
I'm not sure how easy the interlock would be, but it is probably
do-able. The problem is that I would like a well-defined major
number and Linus doesn't seem keen on any more of those (though I
realise that isn't unanimous).
There was once talk of a 'disk' major number and all the things
that looked like discs would come under that somehow, but that
doesn't seem to have eventuated. Maybe it should, but there would
still be the interlock problem
3/ define minor numbers of block-major-9 that are larger than 255 to
have 6 bits of partitioning information. i.e.
9,0 -> md0
9,1 -> md1
...
9,255 -> md255
9,256 -> md256
9,257 -> md256p1
9,257 -> md256p2
...
9,320 -> md257
9,321 -> md257p1
...
This has least impact on other system and is in some ways simplest,
but it has the problem of lack of uniformity. You wouldn't be able
to partition md0, but that isn't a big problem as long as you can
partition some md arrays.
4/ just use 'dm', or write a new 'md' module that can present a
partition of a device. Then leave the setup to user-space.
This is least impact on the kernel, but most impact on
user-space. It would not be too hard to create a userspace tool
that made most of this fairly transparent.
Particularly if it was a new 'md' personality, userspace could
then effectively decide how the minor numbers of block-major-9
were used with respect to partitioning.
There are probably other options and I would be happy to hear them.
My personal preference is wavering between 4 (using md) and 2.
Possibly I should learn more about how 'dm' could handle it for me..
Opinions welcome,
Thanks,
NeilBrown
On Fri, Nov 14, 2003 at 02:11:15PM +1100, Neil Brown wrote:
> This patch declares a new major number and uses it to address upto 15
> partitions on each of the first 16 md arrays.
> Not only does this limit you do only partitioning 16 md arrays, but
> it means that there are two separate devices (major+minor) that
> access the same device. In 2.4 this is just untidy. In 2.6 it would
> also subvert the exclusive access provided by bd_claim, and that
> isn't a good thing.
It breaks all sorts of exclusions common for 2.4 and 2.6.
> 2/ new major number which uses 6 bits for partitioning and provide
> some sort of interlock so that you cannot access the same raid
> array from both the old and the new major at the same time.
> I'm not sure how easy the interlock would be, but it is probably
> do-able. The problem is that I would like a well-defined major
Very painful.
> number and Linus doesn't seem keen on any more of those (though I
> realise that isn't unanimous).
> There was once talk of a 'disk' major number and all the things
> that looked like discs would come under that somehow, but that
> doesn't seem to have eventuated. Maybe it should, but there would
> still be the interlock problem
Yup. And it won't be easy (if at all feasible).
> 3/ define minor numbers of block-major-9 that are larger than 255 to
> have 6 bits of partitioning information. i.e.
> 9,0 -> md0
> 9,1 -> md1
> ...
> 9,255 -> md255
> 9,256 -> md256
> 9,257 -> md256p1
> 9,257 -> md256p2
> ...
> 9,320 -> md257
> 9,321 -> md257p1
> ...
> This has least impact on other system and is in some ways simplest,
> but it has the problem of lack of uniformity. You wouldn't be able
> to partition md0, but that isn't a big problem as long as you can
> partition some md arrays.
That works and is trivial to implement.
> 4/ just use 'dm', or write a new 'md' module that can present a
> partition of a device. Then leave the setup to user-space.
> This is least impact on the kernel, but most impact on
> user-space. It would not be too hard to create a userspace tool
> that made most of this fairly transparent.
> Particularly if it was a new 'md' personality, userspace could
> then effectively decide how the minor numbers of block-major-9
> were used with respect to partitioning.
Maybe...
>
> There are probably other options and I would be happy to hear them.
> My personal preference is wavering between 4 (using md) and 2.
> Possibly I should learn more about how 'dm' could handle it for me..
(2) is going to be very nasty. Keep in mind that there is locking
based on having unique struct block_device. And entire area is not
too nice to start with - we still have lots of cleanup to do in 2.7.
Try it if you really want to, but I'd expect a lot of hard-to-plug
holes.
(3) is absolutely trivial - will take 10--30 lines in md.c.
No comments on (4).
On Fri, 14 Nov 2003, Neil Brown wrote:
[...]
> 3/ define minor numbers of block-major-9 that are larger than 255 to
> have 6 bits of partitioning information. i.e.
> 9,0 -> md0
> 9,1 -> md1
> ...
> 9,255 -> md255
> 9,256 -> md256
> 9,257 -> md256p1
> 9,257 -> md256p2
> ...
> 9,320 -> md257
> 9,321 -> md257p1
> ...
> This has least impact on other system and is in some ways simplest,
> but it has the problem of lack of uniformity. You wouldn't be able
> to partition md0, but that isn't a big problem as long as you can
> partition some md arrays.
How about assigning the partition space above
9,0 => md0
9,1 => md1
...
9,257 => md0p1
9,258 => md0p2
...
9,320 => md1p1
That should be sensibly backward compatible, I think, and still allow
all the MD devices to be partitioned.
Daniel
--
No, no, you're not thinking, you're just being logical.
-- Niels Bohr
On Thu, 2003-11-13 at 22:11, Neil Brown wrote:
> RFCI == Request For Clever Ideas.
>
> Hi all..
>
> I want be able to partition "md" raid arrays.
> e.g. I want to be able to use RAID1 to mirror sda and sdb as whole
> drives, and then partitions that into root, swap, other (or whatever
> suits the particular situation).
<snip>
Can't LVM do this? I have a raid array (mirror) that is LVM'd into
multiple partitions. It currently runs 2.4, but it should work fine
with 2.6, right? All the rest of my boxes have 2.6 and LVM, but no raid
(no duplicate hard drives).
--
Daniel Gryniewicz <[email protected]>
On Friday November 14, [email protected] wrote:
> On Thu, 2003-11-13 at 22:11, Neil Brown wrote:
> > RFCI == Request For Clever Ideas.
> >
> > Hi all..
> >
> > I want be able to partition "md" raid arrays.
> > e.g. I want to be able to use RAID1 to mirror sda and sdb as whole
> > drives, and then partitions that into root, swap, other (or whatever
> > suits the particular situation).
>
> <snip>
>
> Can't LVM do this? I have a raid array (mirror) that is LVM'd into
> multiple partitions. It currently runs 2.4, but it should work fine
> with 2.6, right? All the rest of my boxes have 2.6 and LVM, but no raid
> (no duplicate hard drives).
Fair question.
I want it to work with "standard" partition tables such as MSDOS
partitions etc.
I would like to be able to take a single drive that is being used and
has partitions on it, and to add an identical drive beside it, mirror
them, and get a mirrored pair that looked much like the original
drive.
There are issues with the raid superblock but assuming they can be
solved, I want partitioning to work easily.
Can LVM work happily with 'legacy' partitioning information?
NeilBrown
On Friday November 14, [email protected] wrote:
>
Thanks for the comments.
> (3) is absolutely trivial - will take 10--30 lines in md.c.
Yes. I've actually started working on that (clear some stuff up first
so it become even more trivial). I was feeling uncomfortable about
the non-uniform interpretation of minor numbers. I know the block
layer can handle it. I'm wondering what I should expect of users
though :-)
NeilBrown
On Fri, Nov 14, 2003 at 04:27:51PM +1100, Daniel Pittman wrote:
> How about assigning the partition space above
>
> 9,0 => md0
> 9,1 => md1
> ...
> 9,257 => md0p1
> 9,258 => md0p2
> ...
> 9,320 => md1p1
>
> That should be sensibly backward compatible, I think, and still allow
> all the MD devices to be partitioned.
Kernel should be mostly OK with partitions having device numbers far from
that of entire disk. However, mostly != entirely and I can't tell right
now what amount of work would that take. One obvious problem is boot-time
stuff - code parses root= and its ilk does so by digging in sysfs and that
definitely will need changes. There might be more. I'm more or less sure
than all common codepaths are clean and will be OK with having probe do
whatever it wants, but I wouldn't bet a dime on the ioctl side/procfs/sysfs/
devfs.
I would *really* expect breakage in userland, though, and that might be a
killer. Up until now, we had all partitions getting device numbers in a range
that included entire disk and did not include any device numbers of anything
unrelated.
Folks, we are in 2.6.0-test freeze. Neil's #3 fits entirely in md.c, is small
and has minimal damage potential for userland code. Playing games with the
scheme above, nice as it might be, is *not* safe for now and it's not even
safe for 2.6.early.
On Fri, Nov 14, 2003 at 02:11:15PM +1100, Neil Brown wrote:
> 3/ define minor numbers of block-major-9 that are larger than 255 to
> have 6 bits of partitioning information. i.e.
> 9,0 -> md0
> 9,1 -> md1
> ...
> 9,255 -> md255
> 9,256 -> md256
> 9,257 -> md256p1
> 9,257 -> md256p2
> ...
> 9,320 -> md257
> 9,321 -> md257p1
> ...
> This has least impact on other system and is in some ways simplest,
> but it has the problem of lack of uniformity. You wouldn't be able
> to partition md0, but that isn't a big problem as long as you can
> partition some md arrays.
may i write in hex, i feel much unconfortable having 20bit numbers in
decimal?
9,0x00000 -> md0
...
9,0x000FF -> md255
9,0x00100 -> md0p1
9,0x00200 -> md0p2
...
one would expect it to be the other way around, but it is still fairly
intuitive, and it keeps uniformity.
Uniformity is important, because we can doo binary ops on the minor
number and get consistent results.
L.
--
Luca Berra -- [email protected]
Communication Media & Services S.r.l.
/"\
\ / ASCII RIBBON CAMPAIGN
X AGAINST HTML MAIL
/ \
Hi Neil,
On Fri, 14 Nov 2003, Neil Brown wrote:
> 4/ just use 'dm', or write a new 'md' module that can present a
> partition of a device. Then leave the setup to user-space.
> This is least impact on the kernel, but most impact on
> user-space. It would not be too hard to create a userspace tool
>From a user/admin POV, I'd say go with dm. Even the 'virt-partition
by way of dm' isnt needed if you have LVM tools, but I guess might be
nice to obviate need for extra tools. (though, I couldnt live without
LVM anymore, its just /too/ useful.).
regards,
--
Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A
warning: do not ever send email to [email protected]
Fortune:
It is against the law for a monster to enter the corporate limits of
Urbana, Illinois.
On 2003-11-14T16:30:42,
Neil Brown <[email protected]> said:
> There are issues with the raid superblock but assuming they can be
> solved, I want partitioning to work easily.
>
> Can LVM work happily with 'legacy' partitioning information?
I'd really suggest to run DM (either LVM2 or EVMS2) on top of md
instead. It's much more flexible; I don't see any benefit in 'old style'
partition information, which has all sorts of problems - ie,
non-transactional updates (_why_ were you running raid again? ;), static
as they can't be modified during runtime etc.
Sincerely,
Lars Marowsky-Br?e <[email protected]>
--
High Availability & Clustering \ ever tried. ever failed. no matter.
SUSE Labs | try again. fail again. fail better.
Research & Development, SUSE LINUX AG \ -- Samuel Beckett
On Fri, Nov 14, 2003 at 11:16:47AM +0100, Lars Marowsky-Bree wrote:
> On 2003-11-14T16:30:42,
> Neil Brown <[email protected]> said:
>
> > There are issues with the raid superblock but assuming they can be
> > solved, I want partitioning to work easily.
> >
> > Can LVM work happily with 'legacy' partitioning information?
>
> I'd really suggest to run DM (either LVM2 or EVMS2) on top of md
> instead. It's much more flexible; I don't see any benefit in 'old style'
> partition information, which has all sorts of problems - ie,
> non-transactional updates (_why_ were you running raid again? ;), static
> as they can't be modified during runtime etc.
This brings up a tangent point... partitions on top of RAID are a new
thing, which means that one has the chance to define the partition
format.
And I kinda like EFI partition format, a lot better than the other
common ones...
Jeff
P.S. No, this isn't a blanket endorsement of EFI as a whole :)
> This brings up a tangent point... partitions on top of RAID are a new
> thing, which means that one has the chance to define the partition
> format.
>
> And I kinda like EFI partition format, a lot better than the other
> common ones...
Any reason why the current partition-mapping code couldn't be extended
to handle partition detection on a generic block device (which is what
MD presents I think) instead of a struct gendisk? Then it wouldn't
matter which scheme someone wanted to use - any scheme provided for in
the kernel (or userspace if partx were extended) could be used.
I'm partial to the EFI format too, but wouldn't want to write that
code a second time, once for normal disks, and once for md.
Thanks,
Matt
--
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com
On Fri, Nov 14, 2003 at 03:44:23PM -0600, Matt Domsch wrote:
> Any reason why the current partition-mapping code couldn't be extended
> to handle partition detection on a generic block device (which is what
> MD presents I think) instead of a struct gendisk? Then it wouldn't
Any block_device has a gendisk - md.c ones included. The problem is where
to put device numbers of partitions.
My hard disks support some programmable options.
Automatic Write Reallocation Enable (AWRE):
On, drive automatically relocates bad blocks detected during write
operations. Off, drive creates Check Condition status with sense key of
Medium Error if bad blocks are detected during write operations.
Automatic Read Reallocation Enable (ARRE):
On, drive automatically relocates bad blocks detected during read
operations. Off, drive creates Check Condition status with sense key of
Medium Error if bad blocks are detected during read operations.
These options are both off.
Would md be happier if these were on? This would hide recoverable errors
from md.
Any opinions?
My disks are ST118202LC.
Thanks.