2004-01-13 00:35:50

by Scott Long

[permalink] [raw]
Subject: Proposed enhancements to MD

--- linux-2.6.1/drivers/md/md.c 2004-01-08 23:59:19.000000000 -0700
+++ md/linux-2.6.1/drivers/md/md.c 2004-01-12 14:46:33.818544376 -0700
@@ -1446,6 +1446,9 @@
return 1;
}

+/* MD Partition definitions */
+#define MDP_MINOR_COUNT 16
+#define MDP_MINOR_SHIFT 4

static struct kobject *md_probe(dev_t dev, int *part, void *data)
{
@@ -1453,6 +1456,7 @@
int unit = *part;
mddev_t *mddev = mddev_find(unit);
struct gendisk *disk;
+ int index;

if (!mddev)
return NULL;
@@ -1463,15 +1467,22 @@
mddev_put(mddev);
return NULL;
}
- disk = alloc_disk(1);
+ disk = alloc_disk(MDP_MINOR_COUNT);
if (!disk) {
up(&disks_sem);
mddev_put(mddev);
return NULL;
}
+ index = mdidx(mddev);
disk->major = MD_MAJOR;
- disk->first_minor = mdidx(mddev);
- sprintf(disk->disk_name, "md%d", mdidx(mddev));
+ disk->first_minor = index << MDP_MINOR_SHIFT;
+ disk->minors = MDP_MINOR_COUNT;
+ if (index >= 26) {
+ sprintf(disk->disk_name, "md%c%c",
+ 'a' + index/26-1,'a' + index % 26);
+ } else {
+ sprintf(disk->disk_name, "md%c", 'a' + index % 26);
+ }
disk->fops = &md_fops;
disk->private_data = mddev;
disk->queue = mddev->queue;
@@ -2512,18 +2523,21 @@
* 4 sectors (with a BIG number of cylinders...). This drives
* dosfs just mad... ;-)
*/
+#define MD_HEADS 254
+#define MD_SECTORS 60
case HDIO_GETGEO:
if (!loc) {
err = -EINVAL;
goto abort_unlock;
}
- err = put_user (2, (char *) &loc->heads);
+ err = put_user (MD_HEADS, (char *) &loc->heads);
if (err)
goto abort_unlock;
- err = put_user (4, (char *) &loc->sectors);
+ err = put_user (MD_SECTORS, (char *) &loc->sectors);
if (err)
goto abort_unlock;
- err = put_user(get_capacity(disks[mdidx(mddev)])/8,
+ err = put_user(get_capacity(disks[mdidx(mddev)]) /
+ (MD_HEADS * MD_SECTORS),
(short *) &loc->cylinders);
if (err)
goto abort_unlock;


Attachments:
md_partition.diff (1.86 kB)

2004-01-13 16:26:43

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Mon, Jan 12, 2004 at 05:34:10PM -0700, Scott Long wrote:
> All,
>
> Adaptec has been looking at the MD driver for a foundation for their
> Open-Source software RAID stack. This will help us provide full
> and open support for current and future Adaptec RAID products (as
> opposed to the limited support through closed drivers that we have now).

Interesting...

>
> While MD is fairly functional and clean, there are a number of
> enhancements to it that we have been working on for a while and would
> like to push out to the community for review and integration. These
> include:
>
> - partition support for md devices: MD does not support the concept of
> fdisk partitions; the only way to approximate this right now is by
> creating multiple arrays on the same media. Fixing this is required
> for not only feature-completeness, but to allow our BIOS to recognise
> the partitions on an array and properly boot them as it would boot a
> normal disk.

This change is probably not going to go into 2.6.X anytime soon anyway,
so what's your thoughts on doing this "right" - getting MD moved into
DM ?

That would solve the problem, as I see it.

I'm not currently involved in either of those development efforts, but I
thought I'd bring your attention to the DM/MD issue - there was some
talk about it in the past.

Also, since DM will do on-line resizing and we want MD to do this as
well some day, I really think this is the way to be going. Getting
partition support on MD devices will solve a problem now, but for the
long run I really think MD should be a part of DM.

Anyway, that's my 0.02 Euro on that issue.

...
> - Metadata abstraction: We intend to support multiple on-disk metadata
> formats, along with the 'native MD' format. To do this, specific
> knowledge of MD on-disk structures must be abstracted out of the core
> and personalities modules.

I think this one touches the DM issue as well.

So, how about Adaptec and IBM get someone to move MD into DM, and while
you're at it, add hot resizing and hot conversion between RAID levels :)

2.7.1? ;)

Jokes aside, I'd like to hear your oppinions on this longer-term
perspective on things...

The RAID conversion/resize code for userspace exists already, and it
works except for some cases with RAID-5 and disks of non-equal size,
where it breaks horribly (fixable bug though).


/ jakob

2004-01-13 18:24:52

by mutex

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Mon, Jan 12, 2004 at 05:34:10PM -0700 or thereabouts, Scott Long wrote:
> All,
>
> Adaptec has been looking at the MD driver for a foundation for their
> Open-Source software RAID stack. This will help us provide full
> and open support for current and future Adaptec RAID products (as
> opposed to the limited support through closed drivers that we have now).
>
> While MD is fairly functional and clean, there are a number of
> enhancements to it that we have been working on for a while and would
> like to push out to the community for review and integration. These
> include:
>


How about a endian safe superblock ? Seriously, is that a 'bug' or a
'feature' ? Or do people just not care.

2004-01-13 18:44:35

by Jeff Garzik

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

Scott Long wrote:
> I'm going to push these changes out in phases in order to keep the risk
> and churn to a minimum. The attached patch is for the partition
> support. It was originally from Ingo Molnar, but has changed quite a
> bit due to the radical changes in the disk/block layer in 2.6. The 2.4
> version works quite well, while the 2.6 version is fairly fresh. One
> problem that I have with it is that the created partitions show up in
> /proc/partitions after running fdisk, but not after a reboot.

You sorta hit a bad time for 2.4 development. Even though my employer
(Red Hat), Adaptec, and many others must continue to support new
products on 2.4.x kernels, kernel development has shifted to 2.6.x (and
soon 2.7.x).

In general, you want a strategy of "develop on latest, then backport if
needed." Once a solution is merged into the latest kernel, it
automatically appears in many companies' products (and perhaps more
importantly) product roadmaps. Otherwise you will design various things
into your software that have already been handled different in the
future, thus creating an automatically-obsolete solution and support
nightmare.

Now, addressing your specific issues...

> hile MD is fairly functional and clean, there are a number of enhancements to it that we have been working on for a while and would
> like to push out to the community for review and integration. These
> include:
>
> - partition support for md devices: MD does not support the concept of
> fdisk partitions; the only way to approximate this right now is by
> creating multiple arrays on the same media. Fixing this is required
> for not only feature-completeness, but to allow our BIOS to recognise
> the partitions on an array and properly boot them as it would boot a
> normal disk.

Neil Brown already done a significant amount of research into this
topic. Given this, and his general status as md maintainer, you should
definitely make sure he's kept in the loop.

Partitioning for md was discussed in this thread:
http://lkml.org/lkml/2003/11/13/182

In particular note Al Viro's response to Neil, in addition to Neil's own
post.

And I could have _sworn_ that Neil already posted a patch to do
partitions in md, but maybe my memory is playing tricks on me.


> - generic device arrival notification mechanism: This is needed to
> support device hot-plug, and allow arrays to be automatically
> configured regardless of when the md module is loaded or initialized.
> RedHat EL3 has a scaled down version of this already, but it is
> specific to MD and only works if MD is statically compiled into the
> kernel. A general mechanism will benefit MD as well as any other
> storage system that wants hot-arrival notices.

This would be via /sbin/hotplug, in the Linux world. SCSI already does
this, I think, so I suppose something similar would happen for md.


> - RAID-0 fixes: The MD RAID-0 personality is unable to perform I/O
> that spans a chunk boundary. Modifications are needed so that it can
> take a request and break it up into 1 or more per-disk requests.

I thought that raid0 was one of the few that actually did bio splitting
correctly? Hum, maybe this is a 2.4-only issue. Interesting, and
agreed, if so...


> - Metadata abstraction: We intend to support multiple on-disk metadata
> formats, along with the 'native MD' format. To do this, specific
> knowledge of MD on-disk structures must be abstracted out of the core
> and personalities modules.

> - DDF Metadata support: Future products will use the 'DDF' on-disk
> metadata scheme. These products will be bootable by the BIOS, but
> must have DDF support in the OS. This will plug into the abstraction
> mentioned above.

Neil already did the work to make 'md' support multiple types of
superblocks, but I'm not sure if we want to hack 'md' to support the
various vendor RAIDs out there. DDF support we _definitely_ want, of
course. DDF follows a very nice philosophy: open[1] standard with no
vendor lock-in.

IMO, your post/effort all boils down to an open design question: device
mapper or md, for doing stuff like vendor-raid1 or vendor-raid5? And it
is even possible to share (for example) raid5 engine among all the
various vendor RAID5's?

Jeff


[1] well, developed in secret, but published openly. Not quite up to
Linux's standards, but decent for the h/w world.

2004-01-13 18:56:39

by John Bradford

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

> [1] well, developed in secret, but published openly. Not quite up to
> Linux's standards, but decent for the h/w world.

..and patent-free?

John.

2004-01-13 19:06:58

by Jeff Garzik

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

mutex wrote:
> On Mon, Jan 12, 2004 at 05:34:10PM -0700 or thereabouts, Scott Long wrote:
>
>>All,
>>
>>Adaptec has been looking at the MD driver for a foundation for their
>>Open-Source software RAID stack. This will help us provide full
>>and open support for current and future Adaptec RAID products (as
>>opposed to the limited support through closed drivers that we have now).
>>
>>While MD is fairly functional and clean, there are a number of
>>enhancements to it that we have been working on for a while and would
>>like to push out to the community for review and integration. These
>>include:
>>
>
>
>
> How about a endian safe superblock ? Seriously, is that a 'bug' or a
> 'feature' ? Or do people just not care.


There was a thread discussing md's new superblock design, did you
research/follow that? neilb was actively soliciting comments and there
was an amount of discussion.

Jeff



2004-01-13 19:41:24

by Matt Domsch

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Tue, Jan 13, 2004 at 01:44:05PM -0500, Jeff Garzik wrote:
> You sorta hit a bad time for 2.4 development. Even though my employer
> (Red Hat), Adaptec, and many others must continue to support new
> products on 2.4.x kernels,

Indeed, enterprise class products based on 2.4.x kernels will need
some form of solution here too.

> kernel development has shifted to 2.6.x (and soon 2.7.x).
>
> In general, you want a strategy of "develop on latest, then backport if
> needed."

Ideally in 2.6 one can use device mapper, but DM hasn't been
incorporated into 2.4 stock, I know it's not in RHEL 3, and I don't
believe it's included in SLES8. Can anyone share thoughts on if a DDF
solution were built on top of DM, that DM could be included in 2.4
stock, RHEL3, or SLES8? Otherwise, Adaptec will be stuck with two
different solutions anyhow, one for 2.4 (they're proposing enhancing
MD), and DM for 2.6.

Thanks,
Matt

--
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

2004-01-13 19:30:21

by mutex

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Tue, Jan 13, 2004 at 02:05:55PM -0500 or thereabouts, Jeff Garzik wrote:
> >How about a endian safe superblock ? Seriously, is that a 'bug' or a
> >'feature' ? Or do people just not care.
>
>
> There was a thread discussing md's new superblock design, did you
> research/follow that? neilb was actively soliciting comments and there
> was an amount of discussion.
>

hmm I don't remember that... was it on lkml or the raid development
list ? Can you give me a string/date to search around ?

2004-01-13 19:43:43

by Jeff Garzik

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

mutex wrote:
> On Tue, Jan 13, 2004 at 02:05:55PM -0500 or thereabouts, Jeff Garzik wrote:
>
>>>How about a endian safe superblock ? Seriously, is that a 'bug' or a
>>>'feature' ? Or do people just not care.
>>
>>
>>There was a thread discussing md's new superblock design, did you
>>research/follow that? neilb was actively soliciting comments and there
>>was an amount of discussion.
>>
>
>
> hmm I don't remember that... was it on lkml or the raid development
> list ? Can you give me a string/date to search around ?


Other than "neil brown md superblock" don't recall. In the past year or
two :) There were patches, so it wasn't just discussion.

Jeff



2004-01-13 20:01:14

by mutex

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Tue, Jan 13, 2004 at 02:43:29PM -0500 or thereabouts, Jeff Garzik wrote:
> >hmm I don't remember that... was it on lkml or the raid development
> >list ? Can you give me a string/date to search around ?
>
>
> Other than "neil brown md superblock" don't recall. In the past year or
> two :) There were patches, so it wasn't just discussion.
>

in case anybody else is curious, I think this is the thread:

http://marc.theaimsgroup.com/?l=linux-kernel&m=103776556308924&w=2

2004-01-13 19:59:43

by Cress, Andrew R

[permalink] [raw]
Subject: RE: Proposed enhancements to MD

That discussion was mostly in Nov & Dev 2002.
The Subject line was "RFC - new raid superblock layout for md driver".

Andy

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Jeff Garzik
Sent: Tuesday, January 13, 2004 2:43 PM
To: mutex
Cc: Scott Long; [email protected]
Subject: Re: Proposed enhancements to MD


mutex wrote:
> On Tue, Jan 13, 2004 at 02:05:55PM -0500 or thereabouts, Jeff Garzik
wrote:
>
>>>How about a endian safe superblock ? Seriously, is that a 'bug' or a
>>>'feature' ? Or do people just not care.
>>
>>
>>There was a thread discussing md's new superblock design, did you
>>research/follow that? neilb was actively soliciting comments and
there
>>was an amount of discussion.
>>
>
>
> hmm I don't remember that... was it on lkml or the raid development
> list ? Can you give me a string/date to search around ?


Other than "neil brown md superblock" don't recall. In the past year or

two :) There were patches, so it wasn't just discussion.

Jeff



2004-01-13 20:46:10

by Scott Long

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

mutex wrote:
> On Mon, Jan 12, 2004 at 05:34:10PM -0700 or thereabouts, Scott Long
> wrote:
>
>>All,
>>
>>Adaptec has been looking at the MD driver for a foundation for their
>>Open-Source software RAID stack. This will help us provide full
>>and open support for current and future Adaptec RAID products (as
>>opposed to the limited support through closed drivers that we have
>
> now).
>
>>While MD is fairly functional and clean, there are a number of
>>enhancements to it that we have been working on for a while and would
>>like to push out to the community for review and integration. These
>>include:
>>
>
>
>
> How about a endian safe superblock ? Seriously, is that a 'bug' or a
> 'feature' ? Or do people just not care.

The DDF metadata module will be endian-safe.

Scott

2004-01-13 20:43:31

by Scott Long

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

Jeff Garzik wrote:
> Scott Long wrote:
>
>>I'm going to push these changes out in phases in order to keep the
>
> risk
>
>>and churn to a minimum. The attached patch is for the partition
>>support. It was originally from Ingo Molnar, but has changed quite a
>>bit due to the radical changes in the disk/block layer in 2.6. The
>
> 2.4
>
>>version works quite well, while the 2.6 version is fairly fresh. One
>>problem that I have with it is that the created partitions show up in
>>/proc/partitions after running fdisk, but not after a reboot.
>
>
> You sorta hit a bad time for 2.4 development. Even though my employer
> (Red Hat), Adaptec, and many others must continue to support new
> products on 2.4.x kernels, kernel development has shifted to 2.6.x (and
> soon 2.7.x).
>
> In general, you want a strategy of "develop on latest, then backport if
> needed." Once a solution is merged into the latest kernel, it
> automatically appears in many companies' products (and perhaps more
> importantly) product roadmaps. Otherwise you will design various things
>
> into your software that have already been handled different in the
> future, thus creating an automatically-obsolete solution and support
> nightmare.
>

Oh, I understand completely. This work has actually been going on for a
number of years in an on-and-off fashion. I'm just the latest person to
pick it up, and I happened to pick it up right when the big transition
to 2.6 happened.

> Now, addressing your specific issues...
>
>
>>hile MD is fairly functional and clean, there are a number of
>
> enhancements to it that we have been working on for a while and would
>
>>like to push out to the community for review and integration. These
>>include:
>>
>>- partition support for md devices: MD does not support the concept
>
> of
>
>> fdisk partitions; the only way to approximate this right now is by
>> creating multiple arrays on the same media. Fixing this is required
>> for not only feature-completeness, but to allow our BIOS to
>
> recognise
>
>> the partitions on an array and properly boot them as it would boot a
>> normal disk.
>
>
> Neil Brown already done a significant amount of research into this
> topic. Given this, and his general status as md maintainer, you should
> definitely make sure he's kept in the loop.
>
> Partitioning for md was discussed in this thread:
> http://lkml.org/lkml/2003/11/13/182
>
> In particular note Al Viro's response to Neil, in addition to Neil's own
>
> post.
>
> And I could have _sworn_ that Neil already posted a patch to do
> partitions in md, but maybe my memory is playing tricks on me.
>

I thought that I had attached a patch to the end of my last mail, but I
could have messed it up. The work to do partitioning in 2.6 looks to
be incredibly less significant than in 2.4, thankfully =-)

>
>
>>- generic device arrival notification mechanism: This is needed to
>> support device hot-plug, and allow arrays to be automatically
>> configured regardless of when the md module is loaded or
>
> initialized.
>
>> RedHat EL3 has a scaled down version of this already, but it is
>> specific to MD and only works if MD is statically compiled into the
>> kernel. A general mechanism will benefit MD as well as any other
>> storage system that wants hot-arrival notices.
>
>
> This would be via /sbin/hotplug, in the Linux world. SCSI already does
> this, I think, so I suppose something similar would happen for md.
>

A problem that we've encountered, though, is the following sequence:

1) md is inialized during boot
2) drives X Y and Z are probed during boot
3) root fs exists on array [X Y Z], but md didn't see them show up,
so it didn't auto-configure the array

I'm not sure how this can be addressed by a userland daemon. Remember
that we are focused on providing RAID during boot; configuring a
secondary array after boot is a much easier problem.

RHEL3 already has a mechanism to address this via the
md_autodetect_dev() hook. This gets called by the partition code when
partition entites are discovered. However, it is a static method, so
it only works when md is compiled into the kernel. Our proposal to
to turn this into a generic registration mechanism, where md can
register as a listener. When it does that, it gets a list of
previously announced devices, along with future devices as they are
discovered.

The code to do this is pretty small and simple. The biggest question
is whether to implement it by enhancing add_partition(), or create a
new call (i.e. device_register_partition() ), like is done in RHEL3.

>
>
>>- RAID-0 fixes: The MD RAID-0 personality is unable to perform I/O
>> that spans a chunk boundary. Modifications are needed so that it
>
> can
>
>> take a request and break it up into 1 or more per-disk requests.
>
>
> I thought that raid0 was one of the few that actually did bio splitting
> correctly? Hum, maybe this is a 2.4-only issue. Interesting, and
> agreed, if so...
>

This is definitely still a problem in 2.6.1

>
>
>>- Metadata abstraction: We intend to support multiple on-disk
>
> metadata
>
>> formats, along with the 'native MD' format. To do this, specific
>> knowledge of MD on-disk structures must be abstracted out of the
>
> core
>
>> and personalities modules.
>
>
>>- DDF Metadata support: Future products will use the 'DDF' on-disk
>> metadata scheme. These products will be bootable by the BIOS, but
>> must have DDF support in the OS. This will plug into the
>
> abstraction
>
>> mentioned above.
>
>
> Neil already did the work to make 'md' support multiple types of
> superblocks, but I'm not sure if we want to hack 'md' to support the
> various vendor RAIDs out there. DDF support we _definitely_ want, of
> course. DDF follows a very nice philosophy: open[1] standard with no
> vendor lock-in.
>
> IMO, your post/effort all boils down to an open design question: device
>
> mapper or md, for doing stuff like vendor-raid1 or vendor-raid5? And it
>
> is even possible to share (for example) raid5 engine among all the
> various vendor RAID5's?
>

The stripe and parity format is not the problem here; md can be enhanced
to support different stripe and parity rotation sequences without much
trouble.

Also, think beyond just DDF. Having plugable metadata personalities
means that a module can be written for the existing Adaptec RAID
products too (like the HostRAID functionality on our U320 adapters).
It also means that you can write personality modules for other vendors,
and even hardware RAID solutions. Imagine having a PCI RAID card fail,
then plugging the drives directly into your computer and having the
array 'Just Work'.

As for the question of DM vs. MD, I think that you have to consider that
DM right now has no concept of storing configuration data on the disk
(at least that I can find, please correct me if I'm wrong). I think
that DM will make a good LVM-like layer on top of MD, but I don't see it
replacing MD right now.


Scott

2004-01-13 22:06:43

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Tue, 2004-01-13 at 01:34, Scott Long wrote:
> All,
>
> Adaptec has been looking at the MD driver for a foundation for their
> Open-Source software RAID stack.

Hi,

Is there a (good) reason you didn't use Device Mapper for this? It
really sounds like Device Mapper is the way to go to parse and use
raid-like formats to the kernel, since it's designed to be independent
of on disk formats, unlike MD.

Greetings,
Arjan van de Ven


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2004-01-13 22:11:16

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Proposed enhancements to MD


> Ideally in 2.6 one can use device mapper, but DM hasn't been
> incorporated into 2.4 stock, I know it's not in RHEL 3, and I don't
> believe it's included in SLES8. Can anyone share thoughts on if a DDF
> solution were built on top of DM, that DM could be included in 2.4
> stock, RHEL3, or SLES8? Otherwise, Adaptec will be stuck with two
> different solutions anyhow, one for 2.4 (they're proposing enhancing
> MD), and DM for 2.6.

Well it's either putting DM into 2.4 or forcing some sort of partitioned
MD into 2.4. My strong preference would be DM in that cases since it's
already in 2.6 and is actually designed for the
multiple-superblock-formats case.



Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2004-01-13 22:32:19

by Wakko Warner

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

> > Adaptec has been looking at the MD driver for a foundation for their
> > Open-Source software RAID stack.
>
> Hi,
>
> Is there a (good) reason you didn't use Device Mapper for this? It
> really sounds like Device Mapper is the way to go to parse and use
> raid-like formats to the kernel, since it's designed to be independent
> of on disk formats, unlike MD.

As I've understood it, the configuration for DM is userspace and the kernel
can't do any auto detection. This would be a "put off" for me to use as a
root filesystem. Configurations like this (and lvm too last I looked at it)
require an initrd or some other way of setting up the device. Unfortunately
this means that there's configs in 2 locations (one not easily available, if
using initrd. easily != mounting via loop!)

--
Lab tests show that use of micro$oft causes cancer in lab animals

2004-01-13 22:36:23

by Jure Pečar

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Tue, 13 Jan 2004 13:41:07 -0700
Scott Long <[email protected]> wrote:

> A problem that we've encountered, though, is the following sequence:
>
> 1) md is inialized during boot
> 2) drives X Y and Z are probed during boot
> 3) root fs exists on array [X Y Z], but md didn't see them show up,
> so it didn't auto-configure the array
>
> I'm not sure how this can be addressed by a userland daemon. Remember
> that we are focused on providing RAID during boot; configuring a
> secondary array after boot is a much easier problem.

Looking at this chicken-and-egg problem of booting from an array from
administrator's point of view ...

What do you guys think about Intel's EFI? I think it would be the most
apropriate place to put a piece of code that would scan the disks, assemble
any arrays and present them to the OS as bootable devices ... If we're going
to get a common metadata layout, that would be even easier.

Thoughts?

--

Jure Pečar

2004-01-13 23:10:06

by Andreas Steinmetz

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

Wakko Warner wrote:
>
> As I've understood it, the configuration for DM is userspace and the kernel
> can't do any auto detection. This would be a "put off" for me to use as a
> root filesystem. Configurations like this (and lvm too last I looked at it)
> require an initrd or some other way of setting up the device. Unfortunately
> this means that there's configs in 2 locations (one not easily available, if
> using initrd. easily != mounting via loop!)
>

You can always do the following: use a mini root fs on the partition
where the kernel is located that does nothing but vgscan and friends and
then calls pivot_root. '/sbin/init' of the mini root fs looks like:


#!/bin/sh
case "$1" in
-s|S|single|-a|auto)
opt=$1
;;
-b|emergency)
export PATH=/bin:/sbin
/bin/mount /proc
/bin/loadkeys \
/keymaps/i386/qwertz/de-latin1-nodeadkeys.map.gz
exec /bin/sh < /dev/console > /dev/console 2>&1
;;
esac
cd /
/bin/mount /proc
/bin/mount -o remount,rw,notail,noatime,nodiratime /
/sbin/vgscan > /dev/null
/sbin/vgchange -a y > /dev/null
/bin/mount -o remount,ro,notail,noatime,nodiratime /
/bin/mount /mnt
/bin/umount /proc
cd /mnt
/sbin/pivot_root . boot
exec /bin/chroot . /bin/sh -c \
"/bin/umount /boot ; exec /sbin/init $opt" \
< dev/console > dev/console 2>&1



And if you have partitions of the same size on other disks and fiddle a
bit with dd you have perfect working backups including the boot loader
code of the master boot record on the other disks. No initrd required.
As an add-on you have an on-disk rescue system.
--
Andreas Steinmetz

2004-01-13 22:49:30

by Luca Berra

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Tue, Jan 13, 2004 at 01:44:05PM -0500, Jeff Garzik wrote:
>And I could have _sworn_ that Neil already posted a patch to do
>partitions in md, but maybe my memory is playing tricks on me.
he did, and a long time ago also.
http://cgi.cse.unsw.edu.au/~neilb/patches/

>IMO, your post/effort all boils down to an open design question: device
>mapper or md, for doing stuff like vendor-raid1 or vendor-raid5? And it
>is even possible to share (for example) raid5 engine among all the
>various vendor RAID5's?
I would believe the way to go is having md raid personalities turned
into device mapper targets.
the issue is that raid personalities need to be able to constantly
update the metadata, so a callback must be in place to communicate
`exceptions` to a layer that sits above device-mapper and handles
metadatas.

L.


--
Luca Berra -- [email protected]
Communication Media & Services S.r.l.
/"\
\ / ASCII RIBBON CAMPAIGN
X AGAINST HTML MAIL
/ \

2004-01-13 22:57:19

by Al Viro

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Tue, Jan 13, 2004 at 11:33:20PM +0100, Jure Pe??ar wrote:
> Looking at this chicken-and-egg problem of booting from an array from
> administrator's point of view ...
>
> What do you guys think about Intel's EFI? I think it would be the most
> apropriate place to put a piece of code that would scan the disks, assemble
> any arrays and present them to the OS as bootable devices ... If we're going
> to get a common metadata layout, that would be even easier.
>
> Thoughts?

Why bother? We can have userland code running before any device drivers
are initialized. And have access to
* all normal system calls
* normal writable filesystem already present (ramfs)
* normal multitasking
All of that - within the heavily tested codebase; regular kernel codepaths
that are used all the time by everything. Oh, and it's portable.

What's the benefit of doing that from EFI? Pure masochism?

2004-01-13 22:52:00

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Tue, Jan 13, 2004 at 05:44:22PM -0500, Wakko Warner wrote:
> > > Adaptec has been looking at the MD driver for a foundation for their
> > > Open-Source software RAID stack.
> >
> > Hi,
> >
> > Is there a (good) reason you didn't use Device Mapper for this? It
> > really sounds like Device Mapper is the way to go to parse and use
> > raid-like formats to the kernel, since it's designed to be independent
> > of on disk formats, unlike MD.
>
> As I've understood it, the configuration for DM is userspace and the kernel
> can't do any auto detection. This would be a "put off" for me to use as a
> root filesystem. Configurations like this (and lvm too last I looked at it)
> require an initrd or some other way of setting up the device. Unfortunately
> this means that there's configs in 2 locations (one not easily available, if
> using initrd. easily != mounting via loop!)

the kernel is moving into that direction fast, with initramfs etc etc...
It's not like the userspace autodetector needs configuration (although it
can have it of course)


Attachments:
(No filename) (1.04 kB)
(No filename) (189.00 B)
Download all attachments

2004-01-13 22:52:00

by Scott Long

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

Jure Pečar wrote:
> On Tue, 13 Jan 2004 13:41:07 -0700
> Scott Long <[email protected]> wrote:
>
>
>>A problem that we've encountered, though, is the following sequence:
>>
>>1) md is inialized during boot
>>2) drives X Y and Z are probed during boot
>>3) root fs exists on array [X Y Z], but md didn't see them show up,
>> so it didn't auto-configure the array
>>
>>I'm not sure how this can be addressed by a userland daemon. Remember
>>that we are focused on providing RAID during boot; configuring a
>>secondary array after boot is a much easier problem.
>
>
> Looking at this chicken-and-egg problem of booting from an array from
> administrator's point of view ...
>
> What do you guys think about Intel's EFI? I think it would be the most
> apropriate place to put a piece of code that would scan the disks,
> assemble
> any arrays and present them to the OS as bootable devices ... If we're
> going
> to get a common metadata layout, that would be even easier.
>
> Thoughts?
>

The BIOS already scans the disks, assembles the arrays, and presents
finds the boot sector, and presents the arrays to the loader/GRUB. Are
you saying that EFI should be the interface by which the arrays are
communicated through, even after the kernel has booted? Is this
possible right now?

Scott

2004-01-13 23:26:06

by Wakko Warner

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

> > As I've understood it, the configuration for DM is userspace and the kernel
> > can't do any auto detection. This would be a "put off" for me to use as a
> > root filesystem. Configurations like this (and lvm too last I looked at it)
> > require an initrd or some other way of setting up the device. Unfortunately
> > this means that there's configs in 2 locations (one not easily available, if
> > using initrd. easily != mounting via loop!)
>
> You can always do the following: use a mini root fs on the partition
> where the kernel is located that does nothing but vgscan and friends and
> then calls pivot_root. '/sbin/init' of the mini root fs looks like:

What is the advantage of not putting the autodetector/setup in the kernel?
Not everyone is going to use this software (or am I wrong on that?) so that
can be left as an option to compile in (or as a module if possible and if
autodetection is not required). How much work is it to maintain something
like this in the kernel?

I ask because I'm not a kernel hacker, mostly an end user (atleast I can
compile my own kernels =)

I must say, the day that kernel level ip configuration via bootp is removed
I'm going to be pissed =)

I like the fact that MD can autodetect raids on boot when compiled in, I
didn't like the fact it can't be partitioned. That's the only thing that
put me off with MD. LVM put me off because it couldn't be auto detected at
boot. I was going to play with DM, but I haven't yet.

--
Lab tests show that use of micro$oft causes cancer in lab animals

2004-01-14 15:54:42

by Kevin Corry

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Tuesday 13 January 2004 14:41, Scott Long wrote:
> A problem that we've encountered, though, is the following sequence:
>
> 1) md is inialized during boot
> 2) drives X Y and Z are probed during boot
> 3) root fs exists on array [X Y Z], but md didn't see them show up,
> so it didn't auto-configure the array
>
> I'm not sure how this can be addressed by a userland daemon. Remember
> that we are focused on providing RAID during boot; configuring a
> secondary array after boot is a much easier problem.

This can already be accomplished with an init-ramdisk (or initramfs in the
future). These provide the ability to run user-space code before the real
root filesystem is mounted.

> > I thought that raid0 was one of the few that actually did bio splitting
> > correctly? Hum, maybe this is a 2.4-only issue. Interesting, and
> > agreed, if so...
>
> This is definitely still a problem in 2.6.1

Device-Mapper does bio-splitting correctly, and already has a "stripe" module.
It's pretty trivial to set up a raid0 device with DM.

> As for the question of DM vs. MD, I think that you have to consider that
> DM right now has no concept of storing configuration data on the disk
> (at least that I can find, please correct me if I'm wrong). I think
> that DM will make a good LVM-like layer on top of MD, but I don't see it
> replacing MD right now.

The DM core has no knowledge of any metadata, but that doesn't mean its
sub-modules ("targets" in DM-speak) can't. Example, the dm-snapshot target
has to record enough on-disk metadata for its snapshots to be persistent
across reboots. Same with the persistent dm-mirror target that Joe Thornber
and co. have been working on. You could certainly write a raid5 target that
recorded parity and other state information on disk.

The real key here is keeping the metadata that simply identifies the device
separate from the metadata that keeps track of the device state. Using the
snapshot example again, DM keeps a copy of the remapping table on disk, so an
existing snapshot can be initialized when it's activated at boot-time. But
this remapping table is completely separate from the metadata that identifies
a device/volume as being a snapshot. In fact, EVMS and LVM2 have completely
different ways of identifying snapshots (which is done in user-space), yet
they both use the same kernel snapshot module.

--
Kevin Corry
[email protected]
http://evms.sourceforge.net/

2004-01-14 16:18:27

by Kevin Corry

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Tuesday 13 January 2004 17:38, Wakko Warner wrote:
> > > As I've understood it, the configuration for DM is userspace and the
> > > kernel can't do any auto detection. This would be a "put off" for me
> > > to use as a root filesystem. Configurations like this (and lvm too
> > > last I looked at it) require an initrd or some other way of setting up
> > > the device. Unfortunately this means that there's configs in 2
> > > locations (one not easily available, if using initrd. easily !=
> > > mounting via loop!)
> >
> > You can always do the following: use a mini root fs on the partition
> > where the kernel is located that does nothing but vgscan and friends and
> > then calls pivot_root. '/sbin/init' of the mini root fs looks like:
>
> What is the advantage of not putting the autodetector/setup in the kernel?

Because it can be incredibly complicated, bloated, and difficult to coordinate
with the corresponding user-space tools.

> Not everyone is going to use this software (or am I wrong on that?) so that
> can be left as an option to compile in (or as a module if possible and if
> autodetection is not required). How much work is it to maintain something
> like this in the kernel?

Enough to have had the idea shot down a year-and-a-half ago. EVMS did
in-kernel volume discovery at one point, but the driver was enormous. Let's
just say we finally "saw the light" and redesigned to do user-space
discovery. Trust me, it works much better that way.

> I like the fact that MD can autodetect raids on boot when compiled in, I
> didn't like the fact it can't be partitioned. That's the only thing that
> put me off with MD. LVM put me off because it couldn't be auto detected at
> boot. I was going to play with DM, but I haven't yet.

I guess I simply don't understand the desire to partition MD devices when
putting LVM on top of MD provides *WAY* more flexibility. You can resize any
volume in your group, as well as add new disks or raid devices in the future
and expand existing volumes across those new devices. All of this is quite a
pain with just partitions.

And setting up an init-ramdisk to run the tools isn't that hard. EVMS even
provides pre-built init-ramdisks with the EVMS tools which has worked for
virtually all of our users who want their root filesystem on an EVMS volume.
It really is fairly simple, and I run three of my own computers this way. If
you'd like to give it a try, I'd be more than happy to help you out.

--
Kevin Corry
[email protected]
http://evms.sourceforge.net/

2004-01-14 16:54:36

by Kevin P. Fleming

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

Kevin Corry wrote:

> I guess I simply don't understand the desire to partition MD devices when
> putting LVM on top of MD provides *WAY* more flexibility. You can resize any
> volume in your group, as well as add new disks or raid devices in the future
> and expand existing volumes across those new devices. All of this is quite a
> pain with just partitions.

In a nutshell: other OS compatibility. Not that I care, but they're
trying to cater to the users that have both Linux and Windows (and other
stuff) installed on a RAID-1 created by their BIOS RAID driver. In that
situation, they can't use logical volumes for the other OS partitions,
they've got to have an MSDOS partition table on top of the RAID device.

However, that does not mean this needs to be done in the kernel, they
can easily use a (future) dm-partx that reads the partition table and
tells DM what devices to make from the RAID device.

2004-01-14 19:07:06

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Tue, Jan 13, 2004 at 12:10:58PM -0800, Mike Fedyk wrote:
> On Tue, Jan 13, 2004 at 05:26:36PM +0100, Jakob Oestergaard wrote:
> > The RAID conversion/resize code for userspace exists already, and it
>
> That's news to me!
>
> Where is the project that does this?

http://unthought.net/raidreconf/index.shtml

I know of one bug in it which will thoroughly smash user data beyond
recognition - it happens when you resize RAID-5 arrays on disks that are
not of equal size. Should be easy to fix, if one tried :)

If you want it in the kernel doing hot-resizing, you probably want to
add some sort of 'progress log' so that one can resume the
reconfiguration after a reboot - that should be doable, just isn't done
yet.

Right now it's entirely a user-space tool and it is not integrated with
the MD code to make it do hot-reconfiguration - integrating it with DM
and MD would make it truely useful.


/ jakob

2004-01-14 21:03:26

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Wed, Jan 14, 2004 at 11:40:52AM -0800, Mike Fedyk wrote:
> On Wed, Jan 14, 2004 at 08:07:02PM +0100, Jakob Oestergaard wrote:
> > http://unthought.net/raidreconf/index.shtml
> >
> > I know of one bug in it which will thoroughly smash user data beyond
> > recognition - it happens when you resize RAID-5 arrays on disks that are
> > not of equal size. Should be easy to fix, if one tried :)
> >
>
> Hmm, that's if the underlying blockdevs are of differing sizes, right? I
> usually do my best to make the partitions the same size, so hopefully that
> won't hit for me. (though, I don't need to resize any arrays right now)

Make backups anyway :)

>
> > If you want it in the kernel doing hot-resizing, you probably want to
> > add some sort of 'progress log' so that one can resume the
> > reconfiguration after a reboot - that should be doable, just isn't done
> > yet.
>
> IIRC, most filesystems don't support hot shrinking if they support
> hot-resizing, so that would only help with adding a disk to an array.

"only" adding disks... How many people actually shrink stuff nowadays?

I'd say having hot-growth would solve 99% of the problems out there.

And I think that's at least a good part of the reason why so few FSes
can actually shrink. Shrinking can be a much harder problem too, though
- maybe that's part of the reason too.

>
> > Right now it's entirely a user-space tool and it is not integrated with
> > the MD code to make it do hot-reconfiguration - integrating it with DM
> > and MD would make it truely useful.
>
> True, but an intermediate step would be to call parted for resizing to the
> exact size needed for a raid0 -> raid5 conversion for example.

Yep.

/ jakob

2004-01-14 23:09:45

by NeilBrown

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Monday January 12, [email protected] wrote:
> All,
>
> Adaptec has been looking at the MD driver for a foundation for their
> Open-Source software RAID stack. This will help us provide full
> and open support for current and future Adaptec RAID products (as
> opposed to the limited support through closed drivers that we have
> now).

Sounds like a great idea.

>
> While MD is fairly functional and clean, there are a number of
> enhancements to it that we have been working on for a while and would
> like to push out to the community for review and integration. These
> include:

It would help if you said up-front if you were thinking of 2.4 or 2.6
or 2.7 or all of whatever. I gather from subsequent emails in the
thread that you are thinking of 2.6 and hoping for 2.4.
It is definately too late for any of this to go into kernel.org 2.4,
but some of it could live in an external patch set that people or
vendors can choose or not.

>
> - partition support for md devices: MD does not support the concept of
> fdisk partitions; the only way to approximate this right now is by
> creating multiple arrays on the same media. Fixing this is required
> for not only feature-completeness, but to allow our BIOS to recognise
> the partitions on an array and properly boot them as it would boot a
> normal disk.

Your attached patch is completely unacceptable as it breaks backwards
compatability. /dev/md1 (blockdev 9,1) changes from being the second
md array to being the first partition of the first md array.

I too would like to support partitions of md devices but there is not
really elegant way to do it.
I'm beginning to think the best approach is to use a new major number
(which will be dynammically allocated because Linus has forbidden new
static allocations). This should be fairly easy to do.

A reasonable alternate is to use DM. As I understand it, DM can work
with any sort of metadata (As metadata is handled by user-space) so
this should work just fine.

Note that kernel-based autodetection is seriously a thing of the past.
As has been said already, it should be just as easy and much more
manageable to do autodtection in early user-space. If it isn't, then
we need to improve the early user-space tools.

>
> - generic device arrival notification mechanism: This is needed to
> support device hot-plug, and allow arrays to be automatically
> configured regardless of when the md module is loaded or initialized.
> RedHat EL3 has a scaled down version of this already, but it is
> specific to MD and only works if MD is statically compiled into the
> kernel. A general mechanism will benefit MD as well as any other
> storage system that wants hot-arrival notices.

This has largely been covered, but just to add or clarify slightly:

This is not an md issue. This is either a buss controller or
userspace issue.
2.6 has a "hotplug" infrastructure and each buss should report
hotplug events to userspace.
If they don't they should be enhanced so they do.
If they do, then userspace needs to be told what to do with these
events, and when to assemble devices into arrays.


>
> - RAID-0 fixes: The MD RAID-0 personality is unable to perform I/O
> that spans a chunk boundary. Modifications are needed so that it can
> take a request and break it up into 1 or more per-disk requests.

In 2.4 it cannot, but arguable doesn't need to. However I have a
fairly straight-forward patch which supports raid0 request splitting.
In 2.6, this should work properly already.

>
> - Metadata abstraction: We intend to support multiple on-disk metadata
> formats, along with the 'native MD' format. To do this, specific
> knowledge of MD on-disk structures must be abstracted out of the core
> and personalities modules.

In 2.4, this would be a massive amount of work and I don't recommend
it.
In 2.6, most of this is already done - the knowledge about superblock
format is very localised. I would like to extend this so that a
loadable module can add a new format. Patches welcome.

Note that the kernel does need to know about the format of the
superblock.
DM can manage without knowing as it's superblock is read mostly and
the very few updates (for reconfiguration) are managed by userspace.
For raid1 and raid5 (which DM doesn't support), we need to update the
superblock on errors and I think that is best done in the kernel.


>
> - DDF Metadata support: Future products will use the 'DDF' on-disk
> metadata scheme. These products will be bootable by the BIOS, but
> must have DDF support in the OS. This will plug into the abstraction
> mentioned above.

I'm looking forward to seeing the specs for DDF (but isn't it pretty
dump to develop a standard in a closed forum). If DDF turns out to
have real value I would he happy to have support for it in linux/md.

NeilBrown

2004-01-15 01:45:39

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Wed, Jan 14, 2004 at 02:24:47PM -0800, Mike Fedyk wrote:
...
> > "only" adding disks... How many people actually shrink stuff nowadays?
> >
>
> Going raid0 -> raid5 would shrink your filesystem.

Not if you add an extra disk :)

>
> > I'd say having hot-growth would solve 99% of the problems out there.
> >
>
> True. Until now I didn't know I could resize my MD raid arrays!
>
> Is it still true that you think it's a good idea to try to test the resizing
> code? It's been around since 1999, so maybe it's a bit further than
> "experemental" now?

I haven't had much need for the program myself since shortly after I
wrote it, but maybe a handfull or so of people have tested it and
reported results back to me (and that's since 1999!).

RedHat took the tool and shipped it with some changes. Don't know if
they have had feedback...

>From the testing it has had, I wouldn't call it more than experimental.
As it turns out, it was "almost" correct from the beginning, and there
haven't been much progress since then :)

Now it's just lying on my site, rotting... Mostly, I think the problem
is that the reconfiguration is not on-line. It is not really useful to
do off-line reconfiguration. You need to make a full backup anyway - and
it is simply faster to just re-create the array and restore your data,
than to run the reconfiguration. At least this holds true for most of
the cases I've heard of (except maybe the ones where users didn't back
up data first).

I think it's a pity that noone has taken the code and somehow
(userspace/kernel hybrid or pure kernel?) integrated it with the kernel
to make hot reconfiguration possible.

But I have not had the time to do so myself, and I cannot see myself
getting the time to do it in any forseeable future.

I aired the idea with the EVMS folks about a year ago, and they like the
idea but were too busy just getting EVMS into the kernel as it was,
making the necessary changes there...

I think most people agree that hot reconfiguration of RAID arrays would
be a cool feature. It just seems that noone really has the time to do
it. The logic as such should be fairly simple - raidreconf is maybe
not exactly 'trivial', but it's not rocket science either. And if
nothing else, it's a skeleton that works (mostly) :)

>
> Has anyone tried to write a test suite for it?

Not that I know of. But a certain commercial NAS vendor used the tool
in their products, so maybe they wrote a test suite, I don't know.

/ jakob

2004-01-15 21:53:25

by Matt Domsch

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Thu, Jan 15, 2004 at 10:07:34AM +1100, Neil Brown wrote:
> On Monday January 12, [email protected] wrote:
> > All,
> >
> > Adaptec has been looking at the MD driver for a foundation for their
> > Open-Source software RAID stack. This will help us provide full
> > and open support for current and future Adaptec RAID products (as
> > opposed to the limited support through closed drivers that we have
> > now).
>
> Sounds like a great idea.
>
> > - Metadata abstraction: We intend to support multiple on-disk metadata
> > formats, along with the 'native MD' format. To do this, specific
> > knowledge of MD on-disk structures must be abstracted out of the core
> > and personalities modules.
>
> In 2.4, this would be a massive amount of work and I don't recommend
> it.

Scott has a decent stab at doing so already in 2.4, and I've encouraged him
to post the code he's got now. Since it's too intrusive for 2.4,
perhaps it could be added in parallel, an "emd" driver, and one could
choose to use emd to get the DDF functionality, or continue to use md
without DDF.

Here are some of the features I know I'm looking for, and I've
compared solutions suggested. Comments/corrections welcome.

* Solution works in both 2.4 and 2.6 kernels
- less ideal of two different solutions are needed
* RAID 0,1 DDF format
* Bootable from degraded R1
* Online Rebuild
* Mgmt tools/hooks
- online create, delete, modify
* Event notification/logging
* Error Handling
* Installation - simple i.e. without modifying distro installers
significantly or at all; driver disk only is ideal


>From what I see about DM at present:
* RAID 0,1 possible, dm-raid1 module in Sistina CVS needs to get merged
* Boot drive - requires setup method early in boot process, either
initrd or kernel code
* Boot from degraded RAID1 requires setup method early in boot
process, either initrd or kernel code.
* Online Rebuild - dm-raid1 has this capability
* mgmt tools/hooks - DM has today way to communicate to kernel the
changes desired. What remains is userspace tools that read, modify DDF
metadata and calls into these hooks.
* Event notification / logging - doesn't appear to exist in DM
* Error handling - unclear if/how DM handles this. For instance, how
is a disk failure on a dm-raid1 array handled?
* Installation - RHEL3 doesn't include DM yet, significant installer
work necessary for several distros.


>From what I see about md:
* RAID 0,1 there today, no DDF
* Boot drive - yes
* Boot from degraded RAID1 - possible but may require manual
intervention depending on BIOS capabilities
* Online Rebuild - there today
* mgmt tools/hooks - mdadm there today
* Event notification / logging - mdadm there today
* Error handling - there today
* Installation - disto installer capable of this today


>From what I see about emd:
* RAID 0,1 - code being developed by Adaptec today, DDF capable
* Boot drive - yes
* Boot from degraded RAID1 - possible without intervention due to
Adaptec BIOS
* Online Rebuild - there today
* mgmt tools/hooks - mdadm there today, expect Adaptec to enhance mdam to support DDF
* Event notification / logging - mdadm there today
* Error handling - there today
* Installation - could be done with only a driver disk which adds the
emd module.

Am I way off base here? :-)

Thanks,
Matt

--
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

2004-01-16 09:28:34

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On 2004-01-15T15:52:21,
Matt Domsch <[email protected]> said:

> * Solution works in both 2.4 and 2.6 kernels
> - less ideal of two different solutions are needed

Sure, this is important.

> * RAID 0,1 DDF format
> * Bootable from degraded R1

We were looking at extending the boot loader (grub/lilo) to have
additional support for R1 & multipath. (ie, booting from the first
drive/path in the set where a consistent image can be read.) If the BIOS
supports DDF too, this would get even better.

For the boot drive, this is highly desireable!

Do you know whether DDF can also support simple multipathing?

> * Boot from degraded RAID1 requires setup method early in boot
> process, either initrd or kernel code.

This is needed with DDF too; we need to parse the DDF data somewhere
afterall.

> From what I see about md:
> * RAID 0,1 there today, no DDF

Supporting additional metadata is desireable. For 2.6, this is already
in the code, and I am looking forward to having this feature.

> Am I way off base here? :-)

I don't think so. But for 2.6, the functionality should go either into
DM or MD, not into emd. I don't care which, really, both sides have good
arguments, none of which _really_ matter from a user-perspective ;-)

(If, in 2.7 time, we rip out MD and fully integrate it all into DM, then
we can see further.)


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering \ ever tried. ever failed. no matter.
SUSE Labs | try again. fail again. fail better.
Research & Development, SUSE LINUX AG \ -- Samuel Beckett

2004-01-16 09:43:23

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On 2004-01-13T13:41:07,
Matt Domsch <[email protected]> said:

> > You sorta hit a bad time for 2.4 development. Even though my employer
> > (Red Hat), Adaptec, and many others must continue to support new
> > products on 2.4.x kernels,
> Indeed, enterprise class products based on 2.4.x kernels will need
> some form of solution here too.

Yes, namely not supporting this feature and moving onwards to 2.6 in
their next release ;-)


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering \ ever tried. ever failed. no matter.
SUSE Labs | try again. fail again. fail better.
Research & Development, SUSE LINUX AG \ -- Samuel Beckett

2004-01-16 09:58:11

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Fri, 2004-01-16 at 10:31, Lars Marowsky-Bree wrote:
> On 2004-01-13T13:41:07,
> Matt Domsch <[email protected]> said:
>
> > > You sorta hit a bad time for 2.4 development. Even though my employer
> > > (Red Hat), Adaptec, and many others must continue to support new
> > > products on 2.4.x kernels,
> > Indeed, enterprise class products based on 2.4.x kernels will need
> > some form of solution here too.
>
> Yes, namely not supporting this feature and moving onwards to 2.6 in
> their next release ;-)

hear hear


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2004-01-16 13:43:58

by Matt Domsch

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Fri, Jan 16, 2004 at 10:24:47AM +0100, Lars Marowsky-Bree wrote:
> Do you know whether DDF can also support simple multipathing?

Yes, the structure info for each physical disk allows for two (and
only 2) paths to be represented. But it's pretty limited, describing
only SCSI-like paths with bus/id/lun only described in the current
draft. At the same time, there's a per-physical-disk GUID, such
that if you find the same disk by multiple paths you can tell.
There's room for enhancment/feedback in this space for certain.

Thanks,
Matt

--
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

2004-01-16 13:55:42

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On 2004-01-16T07:43:36,
Matt Domsch <[email protected]> said:

> > Do you know whether DDF can also support simple multipathing?
> Yes, the structure info for each physical disk allows for two (and
> only 2) paths to be represented. But it's pretty limited, describing
> only SCSI-like paths with bus/id/lun only described in the current
> draft. At the same time, there's a per-physical-disk GUID, such
> that if you find the same disk by multiple paths you can tell.
> There's room for enhancment/feedback in this space for certain.

One would guess that for m-p, a mere media UUID would be completely
enough; one can simply scan where those are found.

If it encodes the bus/id/lun, I can forsee bad effects if the device
enumeration changes because the HBAs get swapped in their slots ;-)


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering \ ever tried. ever failed. no matter.
SUSE Labs | try again. fail again. fail better.
Research & Development, SUSE LINUX AG \ -- Samuel Beckett

2004-01-16 14:11:41

by Matt Domsch

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Fri, 16 Jan 2004, Christoph Hellwig wrote:
> On Fri, Jan 16, 2004 at 02:56:46PM +0100, Lars Marowsky-Bree wrote:
> > If it encodes the bus/id/lun, I can forsee bad effects if the device
> > enumeration changes because the HBAs get swapped in their slots ;-)

I believe it's just supposed to be a hint to the firmware that the drive
has roamed from one physical slot to another.

> A bus/id/lun enumeration is completely bogus.? Think (S)ATA, FC or
> iSCSI.
>
> So is there a pointer to the current version of the spec?? Just reading
> these multi-path enumerations start to give me the feeling this spec
> is designed rather badly..

http://www.snia.org in the DDF TWG section, but requires you be a member of SNIA
to see at present. The DDF chairperson is trying to make the draft
publicly available, and if/when I see that happen I'll post a link to it
here.

Thanks,
Matt

--
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

2004-01-16 14:14:08

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Fri, Jan 16, 2004 at 08:11:07AM -0600, Matt Domsch wrote:
> http://www.snia.org in the DDF TWG section, but requires you be a member of SNIA
> to see at present. The DDF chairperson is trying to make the draft
> publicly available, and if/when I see that happen I'll post a link to it
> here.

Oops. That's not a good sign. /me tries to remember a sane spec coming
from SNIA and fails..

2004-01-16 14:06:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Proposed enhancements to MD

On Fri, Jan 16, 2004 at 02:56:46PM +0100, Lars Marowsky-Bree wrote:
> If it encodes the bus/id/lun, I can forsee bad effects if the device
> enumeration changes because the HBAs get swapped in their slots ;-)

A bus/id/lun enumeration is completely bogus. Think (S)ATA, FC or
iSCSI.

So is there a pointer to the current version of the spec? Just reading
these multi-path enumerations start to give me the feeling this spec
is designed rather badly..