2002-11-20 04:02:32

by NeilBrown

[permalink] [raw]
Subject: RFC - new raid superblock layout for md driver


The md driver in linux uses a 'superblock' written to all devices in a
RAID to record the current state and geometry of a RAID and to allow
the various parts to be re-assembled reliably.

The current superblock layout is sub-optimal. It contains a lot of
redundancy and wastes space. In 4K it can only fit 27 component
devices. It has other limitations.

I (and others) would like to define a new (version 1) format that
resolves the problems in the current (0.90.0) format.

The code in 2.5.lastest has all the superblock handling factored out so
that defining a new format is very straight forward.

I would like to propose a new layout, and to receive comment on it..

My current design looks like:
/* constant array information - 128 bytes */
u32 md_magic
u32 major_version == 1
u32 feature_map /* bit map of extra features in superblock */
u32 set_uuid[4]
u32 ctime
u32 level
u32 layout
u64 size /* size of component devices, if they are all
* required to be the same (Raid 1/5 */
u32 chunksize
u32 raid_disks
char name[32]
u32 pad1[10];

/* constant this-device information - 64 bytes */
u64 address of superblock in device
u32 number of this device in array /* constant over reconfigurations */
u32 device_uuid[4]
u32 pad3[9]

/* array state information - 64 bytes */
u32 utime
u32 state /* clean, resync-in-progress */
u32 sb_csum
u64 events
u64 resync-position /* flag in state if this is valid)
u32 number of devices
u32 pad2[8]

/* device state information, indexed by 'number of device in array'
4 bytes per device */
for each device:
u16 position /* in raid array or 0xffff for a spare. */
u16 state flags /* error detected, in-sync */


This has 128 bytes for essentially constant information about the
array, 64 bytes for constant information about this device, 64 bytes
for changable state information about the array, and 4 bytes per
device for state information about the devices. This would allow an
array with 192 devices in a 1K superblock, and 960 devices in a 4k
superblock (the current size).

Other features:
A feature map instead of a minor version number.
64 bit component device size field.
field for storing current position of resync process if array is
shut down while resync is running.
no "minor" field but a textual "name" field instead.
address of superblock in superblock to avoid misidentifying
superblock. e.g. is it in a partition or a whole device.
uuid for each device. This is not directly used by the md driver,
but it is maintained, even if a drive is moved between arrays,
and user-space can use it for tracking devices.

md would, of course, continue to support the current layout
indefinately, but this new layout would be available for use by people
who don't need compatability with 2.4 and do want more than 27 devices
etc.

To create an array with the new superblock layout, the user-space
tool would write directly to the devices, (like mkfs does) and then
assemble the array. Creating an array using the ioctl interface will
still create an array with the old superblock.

When the kernel loads a superblock, it would check the major_version
to see which piece of code to use to handle it.
When it writes out a superblock, it would use the same version as was
read in (of course).

This superblock would *not* support in-kernel auto-assembly as that
requires the "minor" field that I have deliberatly removed. However I
don't think this is a big cost as it looks like in-kernel
auto-assembly is about to disappear with the early-user-space patches.

The interpretation of the 'name' field would be up to the user-space
tools and the system administrator.
I imagine having something like:
host:name
where if "host" isn't the current host name, auto-assembly is not
tried, and if "host" is the current host name then:
if "name" looks like "md[0-9]*" then the array is assembled as that
device
else the array is assembled as /dev/mdN for some large, unused N,
and a symlink is created from /dev/md/name to /dev/mdN
If the "host" part is empty or non-existant, then the array would be
assembled no-matter what the hostname is. This would be important
e.g. for assembling the device that stores the root filesystem, as we
may not know the host name until after the root filesystem were loaded.

This would make auto-assembly much more flexable.

Comments welcome.

NeilBrown


2002-11-20 09:56:23

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

Hi,

On Wed, 20 Nov 2002, Neil Brown wrote:
> I (and others) would like to define a new (version 1) format that
> resolves the problems in the current (0.90.0) format.
>
> The code in 2.5.lastest has all the superblock handling factored out so
> that defining a new format is very straight forward.
>
> I would like to propose a new layout, and to receive comment on it..

If you are making a new layout anyway, I would suggest to actually add the
complete information about each disk which is in the array into the raid
superblock of each disk in the array. In that way if a disk blows up, you
can just replace the disk use some to be written (?) utility to write the
correct superblock to the new disk and add it to the array which then
reconstructs the disk. Preferably all of this happens without ever
rebooting given a hotplug ide/scsi controller. (-;

>From a quick read of the layout it doesn't seem to be possible to do the
above trivially (or certainly not without help of /etc/raidtab), but
perhaps I missed something...

Also, autoassembly would be greatly helped if the superblock contained the
uuid for each of the devices contained in the array. It is then trivial to
unplug all raid devices and move them around on the controller and it
would still just work. Again I may be missing something and that is
already possible to do trivially.

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/



2002-11-20 13:34:14

by Alan

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Wed, 2002-11-20 at 04:09, Neil Brown wrote:
> u32 set_uuid[4]

Wouldnt u8 for the uuid avoid a lot of endian mess

> u32 ctime

Use some padding so you can go to 64bit times


2002-11-20 13:51:43

by Bill Rugolsky Jr.

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Wed, Nov 20, 2002 at 03:09:18PM +1100, Neil Brown wrote:
> u32 feature_map /* bit map of extra features in superblock */

Perhaps compat/incompat feature flags, like ext[23]?

Also, journal information, such as a journal UUID?

Regards,

Bill Rugolsky

2002-11-20 15:49:52

by steven pratt

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver


Neil Brown wrote;

>I would like to propose a new layout, and to receive comment on it..


>/* constant this-device information - 64 bytes */
>u64 address of superblock in device
>u32 number of this device in array /* constant over reconfigurations
*/

Does this mean that there can be holes in the numbering for disks that die
and are replaced?

>u32 device_uuid[4]
>u32 pad3[9]

>/* array state information - 64 bytes */
>u32 utime
>u32 state /* clean, resync-in-progress */
>u32 sb_csum

These next 2 fields are not 64 bit aligned. Either rearrange or add
padding.

>u64 events
>u64 resync-position /* flag in state if this is valid)
>u32 number of devices
>u32 pad2[8]



>Other features:
>A feature map instead of a minor version number.

Good.

>64 bit component device size field.

Size in sectors not blocks please.


>no "minor" field but a textual "name" field instead.

Ok, I assume that there will be some way for userspace to query the minor
which gets dynamically assigned when the array is started.

>address of superblock in superblock to avoid misidentifying superblock.
e.g. is it >in a partition or a whole device.

Really needed this.


>The interpretation of the 'name' field would be up to the user-space
>tools and the system administrator.

Yes, so let's leave this out of this discussion.


EVMS 2.0 with full user-space discovery should be able to support the new
superblock format without any problems. We would like to work together on
this new format.

Keep up the good work, Steve



2002-11-20 15:56:19

by Joel Becker

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Wed, Nov 20, 2002 at 03:09:18PM +1100, Neil Brown wrote:
> The interpretation of the 'name' field would be up to the user-space
> tools and the system administrator.
> I imagine having something like:
> host:name
> where if "host" isn't the current host name, auto-assembly is not
> tried, and if "host" is the current host name then:
> if "name" looks like "md[0-9]*" then the array is assembled as that
> device
> else the array is assembled as /dev/mdN for some large, unused N,
> and a symlink is created from /dev/md/name to /dev/mdN
> If the "host" part is empty or non-existant, then the array would be
> assembled no-matter what the hostname is. This would be important
> e.g. for assembling the device that stores the root filesystem, as we
> may not know the host name until after the root filesystem were loaded.

Hmm, what is the intended future interaction of DM and MD? Two
ways at the same problem? Just curious.
Assuming MD as a continually used feature, the "name" bits above
seem to be preparing to support multiple shared users of the array. If
that is the case, shouldn't the superblock contain everything needed for
"clustered" operation?

Joel

--

"When I am working on a problem I never think about beauty. I
only think about how to solve the problem. But when I have finished, if
the solution is not beautiful, I know it is wrong."
- Buckminster Fuller

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: [email protected]
Phone: (650) 506-8127

2002-11-20 16:57:22

by Steven Dake

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

Neil,

I would suggest adding a 64 bit field called "unique_identifier" to the
per-device structure. This would allow a RAID volume to be locked to a
specific host, allowing the ability for true multihost operation.

For FibreChannel, we have a patch which places the host's FC WWN into
the superblock structure, and only allows importing an array disk (via
ioctl or autostart) if the superblock's WWN matches the target dev_t's
host fibrechannel WWN. We also use this for environments where slots
are used to house either CPU or disks and lock a RAID array to a
specific cpu slot. WWNs are 64 bit, which is why I suggest such a large
bitsize for this field. This really helps in hotswap environments where
a CPU blade is replaced and should use the same disk, but the disk
naming may have changed between reboots.

This could be done without this field, but then the RAID arrays could be
started unintentionally by the wrong host. Imagine a host starting the
wrong RAID array while it has been already started by some other party
(forcing a rebuild) ugh!

Thanks
-steve

Neil Brown wrote:

>The md driver in linux uses a 'superblock' written to all devices in a
>RAID to record the current state and geometry of a RAID and to allow
>the various parts to be re-assembled reliably.
>
>The current superblock layout is sub-optimal. It contains a lot of
>redundancy and wastes space. In 4K it can only fit 27 component
>devices. It has other limitations.
>
>I (and others) would like to define a new (version 1) format that
>resolves the problems in the current (0.90.0) format.
>
>The code in 2.5.lastest has all the superblock handling factored out so
>that defining a new format is very straight forward.
>
>I would like to propose a new layout, and to receive comment on it..
>
>My current design looks like:
> /* constant array information - 128 bytes */
> u32 md_magic
> u32 major_version == 1
> u32 feature_map /* bit map of extra features in superblock */
> u32 set_uuid[4]
> u32 ctime
> u32 level
> u32 layout
> u64 size /* size of component devices, if they are all
> * required to be the same (Raid 1/5 */
> u32 chunksize
> u32 raid_disks
> char name[32]
> u32 pad1[10];
>
> /* constant this-device information - 64 bytes */
> u64 address of superblock in device
> u32 number of this device in array /* constant over reconfigurations */
> u32 device_uuid[4]
> u32 pad3[9]
>
> /* array state information - 64 bytes */
> u32 utime
> u32 state /* clean, resync-in-progress */
> u32 sb_csum
> u64 events
> u64 resync-position /* flag in state if this is valid)
> u32 number of devices
> u32 pad2[8]
>
> /* device state information, indexed by 'number of device in array'
> 4 bytes per device */
> for each device:
> u16 position /* in raid array or 0xffff for a spare. */
> u16 state flags /* error detected, in-sync */
>
>
>This has 128 bytes for essentially constant information about the
>array, 64 bytes for constant information about this device, 64 bytes
>for changable state information about the array, and 4 bytes per
>device for state information about the devices. This would allow an
>array with 192 devices in a 1K superblock, and 960 devices in a 4k
>superblock (the current size).
>
>Other features:
> A feature map instead of a minor version number.
> 64 bit component device size field.
> field for storing current position of resync process if array is
> shut down while resync is running.
> no "minor" field but a textual "name" field instead.
> address of superblock in superblock to avoid misidentifying
> superblock. e.g. is it in a partition or a whole device.
> uuid for each device. This is not directly used by the md driver,
> but it is maintained, even if a drive is moved between arrays,
> and user-space can use it for tracking devices.
>
>md would, of course, continue to support the current layout
>indefinately, but this new layout would be available for use by people
>who don't need compatability with 2.4 and do want more than 27 devices
>etc.
>
>To create an array with the new superblock layout, the user-space
>tool would write directly to the devices, (like mkfs does) and then
>assemble the array. Creating an array using the ioctl interface will
>still create an array with the old superblock.
>
>When the kernel loads a superblock, it would check the major_version
>to see which piece of code to use to handle it.
>When it writes out a superblock, it would use the same version as was
>read in (of course).
>
>This superblock would *not* support in-kernel auto-assembly as that
>requires the "minor" field that I have deliberatly removed. However I
>don't think this is a big cost as it looks like in-kernel
>auto-assembly is about to disappear with the early-user-space patches.
>
>The interpretation of the 'name' field would be up to the user-space
>tools and the system administrator.
>I imagine having something like:
> host:name
>where if "host" isn't the current host name, auto-assembly is not
>tried, and if "host" is the current host name then:
> if "name" looks like "md[0-9]*" then the array is assembled as that
> device
> else the array is assembled as /dev/mdN for some large, unused N,
> and a symlink is created from /dev/md/name to /dev/mdN
>If the "host" part is empty or non-existant, then the array would be
>assembled no-matter what the hostname is. This would be important
>e.g. for assembling the device that stores the root filesystem, as we
>may not know the host name until after the root filesystem were loaded.
>
>This would make auto-assembly much more flexable.
>
>Comments welcome.
>
>NeilBrown
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
>
>

2002-11-20 22:55:21

by NeilBrown

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Wednesday November 20, [email protected] wrote:
> Hi,
>
> On Wed, 20 Nov 2002, Neil Brown wrote:
> > I (and others) would like to define a new (version 1) format that
> > resolves the problems in the current (0.90.0) format.
> >
> > The code in 2.5.lastest has all the superblock handling factored out so
> > that defining a new format is very straight forward.
> >
> > I would like to propose a new layout, and to receive comment on it..
>
> If you are making a new layout anyway, I would suggest to actually add the
> complete information about each disk which is in the array into the raid
> superblock of each disk in the array. In that way if a disk blows up, you
> can just replace the disk use some to be written (?) utility to write the
> correct superblock to the new disk and add it to the array which then
> reconstructs the disk. Preferably all of this happens without ever
> rebooting given a hotplug ide/scsi controller. (-;

What sort of 'complete information about each disk' are you thinking
of?

Hot-spares already work.
Auto-detecting a new drive that has just been physically plugged in
and adding it to a raid array is as issue that requires configuration
well beyond the scope of the superblock I believe.
But if you could be more concrete, I might be convinced.

>
> >From a quick read of the layout it doesn't seem to be possible to do the
> above trivially (or certainly not without help of /etc/raidtab), but
> perhaps I missed something...
>
> Also, autoassembly would be greatly helped if the superblock contained the
> uuid for each of the devices contained in the array. It is then trivial to
> unplug all raid devices and move them around on the controller and it
> would still just work. Again I may be missing something and that is
> already possible to do trivially.

Well... it depends on whether you want a 'name' or an 'address' in the
superblock.
A 'name' is something you can use to recognise the device when you see
it, an 'address' is some way to go and find the device if you don't
have it.

Each superblock already has the 'name' of every other device
implicitly, as a devices 'name' is the set_uuid plus a device number.

I think storing addresses in the superblock is a bad idea as they are
in-general not stable, and if you did try to store some sort of
stable address, you would need to allocate quite a lot of space which
I don't think is justified.

Just storing a name is enough for auto-assembly providing you can
enumerate all devices. I think at this stage we have to assume that
userspace can enumerate all devices and so can find the device for
each name. i.e. find all devices with the correct set_uuid.

Does that make sense?

Thankyou for your feedback.

NeilBrown

2002-11-20 23:04:51

by NeilBrown

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On November 20, [email protected] wrote:
> On Wed, 2002-11-20 at 04:09, Neil Brown wrote:
> > u32 set_uuid[4]
>
> Wouldnt u8 for the uuid avoid a lot of endian mess

Probably....
This makes it very similar to 'name'.
The difference if partly the intent for how user-space would use it,
and partly that set_uuid must *never* change, while you would probably
want name to be allowed to change.


>
> > u32 ctime
>
> Use some padding so you can go to 64bit times
>
Before or after? Or just make it 64bits of seconds now?
This brings up endian-ness? Should I assert 'little-endian' or should
the code check the endianness of the magic number and convert if
necessary?
The former is less code which will be exercised more often, so it is
probably safe.

So:
All values shall be little-endian and times shall be stored in 64
bits with the top 20 bits representing microseconds (so we & with
(1<<44)-1 to get seconds.

Thanks.

NeilBrown

2002-11-20 23:10:27

by NeilBrown

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Wednesday November 20, [email protected] wrote:
> On Wed, Nov 20, 2002 at 03:09:18PM +1100, Neil Brown wrote:
> > u32 feature_map /* bit map of extra features in superblock */
>
> Perhaps compat/incompat feature flags, like ext[23]?

I thought about that, but am not sure that it makes sense as there is
much less metadata in a raid array than there is in a filesystem.
I think I am happier to have initial code require feature_map == 0 or
it doesn't get loaded, and if it becomes an issue, get user-space to
clear any 'compatible' flags before passing the device to an 'old'
kernel.

>
> Also, journal information, such as a journal UUID?

As there is no current code, or serious project that I know of, to add
journalling to md (I have thought about it, but it isn't a priority) I
wouldn't like to pre-empt it at all by defining fields now. I would
rather that presense-of-a-journal be indicated by a bit in the feature map,
and that would imply uuid was stored in one of the current 'pad'
fields. I think there is plenty of space.

Thanks,
NeilBrown


>
> Regards,
>
> Bill Rugolsky

2002-11-20 23:17:39

by NeilBrown

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Wednesday November 20, [email protected] wrote:
>
> Neil Brown wrote;
>
> >I would like to propose a new layout, and to receive comment on it..
>
>
> >/* constant this-device information - 64 bytes */
> >u64 address of superblock in device
> >u32 number of this device in array /* constant over reconfigurations
> */
>
> Does this mean that there can be holes in the numbering for disks that die
> and are replaced?

Yes. When a drive is added to an array it gets a number which it
keeps for it's life in the array. This is completely separate from
the number that says what it's role in the array is. This number,
together with the set_uuid, forms the 'name' of the device as long as
it is part of the array. So there could well be holes in the
numbering of devices, but in general the set of numbers would be
fairly dense (max number of holes is max number of hot-spaces that you
have had in the array at any one time).

>
> >u32 device_uuid[4]
> >u32 pad3[9]
>
> >/* array state information - 64 bytes */
> >u32 utime
> >u32 state /* clean, resync-in-progress */
> >u32 sb_csum
>
> These next 2 fields are not 64 bit aligned. Either rearrange or add
> padding.

Thanks. I think I did check that once, but then I changed things
again :-(
Actually, making utime a u64 makes this properly aligned again, but I
will group the u64s together at the top.

>
> >u64 events
> >u64 resync-position /* flag in state if this is valid)
> >u32 number of devices
> >u32 pad2[8]
>
>
>
> >Other features:
> >A feature map instead of a minor version number.
>
> Good.
>
> >64 bit component device size field.
>
> Size in sectors not blocks please.

Another possibility that I considered was a size in chunks, but
sectors is less confusing.

>
>
> >no "minor" field but a textual "name" field instead.
>
> Ok, I assume that there will be some way for userspace to query the minor
> which gets dynamically assigned when the array is started.

Well, actually it is user-space which dynamically assigns a minor.
It then has the option of recording, possibly as a symlink in /dev, the
relationship between the 'name' of the array and the dynamically
assigned minor.

>
> >address of superblock in superblock to avoid misidentifying superblock.
> e.g. is it >in a partition or a whole device.
>
> Really needed this.
>
>
> >The interpretation of the 'name' field would be up to the user-space
> >tools and the system administrator.
>
> Yes, so let's leave this out of this discussion.

... except to show that is is sufficient to meet the needs of users.


Thanks for your comments,
NeilBrown

2002-11-20 23:24:52

by NeilBrown

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Wednesday November 20, [email protected] wrote:
> On Wed, Nov 20, 2002 at 03:09:18PM +1100, Neil Brown wrote:
> > The interpretation of the 'name' field would be up to the user-space
> > tools and the system administrator.
> > I imagine having something like:
> > host:name
> > where if "host" isn't the current host name, auto-assembly is not
> > tried, and if "host" is the current host name then:
> > if "name" looks like "md[0-9]*" then the array is assembled as that
> > device
> > else the array is assembled as /dev/mdN for some large, unused N,
> > and a symlink is created from /dev/md/name to /dev/mdN
> > If the "host" part is empty or non-existant, then the array would be
> > assembled no-matter what the hostname is. This would be important
> > e.g. for assembling the device that stores the root filesystem, as we
> > may not know the host name until after the root filesystem were loaded.
>
> Hmm, what is the intended future interaction of DM and MD? Two
> ways at the same problem? Just curious.

I see MD and DM as quite different, though I haven't looked much as DM
so I could be wrong.

I see raid1 and raid5 as being the key elements of MD. i.e. handling
redundancy, rebuilding hot spares, stuff like that. raid0 and linear
are almost optional frills.

DM on the other hand doesn't do redundancy (I don't think) but helps
to chop devices up into little bits and put them back together into
other devices.... a bit like a filesystem really, but it provided block
devices instead of files.

So raid0 and linear are more the domain of DM than MD in my mind.
But they are currently supported by MD and there is no real need to
change that.


> Assuming MD as a continually used feature, the "name" bits above
> seem to be preparing to support multiple shared users of the array. If
> that is the case, shouldn't the superblock contain everything needed for
> "clustered" operation?

Only if I knew what 'everything needed for clustered operation' was....
There is room for expansion in the superblock so stuff could be added.
If there were some specific things that you think would help clustered
operation, I'd be happy to hear the details.

Thanks,
NeilBrown

2002-11-20 23:42:13

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

>The md driver in linux uses a 'superblock' written to all devices in a
>RAID to record the current state and geometry of a RAID and to allow
>the various parts to be re-assembled reliably.
>
>The current superblock layout is sub-optimal. It contains a lot of
>redundancy and wastes space. In 4K it can only fit 27 component
>devices. It has other limitations.

Yes. (In particular, getting all the various counters to agree with each other
is a pain ;-)

Steven raises the valid point that multihost operation isn't currently
possible; I just don't agree with his solution:

- Activating a drive only on one host is already entirely possible.
(can be done by device uuid in initrd for example)
- Activating a RAID device from multiple hosts is still not possible.
(Requires way more sophisticated locking support than we currently have)

However, for none-RAID devices like multipathing I believe that activating a
drive on multiple hosts should be possible; ie, for these it might not be
necessary to scribble to the superblock every time.

(The md patch for 2.4 I sent you already does that; it reconstructs the
available paths fully dynamic on startup (by activating all paths present);
however it still updates the superblock afterwards)

Anyway, on to the format:

>The code in 2.5.lastest has all the superblock handling factored out so
>that defining a new format is very straight forward.
>
>I would like to propose a new layout, and to receive comment on it..
>
>My current design looks like:
> /* constant array information - 128 bytes */
> u32 md_magic
> u32 major_version == 1
> u32 feature_map /* bit map of extra features in superblock */
> u32 set_uuid[4]
> u32 ctime
> u32 level
> u32 layout
> u64 size /* size of component devices, if they are all
> * required to be the same (Raid 1/5 */
> u32 chunksize
> u32 raid_disks
> char name[32]
> u32 pad1[10];

Looks good so far.

> /* constant this-device information - 64 bytes */
> u64 address of superblock in device
> u32 number of this device in array /* constant over reconfigurations
> */
> u32 device_uuid[4]

What is "address of superblock in device" ? Seems redundant, otherwise you
would have been unable to read it, or am missing something?

Special case here might be required for multipathing. (ie, device_uuid == 0)

> u32 pad3[9]
>
> /* array state information - 64 bytes */
> u32 utime

Timestamps (also above, ctime) are always difficult. Time might not be set
correctly at any given time, in particular during early bootup. This field
should only be advisory.

> u32 state /* clean, resync-in-progress */
> u32 sb_csum
> u64 events
> u64 resync-position /* flag in state if this is valid)
> u32 number of devices
> u32 pad2[8]
>
> /* device state information, indexed by 'number of device in array'
> 4 bytes per device */
> for each device:
> u16 position /* in raid array or 0xffff for a spare. */
> u16 state flags /* error detected, in-sync */

u16 != u32; your position flags don't match up. I'd like to be able to take
the "position in the superblock" as a mapping here so it can be found in this
list, or what is the proposed relationship between the two?

>The interpretation of the 'name' field would be up to the user-space
>tools and the system administrator.
>I imagine having something like:
> host:name
>where if "host" isn't the current host name, auto-assembly is not
>tried, and if "host" is the current host name then:

Oh, well. You seem to sort of have Steven's idea here too ;-) In that case,
I'd go with the idea of Steven. Make that field a uuid of the host.



Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
Principal Squirrel
SuSE Labs - Research & Development, SuSE Linux AG

"If anything can go wrong, it will." "Chance favors the prepared (mind)."
-- Capt. Edward A. Murphy -- Louis Pasteur

2002-11-20 23:41:49

by NeilBrown

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Wednesday November 20, [email protected] wrote:
> Neil,
>
> I would suggest adding a 64 bit field called "unique_identifier" to the
> per-device structure. This would allow a RAID volume to be locked to a
> specific host, allowing the ability for true multihost operation.

You seem to want a uniq id in 'per device' which will identify the
'volume'.
That doesn't make sense to me so maybe I am missing something.
If you want to identify the 'volume', you put some sort of id in the
'per-volume' data structure.

This is what the 'name' field is for.

>
> For FibreChannel, we have a patch which places the host's FC WWN into
> the superblock structure, and only allows importing an array disk (via
> ioctl or autostart) if the superblock's WWN matches the target dev_t's
> host fibrechannel WWN. We also use this for environments where slots
> are used to house either CPU or disks and lock a RAID array to a
> specific cpu slot. WWNs are 64 bit, which is why I suggest such a large
> bitsize for this field. This really helps in hotswap environments where
> a CPU blade is replaced and should use the same disk, but the disk
> naming may have changed between reboots.
>
> This could be done without this field, but then the RAID arrays could be
> started unintentionally by the wrong host. Imagine a host starting the
> wrong RAID array while it has been already started by some other party
> (forcing a rebuild) ugh!

Put your 64 bit WWN in the 'name' field, and teach user-space to match
'name' to FC adapter.

Does that work for you?

NeilBrown

2002-11-20 23:42:13

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On 2002-11-20T10:05:29,
Steven Dake <[email protected]> said:

> This could be done without this field, but then the RAID arrays could be
> started unintentionally by the wrong host. Imagine a host starting the
> wrong RAID array while it has been already started by some other party
> (forcing a rebuild) ugh!

This is already easy and does not require addition of a field to the md
superblock.

Just only explicitly start disks with the proper uuid in the md superblock.
Don't simply start them all.

(I'll reply to Neil's mail momentarily)


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
Principal Squirrel
SuSE Labs - Research & Development, SuSE Linux AG

"If anything can go wrong, it will." "Chance favors the prepared (mind)."
-- Capt. Edward A. Murphy -- Louis Pasteur

2002-11-20 23:55:17

by Alan

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Wed, 2002-11-20 at 23:11, Neil Brown wrote:
> This brings up endian-ness? Should I assert 'little-endian' or should
> the code check the endianness of the magic number and convert if
> necessary?
> The former is less code which will be exercised more often, so it is
> probably safe.

>From my own experience pick a single endianness otherwise some tool will
always get one endian case wrong on one platform with one word size.

>
> So:
> All values shall be little-endian and times shall be stored in 64
> bits with the top 20 bits representing microseconds (so we & with
> (1<<44)-1 to get seconds.

Could do - or struct timeval or whatever

2002-11-20 23:55:24

by Alan

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Wed, 2002-11-20 at 23:11, Neil Brown wrote:
> This brings up endian-ness? Should I assert 'little-endian' or should
> the code check the endianness of the magic number and convert if
> necessary?
> The former is less code which will be exercised more often, so it is
> probably safe.

>From my own experience pick a single endianness otherwise some tool will
always get one endian case wrong on one platform with one word size.

>
> So:
> All values shall be little-endian and times shall be stored in 64
> bits with the top 20 bits representing microseconds (so we & with
> (1<<44)-1 to get seconds.

Could do - or struct timeval or whatever

2002-11-21 00:20:51

by Steven Dake

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver



Neil Brown wrote:

>On Wednesday November 20, [email protected] wrote:
>
>
>>Neil,
>>
>>I would suggest adding a 64 bit field called "unique_identifier" to the
>>per-device structure. This would allow a RAID volume to be locked to a
>>specific host, allowing the ability for true multihost operation.
>>
>>
>
>You seem to want a uniq id in 'per device' which will identify the
>'volume'.
>That doesn't make sense to me so maybe I am missing something.
>If you want to identify the 'volume', you put some sort of id in the
>'per-volume' data structure.
>
>This is what the 'name' field is for.
>
>
This is useful, atleast in the current raid implementation, because
md_import can be changed to return an error if the device's unique
identifier doesn't match the host identifier. In this way, each device
of a RAID volume is individually locked to the specific host, and
rejection occurs at import of the device time.

Perhaps locking using the name field would work except that other
userspace applications may reuse that name field for some other purpose,
not providing any kind of uniqueness.

Thanks for the explination of how the name field was intended to be used.

-steve


2002-11-21 00:24:46

by NeilBrown

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Thursday November 21, [email protected] wrote:
>
> However, for none-RAID devices like multipathing I believe that activating a
> drive on multiple hosts should be possible; ie, for these it might not be
> necessary to scribble to the superblock every time.
>
> (The md patch for 2.4 I sent you already does that; it reconstructs the
> available paths fully dynamic on startup (by activating all paths present);
> however it still updates the superblock afterwards)

I haven't thought much about multipat I admit.....

Mt feeling is that a multipath superblock should never be updated.
Just writen once at creation and left like that (raid0 and linear are
much the same) The only lose would be the utime update, and I don't
think that is a real lose.

> > /* constant this-device information - 64 bytes */
> > u64 address of superblock in device
> > u32 number of this device in array /* constant over reconfigurations
> > */
> > u32 device_uuid[4]
>
> What is "address of superblock in device" ? Seems redundant, otherwise you
> would have been unable to read it, or am missing something?

Suppose I have a device with a partition that ends at the end of the
device (and starts at a 64k align location). Then if there is a
superblock in the whole device, it will also be in the final
partition... but which is right? Storing the location of the
superblock allows us to disambiguate.

>
> Special case here might be required for multipathing. (ie, device_uuid == 0)
>
> > u32 pad3[9]
> >
> > /* array state information - 64 bytes */
> > u32 utime
>
> Timestamps (also above, ctime) are always difficult. Time might not be set
> correctly at any given time, in particular during early bootup. This field
> should only be advisory.

Indeed, they are only advisory.

>
> > u32 state /* clean, resync-in-progress */
> > u32 sb_csum
> > u64 events
> > u64 resync-position /* flag in state if this is valid)
> > u32 number of devices
> > u32 pad2[8]
> >
> > /* device state information, indexed by 'number of device in array'
> > 4 bytes per device */
> > for each device:
> > u16 position /* in raid array or 0xffff for a spare. */
> > u16 state flags /* error detected, in-sync */
>
> u16 != u32; your position flags don't match up. I'd like to be able to take
> the "position in the superblock" as a mapping here so it can be found in this
> list, or what is the proposed relationship between the two?

u16 for device flags. u32 (over kill for) array flags. Is there are
problem that I am missing?

There is an array of
struct {
u16 position; /* aka role. 0xffff for spare */
u16 state; /* error/insync */
}
in each copy of the superblock. It is indexed by 'number of this
device in array' which is constant for any given device despite any
configuration changes (until the device is removed from the array).
If you have two hot spares, then their 'postition' (aka role) will
initially be 0xffff. After a failure, one will be swapped in and it's
position becomes (say) 3. Once rebuild is complete, the insync flag
is set and the device becomes fully active.

Does that make it clear?

NeilBrown

2002-11-21 00:27:46

by Steven Dake

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver



Lars Marowsky-Bree wrote:

>>The md driver in linux uses a 'superblock' written to all devices in a
>>RAID to record the current state and geometry of a RAID and to allow
>>the various parts to be re-assembled reliably.
>>
>>The current superblock layout is sub-optimal. It contains a lot of
>>redundancy and wastes space. In 4K it can only fit 27 component
>>devices. It has other limitations.
>>
>>
>
>Yes. (In particular, getting all the various counters to agree with each other
>is a pain ;-)
>
>Steven raises the valid point that multihost operation isn't currently
>possible; I just don't agree with his solution:
>
>- Activating a drive only on one host is already entirely possible.
> (can be done by device uuid in initrd for example)
>
>
This technique doesn't work if autostart is set (the partition type is
tagged as a RAID volume) or if the user is stupid and starts the wrong
uuid by accident. It also requires the user to keep track of which
uuids are used by which hosts, which is a pain. Trust me, users will
start the wrong RAID volume and have a hard time keeping track of the
right UUIDs to asssemble. The technique I use ensures that the RAID
volumes can all be set to autostart and only the correct volumes will be
started on the correct host.

>- Activating a RAID device from multiple hosts is still not possible.
> (Requires way more sophisticated locking support than we currently have)
>
>
The only application where having a RAID volume shareable between two
hosts is useful is for a clustering filesystem (GFS comes to mind).
Since RAID is an important need for GFS (if a disk node fails, you
don't want ot loose the entire filesystem as you would on GFS) this
possibility may be worth exploring.

Since GFS isn't GPL at this point and OpenGFS needs alot of work, I've
not spent the time looking at it.

Neil have you thought about sharing an active volume between two hosts
and what sort of support would be needed in the superblock?

Thanks
-steve

>However, for none-RAID devices like multipathing I believe that activating a
>drive on multiple hosts should be possible; ie, for these it might not be
>necessary to scribble to the superblock every time.
>
>(The md patch for 2.4 I sent you already does that; it reconstructs the
>available paths fully dynamic on startup (by activating all paths present);
>however it still updates the superblock afterwards)
>
>Anyway, on to the format:
>
>
>
>>The code in 2.5.lastest has all the superblock handling factored out so
>>that defining a new format is very straight forward.
>>
>>I would like to propose a new layout, and to receive comment on it..
>>
>>My current design looks like:
>> /* constant array information - 128 bytes */
>> u32 md_magic
>> u32 major_version == 1
>> u32 feature_map /* bit map of extra features in superblock */
>> u32 set_uuid[4]
>> u32 ctime
>> u32 level
>> u32 layout
>> u64 size /* size of component devices, if they are all
>> * required to be the same (Raid 1/5 */
>> u32 chunksize
>> u32 raid_disks
>> char name[32]
>> u32 pad1[10];
>>
>>
>
>Looks good so far.
>
>
>
>> /* constant this-device information - 64 bytes */
>> u64 address of superblock in device
>> u32 number of this device in array /* constant over reconfigurations
>> */
>> u32 device_uuid[4]
>>
>>
>
>What is "address of superblock in device" ? Seems redundant, otherwise you
>would have been unable to read it, or am missing something?
>
>Special case here might be required for multipathing. (ie, device_uuid == 0)
>
>
>
>> u32 pad3[9]
>>
>> /* array state information - 64 bytes */
>> u32 utime
>>
>>
>
>Timestamps (also above, ctime) are always difficult. Time might not be set
>correctly at any given time, in particular during early bootup. This field
>should only be advisory.
>
>
>
>> u32 state /* clean, resync-in-progress */
>> u32 sb_csum
>> u64 events
>> u64 resync-position /* flag in state if this is valid)
>> u32 number of devices
>> u32 pad2[8]
>>
>> /* device state information, indexed by 'number of device in array'
>> 4 bytes per device */
>> for each device:
>> u16 position /* in raid array or 0xffff for a spare. */
>> u16 state flags /* error detected, in-sync */
>>
>>
>
>u16 != u32; your position flags don't match up. I'd like to be able to take
>the "position in the superblock" as a mapping here so it can be found in this
>list, or what is the proposed relationship between the two?
>
>
>
>>The interpretation of the 'name' field would be up to the user-space
>>tools and the system administrator.
>>I imagine having something like:
>> host:name
>>where if "host" isn't the current host name, auto-assembly is not
>>tried, and if "host" is the current host name then:
>>
>>
>
>Oh, well. You seem to sort of have Steven's idea here too ;-) In that case,
>I'd go with the idea of Steven. Make that field a uuid of the host.
>
>
>
>Sincerely,
> Lars Marowsky-Br?e <[email protected]>
>
>
>

2002-11-21 00:34:30

by Alan

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Thu, 2002-11-21 at 00:35, Steven Dake wrote:
> Since GFS isn't GPL at this point and OpenGFS needs alot of work, I've
> not spent the time looking at it.

OCFS is probably the right place to be looking in terms of development
in this area right now

2002-11-21 01:37:52

by Doug Ledford

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Thu, Nov 21, 2002 at 10:31:47AM +1100, Neil Brown wrote:
> I see MD and DM as quite different, though I haven't looked much as DM
> so I could be wrong.

I haven't yet played with the new dm code, but if it's like I expect it to
be, then I predict that in a few years, or maybe much less, md and dm will
be two parts of the same whole. The purpose of md is to map from a single
logical device to all the underlying physical devices. The purpose of :VM
code in general is to handle the creation, orginization, and mapping of
multiple physical devices into a single logical device. LVM code is
usually shy on advanced mapping routines like RAID5, relying instead on
underlying hardware to handle things like that while the LVM code itself
just concentrates on physical volumes in the logical volume similar to how
linear would do things. But, the things LVM does do that are very handy,
are things like adding a new disk to a volume group and having the volume
group automatically expand to fill the additional space, making it
possible to increase the size of a logical volume on the fly.

When you get right down to it, MD is 95% advanced mapping of physical
disks with different possibilities for redundancy and performance. DM is
95% advanced handling of logical volumes including snapshot support,
shrink/grow on the fly support, labelling, sharing, etc. The best of both
worlds would be to make all of the MD modules be plug-ins in the DM code
so that anyone creating a logical volume from a group of physical disks
could pick which mapping they want used; linear, raid0, raid1, raid5, etc.
You would also want all the md modules inside the DM/LVM core to support
the advanced features of LVM, with the online resizing being the primary
one that the md modules would need to implement and export an interface
for. I would think that the snapshot support would be done at the LVM/DM
level instead of in the individual md modules.

Anyway, that's my take on how the two *should* go over the next year or
so, who knows if that's what will actually happen.


--
Doug Ledford <[email protected]> 919-754-3700 x44233
Red Hat, Inc.
1801 Varsity Dr.
Raleigh, NC 27606

2002-11-21 15:16:55

by John Stoffel

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver


Steven> This is useful, atleast in the current raid implementation,
Steven> because md_import can be changed to return an error if the
Steven> device's unique identifier doesn't match the host identifier.
Steven> In this way, each device of a RAID volume is individually
Steven> locked to the specific host, and rejection occurs at import of
Steven> the device time.

This is a key issue on SANs as well. I think that having the hosts'
UUID in the RAID superblock will allow rejection to happen
gracefully. If needed, the user-land tools can have a --force
option.

Steven> Perhaps locking using the name field would work except that
Steven> other userspace applications may reuse that name field for
Steven> some other purpose, not providing any kind of uniqueness.

I think the there needs to be two fields, a UUID field for the host
owning the RAID superblocks, and then a name field so that the host,
along with any other systems which can *view* the RAID superblock, can
know the user defined name.

John
John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
[email protected] - http://www.lucent.com - 978-399-0479

2002-11-21 19:27:33

by Joel Becker

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Wed, Nov 20, 2002 at 08:46:25PM -0500, Doug Ledford wrote:
> I haven't yet played with the new dm code, but if it's like I expect it to
> be, then I predict that in a few years, or maybe much less, md and dm will
> be two parts of the same whole. The purpose of md is to map from a single

Most LVMs support mirroring as an essential function. They
don't usually support RAID5, leaving that to hardware.
I certainly don't want to have to deal with two disparate
systems to get my code up and running. I don't want to be limited in my
mirroring options at the block device level.
DM supports mirroring. It's a simple 1:2 map. Imagine this LVM
volume layout, where volume 1 is data and mirrored, and volume 2 is some
scratch space crossing both disks.

[Disk 1] [Disk 2]
[volume 1] [volume 1 copy]
[ volume 2 ]

If DM handles the mirroring, this works great. Disk 1 and disk
2 are handled either as the whole disk (sd[ab]) or one big partition on
each disk (sd[ab]1), with DM handling the sizing and layout, even
dynamically.
If MD is handling this, then the disks have to be partitioned.
sd[ab]1 contain the portions of md0, and sd[ab]2 are managed by DM. I
can't resize the partitions on the fly, I can't break the mirror to add
space to volume 2 quickly, etc.

Joel

--

"There are only two ways to live your life. One is as though nothing
is a miracle. The other is as though everything is a miracle."
- Albert Einstein

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: [email protected]
Phone: (650) 506-8127

2002-11-21 19:29:56

by Joel Becker

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Wed, Nov 20, 2002 at 10:05:29AM -0700, Steven Dake wrote:
> per-device structure. This would allow a RAID volume to be locked to a
> specific host, allowing the ability for true multihost operation.

Locking to a specific host isn't the only thing to do though.
Allowing multiple hosts to share the disk is quite interesting as well.

Joel


--

The zen have a saying:
"When you learn how to listen, ANYONE can be your teacher."

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: [email protected]
Phone: (650) 506-8127

2002-11-21 19:32:53

by Joel Becker

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Thu, Nov 21, 2002 at 12:47:43AM +0100, Lars Marowsky-Bree wrote:
> However, for none-RAID devices like multipathing I believe that activating a
> drive on multiple hosts should be possible; ie, for these it might not be
> necessary to scribble to the superblock every time.

Again, if you don't use persistent superblock and merely run the
mkraid from your initscripts (or initrd), raid0 and multipath work just
fine today.

Joel


--

"We will have to repent in this generation not merely for the
vitriolic words and actions of the bad people, but for the
appalling silence of the good people."
- Rev. Dr. Martin Luther King, Jr.

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: [email protected]
Phone: (650) 506-8127

2002-11-21 19:49:20

by Steven Dake

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

Doug,

EVMS integrates all of this stuff together into one cohesive peice of
technology.

But I agree, LVM should be modified to support RAID 1 and RAID 5, or MD
should be modified to support volume management. Since RAID 1 and RAID
5 are easier to implement, LVM is probably the best place to put all
this stuff.

Doug Ledford wrote:

>On Thu, Nov 21, 2002 at 11:34:24AM -0800, Joel Becker wrote:
>
>
>>On Wed, Nov 20, 2002 at 08:46:25PM -0500, Doug Ledford wrote:
>>
>>
>>>I haven't yet played with the new dm code, but if it's like I expect it to
>>>be, then I predict that in a few years, or maybe much less, md and dm will
>>>be two parts of the same whole. The purpose of md is to map from a single
>>>
>>>
>> Most LVMs support mirroring as an essential function. They
>>don't usually support RAID5, leaving that to hardware.
>> I certainly don't want to have to deal with two disparate
>>systems to get my code up and running. I don't want to be limited in my
>>mirroring options at the block device level.
>> DM supports mirroring. It's a simple 1:2 map. Imagine this LVM
>>volume layout, where volume 1 is data and mirrored, and volume 2 is some
>>scratch space crossing both disks.
>>
>> [Disk 1] [Disk 2]
>> [volume 1] [volume 1 copy]
>> [ volume 2 ]
>>
>> If DM handles the mirroring, this works great. Disk 1 and disk
>>2 are handled either as the whole disk (sd[ab]) or one big partition on
>>each disk (sd[ab]1), with DM handling the sizing and layout, even
>>dynamically.
>> If MD is handling this, then the disks have to be partitioned.
>>sd[ab]1 contain the portions of md0, and sd[ab]2 are managed by DM. I
>>can't resize the partitions on the fly, I can't break the mirror to add
>>space to volume 2 quickly, etc.
>>
>>
>
>Not at all. That was the point of me entire email, that the LVM code
>should handle these types of shuffles of space and simply use md modules
>as the underlying mapper technology. Then, you go to one place to both
>specify how things are laid out and what mapping is used in those laid out
>spaces. Basically, I'm saying how I think things *should* be, and you're
>telling me how they *are*. I know this, and I'm saying how things *are*
>is wrong. There *should* be no md superblocks, there should only be dm
>superblocks on LVM physical devices and those DM superblocks should
>include the data needed to fire up the proper md module on the proper
>physical extents based upon what mapper technology is specified in the
>DM superblock and what layout is specified in the DM superblock. In my
>opinion, the existence of both an MD and DM driver is wrong because they
>are inherently two sides of the same coin, logical device mapping support,
>with one being better at putting physical disks into intelligent arrays
>and one being better at mapping different logical volumes onto one or more
>physical volume groups.
>
>
>

2002-11-21 19:45:33

by Doug Ledford

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Thu, Nov 21, 2002 at 11:34:24AM -0800, Joel Becker wrote:
> On Wed, Nov 20, 2002 at 08:46:25PM -0500, Doug Ledford wrote:
> > I haven't yet played with the new dm code, but if it's like I expect it to
> > be, then I predict that in a few years, or maybe much less, md and dm will
> > be two parts of the same whole. The purpose of md is to map from a single
>
> Most LVMs support mirroring as an essential function. They
> don't usually support RAID5, leaving that to hardware.
> I certainly don't want to have to deal with two disparate
> systems to get my code up and running. I don't want to be limited in my
> mirroring options at the block device level.
> DM supports mirroring. It's a simple 1:2 map. Imagine this LVM
> volume layout, where volume 1 is data and mirrored, and volume 2 is some
> scratch space crossing both disks.
>
> [Disk 1] [Disk 2]
> [volume 1] [volume 1 copy]
> [ volume 2 ]
>
> If DM handles the mirroring, this works great. Disk 1 and disk
> 2 are handled either as the whole disk (sd[ab]) or one big partition on
> each disk (sd[ab]1), with DM handling the sizing and layout, even
> dynamically.
> If MD is handling this, then the disks have to be partitioned.
> sd[ab]1 contain the portions of md0, and sd[ab]2 are managed by DM. I
> can't resize the partitions on the fly, I can't break the mirror to add
> space to volume 2 quickly, etc.

Not at all. That was the point of me entire email, that the LVM code
should handle these types of shuffles of space and simply use md modules
as the underlying mapper technology. Then, you go to one place to both
specify how things are laid out and what mapping is used in those laid out
spaces. Basically, I'm saying how I think things *should* be, and you're
telling me how they *are*. I know this, and I'm saying how things *are*
is wrong. There *should* be no md superblocks, there should only be dm
superblocks on LVM physical devices and those DM superblocks should
include the data needed to fire up the proper md module on the proper
physical extents based upon what mapper technology is specified in the
DM superblock and what layout is specified in the DM superblock. In my
opinion, the existence of both an MD and DM driver is wrong because they
are inherently two sides of the same coin, logical device mapping support,
with one being better at putting physical disks into intelligent arrays
and one being better at mapping different logical volumes onto one or more
physical volume groups.

--
Doug Ledford <[email protected]> 919-754-3700 x44233
Red Hat, Inc.
1801 Varsity Dr.
Raleigh, NC 27606

2002-11-21 19:59:19

by Joel Becker

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Thu, Nov 21, 2002 at 02:54:06PM -0500, Doug Ledford wrote:
> opinion, the existence of both an MD and DM driver is wrong because they
> are inherently two sides of the same coin

This is exactly my point. I got "MD and DM should be used
together" out of your email, and I guess I didn't get your stance
clearly.

Joel

--

Life's Little Instruction Book #69

"Whistle"

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: [email protected]
Phone: (650) 506-8127

2002-11-21 20:29:57

by Doug Ledford

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Thu, Nov 21, 2002 at 12:57:42PM -0700, Steven Dake wrote:
> Doug,
>
> EVMS integrates all of this stuff together into one cohesive peice of
> technology.
>
> But I agree, LVM should be modified to support RAID 1 and RAID 5, or MD
> should be modified to support volume management. Since RAID 1 and RAID
> 5 are easier to implement, LVM is probably the best place to put all
> this stuff.

Yep. I tend to agree there. A little work to make device mapping modular
in LVM, and a little work to make the md modules plug into LVM, and you
could be done. All that would be left then is adding the right stuff into
the user space tools. Basically, what irks me about the current situation
is that right now in the Red Hat installer, if I want LVM features I have
to create one type of object with a disk, and if I want reasonable
software RAID I have to create another type of object with partitions.
That shouldn't be the case, I should just create an LVM logical volume,
assign physical disks to it, and then additionally assign the redundancy
or performance layout I want (IMNSHO) :-)


--
Doug Ledford <[email protected]> 919-754-3700 x44233
Red Hat, Inc.
1801 Varsity Dr.
Raleigh, NC 27606

2002-11-21 20:41:16

by Steven Dake

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

Doug,

Yup this would be ideal and I think this is what EVMS tries to do,
although I haven't tried it.

The advantage of doing such a thing would also be that MD could be made
to work with shared LVM VGs for shared storage environments.

now to write the code...

-steve

Doug Ledford wrote:

>On Thu, Nov 21, 2002 at 12:57:42PM -0700, Steven Dake wrote:
>
>
>>Doug,
>>
>>EVMS integrates all of this stuff together into one cohesive peice of
>>technology.
>>
>>But I agree, LVM should be modified to support RAID 1 and RAID 5, or MD
>>should be modified to support volume management. Since RAID 1 and RAID
>>5 are easier to implement, LVM is probably the best place to put all
>>this stuff.
>>
>>
>
>Yep. I tend to agree there. A little work to make device mapping modular
>in LVM, and a little work to make the md modules plug into LVM, and you
>could be done. All that would be left then is adding the right stuff into
>the user space tools. Basically, what irks me about the current situation
>is that right now in the Red Hat installer, if I want LVM features I have
>to create one type of object with a disk, and if I want reasonable
>software RAID I have to create another type of object with partitions.
>That shouldn't be the case, I should just create an LVM logical volume,
>assign physical disks to it, and then additionally assign the redundancy
>or performance layout I want (IMNSHO) :-)
>
>
>
>

2002-11-21 20:53:47

by Alan

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Thu, 2002-11-21 at 19:57, Steven Dake wrote:
> Doug,
>
> EVMS integrates all of this stuff together into one cohesive peice of
> technology.
>
> But I agree, LVM should be modified to support RAID 1 and RAID 5, or MD
> should be modified to support volume management. Since RAID 1 and RAID
> 5 are easier to implement, LVM is probably the best place to put all
> this stuff.

User space issue. Its about the tools view not about the kernel drivers.

2002-11-21 21:14:17

by Doug Ledford

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Thu, Nov 21, 2002 at 09:29:36PM +0000, Alan Cox wrote:
> On Thu, 2002-11-21 at 19:57, Steven Dake wrote:
> > Doug,
> >
> > EVMS integrates all of this stuff together into one cohesive peice of
> > technology.
> >
> > But I agree, LVM should be modified to support RAID 1 and RAID 5, or MD
> > should be modified to support volume management. Since RAID 1 and RAID
> > 5 are easier to implement, LVM is probably the best place to put all
> > this stuff.
>
> User space issue. Its about the tools view not about the kernel drivers.

Not entirely true. You could do everything in user space except online
resize of raid0/4/5 arrays, that requires specific support in the md
modules and it begs for integration between LVM and MD since the MD is
what has to resize the underlying device yet it's the LVM that usually
handles filesystem resizing.

--
Doug Ledford <[email protected]> 919-754-3700 x44233
Red Hat, Inc.
1801 Varsity Dr.
Raleigh, NC 27606

2002-11-21 21:11:26

by Kevin Corry

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Thursday 21 November 2002 14:49, Steven Dake wrote:
> Doug,
>
> Yup this would be ideal and I think this is what EVMS tries to do,
> although I haven't tried it.

This is indeed what EVMS's new design does. It has user-space plugins that
understand a variety of on-disk-metadata formats. There are plugins for LVM
volumes, for MD RAID devices, for partitions, as well as others. The plugins
communicate with the MD driver to activate MD devices, and with the
device-mapper driver to activate other devices.

As for whether DM and MD kernel drivers should be merged: I imagine it could
be done, since DM already has support for easily adding new modules, but I
don't see any overwhelming reason to merge them right now. I'm sure it will
be discussed more when 2.7 comes out. For now they seem to work fine as
separate drivers doing what each specializes in. All the integration issues
that have been brought up can usually be dealt with in user-space.

--
Kevin Corry
[email protected]
http://evms.sourceforge.net/

2002-11-21 21:29:07

by Kevin Corry

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Thursday 21 November 2002 15:22, Doug Ledford wrote:
> On Thu, Nov 21, 2002 at 09:29:36PM +0000, Alan Cox wrote:
> > On Thu, 2002-11-21 at 19:57, Steven Dake wrote:
> > > Doug,
> > >
> > > EVMS integrates all of this stuff together into one cohesive peice of
> > > technology.
> > >
> > > But I agree, LVM should be modified to support RAID 1 and RAID 5, or MD
> > > should be modified to support volume management. Since RAID 1 and RAID
> > > 5 are easier to implement, LVM is probably the best place to put all
> > > this stuff.
> >
> > User space issue. Its about the tools view not about the kernel drivers.
>
> Not entirely true. You could do everything in user space except online
> resize of raid0/4/5 arrays, that requires specific support in the md
> modules and it begs for integration between LVM and MD since the MD is
> what has to resize the underlying device yet it's the LVM that usually
> handles filesystem resizing.

LVM doesn't handle the filesystem resizing, the filesystem tools do. The only
thing you need is something in user-space to ensure the correct ordering. For
an expand, the MD device must be expanded first. When that is complete,
resizefs is called to expand the filesystem.

MD currently doesn't allow resize of RAID 0, 4 or 5, because expanding
striped devices is way ugly. If it was determined to be possible, the MD
driver may need additional support to allow online resize. But it is just as
easy to add this support to MD rather than have to merge MD and DM.

--
Kevin Corry
[email protected]
http://evms.sourceforge.net/

2002-11-21 21:46:42

by Doug Ledford

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Thu, Nov 21, 2002 at 02:53:23PM -0600, Kevin Corry wrote:
>
> LVM doesn't handle the filesystem resizing, the filesystem tools do. The only
> thing you need is something in user-space to ensure the correct ordering. For
> an expand, the MD device must be expanded first. When that is complete,
> resizefs is called to expand the filesystem.
>
> MD currently doesn't allow resize of RAID 0, 4 or 5, because expanding
> striped devices is way ugly.

MD doesn't, raidreconf does but not online.

> If it was determined to be possible, the MD
> driver may need additional support to allow online resize.

Yes, it would. It's not impossible, just difficult.

> But it is just as
> easy to add this support to MD rather than have to merge MD and DM.

Well, merging the two would actually be rather a simple task I think since
you would still keep each md mode a separate module, the only difference
might be some inter-communication call backs between LVM and MD, but even
those aren't necessarily required. The prime benefit I would see from
making the two into one is being able to integrate all the disparate
superblocks into a single superblock format that helps to avoid any
possible startup errors between the different logical mapping levels.

--
Doug Ledford <[email protected]> 919-754-3700 x44233
Red Hat, Inc.
1801 Varsity Dr.
Raleigh, NC 27606

2002-11-21 23:28:41

by Luca Berra

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Thu, Nov 21, 2002 at 02:54:06PM -0500, Doug Ledford wrote:
>is wrong. There *should* be no md superblocks, there should only be dm
>superblocks on LVM physical devices and those DM superblocks should
>include the data needed to fire up the proper md module on the proper
>physical extents based upon what mapper technology is specified in the
>DM superblock and what layout is specified in the DM superblock. In my
there are no DM superblocks, DM only maps sectors of existing devices
into new (logical) device. the decision of which sectors should be
mapped and where rests in user-space be it LVM2, dmsetup, EVMS or
whatever.

Regards,
L.

--
Luca Berra -- [email protected]
Communication Media & Services S.r.l.
/"\
\ / ASCII RIBBON CAMPAIGN
X AGAINST HTML MAIL
/ \

2002-11-21 23:43:11

by Luca Berra

[permalink] [raw]
Subject: DM vs MD (Was: RFC - new raid superblock layout for md driver)

On Thu, Nov 21, 2002 at 09:29:36PM +0000, Alan Cox wrote:
>On Thu, 2002-11-21 at 19:57, Steven Dake wrote:
>> Doug,
>>
>> EVMS integrates all of this stuff together into one cohesive peice of
>> technology.
>>
>> But I agree, LVM should be modified to support RAID 1 and RAID 5, or MD
>> should be modified to support volume management. Since RAID 1 and RAID
>> 5 are easier to implement, LVM is probably the best place to put all
>> this stuff.
>
>User space issue. Its about the tools view not about the kernel drivers.
>
Ok,
dm should be modified to support raid1 or raid5 or raidwhatever,
probably using code from md (included or as a module) and md
should be modified to use dm for the request mapping work.
problem is that raid needs some way to keep state so wether we want to
keep it in md superblock on in LVM metadata we need to do this in kernel
space.
And i don't feel dm is the correct place for this, unless Joe has a
very elegant solution :)

L.

--
Luca Berra -- [email protected]
Communication Media & Services S.r.l.
/"\
\ / ASCII RIBBON CAMPAIGN
X AGAINST HTML MAIL
/ \

2002-11-22 00:01:48

by Kenneth D. Merry

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Wed, Nov 20, 2002 at 10:03:26 +0000, Anton Altaparmakov wrote:
> Hi,
>
> On Wed, 20 Nov 2002, Neil Brown wrote:
> > I (and others) would like to define a new (version 1) format that
> > resolves the problems in the current (0.90.0) format.
> >
> > The code in 2.5.lastest has all the superblock handling factored out so
> > that defining a new format is very straight forward.
> >
> > I would like to propose a new layout, and to receive comment on it..
>
> If you are making a new layout anyway, I would suggest to actually add the
> complete information about each disk which is in the array into the raid
> superblock of each disk in the array. In that way if a disk blows up, you
> can just replace the disk use some to be written (?) utility to write the
> correct superblock to the new disk and add it to the array which then
> reconstructs the disk. Preferably all of this happens without ever
> rebooting given a hotplug ide/scsi controller. (-;
>
> >From a quick read of the layout it doesn't seem to be possible to do the
> above trivially (or certainly not without help of /etc/raidtab), but
> perhaps I missed something...
>
> Also, autoassembly would be greatly helped if the superblock contained the
> uuid for each of the devices contained in the array. It is then trivial to
> unplug all raid devices and move them around on the controller and it
> would still just work. Again I may be missing something and that is
> already possible to do trivially.

This is a good idea. Having all of the devices listed in the metadata on
each disk is very helpful. (See below for why.)

Here are some of my ideas about the features you'll want out of a new type
of metadata:

[ these you've already got ]

- each array has a unique identifier (you've got this already)
- each disk/partition/component has a unique identifier (you've got this
already)
- a monotonically increasing serial number that gets incremented every
time you write out the metadata (you've got this, the 'events' field)

[ these are features I think would be good to have ]

- Per-array state that lets you know whether you're doing a resync,
reconstruction, verify, verify and fix, and so on. This is part of the
state you'll need to do checkpointing -- picking up where you left off
after a reboot during the middle of an operation.

- Per-array block number that tells you how far along you are in a verify,
resync, reconstruction, etc. If you reboot, you can, for example, pick
a verify back up where you left off.

- Enough per-disk state so you can determine, if you're doing a resync or
reconstruction, which disk is the target of the operation. When I was
doing a lot of work on md a while back, one of the things I ran into is
that when you do a resync of a RAID-1, it always resyncs from the first
to the second disk, even if the first disk is the one out of sync. (I
changed this, with Adaptec metadata at least, so it would resync onto
the correct disk.)

- Each component knows about every other component in the array. (It
knows by UUID, not just that there are N other devices in the array.)
This is an important piece of information:
- You can compose the array now, using each disk's set_uuid and the
position of the device in the array, and by using the events
field to filter out the older of two disks that claim the same
position.

The problem comes in more complicated scenarios. For example:
- user pulls one disk out of a RAID-1 with a spare
- md reconstructs onto the spare
- user shuts down machine, pulls the (former) spare that is
now part of the machine, and replaces the disk that he
originally pulled.

So now you've got a scenario where you have a disk that claims to
be part of the array (same set_uuid), but its events field is a
little behind. You could just resync the disk since it is out of
date, but still claims to be part of the array. But you'd be
back in the same position if the user pulls the disk again and
puts the former spare back in -- you'd have to resync again.

If each disk had a list of the uuids of every disk in the array,
you could tell from the disk table on the "freshest" disk that
the disk the user stuck back in isn't part of the array, despite
the fact that it claims to be. (It was at one point, and then
was removed.) You can then make the user add it back explicitly,
instead of just resyncing onto it.

- Possibly the ability to setup multilevel arrays within a given piece of
metadata. As far as multilevel arrays go, there are two basic
approaches to the metadata:
- Integrated metadata defines all levels of the array in a single
chunk of metadata. So, for example, by reading metadata off of
sdb, you can figure out that it is a component of a RAID-1 array,
and that that RAID-1 array is a component of a RAID-10.

There are a couple of advantages to integrated metadata:
- You can keep state that applies to the whole array
(clean/dirty, for example) in one place.
- It helps in autoconfiguring an array, since you don't
have to go through multiple steps to find out all the
levels of an array. You just read the metadata from one
place on one disk, and you've got

There are a couple of disadvantages to integrated metadata:
- Possibly reduced/limited space for defining multiple
array levels or arrays with lots of disks. This is not a
problem, though, given sufficient metadata space.

- Marginally more difficulty handling metadata updates,
depending on how you handle your multilevel arrays. If
you handle them like md currently does (separate block
devices for each level and component of the array), it'll
be pretty difficult to use integrated metadata.

- Recursive metadata defines each level of the array separately.
So, for example, you'd read the metadata from the end of a disk
and determine it is part of a RAID-1 array. Then, you configure
the RAID-1 array, and read the metadata from the end of that
array, and determine it is part of a RAID-0 array. So then you
configure the RAID-0 array, look at the end, fail to find
metadata, and figure out that you've reached the top level of the
array.

This is almost how md currently does things, except that it
really has no mechanism for autoconfiguring multilevel arrays.

There are a couple of advantages to recursive metadata:
- It is easier to handle metadata updates for multilevel
arrays, especially if the various levels of the array are
handled by different block devices, as md does.

- You've potentially got more space for defining disks as
part of the array, since you're only defining one level
at a time.

There are a couple of disadvantages to recursive metadata:
- You have to have multiple copies of any state that
applies to the whole array (e.g. clean/dirty).

- More windows of opportunity for incomplete metadata
writes. Since metadata is in multiple places, there are
more opportunities for scenarios where you'll have
metadata for one part of the array written out, but not
another part before you crash or a disk crashes...etc.

I know Neil has philosophical issues with autoconfiguration (or perhaps
in-kernel autoconfiguration), but it really is helpful, especially in
certain situations.

As for recursive versus integrated metadata, it would be nice if md could
handle autoconfiguration with either type of multilevel array. The reason
I say this is that Adaptec HostRAID adapters use integrated metadata.
So if you want to support multilevel arrays with md on HostRAID adapters,
you have to have support for multilevel arrays with integrated metadata.

When I did the first port of md to work on HostRAID, I pretty much had to
skip doing RAID-10 support because it wasn't structurally feasible to
autodetect and configure a multilevel array. (I ended up doing a full
rewrite of md that I was partially done with when I got laid off from
Adaptec.)

Anyway, if you want to see the Adaptec HostRAID support, which includes
metadata definitions:

http://people.freebsd.org/~ken/linux/md.html

The patches are against 2.4.18, but you should be able to get an idea of
what I'm talking about as far as integrated metadata goes.

This is all IMO, maybe it'll be helpful, maybe not, but hopefully it'll be
useful to consider these ideas.

Ken
--
Kenneth Merry
[email protected]

2002-11-22 07:04:21

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Tue, 2002-11-19 at 20:09, Neil Brown wrote:
> My current design looks like:
> /* constant array information - 128 bytes */
> u32 md_magic
> u32 major_version == 1
> u32 feature_map /* bit map of extra features in superblock */
> u32 set_uuid[4]
> u32 ctime
> u32 level
> u32 layout
> u64 size /* size of component devices, if they are all
> * required to be the same (Raid 1/5 */
Can you make 64 bit fields 64 bit aligned? I think PPC will lay this
structure out with padding before size, which may well cause confusion.
If your routines to load and save the header don't depend on structure
layout, then it doesn't matter.
J

2002-11-22 10:06:06

by Joe Thornber

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Wed, Nov 20, 2002 at 08:03:00AM -0800, Joel Becker wrote:
> Hmm, what is the intended future interaction of DM and MD? Two
> ways at the same problem? Just curious.


There are a couple of good arguments for moving the _mapping_ code
from md into dm targets:

1) Building a mirror is essentially just copying large amounts of data
around, exactly what is needed to implement move functionality for
arbitrarily remapping volumes. (see
http://people.sistina.com/~thornber/pvmove_outline.txt).

So I've always had every intention of implementing a mirror target
for dm.

2) Extending raid 5 volumes becomes very simple if they are dm targets
since you just add another segment, this new segment could even
have different numbers of stripes. eg,


old volume new volume
+--------------------+ +--------------------+--------------------+
| raid5 across 3 LVs | => | raid5 across 3 LVs | raid5 across 5 LVs |
+--------------------+ +--------------------+--------------------+

Of course this could be done ATM by stacking 'bottom LVs' -> mds ->
'top LV', but that does create more intermediate devices and
sacrifices space to the md metadata (eg, LVM has its own metadata
and doesn't need md to duplicate it).

MD would continue to exist as a seperate driver, it needs to read and
write the md metadata at the beginning of the physical volumes, and
implement all the nice recovery/hot spare features. ie. dm does the
mapping, md implements the policies by driving dm appropriately. If
other volume managers such as EVMS or LVM want to implement features
not provided by md, they are free to drive dm directly.

I don't think it's a huge amount of work to refactor the md code, and
now might be the right time if Neil is already changing things. I
would be more than happy to work on this if Neil agrees.

- Joe

2002-12-02 21:31:28

by NeilBrown

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Friday November 22, [email protected] wrote:
> On Wed, Nov 20, 2002 at 08:03:00AM -0800, Joel Becker wrote:
> > Hmm, what is the intended future interaction of DM and MD? Two
> > ways at the same problem? Just curious.
>
>
> There are a couple of good arguments for moving the _mapping_ code
> from md into dm targets:
>
> 1) Building a mirror is essentially just copying large amounts of data
> around, exactly what is needed to implement move functionality for
> arbitrarily remapping volumes. (see
> http://people.sistina.com/~thornber/pvmove_outline.txt).

Building a mirror may be just moving data around. But the interesting
issues in raid1 are more about maintaining a mirror: read balancing,
retry on error, hot spares, etc.

>
> So I've always had every intention of implementing a mirror target
> for dm.
>
> 2) Extending raid 5 volumes becomes very simple if they are dm targets
> since you just add another segment, this new segment could even
> have different numbers of stripes. eg,
>
>
> old volume new volume
> +--------------------+ +--------------------+--------------------+
> | raid5 across 3 LVs | => | raid5 across 3 LVs | raid5 across 5 LVs |
> +--------------------+ +--------------------+--------------------+
>
> Of course this could be done ATM by stacking 'bottom LVs' -> mds ->
> 'top LV', but that does create more intermediate devices and
> sacrifices space to the md metadata (eg, LVM has its own metadata
> and doesn't need md to duplicate it).

But is this something that you would *want* to do???

To my mind, the raid1/raid5 almost always lives below any LVM or
partitioning scheme. You use raid1/raid5 to combine drives (real,
physical drives) into virtual drives that are more reliable, and then
you partition them or whatever you want to do. raid1 and raid5 on top
of LVM bits just doesn't make sense to me.

I say 'almost' above because there is one situation where something
else makes sense. That is when you have a small number of drives in a
machine (3 to 5) and you really want RAID5 for all of these, but
booting only really works for RAID1. So you partition the drives, use
RAID1 for the first partitions, and RAID5 for the rest.
put boot (or maybe root) on the RAID1 bit and all your interesting data
on the RAID5 bit.

[[ I just had this really sick idea of creating a raid level that did
data duplication (aka mirroring) for the first N stripes, and
stripe/parity (aka raid5) for the remaining stripes. Then you just
combine your set of drives together with this level, and depending on
your choice of N, you get all raid1, all raid5, or a mixture which
allows booting off the early sectors....]]

>
> MD would continue to exist as a seperate driver, it needs to read and
> write the md metadata at the beginning of the physical volumes, and
> implement all the nice recovery/hot spare features. ie. dm does the
> mapping, md implements the policies by driving dm appropriately. If
> other volume managers such as EVMS or LVM want to implement features
> not provided by md, they are free to drive dm directly.
>
> I don't think it's a huge amount of work to refactor the md code, and
> now might be the right time if Neil is already changing things. I
> would be more than happy to work on this if Neil agrees.

I would probably need a more concrete proposal before I had something
to agree with :-)

I really think the raid1/raid5 parts of MD are distinctly different
from DM, and they should remain separate. However I am quite happy to
improve the interfaces so that seamless connections can be presented
by user-space tools.

For example, md currently gets its 'super-block' information by
reading the device directly. Soon it will have two separate routines
that get the super-block info, one for each super-block format. I
would be quite happy for there to be a way for DM to give a device to
MD along with some routine that provided super-block info by getting
it out of some near-by LVM superblock rather than out of the device
itself.

Similarly, if an API could be designed for MD to provide higher levels
with access to the spare parts of it's superblock, e.g. for partition
table information, then that might make sense.

To summarise: If you want tigher integration between MD and DM, do it
by defining useful interfaces, not by trying to tie them both together
into one big lump.

NeilBrown

2002-12-03 08:17:08

by Luca Berra

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Tue, Dec 03, 2002 at 08:38:25AM +1100, Neil Brown wrote:
>> 1) Building a mirror is essentially just copying large amounts of data
>> around, exactly what is needed to implement move functionality for
>> arbitrarily remapping volumes. (see
>> http://people.sistina.com/~thornber/pvmove_outline.txt).
>
>Building a mirror may be just moving data around. But the interesting
>issues in raid1 are more about maintaining a mirror: read balancing,
>retry on error, hot spares, etc.

true, that's why LVM (dm) should use md for the raid work.

>>
>> 2) Extending raid 5 volumes becomes very simple if they are dm targets
>> since you just add another segment, this new segment could even
>> have different numbers of stripes. eg,
>>
>But is this something that you would *want* to do???
>
>To my mind, the raid1/raid5 almost always lives below any LVM or
>partitioning scheme. You use raid1/raid5 to combine drives (real,
>physical drives) into virtual drives that are more reliable, and then
>you partition them or whatever you want to do. raid1 and raid5 on top
>of LVM bits just doesn't make sense to me.
well to me does:
- you might want to split a mirror of a portion of data for backup
purposes (when snapshots won't do) or for safety before attempting a
risky operation.
- you might also want to have different raid strategies for different
data. Think a medium sized storage with oracle, you might want to do
a fast mirror for online redo logs(1) and raid5 for datafiles.(2)
- you might want to add mirroring after having put data on your disks
and the current way to do it with MD on partitions is complex, with
LVM over MD is really hard to do right.
- stackable devices are harder to maintain, a single interface to deal
with mirroring and volume management would be easier.
- we wont have any more problems with 'switching cache buffer size' :))))

(1) yes i know they are mirrored by oracle, but having a fs unavailable
due to disk failure is a pita anyway
(2) a dba will tell you to use different disks, but i never found
anyone willing to use 4 73Gb disks for redo logs


>[[ I just had this really sick idea of creating a raid level that did
>data duplication (aka mirroring) for the first N stripes, and
I had another sick idea of teaching lilo how to do raid5, but it won't
fit in 512b. anyway for the normal MD on partitions case creating one
n-way raid1 for /boot and raid5 for the rest

>I really think the raid1/raid5 parts of MD are distinctly different
>from DM, and they should remain separate. However I am quite happy to
>improve the interfaces so that seamless connections can be presented
>by user-space tools.

reading this it looks like that the only way dm could get raid is
reimplementing or duplicating code from existing md, thus duplicating
code in the kernel.

>To summarise: If you want tigher integration between MD and DM, do it
>by defining useful interfaces, not by trying to tie them both together
>into one big lump.

we can think of md split in those major areas
1 the superblock interface, which i believe we all agree should go to
user mode for all the array setup function, and should keep the
portion for updating superblock in kernel space.
2 the raid logic code
3 the interface to lower block device layer
4 the interface to upper block device layer
(in md these 3 are thightly coupled)

some of these areas overlap with dm and it could be possible to merge
the duplicated functionality.

having said that and having looked 'briefly' at the code i believe that
doing something like this would mean reworking completely the logic
behind md, and adding some major parts to dm, or better to a separate
module.

in my idea we will have
a core that handles request mapping
metadata plugins for both md superblock format and lvm metadata (those
would deal with keeping the metadata current with the array
current status)
layout plugins for raid?, striping, linear, multipath (does this belong
here or at a different level?)

L.

--
Luca Berra -- [email protected]
Communication Media & Services S.r.l.
/"\
\ / ASCII RIBBON CAMPAIGN
X AGAINST HTML MAIL
/ \

2002-12-09 00:16:31

by NeilBrown

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver


( sorrt for the delay in replying, I had a week off, and then a week
catching up...)

On Wednesday November 20, [email protected] wrote:
> The only application where having a RAID volume shareable between two
> hosts is useful is for a clustering filesystem (GFS comes to mind).
> Since RAID is an important need for GFS (if a disk node fails, you
> don't want ot loose the entire filesystem as you would on GFS) this
> possibility may be worth exploring.
>
> Since GFS isn't GPL at this point and OpenGFS needs alot of work, I've
> not spent the time looking at it.
>
> Neil have you thought about sharing an active volume between two hosts
> and what sort of support would be needed in the superblock?
>

I think that the only way shared access could work is if different
hosts controlled different slices of the device. The hosts would have
to some-how negotiate and record who was managing which bit. It is
quite appropriate that this information be stored on the raid array,
and quite possibly in a superblock. But I think that this is a
sufficiently major departure from how md/raid normally works that I
would want it to go in a secondary superblock.
There is 60K free at the end of each device on an MD array. Whoever
was implementing this scheme could just have a flag in the main
superblock to say "there is a secondary superblock" and then read the
info about who owns what from somewhere in that extra 60K

So in short, I think the metadata needed for this sort of thing is
sufficiently large and sufficiently unknown that I wouldn't make any
allowance for it in the primary superblock.

Does that sound reasonable?

NeilBrown

2002-12-09 03:45:27

by NeilBrown

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Thursday November 21, [email protected] wrote:
>
> This is a good idea. Having all of the devices listed in the metadata on
> each disk is very helpful. (See below for why.)
>
> Here are some of my ideas about the features you'll want out of a new type
> of metadata:
...
>
> [ these are features I think would be good to have ]
>
> - Per-array state that lets you know whether you're doing a resync,
> reconstruction, verify, verify and fix, and so on. This is part of the
> state you'll need to do checkpointing -- picking up where you left off
> after a reboot during the middle of an operation.
>

Yes, a couple of flags in the 'state' field could do this.

> - Per-array block number that tells you how far along you are in a verify,
> resync, reconstruction, etc. If you reboot, you can, for example, pick
> a verify back up where you left off.

Got that, called "resync-position" (though I guess I have to change
the hypen...).

>
> - Enough per-disk state so you can determine, if you're doing a resync or
> reconstruction, which disk is the target of the operation. When I was
> doing a lot of work on md a while back, one of the things I ran into is
> that when you do a resync of a RAID-1, it always resyncs from the first
> to the second disk, even if the first disk is the one out of sync. (I
> changed this, with Adaptec metadata at least, so it would resync onto
> the correct disk.)

When a raid1 array is out of sync, it doesn't mean anything to say
which disc is out of sync. They all are, with each other...
Nonetheless, the per-device stateflags have an 'in-sync' bit which can
be set or cleared as appropriate.

>
> - Each component knows about every other component in the array. (It
> knows by UUID, not just that there are N other devices in the array.)
> This is an important piece of information:
> - You can compose the array now, using each disk's set_uuid and the
> position of the device in the array, and by using the events
> field to filter out the older of two disks that claim the same
> position.
>
> The problem comes in more complicated scenarios. For example:
> - user pulls one disk out of a RAID-1 with a spare
> - md reconstructs onto the spare
> - user shuts down machine, pulls the (former) spare that is
> now part of the machine, and replaces the disk that he
> originally pulled.
>
> So now you've got a scenario where you have a disk that claims to
> be part of the array (same set_uuid), but its events field is a
> little behind. You could just resync the disk since it is out of
> date, but still claims to be part of the array. But you'd be
> back in the same position if the user pulls the disk again and
> puts the former spare back in -- you'd have to resync again.
>
> If each disk had a list of the uuids of every disk in the array,
> you could tell from the disk table on the "freshest" disk that
> the disk the user stuck back in isn't part of the array, despite
> the fact that it claims to be. (It was at one point, and then
> was removed.) You can then make the user add it back explicitly,
> instead of just resyncing onto it.

The event counter is enough to determine if a device is really part of
the current array or not, and I cannot see why you need more than
that.
As far as I can tell, everything that you have said can be achieved
with setuid/devnumber/event.

>
> - Possibly the ability to setup multilevel arrays within a given piece of
> metadata. As far as multilevel arrays go, there are two basic
> approaches to the metadata:

How many actual uses of multi-level arrays are there??

The most common one is raid0 over raid1, and I think there is a strong
case for implementing a 'raid10' module that does that, but also
allows a raid10 of an odd number of drives and things like that.

I don't think anything else is sufficiently common to really deserve
special treatment: recursive metadata is adequate I think.

Concerning the auto-assembly of multi-level arrays, that is not
particularly difficult, it just needs to be described precisely, and
coded.
It is a user-space thing and easily solved at that level.

>
> I know Neil has philosophical issues with autoconfiguration (or perhaps
> in-kernel autoconfiguration), but it really is helpful, especially in
> certain situations.

I have issues with autoconfiguration that is not adequately
configurable, and current linux in-kernel autoconfiguration is not
adequately configurable. With mdadm autoconfiguration is (very
nearly) adequately configurable and is fine. There is still room for
some improvement, but not much.

Thanks for your input,
NeilBrown

2002-12-10 06:21:49

by Kenneth D. Merry

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Mon, Dec 09, 2002 at 14:52:11 +1100, Neil Brown wrote:
> > - Enough per-disk state so you can determine, if you're doing a resync or
> > reconstruction, which disk is the target of the operation. When I was
> > doing a lot of work on md a while back, one of the things I ran into is
> > that when you do a resync of a RAID-1, it always resyncs from the first
> > to the second disk, even if the first disk is the one out of sync. (I
> > changed this, with Adaptec metadata at least, so it would resync onto
> > the correct disk.)
>
> When a raid1 array is out of sync, it doesn't mean anything to say
> which disc is out of sync. They all are, with each other...
> Nonetheless, the per-device stateflags have an 'in-sync' bit which can
> be set or cleared as appropriate.

This sort of information (if it is used) would be very useful for dealing
with Adaptec metadata. Adaptec HostRAID adapters let you build a RAID-1 by
copying one disk onto the other, with the state set to indicate the source
and target disks.

Since the BIOS on those adapters takes a long time to do a copy, it's
easier to break out of the build after it gets started, and let the kernel
pick back up where the BIOS left off. To do that, you need checkpointing
support (i.e. be able to figure out where we left off with a particular
operation) and you need to be able to determine which disk is the source
and which is the target.

To do this with the first set of Adaptec metadata patches I wrote for
md, I had to kinda "bolt on" some extra state in the kernel, so I could
figure out which disk was the target, since md doesn't really pay attention
to the current per-disk in sync flags.

I solved this the second time around by making the in-core metadata generic
(and thus a superset of all the metadata types I planned on supporting),
and each metadata personality could supply target disk information if
possible.

> > - Each component knows about every other component in the array. (It
> > knows by UUID, not just that there are N other devices in the array.)
> > This is an important piece of information:
> > - You can compose the array now, using each disk's set_uuid and the
> > position of the device in the array, and by using the events
> > field to filter out the older of two disks that claim the same
> > position.
> >
> > The problem comes in more complicated scenarios. For example:
> > - user pulls one disk out of a RAID-1 with a spare
> > - md reconstructs onto the spare
> > - user shuts down machine, pulls the (former) spare that is
> > now part of the machine, and replaces the disk that he
> > originally pulled.
> >
> > So now you've got a scenario where you have a disk that claims to
> > be part of the array (same set_uuid), but its events field is a
> > little behind. You could just resync the disk since it is out of
> > date, but still claims to be part of the array. But you'd be
> > back in the same position if the user pulls the disk again and
> > puts the former spare back in -- you'd have to resync again.
> >
> > If each disk had a list of the uuids of every disk in the array,
> > you could tell from the disk table on the "freshest" disk that
> > the disk the user stuck back in isn't part of the array, despite
> > the fact that it claims to be. (It was at one point, and then
> > was removed.) You can then make the user add it back explicitly,
> > instead of just resyncing onto it.
>
> The event counter is enough to determine if a device is really part of
> the current array or not, and I cannot see why you need more than
> that.
> As far as I can tell, everything that you have said can be achieved
> with setuid/devnumber/event.

It'll work with just the setuuid/devnumber/event, but as I mentioned in the
last paragraph above, you'll end up resyncing onto the disk that is pulled
and then reinserted, because you don't really have any way of knowing it is
no longer a part of the array. All you know is that it is out of date.

> > - Possibly the ability to setup multilevel arrays within a given piece of
> > metadata. As far as multilevel arrays go, there are two basic
> > approaches to the metadata:
>
> How many actual uses of multi-level arrays are there??
>
> The most common one is raid0 over raid1, and I think there is a strong
> case for implementing a 'raid10' module that does that, but also
> allows a raid10 of an odd number of drives and things like that.

RAID-10 is the most common, but RAID-50 is found in the "wild" as well.

It would be more flexible if you could stack personalities on top of each
other. This would give people the option of combining whatever
personalities they want (within reason; the multipath personality doesn't
make a whole lot of sense to stack).

> I don't think anything else is sufficiently common to really deserve
> special treatment: recursive metadata is adequate I think.

Recursive metadata is fine, but I would encourage you to think about how
you would (structurally) support multilevel arrays that use integrated
metadata. (e.g. like RAID-10 on an Adaptec HostRAID board)

> Concerning the auto-assembly of multi-level arrays, that is not
> particularly difficult, it just needs to be described precisely, and
> coded.
> It is a user-space thing and easily solved at that level.

How does it work if you're trying to boot off the array? The kernel needs
to know how to auto-assemble the array in order to run init and everything
else that makes a userland program run.

> >
> > I know Neil has philosophical issues with autoconfiguration (or perhaps
> > in-kernel autoconfiguration), but it really is helpful, especially in
> > certain situations.
>
> I have issues with autoconfiguration that is not adequately
> configurable, and current linux in-kernel autoconfiguration is not
> adequately configurable. With mdadm autoconfiguration is (very
> nearly) adequately configurable and is fine. There is still room for
> some improvement, but not much.

I agree that userland configuration is very flexible, but I think there is
a place for kernel-level autoconfiguration as well. With something like an
Adaptec HostRAID board (i.e. something you can boot from), you need kernel
level autoconfiguration in order for it to work smoothly.

Ken
--
Kenneth Merry
[email protected]

2002-12-10 23:59:50

by NeilBrown

[permalink] [raw]
Subject: Re: RFC - new raid superblock layout for md driver

On Monday December 9, [email protected] wrote:
> On Mon, Dec 09, 2002 at 14:52:11 +1100, Neil Brown wrote:
> > > - Enough per-disk state so you can determine, if you're doing a resync or
> > > reconstruction, which disk is the target of the operation. When I was
> > > doing a lot of work on md a while back, one of the things I ran into is
> > > that when you do a resync of a RAID-1, it always resyncs from the first
> > > to the second disk, even if the first disk is the one out of sync. (I
> > > changed this, with Adaptec metadata at least, so it would resync onto
> > > the correct disk.)
> >
> > When a raid1 array is out of sync, it doesn't mean anything to say
> > which disc is out of sync. They all are, with each other...
> > Nonetheless, the per-device stateflags have an 'in-sync' bit which can
> > be set or cleared as appropriate.
>
> This sort of information (if it is used) would be very useful for dealing
> with Adaptec metadata. Adaptec HostRAID adapters let you build a RAID-1 by
> copying one disk onto the other, with the state set to indicate the source
> and target disks.
>
> Since the BIOS on those adapters takes a long time to do a copy, it's
> easier to break out of the build after it gets started, and let the kernel
> pick back up where the BIOS left off. To do that, you need checkpointing
> support (i.e. be able to figure out where we left off with a particular
> operation) and you need to be able to determine which disk is the source
> and which is the target.
>
> To do this with the first set of Adaptec metadata patches I wrote for
> md, I had to kinda "bolt on" some extra state in the kernel, so I could
> figure out which disk was the target, since md doesn't really pay attention
> to the current per-disk in sync flags.

The way to solve this that would be most in-keeping with the raid code
in 2.4 would be for the drives that were not yet in-sync to appear as
'spare' drives. On array assembly, the first spare would get rebuilt
by md and then fully incorporated into the array. I agree that this is
not a very good conceptual fit.

The 2.5 code is a lot tidier with respect to this. Each device has an
'in-sync' flag so when an array has a missing drive, a spare is added
and marked not-in-sync. When recovery finishes, the drive that was
spare has the in-sync flag set.

2.5 code has an insync flag to, but it is not used sensibly.

Note that this relates to a drive being out-of-sync (as in a
reconstruction or recovery operation). It is quite different to the
array being out-of-sync which requires a resync operation.


> > >
> > > If each disk had a list of the uuids of every disk in the array,
> > > you could tell from the disk table on the "freshest" disk that
> > > the disk the user stuck back in isn't part of the array, despite
> > > the fact that it claims to be. (It was at one point, and then
> > > was removed.) You can then make the user add it back explicitly,
> > > instead of just resyncing onto it.
> >
> > The event counter is enough to determine if a device is really part of
> > the current array or not, and I cannot see why you need more than
> > that.
> > As far as I can tell, everything that you have said can be achieved
> > with setuid/devnumber/event.
>
> It'll work with just the setuuid/devnumber/event, but as I mentioned in the
> last paragraph above, you'll end up resyncing onto the disk that is pulled
> and then reinserted, because you don't really have any way of knowing it is
> no longer a part of the array. All you know is that it is out of
> date.

If you pull drive N, then it will appear to fail and all other drives
will be marked to say that 'drive N is faulty'.
If you plug drive N back in, the md code simply wont notice.
If you tell it to 'hot-add' the drive, it will rebuild onto it, but
that is what you asked to do.
If you shut down and restart, the auto-detection may well find drive
N, but even if it's event number is sufficiently recent (which would
require an unclean shutdown of the array), the fact that the most
recent superblocks will say that drive N is failed will mean that it
doesn't get incorporated into the array. You still have to explicitly
hot-add it before it will resync.

I still don't see the problem, sorry.

>
> > > - Possibly the ability to setup multilevel arrays within a given piece of
> > > metadata. As far as multilevel arrays go, there are two basic
> > > approaches to the metadata:
> >
> > How many actual uses of multi-level arrays are there??
> >
> > The most common one is raid0 over raid1, and I think there is a strong
> > case for implementing a 'raid10' module that does that, but also
> > allows a raid10 of an odd number of drives and things like that.
>
> RAID-10 is the most common, but RAID-50 is found in the "wild" as well.
>
> It would be more flexible if you could stack personalities on top of each
> other. This would give people the option of combining whatever
> personalities they want (within reason; the multipath personality doesn't
> make a whole lot of sense to stack).
>
> > I don't think anything else is sufficiently common to really deserve
> > special treatment: recursive metadata is adequate I think.
>
> Recursive metadata is fine, but I would encourage you to think about how
> you would (structurally) support multilevel arrays that use integrated
> metadata. (e.g. like RAID-10 on an Adaptec HostRAID board)

How about this:
Option 1:
Assertion: The only sensible raid stacks involve two levels: A
level that provides redundancy (Raid1/raid5) on the bottom, and a
level that compbines capacity on the top (raid0/linear).

Observation: in-kernel knowledge of superblock is only needed for
levels that provide redundancy (raid1/raid5) and so need to
update to superblock after errors, etc. raid0/linear can
be managed fine without any in-kernel knowledge of superblocks.

Approach:
Teach the kernel to read your adaptec raid10 superblock and
present it to md as N separate raid1 arrays.
Have a user-space tool that assembles the array as follows:
1/ read the superblocks
2/ build the raid1 arrays
3/ build the raid0 on-top using non-persistant superblocks.

There may need to be small changes to the md code to make this work
properly, but I feel that it is a good approach.

Option 2:
Possibly you disagree with the above assertion. Possibly you
think that a raid5 build from a number of raid1's is a good idea.
And maybe you are right.

Approach:
Add an ioctl, or possibly an 'magic' address, so that it is
possible to read a raw superblock from an md array.
Define two in-kernel superblock reading methods. One reads the
superblock and presents it as the bottom level only. The other
reads the raw superblock out of the underlying device, using the
ioctl or magic address (e.g. read from MAX_SECTOR-8) and
presents it as the next level of the raid stack.

I think this approach, possible with some refinement, would be
adequate to support any sort of stacking and any sort of raid
superblock, and it would be my preferred way to go, if this were
necessary.

>
> > Concerning the auto-assembly of multi-level arrays, that is not
> > particularly difficult, it just needs to be described precisely, and
> > coded.
> > It is a user-space thing and easily solved at that level.
>
> How does it work if you're trying to boot off the array? The kernel needs
> to know how to auto-assemble the array in order to run init and everything
> else that makes a userland program run.

initramdisk or initramfs or whatever is appropriate for the kernel you
are using.

Also, remember to keep the concepts of 'boot' and 'root' distinct.

To boot off an array, your BIOS needs to know about the array. There
are no two ways about that. It doesn't need to know a lot about the
array, and for raid1 all it needs to know is 'try this device, and if
it fails, try that device'.

To have root on an array, you need to be able to assemble the array
before root is mounted. md= kernel parameters in one option, but not
a very extensible one.
initramfs will be the preferred approach in 2.6. i.e. an initial root
is loaded along with the kernel, and it has the user-space tools for
finding, assembling and mounting the root device.

>
> > >
> > > I know Neil has philosophical issues with autoconfiguration (or perhaps
> > > in-kernel autoconfiguration), but it really is helpful, especially in
> > > certain situations.
> >
> > I have issues with autoconfiguration that is not adequately
> > configurable, and current linux in-kernel autoconfiguration is not
> > adequately configurable. With mdadm autoconfiguration is (very
> > nearly) adequately configurable and is fine. There is still room for
> > some improvement, but not much.
>
> I agree that userland configuration is very flexible, but I think there is
> a place for kernel-level autoconfiguration as well. With something like an
> Adaptec HostRAID board (i.e. something you can boot from), you need kernel
> level autoconfiguration in order for it to work smoothly.

I disagree, and the development directions of 2.5 tend to support me.
You certainly need something before root is mounted, but 2.5 is
leading us to 'early-user-space configuration' rather than 'in-kernel
configuration'.

NeilBrown