2004-03-24 02:27:09

by NeilBrown

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

On Monday March 22, [email protected] wrote:
> >> o Any successful solution will have to have "meta-data modules" for
> >> active arrays "core resident" in order to be robust. This
>
> ...
>
> > I agree.
> > 'Linear' and 'raid0' arrays don't really need metadata support in the
> > kernel as their metadata is essentially read-only.
> > There are interesting applications for raid1 without metadata, but I
> > think that for all raid personalities where metadata might need to be
> > updated in an error condition to preserve data integrity, the kernel
> > should know enough about the metadata to perform that update.
> >
> > It would be nice to keep the in-kernel knowledge to a minimum, though
> > some metadata formats probably make that hard.
>
> Can you further explain why you want to limit the kernel's knowledge
> and where you would separate the roles between kernel and userland?

General caution.
It is generally harder the change mistakes in the kernel than it is to
change mistakes in userspace, and similarly it is easer to add
functionality and configurability in userspace. A design that puts
the control in userspace is therefore preferred. A design that ties
you to working through a narrow user-kernel interface is disliked.
A design that gives easy control to user-space, and allows the kernel
to do simple things simply is probably best.

I'm not particularly concerned with code size and code duplication. A
clean, expressive design is paramount.

> 2) Solution Complexity
>
> Two entities understand how to read and manipulate the meta-data.
> Policies and APIs must be created to ensure that only one entity
> is performing operations on the meta-data at a time. This is true
> even if one entity is primarily a read-only "client". For example,
> a meta-data module may defer meta-data updates in some instances
> (e.g. rebuild checkpointing) until the meta-data is closed (writing
> the checkpoint sooner doesn't make sense considering that you should
> restart your scrub, rebuild or verify if the system is not safely
> shutdown). How does the userland client get the most up-to-date
> information? This is just one of the problems in this area.

If the kernel and userspace both need to know about metadata, then the
design must make clear how they communicate.

>
> > Currently partitions are (sufficiently) needs-driven. It is true that
> > any partitionable devices has it's partitions presented. However the
> > existence of partitions does not affect access to the whole device at
> > all. Only once the partitions are claimed is the whole-device
> > blocked.
>
> This seems a slight digression from your earlier argument. Is your
> concern that the arrays are auto-enumerated, or that the act of enumerating
> them prevents the component devices from being accessed (due to
> bd_clam)?

Primarily the latter. But also that the act of enumerating them may
cause an update to an underlying devices (e.g. metadata update or
resync). That is what I am particularly uncomfortable about.

>
> > Providing that auto-assembly of arrays works the same way (needs
> > driven), I am happy for arrays to auto-assemble.
> > I happen to think this most easily done in user-space.
>
> I don't know how to reconcile a needs based approach with system
> features that require arrays to be exported as soon as they are
> detected.
>

Maybe if arrays were auto-assembled in a read-only mode that guaranteed
not to write to the devices *at*all* and did not bd_claim them.

When they are needed (either though some explicit set-writable command
or through an implicit first-write) then the underlying components are
bd_claimed. If that succeeds, the array becomes "live". If it fails,
it stays read-only.

>
> > But back to your original post: I suspect there is lots of valuable
> > stuff in your emd patch, but as you have probably gathered, big
> > patches are not the way we work around here, and with good reason.
> >
> > If you would like to identify isolated pieces of functionality, create
> > patches to implement them, and submit them for review I will be quite
> > happy to review them and, when appropriate, forward them to
> > Andrew/Linus.
> > I suggest you start with less controversial changes and work your way
> > forward.
>
> One suggestion that was recently raised was to present these changes
> in the form of an alternate "EMD" driver to avoid any potential
> breakage of the existing MD. Do you have any opinion on this?

Choice is good. Competition is good. I would not try to interfere
with you creating a new "emd" driver that didn't interfere with "md".
What Linus would think of it I really don't know. It is certainly not
impossible that he would accept it.

However I'm not sure that having three separate device-array systems
(dm, md, emd) is actually a good idea. It would probably be really
good to unite md and dm somehow, but no-one seems really keen on
actually doing the work.

I seriously think the best long-term approach for your emd work is to
get it integrated into md. I do listen to reason and I am not
completely head-strong, but I do have opinions, and you would need to
put in the effort to convincing me.

NeilBrown


2004-03-24 19:10:36

by Matt Domsch

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

On Wed, Mar 24, 2004 at 01:26:47PM +1100, Neil Brown wrote:
> On Monday March 22, [email protected] wrote:
> > One suggestion that was recently raised was to present these changes
> > in the form of an alternate "EMD" driver to avoid any potential
> > breakage of the existing MD. Do you have any opinion on this?
>
> I seriously think the best long-term approach for your emd work is to
> get it integrated into md. I do listen to reason and I am not
> completely head-strong, but I do have opinions, and you would need to
> put in the effort to convincing me.

I completely agree that long-term, md and emd need to be the same.
However, watching the pain that the IDE changes took in early 2.5, I'd
like to see emd be merged alongside md for the short-term while the
kinks get worked out, keeping in mind the desire to merge them
together again soon as that happens.

Thanks,
Matt

--
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

2004-03-25 02:21:59

by Jeff Garzik

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

Neil Brown wrote:
> Choice is good. Competition is good. I would not try to interfere
> with you creating a new "emd" driver that didn't interfere with "md".
> What Linus would think of it I really don't know. It is certainly not
> impossible that he would accept it.

Agreed.

Independent DM efforts have already started supporting MD raid0/1
metadata from what I understand, though these efforts don't seem to post
to linux-kernel or linux-raid much at all. :/


> However I'm not sure that having three separate device-array systems
> (dm, md, emd) is actually a good idea. It would probably be really
> good to unite md and dm somehow, but no-one seems really keen on
> actually doing the work.

I would be disappointed if all the work that has gone into the MD driver
is simply obsoleted by new DM targets. Particularly RAID 1/5/6.

You pretty much echoed my sentiments exactly... ideally md and dm can
be bound much more tightly to each other. For example, convert md's
raid[0156].c into device mapper targets... but indeed, nobody has
stepped up to do that so far.

Jeff



2004-03-25 18:02:22

by Kevin Corry

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

On Wednesday 24 March 2004 8:21 pm, Jeff Garzik wrote:
> Neil Brown wrote:
> > Choice is good. Competition is good. I would not try to interfere
> > with you creating a new "emd" driver that didn't interfere with "md".
> > What Linus would think of it I really don't know. It is certainly not
> > impossible that he would accept it.
>
> Agreed.
>
> Independent DM efforts have already started supporting MD raid0/1
> metadata from what I understand, though these efforts don't seem to post
> to linux-kernel or linux-raid much at all. :/

I post on lkml.....occasionally. :)

I'm guessing you're referring to EVMS in that comment, since we have done
*part* of what you just described. EVMS has always had a plugin to recognize
MD devices, and has been using the MD driver for quite some time (along with
using Device-Mapper for non-MD stuff). However, as of our most recent release
(earlier this month), we switched to using Device-Mapper for MD RAID-linear
and RAID-0 devices. Device-Mapper has always had a "linear" and a "striped"
module (both required to support LVM volumes), and it was a rather trivial
exercise to switch to activating these RAID devices using DM instead of MD.

This decision was not based on any real dislike of the MD driver, but rather
for the benefits that are gained by using Device-Mapper. In particular,
Device-Mapper provides the ability to change out the device mapping on the
fly, by temporarily suspending I/O, changing the table, and resuming the I/O
I'm sure many of you know this already. But I'm not sure everyone fully
understands how powerful a feature this is. For instance, it means EVMS can
now expand RAID-linear devices online. While that particular example may not
sound all that exciting, if things like RAID-1 and RAID-5 were "ported" to
Device-Mapper, this feature would then allow you to do stuff like add new
"active" members to a RAID-1 online (think changing from 2-way mirror to
3-way mirror). It would be possible to convert from RAID-0 to RAID-4 online
simply by adding a new disk (assuming other limitations, e.g. a single
stripe-zone). Unfortunately, these are things the MD driver can't do online,
because you need to completely stop the MD device before making such changes
(to prevent the kernel and user-space from trampling on the same metadata),
and MD won't stop the device if it's open (i.e. if it's mounted or if you
have other device (LVM) built on top of MD). Often times this means you need
to boot to a rescue-CD to make these types of configuration changes.

As for not posting this information on lkml and/or linux-raid, I do apologize
if this is something you would like to have been informed of. Most of the
recent mentions of EVMS on this list seem to fall on deaf ears, so I've taken
that to mean the folks on the list aren't terribly interested in EVMS
developments. And since EVMS is a completely user-space tool and this
decision didn't affect any kernel components, I didn't think it was really
relevent to mention here. We usually discuss such things on
[email protected] or [email protected], but I'll be happy to
cross-post to lkml more often if it's something that might be pertinent.

> > However I'm not sure that having three separate device-array systems
> > (dm, md, emd) is actually a good idea. It would probably be really
> > good to unite md and dm somehow, but no-one seems really keen on
> > actually doing the work.
>
> I would be disappointed if all the work that has gone into the MD driver
> is simply obsoleted by new DM targets. Particularly RAID 1/5/6.
>
> You pretty much echoed my sentiments exactly... ideally md and dm can
> be bound much more tightly to each other. For example, convert md's
> raid[0156].c into device mapper targets... but indeed, nobody has
> stepped up to do that so far.

We're obviously pretty keen on seeing MD and Device-Mapper "merge" at some
point in the future, primarily for some of the reasons I mentioned above.
Obviously linear.c and raid0.c don't really need to be ported. DM provides
equivalent functionality, the discovery/activation can be driven from
user-space, and no in-kernel status updating is necessary (unlike RAID-1 and
-5). And we've talked for a long time about wanting to port RAID-1 and RAID-5
(and now RAID-6) to Device-Mapper targets, but we haven't started on any such
work, or even had any significant discussions about *how* to do it. I can't
imagine we would try this without at least involving Neil and other folks
from linux-raid, since it would be nice to actually reuse as much of the
existing MD code as possible (especially for RAID-5 and -6). I have no desire
to try to rewrite those from scratch.

Device-Mapper does currently contain a mirroring module (still just in Joe's
-udm tree), which has primarily been used to provide online-move
functionality in LVM2 and EVMS. They've recently added support for persistent
logs, so it's possible for a mirror to survive a reboot. Of course, MD RAID-1
has some additional requirements for updating status in its superblock at
runtime. I'd hope that in porting RAID-1 to DM, the core of the DM mirroring
module could still be used, with the possibility of either adding
MD-RAID-1-specific information to the persistent-log module, or simply as an
additional log type.

So, if this is the direction everyone else would like to see MD and DM take,
we'd be happy to help out.

--
Kevin Corry
[email protected]
http://evms.sourceforge.net/

2004-03-25 18:42:42

by Jeff Garzik

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

Kevin Corry wrote:
> I'm guessing you're referring to EVMS in that comment, since we have done
> *part* of what you just described. EVMS has always had a plugin to recognize
> MD devices, and has been using the MD driver for quite some time (along with
> using Device-Mapper for non-MD stuff). However, as of our most recent release
> (earlier this month), we switched to using Device-Mapper for MD RAID-linear
> and RAID-0 devices. Device-Mapper has always had a "linear" and a "striped"
> module (both required to support LVM volumes), and it was a rather trivial
> exercise to switch to activating these RAID devices using DM instead of MD.

nod


> This decision was not based on any real dislike of the MD driver, but rather
> for the benefits that are gained by using Device-Mapper. In particular,
> Device-Mapper provides the ability to change out the device mapping on the
> fly, by temporarily suspending I/O, changing the table, and resuming the I/O
> I'm sure many of you know this already. But I'm not sure everyone fully
> understands how powerful a feature this is. For instance, it means EVMS can
> now expand RAID-linear devices online. While that particular example may not
[...]

Sounds interesting but is mainly an implementation detail for the
purposes of this discussion...

Some of this emd may want to use, for example.


> As for not posting this information on lkml and/or linux-raid, I do apologize
> if this is something you would like to have been informed of. Most of the
> recent mentions of EVMS on this list seem to fall on deaf ears, so I've taken
> that to mean the folks on the list aren't terribly interested in EVMS
> developments. And since EVMS is a completely user-space tool and this
> decision didn't affect any kernel components, I didn't think it was really
> relevent to mention here. We usually discuss such things on
> [email protected] or [email protected], but I'll be happy to
> cross-post to lkml more often if it's something that might be pertinent.

Understandable... for the stuff that impacts MD some mention of the
work, on occasion, to linux-raid and/or linux-kernel would be useful.

I'm mainly looking at it from a standpoint of making sure that all the
various RAID efforts are not independent of each other.


> We're obviously pretty keen on seeing MD and Device-Mapper "merge" at some
> point in the future, primarily for some of the reasons I mentioned above.
> Obviously linear.c and raid0.c don't really need to be ported. DM provides
> equivalent functionality, the discovery/activation can be driven from
> user-space, and no in-kernel status updating is necessary (unlike RAID-1 and
> -5). And we've talked for a long time about wanting to port RAID-1 and RAID-5
> (and now RAID-6) to Device-Mapper targets, but we haven't started on any such
> work, or even had any significant discussions about *how* to do it. I can't

let's have that discussion :)

> imagine we would try this without at least involving Neil and other folks
> from linux-raid, since it would be nice to actually reuse as much of the
> existing MD code as possible (especially for RAID-5 and -6). I have no desire
> to try to rewrite those from scratch.

<cheers>


> Device-Mapper does currently contain a mirroring module (still just in Joe's
> -udm tree), which has primarily been used to provide online-move
> functionality in LVM2 and EVMS. They've recently added support for persistent
> logs, so it's possible for a mirror to survive a reboot. Of course, MD RAID-1
> has some additional requirements for updating status in its superblock at
> runtime. I'd hope that in porting RAID-1 to DM, the core of the DM mirroring
> module could still be used, with the possibility of either adding
> MD-RAID-1-specific information to the persistent-log module, or simply as an
> additional log type.

WRT specific implementation, I would hope for the reverse -- that the
existing, known, well-tested MD raid1 code would be used. But perhaps
that's a naive impression... Folks with more knowledge of the
implementation can make that call better than I.


I'd like to focus on the "additional requirements" you mention, as I
think that is a key area for consideration.

There is a certain amount of metadata that -must- be updated at runtime,
as you recognize. Over and above what MD already cares about, DDF and
its cousins introduce more items along those lines: event logs, bad
sector logs, controller-level metadata... these are some of the areas I
think Justin/Scott are concerned about.

My take on things... the configuration of RAID arrays got a lot more
complex with DDF and "host RAID" in general. Association of RAID arrays
based on specific hardware controllers. Silently building RAID0+1
stacked arrays out of non-RAID block devices the kernel presents.
Failing over when one of the drives the kernel presents does not respond.

All that just screams "do it in userland".

OTOH, once the devices are up and running, kernel needs update some of
that configuration itself. Hot spare lists are an easy example, but any
time the state of the overall RAID array changes, some host RAID
formats, more closely tied to hardware than MD, may require
configuration metadata changes when some hardware condition(s) change.

I respectfully disagree with the EMD folks that a userland approach is
impossible, given all the failure scenarios. In a userland approach,
there -will- be some duplicated metadata-management code between
userland and the kernel. But for configuration _above_ the
single-raid-array level, I think that's best left to userspace.

There will certainly be a bit of intra-raid-array management code in the
kernel, including configuration updating. I agree to its necessity...
but that doesn't mean that -all- configuration/autorun stuff needs to be
in the kernel.

Jeff



2004-03-25 18:50:31

by Jeff Garzik

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

Jeff Garzik wrote:
> My take on things... the configuration of RAID arrays got a lot more
> complex with DDF and "host RAID" in general. Association of RAID arrays
> based on specific hardware controllers. Silently building RAID0+1
> stacked arrays out of non-RAID block devices the kernel presents.
> Failing over when one of the drives the kernel presents does not respond.
>
> All that just screams "do it in userland".

Just so there is no confusion... the "failing over...in userland" thing
I mention is _only_ during discovery of the root disk.

Similar code would need to go into the bootloader, for controllers that
do not present the entire RAID array as a faked BIOS INT drive.

Jeff



2004-03-25 22:04:43

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

On 2004-03-25T13:42:12,
Jeff Garzik <[email protected]> said:

> >and -5). And we've talked for a long time about wanting to port RAID-1 and
> >RAID-5 (and now RAID-6) to Device-Mapper targets, but we haven't started
> >on any such work, or even had any significant discussions about *how* to
> >do it. I can't
> let's have that discussion :)

Nice 2.7 material, and parts I've always wanted to work on. (Including
making the entire partition scanning user-space on top of DM too.)

KS material?

> My take on things... the configuration of RAID arrays got a lot more
> complex with DDF and "host RAID" in general.

And then add all the other stuff, like scenarios where half of your RAID
is "somewhere" on the network via nbd, iSCSI or whatever and all the
other possible stackings... Definetely user-space material, and partly
because it /needs/ to have the input from the volume managers to do the
sane things.

The point about this implying that the superblock parsing/updating logic
needs to be duplicated between userspace and kernel land is valid too
though, and I'm keen on resolving this in a way which doesn't suck...


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering \ ever tried. ever failed. no matter.
SUSE Labs | try again. fail again. fail better.
Research & Development, SUSE LINUX AG \ -- Samuel Beckett

2004-03-25 23:11:09

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

>> Independent DM efforts have already started supporting MD raid0/1
>> metadata from what I understand, though these efforts don't seem to post
>> to linux-kernel or linux-raid much at all. :/
>
> I post on lkml.....occasionally. :)

...

> This decision was not based on any real dislike of the MD driver, but rather
> for the benefits that are gained by using Device-Mapper. In particular,
> Device-Mapper provides the ability to change out the device mapping on the
> fly, by temporarily suspending I/O, changing the table, and resuming the I/O
> I'm sure many of you know this already. But I'm not sure everyone fully
> understands how powerful a feature this is. For instance, it means EVMS can
> now expand RAID-linear devices online. While that particular example may not
> sound all that exciting, if things like RAID-1 and RAID-5 were "ported" to
> Device-Mapper, this feature would then allow you to do stuff like add new
> "active" members to a RAID-1 online (think changing from 2-way mirror to
> 3-way mirror). It would be possible to convert from RAID-0 to RAID-4 online
> simply by adding a new disk (assuming other limitations, e.g. a single
> stripe-zone). Unfortunately, these are things the MD driver can't do online,
> because you need to completely stop the MD device before making such changes
> (to prevent the kernel and user-space from trampling on the same metadata),
> and MD won't stop the device if it's open (i.e. if it's mounted or if you
> have other device (LVM) built on top of MD). Often times this means you need
> to boot to a rescue-CD to make these types of configuration changes.

We should be clear about your argument here. It is not that DM makes
generic morphing easy and possible, it is that with DM the most basic
types of morphing (no data striping or de-striping) is easily accomplished.
You sight two examples:

1) Adding another member to a RAID-1. While MD may not allow this to
occur while the array is operational, EMD does. This is possible
because there is only one entity controlling the meta-data.

2) Converting a RAID0 to a RAID4 while possible with DM is not particularly
interesting from an end user perspective.

The fact of the matter is that neither EMD nor DM provide a generic
morphing capability. If this is desirable, we can discuss how it could
be achieved, but my initial belief is that attempting any type of
complicated morphing from userland would be slow, prone to deadlocks,
and thus difficult to achieve in a fashion that guaranteed no loss of
data in the face of unexpected system restarts.

--
Justin

2004-03-25 23:36:35

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

> I respectfully disagree with the EMD folks that a userland approach is
> impossible, given all the failure scenarios.

I've never said that it was impossible, just unwise. I believe
that a userland approach offers no benefit over allowing the kernel
to perform all meta-data operations. The end result of such an
approach (given feature and robustness parity with the EMD solution)
is a larger resident side, code duplication, and more complicated
configuration/management interfaces.

--
Justin

2004-03-25 23:43:33

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

On 2004-03-25T15:59:00,
"Justin T. Gibbs" <[email protected]> said:

> The fact of the matter is that neither EMD nor DM provide a generic
> morphing capability. If this is desirable, we can discuss how it could
> be achieved, but my initial belief is that attempting any type of
> complicated morphing from userland would be slow, prone to deadlocks,
> and thus difficult to achieve in a fashion that guaranteed no loss of
> data in the face of unexpected system restarts.

Uhm. DM sort of does (at least where the morphing amounts to resyncing a
part of the stripe, ie adding a new mirror, RAID1->4, RAID5->6 etc).
Freeze, load new mapping, continue.

I agree that more complex morphings (RAID1->RAID5 or vice-versa in
particular) are more difficult to get right, but are not that often
needed online - or if they are, typically such scenarios will have
enough temporary storage to create the new target, RAID1 over,
disconnect the old part and free it, which will work just fine with DM.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering \ ever tried. ever failed. no matter.
SUSE Labs | try again. fail again. fail better.
Research & Development, SUSE LINUX AG \ -- Samuel Beckett

2004-03-25 23:47:57

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

> Jeff Garzik wrote:
>
> Just so there is no confusion... the "failing over...in userland" thing I
> mention is _only_ during discovery of the root disk.

None of the solutions being talked about perform "failing over" in
userland. The RAID transforms which perform this operation are kernel
resident in DM, MD, and EMD. Perhaps you are talking about spare
activation and rebuild?

> Similar code would need to go into the bootloader, for controllers that do
> not present the entire RAID array as a faked BIOS INT drive.

None of the solutions presented here are attempting to make RAID
transforms operate from the boot loader environment without BIOS
support. I see this as a completely tangental problem to what is
being discussed.

--
Justin

2004-03-26 00:08:49

by Jeff Garzik

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

Justin T. Gibbs wrote:
>>Jeff Garzik wrote:
>>
>>Just so there is no confusion... the "failing over...in userland" thing I
>>mention is _only_ during discovery of the root disk.
>
>
> None of the solutions being talked about perform "failing over" in
> userland. The RAID transforms which perform this operation are kernel
> resident in DM, MD, and EMD. Perhaps you are talking about spare
> activation and rebuild?

This is precisely why I sent the second email, and made the
qualification I did :)

For a "do it in userland" solution, an initrd or initramfs piece
examines the system configuration, and assembles physical disks into
RAID arrays based on the information it finds. I was mainly implying
that an initrd solution would have to provide some primitive failover
initially, before the kernel is bootstrapped... much like a bootloader
that supports booting off a RAID1 array would need to do.

Jeff



2004-03-26 00:24:29

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

> Uhm. DM sort of does (at least where the morphing amounts to resyncing a
> part of the stripe, ie adding a new mirror, RAID1->4, RAID5->6 etc).
> Freeze, load new mapping, continue.

The point is that these trivial "morphings" can be achieved with limited
effort regardless of whether you do it via EMD or DM. Implementing this
in EMD could be achieved with perhaps 8 hours work with no significant
increase in code size or complexity. This is part of why I find them
"uninteresting". If we really want to talk about generic morphing,
I think you'll find that DM is no better suited to this task than MD or
its derivatives.

> I agree that more complex morphings (RAID1->RAID5 or vice-versa in
> particular) are more difficult to get right, but are not that often
> needed online - or if they are, typically such scenarios will have
> enough temporary storage to create the new target, RAID1 over,
> disconnect the old part and free it, which will work just fine with DM.

The most common requests that we hear from customers are:

o single -> R1

Equally possible with MD or DM assuming your singles are
accessed via a volume manager. Without that support the
user will have to dismount and remount storage.

o R1 -> R10

This should require just double the number of active members.
This is not possible today with either DM or MD. Only
"migration" is possible.

o R1 -> R5
o R5 -> R1

These typically occur when data access patterns change for
the customer. Again not possible with DM or MD today.

All of these are important to some subset of customers and are, to
my mind, required if you want to claim even basic morphing capability.
If you are allowing the "cop-out" of using a volume manager to substitute
data-migration for true morphing, then MD is almost as well suited to
that task as DM.

--
Justin

2004-03-26 00:33:23

by Jeff Garzik

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

Justin T. Gibbs wrote:
>>>None of the solutions being talked about perform "failing over" in
>>>userland. The RAID transforms which perform this operation are kernel
>>>resident in DM, MD, and EMD. Perhaps you are talking about spare
>>>activation and rebuild?
>>
>>This is precisely why I sent the second email, and made the qualification
>>I did :)
>>
>>For a "do it in userland" solution, an initrd or initramfs piece examines
>>the system configuration, and assembles physical disks into RAID arrays
>>based on the information it finds. I was mainly implying that an initrd
>>solution would have to provide some primitive failover initially, before
>>the kernel is bootstrapped... much like a bootloader that supports booting
>>off a RAID1 array would need to do.
>
>
> "Failover" (i.e. redirecting a read to a viable member) will not occur
> via userland at all. The initrd solution just has to present all available
> members to the kernel interface performing the RAID transform. There
> is no need for "special failover handling" during bootstrap in either
> case.

hmmm, yeah, agreed.

Jeff




2004-03-26 01:04:22

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

>> None of the solutions being talked about perform "failing over" in
>> userland. The RAID transforms which perform this operation are kernel
>> resident in DM, MD, and EMD. Perhaps you are talking about spare
>> activation and rebuild?
>
> This is precisely why I sent the second email, and made the qualification
> I did :)
>
> For a "do it in userland" solution, an initrd or initramfs piece examines
> the system configuration, and assembles physical disks into RAID arrays
> based on the information it finds. I was mainly implying that an initrd
> solution would have to provide some primitive failover initially, before
> the kernel is bootstrapped... much like a bootloader that supports booting
> off a RAID1 array would need to do.

"Failover" (i.e. redirecting a read to a viable member) will not occur
via userland at all. The initrd solution just has to present all available
members to the kernel interface performing the RAID transform. There
is no need for "special failover handling" during bootstrap in either
case.

--
Justin

2004-03-26 00:32:46

by Jeff Garzik

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

Justin T. Gibbs wrote:
>>I respectfully disagree with the EMD folks that a userland approach is
>>impossible, given all the failure scenarios.
>
>
> I've never said that it was impossible, just unwise. I believe
> that a userland approach offers no benefit over allowing the kernel
> to perform all meta-data operations. The end result of such an
> approach (given feature and robustness parity with the EMD solution)
> is a larger resident side, code duplication, and more complicated
> configuration/management interfaces.

There is some code duplication, yes. But the right userspace solution
does not have a larger RSS, and has _less_ complicated management
interfaces. A key benefit of "do it in userland" is a clear gain in
flexibility, simplicity, and debuggability (if that's a word).

But it's hard. It requires some deep thinking. It's a whole lot easier
to do everything in the kernel -- but that doesn't offer you the
protections of userland, particularly separate address spaces from the
kernel, and having to try harder to crash the kernel. :)

Jeff



2004-03-26 17:44:31

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

>>> I respectfully disagree with the EMD folks that a userland approach is
>>> impossible, given all the failure scenarios.
>>
>>
>> I've never said that it was impossible, just unwise. I believe
>> that a userland approach offers no benefit over allowing the kernel
>> to perform all meta-data operations. The end result of such an
>> approach (given feature and robustness parity with the EMD solution)
>> is a larger resident side, code duplication, and more complicated
>> configuration/management interfaces.
>
> There is some code duplication, yes. But the right userspace solution
> does not have a larger RSS, and has _less_ complicated management
> interfaces.
>
> A key benefit of "do it in userland" is a clear gain in flexibility,
> simplicity, and debuggability (if that's a word).

This is just as much hand waving as, 'All that just screams "do it in
userland".' <sigh>

I posted a rather detailed, technical, analysis of what I believe would
be required to make this work correctly using a userland approach. The
only response I've received is from Neil Brown. Please, point out, in
a technical fashion, how you would address the feature set being proposed:

o Rebuilds
o Auto-array enumeration
o Meta-data updates for topology changes (failed members, spare activation)
o Meta-data updates for "safe mode"
o Array creation/deletion
o "Hot member addition"

Only then can a true comparative analysis of which solution is "less
complex", "more maintainable", and "smaller" be performed.

> But it's hard. It requires some deep thinking. It's a whole lot easier
> to do everything in the kernel -- but that doesn't offer you the
> protections of userland, particularly separate address spaces from the
> kernel, and having to try harder to crash the kernel. :)

A crash in any component of a RAID solution that prevents automatic
failover and rebuilds without customer intervention is unacceptable.
Whether it crashes your kernel or not is really not that important other
than the customer will probably notice that their data is no longer
protected *sooner* if the system crashes. In other-words, the solution
must be *correct* regardless of where it resides. Saying that doing
a portion of it in userland allows it to safely be buggier seems a very
strange argument.

--
Justin

2004-03-26 19:15:59

by Kevin Corry

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

On Thursday 25 March 2004 12:42 pm, Jeff Garzik wrote:
> > We're obviously pretty keen on seeing MD and Device-Mapper "merge" at
> > some point in the future, primarily for some of the reasons I mentioned
> > above. Obviously linear.c and raid0.c don't really need to be ported. DM
> > provides equivalent functionality, the discovery/activation can be driven
> > from user-space, and no in-kernel status updating is necessary (unlike
> > RAID-1 and -5). And we've talked for a long time about wanting to port
> > RAID-1 and RAID-5 (and now RAID-6) to Device-Mapper targets, but we
> > haven't started on any such work, or even had any significant discussions
> > about *how* to do it. I can't
>
> let's have that discussion :)

Great! Where do we begin? :)

> I'd like to focus on the "additional requirements" you mention, as I
> think that is a key area for consideration.
>
> There is a certain amount of metadata that -must- be updated at runtime,
> as you recognize. Over and above what MD already cares about, DDF and
> its cousins introduce more items along those lines: event logs, bad
> sector logs, controller-level metadata... these are some of the areas I
> think Justin/Scott are concerned about.

I'm sure these things could be accomodated within DM. Nothing in DM prevents
having some sort of in-kernel metadata knowledge. In fact, other DM modules
already do - dm-snapshot and the above mentioned dm-mirror both need to do
some amount of in-kernel status updating. But I see this as completely
separate from in-kernel device discovery (which we seem to agree is the wrong
direction). And IMO, well designed metadata will make this "split" very
obvious, so it's clear which parts of the metadata the kernel can use for
status, and which parts are purely for identification (which the kernel thus
ought to be able to ignore).

The main point I'm trying to get across here is that DM provides a simple yet
extensible kernel framework for a variety of storage management tasks,
including a lot more than just RAID. I think it would be a huge benefit for
the RAID drivers to make use of this framework to provide functionality
beyond what is currently available.

> My take on things... the configuration of RAID arrays got a lot more
> complex with DDF and "host RAID" in general. Association of RAID arrays
> based on specific hardware controllers. Silently building RAID0+1
> stacked arrays out of non-RAID block devices the kernel presents.

By this I assume you mean RAID devices that don't contain any type of on-disk
metadata (e.g. MD superblocks). I don't see this as a huge hurdle. As long as
the device drivers (SCIS, IDE, etc) export the necessary identification info
through sysfs, user-space tools can contain the policies necessary to allow
them to detect which disks belong together in a RAID device, and then tell
the kernel to activate said RAID device. This sounds a lot like how
Christophe Varoqui has been doing things in his new multipath tools.

> Failing over when one of the drives the kernel presents does not respond.
>
> All that just screams "do it in userland".
>
> OTOH, once the devices are up and running, kernel needs update some of
> that configuration itself. Hot spare lists are an easy example, but any
> time the state of the overall RAID array changes, some host RAID
> formats, more closely tied to hardware than MD, may require
> configuration metadata changes when some hardware condition(s) change.

Certainly. Of course, I see things like adding and removing hot-spares and
removing stale/faulty disks as something that can be driven from user-space.
For example, for adding a new hot-spare, with DM it's as simple as loading a
new mapping that contains the new disk, then telling DM to switch the device
mapping (which implies a suspend/resume of I/O). And if necessary, such a
user-space tool can be activated by hotplug events triggered by the insertion
of a new disk into the system, making the process effectively transparent to
the user.

--
Kevin Corry
[email protected]
http://evms.sourceforge.net/

2004-03-26 19:20:45

by Kevin Corry

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

On Thursday 25 March 2004 4:04 pm, Lars Marowsky-Bree wrote:
> On 2004-03-25T13:42:12,
>
> Jeff Garzik <[email protected]> said:
> > >and -5). And we've talked for a long time about wanting to port RAID-1
> > > and RAID-5 (and now RAID-6) to Device-Mapper targets, but we haven't
> > > started on any such work, or even had any significant discussions about
> > > *how* to do it. I can't
> >
> > let's have that discussion :)
>
> Nice 2.7 material, and parts I've always wanted to work on. (Including
> making the entire partition scanning user-space on top of DM too.)

Couldn't agree more. Whether using EVMS or kpartx or some other tool, I think
we've already proved this is possible. We really only need to work on making
early-userspace a little easier to use.

> KS material?

Sounds good to me.

--
Kevin Corry
[email protected]
http://evms.sourceforge.net/

2004-03-26 20:45:38

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

>> There is a certain amount of metadata that -must- be updated at runtime,
>> as you recognize. Over and above what MD already cares about, DDF and
>> its cousins introduce more items along those lines: event logs, bad
>> sector logs, controller-level metadata... these are some of the areas I
>> think Justin/Scott are concerned about.
>
> I'm sure these things could be accommodated within DM. Nothing in DM prevents
> having some sort of in-kernel metadata knowledge. In fact, other DM modules
> already do - dm-snapshot and the above mentioned dm-mirror both need to do
> some amount of in-kernel status updating. But I see this as completely
> separate from in-kernel device discovery (which we seem to agree is the wrong
> direction). And IMO, well designed metadata will make this "split" very
> obvious, so it's clear which parts of the metadata the kernel can use for
> status, and which parts are purely for identification (which the kernel thus
> ought to be able to ignore).

We don't have control over the meta-data formats being used by the industry.
Coming up with a solution that only works for "Linux Engineered Meta-data
formats" removes any possibility of supporting things like DDF, Adaptec
ASR, and a host of other meta-data formats that can be plugged into things
like EMD. In the two cases we are supporting today with EMD, the records
required for doing discovery reside in the same sectors as those that need
to be updated at runtime from some "in-core" context.

> The main point I'm trying to get across here is that DM provides a simple yet
> extensible kernel framework for a variety of storage management tasks,
> including a lot more than just RAID. I think it would be a huge benefit for
> the RAID drivers to make use of this framework to provide functionality
> beyond what is currently available.

DM is a transform layer that has the ability to pause I/O while that
transform is updated from userland. That's all it provides. As such,
it is perfectly suited to some types of logical volume management
applications. But that is as far as it goes. It does not have any
support for doing "sync/resync/scrub" type operations or any generic
support for doing anything with meta-data. In all of the examples you
have presented so far, you have not explained how this part of the equation
is handled. Sure, adding a member to a RAID1 is trivial. Just pause the
I/O, update the transform, and let it go. Unfortunately, that new member
is not in sync with the rest. The transform must be aware of this and only
trust the member below the sync mark. How is this information communicated
to the transform? Who updates the sync mark? Who copies the data to the
new member while guaranteeing that an in-flight write does not occur to the
area being synced? If you intend to add all of this to DM, then it is no
longer any "simpler" or more extensible than EMD.

Don't take my arguments the wrong way. I believe that DM is useful
for what it was designed for: LVM. It does not, however, provide the
machinery required for it to replace a generic RAID stack. Could
you merge a RAID stack into DM. Sure. Its only software. But for
it to be robust, the same types of operations MD/EMD perform in kernel
space will have to be done there too.

The simplicity of DM is part of why it is compelling. My belief is that
merging RAID into DM will compromise this simplicity and divert DM from
what it was designed to do - provide LVM transforms.

As for RAID discovery, this is the trivial portion of RAID. For an extra
10% or less of code in a meta-data module, you get RAID discovery. You
also get a single point of access to the meta-data, avoid duplicated code,
and complex kernel/user interfaces. There seems to be a consistent feeling
that it is worth compromising all of these benefits just to push this 10%
of the meta-data handling code out of the kernel (and inflate it by 5 or
6 X duplicating code already in the kernel). Where are the benefits of
this userland approach?

--
Justin

2004-03-27 15:40:26

by Kevin Corry

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

On Friday 26 March 2004 2:45 pm, Justin T. Gibbs wrote:
> We don't have control over the meta-data formats being used by the
> industry. Coming up with a solution that only works for "Linux Engineered
> Meta-data formats" removes any possibility of supporting things like DDF,
> Adaptec ASR, and a host of other meta-data formats that can be plugged into
> things like EMD. In the two cases we are supporting today with EMD, the
> records required for doing discovery reside in the same sectors as those
> that need to be updated at runtime from some "in-core" context.

Well, there's certainly no guarantee that the "industry" will get it right. In
this case, it seems that they didn't. But even given that we don't have ideal
metadata formats, it's still possible to do discovery and a number of other
management tasks from user-space.

> > The main point I'm trying to get across here is that DM provides a simple
> > yet extensible kernel framework for a variety of storage management
> > tasks, including a lot more than just RAID. I think it would be a huge
> > benefit for the RAID drivers to make use of this framework to provide
> > functionality beyond what is currently available.
>
> DM is a transform layer that has the ability to pause I/O while that
> transform is updated from userland. That's all it provides.

I think the DM developers would disagree with you on this point.

> As such,
> it is perfectly suited to some types of logical volume management
> applications. But that is as far as it goes. It does not have any
> support for doing "sync/resync/scrub" type operations or any generic
> support for doing anything with meta-data.

The core DM driver would not and should not be handling these operations.
These are handled in modules specific to one type of mapping. There's no
need for the DM core to know anything about any metadata. If one particular
module (e.g. dm-mirror) needs to support one or more metadata formats, it's
free to do so.

On the other hand, DM *does* provide services that make "sync/resync" a great
deal simpler for such a module. It provides simple services for performing
synchronous or asynchronous I/O to pages or vm areas. It provides a service
for performing copies from one block-device area to another. The dm-mirror
module uses these for this very purpose. If we need additional "libraries"
for common RAID tasks (e.g. parity calculations) we can certainly add them.

> In all of the examples you
> have presented so far, you have not explained how this part of the equation
> is handled. Sure, adding a member to a RAID1 is trivial. Just pause the
> I/O, update the transform, and let it go. Unfortunately, that new member
> is not in sync with the rest. The transform must be aware of this and only
> trust the member below the sync mark. How is this information communicated
> to the transform? Who updates the sync mark? Who copies the data to the
> new member while guaranteeing that an in-flight write does not occur to the
> area being synced?

Before the new disk is added to the raid1, user-space is responsible for
writing an initial state to that disk, effectively marking it as completely
dirty and unsynced. When the new table is loaded, part of the "resume" is for
the module to read any metadata and do any initial setup that's necessary. In
this particular example, it means the new disk would start with all of its
"regions" marked "dirty", and all the regions would need to be synced from
corresponding "clean" regions on another disk in the set.

If the previously-existing disks were part-way through a sync when the table
was switched, their metadata would indicate where the current "sync mark" was
located. The module could then continue the sync from where it left off,
including the new disk that was just added. When the sync completed, it might
have to scan back to the beginning of the new disk to see if had any remaining
dirty regions that needed to be synced before that disk was completely clean.

And of course the I/O-mapping path just has to be smart enough to know which
regions are dirty and avoid sending live I/O to those.

(And I'm sure Joe or Alasdair could provide a better in-depth explanation of
the current dm-mirror module than I'm trying to. This is obviously a very
high-level overview.)

This process is somewhat similar to how dm-snapshot works. If it reads an
empty header structure, it assumes it's a new snapshot, and starts with an
empty hash table. If it reads a previously existing header, it continues to
read the on-disk COW tables and constructs the necessary in-memory hash-table
to represent that initial state.

> If you intend to add all of this to DM, then it is no
> longer any "simpler" or more extensible than EMD.

Sure it is. Because very little (if any) of this needs to affect the core DM
driver, that core remains as simple and extensible as it currently is. The
extra complexity only really affects the new modules that would handle RAID.

> Don't take my arguments the wrong way. I believe that DM is useful
> for what it was designed for: LVM. It does not, however, provide the
> machinery required for it to replace a generic RAID stack. Could
> you merge a RAID stack into DM. Sure. Its only software. But for
> it to be robust, the same types of operations MD/EMD perform in kernel
> space will have to be done there too.
>
> The simplicity of DM is part of why it is compelling. My belief is that
> merging RAID into DM will compromise this simplicity and divert DM from
> what it was designed to do - provide LVM transforms.

I disagree. The simplicity of the core DM driver really isn't at stake here.
We're only talking about adding a few relatively complex target modules. And
with DM you get the benefit of a very simple user/kernel interface.

> As for RAID discovery, this is the trivial portion of RAID. For an extra
> 10% or less of code in a meta-data module, you get RAID discovery. You
> also get a single point of access to the meta-data, avoid duplicated code,
> and complex kernel/user interfaces. There seems to be a consistent feeling
> that it is worth compromising all of these benefits just to push this 10%
> of the meta-data handling code out of the kernel (and inflate it by 5 or
> 6 X duplicating code already in the kernel). Where are the benefits of
> this userland approach?

I've got to admit, this whole discussion is very ironic. Two years ago I
was exactly where you are today, pushing for in-kernel discover, a variety of
metadata modules, internal opaque device stacking, etc, etc. I can only
imagine that hch is laughing his ass off now that I'm the one arguing for
moving all this stuff to user-space.

I don't honestly expect to suddenly change your mind on all these issues.
A lot of work has obviously gone into EMD, and I definitely know how hard it
can be when the community isn't greeting your suggestions with open arms. And
I'm certainly not saying the EMD method isn't a potentially viable approach.
But it doesn't seem to be the approach the community is looking for. We faced
the same resistance two years ago. It took months of arguing with the
community and arguing amongst ourselves before we finally decided to move
EVMS to user-space and use MD and DM. It was a decision that meant
essentially throwing away an enormous amount of work from several people. It
was an incredibly hard choice, but I really believe now that it was the right
decision. It was the direction the community wanted to move in, and the only
way for our project to truely survive was to move with them.

So feel free to continue to develop and promote EMD. I'm not trying to stop
you and I don't mind having competition for finding the best way to do RAID
in Linux. But I can tell you from experience that EMD is going to face a good
bit of opposition based on its current design and you might want to take that
into consideration.

I am interested in discussing if and how RAID could be supported under
Device-Mapper (or some other "merging" of these two drivers). Jeff and Lars
have shown some interest, and I certainly hope we can convince Neil and Joe
that this is a good direction. Maybe it can be done and maybe it can't. I
personally think it can be, and I'd at least like to have that discussion
and find out.

--
Kevin Corry
[email protected]
http://evms.sourceforge.net/

2004-03-28 00:07:42

by Lincoln Dale

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

At 03:43 AM 27/03/2004, Justin T. Gibbs wrote:
>I posted a rather detailed, technical, analysis of what I believe would
>be required to make this work correctly using a userland approach. The
>only response I've received is from Neil Brown. Please, point out, in
>a technical fashion, how you would address the feature set being proposed:

i'll have a go.

your position is one of "put it all in the kernel".
Jeff, Neil, Kevin et al is one of "it can live in userspace".

to that end, i agree with the userspace approach.
the way i personally believe that it SHOULD happen is that you tie your
metadata format (and RAID format, if its different to others) into DM.

you boot up using an initrd where you can start some form of userspace
management daemon from initrd.
you can have your binary (userspace) tools started from initrd which can
populate the tables for all disks/filesystems, including pivoting to a new
root filesystem if need-be.

the only thing your BIOS/int13h redirection needs to do is be able to
provide sufficient information to be capable of loading the kernel and the
initial ramdisk.
perhaps that means that you guys could provide enhancements to grub/lilo if
they are insufficient for things like finding a secondary copy of
initrd/vmlinuz. (if such issues exist, wouldn't it be better to do things
the "open source way" and help improve the overall tools, if the end goal
ends up being the same: enabling YOUR system to work better?)

moving forward, perhaps initrd will be deprecated in favour of initramfs -
but until then, there isn't any downside to this approach that i can see.

with all this in mind, and the basic premise being that as a minimum, the
kernel has booted, and initrd is working
then answering your other points:

> o Rebuilds

userspace is running.
rebuilds are simply a process of your userspace tools recognising that
there are disk groups in a inconsistent state, and don't bring them online,
but rather, do whatever is necessary to rebuild them.
nothing says that you cannot have a KERNEL-space 'helper' to help do the
rebuild..

> o Auto-array enumeration

your userspace tool can receive notification (via udev/hotplug) when new
disks/devices appear. from there, your userspace tool can read whatever
metadata exists on the disk, and use that to enumerate whatever block
devices exist.

perhaps DM needs some hooks to be able to do this - but i believe that the
DM v4 ioctls cover this already.

> o Meta-data updates for topology changes (failed members, spare activation)

a failed member may be as a result of a disk being pulled out. for such an
event, udev/hotplug should tell your userspace daemon.
a failed member may be as a result of lots of I/O errors. perhaps there is
work needed in the linux block layer to indicate some form of hotplug event
such as 'excessive errors', perhaps its something needed in the DM
layer. in either case, it isn't out of the question that userspace can be
notified.

for a "spare activation", once again, that can be done entirely from userspace.

> o Meta-data updates for "safe mode"

seems implementation specific to me.

> o Array creation/deletion

the short answer here is "how does one create or remove DM/LVM/MD
partitions today?"
it certainly isn't in the kernel ...

> o "Hot member addition"

this should also be possible today.
i haven't looked too closely at whether there are sufficient interfaces for
quiescence of I/O or not - but once again, if not, why not implement
something that can be used for all?

>Only then can a true comparative analysis of which solution is "less
>complex", "more maintainable", and "smaller" be performed.

there may be less lines of code involved in "entirely in kernel" for YOUR
hardware --
but what about when 4 other storage vendors come out with such a card?
what if someone wants to use your card in conjunction with the storage
being multipathed or replicated automatically?
what about when someone wants to create snapshots for backups?

all that functionality has to then go into your EMD driver.

Adaptec may decide all that is too hard -- at which point, your product may
become obsolete as the storage paradigms have moved beyond what your EMD
driver is capable of.
if you could tie it into DM -- which i believe to be the defacto path
forward for lots of this cool functionality -- you gain this kind of
functionality gratis -- or at least with minimal effort to integrate.

better yet, Linux as a whole benefits from your involvement -- your
time/effort isn't put into something specific to your hardware -- but
rather your time/effort is put into something that can be used by all.

this conversation really sounds like the same one you had with James about
the SCSI Mid layer and why you just have to bypass items there and do your
own proprietary things. in summary, i don't believe you should be
focussing on a short-term viiew of "but its more lines of code", but rather
a more big-picture view of "overall, there will be LESS lines of code" and
"it will fit better into the overall device-mapper/block-remapper
functionality" within the kernel.


cheers,

lincoln.

2004-03-28 00:30:58

by Jeff Garzik

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

Justin T. Gibbs wrote:
> o Rebuilds

> 90% kernel, AFAICS, otherwise you have races with
requests that the driver is actively satisfying


> o Auto-array enumeration

userspace


> o Meta-data updates for "safe mode"

unsure of the definition of safe mode


> o Array creation/deletion


of entire arrays? can mostly be done in userspace, but deletion
also needs to update controller-wide metadata, which might be
stored on active arrays.


> o "Hot member addition"

userspace prepares, kernel completes

[moved this down in your list]
> o Meta-data updates for topology changes (failed members, spare activation)

[warning: this is a tangent from the userspace sub-thread/topic]

the kernel, of course, must manage topology, otherwise things
Don't Get Done, and requests don't do where they should. :)

Part of the value of device mapper is that it provides container
objects for multi-disk groups, and a common method of messing
around with those container objects. You clearly recognized the
same need in emd... but I don't think we want two different
pieces of code doing the same basic thing.


I do think that metadata management needs to be fairly cleanly
separately (I like what emd did, there) such that a user needs
three in-kernel pieces:
* device mapper
* generic raid1 engine
* personality module

"personality" would be where the specifics of the metadata
management lived, and it would be responsible for handling the
specifics of non-hot-path events that nonetheless still need
to be in the kernel.



2004-03-28 09:11:49

by christophe varoqui

[permalink] [raw]
Subject: Re: [dm-devel] Re: "Enhanced" MD code avaible for review

Justin,

I direct you to http://christophe.varoqui.free.fr/ for a well documented
example of coordination between the device-mapper and the userspace
multipath tools.

I hope you'll see how robust and elegant the solution can be.

regards,
cvaroqui

2004-03-30 17:05:45

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

> Well, there's certainly no guarantee that the "industry" will get it right. In
> this case, it seems that they didn't. But even given that we don't have ideal
> metadata formats, it's still possible to do discovery and a number of other
> management tasks from user-space.

I have never proposed that management activities be performed solely
within the kernel. My position has been that meta-data parsing and
updating has to be core-resident for any solution that handles advanced
RAID functionality and that spliting out any portion of those roles
to userland just complicates the solution.

>> it is perfectly suited to some types of logical volume management
>> applications. But that is as far as it goes. It does not have any
>> support for doing "sync/resync/scrub" type operations or any generic
>> support for doing anything with meta-data.
>
> The core DM driver would not and should not be handling these operations.
> These are handled in modules specific to one type of mapping. There's no
> need for the DM core to know anything about any metadata. If one particular
> module (e.g. dm-mirror) needs to support one or more metadata formats, it's
> free to do so.

That's unfortunate considering that the meta-data formats we are talking
about already have the capability of expressing RAID 1(E),4,5,6. There has
to be a common meta-data framework in order to avoid this duplication.

>> In all of the examples you
>> have presented so far, you have not explained how this part of the equation
>> is handled.

...

> Before the new disk is added to the raid1, user-space is responsible for
> writing an initial state to that disk, effectively marking it as completely
> dirty and unsynced. When the new table is loaded, part of the "resume" is for
> the module to read any metadata and do any initial setup that's necessary. In
> this particular example, it means the new disk would start with all of its
> "regions" marked "dirty", and all the regions would need to be synced from
> corresponding "clean" regions on another disk in the set.
>
> If the previously-existing disks were part-way through a sync when the table
> was switched, their metadata would indicate where the current "sync mark" was
> located. The module could then continue the sync from where it left off,
> including the new disk that was just added. When the sync completed, it might
> have to scan back to the beginning of the new disk to see if had any remaining
> dirty regions that needed to be synced before that disk was completely clean.
>
> And of course the I/O-mapping path just has to be smart enough to know which
> regions are dirty and avoid sending live I/O to those.
>
> (And I'm sure Joe or Alasdair could provide a better in-depth explanation of
> the current dm-mirror module than I'm trying to. This is obviously a very
> high-level overview.)

So all of this complexity is still in the kernel. The only difference is
that the meta-data can *also* be manipulated from userspace. In order
for this to be safe, the mirror must be suspended (meta-data becomes stable),
the meta-data must be re-read by the userland program, the meta-data must be
updated, the mapping must be updated, the mirror must be resumed, and the
mirror must revalidate all meta-data. How do you avoid deadlock in this
process? Does the userland daemon, which must be core resident in this case,
pre-allocate buffers for reading and writing the meta-data?

The dm-raid1 module also appears to intrinsicly trust its mapping and the
contents of its meta-data (simple magic number check). It seems to me that
the kernel should validate all of its inputs regardless of whether the
ioctls that are used to present them are only supposed to be used by a
"trusted daemon".

All of this adds up to more complexity. Your argument seems to be that,
since DM avoids this complexity in its core, this is a better solution,
but I am more interested in the least complex, most easily maintained
total solution.

>> The simplicity of DM is part of why it is compelling. My belief is that
>> merging RAID into DM will compromise this simplicity and divert DM from
>> what it was designed to do - provide LVM transforms.
>
> I disagree. The simplicity of the core DM driver really isn't at stake here.
> We're only talking about adding a few relatively complex target modules. And
> with DM you get the benefit of a very simple user/kernel interface.

The simplicity of the user/kernel interface is not what is at stake here.
With EMD, you can perform all of the same operations talked about above,
in just as few ioctl calls. The only difference is that the kernel and
only the kernel, reads and modifies the metadata. There are actually
fewer steps for the userland application than before. This becomes even
more evident as more meta-data modules are added.

> I don't honestly expect to suddenly change your mind on all these issues.
> A lot of work has obviously gone into EMD, and I definitely know how hard it
> can be when the community isn't greeting your suggestions with open arms.

I honestly don't care if the final solution is EMD, DM, or XYZ so long
as that solution is correct, supportable, and covers all of the scenarios
required for robust RAID support. That is the crux of the argument, not
"please love my code".

--
Justin

2004-03-30 17:17:07

by Jeff Garzik

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

Justin T. Gibbs wrote:
> The dm-raid1 module also appears to intrinsicly trust its mapping and the
> contents of its meta-data (simple magic number check). It seems to me that
> the kernel should validate all of its inputs regardless of whether the
> ioctls that are used to present them are only supposed to be used by a
> "trusted daemon".

The kernel should not be validating -trusted- userland inputs. Root is
allowed to scrag the disk, violate limits, and/or crash his own machine.

A simple example is requiring userland, when submitting ATA taskfiles
via an ioctl, to specify the data phase (pio read, dma write, no-data,
etc.). If the data phase is specified incorrectly, you kill the OS
driver's ATA host state machine, and the results are very unpredictable.
Since this is a trusted operation, requiring CAP_RAW_IO, it's up to
userland to get the required details right (just like following a spec).


> I honestly don't care if the final solution is EMD, DM, or XYZ so long
> as that solution is correct, supportable, and covers all of the scenarios
> required for robust RAID support. That is the crux of the argument, not
> "please love my code".

hehe. I think we all agree here...

Jeff




2004-03-30 17:36:59

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

> The kernel should not be validating -trusted- userland inputs. Root is
> allowed to scrag the disk, violate limits, and/or crash his own machine.
>
> A simple example is requiring userland, when submitting ATA taskfiles via
> an ioctl, to specify the data phase (pio read, dma write, no-data, etc.).
> If the data phase is specified incorrectly, you kill the OS driver's ATA
> host wwtate machine, and the results are very unpredictable. Since this
> is a trusted operation, requiring CAP_RAW_IO, it's up to userland to get the
> required details right (just like following a spec).

That's unfortunate for those using ATA. A command submitted from userland
to the SCSI drivers I've written that causes a protocol violation will
be detected, result in appropriate recovery, and a nice diagnostic that
can be used to diagnose the problem. Part of this is because I cannot know
if the protocol violation stems from a target defect, the input from the
user or, for that matter, from the kernel. The main reason is for robustness
and ease of debugging. In SCSI case, there is almost no run-time cost, and
the system will stop before data corruption occurs. In the meta-data case
we've been discussing in terms of EMD, there is no runtime cost, the
validation has to occur somewhere anyway, and in many cases some validation
is already required to avoid races with external events. If the validation
is done in the kernel, then you get the benefit of nice diagnostics instead
of strange crashes that are difficult to debug.

--
Justin

2004-03-30 17:48:03

by Jeff Garzik

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

Justin T. Gibbs wrote:
>>The kernel should not be validating -trusted- userland inputs. Root is
>>allowed to scrag the disk, violate limits, and/or crash his own machine.
>>
>>A simple example is requiring userland, when submitting ATA taskfiles via
>>an ioctl, to specify the data phase (pio read, dma write, no-data, etc.).
>>If the data phase is specified incorrectly, you kill the OS driver's ATA
>>host wwtate machine, and the results are very unpredictable. Since this
>>is a trusted operation, requiring CAP_RAW_IO, it's up to userland to get the
>>required details right (just like following a spec).
>
>
> That's unfortunate for those using ATA. A command submitted from userland

Required, since one cannot know the data phase of vendor-specific commands.


> to the SCSI drivers I've written that causes a protocol violation will
> be detected, result in appropriate recovery, and a nice diagnostic that
> can be used to diagnose the problem. Part of this is because I cannot know
> if the protocol violation stems from a target defect, the input from the
> user or, for that matter, from the kernel. The main reason is for robustness

Well,
* the target is not _issuing_ commands,
* any user issuing incorrect commands/cdbs is not your bug,
* and kernel code issuing incorrect cmands/cdbs isn't your bug either

Particularly, checking whether the kernel is doing something wrong, or
wrong, just wastes cycles. That's not a scalable way to code... if
every driver and Linux subsystem did that, things would be unbearable slow.

Jeff



2004-03-30 17:56:03

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

> At 03:43 AM 27/03/2004, Justin T. Gibbs wrote:
>> I posted a rather detailed, technical, analysis of what I believe would
>> be required to make this work correctly using a userland approach. The
>> only response I've received is from Neil Brown. Please, point out, in
>> a technical fashion, how you would address the feature set being proposed:
>
> i'll have a go.
>
> your position is one of "put it all in the kernel".
> Jeff, Neil, Kevin et al is one of "it can live in userspace".

Please don't misrepresent or over simplify my statements. What
I have said is that meta-data reading and writing should occur in
only one place. Since, as has already been acknowledged by many,
meta-data updates are required in the kernel, that means this support
should be handled in the kernel. Any other solution adds complexity
and size to the solution.

> to that end, i agree with the userspace approach.
> the way i personally believe that it SHOULD happen is that you tie
> your metadata format (and RAID format, if its different to others) into DM.

Saying how you think something should happen without any technical
argument for it, doesn't help me to understand the benefits of your
approach.

...

> perhaps that means that you guys could provide enhancements to grub/lilo
> if they are insufficient for things like finding a secondary copy of
> initrd/vmlinuz. (if such issues exist, wouldn't it be better to do things
> the "open source way" and help improve the overall tools, if the end goal
> ends up being the same: enabling YOUR system to work better?)

I don't understand your argument. We have improved an already existing
opensource driver to provide this functionality. This is not the
OpenSource way?

> then answering your other points:

Again, you have presented strategies that may or may not work, but
no technical arguments for their superiority over placing meta-data
in the kernel.

> there may be less lines of code involved in "entirely in kernel" for YOUR
> hardware -- but what about when 4 other storage vendors come out with such
> a card?

There will be less lines of code total for any vendor that decides to
add a new meta-data type. All the vendor has to do is provide a meta-data
module. There are no changes to the userland utilities (they know nothing
about specific meta-data formats), to the RAID transform modules, or to
the core of EMD. If this were not the case, there would be little point
to the EMD work.

> what if someone wants to use your card in conjunction with the storage
> being multipathed or replicated automatically?
> what about when someone wants to create snapshots for backups?
>
> all that functionality has to then go into your EMD driver.

No. DM already works on any block device exported to the kernel.
EMD exports its devices as block devices. Thus, all of the DM
functionality you are talking about is also available for EMD.

--
Justin

Subject: Re: "Enhanced" MD code avaible for review

On Tuesday 30 of March 2004 19:35, Justin T. Gibbs wrote:
> > The kernel should not be validating -trusted- userland inputs. Root is
> > allowed to scrag the disk, violate limits, and/or crash his own machine.
> >
> > A simple example is requiring userland, when submitting ATA taskfiles via
> > an ioctl, to specify the data phase (pio read, dma write, no-data, etc.).
> > If the data phase is specified incorrectly, you kill the OS driver's ATA
> > host wwtate machine, and the results are very unpredictable. Since this
> > is a trusted operation, requiring CAP_RAW_IO, it's up to userland to get
> > the required details right (just like following a spec).
>
> That's unfortunate for those using ATA. A command submitted from userland
> to the SCSI drivers I've written that causes a protocol violation will
> be detected, result in appropriate recovery, and a nice diagnostic that
> can be used to diagnose the problem. Part of this is because I cannot know
> if the protocol violation stems from a target defect, the input from the
> user or, for that matter, from the kernel. The main reason is for
> robustness and ease of debugging. In SCSI case, there is almost no
> run-time cost, and the system will stop before data corruption occurs. In

In ATA case detection of protocol violation is not possible w/o checking every
possible command opcode. Even if implemented (notice that checking commands
coming from kernel is out of question - for performance reasons) this breaks
for future and vendor specific commands.

> the meta-data case we've been discussing in terms of EMD, there is no
> runtime cost, the validation has to occur somewhere anyway, and in many
> cases some validation is already required to avoid races with external
> events. If the validation is done in the kernel, then you get the benefit
> of nice diagnostics instead of strange crashes that are difficult to debug.

Unless code that crashes is the one doing validation. ;-)

Bartlomiej

2004-03-30 18:06:05

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

>> That's unfortunate for those using ATA. A command submitted from userland
>
> Required, since one cannot know the data phase of vendor-specific commands.

So you are saying that this presents an unrecoverable situation?

> Particularly, checking whether the kernel is doing something wrong, or wrong,
> just wastes cycles. That's not a scalable way to code... if every driver
> and Linux subsystem did that, things would be unbearable slow.

Hmm. I've never had someone tell me that my SCSI drivers are slow.

I don't think that your statement is true in the general case. My
belief is that validation should occur where it is cheap and efficient
to do so. More expensive checks should be pushed into diagnostic code
that is disabled by default, but the code *should be there*. In any event,
for RAID meta-data, we're talking about code that is *not* in the common
or time critical path of the kernel. A few dozen lines of validation code
there has almost no impact on the size of the kernel and yields huge
benefits for debugging and maintaining the code. This is even more
the case in Linux the end user is often your test lab.

--
Justin

2004-03-30 21:48:13

by Jeff Garzik

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

Justin T. Gibbs wrote:
>>>That's unfortunate for those using ATA. A command submitted from userland
>>
>>Required, since one cannot know the data phase of vendor-specific commands.
>
>
> So you are saying that this presents an unrecoverable situation?

No, I'm saying that the data phase need not have a bunch of in-kernel
checks, it should be generated correctly from the source.


>>Particularly, checking whether the kernel is doing something wrong, or wrong,
>>just wastes cycles. That's not a scalable way to code... if every driver
>>and Linux subsystem did that, things would be unbearable slow.
>
>
> Hmm. I've never had someone tell me that my SCSI drivers are slow.

This would be noticed in the CPU utilization area. Your drivers are
probably a long way from being CPU-bound.


> I don't think that your statement is true in the general case. My
> belief is that validation should occur where it is cheap and efficient
> to do so. More expensive checks should be pushed into diagnostic code
> that is disabled by default, but the code *should be there*. In any event,
> for RAID meta-data, we're talking about code that is *not* in the common
> or time critical path of the kernel. A few dozen lines of validation code
> there has almost no impact on the size of the kernel and yields huge
> benefits for debugging and maintaining the code. This is even more
> the case in Linux the end user is often your test lab.

It doesn't scale terribly well, because the checks themselves become a
source of bugs.

Jeff




2004-03-30 22:15:53

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

>> So you are saying that this presents an unrecoverable situation?
>
> No, I'm saying that the data phase need not have a bunch of in-kernel
> checks, it should be generated correctly from the source.

The SCSI drivers validate the controller's data phase based on the
expected phase presented to them from an upper layer. I never talked
about adding checks that make little sense or are overly expensive. You
seem to equate validation with huge expense. That is just not the
general case.

>> Hmm. I've never had someone tell me that my SCSI drivers are slow.
>
> This would be noticed in the CPU utilization area. Your drivers are
> probably a long way from being CPU-bound.

I very much doubt that. There are perhaps four or five tests in the
I/O path where some value already in a cache line that has to be accessed
anyway is compared against a constant. We're talking about something
down in the noise of any type of profiling you could perform. As I said,
validation makes sense where there is basically no-cost to do it.

>> I don't think that your statement is true in the general case. My
>> belief is that validation should occur where it is cheap and efficient
>> to do so. More expensive checks should be pushed into diagnostic code
>> that is disabled by default, but the code *should be there*. In any event,
>> for RAID meta-data, we're talking about code that is *not* in the common
>> or time critical path of the kernel. A few dozen lines of validation code
>> there has almost no impact on the size of the kernel and yields huge
>> benefits for debugging and maintaining the code. This is even more
>> the case in Linux the end user is often your test lab.
>
> It doesn't scale terribly well, because the checks themselves become a
> source of bugs.

So now the complaint is that validation code is somehow harder to write
and maintain than the rest of the code?

--
Justin

2004-03-30 22:35:33

by Jeff Garzik

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

Justin T. Gibbs wrote:
>>>So you are saying that this presents an unrecoverable situation?
>>
>>No, I'm saying that the data phase need not have a bunch of in-kernel
>>checks, it should be generated correctly from the source.
>
>
> The SCSI drivers validate the controller's data phase based on the
> expected phase presented to them from an upper layer. I never talked
> about adding checks that make little sense or are overly expensive. You
> seem to equate validation with huge expense. That is just not the
> general case.
>
>
>>>Hmm. I've never had someone tell me that my SCSI drivers are slow.
>>
>>This would be noticed in the CPU utilization area. Your drivers are
>>probably a long way from being CPU-bound.
>
>
> I very much doubt that. There are perhaps four or five tests in the
> I/O path where some value already in a cache line that has to be accessed
> anyway is compared against a constant. We're talking about something
> down in the noise of any type of profiling you could perform. As I said,
> validation makes sense where there is basically no-cost to do it.
>
>
>>>I don't think that your statement is true in the general case. My
>>>belief is that validation should occur where it is cheap and efficient
>>>to do so. More expensive checks should be pushed into diagnostic code
>>>that is disabled by default, but the code *should be there*. In any event,
>>>for RAID meta-data, we're talking about code that is *not* in the common
>>>or time critical path of the kernel. A few dozen lines of validation code
>>>there has almost no impact on the size of the kernel and yields huge
>>>benefits for debugging and maintaining the code. This is even more
>>>the case in Linux the end user is often your test lab.
>>
>>It doesn't scale terribly well, because the checks themselves become a
>>source of bugs.
>
>
> So now the complaint is that validation code is somehow harder to write
> and maintain than the rest of the code?

Actually, yes. Validation of random user input has always been a source
of bugs (usually in edge cases), in Linux and in other operating
systems. It is often the area where security bugs are found.

Basically you want to avoid add checks for conditions that don't occur
in properly written software, and make sure that the kernel always
generates correct requests. Obviously that excludes anything on the
target side, but other than that... in userland, a priveleged user is
free to do anything they wish, including violate protocols, cook their
disk, etc.

Jeff



2004-03-31 17:11:28

by Randy.Dunlap

[permalink] [raw]
Subject: Re: "Enhanced" MD code avaible for review

On Fri, 26 Mar 2004 13:19:28 -0600 Kevin Corry wrote:

| On Thursday 25 March 2004 4:04 pm, Lars Marowsky-Bree wrote:
| > On 2004-03-25T13:42:12,
| >
| > Jeff Garzik <[email protected]> said:
| > > >and -5). And we've talked for a long time about wanting to port RAID-1
| > > > and RAID-5 (and now RAID-6) to Device-Mapper targets, but we haven't
| > > > started on any such work, or even had any significant discussions about
| > > > *how* to do it. I can't
| > >
| > > let's have that discussion :)
| >
| > Nice 2.7 material, and parts I've always wanted to work on. (Including
| > making the entire partition scanning user-space on top of DM too.)
|
| Couldn't agree more. Whether using EVMS or kpartx or some other tool, I think
| we've already proved this is possible. We really only need to work on making
| early-userspace a little easier to use.
|
| > KS material?
|
| Sounds good to me.

Ditto.

I didn't see much conclusion to this thread, other than Neil's
good suggestions. (maybe on some other list that I don't read?)

I wouldn't want this or any other projects to have to wait for the
kernel summit. Email has worked well for many years...let's
try to keep it working. :)

--
~Randy
"You can't do anything without having to do something else first."
-- Belefant's Law