LinuxLists.cc - Re: "Enhanced" MD code avaible for review

2004-03-17 19:19:00

Subject: Re: "Enhanced" MD code avaible for review

Justin T. Gibbs wrote:
> [ I tried sending this last night from my Adaptec email address and have
> yet to see it on the list. Sorry if this is dup for any of you. ]

Included linux-kernel in the CC (and also bounced this post there).

> For the past few months, Adaptec Inc, has been working to enhance MD.

The FAQ from several corners is going to be "why not DM?", so I would
humbly request that you (or Scott Long) re-post some of that rationale
here...

> The goals of this project are:
>
> o Allow fully pluggable meta-data modules

yep, needed

> o Add support for Adaptec ASR (aka HostRAID) and DDF
> (Disk Data Format) meta-data types. Both of these
> formats are understood natively by certain vendor
> BIOSes meaning that arrays can be booted from transparently.

yep, needed

For those who don't know, DDF is particularly interesting. A storage
industry association, "SNIA", has gotten most of the software and
hardware RAID folks to agree on a common, vendor-neutral on-disk format.
Pretty historic, IMO :) Since this will be appearing on most of the
future RAID hardware, Linux users will be left out in a big way if this
isn't supported.

EARLY DRAFT spec for DDF was posted on snia.org at
http://www.snia.org/tech_activities/ddftwg/DDFTrial-UseDraft_0_45.pdf

> o Improve the ability of MD to auto-configure arrays.

hmmmm. Maybe in my language this means "improve ability for low-level
drivers to communicate RAID support to upper layers"?

> o Support multi-level arrays transparently yet allow
> proper event notification across levels when the
> topology is known to MD.

I'll need to see the code to understand what this means, much less
whether it is needed ;-)

> o Create a more generic "work item" framework which is
> used to support array initialization, rebuild, and
> verify operations as well as miscellaneous tasks that
> a meta-data or RAID personality may need to perform
> from a thread context (e.g. spare activation where
> meta-data records may need to be sequenced carefully).

This is interesting. (guessing) sort of like a pluggable finite state
machine?

> o Modify the MD ioctl interface to allow the creation
> of management utilities that are meta-data format
> agnostic.

I'm thinking that for 2.6, it is much better to use a more tightly
defined interface via a Linux character driver. Userland write(2)'s
packets of data (h/w raid commands or software raid configuration
commands), and read(2)'s the responses.

ioctl's are a pain for 32->64-bit translation layers. Using a
read/write interface allows one to create an interface that requires no
translation layer -- a big deal for AMD64 and IA32e processors moving
forward -- and it also gives one a lot more control over the interface.

See, we need what I described _anyway_, as a chrdev-based interface to
sending and receiving ATA taskfiles or SCSI cdb's.

It would be IMO simple to extend this to a looks-a-lot-like-ioctl
raid_op interface.

> A snapshot of this work is now available here:
>
> http://people.freebsd.org/~gibbs/linux/SRC/emd-0.7.0-tar.gz

Your email didn't say... this appears to be for 2.6, correct?

> This snapshot includes support for RAID0, RAID1, and the Adaptec
> ASR and DDF meta-data formats. Additional RAID personalities and
> support for the Super90 and Super 1 meta-data formats will be added
> in the coming weeks, the end goal being to provide a superset of
> the functionality in the current MD.

groovy

> Since the current MD notification scheme does not allow MD to receive
> notifications unless it is statically compiled into the kernel, we
> would like to work with the community to develop a more generic
> notification scheme to which modules, such as MD, can dynamically
> register. Until that occurs, these EMD snapshots will require at
> least md.c to be a static component of the kernel.

You would just need a small stub that holds a notifier pointer, yes?

> Architectural Notes
> ===================
> The major areas of change in "EMD" can be categorized into:
>
> 1) "Object Oriented" Data structure changes
>
> These changes are the basis for allowing RAID personalities
> to transparently operate on "disks" or "arrays" as member
> objects. While it has always been possible to create
> multi-level arrays in MD using block layer stacking, our
> approach allows MD to also stack internally. Once a given
> RAID or meta-data personality is converted to the new
> structures, this "feature" comes at no cost. The benefit
> to stacking internally, which requires a meta-data format
> that supports this, is that array state can propagate up
> and down the topology without the loss of information
> inherent in using the block layer to traverse levels of an
> array.

I have a feeling that consensus will prefer that we fix the block layer,
and then figure out the best way to support "automatic stacking" --
since DDF and presumeably other RAID formats will require automatic
setup of raid0+1, etc.

Are there RAID-specific issues here, that do not apply to e.g.
multipathing, which I've heard needs more information at the block layer?

> 2) Opcode based interfaces.
>
> Rather than add additional method vectors to either the
> RAID personality or meta-data personality objects, the new
> code uses only a few methods that are parameterized. This
> has allowed us to create a fairly rich interface between
> the core and the personalities without overly bloating
> personality "classes".

Modulo what I said above, about the chrdev userland interface, we want
to avoid this. You're already going down the wrong road by creating
more untyped interfaces...

static int raid0_raidop(mdk_member_t *member, int op, void *arg)
{
switch (op) {
case MDK_RAID_OP_MSTATE_CHANGED:

The preferred model is to create a single marshalling module (a la
net/core/ethtool.c) that converts the ioctls we must support into a
fully typed function call interface (a la struct ethtool_ops).

> 3) WorkItems
>
> Workitems provide a generic framework for queuing work to
> a thread context. Workitems include a "control" method as
> well as a "handler" method. This separation allows, for
> example, a RAID personality to use the generic sync handler
> while trapping the "open", "close", and "free" of any sync
> workitems. Since both handlers can be tailored to the
> individual workitem that is queued, this removes the need
> to overload one or more interfaces in the personalities.
> It also means that any code in MD can make use of this
> framework - it is not tied to particular objects or modules
> in the system.

Makes sense, though I wonder if we'll want to make this more generic.
hardware RAID drivers might want to use this sort of stuff internally?

> 4) "Syncable Volume" Support
>
> All of the transaction accounting necessary to support
> redundant arrays has been abstracted out into a few inline
> functions. With the inclusion of a "sync support" structure
> in a RAID personality's private data structure area and the
> use of these functions, the generic sync framework is fully
> available. The sync algorithm is also now more like that
> in 2.4.X - with some updates to improve performance. Two
> contiguous sync ranges are employed so that sync I/O can
> be pending while the lock range is extended and new sync
> I/O is stalled waiting for normal I/O writes that might
> conflict with the new range complete. The syncer updates
> its stats more frequently than in the past so that it can
> more quickly react to changes in the normal I/O load. Syncer
> backoff is also disabled anytime there is pending I/O blocked
> on the syncer's locked region. RAID personalities have
> full control over the size of the sync windows used so that
> they can be optimized based on RAID layout policy.

interesting. makes sense on the surface, I'll have to think some more...

> 5) IOCTL Interface
>
> "EMD" now performs all of its configuration via an "mdctl"
> character device. Since one of our goals is to remove any
> knowledge of meta-data type in the user control programs,
> initial meta-data stamping and configuration validation
> occurs in the kernel. In general, the meta-data modules
> already need this validation code in order to support
> auto-configuration, so adding this capability adds little
> to the overall size of EMD. It does, however, require a
> few additional ioctls to support things like querying the
> maximum "coerced" size of a disk targeted for a new array,
> or enumerating the names of installed meta-data modules,
> etc.
>
> This area of EMD is still in very active development and we expect
> to provide a drop of an "emdadm" utility later this week.

I haven't evaluated yet the ioctl interface. I do understand the need
to play alongside the existing md interface, but if there are huge
numbers of additions, it would be preferred to just use the chrdev
straightaway. Such a chrdev would be easily portable to 2.4.x kernels
too :)

> 7) Correction of RAID0 Transform
>
> The RAID0 transform's "merge function" assumes that the
> incoming bio's starting sector is the same as what will be
> presented to its make_request function. In the case of a
> partitioned MD device, the starting sector is shifted by
> the partition offset for the target offset. Unfortunately,
> the merge functions are not notified of the partition
> transform, so RAID0 would often reject requests that span
> "chunk" boundaries once shifted. The fix employed here is
> to determine if a partition transform will occur and take
> this into account in the merge function.

interesting

> Adaptec is currently validating EMD through formal testing while
> continuing the build-out of new features. Our hope is to gather
> feedback from the Linux community and adjust our approach to satisfy
> the community's requirements. We look forward to your comments,
> suggestions, and review of this project.

Thanks much for working with the Linux community.

One overall comment on merging into 2.6: the patch will need to be
broken up into pieces. It's OK if each piece is dependent on the prior
one, and it's OK if there are 20, 30, even 100 pieces. It helps a lot
for review to see the evolution, and it also helps flush out problems
you might not have even noticed. e.g.
- add concept of member, and related helper functions
- use member functions/structs in raid drivers raid0.c, etc.
- fix raid0 transform
- add ioctls needed in order for DDF to be useful
- add DDF format
etc.

2004-03-17 19:32:21

by Christoph Hellwig

[permalink] [raw]

Subject: Re: "Enhanced" MD code avaible for review

On Wed, Mar 17, 2004 at 02:18:25PM -0500, Jeff Garzik wrote:
> > o Allow fully pluggable meta-data modules
>
> yep, needed

Well, this is pretty much the EVMS route we all heavily argued against.
Most of the metadata shouldn't be visible in the kernel at all.

> > o Improve the ability of MD to auto-configure arrays.
>
> hmmmm. Maybe in my language this means "improve ability for low-level
> drivers to communicate RAID support to upper layers"?

I think he's talking about the deprecated raid autorun feature. Again
something that is completely misplaced in the kernel. (?gain EVMS light)

> > o Support multi-level arrays transparently yet allow
> > proper event notification across levels when the
> > topology is known to MD.
>
> I'll need to see the code to understand what this means, much less
> whether it is needed ;-)

I think he mean the broken inter-driver raid stacking mentioned below.
Why do I have to thing of EVMS when for each feature?..

2004-03-17 20:03:10

by Jeff Garzik

[permalink] [raw]

Subject: Re: "Enhanced" MD code avaible for review

Christoph Hellwig wrote:
> On Wed, Mar 17, 2004 at 02:18:25PM -0500, Jeff Garzik wrote:
>
>>> o Allow fully pluggable meta-data modules
>>
>>yep, needed
>
>
> Well, this is pretty much the EVMS route we all heavily argued against.
> Most of the metadata shouldn't be visible in the kernel at all.

_some_ metadata is required at runtime, and must be in the kernel. I
agree that a lot of configuration doesn't necessarily need to be in the
kernel. But stuff like bad sector and event logs, and other bits are
still needed at runtime.

>>> o Improve the ability of MD to auto-configure arrays.
>>
>>hmmmm. Maybe in my language this means "improve ability for low-level
>>drivers to communicate RAID support to upper layers"?
>
>
> I think he's talking about the deprecated raid autorun feature. Again
> something that is completely misplaced in the kernel. (?gain EVMS light)

Indeed, but I'll let him and the code illuminate the meaning :)

Jeff

2004-03-17 21:21:35

by Scott Long

[permalink] [raw]

Subject: Re: "Enhanced" MD code avaible for review

Jeff Garzik wrote:
> Justin T. Gibbs wrote:
> > [ I tried sending this last night from my Adaptec email address and have
> > yet to see it on the list. Sorry if this is dup for any of you. ]
>
> Included linux-kernel in the CC (and also bounced this post there).
>
>
> > For the past few months, Adaptec Inc, has been working to enhance MD.
>
> The FAQ from several corners is going to be "why not DM?", so I would
> humbly request that you (or Scott Long) re-post some of that rationale
> here...
>
>
> > The goals of this project are:
> >
> > o Allow fully pluggable meta-data modules
>
> yep, needed
>
>
> > o Add support for Adaptec ASR (aka HostRAID) and DDF
> > (Disk Data Format) meta-data types. Both of these
> > formats are understood natively by certain vendor
> > BIOSes meaning that arrays can be booted from transparently.
>
> yep, needed
>
> For those who don't know, DDF is particularly interesting. A storage
> industry association, "SNIA", has gotten most of the software and
> hardware RAID folks to agree on a common, vendor-neutral on-disk format.
> Pretty historic, IMO :) Since this will be appearing on most of the
> future RAID hardware, Linux users will be left out in a big way if this
> isn't supported.
>
> EARLY DRAFT spec for DDF was posted on snia.org at
> http://www.snia.org/tech_activities/ddftwg/DDFTrial-UseDraft_0_45.pdf
>
>
> > o Improve the ability of MD to auto-configure arrays.
>
> hmmmm. Maybe in my language this means "improve ability for low-level
> drivers to communicate RAID support to upper layers"?
>

No, this is full auto-configuration support at boot-time, and when
drives are hot-added. I think that you comment applies to the next
item, and yes, you are correct.

>
> > o Support multi-level arrays transparently yet allow
> > proper event notification across levels when the
> > topology is known to MD.
>
> I'll need to see the code to understand what this means, much less
> whether it is needed ;-)
>
>
> > o Create a more generic "work item" framework which is
> > used to support array initialization, rebuild, and
> > verify operations as well as miscellaneous tasks that
> > a meta-data or RAID personality may need to perform
> > from a thread context (e.g. spare activation where
> > meta-data records may need to be sequenced carefully).
>
> This is interesting. (guessing) sort of like a pluggable finite state
> machine?
>

More or less, yes. We needed a way to bridge the gap from an error
being reported in an interrupt context to being able to allocate memory
and do blocking I/O from a thread context. The md_error() interface
already existed to do this, but was way too primitive for our needs. It
had no way to handle cascading or compound events.

>
> > o Modify the MD ioctl interface to allow the creation
> > of management utilities that are meta-data format
> > agnostic.
>
> I'm thinking that for 2.6, it is much better to use a more tightly
> defined interface via a Linux character driver. Userland write(2)'s
> packets of data (h/w raid commands or software raid configuration
> commands), and read(2)'s the responses.
>
> ioctl's are a pain for 32->64-bit translation layers. Using a
> read/write interface allows one to create an interface that requires no
> translation layer -- a big deal for AMD64 and IA32e processors moving
> forward -- and it also gives one a lot more control over the interface.
>

I'm not exactly sure what the difference is here. Both the ioctl and
read/write paths copy data in and out of the kernel. The ioctl
method is a little bit easier since you don't have to stream in a chunk
of data before knowing what to do with it. ANd I also don't see how
read/write protect you from endian and 64/32-bit issues better than
ioctl. If you write your code cleanly and correctly, it's a moot point.

> See, we need what I described _anyway_, as a chrdev-based interface to
> sending and receiving ATA taskfiles or SCSI cdb's.
>
> It would be IMO simple to extend this to a looks-a-lot-like-ioctl
> raid_op interface.
>
>
> > A snapshot of this work is now available here:
> >
> > http://people.freebsd.org/~gibbs/linux/SRC/emd-0.7.0-tar.gz
>
> Your email didn't say... this appears to be for 2.6, correct?
>
>
> > This snapshot includes support for RAID0, RAID1, and the Adaptec
> > ASR and DDF meta-data formats. Additional RAID personalities and
> > support for the Super90 and Super 1 meta-data formats will be added
> > in the coming weeks, the end goal being to provide a superset of
> > the functionality in the current MD.
>
> groovy
>
>
> > Since the current MD notification scheme does not allow MD to receive
> > notifications unless it is statically compiled into the kernel, we
> > would like to work with the community to develop a more generic
> > notification scheme to which modules, such as MD, can dynamically
> > register. Until that occurs, these EMD snapshots will require at
> > least md.c to be a static component of the kernel.
>
> You would just need a small stub that holds a notifier pointer, yes?
>

I think that we are flexible on this. We have an implementation from
several years ago that records partition type information and passes it
around in the notification message so that consumers can register for
distinct types of disks/partitions/etc. Our needs aren't that complex,
but we would be happy to share it anyways since it is useful.

>
> > Architectural Notes
> > ===================
> > The major areas of change in "EMD" can be categorized into:
> >
> > 1) "Object Oriented" Data structure changes
> >
> > These changes are the basis for allowing RAID personalities
> > to transparently operate on "disks" or "arrays" as member
> > objects. While it has always been possible to create
> > multi-level arrays in MD using block layer stacking, our
> > approach allows MD to also stack internally. Once a given
> > RAID or meta-data personality is converted to the new
> > structures, this "feature" comes at no cost. The benefit
> > to stacking internally, which requires a meta-data format
> > that supports this, is that array state can propagate up
> > and down the topology without the loss of information
> > inherent in using the block layer to traverse levels of an
> > array.
>
> I have a feeling that consensus will prefer that we fix the block layer,
> and then figure out the best way to support "automatic stacking" --
> since DDF and presumeably other RAID formats will require automatic
> setup of raid0+1, etc.
>
> Are there RAID-specific issues here, that do not apply to e.g.
> multipathing, which I've heard needs more information at the block layer?
>

No, the issue is, how do you propagate events through the block layer?
EIO/EINVAL/etc error codes just don't cut it. Also, many metadata
formats are unified, in that even though the arrays are stacked, the
metadata sees the entire picture. Updates might need to touch every
disk in the compound array, not just a certain sub-array.

The stacking that we do internal to MD is still fairly clean and doesn't
prevent one from stacking outside of MD.

>
> > 2) Opcode based interfaces.
> >
> > Rather than add additional method vectors to either the
> > RAID personality or meta-data personality objects, the new
> > code uses only a few methods that are parameterized. This
> > has allowed us to create a fairly rich interface between
> > the core and the personalities without overly bloating
> > personality "classes".
>
> Modulo what I said above, about the chrdev userland interface, we want
> to avoid this. You're already going down the wrong road by creating
> more untyped interfaces...
>
> static int raid0_raidop(mdk_member_t *member, int op, void *arg)
> {
> switch (op) {
> case MDK_RAID_OP_MSTATE_CHANGED:
>
> The preferred model is to create a single marshalling module (a la
> net/core/ethtool.c) that converts the ioctls we must support into a
> fully typed function call interface (a la struct ethtool_ops).
>

These OPS don't exist soley for the userland ap. They also exist for
communicating between the raid transform and metadata modules.

>
> > 3) WorkItems
> >
> > Workitems provide a generic framework for queuing work to
> > a thread context. Workitems include a "control" method as
> > well as a "handler" method. This separation allows, for
> > example, a RAID personality to use the generic sync handler
> > while trapping the "open", "close", and "free" of any sync
> > workitems. Since both handlers can be tailored to the
> > individual workitem that is queued, this removes the need
> > to overload one or more interfaces in the personalities.
> > It also means that any code in MD can make use of this
> > framework - it is not tied to particular objects or modules
> > in the system.
>
> Makes sense, though I wonder if we'll want to make this more generic.
> hardware RAID drivers might want to use this sort of stuff internally?
>

If you want to make it into a more generic kernel service, that fine.
However, I'm not quite sure what kind of work items a hardware raid
driver will need. The whole point there is to hide what's going on ;-)

>
> > 4) "Syncable Volume" Support
> >
> > All of the transaction accounting necessary to support
> > redundant arrays has been abstracted out into a few inline
> > functions. With the inclusion of a "sync support" structure
> > in a RAID personality's private data structure area and the
> > use of these functions, the generic sync framework is fully
> > available. The sync algorithm is also now more like that
> > in 2.4.X - with some updates to improve performance. Two
> > contiguous sync ranges are employed so that sync I/O can
> > be pending while the lock range is extended and new sync
> > I/O is stalled waiting for normal I/O writes that might
> > conflict with the new range complete. The syncer updates
> > its stats more frequently than in the past so that it can
> > more quickly react to changes in the normal I/O load. Syncer
> > backoff is also disabled anytime there is pending I/O blocked
> > on the syncer's locked region. RAID personalities have
> > full control over the size of the sync windows used so that
> > they can be optimized based on RAID layout policy.
>
> interesting. makes sense on the surface, I'll have to think some more...
>
>
> > 5) IOCTL Interface
> >
> > "EMD" now performs all of its configuration via an "mdctl"
> > character device. Since one of our goals is to remove any
> > knowledge of meta-data type in the user control programs,
> > initial meta-data stamping and configuration validation
> > occurs in the kernel. In general, the meta-data modules
> > already need this validation code in order to support
> > auto-configuration, so adding this capability adds little
> > to the overall size of EMD. It does, however, require a
> > few additional ioctls to support things like querying the
> > maximum "coerced" size of a disk targeted for a new array,
> > or enumerating the names of installed meta-data modules,
> > etc.
> >
> > This area of EMD is still in very active development and we expect
> > to provide a drop of an "emdadm" utility later this week.
>
> I haven't evaluated yet the ioctl interface. I do understand the need
> to play alongside the existing md interface, but if there are huge
> numbers of additions, it would be preferred to just use the chrdev
> straightaway. Such a chrdev would be easily portable to 2.4.x kernels
> too :)
>
>
> > 7) Correction of RAID0 Transform
> >
> > The RAID0 transform's "merge function" assumes that the
> > incoming bio's starting sector is the same as what will be
> > presented to its make_request function. In the case of a
> > partitioned MD device, the starting sector is shifted by
> > the partition offset for the target offset. Unfortunately,
> > the merge functions are not notified of the partition
> > transform, so RAID0 would often reject requests that span
> > "chunk" boundaries once shifted. The fix employed here is
> > to determine if a partition transform will occur and take
> > this into account in the merge function.
>
> interesting
>
>
> > Adaptec is currently validating EMD through formal testing while
> > continuing the build-out of new features. Our hope is to gather
> > feedback from the Linux community and adjust our approach to satisfy
> > the community's requirements. We look forward to your comments,
> > suggestions, and review of this project.
>
> Thanks much for working with the Linux community.
>
> One overall comment on merging into 2.6: the patch will need to be
> broken up into pieces. It's OK if each piece is dependent on the prior
> one, and it's OK if there are 20, 30, even 100 pieces. It helps a lot
> for review to see the evolution, and it also helps flush out problems
> you might not have even noticed. e.g.
> - add concept of member, and related helper functions
> - use member functions/structs in raid drivers raid0.c, etc.
> - fix raid0 transform
> - add ioctls needed in order for DDF to be useful
> - add DDF format
> etc.
>

We can provide our Perforce changelogs (just like we do for SCSI).

Scott

2004-03-17 21:35:57

by Jeff Garzik

[permalink] [raw]

Subject: Re: "Enhanced" MD code avaible for review

Scott Long wrote:
> Jeff Garzik wrote:
>> Modulo what I said above, about the chrdev userland interface, we want
>> to avoid this. You're already going down the wrong road by creating
>> more untyped interfaces...
>>
>> static int raid0_raidop(mdk_member_t *member, int op, void *arg)
>> {
>> switch (op) {
>> case MDK_RAID_OP_MSTATE_CHANGED:
>>
>> The preferred model is to create a single marshalling module (a la
>> net/core/ethtool.c) that converts the ioctls we must support into a
>> fully typed function call interface (a la struct ethtool_ops).
>>
>
> These OPS don't exist soley for the userland ap. They also exist for
> communicating between the raid transform and metadata modules.

Nod -- kernel internal calls should _especially_ be type-explicit, not
typeless ioctl-like APIs.

>> One overall comment on merging into 2.6: the patch will need to be
>> broken up into pieces. It's OK if each piece is dependent on the prior
>> one, and it's OK if there are 20, 30, even 100 pieces. It helps a lot
>> for review to see the evolution, and it also helps flush out problems
>> you might not have even noticed. e.g.
>> - add concept of member, and related helper functions
>> - use member functions/structs in raid drivers raid0.c, etc.
>> - fix raid0 transform
>> - add ioctls needed in order for DDF to be useful
>> - add DDF format
>> etc.
>>
>
> We can provide our Perforce changelogs (just like we do for SCSI).

What I'm saying is, emd needs to be submitted to the kernel just like
Neil Brown submits patches to Andrew, etc. This is how everybody else
submits and maintains Linux kernel code. There needs to be N patches,
one patch per email, that successively introduces new code, or modifies
existing code.

Absent of all other issues, one huge patch that completely updates md
isn't going to be acceptable, no matter how nifty or well-tested it is...

Jeff

2004-03-17 21:37:43

by Bartlomiej Zolnierkiewicz

[permalink] [raw]

Subject: Re: "Enhanced" MD code avaible for review

On Wednesday 17 of March 2004 22:18, Scott Long wrote:
> Jeff Garzik wrote:
> > Justin T. Gibbs wrote:
> > > [ I tried sending this last night from my Adaptec email address and
> > > have yet to see it on the list. Sorry if this is dup for any of you.
> > > ]
> >
> > Included linux-kernel in the CC (and also bounced this post there).
> >
> > > For the past few months, Adaptec Inc, has been working to enhance MD.
> >
> > The FAQ from several corners is going to be "why not DM?", so I would
> > humbly request that you (or Scott Long) re-post some of that rationale
> > here...

This is #1 question so... why not DM? 8)

Regards,
Bartlomiej

2004-03-18 00:24:31

by Scott Long

[permalink] [raw]

Subject: Re: "Enhanced" MD code avaible for review

Bartlomiej Zolnierkiewicz wrote:
> On Wednesday 17 of March 2004 22:18, Scott Long wrote:
> > Jeff Garzik wrote:
> > > Justin T. Gibbs wrote:
> > > > [ I tried sending this last night from my Adaptec email address and
> > > > have yet to see it on the list. Sorry if this is dup for any of
> you.
> > > > ]
> > >
> > > Included linux-kernel in the CC (and also bounced this post there).
> > >
> > > > For the past few months, Adaptec Inc, has been working to
> enhance MD.
> > >
> > > The FAQ from several corners is going to be "why not DM?", so I would
> > > humbly request that you (or Scott Long) re-post some of that rationale
> > > here...
>
> This is #1 question so... why not DM? 8)
>
> Regards,
> Bartlomiej
>

The primary feature of any RAID implementation is reliability.
Reliability is a surprisingly hard goal. Making sure that your
data is available and trustworthy under real-world scenarios is
a lot harder than it sounds. This has been a significant focus
of ours on MD, and is the primary reason why we chose MD as the
foundation of our work.

Storage is the foundation of everything that you do with your
computer. It needs to work regardless of what happened to your
filesystem on the last crash, regardless of whether or not you
have the latest initrd tools, regardless of what rpms you've kept
up to date on, regardless if your userland works, regardless of
what libc you are using this week, etc.

With DM, what happens when your initrd gets accidentally corrupted?
What happens when the kernel and userland pieces get out of sync?
Maybe you are booting off of a single drive and only using DM arrays
for secondary storage, but maybe you're not. If something goes wrong
with DM, how do you boot?

Secondly, our target here is to interoperate with hardware components
that run outside the scope of Linux. The HostRAID or DDF BIOS is
going to create an array using it's own format. It's not going to
have any knowledge of DM config files, initrd, ramfs, etc. However,
the end user is still going to expect to be able to seamlessly install
onto that newly created array, maybe move that array to another system,
whatever, and have it all Just Work. Has anyone heard of a hardware
RAID card that requires you to run OS-specific commands in order to
access the arrays on it? Of course not. The point here is to make
software raid just as easy to the end user.

The third, and arguably most important issue is the need for reliable
error recovery. With the DM model, error recovery would be done in
userland. Errors generated during I/O would be kicked to a userland
app that would then drive the recovery-spare activation-rebuild
sequence. That's fine, but what if something happens that prevents
the userland tool from running? Maybe it was a daemon that became
idle and got swapped out to disk, but now you can't swap it back in
because your I/O is failing. Or maybe it needs to activate a helper
module or read a config file, but again it can't because i/o is
failing. What if it crashes. What if the source code gets out of sync
with the kernel interface. What if you upgrade glibc and it stops
working for whatever unknown reason.

Some have suggested in the past that these userland tools get put into
ramfs and locked into memory. If you do that, then it might as well be
part of the kernel anyways. It's consuming the same memory, if not
more, than the equivalent code in the kernel (likely a lot more since
you'd have to static link it). And you still have the downsides of it
possibly getting out of date with the kernel. So what are the upsides?

MD is not terribly heavy-weight. As a monolithic module of
DDF+ASR+R0+R1 it's about 65k in size. That's 1/2 the size of your
average SCSI driver these days, and no one is advocating putting those
into userland. It just doesn't make sense to sacrifice reliability
for the phantom goal of 'reducing kernel bloat'.

Scott

2004-03-18 01:47:18

by Bartlomiej Zolnierkiewicz

[permalink] [raw]

Subject: Re: "Enhanced" MD code avaible for review

On Thursday 18 of March 2004 01:23, Scott Long wrote:
> Bartlomiej Zolnierkiewicz wrote:
> > On Wednesday 17 of March 2004 22:18, Scott Long wrote:
> > > Jeff Garzik wrote:
> > > > Justin T. Gibbs wrote:
> > > > > [ I tried sending this last night from my Adaptec email address
> > > > > and have yet to see it on the list. Sorry if this is dup for any
> > > > > of
> >
> > you.
> >
> > > > > ]
> > > >
> > > > Included linux-kernel in the CC (and also bounced this post there).
> > > >
> > > > > For the past few months, Adaptec Inc, has been working to
> >
> > enhance MD.
> >
> > > > The FAQ from several corners is going to be "why not DM?", so I
> > > > would humbly request that you (or Scott Long) re-post some of that
> > > > rationale here...
> >
> > This is #1 question so... why not DM? 8)
> >
> > Regards,
> > Bartlomiej
>
> The primary feature of any RAID implementation is reliability.
> Reliability is a surprisingly hard goal. Making sure that your
> data is available and trustworthy under real-world scenarios is
> a lot harder than it sounds. This has been a significant focus
> of ours on MD, and is the primary reason why we chose MD as the
> foundation of our work.

Okay.

> Storage is the foundation of everything that you do with your
> computer. It needs to work regardless of what happened to your
> filesystem on the last crash, regardless of whether or not you
> have the latest initrd tools, regardless of what rpms you've kept
> up to date on, regardless if your userland works, regardless of
> what libc you are using this week, etc.

I'm thinking about initrd+klibc not rpms+libc,
fs is a lower level than DM - fs crash is not a problem here.

> With DM, what happens when your initrd gets accidentally corrupted?

The same what happens when your kernel image gets corrupted,
probability is similar.

> What happens when the kernel and userland pieces get out of sync?

The same what happens when your kernel driver gets out of sync.

> Maybe you are booting off of a single drive and only using DM arrays
> for secondary storage, but maybe you're not. If something goes wrong
> with DM, how do you boot?

The same what happens when "something" wrong goes with kernel.

> Secondly, our target here is to interoperate with hardware components
> that run outside the scope of Linux. The HostRAID or DDF BIOS is
> going to create an array using it's own format. It's not going to
> have any knowledge of DM config files, initrd, ramfs, etc. However,

It doesn't need any knowledge of config files, initrd, ramfs etc.

> the end user is still going to expect to be able to seamlessly install
> onto that newly created array, maybe move that array to another system,
> whatever, and have it all Just Work. Has anyone heard of a hardware
> RAID card that requires you to run OS-specific commands in order to
> access the arrays on it? Of course not. The point here is to make
> software raid just as easy to the end user.

It won't require user to run any commands.

RAID card gets detected and initialized -> hotplug event happens ->
user-land configuration tools executed etc.

> The third, and arguably most important issue is the need for reliable
> error recovery. With the DM model, error recovery would be done in
> userland. Errors generated during I/O would be kicked to a userland
> app that would then drive the recovery-spare activation-rebuild
> sequence. That's fine, but what if something happens that prevents
> the userland tool from running? Maybe it was a daemon that became
> idle and got swapped out to disk, but now you can't swap it back in
> because your I/O is failing. Or maybe it needs to activate a helper
> module or read a config file, but again it can't because i/o is

I see valid points here but ramfs can be used etc.

> failing. What if it crashes. What if the source code gets out of sync
> with the kernel interface. What if you upgrade glibc and it stops
> working for whatever unknown reason.

glibc is not needed/recommend here.

> Some have suggested in the past that these userland tools get put into
> ramfs and locked into memory. If you do that, then it might as well be
> part of the kernel anyways. It's consuming the same memory, if not
> more, than the equivalent code in the kernel (likely a lot more since
> you'd have to static link it). And you still have the downsides of it
> possibly getting out of date with the kernel. So what are the upsides?

Faster/easier development - user-space apps don't OOPS. :-)
Somebody else than kernel people have to update user-land. :-)

> MD is not terribly heavy-weight. As a monolithic module of
> DDF+ASR+R0+R1 it's about 65k in size. That's 1/2 the size of your
> average SCSI driver these days, and no one is advocating putting those

SCSI driver is a low-level stuff - it needs direct hardware access.

Even 65k is still a bloat - think about vendor kernel including support
for all possible RAID flavors. If they are modular - they require initrd
so may as well be put to user-land.

> into userland. It just doesn't make sense to sacrifice reliability
> for the phantom goal of 'reducing kernel bloat'.

ATARAID drivers are just moving in this direction...
ASR+DDF will also follow this way... sooner or later...

Regards,
Bartlomiej

2004-03-18 01:56:57

by Al Viro

[permalink] [raw]

Subject: Re: "Enhanced" MD code avaible for review

On Wed, Mar 17, 2004 at 02:18:01PM -0700, Scott Long wrote:
> >One overall comment on merging into 2.6: the patch will need to be
> >broken up into pieces. It's OK if each piece is dependent on the prior
> >one, and it's OK if there are 20, 30, even 100 pieces. It helps a lot
> >for review to see the evolution, and it also helps flush out problems
> >you might not have even noticed. e.g.
> > - add concept of member, and related helper functions
> > - use member functions/structs in raid drivers raid0.c, etc.
> > - fix raid0 transform
> > - add ioctls needed in order for DDF to be useful
> > - add DDF format
> > etc.
> >
>
> We can provide our Perforce changelogs (just like we do for SCSI).

TA: "you must submit a solution, not just an answer"
CALC101 student: "but I've checked the answer, it's OK"
TA: "I'm sorry, it's not enough"
<student hands a pile of paper covered with snippets of text and calculations>
Student: "All right, here are all notes I've made while solving the problem.
Happy now?"
TA: <exasperated sigh> "Not really"

2004-03-18 06:39:17

by Stefan Smietanowski

[permalink] [raw]

Subject: Re: "Enhanced" MD code avaible for review

Hi.

<snip beginning of discsussion about DDF, etc>

> With DM, what happens when your initrd gets accidentally corrupted?
> What happens when the kernel and userland pieces get out of sync?
> Maybe you are booting off of a single drive and only using DM arrays
> for secondary storage, but maybe you're not. If something goes wrong
> with DM, how do you boot?

Tell me something... Do you guys release a driver for WinXP as an
example? You don't have to answer that really as it's obvious that
you do. Do you in the installation program recompile the windows
kernel so that your driver is monolithic? The answer is most presumably
no - that's not how it's done there.

Ok. Your example states "what if initrd gets corrupted" and my example
is "what if you driver file(s) get corrupted?" and my example
is equally important to a module in linux as it is a driver in windows.

Now, since you do supply a windows driver and that driver is NOT
statically linked to the windows kernel why is it that you believe
a meta driver (which MD really is in a sense) needs special treatment
(static linking into the kernel) when for instance a driver for a piece
of hardware doesn't? If you have disk corruption so far that your
initrd is corrupted I would seriously suggest NOT booting that OS
that's on that drive regardless of anything else and sticking it
in another box OR booting from rescue media of some sort.

// Stefan

2004-03-20 13:07:32

by Arjan van de Ven

[permalink] [raw]

Subject: Re: "Enhanced" MD code avaible for review

> With DM, what happens when your initrd gets accidentally corrupted?

What happens if your vmlinuz accidentally gets corrupted? If your initrd
is toast the module for your root fs doesn't load either. Duh.

> What happens when the kernel and userland pieces get out of sync?
> Maybe you are booting off of a single drive and only using DM arrays
> for secondary storage, but maybe you're not. If something goes wrong
> with DM, how do you boot?

If you loose 10 disks out of your raid array, how do you boot ?

>
> Secondly, our target here is to interoperate with hardware components
> that run outside the scope of Linux. The HostRAID or DDF BIOS is
> going to create an array using it's own format. It's not going to
> have any knowledge of DM config files,

DM doesn't need/use config files.
> initrd, ramfs, etc. However,
> the end user is still going to expect to be able to seamlessly install
> onto that newly created array, maybe move that array to another system,
> whatever, and have it all Just Work. Has anyone heard of a hardware
> RAID card that requires you to run OS-specific commands in order to
> access the arrays on it? Of course not. The point here is to make
> software raid just as easy to the end user.

And that is an easy task for distribution makers (or actually the people
who make the initrd creation software).

I'm sorry, I'm not buying your arguments and consider 100% the wrong
direction. I'm hoping that someone with a bit more time than me will
write the DDF device mapper target so that I can use it for my
kernels... ;)

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2004-03-21 23:44:18

by Scott Long

[permalink] [raw]

Subject: Re: "Enhanced" MD code avaible for review

Arjan van de Ven wrote:
>>With DM, what happens when your initrd gets accidentally corrupted?
>
>
> What happens if your vmlinuz accidentally gets corrupted? If your initrd
> is toast the module for your root fs doesn't load either. Duh.

The point here is to minimize points of failure.

>
>
>>What happens when the kernel and userland pieces get out of sync?
>>Maybe you are booting off of a single drive and only using DM arrays
>>for secondary storage, but maybe you're not. If something goes wrong
>>with DM, how do you boot?
>
>
> If you loose 10 disks out of your raid array, how do you boot ?

That's a silly statement and has nothing to do with the argument.

>
>
>>Secondly, our target here is to interoperate with hardware components
>>that run outside the scope of Linux. The HostRAID or DDF BIOS is
>>going to create an array using it's own format. It's not going to
>>have any knowledge of DM config files,
>
>
> DM doesn't need/use config files.
>
>>initrd, ramfs, etc. However,
>>the end user is still going to expect to be able to seamlessly install
>>onto that newly created array, maybe move that array to another system,
>>whatever, and have it all Just Work. Has anyone heard of a hardware
>>RAID card that requires you to run OS-specific commands in order to
>>access the arrays on it? Of course not. The point here is to make
>>software raid just as easy to the end user.
>
>
> And that is an easy task for distribution makers (or actually the people
> who make the initrd creation software).
>
> I'm sorry, I'm not buying your arguments and consider 100% the wrong
> direction. I'm hoping that someone with a bit more time than me will
> write the DDF device mapper target so that I can use it for my
> kernels... ;)
>

Well, code speaks louder than words, as this group loves to say. I
eagerly await your code. Barring that, I eagerly await a technical
argument, rather than an emotional "you're wrong because I'm right"
argument.

Scott

2004-03-22 09:06:05

by Arjan van de Ven

[permalink] [raw]

Subject: Re: "Enhanced" MD code avaible for review

On Mon, 2004-03-22 at 00:42, Scott Long wrote:

> Well, code speaks louder than words, as this group loves to say. I
> eagerly await your code. Barring that, I eagerly await a technical
> argument, rather than an emotional "you're wrong because I'm right"
> argument.

I think that all the arguments for using DM are techinical arguments not
emotional ones. oh well.. you're free to write your code I'm free to not
use it in my kernels ;)

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2004-03-22 22:01:30

by Scott Long

[permalink] [raw]

Subject: Re: "Enhanced" MD code avaible for review

Arjan van de Ven wrote:
> On Mon, 2004-03-22 at 00:42, Scott Long wrote:
>
>
>>Well, code speaks louder than words, as this group loves to say. I
>>eagerly await your code. Barring that, I eagerly await a technical
>>argument, rather than an emotional "you're wrong because I'm right"
>>argument.
>
>
> I think that all the arguments for using DM are techinical arguments not
> emotional ones. oh well.. you're free to write your code I'm free to not
> use it in my kernels ;)

Ok, the technical arguments I've heard in favor of the DM approach is
that it reduces kernel bloat. That fair, and I certainly agree with not
putting the kitchen sink into the kernel. Our position on EMD is that
it's a special case because you want to reduce the number of failure
modes, and that it doesn't contribute in a significant way to the kernel
size. Your response to that our arguments don't matter since your mind
is already made up. That's the barrier I'm trying to break through and
have a techincal discussion on.

Scott

2004-03-22 22:21:15

by Lars Marowsky-Bree

[permalink] [raw]

Subject: Re: "Enhanced" MD code avaible for review

On 2004-03-22T14:59:29,
Scott Long <[email protected]> said:

> Ok, the technical arguments I've heard in favor of the DM approach is
> that it reduces kernel bloat. That fair, and I certainly agree with not
> putting the kitchen sink into the kernel. Our position on EMD is that
> it's a special case because you want to reduce the number of failure
> modes, and that it doesn't contribute in a significant way to the kernel
> size. Your response to that our arguments don't matter since your mind
> is already made up. That's the barrier I'm trying to break through and
> have a techincal discussion on.

The problematic point is that the failure modes which you want to
protect against all basically amount to -EUSERTOOSTUPID (if he forgot to
update the initrd and thus basically missed a vital part of the kernel
update), or -EFUBAR (in which case even the kernel image itself won't
help you). In those cases, not even being linked into the kernel helps
you any.

All of these cases are well understood, and have been problematic in the
past already, and will fuck the user up whether he has EMD enabled or
not. That EMD is coming up is not going to help him much, because he
won't be able to mount the root filesystem w/o the filesystem module,
or without the LVM2/EVMS2 stuff etc. initrd has long been mostly
mandatory already for such scenarios.

This is the way how the kernel has been developing for a while. Your
patch does something different, and the reasons you give are not
convincing.

In particular, if EMD is going to be stacked with other stuff (ie, EMD
RAID1 on top of multipath or whatever), having the autodiscovery in the
kernel is actually cumbersome. And yes, right now you have only one
format. But bet on it, the spec will change, vendors will not 100%
adhere to it, new formats will be supported by the same code etc, and
thus the discovery logic will become bigger. Having such complexity
outside the kernel is good, and its also not time critical, because it
is only done once.

Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering \ ever tried. ever failed. no matter.
SUSE Labs | try again. fail again. fail better.
Research & Development, SUSE LINUX AG \ -- Samuel Beckett

2004-03-23 06:48:59

by Arjan van de Ven

[permalink] [raw]

Subject: Re: "Enhanced" MD code avaible for review

On Mon, Mar 22, 2004 at 02:59:29PM -0700, Scott Long wrote:
> >I think that all the arguments for using DM are techinical arguments not
> >emotional ones. oh well.. you're free to write your code I'm free to not
> >use it in my kernels ;)
>
> Ok, the technical arguments I've heard in favor of the DM approach is
> that it reduces kernel bloat. That fair, and I certainly agree with not
> putting the kitchen sink into the kernel. Our position on EMD is that
> it's a special case because you want to reduce the number of failure
> modes, and that it doesn't contribute in a significant way to the kernel
> size.

There are serveral dozen such formats as DDF, should those be put in too?
And then the next step is built in multipathing or stacking or .. or ....
And pretty soon you're back at the EVMS 1.0 situation. I see the general
kernel direction be to move such autodetection to early userland (there's
a reason DM and not EVMS1.0 is in the kernel, afaics even the EVMS guys now
agree that this was the right move); EMD is a step in the opposite direction.

Attachments:

(No filename) (1.05 kB)
(No filename) (189.00 B)
Download all attachments