2006-01-17 06:56:15

by NeilBrown

[permalink] [raw]
Subject: [PATCH 000 of 5] md: Introduction


Greetings.

In line with the principle of "release early", following are 5 patches
against md in 2.6.latest which implement reshaping of a raid5 array.
By this I mean adding 1 or more drives to the array and then re-laying
out all of the data.

This is still EXPERIMENTAL and could easily eat your data. Don't use it on
valuable data. Only use it for review and testing.

This release does not make ANY attempt to record how far the reshape
has progressed on stable storage. That means that if the process is
interrupted either by a crash or by "mdadm -S", then you completely
lose your data. All of it.
So don't use it on valuable data.

There are 5 patches to (hopefully) ease review. Comments are most
welcome, as are test results (providing they aren't done on valuable data:-).

You will need to enable the experimental MD_RAID5_RESHAPE config option
for this to work. Please read the help message that come with it.
It gives an example mdadm command to effect a reshape (you do not need
a new mdadm, and vaguely recent version should work).

This code is based in part on earlier work by
"Steinar H. Gunderson" <[email protected]>
Though little of his code remains, having access to it, and having
discussed the issues with him greatly eased the processed of creating
these patches. Thanks Steinar.

NeilBrown


[PATCH 001 of 5] md: Split disks array out of raid5 conf structure so it is easier to grow.
[PATCH 002 of 5] md: Allow stripes to be expanded in preparation for expanding an array.
[PATCH 003 of 5] md: Infrastructure to allow normal IO to continue while array is expanding.
[PATCH 004 of 5] md: Core of raid5 resize process
[PATCH 005 of 5] md: Final stages of raid5 expand code.


2006-01-17 08:17:21

by Michael Tokarev

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

NeilBrown wrote:
> Greetings.
>
> In line with the principle of "release early", following are 5 patches
> against md in 2.6.latest which implement reshaping of a raid5 array.
> By this I mean adding 1 or more drives to the array and then re-laying
> out all of the data.

Neil, is this online resizing/reshaping really needed? I understand
all those words means alot for marketing persons - zero downtime,
online resizing etc, but it is much safer and easier to do that stuff
'offline', on an inactive array, like raidreconf does - safer, easier,
faster, and one have more possibilities for more complex changes. It
isn't like you want to add/remove drives to/from your arrays every day...
Alot of good hw raid cards are unable to perform such reshaping too.

/mjt

2006-01-17 09:50:14

by Sander

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

Michael Tokarev wrote (ao):
> NeilBrown wrote:
> > Greetings.
> >
> > In line with the principle of "release early", following are 5
> > patches against md in 2.6.latest which implement reshaping of a
> > raid5 array. By this I mean adding 1 or more drives to the array and
> > then re-laying out all of the data.
>
> Neil, is this online resizing/reshaping really needed? I understand
> all those words means alot for marketing persons - zero downtime,
> online resizing etc, but it is much safer and easier to do that stuff
> 'offline', on an inactive array, like raidreconf does - safer, easier,
> faster, and one have more possibilities for more complex changes. It
> isn't like you want to add/remove drives to/from your arrays every
> day... Alot of good hw raid cards are unable to perform such reshaping
> too.

I like the feature. Not only marketing prefers zero downtime you know :-)

Actually, I don't understand why you bother at all. One writes the
feature. Another uses it. How would this feature harm you?

Kind regards, Sander

--
Humilis IT Services and Solutions
http://www.humilis.net

2006-01-17 11:26:16

by Michael Tokarev

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

Sander wrote:
> Michael Tokarev wrote (ao):
[]
>>Neil, is this online resizing/reshaping really needed? I understand
>>all those words means alot for marketing persons - zero downtime,
>>online resizing etc, but it is much safer and easier to do that stuff
>>'offline', on an inactive array, like raidreconf does - safer, easier,
>>faster, and one have more possibilities for more complex changes. It
>>isn't like you want to add/remove drives to/from your arrays every
>>day... Alot of good hw raid cards are unable to perform such reshaping
>>too.
[]
> Actually, I don't understand why you bother at all. One writes the
> feature. Another uses it. How would this feature harm you?

This is about code complexity/bloat. It's already complex enouth.
I rely on the stability of the linux softraid subsystem, and want
it to be reliable. Adding more features, especially non-trivial
ones, does not buy you bugfree raid subsystem, just the opposite:
it will have more chances to crash, to eat your data etc, and will
be harder in finding/fixing bugs.

Raid code is already too fragile, i'm afraid "simple" I/O errors
(which is what we need raid for) may crash the system already, and
am waiting for the next whole system crash due to eg superblock
update error or whatnot. I saw all sorts of failures due to
linux softraid already (we use it here alot), including ones
which required complete array rebuild with heavy data loss.

Any "unnecessary bloat" (note the quotes: I understand some
people like this and other features) makes whole system even
more fragile than it is already.

Compare this with my statement about "offline" "reshaper" above:
separate userspace (easier to write/debug compared with kernel
space) program which operates on an inactive array (no locking
needed, no need to worry about other I/O operations going to the
array at the time of reshaping etc), with an ability to plan it's
I/O strategy in alot more efficient and safer way... Yes this
apprpach has one downside: the array has to be inactive. But in
my opinion it's worth it, compared to more possibilities to lose
your data, even if you do NOT use that feature at all...

/mjt

2006-01-17 14:04:25

by Kyle Moffett

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Jan 17, 2006, at 06:26, Michael Tokarev wrote:
> This is about code complexity/bloat. It's already complex enouth.
> I rely on the stability of the linux softraid subsystem, and want
> it to be reliable. Adding more features, especially non-trivial
> ones, does not buy you bugfree raid subsystem, just the opposite:
> it will have more chances to crash, to eat your data etc, and will
> be harder in finding/fixing bugs.

What part of: "You will need to enable the experimental
MD_RAID5_RESHAPE config option for this to work." isn't bvious? If
you don't want this feature, either don't turn on
CONFIG_MD_RAID5_RESHAPE, or don't use the raid5 mdadm reshaping
command. This feature might be extremely useful for some people
(including me on occasion), but I would not trust it even on my
family's fileserver (let alone a corporate one) until it's been
through several generations of testing and bugfixing.


Cheers,
Kyle Moffett

--
There is no way to make Linux robust with unreliable memory
subsystems, sorry. It would be like trying to make a human more
robust with an unreliable O2 supply. Memory just has to work.
-- Andi Kleen


2006-01-17 14:10:26

by Steinar H. Gunderson

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Tue, Jan 17, 2006 at 11:17:15AM +0300, Michael Tokarev wrote:
> Neil, is this online resizing/reshaping really needed? I understand
> all those words means alot for marketing persons - zero downtime,
> online resizing etc, but it is much safer and easier to do that stuff
> 'offline', on an inactive array, like raidreconf does - safer, easier,
> faster, and one have more possibilities for more complex changes.

Try the scenario where the resize takes a week, and you don't have enough
spare disks to move it onto another server -- besides, that would take
several days alone... This is the kind of use-case for which I wrote the
original patch, and I'm grateful that Neil has picked it up again so we can
finally get something working in.

/* Steinar */
--
Homepage: http://www.sesse.net/

2006-01-17 16:08:31

by ross

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Tue, Jan 17, 2006 at 02:26:11PM +0300, Michael Tokarev wrote:
> Raid code is already too fragile, i'm afraid "simple" I/O errors
> (which is what we need raid for) may crash the system already, and
> am waiting for the next whole system crash due to eg superblock
> update error or whatnot.

I think you've got some other issue if simple I/O errors cause issues.
I've managed hundreds of MD arrays over the past ~ten years. MD is
rock solid. I'd guess that I've recovered at least a hundred disk failures
where data was saved by mdadm.

What is your setup like? It's also possible that you've found a bug.

> I saw all sorts of failures due to
> linux softraid already (we use it here alot), including ones
> which required complete array rebuild with heavy data loss.

Are you sure? The one thing that's not always intuitive about MD - a
faild array often still has your data and you can recover it. Unlike
hardware RAID solutions, you have a lot of control over how the disks
are assembled and used - this can be a major advantage.

I'd say once a week someone comes on the linux-raid list and says "Oh no!
I accidently ruined my RAID array!". Neil almost always responds "Well,
don't do that! But since you did, this might help...".

--
Ross Vandegrift
[email protected]

"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37

2006-01-17 18:12:11

by Michael Tokarev

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

Ross Vandegrift wrote:
> On Tue, Jan 17, 2006 at 02:26:11PM +0300, Michael Tokarev wrote:
>
>>Raid code is already too fragile, i'm afraid "simple" I/O errors
>>(which is what we need raid for) may crash the system already, and
>>am waiting for the next whole system crash due to eg superblock
>>update error or whatnot.
>
> I think you've got some other issue if simple I/O errors cause issues.
> I've managed hundreds of MD arrays over the past ~ten years. MD is
> rock solid. I'd guess that I've recovered at least a hundred disk failures
> where data was saved by mdadm.
>
> What is your setup like? It's also possible that you've found a bug.

We've about 500 systems with raid1, raid5 and raid10 running for about
5 or 6 years (since 0.90 beta patched into 2.2 kernel -- I don't think
linux softraid existed before, or, rather, I can't say that was something
which was possible to use in production).

Most problematic case so far, which I described numerous times (like,
"why linux raid isn't Raid really, why it can be worse than plain disk")
is when, after single sector read failure, md kicks the whole disk off
the array, and when you start resync (after replacing the "bad" drive or
just remapping that bad sector or even doing nothing, as it will be
remapped in almost all cases during write, on real drives anyway),
you find another "bad sector" on another drive. After this, the array
can't be started anymore, at least not w/o --force (ie, requires some
user intervention, which is sometimes quite difficult if the server
is several 100s miles away). More, it's quite difficult to recover
it even manually (after --force'ing it to start), without fixing that
bad sector somehow -- if first drive failure is "recent enouth" we've
a hope that this very sector can be read from that first drive. if
the alot of filesystem activity happened since that time, that chances
are quite small; and with raid5 it's quite difficult to say where the
error is in the filesystem, due to the complex layout of raid5.

But this has been described here numerous times, and - hopefully -
with current changes (re-writing of bad blocks) this very issue will
go away, at least most common scenario of it (i'd try to keep even
"bad" drive, even after some write errors, because it still contains
some data which can be read; but that's problematic to say the best
because one has to store a list of bad blocks somewhere...).

(And no, I don't have all bad/cheap drives - it's just when you have
hundreds or 1000s of drives, you've quite high probability that some
of them will fail sometimes, or will develop a bad sector etc).

>>I saw all sorts of failures due to
>>linux softraid already (we use it here alot), including ones
>>which required complete array rebuild with heavy data loss.
>
> Are you sure? The one thing that's not always intuitive about MD - a
> faild array often still has your data and you can recover it. Unlike
> hardware RAID solutions, you have a lot of control over how the disks
> are assembled and used - this can be a major advantage.
>
> I'd say once a week someone comes on the linux-raid list and says "Oh no!
> I accidently ruined my RAID array!". Neil almost always responds "Well,
> don't do that! But since you did, this might help...".

I know that. And I've quite some expirience too, and I studied mdadm
source.

There was in fact two cases like that, not one.

First was mostly due to operator error, or lack of better choice at
2.2 (or early 2.4) times -- I relied on raid autodetection (which I
don't do anymore, and strongly suggest others to switch to mdassemble
or something like that) -- a drive failed (for real, not bad blocks)
and needed to be replaced, and I forgot to clear the partition table
on the replacement drive (which was in our testing box) - in a result,
kernel assembled a raid5 out of components which belonged to different
arrays.. I only vaguely remember what it was at that time -- maybe
kernel or I started reconstruction (not noticiyng the wrong array),
or i mounted the filesystem - can't say anymore for sure, but the
result was that I wasn't able to restore the filesystem, because i
didn't have that filesystem anymore. (it should have been assembling
boot raid1 array but assembled a degraided raid5 instead)

And second case was when, after an attempt to resync the array (after
that famous 'bad block kicked off the whole disk) which resulted in an
OOPS (which I didn't notice immediately, but it continued the resync),
it wrote some garbage all over, resulting in badly broken filesystem,
and somewhat broken nearby partition too (which I was able to recover).
It was at about 2.4.19 or so, and I had that situation only once.
Granted, I can't blame raid code for all this, because I don't even
know what was in the oops (machine locked hard but someone who was
near the server noticied it OOPSed) - it sure may be a bug somewhere
else.

As a sort of conclusion.

There are several features that can be implemented in linux softraid
code to make it real Raid, with data safety goal. One example is to
be able to replace a "to-be-failed" drive (think SMART failure
predictions for example) without removing it from the array with a
(hot)spare (or just a replacement) -- by adding the new drive to the
array *first*, and removing the to-be-replaced one only after new is
fully synced. Another example is to implement some NVRAM-like storage
for metadata (this will require the necessary hardware as well, like
eg a flash card -- I dunno how safe it can be). And so on.

The current MD code is "almost here", almost real. It still has some
(maybe minor) problems, it still lacks some (again maybe minor) features
wrt data safety. Ie, it still can fail, but it's almost here.

While current development is going to implement some new and non-trivial
features which are of little use in real life. Face it: yes it's good
when you're able to reshape your array online keeping servicing your
users, but i'd go for even 12 hours downtime if i know my data is safe,
instead of unknown downtime after I realize the reshape failed for some
reason and I dont have my data anymore. And yes it's very rarely used
(which adds to the problem - rarely used code paths with bugs with stays
unfound for alot of time, and bite you at a very unexpected moment, when
you think it's all ok...)

Well, not all is that bad really. I really apprecate Neil's work, it's
all his baby after all, and I owe him alot of stuff because of all our
machines which, due to raid code, are running fine (most of them anyway).
I had a hopefully small question, whenever the new features are really
useful, and just described my point of view to the topic.. And answered
your, Ross, questions as well.. ;)

Thank you.

/mjt

2006-01-17 21:38:16

by Lincoln Dale

[permalink] [raw]
Subject: RE: [PATCH 000 of 5] md: Introduction

> Neil, is this online resizing/reshaping really needed? I understand
> all those words means alot for marketing persons - zero downtime,
> online resizing etc, but it is much safer and easier to do that stuff
> 'offline', on an inactive array, like raidreconf does - safer, easier,
> faster, and one have more possibilities for more complex changes. It
> isn't like you want to add/remove drives to/from your arrays every
day...
> Alot of good hw raid cards are unable to perform such reshaping too.

RAID resize/restripe may not be so common with cheap / PC-based RAID
systems, but it is common with midrange and enterprise storage
subsystems from vendors such as EMC, HDS, IBM & HP.
in fact, I'd say it's the exception to the rule _if_ an
midrange/enterprise storage subsystem doesn't have an _online_ resize
capability..

personally, I think this this useful functionality, but my personal
preference is that this would be in DM/LVM2 rather than MD. but given
Neil is the MD author/maintainer, I can see why he'd prefer to do it in
MD. :)


cheers,

lincoln.

2006-01-17 22:39:15

by Phillip Susi

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

Michael Tokarev wrote:
<snip>
> Compare this with my statement about "offline" "reshaper" above:
> separate userspace (easier to write/debug compared with kernel
> space) program which operates on an inactive array (no locking
> needed, no need to worry about other I/O operations going to the
> array at the time of reshaping etc), with an ability to plan it's
> I/O strategy in alot more efficient and safer way... Yes this
> apprpach has one downside: the array has to be inactive. But in
> my opinion it's worth it, compared to more possibilities to lose
> your data, even if you do NOT use that feature at all...
>
>
I also like the idea of this kind of thing going in user space. I was
also under the impression that md was going to be phased out and
replaced by the device mapper. I've been kicking around the idea of a
user space utility that manipulates the device mapper tables and
performs block moves itself to reshape a raid array. It doesn't seem
like it would be that difficult and would not require modifying the
kernel at all. The basic idea is something like this:

/dev/mapper/raid is your raid array, which is mapped to a stripe between
/dev/sda, /dev/sdb. When you want to expand the stripe to add /dev/sdc
to the array, you create three new devices:

/dev/mapper/raid-old: copy of the old mapper table, striping sda and sdb
/dev/mapper/raid-progress: linear map with size = new stripe width, and
pointing to raid-new
/dev/mapper/raid-new: what the raid will look like when done, i.e.
stripe of sda, sdb, and sdc

Then you replace /dev/mapper/raid with a linear map to raid-new,
raid-progress, and raid-old, in that order. Initially the length of the
chunks from raid-progress and raid-new are zero, so you will still be
entirely accessing raid-old. For each stripe in the array, you change
raid-progress to point to the corresponding blocks in raid-new, but
suspended, so IO to this stripe will block. Then you update the raid
map so raid-progress overlays the stripe you are working on to catch IO
instead of allowing it to go to raid-old. After you read that stripe
from raid-old and write it to raid-new, resume raid-progress to flush
any blocked writes to the raid-new stripe. Finally update raid so the
previously in progress stripe now maps to raid-new.

Repeat for each stripe in the array, and finally replace the raid table
with raid-new's table, and delete the 3 temporary devices.


Adding transaction logging to the user mode utility wouldn't be very
hard either.


2006-01-17 22:57:23

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Tuesday January 17, [email protected] wrote:
> I was
> also under the impression that md was going to be phased out and
> replaced by the device mapper.

I wonder where this sort of idea comes from....

Obviously individual distributions are free to support or not support
whatever bits of code they like. And developers are free to add
duplicate functionality to the kernel (I believe someone is working on
a raid5 target for dm). But that doesn't mean that anything is going
to be 'phased out'.

md and dm, while similar, are quite different. They can both
comfortably co-exist even if they have similar functionality.
What I expect will happen (in line with what normally happens in
Linux) is that both will continue to evolve as long as there is
interest and developer support. They will quite possibly borrow ideas
from each other where that is relevant. Parts of one may lose
support and eventually die (as md/multipath is on the way to doing)
but there is no wholesale 'phasing out' going to happen in either
direction.

NeilBrown

2006-01-18 08:14:09

by Sander

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

Michael Tokarev wrote (ao):
> Most problematic case so far, which I described numerous times (like,
> "why linux raid isn't Raid really, why it can be worse than plain
> disk") is when, after single sector read failure, md kicks the whole
> disk off the array, and when you start resync (after replacing the
> "bad" drive or just remapping that bad sector or even doing nothing,
> as it will be remapped in almost all cases during write, on real
> drives anyway),

If the (harddisk internal) remap succeeded, the OS doesn't see the bad
sector at all I believe.

If you (the OS) do see a bad sector, the disk couldn't remap, and goes
downhill from there, right?

Sander

--
Humilis IT Services and Solutions
http://www.humilis.net

2006-01-18 09:05:44

by Alan

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Mer, 2006-01-18 at 09:14 +0100, Sander wrote:
> If the (harddisk internal) remap succeeded, the OS doesn't see the bad
> sector at all I believe.

True for ATA, in the SCSI case you may be told about the remap having
occurred but its a "by the way" type message not an error proper.

> If you (the OS) do see a bad sector, the disk couldn't remap, and goes
> downhill from there, right?

If a hot spare is configured it will be dropped into the configuration
at that point.

2006-01-18 13:28:08

by Jan Engelhardt

[permalink] [raw]
Subject: RE: [PATCH 000 of 5] md: Introduction


>personally, I think this this useful functionality, but my personal
>preference is that this would be in DM/LVM2 rather than MD. but given
>Neil is the MD author/maintainer, I can see why he'd prefer to do it in
>MD. :)

Why don't MD and DM merge some bits?



Jan Engelhardt
--

2006-01-18 23:19:37

by NeilBrown

[permalink] [raw]
Subject: RE: [PATCH 000 of 5] md: Introduction

On Wednesday January 18, [email protected] wrote:
>
> >personally, I think this this useful functionality, but my personal
> >preference is that this would be in DM/LVM2 rather than MD. but given
> >Neil is the MD author/maintainer, I can see why he'd prefer to do it in
> >MD. :)
>
> Why don't MD and DM merge some bits?
>

Which bits?
Why?

My current opinion is that you should:

Use md for raid1, raid5, raid6 - anything with redundancy.
Use dm for multipath, crypto, linear, LVM, snapshot
Use either for raid0 (I don't think dm has particular advantages
for md or md over dm).

These can be mixed together quite effectively:
You can have dm/lvm over md/raid1 over dm/multipath
with no problems.

If there is functionality missing from any of these recommended
components, then make a noise about it, preferably but not necessarily
with code, and it will quite possibly be fixed.

NeilBrown

2006-01-19 00:22:41

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Tuesday January 17, [email protected] wrote:
>
> As a sort of conclusion.
>
> There are several features that can be implemented in linux softraid
> code to make it real Raid, with data safety goal. One example is to
> be able to replace a "to-be-failed" drive (think SMART failure
> predictions for example) without removing it from the array with a
> (hot)spare (or just a replacement) -- by adding the new drive to the
> array *first*, and removing the to-be-replaced one only after new is
> fully synced. Another example is to implement some NVRAM-like storage
> for metadata (this will require the necessary hardware as well, like
> eg a flash card -- I dunno how safe it can be). And so on.

proactive replacement before complete failure is a good idea and is
(just recently) on my todo list. It shouldn't be too hard.

>
> The current MD code is "almost here", almost real. It still has some
> (maybe minor) problems, it still lacks some (again maybe minor) features
> wrt data safety. Ie, it still can fail, but it's almost here.

concrete suggestions are always welcome (though sometimes you might
have to put some effort into convincing me...)

>
> While current development is going to implement some new and non-trivial
> features which are of little use in real life. Face it: yes it's good
> when you're able to reshape your array online keeping servicing your
> users, but i'd go for even 12 hours downtime if i know my data is safe,
> instead of unknown downtime after I realize the reshape failed for some
> reason and I dont have my data anymore. And yes it's very rarely used
> (which adds to the problem - rarely used code paths with bugs with stays
> unfound for alot of time, and bite you at a very unexpected moment, when
> you think it's all ok...)

If you look at the amount of code in the 'reshape raid5' patch you
will notice that it isn't really very much. It reuses a lot of the
infrastructure that is already present in md/raid5. So a reshape
actually uses a lot of code that is used very often.

Compare this to an offline solution (raidreconfig) where all the code
is only used occasionally. You could argue that the online version
has more code safety than the offline version....

NeilBrown

2006-01-19 00:28:13

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Tuesday January 17, [email protected] wrote:
> On Jan 17, 2006, at 06:26, Michael Tokarev wrote:
> > This is about code complexity/bloat. It's already complex enouth.
> > I rely on the stability of the linux softraid subsystem, and want
> > it to be reliable. Adding more features, especially non-trivial
> > ones, does not buy you bugfree raid subsystem, just the opposite:
> > it will have more chances to crash, to eat your data etc, and will
> > be harder in finding/fixing bugs.
>
> What part of: "You will need to enable the experimental
> MD_RAID5_RESHAPE config option for this to work." isn't bvious? If
> you don't want this feature, either don't turn on
> CONFIG_MD_RAID5_RESHAPE, or don't use the raid5 mdadm reshaping
> command.

This isn't really a fair comment. CONFIG_MD_RAID5_RESHAPE just
enables the code. All the code is included whether this config option
is set or not. So if code-bloat were an issue, the config option
wouldn't answer it.

NeilBrown

2006-01-19 09:01:16

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Thu, Jan 19, 2006 at 11:22:31AM +1100, Neil Brown wrote:
...
> Compare this to an offline solution (raidreconfig) where all the code
> is only used occasionally. You could argue that the online version
> has more code safety than the offline version....

Correct.

raidreconf, however, can convert a 2 disk RAID-0 to a 4 disk RAID-5 for
example - the whole design of raidreconf is fundamentally different (of
course) from the on-line reshape. The on-line reshape can be (and
should be) much simpler.

Now, back when I wrote raidreconf, my thoughts were that md would be
merged into dm, and that raidreconf should evolve into something like
'pvmove' - a user-space tool that moves blocks around, interfacing with
the kernel as much as strictly necessary, allowing hot reconfiguration
of RAID setups.

That was the idea.

Reality, however, seems to be that MD is not moving quickly into DM (for
whatever reasons). Also, I haven't had the time to actually just move on
this myself. Today, raidreconf is used by some, but it is not
maintained, and it is often too slow for comfortable off-line usage
(reconfiguration of TB sized arrays is slow - not so much because of
raidreconf, but because there simply is a lot of data that needs to be
moved around).

I still think that putting MD into DM and extending pvmove to include
raidreconf functionality, would be the way to go. The final solution
should also be tolerant (like pvmove is today) of power cycles during
reconfiguration - the operation should be re-startable.

Anyway - this is just me dreaming - I don't have time to do this and it
seems that currently noone else has either.

Great initiative with the reshape Neil - hot reconfiguration is much
needed - personally I still hope to see MD move into DM and pvmove
including raidreconf functionality, but I guess that when we're eating
an elephant we should be satisfied with taking one bite at a time :)

--

/ jakob

2006-01-19 15:34:12

by Mark Hahn

[permalink] [raw]
Subject: RE: [PATCH 000 of 5] md: Introduction

> Use either for raid0 (I don't think dm has particular advantages
> for md or md over dm).

I measured this a few months ago, and was surprised to find that
DM raid0 was very noticably slower than MD raid0. same machine,
same disks/controller/kernel/settings/stripe-size. I didn't try
to find out why, since I usually need redundancy...

regards, mark hahn.

2006-01-19 20:12:28

by Jan Engelhardt

[permalink] [raw]
Subject: RE: [PATCH 000 of 5] md: Introduction

>> >personally, I think this this useful functionality, but my personal
>> >preference is that this would be in DM/LVM2 rather than MD. but given
>> >Neil is the MD author/maintainer, I can see why he'd prefer to do it in
>> >MD. :)
>>
>> Why don't MD and DM merge some bits?
>
>Which bits?
>Why?
>
>My current opinion is that you should:
>
> Use md for raid1, raid5, raid6 - anything with redundancy.
> Use dm for multipath, crypto, linear, LVM, snapshot

There are pairs of files that look like they would do the same thing:

raid1.c <-> dm-raid1.c
linear.c <-> dm-linear.c



Jan Engelhardt
--

2006-01-19 21:23:23

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On 2006-01-19T21:12:02, Jan Engelhardt <[email protected]> wrote:

> > Use md for raid1, raid5, raid6 - anything with redundancy.
> > Use dm for multipath, crypto, linear, LVM, snapshot
> There are pairs of files that look like they would do the same thing:
>
> raid1.c <-> dm-raid1.c
> linear.c <-> dm-linear.c

Sure there's some historical overlap. It'd make sense if DM used the md
raid personalities, yes.


Sincerely,
Lars Marowsky-Br?e

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

2006-01-19 22:17:57

by Phillip Susi

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

I'm currently of the opinion that dm needs a raid5 and raid6 module
added, then the user land lvm tools fixed to use them, and then you
could use dm instead of md. The benefit being that dm pushes things
like volume autodetection and management out of the kernel to user space
where it belongs. But that's just my opinion...


I'm using dm at home because I have a sata hardware fakeraid raid-0
between two WD 10,000 rpm raptors, and the dmraid utility correctly
recognizes that and configures device mapper to use it.


Neil Brown wrote:
> Which bits?
> Why?
>
> My current opinion is that you should:
>
> Use md for raid1, raid5, raid6 - anything with redundancy.
> Use dm for multipath, crypto, linear, LVM, snapshot
> Use either for raid0 (I don't think dm has particular advantages
> for md or md over dm).
>
> These can be mixed together quite effectively:
> You can have dm/lvm over md/raid1 over dm/multipath
> with no problems.
>
> If there is functionality missing from any of these recommended
> components, then make a noise about it, preferably but not necessarily
> with code, and it will quite possibly be fixed.

2006-01-19 22:33:09

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Thursday January 19, [email protected] wrote:
> I'm currently of the opinion that dm needs a raid5 and raid6 module
> added, then the user land lvm tools fixed to use them, and then you
> could use dm instead of md. The benefit being that dm pushes things
> like volume autodetection and management out of the kernel to user space
> where it belongs. But that's just my opinion...

The in-kernel autodetection in md is purely legacy support as far as I
am concerned. md does volume detection in user space via 'mdadm'.

What other "things like" were you thinking of.

NeilBrown

2006-01-19 23:27:11

by Phillip Susi

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

Neil Brown wrote:
>
> The in-kernel autodetection in md is purely legacy support as far as I
> am concerned. md does volume detection in user space via 'mdadm'.
>
> What other "things like" were you thinking of.
>

Oh, I suppose that's true. Well, another thing is your new mods to
support on the fly reshaping, which dm could do from user space. Then
of course, there's multipath and snapshots and other lvm things which
you need dm for, so why use both when one will do? That's my take on it.


2006-01-19 23:43:28

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Thursday January 19, [email protected] wrote:
> Neil Brown wrote:
> >
> > The in-kernel autodetection in md is purely legacy support as far as I
> > am concerned. md does volume detection in user space via 'mdadm'.
> >
> > What other "things like" were you thinking of.
> >
>
> Oh, I suppose that's true. Well, another thing is your new mods to
> support on the fly reshaping, which dm could do from user space. Then
> of course, there's multipath and snapshots and other lvm things which
> you need dm for, so why use both when one will do? That's my take on it.

Maybe the problem here is thinking of md and dm as different things.
Try just not thinking of them at all.
Think about it like this:
The linux kernel support lvm
The linux kernel support multipath
The linux kernel support snapshots
The linux kernel support raid0
The linux kernel support raid1
The linux kernel support raid5

Use the bits that you want, and not the bits that you don't.

dm and md are just two different interface styles to various bits of
this. Neither is clearly better than the other, partly because
different people have different tastes.

Maybe what you really want is for all of these functions to be managed
under the one umbrella application. I think that is was EVMS tried to
do.

One big selling point that 'dm' has is 'dmraid' - a tool that allows
you to use a lot of 'fakeraid' cards. People would like dmraid to
work with raid5 as well, and that is a good goal.
However it doesn't mean that dm needs to get it's own raid5
implementation or that md/raid5 needs to be merged with dm.
It can be achieved by giving md/raid5 the right interfaces so that
metadata can be managed from userspace (and I am nearly there).
Then 'dmraid' (or a similar tool) can use 'dm' interfaces for some
raid levels and 'md' interfaces for others.

NeilBrown

2006-01-20 02:17:34

by Phillip Susi

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

Neil Brown wrote:

>Maybe the problem here is thinking of md and dm as different things.
>Try just not thinking of them at all.
>Think about it like this:
> The linux kernel support lvm
> The linux kernel support multipath
> The linux kernel support snapshots
> The linux kernel support raid0
> The linux kernel support raid1
> The linux kernel support raid5
>
>Use the bits that you want, and not the bits that you don't.
>
>dm and md are just two different interface styles to various bits of
>this. Neither is clearly better than the other, partly because
>different people have different tastes.
>
>Maybe what you really want is for all of these functions to be managed
>under the one umbrella application. I think that is was EVMS tried to
>do.
>
>
>

I am under the impression that dm is simpler/cleaner than md. That
impression very well may be wrong, but if it is simpler, then that's a
good thing.


>One big selling point that 'dm' has is 'dmraid' - a tool that allows
>you to use a lot of 'fakeraid' cards. People would like dmraid to
>work with raid5 as well, and that is a good goal.
>
>

AFAIK, the hardware fakeraid solutions on the market don't support raid5
anyhow ( at least mine doesn't ), so dmraid won't either.

>However it doesn't mean that dm needs to get it's own raid5
>implementation or that md/raid5 needs to be merged with dm.
>It can be achieved by giving md/raid5 the right interfaces so that
>metadata can be managed from userspace (and I am nearly there).
>Then 'dmraid' (or a similar tool) can use 'dm' interfaces for some
>raid levels and 'md' interfaces for others.
>

Having two sets of interfaces and retrofiting a new interface onto a
system that wasn't designed for it seems likely to bloat the kernel with
complex code. I don't really know if that is the case because I have
not studied the code, but that's the impression I get, and if it's
right, then I'd say it is better to stick with dm rather than retrofit
md. In either case, it seems overly complex to have to deal with both.


2006-01-20 10:54:17

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On 2006-01-19T21:17:12, Phillip Susi <[email protected]> wrote:

> I am under the impression that dm is simpler/cleaner than md. That
> impression very well may be wrong, but if it is simpler, then that's a
> good thing.

That impression is wrong in that general form. Both have advantages and
disadvantages.

I've been an advocate of seeing both of them merged, mostly because I
think it would be beneficial if they'd share the same interface to
user-space to make the tools easier to write and maintain.

However, rewriting the RAID personalities for DM is a thing only a fool
would do without really good cause. Sure, everybody can write a
RAID5/RAID6 parity algorithm. But getting the failure/edge cases stable
is not trivial and requires years of maturing.

Which is why I think gentle evolution of both source bases towards some
common API (for example) is much preferable to reinventing one within
the other.

Oversimplifying to "dm is better than md" is just stupid.



Sincerely,
Lars Marowsky-Br?e

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

2006-01-20 12:04:51

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Fri, Jan 20 2006, Lars Marowsky-Bree wrote:
> Oversimplifying to "dm is better than md" is just stupid.

Indeed. But "generally" md is faster and more efficient in the way it
handles ios, it doesn't do any splitting unless it has to.

--
Jens Axboe

2006-01-20 15:28:45

by Hubert Tonneau

[permalink] [raw]
Subject: RE: [PATCH 000 of 5] md: Introduction

Neil Brown wrote:
>
> These can be mixed together quite effectively:
> You can have dm/lvm over md/raid1 over dm/multipath
> with no problems.
>
> If there is functionality missing from any of these recommended
> components, then make a noise about it, preferably but not necessarily
> with code, and it will quite possibly be fixed.

Chiepest high capacity is now provided through USB connected external disks.
Of course, it's for very low load.

So, what would be helpfull is let's say have 7 usefull disks, plus 1 for parity
(just like RAID4), but with not a result of one large partition, but with
the result of seven partitions, one on each disk.

So, in case of one disk failure, you loose no data,
in case of two disks failure, you loose 1/7 partition,
in case of three disks failure, you loose 2/7 partitions,
etc, because if the RAID4 is unusable, you can still read each partition
as a non raid partition.

Somebody suggested that it could be done through LVM, but I failed to find
the way to configure LVM on top of RAID4 or RAID5 to grant that each
partition sectors are consecutive all on a single physical disk.

2006-01-20 15:41:00

by Hubert Tonneau

[permalink] [raw]
Subject: RE: [PATCH 000 of 5] md: Introduction

Neil Brown wrote:
>
> These can be mixed together quite effectively:
> You can have dm/lvm over md/raid1 over dm/multipath
> with no problems.
>
> If there is functionality missing from any of these recommended
> components, then make a noise about it, preferably but not necessarily
> with code, and it will quite possibly be fixed.

Also it's not Neil direct problem, since we are at it, the weakest point
about Linux MD is currently that ...
there is no production quality U320 SCSI driver for Linux to run MD over !

In the U160 category, the symbios driver passed all possible stress tests
(partly bad drives that require the driver to properly reset and restart),
but in the U320 category, neither the Fusion not the AIC79xx did.

2006-01-20 16:15:54

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Fri, Jan 20, 2006 at 05:01:06PM +0000, Hubert Tonneau wrote:
> In the U160 category, the symbios driver passed all possible stress tests
> (partly bad drives that require the driver to properly reset and restart),
> but in the U320 category, neither the Fusion not the AIC79xx did.

Please report any fusion problems to Eric Moore at LSI, the Adaptec driver
must unfortunately be considered unmaintained.

2006-01-20 16:45:19

by Hubert Tonneau

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

Christoph Hellwig wrote:
>
> Please report any fusion problems to Eric Moore at LSI, the Adaptec driver
> must unfortunately be considered unmaintained.

Done several time over more than two years, but new versions did not solve
the problem, even if in 2.6.13 it forwards the problem to MD instead of just
locking the bus.
Also, production quality verification is something really hard to verify
since even on production servers it requires several monthes to tigger some
very rare situations, and on the production server I've finaly replaced the
partialy faulty disk so that the problem may well never append again on
the box.
So the problem is probably still there in the driver, maybe not, but I have
no way to validate new drivers.

To be more precise, last 2.4.xx have production quality fusion driver,
only 2.6.xx driver has problems.

The last point is that if you look at the last changes in the fusion driver,
they are moving evrything around to introduce the SAS, so after more than a
year of unsuccessfull reports about a single bug that appends on production
server, you can understand that my willing to make tests with potencial
production consequences has vanished.

The fusion maintainer is responsive, did his best, but could not achieve
the result, so he may have not received enough help from general kernel
maintainers, or the kernel or the fusion drivers might start to be too
complicated. I stop here because I do not want to start flames. Just report.

2006-01-20 17:29:34

by ross

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Fri, Jan 20, 2006 at 10:43:13AM +1100, Neil Brown wrote:
> dm and md are just two different interface styles to various bits of
> this. Neither is clearly better than the other, partly because
> different people have different tastes.

Here's why it's great to have both: they have different toolkits. I'm
really familiar with md's toolkit. I can do most anything I need.
But I'll bet that I've never gotten a pvmove to finish sucessfully
because I am doing something wrong and I don't know it.

Becuase we're talking about data integrity, the toolkit issue alone
makes it worth keeping both code paths. md does 90% of what I need,
so why should I spend the time to learn a new system that doesn't
offer any advantages?

[1] I'm intentionally neglecting the 4k stack issue

--
Ross Vandegrift
[email protected]

"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37

2006-01-20 18:37:24

by Heinz Mauelshagen

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Fri, Jan 20, 2006 at 10:43:13AM +1100, Neil Brown wrote:
> On Thursday January 19, [email protected] wrote:
> > Neil Brown wrote:
> > >
> > > The in-kernel autodetection in md is purely legacy support as far as I
> > > am concerned. md does volume detection in user space via 'mdadm'.
> > >
> > > What other "things like" were you thinking of.
> > >
> >
> > Oh, I suppose that's true. Well, another thing is your new mods to
> > support on the fly reshaping, which dm could do from user space. Then
> > of course, there's multipath and snapshots and other lvm things which
> > you need dm for, so why use both when one will do? That's my take on it.
>
> Maybe the problem here is thinking of md and dm as different things.
> Try just not thinking of them at all.
> Think about it like this:
> The linux kernel support lvm
> The linux kernel support multipath
> The linux kernel support snapshots
> The linux kernel support raid0
> The linux kernel support raid1
> The linux kernel support raid5
>
> Use the bits that you want, and not the bits that you don't.
>
> dm and md are just two different interface styles to various bits of
> this. Neither is clearly better than the other, partly because
> different people have different tastes.
>
> Maybe what you really want is for all of these functions to be managed
> under the one umbrella application. I think that is was EVMS tried to
> do.
>
> One big selling point that 'dm' has is 'dmraid' - a tool that allows
> you to use a lot of 'fakeraid' cards. People would like dmraid to
> work with raid5 as well, and that is a good goal.
> However it doesn't mean that dm needs to get it's own raid5
> implementation or that md/raid5 needs to be merged with dm.

That's a valid point to make but it can ;)

> It can be achieved by giving md/raid5 the right interfaces so that
> metadata can be managed from userspace (and I am nearly there).

Yeah, and I'm nearly there to have a RAID4 and RAID5 target for dm
(which took advantage of the raid address calculation and the bio to
stripe cache copy code of md raid5).

See http://people.redhat.com/heinzm/sw/dm/dm-raid45/dm-raid45_2.6.15_200601201914.patch.bz2 (no Makefile / no Kconfig changes) for early code reference.

> Then 'dmraid' (or a similar tool) can use 'dm' interfaces for some
> raid levels and 'md' interfaces for others.

Yes, that's possible but there's recommendations to have a native target
for dm to do RAID5, so I started to implement it.

>
> NeilBrown
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--

Regards,
Heinz -- The LVM Guy --

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen Red Hat GmbH
Consulting Development Engineer Am Sonnenhang 11
Cluster and Storage Development 56242 Marienrachdorf
Germany
[email protected] +49 2626 141200
FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

2006-01-20 18:39:24

by Heinz Mauelshagen

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Fri, Jan 20, 2006 at 11:53:06AM +0100, Lars Marowsky-Bree wrote:
> On 2006-01-19T21:17:12, Phillip Susi <[email protected]> wrote:
>
> > I am under the impression that dm is simpler/cleaner than md. That
> > impression very well may be wrong, but if it is simpler, then that's a
> > good thing.
>
> That impression is wrong in that general form. Both have advantages and
> disadvantages.
>
> I've been an advocate of seeing both of them merged, mostly because I
> think it would be beneficial if they'd share the same interface to
> user-space to make the tools easier to write and maintain.
>
> However, rewriting the RAID personalities for DM is a thing only a fool
> would do without really good cause.

Thanks Lars ;)

> Sure, everybody can write a
> RAID5/RAID6 parity algorithm. But getting the failure/edge cases stable
> is not trivial and requires years of maturing.
>
> Which is why I think gentle evolution of both source bases towards some
> common API (for example) is much preferable to reinventing one within
> the other.
>
> Oversimplifying to "dm is better than md" is just stupid.
>
>
>
> Sincerely,
> Lars Marowsky-Br?e
>
> --
> High Availability & Clustering
> SUSE Labs, Research and Development
> SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
> "Ignorance more frequently begets confidence than does knowledge"
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Regards,
Heinz -- The LVM Guy --

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen Red Hat GmbH
Consulting Development Engineer Am Sonnenhang 11
Cluster and Storage Development 56242 Marienrachdorf
Germany
[email protected] +49 2626 141200
FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

2006-01-20 18:41:42

by Heinz Mauelshagen

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Thu, Jan 19, 2006 at 09:17:12PM -0500, Phillip Susi wrote:
> Neil Brown wrote:
>
> >Maybe the problem here is thinking of md and dm as different things.
> >Try just not thinking of them at all.
> >Think about it like this:
> > The linux kernel support lvm
> > The linux kernel support multipath
> > The linux kernel support snapshots
> > The linux kernel support raid0
> > The linux kernel support raid1
> > The linux kernel support raid5
> >
> >Use the bits that you want, and not the bits that you don't.
> >
> >dm and md are just two different interface styles to various bits of
> >this. Neither is clearly better than the other, partly because
> >different people have different tastes.
> >
> >Maybe what you really want is for all of these functions to be managed
> >under the one umbrella application. I think that is was EVMS tried to
> >do.
> >
> >
> >
>
> I am under the impression that dm is simpler/cleaner than md. That
> impression very well may be wrong, but if it is simpler, then that's a
> good thing.
>
>
> >One big selling point that 'dm' has is 'dmraid' - a tool that allows
> >you to use a lot of 'fakeraid' cards. People would like dmraid to
> >work with raid5 as well, and that is a good goal.
> >
> >
>
> AFAIK, the hardware fakeraid solutions on the market don't support raid5
> anyhow ( at least mine doesn't ), so dmraid won't either.

Well, some do (eg, Nvidia).

>
> >However it doesn't mean that dm needs to get it's own raid5
> >implementation or that md/raid5 needs to be merged with dm.
> >It can be achieved by giving md/raid5 the right interfaces so that
> >metadata can be managed from userspace (and I am nearly there).
> >Then 'dmraid' (or a similar tool) can use 'dm' interfaces for some
> >raid levels and 'md' interfaces for others.
> >
>
> Having two sets of interfaces and retrofiting a new interface onto a
> system that wasn't designed for it seems likely to bloat the kernel with
> complex code. I don't really know if that is the case because I have
> not studied the code, but that's the impression I get, and if it's
> right, then I'd say it is better to stick with dm rather than retrofit
> md. In either case, it seems overly complex to have to deal with both.

I agree, but dm will need to mature before it'll be able to substitute md.

>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--

Regards,
Heinz -- The LVM Guy --

*** Software bugs are stupid.
Nevertheless it needs not so stupid people to solve them ***

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen Red Hat GmbH
Consulting Development Engineer Am Sonnenhang 11
Cluster and Storage Development 56242 Marienrachdorf
Germany
[email protected] +49 2626 141200
FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

2006-01-20 22:10:56

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On 2006-01-20T19:38:40, Heinz Mauelshagen <[email protected]> wrote:

> > However, rewriting the RAID personalities for DM is a thing only a fool
> > would do without really good cause.
>
> Thanks Lars ;)

Well, I assume you have a really good cause then, don't you? ;-)


Sincerely,
Lars Marowsky-Br?e

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

2006-01-20 22:58:23

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On 2006-01-20T19:36:21, Heinz Mauelshagen <[email protected]> wrote:

> > Then 'dmraid' (or a similar tool) can use 'dm' interfaces for some
> > raid levels and 'md' interfaces for others.
> Yes, that's possible but there's recommendations to have a native target
> for dm to do RAID5, so I started to implement it.

Can you answer me what the recommendations are based on?

I understand wanting to manage both via the same framework, but
duplicating the code is just ... wrong.

What's gained by it? Why not provide a dm-md wrapper which could then
load/interface to all md personalities?


Sincerely,
Lars Marowsky-Br?e

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

2006-01-21 00:02:13

by Heinz Mauelshagen

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Fri, Jan 20, 2006 at 11:57:24PM +0100, Lars Marowsky-Bree wrote:
> On 2006-01-20T19:36:21, Heinz Mauelshagen <[email protected]> wrote:
>
> > > Then 'dmraid' (or a similar tool) can use 'dm' interfaces for some
> > > raid levels and 'md' interfaces for others.
> > Yes, that's possible but there's recommendations to have a native target
> > for dm to do RAID5, so I started to implement it.
>
> Can you answer me what the recommendations are based on?

Partner requests.

>
> I understand wanting to manage both via the same framework, but
> duplicating the code is just ... wrong.
>
> What's gained by it?
>
> Why not provide a dm-md wrapper which could then
> load/interface to all md personalities?
>

As we want to enrich the mapping flexibility (ie, multi-segment fine grained
mappings) of dm by adding targets as we go, a certain degree and transitional
existence of duplicate code is the price to gain that flexibility.

>
> Sincerely,
> Lars Marowsky-Br?e
>
> --
> High Availability & Clustering
> SUSE Labs, Research and Development
> SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
> "Ignorance more frequently begets confidence than does knowledge"

Warm regards,
Heinz -- The LVM Guy --

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen Red Hat GmbH
Consulting Development Engineer Am Sonnenhang 11
Cluster and Storage Development 56242 Marienrachdorf
Germany
[email protected] +49 2626 141200
FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

2006-01-21 00:04:27

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On 2006-01-21T01:01:42, Heinz Mauelshagen <[email protected]> wrote:

> > Why not provide a dm-md wrapper which could then
> > load/interface to all md personalities?
> As we want to enrich the mapping flexibility (ie, multi-segment fine grained
> mappings) of dm by adding targets as we go, a certain degree and transitional
> existence of duplicate code is the price to gain that flexibility.

A dm-md wrapper would give you the same?


Sincerely,
Lars Marowsky-Br?e

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

2006-01-21 00:06:52

by Heinz Mauelshagen

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Fri, Jan 20, 2006 at 11:09:51PM +0100, Lars Marowsky-Bree wrote:
> On 2006-01-20T19:38:40, Heinz Mauelshagen <[email protected]> wrote:
>
> > > However, rewriting the RAID personalities for DM is a thing only a fool
> > > would do without really good cause.
> >
> > Thanks Lars ;)
>
> Well, I assume you have a really good cause then, don't you? ;-)

Well, I'll share your assumption ;-)

>
>
> Sincerely,
> Lars Marowsky-Br?e
>
> --
> High Availability & Clustering
> SUSE Labs, Research and Development
> SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
> "Ignorance more frequently begets confidence than does knowledge"

--

Regards,
Heinz -- The LVM Guy --

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen Red Hat GmbH
Consulting Development Engineer Am Sonnenhang 11
Cluster and Storage Development 56242 Marienrachdorf
Germany
[email protected] +49 2626 141200
FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

2006-01-21 00:08:26

by Heinz Mauelshagen

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Sat, Jan 21, 2006 at 01:03:44AM +0100, Lars Marowsky-Bree wrote:
> On 2006-01-21T01:01:42, Heinz Mauelshagen <[email protected]> wrote:
>
> > > Why not provide a dm-md wrapper which could then
> > > load/interface to all md personalities?
> > As we want to enrich the mapping flexibility (ie, multi-segment fine grained
> > mappings) of dm by adding targets as we go, a certain degree and transitional
> > existence of duplicate code is the price to gain that flexibility.
>
> A dm-md wrapper would give you the same?

No, we'ld need to stack more complex to achieve mappings.
Think lvm2 and logical volume level raid5.

>
>
> Sincerely,
> Lars Marowsky-Br?e
>
> --
> High Availability & Clustering
> SUSE Labs, Research and Development
> SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
> "Ignorance more frequently begets confidence than does knowledge"

--

Regards,
Heinz -- The LVM Guy --

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen Red Hat GmbH
Consulting Development Engineer Am Sonnenhang 11
Cluster and Storage Development 56242 Marienrachdorf
Germany
[email protected] +49 2626 141200
FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

2006-01-21 00:14:04

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On 2006-01-21T01:08:06, Heinz Mauelshagen <[email protected]> wrote:

> > A dm-md wrapper would give you the same?
> No, we'ld need to stack more complex to achieve mappings.
> Think lvm2 and logical volume level raid5.

How would you not get that if you had a wrapper around md which made it
into an dm personality/target?

Besides, stacking between dm devices so far (ie, if I look how kpartx
does it, or LVM2 on top of MPIO etc, which works just fine) is via the
block device layer anyway - and nothing stops you from putting md on top
of LVM2 LVs either.

I use the regularly to play with md and other stuff...

So I remain unconvinced that code duplication is worth it for more than
"hark we want it so!" ;-)



2006-01-22 03:54:23

by Adam Kropelin

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

NeilBrown <[email protected]> wrote:
> In line with the principle of "release early", following are 5 patches
> against md in 2.6.latest which implement reshaping of a raid5 array.
> By this I mean adding 1 or more drives to the array and then re-laying
> out all of the data.

I've been looking forward to a feature like this, so I took the
opportunity to set up a vmware session and give the patches a try. I
encountered both success and failure, and here are the details of both.

On the first try I neglected to read the directions and increased the
number of devices first (which worked) and then attempted to add the
physical device (which didn't work; at least not the way I intended).
The result was an array of size 4, operating in degraded mode, with
three active drives and one spare. I was unable to find a way to coax
mdadm into adding the 4th drive as an active device instead of a
spare. I'm not an mdadm guru, so there may be a method I overlooked.
Here's what I did, interspersed with trimmed /proc/mdstat output:

mdadm --create -l5 -n3 /dev/md0 /dev/sda /dev/sdb /dev/sdc

md0 : active raid5 sda[0] sdc[2] sdb[1]
2097024 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

mdadm --grow -n4 /dev/md0

md0 : active raid5 sda[0] sdc[2] sdb[1]
3145536 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]

mdadm --manage --add /dev/md0 /dev/sdd

md0 : active raid5 sdd[3](S) sda[0] sdc[2] sdb[1]
3145536 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]

mdadm --misc --stop /dev/md0
mdadm --assemble /dev/md0 /dev/sda /dev/sdb /dev/sdc /dev/sdd

md0 : active raid5 sdd[3](S) sda[0] sdc[2] sdb[1]
3145536 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]

For my second try I actually read the directions and things went much
better, aside from a possible /proc/mdstat glitch shown below.

mdadm --create -l5 -n3 /dev/md0 /dev/sda /dev/sdb /dev/sdc

md0 : active raid5 sda[0] sdc[2] sdb[1]
2097024 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

mdadm --manage --add /dev/md0 /dev/sdd

md0 : active raid5 sdd[3](S) sdc[2] sdb[1] sda[0]
2097024 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

mdadm --grow -n4 /dev/md0

md0 : active raid5 sdd[3] sdc[2] sdb[1] sda[0]
2097024 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
...should this be... --> [4/3] [UUU_] perhaps?
[>....................] recovery = 0.4% (5636/1048512) finish=9.1min speed=1878K/sec

[...time passes...]

md0 : active raid5 sdd[3] sdc[2] sdb[1] sda[0]
3145536 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

My final test was a repeat of #2, but with data actively being written
to the array during the reshape (the previous tests were on an idle,
unmounted array). This one failed pretty hard, with several processes
ending up in the D state. I repeated it twice and sysrq-t dumps can be
found at <http://www.kroptech.com/~adk0212/md-raid5-reshape-wedge.txt>.
The writeout load was a kernel tree untar started shortly before the
'mdadm --grow' command was given. mdadm hung, as did tar. Any process
which subsequently attmpted to access the array hung as well. A second
attempt at the same thing hung similarly, although only pdflush shows up
hung in that trace. mdadm and tar are missing for some reason.

I'm happy to do more tests. It's easy to conjur up virtual disks and
load them with irrelevant data (like kernel trees ;)

--Adam

2006-01-22 06:45:49

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Fri, Jan 20, 2006 at 04:15:50PM +0000, Christoph Hellwig wrote:
> On Fri, Jan 20, 2006 at 05:01:06PM +0000, Hubert Tonneau wrote:
> > In the U160 category, the symbios driver passed all possible stress tests
> > (partly bad drives that require the driver to properly reset and restart),
> > but in the U320 category, neither the Fusion not the AIC79xx did.
>
> Please report any fusion problems to Eric Moore at LSI, the Adaptec
> driver must unfortunately be considered unmaintained.

wasn't Justin T. Gibbs maintaining this driver for
some time, and who is doing the drivers/updates
published on the adaptec site?

http://www.adaptec.com/worldwide/support/driversbycat.jsp?sess=no&language=English+US&cat=%2FOperating+System%2FLinux+Driver+Source+Code

best,
Herbert

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2006-01-22 22:53:03

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Saturday January 21, [email protected] wrote:
> NeilBrown <[email protected]> wrote:
> > In line with the principle of "release early", following are 5 patches
> > against md in 2.6.latest which implement reshaping of a raid5 array.
> > By this I mean adding 1 or more drives to the array and then re-laying
> > out all of the data.
>
> I've been looking forward to a feature like this, so I took the
> opportunity to set up a vmware session and give the patches a try. I
> encountered both success and failure, and here are the details of both.
>
> On the first try I neglected to read the directions and increased the
> number of devices first (which worked) and then attempted to add the
> physical device (which didn't work; at least not the way I intended).
> The result was an array of size 4, operating in degraded mode, with
> three active drives and one spare. I was unable to find a way to coax
> mdadm into adding the 4th drive as an active device instead of a
> spare. I'm not an mdadm guru, so there may be a method I overlooked.
> Here's what I did, interspersed with trimmed /proc/mdstat output:

Thanks, this is exactly the sort of feedback I was hoping for - people
testing thing that I didn't think to...

>
> mdadm --create -l5 -n3 /dev/md0 /dev/sda /dev/sdb /dev/sdc
>
> md0 : active raid5 sda[0] sdc[2] sdb[1]
> 2097024 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
>
> mdadm --grow -n4 /dev/md0
>
> md0 : active raid5 sda[0] sdc[2] sdb[1]
> 3145536 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]

I assume that no "resync" started at this point? It should have done.

>
> mdadm --manage --add /dev/md0 /dev/sdd
>
> md0 : active raid5 sdd[3](S) sda[0] sdc[2] sdb[1]
> 3145536 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
>
> mdadm --misc --stop /dev/md0
> mdadm --assemble /dev/md0 /dev/sda /dev/sdb /dev/sdc /dev/sdd
>
> md0 : active raid5 sdd[3](S) sda[0] sdc[2] sdb[1]
> 3145536 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]

This really should have started a recovery.... I'll look into that
too.


>
> For my second try I actually read the directions and things went much
> better, aside from a possible /proc/mdstat glitch shown below.
>
> mdadm --create -l5 -n3 /dev/md0 /dev/sda /dev/sdb /dev/sdc
>
> md0 : active raid5 sda[0] sdc[2] sdb[1]
> 2097024 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
>
> mdadm --manage --add /dev/md0 /dev/sdd
>
> md0 : active raid5 sdd[3](S) sdc[2] sdb[1] sda[0]
> 2097024 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
>
> mdadm --grow -n4 /dev/md0
>
> md0 : active raid5 sdd[3] sdc[2] sdb[1] sda[0]
> 2097024 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
> ...should this be... --> [4/3] [UUU_] perhaps?

Well, part of the array is "4/4 UUUU" and part is "3/3 UUU". How do
you represent that? I think "4/4 UUUU" is best.


> [>....................] recovery = 0.4% (5636/1048512) finish=9.1min speed=1878K/sec
>
> [...time passes...]
>
> md0 : active raid5 sdd[3] sdc[2] sdb[1] sda[0]
> 3145536 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
>
> My final test was a repeat of #2, but with data actively being written
> to the array during the reshape (the previous tests were on an idle,
> unmounted array). This one failed pretty hard, with several processes
> ending up in the D state. I repeated it twice and sysrq-t dumps can be
> found at <http://www.kroptech.com/~adk0212/md-raid5-reshape-wedge.txt>.
> The writeout load was a kernel tree untar started shortly before the
> 'mdadm --grow' command was given. mdadm hung, as did tar. Any process
> which subsequently attmpted to access the array hung as well. A second
> attempt at the same thing hung similarly, although only pdflush shows up
> hung in that trace. mdadm and tar are missing for some reason.

Hmmm... I tried similar things but didn't get this deadlock. Somehow
the fact that mdadm is holding the reconfig_sem semaphore means that
some IO cannot proceed and so mdadm cannot grab and resize all the
stripe heads... I'll have to look more deeply into this.

>
> I'm happy to do more tests. It's easy to conjur up virtual disks and
> load them with irrelevant data (like kernel trees ;)

Great. I'll probably be putting out a new patch set late this week
or early next. Hopefully it will fix the issues you can found and you
can try it again..


Thanks again,
NeilBrown

2006-01-23 01:09:05

by John Hendrikx

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

NeilBrown wrote:
> In line with the principle of "release early", following are 5 patches
> against md in 2.6.latest which implement reshaping of a raid5 array.
> By this I mean adding 1 or more drives to the array and then re-laying
> out all of the data.
>
I think my question is already answered by this, but...

Would this also allow changing the size of each raid device? Let's say
I currently have 160 GB x 6, could I change that to 300 GB x 6 or am I
only allowed to add more 160 GB devices?

2006-01-23 01:26:08

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Monday January 23, [email protected] wrote:
> NeilBrown wrote:
> > In line with the principle of "release early", following are 5 patches
> > against md in 2.6.latest which implement reshaping of a raid5 array.
> > By this I mean adding 1 or more drives to the array and then re-laying
> > out all of the data.
> >
> I think my question is already answered by this, but...
>
> Would this also allow changing the size of each raid device? Let's say
> I currently have 160 GB x 6, could I change that to 300 GB x 6 or am I
> only allowed to add more 160 GB devices?

Changing the size of the devices is a separate operation that has been
supported for a while.
For each device in turn, you fail it and replace it with a larger
device. (This means the array runs degraded for a while, which isn't
ideal and might be fixed one day).

Once all the devices in the array are of the desired size, you run
mdadm --grow /dev/mdX --size=max
and the array (raid1, raid5, raid6) will use up all available space on
the devices, and a resync will start to make sure that extra space is
in-sync.

NeilBrown

2006-01-23 01:54:17

by Kyle Moffett

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Jan 22, 2006, at 20:25, Neil Brown wrote:
> Changing the size of the devices is a separate operation that has
> been supported for a while. For each device in turn, you fail it
> and replace it with a larger device. (This means the array runs
> degraded for a while, which isn't ideal and might be fixed one day).
>
> Once all the devices in the array are of the desired size, you run
> mdadm --grow /dev/mdX --size=max
> and the array (raid1, raid5, raid6) will use up all available space
> on the devices, and a resync will start to make sure that extra
> space is in-sync.

One option I can think of that would make it much safer would be to
originally set up your RAID like this:

md3 (RAID-5)
__________/ | \__________
/ | \
md0 (RAID-1) md1 (RAID-1) md2 (RAID-1)

Each of md0-2 would only have a single drive, and therefore provide
no redundancy. When you wanted to grow the RAID-5, you would first
add a new larger disk to each of md0-md2 and trigger each resync.
Once that is complete, remove the old drives from md0-2 and run:
mdadm --grow /dev/md0 --size=max
mdadm --grow /dev/md1 --size=max
mdadm --grow /dev/md2 --size=max

Then once all that has completed, run:
mdadm --grow /dev/md3 --size=max

This will enlarge the top-level array. If you have LVM on the top-
level, you can allocate new LVs, resize existing ones, etc.

With the newly added code, you could also add new drives dynamically
by creating a /dev/md4 out of the single drive, and adding that as a
new member of /dev/md3.

Cheers,
Kyle Moffett

--
I lost interest in "blade servers" when I found they didn't throw
knives at people who weren't supposed to be in your machine room.
-- Anthony de Boer


2006-01-23 09:45:28

by Heinz Mauelshagen

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Sat, Jan 21, 2006 at 01:13:11AM +0100, Lars Marowsky-Bree wrote:
> On 2006-01-21T01:08:06, Heinz Mauelshagen <[email protected]> wrote:
>
> > > A dm-md wrapper would give you the same?
> > No, we'ld need to stack more complex to achieve mappings.
> > Think lvm2 and logical volume level raid5.
>
> How would you not get that if you had a wrapper around md which made it
> into an dm personality/target?

You could with deeper stacking. That's why I mentioned it above.

>
> Besides, stacking between dm devices so far (ie, if I look how kpartx
> does it, or LVM2 on top of MPIO etc, which works just fine) is via the
> block device layer anyway - and nothing stops you from putting md on top
> of LVM2 LVs either.
>
> I use the regularly to play with md and other stuff...

Me too but for production, I want to avoid the
additional stacking overhead and complexity.

>
> So I remain unconvinced that code duplication is worth it for more than
> "hark we want it so!" ;-)

Shall I remove you from the list of potential testers of dm-raid45 then ;-)

>
>

--

Regards,
Heinz -- The LVM Guy --

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen Red Hat GmbH
Consulting Development Engineer Am Sonnenhang 11
Cluster and Storage Development 56242 Marienrachdorf
Germany
[email protected] +49 2626 141200
FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

2006-01-23 10:26:50

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On 2006-01-23T10:44:18, Heinz Mauelshagen <[email protected]> wrote:

> > Besides, stacking between dm devices so far (ie, if I look how kpartx
> > does it, or LVM2 on top of MPIO etc, which works just fine) is via the
> > block device layer anyway - and nothing stops you from putting md on top
> > of LVM2 LVs either.
> >
> > I use the regularly to play with md and other stuff...
>
> Me too but for production, I want to avoid the
> additional stacking overhead and complexity.

Ok, I still didn't get that. I must be slow.

Did you implement some DM-internal stacking now to avoid the above
mentioned complexity?

Otherwise, even DM-on-DM is still stacked via the block device
abstraction...


Sincerely,
Lars Marowsky-Br?e

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

2006-01-23 10:40:21

by Heinz Mauelshagen

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Mon, Jan 23, 2006 at 11:26:01AM +0100, Lars Marowsky-Bree wrote:
> On 2006-01-23T10:44:18, Heinz Mauelshagen <[email protected]> wrote:
>
> > > Besides, stacking between dm devices so far (ie, if I look how kpartx
> > > does it, or LVM2 on top of MPIO etc, which works just fine) is via the
> > > block device layer anyway - and nothing stops you from putting md on top
> > > of LVM2 LVs either.
> > >
> > > I use the regularly to play with md and other stuff...
> >
> > Me too but for production, I want to avoid the
> > additional stacking overhead and complexity.
>
> Ok, I still didn't get that. I must be slow.
>
> Did you implement some DM-internal stacking now to avoid the above
> mentioned complexity?
>
> Otherwise, even DM-on-DM is still stacked via the block device
> abstraction...

No, not necessary because a single-level raid4/5 mapping will do it.
Ie. it supports <offset> parameters in the constructor as other targets
do as well (eg. mirror or linear).

>
>
> Sincerely,
> Lars Marowsky-Br?e
>
> --
> High Availability & Clustering
> SUSE Labs, Research and Development
> SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
> "Ignorance more frequently begets confidence than does knowledge"

--

Regards,
Heinz -- The LVM Guy --

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen Red Hat GmbH
Consulting Development Engineer Am Sonnenhang 11
Cluster and Storage Development 56242 Marienrachdorf
Germany
[email protected] +49 2626 141200
FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

2006-01-23 10:46:15

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On 2006-01-23T11:38:51, Heinz Mauelshagen <[email protected]> wrote:

> > Ok, I still didn't get that. I must be slow.
> >
> > Did you implement some DM-internal stacking now to avoid the above
> > mentioned complexity?
> >
> > Otherwise, even DM-on-DM is still stacked via the block device
> > abstraction...
>
> No, not necessary because a single-level raid4/5 mapping will do it.
> Ie. it supports <offset> parameters in the constructor as other targets
> do as well (eg. mirror or linear).

An dm-md wrapper would not support such a basic feature (which is easily
added to md too) how?

I mean, "I'm rewriting it because I want to and because I understand and
own the code then" is a perfectly legitimate reason, but let's please
not pretend there's really sound and good technical reasons ;-)


Sincerely,
Lars Marowsky-Br?e

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

2006-01-23 11:01:24

by Heinz Mauelshagen

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Mon, Jan 23, 2006 at 11:45:22AM +0100, Lars Marowsky-Bree wrote:
> On 2006-01-23T11:38:51, Heinz Mauelshagen <[email protected]> wrote:
>
> > > Ok, I still didn't get that. I must be slow.
> > >
> > > Did you implement some DM-internal stacking now to avoid the above
> > > mentioned complexity?
> > >
> > > Otherwise, even DM-on-DM is still stacked via the block device
> > > abstraction...
> >
> > No, not necessary because a single-level raid4/5 mapping will do it.
> > Ie. it supports <offset> parameters in the constructor as other targets
> > do as well (eg. mirror or linear).
>
> An dm-md wrapper would not support such a basic feature (which is easily
> added to md too) how?
>
> I mean, "I'm rewriting it because I want to and because I understand and
> own the code then" is a perfectly legitimate reason

Sure :-)

>, but let's please
> not pretend there's really sound and good technical reasons ;-)

Mind you that there's no need to argue about that:
this is based on requests to do it.

>
>
> Sincerely,
> Lars Marowsky-Br?e
>
> --
> High Availability & Clustering
> SUSE Labs, Research and Development
> SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
> "Ignorance more frequently begets confidence than does knowledge"

--

Regards,
Heinz -- The LVM Guy --

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen Red Hat GmbH
Consulting Development Engineer Am Sonnenhang 11
Cluster and Storage Development 56242 Marienrachdorf
Germany
[email protected] +49 2626 141200
FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

2006-01-23 12:54:25

by Ville Herva

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Mon, Jan 23, 2006 at 10:44:18AM +0100, you [Heinz Mauelshagen] wrote:
> >
> > I use the regularly to play with md and other stuff...
>
> Me too but for production, I want to avoid the
> additional stacking overhead and complexity.
>
> > So I remain unconvinced that code duplication is worth it for more than
> > "hark we want it so!" ;-)
>
> Shall I remove you from the list of potential testers of dm-raid45 then ;-)

Heinz,

If you really want the rest of us to convert from md to lvm, you should
perhaps give some attention to thee brittle userland (scripts and and
binaries).

It is very tedious to have to debug a production system for a few hours in
order to get the rootfs mounted after each kernel update.

The lvm error messages give almost no clue on the problem.

Worse yet, problem reports on these issues are completely ignored on the lvm
mailing list, even when a patch is attached.

(See
http://marc.theaimsgroup.com/?l=linux-lvm&m=113775502821403&w=2
http://linux.msede.com/lvm_mlist/archive/2001/06/0205.html
http://linux.msede.com/lvm_mlist/archive/2001/06/0271.html
for reference.)

Such experience gives an impression lvm is not yet ready for serious
production use.

No offense intended, lvm kernel (lvm1 nor lvm2) code has never given me
trouble, and is probably as solid as anything.


-- v --

[email protected]

PS: Speaking of debugging failing initrd init scripts; it would be nice if
the kernel gave an error message on wrong initrd format rather than silently
failing... Yes, I forgot to make the cpio with the "-H newc" option :-/.

2006-01-23 13:00:51

by Steinar H. Gunderson

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Mon, Jan 23, 2006 at 02:54:20PM +0200, Ville Herva wrote:
> If you really want the rest of us to convert from md to lvm, you should
> perhaps give some attention to thee brittle userland (scripts and and
> binaries).

If you do not like the LVM userland, you might want to try the EVMS userland,
which uses the same kernel code and (mostly) the same on-disk formats, but
has a different front-end.

> It is very tedious to have to debug a production system for a few hours in
> order to get the rootfs mounted after each kernel update.

This sounds a bit like an issue with your distribution, which should normally
fix initrd/initramfs issues for you.

/* Steinar */
--
Homepage: http://www.sesse.net/

2006-01-23 13:55:09

by Heinz Mauelshagen

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Mon, Jan 23, 2006 at 02:54:20PM +0200, Ville Herva wrote:
> On Mon, Jan 23, 2006 at 10:44:18AM +0100, you [Heinz Mauelshagen] wrote:
> > >
> > > I use the regularly to play with md and other stuff...
> >
> > Me too but for production, I want to avoid the
> > additional stacking overhead and complexity.
> >
> > > So I remain unconvinced that code duplication is worth it for more than
> > > "hark we want it so!" ;-)
> >
> > Shall I remove you from the list of potential testers of dm-raid45 then ;-)
>
> Heinz,
>
> If you really want the rest of us to convert from md to lvm, you should
> perhaps give some attention to thee brittle userland (scripts and and
> binaries).

Sure :-)

>
> It is very tedious to have to debug a production system for a few hours in
> order to get the rootfs mounted after each kernel update.
>
> The lvm error messages give almost no clue on the problem.
>
> Worse yet, problem reports on these issues are completely ignored on the lvm
> mailing list, even when a patch is attached.
>
> (See
> http://marc.theaimsgroup.com/?l=linux-lvm&m=113775502821403&w=2
> http://linux.msede.com/lvm_mlist/archive/2001/06/0205.html
> http://linux.msede.com/lvm_mlist/archive/2001/06/0271.html
> for reference.)

Hrm, those are initscripts related, not lvm directly

>
> Such experience gives an impression lvm is not yet ready for serious
> production use.

initscripts/initramfs surely need to do the right thing
in case root is on lvm.

>
> No offense intended, lvm kernel (lvm1 nor lvm2) code has never given me
> trouble, and is probably as solid as anything.

Alright.
Is the initscript issue fixed now or still open ?
Had you filed a bug against the distros initscripts ?

>
>
> -- v --
>
> [email protected]
>
> PS: Speaking of debugging failing initrd init scripts; it would be nice if
> the kernel gave an error message on wrong initrd format rather than silently
> failing... Yes, I forgot to make the cpio with the "-H newc" option :-/.

--

Regards,
Heinz -- The LVM Guy --

*** Software bugs are stupid.
Nevertheless it needs not so stupid people to solve them ***

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen Red Hat GmbH
Consulting Development Engineer Am Sonnenhang 11
Cluster and Storage Development 56242 Marienrachdorf
Germany
[email protected] +49 2626 141200
FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

2006-01-23 17:33:39

by Ville Herva

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

On Mon, Jan 23, 2006 at 02:54:28PM +0100, you [Heinz Mauelshagen] wrote:
> >
> > It is very tedious to have to debug a production system for a few hours in
> > order to get the rootfs mounted after each kernel update.
> >
> > The lvm error messages give almost no clue on the problem.
> >
> > Worse yet, problem reports on these issues are completely ignored on the lvm
> > mailing list, even when a patch is attached.
> >
> > (See
> > http://marc.theaimsgroup.com/?l=linux-lvm&m=113775502821403&w=2
> > http://linux.msede.com/lvm_mlist/archive/2001/06/0205.html
> > http://linux.msede.com/lvm_mlist/archive/2001/06/0271.html
> > for reference.)
>
> Hrm, those are initscripts related, not lvm directly

With the ancient LVM1 issue, my main problem was indeed that mkinitrd did
not reserve enough space for the initrd. The LVM issue I posted to the LVM
list was that LVM userland (vg_cfgbackup.c) did not check for errors while
writing to the fs. The (ignored) patch added some error checking.

But that's ancient, I think we can forget about that.

The current issue (please see the first link) is about the need to add
a "sleep 5" between
lvm vgmknodes
and
mount -o defaults --ro -t ext3 /dev/root /sysroot
.

Otherwise, mounting fails. (Actually, I added "sleep 5" after every lvm
command in the init script and did not narrow it down any more, since this
was a production system, each boot took ages, and I had to get the system up
as soon as possible.)

To me it seemed some kind of problem with the lvm utilities, not with the
initscripts. At least, the correct solution cannot be adding "sleep 5" here
and there in the initscripts...

> Alright.
> Is the initscript issue fixed now or still open ?

It is still open.

Sadly, the only two systems this currently happens are production boxes and
I cannot boot them at will for debugging. It is, however, 100% reproducible
and I can try reasonable suggestions when I boot them the next time. Sorry
about this.

> Had you filed a bug against the distros initscripts ?

No, since I wasn't sure the problem actually was in the initscript. Perhaps
it does do something wrong, but the "sleep 5" workaround is pretty
suspicious.

Thanks for the reply.



-- v --

[email protected]

2006-01-23 23:02:26

by Adam Kropelin

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

Neil Brown wrote:
> On Saturday January 21, [email protected] wrote:
>> On the first try I neglected to read the directions and increased the
>> number of devices first (which worked) and then attempted to add the
>> physical device (which didn't work; at least not the way I intended).
>
> Thanks, this is exactly the sort of feedback I was hoping for - people
> testing thing that I didn't think to...
>
>> mdadm --create -l5 -n3 /dev/md0 /dev/sda /dev/sdb /dev/sdc
>>
>> md0 : active raid5 sda[0] sdc[2] sdb[1]
>> 2097024 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
>>
>> mdadm --grow -n4 /dev/md0
>>
>> md0 : active raid5 sda[0] sdc[2] sdb[1]
>> 3145536 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
>
> I assume that no "resync" started at this point? It should have done.

Actually, it did start a resync. Sorry, I should have mentioned that. I
waited until the resync completed before I issued the 'mdadm --add'
command.

>> md0 : active raid5 sdd[3] sdc[2] sdb[1] sda[0]
>> 2097024 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
>> ...should this be... --> [4/3]
>> [UUU_] perhaps?
>
> Well, part of the array is "4/4 UUUU" and part is "3/3 UUU". How do
> you represent that? I think "4/4 UUUU" is best.

I see your point. I was expecting some indication that that my array was
vulnerable and that the new disk was not fully utilized yet. I guess the
resync in progress indicator is sufficient.

>> My final test was a repeat of #2, but with data actively being
>> written
>> to the array during the reshape (the previous tests were on an idle,
>> unmounted array). This one failed pretty hard, with several processes
>> ending up in the D state.
>
> Hmmm... I tried similar things but didn't get this deadlock. Somehow
> the fact that mdadm is holding the reconfig_sem semaphore means that
> some IO cannot proceed and so mdadm cannot grab and resize all the
> stripe heads... I'll have to look more deeply into this.

For what it's worth, I'm using the Buslogic SCSI driver for the disks in
the array.

>> I'm happy to do more tests. It's easy to conjur up virtual disks and
>> load them with irrelevant data (like kernel trees ;)
>
> Great. I'll probably be putting out a new patch set late this week
> or early next. Hopefully it will fix the issues you can found and you
> can try it again..

Looking forward to it...

--Adam

2006-01-24 02:02:50

by Phillip Susi

[permalink] [raw]
Subject: Re: [PATCH 000 of 5] md: Introduction

Ville Herva wrote:
> PS: Speaking of debugging failing initrd init scripts; it would be nice if
> the kernel gave an error message on wrong initrd format rather than silently
> failing... Yes, I forgot to make the cpio with the "-H newc" option :-/.
>

LOL, yea, that one got me too when I was first getting back into linux a
few months ago and had to customize my initramfs to include dmraid to
recognize my hardware fakeraid raid0. Then I discovered the mkinitramfs
utility which makes things much nicer ;)


2006-01-24 07:26:12

by Ville Herva

[permalink] [raw]
Subject: Error message for invalid initramfs cpio format?

On Mon, Jan 23, 2006 at 09:02:16PM -0500, you [Phillip Susi] wrote:
> Ville Herva wrote:
> >PS: Speaking of debugging failing initrd init scripts; it would be nice if
> >the kernel gave an error message on wrong initrd format rather than
> >silently
> >failing... Yes, I forgot to make the cpio with the "-H newc" option :-/.
> >
>
> LOL, yea, that one got me too when I was first getting back into linux a
> few months ago and had to customize my initramfs to include dmraid to
> recognize my hardware fakeraid raid0. Then I discovered the mkinitramfs
> utility which makes things much nicer ;)

Sure does, that's what I first used, too. But then I had to hack with the
init script and it seemed quicker to

gzip -d < /boot/initrd-2.6.15.1.img | cpio --extract --verbose --make-directories --no-absolute-filenames
vi init
...
find . | cpio -H newc --create --verbose | gzip -9 > /boot/initrd-2.6.15.1.img

It seems do_header() in init/initramfs.c checks for the "070701" magic (that
is specific to the newc format [1]), and populate_rootfs() should then
panic() with "no cpio magic" error message, but I'm fairly sure I didn't see
an error about wrong initramfs format when booting with an initrd made with
cpio without the -H newc option.

This is what I see:

RAMDISK: Couldn't find valid RAM disk starting at 0.
VFS: Cannot open root device "LABEL=/" or unknown-block(0,0)
Please append correct "root=" boot option
Kernel panic - not syncing: VFS: Unable to mount root fs on
unknown-block(0,0)

seems the "no cpio magic" message is somehow lost. It would be useful.


-- v --

[email protected]



[1] "The new (SVR4) portable format, which supports file systems having more
than 65536 i-nodes."