Joe Thornber wrote:
>On Wed, Jan 30, 2002 at 10:03:40PM +0000, Jim McDonald wrote:
>> Also, does/where does this fit in with EVMS?
>EVMS differs from us in that they seem to be trying to move the whole
>application into the kernel,
No, not really. We only put in the kernel the things that make sense to be
in the kernel, discovery logic, ioctl support, I/O path. All configuration
is handled in user space.
> whereas we've taken the opposite route
>and stripped down the kernel side to just provide services.
Then why does snapshot.c in device mapper have a read_metadata function
which populates the exception table from on disk metadata? Seems like you
agree with us that having metadata knowledge in the kernel is a GOOD thing.
>This is fine, I think there's room for both projects. But it is worth
>noting that EVMS could be changed to use device-mapper for it's low
>level functionality. That way they could take advantage of the cool
>work we're doing with snapshots and pvmove, and we could take
>advantage of having more eyes on the core driver.
Since device_mapper does not support in kernel discovery, and EVMS relies
on this, it would be very difficult to change EVMS to use device_mapper.
Besides, EVMS already has all the capabilities provided by device mapper,
including a complete LVM1 compatibility package.
>LVM2 may not seem that exciting initially, since the first release is
>just concentrating on reproducing LVM1 functionality. But a lot of
>the reason for this rewrite is to enable us to add in the new features
>that we want (such as a transaction based disk format). It's on this
>new feature list that we'll be mainly competing with EVMS.
Why compete, come on over and help us :-)
Steve
EVMS Development - http://www.sf.net/projects/evms
Linux Technology Center - IBM Corporation
(512) 838-9763 EMAIL: [email protected]
On Thu, Jan 31, 2002 at 01:52:29PM -0600, Steve Pratt wrote:
> Joe Thornber wrote:
> >On Wed, Jan 30, 2002 at 10:03:40PM +0000, Jim McDonald wrote:
> >> Also, does/where does this fit in with EVMS?
>
> >EVMS differs from us in that they seem to be trying to move the whole
> >application into the kernel,
>
> No, not really. We only put in the kernel the things that make sense to be
> in the kernel, discovery logic, ioctl support, I/O path. All configuration
> is handled in user space.
The notion of "volume group" and "physical volume", etc. does
not exist in the device mapper.
It is really much lower-level.
As it is, it could be used for partitioning as well, with a single
LOC changed. In fact, I imagine it's a 100 LOC or so in libparted to
get partprobe to use it. Then we could retire all partition code
from the kernel :)
> > whereas we've taken the opposite route
> >and stripped down the kernel side to just provide services.
>
> Then why does snapshot.c in device mapper have a read_metadata function
> which populates the exception table from on disk metadata? Seems like you
> agree with us that having metadata knowledge in the kernel is a GOOD thing.
It's good to *support* reading metadata, but better to not require it.
> Since device_mapper does not support in kernel discovery, and EVMS relies
> on this, it would be very difficult to change EVMS to use device_mapper.
Why? Just because device_mapper itself doesn't support discovery
doesn't mean you can't do the discovery for it, and setup the
devices in-kernel.
I think a bigger issue will be the difference in interface for
EVMS stuff, like all the plugin ioctls... it would be great if
you could:
* decouple the evms runtime (kernel code) from the evms
engine (userland). Something like libparted's PedArchitecture
would be nice.
* get everyone in the linux community to agree on some standard
ioctl's that provide an interface for everything that evms
(and everyone else) needs.
How hard is this to do?
Andrew
On Thu, Jan 31, 2002 at 01:52:29PM -0600, Steve Pratt wrote:
> Joe Thornber wrote:
> >On Wed, Jan 30, 2002 at 10:03:40PM +0000, Jim McDonald wrote:
> >> Also, does/where does this fit in with EVMS?
>
> >EVMS differs from us in that they seem to be trying to move the whole
> >application into the kernel,
>
> No, not really. We only put in the kernel the things that make sense to be
> in the kernel, discovery logic, ioctl support, I/O path. All configuration
> is handled in user space.
There's still a *lot* of code in there; > 26,000 lines in fact.
Whereas device-mapper weighs in at ~2,600 lines. This is just because
you've decided to take a different route from us, you may be proven to
be correct.
> > whereas we've taken the opposite route
> >and stripped down the kernel side to just provide services.
>
> Then why does snapshot.c in device mapper have a read_metadata function
> which populates the exception table from on disk metadata? Seems like you
> agree with us that having metadata knowledge in the kernel is a GOOD thing.
In the case of snapshots the exception data has to be written by the
kernel for performance reasons, as you know. This is still a far cry
from understanding the LVM1 metadata format.
> Since device_mapper does not support in kernel discovery, and EVMS relies
> on this, it would be very difficult to change EVMS to use device_mapper.
So do the discovery on the EVMS side, and then pass the tables across
to device-mapper to activate the LV's.
> Why compete, come on over and help us :-)
I would like the two projects to help each other, but not to the point
where one group of people has to say 'you are completely right, we
will stop developing our project'. It's unlikely that either of us is
100% correct; but I do think device-mapper splits off a nice chunk of
services that is useful to *all* people who want to do volume
management. As such I see that as one area where we may eventually
work together.
Similarly I expect to be providing an *optional* kernel module for LVM
users who wish to do in kernel discovery of a root LV, so if the EVMS
team has managed to get a nice generic way of iterating block devices
etc. into the kernel, we would be able to take advantage of that.
Are you trying to break out functionality so it benefits other Linux
projects ? or is EVMS just one monolithic application embedded in the
kernel ?
- Joe
On Fri, Feb 01, 2002 at 04:55:18AM -0500, Arjan van de Ven wrote:
> There is one thing that might spoil the device-mapper "just simple stuff
> only" thing: moving active volumes around. Doing that in userspace reliably
> is impossible and basically needs to be done in kernelspace (it's an
> operation comparable with raid1 resync, a not even that hard in kernel
> space). However, that sort of automatically requires kernelspace to know
> about volumes, and from there it's a small step....
I think you're talking about pvmove and friends, in which case I hope
the description below helps.
- Joe
Let's say we have an LV, made up of three segments of different PV's,
I've also added in the device major:minor as this will be useful
later:
+-----------------------------+
| PV1 | PV2 | PV3 | 254:3
+----------+---------+--------+
Now our hero decides to PV move PV2 to PV4:
1. Suspend our LV (254:3), this starts queueing all io, and flushes
all pending io. Once the suspend has completed we are free to change
the mapping table.
2. Set up *another* (254:4) device with the mapping table of our LV.
3. Load a new mapping table into (254:3) that has identity targets for
parts that aren't moving, and a mirror target for parts that are.
4. Unsuspend (254:3)
So now we have:
destination of copy
+--------------------->--------------+
| |
+-----------------------------+ + -----------+
| Identity | mirror | Ident. | 254:3 | PV4 |
+----------+---------+--------+ +------------+
| | |
\/ \/ \/
+-----------------------------+
| PV1 | PV2 | PV3 | 254:4
+----------+---------+--------+
Any writes to segment2 of the LV get intercepted by the mirror target
who checks that that chunk has been copied to the new destination, if
it hasn't it queues the initial copy and defers the current io until
it has finished. Then the current io is written to *both* PV2 and the
PV4.
5. When the copying has completed 254:3 is suspended/pending flushed.
6. 254:4 is taken down
7. metadata is updated on disk
8. 254:3 has new mapping table loaded:
+-----------------------------+
| PV1 | PV4 | PV3 | 254:3
+----------+---------+--------+
On Fri, Feb 01, 2002 at 05:12:51AM -0500, Arjan van de Ven wrote:
> On Thu, Jan 31, 2002 at 01:09:13PM +0000, Joe Thornber wrote:
> >
> > Now our hero decides to PV move PV2 to PV4:
> >
> > 1. Suspend our LV (254:3), this starts queueing all io, and flushes
> > all pending io.
>
> But "flushes all pending io" is *far* from trivial. there's no current
> kernel functionality for this, so you'll have to do "weird shit" that will
> break easy and often.
Here's the weird shit. If you can see how to break it, I'd like to
know.
Whenever the dm driver maps a buffer_head, I increment a 'pending'
counter for that device, and hook the bh->b_end_io, bh->b_private
function so that this counter is decremented when the io completes.
This doesn't work with ext3 on 2.4 kernels since ext3 believes the
b_private pointer is for general filesystem use rather than just
b_end_io, however Stephen Tweedie and I have been discussing ways to
get round this. On 2.5 this works fine since the bio->bi_private
isn't abused in this way.
> Also "suspending" is rather dangerous because it can deadlock the machine
> (think about the VM needing to write back dirty data on this LV in order to
> make memory available for your move)...
You are correct, this is the main flaw IMO with the LVM1 version of
pvmove (which was userland with locking on a per extent basis).
However for LVM2 the device will only be suspended while a table
is loaded, *not* while the move takes place. I will however allocate
the struct deferred_io objects from a mempool in 2.5.
- Joe
On Fri, Feb 01, 2002 at 03:59:06PM -0600, Kevin Corry wrote:
> I have been thinking about this today and looking over some of the
> device-mapper interfaces. I will agree that, in concept, EVMS could be
> modified to use device-mapper for I/O remapping. However, as things stand
> today, I don't think the transition would be easy.
>
> As I'm trying to envision it, the EVMS runtime would become a "volume
> recognition" framework (see tanget below). Every current EVMS plugin would
> then probe all available devices and communicate the necessary volume
> remapping info to device-mapper through the ioctl interface. (An in-kernel
> API set might be nice to avoid the overhead of the ioctl path).
The in kernel API already exists in drivers/md/dm.h, see dm-ioctl.c
for example code that uses this.
> A new device would then be created for every node that every plugin
> recognizes. This brings up my first objection. With this approach,
> there would be an exposed device for every node in a volume stack,
> whereas the current EVMS design only exposes nodes for the final
> volumes.
True, all the devices are exposed in /dev/device-mapper. What we do
with LVM2 is stick with the /dev/<vg_name>/<lv_name> scheme, where the
lv is a symlink to the relevant node in the device-mapper dir. That
way people only see the top level devices, unless they peek under the
hood by looking at /dev/device-mapper.
> Next, from what I've seen, device-mapper provides static remappings
> from one device to another. It seems to be a good approach for
> setting up things like disk partitions and LVM linear LVs. There is
> also striping support in device-mapper, but I'm assuming it uses one
> notion of striping. For instance, RAID-0 striping in MD is handled
> differently than striped LVs in LVM, and I think AIX striping is
> also different. I'm not sure if one stiping module could be generic
> enough to handle all of these cases. But, maybe it can. I'll have to
> think more about that one.
The thing that you've missed is that ranges of sectors are mapped by
dm onto 'targets', these targets are instances of a 'target_type'
virtual class. You can define your own target types in a seperate
module and register them using the
int dm_register_target(struct target_type *t);
function (see <linux/device-mapper.h>). We originally intended all
the standard target types to be defined in seperate modules, but that
seemed silly considering the tiny size of them (eg, < 200 lines for striping).
The targets type that we will be using in LVM2 are:
io-err - errors every io
linear
striped
snapshot, and snapshot origin
mirror (used to provide pvmove support as well as mirroring).
(both the snapshot targets and mirroring target will use kcopyd for
efficient copying using kiobufs.)
So if you invent a new form of striping, just implement a new target type.
> How about mirroring? Does the linear module in device-mapper allow
> for 1-to-n mappings? This would be similar to the way AIX does
> mirroring, where each LP of an LV can map to up to three PPs on
> different PVs.
No, there's a seperate mirror target that Patrick Caulfields working
on at the moment.
> How would device-mapper handle a remapping that is dynamic at runtime?
Suspending a device flushes all pending io to the device, ie. all io
that has been through the driver but not yet completed. Once you have
suspended you are free to loaded a new mapping table into the device
(this is essential for implementing snapshots and pvmove). When the
new mapping is in place you just have to resume the device and away
you go.
- Joe
On Thu, Jan 31, 2002 at 12:52:11PM +0000, Joe Thornber wrote:
> On Thu, Jan 31, 2002 at 01:52:29PM -0600, Steve Pratt wrote:
> > Joe Thornber wrote:
> > >On Wed, Jan 30, 2002 at 10:03:40PM +0000, Jim McDonald wrote:
> > >> Also, does/where does this fit in with EVMS?
> >
> > >EVMS differs from us in that they seem to be trying to move the whole
> > >application into the kernel,
> >
> > No, not really. We only put in the kernel the things that make sense to be
> > in the kernel, discovery logic, ioctl support, I/O path. All configuration
> > is handled in user space.
>
> There's still a *lot* of code in there; > 26,000 lines in fact.
> Whereas device-mapper weighs in at ~2,600 lines. This is just because
> you've decided to take a different route from us, you may be proven to
> be correct.
There is one thing that might spoil the device-mapper "just simple stuff
only" thing: moving active volumes around. Doing that in userspace reliably
is impossible and basically needs to be done in kernelspace (it's an
operation comparable with raid1 resync, a not even that hard in kernel
space). However, that sort of automatically requires kernelspace to know
about volumes, and from there it's a small step....
Greetings,
Arjan van de Ven
On Thu, Jan 31, 2002 at 01:09:13PM +0000, Joe Thornber wrote:
>
> Now our hero decides to PV move PV2 to PV4:
>
> 1. Suspend our LV (254:3), this starts queueing all io, and flushes
> all pending io.
But "flushes all pending io" is *far* from trivial. there's no current
kernel functionality for this, so you'll have to do "weird shit" that will
break easy and often.
Also "suspending" is rather dangerous because it can deadlock the machine
(think about the VM needing to write back dirty data on this LV in order to
make memory available for your move)...
Greetings,
Arjan van de Ven
Hi,
On Fri, Feb 01, 2002 at 05:12:51AM -0500, Arjan van de Ven wrote:
> On Thu, Jan 31, 2002 at 01:09:13PM +0000, Joe Thornber wrote:
> >
> > Now our hero decides to PV move PV2 to PV4:
> >
> > 1. Suspend our LV (254:3), this starts queueing all io, and flushes
> > all pending io.
>
> But "flushes all pending io" is *far* from trivial. there's no current
> kernel functionality for this, so you'll have to do "weird shit" that will
> break easy and often.
I've been all through this with Joe. He *does* track pending IO in
device_mapper, and he's got a layered device_mirror driver which can
be overlayed on top of the segments of the device that you want to
copy. His design looks solid and the necessary infrastructure for
getting the locking right is all there.
> Also "suspending" is rather dangerous because it can deadlock the machine
> (think about the VM needing to write back dirty data on this LV in order to
> make memory available for your move)...
There's a copy thread which preallocates a fully-populated kiobuf for
the data. There's no VM pressure, and since it uses unbuffered IO,
there are no cache coherency problems like current LVM has.
Arjan, I *told* you they have thought this stuff through. :-)
Cheers,
Stephen
> But "flushes all pending io" is *far* from trivial. there's no current
> kernel functionality for this, so you'll have to do "weird shit" that will
> break easy and often.
>
> Also "suspending" is rather dangerous because it can deadlock the machine
> (think about the VM needing to write back dirty data on this LV in order to
> make memory available for your move)...
I don't think you need to suspend I/O except momentarily. I don't use LVM and
while I can't resize volumes I migrate them like this
mdhotadd /dev/md1 /dev/newvolume
[wait for it to resync]
mdhotremove /dev/md1 /dev/oldvolume
the situation here seems analogous. You never need to suspend I/O to the
volume until you actually kill it, by which time you can just skip the write
to the dead volume.
(The above procedure with ext3 does btw make a great backup system 8))
Alan
Hi,
On Fri, Feb 01, 2002 at 02:44:24PM +0000, Alan Cox wrote:
> > But "flushes all pending io" is *far* from trivial. there's no current
> > kernel functionality for this, so you'll have to do "weird shit" that will
> > break easy and often.
> >
> > Also "suspending" is rather dangerous because it can deadlock the machine
> > (think about the VM needing to write back dirty data on this LV in order to
> > make memory available for your move)...
>
> I don't think you need to suspend I/O except momentarily. I don't use LVM and
> while I can't resize volumes I migrate them like this
LVM1 has some problems here. First, when it needs to flush IO as
part of its locking it does so with fsync_dev, which is not a valid
way of flushing certain types IO. Second, its copy is done in user
space, so there is no cache coherence with the logical device contents
and there is enough VM pressure to give a good chance of deadlocking.
However, it _does_ do its locking at a finer granularity than the
whole disk (it locks an extent --- 4MB by default --- at a time), so
even with LVM1 it is possible to do the move on a live volume without
locking up all IO for the duration of the entire copy.
LVM2's device-mirror code is much closer to the raid1 mechanism in
design, so it doesn't even have to lock down an extent during the
copy.
> the situation here seems analogous. You never need to suspend I/O to the
> volume until you actually kill it, by which time you can just skip the write
> to the dead volume.
Right. LVM1 doesn't actually suspend IO to the volume, just to an
extent. What it does volume-wide is to flush IO, which is different.
The problem is that when we come to copy a chunk of the volume,
however large that chunk is, we need to make sure both that no new IOs
arrive on it, AND that we have waited for all outstanding IOs against
that chunk. It's the latter part which is the problem. It is
expensive to keep track of all outstanding IOs on a per-stripe basis,
so when we place a lock on a stripe and come to wait for
already-submitted IOs to complete, it is much easier just to do that
flush volume-wide. It's not a complete lock on the whole volume, just
a temporary mutex to ensure that there are no IOs left outstanding on
the stripe we're locking.
Cheers,
Stephen
On Thursday 31 January 2002 06:52, Joe Thornber wrote:
> On Thu, Jan 31, 2002 at 01:52:29PM -0600, Steve Pratt wrote:
> > No, not really. We only put in the kernel the things that make sense to
> > be in the kernel, discovery logic, ioctl support, I/O path. All
> > configuration is handled in user space.
>
> There's still a *lot* of code in there; > 26,000 lines in fact.
> Whereas device-mapper weighs in at ~2,600 lines. This is just because
> you've decided to take a different route from us, you may be proven to
> be correct.
Just so everyone is clear on the amount of code we are talking about, here
are my current counts of the different kernel drivers (based on code for
2.4.17):
EVMS: 17685 (this includes drivers/evms and include/linux/evms)
device-mapper: 2895 (this includes device-mapper/kernel/common and
device-mapper/kernel/ioctl - I've left out device-mapper/kernel/fs since only
one interface can be active at a time)
Current MD and LVM1: 11105 (this includes drivers/md, include/linux/lvm.h,
and include/linux/raid/)
Linux 2.4.17: 2519386 (a clean kernel, without EVMS or the latest LVM1
updates, and not counting asm code)
See http://www.dwheeler.com/sloccount/ for the tool that I used to get these
counts.
So I will agree - device-mapper does provide a nice, general-purpose I/O
remapping service in a small amount of code. Kernel bloat is obviously a big
concern as more functionality is added to Linux, and achieving a desired set
of functionality with less code is generally a good thing.
However, I don't think that the size of EVMS should just be written off as
kernel bloat. (I don't think any of the LVM guys have specifically said this,
but I have heard this comment from others, and I don't think they are looking
at the whole picture). We are talking about seven-tenths of a one percent
increase in the total size of the kernel. And if you consider that EVMS has
implemented support for LVM1 and MD, then EVMS is really only adding 6580
lines of code to the kernel. On top of that, EVMS has support for several
disk partitioning formats. (This support does not yet fully duplicate the
advanced partition support in the kernel, so I can't yet give any definite
numeric comparisons.) There is also support for the AIX LVM and the OS/2 LVM,
as well as general bad-block-relocation and snapshotting support. For all of
this extra functionality, I don't believe the extra code is unwarranted.
> I would like the two projects to help each other, but not to the point
> where one group of people has to say 'you are completely right, we
> will stop developing our project'. It's unlikely that either of us is
> 100% correct; but I do think device-mapper splits off a nice chunk of
> services that is useful to *all* people who want to do volume
> management. As such I see that as one area where we may eventually
> work together.
>
> Similarly I expect to be providing an *optional* kernel module for LVM
> users who wish to do in kernel discovery of a root LV, so if the EVMS
> team has managed to get a nice generic way of iterating block devices
> etc. into the kernel, we would be able to take advantage of that.
> Are you trying to break out functionality so it benefits other Linux
> projects ? or is EVMS just one monolithic application embedded in the
> kernel ?
I have been thinking about this today and looking over some of the
device-mapper interfaces. I will agree that, in concept, EVMS could be
modified to use device-mapper for I/O remapping. However, as things stand
today, I don't think the transition would be easy.
As I'm trying to envision it, the EVMS runtime would become a "volume
recognition" framework (see tanget below). Every current EVMS plugin would
then probe all available devices and communicate the necessary volume
remapping info to device-mapper through the ioctl interface. (An in-kernel
API set might be nice to avoid the overhead of the ioctl path). A new device
would then be created for every node that every plugin recognizes. This
brings up my first objection. With this approach, there would be an exposed
device for every node in a volume stack, whereas the current EVMS design only
exposes nodes for the final volumes. Ignoring the dwindling minor-number
issue which should go away in 2.5, you still take up space in the
buffer-cache for every one of these devices, which introduces the possibility
of cache-incoherencies.
Maybe this example will help: Say we have four disks. These four disks have
one partition each, and are striped together with RAID-0 (using MD). This MD
device is then made into an LVM PV and put in a volume group, and an LV is
created from part of the space in that group. Then down the line, you decide
to do some backups, and create another LV to use as a snapshot of the first
LV. (For those who might be wondering, this is a very realistic scenario.)
Snap_of_LV1
|
LV1 LV2
_|__________|_
| Volume Group |
--------------
|
md0
_____________|_______________
| | | |
sda1 sdb1 sdc1 sdd1
| | | |
sda sdb sdc sdd
In this scenario, we would wind up with exposed devices for every item in
this graph (except the volume group). But in reality, we don't want someone
coming along and mucking with md0 or with LV2 or with any of the disk
partitions, because they are all in use by the two volumes at the top.
<tangent>
As we know, EVMS does volume discovery in the kernel. LVM1 does discovery in
user-space, but Joe has hinted at an in-kernel LVM1 discovery module to work
with device-mapper. Back when we started on EVMS, people were basically
shouting "we need in-kernel discovery!", so that's the route we took. This is
why it looks like EVMS has so much code. I'd say 50-75% of each plugin is
devoted to discovery. Of course, today there seem to be people shouting,
"let's move all discovery into user-space!". Well, I suppose that approach is
feasible, but I personally don't agree with it. My belief is that it's the
kernel's job to tell user-space what the hardware looks like, not the other
way around. If we move partition/volume/etc discovery into user-space, at
what point do we move device recognition into user-space? Looking down that
path just seems more and more like a micro-kernel approach, and I'm sure we
don't want to rehash that discussion.
</tangent>
Next, from what I've seen, device-mapper provides static remappings from one
device to another. It seems to be a good approach for setting up things like
disk partitions and LVM linear LVs. There is also striping support in
device-mapper, but I'm assuming it uses one notion of striping. For instance,
RAID-0 striping in MD is handled differently than striped LVs in LVM, and I
think AIX striping is also different. I'm not sure if one stiping module
could be generic enough to handle all of these cases. But, maybe it can. I'll
have to think more about that one.
How about mirroring? Does the linear module in device-mapper allow for 1-to-n
mappings? This would be similar to the way AIX does mirroring, where each LP
of an LV can map to up to three PPs on different PVs.
How would device-mapper handle a remapping that is dynamic at runtime? For
instance, how would device-mapper handle bad-block-relocation? Actually, it
seems you have dealt with this from one point of view in the snapshotting
code in device-mapper. In order for persistent snapshots (or bad-block) to
work, device-mapper needs a module which knows the metadata format, because
that metadata has to be written at runtime. So another device-mapper "module"
would need to be written to handle bad-block. This implicitly limits the
capabilities of device-mapper, or else ties it directly to the recognition
code. For EVMS and device-mapper to work together, we would have to agree on
metadata formats for these types of modules. Other similar example that come
to mind are RAID-5 and block-level encryption.
Now, don't get me totally wrong. I'm not saying using device-mapper in EVMS
is impossible. I'm just pointing out some of the issues I currently see with
making such a transition. Perhaps some of these issues can be addressed, from
either side.
Ultimately, I agree with Joe and Alasdair - I think there is room for both
projects. There are plenty of other examples of so-called competing projects
that co-exist just fine - KDE/Gnome, ReiserFS/JFS/XFS/ext3 - hell, there's
even two virtual memory managers to choose from! So if it just turns out that
Linux has a choice of two volume managers, then I don't have any problem with
it. I will say that it is somewhat unfortunate that we couldn't have worked
together more, but it seems to me that timing is what kept it from happening.
EVMS was under development when LVM was getting ready for their 1.0 release,
and now EVMS is trying to get a final release out as the new device-mapper is
coming along. Unfortunately we each have our own deadlines to think about.
Maybe down the line there will be more time to get together and figure out
ways to get these different technologies to work together.
-Kevin
> As I'm trying to envision it, the EVMS runtime would become a "volume
> recognition" framework (see tanget below). Every current EVMS plugin would
Volume recognition is definitely a user space area. There are a huge array
of things you want to do in some environments that you cannot do from
kernel space
Simple example: We have mount by label, imagine trying to extend that in
kernel space to automatically do LDAP queries to find a remote handle to
the volume and NFS mount it. It's easy in user space.
Alan
On Fri, Feb 01, 2002 at 03:59:06PM -0600, Kevin Corry wrote:
> Current MD and LVM1: 11105 (this includes drivers/md, include/linux/lvm.h,
> and include/linux/raid/)
Why are you counting MD? It should be trivial to implement raid
with device-mapper.
> So I will agree - device-mapper does provide a nice, general-purpose I/O
> remapping service in a small amount of code.
It provides everything you need (from a kernel) to do Volume
Management in general. (What's missing?)
> And if you consider that EVMS has
> implemented support for LVM1 and MD, then EVMS is really only adding 6580
> lines of code to the kernel.
Yep, it's not the size, but the complexity.
device-mapper is *simple*. It couldn't be simpler!
simple is *good*, as long as it's not "too simple", in the sense
that it doesn't get the job done. So, tell us what the disadvantages
of device-mapper's "simplemindedness" is!
> In this scenario, we would wind up with exposed devices for every item in
> this graph (except the volume group). But in reality, we don't want someone
> coming along and mucking with md0 or with LV2 or with any of the disk
> partitions, because they are all in use by the two volumes at the top.
It's the user's fault if they choose to write on such a device.
> As we know, EVMS does volume discovery in the kernel. LVM1 does discovery in
> user-space, but Joe has hinted at an in-kernel LVM1 discovery module to work
> with device-mapper. Back when we started on EVMS, people were basically
> shouting "we need in-kernel discovery!",
Not me! Anyway, device-mapper is compatible with in-kernel discovery.
But, why not just stick it in initramfs? (Are there any Issues?)
> My belief is that it's the
> kernel's job to tell user-space what the hardware looks like, not the other
> way around. If we move partition/volume/etc discovery into user-space, at
> what point do we move device recognition into user-space?
There's a big difference: partition discovery isn't actually a hardware
thing, it's a software thing.
Some types of device recognition (iSCSI?) are also software things.
WRT: beliefs, I think the kernel should provide all the services that
can't be implemented well in userspace.
> Looking down that
> path just seems more and more like a micro-kernel approach, and I'm sure we
> don't want to rehash that discussion.
What's your point?
> Next, from what I've seen, device-mapper provides static remappings from one
> device to another. It seems to be a good approach for setting up things like
> disk partitions and LVM linear LVs. There is also striping support in
> device-mapper, but I'm assuming it uses one notion of striping. For instance,
> RAID-0 striping in MD is handled differently than striped LVs in LVM, and I
> think AIX striping is also different. I'm not sure if one stiping module
> could be generic enough to handle all of these cases.
Why do you need a single striping module?
> How about mirroring? Does the linear module in device-mapper allow for 1-to-n
> mappings?
Wouldn't you use the strip module, with the chunk size set to
a good approximation of infinity? Haven't tried it...
Anyway, this is just a small detail... if there was no way of doing it,
it would be easy to add a module...
> So another device-mapper "module"
> would need to be written to handle bad-block. This implicitly limits the
> capabilities of device-mapper, or else ties it directly to the recognition
> code. For EVMS and device-mapper to work together, we would have to agree on
> metadata formats for these types of modules. Other similar example that come
> to mind are RAID-5 and block-level encryption.
Agreed... but these shouldn't be big problems? (raid-5 is already
a standard...)
Andrew
> > this graph (except the volume group). But in reality, we don't want someone
> > coming along and mucking with md0 or with LV2 or with any of the disk
> > partitions, because they are all in use by the two volumes at the top.
>
> It's the user's fault if they choose to write on such a device.
You want access to the raw devices as well as the virtual volumes. You
try doing SMART diagnostics online on an md volume if you can't get
at /dev/hd* and /dev/sd*.
You don't want to mount both layers at once I suspect, but even that
may be questionable for a read only mirror.
Alan