2007-08-12 10:35:19

by Al Boldi

[permalink] [raw]
Subject: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

Lars Ellenberg wrote:
> meanwhile, please, anyone interessted,
> the drbd paper for LinuxConf Eu 2007 is finalized.
> http://www.drbd.org/fileadmin/drbd/publications/
> drbd8.linux-conf.eu.2007.pdf
>
> it does not give too much implementation detail (would be inapropriate
> for conference proceedings, imo; some paper commenting on the source
> code should follow).
>
> but it does give a good overview about what DRBD actually is,
> what exact problems it tries to solve,
> and what developments to expect in the near future.
>
> so you can make up your mind about
> "Do we need it?", and
> "Why DRBD? Why not NBD + MD-RAID?"

Ok, conceptually your driver sounds really interresting, but when I read the
pdf I got completely turned off. The problem is that the concepts are not
clearly implemented, when in fact the concepts are really simple:

Allow shared access to remote block storage with fault tolerance.

The first thing to tackle here would be write serialization. Then start
thinking about fault tolerance.

Now, shared remote block access should theoretically be handled, as does
DRBD, by a block layer driver, but realistically it may be more appropriate
to let it be handled by the combining end user, like OCFS or GFS.

The idea here is to simplify lower layer implementations while removing any
preconceived dependencies, and let upper layers reign free without incurring
redundant overhead.

Look at ZFS; it illegally violates layering by combining md/dm/lvm with the
fs, but it does this based on a realistic understanding of the problems
involved, which enables it to improve performance, flexibility, and
functionality specific to its use case.

This implies that there are two distinct forces at work here:

1. Layer components
2. Use-Case composers

Layer components should technically not implement any use case (other than
providing a plumbing framework), as that would incur unnecessary
dependencies, which could reduce its generality and thus reusability.

Use-Case composers can now leverage layer components from across the layering
hierarchy, to yield a specific use case implementation.

DRBD is such a Use-Case composer, as is mdm / dm / lvm and any fs in general,
whereas aoe / nbd / loop and the VFS / FUSE are examples of layer
components.

It follows that Use-case composers, like DRBD, need common functionality that
should be factored out into layer components, and then recompose to
implement a specific use case.


Thanks!

--
Al


2007-08-12 11:28:53

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])


On Aug 12 2007 13:35, Al Boldi wrote:
>Lars Ellenberg wrote:
>> meanwhile, please, anyone interessted,
>> the drbd paper for LinuxConf Eu 2007 is finalized.
>> http://www.drbd.org/fileadmin/drbd/publications/
>> drbd8.linux-conf.eu.2007.pdf
>>
>> but it does give a good overview about what DRBD actually is,
>> what exact problems it tries to solve,
>> and what developments to expect in the near future.
>>
>> so you can make up your mind about
>> "Do we need it?", and
>> "Why DRBD? Why not NBD + MD-RAID?"

I may have made a mistake when asking for how it compares to NBD+MD.
Let me retry: what's the functional difference between
GFS2 on a DRBD .vs. GFS2 on a DAS SAN?

>Now, shared remote block access should theoretically be handled, as does
>DRBD, by a block layer driver, but realistically it may be more appropriate
>to let it be handled by the combining end user, like OCFS or GFS.


Jan
--

2007-08-12 11:51:46

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

On Sun, Aug 12, 2007 at 01:35:17PM +0300, Al Boldi ([email protected]) wrote:
> Lars Ellenberg wrote:
> > meanwhile, please, anyone interessted,
> > the drbd paper for LinuxConf Eu 2007 is finalized.
> > http://www.drbd.org/fileadmin/drbd/publications/
> > drbd8.linux-conf.eu.2007.pdf
> >
> > it does not give too much implementation detail (would be inapropriate
> > for conference proceedings, imo; some paper commenting on the source
> > code should follow).
> >
> > but it does give a good overview about what DRBD actually is,
> > what exact problems it tries to solve,
> > and what developments to expect in the near future.
> >
> > so you can make up your mind about
> > "Do we need it?", and
> > "Why DRBD? Why not NBD + MD-RAID?"
>
> Ok, conceptually your driver sounds really interresting, but when I read the
> pdf I got completely turned off. The problem is that the concepts are not
> clearly implemented, when in fact the concepts are really simple:
>
> Allow shared access to remote block storage with fault tolerance.
>
> The first thing to tackle here would be write serialization. Then start
> thinking about fault tolerance.
>
> Now, shared remote block access should theoretically be handled, as does
> DRBD, by a block layer driver, but realistically it may be more appropriate
> to let it be handled by the combining end user, like OCFS or GFS.
>
> The idea here is to simplify lower layer implementations while removing any
> preconceived dependencies, and let upper layers reign free without incurring
> redundant overhead.
>
> Look at ZFS; it illegally violates layering by combining md/dm/lvm with the
> fs, but it does this based on a realistic understanding of the problems
> involved, which enables it to improve performance, flexibility, and
> functionality specific to its use case.
>
> This implies that there are two distinct forces at work here:
>
> 1. Layer components
> 2. Use-Case composers
>
> Layer components should technically not implement any use case (other than
> providing a plumbing framework), as that would incur unnecessary
> dependencies, which could reduce its generality and thus reusability.
>
> Use-Case composers can now leverage layer components from across the layering
> hierarchy, to yield a specific use case implementation.
>
> DRBD is such a Use-Case composer, as is mdm / dm / lvm and any fs in general,
> whereas aoe / nbd / loop and the VFS / FUSE are examples of layer
> components.
>
> It follows that Use-case composers, like DRBD, need common functionality that
> should be factored out into layer components, and then recompose to
> implement a specific use case.

Out of curiosity, did you try ndb+dm+raid1 compared to drbd and/or zfs
on top of distributed storage (which is a urprise to me, that holy zfs
suppors that)?

> Thanks!
>
> --
> Al
>
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Evgeniy Polyakov

2007-08-12 15:28:35

by Al Boldi

[permalink] [raw]
Subject: Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

Evgeniy Polyakov wrote:
> Al Boldi ([email protected]) wrote:
> > Look at ZFS; it illegally violates layering by combining md/dm/lvm with
> > the fs, but it does this based on a realistic understanding of the
> > problems involved, which enables it to improve performance, flexibility,
> > and functionality specific to its use case.
> >
> > This implies that there are two distinct forces at work here:
> >
> > 1. Layer components
> > 2. Use-Case composers
> >
> > Layer components should technically not implement any use case (other
> > than providing a plumbing framework), as that would incur unnecessary
> > dependencies, which could reduce its generality and thus reusability.
> >
> > Use-Case composers can now leverage layer components from across the
> > layering hierarchy, to yield a specific use case implementation.
> >
> > DRBD is such a Use-Case composer, as is mdm / dm / lvm and any fs in
> > general, whereas aoe / nbd / loop and the VFS / FUSE are examples of
> > layer components.
> >
> > It follows that Use-case composers, like DRBD, need common functionality
> > that should be factored out into layer components, and then recompose to
> > implement a specific use case.
>
> Out of curiosity, did you try ndb+dm+raid1 compared to drbd and/or zfs
> on top of distributed storage (which is a urprise to me, that holy zfs
> suppors that)?

Actually, I may not have been very clear in my Use-Case composer description
to mean internal in-kernel Use-Case composer as opposed to external Userland
Use-Case composer.

So, nbd+dm+raid1 would be an external Userland Use-Case composition, which
obviously could have some drastic performance issues.

DRBD and ZFS are examples of internal in-kernel Use-Case composers, which
obviously could show some drastic performance improvements.

Although you could allow in-kernel Use-Case composers to be run on top of
Userland Use-Case composers, that wouldn't be the preferred mode of
operation. Instead, you would for example recompose ZFS to incorporate an
in-kernel distributed storage layer component, like nbd.

All this boils down to refactoring Use-Case composers to produce layer
components with both in-kernel and userland interfaces. Once we have that,
it becomes a matter of plug-and-play to produce something awesome like ZFS.


Thanks!

--
Al

2007-08-12 16:44:18

by David Lang

[permalink] [raw]
Subject: Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

On Sun, 12 Aug 2007, Jan Engelhardt wrote:

> On Aug 12 2007 13:35, Al Boldi wrote:
>> Lars Ellenberg wrote:
>>> meanwhile, please, anyone interessted,
>>> the drbd paper for LinuxConf Eu 2007 is finalized.
>>> http://www.drbd.org/fileadmin/drbd/publications/
>>> drbd8.linux-conf.eu.2007.pdf
>>>
>>> but it does give a good overview about what DRBD actually is,
>>> what exact problems it tries to solve,
>>> and what developments to expect in the near future.
>>>
>>> so you can make up your mind about
>>> "Do we need it?", and
>>> "Why DRBD? Why not NBD + MD-RAID?"
>
> I may have made a mistake when asking for how it compares to NBD+MD.
> Let me retry: what's the functional difference between
> GFS2 on a DRBD .vs. GFS2 on a DAS SAN?

GFS is a distributed filesystem, DRDB is a replicated block device. you
wouldn't do GFS on top of DRDB, you would do ext2/3, XFS, etc

DRDB is much closer to the NBD+MD option.

now, I am not an expert on either option, but three are a couple things
that I would question about the DRDB+MD option

1. when the remote machine is down, how does MD deal with it for reads and
writes?

2. MD over local drive will alternate reads between mirrors (or so I've
been told), doing so over the network is wrong.

3. when writing, will MD wait for the network I/O to get the data saved on
the backup before returning from the syscall? or can it sync the data out
lazily

>> Now, shared remote block access should theoretically be handled, as does
>> DRBD, by a block layer driver, but realistically it may be more appropriate
>> to let it be handled by the combining end user, like OCFS or GFS.

there are times when you want to replicate at the block layer, and there
are times when you want to have a filesystem do the work. don't force a
filesystem on use-cases where a block device is the right answer.

David Lang

2007-08-12 17:04:37

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])


On Aug 12 2007 09:39, [email protected] wrote:
>
> now, I am not an expert on either option, but three are a couple things that I
> would question about the DRDB+MD option
>
> 1. when the remote machine is down, how does MD deal with it for reads and
> writes?

I suppose it kicks the drive and you'd have to re-add it by hand unless done by
a cronjob.

> 2. MD over local drive will alternate reads between mirrors (or so I've been
> told), doing so over the network is wrong.

Certainly. In which case you set "write_mostly" (or even write_only, not sure
of its name) on the raid component that is nbd.

> 3. when writing, will MD wait for the network I/O to get the data saved on the
> backup before returning from the syscall? or can it sync the data out lazily

Can't answer this one - ask Neil :)




Jan
--

2007-08-12 18:06:27

by Iustin Pop

[permalink] [raw]
Subject: Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

On Sun, Aug 12, 2007 at 07:03:44PM +0200, Jan Engelhardt wrote:
>
> On Aug 12 2007 09:39, [email protected] wrote:
> >
> > now, I am not an expert on either option, but three are a couple things that I
> > would question about the DRDB+MD option
> >
> > 1. when the remote machine is down, how does MD deal with it for reads and
> > writes?
>
> I suppose it kicks the drive and you'd have to re-add it by hand unless done by
> a cronjob.

>From my tests, since NBD doesn't have a timeout option, MD hangs in the
write to that mirror indefinitely, somewhat like when dealing with a
broken IDE driver/chipset/disk.

> > 2. MD over local drive will alternate reads between mirrors (or so I've been
> > told), doing so over the network is wrong.
>
> Certainly. In which case you set "write_mostly" (or even write_only, not sure
> of its name) on the raid component that is nbd.
>
> > 3. when writing, will MD wait for the network I/O to get the data saved on the
> > backup before returning from the syscall? or can it sync the data out lazily
>
> Can't answer this one - ask Neil :)

MD has the write-mostly/write-behind options - which help in this case
but only up to a certain amount.


In my experience DRBD wins hands-down over MD+NBD because of MD doesn't
know (or handle) a component that never returns from a write, which is
quite different from returning with an error. Furthermore, DRBD was
designed to handle transient errors in the connection to the peer due to
its network-oriented design, whereas MD is mostly designed with local or
at least high-reliability disks (where disk can be SAN, SCSI, etc.) and
a failure is not normal for MD. Thus the need for manual reconnect in MD
case and the automated handling of reconnects in case of DRBD.

I'm just a happy user of both MD over local disks and DRBD for networked
raid.

regards,
iustin

2007-08-13 01:41:31

by Paul Clements

[permalink] [raw]
Subject: Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

Iustin Pop wrote:
> On Sun, Aug 12, 2007 at 07:03:44PM +0200, Jan Engelhardt wrote:
>> On Aug 12 2007 09:39, [email protected] wrote:
>>> now, I am not an expert on either option, but three are a couple things that I
>>> would question about the DRDB+MD option
>>>
>>> 1. when the remote machine is down, how does MD deal with it for reads and
>>> writes?
>> I suppose it kicks the drive and you'd have to re-add it by hand unless done by
>> a cronjob.

Yes, and with a bitmap configured on the raid1, you just resync the
blocks that have been written while the connection was down.


>>From my tests, since NBD doesn't have a timeout option, MD hangs in the
> write to that mirror indefinitely, somewhat like when dealing with a
> broken IDE driver/chipset/disk.

Well, if people would like to see a timeout option, I actually coded up
a patch a couple of years ago to do just that, but I never got it into
mainline because you can do almost as well by doing a check at
user-level (I basically ping the nbd connection periodically and if it
fails, I kill -9 the nbd-client).


>>> 2. MD over local drive will alternate reads between mirrors (or so I've been
>>> told), doing so over the network is wrong.
>> Certainly. In which case you set "write_mostly" (or even write_only, not sure
>> of its name) on the raid component that is nbd.
>>
>>> 3. when writing, will MD wait for the network I/O to get the data saved on the
>>> backup before returning from the syscall? or can it sync the data out lazily
>> Can't answer this one - ask Neil :)
>
> MD has the write-mostly/write-behind options - which help in this case
> but only up to a certain amount.

You can configure write_behind (aka, asynchronous writes) to buffer as
much data as you have RAM to hold. At a certain point, presumably, you'd
want to just break the mirror and take the hit of doing a resync once
your network leg falls too far behind.

--
Paul

2007-08-13 03:25:45

by David Lang

[permalink] [raw]
Subject: Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

per the message below MD (or DM) would need to be modified to work
reasonably well with one of the disk components being over an unreliable
link (like a network link)

are the MD/DM maintainers interested in extending their code in this
direction? or would they prefer to keep it simpler by being able to
continue to assume that the raid components are connected over a highly
reliable connection?

if they are interested in adding (and maintaining) this functionality then
there is a real possibility that NBD+MD/DM could eliminate the need for
DRDB. however if they are not interested in adding all the code to deal
with the network type issues, then the argument that DRDB should not be
merged becouse you can do the same thing with MD/DM + NBD is invalid and
can be dropped/ignored

David Lang

On Sun, 12 Aug 2007, Paul Clements wrote:

> Iustin Pop wrote:
>> On Sun, Aug 12, 2007 at 07:03:44PM +0200, Jan Engelhardt wrote:
>> > On Aug 12 2007 09:39, [email protected] wrote:
>> > > now, I am not an expert on either option, but three are a couple
>> > > things that I
>> > > would question about the DRDB+MD option
>> > >
>> > > 1. when the remote machine is down, how does MD deal with it for reads
>> > > and
>> > > writes?
>> > I suppose it kicks the drive and you'd have to re-add it by hand unless
>> > done by
>> > a cronjob.
>
> Yes, and with a bitmap configured on the raid1, you just resync the blocks
> that have been written while the connection was down.
>
>
>> >From my tests, since NBD doesn't have a timeout option, MD hangs in the
>> write to that mirror indefinitely, somewhat like when dealing with a
>> broken IDE driver/chipset/disk.
>
> Well, if people would like to see a timeout option, I actually coded up a
> patch a couple of years ago to do just that, but I never got it into mainline
> because you can do almost as well by doing a check at user-level (I basically
> ping the nbd connection periodically and if it fails, I kill -9 the
> nbd-client).
>
>
>> > > 2. MD over local drive will alternate reads between mirrors (or so
>> > > I've been
>> > > told), doing so over the network is wrong.
>> > Certainly. In which case you set "write_mostly" (or even write_only, not
>> > sure
>> > of its name) on the raid component that is nbd.
>> >
>> > > 3. when writing, will MD wait for the network I/O to get the data
>> > > saved on the
>> > > backup before returning from the syscall? or can it sync the data out
>> > > lazily
>> > Can't answer this one - ask Neil :)
>>
>> MD has the write-mostly/write-behind options - which help in this case
>> but only up to a certain amount.
>
> You can configure write_behind (aka, asynchronous writes) to buffer as much
> data as you have RAM to hold. At a certain point, presumably, you'd want to
> just break the mirror and take the hit of doing a resync once your network
> leg falls too far behind.
>
> --
> Paul
>

2007-08-13 12:57:16

by David Greaves

[permalink] [raw]
Subject: Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

Paul Clements wrote:
> Well, if people would like to see a timeout option, I actually coded up
> a patch a couple of years ago to do just that, but I never got it into
> mainline because you can do almost as well by doing a check at
> user-level (I basically ping the nbd connection periodically and if it
> fails, I kill -9 the nbd-client).


Yes please.

David

2007-08-13 13:00:04

by David Greaves

[permalink] [raw]
Subject: Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

[email protected] wrote:
> per the message below MD (or DM) would need to be modified to work
> reasonably well with one of the disk components being over an unreliable
> link (like a network link)
>
> are the MD/DM maintainers interested in extending their code in this
> direction? or would they prefer to keep it simpler by being able to
> continue to assume that the raid components are connected over a highly
> reliable connection?
>
> if they are interested in adding (and maintaining) this functionality
> then there is a real possibility that NBD+MD/DM could eliminate the need
> for DRDB. however if they are not interested in adding all the code to
> deal with the network type issues, then the argument that DRDB should
> not be merged becouse you can do the same thing with MD/DM + NBD is
> invalid and can be dropped/ignored
>
> David Lang

As a user I'd like to see md/nbd be extended to cope with unreliable links.
I think md could be better in handling link exceptions. My unreliable memory
recalls sporadic issues with hot-plug leaving md hanging and certain lower level
errors (or even very high latency) causing unsatisfactory behaviour in what is
supposed to be a fault 'tolerant' subsystem.


Would this just be relevant to network devices or would it improve support for
jostled usb and sata hot-plugging I wonder?

David

2007-08-13 13:11:32

by David Lang

[permalink] [raw]
Subject: Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

On Mon, 13 Aug 2007, David Greaves wrote:

> [email protected] wrote:
>> per the message below MD (or DM) would need to be modified to work
>> reasonably well with one of the disk components being over an unreliable
>> link (like a network link)
>>
>> are the MD/DM maintainers interested in extending their code in this
>> direction? or would they prefer to keep it simpler by being able to
>> continue to assume that the raid components are connected over a highly
>> reliable connection?
>>
>> if they are interested in adding (and maintaining) this functionality then
>> there is a real possibility that NBD+MD/DM could eliminate the need for
>> DRDB. however if they are not interested in adding all the code to deal
>> with the network type issues, then the argument that DRDB should not be
>> merged becouse you can do the same thing with MD/DM + NBD is invalid and
>> can be dropped/ignored
>>
>> David Lang
>
> As a user I'd like to see md/nbd be extended to cope with unreliable links.
> I think md could be better in handling link exceptions. My unreliable memory
> recalls sporadic issues with hot-plug leaving md hanging and certain lower
> level errors (or even very high latency) causing unsatisfactory behaviour in
> what is supposed to be a fault 'tolerant' subsystem.
>
>
> Would this just be relevant to network devices or would it improve support
> for jostled usb and sata hot-plugging I wonder?

good question, I suspect that some of the error handling would be similar
(for devices that are unreachable not haning the system for example), but
a lot of the rest would be different (do you really want to try to
auto-resync to a drive that you _think_ just reappeared, what if it's a
different drive? how can you be sure?) the error rate of a network is gong
to be significantly higher then for USB or SATA drives (although I suppose
iscsi would be limilar)

David Lang

2007-08-13 13:18:50

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])


On Aug 12 2007 20:21, [email protected] wrote:
>
> per the message below MD (or DM) would need to be modified to work
> reasonably well with one of the disk components being over an
> unreliable link (like a network link)

Does not dm-multipath do something like that?

> are the MD/DM maintainers interested in extending their code in this direction?
> or would they prefer to keep it simpler by being able to continue to assume
> that the raid components are connected over a highly reliable connection?

Jan
--

2007-08-13 15:05:48

by David Greaves

[permalink] [raw]
Subject: Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

[email protected] wrote:
>> Would this just be relevant to network devices or would it improve
>> support for jostled usb and sata hot-plugging I wonder?
>
> good question, I suspect that some of the error handling would be
> similar (for devices that are unreachable not haning the system for
> example), but a lot of the rest would be different (do you really want
> to try to auto-resync to a drive that you _think_ just reappeared,
Well, omit 'think' and the answer may be "yes". A lot of systems are quite
simple and RAID is common on the desktop now. If jostled USB fits into this
category - then "yes".

> what
> if it's a different drive? how can you be sure?
And that's the key isn't it. We have the RAID device UUID and the superblock
info. Isn't that enough? If not then given the work involved an extended
superblock wouldn't be unreasonable.
And I suspect the capability of devices would need recording in the superblock
too? eg 'retry-on-fail'
I can see how md would fail a device but may now periodically retry it. If a
retry shows that it's back then it would validate it (UUID) and then resync it.

> ) the error rate of a
> network is gong to be significantly higher then for USB or SATA drives
> (although I suppose iscsi would be limilar)

I do agree - I was looking for value-add for the existing subsystem. If this
benefits existing RAID users then it's more likely to be attractive.

David