LinuxLists.cc - Distributed storage. Move away from char device ioctls.

[permalink] [raw]

Subject: Re: Distributed storage. Move away from char device ioctls.

Evgeniy Polyakov wrote:
> Hi.
>
> I'm pleased to announce fourth release of the distributed storage
> subsystem, which allows to form a storage on top of remote and local
> nodes, which in turn can be exported to another storage as a node to
> form tree-like storages.
>
> This release includes new configuration interface (kernel connector over
> netlink socket) and number of fixes of various bugs found during move
> to it (in error path).
>
> Further TODO list includes:
> * implement optional saving of mirroring/linear information on the remote
> nodes (simple)
> * new redundancy algorithm (complex)
> * some thoughts about distributed filesystem tightly connected to DST
> (far-far planes so far)
>
> Homepage:
> http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst
>
> Signed-off-by: Evgeniy Polyakov <[email protected]>

My thoughts. But first a disclaimer: Perhaps you will recall me as
one of the people who really reads all your patches, and examines your
code and proposals closely. So, with that in mind...

I question the value of distributed block services (DBS), whether its
your version or the others out there. DBS are not very useful, because
it still relies on a useful filesystem sitting on top of the DBS. It
devolves into one of two cases: (1) multi-path much like today's SCSI,
with distributed filesystem arbitrarion to ensure coherency, or (2) the
filesystem running on top of the DBS is on a single host, and thus, a
single point of failure (SPOF).

It is quite logical to extend the concepts of RAID across the network,
but ultimately you are still bound by the inflexibility and simplicity
of the block device.

In contrast, a distributed filesystem offers far more scalability,
eliminates single points of failure, and offers more room for
optimization and redundancy across the cluster.

A distributed filesystem is also much more complex, which is why
distributed block devices are so appealing :)

With a redundant, distributed filesystem, you simply do not need any
complexity at all at the block device level. You don't even need RAID.

It is my hope that you will put your skills towards a distributed
filesystem :) Of the current solutions, GFS (currently in kernel)
scales poorly, and NFS v4.1 is amazingly bloated and overly complex.

I've been waiting for years for a smart person to come along and write a
POSIX-only distributed filesystem.

Jeff

2007-09-14 20:48:18

by Al Boldi

[permalink] [raw]

Subject: Re: Distributed storage. Move away from char device ioctls.

Jeff Garzik wrote:
> Evgeniy Polyakov wrote:
> > Hi.
> >
> > I'm pleased to announce fourth release of the distributed storage
> > subsystem, which allows to form a storage on top of remote and local
> > nodes, which in turn can be exported to another storage as a node to
> > form tree-like storages.
> >
> > This release includes new configuration interface (kernel connector over
> > netlink socket) and number of fixes of various bugs found during move
> > to it (in error path).
> >
> > Further TODO list includes:
> > * implement optional saving of mirroring/linear information on the
> > remote nodes (simple)
> > * new redundancy algorithm (complex)
> > * some thoughts about distributed filesystem tightly connected to DST
> > (far-far planes so far)
> >
> > Homepage:
> > http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst
> >
> > Signed-off-by: Evgeniy Polyakov <[email protected]>
>
> My thoughts. But first a disclaimer: Perhaps you will recall me as
> one of the people who really reads all your patches, and examines your
> code and proposals closely. So, with that in mind...
>
> I question the value of distributed block services (DBS), whether its
> your version or the others out there. DBS are not very useful, because
> it still relies on a useful filesystem sitting on top of the DBS. It
> devolves into one of two cases: (1) multi-path much like today's SCSI,
> with distributed filesystem arbitrarion to ensure coherency, or (2) the
> filesystem running on top of the DBS is on a single host, and thus, a
> single point of failure (SPOF).
>
> It is quite logical to extend the concepts of RAID across the network,
> but ultimately you are still bound by the inflexibility and simplicity
> of the block device.
>
> In contrast, a distributed filesystem offers far more scalability,
> eliminates single points of failure, and offers more room for
> optimization and redundancy across the cluster.
>
> A distributed filesystem is also much more complex, which is why
> distributed block devices are so appealing :)
>
> With a redundant, distributed filesystem, you simply do not need any
> complexity at all at the block device level. You don't even need RAID.
>
> It is my hope that you will put your skills towards a distributed
> filesystem :) Of the current solutions, GFS (currently in kernel)
> scales poorly, and NFS v4.1 is amazingly bloated and overly complex.
>
> I've been waiting for years for a smart person to come along and write a
> POSIX-only distributed filesystem.

This http://lkml.org/lkml/2007/8/12/159 may provide a fast-path to reaching
that goal.

Thanks!

--
Al

2007-09-14 21:12:30

[permalink] [raw]

Subject: Re: Distributed storage. Move away from char device ioctls.

On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
> My thoughts. But first a disclaimer: Perhaps you will recall me as one
> of the people who really reads all your patches, and examines your code and
> proposals closely. So, with that in mind...
>
> I question the value of distributed block services (DBS), whether its your
> version or the others out there. DBS are not very useful, because it still
> relies on a useful filesystem sitting on top of the DBS. It devolves into
> one of two cases: (1) multi-path much like today's SCSI, with distributed
> filesystem arbitrarion to ensure coherency, or (2) the filesystem running
> on top of the DBS is on a single host, and thus, a single point of failure
> (SPOF).
>
> It is quite logical to extend the concepts of RAID across the network, but
> ultimately you are still bound by the inflexibility and simplicity of the
> block device.
>
> In contrast, a distributed filesystem offers far more scalability,
> eliminates single points of failure, and offers more room for optimization
> and redundancy across the cluster.
>
> A distributed filesystem is also much more complex, which is why
> distributed block devices are so appealing :)
>
> With a redundant, distributed filesystem, you simply do not need any
> complexity at all at the block device level. You don't even need RAID.
>
> It is my hope that you will put your skills towards a distributed
> filesystem :) Of the current solutions, GFS (currently in kernel) scales
> poorly, and NFS v4.1 is amazingly bloated and overly complex.
>
> I've been waiting for years for a smart person to come along and write a
> POSIX-only distributed filesystem.

What exactly do you mean by "POSIX-only"?

--b.

2007-09-14 21:15:12

[permalink] [raw]

Subject: Re: Distributed storage. Move away from char device ioctls.

J. Bruce Fields wrote:
> On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
>> I've been waiting for years for a smart person to come along and write a
>> POSIX-only distributed filesystem.

> What exactly do you mean by "POSIX-only"?

Don't bother supporting attributes, file modes, and other details not
supported by POSIX. The prime example being NFSv4, which is larded down
with Windows features.

NFSv4.1 adds to the fun, by throwing interoperability completely out the
window.

Jeff

2007-09-14 21:18:45

[permalink] [raw]

Subject: Re: Distributed storage. Move away from char device ioctls.

On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote:
> J. Bruce Fields wrote:
>> On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
>>> I've been waiting for years for a smart person to come along and write a
>>> POSIX-only distributed filesystem.
>
>> What exactly do you mean by "POSIX-only"?
>
> Don't bother supporting attributes, file modes, and other details not
> supported by POSIX. The prime example being NFSv4, which is larded down
> with Windows features.

I am sympathetic.... Cutting those out may still leave you with
something pretty complicated, though.

> NFSv4.1 adds to the fun, by throwing interoperability completely out the
> window.

What parts are you worried about in particular?

--b.

2007-09-14 22:32:32

[permalink] [raw]

Subject: Re: Distributed storage. Move away from char device ioctls.

J. Bruce Fields wrote:
> On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote:
>> J. Bruce Fields wrote:
>>> On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
>>>> I've been waiting for years for a smart person to come along and write a
>>>> POSIX-only distributed filesystem.
>>> What exactly do you mean by "POSIX-only"?
>> Don't bother supporting attributes, file modes, and other details not
>> supported by POSIX. The prime example being NFSv4, which is larded down
>> with Windows features.
>
> I am sympathetic.... Cutting those out may still leave you with
> something pretty complicated, though.

Far less complicated than NFSv4.1 though (which is easy :))

>> NFSv4.1 adds to the fun, by throwing interoperability completely out the
>> window.
>
> What parts are you worried about in particular?

I'm not worried; I'm stating facts as they exist today (draft 13):

NFS v4.1 does something completely without precedent in the history of
NFS: the specification is defined such that interoperability is
-impossible- to guarantee.

pNFS permits private and unspecified layout types. This means it is
impossible to guarantee that one NFSv4.1 implementation will be able to
talk another NFSv4.1 implementation.

Even if Linux supports the entire NFSv4.1 RFC (as it stands in draft 13
anyway), there is no guarantee at all that Linux will be able to store
and retrieve data, since it's entirely possible that a proprietary
protocol is required to access your data.

NFSv4.1 is no longer a completely open architecture.

Jeff

2007-09-14 22:42:32

[permalink] [raw]

Subject: Re: Distributed storage. Move away from char device ioctls.

On Fri, Sep 14, 2007 at 06:32:11PM -0400, Jeff Garzik wrote:
> J. Bruce Fields wrote:
>> On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote:
>>> J. Bruce Fields wrote:
>>>> On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
>>>>> I've been waiting for years for a smart person to come along and write
>>>>> a POSIX-only distributed filesystem.
>>>> What exactly do you mean by "POSIX-only"?
>>> Don't bother supporting attributes, file modes, and other details not
>>> supported by POSIX. The prime example being NFSv4, which is larded down
>>> with Windows features.
>> I am sympathetic.... Cutting those out may still leave you with
>> something pretty complicated, though.
>
> Far less complicated than NFSv4.1 though (which is easy :))

One would hope so.

>>> NFSv4.1 adds to the fun, by throwing interoperability completely out the
>>> window.
>> What parts are you worried about in particular?
>
> I'm not worried; I'm stating facts as they exist today (draft 13):
>
> NFS v4.1 does something completely without precedent in the history of NFS:
> the specification is defined such that interoperability is -impossible- to
> guarantee.
>
> pNFS permits private and unspecified layout types. This means it is
> impossible to guarantee that one NFSv4.1 implementation will be able to
> talk another NFSv4.1 implementation.

No, servers are required to support ordinary nfs operations to the
metadata server.

At least, that's the way it was last I heard, which was a while ago. I
agree that it'd stink (for any number of reasons) if you ever *had* to
get a layout to access some file.

Was that your main concern?

--b.

2007-09-15 02:55:14

by Mike Snitzer

[permalink] [raw]

Subject: Re: Distributed storage. Move away from char device ioctls.

On 9/14/07, Jeff Garzik <[email protected]> wrote:
> Evgeniy Polyakov wrote:
> > Hi.
> >
> > I'm pleased to announce fourth release of the distributed storage
> > subsystem, which allows to form a storage on top of remote and local
> > nodes, which in turn can be exported to another storage as a node to
> > form tree-like storages.
> >
> > This release includes new configuration interface (kernel connector over
> > netlink socket) and number of fixes of various bugs found during move
> > to it (in error path).
> >
> > Further TODO list includes:
> > * implement optional saving of mirroring/linear information on the remote
> > nodes (simple)
> > * new redundancy algorithm (complex)
> > * some thoughts about distributed filesystem tightly connected to DST
> > (far-far planes so far)
> >
> > Homepage:
> > http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst
> >
> > Signed-off-by: Evgeniy Polyakov <[email protected]>
>
> My thoughts. But first a disclaimer: Perhaps you will recall me as
> one of the people who really reads all your patches, and examines your
> code and proposals closely. So, with that in mind...
>
> I question the value of distributed block services (DBS), whether its
> your version or the others out there. DBS are not very useful, because
> it still relies on a useful filesystem sitting on top of the DBS. It
> devolves into one of two cases: (1) multi-path much like today's SCSI,
> with distributed filesystem arbitrarion to ensure coherency, or (2) the
> filesystem running on top of the DBS is on a single host, and thus, a
> single point of failure (SPOF).

This distributed storage is very much needed; even if it were to act
as a more capable/performant replacement for NBD (or MD+NBD) in the
near term. Many high availability applications don't _need_ all the
additional complexity of a full distributed filesystem. So given
that, its discouraging to see you trying to gently push Evgeniy away
from all the promising work he has published.

Evgeniy, please continue your current work.

Mike

2007-09-15 04:08:56

[permalink] [raw]

Subject: Re: Distributed storage. Move away from char device ioctls.

J. Bruce Fields wrote:
> On Fri, Sep 14, 2007 at 06:32:11PM -0400, Jeff Garzik wrote:
>> J. Bruce Fields wrote:
>>> On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote:
>>>> NFSv4.1 adds to the fun, by throwing interoperability completely out the
>>>> window.

>>> What parts are you worried about in particular?

>> I'm not worried; I'm stating facts as they exist today (draft 13):
>>
>> NFS v4.1 does something completely without precedent in the history of NFS:
>> the specification is defined such that interoperability is -impossible- to
>> guarantee.
>>
>> pNFS permits private and unspecified layout types. This means it is
>> impossible to guarantee that one NFSv4.1 implementation will be able to
>> talk another NFSv4.1 implementation.

> No, servers are required to support ordinary nfs operations to the
> metadata server.
>
> At least, that's the way it was last I heard, which was a while ago. I
> agree that it'd stink (for any number of reasons) if you ever *had* to
> get a layout to access some file.
>
> Was that your main concern?

I just sorta assumed you could fall back to the NFSv4.0 mode of
operation, going through the metadata server for all data accesses.

But look at that choice in practice: you can either ditch pNFS
completely, or use a proprietary solution. The market incentives are
CLEARLY tilted in favor of makers of proprietary solutions. But it's a
poor choice (really little choice at all).

Overall, my main concern is that NFSv4.1 is no longer an open
architecture solution. The "no-pNFS or proprietary platform" choice
merely illustrate one of many negative aspects of this architecture.

One of NFS's biggest value propositions is its interoperability. To
quote some Wall Street guys, "NFS is like crack. It Just Works. We
love it."

Now, for the first time in NFS's history (AFAIK), the protocol is no
longer completely specified, completely known. No longer a "closed
loop." Private layout types mean that it is _highly_ unlikely that any
OS or appliance or implementation will be able to claim "full NFS
compatibility."

And when the proprietary portion of the spec involves something as basic
as accessing one's own data, I consider that a fundamental flaw. NFS is
no longer completely open.

Jeff

2007-09-15 04:40:43

[permalink] [raw]

Subject: Re: Distributed storage. Move away from char device ioctls.

On Sat, Sep 15, 2007 at 12:08:42AM -0400, Jeff Garzik wrote:
> J. Bruce Fields wrote:
>> No, servers are required to support ordinary nfs operations to the
>> metadata server.
>> At least, that's the way it was last I heard, which was a while ago. I
>> agree that it'd stink (for any number of reasons) if you ever *had* to
>> get a layout to access some file.
>> Was that your main concern?
>
> I just sorta assumed you could fall back to the NFSv4.0 mode of operation,
> going through the metadata server for all data accesses.

Right. So any two pNFS implementations *will* be able to talk to each
other; they just may not be able to use the (possibly higher-bandwidth)
read/write path that pNFS gives them.

> But look at that choice in practice: you can either ditch pNFS completely,
> or use a proprietary solution. The market incentives are CLEARLY tilted in
> favor of makers of proprietary solutions.

I doubt somebody would go to all the trouble to implement pNFS and then
present their customers with that kind of choice. But maybe I'm missing
something. What market incentives do you see that would make that more
attractive than either 1) using a standard fully-specified layout type,
or 2) just implementing your own proprietary protocol instead of pNFS?

> Overall, my main concern is that NFSv4.1 is no longer an open architecture
> solution. The "no-pNFS or proprietary platform" choice merely illustrate
> one of many negative aspects of this architecture.

It's always been possible to extend NFS in various ways if you want.
You could use sideband protocols with v2 and v3, for example. People
have done that. Some of them have been standardized and widely
implemented, some haven't. You could probably add your own compound ops
to v4 if you wanted, I guess.

And there's advantages to experimenting with extensions first and then
standardizing when you figure out what works. I wish it happened that
way more often.

> Now, for the first time in NFS's history (AFAIK), the protocol is no longer
> completely specified, completely known. No longer a "closed loop."
> Private layout types mean that it is _highly_ unlikely that any OS or
> appliance or implementation will be able to claim "full NFS compatibility."

Do you know of any such "private layout types"?

This is kind of a boring argument, isn't it? I'd rather hear whatever
ideas you have for a new distributed filesystem protocol.

--b.

2007-09-15 12:30:30

[permalink] [raw]

Subject: Re: Distributed storage. Move away from char device ioctls.

Hi Jeff.

On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik ([email protected]) wrote:
> >Further TODO list includes:
> >* implement optional saving of mirroring/linear information on the remote
> > nodes (simple)
> >* new redundancy algorithm (complex)
> >* some thoughts about distributed filesystem tightly connected to DST
> > (far-far planes so far)
> >
> >Homepage:
> >http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst
> >
> >Signed-off-by: Evgeniy Polyakov <[email protected]>
>
> My thoughts. But first a disclaimer: Perhaps you will recall me as
> one of the people who really reads all your patches, and examines your
> code and proposals closely. So, with that in mind...

:)

> I question the value of distributed block services (DBS), whether its
> your version or the others out there. DBS are not very useful, because
> it still relies on a useful filesystem sitting on top of the DBS. It
> devolves into one of two cases: (1) multi-path much like today's SCSI,
> with distributed filesystem arbitrarion to ensure coherency, or (2) the
> filesystem running on top of the DBS is on a single host, and thus, a
> single point of failure (SPOF).
>
> It is quite logical to extend the concepts of RAID across the network,
> but ultimately you are still bound by the inflexibility and simplicity
> of the block device.

Yes, block device itself is not able to scale well, but it is the place
for redundancy, since filesystem will just fail if underlying device
does not work correctly and FS actually does not know about where it
should place redundancy bits - it might happen to be the same broken
disk, so I created a low-level device which distribute requests itself.
It is not allowed to mount it via multiple points, that is where
distributed filesystem must enter the show - multiple remote nodes
export its devices via network, each client gets address of the remote
node to work with, connect to it and process requests. All those bits
are already in the DST, next logical step is to connect it with
higher-layer filesystem.

> In contrast, a distributed filesystem offers far more scalability,
> eliminates single points of failure, and offers more room for
> optimization and redundancy across the cluster.
>
> A distributed filesystem is also much more complex, which is why
> distributed block devices are so appealing :)
>
> With a redundant, distributed filesystem, you simply do not need any
> complexity at all at the block device level. You don't even need RAID.
>
> It is my hope that you will put your skills towards a distributed
> filesystem :) Of the current solutions, GFS (currently in kernel)
> scales poorly, and NFS v4.1 is amazingly bloated and overly complex.
>
> I've been waiting for years for a smart person to come along and write a
> POSIX-only distributed filesystem.

Well, originally (about half a year ago) I started to draft a generic
filesystem which would be just superior to existing designs, not
overbloated like zfs, and just faster.
I do believe it can be implemented.
Further I added network capabilities (since what I saw that time
(AFS was proposed) I did not like - I'm not saying it is bad or
something like that at all, but I would implement things differently)
into design drafts.

When Chris Mason announced btrfs, I found that quite a few new ideas
are already implemented there, so I postponed project (although
direction of the developement of the btrfs seems to move to the zfs side
with some questionable imho points, so I think I can jump to the wagon
of new filesystems right now).

DST is low level for my (theoretical so far) filesystem (actually its
network part) like kevent was a low level system for network AIO (originally).

No matter what filesystem works with network it implements some kind
of logic completed in DST.
Sometimes it is very simple, sometimes a bit more complex, but
eventually it is a network entity with parts of stuff I put into DST.
Since I postponed the project (looking at btrfs and its results), I
completed DST as a standalone block device.

So, essentially, a filesystem with simple distributed facilities is on
(my) radar, but so far you are first who requested it :)

--
Evgeniy Polyakov

2007-09-15 12:34:52

[permalink] [raw]

Subject: Re: Distributed storage. Move away from char device ioctls.

Hi Mike.

On Fri, Sep 14, 2007 at 10:54:56PM -0400, Mike Snitzer ([email protected]) wrote:
> This distributed storage is very much needed; even if it were to act
> as a more capable/performant replacement for NBD (or MD+NBD) in the
> near term. Many high availability applications don't _need_ all the
> additional complexity of a full distributed filesystem. So given
> that, its discouraging to see you trying to gently push Evgeniy away
> from all the promising work he has published.
>
> Evgeniy, please continue your current work.

Thanks Mike, I work on this and will until feel it is completed.

Distributed filesystem is a logical continuation of the whoe idea of
storing data on the several remote nodes - DST and FS must exist
together for the maximum performance, but that does not mean that block
layer itself should be abandoned. As you probably noticed from my mail
to Jeff, distributed storage was originally part of the overall
filesystem design.

--
Evgeniy Polyakov

2007-09-15 14:20:25

by Robin Humble

[permalink] [raw]

Subject: Re: Distributed storage. Move away from char device ioctls.

On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
>It is my hope that you will put your skills towards a distributed
>filesystem :) Of the current solutions, GFS (currently in kernel)
>scales poorly, and NFS v4.1 is amazingly bloated and overly complex.
>
>I've been waiting for years for a smart person to come along and write a
>POSIX-only distributed filesystem.

it's called Lustre.
works well, scales well, is widely used, is GPL.
sadly it's not in mainline.

cheers,
robin

2007-09-15 14:35:30

[permalink] [raw]

Subject: Re: Distributed storage. Move away from char device ioctls.

Robin Humble wrote:
> On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
>> It is my hope that you will put your skills towards a distributed
>> filesystem :) Of the current solutions, GFS (currently in kernel)
>> scales poorly, and NFS v4.1 is amazingly bloated and overly complex.
>>
>> I've been waiting for years for a smart person to come along and write a
>> POSIX-only distributed filesystem.
>
> it's called Lustre.
> works well, scales well, is widely used, is GPL.
> sadly it's not in mainline.

Lustre is tilted far too much towards high-priced storage, and needs
improvement before it could be considered for mainline.

Jeff

2007-09-15 16:21:33

by Robin Humble

[permalink] [raw]

Subject: Re: Distributed storage. Move away from char device ioctls.

On Sat, Sep 15, 2007 at 10:35:16AM -0400, Jeff Garzik wrote:
>Robin Humble wrote:
>>On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
>>>I've been waiting for years for a smart person to come along and write a
>>>POSIX-only distributed filesystem.
>>it's called Lustre.
>>works well, scales well, is widely used, is GPL.
>>sadly it's not in mainline.
>Lustre is tilted far too much towards high-priced storage,

many (most?) Lustre deployments are with SATA and md raid5 and GigE -
can't get much cheaper than that.

if you want storage node failover capabilities (which larger sites often
do) or want to saturate an IB link then the price of the storage goes
up but this is a consequence of wanting more reliability or performance,
not anything to do with lustre.

interestingly, one of the ways to provide dual-attached storage behind
a failover pair of lustre servers (apart from buying SAS) would be via
a networked-raid-1 device like Evgeniy's, so I don't see distributed
block devices and distributed filesystems as being mutually exclusive.
iSER (almost in http://stgt.berlios.de/) is also intriguing.

>and needs
>improvement before it could be considered for mainline.

quite likely.
from what I understand (hopefully I am mistaken) they consider a merge
task to be too daunting as the number of kernel subsystems that any
scalable distributed filesystem touches is necessarily large.

roadmaps indicate that parts of lustre are likely to move to userspace
(partly to ease solaris and ZFS ports) so perhaps those performance
critical parts that remain kernel space will be easier to merge.

cheers,
robin

2007-09-15 17:24:34

by Andreas Dilger

[permalink] [raw]

Subject: Re: Distributed storage. Move away from char device ioctls.

On Sep 15, 2007 16:29 +0400, Evgeniy Polyakov wrote:
> Yes, block device itself is not able to scale well, but it is the place
> for redundancy, since filesystem will just fail if underlying device
> does not work correctly and FS actually does not know about where it
> should place redundancy bits - it might happen to be the same broken
> disk, so I created a low-level device which distribute requests itself.

I actually think there is a place for this - and improvements are
definitely welcome. Even Lustre needs block-device level redundancy
currently, though we will be working to make Lustre-level redundancy
available in the future (the problem is WAY harder than it seems at
first glance, if you allow writeback caches at the clients and servers).

> When Chris Mason announced btrfs, I found that quite a few new ideas
> are already implemented there, so I postponed project (although
> direction of the developement of the btrfs seems to move to the zfs side
> with some questionable imho points, so I think I can jump to the wagon
> of new filesystems right now).

This is an area I'm always a bit sad about in OSS development - the need
everyone has to make a new {fs, editor, gui, etc} themselves instead of
spending more time improving the work we already have. Imagine where the
internet would be (or not) if there were 50 different network protocols
instead of TCP/IP? If you don't like some things about btrfs, maybe you
can fix them?

To be honest, developing a new filesystem that is actually widely useful
and used is a very time consuming task (see Reiserfs and Reiser4). It
takes many years before the code is reliable enough for people to trust it,
so most likely any effort you put into this would be wasted unless you can
come up with something that is dramatically better than something existing.

The part that bothers me is that this same effort could have been used to
improve something that more people would use (btrfs in this case). Of
course, sometimes the new code is substantially better than what currently
exists, and I think btrfs may have laid claim to the current generation of
filesystems.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-09-15 17:50:57

by Andreas Dilger

[permalink] [raw]

Subject: Re: Distributed storage. Move away from char device ioctls.

On Sep 15, 2007 12:20 -0400, Robin Humble wrote:
> On Sat, Sep 15, 2007 at 10:35:16AM -0400, Jeff Garzik wrote:
> >Lustre is tilted far too much towards high-priced storage,
>
> many (most?) Lustre deployments are with SATA and md raid5 and GigE -
> can't get much cheaper than that.

I have to agree - while Lustre CAN scale up to huge servers and fat pipes,
it can definitely also scale down (which is a LOT easier to do :-). I can
run a client + MDS + 5 OSTs in a single UML instance using loop devices
for testing w/o problems.

> interestingly, one of the ways to provide dual-attached storage behind
> a failover pair of lustre servers (apart from buying SAS) would be via
> a networked-raid-1 device like Evgeniy's, so I don't see distributed
> block devices and distributed filesystems as being mutually exclusive.

That is definitely true, and there are a number of users who run in
this mode. We're also working to make Lustre handle the replication
internally (RAID5/6+ at the OST level) so you wouldn't need any kind of
block-level redundancy at all. I suspect some sites may still use RAID5/6
back-ends anyways to avoid performance loss from taking out a whole OST
due to a single disk failure, but that would definitely not be required.

> >and needs improvement before it could be considered for mainline.

It's definitely true, and we are always working at improving it. It
used to be in the past that one of the reasons we DIDN'T want to go
into mainline was because this would restrict our ability to make
network protocol changes. Because our install base is large enough
and many of the large sites with mutliple supercomputers mounting
multiple global filesystems we aren't at liberty to change the network
protocol at will anymore. That said, we also have network protocol
versioning that is akin to the ext3 COMPAT/INCOMPAT feature flags, so
we are able to add/change features without breaking old clients

> from what I understand (hopefully I am mistaken) they consider a merge
> task to be too daunting as the number of kernel subsystems that any
> scalable distributed filesystem touches is necessarily large.

That's partly true - Lustre has its own RDMA RPC mechanism, but it does
not need kernel patches anymore (we removed the zero-copy callback and
do this at the protocol level because there was too much resistance to it).
We are now also able to run a client filesystem that doesn't require any
kernel patches, since we've given up on trying to get the intents and
raw operations into the VFS, and have worked out other ways to improve
the performance to compensate. Likewise with parallel directory operations.

It's a bit sad, in a way, because these are features that other filesystems
(especially network fs) could have benefitted from also.

> roadmaps indicate that parts of lustre are likely to move to userspace
> (partly to ease solaris and ZFS ports) so perhaps those performance
> critical parts that remain kernel space will be easier to merge.

This is also true - when that is done the only parts that will remain
in the kernel are the network drivers. With some network stacks there
is even direct userspace acceleration. We'll use RDMA and direct IO to
avoid doing any user<->kernel data copies.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-09-16 07:07:38

by Kyle Moffett

[permalink] [raw]

Subject: Re: Distributed storage. Move away from char device ioctls.

On Sep 15, 2007, at 13:24:46, Andreas Dilger wrote:
> On Sep 15, 2007 16:29 +0400, Evgeniy Polyakov wrote:
>> Yes, block device itself is not able to scale well, but it is the
>> place for redundancy, since filesystem will just fail if
>> underlying device does not work correctly and FS actually does not
>> know about where it should place redundancy bits - it might happen
>> to be the same broken disk, so I created a low-level device which
>> distribute requests itself.
>
> I actually think there is a place for this - and improvements are
> definitely welcome. Even Lustre needs block-device level
> redundancy currently, though we will be working to make Lustre-
> level redundancy available in the future (the problem is WAY harder
> than it seems at first glance, if you allow writeback caches at the
> clients and servers).

I really think that to get proper non-block-device-level filesystem
redundancy you need to base it on something similar to the GIT
model. Data replication is done in specific-sized chunks indexed by
SHA-1 sum and you actually have a sort of "merge algorithm" for when
local and remote changes differ. The OS would only implement a very
limited list of merge algorithms, IE one of:

(A) Don't merge, each client gets its own branch and merges are manual
(B) Most recent changed version is made the master every X-seconds/
open/close/write/other-event.
(C) The tree at X (usually a particular client/server) is always
used as the master when there are conflicts.

This lets you implement whatever replication policy you want: You
can require that some files are replicated (cached) on *EVERY*
system, you can require that other files are cached on at least X
systems. You can say "this needs to be replicated on at least X% of
the online systems, or at most Y". Moreover, the replication could
be done pretty easily from userspace via a couple syscalls. You also
automatically keep track of history with some default purge policy.

The main point is that for efficiency and speed things are *not*
always replicated; this also allows for offline operation. You would
of course have "userspace" merge drivers which notice that the tree
on your laptop is not a subset/superset of the tree on your desktop
and do various merges based on per-file metadata. My address-book,
for example, would have a custom little merge program which knows
about how to merge changes between two address book files, asking me
useful questions along the way. Since a lot of this merging is
mechanical, some of the code from GIT could easily be made into a
"merge library" which knows how to do such things.

Moreover, this would allow me to have a "shared" root filesystem on
my laptop and desktop. It would have 'sub-project'-type trees, so
that "/" would be an independent branch on each system. "/etc" would
be separate branches but manually merged git-style as I make
changes. "/home/*" folders would be auto-created as separate
subtrees so each user can version their own individually. Specific
subfolders (like address-book, email, etc) would be adjusted by the
GUI programs that manage them to be separate subtrees with manual-
merging controlled by that GUI program.

Backups/dumps/archival of such a system would be easy. You would
just need to clone the significant commits/trees/etc to a DVD and
replace the old SHA-1-indexed objects to tiny "object-deleted" stubs;
to rollback to an archived version you insert the DVD, "mount" it
into the existing kernel SHA-1 index, and then mount the appropriate
commit as a read-only volume somewhere to access. The same procedure
would also work for wide-area-network backups and such.

The effective result would be the ability to do things like the
following:
(A) Have my homedir synced between both systems mostly-
automatically as I make changes to different files on both systems
(B) Easily have 2 copies of all my files, so if one system's disk
goes kaput I can just re-clone from the other.
(C) Keep archived copies of the last 5 years worth of work,
including change history, on a stack of DVDs.
(D) Synchronize work between locations over a relatively slow
link without much work.

As long as files were indirectly indexed by sub-block SHA1 (with the
index depth based on the size of the file), and each individually-
SHA1-ed object could have references, you could trivially have a 4TB-
sized file where you modify 4 bytes at a thousand random locations
throughout the file and only have to update about 5MB worth of on-
disk data. The actual overhead for that kind of operation under any
existing filesystem would be 100% seek-dominated regardless whereas
with this mechanism you would not directly be overwriting data and so
you could append all the updates as a single 5MB chunk. Data reads
would be much more seek-y, but you could trivially have an on-line
defragmenter tool which notices fragmented commonly-accessed inode
objects and creates non-fragmented copies before deleting the old ones.

There's a lot of other technical details which would need resolution
in an actual implementation, but this is enough of a summary to give
you the gist of the concept. Most likely there will be some major
flaw which makes it impossible to produce reliably, but the concept
contains the things I would be interested in for a real "networked
filesystem".

Cheers,
Kyle Moffett

2007-09-16 13:44:37

[permalink] [raw]

Subject: Re: Distributed storage. Move away from char device ioctls.

On Sat, Sep 15, 2007 at 11:24:46AM -0600, Andreas Dilger ([email protected]) wrote:
> > When Chris Mason announced btrfs, I found that quite a few new ideas
> > are already implemented there, so I postponed project (although
> > direction of the developement of the btrfs seems to move to the zfs side
> > with some questionable imho points, so I think I can jump to the wagon
> > of new filesystems right now).
>
> This is an area I'm always a bit sad about in OSS development - the need
> everyone has to make a new {fs, editor, gui, etc} themselves instead of
> spending more time improving the work we already have. Imagine where the

If that would be true, we would be still in the stone age.
Or not, actually I think the first cell in the universe would not bother
itself dividing into the two just because it could spent infinite time
trying to make itself better.

> internet would be (or not) if there were 50 different network protocols
> instead of TCP/IP? If you don't like some things about btrfs, maybe you
> can fix them?

When some idea is implemented it is virtually impossible to change it,
only recreate new one with fixed issues. So, we have multiple ext,
reiser and many others. I do not say btrfs is broken or has design
problems, it is really interesting filesystem, but all we have our own
opinions about how things should be done, that's it.

Btw, we do have so many network protocols for different purposes, that
number of (storage) filesystems is negligebly small compared to it.
Internet as is popular today is just a subset of where network is used.

And we do invent new protocols each time we need something new, which
does not fit into existing models (for example TCP by design can not
work with very long-distance links with tooo long RTT). We have sctp to
fix some tcp issues. Number of IP layer 'neighbours' is even more.
Physical media layer has many different protocols too.
And that is just what exists in the linux tree...

> To be honest, developing a new filesystem that is actually widely useful
> and used is a very time consuming task (see Reiserfs and Reiser4). It
> takes many years before the code is reliable enough for people to trust it,
> so most likely any effort you put into this would be wasted unless you can
> come up with something that is dramatically better than something existing.

Yep, I know.
Wasting my time is one of the most pleasant things I ever tried in my life.

> The part that bothers me is that this same effort could have been used to
> improve something that more people would use (btrfs in this case). Of
> course, sometimes the new code is substantially better than what currently
> exists, and I think btrfs may have laid claim to the current generation of
> filesystems.

Call me greedy bastard, but I do not care about world happiness, it is
just impossible to achieve. So I like what I do right now.
If it will be rest under the layer of dust I do not care, I like the
process of creating, so if it will fail, I just will get new knowledge.

:)

> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.

--
Evgeniy Polyakov

2007-10-26 10:45:58