2002-09-04 07:11:33

by Helge Hafting

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct" (fwd)

"Peter T. Breuer" wrote:
>
> "A month of sundays ago Lars Marowsky-Bree wrote:"
> > On 2002-09-03T18:29:02,
> > "Peter T. Breuer" <[email protected]> said:
> >
> > > > Lets say you have a perfect locking mechanism, a fake SCSI layer
> > > OK.
> >
> > BTW, I would like to see your perfect distributed locking mechanism.
>
> That bit's easy and is done. The "trick" is NOT to distribute the lock,
> but to have it in one place - on the driver that guards the remote
> disk resource.
>
> > > The directory entry would certainly have to be reread after a write
> > > operation on disk that touched it - or more simply, the directory entry
> > > would have to be reread every time it were needed, i.e. be uncached.
> >
> > *ouch* Sure. Right. You just have to read it from scratch every time. How
> > would you make readdir work?
>
> Well, one has to read it from scratch. I'll set about seeing how to do.
> CLues welcome.
>
> > > If that presently is not possible, then I would like to think about
> > > making it possible.
> >
> > Just please, tell us why.
>
> You don't really want the whole rationale. It concerns certain
> european (nay, world ..) scientific projects and the calculations of the
> technologists about the progress in hardware over the next few years.
> We/they foresee that we will have to move to multiple relatively small
> distributed disks per node in order to keep the bandwidth per unit of
> storage at the levels that they will have to be at to keep the farms
> fed. We are talking petabytes of data storage in thousands of nodes
> moving over gigabit networks.
>
> The "big view" calculations indicate that we must have distributed
> shared writable data.
>
Increasing demands for performance may indeed force a need
for shared writeable data someday. Several solutions for that is
being developed.
Your idea about re-reading stuff over and over isn't going to help
because that sort of thing consumes much more bandwith. Caches help
because they _avoid_ data transfers. So shared writeable data
will happen, and it will use some sort of cache coherency,
for performance reasons.

> These calculations affect us all. They show us what way computing
> will evolve under the price and technology pressures. The calculations
> are only looking to 2006, but that's what they show. For example
> if we think about a 5PB system made of 5000 disks of 1TB each in a GE
> net, we calculate the aggregate bandwidth available in the topology as
> 50GB/s, which is less than we need in order to keep the nodes fed
> at the rates they could be fed at (yes, a few % loss translates into
> time and money). To increase available bandwidth we must have more
> channels to the disks, and more disks, ... well, you catch my drift.
>
> So, start thinking about general mechanisms to do distributed storage.
> Not particular FS solutions.
Distributed systems will need somewhat different solutions, because
they are fundamentally different. Existing fs'es like ext2 is built
around a single-node assumption. I claim that making a new fs from
scratch for the distributed case is easier than tweaking ext2
and 10-20 other existing fs'es to work in such an environment.
Making a new fs from scratch isn't such a big deal after all.

To make a historical parallel:
Data used to be stored on sequential media like tapes (or
even stacks of punched cards) filesystems were developed
for tapes. Then they made disks.
Using a disk as a tape with the existing tape-fs'es
worked, but didn't give much benefit. So we got something
new - block-based filesystems designed to take advantage
of the new random-access media.

The case of distributed storage is similiar, it is fundamentally
different from the one-node case just as random-access media
were different from sequential.

I think a new design that considers both the benefits and
problems of many nodes will be much better than trying to
patch the existing fs'es. An approach that starts with
throwing away the thousand-fold speedup provided by caching
isn't very convincing.

If you merely proposed making the VFS and existing fs'es
cache-coherent,then I'd agree it might work well, but
it'd be a _lot_ of work. Which is no problem
if you volunteer to do the work. But simplification
by throwing away caching _will_ be too slow, it certainly
don't fit the idea of getting more bandwith.

More bandwith won't help if you throw all of it and then some
away on massive re-reading of data.

Wanting a generic mechanism instead of a special fs
might be the way to go, but it'd be a generic mechanism
used by a bunch of new fs'es designed to work distributed.

There will probably be different needs for which people
will build different distributed fs'es. So a
"VDFS" makes sense for those fs'es, putting the common stuff
in one place. But I am sure the VDFS will contain cache
coherency calls for dropping pages from cache when
necessary, instead of dropping the cache unconditionally
in every case.

Helge Hafting


2002-09-04 08:37:27

by Peter T. Breuer

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct" (fwd)

"A month of sundays ago Helge Hafting wrote:"
> > The "big view" calculations indicate that we must have distributed
> > shared writable data.
> >
> Increasing demands for performance may indeed force a need
> for shared writeable data someday. Several solutions for that is
> being developed.
> Your idea about re-reading stuff over and over isn't going to help

I really don't see why you people don't get it. Rereading is a RARE
operation. Normally we write once and read once. That's all. Once
the data's in memory we use it.

And if we ever have to reread something, it will very very rarely be
metadata.

> because that sort of thing consumes much more bandwith. Caches help
> because they _avoid_ data transfers. So shared writeable data

Tough. Data transfers are inevitable in this scenario. There's no
sense in trying to avoid them. Data comes in at A and goes out at B.
Ergo it's transfered.

> > So, start thinking about general mechanisms to do distributed storage.
> > Not particular FS solutions.
> Distributed systems will need somewhat different solutions, because
> they are fundamentally different. Existing fs'es like ext2 is built
> around a single-node assumption. I claim that making a new fs from

I am still getting afeel for the problem. Only avoiding directory
caching (and inode caching) has worried me. I looked at the name
lookup routines on the train and I don't see we onne can't force a
reread from root every time, or a reread every time there is a
"changed" bit set in the sb.

> scratch for the distributed case is easier than tweaking ext2

No tweak. But I'm looking.

> The case of distributed storage is similiar, it is fundamentally
> different from the one-node case just as random-access media

I agree. But the case of one FS accessed from different nodes is not
fundamentally different from the situation we have now. It requires
locking. It also requires either explicit sharing of cached
information, or no caching (which is the same thing :-). I merely
opine that the latter is easier to try first and may not be so bad.

> If you merely proposed making the VFS and existing fs'es
> cache-coherent,then I'd agree it might work well, but

I'm proposing making no caching _possible_. Not mandatory, but
_possible_. If you like, you can see it as a trivial case of cache
sharing.

> by throwing away caching _will_ be too slow, it certainly

Why? The only thing I've seen mentioned that might slow things down
is that at every open we have to trace the full path anew. So what?
OK, so there's also objections about what happens if one kernel frees
the data and anotehr adds to it. I'm thinking about what that implies.

> There will probably be different needs for which people
> will build different distributed fs'es. So a
> "VDFS" makes sense for those fs'es, putting the common stuff
> in one place. But I am sure the VDFS will contain cache
> coherency calls for dropping pages from cache when
> necessary, instead of dropping the cache unconditionally
> in every case.

That's possible, but right now I don't know any way of saying to the
kernel "I just stepped all over X on disk, please invalidate anything
you have cached that points to X". I'd like it very much in the
buffering layer too (i.e. vMs).

Peter

2002-09-04 08:36:41

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct" (fwd)

On Sep 04, 2002 09:16 +0200, Helge Hafting wrote:
> Your idea about re-reading stuff over and over isn't going to help
> because that sort of thing consumes much more bandwith. Caches help
> because they _avoid_ data transfers. So shared writeable data
> will happen, and it will use some sort of cache coherency,
> for performance reasons.

You assume too much about the applications. For example, Oracle
does not want _any_ cacheing to be done by the OS, because it
manages the cache itself, and would rather allocate the full amount
of RAM itself instead of the OS duplicating data it is cacheing
internally.

Similarly, there are many "write only" applications that are only
hindered by OS cache, such as any kind of high-speed data recording
(video, particle accelerators, scientific computing, etc) which is
using most of the RAM for internal structures and wants the data it
writes to go directly to disk at the highest possible speed.

> I claim that making a new fs from scratch for the distributed
> case is easier than tweaking ext2 and 10-20 other existing fs'es
> to work in such an environment. Making a new fs from scratch
> isn't such a big deal after all.

The problem isn't making a new fs, the problem is making a _good_
new fs. It takes at least several years of development, testing,
tuning, etc to get just a local fs right, if not longer (i.e.
reiserfs, JFS, XFS, ext3, etc). Add in the complexity of the
network side of things and it just gets that much harder to do
it all well.

We have taken the approach that local filesystems do a good job
with the "one node" assumption, so just use them as-is to
do a job they are good at. All of the network and locking code
for Lustre is outside of the filesystem, and the "local" filesystems
are used for storing either the directory structure + attributes
(for the metadata server), or file data (for the storage targets).

Local filesystems can do both of those jobs very well already, so
no need to re-invent the wheel.

See http://www.lustre.org/docs.html for lots of papers and
documentation on the design of Lustre.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/