2002-09-07 19:54:43

by Peter T. Breuer

[permalink] [raw]
Subject: Re: [[email protected]: Re: [RFC] mount flag "direct" (fwd)]

"A month of sundays ago Lars Marowsky-Bree wrote:"
> as per your request, I am forwarding this mail to you again. The main point

Thanks.

> you'll find is that yes, I believe that your idea _can_ be made to work. Quite
> frankly, there are very few ideas which _cannot_ be made to work. The
> interesting question is whether it is worth it to a particular route or not.
>
> And let me say that I find it at least slightly rude to "miss" mail in a
> discussion; if you are so important that you get so much mail every day, maybe
> a public discussion on a mailing list isn't the proper way how to go about
> something...

Well, I'm sorry. I did explain that I am travelling, I think! And
it is even very hard to connevct occasionally (it requires me finding
an airport kiosk or an internet cafe), and then I have to _pay_ for the
time to compose a reply, and so on! Well, if I don't read your mail for
one day, then it will be filed somewhere for me by procmail, and I
haven't been able to check any filings ..

> > > *ouch* Sure. Right. You just have to read it from scratch every time. How
> > > would you make readdir work?
> > Well, one has to read it from scratch. I'll set about seeing how to do.
> > CLues welcome.
>
> Yes, use a distributed filesystem. There are _many_ out there; GFS, OCFS,
> OpenGFS, Compaq has one as part of their SSI, Inter-Mezzo (sort of), Lustre,
> PvFS etc.

Eh, I thought I saw this - didn't I reply?

> Any of them will appreciate the good work of a bright fellow.

Well, I know of some of these. Intermezzo I've tried lately and found
near impossible to set up and work with (still, a great improvement
over coda, which was absolutely impossible, to within an atom's
breadth). And it's nowhere near got the right orientation. Lustre
people have been pointing me at. What happened to petal?

> Noone appreciates reinventing the wheel another time, especially if - for
> simplification - it starts out as a square.

But what I suggest is finding a simple way to turn an existing FS into a
distributed one. I.e. NOT reinventing the wheel. All those other people
are reinventing a wheel, for some reason :-).

> You tell me why Distributed Filesystems are important. I fully agree.
>
> You fail to give a convincing reason why that must be made to work with
> "all" conventional filesystems, especially given the constraints this implies.

Because that's the simplest thing to do.

> Conventional wisdom seems to be that this can much better be handled specially
> by special filesystems, who can do finer grained locking etc because they
> understand the on disk structures, can do distributed journal recovery etc.

Well, how about allowing get_block to return an extra argument, which
is the ondisk placement of the inode(s) concerned, so that the vfs can
issue a lock request for them before the i/o starts. Let the FS return
the list of metadata things to lock, and maybe a semaphore to start the
i/o with.

There you are: instant distribution. It works for thsoe fs's which
cooperate. Make sure the FS can indicate whether it replied or not.

> What you are starting would need at least 3-5 years to catch up with what
> people currently already can do, and they'll improve in this time too.

Maybe 3-4 weeks more like. The discussion is helping me get a picture,
and when I'm back next week I'll try something. Then, unfortunately I
am away again from the 18th ...

> I've seen your academic track record and it is surely impressive. I am not

I didn't even know it was available anywhere! (or that it was
impressive - thank you).

> saying that your approach won't work within the constraints. Given enough
> thrust, pigs fly. I'm just saying that it would be nice to learn what reasons
> you have for this, because I believe that "within the constraints" makes your
> proposal essentially useless (see the other mails).
>
> In particular, they make them useless for the requirements you seem to have. A
> petabyte filesystem without journaling? A petabyte filesystem with a single
> write lock? Gimme a break.

Journalling? Well, now you mention it, that would seem to be nice. But
my experience with journalling FS's so far tells me that they break
more horribly than normal. Also, 1PB or so is the aggregate, not the
size of each FS on the local nodes. I don't think you can diagnose
"journalling" from the numbers. I am even rather loath to journal,
given what I have seen.


> Please, do the research and tell us what features you desire to have which are
> currently missing, and why implementing them essentially from scratch is

No features. Just take any FS that corrently works, and see if you can
distribute it. Get rid of all fancy features along the way. You mean
"what's wrong with X"? Well, it won't be mainstream, for a start, and
that's surely enough. The projects involved are huge, and they need to
minimize risk, and maximize flexibility. This is CERN, by the way.


> preferrable to extending existing solutions.
>
> You are dancing around all the hard parts. "Don't have a distributed lock
> manager, have one central lock." Yeah, right, has scaled _really_ well in the
> past. Then you figure this one out, and come up with a lock-bitmap on the
> device itself for locking subtrees of the fs. Next you are going to realize
> that a single block is not scalable either because one needs exclusive write

I made suggestions, hoping that the suggestions would elicit a response
of some kind. I need to explore as much as I can and get as much as I
can back without "doing it first", because I need the insight you can
offer. I don't have the experience in this area, and I have the
experience to know that I need years of experience with that code to be
able to generate the semantics from scracth. I'm happy with what I'm
getting. I hope I'll be able to return soon with a trial patch.


> lock to it, 'cause you can't just rewrite a single bit. You might then begin
> to explore that a single bit won't cut it, because for recovery you'll need to
> be able to pinpoint all locks a node had and recover them. Then you might
> begin to think about the difficulties in distributed lock management and

There is no difficulty with that - there are no distributed locks. All
locks are held on the server of the disk (I decided not to be
complicated to begine with as a matter of principle early in life ;-).

> recovery. ("Transaction processing" is an exceptionally good book on that I
> believe)

Thanks but I don't feel like rolling it out and rolling it back!

> I bet you a dinner that what you are going to come up with will look
> frighteningly like one of the solutions which already exist; so why not

Maybe.

> research them first in depth and start working on the one you like most,
> instead of wasting time on an academic exercise?

Because I don't agree with your assessment of what I should waste my
time on. Though I'm happy to take it into account!

Maybe twenty years ago now I wrote my first disk based file system (for
functional databases) and I didn't like debugging it then! I positively
hate the thought of flattening trees and relating indices and pointers
now :-).

> > So, start thinking about general mechanisms to do distributed storage.
> > Not particular FS solutions.
>
> Distributed storage needs a way to access it; in the Unix paradigm,
> "everything is a file", that implies a distributed filesystem. Other
> approaches would include accessing raw blocks and doing the locking in the
> application / via a DLM (ie, what Oracle RAC does).

Yep, but we want "normality", just normality writ a bit larger than
normal.

> Lars Marowsky-Br_e <[email protected]>

Thanks for the input. I don't know what I was supposed to take away
from it though!

Peter


2002-09-07 20:22:50

by Rik van Riel

[permalink] [raw]
Subject: Re: [[email protected]: Re: [RFC] mount flag "direct" (fwd)]

On Sat, 7 Sep 2002, Peter T. Breuer wrote:

> > Noone appreciates reinventing the wheel another time, especially if - for
> > simplification - it starts out as a square.
>
> But what I suggest is finding a simple way to turn an existing FS into a
> distributed one. I.e. NOT reinventing the wheel. All those other people
> are reinventing a wheel, for some reason :-).

To stick with the wheel analogy, while bicycle wheels will
fit on a 40-tonne truck, they won't even get you out of the
parking lot.

> > You tell me why Distributed Filesystems are important. I fully agree.
> >
> > You fail to give a convincing reason why that must be made to work with
> > "all" conventional filesystems, especially given the constraints this implies.
>
> Because that's the simplest thing to do.

You've already admitted that you would need to modify the
existing filesystems in order to create "filesystem independant"
clustered filesystem functionality.

If you're modifying filesystems, surely it no longer is filesystem
independant and you might as well design your filesystem so it can
do clustering in an _efficient_ way.

> > What you are starting would need at least 3-5 years to catch up with what
> > people currently already can do, and they'll improve in this time too.
>
> Maybe 3-4 weeks more like. The discussion is helping me get a picture,
> and when I'm back next week I'll try something. Then, unfortunately I
> am away again from the 18th ...

If you'd only spent 3-4 _days_ looking at clustered filesystems
you would see that it'd take months or years to get something
work decently.


> No features. Just take any FS that corrently works, and see if you can
> distribute it. Get rid of all fancy features along the way. You mean
> "what's wrong with X"? Well, it won't be mainstream, for a start, and
> that's surely enough. The projects involved are huge, and they need to
> minimize risk, and maximize flexibility. This is CERN, by the way.

All you can hope for now is that CERN doesn't care about data
integrity or performance ;)

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

Spamtraps of the month: [email protected] [email protected]

2002-09-07 21:09:23

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

On 2002-09-07T21:59:20,
"Peter T. Breuer" <[email protected]> said:

> > Yes, use a distributed filesystem. There are _many_ out there; GFS, OCFS,
> > OpenGFS, Compaq has one as part of their SSI, Inter-Mezzo (sort of), Lustre,
> > PvFS etc.
> Eh, I thought I saw this - didn't I reply?

No, you didn't.

> > Noone appreciates reinventing the wheel another time, especially if - for
> > simplification - it starts out as a square.
> But what I suggest is finding a simple way to turn an existing FS into a
> distributed one. I.e. NOT reinventing the wheel. All those other people
> are reinventing a wheel, for some reason :-).

Well, actually they aren't exactly. The hard part in a "distributed
filesystem" isn't the filesystem itself; while it is very necessary of course.
The locking, synchronization and cluster infrastructure is where the real
difficulty tends to arise.

Yes, it can be argued whether it is in fact easier to create a filesystem from
scratch with clustering in mind (so it is "optimised" for being able to do
fine-grained locking etc), or whether proping a generic clustering layer on
top of existing ones.

The guesstimate of those involved in the past have seemed to suggest that the
first is the case. And I also tend to think this to be the case, but I've been
wrong.

That would - indeed - be very helpful research to do. I would start by
comparing the places where those specialized fs's actually are doing cluster
related stuff and checking whether it can be abstracted, generalized and
improved. In any case, trying to pick apart OpenGFS for example will provide
you more insight into the problem area that a discussion on l-k.

If you want to look into "turn a local fs into a cluster fs", SGI has a
"clustered XFS"; however I'm not too sure how public that extension is. The
hooks might however be in the common XFS core though.

Now, going on with the gedankenexperiment, given a distributed lock manager
(IBM open-sourced one of theirs, though it is not currently perfectly working
;), the locking primitives in the filesystems could "simply" be changed from
local-node SMP spinlocks to cluster-wide locks.

That _should_ to a large degree take care of the locking.

What remains is the invalidation of cache pages; I would expect similar
problems must have arised in NC-NUMA style systems, so looking there should
provide hints.

> > You fail to give a convincing reason why that must be made to work with
> > "all" conventional filesystems, especially given the constraints this
> > implies.
> Because that's the simplest thing to do.

Why? I disagree.

You will have to modify existing file systems quite a bit to work
_efficiently_ in a cluster environment; not even the on-disk layout is
guaranteed to stay consistent as soon as you add per-node journals etc. The
real complexity is in the distributed nature, in particular the recovery (see
below).

"Simplest thing to do" might be to take your funding and give it to the
OpenGFS group or have someone fix the Oracle Cluster FS.

> > In particular, they make them useless for the requirements you seem to
> > have. A petabyte filesystem without journaling? A petabyte filesystem with
> > a single write lock? Gimme a break.
> Journalling? Well, now you mention it, that would seem to be nice.

"Nice" ? ;-) You gotta be kidding. If you don't have journaling, distributed
recovery becomes near impossible - at least I don't have a good idea on how to
do it if you don't know what the node had been working on prior to its
failure.

If "take down the entire filesystem on all nodes, run fsck" is your answer to
that, I will start laughing in your face. Because then your requirements are
kind of from outer space and will certainly not reflect a large part of the
user base.

> > Please, do the research and tell us what features you desire to have which
> > are currently missing, and why implementing them essentially from scratch
> > is
> No features.

So they implement what you need, but you don't like them because theres just
so few of them to chose from? Interesting.

> Just take any FS that corrently works, and see if you can distribute it.
> Get rid of all fancy features along the way. The projects involved are
> huge, and they need to minimize risk, and maximize flexibility. This is
> CERN, by the way.

Well, you are taking quite a risk trying to run a
not-aimed-at-distributed-environments fs and trying to make it distributed by
force. I _believe_ that you are missing where the real trouble lurks.

You maximize flexibility for mediocre solutions; little caching, no journaling
etc.

What does this supposed "flexibility" buy you? Is there any real value in it
or is it a "because!" ?

> You mean "what's wrong with X"? Well, it won't be mainstream, for a start,
> and that's surely enough.

I have pulled these two sentences out because I don't get them. What "X" are
you referring to?

> of some kind. I need to explore as much as I can and get as much as I
> can back without "doing it first", because I need the insight you can
> offer.

The insight I can offer you is look at OpenGFS, see and understand what it
does, why and how. The try to come up with a generic approach on how to put
this on top of a generic filesystem, without making it useless.

Then I shall be amazed.

> There is no difficulty with that - there are no distributed locks. All locks
> are held on the server of the disk (I decided not to be complicated to
> begine with as a matter of principle early in life ;-).

Maybe you and I have a different idea of "distributed fs". I thought you had a
central pool of disks.

You want there to be local disks at each server, and other nodes can read
locally and have it appear as a big, single filesystem? You'll still have to
deal with node failure though.

Interesting.

One might consider to peel apart meta-data (which always goes through the
"home" node) and data (which goes directly to disk via the SAN); if necessary,
the reply to the meta-data request to the home node could tell the node where
to write/read. This smells a lot like cXFS and co with a central metadata
server.

> > recovery. ("Transaction processing" is an exceptionally good book on that
> > I believe)
> Thanks but I don't feel like rolling it out and rolling it back!

Please explain how you'll recover anywhere close to "fast" or even
"acceptable" without transactions. Even if you don't have to fsck the petabyte
filesystem completely, do a benchmark on how long e2fsck takes on, oh, 50gb
only.

> Thanks for the input. I don't know what I was supposed to take away
> from it though!

I apologize and am sorry if you didn't notice.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
Immortality is an adequate definition of high availability for me.
--- Gregory F. Pfister

2002-09-07 23:16:13

by Andreas Dilger

[permalink] [raw]
Subject: Re: [[email protected]: Re: [RFC] mount flag "direct" (fwd)]

On Sep 07, 2002 21:59 +0200, Peter T. Breuer wrote:
> "A month of sundays ago Lars Marowsky-Bree wrote:"
> > Yes, use a distributed filesystem. There are _many_ out there; GFS, OCFS,
> > OpenGFS, Compaq has one as part of their SSI, Inter-Mezzo (sort of), Lustre,
> > PvFS etc.
> > Any of them will appreciate the good work of a bright fellow.
>
> Well, I know of some of these. Intermezzo I've tried lately and found
> near impossible to set up and work with (still, a great improvement
> over coda, which was absolutely impossible, to within an atom's
> breadth). And it's nowhere near got the right orientation. Lustre
> people have been pointing me at. What happened to petal?

Well, Intermezzo has _some_ of what you are looking for, but isn't
really geared to your needs. It is a distributed _replicated_
filesystem, so it doesn't necessarily scale as good as possible for
the many-nodes-writing case.

Lustre is actually very close to what you are talking about, which I
have mentioned a couple of times before. It has distributed storage,
so each node could write to its own disk, but it also has a coherent
namespace across all client nodes (clients can also be data servers),
so that all files are accessible from all clients over the network.

It is designed with high performance networking in mind (Quadrics Elan
is what we are working with now) which supports remote DMA and such.

> But what I suggest is finding a simple way to turn an existing FS into a
> distributed one. I.e. NOT reinventing the wheel. All those other people
> are reinventing a wheel, for some reason :-).

We are not re-inventing the on-disk filesystem, only adding the layers
on top (networking, locking) which is absolutely _required_ if you are
going to have a distributed filesystem. The locking is distributed,
in the sense that there is one node in charge of the filesystem layout
(metadata server, MDS) and it is the lock manager there, but all of the
storage nodes (called object storage targets, OST) are in charge of locking
(and block allocation and such) for files stored on there.

You can't tell me you are going to have a distributed network filesystem
that does not have network support or locking.

> Well, how about allowing get_block to return an extra argument, which
> is the ondisk placement of the inode(s) concerned, so that the vfs can
> issue a lock request for them before the i/o starts. Let the FS return
> the list of metadata things to lock, and maybe a semaphore to start the
> i/o with.

In fact, you can go one better (as Lustre does) - the layout of the data
blocks for a file are totally internal to the storage node. The clients
only deal with object IDs (inode numbers, essentially) and offsets in
that file. How the OST filesystem lays out the data in that object is
not visible to the clients at all, so no need to lock the whole
filesystem across nodes to do the allocation. The clients do not write
directly to the OST block device EVER.

> > What you are starting would need at least 3-5 years to catch up with what
> > people currently already can do, and they'll improve in this time too.
>
> Maybe 3-4 weeks more like.

LOL. This is my full time job for the last 6 months, and I'm just a babe
in the woods. Sure you could do _something_, but nothing that would get
the performance you want.

> > A petabyte filesystem without journaling? A petabyte filesystem with a
> > single write lock? Gimme a break.
>
> Journalling? Well, now you mention it, that would seem to be nice. But
> my experience with journalling FS's so far tells me that they break
> more horribly than normal. Also, 1PB or so is the aggregate, not the
> size of each FS on the local nodes. I don't think you can diagnose
> "journalling" from the numbers. I am even rather loath to journal,
> given what I have seen.

Lustre is what you describe - dozens (hundreds, thousands?) of independent
storage targets, each controlling part of the total storage space.
Even so, journaling is crucial for recovery of metadata state, and
coherency between the clients and the servers, unless you don't think
that hardware or networks ever fail. Even with distributed storage,
a PB is 1024 nodes with 1TB of storage each, and that will still take
a long time to fsck just one client, let alone return to filesystem
wide coherency.

> No features. Just take any FS that corrently works, and see if you can
> distribute it. Get rid of all fancy features along the way. You mean
> "what's wrong with X"?

Again, you are preaching to the choir here. In principle, Lustre does
what you want, but not with the "one big lock for the whole system"
approach, and it doesn't intrude into the VFS or need no-cache operation
because the clients DO NOT write directly onto the block device on the
OST. The DO communicate directly to the OST (so you have basically
linear I/O bandwidth scaling with OSTs and clients), but the OST uses
the normal VFS/filesystem to handle block allocation internally.

> Well, it won't be mainstream, for a start, and
> that's surely enough. The projects involved are huge, and they need to
> minimize risk, and maximize flexibility. This is CERN, by the way.

We are working on the filesystem for MCR (http://www.llnl.gov/linux/mcr/),
a _very large_ cluster at LLNL, 1000 4way 2.4GHz P4 client nodes, 100 TB
of disk, etc. (as an aside, sys_statfs wraps at 16TB for one filesystem,
I already saw, but I think I can work around it... ;-)

> There is no difficulty with that - there are no distributed locks. All
> locks are held on the server of the disk (I decided not to be
> complicated to begine with as a matter of principle early in life ;-).

See above. Even if the server holds all of the locks for its "area" (as
we are doing) you are still "distributing" the locks to the clients when
they want to do things. The server still has to revoke those locks when
another client wants them, or your application ends up doing something
similar.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-09-08 09:19:04

by Peter T. Breuer

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

"A month of sundays ago Lars Marowsky-Bree wrote:"
> > > In particular, they make them useless for the requirements you seem to
> > > have. A petabyte filesystem without journaling? A petabyte filesystem with
> > > a single write lock? Gimme a break.
> > Journalling? Well, now you mention it, that would seem to be nice.
>
> "Nice" ? ;-) You gotta be kidding. If you don't have journaling, distributed
> recovery becomes near impossible - at least I don't have a good idea on how to

It's OK. The calculations are duplicated and the FS's are too. The
calculation is highly parallel.

> do it if you don't know what the node had been working on prior to its
> failure.

Yes we do. Its place in the topology of the network dictates what it was
working on, and anyway that's just a standard parallelism "barrier"
problem.

> Well, you are taking quite a risk trying to run a
> not-aimed-at-distributed-environments fs and trying to make it distributed by
> force. I _believe_ that you are missing where the real trouble lurks.

There is no risk, because, as you say, we can always use nfs or another
off the shelf solution. But 10% better is 10% more experiment for
each timeslot for each group of investigators.

> What does this supposed "flexibility" buy you? Is there any real value in it

Ask the people ho might scream for 10% more experiment in their 2
weeks.

> > You mean "what's wrong with X"? Well, it won't be mainstream, for a start,
> > and that's surely enough.
>
> I have pulled these two sentences out because I don't get them. What "X" are
> you referring to?

Any X that is not a standard FS. Yes, I agree, not exact.

> The insight I can offer you is look at OpenGFS, see and understand what it
> does, why and how. The try to come up with a generic approach on how to put
> this on top of a generic filesystem, without making it useless.
>
> Then I shall be amazed.

I have to catch a plane ..

Peter

2002-09-08 09:55:11

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

On 2002-09-08T11:23:39,
"Peter T. Breuer" <[email protected]> said:

> > do it if you don't know what the node had been working on prior to its
> > failure.
> Yes we do. Its place in the topology of the network dictates what it was
> working on, and anyway that's just a standard parallelism "barrier"
> problem.

I meant wrt what is had been working on in the filesystem. You'll need to do a
full fsck locally if it isn't journaled. Oh well.

Maybe it would help if you outlined your architecture as you see it right now.

> > Well, you are taking quite a risk trying to run a
> > not-aimed-at-distributed-environments fs and trying to make it distributed
> > by force. I _believe_ that you are missing where the real trouble lurks.
> There is no risk, because, as you say, we can always use nfs or another off
> the shelf solution.

Oh, so the discussion is a purely academic mind experiment; it would have been
helpful if you told us in the beginning.

> But 10% better is 10% more experiment for each timeslot
> for each group of investigators.

> > What does this supposed "flexibility" buy you? Is there any real value in
> > it
> Ask the people ho might scream for 10% more experiment in their 2 weeks.

> > > You mean "what's wrong with X"? Well, it won't be mainstream, for a start,
> > > and that's surely enough.
> > I have pulled these two sentences out because I don't get them. What "X" are
> > you referring to?
> Any X that is not a standard FS. Yes, I agree, not exact.

So, your extensions are going to be "more" mainstream than OpenGFS / OCFS etc?
What the hell have you been smoking?

It has become apparent in the discussion that you are optimizing for a very
rare special case. OpenGFS, Lustre etc at least try to remain useable for
generic filesystem operation.

That it won't be mainstream is wrong about _your_ approach, not about those
"off the shelves" solutions.

And your special "optimisations" (like, no caching, no journaling...) are
supposed to be 10% _faster_ overall than these which are - to a certain extent
- from the ground up optimised for this case?

One of us isn't listening while clue is knocking.

Now it might be me, but then I apologize for having wasted your time and will
stand corrected as soon as you have produced working code.

Until then, have fun. I feel like I am wasting both your and my time, and this
isn't strictly necessary.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
Immortality is an adequate definition of high availability for me.
--- Gregory F. Pfister

2002-09-08 16:41:47

by Peter T. Breuer

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

"A month of sundays ago Lars Marowsky-Bree wrote:"
> On 2002-09-08T11:23:39,
> "Peter T. Breuer" <[email protected]> said:
>
> > > do it if you don't know what the node had been working on prior to its
> > > failure.
> > Yes we do. Its place in the topology of the network dictates what it was
> > working on, and anyway that's just a standard parallelism "barrier"
> > problem.
>
> I meant wrt what is had been working on in the filesystem. You'll need to do a
> full fsck locally if it isn't journaled. Oh well.

Well, something like that anyway.

> Maybe it would help if you outlined your architecture as you see it right now.

I did in another post, I think. A torus with local 4-way direct
connectivity with each node connected to three neigbours and exporting
one local resource and importing three more from neighbours. All
shared. Add raid to taste.

> > There is no risk, because, as you say, we can always use nfs or another off
> > the shelf solution.
>
> Oh, so the discussion is a purely academic mind experiment; it would have been

Puhleeese try not to go off the deep end at an innocent observation.
Take the novocaine or something. I am just pointing out that there are
obvious safe fallbacks, AND ...

> helpful if you told us in the beginning.
>
> > But 10% better is 10% more experiment for each timeslot
> > for each group of investigators.

You see?

> > > you referring to?
> > Any X that is not a standard FS. Yes, I agree, not exact.
>
> So, your extensions are going to be "more" mainstream than OpenGFS / OCFS etc?

Quite possibly/probably. Let's see how it goes, shall we?
Do you want to shoot down returning the index of the inode in get_block
in order that we can do a wlock on that index before the io to
the file takes place? Not sufficient in itself, but enough to be going
on with, and enough for FS's that are reasonable in what they do.
Then we need to drop the dcache entry nonlocally.

> What the hell have you been smoking?

Unfortunately nothing at all, let alone worthwhile.

> It has become apparent in the discussion that you are optimizing for a very

To you, perhaps, not to me. What I am thinking about is a data analysis
farm, handling about 20GB/s of input data in real time, with numbers
of nodes measured in the thousands, and network raided internally. Well,
you'd need a thousand nodes on the first ring alone just to stream
to disk at 20MB/s per node, and that will generate three to six times
that amount of internal traffic just from the raid. So aggregate
bandwidth in the first analysis ring has to be order of 100GB/s.

If the needs are special, it's because of the magnitude of the numbers,
not because of any special quality.

> rare special case. OpenGFS, Lustre etc at least try to remain useable for
> generic filesystem operation.
>
> That it won't be mainstream is wrong about _your_ approach, not about those
> "off the shelves" solutions.

I'm willing to look at everything.

> And your special "optimisations" (like, no caching, no journaling...) are
> supposed to be 10% _faster_ overall than these which are - to a certain extent

Yep. Caching looks irrelevant because we read once and write once, by
and large. You could argue that we write once and read once, which
would make caching sensible, but the data streams are so large to make
it likely that caches would be flooded out anyway. Buffering would be
irrelevant except inasmuch as it allows for asynchronous operation.

And the network is so involved in this that I would really like to get
rid of the current VMS however I could (it causes pulsing behaviour,
which is most disagreeable).

> - from the ground up optimised for this case?
>
> One of us isn't listening while clue is knocking.

You have an interesting bedtime story manner.

> Now it might be me, but then I apologize for having wasted your time and will
> stand corrected as soon as you have produced working code.

Shrug.

> Until then, have fun. I feel like I am wasting both your and my time, and this
> isn't strictly necessary.

!!

There's no argument. I'm simply looking for entry points to the code.
I've got a lot of good information, especially from Anton (and other
people!), that I can use straight off. My thanks for the insights.

Peter