LinuxLists.cc - Re: [RFC] mount flag "direct" (fwd)

2002-09-03 15:35:01

Subject: Re: [RFC] mount flag "direct" (fwd)

Hi!

Thanks for the comment!

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> > Rationale:
> > No caching means that each kernel doesn't go off with its own idea of
> > what is on the disk in a file, at least. Dunno about directories and
> > metadata.

> And what if they both allocate the same disk block to another
> file, simultaneously ?

I see - yes, that's a good one.

I assumed that I would need to make several VFS operations atomic
or revertable, or simply forbid things like new file allocations or
extensions (i.e. the above), depending on what is possible or not.

This is precisely the kind of objection that I want to hear about.

OK - reply:
It appears that in order to allocate away free space, one must first
"grab" that free space using a shared lock. That's perfectly feasible.

Thank you.

Where could I intercept the block allocation in VFS?

> A mount option isn't enough to achieve your goal.
>
> It looks like you want GFS or OCFS. Info about GFS can be found at:

No, I don't want ANY FS. Thanks, I know about these, but they're not
it. I want support for /any/ FS at all at the VFS level.

> http://www.opengfs.org/
> http://www.sistina.com/ (commercial GFS)

> Dunno where Oracle's cluster fs is documented.

I know about that too, but no, I do not want ANY FS, I want /any/ FS.
:-)

Peter

2002-09-03 15:40:26

by Rik van Riel

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> I assumed that I would need to make several VFS operations atomic
> or revertable, or simply forbid things like new file allocations or
> extensions (i.e. the above), depending on what is possible or not.

> No, I don't want ANY FS. Thanks, I know about these, but they're not
> it. I want support for /any/ FS at all at the VFS level.

You can't. Even if each operation is fully atomic on one node,
you still don't have synchronisation between the different nodes
sharing one disk.

You really need filesystem support.

Rik
--
http://www.linuxsymposium.org/2002/
"You're one of those condescending OLS attendants"
"Here's a nickle kid. Go buy yourself a real t-shirt"

http://www.surriel.com/ http://distro.conectiva.com/

2002-09-03 15:46:11

by Peter T. Breuer

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

"A month of sundays ago Rik van Riel wrote:"
> On Tue, 3 Sep 2002, Peter T. Breuer wrote:
>
> > I assumed that I would need to make several VFS operations atomic
> > or revertable, or simply forbid things like new file allocations or
> > extensions (i.e. the above), depending on what is possible or not.
>
> > No, I don't want ANY FS. Thanks, I know about these, but they're not
> > it. I want support for /any/ FS at all at the VFS level.
>
> You can't. Even if each operation is fully atomic on one node,
> you still don't have synchronisation between the different nodes
> sharing one disk.

Yes, I do have synchronization - locks are/can be shared between both
kernels using a device driver mechanism that I implemented. That is
to say, I can guarantee that atomic operations by each kernel do not
overlap "on the device", and remain locally ordered at least (and
hopefully globally, if I get the time thing right).

It's not that hard - the locks are held on the remote disk by a
"guardian" driver, to which the drivers on both of the kernels
communicate. A fake "scsi adapter", if you prefer.

> You really need filesystem support.

I don't think so. I think you're not convinced either! But
I would really like it if you could put your finger on an
overriding objection.

Peter

2002-09-03 15:52:07

by Chris Wedgwood

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

On Tue, Sep 03, 2002 at 05:50:42PM +0200, Peter T. Breuer wrote:

Yes, I do have synchronization - locks are/can be shared between both
kernels using a device driver mechanism that I implemented.

What happens if one of the kernels/nodes dies?

--cw

2002-09-03 15:57:18

by Peter T. Breuer

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

"A month of sundays ago Chris Wedgwood wrote:"
> On Tue, Sep 03, 2002 at 05:50:42PM +0200, Peter T. Breuer wrote:
>
> Yes, I do have synchronization - locks are/can be shared between both
> kernels using a device driver mechanism that I implemented.
>
> What happens if one of the kernels/nodes dies?

With the lock held, you mean? Depends on policy. There are two
implemented at present:

a) show all errors
b) hide all errors

In case b) the lock will continue to be held until the other
node comes back up. In case a) the lock will be abandoned after
timeout, and pending requests will be errored.

I'll explore the ramifications later.

Peter

2002-09-03 16:05:30

by Richard B. Johnson

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> "A month of sundays ago Rik van Riel wrote:"
> > On Tue, 3 Sep 2002, Peter T. Breuer wrote:
> >
> > > I assumed that I would need to make several VFS operations atomic
> > > or revertable, or simply forbid things like new file allocations or
> > > extensions (i.e. the above), depending on what is possible or not.
> >
> > > No, I don't want ANY FS. Thanks, I know about these, but they're not
> > > it. I want support for /any/ FS at all at the VFS level.
> >
> > You can't. Even if each operation is fully atomic on one node,
> > you still don't have synchronisation between the different nodes
> > sharing one disk.
>
> Yes, I do have synchronization - locks are/can be shared between both
> kernels using a device driver mechanism that I implemented. That is
> to say, I can guarantee that atomic operations by each kernel do not
> overlap "on the device", and remain locally ordered at least (and
> hopefully globally, if I get the time thing right).
>
> It's not that hard - the locks are held on the remote disk by a
> "guardian" driver, to which the drivers on both of the kernels
> communicate. A fake "scsi adapter", if you prefer.
>
> > You really need filesystem support.
>
> I don't think so. I think you're not convinced either! But
> I would really like it if you could put your finger on an
> overriding objection.
>
> Peter

Lets say you have a perfect locking mechanism, a fake SCSI layer
as you state. You are now going to create a new file on the
shared block device. You are careful that you use only space
that you "own", etc., so you perfectly create a new file on
your VFS.

How does the other user's of this device "know" that there is
a new file so it can update its notion of the block-device state?

You have created perfect isolation so, by definition, the other
isolated user's don't know that you have just used space that they
think that they own.

Now, the notion of a complete 'file-system' for support may not be
required. What you need is like a file-system without all the frills.
It needs to act like a "hard disk malloc" or slab allocator. That way,
you can have independence between the systems that are accessing the
device.

So, if you made this, you are still stuck with the problem of duplicate
file-names, but this could be resolved by using a "librarian" layer
so that a new file-name and its meta-data gets known by all the
users of the device.

FYI, the "librarian" layer is the file-system so, I have shown that
you need file-system support.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.

2002-09-03 16:24:32

by Peter T. Breuer

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

"Richard B. Johnson wrote:"
> On Tue, 3 Sep 2002, Peter T. Breuer wrote:
> > It's not that hard - the locks are held on the remote disk by a
> > "guardian" driver, to which the drivers on both of the kernels
> > communicate. A fake "scsi adapter", if you prefer.
> >
> > > You really need filesystem support.

> Lets say you have a perfect locking mechanism, a fake SCSI layer

OK.

> as you state. You are now going to create a new file on the
> shared block device. You are careful that you use only space
> that you "own", etc., so you perfectly create a new file on
> your VFS.

OK.

> How does the other user's of this device "know" that there is
> a new file so it can update its notion of the block-device state?

The block device itself is stateless at the block level. Every block
access goes "direct to the metal".

The question is how much FS state is cached on either kernel.
If it is too much, then I will ask how I can cause to be less, perhaps
by use of a flag that parallels how O_DIRECT works. I thought that new
files were entries in a directories inode and I agree that inodes are
held in memory! But I don't know when they are first read or reread.
The directory entry would ceryainly have to be reread after a write
operation on disk that touched it - or more simply, the directory entry
would hvae to be reread every time it were needed, i.e. be uncached.

If that presently is not possible, then I would like to think about
making it possible. Isn't there some kind of inode reading that goes on
at mount? Can I cause it to happen (or unhappen) at will?

> You have created perfect isolation so, by definition, the other
> isolated user's don't know that you have just used space that they
> think that they own.

Well, I don't think that's a fair analogy .. if a "reserve_blocks"
call is added to VFS, then I can use it to prelock the "space that
they think they own", and prevent contention. The question is how
each FS does the block reservation, and why it should not go through
a generic method in the VFS layer.

> Now, the notion of a complete 'file-system' for support may not be
> required. What you need is like a file-system without all the frills.

I think that's the wrong tack, though simply _disabling_ some
operations initially (such as making new files!) may be the way to go.
Just enable more ops as generic support is added.

> FYI, the "librarian" layer is the file-system so, I have shown that
> you need file-system support.

Nice try - your argument reduces to saying that the state of the
directory inodes must be shared. I agree and suggest two remedies

1) maintain no directory inode state, but reread them every time
(how?)
2) force rereading of a particular inode or all inodes when
signalled to do so.

I would prefer (1). It seems in the spirit of O_DIRECT. I imagine that
(2) is presently easy to do (but of course horrible).

Peter

2002-09-03 16:29:05

by Rik van Riel

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> > How does the other user's of this device "know" that there is
> > a new file so it can update its notion of the block-device state?
>
> The block device itself is stateless at the block level. Every block
> access goes "direct to the metal".
>
> The question is how much FS state is cached on either kernel.
> If it is too much, then I will ask how I can cause to be less, perhaps
> by use of a flag that parallels how O_DIRECT works. I thought that new
> files were entries in a directories inode and I agree that inodes are
> held in memory! But I don't know when they are first read or reread.

And neither can you know. After all, this is filesystem dependant.

You cannot decide whether filesystem-independant clustering is
possible unless you know that all the filesystems play by your
rules. So much for filesystem-independance.

regards,

Rik
--
http://www.linuxsymposium.org/2002/
"You're one of those condescending OLS attendants"
"Here's a nickle kid. Go buy yourself a real t-shirt"

http://www.surriel.com/ http://distro.conectiva.com/

2002-09-03 16:54:36

by Anton Altaparmakov

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

On Tue, 3 Sep 2002, Peter T. Breuer wrote:
> "A month of sundays ago Rik van Riel wrote:"
> > On Tue, 3 Sep 2002, Peter T. Breuer wrote:
> >
> > > I assumed that I would need to make several VFS operations atomic
> > > or revertable, or simply forbid things like new file allocations or
> > > extensions (i.e. the above), depending on what is possible or not.
> >
> > > No, I don't want ANY FS. Thanks, I know about these, but they're not
> > > it. I want support for /any/ FS at all at the VFS level.
> >
> > You can't. Even if each operation is fully atomic on one node,
> > you still don't have synchronisation between the different nodes
> > sharing one disk.
>
> Yes, I do have synchronization - locks are/can be shared between both
> kernels using a device driver mechanism that I implemented. That is
> to say, I can guarantee that atomic operations by each kernel do not
> overlap "on the device", and remain locally ordered at least (and
> hopefully globally, if I get the time thing right).
>
> It's not that hard - the locks are held on the remote disk by a
> "guardian" driver, to which the drivers on both of the kernels
> communicate. A fake "scsi adapter", if you prefer.

You have synchronisation at block layer level which is completely
insufficient.

> > You really need filesystem support.
>
> I don't think so. I think you're not convinced either! But
> I would really like it if you could put your finger on an
> overriding objection.

You think wrong... (-;

I will give you a few examples of the why you are wrong:

1) Neither the block layer nor the VFS have anything to do with block
allocations and hence you cannot solve this problem at VFS nor block layer
level. The only thing the VFS does is tell the file system driver "write X
number of bytes to the file F at offset Y". Nothing more than that! The
file system then goes off and allocates blocks in its own disk block
bitmap and then writes the data. The only locking used is file system
specific. For example NTFS has a per mounted volume rw_semaphore to
synchronize accesses to the disk block bitmap. But other file systems most
certainly implement this differently...

2) Some file systems cache the metadata. For example in NTFS the
disk block bitmap is stored inside a normal file called $Bitmap. Thus NTFS
uses the page cache to access the block bitmap and this means that when
new blocks are allocated, we take the volume specific rw_semaphore and
then we search the page cache of $Bitmap for zero bits, set the
required number of bits to one, and then we drop the rw_semaphore and
return which blocks were allocated to the calling ntfs function.

Even if you modified the ntfs driver so that the two hosts accessing the
same device would share the same rw_semaphore, it still wouldn't work,
because there is no synchroisation between the disk block bitmap on the
two hosts. When one has gone through the above procedure and has dropped
the lock, the allocate clusters are held in memory only, thus the other
host doesn't see that some blocks have been allocated and goes off and
allocates the same blocks to a different file as Rik and myself described
already.

And this is just the tip of the iceberg. The only way you could get
something like this to work is by modifying each and every file system
driver to use some VFS provided mechanism for all (de-)allocations, both
disk block, and inode ones. Further you would need to provide shared
memory, i.e. the two hosts need to share the same page cache / address
space mappings. So basically, it can only work if the two hosts are
virtually the same host, i.e. if the two hosts are part of a Single System
Image Cluster...

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-09-03 17:21:55

by Peter T. Breuer

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

"A month of sundays ago Anton Altaparmakov wrote:"
> > It's not that hard - the locks are held on the remote disk by a
> > "guardian" driver, to which the drivers on both of the kernels
> > communicate. A fake "scsi adapter", if you prefer.
>
> You have synchronisation at block layer level which is completely
> insufficient.

No, I have syncronization whenever one cares to ask for it (the level
is purely notional), but I suggest that one adds a "tag" request type
to the block layers in order that one may ask for a lock at VFS level
by issuing a "tag block request", which does nothing except stop
anybody else from processing the named notional resource until the
corresponding "untag block request" is issued.

> 1) Neither the block layer nor the VFS have anything to do with block
> allocations and hence you cannot solve this problem at VFS nor block layer

That's OK. We've already agreed that the fs's need to reserve blocks
before they make an allocaton, and that they need to do that by calling
up to VFS to reserve it, and that VFS ought to call back down to let
them reserve it the way they like, but take the opportunity to notice
the reserve call.

> level. The only thing the VFS does is tell the file system driver "write X
> number of bytes to the file F at offset Y". Nothing more than that! The
> file system then goes off and allocates blocks in its own disk block

Well, it needs to be altered to call back up first, telling the VFS not
to allow any allocations for a moment (that's a lock), and then the
VFS calls back down and finds out what it feels like reserving, and
now we get to the tricky bit, because each kernel has its own bitmap
... well you tell me. I can see several generic implementations:

1) the bitmap is required to be held on disk by a FS and to be reread
each time any kernel wants to make a new file allocation (that's not
so expensive - new files are generally rare and we don't care).

2) the VFS holds the bitmap and we add ops to read and write the
bitmap in VFS, and intercept those calls and share them (somehow -
details to be arranged).

3) .. any or all of this behavior be forced by a MMETADIRECT
flag that formids metadata to be cached in memory without being
synced to disk.

> bitmap and then writes the data. The only locking used is file system
> specific. For example NTFS has a per mounted volume rw_semaphore to
> synchronize accesses to the disk block bitmap. But other file systems most
> certainly implement this differently...

Then they will have to be patched to do it generically ..?

> 2) Some file systems cache the metadata. For example in NTFS the

This seems like a pretty valid objection!

> disk block bitmap is stored inside a normal file called $Bitmap. Thus NTFS
> uses the page cache to access the block bitmap and this means that when

This is the same objection as your first objection, I think, except
made particular. My response must therefore be the same - make the
bitmap operations pass through VFS at least, and add a METADIRECT
flag that makes the information be reread when it is needed.

The question is how best to force it, or if the data should be shared
via the VFS's directly (I can handle that - I can make a fake device
that contains the bitmap datam, for example).

> new blocks are allocated, we take the volume specific rw_semaphore and
> then we search the page cache of $Bitmap for zero bits, set the
> required number of bits to one, and then we drop the rw_semaphore and
> return which blocks were allocated to the calling ntfs function.

I'm not sure what relevance the semaphore has. I'm advocating that the
bitmaps ops become generic, which automatically gives the opportunity
for generic locking mechanisms.

> Even if you modified the ntfs driver so that the two hosts accessing the
> same device would share the same rw_semaphore, it still wouldn't work,

I won't modify it except to use new generic ops instead of fs
particular ones. One could say that only FS's which use the
generic VFS ops are suitable candidates to BE fs's on a shared device.
Then it ceases to be a problem, and becomes a desired goal.

> And this is just the tip of the iceberg. The only way you could get

Well, how much more is there? What you mentioned didn't worry me
because it wasn't a generic strategic objection.

> something like this to work is by modifying each and every file system
> driver to use some VFS provided mechanism for all (de-)allocations, both

Yes. Precisely. There is nothing wrong with that.

> disk block, and inode ones. Further you would need to provide shared
> memory, i.e. the two hosts need to share the same page cache / address

Well, that I don't know about. Can you elaborate a bit on that? I'm not
at all sure that is the case. Can you provide another of your very
useful concretizations?

> space mappings. So basically, it can only work if the two hosts are
> virtually the same host, i.e. if the two hosts are part of a Single System
> Image Cluster...

Thank you! I find that input very enlightening.

Peter

2002-09-03 17:27:50

by Richard B. Johnson

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> "Richard B. Johnson wrote:"
> > On Tue, 3 Sep 2002, Peter T. Breuer wrote:
> > > It's not that hard - the locks are held on the remote disk by a
> > > "guardian" driver, to which the drivers on both of the kernels
> > > communicate. A fake "scsi adapter", if you prefer.
> > >
> > > > You really need file-system support.
>
> > Lets say you have a perfect locking mechanism, a fake SCSI layer
>
> OK.
>
> > as you state. You are now going to create a new file on the
> > shared block device. You are careful that you use only space
> > that you "own", etc., so you perfectly create a new file on
> > your VFS.
>
> OK.
>
> > How does the other user's of this device "know" that there is
> > a new file so it can update its notion of the block-device state?
>
> The block device itself is stateless at the block level. Every block
> access goes "direct to the metal".
>

Well it doesn't. In particular SCSI and Fire-wire Drives have data
queued and, to give the CPU something to do while the writes are
occurring, the block-layer sleeps. So, you can have some other
task "think" wrong about the state of the machine.

> The question is how much FS state is cached on either kernel.
> If it is too much, then I will ask how I can cause to be less, perhaps
> by use of a flag that parallels how O_DIRECT works. I thought that new
> files were entries in a directories inode and I agree that inodes are
> held in memory! But I don't know when they are first read or reread.

Unless you unmount/re-mount, they will not be re-read. That's why you
need to "share" at the file-system level. FYI, it's already being
done and clustered disks were first done by DEC under RSX/11, then
under VAX/VMS. It's truly "old-hat".

> The directory entry would certainly have to be reread after a write
> operation on disk that touched it - or more simply, the directory entry
> would hvae to be reread every time it were needed, i.e. be uncached.
>
> If that presently is not possible, then I would like to think about
> making it possible. Isn't there some kind of inode reading that goes on
> at mount? Can I cause it to happen (or unhappen) at will?
>

Yes but you have a problem with synchronization. You need to synchronize
a file-system at the file-system level so that one process accessing the
file-system, obtains the exact same image as any other process.

> > You have created perfect isolation so, by definition, the other
> > isolated user's don't know that you have just used space that they
> > think that they own.
>
> Well, I don't think that's a fair analogy .. if a "reserve_blocks"
> call is added to VFS, then I can use it to prelock the "space that
> they think they own", and prevent contention. The question is how
> each FS does the block reservation, and why it should not go through
> a generic method in the VFS layer.
>
> > Now, the notion of a complete 'file-system' for support may not be
> > required. What you need is like a file-system without all the frills.
>
> I think that's the wrong tack, though simply _disabling_ some
> operations initially (such as making new files!) may be the way to go.
> Just enable more ops as generic support is added.

Well, if you don't make new files, and you don't update any file-data,
they you just mount R/O and be done with it. When a FS is mounted
R/O, one doesn't care about atomicity anymore, only performance.

Once you allow a file's contents to be altered, you have the problem
of making certain that every processes' notion of the file contents
is identical. Again, that's done at the file-system layer, not at
some block layer.

>
> > FYI, the "librarian" layer is the file-system so, I have shown that
> > you need file-system support.
>
> Nice try - your argument reduces to saying that the state of the
> directory inodes must be shared. I agree and suggest two remedies
>
> 1) maintain no directory inode state, but reread them every time
> (how?)

If you don't maintain some kind of state, you end up reading all
directory inodes. I don't think you want that. You need to maintain
that "directory inode state" and that's what a file-system does.

> 2) force rereading of a particular inode or all inodes when
> signalled to do so.

The signaler needs to "know". Which means that somebody is maintaining
the file-system state. You shouldn't have to re-invent file-systems to
do that. You just maintain synchronomy at the file-system level and
be done with it.

>
> I would prefer (1). It seems in the spirit of O_DIRECT. I imagine that
> (2) is presently easy to do (but of course horrible).
>
> Peter

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.

2002-09-03 18:48:24

by Lars Marowsky-Bree

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

On 2002-09-03T18:29:02,
"Peter T. Breuer" <[email protected]> said:

> > Lets say you have a perfect locking mechanism, a fake SCSI layer
> OK.

BTW, I would like to see your perfect distributed locking mechanism.

> The directory entry would ceryainly have to be reread after a write
> operation on disk that touched it - or more simply, the directory entry
> would hvae to be reread every time it were needed, i.e. be uncached.

*ouch* Sure. Right. You just have to read it from scratch every time. How
would you make readdir work?

> If that presently is not possible, then I would like to think about
> making it possible.

Just please, tell us why.

Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
Immortality is an adequate definition of high availability for me.
--- Gregory F. Pfister

2002-09-03 21:02:31

by Peter T. Breuer

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

"A month of sundays ago Lars Marowsky-Bree wrote:"
> On 2002-09-03T18:29:02,
> "Peter T. Breuer" <[email protected]> said:
>
> > > Lets say you have a perfect locking mechanism, a fake SCSI layer
> > OK.
>
> BTW, I would like to see your perfect distributed locking mechanism.

That bit's easy and is done. The "trick" is NOT to distribute the lock,
but to have it in one place - on the driver that guards the remote
disk resource.

> > The directory entry would certainly have to be reread after a write
> > operation on disk that touched it - or more simply, the directory entry
> > would have to be reread every time it were needed, i.e. be uncached.
>
> *ouch* Sure. Right. You just have to read it from scratch every time. How
> would you make readdir work?

Well, one has to read it from scratch. I'll set about seeing how to do.
CLues welcome.

> > If that presently is not possible, then I would like to think about
> > making it possible.
>
> Just please, tell us why.

You don't really want the whole rationale. It concerns certain
european (nay, world ..) scientific projects and the calculations of the
technologists about the progress in hardware over the next few years.
We/they foresee that we will have to move to multiple relatively small
distributed disks per node in order to keep the bandwidth per unit of
storage at the levels that they will have to be at to keep the farms
fed. We are talking petabytes of data storage in thousands of nodes
moving over gigabit networks.

The "big view" calculations indicate that we must have distributed
shared writable data.

These calculations affect us all. They show us what way computing
will evolve under the price and technology pressures. The calculations
are only looking to 2006, but that's what they show. For example
if we think about a 5PB system made of 5000 disks of 1TB each in a GE
net, we calculate the aggregate bandwidth available in the topology as
50GB/s, which is less than we need in order to keep the nodes fed
at the rates they could be fed at (yes, a few % loss translates into
time and money). To increase available bandwidth we must have more
channels to the disks, and more disks, ... well, you catch my drift.

So, start thinking about general mechanisms to do distributed storage.
Not particular FS solutions.

Peter

2002-09-03 21:11:51

by Rik van Riel

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> The "big view" calculations indicate that we must have distributed
> shared writable data.

Agreed. Note that the same big view also dictates that any such
solution must have good performance.

Do you need any more reasons for having special cluster filesystems
instead of trying to add clustering to already existing filesystems ?

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

2002-09-03 21:12:47

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

On Sep 03, 2002 23:07 +0200, Peter T. Breuer wrote:
> You don't really want the whole rationale. It concerns certain
> european (nay, world ..) scientific projects and the calculations of the
> technologists about the progress in hardware over the next few years.
> We/they foresee that we will have to move to multiple relatively small
> distributed disks per node in order to keep the bandwidth per unit of
> storage at the levels that they will have to be at to keep the farms
> fed. We are talking petabytes of data storage in thousands of nodes
> moving over gigabit networks.
>
> The "big view" calculations indicate that we must have distributed
> shared writable data.
>
> These calculations affect us all. They show us what way computing
> will evolve under the price and technology pressures. The calculations
> are only looking to 2006, but that's what they show. For example
> if we think about a 5PB system made of 5000 disks of 1TB each in a GE
> net, we calculate the aggregate bandwidth available in the topology as
> 50GB/s, which is less than we need in order to keep the nodes fed
> at the rates they could be fed at (yes, a few % loss translates into
> time and money). To increase available bandwidth we must have more
> channels to the disks, and more disks, ... well, you catch my drift.
>
> So, start thinking about general mechanisms to do distributed storage.
> Not particular FS solutions.

Please see lustre.org.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-09-03 21:49:46

by Anton Altaparmakov

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

At 22:07 03/09/02, Peter T. Breuer wrote:
>"A month of sundays ago Lars Marowsky-Bree wrote:"
> > On 2002-09-03T18:29:02,
> > "Peter T. Breuer" <[email protected]> said:
> > > If that presently is not possible, then I would like to think about
> > > making it possible.
> >
> > Just please, tell us why.
>
>You don't really want the whole rationale. It concerns certain
>european (nay, world ..) scientific projects and the calculations of the
>technologists about the progress in hardware over the next few years.
>We/they foresee that we will have to move to multiple relatively small
>distributed disks per node in order to keep the bandwidth per unit of
>storage at the levels that they will have to be at to keep the farms
>fed. We are talking petabytes of data storage in thousands of nodes
>moving over gigabit networks.
>
>The "big view" calculations indicate that we must have distributed
>shared writable data.
>
>These calculations affect us all. They show us what way computing
>will evolve under the price and technology pressures. The calculations
>are only looking to 2006, but that's what they show. For example
>if we think about a 5PB system made of 5000 disks of 1TB each in a GE
>net, we calculate the aggregate bandwidth available in the topology as
>50GB/s, which is less than we need in order to keep the nodes fed
>at the rates they could be fed at (yes, a few % loss translates into
>time and money). To increase available bandwidth we must have more
>channels to the disks, and more disks, ... well, you catch my drift.
>
>So, start thinking about general mechanisms to do distributed storage.
>Not particular FS solutions.

Hm, I believe you are barking up the wrong tree. Either you are omitting
too much information in your statement above or you are contradicting
yourself.

What you are looking for is _exactly_ particular FS solution(s)! And in
particular you are looking for a truly distributed file system.

I just get the impression you are not fully aware what a distributed FS
(call it DFS for short) actually is.

In my understanding a DFS offers exactly what you need: each node has disks
and all disks on all nodes are part of the very same file system. Of course
each node maintains the local disks, i.e. the local part of the file system
and certain operations require that the nodes communicates with the "DFS
master node(s)" in order for example to reserve blocks of disks or to
create/rename files (need to make sure no duplicate filenames are
instantiated for example). -- Sound familiar so far? You wanted to do
exactly the same things but at the block layer and the VFS layer levels
instead of the FS layer...

The difference between a DFS and your proposal is that a DFS maintains all
the caching benefits of a normal FS at the local node level, while your
proposal completely and entirely disables caching, which is debatably
impossible (due to need to load things into ram to read them and to modify
them and then write them back) and certainly no FS author will accept their
FS driver to be crippled in such a way. The performance loss incurred by
removing caching completely is going to make sure you will only be dreaming
of those 50GiB/sec. More likely you will be getting a few bytes/sec... (OK,
I exaggerate a bit.) The seek times on the disks together with the
read/write timings are going to completely annihilate performance. A DFS
maintains caching at local node level, so you can still keep open inodes in
memory for example (just don't allow any other node to open the same file
at the same time or you need to do some juggling via the "Master DFS node").

To give you an analogy, you can think of a DFS like a NUMA machine, where
you have different access speeds to different parts of memory (for DFS the
"storage device", same thing really) and where decision on where to store
things are decided depending on the resource/time cost involved. Simplest
example: A file created on node A, will be allocated/written to a disk (or
multiple disks) located on node A, because accessing the local disks has a
lower time cost compared to going to a different node over the slower wire.

Your time would be much better spent in creating the _one_ true DFS, or
helping improve one of the existing ones instead of trying to hack up the
VFS/block layers to pieces. It almost certainly will be a hell of a lot
less work to implement a decent DFS in comparison to changing the block
layer, the VFS, _and_ every single FS driver out there to comply with the
block layer and VFS changes. And at the same time you get exactly the same
features you wanted to have but with hugely boosted performance.

I hope my ramblings made some kind of sense...

Best regards,

Anton

--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-09-03 22:44:51

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

On Sep 03, 2002 22:54 +0100, Anton Altaparmakov wrote:
> In my understanding a DFS offers exactly what you need: each node has disks
> and all disks on all nodes are part of the very same file system. Of course
> each node maintains the local disks, i.e. the local part of the file system
> and certain operations require that the nodes communicates with the "DFS
> master node(s)" in order for example to reserve blocks of disks or to
> create/rename files (need to make sure no duplicate filenames are
> instantiated for example). -- Sound familiar so far? You wanted to do
> exactly the same things but at the block layer and the VFS layer levels
> instead of the FS layer...
>
> The difference between a DFS and your proposal is that a DFS maintains all
> the caching benefits of a normal FS at the local node level, while your
> proposal completely and entirely disables caching, which is debatably
> impossible (due to need to load things into ram to read them and to modify
> them and then write them back) and certainly no FS author will accept their
> FS driver to be crippled in such a way. The performance loss incurred by
> removing caching completely is going to make sure you will only be dreaming
> of those 50GiB/sec. More likely you will be getting a few bytes/sec... (OK,
> I exaggerate a bit.) The seek times on the disks together with the
> read/write timings are going to completely annihilate performance. A DFS
> maintains caching at local node level, so you can still keep open inodes in
> memory for example (just don't allow any other node to open the same file
> at the same time or you need to do some juggling via the "Master DFS node").
>
> Your time would be much better spent in creating the _one_ true DFS, or
> helping improve one of the existing ones instead of trying to hack up the
> VFS/block layers to pieces. It almost certainly will be a hell of a lot
> less work to implement a decent DFS in comparison to changing the block
> layer, the VFS, _and_ every single FS driver out there to comply with the
> block layer and VFS changes. And at the same time you get exactly the same
> features you wanted to have but with hugely boosted performance.

This is exactly what Lustre is supposed to be. Many nodes, each with
local storage, and clients are able to do I/O directly to the storage
nodes (for non-local storage, or if they have no local storage at all).

There is (currently) a single metadata server (MDS) which controls the
directory tree locking, and the storage nodes control the locking of
inodes (objects) local to their storage.

It's not quite in a robust state yet, but we're working on it.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-09-03 23:12:49

by Daniel Phillips

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

On Tuesday 03 September 2002 23:54, Anton Altaparmakov wrote:
> The difference between a DFS and your proposal is that a DFS maintains all
> the caching benefits of a normal FS at the local node level, while your
> proposal completely and entirely disables caching, which is debatably
> impossible (due to need to load things into ram to read them and to modify
> them and then write them back) and certainly no FS author will accept their
> FS driver to be crippled in such a way. The performance loss incurred by
> removing caching completely is going to make sure you will only be dreaming
> of those 50GiB/sec. More likely you will be getting a few bytes/sec... (OK,
> I exaggerate a bit.) The seek times on the disks together with the
> read/write timings are going to completely annihilate performance. A DFS
> maintains caching at local node level, so you can still keep open inodes in
> memory for example (just don't allow any other node to open the same file
> at the same time or you need to do some juggling via the "Master DFS node").

You're well wide of the mark here, in that you're relying on the assumption
that caching is important to the application he has in mind. The raw transfer
bandwidth may well be sufficient, especially if it is unimpeded by being
funneled through a bottleneck like our vfs cache.

--
Daniel

2002-09-04 00:14:23

by Anton Altaparmakov

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

At 00:19 04/09/02, Daniel Phillips wrote:
>On Tuesday 03 September 2002 23:54, Anton Altaparmakov wrote:
> > The difference between a DFS and your proposal is that a DFS maintains all
> > the caching benefits of a normal FS at the local node level, while your
> > proposal completely and entirely disables caching, which is debatably
> > impossible (due to need to load things into ram to read them and to modify
> > them and then write them back) and certainly no FS author will accept
> their
> > FS driver to be crippled in such a way. The performance loss incurred by
> > removing caching completely is going to make sure you will only be
> dreaming
> > of those 50GiB/sec. More likely you will be getting a few bytes/sec...
> (OK,
> > I exaggerate a bit.) The seek times on the disks together with the
> > read/write timings are going to completely annihilate performance. A DFS
> > maintains caching at local node level, so you can still keep open
> inodes in
> > memory for example (just don't allow any other node to open the same file
> > at the same time or you need to do some juggling via the "Master DFS
> node").
>
>You're well wide of the mark here, in that you're relying on the assumption
>that caching is important to the application he has in mind. The raw transfer
>bandwidth may well be sufficient, especially if it is unimpeded by being
>funneled through a bottleneck like our vfs cache.

I don't think I am. I think we just define "caching" differently. The "raw
transfer bandwidth" will be close to zero if no caching happens at all. I
agree with you if you define caching as data caching. But both Peter and I
are talking about metadata caching + data caching. Sure, you can throw data
caching out the window and actually gain performance. I would never dispute
that. But if you throw away metadata caching you destroy performance. Maybe
not on "simplistic" file systems like ext2 but certainly so on complex ones
like ntfs... I described already what a single read in ntfs entails if no
metadata caching happens. I doubt very much that there is a possible
scenario where not doing any metadata caching would improve performance (on
ntfs and at a guess many other fs). Even a sequential read or write from
start of file to end of file would be really killed without caching of the
logical to physical block mapping table for the inode being read/written on
ntfs...

So we aren't in disagreement I think. (-:

Best regards,

Anton

--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-09-04 05:26:48

by David Lang

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

On Wed, 4 Sep 2002, Daniel Phillips wrote:

>
> You're well wide of the mark here, in that you're relying on the assumption
> that caching is important to the application he has in mind. The raw transfer
> bandwidth may well be sufficient, especially if it is unimpeded by being
> funneled through a bottleneck like our vfs cache.
>

the fact that he is saying that this needs to run normal filesystems tells
us that.

if you need a filesystem to max out transfer rate and don't want to have
it cache things that is a VERY specialized thing and not something that
will match what NTFS/XFS/JFS/ReiserFS/ext2 etc are going to be used for.

either he has a very specialized need (in which case a specialized
filesystem is probably the best bet anyway) or he is trying to support
normal uses (in which case caching is important)

however the point is that the read-modify-write cycle is a form of cache,
it is only safe if you aquire a lock at the beginning of it and release it
at the end. A standard filesystem won't do this, this is what makes a DFS.

David Lang

2002-09-04 07:46:09

by Joachim Breuer

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

"Peter T. Breuer" <[email protected]> writes:

> "A month of sundays ago Lars Marowsky-Bree wrote:"
>> On 2002-09-03T18:29:02,
>> "Peter T. Breuer" <[email protected]> said:
>>
>> > The directory entry would certainly have to be reread after a write
>> > operation on disk that touched it - or more simply, the directory entry
>> > would have to be reread every time it were needed, i.e. be uncached.
>>
>> *ouch* Sure. Right. You just have to read it from scratch every time. How
>> would you make readdir work?
>
> Well, one has to read it from scratch. I'll set about seeing how to do.
> CLues welcome.

Just an idea, I don't know how well this works what with the 'IDE
can't do write barriers right' and related effects:

- Allow all nodes to cache as many blocks as they wish
- The atomic operation "update this block" includes "invalidate this
block, if cached" broadcast to all nodes

Performance would certainly become an issue; depending on the
architecture bus sniffing as in certain MP cache consistency protocols
might be feasible (I, node 3, see a transfer from node 1 going to
block #42, which is in my cache; so I update my cache using the data
part of the block transfer as it comes by on the bus).

So long,
Joe

--
"I use emacs, which might be thought of as a thermonuclear
word processor."
-- Neal Stephenson, "In the beginning... was the command line"

2002-09-04 09:21:27

by Lars Marowsky-Bree

[permalink] [raw]

Subject: Re: [RFC] mount flag "direct" (fwd)

On 2002-09-03T23:07:01,
"Peter T. Breuer" <[email protected]> said:

> > *ouch* Sure. Right. You just have to read it from scratch every time. How
> > would you make readdir work?
> Well, one has to read it from scratch. I'll set about seeing how to do.
> CLues welcome.

Yes, use a distributed filesystem. There are _many_ out there; GFS, OCFS,
OpenGFS, Compaq has one as part of their SSI, Inter-Mezzo (sort of), Lustre,
PvFS etc.

Any of them will appreciate the good work of a bright fellow.

Noone appreciates reinventing the wheel another time, especially if - for
simplification - it starts out as a square.

> > Just please, tell us why.
> You don't really want the whole rationale.

Yes, I do.

You tell me why Distributed Filesystems are important. I fully agree.

You fail to give a convincing reason why that must be made to work with
"all" conventional filesystems, especially given the constraints this implies.

Conventional wisdom seems to be that this can much better be handled specially
by special filesystems, who can do finer grained locking etc because they
understand the on disk structures, can do distributed journal recovery etc.

What you are starting would need at least 3-5 years to catch up with what
people currently already can do, and they'll improve in this time too.

I've seen your academic track record and it is surely impressive. I am not
saying that your approach won't work within the constraints. Given enough
thrust, pigs fly. I'm just saying that it would be nice to learn what reasons
you have for this, because I believe that "within the constraints" makes your
proposal essentially useless (see the other mails).

In particular, they make them useless for the requirements you seem to have. A
petabyte filesystem without journaling? A petabyte filesystem with a single
write lock? Gimme a break.

Please, do the research and tell us what features you desire to have which are
currently missing, and why implementing them essentially from scratch is
preferrable to extending existing solutions.

You are dancing around all the hard parts. "Don't have a distributed lock
manager, have one central lock." Yeah, right, has scaled _really_ well in the
past. Then you figure this one out, and come up with a lock-bitmap on the
device itself for locking subtrees of the fs. Next you are going to realize
that a single block is not scalable either because one needs exclusive write
lock to it, 'cause you can't just rewrite a single bit. You might then begin
to explore that a single bit won't cut it, because for recovery you'll need to
be able to pinpoint all locks a node had and recover them. Then you might
begin to think about the difficulties in distributed lock management and
recovery. ("Transaction processing" is an exceptionally good book on that I
believe)

I bet you a dinner that what you are going to come up with will look
frighteningly like one of the solutions which already exist; so why not
research them first in depth and start working on the one you like most,
instead of wasting time on an academic exercise?

> So, start thinking about general mechanisms to do distributed storage.
> Not particular FS solutions.

Distributed storage needs a way to access it; in the Unix paradigm,
"everything is a file", that implies a distributed filesystem. Other
approaches would include accessing raw blocks and doing the locking in the
application / via a DLM (ie, what Oracle RAC does).

Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
Immortality is an adequate definition of high availability for me.
--- Gregory F. Pfister