2002-09-03 14:56:59

by Peter T. Breuer

[permalink] [raw]
Subject: [RFC] mount flag "direct"

I'll rephrase this as an RFC, since I want help and comments.

Scenario:
I have a driver which accesses a "disk" at the block level, to which
another driver on another machine is also writing. I want to have
an arbitrary FS on this device which can be read from and written to
from both kernels, and I want support at the block level for this idea.

Question:
What do people think of adding a "direct" option to mount, with the
semantics that the VFS then makes all opens on files on the FS mounted
"direct" use O_DIRECT, which means that file r/w is not cached in VMS,
but instead goes straight to and from the real device? Is this enough
or nearly enough for what I have in mind?

Rationale:
No caching means that each kernel doesn't go off with its own idea of
what is on the disk in a file, at least. Dunno about directories and
metadata.

Wish:
If that mount option looks promising, can somebody make provision for
it in the kernel? Details to be ironed out later?

What I have explored or will explore:
1) I have put shared zoned read/write locks on the remote resource, so each
kernel request locks precisely the "disk" area that it should, in
precisely the mode it should, for precisely the duration of each block
layer request.

2) I have maintained request write order from individual kernels.

3) IMO I should also intercept and share the FS superblock lock, but thats
for later, and please tell me about it. What about dentries? Does
O_DIRECT get rid of them? What happens with mkdir?

4) I would LIKE the kernel to emit a "tag request" on the underlying
device before and after every atomic FS operation, so that I can maintain
FS atomicity at the block level. Please comment. Can somebody make this
happen, please? Or do I add the functionality to VFS myself? Where?

I have patched the kernel to support mount -o direct, creating MS_DIRECT
and MNT_DIRECT flags for the purpose. And it works. But I haven't
dared do too much to the remote FS by way of testing yet. I have
confirmed that individual file contents can be changed without problem
when the file size does not change.

Comments?

Here is the tiny proof of concept patch for VFS that implements the
"direct" mount option.


Peter

The idea embodied in this patch is that if we get the MS_DIRECT flag when
the vfs do_mount() is called, we pass it across into the mnt flags used
by do_add_mount() as MNT_DIRECT and thus make it a permament part of the
vfsmnt object that is the mounted fs. Then, in the generic
dentry_open() call for any file, we examine the flags on the mnt
parameter and set the O_DIRECT flag on the file pointer if MNT_DIRECT
is set on the vfsmnt object.

That makes all file opens O_DIRECT on the file system in question,
and makes all file accesses uncached by VMS.

The patch in itself works fine.

--- linux-2.5.31/fs/open.c.pre-o_direct Mon Sep 2 20:36:11 2002
+++ linux-2.5.31/fs/open.c Mon Sep 2 17:12:08 2002
@@ -643,6 +643,9 @@
if (error)
goto cleanup_file;
}
+ if (mnt->mnt_flags & MNT_DIRECT)
+ f->f_flags |= O_DIRECT;
+
f->f_ra.ra_pages = inode->i_mapping->backing_dev_info->ra_pages;
f->f_dentry = dentry;
f->f_vfsmnt = mnt;
--- linux-2.5.31/fs/namespace.c.pre-o_direct Mon Sep 2 20:37:39 2002
+++ linux-2.5.31/fs/namespace.c Mon Sep 2 17:12:04 2002
@@ -201,6 +201,7 @@
{ MS_MANDLOCK, ",mand" },
{ MS_NOATIME, ",noatime" },
{ MS_NODIRATIME, ",nodiratime" },
+ { MS_DIRECT, ",direct" },
{ 0, NULL }
};
static struct proc_fs_info mnt_info[] = {
@@ -734,7 +741,9 @@
mnt_flags |= MNT_NODEV;
if (flags & MS_NOEXEC)
mnt_flags |= MNT_NOEXEC;
- flags &= ~(MS_NOSUID|MS_NOEXEC|MS_NODEV);
+ if (flags & MS_DIRECT)
+ mnt_flags |= MNT_DIRECT;
+ flags &= ~(MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_DIRECT);

/* ... and get the mountpoint */
retval = path_lookup(dir_name, LOOKUP_FOLLOW, &nd);
--- linux-2.5.31/include/linux/mount.h.pre-o_direct Mon Sep 2 20:31:16 2002
+++ linux-2.5.31/include/linux/mount.h Mon Sep 2 18:06:14 2002
@@ -17,6 +17,7 @@
#define MNT_NOSUID 1
#define MNT_NODEV 2
#define MNT_NOEXEC 4
+#define MNT_DIRECT 256

struct vfsmount
{
--- linux-2.5.31/include/linux/fs.h.pre-o_direct Mon Sep 2 20:32:05 2002
+++ linux-2.5.31/include/linux/fs.h Mon Sep 2 18:05:57 2002
@@ -104,6 +104,9 @@
#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
+
+#define MS_DIRECT 256 /* Make all opens be O_DIRECT */
+
#define MS_NOATIME 1024 /* Do not update access times. */
#define MS_NODIRATIME 2048 /* Do not update directory access times */
#define MS_BIND 4096


2002-09-03 15:04:36

by jbradford

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

> Rationale:
> No caching means that each kernel doesn't go off with its own idea of
> what is on the disk in a file, at least. Dunno about directories and
> metadata.

Somewhat related to this - is there currently, or would it be possible to include in what you're working on now, a sane way for two or more machines to access a SCSI drive on a shared SCSI bus - in other words, several host adaptors in different machines are all connected to one SCSI bus, and can all access a single hard disk. At the moment, you can only do this if all machines mount the disk read-only.

John.

2002-09-03 15:09:11

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> Rationale:
> No caching means that each kernel doesn't go off with its own idea of
> what is on the disk in a file, at least. Dunno about directories and
> metadata.

And what if they both allocate the same disk block to another
file, simultaneously ?

A mount option isn't enough to achieve your goal.

It looks like you want GFS or OCFS. Info about GFS can be found at:

http://www.opengfs.org/
http://www.sistina.com/ (commercial GFS)

Dunno where Oracle's cluster fs is documented.

regards,

Rik
--
http://www.linuxsymposium.org/2002/
"You're one of those condescending OLS attendants"
"Here's a nickle kid. Go buy yourself a real t-shirt"

http://www.surriel.com/ http://distro.conectiva.com/

2002-09-03 15:33:12

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> I'll rephrase this as an RFC, since I want help and comments.
>
> Scenario:
> I have a driver which accesses a "disk" at the block level, to which
> another driver on another machine is also writing. I want to have
> an arbitrary FS on this device which can be read from and written to
> from both kernels, and I want support at the block level for this idea.

You cannot have an arbitrary fs. The two fs drivers must coordinate with
each other in order for your scheme to work. Just think about if the two
fs drivers work on the same file simultaneously and both start growing the
file at the same time. All hell would break lose.

For your scheme to work, the fs drivers need to communicate with each
other in order to attain atomicity of cluster and inode (de-)allocations,
etc.

Basically you need a clustered fs for this to work. GFS springs to
mind but I never really looked at it...

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-09-03 15:39:40

by Peter T. Breuer

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

"A month of sundays ago Anton Altaparmakov wrote:"
> On Tue, 3 Sep 2002, Peter T. Breuer wrote:
>
> > I'll rephrase this as an RFC, since I want help and comments.
> >
> > Scenario:
> > I have a driver which accesses a "disk" at the block level, to which
> > another driver on another machine is also writing. I want to have
> > an arbitrary FS on this device which can be read from and written to
> > from both kernels, and I want support at the block level for this idea.
>
> You cannot have an arbitrary fs. The two fs drivers must coordinate with
> each other in order for your scheme to work. Just think about if the two
> fs drivers work on the same file simultaneously and both start growing the
> file at the same time. All hell would break lose.

Thanks!

Rik also mentioned that objection! That's good. You both "only" see
the same problem, so there can't be many more like it..

I replied thusly:

OK - reply:
It appears that in order to allocate away free space, one must first
"grab" that free space using a shared lock. That's perfectly
feasible.


> For your scheme to work, the fs drivers need to communicate with each
> other in order to attain atomicity of cluster and inode (de-)allocations,
> etc.

Yes. They must create atomic FS operations at the VFS level (grabbing
unallocated space is one of them) and I must share the locks for those
ops.

> Basically you need a clustered fs for this to work. GFS springs to

No! I do not want /A/ fs, but /any/ fs, and I want to add the vfs
support necessary :-).

That's really what my question is driving at. I see that I need to
make VFS ops communicate "tag requests" to the block layer, in
order to implement locking. Now you and Rik have pointed out one
operation that needs locking. My next question is obviously: can you
point me more or less precisely at this operation in the VFS layer?
I've only started studying it and I am relatively unfamiliar with it.

Thanks.

Peter

2002-09-03 15:49:00

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

On Tue, 3 Sep 2002, Rik van Riel wrote:

> > Rationale:
> > No caching means that each kernel doesn't go off with its own idea of
> > what is on the disk in a file, at least. Dunno about directories and
> > metadata.
>
> And what if they both allocate the same disk block to another
> file, simultaneously ?

You need a mutex then. For SCSI devices a reservation is the way to go
-- the RESERVE/RELEASE commands are mandatory for direct-access devices,
so thy should work universally for disks.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2002-09-03 16:00:24

by Peter T. Breuer

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

"A month of sundays ago Maciej W. Rozycki wrote:"
> On Tue, 3 Sep 2002, Rik van Riel wrote:
> > And what if they both allocate the same disk block to another
> > file, simultaneously ?
>
> You need a mutex then. For SCSI devices a reservation is the way to go
> -- the RESERVE/RELEASE commands are mandatory for direct-access devices,
> so thy should work universally for disks.

Is there provision in VFS for this operation?

(i.e. care to point me at an entry point? I just grepped for "reserve"
and came up with nothing useful).

Peter

2002-09-03 16:04:45

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

On Tue, 3 Sep 2002, Peter T. Breuer wrote:
> "A month of sundays ago Maciej W. Rozycki wrote:"
> > On Tue, 3 Sep 2002, Rik van Riel wrote:
> > > And what if they both allocate the same disk block to another
> > > file, simultaneously ?
> >
> > You need a mutex then. For SCSI devices a reservation is the way to go
> > -- the RESERVE/RELEASE commands are mandatory for direct-access devices,
> > so thy should work universally for disks.
>
> Is there provision in VFS for this operation?

No. Everybody but you seems to agree these things should be
filesystem specific and not in the VFS.

> (i.e. care to point me at an entry point? I just grepped for "reserve"
> and came up with nothing useful).

Good.

cheers,

Rik
--
http://www.linuxsymposium.org/2002/
"You're one of those condescending OLS attendants"
"Here's a nickle kid. Go buy yourself a real t-shirt"

http://www.surriel.com/ http://distro.conectiva.com/

2002-09-03 16:17:44

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

On 2002-09-03T17:44:10,
"Peter T. Breuer" <[email protected]> said:

> No! I do not want /A/ fs, but /any/ fs, and I want to add the vfs
> support necessary :-).
>
> That's really what my question is driving at. I see that I need to
> make VFS ops communicate "tag requests" to the block layer, in
> order to implement locking. Now you and Rik have pointed out one
> operation that needs locking. My next question is obviously: can you
> point me more or less precisely at this operation in the VFS layer?
> I've only started studying it and I am relatively unfamiliar with it.

Your approach is not feasible.

Distributed filesystems have a lot of subtle pitfalls - locking, cache
coherency, journal replay to name a few - which you can hardly solve at the
VFS layer.

Good reading would be any sort of entry literature on clustering, I would
recommend "In search of clusters" and many of the whitepapers Google will turn
up for you, as well as the OpenGFS source.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
Immortality is an adequate definition of high availability for me.
--- Gregory F. Pfister

2002-09-03 16:37:24

by Peter T. Breuer

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

"A month of sundays ago Lars Marowsky-Bree wrote:"
> "Peter T. Breuer" <[email protected]> said:
>
> > No! I do not want /A/ fs, but /any/ fs, and I want to add the vfs
> > support necessary :-).
> >
> > That's really what my question is driving at. I see that I need to
> > make VFS ops communicate "tag requests" to the block layer, in
> > order to implement locking. Now you and Rik have pointed out one
> > operation that needs locking. My next question is obviously: can you
> > point me more or less precisely at this operation in the VFS layer?
> > I've only started studying it and I am relatively unfamiliar with it.
>
> Your approach is not feasible.

But you have to be specific about why not. I've responded to the
particular objections so far.

> Distributed filesystems have a lot of subtle pitfalls - locking, cache

Yes, thanks, I know.

> coherency, journal replay to name a few - which you can hardly solve at the

My simple suggestion is not to cache. I am of the opinion that in
principle that solves all coherency problems, since there would be no
stored state that needs to "cohere". The question is how to identify
and remove the state that is currently cached.

As to journal replay, there will be no journalling - if it breaks it
breaks and somebody (fsck) can go fix it. I don't want to get anywhere
near complicated.

> VFS layer.
>
> Good reading would be any sort of entry literature on clustering, I would

Please don't condescend! I am honestly not in need of education :-).

> recommend "In search of clusters" and many of the whitepapers Google will turn
> up for you, as well as the OpenGFS source.

(Puhleeese!)

We already know that we can have a perfectly fine and arbitrary
shared file system, shared only at the block level if we

1) permit no new dirs or files to be made (disable O_CREAT or something
like)
2) do all r/w on files with O_DIRECT
3) do file extensions via a new generic VFS "reserve" operation
4) have shared mutexes on all vfs op, implemented by passing
down a special "tag" request to the block layer.
5) maintain read+write order at the shared resource.

I have already implemented 2,4,5.

The question is how to extend the range of useful operations. For the
moment I would be happy simply to go ahead and implement 1) and 3),
while taking serious strong advice on what to do about directories.



Peter

2002-09-03 17:10:54

by David Lang

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

Peter, the thing that you seem to be missing is that direct mode only
works for writes, it doesn't force a filesystem to go to the hardware for
reads.

for many filesystems you cannot turn off their internal caching of data
(metadata for some, all data for others)

so to implement what you are after you will have to modify the filesystem
to not cache anything, since you aren't going to do this for every
filesystem you end up only haivng this option on the one(s) that you
modify.

if you have a single (or even just a few) filesystems that have this
option you may as well include the locking/syncing software in them rather
then modifying the VFS layer.

David Lang


On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> Date: Tue, 3 Sep 2002 18:41:49 +0200 (MET DST)
> From: Peter T. Breuer <[email protected]>
> To: Lars Marowsky-Bree <[email protected]>
> Cc: Peter T. Breuer <[email protected]>,
> linux kernel <[email protected]>
> Subject: Re: [RFC] mount flag "direct"
>
> "A month of sundays ago Lars Marowsky-Bree wrote:"
> > "Peter T. Breuer" <[email protected]> said:
> >
> > > No! I do not want /A/ fs, but /any/ fs, and I want to add the vfs
> > > support necessary :-).
> > >
> > > That's really what my question is driving at. I see that I need to
> > > make VFS ops communicate "tag requests" to the block layer, in
> > > order to implement locking. Now you and Rik have pointed out one
> > > operation that needs locking. My next question is obviously: can you
> > > point me more or less precisely at this operation in the VFS layer?
> > > I've only started studying it and I am relatively unfamiliar with it.
> >
> > Your approach is not feasible.
>
> But you have to be specific about why not. I've responded to the
> particular objections so far.
>
> > Distributed filesystems have a lot of subtle pitfalls - locking, cache
>
> Yes, thanks, I know.
>
> > coherency, journal replay to name a few - which you can hardly solve at the
>
> My simple suggestion is not to cache. I am of the opinion that in
> principle that solves all coherency problems, since there would be no
> stored state that needs to "cohere". The question is how to identify
> and remove the state that is currently cached.
>
> As to journal replay, there will be no journalling - if it breaks it
> breaks and somebody (fsck) can go fix it. I don't want to get anywhere
> near complicated.
>
> > VFS layer.
> >
> > Good reading would be any sort of entry literature on clustering, I would
>
> Please don't condescend! I am honestly not in need of education :-).
>
> > recommend "In search of clusters" and many of the whitepapers Google will turn
> > up for you, as well as the OpenGFS source.
>
> (Puhleeese!)
>
> We already know that we can have a perfectly fine and arbitrary
> shared file system, shared only at the block level if we
>
> 1) permit no new dirs or files to be made (disable O_CREAT or something
> like)
> 2) do all r/w on files with O_DIRECT
> 3) do file extensions via a new generic VFS "reserve" operation
> 4) have shared mutexes on all vfs op, implemented by passing
> down a special "tag" request to the block layer.
> 5) maintain read+write order at the shared resource.
>
> I have already implemented 2,4,5.
>
> The question is how to extend the range of useful operations. For the
> moment I would be happy simply to go ahead and implement 1) and 3),
> while taking serious strong advice on what to do about directories.
>
>
>
> Peter
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2002-09-03 17:25:54

by Peter T. Breuer

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

"A month of sundays ago David Lang wrote:"
> Peter, the thing that you seem to be missing is that direct mode only
> works for writes, it doesn't force a filesystem to go to the hardware for
> reads.

Yes it does. I've checked! Well, at least I've checked that writing
then reading causes the reads to get to the device driver. I haven't
checked what reading twice does.

If it doesn't cause the data to be read twice, then it ought to, and
I'll fix it (given half a clue as extra pay ..:-)

> for many filesystems you cannot turn off their internal caching of data
> (metadata for some, all data for others)

Well, let's take things one at a time. Put in a VFS mechanism and then
convert some FSs to use it.

> so to implement what you are after you will have to modify the filesystem
> to not cache anything, since you aren't going to do this for every

Yes.

> filesystem you end up only haivng this option on the one(s) that you
> modify.

I intend to make the generic mechanism attractive.

> if you have a single (or even just a few) filesystems that have this
> option you may as well include the locking/syncing software in them rather
> then modifying the VFS layer.

Why? Are you advocating a particular approach? Yes, I agree that that
is a possible way to go - but I will want the extra VFS ops anyway,
and will want to modify the particular fs to use them, no?

Peter

2002-09-03 17:21:34

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> > Your approach is not feasible.
>
> But you have to be specific about why not. I've responded to the
> particular objections so far.

[snip]

> Please don't condescend! I am honestly not in need of education :-).

You make it sound like you bet your masters degree on
doing a distributed filesystem without filesystem support ;)

Rik
--
http://www.linuxsymposium.org/2002/
"You're one of those condescending OLS attendants"
"Here's a nickle kid. Go buy yourself a real t-shirt"

http://www.surriel.com/ http://distro.conectiva.com/

2002-09-03 17:25:37

by Jan Harkes

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

On Tue, Sep 03, 2002 at 06:41:49PM +0200, Peter T. Breuer wrote:
> "A month of sundays ago Lars Marowsky-Bree wrote:"
> > Your approach is not feasible.
>
> But you have to be specific about why not. I've responded to the
> particular objections so far.
>
> > Distributed filesystems have a lot of subtle pitfalls - locking, cache
>
> Yes, thanks, I know.
>
> > coherency, journal replay to name a few - which you can hardly solve at the
>
> My simple suggestion is not to cache. I am of the opinion that in
> principle that solves all coherency problems, since there would be no
> stored state that needs to "cohere". The question is how to identify
> and remove the state that is currently cached.

That is a very simple suggestion, but not feasable because there always
be 'cached copies' floating around. Even if you remove the dcache
(directory lookups) and icache (inode cache) in the kernel, both
filesystems will still need to look at the data in order to modify it.
Looking at the data involves creating an in-memory representation of the
object. If there is no locking, if one filesystem modifies the object,
the other filesystem is looking at (and possibly modifying) stale data
which causes consistency problems.

> > Good reading would be any sort of entry literature on clustering, I would
>
> Please don't condescend! I am honestly not in need of education :-).

I'm afraid that all of this has been very well documented, another
example would be Tanenbaum's "Distributed Systems", especially the
chapter on various consistency models is a nice read.

> We already know that we can have a perfectly fine and arbitrary
> shared file system, shared only at the block level if we
>
> 1) permit no new dirs or files to be made (disable O_CREAT or something
> like)
> 2) do all r/w on files with O_DIRECT
> 3) do file extensions via a new generic VFS "reserve" operation
> 4) have shared mutexes on all vfs op, implemented by passing
> down a special "tag" request to the block layer.
> 5) maintain read+write order at the shared resource.

Can I quote your 'puhleese' here? Inodes are sharing the same on-disk
blocks, so when one inode is changed (setattr, truncate) and written
back to disk, it affects all other inodes stored in the same block. So
the shared mutexes on the VFS level don't cover the necessary locking.

Each time you add another point to work around the latest argument,
someone will surely give you another argument until you end up with a
system that is no longer practical. And then probably even slower
because you absolutely cannot allow the FS to trust _any_ data without a
locked read or write off the disk (or across the network). And because
you seem to like cpu consistency that much this even involves the data
that happens to be 'cached' in the CPU.

> I have already implemented 2,4,5.
>
> The question is how to extend the range of useful operations. For the
> moment I would be happy simply to go ahead and implement 1) and 3),
> while taking serious strong advice on what to do about directories.

Perhaps the fact that directories (and journalled filesystems) aren't
already solved is an indication that the proposed 'solution' is flawed?

Filesystems were designed to trust the disk as 'stable storage', i.e.
anything that was read or recently written will be the same. NFS already
weakens this model slightly. AFS and Coda go even further, we only
guarantee that changes are propagated when a file is closed. There is
a callback mechanism to invalidate cached copies. But even when we open
a file, it could still have been changed within the past 1/2 RTT. This
is a window we intentionally live with because it avoids the full RTT
hit we would have if we had to go to the server on every file open.

It is the latency that kills you when you can't cache.

Jan

2002-09-03 17:43:56

by David Lang

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

On Tue, 3 Sep 2002, Peter T. Breuer wrote:

> "A month of sundays ago David Lang wrote:"
> > Peter, the thing that you seem to be missing is that direct mode only
> > works for writes, it doesn't force a filesystem to go to the hardware for
> > reads.
>
> Yes it does. I've checked! Well, at least I've checked that writing
> then reading causes the reads to get to the device driver. I haven't
> checked what reading twice does.
>
> If it doesn't cause the data to be read twice, then it ought to, and
> I'll fix it (given half a clue as extra pay ..:-)

writing then reading the same file may cause it to be read from the disk,
but reading /foo/bar then reading /foo/bar again will not cause two reads
of all data.

some filesystems go to a lot fo work to orginize the metadata in
particular in memory to access things more efficiantly, you will have to
go into each filesystem and modify them to not do this.

in addition you will have lots of potential races as one system reads a
block of data, modifies it, then writes it while the other system does the
same thing. you cannot easily detect this in the low level drivers as
these are seperate calls from the filesystem, and even if you do what
error message will you send to the second system? there's no error that
says 'the disk has changed under you, backup and re-read it before you
modify it'

yes this is stuff that could be added to all filesystems, but will the
filesystem maintainsers let you do this major surgery to their systems?

for example the XFS and JFS teams are going to a lot of effort to maintain
their systems to be compatable with other OS's, they probably won't
appriciate all the extra conditionals that you will need to put in to
do all of this.

even for ext2 there are people (including linus I believe) that are saying
that major new features should not be added to ext2, but to a new
filesystem forked off of ext2 (ext3 for example or a fork of it).

David Lang

> > for many filesystems you cannot turn off their internal caching of data
> > (metadata for some, all data for others)
>
> Well, let's take things one at a time. Put in a VFS mechanism and then
> convert some FSs to use it.
>
> > so to implement what you are after you will have to modify the filesystem
> > to not cache anything, since you aren't going to do this for every
>
> Yes.
>
> > filesystem you end up only haivng this option on the one(s) that you
> > modify.
>
> I intend to make the generic mechanism attractive.
>
> > if you have a single (or even just a few) filesystems that have this
> > option you may as well include the locking/syncing software in them rather
> > then modifying the VFS layer.
>
> Why? Are you advocating a particular approach? Yes, I agree that that
> is a possible way to go - but I will want the extra VFS ops anyway,
> and will want to modify the particular fs to use them, no?
>
> Peter
>

2002-09-03 18:01:49

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

On Sep 03, 2002 14:26 -0300, Rik van Riel wrote:
> On Tue, 3 Sep 2002, Peter T. Breuer wrote:
> > > Your approach is not feasible.
> >
> > But you have to be specific about why not. I've responded to the
> > particular objections so far.
>
> You make it sound like you bet your masters degree on
> doing a distributed filesystem without filesystem support ;)

Actually, we are using ext3 pretty much as-is for our backing-store
for Lustre. The same is true of InterMezzo, and NFS, for that matter.
All of them live on top of a standard "local" filesystem, which doesn't
know the things that happen above it to make it a network filesystem
(locking, etc).

That isn't to say that I agree with just taking a local filesystem and
putting it on a shared block device and expecting it to work with only
the normal filesystem code. We do all of our locking above the fs
level, but we do have some help in the VFS (intent-based lookup, patch
in the Lustre CVS repository, if people are interested).

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-09-03 18:13:46

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

On Tuesday 03 September 2002 17:44, Peter T. Breuer wrote:
> > > Scenario:
> > > I have a driver which accesses a "disk" at the block level, to which
> > > another driver on another machine is also writing. I want to have
> > > an arbitrary FS on this device which can be read from and written to
> > > from both kernels, and I want support at the block level for this idea.
> >
> > You cannot have an arbitrary fs. The two fs drivers must coordinate with
> > each other in order for your scheme to work. Just think about if the two
> > fs drivers work on the same file simultaneously and both start growing the
> > file at the same time. All hell would break lose.
>
> Thanks!
>
> Rik also mentioned that objection! That's good. You both "only" see
> the same problem, so there can't be many more like it..

(intentionally misinterpreting) No indeed, there are aren't many problems
like it, in terms of sheer complexity.

--
Daniel

2002-09-03 18:24:09

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

On Tuesday 03 September 2002 18:41, Peter T. Breuer wrote:
> > Distributed filesystems have a lot of subtle pitfalls - locking, cache
>
> Yes, thanks, I know.
>
> > coherency, journal replay to name a few - which you can hardly solve at
> > the
>
> My simple suggestion is not to cache. I am of the opinion that in
> principle that solves all coherency problems, since there would be no
> stored state that needs to "cohere". The question is how to identify
> and remove the state that is currently cached.

Well, for example, you would not be able to have the same file open in two
different kernels because the inode would be cached. So you'd have to close
the root directory on one kernel before the other could access any file. Not
only would that be horribly inefficient, you would *still* need to implement
a locking protocol between the two kernels to make it work.

There's no magic way of making this easy.

--
Daniel

2002-09-03 18:37:30

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

On Tuesday 03 September 2002 20:02, Andreas Dilger wrote:
> Actually, we are using ext3 pretty much as-is for our backing-store
> for Lustre. The same is true of InterMezzo, and NFS, for that matter.
> All of them live on top of a standard "local" filesystem, which doesn't
> know the things that happen above it to make it a network filesystem
> (locking, etc).

To put this in simplistic terms, this works because you treat the
underlying filesystem simply as a storage device, a slightly funky
kind of disk.

--
Daniel

2002-09-04 05:53:02

by Helge Hafting

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

"Peter T. Breuer" wrote:
>
> "A month of sundays ago David Lang wrote:"
> > Peter, the thing that you seem to be missing is that direct mode only
> > works for writes, it doesn't force a filesystem to go to the hardware for
> > reads.
>
> Yes it does. I've checked! Well, at least I've checked that writing
> then reading causes the reads to get to the device driver. I haven't
> checked what reading twice does.

You tried reading from a file? For how long are you going to
work on that data you read? The other machine may ruin it anytime,
even instantly after you read it.

Now, try "ls -l" twice instead of reading from a file. Notice
that no io happens the second time. Here we're reading
metadata instead of file data. This sort of stuff
is cached in separate caches that assumes nothing
else modifies the disk.



>
> If it doesn't cause the data to be read twice, then it ought to, and
> I'll fix it (given half a clue as extra pay ..:-)
>
> > for many filesystems you cannot turn off their internal caching of data
> > (metadata for some, all data for others)
>
> Well, let's take things one at a time. Put in a VFS mechanism and then
> convert some FSs to use it.
>
> > so to implement what you are after you will have to modify the filesystem
> > to not cache anything, since you aren't going to do this for every
>
> Yes.
>
> > filesystem you end up only haivng this option on the one(s) that you
> > modify.
>
> I intend to make the generic mechanism attractive.

It won't be attractive, for the simple reason that a no-cache fs
will be devastatingly slow. A program that read a file one byte at
a time will do 1024 disk accesses to read a single kilobyte. And
it will do that again if you run it again.

Nobody will have time to wait for this, and this alone makes your
idea useless. To get an idea - try booting with mem=4M and suffer.
a cacheless fs will be much much worse than that.

Using nfs or similiar will be so much faster. Existing
network fs'es work around complexities by using one machine as
disk server, others simply transfers requests to and from that machine
and let it sort things out alone.

The main reason I can imagine for letting two machines write to
the *same* disk is performance. Going cacheless won't give you
that. But you *can* beat nfs and friends by going for
a "distributed ext2" or similiar where the participating machines
talks to each other about who writes where.
Each machine locks down the blocks they want to cache, with
either a shared read lock or a exclusive write lock.

There is a lot of performance tricks you may use, such as
pre-reserving some free blocks for each machine, some ranges
of inodes and so on, so each can modify those without
asking the others. Then re-distribute stuff occationally so
nobody runs out while the others have plenty.


Helge Hafting

2002-09-04 06:17:01

by Peter T. Breuer

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

"A month of sundays ago Helge Hafting wrote:"
> "Peter T. Breuer" wrote:
> > "A month of sundays ago David Lang wrote:"
> > > Peter, the thing that you seem to be missing is that direct mode only
> > > works for writes, it doesn't force a filesystem to go to the hardware for
> > > reads.
> >
> > Yes it does. I've checked! Well, at least I've checked that writing
> > then reading causes the reads to get to the device driver. I haven't
> > checked what reading twice does.
>
> You tried reading from a file? For how long are you going to

Yes I did. And I tried readingtwice too, and it reads twice at device
level.

> work on that data you read? The other machine may ruin it anytime,

Well, as long as I want to. What's the problem? I read file X at time
T and got data Y. That's all I need.

> even instantly after you read it.

So what?

> Now, try "ls -l" twice instead of reading from a file. Notice
> that no io happens the second time. Here we're reading

Directory data is cached.

> metadata instead of file data. This sort of stuff
> is cached in separate caches that assumes nothing
> else modifies the disk.

True, and I'm happy to change it. I don't think we always had a
directory cache.

> > > filesystem you end up only haivng this option on the one(s) that you
> > > modify.
> >
> > I intend to make the generic mechanism attractive.
>
> It won't be attractive, for the simple reason that a no-cache fs
> will be devastatingly slow. A program that read a file one byte at

A generic mechanism is not a "no cache fs". It's a generic mechanism.

> Nobody will have time to wait for this, and this alone makes your

Try arguing logically. I really don't like it when people invent their
own straw men and then procede to reason as though it were *mine*.

> The main reason I can imagine for letting two machines write to
> the *same* disk is performance. Going cacheless won't give you

Then imagine some more. I'm not responsible for your imagination ...

> that. But you *can* beat nfs and friends by going for
> a "distributed ext2" or similiar where the participating machines
> talks to each other about who writes where.
> Each machine locks down the blocks they want to cache, with
> either a shared read lock or a exclusive write lock.

That's already done.

> There is a lot of performance tricks you may use, such as

No tricks. Let's be simple.

Peter

2002-09-04 06:44:43

by Helge Hafting

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

"Peter T. Breuer" wrote:
>
> "A month of sundays ago Helge Hafting wrote:"
> > "Peter T. Breuer" wrote:
> > > "A month of sundays ago David Lang wrote:"
> > > > Peter, the thing that you seem to be missing is that direct mode only
> > > > works for writes, it doesn't force a filesystem to go to the hardware for
> > > > reads.
> > >
> > > Yes it does. I've checked! Well, at least I've checked that writing
> > > then reading causes the reads to get to the device driver. I haven't
> > > checked what reading twice does.
> >
> > You tried reading from a file? For how long are you going to
>
> Yes I did. And I tried readingtwice too, and it reads twice at device
> level.
>
> > work on that data you read? The other machine may ruin it anytime,
>
> Well, as long as I want to. What's the problem? I read file X at time
> T and got data Y. That's all I need.

No problem if all you do is use file data. A serious problem if
the stuff you read is used to make a decision about where
to write something else on that shared disk. For example:
The fs need to extend a file. It reads the free block bitmap,
and finds a free block. Then it overwrites that free block,
and also write back a changed block bitmap. Unfortunately
some other machine just did the same thing and you
know have a crosslinked and corrupt file.

There are several similiar scenarios. You can't really talk
about "not caching". Once you read something into
memory it is "cached" in memory, even if you only use it once
and then re-read it whenever you need it later.
>
> > even instantly after you read it.
>
> So what?
See above.

> > Now, try "ls -l" twice instead of reading from a file. Notice
> > that no io happens the second time. Here we're reading
>
> Directory data is cached.
>
> > metadata instead of file data. This sort of stuff
> > is cached in separate caches that assumes nothing
> > else modifies the disk.
>
> True, and I'm happy to change it. I don't think we always had a
> directory cache.

>
> > > > filesystem you end up only haivng this option on the one(s) that you
> > > > modify.
> > >
> > > I intend to make the generic mechanism attractive.
> >
> > It won't be attractive, for the simple reason that a no-cache fs
> > will be devastatingly slow. A program that read a file one byte at
>
> A generic mechanism is not a "no cache fs". It's a generic mechanism.
>
> > Nobody will have time to wait for this, and this alone makes your
>
> Try arguing logically. I really don't like it when people invent their
> own straw men and then procede to reason as though it were *mine*.
>
Maybe I wasn't clear. What I say is that a fs that don't cache
anything in order to avoid cache coherency problems will be
too slow for generic use. (Such as two desktop computers
sharing a single main disk with applications and data)
Perhaps it is really useful for some special purpose, I haven't
seen you claim what you want this for, so I assumed general use.

There is nothing illogical about performance problems. A
cacheless system may _work_ and it might be simple, but
it is also _useless_ for a a lot of common situations
where cached fs'es works fine.


> > The main reason I can imagine for letting two machines write to
> > the *same* disk is performance. Going cacheless won't give you
>
> Then imagine some more. I'm not responsible for your imagination ...

You tell. You keep asking why your idea won't work and I
give you "performance problems" _even_ if you sort out the
correctness issues with no other cost than the lack of cache.

Please tell what you think it can be used for. I do not say
it is useless for everything, although it certainly is useless
for the purposes I can come up with. The only uses *I* find
for a shared writeable (but uncachable) disk is so special that
I wouldn't bother putting a fs like ext2 on it. Sharing a
raw block device is doable today if you let the programs
using it keep track of data themselves instead of using
a fs. This isn't what you want though. It could be interesting
to know what you want, considering what your solution looks like.

Helge Hafting

2002-09-04 09:11:28

by Peter T. Breuer

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

"A month of sundays ago Helge Hafting wrote:"
> No problem if all you do is use file data. A serious problem if
> the stuff you read is used to make a decision about where
> to write something else on that shared disk. For example:
> The fs need to extend a file. It reads the free block bitmap,
> and finds a free block. Then it overwrites that free block,
> and also write back a changed block bitmap. Unfortunately

That's the exact problem that's already been mentioned twice,
and I'm confident of that one being solved. Lock the whole
FS if necessary, but read the bitmap and lock the bitmap on disk
until the extension is finished and the bitmap is written back.
It has been suggested that the VFS support a "reserve/release blocks"
operation. It would simply mark the ondisk bitmap bits as used
and add them to our available list. Then every file extension
or creation would need to be preceded by a reserve command,
or fail, according to policy.

> some other machine just did the same thing and you
> know have a crosslinked and corrupt file.

There is no problem locking and serializing groups of
read/write accesses. Please stop harping on about THAT at
least :-). What is a problem is marking the groups of accesses.

> There are several similiar scenarios. You can't really talk
> about "not caching". Once you read something into
> memory it is "cached" in memory, even if you only use it once
> and then re-read it whenever you need it later.

That's fine. And I don't see what needs to be reread. You had this
problem once with smp, and you beat it with locks.

> > A generic mechanism is not a "no cache fs". It's a generic mechanism.
> >
> > > Nobody will have time to wait for this, and this alone makes your
> >
> > Try arguing logically. I really don't like it when people invent their
> > own straw men and then procede to reason as though it were *mine*.
> >
> Maybe I wasn't clear. What I say is that a fs that don't cache
> anything in order to avoid cache coherency problems will be
> too slow for generic use. (Such as two desktop computers

Quite possibly, but not too slow for reading data in and writing data
out, at gigabyte/s rates overall, which is what the intention is.
That's not general use. And even if it were general use, it would still
be pretty acceptable _in general_.

> > Then imagine some more. I'm not responsible for your imagination ...
>
> You tell. You keep asking why your idea won't work and I
> give you "performance problems" _even_ if you sort out the
> correctness issues with no other cost than the lack of cache.

The correctness issues are the only important ones, once we have
correct and fast shared read and write to (existing) files.

> it is useless for everything, although it certainly is useless
> for the purposes I can come up with. The only uses *I* find
> for a shared writeable (but uncachable) disk is so special that
> I wouldn't bother putting a fs like ext2 on it. Sharing a
> raw block device is doable today if you let the programs

It's far too inconvenient to be totally without a FS. What we
want is a normal FS, but slower at some things, and faster at others,
but correct and shared. It's an approach. The caclulations show
clearly that r/w (once!) to existing files are the only performance
issues. The rest is decor. But decor that is very nice to have around.

Peter

2002-09-04 11:29:30

by Helge Hafting

[permalink] [raw]
Subject: Re: [RFC] mount flag "direct"

"Peter T. Breuer" wrote:

> There is no problem locking and serializing groups of
> read/write accesses. Please stop harping on about THAT at
> least :-). What is a problem is marking the groups of accesses.

Sorry, I now see you dealt with that in other threads.
>
> That's fine. And I don't see what needs to be reread. You had this
> problem once with smp, and you beat it with locks.
>
Consider that taking a lock on a SMP machine is a fairly fast
operation. Taking a lock shared over a network probably
takes about 100-1000 times as long.

People submit patches for shaving a single instruction
off the SMP locks, for performance. The locking is removed
on UP, because it makes a difference even though the
lock never is busy in the UP case. A much slower lock will
either hurt performance a lot, or force a coarse granularity.
The time spent on locking had better be a small fraction
of total time, or you won't get your high performance.
A coarse granularity will limit your software so the
different machines mostly use different parts of the
shared disks, or you'll loose the parallellism.
I guess that is fine with you then.


> > it is useless for everything, although it certainly is useless
> > for the purposes I can come up with. The only uses *I* find
> > for a shared writeable (but uncachable) disk is so special that
> > I wouldn't bother putting a fs like ext2 on it.
>
> It's far too inconvenient to be totally without a FS. What we
> want is a normal FS, but slower at some things, and faster at others,
> but correct and shared. It's an approach. The caclulations show
> clearly that r/w (once!) to existing files are the only performance
> issues. The rest is decor. But decor that is very nice to have around.

Ok. If r/w _once_ is what matters, then surely you don't
need cache. I consider that a rather unusual case though,
which is why you'll have a hard time getting this into
the standard kernel. But maybe you don't need that?

Still, you should consider writing a fs of your own.
It is a _small_ job compared to implementing your locking
system in existing filesystems. Remember that those
filesystems are optimized for a common case of a few
cpu's, where you may take and release hundreds or
thousands of locks per second, and where data transfers
often are small and repetitive. Caching is so
useful for this case that current fs code is designed
around it.

With a fs of your own you won't have to worry about
maintainers changing the rest of the fs
code. That sort of thing is hard to keep up with
with the massive changes you'll need for your sort
of distributed fs. A single-purpose fs isn't such a
big job, you can leave out design considerations
that don't apply to your case.

Helge Hafting