2004-09-06 00:21:33

by NeilBrown

[permalink] [raw]
Subject: [PATCH - EXPERIMENTAL] files with forks in the VFS


As a followup to the multi-branching threads about reiser4, I would
like to present this patch for discussion and exploration.
It implements files with fork (which are quite different to files that
provide different views via a subdirectory structure).

See Documentation/filesystems/forks.txt (after applying the patch) for more detail.

This is not "how it should be done" but rather "how it could be done",
and is intended primarily to provide a base for experimentation and
exploration.

Below is a sample of what can be done, and then the patch.

NeilBrown


neilb@adams:/tmp$ ls -l
total 724
drws------ 2 neilb neilb 4096 Sep 6 10:07 echo
-rw------- 1 neilb neilb 732560 Sep 6 10:05 portrait.jpg
neilb@adams:/tmp$ ./echo hello there
hello there
neilb@adams:/tmp$ cat echo/man
Here is some doco
neilb@adams:/tmp$ file portrait.jpg
portrait.jpg: JPEG image data, JFIF standard 1.01, "CREATOR: XV version 3.10a-jumbo"
neilb@adams:/tmp$ file portrait.jpg/thumbnail
portrait.jpg/thumbnail: Netpbm PGM "rawbits" image data
neilb@adams:/tmp$ pnmfile portrait.jpg/thumbnail
portrait.jpg/thumbnail: PGM raw, 121 by 167 maxval 255
neilb@adams:/tmp$ ls -la echo
total 24
drws------ 2 neilb neilb 4096 Sep 6 10:07 .
drwxrwxrwt 7 root root 4096 Sep 6 10:05 ..
-rwx------ 1 neilb neilb 12216 Sep 6 10:05 ...
-rw------- 1 neilb neilb 18 Sep 6 10:07 man
neilb@adams:/tmp$ ls -la portrait.jpg
-rw------- 1 neilb neilb 732560 Sep 6 10:05 portrait.jpg
neilb@adams:/tmp$ ls -la portrait.jpg/
total 748
drws--S--- 2 neilb neilb 4096 Sep 6 10:07 .
drwxrwxrwt 7 root root 4096 Sep 6 10:05 ..
-rw------- 1 neilb neilb 732560 Sep 6 10:05 ...
-rw------- 1 neilb neilb 20347 Sep 6 10:07 thumbnail
neilb@adams:/tmp$

.....

neilb@adams:/tmp$ ls -la emacs
total 28
drws------ 2 neilb neilb 4096 Sep 6 10:16 .
drwxrwxrwt 8 root root 4096 Sep 6 10:17 ..
lrwx------ 1 neilb neilb 14 Sep 6 10:16 ... -> /usr/bin/emacs
-rw------- 1 neilb neilb 20347 Sep 6 10:16 icon
neilb@adams:/tmp$ ./emacs
... emacs runs ...
neilb@adams:/tmp$

Signed-off-by: Neil Brown <[email protected]>

### Diffstat output
./Documentation/filesystems/forks.txt | 49 +++++++++++++++++++++
./fs/namei.c | 79 ++++++++++++++++++++++++++++++++++
2 files changed, 128 insertions(+)

diff ./Documentation/filesystems/forks.txt~current~ ./Documentation/filesystems/forks.txt
--- ./Documentation/filesystems/forks.txt~current~ 2004-09-06 09:21:57.000000000 +1000
+++ ./Documentation/filesystems/forks.txt 2004-09-06 10:19:05.000000000 +1000
@@ -0,0 +1,49 @@
+Files with Forks.
+
+
+This kernel contains experimental support for files with forks.
+
+A "file with forks" is a directory which contains a number of files,
+one of which is a distinguished file. If the directory is opened as a
+file, the distinguished file is returned. If it is opened as a
+directory, then the real directory is returned.
+
+The behaviour when the directory is "stat"ed is configurable. Either
+the information about the actual directory or the distinguished file
+can be returned. This allows the behaviour of various programs and
+tools to be tested with the different behaviours to see what works
+best.
+
+A directory is flagged as being a "file with forks" by
+ 1/ setting the setuid bit (chmod u+s)
+ 2/ creating a file or symlink within the directory called
+ "..."
+
+Once this is done, any attempt to open the directory without
+O_DIRECTORY will result in "..." being openned. This works for
+ cat directoryname
+ exec directoryname
+ echo hello > directoryname # ... must already exist
+as examples.
+
+"stat" calls, such as "ls -l" will still show the directory.
+
+If the setgid bit is also set
+ chmod ug+s
+then stat calls will show the distinguished file. This can be
+more confusing to some programs (rm won't like it), but less confusing
+to others (smart editors won't try to list the directory).
+
+ cd directoryname
+will always change into the actual directory, and accessing any path
+name with "." as the last component will always show the directory,
+not the distinguished file.
+
+The functionality is provided for experimentation and discussion
+only. It is not presented as "the right way to do it" but as "an
+interesting experiment to explore".
+
+Note: if "..." is a directory, then it may not be very useful. If it
+is a relative symlink and g+s is set, then it can be confusing.
+
+NeilBrown Sept2004

diff ./fs/namei.c~current~ ./fs/namei.c
--- ./fs/namei.c~current~ 2004-09-03 14:18:08.000000000 +1000
+++ ./fs/namei.c 2004-09-06 09:21:20.000000000 +1000
@@ -818,6 +818,50 @@ last_component:
err = -ENOTDIR;
if (!inode->i_op || !inode->i_op->lookup)
break;
+ } else if ( (inode->i_mode & S_ISUID) &&
+ ((inode->i_mode & S_ISGID) || (lookup_flags & LOOKUP_OPEN)) &&
+ inode->i_op && inode->i_op->lookup) {
+ /* opening a setuid directory as a file, open '...'
+ * within it
+ */
+ this.name = "...";
+ this.len = 3;
+ hash = init_name_hash();
+ hash = partial_name_hash('.', hash);
+ hash = partial_name_hash('.', hash);
+ hash = partial_name_hash('.', hash);
+ this.hash = end_name_hash(hash);
+
+ if (nd->dentry->d_op && nd->dentry->d_op->d_hash) {
+ err = nd->dentry->d_op->d_hash(nd->dentry, &this);
+ if (err < 0)
+ break;
+ }
+ err = do_lookup(nd, &this, &next);
+ if (!err && next.dentry->d_inode == NULL) {
+ dput(next.dentry);
+ /* abort "..." lookup as no entry */
+ } else {
+ follow_mount(&next.mnt, &next.dentry);
+ inode = next.dentry->d_inode;
+ if ((lookup_flags & LOOKUP_FOLLOW)
+ && inode->i_op && inode->i_op->follow_link) {
+ mntget(next.mnt);
+ err = do_follow_link(next.dentry, nd);
+ dput(next.dentry);
+ mntput(next.mnt);
+ if (err)
+ goto return_err;
+ inode = nd->dentry->d_inode;
+ } else {
+ dput(nd->dentry);
+ nd->mnt = next.mnt;
+ nd->dentry = next.dentry;
+ }
+ err = -ENOENT;
+ if (!inode)
+ break;
+ }
}
goto return_base;
lookup_parent:
@@ -1413,6 +1457,41 @@ do_last:
error = -ENOENT;
if (!dentry->d_inode)
goto exit_dput;
+ if ((dentry->d_inode->i_mode & S_ISUID) &&
+ dentry->d_inode->i_op && dentry->d_inode->i_op->lookup) {
+ struct path next;
+ int err;
+ struct qstr this;
+ unsigned long hash;
+ this.name = "...";
+ this.len = 3;
+ hash = init_name_hash();
+ hash = partial_name_hash('.', hash);
+ hash = partial_name_hash('.', hash);
+ hash = partial_name_hash('.', hash);
+ this.hash = end_name_hash(hash);
+
+ err = 0;
+ if (dentry->d_op && dentry->d_op->d_hash) {
+ err = dentry->d_op->d_hash(nd->dentry, &this);
+ }
+ if (err >= 0) {
+ struct dentry *tmp = nd->dentry;
+ nd->dentry = dentry;
+ err = do_lookup(nd, &this, &next);
+ if (err) {
+ nd->dentry = tmp;
+ } else if (next.dentry->d_inode == NULL) {
+ dput(next.dentry);
+ nd->dentry = tmp;
+ /* abort "..." lookup as no entry */
+ } else if (!err) {
+ dput(tmp);
+ dentry = next.dentry;
+ }
+ }
+ }
+
if (dentry->d_inode->i_op && dentry->d_inode->i_op->follow_link)
goto do_link;


2004-09-06 05:59:14

by Hans Reiser

[permalink] [raw]
Subject: Re: [PATCH - EXPERIMENTAL] files with forks in the VFS

Neil Brown wrote:

>As a followup to the multi-branching threads about reiser4, I would
>like to present this patch for discussion and exploration.
>It implements files with fork (which are quite different to files that
>provide different views via a subdirectory structure).
>
>
How are they different? Having a distinguished file is consistent with
the reiser4 approach.

>See Documentation/filesystems/forks.txt (after applying the patch) for more detail.
>
>This is not "how it should be done" but rather "how it could be done",
>and is intended primarily to provide a base for experimentation and
>exploration.
>
>Below is a sample of what can be done, and then the patch.
>
>NeilBrown
>
>

2004-09-06 23:04:48

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH - EXPERIMENTAL] files with forks in the VFS

On Sunday September 5, [email protected] wrote:
> Neil Brown wrote:
>
> >As a followup to the multi-branching threads about reiser4, I would
> >like to present this patch for discussion and exploration.
> >It implements files with fork (which are quite different to files that
> >provide different views via a subdirectory structure).
> >
> >
> How are they different? Having a distinguished file is consistent with
> the reiser4 approach.
>

They are different at least in my perception. It is possible that a
common abstraction and a common implementation could support them
both, though I am slightly sceptical.

On the one hand, you have a name space within a file which provides
access to information that is not part of that file but is only
loosely associated with it: an icon for a desktop app, documentation
for a program, a collection of fonts that a document uses.

On the other hand, you have a name space within a file which provides
alternate views onto information that already exists within that
file: "unzip" which presents the file uncompressed, "tar" which
explodes a tar achieve, "tag" which shows tags in a multi-media
file. "elf" which exposes sections of an ELF executable.

In the first case, the subordinate files should clearly be writable,
and should be backed up along with the main file.
In the second case, it is not clear that subordinate files should or
could be writable in general (though there may well be specific
cases), and the data does not need to be backed up.

In the first case, the extra semantic only applies to files, not
directories (allowing a directory to have extra streams is nothing
new).
In the second case, the extra semantic should apply to directories as
well (as there may we be different views you might want on a
directory).

NeilBrown

2004-09-07 19:55:07

by Hans Reiser

[permalink] [raw]
Subject: Re: [PATCH - EXPERIMENTAL] files with forks in the VFS

Neil Brown wrote:

>On Sunday September 5, [email protected] wrote:
>
>
>>Neil Brown wrote:
>>
>>
>>
>>>As a followup to the multi-branching threads about reiser4, I would
>>>like to present this patch for discussion and exploration.
>>>It implements files with fork (which are quite different to files that
>>>provide different views via a subdirectory structure).
>>>
>>>
>>>
>>>
>>How are they different? Having a distinguished file is consistent with
>>the reiser4 approach.
>>
>>
>>
>
>They are different at least in my perception. It is possible that a
>common abstraction and a common implementation could support them
>both, though I am slightly sceptical.
>
>On the one hand, you have a name space within a file which provides
>access to information that is not part of that file but is only
>loosely associated with it: an icon for a desktop app, documentation
>for a program, a collection of fonts that a document uses.
>
>On the other hand, you have a name space within a file which provides
>alternate views onto information that already exists within that
>file: "unzip" which presents the file uncompressed, "tar" which
>explodes a tar achieve, "tag" which shows tags in a multi-media
>file. "elf" which exposes sections of an ELF executable.
>
>In the first case, the subordinate files should clearly be writable,
>and should be backed up along with the main file.
>In the second case, it is not clear that subordinate files should or
>could be writable in general (though there may well be specific
>cases), and the data does not need to be backed up.
>
>
After the file compression plugin we should consider creating a
directory compression plugin for directories with lots of small files....

>In the first case, the extra semantic only applies to files, not
>directories (allowing a directory to have extra streams is nothing
>new).
>In the second case, the extra semantic should apply to directories as
>well (as there may we be different views you might want on a
>directory).
>
>
I don't understand the paragraph above. Can you say with fewer
indirections (e.g. define extra semantic)?

>NeilBrown
>
>
>
>

2004-09-08 00:28:18

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH - EXPERIMENTAL] files with forks in the VFS

On Tuesday September 7, [email protected] wrote:
> Neil Brown wrote:
> >In the first case, the extra semantic only applies to files, not
> >directories (allowing a directory to have extra streams is nothing
> >new).
> >In the second case, the extra semantic should apply to directories as
> >well (as there may we be different views you might want on a
> >directory).
> >
> >
> I don't understand the paragraph above. Can you say with fewer
> indirections (e.g. define extra semantic)?

Sorry. I'll be more verbose (probably too verbose. Stop reading when
you've had enough:-)

The topic of discussion is "adding namespace below files", and I want
to be sure I know *why* we are doing that - what the purpose is.
Knowing the purpose helps a lot with reasoning.

I claim there are two distinct purposes that have been discussed.
There may be more, but there seem to be two main ones.

The first purpose can be called "attributes" or "forks". This allows
arbitrary pieces of data to be associated with a pre-existing
filesystem object. Here the "extra semantic" is "extra information
can be stored as name-value pairs". This is already the case for
directories. Directories have files with names. Adding some other
sorts of name-value pairs is nothing more or less than adding a wart
to the name space.
Other filesystem objects (files, device-special-files, pipes ...) do
not currently have a way to store extra name-value pairs, so they
could conceivably benefit from the "extra semantic".

My patch showed an example of how to provide this extra semantic, so
that it could be explored in a concrete way.

The second purpose can be called "views". It allows a pre-existing
filesystem object to be seen in a different way than normal. Here the
"extra semantic" is "alternate ways to look at what you already have".
Some examples are "untar" and "uncompress".
You wouldn't normally expect a "view" to show information that is not
already in the object, but only to show already-existing information
in a different way. (Note that this isn't an absolute rule. e.g. a
file is a container as well as a content. Some information about the
container is visible in stat(2) [such as st_blocks], but not all
[e.g. depth of btree] so a view could reasonably show other
information).

Views could reasonably apply to directories just as much as to files
or other filesystem objects.

So this is one element of the difference between "attributes" and
"views". There is no point in giving directories "attributes" because
they already have them. They are called "files".
There is point in giving directories "views" - just as much as for
files.

One example of a need for views on directories that came up a while
ago, but was not (I think) resolved, was the NFS client (nfsv4
particularly) wanted to present statistics information for each
mountpoint (request counts, errors, retries etc). This is most
naturally done as a view on the root directory, but Unix has no way
to easily present views. xattrs were suggested as were magic names in
the directory. None were universally accepted.

Implementing "attributes" is very like implementing a directory. I
believe that the one should be leveraged to support the other (if the
support is actually needed/wanted).

Implementing "views" is much more like implementing filesystems. A
filesystem is in some sense a filter that maps some underlying object
(typically a block device, but sometimes a network connection or even
a file) into some other filesystem object (typically a directory
tree).
I think if views are wanted, they should, as much as possible,
leverage mounts and filesystems.

One could imagine using filenames like:
/dev/sda1/ext3,rw,data=journal,nodev/some/path

to transparently mount a block device with appropriate options and
provide access to it. Among the several problems with this (many of
which could be resolved) is the fact that you cannot use it to mount a
directory or a file-with-attributes as accessing names within those
objects already has another meaning.

So here is the problem. We seem to want name1/name2 to mean two
different things
1/ an attribute/fork/file named "name2" with the object named "name1"
2/ a view called "name2" on the object called "name1"
and that just doesn't work in the kernel.

Now it is worth noting that in user-space, we cope quite well with
filenames meaning different things in different contexts.
'~' at the start of a filename often indicates a user's home
directory.
* is often a wildcard
{,} allows one filename to become several

These are all dealt with quite effectively in user-space using
convention and quotes.
Maybe the right answer to the current problem is to leave it to
user-space.
e.g x/y always means to the kernel "the attribute/fork/file in 'x' that is called
'y'"
but allow userspace to treat (say) a '~' at the start of a
non-initial component to mean "mount the view". So the above
example simply becomes
/dev/sda1/~ext3,rw,data=journal,nodev/some/path

[As there are already problem with filenames beginning with ~, this
might not be too much of a further problem]
Then the shell, or any other program that current handles
wildcards, can perform the implied mount, and substitute the
chosen mountpoint ($HOME/.mounts/xxxx) for the initial part of the
path.

This would provide most of what you want with relatively little
ugliness in the kernel.

Some simple cases of this could be done today:
cat ~mirror/pub/linux/kernel/v2.6/linux-2.6.9.tar.bz2/~unarch/linux-2.6.8/Makefile

could be interpreted by your shell to run some "unarch" program on the
kernel archive, asking for "linux-2.6.8/Makefile". "unarch" would
notice that it is a tar.bz2, would unbzip and extract just the
Makefile and store it in some temp directory. It would then run
cat $TEMPDIR/Makefile

all transparently.

Ok, I'll stop rambling now.

NeilBrown