2010-08-26 18:35:18

by Miklos Szeredi

[permalink] [raw]
Subject: [PATCH 0/5] hybrid union filesystem prototype

This is the result of my experimentation with trying to do
union-mounts like semantics with a filesystem. The implementation is
far from perfect, but I think the concept does sort of work. See the
patch header for the union filesystem.

VFS modifications necessary to make it work:

- allow f_op->open() to return a different file
- pass dentry to i_op->permission() instead of inode
- hack to vfs_rename() to allow rename to same inode

Git tree is here:

git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git union-hybrid

Thanks,
Miklos
--


2010-08-27 07:06:05

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 0/5] hybrid union filesystem prototype

On Thu, 26 Aug 2010 20:33:40 +0200
Miklos Szeredi <[email protected]> wrote:

> This is the result of my experimentation with trying to do
> union-mounts like semantics with a filesystem. The implementation is
> far from perfect, but I think the concept does sort of work. See the
> patch header for the union filesystem.
>
> VFS modifications necessary to make it work:
>
> - allow f_op->open() to return a different file
> - pass dentry to i_op->permission() instead of inode
> - hack to vfs_rename() to allow rename to same inode

Hi Miklos,

My first problem with this that there isn't nearly enough documentation.
So I offer the following to fix this problem. Please correct anything that I
got glaringly wrong. I don't claim it is at all complete, but touches on the
things that I thought were interesting.

NeilBrown



Hybrid Union Filesystem
=======================

This document describes a prototype for a new approach to providing
union-filesystem functionality in Linux.
A union-filesystem tries to present the union of two different filesystems as
though it were a single filesystem. The result will inevitably fail to look
exactly like a normal filesystem for various technical reasons. The
expectation is that many use cases will be able to ignore these differences.

This approach is 'hybrid' because the objects that appear in the filesystem
do not all appear to belong to that filesystem. In many case an object
accessed in the hybrid-union will be indistinguishable from accessing the
corresponding object from the original filesystem. This is most obvious
from the 'st_dev' field returned by stat(2). Some objects will report an
st_dev from one original filesystem, some from the other, none will report an
st_dev from the union itself. Similarly st_ino will only be unique when
combined with st_dev, and both of these can change over the lifetime of an
object.
Many applications and tools ignore these values and will not be affected.

Upper and Lower
---------------

A union filesystem combines two filesystems - an 'upper' filesystem and a
'lower' filesystem. Note that while in set theory, 'union' is a commutative
operation, in filesystems it is not - the two filesystems are treated
differently. When a name exists in both filesystems, the object in the
'upper' filesystem is visible while the object in the 'lower' filesystem is
either hidden or, in the case of directories, merged with the 'upper' object.

It would be more correct to refer to an upper and lower 'directory tree'
rather than 'filesystem' as it is quite possible for both directory trees to
be in the same filesystem and there is no requirement that the root of a
filesystem be given for either upper or lower.

The lower filesystem can be any filesystem supported by Linux and does not
need to be writable. It could even be another union.
The upper filesystem will normally be writeable and if it is it must support
the creation of trusted.* extended attributes, and must provide valid d_type
in readdir responses, at least for symbolic links - so NFS is not suitable.

A read-only union of two read-only filesystems is support for any filesystem
type.

Directories
-----------

Unioning mainly involved directories. If a given name appears in both upper
ad lower filesystems and refers to a non-directory in either, then the lower
object is hidden - the name refers only to the upper object.

Where both upper and lower objects are directories, a merged directory is
formed.

At mount time, the two directories given as mount options are combined
into a merged directory. Then whenever a lookup is requested in such a
merged directory, the lookup is performed in each actual directory and the
combined result is cached in the dentry belonging to the union filesystem.
If both actual lookups find directories, both are stored and a merged
directory is create, otherwise only one is stored: the upper if it exists,
else the lower.

Only the lists of names from directories are merged. Other content such as
metadata and extended attributes are reported for the upper directory only.
These attributes of the lower directory are hidden.

whiteouts and opaque directories
--------------------------------

In order to support rm and rmdir without changing the lower filesystem, a
union filesystem needs to record in the upper filesystem that files have been
removed. This is done using whiteouts and opaque directories (non-directories
are always opaque).

The hybrid union filesystem uses extended attributes with a "trusted.union."
prefix to record these details.

A whiteout is created as a symbolic link with target "(union-whiteout)" and
with xattr "trusted.union.whiteout" set to "y". When a whiteout is found in
the upper level of a merged directory, any matching name in the lower level
is ignored, and the whiteout itself is also hidden.

A directory is made opaque by setting the xattr "trusted.union.opaque" to "y".
Where the upper filesystem contains an opaque directory, any directory in the
lower filesystem with the same name is ignored.

readdir
-------

When a 'readdir' request is made on a merged directory, the upper and lower
directories are each read and the name lists merged in the obvious way (upper
is read first, then lower - entries that already exist are not re-added).
This merged name list is cached in the 'struct file' and so remains as long as
the file is kept open. If the directory is opened and read by two processes
at the same time, they will each have separate caches.
A seekdir to the start of the directory (offset 0) followed by a readdir will
cause the cache to be discarded and rebuilt.

This means that changes to the merged directory do not appear while a
directory is being read. This is unlikely to be noticed by many programs.

seek offsets are assigned sequentially when the directories are read. Thus
if
- read part of a directory
- remember an offset, and close the directory
- re-open the directory some time later
- seek to the remembered offset

there may be little correlation between the old and new locations in the list
of filenames, particularly if anything has changed in the directory.

Readdir on directories that are not merged is simply handled by the
underlying directory (upper or lower).


Non-directories
---------------

Objects that are not directories (files, symlinks, device-special files etc)
are presented either from the upper or lower filesystem as appropriate.
When a file in the lower filesystem is accessed in a way the requires
write-access; such as opening for write access, changing some metadata etc,
the file is first copied from the lower filesystem to the upper filesystem
(copy_up). Note that creating a hard-link also requires copy-up, though of
course creation of a symlink does not.

The copy_up process first makes sure that the containing directory exists
in the upper filesystem - creating it and any parents as necessary. It then
creates the object with the same metadata (owner, mode, mtime, symlink-target
etc) and then if the object is a file, the data is copied from the lower to
the upper filesystem. Finally any extended attributes are copied up.

Once the copy_up is complete, the hybrid-union filesystem simply provides
direct access to the newly created file in the upper filesystem - future
operations on the file are barely noticed by the hybrid-union (though an
operation on the name of the file such as rename or unlink will of course be
noticed and handled).

Changes to underlying filesystems
---------------------------------

The hybrid-union filesystem does not insist that the upper and lower
filesystems that it combines remain read-only. This is certainly preferred
but it is non-trivial to enforce (e.g. for NFS) and so cannot be depended on.
A system integrator should ensure the underlying filesystems are not changed
- except through the hybrid union - if they want predictable behaviour, or
must accept the consequences.
The hybrid-union filesystem only ensures that unexpected changes do not cause
a system crash or serious corruption to either underlying filesystem.

The exact behaviour of the hybrid-union is best described in terms from the
'dcache' and as objects can expire from the 'dcache' at unpredictable times,
the exact behaviour when the underlying filesystems change is in
general unpredictable.

Each dcache entry in the hybrid union holds a reference to a corresponding
dcache entry in upper or lower filesystem or, in the case of merged
directories, in both the upper and lower filesystems.

If a referenced file is rename directly in one of the filesystems, that
renaming will not be noticed in the hybrid union until the referencing dcache
entry expires.

If a referenced file is unlinked from a local filesystem, it could similarly
still be accessibly through the hybrid while the dcache entry persists. If a
referenced file is on an NFS filesystem and is unlinked remotely, the file in
the hybrid-union will become 'stale' returning ESTALE for most accesses. It
will not disappear until the dcache entry expires from the dcache.

If a referenced file is changed, either contents or metadata, that change
will be immediately visible in the hybrid union.

If the contents of a directory are changed, then a readdir will not see that
change unless it opens the file or seeks to the start of the file after the
change was made. Then it will be visible.

Changes will *not* be visible to a lookup if the name that is changed is
currently cached in the the dcache, whether positive or negative.
e.g. If a name is looked for in the hybrid union and not found, a negative
dcache entry will be created. If the same name is directly created in the
upper filesystem it will not immediately become visible to the hybrid, though
it will be visible in a readdir. This could be confusing, and could
probably be fixed using dcache revalidation.

2010-08-27 08:47:53

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/5] hybrid union filesystem prototype

Hi Neil,

On Fri, 27 Aug 2010, Neil Brown wrote:
> My first problem with this that there isn't nearly enough documentation.
> So I offer the following to fix this problem. Please correct anything that I
> got glaringly wrong. I don't claim it is at all complete, but touches on the
> things that I thought were interesting.

Whoa, thank you very much. I'll add it to the patchset with some
fixes (see below).

> Hybrid Union Filesystem
> =======================
>
> This document describes a prototype for a new approach to providing
> union-filesystem functionality in Linux.
> A union-filesystem tries to present the union of two different filesystems as
> though it were a single filesystem. The result will inevitably fail to look
> exactly like a normal filesystem for various technical reasons. The
> expectation is that many use cases will be able to ignore these differences.
>
> This approach is 'hybrid' because the objects that appear in the filesystem
> do not all appear to belong to that filesystem. In many case an object
> accessed in the hybrid-union will be indistinguishable from accessing the
> corresponding object from the original filesystem. This is most obvious
> from the 'st_dev' field returned by stat(2). Some objects will report an
> st_dev from one original filesystem, some from the other, none will report an
> st_dev from the union itself.

Hmm, that's a bug. Directories actually come from the union itself
for various reasons, and it does report the correct st_ino but not
st_dev. Fixed.

> Similarly st_ino will only be unique when
> combined with st_dev, and both of these can change over the lifetime of an
> object.
> Many applications and tools ignore these values and will not be affected.
>
> Upper and Lower
> ---------------
>
> A union filesystem combines two filesystems - an 'upper' filesystem and a
> 'lower' filesystem. Note that while in set theory, 'union' is a commutative
> operation, in filesystems it is not - the two filesystems are treated
> differently. When a name exists in both filesystems, the object in the
> 'upper' filesystem is visible while the object in the 'lower' filesystem is
> either hidden or, in the case of directories, merged with the 'upper' object.
>
> It would be more correct to refer to an upper and lower 'directory tree'
> rather than 'filesystem' as it is quite possible for both directory trees to
> be in the same filesystem and there is no requirement that the root of a
> filesystem be given for either upper or lower.
>
> The lower filesystem can be any filesystem supported by Linux and does not
> need to be writable. It could even be another union.

Almost. Xattr namespace issues currently prevent that, but with
escaping it could be worked around if necessary.

> The upper filesystem will normally be writeable and if it is it must support
> the creation of trusted.* extended attributes, and must provide valid d_type
> in readdir responses, at least for symbolic links - so NFS is not suitable.
>
> A read-only union of two read-only filesystems is support for any filesystem
> type.

s/is support for/may use/

>
> Directories
> -----------
>
> Unioning mainly involved directories. If a given name appears in both upper
> ad lower filesystems and refers to a non-directory in either, then the lower
> object is hidden - the name refers only to the upper object.
>
> Where both upper and lower objects are directories, a merged directory is
> formed.
>
> At mount time, the two directories given as mount options are combined
> into a merged directory. Then whenever a lookup is requested in such a
> merged directory, the lookup is performed in each actual directory and the
> combined result is cached in the dentry belonging to the union filesystem.
> If both actual lookups find directories, both are stored and a merged
> directory is create, otherwise only one is stored: the upper if it exists,
> else the lower.
>
> Only the lists of names from directories are merged. Other content such as
> metadata and extended attributes are reported for the upper directory only.
> These attributes of the lower directory are hidden.
>
> whiteouts and opaque directories
> --------------------------------
>
> In order to support rm and rmdir without changing the lower filesystem, a
> union filesystem needs to record in the upper filesystem that files have been
> removed. This is done using whiteouts and opaque directories (non-directories
> are always opaque).
>
> The hybrid union filesystem uses extended attributes with a "trusted.union."
> prefix to record these details.
>
> A whiteout is created as a symbolic link with target "(union-whiteout)" and
> with xattr "trusted.union.whiteout" set to "y". When a whiteout is found in
> the upper level of a merged directory, any matching name in the lower level
> is ignored, and the whiteout itself is also hidden.
>
> A directory is made opaque by setting the xattr "trusted.union.opaque" to "y".
> Where the upper filesystem contains an opaque directory, any directory in the
> lower filesystem with the same name is ignored.
>
> readdir
> -------
>
> When a 'readdir' request is made on a merged directory, the upper and lower
> directories are each read and the name lists merged in the obvious way (upper
> is read first, then lower - entries that already exist are not re-added).
> This merged name list is cached in the 'struct file' and so remains as long as
> the file is kept open. If the directory is opened and read by two processes
> at the same time, they will each have separate caches.
> A seekdir to the start of the directory (offset 0) followed by a readdir will
> cause the cache to be discarded and rebuilt.
>
> This means that changes to the merged directory do not appear while a
> directory is being read. This is unlikely to be noticed by many programs.
>
> seek offsets are assigned sequentially when the directories are read. Thus
> if
> - read part of a directory
> - remember an offset, and close the directory
> - re-open the directory some time later
> - seek to the remembered offset
>
> there may be little correlation between the old and new locations in the list
> of filenames, particularly if anything has changed in the directory.
>
> Readdir on directories that are not merged is simply handled by the
> underlying directory (upper or lower).
>
>
> Non-directories
> ---------------
>
> Objects that are not directories (files, symlinks, device-special files etc)
> are presented either from the upper or lower filesystem as appropriate.
> When a file in the lower filesystem is accessed in a way the requires
> write-access; such as opening for write access, changing some metadata etc,
> the file is first copied from the lower filesystem to the upper filesystem
> (copy_up). Note that creating a hard-link also requires copy-up, though of
> course creation of a symlink does not.
>
> The copy_up process first makes sure that the containing directory exists
> in the upper filesystem - creating it and any parents as necessary. It then
> creates the object with the same metadata (owner, mode, mtime, symlink-target
> etc) and then if the object is a file, the data is copied from the lower to
> the upper filesystem. Finally any extended attributes are copied up.
>
> Once the copy_up is complete, the hybrid-union filesystem simply provides
> direct access to the newly created file in the upper filesystem - future
> operations on the file are barely noticed by the hybrid-union (though an
> operation on the name of the file such as rename or unlink will of course be
> noticed and handled).
>
> Changes to underlying filesystems
> ---------------------------------
>

For now I refuse to even think about what happens in this case.

The easiest way out of this mess might simply be to enforce exclusive
modification to the underlying filesystems on a local level, same as
the union mount strategy. For NFS and other remote filesystems we
either

a) add some way to enforce it,
b) live with the consequences if not enforced on the system level, or
c) disallow them to be part of the union.


> The hybrid-union filesystem does not insist that the upper and lower
> filesystems that it combines remain read-only. This is certainly preferred
> but it is non-trivial to enforce (e.g. for NFS) and so cannot be depended on.
> A system integrator should ensure the underlying filesystems are not changed
> - except through the hybrid union - if they want predictable behaviour, or
> must accept the consequences.
> The hybrid-union filesystem only ensures that unexpected changes do not cause
> a system crash or serious corruption to either underlying filesystem.
>
> The exact behaviour of the hybrid-union is best described in terms from the
> 'dcache' and as objects can expire from the 'dcache' at unpredictable times,
> the exact behaviour when the underlying filesystems change is in
> general unpredictable.
>
> Each dcache entry in the hybrid union holds a reference to a corresponding
> dcache entry in upper or lower filesystem or, in the case of merged
> directories, in both the upper and lower filesystems.
>
> If a referenced file is rename directly in one of the filesystems, that
> renaming will not be noticed in the hybrid union until the referencing dcache
> entry expires.
>
> If a referenced file is unlinked from a local filesystem, it could similarly
> still be accessibly through the hybrid while the dcache entry persists. If a
> referenced file is on an NFS filesystem and is unlinked remotely, the file in
> the hybrid-union will become 'stale' returning ESTALE for most accesses. It
> will not disappear until the dcache entry expires from the dcache.
>
> If a referenced file is changed, either contents or metadata, that change
> will be immediately visible in the hybrid union.
>
> If the contents of a directory are changed, then a readdir will not see that
> change unless it opens the file or seeks to the start of the file after the
> change was made. Then it will be visible.
>
> Changes will *not* be visible to a lookup if the name that is changed is
> currently cached in the the dcache, whether positive or negative.
> e.g. If a name is looked for in the hybrid union and not found, a negative
> dcache entry will be created. If the same name is directly created in the
> upper filesystem it will not immediately become visible to the hybrid, though
> it will be visible in a readdir. This could be confusing, and could
> probably be fixed using dcache revalidation.
>

Thanks,
Miklos

2010-08-27 11:35:17

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 0/5] hybrid union filesystem prototype

On Fri, 27 Aug 2010 10:47:39 +0200
Miklos Szeredi <[email protected]> wrote:

> Hi Neil,
>
> On Fri, 27 Aug 2010, Neil Brown wrote:
> > My first problem with this that there isn't nearly enough documentation.
> > So I offer the following to fix this problem. Please correct anything that I
> > got glaringly wrong. I don't claim it is at all complete, but touches on the
> > things that I thought were interesting.
>
> Whoa, thank you very much. I'll add it to the patchset with some
> fixes (see below).
>
> > Hybrid Union Filesystem
> > =======================
> >
> > This document describes a prototype for a new approach to providing
> > union-filesystem functionality in Linux.
> > A union-filesystem tries to present the union of two different filesystems as
> > though it were a single filesystem. The result will inevitably fail to look
> > exactly like a normal filesystem for various technical reasons. The
> > expectation is that many use cases will be able to ignore these differences.
> >
> > This approach is 'hybrid' because the objects that appear in the filesystem
> > do not all appear to belong to that filesystem. In many case an object
> > accessed in the hybrid-union will be indistinguishable from accessing the
> > corresponding object from the original filesystem. This is most obvious
> > from the 'st_dev' field returned by stat(2). Some objects will report an
> > st_dev from one original filesystem, some from the other, none will report an
> > st_dev from the union itself.
>
> Hmm, that's a bug. Directories actually come from the union itself
> for various reasons, and it does report the correct st_ino but not
> st_dev. Fixed.

I was wondering why you even bothered to fiddle st_ino. Given that lots of
other things come from one fs or the other, having the merged directories
appear to be from the upper filesystem doesn't seem like a problem. I don't
really care either way though.

> >
> > The lower filesystem can be any filesystem supported by Linux and does not
> > need to be writable. It could even be another union.
>
> Almost. Xattr namespace issues currently prevent that, but with
> escaping it could be worked around if necessary.

But you never access the xattrs for the lower directory so that shouldn't
matter.
Have a union for the upper fs would certainly be sufficiently 'interesting'
to explicitly forbid.

> >
> > Changes to underlying filesystems
> > ---------------------------------
> >
>
> For now I refuse to even think about what happens in this case.
>
> The easiest way out of this mess might simply be to enforce exclusive
> modification to the underlying filesystems on a local level, same as
> the union mount strategy. For NFS and other remote filesystems we
> either
>
> a) add some way to enforce it,
> b) live with the consequences if not enforced on the system level, or
> c) disallow them to be part of the union.
>

I actually think that your approach can work quite will with either the
upper or lower changing independently. Certainly it can produce some odd
situations, but even NFS can do that (though maybe not quite so odd).

It think the best approach would be to handle the few that can be handled and
explicitly document the rest - people might even find them useful.


Anyway, here is my next instalment which are the review comments, now that I
have some documentation to read :-)

In no particular order:

1/ You call union_remove_whiteouts on an upper directory once you have
determined that the merged directory is empty, which implies that the only
things in the the upper directory are whiteouts.
union_remove_whiteouts calls union_unlink_writeout on every entry.
It checks that each entry is a DT_LNK - which we assume it must be - but
doesn't check the the inode there is really a whiteout.
It seem inconsistent to do the one check but not the other.

As this could race with independent changes on the upper filesystem, I
thick it would be safest to at least check the i_mode and i_size of the
dentry->d_inode that is found, and possible do a readlink as well to
ensure we only delete whiteouts.

1a/ A minor optimisation for union_is_white would be to check i_size matches
size of (union-whiteout)

2/ It bothers me that the 'dev_name' arg to union_get_sb is unused, yet there
are mandatory options. I think it would be nice if dev_name were required
to be lower,upper and options were left for other things.
So the typical usage would be:

mount -t union /path/to/lower,/path/to/upper /mount/point

3/ I think it would be great if you made use of d_revalidate to handle some
of the worst problems caused by underlying filesystem changes. If you
cache i_mtime and i_version of the parent in the dentry and re-do the
lookup if either is different from the directory you could hide most of
the issues mentioned in the doco about dentries expiring. And if you
called d_revalidate in the underlying filesystem at the same time you could
probably hide all the rest.

4/ The problem you mentioned about ->i_mutex which is taken during unlink
being quasi-global could be eased somewhat by simply having a small pool
of inodes and assigning them to dentries pseudo-randomly. Or possibly
having one 'file' inode per directory (that might be a bit wasteful
though).

5/ I wonder if it would be useful to recognise 'fallthrough' objects (much
like whiteouts but inverted) and to optionally (based on mount option) copy
up any directory on readdir and fill it with fall-throughs for any lower
name that isn't in the upper. This helps with enormous directories
(though not if upper is RAMFS of course) and ensures a stable directory
cookie.


While I like the idea of being able to work with changeable filesystems and
think most scenarios can be handled adequately, there is one that I'm not
comfortable with.
If you open a file readonly and get a handle on the file in the lower
filesystem, and then you fchmod the handle, it will work but will change the
lower filesystem - which you don't really want (I think).
The only way to avoid this currently is to require the lower vfsmnt to be
mounted readonly. That is probably acceptable (it can be mounted read-write
elsewhere) but I'd like it to be easier than that.
An alternate would be to teach mnt_want_write_file about some new flag which
marks the 'struct file' as 'really-readonly' and have union_open set that
flag. Note sure if I really like that or not.
Probably for now just:

6/ require __mnt_is_readonly(lowerpath.mnt) at mount time.

Finally, I think it is really important to document all the non-standard
aspects of the unioned filesystem in some detail and suggest work-around for
potentially problematic behaviour.
The fact that a read-only open can get a lower-filesystem filehandle has a
lot of interesting consequences.
If you take a read-lock, it doesn't stop an other process writing to the file
(though it does stop the file from changing!).
If you request a DNOTIFY on a directory, you will not see when things get
added.
If you take a lease on a file it won't be broken by someone trying to write.

So if you:
- create a union.
- start a daemon that watches for changes
- make changes

The daemon could never notice. Of course if you stop everything and restart
it will then work smoothly, as the 'make-changes' will have copied-up the
directory.

Most such confusion can be easily avoided by 'touching' any file or directory
that will be monitored for changes or will be involved in locking. Having to
do that after creating a union is a hassle, but only a minor hassle I think,
and may not actually be needed at all in many cases.

On the whole, I really like this approach. It strikes a good balance between
simplicity and functionality. It doesn't provide perfect semantics, but I
think it provides good-enough semantics and a very moderate cost.

Thanks,
NeilBrown

2010-08-27 16:53:55

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/5] hybrid union filesystem prototype

On Fri, 27 Aug 2010, Neil Brown wrote:
> I was wondering why you even bothered to fiddle st_ino. Given that lots of
> other things come from one fs or the other, having the merged directories
> appear to be from the upper filesystem doesn't seem like a problem. I don't
> really care either way though.

"rm -rf" complains if st_ino of a directory changes spontaneously.

> > > The lower filesystem can be any filesystem supported by Linux and does not
> > > need to be writable. It could even be another union.
> >
> > Almost. Xattr namespace issues currently prevent that, but with
> > escaping it could be worked around if necessary.
>
> But you never access the xattrs for the lower directory so that shouldn't
> matter.

Ah, right. Another small issue is that currently unionfs accesses
inode->i_* from the underlying filesystems instead of calling
->getattr(), which will break if the underlying fs is a union with its
dummy inodes. But that should be easy to fix.

> I actually think that your approach can work quite will with either the
> upper or lower changing independently. Certainly it can produce some odd
> situations, but even NFS can do that (though maybe not quite so odd).
>
> It think the best approach would be to handle the few that can be handled and
> explicitly document the rest - people might even find them useful.

I think it's best to leave that stuff until someone actually cares.
The "people might find it useful" argument is not strong enough to put
nontrivial effort into thinking about all the weird corner cases.

> Anyway, here is my next instalment which are the review comments, now that I
> have some documentation to read :-)
>
> In no particular order:
>
> 1/ You call union_remove_whiteouts on an upper directory once you have
> determined that the merged directory is empty, which implies that the only
> things in the the upper directory are whiteouts.
> union_remove_whiteouts calls union_unlink_writeout on every entry.
> It checks that each entry is a DT_LNK - which we assume it must be - but
> doesn't check the the inode there is really a whiteout.
> It seem inconsistent to do the one check but not the other.

The DT_LNK check is done to filter out "." and "..". Added comment.

> As this could race with independent changes on the upper filesystem, I
> thick it would be safest to at least check the i_mode and i_size of the
> dentry->d_inode that is found, and possible do a readlink as well to
> ensure we only delete whiteouts.
>
> 1a/ A minor optimisation for union_is_white would be to check i_size matches
> size of (union-whiteout)

I'm not fond of relying on inode->i_* members directly as unionfs
itself doesn't play by those rules. But maybe it's OK here as
anything wanting to be an upper filesystem will be sufficiently
"normal" for this to work.

Fixed.

> 2/ It bothers me that the 'dev_name' arg to union_get_sb is unused, yet there
> are mandatory options. I think it would be nice if dev_name were required
> to be lower,upper and options were left for other things.
> So the typical usage would be:
>
> mount -t union /path/to/lower,/path/to/upper /mount/point
>

I think that's a matter of taste. The 'dev_name' argument is just a
specialized option, and when that option needs a structure like your
example then IMO it's better to just move it to normal options.

> 3/ I think it would be great if you made use of d_revalidate to handle some
> of the worst problems caused by underlying filesystem changes. If you
> cache i_mtime and i_version of the parent in the dentry and re-do the
> lookup if either is different from the directory you could hide most of
> the issues mentioned in the doco about dentries expiring. And if you
> called d_revalidate in the underlying filesystem at the same time you could
> probably hide all the rest.

As I said, I'd leave it until someone actually needs this.

>
> 4/ The problem you mentioned about ->i_mutex which is taken during unlink
> being quasi-global could be eased somewhat by simply having a small pool
> of inodes and assigning them to dentries pseudo-randomly. Or possibly
> having one 'file' inode per directory (that might be a bit wasteful
> though).

That's an idea, but I'm inclined just to add some hacks to the VFS to
omit the locking if some inode flag is set.

> 5/ I wonder if it would be useful to recognise 'fallthrough' objects (much
> like whiteouts but inverted) and to optionally (based on mount option) copy
> up any directory on readdir and fill it with fall-throughs for any lower
> name that isn't in the upper. This helps with enormous directories
> (though not if upper is RAMFS of course) and ensures a stable directory
> cookie.

I'm not sure if the stable directory cookie problem is important
enough. AFAIR some ancient versions of libc relied on directory
seeking and also some weird apps might, but anything sane will not
touch that interface (and I'm hoping someday we can get rid of it for
good).

As for caching large directories, I think that's best done with the
page cache, not by permanently copying up the contents to the upper
directory.

> While I like the idea of being able to work with changeable filesystems and
> think most scenarios can be handled adequately, there is one that I'm not
> comfortable with.
> If you open a file readonly and get a handle on the file in the lower
> filesystem, and then you fchmod the handle, it will work but will change the
> lower filesystem - which you don't really want (I think).
> The only way to avoid this currently is to require the lower vfsmnt to be
> mounted readonly. That is probably acceptable (it can be mounted read-write
> elsewhere) but I'd like it to be easier than that.
> An alternate would be to teach mnt_want_write_file about some new flag which
> marks the 'struct file' as 'really-readonly' and have union_open set that
> flag. Note sure if I really like that or not.
> Probably for now just:
>
> 6/ require __mnt_is_readonly(lowerpath.mnt) at mount time.

Right, that's at the front of the todo list.

> Finally, I think it is really important to document all the non-standard
> aspects of the unioned filesystem in some detail and suggest work-around for
> potentially problematic behaviour.

I agree completely. I just tend to write code first and documentation
later (or as late as possibly can) so your contribution in this area
really warms my heart :)

Thanks,
Miklos

2010-08-29 04:42:20

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 0/5] hybrid union filesystem prototype

On Fri, 27 Aug 2010 18:53:45 +0200
Miklos Szeredi <[email protected]> wrote:

> On Fri, 27 Aug 2010, Neil Brown wrote:
> > I was wondering why you even bothered to fiddle st_ino. Given that lots of
> > other things come from one fs or the other, having the merged directories
> > appear to be from the upper filesystem doesn't seem like a problem. I don't
> > really care either way though.
>
> "rm -rf" complains if st_ino of a directory changes spontaneously.

I see...
Doesn't that mean that you must always present a merged-directory if the
lower directory exists.
Otherwise if you "rm -r" in a directory that only exists in the lower
filesystem, the inode will change on the first 'unlink' call.
??

>
> > > > The lower filesystem can be any filesystem supported by Linux and does not
> > > > need to be writable. It could even be another union.
> > >
> > > Almost. Xattr namespace issues currently prevent that, but with
> > > escaping it could be worked around if necessary.
> >
> > But you never access the xattrs for the lower directory so that shouldn't
> > matter.
>
> Ah, right. Another small issue is that currently unionfs accesses
> inode->i_* from the underlying filesystems instead of calling
> ->getattr(), which will break if the underlying fs is a union with its
> dummy inodes. But that should be easy to fix.
>
> > I actually think that your approach can work quite will with either the
> > upper or lower changing independently. Certainly it can produce some odd
> > situations, but even NFS can do that (though maybe not quite so odd).
> >
> > It think the best approach would be to handle the few that can be handled and
> > explicitly document the rest - people might even find them useful.
>
> I think it's best to leave that stuff until someone actually cares.
> The "people might find it useful" argument is not strong enough to put
> nontrivial effort into thinking about all the weird corner cases.

My thought on this is that in order to describe the behaviour of the
filesystem accurately you need to think about all the weird corner cases.

Then it becomes a question of: is it easier to document the weird behaviour,
or change the code so the documentation will be easier to understand?
Some cases will go one way, some the other.

But as you suggest this isn't urgent.

>
> > Anyway, here is my next instalment which are the review comments, now that I
> > have some documentation to read :-)
> >
> > In no particular order:
> >
> > 1/ You call union_remove_whiteouts on an upper directory once you have
> > determined that the merged directory is empty, which implies that the only
> > things in the the upper directory are whiteouts.
> > union_remove_whiteouts calls union_unlink_writeout on every entry.
> > It checks that each entry is a DT_LNK - which we assume it must be - but
> > doesn't check the the inode there is really a whiteout.
> > It seem inconsistent to do the one check but not the other.
>
> The DT_LNK check is done to filter out "." and "..". Added comment.
>
> > As this could race with independent changes on the upper filesystem, I
> > thick it would be safest to at least check the i_mode and i_size of the
> > dentry->d_inode that is found, and possible do a readlink as well to
> > ensure we only delete whiteouts.
> >
> > 1a/ A minor optimisation for union_is_white would be to check i_size matches
> > size of (union-whiteout)
>
> I'm not fond of relying on inode->i_* members directly as unionfs
> itself doesn't play by those rules. But maybe it's OK here as
> anything wanting to be an upper filesystem will be sufficiently
> "normal" for this to work.
>
> Fixed.

I see your point and I think you are right. In most cases the filesystem or
some support-library it calls should be the only thing to look at i_*
directly.
An exception seems to be that (i_mode & S_IFMT) is often treated and publicly
viewable. But I agree that assuming i_size means something particular is
dangerous.


> >
> > 4/ The problem you mentioned about ->i_mutex which is taken during unlink
> > being quasi-global could be eased somewhat by simply having a small pool
> > of inodes and assigning them to dentries pseudo-randomly. Or possibly
> > having one 'file' inode per directory (that might be a bit wasteful
> > though).
>
> That's an idea, but I'm inclined just to add some hacks to the VFS to
> omit the locking if some inode flag is set.

Every new flag you add is another sign of weakness in the VFS....

Hmmm... vfs_unlink checks if the victim is a mountpoint, which suggests that
you can mount on a non-directory. I'm not sure if I knew that.
I wonder how this union fs would cope if a file in the upper filesystem were
mounted-on....I suspect it would work correctly.



Two other thoughts.
My comment about set-theory unions being commutative set me thinking. I
really don't think "union" is the right name for this thing. There is
nothing about it which really fits that proper definition of a union.
whiteouts mean that even the list of names in a directory is not the union of
the lists of names in the upper and lower directories.
"overlay" is a much more accurate name. But union seems to be the name
that is most used. I wonder if it is too late to change that.

Also, dnotify (and presumably inotifiy) doesn't work reliably in the current
implementation.
It works for opaque directories and those which don't have a namesake in the
lower filesystem, but for others it never notifies of changes to any files in
the directory.
This is because dnotify will set an fsnotifier on the merged-directory in the
union-fs, but the change happens to the file in the upper fs, so
fsnotify_parent will only call notifiers on the parent in the upper fs.

I think the way to fix this would involve the union-fs putting a notifier on
the upper dir to match whatever is on the merged-dir. However the filesystem
currently isn't told when notifiers are attach to an inode. I think it would
be good to add an inode_operation which is called whenever the notifiers are
changed. This would also allow a networked filesystem to report notification
if the protocol supported it.
Does that seem reasonable?

NeilBrown

2010-08-30 10:18:24

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/5] hybrid union filesystem prototype

On Sun, 29 Aug 2010, Neil Brown wrote:
> On Fri, 27 Aug 2010 18:53:45 +0200
> Miklos Szeredi <[email protected]> wrote:
>
> > On Fri, 27 Aug 2010, Neil Brown wrote:
> > > I was wondering why you even bothered to fiddle st_ino. Given
> > > that lots of other things come from one fs or the other, having
> > > the merged directories appear to be from the upper filesystem
> > > doesn't seem like a problem. I don't really care either way
> > > though.
> >
> > "rm -rf" complains if st_ino of a directory changes spontaneously.
>
> I see...
> Doesn't that mean that you must always present a merged-directory if the
> lower directory exists.

Directory opens are never "forwarded" to the lower filesystem like
other opens. One reason is that lookups continuing from the file's
path need to be on the union filesystem instead on one of the
underlying filesystems.

> Otherwise if you "rm -r" in a directory that only exists in the lower
> filesystem, the inode will change on the first 'unlink' call.
> ??

Right.

Since the union filesystem doesn't have "real" inodes the st_ino on
union directories are stable only as long as the inode is in the
cache. That's not a problem for "rm -rf" because the ancestor
directory will always have a reference through one of its children.

> > I think it's best to leave that stuff until someone actually cares.
> > The "people might find it useful" argument is not strong enough to put
> > nontrivial effort into thinking about all the weird corner cases.
>
> My thought on this is that in order to describe the behaviour of the
> filesystem accurately you need to think about all the weird corner cases.
>
> Then it becomes a question of: is it easier to document the weird behaviour,
> or change the code so the documentation will be easier to understand?
> Some cases will go one way, some the other.
>
> But as you suggest this isn't urgent.

You didn't mention one possibility: add limitations that prevent the
weird corner cases arising. I believe this is the simplest option.


> > > 4/ The problem you mentioned about ->i_mutex which is taken during unlink
> > > being quasi-global could be eased somewhat by simply having a small pool
> > > of inodes and assigning them to dentries pseudo-randomly. Or possibly
> > > having one 'file' inode per directory (that might be a bit wasteful
> > > though).
> >
> > That's an idea, but I'm inclined just to add some hacks to the VFS to
> > omit the locking if some inode flag is set.
>
> Every new flag you add is another sign of weakness in the VFS....
>
> Hmmm... vfs_unlink checks if the victim is a mountpoint, which suggests that
> you can mount on a non-directory. I'm not sure if I knew that.

mount --bind has always worked on non-directories too. It's a useful
feature.

> I wonder how this union fs would cope if a file in the upper filesystem were
> mounted-on....I suspect it would work correctly.

Mounts on the union fs work as expected, they don't care about inodes
except that directories cannot cover non-directories and vice-versa.

Mounts on the upper and lower fs are currenly ignored. It's an
interesting question whether union fs should handle mount trees on the
underlying filesystems or not.


> My comment about set-theory unions being commutative set me thinking. I
> really don't think "union" is the right name for this thing. There is
> nothing about it which really fits that proper definition of a union.
> whiteouts mean that even the list of names in a directory is not the union of
> the lists of names in the upper and lower directories.
> "overlay" is a much more accurate name. But union seems to be the name
> that is most used. I wonder if it is too late to change that.

We could call it overlayfs. People learn new names quickly :)

> Also, dnotify (and presumably inotifiy) doesn't work reliably in the current
> implementation.
> It works for opaque directories and those which don't have a namesake in the
> lower filesystem, but for others it never notifies of changes to any files in
> the directory.
> This is because dnotify will set an fsnotifier on the merged-directory in the
> union-fs, but the change happens to the file in the upper fs, so
> fsnotify_parent will only call notifiers on the parent in the upper fs.

I think *notify will work correctly, every modificaton will be
notified on both the union fs and the upper fs. But I haven't tested
this yet.

Thanks,
Miklos

2010-08-30 11:40:38

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 0/5] hybrid union filesystem prototype

On Mon, 30 Aug 2010 12:18:11 +0200
Miklos Szeredi <[email protected]> wrote:

> On Sun, 29 Aug 2010, Neil Brown wrote:
> > On Fri, 27 Aug 2010 18:53:45 +0200
> > Miklos Szeredi <[email protected]> wrote:
> >
> > > On Fri, 27 Aug 2010, Neil Brown wrote:
> > > > I was wondering why you even bothered to fiddle st_ino. Given
> > > > that lots of other things come from one fs or the other, having
> > > > the merged directories appear to be from the upper filesystem
> > > > doesn't seem like a problem. I don't really care either way
> > > > though.
> > >
> > > "rm -rf" complains if st_ino of a directory changes spontaneously.
> >
> > I see...
> > Doesn't that mean that you must always present a merged-directory if the
> > lower directory exists.
>
> Directory opens are never "forwarded" to the lower filesystem like
> other opens. One reason is that lookups continuing from the file's
> path need to be on the union filesystem instead on one of the
> underlying filesystems.

You're right - I misread the code there.

> > > I think it's best to leave that stuff until someone actually cares.
> > > The "people might find it useful" argument is not strong enough to put
> > > nontrivial effort into thinking about all the weird corner cases.
> >
> > My thought on this is that in order to describe the behaviour of the
> > filesystem accurately you need to think about all the weird corner cases.
> >
> > Then it becomes a question of: is it easier to document the weird behaviour,
> > or change the code so the documentation will be easier to understand?
> > Some cases will go one way, some the other.
> >
> > But as you suggest this isn't urgent.
>
> You didn't mention one possibility: add limitations that prevent the
> weird corner cases arising. I believe this is the simplest option.

Val has been following that approach and asking if it is possible to make an
NFS filesystem really-truly read-only. i.e. no changes.
I don't believe it is.

But I won't pursue this further without patches.


> > My comment about set-theory unions being commutative set me thinking. I
> > really don't think "union" is the right name for this thing. There is
> > nothing about it which really fits that proper definition of a union.
> > whiteouts mean that even the list of names in a directory is not the union of
> > the lists of names in the upper and lower directories.
> > "overlay" is a much more accurate name. But union seems to be the name
> > that is most used. I wonder if it is too late to change that.
>
> We could call it overlayfs. People learn new names quickly :)

+1

>
> > Also, dnotify (and presumably inotifiy) doesn't work reliably in the current
> > implementation.
> > It works for opaque directories and those which don't have a namesake in the
> > lower filesystem, but for others it never notifies of changes to any files in
> > the directory.
> > This is because dnotify will set an fsnotifier on the merged-directory in the
> > union-fs, but the change happens to the file in the upper fs, so
> > fsnotify_parent will only call notifiers on the parent in the upper fs.
>
> I think *notify will work correctly, every modificaton will be
> notified on both the union fs and the upper fs. But I haven't tested
> this yet.

I tried. It doesn't.
To be precise: directory changes like file creation (even creation of a file
that already exists) get notified, but purely file-based event like modifying
the contents of the file don't generate events back to the overlayfs
directory.
Because an open (for write) of a file is passed through to the upper
filesystem, the notifications of modification through that open only go to the
upper filesystem.

Thanks,
NeilBrown

2010-08-30 12:20:59

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/5] hybrid union filesystem prototype

On Mon, 30 Aug 2010, Neil Brown wrote:

> > You didn't mention one possibility: add limitations that prevent the
> > weird corner cases arising. I believe this is the simplest option.
>
> Val has been following that approach and asking if it is possible to make an
> NFS filesystem really-truly read-only. i.e. no changes.
> I don't believe it is.

Perhaps it doesn't matter. The nasty cases can be prevented by just
disallowing local modification. For the rest NFS will return ESTALE:
"though luck, why didn't you follow the rules?"

> > I think *notify will work correctly, every modificaton will be
> > notified on both the union fs and the upper fs. But I haven't tested
> > this yet.
>
> I tried. It doesn't.
> To be precise: directory changes like file creation (even creation of a file
> that already exists) get notified, but purely file-based event like modifying
> the contents of the file don't generate events back to the overlayfs
> directory.
> Because an open (for write) of a file is passed through to the upper
> filesystem, the notifications of modification through that open only go to the
> upper filesystem.

Ah, right.

> > > I think the way to fix this would involve the union-fs putting a
> > > notifier on the upper dir to match whatever is on the
> > > merged-dir. However the filesystem currently isn't told when
> > > notifiers are attach to an inode. I think it would be good to
> > > add an inode_operation which is called whenever the notifiers
> > > are changed. This would also allow a networked filesystem to
> > > report notification if the protocol supported it. Does that
> > > seem reasonable?

In that light this sounds entirely reasonable.

Thanks,
Miklos

2010-08-30 18:39:08

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 0/5] hybrid union filesystem prototype

On Fri, Aug 27, 2010 at 09:35:02PM +1000, Neil Brown wrote:
> On Fri, 27 Aug 2010 10:47:39 +0200
> Miklos Szeredi <[email protected]> wrote:
> > > Changes to underlying filesystems
> > > ---------------------------------
> > >
> >
> > For now I refuse to even think about what happens in this case.
> >
> > The easiest way out of this mess might simply be to enforce exclusive
> > modification to the underlying filesystems on a local level, same as
> > the union mount strategy. For NFS and other remote filesystems we
> > either
> >
> > a) add some way to enforce it,
> > b) live with the consequences if not enforced on the system level, or
> > c) disallow them to be part of the union.
> >
>
> I actually think that your approach can work quite will with either the
> upper or lower changing independently. Certainly it can produce some odd
> situations, but even NFS can do that (though maybe not quite so odd).

I'm very curious about your thoughts on how to handle the lower layer
changing. Al Viro's comments:

http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/0839.html

Do you see something we're missing?

-VAL

2010-08-30 23:12:23

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 0/5] hybrid union filesystem prototype

On Mon, 30 Aug 2010 14:38:43 -0400
Valerie Aurora <[email protected]> wrote:

> On Fri, Aug 27, 2010 at 09:35:02PM +1000, Neil Brown wrote:
> > On Fri, 27 Aug 2010 10:47:39 +0200
> > Miklos Szeredi <[email protected]> wrote:
> > > > Changes to underlying filesystems
> > > > ---------------------------------
> > > >
> > >
> > > For now I refuse to even think about what happens in this case.
> > >
> > > The easiest way out of this mess might simply be to enforce exclusive
> > > modification to the underlying filesystems on a local level, same as
> > > the union mount strategy. For NFS and other remote filesystems we
> > > either
> > >
> > > a) add some way to enforce it,
> > > b) live with the consequences if not enforced on the system level, or
> > > c) disallow them to be part of the union.
> > >
> >
> > I actually think that your approach can work quite will with either the
> > upper or lower changing independently. Certainly it can produce some odd
> > situations, but even NFS can do that (though maybe not quite so odd).
>
> I'm very curious about your thoughts on how to handle the lower layer
> changing. Al Viro's comments:
>
> http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/0839.html
>
> Do you see something we're missing?
>

That would be:

> If you allow a mix of old and new mappings, you can easily run into the
> situations when at some moment X1 covers Y1, X2 covers Y2, X2 is a descendent
> of X1 and Y1 is a descendent of Y2. You *really* don't want to go there -
> if nothing else, defining behaviour of copyup in face of that insanity
> will be very painful.

in particular - right?

The current proposal doesn't do copy-up of directory trees - or even of
directories - which is what I assume the reference to copyup refers to, so
that is a non-issue (i.e. where a tree copy-up might be needed, EXDEV is
returned).

For the [XY][12] issue, let's try to make it a bit more concrete.

Suppose /L is the lower tree, /U is the upper tree and /O is the overlay
mount point.
Suppose further that /U/a/b/c/d/e/f/g exists and (for now) /L/a doesn't.

Then /O/a/b/c/d/e/f/g can appear to 'cover' the above path.

So I open
/U/a/b/c and /U/a/b/c/d/e/f
and
/O/a/b/c and /O/a/b/c/d/e/f

These are Y1 Y2 X1 X2. Currently Xn covers Yn.

To get the situation Al describes we would need to perform some manipulations
in /U e.g.
mv /U/a/b/c/d/e /U/a/e
mv /U/a/b /U/a/e/f/g/b
Now each dir still has the same basename as before and there is only one
child per dir, and the longest path is
/U/a/e/f/g/b/c/d
So now Y1 (/U/../c) is a descendent of Y2 (/U/../f), and the dentries
attached to the 4 open fds haven't seen any local change.

So: is this a problem? It may seem a bit confusing to someone who doesn't
understand what is happening, but we define that as not being a problem (to
avoid confusion: don't change U or L).
The important questions are: Can it cause corruption, and can it cause a
deadlock?

If we walk down the directory tree from either X1 or X2, we will see
c/d or f/g/b/c/d
respectively. Both paths will terminate - no loops. The 'd' below X2 will be
a different dentry than the 'd' above X2 despite the fact that they both
reference the same 'd' under /U.
If we try a 'renameat' - the only thing that I can imagine that would cause
fs corruption, we use the same lock_rename and vfs_rename calls on the /U
filesystem as a direct syscall would, so any attempt to create a disconnected
loop will fail.
It is probable that the upperpath.dentry should be revalidated after getting
the lock to ensure it is still hashed etc, but that would be part of the
revalidation code that I would propose to add. So I fail to see how any
filesystem corruption could happen.

For locking: we take locks in the /U filesystem while holding locks in the /O
overlay. There is a clear ordering here so there should be no room for
deadlock.
If the overlay filesystem were to support overlaying a mount-tree on a
mount-tree rather than just a filesystem on a filesystem I imagine that it
would not be too hard to create some weird loops. However the current
proposal ignores mountpoints and just overlays one filesystem on another, so
the overlay is certain to be separate from, and well-ordered with respect to,
the upper and lower filesystems.

If the long path were on /L rather than /U I don't think anything would be
different.
If the paths were shorter and you managed to directly swap a parent and a
child, overlayfs would be able to notice that during revalidate and - if
necessary due to unpleasant races - would simply return -EBUSY.

It is a fairly key aspect of this proposal that we feel free to return errors
for situations that are too hard. -EXDEV for non-opaque directory renames is
one such case. -EBUSY when racing with changes in the covered filesystems
would be another.

I haven't got revalidation code yet, but when {upper,lower}path.dentry were
non-null, it would check they are still hashed, have the same name as the
main dentry, and have the correct corresponding parent. If they are NULL, we
check that the corresponding parent inode is still NULL.
If not, we repeat union_lookup - or fail or whatever seems appropriate.

Inside any locks that we take on the upper filesystem to perform some
directory manipulations we would repeat the checks (on upperpath only) and
return -EBUSY if the dentries have become 'invalid' since the recent
revalidate.

So I don't see a problem. Maybe I'm not seeing something that Al does, or
maybe the current proposal is sufficiently different from the one Al was
reviewing that the problems he observed simply don't exist.

I've reviewed your recent discussion with J. R. Okajima and it doesn't help me
see any locking issues more clearly. Maybe if you could identify a specific
vfs_* call which could issue lock requests in a dangerous order and is not
protected by the revalidation I described above, that might help me see more
clearly.

Thanks,
NeilBrown

2010-08-31 11:00:58

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/5] hybrid union filesystem prototype

On Tue, 31 Aug 2010, Neil Brown wrote:
> So: is this a problem? It may seem a bit confusing to someone who doesn't
> understand what is happening, but we define that as not being a problem (to
> avoid confusion: don't change U or L).
> The important questions are: Can it cause corruption, and can it cause a
> deadlock?

No, I don't think this design will do that. So it might be enough
just to document that online modification of upper or lower
filesystems results in undefined behavior.

But to prevent accidental damage, it's prudent (at least by default)
to enforce the no-modification policy.

Why do you think this feature of allowing modification is important?
Lets take some typical use cases:

- live cd: lower layer is hard r/o, upper layer makes no sense to
modify online

- thin client: lower layer is static except upgrades, which need
special tools to support and is done offline, upper layer makes no
sense to modify online

Do you have some cases in mind where it makes at least a little sense
to allow online modification of the underlying filesystems?

Thanks,
Miklos

2010-08-31 11:25:01

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 0/5] hybrid union filesystem prototype

On Tue, 31 Aug 2010 13:00:45 +0200
Miklos Szeredi <[email protected]> wrote:

> On Tue, 31 Aug 2010, Neil Brown wrote:
> > So: is this a problem? It may seem a bit confusing to someone who doesn't
> > understand what is happening, but we define that as not being a problem (to
> > avoid confusion: don't change U or L).
> > The important questions are: Can it cause corruption, and can it cause a
> > deadlock?
>
> No, I don't think this design will do that. So it might be enough
> just to document that online modification of upper or lower
> filesystems results in undefined behavior.
>
> But to prevent accidental damage, it's prudent (at least by default)
> to enforce the no-modification policy.
>
> Why do you think this feature of allowing modification is important?
> Lets take some typical use cases:
>
> - live cd: lower layer is hard r/o, upper layer makes no sense to
> modify online
>
> - thin client: lower layer is static except upgrades, which need
> special tools to support and is done offline, upper layer makes no
> sense to modify online
>
> Do you have some cases in mind where it makes at least a little sense
> to allow online modification of the underlying filesystems?

No, I don't have a particular use case in mind that would take advantage of
the layers being directly modifiable. But I know that sys-admins can be very
ingenious and may well come up with something clever.

My point is more that I don't think that is it *possible* to prevent changes
to the underlying filesystem (NFS being the prime example) so if there are
easy steps we can take to make the behaviour of overlayfs more predictable in
those cases, we should.

Further I think that insisting that the underlying filesystems remain
unchangeable is overly restrictive. If I were not allowed to perform an
overlay mount on a read/write lower filesystem, that would make it
significantly harder to explore the possibilities of overlayfs and experiment
with it.

I certainly don't think we should put a lot of work or a lot of code into
making it work "perfectly" in any sense at all. But if there are *easy*
things to do that allow us to avoid some weird behaviours, then I think it is
worth that effort to do it.

NeilBrown

2010-08-31 15:22:43

by Kyle Moffett

[permalink] [raw]
Subject: Re: [PATCH 0/5] hybrid union filesystem prototype

On Tue, Aug 31, 2010 at 07:24, Neil Brown <[email protected]> wrote:
> On Tue, 31 Aug 2010 13:00:45 +0200 Miklos Szeredi <[email protected]> wrote:
>> No, I don't think this design will do that.  So it might be enough
>> just to document that online modification of upper or lower
>> filesystems results in undefined behavior.
>>
>> But to prevent accidental damage, it's prudent (at least by default)
>> to enforce the no-modification policy.
>>
>> Why do you think this feature of allowing modification is important?
>> Lets take some typical use cases:
>>
>>  - live cd: lower layer is hard r/o, upper layer makes no sense to
>>    modify online
>>
>>  - thin client: lower layer is static except upgrades, which need
>>    special tools to support and is done offline, upper layer makes no
>>    sense to modify online
>>
>> Do you have some cases in mind where it makes at least a little sense
>> to allow online modification of the underlying filesystems?
>
> No, I don't have a particular use case in mind that would take advantage of
> the layers being directly modifiable.  But I know that sys-admins can be very
> ingenious and may well come up with something clever.
>
> My point is more that I don't think that is it *possible* to prevent changes
> to the underlying filesystem (NFS being the prime example) so if there are
> easy steps we can take to make the behaviour of overlayfs more predictable in
> those cases, we should.

There's certainly already weird behaviors you can cause by regular
filesystem over-mounts on NFS. For example, I have an NFS server that
exports a "/srv/git" directory; if I was to do the following actions
on a client:

# mkdir /srv/git
# mount -t nfs myserver:/srv/git /srv/git
# mkdir /srv/git/mnt
# mount -t ext3 /dev/sda3 /srv/git/mnt

And then from the server I were to:
# rmdir /srv/git/mnt

Terrible terrible things would happen... by which I mean I can no
longer access or unmount that filesystem from the client. That use
case in particular seems to be much worse than your regular unionfs
example even, and it's easily possible today (even by accident).

Cheers,
Kyle Moffett

2010-08-31 19:19:55

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 0/5] hybrid union filesystem prototype

On Mon, Aug 30, 2010 at 02:20:47PM +0200, Miklos Szeredi wrote:
> On Mon, 30 Aug 2010, Neil Brown wrote:
>
> > Val has been following that approach and asking if it is possible to make an
> > NFS filesystem really-truly read-only. i.e. no changes.
> > I don't believe it is.
>
> Perhaps it doesn't matter. The nasty cases can be prevented by just
> disallowing local modification. For the rest NFS will return ESTALE:
> "though luck, why didn't you follow the rules?"

I agree: Ask the server to keep it read-only, but also detect if it
lied to prevent kernel bugs on the client.

Is detecting ESTALE and failing the mount sufficient to detect all
cases of a cached directory being altered? I keep trying to trap an
NFS developer and beat the answer out of him but they usually get hung
up on the impossibility of 100% enforcement of the read-only server
option. (Agreed, impossible, just give the sysadmin a mount option so
that it doesn't happen accidentally.)

-VAL

2010-08-31 19:29:39

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 0/5] hybrid union filesystem prototype

On Mon, Aug 30, 2010 at 12:18:11PM +0200, Miklos Szeredi wrote:
> On Sun, 29 Aug 2010, Neil Brown wrote:
>
> > My comment about set-theory unions being commutative set me thinking. I
> > really don't think "union" is the right name for this thing. There is
> > nothing about it which really fits that proper definition of a union.
> > whiteouts mean that even the list of names in a directory is not the union of
> > the lists of names in the upper and lower directories.
> > "overlay" is a much more accurate name. But union seems to be the name
> > that is most used. I wonder if it is too late to change that.
>
> We could call it overlayfs. People learn new names quickly :)

Union mounts was named "writable overlays" for one release in an
attempt to get away from the "arbitrary union of file systems" idea.
I think it helped, but went back to union mounts since it was more
familiar and made prettier function names.

The config option for union mounts says:

Union mounts allow you to mount a transparent writable layer over a
read-only file system, for example, an ext3 partition on a hard drive
over a CD-ROM root file system image.

-VAL

2010-08-31 20:20:28

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 0/5] hybrid union filesystem prototype

On Tue, 2010-08-31 at 15:18 -0400, Valerie Aurora wrote:
> On Mon, Aug 30, 2010 at 02:20:47PM +0200, Miklos Szeredi wrote:
> > On Mon, 30 Aug 2010, Neil Brown wrote:
> >
> > > Val has been following that approach and asking if it is possible to make an
> > > NFS filesystem really-truly read-only. i.e. no changes.
> > > I don't believe it is.
> >
> > Perhaps it doesn't matter. The nasty cases can be prevented by just
> > disallowing local modification. For the rest NFS will return ESTALE:
> > "though luck, why didn't you follow the rules?"
>
> I agree: Ask the server to keep it read-only, but also detect if it
> lied to prevent kernel bugs on the client.
>
> Is detecting ESTALE and failing the mount sufficient to detect all
> cases of a cached directory being altered?

No. Files can be altered without being unlinked.

> I keep trying to trap an
> NFS developer and beat the answer out of him but they usually get hung
> up on the impossibility of 100% enforcement of the read-only server
> option. (Agreed, impossible, just give the sysadmin a mount option so
> that it doesn't happen accidentally.)

Remind me again why mounting the filesystem '-oro' on the server (and
possibly exporting it 'ro') isn't an option?

Trond

2010-08-31 20:37:23

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 0/5] hybrid union filesystem prototype

On Tue, Aug 31, 2010 at 11:05:18AM -0400, Kyle Moffett wrote:
> On Tue, Aug 31, 2010 at 07:24, Neil Brown <[email protected]> wrote:
> > On Tue, 31 Aug 2010 13:00:45 +0200 Miklos Szeredi <[email protected]> wrote:
> >> No, I don't think this design will do that. ??So it might be enough
> >> just to document that online modification of upper or lower
> >> filesystems results in undefined behavior.
> >>
> >> But to prevent accidental damage, it's prudent (at least by default)
> >> to enforce the no-modification policy.
> >>
> >> Why do you think this feature of allowing modification is important?
> >> Lets take some typical use cases:
> >>
> >> ??- live cd: lower layer is hard r/o, upper layer makes no sense to
> >> ?? ??modify online
> >>
> >> ??- thin client: lower layer is static except upgrades, which need
> >> ?? ??special tools to support and is done offline, upper layer makes no
> >> ?? ??sense to modify online
> >>
> >> Do you have some cases in mind where it makes at least a little sense
> >> to allow online modification of the underlying filesystems?
> >
> > No, I don't have a particular use case in mind that would take advantage of
> > the layers being directly modifiable. ??But I know that sys-admins can be very
> > ingenious and may well come up with something clever.
> >
> > My point is more that I don't think that is it *possible* to prevent changes
> > to the underlying filesystem (NFS being the prime example) so if there are
> > easy steps we can take to make the behaviour of overlayfs more predictable in
> > those cases, we should.
>
> There's certainly already weird behaviors you can cause by regular
> filesystem over-mounts on NFS. For example, I have an NFS server that
> exports a "/srv/git" directory; if I was to do the following actions
> on a client:
>
> # mkdir /srv/git
> # mount -t nfs myserver:/srv/git /srv/git
> # mkdir /srv/git/mnt
> # mount -t ext3 /dev/sda3 /srv/git/mnt
>
> And then from the server I were to:
> # rmdir /srv/git/mnt
>
> Terrible terrible things would happen... by which I mean I can no
> longer access or unmount that filesystem from the client. That use
> case in particular seems to be much worse than your regular unionfs
> example even, and it's easily possible today (even by accident).

While this definitely sucks, the concern in this case with unioning
file systems is a deadlock or kernel panic, not just "weird" behavior
or inability to unmount a file system. Although in general I like the
standard for union behavior as "not as bad as NFS." :)

-VAL

2010-09-01 01:56:52

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 0/5] hybrid union filesystem prototype

On Tue, Aug 31, 2010 at 04:19:47PM -0400, Trond Myklebust wrote:
> On Tue, 2010-08-31 at 15:18 -0400, Valerie Aurora wrote:
> > On Mon, Aug 30, 2010 at 02:20:47PM +0200, Miklos Szeredi wrote:
> > > On Mon, 30 Aug 2010, Neil Brown wrote:
> > >
> > > > Val has been following that approach and asking if it is possible to make an
> > > > NFS filesystem really-truly read-only. i.e. no changes.
> > > > I don't believe it is.
> > >
> > > Perhaps it doesn't matter. The nasty cases can be prevented by just
> > > disallowing local modification. For the rest NFS will return ESTALE:
> > > "though luck, why didn't you follow the rules?"
> >
> > I agree: Ask the server to keep it read-only, but also detect if it
> > lied to prevent kernel bugs on the client.
> >
> > Is detecting ESTALE and failing the mount sufficient to detect all
> > cases of a cached directory being altered?
>
> No. Files can be altered without being unlinked.

Argh. Do generation numbers and/or mtime help us here?

> > I keep trying to trap an
> > NFS developer and beat the answer out of him but they usually get hung
> > up on the impossibility of 100% enforcement of the read-only server
> > option. (Agreed, impossible, just give the sysadmin a mount option so
> > that it doesn't happen accidentally.)
>
> Remind me again why mounting the filesystem '-oro' on the server (and
> possibly exporting it 'ro') isn't an option?

The "hard read only" export option is to, at minimum, make it
difficult to *accidentally* remount the file system on the server as
read-write while it is exported as a union lower layer.

True, the NFS server can mount the file system "-o ro" - and then any
random other mount(8) on the server can come along and "-o
remount,rw". So if we have an NFS server option that uses the new
hard read-only feature, this at least makes the admin go change the
NFS mount options or unexport it before writing to the exported union
lower layer and hosing all the union mount clients.

It's the difference between:

# mount -o remount,rw /exports/union_mount_root
[thousands of union mounted clients hang]

And:

# mount -o remount,rw /exports/union_mount_root
mount: /exports/union_mount_root is hard read-only
# emacs /etc/exports
[edit out union/hard readonly option]
# exportfs -r
[thousands of union mounted clients hang]

But heck, system administration is hard, what's a little more rope?
Here, hold this gun while I position your foot...

-VAL