2011-06-10 06:57:15

by Valerie Aurora

[permalink] [raw]
Subject: Re: Fw: Re: [PATCH 0/7] overlay filesystem: request for inclusion

Andrew Morton <[email protected]> wrote:
> Subject: Re: [PATCH 0/7] overlay filesystem: request for inclusion
>
>
> On Thu, 09 Jun 2011 14:47:49 +0200
> Miklos Szeredi <[email protected]> wrote:
>
>> Andrew Morton <[email protected]> writes:
>> > Another issue: there have been numerous attempts at Linux overlay
>> > filesystems from numerous parties. ?Does (or will) this implementation
>> > satisfy all their requirements?
>>
>> Overlayfs aims to be the simplest possible but not simpler.
>>
>> I think the reason why "aufs" never had a real chance at getting merged
>> is because of feature creep.
>>
>> Of course I expect new features to be added to overlayfs after the
>> merge, but I beleive some of the features in those other solutions are
>> simply unnecessary.
>
> This is my main worry. ?If overlayfs doesn't appreciably decrease the
> motivation to merge other unioned filesystems then we might end up with
> two similar-looking things. ?And, I assume, the later and more
> fully-blown implementation might make overlayfs obsolete but by that
> time it will be hard to remove.
>
> So it would be interesting to hear the thoughts of the people who have
> been working on the other implementations.

1. Al Viro and Christoph Hellwig bring up the same locking problems
*every single time* someone proposes a copy-up in-kernel file system,
and *every single time* they are dismissed or hand-waved. I'd like to
see the thread in which one of them says, "Why yes, you have
understood and solved that problem to my satisfaction," before even
considering merging something.

2. Overlayfs is not the simplest possible solution at present. For
example, it currently does not prevent modification of the underlying
file system directories, which is absolutely required to prevent bugs
according to Al. Al proposed a solution he was happy with (read-only
superblocks), I implemented it for union mounts, and I believe it can
be ported to overlayfs. But that should happen *before* merging.

3. To my knowledge, Al is not currently able to reply to email on a
regular basis and Christoph doesn't frequently comment on the subject
these days. Don't take their silence as approval.

I have many more possible comments but I don't think they should be
relevant to the discussion about merging. Al and Christoph's
judgements should be sufficient.

-VAL


2011-06-10 09:00:15

by Alan

[permalink] [raw]
Subject: Re: [PATCH 0/7] overlay filesystem: request for inclusion

> 1. Al Viro and Christoph Hellwig bring up the same locking problems
> *every single time* someone proposes a copy-up in-kernel file system,
> and *every single time* they are dismissed or hand-waved.

Perhaps you can detail the locking problem in question ?

Alan

2011-06-15 11:18:44

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/7] overlay filesystem: request for inclusion

Alan Cox <[email protected]> writes:

>> 1. Al Viro and Christoph Hellwig bring up the same locking problems
>> *every single time* someone proposes a copy-up in-kernel file system,
>> and *every single time* they are dismissed or hand-waved.
>
> Perhaps you can detail the locking problem in question ?

Pretty bizarre things can happen when the topology of the underlying
layers change after overlayfs acquired refs to underlying dentries. I
think this is the case Val is talking about.

Example:

# mount -toverlayfs x -oupperdir=/upper,lowerdir=/lower /ovl
# mkdir -p /upper/a/b
# ls /ovl/a/b
# mv /upper/a/b /upper/
# mv /upper/a /upper/b/
# ls /ovl/a/b
a

Apparently "a" became its own ancestor.

Overlayfs is careful not to assume anything about child/parent
relationships of underlying dentries.

For example "rmdir /ovl/a/b" will do the following:

1. find the underlying dentry for "a" -> upper-a
2. lock upper-a
3. find the underlying dentry for "b" -> upper-b
4. verify that upper-b is a child of upper-a
5. remove upper-b

With the above example it will fail on step 4. Changes to the
underlying filesystems are not supported and result in undefined
behavior. But it should never result in BUGs or deadlocks.

The overall locking order is:

-> overlayfs locks
-> upper fs locks
-> lower fs locks

Within each filesystem the usual locking rules apply.

One more difficulty is copy-up. This happens without being protected by
i_mutex on overlayfs. The rules here are:

A. directory renames only succeed if both source and destination are
only on the upper fs (never copied up)
B. non-directory renames start with copy-up of source if necessary
C. copy-up takes i_mutex on upper parent

During copy-up no ancestor will be renamed because of A.

The file being copied up may be moved concurrently, however. If this
happens then copy-up will acquire i_mutex for either the old or the new
upper parent. In the latter case the file has already been copied up.
In the former case the file may or may not have been copied up. The
state of the file is checked after having locked the upper parent and
the copy-up skipped if it has already succeeded.

There may be flaws in the above reasoning or the implementation and
reviews are very welcome.

Thanks,
Miklos

2011-06-15 14:33:05

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH 0/7] overlay filesystem: request for inclusion


Miklos Szeredi:
> For example "rmdir /ovl/a/b" will do the following:
>
> 1. find the underlying dentry for "a" -> upper-a
> 2. lock upper-a
> 3. find the underlying dentry for "b" -> upper-b
> 4. verify that upper-b is a child of upper-a
> 5. remove upper-b

It is good to verify in step 4.
Essentially (or ideally) this verification should be equivalent to all
of what VFS does before vfs_rmdir(). I know overlayfs makes the upper
mnt_want_write()-ed in early stage and keeps it. So it might be better
to lookup again (as step 3 and 4) instead of comparing d_parent
simply. If you think it is unnecessary to lookup here, then I'd suggest
you to make it option (choosable by user).

I see ovl_rmdir() does,
- lookup and unlink all whiteouts
- rmdir the target dir
- create a whiteout for the target
Right?
But I am afraid that any error can happen in every step on the upper
dir. And if it happens, then ovl_rmdir() returns the error but the dir
left in incomplete status. It may be one of these.
- some whiteouts are unlinked but others are left
- all whiteouts are gone but the target dir remains
- the target dir is removed but the whiteout is not created
Of course, it is bad and makes users really confused, since it will show
users things which should not be. At the same time, I don't know how
possible it can happen.

Anyway if you have read aufs, then you would know how aufs solves these
problems. I don't think the approaches in aufs is best or one and
only. I just could not get another good idea.


J. R. Okajima

2011-06-15 15:48:37

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/7] overlay filesystem: request for inclusion

"J. R. Okajima" <[email protected]> writes:

> Miklos Szeredi:
>> For example "rmdir /ovl/a/b" will do the following:
>>
>> 1. find the underlying dentry for "a" -> upper-a
>> 2. lock upper-a
>> 3. find the underlying dentry for "b" -> upper-b
>> 4. verify that upper-b is a child of upper-a
>> 5. remove upper-b
>
> It is good to verify in step 4.
> Essentially (or ideally) this verification should be equivalent to all
> of what VFS does before vfs_rmdir(). I know overlayfs makes the upper
> mnt_want_write()-ed in early stage and keeps it. So it might be better
> to lookup again (as step 3 and 4) instead of comparing d_parent
> simply. If you think it is unnecessary to lookup here, then I'd suggest
> you to make it option (choosable by user).

The parent verification is only to make sure the locking is correct.
It's not to make sure that modifications of underlying filesystems will
have sane semantics.

Until someone comes up with a sane use case for allowing modification of
underlying filesystem I won't bother with that.

> I see ovl_rmdir() does,
> - lookup and unlink all whiteouts
> - rmdir the target dir
> - create a whiteout for the target
> Right?

Not quite.

- checks if directory is empty (all lower entries are whiteouted)
- marks directory opaque
- unlinks whiteouts
- rmdir
- create whiteout

> But I am afraid that any error can happen in every step on the upper
> dir. And if it happens, then ovl_rmdir() returns the error but the dir
> left in incomplete status. It may be one of these.
> - some whiteouts are unlinked but others are left
> - all whiteouts are gone but the target dir remains
> - the target dir is removed but the whiteout is not created
> Of course, it is bad and makes users really confused, since it will show
> users things which should not be. At the same time, I don't know how
> possible it can happen.

Atomic whiteout and atomic copy-up would be nice, that's one feature I'm
willing to think about.

> Anyway if you have read aufs, then you would know how aufs solves these
> problems. I don't think the approaches in aufs is best or one and
> only. I just could not get another good idea.

Rollback on failure is an incomplete solution, rollback itself can fail.
And it doesn't protect against machine crashing in the middle of
operation.

Thanks,
Miklos

2011-06-15 16:15:08

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH 0/7] overlay filesystem: request for inclusion


Miklos Szeredi:
> Rollback on failure is an incomplete solution, rollback itself can fail.
> And it doesn't protect against machine crashing in the middle of
> operation.

Maybe you are right.
But do you think rollback is unnecessary since it is an incomplete
solution?

And you might not have read about the approach in aufs, which tries
reducing the operations in rollback.

(from '[RFC 2/8] Aufs2: structure' in 2009
<http://marc.info/?l=linux-kernel&m=123537453514896&w=2>)
----------------------------------------
In aufs, rmdir(2) and rename(2) for dir uses whiteout alternatively.
In order to make several functions in a single systemcall to be
revertible, aufs adopts an approach to rename a directory to a temporary
unique whiteouted name.
For example, in rename(2) dir where the target dir already existed, aufs
renames the target dir to a temporary unique whiteouted name before the
actual rename on a branch and then handles other actions (make it opaque,
update the attributes, etc). If an error happens in these actions, aufs
simply renames the whiteouted name back and returns an error. If all are
succeeded, aufs registers a function to remove the whiteouted unique
temporary name completely and asynchronously to the system global
workqueue.
----------------------------------------


J. R. Okajima

2011-06-15 17:20:29

by Michal Suchanek

[permalink] [raw]
Subject: Re: [PATCH 0/7] overlay filesystem: request for inclusion

On 15 June 2011 18:14, J. R. Okajima <[email protected]> wrote:
>
> Miklos Szeredi:
>> Rollback on failure is an incomplete solution, rollback itself can fail.
>> And it doesn't protect against machine crashing in the middle of
>> operation.
>
> Maybe you are right.
> But do you think rollback is unnecessary since it is an incomplete
> solution?
>
> And you might not have read about the approach in aufs, which tries
> reducing the operations in rollback.
>
> (from '[RFC 2/8] Aufs2: structure' in 2009
>      <http://marc.info/?l=linux-kernel&m=123537453514896&w=2>)
> ----------------------------------------
> In aufs, rmdir(2) and rename(2) for dir uses whiteout alternatively.
> In order to make several functions in a single systemcall to be
> revertible, aufs adopts an approach to rename a directory to a temporary
> unique whiteouted name.
> For example, in rename(2) dir where the target dir already existed, aufs
> renames the target dir to a temporary unique whiteouted name before the

This is generally not possible in solutions that don't reserve any filenames.

However, it should be possible to create whiteout of a non-existent
entry in a directory while it is locked without affecting userspace.

> actual rename on a branch and then handles other actions (make it opaque,
> update the attributes, etc). If an error happens in these actions, aufs
> simply renames the whiteouted name back and returns an error. If all are
> succeeded, aufs registers a function to remove the whiteouted unique
> temporary name completely and asynchronously to the system global
> workqueue.

Removing the whiteout asynchronously does not seem like a good idea.
It should be gone before the directory containing the whiteout is
unlocked. Otherwise there might be an entry created which conflicts
with this whiteout that did not exist when the operation started. Also
if you unlock the directory while the artifical whiteout exists an
asynchronous process might replace the whiteout and the rollback would
fail.

As an alternative way to perform atomic renames I would suggest
"fallthrough symlinks". If you want to rename an entry which is
"fallthrough" (ie pointing to the entry with the same name in the
lower layer in the same directory) you can replace it with a
"fallthrough symlink" which points to the lower layer and does not
just implicitly say "here" but specifies a path relative to the
mountpoint instead. This can then be moved like any other entry. it is
in no way special anymore. Moving a directory tree which is partially
in the upper layer is still time-consuming but can be performed with
reasonable semantics imho. You perform a preparation step during which
nothing seems to change from the user's point of view and at the very
end you just move the directory.

Thanks

Michal

2011-06-15 18:12:17

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/7] overlay filesystem: request for inclusion

Michal Suchanek <[email protected]> writes:

> On 15 June 2011 18:14, J. R. Okajima <[email protected]> wrote:
>> For example, in rename(2) dir where the target dir already existed, aufs
>> renames the target dir to a temporary unique whiteouted name before the
>
> This is generally not possible in solutions that don't reserve any filenames.
>
> However, it should be possible to create whiteout of a non-existent
> entry in a directory while it is locked without affecting userspace.

Yes, creation of whiteout and renaming it to target or vice versa works
if target is non-directory.

Cases where this trick could make operations atomic:

- create/mknod/symlink/link over whiteout
- rename non-directory to whiteout
- remove of non-directory with whiteout creation
- copy up

Cases where atomicity is not possible with this:

- mkdir over whiteout
- rename directory to whiteout
- rename where source needs whiteout
- rmdir with whiteout creation


>> actual rename on a branch and then handles other actions (make it opaque,
>> update the attributes, etc). If an error happens in these actions, aufs
>> simply renames the whiteouted name back and returns an error. If all are
>> succeeded, aufs registers a function to remove the whiteouted unique
>> temporary name completely and asynchronously to the system global
>> workqueue.
>
> Removing the whiteout asynchronously does not seem like a good idea.
> It should be gone before the directory containing the whiteout is
> unlocked. Otherwise there might be an entry created which conflicts
> with this whiteout that did not exist when the operation started. Also
> if you unlock the directory while the artifical whiteout exists an
> asynchronous process might replace the whiteout and the rollback would
> fail.
>
> As an alternative way to perform atomic renames I would suggest
> "fallthrough symlinks". If you want to rename an entry which is
> "fallthrough" (ie pointing to the entry with the same name in the
> lower layer in the same directory) you can replace it with a
> "fallthrough symlink" which points to the lower layer and does not
> just implicitly say "here" but specifies a path relative to the
> mountpoint instead. This can then be moved like any other entry. it is
> in no way special anymore.

This is a nice idea, but doesn't have a lot to do with atomicity. It
allows rename of non-pure upper directory (they return EXDEV currently).

> Moving a directory tree which is partially
> in the upper layer is still time-consuming but can be performed with
> reasonable semantics imho.

Shouldn't be time consuming, really. The upper, mixed directory is
renamed and given a "trusted.overlay.redirect" attribute to show where
its lower directory resides.

Thanks,
Miklos

2011-06-16 02:44:00

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH 0/7] overlay filesystem: request for inclusion


Michal Suchanek:
> This is generally not possible in solutions that don't reserve any filename=
> s.
>
> However, it should be possible to create whiteout of a non-existent
> entry in a directory while it is locked without affecting userspace.

Actually aufs generates a doubly whiteouted unique name dynamically for
the target dir. For instance, when rmdir("dirA") aufs does,
- lock i_mutex of the parent dir of dirA on the real fs
- some verifycations for the parent-child relationship
- some tests whether we can do rmdir
- create whiteout for dirA
- rename dirA to .wh..wh.XXXXXXXX (random value in hex), after making
sure the name doesn't exist
- unlock the parent dir
- return to VFS
And then the async workqueue removes the .wh..wh.XXXXXXXX dir with some
whiteouts under it.

It means the temporary whiteout name is,
- always unique
- always hidden (from users), even if it remains accidentally
So even if an error happens in the async work, it doesn't matter.

Additionally there is a userspace script called "auchk" which is like
fsck for real fs. auchk script checks the logical consistency on the
(writable) real fs, and removes the illegal whiteouts, remained
pseudo-links, and remained temp files.


> As an alternative way to perform atomic renames I would suggest
> "fallthrough symlinks". If you want to rename an entry which is

Symlink?
Is it a different thing from DCACHE_FALLTHRU in UnionMount?
I am afraid a special symlink is fragile or dangerous.
Its special meaning is valid in inner union world only, is it? If
something in outer world gets changed, we may not follow the symlink
anymore or follow something different unexpectedly. Is it acceptable?


J. R. Okajima

2011-06-16 10:35:45

by Michal Suchanek

[permalink] [raw]
Subject: Re: [PATCH 0/7] overlay filesystem: request for inclusion

On 16 June 2011 04:43, J. R. Okajima <[email protected]> wrote:
>
> Michal Suchanek:
>> This is generally not possible in solutions that don't reserve any filename=
>> s.
>>
>> However, it should be possible to create whiteout of a non-existent
>> entry in a directory while it is locked without affecting userspace.
>
> Actually aufs generates a doubly whiteouted unique name dynamically for
> the target dir. For instance, when rmdir("dirA") aufs does,
> - lock i_mutex of the parent dir of dirA on the real fs
> - some verifycations for the parent-child relationship
> - some tests whether we can do rmdir
> - create whiteout for dirA
> - rename dirA to .wh..wh.XXXXXXXX (random value in hex), after making

Probably swap the two above, you can't make a whiteout in presence of
the directory, right?
Anyway, you could just mark dirA as whiteout and remove any whiteouts
contained in it asynchronously, and only jump through these hoops when
trying to create a new entry in place of non-empty whiteout, or sync
on emptying the old whiteout before making a new entry.

>  sure the name doesn't exist
> - unlock the parent dir
> - return to VFS
> And then the async workqueue removes the .wh..wh.XXXXXXXX dir with some
> whiteouts under it.
>
> It means the temporary whiteout name is,
> - always unique
> - always hidden (from users), even if it remains accidentally
> So even if an error happens in the async work, it doesn't matter.

Yes, it can only cause pollution with whiteouts unrelated to any files
that ever existed which is not too much of an issue unless people want
to add random stuff to the lower layer and see it in the union when
they reconstruct it again.

>
> Additionally there is a userspace script called "auchk" which is like
> fsck for real fs. auchk script checks the logical consistency on the
> (writable) real fs, and removes the illegal whiteouts, remained
> pseudo-links, and remained temp files.
>
>
>> As an alternative way to perform atomic renames I would suggest
>> "fallthrough symlinks". If you want to rename an entry which is
>
> Symlink?
> Is it a different thing from DCACHE_FALLTHRU in UnionMount?

Yes, the fallthru in unionmount only says "look below here", it cannot
point to a different place in the filesystem.

> I am afraid a special symlink is fragile or dangerous.
> Its special meaning is valid in inner union world only, is it? If

It is only valid when in the upper layer of a union. However, so is
whiteout, and so are files that were visible in the union but are not
visible in the top layer if examined separately, outside of the union.

It must be accepted that the top layer is different from the union,
otherwise you want a copy, not a union.

> something in outer world gets changed, we may not follow the symlink
> anymore or follow something different unexpectedly. Is it acceptable?

That' the whole idea behind symlinks, and also unions which implicitly
link the lower layer into the upper to present the result as a single
directory tree.

Anyway, the motivation behind the "fallthru symlink" is that you need
not copy-up on seemingly trivial operations like rename, touch, etc.
which both makes them more efficient and easier to get atomic. As I
understand it copy-up is the operation that causes the most issues and
with "fallthru symlinks" you need it only for operations that are
expected to modify something non-trivially.

Obviously, this is not so nice for zero sized files but they should be
handled the same way for consistency I guess. Also metada that can be
conveniently recorded on the fallthru entry would make touch fast but
would hide possible later updates to the lower layer so it might be
not good solution for all use cases. For throwaway tmpfs, however, any
optimization counts.

Seriously, the overlayfs documents that it can have opaque directories
but I don't see what they would be used for. There is no way to turn a
directory opaque with normal userspace operation afaict.
It has no explicit fallthrus, at least not documented so to have any
level of consistency it should always check the lower layer because it
can grow some new directories when the union is deconstructed, offline
modified, and reconstructed (which is supported use case according to
the docs).

Thanks

Michal

2011-06-16 15:15:59

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH 0/7] overlay filesystem: request for inclusion


Michal Suchanek:
> Probably swap the two above, you can't make a whiteout in presence of
> the directory, right?
> Anyway, you could just mark dirA as whiteout and remove any whiteouts
> contained in it asynchronously, and only jump through these hoops when
> trying to create a new entry in place of non-empty whiteout, or sync
> on emptying the old whiteout before making a new entry.

Unfortunately I cannot understand what you wrote.

First, the order of
> - create whiteout for dirA
> - rename dirA to .wh..wh.XXXXXXXX
is correct and I think it should be, in order to make a little help for
fsck/auchk.
And what is "non-empty whiteout" and "emptying the old whiteout"?
The whiteout is a size zero-ed and hardlinked regular file in aufs.


> Yes, it can only cause pollution with whiteouts unrelated to any files
> that ever existed which is not too much of an issue unless people want
> to add random stuff to the lower layer and see it in the union when
> they reconstruct it again.

??
Do you think that the .wh..wh.XXXXXXXX hides something on the lower
layer? If so, it is wrong. Such doubly whiteout hides nothing except
itself.


> It is only valid when in the upper layer of a union. However, so is
> whiteout, and so are files that were visible in the union but are not
> visible in the top layer if examined separately, outside of the union.

Do you mean that your special symlink has totally different file-type
from a symlink?
Anyway what I want to say is, what such symlink refers may differ
from what users originally expect. But I may misunderstand what you call
"fallthru symlink".


J. R. Okajima

2011-06-17 07:39:00

by Michal Suchanek

[permalink] [raw]
Subject: Re: [PATCH 0/7] overlay filesystem: request for inclusion

On 16 June 2011 17:15, J. R. Okajima <[email protected]> wrote:
>
> Michal Suchanek:
>> Probably swap the two above, you can't make a whiteout in presence of
>> the directory, right?
>> Anyway, you could just mark dirA as whiteout and remove any whiteouts
>> contained in it asynchronously, and only jump through these hoops when
>> trying to create a new entry in place of non-empty whiteout, or sync
>> on emptying the old whiteout before making a new entry.
>
> Unfortunately I cannot understand what you wrote.
>
> First, the order of
>> - create whiteout for dirA
>> - rename dirA to .wh..wh.XXXXXXXX
> is correct and I think it should be, in order to make a little help for

Yes, it's correct for aufs which uses reserved file names for whiteouts.

Filesystems that don't reserve filenames cannot make whiteout for an
existing entry but aufs can.

> fsck/auchk.
> And what is "non-empty whiteout" and "emptying the old whiteout"?
> The whiteout is a size zero-ed and hardlinked regular file in aufs.

Is there any reason why a directory cannot be whiteout?

>
>
>> Yes, it can only cause pollution with whiteouts unrelated to any files
>> that ever existed which is not too much of an issue unless people want
>> to add random stuff to the lower layer and see it in the union when
>> they reconstruct it again.
>
> ??
> Do you think that the .wh..wh.XXXXXXXX hides something on the lower
> layer? If so, it is wrong. Such doubly whiteout hides nothing except
> itself.

It may possibly hide a XXXXXXXX file if it is later added to the lower layer.

But if .wh.XXXXXXXX is in itself a reserved filename that is never
brought up from the lower layer then this is a non-issue, it works
consistently regardless of existence of the superfluous whiteout.

>
>
>> It is only valid when in the upper layer of a union. However, so is
>> whiteout, and so are files that were visible in the union but are not
>> visible in the top layer if examined separately, outside of the union.
>
> Do you mean that your special symlink has totally different file-type
> from a symlink?

Just as whiteout has totally different file-type from a file. It's
specific to the union.

> Anyway what I want to say is, what such symlink refers may differ
> from what users originally expect. But I may misunderstand what you call
> "fallthru symlink".

How is this different from other files that are taken from the lower
layer and not copied into the upper layer?

If you are concerned about that you want a full copy, not a union.

Thanks

Michal

2011-06-20 00:43:59

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH 0/7] overlay filesystem: request for inclusion


Michal Suchanek:
> Is there any reason why a directory cannot be whiteout?

Just to reduce consuming inodes.


> It may possibly hide a XXXXXXXX file if it is later added to the lower layer.

No, because it is "doubly" whiteouted.


> Just as whiteout has totally different file-type from a file. It's
> specific to the union.

Ok, we are talking about different whiteouts.


J. R. Okajima