2008-06-01 04:01:55

by Phillip Lougher

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support

Arnd Bergmann wrote:
> On Saturday 31 May 2008, David Newall wrote:
>> I don't agree that it is nicer to do this in cramfs. I prefer the
>> technique of union of a tmpfs over some other fs because a single
>> solution that works with all filesystems is better than re-implementing
>> the same idea in multiple filesystems. Multiple implementations is a
>> recipe for bugs and feature mismatch.
>
> You're right in principle, but unfortunately there is to date no working
> implementation of union mounts. Giving users the option of using an
> existing file system with a few tweaks can only be better than than
> forcing them to use hacks like unionfs.
>

I tend to agree with Arnd Bergmann. While I prefer the aesthetic
cleanliness of stackable filesystems, the lack of proper stacking
support in the Linux VFS makes other techniques necessary. Unionfs is
complex and for many embedded systems with constrained resources Unionfs
adds a lot of extra overhead.

If I read the patches correctly, when a file page is written to, only
that page gets copied into the page cache and locked, the other pages
continue to be read off disk from cramfs? With Unionfs a page write
causes the entire file to be copied up to the r/w tmpfs and locked into
the page cache causing unnecessary RAM overhead.

Phillip


2008-06-01 08:52:34

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support

On Sunday 01 June 2008, Phillip Lougher wrote:
> If I read the patches correctly, when a file page is written to, only
> that page gets copied into the page cache and locked, the other pages
> continue to be read off disk from cramfs? ?With Unionfs a page write
> causes the entire file to be copied up to the r/w tmpfs and locked into
> the page cache causing unnecessary RAM overhead.

Yes, that's right.

Arnd <><

2008-06-01 12:28:45

by Jamie Lokier

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support

Phillip Lougher wrote:
> If I read the patches correctly, when a file page is written to, only
> that page gets copied into the page cache and locked, the other pages
> continue to be read off disk from cramfs? With Unionfs a page write
> causes the entire file to be copied up to the r/w tmpfs and locked into
> the page cache causing unnecessary RAM overhead.

Ok, so why not fix that in unionfs? An option so that holes in the
overlay file let through data from the underlying file sounds like it
would be generally useful, and quite easy to implement.

If not unionfs, a "union-tmpfs" combination would be good. Many
filesystems aren't well suited to being the overlay filesystem -
adding to the implementation's complexity - but a modified tmpfs could
be very well suited.

-- Jamie

2008-06-01 21:50:03

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support

On Sunday 01 June 2008, Jamie Lokier wrote:
> Ok, so why not fix that in unionfs? ?An option so that holes in the
> overlay file let through data from the underlying file sounds like it
> would be generally useful, and quite easy to implement.

I can imagine a lot of unexpected effects with that. Think of e.g.
someone replacing the underlying file with a new one. Then enlarge
the file using truncate() and read from it -- suddenly you see
the old contents instead of zeroes. Probably fixable as well, but
certainly not in a nice way.

Besides, there are a many more problems with unionfs, which have
all been mentioned in the previous review cycles. Aufs doesn't
address those either AFAIK, with the exception of at least
not making additional copies in the page cache when writing to
a file.

The real solution of course are VFS based union mounts (think
'mount --union -t tmpfs none /'), but the patches for that
are not stable enough for inclusion in mainline yet.

> If not unionfs, a "union-tmpfs" combination would be good. ?Many
> filesystems aren't well suited to being the overlay filesystem -
> adding to the implementation's complexity - but a modified tmpfs could
> be very well suited.

Yes, that is similar to one of my earlier ideas as well. Christoph
managed to convince me that it's not as easy as I thought, though
I can't remember the exact arguments any more. I'll try to think
about that some more.

One of the problems is certainly the complexity involved in tmpfs
to start with, which is the reason I based the code on ramfs instead.

Arnd <><

2008-06-02 02:49:15

by J. R. Okajima

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support


Arnd Bergmann:
> Besides, there are a many more problems with unionfs, which have
> all been mentioned in the previous review cycles. Aufs doesn't
> address those either AFAIK, with the exception of at least
> not making additional copies in the page cache when writing to
> a file.

Hello Arnd,

While I don't have particular objection to your idea and approach to
cramfs, I'd point out that modern LiveCDs tend to save their
modifications to disk.
And AUFS did address all known problems. If there left something, please
let me know.


Junjiro Okajima

2008-06-02 03:26:31

by Erez Zadok

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support

Arnd Bergmann:
> Besides, there are a many more problems with unionfs, which have
> all been mentioned in the previous review cycles. Aufs doesn't
> address those either AFAIK, with the exception of at least
> not making additional copies in the page cache when writing to
> a file.

Correction: Unionfs doesn't make additional copies in the page cache.

Arnd, I favor a more generic approach, one that will work with the vast
majority of file systems that people use w/ unioning, preferably all of
them. Supporting copy-on-write in cramfs will only help a small subset of
users. Yes, it might be simple, but I fear it won't be useful enough to
convince existing users of unioning to switch over. And I don't think we
should add CoW support in every file system -- the complexity will be much
more than using unionfs or some other VFS-based solution.

I can see some advantages (re: cache coherency) by hacking CoW support
directly into a f/s. If you want to use a filesystem-specific solution,
then I suggest you don't modify a file system used as a source in a union,
but one used as a destination. You'll have better overage that way. The
vast majority of times, unionfs users will either write to tmpfs or ext2;
but the source readonly f/s can be a lot of different ones (most popular are
ext*, nfs*, isofs, and cramfs/squashfs).

I find it somewhat ironic to hear the argument that "union mounts isn't
stable yet, so lets come up with a new solution inside cramfs." Why should
your solution become stable much faster than union mounts (which also had
patches floating around for a long time already).

If you have cycles to spare, why not help Bharata and Jan?

Cheers,
Erez.

2008-06-02 03:51:39

by Erez Zadok

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support


> Jamie Lokier wrote:
> > Phillip Lougher wrote:
> > If I read the patches correctly, when a file page is written to, only
> > that page gets copied into the page cache and locked, the other pages
> > continue to be read off disk from cramfs? With Unionfs a page write
> > causes the entire file to be copied up to the r/w tmpfs and locked into
> > the page cache causing unnecessary RAM overhead.

Yes, unionfs does copyup whole files, but it doesn't lock the entire file
into the page cache. But I agree, that copying up large files to a tmpfs
partition adds more memory pressure, at least temporarily (until pdflush
kicks in).

> Ok, so why not fix that in unionfs? An option so that holes in the
> overlay file let through data from the underlying file sounds like it
> would be generally useful, and quite easy to implement.

If I understand you right, you want to copyup one page at a time, right?
That's not nearly as easy as one might imagine. First, you can't do it on
file systems which don't support holes. Second, holes is a file-systems
specific implementation issue, and the knowledge of holes AFAIC, is hidden
from the VFS (IIRC, FreeBSD has a specific "zfod" page flag, which is turned
on when the VM has a page that came out of a f/s hole).

You'll need a way to tell if a given page was copied up or not, and
distinguish b/t pages which are naturally filled with zeros vs. those which
came from f/s holes.

Copyup is also providing persistency: you can copyup to a persistent f/s
such as ext2. So you'll need a bitmap or some sort of record that will
survive file system remount and system reboot; such a bitmap will have to
tell which pages of a file have been copied up or not.

I'm not saying it's not possible, but it's to do this page-wise caching at a
stackable layer than inside a native f/s such as ext2. Now, if there was a
generic VFS op that allowed me to query a file system whether a page it a
given file is a hole or not, then unionfs would be able to do page-wise
copyup easily.

Frankly, I think something like support for a copied-up file, page-by-page,
should probably be supported by a block layer virtual driver (this might be
easier in a BSD-like geom layer.)

BTW, I believe FSCache has page-wise caching, right? Caching is a
copy-on-read operation, and it doesn't take much to make it cache (read:
copy) on writes. So FScache might be a good starting point for such an
effort.

> If not unionfs, a "union-tmpfs" combination would be good. Many
> filesystems aren't well suited to being the overlay filesystem -
> adding to the implementation's complexity - but a modified tmpfs could
> be very well suited.

I think a union-tmpfs is a better solution than a cramfs-specific one, b/c
at least with union-tmpfs, many more users could use it. Even if you
restrict yourself to using tmpfs as the r-w layer, and read-only from just
one other source f/s, that still will cover a large portion of unioning
users.

> -- Jamie

Cheers,
Erez.

2008-06-02 04:38:16

by Erez Zadok

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support

> Jan Engelhardt wrote:
> > On Sunday 2008-06-01 08:02, David Newall wrote:
> >>
> >>> I prefer the technique of union of a tmpfs over some other fs
> >>
> >> You're right in principle, but unfortunately there is to date no working
> >> implementation of union mounts. Giving users the option of using an
> >> existing file system with a few tweaks can only be better than than
> >> forcing them to use hacks like unionfs.
> >
> >I've not used unionfs (nor aufs) so I'm not aware of its foibles, but I
> >can say that it's the right kind of solution. Rather than spend effort
> >implementing write support for read-only filesystems, why not put your
> >time into fixing whatever you see wrong with one or both of those?
>
> I have to join in. Unionfs and AUFS may be bigger in bytes than the
> embedded developer wants to sacrifice, but that is what it takes for
> a solid implementation that has to deal with things like NFS and
> mmap. Even so, there is a fs called mini_fo you can try using if
> you disagree with the size of unionfs/aufs, at the cost of not having
> support for all corner cases.

I agree w/ Jan E.

Folks, I've said it before: unioning is a deceptively simple idea in
principle, and &^@%*$&^@ hard in practice. And anyone who thinks otherwise
is welcome to write a *versatile* unioning implementation on their own. Once
you get through all corner cases and satisfy all the features which users
want, you have a complex large file system.

I believe that implementing unioning inside actual filesystems is totally the
wrong direction: going to lower layers is wrong, instead of going up to a
VFS-based solution. Unioning is a namespace operation that should not be
done deep inside a lower f/s.

People often wonder why FScache is (reportedly) so complex and big. It's
b/c in some part it has to deal with similar issues: unioning is
copy-on-write, whereas caching is copy-on-read.

Nevertheless, I can understand if the embedded community wants lightweight
unioning. Union Mounts initially may not support everything that unionfs
does, but it should be smaller, and it should be enough I believe for the
basic unioning uses --- perhaps even for the embedded community. If so,
then I suggest people offer to help Bharata and Jan Blunk's efforts, rather
than [sic] cramming unioning into a single file system.

Erez.

2008-06-02 06:07:25

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support

On Mon, Jun 2, 2008 at 10:07 AM, Erez Zadok <[email protected]> wrote:
>
> Nevertheless, I can understand if the embedded community wants lightweight
> unioning. Union Mounts initially may not support everything that unionfs
> does, but it should be smaller, and it should be enough I believe for the
> basic unioning uses --- perhaps even for the embedded community. If so,
> then I suggest people offer to help Bharata and Jan Blunk's efforts, rather
> than [sic] cramming unioning into a single file system.
>

Though Union Mount effort has become slow and silent lately, some of
us are still working on it. While I worked on readdir support lately,
Jan Blunck and David Woodhouse are working on having a generic
whiteout support for linux.

Talking about help, Union Mount effort could take a generous help in
getting directory listing implementation right. We first tried to
handle duplicate elimination (during readdir) inside the kernel
entirely. The outcome was neither clean nor efficient.
(http://lkml.org/lkml/2007/12/5/147). Then there was a suggestion to
push the duplicate elimination to userspace. When that was tried out
(http://lkml.org/lkml/2008/4/29/248), we were told that NFS support is
going to be an issue. (BTW NFS support is going to be an issue
irrespective of where directory listing is implemented: kernel or
userspace). Some insights into feasibility of supporting NFS with
Union Mount from people who understand NFS better would be very
helpful.

Regards,
Bharata.
--
http://bharata.sulekha.com/blog/posts.htm

2008-06-02 07:13:33

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support

On Monday 02 June 2008, [email protected] wrote:
> While I don't have particular objection to your idea and approach to
> cramfs, I'd point out that modern LiveCDs tend to save their
> modifications to disk.

Sure, and I wasn't trying to address those of course. I have a rather
specific setup in mind myself, and I figured the same would be useful
for others as well, while we are waiting for a generic union mount
implementation in the mainline kernel.

> And AUFS did address all known problems. If there left something, please
> let me know.

Ok, I'm sorry for not having looked at it myself. I saw an older version
and assumed it was not going to improve much. I'll have another look
when I find the time. Unionfs was suffering from severe feature creep
(multiple writable branches, runtime branch modification), and aufs
seemed to add even more features instead of removing them.

Without reading either again, the top problems in unionfs at the time were:
* data inconsistency problems when simultaneously accessing the underlying
fs and the union.
* duplication of dentry and inode data structures in the union wastes
memory and cpu cycles.
* whiteouts are in the same namespace as regular files, so conflicts are
possible.
* mounting a large number of aufs on top of each other eventually
overflows the kernel stack, e.g. in readdir.
* allowing multiple writable branches (instead of just stacking
one rw copy on a number of ro file systems) is confusing to the user
and complicates the implementation a lot.

With the exception of the last two, I assumed that these were all
unfixable with a file system based approach (including the hypothetical
union-tmpfs). If you have addressed them, how?

Arnd <><

2008-06-02 07:18:18

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support


On Monday 2008-06-02 06:37, Erez Zadok wrote:
>> Jan Engelhardt wrote:
>> > On Sunday 2008-06-01 08:02, David Newall wrote:
>> >>
>> >>> I prefer the technique of union of a tmpfs over some other fs
>> >>
>> >> You're right in principle, but unfortunately there is to date no working
>> >> implementation of union mounts. Giving users the option of using an
>> >> existing file system with a few tweaks can only be better than than
>> >> forcing them to use hacks like unionfs.

>Folks, I've said it before: unioning is a deceptively simple idea in
>principle, and &^@%*$&^@ hard in practice. And anyone who thinks otherwise
>is welcome to write a *versatile* unioning implementation on their own. Once
>you get through all corner cases and satisfy all the features which users
>want, you have a complex large file system.
>[...]

To the original posters:

I urge those who do believe {au,union}fs is too fat to go and build
their unioning into their on-disk filesystems, then let users run it
(remark: iff you can convince (or force) them why they should not be
using existing fs), let users report issues and iron it out for
perhaps 2-3 years, and then see how much your implementation has
grown. That is, if you actually added code (see remark 1).

About last year (June 2007), SLAX sought a solution that enhances
VFAT with UNIX permissions -- much like the old umsdosfs. A kernel
solution was initially preferred by Tomas (SLAX developer), yet I
(who got to write posixovl then) went for FUSE. It was about 20 KB
when it was moderately usable. The end result? Posixovl is a 46 KB C
file today. For userspace code. I bet it would be much more if it was
in-kernel.

Take that as a hint when developing your fs-specific unioning.

2008-06-02 07:52:10

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support

On Monday 02 June 2008, Erez Zadok wrote:
> Correction: Unionfs doesn't make additional copies in the page cache.

Ok, I must have misunderstood something there. Sorry about that.

> Arnd, I favor a more generic approach, one that will work with the vast
> majority of file systems that people use w/ unioning, preferably all of
> them. ?Supporting copy-on-write in cramfs will only help a small subset of
> users. ?Yes, it might be simple, but I fear it won't be useful enough to
> convince existing users of unioning to switch over. ?And I don't think we
> should add CoW support in every file system -- the complexity will be much
> more than using unionfs or some other VFS-based solution.

My idea was to have it in cramfs, squashfs and iso9660 at most, I agree
that doing it in even a single writable file system would add far too
much complexity. I did not mean to start a fundamental discussion about
how to do it the right way, just noticed that there are half a dozen
implementations that have been around for years without getting close to
inclusion in the mainline kernel, while a much simpler approach gives
you sane semantics for a subset of users.

> I can see some advantages (re: cache coherency) by hacking CoW support
> directly into a f/s. ?If you want to use a filesystem-specific solution,
> then I suggest you don't modify a file system used as a source in a union,
> but one used as a destination. ?You'll have better overage that way. ?The
> vast majority of times, unionfs users will either write to tmpfs or ext2;
> but the source readonly f/s can be a lot of different ones (most popular are
> ext*, nfs*, isofs, and cramfs/squashfs).

Yes, that absolutely makes sense. I don't care much about a persistant
storage for the overlay, so tmpfs (if not ramfs) should be the only place
to do it in. It does introduce some of the same old problems though,
because you could still write to a bind mounted copy of the underlying
file system (unlike cramfs, which is guaranteed to be read-only), which
forces you to either to a full copy-up, or can result in inconsistent
file contents. Also, stacking multiple union-tmpfs copies on top of each
other would be hard to do without the potential to overflow the kernel
stack.

I'll probably try implementing a '-o union' option tmpfs anyway, just
to see how hard it is and what the problems are.

> I find it somewhat ironic to hear the argument that "union mounts isn't
> stable yet, so lets come up with a new solution inside cramfs." ?Why should
> your solution become stable much faster than union mounts (which also had
> patches floating around for a long time already).

Because the patches are not trying to solve any of the hard problems at all:
Persistent storage of overlays, readdir traversal through more than two
layers, stable inode numbers, opening a file through two different overlays,
copyup, and so on. I'm sure you know more about these problems that I do,
but as long as I don't have to care about them, I don't see a problem
with my patches (other than the bugs I already described).

> If you have cycles to spare, why not help Bharata and Jan?

I spent a lot of time on discussing the initial implementation with Jan
years ago, and will keep reviewing their patches, but I have neither the
time nor the brains to really contribute much to them. As you mentioned
in your reply to Jan E., it's on an entirely different scale than doing
a small hack to cramfs or tmpfs.

Arnd <><

2008-06-02 10:37:15

by J. R. Okajima

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support


Arnd Bergmann:
> Without reading either again, the top problems in unionfs at the time were:
> * data inconsistency problems when simultaneously accessing the underlying
> fs and the union.
> * duplication of dentry and inode data structures in the union wastes
> memory and cpu cycles.
> * whiteouts are in the same namespace as regular files, so conflicts are
> possible.
> * mounting a large number of aufs on top of each other eventually
> overflows the kernel stack, e.g. in readdir.
> * allowing multiple writable branches (instead of just stacking
> one rw copy on a number of ro file systems) is confusing to the user
> and complicates the implementation a lot.
>
> With the exception of the last two, I assumed that these were all
> unfixable with a file system based approach (including the hypothetical
> union-tmpfs). If you have addressed them, how?

I will try explain individually.
Here are what I implemented in AUFS.
Any comments are welcome.

> * data inconsistency problems when simultaneously accessing the underlying
> fs and the union.
Aufs has three levels of detecting the direct-access to the lower
(branch) filesystems (ie. bypassing aufs). I guess the most strict level
is a good answer for your question. It is based on the inotify
feature. Aufs sets inotify-watch to every accessed directories on lower
fs. During those inodes are cached, aufs receives the inotify event for
thier children/files and marks the aufs data for the file is
obsoleted. When the file is accessed later, aufs retrives the latest
inode (or dentry) again.
The inotify-watch will be removed when the aufs dir inode is discarded
from cache.


> * duplication of dentry and inode data structures in the union wastes
> memory and cpu cycles.

Aufs has its own dentry and inode object as normal fs has. And they have
pointers to the corresponding ones on the lower fs. If you make a union
from two real filesystems, then aufs inode will have (at most) two
pointers as its private data.
Do you mean having pointers is a duplicataion?


> * whiteouts are in the same namespace as regular files, so conflicts are
> possible.

Yes, that's right.
Aufs reserves ".wh." as a whiteout prefix, and prohibits users to handle
such filename inside aufs. It might be a problem as you wrote, but users
can create/remove them directly on the lower fs and I have never
received request about this reserved prefix.


> * mounting a large number of aufs on top of each other eventually
> overflows the kernel stack, e.g. in readdir.

Aufs readdir operation consumes memory, but it is not stack. If it was
implemented as a recursive function, it might cause the stack
overflow. But actually it is a loop.
The memory is used for stroing entry names and eliminating whiteout-ed
ones, and the result will be cached for a specified time. So the memory
(other than stack) will be consumed.


> * allowing multiple writable branches (instead of just stacking
> one rw copy on a number of ro file systems) is confusing to the user
> and complicates the implementation a lot.

Probably you are right. Initially aufs had only one policy to select the
writable branch. But several users requested another policy such as
round-robin or most-free-spece, and aufs has implemented them.
I don't guess uers will be confused by these policies. While I tried it
should be simple, I guess some people will say it is complex.


Junjiro Okajima

2008-06-02 11:07:40

by Jamie Lokier

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support

Erez Zadok wrote:
>
> > Jamie Lokier wrote:
> > > Phillip Lougher wrote:
> > > If I read the patches correctly, when a file page is written to, only
> > > that page gets copied into the page cache and locked, the other pages
> > > continue to be read off disk from cramfs? With Unionfs a page write
> > > causes the entire file to be copied up to the r/w tmpfs and locked into
> > > the page cache causing unnecessary RAM overhead.
>
> Yes, unionfs does copyup whole files, but it doesn't lock the entire file
> into the page cache. But I agree, that copying up large files to a tmpfs
> partition adds more memory pressure, at least temporarily (until pdflush
> kicks in).

1: I'm thinking systems which have union-over-cramfs probably don't have
swap at all...

2: It's a problem when you modify a very large file, even on a fast PC
with plenty of RAM. LVM snapshots might be better for this sort of
thing.

> > Ok, so why not fix that in unionfs? An option so that holes in the
> > overlay file let through data from the underlying file sounds like it
> > would be generally useful, and quite easy to implement.
>
> If I understand you right, you want to copyup one page at a time, right?
> That's not nearly as easy as one might imagine. First, you can't do it on
> file systems which don't support holes. Second, holes is a file-systems
> specific implementation issue, and the knowledge of holes AFAIC, is hidden
> from the VFS (IIRC, FreeBSD has a specific "zfod" page flag, which is turned
> on when the VM has a page that came out of a f/s hole).

True, although the new FIEMAP ioctl is supposed to make holes more
filesystem independent, when they are supported.

> You'll need a way to tell if a given page was copied up or not, and
> distinguish b/t pages which are naturally filled with zeros vs. those which
> came from f/s holes.

Metadata. Don't you have other metadata anyway, like whiteouts? :-)

> Copyup is also providing persistency: you can copyup to a persistent f/s
> such as ext2. So you'll need a bitmap or some sort of record that will
> survive file system remount and system reboot; such a bitmap will have to
> tell which pages of a file have been copied up or not.

Yes.

> I'm not saying it's not possible, but it's to do this page-wise caching at a
> stackable layer than inside a native f/s such as ext2. Now, if there was a
> generic VFS op that allowed me to query a file system whether a page it a
> given file is a hole or not, then unionfs would be able to do page-wise
> copyup easily.

See FIEMAP. Is it any use?

-- Jamie

2008-06-02 11:16:25

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support

On Monday 02 June 2008, [email protected] wrote:
> > * data inconsistency problems when simultaneously accessing the underlying
> > ? fs and the union.
> Aufs has three levels of detecting the direct-access to the lower
> (branch) filesystems (ie. bypassing aufs). I guess the most strict level
> is a good answer for your question. It is based on the inotify
> feature. Aufs sets inotify-watch to every accessed directories on lower
> fs. During those inodes are cached, aufs receives the inotify event for
> thier children/files and marks the aufs data for the file is
> obsoleted. When the file is accessed later, aufs retrives the latest
> inode (or dentry) again.
> The inotify-watch will be removed when the aufs dir inode is discarded
> from cache.

This is a very complicated approach, and I'm not sure if it even addresses
the case where you have a shared mmap on both files. With VFS based union
mounts, they share one inode, so you don't need to use idiotify in the first
place, and it automatically works on shared mmaps.

> > * duplication of dentry and inode data structures in the union wastes
> > ? memory and cpu cycles.
>
> Aufs has its own dentry and inode object as normal fs has. And they have
> pointers to the corresponding ones on the lower fs. If you make a union
> from two real filesystems, then aufs inode will have (at most) two
> pointers as its private data.
> Do you mean having pointers is a duplicataion?

I mean having your own dentry and inode object is duplication. The
underlying file system already has them, so if you have your own,
you need to keep them synchronized. I guess that in order to do
a lookup on a file, you need the steps of

1. lookup in aufs dentry cache -> fail
2. lookup in underlying dentry cache -> fail
3. try to read dentry from disk -> fail
4. repeat 2-3 until found, or arrive at lowest level
5. create an inode in memory for the lower file system
6. create dentry in memory on lower file system, pointing
to that
7. create an aufs specific inode pointing to the underlying
inode
8. create an aufs specific dentry object to point to that
9. create a struct inode representing the aufs inode
10. create another VFS dentry to point to that

when you really should just return the dentry found by the
lower file system.

> > * whiteouts are in the same namespace as regular files, so conflicts are
> > ? possible.
>
> Yes, that's right.
> Aufs reserves ".wh." as a whiteout prefix, and prohibits users to handle
> such filename inside aufs. It might be a problem as you wrote, but users
> can create/remove them directly on the lower fs and I have never
> received request about this reserved prefix.

It's not so much a practical limitation as an exploitable feature.
E.g. an unpriviledged user may use this to get an application into
an error condition by asking for an invalid file name.

Posix reserves a well-defined set of invalid file names, and
deviation from this means that you are not compliant, and that
in a potentially unexpected way.

> > * mounting a large number of aufs on top of each other eventually
> > ? overflows the kernel stack, e.g. in readdir.
>
> Aufs readdir operation consumes memory, but it is not stack. If it was
> implemented as a recursive function, it might cause the stack
> overflow. But actually it is a loop.
> The memory is used for stroing entry names and eliminating whiteout-ed
> ones, and the result will be cached for a specified time. So the memory
> (other than stack) will be consumed.

How does aufs know that one of its branches is an aufs itself?
If you detect this, do you fold it into a single aufs instance with
more branches?
In case you don't do it, I don't see how you get around the stack
overflow, but if you do it, you have again added a whole lot of
complexity for something that should be trivial when done right.

> > * allowing multiple writable branches (instead of just stacking
> > ? one rw copy on a number of ro file systems) is confusing to the user
> > ? and complicates the implementation a lot.
>
> Probably you are right. Initially aufs had only one policy to select the
> writable branch. But several users requested another policy such as
> round-robin or most-free-spece, and aufs has implemented them.
> I don't guess uers will be confused by these policies. While I tried it
> should be simple, I guess some people will say it is complex.

I personally think that a policy other than writing to the top is crazy
enough, but randomly writing to multiple places is much worse, as it
becomes unpredictable what the file system does, not just unexpected.

Arnd <><

2008-06-02 12:58:47

by J. R. Okajima

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support


Arnd Bergmann:
> This is a very complicated approach, and I'm not sure if it even addresses
> the case where you have a shared mmap on both files. With VFS based union
> mounts, they share one inode, so you don't need to use idiotify in the first
> place, and it automatically works on shared mmaps.

As you might know, aufs doesn't have its own file mapped pages. Aufs
overrides vm_operations and redirects the page fault to the lower file's
vm_operation. So the shared mmap has no problem.
I am afraid that I should write "marks the attributes in aufs is
obsoleted" instead of "marks the aufs data for the file is obsoleted" in
my previous mail.


> I mean having your own dentry and inode object is duplication. The

I see.
Then the solution must be union-mount.
Your 10 steps seem to be rather verbose. Generally, 'lookup' means to
create (or get) inode and dentry, and the fs inode and VFS inode are
allocated in the same time.
Aufs does 'lookup' for the lower dentry (yes, it must be repeated if
necessary), and sets it to the aufs dentry/inode private data.


> It's not so much a practical limitation as an exploitable feature.
> E.g. an unpriviledged user may use this to get an application into
> an error condition by asking for an invalid file name.

If a user specifies the prohibitted filename, the he will get an error.


> Posix reserves a well-defined set of invalid file names, and
> deviation from this means that you are not compliant, and that
> in a potentially unexpected way.

Yes, the whiteout prefix is a limitation (or a feature).


> How does aufs know that one of its branches is an aufs itself?
> If you detect this, do you fold it into a single aufs instance with
> more branches?
> In case you don't do it, I don't see how you get around the stack
> overflow, but if you do it, you have again added a whole lot of
> complexity for something that should be trivial when done right.

- To detect the filesystem type is easy. Aufs can know whether the
branch is aufs or not by checking s_magic or s_type->name.
- aufs doesn't fold? expand? the nested aufs branch.

You might be pointng out a general matter of stacking filesystem.
When one of branches is a stacking fs, and it is nested deeper and
deeper,
- /aufs1 = /rw1 + /aufs2
- /aufs2 = /rw2 + /aufs3
- /aufs3 = /rw3 + /aufs4
:::
then the stack-overflow may happen. It is not limited to readdir, it can
happen in every operation. Basically aufs rejects 'aufs/unionfs branch',
in other word "aufs branch of another aufs mount."
But aufs has a configuration to enable this. When a user enables it and
sets deeply nested aufs branch, it could happen. But this is same thing
even if you use union-mount (and if UnionMount supports such branch).


> I personally think that a policy other than writing to the top is crazy
> enough, but randomly writing to multiple places is much worse, as it
> becomes unpredictable what the file system does, not just unexpected.

I don't want you to call aufs users crazy who are using such policies.
By the way, how do you think link(2) or rename(2)? When the source file
exists on the lower writable branch, do you think copy-up is the best
way? Or do you think all lower branches should be readonly?
There is an exception in aufs's branch-select policy. That is
link/rename case. When the source file exists on a writable branch, aufs
tries link/rename it on that branch in every policy. Do you think it
best to do it on the top branch only?


Junjiro Okajima

2008-06-02 14:14:27

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support

On Monday 02 June 2008, [email protected] wrote:
> I don't want you to call aufs users crazy who are using such policies.
> By the way, how do you think link(2) or rename(2)? When the source file
> exists on the lower writable branch, do you think copy-up is the best
> way? Or do you think all lower branches should be readonly?
> There is an exception in aufs's branch-select policy. That is
> link/rename case. When the source file exists on a writable branch, aufs
> tries link/rename it on that branch in every policy. Do you think it
> best to do it on the top branch only?

Yes, I tend to consider the union case identical to the cross-mount
move or link, so I'd expect the kernel to return errno=EXDEV and user
space to handle this by doing the appropriate copy/unlink as it does
for other cases already.

Arnd <><

2008-06-02 14:35:34

by J. R. Okajima

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support


Arnd Bergmann:
> > By the way, how do you think link(2) or rename(2)? When the source file
> > exists on the lower writable branch, do you think copy-up is the best
> > way? Or do you think all lower branches should be readonly?
> > There is an exception in aufs's branch-select policy. That is
> > link/rename case. When the source file exists on a writable branch, aufs
> > tries link/rename it on that branch in every policy. Do you think it
> > best to do it on the top branch only?
>
> Yes, I tend to consider the union case identical to the cross-mount
> move or link, so I'd expect the kernel to return errno=EXDEV and user
> space to handle this by doing the appropriate copy/unlink as it does
> for other cases already.

Aure rename returns EXDEV when the target is a dir and it has child
entr(y|ies) on lower branhc(es). And mv(1) handles this case.
My Engilsh might be miunderstood. Do you think link(2) should return an
error when the target exists on lower writable branch?


Junjiro Okajima

2008-06-02 15:01:43

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support

On Monday 02 June 2008, [email protected] wrote:
> Aure rename returns EXDEV when the target is a dir and it has child
> entr(y|ies) on lower branhc(es). And mv(1) handles this case.
> My Engilsh might be miunderstood. Do you think link(2) should return an
> error when the target exists on lower writable branch?

Any writes should always just go to the top level. If the source file
for link() exists on the top level, link should succeed even if a target
exists on a lower level (given that the user has permissions to
unlink that file), but should return EXDEV if the source comes from
a lower level.

Arnd <><

2008-06-02 15:07:50

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support

Hi Arnd.

On Mon, Jun 02, 2008 at 01:15:40PM +0200, Arnd Bergmann ([email protected]) wrote:
> This is a very complicated approach, and I'm not sure if it even addresses
> the case where you have a shared mmap on both files. With VFS based union
> mounts, they share one inode, so you don't need to use idiotify in the first
> place, and it automatically works on shared mmaps.

Inotify has nothing common with that - it notifies about inode update,
which is only thing needed for unionfs. VM and aufs vmops will take care of
reads and writes, since there is no duplication of the data here.

> I mean having your own dentry and inode object is duplication. The
> underlying file system already has them, so if you have your own,
> you need to keep them synchronized. I guess that in order to do
> a lookup on a file, you need the steps of
>
> 1. lookup in aufs dentry cache -> fail
> 2. lookup in underlying dentry cache -> fail
> 3. try to read dentry from disk -> fail
> 4. repeat 2-3 until found, or arrive at lowest level
> 5. create an inode in memory for the lower file system
> 6. create dentry in memory on lower file system, pointing
> to that
> 7. create an aufs specific inode pointing to the underlying
> inode
> 8. create an aufs specific dentry object to point to that
> 9. create a struct inode representing the aufs inode
> 10. create another VFS dentry to point to that
>
> when you really should just return the dentry found by the
> lower file system.

Or it is a feature, and you should not return dentry for lower file
system, when you can have different objects pointing to the
same object.

> It's not so much a practical limitation as an exploitable feature.
> E.g. an unpriviledged user may use this to get an application into
> an error condition by asking for an invalid file name.

Hmm... I believe if exploit wants to do bad things and system prevents
it, it is actually a right decision? But since you asked, I'm not sure
anymore...

> Posix reserves a well-defined set of invalid file names, and
> deviation from this means that you are not compliant, and that
> in a potentially unexpected way.

Everything has own limitation. 256 bytes per name is much stronger
problem, but everyone works with that.
It is a limitation, buts rather nonsignificant IMO.

> I personally think that a policy other than writing to the top is crazy
> enough, but randomly writing to multiple places is much worse, as it
> becomes unpredictable what the file system does, not just unexpected.

Is this a double rot13 encoded "people will never use computers with
more than 640 kb of ram" phrase? :)

While working VFS union mounting does not exist, AUFS does work.
It is just another filesystem, which works and has big userbase. Any VFS
approach (when implemented) will work on its own and its implementation
does not depend on this particular fs.

--
Evgeniy Polyakov

2008-06-02 15:48:57

by Erez Zadok

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support

In message <[email protected]>, Arnd Bergmann writes:
> On Monday 02 June 2008, [email protected] wrote:
[...]
> Ok, I'm sorry for not having looked at it myself. I saw an older version
> and assumed it was not going to improve much. I'll have another look
> when I find the time. Unionfs was suffering from severe feature creep
> (multiple writable branches, runtime branch modification), and aufs
> seemed to add even more features instead of removing them.

Re: feature creep. Unionfs had more features initially, but we removed
those that users didn't seem to want/use. The bottom line, we've been
maintaining unionfs publicly for 5+ years now, so the set of features we
have is based exactly on what users want. If anyone can give the users what
they want/need in a different, more elegant way, that's great; if not, users
just won't switch to another solution.

> Without reading either again, the top problems in unionfs at the time were:
> * data inconsistency problems when simultaneously accessing the underlying
> fs and the union.

That's not an issue when using vm_ops->fault for data.

There is still an issue wrt dentries and topology changes, as Al mentioned
here before. Al suggested to me (at OLS 08) that the superblock struct
might need the same writers-count as has been done for vfsmounts recently;
then you can prevent topology changes during union'ed operations
(esp. copyup).

> * duplication of dentry and inode data structures in the union wastes
> memory and cpu cycles.

Yes, but I don't think it's much more than any other layered solution will
have (including ecryptfs and union mounts). This is a general problem in
stackable file systems. Union Mounts, being in the VFS, has the chance to
use less memory indeed, but at a possible cost of increased VFS complexity.

> * whiteouts are in the same namespace as regular files, so conflicts are
> possible.

Agreed. We have a different version of unionfs, called unionfs-odf, which
moves the whiteouts and all unioning-related meta-data to a separate, small
persistent partition.

But the better long-term solution is to get WH support in every native f/s.
These patches had been floating around for a while now, and they seem simple
enough that I don't see why it had taken so long to get basic WH support
into mainline (or at least -mm). (Bharata, can you ask akpm to add just the
WH support into -mm perhaps?)

> * mounting a large number of aufs on top of each other eventually
> overflows the kernel stack, e.g. in readdir.

Yes. That's a general problem with stackable file systems. Each layer you
add increases the depth of the stack. There are all already known paths
(involving xfs/nfs/dm, etc.) which overrun the stack and the solution I've
heard was "don't do it." That seems silly to me. Instead, the kernel stack
should be growable dynamically, at the cost of performance.

However, the vast majority of unioning users use just one layer, so even for
us, blowing up the stack has been a rather rare user complaint. And we've
been very mindful of stack usage (i.e., checking and optimizing based on
checkstack.pl).

> * allowing multiple writable branches (instead of just stacking
> one rw copy on a number of ro file systems) is confusing to the user
> and complicates the implementation a lot.

I agree that it does complicate the implementation, but again, this is
something that _some_ users really want: they want to merge multiple
"packages" together, and ensure that modifications to files/dirs of a given
package stay in their logical location.

I disagree with you that it's confusing to the user. I've never had
complaints that people didn't how to change the branch configurations
dynamically. Heck, people come up with creative ways of using dynamic
branch configurations in all sorts of funky environments that make even my
head spin :-) -- chroot, pivot_root, nfs exports, etc.

> With the exception of the last two, I assumed that these were all
> unfixable with a file system based approach (including the hypothetical
> union-tmpfs). If you have addressed them, how?
>
> Arnd <><

Erez.

2008-06-02 17:47:14

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support

On Monday 02 June 2008, Evgeniy Polyakov wrote:
> > I personally think that a policy other than writing to the top is crazy
> > enough, but randomly writing to multiple places is much worse, as it
> > becomes unpredictable what the file system does, not just unexpected.
>
> Is this a double rot13 encoded "people will never use computers with
> more than 640 kb of ram" phrase? :)

No, it's more the "people don't need variable block size drives" argument.
They've been working fine for decades on mainframes, are incredibly
complicated to build and entirely pointless in practice ;-)

Arnd <><

2008-06-02 18:14:27

by Erez Zadok

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support

In message <[email protected]>, Arnd Bergmann writes:
> On Monday 02 June 2008, Erez Zadok wrote:

> > Arnd, I favor a more generic approach, one that will work with the vast
> > majority of file systems that people use w/ unioning, preferably all of
> > them. ??Supporting copy-on-write in cramfs will only help a small subset of
> > users. ??Yes, it might be simple, but I fear it won't be useful enough to
> > convince existing users of unioning to switch over. ??And I don't think we
> > should add CoW support in every file system -- the complexity will be much
> > more than using unionfs or some other VFS-based solution.
>
> My idea was to have it in cramfs, squashfs and iso9660 at most, I agree
[...]

Ah, ok. Doing those 3 will get better coverage for existing users. The
question may come to how much code complexity does it add to each, and
whether some common code can be excised into generic helpers?

Arnd, my concern is that it might take a long time to see those in mainline.
Look at the status of whiteouts support in native file systems (just
whiteouts, not duplicate elimination): after months trials and several
posts, those patches aren't even in -mm. And those are relatively simple
patches. I can search for Viro's posting when he said he could hack it all
in one weekend; ok so maybe *he* can :-), but the point is that even with
Viro's tentative support of whiteouts, we're still not closer to having WH
support in mainline.

Who knows, maybe if you managed to get _something_ into mainline, it'll help
the overall effort move along; right now I fear there are too many strong
opinions on all sides that the effort is stuck.

[...]
> I'll probably try implementing a '-o union' option tmpfs anyway, just
> to see how hard it is and what the problems are.

And I'll be happy to test it for you (read: find bugs :-). I've built a
large set of unioning-related regression tests over the years.

> Arnd <><

Erez.

2008-06-03 02:03:36

by Phillip Lougher

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support

Erez Zadok wrote:
> In message <[email protected]>, Arnd Bergmann writes:
>> On Monday 02 June 2008, Erez Zadok wrote:
>
>>> Arnd, I favor a more generic approach, one that will work with the vast
>>> majority of file systems that people use w/ unioning, preferably all of
>>> them. ? Supporting copy-on-write in cramfs will only help a small subset of
>>> users. ? Yes, it might be simple, but I fear it won't be useful enough to
>>> convince existing users of unioning to switch over. ? And I don't think we
>>> should add CoW support in every file system -- the complexity will be much
>>> more than using unionfs or some other VFS-based solution.
>> My idea was to have it in cramfs, squashfs and iso9660 at most, I agree
> [...]
>
> Ah, ok. Doing those 3 will get better coverage for existing users. The
> question may come to how much code complexity does it add to each, and
> whether some common code can be excised into generic helpers?
>

Yes, that's what I'm interested in. From my reading of the patches, the
general approach and a lot of the code should be directly useable in a
fake-writable Squashfs. The first step (a very big first step) is to
get readonly Squashfs mainlined, which is what I'm working on at the
moment. After that I'll be very interested in looking at fake-write
support and factoring any common code into generic helpers.

Phillip

2008-06-03 11:05:29

by J. R. Okajima

[permalink] [raw]
Subject: Re: [RFC 0/7] [RFC] cramfs: fake write support


Arnd Bergmann:
> Any writes should always just go to the top level. If the source file
> for link() exists on the top level, link should succeed even if a target
> exists on a lower level (given that the user has permissions to
> unlink that file), but should return EXDEV if the source comes from
> a lower level.

Then what will happen when a user builds a union by "empty tmpfs" +
"cramfs"? Following your design, link(2) becomes useless in stacking fs.

You may be considering to implement a new dynamic link library for
stacking.
Hmm, that is intersting. It may be worth to think.


Junjiro Okajima