LinuxLists.cc - unionfs

[permalink] [raw]

Subject: Re: unionfs

> I have never tested it, but from other reports I read that it works.
>
> http://translucency.sourceforge.net/

Thanks, but I'm not looking for a working unionfs, rather for design
ideas for doing these sort of 'layered' filesystems _properly_, not
with hacks like system call table modification (like the above
solution).

Al, can you please give some guidance? Have you any code for a
unionfs design, or only ideas, or was this just myth?

Miklos

2004-03-11 15:44:14

[permalink] [raw]

Subject: Re: unionfs

> If you get a response from Al, could you let me know?

OK.

> I've been wondering about this myself, and beyond simple
> coolness/usefullness, we may also need the unionfs for mls
> polyinstantiation.
>
> If you don't hear from Al, please let me know whether you plan to tackle
> it yourself or not.

I'll have to, as this is needed for AVFS. Not unionfs, but something
similar, that allows file/directory lookups for special filenames to
be redirected to another filesystem.

For the time I plan to go along the easy way, and create a filesystem
specifically for AVFS that doesn't need modifications to the kernel.
This will be inefficient in a number of ways: doubling the memory use
by the dentry/inode caches, deeper call chains for all filesystem
operations (even those, that require no intervention).

The next step will be to try to optimize and generalize it to be
usable for other filesystems like unionsfs. I'd really love to have
some input about this from Al or anybody who has any ideas.

Miklos

2004-03-12 23:38:44

by Herbert Poetzl

[permalink] [raw]

Subject: Re: unionfs

On Mon, Mar 08, 2004 at 10:52:54AM +0100, Miklos Szeredi wrote:
> Hi Al,
>
> I'm interested in any (even if very incomplete) work you did on
> unionfs. But I can only find an empty directory at
> ftp://ftp.math.psu.edu/pub/viro/ linked from old kernel status pages.

> Can you point me to any information regarding this?

FWIW, have a look at http://vserver.13thfloor.at/TBVFS/

HTH,
Herbert

> Thanks,
> Miklos

2004-03-14 02:32:41

[permalink] [raw]

Subject: Re: unionfs

On Thu, 11 Mar 2004, Miklos Szeredi wrote:

>
> I'll have to, as this is needed for AVFS. Not unionfs, but something
> similar, that allows file/directory lookups for special filenames to
> be redirected to another filesystem.

I have a need for this in autofs4 also.

Ian

2004-03-14 04:23:02

[permalink] [raw]

Subject: Re: unionfs

Ian Kent <[email protected]> dijo:
> On Thu, 11 Mar 2004, Miklos Szeredi wrote:
>
> >
> > I'll have to, as this is needed for AVFS. Not unionfs, but something
> > similar, that allows file/directory lookups for special filenames to
> > be redirected to another filesystem.
>
> I have a need for this in autofs4 also.

At least some Unices have context-dependent symlinks, and AFAIR there was
something like this in Linux a _long_ while back (perhaps just in Red Hat,
must have been the second half of the '90s). It was discarded as too much
mess (in kernel, in userspace, and in wetware) for little gain, IIRC.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2004-03-15 11:35:32

by Carsten Otte

[permalink] [raw]

Subject: Re: unionfs

Herbert Poetzl wrote:
>FWIW, have a look at http://vserver.13thfloor.at/TBVFS
I do really think this problem needs to be solved a different way: BSD-style
union mount in VFS, no redirecting filesystem.
I am planning to work on that during the 2.7. series. I do hope I will be able
to write code clean enough for inclusion, lets see...

2004-03-15 12:19:53

[permalink] [raw]

Subject: Re: unionfs

On Mon, 15 March 2004 12:35:25 +0100, Carsten Otte wrote:
> Herbert Poetzl wrote:
> >FWIW, have a look at http://vserver.13thfloor.at/TBVFS
> I do really think this problem needs to be solved a different way: BSD-style
> union mount in VFS, no redirecting filesystem.
> I am planning to work on that during the 2.7. series. I do hope I will be able
> to write code clean enough for inclusion, lets see...

You could also have some sort of 'hidden symlink', i.e. something that
behaves just like a file but is in fact a link to some other
filesystem. If that other filesystem is not accessable, all
operations return -EIO.

Not sure if this is a sane solution, but it would make my cow-stuff
work across filesystems as well.

J?rn

--
Schr?dinger's cat is <BLINK>not</BLINK> dead.
-- Illiad

2004-03-15 12:46:30

[permalink] [raw]

Subject: Re: unionfs

On Mon, 15 Mar 2004, [iso-8859-1] J?rn Engel wrote:

> On Mon, 15 March 2004 12:35:25 +0100, Carsten Otte wrote:
> > Herbert Poetzl wrote:
> > >FWIW, have a look at http://vserver.13thfloor.at/TBVFS
> > I do really think this problem needs to be solved a different way: BSD-style
> > union mount in VFS, no redirecting filesystem.
> > I am planning to work on that during the 2.7. series. I do hope I will be able
> > to write code clean enough for inclusion, lets see...
>
> You could also have some sort of 'hidden symlink', i.e. something that
> behaves just like a file but is in fact a link to some other
> filesystem. If that other filesystem is not accessable, all
> operations return -EIO.

Sounds a bit untidy.

Has anyone checked http://www.filesystems.org/

What do you think?

Ian

2004-03-15 13:16:19

[permalink] [raw]

Subject: Re: unionfs

On Mon, 15 March 2004 20:47:05 +0800, Ian Kent wrote:
> On Mon, 15 Mar 2004, [iso-8859-1] J?rn Engel wrote:
> >
> > You could also have some sort of 'hidden symlink', i.e. something that
> > behaves just like a file but is in fact a link to some other
> > filesystem. If that other filesystem is not accessable, all
> > operations return -EIO.
>
> Sounds a bit untidy.

If you have a cleaner idea, I'm open for suggestions.

> Has anyone checked http://www.filesystems.org/
>
> What do you think?

Looks like an abstraction layer that still assumes a 1:1 mapping
between filesystems and devices, so it doesn't help. Did I miss
something?

J?rn

--
And spam is a useful source of entropy for /dev/random too!
-- Jasmine Strong

2004-03-15 14:33:58

[permalink] [raw]

Subject: Re: unionfs

On Mon, 15 Mar 2004, [iso-8859-1] J?rn Engel wrote:

> On Mon, 15 March 2004 20:47:05 +0800, Ian Kent wrote:
> > On Mon, 15 Mar 2004, [iso-8859-1] J?rn Engel wrote:
> > >
> > > You could also have some sort of 'hidden symlink', i.e. something that
> > > behaves just like a file but is in fact a link to some other
> > > filesystem. If that other filesystem is not accessable, all
> > > operations return -EIO.
> >
> > Sounds a bit untidy.
>
> If you have a cleaner idea, I'm open for suggestions.
>
> > Has anyone checked http://www.filesystems.org/
> >
> > What do you think?
>
> Looks like an abstraction layer that still assumes a 1:1 mapping
> between filesystems and devices, so it doesn't help. Did I miss
> something?

I don't understand the requirement properly. Sorry.

Ian

2004-03-15 14:36:20

[permalink] [raw]

Subject: Re: unionfs

On Mon, 15 Mar 2004, [iso-8859-1] J?rn Engel wrote:

> On Mon, 15 March 2004 20:47:05 +0800, Ian Kent wrote:
> > On Mon, 15 Mar 2004, [iso-8859-1] J?rn Engel wrote:
> > >
> > > You could also have some sort of 'hidden symlink', i.e. something that
> > > behaves just like a file but is in fact a link to some other
> > > filesystem. If that other filesystem is not accessable, all
> > > operations return -EIO.
> >
> > Sounds a bit untidy.
>
> If you have a cleaner idea, I'm open for suggestions.
>
> > Has anyone checked http://www.filesystems.org/
> >
> > What do you think?
>
> Looks like an abstraction layer that still assumes a 1:1 mapping
> between filesystems and devices, so it doesn't help. Did I miss
> something?

There was talk on the mailing list that they were close to releasing a
unionfs filesystem for their fistgen generator. But it has been a while
and still nothing.

Ian

2004-03-15 16:16:18

[permalink] [raw]

Subject: Re: unionfs

On Mon, 15 March 2004 14:16:01 +0100, J?rn Engel wrote:
> On Mon, 15 March 2004 20:47:05 +0800, Ian Kent wrote:
> > On Mon, 15 Mar 2004, [iso-8859-1] J?rn Engel wrote:
> > >
> > > You could also have some sort of 'hidden symlink', i.e. something that
> > > behaves just like a file but is in fact a link to some other
> > > filesystem. If that other filesystem is not accessable, all
> > > operations return -EIO.
> >
> > Sounds a bit untidy.
>
> If you have a cleaner idea, I'm open for suggestions.

Stupid me. Simply open a file in the cached fs, so it can't be
umounted.

J?rn

--
Measure. Don't tune for speed until you've measured, and even then
don't unless one part of the code overwhelms the rest.
-- Rob Pike

2004-03-15 16:17:21

[permalink] [raw]

Subject: Re: unionfs

On Mon, 15 March 2004 22:35:20 +0800, Ian Kent wrote:
>
> I don't understand the requirement properly. Sorry.

Depends on who you ask, but imo it boils down to this:
- Use one filesystem as backing store, usually ro.
- Have another filesystem on top for extra functionality, usually rw
access.

Famous example is a rw-CDROM, where writes go to hard drive and
unchanged data is read from CDROM. But it makes sense for other
things as well.

J?rn

--
Data expands to fill the space available for storage.
-- Parkinson's Law

2004-03-15 17:09:27

by Claudio Martins

[permalink] [raw]

Subject: Re: unionfs

On Monday 15 March 2004 16:13, J?rn Engel wrote:
> On Mon, 15 March 2004 22:35:20 +0800, Ian Kent wrote:
> > I don't understand the requirement properly. Sorry.
>
> Depends on who you ask, but imo it boils down to this:
> - Use one filesystem as backing store, usually ro.
> - Have another filesystem on top for extra functionality, usually rw
> access.
>
> Famous example is a rw-CDROM, where writes go to hard drive and
> unchanged data is read from CDROM. But it makes sense for other
> things as well.

If I understand correctly this unionfs feature would also be the cleanest
way of changing the root filesystem after using an initrd ramdisk on boot.
Currently the pivot_root call is used to change root but that still implies a
bit of a hack. You can read about it on this fine paper:

http://www.cis.udel.edu/~zhi/www.docshow.net/linux/ols.zip

It's also a good read if you want to understand the linux bootloaders and
the boot process in general.

Regards

Claudio

2004-03-15 19:22:58

[permalink] [raw]

Subject: Re: unionfs

=?iso-8859-1?Q?J=F6rn?= Engel <[email protected]> said:
> On Mon, 15 March 2004 22:35:20 +0800, Ian Kent wrote:

> > I don't understand the requirement properly. Sorry.

> Depends on who you ask, but imo it boils down to this:
> - Use one filesystem as backing store, usually ro.
> - Have another filesystem on top for extra functionality, usually rw
> access.
>
> Famous example is a rw-CDROM, where writes go to hard drive and
> unchanged data is read from CDROM. But it makes sense for other
> things as well.

And what if the underlying filesystem is RW too? What should happen if you
unite several (>= 3) filesystems? What if some are RO, others RW? What do
you do if a file shows up several times, each different?

Assuming one RW on top of a RO only now: What should happen when a
file/directory is missing from the top? If the bottom one "shows through",
you can't delete anything; if it doesn't, you win nothing (because you will
have to keep a complete copy RW on top).

IIRC, this has been discussed a couple of times before, and the consensus
each time was that it isn't /that hard/ to do, it is /hard or impossible/
to find a sensible, simple semantics for this. The idea was then dropped...
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2004-03-15 23:20:29

by Chris Friesen

[permalink] [raw]

Subject: Re: unionfs

Horst von Brand wrote:

> Assuming one RW on top of a RO only now: What should happen when a
> file/directory is missing from the top? If the bottom one "shows through",
> you can't delete anything; if it doesn't, you win nothing (because you will
> have to keep a complete copy RW on top).

I don't see how you win nothing. I create an overlay filesystem. I
delete a bunch of files in the overlay and it doesn't show through. All
my other files are still links to the originals, with the

I would dearly love to use something like to make it easy to track
changes made all over a source tree. If I could sync them up at the
begining, then make all my changes in the overly, then doing a diff is
really easy since you just look for places where the inodes are
different between the two filesystems. Like having hard links, but the
filesystem breaks them for you when you write.

Chris

--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]

2004-03-15 23:52:56

[permalink] [raw]

Subject: Re: unionfs

On Mon, 15 March 2004 15:22:41 -0400, Horst von Brand wrote:
> =?iso-8859-1?Q?J=F6rn?= Engel <[email protected]> said:
> > On Mon, 15 March 2004 22:35:20 +0800, Ian Kent wrote:
>
> > > I don't understand the requirement properly. Sorry.
>
> > Depends on who you ask, but imo it boils down to this:
> > - Use one filesystem as backing store, usually ro.
> > - Have another filesystem on top for extra functionality, usually rw
> > access.
> >
> > Famous example is a rw-CDROM, where writes go to hard drive and
> > unchanged data is read from CDROM. But it makes sense for other
> > things as well.
>
> And what if the underlying filesystem is RW too? What should happen if you
> unite several (>= 3) filesystems? What if some are RO, others RW? What do
> you do if a file shows up several times, each different?
>
> Assuming one RW on top of a RO only now: What should happen when a
> file/directory is missing from the top? If the bottom one "shows through",
> you can't delete anything; if it doesn't, you win nothing (because you will
> have to keep a complete copy RW on top).

What looks like a promising idea for this problem and others is to
have visible and invisible inodes. All current filesystems know only
visible inodes. Invisible ones have no dentry linking to them
directly, only indirectly through files/links with cow semantics.

Ok, when the underlying filesystem is rw, each file linked from the
caching fs has to be broken up into visible and invisible inodes. The
visible link from both filesystems is to the invisible inode and
writes to either one have to cow.

Three or more filesystems? No problem, same as above.
Mixed ro and rw? No problem, same as above.
Files "showing through"? Doesn't happen if you do the equivalent of
"cp -l" - directories are copied, files are linked.

Solves all of your problems so far. Do you have more?

> IIRC, this has been discussed a couple of times before, and the consensus
> each time was that it isn't /that hard/ to do, it is /hard or impossible/
> to find a sensible, simple semantics for this. The idea was then dropped...

Yeah, maybe. My personal consensus right now is that this actually
looks very simple. Not sure how much time I will find, but it should
definitely be finished for 2.8.

J?rn

--
To recognize individual spam features you have to try to get into the
mind of the spammer, and frankly I want to spend as little time inside
the minds of spammers as possible.
-- Paul Graham

2004-03-16 12:42:54

[permalink] [raw]

Subject: Re: unionfs

> > I'll have to, as this is needed for AVFS. Not unionfs, but something
> > similar, that allows file/directory lookups for special filenames to
> > be redirected to another filesystem.
>
> I have a need for this in autofs4 also.

What are your exact requirements? I mean, do you want to check every
lookup, or only if the lookup returned a negative dentry? Is it a
fixed set of names that you need to check or is it dynamic?

Thanks,
Miklos

2004-03-16 13:44:45

[permalink] [raw]

Subject: Re: unionfs

> I was just re-reading the linux-fsdevel archives from june of 2000.
> I'm guessing the reason you're not getting a response from Al is that
> he never did unionfs. He did union mounts, and made clear that they
> are not related. Union mounts would only work at the top level, the
> union would not recurse down the (union-)mounted trees.

(Reading that thread...)

OK, now I understand better. Although I can't find any code/patch for
union mount either.

> Performance-wise this could become very, very slow very quickly, but
> if we replace the vfsmount which is found by using
> mount_hashtable + hash(vfsmnt, dentry) to find what is mounted on top
> of a particular dentry with a vfsmount_stackable struct, which contains,
> say,
>
> struct vfsmount_stackable {
> struct vfsmount *vfsmnt;
> int mount_flags; /* 1 = read, 2 = write, 3 = hide */
> struct vfsmount_stackable *next;
> };
>
> then perhaps it might be reasonably simple to have __follow_down and
> follow_mount make use of this structure. We make sure to keep the
> vfsmount_stackable list in order mounted priority, so that when we
> come to one of these lists, we can just do something like
>
> while (vfsmount_stacked) {
> ret = stacked_lookup(vfsmount_stacked->vfsmnt, vfsmnt_stacked->dentry,
> remaining_pathname);
> if (ret)
> return ret;
> vfsmnt_stacked = vfsmnt_stacked->next;
> }
>
> return NULL;
>
> Thoughts?

Yes, this sounds like a good way to implement a unionfs-like
functionality. Something a bit more general would be to have a
path_walk(const char *remaining_path, struct nameidata *nd) operation
of vfsmount, which if non-null would be used to perform the rest of
the lookup. This could then perform the looped lookup trials you
describe, but could be used for other special lookup methods.

Thanks for your comments,
Miklos

2004-03-16 16:07:57

[permalink] [raw]

Subject: Re: unionfs

Chris Friesen <[email protected]> said:
> Horst von Brand wrote:
> > Assuming one RW on top of a RO only now: What should happen when a
> > file/directory is missing from the top? If the bottom one "shows through",
> > you can't delete anything; if it doesn't, you win nothing (because you will
> > have to keep a complete copy RW on top).

> I don't see how you win nothing. I create an overlay filesystem.

Completely empty is what you get then... and you have to explicitly link in
each file. Or everything shows up here.

> I
> delete a bunch of files in the overlay and it doesn't show through.

Next time you mount it, what happens? How do you know the "top files" where
deleted, and should not show up?

What happens if I mount the live 2.6.4 kernel source over a CD containing
2.5.30? What happens to identical files, files that moved, changed files,
deleted files? Pray tell, how does the kernel find out which is which?

How do you back up a beast like this?

> All
> my other files are still links to the originals, with the

Something missing here?

In any case, there are tools that create a farm of symlinks, and when you
try to write to a file (pointing to a RO area/file), you get an error. This
gives you 90% of what you want, _without_ aggravating the filesystem
hackers.

> I would dearly love to use something like to make it easy to track
> changes made all over a source tree. If I could sync them up at the
> begining, then make all my changes in the overly, then doing a diff is
> really easy since you just look for places where the inodes are
> different between the two filesystems. Like having hard links, but the
> filesystem breaks them for you when you write.

This is called BitKeeper, CVS, Subversion, arch, RCS, SCCS, ... Better yet,
it keeps the history of each file (not just the one version on RO media),
with annotations. You decide when a version is ready for archiving.

Sure, this would save disk space. But at today's prices, it just is not
worth the trouble.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2004-03-16 16:27:57

[permalink] [raw]

Subject: Re: unionfs

=?iso-8859-1?Q?J=F6rn?= Engel <[email protected]> said:
> Horst von Brand <[email protected]> said:

[...]

> What looks like a promising idea for this problem and others is to
> have visible and invisible inodes. All current filesystems know only
> visible inodes. Invisible ones have no dentry linking to them
> directly, only indirectly through files/links with cow semantics.

But this is then _one_ filesystem, not a stack of them added/deleted in
random order while running. _So_ it is easy... and mostly useless.

[...]

> > IIRC, this has been discussed a couple of times before, and the consensus
> > each time was that it isn't /that hard/ to do, it is /hard or impossible/
> > to find a sensible, simple semantics for this. The idea was then dropped...

> Yeah, maybe. My personal consensus right now is that this actually
> looks very simple. Not sure how much time I will find, but it should
> definitely be finished for 2.8.

As I said: Not too hard, doable. But not sensibly. And needs to mess with
_all_ filesystems (on disk and kernel guts) if they want to someday perhaps
somewhere participate... Besides, the people asking for this mostly really
want version control, or get what they want from symlink farms.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2004-03-16 17:14:06

[permalink] [raw]

Subject: Re: unionfs

On Tue, 16 March 2004 12:18:29 -0400, Horst von Brand wrote:
> =?iso-8859-1?Q?J=F6rn?= Engel <[email protected]> said:
> > Horst von Brand <[email protected]> said:
>
> > What looks like a promising idea for this problem and others is to
> > have visible and invisible inodes. All current filesystems know only
> > visible inodes. Invisible ones have no dentry linking to them
> > directly, only indirectly through files/links with cow semantics.
>
> But this is then _one_ filesystem, not a stack of them added/deleted in
> random order while running. _So_ it is easy... and mostly useless.

Maybe. I personally don't care much about links between filesystems,
but some people seem to, so there should be some use.

BTW: Why would you want to mount/umount filesystems in a stack in
random order?

> > Yeah, maybe. My personal consensus right now is that this actually
> > looks very simple. Not sure how much time I will find, but it should
> > definitely be finished for 2.8.
>
> As I said: Not too hard, doable. But not sensibly. And needs to mess with
> _all_ filesystems (on disk and kernel guts) if they want to someday perhaps
> somewhere participate... Besides, the people asking for this mostly really
> want version control, or get what they want from symlink farms.

So what? Yes, I have to tweak vfs, but mainly to save tons of memory.
Cannot imagine too many complaints against that. All filesystems that
want stacking capability have to be changed, the rest can set a couple
of pointers to NULL. Effectively it will come down to ext[23], maybe
reiser and xfs plus those fifty special purpose filesystems that never
make it into linus' tree anyway.

And version control is something I actually want to be done inside the
kernel, at least to some degree. People already use kernel support,
although it sucks (cp -lr anyone?). Looks like the alternatives suck
even more, so your point is void.

J?rn

--
/* Keep these two variables together */
int bar;

2004-03-16 17:34:13

[permalink] [raw]

Subject: Re: unionfs

On Tue, 16 March 2004 12:04:30 -0400, Horst von Brand wrote:
> Chris Friesen <[email protected]> said:
>
> > I don't see how you win nothing. I create an overlay filesystem.
>
> Completely empty is what you get then... and you have to explicitly link in
> each file. Or everything shows up here.

Correct. Is that a problem?

> > delete a bunch of files in the overlay and it doesn't show through.
>
> Next time you mount it, what happens? How do you know the "top files" where
> deleted, and should not show up?
>
> What happens if I mount the live 2.6.4 kernel source over a CD containing
> 2.5.30? What happens to identical files, files that moved, changed files,
> deleted files? Pray tell, how does the kernel find out which is which?

What happens if I write to /dev/hda while having my rootfs /dev/hda1?
Bad things, damn right. But why would anyone do that?

Can you tell me what the point behind your examples is? It escapes
me.

> How do you back up a beast like this?

- Use a really large tape (stupid).
- cp /dev/... backup_medium
- Backup software with a clue about the underlying fs.

> In any case, there are tools that create a farm of symlinks, and when you
> try to write to a file (pointing to a RO area/file), you get an error. This
> gives you 90% of what you want, _without_ aggravating the filesystem
> hackers.

Great, so you found *your* solution already. I've done the same
without the need for symlinks in a 90-line patch, good enough for my
immediate needs right now. But someday I'd like to have the remaining
10% as well. :)

> > I would dearly love to use something like to make it easy to track
> > changes made all over a source tree. If I could sync them up at the
> > begining, then make all my changes in the overly, then doing a diff is
> > really easy since you just look for places where the inodes are
> > different between the two filesystems. Like having hard links, but the
> > filesystem breaks them for you when you write.
>
> This is called BitKeeper, CVS, Subversion, arch, RCS, SCCS, ... Better yet,
> it keeps the history of each file (not just the one version on RO media),
> with annotations. You decide when a version is ready for archiving.
>
> Sure, this would save disk space. But at today's prices, it just is not
> worth the trouble.

Not true:
- Even with bitkeeper, people copy their complete tree before making
changes, at least Linus sais he does. Go back to start, do not
collect $2000.
- Copying the kernel tree is not just a question of space and money,
but also about time.
- When the time and disk hit of identical copies approaches zero,
people will do this a lot more, they have new possibilities. *That*
is really important, not doing the same as before, just slightly
optimized.

J?rn

--
Everything should be made as simple as possible, but not simpler.
-- Albert Einstein

2004-03-16 18:04:31

[permalink] [raw]

Subject: Re: unionfs

On Tue, 16 March 2004 18:31:47 +0100, J?rn Engel wrote:
> On Tue, 16 March 2004 12:04:30 -0400, Horst von Brand wrote:
> >
> > What happens if I mount the live 2.6.4 kernel source over a CD containing
> > 2.5.30? What happens to identical files, files that moved, changed files,
> > deleted files? Pray tell, how does the kernel find out which is which?
>
> What happens if I write to /dev/hda while having my rootfs /dev/hda1?
> Bad things, damn right. But why would anyone do that?

There is really no point to this discussion, as it looks like a big
misunderstanding. Maybe you object less if you see the design:

Variant 1 (just a single filesystem):

- Introduce a new variant of links, which I call COW.
- COWs can only link to hidden inodes.
- Hidden inodes cannot be accessed directly.
- COWs look like regular files to userspace.
- Read access to COWs goes to the hidden inode.
- Write access to COWs copies the hidden inode before writing to it.
- Copying file1 to file2 does four things:
- Create a new hidden inode.
- Move data from file1 to hidden inode.
- Turn file1 into COW and link to hidden inode.
- Create COW for file2 and link to hidden inode.

There are some more special cases, but this is basically it. So let's
use the stuff a little:

$ cp -cr dir1 dir2

Behaves similar to 'cp -lr', but creates COWs instead of hard links.
Can take a few seconds to create the directories, but not minutes.

$ vi dir2/bunch*of*files

Writing to those files makes a real copy for each. dir1/* remains
unaffected, no matter how careless you are. We've made it foolproof,
so the universe has to create greater fools again, right?

$ rm -rf dir2

Scraps one of the copies along with all modifications. dir1/* remains
unaffected gain.

$ cp -cr /fs1/dir1 /fs2/dir2

Fails, since links between different filesystems don't work.

Variant 2 (across multiple filesystems now):

- COWs contain a filesystem identifier as well.
- Accessing COWs linking to unavaillable filesystems returns -E...
Alternatively:
- Mounting such an fs fails, unless all links work.

Usage is as above.

$ mkfs /dev/fs2
$ mount /dev/fs2 /fs2
$ cp -cr /fs1 /fs2

Creates an identical copy of one filesystem on another one. fs2 has
to support COWs and fs2 has to be RO or support COWs. A rw-fs mounted
ro means trouble, as you know.

Maybe I'm just stupid and missed some important detail, but this
design looks like it can solve a bunch of problems. Do you still
think, it is useless?

J?rn

--
The cheapest, fastest and most reliable components of a computer
system are those that aren't there.
-- Gordon Bell, DEC labratories

2004-03-16 19:03:55

by Andy Isaacson

[permalink] [raw]

Subject: Re: unionfs

Your "what are the semantics?" arguments are mysterious to me, Horst. I
don't know that unionfs is a good idea, but there are trivial solutions
to the problems you suggest. The fact that a facility can be used to
create untenable situations does not mean that the facility is useless.

On Mon, Mar 15, 2004 at 03:22:41PM -0400, Horst von Brand wrote:
> =?iso-8859-1?Q?J=F6rn?= Engel <[email protected]> said:
> > On Mon, 15 March 2004 22:35:20 +0800, Ian Kent wrote:
> > > I don't understand the requirement properly. Sorry.
> > Depends on who you ask, but imo it boils down to this:
> > - Use one filesystem as backing store, usually ro.
> > - Have another filesystem on top for extra functionality, usually rw
> > access.
> >
> > Famous example is a rw-CDROM, where writes go to hard drive and
> > unchanged data is read from CDROM. But it makes sense for other
> > things as well.
>
> And what if the underlying filesystem is RW too?

Only the topmost layer of a "union stack" should be RW. If you manage
to write to an underlying FS, it is akin to writing to the block device
underlying a normal FS -- the behavior is undefined.

> What should happen if you unite several (>= 3) filesystems? What if
> some are RO, others RW?

Given that only the topmost is RW, it Just Works.

> What do you do if a file shows up several times, each different?

The topmost entry wins.

> Assuming one RW on top of a RO only now: What should happen when a
> file/directory is missing from the top? If the bottom one "shows through",
> you can't delete anything; if it doesn't, you win nothing (because you will
> have to keep a complete copy RW on top).

If a directory entry is missing, the next lower layer is consulted.
Delete is implemented with "white-out" directory entries -- a directory
entry in the topmost FS which has special meaning, "return -ENOENT
immediately without consulting FSs underlying me".

> IIRC, this has been discussed a couple of times before, and the consensus
> each time was that it isn't /that hard/ to do, it is /hard or impossible/
> to find a sensible, simple semantics for this. The idea was then dropped...

The semantics of BSD unionfs are fairly well-defined and useful in at
least some circumstances.

References:

J. S. Pendry and M. K. McKusick. Union mounts in 4.4BSD-Lite.
In Proceedings of the USENIX Technical Conference on UNIX and Advanced
Computing Systems, pages 25?33, December 1995.
http://www.usenix.org/publications/library/proceedings/neworl/full_papers/mckusick.a

-andy

2004-03-16 19:27:56

[permalink] [raw]

Subject: Re: unionfs

=?iso-8859-1?Q?J=F6rn?= Engel <[email protected]> said:
> On Tue, 16 March 2004 12:04:30 -0400, Horst von Brand wrote:
> > Chris Friesen <[email protected]> said:
> >
> > > I don't see how you win nothing. I create an overlay filesystem.
> >
> > Completely empty is what you get then... and you have to explicitly link in
> > each file. Or everything shows up here.
>
> Correct. Is that a problem?

Yes. Use it for a kernel tree (some 18.000 files by now), and do it each
time you mount it.

> > > delete a bunch of files in the overlay and it doesn't show through.
> >
> > Next time you mount it, what happens? How do you know the "top files" where
> > deleted, and should not show up?
> >
> > What happens if I mount the live 2.6.4 kernel source over a CD containing
> > 2.5.30? What happens to identical files, files that moved, changed files,
> > deleted files? Pray tell, how does the kernel find out which is which?

> What happens if I write to /dev/hda while having my rootfs /dev/hda1?
> Bad things, damn right. But why would anyone do that?

> Can you tell me what the point behind your examples is? It escapes
> me.

OK, let's see... I've got a laptop, on CD is the "original" kernel tree, on
HD is my modified stuff. I delete a file (or move it). Then I pack up, go
home. There I start up again. How is the fact that the file is gone recorded?

> > How do you back up a beast like this?
>
> - Use a really large tape (stupid).

All layers? Urgh...

> - cp /dev/... backup_medium
> - Backup software with a clue about the underlying fs.

Another special, non-POSIX piece that needs to be written and maintained.

> > In any case, there are tools that create a farm of symlinks, and when you
> > try to write to a file (pointing to a RO area/file), you get an error. This
> > gives you 90% of what you want, _without_ aggravating the filesystem
> > hackers.
>
> Great, so you found *your* solution already. I've done the same
> without the need for symlinks in a 90-line patch, good enough for my
> immediate needs right now. But someday I'd like to have the remaining
> 10% as well. :)

I had my solution with _no_ kernel patch. Better still, it also works on
propietary Unix systems. Even better yet, any newbie Unix (even without any
Linux-with-funky-patch background) user understands what is going on here.
Fully POSIX compliant.

> > > I would dearly love to use something like to make it easy to track
> > > changes made all over a source tree. If I could sync them up at the
> > > begining, then make all my changes in the overly, then doing a diff is
> > > really easy since you just look for places where the inodes are
> > > different between the two filesystems. Like having hard links, but the
> > > filesystem breaks them for you when you write.
> >
> > This is called BitKeeper, CVS, Subversion, arch, RCS, SCCS, ... Better yet,
> > it keeps the history of each file (not just the one version on RO media),
> > with annotations. You decide when a version is ready for archiving.
> >
> > Sure, this would save disk space. But at today's prices, it just is not
> > worth the trouble.
>
> Not true:
> - Even with bitkeeper, people copy their complete tree before making
> changes, at least Linus sais he does. Go back to start, do not
> collect $2000.

Not needed at all. Sure, if you have enough disk... now hand over the $2000

> - Copying the kernel tree is not just a question of space and money,
> but also about time.

And both copies slowly diverge, and need to be sychronized sometime. You
owe me another $2000

> - When the time and disk hit of identical copies approaches zero,
> people will do this a lot more, they have new possibilities. *That*
> is really important, not doing the same as before, just slightly
> optimized.

What you talking about is some kind of (modifiable) disk cache of data on
ro media...

> --
> Everything should be made as simple as possible, but not simpler.
> -- Albert Einstein

This stuff definitely fails this, IMHO.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2004-03-16 19:41:44

[permalink] [raw]

Subject: Re: unionfs

=?iso-8859-1?Q?J=F6rn?= Engel <[email protected]> said:
> On Tue, 16 March 2004 18:31:47 +0100, J?rn Engel wrote:
> > On Tue, 16 March 2004 12:04:30 -0400, Horst von Brand wrote:
> > >
> > > What happens if I mount the live 2.6.4 kernel source over a CD containing
> > > 2.5.30? What happens to identical files, files that moved, changed files,
> > > deleted files? Pray tell, how does the kernel find out which is which?
> >
> > What happens if I write to /dev/hda while having my rootfs /dev/hda1?
> > Bad things, damn right. But why would anyone do that?
>
> There is really no point to this discussion, as it looks like a big
> misunderstanding. Maybe you object less if you see the design:
>
> Variant 1 (just a single filesystem):
>
> - Introduce a new variant of links, which I call COW.
> - COWs can only link to hidden inodes.
> - Hidden inodes cannot be accessed directly.
> - COWs look like regular files to userspace.
> - Read access to COWs goes to the hidden inode.
> - Write access to COWs copies the hidden inode before writing to it.
> - Copying file1 to file2 does four things:
> - Create a new hidden inode.
> - Move data from file1 to hidden inode.
> - Turn file1 into COW and link to hidden inode.
> - Create COW for file2 and link to hidden inode.

Looks an awful lot like symlinks...

> There are some more special cases, but this is basically it. So let's
> use the stuff a little:
>
> $ cp -cr dir1 dir2
>
> Behaves similar to 'cp -lr', but creates COWs instead of hard links.
> Can take a few seconds to create the directories, but not minutes.

Why does it magically take less time? The work done (recursing, fiddling
with directories, syscalls, ...) is nearly the same.

> $ vi dir2/bunch*of*files
>
> Writing to those files makes a real copy for each. dir1/* remains
> unaffected, no matter how careless you are. We've made it foolproof,
> so the universe has to create greater fools again, right?

Just do it again, thinking _these_ versions are the ones safe from
fat-fingering...

> $ rm -rf dir2
>
> Scraps one of the copies along with all modifications. dir1/* remains
> unaffected gain.
>
> $ cp -cr /fs1/dir1 /fs2/dir2
>
> Fails, since links between different filesystems don't work.

Why?

How do you push changes back if needed? How do you get back the version 3
changes back? Oops, can't do...

You do want version control.

> Variant 2 (across multiple filesystems now):
>
> - COWs contain a filesystem identifier as well.
> - Accessing COWs linking to unavaillable filesystems returns -E...
> Alternatively:
> - Mounting such an fs fails, unless all links work.
>
> Usage is as above.
>
> $ mkfs /dev/fs2
> $ mount /dev/fs2 /fs2
> $ cp -cr /fs1 /fs2
>
> Creates an identical copy of one filesystem on another one. fs2 has
> to support COWs and fs2 has to be RO or support COWs. A rw-fs mounted
> ro means trouble, as you know.
>
This is just one filesystem. Where are the others?

> Maybe I'm just stupid and missed some important detail, but this
> design looks like it can solve a bunch of problems. Do you still
> think, it is useless?

I still think there are much better solutions to the problems you mention.

What I'd love to see is something like, say /usr for each package (complete
with binaries in /usr/bin/, manuals in /usr/share/man/, ...) that you can
mount together in any combination wanted (even per-user, a la Plan 9) over
/usr and have it fully populated. But that gets horribly messy when you
want files from different versions (say, I've got an overlay for vi(1) that
fixes a horrible bug, but the manual is in Czech, and I prefer English) to
show up on top, or have files one on top of the other (source code
versions?) and you delete/modify/move the top one. What happens then? If
you can't come up with a sensible interpretation of Unix file operations in
this scenario (and you can't, trust me), the idea is doomed.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2004-03-16 20:01:08

by Chris Friesen

[permalink] [raw]

Subject: Re: unionfs

Horst von Brand wrote:

> OK, let's see... I've got a laptop, on CD is the "original" kernel tree, on
> HD is my modified stuff. I delete a file (or move it). Then I pack up, go
> home. There I start up again. How is the fact that the file is gone recorded?

If I recall, directory inodes list the files in that directory.
Presumably you had to write to the directory inode to do the delete.
The new version of the directory is now stored on the HD. It lists the
original files on the CD, minus the one that was deleted.

Chris

--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]

2004-03-16 20:46:07

[permalink] [raw]

Subject: Re: unionfs

On Tue, 16 March 2004 15:40:24 -0400, Horst von Brand wrote:
> >
> > $ cp -cr dir1 dir2
> >
> > Behaves similar to 'cp -lr', but creates COWs instead of hard links.
> > Can take a few seconds to create the directories, but not minutes.
>
> Why does it magically take less time? The work done (recursing, fiddling
> with directories, syscalls, ...) is nearly the same.

joern@limerick:~$ time cp -lr /usr/src/linux/ /tmp/linux

real 0m22.356s
user 0m0.167s
sys 0m1.480s
joern@limerick:~$ rm -r /tmp/linux/
joern@limerick:~$ time cp -r /usr/src/linux/ /tmp/linux

real 1m44.147s
user 0m0.499s
sys 0m7.987s

'nuf said, eot.

J?rn

--
To recognize individual spam features you have to try to get into the
mind of the spammer, and frankly I want to spend as little time inside
the minds of spammers as possible.
-- Paul Graham

2004-03-19 09:04:47

by Jamie Lokier

[permalink] [raw]

Subject: Re: unionfs

Horst von Brand wrote:
> Besides, the people asking for this mostly really
> want version control, or get what they want from symlink farms.

No. Version control does not address the requirement: to have 30
checked out kernel trees, each with compiled images, because you're
actually working on 30 trees, sharing files to save time and space,
and normal shell commands in each directory not accidentally affecting
the others.

I have not heard of any version control system which offers that.
Perhaps one based around a virtual filesystem could.

Symlink farms do not solve it either. They have the same problem as
hard links: namely, it is too easy to accidentally modify a file in
one tree while intending to modify only in another, plus they
introduce a whole bunch of other problems.

This idea of COW links is to solve one quite specific problem:
creating the illusion that large trees are copied and independent, so
that editors and compilers and makefiles and so on affect them
independently, while doing so fast and small, and allowing programs
which compare files (such as version control and diff) to know when
two files' contents are identical efficiently.

-- Jamie

2004-03-19 09:11:54

by Jamie Lokier

[permalink] [raw]

Subject: Re: unionfs

J?rn Engel wrote:
> And version control is something I actually want to be done inside the
> kernel, at least to some degree. People already use kernel support,
> although it sucks (cp -lr anyone?). Looks like the alternatives suck
> even more, so your point is void.

Fwiw, I much prefer your COW hard links to something where I have to
mount a new filesystem every time I "copy" a tree, and have to redo
those mounts each time I reboot, have a big ugly mess in "df" output,
what "du" get confused, and "rsync" has no hope of dealing with them
sensibly.

I also don't mind if copying isn't implemented in the kernel. I'm ok
with programs reporting an error that they couldn't write to a file
because it was linked readonly. At least that removes the danger of
accidental overwriting, and I can either fix it by hand or use an
LD_PRELOAD library which detects that error code from open() and
copies the file.

Even if vi and Emacs, which make it temptingly easy to ignore normal
read-only protection, were changed to be aware of and bypass the
read-only link attribute, they'd do the right thing: the attribute
expresses the _intent_ that removing it should always be done by
copying the file, whereas with hard links that intent isn't clear.
(Emacs has backup-by-copying-when-linked, but that isn't too helpful
because sometimes you want writing to a linked file to change both places).

So my vote is for the very simple COW hard link attribute, and leave
the rest to userspace.

Thanks!
-- Jamie

2004-03-19 09:43:05