Hi Al,
I'm interested in any (even if very incomplete) work you did on
unionfs. But I can only find an empty directory at
ftp://ftp.math.psu.edu/pub/viro/ linked from old kernel status pages.
Can you point me to any information regarding this?
Thanks,
Miklos
On Monday 08 March 2004 10:52, Miklos Szeredi wrote:
> Hi Al,
>
> I'm interested in any (even if very incomplete) work you did on
> unionfs. But I can only find an empty directory at
> ftp://ftp.math.psu.edu/pub/viro/ linked from old kernel status pages.
>
> Can you point me to any information regarding this?
>
> Thanks,
> Miklos
Hi,
I have never tested it, but from other reports I read that it works.
http://translucency.sourceforge.net/
Cheers,
Bernd
> I have never tested it, but from other reports I read that it works.
>
> http://translucency.sourceforge.net/
Thanks, but I'm not looking for a working unionfs, rather for design
ideas for doing these sort of 'layered' filesystems _properly_, not
with hacks like system call table modification (like the above
solution).
Al, can you please give some guidance? Have you any code for a
unionfs design, or only ideas, or was this just myth?
Miklos
> If you get a response from Al, could you let me know?
OK.
> I've been wondering about this myself, and beyond simple
> coolness/usefullness, we may also need the unionfs for mls
> polyinstantiation.
>
> If you don't hear from Al, please let me know whether you plan to tackle
> it yourself or not.
I'll have to, as this is needed for AVFS. Not unionfs, but something
similar, that allows file/directory lookups for special filenames to
be redirected to another filesystem.
For the time I plan to go along the easy way, and create a filesystem
specifically for AVFS that doesn't need modifications to the kernel.
This will be inefficient in a number of ways: doubling the memory use
by the dentry/inode caches, deeper call chains for all filesystem
operations (even those, that require no intervention).
The next step will be to try to optimize and generalize it to be
usable for other filesystems like unionsfs. I'd really love to have
some input about this from Al or anybody who has any ideas.
Miklos
On Mon, Mar 08, 2004 at 10:52:54AM +0100, Miklos Szeredi wrote:
> Hi Al,
>
> I'm interested in any (even if very incomplete) work you did on
> unionfs. But I can only find an empty directory at
> ftp://ftp.math.psu.edu/pub/viro/ linked from old kernel status pages.
> Can you point me to any information regarding this?
FWIW, have a look at http://vserver.13thfloor.at/TBVFS/
HTH,
Herbert
> Thanks,
> Miklos
On Thu, 11 Mar 2004, Miklos Szeredi wrote:
>
> I'll have to, as this is needed for AVFS. Not unionfs, but something
> similar, that allows file/directory lookups for special filenames to
> be redirected to another filesystem.
I have a need for this in autofs4 also.
Ian
Ian Kent <[email protected]> dijo:
> On Thu, 11 Mar 2004, Miklos Szeredi wrote:
>
> >
> > I'll have to, as this is needed for AVFS. Not unionfs, but something
> > similar, that allows file/directory lookups for special filenames to
> > be redirected to another filesystem.
>
> I have a need for this in autofs4 also.
At least some Unices have context-dependent symlinks, and AFAIR there was
something like this in Linux a _long_ while back (perhaps just in Red Hat,
must have been the second half of the '90s). It was discarded as too much
mess (in kernel, in userspace, and in wetware) for little gain, IIRC.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513
Herbert Poetzl wrote:
>FWIW, have a look at http://vserver.13thfloor.at/TBVFS
I do really think this problem needs to be solved a different way: BSD-style
union mount in VFS, no redirecting filesystem.
I am planning to work on that during the 2.7. series. I do hope I will be able
to write code clean enough for inclusion, lets see...
On Mon, 15 March 2004 12:35:25 +0100, Carsten Otte wrote:
> Herbert Poetzl wrote:
> >FWIW, have a look at http://vserver.13thfloor.at/TBVFS
> I do really think this problem needs to be solved a different way: BSD-style
> union mount in VFS, no redirecting filesystem.
> I am planning to work on that during the 2.7. series. I do hope I will be able
> to write code clean enough for inclusion, lets see...
You could also have some sort of 'hidden symlink', i.e. something that
behaves just like a file but is in fact a link to some other
filesystem. If that other filesystem is not accessable, all
operations return -EIO.
Not sure if this is a sane solution, but it would make my cow-stuff
work across filesystems as well.
J?rn
--
Schr?dinger's cat is <BLINK>not</BLINK> dead.
-- Illiad
On Mon, 15 Mar 2004, [iso-8859-1] J?rn Engel wrote:
> On Mon, 15 March 2004 12:35:25 +0100, Carsten Otte wrote:
> > Herbert Poetzl wrote:
> > >FWIW, have a look at http://vserver.13thfloor.at/TBVFS
> > I do really think this problem needs to be solved a different way: BSD-style
> > union mount in VFS, no redirecting filesystem.
> > I am planning to work on that during the 2.7. series. I do hope I will be able
> > to write code clean enough for inclusion, lets see...
>
> You could also have some sort of 'hidden symlink', i.e. something that
> behaves just like a file but is in fact a link to some other
> filesystem. If that other filesystem is not accessable, all
> operations return -EIO.
Sounds a bit untidy.
Has anyone checked http://www.filesystems.org/
What do you think?
Ian
On Mon, 15 March 2004 20:47:05 +0800, Ian Kent wrote:
> On Mon, 15 Mar 2004, [iso-8859-1] J?rn Engel wrote:
> >
> > You could also have some sort of 'hidden symlink', i.e. something that
> > behaves just like a file but is in fact a link to some other
> > filesystem. If that other filesystem is not accessable, all
> > operations return -EIO.
>
> Sounds a bit untidy.
If you have a cleaner idea, I'm open for suggestions.
> Has anyone checked http://www.filesystems.org/
>
> What do you think?
Looks like an abstraction layer that still assumes a 1:1 mapping
between filesystems and devices, so it doesn't help. Did I miss
something?
J?rn
--
And spam is a useful source of entropy for /dev/random too!
-- Jasmine Strong
On Mon, 15 Mar 2004, [iso-8859-1] J?rn Engel wrote:
> On Mon, 15 March 2004 20:47:05 +0800, Ian Kent wrote:
> > On Mon, 15 Mar 2004, [iso-8859-1] J?rn Engel wrote:
> > >
> > > You could also have some sort of 'hidden symlink', i.e. something that
> > > behaves just like a file but is in fact a link to some other
> > > filesystem. If that other filesystem is not accessable, all
> > > operations return -EIO.
> >
> > Sounds a bit untidy.
>
> If you have a cleaner idea, I'm open for suggestions.
>
> > Has anyone checked http://www.filesystems.org/
> >
> > What do you think?
>
> Looks like an abstraction layer that still assumes a 1:1 mapping
> between filesystems and devices, so it doesn't help. Did I miss
> something?
I don't understand the requirement properly. Sorry.
Ian
On Mon, 15 Mar 2004, [iso-8859-1] J?rn Engel wrote:
> On Mon, 15 March 2004 20:47:05 +0800, Ian Kent wrote:
> > On Mon, 15 Mar 2004, [iso-8859-1] J?rn Engel wrote:
> > >
> > > You could also have some sort of 'hidden symlink', i.e. something that
> > > behaves just like a file but is in fact a link to some other
> > > filesystem. If that other filesystem is not accessable, all
> > > operations return -EIO.
> >
> > Sounds a bit untidy.
>
> If you have a cleaner idea, I'm open for suggestions.
>
> > Has anyone checked http://www.filesystems.org/
> >
> > What do you think?
>
> Looks like an abstraction layer that still assumes a 1:1 mapping
> between filesystems and devices, so it doesn't help. Did I miss
> something?
There was talk on the mailing list that they were close to releasing a
unionfs filesystem for their fistgen generator. But it has been a while
and still nothing.
Ian
On Mon, 15 March 2004 14:16:01 +0100, J?rn Engel wrote:
> On Mon, 15 March 2004 20:47:05 +0800, Ian Kent wrote:
> > On Mon, 15 Mar 2004, [iso-8859-1] J?rn Engel wrote:
> > >
> > > You could also have some sort of 'hidden symlink', i.e. something that
> > > behaves just like a file but is in fact a link to some other
> > > filesystem. If that other filesystem is not accessable, all
> > > operations return -EIO.
> >
> > Sounds a bit untidy.
>
> If you have a cleaner idea, I'm open for suggestions.
Stupid me. Simply open a file in the cached fs, so it can't be
umounted.
J?rn
--
Measure. Don't tune for speed until you've measured, and even then
don't unless one part of the code overwhelms the rest.
-- Rob Pike
On Mon, 15 March 2004 22:35:20 +0800, Ian Kent wrote:
>
> I don't understand the requirement properly. Sorry.
Depends on who you ask, but imo it boils down to this:
- Use one filesystem as backing store, usually ro.
- Have another filesystem on top for extra functionality, usually rw
access.
Famous example is a rw-CDROM, where writes go to hard drive and
unchanged data is read from CDROM. But it makes sense for other
things as well.
J?rn
--
Data expands to fill the space available for storage.
-- Parkinson's Law
On Monday 15 March 2004 16:13, J?rn Engel wrote:
> On Mon, 15 March 2004 22:35:20 +0800, Ian Kent wrote:
> > I don't understand the requirement properly. Sorry.
>
> Depends on who you ask, but imo it boils down to this:
> - Use one filesystem as backing store, usually ro.
> - Have another filesystem on top for extra functionality, usually rw
> access.
>
> Famous example is a rw-CDROM, where writes go to hard drive and
> unchanged data is read from CDROM. But it makes sense for other
> things as well.
If I understand correctly this unionfs feature would also be the cleanest
way of changing the root filesystem after using an initrd ramdisk on boot.
Currently the pivot_root call is used to change root but that still implies a
bit of a hack. You can read about it on this fine paper:
http://www.cis.udel.edu/~zhi/www.docshow.net/linux/ols.zip
It's also a good read if you want to understand the linux bootloaders and
the boot process in general.
Regards
Claudio
=?iso-8859-1?Q?J=F6rn?= Engel <[email protected]> said:
> On Mon, 15 March 2004 22:35:20 +0800, Ian Kent wrote:
> > I don't understand the requirement properly. Sorry.
> Depends on who you ask, but imo it boils down to this:
> - Use one filesystem as backing store, usually ro.
> - Have another filesystem on top for extra functionality, usually rw
> access.
>
> Famous example is a rw-CDROM, where writes go to hard drive and
> unchanged data is read from CDROM. But it makes sense for other
> things as well.
And what if the underlying filesystem is RW too? What should happen if you
unite several (>= 3) filesystems? What if some are RO, others RW? What do
you do if a file shows up several times, each different?
Assuming one RW on top of a RO only now: What should happen when a
file/directory is missing from the top? If the bottom one "shows through",
you can't delete anything; if it doesn't, you win nothing (because you will
have to keep a complete copy RW on top).
IIRC, this has been discussed a couple of times before, and the consensus
each time was that it isn't /that hard/ to do, it is /hard or impossible/
to find a sensible, simple semantics for this. The idea was then dropped...
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513
Horst von Brand wrote:
> Assuming one RW on top of a RO only now: What should happen when a
> file/directory is missing from the top? If the bottom one "shows through",
> you can't delete anything; if it doesn't, you win nothing (because you will
> have to keep a complete copy RW on top).
I don't see how you win nothing. I create an overlay filesystem. I
delete a bunch of files in the overlay and it doesn't show through. All
my other files are still links to the originals, with the
I would dearly love to use something like to make it easy to track
changes made all over a source tree. If I could sync them up at the
begining, then make all my changes in the overly, then doing a diff is
really easy since you just look for places where the inodes are
different between the two filesystems. Like having hard links, but the
filesystem breaks them for you when you write.
Chris
--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]
On Mon, 15 March 2004 15:22:41 -0400, Horst von Brand wrote:
> =?iso-8859-1?Q?J=F6rn?= Engel <[email protected]> said:
> > On Mon, 15 March 2004 22:35:20 +0800, Ian Kent wrote:
>
> > > I don't understand the requirement properly. Sorry.
>
> > Depends on who you ask, but imo it boils down to this:
> > - Use one filesystem as backing store, usually ro.
> > - Have another filesystem on top for extra functionality, usually rw
> > access.
> >
> > Famous example is a rw-CDROM, where writes go to hard drive and
> > unchanged data is read from CDROM. But it makes sense for other
> > things as well.
>
> And what if the underlying filesystem is RW too? What should happen if you
> unite several (>= 3) filesystems? What if some are RO, others RW? What do
> you do if a file shows up several times, each different?
>
> Assuming one RW on top of a RO only now: What should happen when a
> file/directory is missing from the top? If the bottom one "shows through",
> you can't delete anything; if it doesn't, you win nothing (because you will
> have to keep a complete copy RW on top).
What looks like a promising idea for this problem and others is to
have visible and invisible inodes. All current filesystems know only
visible inodes. Invisible ones have no dentry linking to them
directly, only indirectly through files/links with cow semantics.
Ok, when the underlying filesystem is rw, each file linked from the
caching fs has to be broken up into visible and invisible inodes. The
visible link from both filesystems is to the invisible inode and
writes to either one have to cow.
Three or more filesystems? No problem, same as above.
Mixed ro and rw? No problem, same as above.
Files "showing through"? Doesn't happen if you do the equivalent of
"cp -l" - directories are copied, files are linked.
Solves all of your problems so far. Do you have more?
> IIRC, this has been discussed a couple of times before, and the consensus
> each time was that it isn't /that hard/ to do, it is /hard or impossible/
> to find a sensible, simple semantics for this. The idea was then dropped...
Yeah, maybe. My personal consensus right now is that this actually
looks very simple. Not sure how much time I will find, but it should
definitely be finished for 2.8.
J?rn
--
To recognize individual spam features you have to try to get into the
mind of the spammer, and frankly I want to spend as little time inside
the minds of spammers as possible.
-- Paul Graham
> > I'll have to, as this is needed for AVFS. Not unionfs, but something
> > similar, that allows file/directory lookups for special filenames to
> > be redirected to another filesystem.
>
> I have a need for this in autofs4 also.
What are your exact requirements? I mean, do you want to check every
lookup, or only if the lookup returned a negative dentry? Is it a
fixed set of names that you need to check or is it dynamic?
Thanks,
Miklos
> I was just re-reading the linux-fsdevel archives from june of 2000.
> I'm guessing the reason you're not getting a response from Al is that
> he never did unionfs. He did union mounts, and made clear that they
> are not related. Union mounts would only work at the top level, the
> union would not recurse down the (union-)mounted trees.
(Reading that thread...)
OK, now I understand better. Although I can't find any code/patch for
union mount either.
> Performance-wise this could become very, very slow very quickly, but
> if we replace the vfsmount which is found by using
> mount_hashtable + hash(vfsmnt, dentry) to find what is mounted on top
> of a particular dentry with a vfsmount_stackable struct, which contains,
> say,
>
> struct vfsmount_stackable {
> struct vfsmount *vfsmnt;
> int mount_flags; /* 1 = read, 2 = write, 3 = hide */
> struct vfsmount_stackable *next;
> };
>
> then perhaps it might be reasonably simple to have __follow_down and
> follow_mount make use of this structure. We make sure to keep the
> vfsmount_stackable list in order mounted priority, so that when we
> come to one of these lists, we can just do something like
>
> while (vfsmount_stacked) {
> ret = stacked_lookup(vfsmount_stacked->vfsmnt, vfsmnt_stacked->dentry,
> remaining_pathname);
> if (ret)
> return ret;
> vfsmnt_stacked = vfsmnt_stacked->next;
> }
>
> return NULL;
>
> Thoughts?
Yes, this sounds like a good way to implement a unionfs-like
functionality. Something a bit more general would be to have a
path_walk(const char *remaining_path, struct nameidata *nd) operation
of vfsmount, which if non-null would be used to perform the rest of
the lookup. This could then perform the looped lookup trials you
describe, but could be used for other special lookup methods.
Thanks for your comments,
Miklos
Chris Friesen <[email protected]> said:
> Horst von Brand wrote:
> > Assuming one RW on top of a RO only now: What should happen when a
> > file/directory is missing from the top? If the bottom one "shows through",
> > you can't delete anything; if it doesn't, you win nothing (because you will
> > have to keep a complete copy RW on top).
> I don't see how you win nothing. I create an overlay filesystem.
Completely empty is what you get then... and you have to explicitly link in
each file. Or everything shows up here.
> I
> delete a bunch of files in the overlay and it doesn't show through.
Next time you mount it, what happens? How do you know the "top files" where
deleted, and should not show up?
What happens if I mount the live 2.6.4 kernel source over a CD containing
2.5.30? What happens to identical files, files that moved, changed files,
deleted files? Pray tell, how does the kernel find out which is which?
How do you back up a beast like this?
> All
> my other files are still links to the originals, with the
Something missing here?
In any case, there are tools that create a farm of symlinks, and when you
try to write to a file (pointing to a RO area/file), you get an error. This
gives you 90% of what you want, _without_ aggravating the filesystem
hackers.
> I would dearly love to use something like to make it easy to track
> changes made all over a source tree. If I could sync them up at the
> begining, then make all my changes in the overly, then doing a diff is
> really easy since you just look for places where the inodes are
> different between the two filesystems. Like having hard links, but the
> filesystem breaks them for you when you write.
This is called BitKeeper, CVS, Subversion, arch, RCS, SCCS, ... Better yet,
it keeps the history of each file (not just the one version on RO media),
with annotations. You decide when a version is ready for archiving.
Sure, this would save disk space. But at today's prices, it just is not
worth the trouble.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513
=?iso-8859-1?Q?J=F6rn?= Engel <[email protected]> said:
> Horst von Brand <[email protected]> said:
[...]
> What looks like a promising idea for this problem and others is to
> have visible and invisible inodes. All current filesystems know only
> visible inodes. Invisible ones have no dentry linking to them
> directly, only indirectly through files/links with cow semantics.
But this is then _one_ filesystem, not a stack of them added/deleted in
random order while running. _So_ it is easy... and mostly useless.
[...]
> > IIRC, this has been discussed a couple of times before, and the consensus
> > each time was that it isn't /that hard/ to do, it is /hard or impossible/
> > to find a sensible, simple semantics for this. The idea was then dropped...
> Yeah, maybe. My personal consensus right now is that this actually
> looks very simple. Not sure how much time I will find, but it should
> definitely be finished for 2.8.
As I said: Not too hard, doable. But not sensibly. And needs to mess with
_all_ filesystems (on disk and kernel guts) if they want to someday perhaps
somewhere participate... Besides, the people asking for this mostly really
want version control, or get what they want from symlink farms.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513
On Tue, 16 March 2004 12:18:29 -0400, Horst von Brand wrote:
> =?iso-8859-1?Q?J=F6rn?= Engel <[email protected]> said:
> > Horst von Brand <[email protected]> said:
>
> > What looks like a promising idea for this problem and others is to
> > have visible and invisible inodes. All current filesystems know only
> > visible inodes. Invisible ones have no dentry linking to them
> > directly, only indirectly through files/links with cow semantics.
>
> But this is then _one_ filesystem, not a stack of them added/deleted in
> random order while running. _So_ it is easy... and mostly useless.
Maybe. I personally don't care much about links between filesystems,
but some people seem to, so there should be some use.
BTW: Why would you want to mount/umount filesystems in a stack in
random order?
> > Yeah, maybe. My personal consensus right now is that this actually
> > looks very simple. Not sure how much time I will find, but it should
> > definitely be finished for 2.8.
>
> As I said: Not too hard, doable. But not sensibly. And needs to mess with
> _all_ filesystems (on disk and kernel guts) if they want to someday perhaps
> somewhere participate... Besides, the people asking for this mostly really
> want version control, or get what they want from symlink farms.
So what? Yes, I have to tweak vfs, but mainly to save tons of memory.
Cannot imagine too many complaints against that. All filesystems that
want stacking capability have to be changed, the rest can set a couple
of pointers to NULL. Effectively it will come down to ext[23], maybe
reiser and xfs plus those fifty special purpose filesystems that never
make it into linus' tree anyway.
And version control is something I actually want to be done inside the
kernel, at least to some degree. People already use kernel support,
although it sucks (cp -lr anyone?). Looks like the alternatives suck
even more, so your point is void.
J?rn
--
/* Keep these two variables together */
int bar;
On Tue, 16 March 2004 12:04:30 -0400, Horst von Brand wrote:
> Chris Friesen <[email protected]> said:
>
> > I don't see how you win nothing. I create an overlay filesystem.
>
> Completely empty is what you get then... and you have to explicitly link in
> each file. Or everything shows up here.
Correct. Is that a problem?
> > delete a bunch of files in the overlay and it doesn't show through.
>
> Next time you mount it, what happens? How do you know the "top files" where
> deleted, and should not show up?
>
> What happens if I mount the live 2.6.4 kernel source over a CD containing
> 2.5.30? What happens to identical files, files that moved, changed files,
> deleted files? Pray tell, how does the kernel find out which is which?
What happens if I write to /dev/hda while having my rootfs /dev/hda1?
Bad things, damn right. But why would anyone do that?
Can you tell me what the point behind your examples is? It escapes
me.
> How do you back up a beast like this?
- Use a really large tape (stupid).
- cp /dev/... backup_medium
- Backup software with a clue about the underlying fs.
> In any case, there are tools that create a farm of symlinks, and when you
> try to write to a file (pointing to a RO area/file), you get an error. This
> gives you 90% of what you want, _without_ aggravating the filesystem
> hackers.
Great, so you found *your* solution already. I've done the same
without the need for symlinks in a 90-line patch, good enough for my
immediate needs right now. But someday I'd like to have the remaining
10% as well. :)
> > I would dearly love to use something like to make it easy to track
> > changes made all over a source tree. If I could sync them up at the
> > begining, then make all my changes in the overly, then doing a diff is
> > really easy since you just look for places where the inodes are
> > different between the two filesystems. Like having hard links, but the
> > filesystem breaks them for you when you write.
>
> This is called BitKeeper, CVS, Subversion, arch, RCS, SCCS, ... Better yet,
> it keeps the history of each file (not just the one version on RO media),
> with annotations. You decide when a version is ready for archiving.
>
> Sure, this would save disk space. But at today's prices, it just is not
> worth the trouble.
Not true:
- Even with bitkeeper, people copy their complete tree before making
changes, at least Linus sais he does. Go back to start, do not
collect $2000.
- Copying the kernel tree is not just a question of space and money,
but also about time.
- When the time and disk hit of identical copies approaches zero,
people will do this a lot more, they have new possibilities. *That*
is really important, not doing the same as before, just slightly
optimized.
J?rn
--
Everything should be made as simple as possible, but not simpler.
-- Albert Einstein
On Tue, 16 March 2004 18:31:47 +0100, J?rn Engel wrote:
> On Tue, 16 March 2004 12:04:30 -0400, Horst von Brand wrote:
> >
> > What happens if I mount the live 2.6.4 kernel source over a CD containing
> > 2.5.30? What happens to identical files, files that moved, changed files,
> > deleted files? Pray tell, how does the kernel find out which is which?
>
> What happens if I write to /dev/hda while having my rootfs /dev/hda1?
> Bad things, damn right. But why would anyone do that?
There is really no point to this discussion, as it looks like a big
misunderstanding. Maybe you object less if you see the design:
Variant 1 (just a single filesystem):
- Introduce a new variant of links, which I call COW.
- COWs can only link to hidden inodes.
- Hidden inodes cannot be accessed directly.
- COWs look like regular files to userspace.
- Read access to COWs goes to the hidden inode.
- Write access to COWs copies the hidden inode before writing to it.
- Copying file1 to file2 does four things:
- Create a new hidden inode.
- Move data from file1 to hidden inode.
- Turn file1 into COW and link to hidden inode.
- Create COW for file2 and link to hidden inode.
There are some more special cases, but this is basically it. So let's
use the stuff a little:
$ cp -cr dir1 dir2
Behaves similar to 'cp -lr', but creates COWs instead of hard links.
Can take a few seconds to create the directories, but not minutes.
$ vi dir2/bunch*of*files
Writing to those files makes a real copy for each. dir1/* remains
unaffected, no matter how careless you are. We've made it foolproof,
so the universe has to create greater fools again, right?
$ rm -rf dir2
Scraps one of the copies along with all modifications. dir1/* remains
unaffected gain.
$ cp -cr /fs1/dir1 /fs2/dir2
Fails, since links between different filesystems don't work.
Variant 2 (across multiple filesystems now):
- COWs contain a filesystem identifier as well.
- Accessing COWs linking to unavaillable filesystems returns -E...
Alternatively:
- Mounting such an fs fails, unless all links work.
Usage is as above.
$ mkfs /dev/fs2
$ mount /dev/fs2 /fs2
$ cp -cr /fs1 /fs2
Creates an identical copy of one filesystem on another one. fs2 has
to support COWs and fs2 has to be RO or support COWs. A rw-fs mounted
ro means trouble, as you know.
Maybe I'm just stupid and missed some important detail, but this
design looks like it can solve a bunch of problems. Do you still
think, it is useless?
J?rn
--
The cheapest, fastest and most reliable components of a computer
system are those that aren't there.
-- Gordon Bell, DEC labratories
Your "what are the semantics?" arguments are mysterious to me, Horst. I
don't know that unionfs is a good idea, but there are trivial solutions
to the problems you suggest. The fact that a facility can be used to
create untenable situations does not mean that the facility is useless.
On Mon, Mar 15, 2004 at 03:22:41PM -0400, Horst von Brand wrote:
> =?iso-8859-1?Q?J=F6rn?= Engel <[email protected]> said:
> > On Mon, 15 March 2004 22:35:20 +0800, Ian Kent wrote:
> > > I don't understand the requirement properly. Sorry.
> > Depends on who you ask, but imo it boils down to this:
> > - Use one filesystem as backing store, usually ro.
> > - Have another filesystem on top for extra functionality, usually rw
> > access.
> >
> > Famous example is a rw-CDROM, where writes go to hard drive and
> > unchanged data is read from CDROM. But it makes sense for other
> > things as well.
>
> And what if the underlying filesystem is RW too?
Only the topmost layer of a "union stack" should be RW. If you manage
to write to an underlying FS, it is akin to writing to the block device
underlying a normal FS -- the behavior is undefined.
> What should happen if you unite several (>= 3) filesystems? What if
> some are RO, others RW?
Given that only the topmost is RW, it Just Works.
> What do you do if a file shows up several times, each different?
The topmost entry wins.
> Assuming one RW on top of a RO only now: What should happen when a
> file/directory is missing from the top? If the bottom one "shows through",
> you can't delete anything; if it doesn't, you win nothing (because you will
> have to keep a complete copy RW on top).
If a directory entry is missing, the next lower layer is consulted.
Delete is implemented with "white-out" directory entries -- a directory
entry in the topmost FS which has special meaning, "return -ENOENT
immediately without consulting FSs underlying me".
> IIRC, this has been discussed a couple of times before, and the consensus
> each time was that it isn't /that hard/ to do, it is /hard or impossible/
> to find a sensible, simple semantics for this. The idea was then dropped...
The semantics of BSD unionfs are fairly well-defined and useful in at
least some circumstances.
References:
J. S. Pendry and M. K. McKusick. Union mounts in 4.4BSD-Lite.
In Proceedings of the USENIX Technical Conference on UNIX and Advanced
Computing Systems, pages 25?33, December 1995.
http://www.usenix.org/publications/library/proceedings/neworl/full_papers/mckusick.a
-andy
=?iso-8859-1?Q?J=F6rn?= Engel <[email protected]> said:
> On Tue, 16 March 2004 12:04:30 -0400, Horst von Brand wrote:
> > Chris Friesen <[email protected]> said:
> >
> > > I don't see how you win nothing. I create an overlay filesystem.
> >
> > Completely empty is what you get then... and you have to explicitly link in
> > each file. Or everything shows up here.
>
> Correct. Is that a problem?
Yes. Use it for a kernel tree (some 18.000 files by now), and do it each
time you mount it.
> > > delete a bunch of files in the overlay and it doesn't show through.
> >
> > Next time you mount it, what happens? How do you know the "top files" where
> > deleted, and should not show up?
> >
> > What happens if I mount the live 2.6.4 kernel source over a CD containing
> > 2.5.30? What happens to identical files, files that moved, changed files,
> > deleted files? Pray tell, how does the kernel find out which is which?
> What happens if I write to /dev/hda while having my rootfs /dev/hda1?
> Bad things, damn right. But why would anyone do that?
> Can you tell me what the point behind your examples is? It escapes
> me.
OK, let's see... I've got a laptop, on CD is the "original" kernel tree, on
HD is my modified stuff. I delete a file (or move it). Then I pack up, go
home. There I start up again. How is the fact that the file is gone recorded?
> > How do you back up a beast like this?
>
> - Use a really large tape (stupid).
All layers? Urgh...
> - cp /dev/... backup_medium
> - Backup software with a clue about the underlying fs.
Another special, non-POSIX piece that needs to be written and maintained.
> > In any case, there are tools that create a farm of symlinks, and when you
> > try to write to a file (pointing to a RO area/file), you get an error. This
> > gives you 90% of what you want, _without_ aggravating the filesystem
> > hackers.
>
> Great, so you found *your* solution already. I've done the same
> without the need for symlinks in a 90-line patch, good enough for my
> immediate needs right now. But someday I'd like to have the remaining
> 10% as well. :)
I had my solution with _no_ kernel patch. Better still, it also works on
propietary Unix systems. Even better yet, any newbie Unix (even without any
Linux-with-funky-patch background) user understands what is going on here.
Fully POSIX compliant.
> > > I would dearly love to use something like to make it easy to track
> > > changes made all over a source tree. If I could sync them up at the
> > > begining, then make all my changes in the overly, then doing a diff is
> > > really easy since you just look for places where the inodes are
> > > different between the two filesystems. Like having hard links, but the
> > > filesystem breaks them for you when you write.
> >
> > This is called BitKeeper, CVS, Subversion, arch, RCS, SCCS, ... Better yet,
> > it keeps the history of each file (not just the one version on RO media),
> > with annotations. You decide when a version is ready for archiving.
> >
> > Sure, this would save disk space. But at today's prices, it just is not
> > worth the trouble.
>
> Not true:
> - Even with bitkeeper, people copy their complete tree before making
> changes, at least Linus sais he does. Go back to start, do not
> collect $2000.
Not needed at all. Sure, if you have enough disk... now hand over the $2000
> - Copying the kernel tree is not just a question of space and money,
> but also about time.
And both copies slowly diverge, and need to be sychronized sometime. You
owe me another $2000
> - When the time and disk hit of identical copies approaches zero,
> people will do this a lot more, they have new possibilities. *That*
> is really important, not doing the same as before, just slightly
> optimized.
What you talking about is some kind of (modifiable) disk cache of data on
ro media...
> --
> Everything should be made as simple as possible, but not simpler.
> -- Albert Einstein
This stuff definitely fails this, IMHO.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513
=?iso-8859-1?Q?J=F6rn?= Engel <[email protected]> said:
> On Tue, 16 March 2004 18:31:47 +0100, J?rn Engel wrote:
> > On Tue, 16 March 2004 12:04:30 -0400, Horst von Brand wrote:
> > >
> > > What happens if I mount the live 2.6.4 kernel source over a CD containing
> > > 2.5.30? What happens to identical files, files that moved, changed files,
> > > deleted files? Pray tell, how does the kernel find out which is which?
> >
> > What happens if I write to /dev/hda while having my rootfs /dev/hda1?
> > Bad things, damn right. But why would anyone do that?
>
> There is really no point to this discussion, as it looks like a big
> misunderstanding. Maybe you object less if you see the design:
>
> Variant 1 (just a single filesystem):
>
> - Introduce a new variant of links, which I call COW.
> - COWs can only link to hidden inodes.
> - Hidden inodes cannot be accessed directly.
> - COWs look like regular files to userspace.
> - Read access to COWs goes to the hidden inode.
> - Write access to COWs copies the hidden inode before writing to it.
> - Copying file1 to file2 does four things:
> - Create a new hidden inode.
> - Move data from file1 to hidden inode.
> - Turn file1 into COW and link to hidden inode.
> - Create COW for file2 and link to hidden inode.
Looks an awful lot like symlinks...
> There are some more special cases, but this is basically it. So let's
> use the stuff a little:
>
> $ cp -cr dir1 dir2
>
> Behaves similar to 'cp -lr', but creates COWs instead of hard links.
> Can take a few seconds to create the directories, but not minutes.
Why does it magically take less time? The work done (recursing, fiddling
with directories, syscalls, ...) is nearly the same.
> $ vi dir2/bunch*of*files
>
> Writing to those files makes a real copy for each. dir1/* remains
> unaffected, no matter how careless you are. We've made it foolproof,
> so the universe has to create greater fools again, right?
Just do it again, thinking _these_ versions are the ones safe from
fat-fingering...
> $ rm -rf dir2
>
> Scraps one of the copies along with all modifications. dir1/* remains
> unaffected gain.
>
> $ cp -cr /fs1/dir1 /fs2/dir2
>
> Fails, since links between different filesystems don't work.
Why?
How do you push changes back if needed? How do you get back the version 3
changes back? Oops, can't do...
You do want version control.
> Variant 2 (across multiple filesystems now):
>
> - COWs contain a filesystem identifier as well.
> - Accessing COWs linking to unavaillable filesystems returns -E...
> Alternatively:
> - Mounting such an fs fails, unless all links work.
>
> Usage is as above.
>
> $ mkfs /dev/fs2
> $ mount /dev/fs2 /fs2
> $ cp -cr /fs1 /fs2
>
> Creates an identical copy of one filesystem on another one. fs2 has
> to support COWs and fs2 has to be RO or support COWs. A rw-fs mounted
> ro means trouble, as you know.
>
This is just one filesystem. Where are the others?
> Maybe I'm just stupid and missed some important detail, but this
> design looks like it can solve a bunch of problems. Do you still
> think, it is useless?
I still think there are much better solutions to the problems you mention.
What I'd love to see is something like, say /usr for each package (complete
with binaries in /usr/bin/, manuals in /usr/share/man/, ...) that you can
mount together in any combination wanted (even per-user, a la Plan 9) over
/usr and have it fully populated. But that gets horribly messy when you
want files from different versions (say, I've got an overlay for vi(1) that
fixes a horrible bug, but the manual is in Czech, and I prefer English) to
show up on top, or have files one on top of the other (source code
versions?) and you delete/modify/move the top one. What happens then? If
you can't come up with a sensible interpretation of Unix file operations in
this scenario (and you can't, trust me), the idea is doomed.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513
Horst von Brand wrote:
> OK, let's see... I've got a laptop, on CD is the "original" kernel tree, on
> HD is my modified stuff. I delete a file (or move it). Then I pack up, go
> home. There I start up again. How is the fact that the file is gone recorded?
If I recall, directory inodes list the files in that directory.
Presumably you had to write to the directory inode to do the delete.
The new version of the directory is now stored on the HD. It lists the
original files on the CD, minus the one that was deleted.
Chris
--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]
On Tue, 16 March 2004 15:40:24 -0400, Horst von Brand wrote:
> >
> > $ cp -cr dir1 dir2
> >
> > Behaves similar to 'cp -lr', but creates COWs instead of hard links.
> > Can take a few seconds to create the directories, but not minutes.
>
> Why does it magically take less time? The work done (recursing, fiddling
> with directories, syscalls, ...) is nearly the same.
joern@limerick:~$ time cp -lr /usr/src/linux/ /tmp/linux
real 0m22.356s
user 0m0.167s
sys 0m1.480s
joern@limerick:~$ rm -r /tmp/linux/
joern@limerick:~$ time cp -r /usr/src/linux/ /tmp/linux
real 1m44.147s
user 0m0.499s
sys 0m7.987s
'nuf said, eot.
J?rn
--
To recognize individual spam features you have to try to get into the
mind of the spammer, and frankly I want to spend as little time inside
the minds of spammers as possible.
-- Paul Graham
Horst von Brand wrote:
> Besides, the people asking for this mostly really
> want version control, or get what they want from symlink farms.
No. Version control does not address the requirement: to have 30
checked out kernel trees, each with compiled images, because you're
actually working on 30 trees, sharing files to save time and space,
and normal shell commands in each directory not accidentally affecting
the others.
I have not heard of any version control system which offers that.
Perhaps one based around a virtual filesystem could.
Symlink farms do not solve it either. They have the same problem as
hard links: namely, it is too easy to accidentally modify a file in
one tree while intending to modify only in another, plus they
introduce a whole bunch of other problems.
This idea of COW links is to solve one quite specific problem:
creating the illusion that large trees are copied and independent, so
that editors and compilers and makefiles and so on affect them
independently, while doing so fast and small, and allowing programs
which compare files (such as version control and diff) to know when
two files' contents are identical efficiently.
-- Jamie
J?rn Engel wrote:
> And version control is something I actually want to be done inside the
> kernel, at least to some degree. People already use kernel support,
> although it sucks (cp -lr anyone?). Looks like the alternatives suck
> even more, so your point is void.
Fwiw, I much prefer your COW hard links to something where I have to
mount a new filesystem every time I "copy" a tree, and have to redo
those mounts each time I reboot, have a big ugly mess in "df" output,
what "du" get confused, and "rsync" has no hope of dealing with them
sensibly.
I also don't mind if copying isn't implemented in the kernel. I'm ok
with programs reporting an error that they couldn't write to a file
because it was linked readonly. At least that removes the danger of
accidental overwriting, and I can either fix it by hand or use an
LD_PRELOAD library which detects that error code from open() and
copies the file.
Even if vi and Emacs, which make it temptingly easy to ignore normal
read-only protection, were changed to be aware of and bypass the
read-only link attribute, they'd do the right thing: the attribute
expresses the _intent_ that removing it should always be done by
copying the file, whereas with hard links that intent isn't clear.
(Emacs has backup-by-copying-when-linked, but that isn't too helpful
because sometimes you want writing to a linked file to change both places).
So my vote is for the very simple COW hard link attribute, and leave
the rest to userspace.
Thanks!
-- Jamie
On Fri, 19 March 2004 09:11:48 +0000, Jamie Lokier wrote:
> [...]
>
> So my vote is for the very simple COW hard link attribute, and leave
> the rest to userspace.
That makes an overwhelming majority of 100% so far, congratulations!
I'll merge Sytse's fixes and post a new patch soon, followed by a
userspace copyfile() implementation.
J?rn
--
The cost of changing business rules is much more expensive for software
than for a secretaty.
-- unknown
Forgetting his purposes, here are my valuable cases. Note that none of
these cases are about source control or versioning. (IMHO that is what
version control is for, and that task has been dealt with in its place.) I
have thought about this a lot because of my direct need a-la scenario 1.
This is a fairly direct description, and the utility has become painfully
desirable by its absence.
Scenario 1: (my real world)
I have a product where there is a central (full sized PC) controller and
some number of satellite boxes. The satellites are small custom hardware
jobbies running a linux instance of their own. The number and disposition
of these sub-boxes varies by installation.
Ideally I would use dhcp and provide one single root image for the satellite
boxes to nfs-mount regardless of their dynamically assigned host names or IP
addresses. The boxes are small enough that transcribing an initrd (etc)
into ram is too resource hungry even with busybox.
Right now, it is not possible to do this because if two running instances
share the same image they will share some critically variable files. (It is
disasterous to have two boxes share an image containing /etc/ld.so.cache
{and hugely problematic for them to share /var/run/ and /var/log/} but I
want them to have all the rest of /etc held in common.
I have used various tricks to overcome most of this, but it would be (would
have been) ideal to be able to mount host:/export/satellite/root as the root
directory. Then the startup script could union that root under a ramfs
partition. (I have no need to maintain the changes between boots, so in
this case discarding the changes is ideal.)
Scenario 2: (theoretical)
Take scenario 1 and re-cast it in a manufacturing or test/maintenance
environment. The target machines are (network) booted anonymously into a
test network. The test are performed, the results are logged to an
infrastructure machine, and the box is powered off and removed to its next
location. Again, the single image is used, and the union overlay is
discarded after use.
Scenario 3: (theoretical, from my two-back previous employer)
Take scenario 1 again and apply it to education. A lab full of student
machines. Students are expected to either supply their own home directories
on USB/FireWire hard disks they bring with them; or their home directories
are mounted across a network from a homing farm. These lab machines need to
be quite accessible but you don't want the students to corrupt the installs.
Again the union overlay is discarded after use.
OK so I have been discarding the overlays. But are their scenarios that
where someone would want to keep the overlays or analyze them separately?
You can tell by the way I asked that I think the answer is yes, can't you?
8-)
Scenario 4: (past, real world trauma 8-)
You want to install/replace a package on your existing box. You would like
to try it "in place" but you don't want to hammer your install. You mount
up the union with a "spare" partition, and pivot that in as root. Now you
do the install and try things out. (Think system wide utilities like a new
init, or glibc.) If things don't work well you want to reboot or pivot back
to your real root and then look through just what the install task did to
your machine. In this scenario the preservation of the binaries in the
"original" location, as reflected on the now dis-unioned partition provides
important information. You can check log fragments, partial deletes, and so
on. You can also tweak the dis-unioned image and then try it out to see if
your problems are addressed. With this information, and once you are
satisfied you have an install/replace that will work for you, you can
proceed to perform the install "for real" on your root. Then you can
double-check against the dis-unioned overlay.
Scenario 5: (theoretical)
You want to roll your own distribution, or build package images (rpm's etc)
out of packages that don't have build semantics that allow build-for-root
install-to-scratch construction. Take your near-virgin machine, pivot the
union over the root, do the install, pivot back or reboot, pack the
dis-unioned image up as the package tree.
I would say that to do these tasks you would need to meet the following
requirements:
1) Overlay trump semantics. Any file with a given name in the overlay must
completely hide the same file in the backing file system. It must be
possible to remove a file and then create a directory of that same
2) copy whole-file on write. This would probably require a kernel daemon.
Files opened read-only are used in place. Files opened for writing need to
wholly migrate from the backing file system to the overlay, either at open
time or on first write. This would involve waits and stuff of potentially
extreme severity, hence the daemon. Block-differences are not interesting,
that is what later diffs are for anyway. Block-wise copying would make mmap
(etc) just painfully impossible anyway. So all-or-nothing migration is the
ticket 8-).
3) White-out list. There would need to be a database, by path name, of
files that had been "removed" (unlinked but not "overwritten") and that
database would need to be expressed in an accessible format when the overlay
file system was not part of the union. Presumably there would be a file
existent on the overlay, that was opened by the kernel daemon at mount time
and which didn't appear in the resulting mounted image. This file would be
the white-out list. The natural effect would be that a file of that name in
the backing file system would be inaccessible. If that name were a
mount-time option (etc) then this one point of contention could be worked
around on a mount-by-mount basis, making it immaterial. Being available at
open() time, this white-out list removes the need for invisible inodes in
the exceptional sense. We don't care what "/bob" used to be, there is now
no file or directory (etc) named "/bob" and there will persist in being no
such thing until "/bob" is created in the overlay or through the union.
Thanks for playing... 8-)
4) A dumb-guy-gets-the-shaft provision would be that changes to the backing
file system, or mounting an overlay that doesn?t match the backing file
system produces exactly the mess that you get from overlaying the overlay
(with its white out list) over the backing system. The Cartesian product of
possible combinations is interesting but not pejorative. If I have a
directory in the overlay that doesn't exist in the backing system, then I
have a directory with exactly those elements from the overlay. If I
white-out "/bob" then there is no /bob whether the backing system wants one
there or not.
5) No backing-fs updates at all. Under no circumstances is there to be any
requirement of writing to the backing file system to keep things current.
6) No write-through. No checks are made for the write-through case. All
writes take place as per item 2 and on the overlay fs regardless of the
writeability of the backing file system.
7) The MMap conundrum... In the case of a read-only mmap of a file, the file
should *probably* be speculatively "coppied" to the overlay file system.
The user will expect that the overlay file system remains consistent on
update, so you don't want under-writes of the backing file system to make
give you a moving target. Such copies would take place as if a write were
going to happen, but the inode would be unlinked-after-open (or more
accurately unlinked at close if not written) in the overlay file system so
that in the zero-change case you don't produce excessively large
dis-union(ed) image sizes. (In truth this speculative copy behavior would
be done via cached semi-inodes in the union driver itself, the real inode
only being written to the overlay in the update case. This is practical
since we are writing these specialized open/release/write/etc routines
anyway. So you are doing in the phantom inode thing just like
unlink-after-open in all the other file systems anyway, but you will later
link-down the inode if it warrants saving instead of doing things the other
way around.
8) Speculative copy thresholds. There probably needs to be some tunable(s)
for sizes and types of speculative copying into the phantom inodes. I is
just bound to come up.
None of these requirements are particularly complex or onerous. Some of the
actions are potentially expensive, but in the foreseeable uses, they would
not be prohibitively so.
Rob.