2009-10-01 14:56:16

by Valerie Aurora

[permalink] [raw]
Subject: [RFC] Union mounts/writable overlays design

Hi all,

As Al and Christoph have requested, here is the design document for
writable overlays (a.k.a. union mounts). It includes a description of
our locking strategy. Please read and comment!

To go along with this doc, I have rebased our kernel patches against
2.6.31, e2fsprogs against 1.40.9, and util-linux-ng against latest
git. Pointers to all these git repositories and a complete UML-based
union mounts dev kit can be found here:

http://valerieaurora.org/union/

We will post the patches for review soon, but don't let that stop you
from reviewing and testing them now. :) Thanks to everyone who already
sent patches, tested, or reviewed. A list of everyone who has
contributed so far is on the union mounts web page.

Thanks,

-VAL

State of writable overlays (formerly union mounts)
==================================================

This version of union mounts is renamed "writable overlays." The goal
of this patch set is to support a single read-write file system
overlaid on a single read-only file system. "Union mounts" suggests
that we support unions of arbitrary numbers and types of file systems,
which is not the goal of this patch set.

The most recent version of writable overlays can boot to multi-user
mode with a writable overlay root file system. open(), truncate(),
creat(), unlink(), mkdir(), rmdir(), and rename() work. link(),
chmod(), chown(), and chattr() don't work yet.

This document describes the architecture and current status of
writable overlays, including an item-by-item todo list.

Writable overlays (formerly union mounts)
=========================================

In this document:
- Overview of writable overlays
- Terminology
- VFS implementation
- Locking strategy
- VFS/file system interface
- Userland interface
- NFS interaction
- Status
- Contributing to writable overlays

Overview
========

Writable overlays (formerly known as union mounts) are used to layer a
single writable file system over a single read-only file system, with
all writes going to the writable file system. The namespace of both
file systems appears as a combined whole to userland, with those on
the writable file system covering up any matching pathnames on the
read-only file system. A few use cases:

- Root file system on CD with writes saved to hard drive (LiveCD)
- Multiple virtual machines with the same starting root file system
- Cluster with NFS mounted root on clients

Most if not all of these problems could be solved with a COW block
device; however, sharing at the file system level has higher
performance and uses less disk space.

What writable overlays are not
------------------------------

Writable overlays are not a general-purpose unioning file system.
They do not provide a generic "union of namespaces" operation for an
arbitrary number of file systems. Many interesting features can be
implemented with a generic unioning facility: unioning of more than
two file systems, dynamic insertion and removal of branches, online
upgrade, etc. Some unioning file systems that do this are UnionFS and
AUFS. Unfortunately, the complexity of these feature sets lead to
difficult corner cases which so far have been unsolvable in the
context of the Linux VFS.

Writable overlays avoid these corner cases by reducing the feature set
to the bare minimum most requested features: one writable file system
layered over one read-only file system. Despite the limitations of
writable overlays, the VFS infrastructure it uses are generic enough
to be reused by more full-featured unioning file systems.

Terminology
===========

The main analogy for writable overlays is that a writable file system
is mounted "on top" of a read-only file system. Lookups start at the
"top" read-write file system and travel "down" to the "bottom"
read-only file system only if no blocking entry exists on the top
layer.

Top layer: The read-write file system. Lookups begin here.

Bottom layer: The read-only file system. Lookups end here.

Path: Combination of the vfsmount and dentry structure.

Follow down: Given a path from the top layer, find the corresponding
path on the bottom layer.

Follow up: Given a path from the bottom layer, find the corresponding
path on the top layer.

Whiteout: A directory entry in the top layer that prevents lookups
from travelling down to the bottom layer. Created on unlink()/rmdir()
if a corresponding directory entry exists in the bottom layer.

Opaque: A flag on a directory in the top layer that prevents lookups
of entries in this directory from travelling down to the bottom
layer (unless there is an explicit fallthru entry allowing that for a
particular entry). Set on creation of a directory that replaces a
whiteout, and after a directory copyup.

Fallthru: A directory entry which allows lookups to "fall through" to
the bottom layer for that exact directory entry. This serves as a
placeholder for directory entries from the bottom layer during
readdir(). Fallthrus override opaque flags.

File copyup: Create a file on the top layer that has the same properties
and contents as the file with the same pathname on the bottom layer.

Directory copyup: Copy up the visible directory entries from the
bottom layer as fallthrus in the matching top layer directory. Mark
the directory opaque to avoid unnecessary negative lookups on the
bottom layer.

Examples
========

What happens when I...

- creat() /newfile -> creates on top layer
- unlink() /oldfile -> creates a whiteout on top layer
- Edit /existingfile -> copies up to top layer at open(O_WR) time
- truncate /existingfile -> copies up to top layer + N bytes if specified
- touch()/chmod()/chown()/etc. -> copies up to top layer
- mkdir() /newdir -> creates on top layer
- rmdir() /olddir -> creates a whiteout on top layer
- mkdir() /olddir after above -> creates on top layer w/ opaque flag
- readdir() /shareddir -> copies up entries from bottom layer as fallthrus
- link() /oldfile /newlink -> copies up /oldfile, creates /newlink on top layer
- symlink() /oldfile /symlink -> nothing special
- rename() /oldfile /newfile -> copies up /oldfile to /newfile on top layer
- rename() dir -> EXDEV

Getting to a root file system with a writable overlay:

- Mount the base read-only file system as the root file system
- Mount the read-only file system again on /newroot
- Mount the writable overlay on /newroot:
# mount -o union /dev/sda /newroot
- pivot_root to /newroot
- Start init

See scripts/pivot.sh in the UML devkit linked to from:

http://valerieaurora.org/union/

VFS implementation
==================

Writable overlays are implemented as an integral part of the VFS,
rather than as a VFS client file system (i.e., a stacked file system
like unionfs or ecryptfs). Implementing writable overlays inside the
VFS eliminates the need for duplicate copies of VFS data structures,
unnecessary indirection, and code duplication, but requires very
maintainable, low-to-zero overhead code. Writable overlays require no
change to file systems serving as the read-only layer, and requires
some minor support from file systems serving as the read-write layer.
File systems that want to be the writable layer must implement the new
->whiteout() and ->fallthru() inode operations, which create special
dummy directory entries.

union_mount structure
---------------------

The primary data structure for writable overlays is the union_mount
structure, which connects overlapping directory dentries into a "union
stack":

struct union_mount {
atomic_t u_count; /* reference count */
struct mutex u_mutex;
struct list_head u_unions; /* list head for d_unions */
struct list_head u_list; /* list head for mnt_unions */
struct hlist_node u_hash; /* list head for searching */
struct hlist_node u_rhash; /* list head for reverse searching */

struct path u_this; /* this is me */
struct path u_next; /* this is what I overlay */
};

The union_mount is referenced from the corresponding directory's
dentry:

struct dentry {
[...]
#ifdef CONFIG_UNION_MOUNT
/*
* The following fields are used by the VFS based union mount
* implementation. Both are protected by union_lock!
*/
struct list_head d_unions; /* list of union_mounts */
unsigned int d_unionized; /* unions referencing this dentry */
#endif
[...]
};

Each top layer directory with the potential for a lookup to fall
through to the bottom layer has a union_mount structure stored in a
union_mount hash table. The union_mount's can be looked up both by the
top layer's path (via union_lookup()) and the bottom layer's path (via
union_rlookup()). Once you have the path (vfsmount and dentry pair)
of a file, the union stack can be followed down, layer by layer, with
follow_union_down(), and up with follow_union_mount().

All union_mount's are allocated from a kmem cache when the
corresponding dentries are created. union_mount's are allocated when
the first referencing dentry is allocated and freed when all of the
referencing dentries are freed - that is, the dcache drives the union
cache. While writable overlays only use two layers, the union stack
infrastructure is capable of supporting an arbitrary number of file
system layers (leaving aside locking issues).

Todo:

- Rename union_mount structure - it's per directory, not per mount

Code paths
----------

Writable overlays modify the following key code paths in the VFS:

- mount()/umount()
- Path lookup
- Any path that modifies an existing file

Mount
-----

Writable overlays are created in two steps:

1. Mount the bottom layer file system read-only in the usual manner.
2. Mount the top layer with the "-o union" option at the same mountpoint.

The bottom layer must be read-only and the top layer must be
read-write and support whiteouts and fallthrus (indicated by setting
the MS_WHITEOUT flag). Currently, the top layer is forced to
"noatime" to avoid a copyup on every access of a file. Supporting
atime with the current infrastructure would require a copyup on every
open().

Currently, the top layer covers all submounts on the read-only file
system. This can be inconvenient; e.g., mounting a writable overlay
on the root file system after procfs has been mounted. It's not clear
what the right behavior is. Also, it may be smarter to mount both
read-only and read-write layers in one step, but the mount options get
pretty ugly.

pivot_root() is supported and is the recommended way to get to a root
file system with a writable overlay.

Todo:

- Rename "-o union" mount option - "overlay"?
- Don't permit mounting over read-write submounts
- Choose submount covering behavior
- Allow atime?

Really really read-only file systems: In Linux, any individual file
system may be mounted at multiple places in the namespace. The file
system may change from read-only to read-write while still mounted.
Thus, simply checking that the bottom layer is read-only at the time
the writable overlay is mounted over it is pointless, since at any
time the bottom layer may become read-write.

We need to guarantee that a file system will be read-only for as long
as it is the bottom layer of a writable overlay. To do this, we track
the number of "read-only users" of a file system in its VFS superblock
structure. When we mount a writable overlay over a file system, we
increment its read-only user count. The file system can only be
mounted read-write if its read-only users count is zero.

Todo:

- Support really really read-only NFS mounts. See discussion here:

http://markmail.org/message/3mkgnvo4pswxd7lp

Path lookup
-----------

Much of the action in writable overlasy happens during lookup().
First, if we lookup a directory on the bottom layer that doesn't yet
exist on the top layer, __link_path_walk() always create a matching
directory on the top layer. This way, we never have to walk back up a
path, creating directories as we go, before we can copyup a file.
Second, if we need to copy up a file, we first (re)look it up with the
LOOKUP_TOPMOST flag, which instructs __link_path_walk() to create it
on the top layer. Neither directory entries nor file data are copied
up in __link_path_walk() - that happens after the lookup, in the
caller.

The main cut-out to writable overlay code is in do_lookup():

static int do_lookup(struct nameidata *nd, struct qstr *name,
struct path *path)
{
int err;

if (IS_MNT_UNION(nd->path.mnt))
goto need_union_lookup;
[...]
need_union_lookup:
err = cache_lookup_union(nd, name, path);
if (!err && path->dentry)
goto done;

err = real_lookup_union(nd, name, path);
if (err)
goto fail;
goto done;

cache_lookup_union() looks for the dentry in the dcache, starting at
the top layer and following down. If it finds nothing, it returns a
negative dentry from the top layer. If it finds a directory, it looks
for the same directory in the bottom layer; if that exists, it
allocates a union_mount struct and hangs the bottom layer dentry off
of it. real_lookup_union() does the same for uncached entries.

Todo:

- Reorganize cache/hash/real lookup code - lots of code duplication
- Turn create-on-topmost test into #ifdef'able function
- Rewrite with assumption that topmost directory always exists
- Remove duplicated tests and other duplicated code

File copyup
-----------

Any system call that alters an existing file on the bottom layer
(including creating or moving a hard link to it) will trigger a copyup
of the target file to the top layer (via union_copyup() or
__union_copyup()). This includes:

- open(O_WRITE | O_RDWR | O_APPEND | O_DIRECT)
- truncate()/ftruncate()/open(O_TRUNC)
- link()
- rename()
- chmod()
- chattr()

Copyup of a file DOES NOT occur on:

- open(O_RDONLY) if noatime
- stat() if no atime
- creat()/mkdir()/mknod()
- symlink()
- unlink()/rmdir()

>From an application's point of view, the result of an in-kernel file
copyup is the logical equivalent of another application updating the
file via the rename() pattern: creat() a new file, copy the data over,
make changes the copy, and rename() over the old version. Any
existing open file descriptors for that file (including those in the
same application) refer to a now invisible and unreferenced object
that used to have the same pathname. Only opens that occur after the
copyup will see updates to the file.

Todo:

- copyup on chown()/chmod()/chattr()
- copyup if atime is enabled?

Permission checks
-----------------

We want to be sure we have the correct permissions to actually succeed
in a system call before copying a file up to avoid unnecessary IO. At
present, the permission check for a single system call may be spread
out over many hundreds of lines of code (e.g., open()). In order to
check permissions, we occasionally need to determine if there is a
writable overlay on top of this inode. This requires a full path, but
often we only have the inode at this point. In particular,
inode_permission() returns EROFS if the inode is on a read-only file
system, which is the wrong answer if there is a writable overlay
mounted on top of it.

Another trouble-maker is may_open(), which both checks permissions for
open AND truncates the file if O_TRUNC is specified. It doesn't make
any sense to copy up the file and then let may_open() truncate it, but
we can't copy it after may_open() truncates it either. The current
ugly hack is to pass the full nameidata to may_open() and copyup
inside may_open().

Some solutions:

- Create __inode_permission() and pass it a flag telling it whether or
not to check for a read-only fs. Create union_permission() which
takes a path, checks for a union mount, and sets the rofs flag.
Place the file copyup call after all the permission checks are
completed. Push down the full path into the functions that need it
and currently only take the dentry or inode.

- For each instance in which we might want to copyup, move permission
checks into a new function and call it from a level at which we
still have the full path. Pass it an "ignore read-only fs" flag if
the file is on a union mount. Pass around the ignore-rofs flag
inside the function doing permission checks. If all the permission
checks complete successfully, copyup the file. Would require moving
truncate out of may_open().

Todo:
- On truncate, only copy up the N bytes of file data requested
- Make sure above handles truncate beyond EOF correctly
- File copyup on chown()/chmod()/chattr() etc.
- File copyup on open(O_APPEND)
- File copyup on open(O_DIRECT)

Impact on non-union kernels and mounts
--------------------------------------

Union-related data structures, extra fields, and function calls are
#ifdef'd out at the function/macro level with CONFIG_UNION_MOUNT in
nearly all cases (see include/linux/union.h). The union-specific code
in the cache lookup path is out of line.

Currently, is_unionized() is pretty heavy-weight: it walks up the
mount hierarchy, grabbing the vfsmount lock at each level. It may be
possible to simplify this greatly if a writable layer can only cover
exactly one mount, rather than a tree of mounts.

Todo:

- Turn copyup in __link_path_walk() into #ifdef'd function
- Do performance tests
- Optimize is_unionized()
- Properly #ifdef out mount path code

Locking strategy
================

The current writable overlay locking strategy is based on the
following rules:

* Exactly two file systems are unioned
* The bottom file system is always read-only
* The top file system is always read-write
=> A file system can never a top and a bottom layer at the same time

Additionally, the top layer (the writable overlay) may only be mounted
exactly once. Don't think of the writable overlay as a separate
independent file system; when it is mounted as a writable overlay, it
is only a file system in conjunction with the read-only bottom layer.
The read-only bottom layer is an independent file system in and of
itself and can be mounted elsewhere, including as the bottom layer for
another writable overlay.

Thus, we may define a stable locking order in terms of top layer and
bottom layer locks, since a top layer is never a bottom layer and a
bottom layer is never a top layer. Objects from the bottom layer are
never changed (so don't need write locks) and only require atomic
operations to manage kernel data structures (ref counts, etc.).

Another simplifying assumption is that all directories in a pathname
exist on the top layer, as they are created step-by-step during
lookup. This prevents us from ever having to walk backwards up the
path creating directory entries, which can get complicated especially
when you consider the need to prevent topology changes. By
implication, parent directories during any operation (rename(),
unlink(),etc.) are from the top layer. Dentries for directories from
the bottom layer are only ever used by lookup code.

The two major problems we avoid with the above rules are:

Lock ordering: Imagine two union stacks with the same two file
systems: A mounted over B, and B mounted over A. Sometimes locks on
objects in both A and B will have to be held simultanously. What
order should they be acquired in? Simply acquiring them from top to
bottom will create a lock-ordering problem - one thread acquires lock
on object from A and then tries for a lock on object from B, while
another thread grabs the lock on object from B and then waits for the
lock on object from A. Some other lock ordering must be defined.

Movement/change/disappearance of objects on multiple layers: A variety
of nasty corner cases arise when more than one layer is changing at
the same time. Changes in the directory topology and their effect on
inheritance are of special concern. Al Viro's canonical email on the
subject:

http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/0839.html

We don't try to solve any of these cases, just avoid them in the first
place.

Todo: Prevent top layer from being mounted more than once.

Cross-layer interactions
------------------------

The VFS code simultaneously holds references to and/or modifies
objects from both the top and bottom layers in the following cases:

Path lookup:

Holds i_mutex on top layer directory inode while doing lookups on
bottom layer. Grabs i_mutex on bottom layer off and on.

Todo:
- Is i_mutex on lower directory necessary?

File copyup in general:

File copyup occurs while holding i_mutex on the parent directory of
the top layer. As noted before, an in-kernel file copyup is the
logical equivalent of a userspace rename() of an identical file on to
this pathname.

link():

File copyup of target while holding i_mutex on parent directory on top
layer. Followed by a normal link() operation.

rename():

First, renaming of directories returns EXDEV. It's not at all
reasonable to recursively copy directory trees and userspace has to
handle this case anyway.

Rename involves two operations on a writable overlay: (1) creation of
a whiteout covering the source of the rename, (2) a copyup of the file
from the bottom layer. The file copyup does not need to happen
atomically, only the whiteout and the new link to the file.

I propose that we copyup the source file to the "old" name (rather
than directly to the "new" name), and then perform the normal file
system rename operation. The only addition is creation of whiteout
for the old name.

The current rename() implementation is just a hack to get things
working and doesn't work at all as described above.

Lock order: The file copyup happens before the rename() lock. When we
create the whiteout, we will already have the directory i_mutex.
Otherwise, locking as usual.

Directory copyup:

Directory entries are copied up on the first readdir(). We hold the
top layer directory i_mutex throughout. A fallthru is created for
each entry that appears only on the lower layer.

Current patch takes the i_mutex on the bottom layer directory, which
doesn't seem to be necessary.

VFS-fs interface
================

Read-only layer: No support necessary other than enforcement of really
really read-only semantics (done by VFS for local file systems).

Writable layer: Must implement two new inode operations:

int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
int (*fallthru) (struct inode *, struct dentry *);

And set the MS_WHITEOUT flag.

Whiteouts and fallthrus are most similar to symlinks, since they
redirect to an object possibly located in another file system without
keeping a reference on it.

Todo:

- Return correct inode number in d_ino member of struct dirent by one of:
- Save inode number of target in fallthru entry itself
- Lookup inode number during readdir()
- Try re-implementing ext2 as special symlinks - may be much simpler
- Implement ext3 (also as symlinks?)
- Implement btrfs

Supported file systems
----------------------

Any file system can be a read-only layer. File systems must
explicitly support whiteouts and fallthrus in order to be a read-write
layer. This patch set implements whiteouts for ext2, tmpfs, and
jffs2. We have tested ext2, tmpfs, and iso9660 as the read-only
layer.

Todo:
- Test corner cases of case-insensitive/oversensitive file systems

NFS interaction
===============

NFS is currently not supported as either type of layer. NFS as
read-only layer requires support from the server to honor the
read-only guarantee needed for the bottom layer. To do this, the
server needs to revoke access to clients requesting read-only file
systems if the exported file system is remounted read-write or
unmounted (during which arbitrary changes can occur). Some recent
discussion:

http://markmail.org/message/3mkgnvo4pswxd7lp

NFS as the read-write layer would require implementation of the
->whiteout() and ->fallthru() methods. DT_WHT directory entries are
theoretically already supported.

Also, technically the requirement for a readdir() cookie that is
stable across reboots comes only from file systems exported via NFSv2:

http://oss.oracle.com/pipermail/btrfs-devel/2008-January/000463.html

Todo:

- Implement whiteout()/fallthru() for NFS
- Guarantee really really read-only on NFS exports

Userland support
================

The mount command must support the "-o union" mount option and pass
the corresponding MS_UNION flag to the kerel. A util-linux git
tree with writable overlay support is here:

git://git.kernel.org/pub/scm/utils/util-linux-ng/val/util-linux-ng.git

File system utilities must support whiteouts and fallthrus. An
e2fsprogs git tree with writable overlay support is here:

git://git.kernel.org/pub/scm/fs/ext2/val/e2fsprogs.git

Currently, whiteout directory entries are not returned to userland.
While the directory type for whiteouts, DT_WHT, has been defined for
many years, very little userland code handles them. Userland will
never see fallthru directory entries.

Known non-POSIX behaviors
-------------------------

- Any writing system call (unlink()/chmod()/etc.) can return ENOSPC or EIO
- Link count may be wrong for files on bottom layer with > 1 link count
- Link count on directories will be wrong before readdir() (fixable)
- File copyup is the logical equivalent of an update via copy +
rename(). Any existing open file descriptors will continue to refer
to the read-only copy on the bottom layer and will not see any
changes that occur after the copy-up.
- rename() of directory fails with EXDEV

Status
======

The current writable overlays patch set varies between RFC/prototype
and pretty stable, depending on the particular patch. The current
patch set boots to multi-user mode with a writable overlay root file
system (albeit with some complaints). Some parts of the code were
written years ago and have been reviewed, rewritten and tested many
times. Other parts were written last month and need review,
rewriting, and testing. The commit messages note the state of each
patch.

The current patch set is against 2.6.31. You can find it here, in the
branch "overlay":

git://git.kernel.org/pub/scm/linux/kernel/git/val/linux-2.6.git

Non-features
------------

Features we do not currently plan to support as part of writable
overlays:

Online upgrade: E.g., installing software on a file system NFS
exported to clients while the clients are still up and running.
Allowing the read-only bottom layer to change while the writable
overlay file system is mounted invalidates our locking strategy.

Recursive copying of directories: E.g., implementing rename() across
layers for directories. Doing an in-kernel copy of a single file is
bad enough. Recursively copying a directory is a big no-no.

Read-only top layer: The readdir() strategy fundamentally requires the
ability to create persistent directory entries on the top layer file
system (which may be tmpfs). Numerous alternatives (including
in-kernel or in-application caching) exist and are compatible with
writable overlays with its writing-readdir() implementation disabled.
Creating a readdir() cookie that is stable across multiple readdir()s
requires one of:

- Write to stable storage (e.g., fallthru dentries)
- Non-evictable kernel memory cache (doesn't handle NFS server reboot)
- Per-application caching by glibc readdir()

Aggregation of multiple read-only file systems: While perfectly
reasonable from a user perspective, we just aren't smart enough to
figure out the locking problems from a kernel perspective. Sorry!

Often these features are supported by other unioning file systems or
by other versions of union mounts.

Contributing to writable overlays
=================================

The writable overlays web page is here:

http://valerieaurora.org/union/

It links to:

- All git repositories
- Documentation
- An entire self-contained UML-based dev kit with README, etc.

The mailing list for discussing writable overlays is:

[email protected]

http://vger.kernel.org/vger-lists.html#linux-fsdevel

Thank you for reading!


2009-10-01 15:55:12

by kevin granade

[permalink] [raw]
Subject: Re: [RFC] Union mounts/writable overlays design

My apologies to anyone who has received this twice now, re-sending
after gmail added a html rider to the previous email, which was
rejected by lkml. (A pox on "rich text" emails!)


Wow, amazingly thorough writeup, was a very interesting read and I'm
looking forward to trying this out.

> Examples
> ========
>
> What happens when I...
>
> - creat() /newfile -> creates on top layer
> - unlink() /oldfile -> creates a whiteout on top layer
> - Edit /existingfile -> copies up to top layer at open(O_WR) time
> - truncate /existingfile -> copies up to top layer + N bytes if specified
> - touch()/chmod()/chown()/etc. -> copies up to top layer
> - mkdir() /newdir -> creates on top layer
> - rmdir() /olddir -> creates a whiteout on top layer
> - mkdir() /olddir after above -> creates on top layer w/ opaque flag
> - readdir() /shareddir -> copies up entries from bottom layer as fallthrus
> - link() /oldfile /newlink -> copies up /oldfile, creates /newlink on top layer
> - symlink() /oldfile /symlink -> nothing special
> - rename() /oldfile /newfile -> copies up /oldfile to /newfile on top layer

Minor quibble here, rename should also whiteout /oldfile, of course
you have it explained correctly in the detailed description of
rename() below.
Or am I misunderstanding and the above is what it does now and the
detailed description is what it will do once implemented properly?


> Non-features
> ------------
>
> Features we do not currently plan to support as part of writable
> overlays:
>
> Online upgrade: E.g., installing software on a file system NFS
> exported to clients while the clients are still up and running.
> Allowing the read-only bottom layer to change while the writable
> overlay file system is mounted invalidates our locking strategy.


So as long as the RO filesystem is NOT mounted as part of an overlay,
you could modify it and then re-construct the previous overlay and
things will work as expected?
For example could one create a hard drive over CD overlay, then
periodically (requiring a reboot probably) replace the underlying CD
with a new version and automagically have new versions of software
available? ( obviously there are additional complexities in packaging
to make this work, but having support in the kernel is the first step.
)

One last thing, I don't see this in either the "features" or the
"non-features". Will there be a way to "revert" a file to the RO
version once it has been copied up, either by just removing the
overlay entry or by somehow forcing the open of the underlying file
when it has an overlay? Now that I think of it, one could just mount
the underlying filesystem elsewhere and copy the file, but I'd still
be interested to know if there is any desire to provide the more
"direct" operation.


> Thank you for reading!


Thank you for writing!!!

-Kevin Granade

2009-10-01 17:16:21

by Valerie Aurora

[permalink] [raw]
Subject: Re: [RFC] Union mounts/writable overlays design

On Thu, Oct 01, 2009 at 10:38:27AM -0500, kevin granade wrote:
> Wow, amazingly thorough writeup, was a very interesting read and I'm looking
> forward to trying this out.
>
> Examples
> > ========
> >
> > What happens when I...
> >
> > - creat() /newfile -> creates on top layer
> > - unlink() /oldfile -> creates a whiteout on top layer
> > - Edit /existingfile -> copies up to top layer at open(O_WR) time
> > - truncate /existingfile -> copies up to top layer + N bytes if specified
> > - touch()/chmod()/chown()/etc. -> copies up to top layer
> > - mkdir() /newdir -> creates on top layer
> > - rmdir() /olddir -> creates a whiteout on top layer
> > - mkdir() /olddir after above -> creates on top layer w/ opaque flag
> > - readdir() /shareddir -> copies up entries from bottom layer as fallthrus
> > - link() /oldfile /newlink -> copies up /oldfile, creates /newlink on top
> > layer
> > - symlink() /oldfile /symlink -> nothing special
> > - rename() /oldfile /newfile -> copies up /oldfile to /newfile on top layer
> >
>
> Minor quibble here, rename should also whiteout /oldfile, of course you have
> it explained correctly in the detailed description of rename() below.
> Or am I misunderstanding and the above is what it does now and the detailed
> description is what it will do once implemented properly?

Hi Kevin,

You are correct, it whiteouts the original name. Thanks for pointing
that out!

> > Non-features
> > ------------
> >
> > Features we do not currently plan to support as part of writable
> > overlays:
> >
> > Online upgrade: E.g., installing software on a file system NFS
> > exported to clients while the clients are still up and running.
> > Allowing the read-only bottom layer to change while the writable
> > overlay file system is mounted invalidates our locking strategy.
> >
>
> So as long as the RO filesystem is NOT mounted as part of an overlay, you
> could modify it and then re-construct the previous overlay and things will
> work as expected?
> For example could one create a hard drive over CD overlay, then periodically
> (requiring a reboot probably) replace the underlying CD with a new version
> and automagically have new versions of software available? ( obviously
> there are additional complexities in packaging to make this work, but having
> support in the kernel is the first step. )

This could theoretically work, but the main problem is resolving
differences between files (always the big problem in upgrade). Say
you have /etc/passwd, you add a new user and write to it on the top
layer, and then the next upgrade adds a new user to the read-only
version. You're not going to see the new user.

> One last thing, I don't see this in either the "features" or the
> "non-features". Will there be a way to "revert" a file to the RO version
> once it has been copied up, either by just removing the overlay entry or by
> somehow forcing the open of the underlying file when it has an overlay? Now
> that I think of it, one could just mount the underlying filesystem elsewhere
> and copy the file, but I'd still be interested to know if there is any
> desire to provide the more "direct" operation.

I think that people are calling this "punch-through." I don't see a
problem with it, other than slightly more kernel support.

-VAL

2009-10-01 17:55:30

by Jan Blunck

[permalink] [raw]
Subject: Re: [RFC] Union mounts/writable overlays design

Am 01.10.2009 um 19:15 schrieb Valerie Aurora <[email protected]>:

> On Thu, Oct 01, 2009 at 10:38:27AM -0500, kevin granade wrote:
>> Wow, amazingly thorough writeup, was a very interesting read and
>> I'm looking
>> forward to trying this out.
>>
>> Examples
>>> ========
>>>
>>> What happens when I...
>>>
>>> - creat() /newfile -> creates on top layer
>>> - unlink() /oldfile -> creates a whiteout on top layer
>>> - Edit /existingfile -> copies up to top layer at open(O_WR) time
>>> - truncate /existingfile -> copies up to top layer + N bytes if
>>> specified
>>> - touch()/chmod()/chown()/etc. -> copies up to top layer
>>> - mkdir() /newdir -> creates on top layer
>>> - rmdir() /olddir -> creates a whiteout on top layer
>>> - mkdir() /olddir after above -> creates on top layer w/ opaque flag
>>> - readdir() /shareddir -> copies up entries from bottom layer as
>>> fallthrus
>>> - link() /oldfile /newlink -> copies up /oldfile, creates /newlink
>>> on top
>>> layer
>>> - symlink() /oldfile /symlink -> nothing special
>>> - rename() /oldfile /newfile -> copies up /oldfile to /newfile on
>>> top layer
>>>
>>
>> Minor quibble here, rename should also whiteout /oldfile, of course
>> you have
>> it explained correctly in the detailed description of rename() below.
>> Or am I misunderstanding and the above is what it does now and the
>> detailed
>> description is what it will do once implemented properly?
>
> Hi Kevin,
>
> You are correct, it whiteouts the original name. Thanks for pointing
> that out!
>
>>> Non-features
>>> ------------
>>>
>>> Features we do not currently plan to support as part of writable
>>> overlays:
>>>
>>> Online upgrade: E.g., installing software on a file system NFS
>>> exported to clients while the clients are still up and running.
>>> Allowing the read-only bottom layer to change while the writable
>>> overlay file system is mounted invalidates our locking strategy.
>>>
>>
>> So as long as the RO filesystem is NOT mounted as part of an
>> overlay, you
>> could modify it and then re-construct the previous overlay and
>> things will
>> work as expected?
>> For example could one create a hard drive over CD overlay, then
>> periodically
>> (requiring a reboot probably) replace the underlying CD with a new
>> version
>> and automagically have new versions of software available?
>> ( obviously
>> there are additional complexities in packaging to make this work,
>> but having
>> support in the kernel is the first step. )
>
> This could theoretically work, but the main problem is resolving
> differences between files (always the big problem in upgrade). Say
> you have /etc/passwd, you add a new user and write to it on the top
> layer, and then the next upgrade adds a new user to the read-only
> version. You're not going to see the new user.
>

No. Scripts that come with updated packages still need to run on the
union. Otherwise this is just asking for problems. Probably you could
come up with a clever merger if the update and the base image is still
available.

>> One last thing, I don't see this in either the "features" or the
>> "non-features". Will there be a way to "revert" a file to the RO
>> version
>> once it has been copied up, either by just removing the overlay
>> entry or by
>> somehow forcing the open of the underlying file when it has an
>> overlay? Now
>> that I think of it, one could just mount the underlying filesystem
>> elsewhere
>> and copy the file, but I'd still be interested to know if there is
>> any
>> desire to provide the more "direct" operation.
>
> I think that people are calling this "punch-through." I don't see a
> problem with it, other than slightly more kernel support.
>
> -VAL

2009-10-01 19:34:43

by Brad Boyer

[permalink] [raw]
Subject: Re: [RFC] Union mounts/writable overlays design

On Thu, Oct 01, 2009 at 10:55:48AM -0400, Valerie Aurora wrote:
> We need to guarantee that a file system will be read-only for as long
> as it is the bottom layer of a writable overlay. To do this, we track
> the number of "read-only users" of a file system in its VFS superblock
> structure. When we mount a writable overlay over a file system, we
> increment its read-only user count. The file system can only be
> mounted read-write if its read-only users count is zero.
>
> Todo:
>
> - Support really really read-only NFS mounts. See discussion here:
>
> http://markmail.org/message/3mkgnvo4pswxd7lp

Is there any way for a file system driver to just come out and say
"I can't guarantee that this mount is really read-only"? I can imagine
this might be an issue for things other than NFS. I think it would be
worthwhile to have a flag maybe on a per sb level that says that even
if it is mounted with the "ro" option that it might not really be stable.
I don't think this is essential, but it would be a good feature as long
as it doesn't damage the design or performance too much.

Brad Boyer
[email protected]

2009-10-01 20:08:45

by kevin granade

[permalink] [raw]
Subject: Re: [RFC] Union mounts/writable overlays design

On Thu, Oct 1, 2009 at 12:55 PM, Jan Blunck <[email protected]> wrote:
> Am 01.10.2009 um 19:15 schrieb Valerie Aurora <[email protected]>:
>
>> On Thu, Oct 01, 2009 at 10:38:27AM -0500, kevin granade wrote:


>>>> Non-features
>>>> ------------
>>>>
>>>> Features we do not currently plan to support as part of writable
>>>> overlays:
>>>>
>>>> Online upgrade: E.g., installing software on a file system NFS
>>>> exported to clients while the clients are still up and running.
>>>> Allowing the read-only bottom layer to change while the writable
>>>> overlay file system is mounted invalidates our locking strategy.
>>>>
>>>
>>> So as long as the RO filesystem is NOT mounted as part of an overlay, you
>>> could modify it and then re-construct the previous overlay and things
>>> will
>>> work as expected?
>>> For example could one create a hard drive over CD overlay, then
>>> periodically
>>> (requiring a reboot probably) replace the underlying CD with a new
>>> version
>>> and automagically have new versions of software available? ?( obviously
>>> there are additional complexities in packaging to make this work, but
>>> having
>>> support in the kernel is the first step. )
>>
>> This could theoretically work, but the main problem is resolving
>> differences between files (always the big problem in upgrade). ?Say
>> you have /etc/passwd, you add a new user and write to it on the top
>> layer, and then the next upgrade adds a new user to the read-only
>> version. ?You're not going to see the new user.
>>
>
> No. Scripts that come with updated packages still need to run on the union.
> Otherwise this is just asking for problems. Probably you could come up with
> a clever merger if the update and the base image is still available.
>

Yes, that sort of thing is what I meant by "additional complexities in
packaging", and I understand that they are in no way trivial, but I
was mostly interested in whether that kind of behavior would be
supported at the kernel level at all.

For example a simpler use, once again with the HD-over-CD distro, is
one could build in an option to "flatten" the entire contents of the
overlay onto a new CD, at which point the CD itself contains the
logical contents of the overlay at that point in time. The user could
then wipe the HD (or ramdrive in some scenarios) and continue with all
their customizations in place, but no space being used on the RW
filesystem.

Puppy Linux did this in one installation mode by using Unionfs to
progressively "flatten" the user's additions onto successive tracks on
the DVD being used as the RO device. When the disk ran out of room
the logical "top layer" could be copied to a new disk and the process
restarted. Obviously this system won't support the iterative addition
of filesystem changes, but it could support more occasional
"flattening" of the overlay onto a new RO media.

Another scenario would be a liveCD that allows the user to make
(potentially extensive) customizations on a ramdrive, then burn a new
liveCD with all the user's customizations included on the disc.

-Kevin Granade

2009-10-02 15:42:14

by J. R. Okajima

[permalink] [raw]
Subject: Re: [RFC] Union mounts/writable overlays design


Hi,

Valerie Aurora:
> Writable overlays (formerly union mounts)
> =========================================
>
> In this document:
> - Overview of writable overlays
> - Terminology
> - VFS implementation
:::

While I don't remember exactly when I first read the source files of
UnionMount, I think it is promising. And I have written to Val and Jan
some of my comments or reviews about UnionMount.
Recently I noticed another issue about stat(2) and mountpoint(1). The
latter is a part of 'initscripts' package.

For example,
- you have a union-ed directory, /u = /rw + /ro
- /ro/usr dir exists
- /rw/usr dir does NOT exist
- of course, /u/usr exists

As far as I know, UnionMount is expected to handle /u/usr directory
as if it exists under /u dir.
(I may be wrong since it totally depends upon the design of UnionMount)

In this case, stat(2) for /u and /u/usr will return different st_dev
from each other. eg. stat(/u/usr) returns the st_dev value of /ro,
stat(/u) returns the one for /rw.
This behaviour may make /bin/mountpoint confused, particulary in the
chroot/switch_root-ed environment.
/bin/mountpoint issues stat(2) for the specified dir and its parent, and
compares their st_dev. If they differ from each other, the utility
handles the specified dir as a "mountpoint". I am afraid it will make
some init-scripts crazy because /u/usr is NOT a mountpoint actually.

One possible solution will be setting a hook to vfs_stat(), which
handles the vfsmount set UNION flag differently and returns the pseudo
st_dev for the entires in UnionMount. But it may lead to the duplicated
inode number situation which may make applications crazy.
For instance,
- /ro/fileA is hardlinked to /ro/fileB.
- the inode number of them is i100.
- /rw/fileC is handlinked to /rw/fileD.
- the inode number of them is i100 too.

Since /ro and /rw are different, the same inode number is not a
problem natively. But if UnionMount takes an approach above, they all
have the same st_dev value. And I am afraid some applications may
handle them as a single hardlink unexpectedly.

So UnionMount should maintain its inode numbers by itself?
No, it goes to the filesystem-type implementation. It should not be the
way of UnionMount.
Are there any ideas to solve this problem?


J. R. Okajima

2009-10-02 19:15:53

by Valerie Aurora

[permalink] [raw]
Subject: Re: [RFC] Union mounts/writable overlays design

On Thu, Oct 01, 2009 at 03:08:46PM -0500, kevin granade wrote:
> On Thu, Oct 1, 2009 at 12:55 PM, Jan Blunck <[email protected]> wrote:
> > Am 01.10.2009 um 19:15 schrieb Valerie Aurora <[email protected]>:
> >
> >> On Thu, Oct 01, 2009 at 10:38:27AM -0500, kevin granade wrote:
>
>
> >>>> Non-features
> >>>> ------------
> >>>>
> >>>> Features we do not currently plan to support as part of writable
> >>>> overlays:
> >>>>
> >>>> Online upgrade: E.g., installing software on a file system NFS
> >>>> exported to clients while the clients are still up and running.
> >>>> Allowing the read-only bottom layer to change while the writable
> >>>> overlay file system is mounted invalidates our locking strategy.
> >>>>
> >>>
> >>> So as long as the RO filesystem is NOT mounted as part of an overlay, you
> >>> could modify it and then re-construct the previous overlay and things
> >>> will
> >>> work as expected?
> >>> For example could one create a hard drive over CD overlay, then
> >>> periodically
> >>> (requiring a reboot probably) replace the underlying CD with a new
> >>> version
> >>> and automagically have new versions of software available? ?( obviously
> >>> there are additional complexities in packaging to make this work, but
> >>> having
> >>> support in the kernel is the first step. )
> >>
> >> This could theoretically work, but the main problem is resolving
> >> differences between files (always the big problem in upgrade). ?Say
> >> you have /etc/passwd, you add a new user and write to it on the top
> >> layer, and then the next upgrade adds a new user to the read-only
> >> version. ?You're not going to see the new user.
> >>
> >
> > No. Scripts that come with updated packages still need to run on the union.
> > Otherwise this is just asking for problems. Probably you could come up with
> > a clever merger if the update and the base image is still available.
> >
>
> Yes, that sort of thing is what I meant by "additional complexities in
> packaging", and I understand that they are in no way trivial, but I
> was mostly interested in whether that kind of behavior would be
> supported at the kernel level at all.

To a first approximation, if your question looks anything like:

"Can I do [...] packaging [...] with writable overlays?"

The answer is, no, you can't, and if you could, it wouldn't have the
results you expected.

Writable overlays are to packaging as file system snapshots were to
source control - check out all the old papers, people were always
saying, "And you can use it to make an easy diff of your changes to
your source code!" Now that we have decent source control, no one
would dream of versioning their source control with file system level
snapshots - what about commit grouping? What about commit messages?
What about annotation?

I think packaging is at the same level today. Seriously, go check out
Puppet before hacking around with writable overlays. It will probably
fix your problems and give you more features.

-VAL

2009-10-11 01:44:33

by Valerie Aurora

[permalink] [raw]
Subject: Re: [RFC] Union mounts/writable overlays design

On Sat, Oct 03, 2009 at 12:30:42AM +0900, [email protected] wrote:
>
> Hi,
>
> Valerie Aurora:
> > Writable overlays (formerly union mounts)
> > =========================================
> >
> > In this document:
> > - Overview of writable overlays
> > - Terminology
> > - VFS implementation
> :::
>
> While I don't remember exactly when I first read the source files of
> UnionMount, I think it is promising. And I have written to Val and Jan
> some of my comments or reviews about UnionMount.
> Recently I noticed another issue about stat(2) and mountpoint(1). The
> latter is a part of 'initscripts' package.
>
> For example,
> - you have a union-ed directory, /u = /rw + /ro
> - /ro/usr dir exists
> - /rw/usr dir does NOT exist
> - of course, /u/usr exists
>
> As far as I know, UnionMount is expected to handle /u/usr directory
> as if it exists under /u dir.
> (I may be wrong since it totally depends upon the design of UnionMount)
>
> In this case, stat(2) for /u and /u/usr will return different st_dev
> from each other. eg. stat(/u/usr) returns the st_dev value of /ro,
> stat(/u) returns the one for /rw.
>
> This behaviour may make /bin/mountpoint confused, particulary in the
> chroot/switch_root-ed environment.
> /bin/mountpoint issues stat(2) for the specified dir and its parent, and
> compares their st_dev. If they differ from each other, the utility
> handles the specified dir as a "mountpoint". I am afraid it will make
> some init-scripts crazy because /u/usr is NOT a mountpoint actually.

In writable overlays, every directory will be copied up to the top
writable overlay, so /u and /u/usr will both have the same st_dev.
The copy up happens on lookup, so a stat() will trigger this copy up.
A directory and a regular file in it will have different st_dev's,
though. Can you foresee any problems with that?

Thanks again,

-VAL

> One possible solution will be setting a hook to vfs_stat(), which
> handles the vfsmount set UNION flag differently and returns the pseudo
> st_dev for the entires in UnionMount. But it may lead to the duplicated
> inode number situation which may make applications crazy.
> For instance,
> - /ro/fileA is hardlinked to /ro/fileB.
> - the inode number of them is i100.
> - /rw/fileC is handlinked to /rw/fileD.
> - the inode number of them is i100 too.
>
> Since /ro and /rw are different, the same inode number is not a
> problem natively. But if UnionMount takes an approach above, they all
> have the same st_dev value. And I am afraid some applications may
> handle them as a single hardlink unexpectedly.
>
> So UnionMount should maintain its inode numbers by itself?
> No, it goes to the filesystem-type implementation. It should not be the
> way of UnionMount.
> Are there any ideas to solve this problem?
>
>
> J. R. Okajima
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-10-12 05:20:59

by J. R. Okajima

[permalink] [raw]
Subject: Re: [RFC] Union mounts/writable overlays design


Valerie Aurora:
> In writable overlays, every directory will be copied up to the top
> writable overlay, so /u and /u/usr will both have the same st_dev.
> The copy up happens on lookup, so a stat() will trigger this copy up.
> A directory and a regular file in it will have different st_dev's,
> though. Can you foresee any problems with that?

You're right.
I was confused with another implementation of UnionMount, sorry.
I think I had to sleep well.


J. R. Okajima