2007-05-14 09:30:28

by Bharata B Rao

[permalink] [raw]
Subject: [RFC][PATCH 0/14] VFS based Union Mount(v1)

Here is another post of vfs based union mount implementation.
Union mount provides the filesytem namespace unification feature.
Unlike the traditional mounts which hide the contents of the mount point,
the union mount presents the merged view of the mount point and the
mounted filesytem.

These patches were originally by Jan Blunck. The current patches are for
2.6.21-mm1. The main change from the previous post is a different implementation
of union mount readdir which is heavily inspired by the unionfs' implementation.
The earlier version had two serious drawbacks: It worked only for filesystems
which had flat file directories and it used to build and destroy readdir cache
for every readdir() call. This version has addressed both of these shortcomings.

The code is still in an experimental stage and the intention of posting this
now is to get some initial feedback about the design and the future directions
about how this should be taken forward.

You can find more details about union mount in the documentation
included in the patchset.

Kindly review and let us know your comments.

Regards,
Bharata.


2007-05-14 09:31:19

by Bharata B Rao

[permalink] [raw]
Subject: [RFC][PATCH 1/14] Add union mount documentation

From: Bharata B Rao <[email protected]>
Subject: Add union mount documentation.

This is an attempt to document some of the implementation details
and issues of union mount.

Signed-off-by: Bharata B Rao <[email protected]>
Signed-off-by: Jan Blunck <[email protected]>
---
Documentation/union-mounts.txt | 538 +++++++++++++++++++++++++++++++++++++++++
1 files changed, 538 insertions(+)

--- /dev/null
+++ b/Documentation/union-mounts.txt
@@ -0,0 +1,538 @@
+VFS BASED UNION MOUNT
+=====================
+
+1. Overview
+2. Union stack
+3. Lookup
+4. Readdir
+ 4.1 Duplicate elimination
+ 4.2 Preserving state
+ 4.3 File offset problem
+ 4.4 Altered lseek behaviour
+ 4.5 TODO
+5. Copyup
+6. Whiteout
+ 6.1. Creation and deletion
+ 6.2. Whiteout filetype support
+ 6.3. Directory renaming
+7. Usage
+8. State of the code
+9. Extracted (old)mail comments
+
+1. Overview
+-----------
+Union mount allows mounting of two or more filesystems transparently on
+a single mount point. The contents(files or directories) of all the
+filesystems become visible at the mount point after a union mount. If
+there are files of same name in multiple layers, only the topmost files remain
+visible in a union mount. However (currently) common named directories are
+again union-ed to present a unified view at the subdir level.
+
+In this approach of unioning filesystems, the layering information of
+different components of the union mount are maintained at the VFS layer.
+Hence we call this a VFS based union mount.
+
+2. Union stack
+--------------
+Union stack reflects the stacking of two or more filesystems of the
+union mount. The stacking or the layering information is maintained
+as part of dentry structures of the mountpoint and mount root.
+
+The union stack information in the dentry structure looks like this:
+
+struct dentry {
+ ...
+
+#ifdef CONFIG_UNION_MOUNT
+ struct dentry *d_overlaid; /* overlaid directory */
+ struct dentry *d_topmost; /* topmost directory */
+ struct union_info *d_union; /* union stack info */
+#endif
+ ...
+};
+
+struct union_info {
+ struct mutex u_mutex;
+ atomic_t u_count;
+};
+
+There is one union_info shared by all dentries which are part of
+a union and u_count member holds the number of references to the union
+stack. When this reaches zero, the union stack ceases to exist and
+the union_info is freed.
+
+Union stack is essentially a singly linked list of dentries of the union
+with d_topmost as the head of the list and d_overlaid points
+to the next member of the stack. The walking of union stack is guarded by
+the u_mutex member.
+
+dget() references every dentry of the overlaid union stack to make sure
+that no dentry of the stack is discarded from memory while others are
+still in use. Since walking of union stack is protected by a mutex,
+dget() can now sleep.
+
+dput() also walks the union stack and releases references to all the
+dentries that are part of the union. If a dentry's reference count
+in a union stack reaches zero, it implies that the dentries above it
+in the stack must also be unused and the union stack can be safely
+destroyed at this point.
+
+Since dget() can sleep with union mount, it becomes necessary to
+fix many callers of dget() to release and re-acquire any spinlocks
+they are holding until they acquire the union lock(mutex).
+
+3. Lookup
+---------
+With union mount, it becomes necessary to lookup pathnames not only
+in the topmost filesystem but also in the underlying filesystems.
+
+In case of looking up a filename, the lookup routines as a rule return
+the match from the topmost layer. However if the file is not found
+in the topmost layer, the lookup routines have been modified to
+find the file in the underlying filesystems of the union stack.
+
+When looking up a directory under a union mount point, the lookup
+code has been modified to build a union stack (if necessary).
+
+When looking up a name in a union directory, it is necessary to
+guarantee that the returned union stack remains valid. Hence
+concurrent lookups are prevented by obtaining the mutex lock during
+lookups.
+
+4. Readdir
+----------
+The core functionality of union mount, viz., the merged view of
+multiple directories is provided by the readdir()/getdents() routines.
+This is achieved by reading the contents of every directory of the union
+stack and by merging the result.
+
+4.1 Duplicate elimination
+
+The directory entries are read starting from the top layer and they
+are maintained in a cache. Subsequently when the entries from the bottom layers
+of the union stack are read, they are checked for duplicates (in the cache)
+before being passed out to the user space. Since there can be mulitple
+readdir()/getdents() calls to read a single directory, the cache is made to
+persist across these calls. So we need to maintain this cache and the
+associated state across readdir calls.
+
+4.2 Preserving state
+
+If the readdir()/getdents() routine returns midway for any reason (most likely
+reason is the exhaustion of the user supplied buffer), the state at which it
+left reading the reading is preserved in the union_info structure which hangs
+off from the unioned dentries.
+
+The state consists of the following information:
+
+- The directory(dentry) which was being read (this can be the upper or any of
+ the lower directories).
+- The file offset at which the directory entries were being fetched from.
+
+These two form the readdir state or rdstate. The next readdir call on the same
+unioned directory would start from the right directory (dentry) and offset by
+getting the same from the preserved rdstate.
+
+When two processes issue readdir()/getdents() call on the same unioned
+directory, both of them would be referring to the same dentries via their
+file structures. So it becomes necessary to maintain rdstate separately for
+these two instances. This is achieved by using a cookie variable in the
+rdstate. Each of these rdstate instances would get a different cookie thereby
+differentiating them.
+
+4.3 File offset problem
+
+readdir() is issued on a unioned directory by referring to the topmost
+directory of the union. But since internally readdir has to read lower level
+directories also, there needs to be mapping b/n the file->f_pos of the topmost
+directory and the file offsets of the lower directories. Given a file->f_pos
+of the topmost directory, there needs be a way to determine which lower
+directory and the offset within that this file->f_pos specifies. As already
+mentioned, the directory(dentry) and offset are maintained as part of rdstate.
+The file->f_pos is made a function of the the rdstate instance (cookie) and the
+actual file offset of the directory (see rdstate2offset() routine). After a
+read from a directory, the file->f_pos of the topmost file is updated with the
+new offset information.
+
+Though we modify the file->f_pos of the topmost file, when it comes to reading
+the directory entries, we always use the original file offsets stored in the
+rdstate. Hence this should work with all filesytems irrespective of how they
+maintain and use file->f_pos.
+
+4.4 Altered lseek behaviour
+
+Since we have modifed the meaning of file->f_pos of the topmost directory, a
+lseek on it woudn't work as expected. It is not even clear if a fully working
+lseek is expected on a unioned directory because it is not a single directory
+we are seeking here, there are underying directories.
+
+So with this scheme, at the moment, it is only possible to support two kinds of
+seek operations:
+
+- seeking to the beginning of the file (which invovles destroying the
+ associated rdstate)
+- seeking to the current position :)
+
+All other seek operations return -EINVAL.
+
+4.5 TODO
+
+Handle the case of directory getting modified (addition/deletion) when
+we cache its contents.
+
+5. Copyup
+---------
+In this implementation of union mount, only the files residing in
+the topmost layer are writable. With this restriction, when a file residing
+in a bottom layer is opened for writing, it is copied up to the topmost layer
+and the write is allowed there. The copyup is done by first creating the
+file in the topmost layer and then copying the contents of the file.
+
+If it becomes necessary to create a directory structure in the top layer
+while copying up a file, then it is done so.
+
+Every time a file is opened for writing, we have introduced a check to
+see if this file belongs to a union and if so resides in the bottom
+layer of the union stack. Only then the copyup operation is performed.
+VFS routines are used directly to create the file in the topmost layer.
+However to copy the contents of the file from within the kernel splice
+routines are used.
+
+6. Whiteout
+-----------
+A whiteout file is a placeholder for a file that does not exist from a
+logical point of view. VFS returns -ENOENT for any reference to whiteouts.
+
+Typically whiteouts are created in the topmost layer when a file in
+the lower layer is deleted. The whiteout essentially masks out the file
+in the lower layer.
+
+6.1 Creation and deletion
+
+With union mount, a top layer whiteout is created in the following scenarios:
+- A file/directory which resides only the bottom layer is removed.
+- A file/directory which resides in both the layers are removed.
+
+The VFS calls like unlink(), rename() and rmdir() have been modified to create
+a whiteout automatically when the above situation occurs.
+
+A whiteout is automatically deleted whenever a new file or directory
+with a corresponding name is created. This happens in calls like
+create(), mknod(), symlink(), link() and mkdir().
+
+There is a special case in mkdir(). When a whiteout is replaced by a
+directory, it is marked opaque (by using new S_OPAQUE inode flag).
+And lookup wouldn't descend down to lower directories if a directory
+is marked opaque. This is needed in the following scenario:
+
+# rm -rf dir/
+# mkdir dir
+
+The newly created dir/ has to be marked opaque, otherwise the contents
+of union stack would become visible again. And it is not expected to
+find a non-empty directory immediately after it's creation.
+
+6.2. Whiteout filetype support
+
+Creation or deletion of whiteouts is a persistent operation and hence it
+needs support from the underlying filesystem.
+
+Linux already defines DT_WHT(include/linux/fs.h) for whiteout directory
+entry (file)type. In addition we need to define the whiteout filetype
+for which we make use of an unused bit in the filetype bitmask and
+define S_IFWHT (include/linux/stat.h).
+
+Filesystems which support the whiteout filetype should set the FS_WHT
+flag (include/linux/fs.h) on .fs_type in their file_system_type structure.
+
+Additionally they have to implement the whiteout inode operation.
+
+int (*whiteout)(struct inode *dir, struct dentry *dentry);
+
+where 'dentry' is the negative dentry to be masked out under the parent 'dir'.
+
+In the current implementation, there is an inode for every whiteout in the
+filesystem. But since a whiteout doesn't have any usable attribute apart
+from it's name(name of the whiteout file is stored as directory entry
+in the parent directory), it is an ideal candidate for being replaced by
+a singleton object. We have plans to explore this option at a later point
+in time.
+
+In ext2 and ext3 filesystems, whiteout is introduced as an incompatible
+feature and only readonly mounts are allowed without whiteout support.
+tune2fs(8) from e2fsprogs has been modified to add whiteout support to
+ext2/3.
+
+6.3. Directory renaming
+<TODO>
+
+7. Usage
+--------
+The way to union mount filesystems on two devices /dev/sda1 and /dev/sda2,
+on a mountpoint union/ is like this:
+
+- Mount the first filesystem normally and this becomes the lower layer
+of the union stack.
+# mount /dev/sda1 union/
+
+- Mount the second filesystem as a union on top of first
+# mount --union /dev/sda2 union/
+
+The mount(8) command from util-linux needs to be modified to make it
+interpret the --union option.
+
+After this the union/ will have the merged contents of /dev/sda1
+and /dev/sda2.
+
+8. State of the code
+--------------------
+The entire code is in an experimental stage at present.
+
+These are a number of (un)known issues/shortcomings:
+
+- Unstable, might crash any time. Hasn't undergone any decent levels
+ of testing.
+- We are touching some fastpaths in the lookup code and introducing the
+ latency of obtaining a mutex in dget() (only for union mount cases).
+ We haven't yet benchmarked this to check the (adverse) effects.
+- Unioning of subdirectories within a union mount is working, but is buggy.
+- The side effects of union mount changes on other subsystems
+ (eg cpuset, aio, dnotify, inotify etc which are touched by union
+ mount changes) haven't been tested yet.
+- bind/move vs union mount not yet handled.
+- Some lockdep warnings need to be addressed still.
+- In general some code cleanliness issues are yet to be handled. There
+ are still some #ifdefs in the .c files.
+- The union locking is not robust and needs some fixing.
+
+9. Extracted (old)mail comments
+--------------------------
+
+These are some of the extracts from an old linux-fsdevel post.
+
+----
+Andries Brouwer wrote:
+>
+> On "union mounts".
+> We must first have a theory on what "union mount" means.
+> Union is a commutative operator, but here there is no symmetry
+> at all, so "union" is a misnomer. There is an order.
+>
+> One might consider partial orders, so that one obtains a tree of mounts,
+> but I do not know any applications, and there is the problem of naming.
+> So, for simplicity, maybe there is a linear order.
+>
+> Things happen in the top one. All others are read-only.
+>
+
+Yes, that is correct. This is naturally since the stacking of vfsmount objects
+has been like this before.
+
+----
+
+Alexander Viro wrote:
+>
+> > Does not same thing apply also for common subdirectories?
+>
+> Not. union-mount != unionfs, it does not descend into subdirectories.
+> There is no way in hell to do that and permit sharing the union-mount
+> components between several mountpoints. unionfs is very different animal
+> and there the main point is that you are getting real, honest
+> copy-on-write, i.e. if you have foo/bar/baz on underlying filesystem than
+> any attempt to access foo will create a shadowing directory in the upper
+> layer, any attempt to access foo/bar will do the same for foo/bar and
+> attempt to write into the foo/bar/baz will lead to copying the thing into
+> the upper layer and changing it there. _Very_ useful when you have a
+> read-only fs and want to run make on it, for one thing - everything
+> new/modified gets into the covering layer, along with the accessed part of
+> directory tree. Very nice, but completely different - there are things
+> impossible for one and doable on another.
+>
+
+----
+
+Werner Almesberger wrote:
+>
+> Hmm, now I'm throughly confused :-( What is the "union" in here then ?
+> Is it that a lookup for a top-level component searches all file system
+> in that list, or does it simply mean that all the file systems are
+> internally linked to the same place, but only one of them is truly
+> visible ?
+>
+> E.g., given
+>
+> # mount /dev/a /mnt
+> # mkdir -p /mnt/foo/blah /mnt/bar
+> # umount /dev/a
+> # mount /dev/b /mnt
+> # mkdir -p /mnt/foo/zulu /mnt/baz
+> # mount -o union /dev/a /mnt
+>
+> # cd /mnt/foo/blah works ?
+> # cd /mnt/foo/zulu works too ? (no, I guess)
+> # cd /mnt/baz works ?
+> # cd /mnt/bar works too ?
+> # cd /mnt; touch file works ? on which device is the file created ?
+> # cd /mnt/foo; touch file works ?
+> # cd /mnt/foo/blah; touch file works ?
+> # cd /mnt/foo/zulu; touch file works too ? (no, I guess)
+>
+
+# cd /mnt/foo/blah works !
+# cd /mnt/foo/zulu works !
+# cd /mnt/baz works !
+# cd /mnt/bar works !
+# cd /mnt; touch file file created on /dev/a
+# cd /mnt/foo; touch file file created on /dev/a
+# cd /mnt/foo/blah; touch file file created on /dev/a
+# cd /mnt/foo/zulu; touch file zulu copied to /dev/a and file created on it
+
+----
+
+Alexander Viro wrote:
+>
+> A) suppose we have a bunch of filesystems union-mounted on /foo/bar. We do
+> chdir("/foo/bar"), what should become busy? Variants:
+> mountpoint, first element, last element, all of them.
+> B) after the action in (A) we add another filesystem to the set. Again, what
+> should happen to the busy/not busy status of the components?
+> C) we start with the normal mount and union-mount something else.
+> Question: what is the desired result (almost definitely the set of old
+> and new mounted stuff) and who should become busy?
+> D) In the cases above, what do we want to get from stat(2)?
+> E) What do we want to do if we do normal mount atop of the union-mount?
+> Variants: try to replace, return -EBUSY. Doing replace (i.e. if
+> everything can be umounted - do it and mount the new fs in place of the
+> union) is attractive - we probably might treat the normal mount same way,
+> which kills the "I've clicked in my point'n'drool krapplication ten times
+> and it mounted CD ten times, waaaaaah" bug reports.
+> Disadvantage: may need small fixes to mount(8) (basically, "if we already
+> have mtab entry for this mountpoint and mount succeeds - discard the old
+> one").
+>
+
+I don't understand the union mount as a set of mounts because we also need a
+strict order to remove duplicate filenames from the directory
+listing. Therefore after union mounting a filesystem the mount-points
+filesystem is busy. A chdir() to the mount-point makes the last mounted
+filesystem busy since a lookup returns the root directory of the topmost
+filesystem.
+
+----
+
+Alexander Viro wrote:
+> >
+> > > A) suppose we have a bunch of filesystems union-mounted on
+> > > /foo/bar. We do chdir("/foo/bar"), what should become busy? Variants:
+> > > mountpoint, first element, last element, all of them.
+> >
+> > I believe that all of them. Or, we can make alternative and mark
+> > none of them busy (together with Tigran yet-to-write force unmount) -
+> > if there is reason why cwd should make filesystem busy at all...
+>
+> Ouch. "All" means that we can't, e.g expire elements of union.
+>
+
+
+----
+
+Andries Brouwer wrote:
+>
+> > A) suppose we have a bunch of filesystems union-mounted on
+> > /foo/bar. We do chdir("/foo/bar"), what should become busy? Variants:
+> > mountpoint, first element, last element, all of them.
+>
+> Last element.
+>
+> > B) after the action in (A) we add another filesystem to the set.
+> > Again, what should happen to the busy/not busy status of the components?
+>
+> Previous top one has now become busy. All other were busy already.
+>
+> > C) we start with the normal mount and union-mount something else.
+> > Question: what is the desired result (almost definitely the set of old and
+> > new mounted stuff) and who should become busy?
+>
+> First element now is busy.
+>
+> > D) In the cases above, what do we want to get from stat(2)?
+>
+> stat(2) on this directory looks at the top one
+>
+> > E) What do we want to do if we do normal mount atop of the
+> > union-mount? Variants: try to replace,
+>
+> No. Very strange semantics for a mount.
+>
+> > return -EBUSY.
+>
+> Yes, quite reasonable. But I would prefer the third: just succeed.
+> We have a file hierarchy, and do a mount - well, we already know what that
+> means, and we just do it.
+>
+> [I would prefer to return -EBUSY only when the same filesystem was already
+> mounted (in the same way) on the same mount point.]
+>
+
+
+----
+
+Neil Brown wrote:
+>
+> A "mount" is an ordered list (pile) of directories.
+> One of these elements is the "mountpoint", and it is particularly
+> distiguished because ".." from the "mount" goes through ".." of the
+> "mountpoint". ".." of all other directories is not accessable.
+>
+> Each directory in the pile has two flags (well, three if you count
+> IS_MOUNTPOINT):
+>
+> IS_WRITABLE: You can create things in here.
+> IS_VISIBLE: You can see inside this.
+>
+> Thus, a traditional mount has two directories in the pile.
+> The bottom one IS_MOUNTPOINT
+> The top one IS_WRITABLE|IS_VISIBLE
+>
+> With mount -o union, you can set what ever flags you like, though
+> having IS_WRITABLE and not IS_VISIBLE would be a problem.
+> However you can only have one IS_MOUNTPOINT directory.
+>
+> Now the rules:
+>
+> 1/ on "lookup", you do a lookup in each IS_VISIBLE directory from the
+> top down until you find a match or you hit the bottom.
+>
+> 2/ If you decide to create something (*) then it goes in the uppermost
+> IS_WRITABLE directory.
+>
+> 3/ "stat" (of ".") sees the IS_MOUNTPOINT directory if it IS_VISIBLE,
+> otherwise the lowest IS_VISIBLE directory.
+> Possibly n_links could be fiddled, but I don't know how important
+> that is.
+>
+> 4/ The "mount" keeps only the IS_MOUNTPOINT directory busy.
+>
+> 5/ An open or cd to the mount makes the directory which "stat" sees
+> busy.
+>
+> 6/ A mount is not allowed if it would change 'the directory which
+> "stat" sees', and that directory is "busy".
+>
+> (*) It is unclear to me when creation should be allowed.
+> If I say "mkdir fred", and fred does not exist in or above the
+> uppermost IS_WRITABLE directory, but does exist is a lower
+> IS_VISIBLE directory, should the create succeed or fail?
+> Would that same be true for
+> open("fred", O_CREAT) which is "create if it doesn't exist"
+> or open("fred", O_CREAT|O_EXCL) which is "create and it mustn't exist".
+>
+
+For the complete thread refer to:
+http://marc.theaimsgroup.com/?l=linux-fsdevel&m=96035682927821&w=2
+
+---
+- Bharata B Rao <[email protected]>
+- Jan Blunck <[email protected]>
+
+May 2007

2007-05-14 09:31:47

by Bharata B Rao

[permalink] [raw]
Subject: [RFC][PATCH 2/14] Add a new mount flag (MNT_UNION) for union mount

From: Jan Blunck <[email protected]>
Subject: Add a new mount flag (MNT_UNION) for union mount.

Introduce MNT_UNION, MS_UNION and FS_WHT flags. There are the necessary flags
for doing

mount /dev/hda3 /mnt -o union

You need additional patches for util-linux for that to work.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Bharata B Rao <[email protected]>
---
fs/namespace.c | 14 +++++++++++++-
include/linux/fs.h | 2 ++
include/linux/mount.h | 1 +
3 files changed, 16 insertions(+), 1 deletion(-)

--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -442,6 +442,7 @@ static int show_vfsmnt(struct seq_file *
{ MNT_NODIRATIME, ",nodiratime" },
{ MNT_RELATIME, ",relatime" },
{ MNT_NOMNT, ",nomnt" },
+ { MNT_UNION, ",union" },
{ 0, NULL }
};
struct proc_fs_info *fs_infop;
@@ -1256,6 +1257,14 @@ int do_add_mount(struct vfsmount *newmnt
if (S_ISLNK(newmnt->mnt_root->d_inode->i_mode))
goto unlock;

+ /* Unions couldn't be writable if the filesystem
+ * doesn't know about whiteouts */
+ err = -ENOTSUPP;
+ if ((mnt_flags & MNT_UNION) &&
+ !(newmnt->mnt_sb->s_flags & MS_RDONLY) &&
+ !(newmnt->mnt_sb->s_type->fs_flags & FS_WHT))
+ goto unlock;
+
/* some flags may have been set earlier */
newmnt->mnt_flags |= mnt_flags;
if ((err = graft_tree(newmnt, nd)))
@@ -1562,9 +1571,12 @@ long do_mount(char *dev_name, char *dir_
mnt_flags |= MNT_RELATIME;
if (flags & MS_NOMNT)
mnt_flags |= MNT_NOMNT;
+ if (flags & MS_UNION)
+ mnt_flags |= MNT_UNION;

flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
- MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_NOMNT);
+ MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_NOMNT |
+ MS_UNION);

/* ... and get the mountpoint */
retval = path_lookup(dir_name, LOOKUP_FOLLOW, &nd);
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -97,6 +97,7 @@ extern int dir_notify_enable;
#define FS_BINARY_MOUNTDATA 2
#define FS_HAS_SUBTYPE 4
#define FS_SAFE 8 /* Safe to mount by unprivileged users */
+#define FS_WHT 16
#define FS_REVAL_DOT 16384 /* Check the paths ".", ".." for staleness */
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move()
* during rename() internally.
@@ -113,6 +114,7 @@ extern int dir_notify_enable;
#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
+#define MS_UNION 256 /* Union mount */
#define MS_NOATIME 1024 /* Do not update access times. */
#define MS_NODIRATIME 2048 /* Do not update directory access times */
#define MS_BIND 4096
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -36,6 +36,7 @@ struct mnt_namespace;
#define MNT_SHARED 0x1000 /* if the vfsmount is a shared mount */
#define MNT_UNBINDABLE 0x2000 /* if the vfsmount is a unbindable mount */
#define MNT_PNODE_MASK 0x3000 /* propogation flag mask */
+#define MNT_UNION 0x4000 /* if the vfsmount is a union mount */

struct vfsmount {
struct list_head mnt_hash;

2007-05-14 09:32:30

by Bharata B Rao

[permalink] [raw]
Subject: [RFC][PATCH 3/14] Add the whiteout file type

From: Jan Blunck <[email protected]>
Subject: Add the whiteout file type

A white-out stops the VFS from further lookups of the white-outs name and
returns -ENOENT. This is the same behaviour as if the filename isn't
found. This can be used in combination with union mounts to virtually
delete (white-out) files by creating a file with this file type.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Bharata B Rao <[email protected]>
---
include/linux/stat.h | 2 ++
1 files changed, 2 insertions(+)

--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -10,6 +10,7 @@
#if defined(__KERNEL__) || !defined(__GLIBC__) || (__GLIBC__ < 2)

#define S_IFMT 00170000
+#define S_IFWHT 0160000 /* whiteout */
#define S_IFSOCK 0140000
#define S_IFLNK 0120000
#define S_IFREG 0100000
@@ -28,6 +29,7 @@
#define S_ISBLK(m) (((m) & S_IFMT) == S_IFBLK)
#define S_ISFIFO(m) (((m) & S_IFMT) == S_IFIFO)
#define S_ISSOCK(m) (((m) & S_IFMT) == S_IFSOCK)
+#define S_ISWHT(m) (((m) & S_IFMT) == S_IFWHT)

#define S_IRWXU 00700
#define S_IRUSR 00400

2007-05-14 09:33:00

by Bharata B Rao

[permalink] [raw]
Subject: [RFC][PATCH 4/14] Add config options for union mount

From: Jan Blunck <[email protected]>
Subject: Add config options for union mount

Introduces two new config options for union mount:

CONFIG_UNION_MOUNT - Enables union mount
CONFIG_UNION_MOUNT_DEBUG - Enables debugging support for union mount.

Also adds debugging routines.

FIXME: this needs some work. printk'ing isn't the right method for getting
good debugging output.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Bharata B Rao <[email protected]>
---
fs/Kconfig | 16 +++++++++
include/linux/union_debug.h | 76 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 92 insertions(+)

--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -551,6 +551,22 @@ config INOTIFY_USER

If unsure, say Y.

+config UNION_MOUNT
+ bool "Union mount support (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ ---help---
+ If you say Y here, you will be able to mount file systems as
+ union mount stacks. This is a VFS based implementation and
+ should work with all file systems. If unsure, say N.
+
+config UNION_MOUNT_DEBUG
+ bool "Union mount debugging output"
+ depends on UNION_MOUNT
+ ---help---
+ If you say Y here, the union mount debugging code will be
+ compiled in. You have activate the appropriate UNION_MOUNT_DEBUG
+ flags in <file:include/linux/union.h>, too.
+
config QUOTA
bool "Quota support"
help
--- /dev/null
+++ b/include/linux/union_debug.h
@@ -0,0 +1,76 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright ? 2004-2007 IBM Corporation
+ * Author(s): Jan Blunck ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+#ifndef __LINUX_UNION_DEBUG_H
+#define __LINUX_UNION_DEBUG_H
+
+#ifdef __KERNEL__
+
+#ifdef CONFIG_UNION_MOUNT_DEBUG
+
+#include <linux/sched.h>
+
+#ifndef UNION_MOUNT_DEBUG
+#define UNION_MOUNT_DEBUG 0
+#endif /* UNION_MOUNT_DEBUG */
+#ifndef UNION_MOUNT_DEBUG_DCACHE
+#define UNION_MOUNT_DEBUG_DCACHE 0
+#endif /* UNION_MOUNT_DEBUG_DCACHE */
+#ifndef UNION_MOUNT_DEBUG_LOCK
+#define UNION_MOUNT_DEBUG_LOCK 0
+#endif /* UNION_MOUNT_DEBUG_LOCK */
+#ifndef UNION_MOUNT_DEBUG_READDIR
+#define UNION_MOUNT_DEBUG_READDIR 0
+#endif /* UNION_MOUNT_DEBUG_READDIR */
+
+/*
+ * The really excessive debugging output is triggered by
+ * the user id (7777) which is accessing the union stack
+ */
+#define UM_DEBUG(fmt, args...) \
+do { \
+ if (UNION_MOUNT_DEBUG) \
+ printk(KERN_DEBUG "%s: " fmt, __FUNCTION__, ## args); \
+} while (0)
+#define UM_DEBUG_UID(fmt, args...) \
+do { \
+ if (UNION_MOUNT_DEBUG && (current->uid == 7777)) \
+ printk(KERN_DEBUG "%s: " fmt, __FUNCTION__, ## args); \
+} while (0)
+#define UM_DEBUG_DCACHE(fmt, args...) \
+do { \
+ if (UNION_MOUNT_DEBUG_DCACHE && (current->uid == 7777)) \
+ printk(KERN_DEBUG "%s: " fmt, __FUNCTION__, ## args); \
+} while (0)
+#define UM_DEBUG_LOCK(fmt, args...) \
+do { \
+ if (UNION_MOUNT_DEBUG_LOCK && (current->uid == 7777)) \
+ printk(KERN_DEBUG "%s: " fmt, __FUNCTION__, ## args); \
+} while (0)
+#define UM_DEBUG_READDIR(fmt, args...) \
+do { \
+ if (UNION_MOUNT_DEBUG_READDIR && (current->uid == 7777)) \
+ printk(KERN_DEBUG "%s: " fmt, __FUNCTION__, ## args); \
+} while (0)
+
+#else /* CONFIG_UNION_MOUNT_DEBUG */
+
+#define UM_DEBUG(fmt, args...) do { /* empty */ } while (0)
+#define UM_DEBUG_UID(fmt, args...) do { /* empty */ } while (0)
+#define UM_DEBUG_DCACHE(fmt, args...) do { /* empty */ } while (0)
+#define UM_DEBUG_LOCK(fmt, args...) do { /* empty */ } while (0)
+#define UM_DEBUG_READDIR(fmt, args...) do { /* empty */ } while (0)
+
+#endif /* CONFIG_UNION_MOUNT_DEBUG */
+
+#endif /* __KERNEL__ */
+#endif /* __LINUX_UNION_DEBUG_H */

2007-05-14 09:33:39

by Bharata B Rao

[permalink] [raw]
Subject: [RFC][PATCH 5/14] Introduce union stack

From: Jan Blunck <[email protected]>
Subject: Introduce union stack.

Adds union stack infrastructure to the dentry structure and provides
locking routines to walk the union stack.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Bharata B Rao <[email protected]>
---
fs/Makefile | 2
fs/dcache.c | 5
fs/union.c | 53 +++++++++
include/linux/dcache.h | 6 +
include/linux/dcache_union.h | 248 +++++++++++++++++++++++++++++++++++++++++++
5 files changed, 314 insertions(+)

--- a/fs/Makefile
+++ b/fs/Makefile
@@ -49,6 +49,8 @@ obj-$(CONFIG_FS_POSIX_ACL) += posix_acl.
obj-$(CONFIG_NFS_COMMON) += nfs_common/
obj-$(CONFIG_GENERIC_ACL) += generic_acl.o

+obj-$(CONFIG_UNION_MOUNT) += union.o
+
obj-$(CONFIG_QUOTA) += dquot.o
obj-$(CONFIG_QFMT_V1) += quota_v1.o
obj-$(CONFIG_QFMT_V2) += quota_v2.o
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -936,6 +936,11 @@ struct dentry *d_alloc(struct dentry * p
#ifdef CONFIG_PROFILING
dentry->d_cookie = NULL;
#endif
+#ifdef CONFIG_UNION_MOUNT
+ dentry->d_overlaid = NULL;
+ dentry->d_topmost = NULL;
+ dentry->d_union = NULL;
+#endif
INIT_HLIST_NODE(&dentry->d_hash);
INIT_LIST_HEAD(&dentry->d_lru);
INIT_LIST_HEAD(&dentry->d_subdirs);
--- /dev/null
+++ b/fs/union.c
@@ -0,0 +1,53 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright ? 2004-2007 IBM Corporation
+ * Author(s): Jan Blunck ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include <linux/fs.h>
+
+struct union_info * union_alloc(void)
+{
+ struct union_info *info;
+
+ info = kmalloc(sizeof(*info), GFP_ATOMIC);
+ if (!info)
+ return NULL;
+
+ mutex_init(&info->u_mutex);
+ mutex_lock(&info->u_mutex);
+ atomic_set(&info->u_count, 1);
+ UM_DEBUG_LOCK("allocate union %p\n", info);
+ return info;
+}
+
+struct union_info * union_get(struct union_info *info)
+{
+ BUG_ON(!info);
+ BUG_ON(!atomic_read(&info->u_count));
+ atomic_inc(&info->u_count);
+ UM_DEBUG_LOCK("get union %p (count=%d)\n", info,
+ atomic_read(&info->u_count));
+ return info;
+}
+
+void union_put(struct union_info *info)
+{
+ BUG_ON(!info);
+ UM_DEBUG_LOCK("put union %p (count=%d)\n", info,
+ atomic_read(&info->u_count));
+ atomic_dec(&info->u_count);
+
+ if (!atomic_read(&info->u_count)) {
+ UM_DEBUG_LOCK("free union %p\n", info);
+ kfree(info);
+ }
+
+ return;
+}
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -93,6 +93,12 @@ struct dentry {
struct dentry *d_parent; /* parent directory */
struct qstr d_name;

+#ifdef CONFIG_UNION_MOUNT
+ struct dentry *d_overlaid; /* overlaid directory */
+ struct dentry *d_topmost; /* topmost directory */
+ struct union_info *d_union; /* union directory info */
+#endif
+
struct list_head d_lru; /* LRU list */
/*
* d_child and d_rcu can share memory
--- /dev/null
+++ b/include/linux/dcache_union.h
@@ -0,0 +1,248 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright ? 2004-2007 IBM Corporation
+ * Author(s): Jan Blunck ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+#ifndef __LINUX_DCACHE_UNION_H
+#define __LINUX_DCACHE_UNION_H
+#ifdef __KERNEL__
+
+#include <linux/union_debug.h>
+#include <linux/fs_struct.h>
+#include <asm/atomic.h>
+#include <asm/semaphore.h>
+
+#ifdef CONFIG_UNION_MOUNT
+
+/*
+ * This is the union info object, that describes general information about this
+ * union directory
+ *
+ * u_mutex protects the union stack against modification. You can reach it
+ * through the d_union field in struct dentry. Hold it when you are walking
+ * or modifing the union stack !
+ */
+struct union_info {
+ atomic_t u_count;
+ struct mutex u_mutex;
+};
+
+/* allocate/de-allocate */
+extern struct union_info *union_alloc(void);
+extern struct union_info *union_get(struct union_info *);
+extern void union_put(struct union_info *);
+
+/*
+ * These are the functions for locking a dentry's union. When one
+ * want to acquire a denties union lock, use:
+ *
+ * - union_lock() when you can sleep,
+ * - union_lock_spinlock() when you are holding a spinlock (that
+ * you CAN savely give up and reacquire again)
+ * - union_lock_readlock() when you are holding a readlock (that
+ * you CAN savely give up and reacquire again)
+ *
+ * Otherwise get the union lock early before you enter your
+ * "no sleeping here" code.
+ *
+ * NOTES: union_info structure is reference counted using u_count member.
+ * union_get() and union_put() which get and put references on union_info
+ * should be done under union_info's u_mutex. Since the last union_put() frees
+ * the union_info structure itself it can't obviously be done under u_mutex.
+ * union_release() should be used in such cases (Eg. dput(), umount()) where
+ * union_info is disassociated from the dentries, and it becomes safe
+ * to free the union_info.
+ */
+static inline void __union_lock(struct union_info *uinfo)
+{
+ BUG_ON(!atomic_read(&uinfo->u_count));
+ mutex_lock(&uinfo->u_mutex);
+}
+
+static inline void union_lock(struct dentry *dentry)
+{
+ if (unlikely(dentry && dentry->d_union)) {
+ struct union_info *ui = dentry->d_union;
+
+ UM_DEBUG_LOCK("\"%s\" locking %p (count=%d)\n",
+ dentry->d_name.name, ui,
+ atomic_read(&ui->u_count));
+ __union_lock(dentry->d_union);
+ }
+}
+
+static inline void __union_unlock(struct union_info *uinfo)
+{
+ BUG_ON(!atomic_read(&uinfo->u_count));
+ mutex_unlock(&uinfo->u_mutex);
+}
+
+static inline void union_unlock(struct dentry *dentry)
+{
+ if (unlikely(dentry && dentry->d_union)) {
+ struct union_info *ui = dentry->d_union;
+
+ UM_DEBUG_LOCK("\"%s\" unlocking %p (count=%d)\n",
+ dentry->d_name.name, ui,
+ atomic_read(&ui->u_count));
+ __union_unlock(dentry->d_union);
+ }
+}
+
+static inline void union_alloc_dentry(struct dentry *dentry)
+{
+ spin_lock(&dentry->d_lock);
+ if (!dentry->d_union) {
+ dentry->d_union = union_alloc();
+ spin_unlock(&dentry->d_lock);
+ } else {
+ spin_unlock(&dentry->d_lock);
+ union_lock(dentry);
+ }
+}
+
+static inline struct union_info *union_lock_and_get(struct dentry *dentry)
+{
+ union_lock(dentry);
+ return union_get(dentry->d_union);
+}
+
+/* Shouldn't be called with last reference to union_info */
+static inline void union_put_and_unlock(struct union_info *uinfo)
+{
+ union_put(uinfo);
+ __union_unlock(&uinfo->u_mutex);
+}
+
+/*
+ * Called when we know for sure that there is no reference
+ * to this union_info from any dentry and it can be safely
+ * destroyed.
+ */
+static inline void union_release(struct union_info *uinfo)
+{
+ if (!uinfo)
+ return;
+
+ mutex_unlock(&uinfo->u_mutex);
+ union_put(uinfo);
+}
+
+/*
+ * Immediately return ZERO if the lock is contended, NON-ZERO if it's acquired.
+ */
+static inline int union_trylock(struct dentry *dentry)
+{
+ int locked = 1;
+
+ if (unlikely(dentry && dentry->d_union)) {
+ UM_DEBUG_LOCK("\"%s\" try locking %p (count=%d)\n",
+ dentry->d_name.name, dentry->d_union,
+ atomic_read(&dentry->d_union->u_count));
+ BUG_ON(!atomic_read(&dentry->d_union->u_count));
+ locked = mutex_trylock(&dentry->d_union->u_mutex);
+ UM_DEBUG_LOCK("\"%s\" trylock %p %s\n", dentry->d_name.name,
+ dentry->d_union,
+ locked ? "succeeded" : "failed");
+ }
+ return (locked ? 1 : 0);
+}
+
+/*
+ * The following functions are locking helpers to guarantee the locking order
+ * in some situations.
+ */
+
+static inline void union_lock_spinlock(struct dentry *dentry, spinlock_t *lock)
+{
+ while (!union_trylock(dentry)) {
+ spin_unlock(lock);
+ cpu_relax();
+ spin_lock(lock);
+ }
+}
+
+static inline void union_lock_readlock(struct dentry *dentry, rwlock_t *lock)
+{
+ while (!union_trylock(dentry)) {
+ read_unlock(lock);
+ cpu_relax();
+ read_lock(lock);
+ }
+}
+
+/*
+ * This is a *I can't get no sleep* helper which is called when we try
+ * to access the struct fs_struct *fs field of a struct task_struct.
+ *
+ * Yes, this is possibly starving but we have to change root, altroot
+ * or pwd in the frequency of this while loop. Don't think that this
+ * happens really often ;)
+ *
+ * This is called while holding the rwlock_t fs->lock
+ *
+ * TODO: Unlocking side of union_lock_fs() needs 3 union_unlock()s.
+ * May be introduce union_unlock_fs().
+ *
+ * FIXME: This routine is used when the caller wants to dget one or
+ * more of fs->[root, altroot, pwd]. When the caller doesn't want to
+ * dget _all_ of these, it is strictly not necessary to get union_locks
+ * on all of these. Check.
+ */
+static inline void union_lock_fs(struct fs_struct *fs)
+{
+ int locked;
+
+ while (fs) {
+ locked = union_trylock(fs->root);
+ if (!locked)
+ goto loop1;
+ locked = union_trylock(fs->altroot);
+ if (!locked)
+ goto loop2;
+ locked = union_trylock(fs->pwd);
+ if (!locked)
+ goto loop3;
+ break;
+ loop3:
+ union_unlock(fs->altroot);
+ loop2:
+ union_unlock(fs->root);
+ loop1:
+ read_unlock(&fs->lock);
+ UM_DEBUG_LOCK("Failed to get all semaphores in fs_struct!\n");
+ cpu_relax();
+ read_lock(&fs->lock);
+ continue;
+ }
+ BUG_ON(!fs);
+ return;
+}
+
+#define IS_UNION(dentry) ((dentry)->d_overlaid || (dentry)->d_topmost || \
+ (dentry)->d_overlaid)
+
+#else /* CONFIG_UNION_MOUNT */
+
+#define union_lock(dentry) do { /* empty */ } while (0)
+#define union_trylock(dentry) ({ (1); })
+#define union_unlock(dentry) do { /* empty */ } while (0)
+#define union_lock_spinlock(dentry, lock) do { /* empty */ } while (0)
+#define union_lock_readlock(dentry, lock) do { /* empty */ } while (0)
+#define union_lock_fs(fs) do { /* empty */ } while (0)
+#define IS_UNION(dentry) ({ (0); })
+#define union_alloc_dentry(x) ({ BUG(); (0); })
+#define union_lock_and_get(dentry) ({ (NULL); })
+#define union_unlock_and_put(dentry) do { /* empty */ } while (0)
+#define union_release(x) do { BUG(); } while (0)
+
+#endif /* CONFIG_UNION_MOUNT */
+#endif /* __KERNEL__ */
+#endif /* __LINUX_DCACHE_UNION_H */

2007-05-14 09:34:17

by Bharata B Rao

[permalink] [raw]
Subject: [RFC][PATCH 6/14] Union-mount dentry reference counting

From: Jan Blunck <[email protected]>
Subject: Union-mount dentry reference counting

dget is modified to walk the union stack taking reference on every
dentry that is part of the union stack. This is necessary to ensure that
parts of union stack don't go away from under us. Since dget() takes a mutex
for walking the stack, dget can now sleep.

dput also walks the union stack and releases references to all the
dentries that are part of the union.

Since dget() can now sleep, make sure that dget() doesn't go to sleep with
any spinlocks held while it tries to get the mutex.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Bharata B Rao <[email protected]>
---
fs/dcache.c | 43 ++++++-
fs/dnotify.c | 5
fs/inotify.c | 8 +
fs/namei.c | 42 +++++--
fs/namespace.c | 12 +-
fs/proc/base.c | 17 ++
fs/union.c | 248 +++++++++++++++++++++++++++++++++++++++++++
include/linux/dcache.h | 110 ++++++++++++++++++-
include/linux/dcache_union.h | 65 +++++++++++
kernel/auditsc.c | 4
kernel/cpuset.c | 4
kernel/fork.c | 10 +
net/unix/af_unix.c | 5
13 files changed, 537 insertions(+), 36 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -171,8 +171,7 @@ static struct dentry *d_kill(struct dent
*
* no dcache lock, please.
*/
-
-void dput(struct dentry *dentry)
+void __dput_single(struct dentry *dentry)
{
if (!dentry)
return;
@@ -190,6 +189,13 @@ repeat:
return;
}

+ if (!__dput_single_destroy_union(dentry)) {
+ atomic_inc(&dentry->d_count);
+ spin_unlock(&dentry->d_lock);
+ spin_unlock(&dcache_lock);
+ goto repeat;
+ }
+
/*
* AV: ->d_delete() is _NOT_ allowed to block now.
*/
@@ -224,6 +230,14 @@ kill_it:
goto repeat;
}

+void dput(struct dentry *dentry)
+{
+ if (!dentry)
+ return;
+
+ dput_common(dentry);
+}
+
/**
* d_invalidate - invalidate a dentry
* @dentry: dentry to invalidate
@@ -285,6 +299,15 @@ int d_invalidate(struct dentry * dentry)

static inline struct dentry * __dget_locked(struct dentry *dentry)
{
+ /*
+ * TODO: We come here with dcache_lock held and can't
+ * afford to sleep now to acquire the union_lock. We should
+ * change all the callers to acquire union_lock first using
+ * the union_lock_spinlock() helper. Until that is done,
+ * BUG() here.
+ */
+ BUG_ON(IS_UNION(dentry));
+
atomic_inc(&dentry->d_count);
if (!list_empty(&dentry->d_lru)) {
dentry_stat.nr_unused--;
@@ -392,7 +415,7 @@ static void prune_one_dentry(struct dent
__d_drop(dentry);
dentry = d_kill(dentry);
if (!prune_parents) {
- dput(dentry);
+ __dput_single(dentry);
spin_lock(&dcache_lock);
return;
}
@@ -947,7 +970,7 @@ struct dentry *d_alloc(struct dentry * p
INIT_LIST_HEAD(&dentry->d_alias);

if (parent) {
- dentry->d_parent = dget(parent);
+ dentry->d_parent = __dget_single(parent);
dentry->d_sb = parent->d_sb;
} else {
INIT_LIST_HEAD(&dentry->d_u.d_child);
@@ -1879,8 +1902,10 @@ char * d_path(struct dentry *dentry, str
return dentry->d_op->d_dname(dentry, buf, buflen);

read_lock(&current->fs->lock);
+ union_lock_readlock(current->fs->root, &current->fs->lock);
rootmnt = mntget(current->fs->rootmnt);
- root = dget(current->fs->root);
+ root = __dget(current->fs->root);
+ union_unlock(current->fs->root);
read_unlock(&current->fs->lock);
spin_lock(&dcache_lock);
res = __d_path(dentry, vfsmnt, root, rootmnt, buf, buflen);
@@ -1940,10 +1965,14 @@ asmlinkage long sys_getcwd(char __user *
return -ENOMEM;

read_lock(&current->fs->lock);
+ union_lock_fs(current->fs);
pwdmnt = mntget(current->fs->pwdmnt);
- pwd = dget(current->fs->pwd);
+ pwd = __dget(current->fs->pwd);
rootmnt = mntget(current->fs->rootmnt);
- root = dget(current->fs->root);
+ root = __dget(current->fs->root);
+ union_unlock(current->fs->pwd);
+ union_unlock(current->fs->altroot);
+ union_unlock(current->fs->root);
read_unlock(&current->fs->lock);

error = -ENOENT;
--- a/fs/dnotify.c
+++ b/fs/dnotify.c
@@ -161,13 +161,16 @@ void dnotify_parent(struct dentry *dentr
return;

spin_lock(&dentry->d_lock);
+ union_lock_spinlock(dentry->d_parent, &dentry->d_lock);
parent = dentry->d_parent;
if (parent->d_inode->i_dnotify_mask & event) {
- dget(parent);
+ __dget(parent);
+ union_unlock(parent);
spin_unlock(&dentry->d_lock);
__inode_dir_notify(parent->d_inode, event);
dput(parent);
} else {
+ union_unlock(parent);
spin_unlock(&dentry->d_lock);
}
}
--- a/fs/inotify.c
+++ b/fs/inotify.c
@@ -325,17 +325,21 @@ void inotify_dentry_parent_queue_event(s
return;

spin_lock(&dentry->d_lock);
+ union_lock_spinlock(dentry->d_parent, &dentry->d_lock);
parent = dentry->d_parent;
inode = parent->d_inode;

if (inotify_inode_watched(inode)) {
- dget(parent);
+ __dget(parent);
+ union_unlock(parent);
spin_unlock(&dentry->d_lock);
inotify_inode_queue_event(inode, mask, cookie, name,
dentry->d_inode);
dput(parent);
- } else
+ } else {
+ union_unlock(parent);
spin_unlock(&dentry->d_lock);
+ }
}
EXPORT_SYMBOL_GPL(inotify_dentry_parent_queue_event);

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -522,16 +522,23 @@ walk_init_root(const char *name, struct
struct fs_struct *fs = current->fs;

read_lock(&fs->lock);
+ union_lock_fs(fs);
if (fs->altroot && !(nd->flags & LOOKUP_NOALT)) {
nd->mnt = mntget(fs->altrootmnt);
- nd->dentry = dget(fs->altroot);
+ nd->dentry = __dget(fs->altroot);
+ union_unlock(fs->pwd);
+ union_unlock(fs->altroot);
+ union_unlock(fs->root);
read_unlock(&fs->lock);
if (__emul_lookup_dentry(name,nd))
return 0;
read_lock(&fs->lock);
}
nd->mnt = mntget(fs->rootmnt);
- nd->dentry = dget(fs->root);
+ nd->dentry = __dget(fs->root);
+ union_unlock(fs->pwd);
+ union_unlock(fs->altroot);
+ union_unlock(fs->root);
read_unlock(&fs->lock);
return 1;
}
@@ -654,13 +661,16 @@ int follow_up(struct vfsmount **mnt, str
struct vfsmount *parent;
struct dentry *mountpoint;
spin_lock(&vfsmount_lock);
+ union_lock_spinlock((*mnt)->mnt_mountpoint, &vfsmount_lock);
parent=(*mnt)->mnt_parent;
if (parent == *mnt) {
+ union_unlock((*mnt)->mnt_mountpoint);
spin_unlock(&vfsmount_lock);
return 0;
}
mntget(parent);
- mountpoint=dget((*mnt)->mnt_mountpoint);
+ mountpoint=__dget((*mnt)->mnt_mountpoint);
+ union_unlock((*mnt)->mnt_mountpoint);
spin_unlock(&vfsmount_lock);
dput(*dentry);
*dentry = mountpoint;
@@ -736,21 +746,27 @@ static __always_inline void follow_dotdo
}
read_unlock(&fs->lock);
spin_lock(&dcache_lock);
+ union_lock_spinlock(nd->dentry->d_parent, &dcache_lock);
if (nd->dentry != nd->mnt->mnt_root) {
- nd->dentry = dget(nd->dentry->d_parent);
+ nd->dentry = __dget(nd->dentry->d_parent);
+ union_unlock(nd->dentry->d_parent);
spin_unlock(&dcache_lock);
dput(old);
break;
}
+ union_unlock(nd->dentry->d_parent);
spin_unlock(&dcache_lock);
spin_lock(&vfsmount_lock);
+ union_lock_spinlock(nd->mnt->mnt_mountpoint, &vfsmount_lock);
parent = nd->mnt->mnt_parent;
if (parent == nd->mnt) {
+ union_unlock(nd->mnt->mnt_mountpoint);
spin_unlock(&vfsmount_lock);
break;
}
mntget(parent);
- nd->dentry = dget(nd->mnt->mnt_mountpoint);
+ nd->dentry = __dget(nd->mnt->mnt_mountpoint);
+ union_unlock(nd->mnt->mnt_mountpoint);
spin_unlock(&vfsmount_lock);
dput(old);
mntput(nd->mnt);
@@ -1050,8 +1066,10 @@ static int __emul_lookup_dentry(const ch
*/
nd->last_type = LAST_ROOT;
read_lock(&fs->lock);
+ union_lock_readlock(fs->root, &fs->lock);
nd->mnt = mntget(fs->rootmnt);
- nd->dentry = dget(fs->root);
+ nd->dentry = __dget(fs->root);
+ union_unlock(fs->root);
read_unlock(&fs->lock);
if (path_walk(name, nd) == 0) {
if (nd->dentry->d_inode) {
@@ -1114,20 +1132,26 @@ static int fastcall do_path_lookup(int d
if (*name=='/') {
read_lock(&fs->lock);
if (fs->altroot && !(nd->flags & LOOKUP_NOALT)) {
+ union_lock_readlock(fs->altroot, &fs->lock);
nd->mnt = mntget(fs->altrootmnt);
- nd->dentry = dget(fs->altroot);
+ nd->dentry = __dget(fs->altroot);
+ union_unlock(fs->altroot);
read_unlock(&fs->lock);
if (__emul_lookup_dentry(name,nd))
goto out; /* found in altroot */
read_lock(&fs->lock);
}
+ union_lock_readlock(fs->root, &fs->lock);
nd->mnt = mntget(fs->rootmnt);
- nd->dentry = dget(fs->root);
+ nd->dentry = __dget(fs->root);
+ union_unlock(fs->root);
read_unlock(&fs->lock);
} else if (dfd == AT_FDCWD) {
read_lock(&fs->lock);
+ union_lock_readlock(fs->pwd, &fs->lock);
nd->mnt = mntget(fs->pwdmnt);
- nd->dentry = dget(fs->pwd);
+ nd->dentry = __dget(fs->pwd);
+ union_unlock(fs->pwd);
read_unlock(&fs->lock);
} else {
struct dentry *dentry;
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1743,12 +1743,14 @@ void set_fs_root(struct fs_struct *fs, s
{
struct dentry *old_root;
struct vfsmount *old_rootmnt;
+ union_lock(dentry);
write_lock(&fs->lock);
old_root = fs->root;
old_rootmnt = fs->rootmnt;
fs->rootmnt = mntget(mnt);
- fs->root = dget(dentry);
+ fs->root = __dget(dentry);
write_unlock(&fs->lock);
+ union_unlock(dentry);
if (old_root) {
dput(old_root);
mntput(old_rootmnt);
@@ -1765,12 +1767,14 @@ void set_fs_pwd(struct fs_struct *fs, st
struct dentry *old_pwd;
struct vfsmount *old_pwdmnt;

+ union_lock(dentry);
write_lock(&fs->lock);
old_pwd = fs->pwd;
old_pwdmnt = fs->pwdmnt;
fs->pwdmnt = mntget(mnt);
- fs->pwd = dget(dentry);
+ fs->pwd = __dget(dentry);
write_unlock(&fs->lock);
+ union_unlock(dentry);

if (old_pwd) {
dput(old_pwd);
@@ -1859,8 +1863,10 @@ asmlinkage long sys_pivot_root(const cha
}

read_lock(&current->fs->lock);
+ union_lock_readlock(current->fs->root, &current->fs->lock);
user_nd.mnt = mntget(current->fs->rootmnt);
- user_nd.dentry = dget(current->fs->root);
+ user_nd.dentry = __dget(current->fs->root);
+ union_unlock(current->fs->root);
read_unlock(&current->fs->lock);
down_write(&namespace_sem);
mutex_lock(&old_nd.dentry->d_inode->i_mutex);
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -162,8 +162,10 @@ static int proc_cwd_link(struct inode *i
}
if (fs) {
read_lock(&fs->lock);
+ union_lock_readlock(fs->pwd, &fs->lock);
*mnt = mntget(fs->pwdmnt);
- *dentry = dget(fs->pwd);
+ *dentry = __dget(fs->pwd);
+ union_unlock(fs->pwd);
read_unlock(&fs->lock);
result = 0;
put_fs_struct(fs);
@@ -183,8 +185,10 @@ static int proc_root_link(struct inode *
}
if (fs) {
read_lock(&fs->lock);
+ union_lock_readlock(fs->root, &fs->lock);
*mnt = mntget(fs->rootmnt);
- *dentry = dget(fs->root);
+ *dentry = __dget(fs->root);
+ union_unlock(fs->root);
read_unlock(&fs->lock);
result = 0;
put_fs_struct(fs);
@@ -1199,19 +1203,26 @@ static int proc_fd_info(struct inode *in
* We are not taking a ref to the file structure, so we must
* hold ->file_lock.
*/
+repeat:
spin_lock(&files->file_lock);
file = fcheck_files(files, fd);
if (file) {
+ if (!union_trylock(file->f_path.dentry)) {
+ spin_unlock(&files->file_lock);
+ cpu_relax();
+ goto repeat;
+ }
if (mnt)
*mnt = mntget(file->f_path.mnt);
if (dentry)
- *dentry = dget(file->f_path.dentry);
+ *dentry = __dget(file->f_path.dentry);
if (info)
snprintf(info, PROC_FDINFO_MAX,
"pos:\t%lli\n"
"flags:\t0%o\n",
(long long) file->f_pos,
file->f_flags);
+ union_unlock(file->f_path.dentry);
spin_unlock(&files->file_lock);
put_files_struct(files);
return 0;
--- a/fs/union.c
+++ b/fs/union.c
@@ -11,6 +11,9 @@
*/

#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/module.h>
+#include <linux/mount.h>

struct union_info * union_alloc(void)
{
@@ -51,3 +54,248 @@ void union_put(struct union_info *info)

return;
}
+
+void __union_check(struct dentry *dentry)
+{
+ if (likely(!(dentry->d_topmost || dentry->d_overlaid))) {
+ if (unlikely(dentry->d_union)) {
+ printk(KERN_ERR "%s: \"%s\" stale union reference\n" \
+ "\tdentry=%p, inode=%p, count=%d, u_count=%d\n",
+ __FUNCTION__,
+ dentry->d_name.name,
+ dentry,
+ dentry->d_inode,
+ atomic_read(&dentry->d_count),
+ atomic_read(&dentry->d_union->u_count));
+ dump_stack();
+ }
+ return;
+ }
+
+ BUG_ON(!dentry->d_union);
+
+ if ((dentry == dentry->d_topmost) || (dentry == dentry->d_overlaid)) {
+ printk(KERN_ERR "%s: \"%s\" loop in union stack\n",
+ __FUNCTION__, dentry->d_name.name);
+ BUG();
+ }
+
+ if (dentry->d_inode && !S_ISDIR(dentry->d_inode->i_mode)) {
+ printk(KERN_ERR "%s: \"%s\" isn't a directory!\n",
+ __FUNCTION__, dentry->d_name.name);
+ BUG();
+ }
+
+ if (dentry->d_topmost && !dentry->d_topmost->d_inode) {
+ printk(KERN_ERR "%s: \"%s\" has a negative topmost dentry!\n",
+ __FUNCTION__, dentry->d_name.name);
+ BUG();
+ }
+
+ if (!dentry->d_inode && !dentry->d_topmost) {
+ printk(KERN_ERR "%s: \"%s\" is a negative topmost dentry!\n",
+ __FUNCTION__, dentry->d_name.name);
+ BUG();
+ }
+}
+EXPORT_SYMBOL_GPL(__union_check);
+
+/*
+ * Check if the given @parent dentry is really a parent of @dentry
+ */
+static int union_is_parent(struct dentry *dentry, struct dentry *parent)
+{
+ struct dentry *tmp = dentry;
+
+ if (parent->d_sb != dentry->d_sb) {
+ UM_DEBUG("%s and %s have different superblocks\n",
+ dentry->d_name.name, parent->d_name.name);
+ return 0;
+ }
+
+ do {
+ if (tmp == parent)
+ return 1;
+ } while (tmp != tmp->d_parent && (tmp = tmp->d_parent));
+
+ return 0;
+}
+
+/*
+ * Check if the @dentry is part of a union
+ */
+int union_is_member(struct dentry *dentry, struct vfsmount *mnt)
+{
+ struct list_head *tmp;
+ struct vfsmount *p, *m_tmp = mntget(mnt);
+ struct dentry *d_tmp = __dget(dentry);
+
+ UM_DEBUG_UID("dentry=%s\n", dentry->d_name.name);
+
+ do {
+ UM_DEBUG_UID("device=%s\n", mnt->mnt_devname);
+ list_for_each(tmp, &m_tmp->mnt_mounts) {
+ p = list_entry(tmp, struct vfsmount, mnt_child);
+ UM_DEBUG_UID("child=%s\n", p->mnt_devname);
+ if (p->mnt_flags & MNT_UNION) {
+ UM_DEBUG_UID("is union=%s\n", p->mnt_devname);
+ if (union_is_parent(d_tmp, p->mnt_mountpoint)) {
+ __dput(d_tmp);
+ mntput(m_tmp);
+ return 1;
+ }
+ }
+ }
+
+ __dput(d_tmp);
+ d_tmp = __dget(m_tmp->mnt_mountpoint);
+ p = mntget(m_tmp->mnt_parent);
+ mntput(m_tmp);
+ m_tmp = p;
+ } while (m_tmp != m_tmp->mnt_parent);
+
+ __dput(d_tmp);
+ mntput(m_tmp);
+ return 0;
+}
+
+int __destroy_union(struct dentry * dentry)
+{
+ struct dentry *next;
+ struct dentry *topmost;
+ struct union_info *uinfo;
+
+ if (!union_trylock(dentry))
+ return 0;
+
+ uinfo = union_get(dentry->d_union);
+
+ UM_DEBUG_DCACHE("destroying \"%s\" (%p) union stack %p\n",
+ dentry->d_name.name, dentry->d_inode, uinfo);
+
+ next = dentry->d_topmost ? dentry->d_topmost : dentry;
+ while (next) {
+ struct dentry *tmp = next;
+ next = next->d_overlaid;
+
+ UM_DEBUG_DCACHE("\"%s\", inode=%p, count=%d\n",
+ tmp->d_name.name, tmp->d_inode,
+ atomic_read(&tmp->d_count));
+ if (tmp != dentry)
+ spin_lock(&tmp->d_lock);
+ tmp->d_topmost = NULL;
+ tmp->d_overlaid = NULL;
+ union_put(tmp->d_union);
+ tmp->d_union = NULL;
+ if (tmp != dentry)
+ spin_unlock(&tmp->d_lock);
+ if (tmp == dentry)
+ goto rebuild_stack;
+ }
+
+ mutex_unlock(&uinfo->u_mutex);
+ union_put(uinfo);
+ return 1;
+
+rebuild_stack:
+ if (next) {
+ topmost = next;
+ UM_DEBUG_DCACHE("\"%s\", inode=%p, count=%d\n",
+ next->d_name.name, next->d_inode,
+ atomic_read(&next->d_count));
+ spin_lock(&next->d_lock);
+ next->d_topmost = NULL;
+ if (!next->d_overlaid) {
+ union_put(next->d_union);
+ next->d_union = NULL;
+ }
+ spin_unlock(&next->d_lock);
+ next = next->d_overlaid;
+ }
+
+ while (next) {
+ struct dentry *tmp = next;
+ next = next->d_overlaid;
+ UM_DEBUG_DCACHE("\"%s\", inode=%p, count=%d\n",
+ tmp->d_name.name, tmp->d_inode,
+ atomic_read(&tmp->d_count));
+ tmp->d_topmost = topmost;
+ }
+
+ mutex_unlock(&uinfo->u_mutex);
+ union_put(uinfo);
+ return 1;
+}
+
+static void __destroy_stack_part(struct dentry * first, struct dentry * last)
+{
+ struct dentry * next = first;
+
+ while (next) {
+ struct dentry * tmp = next;
+ next = next->d_overlaid;
+
+ spin_lock(&tmp->d_lock);
+ tmp->d_topmost = NULL;
+ tmp->d_overlaid = NULL;
+ union_put(tmp->d_union);
+ tmp->d_union = NULL;
+ spin_unlock(&tmp->d_lock);
+ if (tmp == last)
+ break;
+ }
+}
+
+/*
+ * This is union-mount dput(). For union mount dentries it is walking DOWN
+ * the union stack and putting every dentry in it. If one of the dentries
+ * usage count reaching zero it is removed from the stack.
+ */
+void __dput_union(struct dentry *dentry)
+{
+ struct dentry *topmost; // the new topmost after dput()
+ struct dentry *next;
+
+ union_check(dentry);
+
+ if (dentry->d_topmost) {
+ UM_DEBUG_DCACHE("we are not the topmost dentry\n");
+ topmost = dentry->d_topmost;
+ } else
+ topmost = NULL;
+
+ next = dentry;
+ while (next) {
+ struct dentry *tmp = next; // the dentry we dput now
+ next = next->d_overlaid; // the dentry we dput next
+
+ UM_DEBUG_DCACHE("name=\"%s\", inode=%p, count=%d\n",
+ tmp->d_name.name, tmp->d_inode,
+ atomic_read(&tmp->d_count));
+
+ if (atomic_read(&tmp->d_count) < 2) {
+ __destroy_stack_part(topmost ? topmost : tmp, tmp);
+ topmost = NULL;
+ } else {
+ tmp->d_topmost = topmost;
+ if (!topmost)
+ topmost = tmp;
+ }
+
+ /* We are the last one using d_union */
+ spin_lock(&tmp->d_lock);
+ if (tmp->d_union
+ && (atomic_read(&tmp->d_union->u_count) == 1)) {
+ BUG_ON(next);
+ tmp->d_overlaid = NULL;
+ tmp->d_topmost = NULL;
+ union_put(tmp->d_union);
+ tmp->d_union = NULL;
+ }
+ spin_unlock(&tmp->d_lock);
+
+ __dput_single(tmp);
+ }
+
+ return;
+}
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -322,7 +322,7 @@ extern char * d_path(struct dentry *, st
* and call dget_locked() instead of dget().
*/

-static inline struct dentry *dget(struct dentry *dentry)
+static inline struct dentry *__dget_single(struct dentry *dentry)
{
if (dentry) {
BUG_ON(!atomic_read(&dentry->d_count));
@@ -333,6 +333,108 @@ static inline struct dentry *dget(struct

extern struct dentry * dget_locked(struct dentry *);

+/*
+ * Reference counting for union mounts
+ */
+#include <linux/dcache_union.h>
+extern void __dput_single(struct dentry *);
+extern void dput(struct dentry *);
+
+#ifdef CONFIG_UNION_MOUNT
+/*
+ * Called with dentry's union lock held
+ */
+static inline struct dentry * __dget(struct dentry *dentry)
+{
+ if (unlikely(IS_UNION(dentry)))
+ return __dget_union(dentry);
+ else
+ return __dget_single(dentry);
+}
+
+static inline struct dentry * dget(struct dentry *dentry)
+{
+ if (!dentry)
+ return dentry;
+
+ /*
+ * Yes, dget() can sleep now, if the union struct isn't yet read
+ * in completly. This is symmetric to dput() which can sleep too.
+ */
+ might_sleep();
+
+ union_lock(dentry);
+ __dget(dentry);
+ union_unlock(dentry);
+ return dentry;
+}
+
+/*
+ * Called with dentry's union lock held
+ */
+static inline void __dput(struct dentry *dentry)
+{
+ if (unlikely(IS_UNION(dentry)))
+ __dput_union(dentry);
+ else
+ __dput_single(dentry);
+}
+
+#else /* CONFIG_UNION_MOUNT */
+
+/* Allocation counts.. */
+
+/**
+ * dget, dget_locked - get a reference to a dentry
+ * @dentry: dentry to get a reference to
+ *
+ * Given a dentry or %NULL pointer increment the reference count
+ * if appropriate and return the dentry. A dentry will not be
+ * destroyed when it has references. dget() should never be
+ * called for dentries with zero reference counter. For these cases
+ * (preferably none, functions in dcache.c are sufficient for normal
+ * needs and they take necessary precautions) you should hold dcache_lock
+ * and call dget_locked() instead of dget().
+ */
+
+static inline struct dentry *dget(struct dentry *dentry)
+{
+ if (dentry) {
+ BUG_ON(!atomic_read(&dentry->d_count));
+ atomic_inc(&dentry->d_count);
+ }
+ return dentry;
+}
+
+#define __dget(dentry) dget(dentry)
+#define __dput(dentry) dput(dentry)
+
+#endif /* CONFIG_UNION_MOUNT */
+
+static inline void dput_common(struct dentry *dentry)
+{
+#ifdef CONFIG_UNION_MOUNT
+ if (unlikely(IS_UNION(dentry))) {
+ struct union_info *uinfo;
+
+ /*
+ * Grab a reference to the union_info which might get detached
+ * from the dentries in __dput_union().
+ */
+ uinfo = union_lock_and_get(dentry);
+ __dput_union(dentry);
+ if (uinfo) {
+ if (atomic_read(&uinfo->u_count) == 1)
+ /* We are the last user of this union_info */
+ union_release(uinfo);
+ else
+ union_put_and_unlock(uinfo);
+ }
+ } else
+#endif
+ return __dput_single(dentry);
+}
+
/**
* d_unhashed - is dentry hashed
* @dentry: entry to check
@@ -350,13 +452,13 @@ static inline struct dentry *dget_parent
struct dentry *ret;

spin_lock(&dentry->d_lock);
- ret = dget(dentry->d_parent);
+ union_lock_spinlock(dentry->d_parent, &dentry->d_lock);
+ ret = __dget(dentry->d_parent);
+ union_unlock(dentry->d_parent);
spin_unlock(&dentry->d_lock);
return ret;
}

-extern void dput(struct dentry *);
-
static inline int d_mountpoint(struct dentry *dentry)
{
return dentry->d_mounted;
--- a/include/linux/dcache_union.h
+++ b/include/linux/dcache_union.h
@@ -39,6 +39,17 @@ extern struct union_info *union_alloc(vo
extern struct union_info *union_get(struct union_info *);
extern void union_put(struct union_info *);

+
+extern void __union_check(struct dentry *);
+
+static inline void union_check(struct dentry *dentry)
+{
+ if (!dentry)
+ return;
+ if (unlikely(dentry->d_union))
+ __union_check(dentry);
+}
+
/*
* These are the functions for locking a dentry's union. When one
* want to acquire a denties union lock, use:
@@ -74,6 +85,7 @@ static inline void union_lock(struct den
UM_DEBUG_LOCK("\"%s\" locking %p (count=%d)\n",
dentry->d_name.name, ui,
atomic_read(&ui->u_count));
+ union_check(dentry);
__union_lock(dentry->d_union);
}
}
@@ -92,6 +104,7 @@ static inline void union_unlock(struct d
UM_DEBUG_LOCK("\"%s\" unlocking %p (count=%d)\n",
dentry->d_name.name, ui,
atomic_read(&ui->u_count));
+ union_check(dentry);
__union_unlock(dentry->d_union);
}
}
@@ -146,6 +159,7 @@ static inline int union_trylock(struct d
UM_DEBUG_LOCK("\"%s\" try locking %p (count=%d)\n",
dentry->d_name.name, dentry->d_union,
atomic_read(&dentry->d_union->u_count));
+ union_check(dentry);
BUG_ON(!atomic_read(&dentry->d_union->u_count));
locked = mutex_trylock(&dentry->d_union->u_mutex);
UM_DEBUG_LOCK("\"%s\" trylock %p %s\n", dentry->d_name.name,
@@ -227,10 +241,57 @@ static inline void union_lock_fs(struct
}

#define IS_UNION(dentry) ((dentry)->d_overlaid || (dentry)->d_topmost || \
- (dentry)->d_overlaid)
+ (dentry)->d_union)
+
+/* dentry reference counting */
+static inline struct dentry * __dget_union(struct dentry *dentry)
+{
+ if (!dentry)
+ return dentry;
+
+ union_check(dentry);
+ __dget_single(dentry);
+
+ if (likely(!dentry->d_overlaid && !dentry->d_topmost))
+ return dentry;
+
+ if (dentry->d_topmost)
+ UM_DEBUG_DCACHE("we are not the topmost dentry\n");
+
+ UM_DEBUG_DCACHE("name=\"%s\", inode=%p, count=%d\n",
+ dentry->d_name.name, dentry->d_inode,
+ atomic_read(&dentry->d_count));
+
+ if (dentry->d_overlaid) {
+ struct dentry * tmp = dentry->d_overlaid;
+
+ while (tmp) {
+ __dget_single(tmp);
+ UM_DEBUG_DCACHE("name=\"%s\", inode=%p, count=%d\n",
+ tmp->d_name.name, tmp->d_inode,
+ atomic_read(&tmp->d_count));
+ tmp = tmp->d_overlaid;
+ }
+ }
+
+ return dentry;
+}
+
+extern void __dput_union(struct dentry *);
+extern int __destroy_union(struct dentry *dentry);
+
+static inline int __dput_single_destroy_union(struct dentry *dentry)
+{
+ if (!dentry->d_union)
+ return 1;
+
+ return __destroy_union(dentry);
+}

#else /* CONFIG_UNION_MOUNT */

+#define union_check(dentry) do { /* empty */ } while (0)
+
#define union_lock(dentry) do { /* empty */ } while (0)
#define union_trylock(dentry) ({ (1); })
#define union_unlock(dentry) do { /* empty */ } while (0)
@@ -243,6 +304,8 @@ static inline void union_lock_fs(struct
#define union_unlock_and_put(dentry) do { /* empty */ } while (0)
#define union_release(x) do { BUG(); } while (0)

+#define __dput_single_destroy_union(x) ({ (1); })
+
#endif /* CONFIG_UNION_MOUNT */
#endif /* __KERNEL__ */
#endif /* __LINUX_DCACHE_UNION_H */
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1229,8 +1229,10 @@ void __audit_getname(const char *name)
++context->name_count;
if (!context->pwd) {
read_lock(&current->fs->lock);
- context->pwd = dget(current->fs->pwd);
+ union_lock_readlock(current->fs->pwd, &current->fs->lock);
+ context->pwd = __dget(current->fs->pwd);
context->pwdmnt = mntget(current->fs->pwdmnt);
+ union_unlock(current->fs->pwd);
read_unlock(&current->fs->lock);
}

--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1908,9 +1908,11 @@ static int cpuset_rmdir(struct inode *un
mutex_lock(&callback_mutex);
set_bit(CS_REMOVED, &cs->flags);
list_del(&cs->sibling); /* delete my sibling from parent->children */
+ union_lock(cs->dentry);
spin_lock(&cs->dentry->d_lock);
- d = dget(cs->dentry);
+ d = __dget(cs->dentry);
cs->dentry = NULL;
+ union_unlock(d);
spin_unlock(&d->d_lock);
cpuset_d_remove_dir(d);
dput(d);
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -580,17 +580,21 @@ static inline struct fs_struct *__copy_f
rwlock_init(&fs->lock);
fs->umask = old->umask;
read_lock(&old->lock);
+ union_lock_fs(old);
fs->rootmnt = mntget(old->rootmnt);
- fs->root = dget(old->root);
+ fs->root = __dget(old->root);
fs->pwdmnt = mntget(old->pwdmnt);
- fs->pwd = dget(old->pwd);
+ fs->pwd = __dget(old->pwd);
if (old->altroot) {
fs->altrootmnt = mntget(old->altrootmnt);
- fs->altroot = dget(old->altroot);
+ fs->altroot = __dget(old->altroot);
} else {
fs->altrootmnt = NULL;
fs->altroot = NULL;
}
+ union_unlock(old->pwd);
+ union_unlock(old->altroot);
+ union_unlock(old->root);
read_unlock(&old->lock);
}
return fs;
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1082,8 +1082,11 @@ restart:
newu->addr = otheru->addr;
}
if (otheru->dentry) {
- newu->dentry = dget(otheru->dentry);
+ /* Is this safe here? I don't know ... */
+ union_lock_spinlock(otheru->dentry, &otheru->lock);
+ newu->dentry = __dget(otheru->dentry);
newu->mnt = mntget(otheru->mnt);
+ union_unlock(otheru->dentry);
}

/* Set credentials */

2007-05-14 09:34:52

by Bharata B Rao

[permalink] [raw]
Subject: [RFC][PATCH 7/14] Union-mount mounting

From: Jan Blunck <[email protected]>
Subject: Union-mount mounting

Adds union mount support to mount() and umount() system calls.
Sets up the union stack during mount and destroys it during unmount.

TODO: bind and move mounts aren't yet supported with union mounts.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Bharata B Rao <[email protected]>
---
fs/namespace.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++----
fs/union.c | 71 +++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 3 +
include/linux/union.h | 33 ++++++++++++++++++
4 files changed, 190 insertions(+), 7 deletions(-)

--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -169,7 +169,7 @@ void mnt_set_mountpoint(struct vfsmount
struct vfsmount *child_mnt)
{
child_mnt->mnt_parent = mntget(mnt);
- child_mnt->mnt_mountpoint = dget(dentry);
+ child_mnt->mnt_mountpoint = __dget(dentry);
dentry->d_mounted++;
}

@@ -294,6 +294,10 @@ static struct vfsmount *clone_mnt(struct
if (!mnt)
goto alloc_failed;

+ /*
+ * As of now, cloning of union mounted mnt isn't permitted.
+ */
+ BUG_ON(mnt->mnt_flags & MNT_UNION);
mnt->mnt_flags = old->mnt_flags;
atomic_inc(&sb->s_active);
mnt->mnt_sb = sb;
@@ -579,16 +583,20 @@ void release_mounts(struct list_head *he
mnt = list_first_entry(head, struct vfsmount, mnt_hash);
list_del_init(&mnt->mnt_hash);
if (mnt->mnt_parent != mnt) {
- struct dentry *dentry;
- struct vfsmount *m;
+ struct path old_nd;
spin_lock(&vfsmount_lock);
- dentry = mnt->mnt_mountpoint;
- m = mnt->mnt_parent;
+ old_nd.dentry = mnt->mnt_mountpoint;
+ old_nd.mnt = mnt->mnt_parent;
mnt->mnt_mountpoint = mnt->mnt_root;
mnt->mnt_parent = mnt;
+ detach_mnt_union(mnt, &old_nd);
spin_unlock(&vfsmount_lock);
- dput(dentry);
- mntput(m);
+ if (mnt->mnt_flags & MNT_UNION) {
+ UM_DEBUG("shrink the mountpoint's dcache\n");
+ shrink_dcache_sb(old_nd.dentry->d_sb);
+ }
+ __dput(old_nd.dentry);
+ mntput(old_nd.mnt);
}
mntput(mnt);
}
@@ -621,6 +629,9 @@ static int do_umount(struct vfsmount *mn
struct super_block *sb = mnt->mnt_sb;
int retval;
LIST_HEAD(umount_list);
+#ifdef CONFIG_UNION_MOUNT
+ struct union_info *uinfo = NULL;
+#endif

retval = security_sb_umount(mnt, flags);
if (retval)
@@ -685,6 +696,14 @@ static int do_umount(struct vfsmount *mn
}

down_write(&namespace_sem);
+#ifdef CONFIG_UNION_MOUNT
+ /*
+ * Grab a reference to the union_info which gets detached
+ * from the dentries in release_mounts().
+ */
+ if (mnt->mnt_flags & MNT_UNION)
+ uinfo = union_lock_and_get(mnt->mnt_root);
+#endif
spin_lock(&vfsmount_lock);
event++;

@@ -699,6 +718,15 @@ static int do_umount(struct vfsmount *mn
security_sb_umount_busy(mnt);
up_write(&namespace_sem);
release_mounts(&umount_list);
+#ifdef CONFIG_UNION_MOUNT
+ if (uinfo) {
+ if (atomic_read(&uinfo->u_count) == 1)
+ /* We are the last user of this union_info */
+ union_release(uinfo);
+ else
+ union_put_and_unlock(uinfo);
+ }
+#endif
return retval;
}

@@ -941,6 +969,9 @@ static int attach_recursive_mnt(struct v
set_mnt_shared(p);
}

+ if (source_mnt->mnt_flags & MNT_UNION)
+ union_alloc_dentry(nd->dentry);
+
spin_lock(&vfsmount_lock);
if (parent_nd) {
detach_mnt(source_mnt, parent_nd);
@@ -948,6 +979,7 @@ static int attach_recursive_mnt(struct v
touch_mnt_namespace(current->nsproxy->mnt_ns);
} else {
mnt_set_mountpoint(dest_mnt, dest_dentry, source_mnt);
+ attach_mnt_union(source_mnt, nd);
commit_tree(source_mnt);
}

@@ -956,6 +988,7 @@ static int attach_recursive_mnt(struct v
commit_tree(child);
}
spin_unlock(&vfsmount_lock);
+ union_unlock(nd->dentry);
return 0;
}

@@ -1003,6 +1036,12 @@ static int do_change_type(struct nameida
if (nd->dentry != nd->mnt->mnt_root)
return -EINVAL;

+ /*
+ * Don't change the type of union mounts
+ */
+ if (nd->mnt->mnt_flags & MNT_UNION)
+ return -EINVAL;
+
down_write(&namespace_sem);
spin_lock(&vfsmount_lock);
for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL))
@@ -1031,6 +1070,15 @@ static int do_loopback(struct nameidata
if (err)
return err;

+ /*
+ * bind mounting to or from union mounts is not supported
+ */
+ err = -EINVAL;
+ if (nd->mnt->mnt_flags & MNT_UNION)
+ goto out_unlocked;
+ if (old_nd.mnt->mnt_flags & MNT_UNION)
+ goto out_unlocked;
+
down_write(&namespace_sem);
err = -EINVAL;
if (IS_MNT_UNBINDABLE(old_nd.mnt))
@@ -1064,6 +1112,7 @@ static int do_loopback(struct nameidata

out:
up_write(&namespace_sem);
+out_unlocked:
path_release(&old_nd);
return err;
}
@@ -1125,6 +1174,15 @@ static int do_move_mount(struct nameidat
if (err)
return err;

+ /*
+ * moving to or from a union mount is not supported
+ */
+ err = -EINVAL;
+ if (nd->mnt->mnt_flags & MNT_UNION)
+ goto exit;
+ if (old_nd.mnt->mnt_flags & MNT_UNION)
+ goto exit;
+
down_write(&namespace_sem);
while (d_mountpoint(nd->dentry) && follow_down(&nd->mnt, &nd->dentry))
;
@@ -1180,6 +1238,7 @@ out:
up_write(&namespace_sem);
if (!err)
path_release(&parent_nd);
+exit:
path_release(&old_nd);
return err;
}
@@ -1223,6 +1282,9 @@ static int do_new_mount(struct nameidata
if (flags & MS_SETUSER)
__set_mnt_user(mnt, current->fsuid);

+ UM_DEBUG("dentry=%s, device=%s\n", nd->dentry->d_name.name,
+ mnt->mnt_devname);
+
return do_add_mount(mnt, nd, mnt_flags, NULL);

out_put_filesystem:
@@ -1257,6 +1319,12 @@ int do_add_mount(struct vfsmount *newmnt
if (S_ISLNK(newmnt->mnt_root->d_inode->i_mode))
goto unlock;

+ /* Unions couldn't include shared mounts */
+ err = -EINVAL;
+ if ((mnt_flags & MNT_UNION) &&
+ IS_MNT_SHARED(nd->mnt))
+ goto unlock;
+
/* Unions couldn't be writable if the filesystem
* doesn't know about whiteouts */
err = -ENOTSUPP;
@@ -1276,6 +1344,14 @@ int do_add_mount(struct vfsmount *newmnt
list_add_tail(&newmnt->mnt_expire, fslist);
spin_unlock(&vfsmount_lock);
}
+
+ UM_DEBUG("mntpoint->d_count=%d/%p\n",
+ atomic_read(&nd->dentry->d_count),
+ &nd->dentry->d_count);
+ UM_DEBUG("mntroot->d_count=%d/%p\n",
+ atomic_read(&newmnt->mnt_root->d_count),
+ &newmnt->mnt_root->d_count);
+
up_write(&namespace_sem);
return 0;

--- a/fs/union.c
+++ b/fs/union.c
@@ -299,3 +299,74 @@ void __dput_union(struct dentry *dentry)

return;
}
+
+void attach_mnt_union(struct vfsmount *mnt, struct nameidata *nd)
+{
+ struct dentry *tmp;
+
+ if (!(mnt->mnt_flags & MNT_UNION))
+ return;
+
+ UM_DEBUG("MNT_UNION set for dentry \"%s\", devname=%s\n",
+ mnt->mnt_root->d_name.name, mnt->mnt_devname);
+ UM_DEBUG("mountpoint \"%s\", inode=%p\n",
+ nd->dentry->d_name.name, nd->dentry->d_inode);
+
+ spin_lock(&mnt->mnt_root->d_lock);
+ mnt->mnt_root->d_overlaid = __dget(nd->dentry);
+ mnt->mnt_root->d_topmost = NULL;
+ mnt->mnt_root->d_union = union_get(nd->dentry->d_union);
+ spin_unlock(&mnt->mnt_root->d_lock);
+
+ tmp = nd->dentry;
+ while (tmp) {
+ tmp->d_topmost = mnt->mnt_root;
+ tmp = tmp->d_overlaid;
+ }
+}
+
+void detach_mnt_union(struct vfsmount *mnt, struct path *path)
+{
+ struct dentry *tmp;
+
+ if (!(mnt->mnt_flags & MNT_UNION))
+ return;
+
+ UM_DEBUG("MNT_UNION set for dentry \"%s\", devname=%s\n",
+ mnt->mnt_root->d_name.name, mnt->mnt_devname);
+ UM_DEBUG("mountpoint \"%s\", inode=%p\n",
+ path->dentry->d_name.name, path->dentry->d_inode);
+ BUG_ON(mnt->mnt_root->d_topmost);
+
+ /* put reference to the underlying union stack */
+ __dput(mnt->mnt_root->d_overlaid);
+ spin_lock(&mnt->mnt_root->d_lock);
+ mnt->mnt_root->d_overlaid = NULL;
+ union_put(mnt->mnt_root->d_union);
+ mnt->mnt_root->d_union = NULL;
+ spin_unlock(&mnt->mnt_root->d_lock);
+
+ /* rearrange the union stack */
+ path->dentry->d_topmost = NULL;
+ tmp = path->dentry->d_overlaid;
+ while (tmp) {
+ tmp->d_topmost = path->dentry;
+ tmp = tmp->d_overlaid;
+ }
+
+ /* If the mount point is the last component in the union,
+ * put the reference to the union struct */
+ if (!path->dentry->d_overlaid) {
+ spin_lock(&path->dentry->d_lock);
+ union_put(path->dentry->d_union);
+ path->dentry->d_union = NULL;
+ spin_unlock(&path->dentry->d_lock);
+ }
+
+ /* when we looked up the mountpoint to be unmounted
+ * we dget() a union-mount dentry struct so we have
+ * to dput() parts of it by hand before we remove the
+ * topmost dentry (which is mnt->mnt_root) from the
+ * union stack */
+ __dput(path->dentry);
+}
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1984,6 +1984,9 @@ static inline ino_t parent_ino(struct de
/* kernel/fork.c */
extern int unshare_files(void);

+/* fs/union.c */
+#include <linux/union.h>
+
/* Transaction based IO helpers */

/*
--- /dev/null
+++ b/include/linux/union.h
@@ -0,0 +1,33 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright ? 2004-2007 IBM Corporation
+ * Author(s): Jan Blunck ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+#ifndef __LINUX_UNION_H
+#define __LINUX_UNION_H
+#ifdef __KERNEL__
+
+#ifdef CONFIG_UNION_MOUNT
+
+#include <linux/fs_struct.h>
+
+/* namespace stuff used at mount time */
+extern void attach_mnt_union(struct vfsmount *, struct nameidata *);
+extern void detach_mnt_union(struct vfsmount *, struct path *);
+
+#else /* CONFIG_UNION_MOUNT */
+
+#define attach_mnt_union(mnt,nd) do { /* empty */ } while (0)
+#define detach_mnt_union(mnt,nd) do { /* empty */ } while (0)
+
+#endif /* CONFIG_UNION_MOUNT */
+
+#endif /* __KERNEL __ */
+#endif /* __LINUX_UNION_H */

2007-05-14 09:35:29

by Bharata B Rao

[permalink] [raw]
Subject: [RFC][PATCH 8/14] Union-mount lookup

From: Jan Blunck <[email protected]>
Subject: Union-mount lookup

Modifies the vfs lookup routines to work with union mounted directories.

The existing lookup routines generally lookup for a pathname only in the
topmost or given directory. The changed versions of the lookup routines
search for the pathname in the entire union mounted stack. Also they have been
modified to setup the union stack during lookup from dcache cache and from
real_lookup().

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Bharata B Rao <[email protected]>
---
fs/dcache.c | 16 +
fs/namei.c | 78 +++++-
fs/namespace.c | 35 ++
fs/union.c | 598 +++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/dcache.h | 17 +
include/linux/namei.h | 4
include/linux/union.h | 49 ++++
7 files changed, 786 insertions(+), 11 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1286,7 +1286,7 @@ struct dentry * d_lookup(struct dentry *
return dentry;
}

-struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
+struct dentry * __d_lookup_single(struct dentry *parent, struct qstr *name)
{
unsigned int len = name->len;
unsigned int hash = name->hash;
@@ -1371,6 +1371,20 @@ out:
return dentry;
}

+struct dentry * d_lookup_single(struct dentry *parent, struct qstr *name)
+{
+ struct dentry *dentry;
+ unsigned long seq;
+
+ do {
+ seq = read_seqbegin(&rename_lock);
+ dentry = __d_lookup_single(parent, name);
+ if (dentry)
+ break;
+ } while (read_seqretry(&rename_lock, seq));
+ return dentry;
+}
+
/**
* d_validate - verify dentry provided from insecure source
* @dentry: The dentry alleged to be valid child of @dparent
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -374,6 +374,33 @@ void release_open_intent(struct nameidat
}

static inline struct dentry *
+do_revalidate_single(struct dentry *dentry, struct nameidata *nd)
+{
+ int status = dentry->d_op->d_revalidate(dentry, nd);
+ if (unlikely(status <= 0)) {
+ /*
+ * The dentry failed validation.
+ * If d_revalidate returned 0 attempt to invalidate
+ * the dentry otherwise d_revalidate is asking us
+ * to return a fail status.
+ */
+ if (!status) {
+ if (!d_invalidate(dentry)) {
+ __dput_single(dentry);
+ dentry = NULL;
+ }
+ } else {
+ __dput_single(dentry);
+ dentry = ERR_PTR(status);
+ }
+ }
+ return dentry;
+}
+
+/*
+ * FIXME: We need a union aware revalidate here!
+ */
+static inline struct dentry *
do_revalidate(struct dentry *dentry, struct nameidata *nd)
{
int status = dentry->d_op->d_revalidate(dentry, nd);
@@ -403,16 +430,16 @@ do_revalidate(struct dentry *dentry, str
*/
static struct dentry * cached_lookup(struct dentry * parent, struct qstr * name, struct nameidata *nd)
{
- struct dentry * dentry = __d_lookup(parent, name);
+ struct dentry *dentry = __d_lookup_single(parent, name);

/* lockess __d_lookup may fail due to concurrent d_move()
* in some unrelated directory, so try with d_lookup
*/
if (!dentry)
- dentry = d_lookup(parent, name);
+ dentry = d_lookup_single(parent, name);

if (dentry && dentry->d_op && dentry->d_op->d_revalidate)
- dentry = do_revalidate(dentry, nd);
+ dentry = do_revalidate_single(dentry, nd);

return dentry;
}
@@ -465,7 +492,7 @@ ok:
* make sure that nobody added the entry to the dcache in the meantime..
* SMP-safe
*/
-static struct dentry * real_lookup(struct dentry * parent, struct qstr * name, struct nameidata *nd)
+struct dentry * real_lookup_single(struct dentry *parent, struct qstr *name, struct nameidata *nd)
{
struct dentry * result;
struct inode *dir = parent->d_inode;
@@ -485,7 +512,7 @@ static struct dentry * real_lookup(struc
*
* so doing d_lookup() (with seqlock), instead of lockfree __d_lookup
*/
- result = d_lookup(parent, name);
+ result = d_lookup_single(parent, name);
if (!result) {
struct dentry * dentry = d_alloc(parent, name);
result = ERR_PTR(-ENOMEM);
@@ -506,7 +533,7 @@ static struct dentry * real_lookup(struc
*/
mutex_unlock(&dir->i_mutex);
if (result->d_op && result->d_op->d_revalidate) {
- result = do_revalidate(result, nd);
+ result = do_revalidate_single(result, nd);
if (!result)
result = ERR_PTR(-ENOENT);
}
@@ -699,7 +726,7 @@ static int __follow_mount(struct path *p
return res;
}

-static void follow_mount(struct vfsmount **mnt, struct dentry **dentry)
+void follow_mount(struct vfsmount **mnt, struct dentry **dentry)
{
while (d_mountpoint(*dentry)) {
struct vfsmount *mounted = lookup_mnt(*mnt, *dentry);
@@ -773,6 +800,7 @@ static __always_inline void follow_dotdo
nd->mnt = parent;
}
follow_mount(&nd->mnt, &nd->dentry);
+ follow_union_mount(&nd->mnt, &nd->dentry);
}

/*
@@ -784,7 +812,15 @@ static int do_lookup(struct nameidata *n
struct path *path)
{
struct vfsmount *mnt = nd->mnt;
- struct dentry *dentry = __d_lookup(nd->dentry, name);
+ struct dentry *dentry;
+
+ UM_DEBUG_UID("lookup \"%s\" in \"%s\" (inode=%p,dev=%s)\n",
+ name->name,
+ nd->dentry->d_name.name,
+ nd->dentry->d_inode,
+ nd->mnt->mnt_devname);
+
+ dentry = __d_lookup(nd->dentry, name);

if (!dentry)
goto need_lookup;
@@ -793,7 +829,17 @@ static int do_lookup(struct nameidata *n
done:
path->mnt = mnt;
path->dentry = dentry;
+
+ if (nd->dentry->d_sb != dentry->d_sb)
+ path->mnt = find_mnt(dentry);
+
__follow_mount(path);
+ follow_union_mount(&path->mnt, &path->dentry);
+
+ UM_DEBUG_UID("found \"%s\" (inode=%p,dev=%s)\n",
+ path->dentry->d_name.name,
+ path->dentry->d_inode,
+ path->mnt->mnt_devname);
return 0;

need_lookup:
@@ -838,6 +884,9 @@ static fastcall int __link_path_walk(con
if (nd->depth)
lookup_flags = LOOKUP_FOLLOW | (nd->flags & LOOKUP_CONTINUE);

+ UM_DEBUG_UID("begin walking for %s\n", name);
+ follow_union_mount(&nd->mnt, &nd->dentry);
+
/* At this point we know we have a real path component. */
for(;;) {
unsigned long hash;
@@ -931,6 +980,7 @@ static fastcall int __link_path_walk(con
last_with_slashes:
lookup_flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
last_component:
+ UM_DEBUG_UID("last component %s\n", this.name);
/* Clear LOOKUP_CONTINUE iff it was previously unset */
nd->flags &= lookup_flags | ~LOOKUP_CONTINUE;
if (lookup_flags & LOOKUP_PARENT)
@@ -1266,7 +1316,15 @@ int __user_path_lookup_open(const char _
return err;
}

-static inline struct dentry *__lookup_hash_kern(struct qstr *name, struct dentry *base, struct nameidata *nd)
+/*
+ * NOTE: On union mounts it is important that the overlaid dentries are
+ * correct. Therefore we need to follow mounts. Take a look at
+ * __lookup_hash_kern_union() how it is done.
+ *
+ * Called with union already locked (before the parent inode is locked !!!)
+ */
+struct dentry * __lookup_hash_kern_single(struct qstr *name,
+ struct dentry *base, struct nameidata *nd)
{
struct dentry *dentry;
struct inode *inode;
@@ -1298,6 +1356,8 @@ static inline struct dentry *__lookup_ha
dput(new);
}
out:
+ UM_DEBUG_UID("name=\"%s\", inode=%p\n",
+ dentry->d_name.name, dentry->d_inode);
return dentry;
}

--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -133,6 +133,41 @@ struct vfsmount *lookup_mnt(struct vfsmo
return child_mnt;
}

+/*
+ * find_mnt - find a vfsmount struct
+ * @dentry: a dentry
+ *
+ * This searches the namespace for a given dentries
+ * vfsmount struct. This is used by union-mount.
+ */
+struct vfsmount * find_mnt(struct dentry *dentry)
+{
+ struct list_head *tmp;
+ struct vfsmount *p, *mnt = NULL;
+
+ down_read(&namespace_sem);
+ spin_lock(&vfsmount_lock);
+ if (list_empty(&current->nsproxy->mnt_ns->list)) {
+ spin_unlock(&vfsmount_lock);
+ up_read(&namespace_sem);
+ return NULL;
+ }
+ list_for_each(tmp, &current->nsproxy->mnt_ns->list) {
+ p = list_entry(tmp, struct vfsmount, mnt_list);
+ if (dentry->d_sb == p->mnt_sb) {
+ mnt = mntget(p);
+ break;
+ }
+ }
+ spin_unlock(&vfsmount_lock);
+ up_read(&namespace_sem);
+
+ BUG_ON(!mnt);
+// UM_DEBUG_UID("found %s/%p in %s\n", dentry->d_name.name,
+// dentry->d_inode, mnt->mnt_devname);
+ return mnt;
+}
+
static inline int check_mnt(struct vfsmount *mnt)
{
return mnt->mnt_ns == current->nsproxy->mnt_ns;
--- a/fs/union.c
+++ b/fs/union.c
@@ -370,3 +370,601 @@ void detach_mnt_union(struct vfsmount *m
* union stack */
__dput(path->dentry);
}
+
+static noinline int revalidate_union(struct dentry * dentry)
+{
+ union_check(dentry);
+
+ spin_lock(&dcache_lock);
+ spin_lock(&dentry->d_lock);
+ if (atomic_read(&dentry->d_count) < 2) {
+ UM_DEBUG_DCACHE("dentry unused, count=%d\n",
+ atomic_read(&dentry->d_count));
+ __d_drop(dentry);
+ spin_unlock(&dentry->d_lock);
+ spin_unlock(&dcache_lock);
+ return 0;
+ }
+ spin_unlock(&dentry->d_lock);
+ spin_unlock(&dcache_lock);
+
+ return 1;
+}
+
+static noinline void replace_union_info(struct dentry *dentry,
+ struct union_info *lock)
+{
+ struct dentry *tmp = dentry;
+ struct union_info *old_lock = union_get(dentry->d_union);
+
+ BUG_ON(!lock);
+ BUG_ON(dentry->d_union == lock);
+
+ while (tmp) {
+ spin_lock(&tmp->d_lock);
+ union_put(tmp->d_union);
+ tmp->d_union = union_get(lock);
+ spin_unlock(&tmp->d_lock);
+ tmp = tmp->d_overlaid;
+ }
+
+ BUG_ON(atomic_read(&old_lock->u_count) != 1);
+ union_put(old_lock);
+ return;
+}
+
+static void __dput_from_to(struct dentry *from, struct dentry *to,
+ struct union_info *lock)
+{
+ struct dentry *next = from;
+ struct union_info *mylock = union_get(from->d_union);
+
+ while (next) {
+ struct dentry *tmp = next;
+ next = next->d_overlaid;
+
+ UM_DEBUG_UID("dput_all dentry=\"%s\", inode=\"%p\"\n",
+ tmp->d_name.name, tmp->d_inode);
+
+ if (lock) {
+ spin_lock(&tmp->d_lock);
+ tmp->d_topmost = NULL;
+ tmp->d_overlaid = NULL;
+ union_put(tmp->d_union);
+ tmp->d_union = NULL;
+ spin_unlock(&tmp->d_lock);
+ }
+
+ __dput_single(tmp);
+
+ if (tmp == to)
+ break;
+ }
+
+ UM_DEBUG_LOCK("\"??\" unlocking union %p\n", lock);
+ mutex_unlock(&mylock->u_mutex);
+ union_put(mylock);
+}
+
+/*
+ * Lookup for the @name in the dentry cache. Look through the lower layers
+ * of the parent's union stack and build a union stack for the child if
+ * necessary.
+ * TODO: This shares a considerable amount of code with __lookup_union().
+ */
+struct dentry * __d_lookup_union(struct dentry *base, struct qstr *name)
+{
+ struct dentry *parent = base->d_overlaid;
+ struct dentry *dentry = NULL;
+ struct dentry *topmost;
+ struct dentry *last;
+ struct qstr this;
+ struct union_info *lock = NULL;
+ int err;
+
+ union_lock(base);
+ topmost = __d_lookup_single(base, name);
+ last = topmost;
+
+ /*
+ * - If dcache lookup returns a NULL dentry, return to force a real
+ * lookup. Union mount version of real lookup will endup doing real or
+ * dcache lookup for this in the lower layers also. OR
+ * - If parent is not a union mounted directory, we are done
+ * with the lookup, return.
+ */
+ if (!topmost || !base->d_overlaid)
+ goto out;
+
+ this.name = name->name;
+ this.len = name->len;
+ this.hash = name->hash;
+
+ /*
+ * If dcache lookup returned a non-negative dentry from the top layer,
+ * continue the lookup in to the lower layers and re-build the union
+ * stack if necessary.
+ */
+ if (topmost->d_inode)
+ goto lookup_union;
+
+ /*
+ * dcache lookup in the top layer returned a negative dentry. Look
+ * through the lower layers to find the first non-negative dentry.
+ */
+ while (parent) {
+ if (parent->d_op && parent->d_op->d_hash) {
+ err = parent->d_op->d_hash(parent, &this);
+ if (err < 0) {
+ __dput_single(topmost);
+ topmost = NULL;
+ goto out;
+ }
+ }
+ dentry = __d_lookup_single(parent, &this);
+ /*
+ * Force a real lookup if parts of the union stack are not in
+ * dcache
+ */
+ if (!dentry) {
+ __dput_single(topmost);
+ topmost = NULL;
+ goto out;
+ }
+ if (dentry->d_inode)
+ break;
+ __dput_single(dentry);
+ dentry = NULL;
+ parent = parent->d_overlaid;
+ }
+
+ if (!dentry)
+ goto out;
+
+ __dput_single(topmost);
+ topmost = dentry;
+ last = dentry;
+lookup_union:
+ do {
+ struct vfsmount *mnt = find_mnt(topmost);
+ UM_DEBUG_DCACHE("name=\"%s\", inode=%p, device=%s\n",
+ topmost->d_name.name, topmost->d_inode,
+ mnt->mnt_devname);
+ mntput(mnt);
+ } while (0);
+
+ /* If this is not a directory, no need to look beyond this layer */
+ if (!S_ISDIR(topmost->d_inode->i_mode))
+ goto out;
+
+ if (!revalidate_union(topmost)) {
+ __dput_single(topmost);
+ topmost = NULL;
+ goto out;
+ }
+
+ spin_lock(&topmost->d_lock);
+ if (topmost->d_union) {
+ union_lock_spinlock(topmost, &topmost->d_lock);
+ }
+ spin_unlock(&topmost->d_lock);
+
+ parent = topmost->d_parent->d_overlaid;
+ while (parent) {
+ if (parent->d_op && parent->d_op->d_hash) {
+ err = parent->d_op->d_hash(parent, &this);
+ if (err < 0) {
+ UM_DEBUG("failed to hash the qstr\n");
+ goto dput_all;
+ }
+ }
+ dentry = __d_lookup_single(parent, &this);
+ if (!dentry) {
+ /*
+ * No dentry for this name in this lower layer.
+ * CHECK: Why return like this ? Shoudn't we look
+ * for this name in the next lower layer ?
+ */
+ __dput_single(dentry);
+ goto dput_all;
+ }
+
+ if (!dentry->d_inode) {
+ __dput_single(dentry);
+ parent = parent->d_overlaid;
+ continue;
+ }
+ if (!S_ISDIR(dentry->d_inode->i_mode)) {
+ __dput_single(dentry);
+ break;
+ }
+ if (last->d_overlaid
+ && (last->d_overlaid != dentry)) {
+ printk(KERN_ERR "%s: strange stack layout " \
+ "(\"%s\" overlays \"%s\")\n",
+ __FUNCTION__, last->d_name.name,
+ dentry->d_name.name);
+ dump_stack();
+ __dput_single(dentry);
+ goto dput_all;
+ }
+ spin_lock(&topmost->d_lock);
+ if (!topmost->d_union) {
+ UM_DEBUG_LOCK("allocate union for \"%s\"\n",
+ topmost->d_name.name);
+ topmost->d_union = union_alloc();
+ lock = topmost->d_union;
+ }
+ spin_unlock(&topmost->d_lock);
+ spin_lock(&dentry->d_lock);
+ if (!dentry->d_union)
+ dentry->d_union = union_get(topmost->d_union);
+ spin_unlock(&dentry->d_lock);
+ if (dentry->d_union != topmost->d_union) {
+ union_lock(dentry);
+ replace_union_info(topmost, dentry->d_union);
+ }
+ dentry->d_topmost = topmost;
+ last->d_overlaid = dentry;
+ last = dentry;
+ parent = parent->d_overlaid;
+ }
+
+ spin_lock(&topmost->d_lock);
+ if (topmost->d_union && atomic_read(&topmost->d_union->u_count) == 1) {
+ union_put(topmost->d_union);
+ topmost->d_union = NULL;
+ } else
+ union_unlock(topmost);
+ spin_unlock(&topmost->d_lock);
+out:
+ union_unlock(base);
+ return topmost;
+
+dput_all:
+ __dput_from_to(topmost, last, lock);
+ union_unlock(base);
+ return NULL;
+}
+
+/*
+ * FIXME: export this from fs/namei.c ???
+ */
+extern int follow_mount(struct vfsmount **, struct dentry **);
+extern struct dentry * __lookup_hash_kern_single(struct qstr *, struct dentry *,
+ struct nameidata *);
+extern struct dentry * real_lookup_single(struct dentry *, struct qstr *,
+ struct nameidata *);
+
+static inline void copy_nd(struct nameidata *old_nd, struct nameidata *new_nd)
+{
+ if (old_nd) {
+ new_nd->last.name = NULL; /* handled in __link_path_walk */
+ new_nd->last.len = 0;
+ new_nd->last.hash = 0;
+ new_nd->flags = old_nd->flags;
+ new_nd->um_flags = 0; /* ditto */
+ new_nd->last_type = -1; /* ditto */
+ new_nd->depth = 0; /* handled in do_follow_link */
+ memcpy(&new_nd->intent, &old_nd->intent, sizeof(new_nd->intent));
+ }
+}
+
+/*
+ * This is called when a dentries parent is union-mounted and we have
+ * to lookup the overlaid dentries. The lookup starts at the parents
+ * first overlaid dentry of the given dentry. Negative dentries are
+ * ignored and not included in the overlaid list.
+ *
+ * If we reach a dentry with restricted access, we just stop the lookup
+ * because we shouldn't see through that dentry. Same thing for dentry
+ * type mismatch and whiteouts.
+ *
+ * FIXME:
+ * - handle DT_WHT
+ * - handle union stacks in use
+ * - handle union stacks mounted upon union stacks
+ * - avoid unnecessary allocations of union locks
+ */
+static int __lookup_union(struct dentry *topmost, struct qstr *name,
+ struct nameidata *__nd)
+{
+ struct dentry *parent;
+ struct dentry *last;
+ struct dentry *dentry;
+ unsigned int hash = name->hash;
+ struct nameidata nd;
+ int err;
+
+ /* we may also be called via lookup_hash with a NULLed nd argument */
+ copy_nd(__nd, &nd);
+
+ spin_lock(&topmost->d_lock);
+ if (topmost->d_union) {
+ union_lock_spinlock(topmost, &topmost->d_lock);
+ }
+ spin_unlock(&topmost->d_lock);
+
+ parent = topmost->d_parent->d_overlaid;
+ last = topmost;
+
+ while (parent) {
+ /*
+ * the hash could be changed in the last
+ * __lookup_hash_single() so we need to reset it here
+ */
+ name->hash = hash;
+ nd.dentry = __dget(parent);
+ nd.mnt = find_mnt(parent);
+
+ mutex_lock(&parent->d_inode->i_mutex);
+ dentry = __lookup_hash_single(name, parent,
+ __nd ? &nd : NULL);
+ mutex_unlock(&parent->d_inode->i_mutex);
+ if (IS_ERR(dentry)) {
+ err = PTR_ERR(dentry);
+ goto out;
+ }
+
+ if (!dentry->d_inode) {
+ __dput_single(dentry);
+ goto loop;
+ }
+
+ if (!S_ISDIR(dentry->d_inode->i_mode)) {
+ __dput_single(dentry);
+ err = 0;
+ goto out;
+ }
+
+ /* Now we know, we found something real */
+ follow_mount(&nd.mnt, &dentry);
+
+ do {
+ struct vfsmount *mnt = find_mnt(dentry);
+ UM_DEBUG_UID("name=\"%s\", inode=%p, device=%s\n",
+ dentry->d_name.name, dentry->d_inode,
+ mnt->mnt_devname);
+ mntput(mnt);
+ } while (0);
+
+ if (last->d_overlaid && (last->d_overlaid != dentry)) {
+ printk(KERN_ERR "%s: strange stack layout " \
+ "(\"%s\" overlays \"%s\")\n",
+ __FUNCTION__, last->d_name.name,
+ dentry->d_name.name);
+ dump_stack();
+ __dput_single(dentry);
+ /* lets try to make a clean ending */
+ last->d_overlaid = NULL;
+ err = -EFAULT; // FIXME: something better?
+ goto out;
+ }
+
+ spin_lock(&topmost->d_lock);
+ if (!topmost->d_union) {
+ UM_DEBUG_LOCK("allocate union for \"%s\"\n",
+ topmost->d_name.name);
+ topmost->d_union = union_alloc();
+ }
+ spin_unlock(&topmost->d_lock);
+
+ spin_lock(&dentry->d_lock);
+ if (!dentry->d_union)
+ dentry->d_union = union_get(topmost->d_union);
+ spin_unlock(&dentry->d_lock);
+
+ if (topmost->d_union != dentry->d_union) {
+ union_lock(dentry);
+ replace_union_info(topmost, dentry->d_union);
+ }
+
+ dentry->d_topmost = topmost;
+ last->d_overlaid = dentry;
+ last = dentry;
+ loop:
+ __dput(nd.dentry);
+ mntput(nd.mnt);
+ parent = parent->d_overlaid;
+ }
+
+ err = 0;
+ union_unlock(topmost);
+ return err;
+out:
+ __dput(nd.dentry);
+ mntput(nd.mnt);
+ union_unlock(topmost);
+ return err;
+}
+
+/*
+ * Union mount version of real_lookup().
+ * Looks through the lower layers of the union and builds a union stack
+ * if necessary (i,e., for directories)
+ * TODO: This routine is almost similar to __lookup_hash_kern_union() which
+ * uses __lookup_hash_kern_single() (instead of real_lookup_single()). Check
+ * if some code can be shared here.
+ */
+struct dentry * real_lookup_union(struct dentry *base, struct qstr *name,
+ struct nameidata *__nd)
+{
+ struct dentry *parent;
+ struct dentry *topmost;
+ unsigned int hash = name->hash;
+ struct nameidata nd;
+ int err;
+
+ union_lock(base);
+ topmost = real_lookup_single(base, name, __nd);
+ if (IS_ERR(topmost))
+ goto out;
+
+ /*
+ * If real_lookup returns a valid dentry from the topmost layer,
+ * continue the lookup into the lower layers and build a union
+ * stack in case of directories.
+ */
+ if (topmost->d_inode) {
+ parent = base;
+ goto lookup_union;
+ }
+
+ copy_nd(__nd, &nd);
+
+ /*
+ * real_lookup returned a negative dentry, walk through the lower
+ * layers looking for the given name.
+ */
+ parent = base->d_overlaid;
+ while (parent) {
+ struct dentry * dentry;
+
+ name->hash = hash;
+ nd.dentry = __dget(parent);
+ nd.mnt = find_mnt(parent);
+
+ dentry = real_lookup_single(nd.dentry, name, &nd);
+ __dput(nd.dentry);
+ mntput(nd.mnt);
+ if (IS_ERR(dentry))
+ goto out;
+
+ /*
+ * If a dentry is found in a lower layer, continue to lookup
+ * in the lower layers and build a union stack if needed.
+ */
+ if (dentry->d_inode) {
+ __dput_single(topmost);
+ topmost = dentry;
+ goto lookup_union;
+ }
+ __dput_single(dentry);
+ parent = parent->d_overlaid;
+ }
+
+out:
+ union_unlock(base);
+ return topmost;
+
+lookup_union:
+ /*
+ * If our parent doesn't have a union stack or we are not a directory,
+ * the lookup ends here.
+ */
+ if (!parent->d_overlaid || !S_ISDIR(topmost->d_inode->i_mode))
+ goto out;
+
+ do {
+ struct vfsmount *mnt = find_mnt(topmost);
+ UM_DEBUG_UID("name=\"%s\", inode=%p, device=%s\n",
+ topmost->d_name.name, topmost->d_inode,
+ mnt->mnt_devname);
+ mntput(mnt);
+ } while (0);
+
+ name->hash = hash;
+ err = __lookup_union(topmost, name, &nd);
+ if (err) {
+ dput(topmost);
+ topmost = ERR_PTR(err);
+ }
+ goto out;
+}
+
+/*
+ * Union mount version of __lookup_hash_kern().
+ * Looks through the lower layers of the union and builds a union stack
+ * if necessary (i,e., for directories)
+ */
+struct dentry * __lookup_hash_kern_union(struct qstr *name,
+ struct dentry *base, struct nameidata *__nd)
+{
+ struct dentry *topmost;
+ struct dentry *parent;
+ unsigned int hash = name->hash;
+ struct nameidata nd;
+ int err;
+
+ union_lock(base);
+ topmost = __lookup_hash_kern_single(name, base, __nd);
+ if (IS_ERR(topmost))
+ goto out;
+
+ if (topmost->d_inode) {
+ parent = base;
+ goto lookup_union;
+ }
+
+ copy_nd(__nd, &nd);
+
+ parent = base->d_overlaid;
+ while (parent) {
+ struct dentry *dentry;
+
+ name->hash = hash;
+ nd.dentry = __dget(parent);
+ nd.mnt = find_mnt(parent);
+
+ mutex_lock(&parent->d_inode->i_mutex);
+ dentry = __lookup_hash_kern_single(name, nd.dentry, &nd);
+ mutex_unlock(&parent->d_inode->i_mutex);
+ __dput(nd.dentry);
+ mntput(nd.mnt);
+ if (IS_ERR(dentry))
+ goto out;
+
+ if (dentry->d_inode) {
+ __dput_single(topmost);
+ topmost = dentry;
+ goto lookup_union;
+ }
+ __dput_single(dentry);
+ parent = parent->d_overlaid;
+ }
+
+out:
+ union_unlock(base);
+ return topmost;
+
+lookup_union:
+ if (!parent->d_overlaid || !S_ISDIR(topmost->d_inode->i_mode))
+ goto out;
+
+ do {
+ struct vfsmount *mnt = find_mnt(topmost);
+ UM_DEBUG_UID("name=\"%s\", inode=%p, device=%s\n",
+ topmost->d_name.name, topmost->d_inode,
+ mnt->mnt_devname);
+ mntput(mnt);
+ } while (0);
+
+ name->hash = hash;
+ err = __lookup_union(topmost, name, &nd);
+ if (err) {
+ dput(topmost);
+ topmost = ERR_PTR(err);
+ }
+ goto out;
+}
+
+int follow_union_mount(struct vfsmount **mnt, struct dentry **dentry)
+{
+ int res = 0;
+
+ while ((*dentry)->d_topmost) {
+ struct dentry *d_tmp = dget((*dentry)->d_topmost);
+ struct vfsmount *m_tmp = find_mnt((*dentry)->d_topmost);
+
+ UM_DEBUG_UID("name=\"%s\", follow union from %s to %s\n",
+ (*dentry)->d_name.name, (*mnt)->mnt_devname,
+ m_tmp->mnt_devname);
+ mntput(*mnt);
+ *mnt = m_tmp;
+ dput(*dentry);
+ *dentry = d_tmp;
+ res = 1;
+ }
+
+ return res;
+}
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -294,9 +294,23 @@ extern void d_move(struct dentry *, stru

/* appendix may either be NULL or be used for transname suffixes */
extern struct dentry * d_lookup(struct dentry *, struct qstr *);
-extern struct dentry * __d_lookup(struct dentry *, struct qstr *);
+extern struct dentry * d_lookup_single(struct dentry *, struct qstr *);
+extern struct dentry * __d_lookup_single(struct dentry *, struct qstr *);
extern struct dentry * d_hash_and_lookup(struct dentry *, struct qstr *);

+#ifdef CONFIG_UNION_MOUNT
+extern struct dentry * __d_lookup_union(struct dentry *, struct qstr *);
+#endif
+
+static inline struct dentry * __d_lookup(struct dentry *parent, struct qstr *name)
+{
+#ifdef CONFIG_UNION_MOUNT
+ return __d_lookup_union(parent, name);
+#else
+ return __d_lookup_single(parent, name);
+#endif
+}
+
/* validate "insecure" dentry pointer */
extern int d_validate(struct dentry *, struct dentry *);

@@ -467,6 +481,7 @@ static inline int d_mountpoint(struct de
extern struct vfsmount *lookup_mnt(struct vfsmount *, struct dentry *);
extern struct vfsmount *__lookup_mnt(struct vfsmount *, struct dentry *, int);
extern struct dentry *lookup_create(struct nameidata *nd, int is_dir);
+extern struct vfsmount *find_mnt(struct dentry *);

extern int sysctl_vfs_cache_pressure;

--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -20,6 +20,7 @@ struct nameidata {
struct vfsmount *mnt;
struct qstr last;
unsigned int flags;
+ unsigned int um_flags;
int last_type;
unsigned depth;
char *saved_names[MAX_NESTED_LINKS + 1];
@@ -40,6 +41,9 @@ struct path {
*/
enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};

+#define LAST_UNION 0x01
+#define LAST_LOWLEVEL 0x02
+
/*
* The bitmask for a lookup event:
* - follow links at the end
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -17,17 +17,66 @@
#ifdef CONFIG_UNION_MOUNT

#include <linux/fs_struct.h>
+#include <linux/dcache_union.h>

/* namespace stuff used at mount time */
extern void attach_mnt_union(struct vfsmount *, struct nameidata *);
extern void detach_mnt_union(struct vfsmount *, struct path *);

+/* lookup stuff */
+extern int follow_union_mount(struct vfsmount **, struct dentry **);
+extern struct dentry * real_lookup_union(struct dentry *, struct qstr *,
+ struct nameidata *);
+extern struct dentry * __lookup_hash_kern_union(struct qstr *, struct dentry *,
+ struct nameidata *);
+
#else /* CONFIG_UNION_MOUNT */

#define attach_mnt_union(mnt,nd) do { /* empty */ } while (0)
#define detach_mnt_union(mnt,nd) do { /* empty */ } while (0)
+#define follow_union_mount(x,y) do { /* empty */ } while (0)

#endif /* CONFIG_UNION_MOUNT */

+extern struct dentry * real_lookup_single(struct dentry *, struct qstr *,
+ struct nameidata *);
+extern struct dentry * __lookup_hash_kern_single(struct qstr *, struct dentry *,
+ struct nameidata *);
+
+static inline struct dentry * real_lookup(struct dentry *parent, struct qstr *name, struct nameidata *nd)
+{
+#ifdef CONFIG_UNION_MOUNT
+ return real_lookup_union(parent, name, nd);
+#else
+ return real_lookup_single(parent, name, nd);
+#endif
+}
+
+static inline struct dentry * __lookup_hash_kern(struct qstr *name, struct dentry *base, struct nameidata *nd)
+{
+#ifdef CONFIG_UNION_MOUNT
+ return __lookup_hash_kern_union(name, base, nd);
+#else
+ return __lookup_hash_kern_single(name, base, nd);
+#endif
+}
+
+static inline struct dentry * __lookup_hash_single(struct qstr *name, struct dentry *base, struct nameidata *nd)
+{
+ struct dentry *dentry;
+ struct inode *inode;
+ int err;
+
+ inode = base->d_inode;
+
+ err = permission(inode, MAY_EXEC, nd);
+ dentry = ERR_PTR(err);
+ if (err)
+ goto out;
+
+ dentry = __lookup_hash_kern_single(name, base, nd);
+out:
+ return dentry;
+}
#endif /* __KERNEL __ */
#endif /* __LINUX_UNION_H */

2007-05-14 09:36:18

by Bharata B Rao

[permalink] [raw]
Subject: [RFC][PATCH 9/14] Union-mount readdir

From: Bharata B Rao <[email protected]>
Subject: Union mount readdir

This modifies the readdir()/getdents() routines to read directory
entries from toplevel and the lower directories of a union and present
a merged view.

The directory entries are read starting from the top layer and they
are maintained in a cache. Subsequently when the entries from the bottom layers
of the union stack are read they are checked for duplicates (in the cache)
before being passed out to the user space. There can be multiple calls
to readdir/getdents routines for reading the entries of a single directory.
And union directory cache is maitained across these calls.

Signed-off-by: Bharata B Rao <[email protected]>
Signed-off-by: Jan Blunck <[email protected]>
---
fs/aio.c | 8
fs/file_table.c | 14 -
fs/read_write.c | 7
fs/readdir.c | 2
fs/union.c | 404 +++++++++++++++++++++++++++++++++++++++++++
include/linux/dcache_union.h | 27 ++
include/linux/union.h | 22 ++
7 files changed, 475 insertions(+), 9 deletions(-)

--- a/fs/aio.c
+++ b/fs/aio.c
@@ -21,6 +21,7 @@

#include <linux/sched.h>
#include <linux/fs.h>
+#include <linux/mount.h>
#include <linux/file.h>
#include <linux/mm.h>
#include <linux/mman.h>
@@ -486,6 +487,13 @@ static void aio_fput_routine(struct work
/* Complete the fput */
__fput(req->ki_filp);

+ /*
+ * __fput no longer releases the dentry and vfsmnt, thanks to
+ * to union mount. Hence do this manually.
+ */
+ dput(req->ki_filp->f_path.dentry);
+ mntput(req->ki_filp->f_path.mnt);
+
/* Link the iocb into the context's free list */
spin_lock_irq(&ctx->ctx_lock);
really_put_req(ctx, req);
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -141,8 +141,14 @@ EXPORT_SYMBOL(get_empty_filp);

void fastcall fput(struct file *file)
{
- if (atomic_dec_and_test(&file->f_count))
+ struct dentry *dentry = file->f_path.dentry;
+ struct vfsmount *mnt = file->f_path.mnt;
+
+ if (atomic_dec_and_test(&file->f_count)) {
__fput(file);
+ dput(dentry);
+ mntput(mnt);
+ }
}

EXPORT_SYMBOL(fput);
@@ -152,9 +158,7 @@ EXPORT_SYMBOL(fput);
*/
void fastcall __fput(struct file *file)
{
- struct dentry *dentry = file->f_path.dentry;
- struct vfsmount *mnt = file->f_path.mnt;
- struct inode *inode = dentry->d_inode;
+ struct inode *inode = file->f_path.dentry->d_inode;

might_sleep();

@@ -180,8 +184,6 @@ void fastcall __fput(struct file *file)
file->f_path.dentry = NULL;
file->f_path.mnt = NULL;
file_free(file);
- dput(dentry);
- mntput(mnt);
}

struct file fastcall *fget(unsigned int fd)
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -15,6 +15,7 @@
#include <linux/module.h>
#include <linux/syscalls.h>
#include <linux/pagemap.h>
+#include <linux/union.h>
#include "read_write.h"

#include <asm/uaccess.h>
@@ -123,6 +124,12 @@ loff_t vfs_llseek(struct file *file, lof
if (file->f_op && file->f_op->llseek)
fn = file->f_op->llseek;
}
+
+#ifdef CONFIG_UNION_MOUNT
+ if (S_ISDIR(file->f_path.dentry->d_inode->i_mode) &&
+ unlikely(file->f_path.dentry->d_overlaid))
+ return union_dir_llseek(file, offset, origin);
+#endif
return fn(file, offset, origin);
}
EXPORT_SYMBOL(vfs_llseek);
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -33,7 +33,7 @@ int vfs_readdir(struct file *file, filld
mutex_lock(&inode->i_mutex);
res = -ENOENT;
if (!IS_DEADDIR(inode)) {
- res = file->f_op->readdir(file, buf, filler);
+ res = do_readdir(file, buf, filler);
file_accessed(file);
}
mutex_unlock(&inode->i_mutex);
--- a/fs/union.c
+++ b/fs/union.c
@@ -14,6 +14,7 @@
#include <linux/namei.h>
#include <linux/module.h>
#include <linux/mount.h>
+#include <linux/file.h>

struct union_info * union_alloc(void)
{
@@ -26,6 +27,8 @@ struct union_info * union_alloc(void)
mutex_init(&info->u_mutex);
mutex_lock(&info->u_mutex);
atomic_set(&info->u_count, 1);
+ INIT_LIST_HEAD(&info->u_rdcache);
+ info->u_cookie = 0;
UM_DEBUG_LOCK("allocate union %p\n", info);
return info;
}
@@ -40,6 +43,7 @@ struct union_info * union_get(struct uni
return info;
}

+static void release_rdstates(struct union_info *info);
void union_put(struct union_info *info)
{
BUG_ON(!info);
@@ -49,6 +53,7 @@ void union_put(struct union_info *info)

if (!atomic_read(&info->u_count)) {
UM_DEBUG_LOCK("free union %p\n", info);
+ release_rdstates(info);
kfree(info);
}

@@ -968,3 +973,402 @@ int follow_union_mount(struct vfsmount *

return res;
}
+
+/*
+ * Union mounts support for readdir.
+ */
+
+/* This is a copy from fs/readdir.c */
+struct getdents_callback {
+ struct linux_dirent __user *current_dir;
+ struct linux_dirent __user *previous;
+ int count;
+ int error;
+};
+
+/*
+ * The readdir union cache object
+ */
+struct union_cache_entry {
+ struct list_head list;
+ struct qstr name;
+};
+
+struct union_cache_callback {
+ struct getdents_callback *buf; /* original getdents_callback */
+ filldir_t filldir; /* the filldir() we should call */
+ int error; /* stores filldir error */
+ struct rdstate *rdstate; /* readdir state */
+};
+
+static int union_cache_add_entry(struct list_head *list,
+ const char *name, int namelen)
+{
+ struct union_cache_entry *this;
+ char *tmp_name;
+
+ this = kmalloc(sizeof(*this), GFP_KERNEL);
+ if (!this) {
+ printk(KERN_CRIT
+ "union_cache_add_entry(): out of kernel memory\n");
+ return -ENOMEM;
+ }
+
+ tmp_name = kmalloc(namelen + 1, GFP_KERNEL);
+ if (!tmp_name) {
+ printk(KERN_CRIT
+ "union_cache_add_entry(): out of kernel memory\n");
+ kfree(this);
+ return -ENOMEM;
+ }
+
+ this->name.name = tmp_name;
+ this->name.len = namelen;
+ this->name.hash = 0;
+ memcpy(tmp_name, name, namelen);
+ tmp_name[namelen] = 0;
+ INIT_LIST_HEAD(&this->list);
+ list_add(&this->list, list);
+ return 0;
+}
+
+static void union_cache_free(struct list_head *list)
+{
+ struct list_head *p;
+ struct list_head *ptmp;
+ int count = 0;
+
+ list_for_each_safe(p, ptmp, list) {
+ struct union_cache_entry *this;
+
+ this = list_entry(p, struct union_cache_entry, list);
+ list_del_init(&this->list);
+ kfree(this->name.name);
+ kfree(this);
+ count++;
+ }
+ UM_DEBUG_READDIR("freed %d entries\n", count);
+ return;
+}
+
+static int union_cache_find_entry(struct list_head *uc_list,
+ const char *name, int namelen)
+{
+ struct union_cache_entry *p;
+ int ret = 0;
+
+ list_for_each_entry(p, uc_list, list) {
+ if (p->name.len != namelen)
+ continue;
+ if (strncmp(p->name.name, name, namelen) == 0) {
+ ret = 1;
+ break;
+ }
+ }
+ return ret;
+}
+
+static void fastcall fput_union(struct file *file)
+{
+ struct dentry *dentry = file->f_dentry;
+ struct vfsmount *mnt = file->f_vfsmnt;
+
+ if (atomic_dec_and_test(&file->f_count)) {
+ __fput(file);
+ __dput(dentry);
+ mntput(mnt);
+ }
+}
+
+/*
+ * This is same as __dentry_read(). But since this is called with
+ * union lock held, the corresponding release involves doing fput_union()
+ * rather than usual fput().
+ */
+static struct file * __dentry_open_read(struct dentry *dentry,
+ struct vfsmount *mnt, int flags)
+{
+ struct file *f;
+ struct inode *inode;
+ int error;
+
+ error = -ENFILE;
+ f = get_empty_filp();
+ if (!f)
+ goto out;
+ f->f_flags = flags;
+ f->f_mode = ((flags+1) & O_ACCMODE) | FMODE_LSEEK |
+ FMODE_PREAD | FMODE_PWRITE;
+ inode = dentry->d_inode;
+ BUG_ON(f->f_mode & FMODE_WRITE);
+ f->f_mapping = inode->i_mapping;
+ f->f_dentry = dentry;
+ f->f_vfsmnt = mnt;
+ f->f_pos = 0;
+ f->f_op = fops_get(inode->i_fop);
+ file_move(f, &inode->i_sb->s_files);
+
+ if (f->f_op && f->f_op->open) {
+ error = f->f_op->open(inode,f);
+ if (error)
+ goto cleanup;
+ }
+ f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
+
+ file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
+
+ /* NB: we're sure to have correct a_ops only after f_op->open */
+ if (f->f_flags & O_DIRECT) {
+ if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO) {
+ fput_union(f);
+ f = ERR_PTR(-EINVAL);
+ }
+ }
+
+ return f;
+
+cleanup:
+ fops_put(f->f_op);
+ file_kill(f);
+ f->f_path.dentry = NULL;
+ f->f_path.mnt = NULL;
+ put_filp(f);
+out:
+ return ERR_PTR(error);
+}
+
+/*
+ * filldir routine for union mounted directories.
+ * Handles duplicate elimination by building a readdir cache.
+ */
+static int filldir_union(void *buf, const char *name, int namlen,
+ loff_t offset, u64 ino, unsigned int d_type)
+{
+ struct union_cache_callback *cb = buf;
+ int err;
+
+ switch (namlen) {
+ case 2:
+ if (name[1] != '.')
+ break;
+ case 1:
+ if (name[0] != '.')
+ break;
+ return 0;
+ }
+
+ if (union_cache_find_entry(&cb->rdstate->dirent_cache, name, namlen))
+ return 0;
+
+ err = cb->filldir(cb->buf, name, namlen, offset, ino, d_type);
+ if (err >= 0)
+ union_cache_add_entry(&cb->rdstate->dirent_cache, name, namlen);
+ cb->error = err;
+ return err;
+}
+
+/* rdstate - readdir state */
+#define DIREOF 0xffffff
+#define RDOFFBITS 20
+#define MAXRDCOOKIE 0xfff
+
+static inline off_t rdstate2offset(struct rdstate *r)
+{
+ return ((r->cookie & MAXRDCOOKIE) << RDOFFBITS) | (r->off & DIREOF);
+}
+
+static void put_rdstate(struct rdstate *rdstate)
+{
+ union_cache_free(&rdstate->dirent_cache);
+ list_del(&rdstate->list);
+ kfree(rdstate);
+}
+
+static void release_rdstates(struct union_info *info)
+{
+ struct list_head *pos, *tmp;
+
+ list_for_each_safe(pos, tmp, &info->u_rdcache) {
+ struct rdstate *r = list_entry(pos, struct rdstate, list);
+ put_rdstate(r);
+ }
+}
+
+static struct rdstate *get_rdstate(struct file *file)
+{
+ struct dentry *dentry = file->f_path.dentry;
+ struct list_head *pos;
+ struct rdstate *r;
+
+ /*
+ * Do we already have a rdstate for this file at this
+ * corresponding offset ?
+ */
+ list_for_each(pos, &dentry->d_union->u_rdcache) {
+ r = list_entry(pos, struct rdstate, list);
+ if (file->f_pos == rdstate2offset(r))
+ return r;
+ }
+
+ /*
+ * We have read the dirents from this earlier but now
+ * don't have a corresponding rdstate. AFAICS this can't
+ * happen.
+ */
+ if (file->f_pos)
+ return ERR_PTR(-ESTALE);
+
+ /*
+ * Create a new instance of rdstate for this request.
+ */
+ r = kmalloc(sizeof(struct rdstate), GFP_KERNEL);
+ if (!r)
+ return ERR_PTR(-ENOMEM);
+ r->dentry = dentry;
+ r->off = 0;
+ if (dentry->d_union->u_cookie >= (MAXRDCOOKIE - 1))
+ dentry->d_union->u_cookie = 1;
+ else
+ dentry->d_union->u_cookie++;
+ r->cookie = dentry->d_union->u_cookie;
+ INIT_LIST_HEAD(&r->dirent_cache);
+ list_add(&r->list, &dentry->d_union->u_rdcache);
+ return r;
+}
+
+/**
+ * readdir_union - A wrapper around ->readdir()
+ *
+ * This is a wrapper around the filesystems readdir(), which is walking
+ * the union stack and calls ->readdir() for every directory in the stack.
+ * The directory entries are read into the union readdir cache to
+ * support whiteout's and duplicate removal.
+ */
+int readdir_union(struct file *file, void *buf, filldir_t filldir)
+{
+ struct dentry *topmost = file->f_path.dentry;
+ struct rdstate *rdstate;
+ struct dentry *dentry;
+ loff_t offset = 0;
+ struct union_cache_callback cb;
+ int err = 0;
+
+ BUG_ON(!topmost->d_union);
+
+ if (file->f_pos == DIREOF)
+ goto out;
+
+ rdstate = get_rdstate(file);
+ if (IS_ERR(rdstate)) {
+ err = PTR_ERR(rdstate);
+ return err;
+ }
+
+ cb.buf = buf;
+ cb.filldir = filldir;
+ cb.rdstate = rdstate;
+
+ dentry = rdstate->dentry;
+ offset = rdstate->off;
+
+ /* Read from the topmost directory */
+ if (dentry == topmost) {
+ file->f_pos = offset;
+ err = file->f_op->readdir(file, &cb, filldir_union);
+ rdstate->off = file->f_pos;
+ file->f_pos = rdstate2offset(rdstate);
+ if (err < 0 || cb.error < 0)
+ goto out;
+
+ /*
+ * Reading from topmost dir complete, start reading the lower
+ * dir from the beginning.
+ */
+ offset = 0;
+ } else
+ goto read_lower;
+
+ dentry = dentry->d_overlaid;
+ BUG_ON(dentry->d_topmost != topmost);
+
+read_lower:
+ /* Read from the underlying directories */
+ while (dentry) {
+ struct vfsmount *mnt;
+ struct file *ftmp;
+
+ mnt = find_mnt(dentry);
+ __dget(dentry);
+ ftmp = __dentry_open_read(dentry, mnt, file->f_flags);
+ if (IS_ERR(ftmp)) {
+ __dput(dentry);
+ mntput(mnt);
+ err = PTR_ERR(ftmp);
+ goto out;
+ }
+
+ mutex_lock(&dentry->d_inode->i_mutex);
+ ftmp->f_pos = offset;
+
+ err = ftmp->f_op->readdir(ftmp, &cb, filldir_union);
+ file_accessed(ftmp);
+ mutex_unlock(&dentry->d_inode->i_mutex);
+ rdstate->off = ftmp->f_pos;
+ rdstate->dentry = dentry;
+ file->f_pos = rdstate2offset(rdstate);
+ fput_union(ftmp);
+ if (err < 0 || cb.error < 0)
+ goto out;
+
+ /*
+ * Reading from a lower dir complete, start reading the
+ * next lower dir from the beginning.
+ */
+ offset = 0;
+ dentry = dentry->d_overlaid;
+ }
+
+ /*
+ * We reached the end of lowermost directory of the union,
+ * we can now release this rdstate.
+ */
+ put_rdstate(rdstate);
+ file->f_pos = DIREOF;
+ return 0;
+out:
+ return err;
+}
+
+/*
+ * lseek operations on a union directory is restricted. We allow only to
+ * seek to the beginning of the file and to the current position.
+ *
+ * Since union mount gives a new meaning to file->f_pos(for only
+ * union mounted directories) it doesn't make much sense to allow all
+ * seek operations afterall.
+ */
+loff_t union_dir_llseek(struct file *file, loff_t offset, int origin)
+{
+ struct dentry *topmost;
+ struct rdstate *rdstate;
+
+ if (offset)
+ return -EINVAL;
+
+ switch (origin) {
+ case SEEK_SET:
+ topmost = file->f_path.dentry;
+ union_lock(topmost);
+ BUG_ON(!topmost->d_union);
+ /* Flush the readdir cache if one exists */
+ rdstate = get_rdstate(file);
+ put_rdstate(rdstate);
+ union_unlock(topmost);
+ return 0;
+ case SEEK_CUR:
+ return file->f_pos;
+ case SEEK_END:
+ return -EINVAL;
+ }
+ return -EINVAL;
+}
--- a/include/linux/dcache_union.h
+++ b/include/linux/dcache_union.h
@@ -21,6 +21,27 @@

#ifdef CONFIG_UNION_MOUNT

+struct rdstate {
+ /* readdir read the entries last time from this directory */
+ struct dentry *dentry;
+
+ /* and stopped reading at this offset */
+ loff_t off;
+
+ /* different rdstates are linked thro' this */
+ struct list_head list;
+
+ /* cache of directory entries. As of now, cache is just a linked list */
+ struct list_head dirent_cache;
+
+ /*
+ * there can be multiple readdir()/getdents() routines reading a
+ * directory at a time. And this indicates whose state is this
+ * rdstate holding.
+ */
+ unsigned int cookie;
+};
+
/*
* This is the union info object, that describes general information about this
* union directory
@@ -30,8 +51,10 @@
* or modifing the union stack !
*/
struct union_info {
- atomic_t u_count;
- struct mutex u_mutex;
+ atomic_t u_count; /* number of users of this union */
+ struct mutex u_mutex; /* mutex guarding this union stack */
+ unsigned int u_cookie; /* new rdstates get cookies from here */
+ struct list_head u_rdcache; /* list of rdstates for this union */
};

/* allocate/de-allocate */
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -30,6 +30,11 @@ extern struct dentry * real_lookup_union
extern struct dentry * __lookup_hash_kern_union(struct qstr *, struct dentry *,
struct nameidata *);

+/* readdir */
+extern int readdir_union(struct file *, void *, filldir_t);
+
+extern loff_t union_dir_llseek(struct file *, loff_t, int);
+
#else /* CONFIG_UNION_MOUNT */

#define attach_mnt_union(mnt,nd) do { /* empty */ } while (0)
@@ -78,5 +83,22 @@ static inline struct dentry * __lookup_h
out:
return dentry;
}
+
+static inline int do_readdir(struct file *file, void *buf, filldir_t filler)
+{
+ int res;
+
+#ifdef CONFIG_UNION_MOUNT
+ if (file->f_path.dentry->d_overlaid) {
+ union_lock(file->f_path.dentry);
+ res = readdir_union(file, buf, filler);
+ union_unlock(file->f_path.dentry);
+ } else
+#endif
+ res = file->f_op->readdir(file, buf, filler);
+
+ return res;
+}
+
#endif /* __KERNEL __ */
#endif /* __LINUX_UNION_H */

2007-05-14 09:36:45

by Bharata B Rao

[permalink] [raw]
Subject: [RFC][PATCH 10/14] In-kernel file copy between union mounted filesystems

From: Jan Blunck <[email protected]>
Subject: In-kernel file copy between union mounted filesystems

This patch introduces in-kernel file copy between union mounted
filesystems. When a file is opened for writing but resides on a lower (thus
read-only) layer of the union stack it is copied to the topmost union layer
first.

This patch uses the do_splice_direct() for doing the in-kernel file copy.

Signed-off-by: Bharata B Rao <[email protected]>
Signed-off-by: Jan Blunck <[email protected]>
---
fs/namei.c | 46 +++++
fs/union.c | 415 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/namei.h | 2
include/linux/union.h | 14 +
4 files changed, 476 insertions(+), 1 deletion(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -830,8 +830,17 @@ done:
path->mnt = mnt;
path->dentry = dentry;

- if (nd->dentry->d_sb != dentry->d_sb)
+ /*
+ * This should be checked after the following of unions.
+ * Otherwise we might run into trouble creating directories
+ * on mountpoints. :(
+ * But maybe we shouldn't set the LAST_LOWLEVEL flag here
+ * at all ... */
+ if (nd->dentry->d_sb != dentry->d_sb) {
path->mnt = find_mnt(dentry);
+ UM_DEBUG_UID("Setting LAST_LOWLEVEL for %s\n", name->name);
+ nd->um_flags |= LAST_LOWLEVEL;
+ }

__follow_mount(path);
follow_union_mount(&path->mnt, &path->dentry);
@@ -950,6 +959,14 @@ static fastcall int __link_path_walk(con
if (err)
break;

+ if ((nd->flags & LOOKUP_TOPMOST) &&
+ (nd->um_flags & LAST_LOWLEVEL)) {
+ err = union_create_topdir(nd,&next.dentry,&next.mnt);
+ if (err)
+ goto out_dput;
+ nd->um_flags &= ~LAST_LOWLEVEL;
+ }
+
err = -ENOENT;
inode = next.dentry->d_inode;
if (!inode)
@@ -1005,6 +1022,15 @@ last_component:
err = do_lookup(nd, &this, &next);
if (err)
break;
+
+ if ((nd->flags & LOOKUP_TOPMOST) &&
+ (nd->um_flags & LAST_LOWLEVEL)) {
+ err = union_create_topdir(nd,&next.dentry,&next.mnt);
+ if (err)
+ goto out_dput;
+ nd->um_flags &= ~LAST_LOWLEVEL;
+ }
+
inode = next.dentry->d_inode;
if ((lookup_flags & LOOKUP_FOLLOW)
&& inode && inode->i_op && inode->i_op->follow_link) {
@@ -1177,6 +1203,7 @@ static int fastcall do_path_lookup(int d

nd->last_type = LAST_ROOT; /* if there are only slashes... */
nd->flags = flags;
+ nd->um_flags = 0;
nd->depth = 0;

if (*name=='/') {
@@ -1756,9 +1783,18 @@ int open_namei(int dfd, const char *path
nd, flag);
if (error)
return error;
+ /* test for WRONLY and RDWR - flag's special lower bits */
+ if (flag & 0x2) {
+ UM_DEBUG_UID("\"%s\" opened for writing\n", pathname);
+ error = union_copyup(nd, flag);
+ if (error)
+ return error;
+ }
goto ok;
}

+ UM_DEBUG_UID("open called with O_CREATE\n");
+
/*
* Create - we need to know the parent.
*/
@@ -1775,6 +1811,8 @@ int open_namei(int dfd, const char *path
if (nd->last_type != LAST_NORM || nd->last.name[nd->last.len])
goto exit;

+ UM_DEBUG_UID("do_last now\n");
+
dir = nd->dentry;
nd->flags &= ~LOOKUP_PARENT;
mutex_lock(&dir->d_inode->i_mutex);
@@ -1828,6 +1866,12 @@ do_last:
error = -EISDIR;
if (path.dentry->d_inode && S_ISDIR(path.dentry->d_inode->i_mode))
goto exit;
+
+ if (flag & 0x2) {
+ error = union_copyup(nd, flag);
+ if (error)
+ goto exit;
+ }
ok:
error = may_open(nd, acc_mode, flag);
if (error)
--- a/fs/union.c
+++ b/fs/union.c
@@ -15,6 +15,11 @@
#include <linux/module.h>
#include <linux/mount.h>
#include <linux/file.h>
+#include <linux/mm.h>
+#include <linux/quotaops.h>
+#include <linux/dnotify.h>
+#include <linux/security.h>
+#include <linux/pipe_fs_i.h>

struct union_info * union_alloc(void)
{
@@ -305,6 +310,53 @@ void __dput_union(struct dentry *dentry)
return;
}

+/*
+ * union_relookup_topmost - lookup and create the topmost path to dentry
+ * @nd: pointer to nameidata
+ * @flags: lookup flags
+ */
+int union_relookup_topmost(struct nameidata *nd, int flags)
+{
+ int err;
+ char *kbuf, *name;
+ struct nameidata this;
+
+ UM_DEBUG_UID("relookup the topmost dir for %s\n",
+ nd->dentry->d_name.name);
+
+ kbuf = (char *)__get_free_page(GFP_KERNEL);
+ if (!kbuf)
+ return -ENOMEM;
+
+ name = d_path(nd->dentry, nd->mnt, kbuf, PAGE_SIZE);
+ err = PTR_ERR(name);
+ if (IS_ERR(name))
+ goto free_page;
+
+ err = path_lookup(name, flags|LOOKUP_CREATE|LOOKUP_TOPMOST, &this);
+ if (err)
+ goto free_page;
+
+ path_release(nd);
+ nd->dentry = this.dentry;
+ nd->mnt = this.mnt;
+
+ /* If we are looking up the parent, copy the child details also */
+ if (flags & LOOKUP_PARENT) {
+ nd->last = this.last;
+ nd->last_type = this.last_type;
+ }
+
+ /*
+ * the nd->flags should be unchanged
+ */
+ BUG_ON(this.um_flags & LAST_LOWLEVEL);
+ nd->um_flags &= ~LAST_LOWLEVEL;
+ free_page:
+ free_page((unsigned long)kbuf);
+ return err;
+}
+
void attach_mnt_union(struct vfsmount *mnt, struct nameidata *nd)
{
struct dentry *tmp;
@@ -975,6 +1027,37 @@ int follow_union_mount(struct vfsmount *
}

/*
+ * @stack is a already existing union stack, @new is a dentry or a union stack
+ * which is overlaid by @stack. So the topmost dentry is in @stack.
+ */
+static void append_to_stack(struct dentry *stack, struct dentry *new)
+{
+ struct dentry *topmost;
+ struct dentry *prev = stack;
+ struct dentry *next = new;
+
+ BUG_ON(!stack);
+ BUG_ON(!new);
+
+ while (prev->d_overlaid)
+ prev = prev->d_overlaid;
+
+ if (prev->d_topmost)
+ topmost = prev->d_topmost;
+ else
+ topmost = stack;
+
+ while (next) {
+ next->d_topmost = topmost;
+ prev->d_overlaid = next;
+ prev = next;
+ next = next->d_overlaid;
+ }
+
+ return;
+}
+
+/*
* Union mounts support for readdir.
*/

@@ -1372,3 +1455,335 @@ loff_t union_dir_llseek(struct file *fil
}
return -EINVAL;
}
+
+/*
+ * Union mount copyup support
+ */
+
+/*
+ * Just do what vfs_create() would do, but the union mount way
+ */
+static struct dentry * union_create(struct dentry *parent, struct dentry *old,
+ struct nameidata *nd)
+{
+ struct dentry *dentry;
+ int err, mode;
+
+ dentry = __lookup_hash_single(&old->d_name, parent, NULL);
+ if (IS_ERR(dentry))
+ goto exit;
+
+ err = -EEXIST;
+ if (dentry->d_inode) {
+ dput(dentry);
+ goto error;
+ }
+
+ err = -ENOENT;
+ if (IS_DEADDIR(parent->d_inode))
+ goto error;
+ err = -EACCES; /* shouldn't it be ENOSYS? */
+ if (!parent->d_inode->i_op || !parent->d_inode->i_op->create)
+ goto error;
+
+ mode = old->d_inode->i_mode & S_IALLUGO;
+ mode |= S_IFREG;
+
+ err = security_inode_create(parent->d_inode, dentry, mode);
+ if (err)
+ goto error;
+
+ DQUOT_INIT(parent->d_inode);
+ err = parent->d_inode->i_op->create(parent->d_inode, dentry, mode, nd);
+ if (err)
+ goto error;
+
+ dentry->d_inode->i_uid = old->d_inode->i_uid;
+ dentry->d_inode->i_gid = old->d_inode->i_gid;
+ mark_inode_dirty(dentry->d_inode);
+exit:
+ return dentry;
+error:
+ return ERR_PTR(err);
+}
+
+/*
+ * Just do what vfs_mkdir() would do, but the union mount way
+ */
+static struct dentry * union_mkdir(struct dentry *parent, struct dentry *dir)
+{
+ struct dentry *dentry;
+ int err, mode;
+
+ dentry = __lookup_hash_single(&dir->d_name, parent, NULL);
+ if (IS_ERR(dentry))
+ goto exit;
+
+ err = -EEXIST;
+ if (dentry->d_inode) {
+ dput(dentry);
+ goto error;
+ }
+
+ err = -ENOENT;
+ if (IS_DEADDIR(parent->d_inode))
+ goto error;
+ err = -EPERM;
+ if (!parent->d_inode->i_op || !parent->d_inode->i_op->mkdir)
+ goto error;
+
+ mode = dir->d_inode->i_mode & (S_IRWXUGO|S_ISVTX);
+
+ err = security_inode_mkdir(parent->d_inode, dentry, mode);
+ if (err)
+ goto error;
+
+ DQUOT_INIT(parent->d_inode);
+ err = parent->d_inode->i_op->mkdir(parent->d_inode, dentry, mode);
+ if (err)
+ goto error;
+
+ dentry->d_inode->i_uid = dir->d_inode->i_uid;
+ dentry->d_inode->i_gid = dir->d_inode->i_gid;
+ mark_inode_dirty(dentry->d_inode);
+exit:
+ return dentry;
+error:
+ return ERR_PTR(err);
+}
+
+static void __update_fs_pwd(struct dentry *old, struct dentry *new)
+{
+ struct dentry *old_pwd = NULL;
+ struct vfsmount *old_pwdmnt = NULL;
+ struct vfsmount *new_pwdmnt = find_mnt(new);
+
+ write_lock(&current->fs->lock);
+ if (current->fs->pwd == old) {
+ old_pwd = current->fs->pwd;
+ old_pwdmnt = current->fs->pwdmnt;
+ current->fs->pwdmnt = mntget(new_pwdmnt);
+ current->fs->pwd = __dget(new);
+ UM_DEBUG_UID("replacing fs->pwd\n");
+ UM_DEBUG_UID("oldpwd: name=\"%s\", inode=%p, devname=%s\n",
+ old_pwd->d_name.name, old_pwd->d_inode,
+ old_pwdmnt->mnt_devname);
+ UM_DEBUG_UID("newpwd: name=\"%s\", inode=%p, devname=%s\n",
+ new->d_name.name, new->d_inode,
+ new_pwdmnt->mnt_devname);
+ }
+ write_unlock(&current->fs->lock);
+
+ if (old_pwd) {
+ __dput(old_pwd);
+ mntput(old_pwdmnt);
+ }
+
+ mntput(new_pwdmnt);
+
+ return;
+}
+
+struct dentry * union_create_topmost(struct nameidata *nd, struct dentry *old)
+{
+ struct dentry *dentry;
+ struct dentry *parent = nd->dentry;
+
+ UM_DEBUG_UID("dentry=%s\n", old->d_name.name);
+
+ BUG_ON(parent->d_sb == old->d_sb);
+ if (!S_ISREG(old->d_inode->i_mode)) {
+ UM_DEBUG("This filetype isn't supported!\n");
+ dentry = ERR_PTR(-EINVAL);
+ goto exit;
+ }
+
+ /*
+ * Create the topmost regular file here.
+ */
+ mutex_lock(&parent->d_inode->i_mutex);
+ dentry = union_create(parent, old, nd);
+ mutex_unlock(&parent->d_inode->i_mutex);
+ if (IS_ERR(dentry)) {
+ UM_DEBUG("some error occurred\n");
+ goto exit;
+ }
+
+exit:
+ return dentry;
+}
+
+int union_create_topdir(struct nameidata *nd,
+ struct dentry **dentry, struct vfsmount **mnt)
+{
+ struct dentry *topdir;
+ struct dentry *parent = nd->dentry;
+
+ UM_DEBUG_UID("dentry=%s\n", (*dentry)->d_name.name);
+
+ if (parent->d_sb == (*dentry)->d_sb)
+ return 0;
+
+ if (!S_ISDIR((*dentry)->d_inode->i_mode)) {
+ UM_DEBUG("Unsupported filetype!\n");
+ BUG();
+ }
+
+ /*
+ * Create the topmost directory here.
+ */
+ spin_lock(&(*dentry)->d_lock);
+ if (!(*dentry)->d_union) {
+ UM_DEBUG_LOCK("Allocate lock for \"%s\"\n",
+ (*dentry)->d_name.name);
+ (*dentry)->d_union = union_alloc();
+ spin_unlock(&(*dentry)->d_lock);
+ } else {
+ spin_unlock(&(*dentry)->d_lock);
+ union_lock(*dentry);
+ }
+ mutex_lock(&parent->d_inode->i_mutex);
+ topdir = union_mkdir(parent, *dentry);
+ if (IS_ERR(topdir)) {
+ UM_DEBUG("some error occurred\n");
+ mutex_unlock(&parent->d_inode->i_mutex);
+ union_unlock(*dentry);
+ return PTR_ERR(topdir);
+ }
+
+ spin_lock(&topdir->d_lock);
+ if (topdir->d_union) {
+ UM_DEBUG("Aaargh! topdir \"%s\" already has a lock?!\n",
+ topdir->d_name.name);
+ dump_stack();
+ }
+ topdir->d_union = union_get((*dentry)->d_union);
+ spin_unlock(&topdir->d_lock);
+ append_to_stack(topdir, *dentry);
+ __update_fs_pwd(*dentry, topdir);
+ *dentry = topdir;
+ mutex_unlock(&parent->d_inode->i_mutex);
+ union_unlock(*dentry);
+
+ if (nd->mnt != *mnt) {
+ mntput(*mnt);
+ *mnt = mntget(nd->mnt);
+ }
+
+ return 0;
+}
+
+int union_copy_file(struct dentry *old_dentry, struct vfsmount *old_mnt,
+ struct dentry *new_dentry, struct vfsmount *new_mnt)
+{
+ int ret;
+ size_t size;
+ loff_t offset;
+ struct file *old_file, *new_file;
+
+ dget(old_dentry);
+ mntget(old_mnt);
+ old_file = dentry_open(old_dentry, old_mnt, O_RDONLY);
+ if (IS_ERR(old_file))
+ return PTR_ERR(old_file);
+
+ dget(new_dentry);
+ mntget(new_mnt);
+ new_file = dentry_open(new_dentry, new_mnt, O_WRONLY);
+ ret = PTR_ERR(new_file);
+ if (IS_ERR(new_file))
+ goto fput_old;
+
+ size = i_size_read(old_file->f_path.dentry->d_inode);
+ if (((size_t)size != size) || ((ssize_t)size != size)) {
+ ret = -EFBIG;
+ goto fput_new;
+ }
+
+ offset = 0;
+ ret = do_splice_direct(old_file, &offset, new_file, size,
+ SPLICE_F_MOVE);
+ if (ret >= 0)
+ ret = 0;
+ fput_new:
+ fput(new_file);
+ fput_old:
+ fput(old_file);
+ return ret;
+}
+
+/**
+ * union_copyup - copy a file to the topmost layer of the union stack
+ * @nd: nameidata pointer to the file
+ * @flags: flags given to open_namei
+ */
+int union_copyup(struct nameidata *nd, int flags)
+{
+ struct dentry *dir;
+ struct dentry *dentry;
+ int err;
+
+ if (!union_is_member(nd->dentry, nd->mnt))
+ return 0;
+ if (!S_ISREG(nd->dentry->d_inode->i_mode))
+ return 0;
+
+ err = union_relookup_topmost(nd, nd->flags|LOOKUP_PARENT);
+ if (err)
+ return err;
+
+ dir = nd->dentry;
+ nd->flags &= ~LOOKUP_PARENT;
+ union_lock(nd->dentry);
+ mutex_lock(&dir->d_inode->i_mutex);
+ dentry = __lookup_hash_kern_union(&nd->last, nd->dentry, nd);
+ err = PTR_ERR(dentry);
+ if (IS_ERR(dentry)) {
+ mutex_unlock(&dir->d_inode->i_mutex);
+ union_unlock(nd->dentry);
+ return err;
+ }
+
+ mutex_unlock(&dir->d_inode->i_mutex);
+ union_unlock(nd->dentry);
+
+ err = -ENOENT;
+ if (!dentry->d_inode)
+ goto exit_dput;
+
+ follow_mount(&nd->mnt, &dentry);
+
+ err = -ENOENT;
+ if (!dentry->d_inode)
+ goto exit_dput;
+
+ if (dentry->d_parent != dir) {
+ struct dentry *tmp;
+ struct vfsmount *old_mnt;
+
+ UM_DEBUG_UID("already exists -> copy file\n");
+ tmp = union_create_topmost(nd, dentry);
+ if (IS_ERR(tmp))
+ goto exit_dput;
+
+ old_mnt = find_mnt(dentry);
+ err = union_copy_file(dentry, old_mnt, tmp, nd->mnt);
+ if (err) {
+ int ret = vfs_unlink(tmp->d_inode, tmp);
+ BUG_ON(ret);
+ /* FIXME: not sure if there are return value
+ * we should not BUG() on */
+ }
+ dput(dentry);
+ dentry = tmp;
+ mntput(old_mnt);
+ }
+
+ dput(nd->dentry);
+ nd->dentry = dentry;
+ return 0;
+
+exit_dput:
+ dput(dentry);
+ return err;
+}
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -59,6 +59,8 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LA
#define LOOKUP_PARENT 16
#define LOOKUP_NOALT 32
#define LOOKUP_REVAL 64
+#define LOOKUP_TOPMOST 128
+#define LOOKUP_WHT 256
/*
* Intent data
*/
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -35,11 +35,25 @@ extern int readdir_union(struct file *,

extern loff_t union_dir_llseek(struct file *, loff_t, int);

+/* copy-up support */
+extern struct dentry * union_create_topmost(struct nameidata *, struct dentry *);
+extern int union_create_topdir(struct nameidata *, struct dentry **, struct vfsmount **);
+extern int union_is_member(struct dentry *, struct vfsmount *);
+extern int union_copy_file(struct dentry *, struct vfsmount *, struct dentry *, struct vfsmount *);
+extern int union_copyup(struct nameidata *, int);
+extern int union_relookup_topmost(struct nameidata *, int);
+
#else /* CONFIG_UNION_MOUNT */

#define attach_mnt_union(mnt,nd) do { /* empty */ } while (0)
#define detach_mnt_union(mnt,nd) do { /* empty */ } while (0)
#define follow_union_mount(x,y) do { /* empty */ } while (0)
+#define union_create_topmost(x,y) ({ BUG(); ERR_PTR(-EINVAL); })
+#define union_create_topdir(x,y,z) ({ (0); })
+#define union_is_member(x,y) ({ (0); })
+#define union_copy_file(dentry1,mnt1,dentry2,mnt2) ({ (0); })
+#define union_copyup(x,y) ({ (0); })
+#define union_relookup_topmost(x,y) ({ (0); })

#endif /* CONFIG_UNION_MOUNT */

2007-05-14 09:37:07

by Bharata B Rao

[permalink] [raw]
Subject: [RFC][PATCH 11/14] VFS whiteout handling

From: Jan Blunck <[email protected]>
Subject: VFS whiteout handling

Introduce white-out handling in the VFS.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Bharata B Rao <[email protected]>
---
fs/inode.c | 17 +
fs/namei.c | 476 ++++++++++++++++++++++++++++++++++++++++++++++++--
fs/readdir.c | 10 +
fs/union.c | 104 ++++++++++
include/linux/fs.h | 4
include/linux/union.h | 6
6 files changed, 605 insertions(+), 12 deletions(-)

--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1421,6 +1421,21 @@ void __init inode_init(unsigned long mem
INIT_HLIST_HEAD(&inode_hashtable[loop]);
}

+/*
+ * Dummy default file-operations:
+ * Never open a whiteout. This is always a bug.
+ */
+static int whiteout_no_open(struct inode *irrelevant, struct file *dontcare)
+{
+ printk("Attemp to open a whiteout!\n");
+ WARN_ON(1);
+ return -ENXIO;
+}
+
+static struct file_operations def_wht_fops = {
+ .open = whiteout_no_open,
+};
+
void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
{
inode->i_mode = mode;
@@ -1434,6 +1449,8 @@ void init_special_inode(struct inode *in
inode->i_fop = &def_fifo_fops;
else if (S_ISSOCK(mode))
inode->i_fop = &bad_sock_fops;
+ else if (S_ISWHT(mode))
+ inode->i_fop = &def_wht_fops;
else
printk(KERN_DEBUG "init_special_inode: bogus i_mode (%o)\n",
mode);
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -969,7 +969,7 @@ static fastcall int __link_path_walk(con

err = -ENOENT;
inode = next.dentry->d_inode;
- if (!inode)
+ if (!inode || S_ISWHT(inode->i_mode))
goto out_dput;
err = -ENOTDIR;
if (!inode->i_op)
@@ -1043,6 +1043,12 @@ last_component:
err = -ENOENT;
if (!inode)
break;
+ if (S_ISWHT(inode->i_mode)) {
+ UM_DEBUG_UID("found a whiteout\n");
+ break;
+ //if (!(nd->flags & LOOKUP_WHT))
+ // break;
+ }
if (lookup_flags & LOOKUP_DIRECTORY) {
err = -ENOTDIR;
if (!inode->i_op || !inode->i_op->lookup)
@@ -1556,7 +1562,7 @@ static int may_delete(struct inode *dir,
static inline int may_create(struct inode *dir, struct dentry *child,
struct nameidata *nd)
{
- if (child->d_inode)
+ if (child->d_inode && !S_ISWHT(child->d_inode->i_mode))
return -EEXIST;
if (IS_DEADDIR(dir))
return -ENOENT;
@@ -1623,6 +1629,82 @@ void unlock_rename(struct dentry *p1, st
}
}

+/*
+ * __vfs_unlink_whiteout - Unlink a single whiteout from the system
+ * @dir: parent directory
+ * @dentry: the whiteout itself
+ *
+ * This is for unlinking a single whiteout. Don't use vfs_unlink() because we
+ * don't want any notification stuff etc. but basically it is the same stuff.
+ */
+static int
+__vfs_unlink_whiteout(struct inode *dir, struct dentry *dentry)
+{
+ int error = may_delete(dir, dentry, 0);
+
+ if (error)
+ return error;
+
+ if (!dir->i_op || !dir->i_op->unlink)
+ return -EPERM;
+
+ DQUOT_INIT(dir);
+
+ mutex_lock(&dentry->d_inode->i_mutex);
+ if (d_mountpoint(dentry))
+ error = -EBUSY;
+ else {
+ error = security_inode_unlink(dir, dentry);
+ if (!error)
+ error = dir->i_op->unlink(dir, dentry);
+ }
+ mutex_unlock(&dentry->d_inode->i_mutex);
+
+ /* We don't d_delete() NFS sillyrenamed files--they still exist. */
+ if (!error && !(dentry->d_flags & DCACHE_NFSFS_RENAMED)) {
+ d_delete(dentry);
+ //inode_dir_notify(dir, DN_DELETE);
+ }
+ return error;
+}
+
+/*
+ * vfs_unlink_whiteout - unlink and relookup the whiteout
+ *
+ * This is what you want to call from vfs_* functions to remove a whiteout. It
+ * unlinks the whiteout dentry and relookups it afterwards.
+ */
+static int
+vfs_unlink_whiteout(struct inode *dir, struct dentry **dp)
+{
+ struct dentry *dentry = *dp;
+ struct dentry *parent = dentry->d_parent;
+ struct qstr name;
+ int error;
+
+ BUG_ON(dir != parent->d_inode);
+
+ error = -ENOMEM;
+ name.name = kmalloc(dentry->d_name.len, GFP_KERNEL);
+ if (!name.name)
+ goto out;
+ strncpy((char *)name.name, dentry->d_name.name, dentry->d_name.len);
+ name.len = dentry->d_name.len;
+ name.hash = dentry->d_name.hash;
+
+ error = __vfs_unlink_whiteout(dir, dentry);
+ if (error)
+ goto out_freename;
+
+ __dput_single(dentry);
+ *dp = __lookup_hash_single(&name, parent, NULL);
+ BUG_ON(IS_ERR(*dp)); /* Hmm, very hard response here */
+out_freename:
+ kfree(name.name);
+out:
+ return error;
+}
+
int vfs_create(struct inode *dir, struct dentry *dentry, int mode,
struct nameidata *nd)
{
@@ -1638,6 +1720,13 @@ int vfs_create(struct inode *dir, struct
error = security_inode_create(dir, dentry, mode);
if (error)
return error;
+
+ if (dentry->d_inode && S_ISWHT(dentry->d_inode->i_mode)) {
+ error = vfs_unlink_whiteout(dir, &dentry);
+ if (error)
+ return error;
+ }
+
DQUOT_INIT(dir);
error = dir->i_op->create(dir, dentry, mode, nd);
if (!error)
@@ -1833,7 +1922,14 @@ do_last:
}

/* Negative dentry, just create the file */
- if (!path.dentry->d_inode) {
+ if (!path.dentry->d_inode || S_ISWHT(path.dentry->d_inode->i_mode)) {
+ if (path.dentry->d_parent != dir) {
+ UM_DEBUG_UID("found a lower layers whiteout\n");
+ dput(path.dentry);
+ path.dentry = __lookup_hash_single(&nd->last, dir, nd);
+ goto do_last;
+ }
+
error = open_namei_create(nd, &path, flag, mode);
if (error)
goto exit;
@@ -1949,6 +2045,17 @@ do_link:
struct dentry *lookup_create(struct nameidata *nd, int is_dir)
{
struct dentry *dentry = ERR_PTR(-EEXIST);
+ int error;
+
+ if (union_is_member(nd->dentry, nd->mnt)) {
+ error = union_relookup_topmost(nd, nd->flags & ~LOOKUP_PARENT);
+ if (error) {
+ /* FIXME: This really sucks */
+ mutex_lock_nested(&nd->dentry->d_inode->i_mutex,
+ I_MUTEX_PARENT);
+ goto fail;
+ }
+ }

mutex_lock_nested(&nd->dentry->d_inode->i_mutex, I_MUTEX_PARENT);
/*
@@ -1968,6 +2075,15 @@ struct dentry *lookup_create(struct name
if (IS_ERR(dentry))
goto fail;

+ /* Special case - we found a whiteout */
+ if (dentry->d_inode && S_ISWHT(dentry->d_inode->i_mode)) {
+ if (dentry->d_parent != nd->dentry) {
+ UM_DEBUG_UID("found a lower layers whiteout\n");
+ dput(dentry);
+ dentry = __lookup_hash_single(&nd->last,nd->dentry,nd);
+ }
+ }
+
/*
* Special case - lookup gave negative, but... we had foo/bar/
* From the vfs_mknod() POV we just have a negative dentry -
@@ -2002,6 +2118,12 @@ int vfs_mknod(struct inode *dir, struct
if (error)
return error;

+ if (dentry->d_inode && S_ISWHT(dentry->d_inode->i_mode)) {
+ error = vfs_unlink_whiteout(dir, &dentry);
+ if (error)
+ return error;
+ }
+
DQUOT_INIT(dir);
error = dir->i_op->mknod(dir, dentry, mode, dev);
if (!error)
@@ -2067,6 +2189,7 @@ asmlinkage long sys_mknod(const char __u
int vfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
{
int error = may_create(dir, dentry, NULL);
+ int opaque;

if (error)
return error;
@@ -2079,10 +2202,23 @@ int vfs_mkdir(struct inode *dir, struct
if (error)
return error;

+ if (dentry->d_inode && S_ISWHT(dentry->d_inode->i_mode)) {
+ error = vfs_unlink_whiteout(dir, &dentry);
+ if (error)
+ return error;
+ opaque = 1;
+ } else
+ opaque = 0;
+
DQUOT_INIT(dir);
error = dir->i_op->mkdir(dir, dentry, mode);
- if (!error)
+ if (!error) {
fsnotify_mkdir(dir, dentry);
+#ifdef CONFIG_UNION_MOUNT
+ if (opaque && dentry->d_parent->d_overlaid)
+ dentry->d_inode->i_flags |= S_OPAQUE;
+#endif
+ }
return error;
}

@@ -2124,6 +2260,225 @@ asmlinkage long sys_mkdir(const char __u
return sys_mkdirat(AT_FDCWD, pathname, mode);
}

+/* Checks on the victiom for whiteout */
+static inline int may_whiteout(struct dentry *victim, int isdir)
+{
+ if (!victim->d_inode || S_ISWHT(victim->d_inode->i_mode))
+ return -ENOENT;
+ if (IS_APPEND(victim->d_inode) || IS_IMMUTABLE(victim->d_inode))
+ return -EPERM;
+ if (isdir) {
+ if (!S_ISDIR(victim->d_inode->i_mode))
+ return -ENOTDIR;
+ if (IS_ROOT(victim))
+ return -EBUSY;
+ if (!union_dir_is_empty(victim))
+ return -ENOTEMPTY;
+ } else if (S_ISDIR(victim->d_inode->i_mode))
+ return -EISDIR;
+ if (victim->d_flags & DCACHE_NFSFS_RENAMED)
+ return -EBUSY;
+ return 0;
+}
+
+/*
+ * We try to whiteout a dentry. dir is the parent of the whiteout.
+ * Whiteouts can be vfs_unlink'ed.
+ */
+int vfs_whiteout(struct inode *dir, struct dentry *dentry)
+{
+ int err;
+
+ BUG_ON(dentry->d_parent->d_inode != dir);
+
+ /* from may_create() */
+ if (dentry->d_inode)
+ return -EEXIST;
+ if (IS_DEADDIR(dir))
+ return -ENOENT;
+ err = permission(dir, MAY_WRITE | MAY_EXEC, NULL);
+ if (err)
+ return err;
+
+ /* from may_delete() */
+ if (IS_APPEND(dir))
+ return -EPERM;
+ /* We don't call check_sticky() here because d_inode == NULL */
+
+ if (!dir->i_op || !dir->i_op->whiteout)
+ return -EOPNOTSUPP;
+
+ err = dir->i_op->whiteout(dir, dentry);
+ /* Ignore quota and fsnotify */
+ return err;
+}
+
+/*
+ * do_whiteout - whiteout a dentry, either when removing or renaming
+ * @dentry: the dentry to whiteout
+ *
+ * This is called by the VFS when removing or renaming files on an union mount.
+ */
+static int do_whiteout(struct dentry *parent, struct dentry *dentry, int isdir)
+{
+ int err;
+ struct qstr name;
+
+ UM_DEBUG_UID("parent=\"%s\", dentry=\"%s\", isdir=%d\n",
+ parent->d_name.name, dentry->d_name.name, isdir);
+
+ err = may_whiteout(dentry, isdir);
+ if (err)
+ goto out;
+
+ err = -ENOMEM;
+ name.name = kmalloc(dentry->d_name.len, GFP_KERNEL);
+ if (!name.name)
+ goto out;
+ strncpy((char *)name.name, dentry->d_name.name, dentry->d_name.len);
+ name.len = dentry->d_name.len;
+ name.hash = dentry->d_name.hash;
+
+ /*
+ * TODO: Should we BUG_ON(dentry->d_parent != parent) ?
+ */
+ if (dentry->d_parent == parent) {
+ if (isdir)
+ err = vfs_rmdir(parent->d_inode, dentry);
+ else
+ err = vfs_unlink(parent->d_inode, dentry);
+ dput(dentry);
+ if (err)
+ goto out_freename;
+ }
+
+ /*
+ * Relookup the dentry to whiteout now. By this time, the dentry is
+ * dput'ed in vfs_rmdir or vfs_unlink and we should find a fresh
+ * negative dentry.
+ */
+ dentry = __lookup_hash_single(&name, parent, NULL);
+ err = PTR_ERR(dentry);
+ if (IS_ERR(dentry))
+ goto out_freename;
+
+ err = vfs_whiteout(parent->d_inode, dentry);
+ __dput_single(dentry);
+out_freename:
+ kfree(name.name);
+out:
+ return err;
+}
+
+static int
+__hash_one_len(const char *name, int len, struct qstr *this)
+{
+ unsigned long hash;
+ unsigned int c;
+
+ hash = init_name_hash();
+ while (len--) {
+ c = *(const unsigned char *)name++;
+ if (c == '/' || c == '\0')
+ return -EINVAL;
+ hash = partial_name_hash(c, hash);
+ }
+ this->hash = end_name_hash(hash);
+ return 0;
+}
+
+static int unlink_whiteouts_filldir(void *buf, const char *name, int namlen,
+ loff_t offset, u64 ino, unsigned int d_type)
+{
+ struct dentry *parent = buf;
+ struct dentry *dentry;
+ struct qstr this;
+ int res;
+
+ switch (namlen) {
+ case 2:
+ if (name[1] != '.')
+ break;
+ case 1:
+ if (name[0] != '.')
+ break;
+ return 0;
+ }
+
+ UM_DEBUG_UID("name=\"%s\", d_type=%d\n", name, d_type);
+
+ if (d_type != DT_WHT)
+ return 0;
+
+ this.name = name;
+ this.len = namlen;
+ res = __hash_one_len(name, namlen, &this);
+ if (res)
+ return res;
+
+ dentry = __lookup_hash_single(&this, parent, NULL);
+ if (IS_ERR(dentry))
+ return PTR_ERR(dentry);
+
+ res = __vfs_unlink_whiteout(parent->d_inode, dentry);
+ __dput_single(dentry);
+ return res;
+}
+
+/*
+ * do_unlink_whiteouts - remove all whiteouts of an "empty" directory
+ * @dentry: the directories dentry
+ *
+ * Before removing a directory from the file system, we have to make sure
+ * that there are no stale whiteouts in it. Therefore we call readdir() with
+ * a special filldir helper to remove all the whiteouts.
+ *
+ * XXX: Don't call any security and permission checks here (If we aren't
+ * allowed to go here, we shouldn't be here at all). Same with i_mutex, don't
+ * touch it here.
+ */
+static int do_unlink_whiteouts(struct dentry *dentry)
+{
+ struct file *file;
+ struct vfsmount *mnt;
+ struct inode *inode;
+ int res;
+
+ dget(dentry);
+ mnt = find_mnt(dentry);
+
+ /*
+ * FIXME: This is bad, because we really don't want to open a new
+ * file in the kernel but readdir needs a file pointer
+ */
+ file = dentry_open(dentry, mnt, O_RDWR);
+ if (IS_ERR(file)) {
+ printk(KERN_ERR "%s: dentry_open failed (%ld)\n",
+ __FUNCTION__, PTR_ERR(file));
+ return PTR_ERR(file);
+ }
+
+ inode = file->f_path.dentry->d_inode;
+
+ res = -ENOTDIR;
+ if (!file->f_op || !file->f_op->readdir)
+ goto out_fput;
+
+ res = -ENOENT;
+ if (!IS_DEADDIR(inode)) {
+ res = file->f_op->readdir(file, (void *)file->f_path.dentry,
+ unlink_whiteouts_filldir);
+ file_accessed(file);
+ }
+out_fput:
+ fput(file);
+ if (unlikely(res))
+ printk(KERN_ERR "%s: readdir failed (%d)\n",
+ __FUNCTION__, res);
+ return res;
+}
+
+
/*
* We try to drop the dentry early: we should have
* a usage count of 2 if we're the only user of this
@@ -2153,8 +2508,12 @@ void dentry_unhash(struct dentry *dentry

int vfs_rmdir(struct inode *dir, struct dentry *dentry)
{
- int error = may_delete(dir, dentry, 1);
+ int error;

+ if (!dentry->d_inode || S_ISWHT(dentry->d_inode->i_mode))
+ return -ENOENT;
+
+ error = may_delete(dir, dentry, 1);
if (error)
return error;

@@ -2170,11 +2529,15 @@ int vfs_rmdir(struct inode *dir, struct
else {
error = security_inode_rmdir(dir, dentry);
if (!error) {
+ error = do_unlink_whiteouts(dentry);
+ if (error)
+ goto out;
error = dir->i_op->rmdir(dir, dentry);
if (!error)
dentry->d_inode->i_flags |= S_DEAD;
}
}
+ out:
mutex_unlock(&dentry->d_inode->i_mutex);
if (!error) {
d_delete(dentry);
@@ -2215,8 +2578,41 @@ static long do_rmdir(int dfd, const char
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto exit2;
- error = vfs_rmdir(nd.dentry->d_inode, dentry);
- dput(dentry);
+
+ if (!union_is_member(nd.dentry, nd.mnt)) {
+ /* Not a member of union, normal removal */
+ error = vfs_rmdir(nd.dentry->d_inode, dentry);
+ dput(dentry);
+ goto exit2;
+ }
+
+ if (dentry->d_parent == nd.dentry) {
+ /*
+ * Topmost dentry of the union. Check if there
+ * is a dentry of same name in the lower layers.
+ * If so create a whiteout before unlinking.
+ * Else normal removal.
+ */
+ if (present_in_lower(dentry, &nd))
+ error = do_whiteout(nd.dentry, dentry, 1);
+ else {
+ error = vfs_rmdir(nd.dentry->d_inode, dentry);
+ dput(dentry);
+ }
+ } else {
+ /*
+ * Lower layer dentry of the union. Relookup
+ * the dentry in the top layer(which should return
+ * a negative dentry) create a whiteout there.
+ */
+ dput(dentry);
+ dentry = __lookup_hash_single(&nd.last, nd.dentry, &nd);
+ error = PTR_ERR(dentry);
+ if (IS_ERR(dentry))
+ goto exit2;
+ error = vfs_whiteout(nd.dentry->d_inode, dentry);
+ __dput_single(dentry);
+ }
exit2:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
exit1:
@@ -2295,10 +2691,44 @@ static long do_unlinkat(int dfd, const c
inode = dentry->d_inode;
if (inode)
atomic_inc(&inode->i_count);
- error = vfs_unlink(nd.dentry->d_inode, dentry);
- exit2:
- dput(dentry);
+
+ if (!union_is_member(nd.dentry, nd.mnt)) {
+ /* Not a member of union, normal removal */
+ error = vfs_unlink(nd.dentry->d_inode, dentry);
+ dput(dentry);
+ goto exit2;
+ }
+
+ /* TODO: fix this code duplication with do_rmdir() */
+ if (dentry->d_parent == nd.dentry) {
+ /*
+ * Topmost dentry of the union. Check if there
+ * is a dentry of same name in the lower layers.
+ * If so create a whiteout before unlinking.
+ * Else normal removal.
+ */
+ if (present_in_lower(dentry, &nd))
+ error = do_whiteout(nd.dentry, dentry, 0);
+ else {
+ error = vfs_unlink(nd.dentry->d_inode, dentry);
+ dput(dentry);
+ }
+ } else {
+ /*
+ * Lower layer dentry of the union. Relookup
+ * the dentry in the top layer(which should return
+ * a negative dentry) create a whiteout there.
+ */
+ dput(dentry);
+ dentry = __lookup_hash_single(&nd.last, nd.dentry, &nd);
+ error = PTR_ERR(dentry);
+ if (IS_ERR(dentry))
+ goto exit2;
+ error = vfs_whiteout(nd.dentry->d_inode, dentry);
+ __dput_single(dentry);
+ }
}
+exit2:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
if (inode)
iput(inode); /* truncate the inode here */
@@ -2311,6 +2741,7 @@ exit:
slashes:
error = !dentry->d_inode ? -ENOENT :
S_ISDIR(dentry->d_inode->i_mode) ? -EISDIR : -ENOTDIR;
+ dput(dentry);
goto exit2;
}

@@ -2344,6 +2775,12 @@ int vfs_symlink(struct inode *dir, struc
if (error)
return error;

+ if (dentry->d_inode && S_ISWHT(dentry->d_inode->i_mode)) {
+ error = vfs_unlink_whiteout(dir, &dentry);
+ if (error)
+ return error;
+ }
+
DQUOT_INIT(dir);
error = dir->i_op->symlink(dir, dentry, oldname);
if (!error)
@@ -2398,7 +2835,7 @@ int vfs_link(struct dentry *old_dentry,
struct inode *inode = old_dentry->d_inode;
int error;

- if (!inode)
+ if (!inode || S_ISWHT(inode->i_mode))
return -ENOENT;

error = may_create(dir, new_dentry, NULL);
@@ -2674,7 +3111,7 @@ static int do_rename(int olddfd, const c
goto exit3;
/* source must exist */
error = -ENOENT;
- if (!old_dentry->d_inode)
+ if (!old_dentry->d_inode || S_ISWHT(old_dentry->d_inode->i_mode))
goto exit4;
/* unless the source is a directory trailing slashes give -ENOTDIR */
if (!S_ISDIR(old_dentry->d_inode->i_mode)) {
@@ -2696,6 +3133,21 @@ static int do_rename(int olddfd, const c
error = -ENOTEMPTY;
if (new_dentry == trap)
goto exit5;
+ error = -EXDEV;
+ /* renaming of directories on unions isn't implemented, yet */
+ if (union_is_member(old_dentry, oldnd.mnt)) {
+ error = -EOPNOTSUPP;
+ if (S_ISDIR(old_dentry->d_inode->i_mode))
+ goto exit5;
+ error = -EXDEV;
+ if (oldnd.um_flags & LAST_LOWLEVEL)
+ goto exit5;
+ }
+ if (union_is_member(new_dentry, newnd.mnt)) {
+ error = -EXDEV;
+ if (newnd.um_flags & LAST_LOWLEVEL)
+ goto exit5;
+ }

error = vfs_rename(old_dir->d_inode, old_dentry,
new_dir->d_inode, new_dentry);
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -148,6 +148,11 @@ static int filldir(void * __buf, const c
unsigned long d_ino;
int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(long));

+#ifdef CONFIG_UNION_MOUNT
+ if (d_type == DT_WHT)
+ return 0;
+#endif /* CONFIG_UNION_MOUNT */
+
buf->error = -EINVAL; /* only used if we fail.. */
if (reclen > buf->count)
return -EINVAL;
@@ -233,6 +238,11 @@ static int filldir64(void * __buf, const
struct getdents_callback64 * buf = (struct getdents_callback64 *) __buf;
int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 1, sizeof(u64));

+#ifdef CONFIG_UNION_MOUNT
+ if (d_type == DT_WHT)
+ return 0;
+#endif /* CONFIG_UNION_MOUNT */
+
buf->error = -EINVAL; /* only used if we fail.. */
if (reclen > buf->count)
return -EINVAL;
--- a/fs/union.c
+++ b/fs/union.c
@@ -594,6 +594,9 @@ lookup_union:
if (!S_ISDIR(topmost->d_inode->i_mode))
goto out;

+ if (IS_OPAQUE(topmost->d_inode))
+ goto out;
+
if (!revalidate_union(topmost)) {
__dput_single(topmost);
topmost = NULL;
@@ -664,6 +667,8 @@ lookup_union:
dentry->d_topmost = topmost;
last->d_overlaid = dentry;
last = dentry;
+ if (IS_OPAQUE(last->d_inode))
+ break;
parent = parent->d_overlaid;
}

@@ -822,6 +827,8 @@ static int __lookup_union(struct dentry
loop:
__dput(nd.dentry);
mntput(nd.mnt);
+ if (IS_OPAQUE(last->d_inode))
+ break;
parent = parent->d_overlaid;
}

@@ -912,6 +919,9 @@ lookup_union:
if (!parent->d_overlaid || !S_ISDIR(topmost->d_inode->i_mode))
goto out;

+ if (IS_OPAQUE(topmost->d_inode))
+ goto out;
+
do {
struct vfsmount *mnt = find_mnt(topmost);
UM_DEBUG_UID("name=\"%s\", inode=%p, device=%s\n",
@@ -988,6 +998,9 @@ lookup_union:
if (!parent->d_overlaid || !S_ISDIR(topmost->d_inode->i_mode))
goto out;

+ if (IS_OPAQUE(topmost->d_inode))
+ goto out;
+
do {
struct vfsmount *mnt = find_mnt(topmost);
UM_DEBUG_UID("name=\"%s\", inode=%p, device=%s\n",
@@ -1787,3 +1800,94 @@ exit_dput:
dput(dentry);
return err;
}
+
+static int
+filldir_dummy(void *__buf, const char *name, int namlen, loff_t offset,
+ u64 ino, unsigned int d_type)
+{
+ int *is_empty = (int *)__buf;
+
+ switch (namlen) {
+ case 2:
+ if (name[1] != '.')
+ break;
+ case 1:
+ if (name[0] != '.')
+ break;
+ return 0;
+ }
+
+ if (d_type == DT_WHT)
+ return 0;
+
+ (*is_empty) = 0;
+ return 0;
+}
+
+int
+union_dir_is_empty(struct dentry *dentry)
+{
+ struct file *file;
+ struct vfsmount *mnt;
+ int err;
+ int is_empty = 1;
+
+ BUG_ON(!S_ISDIR(dentry->d_inode->i_mode));
+
+ dget(dentry);
+ mnt = find_mnt(dentry);
+
+ file = dentry_open(dentry, mnt, O_RDONLY);
+ if (IS_ERR(file))
+ return 0;
+
+ err = vfs_readdir(file, filldir_dummy, &is_empty);
+ UM_DEBUG("err=%d, is_empty=%d\n", err, is_empty);
+
+ fput(file);
+ return is_empty;
+}
+
+int present_in_lower(struct dentry *dentry, struct nameidata *nd)
+{
+ int err = 0;
+ struct dentry *parent = nd->dentry->d_overlaid;
+ struct dentry *tmp;
+ struct nameidata nd_tmp;
+ struct qstr this;
+
+ this.name = nd->last.name;
+ this.len = nd->last.len;
+
+ while (parent) {
+ this.hash = nd->last.hash;
+ nd_tmp.dentry = dget(parent);
+ nd_tmp.mnt = find_mnt(parent);
+ mutex_lock(&parent->d_inode->i_mutex);
+ tmp = __lookup_hash_single(&this, nd_tmp.dentry, &nd_tmp);
+ mutex_unlock(&parent->d_inode->i_mutex);
+ /*
+ * If there is an error in lookup, we return 0 concluding
+ * that this dentry is not present in lower layers.
+ */
+ if (IS_ERR(tmp))
+ goto out;
+
+ if (tmp->d_inode) {
+ __dput_single(tmp);
+ err = 1;
+ goto out;
+ }
+
+ __dput_single(tmp);
+ mntput(nd_tmp.mnt);
+ dput(nd_tmp.dentry);
+ parent = parent->d_overlaid;
+ }
+
+ return err;
+out:
+ mntput(nd_tmp.mnt);
+ dput(nd_tmp.dentry);
+ return err;
+}
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -157,6 +157,7 @@ extern int dir_notify_enable;
#define S_NOCMTIME 128 /* Do not update file c/mtime */
#define S_SWAPFILE 256 /* Do not truncate: swapon got its bmaps */
#define S_PRIVATE 512 /* Inode is fs-internal */
+#define S_OPAQUE 1024 /* Directory is opaque */

/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -190,6 +191,7 @@ extern int dir_notify_enable;
#define IS_NOCMTIME(inode) ((inode)->i_flags & S_NOCMTIME)
#define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE)
#define IS_PRIVATE(inode) ((inode)->i_flags & S_PRIVATE)
+#define IS_OPAQUE(inode) ((inode)->i_flags & S_OPAQUE)

/* the read-only stuff doesn't really belong here, but any other place is
probably as bad and I don't want to create yet another include file. */
@@ -1050,6 +1052,7 @@ extern int vfs_link(struct dentry *, str
extern int vfs_rmdir(struct inode *, struct dentry *);
extern int vfs_unlink(struct inode *, struct dentry *);
extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
+extern int vfs_whiteout(struct inode *, struct dentry *);

/*
* VFS dentry helper functions.
@@ -1175,6 +1178,7 @@ struct inode_operations {
int (*mkdir) (struct inode *,struct dentry *,int);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
+ int (*whiteout) (struct inode *, struct dentry *);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char __user *,int);
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -43,6 +43,10 @@ extern int union_copy_file(struct dentry
extern int union_copyup(struct nameidata *, int);
extern int union_relookup_topmost(struct nameidata *, int);

+/* vfs whiteout support */
+extern int union_dir_is_empty(struct dentry *);
+extern int present_in_lower(struct dentry *, struct nameidata *);
+
#else /* CONFIG_UNION_MOUNT */

#define attach_mnt_union(mnt,nd) do { /* empty */ } while (0)
@@ -54,6 +58,8 @@ extern int union_relookup_topmost(struct
#define union_copy_file(dentry1,mnt1,dentry2,mnt2) ({ (0); })
#define union_copyup(x,y) ({ (0); })
#define union_relookup_topmost(x,y) ({ (0); })
+#define union_dir_is_empty(x) ({ (1); })
+#define present_in_lower(x, y) ({ (0); })

#endif /* CONFIG_UNION_MOUNT */

2007-05-14 09:37:41

by Bharata B Rao

[permalink] [raw]
Subject: [RFC][PATCH 12/14] ext2 whiteout support

From: Jan Blunck <[email protected]>
Subject: ext2 whiteout support

Introduce whiteout support to ext2.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Bharata B Rao <[email protected]>
---
fs/ext2/dir.c | 2 ++
fs/ext2/namei.c | 17 +++++++++++++++++
fs/ext2/super.c | 11 ++++++++++-
include/linux/ext2_fs.h | 4 ++++
4 files changed, 33 insertions(+), 1 deletion(-)

--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -218,6 +218,7 @@ static unsigned char ext2_filetype_table
[EXT2_FT_FIFO] = DT_FIFO,
[EXT2_FT_SOCK] = DT_SOCK,
[EXT2_FT_SYMLINK] = DT_LNK,
+ [EXT2_FT_WHT] = DT_WHT,
};

#define S_SHIFT 12
@@ -229,6 +230,7 @@ static unsigned char ext2_type_by_mode[S
[S_IFIFO >> S_SHIFT] = EXT2_FT_FIFO,
[S_IFSOCK >> S_SHIFT] = EXT2_FT_SOCK,
[S_IFLNK >> S_SHIFT] = EXT2_FT_SYMLINK,
+ [S_IFWHT >> S_SHIFT] = EXT2_FT_WHT,
};

static inline void ext2_set_de_type(ext2_dirent *de, struct inode *inode)
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -288,6 +288,22 @@ static int ext2_rmdir (struct inode * di
return err;
}

+static int ext2_whiteout(struct inode *dir, struct dentry *dentry)
+{
+ struct inode *inode;
+ int err;
+
+ inode = ext2_new_inode (dir, S_IFWHT | S_IRUGO);
+ err = PTR_ERR(inode);
+ if (IS_ERR(inode))
+ goto out;
+
+ mark_inode_dirty(inode);
+ err = ext2_add_nondir(dentry, inode);
+out:
+ return err;
+}
+
static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
struct inode * new_dir, struct dentry * new_dentry )
{
@@ -382,6 +398,7 @@ const struct inode_operations ext2_dir_i
.mkdir = ext2_mkdir,
.rmdir = ext2_rmdir,
.mknod = ext2_mknod,
+ .whiteout = ext2_whiteout,
.rename = ext2_rename,
#ifdef CONFIG_EXT2_FS_XATTR
.setxattr = generic_setxattr,
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -754,6 +754,15 @@ static int ext2_fill_super(struct super_
ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
EXT2_MOUNT_XIP if not */

+ if ((sb->s_flags & MS_UNION) && !(sb->s_flags & MS_RDONLY)) {
+ if (!EXT2_HAS_INCOMPAT_FEATURE(sb,
+ EXT2_FEATURE_INCOMPAT_WHITEOUT)) {
+ sb->s_flags |= MS_RDONLY;
+ ext2_warning(sb, __FUNCTION__,
+ "no whiteout support, mounting filesystem read-only");
+ }
+ }
+
if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV &&
(EXT2_HAS_COMPAT_FEATURE(sb, ~0U) ||
EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -1292,7 +1301,7 @@ static struct file_system_type ext2_fs_t
.name = "ext2",
.get_sb = ext2_get_sb,
.kill_sb = kill_block_super,
- .fs_flags = FS_REQUIRES_DEV,
+ .fs_flags = FS_REQUIRES_DEV | FS_WHT,
};

static int __init init_ext2_fs(void)
--- a/include/linux/ext2_fs.h
+++ b/include/linux/ext2_fs.h
@@ -61,6 +61,7 @@
#define EXT2_ROOT_INO 2 /* Root inode */
#define EXT2_BOOT_LOADER_INO 5 /* Boot loader inode */
#define EXT2_UNDEL_DIR_INO 6 /* Undelete directory inode */
+#define EXT2_WHT_INO 7 /* Whiteout inode */

/* First non-reserved inode for old ext2 filesystems */
#define EXT2_GOOD_OLD_FIRST_INO 11
@@ -479,10 +480,12 @@ struct ext2_super_block {
#define EXT3_FEATURE_INCOMPAT_RECOVER 0x0004
#define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV 0x0008
#define EXT2_FEATURE_INCOMPAT_META_BG 0x0010
+#define EXT2_FEATURE_INCOMPAT_WHITEOUT 0x0020
#define EXT2_FEATURE_INCOMPAT_ANY 0xffffffff

#define EXT2_FEATURE_COMPAT_SUPP EXT2_FEATURE_COMPAT_EXT_ATTR
#define EXT2_FEATURE_INCOMPAT_SUPP (EXT2_FEATURE_INCOMPAT_FILETYPE| \
+ EXT2_FEATURE_INCOMPAT_WHITEOUT| \
EXT2_FEATURE_INCOMPAT_META_BG)
#define EXT2_FEATURE_RO_COMPAT_SUPP (EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER| \
EXT2_FEATURE_RO_COMPAT_LARGE_FILE| \
@@ -549,6 +552,7 @@ enum {
EXT2_FT_FIFO,
EXT2_FT_SOCK,
EXT2_FT_SYMLINK,
+ EXT2_FT_WHT,
EXT2_FT_MAX
};

2007-05-14 09:38:14

by Bharata B Rao

[permalink] [raw]
Subject: [RFC][PATCH 13/14] ext3 whiteout support

From: Bharata B Rao <[email protected]>
Subject: ext3 whiteout support

Introduce whiteout support for ext3.

Signed-off-by: Bharata B Rao <[email protected]>
Signed-off-by: Jan Blunck <[email protected]>
---
fs/ext3/dir.c | 2 -
fs/ext3/namei.c | 62 ++++++++++++++++++++++++++++++++++++++++++++----
fs/ext3/super.c | 11 +++++++-
include/linux/ext3_fs.h | 5 +++
4 files changed, 72 insertions(+), 8 deletions(-)

--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -29,7 +29,7 @@
#include <linux/rbtree.h>

static unsigned char ext3_filetype_table[] = {
- DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
+ DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK, DT_WHT
};

static int ext3_readdir(struct file *, void *, filldir_t);
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -1071,6 +1071,7 @@ static unsigned char ext3_type_by_mode[S
[S_IFIFO >> S_SHIFT] = EXT3_FT_FIFO,
[S_IFSOCK >> S_SHIFT] = EXT3_FT_SOCK,
[S_IFLNK >> S_SHIFT] = EXT3_FT_SYMLINK,
+ [S_IFWHT >> S_SHIFT] = EXT3_FT_WHT,
};

static inline void ext3_set_de_type(struct super_block *sb,
@@ -1786,7 +1787,7 @@ out_stop:
/*
* routine to check that the specified directory is empty (for rmdir)
*/
-static int empty_dir (struct inode * inode)
+static int empty_dir (handle_t *handle, struct inode * inode)
{
unsigned long offset;
struct buffer_head * bh;
@@ -1848,8 +1849,28 @@ static int empty_dir (struct inode * ino
continue;
}
if (le32_to_cpu(de->inode)) {
- brelse (bh);
- return 0;
+ /* If this is a whiteout, remove it */
+ if (de->file_type == EXT3_FT_WHT) {
+ unsigned long ino = le32_to_cpu(de->inode);
+ struct inode *tmp_inode = iget(inode->i_sb, ino);
+ if (!tmp_inode) {
+ brelse (bh);
+ return 0;
+ }
+
+ if (ext3_delete_entry(handle, inode, de, bh)) {
+ iput(tmp_inode);
+ brelse (bh);
+ return 0;
+ }
+
+ tmp_inode->i_ctime = inode->i_ctime;
+ tmp_inode->i_nlink--;
+ iput(tmp_inode);
+ } else {
+ brelse (bh);
+ return 0;
+ }
}
offset += le16_to_cpu(de->rec_len);
de = (struct ext3_dir_entry_2 *)
@@ -2031,7 +2052,7 @@ static int ext3_rmdir (struct inode * di
goto end_rmdir;

retval = -ENOTEMPTY;
- if (!empty_dir (inode))
+ if (!empty_dir (handle, inode))
goto end_rmdir;

retval = ext3_delete_entry(handle, dir, de, bh);
@@ -2060,6 +2081,36 @@ end_rmdir:
return retval;
}

+static int ext3_whiteout(struct inode *dir, struct dentry *dentry)
+{
+ struct inode * inode;
+ int err, retries = 0;
+ handle_t *handle;
+
+retry:
+ handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS(dir->i_sb) +
+ EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3 +
+ 2*EXT3_QUOTA_INIT_BLOCKS(dir->i_sb));
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+
+ if (IS_DIRSYNC(dir))
+ handle->h_sync = 1;
+
+ inode = ext3_new_inode (handle, dir, S_IFWHT | S_IRUGO);
+ err = PTR_ERR(inode);
+ if (IS_ERR(inode))
+ goto out_stop;
+
+ err = ext3_add_nondir(handle, dentry, inode);
+
+out_stop:
+ ext3_journal_stop(handle);
+ if (err == -ENOSPC && ext3_should_retry_alloc(dir->i_sb, &retries))
+ goto retry;
+ return err;
+}
+
static int ext3_unlink(struct inode * dir, struct dentry *dentry)
{
int retval;
@@ -2261,7 +2312,7 @@ static int ext3_rename (struct inode * o
if (S_ISDIR(old_inode->i_mode)) {
if (new_inode) {
retval = -ENOTEMPTY;
- if (!empty_dir (new_inode))
+ if (!empty_dir (handle, new_inode))
goto end_rename;
}
retval = -EIO;
@@ -2377,6 +2428,7 @@ const struct inode_operations ext3_dir_i
.mkdir = ext3_mkdir,
.rmdir = ext3_rmdir,
.mknod = ext3_mknod,
+ .whiteout = ext3_whiteout,
.rename = ext3_rename,
.setattr = ext3_setattr,
#ifdef CONFIG_EXT3_FS_XATTR
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -1492,6 +1492,15 @@ static int ext3_fill_super (struct super
sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
((sbi->s_mount_opt & EXT3_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);

+ if ((sb->s_flags & MS_UNION) && !(sb->s_flags & MS_RDONLY)) {
+ if (!EXT3_HAS_INCOMPAT_FEATURE(sb,
+ EXT3_FEATURE_INCOMPAT_WHITEOUT)) {
+ sb->s_flags |= MS_RDONLY;
+ ext3_warning(sb, __FUNCTION__,
+ "no whiteout support, mounting filesystem read-only");
+ }
+ }
+
if (le32_to_cpu(es->s_rev_level) == EXT3_GOOD_OLD_REV &&
(EXT3_HAS_COMPAT_FEATURE(sb, ~0U) ||
EXT3_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -2748,7 +2757,7 @@ static struct file_system_type ext3_fs_t
.name = "ext3",
.get_sb = ext3_get_sb,
.kill_sb = kill_block_super,
- .fs_flags = FS_REQUIRES_DEV,
+ .fs_flags = FS_REQUIRES_DEV | FS_WHT,
};

static int __init init_ext3_fs(void)
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -63,6 +63,7 @@
#define EXT3_UNDEL_DIR_INO 6 /* Undelete directory inode */
#define EXT3_RESIZE_INO 7 /* Reserved group descriptors inode */
#define EXT3_JOURNAL_INO 8 /* Journal inode */
+#define EXT3_WHT_INO 9 /* Whiteout inode */

/* First non-reserved inode for old ext3 filesystems */
#define EXT3_GOOD_OLD_FIRST_INO 11
@@ -582,6 +583,7 @@ static inline int ext3_valid_inum(struct
#define EXT3_FEATURE_INCOMPAT_RECOVER 0x0004 /* Needs recovery */
#define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV 0x0008 /* Journal device */
#define EXT3_FEATURE_INCOMPAT_META_BG 0x0010
+#define EXT3_FEATURE_INCOMPAT_WHITEOUT 0x0020

#define EXT3_FEATURE_COMPAT_SUPP EXT2_FEATURE_COMPAT_EXT_ATTR
#define EXT3_FEATURE_INCOMPAT_SUPP (EXT3_FEATURE_INCOMPAT_FILETYPE| \
@@ -648,8 +650,9 @@ struct ext3_dir_entry_2 {
#define EXT3_FT_FIFO 5
#define EXT3_FT_SOCK 6
#define EXT3_FT_SYMLINK 7
+#define EXT3_FT_WHT 8

-#define EXT3_FT_MAX 8
+#define EXT3_FT_MAX 9

/*
* EXT3_DIR_PAD defines the directory entries boundaries

2007-05-14 09:38:32

by Bharata B Rao

[permalink] [raw]
Subject: [RFC][PATCH 14/14] tmpfs whiteout support

From: Jan Blunck <[email protected]>
Subject: tmpfs whiteout support

Introduce whiteout support to tmpfs.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Bharata B Rao <[email protected]>
---
mm/shmem.c | 9 ++++++++-
1 files changed, 8 insertions(+), 1 deletion(-)

--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -74,7 +74,7 @@
#define LATENCY_LIMIT 64

/* Pretend that each entry is of this size in directory's i_size */
-#define BOGO_DIRENT_SIZE 20
+#define BOGO_DIRENT_SIZE 1

/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
enum sgp_type {
@@ -1772,6 +1772,11 @@ static int shmem_create(struct inode *di
return shmem_mknod(dir, dentry, mode | S_IFREG, 0);
}

+static int shmem_whiteout(struct inode *dir, struct dentry *dentry)
+{
+ return shmem_mknod(dir, dentry, S_IRUGO | S_IWUGO | S_IFWHT, 0);
+}
+
/*
* Link a file..
*/
@@ -2399,6 +2404,7 @@ static const struct inode_operations shm
.rmdir = shmem_rmdir,
.mknod = shmem_mknod,
.rename = shmem_rename,
+ .whiteout = shmem_whiteout,
#endif
#ifdef CONFIG_TMPFS_POSIX_ACL
.setattr = shmem_notify_change,
@@ -2453,6 +2459,7 @@ static struct file_system_type tmpfs_fs_
.name = "tmpfs",
.get_sb = shmem_get_sb,
.kill_sb = kill_litter_super,
+ .fs_flags = FS_WHT,
};
static struct vfsmount *shm_mnt;

2007-05-14 10:43:53

by Carsten Otte

[permalink] [raw]
Subject: Re: [RFC][PATCH 9/14] Union-mount readdir

On 5/14/07, Bharata B Rao <[email protected]> wrote:
> +/* This is a copy from fs/readdir.c */
> +struct getdents_callback {
> + struct linux_dirent __user *current_dir;
> + struct linux_dirent __user *previous;
> + int count;
> + int error;
> +};
This should go into a header file.

> +static int union_cache_find_entry(struct list_head *uc_list,
> + const char *name, int namelen)
> +{
> + struct union_cache_entry *p;
> + int ret = 0;
> +
> + list_for_each_entry(p, uc_list, list) {
> + if (p->name.len != namelen)
> + continue;
> + if (strncmp(p->name.name, name, namelen) == 0) {
> + ret = 1;
> + break;
> + }
> + }
> + return ret;
> +}
Why not use strlen instead of having both string and length as parameter?

> +static struct file * __dentry_open_read(struct dentry *dentry,
> + struct vfsmount *mnt, int flags)
> +{
> + struct file *f;
> + struct inode *inode;
> + int error;
> +
> + error = -ENFILE;
> + f = get_empty_filp();
> + if (!f)
> + goto out;
This is the only case where error is not explicitly set to a different
value before hitting out/cleanup => consider setting conditionally.

so long,
Carsten

2007-05-14 11:08:18

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC][PATCH 9/14] Union-mount readdir

On Mon, May 14, 2007 at 12:43:43PM +0200, Carsten Otte wrote:
> On 5/14/07, Bharata B Rao <[email protected]> wrote:
> >+/* This is a copy from fs/readdir.c */
> >+struct getdents_callback {
> >+ struct linux_dirent __user *current_dir;
> >+ struct linux_dirent __user *previous;
> >+ int count;
> >+ int error;
> >+};
> This should go into a header file.

Yes ideally. As the comment above says, it is copied from fs/readdir.c and
we should be using the definition from there. But that needs touching
additional files and we wanted to avoid that for this initial RFC post.

>
> >+static int union_cache_find_entry(struct list_head *uc_list,
> >+ const char *name, int namelen)
> >+{
> >+ struct union_cache_entry *p;
> >+ int ret = 0;
> >+
> >+ list_for_each_entry(p, uc_list, list) {
> >+ if (p->name.len != namelen)
> >+ continue;
> >+ if (strncmp(p->name.name, name, namelen) == 0) {
> >+ ret = 1;
> >+ break;
> >+ }
> >+ }
> >+ return ret;
> >+}
> Why not use strlen instead of having both string and length as parameter?
>

All generic filldir routines in fs/readdir.c (filldir, fillonedir and
filldir64) don't depend on the dirent->d_name to be NULL terminated and
put a 0 themselves at the end. Hence we are also not depending on the
name string to be NULL terminated.

> >+static struct file * __dentry_open_read(struct dentry *dentry,
> >+ struct vfsmount *mnt, int flags)
> >+{
> >+ struct file *f;
> >+ struct inode *inode;
> >+ int error;
> >+
> >+ error = -ENFILE;
> >+ f = get_empty_filp();
> >+ if (!f)
> >+ goto out;
> This is the only case where error is not explicitly set to a different
> value before hitting out/cleanup => consider setting conditionally.

Sure can be done. Again this routine is copied from dentry_open() and
hence it is like that atm.

Thanks for your review.

Regards,
Bharata.

2007-05-14 16:13:41

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC][PATCH 14/14] tmpfs whiteout support

On Mon, 14 May 2007, Bharata B Rao wrote:
> From: Jan Blunck <[email protected]>
> Subject: tmpfs whiteout support
>
> Introduce whiteout support to tmpfs.
>
> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Bharata B Rao <[email protected]>
> ---
> mm/shmem.c | 9 ++++++++-
> 1 files changed, 8 insertions(+), 1 deletion(-)
>
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -74,7 +74,7 @@
> #define LATENCY_LIMIT 64
>
> /* Pretend that each entry is of this size in directory's i_size */
> -#define BOGO_DIRENT_SIZE 20
> +#define BOGO_DIRENT_SIZE 1

Why would that change be needed for whiteout support?

Hugh

2007-05-14 19:21:07

by Jan Blunck

[permalink] [raw]
Subject: Re: [RFC][PATCH 14/14] tmpfs whiteout support

On 5/14/07, Hugh Dickins <[email protected]> wrote:
> On Mon, 14 May 2007, Bharata B Rao wrote:
> > From: Jan Blunck <[email protected]>
> > Subject: tmpfs whiteout support
> >
> > Introduce whiteout support to tmpfs.
> >
> > Signed-off-by: Jan Blunck <[email protected]>
> > Signed-off-by: Bharata B Rao <[email protected]>
> > ---
> > mm/shmem.c | 9 ++++++++-
> > 1 files changed, 8 insertions(+), 1 deletion(-)
> >
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -74,7 +74,7 @@
> > #define LATENCY_LIMIT 64
> >
> > /* Pretend that each entry is of this size in directory's i_size */
> > -#define BOGO_DIRENT_SIZE 20
> > +#define BOGO_DIRENT_SIZE 1
>
> Why would that change be needed for whiteout support?
>

Good question. It seems that this a survivor of the changes necessary
for union readdir. This isn't necessary for white-outs.

BTW: Why do we claim this to be 20??? Is there any meaning behind this?

Cheers,
Jan

2007-05-14 19:35:51

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC][PATCH 14/14] tmpfs whiteout support

On Mon, 14 May 2007, Jan Blunck wrote:
> On 5/14/07, Hugh Dickins <[email protected]> wrote:
> > >
> > > /* Pretend that each entry is of this size in directory's i_size */
> > > -#define BOGO_DIRENT_SIZE 20
> > > +#define BOGO_DIRENT_SIZE 1
> >
> > Why would that change be needed for whiteout support?
>
> Good question. It seems that this a survivor of the changes necessary
> for union readdir.

(I'd be asking the same question in that case, but don't worry about it!)

> This isn't necessary for white-outs.

Phew, thanks, please drop that hunk.

> BTW: Why do we claim this to be 20??? Is there any meaning behind this?

No great meaning, hence "BOGO". I put that in when hpa (IIRC) found
tmpfs directory size 0 didn't suit some apps. I thought it would be
nice to have a size which indicates the current number of entries
(which your 1 would do), looks plausible (for short filenames),
and easy to make sense of in an "ls -l". Bogus, yes; but I'd
resist changing it after all this time, without very good reason.

Hugh

2007-05-14 20:16:30

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC][PATCH 13/14] ext3 whiteout support

On Mon, 2007-05-14 at 15:14 +0530, Bharata B Rao wrote:
> From: Bharata B Rao <[email protected]>
> Subject: ext3 whiteout support
>
> Introduce whiteout support for ext3.
>
> Signed-off-by: Bharata B Rao <[email protected]>
> Signed-off-by: Jan Blunck <[email protected]>
> ---
> fs/ext3/dir.c | 2 -
> fs/ext3/namei.c | 62 ++++++++++++++++++++++++++++++++++++++++++++----
> fs/ext3/super.c | 11 +++++++-
> include/linux/ext3_fs.h | 5 +++
> 4 files changed, 72 insertions(+), 8 deletions(-)
>
> --- a/fs/ext3/dir.c
> +++ b/fs/ext3/dir.c
> @@ -29,7 +29,7 @@
> #include <linux/rbtree.h>
>
> static unsigned char ext3_filetype_table[] = {
> - DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
> + DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK, DT_WHT
> };
>
> static int ext3_readdir(struct file *, void *, filldir_t);
> --- a/fs/ext3/namei.c
> +++ b/fs/ext3/namei.c
> @@ -1071,6 +1071,7 @@ static unsigned char ext3_type_by_mode[S
> [S_IFIFO >> S_SHIFT] = EXT3_FT_FIFO,
> [S_IFSOCK >> S_SHIFT] = EXT3_FT_SOCK,
> [S_IFLNK >> S_SHIFT] = EXT3_FT_SYMLINK,
> + [S_IFWHT >> S_SHIFT] = EXT3_FT_WHT,
> };
>
> static inline void ext3_set_de_type(struct super_block *sb,
> @@ -1786,7 +1787,7 @@ out_stop:
> /*
> * routine to check that the specified directory is empty (for rmdir)
> */
> -static int empty_dir (struct inode * inode)
> +static int empty_dir (handle_t *handle, struct inode * inode)

Is there a reason for passing the handle ? Why couldn't you get it from
journal_current_handle() if needed to do the delete the whiteout ?

> {
> unsigned long offset;
> struct buffer_head * bh;
> @@ -1848,8 +1849,28 @@ static int empty_dir (struct inode * ino
> continue;
> }
> if (le32_to_cpu(de->inode)) {
> - brelse (bh);
> - return 0;
> + /* If this is a whiteout, remove it */
> + if (de->file_type == EXT3_FT_WHT) {
> + unsigned long ino = le32_to_cpu(de->inode);
> + struct inode *tmp_inode = iget(inode->i_sb, ino);
> + if (!tmp_inode) {
> + brelse (bh);
> + return 0;
> + }
> +
> + if (ext3_delete_entry(handle, inode, de, bh)) {
> + iput(tmp_inode);
> + brelse (bh);
> + return 0;
> + }
> +
> + tmp_inode->i_ctime = inode->i_ctime;
> + tmp_inode->i_nlink--;
> + iput(tmp_inode);
> + } else {
> + brelse (bh);
> + return 0;
> + }
> }
> offset += le16_to_cpu(de->rec_len);
> de = (struct ext3_dir_entry_2 *)
> @@ -2031,7 +2052,7 @@ static int ext3_rmdir (struct inode * di
> goto end_rmdir;
>
> retval = -ENOTEMPTY;
> - if (!empty_dir (inode))
> + if (!empty_dir (handle, inode))
> goto end_rmdir;
>
> retval = ext3_delete_entry(handle, dir, de, bh);
> @@ -2060,6 +2081,36 @@ end_rmdir:
> return retval;
> }
>
> +static int ext3_whiteout(struct inode *dir, struct dentry *dentry)
> +{
> + struct inode * inode;
> + int err, retries = 0;
> + handle_t *handle;
> +
> +retry:
> + handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS(dir->i_sb) +
> + EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3 +
> + 2*EXT3_QUOTA_INIT_BLOCKS(dir->i_sb));
> + if (IS_ERR(handle))
> + return PTR_ERR(handle);
> +
> + if (IS_DIRSYNC(dir))
> + handle->h_sync = 1;
> +
> + inode = ext3_new_inode (handle, dir, S_IFWHT | S_IRUGO);
> + err = PTR_ERR(inode);
> + if (IS_ERR(inode))
> + goto out_stop;

Don't you need to call init_special_inode() here ?
Or this is handled somewhere else ?

> +
> + err = ext3_add_nondir(handle, dentry, inode);
> +
> +out_stop:
> + ext3_journal_stop(handle);
> + if (err == -ENOSPC && ext3_should_retry_alloc(dir->i_sb, &retries))
> + goto retry;
> + return err;
> +}
> +

Thanks,
Badari

2007-05-14 20:17:32

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC][PATCH 13/14] ext3 whiteout support

On May 14, 2007 15:14 +0530, Bharata B Rao wrote:
> #define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV 0x0008 /* Journal device */
> #define EXT3_FEATURE_INCOMPAT_META_BG 0x0010
> +#define EXT3_FEATURE_INCOMPAT_WHITEOUT 0x0020

Is this flag reserved with Ted? It isn't listed in the e2fsprogs repo.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-05-14 20:22:49

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/14] Introduce union stack

On Mon, 2007-05-14 at 15:10 +0530, Bharata B Rao wrote:
> From: Jan Blunck <[email protected]>
> Subject: Introduce union stack.
>
> Adds union stack infrastructure to the dentry structure and provides
> locking routines to walk the union stack.
>
> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Bharata B Rao <[email protected]>
...

> +/*
> + * This is a *I can't get no sleep* helper which is called when we try
> + * to access the struct fs_struct *fs field of a struct task_struct.
> + *
> + * Yes, this is possibly starving but we have to change root, altroot
> + * or pwd in the frequency of this while loop. Don't think that this
> + * happens really often ;)
> + *
> + * This is called while holding the rwlock_t fs->lock
> + *
> + * TODO: Unlocking side of union_lock_fs() needs 3 union_unlock()s.
> + * May be introduce union_unlock_fs().
> + *
> + * FIXME: This routine is used when the caller wants to dget one or
> + * more of fs->[root, altroot, pwd]. When the caller doesn't want to
> + * dget _all_ of these, it is strictly not necessary to get union_locks
> + * on all of these. Check.
> + */
> +static inline void union_lock_fs(struct fs_struct *fs)
> +{
> + int locked;
> +
> + while (fs) {
> + locked = union_trylock(fs->root);
> + if (!locked)
> + goto loop1;
> + locked = union_trylock(fs->altroot);
> + if (!locked)
> + goto loop2;
> + locked = union_trylock(fs->pwd);
> + if (!locked)
> + goto loop3;
> + break;
> + loop3:
> + union_unlock(fs->altroot);
> + loop2:
> + union_unlock(fs->root);
> + loop1:
> + read_unlock(&fs->lock);
> + UM_DEBUG_LOCK("Failed to get all semaphores in fs_struct!\n");
> + cpu_relax();
> + read_lock(&fs->lock);
> + continue;

Nit.. why "continue" ?

> + }
> + BUG_ON(!fs);

Whats the use of BUG_ON() here ? Top of the function would be more
useful.

Thanks,
Badari

2007-05-14 20:35:59

by Jan Blunck

[permalink] [raw]
Subject: Re: [RFC][PATCH 13/14] ext3 whiteout support

On 5/14/07, Andreas Dilger <[email protected]> wrote:
> On May 14, 2007 15:14 +0530, Bharata B Rao wrote:
> > #define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV 0x0008 /* Journal device */
> > #define EXT3_FEATURE_INCOMPAT_META_BG 0x0010
> > +#define EXT3_FEATURE_INCOMPAT_WHITEOUT 0x0020
>
> Is this flag reserved with Ted? It isn't listed in the e2fsprogs repo.
>

I don't know. I tried to contact him a few weeks ago but failed.
Guess, maybe he isn't reading the @thunk.org email anymore which was
reference in the e2fsprogs source I used.

Ted,
from ext2_fs.h I learn that the value 0x0020 is left unused.

#define EXT2_FEATURE_INCOMPAT_META_BG 0x0010
#define EXT3_FEATURE_INCOMPAT_EXTENTS 0x0040
#define EXT4_FEATURE_INCOMPAT_64BIT 0x0080

Is this intentionally?

Cheers,
Jan

2007-05-14 20:41:41

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/14] Add the whiteout file type


On May 14 2007 15:09, Bharata B Rao wrote:
>
>A white-out stops the VFS from further lookups of the white-outs name and
>returns -ENOENT. This is the same behaviour as if the filename isn't
>found. This can be used in combination with union mounts to virtually
>delete (white-out) files by creating a file with this file type.
>
>Signed-off-by: Jan Blunck <[email protected]>
>Signed-off-by: Bharata B Rao <[email protected]>
>---
> include/linux/stat.h | 2 ++
> 1 files changed, 2 insertions(+)
>
>--- a/include/linux/stat.h
>+++ b/include/linux/stat.h
>@@ -10,6 +10,7 @@
> #if defined(__KERNEL__) || !defined(__GLIBC__) || (__GLIBC__ < 2)
>
> #define S_IFMT 00170000
>+#define S_IFWHT 0160000 /* whiteout */
> #define S_IFSOCK 0140000
> #define S_IFLNK 0120000
> #define S_IFREG 0100000

I wonder why 110000, 130000 or 150000 could not also be used?


Jan
--

2007-05-14 20:45:52

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/14] Add a new mount flag (MNT_UNION) for union mount


On May 14 2007 15:09, Bharata B Rao wrote:
>
>Introduce MNT_UNION, MS_UNION and FS_WHT flags. There are the necessary flags
>for doing
>
> mount /dev/hda3 /mnt -o union
>
>You need additional patches for util-linux for that to work.
>
>Signed-off-by: Jan Blunck <[email protected]>
>Signed-off-by: Bharata B Rao <[email protected]>
>---
>
>+ /* Unions couldn't be writable if the filesystem
>+ * doesn't know about whiteouts */
>+ err = -ENOTSUPP;
>+ if ((mnt_flags & MNT_UNION) &&
>+ !(newmnt->mnt_sb->s_flags & MS_RDONLY) &&
>+ !(newmnt->mnt_sb->s_type->fs_flags & FS_WHT))
>+ goto unlock;
>+

Maybe I am too biased towards unionfs/aufs, but if I have an {rw,rw} union
with whiteouts disabled (delete=all in unionfs speak), then FS_WHT
does not need to be supported. Your patches do not seem to do
delete=all semantics, do they?

Jan
--

2007-05-14 20:50:23

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/14] Introduce union stack


On May 14 2007 15:10, Bharata B Rao wrote:
>+struct union_info * union_alloc(void)

Ultimate nitpick: try s/\* /*/; (also elsewhere)

>+static inline void union_lock(struct dentry *dentry)
>+{
>+ if (unlikely(dentry && dentry->d_union)) {
>+ struct union_info *ui = dentry->d_union;
>+
>+ UM_DEBUG_LOCK("\"%s\" locking %p (count=%d)\n",
>+ dentry->d_name.name, ui,
>+ atomic_read(&ui->u_count));
>+ __union_lock(dentry->d_union);
>+ }
>+}
>+
>+static inline void union_unlock(struct dentry *dentry)
>+{
>+ if (unlikely(dentry && dentry->d_union)) {
>+ struct union_info *ui = dentry->d_union;
>+
>+ UM_DEBUG_LOCK("\"%s\" unlocking %p (count=%d)\n",
>+ dentry->d_name.name, ui,
>+ atomic_read(&ui->u_count));
>+ __union_unlock(dentry->d_union);
>+ }
>+}

Do we really need the unlikely()? d_union may be a new feature,
but it may very well be possible that someone puts the bigger
part of his/her files under a union. And when d_unions get
stable, people will probably begin making their root filesystem
unioned for livecds, and then unlikely() will rather be a
likely penalty. My stance: just
if (dentry != NULL && dentry->d_union != NULL)
This also goes for union_trylock.

>+static inline int union_trylock(struct dentry *dentry)
>+{
>+ int locked = 1;
>+
>+ if (unlikely(dentry && dentry->d_union)) {
>+ UM_DEBUG_LOCK("\"%s\" try locking %p (count=%d)\n",
>+ dentry->d_name.name, dentry->d_union,
>+ atomic_read(&dentry->d_union->u_count));
>+ BUG_ON(!atomic_read(&dentry->d_union->u_count));
>+ locked = mutex_trylock(&dentry->d_union->u_mutex);
>+ UM_DEBUG_LOCK("\"%s\" trylock %p %s\n", dentry->d_name.name,
>+ dentry->d_union,
>+ locked ? "succeeded" : "failed");
>+ }
>+ return (locked ? 1 : 0);
>+}

return locked ? 1 : 0
or even
return !!locked;
or since we're just passing up from mutex_trylock:
return locked;
?

>+/*
>+ * This is a *I can't get no sleep* helper

More commonly known as "insomnia". :)


Jan
--

2007-05-14 20:53:58

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/14] Introduce union stack


On May 14 2007 13:23, Badari Pulavarty wrote:
>> +static inline void union_lock_fs(struct fs_struct *fs)
>> +{
>> + int locked;
>> +
>> + while (fs) {
>> + locked = union_trylock(fs->root);
>> + if (!locked)
>> + goto loop1;
>> + locked = union_trylock(fs->altroot);
>> + if (!locked)
>> + goto loop2;
>> + locked = union_trylock(fs->pwd);
>> + if (!locked)
>> + goto loop3;
>> + break;
^^^^^^
>> + loop3:
>> + union_unlock(fs->altroot);
>> + loop2:
>> + union_unlock(fs->root);
>> + loop1:
>> + read_unlock(&fs->lock);
>> + UM_DEBUG_LOCK("Failed to get all semaphores in fs_struct!\n");
>> + cpu_relax();
>> + read_lock(&fs->lock);
>> + continue;
>
>Nit.. why "continue" ?

There's your break. Oh right, the continue is superfluous.

At the risk of using yet another goto, the conditional jump
could be turned into an inconditional one, since 'fs' will
remain valid. (Compiler smart enough to figure out?)

>> + while (fs) {
loop0:
>> + locked = union_trylock(fs->root);
>> + if (!locked)
>> + goto loop1;
>> + locked = union_trylock(fs->altroot);
>> + if (!locked)
>> + goto loop2;
>> + locked = union_trylock(fs->pwd);
>> + if (!locked)
>> + goto loop3;
>> + break;
>> + loop3:
>> + union_unlock(fs->altroot);
>> + loop2:
>> + union_unlock(fs->root);
>> + loop1:
>> + read_unlock(&fs->lock);
>> + UM_DEBUG_LOCK("Failed to get all semaphores in fs_struct!\n");
>> + cpu_relax();
>> + read_lock(&fs->lock);
/* continue */
goto loop0;
}


Jan
--

2007-05-14 20:59:35

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFC][PATCH 6/14] Union-mount dentry reference counting


On May 14 2007 15:11, Bharata B Rao wrote:
>+void __union_check(struct dentry *dentry)
>+{
>+ if (likely(!(dentry->d_topmost || dentry->d_overlaid))) {

This could be simplified to

if (likely(!dentry->d_topmost && !dentry->d_overlaid))

(I prefer x==NULL over !x for pointers, though)
And then again, do not assume everything is (un)likely [also elsewhere].

>+ if (unlikely(dentry->d_union)) {
>+ printk(KERN_ERR "%s: \"%s\" stale union reference\n" \
>+ "\tdentry=%p, inode=%p, count=%d, u_count=%d\n",
>+ __FUNCTION__,
>+ dentry->d_name.name,
>+ dentry,
>+ dentry->d_inode,
>+ atomic_read(&dentry->d_count),
>+ atomic_read(&dentry->d_union->u_count));
>+ dump_stack();
>+ }
>+ return;
>+ }
>+
>--- a/net/unix/af_unix.c
>+++ b/net/unix/af_unix.c
>@@ -1082,8 +1082,11 @@ restart:
> newu->addr = otheru->addr;
> }
> if (otheru->dentry) {
>- newu->dentry = dget(otheru->dentry);
>+ /* Is this safe here? I don't know ... */

Figure it out :)


Jan
--

2007-05-14 22:40:28

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/14] Introduce union stack

On Mon, 2007-05-14 at 15:10 +0530, Bharata B Rao wrote:
> From: Jan Blunck <[email protected]>
> Subject: Introduce union stack.
>
> Adds union stack infrastructure to the dentry structure and provides
> locking routines to walk the union stack.
...

> --- /dev/null
> +++ b/include/linux/dcache_union.h
> @@ -0,0 +1,248 @@
> +/*
> + * VFS based union mount for Linux
> + *
> + * Copyright © 2004-2007 IBM Corporation
> + * Author(s): Jan Blunck ([email protected])
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the Free
> + * Software Foundation; either version 2 of the License, or (at your option)
> + * any later version.
> + *
> + */
> +#ifndef __LINUX_DCACHE_UNION_H
> +#define __LINUX_DCACHE_UNION_H
> +#ifdef __KERNEL__
> +
> +#include <linux/union_debug.h>
> +#include <linux/fs_struct.h>
> +#include <asm/atomic.h>
> +#include <asm/semaphore.h>
> +
> +#ifdef CONFIG_UNION_MOUNT
> +
> +/*
> + * This is the union info object, that describes general information about this
> + * union directory
> + *
> + * u_mutex protects the union stack against modification. You can reach it
> + * through the d_union field in struct dentry. Hold it when you are walking
> + * or modifing the union stack !
> + */
> +struct union_info {
> + atomic_t u_count;
> + struct mutex u_mutex;
> +};
> +
> +/* allocate/de-allocate */
> +extern struct union_info *union_alloc(void);
> +extern struct union_info *union_get(struct union_info *);
> +extern void union_put(struct union_info *);
> +
> +/*
> + * These are the functions for locking a dentry's union. When one
> + * want to acquire a denties union lock, use:
> + *
> + * - union_lock() when you can sleep,
> + * - union_lock_spinlock() when you are holding a spinlock (that
> + * you CAN savely give up and reacquire again)
> + * - union_lock_readlock() when you are holding a readlock (that
> + * you CAN savely give up and reacquire again)
> + *
> + * Otherwise get the union lock early before you enter your
> + * "no sleeping here" code.
> + *
> + * NOTES: union_info structure is reference counted using u_count member.
> + * union_get() and union_put() which get and put references on union_info
> + * should be done under union_info's u_mutex. Since the last union_put() frees
> + * the union_info structure itself it can't obviously be done under u_mutex.
> + * union_release() should be used in such cases (Eg. dput(), umount()) where
> + * union_info is disassociated from the dentries, and it becomes safe
> + * to free the union_info.
> + */
> +static inline void __union_lock(struct union_info *uinfo)
> +{
> + BUG_ON(!atomic_read(&uinfo->u_count));
> + mutex_lock(&uinfo->u_mutex);
> +}
> +
> +static inline void union_lock(struct dentry *dentry)
> +{
> + if (unlikely(dentry && dentry->d_union)) {
> + struct union_info *ui = dentry->d_union;
> +
> + UM_DEBUG_LOCK("\"%s\" locking %p (count=%d)\n",
> + dentry->d_name.name, ui,
> + atomic_read(&ui->u_count));
> + __union_lock(dentry->d_union);
> + }
> +}
> +
> +static inline void __union_unlock(struct union_info *uinfo)
> +{
> + BUG_ON(!atomic_read(&uinfo->u_count));
> + mutex_unlock(&uinfo->u_mutex);
> +}
> +
> +static inline void union_unlock(struct dentry *dentry)
> +{
> + if (unlikely(dentry && dentry->d_union)) {
> + struct union_info *ui = dentry->d_union;
> +
> + UM_DEBUG_LOCK("\"%s\" unlocking %p (count=%d)\n",
> + dentry->d_name.name, ui,
> + atomic_read(&ui->u_count));
> + __union_unlock(dentry->d_union);
> + }
> +}
> +
> +static inline void union_alloc_dentry(struct dentry *dentry)
> +{
> + spin_lock(&dentry->d_lock);
> + if (!dentry->d_union) {
> + dentry->d_union = union_alloc();
> + spin_unlock(&dentry->d_lock);
> + } else {
> + spin_unlock(&dentry->d_lock);
> + union_lock(dentry);
> + }
> +}
> +
> +static inline struct union_info *union_lock_and_get(struct dentry *dentry)
> +{
> + union_lock(dentry);
> + return union_get(dentry->d_union);
> +}
> +
> +/* Shouldn't be called with last reference to union_info */
> +static inline void union_put_and_unlock(struct union_info *uinfo)
> +{
> + union_put(uinfo);
> + __union_unlock(&uinfo->u_mutex);
^^^^^^^^^^^^^^^^^^^

It should be

__union_unlock(uinfo);

Thanks,
Badari



2007-05-15 06:00:27

by Jan Blunck

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/14] Add the whiteout file type

On 5/14/07, Jan Engelhardt <[email protected]> wrote:
>
> On May 14 2007 15:09, Bharata B Rao wrote:
> >
> >A white-out stops the VFS from further lookups of the white-outs name and
> >returns -ENOENT. This is the same behaviour as if the filename isn't
> >found. This can be used in combination with union mounts to virtually
> >delete (white-out) files by creating a file with this file type.
> >
> >Signed-off-by: Jan Blunck <[email protected]>
> >Signed-off-by: Bharata B Rao <[email protected]>
> >---
> > include/linux/stat.h | 2 ++
> > 1 files changed, 2 insertions(+)
> >
> >--- a/include/linux/stat.h
> >+++ b/include/linux/stat.h
> >@@ -10,6 +10,7 @@
> > #if defined(__KERNEL__) || !defined(__GLIBC__) || (__GLIBC__ < 2)
> >
> > #define S_IFMT 00170000
> >+#define S_IFWHT 0160000 /* whiteout */
> > #define S_IFSOCK 0140000
> > #define S_IFLNK 0120000
> > #define S_IFREG 0100000
>
> I wonder why 110000, 130000 or 150000 could not also be used?
>

I used the S_IFWHT definition like it is referenced in stat(2). I
guess it would be a good idea to use the same flag on BSD and Linux.

As you can see in stat(2) other OS use 011, 013 and 015.

2007-05-15 06:19:35

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC][PATCH 13/14] ext3 whiteout support

On Mon, May 14, 2007 at 01:16:57PM -0700, Badari Pulavarty wrote:
> On Mon, 2007-05-14 at 15:14 +0530, Bharata B Rao wrote:
> > From: Bharata B Rao <[email protected]>
> > Subject: ext3 whiteout support
> >
> > Introduce whiteout support for ext3.
> >
> > Signed-off-by: Bharata B Rao <[email protected]>
> > Signed-off-by: Jan Blunck <[email protected]>
> > ---
> > fs/ext3/dir.c | 2 -
> > fs/ext3/namei.c | 62 ++++++++++++++++++++++++++++++++++++++++++++----
> > fs/ext3/super.c | 11 +++++++-
> > include/linux/ext3_fs.h | 5 +++
> > 4 files changed, 72 insertions(+), 8 deletions(-)
> >
> > --- a/fs/ext3/dir.c
> > +++ b/fs/ext3/dir.c
> > @@ -29,7 +29,7 @@
> > #include <linux/rbtree.h>
> >
> > static unsigned char ext3_filetype_table[] = {
> > - DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
> > + DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK, DT_WHT
> > };
> >
> > static int ext3_readdir(struct file *, void *, filldir_t);
> > --- a/fs/ext3/namei.c
> > +++ b/fs/ext3/namei.c
> > @@ -1071,6 +1071,7 @@ static unsigned char ext3_type_by_mode[S
> > [S_IFIFO >> S_SHIFT] = EXT3_FT_FIFO,
> > [S_IFSOCK >> S_SHIFT] = EXT3_FT_SOCK,
> > [S_IFLNK >> S_SHIFT] = EXT3_FT_SYMLINK,
> > + [S_IFWHT >> S_SHIFT] = EXT3_FT_WHT,
> > };
> >
> > static inline void ext3_set_de_type(struct super_block *sb,
> > @@ -1786,7 +1787,7 @@ out_stop:
> > /*
> > * routine to check that the specified directory is empty (for rmdir)
> > */
> > -static int empty_dir (struct inode * inode)
> > +static int empty_dir (handle_t *handle, struct inode * inode)
>
> Is there a reason for passing the handle ? Why couldn't you get it from
> journal_current_handle() if needed to do the delete the whiteout ?

Yes, using journal_current_handle() is possible, didn't realize it earlier.

>
> > {
> > unsigned long offset;
> > struct buffer_head * bh;
> > @@ -1848,8 +1849,28 @@ static int empty_dir (struct inode * ino
> > continue;
> > }
> > if (le32_to_cpu(de->inode)) {
> > - brelse (bh);
> > - return 0;
> > + /* If this is a whiteout, remove it */
> > + if (de->file_type == EXT3_FT_WHT) {
> > + unsigned long ino = le32_to_cpu(de->inode);
> > + struct inode *tmp_inode = iget(inode->i_sb, ino);
> > + if (!tmp_inode) {
> > + brelse (bh);
> > + return 0;
> > + }
> > +
> > + if (ext3_delete_entry(handle, inode, de, bh)) {
> > + iput(tmp_inode);
> > + brelse (bh);
> > + return 0;
> > + }
> > +
> > + tmp_inode->i_ctime = inode->i_ctime;
> > + tmp_inode->i_nlink--;
> > + iput(tmp_inode);
> > + } else {
> > + brelse (bh);
> > + return 0;
> > + }
> > }
> > offset += le16_to_cpu(de->rec_len);
> > de = (struct ext3_dir_entry_2 *)
> > @@ -2031,7 +2052,7 @@ static int ext3_rmdir (struct inode * di
> > goto end_rmdir;
> >
> > retval = -ENOTEMPTY;
> > - if (!empty_dir (inode))
> > + if (!empty_dir (handle, inode))
> > goto end_rmdir;
> >
> > retval = ext3_delete_entry(handle, dir, de, bh);
> > @@ -2060,6 +2081,36 @@ end_rmdir:
> > return retval;
> > }
> >
> > +static int ext3_whiteout(struct inode *dir, struct dentry *dentry)
> > +{
> > + struct inode * inode;
> > + int err, retries = 0;
> > + handle_t *handle;
> > +
> > +retry:
> > + handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS(dir->i_sb) +
> > + EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3 +
> > + 2*EXT3_QUOTA_INIT_BLOCKS(dir->i_sb));
> > + if (IS_ERR(handle))
> > + return PTR_ERR(handle);
> > +
> > + if (IS_DIRSYNC(dir))
> > + handle->h_sync = 1;
> > +
> > + inode = ext3_new_inode (handle, dir, S_IFWHT | S_IRUGO);
> > + err = PTR_ERR(inode);
> > + if (IS_ERR(inode))
> > + goto out_stop;
>
> Don't you need to call init_special_inode() here ?
> Or this is handled somewhere else ?

Whiteout doesn't have any attributes and hence we are not explicitly
doing init_special_inode() on this. Accesses to whiteout files are trapped
at the VFS lookup itself and creation and deletion of whiteouts are handled
automatically by VFS. So I believe init_special_inode() isn't necessary
on a whiteout file.

Regards,
Bharata.

2007-05-15 06:21:23

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/14] Introduce union stack

On Mon, May 14, 2007 at 03:40:57PM -0700, Badari Pulavarty wrote:
> On Mon, 2007-05-14 at 15:10 +0530, Bharata B Rao wrote:
< snip >
> > +
> > +/* Shouldn't be called with last reference to union_info */
> > +static inline void union_put_and_unlock(struct union_info *uinfo)
> > +{
> > + union_put(uinfo);
> > + __union_unlock(&uinfo->u_mutex);
> ^^^^^^^^^^^^^^^^^^^
>
> It should be
>
> __union_unlock(uinfo);
>

True, thanks for pointing this out. Will fix this.

Regards,
Bharata.

2007-05-15 07:19:49

by Jan Blunck

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/14] Introduce union stack

On 5/14/07, Jan Engelhardt <[email protected]> wrote:
>
> >+static inline void union_lock(struct dentry *dentry)
> >+{
> >+ if (unlikely(dentry && dentry->d_union)) {
> >+ struct union_info *ui = dentry->d_union;
> >+
> >+ UM_DEBUG_LOCK("\"%s\" locking %p (count=%d)\n",
> >+ dentry->d_name.name, ui,
> >+ atomic_read(&ui->u_count));
> >+ __union_lock(dentry->d_union);
> >+ }
> >+}
> >+
> >+static inline void union_unlock(struct dentry *dentry)
> >+{
> >+ if (unlikely(dentry && dentry->d_union)) {
> >+ struct union_info *ui = dentry->d_union;
> >+
> >+ UM_DEBUG_LOCK("\"%s\" unlocking %p (count=%d)\n",
> >+ dentry->d_name.name, ui,
> >+ atomic_read(&ui->u_count));
> >+ __union_unlock(dentry->d_union);
> >+ }
> >+}
>
> Do we really need the unlikely()? d_union may be a new feature,
> but it may very well be possible that someone puts the bigger
> part of his/her files under a union. And when d_unions get
> stable, people will probably begin making their root filesystem
> unioned for livecds, and then unlikely() will rather be a
> likely penalty. My stance: just
> if (dentry != NULL && dentry->d_union != NULL)
> This also goes for union_trylock.

Good question. My intention was that since most of the union code
costs performance (stack traversal, readdir) I optimize for the normal
(not unified) case.

> >+static inline int union_trylock(struct dentry *dentry)
> >+{
> >+ int locked = 1;
> >+
> >+ if (unlikely(dentry && dentry->d_union)) {
> >+ UM_DEBUG_LOCK("\"%s\" try locking %p (count=%d)\n",
> >+ dentry->d_name.name, dentry->d_union,
> >+ atomic_read(&dentry->d_union->u_count));
> >+ BUG_ON(!atomic_read(&dentry->d_union->u_count));
> >+ locked = mutex_trylock(&dentry->d_union->u_mutex);
> >+ UM_DEBUG_LOCK("\"%s\" trylock %p %s\n", dentry->d_name.name,
> >+ dentry->d_union,
> >+ locked ? "succeeded" : "failed");
> >+ }
> >+ return (locked ? 1 : 0);
> >+}
>
> return locked ? 1 : 0
> or even
> return !!locked;
> or since we're just passing up from mutex_trylock:
> return locked;
> ?

Ahh, this seems to be a left-over of the semaphore -> mutex conversion.

> >+/*
> >+ * This is a *I can't get no sleep* helper
>
> More commonly known as "insomnia". :)
>

:)


Before I forget this: thank you (and Badari) for reviewing the patches!

Cheers,
Jan

2007-05-15 07:31:25

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/14] Union-mount mounting


On May 14 2007 15:11, Bharata B Rao wrote:
>
>TODO: bind and move mounts aren't yet supported with union mounts.

Are the semantics already set?

>@@ -294,6 +294,10 @@ static struct vfsmount *clone_mnt(struct
> if (!mnt)
> goto alloc_failed;
>
>+ /*
>+ * As of now, cloning of union mounted mnt isn't permitted.
>+ */
>+ BUG_ON(mnt->mnt_flags & MNT_UNION);

One, please avoid BUG_ONs. Now I am not sure if clone_mnt is called
as part of kthread creation when CLONE_FS is it. If so, get rid of
this one real fast. Also see chunk "@@ -1031,.. @@ do_loopback"
below.

if(mnt->mnt_flags & MNT_UNION)
goto return_einval;

or something.

>+#ifdef CONFIG_UNION_MOUNT
>+ struct union_info *uinfo = NULL;
>+#endif
>
> retval = security_sb_umount(mnt, flags);
> if (retval)
>@@ -685,6 +696,14 @@ static int do_umount(struct vfsmount *mn
> }
>
> down_write(&namespace_sem);
>+#ifdef CONFIG_UNION_MOUNT
>+ /*
>+ * Grab a reference to the union_info which gets detached
>+ * from the dentries in release_mounts().
>+ */
>+ if (mnt->mnt_flags & MNT_UNION)
>+ uinfo = union_lock_and_get(mnt->mnt_root);
>+#endif
> spin_lock(&vfsmount_lock);
> event++;
>
>@@ -699,6 +718,15 @@ static int do_umount(struct vfsmount *mn
> security_sb_umount_busy(mnt);
> up_write(&namespace_sem);
> release_mounts(&umount_list);
>+#ifdef CONFIG_UNION_MOUNT
>+ if (uinfo) {
>+ if (atomic_read(&uinfo->u_count) == 1)
>+ /* We are the last user of this union_info */
>+ union_release(uinfo);
>+ else
>+ union_put_and_unlock(uinfo);
>+ }
>+#endif
> return retval;
> }
>

Is it feasible to do with with some less #if/#endif magic?:

>@@ -1031,6 +1070,15 @@ static int do_loopback(struct nameidata
> if (err)
> return err;
>
>+ /*
>+ * bind mounting to or from union mounts is not supported
>+ */
>+ err = -EINVAL;
>+ if (nd->mnt->mnt_flags & MNT_UNION)
>+ goto out_unlocked;
>+ if (old_nd.mnt->mnt_flags & MNT_UNION)
>+ goto out_unlocked;
>+

Do the same in clone_mnt.

> down_write(&namespace_sem);
> err = -EINVAL;
> if (IS_MNT_UNBINDABLE(old_nd.mnt))
>@@ -1064,6 +1112,7 @@ static int do_loopback(struct nameidata
>
> out:
> up_write(&namespace_sem);
>+out_unlocked:
> path_release(&old_nd);
> return err;
> }

>+++ b/include/linux/fs.h
>@@ -1984,6 +1984,9 @@ static inline ino_t parent_ino(struct de
> /* kernel/fork.c */
> extern int unshare_files(void);
>
>+/* fs/union.c */
>+#include <linux/union.h>
>+
> /* Transaction based IO helpers */
>
> /*

This raises a big eyebrow. If linux/fs.h can compile without the
inclusion of linux/union.h, do not put linux/union.h in fs.h.

>+#ifdef CONFIG_UNION_MOUNT
>+
>+#include <linux/fs_struct.h>
>+
>+/* namespace stuff used at mount time */
>+extern void attach_mnt_union(struct vfsmount *, struct nameidata *);
>+extern void detach_mnt_union(struct vfsmount *, struct path *);

You do not need that #include I suppose. Just predeclare the structs.

struct path;
struct vfsmount;
extern void ...

Saves us the "compiler slurps in so many .h files" problem.


Jan
--

2007-05-15 07:59:15

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/14] Union-mount lookup


On May 14 2007 15:12, Bharata B Rao wrote:
>
>+struct dentry * d_lookup_single(struct dentry *parent, struct qstr *name)
>+{
>+ struct dentry *dentry;
>+ unsigned long seq;
>+
>+ do {
>+ seq = read_seqbegin(&rename_lock);
>+ dentry = __d_lookup_single(parent, name);
>+ if (dentry)
>+ break;
>+ } while (read_seqretry(&rename_lock, seq));
>+ return dentry;
>+}

Replace with tabs.

>+lookup_union:
>+ do {
>+ struct vfsmount *mnt = find_mnt(topmost);
>+ UM_DEBUG_DCACHE("name=\"%s\", inode=%p, device=%s\n",
>+ topmost->d_name.name, topmost->d_inode,
>+ mnt->mnt_devname);
>+ mntput(mnt);
>+ } while (0);

Why the extra do{}while? [elsewhere too]

>+ if (topmost->d_union) {
>+ union_lock_spinlock(topmost, &topmost->d_lock);
>+ }

Extra {} could go [elsewhere too].

>+ if (last->d_overlaid
>+ && (last->d_overlaid != dentry)) {

As can these extra () [elsewhere too].

>+static inline struct dentry * __lookup_hash_single(struct qstr *name, struct dentry *base, struct nameidata *nd)
>+{
>+ struct dentry *dentry;
>+ struct inode *inode;
>+ int err;
>+
>+ inode = base->d_inode;
>+
>+ err = permission(inode, MAY_EXEC, nd);
>+ dentry = ERR_PTR(err);
>+ if (err)
>+ goto out;
>+
>+ dentry = __lookup_hash_kern_single(name, base, nd);
>+out:
>+ return dentry;
>+}

This looks a little big for being inline.



Jan
--

2007-05-15 08:09:43

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/14] Add a new mount flag (MNT_UNION) for union mount

On Mon, May 14, 2007 at 10:38:46PM +0200, Jan Engelhardt wrote:
>
> On May 14 2007 15:09, Bharata B Rao wrote:
> >
> >Introduce MNT_UNION, MS_UNION and FS_WHT flags. There are the necessary flags
> >for doing
> >
> > mount /dev/hda3 /mnt -o union
> >
> >You need additional patches for util-linux for that to work.
> >
> >Signed-off-by: Jan Blunck <[email protected]>
> >Signed-off-by: Bharata B Rao <[email protected]>
> >---
> >
> >+ /* Unions couldn't be writable if the filesystem
> >+ * doesn't know about whiteouts */
> >+ err = -ENOTSUPP;
> >+ if ((mnt_flags & MNT_UNION) &&
> >+ !(newmnt->mnt_sb->s_flags & MS_RDONLY) &&
> >+ !(newmnt->mnt_sb->s_type->fs_flags & FS_WHT))
> >+ goto unlock;
> >+
>
> Maybe I am too biased towards unionfs/aufs, but if I have an {rw,rw} union
> with whiteouts disabled (delete=all in unionfs speak), then FS_WHT
> does not need to be supported. Your patches do not seem to do
> delete=all semantics, do they?
>

No, they don't support delete=all semantics.

Correct me if I am wrong, but a key difference from unionfs is that in
union mounts only the topmost layer is writable (and it doesn't matter if
the lower layers are mounted rw), while this not so with unionfs.
Hence when we delete any file which is present in a lower layer, we create
a whiteout for it on the topmost layer to mask it out.

So there can be two cases in union mounts:
1. A file exists in topmost layer and also in one or more lower layers. Deleting
the file would result in the top layer file being deleted and a whiteout being
created in the top layer.

2. A file exists in one or more of lower layers, but not in the topmost layer.
Deleting this file would result in just a whiteout being created in the
topmost layer.

Regards,
Bharata.

2007-05-15 08:31:59

by Jan Blunck

[permalink] [raw]
Subject: Re: [RFC][PATCH 13/14] ext3 whiteout support

On 5/15/07, Bharata B Rao <[email protected]> wrote:
> On Mon, May 14, 2007 at 01:16:57PM -0700, Badari Pulavarty wrote:
> > On Mon, 2007-05-14 at 15:14 +0530, Bharata B Rao wrote:
> > > From: Bharata B Rao <[email protected]>
> > >
> > > +static int ext3_whiteout(struct inode *dir, struct dentry *dentry)
> > > +{
> > > + struct inode * inode;
> > > + int err, retries = 0;
> > > + handle_t *handle;
> > > +
> > > +retry:
> > > + handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS(dir->i_sb) +
> > > + EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3 +
> > > + 2*EXT3_QUOTA_INIT_BLOCKS(dir->i_sb));
> > > + if (IS_ERR(handle))
> > > + return PTR_ERR(handle);
> > > +
> > > + if (IS_DIRSYNC(dir))
> > > + handle->h_sync = 1;
> > > +
> > > + inode = ext3_new_inode (handle, dir, S_IFWHT | S_IRUGO);
> > > + err = PTR_ERR(inode);
> > > + if (IS_ERR(inode))
> > > + goto out_stop;
> >
> > Don't you need to call init_special_inode() here ?
> > Or this is handled somewhere else ?
>
> Whiteout doesn't have any attributes and hence we are not explicitly
> doing init_special_inode() on this. Accesses to whiteout files are trapped
> at the VFS lookup itself and creation and deletion of whiteouts are handled
> automatically by VFS. So I believe init_special_inode() isn't necessary
> on a whiteout file.
>

I added default whiteout file operations. So calling
init_special_inode() seems to make sense.

I know the ext2/ext3 whiteout patches are not really where they should
be. I plan to use a reserved inode number to reflect the case that the
inode itself doesn't have any attributes itself. It makes sense to
have a singleton whiteout inode per superblock.

2007-05-15 12:06:40

by Eric Van Hensbergen

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/14] Add a new mount flag (MNT_UNION) for union mount

On 5/15/07, Bharata B Rao <[email protected]> wrote:
>
> So there can be two cases in union mounts:
> 1. A file exists in topmost layer and also in one or more lower layers. Deleting
> the file would result in the top layer file being deleted and a whiteout being
> created in the top layer.
>
> 2. A file exists in one or more of lower layers, but not in the topmost layer.
> Deleting this file would result in just a whiteout being created in the
> topmost layer.
>

I'd imagine there is a third potential option, which I'll admit strays
a bit from the conventional UNIX semantic. If only one layer is
marked as writable, then any changes (including delete) only effect
that layer. I could imagine this would be useful in situations like
overlaying a sandbox on an otherwise read-only source code tree (you
might want to just get rid of a modification by removing your file and
have it replaced by the original underlying source).

I suppose a further extension would be to have multiple layers marked
as mutable and functions such as delete would effect all mutable
layers, but functions like create would only affect the top mutable
layer.

As an aside, perhaps it would be useful to mark the mutable layer at
mount time (instead of having it always be the top layer). Again this
could lead to some optional non-conventional file system semantics,
but its proven useful in Plan 9 union mount semantics and it seems a
fairly trivial extension to what you currently have.

-eric

2007-05-15 12:53:25

by Jan Blunck

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/14] Add a new mount flag (MNT_UNION) for union mount

On 5/15/07, Eric Van Hensbergen <[email protected]> wrote:
> On 5/15/07, Bharata B Rao <[email protected]> wrote:
> >
> > So there can be two cases in union mounts:
> > 1. A file exists in topmost layer and also in one or more lower layers. Deleting
> > the file would result in the top layer file being deleted and a whiteout being
> > created in the top layer.
> >
> > 2. A file exists in one or more of lower layers, but not in the topmost layer.
> > Deleting this file would result in just a whiteout being created in the
> > topmost layer.
> >
>
> I'd imagine there is a third potential option, which I'll admit strays
> a bit from the conventional UNIX semantic.

And this is exactly why I designed it as is: NOT to do anything that
anyone can imagine. This is the reason why most of the filesystem
unification approaches fail: too much unnecessary complexity.

> If only one layer is
> marked as writable, then any changes (including delete) only effect
> that layer. I could imagine this would be useful in situations like
> overlaying a sandbox on an otherwise read-only source code tree (you
> might want to just get rid of a modification by removing your file and
> have it replaced by the original underlying source).

You just unmount the topmost writable layer and replace it by a clean
tmpfs. There you go. Removing a file and not creating a whiteout
totally breaks what the user of the filesystem expects.

BTW: Undoing your changes to source code is solved by many
applications: patch, backup files, version control systems ...

> I suppose a further extension would be to have multiple layers marked
> as mutable and functions such as delete would effect all mutable
> layers, but functions like create would only affect the top mutable
> layer.

You want per systemcall policies of what layer is affected??? Crazy
idea but how do you do this without letting the user know of what
layer the file is in the first place. This doesn't work without major
extension of the VFS/syscall interface.

> As an aside, perhaps it would be useful to mark the mutable layer at
> mount time (instead of having it always be the top layer). Again this
> could lead to some optional non-conventional file system semantics,
> but its proven useful in Plan 9 union mount semantics and it seems a
> fairly trivial extension to what you currently have.

I don't think so. Plan9 union directory semantic is different. Plan9
is different.

I'll stick to the conventional UNIX semantics since I think that
otherwise it will just make this thing too complex. First we need to
solve the existing problems like union readdir, bind mounts and
whiteout handling. If this is all solved and you are disappointed with
what you can achieve with only the topmost layer writable feel free to
come up with trivial extensions :)

Cheers,
Jan

2007-05-15 14:01:03

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/14] Union-mount lookup

On Mon, 2007-05-14 at 15:12 +0530, Bharata B Rao wrote:
> From: Jan Blunck <[email protected]>
> Subject: Union-mount lookup
>
> Modifies the vfs lookup routines to work with union mounted directories.
>
> The existing lookup routines generally lookup for a pathname only in the
> topmost or given directory. The changed versions of the lookup routines
> search for the pathname in the entire union mounted stack. Also they have been
> modified to setup the union stack during lookup from dcache cache and from
> real_lookup().
>
> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Bharata B Rao <[email protected]>
> ---
> fs/dcache.c | 16 +
> fs/namei.c | 78 +++++-
> fs/namespace.c | 35 ++
> fs/union.c | 598 +++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/dcache.h | 17 +
> include/linux/namei.h | 4
> include/linux/union.h | 49 ++++
> 7 files changed, 786 insertions(+), 11 deletions(-)
>
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1286,7 +1286,7 @@ struct dentry * d_lookup(struct dentry *
> return dentry;
> }
>
> -struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
> +struct dentry * __d_lookup_single(struct dentry *parent, struct qstr *name)
> {
> unsigned int len = name->len;
> unsigned int hash = name->hash;
> @@ -1371,6 +1371,20 @@ out:
> return dentry;
> }
>
> +struct dentry * d_lookup_single(struct dentry *parent, struct qstr *name)
> +{
> + struct dentry *dentry;
> + unsigned long seq;
> +
> + do {
> + seq = read_seqbegin(&rename_lock);
> + dentry = __d_lookup_single(parent, name);
> + if (dentry)
> + break;
> + } while (read_seqretry(&rename_lock, seq));
> + return dentry;
> +}
> +
> /**
> * d_validate - verify dentry provided from insecure source
> * @dentry: The dentry alleged to be valid child of @dparent
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -374,6 +374,33 @@ void release_open_intent(struct nameidat
> }
>
> static inline struct dentry *
> +do_revalidate_single(struct dentry *dentry, struct nameidata *nd)
> +{
> + int status = dentry->d_op->d_revalidate(dentry, nd);
> + if (unlikely(status <= 0)) {

d_revalidate() returns a 0 or 1 result, not an error.

> + /*
> + * The dentry failed validation.
> + * If d_revalidate returned 0 attempt to invalidate
> + * the dentry otherwise d_revalidate is asking us
> + * to return a fail status.
> + */
> + if (!status) {
> + if (!d_invalidate(dentry)) {
> + __dput_single(dentry);
> + dentry = NULL;
> + }
> + } else {
> + __dput_single(dentry);
> + dentry = ERR_PTR(status);

See above

> + }
> + }
> + return dentry;
> +}
> +
> +/*
> + * FIXME: We need a union aware revalidate here!
> + */
> +static inline struct dentry *
> do_revalidate(struct dentry *dentry, struct nameidata *nd)
> {
> int status = dentry->d_op->d_revalidate(dentry, nd);
> @@ -403,16 +430,16 @@ do_revalidate(struct dentry *dentry, str
> */
> static struct dentry * cached_lookup(struct dentry * parent, struct qstr * name, struct nameidata *nd)
> {
> - struct dentry * dentry = __d_lookup(parent, name);
> + struct dentry *dentry = __d_lookup_single(parent, name);
>
> /* lockess __d_lookup may fail due to concurrent d_move()
> * in some unrelated directory, so try with d_lookup
> */
> if (!dentry)
> - dentry = d_lookup(parent, name);
> + dentry = d_lookup_single(parent, name);
>
> if (dentry && dentry->d_op && dentry->d_op->d_revalidate)
> - dentry = do_revalidate(dentry, nd);
> + dentry = do_revalidate_single(dentry, nd);
>
> return dentry;
> }
> @@ -465,7 +492,7 @@ ok:
> * make sure that nobody added the entry to the dcache in the meantime..
> * SMP-safe
> */
> -static struct dentry * real_lookup(struct dentry * parent, struct qstr * name, struct nameidata *nd)
> +struct dentry * real_lookup_single(struct dentry *parent, struct qstr *name, struct nameidata *nd)
> {
> struct dentry * result;
> struct inode *dir = parent->d_inode;
> @@ -485,7 +512,7 @@ static struct dentry * real_lookup(struc
> *
> * so doing d_lookup() (with seqlock), instead of lockfree __d_lookup
> */
> - result = d_lookup(parent, name);
> + result = d_lookup_single(parent, name);
> if (!result) {
> struct dentry * dentry = d_alloc(parent, name);
> result = ERR_PTR(-ENOMEM);
> @@ -506,7 +533,7 @@ static struct dentry * real_lookup(struc
> */
> mutex_unlock(&dir->i_mutex);
> if (result->d_op && result->d_op->d_revalidate) {
> - result = do_revalidate(result, nd);
> + result = do_revalidate_single(result, nd);
> if (!result)
> result = ERR_PTR(-ENOENT);
> }
> @@ -699,7 +726,7 @@ static int __follow_mount(struct path *p
> return res;
> }
>
> -static void follow_mount(struct vfsmount **mnt, struct dentry **dentry)
> +void follow_mount(struct vfsmount **mnt, struct dentry **dentry)
> {
> while (d_mountpoint(*dentry)) {
> struct vfsmount *mounted = lookup_mnt(*mnt, *dentry);
> @@ -773,6 +800,7 @@ static __always_inline void follow_dotdo
> nd->mnt = parent;
> }
> follow_mount(&nd->mnt, &nd->dentry);
> + follow_union_mount(&nd->mnt, &nd->dentry);
> }
>
> /*
> @@ -784,7 +812,15 @@ static int do_lookup(struct nameidata *n
> struct path *path)
> {
> struct vfsmount *mnt = nd->mnt;
> - struct dentry *dentry = __d_lookup(nd->dentry, name);
> + struct dentry *dentry;
> +
> + UM_DEBUG_UID("lookup \"%s\" in \"%s\" (inode=%p,dev=%s)\n",
> + name->name,
> + nd->dentry->d_name.name,
> + nd->dentry->d_inode,
> + nd->mnt->mnt_devname);

Ugh! Please don't pollute generic VFS code with this sort of private
debugging crap.

> +
> + dentry = __d_lookup(nd->dentry, name);
>
> if (!dentry)
> goto need_lookup;
> @@ -793,7 +829,17 @@ static int do_lookup(struct nameidata *n
> done:
> path->mnt = mnt;
> path->dentry = dentry;
> +
> + if (nd->dentry->d_sb != dentry->d_sb)
> + path->mnt = find_mnt(dentry);
> +
> __follow_mount(path);
> + follow_union_mount(&path->mnt, &path->dentry);
> +
> + UM_DEBUG_UID("found \"%s\" (inode=%p,dev=%s)\n",
> + path->dentry->d_name.name,
> + path->dentry->d_inode,
> + path->mnt->mnt_devname);
> return 0;
>
> need_lookup:
> @@ -838,6 +884,9 @@ static fastcall int __link_path_walk(con
> if (nd->depth)
> lookup_flags = LOOKUP_FOLLOW | (nd->flags & LOOKUP_CONTINUE);
>
> + UM_DEBUG_UID("begin walking for %s\n", name);
> + follow_union_mount(&nd->mnt, &nd->dentry);
> +
> /* At this point we know we have a real path component. */
> for(;;) {
> unsigned long hash;
> @@ -931,6 +980,7 @@ static fastcall int __link_path_walk(con
> last_with_slashes:
> lookup_flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
> last_component:
> + UM_DEBUG_UID("last component %s\n", this.name);
> /* Clear LOOKUP_CONTINUE iff it was previously unset */
> nd->flags &= lookup_flags | ~LOOKUP_CONTINUE;
> if (lookup_flags & LOOKUP_PARENT)
> @@ -1266,7 +1316,15 @@ int __user_path_lookup_open(const char _
> return err;
> }
>
> -static inline struct dentry *__lookup_hash_kern(struct qstr *name, struct dentry *base, struct nameidata *nd)
> +/*
> + * NOTE: On union mounts it is important that the overlaid dentries are
> + * correct. Therefore we need to follow mounts. Take a look at
> + * __lookup_hash_kern_union() how it is done.
> + *
> + * Called with union already locked (before the parent inode is locked !!!)
> + */
> +struct dentry * __lookup_hash_kern_single(struct qstr *name,
> + struct dentry *base, struct nameidata *nd)
> {
> struct dentry *dentry;
> struct inode *inode;
> @@ -1298,6 +1356,8 @@ static inline struct dentry *__lookup_ha
> dput(new);
> }
> out:
> + UM_DEBUG_UID("name=\"%s\", inode=%p\n",
> + dentry->d_name.name, dentry->d_inode);
> return dentry;
> }
>
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -133,6 +133,41 @@ struct vfsmount *lookup_mnt(struct vfsmo
> return child_mnt;
> }
>
> +/*
> + * find_mnt - find a vfsmount struct
> + * @dentry: a dentry
> + *
> + * This searches the namespace for a given dentries
> + * vfsmount struct. This is used by union-mount.
> + */
> +struct vfsmount * find_mnt(struct dentry *dentry)
> +{
> + struct list_head *tmp;
> + struct vfsmount *p, *mnt = NULL;
> +
> + down_read(&namespace_sem);
> + spin_lock(&vfsmount_lock);
> + if (list_empty(&current->nsproxy->mnt_ns->list)) {
> + spin_unlock(&vfsmount_lock);
> + up_read(&namespace_sem);
> + return NULL;
> + }
> + list_for_each(tmp, &current->nsproxy->mnt_ns->list) {
> + p = list_entry(tmp, struct vfsmount, mnt_list);
> + if (dentry->d_sb == p->mnt_sb) {
> + mnt = mntget(p);
> + break;
> + }
> + }
> + spin_unlock(&vfsmount_lock);
> + up_read(&namespace_sem);
> +
> + BUG_ON(!mnt);
> +// UM_DEBUG_UID("found %s/%p in %s\n", dentry->d_name.name,
> +// dentry->d_inode, mnt->mnt_devname);

^^^^ Definitely remove

> + return mnt;
> +}
> +
> static inline int check_mnt(struct vfsmount *mnt)
> {
> return mnt->mnt_ns == current->nsproxy->mnt_ns;
> --- a/fs/union.c
> +++ b/fs/union.c
> @@ -370,3 +370,601 @@ void detach_mnt_union(struct vfsmount *m
> * union stack */
> __dput(path->dentry);
> }
> +
> +static noinline int revalidate_union(struct dentry * dentry)
^^^^^^^^ this is just as bad as 'inline'. I thought we had all
agreed to leave it up to the compiler to optimise.
> +{
> + union_check(dentry);
> +
> + spin_lock(&dcache_lock);
> + spin_lock(&dentry->d_lock);
> + if (atomic_read(&dentry->d_count) < 2) {
> + UM_DEBUG_DCACHE("dentry unused, count=%d\n",
> + atomic_read(&dentry->d_count));
> + __d_drop(dentry);
> + spin_unlock(&dentry->d_lock);
> + spin_unlock(&dcache_lock);
> + return 0;
> + }
> + spin_unlock(&dentry->d_lock);
> + spin_unlock(&dcache_lock);
> +
> + return 1;
> +}
> +
> +static noinline void replace_union_info(struct dentry *dentry,
> + struct union_info *lock)
> +{
> + struct dentry *tmp = dentry;
> + struct union_info *old_lock = union_get(dentry->d_union);
> +
> + BUG_ON(!lock);
> + BUG_ON(dentry->d_union == lock);
> +
> + while (tmp) {
> + spin_lock(&tmp->d_lock);
> + union_put(tmp->d_union);
> + tmp->d_union = union_get(lock);
> + spin_unlock(&tmp->d_lock);
> + tmp = tmp->d_overlaid;
> + }
> +
> + BUG_ON(atomic_read(&old_lock->u_count) != 1);
> + union_put(old_lock);
> + return;
> +}
> +
> +static void __dput_from_to(struct dentry *from, struct dentry *to,
> + struct union_info *lock)
> +{
> + struct dentry *next = from;
> + struct union_info *mylock = union_get(from->d_union);
> +
> + while (next) {
> + struct dentry *tmp = next;
> + next = next->d_overlaid;
> +
> + UM_DEBUG_UID("dput_all dentry=\"%s\", inode=\"%p\"\n",
> + tmp->d_name.name, tmp->d_inode);
> +
> + if (lock) {
> + spin_lock(&tmp->d_lock);
> + tmp->d_topmost = NULL;
> + tmp->d_overlaid = NULL;
> + union_put(tmp->d_union);
> + tmp->d_union = NULL;
> + spin_unlock(&tmp->d_lock);
> + }
> +
> + __dput_single(tmp);
> +
> + if (tmp == to)
> + break;
> + }
> +
> + UM_DEBUG_LOCK("\"??\" unlocking union %p\n", lock);
> + mutex_unlock(&mylock->u_mutex);
> + union_put(mylock);
> +}
> +
> +/*
> + * Lookup for the @name in the dentry cache. Look through the lower layers
> + * of the parent's union stack and build a union stack for the child if
> + * necessary.
> + * TODO: This shares a considerable amount of code with __lookup_union().
> + */
> +struct dentry * __d_lookup_union(struct dentry *base, struct qstr *name)
> +{
> + struct dentry *parent = base->d_overlaid;
> + struct dentry *dentry = NULL;
> + struct dentry *topmost;
> + struct dentry *last;
> + struct qstr this;
> + struct union_info *lock = NULL;
> + int err;
> +
> + union_lock(base);
> + topmost = __d_lookup_single(base, name);
> + last = topmost;
> +
> + /*
> + * - If dcache lookup returns a NULL dentry, return to force a real
> + * lookup. Union mount version of real lookup will endup doing real or
> + * dcache lookup for this in the lower layers also. OR
> + * - If parent is not a union mounted directory, we are done
> + * with the lookup, return.
> + */
> + if (!topmost || !base->d_overlaid)
> + goto out;
> +
> + this.name = name->name;
> + this.len = name->len;
> + this.hash = name->hash;
> +
> + /*
> + * If dcache lookup returned a non-negative dentry from the top layer,
> + * continue the lookup in to the lower layers and re-build the union
> + * stack if necessary.
> + */
> + if (topmost->d_inode)
> + goto lookup_union;
> +
> + /*
> + * dcache lookup in the top layer returned a negative dentry. Look
> + * through the lower layers to find the first non-negative dentry.
> + */
> + while (parent) {
> + if (parent->d_op && parent->d_op->d_hash) {
> + err = parent->d_op->d_hash(parent, &this);
> + if (err < 0) {
> + __dput_single(topmost);
> + topmost = NULL;
> + goto out;
> + }
> + }
> + dentry = __d_lookup_single(parent, &this);
> + /*
> + * Force a real lookup if parts of the union stack are not in
> + * dcache
> + */
> + if (!dentry) {
> + __dput_single(topmost);
> + topmost = NULL;
> + goto out;
> + }
> + if (dentry->d_inode)
> + break;
> + __dput_single(dentry);
> + dentry = NULL;
> + parent = parent->d_overlaid;
> + }
> +
> + if (!dentry)
> + goto out;
> +
> + __dput_single(topmost);
> + topmost = dentry;
> + last = dentry;
> +lookup_union:
> + do {
> + struct vfsmount *mnt = find_mnt(topmost);
> + UM_DEBUG_DCACHE("name=\"%s\", inode=%p, device=%s\n",
> + topmost->d_name.name, topmost->d_inode,
> + mnt->mnt_devname);
> + mntput(mnt);
> + } while (0);
> +
> + /* If this is not a directory, no need to look beyond this layer */
> + if (!S_ISDIR(topmost->d_inode->i_mode))
> + goto out;
> +
> + if (!revalidate_union(topmost)) {
> + __dput_single(topmost);
> + topmost = NULL;
> + goto out;
> + }
> +
> + spin_lock(&topmost->d_lock);
> + if (topmost->d_union) {
> + union_lock_spinlock(topmost, &topmost->d_lock);
> + }
> + spin_unlock(&topmost->d_lock);
> +
> + parent = topmost->d_parent->d_overlaid;
> + while (parent) {
> + if (parent->d_op && parent->d_op->d_hash) {
> + err = parent->d_op->d_hash(parent, &this);
> + if (err < 0) {
> + UM_DEBUG("failed to hash the qstr\n");
> + goto dput_all;
> + }
> + }
> + dentry = __d_lookup_single(parent, &this);
> + if (!dentry) {
> + /*
> + * No dentry for this name in this lower layer.
> + * CHECK: Why return like this ? Shoudn't we look
> + * for this name in the next lower layer ?
> + */
> + __dput_single(dentry);
> + goto dput_all;
> + }
> +
> + if (!dentry->d_inode) {
> + __dput_single(dentry);
> + parent = parent->d_overlaid;
> + continue;
> + }
> + if (!S_ISDIR(dentry->d_inode->i_mode)) {
> + __dput_single(dentry);
> + break;
> + }
> + if (last->d_overlaid
> + && (last->d_overlaid != dentry)) {
> + printk(KERN_ERR "%s: strange stack layout " \
> + "(\"%s\" overlays \"%s\")\n",
> + __FUNCTION__, last->d_name.name,
> + dentry->d_name.name);
> + dump_stack();
> + __dput_single(dentry);
> + goto dput_all;
> + }
> + spin_lock(&topmost->d_lock);
> + if (!topmost->d_union) {
> + UM_DEBUG_LOCK("allocate union for \"%s\"\n",
> + topmost->d_name.name);
> + topmost->d_union = union_alloc();
> + lock = topmost->d_union;
> + }
> + spin_unlock(&topmost->d_lock);
> + spin_lock(&dentry->d_lock);
> + if (!dentry->d_union)
> + dentry->d_union = union_get(topmost->d_union);
> + spin_unlock(&dentry->d_lock);
> + if (dentry->d_union != topmost->d_union) {
> + union_lock(dentry);
> + replace_union_info(topmost, dentry->d_union);
> + }
> + dentry->d_topmost = topmost;
> + last->d_overlaid = dentry;
> + last = dentry;
> + parent = parent->d_overlaid;
> + }
> +
> + spin_lock(&topmost->d_lock);
> + if (topmost->d_union && atomic_read(&topmost->d_union->u_count) == 1) {
> + union_put(topmost->d_union);
> + topmost->d_union = NULL;
> + } else
> + union_unlock(topmost);
> + spin_unlock(&topmost->d_lock);
> +out:
> + union_unlock(base);
> + return topmost;
> +
> +dput_all:
> + __dput_from_to(topmost, last, lock);
> + union_unlock(base);
> + return NULL;
> +}
> +
> +/*
> + * FIXME: export this from fs/namei.c ???
> + */
> +extern int follow_mount(struct vfsmount **, struct dentry **);
> +extern struct dentry * __lookup_hash_kern_single(struct qstr *, struct dentry *,
> + struct nameidata *);
> +extern struct dentry * real_lookup_single(struct dentry *, struct qstr *,
> + struct nameidata *);
> +
> +static inline void copy_nd(struct nameidata *old_nd, struct nameidata *new_nd)
> +{
> + if (old_nd) {
> + new_nd->last.name = NULL; /* handled in __link_path_walk */
> + new_nd->last.len = 0;
> + new_nd->last.hash = 0;
> + new_nd->flags = old_nd->flags;
> + new_nd->um_flags = 0; /* ditto */
> + new_nd->last_type = -1; /* ditto */
> + new_nd->depth = 0; /* handled in do_follow_link */
> + memcpy(&new_nd->intent, &old_nd->intent, sizeof(new_nd->intent));
> + }
> +}
> +
> +/*
> + * This is called when a dentries parent is union-mounted and we have
> + * to lookup the overlaid dentries. The lookup starts at the parents
> + * first overlaid dentry of the given dentry. Negative dentries are
> + * ignored and not included in the overlaid list.
> + *
> + * If we reach a dentry with restricted access, we just stop the lookup
> + * because we shouldn't see through that dentry. Same thing for dentry
> + * type mismatch and whiteouts.
> + *
> + * FIXME:
> + * - handle DT_WHT
> + * - handle union stacks in use
> + * - handle union stacks mounted upon union stacks
> + * - avoid unnecessary allocations of union locks
> + */
> +static int __lookup_union(struct dentry *topmost, struct qstr *name,
> + struct nameidata *__nd)
> +{
> + struct dentry *parent;
> + struct dentry *last;
> + struct dentry *dentry;
> + unsigned int hash = name->hash;
> + struct nameidata nd;
> + int err;
> +
> + /* we may also be called via lookup_hash with a NULLed nd argument */
> + copy_nd(__nd, &nd);
> +
> + spin_lock(&topmost->d_lock);
> + if (topmost->d_union) {
> + union_lock_spinlock(topmost, &topmost->d_lock);
> + }
> + spin_unlock(&topmost->d_lock);
> +
> + parent = topmost->d_parent->d_overlaid;
> + last = topmost;
> +
> + while (parent) {
> + /*
> + * the hash could be changed in the last
> + * __lookup_hash_single() so we need to reset it here
> + */
> + name->hash = hash;
> + nd.dentry = __dget(parent);
> + nd.mnt = find_mnt(parent);
> +
> + mutex_lock(&parent->d_inode->i_mutex);
> + dentry = __lookup_hash_single(name, parent,
> + __nd ? &nd : NULL);
> + mutex_unlock(&parent->d_inode->i_mutex);
> + if (IS_ERR(dentry)) {
> + err = PTR_ERR(dentry);
> + goto out;
> + }
> +
> + if (!dentry->d_inode) {
> + __dput_single(dentry);
> + goto loop;
> + }
> +
> + if (!S_ISDIR(dentry->d_inode->i_mode)) {
> + __dput_single(dentry);
> + err = 0;
> + goto out;
> + }
> +
> + /* Now we know, we found something real */
> + follow_mount(&nd.mnt, &dentry);
> +
> + do {
> + struct vfsmount *mnt = find_mnt(dentry);
> + UM_DEBUG_UID("name=\"%s\", inode=%p, device=%s\n",
> + dentry->d_name.name, dentry->d_inode,
> + mnt->mnt_devname);
> + mntput(mnt);
> + } while (0);
> +
> + if (last->d_overlaid && (last->d_overlaid != dentry)) {
> + printk(KERN_ERR "%s: strange stack layout " \
> + "(\"%s\" overlays \"%s\")\n",
> + __FUNCTION__, last->d_name.name,
> + dentry->d_name.name);
> + dump_stack();
> + __dput_single(dentry);
> + /* lets try to make a clean ending */
> + last->d_overlaid = NULL;
> + err = -EFAULT; // FIXME: something better?
> + goto out;
> + }
> +
> + spin_lock(&topmost->d_lock);
> + if (!topmost->d_union) {
> + UM_DEBUG_LOCK("allocate union for \"%s\"\n",
> + topmost->d_name.name);
> + topmost->d_union = union_alloc();
> + }
> + spin_unlock(&topmost->d_lock);
> +
> + spin_lock(&dentry->d_lock);
> + if (!dentry->d_union)
> + dentry->d_union = union_get(topmost->d_union);
> + spin_unlock(&dentry->d_lock);
> +
> + if (topmost->d_union != dentry->d_union) {
> + union_lock(dentry);
> + replace_union_info(topmost, dentry->d_union);
> + }
> +
> + dentry->d_topmost = topmost;
> + last->d_overlaid = dentry;
> + last = dentry;
> + loop:
> + __dput(nd.dentry);
> + mntput(nd.mnt);
> + parent = parent->d_overlaid;
> + }
> +
> + err = 0;
> + union_unlock(topmost);
> + return err;
> +out:
> + __dput(nd.dentry);
> + mntput(nd.mnt);
> + union_unlock(topmost);
> + return err;
> +}
> +
> +/*
> + * Union mount version of real_lookup().
> + * Looks through the lower layers of the union and builds a union stack
> + * if necessary (i,e., for directories)
> + * TODO: This routine is almost similar to __lookup_hash_kern_union() which
> + * uses __lookup_hash_kern_single() (instead of real_lookup_single()). Check
> + * if some code can be shared here.
> + */
> +struct dentry * real_lookup_union(struct dentry *base, struct qstr *name,
> + struct nameidata *__nd)
> +{
> + struct dentry *parent;
> + struct dentry *topmost;
> + unsigned int hash = name->hash;
> + struct nameidata nd;
> + int err;
> +
> + union_lock(base);
> + topmost = real_lookup_single(base, name, __nd);
> + if (IS_ERR(topmost))
> + goto out;
> +
> + /*
> + * If real_lookup returns a valid dentry from the topmost layer,
> + * continue the lookup into the lower layers and build a union
> + * stack in case of directories.
> + */
> + if (topmost->d_inode) {
> + parent = base;
> + goto lookup_union;
> + }
> +
> + copy_nd(__nd, &nd);
> +
> + /*
> + * real_lookup returned a negative dentry, walk through the lower
> + * layers looking for the given name.
> + */
> + parent = base->d_overlaid;
> + while (parent) {
> + struct dentry * dentry;
> +
> + name->hash = hash;
> + nd.dentry = __dget(parent);
> + nd.mnt = find_mnt(parent);
> +
> + dentry = real_lookup_single(nd.dentry, name, &nd);
> + __dput(nd.dentry);
> + mntput(nd.mnt);
> + if (IS_ERR(dentry))
> + goto out;
> +
> + /*
> + * If a dentry is found in a lower layer, continue to lookup
> + * in the lower layers and build a union stack if needed.
> + */
> + if (dentry->d_inode) {
> + __dput_single(topmost);
> + topmost = dentry;
> + goto lookup_union;
> + }
> + __dput_single(dentry);
> + parent = parent->d_overlaid;
> + }
> +
> +out:
> + union_unlock(base);
> + return topmost;
> +
> +lookup_union:
> + /*
> + * If our parent doesn't have a union stack or we are not a directory,
> + * the lookup ends here.
> + */
> + if (!parent->d_overlaid || !S_ISDIR(topmost->d_inode->i_mode))
> + goto out;
> +
> + do {
> + struct vfsmount *mnt = find_mnt(topmost);
> + UM_DEBUG_UID("name=\"%s\", inode=%p, device=%s\n",
> + topmost->d_name.name, topmost->d_inode,
> + mnt->mnt_devname);
> + mntput(mnt);
> + } while (0);
> +
> + name->hash = hash;
> + err = __lookup_union(topmost, name, &nd);
> + if (err) {
> + dput(topmost);
> + topmost = ERR_PTR(err);
> + }
> + goto out;
> +}
> +
> +/*
> + * Union mount version of __lookup_hash_kern().
> + * Looks through the lower layers of the union and builds a union stack
> + * if necessary (i,e., for directories)
> + */
> +struct dentry * __lookup_hash_kern_union(struct qstr *name,
> + struct dentry *base, struct nameidata *__nd)
> +{
> + struct dentry *topmost;
> + struct dentry *parent;
> + unsigned int hash = name->hash;
> + struct nameidata nd;
> + int err;
> +
> + union_lock(base);
> + topmost = __lookup_hash_kern_single(name, base, __nd);
> + if (IS_ERR(topmost))
> + goto out;
> +
> + if (topmost->d_inode) {
> + parent = base;
> + goto lookup_union;
> + }
> +
> + copy_nd(__nd, &nd);
> +
> + parent = base->d_overlaid;
> + while (parent) {
> + struct dentry *dentry;
> +
> + name->hash = hash;
> + nd.dentry = __dget(parent);
> + nd.mnt = find_mnt(parent);
> +
> + mutex_lock(&parent->d_inode->i_mutex);
> + dentry = __lookup_hash_kern_single(name, nd.dentry, &nd);
> + mutex_unlock(&parent->d_inode->i_mutex);
> + __dput(nd.dentry);
> + mntput(nd.mnt);
> + if (IS_ERR(dentry))
> + goto out;
> +
> + if (dentry->d_inode) {
> + __dput_single(topmost);
> + topmost = dentry;
> + goto lookup_union;
> + }
> + __dput_single(dentry);
> + parent = parent->d_overlaid;
> + }
> +
> +out:
> + union_unlock(base);
> + return topmost;
> +
> +lookup_union:
> + if (!parent->d_overlaid || !S_ISDIR(topmost->d_inode->i_mode))
> + goto out;
> +
> + do {
> + struct vfsmount *mnt = find_mnt(topmost);
> + UM_DEBUG_UID("name=\"%s\", inode=%p, device=%s\n",
> + topmost->d_name.name, topmost->d_inode,
> + mnt->mnt_devname);
> + mntput(mnt);
> + } while (0);
> +
> + name->hash = hash;
> + err = __lookup_union(topmost, name, &nd);
> + if (err) {
> + dput(topmost);
> + topmost = ERR_PTR(err);
> + }
> + goto out;
> +}
> +
> +int follow_union_mount(struct vfsmount **mnt, struct dentry **dentry)
> +{
> + int res = 0;
> +
> + while ((*dentry)->d_topmost) {
> + struct dentry *d_tmp = dget((*dentry)->d_topmost);
> + struct vfsmount *m_tmp = find_mnt((*dentry)->d_topmost);
> +
> + UM_DEBUG_UID("name=\"%s\", follow union from %s to %s\n",
> + (*dentry)->d_name.name, (*mnt)->mnt_devname,
> + m_tmp->mnt_devname);
> + mntput(*mnt);
> + *mnt = m_tmp;
> + dput(*dentry);
> + *dentry = d_tmp;
> + res = 1;
> + }
> +
> + return res;
> +}
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -294,9 +294,23 @@ extern void d_move(struct dentry *, stru
>
> /* appendix may either be NULL or be used for transname suffixes */
> extern struct dentry * d_lookup(struct dentry *, struct qstr *);
> -extern struct dentry * __d_lookup(struct dentry *, struct qstr *);
> +extern struct dentry * d_lookup_single(struct dentry *, struct qstr *);
> +extern struct dentry * __d_lookup_single(struct dentry *, struct qstr *);
> extern struct dentry * d_hash_and_lookup(struct dentry *, struct qstr *);
>
> +#ifdef CONFIG_UNION_MOUNT
> +extern struct dentry * __d_lookup_union(struct dentry *, struct qstr *);
> +#endif
> +
> +static inline struct dentry * __d_lookup(struct dentry *parent, struct qstr *name)
> +{
> +#ifdef CONFIG_UNION_MOUNT
> + return __d_lookup_union(parent, name);
> +#else
> + return __d_lookup_single(parent, name);
> +#endif
> +}
> +
> /* validate "insecure" dentry pointer */
> extern int d_validate(struct dentry *, struct dentry *);
>
> @@ -467,6 +481,7 @@ static inline int d_mountpoint(struct de
> extern struct vfsmount *lookup_mnt(struct vfsmount *, struct dentry *);
> extern struct vfsmount *__lookup_mnt(struct vfsmount *, struct dentry *, int);
> extern struct dentry *lookup_create(struct nameidata *nd, int is_dir);
> +extern struct vfsmount *find_mnt(struct dentry *);
>
> extern int sysctl_vfs_cache_pressure;
>
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -20,6 +20,7 @@ struct nameidata {
> struct vfsmount *mnt;
> struct qstr last;
> unsigned int flags;
> + unsigned int um_flags;
> int last_type;
> unsigned depth;
> char *saved_names[MAX_NESTED_LINKS + 1];
> @@ -40,6 +41,9 @@ struct path {
> */
> enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
>
> +#define LAST_UNION 0x01
> +#define LAST_LOWLEVEL 0x02
> +
> /*
> * The bitmask for a lookup event:
> * - follow links at the end
> --- a/include/linux/union.h
> +++ b/include/linux/union.h
> @@ -17,17 +17,66 @@
> #ifdef CONFIG_UNION_MOUNT
>
> #include <linux/fs_struct.h>
> +#include <linux/dcache_union.h>
>
> /* namespace stuff used at mount time */
> extern void attach_mnt_union(struct vfsmount *, struct nameidata *);
> extern void detach_mnt_union(struct vfsmount *, struct path *);
>
> +/* lookup stuff */
> +extern int follow_union_mount(struct vfsmount **, struct dentry **);
> +extern struct dentry * real_lookup_union(struct dentry *, struct qstr *,
> + struct nameidata *);
> +extern struct dentry * __lookup_hash_kern_union(struct qstr *, struct dentry *,
> + struct nameidata *);
> +
> #else /* CONFIG_UNION_MOUNT */
>
> #define attach_mnt_union(mnt,nd) do { /* empty */ } while (0)
> #define detach_mnt_union(mnt,nd) do { /* empty */ } while (0)
> +#define follow_union_mount(x,y) do { /* empty */ } while (0)
>
> #endif /* CONFIG_UNION_MOUNT */
>
> +extern struct dentry * real_lookup_single(struct dentry *, struct qstr *,
> + struct nameidata *);
> +extern struct dentry * __lookup_hash_kern_single(struct qstr *, struct dentry *,
> + struct nameidata *);
> +
> +static inline struct dentry * real_lookup(struct dentry *parent, struct qstr *name, struct nameidata *nd)
> +{
> +#ifdef CONFIG_UNION_MOUNT
> + return real_lookup_union(parent, name, nd);
> +#else
> + return real_lookup_single(parent, name, nd);
> +#endif
> +}
> +
> +static inline struct dentry * __lookup_hash_kern(struct qstr *name, struct dentry *base, struct nameidata *nd)
> +{
> +#ifdef CONFIG_UNION_MOUNT
> + return __lookup_hash_kern_union(name, base, nd);
> +#else
> + return __lookup_hash_kern_single(name, base, nd);
> +#endif
> +}
> +
> +static inline struct dentry * __lookup_hash_single(struct qstr *name, struct dentry *base, struct nameidata *nd)
> +{
> + struct dentry *dentry;
> + struct inode *inode;
> + int err;
> +
> + inode = base->d_inode;
> +
> + err = permission(inode, MAY_EXEC, nd);
> + dentry = ERR_PTR(err);
> + if (err)
> + goto out;
> +
> + dentry = __lookup_hash_kern_single(name, base, nd);
> +out:
> + return dentry;
> +}
> #endif /* __KERNEL __ */
> #endif /* __LINUX_UNION_H */
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2007-05-15 14:29:19

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC][PATCH 13/14] ext3 whiteout support

On Mon, May 14, 2007 at 10:35:47PM +0200, Jan Blunck wrote:
>
> I don't know. I tried to contact him a few weeks ago but failed.
> Guess, maybe he isn't reading the @thunk.org email anymore which was
> reference in the e2fsprogs source I used.

I do pay more attention to mail sent to the @mit.edu address, but I
don't recall any reservation requests sent to my @thunk.org address
(nope, don't see it; maybe it got caught in my spam filters? When did
you send it, precisely?)

> Ted,
> from ext2_fs.h I learn that the value 0x0020 is left unused.
>
> #define EXT2_FEATURE_INCOMPAT_META_BG 0x0010
> #define EXT3_FEATURE_INCOMPAT_EXTENTS 0x0040
> #define EXT4_FEATURE_INCOMPAT_64BIT 0x0080
>
> Is this intentionally?

I think 0x0020 was used at one point, but I can't remember what it was
used for. It's almost certainly not in wide use at this point, so
it's probably safe to use it; OTOH it's not like we're running out of
bits at this point.

- Ted

2007-05-16 04:57:18

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/14] Union-mount mounting

On Tue, May 15, 2007 at 09:29:39AM +0200, Jan Engelhardt wrote:
>
> On May 14 2007 15:11, Bharata B Rao wrote:
> >
> >TODO: bind and move mounts aren't yet supported with union mounts.
>
> Are the semantics already set?

Not yet.

>
> >@@ -294,6 +294,10 @@ static struct vfsmount *clone_mnt(struct
> > if (!mnt)
> > goto alloc_failed;
> >
> >+ /*
> >+ * As of now, cloning of union mounted mnt isn't permitted.
> >+ */
> >+ BUG_ON(mnt->mnt_flags & MNT_UNION);
>
> One, please avoid BUG_ONs. Now I am not sure if clone_mnt is called
> as part of kthread creation when CLONE_FS is it. If so, get rid of
> this one real fast. Also see chunk "@@ -1031,.. @@ do_loopback"
> below.

Looks like not CLONE_FS but CLONE_NEWNS ends up calling clone_mnt.

>
> if(mnt->mnt_flags & MNT_UNION)
> goto return_einval;
>
> or something.

Will do this for now, but eventually we need to get this working sanely anyway.

>
> >+#ifdef CONFIG_UNION_MOUNT
> >+ struct union_info *uinfo = NULL;
> >+#endif
> >
> > retval = security_sb_umount(mnt, flags);
> > if (retval)
> >@@ -685,6 +696,14 @@ static int do_umount(struct vfsmount *mn
> > }
> >
> > down_write(&namespace_sem);
> >+#ifdef CONFIG_UNION_MOUNT
> >+ /*
> >+ * Grab a reference to the union_info which gets detached
> >+ * from the dentries in release_mounts().
> >+ */
> >+ if (mnt->mnt_flags & MNT_UNION)
> >+ uinfo = union_lock_and_get(mnt->mnt_root);
> >+#endif
> > spin_lock(&vfsmount_lock);
> > event++;
> >
> >@@ -699,6 +718,15 @@ static int do_umount(struct vfsmount *mn
> > security_sb_umount_busy(mnt);
> > up_write(&namespace_sem);
> > release_mounts(&umount_list);
> >+#ifdef CONFIG_UNION_MOUNT
> >+ if (uinfo) {
> >+ if (atomic_read(&uinfo->u_count) == 1)
> >+ /* We are the last user of this union_info */
> >+ union_release(uinfo);
> >+ else
> >+ union_put_and_unlock(uinfo);
> >+ }
> >+#endif
> > return retval;
> > }
> >
>
> Is it feasible to do with with some less #if/#endif magic?:
>

Will try. We need union_info here which is available only with
CONFIG_UNION_MOUNT.

> >@@ -1031,6 +1070,15 @@ static int do_loopback(struct nameidata
> > if (err)
> > return err;
> >
> >+ /*
> >+ * bind mounting to or from union mounts is not supported
> >+ */
> >+ err = -EINVAL;
> >+ if (nd->mnt->mnt_flags & MNT_UNION)
> >+ goto out_unlocked;
> >+ if (old_nd.mnt->mnt_flags & MNT_UNION)
> >+ goto out_unlocked;
> >+
>
> Do the same in clone_mnt.
>
> > down_write(&namespace_sem);
> > err = -EINVAL;
> > if (IS_MNT_UNBINDABLE(old_nd.mnt))
> >@@ -1064,6 +1112,7 @@ static int do_loopback(struct nameidata
> >
> > out:
> > up_write(&namespace_sem);
> >+out_unlocked:
> > path_release(&old_nd);
> > return err;
> > }
>
> >+++ b/include/linux/fs.h
> >@@ -1984,6 +1984,9 @@ static inline ino_t parent_ino(struct de
> > /* kernel/fork.c */
> > extern int unshare_files(void);
> >
> >+/* fs/union.c */
> >+#include <linux/union.h>
> >+
> > /* Transaction based IO helpers */
> >
> > /*
>
> This raises a big eyebrow. If linux/fs.h can compile without the
> inclusion of linux/union.h, do not put linux/union.h in fs.h.
>

Ok, better to include union.h in the appropriate .c file which needs it.

> >+#ifdef CONFIG_UNION_MOUNT
> >+
> >+#include <linux/fs_struct.h>
> >+
> >+/* namespace stuff used at mount time */
> >+extern void attach_mnt_union(struct vfsmount *, struct nameidata *);
> >+extern void detach_mnt_union(struct vfsmount *, struct path *);
>
> You do not need that #include I suppose. Just predeclare the structs.
>
> struct path;
> struct vfsmount;
> extern void ...
>
> Saves us the "compiler slurps in so many .h files" problem.

Sure. And thanks for the review.

Regards,
Bharata.

2007-05-16 05:01:43

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/14] Union-mount lookup

On Tue, May 15, 2007 at 09:57:24AM +0200, Jan Engelhardt wrote:
>
> On May 14 2007 15:12, Bharata B Rao wrote:
> >
> >+struct dentry * d_lookup_single(struct dentry *parent, struct qstr *name)
> >+{
> >+ struct dentry *dentry;
> >+ unsigned long seq;
> >+
> >+ do {
> >+ seq = read_seqbegin(&rename_lock);
> >+ dentry = __d_lookup_single(parent, name);
> >+ if (dentry)
> >+ break;
> >+ } while (read_seqretry(&rename_lock, seq));
> >+ return dentry;
> >+}
>
> Replace with tabs.

This is copied from fs/dcache.c:d_lookup() and the whitespaces came from there.
But that is not an excuse, will fix.

>
> >+lookup_union:
> >+ do {
> >+ struct vfsmount *mnt = find_mnt(topmost);
> >+ UM_DEBUG_DCACHE("name=\"%s\", inode=%p, device=%s\n",
> >+ topmost->d_name.name, topmost->d_inode,
> >+ mnt->mnt_devname);
> >+ mntput(mnt);
> >+ } while (0);
>
> Why the extra do{}while? [elsewhere too]

Not sure, may be to get a scope to define 'mnt' here. Jan ?

>
> >+ if (topmost->d_union) {
> >+ union_lock_spinlock(topmost, &topmost->d_lock);
> >+ }
>
> Extra {} could go [elsewhere too].
>
> >+ if (last->d_overlaid
> >+ && (last->d_overlaid != dentry)) {
>
> As can these extra () [elsewhere too].
>

Sure, will fix all these.

> >+static inline struct dentry * __lookup_hash_single(struct qstr *name, struct dentry *base, struct nameidata *nd)
> >+{
> >+ struct dentry *dentry;
> >+ struct inode *inode;
> >+ int err;
> >+
> >+ inode = base->d_inode;
> >+
> >+ err = permission(inode, MAY_EXEC, nd);
> >+ dentry = ERR_PTR(err);
> >+ if (err)
> >+ goto out;
> >+
> >+ dentry = __lookup_hash_kern_single(name, base, nd);
> >+out:
> >+ return dentry;
> >+}
>
> This looks a little big for being inline.

ok.

Regards,
Bharata.

2007-05-16 07:59:22

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFC][PATCH 10/14] In-kernel file copy between union mounted filesystems


On May 14 2007 15:13, Bharata B Rao wrote:
>+
>+ if (flag & 0x2) {
>+ error = union_copyup(nd, flag);
>+ if (error)
>+ goto exit;
>+ }

What I dislike (and that also goes for fs/namei.c and such) that they use
numeral constants, i.e. 0x2. That seems error-prone. Could this (and
the in-kernel users of 0x1/0x2/0x4) be turned into some constant?

>+ if (IS_DEADDIR(parent->d_inode))
>+ goto error;
>+ err = -EACCES; /* shouldn't it be ENOSYS? */

I do not think so. ENOSYS means Syscall not implemented. But it is
implemented. If ->i_op is not there does not imply ENOSYS.

Though, now that I grep through fs/*, I see that namei.c also
has that comment "shouldn't it be ENOSYS", so it's all at odds.

>+ if (!parent->d_inode->i_op || !parent->d_inode->i_op->create)
>+ goto error;

>+struct dentry * union_create_topmost(struct nameidata *nd, struct dentry *old)
>+{
>+ struct dentry *dentry;
>+ struct dentry *parent = nd->dentry;
>+
>+ UM_DEBUG_UID("dentry=%s\n", old->d_name.name);
>+
>+ BUG_ON(parent->d_sb == old->d_sb);
>+ if (!S_ISREG(old->d_inode->i_mode)) {
>+ UM_DEBUG("This filetype isn't supported!\n");

Does that mean I cannot create block devices, etc.?

>+int union_copy_file(struct dentry *old_dentry, struct vfsmount *old_mnt,
>+ struct dentry *new_dentry, struct vfsmount *new_mnt)
>+{
>+ int ret;
>+ size_t size;
>+ loff_t offset;
>+ struct file *old_file, *new_file;
>[...]
>+ size = i_size_read(old_file->f_path.dentry->d_inode);
>+ if (((size_t)size != size) || ((ssize_t)size != size)) {

No need to cast, size is already size_t. And then that left part
is somewhat superfluous.

>+ ret = -EFBIG;
>+ goto fput_new;
>+ }
>+


Jan
--

2007-05-16 08:07:54

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFC][PATCH 11/14] VFS whiteout handling


On May 14 2007 15:13, Bharata B Rao wrote:
>
>+/*
>+ * Dummy default file-operations:
>+ * Never open a whiteout. This is always a bug.
>+ */
>+static int whiteout_no_open(struct inode *irrelevant, struct file *dontcare)
>+{
>+ printk("Attemp to open a whiteout!\n");
>+ WARN_ON(1);
>+ return -ENXIO;
>+}

Eww, really WARN_ON? ENXIO should be enough.
(FWIW, spello fix -> "Attempted to open a whiteout")

>+static int
>+vfs_unlink_whiteout(struct inode *dir, struct dentry **dp)

One line.

>+static int
>+__hash_one_len(const char *name, int len, struct qstr *this)
>+{
>+ unsigned long hash;
>+ unsigned int c;

I doubt it will make a difference - "unsigned char c";

>+
>+ hash = init_name_hash();
>+ while (len--) {
>+ c = *(const unsigned char *)name++;
>+ if (c == '/' || c == '\0')
>+ return -EINVAL;
>+ hash = partial_name_hash(c, hash);
>+ }
>+ this->hash = end_name_hash(hash);
>+ return 0;
>+}
>+
>+static int do_unlink_whiteouts(struct dentry *dentry)
>+{
>[...]
>+ if (!IS_DEADDIR(inode)) {
>+ res = file->f_op->readdir(file, (void *)file->f_path.dentry,
>+ unlink_whiteouts_filldir);

I think this cast can go.

>--- a/fs/readdir.c
>+++ b/fs/readdir.c
>@@ -148,6 +148,11 @@ static int filldir(void * __buf, const c
> unsigned long d_ino;
> int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(long));
>
>+#ifdef CONFIG_UNION_MOUNT
>+ if (d_type == DT_WHT)
>+ return 0;
>+#endif /* CONFIG_UNION_MOUNT */
>+

DT_WHT is always available, is not it? In that case, would not it
be simpler to get rid of the #if/#endif.

>@@ -233,6 +238,11 @@ static int filldir64(void * __buf, const
> struct getdents_callback64 * buf = (struct getdents_callback64 *) __buf;
> int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 1, sizeof(u64));
>
>+#ifdef CONFIG_UNION_MOUNT
>+ if (d_type == DT_WHT)
>+ return 0;
>+#endif /* CONFIG_UNION_MOUNT */
>+
> buf->error = -EINVAL; /* only used if we fail.. */
> if (reclen > buf->count)
> return -EINVAL;

>--- a/fs/union.c
>+++ b/fs/union.c
>
> #define attach_mnt_union(mnt,nd) do { /* empty */ } while (0)
>@@ -54,6 +58,8 @@ extern int union_relookup_topmost(struct
> #define union_copy_file(dentry1,mnt1,dentry2,mnt2) ({ (0); })
> #define union_copyup(x,y) ({ (0); })
> #define union_relookup_topmost(x,y) ({ (0); })
>+#define union_dir_is_empty(x) ({ (1); })
>+#define present_in_lower(x, y) ({ (0); })

These could just be
#define something 0


Jan
--

2007-05-16 08:09:33

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFC][PATCH 12/14] ext2 whiteout support


On May 14 2007 15:14, Bharata B Rao wrote:
>
>--- a/fs/ext2/dir.c
>+++ b/fs/ext2/dir.c
>@@ -218,6 +218,7 @@ static unsigned char ext2_filetype_table
> [EXT2_FT_FIFO] = DT_FIFO,
> [EXT2_FT_SOCK] = DT_SOCK,
> [EXT2_FT_SYMLINK] = DT_LNK,
>+ [EXT2_FT_WHT] = DT_WHT,
> };
>
> #define S_SHIFT 12
>@@ -1292,7 +1301,7 @@ static struct file_system_type ext2_fs_t
> .name = "ext2",
> .get_sb = ext2_get_sb,
> .kill_sb = kill_block_super,
>- .fs_flags = FS_REQUIRES_DEV,
>+ .fs_flags = FS_REQUIRES_DEV | FS_WHT,
> };
>

Hum. It's always so short. Would it offend someone to make that
DT_WHITEOUT and/or FS_WHITEOUT?



Jan
--

2007-05-16 19:35:00

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/14] Union-mount lookup


On May 16 2007 10:38, Bharata B Rao wrote:
>>
>> >+lookup_union:
>> >+ do {
>> >+ struct vfsmount *mnt = find_mnt(topmost);
>> >+ UM_DEBUG_DCACHE("name=\"%s\", inode=%p, device=%s\n",
>> >+ topmost->d_name.name, topmost->d_inode,
>> >+ mnt->mnt_devname);
>> >+ mntput(mnt);
>> >+ } while (0);
>>
>> Why the extra do{}while? [elsewhere too]
>
>Not sure, may be to get a scope to define 'mnt' here. Jan ?

What I was implicitly suggesting that mnt could be moved into the
normal 'function scope'.


Jan
--

2007-05-16 20:06:59

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/14] Union-mount lookup

Quoting Jan Engelhardt ([email protected]):
>
> On May 16 2007 10:38, Bharata B Rao wrote:
> >>
> >> >+lookup_union:
> >> >+ do {
> >> >+ struct vfsmount *mnt = find_mnt(topmost);
> >> >+ UM_DEBUG_DCACHE("name=\"%s\", inode=%p, device=%s\n",
> >> >+ topmost->d_name.name, topmost->d_inode,
> >> >+ mnt->mnt_devname);
> >> >+ mntput(mnt);
> >> >+ } while (0);
> >>
> >> Why the extra do{}while? [elsewhere too]
> >
> >Not sure, may be to get a scope to define 'mnt' here. Jan ?
>
> What I was implicitly suggesting that mnt could be moved into the
> normal 'function scope'.
>
>
> Jan

This code can't stay anyway so it's kind of moot. find_mnt() is bogus,
and the topmost and overlaid mappings need to be changed from
dentry->dentry to (vfsmnt,dentry)->(vfsmnt,dentry) in order to cope with
bind mounts and mount namespaces.

-serge

2007-05-18 11:01:38

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/14] Union-mount lookup

On Tue, May 15, 2007 at 10:00:45AM -0400, Trond Myklebust wrote:
> On Mon, 2007-05-14 at 15:12 +0530, Bharata B Rao wrote:
> > From: Jan Blunck <[email protected]>
> > Subject: Union-mount lookup
> >
> > Modifies the vfs lookup routines to work with union mounted directories.
> >
> > The existing lookup routines generally lookup for a pathname only in the
> > topmost or given directory. The changed versions of the lookup routines
> > search for the pathname in the entire union mounted stack. Also they have been
> > modified to setup the union stack during lookup from dcache cache and from
> > real_lookup().
> >
> > Signed-off-by: Jan Blunck <[email protected]>
> > Signed-off-by: Bharata B Rao <[email protected]>
> > ---
> > fs/dcache.c | 16 +
> > fs/namei.c | 78 +++++-
> > fs/namespace.c | 35 ++
> > fs/union.c | 598 +++++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/dcache.h | 17 +
> > include/linux/namei.h | 4
> > include/linux/union.h | 49 ++++
> > 7 files changed, 786 insertions(+), 11 deletions(-)
> >
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -1286,7 +1286,7 @@ struct dentry * d_lookup(struct dentry *
> > return dentry;
> > }
> >
> > -struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
> > +struct dentry * __d_lookup_single(struct dentry *parent, struct qstr *name)
> > {
> > unsigned int len = name->len;
> > unsigned int hash = name->hash;
> > @@ -1371,6 +1371,20 @@ out:
> > return dentry;
> > }
> >
> > +struct dentry * d_lookup_single(struct dentry *parent, struct qstr *name)
> > +{
> > + struct dentry *dentry;
> > + unsigned long seq;
> > +
> > + do {
> > + seq = read_seqbegin(&rename_lock);
> > + dentry = __d_lookup_single(parent, name);
> > + if (dentry)
> > + break;
> > + } while (read_seqretry(&rename_lock, seq));
> > + return dentry;
> > +}
> > +
> > /**
> > * d_validate - verify dentry provided from insecure source
> > * @dentry: The dentry alleged to be valid child of @dparent
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -374,6 +374,33 @@ void release_open_intent(struct nameidat
> > }
> >
> > static inline struct dentry *
> > +do_revalidate_single(struct dentry *dentry, struct nameidata *nd)
> > +{
> > + int status = dentry->d_op->d_revalidate(dentry, nd);
> > + if (unlikely(status <= 0)) {
>
> d_revalidate() returns a 0 or 1 result, not an error.

Doesn't look like (see the comment below) because this is copied
as-is from do_revalidate().

>
> > + /*
> > + * The dentry failed validation.
> > + * If d_revalidate returned 0 attempt to invalidate
> > + * the dentry otherwise d_revalidate is asking us
> > + * to return a fail status.
> > + */
> > + if (!status) {
> > + if (!d_invalidate(dentry)) {
> > + __dput_single(dentry);
> > + dentry = NULL;
> > + }
> > + } else {
> > + __dput_single(dentry);
> > + dentry = ERR_PTR(status);

Regards,
Bharata.

2007-05-18 11:04:21

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC][PATCH 10/14] In-kernel file copy between union mounted filesystems

On Wed, May 16, 2007 at 09:57:28AM +0200, Jan Engelhardt wrote:
>
> On May 14 2007 15:13, Bharata B Rao wrote:
> >+
> >+ if (flag & 0x2) {
> >+ error = union_copyup(nd, flag);
> >+ if (error)
> >+ goto exit;
> >+ }
>
> What I dislike (and that also goes for fs/namei.c and such) that they use
> numeral constants, i.e. 0x2. That seems error-prone. Could this (and
> the in-kernel users of 0x1/0x2/0x4) be turned into some constant?
>
> >+ if (IS_DEADDIR(parent->d_inode))
> >+ goto error;
> >+ err = -EACCES; /* shouldn't it be ENOSYS? */
>
> I do not think so. ENOSYS means Syscall not implemented. But it is
> implemented. If ->i_op is not there does not imply ENOSYS.
>
> Though, now that I grep through fs/*, I see that namei.c also
> has that comment "shouldn't it be ENOSYS", so it's all at odds.
>
> >+ if (!parent->d_inode->i_op || !parent->d_inode->i_op->create)
> >+ goto error;
>
> >+struct dentry * union_create_topmost(struct nameidata *nd, struct dentry *old)
> >+{
> >+ struct dentry *dentry;
> >+ struct dentry *parent = nd->dentry;
> >+
> >+ UM_DEBUG_UID("dentry=%s\n", old->d_name.name);
> >+
> >+ BUG_ON(parent->d_sb == old->d_sb);
> >+ if (!S_ISREG(old->d_inode->i_mode)) {
> >+ UM_DEBUG("This filetype isn't supported!\n");
>
> Does that mean I cannot create block devices, etc.?
>

Not really. This is called during copyup of a file residing in a lower
layer. And that is done only for regular files.

Regards,
Bharata.

2007-05-18 14:31:32

by Shaya Potter

[permalink] [raw]
Subject: Re: [RFC][PATCH 10/14] In-kernel file copy between union mounted filesystems

Bharata B Rao wrote:

>
> Not really. This is called during copyup of a file residing in a lower
> layer. And that is done only for regular files.

That is broken.

You should be able to change the permissions on a device node on a layer
that is RO.

so it would copy it up (1. mknod, 2. copy attributes) and then the
appropriate attribute notification change would be called.

2007-05-19 10:33:51

by Paul Dickson

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/14] Introduce union stack

On Mon, 14 May 2007 13:23:06 -0700, Badari Pulavarty wrote:

> > + while (fs) {
> > + locked = union_trylock(fs->root);
> > + if (!locked)
> > + goto loop1;
> > + locked = union_trylock(fs->altroot);
> > + if (!locked)
> > + goto loop2;
> > + locked = union_trylock(fs->pwd);
> > + if (!locked)
> > + goto loop3;
> > + break;
> > + loop3:
> > + union_unlock(fs->altroot);
> > + loop2:
> > + union_unlock(fs->root);
> > + loop1:
> > + read_unlock(&fs->lock);
> > + UM_DEBUG_LOCK("Failed to get all semaphores in fs_struct!\n");
> > + cpu_relax();
> > + read_lock(&fs->lock);
> > + continue;
>
> Nit.. why "continue" ?
>
> > + }
> > + BUG_ON(!fs);

How about getting rid of the gotos:

while (fs) {
locked = union_trylock(fs->root);
if (locked) {
locked = union_trylock(fs->altroot);
if (locked) {
locked = union_trylock(fs->pwd);
if (locked)
break;
else {
union_unlock(fs->altroot);
union_unlock(fs->root);
}
else
union_unlock(fs->root);
}
}
read_unlock(&fs->lock);
UM_DEBUG_LOCK("Failed to get all semaphores in fs_struct!\n");
cpu_relax();
read_lock(&fs->lock);
}
BUG_ON(!fs);

It's the same number of lines. Shorter if you get rid of the "locked"
variable.

-Paul

2007-05-22 03:07:53

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC][PATCH 10/14] In-kernel file copy between union mounted filesystems

On Fri, May 18, 2007 at 09:47:31AM -0400, Shaya Potter wrote:
> Bharata B Rao wrote:
>
> >
> >Not really. This is called during copyup of a file residing in a lower
> >layer. And that is done only for regular files.
>
> That is broken.

But it only breaks the semantics (in other cases we allow writes only to the
top layer files). So the question is why do we have to copy up the device
node ? What difference it makes to writing to the device itself ? Currently
we allow write to the device using the lower layer device node itself.

>
> You should be able to change the permissions on a device node on a layer
> that is RO.
>

Hmm not sure why we need to touch the permissions of the device. See below.

> so it would copy it up (1. mknod, 2. copy attributes) and then the
> appropriate attribute notification change would be called.

With union mount, when a regular file is opened for write, it is checked
if it resides in the lower layer and if so copied up to the topmost layer
and this new fd is returned from open. And any subsequent writes using this
fd will go to the newly created topmost file. (We are aware that we are not
yet copying the (extended) attributes to the newly created topmost file,
which we have to do).

Regards,
Bharata.

2007-05-22 06:28:15

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFC][PATCH 10/14] In-kernel file copy between union mounted filesystems


On May 22 2007 08:43, Bharata B Rao wrote:
>On Fri, May 18, 2007 at 09:47:31AM -0400, Shaya Potter wrote:
>> Bharata B Rao wrote:
>>
>> >
>> >Not really. This is called during copyup of a file residing in a lower
>> >layer. And that is done only for regular files.
>>
>> That is broken.
>
>But it only breaks the semantics (in other cases we allow writes only to the
>top layer files). So the question is why do we have to copy up the device
>node ? What difference it makes to writing to the device itself ?

Because `chmod 666 blockdevnode` is not the same as writing
to the device itself?

>Currently we allow write to the device using the lower layer device node
>itself.


Jan
--

2007-05-22 08:36:11

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC][PATCH 10/14] In-kernel file copy between union mounted filesystems

On Tue, May 22, 2007 at 08:25:16AM +0200, Jan Engelhardt wrote:
>
> On May 22 2007 08:43, Bharata B Rao wrote:
> >On Fri, May 18, 2007 at 09:47:31AM -0400, Shaya Potter wrote:
> >> Bharata B Rao wrote:
> >>
> >> >
> >> >Not really. This is called during copyup of a file residing in a lower
> >> >layer. And that is done only for regular files.
> >>
> >> That is broken.
> >
> >But it only breaks the semantics (in other cases we allow writes only to the
> >top layer files). So the question is why do we have to copy up the device
> >node ? What difference it makes to writing to the device itself ?
>
> Because `chmod 666 blockdevnode` is not the same as writing
> to the device itself?

What if that chmod is applied on the lower level device node ? This is what
we do currently, even for regular files. Copyup happens only when the file
is opened for writing.

Let me rephrase my earlier question:

In case of regular files, when we copyup a file, we are actually preventing
any writes to the lower layers (which we have designated as read only).

Applying the same logic to devices, what do we achieve by copying them up ?
How does it matter if we write to the device through a node in the upper
layer or in the lower layer. Both the writes eventually do the same thing.

What I am trying to understand is, if the need for copyup is purely a matter
of conforming to semantics (of not allowing writes to the lower layers in
case of union mount) or do we achieve anything else by doing a device
copyup ? Are there any cases where copying up of device nodes are absolutely
essential for sane behaviour ?

Regards,
Bharata.

2007-05-22 12:46:22

by Shaya Potter

[permalink] [raw]
Subject: Re: [RFC][PATCH 10/14] In-kernel file copy between union mounted filesystems

Bharata B Rao wrote:
> On Tue, May 22, 2007 at 08:25:16AM +0200, Jan Engelhardt wrote:
>> On May 22 2007 08:43, Bharata B Rao wrote:
>>> On Fri, May 18, 2007 at 09:47:31AM -0400, Shaya Potter wrote:
>>>> Bharata B Rao wrote:
>>>>
>>>>> Not really. This is called during copyup of a file residing in a lower
>>>>> layer. And that is done only for regular files.
>>>> That is broken.
>>> But it only breaks the semantics (in other cases we allow writes only to the
>>> top layer files). So the question is why do we have to copy up the device
>>> node ? What difference it makes to writing to the device itself ?
>> Because `chmod 666 blockdevnode` is not the same as writing
>> to the device itself?
>
> What if that chmod is applied on the lower level device node ? This is what
> we do currently, even for regular files. Copyup happens only when the file
> is opened for writing.
>
> Let me rephrase my earlier question:
>
> In case of regular files, when we copyup a file, we are actually preventing
> any writes to the lower layers (which we have designated as read only).
>
> Applying the same logic to devices, what do we achieve by copying them up ?
> How does it matter if we write to the device through a node in the upper
> layer or in the lower layer. Both the writes eventually do the same thing.

What happens if the lower layer is on a read only medium. But the top
layer is RW. Why can't one change permissions? In your model, one can't.

What happens if one wants to share a lower layer read-only (I'm doing
this with my research into uses of union file systems), one doesn't want
permission change in one use of the lower layer to affect any of the
other uses.


> What I am trying to understand is, if the need for copyup is purely a matter
> of conforming to semantics (of not allowing writes to the lower layers in
> case of union mount) or do we achieve anything else by doing a device
> copyup ? Are there any cases where copying up of device nodes are absolutely
> essential for sane behaviour ?

If the lower layer is relatively immutable (ignoring atime) it can be
shared in a RO manner by multiple unions. If it's not, it can't be.
Also, copyup is needed in general as the RO union layer can be on a RO
file system but the union will not be RO.

2007-05-22 16:44:33

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/14] Introduce union stack


On May 19 2007 03:18, Paul Dickson wrote:
>
>How about getting rid of the gotos:
>
> while (fs) {
> locked = union_trylock(fs->root);
> if (locked) {
> locked = union_trylock(fs->altroot);
> if (locked) {
> locked = union_trylock(fs->pwd);
> if (locked)
> break;

Suppose we break here...

> else {
> union_unlock(fs->altroot);
> union_unlock(fs->root);
> }
> else
> union_unlock(fs->root);
> }
> }
> read_unlock(&fs->lock);
> UM_DEBUG_LOCK("Failed to get all semaphores in fs_struct!\n");
> cpu_relax();
> read_lock(&fs->lock);
> }
> BUG_ON(!fs);

Then no lock is released. Boom.



Jan
--

2007-05-23 10:34:22

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC][PATCH 10/14] In-kernel file copy between union mounted filesystems

On Tue, May 22, 2007 at 08:35:17AM -0400, Shaya Potter wrote:
> Bharata B Rao wrote:
> >
> >In case of regular files, when we copyup a file, we are actually preventing
> >any writes to the lower layers (which we have designated as read only).
> >
> >Applying the same logic to devices, what do we achieve by copying them up ?
> >How does it matter if we write to the device through a node in the upper
> >layer or in the lower layer. Both the writes eventually do the same thing.
>
> What happens if the lower layer is on a read only medium. But the top
> layer is RW. Why can't one change permissions? In your model, one can't.
>
> What happens if one wants to share a lower layer read-only (I'm doing
> this with my research into uses of union file systems), one doesn't want
> permission change in one use of the lower layer to affect any of the
> other uses.

Ok, makes sense. Thanks for the clarification.

So looks like in addition to copyup on open (which is what we do currently)
there is a case for doing copyups for other situations also.

Regards,
Bharata.

2007-05-23 13:26:19

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/14] Introduce union stack

Quoting Paul Dickson ([email protected]):
> On Mon, 14 May 2007 13:23:06 -0700, Badari Pulavarty wrote:
>
> > > + while (fs) {
> > > + locked = union_trylock(fs->root);
> > > + if (!locked)
> > > + goto loop1;
> > > + locked = union_trylock(fs->altroot);
> > > + if (!locked)
> > > + goto loop2;
> > > + locked = union_trylock(fs->pwd);
> > > + if (!locked)
> > > + goto loop3;
> > > + break;
> > > + loop3:
> > > + union_unlock(fs->altroot);
> > > + loop2:
> > > + union_unlock(fs->root);
> > > + loop1:
> > > + read_unlock(&fs->lock);
> > > + UM_DEBUG_LOCK("Failed to get all semaphores in fs_struct!\n");
> > > + cpu_relax();
> > > + read_lock(&fs->lock);
> > > + continue;
> >
> > Nit.. why "continue" ?
> >
> > > + }
> > > + BUG_ON(!fs);
>
> How about getting rid of the gotos:
>
> while (fs) {
> locked = union_trylock(fs->root);
> if (locked) {
> locked = union_trylock(fs->altroot);
> if (locked) {
> locked = union_trylock(fs->pwd);
> if (locked)
> break;
> else {
> union_unlock(fs->altroot);
> union_unlock(fs->root);
> }
> else
> union_unlock(fs->root);
> }
> }
> read_unlock(&fs->lock);
> UM_DEBUG_LOCK("Failed to get all semaphores in fs_struct!\n");
> cpu_relax();
> read_lock(&fs->lock);
> }
> BUG_ON(!fs);
>
> It's the same number of lines. Shorter if you get rid of the "locked"
> variable.

I dunno, I thought the goto versoin was cleaner and easier to tell that
the right locks are getting unlocked. The worst part in the second
version is the break in the middle!

-serge