2007-10-18 15:57:08

by Jaroslav Sykora

[permalink] [raw]
Subject: [RFC PATCH 0/5] Shadow directories

Hello,

Let's say we have an archive file "hello.zip" with a hello world program source
code. We want to do this:
cat hello.zip^/hello.c
gcc hello.zip^/hello.c -o hello
etc..

The '^' is an escape character and it tells the computer to treat the file as a directory.
[Note: We can't do "cat hello.zip/hello.c" because of http://lwn.net/Articles/100148/ ]
The kernel patch implements only a redirection of the request to another directory
("shadow directory") where a FUSE server must be mounted. The decompression of
archives is entirely handled in the user space. More info can be found in the documentation
patch in the series.

The shadow directories are used in RheaVFS project [ http://rheavfs.sourceforge.net/ ],
and it also can be used with the original AVFS [ http://www.inf.bme.hu/~mszeredi/avfs/ ].

The patches are against vanilla 2.6.23.
This is my first bigger contribution to the kernel so please be gentle ;-)

Jara

--
"Elves and Dragons!" I says to him. "Cabbages and potatoes are better
for you and me." -- J. R. R. Tolkien


2007-10-18 15:27:31

by Jaroslav Sykora

[permalink] [raw]
Subject: [RFC PATCH 4/5] Shadow directories: procfs

Procfs interface: /proc/<pid>/status, /proc/<pid>/{root-shdw, cwd-shdw}.

Signed-off-by: Jaroslav Sykora <[email protected]>

fs/proc/array.c | 23 +++++++++++++++++++
fs/proc/base.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 76 insertions(+)

--- orig/fs/proc/base.c 2007-10-07 19:00:20.000000000 +0200
+++ new/fs/proc/base.c 2007-10-07 13:39:08.000000000 +0200
@@ -171,6 +171,32 @@ static int proc_cwd_link(struct inode *i
return result;
}

+static int proc_shdwcwd_link(struct inode *inode, struct dentry **dentry,
+ struct vfsmount **mnt)
+{
+ struct task_struct *task = get_proc_task(inode);
+ struct fs_struct *fs = NULL;
+ int result = -ENOENT;
+
+ if (task) {
+ fs = get_fs_struct(task);
+ put_task_struct(task);
+ }
+ if (fs) {
+ read_lock(&fs->lock);
+ *dentry = dget(fs->shdwpwd);
+ if (fs->shdwpwd)
+ *mnt = mntget(fs->shdwpwdmnt);
+ else
+ *mnt = NULL;
+ read_unlock(&fs->lock);
+ if (*dentry)
+ result = 0;
+ put_fs_struct(fs);
+ }
+ return result;
+}
+
static int proc_root_link(struct inode *inode, struct dentry **dentry, struct vfsmount **mnt)
{
struct task_struct *task = get_proc_task(inode);
@@ -192,6 +218,29 @@ static int proc_root_link(struct inode *
return result;
}

+static int proc_shdwroot_link(struct inode *inode, struct dentry **dentry,
+ struct vfsmount **mnt)
+{
+ struct task_struct *task = get_proc_task(inode);
+ struct fs_struct *fs = NULL;
+ int result = -ENOENT;
+
+ if (task) {
+ fs = get_fs_struct(task);
+ put_task_struct(task);
+ }
+ if (fs) {
+ read_lock(&fs->lock);
+ *mnt = mntget(fs->shdwrootmnt);
+ *dentry = dget(fs->shdwroot);
+ read_unlock(&fs->lock);
+ if (*dentry)
+ result = 0;
+ put_fs_struct(fs);
+ }
+ return result;
+}
+
#define MAY_PTRACE(task) \
(task == current || \
(task->parent == current && \
@@ -2094,6 +2143,8 @@ static const struct pid_entry tgid_base_
#if defined(USE_ELF_CORE_DUMP) && defined(CONFIG_ELF_CORE)
REG("coredump_filter", S_IRUGO|S_IWUSR, coredump_filter),
#endif
+ LNK("root-shdw", shdwroot),
+ LNK("cwd-shdw", shdwcwd),
#ifdef CONFIG_TASK_IO_ACCOUNTING
INF("io", S_IRUGO, pid_io_accounting),
#endif
@@ -2377,6 +2428,8 @@ static const struct pid_entry tid_base_s
#ifdef CONFIG_FAULT_INJECTION
REG("make-it-fail", S_IRUGO|S_IWUSR, fault_inject),
#endif
+ LNK("root-shdw", shdwroot),
+ LNK("cwd-shdw", shdwcwd),
};

static int proc_tid_base_readdir(struct file * filp,
--- orig/fs/proc/array.c 2007-10-07 19:00:20.000000000 +0200
+++ new/fs/proc/array.c 2007-10-07 19:57:03.000000000 +0200
@@ -298,6 +298,28 @@ static inline char *task_context_switch_
p->nivcsw);
}

+static inline char *task_fsinfo(struct task_struct *p, char *buffer)
+{
+ int enabled = 0, use_esc = 0, esc_ch = 0;
+
+ rcu_read_lock();
+ task_lock(p);
+ if (p->fs) {
+ read_lock(&p->fs->lock);
+ enabled = (p->fs->flags & SHDW_ENABLED) ? 1 : 0;
+ use_esc = (p->fs->flags & SHDW_USE_ESC) ? 1 : 0;
+ esc_ch = p->fs->shdw_escch;
+ read_unlock(&p->fs->lock);
+ }
+ task_unlock(p);
+ rcu_read_unlock();
+
+ return buffer + sprintf(buffer, "Shdw_Enabled:\t%d\n"
+ "Shdw_UseEscChar: %d\n"
+ "Shdw_EscChar:\t%u\n",
+ enabled, use_esc, (unsigned int)esc_ch);
+}
+
int proc_pid_status(struct task_struct *task, char *buffer)
{
char *orig = buffer;
@@ -317,6 +339,7 @@ int proc_pid_status(struct task_struct *
buffer = task_show_regs(task, buffer);
#endif
buffer = task_context_switch_counts(task, buffer);
+ buffer = task_fsinfo(task, buffer);
return buffer - orig;
}

2007-10-18 15:28:58

by Jaroslav Sykora

[permalink] [raw]
Subject: [RFC PATCH 5/5] Shadow directories: documentation

Documentation of the shadow directories.

Signed-off-by: Jaroslav Sykora <[email protected]>

Documentation/filesystems/shadow-directories.txt | 177 +++++++++++++
1 file changed, 177 insertions(+)

--- /dev/null 2007-10-18 09:34:42.624413454 +0200
+++ new/Documentation/filesystems/shadow-directories.txt 2007-10-18 17:03:06.000000000 +0200
@@ -0,0 +1,177 @@
+Shadow directories
+==================
+
+The Goal
+--------
+
+Let's say we have an archive file "hello.zip" with a hello world program source
+code. We want to do this:
+ cat hello.zip^/hello.c
+
+The '^' is an escape character and it tells the computer to treat the file
+as a directory.
+[Note: We can't do "cat hello.zip/hello.c" because of http://lwn.net/Articles/100148/ ]
+
+One way to implement the scenario above is to create a FUSE VFS server and chroot
+everything into it. This will work, but poorly. The performance will be low
+and many things, like setuid binaries, won't principally work (iff the server
+doesn't have root privileges).
+
+
+The Principle
+-------------
+
+For every process we define two VFS trees:
+(1) the standard system-wide tree, managed by mount/umount, implemented by native
+ filesystems like ext3, reiserfs, etc..;
+(2) a per-process shadow tree, usually implemented by FUSE.
+
+The main change is within VFS look up code: A file name is looked up in a standard
+tree and if it's found we're done. If not the name is transparently looked up
+in a shadow tree.
+
+[Picture: A standard and a shadow tree. The shadow tree will be in fact mounted
+ on some point in the standard tree, e.g. "/home/jara/.vfs/mnt". ]
+
+ Standard Shadow
+ "/" "/"
+ ,------|-------, ,-----|------,
+ bin home usr bin home usr
+ | |
+ jara jara
+ ,----|-----, ,----|-----,-------------,
+ tmp hello.zip tmp hello.zip hello.zip^
+ |
+ ,---------------,
+ hello.c Makefile
+
+
+Generally speaking a shadow tree is a superset of a standard tree -- everything
+we can find in the standard tree can be found in the shadow tree.
+But the standard tree is faster (it's a native FS), so we want to take most
+of files from it and only the rest from the other tree (see the directory
+hello.zip^ in the picture above which is not in the std. tree).
+
+In a task the standard tree is primarily defined by its root directory
+(fs_struct.root). Secondarily it's represented by current working directory
+and by opened directory handles. To map all these directories to corresponding
+shadow directories we add shadow root, shadow current directory and shadow
+directories for all the opened directories (in the struct file).
+
+The user needs to set only the shadow root directory for his/her login shell. The
+settings will be inherited by all child processes. Although we provide a system
+call to set up shadow current directory (SHDW_FD_PWD, bellow) and shadow directories
+of opened directories (@@fd>=0 bellow), this information can be automatically
+deduced from the standard directories.
+
+Example 1: See the picture above:
+A process has root=/ and pwd=/home/jara. The user's FUSE VFS server is mounted
+on "/home/jara/.vfs/mnt". We setup shadow root directory of the process with
+a system call:
+ setshdwpath(pid, SHDW_FD_ROOT, "/home/jara/.vfs/mnt");
+The kernel knows that pwd=/home/jara, so it can deduce that shadow pwd will
+be "/home/jara/.vfs/mnt/home/jara" (absolute path).
+
+
+The Escape Character Mode
+-------------------------
+
+As has been said above a file name look-up is now a two stage process: first
+we try to look-up the name in the standard tree and if we fail we try in the
+shadow tree. The problem is that there are hundreds of failed lookups on
+normal session start -- a few dozen per every starting process. All these
+bogus lookups will make it to the shadow root and will be processes by the user
+space VFS server implemented in FUSE. The lookups will be rejected and everything
+works as usuall but it's slow.
+
+To speed things up and to be practical we define an _escape character_. It's
+simply any character which can be used in a file name but which isn't used
+very often -- like '#' or '^'. We choose the '^' in this document.
+
+The escape character is loaded by the system call described bellow. All the
+lookups going to the shadow tree are filtered against the escape character.
+The VFS look-up procedure is thus:
+ 1. a component of the path (a name) is looked up in the standard tree.
+ If it's found, we're done.
+ 2. if the escape character mode is enabled the name is checked if it
+ contains the escape character. If not the file was not found.
+ 3. the name is looked up in the shadow tree.
+
+
+Example 2: settings as in the ex.1:
+The user wants to read the file hello.c in the archive hello.zip in his
+home directory [see picture above]. The escape character is '^': He will do:
+ cat hello.zip^/hello.c
+The name "hello.zip^" will be looked up in the pwd=/home/jara and won't be found.
+The component contains the escape character '^' so it gets a second chance
+and will be looked up in the shadow pwd=/home/jara/.vfs/mnt/home/jara.
+The next component of the path (hello.c in this example) is looked up
+starting from the point where previous finished, so it will be found right
+in the first step of the two stage look-up process.
+
+
+
+The Syscalls
+------------
+
+Synopsis
+
+int getshdwinfo(int pid, int func, int *data);
+int setshdwinfo(int pid, int func, int data);
+int setshdwpath(int pid, int fd, const char *path);
+
+
+/* functions (parameter @func) */
+#define FSI_SHDW_ENABLE 1 /* enable shadow directories */
+#define FSI_SHDW_ESC_EN 2 /* enable use of escape character */
+#define FSI_SHDW_ESC_CHAR 3 /* specify escape character */
+
+/* pseudo file descriptors (parameter @fd) */
+#define SHDW_FD_ROOT -1 /* pseudo FD for root shadow dir */
+#define SHDW_FD_PWD -2 /* pseudo FD for pwd shadow dir */
+
+Description
+
+getshdwinfo() reads attributes of process @pid regarding shadow directories
+into @data, while setshdwinfo() sets the attributes.
+setshdwpath() sets path of shadow directory for a file descriptor,
+root directory or working directory of process @pid.
+
+@pid is the PID of the target process. The special value of 0 means 'current process'.
+Thus getshdwinfo(0, ...) is equivalent to getshdwinfo(getpid(), ...).
+
+@func determines the attribute being read or written. It may be:
+ FSI_SHDW_ENABLE -- shadow directories for the process are enabled
+ iff @data != 0
+ FSI_SHDW_ESC_EN -- the escape character is used iff @data != 0
+ (escape mode enabled)
+ FSI_SHDW_ESC_CHAR -- the @data specifies ASCII character used in the escape mode
+
+@fd is a file descriptor number in the target process whose shadow directory
+is being set. It may be:
+ @fd >= 0 -- a file descriptor number
+ SHDW_FD_ROOT -- the root directory
+ SHDW_FD_PWD -- the working directory
+
+@path is the path to the directory. May be NULL in which case:
+ (a) if @func==SHDW_FD_ROOT -- the shadow directories will be switched off;
+ (b) else try to deduce the path from the standard path and shadow root
+ on demand.
+
+
+Return value
+
+On success zero is returned. On failure the negative error code is returned.
+
+
+Errors
+
+EINVAL -- Parameter is invalid: @func out of range.
+ESRCH -- Process @pid not found.
+EPERM -- Access to process @pid is denied.
+EFAULT -- Bad pointer @data or @path.
+EBADF -- File descriptor @fd does not exist.
+EACCES -- Access to VFS path @path denied.
+ENOENT -- VFS path @path does not exist.
+ENOTDIR -- VFS path @path points to non-directory object.
+

2007-10-18 15:57:36

by Jaroslav Sykora

[permalink] [raw]
Subject: [RFC PATCH 1/5] Shadow directories: headers

Header file changes for shadow directories.
Adds pointers to shadows dirs to the struct file and struct fs_struct.
Defines internal lookup flags and syscall flags.

Signed-off-by: Jaroslav Sykora <[email protected]>

include/linux/file.h | 2 ++
include/linux/fs.h | 18 ++++++++++++++++++
include/linux/fs_struct.h | 25 +++++++++++++++++++++++++
include/linux/namei.h | 16 ++++++++++++++++
4 files changed, 61 insertions(+)

--- orig/include/linux/fs.h 2007-10-07 19:00:24.000000000 +0200
+++ new/include/linux/fs.h 2007-10-07 13:39:08.000000000 +0200
@@ -266,6 +266,14 @@ extern int dir_notify_enable;
#define SYNC_FILE_RANGE_WRITE 2
#define SYNC_FILE_RANGE_WAIT_AFTER 4

+/* sys_setshdwinfo(), sys_getshdwinfo(): */
+#define FSI_SHDW_ENABLE 1 /* enable shadow directories */
+#define FSI_SHDW_ESC_EN 2 /* enable use of escape character */
+#define FSI_SHDW_ESC_CHAR 3 /* specify escape character */
+/* sys_setshdwpath */
+#define SHDW_FD_ROOT -1 /* pseudo FD for root shadow dir */
+#define SHDW_FD_PWD -2 /* pseudo FD for pwd shadow dir */
+
#ifdef __KERNEL__

#include <linux/linkage.h>
@@ -752,6 +760,16 @@ struct file {
spinlock_t f_ep_lock;
#endif /* #ifdef CONFIG_EPOLL */
struct address_space *f_mapping;
+
+ /* the following fields are protected by f_owner.lock */
+ /* | f_shdw | f_shdwmnt | result
+ +----------+-------------+------------
+ | NULL | NULL | delayed
+ | NULL | !NULL | invalid
+ | !NULL | NULL | BUG
+ | !NULL | !NULL | valid */
+ struct dentry *f_shdw;
+ struct vfsmount *f_shdwmnt;
};
extern spinlock_t files_lock;
#define file_list_lock() spin_lock(&files_lock);
--- orig/include/linux/fs_struct.h 2007-07-09 01:32:17.000000000 +0200
+++ new/include/linux/fs_struct.h 2007-10-07 13:39:08.000000000 +0200
@@ -10,8 +10,31 @@ struct fs_struct {
int umask;
struct dentry * root, * pwd, * altroot;
struct vfsmount * rootmnt, * pwdmnt, * altrootmnt;
+
+ int flags;
+ /* shadow dirs: root and pwd */
+ /* | shdwroot | shdwrootmnt | result
+ +----------+-------------+------------
+ | NULL | NULL | BUG_ON(flags&SHDW_ENABLED)
+ | !NULL | !NULL | ok
+ +==========+=============+============
+ | shdwpwd | shdwpwdmnt | result
+ +----------+-------------+------------
+ | NULL | NULL | delayed
+ | NULL | !NULL | invalid
+ | !NULL | NULL | BUG
+ | !NULL | !NULL | valid */
+ struct dentry *shdwroot, *shdwpwd;
+ struct vfsmount *shdwrootmnt, *shdwpwdmnt;
+ /* shadow dirs: escape character */
+ unsigned char shdw_escch;
};

+/* bitflags for fs_struct.flags */
+#define SHDW_ENABLED 1 /* are shadow dirs enabled? */
+#define SHDW_USE_ESC 2 /* use escape char in shadow dirs? */
+
+
#define INIT_FS { \
.count = ATOMIC_INIT(1), \
.lock = RW_LOCK_UNLOCKED, \
@@ -24,6 +47,8 @@ extern void exit_fs(struct task_struct *
extern void set_fs_altroot(void);
extern void set_fs_root(struct fs_struct *, struct vfsmount *, struct dentry *);
extern void set_fs_pwd(struct fs_struct *, struct vfsmount *, struct dentry *);
+extern void set_fs_shdwpwd(struct fs_struct *fs,
+ struct vfsmount *mnt, struct dentry *dentry);
extern struct fs_struct *copy_fs_struct(struct fs_struct *);
extern void put_fs_struct(struct fs_struct *);

--- orig/include/linux/namei.h 2007-10-07 19:00:25.000000000 +0200
+++ new/include/linux/namei.h 2007-10-07 20:03:11.000000000 +0200
@@ -22,6 +22,7 @@ struct nameidata {
int last_type;
unsigned depth;
char *saved_names[MAX_NESTED_LINKS + 1];
+ unsigned char find_char;

/* Intent data */
union {
@@ -54,6 +55,16 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LA
#define LOOKUP_PARENT 16
#define LOOKUP_NOALT 32
#define LOOKUP_REVAL 64
+
+/* don't fallback to lookup in shadow directory */
+#define LOOKUP_NOSHDW 128
+/* try to find nameidata.find_char in pathname,
+ * set LOOKUP_CHARFOUND in nameidata.flags if found */
+#define LOOKUP_FINDCHAR (1<<16)
+#define LOOKUP_CHARFOUND (1<<17)
+/* (dentry,mnt) was found in shadow dir */
+#define LOOKUP_INSHDW (1<<18)
+
/*
* Intent data
*/
@@ -68,6 +79,8 @@ extern int FASTCALL(__user_walk_fd(int d
__user_walk_fd(AT_FDCWD, name, LOOKUP_FOLLOW, nd)
#define user_path_walk_link(name,nd) \
__user_walk_fd(AT_FDCWD, name, 0, nd)
+extern int FASTCALL(path_lookup_shdw(int dfd, const char *name,
+ unsigned int flags, struct nameidata *nd));
extern int FASTCALL(path_lookup(const char *, unsigned, struct nameidata *));
extern int vfs_path_lookup(struct dentry *, struct vfsmount *,
const char *, unsigned int, struct nameidata *);
@@ -90,6 +103,9 @@ extern int follow_up(struct vfsmount **,
extern struct dentry *lock_rename(struct dentry *, struct dentry *);
extern void unlock_rename(struct dentry *, struct dentry *);

+extern int get_file_shdwdir(struct file *file, struct dentry **dentry,
+ struct vfsmount **mnt);
+
static inline void nd_set_link(struct nameidata *nd, char *path)
{
nd->saved_names[nd->depth] = path;
--- orig/include/linux/file.h 2007-10-07 19:00:24.000000000 +0200
+++ new/include/linux/file.h 2007-10-16 21:06:51.000000000 +0200
@@ -68,6 +68,8 @@ static inline void fput_light(struct fil
fput(file);
}

+extern struct file *FASTCALL(__fget(struct files_struct *files,
+ unsigned int fd));
extern struct file * FASTCALL(fget(unsigned int fd));
extern struct file * FASTCALL(fget_light(unsigned int fd, int *fput_needed));
extern void FASTCALL(set_close_on_exec(unsigned int fd, int flag));


--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2007-10-18 15:57:58

by Jaroslav Sykora

[permalink] [raw]
Subject: [RFC PATCH 3/5] Shadow directories: chdir, fchdir

sys_chdir and sys_fchdir changes.

Signed-off-by: Jaroslav Sykora <[email protected]>

fs/open.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 73 insertions(+), 6 deletions(-)

--- orig/fs/open.c 2007-10-07 19:00:19.000000000 +0200
+++ new/fs/open.c 2007-10-16 21:04:56.000000000 +0200
@@ -476,13 +476,51 @@ asmlinkage long sys_access(const char __
return sys_faccessat(AT_FDCWD, filename, mode);
}

+static inline int read_fs_flags(void)
+{
+ int res;
+ read_lock(&current->fs->lock);
+ res = current->fs->flags;
+ read_unlock(&current->fs->lock);
+ return res;
+}
+
+void set_fs_shdwpwd(struct fs_struct *fs,
+ struct vfsmount *mnt, struct dentry *dentry)
+{
+ struct dentry *old_dentry;
+ struct vfsmount *old_mnt;
+
+ BUG_ON(dentry != NULL && mnt == NULL);
+ write_lock(&fs->lock);
+ /* set shadow pwd */
+ old_dentry = fs->shdwpwd;
+ old_mnt = fs->shdwpwdmnt;
+ fs->shdwpwd = dget(dentry);
+ if (dentry)
+ fs->shdwpwdmnt = mntget(mnt);
+ else
+ /* PTR_ERR flag */
+ fs->shdwpwdmnt = mnt;
+ write_unlock(&fs->lock);
+
+ if (old_dentry) {
+ mntput(old_mnt);
+ dput(old_dentry);
+ }
+}
+
asmlinkage long sys_chdir(const char __user * filename)
{
struct nameidata nd;
- int error;
+ char *tmp = getname(filename);
+ int error = PTR_ERR(tmp);;
+
+ if (IS_ERR(tmp))
+ goto out_badname;

- error = __user_walk(filename,
- LOOKUP_FOLLOW|LOOKUP_DIRECTORY|LOOKUP_CHDIR, &nd);
+ error = path_lookup(tmp, LOOKUP_FOLLOW | LOOKUP_DIRECTORY
+ | LOOKUP_CHDIR, &nd);
if (error)
goto out;

@@ -490,11 +528,23 @@ asmlinkage long sys_chdir(const char __u
if (error)
goto dput_and_out;

- set_fs_pwd(current->fs, nd.mnt, nd.dentry);
+ if (!(read_fs_flags() & SHDW_ENABLED))
+ goto set_std;

+ if (!(nd.flags & LOOKUP_INSHDW))
+ set_fs_shdwpwd(current->fs, NULL, NULL);
+ else
+ /* shadow == std */
+ set_fs_shdwpwd(current->fs, nd.mnt, nd.dentry);
+
+set_std:
+ /* set std cwd */
+ set_fs_pwd(current->fs, nd.mnt, nd.dentry);
dput_and_out:
path_release(&nd);
out:
+ putname(tmp);
+out_badname:
return error;
}

@@ -520,8 +570,25 @@ asmlinkage long sys_fchdir(unsigned int
goto out_putf;

error = file_permission(file, MAY_EXEC);
- if (!error)
- set_fs_pwd(current->fs, mnt, dentry);
+ if (error)
+ goto out_putf;
+
+ set_fs_pwd(current->fs, mnt, dentry);
+
+ if (!(read_fs_flags() & SHDW_ENABLED))
+ /* shadow dirs aren't enabled */
+ goto out_putf;
+
+ if (get_file_shdwdir(file, &dentry, &mnt))
+ /* some error ocured */
+ set_fs_shdwpwd(current->fs, NULL, NULL);
+ else {
+ /* ok */
+ set_fs_shdwpwd(current->fs, mnt, dentry);
+ mntput(mnt);
+ dput(dentry);
+ }
+
out_putf:
fput(file);
out:

2007-10-18 15:58:34

by Jaroslav Sykora

[permalink] [raw]
Subject: [RFC PATCH 2/5] Shadow directories: core

Implements two stage lookup with escape character filtering
and system calls for i386.
Changes lookup path, namely do_path_lookup. This function is split
into path_lookup_norm(), which performs standard name lookup,
and path_lookup_shdw(), which performs name lookup in an associated shadow directory.

Signed-off-by: Jaroslav Sykora <[email protected]>

arch/i386/kernel/syscall_table.S | 6
fs/exec.c | 4
fs/file_table.c | 19
fs/namei.c | 610 ++++++++++++++++++++++++++++-
fs/namespace.c | 13
include/linux/syscalls.h | 6
kernel/exit.c | 8
kernel/fork.c | 20
8 files changed, 672 insertions(+), 14 deletions(-)

--- orig/fs/namei.c 2007-10-07 19:00:19.000000000 +0200
+++ new/fs/namei.c 2007-10-18 15:35:54.000000000 +0200
@@ -31,6 +31,7 @@
#include <linux/file.h>
#include <linux/fcntl.h>
#include <linux/namei.h>
+#include <linux/ptrace.h>
#include <asm/namei.h>
#include <asm/uaccess.h>

@@ -515,6 +516,25 @@ static struct dentry * real_lookup(struc
return result;
}

+static inline int use_shadow(struct fs_struct *fs, struct nameidata *nd)
+{
+ /* assert: fs->lock held */
+ return (fs->flags & SHDW_ENABLED) && (nd->flags & LOOKUP_INSHDW);
+}
+
+static inline struct dentry *fs_root(struct fs_struct *fs, struct nameidata *nd)
+{
+ /* assert: current->fs->lock held */
+ return (use_shadow(fs, nd)) ? fs->shdwroot : fs->root;
+}
+
+static inline struct vfsmount *fs_rootmnt(struct fs_struct *fs,
+ struct nameidata *nd)
+{
+ /* assert: current->fs->lock held */
+ return (use_shadow(fs, nd)) ? fs->shdwrootmnt : fs->rootmnt;
+}
+
static int __emul_lookup_dentry(const char *, struct nameidata *);

/* SMP-safe */
@@ -532,8 +552,8 @@ walk_init_root(const char *name, struct
return 0;
read_lock(&fs->lock);
}
- nd->mnt = mntget(fs->rootmnt);
- nd->dentry = dget(fs->root);
+ nd->mnt = mntget(fs_rootmnt(fs, nd));
+ nd->dentry = dget(fs_root(fs, nd));
read_unlock(&fs->lock);
return 1;
}
@@ -730,9 +750,9 @@ static __always_inline void follow_dotdo
struct vfsmount *parent;
struct dentry *old = nd->dentry;

- read_lock(&fs->lock);
- if (nd->dentry == fs->root &&
- nd->mnt == fs->rootmnt) {
+ read_lock(&fs->lock);
+ if (nd->dentry == fs_root(fs, nd) &&
+ nd->mnt == fs_rootmnt(fs, nd)) {
read_unlock(&fs->lock);
break;
}
@@ -842,6 +862,11 @@ static fastcall int __link_path_walk(con

hash = init_name_hash();
do {
+ if (unlikely((nd->flags & LOOKUP_FINDCHAR) &&
+ (c == nd->find_char))) {
+ /* shadow control char found */
+ nd->flags |= LOOKUP_CHARFOUND;
+ }
name++;
hash = partial_name_hash(c, hash);
c = *(const unsigned char *)name;
@@ -1100,8 +1125,8 @@ set_it:
}
}

-/* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
-static int fastcall do_path_lookup(int dfd, const char *name,
+/* Lookup @name, starting at @dfd, use normal (non-shadow) root and pwd */
+static int fastcall path_lookup_norm(int dfd, const char *name,
unsigned int flags, struct nameidata *nd)
{
int retval = 0;
@@ -1168,6 +1193,313 @@ fput_fail:
goto out_fail;
}

+/*
+ * Set @filp->f_shdw, @filp->f_shdwmnt to @mnt,@dentry.
+ * Takes @filp->f_owner->lock.
+ * Note: if @dentry == NULL then @mnt may be ERR_PTR(-EINVAL).
+ */
+static void set_fileshdw(struct file *filp, struct vfsmount *mnt,
+ struct dentry *dentry)
+{
+ struct dentry *old_dentry;
+ struct vfsmount *old_mnt;
+
+ BUG_ON(dentry != NULL && mnt == NULL);
+ write_lock(&filp->f_owner.lock);
+ old_dentry = filp->f_shdw;
+ old_mnt = filp->f_shdwmnt;
+ filp->f_shdw = dget(dentry);
+ if (dentry)
+ filp->f_shdwmnt = mntget(mnt);
+ else
+ /* mnt is ERR_PTR */
+ filp->f_shdwmnt = mnt;
+ write_unlock(&filp->f_owner.lock);
+
+ if (old_dentry) {
+ dput(old_dentry);
+ mntput(old_mnt);
+ }
+}
+
+/*
+ * Determine @filp->f_shdw,f_shdwmnt from @filp->dentry,mnt
+ * and current->fs->shdwroot.
+ * Also check whether it's a directory and we have permisson.
+ * Called only from get_file_shdwdir().
+ */
+static int validate_shdwfile(struct file *filp)
+{
+ struct nameidata nd;
+ char *buf, *name;
+ int res = -ENOMEM;
+
+ buf = (char *)__get_free_page(GFP_KERNEL);
+ if (!buf)
+ goto fail;
+
+ /* doesn't need a lock for reading f_dentry, f_vfsmnt */
+ name = d_path(filp->f_dentry, filp->f_vfsmnt, buf, PAGE_SIZE);
+ res = PTR_ERR(name);
+ if (IS_ERR(name))
+ goto fail_free;
+
+ BUG_ON(*name != '/');
+ res = path_lookup_shdw(AT_FDCWD, name,
+ LOOKUP_FOLLOW|LOOKUP_DIRECTORY, &nd);
+ if (res)
+ goto fail_free;
+
+ res = permission(nd.dentry->d_inode, MAY_EXEC, NULL);
+ if (res)
+ goto fail_put;
+
+ /* ok -> valid */
+ set_fileshdw(filp, nd.mnt, nd.dentry);
+ path_release(&nd);
+ free_page((unsigned long)buf);
+out:
+ /* current->fs->lock is not held on exit */
+ return res;
+
+fail_put:
+ path_release(&nd);
+fail_free:
+ free_page((unsigned long)buf);
+fail:
+ /* error -> invalid */
+ set_fileshdw(filp, ERR_PTR(-EINVAL), NULL);
+ goto out;
+}
+
+/*
+ * Set *@dentry,*@mnt to @file->f_shdw,f_shdwmnt, try to validate
+ * them if needed.
+ */
+int get_file_shdwdir(struct file *file, struct dentry **dentry,
+ struct vfsmount **mnt)
+{
+ int retval = -ENOENT;
+
+ read_lock(&file->f_owner.lock);
+ while (!file->f_shdw) {
+ if (!file->f_shdwmnt) {
+ /* delayed, try to validate */
+ read_unlock(&file->f_owner.lock);
+ if (validate_shdwfile(file))
+ goto out;
+ /* ok but continue loop to avoid races */
+ read_lock(&file->f_owner.lock);
+ } else
+ /* invalid */
+ goto out_unlock;
+ /* continue loop to avoid races */
+ }
+ /* get the shadow dir */
+ *dentry = dget(file->f_shdw);
+ *mnt = mntget(file->f_shdwmnt);
+ retval = 0;
+out_unlock:
+ read_unlock(&file->f_owner.lock);
+out:
+ return retval;
+}
+
+/*
+ * Determine current->fs->shdwpwd,shdwpwdmnt from current->fs->pwd,pwdmnt.
+ * Also check whether it's a directory and we have permisson.
+ */
+static int validate_shdwpwd(void)
+{
+ /* called with current->fs->lock held */
+ struct dentry *pwd = dget(current->fs->pwd);
+ struct vfsmount *mnt = mntget(current->fs->pwdmnt);
+ struct nameidata nd;
+ char *buf, *name;
+ int res = -ENOMEM;
+
+ read_unlock(&current->fs->lock);
+ buf = (char *)__get_free_page(GFP_KERNEL);
+ if (!buf)
+ goto fail;
+
+ name = d_path(pwd, mnt, buf, PAGE_SIZE);
+ res = PTR_ERR(name);
+ if (IS_ERR(name))
+ goto fail_free;
+
+ BUG_ON(*name != '/');
+ /* won't recurse here because @name starts with '/' */
+ res = path_lookup_shdw(AT_FDCWD, name,
+ LOOKUP_FOLLOW|LOOKUP_DIRECTORY, &nd);
+ if (res)
+ goto fail_free;
+
+ res = permission(nd.dentry->d_inode, MAY_EXEC, NULL);
+ if (res)
+ goto fail_put;
+
+ /* ok -> valid */
+ set_fs_shdwpwd(current->fs, nd.mnt, nd.dentry);
+ path_release(&nd);
+ free_page((unsigned long)buf);
+out:
+ dput(pwd);
+ mntput(mnt);
+ /* current->fs->lock is NOT held on exit */
+ return res;
+
+fail_put:
+ path_release(&nd);
+fail_free:
+ free_page((unsigned long)buf);
+fail:
+ /* error -> invalidate */
+ set_fs_shdwpwd(current->fs, ERR_PTR(-EINVAL), NULL);
+ goto out;
+}
+
+/*
+ * Set *@dentry,*@mnt to current->fs->shdwpwd,shdwpwdmnt, try to validate
+ * them if needed.
+ */
+static int get_shdwpwd(struct dentry **dentry, struct vfsmount **mnt)
+{
+ int retval = -ENOENT;
+ /* assert: current->fs->lock is held */
+ while (!current->fs->shdwpwd) {
+ if (current->fs->shdwpwdmnt)
+ /* ERR_PTR - invalid */
+ goto out_unlock;
+
+ /* it's delayed -> validate */
+ if (validate_shdwpwd())
+ /* (current->fs->lock is unlocked
+ * in validate_shdwpwd()) */
+ goto out;
+
+ read_lock(&current->fs->lock);
+ /* continue loop to avoid races */
+ }
+
+ *mnt = mntget(current->fs->shdwpwdmnt);
+ *dentry = dget(current->fs->shdwpwd);
+ retval = 0;
+out_unlock:
+ read_unlock(&current->fs->lock);
+out:
+ /* current->fs->lock is NOT held on exit */
+ return retval;
+}
+
+/*
+ * Lookup @name, starting at @dfd, use shadow root and pwd.
+ * Try to validate current->fs->shdwpwd/filp->f_shdwmnt if needed.
+ */
+int fastcall path_lookup_shdw(int dfd, const char *name,
+ unsigned int flags, struct nameidata *nd)
+{
+ int retval = -ENOENT;
+
+ nd->last_type = LAST_ROOT; /* if there are only slashes... */
+ nd->flags = flags | LOOKUP_INSHDW | LOOKUP_NOALT;
+ nd->depth = 0;
+
+ read_lock(&current->fs->lock);
+ if (!(current->fs->flags & SHDW_ENABLED))
+ goto unlock_fail;
+
+ if (*name == '/') {
+ /* start at the shadow root */
+ if (!current->fs->shdwroot)
+ goto unlock_fail;
+ nd->mnt = mntget(current->fs->shdwrootmnt);
+ nd->dentry = dget(current->fs->shdwroot);
+ read_unlock(&current->fs->lock);
+ } else if (dfd == AT_FDCWD) {
+ /* start at the shadow pwd */
+ retval = get_shdwpwd(&nd->dentry, &nd->mnt);
+ /* current->fs->lock is not held here */
+ if (retval)
+ goto out_fail;
+ } else {
+ int fput_needed;
+ struct file *file;
+
+ read_unlock(&current->fs->lock);
+ /* start at file's shadow dir */
+ file = fget_light(dfd, &fput_needed);
+ retval = -EBADF;
+ if (!file)
+ goto out_fail;
+
+ retval = get_file_shdwdir(file, &nd->dentry, &nd->mnt);
+ fput_light(file, fput_needed);
+
+ if (retval)
+ goto out_fail;
+ }
+
+ current->total_link_count = 0;
+ retval = link_path_walk(name, nd);
+
+ if (likely(retval == 0)) {
+ if (unlikely(!audit_dummy_context() && nd && nd->dentry &&
+ nd->dentry->d_inode))
+ audit_inode(name, nd->dentry->d_inode);
+ }
+
+out_fail:
+ return retval;
+
+unlock_fail:
+ read_unlock(&current->fs->lock);
+ goto out_fail;
+}
+
+/*
+ * Perform full lookup of @name starting at @dfd.
+ * 1. do a normal lookup
+ * 2. if it fails try to lookup in shadow dir
+ * Returns 0 and nd will be valid on success; Retuns error, otherwise.
+ */
+static int fastcall do_path_lookup(int dfd, const char *name,
+ unsigned int flags, struct nameidata *nd)
+{
+ int retval;
+
+ if (!(flags & LOOKUP_NOSHDW)) {
+ /* shadow dir isn't disabled in the current lookup session */
+ read_lock(&current->fs->lock);
+ if (current->fs->flags & SHDW_ENABLED) {
+ /* shadow is enabled */
+ if (current->fs->flags & SHDW_USE_ESC) {
+ flags |= LOOKUP_FINDCHAR;
+ nd->find_char = current->fs->shdw_escch;
+ }
+ } else
+ /* shadow is disabled - disable it in lookup session */
+ flags |= LOOKUP_NOSHDW;
+ read_unlock(&current->fs->lock);
+ }
+
+ retval = path_lookup_norm(dfd, name, flags, nd);
+
+ /*
+ * Do another lookup in the shadow dir iff:
+ * normal lookup failed
+ * && shadow is enabled
+ * && the last lookup was not already going within shadows
+ * && user asked for the escape character and we found it
+ */
+ if (unlikely(retval && !(nd->flags & (LOOKUP_NOSHDW|LOOKUP_INSHDW))
+ && !((nd->flags & LOOKUP_FINDCHAR)
+ && !(nd->flags & LOOKUP_CHARFOUND))))
+ retval = path_lookup_shdw(dfd, name, flags, nd);
+
+ return retval;
+}
+
int fastcall path_lookup(const char *name, unsigned int flags,
struct nameidata *nd)
{
@@ -1225,6 +1557,16 @@ static int __path_lookup_intent_open(int
}
} else if (err != 0)
release_open_intent(nd);
+ else if (!(nd->flags & LOOKUP_NOSHDW) &&
+ S_ISDIR(nd->dentry->d_inode->i_mode)) {
+ /* setup file's shadow dir */
+ /* default: filp->f_shdw = filp->f_shdwmnt = NULL */
+ if (nd->flags & LOOKUP_INSHDW) {
+ filp->f_shdw = dget(nd->dentry);
+ filp->f_shdwmnt = mntget(nd->mnt);
+ }
+ }
+
return err;
}

@@ -2792,6 +3134,260 @@ const struct inode_operations page_symli
.put_link = page_put_link,
};

+
+/*
+ * Find task by @pid, check permissions.
+ * @pid == 0 -> current.
+ */
+static struct task_struct *tsk_by_pid(pid_t pid)
+{
+ struct task_struct *tsk = current;
+
+ if (pid) {
+ read_lock(&tasklist_lock);
+ tsk = find_task_by_pid(pid);
+ if (tsk)
+ get_task_struct(tsk);
+ read_unlock(&tasklist_lock);
+ if (!tsk)
+ tsk = ERR_PTR(-ESRCH);
+ else if (!ptrace_may_attach(tsk)) {
+ put_task_struct(tsk);
+ tsk = ERR_PTR(-EPERM);
+ }
+ }
+ return tsk;
+}
+
+asmlinkage long sys_getshdwinfo(pid_t pid, int func, int __user *data)
+{
+ struct task_struct *tsk = tsk_by_pid(pid);
+ long ret = PTR_ERR(tsk);
+
+ if (IS_ERR(tsk))
+ goto out_noput;
+ ret = -EINVAL;
+
+ switch (func) {
+ case FSI_SHDW_ENABLE:
+ read_lock(&tsk->fs->lock);
+ ret = (tsk->fs->flags & SHDW_ENABLED) ? 1 : 0;
+ read_unlock(&tsk->fs->lock);
+ ret = put_user(ret, data);
+ break;
+
+ case FSI_SHDW_ESC_EN:
+ read_lock(&tsk->fs->lock);
+ ret = (tsk->fs->flags & SHDW_USE_ESC) ? 1 : 0;
+ read_unlock(&tsk->fs->lock);
+ ret = put_user(ret, data);
+ break;
+
+ case FSI_SHDW_ESC_CHAR:
+ read_lock(&tsk->fs->lock);
+ ret = tsk->fs->shdw_escch;
+ read_unlock(&tsk->fs->lock);
+ ret = put_user((char)ret, (char __user *)data);
+ break;
+ }
+
+ if (pid)
+ put_task_struct(tsk);
+out_noput:
+ /* avoid REGPARM breakage on x86: */
+ prevent_tail_call(ret);
+ return ret;
+}
+
+/*
+ * Set fs->shdwpwd,shdwpwdmnt according to @pathname.
+ * @pathname is NOT looked up in shadow dir.
+ */
+static int do_setshdwpwd(struct fs_struct *fs, const char __user *pathname)
+{
+ struct nameidata nd;
+ int error = __user_walk(pathname,
+ LOOKUP_FOLLOW|LOOKUP_DIRECTORY|LOOKUP_NOSHDW, &nd);
+ if (error)
+ goto out;
+
+ error = vfs_permission(&nd, MAY_EXEC);
+ if (error)
+ goto dput_and_out;
+
+ set_fs_shdwpwd(fs, nd.mnt, nd.dentry);
+
+dput_and_out:
+ path_release(&nd);
+out:
+ return error;
+}
+
+/*
+ * Set fs->shdwroot,shdwrootmnt according to @pathname.
+ * @pathname is NOT looked up in shadow dir.
+ * If @pathname == NULL then disable shadow dir.
+ */
+static int do_setshdwroot(struct fs_struct *fs, const char __user *pathname)
+{
+ struct dentry *old_dentry;
+ struct vfsmount *old_mnt;
+ struct nameidata nd;
+ int error = 0;
+
+ if (pathname) {
+ error = __user_walk(pathname,
+ LOOKUP_FOLLOW|LOOKUP_DIRECTORY|LOOKUP_NOSHDW, &nd);
+ if (error)
+ goto out;
+
+ error = vfs_permission(&nd, MAY_EXEC);
+ if (error)
+ goto dput_and_out;
+ } else {
+ /* remove shadow root */
+ nd.dentry = NULL;
+ nd.mnt = NULL;
+ }
+
+ write_lock(&fs->lock);
+ old_dentry = fs->shdwroot;
+ old_mnt = fs->shdwrootmnt;
+ fs->shdwroot = dget(nd.dentry);
+ fs->shdwrootmnt = mntget(nd.mnt);
+ if (!nd.dentry)
+ /* disable shadow dir */
+ fs->flags &= ~SHDW_ENABLED;
+ write_unlock(&fs->lock);
+
+ dput(old_dentry);
+ mntput(old_mnt);
+
+dput_and_out:
+ path_release(&nd);
+out:
+ return error;
+}
+
+/*
+ * Set file->f_shdw,f_shdwmnt according to @pathname.
+ * @pathname is NOT looked up in shadow dir.
+ * If @pathname == NULL then set file->f_shdw,f_shdwmnt as delayed.
+ */
+static int do_setshdwfd(struct task_struct *tsk, int fd,
+ const char __user *pathname)
+{
+ struct nameidata nd;
+ struct file *filp = __fget(tsk->files, fd);
+ int error = 0;
+
+ if (!filp)
+ return -EBADF;
+
+ if (pathname) {
+ error = __user_walk(pathname,
+ LOOKUP_FOLLOW|LOOKUP_DIRECTORY|LOOKUP_NOSHDW, &nd);
+ if (error)
+ goto out;
+
+ error = vfs_permission(&nd, MAY_EXEC);
+ if (!error) {
+ set_fileshdw(filp, nd.mnt, nd.dentry);
+ path_release(&nd);
+ }
+ } else {
+ /* set delayed */
+ set_fileshdw(filp, NULL, NULL);
+ }
+out:
+ fput(filp);
+ return error;
+}
+
+asmlinkage long sys_setshdwpath(pid_t pid, int fd, const char __user *path)
+{
+ struct task_struct *tsk = tsk_by_pid(pid);
+ long ret = PTR_ERR(tsk);
+
+ if (IS_ERR(tsk))
+ goto out_noput;
+
+ ret = -EINVAL;
+
+ if (fd >= 0)
+ /* a normal file's shadow */
+ ret = do_setshdwfd(tsk, fd, path);
+ else if (fd == SHDW_FD_ROOT)
+ /* root shadow */
+ ret = do_setshdwroot(tsk->fs, path);
+ else if (fd == SHDW_FD_PWD) {
+ /* pwd shadow */
+ if (path)
+ ret = do_setshdwpwd(tsk->fs, path);
+ else {
+ /* set delayed */
+ set_fs_shdwpwd(tsk->fs, NULL, NULL);
+ ret = 0;
+ }
+ }
+
+ if (pid)
+ put_task_struct(tsk);
+out_noput:
+ /* avoid REGPARM breakage on x86: */
+ prevent_tail_call(ret);
+ return ret;
+}
+
+asmlinkage long sys_setshdwinfo(pid_t pid, int func, int data)
+{
+ struct task_struct *tsk = tsk_by_pid(pid);
+ long ret = PTR_ERR(tsk);
+
+ if (IS_ERR(tsk))
+ goto out_noput;
+
+ ret = -EINVAL;
+ switch (func) {
+ case FSI_SHDW_ENABLE:
+ ret = 0;
+ write_lock(&tsk->fs->lock);
+ tsk->fs->flags &= ~SHDW_ENABLED;
+ if (data) {
+ /* may enable shadow? */
+ if (tsk->fs->shdwroot && tsk->fs->shdwrootmnt)
+ tsk->fs->flags |= SHDW_ENABLED;
+ else
+ ret = -EPERM;
+ }
+ write_unlock(&tsk->fs->lock);
+ break;
+
+ case FSI_SHDW_ESC_EN:
+ ret = 0;
+ write_lock(&tsk->fs->lock);
+ tsk->fs->flags &= ~SHDW_USE_ESC;
+ if (data)
+ tsk->fs->flags |= SHDW_USE_ESC;
+ write_unlock(&tsk->fs->lock);
+ break;
+
+ case FSI_SHDW_ESC_CHAR:
+ ret = 0;
+ write_lock(&tsk->fs->lock);
+ tsk->fs->shdw_escch = (unsigned char)data;
+ write_unlock(&tsk->fs->lock);
+ break;
+ }
+
+ if (pid)
+ put_task_struct(tsk);
+out_noput:
+ /* avoid REGPARM breakage on x86: */
+ prevent_tail_call(ret);
+ return ret;
+}
+
EXPORT_SYMBOL(__user_walk);
EXPORT_SYMBOL(__user_walk_fd);
EXPORT_SYMBOL(follow_down);
--- orig/fs/exec.c 2007-10-07 19:00:18.000000000 +0200
+++ new/fs/exec.c 2007-10-07 19:53:16.000000000 +0200
@@ -1076,11 +1076,15 @@ int flush_old_exec(struct linux_binprm *
if (bprm->e_uid != current->euid || bprm->e_gid != current->egid) {
suid_keys(current);
set_dumpable(current->mm, suid_dumpable);
+ /* switch off the shadow directories for a suid exec */
+ current->fs->flags &= ~SHDW_ENABLED;
current->pdeath_signal = 0;
} else if (file_permission(bprm->file, MAY_READ) ||
(bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP)) {
suid_keys(current);
set_dumpable(current->mm, suid_dumpable);
+ /* switch off the shadow directories for a suid exec */
+ current->fs->flags &= ~SHDW_ENABLED;
}

/* An exec changes our domain. We are no longer part of the thread
--- orig/fs/namespace.c 2007-10-07 19:00:19.000000000 +0200
+++ new/fs/namespace.c 2007-10-07 13:39:08.000000000 +0200
@@ -1448,6 +1448,7 @@ static struct mnt_namespace *dup_mnt_ns(
{
struct mnt_namespace *new_ns;
struct vfsmount *rootmnt = NULL, *pwdmnt = NULL, *altrootmnt = NULL;
+ struct vfsmount *shdwrootmnt = NULL, *shdwpwdmnt = NULL;
struct vfsmount *p, *q;

new_ns = kmalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
@@ -1494,6 +1495,14 @@ static struct mnt_namespace *dup_mnt_ns(
altrootmnt = p;
fs->altrootmnt = mntget(q);
}
+ if (p == fs->shdwrootmnt) {
+ shdwrootmnt = p;
+ fs->shdwrootmnt = mntget(q);
+ }
+ if (p == fs->shdwpwdmnt) {
+ shdwpwdmnt = p;
+ fs->shdwpwdmnt = mntget(q);
+ }
}
p = next_mnt(p, mnt_ns->root);
q = next_mnt(q, new_ns->root);
@@ -1506,6 +1515,10 @@ static struct mnt_namespace *dup_mnt_ns(
mntput(pwdmnt);
if (altrootmnt)
mntput(altrootmnt);
+ if (shdwrootmnt)
+ mntput(shdwrootmnt);
+ if (shdwpwdmnt)
+ mntput(shdwpwdmnt);

return new_ns;
}
--- orig/fs/file_table.c 2007-07-09 01:32:17.000000000 +0200
+++ new/fs/file_table.c 2007-10-07 13:39:08.000000000 +0200
@@ -151,8 +151,8 @@ EXPORT_SYMBOL(fput);
*/
void fastcall __fput(struct file *file)
{
- struct dentry *dentry = file->f_path.dentry;
- struct vfsmount *mnt = file->f_path.mnt;
+ struct dentry *dentry = file->f_path.dentry, *s_dentry = file->f_shdw;
+ struct vfsmount *mnt = file->f_path.mnt, *s_mnt = file->f_shdwmnt;
struct inode *inode = dentry->d_inode;

might_sleep();
@@ -177,15 +177,21 @@ void fastcall __fput(struct file *file)
file_kill(file);
file->f_path.dentry = NULL;
file->f_path.mnt = NULL;
+ file->f_shdw = NULL;
+ file->f_shdwmnt = NULL;
file_free(file);
dput(dentry);
mntput(mnt);
+ if (s_dentry) {
+ /* NOTE: if s_dentry == NULL then s_mnt may be ERR_PTR */
+ dput(s_dentry);
+ mntput(s_mnt);
+ }
}

-struct file fastcall *fget(unsigned int fd)
+struct file fastcall *__fget(struct files_struct *files, unsigned int fd)
{
struct file *file;
- struct files_struct *files = current->files;

rcu_read_lock();
file = fcheck_files(files, fd);
@@ -201,6 +207,11 @@ struct file fastcall *fget(unsigned int
return file;
}

+struct file fastcall *fget(unsigned int fd)
+{
+ return __fget(current->files, fd);
+}
+
EXPORT_SYMBOL(fget);

/*
--- orig/kernel/exit.c 2007-10-07 19:00:26.000000000 +0200
+++ new/kernel/exit.c 2007-10-07 13:39:08.000000000 +0200
@@ -522,6 +522,14 @@ static inline void __put_fs_struct(struc
dput(fs->altroot);
mntput(fs->altrootmnt);
}
+ if (fs->shdwroot) {
+ dput(fs->shdwroot);
+ mntput(fs->shdwrootmnt);
+ }
+ if (fs->shdwpwd) {
+ dput(fs->shdwpwd);
+ mntput(fs->shdwpwdmnt);
+ }
kmem_cache_free(fs_cachep, fs);
}
}
--- orig/kernel/fork.c 2007-10-07 19:00:26.000000000 +0200
+++ new/kernel/fork.c 2007-10-07 13:39:08.000000000 +0200
@@ -586,6 +586,9 @@ static inline struct fs_struct *__copy_f
fs->root = dget(old->root);
fs->pwdmnt = mntget(old->pwdmnt);
fs->pwd = dget(old->pwd);
+ fs->flags = old->flags;
+ fs->shdw_escch = old->shdw_escch;
+
if (old->altroot) {
fs->altrootmnt = mntget(old->altrootmnt);
fs->altroot = dget(old->altroot);
@@ -593,6 +596,23 @@ static inline struct fs_struct *__copy_f
fs->altrootmnt = NULL;
fs->altroot = NULL;
}
+
+ if (old->shdwroot) {
+ fs->shdwrootmnt = mntget(old->shdwrootmnt);
+ fs->shdwroot = dget(old->shdwroot);
+ } else {
+ fs->shdwrootmnt = NULL;
+ fs->shdwroot = NULL;
+ }
+
+ if (old->shdwpwd) {
+ fs->shdwpwdmnt = mntget(old->shdwpwdmnt);
+ fs->shdwpwd = dget(old->shdwpwd);
+ } else {
+ fs->shdwpwdmnt = NULL;
+ fs->shdwpwd = NULL;
+ }
+
read_unlock(&old->lock);
}
return fs;
--- orig/include/linux/syscalls.h 2007-10-07 19:00:26.000000000 +0200
+++ new/include/linux/syscalls.h 2007-10-07 13:39:08.000000000 +0200
@@ -614,4 +614,10 @@ asmlinkage long sys_fallocate(int fd, in

int kernel_execve(const char *filename, char *const argv[], char *const envp[]);

+asmlinkage long sys_getshdwinfo(pid_t pid, int func, int __user *data);
+
+asmlinkage long sys_setshdwinfo(pid_t pid, int func, int data);
+
+asmlinkage long sys_setshdwpath(pid_t pid, int fd, const char __user *path);
+
#endif
--- orig/arch/i386/kernel/syscall_table.S 2007-10-07 18:59:54.000000000 +0200
+++ new/arch/i386/kernel/syscall_table.S 2007-10-07 20:40:40.000000000 +0200
@@ -222,7 +222,7 @@ ENTRY(sys_call_table)
.long sys_getdents64 /* 220 */
.long sys_fcntl64
.long sys_ni_syscall /* reserved for TUX */
- .long sys_ni_syscall
+ .long sys_getshdwinfo
.long sys_gettid
.long sys_readahead /* 225 */
.long sys_setxattr
@@ -250,7 +250,7 @@ ENTRY(sys_call_table)
.long sys_io_submit
.long sys_io_cancel
.long sys_fadvise64 /* 250 */
- .long sys_ni_syscall
+ .long sys_setshdwinfo
.long sys_exit_group
.long sys_lookup_dcookie
.long sys_epoll_create
@@ -284,7 +284,7 @@ ENTRY(sys_call_table)
.long sys_mq_getsetattr
.long sys_kexec_load
.long sys_waitid
- .long sys_ni_syscall /* 285 */ /* available */
+ .long sys_setshdwpath /* 285 */
.long sys_add_key
.long sys_request_key
.long sys_keyctl

2007-10-18 16:05:35

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Shadow directories


On Oct 18 2007 17:21, Jaroslav Sykora wrote:
>Hello,
>
>Let's say we have an archive file "hello.zip" with a hello world program source
>code. We want to do this:
> cat hello.zip^/hello.c
> gcc hello.zip^/hello.c -o hello
> etc..
>
>The '^' is an escape character and it tells the computer to treat the file as a directory.

Too bad, since ^ is a valid character in a *file*name. Everything is, with
the exception of '\0' and '/'. At the end of the day, there are no control
characters you could use.

But what you could do is: write a FUSE fs that mirrors the lower content
(lofs/fuseloop/however it was named) and expands .zip files as
directories are readdir'ed or the zip files stat'ed. That saves us
from cluttering up the Linux VFS with such stuff.

2007-10-18 16:29:59

by David Newall

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Shadow directories

Jaroslav Sykora wrote:
> Let's say we have an archive file "hello.zip" with a hello world program source
> code. We want to do this:
> cat hello.zip^/hello.c
> gcc hello.zip^/hello.c -o hello
> etc..
>

Wouldn't you do this as a user space filesystem?

2007-10-18 16:33:21

by David Newall

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Shadow directories

David Newall wrote:
> Jaroslav Sykora wrote:
>> Let's say we have an archive file "hello.zip" with a hello world
>> program source
>> code. We want to do this:
>> cat hello.zip^/hello.c
>> gcc hello.zip^/hello.c -o hello
>> etc..
>>
>
> Wouldn't you do this as a user space filesystem?
Which is what you were saying.

*SMACK* I so stupid.

2007-10-18 16:53:19

by David Newall

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Shadow directories

David Newall wrote:
> David Newall wrote:
>> Jaroslav Sykora wrote:
>>> Let's say we have an archive file "hello.zip" with a hello world
>>> program source
>>> code. We want to do this:
>>> cat hello.zip^/hello.c
>>> gcc hello.zip^/hello.c -o hello
>>> etc..
>>>
>>
>> Wouldn't you do this as a user space filesystem?
> Which is what you were saying.
>
> *SMACK* I so stupid.

On third thoughts, what's the reason for this?

2007-10-18 17:08:19

by Jaroslav Sykora

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Shadow directories

On Thursday 18 of October 2007, Jan Engelhardt wrote:
>
> On Oct 18 2007 17:21, Jaroslav Sykora wrote:
> >Hello,
> >
> >Let's say we have an archive file "hello.zip" with a hello world program source
> >code. We want to do this:
> > cat hello.zip^/hello.c
> > gcc hello.zip^/hello.c -o hello
> > etc..
> >
> >The '^' is an escape character and it tells the computer to treat the file as a directory.
>
> Too bad, since ^ is a valid character in a *file*name. Everything is, with
> the exception of '\0' and '/'. At the end of the day, there are no control
> characters you could use.
>
> But what you could do is: write a FUSE fs that mirrors the lower content
> (lofs/fuseloop/however it was named) and expands .zip files as
> directories are readdir'ed or the zip files stat'ed. That saves us
> from cluttering up the Linux VFS with such stuff.
>

Yes, that's exactly what RheaVFS and AVFS do. Except that they both use an escape
character because:
1. without it some programs may break [ http://lwn.net/Articles/100148/ ]
2. it's very useful to pass additional parameters after the escape char to the server.

We can start VFS servers (mentioned above) and chroot the whole user session into
the mount directory of the server. It works but it's very slow, practically unusable.
So both servers need some kind of VFS redirector. In the past there were many
different approaches -- LD_PRELOAD hack, CodaFS hack, NFS hack (?), proof-of-concept
kernel hacks (project podfuk) etc.

If anybody can think of any other solution of the "redirector problem", possibly
even non-kernel based one, let me know and I'd be glad :-)

--
I find television very educating. Every time somebody turns on the set,
I go into the other room and read a book.

2007-10-18 17:10:30

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Shadow directories


On Oct 18 2007 19:07, Jaroslav Sykora wrote:
>> On Oct 18 2007 17:21, Jaroslav Sykora wrote:
>> >Hello,
>> >
>> >Let's say we have an archive file "hello.zip" with a hello world program source
>> >code. We want to do this:
>> > cat hello.zip^/hello.c
>> > gcc hello.zip^/hello.c -o hello
>> > etc..
>> >
>> >The '^' is an escape character and it tells the computer to treat the file as a directory.
>>
>> But what you could do is: write a FUSE fs that mirrors the lower content
>> (lofs/fuseloop/however it was named) and expands .zip files as
>> directories are readdir'ed or the zip files stat'ed. That saves us
>> from cluttering up the Linux VFS with such stuff.
>
>Yes, that's exactly what RheaVFS and AVFS do. Except that they both use an escape
>character because:
>1. without it some programs may break [ http://lwn.net/Articles/100148/ ]
>2. it's very useful to pass additional parameters after the escape char to the server.
>
>We can start VFS servers (mentioned above) and chroot the whole user session into
>the mount directory of the server. It works but it's very slow, practically unusable.

Sounds like a program bug, since NTFS-3G is proof of concept that FUSE
can be fast.

>If anybody can think of any other solution of the "redirector
>problem", possibly even non-kernel based one, let me know and I'd be
>glad :-)

2007-10-18 20:09:21

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Shadow directories


On Oct 19 2007 05:32, David Newall wrote:
>
> The claim is wrong. UNIX systems have traditionally allowed the
> superuser to create hard links to directories. See link(2) for
> 2.10BSD
> <http://www.freebsd.org/cgi/man.cgi?query=link&sektion=2&manpath=2.10+BSD>.
> Having got that wrong throws doubt on the argument; perhaps a path
> can simultaneously be a file and a directory.

But hell will break lose if you allow hardlinking directories.

mkdir /tmp/a
ln /tmp/a /tmp/a/b

And you would not be able to rmdir /tmp/a/b because the directory is
not empty (it contains "b" [full path: /tmp/a/b/b]).

2007-10-18 20:10:30

by Jaroslav Sykora

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Shadow directories

On Thursday 18 of October 2007, Jan Engelhardt wrote:
> >> >
> >> >The '^' is an escape character and it tells the computer to treat the file as a directory.
> >>
> >> But what you could do is: write a FUSE fs that mirrors the lower content
> >> (lofs/fuseloop/however it was named) and expands .zip files as
> >> directories are readdir'ed or the zip files stat'ed. That saves us
> >> from cluttering up the Linux VFS with such stuff.
> >
> >Yes, that's exactly what RheaVFS and AVFS do. Except that they both use an escape
> >character because:
> >1. without it some programs may break [ http://lwn.net/Articles/100148/ ]
> >2. it's very useful to pass additional parameters after the escape char to the server.
> >
> >We can start VFS servers (mentioned above) and chroot the whole user session into
> >the mount directory of the server. It works but it's very slow, practically unusable.
>
> Sounds like a program bug, since NTFS-3G is proof of concept that FUSE
> can be fast.
>

Good point, I'll look onto it.

A minor implementation problem with chrooted environment is that the FUSE VFS server
must be run with root privileges to allow setuid programs on the mounted filesystems.
But it's certainly doable.


--
"Elves and Dragons!" I says to him. "Cabbages and potatoes are better
for you and me." -- J. R. R. Tolkien

2007-10-18 20:12:23

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Shadow directories


On Oct 18 2007 22:10, Jaroslav Sykora wrote:
>
>A minor implementation problem with chrooted environment is that the
>FUSE VFS server must be run with root privileges to allow setuid
>programs on the mounted filesystems. But it's certainly doable.

You would not want user-supplied filesystems to carry SUID bits...

2007-10-18 20:37:57

by David Newall

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Shadow directories

Jaroslav Sykora wrote:
> If anybody can think of any other solution of the "redirector problem", possibly
> even non-kernel based one, let me know and I'd be glad :-)

If I understand your problem, you wish to treat an archive file as if it
was a directory. Thus, in the ideal situation, you could do the following:

cat hello.zip/hello.c
gcc hello.zip/hello.c -o hello
etc..


Rather than complicate matters with a second tree, use FUSE with an
explicit directory. For example, ~/expand could be your shadow, thus to
compile hello.c from ~/hello.zip:

gcc ~/expand/hello.zip^/hello.c -o hello


I think no kernel change would be required.

I'm not keen on the caret. One of the early claims made in
http://lwn.net/Articles/100148/ is:
> Another branch, led by Al Viro, worries about the locking
> considerations of this whole scheme. Linux, like most Unix systems,
> has never allowed hard links to directories for a number of reasons;

The claim is wrong. UNIX systems have traditionally allowed the
superuser to create hard links to directories. See link(2) for 2.10BSD
<http://www.freebsd.org/cgi/man.cgi?query=link&sektion=2&manpath=2.10+BSD>.
Having got that wrong throws doubt on the argument; perhaps a path can
simultaneously be a file and a directory.

2007-10-18 20:47:21

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Shadow directories

On Fri, Oct 19, 2007 at 06:07:45AM +0930, David Newall wrote:
> >considerations of this whole scheme. Linux, like most Unix systems,
> >has never allowed hard links to directories for a number of reasons;
>
> The claim is wrong. UNIX systems have traditionally allowed the
> superuser to create hard links to directories. See link(2) for 2.10BSD
> <http://www.freebsd.org/cgi/man.cgi?query=link&sektion=2&manpath=2.10+BSD>.
> Having got that wrong throws doubt on the argument; perhaps a path can
> simultaneously be a file and a directory.

Learn to read. Linux has never allowed that. Most of the Unix systems
do not allow that. Original _did_ allow that, but at the cost of very
easily triggered fs corruption (and it didn't have things like rename(2) -
it _did_ have userland implementation, of course, in suid-root mv(1),
but that sucker had been extremely racy and could be easily used to
screw filesystem to hell and back; adding rename(2) to the set of primitives
combined with multiple links to directories leads to very nasty issues on
_any_ system).

2007-10-19 02:57:44

by David Newall

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Shadow directories

Al Viro wrote:
> On Fri, Oct 19, 2007 at 06:07:45AM +0930, David Newall wrote:
>
>>> considerations of this whole scheme. Linux, like most Unix systems,
>>> has never allowed hard links to directories for a number of reasons;
>>>
>> The claim is wrong. UNIX systems have traditionally allowed the
>> superuser to create hard links to directories. See link(2) for 2.10BSD
>> <http://www.freebsd.org/cgi/man.cgi?query=link&sektion=2&manpath=2.10+BSD>.
>> Having got that wrong throws doubt on the argument; perhaps a path can
>> simultaneously be a file and a directory.
>>
>
> Learn to read. Linux has never allowed that. Most of the Unix systems
> do not allow that.

I did read the claim and it is ambiguous, in that it can reasonably be
read to mean that most UNIX systems never allowed such links, which is
wrong. All UNIX systems allowed it until relatively recently.

2007-10-19 05:37:36

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Shadow directories

On Fri, Oct 19, 2007 at 12:27:16PM +0930, David Newall wrote:

> >Learn to read. Linux has never allowed that. Most of the Unix systems
> >do not allow that.
>
> I did read the claim and it is ambiguous, in that it can reasonably be
> read to mean that most UNIX systems never allowed such links, which is
> wrong. All UNIX systems allowed it until relatively recently.

FVO"relatively recently" exceeding a decade and half. In any case,
it's _trivial_ to get fs corruption on any system with such links -
play with rename() races a bit and you'll get it. And yes, it does
include 4.4BSD and quite a chunk of even later history.

Anyway, you are quite welcome to propose a sane locking scheme capable
of dealing with that mess.

As for the posted patch, AFAICS it's FUBAR in handling of .. in such
directories. Moreover, how are you going to keep that shadow tree
in sync with the main one if somebody starts doing renames in the
latter? Or mount --move, or...