This is a repost of the v2 patch updated for the d_real changes
For those who want to test it out, there's a git tree here
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/binfmt_misc.git
on the shiftfs-v3 branch
v2:
This is a rewrite of the original shiftfs code to make use of super
block user namespaces. I've also removed the mappings passed in as
mount options in favour of using the mappings in s_user_ns. The upshot
is that it probably needs retesting for all the bugs people found,
since there's a lot of new code, and the use case has changed. Now, to
use it, you have to mark the filesystems you want to be mountable
inside a user namespace as root:
mount -t shiftfs -o mark <origin> <mark location>
The origin should be inaccessible to the unprivileged user, and the
access to the <mark location> can be controlled by the usual filesystem
permissions. Once this is done, any user who can get access to the
<mark location> can do (as the local user namespace root):
mount -t shiftfs <mark location> <somewhere in my local mount ns>
And they will be able to write at their user namespace shifts, but have
the interior view of the uid/gid be what appears on the <origin>
In using the s_user_ns, a lot of the code actually simplified, because
now our credential shifting code simply becomes use the <origin>
s_user_ns and the shifted uid/gid. The updated d_real() code from
overlayfs is also used, so shiftfs now no-longer needs its own file
operations.
---
[original blurb]
My use case for this is that I run a lot of unprivileged architectural
emulation containers on my system using user namespaces. Details here:
http://blog.hansenpartnership.com/unprivileged-build-containers/
They're mostly for building non-x86 stuff (like aarch64 and arm secure
boot and mips images). For builds, I have all the environments in my
home directory with downshifted uids; however, sometimes I need to use
them to administer real images that run on systems, meaning the uids
are the usual privileged ones not the downshifted ones. The only
current choice I have is to start the emulation as root so the uid/gids
match. The reason for this filesystem is to use my standard
unprivileged containers to maintain these images. The way I do this is
crack the image with a loop and then shift the uids before bringing up
the container. I usually loop mount into /var/tmp/images/, so it's
owned by real root there:
jarvis:~ # ls -l /var/tmp/images/mips|head -4
total 0
drwxr-xr-x 1 root root 8192 May 12 08:33 bin
drwxr-xr-x 1 root root 6 May 12 08:33 boot
drwxr-xr-x 1 root root 167 May 12 08:33 dev
And I usually run my build containers with a uid_map of
0 100000 1000
1000 1000 1
65534 101000 1
(maps 0-999 shifted, then shifts nobody to 1000 and keeps my uid [1000]
fixed so I can mount my home directory into the namespace) and
something similar with gid_map. So I shift mount the mips image with
mount -t shiftfs -o
idmap=0:100000:1000,uidmap=65534:101000:1,gidmap=0:100000:100,gidmap=10
1:100101:899,gidmap=65533:101000:2 /var/tmp/images/mips
/home/jejb/containers/mips
and I now see it as
jejb@jarvis:~> ls -l containers/mips|head -4
total 0
drwxr-xr-x 1 100000 100000 8192 May 12 08:33 bin/
drwxr-xr-x 1 100000 100000 6 May 12 08:33 boot/
drwxr-xr-x 1 100000 100000 167 May 12 08:33 dev/
Like my usual unprivileged build roots and I can now use an
unprivileged container to enter and administer the image.
It seems like a lot of container systems need to do something similar
when they try and provide unprivileged access to standard images.
Right at the moment, the security mechanism only allows root in the
host to use this, but it's not impossible to come up with a scheme for
marking trees that can safely be shift mounted by unprivileged user
namespaces.
James
---
James Bottomley (1):
shiftfs: uid/gid shifting bind mount
fs/Kconfig | 8 +
fs/Makefile | 1 +
fs/shiftfs.c | 783 +++++++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/magic.h | 2 +
4 files changed, 794 insertions(+)
create mode 100644 fs/shiftfs.c
--
2.13.7
This allows any subtree to be uid/gid shifted and bound elsewhere. It
does this by operating simlarly to overlayfs. Its primary use is for
shifting the underlying uids of filesystems used to support
unpriviliged (uid shifted) containers. The usual use case here is
that the container is operating with an uid shifted unprivileged root
but sometimes needs to make use of or work with a filesystem image
that has root at real uid 0.
The mechanism is to allow any subordinate mount namespace to mount a
shiftfs filesystem (by marking it FS_USERNS_MOUNT) but only allowing
it to mount marked subtrees (using the -o mark option as root). Once
mounted, the subtree is mapped via the super block user namespace so
that the interior ids of the mounting user namespace are the ids
written to the filesystem.
Signed-off-by: James Bottomley <[email protected]>
---
v3 - update to 4.14 (d_real changes)
v1 - based on original shiftfs with uid mappings now done via s_user_ns
v2 - fix revalidation of dentries
add inode aliasing
---
fs/Kconfig | 8 +
fs/Makefile | 1 +
fs/shiftfs.c | 783 +++++++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/magic.h | 2 +
4 files changed, 794 insertions(+)
create mode 100644 fs/shiftfs.c
diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..8b9c2b8566a9 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -105,6 +105,14 @@ source "fs/autofs4/Kconfig"
source "fs/fuse/Kconfig"
source "fs/overlayfs/Kconfig"
+config SHIFT_FS
+ tristate "UID/GID shifting overlay filesystem for containers"
+ help
+ This filesystem can overlay any mounted filesystem and shift
+ the uid/gid the files appear at. The idea is that
+ unprivileged containers can use this to mount root volumes
+ using this technique.
+
menu "Caches"
source "fs/fscache/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index 7bbaca9c67b1..2aa3ad47a286 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -128,3 +128,4 @@ obj-y += exofs/ # Multiple modules
obj-$(CONFIG_CEPH_FS) += ceph/
obj-$(CONFIG_PSTORE) += pstore/
obj-$(CONFIG_EFIVAR_FS) += efivarfs/
+obj-$(CONFIG_SHIFT_FS) += shiftfs.o
diff --git a/fs/shiftfs.c b/fs/shiftfs.c
new file mode 100644
index 000000000000..7984a93745d2
--- /dev/null
+++ b/fs/shiftfs.c
@@ -0,0 +1,783 @@
+#include <linux/cred.h>
+#include <linux/mount.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/magic.h>
+#include <linux/parser.h>
+#include <linux/seq_file.h>
+#include <linux/statfs.h>
+#include <linux/slab.h>
+#include <linux/user_namespace.h>
+#include <linux/uidgid.h>
+#include <linux/xattr.h>
+
+struct shiftfs_super_info {
+ struct vfsmount *mnt;
+ struct user_namespace *userns;
+ bool mark;
+};
+
+static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
+ struct dentry *dentry);
+
+enum {
+ OPT_MARK,
+ OPT_LAST,
+};
+
+/* global filesystem options */
+static const match_table_t tokens = {
+ { OPT_MARK, "mark" },
+ { OPT_LAST, NULL }
+};
+
+static const struct cred *shiftfs_get_up_creds(struct super_block *sb)
+{
+ struct shiftfs_super_info *ssi = sb->s_fs_info;
+ struct cred *cred = prepare_creds();
+
+ if (!cred)
+ return NULL;
+
+ cred->fsuid = KUIDT_INIT(from_kuid(sb->s_user_ns, cred->fsuid));
+ cred->fsgid = KGIDT_INIT(from_kgid(sb->s_user_ns, cred->fsgid));
+ put_user_ns(cred->user_ns);
+ cred->user_ns = get_user_ns(ssi->userns);
+
+ return cred;
+}
+
+static const struct cred *shiftfs_new_creds(const struct cred **newcred,
+ struct super_block *sb)
+{
+ const struct cred *cred = shiftfs_get_up_creds(sb);
+
+ *newcred = cred;
+
+ if (cred)
+ cred = override_creds(cred);
+ else
+ printk(KERN_ERR "shiftfs: Credential override failed: no memory\n");
+
+ return cred;
+}
+
+static void shiftfs_old_creds(const struct cred *oldcred,
+ const struct cred **newcred)
+{
+ if (!*newcred)
+ return;
+
+ revert_creds(oldcred);
+ put_cred(*newcred);
+}
+
+static int shiftfs_parse_options(struct shiftfs_super_info *ssi, char *options)
+{
+ char *p;
+ substring_t args[MAX_OPT_ARGS];
+
+ ssi->mark = false;
+
+ while ((p = strsep(&options, ",")) != NULL) {
+ int token;
+
+ if (!*p)
+ continue;
+
+ token = match_token(p, tokens, args);
+ switch (token) {
+ case OPT_MARK:
+ ssi->mark = true;
+ break;
+ default:
+ return -EINVAL;
+ }
+ }
+ return 0;
+}
+
+static void shiftfs_d_release(struct dentry *dentry)
+{
+ struct dentry *real = dentry->d_fsdata;
+
+ dput(real);
+}
+
+static struct dentry *shiftfs_d_real(struct dentry *dentry,
+ const struct inode *inode,
+ unsigned int open_flags,
+ unsigned int dreal_flags)
+{
+ struct dentry *real = dentry->d_fsdata;
+
+ if (unlikely(real->d_flags & DCACHE_OP_REAL))
+ return real->d_op->d_real(real, real->d_inode,
+ open_flags, dreal_flags);
+
+ return real;
+}
+
+static int shiftfs_d_weak_revalidate(struct dentry *dentry, unsigned int flags)
+{
+ struct dentry *real = dentry->d_fsdata;
+
+ if (d_unhashed(real))
+ return 0;
+
+ if (!(real->d_flags & DCACHE_OP_WEAK_REVALIDATE))
+ return 1;
+
+ return real->d_op->d_weak_revalidate(real, flags);
+}
+
+static int shiftfs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+ struct dentry *real = dentry->d_fsdata;
+ int ret;
+
+ if (d_unhashed(real))
+ return 0;
+
+ /*
+ * inode state of underlying changed from positive to negative
+ * or vice versa; force a lookup to update our view
+ */
+ if (d_is_negative(real) != d_is_negative(dentry))
+ return 0;
+
+ if (!(real->d_flags & DCACHE_OP_REVALIDATE))
+ return 1;
+
+ ret = real->d_op->d_revalidate(real, flags);
+
+ if (ret == 0 && !(flags & LOOKUP_RCU))
+ d_invalidate(real);
+
+ return ret;
+}
+
+static const struct dentry_operations shiftfs_dentry_ops = {
+ .d_release = shiftfs_d_release,
+ .d_real = shiftfs_d_real,
+ .d_revalidate = shiftfs_d_revalidate,
+ .d_weak_revalidate = shiftfs_d_weak_revalidate,
+};
+
+static int shiftfs_readlink(struct dentry *dentry, char __user *data,
+ int flags)
+{
+ struct dentry *real = dentry->d_fsdata;
+ const struct inode_operations *iop = real->d_inode->i_op;
+
+ if (iop->readlink)
+ return iop->readlink(real, data, flags);
+
+ return -EINVAL;
+}
+
+static const char *shiftfs_get_link(struct dentry *dentry, struct inode *inode,
+ struct delayed_call *done)
+{
+ if (dentry) {
+ struct dentry *real = dentry->d_fsdata;
+ struct inode *reali = real->d_inode;
+ const struct inode_operations *iop = reali->i_op;
+ const char *res = ERR_PTR(-EPERM);
+
+ if (iop->get_link)
+ res = iop->get_link(real, reali, done);
+
+ return res;
+ } else {
+ /* RCU lookup not supported */
+ return ERR_PTR(-ECHILD);
+ }
+}
+
+static int shiftfs_setxattr(struct dentry *dentry, struct inode *inode,
+ const char *name, const void *value,
+ size_t size, int flags)
+{
+ struct dentry *real = dentry->d_fsdata;
+ int err = -EOPNOTSUPP;
+ const struct cred *oldcred, *newcred;
+
+ oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+ err = vfs_setxattr(real, name, value, size, flags);
+ shiftfs_old_creds(oldcred, &newcred);
+
+ return err;
+}
+
+static int shiftfs_xattr_get(const struct xattr_handler *handler,
+ struct dentry *dentry, struct inode *inode,
+ const char *name, void *value, size_t size)
+{
+ struct dentry *real = dentry->d_fsdata;
+ int err;
+ const struct cred *oldcred, *newcred;
+
+ oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+ err = vfs_getxattr(real, name, value, size);
+ shiftfs_old_creds(oldcred, &newcred);
+
+ return err;
+}
+
+static ssize_t shiftfs_listxattr(struct dentry *dentry, char *list,
+ size_t size)
+{
+ struct dentry *real = dentry->d_fsdata;
+ int err;
+ const struct cred *oldcred, *newcred;
+
+ oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+ err = vfs_listxattr(real, list, size);
+ shiftfs_old_creds(oldcred, &newcred);
+
+ return err;
+}
+
+static int shiftfs_removexattr(struct dentry *dentry, const char *name)
+{
+ struct dentry *real = dentry->d_fsdata;
+ int err;
+ const struct cred *oldcred, *newcred;
+
+ oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+ err = vfs_removexattr(real, name);
+ shiftfs_old_creds(oldcred, &newcred);
+
+ return err;
+}
+
+static int shiftfs_xattr_set(const struct xattr_handler *handler,
+ struct dentry *dentry, struct inode *inode,
+ const char *name, const void *value, size_t size,
+ int flags)
+{
+ if (!value)
+ return shiftfs_removexattr(dentry, name);
+ return shiftfs_setxattr(dentry, inode, name, value, size, flags);
+}
+
+static void shiftfs_fill_inode(struct inode *inode, struct dentry *dentry)
+{
+ struct inode *reali;
+
+ if (!dentry)
+ return;
+
+ reali = dentry->d_inode;
+
+ if (!reali->i_op->get_link)
+ inode->i_opflags |= IOP_NOFOLLOW;
+
+ inode->i_mapping = reali->i_mapping;
+ inode->i_private = dentry;
+}
+
+static int shiftfs_make_object(struct inode *dir, struct dentry *dentry,
+ umode_t mode, const char *symlink,
+ struct dentry *hardlink, bool excl)
+{
+ struct dentry *real = dir->i_private, *new = dentry->d_fsdata;
+ struct inode *reali = real->d_inode, *newi;
+ const struct inode_operations *iop = reali->i_op;
+ int err;
+ const struct cred *oldcred, *newcred;
+ bool op_ok = false;
+
+ if (hardlink) {
+ op_ok = iop->link;
+ } else {
+ switch (mode & S_IFMT) {
+ case S_IFDIR:
+ op_ok = iop->mkdir;
+ break;
+ case S_IFREG:
+ op_ok = iop->create;
+ break;
+ case S_IFLNK:
+ op_ok = iop->symlink;
+ }
+ }
+ if (!op_ok)
+ return -EINVAL;
+
+
+ newi = shiftfs_new_inode(dentry->d_sb, mode, NULL);
+ if (!newi)
+ return -ENOMEM;
+
+ oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+
+ inode_lock_nested(reali, I_MUTEX_PARENT);
+
+ err = -EINVAL; /* shut gcc up about uninit var */
+ if (hardlink) {
+ struct dentry *realhardlink = hardlink->d_fsdata;
+
+ err = vfs_link(realhardlink, reali, new, NULL);
+ } else {
+ switch (mode & S_IFMT) {
+ case S_IFDIR:
+ err = vfs_mkdir(reali, new, mode);
+ break;
+ case S_IFREG:
+ err = vfs_create(reali, new, mode, excl);
+ break;
+ case S_IFLNK:
+ err = vfs_symlink(reali, new, symlink);
+ }
+ }
+
+ shiftfs_old_creds(oldcred, &newcred);
+
+ if (err)
+ goto out_dput;
+
+ shiftfs_fill_inode(newi, new);
+
+ d_instantiate(dentry, newi);
+
+ new = NULL;
+ newi = NULL;
+
+ out_dput:
+ dput(new);
+ iput(newi);
+ inode_unlock(reali);
+
+ return err;
+}
+
+static int shiftfs_create(struct inode *dir, struct dentry *dentry,
+ umode_t mode, bool excl)
+{
+ mode |= S_IFREG;
+
+ return shiftfs_make_object(dir, dentry, mode, NULL, NULL, excl);
+}
+
+static int shiftfs_mkdir(struct inode *dir, struct dentry *dentry,
+ umode_t mode)
+{
+ mode |= S_IFDIR;
+
+ return shiftfs_make_object(dir, dentry, mode, NULL, NULL, false);
+}
+
+static int shiftfs_link(struct dentry *hardlink, struct inode *dir,
+ struct dentry *dentry)
+{
+ return shiftfs_make_object(dir, dentry, 0, NULL, hardlink, false);
+}
+
+static int shiftfs_symlink(struct inode *dir, struct dentry *dentry,
+ const char *symlink)
+{
+ return shiftfs_make_object(dir, dentry, S_IFLNK, symlink, NULL, false);
+}
+
+static int shiftfs_rm(struct inode *dir, struct dentry *dentry, bool rmdir)
+{
+ struct dentry *real = dir->i_private, *new = dentry->d_fsdata;
+ struct inode *reali = real->d_inode;
+ int err;
+ const struct cred *oldcred, *newcred;
+
+ inode_lock_nested(reali, I_MUTEX_PARENT);
+
+ oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+
+ if (rmdir)
+ err = vfs_rmdir(reali, new);
+ else
+ err = vfs_unlink(reali, new, NULL);
+
+ shiftfs_old_creds(oldcred, &newcred);
+ inode_unlock(reali);
+
+ return err;
+}
+
+static int shiftfs_unlink(struct inode *dir, struct dentry *dentry)
+{
+ return shiftfs_rm(dir, dentry, false);
+}
+
+static int shiftfs_rmdir(struct inode *dir, struct dentry *dentry)
+{
+ return shiftfs_rm(dir, dentry, true);
+}
+
+static int shiftfs_rename(struct inode *olddir, struct dentry *old,
+ struct inode *newdir, struct dentry *new,
+ unsigned int flags)
+{
+ struct dentry *rodd = olddir->i_private, *rndd = newdir->i_private,
+ *realold = old->d_fsdata,
+ *realnew = new->d_fsdata, *trap;
+ struct inode *realolddir = rodd->d_inode, *realnewdir = rndd->d_inode;
+ int err = -EINVAL;
+ const struct cred *oldcred, *newcred;
+
+ trap = lock_rename(rndd, rodd);
+
+ if (trap == realold || trap == realnew)
+ goto out_unlock;
+
+ oldcred = shiftfs_new_creds(&newcred, old->d_sb);
+
+ err = vfs_rename(realolddir, realold, realnewdir,
+ realnew, NULL, flags);
+
+ shiftfs_old_creds(oldcred, &newcred);
+
+ out_unlock:
+ unlock_rename(rndd, rodd);
+
+ return err;
+}
+
+static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry *dentry,
+ unsigned int flags)
+{
+ struct dentry *real = dir->i_private, *new;
+ struct inode *reali = real->d_inode, *newi;
+ const struct cred *oldcred, *newcred;
+
+ inode_lock(reali);
+ oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+ new = lookup_one_len(dentry->d_name.name, real, dentry->d_name.len);
+ shiftfs_old_creds(oldcred, &newcred);
+ inode_unlock(reali);
+
+ if (IS_ERR(new))
+ return new;
+
+ dentry->d_fsdata = new;
+
+ newi = NULL;
+ if (!new->d_inode)
+ goto out;
+
+ newi = shiftfs_new_inode(dentry->d_sb, new->d_inode->i_mode, new);
+ if (!newi) {
+ dput(new);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ out:
+ return d_splice_alias(newi, dentry);
+}
+
+static int shiftfs_permission(struct inode *inode, int mask)
+{
+ struct dentry *real = inode->i_private;
+ struct inode *reali = real->d_inode;
+ const struct inode_operations *iop = reali->i_op;
+ int err;
+ const struct cred *oldcred, *newcred;
+
+ if (mask & MAY_NOT_BLOCK)
+ return -ECHILD;
+
+ oldcred = shiftfs_new_creds(&newcred, inode->i_sb);
+ if (iop->permission)
+ err = iop->permission(reali, mask);
+ else
+ err = generic_permission(reali, mask);
+ shiftfs_old_creds(oldcred, &newcred);
+
+ return err;
+}
+
+static int shiftfs_setattr(struct dentry *dentry, struct iattr *attr)
+{
+ struct dentry *real = dentry->d_fsdata;
+ struct inode *reali = real->d_inode;
+ const struct inode_operations *iop = reali->i_op;
+ struct iattr newattr = *attr;
+ const struct cred *oldcred, *newcred;
+ struct super_block *sb = dentry->d_sb;
+ int err;
+
+ newattr.ia_uid = KUIDT_INIT(from_kuid(sb->s_user_ns, attr->ia_uid));
+ newattr.ia_gid = KGIDT_INIT(from_kgid(sb->s_user_ns, attr->ia_gid));
+
+ oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+ inode_lock(reali);
+ if (iop->setattr)
+ err = iop->setattr(real, &newattr);
+ else
+ err = simple_setattr(real, &newattr);
+ inode_unlock(reali);
+ shiftfs_old_creds(oldcred, &newcred);
+
+ if (err)
+ return err;
+
+ /* all OK, reflect the change on our inode */
+ setattr_copy(d_inode(dentry), attr);
+ return 0;
+}
+
+static int shiftfs_getattr(const struct path *path, struct kstat *stat,
+ u32 request_mask, unsigned int query_flags)
+{
+ struct inode *inode = path->dentry->d_inode;
+ struct dentry *real = path->dentry->d_fsdata;
+ struct inode *reali = real->d_inode;
+ const struct inode_operations *iop = reali->i_op;
+ struct path newpath = { path->dentry->d_sb->s_fs_info, real };
+ int err = 0;
+
+ if (iop->getattr)
+ err = iop->getattr(&newpath, stat, request_mask, query_flags);
+ else
+ generic_fillattr(reali, stat);
+
+ if (err)
+ return err;
+
+ /* transform the underlying id */
+ stat->uid = make_kuid(inode->i_sb->s_user_ns, __kuid_val(stat->uid));
+ stat->gid = make_kgid(inode->i_sb->s_user_ns, __kgid_val(stat->gid));
+ return 0;
+}
+
+static const struct inode_operations shiftfs_inode_ops = {
+ .lookup = shiftfs_lookup,
+ .getattr = shiftfs_getattr,
+ .setattr = shiftfs_setattr,
+ .permission = shiftfs_permission,
+ .mkdir = shiftfs_mkdir,
+ .symlink = shiftfs_symlink,
+ .get_link = shiftfs_get_link,
+ .readlink = shiftfs_readlink,
+ .unlink = shiftfs_unlink,
+ .rmdir = shiftfs_rmdir,
+ .rename = shiftfs_rename,
+ .link = shiftfs_link,
+ .create = shiftfs_create,
+ .mknod = NULL, /* no special files currently */
+ .listxattr = shiftfs_listxattr,
+};
+
+static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
+ struct dentry *dentry)
+{
+ struct inode *inode;
+
+ inode = new_inode(sb);
+ if (!inode)
+ return NULL;
+
+ /*
+ * our inode is completely vestigial. All lookups, getattr
+ * and permission checks are done on the underlying inode, so
+ * what the user sees is entirely from the underlying inode.
+ */
+ mode &= S_IFMT;
+
+ inode->i_ino = get_next_ino();
+ inode->i_mode = mode;
+ inode->i_flags |= S_NOATIME | S_NOCMTIME;
+
+ inode->i_op = &shiftfs_inode_ops;
+
+ shiftfs_fill_inode(inode, dentry);
+
+ return inode;
+}
+
+static int shiftfs_show_options(struct seq_file *m, struct dentry *dentry)
+{
+ struct super_block *sb = dentry->d_sb;
+ struct shiftfs_super_info *ssi = sb->s_fs_info;
+
+ if (ssi->mark)
+ seq_show_option(m, "mark", NULL);
+
+ return 0;
+}
+
+static int shiftfs_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+ struct super_block *sb = dentry->d_sb;
+ struct shiftfs_super_info *ssi = sb->s_fs_info;
+ struct dentry *root = sb->s_root;
+ struct dentry *realroot = root->d_fsdata;
+ struct path realpath = { .mnt = ssi->mnt, .dentry = realroot };
+ int err;
+
+ err = vfs_statfs(&realpath, buf);
+ if (err)
+ return err;
+
+ buf->f_type = sb->s_magic;
+
+ return 0;
+}
+
+static void shiftfs_put_super(struct super_block *sb)
+{
+ struct shiftfs_super_info *ssi = sb->s_fs_info;
+
+ mntput(ssi->mnt);
+ put_user_ns(ssi->userns);
+ kfree(ssi);
+}
+
+static const struct xattr_handler shiftfs_xattr_handler = {
+ .prefix = "",
+ .get = shiftfs_xattr_get,
+ .set = shiftfs_xattr_set,
+};
+
+const struct xattr_handler *shiftfs_xattr_handlers[] = {
+ &shiftfs_xattr_handler,
+ NULL
+};
+
+static const struct super_operations shiftfs_super_ops = {
+ .put_super = shiftfs_put_super,
+ .show_options = shiftfs_show_options,
+ .statfs = shiftfs_statfs,
+};
+
+struct shiftfs_data {
+ void *data;
+ const char *path;
+};
+
+static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
+ int silent)
+{
+ struct shiftfs_data *data = raw_data;
+ char *name = kstrdup(data->path, GFP_KERNEL);
+ int err = -ENOMEM;
+ struct shiftfs_super_info *ssi = NULL;
+ struct path path;
+ struct dentry *dentry;
+
+ if (!name)
+ goto out;
+
+ ssi = kzalloc(sizeof(*ssi), GFP_KERNEL);
+ if (!ssi)
+ goto out;
+
+ err = -EPERM;
+ err = shiftfs_parse_options(ssi, data->data);
+ if (err)
+ goto out;
+
+ /* to mark a mount point, must be real root */
+ if (ssi->mark && !capable(CAP_SYS_ADMIN))
+ goto out;
+
+ /* else to mount a mark, must be userns admin */
+ if (!ssi->mark && !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
+ goto out;
+
+ err = kern_path(name, LOOKUP_FOLLOW, &path);
+ if (err)
+ goto out;
+
+ err = -EPERM;
+
+ if (!S_ISDIR(path.dentry->d_inode->i_mode)) {
+ err = -ENOTDIR;
+ goto out_put;
+ }
+
+ sb->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1;
+ if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
+ printk(KERN_ERR "shiftfs: maximum stacking depth exceeded\n");
+ err = -EINVAL;
+ goto out_put;
+ }
+
+ if (ssi->mark) {
+ /*
+ * this part is visible unshifted, so make sure no
+ * executables that could be used to give suid
+ * privileges
+ */
+ sb->s_iflags = SB_I_NOEXEC;
+ ssi->mnt = path.mnt;
+ dentry = path.dentry;
+ } else {
+ struct shiftfs_super_info *mp_ssi;
+
+ /*
+ * this leg executes if we're admin capable in
+ * the namespace, so be very careful
+ */
+ if (path.dentry->d_sb->s_magic != SHIFTFS_MAGIC)
+ goto out_put;
+ mp_ssi = path.dentry->d_sb->s_fs_info;
+ if (!mp_ssi->mark)
+ goto out_put;
+ ssi->mnt = mntget(mp_ssi->mnt);
+ dentry = dget(path.dentry->d_fsdata);
+ path_put(&path);
+ }
+ ssi->userns = get_user_ns(dentry->d_sb->s_user_ns);
+ sb->s_fs_info = ssi;
+ sb->s_magic = SHIFTFS_MAGIC;
+ sb->s_op = &shiftfs_super_ops;
+ sb->s_xattr = shiftfs_xattr_handlers;
+ sb->s_d_op = &shiftfs_dentry_ops;
+ sb->s_root = d_make_root(shiftfs_new_inode(sb, S_IFDIR, dentry));
+ sb->s_root->d_fsdata = dentry;
+
+ return 0;
+
+ out_put:
+ path_put(&path);
+ out:
+ kfree(name);
+ kfree(ssi);
+ return err;
+}
+
+static struct dentry *shiftfs_mount(struct file_system_type *fs_type,
+ int flags, const char *dev_name, void *data)
+{
+ struct shiftfs_data d = { data, dev_name };
+
+ return mount_nodev(fs_type, flags, &d, shiftfs_fill_super);
+}
+
+static struct file_system_type shiftfs_type = {
+ .owner = THIS_MODULE,
+ .name = "shiftfs",
+ .mount = shiftfs_mount,
+ .kill_sb = kill_anon_super,
+ .fs_flags = FS_USERNS_MOUNT,
+};
+
+static int __init shiftfs_init(void)
+{
+ return register_filesystem(&shiftfs_type);
+}
+
+static void __exit shiftfs_exit(void)
+{
+ unregister_filesystem(&shiftfs_type);
+}
+
+MODULE_ALIAS_FS("shiftfs");
+MODULE_AUTHOR("James Bottomley");
+MODULE_DESCRIPTION("uid/gid shifting bind filesystem");
+MODULE_LICENSE("GPL v2");
+module_init(shiftfs_init)
+module_exit(shiftfs_exit)
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index e439565df838..1b3db66384a2 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -88,4 +88,6 @@
#define BALLOON_KVM_MAGIC 0x13661366
#define ZSMALLOC_MAGIC 0x58295829
+#define SHIFTFS_MAGIC 0x6a656a62
+
#endif /* __LINUX_MAGIC_H__ */
--
2.13.7
+Cc David
On Fri, Jun 15, 2018 at 02:35:14PM -0700, James Bottomley wrote:
> This is a repost of the v2 patch updated for the d_real changes
>
> For those who want to test it out, there's a git tree here
>
> git://git.kernel.org/pub/scm/linux/kernel/git/jejb/binfmt_misc.git
>
> on the shiftfs-v3 branch
>
> v2:
>
> This is a rewrite of the original shiftfs code to make use of super
> block user namespaces. I've also removed the mappings passed in as
> mount options in favour of using the mappings in s_user_ns. The upshot
> is that it probably needs retesting for all the bugs people found,
> since there's a lot of new code, and the use case has changed. Now, to
> use it, you have to mark the filesystems you want to be mountable
> inside a user namespace as root:
>
> mount -t shiftfs -o mark <origin> <mark location>
>
> The origin should be inaccessible to the unprivileged user, and the
> access to the <mark location> can be controlled by the usual filesystem
> permissions. Once this is done, any user who can get access to the
> <mark location> can do (as the local user namespace root):
>
> mount -t shiftfs <mark location> <somewhere in my local mount ns>
David, I wanted to pull you in here based on something you said on the
most recent filesystem context thread (thought it would make more sense
here rather than piggypacking on that already massive thread).
> I want to be able to add support for a bunch of things:
>
> (1) UID, GID and Project ID mapping/translation. I want to be able to
> install a translation table of some sort on the superblock to translate
> source identifiers (which may be foreign numeric UIDs/GIDs, text names,
> GUIDs) into system identifiers. This needs to be done before the
> superblock is published[*].
>
> Note that this may, for example, involve using the context and the
> superblock held therein to issue an RPC to a server to look up
> translations.
>
> [*] By "published" I mean made available through mount so that other
> userspace processes can access it by path.
>
> Maybe specifying a translation range element with something like:
>
> write(fd, "t uid <srcuid> <nsuid> <count>");
>
> The translation information also needs to propagate over an automount in
> some circumstances.
>
> (2) Namespace configuration. I want to be able to tell the superblock
> creation process what namespaces should be applied when it created (in
> particular the userns and netns) for containerisation purposes, e.g.:
>
> write(fd, "n user=<fd> net=<fd>");
There's some obvious overlap between shiftfs and (1), but also important
differences. Primarily that shiftfs tries to make something that looks
like a bind mount rather than applying the mappings to new superblocks
for arbitrary filesystems.
I've already been playing with shiftfs on top of the filesystem context
patches, because I thought it would allow getting rid of the
intermediate "mark" mount described above. I have a hacky proof of
concept implementation that I've pushed to the shiftfs-fscontext branch
of
git://git.kernel.org/pub/scm/linux/kernel/git/sforshee/linux.git
Basically the idea is that the more privileged "host" context can create
the fs fd and set the source on it to "bless" a subtree for id shifted
mounting, and the less privileged "client" context can use the fd to do
the mount (test program below). But I had to mess with fc->user_ns to
ensure s_user_ns gets set correctly, and it would likely be nicer to do
something more like (2) above.
The idea is that we need the more privileged "host" context to bless the
subtree being id shifted before actually executing the mount in the less
privileged "client" context. I'm doing this by having the host set the
source, then have the client use the fd to create the superblock (my
test program is below). This leads to some undesirable changing of the
fs contexts user ns in shiftfs (so that s_user_ns is the client's
namespace), which could likely be eliminated by doing something like
what's described in (2) and having the super block created on the host
side.
However, maybe these things are similar enough to settle on a common
solution, such as supporting id mapping at the vfsmount level.
Seth
---
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/wait.h>
#include <limits.h>
#define __NR_move_mount 336
#define __NR_fsopen 337
#define __NR_fsmount 338
#define FSOPEN_CLOEXEC 0x00000001
#define FSMOUNT_CLOEXEC 0x00000001
#define MOVE_MOUNT_F_EMPTY_PATH 0x00000004
static int move_mount(int from_dfd, const char *from_pathname,
int to_dfd, const char *to_pathname, unsigned int flags)
{
return syscall(__NR_move_mount, from_dfd, from_pathname, to_dfd, to_pathname,
flags);
}
static int fsopen(const char *fs_name, unsigned int flags)
{
return syscall(__NR_fsopen, fs_name, flags);
}
static int fsmount(int fsfd, unsigned int flags, unsigned int ms_flags)
{
return syscall(__NR_fsmount, fsfd, flags, ms_flags);
}
static void write_idmap(char *path, char *map)
{
int fd;
size_t map_len;
map_len = strlen(map);
fd = open(path, O_RDWR);
if (fd == -1) {
perror("open");
exit(1);
}
if (write(fd, map, map_len) != map_len) {
perror("write");
exit(1);
}
}
#define CHILD_STACK_SIZE (1024 * 1024)
static char child_stack[CHILD_STACK_SIZE];
struct child_args {
int inp[2];
int outp[2];
int fsfd;
char *dest;
uid_t uid;
uid_t gid;
};
static int child_func(void *arg)
{
struct child_args *args = arg;
char ch;
int fsfd = args->fsfd, mfd;
int ret;
close(args->inp[1]);
close(args->outp[0]);
/* Change to uid/gid for root in the user ns */
if (setgid(args->gid) == -1) {
perror("setgid");
exit(-1);
}
if (setuid(args->uid) == -1) {
perror("setgid");
exit(-1);
}
if (unshare(CLONE_NEWNS | CLONE_NEWUSER) == -1) {
perror("unshare");
exit(1);
}
/* Close write pipe to signal unshare has completed */
close(args->outp[1]);
/* Wait for uid/gid maps to be written */
if (read(args->inp[0], &ch, 1) != 0) {
perror("read");
exit(1);
}
close(args->inp[0]);
/* Now we are root in user ns; proceed with mount */
ret = write(fsfd, "x create", 9);
if (ret == -1) {
perror("write \"x create\"");
exit(1);
}
mfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
if (mfd < 0) {
perror("fsmount");
exit(1);
}
ret = move_mount(mfd, "", AT_FDCWD, args->dest, MOVE_MOUNT_F_EMPTY_PATH);
if (ret < 0) {
perror("move_mount");
exit(1);
}
close(fsfd);
close(mfd);
execl("/bin/sh", "/bin/sh", NULL);
perror("execl");
exit(1);
}
int main(int argc, char *argv[])
{
char *src, *dest;
uid_t root_uid;
gid_t root_gid;
int fsfd;
char buf[PATH_MAX + 2];
int len, ret;
struct child_args args;
pid_t child_pid;
int *inp, *outp;
char map_buf[100];
char map_path[PATH_MAX];
char ch;
if (argc != 5) {
printf("Usage: %s <src> <dest> <root_uid> <root_gid>\n", argv[0]);
exit(1);
}
src = argv[1];
dest = argv[2];
root_uid = atoi(argv[3]);
root_gid = atoi(argv[4]);
fsfd = fsopen("shiftfs", FSOPEN_CLOEXEC);
if (fsfd < 0) {
perror("fsopen");
exit(1);
}
/*
* Set source subtree; we do it on this side of clone(2) so that
* the kernel can check for permissions wrt src. The rest of the
* mount will happen in the child process after unsharing the
* user/mount namespaces.
*/
len = snprintf(buf, sizeof(buf), "s %s", src);
if (len >= sizeof(buf)) {
fprintf(stderr, "src too large\n");
exit(1);
}
ret = write(fsfd, buf, len);
if (ret == -1) {
perror("write \"s src\"");
exit(1);
}
if (pipe(args.inp) == -1) {
perror("pipe");
exit(1);
}
if (pipe(args.outp) == -1) {
perror("pipe");
exit(1);
}
args.fsfd = fsfd;
args.dest = dest;
args.uid = root_uid;
args.gid = root_gid;
child_pid = clone(child_func, child_stack + CHILD_STACK_SIZE, SIGCHLD, &args);
/* Pipe directions reversed wrt child */
inp = args.outp;
outp = args.inp;
close(inp[1]);
close(outp[0]);
/* Wait for child to set ids and unshare */
if (read(inp[0], &ch, 1) != 0) {
perror("read");
exit(1);
}
snprintf(map_buf, sizeof(map_buf), "0 %ld 1", (long)root_uid);
snprintf(map_path, sizeof(map_path), "/proc/%ld/uid_map",
(long)child_pid);
write_idmap(map_path, map_buf);
snprintf(map_buf, sizeof(map_buf), "0 %ld 1", (long)root_gid);
snprintf(map_path, sizeof(map_path), "/proc/%ld/gid_map",
(long)child_pid);
write_idmap(map_path, map_buf);
/* Signal child that id maps have been updated */
close(outp[1]);
if (waitpid(child_pid, NULL, 0) == -1) {
perror("waitpid");
exit(1);
}
exit(0);
}