2018-11-01 21:51:05

by Seth Forshee

[permalink] [raw]
Subject: [RFC PATCH 0/6] shiftfs fixes and enhancements

I've done some work to fix and enhance shiftfs for a number of use
cases, so that we would have an idea what a more full-featured shiftfs
would look like. I'm intending for these to serve as a point of
reference for discussing id shifting mounts/filesystems at plumbers in a
couple of weeks [1].

Note that these are based on 4.18, and I've added a small fix to James'
most recent patch to fix a build issue there. To work with 4.19 they
will need a number of updates due to changes in the vfs.

The features I focused on fixing in or adding to shiftfs in these
patches are inotify, file capabilities, posix acls, and nesting. These
are all now working for at least simple use cases, but further testing
and cleanups are needed before I'd consider these finished. I also kept
all the changes within shiftfs, but some of the code might belong in the
vfs instead (in particular some of the posix acl code).

I've also pushed these patches to:

git://git.kernel.org/pub/scm/linux/kernel/git/sforshee/linux.git shiftfs

Thanks,
Seth

[1] https://linuxplumbersconf.org/event/2/contributions/212/

---

James Bottomley (1):
shiftfs: uid/gid shifting bind mount

Seth Forshee (5):
shiftfs: map inodes to lower fs inodes instead of dentries
shiftfs: copy inode attrs up from underlying fs
shiftfs: translate uids using s_user_ns from lower fs
shiftfs: add support for posix acls
shiftfs: support nested shiftfs mounts

fs/Kconfig | 18 +
fs/Makefile | 1 +
fs/shiftfs.c | 1075 ++++++++++++++++++++++++++++++++++++
include/uapi/linux/magic.h | 2 +
4 files changed, 1096 insertions(+)
create mode 100644 fs/shiftfs.c



2018-11-01 21:49:51

by Seth Forshee

[permalink] [raw]
Subject: [RFC PATCH 1/6] shiftfs: uid/gid shifting bind mount

From: James Bottomley <[email protected]>

This allows any subtree to be uid/gid shifted and bound elsewhere. It
does this by operating simlarly to overlayfs. Its primary use is for
shifting the underlying uids of filesystems used to support
unpriviliged (uid shifted) containers. The usual use case here is
that the container is operating with an uid shifted unprivileged root
but sometimes needs to make use of or work with a filesystem image
that has root at real uid 0.

The mechanism is to allow any subordinate mount namespace to mount a
shiftfs filesystem (by marking it FS_USERNS_MOUNT) but only allowing
it to mount marked subtrees (using the -o mark option as root). Once
mounted, the subtree is mapped via the super block user namespace so
that the interior ids of the mounting user namespace are the ids
written to the filesystem.

Signed-off-by: James Bottomley <[email protected]>
[ saf: use designated initializers for path declarations to fix errors
with struct randomization ]
Signed-off-by: Seth Forshee <[email protected]>

---

v3 - update to 4.14 (d_real changes)
v1 - based on original shiftfs with uid mappings now done via s_user_ns
v2 - fix revalidation of dentries
add inode aliasing
---
fs/Kconfig | 8 +
fs/Makefile | 1 +
fs/shiftfs.c | 783 +++++++++++++++++++++++++++++++++++++
include/uapi/linux/magic.h | 2 +
4 files changed, 794 insertions(+)
create mode 100644 fs/shiftfs.c

diff --git a/fs/Kconfig b/fs/Kconfig
index ac474a61be37..392c5a41a9f9 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -113,6 +113,14 @@ source "fs/autofs/Kconfig"
source "fs/fuse/Kconfig"
source "fs/overlayfs/Kconfig"

+config SHIFT_FS
+ tristate "UID/GID shifting overlay filesystem for containers"
+ help
+ This filesystem can overlay any mounted filesystem and shift
+ the uid/gid the files appear at. The idea is that
+ unprivileged containers can use this to mount root volumes
+ using this technique.
+
menu "Caches"

source "fs/fscache/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index 293733f61594..d0222f3816bd 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -128,3 +128,4 @@ obj-y += exofs/ # Multiple modules
obj-$(CONFIG_CEPH_FS) += ceph/
obj-$(CONFIG_PSTORE) += pstore/
obj-$(CONFIG_EFIVAR_FS) += efivarfs/
+obj-$(CONFIG_SHIFT_FS) += shiftfs.o
diff --git a/fs/shiftfs.c b/fs/shiftfs.c
new file mode 100644
index 000000000000..6028244c2f42
--- /dev/null
+++ b/fs/shiftfs.c
@@ -0,0 +1,783 @@
+#include <linux/cred.h>
+#include <linux/mount.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/magic.h>
+#include <linux/parser.h>
+#include <linux/seq_file.h>
+#include <linux/statfs.h>
+#include <linux/slab.h>
+#include <linux/user_namespace.h>
+#include <linux/uidgid.h>
+#include <linux/xattr.h>
+
+struct shiftfs_super_info {
+ struct vfsmount *mnt;
+ struct user_namespace *userns;
+ bool mark;
+};
+
+static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
+ struct dentry *dentry);
+
+enum {
+ OPT_MARK,
+ OPT_LAST,
+};
+
+/* global filesystem options */
+static const match_table_t tokens = {
+ { OPT_MARK, "mark" },
+ { OPT_LAST, NULL }
+};
+
+static const struct cred *shiftfs_get_up_creds(struct super_block *sb)
+{
+ struct shiftfs_super_info *ssi = sb->s_fs_info;
+ struct cred *cred = prepare_creds();
+
+ if (!cred)
+ return NULL;
+
+ cred->fsuid = KUIDT_INIT(from_kuid(sb->s_user_ns, cred->fsuid));
+ cred->fsgid = KGIDT_INIT(from_kgid(sb->s_user_ns, cred->fsgid));
+ put_user_ns(cred->user_ns);
+ cred->user_ns = get_user_ns(ssi->userns);
+
+ return cred;
+}
+
+static const struct cred *shiftfs_new_creds(const struct cred **newcred,
+ struct super_block *sb)
+{
+ const struct cred *cred = shiftfs_get_up_creds(sb);
+
+ *newcred = cred;
+
+ if (cred)
+ cred = override_creds(cred);
+ else
+ printk(KERN_ERR "shiftfs: Credential override failed: no memory\n");
+
+ return cred;
+}
+
+static void shiftfs_old_creds(const struct cred *oldcred,
+ const struct cred **newcred)
+{
+ if (!*newcred)
+ return;
+
+ revert_creds(oldcred);
+ put_cred(*newcred);
+}
+
+static int shiftfs_parse_options(struct shiftfs_super_info *ssi, char *options)
+{
+ char *p;
+ substring_t args[MAX_OPT_ARGS];
+
+ ssi->mark = false;
+
+ while ((p = strsep(&options, ",")) != NULL) {
+ int token;
+
+ if (!*p)
+ continue;
+
+ token = match_token(p, tokens, args);
+ switch (token) {
+ case OPT_MARK:
+ ssi->mark = true;
+ break;
+ default:
+ return -EINVAL;
+ }
+ }
+ return 0;
+}
+
+static void shiftfs_d_release(struct dentry *dentry)
+{
+ struct dentry *real = dentry->d_fsdata;
+
+ dput(real);
+}
+
+static struct dentry *shiftfs_d_real(struct dentry *dentry,
+ const struct inode *inode,
+ unsigned int open_flags,
+ unsigned int dreal_flags)
+{
+ struct dentry *real = dentry->d_fsdata;
+
+ if (unlikely(real->d_flags & DCACHE_OP_REAL))
+ return real->d_op->d_real(real, real->d_inode,
+ open_flags, dreal_flags);
+
+ return real;
+}
+
+static int shiftfs_d_weak_revalidate(struct dentry *dentry, unsigned int flags)
+{
+ struct dentry *real = dentry->d_fsdata;
+
+ if (d_unhashed(real))
+ return 0;
+
+ if (!(real->d_flags & DCACHE_OP_WEAK_REVALIDATE))
+ return 1;
+
+ return real->d_op->d_weak_revalidate(real, flags);
+}
+
+static int shiftfs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+ struct dentry *real = dentry->d_fsdata;
+ int ret;
+
+ if (d_unhashed(real))
+ return 0;
+
+ /*
+ * inode state of underlying changed from positive to negative
+ * or vice versa; force a lookup to update our view
+ */
+ if (d_is_negative(real) != d_is_negative(dentry))
+ return 0;
+
+ if (!(real->d_flags & DCACHE_OP_REVALIDATE))
+ return 1;
+
+ ret = real->d_op->d_revalidate(real, flags);
+
+ if (ret == 0 && !(flags & LOOKUP_RCU))
+ d_invalidate(real);
+
+ return ret;
+}
+
+static const struct dentry_operations shiftfs_dentry_ops = {
+ .d_release = shiftfs_d_release,
+ .d_real = shiftfs_d_real,
+ .d_revalidate = shiftfs_d_revalidate,
+ .d_weak_revalidate = shiftfs_d_weak_revalidate,
+};
+
+static int shiftfs_readlink(struct dentry *dentry, char __user *data,
+ int flags)
+{
+ struct dentry *real = dentry->d_fsdata;
+ const struct inode_operations *iop = real->d_inode->i_op;
+
+ if (iop->readlink)
+ return iop->readlink(real, data, flags);
+
+ return -EINVAL;
+}
+
+static const char *shiftfs_get_link(struct dentry *dentry, struct inode *inode,
+ struct delayed_call *done)
+{
+ if (dentry) {
+ struct dentry *real = dentry->d_fsdata;
+ struct inode *reali = real->d_inode;
+ const struct inode_operations *iop = reali->i_op;
+ const char *res = ERR_PTR(-EPERM);
+
+ if (iop->get_link)
+ res = iop->get_link(real, reali, done);
+
+ return res;
+ } else {
+ /* RCU lookup not supported */
+ return ERR_PTR(-ECHILD);
+ }
+}
+
+static int shiftfs_setxattr(struct dentry *dentry, struct inode *inode,
+ const char *name, const void *value,
+ size_t size, int flags)
+{
+ struct dentry *real = dentry->d_fsdata;
+ int err = -EOPNOTSUPP;
+ const struct cred *oldcred, *newcred;
+
+ oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+ err = vfs_setxattr(real, name, value, size, flags);
+ shiftfs_old_creds(oldcred, &newcred);
+
+ return err;
+}
+
+static int shiftfs_xattr_get(const struct xattr_handler *handler,
+ struct dentry *dentry, struct inode *inode,
+ const char *name, void *value, size_t size)
+{
+ struct dentry *real = dentry->d_fsdata;
+ int err;
+ const struct cred *oldcred, *newcred;
+
+ oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+ err = vfs_getxattr(real, name, value, size);
+ shiftfs_old_creds(oldcred, &newcred);
+
+ return err;
+}
+
+static ssize_t shiftfs_listxattr(struct dentry *dentry, char *list,
+ size_t size)
+{
+ struct dentry *real = dentry->d_fsdata;
+ int err;
+ const struct cred *oldcred, *newcred;
+
+ oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+ err = vfs_listxattr(real, list, size);
+ shiftfs_old_creds(oldcred, &newcred);
+
+ return err;
+}
+
+static int shiftfs_removexattr(struct dentry *dentry, const char *name)
+{
+ struct dentry *real = dentry->d_fsdata;
+ int err;
+ const struct cred *oldcred, *newcred;
+
+ oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+ err = vfs_removexattr(real, name);
+ shiftfs_old_creds(oldcred, &newcred);
+
+ return err;
+}
+
+static int shiftfs_xattr_set(const struct xattr_handler *handler,
+ struct dentry *dentry, struct inode *inode,
+ const char *name, const void *value, size_t size,
+ int flags)
+{
+ if (!value)
+ return shiftfs_removexattr(dentry, name);
+ return shiftfs_setxattr(dentry, inode, name, value, size, flags);
+}
+
+static void shiftfs_fill_inode(struct inode *inode, struct dentry *dentry)
+{
+ struct inode *reali;
+
+ if (!dentry)
+ return;
+
+ reali = dentry->d_inode;
+
+ if (!reali->i_op->get_link)
+ inode->i_opflags |= IOP_NOFOLLOW;
+
+ inode->i_mapping = reali->i_mapping;
+ inode->i_private = dentry;
+}
+
+static int shiftfs_make_object(struct inode *dir, struct dentry *dentry,
+ umode_t mode, const char *symlink,
+ struct dentry *hardlink, bool excl)
+{
+ struct dentry *real = dir->i_private, *new = dentry->d_fsdata;
+ struct inode *reali = real->d_inode, *newi;
+ const struct inode_operations *iop = reali->i_op;
+ int err;
+ const struct cred *oldcred, *newcred;
+ bool op_ok = false;
+
+ if (hardlink) {
+ op_ok = iop->link;
+ } else {
+ switch (mode & S_IFMT) {
+ case S_IFDIR:
+ op_ok = iop->mkdir;
+ break;
+ case S_IFREG:
+ op_ok = iop->create;
+ break;
+ case S_IFLNK:
+ op_ok = iop->symlink;
+ }
+ }
+ if (!op_ok)
+ return -EINVAL;
+
+
+ newi = shiftfs_new_inode(dentry->d_sb, mode, NULL);
+ if (!newi)
+ return -ENOMEM;
+
+ oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+
+ inode_lock_nested(reali, I_MUTEX_PARENT);
+
+ err = -EINVAL; /* shut gcc up about uninit var */
+ if (hardlink) {
+ struct dentry *realhardlink = hardlink->d_fsdata;
+
+ err = vfs_link(realhardlink, reali, new, NULL);
+ } else {
+ switch (mode & S_IFMT) {
+ case S_IFDIR:
+ err = vfs_mkdir(reali, new, mode);
+ break;
+ case S_IFREG:
+ err = vfs_create(reali, new, mode, excl);
+ break;
+ case S_IFLNK:
+ err = vfs_symlink(reali, new, symlink);
+ }
+ }
+
+ shiftfs_old_creds(oldcred, &newcred);
+
+ if (err)
+ goto out_dput;
+
+ shiftfs_fill_inode(newi, new);
+
+ d_instantiate(dentry, newi);
+
+ new = NULL;
+ newi = NULL;
+
+ out_dput:
+ dput(new);
+ iput(newi);
+ inode_unlock(reali);
+
+ return err;
+}
+
+static int shiftfs_create(struct inode *dir, struct dentry *dentry,
+ umode_t mode, bool excl)
+{
+ mode |= S_IFREG;
+
+ return shiftfs_make_object(dir, dentry, mode, NULL, NULL, excl);
+}
+
+static int shiftfs_mkdir(struct inode *dir, struct dentry *dentry,
+ umode_t mode)
+{
+ mode |= S_IFDIR;
+
+ return shiftfs_make_object(dir, dentry, mode, NULL, NULL, false);
+}
+
+static int shiftfs_link(struct dentry *hardlink, struct inode *dir,
+ struct dentry *dentry)
+{
+ return shiftfs_make_object(dir, dentry, 0, NULL, hardlink, false);
+}
+
+static int shiftfs_symlink(struct inode *dir, struct dentry *dentry,
+ const char *symlink)
+{
+ return shiftfs_make_object(dir, dentry, S_IFLNK, symlink, NULL, false);
+}
+
+static int shiftfs_rm(struct inode *dir, struct dentry *dentry, bool rmdir)
+{
+ struct dentry *real = dir->i_private, *new = dentry->d_fsdata;
+ struct inode *reali = real->d_inode;
+ int err;
+ const struct cred *oldcred, *newcred;
+
+ inode_lock_nested(reali, I_MUTEX_PARENT);
+
+ oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+
+ if (rmdir)
+ err = vfs_rmdir(reali, new);
+ else
+ err = vfs_unlink(reali, new, NULL);
+
+ shiftfs_old_creds(oldcred, &newcred);
+ inode_unlock(reali);
+
+ return err;
+}
+
+static int shiftfs_unlink(struct inode *dir, struct dentry *dentry)
+{
+ return shiftfs_rm(dir, dentry, false);
+}
+
+static int shiftfs_rmdir(struct inode *dir, struct dentry *dentry)
+{
+ return shiftfs_rm(dir, dentry, true);
+}
+
+static int shiftfs_rename(struct inode *olddir, struct dentry *old,
+ struct inode *newdir, struct dentry *new,
+ unsigned int flags)
+{
+ struct dentry *rodd = olddir->i_private, *rndd = newdir->i_private,
+ *realold = old->d_fsdata,
+ *realnew = new->d_fsdata, *trap;
+ struct inode *realolddir = rodd->d_inode, *realnewdir = rndd->d_inode;
+ int err = -EINVAL;
+ const struct cred *oldcred, *newcred;
+
+ trap = lock_rename(rndd, rodd);
+
+ if (trap == realold || trap == realnew)
+ goto out_unlock;
+
+ oldcred = shiftfs_new_creds(&newcred, old->d_sb);
+
+ err = vfs_rename(realolddir, realold, realnewdir,
+ realnew, NULL, flags);
+
+ shiftfs_old_creds(oldcred, &newcred);
+
+ out_unlock:
+ unlock_rename(rndd, rodd);
+
+ return err;
+}
+
+static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry *dentry,
+ unsigned int flags)
+{
+ struct dentry *real = dir->i_private, *new;
+ struct inode *reali = real->d_inode, *newi;
+ const struct cred *oldcred, *newcred;
+
+ inode_lock(reali);
+ oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+ new = lookup_one_len(dentry->d_name.name, real, dentry->d_name.len);
+ shiftfs_old_creds(oldcred, &newcred);
+ inode_unlock(reali);
+
+ if (IS_ERR(new))
+ return new;
+
+ dentry->d_fsdata = new;
+
+ newi = NULL;
+ if (!new->d_inode)
+ goto out;
+
+ newi = shiftfs_new_inode(dentry->d_sb, new->d_inode->i_mode, new);
+ if (!newi) {
+ dput(new);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ out:
+ return d_splice_alias(newi, dentry);
+}
+
+static int shiftfs_permission(struct inode *inode, int mask)
+{
+ struct dentry *real = inode->i_private;
+ struct inode *reali = real->d_inode;
+ const struct inode_operations *iop = reali->i_op;
+ int err;
+ const struct cred *oldcred, *newcred;
+
+ if (mask & MAY_NOT_BLOCK)
+ return -ECHILD;
+
+ oldcred = shiftfs_new_creds(&newcred, inode->i_sb);
+ if (iop->permission)
+ err = iop->permission(reali, mask);
+ else
+ err = generic_permission(reali, mask);
+ shiftfs_old_creds(oldcred, &newcred);
+
+ return err;
+}
+
+static int shiftfs_setattr(struct dentry *dentry, struct iattr *attr)
+{
+ struct dentry *real = dentry->d_fsdata;
+ struct inode *reali = real->d_inode;
+ const struct inode_operations *iop = reali->i_op;
+ struct iattr newattr = *attr;
+ const struct cred *oldcred, *newcred;
+ struct super_block *sb = dentry->d_sb;
+ int err;
+
+ newattr.ia_uid = KUIDT_INIT(from_kuid(sb->s_user_ns, attr->ia_uid));
+ newattr.ia_gid = KGIDT_INIT(from_kgid(sb->s_user_ns, attr->ia_gid));
+
+ oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+ inode_lock(reali);
+ if (iop->setattr)
+ err = iop->setattr(real, &newattr);
+ else
+ err = simple_setattr(real, &newattr);
+ inode_unlock(reali);
+ shiftfs_old_creds(oldcred, &newcred);
+
+ if (err)
+ return err;
+
+ /* all OK, reflect the change on our inode */
+ setattr_copy(d_inode(dentry), attr);
+ return 0;
+}
+
+static int shiftfs_getattr(const struct path *path, struct kstat *stat,
+ u32 request_mask, unsigned int query_flags)
+{
+ struct inode *inode = path->dentry->d_inode;
+ struct dentry *real = path->dentry->d_fsdata;
+ struct inode *reali = real->d_inode;
+ const struct inode_operations *iop = reali->i_op;
+ struct path newpath = { .mnt = path->dentry->d_sb->s_fs_info, .dentry = real };
+ int err = 0;
+
+ if (iop->getattr)
+ err = iop->getattr(&newpath, stat, request_mask, query_flags);
+ else
+ generic_fillattr(reali, stat);
+
+ if (err)
+ return err;
+
+ /* transform the underlying id */
+ stat->uid = make_kuid(inode->i_sb->s_user_ns, __kuid_val(stat->uid));
+ stat->gid = make_kgid(inode->i_sb->s_user_ns, __kgid_val(stat->gid));
+ return 0;
+}
+
+static const struct inode_operations shiftfs_inode_ops = {
+ .lookup = shiftfs_lookup,
+ .getattr = shiftfs_getattr,
+ .setattr = shiftfs_setattr,
+ .permission = shiftfs_permission,
+ .mkdir = shiftfs_mkdir,
+ .symlink = shiftfs_symlink,
+ .get_link = shiftfs_get_link,
+ .readlink = shiftfs_readlink,
+ .unlink = shiftfs_unlink,
+ .rmdir = shiftfs_rmdir,
+ .rename = shiftfs_rename,
+ .link = shiftfs_link,
+ .create = shiftfs_create,
+ .mknod = NULL, /* no special files currently */
+ .listxattr = shiftfs_listxattr,
+};
+
+static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
+ struct dentry *dentry)
+{
+ struct inode *inode;
+
+ inode = new_inode(sb);
+ if (!inode)
+ return NULL;
+
+ /*
+ * our inode is completely vestigial. All lookups, getattr
+ * and permission checks are done on the underlying inode, so
+ * what the user sees is entirely from the underlying inode.
+ */
+ mode &= S_IFMT;
+
+ inode->i_ino = get_next_ino();
+ inode->i_mode = mode;
+ inode->i_flags |= S_NOATIME | S_NOCMTIME;
+
+ inode->i_op = &shiftfs_inode_ops;
+
+ shiftfs_fill_inode(inode, dentry);
+
+ return inode;
+}
+
+static int shiftfs_show_options(struct seq_file *m, struct dentry *dentry)
+{
+ struct super_block *sb = dentry->d_sb;
+ struct shiftfs_super_info *ssi = sb->s_fs_info;
+
+ if (ssi->mark)
+ seq_show_option(m, "mark", NULL);
+
+ return 0;
+}
+
+static int shiftfs_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+ struct super_block *sb = dentry->d_sb;
+ struct shiftfs_super_info *ssi = sb->s_fs_info;
+ struct dentry *root = sb->s_root;
+ struct dentry *realroot = root->d_fsdata;
+ struct path realpath = { .mnt = ssi->mnt, .dentry = realroot };
+ int err;
+
+ err = vfs_statfs(&realpath, buf);
+ if (err)
+ return err;
+
+ buf->f_type = sb->s_magic;
+
+ return 0;
+}
+
+static void shiftfs_put_super(struct super_block *sb)
+{
+ struct shiftfs_super_info *ssi = sb->s_fs_info;
+
+ mntput(ssi->mnt);
+ put_user_ns(ssi->userns);
+ kfree(ssi);
+}
+
+static const struct xattr_handler shiftfs_xattr_handler = {
+ .prefix = "",
+ .get = shiftfs_xattr_get,
+ .set = shiftfs_xattr_set,
+};
+
+const struct xattr_handler *shiftfs_xattr_handlers[] = {
+ &shiftfs_xattr_handler,
+ NULL
+};
+
+static const struct super_operations shiftfs_super_ops = {
+ .put_super = shiftfs_put_super,
+ .show_options = shiftfs_show_options,
+ .statfs = shiftfs_statfs,
+};
+
+struct shiftfs_data {
+ void *data;
+ const char *path;
+};
+
+static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
+ int silent)
+{
+ struct shiftfs_data *data = raw_data;
+ char *name = kstrdup(data->path, GFP_KERNEL);
+ int err = -ENOMEM;
+ struct shiftfs_super_info *ssi = NULL;
+ struct path path;
+ struct dentry *dentry;
+
+ if (!name)
+ goto out;
+
+ ssi = kzalloc(sizeof(*ssi), GFP_KERNEL);
+ if (!ssi)
+ goto out;
+
+ err = -EPERM;
+ err = shiftfs_parse_options(ssi, data->data);
+ if (err)
+ goto out;
+
+ /* to mark a mount point, must be real root */
+ if (ssi->mark && !capable(CAP_SYS_ADMIN))
+ goto out;
+
+ /* else to mount a mark, must be userns admin */
+ if (!ssi->mark && !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
+ goto out;
+
+ err = kern_path(name, LOOKUP_FOLLOW, &path);
+ if (err)
+ goto out;
+
+ err = -EPERM;
+
+ if (!S_ISDIR(path.dentry->d_inode->i_mode)) {
+ err = -ENOTDIR;
+ goto out_put;
+ }
+
+ sb->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1;
+ if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
+ printk(KERN_ERR "shiftfs: maximum stacking depth exceeded\n");
+ err = -EINVAL;
+ goto out_put;
+ }
+
+ if (ssi->mark) {
+ /*
+ * this part is visible unshifted, so make sure no
+ * executables that could be used to give suid
+ * privileges
+ */
+ sb->s_iflags = SB_I_NOEXEC;
+ ssi->mnt = path.mnt;
+ dentry = path.dentry;
+ } else {
+ struct shiftfs_super_info *mp_ssi;
+
+ /*
+ * this leg executes if we're admin capable in
+ * the namespace, so be very careful
+ */
+ if (path.dentry->d_sb->s_magic != SHIFTFS_MAGIC)
+ goto out_put;
+ mp_ssi = path.dentry->d_sb->s_fs_info;
+ if (!mp_ssi->mark)
+ goto out_put;
+ ssi->mnt = mntget(mp_ssi->mnt);
+ dentry = dget(path.dentry->d_fsdata);
+ path_put(&path);
+ }
+ ssi->userns = get_user_ns(dentry->d_sb->s_user_ns);
+ sb->s_fs_info = ssi;
+ sb->s_magic = SHIFTFS_MAGIC;
+ sb->s_op = &shiftfs_super_ops;
+ sb->s_xattr = shiftfs_xattr_handlers;
+ sb->s_d_op = &shiftfs_dentry_ops;
+ sb->s_root = d_make_root(shiftfs_new_inode(sb, S_IFDIR, dentry));
+ sb->s_root->d_fsdata = dentry;
+
+ return 0;
+
+ out_put:
+ path_put(&path);
+ out:
+ kfree(name);
+ kfree(ssi);
+ return err;
+}
+
+static struct dentry *shiftfs_mount(struct file_system_type *fs_type,
+ int flags, const char *dev_name, void *data)
+{
+ struct shiftfs_data d = { data, dev_name };
+
+ return mount_nodev(fs_type, flags, &d, shiftfs_fill_super);
+}
+
+static struct file_system_type shiftfs_type = {
+ .owner = THIS_MODULE,
+ .name = "shiftfs",
+ .mount = shiftfs_mount,
+ .kill_sb = kill_anon_super,
+ .fs_flags = FS_USERNS_MOUNT,
+};
+
+static int __init shiftfs_init(void)
+{
+ return register_filesystem(&shiftfs_type);
+}
+
+static void __exit shiftfs_exit(void)
+{
+ unregister_filesystem(&shiftfs_type);
+}
+
+MODULE_ALIAS_FS("shiftfs");
+MODULE_AUTHOR("James Bottomley");
+MODULE_DESCRIPTION("uid/gid shifting bind filesystem");
+MODULE_LICENSE("GPL v2");
+module_init(shiftfs_init)
+module_exit(shiftfs_exit)
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 1a6fee974116..671b0c6d0754 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -90,4 +90,6 @@
#define BALLOON_KVM_MAGIC 0x13661366
#define ZSMALLOC_MAGIC 0x58295829

+#define SHIFTFS_MAGIC 0x6a656a62
+
#endif /* __LINUX_MAGIC_H__ */
--
2.19.1


2018-11-01 21:49:57

by Seth Forshee

[permalink] [raw]
Subject: [RFC PATCH 6/6] shiftfs: support nested shiftfs mounts

shiftfs mounts cannot be nested for two reasons -- global
CAP_SYS_ADMIN is required to set up a mark mount, and a single
functional shiftfs mount meets the filesystem stacking depth
limit.

The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
ids in a mount must be within that mount's s_user_ns, so all that
is needed is CAP_SYS_ADMIN within that s_user_ns.

The stack depth issue can be worked around with a couple of
adjustments. First, a mark mount doesn't really need to count
against the stacking depth as it doesn't contribute to the call
stack depth during filesystem operations. Therefore the mount
over the mark mount only needs to count as one more than the
lower filesystems stack depth.

Second, when the lower mount is shiftfs we can just skip down to
that mount's lower filesystem and shift ids relative to that.
There is no reason to shift ids twice, and the lower path has
already been marked safe for id shifting by a user privileged
towards all ids in that mount's user ns.

Signed-off-by: Seth Forshee <[email protected]>
---
fs/shiftfs.c | 68 +++++++++++++++++++++++++++++++++++-----------------
1 file changed, 46 insertions(+), 22 deletions(-)

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index b19af7b2fe75..008ace2842b9 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -930,7 +930,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
struct shiftfs_data *data = raw_data;
char *name = kstrdup(data->path, GFP_KERNEL);
int err = -ENOMEM;
- struct shiftfs_super_info *ssi = NULL;
+ struct shiftfs_super_info *ssi = NULL, *mp_ssi;
struct path path;
struct dentry *dentry;

@@ -946,11 +946,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
if (err)
goto out;

- /* to mark a mount point, must be real root */
- if (ssi->mark && !capable(CAP_SYS_ADMIN))
- goto out;
-
- /* else to mount a mark, must be userns admin */
+ /* to mount a mark, must be userns admin */
if (!ssi->mark && !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
goto out;

@@ -962,41 +958,66 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,

if (!S_ISDIR(path.dentry->d_inode->i_mode)) {
err = -ENOTDIR;
- goto out_put;
- }
-
- sb->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1;
- if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
- printk(KERN_ERR "shiftfs: maximum stacking depth exceeded\n");
- err = -EINVAL;
- goto out_put;
+ goto out_put_path;
}

if (ssi->mark) {
+ struct super_block *lower_sb = path.mnt->mnt_sb;
+
+ /* to mark a mount point, must root wrt lower s_user_ns */
+ if (!ns_capable(lower_sb->s_user_ns, CAP_SYS_ADMIN))
+ goto out_put_path;
+
+
/*
* this part is visible unshifted, so make sure no
* executables that could be used to give suid
* privileges
*/
sb->s_iflags = SB_I_NOEXEC;
- ssi->mnt = path.mnt;
- dentry = path.dentry;
- } else {
- struct shiftfs_super_info *mp_ssi;

+ /*
+ * Handle nesting of shiftfs mounts by referring this mark
+ * mount back to the original mark mount. This is more
+ * efficient and alleviates concerns about stack depth.
+ */
+ if (lower_sb->s_magic == SHIFTFS_MAGIC) {
+ mp_ssi = lower_sb->s_fs_info;
+
+ /* Doesn't make sense to mark a mark mount */
+ if (mp_ssi->mark) {
+ err = -EINVAL;
+ goto out_put_path;
+ }
+
+ ssi->mnt = mntget(mp_ssi->mnt);
+ dentry = dget(path.dentry->d_fsdata);
+ } else {
+ ssi->mnt = mntget(path.mnt);
+ dentry = dget(path.dentry);
+ }
+ } else {
/*
* this leg executes if we're admin capable in
* the namespace, so be very careful
*/
if (path.dentry->d_sb->s_magic != SHIFTFS_MAGIC)
- goto out_put;
+ goto out_put_path;
mp_ssi = path.dentry->d_sb->s_fs_info;
if (!mp_ssi->mark)
- goto out_put;
+ goto out_put_path;
ssi->mnt = mntget(mp_ssi->mnt);
dentry = dget(path.dentry->d_fsdata);
- path_put(&path);
}
+
+ sb->s_stack_depth = dentry->d_sb->s_stack_depth + 1;
+ if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
+ printk(KERN_ERR "shiftfs: maximum stacking depth exceeded\n");
+ err = -EINVAL;
+ goto out_put_mnt;
+ }
+
+ path_put(&path);
ssi->userns = get_user_ns(dentry->d_sb->s_user_ns);
sb->s_fs_info = ssi;
sb->s_magic = SHIFTFS_MAGIC;
@@ -1009,7 +1030,10 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,

return 0;

- out_put:
+ out_put_mnt:
+ mntput(ssi->mnt);
+ dput(dentry);
+ out_put_path:
path_put(&path);
out:
kfree(name);
--
2.19.1


2018-11-01 21:50:12

by Seth Forshee

[permalink] [raw]
Subject: [RFC PATCH 3/6] shiftfs: copy inode attrs up from underlying fs

Not all inode permission checks go through the permission
callback, e.g. some checks related to file capabilities. Always
copy up the inode attrs to ensure these checks work as expected.

Also introduce helpers helpers for shifting kernel ids from one
user ns to another, as this is an operation that is going to be
repeated.

Signed-off-by: Seth Forshee <[email protected]>
---
fs/shiftfs.c | 30 +++++++++++++++++++++++++++++-
1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index b179a1be7bc1..556594988dd2 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -266,6 +266,33 @@ static int shiftfs_xattr_set(const struct xattr_handler *handler,
return shiftfs_setxattr(dentry, inode, name, value, size, flags);
}

+static kuid_t shift_kuid(struct user_namespace *from, struct user_namespace *to,
+ kuid_t kuid)
+{
+ uid_t uid = from_kuid(from, kuid);
+ return make_kuid(to, uid);
+}
+
+static kgid_t shift_kgid(struct user_namespace *from, struct user_namespace *to,
+ kgid_t kgid)
+{
+ gid_t gid = from_kgid(from, kgid);
+ return make_kgid(to, gid);
+}
+
+static void shiftfs_copyattr(struct inode *from, struct inode *to)
+{
+ struct user_namespace *from_ns = from->i_sb->s_user_ns;
+ struct user_namespace *to_ns = to->i_sb->s_user_ns;
+
+ to->i_uid = shift_kuid(from_ns, to_ns, from->i_uid);
+ to->i_gid = shift_kgid(from_ns, to_ns, from->i_gid);
+ to->i_mode = from->i_mode;
+ to->i_atime = from->i_atime;
+ to->i_mtime = from->i_mtime;
+ to->i_ctime = from->i_ctime;
+}
+
static void shiftfs_fill_inode(struct inode *inode, struct dentry *dentry)
{
struct inode *reali;
@@ -278,6 +305,7 @@ static void shiftfs_fill_inode(struct inode *inode, struct dentry *dentry)
if (!reali->i_op->get_link)
inode->i_opflags |= IOP_NOFOLLOW;

+ shiftfs_copyattr(reali, inode);
inode->i_mapping = reali->i_mapping;
inode->i_private = reali;
set_nlink(inode, reali->i_nlink);
@@ -573,7 +601,7 @@ static int shiftfs_setattr(struct dentry *dentry, struct iattr *attr)
return err;

/* all OK, reflect the change on our inode */
- setattr_copy(d_inode(dentry), attr);
+ shiftfs_copyattr(reali, d_inode(dentry));
return 0;
}

--
2.19.1


2018-11-01 21:50:17

by Seth Forshee

[permalink] [raw]
Subject: [RFC PATCH 4/6] shiftfs: translate uids using s_user_ns from lower fs

Do not assume that ids from the lower filesystem are from
init_user_ns. Instead, translate them from that filesystem's
s_user_ns and then to the shiftfs user ns.

Signed-off-by: Seth Forshee <[email protected]>
---
fs/shiftfs.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index 556594988dd2..226c03d8588b 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -613,6 +613,8 @@ static int shiftfs_getattr(const struct path *path, struct kstat *stat,
struct inode *reali = real->d_inode;
const struct inode_operations *iop = reali->i_op;
struct path newpath = { .mnt = path->dentry->d_sb->s_fs_info, .dentry = real };
+ struct user_namespace *from_ns = reali->i_sb->s_user_ns;
+ struct user_namespace *to_ns = inode->i_sb->s_user_ns;
int err = 0;

if (iop->getattr)
@@ -624,8 +626,8 @@ static int shiftfs_getattr(const struct path *path, struct kstat *stat,
return err;

/* transform the underlying id */
- stat->uid = make_kuid(inode->i_sb->s_user_ns, __kuid_val(stat->uid));
- stat->gid = make_kgid(inode->i_sb->s_user_ns, __kgid_val(stat->gid));
+ stat->uid = shift_kuid(from_ns, to_ns, stat->uid);
+ stat->gid = shift_kgid(from_ns, to_ns, stat->gid);
return 0;
}

--
2.19.1


2018-11-01 21:51:05

by Seth Forshee

[permalink] [raw]
Subject: [RFC PATCH 2/6] shiftfs: map inodes to lower fs inodes instead of dentries

Since shiftfs inodes map to dentries in the lower fs, two links
to the same lowerfs inode create separate inodes in shiftfs. This
causes problems for inotify, as a watch on one of these files in
shiftfs will not see changes made to the underlying inode via the
other file.

Fix this by updating shiftfs to map its inodes to corresponding
inodes in the lower fs. Inodes are cached using the pointer to
the lower fs inode as the hash value. This fixes a second inotify
problem whereby a watch is set on an inode, the dentry is evicted
from the cache, and events on a new dentry are not reported back
to the watch original inode.

Signed-off-by: Seth Forshee <[email protected]>
---
fs/shiftfs.c | 105 ++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 79 insertions(+), 26 deletions(-)

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index 6028244c2f42..b179a1be7bc1 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -22,6 +22,7 @@ struct shiftfs_super_info {

static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
struct dentry *dentry);
+static void shiftfs_init_inode(struct inode *inode, umode_t mode);

enum {
OPT_MARK,
@@ -278,15 +279,27 @@ static void shiftfs_fill_inode(struct inode *inode, struct dentry *dentry)
inode->i_opflags |= IOP_NOFOLLOW;

inode->i_mapping = reali->i_mapping;
- inode->i_private = dentry;
+ inode->i_private = reali;
+ set_nlink(inode, reali->i_nlink);
+}
+
+static int shiftfs_inode_test(struct inode *inode, void *data)
+{
+ return inode->i_private == data;
+}
+
+static int shiftfs_inode_set(struct inode *inode, void *data)
+{
+ inode->i_private = data;
+ return 0;
}

static int shiftfs_make_object(struct inode *dir, struct dentry *dentry,
umode_t mode, const char *symlink,
struct dentry *hardlink, bool excl)
{
- struct dentry *real = dir->i_private, *new = dentry->d_fsdata;
- struct inode *reali = real->d_inode, *newi;
+ struct dentry *new = dentry->d_fsdata;
+ struct inode *reali = dir->i_private, *inode, *newi;
const struct inode_operations *iop = reali->i_op;
int err;
const struct cred *oldcred, *newcred;
@@ -310,9 +323,14 @@ static int shiftfs_make_object(struct inode *dir, struct dentry *dentry,
return -EINVAL;


- newi = shiftfs_new_inode(dentry->d_sb, mode, NULL);
- if (!newi)
- return -ENOMEM;
+ if (hardlink) {
+ inode = d_inode(hardlink);
+ ihold(inode);
+ } else {
+ inode = shiftfs_new_inode(dentry->d_sb, mode, NULL);
+ if (!inode)
+ return -ENOMEM;
+ }

oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);

@@ -341,16 +359,33 @@ static int shiftfs_make_object(struct inode *dir, struct dentry *dentry,
if (err)
goto out_dput;

- shiftfs_fill_inode(newi, new);
+ if (hardlink) {
+ WARN_ON(inode->i_private != new->d_inode);
+ inc_nlink(inode);
+ } else {
+ shiftfs_fill_inode(inode, new);
+
+ newi = inode_insert5(inode, (unsigned long)new->d_inode,
+ shiftfs_inode_test, shiftfs_inode_set,
+ new->d_inode);
+ if (newi != inode) {
+ pr_warn_ratelimited("shiftfs: newly created inode found in cache\n");
+ iput(inode);
+ inode = newi;
+ }
+ }
+
+ if (inode->i_state & I_NEW)
+ unlock_new_inode(inode);

- d_instantiate(dentry, newi);
+ d_instantiate(dentry, inode);

new = NULL;
- newi = NULL;
+ inode = NULL;

out_dput:
dput(new);
- iput(newi);
+ iput(inode);
inode_unlock(reali);

return err;
@@ -386,8 +421,8 @@ static int shiftfs_symlink(struct inode *dir, struct dentry *dentry,

static int shiftfs_rm(struct inode *dir, struct dentry *dentry, bool rmdir)
{
- struct dentry *real = dir->i_private, *new = dentry->d_fsdata;
- struct inode *reali = real->d_inode;
+ struct dentry *new = dentry->d_fsdata;
+ struct inode *reali = dir->i_private;
int err;
const struct cred *oldcred, *newcred;

@@ -400,6 +435,13 @@ static int shiftfs_rm(struct inode *dir, struct dentry *dentry, bool rmdir)
else
err = vfs_unlink(reali, new, NULL);

+ if (!err) {
+ if (rmdir)
+ clear_nlink(d_inode(dentry));
+ else
+ drop_nlink(d_inode(dentry));
+ }
+
shiftfs_old_creds(oldcred, &newcred);
inode_unlock(reali);

@@ -420,7 +462,8 @@ static int shiftfs_rename(struct inode *olddir, struct dentry *old,
struct inode *newdir, struct dentry *new,
unsigned int flags)
{
- struct dentry *rodd = olddir->i_private, *rndd = newdir->i_private,
+ struct dentry *rodd = old->d_parent->d_fsdata,
+ *rndd = new->d_parent->d_fsdata,
*realold = old->d_fsdata,
*realnew = new->d_fsdata, *trap;
struct inode *realolddir = rodd->d_inode, *realnewdir = rndd->d_inode;
@@ -448,8 +491,8 @@ static int shiftfs_rename(struct inode *olddir, struct dentry *old,
static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry *dentry,
unsigned int flags)
{
- struct dentry *real = dir->i_private, *new;
- struct inode *reali = real->d_inode, *newi;
+ struct dentry *real = dentry->d_parent->d_fsdata, *new;
+ struct inode *reali = real->d_inode, *newi, *inode;
const struct cred *oldcred, *newcred;

inode_lock(reali);
@@ -463,24 +506,30 @@ static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry *dentry,

dentry->d_fsdata = new;

- newi = NULL;
- if (!new->d_inode)
+ inode = NULL;
+ newi = new->d_inode;
+ if (!newi)
goto out;

- newi = shiftfs_new_inode(dentry->d_sb, new->d_inode->i_mode, new);
- if (!newi) {
+ inode = iget5_locked(dentry->d_sb, (unsigned long)newi,
+ shiftfs_inode_test, shiftfs_inode_set, newi);
+ if (!inode) {
dput(new);
return ERR_PTR(-ENOMEM);
}
+ if (inode->i_state & I_NEW) {
+ shiftfs_init_inode(inode, newi->i_mode);
+ shiftfs_fill_inode(inode, new);
+ unlock_new_inode(inode);
+ }

out:
- return d_splice_alias(newi, dentry);
+ return d_splice_alias(inode, dentry);
}

static int shiftfs_permission(struct inode *inode, int mask)
{
- struct dentry *real = inode->i_private;
- struct inode *reali = real->d_inode;
+ struct inode *reali = inode->i_private;
const struct inode_operations *iop = reali->i_op;
int err;
const struct cred *oldcred, *newcred;
@@ -579,6 +628,14 @@ static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
if (!inode)
return NULL;

+ shiftfs_init_inode(inode, mode);
+ shiftfs_fill_inode(inode, dentry);
+
+ return inode;
+}
+
+static void shiftfs_init_inode(struct inode *inode, umode_t mode)
+{
/*
* our inode is completely vestigial. All lookups, getattr
* and permission checks are done on the underlying inode, so
@@ -591,10 +648,6 @@ static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
inode->i_flags |= S_NOATIME | S_NOCMTIME;

inode->i_op = &shiftfs_inode_ops;
-
- shiftfs_fill_inode(inode, dentry);
-
- return inode;
}

static int shiftfs_show_options(struct seq_file *m, struct dentry *dentry)
--
2.19.1


2018-11-01 21:51:07

by Seth Forshee

[permalink] [raw]
Subject: [RFC PATCH 5/6] shiftfs: add support for posix acls

Signed-off-by: Seth Forshee <[email protected]>
---
fs/Kconfig | 10 +++
fs/shiftfs.c | 185 +++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 195 insertions(+)

diff --git a/fs/Kconfig b/fs/Kconfig
index 392c5a41a9f9..691f3c4fc7eb 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -121,6 +121,16 @@ config SHIFT_FS
unprivileged containers can use this to mount root volumes
using this technique.

+config SHIFT_FS_POSIX_ACL
+ bool "shiftfs POSIX Access Control Lists"
+ depends on SHIFT_FS
+ select FS_POSIX_ACL
+ help
+ POSIX Access Control Lists (ACLs) support permissions for users and
+ groups beyond the owner/group/world scheme.
+
+ If you don't know what Access Control Lists are, say N.
+
menu "Caches"

source "fs/fscache/Kconfig"
diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index 226c03d8588b..b19af7b2fe75 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -13,6 +13,8 @@
#include <linux/user_namespace.h>
#include <linux/uidgid.h>
#include <linux/xattr.h>
+#include <linux/posix_acl.h>
+#include <linux/posix_acl_xattr.h>

struct shiftfs_super_info {
struct vfsmount *mnt;
@@ -631,6 +633,183 @@ static int shiftfs_getattr(const struct path *path, struct kstat *stat,
return 0;
}

+#ifdef CONFIG_SHIFT_FS_POSIX_ACL
+
+static int
+shift_acl_ids(struct user_namespace *from, struct user_namespace *to,
+ struct posix_acl *acl)
+{
+ int i;
+
+ for (i = 0; i < acl->a_count; i++) {
+ struct posix_acl_entry *e = &acl->a_entries[i];
+ switch(e->e_tag) {
+ case ACL_USER:
+ e->e_uid = shift_kuid(from, to, e->e_uid);
+ if (!uid_valid(e->e_uid))
+ return -EOVERFLOW;
+ break;
+ case ACL_GROUP:
+ e->e_gid = shift_kgid(from, to, e->e_gid);
+ if (!gid_valid(e->e_gid))
+ return -EOVERFLOW;
+ break;
+ }
+ }
+ return 0;
+}
+
+static void
+shift_acl_xattr_ids(struct user_namespace *from, struct user_namespace *to,
+ void *value, size_t size)
+{
+ struct posix_acl_xattr_header *header = value;
+ struct posix_acl_xattr_entry *entry = (void *)(header + 1), *end;
+ int count;
+ kuid_t kuid;
+ kgid_t kgid;
+
+ if (!value)
+ return;
+ if (size < sizeof(struct posix_acl_xattr_header))
+ return;
+ if (header->a_version != cpu_to_le32(POSIX_ACL_XATTR_VERSION))
+ return;
+
+ count = posix_acl_xattr_count(size);
+ if (count < 0)
+ return;
+ if (count == 0)
+ return;
+
+ for (end = entry + count; entry != end; entry++) {
+ switch(le16_to_cpu(entry->e_tag)) {
+ case ACL_USER:
+ kuid = make_kuid(&init_user_ns, le32_to_cpu(entry->e_id));
+ kuid = shift_kuid(from, to, kuid);
+ entry->e_id = cpu_to_le32(from_kuid(&init_user_ns, kuid));
+ break;
+ case ACL_GROUP:
+ kgid = make_kgid(&init_user_ns, le32_to_cpu(entry->e_id));
+ kgid = shift_kgid(from, to, kgid);
+ entry->e_id = cpu_to_le32(from_kgid(&init_user_ns, kgid));
+ break;
+ default:
+ break;
+ }
+ }
+}
+
+static struct posix_acl *shiftfs_get_acl(struct inode *inode, int type)
+{
+ struct inode *reali = inode->i_private;
+ const struct cred *oldcred, *newcred;
+ struct posix_acl *real_acl, *acl = NULL;
+ struct user_namespace *from_ns = reali->i_sb->s_user_ns;
+ struct user_namespace *to_ns = inode->i_sb->s_user_ns;
+ int size;
+ int err;
+
+ if (!IS_POSIXACL(reali))
+ return NULL;
+
+ oldcred = shiftfs_new_creds(&newcred, inode->i_sb);
+ real_acl = get_acl(reali, type);
+ shiftfs_old_creds(oldcred, &newcred);
+
+ if (real_acl && !IS_ERR(acl)) {
+ /* XXX: export posix_acl_clone? */
+ size = sizeof(struct posix_acl) +
+ real_acl->a_count * sizeof(struct posix_acl_entry);
+ acl = kmemdup(acl, size, GFP_KERNEL);
+ posix_acl_release(real_acl);
+
+ if (!acl)
+ return ERR_PTR(-ENOMEM);
+
+ refcount_set(&acl->a_refcount, 1);
+
+ err = shift_acl_ids(from_ns, to_ns, acl);
+ if (err) {
+ kfree(acl);
+ return ERR_PTR(err);
+ }
+ }
+
+ return acl;
+}
+
+static int
+shiftfs_posix_acl_xattr_get(const struct xattr_handler *handler,
+ struct dentry *dentry, struct inode *inode,
+ const char *name, void *buffer, size_t size)
+{
+ struct inode *reali = inode->i_private;
+ int ret;
+
+ ret = shiftfs_xattr_get(NULL, dentry, inode, handler->name,
+ buffer, size);
+ if (ret < 0)
+ return ret;
+
+ shift_acl_xattr_ids(reali->i_sb->s_user_ns, inode->i_sb->s_user_ns,
+ buffer, size);
+ return ret;
+}
+
+static int
+shiftfs_posix_acl_xattr_set(const struct xattr_handler *handler,
+ struct dentry *dentry, struct inode *inode,
+ const char *name, const void *value,
+ size_t size, int flags)
+{
+ struct inode *reali = inode->i_private;
+ int err;
+
+ if (!IS_POSIXACL(reali) || !reali->i_op->set_acl)
+ return -EOPNOTSUPP;
+ if (handler->flags == ACL_TYPE_DEFAULT && !S_ISDIR(inode->i_mode))
+ return value ? -EACCES : 0;
+ if (!inode_owner_or_capable(inode))
+ return -EPERM;
+
+ if (value) {
+ shift_acl_xattr_ids(inode->i_sb->s_user_ns,
+ reali->i_sb->s_user_ns,
+ (void *)value, size);
+ err = shiftfs_setxattr(dentry, inode, handler->name, value,
+ size, flags);
+ } else {
+ err = shiftfs_removexattr(dentry, handler->name);
+ }
+
+ if (!err)
+ shiftfs_copyattr(reali, inode);
+ return err;
+}
+
+static const struct xattr_handler
+shiftfs_posix_acl_access_xattr_handler = {
+ .name = XATTR_NAME_POSIX_ACL_ACCESS,
+ .flags = ACL_TYPE_ACCESS,
+ .get = shiftfs_posix_acl_xattr_get,
+ .set = shiftfs_posix_acl_xattr_set,
+};
+
+static const struct xattr_handler
+shiftfs_posix_acl_default_xattr_handler = {
+ .name = XATTR_NAME_POSIX_ACL_DEFAULT,
+ .flags = ACL_TYPE_DEFAULT,
+ .get = shiftfs_posix_acl_xattr_get,
+ .set = shiftfs_posix_acl_xattr_set,
+};
+
+#else /* !CONFIG_SHIFT_FS_POSIX_ACL */
+
+#define shiftfs_get_acl NULL
+
+#endif /* CONFIG_SHIFT_FS_POSIX_ACL */
+
static const struct inode_operations shiftfs_inode_ops = {
.lookup = shiftfs_lookup,
.getattr = shiftfs_getattr,
@@ -647,6 +826,7 @@ static const struct inode_operations shiftfs_inode_ops = {
.create = shiftfs_create,
.mknod = NULL, /* no special files currently */
.listxattr = shiftfs_listxattr,
+ .get_acl = shiftfs_get_acl,
};

static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
@@ -725,6 +905,10 @@ static const struct xattr_handler shiftfs_xattr_handler = {
};

const struct xattr_handler *shiftfs_xattr_handlers[] = {
+#ifdef CONFIG_SHIFT_FS_POSIX_ACL
+ &shiftfs_posix_acl_access_xattr_handler,
+ &shiftfs_posix_acl_default_xattr_handler,
+#endif
&shiftfs_xattr_handler,
NULL
};
@@ -819,6 +1003,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
sb->s_op = &shiftfs_super_ops;
sb->s_xattr = shiftfs_xattr_handlers;
sb->s_d_op = &shiftfs_dentry_ops;
+ sb->s_flags |= SB_POSIXACL;
sb->s_root = d_make_root(shiftfs_new_inode(sb, S_IFDIR, dentry));
sb->s_root->d_fsdata = dentry;

--
2.19.1


2018-11-02 09:00:28

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] shiftfs fixes and enhancements

[cc: linux-unionfs
It should the mailing list for *all* "stacking fs".
We have a lot of common problems I think ;-) ]

On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <[email protected]> wrote:
>
> I've done some work to fix and enhance shiftfs for a number of use
> cases, so that we would have an idea what a more full-featured shiftfs
> would look like. I'm intending for these to serve as a point of
> reference for discussing id shifting mounts/filesystems at plumbers in a
> couple of weeks [1].
>
> Note that these are based on 4.18, and I've added a small fix to James'
> most recent patch to fix a build issue there. To work with 4.19 they
> will need a number of updates due to changes in the vfs.
>

Seth,

I like the way you addressed my concerns about nesting and stacking depth.
Will provide specific nits on patch.

In preparation to the Plumbers talk (which I will not be attending), I wanted to
get your opinion on the matters I brought up last time:
https://marc.info/?l=linux-fsdevel&m=153013920904844&w=2

1) Having seen what it takes to catch up with overlayfs w.r.t inotify bugs
and having peeked into 4.19 to see what work you still have lined up for you
to bring shitfs up to speed with vfs, did you have time to look into my proposal
for sharing code with overlayfs in the manner that I have implemented the
snapshotfs POC?
https://github.com/amir73il/linux/commit/25416757f2ca47759f59b115e6461b11898c4f06

Even if you end up not saving a single line of code for shiftfs v1
meaning that all shiftfs_inode_ops are completely separate implementation
from overlayfs inode ops, this may still be beneficial to shitfs in
the long run.
For example, you may, in fact, won't need to change anything to work with v4.19.
shittfs (as an overlayfs alias) would use ovl_file_operations and
shiftfs_inode_ops.

Another example, from the top of my head, see what it took to add NFS export
support to snapshotfs, because of the code reuse with overlayfs:
https://github.com/amir73il/linux/commit/d082eb615133490ec26fa2efaa80ed4723860893
Its practically the exact same implementation shiftfs would need,
so in the far future, shitfs and snapshotfs can share the same
export_operations.

2) Regarding this part:
+ /*
+ * this part is visible unshifted, so make sure no
+ * executables that could be used to give suid
+ * privileges
+ */
+ sb->s_iflags = SB_I_NOEXEC;

Why would you want to make the unshifted fs visible at all?
Is there a requirement for container users to access the unshifted fs
content? Is there a requirement for container admin to mount shitfted fs
NOT from the root of the marked mount?

If those are not required, then I propose NOOP inode operations for
the unshifted fs, specifically, empty readdir, just enough ops to be able
to use the mark mount point as the shitfed mount source - no more.

Thanks,
Amir.

2018-11-02 10:05:30

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] shiftfs: support nested shiftfs mounts

On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <[email protected]> wrote:
>
> shiftfs mounts cannot be nested for two reasons -- global
> CAP_SYS_ADMIN is required to set up a mark mount, and a single
> functional shiftfs mount meets the filesystem stacking depth
> limit.
>
> The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
> ids in a mount must be within that mount's s_user_ns, so all that
> is needed is CAP_SYS_ADMIN within that s_user_ns.
>
> The stack depth issue can be worked around with a couple of
> adjustments. First, a mark mount doesn't really need to count
> against the stacking depth as it doesn't contribute to the call
> stack depth during filesystem operations. Therefore the mount
> over the mark mount only needs to count as one more than the
> lower filesystems stack depth.

That's true, but it also highlights the point that the "mark" sb is
completely unneeded and it really is just a nice little hack.
All the information it really stores is a lower mount reference,
a lower root dentry and a declarative statement "I am shiftable!".

Come to think about it, "container shiftable" really isn't that different from
NFS export with "no_root_squash" and auto mounted USB drive.
I mean the shifting itself is different of course, but the
declaration, not so much.
If I am allowing sudoers on another machine to mess with root owned
files visible
on my machine, I am pretty much have the same issues as container admins
accessing root owned files on my init_user_ns filesystem. In all those cases,
I'd better not be executing suid binaries from the untrusted "external" source.

Instead of mounting a dummy filesystem to make the declaration, you could
get the same thing with:
mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC)
and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED
or whatnot) constant to uapi and manage to come up good man page description.

Then users could actually mount a filesystem in init_user_ns MS_EXTERN and
avoid the extra bind mount step (for a full filesystem tree export).
Declaring a mounted image MS_EXTERN has merits on its own even without
containers and shitfs, for example with pluggable storage. Other LSMs could make
good use of that declaration.

>
> Second, when the lower mount is shiftfs we can just skip down to
> that mount's lower filesystem and shift ids relative to that.
> There is no reason to shift ids twice, and the lower path has
> already been marked safe for id shifting by a user privileged
> towards all ids in that mount's user ns.
>
> Signed-off-by: Seth Forshee <[email protected]>
> ---
> fs/shiftfs.c | 68 +++++++++++++++++++++++++++++++++++-----------------
> 1 file changed, 46 insertions(+), 22 deletions(-)
>
> diff --git a/fs/shiftfs.c b/fs/shiftfs.c
> index b19af7b2fe75..008ace2842b9 100644
> --- a/fs/shiftfs.c
> +++ b/fs/shiftfs.c
> @@ -930,7 +930,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
> struct shiftfs_data *data = raw_data;
> char *name = kstrdup(data->path, GFP_KERNEL);
> int err = -ENOMEM;
> - struct shiftfs_super_info *ssi = NULL;
> + struct shiftfs_super_info *ssi = NULL, *mp_ssi;
> struct path path;
> struct dentry *dentry;
>
> @@ -946,11 +946,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
> if (err)
> goto out;
>
> - /* to mark a mount point, must be real root */
> - if (ssi->mark && !capable(CAP_SYS_ADMIN))
> - goto out;
> -
> - /* else to mount a mark, must be userns admin */
> + /* to mount a mark, must be userns admin */
> if (!ssi->mark && !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> goto out;

Isn't this check performed by vfs anyway? i.e. in mount_nodev() -> sget()

>
> @@ -962,41 +958,66 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
>
> if (!S_ISDIR(path.dentry->d_inode->i_mode)) {
> err = -ENOTDIR;
> - goto out_put;
> - }
> -
> - sb->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1;
> - if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> - printk(KERN_ERR "shiftfs: maximum stacking depth exceeded\n");
> - err = -EINVAL;
> - goto out_put;
> + goto out_put_path;
> }
>
> if (ssi->mark) {
> + struct super_block *lower_sb = path.mnt->mnt_sb;
> +
> + /* to mark a mount point, must root wrt lower s_user_ns */
> + if (!ns_capable(lower_sb->s_user_ns, CAP_SYS_ADMIN))
> + goto out_put_path;
> +
> +
> /*
> * this part is visible unshifted, so make sure no
> * executables that could be used to give suid
> * privileges
> */
> sb->s_iflags = SB_I_NOEXEC;

As commented on cover letter, why allow access to any files besides root at all?
In fact, the only justification for a dummy sb (instead of bind mount with
MS_EXTERN flag) would be in order to override inode operations with noop ops
to prevent access to unshifted files from within container.

Thanks,
Amir.

2018-11-02 12:27:22

by Seth Forshee

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] shiftfs fixes and enhancements

On Fri, Nov 02, 2018 at 10:59:38AM +0200, Amir Goldstein wrote:
> [cc: linux-unionfs
> It should the mailing list for *all* "stacking fs".
> We have a lot of common problems I think ;-) ]
>
> On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <[email protected]> wrote:
> >
> > I've done some work to fix and enhance shiftfs for a number of use
> > cases, so that we would have an idea what a more full-featured shiftfs
> > would look like. I'm intending for these to serve as a point of
> > reference for discussing id shifting mounts/filesystems at plumbers in a
> > couple of weeks [1].
> >
> > Note that these are based on 4.18, and I've added a small fix to James'
> > most recent patch to fix a build issue there. To work with 4.19 they
> > will need a number of updates due to changes in the vfs.
> >
>
> Seth,
>
> I like the way you addressed my concerns about nesting and stacking depth.
> Will provide specific nits on patch.
>
> In preparation to the Plumbers talk (which I will not be attending), I wanted to
> get your opinion on the matters I brought up last time:
> https://marc.info/?l=linux-fsdevel&m=153013920904844&w=2

I want the session at plumbers to not be a "talk" but more of a
discussion of the sorts of things you raise below. But I'm also happy to
talk about them here.

> 1) Having seen what it takes to catch up with overlayfs w.r.t inotify bugs
> and having peeked into 4.19 to see what work you still have lined up for you
> to bring shitfs up to speed with vfs, did you have time to look into my proposal
> for sharing code with overlayfs in the manner that I have implemented the
> snapshotfs POC?
> https://github.com/amir73il/linux/commit/25416757f2ca47759f59b115e6461b11898c4f06
>
> Even if you end up not saving a single line of code for shiftfs v1
> meaning that all shiftfs_inode_ops are completely separate implementation
> from overlayfs inode ops, this may still be beneficial to shitfs in
> the long run.
> For example, you may, in fact, won't need to change anything to work with v4.19.
> shittfs (as an overlayfs alias) would use ovl_file_operations and
> shiftfs_inode_ops.

I don't recall seeing the shapshotfs patches before. If id shifting
remains an overlay-style fs and not a feature of the vfs, then I
absolutely think something like this will make life much easier.

> Another example, from the top of my head, see what it took to add NFS export
> support to snapshotfs, because of the code reuse with overlayfs:
> https://github.com/amir73il/linux/commit/d082eb615133490ec26fa2efaa80ed4723860893
> Its practically the exact same implementation shiftfs would need,
> so in the far future, shitfs and snapshotfs can share the same
> export_operations.
>
> 2) Regarding this part:
> + /*
> + * this part is visible unshifted, so make sure no
> + * executables that could be used to give suid
> + * privileges
> + */
> + sb->s_iflags = SB_I_NOEXEC;
>
> Why would you want to make the unshifted fs visible at all?
> Is there a requirement for container users to access the unshifted fs
> content? Is there a requirement for container admin to mount shitfted fs
> NOT from the root of the marked mount?
>
> If those are not required, then I propose NOOP inode operations for
> the unshifted fs, specifically, empty readdir, just enough ops to be able
> to use the mark mount point as the shitfed mount source - no more.

This is part of the original implementation that I didn't touch with
these updates. Imo the mark mount is kind of kludgy, and I'd like to see
it done a different way.

A couple of alternatives have been suggested. One was to use xattrs for
marking, or I did a PoC with an older version of the new mount API
patches where an fsfd was passed to the less privileged context that it
could attach to its mount tree:

https://lkml.kernel.org/r/20180717133847.GB15620@ubuntu-xps13

Either of these can accomplish the same things as the mark mount with
better control over who can create an id-shifted mount of the subtree.

However if the mark mount is kept then no-op inode operations seems
reasonable to me.

Thanks,
Seth

2018-11-02 12:45:03

by Seth Forshee

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] shiftfs: support nested shiftfs mounts

On Fri, Nov 02, 2018 at 12:02:45PM +0200, Amir Goldstein wrote:
> On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <[email protected]> wrote:
> >
> > shiftfs mounts cannot be nested for two reasons -- global
> > CAP_SYS_ADMIN is required to set up a mark mount, and a single
> > functional shiftfs mount meets the filesystem stacking depth
> > limit.
> >
> > The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
> > ids in a mount must be within that mount's s_user_ns, so all that
> > is needed is CAP_SYS_ADMIN within that s_user_ns.
> >
> > The stack depth issue can be worked around with a couple of
> > adjustments. First, a mark mount doesn't really need to count
> > against the stacking depth as it doesn't contribute to the call
> > stack depth during filesystem operations. Therefore the mount
> > over the mark mount only needs to count as one more than the
> > lower filesystems stack depth.
>
> That's true, but it also highlights the point that the "mark" sb is
> completely unneeded and it really is just a nice little hack.
> All the information it really stores is a lower mount reference,
> a lower root dentry and a declarative statement "I am shiftable!".

Seems I should have saved some of the things I said in my previous
response for this one. As you no doubt gleaned from that email, I agree
with this.

> Come to think about it, "container shiftable" really isn't that different from
> NFS export with "no_root_squash" and auto mounted USB drive.
> I mean the shifting itself is different of course, but the
> declaration, not so much.
> If I am allowing sudoers on another machine to mess with root owned
> files visible
> on my machine, I am pretty much have the same issues as container admins
> accessing root owned files on my init_user_ns filesystem. In all those cases,
> I'd better not be executing suid binaries from the untrusted "external" source.
>
> Instead of mounting a dummy filesystem to make the declaration, you could
> get the same thing with:
> mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC)
> and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED
> or whatnot) constant to uapi and manage to come up good man page description.
>
> Then users could actually mount a filesystem in init_user_ns MS_EXTERN and
> avoid the extra bind mount step (for a full filesystem tree export).
> Declaring a mounted image MS_EXTERN has merits on its own even without
> containers and shitfs, for example with pluggable storage. Other LSMs could make
> good use of that declaration.

I'm missing how we figure out the target user ns in this scheme. We need
a context with privileges towards the source path's s_user_ns to say
it's okay to shift ids for the files under the source path, and then we
need a target user ns for the id shifts. Currently the target is
current_user_ns when the final shiftfs mount is created.

So, how do we determine the target s_user_ns in your scheme?

>
> >
> > Second, when the lower mount is shiftfs we can just skip down to
> > that mount's lower filesystem and shift ids relative to that.
> > There is no reason to shift ids twice, and the lower path has
> > already been marked safe for id shifting by a user privileged
> > towards all ids in that mount's user ns.
> >
> > Signed-off-by: Seth Forshee <[email protected]>
> > ---
> > fs/shiftfs.c | 68 +++++++++++++++++++++++++++++++++++-----------------
> > 1 file changed, 46 insertions(+), 22 deletions(-)
> >
> > diff --git a/fs/shiftfs.c b/fs/shiftfs.c
> > index b19af7b2fe75..008ace2842b9 100644
> > --- a/fs/shiftfs.c
> > +++ b/fs/shiftfs.c
> > @@ -930,7 +930,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
> > struct shiftfs_data *data = raw_data;
> > char *name = kstrdup(data->path, GFP_KERNEL);
> > int err = -ENOMEM;
> > - struct shiftfs_super_info *ssi = NULL;
> > + struct shiftfs_super_info *ssi = NULL, *mp_ssi;
> > struct path path;
> > struct dentry *dentry;
> >
> > @@ -946,11 +946,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
> > if (err)
> > goto out;
> >
> > - /* to mark a mount point, must be real root */
> > - if (ssi->mark && !capable(CAP_SYS_ADMIN))
> > - goto out;
> > -
> > - /* else to mount a mark, must be userns admin */
> > + /* to mount a mark, must be userns admin */
> > if (!ssi->mark && !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> > goto out;
>
> Isn't this check performed by vfs anyway? i.e. in mount_nodev() -> sget()

Yeah, I noticed that too. I left it in for the moment to emphasize the
change I was making, but it can be removed.

>
> >
> > @@ -962,41 +958,66 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
> >
> > if (!S_ISDIR(path.dentry->d_inode->i_mode)) {
> > err = -ENOTDIR;
> > - goto out_put;
> > - }
> > -
> > - sb->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1;
> > - if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> > - printk(KERN_ERR "shiftfs: maximum stacking depth exceeded\n");
> > - err = -EINVAL;
> > - goto out_put;
> > + goto out_put_path;
> > }
> >
> > if (ssi->mark) {
> > + struct super_block *lower_sb = path.mnt->mnt_sb;
> > +
> > + /* to mark a mount point, must root wrt lower s_user_ns */
> > + if (!ns_capable(lower_sb->s_user_ns, CAP_SYS_ADMIN))
> > + goto out_put_path;
> > +
> > +
> > /*
> > * this part is visible unshifted, so make sure no
> > * executables that could be used to give suid
> > * privileges
> > */
> > sb->s_iflags = SB_I_NOEXEC;
>
> As commented on cover letter, why allow access to any files besides root at all?
> In fact, the only justification for a dummy sb (instead of bind mount with
> MS_EXTERN flag) would be in order to override inode operations with noop ops
> to prevent access to unshifted files from within container.

Summarizing my response to the other message, if the mark mount is kept
(and I would prefer that it isn't kept) that seems reasonable to me.

Thanks,
Seth

2018-11-02 13:17:43

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] shiftfs: support nested shiftfs mounts

On Fri, Nov 2, 2018 at 2:44 PM Seth Forshee <[email protected]> wrote:
>
> On Fri, Nov 02, 2018 at 12:02:45PM +0200, Amir Goldstein wrote:
> > On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <[email protected]> wrote:
> > >
> > > shiftfs mounts cannot be nested for two reasons -- global
> > > CAP_SYS_ADMIN is required to set up a mark mount, and a single
> > > functional shiftfs mount meets the filesystem stacking depth
> > > limit.
> > >
> > > The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
> > > ids in a mount must be within that mount's s_user_ns, so all that
> > > is needed is CAP_SYS_ADMIN within that s_user_ns.
> > >
> > > The stack depth issue can be worked around with a couple of
> > > adjustments. First, a mark mount doesn't really need to count
> > > against the stacking depth as it doesn't contribute to the call
> > > stack depth during filesystem operations. Therefore the mount
> > > over the mark mount only needs to count as one more than the
> > > lower filesystems stack depth.
> >
> > That's true, but it also highlights the point that the "mark" sb is
> > completely unneeded and it really is just a nice little hack.
> > All the information it really stores is a lower mount reference,
> > a lower root dentry and a declarative statement "I am shiftable!".
>
> Seems I should have saved some of the things I said in my previous
> response for this one. As you no doubt gleaned from that email, I agree
> with this.
>
> > Come to think about it, "container shiftable" really isn't that different from
> > NFS export with "no_root_squash" and auto mounted USB drive.
> > I mean the shifting itself is different of course, but the
> > declaration, not so much.
> > If I am allowing sudoers on another machine to mess with root owned
> > files visible
> > on my machine, I am pretty much have the same issues as container admins
> > accessing root owned files on my init_user_ns filesystem. In all those cases,
> > I'd better not be executing suid binaries from the untrusted "external" source.
> >
> > Instead of mounting a dummy filesystem to make the declaration, you could
> > get the same thing with:
> > mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC)
> > and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED
> > or whatnot) constant to uapi and manage to come up good man page description.
> >
> > Then users could actually mount a filesystem in init_user_ns MS_EXTERN and
> > avoid the extra bind mount step (for a full filesystem tree export).
> > Declaring a mounted image MS_EXTERN has merits on its own even without
> > containers and shitfs, for example with pluggable storage. Other LSMs could make
> > good use of that declaration.
>
> I'm missing how we figure out the target user ns in this scheme. We need
> a context with privileges towards the source path's s_user_ns to say
> it's okay to shift ids for the files under the source path, and then we
> need a target user ns for the id shifts. Currently the target is
> current_user_ns when the final shiftfs mount is created.
>
> So, how do we determine the target s_user_ns in your scheme?
>

Unless I am completely misunderstanding shiftfs, I think we are saying the
same thing. You said you wish to get rid of the "mark" fs and that you had
a POC of implementing the "mark" using xattr.
I'm just saying another option to implement the mark is using a super block
flag and you get the target s_user_ns from mnt_sb.
I did miss the fact that a mount flag is not enough, so that makes the bind
mount concept fail. Unless, maybe, the mount in the container is a slave
mount and the "shiftable" mark is set on the master.
I know too little about mount ns vs. user ns to tell if any of this makes sense.

Feel free to ignore MS_EXTERN idea.

Hopefully, mount API v2 will provide the proper facility to implement mark.

Thanks,
Amir.

2018-11-02 13:49:22

by Seth Forshee

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] shiftfs: support nested shiftfs mounts

On Fri, Nov 02, 2018 at 03:16:05PM +0200, Amir Goldstein wrote:
> On Fri, Nov 2, 2018 at 2:44 PM Seth Forshee <[email protected]> wrote:
> >
> > On Fri, Nov 02, 2018 at 12:02:45PM +0200, Amir Goldstein wrote:
> > > On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <[email protected]> wrote:
> > > >
> > > > shiftfs mounts cannot be nested for two reasons -- global
> > > > CAP_SYS_ADMIN is required to set up a mark mount, and a single
> > > > functional shiftfs mount meets the filesystem stacking depth
> > > > limit.
> > > >
> > > > The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
> > > > ids in a mount must be within that mount's s_user_ns, so all that
> > > > is needed is CAP_SYS_ADMIN within that s_user_ns.
> > > >
> > > > The stack depth issue can be worked around with a couple of
> > > > adjustments. First, a mark mount doesn't really need to count
> > > > against the stacking depth as it doesn't contribute to the call
> > > > stack depth during filesystem operations. Therefore the mount
> > > > over the mark mount only needs to count as one more than the
> > > > lower filesystems stack depth.
> > >
> > > That's true, but it also highlights the point that the "mark" sb is
> > > completely unneeded and it really is just a nice little hack.
> > > All the information it really stores is a lower mount reference,
> > > a lower root dentry and a declarative statement "I am shiftable!".
> >
> > Seems I should have saved some of the things I said in my previous
> > response for this one. As you no doubt gleaned from that email, I agree
> > with this.
> >
> > > Come to think about it, "container shiftable" really isn't that different from
> > > NFS export with "no_root_squash" and auto mounted USB drive.
> > > I mean the shifting itself is different of course, but the
> > > declaration, not so much.
> > > If I am allowing sudoers on another machine to mess with root owned
> > > files visible
> > > on my machine, I am pretty much have the same issues as container admins
> > > accessing root owned files on my init_user_ns filesystem. In all those cases,
> > > I'd better not be executing suid binaries from the untrusted "external" source.
> > >
> > > Instead of mounting a dummy filesystem to make the declaration, you could
> > > get the same thing with:
> > > mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC)
> > > and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED
> > > or whatnot) constant to uapi and manage to come up good man page description.
> > >
> > > Then users could actually mount a filesystem in init_user_ns MS_EXTERN and
> > > avoid the extra bind mount step (for a full filesystem tree export).
> > > Declaring a mounted image MS_EXTERN has merits on its own even without
> > > containers and shitfs, for example with pluggable storage. Other LSMs could make
> > > good use of that declaration.
> >
> > I'm missing how we figure out the target user ns in this scheme. We need
> > a context with privileges towards the source path's s_user_ns to say
> > it's okay to shift ids for the files under the source path, and then we
> > need a target user ns for the id shifts. Currently the target is
> > current_user_ns when the final shiftfs mount is created.
> >
> > So, how do we determine the target s_user_ns in your scheme?
> >
>
> Unless I am completely misunderstanding shiftfs, I think we are saying the
> same thing. You said you wish to get rid of the "mark" fs and that you had
> a POC of implementing the "mark" using xattr.

The PoC was using filesystem contexts and (an earlier version of) the
new mount API, but yes I do think we're in agreement about the mark fs
being awkward.

> I'm just saying another option to implement the mark is using a super block
> flag and you get the target s_user_ns from mnt_sb.
> I did miss the fact that a mount flag is not enough, so that makes the bind
> mount concept fail. Unless, maybe, the mount in the container is a slave
> mount and the "shiftable" mark is set on the master.
> I know too little about mount ns vs. user ns to tell if any of this makes sense.

We need a source and target user ns for the shifting, currently they are
the lower filesystem's s_user_ns and the s_user_ns of the shiftfs fs
(which is current_user_ns at the time the real shiftfs fs is mounted). I
think doing it this way makes sense.

The problem with the slave mount idea is that s_user_ns for the shiftfs
sb is still not the target ns for the id shifting, which is going to
cause all kinds of problems. Unless all of this is done in the vfs, then
it could really be a bind mount. The shifting could be done based on the
mount namespace's user_ns and all permission checks in the vfs could
take this into account. This approach certainly has benefits and is one
of the things I want to discuss.

> Feel free to ignore MS_EXTERN idea.
>
> Hopefully, mount API v2 will provide the proper facility to implement mark.

The approach I took was to have the source user_ns context set up an fd
that was passed to the target user_ns context. Then that fd was used to
attach the mount to the mount tree, and when that happened the target
s_user_ns was set. It worked, but there were some awkware bits. I
haven't kept up with the changes since then to know if things are any
better or worse with the current patches.

Thanks,
Seth

2018-11-02 16:59:53

by James Bottomley

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] shiftfs: support nested shiftfs mounts

On Fri, 2018-11-02 at 15:16 +0200, Amir Goldstein wrote:
> On Fri, Nov 2, 2018 at 2:44 PM Seth Forshee <[email protected]
> om> wrote:
> >
> > On Fri, Nov 02, 2018 at 12:02:45PM +0200, Amir Goldstein wrote:
> > > On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <seth.forshee@canoni
> > > cal.com> wrote:
> > > >
> > > > shiftfs mounts cannot be nested for two reasons -- global
> > > > CAP_SYS_ADMIN is required to set up a mark mount, and a single
> > > > functional shiftfs mount meets the filesystem stacking depth
> > > > limit.
> > > >
> > > > The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
> > > > ids in a mount must be within that mount's s_user_ns, so all
> > > > that
> > > > is needed is CAP_SYS_ADMIN within that s_user_ns.
> > > >
> > > > The stack depth issue can be worked around with a couple of
> > > > adjustments. First, a mark mount doesn't really need to count
> > > > against the stacking depth as it doesn't contribute to the call
> > > > stack depth during filesystem operations. Therefore the mount
> > > > over the mark mount only needs to count as one more than the
> > > > lower filesystems stack depth.
> > >
> > > That's true, but it also highlights the point that the "mark" sb
> > > is
> > > completely unneeded and it really is just a nice little hack.
> > > All the information it really stores is a lower mount reference,
> > > a lower root dentry and a declarative statement "I am
> > > shiftable!".
> >
> > Seems I should have saved some of the things I said in my previous
> > response for this one. As you no doubt gleaned from that email, I
> > agree
> > with this.
> >
> > > Come to think about it, "container shiftable" really isn't that
> > > different from
> > > NFS export with "no_root_squash" and auto mounted USB drive.
> > > I mean the shifting itself is different of course, but the
> > > declaration, not so much.
> > > If I am allowing sudoers on another machine to mess with root
> > > owned
> > > files visible
> > > on my machine, I am pretty much have the same issues as container
> > > admins
> > > accessing root owned files on my init_user_ns filesystem. In all
> > > those cases,
> > > I'd better not be executing suid binaries from the untrusted
> > > "external" source.
> > >
> > > Instead of mounting a dummy filesystem to make the declaration,
> > > you could
> > > get the same thing with:
> > > mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC)
> > > and all you need to do is add MS_EXTERN (MS_SHIFTABLE
> > > MS_UNTRUSTED
> > > or whatnot) constant to uapi and manage to come up good man page
> > > description.
> > >
> > > Then users could actually mount a filesystem in init_user_ns
> > > MS_EXTERN and
> > > avoid the extra bind mount step (for a full filesystem tree
> > > export).
> > > Declaring a mounted image MS_EXTERN has merits on its own even
> > > without
> > > containers and shitfs, for example with pluggable storage. Other
> > > LSMs could make
> > > good use of that declaration.
> >
> > I'm missing how we figure out the target user ns in this scheme. We
> > need
> > a context with privileges towards the source path's s_user_ns to
> > say
> > it's okay to shift ids for the files under the source path, and
> > then we
> > need a target user ns for the id shifts. Currently the target is
> > current_user_ns when the final shiftfs mount is created.
> >
> > So, how do we determine the target s_user_ns in your scheme?
> >
>
> Unless I am completely misunderstanding shiftfs, I think we are
> saying the same thing. You said you wish to get rid of the "mark" fs
> and that you had a POC of implementing the "mark" using xattr.

I've got one of these too ... it works nicely.

> I'm just saying another option to implement the mark is using a super
> block flag and you get the target s_user_ns from mnt_sb.

This works a lot less well because the entire mount becomes shiftable,
not just a subtree, which is suboptimal for the unprivileged use case.
The idea for getting around this was the one Ted mentioned of attaching
properties to the vfsmount structure (he'd like to use this for case
insensitive name comparisons in subtrees), but that requires
rethreading quite a few vfs calls to take a struct path instead of a
struct dentry.

James