2017-10-06 15:49:09

by David Howells

[permalink] [raw]
Subject: [PATCH 00/14] VFS: Introduce filesystem context [ver #6]


Here are a set of patches to create a filesystem context prior to setting
up a new mount, populating it with the parsed options/binary data, creating
the superblock and then effecting the mount. This is also used for remount
since much of the parsing stuff is common in many filesystems.

This allows namespaces and other information to be conveyed through the
mount procedure.

This also allows Miklós Szeredi's idea of doing:

fd = fsopen("nfs");
write(fd, "option=val", ...);
fsmount(fd, "/mnt");

that he presented at LSF-2017 to be implemented (see the relevant patches
in the series).

I didn't use netlink as that would make the core kernel depend on
CONFIG_NET and CONFIG_NETLINK and would introduce network namespacing
issues.

I've implemented mount context handling for procfs, nfs, mqueue, cpuset,
kernfs, sysfs and cgroup filesystems.

Non-converted filesystems are handled by the legacy filesystem wrapper.

Significant changes:

ver #6:

(*) Dropped the supplementary error string facility for the moment.

(*) Dropped the NFS patches for the moment.

(*) Dropped the reserved file descriptor argument from fsopen() and
replaced it with three reserved pointers that must be NULL.

ver #5:

(*) Renamed sb_config -> fs_context and adjusted variable names.

(*) Differentiated the flags in sb->s_flags (now named SB_*) from those
passed to mount(2) (named MS_*).

(*) Renamed __vfs_new_fs_context() to vfs_new_fs_context() and made the
caller always provide a struct file_system_type pointer and the
parameters required.

(*) Got rid of vfs_submount_fc() in favour of passing
FS_CONTEXT_FOR_SUBMOUNT to vfs_new_fs_context(). The purpose is now
used more.

(*) Call ->validate() on the remount path.

(*) Got rid of the inode locking in sys_fsmount().

(*) Call security_sb_mountpoint() in the mount(2) path.

ver #4:

(*) Split the sb_config patch up somewhat.

(*) Made the supplementary error string facility something attached to the
task_struct rather than the sb_config so that error messages can be
obtained from NFS doing a mount-root-and-pathwalk inside the
nfs_get_tree() operation.

Further, made this managed and read by prctl rather than through the
mount fd so that it's more generally available.

ver #3:

(*) Rebased on 4.12-rc1.

(*) Split the NFS patch up somewhat.

ver #2:

(*) Removed the ->fill_super() from sb_config_operations and passed it in
directly to functions that want to call it. NFS now calls
nfs_fill_super() directly rather than jumping through a pointer to it
since there's only the one option at the moment.

(*) Removed ->mnt_ns and ->sb from sb_config and moved ->pid_ns into
proc_sb_config.

(*) Renamed create_super -> get_tree.

(*) Renamed struct mount_context to struct sb_config and amended various
variable names.

(*) sys_fsmount() acquired AT_* flags and MS_* flags (for MNT_* flags)
arguments.

ver #1:

(*) Split the sb_config stuff out into its own header.

(*) Support non-context aware filesystems through a special set of
sb_config operations.

(*) Stored the created superblock and root dentry into the sb_config after
creation rather than directly into a vfsmount. This allows some
arguments to be removed to various NFS functions.

(*) Added an explicit superblock-creation step. This allows a created
superblock to then be mounted multiple times.

(*) Added a flag to say that the sb_config is degraded and cannot have
another go at having a superblock creation whilst getting rid of the
one that says it's already mounted.

Further developments:

(*) Implement sb reconfiguration (for now it returns ENOANO).

(*) Implement mount context support in more filesystems, ext4 being next
on my list.

(*) Move the walk-from-root stuff that nfs has to generic code so that you
can do something akin to:

mount /dev/sda1:/foo/bar /mnt

See nfs_follow_remote_path() and mount_subtree(). This is slightly
tricky in NFS as we have to prevent referral loops.

(*) Work out how to get at the error message incurred by submounts
encountered during nfs_follow_remote_path().

Should the error message be moved to task_struct and made more
general, perhaps retrieved with a prctl() function?

(*) Clean up/consolidate the security functions. Possibly add a
validation hook to be called at the same time as the mount context
validate op.

The patches can be found here also:

http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=mount-context

David
---
David Howells (14):
VFS: Introduce the structs and doc for a filesystem context
VFS: Add LSM hooks for filesystem context
VFS: Implement a filesystem superblock creation/configuration context
VFS: Remove unused code after filesystem context changes
VFS: Implement fsopen() to prepare for a mount
VFS: Implement fsmount() to effect a pre-configured mount
VFS: Add a sample program for fsopen/fsmount
procfs: Move proc_fill_super() to fs/proc/root.c
proc: Add fs_context support to procfs
ipc: Convert mqueue fs to fs_context
cpuset: Use fs_context
kernfs, sysfs, cgroup, intel_rdt: Support fs_context
hugetlbfs: Convert to fs_context
VFS: Remove kern_mount_data()


Documentation/filesystems/mounting.txt | 433 +++++++++++++++++++++++++
arch/x86/entry/syscalls/syscall_32.tbl | 2
arch/x86/entry/syscalls/syscall_64.tbl | 2
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 97 +++---
fs/Makefile | 3
fs/fs_context.c | 526 ++++++++++++++++++++++++++++++
fs/fsopen.c | 273 ++++++++++++++++
fs/hugetlbfs/inode.c | 327 ++++++++++---------
fs/internal.h | 4
fs/kernfs/mount.c | 88 +++--
fs/libfs.c | 17 +
fs/namespace.c | 413 +++++++++++++++++-------
fs/proc/inode.c | 50 ---
fs/proc/internal.h | 6
fs/proc/root.c | 212 +++++++++---
fs/super.c | 347 ++++++++++++++++----
fs/sysfs/mount.c | 59 ++-
include/linux/cgroup.h | 3
include/linux/fs.h | 15 +
include/linux/fs_context.h | 105 ++++++
include/linux/kernfs.h | 37 +-
include/linux/lsm_hooks.h | 47 +++
include/linux/mount.h | 2
include/linux/security.h | 39 ++
include/linux/syscalls.h | 4
include/uapi/linux/magic.h | 1
ipc/mqueue.c | 90 ++++-
kernel/cgroup/cgroup-internal.h | 42 +-
kernel/cgroup/cgroup-v1.c | 293 ++++++++---------
kernel/cgroup/cgroup.c | 216 +++++++-----
kernel/cgroup/cpuset.c | 58 +++
kernel/sys_ni.c | 4
samples/fsmount/test-fsmount.c | 94 +++++
security/security.c | 35 ++
security/selinux/hooks.c | 194 ++++++++++-
security/smack/smack_lsm.c | 32 --
36 files changed, 3248 insertions(+), 922 deletions(-)
create mode 100644 Documentation/filesystems/mounting.txt
create mode 100644 fs/fs_context.c
create mode 100644 fs/fsopen.c
create mode 100644 include/linux/fs_context.h
create mode 100644 samples/fsmount/test-fsmount.c



2017-10-06 15:50:38

by David Howells

[permalink] [raw]
Subject: [PATCH 11/14] cpuset: Use fs_context [ver #6]

Make the cpuset filesystem use the filesystem context. This is potentially
tricky as the cpuset fs is almost an alias for the cgroup filesystem, but
with some special parameters.

This can, however, be handled by setting up an appropriate cgroup
filesystem and returning the root directory of that as the root dir of this
one.

Signed-off-by: David Howells <[email protected]>
cc: Tejun Heo <[email protected]>
---

kernel/cgroup/cpuset.c | 58 +++++++++++++++++++++++++++++++++++++-----------
1 file changed, 45 insertions(+), 13 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 4657e2924ecb..78c61822a99e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -38,7 +38,7 @@
#include <linux/mm.h>
#include <linux/memory.h>
#include <linux/export.h>
-#include <linux/mount.h>
+#include <linux/fs_context.h>
#include <linux/namei.h>
#include <linux/pagemap.h>
#include <linux/proc_fs.h>
@@ -315,25 +315,57 @@ static inline bool is_in_v2_mode(void)
* users. If someone tries to mount the "cpuset" filesystem, we
* silently switch it to mount "cgroup" instead
*/
-static struct dentry *cpuset_mount(struct file_system_type *fs_type,
- int flags, const char *unused_dev_name, void *data)
+static int cpuset_get_tree(struct fs_context *fc)
{
- struct file_system_type *cgroup_fs = get_fs_type("cgroup");
- struct dentry *ret = ERR_PTR(-ENODEV);
+ struct file_system_type *cgroup_fs;
+ struct fs_context *cg_fc;
+ int ret = -ENODEV;
+
+ cgroup_fs = get_fs_type("cgroup");
if (cgroup_fs) {
- char mountopts[] =
- "cpuset,noprefix,"
- "release_agent=/sbin/cpuset_release_agent";
- ret = cgroup_fs->mount(cgroup_fs, flags,
- unused_dev_name, mountopts);
- put_filesystem(cgroup_fs);
+ ret = PTR_ERR(cgroup_fs);
+ goto out;
+ }
+
+ cg_fc = vfs_new_fs_context(cgroup_fs, NULL, fc->sb_flags, fc->purpose);
+ put_filesystem(cgroup_fs);
+ if (IS_ERR(cg_fc)) {
+ ret = PTR_ERR(cg_fc);
+ goto out;
}
+
+ ret = generic_parse_monolithic(
+ fc, "cpuset,noprefix,release_agent=/sbin/cpuset_release_agent");
+ if (ret < 0)
+ goto out_fc;
+
+ ret = vfs_get_tree(cg_fc);
+ if (ret < 0)
+ goto out_fc;
+
+ fc->root = dget(cg_fc->root);
+ ret = 0;
+
+out_fc:
+ put_fs_context(cg_fc);
+out:
return ret;
}

+static const struct fs_context_operations cpuset_fs_context_ops = {
+ .get_tree = cpuset_get_tree,
+};
+
+static int cpuset_init_fs_context(struct fs_context *fc, struct super_block *src_sb)
+{
+ fc->ops = &cpuset_fs_context_ops;
+ return 0;
+}
+
static struct file_system_type cpuset_fs_type = {
- .name = "cpuset",
- .mount = cpuset_mount,
+ .name = "cpuset",
+ .fs_context_size = sizeof(struct fs_context),
+ .init_fs_context = cpuset_init_fs_context,
};

/*


2017-10-06 15:50:30

by David Howells

[permalink] [raw]
Subject: [PATCH 10/14] ipc: Convert mqueue fs to fs_context [ver #6]

Convert the mqueue filesystem to use the filesystem context stuff.

Notes:

(1) The relevant ipc namespace is selected in when the context is
initialised (and it defaults to the current task's ipc namespace).
The caller can override this before calling vfs_get_tree().

(2) Rather than simply calling kern_mount_data(), mq_init_ns() creates a
context, adjusts it and then does the rest of the mount procedure.

Signed-off-by: David Howells <[email protected]>
---

ipc/mqueue.c | 90 ++++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 68 insertions(+), 22 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 9649ecd8a73a..561460675734 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -18,6 +18,7 @@
#include <linux/pagemap.h>
#include <linux/file.h>
#include <linux/mount.h>
+#include <linux/fs_context.h>
#include <linux/namei.h>
#include <linux/sysctl.h>
#include <linux/poll.h>
@@ -42,6 +43,11 @@
#include <net/sock.h>
#include "util.h"

+struct mqueue_fs_context {
+ struct fs_context fc;
+ struct ipc_namespace *ipc_ns;
+};
+
#define MQUEUE_MAGIC 0x19800202
#define DIRENT_SIZE 20
#define FILENT_SIZE 80
@@ -90,6 +96,7 @@ struct mqueue_inode_info {
static const struct inode_operations mqueue_dir_inode_operations;
static const struct file_operations mqueue_file_operations;
static const struct super_operations mqueue_super_ops;
+static const struct fs_context_operations mqueue_fs_context_ops;
static void remove_notification(struct mqueue_inode_info *info);

static struct kmem_cache *mqueue_inode_cachep;
@@ -305,7 +312,7 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
return ERR_PTR(ret);
}

-static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
+static int mqueue_fill_super(struct super_block *sb, struct fs_context *fc)
{
struct inode *inode;
struct ipc_namespace *ns = sb->s_fs_info;
@@ -326,18 +333,29 @@ static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
return 0;
}

-static struct dentry *mqueue_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name,
- void *data)
+static int mqueue_get_tree(struct fs_context *fc)
{
- struct ipc_namespace *ns;
- if (flags & SB_KERNMOUNT) {
- ns = data;
- data = NULL;
- } else {
- ns = current->nsproxy->ipc_ns;
- }
- return mount_ns(fs_type, flags, data, ns, ns->user_ns, mqueue_fill_super);
+ struct mqueue_fs_context *ctx = container_of(fc, struct mqueue_fs_context, fc);
+
+ ctx->fc.s_fs_info = ctx->ipc_ns;
+ return vfs_get_super(fc, vfs_get_keyed_super, mqueue_fill_super);
+}
+
+static void mqueue_fs_context_free(struct fs_context *fc)
+{
+ struct mqueue_fs_context *ctx = container_of(fc, struct mqueue_fs_context, fc);
+
+ if (ctx->ipc_ns)
+ put_ipc_ns(ctx->ipc_ns);
+}
+
+static int mqueue_init_fs_context(struct fs_context *fc, struct super_block *src_sb)
+{
+ struct mqueue_fs_context *ctx = container_of(fc, struct mqueue_fs_context, fc);
+
+ ctx->ipc_ns = get_ipc_ns(current->nsproxy->ipc_ns);
+ ctx->fc.ops = &mqueue_fs_context_ops;
+ return 0;
}

static void init_once(void *foo)
@@ -1574,15 +1592,26 @@ static const struct super_operations mqueue_super_ops = {
.statfs = simple_statfs,
};

+static const struct fs_context_operations mqueue_fs_context_ops = {
+ .free = mqueue_fs_context_free,
+ .get_tree = mqueue_get_tree,
+};
+
static struct file_system_type mqueue_fs_type = {
- .name = "mqueue",
- .mount = mqueue_mount,
- .kill_sb = kill_litter_super,
- .fs_flags = FS_USERNS_MOUNT,
+ .name = "mqueue",
+ .fs_context_size = sizeof(struct mqueue_fs_context),
+ .init_fs_context = mqueue_init_fs_context,
+ .kill_sb = kill_litter_super,
+ .fs_flags = FS_USERNS_MOUNT,
};

int mq_init_ns(struct ipc_namespace *ns)
{
+ struct mqueue_fs_context *ctx;
+ struct fs_context *fc;
+ struct vfsmount *mnt;
+ int ret;
+
ns->mq_queues_count = 0;
ns->mq_queues_max = DFLT_QUEUESMAX;
ns->mq_msg_max = DFLT_MSGMAX;
@@ -1590,13 +1619,30 @@ int mq_init_ns(struct ipc_namespace *ns)
ns->mq_msg_default = DFLT_MSG;
ns->mq_msgsize_default = DFLT_MSGSIZE;

- ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns);
- if (IS_ERR(ns->mq_mnt)) {
- int err = PTR_ERR(ns->mq_mnt);
- ns->mq_mnt = NULL;
- return err;
+ fc = vfs_new_fs_context(&mqueue_fs_type, NULL, 0,
+ FS_CONTEXT_FOR_KERNEL_MOUNT);
+ if (IS_ERR(fc))
+ return PTR_ERR(fc);
+
+ ctx = container_of(fc, struct mqueue_fs_context, fc);
+ put_ipc_ns(ctx->ipc_ns);
+ ctx->ipc_ns = get_ipc_ns(ns);
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0)
+ goto out_fc;
+
+ mnt = vfs_create_mount(fc);
+ if (IS_ERR(mnt)) {
+ ret = PTR_ERR(mnt);
+ goto out_fc;
}
- return 0;
+
+ ns->mq_mnt = mnt;
+ ret = 0;
+out_fc:
+ put_fs_context(fc);
+ return ret;
}

void mq_clear_sbinfo(struct ipc_namespace *ns)


2017-10-06 15:51:00

by David Howells

[permalink] [raw]
Subject: [PATCH 14/14] VFS: Remove kern_mount_data() [ver #6]

The kern_mount_data() isn't used any more so remove it.

Signed-off-by: David Howells <[email protected]>
---

fs/namespace.c | 6 ------
include/linux/fs.h | 1 -
2 files changed, 7 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 8676658b6b2c..091a63a63fa5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3181,12 +3181,6 @@ struct vfsmount *kern_mount(struct file_system_type *type)
}
EXPORT_SYMBOL_GPL(kern_mount);

-struct vfsmount *kern_mount_data(struct file_system_type *type, void *data)
-{
- return vfs_kern_mount(type, SB_KERNMOUNT, type->name, data);
-}
-EXPORT_SYMBOL_GPL(kern_mount_data);
-
/*
* Mount a new, prepared superblock (specified by fs_fd) on the location
* specified by dfd and dir_name. dfd can be AT_FDCWD, a dir fd or a container
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f391263c62a1..b1433076b8e2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2168,7 +2168,6 @@ mount_pseudo(struct file_system_type *fs_type, char *name,
extern int register_filesystem(struct file_system_type *);
extern int unregister_filesystem(struct file_system_type *);
extern struct vfsmount *kern_mount(struct file_system_type *);
-extern struct vfsmount *kern_mount_data(struct file_system_type *, void *);
extern void kern_unmount(struct vfsmount *mnt);
extern int may_umount_tree(struct vfsmount *);
extern int may_umount(struct vfsmount *);


2017-10-06 15:50:53

by David Howells

[permalink] [raw]
Subject: [PATCH 13/14] hugetlbfs: Convert to fs_context [ver #6]

Convert the hugetlbfs to use the fs_context during mount.

Signed-off-by: David Howells <[email protected]>
---

fs/hugetlbfs/inode.c | 327 ++++++++++++++++++++++++++++----------------------
1 file changed, 184 insertions(+), 143 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 59073e9f01a4..56bb851f0641 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -45,11 +45,18 @@ const struct file_operations hugetlbfs_file_operations;
static const struct inode_operations hugetlbfs_dir_inode_operations;
static const struct inode_operations hugetlbfs_inode_operations;

-struct hugetlbfs_config {
+enum hugetlbfs_size_type { NO_SIZE, SIZE_STD, SIZE_PERCENT };
+
+struct hugetlbfs_fs_context {
+ struct fs_context fc;
struct hstate *hstate;
+ unsigned long long max_size_opt;
+ unsigned long long min_size_opt;
long max_hpages;
long nr_inodes;
long min_hpages;
+ enum hugetlbfs_size_type max_val_type;
+ enum hugetlbfs_size_type min_val_type;
kuid_t uid;
kgid_t gid;
umode_t mode;
@@ -682,16 +689,16 @@ static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
}

static struct inode *hugetlbfs_get_root(struct super_block *sb,
- struct hugetlbfs_config *config)
+ struct hugetlbfs_fs_context *ctx)
{
struct inode *inode;

inode = new_inode(sb);
if (inode) {
inode->i_ino = get_next_ino();
- inode->i_mode = S_IFDIR | config->mode;
- inode->i_uid = config->uid;
- inode->i_gid = config->gid;
+ inode->i_mode = S_IFDIR | ctx->mode;
+ inode->i_uid = ctx->uid;
+ inode->i_gid = ctx->gid;
inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
inode->i_op = &hugetlbfs_dir_inode_operations;
inode->i_fop = &simple_dir_operations;
@@ -1049,8 +1056,6 @@ static const struct super_operations hugetlbfs_ops = {
.show_options = hugetlbfs_show_options,
};

-enum hugetlbfs_size_type { NO_SIZE, SIZE_STD, SIZE_PERCENT };
-
/*
* Convert size option passed from command line to number of huge pages
* in the pool specified by hstate. Size option could be in bytes
@@ -1073,170 +1078,156 @@ hugetlbfs_size_to_hpages(struct hstate *h, unsigned long long size_opt,
return size_opt;
}

-static int
-hugetlbfs_parse_options(char *options, struct hugetlbfs_config *pconfig)
+/*
+ * Parse one mount option.
+ */
+static int hugetlbfs_parse_option(struct fs_context *fc, char *p)
{
- char *p, *rest;
+ struct hugetlbfs_fs_context *ctx = container_of(fc, struct hugetlbfs_fs_context, fc);
+ char *rest;
+ unsigned long ps;
substring_t args[MAX_OPT_ARGS];
- int option;
- unsigned long long max_size_opt = 0, min_size_opt = 0;
- enum hugetlbfs_size_type max_val_type = NO_SIZE, min_val_type = NO_SIZE;
-
- if (!options)
+ int token, option;
+
+ token = match_token(p, tokens, args);
+ switch (token) {
+ case Opt_uid:
+ if (match_int(&args[0], &option))
+ goto bad_val;
+ ctx->uid = make_kuid(current_user_ns(), option);
+ if (!uid_valid(ctx->uid))
+ goto bad_val;
return 0;

- while ((p = strsep(&options, ",")) != NULL) {
- int token;
- if (!*p)
- continue;
+ case Opt_gid:
+ if (match_int(&args[0], &option))
+ goto bad_val;
+ ctx->gid = make_kgid(current_user_ns(), option);
+ if (!gid_valid(ctx->gid))
+ goto bad_val;
+ return 0;

- token = match_token(p, tokens, args);
- switch (token) {
- case Opt_uid:
- if (match_int(&args[0], &option))
- goto bad_val;
- pconfig->uid = make_kuid(current_user_ns(), option);
- if (!uid_valid(pconfig->uid))
- goto bad_val;
- break;
+ case Opt_mode:
+ if (match_octal(&args[0], &option))
+ goto bad_val;
+ ctx->mode = option & 01777U;
+ return 0;

- case Opt_gid:
- if (match_int(&args[0], &option))
- goto bad_val;
- pconfig->gid = make_kgid(current_user_ns(), option);
- if (!gid_valid(pconfig->gid))
- goto bad_val;
- break;
+ case Opt_size:
+ /* memparse() will accept a K/M/G without a digit */
+ if (!isdigit(*args[0].from))
+ goto bad_val;
+ ctx->max_size_opt = memparse(args[0].from, &rest);
+ ctx->max_val_type = SIZE_STD;
+ if (*rest == '%')
+ ctx->max_val_type = SIZE_PERCENT;
+ return 0;

- case Opt_mode:
- if (match_octal(&args[0], &option))
- goto bad_val;
- pconfig->mode = option & 01777U;
- break;
+ case Opt_nr_inodes:
+ /* memparse() will accept a K/M/G without a digit */
+ if (!isdigit(*args[0].from))
+ goto bad_val;
+ ctx->nr_inodes = memparse(args[0].from, &rest);
+ return 0;

- case Opt_size: {
- /* memparse() will accept a K/M/G without a digit */
- if (!isdigit(*args[0].from))
- goto bad_val;
- max_size_opt = memparse(args[0].from, &rest);
- max_val_type = SIZE_STD;
- if (*rest == '%')
- max_val_type = SIZE_PERCENT;
- break;
+ case Opt_pagesize:
+ ps = memparse(args[0].from, &rest);
+ ctx->hstate = size_to_hstate(ps);
+ if (!ctx->hstate) {
+ pr_err("Unsupported page size %lu MB\n", ps >> 20);
+ return -EINVAL;
}
+ return 0;

- case Opt_nr_inodes:
- /* memparse() will accept a K/M/G without a digit */
- if (!isdigit(*args[0].from))
- goto bad_val;
- pconfig->nr_inodes = memparse(args[0].from, &rest);
- break;
+ case Opt_min_size:
+ /* memparse() will accept a K/M/G without a digit */
+ if (!isdigit(*args[0].from))
+ goto bad_val;
+ ctx->min_size_opt = memparse(args[0].from, &rest);
+ ctx->min_val_type = SIZE_STD;
+ if (*rest == '%')
+ ctx->min_val_type = SIZE_PERCENT;
+ return 0;

- case Opt_pagesize: {
- unsigned long ps;
- ps = memparse(args[0].from, &rest);
- pconfig->hstate = size_to_hstate(ps);
- if (!pconfig->hstate) {
- pr_err("Unsupported page size %lu MB\n",
- ps >> 20);
- return -EINVAL;
- }
- break;
- }
+ default:
+ pr_err("Bad mount option: \"%s\"\n", p);
+ return -EINVAL;
+ }

- case Opt_min_size: {
- /* memparse() will accept a K/M/G without a digit */
- if (!isdigit(*args[0].from))
- goto bad_val;
- min_size_opt = memparse(args[0].from, &rest);
- min_val_type = SIZE_STD;
- if (*rest == '%')
- min_val_type = SIZE_PERCENT;
- break;
- }
+bad_val:
+ pr_err("Bad value '%s' for mount option '%s'\n", args[0].from, p);
+ return -EINVAL;
+}

- default:
- pr_err("Bad mount option: \"%s\"\n", p);
- return -EINVAL;
- break;
- }
- }
+/*
+ * Validate the parsed options.
+ */
+static int hugetlbfs_validate(struct fs_context *fc)
+{
+ struct hugetlbfs_fs_context *ctx = container_of(fc, struct hugetlbfs_fs_context, fc);

/*
* Use huge page pool size (in hstate) to convert the size
* options to number of huge pages. If NO_SIZE, -1 is returned.
*/
- pconfig->max_hpages = hugetlbfs_size_to_hpages(pconfig->hstate,
- max_size_opt, max_val_type);
- pconfig->min_hpages = hugetlbfs_size_to_hpages(pconfig->hstate,
- min_size_opt, min_val_type);
+ ctx->max_hpages = hugetlbfs_size_to_hpages(ctx->hstate,
+ ctx->max_size_opt,
+ ctx->max_val_type);
+ ctx->min_hpages = hugetlbfs_size_to_hpages(ctx->hstate,
+ ctx->min_size_opt,
+ ctx->min_val_type);

/*
* If max_size was specified, then min_size must be smaller
*/
- if (max_val_type > NO_SIZE &&
- pconfig->min_hpages > pconfig->max_hpages) {
- pr_err("minimum size can not be greater than maximum size\n");
+ if (ctx->max_val_type > NO_SIZE &&
+ ctx->min_hpages > ctx->max_hpages) {
+ pr_err("Minimum size can not be greater than maximum size\n");
return -EINVAL;
}

return 0;
-
-bad_val:
- pr_err("Bad value '%s' for mount option '%s'\n", args[0].from, p);
- return -EINVAL;
}

static int
-hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
+hugetlbfs_fill_super(struct super_block *sb, struct fs_context *fc)
{
- int ret;
- struct hugetlbfs_config config;
+ struct hugetlbfs_fs_context *ctx =
+ container_of(fc, struct hugetlbfs_fs_context, fc);
struct hugetlbfs_sb_info *sbinfo;

- config.max_hpages = -1; /* No limit on size by default */
- config.nr_inodes = -1; /* No limit on number of inodes by default */
- config.uid = current_fsuid();
- config.gid = current_fsgid();
- config.mode = 0755;
- config.hstate = &default_hstate;
- config.min_hpages = -1; /* No default minimum size */
- ret = hugetlbfs_parse_options(data, &config);
- if (ret)
- return ret;
-
sbinfo = kmalloc(sizeof(struct hugetlbfs_sb_info), GFP_KERNEL);
if (!sbinfo)
return -ENOMEM;
sb->s_fs_info = sbinfo;
- sbinfo->hstate = config.hstate;
spin_lock_init(&sbinfo->stat_lock);
- sbinfo->max_inodes = config.nr_inodes;
- sbinfo->free_inodes = config.nr_inodes;
- sbinfo->spool = NULL;
- sbinfo->uid = config.uid;
- sbinfo->gid = config.gid;
- sbinfo->mode = config.mode;
+ sbinfo->hstate = ctx->hstate;
+ sbinfo->max_inodes = ctx->nr_inodes;
+ sbinfo->free_inodes = ctx->nr_inodes;
+ sbinfo->spool = NULL;
+ sbinfo->uid = ctx->uid;
+ sbinfo->gid = ctx->gid;
+ sbinfo->mode = ctx->mode;

/*
* Allocate and initialize subpool if maximum or minimum size is
* specified. Any needed reservations (for minimim size) are taken
* taken when the subpool is created.
*/
- if (config.max_hpages != -1 || config.min_hpages != -1) {
- sbinfo->spool = hugepage_new_subpool(config.hstate,
- config.max_hpages,
- config.min_hpages);
+ if (ctx->max_hpages != -1 || ctx->min_hpages != -1) {
+ sbinfo->spool = hugepage_new_subpool(ctx->hstate,
+ ctx->max_hpages,
+ ctx->min_hpages);
if (!sbinfo->spool)
goto out_free;
}
sb->s_maxbytes = MAX_LFS_FILESIZE;
- sb->s_blocksize = huge_page_size(config.hstate);
- sb->s_blocksize_bits = huge_page_shift(config.hstate);
+ sb->s_blocksize = huge_page_size(ctx->hstate);
+ sb->s_blocksize_bits = huge_page_shift(ctx->hstate);
sb->s_magic = HUGETLBFS_MAGIC;
sb->s_op = &hugetlbfs_ops;
sb->s_time_gran = 1;
- sb->s_root = d_make_root(hugetlbfs_get_root(sb, &config));
+ sb->s_root = d_make_root(hugetlbfs_get_root(sb, ctx));
if (!sb->s_root)
goto out_free;
return 0;
@@ -1246,16 +1237,39 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
return -ENOMEM;
}

-static struct dentry *hugetlbfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+static int hugetlbfs_get_tree(struct fs_context *fc)
{
- return mount_nodev(fs_type, flags, data, hugetlbfs_fill_super);
+ return vfs_get_super(fc, vfs_get_independent_super, hugetlbfs_fill_super);
+}
+
+static const struct fs_context_operations hugetlbfs_fs_context_ops = {
+ .parse_option = hugetlbfs_parse_option,
+ .validate = hugetlbfs_validate,
+ .get_tree = hugetlbfs_get_tree,
+};
+
+static int hugetlbfs_init_fs_context(struct fs_context *fc, struct super_block *src_sb)
+{
+ struct hugetlbfs_fs_context *ctx = container_of(fc, struct hugetlbfs_fs_context, fc);
+
+ ctx->max_hpages = -1; /* No limit on size by default */
+ ctx->nr_inodes = -1; /* No limit on number of inodes by default */
+ ctx->uid = current_fsuid();
+ ctx->gid = current_fsgid();
+ ctx->mode = 0755;
+ ctx->hstate = &default_hstate;
+ ctx->min_hpages = -1; /* No default minimum size */
+ ctx->max_val_type = NO_SIZE;
+ ctx->min_val_type = NO_SIZE;
+ ctx->fc.ops = &hugetlbfs_fs_context_ops;
+ return 0;
}

static struct file_system_type hugetlbfs_fs_type = {
- .name = "hugetlbfs",
- .mount = hugetlbfs_mount,
- .kill_sb = kill_litter_super,
+ .name = "hugetlbfs",
+ .fs_context_size = sizeof(struct hugetlbfs_fs_context),
+ .init_fs_context = hugetlbfs_init_fs_context,
+ .kill_sb = kill_litter_super,
};

static struct vfsmount *hugetlbfs_vfsmount[HUGE_MAX_HSTATE];
@@ -1362,8 +1376,47 @@ struct file *hugetlb_file_setup(const char *name, size_t size,
return file;
}

+static struct vfsmount *__init mount_one_hugetlbfs(struct hstate *h)
+{
+ struct hugetlbfs_fs_context *ctx;
+ struct fs_context *fc;
+ struct vfsmount *mnt;
+ int ret;
+
+ fc = vfs_new_fs_context(&hugetlbfs_fs_type, NULL, 0,
+ FS_CONTEXT_FOR_KERNEL_MOUNT);
+ if (IS_ERR(fc)) {
+ ret = PTR_ERR(fc);
+ goto err;
+ }
+
+ ctx = container_of(fc, struct hugetlbfs_fs_context, fc);
+ ctx->hstate = h;
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0)
+ goto err_fc;
+
+ mnt = vfs_create_mount(fc);
+ if (IS_ERR(mnt)) {
+ ret = PTR_ERR(mnt);
+ goto err_fc;
+ }
+
+ put_fs_context(fc);
+ return mnt;
+
+err_fc:
+ put_fs_context(fc);
+err:
+ pr_err("Cannot mount internal hugetlbfs for page size %uK",
+ 1U << (h->order + PAGE_SHIFT - 10));
+ return ERR_PTR(ret);
+}
+
static int __init init_hugetlbfs_fs(void)
{
+ struct vfsmount *mnt;
struct hstate *h;
int error;
int i;
@@ -1386,24 +1439,12 @@ static int __init init_hugetlbfs_fs(void)

i = 0;
for_each_hstate(h) {
- char buf[50];
- unsigned ps_kb = 1U << (h->order + PAGE_SHIFT - 10);
-
- snprintf(buf, sizeof(buf), "pagesize=%uK", ps_kb);
- hugetlbfs_vfsmount[i] = kern_mount_data(&hugetlbfs_fs_type,
- buf);
-
- if (IS_ERR(hugetlbfs_vfsmount[i])) {
- pr_err("Cannot mount internal hugetlbfs for "
- "page size %uK", ps_kb);
- error = PTR_ERR(hugetlbfs_vfsmount[i]);
- hugetlbfs_vfsmount[i] = NULL;
- }
+ mnt = mount_one_hugetlbfs(h);
+ if (IS_ERR(mnt) && i == 0)
+ goto out;
+ hugetlbfs_vfsmount[i] = mnt;
i++;
}
- /* Non default hstates are optional */
- if (!IS_ERR_OR_NULL(hugetlbfs_vfsmount[default_hstate_idx]))
- return 0;

out:
kmem_cache_destroy(hugetlbfs_inode_cachep);


2017-10-06 15:50:45

by David Howells

[permalink] [raw]
Subject: [PATCH 12/14] kernfs, sysfs, cgroup, intel_rdt: Support fs_context [ver #6]

Make kernfs support superblock creation/mount/remount with fs_context.

This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
be made to support fs_context also.

Notes:

(1) A kernfs_fs_context struct is created to wrap fs_context and the
kernfs mount parameters are moved in here (or are in fs_context).

(2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
namespace tag parameter is passed in the context if desired

(3) kernfs_free_fs_context() is provided as a destructor for the
kernfs_fs_context struct, but for the moment it does nothing except
get called in the right places.

(4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
pass, but possibly this should be done anyway in case someone wants to
add a parameter in future.

(5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
the cgroup v1 and v2 mount parameters are all moved there.

(6) cgroup1 parameter parsing error messages are now handled by invalf(),
which allows userspace to collect them directly.

(7) cgroup1 parameter cleanup is now done in the context destructor rather
than in the mount/get_tree and remount functions.

Weirdies:

(*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
but then uses the resulting pointer after dropping the locks. I'm
told this is okay and needs commenting.

(*) The cgroup refcount web. This really needs documenting.

(*) cgroup2 only has one root?

Signed-off-by: David Howells <[email protected]>
cc: Greg Kroah-Hartman <[email protected]>
cc: Tejun Heo <[email protected]>
cc: Li Zefan <[email protected]>
cc: Johannes Weiner <[email protected]>
cc: [email protected]
cc: [email protected]
---

arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 97 ++++++----
fs/kernfs/mount.c | 88 +++++----
fs/sysfs/mount.c | 59 ++++--
include/linux/cgroup.h | 3
include/linux/kernfs.h | 37 ++--
kernel/cgroup/cgroup-internal.h | 42 +++-
kernel/cgroup/cgroup-v1.c | 293 ++++++++++++++----------------
kernel/cgroup/cgroup.c | 216 +++++++++++++---------
8 files changed, 454 insertions(+), 381 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index a869d4a073c5..e9f409097a11 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -35,6 +35,11 @@
#include <asm/intel_rdt_sched.h>
#include "intel_rdt.h"

+struct rdt_fs_context {
+ struct kernfs_fs_context kfc;
+ bool enable_cdp;
+};
+
DEFINE_STATIC_KEY_FALSE(rdt_enable_key);
DEFINE_STATIC_KEY_FALSE(rdt_mon_enable_key);
DEFINE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
@@ -988,22 +993,6 @@ static void cdp_disable(void)
}
}

-static int parse_rdtgroupfs_options(char *data)
-{
- char *token, *o = data;
- int ret = 0;
-
- while ((token = strsep(&o, ",")) != NULL) {
- if (!*token)
- return -EINVAL;
-
- if (!strcmp(token, "cdp"))
- ret = cdp_enable();
- }
-
- return ret;
-}
-
/*
* We don't allow rdtgroup directories to be created anywhere
* except the root directory. Thus when looking for the rdtgroup
@@ -1072,13 +1061,11 @@ static int mkdir_mondata_all(struct kernfs_node *parent_kn,
struct rdtgroup *prgrp,
struct kernfs_node **mon_data_kn);

-static struct dentry *rdt_mount(struct file_system_type *fs_type,
- int flags, const char *unused_dev_name,
- void *data)
+static int rdt_get_tree(struct fs_context *fc)
{
+ struct rdt_fs_context *ctx = container_of(fc, struct rdt_fs_context, kfc.fc);
struct rdt_domain *dom;
struct rdt_resource *r;
- struct dentry *dentry;
int ret;

mutex_lock(&rdtgroup_mutex);
@@ -1086,47 +1073,40 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
* resctrl file system can only be mounted once.
*/
if (static_branch_unlikely(&rdt_enable_key)) {
- dentry = ERR_PTR(-EBUSY);
+ ret = -EBUSY;
goto out;
}

- ret = parse_rdtgroupfs_options(data);
- if (ret) {
- dentry = ERR_PTR(ret);
- goto out_cdp;
+ if (ctx->enable_cdp) {
+ ret = cdp_enable();
+ if (ret < 0)
+ goto out_cdp;
}

closid_init();

ret = rdtgroup_create_info_dir(rdtgroup_default.kn);
- if (ret) {
- dentry = ERR_PTR(ret);
+ if (ret < 0)
goto out_cdp;
- }

if (rdt_mon_capable) {
ret = mongroup_create_dir(rdtgroup_default.kn,
NULL, "mon_groups",
&kn_mongrp);
- if (ret) {
- dentry = ERR_PTR(ret);
+ if (ret < 0)
goto out_info;
- }
kernfs_get(kn_mongrp);

ret = mkdir_mondata_all(rdtgroup_default.kn,
&rdtgroup_default, &kn_mondata);
- if (ret) {
- dentry = ERR_PTR(ret);
+ if (ret < 0)
goto out_mongrp;
- }
kernfs_get(kn_mondata);
rdtgroup_default.mon.mon_data_kn = kn_mondata;
}

- dentry = kernfs_mount(fs_type, flags, rdt_root,
- RDTGROUP_SUPER_MAGIC, NULL);
- if (IS_ERR(dentry))
+ ret = kernfs_get_tree(&ctx->kfc);
+ if (ret < 0)
goto out_mondata;

if (rdt_alloc_capable)
@@ -1157,8 +1137,42 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
cdp_disable();
out:
mutex_unlock(&rdtgroup_mutex);
+ return ret;
+}
+
+static int rdt_parse_option(struct fs_context *fc, char *p)
+{
+ struct rdt_fs_context *ctx = container_of(fc, struct rdt_fs_context, kfc.fc);

- return dentry;
+ if (strcmp(p, "cdp") == 0) {
+ ctx->enable_cdp = true;
+ return 0;
+ }
+
+ return -EINVAL;
+}
+
+static void rdt_fs_context_free(struct fs_context *fc)
+{
+ struct rdt_fs_context *ctx = container_of(fc, struct rdt_fs_context, kfc.fc);
+
+ kernfs_free_fs_context(&ctx->kfc);
+}
+
+static const struct fs_context_operations rdt_fs_context_ops = {
+ .free = rdt_fs_context_free,
+ .parse_option = rdt_parse_option,
+ .get_tree = rdt_get_tree,
+};
+
+static int rdt_init_fs_context(struct fs_context *fc, struct super_block *src_sb)
+{
+ struct rdt_fs_context *ctx = container_of(fc, struct rdt_fs_context, kfc.fc);
+
+ ctx->kfc.root = rdt_root;
+ ctx->kfc.magic = RDTGROUP_SUPER_MAGIC;
+ ctx->kfc.fc.ops = &rdt_fs_context_ops;
+ return 0;
}

static int reset_all_ctrls(struct rdt_resource *r)
@@ -1323,9 +1337,10 @@ static void rdt_kill_sb(struct super_block *sb)
}

static struct file_system_type rdt_fs_type = {
- .name = "resctrl",
- .mount = rdt_mount,
- .kill_sb = rdt_kill_sb,
+ .name = "resctrl",
+ .fs_context_size = sizeof(struct rdt_fs_context),
+ .init_fs_context = rdt_init_fs_context,
+ .kill_sb = rdt_kill_sb,
};

static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 26dd9a50f383..fffa71137a13 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -22,13 +22,14 @@

struct kmem_cache *kernfs_node_cache;

-static int kernfs_sop_remount_fs(struct super_block *sb, int *flags, char *data)
+static int kernfs_sop_remount_fs(struct super_block *sb, struct fs_context *fc)
{
+ struct kernfs_fs_context *kfc = container_of(fc, struct kernfs_fs_context, fc);
struct kernfs_root *root = kernfs_info(sb)->root;
struct kernfs_syscall_ops *scops = root->syscall_ops;

if (scops && scops->remount_fs)
- return scops->remount_fs(root, flags, data);
+ return scops->remount_fs(root, kfc);
return 0;
}

@@ -60,7 +61,7 @@ const struct super_operations kernfs_sops = {
.drop_inode = generic_delete_inode,
.evict_inode = kernfs_evict_inode,

- .remount_fs = kernfs_sop_remount_fs,
+ .remount_fs_fc = kernfs_sop_remount_fs,
.show_options = kernfs_sop_show_options,
.show_path = kernfs_sop_show_path,
};
@@ -218,7 +219,7 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
} while (true);
}

-static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
+static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *kfc)
{
struct kernfs_super_info *info = kernfs_info(sb);
struct inode *inode;
@@ -229,7 +230,7 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV;
sb->s_blocksize = PAGE_SIZE;
sb->s_blocksize_bits = PAGE_SHIFT;
- sb->s_magic = magic;
+ sb->s_magic = kfc->magic;
sb->s_op = &kernfs_sops;
sb->s_xattr = kernfs_xattr_handlers;
if (info->root->flags & KERNFS_ROOT_SUPPORT_EXPORTOP)
@@ -256,20 +257,25 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
return 0;
}

-static int kernfs_test_super(struct super_block *sb, void *data)
+static int kernfs_test_super(struct super_block *sb, struct fs_context *fc)
{
+ struct kernfs_fs_context *kfc = container_of(fc, struct kernfs_fs_context, fc);
struct kernfs_super_info *sb_info = kernfs_info(sb);
- struct kernfs_super_info *info = data;
+ struct kernfs_super_info *info = kfc->info;

return sb_info->root == info->root && sb_info->ns == info->ns;
}

-static int kernfs_set_super(struct super_block *sb, void *data)
+static int kernfs_set_super(struct super_block *sb, struct fs_context *fc)
{
+ struct kernfs_fs_context *kfc = container_of(fc, struct kernfs_fs_context, fc);
int error;
- error = set_anon_super(sb, data);
- if (!error)
- sb->s_fs_info = data;
+
+ error = set_anon_super(sb, kfc->info);
+ if (!error) {
+ sb->s_fs_info = kfc->info;
+ kfc->info = NULL;
+ }
return error;
}

@@ -287,24 +293,15 @@ const void *kernfs_super_ns(struct super_block *sb)
}

/**
- * kernfs_mount_ns - kernfs mount helper
- * @fs_type: file_system_type of the fs being mounted
- * @flags: mount flags specified for the mount
- * @root: kernfs_root of the hierarchy being mounted
- * @magic: file system specific magic number
- * @new_sb_created: tell the caller if we allocated a new superblock
- * @ns: optional namespace tag of the mount
- *
- * This is to be called from each kernfs user's file_system_type->mount()
- * implementation, which should pass through the specified @fs_type and
- * @flags, and specify the hierarchy and namespace tag to mount via @root
- * and @ns, respectively.
+ * kernfs_get_tree - kernfs filesystem access/retrieval helper
+ * @kfc: The filesystem context.
*
- * The return value can be passed to the vfs layer verbatim.
+ * This is to be called from each kernfs user's fs_context->ops->get_tree()
+ * implementation, which should set the specified ->@fs_type and ->@flags, and
+ * specify the hierarchy and namespace tag to mount via ->@root and ->@ns,
+ * respectively.
*/
-struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
- struct kernfs_root *root, unsigned long magic,
- bool *new_sb_created, const void *ns)
+int kernfs_get_tree(struct kernfs_fs_context *kfc)
{
struct super_block *sb;
struct kernfs_super_info *info;
@@ -312,37 +309,42 @@ struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,

info = kzalloc(sizeof(*info), GFP_KERNEL);
if (!info)
- return ERR_PTR(-ENOMEM);
-
- info->root = root;
- info->ns = ns;
+ return -ENOMEM;

- sb = sget_userns(fs_type, kernfs_test_super, kernfs_set_super, flags,
- &init_user_ns, info);
- if (IS_ERR(sb) || sb->s_fs_info != info)
- kfree(info);
+ info->root = kfc->root;
+ info->ns = kfc->ns_tag;
+
+ kfc->info = info;
+ sb = sget_fc(&kfc->fc, kernfs_test_super, kernfs_set_super);
+ if (kfc->info) {
+ kfree(kfc->info);
+ kfc->info = NULL;
+ } else {
+ kfc->ns_tag = NULL;
+ kfc->fc.degraded = true;
+ }
if (IS_ERR(sb))
- return ERR_CAST(sb);
-
- if (new_sb_created)
- *new_sb_created = !sb->s_root;
+ return PTR_ERR(sb);

if (!sb->s_root) {
struct kernfs_super_info *info = kernfs_info(sb);

- error = kernfs_fill_super(sb, magic);
+ kfc->new_sb_created = true;
+
+ error = kernfs_fill_super(sb, kfc);
if (error) {
deactivate_locked_super(sb);
- return ERR_PTR(error);
+ return error;
}
sb->s_flags |= SB_ACTIVE;

mutex_lock(&kernfs_mutex);
- list_add(&info->node, &root->supers);
+ list_add(&info->node, &info->root->supers);
mutex_unlock(&kernfs_mutex);
}

- return dget(sb->s_root);
+ kfc->fc.root = dget(sb->s_root);
+ return 0;
}

/**
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index fb49510c5dcf..cfe900d43663 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -23,27 +23,45 @@
static struct kernfs_root *sysfs_root;
struct kernfs_node *sysfs_root_kn;

-static struct dentry *sysfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+static int sysfs_get_tree(struct fs_context *fc)
{
- struct dentry *root;
- void *ns;
- bool new_sb;
+ struct kernfs_fs_context *kfc = container_of(fc, struct kernfs_fs_context, fc);
+ int ret;

- if (!(flags & SB_KERNMOUNT)) {
+ ret = kernfs_get_tree(kfc);
+ if (kfc->new_sb_created)
+ fc->root->d_sb->s_iflags |= SB_I_USERNS_VISIBLE;
+ return 0;
+}
+
+static void sysfs_fs_context_free(struct fs_context *fc)
+{
+ struct kernfs_fs_context *kfc = container_of(fc, struct kernfs_fs_context, fc);
+
+ if (kfc->ns_tag)
+ kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag);
+ kernfs_free_fs_context(kfc);
+}
+
+static const struct fs_context_operations sysfs_fs_context_ops = {
+ .free = sysfs_fs_context_free,
+ .get_tree = sysfs_get_tree,
+};
+
+static int sysfs_init_fs_context(struct fs_context *fc, struct super_block *src_sb)
+{
+ struct kernfs_fs_context *kfc = container_of(fc, struct kernfs_fs_context, fc);
+
+ if (!(fc->sb_flags & SB_KERNMOUNT)) {
if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET))
- return ERR_PTR(-EPERM);
+ return -EPERM;
}

- ns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
- root = kernfs_mount_ns(fs_type, flags, sysfs_root,
- SYSFS_MAGIC, &new_sb, ns);
- if (IS_ERR(root) || !new_sb)
- kobj_ns_drop(KOBJ_NS_TYPE_NET, ns);
- else if (new_sb)
- root->d_sb->s_iflags |= SB_I_USERNS_VISIBLE;
-
- return root;
+ kfc->ns_tag = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
+ kfc->root = sysfs_root;
+ kfc->magic = SYSFS_MAGIC;
+ kfc->fc.ops = &sysfs_fs_context_ops;
+ return 0;
}

static void sysfs_kill_sb(struct super_block *sb)
@@ -55,10 +73,11 @@ static void sysfs_kill_sb(struct super_block *sb)
}

static struct file_system_type sysfs_fs_type = {
- .name = "sysfs",
- .mount = sysfs_mount,
- .kill_sb = sysfs_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
+ .name = "sysfs",
+ .fs_context_size = sizeof(struct kernfs_fs_context),
+ .init_fs_context = sysfs_init_fs_context,
+ .kill_sb = sysfs_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};

int __init sysfs_init(void)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index d023ac5e377f..cc932e7e292d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -762,10 +762,11 @@ copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,

#endif /* !CONFIG_CGROUPS */

-static inline void get_cgroup_ns(struct cgroup_namespace *ns)
+static inline struct cgroup_namespace *get_cgroup_ns(struct cgroup_namespace *ns)
{
if (ns)
refcount_inc(&ns->count);
+ return ns;
}

static inline void put_cgroup_ns(struct cgroup_namespace *ns)
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index ab25c8b6d9e3..b8bfa4fe0d48 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -16,6 +16,7 @@
#include <linux/rbtree.h>
#include <linux/atomic.h>
#include <linux/wait.h>
+#include <linux/fs_context.h>

struct file;
struct dentry;
@@ -25,6 +26,7 @@ struct vm_area_struct;
struct super_block;
struct file_system_type;

+struct kernfs_fs_context;
struct kernfs_open_node;
struct kernfs_iattrs;

@@ -166,7 +168,7 @@ struct kernfs_node {
* kernfs_node parameter.
*/
struct kernfs_syscall_ops {
- int (*remount_fs)(struct kernfs_root *root, int *flags, char *data);
+ int (*remount_fs)(struct kernfs_root *root, struct kernfs_fs_context *kfc);
int (*show_options)(struct seq_file *sf, struct kernfs_root *root);

int (*mkdir)(struct kernfs_node *parent, const char *name,
@@ -267,6 +269,20 @@ struct kernfs_ops {
#endif
};

+/*
+ * The kernfs superblock creation/mount parameter context.
+ */
+struct kernfs_fs_context {
+ struct fs_context fc;
+ struct kernfs_root *root; /* Root of the hierarchy being mounted */
+ void *ns_tag; /* Namespace tag of the mount (or NULL) */
+ unsigned long magic; /* File system specific magic number */
+
+ /* The following are set/used by kernfs_mount() */
+ struct kernfs_super_info *info; /* The new superblock info */
+ bool new_sb_created; /* Set to T if we allocated a new sb */
+};
+
#ifdef CONFIG_KERNFS

static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
@@ -350,9 +366,7 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr);
void kernfs_notify(struct kernfs_node *kn);

const void *kernfs_super_ns(struct super_block *sb);
-struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
- struct kernfs_root *root, unsigned long magic,
- bool *new_sb_created, const void *ns);
+int kernfs_get_tree(struct kernfs_fs_context *fc);
void kernfs_kill_sb(struct super_block *sb);
struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns);

@@ -454,11 +468,8 @@ static inline void kernfs_notify(struct kernfs_node *kn) { }
static inline const void *kernfs_super_ns(struct super_block *sb)
{ return NULL; }

-static inline struct dentry *
-kernfs_mount_ns(struct file_system_type *fs_type, int flags,
- struct kernfs_root *root, unsigned long magic,
- bool *new_sb_created, const void *ns)
-{ return ERR_PTR(-ENOSYS); }
+static inline int kernfs_get_tree(struct kernfs_fs_context *fc)
+{ return -ENOSYS; }

static inline void kernfs_kill_sb(struct super_block *sb) { }

@@ -535,13 +546,9 @@ static inline int kernfs_rename(struct kernfs_node *kn,
return kernfs_rename_ns(kn, new_parent, new_name, NULL);
}

-static inline struct dentry *
-kernfs_mount(struct file_system_type *fs_type, int flags,
- struct kernfs_root *root, unsigned long magic,
- bool *new_sb_created)
+static inline void kernfs_free_fs_context(struct kernfs_fs_context *kfc)
{
- return kernfs_mount_ns(fs_type, flags, root,
- magic, new_sb_created, NULL);
+ /* Note that we don't deal with kfc->ns_tag here. */
}

#endif /* __LINUX_KERNFS_H */
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 5151ff256c29..2ab58effc6f0 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -8,6 +8,26 @@
#include <linux/refcount.h>

/*
+ * The cgroup filesystem superblock creation/mount context.
+ */
+struct cgroup_fs_context {
+ struct kernfs_fs_context kfc;
+ struct cgroup_root *root;
+ struct cgroup_namespace *ns;
+ u8 version; /* cgroups version */
+ unsigned int flags; /* CGRP_ROOT_* flags */
+
+ /* cgroup1 bits */
+ bool cpuset_clone_children;
+ bool none; /* User explicitly requested empty subsystem */
+ bool all_ss; /* Seen 'all' option */
+ bool one_ss; /* Seen 'none' option */
+ u16 subsys_mask; /* Selected subsystems */
+ char *name; /* Hierarchy name */
+ char *release_agent; /* Path for release notifications */
+};
+
+/*
* A cgroup can be associated with multiple css_sets as different tasks may
* belong to different cgroups on different hierarchies. In the other
* direction, a css_set is naturally associated with multiple cgroups.
@@ -88,16 +108,6 @@ struct cgroup_mgctx {
#define DEFINE_CGROUP_MGCTX(name) \
struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)

-struct cgroup_sb_opts {
- u16 subsys_mask;
- unsigned int flags;
- char *release_agent;
- bool cpuset_clone_children;
- char *name;
- /* User explicitly requested empty subsystem */
- bool none;
-};
-
extern struct mutex cgroup_mutex;
extern spinlock_t css_set_lock;
extern struct cgroup_subsys *cgroup_subsys[];
@@ -168,12 +178,10 @@ int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
struct cgroup_namespace *ns);

void cgroup_free_root(struct cgroup_root *root);
-void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts);
+void init_cgroup_root(struct cgroup_fs_context *ctx);
int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags);
int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask);
-struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
- struct cgroup_root *root, unsigned long magic,
- struct cgroup_namespace *ns);
+int cgroup_do_get_tree(struct cgroup_fs_context *ctx);

int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp);
void cgroup_migrate_finish(struct cgroup_mgctx *mgctx);
@@ -215,8 +223,8 @@ bool cgroup1_ssid_disabled(int ssid);
void cgroup1_pidlist_destroy_all(struct cgroup *cgrp);
void cgroup1_release_agent(struct work_struct *work);
void cgroup1_check_for_release(struct cgroup *cgrp);
-struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
- void *data, unsigned long magic,
- struct cgroup_namespace *ns);
+int cgroup1_parse_option(struct cgroup_fs_context *ctx, char *p);
+int cgroup1_validate(struct cgroup_fs_context *ctx);
+int cgroup1_get_tree(struct cgroup_fs_context *ctx);

#endif /* __CGROUP_INTERNAL_H */
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 024085daab1a..6163d19f30df 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -16,6 +16,8 @@

#include <trace/events/cgroup.h>

+#define cg_invalf(fmt, ...) ({ pr_err(fmt, ## __VA_ARGS__); })
+
/*
* pidlists linger the following amount before being destroyed. The goal
* is avoiding frequent destruction in the middle of consecutive read calls
@@ -911,168 +913,166 @@ static int cgroup1_show_options(struct seq_file *seq, struct kernfs_root *kf_roo
return 0;
}

-static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
+int cgroup1_parse_option(struct cgroup_fs_context *ctx, char *token)
{
- char *token, *o = data;
- bool all_ss = false, one_ss = false;
- u16 mask = U16_MAX;
struct cgroup_subsys *ss;
- int nr_opts = 0;
int i;

-#ifdef CONFIG_CPUSETS
- mask = ~((u16)1 << cpuset_cgrp_id);
-#endif
-
- memset(opts, 0, sizeof(*opts));
-
- while ((token = strsep(&o, ",")) != NULL) {
- nr_opts++;
+ if (!strcmp(token, "none")) {
+ /* Explicitly have no subsystems */
+ ctx->none = true;
+ return 0;
+ }
+ if (!strcmp(token, "all")) {
+ /* Mutually exclusive option 'all' + subsystem name */
+ if (ctx->one_ss)
+ return cg_invalf("cgroup1: all conflicts with subsys name");
+ ctx->all_ss = true;
+ return 0;
+ }
+ if (!strcmp(token, "noprefix")) {
+ ctx->flags |= CGRP_ROOT_NOPREFIX;
+ return 0;
+ }
+ if (!strcmp(token, "clone_children")) {
+ ctx->cpuset_clone_children = true;
+ return 0;
+ }
+ if (!strcmp(token, "xattr")) {
+ ctx->flags |= CGRP_ROOT_XATTR;
+ return 0;
+ }
+ if (!strncmp(token, "release_agent=", 14)) {
+ /* Specifying two release agents is forbidden */
+ if (ctx->release_agent)
+ return cg_invalf("cgroup1: release_agent respecified");
+ ctx->release_agent =
+ kstrndup(token + 14, PATH_MAX - 1, GFP_KERNEL);
+ if (!ctx->release_agent)
+ return -ENOMEM;
+ return 0;
+ }

- if (!*token)
- return -EINVAL;
- if (!strcmp(token, "none")) {
- /* Explicitly have no subsystems */
- opts->none = true;
- continue;
- }
- if (!strcmp(token, "all")) {
- /* Mutually exclusive option 'all' + subsystem name */
- if (one_ss)
- return -EINVAL;
- all_ss = true;
- continue;
- }
- if (!strcmp(token, "noprefix")) {
- opts->flags |= CGRP_ROOT_NOPREFIX;
- continue;
+ if (!strncmp(token, "name=", 5)) {
+ const char *name = token + 5;
+ /* Can't specify an empty name */
+ if (!strlen(name))
+ return cg_invalf("cgroup1: Empty name");
+ /* Must match [\w.-]+ */
+ for (i = 0; i < strlen(name); i++) {
+ char c = name[i];
+ if (isalnum(c))
+ continue;
+ if ((c == '.') || (c == '-') || (c == '_'))
+ continue;
+ return cg_invalf("cgroup1: Invalid name");
}
- if (!strcmp(token, "clone_children")) {
- opts->cpuset_clone_children = true;
+ /* Specifying two names is forbidden */
+ if (ctx->name)
+ return cg_invalf("cgroup1: name respecified");
+ ctx->name = kstrndup(name,
+ MAX_CGROUP_ROOT_NAMELEN - 1,
+ GFP_KERNEL);
+ if (!ctx->name)
+ return -ENOMEM;
+
+ return 0;
+ }
+
+ for_each_subsys(ss, i) {
+ if (strcmp(token, ss->legacy_name))
continue;
- }
if (!strcmp(token, "cpuset_v2_mode")) {
- opts->flags |= CGRP_ROOT_CPUSET_V2_MODE;
+ ctx->flags |= CGRP_ROOT_CPUSET_V2_MODE;
continue;
}
if (!strcmp(token, "xattr")) {
- opts->flags |= CGRP_ROOT_XATTR;
+ ctx->flags |= CGRP_ROOT_XATTR;
continue;
}
- if (!strncmp(token, "release_agent=", 14)) {
- /* Specifying two release agents is forbidden */
- if (opts->release_agent)
- return -EINVAL;
- opts->release_agent =
- kstrndup(token + 14, PATH_MAX - 1, GFP_KERNEL);
- if (!opts->release_agent)
- return -ENOMEM;
+ if (cgroup1_ssid_disabled(i))
continue;
- }
- if (!strncmp(token, "name=", 5)) {
- const char *name = token + 5;
- /* Can't specify an empty name */
- if (!strlen(name))
- return -EINVAL;
- /* Must match [\w.-]+ */
- for (i = 0; i < strlen(name); i++) {
- char c = name[i];
- if (isalnum(c))
- continue;
- if ((c == '.') || (c == '-') || (c == '_'))
- continue;
- return -EINVAL;
- }
- /* Specifying two names is forbidden */
- if (opts->name)
- return -EINVAL;
- opts->name = kstrndup(name,
- MAX_CGROUP_ROOT_NAMELEN - 1,
- GFP_KERNEL);
- if (!opts->name)
- return -ENOMEM;

- continue;
- }
+ /* Mutually exclusive option 'all' + subsystem name */
+ if (ctx->all_ss)
+ return cg_invalf("cgroup1: subsys name conflicts with all");
+ ctx->subsys_mask |= (1 << i);
+ ctx->one_ss = true;
+ return 0;
+ }

- for_each_subsys(ss, i) {
- if (strcmp(token, ss->legacy_name))
- continue;
- if (!cgroup_ssid_enabled(i))
- continue;
- if (cgroup1_ssid_disabled(i))
- continue;
+ if (i == CGROUP_SUBSYS_COUNT)
+ return -ENOENT;
+
+ return 0;
+}

- /* Mutually exclusive option 'all' + subsystem name */
- if (all_ss)
- return -EINVAL;
- opts->subsys_mask |= (1 << i);
- one_ss = true;
+/*
+ * Validate the options that have been parsed.
+ */
+int cgroup1_validate(struct cgroup_fs_context *ctx)
+{
+ struct cgroup_subsys *ss;
+ u16 mask = U16_MAX;
+ int i;

- break;
- }
- if (i == CGROUP_SUBSYS_COUNT)
- return -ENOENT;
- }
+#ifdef CONFIG_CPUSETS
+ mask = ~((u16)1 << cpuset_cgrp_id);
+#endif

/*
* If the 'all' option was specified select all the subsystems,
* otherwise if 'none', 'name=' and a subsystem name options were
* not specified, let's default to 'all'
*/
- if (all_ss || (!one_ss && !opts->none && !opts->name))
+ if (ctx->all_ss || (!ctx->one_ss && !ctx->none && !ctx->name))
for_each_subsys(ss, i)
if (cgroup_ssid_enabled(i) && !cgroup1_ssid_disabled(i))
- opts->subsys_mask |= (1 << i);
+ ctx->subsys_mask |= (1 << i);

/*
* We either have to specify by name or by subsystems. (So all
* empty hierarchies must have a name).
*/
- if (!opts->subsys_mask && !opts->name)
- return -EINVAL;
+ if (!ctx->subsys_mask && !ctx->name)
+ return cg_invalf("cgroup1: Need name or subsystem set");

/*
* Option noprefix was introduced just for backward compatibility
* with the old cpuset, so we allow noprefix only if mounting just
* the cpuset subsystem.
*/
- if ((opts->flags & CGRP_ROOT_NOPREFIX) && (opts->subsys_mask & mask))
- return -EINVAL;
+ if ((ctx->flags & CGRP_ROOT_NOPREFIX) && (ctx->subsys_mask & mask))
+ return cg_invalf("cgroup1: noprefix used incorrectly");

/* Can't specify "none" and some subsystems */
- if (opts->subsys_mask && opts->none)
- return -EINVAL;
+ if (ctx->subsys_mask && ctx->none)
+ return cg_invalf("cgroup1: none used incorrectly");

return 0;
}

-static int cgroup1_remount(struct kernfs_root *kf_root, int *flags, char *data)
+static int cgroup1_remount(struct kernfs_root *kf_root, struct kernfs_fs_context *kfc)
{
- int ret = 0;
+ struct cgroup_fs_context *ctx = container_of(kfc, struct cgroup_fs_context, kfc);
struct cgroup_root *root = cgroup_root_from_kf(kf_root);
- struct cgroup_sb_opts opts;
u16 added_mask, removed_mask;
+ int ret = 0;

cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);

- /* See what subsystems are wanted */
- ret = parse_cgroupfs_options(data, &opts);
- if (ret)
- goto out_unlock;
-
- if (opts.subsys_mask != root->subsys_mask || opts.release_agent)
+ if (ctx->subsys_mask != root->subsys_mask || ctx->release_agent)
pr_warn("option changes via remount are deprecated (pid=%d comm=%s)\n",
task_tgid_nr(current), current->comm);

- added_mask = opts.subsys_mask & ~root->subsys_mask;
- removed_mask = root->subsys_mask & ~opts.subsys_mask;
+ added_mask = ctx->subsys_mask & ~root->subsys_mask;
+ removed_mask = root->subsys_mask & ~ctx->subsys_mask;

/* Don't allow flags or name to change at remount */
- if ((opts.flags ^ root->flags) ||
- (opts.name && strcmp(opts.name, root->name))) {
- pr_err("option or name mismatch, new: 0x%x \"%s\", old: 0x%x \"%s\"\n",
- opts.flags, opts.name ?: "", root->flags, root->name);
+ if ((ctx->flags ^ root->flags) ||
+ (ctx->name && strcmp(ctx->name, root->name))) {
+ cg_invalf("option or name mismatch, new: 0x%x \"%s\", old: 0x%x \"%s\"",
+ ctx->flags, ctx->name ?: "", root->flags, root->name);
ret = -EINVAL;
goto out_unlock;
}
@@ -1089,17 +1089,15 @@ static int cgroup1_remount(struct kernfs_root *kf_root, int *flags, char *data)

WARN_ON(rebind_subsystems(&cgrp_dfl_root, removed_mask));

- if (opts.release_agent) {
+ if (ctx->release_agent) {
spin_lock(&release_agent_path_lock);
- strcpy(root->release_agent_path, opts.release_agent);
+ strcpy(root->release_agent_path, ctx->release_agent);
spin_unlock(&release_agent_path_lock);
}

trace_cgroup_remount(root);

out_unlock:
- kfree(opts.release_agent);
- kfree(opts.name);
mutex_unlock(&cgroup_mutex);
return ret;
}
@@ -1113,25 +1111,19 @@ struct kernfs_syscall_ops cgroup1_kf_syscall_ops = {
.show_path = cgroup_show_path,
};

-struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
- void *data, unsigned long magic,
- struct cgroup_namespace *ns)
+/*
+ * Find or create a v1 cgroups superblock.
+ */
+int cgroup1_get_tree(struct cgroup_fs_context *ctx)
{
struct super_block *pinned_sb = NULL;
- struct cgroup_sb_opts opts;
struct cgroup_root *root;
struct cgroup_subsys *ss;
- struct dentry *dentry;
int i, ret;
bool new_root = false;

cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);

- /* First find the desired set of subsystems */
- ret = parse_cgroupfs_options(data, &opts);
- if (ret)
- goto out_unlock;
-
/*
* Destruction of cgroup root is asynchronous, so subsystems may
* still be dying after the previous unmount. Let's drain the
@@ -1140,15 +1132,13 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
* starting. Testing ref liveliness is good enough.
*/
for_each_subsys(ss, i) {
- if (!(opts.subsys_mask & (1 << i)) ||
+ if (!(ctx->subsys_mask & (1 << i)) ||
ss->root == &cgrp_dfl_root)
continue;

if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
mutex_unlock(&cgroup_mutex);
- msleep(10);
- ret = restart_syscall();
- goto out_free;
+ goto err_restart;
}
cgroup_put(&ss->root->cgrp);
}
@@ -1164,8 +1154,8 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
* name matches but sybsys_mask doesn't, we should fail.
* Remember whether name matched.
*/
- if (opts.name) {
- if (strcmp(opts.name, root->name))
+ if (ctx->name) {
+ if (strcmp(ctx->name, root->name))
continue;
name_match = true;
}
@@ -1174,15 +1164,15 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
* If we asked for subsystems (or explicitly for no
* subsystems) then they must match.
*/
- if ((opts.subsys_mask || opts.none) &&
- (opts.subsys_mask != root->subsys_mask)) {
+ if ((ctx->subsys_mask || ctx->none) &&
+ (ctx->subsys_mask != root->subsys_mask)) {
if (!name_match)
continue;
ret = -EBUSY;
- goto out_unlock;
+ goto err_unlock;
}

- if (root->flags ^ opts.flags)
+ if (root->flags ^ ctx->flags)
pr_warn("new mount options do not match the existing superblock, will be ignored\n");

/*
@@ -1203,9 +1193,7 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
mutex_unlock(&cgroup_mutex);
if (!IS_ERR_OR_NULL(pinned_sb))
deactivate_super(pinned_sb);
- msleep(10);
- ret = restart_syscall();
- goto out_free;
+ goto err_restart;
}

ret = 0;
@@ -1217,41 +1205,35 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
* specification is allowed for already existing hierarchies but we
* can't create new one without subsys specification.
*/
- if (!opts.subsys_mask && !opts.none) {
- ret = -EINVAL;
- goto out_unlock;
+ if (!ctx->subsys_mask && !ctx->none) {
+ ret = cg_invalf("cgroup1: No subsys list or none specified");
+ goto err_unlock;
}

/* Hierarchies may only be created in the initial cgroup namespace. */
- if (ns != &init_cgroup_ns) {
+ if (ctx->ns != &init_cgroup_ns) {
ret = -EPERM;
- goto out_unlock;
+ goto err_unlock;
}

root = kzalloc(sizeof(*root), GFP_KERNEL);
if (!root) {
ret = -ENOMEM;
- goto out_unlock;
+ goto err_unlock;
}
new_root = true;
+ ctx->root = root;

- init_cgroup_root(root, &opts);
+ init_cgroup_root(ctx);

- ret = cgroup_setup_root(root, opts.subsys_mask, PERCPU_REF_INIT_DEAD);
+ ret = cgroup_setup_root(root, ctx->subsys_mask, PERCPU_REF_INIT_DEAD);
if (ret)
cgroup_free_root(root);

out_unlock:
mutex_unlock(&cgroup_mutex);
-out_free:
- kfree(opts.release_agent);
- kfree(opts.name);
-
- if (ret)
- return ERR_PTR(ret);

- dentry = cgroup_do_mount(&cgroup_fs_type, flags, root,
- CGROUP_SUPER_MAGIC, ns);
+ ret = cgroup_do_get_tree(ctx);

/*
* There's a race window after we release cgroup_mutex and before
@@ -1272,7 +1254,14 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
if (pinned_sb)
deactivate_super(pinned_sb);

- return dentry;
+ return ret;
+
+err_restart:
+ msleep(10);
+ return restart_syscall();
+err_unlock:
+ mutex_unlock(&cgroup_mutex);
+ return ret;
}

static int __init cgroup1_wq_init(void)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 44857278eb8a..e3425ca9df3b 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1686,25 +1686,21 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
return len;
}

-static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
+static int cgroup2_parse_option(struct cgroup_fs_context *ctx, char *token)
{
- char *token;
-
- *root_flags = 0;
-
- if (!data)
+ if (!strcmp(token, "nsdelegate")) {
+ ctx->flags |= CGRP_ROOT_NS_DELEGATE;
return 0;
-
- while ((token = strsep(&data, ",")) != NULL) {
- if (!strcmp(token, "nsdelegate")) {
- *root_flags |= CGRP_ROOT_NS_DELEGATE;
- continue;
- }
-
- pr_err("cgroup2: unknown option \"%s\"\n", token);
- return -EINVAL;
}

+ return -EINVAL;
+}
+
+static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root)
+{
+ if (current->nsproxy->cgroup_ns == &init_cgroup_ns &&
+ cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
+ seq_puts(seq, ",nsdelegate");
return 0;
}

@@ -1718,23 +1714,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
}
}

-static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root)
-{
- if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
- seq_puts(seq, ",nsdelegate");
- return 0;
-}
-
-static int cgroup_remount(struct kernfs_root *kf_root, int *flags, char *data)
+static int cgroup_remount(struct kernfs_root *kf_root, struct kernfs_fs_context *kfc)
{
- unsigned int root_flags;
- int ret;
+ struct cgroup_fs_context *ctx = container_of(kfc, struct cgroup_fs_context, kfc);

- ret = parse_cgroup_root_flags(data, &root_flags);
- if (ret)
- return ret;
-
- apply_cgroup_root_flags(root_flags);
+ apply_cgroup_root_flags(ctx->flags);
return 0;
}

@@ -1820,8 +1804,9 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
INIT_WORK(&cgrp->release_agent_work, cgroup1_release_agent);
}

-void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts)
+void init_cgroup_root(struct cgroup_fs_context *ctx)
{
+ struct cgroup_root *root = ctx->root;
struct cgroup *cgrp = &root->cgrp;

INIT_LIST_HEAD(&root->root_list);
@@ -1830,12 +1815,12 @@ void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts)
init_cgroup_housekeeping(cgrp);
idr_init(&root->cgroup_idr);

- root->flags = opts->flags;
- if (opts->release_agent)
- strcpy(root->release_agent_path, opts->release_agent);
- if (opts->name)
- strcpy(root->name, opts->name);
- if (opts->cpuset_clone_children)
+ root->flags = ctx->flags;
+ if (ctx->release_agent)
+ strcpy(root->release_agent_path, ctx->release_agent);
+ if (ctx->name)
+ strcpy(root->name, ctx->name);
+ if (ctx->cpuset_clone_children)
set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
}

@@ -1937,57 +1922,50 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
return ret;
}

-struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
- struct cgroup_root *root, unsigned long magic,
- struct cgroup_namespace *ns)
+int cgroup_do_get_tree(struct cgroup_fs_context *ctx)
{
- struct dentry *dentry;
- bool new_sb;
+ int ret;

- dentry = kernfs_mount(fs_type, flags, root->kf_root, magic, &new_sb);
+ ctx->kfc.root = ctx->root->kf_root;
+
+ ret = kernfs_get_tree(&ctx->kfc);
+ if (ret < 0)
+ goto out_cgrp;

/*
* In non-init cgroup namespace, instead of root cgroup's dentry,
* we return the dentry corresponding to the cgroupns->root_cgrp.
*/
- if (!IS_ERR(dentry) && ns != &init_cgroup_ns) {
+ if (ctx->ns != &init_cgroup_ns) {
struct dentry *nsdentry;
struct cgroup *cgrp;

mutex_lock(&cgroup_mutex);
spin_lock_irq(&css_set_lock);

- cgrp = cset_cgroup_from_root(ns->root_cset, root);
+ cgrp = cset_cgroup_from_root(ctx->ns->root_cset, ctx->root);

spin_unlock_irq(&css_set_lock);
mutex_unlock(&cgroup_mutex);

- nsdentry = kernfs_node_dentry(cgrp->kn, dentry->d_sb);
- dput(dentry);
- dentry = nsdentry;
+ nsdentry = kernfs_node_dentry(cgrp->kn, ctx->kfc.fc.root->d_sb);
+ dput(ctx->kfc.fc.root);
+ ctx->kfc.fc.root = nsdentry;
}

- if (IS_ERR(dentry) || !new_sb)
- cgroup_put(&root->cgrp);
+ ret = 0;
+ if (ctx->kfc.new_sb_created)
+ goto out_cgrp;
+ apply_cgroup_root_flags(ctx->flags);
+ return 0;

- return dentry;
+out_cgrp:
+ return ret;
}

-static struct dentry *cgroup_mount(struct file_system_type *fs_type,
- int flags, const char *unused_dev_name,
- void *data)
+static int cgroup_get_tree(struct fs_context *fc)
{
- struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
- struct dentry *dentry;
- int ret;
-
- get_cgroup_ns(ns);
-
- /* Check if the caller has permission to mount. */
- if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
- put_cgroup_ns(ns);
- return ERR_PTR(-EPERM);
- }
+ struct cgroup_fs_context *ctx = container_of(fc, struct cgroup_fs_context, kfc.fc);

/*
* The first time anyone tries to mount a cgroup, enable the list
@@ -1996,29 +1974,80 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
if (!use_task_css_set_links)
cgroup_enable_task_cg_lists();

- if (fs_type == &cgroup2_fs_type) {
- unsigned int root_flags;
-
- ret = parse_cgroup_root_flags(data, &root_flags);
- if (ret) {
- put_cgroup_ns(ns);
- return ERR_PTR(ret);
- }
+ switch (ctx->version) {
+ case 1:
+ return cgroup1_get_tree(ctx);

+ case 2:
cgrp_dfl_visible = true;
cgroup_get_live(&cgrp_dfl_root.cgrp);

- dentry = cgroup_do_mount(&cgroup2_fs_type, flags, &cgrp_dfl_root,
- CGROUP2_SUPER_MAGIC, ns);
- if (!IS_ERR(dentry))
- apply_cgroup_root_flags(root_flags);
- } else {
- dentry = cgroup1_mount(&cgroup_fs_type, flags, data,
- CGROUP_SUPER_MAGIC, ns);
+ ctx->root = &cgrp_dfl_root;
+ return cgroup_do_get_tree(ctx);
+
+ default:
+ BUG();
}
+}
+
+static int cgroup_parse_option(struct fs_context *fc, char *p)
+{
+ struct cgroup_fs_context *ctx = container_of(fc, struct cgroup_fs_context, kfc.fc);
+
+ if (ctx->version == 1)
+ return cgroup1_parse_option(ctx, p);
+
+ return cgroup2_parse_option(ctx, p);
+}
+
+static int cgroup_validate(struct fs_context *fc)
+{
+ struct cgroup_fs_context *ctx = container_of(fc, struct cgroup_fs_context, kfc.fc);
+
+ if (ctx->version == 1)
+ return cgroup1_validate(ctx);
+ return 0;
+}

- put_cgroup_ns(ns);
- return dentry;
+/*
+ * Destroy a cgroup filesystem context.
+ */
+static void cgroup_fs_context_free(struct fs_context *fc)
+{
+ struct cgroup_fs_context *ctx = container_of(fc, struct cgroup_fs_context, kfc.fc);
+
+ kfree(ctx->name);
+ kfree(ctx->release_agent);
+ cgroup_put(&ctx->root->cgrp);
+ put_cgroup_ns(ctx->ns);
+ kernfs_free_fs_context(&ctx->kfc);
+}
+
+static const struct fs_context_operations cgroup_fs_context_ops = {
+ .free = cgroup_fs_context_free,
+ .parse_option = cgroup_parse_option,
+ .validate = cgroup_validate,
+ .get_tree = cgroup_get_tree,
+};
+
+/*
+ * Initialise the cgroup filesystem creation/reconfiguration context. Notably,
+ * we select the namespace we're going to use.
+ */
+static int cgroup_init_fs_context(struct fs_context *fc, struct super_block *src_sb)
+{
+ struct cgroup_fs_context *ctx = container_of(fc, struct cgroup_fs_context, kfc.fc);
+ struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
+
+ /* Check if the caller has permission to mount. */
+ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ ctx->ns = get_cgroup_ns(ns);
+ ctx->version = (fc->fs_type == &cgroup2_fs_type) ? 2 : 1;
+ ctx->kfc.magic = (ctx->version == 2) ? CGROUP2_SUPER_MAGIC : CGROUP_SUPER_MAGIC;
+ ctx->kfc.fc.ops = &cgroup_fs_context_ops;
+ return 0;
}

static void cgroup_kill_sb(struct super_block *sb)
@@ -2043,17 +2072,19 @@ static void cgroup_kill_sb(struct super_block *sb)
}

struct file_system_type cgroup_fs_type = {
- .name = "cgroup",
- .mount = cgroup_mount,
- .kill_sb = cgroup_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
+ .name = "cgroup",
+ .fs_context_size = sizeof(struct cgroup_fs_context),
+ .init_fs_context = cgroup_init_fs_context,
+ .kill_sb = cgroup_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};

static struct file_system_type cgroup2_fs_type = {
- .name = "cgroup2",
- .mount = cgroup_mount,
- .kill_sb = cgroup_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
+ .name = "cgroup2",
+ .fs_context_size = sizeof(struct cgroup_fs_context),
+ .init_fs_context = cgroup_init_fs_context,
+ .kill_sb = cgroup_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};

int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
@@ -5110,11 +5141,12 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
*/
int __init cgroup_init_early(void)
{
- static struct cgroup_sb_opts __initdata opts;
+ static struct cgroup_fs_context __initdata ctx;
struct cgroup_subsys *ss;
int i;

- init_cgroup_root(&cgrp_dfl_root, &opts);
+ ctx.root = &cgrp_dfl_root;
+ init_cgroup_root(&ctx);
cgrp_dfl_root.cgrp.self.flags |= CSS_NO_REF;

RCU_INIT_POINTER(init_task.cgroups, &init_css_set);


2017-10-06 15:50:23

by David Howells

[permalink] [raw]
Subject: [PATCH 09/14] proc: Add fs_context support to procfs [ver #6]

Add fs_context support to procfs.

Signed-off-by: David Howells <[email protected]>
---

fs/proc/inode.c | 2 -
fs/proc/internal.h | 2 -
fs/proc/root.c | 176 ++++++++++++++++++++++++++++++++++------------------
3 files changed, 118 insertions(+), 62 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index a4bf66af0ba9..a642bee67a53 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -118,7 +118,7 @@ const struct super_operations proc_sops = {
.drop_inode = generic_delete_inode,
.evict_inode = proc_evict_inode,
.statfs = simple_statfs,
- .remount_fs = proc_remount,
+ .remount_fs_fc = proc_remount,
.show_options = proc_show_options,
};

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 9cc6c2516803..aa53174758fc 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -262,7 +262,7 @@ static inline void proc_tty_init(void) {}
extern struct proc_dir_entry proc_root;

extern void proc_self_init(void);
-extern int proc_remount(struct super_block *, int *, char *);
+extern int proc_remount(struct super_block *, struct fs_context *);

/*
* task_[no]mmu.c
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 0ef44f31d045..0830e296443c 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -18,14 +18,24 @@
#include <linux/module.h>
#include <linux/bitops.h>
#include <linux/user_namespace.h>
+#include <linux/fs_context.h>
#include <linux/mount.h>
#include <linux/pid_namespace.h>
#include <linux/parser.h>
#include <linux/cred.h>
#include <linux/magic.h>
+#include <linux/slab.h>

#include "internal.h"

+struct proc_fs_context {
+ struct fs_context fc;
+ struct pid_namespace *pid_ns;
+ unsigned long mask;
+ int hidepid;
+ int gid;
+};
+
enum {
Opt_gid, Opt_hidepid, Opt_err,
};
@@ -36,56 +46,60 @@ static const match_table_t tokens = {
{Opt_err, NULL},
};

-static int proc_parse_options(char *options, struct pid_namespace *pid)
+static int proc_parse_option(struct fs_context *fc, char *p)
{
- char *p;
+ struct proc_fs_context *ctx = container_of(fc, struct proc_fs_context, fc);
substring_t args[MAX_OPT_ARGS];
- int option;
-
- if (!options)
- return 1;
-
- while ((p = strsep(&options, ",")) != NULL) {
- int token;
- if (!*p)
- continue;
-
- args[0].to = args[0].from = NULL;
- token = match_token(p, tokens, args);
- switch (token) {
- case Opt_gid:
- if (match_int(&args[0], &option))
- return 0;
- pid->pid_gid = make_kgid(current_user_ns(), option);
- break;
- case Opt_hidepid:
- if (match_int(&args[0], &option))
- return 0;
- if (option < HIDEPID_OFF ||
- option > HIDEPID_INVISIBLE) {
- pr_err("proc: hidepid value must be between 0 and 2.\n");
- return 0;
- }
- pid->hide_pid = option;
- break;
- default:
- pr_err("proc: unrecognized mount option \"%s\" "
- "or missing value\n", p);
- return 0;
+ int token;
+
+ args[0].to = args[0].from = NULL;
+ token = match_token(p, tokens, args);
+ switch (token) {
+ case Opt_gid:
+ if (match_int(&args[0], &ctx->gid))
+ return -EINVAL;
+ break;
+
+ case Opt_hidepid:
+ if (match_int(&args[0], &ctx->hidepid))
+ return -EINVAL;
+ if (ctx->hidepid < HIDEPID_OFF ||
+ ctx->hidepid > HIDEPID_INVISIBLE) {
+ pr_err("proc: hidepid value must be between 0 and 2.\n");
+ return -EINVAL;
}
+ break;
+
+ default:
+ pr_err("proc: unrecognized mount option \"%s\" "
+ "or missing value\n", p);
+ return -EINVAL;
}

- return 1;
+ ctx->mask |= 1 << token;
+ return 0;
+}
+
+static void proc_set_options(struct super_block *s,
+ struct fs_context *fc,
+ struct pid_namespace *pid_ns,
+ struct user_namespace *user_ns)
+{
+ struct proc_fs_context *ctx = container_of(fc, struct proc_fs_context, fc);
+
+ if (ctx->mask & (1 << Opt_gid))
+ pid_ns->pid_gid = make_kgid(user_ns, ctx->gid);
+ if (ctx->mask & (1 << Opt_hidepid))
+ pid_ns->hide_pid = ctx->hidepid;
}

-static int proc_fill_super(struct super_block *s, void *data, int silent)
+static int proc_fill_super(struct super_block *s, struct fs_context *fc)
{
- struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
+ struct pid_namespace *pid_ns = get_pid_ns(s->s_fs_info);
struct inode *root_inode;
int ret;

- if (!proc_parse_options(data, ns))
- return -EINVAL;
+ proc_set_options(s, fc, pid_ns, current_user_ns());

/* User space would break if executables or devices appear on proc */
s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -102,7 +116,7 @@ static int proc_fill_super(struct super_block *s, void *data, int silent)
* top of it
*/
s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
-
+
pde_get(&proc_root);
root_inode = proc_get_inode(s, &proc_root);
if (!root_inode) {
@@ -123,27 +137,46 @@ static int proc_fill_super(struct super_block *s, void *data, int silent)
return proc_setup_thread_self(s);
}

-int proc_remount(struct super_block *sb, int *flags, char *data)
+int proc_remount(struct super_block *sb, struct fs_context *fc)
{
struct pid_namespace *pid = sb->s_fs_info;

sync_filesystem(sb);
- return !proc_parse_options(data, pid);
+
+ if (fc)
+ proc_set_options(sb, fc, pid, current_user_ns());
+ return 0;
}

-static struct dentry *proc_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+static int proc_get_tree(struct fs_context *fc)
{
- struct pid_namespace *ns;
+ struct proc_fs_context *ctx = container_of(fc, struct proc_fs_context, fc);

- if (flags & SB_KERNMOUNT) {
- ns = data;
- data = NULL;
- } else {
- ns = task_active_pid_ns(current);
- }
+ ctx->fc.s_fs_info = ctx->pid_ns;
+ return vfs_get_super(fc, vfs_get_keyed_super, proc_fill_super);
+}

- return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
+static void proc_fs_context_free(struct fs_context *fc)
+{
+ struct proc_fs_context *ctx = container_of(fc, struct proc_fs_context, fc);
+
+ if (ctx->pid_ns)
+ put_pid_ns(ctx->pid_ns);
+}
+
+static const struct fs_context_operations proc_fs_context_ops = {
+ .free = proc_fs_context_free,
+ .parse_option = proc_parse_option,
+ .get_tree = proc_get_tree,
+};
+
+static int proc_init_fs_context(struct fs_context *fc, struct super_block *src_sb)
+{
+ struct proc_fs_context *ctx = container_of(fc, struct proc_fs_context, fc);
+
+ ctx->pid_ns = get_pid_ns(task_active_pid_ns(current));
+ ctx->fc.ops = &proc_fs_context_ops;
+ return 0;
}

static void proc_kill_sb(struct super_block *sb)
@@ -161,7 +194,8 @@ static void proc_kill_sb(struct super_block *sb)

static struct file_system_type proc_fs_type = {
.name = "proc",
- .mount = proc_mount,
+ .fs_context_size = sizeof(struct proc_fs_context),
+ .init_fs_context = proc_init_fs_context,
.kill_sb = proc_kill_sb,
.fs_flags = FS_USERNS_MOUNT,
};
@@ -209,7 +243,7 @@ static struct dentry *proc_root_lookup(struct inode * dir, struct dentry * dentr
{
if (!proc_pid_lookup(dir, dentry, flags))
return NULL;
-
+
return proc_lookup(dir, dentry, flags);
}

@@ -248,12 +282,12 @@ static const struct inode_operations proc_root_inode_operations = {
* This is the root "inode" in the /proc tree..
*/
struct proc_dir_entry proc_root = {
- .low_ino = PROC_ROOT_INO,
- .namelen = 5,
- .mode = S_IFDIR | S_IRUGO | S_IXUGO,
- .nlink = 2,
+ .low_ino = PROC_ROOT_INO,
+ .namelen = 5,
+ .mode = S_IFDIR | S_IRUGO | S_IXUGO,
+ .nlink = 2,
.count = ATOMIC_INIT(1),
- .proc_iops = &proc_root_inode_operations,
+ .proc_iops = &proc_root_inode_operations,
.proc_fops = &proc_root_operations,
.parent = &proc_root,
.subdir = RB_ROOT_CACHED,
@@ -262,9 +296,31 @@ struct proc_dir_entry proc_root = {

int pid_ns_prepare_proc(struct pid_namespace *ns)
{
+ struct proc_fs_context *ctx;
+ struct fs_context *fc;
struct vfsmount *mnt;
+ int ret;
+
+ fc = vfs_new_fs_context(&proc_fs_type, NULL, 0,
+ FS_CONTEXT_FOR_KERNEL_MOUNT);
+ if (IS_ERR(fc))
+ return PTR_ERR(fc);
+
+ ctx = container_of(fc, struct proc_fs_context, fc);
+ if (ctx->pid_ns != ns) {
+ put_pid_ns(ctx->pid_ns);
+ get_pid_ns(ns);
+ ctx->pid_ns = ns;
+ }
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0) {
+ put_fs_context(fc);
+ return ret;
+ }

- mnt = kern_mount_data(&proc_fs_type, ns);
+ mnt = vfs_create_mount(fc);
+ put_fs_context(fc);
if (IS_ERR(mnt))
return PTR_ERR(mnt);



2017-10-06 15:50:16

by David Howells

[permalink] [raw]
Subject: [PATCH 08/14] procfs: Move proc_fill_super() to fs/proc/root.c [ver #6]

Move proc_fill_super() to fs/proc/root.c as that's where the other
superblock stuff is.

Signed-off-by: David Howells <[email protected]>
---

fs/proc/inode.c | 48 +-----------------------------------------------
fs/proc/internal.h | 4 +---
fs/proc/root.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 49 insertions(+), 51 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 0163d71d5887..a4bf66af0ba9 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -22,7 +22,6 @@
#include <linux/seq_file.h>
#include <linux/slab.h>
#include <linux/mount.h>
-#include <linux/magic.h>

#include <linux/uaccess.h>

@@ -113,7 +112,7 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root)
return 0;
}

-static const struct super_operations proc_sops = {
+const struct super_operations proc_sops = {
.alloc_inode = proc_alloc_inode,
.destroy_inode = proc_destroy_inode,
.drop_inode = generic_delete_inode,
@@ -470,48 +469,3 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
pde_put(de);
return inode;
}
-
-int proc_fill_super(struct super_block *s, void *data, int silent)
-{
- struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
- struct inode *root_inode;
- int ret;
-
- if (!proc_parse_options(data, ns))
- return -EINVAL;
-
- /* User space would break if executables or devices appear on proc */
- s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
- s->s_flags |= SB_NODIRATIME | SB_NOSUID | SB_NOEXEC;
- s->s_blocksize = 1024;
- s->s_blocksize_bits = 10;
- s->s_magic = PROC_SUPER_MAGIC;
- s->s_op = &proc_sops;
- s->s_time_gran = 1;
-
- /*
- * procfs isn't actually a stacking filesystem; however, there is
- * too much magic going on inside it to permit stacking things on
- * top of it
- */
- s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
-
- pde_get(&proc_root);
- root_inode = proc_get_inode(s, &proc_root);
- if (!root_inode) {
- pr_err("proc_fill_super: get root inode failed\n");
- return -ENOMEM;
- }
-
- s->s_root = d_make_root(root_inode);
- if (!s->s_root) {
- pr_err("proc_fill_super: allocate dentry failed\n");
- return -ENOMEM;
- }
-
- ret = proc_setup_self(s);
- if (ret) {
- return ret;
- }
- return proc_setup_thread_self(s);
-}
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index a34195e92b20..9cc6c2516803 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -197,13 +197,12 @@ struct pde_opener {
struct completion *c;
};
extern const struct inode_operations proc_link_inode_operations;
-
extern const struct inode_operations proc_pid_link_inode_operations;
+extern const struct super_operations proc_sops;

extern void proc_init_inodecache(void);
void set_proc_pid_nlink(void);
extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
-extern int proc_fill_super(struct super_block *, void *data, int flags);
extern void proc_entry_rundown(struct proc_dir_entry *);

/*
@@ -261,7 +260,6 @@ static inline void proc_tty_init(void) {}
* root.c
*/
extern struct proc_dir_entry proc_root;
-extern int proc_parse_options(char *options, struct pid_namespace *pid);

extern void proc_self_init(void);
extern int proc_remount(struct super_block *, int *, char *);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index d2a1bb608820..0ef44f31d045 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -22,6 +22,7 @@
#include <linux/pid_namespace.h>
#include <linux/parser.h>
#include <linux/cred.h>
+#include <linux/magic.h>

#include "internal.h"

@@ -35,7 +36,7 @@ static const match_table_t tokens = {
{Opt_err, NULL},
};

-int proc_parse_options(char *options, struct pid_namespace *pid)
+static int proc_parse_options(char *options, struct pid_namespace *pid)
{
char *p;
substring_t args[MAX_OPT_ARGS];
@@ -77,6 +78,51 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
return 1;
}

+static int proc_fill_super(struct super_block *s, void *data, int silent)
+{
+ struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
+ struct inode *root_inode;
+ int ret;
+
+ if (!proc_parse_options(data, ns))
+ return -EINVAL;
+
+ /* User space would break if executables or devices appear on proc */
+ s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
+ s->s_flags |= SB_NODIRATIME | SB_NOSUID | SB_NOEXEC;
+ s->s_blocksize = 1024;
+ s->s_blocksize_bits = 10;
+ s->s_magic = PROC_SUPER_MAGIC;
+ s->s_op = &proc_sops;
+ s->s_time_gran = 1;
+
+ /*
+ * procfs isn't actually a stacking filesystem; however, there is
+ * too much magic going on inside it to permit stacking things on
+ * top of it
+ */
+ s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
+
+ pde_get(&proc_root);
+ root_inode = proc_get_inode(s, &proc_root);
+ if (!root_inode) {
+ pr_err("proc_fill_super: get root inode failed\n");
+ return -ENOMEM;
+ }
+
+ s->s_root = d_make_root(root_inode);
+ if (!s->s_root) {
+ pr_err("proc_fill_super: allocate dentry failed\n");
+ return -ENOMEM;
+ }
+
+ ret = proc_setup_self(s);
+ if (ret) {
+ return ret;
+ }
+ return proc_setup_thread_self(s);
+}
+
int proc_remount(struct super_block *sb, int *flags, char *data)
{
struct pid_namespace *pid = sb->s_fs_info;


2017-10-06 15:50:07

by David Howells

[permalink] [raw]
Subject: [PATCH 07/14] VFS: Add a sample program for fsopen/fsmount [ver #6]

Add a sample program for driving fsopen/fsmount.

Signed-off-by: David Howells <[email protected]>
---

samples/fsmount/test-fsmount.c | 94 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 94 insertions(+)
create mode 100644 samples/fsmount/test-fsmount.c

diff --git a/samples/fsmount/test-fsmount.c b/samples/fsmount/test-fsmount.c
new file mode 100644
index 000000000000..75f91d272a19
--- /dev/null
+++ b/samples/fsmount/test-fsmount.c
@@ -0,0 +1,94 @@
+/* fd-based mount test.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+
+#define PR_ERRMSG_ENABLE 48
+#define PR_ERRMSG_READ 49
+
+#define E(x) do { if ((x) == -1) { perror(#x); exit(1); } } while(0)
+
+static __attribute__((noreturn))
+void mount_error(int fd, const char *s)
+{
+ char buf[4096];
+ int err, n, perr;
+
+ do {
+ err = errno;
+ errno = 0;
+ n = prctl(PR_ERRMSG_READ, buf, sizeof(buf));
+ perr = errno;
+ errno = err;
+ if (n > 0) {
+ fprintf(stderr, "Error: '%s': %*.*s: %m\n", s, n, n, buf);
+ } else {
+ fprintf(stderr, "%s: %m\n", s);
+ }
+ } while (perr == 0);
+ exit(1);
+}
+
+#define E_write(fd, s) \
+ do { \
+ if (write(fd, s, sizeof(s) - 1) == -1) \
+ mount_error(fd, s); \
+ } while (0)
+
+static inline int fsopen(const char *fs_name, int flags,
+ void *reserved3, void *reserved4, void *reserved5);
+
+{
+ return syscall(333, fs_name, flags, reserved3, reserved4, reserved5);
+}
+
+static inline int fsmount(int fsfd, int dfd, const char *path,
+ unsigned int at_flags, unsigned int flags)
+{
+ return syscall(334, fsfd, dfd, path, at_flags, flags);
+}
+
+int main()
+{
+ int mfd;
+
+ if (prctl(PR_ERRMSG_ENABLE, 1) < 0) {
+ perror("prctl/en");
+ exit(1);
+ }
+
+ /* Mount an NFS filesystem */
+ mfd = fsopen("nfs4", 0, NULL, NULL, NULL);
+ if (mfd == -1) {
+ perror("fsopen");
+ exit(1);
+ }
+
+ E_write(mfd, "s warthog:/data");
+ E_write(mfd, "o fsc");
+ E_write(mfd, "o sync");
+ E_write(mfd, "o intr");
+ E_write(mfd, "o vers=4.2");
+ E_write(mfd, "o addr=90.155.74.18");
+ E_write(mfd, "o clientaddr=90.155.74.21");
+ E_write(mfd, "x create");
+ if (fsmount(mfd, AT_FDCWD, "/mnt", 0, 0) < 0)
+ mount_error(mfd, "fsmount");
+ E(close(mfd));
+
+ exit(0);
+}


2017-10-06 15:49:55

by David Howells

[permalink] [raw]
Subject: [PATCH 06/14] VFS: Implement fsmount() to effect a pre-configured mount [ver #6]

Provide a system call by which a filesystem opened with fsopen() and
configured by a series of writes can be mounted:

int ret = fsmount(int fsfd, int dfd, const char *path,
unsigned int at_flags, unsigned int flags);

where fsfd is the fd returned by fsopen(), dfd, path and at_flags locate
the mountpoint and flags are the applicable MS_* flags. dfd can be
AT_FDCWD or an fd open to a directory.

In the event that fsmount() fails, it may be possible to get an error
message by calling read(). If no message is available, ENODATA will be
reported.

Signed-off-by: David Howells <[email protected]>
---

arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
fs/namespace.c | 82 ++++++++++++++++++++++++++++++++
include/linux/syscalls.h | 2 +
kernel/sys_ni.c | 1
5 files changed, 87 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 9bf8d4c62f85..abe6ea95e0e6 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -392,3 +392,4 @@
383 i386 statx sys_statx
384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl
385 i386 fsopen sys_fsopen
+386 i386 fsmount sys_fsmount
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 9b198c5fc412..0977c5079831 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -340,6 +340,7 @@
331 common pkey_free sys_pkey_free
332 common statx sys_statx
333 common fsopen sys_fsopen
+334 common fsmount sys_fsmount

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/namespace.c b/fs/namespace.c
index d6b0b0067f6d..8676658b6b2c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3188,6 +3188,88 @@ struct vfsmount *kern_mount_data(struct file_system_type *type, void *data)
EXPORT_SYMBOL_GPL(kern_mount_data);

/*
+ * Mount a new, prepared superblock (specified by fs_fd) on the location
+ * specified by dfd and dir_name. dfd can be AT_FDCWD, a dir fd or a container
+ * fd. This cannot be used for binding, moving or remounting mounts.
+ */
+SYSCALL_DEFINE5(fsmount, int, fs_fd, int, dfd, const char __user *, dir_name,
+ unsigned int, at_flags, unsigned int, flags)
+{
+ struct fs_context *fc;
+ struct path mountpoint;
+ struct fd f;
+ unsigned int lookup_flags, mnt_flags = 0;
+ long ret;
+
+ if ((at_flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
+ AT_EMPTY_PATH)) != 0)
+ return -EINVAL;
+
+ if (flags & ~(MS_RDONLY | MS_NOSUID | MS_NODEV | MS_NOEXEC |
+ MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_STRICTATIME))
+ return -EINVAL;
+
+ if (flags & MS_RDONLY)
+ mnt_flags |= MNT_READONLY;
+ if (flags & MS_NOSUID)
+ mnt_flags |= MNT_NOSUID;
+ if (flags & MS_NODEV)
+ mnt_flags |= MNT_NODEV;
+ if (flags & MS_NOEXEC)
+ mnt_flags |= MNT_NOEXEC;
+ if (flags & MS_NODIRATIME)
+ mnt_flags |= MNT_NODIRATIME;
+
+ if (flags & MS_STRICTATIME) {
+ if (flags & MS_NOATIME)
+ return -EINVAL;
+ } else if (flags & MS_NOATIME) {
+ mnt_flags |= MNT_NOATIME;
+ } else {
+ mnt_flags |= MNT_RELATIME;
+ }
+
+ f = fdget(fs_fd);
+ if (!f.file)
+ return -EBADF;
+
+ ret = -EINVAL;
+ if (f.file->f_op != &fs_fs_fops)
+ goto err_fsfd;
+
+ fc = f.file->private_data;
+
+ ret = -EPERM;
+ if (!may_mount() ||
+ ((fc->sb_flags & MS_MANDLOCK) && !may_mandlock()))
+ goto err_fsfd;
+
+ /* There must be a valid superblock or we can't mount it */
+ ret = -EINVAL;
+ if (!fc->root)
+ goto err_fsfd;
+
+ /* Find the mountpoint. A container can be specified in dfd. */
+ lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
+ if (at_flags & AT_SYMLINK_NOFOLLOW)
+ lookup_flags &= ~LOOKUP_FOLLOW;
+ if (at_flags & AT_NO_AUTOMOUNT)
+ lookup_flags &= ~LOOKUP_AUTOMOUNT;
+ if (at_flags & AT_EMPTY_PATH)
+ lookup_flags |= LOOKUP_EMPTY;
+ ret = user_path_at(dfd, dir_name, lookup_flags, &mountpoint);
+ if (ret < 0)
+ goto err_fsfd;
+
+ ret = do_new_mount_fc(fc, &mountpoint, mnt_flags);
+
+ path_put(&mountpoint);
+err_fsfd:
+ fdput(f);
+ return ret;
+}
+
+/*
* Return true if path is reachable from root
*
* namespace_sem or mount_lock is held
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 7cd1b65a4152..e82dde171ce8 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -942,5 +942,7 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
unsigned mask, struct statx __user *buffer);
asmlinkage long sys_fsopen(const char *fs_name, unsigned int flags,
void *reserved3, void *reserved4, void *reserved5);
+asmlinkage long sys_fsmount(int fsfd, int dfd, const char *path, unsigned int at_flags,
+ unsigned int flags);

#endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index de1dc63e7e47..a0fe764bd5dd 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -261,3 +261,4 @@ cond_syscall(sys_pkey_free);

/* fd-based mount */
cond_syscall(sys_fsopen);
+cond_syscall(sys_fsmount);


2017-10-06 15:49:40

by David Howells

[permalink] [raw]
Subject: [PATCH 04/14] VFS: Remove unused code after filesystem context changes [ver #6]

Remove code that is now unused after the filesystem context changes.

Signed-off-by: David Howells <[email protected]>
---

fs/internal.h | 2 --
fs/super.c | 53 --------------------------------------------
include/linux/lsm_hooks.h | 2 --
include/linux/security.h | 6 -----
security/security.c | 5 ----
security/selinux/hooks.c | 20 -----------------
security/smack/smack_lsm.c | 32 ---------------------------
7 files changed, 120 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index e7fb460e7ca4..83ac57f72ce0 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -91,8 +91,6 @@ extern struct file *get_empty_filp(void);
*/
extern int do_remount_sb(struct super_block *, int, void *, int, struct fs_context *);
extern bool trylock_super(struct super_block *sb);
-extern struct dentry *mount_fs(struct file_system_type *,
- int, const char *, void *);
extern struct super_block *user_get_super(dev_t);

/*
diff --git a/fs/super.c b/fs/super.c
index e7d411d1d435..b456239d7e69 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1389,59 +1389,6 @@ struct dentry *mount_single(struct file_system_type *fs_type,
}
EXPORT_SYMBOL(mount_single);

-struct dentry *
-mount_fs(struct file_system_type *type, int flags, const char *name, void *data)
-{
- struct dentry *root;
- struct super_block *sb;
- char *secdata = NULL;
- int error = -ENOMEM;
-
- if (data && !(type->fs_flags & FS_BINARY_MOUNTDATA)) {
- secdata = alloc_secdata();
- if (!secdata)
- goto out;
-
- error = security_sb_copy_data(data, secdata);
- if (error)
- goto out_free_secdata;
- }
-
- root = type->mount(type, flags, name, data);
- if (IS_ERR(root)) {
- error = PTR_ERR(root);
- goto out_free_secdata;
- }
- sb = root->d_sb;
- BUG_ON(!sb);
- WARN_ON(!sb->s_bdi);
- sb->s_flags |= SB_BORN;
-
- error = security_sb_kern_mount(sb, flags, secdata);
- if (error)
- goto out_sb;
-
- /*
- * filesystems should never set s_maxbytes larger than MAX_LFS_FILESIZE
- * but s_maxbytes was an unsigned long long for many releases. Throw
- * this warning for a little while to try and catch filesystems that
- * violate this rule.
- */
- WARN((sb->s_maxbytes < 0), "%s set sb->s_maxbytes to "
- "negative value (%lld)\n", type->name, sb->s_maxbytes);
-
- up_write(&sb->s_umount);
- free_secdata(secdata);
- return root;
-out_sb:
- dput(root);
- deactivate_locked_super(sb);
-out_free_secdata:
- free_secdata(secdata);
-out:
- return ERR_PTR(error);
-}
-
/*
* Setup private BDI for given superblock. It gets automatically cleaned up
* in generic_shutdown_super().
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 74aeccb041a2..f5acc9f1b107 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1427,7 +1427,6 @@ union security_list_options {
void (*sb_free_security)(struct super_block *sb);
int (*sb_copy_data)(char *orig, char *copy);
int (*sb_remount)(struct super_block *sb, void *data);
- int (*sb_kern_mount)(struct super_block *sb, int flags, void *data);
int (*sb_show_options)(struct seq_file *m, struct super_block *sb);
int (*sb_statfs)(struct dentry *dentry);
int (*sb_mount)(const char *dev_name, const struct path *path,
@@ -1752,7 +1751,6 @@ struct security_hook_heads {
struct list_head sb_free_security;
struct list_head sb_copy_data;
struct list_head sb_remount;
- struct list_head sb_kern_mount;
struct list_head sb_show_options;
struct list_head sb_statfs;
struct list_head sb_mount;
diff --git a/include/linux/security.h b/include/linux/security.h
index 4a47c732d7b8..fdd2d921c858 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -244,7 +244,6 @@ int security_sb_alloc(struct super_block *sb);
void security_sb_free(struct super_block *sb);
int security_sb_copy_data(char *orig, char *copy);
int security_sb_remount(struct super_block *sb, void *data);
-int security_sb_kern_mount(struct super_block *sb, int flags, void *data);
int security_sb_show_options(struct seq_file *m, struct super_block *sb);
int security_sb_statfs(struct dentry *dentry);
int security_sb_mount(const char *dev_name, const struct path *path,
@@ -591,11 +590,6 @@ static inline int security_sb_remount(struct super_block *sb, void *data)
return 0;
}

-static inline int security_sb_kern_mount(struct super_block *sb, int flags, void *data)
-{
- return 0;
-}
-
static inline int security_sb_show_options(struct seq_file *m,
struct super_block *sb)
{
diff --git a/security/security.c b/security/security.c
index 7826a493c02a..92af63d3a791 100644
--- a/security/security.c
+++ b/security/security.c
@@ -402,11 +402,6 @@ int security_sb_remount(struct super_block *sb, void *data)
return call_int_hook(sb_remount, 0, sb, data);
}

-int security_sb_kern_mount(struct super_block *sb, int flags, void *data)
-{
- return call_int_hook(sb_kern_mount, 0, sb, flags, data);
-}
-
int security_sb_show_options(struct seq_file *m, struct super_block *sb)
{
return call_int_hook(sb_show_options, 0, m, sb);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index ca57e61f9c43..2da5ed29fc15 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2811,25 +2811,6 @@ static int selinux_sb_remount(struct super_block *sb, void *data)
goto out_free_opts;
}

-static int selinux_sb_kern_mount(struct super_block *sb, int flags, void *data)
-{
- const struct cred *cred = current_cred();
- struct common_audit_data ad;
- int rc;
-
- rc = superblock_doinit(sb, data);
- if (rc)
- return rc;
-
- /* Allow all mounts performed by the kernel */
- if (flags & SB_KERNMOUNT)
- return 0;
-
- ad.type = LSM_AUDIT_DATA_DENTRY;
- ad.u.dentry = sb->s_root;
- return superblock_has_perm(cred, sb, FILESYSTEM__MOUNT, &ad);
-}
-
static int selinux_sb_statfs(struct dentry *dentry)
{
const struct cred *cred = current_cred();
@@ -6453,7 +6434,6 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(sb_free_security, selinux_sb_free_security),
LSM_HOOK_INIT(sb_copy_data, selinux_sb_copy_data),
LSM_HOOK_INIT(sb_remount, selinux_sb_remount),
- LSM_HOOK_INIT(sb_kern_mount, selinux_sb_kern_mount),
LSM_HOOK_INIT(sb_show_options, selinux_sb_show_options),
LSM_HOOK_INIT(sb_statfs, selinux_sb_statfs),
LSM_HOOK_INIT(sb_mount, selinux_mount),
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 286171a16ed2..2b8eec8085dd 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -848,37 +848,6 @@ static int smack_set_mnt_opts(struct super_block *sb,
}

/**
- * smack_sb_kern_mount - Smack specific mount processing
- * @sb: the file system superblock
- * @flags: the mount flags
- * @data: the smack mount options
- *
- * Returns 0 on success, an error code on failure
- */
-static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
-{
- int rc = 0;
- char *options = data;
- struct security_mnt_opts opts;
-
- security_init_mnt_opts(&opts);
-
- if (!options)
- goto out;
-
- rc = smack_parse_opts_str(options, &opts);
- if (rc)
- goto out_err;
-
-out:
- rc = smack_set_mnt_opts(sb, &opts, 0, NULL);
-
-out_err:
- security_free_mnt_opts(&opts);
- return rc;
-}
-
-/**
* smack_sb_statfs - Smack check on statfs
* @dentry: identifies the file system in question
*
@@ -4608,7 +4577,6 @@ static struct security_hook_list smack_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(sb_alloc_security, smack_sb_alloc_security),
LSM_HOOK_INIT(sb_free_security, smack_sb_free_security),
LSM_HOOK_INIT(sb_copy_data, smack_sb_copy_data),
- LSM_HOOK_INIT(sb_kern_mount, smack_sb_kern_mount),
LSM_HOOK_INIT(sb_statfs, smack_sb_statfs),
LSM_HOOK_INIT(sb_set_mnt_opts, smack_set_mnt_opts),
LSM_HOOK_INIT(sb_parse_opts_str, smack_parse_opts_str),


2017-10-06 15:49:47

by David Howells

[permalink] [raw]
Subject: [PATCH 05/14] VFS: Implement fsopen() to prepare for a mount [ver #6]

Provide an fsopen() system call that starts the process of preparing to
mount, using an fd as a context handle. fsopen() is given the name of the
filesystem that will be used:

int mfd = fsopen(const char *fsname, int open_flags,
void *reserved3, void *reserved4,
void *reserved5);

where open_flags can be 0 or O_CLOEXEC and reserved* should all be NULL for
the moment.

For example:

mfd = fsopen("ext4", O_CLOEXEC, NULL, NULL, NULL);
write(mfd, "s /dev/sdb1"); // note I'm ignoring write's length arg
write(mfd, "o noatime");
write(mfd, "o acl");
write(mfd, "o user_attr");
write(mfd, "o iversion");
write(mfd, "o ");
write(mfd, "r /my/container"); // root inside the fs
write(mfd, "x create"); // create the superblock
fsmount(mfd, container_fd, "/mnt", AT_NO_FOLLOW);

mfd = fsopen("afs", -1);
write(mfd, "s %grand.central.org:root.cell");
write(mfd, "o cell=grand.central.org");
write(mfd, "r /");
write(mfd, "x create");
fsmount(mfd, AT_FDCWD, "/mnt", 0);

If an error is reported at any step, an error message may be available to be
read() back (ENODATA will be reported if there isn't an error available) in
the form:

"e <subsys>:<problem>"
"e SELinux:Mount on mountpoint not permitted"

Once fsmount() has been called, further write() calls will incur EBUSY,
even if the fsmount() fails. read() is still possible to retrieve error
information.

The fsopen() syscall creates a mount context and hangs it of the fd that it
returns.

Netlink is not used because it is optional.

Signed-off-by: David Howells <[email protected]>
---

arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
fs/Makefile | 2
fs/fsopen.c | 273 ++++++++++++++++++++++++++++++++
include/linux/fs_context.h | 1
include/linux/syscalls.h | 2
include/uapi/linux/magic.h | 1
kernel/sys_ni.c | 3
8 files changed, 283 insertions(+), 1 deletion(-)
create mode 100644 fs/fsopen.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 448ac2161112..9bf8d4c62f85 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -391,3 +391,4 @@
382 i386 pkey_free sys_pkey_free
383 i386 statx sys_statx
384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl
+385 i386 fsopen sys_fsopen
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183e2f85..9b198c5fc412 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
330 common pkey_alloc sys_pkey_alloc
331 common pkey_free sys_pkey_free
332 common statx sys_statx
+333 common fsopen sys_fsopen

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index ffe728cc15e1..c42d1d9351a6 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -12,7 +12,7 @@ obj-y := open.o read_write.o file_table.o super.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o splice.o sync.o utimes.o \
stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
- fs_context.o
+ fs_context.o fsopen.o

ifeq ($(CONFIG_BLOCK),y)
obj-y += buffer.o block_dev.o direct-io.o mpage.o
diff --git a/fs/fsopen.c b/fs/fsopen.c
new file mode 100644
index 000000000000..6ca7e1979273
--- /dev/null
+++ b/fs/fsopen.c
@@ -0,0 +1,273 @@
+/* Filesystem access-by-fd.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/fs_context.h>
+#include <linux/mount.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/syscalls.h>
+
+static struct vfsmount *fs_fs_mnt __read_mostly;
+
+static int fs_fs_release(struct inode *inode, struct file *file)
+{
+ struct fs_context *fc = file->private_data;
+
+ file->private_data = NULL;
+
+ put_fs_context(fc);
+ return 0;
+}
+
+/*
+ * Userspace writes configuration data and commands to the fd and we parse it
+ * here. For the moment, we assume a single option or command per write. Each
+ * line written is of the form
+ *
+ * <option_type><space><stuff...>
+ *
+ * d /dev/sda1 -- Device name
+ * o noatime -- Option without value
+ * o cell=grand.central.org -- Option with value
+ * r / -- Dir within device to mount
+ * x create -- Create a superblock
+ */
+static ssize_t fs_fs_write(struct file *file,
+ const char __user *_buf, size_t len, loff_t *pos)
+{
+ struct fs_context *fc = file->private_data;
+ struct inode *inode = file_inode(file);
+ char opt[2], *data;
+ ssize_t ret;
+
+ if (len < 3 || len > 4095)
+ return -EINVAL;
+
+ if (copy_from_user(opt, _buf, 2) != 0)
+ return -EFAULT;
+ switch (opt[0]) {
+ case 's':
+ case 'o':
+ case 'x':
+ break;
+ default:
+ goto err_bad_cmd;
+ }
+ if (opt[1] != ' ')
+ goto err_bad_cmd;
+
+ data = memdup_user_nul(_buf + 2, len - 2);
+ if (IS_ERR(data))
+ return PTR_ERR(data);
+
+ /* From this point onwards we need to lock the fd against someone
+ * trying to mount it.
+ */
+ ret = inode_lock_killable(inode);
+ if (ret < 0)
+ goto err_free;
+
+ ret = -EINVAL;
+ switch (opt[0]) {
+ case 's':
+ ret = vfs_set_fs_source(fc, data, len - 2);
+ if (ret < 0)
+ goto err_unlock;
+ data = NULL;
+ break;
+
+ case 'o':
+ ret = vfs_parse_mount_option(fc, data);
+ if (ret < 0)
+ goto err_unlock;
+ break;
+
+ case 'x':
+ if (strcmp(data, "create") == 0) {
+ ret = vfs_get_tree(fc);
+ } else {
+ ret = -EOPNOTSUPP;
+ }
+ if (ret < 0)
+ goto err_unlock;
+ break;
+
+ default:
+ goto err_unlock;
+ }
+
+ ret = len;
+err_unlock:
+ inode_unlock(inode);
+err_free:
+ kfree(data);
+ return ret;
+err_bad_cmd:
+ return -EINVAL;
+}
+
+const struct file_operations fs_fs_fops = {
+ .write = fs_fs_write,
+ .release = fs_fs_release,
+ .llseek = no_llseek,
+};
+
+/*
+ * Indicate the name we want to display the filesystem file as.
+ */
+static char *fs_fs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+ return dynamic_dname(dentry, buffer, buflen, "fs:[%lu]",
+ d_inode(dentry)->i_ino);
+}
+
+static const struct dentry_operations fs_fs_dentry_operations = {
+ .d_dname = fs_fs_dname,
+};
+
+/*
+ * Create a file that can be used to configure a new mount.
+ */
+static struct file *create_fs_file(struct fs_context *fc)
+{
+ struct inode *inode;
+ struct file *f;
+ struct path path;
+ int ret;
+
+ inode = alloc_anon_inode(fs_fs_mnt->mnt_sb);
+ if (!inode)
+ return ERR_PTR(-ENFILE);
+ inode->i_fop = &fs_fs_fops;
+
+ ret = -ENOMEM;
+ path.dentry = d_alloc_pseudo(fs_fs_mnt->mnt_sb, &empty_name);
+ if (!path.dentry)
+ goto err_inode;
+ path.mnt = mntget(fs_fs_mnt);
+
+ d_instantiate(path.dentry, inode);
+
+ f = alloc_file(&path, FMODE_READ | FMODE_WRITE, &fs_fs_fops);
+ if (IS_ERR(f)) {
+ ret = PTR_ERR(f);
+ goto err_file;
+ }
+
+ f->private_data = fc;
+ return f;
+
+err_file:
+ path_put(&path);
+ return ERR_PTR(ret);
+
+err_inode:
+ iput(inode);
+ return ERR_PTR(ret);
+}
+
+ const struct super_operations fs_fs_ops = {
+ .drop_inode = generic_delete_inode,
+ .destroy_inode = free_inode_nonrcu,
+ .statfs = simple_statfs,
+};
+
+static struct dentry *fs_fs_mount(struct file_system_type *fs_type,
+ int flags, const char *dev_name,
+ void *data)
+{
+ return mount_pseudo(fs_type, "fs_fs:", &fs_fs_ops,
+ &fs_fs_dentry_operations, FS_FS_MAGIC);
+}
+
+static struct file_system_type fs_fs_type = {
+ .name = "fs_fs",
+ .mount = fs_fs_mount,
+ .kill_sb = kill_anon_super,
+};
+
+static int __init init_fs_fs(void)
+{
+ int ret;
+
+ ret = register_filesystem(&fs_fs_type);
+ if (ret < 0)
+ panic("Cannot register fs_fs\n");
+
+ fs_fs_mnt = kern_mount(&fs_fs_type);
+ if (IS_ERR(fs_fs_mnt))
+ panic("Cannot mount fs_fs: %ld\n", PTR_ERR(fs_fs_mnt));
+ return 0;
+}
+
+fs_initcall(init_fs_fs);
+
+/*
+ * Open a filesystem by name so that it can be configured for mounting.
+ *
+ * We are allowed to specify a container in which the filesystem will be
+ * opened, thereby indicating which namespaces will be used (notably, which
+ * network namespace will be used for network filesystems).
+ */
+SYSCALL_DEFINE5(fsopen, const char __user *, _fs_name, unsigned int, flags,
+ void *, reserved3, void *, reserved4, void *, reserved5)
+{
+ struct file_system_type *fs_type;
+ struct fs_context *fc;
+ struct file *file;
+ const char *fs_name;
+ int fd, ret;
+
+ if (flags & ~O_CLOEXEC || reserved3 || reserved4 || reserved5)
+ return -EINVAL;
+
+ fs_name = strndup_user(_fs_name, PAGE_SIZE);
+ if (IS_ERR(fs_name))
+ return PTR_ERR(fs_name);
+
+ fs_type = get_fs_type(fs_name);
+ kfree(fs_name);
+ if (!fs_type)
+ return -ENODEV;
+
+ fc = vfs_new_fs_context(fs_type, NULL, 0, FS_CONTEXT_FOR_USER_MOUNT);
+ put_filesystem(fs_type);
+ if (IS_ERR(fc))
+ return PTR_ERR(fc);
+
+ ret = -ENOTSUPP;
+ if (!fc->ops)
+ goto err_fc;
+
+ file = create_fs_file(fc);
+ if (IS_ERR(file)) {
+ ret = PTR_ERR(file);
+ goto err_fc;
+ }
+
+ ret = get_unused_fd_flags(flags & O_CLOEXEC);
+ if (ret < 0)
+ goto err_file;
+
+ fd = ret;
+ fd_install(fd, file);
+ return fd;
+
+err_file:
+ fput(file);
+ return ret;
+
+err_fc:
+ put_fs_context(fc);
+ return ret;
+}
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 8af6ff0e869e..3244b231ede0 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -101,4 +101,5 @@ extern int vfs_get_super(struct fs_context *fc,
int (*fill_super)(struct super_block *sb,
struct fs_context *fc));

+extern const struct file_operations fs_fs_fops;
#endif /* _LINUX_FS_CONTEXT_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a78186d826d7..7cd1b65a4152 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -940,5 +940,7 @@ asmlinkage long sys_pkey_alloc(unsigned long flags, unsigned long init_val);
asmlinkage long sys_pkey_free(int pkey);
asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
unsigned mask, struct statx __user *buffer);
+asmlinkage long sys_fsopen(const char *fs_name, unsigned int flags,
+ void *reserved3, void *reserved4, void *reserved5);

#endif
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index e439565df838..722bf42f9564 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -87,5 +87,6 @@
#define UDF_SUPER_MAGIC 0x15013346
#define BALLOON_KVM_MAGIC 0x13661366
#define ZSMALLOC_MAGIC 0x58295829
+#define FS_FS_MAGIC 0x66736673

#endif /* __LINUX_MAGIC_H__ */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 8acef8576ce9..de1dc63e7e47 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -258,3 +258,6 @@ cond_syscall(sys_membarrier);
cond_syscall(sys_pkey_mprotect);
cond_syscall(sys_pkey_alloc);
cond_syscall(sys_pkey_free);
+
+/* fd-based mount */
+cond_syscall(sys_fsopen);


2017-10-06 15:49:24

by David Howells

[permalink] [raw]
Subject: [PATCH 02/14] VFS: Add LSM hooks for filesystem context [ver #6]

Add LSM hooks for use by the filesystem context code. This includes:

(1) Hooks to handle allocation, duplication and freeing of the security
record attached to a filesystem context.

(2) A hook to snoop a mount options in key[=val] form. If the LSM decides
it wants to handle it, it can suppress the option being passed to the
filesystem. Note that 'val' may include commas and binary data with
the fsopen patch.

(3) A hook to transfer the security from the context to a newly created
superblock.

(4) A hook to rule on whether a path point can be used as a mountpoint.

These are intended to replace:

security_sb_copy_data
security_sb_kern_mount
security_sb_mount
security_sb_set_mnt_opts
security_sb_clone_mnt_opts
security_sb_parse_opts_str

Signed-off-by: David Howells <[email protected]>
cc: [email protected]
---

include/linux/lsm_hooks.h | 45 ++++++++++++
include/linux/security.h | 33 +++++++++
security/security.c | 30 ++++++++
security/selinux/hooks.c | 174 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 282 insertions(+)

diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index c9258124e417..85398ba0b533 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -76,6 +76,38 @@
* changes on the process such as clearing out non-inheritable signal
* state. This is called immediately after commit_creds().
*
+ * Security hooks for mount using fd context.
+ *
+ * @fs_context_alloc:
+ * Allocate and attach a security structure to sc->security. This pointer
+ * is initialised to NULL by the caller.
+ * @fc indicates the new filesystem context.
+ * @src_sb indicates the source superblock of a submount.
+ * @fs_context_dup:
+ * Allocate and attach a security structure to sc->security. This pointer
+ * is initialised to NULL by the caller.
+ * @fc indicates the new filesystem context.
+ * @src_fc indicates the original filesystem context.
+ * @fs_context_free:
+ * Clean up a filesystem context.
+ * @fc indicates the filesystem context.
+ * @fs_context_parse_one:
+ * Userspace provided an option to configure a superblock. The LSM may
+ * reject it with an error and may use it for itself, in which case it
+ * should return 1; otherwise it should return 0 to pass it on to the
+ * filesystem.
+ * @fc indicates the filesystem context.
+ * @p indicates the option in "key[=val]" form.
+ * @sb_get_tree:
+ * Assign the security to a newly created superblock.
+ * @fc indicates the filesystem context.
+ * @fc->root indicates the root that will be mounted.
+ * @fc->root->d_sb points to the superblock.
+ * @sb_mountpoint:
+ * Equivalent of sb_mount, but with an fs_context.
+ * @fc indicates the filesystem context.
+ * @mountpoint indicates the path on which the mount will take place.
+ *
* Security hooks for filesystem operations.
*
* @sb_alloc_security:
@@ -1384,6 +1416,13 @@ union security_list_options {
void (*bprm_committing_creds)(struct linux_binprm *bprm);
void (*bprm_committed_creds)(struct linux_binprm *bprm);

+ int (*fs_context_alloc)(struct fs_context *fc, struct super_block *src_sb);
+ int (*fs_context_dup)(struct fs_context *fc, struct fs_context *src_sc);
+ void (*fs_context_free)(struct fs_context *fc);
+ int (*fs_context_parse_one)(struct fs_context *fc, char *opt);
+ int (*sb_get_tree)(struct fs_context *fc);
+ int (*sb_mountpoint)(struct fs_context *fc, struct path *mountpoint);
+
int (*sb_alloc_security)(struct super_block *sb);
void (*sb_free_security)(struct super_block *sb);
int (*sb_copy_data)(char *orig, char *copy);
@@ -1703,6 +1742,12 @@ struct security_hook_heads {
struct list_head bprm_check_security;
struct list_head bprm_committing_creds;
struct list_head bprm_committed_creds;
+ struct list_head fs_context_alloc;
+ struct list_head fs_context_dup;
+ struct list_head fs_context_free;
+ struct list_head fs_context_parse_one;
+ struct list_head sb_get_tree;
+ struct list_head sb_mountpoint;
struct list_head sb_alloc_security;
struct list_head sb_free_security;
struct list_head sb_copy_data;
diff --git a/include/linux/security.h b/include/linux/security.h
index ce6265960d6c..4a47c732d7b8 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -56,6 +56,7 @@ struct msg_queue;
struct xattr;
struct xfrm_sec_ctx;
struct mm_struct;
+struct fs_context;

/* If capable should audit the security request */
#define SECURITY_CAP_NOAUDIT 0
@@ -233,6 +234,12 @@ int security_bprm_set_creds(struct linux_binprm *bprm);
int security_bprm_check(struct linux_binprm *bprm);
void security_bprm_committing_creds(struct linux_binprm *bprm);
void security_bprm_committed_creds(struct linux_binprm *bprm);
+int security_fs_context_alloc(struct fs_context *fc, struct super_block *sb);
+int security_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc);
+void security_fs_context_free(struct fs_context *fc);
+int security_fs_context_parse_option(struct fs_context *fc, char *opt);
+int security_sb_get_tree(struct fs_context *fc);
+int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint);
int security_sb_alloc(struct super_block *sb);
void security_sb_free(struct super_block *sb);
int security_sb_copy_data(char *orig, char *copy);
@@ -540,6 +547,32 @@ static inline void security_bprm_committed_creds(struct linux_binprm *bprm)
{
}

+static inline int security_fs_context_alloc(struct fs_context *fc,
+ struct super_block *src_sb)
+{
+ return 0;
+}
+static inline int security_fs_context_dup(struct fs_context *fc,
+ struct fs_context *src_fc)
+{
+ return 0;
+}
+static inline void security_fs_context_free(struct fs_context *fc)
+{
+}
+static inline int security_fs_context_parse_option(struct fs_context *fc, char *opt)
+{
+ return 0;
+}
+static inline int security_sb_get_tree(struct fs_context *fc)
+{
+ return 0;
+}
+static inline int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint)
+{
+ return 0;
+}
+
static inline int security_sb_alloc(struct super_block *sb)
{
return 0;
diff --git a/security/security.c b/security/security.c
index 4bf0f571b4ef..55383a0e764d 100644
--- a/security/security.c
+++ b/security/security.c
@@ -351,6 +351,36 @@ void security_bprm_committed_creds(struct linux_binprm *bprm)
call_void_hook(bprm_committed_creds, bprm);
}

+int security_fs_context_alloc(struct fs_context *fc, struct super_block *src_sb)
+{
+ return call_int_hook(fs_context_alloc, 0, fc, src_sb);
+}
+
+int security_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
+{
+ return call_int_hook(fs_context_dup, 0, fc, src_fc);
+}
+
+void security_fs_context_free(struct fs_context *fc)
+{
+ call_void_hook(fs_context_free, fc);
+}
+
+int security_fs_context_parse_one(struct fs_context *fc, char *opt)
+{
+ return call_int_hook(fs_context_parse_one, 0, fc, opt);
+}
+
+int security_sb_get_tree(struct fs_context *fc)
+{
+ return call_int_hook(sb_get_tree, 0, fc);
+}
+
+int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint)
+{
+ return call_int_hook(sb_mountpoint, 0, fc, mountpoint);
+}
+
int security_sb_alloc(struct super_block *sb)
{
return call_int_hook(sb_alloc_security, 0, sb);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 6f37f7e5b9a8..0dda7350b5af 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -48,6 +48,7 @@
#include <linux/fdtable.h>
#include <linux/namei.h>
#include <linux/mount.h>
+#include <linux/fs_context.h>
#include <linux/netfilter_ipv4.h>
#include <linux/netfilter_ipv6.h>
#include <linux/tty.h>
@@ -2862,6 +2863,172 @@ static int selinux_umount(struct vfsmount *mnt, int flags)
FILESYSTEM__UNMOUNT, NULL);
}

+/* fsopen mount context operations */
+
+static int selinux_fs_context_alloc(struct fs_context *fc,
+ struct super_block *src_sb)
+{
+ struct security_mnt_opts *opts;
+
+ opts = kzalloc(sizeof(*opts), GFP_KERNEL);
+ if (!opts)
+ return -ENOMEM;
+
+ fc->security = opts;
+ return 0;
+}
+
+static int selinux_fs_context_dup(struct fs_context *fc,
+ struct fs_context *src_fc)
+{
+ const struct security_mnt_opts *src = src_fc->security;
+ struct security_mnt_opts *opts;
+ int i, n;
+
+ opts = kzalloc(sizeof(*opts), GFP_KERNEL);
+ if (!opts)
+ return -ENOMEM;
+ fc->security = opts;
+
+ if (!src || !src->num_mnt_opts)
+ return 0;
+ n = opts->num_mnt_opts = src->num_mnt_opts;
+
+ if (src->mnt_opts) {
+ opts->mnt_opts = kcalloc(n, sizeof(char *), GFP_KERNEL);
+ if (!opts->mnt_opts)
+ return -ENOMEM;
+
+ for (i = 0; i < n; i++) {
+ if (src->mnt_opts[i]) {
+ opts->mnt_opts[i] = kstrdup(src->mnt_opts[i],
+ GFP_KERNEL);
+ if (!opts->mnt_opts[i])
+ return -ENOMEM;
+ }
+ }
+ }
+
+ if (src->mnt_opts_flags) {
+ opts->mnt_opts_flags = kmemdup(src->mnt_opts_flags,
+ n * sizeof(int), GFP_KERNEL);
+ if (!opts->mnt_opts_flags)
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static void selinux_fs_context_free(struct fs_context *fc)
+{
+ struct security_mnt_opts *opts = fc->security;
+
+ security_free_mnt_opts(opts);
+ fc->security = NULL;
+}
+
+static int selinux_fs_context_parse_one(struct fs_context *fc, char *opt)
+{
+ struct security_mnt_opts *opts = fc->security;
+ substring_t args[MAX_OPT_ARGS];
+ unsigned int have;
+ char *c, **oo;
+ int token, ctx, i, *of;
+
+ token = match_token(opt, tokens, args);
+ if (token == Opt_error)
+ return 0; /* Doesn't belong to us. */
+
+ have = 0;
+ for (i = 0; i < opts->num_mnt_opts; i++)
+ have |= 1 << opts->mnt_opts_flags[i];
+ if (have & (1 << token))
+ return -EINVAL;
+
+ switch (token) {
+ case Opt_context:
+ if (have & (1 << Opt_defcontext))
+ goto incompatible;
+ ctx = CONTEXT_MNT;
+ goto copy_context_string;
+
+ case Opt_fscontext:
+ ctx = FSCONTEXT_MNT;
+ goto copy_context_string;
+
+ case Opt_rootcontext:
+ ctx = ROOTCONTEXT_MNT;
+ goto copy_context_string;
+
+ case Opt_defcontext:
+ if (have & (1 << Opt_context))
+ goto incompatible;
+ ctx = DEFCONTEXT_MNT;
+ goto copy_context_string;
+
+ case Opt_labelsupport:
+ return 1;
+
+ default:
+ return -EINVAL;
+ }
+
+copy_context_string:
+ if (opts->num_mnt_opts > 3)
+ return -EINVAL;
+
+ of = krealloc(opts->mnt_opts_flags,
+ (opts->num_mnt_opts + 1) * sizeof(int), GFP_KERNEL);
+ if (!of)
+ return -ENOMEM;
+ of[opts->num_mnt_opts] = 0;
+ opts->mnt_opts_flags = of;
+
+ oo = krealloc(opts->mnt_opts,
+ (opts->num_mnt_opts + 1) * sizeof(char *), GFP_KERNEL);
+ if (!oo)
+ return -ENOMEM;
+ oo[opts->num_mnt_opts] = NULL;
+ opts->mnt_opts = oo;
+
+ c = match_strdup(&args[0]);
+ if (!c)
+ return -ENOMEM;
+ opts->mnt_opts[opts->num_mnt_opts] = c;
+ opts->mnt_opts_flags[opts->num_mnt_opts] = ctx;
+ opts->num_mnt_opts++;
+ return 1;
+
+incompatible:
+ return -EINVAL;
+}
+
+static int selinux_sb_get_tree(struct fs_context *fc)
+{
+ const struct cred *cred = current_cred();
+ struct common_audit_data ad;
+ int rc;
+
+ rc = selinux_set_mnt_opts(fc->root->d_sb, fc->security, 0, NULL);
+ if (rc)
+ return rc;
+
+ /* Allow all mounts performed by the kernel */
+ if (fc->sb_flags & MS_KERNMOUNT)
+ return 0;
+
+ ad.type = LSM_AUDIT_DATA_DENTRY;
+ ad.u.dentry = fc->root;
+ return superblock_has_perm(cred, fc->root->d_sb, FILESYSTEM__MOUNT, &ad);
+}
+
+static int selinux_sb_mountpoint(struct fs_context *fc, struct path *mountpoint)
+{
+ const struct cred *cred = current_cred();
+
+ return path_has_perm(cred, mountpoint, FILE__MOUNTON);
+}
+
/* inode security operations */

static int selinux_inode_alloc_security(struct inode *inode)
@@ -6275,6 +6442,13 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(bprm_committing_creds, selinux_bprm_committing_creds),
LSM_HOOK_INIT(bprm_committed_creds, selinux_bprm_committed_creds),

+ LSM_HOOK_INIT(fs_context_alloc, selinux_fs_context_alloc),
+ LSM_HOOK_INIT(fs_context_dup, selinux_fs_context_dup),
+ LSM_HOOK_INIT(fs_context_free, selinux_fs_context_free),
+ LSM_HOOK_INIT(fs_context_parse_one, selinux_fs_context_parse_one),
+ LSM_HOOK_INIT(sb_get_tree, selinux_sb_get_tree),
+ LSM_HOOK_INIT(sb_mountpoint, selinux_sb_mountpoint),
+
LSM_HOOK_INIT(sb_alloc_security, selinux_sb_alloc_security),
LSM_HOOK_INIT(sb_free_security, selinux_sb_free_security),
LSM_HOOK_INIT(sb_copy_data, selinux_sb_copy_data),


2017-10-06 15:49:32

by David Howells

[permalink] [raw]
Subject: [PATCH 03/14] VFS: Implement a filesystem superblock creation/configuration context [ver #6]

Implement a filesystem context concept to be used during superblock
creation for mount and superblock reconfiguration for remount.

The mounting procedure then becomes:

(1) Allocate new fs_context context.

(2) Configure the context.

(3) Create superblock.

(4) Mount the superblock any number of times.

(5) Destroy the context.

Rather than calling fs_type->mount(), an fs_context struct is created and
fs_type->init_fs_context() is called to set it up.
fs_type->fs_context_size says how much space should be allocated for the
config context. The fs_context struct is placed at the beginning and any
extra space is for the filesystem's use.

A set of operations has to be set by ->init_fs_context() to provide
freeing, duplication, option parsing, binary data parsing, validation,
mounting and superblock filling.

Legacy filesystems are supported by the provision of a set of legacy
fs_context operations that build up a list of mount options and then invoke
fs_type->mount() from within the fs_context ->get_tree() operation. This
allows all filesystems to be accessed using fs_context.

It should be noted that, whilst this patch adds a lot of lines of code,
there is quite a bit of duplication with existing code that can be
eliminated should all filesystems be converted over.

Signed-off-by: David Howells <[email protected]>
---

Documentation/filesystems/mounting.txt | 7
fs/Makefile | 3
fs/fs_context.c | 526 ++++++++++++++++++++++++++++++++
fs/internal.h | 2
fs/libfs.c | 17 +
fs/namespace.c | 337 ++++++++++++++-------
fs/super.c | 294 +++++++++++++++++-
include/linux/fs.h | 16 +
include/linux/fs_context.h | 37 ++
include/linux/lsm_hooks.h | 6
include/linux/mount.h | 2
security/security.c | 4
security/selinux/hooks.c | 6
13 files changed, 1107 insertions(+), 150 deletions(-)
create mode 100644 fs/fs_context.c

diff --git a/Documentation/filesystems/mounting.txt b/Documentation/filesystems/mounting.txt
index 8c0b0351e949..ba73066c151c 100644
--- a/Documentation/filesystems/mounting.txt
+++ b/Documentation/filesystems/mounting.txt
@@ -192,7 +192,7 @@ structure is not refcounted.

VFS, security and filesystem mount options are set individually with
vfs_parse_mount_option(). Options provided by the old mount(2) system call as
-a page of data can be parsed with generic_monolithic_mount_data().
+a page of data can be parsed with generic_parse_monolithic().

When mounting, the filesystem is allowed to take data from any of the pointers
and attach it to the superblock (or whatever), provided it clears the pointer
@@ -264,7 +264,7 @@ manage the filesystem context. They are as follows:

If the filesystem (eg. NFS) needs to examine the data first and then finds
it's the standard key-val list then it may pass it off to
- generic_monolithic_mount_data().
+ generic_parse_monolithic().

(*) int (*validate)(struct fs_context *fc);

@@ -407,9 +407,10 @@ returned.
[NOTE] ->validate() could perhaps be rolled into ->get_tree() and
->remount_fs_fc().

- (*) struct vfsmount *vfs_kern_mount_fc(struct fs_context *fc);
+ (*) struct vfsmount *vfs_create_mount(struct fs_context *fc);

Create a mount given the parameters in the specified filesystem context.
+ Note that this does not attach the mount to anything.

(*) int vfs_set_fs_source(struct fs_context *fc, char *source);

diff --git a/fs/Makefile b/fs/Makefile
index 7bbaca9c67b1..ffe728cc15e1 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,8 @@ obj-y := open.o read_write.o file_table.o super.o \
attr.o bad_inode.o file.o filesystems.o namespace.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o splice.o sync.o utimes.o \
- stack.o fs_struct.o statfs.o fs_pin.o nsfs.o
+ stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
+ fs_context.o

ifeq ($(CONFIG_BLOCK),y)
obj-y += buffer.o block_dev.o direct-io.o mpage.o
diff --git a/fs/fs_context.c b/fs/fs_context.c
new file mode 100644
index 000000000000..a3a7ccb4323d
--- /dev/null
+++ b/fs/fs_context.c
@@ -0,0 +1,526 @@
+/* Provide a way to create a superblock configuration context within the kernel
+ * that allows a superblock to be set up prior to mounting.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/fs_context.h>
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/nsproxy.h>
+#include <linux/slab.h>
+#include <linux/magic.h>
+#include <linux/security.h>
+#include <linux/parser.h>
+#include <linux/mnt_namespace.h>
+#include <linux/pid_namespace.h>
+#include <linux/user_namespace.h>
+#include <net/net_namespace.h>
+#include "mount.h"
+
+struct legacy_fs_context {
+ struct fs_context fc;
+ char *legacy_data; /* Data page for legacy filesystems */
+ char *secdata;
+ unsigned int data_usage;
+};
+
+static const struct fs_context_operations legacy_fs_context_ops;
+
+static const match_table_t common_set_sb_flag = {
+ { SB_DIRSYNC, "dirsync" },
+ { SB_LAZYTIME, "lazytime" },
+ { SB_MANDLOCK, "mand" },
+ { SB_POSIXACL, "posixacl" },
+ { SB_RDONLY, "ro" },
+ { SB_SYNCHRONOUS, "sync" },
+ { },
+};
+
+static const match_table_t common_clear_sb_flag = {
+ { SB_LAZYTIME, "nolazytime" },
+ { SB_MANDLOCK, "nomand" },
+ { SB_RDONLY, "rw" },
+ { SB_SILENT, "silent" },
+ { SB_SYNCHRONOUS, "async" },
+ { },
+};
+
+static const match_table_t forbidden_sb_flag = {
+ { 0, "bind" },
+ { 0, "move" },
+ { 0, "private" },
+ { 0, "remount" },
+ { 0, "shared" },
+ { 0, "slave" },
+ { 0, "unbindable" },
+ { 0, "rec" },
+ { 0, "noatime" },
+ { 0, "relatime" },
+ { 0, "norelatime" },
+ { 0, "strictatime" },
+ { 0, "nostrictatime" },
+ { 0, "nodiratime" },
+ { 0, "dev" },
+ { 0, "nodev" },
+ { 0, "exec" },
+ { 0, "noexec" },
+ { 0, "suid" },
+ { 0, "nosuid" },
+ { },
+};
+
+/*
+ * Check for a common mount option that manipulates s_flags.
+ */
+static int vfs_parse_sb_flag_option(struct fs_context *fc, char *data)
+{
+ substring_t args[MAX_OPT_ARGS];
+ unsigned int token;
+
+ token = match_token(data, common_set_sb_flag, args);
+ if (token) {
+ fc->sb_flags |= token;
+ return 1;
+ }
+
+ token = match_token(data, common_clear_sb_flag, args);
+ if (token) {
+ fc->sb_flags &= ~token;
+ return 1;
+ }
+
+ token = match_token(data, forbidden_sb_flag, args);
+ if (token)
+ return -EINVAL;
+
+ return 0;
+}
+
+/**
+ * vfs_parse_mount_option - Add a single mount option to a superblock config
+ * @fc: The filesystem context to modify
+ * @p: The option to apply.
+ *
+ * A single mount option in string form is applied to the filesystem context
+ * being set up. Certain standard options (for example "ro") are translated
+ * into flag bits without going to the filesystem. The active security module
+ * is allowed to observe and poach options. Any other options are passed over
+ * to the filesystem to parse.
+ *
+ * This may be called multiple times for a context.
+ *
+ * Returns 0 on success and a negative error code on failure. In the event of
+ * failure, supplementary error information may have been set.
+ */
+int vfs_parse_mount_option(struct fs_context *fc, char *p)
+{
+ int ret;
+
+ ret = vfs_parse_sb_flag_option(fc, p);
+ if (ret < 0)
+ return ret;
+ if (ret == 1)
+ return 0;
+
+ ret = security_fs_context_parse_option(fc, p);
+ if (ret < 0)
+ return ret;
+ if (ret == 1)
+ return 0;
+
+ if (fc->ops->parse_option)
+ return fc->ops->parse_option(fc, p);
+
+ return -EINVAL;
+}
+EXPORT_SYMBOL(vfs_parse_mount_option);
+
+/**
+ * vfs_set_fs_source - Set the source/device name in a filesystem context
+ * @fc: The filesystem context to alter
+ * @source: The name of the source
+ * @slen: Length of @source string
+ */
+int vfs_set_fs_source(struct fs_context *fc, const char *source, size_t slen)
+{
+ if (fc->source)
+ return -EINVAL;
+ if (source) {
+ fc->source = kmemdup_nul(source, slen, GFP_KERNEL);
+ if (!fc->source)
+ return -ENOMEM;
+ }
+
+ if (fc->ops->parse_source)
+ return fc->ops->parse_source(fc);
+ return 0;
+}
+EXPORT_SYMBOL(vfs_set_fs_source);
+
+/**
+ * generic_parse_monolithic - Parse key[=val][,key[=val]]* mount data
+ * @mc: The superblock configuration to fill in.
+ * @data: The data to parse
+ *
+ * Parse a blob of data that's in key[=val][,key[=val]]* form. This can be
+ * called from the ->monolithic_mount_data() fs_context operation.
+ *
+ * Returns 0 on success or the error returned by the ->parse_option() fs_context
+ * operation on failure.
+ */
+int generic_parse_monolithic(struct fs_context *ctx, void *data)
+{
+ char *options = data, *p;
+ int ret;
+
+ if (!options)
+ return 0;
+
+ while ((p = strsep(&options, ",")) != NULL) {
+ if (*p) {
+ ret = vfs_parse_mount_option(ctx, p);
+ if (ret < 0)
+ return ret;
+ }
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL(generic_parse_monolithic);
+
+/**
+ * vfs_new_fs_context - Create a filesystem context.
+ * @fs_type: The filesystem type.
+ * @src_sb: A superblock from which this one derives (or NULL)
+ * @sb_flags: Superblock flags and op flags (such as MS_REMOUNT)
+ * @purpose: The purpose that this configuration shall be used for.
+ *
+ * Open a filesystem and create a mount context. The mount context is
+ * initialised with the supplied flags and, if a submount/automount from
+ * another superblock (@src_sb), may have parameters such as namespaces copied
+ * across from that superblock.
+ */
+struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
+ struct super_block *src_sb,
+ unsigned int sb_flags,
+ enum fs_context_purpose purpose)
+{
+ struct fs_context *fc;
+ size_t fc_size = fs_type->fs_context_size;
+ int ret;
+
+ BUG_ON(fs_type->init_fs_context && fc_size < sizeof(*fc));
+
+ if (!fs_type->init_fs_context)
+ fc_size = sizeof(struct legacy_fs_context);
+
+ fc = kzalloc(fc_size, GFP_KERNEL);
+ if (!fc)
+ return ERR_PTR(-ENOMEM);
+
+ fc->purpose = purpose;
+ fc->sb_flags = sb_flags;
+ fc->fs_type = get_filesystem(fs_type);
+ fc->cred = get_current_cred();
+
+ switch (purpose) {
+ case FS_CONTEXT_FOR_KERNEL_MOUNT:
+ fc->sb_flags |= SB_KERNMOUNT;
+ /* Fallthrough */
+ case FS_CONTEXT_FOR_USER_MOUNT:
+ fc->user_ns = get_user_ns(fc->cred->user_ns);
+ fc->net_ns = get_net(current->nsproxy->net_ns);
+ break;
+ case FS_CONTEXT_FOR_SUBMOUNT:
+ fc->user_ns = get_user_ns(src_sb->s_user_ns);
+ fc->net_ns = get_net(current->nsproxy->net_ns);
+ break;
+ case FS_CONTEXT_FOR_REMOUNT:
+ /* We don't pin any namespaces as the superblock's
+ * subscriptions cannot be changed at this point.
+ */
+ break;
+ }
+
+
+ /* TODO: Make all filesystems support this unconditionally */
+ if (fc->fs_type->init_fs_context) {
+ ret = fc->fs_type->init_fs_context(fc, src_sb);
+ if (ret < 0)
+ goto err_fc;
+ } else {
+ fc->ops = &legacy_fs_context_ops;
+ }
+
+ /* Do the security check last because ->init_fs_context may change the
+ * namespace subscriptions.
+ */
+ ret = security_fs_context_alloc(fc, src_sb);
+ if (ret < 0)
+ goto err_fc;
+
+ return fc;
+
+err_fc:
+ put_fs_context(fc);
+ return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(vfs_new_fs_context);
+
+/**
+ * vfs_sb_reconfig - Create a filesystem context for remount/reconfiguration
+ * @mnt: The mountpoint to open
+ * @sb_flags: Superblock flags and op flags (such as MS_REMOUNT)
+ *
+ * Open a mounted filesystem and create a filesystem context such that a
+ * remount can be effected.
+ */
+struct fs_context *vfs_sb_reconfig(struct vfsmount *mnt,
+ unsigned int sb_flags)
+{
+ return vfs_new_fs_context(mnt->mnt_sb->s_type, mnt->mnt_sb,
+ sb_flags, FS_CONTEXT_FOR_REMOUNT);
+}
+
+/**
+ * vfs_dup_fc_config: Duplicate a filesytem context.
+ * @src_fc: The context to copy.
+ */
+struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc)
+{
+ struct fs_context *fc;
+ size_t fc_size;
+ int ret;
+
+ if (!src_fc->ops->dup)
+ return ERR_PTR(-ENOTSUPP);
+
+ fc_size = src_fc->fs_type->fs_context_size;
+ if (!src_fc->fs_type->init_fs_context)
+ fc_size = sizeof(struct legacy_fs_context);
+
+ fc = kmemdup(src_fc, src_fc->fs_type->fs_context_size, GFP_KERNEL);
+ if (!fc)
+ return ERR_PTR(-ENOMEM);
+
+ fc->source = NULL;
+ fc->security = NULL;
+ get_filesystem(fc->fs_type);
+ get_net(fc->net_ns);
+ get_user_ns(fc->user_ns);
+ get_cred(fc->cred);
+
+ /* Can't call put until we've called ->dup */
+ ret = fc->ops->dup(fc, src_fc);
+ if (ret < 0)
+ goto err_fc;
+
+ ret = security_fs_context_dup(fc, src_fc);
+ if (ret < 0)
+ goto err_fc;
+ return fc;
+
+err_fc:
+ put_fs_context(fc);
+ return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(vfs_dup_fs_context);
+
+/**
+ * put_fs_context - Dispose of a superblock configuration context.
+ * @sc: The context to dispose of.
+ */
+void put_fs_context(struct fs_context *fc)
+{
+ struct super_block *sb;
+
+ if (fc->root) {
+ sb = fc->root->d_sb;
+ dput(fc->root);
+ fc->root = NULL;
+ deactivate_super(sb);
+ }
+
+ if (fc->ops && fc->ops->free)
+ fc->ops->free(fc);
+
+ security_fs_context_free(fc);
+ if (fc->net_ns)
+ put_net(fc->net_ns);
+ put_user_ns(fc->user_ns);
+ if (fc->cred)
+ put_cred(fc->cred);
+ kfree(fc->subtype);
+ put_filesystem(fc->fs_type);
+ kfree(fc->source);
+ kfree(fc);
+}
+EXPORT_SYMBOL(put_fs_context);
+
+/*
+ * Free the config for a filesystem that doesn't support fs_context.
+ */
+static void legacy_fs_context_free(struct fs_context *fc)
+{
+ struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
+
+ free_secdata(ctx->secdata);
+ kfree(ctx->legacy_data);
+}
+
+/*
+ * Duplicate a legacy config.
+ */
+static int legacy_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
+{
+ struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
+ struct legacy_fs_context *src_ctx = container_of(src_fc, struct legacy_fs_context, fc);
+
+ ctx->legacy_data = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!ctx->legacy_data)
+ return -ENOMEM;
+ memcpy(ctx->legacy_data, src_ctx->legacy_data, sizeof(PAGE_SIZE));
+ return 0;
+}
+
+/*
+ * Add an option to a legacy config. We build up a comma-separated list of
+ * options.
+ */
+static int legacy_parse_option(struct fs_context *fc, char *p)
+{
+ struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
+ unsigned int usage = ctx->data_usage;
+ size_t len = strlen(p);
+
+ if (len > PAGE_SIZE - 2 - usage)
+ return -EINVAL;
+ if (memchr(p, ',', len) != NULL)
+ return -EINVAL;
+ if (!ctx->legacy_data) {
+ ctx->legacy_data = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!ctx->legacy_data)
+ return -ENOMEM;
+ }
+
+ ctx->legacy_data[usage++] = ',';
+ memcpy(ctx->legacy_data + usage, p, len);
+ usage += len;
+ ctx->legacy_data[usage] = '\0';
+ ctx->data_usage = usage;
+ return 0;
+}
+
+/*
+ * Add monolithic mount data.
+ */
+static int legacy_parse_monolithic(struct fs_context *fc, void *data)
+{
+ struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
+
+ if (ctx->data_usage != 0) {
+ pr_warn("VFS: Can't mix monolithic and individual options\n");
+ return -EINVAL;
+ }
+ if (!data)
+ return 0;
+ if (!ctx->legacy_data) {
+ ctx->legacy_data = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!ctx->legacy_data)
+ return -ENOMEM;
+ }
+
+ memcpy(ctx->legacy_data, data, PAGE_SIZE);
+ ctx->data_usage = PAGE_SIZE;
+ return 0;
+}
+
+/*
+ * Use the legacy mount validation step to strip out and process security
+ * config options.
+ */
+static int legacy_validate(struct fs_context *fc)
+{
+ struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
+
+ if (!ctx->legacy_data || ctx->fc.fs_type->fs_flags & FS_BINARY_MOUNTDATA)
+ return 0;
+
+ ctx->secdata = alloc_secdata();
+ if (!ctx->secdata)
+ return -ENOMEM;
+
+ return security_sb_copy_data(ctx->legacy_data, ctx->secdata);
+}
+
+/*
+ * Determine the superblock subtype.
+ */
+static int legacy_set_subtype(struct fs_context *fc)
+{
+ const char *subtype = strchr(fc->fs_type->name, '.');
+
+ if (subtype) {
+ subtype++;
+ if (!subtype[0])
+ return -EINVAL;
+ } else {
+ subtype = "";
+ }
+
+ fc->subtype = kstrdup(subtype, GFP_KERNEL);
+ if (!fc->subtype)
+ return -ENOMEM;
+ return 0;
+}
+
+/*
+ * Get a mountable root with the legacy mount command.
+ */
+static int legacy_get_tree(struct fs_context *fc)
+{
+ struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
+ struct super_block *sb;
+ struct dentry *root;
+ int ret;
+
+ root = ctx->fc.fs_type->mount(ctx->fc.fs_type, ctx->fc.sb_flags,
+ ctx->fc.source, ctx->legacy_data);
+ if (IS_ERR(root))
+ return PTR_ERR(root);
+
+ sb = root->d_sb;
+ BUG_ON(!sb);
+
+ if ((ctx->fc.fs_type->fs_flags & FS_HAS_SUBTYPE) &&
+ !fc->subtype) {
+ ret = legacy_set_subtype(fc);
+ if (ret < 0)
+ goto err_sb;
+ }
+
+ ctx->fc.root = root;
+ return 0;
+
+err_sb:
+ dput(root);
+ deactivate_locked_super(sb);
+ return ret;
+}
+
+static const struct fs_context_operations legacy_fs_context_ops = {
+ .free = legacy_fs_context_free,
+ .dup = legacy_fs_context_dup,
+ .parse_option = legacy_parse_option,
+ .parse_monolithic = legacy_parse_monolithic,
+ .validate = legacy_validate,
+ .get_tree = legacy_get_tree,
+};
diff --git a/fs/internal.h b/fs/internal.h
index 48cee21b4f14..e7fb460e7ca4 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -89,7 +89,7 @@ extern struct file *get_empty_filp(void);
/*
* super.c
*/
-extern int do_remount_sb(struct super_block *, int, void *, int);
+extern int do_remount_sb(struct super_block *, int, void *, int, struct fs_context *);
extern bool trylock_super(struct super_block *sb);
extern struct dentry *mount_fs(struct file_system_type *,
int, const char *, void *);
diff --git a/fs/libfs.c b/fs/libfs.c
index 7ff3cb904acd..756e552709fa 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -9,6 +9,7 @@
#include <linux/slab.h>
#include <linux/cred.h>
#include <linux/mount.h>
+#include <linux/fs_context.h>
#include <linux/vfs.h>
#include <linux/quotaops.h>
#include <linux/mutex.h>
@@ -574,13 +575,27 @@ static DEFINE_SPINLOCK(pin_fs_lock);

int simple_pin_fs(struct file_system_type *type, struct vfsmount **mount, int *count)
{
+ struct fs_context *fc;
struct vfsmount *mnt = NULL;
+ int ret;
+
spin_lock(&pin_fs_lock);
if (unlikely(!*mount)) {
spin_unlock(&pin_fs_lock);
- mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, NULL);
+
+ fc = vfs_new_fs_context(type, NULL, 0, FS_CONTEXT_FOR_KERNEL_MOUNT);
+ if (IS_ERR(fc))
+ return PTR_ERR(fc);
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0)
+ return ret;
+
+ mnt = vfs_create_mount(fc);
+ put_fs_context(fc);
if (IS_ERR(mnt))
return PTR_ERR(mnt);
+
spin_lock(&pin_fs_lock);
if (!*mount)
*mount = mnt;
diff --git a/fs/namespace.c b/fs/namespace.c
index a6508e4c0a90..d6b0b0067f6d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -25,8 +25,10 @@
#include <linux/magic.h>
#include <linux/bootmem.h>
#include <linux/task_work.h>
+#include <linux/file.h>
#include <linux/sched/task.h>
#include <uapi/linux/mount.h>
+#include <linux/fs_context.h>

#include "pnode.h"
#include "internal.h"
@@ -1017,55 +1019,6 @@ static struct mount *skip_mnt_tree(struct mount *p)
return p;
}

-struct vfsmount *
-vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void *data)
-{
- struct mount *mnt;
- struct dentry *root;
-
- if (!type)
- return ERR_PTR(-ENODEV);
-
- mnt = alloc_vfsmnt(name);
- if (!mnt)
- return ERR_PTR(-ENOMEM);
-
- if (flags & SB_KERNMOUNT)
- mnt->mnt.mnt_flags = MNT_INTERNAL;
-
- root = mount_fs(type, flags, name, data);
- if (IS_ERR(root)) {
- mnt_free_id(mnt);
- free_vfsmnt(mnt);
- return ERR_CAST(root);
- }
-
- mnt->mnt.mnt_root = root;
- mnt->mnt.mnt_sb = root->d_sb;
- mnt->mnt_mountpoint = mnt->mnt.mnt_root;
- mnt->mnt_parent = mnt;
- lock_mount_hash();
- list_add_tail(&mnt->mnt_instance, &root->d_sb->s_mounts);
- unlock_mount_hash();
- return &mnt->mnt;
-}
-EXPORT_SYMBOL_GPL(vfs_kern_mount);
-
-struct vfsmount *
-vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
- const char *name, void *data)
-{
- /* Until it is worked out how to pass the user namespace
- * through from the parent mount to the submount don't support
- * unprivileged mounts with submounts.
- */
- if (mountpoint->d_sb->s_user_ns != &init_user_ns)
- return ERR_PTR(-EPERM);
-
- return vfs_kern_mount(type, SB_SUBMOUNT, name, data);
-}
-EXPORT_SYMBOL_GPL(vfs_submount);
-
static struct mount *clone_mnt(struct mount *old, struct dentry *root,
int flag)
{
@@ -1592,7 +1545,7 @@ static int do_umount(struct mount *mnt, int flags)
return -EPERM;
down_write(&sb->s_umount);
if (!sb_rdonly(sb))
- retval = do_remount_sb(sb, SB_RDONLY, NULL, 0);
+ retval = do_remount_sb(sb, SB_RDONLY, NULL, 0, NULL);
up_write(&sb->s_umount);
return retval;
}
@@ -2275,6 +2228,20 @@ static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
}

/*
+ * Parse the monolithic page of mount data given to sys_mount().
+ */
+static int parse_monolithic_mount_data(struct fs_context *fc, void *data)
+{
+ int (*monolithic_mount_data)(struct fs_context *, void *);
+
+ monolithic_mount_data = fc->ops->parse_monolithic;
+ if (!monolithic_mount_data)
+ monolithic_mount_data = generic_parse_monolithic;
+
+ return monolithic_mount_data(fc, data);
+}
+
+/*
* change filesystem flags. dir should be a physical root of filesystem.
* If you've mounted a non-root directory somewhere and want to do remount
* on it - tough luck.
@@ -2282,9 +2249,11 @@ static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
static int do_remount(struct path *path, int ms_flags, int sb_flags,
int mnt_flags, void *data)
{
+ struct fs_context *fc = NULL;
int err;
struct super_block *sb = path->mnt->mnt_sb;
struct mount *mnt = real_mount(path->mnt);
+ struct file_system_type *type = sb->s_type;

if (!check_mnt(mnt))
return -EINVAL;
@@ -2319,9 +2288,25 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
return -EPERM;
}

- err = security_sb_remount(sb, data);
- if (err)
- return err;
+ if (type->init_fs_context) {
+ fc = vfs_sb_reconfig(path->mnt, sb_flags);
+ if (IS_ERR(fc))
+ return PTR_ERR(fc);
+
+ err = parse_monolithic_mount_data(fc, data);
+ if (err < 0)
+ goto err_fc;
+
+ if (fc->ops->validate) {
+ err = fc->ops->validate(fc);
+ if (err < 0)
+ goto err_fc;
+ }
+ } else {
+ err = security_sb_remount(sb, data);
+ if (err)
+ return err;
+ }

down_write(&sb->s_umount);
if (ms_flags & MS_BIND)
@@ -2329,7 +2314,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
else if (!capable(CAP_SYS_ADMIN))
err = -EPERM;
else
- err = do_remount_sb(sb, sb_flags, data, 0);
+ err = do_remount_sb(sb, sb_flags, data, 0, fc);
if (!err) {
lock_mount_hash();
mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
@@ -2338,6 +2323,9 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
unlock_mount_hash();
}
up_write(&sb->s_umount);
+err_fc:
+ if (fc)
+ put_fs_context(fc);
return err;
}

@@ -2421,29 +2409,6 @@ static int do_move_mount(struct path *path, const char *old_name)
return err;
}

-static struct vfsmount *fs_set_subtype(struct vfsmount *mnt, const char *fstype)
-{
- int err;
- const char *subtype = strchr(fstype, '.');
- if (subtype) {
- subtype++;
- err = -EINVAL;
- if (!subtype[0])
- goto err;
- } else
- subtype = "";
-
- mnt->mnt_sb->s_subtype = kstrdup(subtype, GFP_KERNEL);
- err = -ENOMEM;
- if (!mnt->mnt_sb->s_subtype)
- goto err;
- return mnt;
-
- err:
- mntput(mnt);
- return ERR_PTR(err);
-}
-
/*
* add a mount into a namespace's mount tree
*/
@@ -2491,40 +2456,89 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags);

/*
- * create a new mount for userspace and request it to be added into the
- * namespace's tree
+ * Create a new mount using a superblock configuration and request it
+ * be added to the namespace tree.
*/
-static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
- int mnt_flags, const char *name, void *data)
+static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags)
{
- struct file_system_type *type;
struct vfsmount *mnt;
- int err;
-
- if (!fstype)
- return -EINVAL;
-
- type = get_fs_type(fstype);
- if (!type)
- return -ENODEV;
+ int ret;

- mnt = vfs_kern_mount(type, sb_flags, name, data);
- if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) &&
- !mnt->mnt_sb->s_subtype)
- mnt = fs_set_subtype(mnt, fstype);
+ ret = security_sb_mountpoint(fc, mountpoint);
+ if (ret < 0)
+ return ret;;

- put_filesystem(type);
+ mnt = vfs_create_mount(fc);
if (IS_ERR(mnt))
return PTR_ERR(mnt);

+ ret = -EPERM;
if (mount_too_revealing(mnt, &mnt_flags)) {
- mntput(mnt);
- return -EPERM;
+ pr_warn("VFS: Mount too revealing\n");
+ goto err_mnt;
+ }
+
+ ret = do_add_mount(real_mount(mnt), mountpoint, mnt_flags);
+ if (ret < 0)
+ goto err_mnt;
+ return ret;
+
+err_mnt:
+ mntput(mnt);
+ return ret;
+}
+
+/*
+ * create a new mount for userspace and request it to be added into the
+ * namespace's tree
+ */
+static int do_new_mount(struct path *mountpoint, const char *fstype,
+ int sb_flags, int mnt_flags, const char *name,
+ void *data)
+{
+ struct file_system_type *fs_type;
+ struct fs_context *fc;
+ int err = -EINVAL;
+
+ if (!fstype)
+ goto err;
+
+ err = -ENODEV;
+ fs_type = get_fs_type(fstype);
+ if (!fs_type)
+ goto err;
+
+ fc = vfs_new_fs_context(fs_type, NULL, sb_flags,
+ FS_CONTEXT_FOR_USER_MOUNT);
+ put_filesystem(fs_type);
+ if (IS_ERR(fc)) {
+ err = PTR_ERR(fc);
+ goto err;
}

- err = do_add_mount(real_mount(mnt), path, mnt_flags);
+ err = vfs_set_fs_source(fc, name, name ? strlen(name) : 0);
+ if (err < 0)
+ goto err_fc;
+
+ err = parse_monolithic_mount_data(fc, data);
+ if (err < 0)
+ goto err_fc;
+
+ err = vfs_get_tree(fc);
+ if (err < 0)
+ goto err_fc;
+
+ err = do_new_mount_fc(fc, mountpoint, mnt_flags);
if (err)
- mntput(mnt);
+ goto err_fc;
+
+ put_fs_context(fc);
+ return 0;
+
+err_fc:
+ put_fs_context(fc);
+err:
return err;
}

@@ -3063,6 +3077,116 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
return ret;
}

+/**
+ * vfs_create_mount - Create a mount for a configured superblock
+ * fc: The configuration context with the superblock attached
+ *
+ * Create a mount to an already configured superblock. If necessary, the
+ * caller should invoke vfs_get_tree() before calling this.
+ *
+ * Note that this does not attach the mount to anything.
+ */
+struct vfsmount *vfs_create_mount(struct fs_context *fc)
+{
+ struct mount *mnt;
+
+ if (!fc->root)
+ return ERR_PTR(-EINVAL);
+
+ mnt = alloc_vfsmnt(fc->source ?: "none");
+ if (!mnt)
+ return ERR_PTR(-ENOMEM);
+
+ if (fc->purpose == FS_CONTEXT_FOR_KERNEL_MOUNT)
+ /* It's a longterm mount, don't release mnt until we unmount
+ * before file sys is unregistered
+ */
+ mnt->mnt.mnt_flags = MNT_INTERNAL;
+
+ atomic_inc(&fc->root->d_sb->s_active);
+ mnt->mnt.mnt_sb = fc->root->d_sb;
+ mnt->mnt.mnt_root = dget(fc->root);
+ mnt->mnt_mountpoint = mnt->mnt.mnt_root;
+ mnt->mnt_parent = mnt;
+
+ lock_mount_hash();
+ list_add_tail(&mnt->mnt_instance, &mnt->mnt.mnt_sb->s_mounts);
+ unlock_mount_hash();
+ return &mnt->mnt;
+}
+EXPORT_SYMBOL(vfs_create_mount);
+
+struct vfsmount *vfs_kern_mount(struct file_system_type *type,
+ int sb_flags, const char *devname, void *data)
+{
+ struct fs_context *fc;
+ struct vfsmount *mnt;
+ int ret;
+
+ if (!type)
+ return ERR_PTR(-EINVAL);
+
+ fc = vfs_new_fs_context(type, NULL, sb_flags,
+ sb_flags & SB_KERNMOUNT ?
+ FS_CONTEXT_FOR_KERNEL_MOUNT :
+ FS_CONTEXT_FOR_USER_MOUNT);
+ if (IS_ERR(fc))
+ return ERR_CAST(fc);
+
+ ret = vfs_set_fs_source(fc, devname, devname ? strlen(devname) : 0);
+ if (ret < 0)
+ goto err_fc;
+
+ ret = parse_monolithic_mount_data(fc, data);
+ if (ret < 0)
+ goto err_fc;
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0)
+ goto err_fc;
+
+ mnt = vfs_create_mount(fc);
+ if (IS_ERR(mnt)) {
+ ret = PTR_ERR(mnt);
+ goto err_fc;
+ }
+
+ put_fs_context(fc);
+ return mnt;
+
+err_fc:
+ put_fs_context(fc);
+ return ERR_PTR(ret);
+}
+EXPORT_SYMBOL_GPL(vfs_kern_mount);
+
+struct vfsmount *
+vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
+ const char *name, void *data)
+{
+ /* Until it is worked out how to pass the user namespace
+ * through from the parent mount to the submount don't support
+ * unprivileged mounts with submounts.
+ */
+ if (mountpoint->d_sb->s_user_ns != &init_user_ns)
+ return ERR_PTR(-EPERM);
+
+ return vfs_kern_mount(type, MS_SUBMOUNT, name, data);
+}
+EXPORT_SYMBOL_GPL(vfs_submount);
+
+struct vfsmount *kern_mount(struct file_system_type *type)
+{
+ return vfs_kern_mount(type, SB_KERNMOUNT, type->name, NULL);
+}
+EXPORT_SYMBOL_GPL(kern_mount);
+
+struct vfsmount *kern_mount_data(struct file_system_type *type, void *data)
+{
+ return vfs_kern_mount(type, SB_KERNMOUNT, type->name, data);
+}
+EXPORT_SYMBOL_GPL(kern_mount_data);
+
/*
* Return true if path is reachable from root
*
@@ -3283,21 +3407,6 @@ void put_mnt_ns(struct mnt_namespace *ns)
free_mnt_ns(ns);
}

-struct vfsmount *kern_mount_data(struct file_system_type *type, void *data)
-{
- struct vfsmount *mnt;
- mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, data);
- if (!IS_ERR(mnt)) {
- /*
- * it is a longterm mount, don't release mnt until
- * we unmount before file sys is unregistered
- */
- real_mount(mnt)->mnt_ns = MNT_NS_INTERNAL;
- }
- return mnt;
-}
-EXPORT_SYMBOL_GPL(kern_mount_data);
-
void kern_unmount(struct vfsmount *mnt)
{
/* release long term mount so mount point can be released */
diff --git a/fs/super.c b/fs/super.c
index 02da00410de8..e7d411d1d435 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -35,6 +35,7 @@
#include <linux/lockdep.h>
#include <linux/user_namespace.h>
#include <uapi/linux/mount.h>
+#include <linux/fs_context.h>
#include "internal.h"


@@ -173,16 +174,13 @@ static void destroy_super(struct super_block *s)
}

/**
- * alloc_super - create new superblock
- * @type: filesystem type superblock should belong to
- * @flags: the mount flags
- * @user_ns: User namespace for the super_block
+ * alloc_super - Create new superblock
+ * @fc: The filesystem configuration context
*
* Allocates and initializes a new &struct super_block. alloc_super()
* returns a pointer new superblock or %NULL if allocation had failed.
*/
-static struct super_block *alloc_super(struct file_system_type *type, int flags,
- struct user_namespace *user_ns)
+static struct super_block *alloc_super(struct fs_context *fc)
{
struct super_block *s = kzalloc(sizeof(struct super_block), GFP_USER);
static const struct super_operations default_op;
@@ -192,7 +190,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
return NULL;

INIT_LIST_HEAD(&s->s_mounts);
- s->s_user_ns = get_user_ns(user_ns);
+ s->s_user_ns = get_user_ns(fc->user_ns);

if (security_sb_alloc(s))
goto fail;
@@ -200,12 +198,12 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
for (i = 0; i < SB_FREEZE_LEVELS; i++) {
if (__percpu_init_rwsem(&s->s_writers.rw_sem[i],
sb_writers_name[i],
- &type->s_writers_key[i]))
+ &fc->fs_type->s_writers_key[i]))
goto fail;
}
init_waitqueue_head(&s->s_writers.wait_unfrozen);
s->s_bdi = &noop_backing_dev_info;
- s->s_flags = flags;
+ s->s_flags = fc->sb_flags;
if (s->s_user_ns != &init_user_ns)
s->s_iflags |= SB_I_NODEV;
INIT_HLIST_NODE(&s->s_instances);
@@ -222,7 +220,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
goto fail;

init_rwsem(&s->s_umount);
- lockdep_set_class(&s->s_umount, &type->s_umount_key);
+ lockdep_set_class(&s->s_umount, &fc->fs_type->s_umount_key);
/*
* sget() can have s_umount recursion.
*
@@ -242,7 +240,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
s->s_count = 1;
atomic_set(&s->s_active, 1);
mutex_init(&s->s_vfs_rename_mutex);
- lockdep_set_class(&s->s_vfs_rename_mutex, &type->s_vfs_rename_key);
+ lockdep_set_class(&s->s_vfs_rename_mutex, &fc->fs_type->s_vfs_rename_key);
init_rwsem(&s->s_dquot.dqio_sem);
s->s_maxbytes = MAX_NON_LFS;
s->s_op = &default_op;
@@ -455,6 +453,96 @@ void generic_shutdown_super(struct super_block *sb)
EXPORT_SYMBOL(generic_shutdown_super);

/**
+ * sget_fc - Find or create a superblock
+ * @fc: Filesystem context.
+ * @test: Comparison callback
+ * @set: Setup callback
+ *
+ * Find or create a superblock using the parameters stored in the filesystem
+ * context and the two callback functions.
+ *
+ * If an extant superblock is matched, then that will be returned with an
+ * elevated reference count that the caller must transfer or discard.
+ *
+ * If no match is made, a new superblock will be allocated and basic
+ * initialisation will be performed (s_type, s_fs_info and s_id will be set and
+ * the set() callback will be invoked), the superblock will be published and it
+ * will be returned in a partially constructed state with SB_BORN and SB_ACTIVE
+ * as yet unset.
+ */
+struct super_block *sget_fc(struct fs_context *fc,
+ int (*test)(struct super_block *, struct fs_context *),
+ int (*set)(struct super_block *, struct fs_context *))
+{
+ struct super_block *s = NULL;
+ struct super_block *old;
+ int err;
+
+ if (!(fc->sb_flags & SB_KERNMOUNT) &&
+ fc->purpose != FS_CONTEXT_FOR_SUBMOUNT) {
+ /* Don't allow mounting unless the caller has CAP_SYS_ADMIN
+ * over the namespace.
+ */
+ if (!(fc->fs_type->fs_flags & FS_USERNS_MOUNT) &&
+ !capable(CAP_SYS_ADMIN))
+ return ERR_PTR(-EPERM);
+ else if (!ns_capable(fc->user_ns, CAP_SYS_ADMIN))
+ return ERR_PTR(-EPERM);
+ }
+
+retry:
+ spin_lock(&sb_lock);
+ if (test) {
+ hlist_for_each_entry(old, &fc->fs_type->fs_supers, s_instances) {
+ if (!test(old, fc))
+ continue;
+ if (fc->user_ns != old->s_user_ns) {
+ spin_unlock(&sb_lock);
+ if (s) {
+ up_write(&s->s_umount);
+ destroy_super(s);
+ }
+ return ERR_PTR(-EBUSY);
+ }
+ if (!grab_super(old))
+ goto retry;
+ if (s) {
+ up_write(&s->s_umount);
+ destroy_super(s);
+ s = NULL;
+ }
+ return old;
+ }
+ }
+ if (!s) {
+ spin_unlock(&sb_lock);
+ s = alloc_super(fc);
+ if (!s)
+ return ERR_PTR(-ENOMEM);
+ goto retry;
+ }
+
+ s->s_fs_info = fc->s_fs_info;
+ err = set(s, fc);
+ if (err) {
+ s->s_fs_info = NULL;
+ spin_unlock(&sb_lock);
+ up_write(&s->s_umount);
+ destroy_super(s);
+ return ERR_PTR(err);
+ }
+ s->s_type = fc->fs_type;
+ strlcpy(s->s_id, s->s_type->name, sizeof(s->s_id));
+ list_add_tail(&s->s_list, &super_blocks);
+ hlist_add_head(&s->s_instances, &s->s_type->fs_supers);
+ spin_unlock(&sb_lock);
+ get_filesystem(s->s_type);
+ register_shrinker(&s->s_shrink);
+ return s;
+}
+EXPORT_SYMBOL(sget_fc);
+
+/**
* sget_userns - find or create a superblock
* @type: filesystem type superblock should belong to
* @test: comparison callback
@@ -503,7 +591,14 @@ struct super_block *sget_userns(struct file_system_type *type,
}
if (!s) {
spin_unlock(&sb_lock);
- s = alloc_super(type, (flags & ~SB_SUBMOUNT), user_ns);
+ {
+ struct fs_context fc = {
+ .fs_type = type,
+ .sb_flags = flags & ~SB_SUBMOUNT,
+ .user_ns = user_ns,
+ };
+ s = alloc_super(&fc);
+ }
if (!s)
return ERR_PTR(-ENOMEM);
goto retry;
@@ -805,10 +900,13 @@ struct super_block *user_get_super(dev_t dev)
* @sb_flags: revised superblock flags
* @data: the rest of options
* @force: whether or not to force the change
+ * @fc: the superblock config for filesystems that support it
+ * (NULL if called from emergency or umount)
*
* Alters the mount options of a mounted file system.
*/
-int do_remount_sb(struct super_block *sb, int sb_flags, void *data, int force)
+int do_remount_sb(struct super_block *sb, int sb_flags, void *data, int force,
+ struct fs_context *fc)
{
int retval;
int remount_ro;
@@ -850,8 +948,14 @@ int do_remount_sb(struct super_block *sb, int sb_flags, void *data, int force)
}
}

- if (sb->s_op->remount_fs) {
- retval = sb->s_op->remount_fs(sb, &sb_flags, data);
+ if (sb->s_op->remount_fs_fc ||
+ sb->s_op->remount_fs) {
+ if (sb->s_op->remount_fs_fc) {
+ retval = sb->s_op->remount_fs_fc(sb, fc);
+ sb_flags = fc->sb_flags;
+ } else {
+ retval = sb->s_op->remount_fs(sb, &sb_flags, data);
+ }
if (retval) {
if (!force)
goto cancel_readonly;
@@ -898,7 +1002,7 @@ static void do_emergency_remount(struct work_struct *work)
/*
* What lock protects sb->s_flags??
*/
- do_remount_sb(sb, SB_RDONLY, NULL, 1);
+ do_remount_sb(sb, SB_RDONLY, NULL, 1, NULL);
}
up_write(&sb->s_umount);
spin_lock(&sb_lock);
@@ -1048,6 +1152,89 @@ struct dentry *mount_ns(struct file_system_type *fs_type,

EXPORT_SYMBOL(mount_ns);

+static int set_anon_super_fc(struct super_block *sb, struct fs_context *fc)
+{
+ return set_anon_super(sb, NULL);
+}
+
+static int test_keyed_super(struct super_block *sb, struct fs_context *fc)
+{
+ return sb->s_fs_info == fc->s_fs_info;
+}
+
+static int test_single_super(struct super_block *s, struct fs_context *fc)
+{
+ return 1;
+}
+
+/**
+ * vfs_get_super - Get a superblock with a search key set in s_fs_info.
+ * @fc: The filesystem context holding the parameters
+ * @keying: How to distinguish superblocks
+ * @fill_super: Helper to initialise a new superblock
+ *
+ * Search for a superblock and create a new one if not found. The search
+ * criterion is controlled by @keying. If the search fails, a new superblock
+ * is created and @fill_super() is called to initialise it.
+ *
+ * @keying can take one of a number of values:
+ *
+ * (1) vfs_get_single_super - Only one superblock of this type may exist on the
+ * system. This is typically used for special system filesystems.
+ *
+ * (2) vfs_get_keyed_super - Multiple superblocks may exist, but they must have
+ * distinct keys (where the key is in s_fs_info). Searching for the same
+ * key again will turn up the superblock for that key.
+ *
+ * (3) vfs_get_independent_super - Multiple superblocks may exist and are
+ * unkeyed. Each call will get a new superblock.
+ *
+ * A permissions check is made by sget_fc() unless we're getting a superblock
+ * for a kernel-internal mount or a submount.
+ */
+int vfs_get_super(struct fs_context *fc,
+ enum vfs_get_super_keying keying,
+ int (*fill_super)(struct super_block *sb,
+ struct fs_context *fc))
+{
+ int (*test)(struct super_block *, struct fs_context *);
+ struct super_block *sb;
+
+ switch (keying) {
+ case vfs_get_single_super:
+ test = test_single_super;
+ break;
+ case vfs_get_keyed_super:
+ test = test_keyed_super;
+ break;
+ case vfs_get_independent_super:
+ test = NULL;
+ break;
+ default:
+ BUG();
+ }
+
+ sb = sget_fc(fc, test, set_anon_super_fc);
+ if (IS_ERR(sb))
+ return PTR_ERR(sb);
+
+ if (!sb->s_root) {
+ int err;
+ err = fill_super(sb, fc);
+ if (err) {
+ deactivate_locked_super(sb);
+ return err;
+ }
+
+ sb->s_flags |= SB_ACTIVE;
+ }
+
+ if (!fc->root)
+ fc->root = dget(sb->s_root);
+ return 0;
+}
+EXPORT_SYMBOL(vfs_get_super);
+
#ifdef CONFIG_BLOCK
static int set_bdev_super(struct super_block *s, void *data)
{
@@ -1196,7 +1383,7 @@ struct dentry *mount_single(struct file_system_type *fs_type,
}
s->s_flags |= SB_ACTIVE;
} else {
- do_remount_sb(s, flags, data, 0);
+ do_remount_sb(s, flags, data, 0, NULL);
}
return dget(s->s_root);
}
@@ -1529,3 +1716,76 @@ int thaw_super(struct super_block *sb)
return 0;
}
EXPORT_SYMBOL(thaw_super);
+
+/**
+ * vfs_get_tree - Get the mountable root
+ * @fc: The superblock configuration context.
+ *
+ * The filesystem is invoked to get or create a superblock which can then later
+ * be used for mounting. The filesystem places a pointer to the root to be
+ * used for mounting in @fc->root.
+ */
+int vfs_get_tree(struct fs_context *fc)
+{
+ struct super_block *sb;
+ int ret;
+
+ if (fc->root)
+ return -EBUSY;
+
+ if (fc->ops->validate) {
+ ret = fc->ops->validate(fc);
+ if (ret < 0)
+ return ret;
+ }
+
+ /* The filesystem may transfer preallocated resources from the
+ * configuration context to the superblock, thereby rendering the
+ * config unusable for another attempt at creation if this one fails.
+ */
+ if (fc->degraded)
+ return -EBUSY;
+
+ /* Get the mountable root in fc->root, with a ref on the root and a ref
+ * on the superblock.
+ */
+ ret = fc->ops->get_tree(fc);
+ if (ret < 0)
+ return ret;
+
+ BUG_ON(!fc->root);
+ sb = fc->root->d_sb;
+ WARN_ON(!sb->s_bdi);
+
+ ret = security_sb_get_tree(fc);
+ if (ret < 0)
+ goto err_sb;
+
+ ret = -ENOMEM;
+ if (fc->subtype && !sb->s_subtype) {
+ sb->s_subtype = kstrdup(fc->subtype, GFP_KERNEL);
+ if (!sb->s_subtype)
+ goto err_sb;
+ }
+
+ sb->s_flags |= SB_BORN;
+
+ /* Filesystems should never set s_maxbytes larger than MAX_LFS_FILESIZE
+ * but s_maxbytes was an unsigned long long for many releases. Throw
+ * this warning for a little while to try and catch filesystems that
+ * violate this rule.
+ */
+ WARN(sb->s_maxbytes < 0,
+ "%s set sb->s_maxbytes to negative value (%lld)\n",
+ fc->fs_type->name, sb->s_maxbytes);
+
+ up_write(&sb->s_umount);
+ return 0;
+
+err_sb:
+ dput(fc->root);
+ fc->root = NULL;
+ deactivate_locked_super(sb);
+ return ret;
+}
+EXPORT_SYMBOL(vfs_get_tree);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bd2ee00e03ff..f391263c62a1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -58,6 +58,7 @@ struct workqueue_struct;
struct iov_iter;
struct fscrypt_info;
struct fscrypt_operations;
+struct fs_context;

extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -717,6 +718,11 @@ static inline void inode_unlock(struct inode *inode)
up_write(&inode->i_rwsem);
}

+static inline int inode_lock_killable(struct inode *inode)
+{
+ return down_write_killable(&inode->i_rwsem);
+}
+
static inline void inode_lock_shared(struct inode *inode)
{
down_read(&inode->i_rwsem);
@@ -1814,6 +1820,7 @@ struct super_operations {
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
+ int (*remount_fs_fc) (struct super_block *, struct fs_context *);
void (*umount_begin) (struct super_block *);

int (*show_options)(struct seq_file *, struct dentry *);
@@ -2072,8 +2079,10 @@ struct file_system_type {
#define FS_HAS_SUBTYPE 4
#define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
+ unsigned short fs_context_size; /* Size of superblock config context to allocate */
struct dentry *(*mount) (struct file_system_type *, int,
const char *, void *);
+ int (*init_fs_context)(struct fs_context *, struct super_block *);
void (*kill_sb) (struct super_block *);
struct module *owner;
struct file_system_type * next;
@@ -2113,6 +2122,9 @@ void deactivate_locked_super(struct super_block *sb);
int set_anon_super(struct super_block *s, void *data);
int get_anon_bdev(dev_t *);
void free_anon_bdev(dev_t);
+struct super_block *sget_fc(struct fs_context *fc,
+ int (*test)(struct super_block *, struct fs_context *),
+ int (*set)(struct super_block *, struct fs_context *));
struct super_block *sget_userns(struct file_system_type *type,
int (*test)(struct super_block *,void *),
int (*set)(struct super_block *,void *),
@@ -2155,8 +2167,8 @@ mount_pseudo(struct file_system_type *fs_type, char *name,

extern int register_filesystem(struct file_system_type *);
extern int unregister_filesystem(struct file_system_type *);
-extern struct vfsmount *kern_mount_data(struct file_system_type *, void *data);
-#define kern_mount(type) kern_mount_data(type, NULL)
+extern struct vfsmount *kern_mount(struct file_system_type *);
+extern struct vfsmount *kern_mount_data(struct file_system_type *, void *);
extern void kern_unmount(struct vfsmount *mnt);
extern int may_umount_tree(struct vfsmount *);
extern int may_umount(struct vfsmount *);
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 645c57e10764..8af6ff0e869e 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -27,9 +27,10 @@ struct user_namespace;
struct vfsmount;

enum fs_context_purpose {
- FS_CONTEXT_FOR_NEW, /* New superblock for direct mount */
+ FS_CONTEXT_FOR_USER_MOUNT, /* New superblock for user-specified mount */
+ FS_CONTEXT_FOR_KERNEL_MOUNT, /* New superblock for kernel-internal mount */
FS_CONTEXT_FOR_SUBMOUNT, /* New superblock for automatic submount */
- FS_CONTEXT_FOR_REMOUNT, /* Superblock reconfiguration for remount */
+ FS_CONTEXT_FOR_REMOUNT, /* Superblock reconfiguration for remount */
};

/*
@@ -53,7 +54,8 @@ struct fs_context {
char *source; /* The source name (eg. device) */
char *subtype; /* The subtype to set on the superblock */
void *security; /* The LSM context */
- unsigned int sb_flags; /* The superblock flags (MS_*) */
+ void *s_fs_info; /* Proposed s_fs_info */
+ unsigned int sb_flags; /* Proposed superblock flags (SB_*) */
bool sloppy; /* Unrecognised options are okay */
bool silent;
bool degraded; /* True if the context can't be reused */
@@ -70,4 +72,33 @@ struct fs_context_operations {
int (*get_tree)(struct fs_context *fc);
};

+/*
+ * fs_context manipulation functions.
+ */
+extern struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
+ struct super_block *src_sb,
+ unsigned int ms_flags,
+ enum fs_context_purpose purpose);
+extern struct fs_context *vfs_sb_reconfig(struct vfsmount *mnt,
+ unsigned int ms_flags);
+extern struct fs_context *vfs_dup_fs_context(struct fs_context *src);
+extern int vfs_set_fs_source(struct fs_context *fc, const char *source, size_t slen);
+extern int vfs_parse_mount_option(struct fs_context *fc, char *data);
+extern int generic_parse_monolithic(struct fs_context *fc, void *data);
+extern int vfs_get_tree(struct fs_context *fc);
+extern void put_fs_context(struct fs_context *fc);
+
+/*
+ * sget() wrapper to be called from the ->get_tree() op.
+ */
+enum vfs_get_super_keying {
+ vfs_get_single_super, /* Only one such superblock may exist */
+ vfs_get_keyed_super, /* Superblocks with different s_fs_info keys may exist */
+ vfs_get_independent_super, /* Multiple independent superblocks may exist */
+};
+extern int vfs_get_super(struct fs_context *fc,
+ enum vfs_get_super_keying keying,
+ int (*fill_super)(struct super_block *sb,
+ struct fs_context *fc));
+
#endif /* _LINUX_FS_CONTEXT_H */
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 85398ba0b533..74aeccb041a2 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -91,7 +91,7 @@
* @fs_context_free:
* Clean up a filesystem context.
* @fc indicates the filesystem context.
- * @fs_context_parse_one:
+ * @fs_context_parse_option:
* Userspace provided an option to configure a superblock. The LSM may
* reject it with an error and may use it for itself, in which case it
* should return 1; otherwise it should return 0 to pass it on to the
@@ -1419,7 +1419,7 @@ union security_list_options {
int (*fs_context_alloc)(struct fs_context *fc, struct super_block *src_sb);
int (*fs_context_dup)(struct fs_context *fc, struct fs_context *src_sc);
void (*fs_context_free)(struct fs_context *fc);
- int (*fs_context_parse_one)(struct fs_context *fc, char *opt);
+ int (*fs_context_parse_option)(struct fs_context *fc, char *opt);
int (*sb_get_tree)(struct fs_context *fc);
int (*sb_mountpoint)(struct fs_context *fc, struct path *mountpoint);

@@ -1745,7 +1745,7 @@ struct security_hook_heads {
struct list_head fs_context_alloc;
struct list_head fs_context_dup;
struct list_head fs_context_free;
- struct list_head fs_context_parse_one;
+ struct list_head fs_context_parse_option;
struct list_head sb_get_tree;
struct list_head sb_mountpoint;
struct list_head sb_alloc_security;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 1ce85e6fd95f..f47306b4bf72 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -20,6 +20,7 @@ struct super_block;
struct vfsmount;
struct dentry;
struct mnt_namespace;
+struct fs_context;

#define MNT_NOSUID 0x01
#define MNT_NODEV 0x02
@@ -87,6 +88,7 @@ struct path;
extern struct vfsmount *clone_private_mount(const struct path *path);

struct file_system_type;
+extern struct vfsmount *vfs_create_mount(struct fs_context *fc);
extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
int flags, const char *name,
void *data);
diff --git a/security/security.c b/security/security.c
index 55383a0e764d..7826a493c02a 100644
--- a/security/security.c
+++ b/security/security.c
@@ -366,9 +366,9 @@ void security_fs_context_free(struct fs_context *fc)
call_void_hook(fs_context_free, fc);
}

-int security_fs_context_parse_one(struct fs_context *fc, char *opt)
+int security_fs_context_parse_option(struct fs_context *fc, char *p)
{
- return call_int_hook(fs_context_parse_one, 0, fc, opt);
+ return call_int_hook(fs_context_parse_option, 0, fc, p);
}

int security_sb_get_tree(struct fs_context *fc)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 0dda7350b5af..ca57e61f9c43 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2927,7 +2927,7 @@ static void selinux_fs_context_free(struct fs_context *fc)
fc->security = NULL;
}

-static int selinux_fs_context_parse_one(struct fs_context *fc, char *opt)
+static int selinux_fs_context_parse_option(struct fs_context *fc, char *opt)
{
struct security_mnt_opts *opts = fc->security;
substring_t args[MAX_OPT_ARGS];
@@ -3014,7 +3014,7 @@ static int selinux_sb_get_tree(struct fs_context *fc)
return rc;

/* Allow all mounts performed by the kernel */
- if (fc->sb_flags & MS_KERNMOUNT)
+ if (fc->purpose & FS_CONTEXT_FOR_KERNEL_MOUNT)
return 0;

ad.type = LSM_AUDIT_DATA_DENTRY;
@@ -6445,7 +6445,7 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(fs_context_alloc, selinux_fs_context_alloc),
LSM_HOOK_INIT(fs_context_dup, selinux_fs_context_dup),
LSM_HOOK_INIT(fs_context_free, selinux_fs_context_free),
- LSM_HOOK_INIT(fs_context_parse_one, selinux_fs_context_parse_one),
+ LSM_HOOK_INIT(fs_context_parse_option, selinux_fs_context_parse_option),
LSM_HOOK_INIT(sb_get_tree, selinux_sb_get_tree),
LSM_HOOK_INIT(sb_mountpoint, selinux_sb_mountpoint),



2017-10-06 15:49:16

by David Howells

[permalink] [raw]
Subject: [PATCH 01/14] VFS: Introduce the structs and doc for a filesystem context [ver #6]

Introduce a filesystem context concept to be used during superblock
creation for mount and superblock reconfiguration for remount. This is
allocated at the beginning of the mount procedure and into it is placed:

(1) Filesystem type.

(2) Namespaces.

(3) Device name.

(4) Superblock flags (MS_*).

(5) Security details.

(6) Filesystem-specific data, as set by the mount options.

Signed-off-by: David Howells <[email protected]>
---

Documentation/filesystems/mounting.txt | 432 ++++++++++++++++++++++++++++++++
include/linux/fs_context.h | 73 +++++
2 files changed, 505 insertions(+)
create mode 100644 Documentation/filesystems/mounting.txt
create mode 100644 include/linux/fs_context.h

diff --git a/Documentation/filesystems/mounting.txt b/Documentation/filesystems/mounting.txt
new file mode 100644
index 000000000000..8c0b0351e949
--- /dev/null
+++ b/Documentation/filesystems/mounting.txt
@@ -0,0 +1,432 @@
+ ===================
+ FILESYSTEM MOUNTING
+ ===================
+
+CONTENTS
+
+ (1) Overview.
+
+ (2) The filesystem context.
+
+ (3) The filesystem context operations.
+
+ (4) Filesystem context security.
+
+ (5) VFS filesystem context operations.
+
+
+========
+OVERVIEW
+========
+
+The creation of new mounts is now to be done in a multistep process:
+
+ (1) Create a filesystem context.
+
+ (2) Parse the options and attach them to the context. Options may be passed
+ individually from userspace.
+
+ (3) Validate and pre-process the context.
+
+ (4) Get or create a superblock and mountable root.
+
+ (5) Perform the mount.
+
+ (6) Return an error message attached to the context.
+
+ (7) Destroy the context.
+
+To support this, the file_system_type struct gains two new fields:
+
+ unsigned short fs_context_size;
+
+which indicates the total amount of space that should be allocated for context
+data (see the Filesystem Context section), and:
+
+ int (*init_fs_context)(struct fs_context *fc, struct super_block *src_sb);
+
+which is invoked to set up the filesystem-specific parts of a filesystem
+context, including the additional space. The src_sb parameter is used to
+convey the superblock from which the filesystem may draw extra information
+(such as namespaces) for submount (FS_CONTEXT_FOR_SUBMOUNT) or remount
+(FS_CONTEXT_FOR_REMOUNT) purposes - otherwise it will be NULL.
+
+Note that security initialisation is done *after* the filesystem is called so
+that the namespaces may be adjusted first.
+
+And the super_operations struct gains one field:
+
+ int (*remount_fs_fc) (struct super_block *, struct fs_context *);
+
+This shadows the ->remount_fs() operation and takes a prepared filesystem
+context instead of the mount flags and data page. It may modify the sb_flags
+in the context for the caller to pick up.
+
+[NOTE] remount_fs_fc is intended as a replacement for remount_fs.
+
+
+======================
+THE FILESYSTEM CONTEXT
+======================
+
+The creation and reconfiguration of a superblock is governed by a filesystem
+context. This is represented by the fs_context structure:
+
+ struct fs_context {
+ const struct fs_context_operations *ops;
+ struct file_system_type *fs;
+ struct dentry *root;
+ struct user_namespace *user_ns;
+ struct net *net_ns;
+ const struct cred *cred;
+ char *device;
+ char *subtype;
+ void *security;
+ unsigned int sb_flags;
+ bool sloppy;
+ bool silent;
+ bool degraded;
+ enum fs_context_purpose purpose : 8;
+ };
+
+When the VFS creates this, it allocates ->fs_context_size bytes (as specified
+by the file_system_type object) to hold both the fs_context struct and any
+extra data required by the filesystem. The fs_context struct is placed at the
+beginning of this space. Any extra space beyond that is for use by the
+filesystem. The filesystem should wrap the struct in its own, e.g.:
+
+ struct nfs_fs_context {
+ struct fs_context fc;
+ ...
+ };
+
+placing the fs_context struct first. container_of() can then be used. The
+file_system_type would be initialised thus:
+
+ struct file_system_type nfs = {
+ ...
+ .fs_context_size = sizeof(struct nfs_fs_context),
+ .init_fs_context = nfs_init_fs_context,
+ ...
+ };
+
+The fs_context fields are as follows:
+
+ (*) const struct fs_context_operations *ops
+
+ These are operations that can be done on a filesystem context (see
+ below). This must be set by the ->init_fs_context() file_system_type
+ operation.
+
+ (*) struct file_system_type *fs
+
+ A pointer to the file_system_type of the filesystem that is being
+ constructed or reconfigured. This retains a reference on the type owner.
+
+ (*) struct dentry *root
+
+ A pointer to the root of the mountable tree (and indirectly, the
+ superblock thereof). This is filled in by the ->get_tree() op.
+
+ (*) struct user_namespace *user_ns
+ (*) struct net *net_ns
+
+ There are a subset of the namespaces in use by the invoking process. They
+ retain references on each namespace. The subscribed namespaces may be
+ replaced by the filesystem to reflect other sources, such as the parent
+ mount superblock on an automount.
+
+ (*) struct cred *cred
+
+ The mounter's credentials. This retains a reference on the credentials.
+
+ (*) char *device
+
+ This is the device to be mounted. It may be a block device
+ (e.g. /dev/sda1) or something more exotic, such as the "host:/path" that
+ NFS desires.
+
+ (*) char *subtype
+
+ This is a string to be added to the type displayed in /proc/mounts to
+ qualify it (used by FUSE). This is available for the filesystem to set if
+ desired.
+
+ (*) void *security
+
+ A place for the LSMs to hang their security data for the superblock. The
+ relevant security operations are described below.
+
+ (*) unsigned int sb_flags
+
+ This holds the SB_* flags to be set in super_block::s_flags.
+
+ (*) bool sloppy
+ (*) bool silent
+
+ These are set if the sloppy or silent mount options are given.
+
+ [NOTE] sloppy is probably unnecessary when userspace passes over one
+ option at a time since the error can just be ignored if userspace deems it
+ to be unimportant.
+
+ [NOTE] silent is probably redundant with sb_flags & SB_SILENT.
+
+ (*) bool degraded
+
+ This is set if any preallocated resources in the context have been used
+ up, thereby rendering it unreusable for the ->get_tree() op.
+
+ (*) enum fs_context_purpose
+
+ This indicates the purpose for which the context is intended. The
+ available values are:
+
+ FS_CONTEXT_FOR_NEW -- New mount
+ FS_CONTEXT_FOR_SUBMOUNT -- New automatic submount of extant mount
+ FS_CONTEXT_FOR_REMOUNT -- Change an existing mount
+
+The mount context is created by calling vfs_new_fs_context(), vfs_sb_reconfig()
+or vfs_dup_fs_context() and is destroyed with put_fs_context(). Note that the
+structure is not refcounted.
+
+VFS, security and filesystem mount options are set individually with
+vfs_parse_mount_option(). Options provided by the old mount(2) system call as
+a page of data can be parsed with generic_monolithic_mount_data().
+
+When mounting, the filesystem is allowed to take data from any of the pointers
+and attach it to the superblock (or whatever), provided it clears the pointer
+in the mount context.
+
+The filesystem is also allowed to allocate resources and pin them with the
+mount context. For instance, NFS might pin the appropriate protocol version
+module.
+
+
+=================================
+THE FILESYSTEM CONTEXT OPERATIONS
+=================================
+
+The filesystem context points to a table of operations:
+
+ struct fs_context_operations {
+ void (*free)(struct fs_context *fc);
+ int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
+ int (*parse_source)(struct fs_context *fc);
+ int (*parse_one)(struct fs_context *fc, char *p);
+ int (*parse_monolithic)(struct fs_context *fc, void *data);
+ int (*validate)(struct fs_context *fc);
+ int (*get_tree)(struct fs_context *fc);
+ };
+
+These operations are invoked by the various stages of the mount procedure to
+manage the filesystem context. They are as follows:
+
+ (*) void (*free)(struct fs_context *fc);
+
+ Called to clean up the filesystem-specific part of the filesystem context
+ when the context is destroyed. It should be aware that parts of the
+ context may have been removed and NULL'd out by ->get_tree().
+
+ (*) int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
+
+ Called when a filesystem context has been duplicated to get any refs or
+ copy any non-referenced resources held in the filesystem-specific part of
+ the filesystem context. An error may be returned to indicate failure to
+ do this.
+
+ [!] Note that even if this fails, put_fs_context() will be called
+ immediately thereafter, so ->dup() *must* make the
+ filesystem-specific part safe for ->free().
+
+ (*) int (*parse_source)(struct fs_context *fc);
+
+ Called when the source or device is specified for a filesystem context.
+ The string will have been stored in fc->source prior to calling. If
+ successful, 0 should be returned and a negative error code otherwise.
+
+ (*) int (*parse_one)(struct fs_context *fc, char *p);
+
+ Called when an option is to be added to the filesystem context. p points
+ to the option string, likely in "key[=val]" format. VFS-specific options
+ will have been weeded out and fc->sb_flags updated in the context.
+ Security options will also have been weeded out and fc->security updated.
+
+ If successful, 0 should be returned and a negative error code otherwise.
+
+ (*) int (*parse_monolithic)(struct fs_context *fc, void *data);
+
+ Called when the mount(2) system call is invoked to pass the entire data
+ page in one go. If this is expected to be just a list of "key[=val]"
+ items separated by commas, then this may be set to NULL.
+
+ The return value is as for ->parse_option().
+
+ If the filesystem (eg. NFS) needs to examine the data first and then finds
+ it's the standard key-val list then it may pass it off to
+ generic_monolithic_mount_data().
+
+ (*) int (*validate)(struct fs_context *fc);
+
+ Called when all the options have been applied and the mount is about to
+ take place. It is should check for inconsistencies from mount options and
+ it is also allowed to do preliminary resource acquisition. For instance,
+ the core NFS module could load the NFS protocol module here.
+
+ Note that if fc->purpose == FS_CONTEXT_FOR_REMOUNT, some of the options
+ necessary for a new mount may not be set.
+
+ The return value is as for ->parse_option().
+
+ (*) int (*get_tree)(struct fs_context *fc);
+
+ Called to get or create the mountable root and superblock, using the
+ information stored in the filesystem context (remounts go
+ via a different vector). It may detach any resources it desires from the
+ filesystem context and transfer them to the superblock it
+ creates.
+
+ On success it should set fc->root to the mountable root and return 0. In
+ the case of an error, it should return a negative error code.
+
+
+===========================
+FILESYSTEM CONTEXT SECURITY
+===========================
+
+The filesystem context contains a security pointer that the LSMs can use for
+building up a security context for the superblock to be mounted. There are a
+number of operations used by the new mount code for this purpose:
+
+ (*) int security_fs_context_alloc(struct fs_context *fc,
+ struct super_block *src_sb);
+
+ Called to initialise fc->security (which is preset to NULL) and allocate
+ any resources needed. It should return 0 on success and a negative error
+ code on failure.
+
+ src_sb is non-NULL in the case of a remount (FS_CONTEXT_FOR_REMOUNT) in
+ which case it indicates the superblock to be remounted or in the case of a
+ submount (FS_CONTEXT_FOR_SUBMOUNT) in which case it indicates the parent
+ superblock.
+
+ (*) int security_fs_context_dup(struct fs_context *fc,
+ struct fs_context *src_fc);
+
+ Called to initialise fc->security (which is preset to NULL) and allocate
+ any resources needed. The original filesystem context is pointed to by
+ src_fc and may be used for reference. It should return 0 on success and a
+ negative error code on failure.
+
+ (*) void security_fs_context_free(struct fs_context *fc);
+
+ Called to clean up anything attached to fc->security. Note that the
+ contents may have been transferred to a superblock and the pointer NULL'd
+ out during mount.
+
+ (*) int security_fs_context_parse_one(struct fs_context *fc, char type, char *opt);
+
+ Called for each mount option. The arguments are as for the ->parse_one()
+ method. An active LSM may reject one with an error, pass one over and
+ return 0 or consume one and return 1. If consumed, the option isn't
+ passed on to the filesystem.
+
+ (*) int security_sb_get_tree(struct fs_context *fc);
+
+ Called during the mount procedure to verify that the specified superblock
+ is allowed to be mounted and to transfer the security data there. It
+ should return 0 or a negative error code.
+
+ [NOTE] Should I add a security_fs_context_validate() operation so that the
+ LSM has the opportunity to allocate stuff and check the options as a
+ whole?
+
+ (*) int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint)
+
+ Called during the mount procedure to verify that the root dentry attached
+ to the context is permitted to be attached to the specified mountpoint.
+ It should return 0 on success and a negative error code on failure.
+
+
+=================================
+VFS FILESYSTEM CONTEXT OPERATIONS
+=================================
+
+There are four operations for creating a filesystem context and
+one for destroying a context:
+
+ (*) struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
+ struct super_block *src_sb;
+ unsigned int sb_flags);
+
+ Create a filesystem context for a given filesystem type. This allocates
+ the filesystem context, sets the flags, initialises the security and calls
+ fs_type->init_fs_context() to initialise the filesystem context.
+
+ src_sb can be NULL or it may indicate a superblock that is going to be
+ remounted (FS_CONTEXT_FOR_REMOUNT) or a superblock that is the parent of a
+ submount (FS_CONTEXT_FOR_SUBMOUNT). This superblock is provided as a
+ source of namespace information.
+
+ (*) struct fs_context *vfs_sb_reconfig(struct vfsmount *mnt,
+ unsigned int sb_flags);
+
+ Create a filesystem context from the same filesystem as an extant mount
+ and initialise the mount parameters from the superblock underlying that
+ mount. This is for use by remount.
+
+ (*) struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc);
+
+ Duplicate a filesystem context, copying any options noted and duplicating
+ or additionally referencing any resources held therein. This is available
+ for use where a filesystem has to get a mount within a mount, such as NFS4
+ does by internally mounting the root of the target server and then doing a
+ private pathwalk to the target directory.
+
+ (*) void put_fs_context(struct fs_context *fc);
+
+ Destroy a filesystem context, releasing any resources it holds. This
+ calls the ->free() operation. This is intended to be called by anyone who
+ created a filesystem context.
+
+ [!] filesystem contexts are not refcounted, so this causes unconditional
+ destruction.
+
+In all the above operations, apart from the put op, the return is a mount
+context pointer or a negative error code.
+
+For the remaining operations, if an error occurs, a negative error code will be
+returned.
+
+ (*) int vfs_get_tree(struct fs_context *fc);
+
+ Get or create the mountable root and superblock, using the parameters in
+ the filesystem context to select/configure the superblock. This invokes
+ the ->validate() op and then the ->get_tree() op.
+
+ [NOTE] ->validate() could perhaps be rolled into ->get_tree() and
+ ->remount_fs_fc().
+
+ (*) struct vfsmount *vfs_kern_mount_fc(struct fs_context *fc);
+
+ Create a mount given the parameters in the specified filesystem context.
+
+ (*) int vfs_set_fs_source(struct fs_context *fc, char *source);
+
+ Supply the source name or device name for the mount. This may cause the
+ filesystem to access the device.
+
+ (*) int vfs_parse_mount_option(struct fs_context *fc, char *data);
+
+ Supply a single mount option to the filesystem context. The mount option
+ should likely be in a "key[=val]" string form. The option is first
+ checked to see if it corresponds to a standard mount flag (in which case
+ it is used to set an SB_xxx flag and consumed) or a security option (in
+ which case the LSM consumes it) before it is passed on to the filesystem.
+
+ (*) int generic_parse_monolithic(struct fs_context *fc, void *data);
+
+ Parse a sys_mount() data page, assuming the form to be a text list
+ consisting of key[=val] options separated by commas. Each item in the
+ list is passed to vfs_mount_option(). This is the default when the
+ ->parse_monolithic() operation is NULL.
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
new file mode 100644
index 000000000000..645c57e10764
--- /dev/null
+++ b/include/linux/fs_context.h
@@ -0,0 +1,73 @@
+/* Filesystem superblock creation and reconfiguration context.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_FS_CONTEXT_H
+#define _LINUX_FS_CONTEXT_H
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+
+struct cred;
+struct dentry;
+struct file_operations;
+struct file_system_type;
+struct mnt_namespace;
+struct net;
+struct pid_namespace;
+struct super_block;
+struct user_namespace;
+struct vfsmount;
+
+enum fs_context_purpose {
+ FS_CONTEXT_FOR_NEW, /* New superblock for direct mount */
+ FS_CONTEXT_FOR_SUBMOUNT, /* New superblock for automatic submount */
+ FS_CONTEXT_FOR_REMOUNT, /* Superblock reconfiguration for remount */
+};
+
+/*
+ * Filesystem context as allocated and constructed by the ->init_fs_context()
+ * file_system_type operation. The size of the object allocated is specified
+ * in struct file_system_type::fs_context_size and this must include sufficient
+ * space for the fs_context struct.
+ *
+ * Superblock creation fills in ->root whereas reconfiguration begins with this
+ * already set.
+ *
+ * See Documentation/filesystems/mounting.txt
+ */
+struct fs_context {
+ const struct fs_context_operations *ops;
+ struct file_system_type *fs_type;
+ struct dentry *root; /* The root and superblock */
+ struct user_namespace *user_ns; /* The user namespace for this mount */
+ struct net *net_ns; /* The network namespace for this mount */
+ const struct cred *cred; /* The mounter's credentials */
+ char *source; /* The source name (eg. device) */
+ char *subtype; /* The subtype to set on the superblock */
+ void *security; /* The LSM context */
+ unsigned int sb_flags; /* The superblock flags (MS_*) */
+ bool sloppy; /* Unrecognised options are okay */
+ bool silent;
+ bool degraded; /* True if the context can't be reused */
+ enum fs_context_purpose purpose : 8;
+};
+
+struct fs_context_operations {
+ void (*free)(struct fs_context *fc);
+ int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
+ int (*parse_source)(struct fs_context *fc);
+ int (*parse_option)(struct fs_context *fc, char *p);
+ int (*parse_monolithic)(struct fs_context *fc, void *data);
+ int (*validate)(struct fs_context *fc);
+ int (*get_tree)(struct fs_context *fc);
+};
+
+#endif /* _LINUX_FS_CONTEXT_H */


2017-10-06 20:34:22

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 03/14] VFS: Implement a filesystem superblock creation/configuration context [ver #6]

On 10/06/17 08:49, David Howells wrote:
>
> diff --git a/fs/fs_context.c b/fs/fs_context.c
> new file mode 100644
> index 000000000000..a3a7ccb4323d
> --- /dev/null
> +++ b/fs/fs_context.c

> +/**
> + * generic_parse_monolithic - Parse key[=val][,key[=val]]* mount data
> + * @mc: The superblock configuration to fill in.

function argument is &struct fs_context *ctx, not @mc

> + * @data: The data to parse
> + *
> + * Parse a blob of data that's in key[=val][,key[=val]]* form. This can be
> + * called from the ->monolithic_mount_data() fs_context operation.
> + *
> + * Returns 0 on success or the error returned by the ->parse_option() fs_context
> + * operation on failure.
> + */
> +int generic_parse_monolithic(struct fs_context *ctx, void *data)
> +{
> + char *options = data, *p;
> + int ret;
> +
> + if (!options)
> + return 0;
> +
> + while ((p = strsep(&options, ",")) != NULL) {
> + if (*p) {
> + ret = vfs_parse_mount_option(ctx, p);
> + if (ret < 0)
> + return ret;
> + }
> + }
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(generic_parse_monolithic);
> +

> +
> +/**
> + * put_fs_context - Dispose of a superblock configuration context.
> + * @sc: The context to dispose of.

@fc:

> + */
> +void put_fs_context(struct fs_context *fc)
> +{
> + struct super_block *sb;
> +
> + if (fc->root) {
> + sb = fc->root->d_sb;
> + dput(fc->root);
> + fc->root = NULL;
> + deactivate_super(sb);
> + }
> +
> + if (fc->ops && fc->ops->free)
> + fc->ops->free(fc);
> +
> + security_fs_context_free(fc);
> + if (fc->net_ns)
> + put_net(fc->net_ns);
> + put_user_ns(fc->user_ns);
> + if (fc->cred)
> + put_cred(fc->cred);
> + kfree(fc->subtype);
> + put_filesystem(fc->fs_type);
> + kfree(fc->source);
> + kfree(fc);
> +}
> +EXPORT_SYMBOL(put_fs_context);


(in fs/namespace.c:)

> +/**
> + * vfs_create_mount - Create a mount for a configured superblock
> + * fc: The configuration context with the superblock attached

@fc:

> + *
> + * Create a mount to an already configured superblock. If necessary, the
> + * caller should invoke vfs_get_tree() before calling this.
> + *
> + * Note that this does not attach the mount to anything.
> + */
> +struct vfsmount *vfs_create_mount(struct fs_context *fc)
> +{
> + struct mount *mnt;
> +
> + if (!fc->root)
> + return ERR_PTR(-EINVAL);
> +
> + mnt = alloc_vfsmnt(fc->source ?: "none");
> + if (!mnt)
> + return ERR_PTR(-ENOMEM);
> +
> + if (fc->purpose == FS_CONTEXT_FOR_KERNEL_MOUNT)
> + /* It's a longterm mount, don't release mnt until we unmount
> + * before file sys is unregistered
> + */
> + mnt->mnt.mnt_flags = MNT_INTERNAL;
> +
> + atomic_inc(&fc->root->d_sb->s_active);
> + mnt->mnt.mnt_sb = fc->root->d_sb;
> + mnt->mnt.mnt_root = dget(fc->root);
> + mnt->mnt_mountpoint = mnt->mnt.mnt_root;
> + mnt->mnt_parent = mnt;
> +
> + lock_mount_hash();
> + list_add_tail(&mnt->mnt_instance, &mnt->mnt.mnt_sb->s_mounts);
> + unlock_mount_hash();
> + return &mnt->mnt;
> +}
> +EXPORT_SYMBOL(vfs_create_mount);


--
~Randy

2017-10-06 20:37:45

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 02/14] VFS: Add LSM hooks for filesystem context [ver #6]

add cc: [email protected]

On 10/06/17 08:49, David Howells wrote:
> Add LSM hooks for use by the filesystem context code. This includes:
>
> (1) Hooks to handle allocation, duplication and freeing of the security
> record attached to a filesystem context.
>
> (2) A hook to snoop a mount options in key[=val] form. If the LSM decides
> it wants to handle it, it can suppress the option being passed to the
> filesystem. Note that 'val' may include commas and binary data with
> the fsopen patch.
>
> (3) A hook to transfer the security from the context to a newly created
> superblock.
>
> (4) A hook to rule on whether a path point can be used as a mountpoint.
>
> These are intended to replace:
>
> security_sb_copy_data
> security_sb_kern_mount
> security_sb_mount
> security_sb_set_mnt_opts
> security_sb_clone_mnt_opts
> security_sb_parse_opts_str
>
> Signed-off-by: David Howells <[email protected]>
> cc: [email protected]
> ---
>
> include/linux/lsm_hooks.h | 45 ++++++++++++
> include/linux/security.h | 33 +++++++++
> security/security.c | 30 ++++++++
> security/selinux/hooks.c | 174 +++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 282 insertions(+)
>
> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> index c9258124e417..85398ba0b533 100644
> --- a/include/linux/lsm_hooks.h
> +++ b/include/linux/lsm_hooks.h
> @@ -76,6 +76,38 @@
> * changes on the process such as clearing out non-inheritable signal
> * state. This is called immediately after commit_creds().
> *
> + * Security hooks for mount using fd context.
> + *
> + * @fs_context_alloc:
> + * Allocate and attach a security structure to sc->security. This pointer
> + * is initialised to NULL by the caller.
> + * @fc indicates the new filesystem context.
> + * @src_sb indicates the source superblock of a submount.
> + * @fs_context_dup:
> + * Allocate and attach a security structure to sc->security. This pointer
> + * is initialised to NULL by the caller.
> + * @fc indicates the new filesystem context.
> + * @src_fc indicates the original filesystem context.
> + * @fs_context_free:
> + * Clean up a filesystem context.
> + * @fc indicates the filesystem context.
> + * @fs_context_parse_one:
> + * Userspace provided an option to configure a superblock. The LSM may
> + * reject it with an error and may use it for itself, in which case it
> + * should return 1; otherwise it should return 0 to pass it on to the
> + * filesystem.
> + * @fc indicates the filesystem context.
> + * @p indicates the option in "key[=val]" form.
> + * @sb_get_tree:
> + * Assign the security to a newly created superblock.
> + * @fc indicates the filesystem context.
> + * @fc->root indicates the root that will be mounted.
> + * @fc->root->d_sb points to the superblock.
> + * @sb_mountpoint:
> + * Equivalent of sb_mount, but with an fs_context.
> + * @fc indicates the filesystem context.
> + * @mountpoint indicates the path on which the mount will take place.
> + *
> * Security hooks for filesystem operations.
> *
> * @sb_alloc_security:
> @@ -1384,6 +1416,13 @@ union security_list_options {
> void (*bprm_committing_creds)(struct linux_binprm *bprm);
> void (*bprm_committed_creds)(struct linux_binprm *bprm);
>
> + int (*fs_context_alloc)(struct fs_context *fc, struct super_block *src_sb);
> + int (*fs_context_dup)(struct fs_context *fc, struct fs_context *src_sc);
> + void (*fs_context_free)(struct fs_context *fc);
> + int (*fs_context_parse_one)(struct fs_context *fc, char *opt);
> + int (*sb_get_tree)(struct fs_context *fc);
> + int (*sb_mountpoint)(struct fs_context *fc, struct path *mountpoint);
> +
> int (*sb_alloc_security)(struct super_block *sb);
> void (*sb_free_security)(struct super_block *sb);
> int (*sb_copy_data)(char *orig, char *copy);
> @@ -1703,6 +1742,12 @@ struct security_hook_heads {
> struct list_head bprm_check_security;
> struct list_head bprm_committing_creds;
> struct list_head bprm_committed_creds;
> + struct list_head fs_context_alloc;
> + struct list_head fs_context_dup;
> + struct list_head fs_context_free;
> + struct list_head fs_context_parse_one;
> + struct list_head sb_get_tree;
> + struct list_head sb_mountpoint;
> struct list_head sb_alloc_security;
> struct list_head sb_free_security;
> struct list_head sb_copy_data;
> diff --git a/include/linux/security.h b/include/linux/security.h
> index ce6265960d6c..4a47c732d7b8 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -56,6 +56,7 @@ struct msg_queue;
> struct xattr;
> struct xfrm_sec_ctx;
> struct mm_struct;
> +struct fs_context;
>
> /* If capable should audit the security request */
> #define SECURITY_CAP_NOAUDIT 0
> @@ -233,6 +234,12 @@ int security_bprm_set_creds(struct linux_binprm *bprm);
> int security_bprm_check(struct linux_binprm *bprm);
> void security_bprm_committing_creds(struct linux_binprm *bprm);
> void security_bprm_committed_creds(struct linux_binprm *bprm);
> +int security_fs_context_alloc(struct fs_context *fc, struct super_block *sb);
> +int security_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc);
> +void security_fs_context_free(struct fs_context *fc);
> +int security_fs_context_parse_option(struct fs_context *fc, char *opt);
> +int security_sb_get_tree(struct fs_context *fc);
> +int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint);
> int security_sb_alloc(struct super_block *sb);
> void security_sb_free(struct super_block *sb);
> int security_sb_copy_data(char *orig, char *copy);
> @@ -540,6 +547,32 @@ static inline void security_bprm_committed_creds(struct linux_binprm *bprm)
> {
> }
>
> +static inline int security_fs_context_alloc(struct fs_context *fc,
> + struct super_block *src_sb)
> +{
> + return 0;
> +}
> +static inline int security_fs_context_dup(struct fs_context *fc,
> + struct fs_context *src_fc)
> +{
> + return 0;
> +}
> +static inline void security_fs_context_free(struct fs_context *fc)
> +{
> +}
> +static inline int security_fs_context_parse_option(struct fs_context *fc, char *opt)
> +{
> + return 0;
> +}
> +static inline int security_sb_get_tree(struct fs_context *fc)
> +{
> + return 0;
> +}
> +static inline int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint)
> +{
> + return 0;
> +}
> +
> static inline int security_sb_alloc(struct super_block *sb)
> {
> return 0;
> diff --git a/security/security.c b/security/security.c
> index 4bf0f571b4ef..55383a0e764d 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -351,6 +351,36 @@ void security_bprm_committed_creds(struct linux_binprm *bprm)
> call_void_hook(bprm_committed_creds, bprm);
> }
>
> +int security_fs_context_alloc(struct fs_context *fc, struct super_block *src_sb)
> +{
> + return call_int_hook(fs_context_alloc, 0, fc, src_sb);
> +}
> +
> +int security_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
> +{
> + return call_int_hook(fs_context_dup, 0, fc, src_fc);
> +}
> +
> +void security_fs_context_free(struct fs_context *fc)
> +{
> + call_void_hook(fs_context_free, fc);
> +}
> +
> +int security_fs_context_parse_one(struct fs_context *fc, char *opt)
> +{
> + return call_int_hook(fs_context_parse_one, 0, fc, opt);
> +}
> +
> +int security_sb_get_tree(struct fs_context *fc)
> +{
> + return call_int_hook(sb_get_tree, 0, fc);
> +}
> +
> +int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint)
> +{
> + return call_int_hook(sb_mountpoint, 0, fc, mountpoint);
> +}
> +
> int security_sb_alloc(struct super_block *sb)
> {
> return call_int_hook(sb_alloc_security, 0, sb);
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 6f37f7e5b9a8..0dda7350b5af 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -48,6 +48,7 @@
> #include <linux/fdtable.h>
> #include <linux/namei.h>
> #include <linux/mount.h>
> +#include <linux/fs_context.h>
> #include <linux/netfilter_ipv4.h>
> #include <linux/netfilter_ipv6.h>
> #include <linux/tty.h>
> @@ -2862,6 +2863,172 @@ static int selinux_umount(struct vfsmount *mnt, int flags)
> FILESYSTEM__UNMOUNT, NULL);
> }
>
> +/* fsopen mount context operations */
> +
> +static int selinux_fs_context_alloc(struct fs_context *fc,
> + struct super_block *src_sb)
> +{
> + struct security_mnt_opts *opts;
> +
> + opts = kzalloc(sizeof(*opts), GFP_KERNEL);
> + if (!opts)
> + return -ENOMEM;
> +
> + fc->security = opts;
> + return 0;
> +}
> +
> +static int selinux_fs_context_dup(struct fs_context *fc,
> + struct fs_context *src_fc)
> +{
> + const struct security_mnt_opts *src = src_fc->security;
> + struct security_mnt_opts *opts;
> + int i, n;
> +
> + opts = kzalloc(sizeof(*opts), GFP_KERNEL);
> + if (!opts)
> + return -ENOMEM;
> + fc->security = opts;
> +
> + if (!src || !src->num_mnt_opts)
> + return 0;
> + n = opts->num_mnt_opts = src->num_mnt_opts;
> +
> + if (src->mnt_opts) {
> + opts->mnt_opts = kcalloc(n, sizeof(char *), GFP_KERNEL);
> + if (!opts->mnt_opts)
> + return -ENOMEM;
> +
> + for (i = 0; i < n; i++) {
> + if (src->mnt_opts[i]) {
> + opts->mnt_opts[i] = kstrdup(src->mnt_opts[i],
> + GFP_KERNEL);
> + if (!opts->mnt_opts[i])
> + return -ENOMEM;
> + }
> + }
> + }
> +
> + if (src->mnt_opts_flags) {
> + opts->mnt_opts_flags = kmemdup(src->mnt_opts_flags,
> + n * sizeof(int), GFP_KERNEL);
> + if (!opts->mnt_opts_flags)
> + return -ENOMEM;
> + }
> +
> + return 0;
> +}
> +
> +static void selinux_fs_context_free(struct fs_context *fc)
> +{
> + struct security_mnt_opts *opts = fc->security;
> +
> + security_free_mnt_opts(opts);
> + fc->security = NULL;
> +}
> +
> +static int selinux_fs_context_parse_one(struct fs_context *fc, char *opt)
> +{
> + struct security_mnt_opts *opts = fc->security;
> + substring_t args[MAX_OPT_ARGS];
> + unsigned int have;
> + char *c, **oo;
> + int token, ctx, i, *of;
> +
> + token = match_token(opt, tokens, args);
> + if (token == Opt_error)
> + return 0; /* Doesn't belong to us. */
> +
> + have = 0;
> + for (i = 0; i < opts->num_mnt_opts; i++)
> + have |= 1 << opts->mnt_opts_flags[i];
> + if (have & (1 << token))
> + return -EINVAL;
> +
> + switch (token) {
> + case Opt_context:
> + if (have & (1 << Opt_defcontext))
> + goto incompatible;
> + ctx = CONTEXT_MNT;
> + goto copy_context_string;
> +
> + case Opt_fscontext:
> + ctx = FSCONTEXT_MNT;
> + goto copy_context_string;
> +
> + case Opt_rootcontext:
> + ctx = ROOTCONTEXT_MNT;
> + goto copy_context_string;
> +
> + case Opt_defcontext:
> + if (have & (1 << Opt_context))
> + goto incompatible;
> + ctx = DEFCONTEXT_MNT;
> + goto copy_context_string;
> +
> + case Opt_labelsupport:
> + return 1;
> +
> + default:
> + return -EINVAL;
> + }
> +
> +copy_context_string:
> + if (opts->num_mnt_opts > 3)
> + return -EINVAL;
> +
> + of = krealloc(opts->mnt_opts_flags,
> + (opts->num_mnt_opts + 1) * sizeof(int), GFP_KERNEL);
> + if (!of)
> + return -ENOMEM;
> + of[opts->num_mnt_opts] = 0;
> + opts->mnt_opts_flags = of;
> +
> + oo = krealloc(opts->mnt_opts,
> + (opts->num_mnt_opts + 1) * sizeof(char *), GFP_KERNEL);
> + if (!oo)
> + return -ENOMEM;
> + oo[opts->num_mnt_opts] = NULL;
> + opts->mnt_opts = oo;
> +
> + c = match_strdup(&args[0]);
> + if (!c)
> + return -ENOMEM;
> + opts->mnt_opts[opts->num_mnt_opts] = c;
> + opts->mnt_opts_flags[opts->num_mnt_opts] = ctx;
> + opts->num_mnt_opts++;
> + return 1;
> +
> +incompatible:
> + return -EINVAL;
> +}
> +
> +static int selinux_sb_get_tree(struct fs_context *fc)
> +{
> + const struct cred *cred = current_cred();
> + struct common_audit_data ad;
> + int rc;
> +
> + rc = selinux_set_mnt_opts(fc->root->d_sb, fc->security, 0, NULL);
> + if (rc)
> + return rc;
> +
> + /* Allow all mounts performed by the kernel */
> + if (fc->sb_flags & MS_KERNMOUNT)
> + return 0;
> +
> + ad.type = LSM_AUDIT_DATA_DENTRY;
> + ad.u.dentry = fc->root;
> + return superblock_has_perm(cred, fc->root->d_sb, FILESYSTEM__MOUNT, &ad);
> +}
> +
> +static int selinux_sb_mountpoint(struct fs_context *fc, struct path *mountpoint)
> +{
> + const struct cred *cred = current_cred();
> +
> + return path_has_perm(cred, mountpoint, FILE__MOUNTON);
> +}
> +
> /* inode security operations */
>
> static int selinux_inode_alloc_security(struct inode *inode)
> @@ -6275,6 +6442,13 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
> LSM_HOOK_INIT(bprm_committing_creds, selinux_bprm_committing_creds),
> LSM_HOOK_INIT(bprm_committed_creds, selinux_bprm_committed_creds),
>
> + LSM_HOOK_INIT(fs_context_alloc, selinux_fs_context_alloc),
> + LSM_HOOK_INIT(fs_context_dup, selinux_fs_context_dup),
> + LSM_HOOK_INIT(fs_context_free, selinux_fs_context_free),
> + LSM_HOOK_INIT(fs_context_parse_one, selinux_fs_context_parse_one),
> + LSM_HOOK_INIT(sb_get_tree, selinux_sb_get_tree),
> + LSM_HOOK_INIT(sb_mountpoint, selinux_sb_mountpoint),
> +
> LSM_HOOK_INIT(sb_alloc_security, selinux_sb_alloc_security),
> LSM_HOOK_INIT(sb_free_security, selinux_sb_free_security),
> LSM_HOOK_INIT(sb_copy_data, selinux_sb_copy_data),
>


--
~Randy

2017-10-07 00:08:11

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 03/14] VFS: Implement a filesystem superblock creation/configuration context [ver #6]

On 10/06/17 16:13, David Howells wrote:
> Randy Dunlap <[email protected]> wrote:
>
>> (in fs/namespace.c:)
>
> Ummm?

in fs/namespace.c:

+/**
+ * vfs_create_mount - Create a mount for a configured superblock
+ * fc: The configuration context with the superblock attached
+ *
+ * Create a mount to an already configured superblock. If necessary, the
+ * caller should invoke vfs_get_tree() before calling this.
+ *
+ * Note that this does not attach the mount to anything.
+ */
+struct vfsmount *vfs_create_mount(struct fs_context *fc)
+{


in the kernel-doc notation:
s/fc:/@fc:/


ta.
--
~Randy

2017-10-10 07:49:55

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 03/14] VFS: Implement a filesystem superblock creation/configuration context [ver #6]

On Fri, Oct 6, 2017 at 5:49 PM, David Howells <[email protected]> wrote:
> Implement a filesystem context concept to be used during superblock
> creation for mount and superblock reconfiguration for remount.
>
> The mounting procedure then becomes:
>
> (1) Allocate new fs_context context.
>
> (2) Configure the context.
>
> (3) Create superblock.
>
> (4) Mount the superblock any number of times.
>
> (5) Destroy the context.
>
> Rather than calling fs_type->mount(), an fs_context struct is created and
> fs_type->init_fs_context() is called to set it up.
> fs_type->fs_context_size says how much space should be allocated for the
> config context. The fs_context struct is placed at the beginning and any
> extra space is for the filesystem's use.
>
> A set of operations has to be set by ->init_fs_context() to provide
> freeing, duplication, option parsing, binary data parsing, validation,
> mounting and superblock filling.
>
> Legacy filesystems are supported by the provision of a set of legacy
> fs_context operations that build up a list of mount options and then invoke
> fs_type->mount() from within the fs_context ->get_tree() operation. This
> allows all filesystems to be accessed using fs_context.
>
> It should be noted that, whilst this patch adds a lot of lines of code,
> there is quite a bit of duplication with existing code that can be
> eliminated should all filesystems be converted over.
>
> Signed-off-by: David Howells <[email protected]>
> ---
>
> Documentation/filesystems/mounting.txt | 7
> fs/Makefile | 3
> fs/fs_context.c | 526 ++++++++++++++++++++++++++++++++
> fs/internal.h | 2
> fs/libfs.c | 17 +
> fs/namespace.c | 337 ++++++++++++++-------
> fs/super.c | 294 +++++++++++++++++-
> include/linux/fs.h | 16 +
> include/linux/fs_context.h | 37 ++
> include/linux/lsm_hooks.h | 6
> include/linux/mount.h | 2
> security/security.c | 4
> security/selinux/hooks.c | 6
> 13 files changed, 1107 insertions(+), 150 deletions(-)
> create mode 100644 fs/fs_context.c
>
> diff --git a/Documentation/filesystems/mounting.txt b/Documentation/filesystems/mounting.txt
> index 8c0b0351e949..ba73066c151c 100644
> --- a/Documentation/filesystems/mounting.txt
> +++ b/Documentation/filesystems/mounting.txt
> @@ -192,7 +192,7 @@ structure is not refcounted.
>
> VFS, security and filesystem mount options are set individually with
> vfs_parse_mount_option(). Options provided by the old mount(2) system call as
> -a page of data can be parsed with generic_monolithic_mount_data().
> +a page of data can be parsed with generic_parse_monolithic().
>
> When mounting, the filesystem is allowed to take data from any of the pointers
> and attach it to the superblock (or whatever), provided it clears the pointer
> @@ -264,7 +264,7 @@ manage the filesystem context. They are as follows:
>
> If the filesystem (eg. NFS) needs to examine the data first and then finds
> it's the standard key-val list then it may pass it off to
> - generic_monolithic_mount_data().
> + generic_parse_monolithic().
>
> (*) int (*validate)(struct fs_context *fc);
>
> @@ -407,9 +407,10 @@ returned.
> [NOTE] ->validate() could perhaps be rolled into ->get_tree() and
> ->remount_fs_fc().
>
> - (*) struct vfsmount *vfs_kern_mount_fc(struct fs_context *fc);
> + (*) struct vfsmount *vfs_create_mount(struct fs_context *fc);
>
> Create a mount given the parameters in the specified filesystem context.
> + Note that this does not attach the mount to anything.
>
> (*) int vfs_set_fs_source(struct fs_context *fc, char *source);
>
> diff --git a/fs/Makefile b/fs/Makefile
> index 7bbaca9c67b1..ffe728cc15e1 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -11,7 +11,8 @@ obj-y := open.o read_write.o file_table.o super.o \
> attr.o bad_inode.o file.o filesystems.o namespace.o \
> seq_file.o xattr.o libfs.o fs-writeback.o \
> pnode.o splice.o sync.o utimes.o \
> - stack.o fs_struct.o statfs.o fs_pin.o nsfs.o
> + stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
> + fs_context.o
>
> ifeq ($(CONFIG_BLOCK),y)
> obj-y += buffer.o block_dev.o direct-io.o mpage.o
> diff --git a/fs/fs_context.c b/fs/fs_context.c
> new file mode 100644
> index 000000000000..a3a7ccb4323d
> --- /dev/null
> +++ b/fs/fs_context.c
> @@ -0,0 +1,526 @@
> +/* Provide a way to create a superblock configuration context within the kernel
> + * that allows a superblock to be set up prior to mounting.
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells ([email protected])
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +#include <linux/fs_context.h>
> +#include <linux/fs.h>
> +#include <linux/mount.h>
> +#include <linux/nsproxy.h>
> +#include <linux/slab.h>
> +#include <linux/magic.h>
> +#include <linux/security.h>
> +#include <linux/parser.h>
> +#include <linux/mnt_namespace.h>
> +#include <linux/pid_namespace.h>
> +#include <linux/user_namespace.h>
> +#include <net/net_namespace.h>
> +#include "mount.h"
> +
> +struct legacy_fs_context {
> + struct fs_context fc;
> + char *legacy_data; /* Data page for legacy filesystems */
> + char *secdata;
> + unsigned int data_usage;
> +};
> +
> +static const struct fs_context_operations legacy_fs_context_ops;
> +
> +static const match_table_t common_set_sb_flag = {
> + { SB_DIRSYNC, "dirsync" },
> + { SB_LAZYTIME, "lazytime" },
> + { SB_MANDLOCK, "mand" },
> + { SB_POSIXACL, "posixacl" },
> + { SB_RDONLY, "ro" },
> + { SB_SYNCHRONOUS, "sync" },
> + { },
> +};
> +
> +static const match_table_t common_clear_sb_flag = {
> + { SB_LAZYTIME, "nolazytime" },
> + { SB_MANDLOCK, "nomand" },
> + { SB_RDONLY, "rw" },
> + { SB_SILENT, "silent" },
> + { SB_SYNCHRONOUS, "async" },
> + { },
> +};
> +
> +static const match_table_t forbidden_sb_flag = {
> + { 0, "bind" },
> + { 0, "move" },
> + { 0, "private" },
> + { 0, "remount" },
> + { 0, "shared" },
> + { 0, "slave" },
> + { 0, "unbindable" },
> + { 0, "rec" },
> + { 0, "noatime" },
> + { 0, "relatime" },
> + { 0, "norelatime" },
> + { 0, "strictatime" },
> + { 0, "nostrictatime" },
> + { 0, "nodiratime" },
> + { 0, "dev" },
> + { 0, "nodev" },
> + { 0, "exec" },
> + { 0, "noexec" },
> + { 0, "suid" },
> + { 0, "nosuid" },
> + { },
> +};
> +
> +/*
> + * Check for a common mount option that manipulates s_flags.
> + */
> +static int vfs_parse_sb_flag_option(struct fs_context *fc, char *data)
> +{
> + substring_t args[MAX_OPT_ARGS];
> + unsigned int token;
> +
> + token = match_token(data, common_set_sb_flag, args);
> + if (token) {
> + fc->sb_flags |= token;
> + return 1;
> + }
> +
> + token = match_token(data, common_clear_sb_flag, args);
> + if (token) {
> + fc->sb_flags &= ~token;
> + return 1;
> + }
> +
> + token = match_token(data, forbidden_sb_flag, args);
> + if (token)
> + return -EINVAL;
> +
> + return 0;
> +}
> +
> +/**
> + * vfs_parse_mount_option - Add a single mount option to a superblock config

Mount options are those that refer to the mount
(nosuid,nodev,noatime,etc..); this function is not parsing those,
AFAICT.

How about vfs_parse_fs_option()?

> + * @fc: The filesystem context to modify
> + * @p: The option to apply.
> + *
> + * A single mount option in string form is applied to the filesystem context
> + * being set up. Certain standard options (for example "ro") are translated
> + * into flag bits without going to the filesystem. The active security module
> + * is allowed to observe and poach options. Any other options are passed over
> + * to the filesystem to parse.
> + *
> + * This may be called multiple times for a context.
> + *
> + * Returns 0 on success and a negative error code on failure. In the event of
> + * failure, supplementary error information may have been set.
> + */
> +int vfs_parse_mount_option(struct fs_context *fc, char *p)
> +{
> + int ret;
> +
> + ret = vfs_parse_sb_flag_option(fc, p);
> + if (ret < 0)
> + return ret;

We probably also need a "reset" type of option that clears all bits
and is also passed onto the filesystem's parsing routine so it can
reset all options as well.

The set/clear behavior should also be documented very clearly, because
I see lots of confusion regarding this, and it's something that legacy
option parsing cannot even handle consistently.

So what are the rules?

1/a) New sb:
- start with zero sb_flags and set/clear specified ones
- filesystems starts with default set of options and set/clear
specified ones

1/b) New sb for legacy mount(2)
- same as 1/a.

2/a) Shared sb:
- this is tricky, I think it would be correct to require a
matching config (sb_flags as well as filesystem options) and error out
in case of a mismatch. But AFAICS this patchset doesn't have anything
related to this.

2/b) Shared sb for legacy mount(2)
- same as 1/a and ignore if sb_flags don't match - except for "ro" error out
- ignore any filesystem options (mount_bdev() does that, at least).

3/a) Reconfig
- start with current sb_flags and set/clear specified ones, reset
to zero on reset
- start wth current filesystem options and set/clear specified
ones, reset to default on reset

3/b) Reconfig for legacy mount(2) (i.e. MS_REMOUNT)
- reset sb_flags to newly specified ones
- most fs then go on to set/clear/modify specified ones from
current set of options, but there are probably exceptions. And if
there are, then we are in trouble becuase we must convert those
filesystems up-front, before the new interface comes live, and handle
those exceptions in some way (e.g. FS_CONTEXT_LEGACY flag)


> + if (ret == 1)
> + return 0;
> +
> + ret = security_fs_context_parse_option(fc, p);
> + if (ret < 0)
> + return ret;
> + if (ret == 1)
> + return 0;
> +
> + if (fc->ops->parse_option)
> + return fc->ops->parse_option(fc, p);
> +
> + return -EINVAL;
> +}
> +EXPORT_SYMBOL(vfs_parse_mount_option);
> +
> +/**
> + * vfs_set_fs_source - Set the source/device name in a filesystem context
> + * @fc: The filesystem context to alter
> + * @source: The name of the source
> + * @slen: Length of @source string
> + */
> +int vfs_set_fs_source(struct fs_context *fc, const char *source, size_t slen)
> +{
> + if (fc->source)
> + return -EINVAL;
> + if (source) {
> + fc->source = kmemdup_nul(source, slen, GFP_KERNEL);
> + if (!fc->source)
> + return -ENOMEM;
> + }
> +
> + if (fc->ops->parse_source)
> + return fc->ops->parse_source(fc);
> + return 0;
> +}
> +EXPORT_SYMBOL(vfs_set_fs_source);
> +
> +/**
> + * generic_parse_monolithic - Parse key[=val][,key[=val]]* mount data
> + * @mc: The superblock configuration to fill in.
> + * @data: The data to parse
> + *
> + * Parse a blob of data that's in key[=val][,key[=val]]* form. This can be
> + * called from the ->monolithic_mount_data() fs_context operation.
> + *
> + * Returns 0 on success or the error returned by the ->parse_option() fs_context
> + * operation on failure.
> + */
> +int generic_parse_monolithic(struct fs_context *ctx, void *data)
> +{
> + char *options = data, *p;
> + int ret;
> +
> + if (!options)
> + return 0;
> +
> + while ((p = strsep(&options, ",")) != NULL) {
> + if (*p) {
> + ret = vfs_parse_mount_option(ctx, p);

Monolithic option block is the legacy thing. It shouldn't be parsing
the common flags. It should instead be treating them as forbidden
(although it probably doesn't really matter, since no filesystem will
accept these anyway).

So probably best to expand vfs_parse_mount_option() here and skip the
sb flag parsing part.

> + if (ret < 0)
> + return ret;
> + }
> + }
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(generic_parse_monolithic);
> +
> +/**
> + * vfs_new_fs_context - Create a filesystem context.
> + * @fs_type: The filesystem type.
> + * @src_sb: A superblock from which this one derives (or NULL)
> + * @sb_flags: Superblock flags and op flags (such as MS_REMOUNT)

I'm confused: MS_REMOUNT in sb_flags and FS_CONTEXT_FOR_REMOUNT in purpose?

I hope that's just a stale comment, sb_flags should really be just the
superblock flags and not any op flags.

Also, can FS_CONTEXT_FOR_REMOUNT be renamed to ..._RECONFIG?

> + * @purpose: The purpose that this configuration shall be used for.
> + *
> + * Open a filesystem and create a mount context. The mount context is
> + * initialised with the supplied flags and, if a submount/automount from
> + * another superblock (@src_sb), may have parameters such as namespaces copied
> + * across from that superblock.
> + */
> +struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
> + struct super_block *src_sb,
> + unsigned int sb_flags,
> + enum fs_context_purpose purpose)
> +{
> + struct fs_context *fc;
> + size_t fc_size = fs_type->fs_context_size;
> + int ret;
> +
> + BUG_ON(fs_type->init_fs_context && fc_size < sizeof(*fc));
> +
> + if (!fs_type->init_fs_context)
> + fc_size = sizeof(struct legacy_fs_context);
> +
> + fc = kzalloc(fc_size, GFP_KERNEL);
> + if (!fc)
> + return ERR_PTR(-ENOMEM);
> +
> + fc->purpose = purpose;
> + fc->sb_flags = sb_flags;
> + fc->fs_type = get_filesystem(fs_type);
> + fc->cred = get_current_cred();
> +
> + switch (purpose) {
> + case FS_CONTEXT_FOR_KERNEL_MOUNT:
> + fc->sb_flags |= SB_KERNMOUNT;
> + /* Fallthrough */
> + case FS_CONTEXT_FOR_USER_MOUNT:
> + fc->user_ns = get_user_ns(fc->cred->user_ns);
> + fc->net_ns = get_net(current->nsproxy->net_ns);
> + break;
> + case FS_CONTEXT_FOR_SUBMOUNT:
> + fc->user_ns = get_user_ns(src_sb->s_user_ns);
> + fc->net_ns = get_net(current->nsproxy->net_ns);
> + break;
> + case FS_CONTEXT_FOR_REMOUNT:
> + /* We don't pin any namespaces as the superblock's
> + * subscriptions cannot be changed at this point.
> + */
> + break;
> + }
> +
> +
> + /* TODO: Make all filesystems support this unconditionally */
> + if (fc->fs_type->init_fs_context) {
> + ret = fc->fs_type->init_fs_context(fc, src_sb);
> + if (ret < 0)
> + goto err_fc;
> + } else {
> + fc->ops = &legacy_fs_context_ops;
> + }
> +
> + /* Do the security check last because ->init_fs_context may change the
> + * namespace subscriptions.
> + */
> + ret = security_fs_context_alloc(fc, src_sb);
> + if (ret < 0)
> + goto err_fc;
> +
> + return fc;
> +
> +err_fc:
> + put_fs_context(fc);
> + return ERR_PTR(ret);
> +}
> +EXPORT_SYMBOL(vfs_new_fs_context);
> +
> +/**
> + * vfs_sb_reconfig - Create a filesystem context for remount/reconfiguration
> + * @mnt: The mountpoint to open
> + * @sb_flags: Superblock flags and op flags (such as MS_REMOUNT)

Here again op flags make no sense.

Also it should be made clear that the old sb flags will be overridden
with these. As such new code should probably be calling this with
current flags (sb->s_flags?) and let the option parsing override them
with new ones.

> + *
> + * Open a mounted filesystem and create a filesystem context such that a
> + * remount can be effected.
> + */
> +struct fs_context *vfs_sb_reconfig(struct vfsmount *mnt,
> + unsigned int sb_flags)
> +{
> + return vfs_new_fs_context(mnt->mnt_sb->s_type, mnt->mnt_sb,
> + sb_flags, FS_CONTEXT_FOR_REMOUNT);
> +}
> +
> +/**
> + * vfs_dup_fc_config: Duplicate a filesytem context.
> + * @src_fc: The context to copy.
> + */

Can we introduce these before they actually get used.

> +struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc)
> +{
> + struct fs_context *fc;
> + size_t fc_size;
> + int ret;
> +
> + if (!src_fc->ops->dup)
> + return ERR_PTR(-ENOTSUPP);
> +
> + fc_size = src_fc->fs_type->fs_context_size;
> + if (!src_fc->fs_type->init_fs_context)
> + fc_size = sizeof(struct legacy_fs_context);
> +
> + fc = kmemdup(src_fc, src_fc->fs_type->fs_context_size, GFP_KERNEL);
> + if (!fc)
> + return ERR_PTR(-ENOMEM);
> +
> + fc->source = NULL;
> + fc->security = NULL;
> + get_filesystem(fc->fs_type);
> + get_net(fc->net_ns);
> + get_user_ns(fc->user_ns);
> + get_cred(fc->cred);
> +
> + /* Can't call put until we've called ->dup */
> + ret = fc->ops->dup(fc, src_fc);
> + if (ret < 0)
> + goto err_fc;
> +
> + ret = security_fs_context_dup(fc, src_fc);
> + if (ret < 0)
> + goto err_fc;
> + return fc;
> +
> +err_fc:
> + put_fs_context(fc);
> + return ERR_PTR(ret);
> +}
> +EXPORT_SYMBOL(vfs_dup_fs_context);
> +
> +/**
> + * put_fs_context - Dispose of a superblock configuration context.
> + * @sc: The context to dispose of.
> + */
> +void put_fs_context(struct fs_context *fc)
> +{
> + struct super_block *sb;
> +
> + if (fc->root) {
> + sb = fc->root->d_sb;
> + dput(fc->root);
> + fc->root = NULL;
> + deactivate_super(sb);
> + }
> +
> + if (fc->ops && fc->ops->free)
> + fc->ops->free(fc);
> +
> + security_fs_context_free(fc);
> + if (fc->net_ns)
> + put_net(fc->net_ns);
> + put_user_ns(fc->user_ns);
> + if (fc->cred)
> + put_cred(fc->cred);
> + kfree(fc->subtype);
> + put_filesystem(fc->fs_type);
> + kfree(fc->source);
> + kfree(fc);
> +}
> +EXPORT_SYMBOL(put_fs_context);
> +
> +/*
> + * Free the config for a filesystem that doesn't support fs_context.
> + */
> +static void legacy_fs_context_free(struct fs_context *fc)
> +{
> + struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
> +
> + free_secdata(ctx->secdata);
> + kfree(ctx->legacy_data);
> +}
> +
> +/*
> + * Duplicate a legacy config.
> + */
> +static int legacy_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
> +{
> + struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
> + struct legacy_fs_context *src_ctx = container_of(src_fc, struct legacy_fs_context, fc);
> +
> + ctx->legacy_data = kmalloc(PAGE_SIZE, GFP_KERNEL);
> + if (!ctx->legacy_data)
> + return -ENOMEM;
> + memcpy(ctx->legacy_data, src_ctx->legacy_data, sizeof(PAGE_SIZE));
> + return 0;
> +}
> +
> +/*
> + * Add an option to a legacy config. We build up a comma-separated list of
> + * options.
> + */
> +static int legacy_parse_option(struct fs_context *fc, char *p)
> +{
> + struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
> + unsigned int usage = ctx->data_usage;
> + size_t len = strlen(p);
> +
> + if (len > PAGE_SIZE - 2 - usage)
> + return -EINVAL;
> + if (memchr(p, ',', len) != NULL)
> + return -EINVAL;
> + if (!ctx->legacy_data) {
> + ctx->legacy_data = kmalloc(PAGE_SIZE, GFP_KERNEL);
> + if (!ctx->legacy_data)
> + return -ENOMEM;
> + }
> +
> + ctx->legacy_data[usage++] = ',';
> + memcpy(ctx->legacy_data + usage, p, len);
> + usage += len;
> + ctx->legacy_data[usage] = '\0';
> + ctx->data_usage = usage;
> + return 0;
> +}
> +
> +/*
> + * Add monolithic mount data.
> + */
> +static int legacy_parse_monolithic(struct fs_context *fc, void *data)
> +{
> + struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
> +
> + if (ctx->data_usage != 0) {
> + pr_warn("VFS: Can't mix monolithic and individual options\n");
> + return -EINVAL;
> + }
> + if (!data)
> + return 0;
> + if (!ctx->legacy_data) {
> + ctx->legacy_data = kmalloc(PAGE_SIZE, GFP_KERNEL);
> + if (!ctx->legacy_data)
> + return -ENOMEM;
> + }
> +
> + memcpy(ctx->legacy_data, data, PAGE_SIZE);
> + ctx->data_usage = PAGE_SIZE;
> + return 0;
> +}
> +
> +/*
> + * Use the legacy mount validation step to strip out and process security
> + * config options.
> + */
> +static int legacy_validate(struct fs_context *fc)
> +{
> + struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
> +
> + if (!ctx->legacy_data || ctx->fc.fs_type->fs_flags & FS_BINARY_MOUNTDATA)
> + return 0;
> +
> + ctx->secdata = alloc_secdata();
> + if (!ctx->secdata)
> + return -ENOMEM;
> +
> + return security_sb_copy_data(ctx->legacy_data, ctx->secdata);
> +}
> +
> +/*
> + * Determine the superblock subtype.
> + */
> +static int legacy_set_subtype(struct fs_context *fc)
> +{
> + const char *subtype = strchr(fc->fs_type->name, '.');
> +
> + if (subtype) {
> + subtype++;
> + if (!subtype[0])
> + return -EINVAL;
> + } else {
> + subtype = "";
> + }
> +
> + fc->subtype = kstrdup(subtype, GFP_KERNEL);
> + if (!fc->subtype)
> + return -ENOMEM;
> + return 0;
> +}
> +
> +/*
> + * Get a mountable root with the legacy mount command.
> + */
> +static int legacy_get_tree(struct fs_context *fc)
> +{
> + struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
> + struct super_block *sb;
> + struct dentry *root;
> + int ret;
> +
> + root = ctx->fc.fs_type->mount(ctx->fc.fs_type, ctx->fc.sb_flags,
> + ctx->fc.source, ctx->legacy_data);
> + if (IS_ERR(root))
> + return PTR_ERR(root);
> +
> + sb = root->d_sb;
> + BUG_ON(!sb);
> +
> + if ((ctx->fc.fs_type->fs_flags & FS_HAS_SUBTYPE) &&
> + !fc->subtype) {
> + ret = legacy_set_subtype(fc);
> + if (ret < 0)
> + goto err_sb;
> + }
> +
> + ctx->fc.root = root;
> + return 0;
> +
> +err_sb:
> + dput(root);
> + deactivate_locked_super(sb);
> + return ret;
> +}
> +
> +static const struct fs_context_operations legacy_fs_context_ops = {
> + .free = legacy_fs_context_free,
> + .dup = legacy_fs_context_dup,
> + .parse_option = legacy_parse_option,
> + .parse_monolithic = legacy_parse_monolithic,
> + .validate = legacy_validate,
> + .get_tree = legacy_get_tree,
> +};
> diff --git a/fs/internal.h b/fs/internal.h
> index 48cee21b4f14..e7fb460e7ca4 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -89,7 +89,7 @@ extern struct file *get_empty_filp(void);
> /*
> * super.c
> */
> -extern int do_remount_sb(struct super_block *, int, void *, int);
> +extern int do_remount_sb(struct super_block *, int, void *, int, struct fs_context *);
> extern bool trylock_super(struct super_block *sb);
> extern struct dentry *mount_fs(struct file_system_type *,
> int, const char *, void *);
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 7ff3cb904acd..756e552709fa 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -9,6 +9,7 @@
> #include <linux/slab.h>
> #include <linux/cred.h>
> #include <linux/mount.h>
> +#include <linux/fs_context.h>
> #include <linux/vfs.h>
> #include <linux/quotaops.h>
> #include <linux/mutex.h>
> @@ -574,13 +575,27 @@ static DEFINE_SPINLOCK(pin_fs_lock);
>
> int simple_pin_fs(struct file_system_type *type, struct vfsmount **mount, int *count)
> {
> + struct fs_context *fc;
> struct vfsmount *mnt = NULL;
> + int ret;
> +
> spin_lock(&pin_fs_lock);
> if (unlikely(!*mount)) {
> spin_unlock(&pin_fs_lock);
> - mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, NULL);
> +
> + fc = vfs_new_fs_context(type, NULL, 0, FS_CONTEXT_FOR_KERNEL_MOUNT);
> + if (IS_ERR(fc))
> + return PTR_ERR(fc);
> +
> + ret = vfs_get_tree(fc);
> + if (ret < 0)
> + return ret;
> +
> + mnt = vfs_create_mount(fc);
> + put_fs_context(fc);
> if (IS_ERR(mnt))
> return PTR_ERR(mnt);
> +
> spin_lock(&pin_fs_lock);
> if (!*mount)
> *mount = mnt;
> diff --git a/fs/namespace.c b/fs/namespace.c
> index a6508e4c0a90..d6b0b0067f6d 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -25,8 +25,10 @@
> #include <linux/magic.h>
> #include <linux/bootmem.h>
> #include <linux/task_work.h>
> +#include <linux/file.h>
> #include <linux/sched/task.h>
> #include <uapi/linux/mount.h>
> +#include <linux/fs_context.h>
>
> #include "pnode.h"
> #include "internal.h"
> @@ -1017,55 +1019,6 @@ static struct mount *skip_mnt_tree(struct mount *p)
> return p;
> }
>
> -struct vfsmount *
> -vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void *data)
> -{
> - struct mount *mnt;
> - struct dentry *root;
> -
> - if (!type)
> - return ERR_PTR(-ENODEV);
> -
> - mnt = alloc_vfsmnt(name);
> - if (!mnt)
> - return ERR_PTR(-ENOMEM);
> -
> - if (flags & SB_KERNMOUNT)
> - mnt->mnt.mnt_flags = MNT_INTERNAL;
> -
> - root = mount_fs(type, flags, name, data);
> - if (IS_ERR(root)) {
> - mnt_free_id(mnt);
> - free_vfsmnt(mnt);
> - return ERR_CAST(root);
> - }
> -
> - mnt->mnt.mnt_root = root;
> - mnt->mnt.mnt_sb = root->d_sb;
> - mnt->mnt_mountpoint = mnt->mnt.mnt_root;
> - mnt->mnt_parent = mnt;
> - lock_mount_hash();
> - list_add_tail(&mnt->mnt_instance, &root->d_sb->s_mounts);
> - unlock_mount_hash();
> - return &mnt->mnt;
> -}
> -EXPORT_SYMBOL_GPL(vfs_kern_mount);
> -
> -struct vfsmount *
> -vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
> - const char *name, void *data)
> -{
> - /* Until it is worked out how to pass the user namespace
> - * through from the parent mount to the submount don't support
> - * unprivileged mounts with submounts.
> - */
> - if (mountpoint->d_sb->s_user_ns != &init_user_ns)
> - return ERR_PTR(-EPERM);
> -
> - return vfs_kern_mount(type, SB_SUBMOUNT, name, data);
> -}
> -EXPORT_SYMBOL_GPL(vfs_submount);
> -
> static struct mount *clone_mnt(struct mount *old, struct dentry *root,
> int flag)
> {
> @@ -1592,7 +1545,7 @@ static int do_umount(struct mount *mnt, int flags)
> return -EPERM;
> down_write(&sb->s_umount);
> if (!sb_rdonly(sb))
> - retval = do_remount_sb(sb, SB_RDONLY, NULL, 0);
> + retval = do_remount_sb(sb, SB_RDONLY, NULL, 0, NULL);
> up_write(&sb->s_umount);
> return retval;
> }
> @@ -2275,6 +2228,20 @@ static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
> }
>
> /*
> + * Parse the monolithic page of mount data given to sys_mount().
> + */
> +static int parse_monolithic_mount_data(struct fs_context *fc, void *data)
> +{
> + int (*monolithic_mount_data)(struct fs_context *, void *);
> +
> + monolithic_mount_data = fc->ops->parse_monolithic;
> + if (!monolithic_mount_data)
> + monolithic_mount_data = generic_parse_monolithic;
> +
> + return monolithic_mount_data(fc, data);
> +}
> +
> +/*
> * change filesystem flags. dir should be a physical root of filesystem.
> * If you've mounted a non-root directory somewhere and want to do remount
> * on it - tough luck.
> @@ -2282,9 +2249,11 @@ static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
> static int do_remount(struct path *path, int ms_flags, int sb_flags,
> int mnt_flags, void *data)
> {
> + struct fs_context *fc = NULL;
> int err;
> struct super_block *sb = path->mnt->mnt_sb;
> struct mount *mnt = real_mount(path->mnt);
> + struct file_system_type *type = sb->s_type;
>
> if (!check_mnt(mnt))
> return -EINVAL;
> @@ -2319,9 +2288,25 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
> return -EPERM;
> }
>
> - err = security_sb_remount(sb, data);
> - if (err)
> - return err;
> + if (type->init_fs_context) {
> + fc = vfs_sb_reconfig(path->mnt, sb_flags);
> + if (IS_ERR(fc))
> + return PTR_ERR(fc);
> +
> + err = parse_monolithic_mount_data(fc, data);
> + if (err < 0)
> + goto err_fc;
> +
> + if (fc->ops->validate) {
> + err = fc->ops->validate(fc);
> + if (err < 0)
> + goto err_fc;
> + }
> + } else {
> + err = security_sb_remount(sb, data);
> + if (err)
> + return err;
> + }
>
> down_write(&sb->s_umount);
> if (ms_flags & MS_BIND)
> @@ -2329,7 +2314,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
> else if (!capable(CAP_SYS_ADMIN))
> err = -EPERM;
> else
> - err = do_remount_sb(sb, sb_flags, data, 0);
> + err = do_remount_sb(sb, sb_flags, data, 0, fc);
> if (!err) {
> lock_mount_hash();
> mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
> @@ -2338,6 +2323,9 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
> unlock_mount_hash();
> }
> up_write(&sb->s_umount);
> +err_fc:
> + if (fc)
> + put_fs_context(fc);
> return err;
> }
>
> @@ -2421,29 +2409,6 @@ static int do_move_mount(struct path *path, const char *old_name)
> return err;
> }
>
> -static struct vfsmount *fs_set_subtype(struct vfsmount *mnt, const char *fstype)
> -{
> - int err;
> - const char *subtype = strchr(fstype, '.');
> - if (subtype) {
> - subtype++;
> - err = -EINVAL;
> - if (!subtype[0])
> - goto err;
> - } else
> - subtype = "";
> -
> - mnt->mnt_sb->s_subtype = kstrdup(subtype, GFP_KERNEL);
> - err = -ENOMEM;
> - if (!mnt->mnt_sb->s_subtype)
> - goto err;
> - return mnt;
> -
> - err:
> - mntput(mnt);
> - return ERR_PTR(err);
> -}
> -
> /*
> * add a mount into a namespace's mount tree
> */
> @@ -2491,40 +2456,89 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
> static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags);
>
> /*
> - * create a new mount for userspace and request it to be added into the
> - * namespace's tree
> + * Create a new mount using a superblock configuration and request it
> + * be added to the namespace tree.
> */
> -static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
> - int mnt_flags, const char *name, void *data)
> +static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint,
> + unsigned int mnt_flags)
> {
> - struct file_system_type *type;
> struct vfsmount *mnt;
> - int err;
> -
> - if (!fstype)
> - return -EINVAL;
> -
> - type = get_fs_type(fstype);
> - if (!type)
> - return -ENODEV;
> + int ret;
>
> - mnt = vfs_kern_mount(type, sb_flags, name, data);
> - if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) &&
> - !mnt->mnt_sb->s_subtype)
> - mnt = fs_set_subtype(mnt, fstype);
> + ret = security_sb_mountpoint(fc, mountpoint);
> + if (ret < 0)
> + return ret;;
>
> - put_filesystem(type);
> + mnt = vfs_create_mount(fc);
> if (IS_ERR(mnt))
> return PTR_ERR(mnt);
>
> + ret = -EPERM;
> if (mount_too_revealing(mnt, &mnt_flags)) {
> - mntput(mnt);
> - return -EPERM;
> + pr_warn("VFS: Mount too revealing\n");
> + goto err_mnt;
> + }
> +
> + ret = do_add_mount(real_mount(mnt), mountpoint, mnt_flags);
> + if (ret < 0)
> + goto err_mnt;
> + return ret;
> +
> +err_mnt:
> + mntput(mnt);
> + return ret;
> +}
> +
> +/*
> + * create a new mount for userspace and request it to be added into the
> + * namespace's tree
> + */
> +static int do_new_mount(struct path *mountpoint, const char *fstype,
> + int sb_flags, int mnt_flags, const char *name,
> + void *data)
> +{
> + struct file_system_type *fs_type;
> + struct fs_context *fc;
> + int err = -EINVAL;
> +
> + if (!fstype)
> + goto err;
> +
> + err = -ENODEV;
> + fs_type = get_fs_type(fstype);
> + if (!fs_type)
> + goto err;
> +
> + fc = vfs_new_fs_context(fs_type, NULL, sb_flags,
> + FS_CONTEXT_FOR_USER_MOUNT);
> + put_filesystem(fs_type);
> + if (IS_ERR(fc)) {
> + err = PTR_ERR(fc);
> + goto err;
> }
>
> - err = do_add_mount(real_mount(mnt), path, mnt_flags);
> + err = vfs_set_fs_source(fc, name, name ? strlen(name) : 0);
> + if (err < 0)
> + goto err_fc;
> +
> + err = parse_monolithic_mount_data(fc, data);
> + if (err < 0)
> + goto err_fc;
> +
> + err = vfs_get_tree(fc);
> + if (err < 0)
> + goto err_fc;
> +
> + err = do_new_mount_fc(fc, mountpoint, mnt_flags);
> if (err)
> - mntput(mnt);
> + goto err_fc;
> +
> + put_fs_context(fc);
> + return 0;
> +
> +err_fc:
> + put_fs_context(fc);
> +err:
> return err;
> }
>
> @@ -3063,6 +3077,116 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
> return ret;
> }
>
> +/**
> + * vfs_create_mount - Create a mount for a configured superblock
> + * fc: The configuration context with the superblock attached
> + *
> + * Create a mount to an already configured superblock. If necessary, the
> + * caller should invoke vfs_get_tree() before calling this.
> + *
> + * Note that this does not attach the mount to anything.
> + */
> +struct vfsmount *vfs_create_mount(struct fs_context *fc)
> +{
> + struct mount *mnt;
> +
> + if (!fc->root)
> + return ERR_PTR(-EINVAL);
> +
> + mnt = alloc_vfsmnt(fc->source ?: "none");
> + if (!mnt)
> + return ERR_PTR(-ENOMEM);
> +
> + if (fc->purpose == FS_CONTEXT_FOR_KERNEL_MOUNT)
> + /* It's a longterm mount, don't release mnt until we unmount
> + * before file sys is unregistered
> + */
> + mnt->mnt.mnt_flags = MNT_INTERNAL;
> +
> + atomic_inc(&fc->root->d_sb->s_active);
> + mnt->mnt.mnt_sb = fc->root->d_sb;
> + mnt->mnt.mnt_root = dget(fc->root);
> + mnt->mnt_mountpoint = mnt->mnt.mnt_root;
> + mnt->mnt_parent = mnt;
> +
> + lock_mount_hash();
> + list_add_tail(&mnt->mnt_instance, &mnt->mnt.mnt_sb->s_mounts);
> + unlock_mount_hash();
> + return &mnt->mnt;
> +}
> +EXPORT_SYMBOL(vfs_create_mount);
> +
> +struct vfsmount *vfs_kern_mount(struct file_system_type *type,
> + int sb_flags, const char *devname, void *data)
> +{
> + struct fs_context *fc;
> + struct vfsmount *mnt;
> + int ret;
> +
> + if (!type)
> + return ERR_PTR(-EINVAL);
> +
> + fc = vfs_new_fs_context(type, NULL, sb_flags,
> + sb_flags & SB_KERNMOUNT ?
> + FS_CONTEXT_FOR_KERNEL_MOUNT :
> + FS_CONTEXT_FOR_USER_MOUNT);
> + if (IS_ERR(fc))
> + return ERR_CAST(fc);
> +
> + ret = vfs_set_fs_source(fc, devname, devname ? strlen(devname) : 0);
> + if (ret < 0)
> + goto err_fc;
> +
> + ret = parse_monolithic_mount_data(fc, data);
> + if (ret < 0)
> + goto err_fc;
> +
> + ret = vfs_get_tree(fc);
> + if (ret < 0)
> + goto err_fc;
> +
> + mnt = vfs_create_mount(fc);
> + if (IS_ERR(mnt)) {
> + ret = PTR_ERR(mnt);
> + goto err_fc;
> + }
> +
> + put_fs_context(fc);
> + return mnt;
> +
> +err_fc:
> + put_fs_context(fc);
> + return ERR_PTR(ret);
> +}
> +EXPORT_SYMBOL_GPL(vfs_kern_mount);
> +
> +struct vfsmount *
> +vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
> + const char *name, void *data)
> +{
> + /* Until it is worked out how to pass the user namespace
> + * through from the parent mount to the submount don't support
> + * unprivileged mounts with submounts.
> + */
> + if (mountpoint->d_sb->s_user_ns != &init_user_ns)
> + return ERR_PTR(-EPERM);
> +
> + return vfs_kern_mount(type, MS_SUBMOUNT, name, data);
> +}
> +EXPORT_SYMBOL_GPL(vfs_submount);
> +
> +struct vfsmount *kern_mount(struct file_system_type *type)
> +{
> + return vfs_kern_mount(type, SB_KERNMOUNT, type->name, NULL);
> +}
> +EXPORT_SYMBOL_GPL(kern_mount);
> +
> +struct vfsmount *kern_mount_data(struct file_system_type *type, void *data)
> +{
> + return vfs_kern_mount(type, SB_KERNMOUNT, type->name, data);
> +}
> +EXPORT_SYMBOL_GPL(kern_mount_data);
> +
> /*
> * Return true if path is reachable from root
> *
> @@ -3283,21 +3407,6 @@ void put_mnt_ns(struct mnt_namespace *ns)
> free_mnt_ns(ns);
> }
>
> -struct vfsmount *kern_mount_data(struct file_system_type *type, void *data)
> -{
> - struct vfsmount *mnt;
> - mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, data);
> - if (!IS_ERR(mnt)) {
> - /*
> - * it is a longterm mount, don't release mnt until
> - * we unmount before file sys is unregistered
> - */
> - real_mount(mnt)->mnt_ns = MNT_NS_INTERNAL;
> - }
> - return mnt;
> -}
> -EXPORT_SYMBOL_GPL(kern_mount_data);
> -
> void kern_unmount(struct vfsmount *mnt)
> {
> /* release long term mount so mount point can be released */
> diff --git a/fs/super.c b/fs/super.c
> index 02da00410de8..e7d411d1d435 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -35,6 +35,7 @@
> #include <linux/lockdep.h>
> #include <linux/user_namespace.h>
> #include <uapi/linux/mount.h>
> +#include <linux/fs_context.h>
> #include "internal.h"
>
>
> @@ -173,16 +174,13 @@ static void destroy_super(struct super_block *s)
> }
>
> /**
> - * alloc_super - create new superblock
> - * @type: filesystem type superblock should belong to
> - * @flags: the mount flags
> - * @user_ns: User namespace for the super_block
> + * alloc_super - Create new superblock
> + * @fc: The filesystem configuration context
> *
> * Allocates and initializes a new &struct super_block. alloc_super()
> * returns a pointer new superblock or %NULL if allocation had failed.
> */
> -static struct super_block *alloc_super(struct file_system_type *type, int flags,
> - struct user_namespace *user_ns)
> +static struct super_block *alloc_super(struct fs_context *fc)
> {
> struct super_block *s = kzalloc(sizeof(struct super_block), GFP_USER);
> static const struct super_operations default_op;
> @@ -192,7 +190,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
> return NULL;
>
> INIT_LIST_HEAD(&s->s_mounts);
> - s->s_user_ns = get_user_ns(user_ns);
> + s->s_user_ns = get_user_ns(fc->user_ns);
>
> if (security_sb_alloc(s))
> goto fail;
> @@ -200,12 +198,12 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
> for (i = 0; i < SB_FREEZE_LEVELS; i++) {
> if (__percpu_init_rwsem(&s->s_writers.rw_sem[i],
> sb_writers_name[i],
> - &type->s_writers_key[i]))
> + &fc->fs_type->s_writers_key[i]))
> goto fail;
> }
> init_waitqueue_head(&s->s_writers.wait_unfrozen);
> s->s_bdi = &noop_backing_dev_info;
> - s->s_flags = flags;
> + s->s_flags = fc->sb_flags;
> if (s->s_user_ns != &init_user_ns)
> s->s_iflags |= SB_I_NODEV;
> INIT_HLIST_NODE(&s->s_instances);
> @@ -222,7 +220,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
> goto fail;
>
> init_rwsem(&s->s_umount);
> - lockdep_set_class(&s->s_umount, &type->s_umount_key);
> + lockdep_set_class(&s->s_umount, &fc->fs_type->s_umount_key);
> /*
> * sget() can have s_umount recursion.
> *
> @@ -242,7 +240,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
> s->s_count = 1;
> atomic_set(&s->s_active, 1);
> mutex_init(&s->s_vfs_rename_mutex);
> - lockdep_set_class(&s->s_vfs_rename_mutex, &type->s_vfs_rename_key);
> + lockdep_set_class(&s->s_vfs_rename_mutex, &fc->fs_type->s_vfs_rename_key);
> init_rwsem(&s->s_dquot.dqio_sem);
> s->s_maxbytes = MAX_NON_LFS;
> s->s_op = &default_op;
> @@ -455,6 +453,96 @@ void generic_shutdown_super(struct super_block *sb)
> EXPORT_SYMBOL(generic_shutdown_super);
>
> /**
> + * sget_fc - Find or create a superblock
> + * @fc: Filesystem context.
> + * @test: Comparison callback
> + * @set: Setup callback
> + *
> + * Find or create a superblock using the parameters stored in the filesystem
> + * context and the two callback functions.
> + *
> + * If an extant superblock is matched, then that will be returned with an
> + * elevated reference count that the caller must transfer or discard.
> + *
> + * If no match is made, a new superblock will be allocated and basic
> + * initialisation will be performed (s_type, s_fs_info and s_id will be set and
> + * the set() callback will be invoked), the superblock will be published and it
> + * will be returned in a partially constructed state with SB_BORN and SB_ACTIVE
> + * as yet unset.
> + */
> +struct super_block *sget_fc(struct fs_context *fc,
> + int (*test)(struct super_block *, struct fs_context *),
> + int (*set)(struct super_block *, struct fs_context *))
> +{
> + struct super_block *s = NULL;
> + struct super_block *old;
> + int err;
> +
> + if (!(fc->sb_flags & SB_KERNMOUNT) &&
> + fc->purpose != FS_CONTEXT_FOR_SUBMOUNT) {
> + /* Don't allow mounting unless the caller has CAP_SYS_ADMIN
> + * over the namespace.
> + */
> + if (!(fc->fs_type->fs_flags & FS_USERNS_MOUNT) &&
> + !capable(CAP_SYS_ADMIN))
> + return ERR_PTR(-EPERM);
> + else if (!ns_capable(fc->user_ns, CAP_SYS_ADMIN))
> + return ERR_PTR(-EPERM);
> + }
> +
> +retry:
> + spin_lock(&sb_lock);
> + if (test) {
> + hlist_for_each_entry(old, &fc->fs_type->fs_supers, s_instances) {
> + if (!test(old, fc))
> + continue;
> + if (fc->user_ns != old->s_user_ns) {
> + spin_unlock(&sb_lock);
> + if (s) {
> + up_write(&s->s_umount);
> + destroy_super(s);
> + }
> + return ERR_PTR(-EBUSY);
> + }
> + if (!grab_super(old))
> + goto retry;
> + if (s) {
> + up_write(&s->s_umount);
> + destroy_super(s);
> + s = NULL;
> + }
> + return old;
> + }
> + }
> + if (!s) {
> + spin_unlock(&sb_lock);
> + s = alloc_super(fc);
> + if (!s)
> + return ERR_PTR(-ENOMEM);
> + goto retry;
> + }
> +
> + s->s_fs_info = fc->s_fs_info;
> + err = set(s, fc);
> + if (err) {
> + s->s_fs_info = NULL;
> + spin_unlock(&sb_lock);
> + up_write(&s->s_umount);
> + destroy_super(s);
> + return ERR_PTR(err);
> + }
> + s->s_type = fc->fs_type;
> + strlcpy(s->s_id, s->s_type->name, sizeof(s->s_id));
> + list_add_tail(&s->s_list, &super_blocks);
> + hlist_add_head(&s->s_instances, &s->s_type->fs_supers);
> + spin_unlock(&sb_lock);
> + get_filesystem(s->s_type);
> + register_shrinker(&s->s_shrink);
> + return s;
> +}
> +EXPORT_SYMBOL(sget_fc);
> +
> +/**
> * sget_userns - find or create a superblock
> * @type: filesystem type superblock should belong to
> * @test: comparison callback
> @@ -503,7 +591,14 @@ struct super_block *sget_userns(struct file_system_type *type,
> }
> if (!s) {
> spin_unlock(&sb_lock);
> - s = alloc_super(type, (flags & ~SB_SUBMOUNT), user_ns);
> + {
> + struct fs_context fc = {
> + .fs_type = type,
> + .sb_flags = flags & ~SB_SUBMOUNT,
> + .user_ns = user_ns,
> + };
> + s = alloc_super(&fc);
> + }
> if (!s)
> return ERR_PTR(-ENOMEM);
> goto retry;
> @@ -805,10 +900,13 @@ struct super_block *user_get_super(dev_t dev)
> * @sb_flags: revised superblock flags
> * @data: the rest of options
> * @force: whether or not to force the change
> + * @fc: the superblock config for filesystems that support it
> + * (NULL if called from emergency or umount)
> *
> * Alters the mount options of a mounted file system.
> */
> -int do_remount_sb(struct super_block *sb, int sb_flags, void *data, int force)
> +int do_remount_sb(struct super_block *sb, int sb_flags, void *data, int force,
> + struct fs_context *fc)
> {
> int retval;
> int remount_ro;
> @@ -850,8 +948,14 @@ int do_remount_sb(struct super_block *sb, int sb_flags, void *data, int force)
> }
> }
>
> - if (sb->s_op->remount_fs) {
> - retval = sb->s_op->remount_fs(sb, &sb_flags, data);
> + if (sb->s_op->remount_fs_fc ||
> + sb->s_op->remount_fs) {
> + if (sb->s_op->remount_fs_fc) {
> + retval = sb->s_op->remount_fs_fc(sb, fc);
> + sb_flags = fc->sb_flags;
> + } else {
> + retval = sb->s_op->remount_fs(sb, &sb_flags, data);
> + }
> if (retval) {
> if (!force)
> goto cancel_readonly;
> @@ -898,7 +1002,7 @@ static void do_emergency_remount(struct work_struct *work)
> /*
> * What lock protects sb->s_flags??
> */
> - do_remount_sb(sb, SB_RDONLY, NULL, 1);
> + do_remount_sb(sb, SB_RDONLY, NULL, 1, NULL);
> }
> up_write(&sb->s_umount);
> spin_lock(&sb_lock);
> @@ -1048,6 +1152,89 @@ struct dentry *mount_ns(struct file_system_type *fs_type,
>
> EXPORT_SYMBOL(mount_ns);
>
> +static int set_anon_super_fc(struct super_block *sb, struct fs_context *fc)
> +{
> + return set_anon_super(sb, NULL);
> +}
> +
> +static int test_keyed_super(struct super_block *sb, struct fs_context *fc)
> +{
> + return sb->s_fs_info == fc->s_fs_info;
> +}
> +
> +static int test_single_super(struct super_block *s, struct fs_context *fc)
> +{
> + return 1;
> +}
> +
> +/**
> + * vfs_get_super - Get a superblock with a search key set in s_fs_info.
> + * @fc: The filesystem context holding the parameters
> + * @keying: How to distinguish superblocks
> + * @fill_super: Helper to initialise a new superblock
> + *
> + * Search for a superblock and create a new one if not found. The search
> + * criterion is controlled by @keying. If the search fails, a new superblock
> + * is created and @fill_super() is called to initialise it.
> + *
> + * @keying can take one of a number of values:
> + *
> + * (1) vfs_get_single_super - Only one superblock of this type may exist on the
> + * system. This is typically used for special system filesystems.
> + *
> + * (2) vfs_get_keyed_super - Multiple superblocks may exist, but they must have
> + * distinct keys (where the key is in s_fs_info). Searching for the same
> + * key again will turn up the superblock for that key.
> + *
> + * (3) vfs_get_independent_super - Multiple superblocks may exist and are
> + * unkeyed. Each call will get a new superblock.
> + *
> + * A permissions check is made by sget_fc() unless we're getting a superblock
> + * for a kernel-internal mount or a submount.
> + */
> +int vfs_get_super(struct fs_context *fc,
> + enum vfs_get_super_keying keying,
> + int (*fill_super)(struct super_block *sb,
> + struct fs_context *fc))
> +{
> + int (*test)(struct super_block *, struct fs_context *);
> + struct super_block *sb;
> +
> + switch (keying) {
> + case vfs_get_single_super:
> + test = test_single_super;
> + break;
> + case vfs_get_keyed_super:
> + test = test_keyed_super;
> + break;
> + case vfs_get_independent_super:
> + test = NULL;
> + break;
> + default:
> + BUG();
> + }
> +
> + sb = sget_fc(fc, test, set_anon_super_fc);
> + if (IS_ERR(sb))
> + return PTR_ERR(sb);
> +
> + if (!sb->s_root) {
> + int err;
> + err = fill_super(sb, fc);
> + if (err) {
> + deactivate_locked_super(sb);
> + return err;
> + }
> +
> + sb->s_flags |= SB_ACTIVE;
> + }
> +
> + if (!fc->root)
> + fc->root = dget(sb->s_root);
> + return 0;
> +}
> +EXPORT_SYMBOL(vfs_get_super);
> +
> #ifdef CONFIG_BLOCK
> static int set_bdev_super(struct super_block *s, void *data)
> {
> @@ -1196,7 +1383,7 @@ struct dentry *mount_single(struct file_system_type *fs_type,
> }
> s->s_flags |= SB_ACTIVE;
> } else {
> - do_remount_sb(s, flags, data, 0);
> + do_remount_sb(s, flags, data, 0, NULL);
> }
> return dget(s->s_root);
> }
> @@ -1529,3 +1716,76 @@ int thaw_super(struct super_block *sb)
> return 0;
> }
> EXPORT_SYMBOL(thaw_super);
> +
> +/**
> + * vfs_get_tree - Get the mountable root
> + * @fc: The superblock configuration context.
> + *
> + * The filesystem is invoked to get or create a superblock which can then later
> + * be used for mounting. The filesystem places a pointer to the root to be
> + * used for mounting in @fc->root.
> + */
> +int vfs_get_tree(struct fs_context *fc)
> +{
> + struct super_block *sb;
> + int ret;
> +
> + if (fc->root)
> + return -EBUSY;
> +
> + if (fc->ops->validate) {
> + ret = fc->ops->validate(fc);
> + if (ret < 0)
> + return ret;
> + }
> +
> + /* The filesystem may transfer preallocated resources from the
> + * configuration context to the superblock, thereby rendering the
> + * config unusable for another attempt at creation if this one fails.
> + */
> + if (fc->degraded)
> + return -EBUSY;
> +
> + /* Get the mountable root in fc->root, with a ref on the root and a ref
> + * on the superblock.
> + */
> + ret = fc->ops->get_tree(fc);
> + if (ret < 0)
> + return ret;
> +
> + BUG_ON(!fc->root);
> + sb = fc->root->d_sb;
> + WARN_ON(!sb->s_bdi);
> +
> + ret = security_sb_get_tree(fc);
> + if (ret < 0)
> + goto err_sb;
> +
> + ret = -ENOMEM;
> + if (fc->subtype && !sb->s_subtype) {
> + sb->s_subtype = kstrdup(fc->subtype, GFP_KERNEL);
> + if (!sb->s_subtype)
> + goto err_sb;
> + }
> +
> + sb->s_flags |= SB_BORN;
> +
> + /* Filesystems should never set s_maxbytes larger than MAX_LFS_FILESIZE
> + * but s_maxbytes was an unsigned long long for many releases. Throw
> + * this warning for a little while to try and catch filesystems that
> + * violate this rule.
> + */
> + WARN(sb->s_maxbytes < 0,
> + "%s set sb->s_maxbytes to negative value (%lld)\n",
> + fc->fs_type->name, sb->s_maxbytes);
> +
> + up_write(&sb->s_umount);
> + return 0;
> +
> +err_sb:
> + dput(fc->root);
> + fc->root = NULL;
> + deactivate_locked_super(sb);
> + return ret;
> +}
> +EXPORT_SYMBOL(vfs_get_tree);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index bd2ee00e03ff..f391263c62a1 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -58,6 +58,7 @@ struct workqueue_struct;
> struct iov_iter;
> struct fscrypt_info;
> struct fscrypt_operations;
> +struct fs_context;
>
> extern void __init inode_init(void);
> extern void __init inode_init_early(void);
> @@ -717,6 +718,11 @@ static inline void inode_unlock(struct inode *inode)
> up_write(&inode->i_rwsem);
> }
>
> +static inline int inode_lock_killable(struct inode *inode)
> +{
> + return down_write_killable(&inode->i_rwsem);
> +}
> +
> static inline void inode_lock_shared(struct inode *inode)
> {
> down_read(&inode->i_rwsem);
> @@ -1814,6 +1820,7 @@ struct super_operations {
> int (*unfreeze_fs) (struct super_block *);
> int (*statfs) (struct dentry *, struct kstatfs *);
> int (*remount_fs) (struct super_block *, int *, char *);
> + int (*remount_fs_fc) (struct super_block *, struct fs_context *);
> void (*umount_begin) (struct super_block *);
>
> int (*show_options)(struct seq_file *, struct dentry *);
> @@ -2072,8 +2079,10 @@ struct file_system_type {
> #define FS_HAS_SUBTYPE 4
> #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
> #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
> + unsigned short fs_context_size; /* Size of superblock config context to allocate */
> struct dentry *(*mount) (struct file_system_type *, int,
> const char *, void *);
> + int (*init_fs_context)(struct fs_context *, struct super_block *);
> void (*kill_sb) (struct super_block *);
> struct module *owner;
> struct file_system_type * next;
> @@ -2113,6 +2122,9 @@ void deactivate_locked_super(struct super_block *sb);
> int set_anon_super(struct super_block *s, void *data);
> int get_anon_bdev(dev_t *);
> void free_anon_bdev(dev_t);
> +struct super_block *sget_fc(struct fs_context *fc,
> + int (*test)(struct super_block *, struct fs_context *),
> + int (*set)(struct super_block *, struct fs_context *));
> struct super_block *sget_userns(struct file_system_type *type,
> int (*test)(struct super_block *,void *),
> int (*set)(struct super_block *,void *),
> @@ -2155,8 +2167,8 @@ mount_pseudo(struct file_system_type *fs_type, char *name,
>
> extern int register_filesystem(struct file_system_type *);
> extern int unregister_filesystem(struct file_system_type *);
> -extern struct vfsmount *kern_mount_data(struct file_system_type *, void *data);
> -#define kern_mount(type) kern_mount_data(type, NULL)
> +extern struct vfsmount *kern_mount(struct file_system_type *);
> +extern struct vfsmount *kern_mount_data(struct file_system_type *, void *);
> extern void kern_unmount(struct vfsmount *mnt);
> extern int may_umount_tree(struct vfsmount *);
> extern int may_umount(struct vfsmount *);
> diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
> index 645c57e10764..8af6ff0e869e 100644
> --- a/include/linux/fs_context.h
> +++ b/include/linux/fs_context.h
> @@ -27,9 +27,10 @@ struct user_namespace;
> struct vfsmount;
>
> enum fs_context_purpose {
> - FS_CONTEXT_FOR_NEW, /* New superblock for direct mount */
> + FS_CONTEXT_FOR_USER_MOUNT, /* New superblock for user-specified mount */
> + FS_CONTEXT_FOR_KERNEL_MOUNT, /* New superblock for kernel-internal mount */
> FS_CONTEXT_FOR_SUBMOUNT, /* New superblock for automatic submount */
> - FS_CONTEXT_FOR_REMOUNT, /* Superblock reconfiguration for remount */
> + FS_CONTEXT_FOR_REMOUNT, /* Superblock reconfiguration for remount */
> };
>
> /*
> @@ -53,7 +54,8 @@ struct fs_context {
> char *source; /* The source name (eg. device) */
> char *subtype; /* The subtype to set on the superblock */
> void *security; /* The LSM context */
> - unsigned int sb_flags; /* The superblock flags (MS_*) */
> + void *s_fs_info; /* Proposed s_fs_info */
> + unsigned int sb_flags; /* Proposed superblock flags (SB_*) */
> bool sloppy; /* Unrecognised options are okay */
> bool silent;
> bool degraded; /* True if the context can't be reused */
> @@ -70,4 +72,33 @@ struct fs_context_operations {
> int (*get_tree)(struct fs_context *fc);
> };
>
> +/*
> + * fs_context manipulation functions.
> + */
> +extern struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
> + struct super_block *src_sb,
> + unsigned int ms_flags,
> + enum fs_context_purpose purpose);
> +extern struct fs_context *vfs_sb_reconfig(struct vfsmount *mnt,
> + unsigned int ms_flags);
> +extern struct fs_context *vfs_dup_fs_context(struct fs_context *src);
> +extern int vfs_set_fs_source(struct fs_context *fc, const char *source, size_t slen);
> +extern int vfs_parse_mount_option(struct fs_context *fc, char *data);
> +extern int generic_parse_monolithic(struct fs_context *fc, void *data);
> +extern int vfs_get_tree(struct fs_context *fc);
> +extern void put_fs_context(struct fs_context *fc);
> +
> +/*
> + * sget() wrapper to be called from the ->get_tree() op.
> + */
> +enum vfs_get_super_keying {
> + vfs_get_single_super, /* Only one such superblock may exist */
> + vfs_get_keyed_super, /* Superblocks with different s_fs_info keys may exist */
> + vfs_get_independent_super, /* Multiple independent superblocks may exist */
> +};
> +extern int vfs_get_super(struct fs_context *fc,
> + enum vfs_get_super_keying keying,
> + int (*fill_super)(struct super_block *sb,
> + struct fs_context *fc));
> +
> #endif /* _LINUX_FS_CONTEXT_H */
> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> index 85398ba0b533..74aeccb041a2 100644
> --- a/include/linux/lsm_hooks.h
> +++ b/include/linux/lsm_hooks.h
> @@ -91,7 +91,7 @@
> * @fs_context_free:
> * Clean up a filesystem context.
> * @fc indicates the filesystem context.
> - * @fs_context_parse_one:
> + * @fs_context_parse_option:
> * Userspace provided an option to configure a superblock. The LSM may
> * reject it with an error and may use it for itself, in which case it
> * should return 1; otherwise it should return 0 to pass it on to the
> @@ -1419,7 +1419,7 @@ union security_list_options {
> int (*fs_context_alloc)(struct fs_context *fc, struct super_block *src_sb);
> int (*fs_context_dup)(struct fs_context *fc, struct fs_context *src_sc);
> void (*fs_context_free)(struct fs_context *fc);
> - int (*fs_context_parse_one)(struct fs_context *fc, char *opt);
> + int (*fs_context_parse_option)(struct fs_context *fc, char *opt);
> int (*sb_get_tree)(struct fs_context *fc);
> int (*sb_mountpoint)(struct fs_context *fc, struct path *mountpoint);
>
> @@ -1745,7 +1745,7 @@ struct security_hook_heads {
> struct list_head fs_context_alloc;
> struct list_head fs_context_dup;
> struct list_head fs_context_free;
> - struct list_head fs_context_parse_one;
> + struct list_head fs_context_parse_option;
> struct list_head sb_get_tree;
> struct list_head sb_mountpoint;
> struct list_head sb_alloc_security;
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index 1ce85e6fd95f..f47306b4bf72 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -20,6 +20,7 @@ struct super_block;
> struct vfsmount;
> struct dentry;
> struct mnt_namespace;
> +struct fs_context;
>
> #define MNT_NOSUID 0x01
> #define MNT_NODEV 0x02
> @@ -87,6 +88,7 @@ struct path;
> extern struct vfsmount *clone_private_mount(const struct path *path);
>
> struct file_system_type;
> +extern struct vfsmount *vfs_create_mount(struct fs_context *fc);
> extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
> int flags, const char *name,
> void *data);
> diff --git a/security/security.c b/security/security.c
> index 55383a0e764d..7826a493c02a 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -366,9 +366,9 @@ void security_fs_context_free(struct fs_context *fc)
> call_void_hook(fs_context_free, fc);
> }
>
> -int security_fs_context_parse_one(struct fs_context *fc, char *opt)
> +int security_fs_context_parse_option(struct fs_context *fc, char *p)
> {
> - return call_int_hook(fs_context_parse_one, 0, fc, opt);
> + return call_int_hook(fs_context_parse_option, 0, fc, p);
> }
>
> int security_sb_get_tree(struct fs_context *fc)
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 0dda7350b5af..ca57e61f9c43 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -2927,7 +2927,7 @@ static void selinux_fs_context_free(struct fs_context *fc)
> fc->security = NULL;
> }
>
> -static int selinux_fs_context_parse_one(struct fs_context *fc, char *opt)
> +static int selinux_fs_context_parse_option(struct fs_context *fc, char *opt)
> {
> struct security_mnt_opts *opts = fc->security;
> substring_t args[MAX_OPT_ARGS];
> @@ -3014,7 +3014,7 @@ static int selinux_sb_get_tree(struct fs_context *fc)
> return rc;
>
> /* Allow all mounts performed by the kernel */
> - if (fc->sb_flags & MS_KERNMOUNT)
> + if (fc->purpose & FS_CONTEXT_FOR_KERNEL_MOUNT)
> return 0;
>
> ad.type = LSM_AUDIT_DATA_DENTRY;
> @@ -6445,7 +6445,7 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
> LSM_HOOK_INIT(fs_context_alloc, selinux_fs_context_alloc),
> LSM_HOOK_INIT(fs_context_dup, selinux_fs_context_dup),
> LSM_HOOK_INIT(fs_context_free, selinux_fs_context_free),
> - LSM_HOOK_INIT(fs_context_parse_one, selinux_fs_context_parse_one),
> + LSM_HOOK_INIT(fs_context_parse_option, selinux_fs_context_parse_option),
> LSM_HOOK_INIT(sb_get_tree, selinux_sb_get_tree),
> LSM_HOOK_INIT(sb_mountpoint, selinux_sb_mountpoint),
>
>

2017-10-10 08:00:02

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 06/14] VFS: Implement fsmount() to effect a pre-configured mount [ver #6]

On Fri, Oct 6, 2017 at 5:49 PM, David Howells <[email protected]> wrote:
> Provide a system call by which a filesystem opened with fsopen() and
> configured by a series of writes can be mounted:
>
> int ret = fsmount(int fsfd, int dfd, const char *path,
> unsigned int at_flags, unsigned int flags);
>
> where fsfd is the fd returned by fsopen(), dfd, path and at_flags locate
> the mountpoint and flags are the applicable MS_* flags. dfd can be
> AT_FDCWD or an fd open to a directory.
>
> In the event that fsmount() fails, it may be possible to get an error
> message by calling read(). If no message is available, ENODATA will be
> reported.
>
> Signed-off-by: David Howells <[email protected]>
> ---
>
> arch/x86/entry/syscalls/syscall_32.tbl | 1
> arch/x86/entry/syscalls/syscall_64.tbl | 1
> fs/namespace.c | 82 ++++++++++++++++++++++++++++++++
> include/linux/syscalls.h | 2 +
> kernel/sys_ni.c | 1
> 5 files changed, 87 insertions(+)
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 9bf8d4c62f85..abe6ea95e0e6 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -392,3 +392,4 @@
> 383 i386 statx sys_statx
> 384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl
> 385 i386 fsopen sys_fsopen
> +386 i386 fsmount sys_fsmount
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 9b198c5fc412..0977c5079831 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -340,6 +340,7 @@
> 331 common pkey_free sys_pkey_free
> 332 common statx sys_statx
> 333 common fsopen sys_fsopen
> +334 common fsmount sys_fsmount
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/namespace.c b/fs/namespace.c
> index d6b0b0067f6d..8676658b6b2c 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3188,6 +3188,88 @@ struct vfsmount *kern_mount_data(struct file_system_type *type, void *data)
> EXPORT_SYMBOL_GPL(kern_mount_data);
>
> /*
> + * Mount a new, prepared superblock (specified by fs_fd) on the location
> + * specified by dfd and dir_name. dfd can be AT_FDCWD, a dir fd or a container
> + * fd. This cannot be used for binding, moving or remounting mounts.
> + */
> +SYSCALL_DEFINE5(fsmount, int, fs_fd, int, dfd, const char __user *, dir_name,
> + unsigned int, at_flags, unsigned int, flags)
> +{
> + struct fs_context *fc;
> + struct path mountpoint;
> + struct fd f;
> + unsigned int lookup_flags, mnt_flags = 0;
> + long ret;
> +
> + if ((at_flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
> + AT_EMPTY_PATH)) != 0)
> + return -EINVAL;
> +
> + if (flags & ~(MS_RDONLY | MS_NOSUID | MS_NODEV | MS_NOEXEC |
> + MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_STRICTATIME))
> + return -EINVAL;

How about propagation flags? Those are also mount specific.




> +
> + if (flags & MS_RDONLY)
> + mnt_flags |= MNT_READONLY;
> + if (flags & MS_NOSUID)
> + mnt_flags |= MNT_NOSUID;
> + if (flags & MS_NODEV)
> + mnt_flags |= MNT_NODEV;
> + if (flags & MS_NOEXEC)
> + mnt_flags |= MNT_NOEXEC;
> + if (flags & MS_NODIRATIME)
> + mnt_flags |= MNT_NODIRATIME;
> +
> + if (flags & MS_STRICTATIME) {
> + if (flags & MS_NOATIME)
> + return -EINVAL;
> + } else if (flags & MS_NOATIME) {
> + mnt_flags |= MNT_NOATIME;
> + } else {
> + mnt_flags |= MNT_RELATIME;
> + }

I'm not sure reusing the MS_FLAGS is the right choice. Why not export
MNT_* to userspace? That would get us a clean namespace without
confusion with sb flags and no need to convert back and forth.

> +
> + f = fdget(fs_fd);
> + if (!f.file)
> + return -EBADF;
> +
> + ret = -EINVAL;
> + if (f.file->f_op != &fs_fs_fops)
> + goto err_fsfd;
> +
> + fc = f.file->private_data;
> +
> + ret = -EPERM;
> + if (!may_mount() ||
> + ((fc->sb_flags & MS_MANDLOCK) && !may_mandlock()))
> + goto err_fsfd;
> +
> + /* There must be a valid superblock or we can't mount it */
> + ret = -EINVAL;
> + if (!fc->root)
> + goto err_fsfd;
> +
> + /* Find the mountpoint. A container can be specified in dfd. */
> + lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
> + if (at_flags & AT_SYMLINK_NOFOLLOW)
> + lookup_flags &= ~LOOKUP_FOLLOW;
> + if (at_flags & AT_NO_AUTOMOUNT)
> + lookup_flags &= ~LOOKUP_AUTOMOUNT;
> + if (at_flags & AT_EMPTY_PATH)
> + lookup_flags |= LOOKUP_EMPTY;
> + ret = user_path_at(dfd, dir_name, lookup_flags, &mountpoint);
> + if (ret < 0)
> + goto err_fsfd;
> +
> + ret = do_new_mount_fc(fc, &mountpoint, mnt_flags);
> +
> + path_put(&mountpoint);
> +err_fsfd:
> + fdput(f);
> + return ret;
> +}
> +
> +/*
> * Return true if path is reachable from root
> *
> * namespace_sem or mount_lock is held
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 7cd1b65a4152..e82dde171ce8 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -942,5 +942,7 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
> unsigned mask, struct statx __user *buffer);
> asmlinkage long sys_fsopen(const char *fs_name, unsigned int flags,
> void *reserved3, void *reserved4, void *reserved5);
> +asmlinkage long sys_fsmount(int fsfd, int dfd, const char *path, unsigned int at_flags,
> + unsigned int flags);
>
> #endif
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index de1dc63e7e47..a0fe764bd5dd 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -261,3 +261,4 @@ cond_syscall(sys_pkey_free);
>
> /* fd-based mount */
> cond_syscall(sys_fsopen);
> +cond_syscall(sys_fsmount);
>

2017-10-10 09:51:12

by Karel Zak

[permalink] [raw]
Subject: Re: [PATCH 06/14] VFS: Implement fsmount() to effect a pre-configured mount [ver #6]

On Tue, Oct 10, 2017 at 10:00:01AM +0200, Miklos Szeredi wrote:
> > +
> > + if (flags & MS_RDONLY)
> > + mnt_flags |= MNT_READONLY;
> > + if (flags & MS_NOSUID)
> > + mnt_flags |= MNT_NOSUID;
> > + if (flags & MS_NODEV)
> > + mnt_flags |= MNT_NODEV;
> > + if (flags & MS_NOEXEC)
> > + mnt_flags |= MNT_NOEXEC;
> > + if (flags & MS_NODIRATIME)
> > + mnt_flags |= MNT_NODIRATIME;
> > +
> > + if (flags & MS_STRICTATIME) {
> > + if (flags & MS_NOATIME)
> > + return -EINVAL;
> > + } else if (flags & MS_NOATIME) {
> > + mnt_flags |= MNT_NOATIME;
> > + } else {
> > + mnt_flags |= MNT_RELATIME;
> > + }
>
> I'm not sure reusing the MS_FLAGS is the right choice. Why not export
> MNT_* to userspace? That would get us a clean namespace without
> confusion with sb flags and no need to convert back and forth.

Well, if you think about it as about two separated things -- VFS-flags
and FS-flags (and for example /proc/#/mountinfo already uses two
columns for the flags) than the question is why the API uses one
variable?

Would be better to use two variables everywhere? (mostly for the
syscall).

It would be nice to keep for example propagation flags only in
vfs_flags, or use MS_RDONLY according to context (for FS or for VFS)
without extra MS_BIND, etc.

Karel

--
Karel Zak <[email protected]>
http://karelzak.blogspot.com

2017-10-10 13:38:23

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 06/14] VFS: Implement fsmount() to effect a pre-configured mount [ver #6]

On Tue, Oct 10, 2017 at 11:51 AM, Karel Zak <[email protected]> wrote:
> On Tue, Oct 10, 2017 at 10:00:01AM +0200, Miklos Szeredi wrote:
>> > +
>> > + if (flags & MS_RDONLY)
>> > + mnt_flags |= MNT_READONLY;
>> > + if (flags & MS_NOSUID)
>> > + mnt_flags |= MNT_NOSUID;
>> > + if (flags & MS_NODEV)
>> > + mnt_flags |= MNT_NODEV;
>> > + if (flags & MS_NOEXEC)
>> > + mnt_flags |= MNT_NOEXEC;
>> > + if (flags & MS_NODIRATIME)
>> > + mnt_flags |= MNT_NODIRATIME;
>> > +
>> > + if (flags & MS_STRICTATIME) {
>> > + if (flags & MS_NOATIME)
>> > + return -EINVAL;
>> > + } else if (flags & MS_NOATIME) {
>> > + mnt_flags |= MNT_NOATIME;
>> > + } else {
>> > + mnt_flags |= MNT_RELATIME;
>> > + }
>>
>> I'm not sure reusing the MS_FLAGS is the right choice. Why not export
>> MNT_* to userspace? That would get us a clean namespace without
>> confusion with sb flags and no need to convert back and forth.
>
> Well, if you think about it as about two separated things -- VFS-flags
> and FS-flags (and for example /proc/#/mountinfo already uses two
> columns for the flags) than the question is why the API uses one
> variable?
>
> Would be better to use two variables everywhere? (mostly for the
> syscall).
>
> It would be nice to keep for example propagation flags only in
> vfs_flags, or use MS_RDONLY according to context (for FS or for VFS)
> without extra MS_BIND, etc.

MS_BIND will be gone in the new API. The two separate columns in
/proc/#/mountinfo are going to be two separate things on the new
interface (one is writes to the fsfd provided by fsopen(2), the other
in flags for fsmount(2)). The question is how to call the mount flags
(what you call vfs flags), "MS_RDONLY" or "MNT_RDONLY" on the uAPI.
Either is probably fine, but I feel that "MNT_FOO" is better, because
it's a relatively clean namespace concerned with mount flags and not
polluted with all the scum that mount(2) collected.

BTW, I think <[email protected]> should be CC-d on all patches
that concern the userspace API.

Thanks,
Miklos

2017-10-10 15:24:38

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/14] VFS: Implement a filesystem superblock creation/configuration context [ver #6]

Randy Dunlap <[email protected]> wrote:

> > + * generic_parse_monolithic - Parse key[=val][,key[=val]]* mount data
> > + * @mc: The superblock configuration to fill in.
>
> function argument is &struct fs_context *ctx, not @mc

Yeah, Miklós and Al kept changing their minds about what I was allowed to call
this struct.

David

2017-10-11 08:54:19

by Karel Zak

[permalink] [raw]
Subject: Re: [PATCH 06/14] VFS: Implement fsmount() to effect a pre-configured mount [ver #6]

On Tue, Oct 10, 2017 at 03:38:21PM +0200, Miklos Szeredi wrote:
> On Tue, Oct 10, 2017 at 11:51 AM, Karel Zak <[email protected]> wrote:
> > On Tue, Oct 10, 2017 at 10:00:01AM +0200, Miklos Szeredi wrote:
> >> > +
> >> > + if (flags & MS_RDONLY)
> >> > + mnt_flags |= MNT_READONLY;
> >> > + if (flags & MS_NOSUID)
> >> > + mnt_flags |= MNT_NOSUID;
> >> > + if (flags & MS_NODEV)
> >> > + mnt_flags |= MNT_NODEV;
> >> > + if (flags & MS_NOEXEC)
> >> > + mnt_flags |= MNT_NOEXEC;
> >> > + if (flags & MS_NODIRATIME)
> >> > + mnt_flags |= MNT_NODIRATIME;
> >> > +
> >> > + if (flags & MS_STRICTATIME) {
> >> > + if (flags & MS_NOATIME)
> >> > + return -EINVAL;
> >> > + } else if (flags & MS_NOATIME) {
> >> > + mnt_flags |= MNT_NOATIME;
> >> > + } else {
> >> > + mnt_flags |= MNT_RELATIME;
> >> > + }
> >>
> >> I'm not sure reusing the MS_FLAGS is the right choice. Why not export
> >> MNT_* to userspace? That would get us a clean namespace without
> >> confusion with sb flags and no need to convert back and forth.
> >
> > Well, if you think about it as about two separated things -- VFS-flags
> > and FS-flags (and for example /proc/#/mountinfo already uses two
> > columns for the flags) than the question is why the API uses one
> > variable?
> >
> > Would be better to use two variables everywhere? (mostly for the
> > syscall).
> >
> > It would be nice to keep for example propagation flags only in
> > vfs_flags, or use MS_RDONLY according to context (for FS or for VFS)
> > without extra MS_BIND, etc.
>
> MS_BIND will be gone in the new API. The two separate columns in
> /proc/#/mountinfo are going to be two separate things on the new
> interface (one is writes to the fsfd provided by fsopen(2), the other
> in flags for fsmount(2)).

Ah, nice.

> The question is how to call the mount flags
> (what you call vfs flags), "MS_RDONLY" or "MNT_RDONLY" on the uAPI.
> Either is probably fine, but I feel that "MNT_FOO" is better, because
> it's a relatively clean namespace concerned with mount flags and not
> polluted with all the scum that mount(2) collected.

Hmm.. for example libmount already uses MNT_ namespace in header
files for all macros. So, I wont be happy with MNT_ in userspace ;-(

I like clone, epoll, etc flags ... there is no any abbreviation and
the prefix follows syscall or API name (CLONE_xxx, EPOLLxxx), what
about MOUNT_FOO ?

Karel

--
Karel Zak <[email protected]>
http://karelzak.blogspot.com

2017-10-26 16:24:21

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/14] VFS: Implement a filesystem superblock creation/configuration context [ver #6]

Miklos Szeredi <[email protected]> wrote:

> > +/**
> > + * vfs_parse_mount_option - Add a single mount option to a superblock config
>
> Mount options are those that refer to the mount
> (nosuid,nodev,noatime,etc..); this function is not parsing those,
> AFAICT.

I'd quibble that those are "mountpoint" options, not "mount" options. Mount
options are the options you can pass to mount and are a mixed bag. But fair
enough, it's probably worth avoiding such terminology where we can.

> How about vfs_parse_fs_option()?

Sure. Can we please not rename it again?

> We probably also need a "reset" type of option that clears all bits
> and is also passed onto the filesystem's parsing routine so it can
> reset all options as well.

Reset what? To what? To a blank slate? To the state on the medium? What if
it's a netfs?

This operation isn't well defined and I'm not sure it's useful because:

(1) Unless we can preload options from some source, the starting context is
blank, so why do you need a reset on a new mount?

(2) We need to find out what state the options are currently in. Reset today
doesn't necessarily mean the same as reset tomorrow.

(3) Not all options are simple on/off switches. Some of them are multistate,
some are strings/numbers that have non-zero defaults and some have
dependencies on other options.

(4) Not all options can be simply reset to "0", particularly if the
filesystem is live. Take an option that points to a network server or a
separate journalling device for example.

> 1/a) New sb:
> 1/b) New sb for legacy mount(2)

Looking at this in terms of ext4, I would make the parser create an "option
change" script prior to loading the superblock. The reason for that with ext4
is that ext4 stores an additional option string that must be parsed and
applied first - except that we potentially need some of the mount-supplied
options to be able to mount the fs.

So in the new-mount-of-new-sb case, I would actually create two scripts, one
for the options written to the context fd, then one for the on-disk script,
then validate the context and then apply them both atomically.

> 2/a) Shared sb:
> 2/b) Shared sb for legacy mount(2)

In the new-mount-of-live-sb case, I would validate the context script and
ignore any options that try to change things that can't be changed because the
fs is live.

It might be nice to report them also, but that requires a mechanism to do so.

> 3/a) Reconfig
> 3/b) Reconfig for legacy mount(2) (i.e. MS_REMOUNT)

In the reconfigure case, I only need to create one script, validate it and
then apply it atomically (well, as atomically as possible, given the fs is
actually live at this point).

There's the question of how far you allow a happens-to-share mount to effect a
reconfigure. Seems a reasonable distinction to say that in your case 2 you
just ignore conflicts but possibly warn or reject in case 3.

> > +int generic_parse_monolithic(struct fs_context *ctx, void *data)
> > +{
> > + char *options = data, *p;
> > + int ret;
> > +
> > + if (!options)
> > + return 0;
> > +
> > + while ((p = strsep(&options, ",")) != NULL) {
> > + if (*p) {
> > + ret = vfs_parse_mount_option(ctx, p);
>
> Monolithic option block is the legacy thing.

Yes, I know.

> It shouldn't be parsing the common flags. It should instead be treating
> them as forbidden (although it probably doesn't really matter, since no
> filesystem will accept these anyway).

Except that ext4, f2fs, 9p, ... do take at least some of them. I'm not sure
whether they ever see them, but without auditing userspace, there's no way to
know.

> So probably best to expand vfs_parse_mount_option() here and skip the
> sb flag parsing part.

You need to prove they are never seen here :-/

> > + * @sb_flags: Superblock flags and op flags (such as MS_REMOUNT)
>
> I'm confused: MS_REMOUNT in sb_flags and FS_CONTEXT_FOR_REMOUNT in purpose?
>
> I hope that's just a stale comment, sb_flags should really be just the
> superblock flags and not any op flags.

Yeah - that's stale.

> Also, can FS_CONTEXT_FOR_REMOUNT be renamed to ..._RECONFIG?

If you really want ;-)

> > + * vfs_sb_reconfig - Create a filesystem context for remount/reconfiguration
> > + * @mnt: The mountpoint to open
> > + * @sb_flags: Superblock flags and op flags (such as MS_REMOUNT)
>
> Here again op flags make no sense.

Also stale.

> Also it should be made clear that the old sb flags will be overridden
> with these.

That's not necessarily the case. The filesystem can override the override.

if (sb->s_op->remount_fs_fc) {
retval = sb->s_op->remount_fs_fc(sb, fc);
---> sb_flags = fc->sb_flags;
} else {
---> retval = sb->s_op->remount_fs(sb, &sb_flags, data);
}

> > +/**
> > + * vfs_dup_fc_config: Duplicate a filesytem context.
> > + * @src_fc: The context to copy.
> > + */
>
> Can we introduce these before they actually get used.

I would rather not. Any fix I have to make then has to be distributed
backwards over a bunch of patches that have stuff that doesn't get compiled,
especially if a change touches code divided up between multiple patches.

David

2017-10-26 17:11:22

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 05/14] VFS: Implement fsopen() to prepare for a mount [ver #6]

On Fri, 2017-10-06 at 16:49 +0100, David Howells wrote:
> Provide an fsopen() system call that starts the process of preparing to
> mount, using an fd as a context handle. fsopen() is given the name of the
> filesystem that will be used:
>
> int mfd = fsopen(const char *fsname, int open_flags,
> void *reserved3, void *reserved4,
> void *reserved5);
>
> where open_flags can be 0 or O_CLOEXEC and reserved* should all be NULL for
> the moment.
>
> For example:
>
> mfd = fsopen("ext4", O_CLOEXEC, NULL, NULL, NULL);
> write(mfd, "s /dev/sdb1"); // note I'm ignoring write's length arg
> write(mfd, "o noatime");
> write(mfd, "o acl");
> write(mfd, "o user_attr");
> write(mfd, "o iversion");
> write(mfd, "o ");
> write(mfd, "r /my/container"); // root inside the fs
> write(mfd, "x create"); // create the superblock
> fsmount(mfd, container_fd, "/mnt", AT_NO_FOLLOW);
>
> mfd = fsopen("afs", -1);
> write(mfd, "s %grand.central.org:root.cell");
> write(mfd, "o cell=grand.central.org");
> write(mfd, "r /");
> write(mfd, "x create");
> fsmount(mfd, AT_FDCWD, "/mnt", 0);
>
> If an error is reported at any step, an error message may be available to be
> read() back (ENODATA will be reported if there isn't an error available) in
> the form:
>
> "e <subsys>:<problem>"
> "e SELinux:Mount on mountpoint not permitted"
>
> Once fsmount() has been called, further write() calls will incur EBUSY,
> even if the fsmount() fails. read() is still possible to retrieve error
> information.
>
> The fsopen() syscall creates a mount context and hangs it of the fd that it
> returns.
>
> Netlink is not used because it is optional.
>
> Signed-off-by: David Howells <[email protected]>
> ---
>
> arch/x86/entry/syscalls/syscall_32.tbl | 1
> arch/x86/entry/syscalls/syscall_64.tbl | 1
> fs/Makefile | 2
> fs/fsopen.c | 273 ++++++++++++++++++++++++++++++++
> include/linux/fs_context.h | 1
> include/linux/syscalls.h | 2
> include/uapi/linux/magic.h | 1
> kernel/sys_ni.c | 3
> 8 files changed, 283 insertions(+), 1 deletion(-)
> create mode 100644 fs/fsopen.c
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 448ac2161112..9bf8d4c62f85 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -391,3 +391,4 @@
> 382 i386 pkey_free sys_pkey_free
> 383 i386 statx sys_statx
> 384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl
> +385 i386 fsopen sys_fsopen
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 5aef183e2f85..9b198c5fc412 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -339,6 +339,7 @@
> 330 common pkey_alloc sys_pkey_alloc
> 331 common pkey_free sys_pkey_free
> 332 common statx sys_statx
> +333 common fsopen sys_fsopen
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/Makefile b/fs/Makefile
> index ffe728cc15e1..c42d1d9351a6 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -12,7 +12,7 @@ obj-y := open.o read_write.o file_table.o super.o \
> seq_file.o xattr.o libfs.o fs-writeback.o \
> pnode.o splice.o sync.o utimes.o \
> stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
> - fs_context.o
> + fs_context.o fsopen.o
>
> ifeq ($(CONFIG_BLOCK),y)
> obj-y += buffer.o block_dev.o direct-io.o mpage.o
> diff --git a/fs/fsopen.c b/fs/fsopen.c
> new file mode 100644
> index 000000000000..6ca7e1979273
> --- /dev/null
> +++ b/fs/fsopen.c
> @@ -0,0 +1,273 @@
> +/* Filesystem access-by-fd.
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells ([email protected])
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#include <linux/fs_context.h>
> +#include <linux/mount.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/file.h>
> +#include <linux/magic.h>
> +#include <linux/syscalls.h>
> +
> +static struct vfsmount *fs_fs_mnt __read_mostly;
> +
> +static int fs_fs_release(struct inode *inode, struct file *file)
> +{
> + struct fs_context *fc = file->private_data;
> +
> + file->private_data = NULL;
> +
> + put_fs_context(fc);
> + return 0;
> +}
> +
> +/*
> + * Userspace writes configuration data and commands to the fd and we parse it
> + * here. For the moment, we assume a single option or command per write. Each
> + * line written is of the form
> + *
> + * <option_type><space><stuff...>
> + *
> + * d /dev/sda1 -- Device name

nit: I think you mean "s /dev/sda1", according to sample program.

> + * o noatime -- Option without value
> + * o cell=grand.central.org -- Option with value
> + * r / -- Dir within device to mount
> + * x create -- Create a superblock
> + */
> +static ssize_t fs_fs_write(struct file *file,
> + const char __user *_buf, size_t len, loff_t *pos)
> +{
> + struct fs_context *fc = file->private_data;
> + struct inode *inode = file_inode(file);
> + char opt[2], *data;
> + ssize_t ret;
> +
> + if (len < 3 || len > 4095)
> + return -EINVAL;
> +
> + if (copy_from_user(opt, _buf, 2) != 0)
> + return -EFAULT;
> + switch (opt[0]) {
> + case 's':
> + case 'o':
> + case 'x':
> + break;
> + default:
> + goto err_bad_cmd;
> + }
> + if (opt[1] != ' ')
> + goto err_bad_cmd;
> +
> + data = memdup_user_nul(_buf + 2, len - 2);
> + if (IS_ERR(data))
> + return PTR_ERR(data);
> +
> + /* From this point onwards we need to lock the fd against someone
> + * trying to mount it.
> + */
> + ret = inode_lock_killable(inode);
> + if (ret < 0)
> + goto err_free;
> +

^^^
Should that be interruptible instead of killable? Allowing someone to ^c
a stuck mount program without a SIGKILL seems reasonable.

As a general design goal, it'd be nice to really try to keep as much of
this as responsive to signals as possible. Mounting and unmounting are
often something that can easily end up stuck.

> + ret = -EINVAL;
> + switch (opt[0]) {
> + case 's':
> + ret = vfs_set_fs_source(fc, data, len - 2);
> + if (ret < 0)
> + goto err_unlock;
> + data = NULL;
> + break;
> +
> + case 'o':
> + ret = vfs_parse_mount_option(fc, data);
> + if (ret < 0)
> + goto err_unlock;
> + break;
> +
> + case 'x':
> + if (strcmp(data, "create") == 0) {
> + ret = vfs_get_tree(fc);
> + } else {
> + ret = -EOPNOTSUPP;
> + }
> + if (ret < 0)
> + goto err_unlock;
> + break;
> +
> + default:
> + goto err_unlock;
> + }
> +
> + ret = len;
> +err_unlock:
> + inode_unlock(inode);
> +err_free:
> + kfree(data);
> + return ret;
> +err_bad_cmd:
> + return -EINVAL;
> +}
> +
> +const struct file_operations fs_fs_fops = {
> + .write = fs_fs_write,
> + .release = fs_fs_release,
> + .llseek = no_llseek,
> +};
> +
> +/*
> + * Indicate the name we want to display the filesystem file as.
> + */
> +static char *fs_fs_dname(struct dentry *dentry, char *buffer, int buflen)
> +{
> + return dynamic_dname(dentry, buffer, buflen, "fs:[%lu]",
> + d_inode(dentry)->i_ino);
> +}
> +
> +static const struct dentry_operations fs_fs_dentry_operations = {
> + .d_dname = fs_fs_dname,
> +};
> +
> +/*
> + * Create a file that can be used to configure a new mount.
> + */
> +static struct file *create_fs_file(struct fs_context *fc)
> +{
> + struct inode *inode;
> + struct file *f;
> + struct path path;
> + int ret;
> +
> + inode = alloc_anon_inode(fs_fs_mnt->mnt_sb);
> + if (!inode)
> + return ERR_PTR(-ENFILE);
> + inode->i_fop = &fs_fs_fops;
> +
> + ret = -ENOMEM;
> + path.dentry = d_alloc_pseudo(fs_fs_mnt->mnt_sb, &empty_name);
> + if (!path.dentry)
> + goto err_inode;
> + path.mnt = mntget(fs_fs_mnt);
> +
> + d_instantiate(path.dentry, inode);
> +
> + f = alloc_file(&path, FMODE_READ | FMODE_WRITE, &fs_fs_fops);
> + if (IS_ERR(f)) {
> + ret = PTR_ERR(f);
> + goto err_file;
> + }
> +
> + f->private_data = fc;
> + return f;
> +
> +err_file:
> + path_put(&path);
> + return ERR_PTR(ret);
> +
> +err_inode:
> + iput(inode);
> + return ERR_PTR(ret);
> +}
> +
> + const struct super_operations fs_fs_ops = {
> + .drop_inode = generic_delete_inode,
> + .destroy_inode = free_inode_nonrcu,
> + .statfs = simple_statfs,
> +};
> +
> +static struct dentry *fs_fs_mount(struct file_system_type *fs_type,
> + int flags, const char *dev_name,
> + void *data)
> +{
> + return mount_pseudo(fs_type, "fs_fs:", &fs_fs_ops,
> + &fs_fs_dentry_operations, FS_FS_MAGIC);
> +}
> +
> +static struct file_system_type fs_fs_type = {
> + .name = "fs_fs",
> + .mount = fs_fs_mount,
> + .kill_sb = kill_anon_super,
> +};
> +
> +static int __init init_fs_fs(void)
> +{
> + int ret;
> +
> + ret = register_filesystem(&fs_fs_type);
> + if (ret < 0)
> + panic("Cannot register fs_fs\n");
> +
> + fs_fs_mnt = kern_mount(&fs_fs_type);
> + if (IS_ERR(fs_fs_mnt))
> + panic("Cannot mount fs_fs: %ld\n", PTR_ERR(fs_fs_mnt));
> + return 0;
> +}
> +
> +fs_initcall(init_fs_fs);
> +
> +/*
> + * Open a filesystem by name so that it can be configured for mounting.
> + *
> + * We are allowed to specify a container in which the filesystem will be
> + * opened, thereby indicating which namespaces will be used (notably, which
> + * network namespace will be used for network filesystems).
> + */
> +SYSCALL_DEFINE5(fsopen, const char __user *, _fs_name, unsigned int, flags,
> + void *, reserved3, void *, reserved4, void *, reserved5)
> +{
> + struct file_system_type *fs_type;
> + struct fs_context *fc;
> + struct file *file;
> + const char *fs_name;
> + int fd, ret;
> +
> + if (flags & ~O_CLOEXEC || reserved3 || reserved4 || reserved5)
> + return -EINVAL;
> +
> + fs_name = strndup_user(_fs_name, PAGE_SIZE);
> + if (IS_ERR(fs_name))
> + return PTR_ERR(fs_name);
> +
> + fs_type = get_fs_type(fs_name);
> + kfree(fs_name);
> + if (!fs_type)
> + return -ENODEV;
> +
> + fc = vfs_new_fs_context(fs_type, NULL, 0, FS_CONTEXT_FOR_USER_MOUNT);
> + put_filesystem(fs_type);
> + if (IS_ERR(fc))
> + return PTR_ERR(fc);
> +
> + ret = -ENOTSUPP;
> + if (!fc->ops)
> + goto err_fc;
> +
> + file = create_fs_file(fc);
> + if (IS_ERR(file)) {
> + ret = PTR_ERR(file);
> + goto err_fc;
> + }
> +
> + ret = get_unused_fd_flags(flags & O_CLOEXEC);
> + if (ret < 0)
> + goto err_file;
> +
> + fd = ret;
> + fd_install(fd, file);
> + return fd;
> +
> +err_file:
> + fput(file);
> + return ret;
> +
> +err_fc:
> + put_fs_context(fc);
> + return ret;
> +}
> diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
> index 8af6ff0e869e..3244b231ede0 100644
> --- a/include/linux/fs_context.h
> +++ b/include/linux/fs_context.h
> @@ -101,4 +101,5 @@ extern int vfs_get_super(struct fs_context *fc,
> int (*fill_super)(struct super_block *sb,
> struct fs_context *fc));
>
> +extern const struct file_operations fs_fs_fops;
> #endif /* _LINUX_FS_CONTEXT_H */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a78186d826d7..7cd1b65a4152 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -940,5 +940,7 @@ asmlinkage long sys_pkey_alloc(unsigned long flags, unsigned long init_val);
> asmlinkage long sys_pkey_free(int pkey);
> asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
> unsigned mask, struct statx __user *buffer);
> +asmlinkage long sys_fsopen(const char *fs_name, unsigned int flags,
> + void *reserved3, void *reserved4, void *reserved5);
>
> #endif
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index e439565df838..722bf42f9564 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -87,5 +87,6 @@
> #define UDF_SUPER_MAGIC 0x15013346
> #define BALLOON_KVM_MAGIC 0x13661366
> #define ZSMALLOC_MAGIC 0x58295829
> +#define FS_FS_MAGIC 0x66736673
>
> #endif /* __LINUX_MAGIC_H__ */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 8acef8576ce9..de1dc63e7e47 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -258,3 +258,6 @@ cond_syscall(sys_membarrier);
> cond_syscall(sys_pkey_mprotect);
> cond_syscall(sys_pkey_alloc);
> cond_syscall(sys_pkey_free);
> +
> +/* fd-based mount */
> +cond_syscall(sys_fsopen);
>

--
Jeff Layton <[email protected]>

2017-10-26 17:21:56

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 07/14] VFS: Add a sample program for fsopen/fsmount [ver #6]

On Fri, 2017-10-06 at 16:50 +0100, David Howells wrote:
> Add a sample program for driving fsopen/fsmount.
>
> Signed-off-by: David Howells <[email protected]>
> ---
>
> samples/fsmount/test-fsmount.c | 94 ++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 94 insertions(+)
> create mode 100644 samples/fsmount/test-fsmount.c
>
> diff --git a/samples/fsmount/test-fsmount.c b/samples/fsmount/test-fsmount.c
> new file mode 100644
> index 000000000000..75f91d272a19
> --- /dev/null
> +++ b/samples/fsmount/test-fsmount.c
> @@ -0,0 +1,94 @@
> +/* fd-based mount test.
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells ([email protected])
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <sys/prctl.h>
> +#include <sys/wait.h>
> +
> +#define PR_ERRMSG_ENABLE 48
> +#define PR_ERRMSG_READ 49
> +
> +#define E(x) do { if ((x) == -1) { perror(#x); exit(1); } } while(0)
> +
> +static __attribute__((noreturn))
> +void mount_error(int fd, const char *s)
> +{
> + char buf[4096];
> + int err, n, perr;
> +
> + do {
> + err = errno;
> + errno = 0;
> + n = prctl(PR_ERRMSG_READ, buf, sizeof(buf));
> + perr = errno;
> + errno = err;
> + if (n > 0) {
> + fprintf(stderr, "Error: '%s': %*.*s: %m\n", s, n, n, buf);
> + } else {
> + fprintf(stderr, "%s: %m\n", s);
> + }
> + } while (perr == 0);
> + exit(1);
> +}
> +
> +#define E_write(fd, s) \
> + do { \
> + if (write(fd, s, sizeof(s) - 1) == -1) \
> + mount_error(fd, s); \
> + } while (0)
> +
> +static inline int fsopen(const char *fs_name, int flags,
> + void *reserved3, void *reserved4, void *reserved5);
> +
> +{
> + return syscall(333, fs_name, flags, reserved3, reserved4, reserved5);
> +}
> +
> +static inline int fsmount(int fsfd, int dfd, const char *path,
> + unsigned int at_flags, unsigned int flags)
> +{
> + return syscall(334, fsfd, dfd, path, at_flags, flags);
> +}
> +
> +int main()
> +{
> + int mfd;
> +
> + if (prctl(PR_ERRMSG_ENABLE, 1) < 0) {
> + perror("prctl/en");
> + exit(1);
> + }
> +
> + /* Mount an NFS filesystem */
> + mfd = fsopen("nfs4", 0, NULL, NULL, NULL);
> + if (mfd == -1) {
> + perror("fsopen");
> + exit(1);
> + }
> +
> + E_write(mfd, "s warthog:/data");
> + E_write(mfd, "o fsc");
> + E_write(mfd, "o sync");
> + E_write(mfd, "o intr");
> + E_write(mfd, "o vers=4.2");
> + E_write(mfd, "o addr=90.155.74.18");
> + E_write(mfd, "o clientaddr=90.155.74.21");
> + E_write(mfd, "x create");
> + if (fsmount(mfd, AT_FDCWD, "/mnt", 0, 0) < 0)
> + mount_error(mfd, "fsmount");
> + E(close(mfd));
> +
> + exit(0);
> +}
>

So to make sure I understand....

Suppose I want to do a bind mount with the new API. Would I do something
like this?

mfd = fsopen("???");
write(mfd, "s /path/to/old/mount");
write(mfd, "o bind");
fsmount(mfd, ...);

That seems a bit klunkier than before as I now need to pay attention to
the fstype. I guess I'd have to scrape /proc/mounts for that info?
--
Jeff Layton <[email protected]>

2017-10-26 19:01:59

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 05/14] VFS: Implement fsopen() to prepare for a mount [ver #6]

On Fri, 2017-10-06 at 16:49 +0100, David Howells wrote:
> Provide an fsopen() system call that starts the process of preparing to
> mount, using an fd as a context handle. fsopen() is given the name of the
> filesystem that will be used:
>
> int mfd = fsopen(const char *fsname, int open_flags,

Can we make open_flags unsigned?

> void *reserved3, void *reserved4,
> void *reserved5);
>
> where open_flags can be 0 or O_CLOEXEC and reserved* should all be NULL for
> the moment.
>
> For example:
>
> mfd = fsopen("ext4", O_CLOEXEC, NULL, NULL, NULL);

While I understand the appeal of reusing O_CLOEXEC, I think we'd be
better off with a completely new set of flags here. It's not a "real"
open.

You can define FSO_CLOEXEC and then you have another 31 bits to play
with later should you need to do so.

> write(mfd, "s /dev/sdb1"); // note I'm ignoring write's length arg
> write(mfd, "o noatime");
> write(mfd, "o acl");
> write(mfd, "o user_attr");
> write(mfd, "o iversion");
> write(mfd, "o ");
> write(mfd, "r /my/container"); // root inside the fs
> write(mfd, "x create"); // create the superblock
> fsmount(mfd, container_fd, "/mnt", AT_NO_FOLLOW);
>
> mfd = fsopen("afs", -1);
> write(mfd, "s %grand.central.org:root.cell");
> write(mfd, "o cell=grand.central.org");
> write(mfd, "r /");
> write(mfd, "x create");
> fsmount(mfd, AT_FDCWD, "/mnt", 0);
>

We chatted a bit about this on IRC, but I'll reply here too for public
consumption:

I think you may need some other stuff to fully emulate what we call bind
mounting today:

1) a way to attach a new fs_context to an existing superblock Maybe a
mntopen() syscall? Or maybe we can use a new FSO_* flag in conjunction
with a string in one of the reserved fields?

2) a way to walk down to a particular dentry inside the superblock and
mount it instead of the actual root. For the interface you could just
define a new "d /path/inside/superblock" command. Then, do a pathwalk
from the existing root dentry and replace the fscontext root dentry with
it.

> If an error is reported at any step, an error message may be available to be
> read() back (ENODATA will be reported if there isn't an error available) in
> the form:
>
> "e <subsys>:<problem>"
> "e SELinux:Mount on mountpoint not permitted"
>
> Once fsmount() has been called, further write() calls will incur EBUSY,
> even if the fsmount() fails. read() is still possible to retrieve error
> information.
>
> The fsopen() syscall creates a mount context and hangs it of the fd that it
> returns.
>
> Netlink is not used because it is optional.
>
> Signed-off-by: David Howells <[email protected]>
> ---
>
> arch/x86/entry/syscalls/syscall_32.tbl | 1
> arch/x86/entry/syscalls/syscall_64.tbl | 1
> fs/Makefile | 2
> fs/fsopen.c | 273 ++++++++++++++++++++++++++++++++
> include/linux/fs_context.h | 1
> include/linux/syscalls.h | 2
> include/uapi/linux/magic.h | 1
> kernel/sys_ni.c | 3
> 8 files changed, 283 insertions(+), 1 deletion(-)
> create mode 100644 fs/fsopen.c
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 448ac2161112..9bf8d4c62f85 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -391,3 +391,4 @@
> 382 i386 pkey_free sys_pkey_free
> 383 i386 statx sys_statx
> 384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl
> +385 i386 fsopen sys_fsopen
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 5aef183e2f85..9b198c5fc412 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -339,6 +339,7 @@
> 330 common pkey_alloc sys_pkey_alloc
> 331 common pkey_free sys_pkey_free
> 332 common statx sys_statx
> +333 common fsopen sys_fsopen
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/Makefile b/fs/Makefile
> index ffe728cc15e1..c42d1d9351a6 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -12,7 +12,7 @@ obj-y := open.o read_write.o file_table.o super.o \
> seq_file.o xattr.o libfs.o fs-writeback.o \
> pnode.o splice.o sync.o utimes.o \
> stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
> - fs_context.o
> + fs_context.o fsopen.o
>
> ifeq ($(CONFIG_BLOCK),y)
> obj-y += buffer.o block_dev.o direct-io.o mpage.o
> diff --git a/fs/fsopen.c b/fs/fsopen.c
> new file mode 100644
> index 000000000000..6ca7e1979273
> --- /dev/null
> +++ b/fs/fsopen.c
> @@ -0,0 +1,273 @@
> +/* Filesystem access-by-fd.
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells ([email protected])
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#include <linux/fs_context.h>
> +#include <linux/mount.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/file.h>
> +#include <linux/magic.h>
> +#include <linux/syscalls.h>
> +
> +static struct vfsmount *fs_fs_mnt __read_mostly;
> +
> +static int fs_fs_release(struct inode *inode, struct file *file)
> +{
> + struct fs_context *fc = file->private_data;
> +
> + file->private_data = NULL;
> +
> + put_fs_context(fc);
> + return 0;
> +}
> +
> +/*
> + * Userspace writes configuration data and commands to the fd and we parse it
> + * here. For the moment, we assume a single option or command per write. Each
> + * line written is of the form
> + *
> + * <option_type><space><stuff...>
> + *
> + * d /dev/sda1 -- Device name
> + * o noatime -- Option without value
> + * o cell=grand.central.org -- Option with value
> + * r / -- Dir within device to mount
> + * x create -- Create a superblock
> + */
> +static ssize_t fs_fs_write(struct file *file,
> + const char __user *_buf, size_t len, loff_t *pos)
> +{
> + struct fs_context *fc = file->private_data;
> + struct inode *inode = file_inode(file);
> + char opt[2], *data;
> + ssize_t ret;
> +
> + if (len < 3 || len > 4095)
> + return -EINVAL;
> +
> + if (copy_from_user(opt, _buf, 2) != 0)
> + return -EFAULT;
> + switch (opt[0]) {
> + case 's':
> + case 'o':
> + case 'x':
> + break;
> + default:
> + goto err_bad_cmd;
> + }
> + if (opt[1] != ' ')
> + goto err_bad_cmd;
> +
> + data = memdup_user_nul(_buf + 2, len - 2);
> + if (IS_ERR(data))
> + return PTR_ERR(data);
> +
> + /* From this point onwards we need to lock the fd against someone
> + * trying to mount it.
> + */
> + ret = inode_lock_killable(inode);
> + if (ret < 0)
> + goto err_free;
> +
> + ret = -EINVAL;
> + switch (opt[0]) {
> + case 's':
> + ret = vfs_set_fs_source(fc, data, len - 2);
> + if (ret < 0)
> + goto err_unlock;
> + data = NULL;
> + break;
> +
> + case 'o':
> + ret = vfs_parse_mount_option(fc, data);
> + if (ret < 0)
> + goto err_unlock;
> + break;
> +
> + case 'x':
> + if (strcmp(data, "create") == 0) {
> + ret = vfs_get_tree(fc);
> + } else {
> + ret = -EOPNOTSUPP;
> + }
> + if (ret < 0)
> + goto err_unlock;
> + break;
> +
> + default:
> + goto err_unlock;
> + }
> +
> + ret = len;
> +err_unlock:
> + inode_unlock(inode);
> +err_free:
> + kfree(data);
> + return ret;
> +err_bad_cmd:
> + return -EINVAL;
> +}
> +
> +const struct file_operations fs_fs_fops = {
> + .write = fs_fs_write,
> + .release = fs_fs_release,
> + .llseek = no_llseek,
> +};
> +
> +/*
> + * Indicate the name we want to display the filesystem file as.
> + */
> +static char *fs_fs_dname(struct dentry *dentry, char *buffer, int buflen)
> +{
> + return dynamic_dname(dentry, buffer, buflen, "fs:[%lu]",
> + d_inode(dentry)->i_ino);
> +}
> +
> +static const struct dentry_operations fs_fs_dentry_operations = {
> + .d_dname = fs_fs_dname,
> +};
> +
> +/*
> + * Create a file that can be used to configure a new mount.
> + */
> +static struct file *create_fs_file(struct fs_context *fc)
> +{
> + struct inode *inode;
> + struct file *f;
> + struct path path;
> + int ret;
> +
> + inode = alloc_anon_inode(fs_fs_mnt->mnt_sb);
> + if (!inode)
> + return ERR_PTR(-ENFILE);
> + inode->i_fop = &fs_fs_fops;
> +
> + ret = -ENOMEM;
> + path.dentry = d_alloc_pseudo(fs_fs_mnt->mnt_sb, &empty_name);
> + if (!path.dentry)
> + goto err_inode;
> + path.mnt = mntget(fs_fs_mnt);
> +
> + d_instantiate(path.dentry, inode);
> +
> + f = alloc_file(&path, FMODE_READ | FMODE_WRITE, &fs_fs_fops);
> + if (IS_ERR(f)) {
> + ret = PTR_ERR(f);
> + goto err_file;
> + }
> +
> + f->private_data = fc;
> + return f;
> +
> +err_file:
> + path_put(&path);
> + return ERR_PTR(ret);
> +
> +err_inode:
> + iput(inode);
> + return ERR_PTR(ret);
> +}
> +
> + const struct super_operations fs_fs_ops = {
> + .drop_inode = generic_delete_inode,
> + .destroy_inode = free_inode_nonrcu,
> + .statfs = simple_statfs,
> +};
> +
> +static struct dentry *fs_fs_mount(struct file_system_type *fs_type,
> + int flags, const char *dev_name,
> + void *data)
> +{
> + return mount_pseudo(fs_type, "fs_fs:", &fs_fs_ops,
> + &fs_fs_dentry_operations, FS_FS_MAGIC);
> +}
> +
> +static struct file_system_type fs_fs_type = {
> + .name = "fs_fs",
> + .mount = fs_fs_mount,
> + .kill_sb = kill_anon_super,
> +};
> +
> +static int __init init_fs_fs(void)
> +{
> + int ret;
> +
> + ret = register_filesystem(&fs_fs_type);
> + if (ret < 0)
> + panic("Cannot register fs_fs\n");
> +
> + fs_fs_mnt = kern_mount(&fs_fs_type);
> + if (IS_ERR(fs_fs_mnt))
> + panic("Cannot mount fs_fs: %ld\n", PTR_ERR(fs_fs_mnt));
> + return 0;
> +}
> +
> +fs_initcall(init_fs_fs);
> +
> +/*
> + * Open a filesystem by name so that it can be configured for mounting.
> + *
> + * We are allowed to specify a container in which the filesystem will be
> + * opened, thereby indicating which namespaces will be used (notably, which
> + * network namespace will be used for network filesystems).
> + */
> +SYSCALL_DEFINE5(fsopen, const char __user *, _fs_name, unsigned int, flags,
> + void *, reserved3, void *, reserved4, void *, reserved5)
> +{
> + struct file_system_type *fs_type;
> + struct fs_context *fc;
> + struct file *file;
> + const char *fs_name;
> + int fd, ret;
> +
> + if (flags & ~O_CLOEXEC || reserved3 || reserved4 || reserved5)
> + return -EINVAL;
> +
> + fs_name = strndup_user(_fs_name, PAGE_SIZE);
> + if (IS_ERR(fs_name))
> + return PTR_ERR(fs_name);
> +
> + fs_type = get_fs_type(fs_name);
> + kfree(fs_name);
> + if (!fs_type)
> + return -ENODEV;
> +
> + fc = vfs_new_fs_context(fs_type, NULL, 0, FS_CONTEXT_FOR_USER_MOUNT);
> + put_filesystem(fs_type);
> + if (IS_ERR(fc))
> + return PTR_ERR(fc);
> +
> + ret = -ENOTSUPP;
> + if (!fc->ops)
> + goto err_fc;
> +
> + file = create_fs_file(fc);
> + if (IS_ERR(file)) {
> + ret = PTR_ERR(file);
> + goto err_fc;
> + }
> +
> + ret = get_unused_fd_flags(flags & O_CLOEXEC);
> + if (ret < 0)
> + goto err_file;
> +
> + fd = ret;
> + fd_install(fd, file);
> + return fd;
> +
> +err_file:
> + fput(file);
> + return ret;
> +
> +err_fc:
> + put_fs_context(fc);
> + return ret;
> +}
> diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
> index 8af6ff0e869e..3244b231ede0 100644
> --- a/include/linux/fs_context.h
> +++ b/include/linux/fs_context.h
> @@ -101,4 +101,5 @@ extern int vfs_get_super(struct fs_context *fc,
> int (*fill_super)(struct super_block *sb,
> struct fs_context *fc));
>
> +extern const struct file_operations fs_fs_fops;
> #endif /* _LINUX_FS_CONTEXT_H */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a78186d826d7..7cd1b65a4152 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -940,5 +940,7 @@ asmlinkage long sys_pkey_alloc(unsigned long flags, unsigned long init_val);
> asmlinkage long sys_pkey_free(int pkey);
> asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
> unsigned mask, struct statx __user *buffer);
> +asmlinkage long sys_fsopen(const char *fs_name, unsigned int flags,
> + void *reserved3, void *reserved4, void *reserved5);
>
> #endif
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index e439565df838..722bf42f9564 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -87,5 +87,6 @@
> #define UDF_SUPER_MAGIC 0x15013346
> #define BALLOON_KVM_MAGIC 0x13661366
> #define ZSMALLOC_MAGIC 0x58295829
> +#define FS_FS_MAGIC 0x66736673
>
> #endif /* __LINUX_MAGIC_H__ */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 8acef8576ce9..de1dc63e7e47 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -258,3 +258,6 @@ cond_syscall(sys_membarrier);
> cond_syscall(sys_pkey_mprotect);
> cond_syscall(sys_pkey_alloc);
> cond_syscall(sys_pkey_free);
> +
> +/* fd-based mount */
> +cond_syscall(sys_fsopen);
>

--
Jeff Layton <[email protected]>

2017-10-26 22:40:26

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 07/14] VFS: Add a sample program for fsopen/fsmount [ver #6]

Jeff Layton <[email protected]> wrote:

> So to make sure I understand....
>
> Suppose I want to do a bind mount with the new API. Would I do something
> like this?
>
> mfd = fsopen("???");
> write(mfd, "s /path/to/old/mount");

You would have to use something other than "s" as that indicates the medium
source, but there are plenty of other options. Alternatively, something like:

mfd = mntopen("/path/to/old/mount", ...);

You would need some way to retrieve the fs type, though. Maybe the first
read() you do will return it:

char fstype[256];
read(mfd, fstype, 256);

ends up with:

"fs <type>"

in the buffer, though I think I'd prefer it to be manually elicited.

> write(mfd, "o bind");

This is unnecessary as you can just do:

> fsmount(mfd, ...);

at this point to achieve the effect.

> That seems a bit klunkier than before as I now need to pay attention to
> the fstype. I guess I'd have to scrape /proc/mounts for that info?

I haven't worked out this interface yet. I'm not sure it's actually necessary
at this point, though it'd be nice to have.

David

2017-10-27 09:24:22

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 03/14] VFS: Implement a filesystem superblock creation/configuration context [ver #6]

On Thu, Oct 26, 2017 at 6:24 PM, David Howells <[email protected]> wrote:
> Miklos Szeredi <[email protected]> wrote:
>
>> > +/**
>> > + * vfs_parse_mount_option - Add a single mount option to a superblock config
>>
>> Mount options are those that refer to the mount
>> (nosuid,nodev,noatime,etc..); this function is not parsing those,
>> AFAICT.
>
> I'd quibble that those are "mountpoint" options, not "mount" options. Mount
> options are the options you can pass to mount and are a mixed bag. But fair
> enough, it's probably worth avoiding such terminology where we can.
>
>> How about vfs_parse_fs_option()?
>
> Sure. Can we please not rename it again?

I promise ;)

Also, how about moving calls to vfs_parse_fs_option() into filesystem
code? Even those options are not generic, some filesystem wants
this, some that. It's just a historical accident that those are set
with MS_FOO and not "foo". Filesystems that don't have any option
parsing could have a generic version that deals with "ro/rw", the
others can handle these options along with the rest.

>
>> We probably also need a "reset" type of option that clears all bits
>> and is also passed onto the filesystem's parsing routine so it can
>> reset all options as well.
>
> Reset what? To what? To a blank slate? To the state on the medium? What if
> it's a netfs?
>
> This operation isn't well defined and I'm not sure it's useful because:
>
> (1) Unless we can preload options from some source, the starting context is
> blank, so why do you need a reset on a new mount?

Reset only makes sense in the context of reconfig (fka. remount).

> (2) We need to find out what state the options are currently in. Reset today
> doesn't necessarily mean the same as reset tomorrow.
>
> (3) Not all options are simple on/off switches. Some of them are multistate,
> some are strings/numbers that have non-zero defaults and some have
> dependencies on other options.
>
> (4) Not all options can be simply reset to "0", particularly if the
> filesystem is live. Take an option that points to a network server or a
> separate journalling device for example.

I'd think reset restores the state to a default. Default state is
what a new fs instance without any specified options starts out with.
Yes it's different today and different tomorrow. Yes, some options
are not binary. Yes, some options are not mutable on a live
filesystem. Regardless of those, I think it makes sense to allow a
reconfig that results in a configuration that would have been reached
by setting the same options on a new mount. I.e. have a "replace
configuration" as well as a "change bits of current configuration"
mode.

But lets leave to later if it's not something trivial.

>> 1/a) New sb:
>> 1/b) New sb for legacy mount(2)
>
> Looking at this in terms of ext4, I would make the parser create an "option
> change" script prior to loading the superblock. The reason for that with ext4
> is that ext4 stores an additional option string that must be parsed and
> applied first - except that we potentially need some of the mount-supplied
> options to be able to mount the fs.
>
> So in the new-mount-of-new-sb case, I would actually create two scripts, one
> for the options written to the context fd, then one for the on-disk script,
> then validate the context and then apply them both atomically.

Agree.

>
>> 2/a) Shared sb:
>> 2/b) Shared sb for legacy mount(2)
>
> In the new-mount-of-live-sb case, I would validate the context script and
> ignore any options that try to change things that can't be changed because the
> fs is live.

Your sentence seems to imply that we do change those that can be
changed. That's not what legacy does, it ignores *all* options
(except rw/ro for which it errors out on mismatch). I don't think
that's a nice behavior, but we definitely need to keep it for legacy.

For non-legacy, do we want to extend the "error out on mismatch"
behavior to all options, rather than ignoring them?

> It might be nice to report them also, but that requires a mechanism to do so.
>
>> 3/a) Reconfig
>> 3/b) Reconfig for legacy mount(2) (i.e. MS_REMOUNT)
>
> In the reconfigure case, I only need to create one script, validate it and
> then apply it atomically (well, as atomically as possible, given the fs is
> actually live at this point).

Yep.

>
> There's the question of how far you allow a happens-to-share mount to effect a
> reconfigure. Seems a reasonable distinction to say that in your case 2 you
> just ignore conflicts but possibly warn or reject in case 3.

Not sure I understand why we'd want to ignore conflicts in case 2 and
not in 3. Can we not have consistency (error out on all conflicts)?

>> > +int generic_parse_monolithic(struct fs_context *ctx, void *data)
>> > +{
>> > + char *options = data, *p;
>> > + int ret;
>> > +
>> > + if (!options)
>> > + return 0;
>> > +
>> > + while ((p = strsep(&options, ",")) != NULL) {
>> > + if (*p) {
>> > + ret = vfs_parse_mount_option(ctx, p);
>>
>> Monolithic option block is the legacy thing.
>
> Yes, I know.
>
>> It shouldn't be parsing the common flags. It should instead be treating
>> them as forbidden (although it probably doesn't really matter, since no
>> filesystem will accept these anyway).
>
> Except that ext4, f2fs, 9p, ... do take at least some of them. I'm not sure
> whether they ever see them, but without auditing userspace, there's no way to
> know.

So moving possibly dead code to the level of VFS fixes things how?

Let filesystems deal with that crap and make sure they deal with it
only for legacy mount and not for the new, supposedly clean one.
Making it generic also possibly breaks uABI by allowing an option that
was rejected previously for some other fs.

>
>> So probably best to expand vfs_parse_mount_option() here and skip the
>> sb flag parsing part.
>
> You need to prove they are never seen here :-/
>
>> > + * @sb_flags: Superblock flags and op flags (such as MS_REMOUNT)
>>
>> I'm confused: MS_REMOUNT in sb_flags and FS_CONTEXT_FOR_REMOUNT in purpose?
>>
>> I hope that's just a stale comment, sb_flags should really be just the
>> superblock flags and not any op flags.
>
> Yeah - that's stale.
>
>> Also, can FS_CONTEXT_FOR_REMOUNT be renamed to ..._RECONFIG?
>
> If you really want ;-)

Yes. I think clean naming results in clean concepts in one's head,
which results in clean interfaces. Which is *the* purpose of this
exercise.

Thanks,
Miklos

2017-10-27 14:35:35

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/14] VFS: Implement a filesystem superblock creation/configuration context [ver #6]

Miklos Szeredi <[email protected]> wrote:

> Also, how about moving calls to vfs_parse_fs_option() into filesystem
> code? Even those options are not generic, some filesystem wants
> this, some that. It's just a historical accident that those are set
> with MS_FOO and not "foo". Filesystems that don't have any option
> parsing could have a generic version that deals with "ro/rw", the
> others can handle these options along with the rest.

Ummm... I don't see how that would work. vfs_parse_mount_option() (or
vfs_parse_fs_option() as it will become) is the way into the filesystem from
write(mfd, "o foo") and also applies the security policy before the filesystem
gets its hands on the option.

Did you mean vfs_parse_sb_flag_option()? The point of that function is so
that the name->flag mapping tables don't have to be replicated in every
filesystem.

Also, filesystems can supply a ->validate() method that rejects any SB_* flags
they don't want to support, but for legacy purposes we probably can't do that.

> Reset only makes sense in the context of reconfig (fka. remount).

Okay, that makes more sense.

> But lets leave to later if it's not something trivial.

I don't think it is trivial - and it's something that would have to be dealt
with on an fs-by-fs basis and very well documented.

Btw, how would it affect the LSM?

Also, how do you propose to use it? I presume you're not thinking of someone
talking to the socket with a telnet-like interface.

> >> 2/a) Shared sb:
> >> 2/b) Shared sb for legacy mount(2)
> >
> > In the new-mount-of-live-sb case, I would validate the context script and
> > ignore any options that try to change things that can't be changed because
> > the fs is live.
>
> Your sentence seems to imply that we do change those that can be
> changed. That's not what legacy does, it ignores *all* options
> (except rw/ro for which it errors out on mismatch). I don't think
> that's a nice behavior, but we definitely need to keep it for legacy.
>
> For non-legacy, do we want to extend the "error out on mismatch"
> behavior to all options, rather than ignoring them?

Actually, we might want to ignore all the options. That might itself be an
option, kind of like O_CREAT/O_EXCL. I think someone suggested this before.

> > There's the question of how far you allow a happens-to-share mount to
> > effect a reconfigure. Seems a reasonable distinction to say that in your
> > case 2 you just ignore conflicts but possibly warn or reject in case 3.
>
> Not sure I understand why we'd want to ignore conflicts in case 2 and
> not in 3. Can we not have consistency (error out on all conflicts)?

I was thinking that if you mount a source that's already mounted, it would do
a reconfigure instead, but I this is addressed above as "2) shared sb".

> > Except that ext4, f2fs, 9p, ... do take at least some of them. I'm not
> > sure whether they ever see them, but without auditing userspace, there's
> > no way to know.
>
> So moving possibly dead code to the level of VFS fixes things how?

It's not dead code. You can call the mount() syscall directly, and something
like busybox might well do so. Normally these are weeded out by userspace.

It's possible, even, in the ext4 case that you might store these options on
disk in the options string in the superblock.

> Let filesystems deal with that crap and make sure they deal with it
> only for legacy mount and not for the new, supposedly clean one.

Sorry, how does the new, clean one do it without handling these options?
There is no MS_* mask passed in, except to fsmount().

> Making it generic also possibly breaks uABI by allowing an option that
> was rejected previously for some other fs.

That's not a particularly serious break, I wouldn't've thought. Further, the
set of options that a filesystem will take evolves over time, and what was
rejected yesterday might be accepted today.

All the UAPI SB_* options can be passed in to mount(2) from userspace, and
filesystems all just ignore them if they don't want to support them as far as
I know. If this is the case, I don't see a problem with letting generic code
parse these common options.

David

2017-10-27 15:33:07

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 03/14] VFS: Implement a filesystem superblock creation/configuration context [ver #6]

Adding linux-api@vger (I think you should add this Cc for patches
which add/change userspace APIs).

On Fri, Oct 27, 2017 at 4:35 PM, David Howells <[email protected]> wrote:
> Miklos Szeredi <[email protected]> wrote:
>
>> Also, how about moving calls to vfs_parse_fs_option() into filesystem
>> code? Even those options are not generic, some filesystem wants
>> this, some that. It's just a historical accident that those are set
>> with MS_FOO and not "foo". Filesystems that don't have any option
>> parsing could have a generic version that deals with "ro/rw", the
>> others can handle these options along with the rest.
>
> Ummm... I don't see how that would work. vfs_parse_mount_option() (or
> vfs_parse_fs_option() as it will become) is the way into the filesystem from
> write(mfd, "o foo") and also applies the security policy before the filesystem
> gets its hands on the option.
>
> Did you mean vfs_parse_sb_flag_option()? The point of that function is so
> that the name->flag mapping tables don't have to be replicated in every
> filesystem.

Yes I did mean vfs_parse_sb_flag_option().

Yes, I understand its purpose, but it would be cleaner if all the
option parsing was done in fc->ops->parse_option().

It might be worth introducing the vfs_parse_sb_flag_option(), to be
called from ->parse_option().

>
> Also, filesystems can supply a ->validate() method that rejects any SB_* flags
> they don't want to support, but for legacy purposes we probably can't do that.
>
>> Reset only makes sense in the context of reconfig (fka. remount).
>
> Okay, that makes more sense.
>
>> But lets leave to later if it's not something trivial.
>
> I don't think it is trivial - and it's something that would have to be dealt
> with on an fs-by-fs basis and very well documented.
>
> Btw, how would it affect the LSM?

LSM would have to reject a "reset" if not enough privileges to
*create* a new fs instance, since it essentially requires creating a
new config, which is what is done when creating an fs instance.

>
> Also, how do you propose to use it? I presume you're not thinking of someone
> talking to the socket with a telnet-like interface.

No. It would be an command line option for the relevant userspace utility:

fs-reconfig /mnt/foo --reset "ro"

as opposed to

fs-reconfig /mnt/foo "ro"

The former would change the options to default + "ro".

The latter would change "rw"->"ro" and leave all other options alone.

>
>> >> 2/a) Shared sb:
>> >> 2/b) Shared sb for legacy mount(2)
>> >
>> > In the new-mount-of-live-sb case, I would validate the context script and
>> > ignore any options that try to change things that can't be changed because
>> > the fs is live.
>>
>> Your sentence seems to imply that we do change those that can be
>> changed. That's not what legacy does, it ignores *all* options
>> (except rw/ro for which it errors out on mismatch). I don't think
>> that's a nice behavior, but we definitely need to keep it for legacy.
>>
>> For non-legacy, do we want to extend the "error out on mismatch"
>> behavior to all options, rather than ignoring them?
>
> Actually, we might want to ignore all the options. That might itself be an
> option, kind of like O_CREAT/O_EXCL. I think someone suggested this before.

Okay, that makes sense.

>
>> > There's the question of how far you allow a happens-to-share mount to
>> > effect a reconfigure. Seems a reasonable distinction to say that in your
>> > case 2 you just ignore conflicts but possibly warn or reject in case 3.
>>
>> Not sure I understand why we'd want to ignore conflicts in case 2 and
>> not in 3. Can we not have consistency (error out on all conflicts)?
>
> I was thinking that if you mount a source that's already mounted, it would do
> a reconfigure instead, but I this is addressed above as "2) shared sb".
>
>> > Except that ext4, f2fs, 9p, ... do take at least some of them. I'm not
>> > sure whether they ever see them, but without auditing userspace, there's
>> > no way to know.
>>
>> So moving possibly dead code to the level of VFS fixes things how?
>
> It's not dead code. You can call the mount() syscall directly, and something
> like busybox might well do so. Normally these are weeded out by userspace.
>
> It's possible, even, in the ext4 case that you might store these options on
> disk in the options string in the superblock.
>
>> Let filesystems deal with that crap and make sure they deal with it
>> only for legacy mount and not for the new, supposedly clean one.
>
> Sorry, how does the new, clean one do it without handling these options?
> There is no MS_* mask passed in, except to fsmount().

The new one certainly should.

>
>> Making it generic also possibly breaks uABI by allowing an option that
>> was rejected previously for some other fs.
>
> That's not a particularly serious break, I wouldn't've thought. Further, the
> set of options that a filesystem will take evolves over time, and what was
> rejected yesterday might be accepted today.
>
> All the UAPI SB_* options can be passed in to mount(2) from userspace, and
> filesystems all just ignore them if they don't want to support them as far as
> I know. If this is the case, I don't see a problem with letting generic code
> parse these common options.

Ignoring unknown flags/options is generally a bad idea.

Thanks,
Miklos

2017-10-27 16:03:06

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/14] VFS: Implement a filesystem superblock creation/configuration context [ver #6]

Miklos Szeredi <[email protected]> wrote:

> Yes I did mean vfs_parse_sb_flag_option().
>
> Yes, I understand its purpose, but it would be cleaner if all the
> option parsing was done in fc->ops->parse_option().
>
> It might be worth introducing the vfs_parse_sb_flag_option(), to be
> called from ->parse_option().

I was trying to relieve the filesystem of the requirement to have to deal with
common stuff and also the need to talk directly to the LSM.

> > Btw, how would it affect the LSM?
>
> LSM would have to reject a "reset" if not enough privileges to
> *create* a new fs instance, since it essentially requires creating a
> new config, which is what is done when creating an fs instance.

That's not what I'm asking. Would the reset change LSM state? Reset security
labels and options?

> > Sorry, how does the new, clean one do it without handling these options?
> > There is no MS_* mask passed in, except to fsmount().
>
> The new one certainly should.

Should what?

> Ignoring unknown flags/options is generally a bad idea.

They're not unknown - just not of interest to the filesystem.

David

2017-10-30 08:44:52

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 03/14] VFS: Implement a filesystem superblock creation/configuration context [ver #6]

On Fri, Oct 27, 2017 at 6:03 PM, David Howells <[email protected]> wrote:
> Miklos Szeredi <[email protected]> wrote:
>
>> Yes I did mean vfs_parse_sb_flag_option().
>>
>> Yes, I understand its purpose, but it would be cleaner if all the
>> option parsing was done in fc->ops->parse_option().
>>
>> It might be worth introducing the vfs_parse_sb_flag_option(), to be
>> called from ->parse_option().
>
> I was trying to relieve the filesystem of the requirement to have to deal with
> common stuff and also the need to talk directly to the LSM.

No need to talk directly to the LSM:
security_fs_context_parse_option() will do that in VFS code.

How common is common stuff?

dirsync/sync/rw: not handled by all filesystems, those that don't
handle it should reject the option on the new interface

lazytime: handled by generic code, AFAICS, but makes no sense on
read-only fs so those should probably reject it

mand: handled by generic code, but does not make sense for some
filesystems (e.g. those that don't have all the unixy permission
bits).

posixacl: there's no such mount option now. The options is "acl" and
does not get translated to MS_POSIXACL in mount(8). Makes zero sense
to add a previously nonexistent option to the new interface.

silent: makes no sense on the new interface, since we should no longer
be putting error messages into the kernel log.

So that leaves async/ro/nolazytime/nomand options to be handled by all
filesystems.

Not sure how to best handle these differences, but the current code
definitely seems lacking, and I cannot imagine a better way than to
pass all options to filesystem's ->parse_option() and add helper(s) to
handle the generic options.

>> > Btw, how would it affect the LSM?
>>
>> LSM would have to reject a "reset" if not enough privileges to
>> *create* a new fs instance, since it essentially requires creating a
>> new config, which is what is done when creating an fs instance.
>
> That's not what I'm asking. Would the reset change LSM state? Reset security
> labels and options?

No. And it wouldn't reset any other option that is immutable (e.g.
server IP address).

Thanks,
Miklos