Here are a set of patches to create a filesystem context prior to setting
up a new mount, populating it with the parsed options/binary data, creating
the superblock and then effecting the mount. This is also used for remount
since much of the parsing stuff is common in many filesystems.
This allows namespaces and other information to be conveyed through the
mount procedure.
This also allows Miklós Szeredi's idea of doing:
fd = fsopen("nfs");
write(fd, "option=val", ...);
fsmount(fd, "/mnt");
that he presented at LSF-2017 to be implemented (see the relevant patches
in the series).
I didn't use netlink as that would make the core kernel depend on
CONFIG_NET and CONFIG_NETLINK and would introduce network namespacing
issues.
I've implemented filesystem context handling for procfs, nfs, mqueue,
cpuset, kernfs, sysfs, cgroup and afs filesystems.
Non-converted filesystems are handled by the legacy filesystem wrapper.
This post is mostly about the internal filesystem context and the special
kernel interface filesystems. I've included the fsopen() and fsmount()
syscall implementations for reference, but I expect these to undergo some
reconsideration during LSF. The last five patches relate to the AFS
conversion and are included as an example.
Significant changes:
ver #7:
(*) Undo an incorrect MS_* -> SB_* conversion.
(*) Pass the mount data buffer size to all the mount-related functions that
take the data pointer. This fixes a problem where someone (say SELinux)
tries to copy the mount data, assuming it to be a page in size, and
overruns the buffer - thereby incurring an oops by hitting a guard page.
(*) Made the AFS filesystem use them as an example. This is a much easier to
deal with than with NFS or Ext4 as there are very few mount options.
ver #6:
(*) Dropped the supplementary error string facility for the moment.
(*) Dropped the NFS patches for the moment.
(*) Dropped the reserved file descriptor argument from fsopen() and
replaced it with three reserved pointers that must be NULL.
ver #5:
(*) Renamed sb_config -> fs_context and adjusted variable names.
(*) Differentiated the flags in sb->s_flags (now named SB_*) from those
passed to mount(2) (named MS_*).
(*) Renamed __vfs_new_fs_context() to vfs_new_fs_context() and made the
caller always provide a struct file_system_type pointer and the
parameters required.
(*) Got rid of vfs_submount_fc() in favour of passing
FS_CONTEXT_FOR_SUBMOUNT to vfs_new_fs_context(). The purpose is now
used more.
(*) Call ->validate() on the remount path.
(*) Got rid of the inode locking in sys_fsmount().
(*) Call security_sb_mountpoint() in the mount(2) path.
ver #4:
(*) Split the sb_config patch up somewhat.
(*) Made the supplementary error string facility something attached to the
task_struct rather than the sb_config so that error messages can be
obtained from NFS doing a mount-root-and-pathwalk inside the
nfs_get_tree() operation.
Further, made this managed and read by prctl rather than through the
mount fd so that it's more generally available.
ver #3:
(*) Rebased on 4.12-rc1.
(*) Split the NFS patch up somewhat.
ver #2:
(*) Removed the ->fill_super() from sb_config_operations and passed it in
directly to functions that want to call it. NFS now calls
nfs_fill_super() directly rather than jumping through a pointer to it
since there's only the one option at the moment.
(*) Removed ->mnt_ns and ->sb from sb_config and moved ->pid_ns into
proc_sb_config.
(*) Renamed create_super -> get_tree.
(*) Renamed struct mount_context to struct sb_config and amended various
variable names.
(*) sys_fsmount() acquired AT_* flags and MS_* flags (for MNT_* flags)
arguments.
ver #1:
(*) Split the sb_config stuff out into its own header.
(*) Support non-context aware filesystems through a special set of
sb_config operations.
(*) Stored the created superblock and root dentry into the sb_config after
creation rather than directly into a vfsmount. This allows some
arguments to be removed to various NFS functions.
(*) Added an explicit superblock-creation step. This allows a created
superblock to then be mounted multiple times.
(*) Added a flag to say that the sb_config is degraded and cannot have
another go at having a superblock creation whilst getting rid of the
one that says it's already mounted.
Possible further developments:
(*) Implement sb reconfiguration (for now it returns ENOANO).
(*) Implement mount context support in more filesystems, ext4 being next
on my list.
(*) Move the walk-from-root stuff that nfs has to generic code so that you
can do something akin to:
mount /dev/sda1:/foo/bar /mnt
See nfs_follow_remote_path() and mount_subtree(). This is slightly
tricky in NFS as we have to prevent referral loops.
(*) Work out how to get at the error message incurred by submounts
encountered during nfs_follow_remote_path().
Should the error message be moved to task_struct and made more
general, perhaps retrieved with a prctl() function?
(*) Clean up/consolidate the security functions. Possibly add a
validation hook to be called at the same time as the mount context
validate op.
The patches can be found here also:
http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=mount-context
David
---
David Howells (24):
vfs: Undo an overly zealous MS_RDONLY -> SB_RDONLY conversion
VFS: Suppress MS_* flag defs within the kernel unless explicitly enabled
VFS: Introduce the structs and doc for a filesystem context
VFS: Add LSM hooks for filesystem context
apparmor: Implement security hooks for the new mount API
tomoyo: Implement security hooks for the new mount API
smack: Implement filesystem context security hooks
VFS: Require specification of size of mount data for internal mounts
VFS: Implement a filesystem superblock creation/configuration context
VFS: Remove unused code after filesystem context changes
procfs: Move proc_fill_super() to fs/proc/root.c
proc: Add fs_context support to procfs
ipc: Convert mqueue fs to fs_context
cpuset: Use fs_context
kernfs, sysfs, cgroup, intel_rdt: Support fs_context
hugetlbfs: Convert to fs_context
VFS: Remove kern_mount_data()
VFS: Implement fsopen() to prepare for a mount
VFS: Implement fsmount() to effect a pre-configured mount
afs: Fix server record deletion
net: Export get_proc_net()
afs: Add fs_context support
afs: Implement namespacing
afs: Use fs_context to pass parameters over automount
Documentation/filesystems/mounting.txt | 445 +++++++++++++++
arch/arc/kernel/setup.c | 1
arch/arm/kernel/atags_parse.c | 1
arch/ia64/kernel/perfmon.c | 3
arch/powerpc/platforms/cell/spufs/inode.c | 6
arch/s390/hypfs/inode.c | 7
arch/sh/kernel/setup.c | 1
arch/sparc/kernel/setup_32.c | 1
arch/sparc/kernel/setup_64.c | 1
arch/x86/entry/syscalls/syscall_32.tbl | 2
arch/x86/entry/syscalls/syscall_64.tbl | 2
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 125 ++--
arch/x86/kernel/setup.c | 1
drivers/base/devtmpfs.c | 7
drivers/dax/super.c | 2
drivers/gpu/drm/drm_drv.c | 3
drivers/gpu/drm/i915/i915_gemfs.c | 2
drivers/infiniband/hw/qib/qib_fs.c | 7
drivers/misc/ibmasm/ibmasmfs.c | 11
drivers/mtd/mtdsuper.c | 26 +
drivers/oprofile/oprofilefs.c | 8
.../staging/lustre/lustre/llite/llite_internal.h | 2
drivers/staging/lustre/lustre/llite/llite_lib.c | 3
drivers/staging/lustre/lustre/obdclass/obd_mount.c | 7
drivers/staging/ncpfs/inode.c | 10
drivers/usb/gadget/function/f_fs.c | 7
drivers/usb/gadget/legacy/inode.c | 7
drivers/virtio/virtio_balloon.c | 2
drivers/xen/xenfs/super.c | 7
fs/9p/vfs_super.c | 2
fs/Makefile | 3
fs/adfs/super.c | 9
fs/affs/super.c | 13
fs/afs/cell.c | 4
fs/afs/internal.h | 46 +-
fs/afs/main.c | 33 +
fs/afs/mntpt.c | 151 +++--
fs/afs/proc.c | 89 ++-
fs/afs/server.c | 9
fs/afs/super.c | 438 ++++++++-------
fs/afs/volume.c | 4
fs/aio.c | 3
fs/anon_inodes.c | 3
fs/autofs4/autofs_i.h | 2
fs/autofs4/init.c | 4
fs/autofs4/inode.c | 3
fs/befs/linuxvfs.c | 11
fs/bfs/inode.c | 8
fs/binfmt_misc.c | 7
fs/block_dev.c | 2
fs/btrfs/super.c | 30 +
fs/btrfs/tests/btrfs-tests.c | 2
fs/ceph/super.c | 3
fs/cifs/cifs_dfs_ref.c | 3
fs/cifs/cifsfs.c | 5
fs/coda/inode.c | 11
fs/configfs/mount.c | 7
fs/cramfs/inode.c | 17 -
fs/debugfs/inode.c | 14
fs/devpts/inode.c | 10
fs/ecryptfs/main.c | 2
fs/efivarfs/super.c | 9
fs/efs/super.c | 14
fs/exofs/super.c | 7
fs/ext2/super.c | 14
fs/ext4/super.c | 16 -
fs/f2fs/super.c | 13
fs/fat/inode.c | 3
fs/fat/namei_msdos.c | 8
fs/fat/namei_vfat.c | 8
fs/freevxfs/vxfs_super.c | 12
fs/fs_context.c | 593 ++++++++++++++++++++
fs/fsopen.c | 304 ++++++++++
fs/fuse/control.c | 9
fs/fuse/inode.c | 16 -
fs/gfs2/ops_fstype.c | 6
fs/gfs2/super.c | 4
fs/hfs/super.c | 12
fs/hfsplus/super.c | 12
fs/hostfs/hostfs_kern.c | 7
fs/hpfs/super.c | 11
fs/hugetlbfs/inode.c | 327 ++++++-----
fs/internal.h | 5
fs/isofs/inode.c | 11
fs/jffs2/super.c | 10
fs/jfs/super.c | 11
fs/kernfs/mount.c | 90 ++-
fs/libfs.c | 17 +
fs/minix/inode.c | 14
fs/namespace.c | 422 ++++++++++----
fs/nfs/internal.h | 4
fs/nfs/namespace.c | 3
fs/nfs/nfs4namespace.c | 3
fs/nfs/nfs4super.c | 27 +
fs/nfs/super.c | 22 -
fs/nfsd/nfsctl.c | 8
fs/nilfs2/super.c | 10
fs/nsfs.c | 3
fs/ntfs/super.c | 13
fs/ocfs2/dlmfs/dlmfs.c | 5
fs/ocfs2/super.c | 14
fs/omfs/inode.c | 9
fs/openpromfs/inode.c | 11
fs/orangefs/orangefs-kernel.h | 2
fs/orangefs/super.c | 5
fs/overlayfs/super.c | 11
fs/pipe.c | 3
fs/pnode.c | 1
fs/proc/inode.c | 50 --
fs/proc/internal.h | 6
fs/proc/proc_net.c | 3
fs/proc/root.c | 202 +++++--
fs/pstore/inode.c | 10
fs/qnx4/inode.c | 14
fs/qnx6/inode.c | 14
fs/ramfs/inode.c | 6
fs/reiserfs/super.c | 14
fs/romfs/super.c | 13
fs/squashfs/super.c | 12
fs/super.c | 389 ++++++++++---
fs/sysfs/mount.c | 59 +-
fs/sysv/inode.c | 3
fs/sysv/super.c | 16 -
fs/tracefs/inode.c | 10
fs/ubifs/super.c | 5
fs/udf/super.c | 16 -
fs/ufs/super.c | 11
fs/xfs/xfs_super.c | 10
include/linux/cgroup.h | 3
include/linux/debugfs.h | 8
include/linux/fs.h | 40 +
include/linux/fs_context.h | 106 ++++
include/linux/kernfs.h | 37 +
include/linux/lsm_hooks.h | 74 ++
include/linux/mount.h | 7
include/linux/mtd/super.h | 4
include/linux/proc_fs.h | 2
include/linux/ramfs.h | 4
include/linux/security.h | 62 ++
include/linux/shmem_fs.h | 3
include/linux/syscalls.h | 4
include/uapi/linux/fs.h | 56 --
include/uapi/linux/magic.h | 1
include/uapi/linux/mount.h | 58 ++
init/do_mounts.c | 5
init/do_mounts_initrd.c | 1
ipc/mqueue.c | 115 +++-
kernel/bpf/inode.c | 7
kernel/cgroup/cgroup-internal.h | 42 +
kernel/cgroup/cgroup-v1.c | 295 +++++-----
kernel/cgroup/cgroup.c | 219 ++++---
kernel/cgroup/cpuset.c | 65 ++
kernel/sys_ni.c | 4
kernel/trace/trace.c | 7
mm/shmem.c | 10
mm/zsmalloc.c | 3
net/socket.c | 3
net/sunrpc/rpc_pipe.c | 7
security/apparmor/apparmorfs.c | 8
security/apparmor/include/mount.h | 11
security/apparmor/lsm.c | 84 +++
security/apparmor/mount.c | 47 ++
security/inode.c | 7
security/security.c | 60 ++
security/selinux/hooks.c | 292 +++++++++-
security/selinux/selinuxfs.c | 8
security/smack/smack_lsm.c | 344 ++++++++++--
security/smack/smackfs.c | 9
security/tomoyo/common.h | 3
security/tomoyo/mount.c | 43 +
security/tomoyo/tomoyo.c | 19 +
171 files changed, 5105 insertions(+), 1739 deletions(-)
create mode 100644 Documentation/filesystems/mounting.txt
create mode 100644 fs/fs_context.c
create mode 100644 fs/fsopen.c
create mode 100644 include/linux/fs_context.h
create mode 100644 include/uapi/linux/mount.h
Add fs_context support to procfs.
Signed-off-by: David Howells <[email protected]>
---
fs/proc/inode.c | 2 -
fs/proc/internal.h | 2 -
fs/proc/root.c | 169 ++++++++++++++++++++++++++++++++++------------------
3 files changed, 113 insertions(+), 60 deletions(-)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 0b13cf6eb6d7..7aa86dd65ba8 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -128,7 +128,7 @@ const struct super_operations proc_sops = {
.drop_inode = generic_delete_inode,
.evict_inode = proc_evict_inode,
.statfs = simple_statfs,
- .remount_fs = proc_remount,
+ .reconfigure = proc_reconfigure,
.show_options = proc_show_options,
};
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 3182e1b636d3..a5ab9504768a 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -254,7 +254,7 @@ static inline void proc_tty_init(void) {}
extern struct proc_dir_entry proc_root;
extern void proc_self_init(void);
-extern int proc_remount(struct super_block *, int *, char *, size_t);
+extern int proc_reconfigure(struct super_block *, struct fs_context *);
/*
* task_[no]mmu.c
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 2fbc177f37a8..e6bd31fbc714 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -19,14 +19,24 @@
#include <linux/module.h>
#include <linux/bitops.h>
#include <linux/user_namespace.h>
+#include <linux/fs_context.h>
#include <linux/mount.h>
#include <linux/pid_namespace.h>
#include <linux/parser.h>
#include <linux/cred.h>
#include <linux/magic.h>
+#include <linux/slab.h>
#include "internal.h"
+struct proc_fs_context {
+ struct fs_context fc;
+ struct pid_namespace *pid_ns;
+ unsigned long mask;
+ int hidepid;
+ int gid;
+};
+
enum {
Opt_gid, Opt_hidepid, Opt_err,
};
@@ -37,56 +47,60 @@ static const match_table_t tokens = {
{Opt_err, NULL},
};
-static int proc_parse_options(char *options, struct pid_namespace *pid)
+static int proc_parse_option(struct fs_context *fc, char *opt, size_t len)
{
- char *p;
+ struct proc_fs_context *ctx = container_of(fc, struct proc_fs_context, fc);
substring_t args[MAX_OPT_ARGS];
- int option;
-
- if (!options)
- return 1;
-
- while ((p = strsep(&options, ",")) != NULL) {
- int token;
- if (!*p)
- continue;
-
- args[0].to = args[0].from = NULL;
- token = match_token(p, tokens, args);
- switch (token) {
- case Opt_gid:
- if (match_int(&args[0], &option))
- return 0;
- pid->pid_gid = make_kgid(current_user_ns(), option);
- break;
- case Opt_hidepid:
- if (match_int(&args[0], &option))
- return 0;
- if (option < HIDEPID_OFF ||
- option > HIDEPID_INVISIBLE) {
- pr_err("proc: hidepid value must be between 0 and 2.\n");
- return 0;
- }
- pid->hide_pid = option;
- break;
- default:
- pr_err("proc: unrecognized mount option \"%s\" "
- "or missing value\n", p);
- return 0;
+ int token;
+
+ args[0].to = args[0].from = NULL;
+ token = match_token(opt, tokens, args);
+ switch (token) {
+ case Opt_gid:
+ if (match_int(&args[0], &ctx->gid))
+ return -EINVAL;
+ break;
+
+ case Opt_hidepid:
+ if (match_int(&args[0], &ctx->hidepid))
+ return -EINVAL;
+ if (ctx->hidepid < HIDEPID_OFF ||
+ ctx->hidepid > HIDEPID_INVISIBLE) {
+ pr_err("proc: hidepid value must be between 0 and 2.\n");
+ return -EINVAL;
}
+ break;
+
+ default:
+ pr_err("proc: unrecognized mount option \"%s\" or missing value\n",
+ opt);
+ return -EINVAL;
}
- return 1;
+ ctx->mask |= 1 << token;
+ return 0;
+}
+
+static void proc_set_options(struct super_block *s,
+ struct fs_context *fc,
+ struct pid_namespace *pid_ns,
+ struct user_namespace *user_ns)
+{
+ struct proc_fs_context *ctx = container_of(fc, struct proc_fs_context, fc);
+
+ if (ctx->mask & (1 << Opt_gid))
+ pid_ns->pid_gid = make_kgid(user_ns, ctx->gid);
+ if (ctx->mask & (1 << Opt_hidepid))
+ pid_ns->hide_pid = ctx->hidepid;
}
-static int proc_fill_super(struct super_block *s, void *data, size_t data_size, int silent)
+static int proc_fill_super(struct super_block *s, struct fs_context *fc)
{
- struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
+ struct pid_namespace *pid_ns = get_pid_ns(s->s_fs_info);
struct inode *root_inode;
int ret;
- if (!proc_parse_options(data, ns))
- return -EINVAL;
+ proc_set_options(s, fc, pid_ns, current_user_ns());
/* User space would break if executables or devices appear on proc */
s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -103,7 +117,7 @@ static int proc_fill_super(struct super_block *s, void *data, size_t data_size,
* top of it
*/
s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
-
+
pde_get(&proc_root);
root_inode = proc_get_inode(s, &proc_root);
if (!root_inode) {
@@ -124,30 +138,46 @@ static int proc_fill_super(struct super_block *s, void *data, size_t data_size,
return proc_setup_thread_self(s);
}
-int proc_remount(struct super_block *sb, int *flags,
- char *data, size_t data_size)
+int proc_reconfigure(struct super_block *sb, struct fs_context *fc)
{
struct pid_namespace *pid = sb->s_fs_info;
sync_filesystem(sb);
- return !proc_parse_options(data, pid);
+
+ if (fc)
+ proc_set_options(sb, fc, pid, current_user_ns());
+ return 0;
}
-static struct dentry *proc_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name,
- void *data, size_t data_size)
+static int proc_get_tree(struct fs_context *fc)
{
- struct pid_namespace *ns;
+ struct proc_fs_context *ctx = container_of(fc, struct proc_fs_context, fc);
- if (flags & SB_KERNMOUNT) {
- ns = data;
- data = NULL;
- } else {
- ns = task_active_pid_ns(current);
- }
+ ctx->fc.s_fs_info = ctx->pid_ns;
+ return vfs_get_super(fc, vfs_get_keyed_super, proc_fill_super);
+}
- return mount_ns(fs_type, flags, data, data_size, ns, ns->user_ns,
- proc_fill_super);
+static void proc_fs_context_free(struct fs_context *fc)
+{
+ struct proc_fs_context *ctx = container_of(fc, struct proc_fs_context, fc);
+
+ if (ctx->pid_ns)
+ put_pid_ns(ctx->pid_ns);
+}
+
+static const struct fs_context_operations proc_fs_context_ops = {
+ .free = proc_fs_context_free,
+ .parse_option = proc_parse_option,
+ .get_tree = proc_get_tree,
+};
+
+static int proc_init_fs_context(struct fs_context *fc, struct super_block *src_sb)
+{
+ struct proc_fs_context *ctx = container_of(fc, struct proc_fs_context, fc);
+
+ ctx->pid_ns = get_pid_ns(task_active_pid_ns(current));
+ ctx->fc.ops = &proc_fs_context_ops;
+ return 0;
}
static void proc_kill_sb(struct super_block *sb)
@@ -165,7 +195,8 @@ static void proc_kill_sb(struct super_block *sb)
static struct file_system_type proc_fs_type = {
.name = "proc",
- .mount = proc_mount,
+ .fs_context_size = sizeof(struct proc_fs_context),
+ .init_fs_context = proc_init_fs_context,
.kill_sb = proc_kill_sb,
.fs_flags = FS_USERNS_MOUNT,
};
@@ -205,7 +236,7 @@ static struct dentry *proc_root_lookup(struct inode * dir, struct dentry * dentr
{
if (!proc_pid_lookup(dir, dentry, flags))
return NULL;
-
+
return proc_lookup(dir, dentry, flags);
}
@@ -259,9 +290,31 @@ struct proc_dir_entry proc_root = {
int pid_ns_prepare_proc(struct pid_namespace *ns)
{
+ struct proc_fs_context *ctx;
+ struct fs_context *fc;
struct vfsmount *mnt;
+ int ret;
+
+ fc = vfs_new_fs_context(&proc_fs_type, NULL, 0,
+ FS_CONTEXT_FOR_KERNEL_MOUNT);
+ if (IS_ERR(fc))
+ return PTR_ERR(fc);
+
+ ctx = container_of(fc, struct proc_fs_context, fc);
+ if (ctx->pid_ns != ns) {
+ put_pid_ns(ctx->pid_ns);
+ get_pid_ns(ns);
+ ctx->pid_ns = ns;
+ }
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0) {
+ put_fs_context(fc);
+ return ret;
+ }
- mnt = kern_mount_data(&proc_fs_type, ns, 0);
+ mnt = vfs_create_mount(fc);
+ put_fs_context(fc);
if (IS_ERR(mnt))
return PTR_ERR(mnt);
Provide an fsopen() system call that starts the process of preparing to
mount, using an fd as a context handle. fsopen() is given the name of the
filesystem that will be used:
int mfd = fsopen(const char *fsname, int open_flags,
void *reserved3, void *reserved4,
void *reserved5);
where open_flags can be 0 or O_CLOEXEC and reserved* should all be NULL for
the moment.
For example:
mfd = fsopen("ext4", O_CLOEXEC, NULL, NULL, NULL);
write(mfd, "s /dev/sdb1"); // note I'm ignoring write's length arg
write(mfd, "o noatime");
write(mfd, "o acl");
write(mfd, "o user_attr");
write(mfd, "o iversion");
write(mfd, "o ");
write(mfd, "r /my/container"); // root inside the fs
write(mfd, "x create"); // create the superblock
fsmount(mfd, container_fd, "/mnt", AT_NO_FOLLOW);
mfd = fsopen("afs", -1);
write(mfd, "s %grand.central.org:root.cell");
write(mfd, "o cell=grand.central.org");
write(mfd, "r /");
write(mfd, "x create");
fsmount(mfd, AT_FDCWD, "/mnt", 0);
If an error is reported at any step, an error message may be available to be
read() back (ENODATA will be reported if there isn't an error available) in
the form:
"e <subsys>:<problem>"
"e SELinux:Mount on mountpoint not permitted"
Once fsmount() has been called, further write() calls will incur EBUSY,
even if the fsmount() fails. read() is still possible to retrieve error
information.
The fsopen() syscall creates a mount context and hangs it of the fd that it
returns.
Netlink is not used because it is optional.
Signed-off-by: David Howells <[email protected]>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
fs/Makefile | 2
fs/fsopen.c | 304 ++++++++++++++++++++++++++++++++
fs/super.c | 3
include/linux/fs_context.h | 1
include/linux/syscalls.h | 2
include/uapi/linux/magic.h | 1
kernel/sys_ni.c | 3
9 files changed, 315 insertions(+), 3 deletions(-)
create mode 100644 fs/fsopen.c
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index d6b27dab1b30..d02346692c3f 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -396,3 +396,4 @@
382 i386 pkey_free sys_pkey_free __ia32_sys_pkey_free
383 i386 statx sys_statx __ia32_sys_statx
384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl
+385 i386 fsopen sys_fsopen __ia32_sys_fsopen
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 4dfe42666d0c..6708847571e2 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -341,6 +341,7 @@
330 common pkey_alloc __x64_sys_pkey_alloc
331 common pkey_free __x64_sys_pkey_free
332 common statx __x64_sys_statx
+333 common fsopen __x64_sys_fsopen
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index 6f2dae3c32da..ee3c8b31cc58 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -13,7 +13,7 @@ obj-y := open.o read_write.o file_table.o super.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o splice.o sync.o utimes.o d_path.o \
stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
- fs_context.o
+ fs_context.o fsopen.o
ifeq ($(CONFIG_BLOCK),y)
obj-y += buffer.o block_dev.o direct-io.o mpage.o
diff --git a/fs/fsopen.c b/fs/fsopen.c
new file mode 100644
index 000000000000..2d115bad13bb
--- /dev/null
+++ b/fs/fsopen.c
@@ -0,0 +1,304 @@
+/* Filesystem access-by-fd.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/fs_context.h>
+#include <linux/mount.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/syscalls.h>
+
+static struct vfsmount *fscontext_fs_mnt __read_mostly;
+
+static int fscontext_fs_release(struct inode *inode, struct file *file)
+{
+ struct fs_context *fc = file->private_data;
+
+ file->private_data = NULL;
+
+ put_fs_context(fc);
+ return 0;
+}
+
+/*
+ * Userspace writes configuration data and commands to the fd and we parse it
+ * here. For the moment, we assume a single option or command per write. Each
+ * line written is of the form
+ *
+ * <option_type><space><stuff...>
+ *
+ * d /dev/sda1 -- Device name
+ * o noatime -- Option without value
+ * o cell=grand.central.org -- Option with value
+ * r / -- Dir within device to mount
+ * x create -- Create a superblock
+ */
+static ssize_t fscontext_fs_write(struct file *file,
+ const char __user *_buf, size_t len, loff_t *pos)
+{
+ struct fs_context *fc = file->private_data;
+ struct inode *inode = file_inode(file);
+ char opt[2], *data;
+ ssize_t ret;
+
+ if (len < 3 || len > 4095)
+ return -EINVAL;
+
+ if (copy_from_user(opt, _buf, 2) != 0)
+ return -EFAULT;
+ switch (opt[0]) {
+ case 's':
+ case 'o':
+ case 'x':
+ break;
+ default:
+ goto err_bad_cmd;
+ }
+ if (opt[1] != ' ')
+ goto err_bad_cmd;
+
+ data = memdup_user_nul(_buf + 2, len - 2);
+ if (IS_ERR(data))
+ return PTR_ERR(data);
+
+ /* From this point onwards we need to lock the fd against someone
+ * trying to mount it.
+ */
+ ret = inode_lock_killable(inode);
+ if (ret < 0)
+ goto err_free;
+
+ ret = -EINVAL;
+ switch (opt[0]) {
+ case 's':
+ ret = vfs_set_fs_source(fc, data, len - 2);
+ if (ret < 0)
+ goto err_unlock;
+ data = NULL;
+ break;
+
+ case 'o':
+ ret = vfs_parse_fs_option(fc, data, len - 2);
+ if (ret < 0)
+ goto err_unlock;
+ break;
+
+ case 'x':
+ if (strcmp(data, "create") == 0) {
+ ret = vfs_get_tree(fc);
+ } else {
+ ret = -EOPNOTSUPP;
+ }
+ if (ret < 0)
+ goto err_unlock;
+ break;
+
+ default:
+ goto err_unlock;
+ }
+
+ ret = len;
+err_unlock:
+ inode_unlock(inode);
+err_free:
+ kfree(data);
+ return ret;
+err_bad_cmd:
+ return -EINVAL;
+}
+
+const struct file_operations fscontext_fs_fops = {
+ .write = fscontext_fs_write,
+ .release = fscontext_fs_release,
+ .llseek = no_llseek,
+};
+
+/*
+ * Indicate the name we want to display the filesystem file as.
+ */
+static char *fscontext_fs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+ return dynamic_dname(dentry, buffer, buflen, "fs:[%lu]",
+ d_inode(dentry)->i_ino);
+}
+
+static const struct dentry_operations fscontext_fs_dentry_operations = {
+ .d_dname = fscontext_fs_dname,
+};
+
+/*
+ * Create a file that can be used to configure a new mount.
+ */
+static struct file *create_fscontext_file(struct fs_context *fc)
+{
+ struct inode *inode;
+ struct file *f;
+ struct path path;
+ int ret;
+
+ inode = alloc_anon_inode(fscontext_fs_mnt->mnt_sb);
+ if (IS_ERR(inode))
+ return ERR_CAST(inode);
+ inode->i_fop = &fscontext_fs_fops;
+
+ ret = -ENOMEM;
+ path.dentry = d_alloc_pseudo(fscontext_fs_mnt->mnt_sb, &empty_name);
+ if (!path.dentry)
+ goto err_inode;
+ path.mnt = mntget(fscontext_fs_mnt);
+
+ d_instantiate(path.dentry, inode);
+
+ f = alloc_file(&path, FMODE_READ | FMODE_WRITE, &fscontext_fs_fops);
+ if (IS_ERR(f)) {
+ ret = PTR_ERR(f);
+ goto err_file;
+ }
+
+ f->private_data = fc;
+ return f;
+
+err_file:
+ path_put(&path);
+ return ERR_PTR(ret);
+
+err_inode:
+ iput(inode);
+ return ERR_PTR(ret);
+}
+
+static const struct super_operations fscontext_fs_super_ops = {
+ .drop_inode = generic_delete_inode,
+ .destroy_inode = free_inode_nonrcu,
+ .statfs = simple_statfs,
+};
+
+/*
+ * Finish filling in the superblock and allocate the root dentry.
+ */
+static int fscontext_fs_fill_super(struct super_block *sb,
+ struct fs_context *fc)
+{
+ struct dentry *root;
+ struct inode *inode;
+
+ sb->s_op = &fscontext_fs_super_ops;
+ inode = alloc_anon_inode(sb);
+ if (IS_ERR(inode))
+ return PTR_ERR(inode);
+ inode->i_fop = &fscontext_fs_fops;
+
+ root = d_make_root(inode);
+ if (!root)
+ return -ENOMEM; /* inode is put by d_make_root() */
+ sb->s_root = root;
+ return 0;
+}
+
+static int fscontext_fs_get_tree(struct fs_context *fc)
+{
+ return vfs_get_super(fc, vfs_get_single_super, fscontext_fs_fill_super);
+}
+
+static const struct fs_context_operations fscontext_fs_context_ops = {
+ .get_tree = fscontext_fs_get_tree,
+};
+
+static int fs_init_fs_context(struct fs_context *fc, struct super_block *src_sb)
+{
+ fc->ops = &fscontext_fs_context_ops;
+ return 0;
+}
+
+static struct file_system_type fscontext_fs_type = {
+ .name = "fscontext",
+ .fs_context_size = sizeof(struct fs_context),
+ .init_fs_context = fs_init_fs_context,
+ .kill_sb = kill_anon_super,
+};
+
+static int __init init_fscontext_fs(void)
+{
+ int ret;
+
+ ret = register_filesystem(&fscontext_fs_type);
+ if (ret < 0)
+ panic("Cannot register fscontext_fs\n");
+
+ fscontext_fs_mnt = kern_mount(&fscontext_fs_type);
+ if (IS_ERR(fscontext_fs_mnt))
+ panic("Cannot mount fscontext_fs: %ld\n",
+ PTR_ERR(fscontext_fs_mnt));
+ return 0;
+}
+
+fs_initcall(init_fscontext_fs);
+
+/*
+ * Open a filesystem by name so that it can be configured for mounting.
+ *
+ * We are allowed to specify a container in which the filesystem will be
+ * opened, thereby indicating which namespaces will be used (notably, which
+ * network namespace will be used for network filesystems).
+ */
+SYSCALL_DEFINE5(fsopen, const char __user *, _fs_name, unsigned int, flags,
+ void *, reserved3, void *, reserved4, void *, reserved5)
+{
+ struct file_system_type *fs_type;
+ struct fs_context *fc;
+ struct file *file;
+ const char *fs_name;
+ int fd, ret;
+
+ if (flags & ~O_CLOEXEC || reserved3 || reserved4 || reserved5)
+ return -EINVAL;
+
+ fs_name = strndup_user(_fs_name, PAGE_SIZE);
+ if (IS_ERR(fs_name))
+ return PTR_ERR(fs_name);
+
+ fs_type = get_fs_type(fs_name);
+ kfree(fs_name);
+ if (!fs_type)
+ return -ENODEV;
+
+ fc = vfs_new_fs_context(fs_type, NULL, 0, FS_CONTEXT_FOR_USER_MOUNT);
+ put_filesystem(fs_type);
+ if (IS_ERR(fc))
+ return PTR_ERR(fc);
+
+ ret = -ENOTSUPP;
+ if (!fc->ops)
+ goto err_fc;
+
+ file = create_fscontext_file(fc);
+ if (IS_ERR(file)) {
+ ret = PTR_ERR(file);
+ goto err_fc;
+ }
+
+ ret = get_unused_fd_flags(flags & O_CLOEXEC);
+ if (ret < 0)
+ goto err_file;
+
+ fd = ret;
+ fd_install(fd, file);
+ return fd;
+
+err_file:
+ fput(file);
+ return ret;
+
+err_fc:
+ put_fs_context(fc);
+ return ret;
+}
diff --git a/fs/super.c b/fs/super.c
index a27487e34ea4..1e2942f81bc9 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1270,8 +1270,7 @@ int vfs_get_super(struct fs_context *fc,
return PTR_ERR(sb);
if (!sb->s_root) {
- int err;
- err = fill_super(sb, fc);
+ int err = fill_super(sb, fc);
if (err) {
deactivate_locked_super(sb);
return err;
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 1914eef0a88f..536ae7d60f1f 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -102,4 +102,5 @@ extern int vfs_get_super(struct fs_context *fc,
int (*fill_super)(struct super_block *sb,
struct fs_context *fc));
+extern const struct file_operations fs_fs_fops;
#endif /* _LINUX_FS_CONTEXT_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 70fcda1a9049..3c9b10e92015 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -890,6 +890,8 @@ asmlinkage long sys_pkey_alloc(unsigned long flags, unsigned long init_val);
asmlinkage long sys_pkey_free(int pkey);
asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
unsigned mask, struct statx __user *buffer);
+asmlinkage long sys_fsopen(const char *fs_name, unsigned int flags,
+ void *reserved3, void *reserved4, void *reserved5);
/*
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 1a6fee974116..2fe02277fb32 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -89,5 +89,6 @@
#define UDF_SUPER_MAGIC 0x15013346
#define BALLOON_KVM_MAGIC 0x13661366
#define ZSMALLOC_MAGIC 0x58295829
+#define FSCONTEXT_FS_MAGIC 0x66736673
#endif /* __LINUX_MAGIC_H__ */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 9791364925dc..c113fc9d5e77 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -430,3 +430,6 @@ COND_SYSCALL(setresgid16);
COND_SYSCALL(setresuid16);
COND_SYSCALL(setreuid16);
COND_SYSCALL(setuid16);
+
+/* fd-based mount */
+COND_SYSCALL(sys_fsopen);
Remove code that is now unused after the filesystem context changes.
Signed-off-by: David Howells <[email protected]>
---
fs/internal.h | 2 --
fs/super.c | 54 --------------------------------------------
include/linux/lsm_hooks.h | 3 --
include/linux/security.h | 7 ------
security/security.c | 5 ----
security/selinux/hooks.c | 20 ----------------
security/smack/smack_lsm.c | 33 ---------------------------
7 files changed, 124 deletions(-)
diff --git a/fs/internal.h b/fs/internal.h
index 91a990234488..f47ede6ace5a 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -101,8 +101,6 @@ extern struct file *get_empty_filp(void);
extern int do_remount_sb(struct super_block *, int, void *, size_t, int,
struct fs_context *);
extern bool trylock_super(struct super_block *sb);
-extern struct dentry *mount_fs(struct file_system_type *,
- int, const char *, void *, size_t);
extern struct super_block *user_get_super(dev_t);
/*
diff --git a/fs/super.c b/fs/super.c
index 5d65a45ca6db..a27487e34ea4 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1441,60 +1441,6 @@ struct dentry *mount_single(struct file_system_type *fs_type,
}
EXPORT_SYMBOL(mount_single);
-struct dentry *
-mount_fs(struct file_system_type *type, int flags, const char *name,
- void *data, size_t data_size)
-{
- struct dentry *root;
- struct super_block *sb;
- char *secdata = NULL;
- int error = -ENOMEM;
-
- if (data && !(type->fs_flags & FS_BINARY_MOUNTDATA)) {
- secdata = alloc_secdata();
- if (!secdata)
- goto out;
-
- error = security_sb_copy_data(data, data_size, secdata);
- if (error)
- goto out_free_secdata;
- }
-
- root = type->mount(type, flags, name, data, data_size);
- if (IS_ERR(root)) {
- error = PTR_ERR(root);
- goto out_free_secdata;
- }
- sb = root->d_sb;
- BUG_ON(!sb);
- WARN_ON(!sb->s_bdi);
- sb->s_flags |= SB_BORN;
-
- error = security_sb_kern_mount(sb, flags, secdata, data_size);
- if (error)
- goto out_sb;
-
- /*
- * filesystems should never set s_maxbytes larger than MAX_LFS_FILESIZE
- * but s_maxbytes was an unsigned long long for many releases. Throw
- * this warning for a little while to try and catch filesystems that
- * violate this rule.
- */
- WARN((sb->s_maxbytes < 0), "%s set sb->s_maxbytes to "
- "negative value (%lld)\n", type->name, sb->s_maxbytes);
-
- up_write(&sb->s_umount);
- free_secdata(secdata);
- return root;
-out_sb:
- dput(root);
- deactivate_locked_super(sb);
-out_free_secdata:
- free_secdata(secdata);
-out:
- return ERR_PTR(error);
-}
-
/*
* Setup private BDI for given superblock. It gets automatically cleaned up
* in generic_shutdown_super().
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index d533ca038604..1d6481c4f965 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1511,8 +1511,6 @@ union security_list_options {
void (*sb_free_security)(struct super_block *sb);
int (*sb_copy_data)(char *orig, size_t orig_size, char *copy);
int (*sb_remount)(struct super_block *sb, void *data, size_t data_size);
- int (*sb_kern_mount)(struct super_block *sb, int flags,
- void *data, size_t data_size);
int (*sb_show_options)(struct seq_file *m, struct super_block *sb);
int (*sb_statfs)(struct dentry *dentry);
int (*sb_mount)(const char *dev_name, const struct path *path,
@@ -1858,7 +1856,6 @@ struct security_hook_heads {
struct hlist_head sb_free_security;
struct hlist_head sb_copy_data;
struct hlist_head sb_remount;
- struct hlist_head sb_kern_mount;
struct hlist_head sb_show_options;
struct hlist_head sb_statfs;
struct hlist_head sb_mount;
diff --git a/include/linux/security.h b/include/linux/security.h
index 22b83dc28bd3..5b9d9ffa6abd 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -245,7 +245,6 @@ int security_sb_alloc(struct super_block *sb);
void security_sb_free(struct super_block *sb);
int security_sb_copy_data(char *orig, size_t orig_size, char *copy);
int security_sb_remount(struct super_block *sb, void *data, size_t data_size);
-int security_sb_kern_mount(struct super_block *sb, int flags, void *data, size_t data_size);
int security_sb_show_options(struct seq_file *m, struct super_block *sb);
int security_sb_statfs(struct dentry *dentry);
int security_sb_mount(const char *dev_name, const struct path *path,
@@ -601,12 +600,6 @@ static inline int security_sb_remount(struct super_block *sb, void *data, size_t
return 0;
}
-static inline int security_sb_kern_mount(struct super_block *sb, int flags,
- void *data, size_t data_size)
-{
- return 0;
-}
-
static inline int security_sb_show_options(struct seq_file *m,
struct super_block *sb)
{
diff --git a/security/security.c b/security/security.c
index 35abdc964724..9693a175587d 100644
--- a/security/security.c
+++ b/security/security.c
@@ -420,11 +420,6 @@ int security_sb_remount(struct super_block *sb, void *data, size_t data_size)
return call_int_hook(sb_remount, 0, sb, data, data_size);
}
-int security_sb_kern_mount(struct super_block *sb, int flags, void *data, size_t data_size)
-{
- return call_int_hook(sb_kern_mount, 0, sb, flags, data, data_size);
-}
-
int security_sb_show_options(struct seq_file *m, struct super_block *sb)
{
return call_int_hook(sb_show_options, 0, m, sb);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 46d31858eaba..cb76a2428926 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2909,25 +2909,6 @@ static int selinux_sb_remount(struct super_block *sb, void *data, size_t data_si
goto out_free_opts;
}
-static int selinux_sb_kern_mount(struct super_block *sb, int flags, void *data, size_t data_size)
-{
- const struct cred *cred = current_cred();
- struct common_audit_data ad;
- int rc;
-
- rc = superblock_doinit(sb, data);
- if (rc)
- return rc;
-
- /* Allow all mounts performed by the kernel */
- if (flags & MS_KERNMOUNT)
- return 0;
-
- ad.type = LSM_AUDIT_DATA_DENTRY;
- ad.u.dentry = sb->s_root;
- return superblock_has_perm(cred, sb, FILESYSTEM__MOUNT, &ad);
-}
-
static int selinux_sb_statfs(struct dentry *dentry)
{
const struct cred *cred = current_cred();
@@ -7138,7 +7119,6 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(sb_free_security, selinux_sb_free_security),
LSM_HOOK_INIT(sb_copy_data, selinux_sb_copy_data),
LSM_HOOK_INIT(sb_remount, selinux_sb_remount),
- LSM_HOOK_INIT(sb_kern_mount, selinux_sb_kern_mount),
LSM_HOOK_INIT(sb_show_options, selinux_sb_show_options),
LSM_HOOK_INIT(sb_statfs, selinux_sb_statfs),
LSM_HOOK_INIT(sb_mount, selinux_mount),
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 5dc31a940961..5fe9e7948fde 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -1150,38 +1150,6 @@ static int smack_set_mnt_opts(struct super_block *sb,
return 0;
}
-/**
- * smack_sb_kern_mount - Smack specific mount processing
- * @sb: the file system superblock
- * @flags: the mount flags
- * @data: the smack mount options
- *
- * Returns 0 on success, an error code on failure
- */
-static int smack_sb_kern_mount(struct super_block *sb, int flags,
- void *data, size_t data_size)
-{
- int rc = 0;
- char *options = data;
- struct security_mnt_opts opts;
-
- security_init_mnt_opts(&opts);
-
- if (!options)
- goto out;
-
- rc = smack_parse_opts_str(options, &opts);
- if (rc)
- goto out_err;
-
-out:
- rc = smack_set_mnt_opts(sb, &opts, 0, NULL);
-
-out_err:
- security_free_mnt_opts(&opts);
- return rc;
-}
-
/**
* smack_sb_statfs - Smack check on statfs
* @dentry: identifies the file system in question
@@ -4942,7 +4910,6 @@ static struct security_hook_list smack_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(sb_alloc_security, smack_sb_alloc_security),
LSM_HOOK_INIT(sb_free_security, smack_sb_free_security),
LSM_HOOK_INIT(sb_copy_data, smack_sb_copy_data),
- LSM_HOOK_INIT(sb_kern_mount, smack_sb_kern_mount),
LSM_HOOK_INIT(sb_statfs, smack_sb_statfs),
LSM_HOOK_INIT(sb_set_mnt_opts, smack_set_mnt_opts),
LSM_HOOK_INIT(sb_parse_opts_str, smack_parse_opts_str),
Alter the AFS automounting code to create and modify an fs_context struct
when parameterising a new mount triggered by an AFS mountpoint rather than
constructing device name and option strings.
Also remove the cell=, vol= and rwpath options as they are then redundant.
The reason they existed is because the 'device name' may be derived
literally from a mountpoint object in the filesystem, so default cell and
parent-type information needed to be passed in by some other method from
the automount routines. The vol= option didn't end up being used.
Signed-off-by: David Howells <[email protected]>
cc: Eric W. Biederman <[email protected]>
---
fs/afs/internal.h | 1
fs/afs/mntpt.c | 152 ++++++++++++++++++++++++++++-------------------------
fs/afs/super.c | 42 +--------------
3 files changed, 83 insertions(+), 112 deletions(-)
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index a5161c0ae3ab..589e5356c560 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -37,7 +37,6 @@ struct afs_call;
struct afs_fs_context {
struct fs_context fc;
struct afs_super_info *as;
- bool rwpath; /* T if the parent should be considered R/W */
bool force; /* T to force cell type */
bool autocell; /* T if set auto mount operation */
bool dyn_root; /* T if dynamic root */
diff --git a/fs/afs/mntpt.c b/fs/afs/mntpt.c
index c45aa1776591..9c4ad0565154 100644
--- a/fs/afs/mntpt.c
+++ b/fs/afs/mntpt.c
@@ -47,6 +47,8 @@ static DECLARE_DELAYED_WORK(afs_mntpt_expiry_timer, afs_mntpt_expiry_timed_out);
static unsigned long afs_mntpt_expiry_timeout = 10 * 60;
+static const char afs_root_volume[] = "root.cell";
+
/*
* no valid lookup procedure on this sort of dir
*/
@@ -68,107 +70,111 @@ static int afs_mntpt_open(struct inode *inode, struct file *file)
}
/*
- * create a vfsmount to be automounted
+ * Set the parameters for the proposed superblock.
*/
-static struct vfsmount *afs_mntpt_do_automount(struct dentry *mntpt)
+static int afs_mntpt_set_params(struct fs_context *fc, struct dentry *mntpt)
{
- struct afs_super_info *as;
- struct vfsmount *mnt;
- struct afs_vnode *vnode;
- struct page *page;
- char *devname, *options;
- bool rwpath = false;
+ struct afs_fs_context *ctx = container_of(fc, struct afs_fs_context, fc);
+ struct afs_vnode *vnode = AFS_FS_I(d_inode(mntpt));
+ struct afs_cell *cell;
+ const char *p;
int ret;
- _enter("{%pd}", mntpt);
-
- BUG_ON(!d_inode(mntpt));
-
- ret = -ENOMEM;
- devname = (char *) get_zeroed_page(GFP_KERNEL);
- if (!devname)
- goto error_no_devname;
-
- options = (char *) get_zeroed_page(GFP_KERNEL);
- if (!options)
- goto error_no_options;
-
- vnode = AFS_FS_I(d_inode(mntpt));
if (test_bit(AFS_VNODE_PSEUDODIR, &vnode->flags)) {
/* if the directory is a pseudo directory, use the d_name */
- static const char afs_root_cell[] = ":root.cell.";
unsigned size = mntpt->d_name.len;
- ret = -ENOENT;
- if (size < 2 || size > AFS_MAXCELLNAME)
- goto error_no_page;
+ if (size < 2)
+ return -ENOENT;
+ p = mntpt->d_name.name;
if (mntpt->d_name.name[0] == '.') {
- devname[0] = '%';
- memcpy(devname + 1, mntpt->d_name.name + 1, size - 1);
- memcpy(devname + size, afs_root_cell,
- sizeof(afs_root_cell));
- rwpath = true;
- } else {
- devname[0] = '#';
- memcpy(devname + 1, mntpt->d_name.name, size);
- memcpy(devname + size + 1, afs_root_cell,
- sizeof(afs_root_cell));
+ size--;
+ p++;
+ ctx->type = AFSVL_RWVOL;
+ ctx->force = true;
+ }
+ if (size > AFS_MAXCELLNAME)
+ return -ENAMETOOLONG;
+
+ cell = afs_lookup_cell(ctx->net, p, size, NULL, false);
+ if (IS_ERR(cell)) {
+ pr_err("kAFS: unable to lookup cell '%pd'\n", mntpt);
+ return PTR_ERR(cell);
}
+ afs_put_cell(ctx->net, ctx->cell);
+ ctx->cell = cell;
+
+ ctx->volname = afs_root_volume;
+ ctx->volnamesz = sizeof(afs_root_volume) - 1;
} else {
/* read the contents of the AFS special symlink */
+ struct page *page;
loff_t size = i_size_read(d_inode(mntpt));
char *buf;
- ret = -EINVAL;
if (size > PAGE_SIZE - 1)
- goto error_no_page;
+ return -EINVAL;
page = read_mapping_page(d_inode(mntpt)->i_mapping, 0, NULL);
- if (IS_ERR(page)) {
- ret = PTR_ERR(page);
- goto error_no_page;
- }
+ if (IS_ERR(page))
+ return PTR_ERR(page);
- ret = -EIO;
- if (PageError(page))
- goto error;
+ if (PageError(page)) {
+ put_page(page);
+ return -EIO;
+ }
- buf = kmap_atomic(page);
- memcpy(devname, buf, size);
- kunmap_atomic(buf);
+ buf = kmap(page);
+ ctx->fc.source = kmemdup_nul(buf, size, GFP_KERNEL);
+ kunmap(page);
put_page(page);
- page = NULL;
- }
+ if (!ctx->fc.source)
+ return -ENOMEM;
- /* work out what options we want */
- as = AFS_FS_S(mntpt->d_sb);
- if (as->cell) {
- memcpy(options, "cell=", 5);
- strcpy(options + 5, as->cell->name);
- if ((as->volume && as->volume->type == AFSVL_RWVOL) || rwpath)
- strcat(options, ",rwpath");
+ ret = ctx->fc.ops->parse_source(fc);
+ if (ret < 0)
+ return ret;
}
- /* try and do the mount */
- _debug("--- attempting mount %s -o %s ---", devname, options);
- mnt = vfs_submount(mntpt, &afs_fs_type, devname,
- options, strlen(options) + 1);
- _debug("--- mount result %p ---", mnt);
+ return 0;
+}
+
+/*
+ * create a vfsmount to be automounted
+ */
+static struct vfsmount *afs_mntpt_do_automount(struct dentry *mntpt)
+{
+ struct fs_context *fc;
+ struct vfsmount *mnt;
+ int ret;
+
+ BUG_ON(!d_inode(mntpt));
+
+ fc = vfs_new_fs_context(&afs_fs_type, mntpt->d_sb, 0,
+ FS_CONTEXT_FOR_SUBMOUNT);
+ if (IS_ERR(fc))
+ return ERR_CAST(fc);
+
+ ret = afs_mntpt_set_params(fc, mntpt);
+ if (ret < 0)
+ goto error_fc;
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0)
+ goto error_fc;
+
+ mnt = vfs_create_mount(fc);
+ if (IS_ERR(mnt)) {
+ ret = PTR_ERR(mnt);
+ goto error_fc;
+ }
- free_page((unsigned long) devname);
- free_page((unsigned long) options);
- _leave(" = %p", mnt);
+ put_fs_context(fc);
return mnt;
-error:
- put_page(page);
-error_no_page:
- free_page((unsigned long) options);
-error_no_options:
- free_page((unsigned long) devname);
-error_no_devname:
- _leave(" = %d", ret);
+error_fc:
+ put_fs_context(fc);
return ERR_PTR(ret);
}
diff --git a/fs/afs/super.c b/fs/afs/super.c
index f56070a9c606..5f9d225e32d9 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -65,18 +65,12 @@ static atomic_t afs_count_active_inodes;
enum {
afs_no_opt,
- afs_opt_cell,
afs_opt_dyn,
- afs_opt_rwpath,
- afs_opt_vol,
afs_opt_autocell,
};
static const match_table_t afs_options_list = {
- { afs_opt_cell, "cell=%s" },
{ afs_opt_dyn, "dyn" },
- { afs_opt_rwpath, "rwpath" },
- { afs_opt_vol, "vol=%s" },
{ afs_opt_autocell, "autocell" },
{ afs_no_opt, NULL },
};
@@ -195,37 +189,13 @@ static int afs_show_options(struct seq_file *m, struct dentry *root)
static int afs_parse_option(struct fs_context *fc, char *opt, size_t len)
{
struct afs_fs_context *ctx = container_of(fc, struct afs_fs_context, fc);
- struct afs_cell *cell;
substring_t args[MAX_OPT_ARGS];
- int token, size;
+ int token;
_enter("%s", opt);
token = match_token(opt, afs_options_list, args);
switch (token) {
- case afs_opt_cell:
- size = args[0].to - args[0].from;
- if (size <= 0)
- return -EINVAL;
- if (size > AFS_MAXCELLNAME)
- return -ENAMETOOLONG;
-
- rcu_read_lock();
- cell = afs_lookup_cell_rcu(ctx->net, args[0].from, size);
- rcu_read_unlock();
- if (IS_ERR(cell))
- return PTR_ERR(cell);
- afs_put_cell(ctx->net, ctx->cell);
- ctx->cell = cell;
- break;
-
- case afs_opt_rwpath:
- ctx->rwpath = true;
- break;
-
- case afs_opt_vol:
- return -EINVAL; /* Not required for automount */
-
case afs_opt_autocell:
ctx->autocell = true;
break;
@@ -249,8 +219,8 @@ static int afs_parse_option(struct fs_context *fc, char *opt, size_t len)
*
* This can be one of the following:
* "%[cell:]volume[.]" R/W volume
- * "#[cell:]volume[.]" R/O or R/W volume (rwpath=0),
- * or R/W (rwpath=1) volume
+ * "#[cell:]volume[.]" R/O or R/W volume (R/O parent),
+ * or R/W (R/W parent) volume
* "%[cell:]volume.readonly" R/O volume
* "#[cell:]volume.readonly" R/O volume
* "%[cell:]volume.backup" Backup volume
@@ -281,9 +251,7 @@ static int afs_parse_source(struct fs_context *fc)
}
/* determine the type of volume we're looking for */
- ctx->type = AFSVL_ROVOL;
- ctx->force = false;
- if (ctx->rwpath || name[0] == '%') {
+ if (name[0] == '%') {
ctx->type = AFSVL_RWVOL;
ctx->force = true;
}
@@ -599,8 +567,6 @@ static int afs_init_fs_context(struct fs_context *fc, struct super_block *src_sb
struct afs_cell *cell;
struct net *net_ns;
- if (current->nsproxy->net_ns != &init_net)
- return -EINVAL;
ctx->type = AFSVL_ROVOL;
switch (ctx->fc.purpose) {
Add fs_context support to the AFS filesystem, converting the parameter
parsing to store options there.
This will form the basis for namespace propagation over mountpoints within
the AFS model, thereby allowing AFS to be used in containers more easily.
Signed-off-by: David Howells <[email protected]>
---
fs/afs/internal.h | 9 +
fs/afs/super.c | 426 +++++++++++++++++++++++++++++------------------------
fs/afs/volume.c | 4
3 files changed, 242 insertions(+), 197 deletions(-)
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index f8086ec95e24..0266730b3ad7 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -32,13 +32,16 @@
struct pagevec;
struct afs_call;
-struct afs_mount_params {
+struct afs_fs_context {
+ struct fs_context fc;
+ struct afs_super_info *as;
bool rwpath; /* T if the parent should be considered R/W */
bool force; /* T to force cell type */
bool autocell; /* T if set auto mount operation */
bool dyn_root; /* T if dynamic root */
+ bool no_cell; /* T if the source is "none" (for dynroot) */
afs_voltype_t type; /* type of volume requested */
- int volnamesz; /* size of volume name */
+ unsigned int volnamesz; /* size of volume name */
const char *volname; /* name of volume to mount */
struct afs_net *net; /* Network namespace in effect */
struct afs_cell *cell; /* cell in which to find volume */
@@ -1007,7 +1010,7 @@ static inline struct afs_volume *__afs_get_volume(struct afs_volume *volume)
return volume;
}
-extern struct afs_volume *afs_create_volume(struct afs_mount_params *);
+extern struct afs_volume *afs_create_volume(struct afs_fs_context *);
extern void afs_activate_volume(struct afs_volume *);
extern void afs_deactivate_volume(struct afs_volume *);
extern void afs_put_volume(struct afs_cell *, struct afs_volume *);
diff --git a/fs/afs/super.c b/fs/afs/super.c
index 7d17d01ca0cd..6ab0b79e061e 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -30,22 +30,21 @@
#include "internal.h"
static void afs_i_init_once(void *foo);
-static struct dentry *afs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name,
- void *data, size_t data_size);
static void afs_kill_super(struct super_block *sb);
static struct inode *afs_alloc_inode(struct super_block *sb);
static void afs_destroy_inode(struct inode *inode);
static int afs_statfs(struct dentry *dentry, struct kstatfs *buf);
static int afs_show_devname(struct seq_file *m, struct dentry *root);
static int afs_show_options(struct seq_file *m, struct dentry *root);
+static int afs_init_fs_context(struct fs_context *fc, struct super_block *src_sb);
struct file_system_type afs_fs_type = {
- .owner = THIS_MODULE,
- .name = "afs",
- .mount = afs_mount,
- .kill_sb = afs_kill_super,
- .fs_flags = 0,
+ .owner = THIS_MODULE,
+ .name = "afs",
+ .fs_context_size = sizeof(struct afs_fs_context),
+ .init_fs_context = afs_init_fs_context,
+ .kill_sb = afs_kill_super,
+ .fs_flags = 0,
};
MODULE_ALIAS_FS("afs");
@@ -189,61 +188,53 @@ static int afs_show_options(struct seq_file *m, struct dentry *root)
}
/*
- * parse the mount options
- * - this function has been shamelessly adapted from the ext3 fs which
- * shamelessly adapted it from the msdos fs
+ * Parse an single mount option.
*/
-static int afs_parse_options(struct afs_mount_params *params,
- char *options, const char **devname)
+static int afs_parse_option(struct fs_context *fc, char *opt, size_t len)
{
+ struct afs_fs_context *ctx = container_of(fc, struct afs_fs_context, fc);
struct afs_cell *cell;
substring_t args[MAX_OPT_ARGS];
- char *p;
- int token;
-
- _enter("%s", options);
-
- options[PAGE_SIZE - 1] = 0;
-
- while ((p = strsep(&options, ","))) {
- if (!*p)
- continue;
-
- token = match_token(p, afs_options_list, args);
- switch (token) {
- case afs_opt_cell:
- rcu_read_lock();
- cell = afs_lookup_cell_rcu(params->net,
- args[0].from,
- args[0].to - args[0].from);
- rcu_read_unlock();
- if (IS_ERR(cell))
- return PTR_ERR(cell);
- afs_put_cell(params->net, params->cell);
- params->cell = cell;
- break;
-
- case afs_opt_rwpath:
- params->rwpath = true;
- break;
-
- case afs_opt_vol:
- *devname = args[0].from;
- break;
-
- case afs_opt_autocell:
- params->autocell = true;
- break;
-
- case afs_opt_dyn:
- params->dyn_root = true;
- break;
-
- default:
- printk(KERN_ERR "kAFS:"
- " Unknown or invalid mount option: '%s'\n", p);
+ int token, size;
+
+ _enter("%s", opt);
+
+ token = match_token(opt, afs_options_list, args);
+ switch (token) {
+ case afs_opt_cell:
+ size = args[0].to - args[0].from;
+ if (size <= 0)
return -EINVAL;
- }
+ if (size > AFS_MAXCELLNAME)
+ return -ENAMETOOLONG;
+
+ rcu_read_lock();
+ cell = afs_lookup_cell_rcu(ctx->net, args[0].from, size);
+ rcu_read_unlock();
+ if (IS_ERR(cell))
+ return PTR_ERR(cell);
+ afs_put_cell(ctx->net, ctx->cell);
+ ctx->cell = cell;
+ break;
+
+ case afs_opt_rwpath:
+ ctx->rwpath = true;
+ break;
+
+ case afs_opt_vol:
+ return -EINVAL; /* Not required for automount */
+
+ case afs_opt_autocell:
+ ctx->autocell = true;
+ break;
+
+ case afs_opt_dyn:
+ ctx->dyn_root = true;
+ break;
+
+ default:
+ printk(KERN_ERR "kAFS: Unknown or invalid mount option: '%s'\n", opt);
+ return -EINVAL;
}
_leave(" = 0");
@@ -251,9 +242,10 @@ static int afs_parse_options(struct afs_mount_params *params,
}
/*
- * parse a device name to get cell name, volume name, volume type and R/W
- * selector
- * - this can be one of the following:
+ * Parse the source name to get cell name, volume name, volume type and R/W
+ * selector.
+ *
+ * This can be one of the following:
* "%[cell:]volume[.]" R/W volume
* "#[cell:]volume[.]" R/O or R/W volume (rwpath=0),
* or R/W (rwpath=1) volume
@@ -262,11 +254,11 @@ static int afs_parse_options(struct afs_mount_params *params,
* "%[cell:]volume.backup" Backup volume
* "#[cell:]volume.backup" Backup volume
*/
-static int afs_parse_device_name(struct afs_mount_params *params,
- const char *name)
+static int afs_parse_source(struct fs_context *fc)
{
+ struct afs_fs_context *ctx = container_of(fc, struct afs_fs_context, fc);
struct afs_cell *cell;
- const char *cellname, *suffix;
+ const char *cellname, *suffix, *name = fc->source;
int cellnamesz;
_enter(",%s", name);
@@ -277,69 +269,116 @@ static int afs_parse_device_name(struct afs_mount_params *params,
}
if ((name[0] != '%' && name[0] != '#') || !name[1]) {
+ /* To use dynroot, we don't want to have to provide a source */
+ if (strcmp(name, "none") == 0) {
+ ctx->no_cell = true;
+ return 0;
+ }
printk(KERN_ERR "kAFS: unparsable volume name\n");
return -EINVAL;
}
/* determine the type of volume we're looking for */
- params->type = AFSVL_ROVOL;
- params->force = false;
- if (params->rwpath || name[0] == '%') {
- params->type = AFSVL_RWVOL;
- params->force = true;
+ ctx->type = AFSVL_ROVOL;
+ ctx->force = false;
+ if (ctx->rwpath || name[0] == '%') {
+ ctx->type = AFSVL_RWVOL;
+ ctx->force = true;
}
name++;
/* split the cell name out if there is one */
- params->volname = strchr(name, ':');
- if (params->volname) {
+ ctx->volname = strchr(name, ':');
+ if (ctx->volname) {
cellname = name;
- cellnamesz = params->volname - name;
- params->volname++;
+ cellnamesz = ctx->volname - name;
+ ctx->volname++;
} else {
- params->volname = name;
+ ctx->volname = name;
cellname = NULL;
cellnamesz = 0;
}
/* the volume type is further affected by a possible suffix */
- suffix = strrchr(params->volname, '.');
+ suffix = strrchr(ctx->volname, '.');
if (suffix) {
if (strcmp(suffix, ".readonly") == 0) {
- params->type = AFSVL_ROVOL;
- params->force = true;
+ ctx->type = AFSVL_ROVOL;
+ ctx->force = true;
} else if (strcmp(suffix, ".backup") == 0) {
- params->type = AFSVL_BACKVOL;
- params->force = true;
+ ctx->type = AFSVL_BACKVOL;
+ ctx->force = true;
} else if (suffix[1] == 0) {
} else {
suffix = NULL;
}
}
- params->volnamesz = suffix ?
- suffix - params->volname : strlen(params->volname);
+ ctx->volnamesz = suffix ?
+ suffix - ctx->volname : strlen(ctx->volname);
_debug("cell %*.*s [%p]",
- cellnamesz, cellnamesz, cellname ?: "", params->cell);
+ cellnamesz, cellnamesz, cellname ?: "", ctx->cell);
/* lookup the cell record */
- if (cellname || !params->cell) {
- cell = afs_lookup_cell(params->net, cellname, cellnamesz,
+ if (cellname) {
+ cell = afs_lookup_cell(ctx->net, cellname, cellnamesz,
NULL, false);
if (IS_ERR(cell)) {
- printk(KERN_ERR "kAFS: unable to lookup cell '%*.*s'\n",
+ pr_err("kAFS: unable to lookup cell '%*.*s'\n",
cellnamesz, cellnamesz, cellname ?: "");
return PTR_ERR(cell);
}
- afs_put_cell(params->net, params->cell);
- params->cell = cell;
+ afs_put_cell(ctx->net, ctx->cell);
+ ctx->cell = cell;
}
_debug("CELL:%s [%p] VOLUME:%*.*s SUFFIX:%s TYPE:%d%s",
- params->cell->name, params->cell,
- params->volnamesz, params->volnamesz, params->volname,
- suffix ?: "-", params->type, params->force ? " FORCE" : "");
+ ctx->cell->name, ctx->cell,
+ ctx->volnamesz, ctx->volnamesz, ctx->volname,
+ suffix ?: "-", ctx->type, ctx->force ? " FORCE" : "");
+
+ return 0;
+}
+
+/*
+ * Validate the options, get the cell key and look up the volume.
+ */
+static int afs_validate_fc(struct fs_context *fc)
+{
+ struct afs_fs_context *ctx = container_of(fc, struct afs_fs_context, fc);
+ struct afs_volume *volume;
+ struct key *key;
+
+ if (!ctx->dyn_root) {
+ if (ctx->no_cell) {
+ pr_warn("kAFS: Can only specify source 'none' with -o dyn\n");
+ return -EINVAL;
+ }
+
+ if (!ctx->cell) {
+ pr_warn("kAFS: No cell specified\n");
+ return -EDESTADDRREQ;
+ }
+
+ /* We try to do the mount securely. */
+ key = afs_request_key(ctx->cell);
+ if (IS_ERR(key))
+ return PTR_ERR(key);
+
+ ctx->key = key;
+
+ if (ctx->volume) {
+ afs_put_volume(ctx->cell, ctx->volume);
+ ctx->volume = NULL;
+ }
+
+ volume = afs_create_volume(ctx);
+ if (IS_ERR(volume))
+ return PTR_ERR(volume);
+
+ ctx->volume = volume;
+ }
return 0;
}
@@ -347,34 +386,34 @@ static int afs_parse_device_name(struct afs_mount_params *params,
/*
* check a superblock to see if it's the one we're looking for
*/
-static int afs_test_super(struct super_block *sb, void *data)
+static int afs_test_super(struct super_block *sb, struct fs_context *fc)
{
- struct afs_super_info *as1 = data;
+ struct afs_fs_context *ctx = container_of(fc, struct afs_fs_context, fc);
struct afs_super_info *as = AFS_FS_S(sb);
- return (as->net == as1->net &&
+ return (as->net == ctx->net &&
as->volume &&
- as->volume->vid == as1->volume->vid);
+ as->volume->vid == ctx->volume->vid);
}
-static int afs_dynroot_test_super(struct super_block *sb, void *data)
+static int afs_dynroot_test_super(struct super_block *sb, struct fs_context *fc)
{
return false;
}
-static int afs_set_super(struct super_block *sb, void *data)
+static int afs_set_super(struct super_block *sb, struct fs_context *fc)
{
- struct afs_super_info *as = data;
+ struct afs_fs_context *ctx = container_of(fc, struct afs_fs_context, fc);
- sb->s_fs_info = as;
+ sb->s_fs_info = ctx->as;
+ ctx->as = NULL;
return set_anon_super(sb, NULL);
}
/*
* fill in the superblock
*/
-static int afs_fill_super(struct super_block *sb,
- struct afs_mount_params *params)
+static int afs_fill_super(struct super_block *sb, struct afs_fs_context *ctx)
{
struct afs_super_info *as = AFS_FS_S(sb);
struct afs_fid fid;
@@ -405,13 +444,13 @@ static int afs_fill_super(struct super_block *sb,
fid.vid = as->volume->vid;
fid.vnode = 1;
fid.unique = 1;
- inode = afs_iget(sb, params->key, &fid, NULL, NULL, NULL);
+ inode = afs_iget(sb, ctx->key, &fid, NULL, NULL, NULL);
}
if (IS_ERR(inode))
return PTR_ERR(inode);
- if (params->autocell || params->dyn_root)
+ if (ctx->autocell || as->dyn_root)
set_bit(AFS_VNODE_AUTOCELL, &AFS_FS_I(inode)->flags);
ret = -ENOMEM;
@@ -419,7 +458,7 @@ static int afs_fill_super(struct super_block *sb,
if (!sb->s_root)
goto error;
- if (params->dyn_root)
+ if (as->dyn_root)
sb->s_d_op = &afs_dynroot_dentry_operations;
else
sb->s_d_op = &afs_fs_dentry_operations;
@@ -432,17 +471,19 @@ static int afs_fill_super(struct super_block *sb,
return ret;
}
-static struct afs_super_info *afs_alloc_sbi(struct afs_mount_params *params)
+static struct afs_super_info *afs_alloc_sbi(struct afs_fs_context *ctx)
{
struct afs_super_info *as;
as = kzalloc(sizeof(struct afs_super_info), GFP_KERNEL);
if (as) {
- as->net = afs_get_net(params->net);
- if (params->dyn_root)
+ as->net = afs_get_net(ctx->net);
+ if (ctx->dyn_root) {
as->dyn_root = true;
- else
- as->cell = afs_get_cell(params->cell);
+ } else {
+ as->cell = afs_get_cell(ctx->cell);
+ as->volume = __afs_get_volume(ctx->volume);
+ }
}
return as;
}
@@ -457,127 +498,128 @@ static void afs_destroy_sbi(struct afs_super_info *as)
}
}
+static void afs_kill_super(struct super_block *sb)
+{
+ struct afs_super_info *as = AFS_FS_S(sb);
+
+ /* Clear the callback interests (which will do ilookup5) before
+ * deactivating the superblock.
+ */
+ if (as->volume)
+ afs_clear_callback_interests(as->net, as->volume->servers);
+ kill_anon_super(sb);
+ if (as->volume)
+ afs_deactivate_volume(as->volume);
+ afs_destroy_sbi(as);
+}
+
/*
- * get an AFS superblock
+ * Get an AFS superblock and root directory.
*/
-static struct dentry *afs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name,
- void *options, size_t data_size)
+static int afs_get_tree(struct fs_context *fc)
{
- struct afs_mount_params params;
+ struct afs_fs_context *ctx = container_of(fc, struct afs_fs_context, fc);
struct super_block *sb;
- struct afs_volume *candidate;
- struct key *key;
struct afs_super_info *as;
int ret;
- _enter(",,%s,%p", dev_name, options);
-
- memset(¶ms, 0, sizeof(params));
- params.net = &__afs_net;
-
- ret = -EINVAL;
- if (current->nsproxy->net_ns != &init_net)
- goto error;
-
- /* parse the options and device name */
- if (options) {
- ret = afs_parse_options(¶ms, options, &dev_name);
- if (ret < 0)
- goto error;
- }
-
- if (!params.dyn_root) {
- ret = afs_parse_device_name(¶ms, dev_name);
- if (ret < 0)
- goto error;
-
- /* try and do the mount securely */
- key = afs_request_key(params.cell);
- if (IS_ERR(key)) {
- _leave(" = %ld [key]", PTR_ERR(key));
- ret = PTR_ERR(key);
- goto error;
- }
- params.key = key;
- }
+ _enter("%s", fc->source);
/* allocate a superblock info record */
- ret = -ENOMEM;
- as = afs_alloc_sbi(¶ms);
- if (!as)
- goto error_key;
-
- if (!params.dyn_root) {
- /* Assume we're going to need a volume record; at the very
- * least we can use it to update the volume record if we have
- * one already. This checks that the volume exists within the
- * cell.
- */
- candidate = afs_create_volume(¶ms);
- if (IS_ERR(candidate)) {
- ret = PTR_ERR(candidate);
- goto error_as;
- }
-
- as->volume = candidate;
+ as = ctx->as;
+ if (!as) {
+ ret = -ENOMEM;
+ as = afs_alloc_sbi(ctx);
+ if (!as)
+ goto error;
+ ctx->as = as;
}
/* allocate a deviceless superblock */
- sb = sget(fs_type,
- as->dyn_root ? afs_dynroot_test_super : afs_test_super,
- afs_set_super, flags, as);
+ sb = sget_fc(fc,
+ as->dyn_root ? afs_dynroot_test_super : afs_test_super,
+ afs_set_super);
if (IS_ERR(sb)) {
ret = PTR_ERR(sb);
- goto error_as;
+ goto error;
}
if (!sb->s_root) {
/* initial superblock/root creation */
_debug("create");
- ret = afs_fill_super(sb, ¶ms);
+ ret = afs_fill_super(sb, ctx);
if (ret < 0)
goto error_sb;
- as = NULL;
sb->s_flags |= SB_ACTIVE;
} else {
_debug("reuse");
ASSERTCMP(sb->s_flags, &, SB_ACTIVE);
- afs_destroy_sbi(as);
- as = NULL;
}
- afs_put_cell(params.net, params.cell);
- key_put(params.key);
+ ctx->fc.root = dget(sb->s_root);
+ ctx->fc.drop_sb = true;
_leave(" = 0 [%p]", sb);
- return dget(sb->s_root);
+ return 0;
error_sb:
deactivate_locked_super(sb);
- goto error_key;
-error_as:
- afs_destroy_sbi(as);
-error_key:
- key_put(params.key);
error:
- afs_put_cell(params.net, params.cell);
_leave(" = %d", ret);
- return ERR_PTR(ret);
+ return ret;
}
-static void afs_kill_super(struct super_block *sb)
+static void afs_free_fc(struct fs_context *fc)
{
- struct afs_super_info *as = AFS_FS_S(sb);
+ struct afs_fs_context *ctx = container_of(fc, struct afs_fs_context, fc);
- /* Clear the callback interests (which will do ilookup5) before
- * deactivating the superblock.
- */
- if (as->volume)
- afs_clear_callback_interests(as->net, as->volume->servers);
- kill_anon_super(sb);
- if (as->volume)
- afs_deactivate_volume(as->volume);
- afs_destroy_sbi(as);
+ afs_put_volume(ctx->cell, ctx->volume);
+ afs_put_cell(ctx->net, ctx->cell);
+ afs_put_net(ctx->net);
+ afs_destroy_sbi(ctx->as);
+ key_put(ctx->key);
+}
+
+static const struct fs_context_operations afs_context_ops = {
+ .free = afs_free_fc,
+ .parse_source = afs_parse_source,
+ .parse_option = afs_parse_option,
+ .validate = afs_validate_fc,
+ .get_tree = afs_get_tree,
+};
+
+/*
+ * Set up the filesystem mount context.
+ */
+static int afs_init_fs_context(struct fs_context *fc, struct super_block *src_sb)
+{
+ struct afs_fs_context *ctx = container_of(fc, struct afs_fs_context, fc);
+ struct afs_super_info *src_as;
+ struct afs_cell *cell;
+
+ if (current->nsproxy->net_ns != &init_net)
+ return -EINVAL;
+
+ if (src_sb) {
+ src_as = AFS_FS_S(src_sb);
+ if (src_as) {
+ ctx->net = afs_get_net(src_as->net);
+ ctx->cell = afs_get_cell(src_as->cell);
+ ctx->volume = __afs_get_volume(src_as->volume);
+ }
+ } else {
+ ctx->net = afs_get_net(&__afs_net);
+
+ /* Default to the workstation cell. */
+ rcu_read_lock();
+ cell = afs_lookup_cell_rcu(ctx->net, NULL, 0);
+ rcu_read_unlock();
+ if (IS_ERR(cell))
+ cell = NULL;
+ ctx->cell = cell;
+ }
+
+ ctx->fc.ops = &afs_context_ops;
+ return 0;
}
/*
diff --git a/fs/afs/volume.c b/fs/afs/volume.c
index 3037bd01f617..7adcddf02e66 100644
--- a/fs/afs/volume.c
+++ b/fs/afs/volume.c
@@ -21,7 +21,7 @@ static const char *const afs_voltypes[] = { "R/W", "R/O", "BAK" };
/*
* Allocate a volume record and load it up from a vldb record.
*/
-static struct afs_volume *afs_alloc_volume(struct afs_mount_params *params,
+static struct afs_volume *afs_alloc_volume(struct afs_fs_context *params,
struct afs_vldb_entry *vldb,
unsigned long type_mask)
{
@@ -149,7 +149,7 @@ static struct afs_vldb_entry *afs_vl_lookup_vldb(struct afs_cell *cell,
* - Rule 3: If parent volume is R/W, then only mount R/W volume unless
* explicitly told otherwise
*/
-struct afs_volume *afs_create_volume(struct afs_mount_params *params)
+struct afs_volume *afs_create_volume(struct afs_fs_context *params)
{
struct afs_vldb_entry *vldb;
struct afs_volume *volume;
AFS server records get removed from the net->fs_servers tree when they're
deleted, but not from the net->fs_addresses{4,6} lists, which can lead to
an oops in afs_find_server() when a server record has been removed, for
instance during rmmod.
Fix this by deleting the record from the by-address lists before posting it
for RCU destruction.
The reason this hasn't been noticed before is that the fileserver keeps
probing the local cache manager, thereby keeping the service record alive,
so the oops would only happen when a fileserver eventually gets bored and
stops pinging or if the module gets rmmod'd and a call comes in from the
fileserver during the window between the server records being destroyed and
the socket being closed.
The oops looks something like:
BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
...
Workqueue: kafsd afs_process_async_call [kafs]
RIP: 0010:afs_find_server+0x271/0x36f [kafs]
...
Call Trace:
? worker_thread+0x230/0x2ac
? worker_thread+0x230/0x2ac
afs_deliver_cb_init_call_back_state3+0x1f2/0x21f [kafs]
afs_deliver_to_call+0x1ee/0x5e8 [kafs]
? worker_thread+0x230/0x2ac
afs_process_async_call+0x5b/0xd0 [kafs]
process_one_work+0x2c2/0x504
? worker_thread+0x230/0x2ac
worker_thread+0x1d4/0x2ac
? rescuer_thread+0x29b/0x29b
kthread+0x11f/0x127
? kthread_create_on_node+0x3f/0x3f
ret_from_fork+0x24/0x30
Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: David Howells <[email protected]>
---
fs/afs/server.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/fs/afs/server.c b/fs/afs/server.c
index e23be63998a8..629c74986cff 100644
--- a/fs/afs/server.c
+++ b/fs/afs/server.c
@@ -428,8 +428,15 @@ static void afs_gc_servers(struct afs_net *net, struct afs_server *gc_list)
}
write_sequnlock(&net->fs_lock);
- if (deleted)
+ if (deleted) {
+ write_seqlock(&net->fs_addr_lock);
+ if (!hlist_unhashed(&server->addr4_link))
+ hlist_del_rcu(&server->addr4_link);
+ if (!hlist_unhashed(&server->addr6_link))
+ hlist_del_rcu(&server->addr6_link);
+ write_sequnlock(&net->fs_addr_lock);
afs_destroy_server(net, server);
+ }
}
}
Export get_proc_net() so that write() routines attached to net-namespaced
proc files can find the network namespace that they're in. Currently, this
is only accessible via seqfile routines.
This will permit AFS to have a separate cell database per-network namespace
and to manage each one independently of the others.
Signed-off-by: David Howells <[email protected]>
---
fs/proc/proc_net.c | 3 ++-
include/linux/proc_fs.h | 2 ++
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 1763f370489d..23d56146e38f 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -33,10 +33,11 @@ static inline struct net *PDE_NET(struct proc_dir_entry *pde)
return pde->parent->data;
}
-static struct net *get_proc_net(const struct inode *inode)
+struct net *get_proc_net(const struct inode *inode)
{
return maybe_get_net(PDE_NET(PDE(inode)));
}
+EXPORT_SYMBOL_GPL(get_proc_net);
int seq_open_net(struct inode *ino, struct file *f,
const struct seq_operations *ops, int size)
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 928ef9e4d912..47a87121d30c 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -79,6 +79,8 @@ static inline struct proc_dir_entry *proc_net_mkdir(
return proc_mkdir_data(name, 0, parent, net);
}
+extern struct net *get_proc_net(const struct inode *inode);
+
struct ns_common;
int open_related_ns(struct ns_common *ns,
struct ns_common *(*get_ns)(struct ns_common *ns));
Implement namespacing within AFS, but don't yet let mounts occur outside
the init namespace. An additional patch will be required propagate the
network namespace across automounts.
Signed-off-by: David Howells <[email protected]>
---
fs/afs/cell.c | 4 +-
fs/afs/internal.h | 36 ++++++++++++---------
fs/afs/main.c | 33 ++++++++++++++++----
fs/afs/proc.c | 89 +++++++++++++++++++++++++++++++++++------------------
fs/afs/super.c | 58 +++++++++++++++++++++++++----------
5 files changed, 149 insertions(+), 71 deletions(-)
diff --git a/fs/afs/cell.c b/fs/afs/cell.c
index fdf4c36cff79..a98a8a3d5544 100644
--- a/fs/afs/cell.c
+++ b/fs/afs/cell.c
@@ -528,7 +528,7 @@ static int afs_activate_cell(struct afs_net *net, struct afs_cell *cell)
NULL, 0,
cell, 0, true);
#endif
- ret = afs_proc_cell_setup(net, cell);
+ ret = afs_proc_cell_setup(cell);
if (ret < 0)
return ret;
spin_lock(&net->proc_cells_lock);
@@ -544,7 +544,7 @@ static void afs_deactivate_cell(struct afs_net *net, struct afs_cell *cell)
{
_enter("%s", cell->name);
- afs_proc_cell_remove(net, cell);
+ afs_proc_cell_remove(cell);
spin_lock(&net->proc_cells_lock);
list_del_init(&cell->proc_link);
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 0266730b3ad7..a5161c0ae3ab 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -22,6 +22,8 @@
#include <linux/backing-dev.h>
#include <linux/uuid.h>
#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include <net/sock.h>
#include <net/af_rxrpc.h>
#include "afs.h"
@@ -192,7 +194,7 @@ struct afs_read {
* - there's one superblock per volume
*/
struct afs_super_info {
- struct afs_net *net; /* Network namespace */
+ struct net *net_ns; /* Network namespace */
struct afs_cell *cell; /* The cell in which the volume resides */
struct afs_volume *volume; /* volume record */
bool dyn_root; /* True if dynamic root */
@@ -221,6 +223,7 @@ struct afs_sysnames {
* AFS network namespace record.
*/
struct afs_net {
+ struct net *net; /* Backpointer to the owning net namespace */
struct afs_uuid uuid;
bool live; /* F if this namespace is being removed */
@@ -283,7 +286,6 @@ struct afs_net {
};
extern const char afs_init_sysname[];
-extern struct afs_net __afs_net;// Dummy AFS network namespace; TODO: replace with real netns
enum afs_cell_state {
AFS_CELL_UNSET,
@@ -790,34 +792,36 @@ extern int afs_drop_inode(struct inode *);
* main.c
*/
extern struct workqueue_struct *afs_wq;
+extern int afs_net_id;
-static inline struct afs_net *afs_d2net(struct dentry *dentry)
+static inline struct afs_net *afs_net(struct net *net)
{
- return &__afs_net;
+ return net_generic(net, afs_net_id);
}
-static inline struct afs_net *afs_i2net(struct inode *inode)
+static inline struct afs_net *afs_sb2net(struct super_block *sb)
{
- return &__afs_net;
+ return afs_net(AFS_FS_S(sb)->net_ns);
}
-static inline struct afs_net *afs_v2net(struct afs_vnode *vnode)
+static inline struct afs_net *afs_d2net(struct dentry *dentry)
{
- return &__afs_net;
+ return afs_sb2net(dentry->d_sb);
}
-static inline struct afs_net *afs_sock2net(struct sock *sk)
+static inline struct afs_net *afs_i2net(struct inode *inode)
{
- return &__afs_net;
+ return afs_sb2net(inode->i_sb);
}
-static inline struct afs_net *afs_get_net(struct afs_net *net)
+static inline struct afs_net *afs_v2net(struct afs_vnode *vnode)
{
- return net;
+ return afs_i2net(&vnode->vfs_inode);
}
-static inline void afs_put_net(struct afs_net *net)
+static inline struct afs_net *afs_sock2net(struct sock *sk)
{
+ return net_generic(sock_net(sk), afs_net_id);
}
static inline void __afs_stat(atomic_t *s)
@@ -852,8 +856,8 @@ extern int afs_get_ipv4_interfaces(struct afs_interface *, size_t, bool);
*/
extern int __net_init afs_proc_init(struct afs_net *);
extern void __net_exit afs_proc_cleanup(struct afs_net *);
-extern int afs_proc_cell_setup(struct afs_net *, struct afs_cell *);
-extern void afs_proc_cell_remove(struct afs_net *, struct afs_cell *);
+extern int afs_proc_cell_setup(struct afs_cell *);
+extern void afs_proc_cell_remove(struct afs_cell *);
extern void afs_put_sysnames(struct afs_sysnames *);
/*
@@ -986,7 +990,7 @@ extern bool afs_annotate_server_list(struct afs_server_list *, struct afs_server
* super.c
*/
extern int __init afs_fs_init(void);
-extern void __exit afs_fs_exit(void);
+extern void afs_fs_exit(void);
/*
* vlclient.c
diff --git a/fs/afs/main.c b/fs/afs/main.c
index d7560168b3bf..7d2c1354e2ca 100644
--- a/fs/afs/main.c
+++ b/fs/afs/main.c
@@ -15,6 +15,7 @@
#include <linux/completion.h>
#include <linux/sched.h>
#include <linux/random.h>
+#include <linux/proc_fs.h>
#define CREATE_TRACE_POINTS
#include "internal.h"
@@ -32,7 +33,7 @@ module_param(rootcell, charp, 0);
MODULE_PARM_DESC(rootcell, "root AFS cell name and VL server IP addr list");
struct workqueue_struct *afs_wq;
-struct afs_net __afs_net;
+static struct proc_dir_entry *afs_proc_symlink;
#if defined(CONFIG_ALPHA)
const char afs_init_sysname[] = "alpha_linux26";
@@ -67,11 +68,13 @@ const char afs_init_sysname[] = "unknown_linux26";
/*
* Initialise an AFS network namespace record.
*/
-static int __net_init afs_net_init(struct afs_net *net)
+static int __net_init afs_net_init(struct net *net_ns)
{
struct afs_sysnames *sysnames;
+ struct afs_net *net = afs_net(net_ns);
int ret;
+ net->net = net_ns;
net->live = true;
generate_random_uuid((unsigned char *)&net->uuid);
@@ -142,8 +145,10 @@ static int __net_init afs_net_init(struct afs_net *net)
/*
* Clean up and destroy an AFS network namespace record.
*/
-static void __net_exit afs_net_exit(struct afs_net *net)
+static void __net_exit afs_net_exit(struct net *net_ns)
{
+ struct afs_net *net = afs_net(net_ns);
+
net->live = false;
afs_cell_purge(net);
afs_purge_servers(net);
@@ -152,6 +157,13 @@ static void __net_exit afs_net_exit(struct afs_net *net)
afs_put_sysnames(net->sysnames);
}
+static struct pernet_operations afs_net_ops = {
+ .init = afs_net_init,
+ .exit = afs_net_exit,
+ .id = &afs_net_id,
+ .size = sizeof(struct afs_net),
+};
+
/*
* initialise the AFS client FS module
*/
@@ -178,7 +190,7 @@ static int __init afs_init(void)
goto error_cache;
#endif
- ret = afs_net_init(&__afs_net);
+ ret = register_pernet_subsys(&afs_net_ops);
if (ret < 0)
goto error_net;
@@ -187,10 +199,18 @@ static int __init afs_init(void)
if (ret < 0)
goto error_fs;
+ afs_proc_symlink = proc_symlink("fs/afs", NULL, "../self/net/afs");
+ if (IS_ERR(afs_proc_symlink)) {
+ ret = PTR_ERR(afs_proc_symlink);
+ goto error_proc;
+ }
+
return ret;
+error_proc:
+ afs_fs_exit();
error_fs:
- afs_net_exit(&__afs_net);
+ unregister_pernet_subsys(&afs_net_ops);
error_net:
#ifdef CONFIG_AFS_FSCACHE
fscache_unregister_netfs(&afs_cache_netfs);
@@ -219,8 +239,9 @@ static void __exit afs_exit(void)
{
printk(KERN_INFO "kAFS: Red Hat AFS client v0.1 unregistering.\n");
+ proc_remove(afs_proc_symlink);
afs_fs_exit();
- afs_net_exit(&__afs_net);
+ unregister_pernet_subsys(&afs_net_ops);
#ifdef CONFIG_AFS_FSCACHE
fscache_unregister_netfs(&afs_cache_netfs);
#endif
diff --git a/fs/afs/proc.c b/fs/afs/proc.c
index 839a22280606..cc7c48a5b743 100644
--- a/fs/afs/proc.c
+++ b/fs/afs/proc.c
@@ -17,14 +17,16 @@
#include <linux/uaccess.h>
#include "internal.h"
-static inline struct afs_net *afs_proc2net(struct file *f)
+static inline struct afs_net *afs_proc2net_get(struct file *f)
{
- return &__afs_net;
+ struct net *net_ns = get_proc_net(file_inode(f));
+
+ return net_ns ? afs_net(net_ns) : NULL;
}
static inline struct afs_net *afs_seq2net(struct seq_file *m)
{
- return &__afs_net; // TODO: use seq_file_net(m)
+ return afs_net(seq_file_net(m));
}
static int afs_proc_cells_open(struct inode *inode, struct file *file);
@@ -161,7 +163,7 @@ int afs_proc_init(struct afs_net *net)
{
_enter("");
- net->proc_afs = proc_mkdir("fs/afs", NULL);
+ net->proc_afs = proc_net_mkdir(net->net, "afs", net->net->proc_net);
if (!net->proc_afs)
goto error_dir;
@@ -196,16 +198,8 @@ void afs_proc_cleanup(struct afs_net *net)
*/
static int afs_proc_cells_open(struct inode *inode, struct file *file)
{
- struct seq_file *m;
- int ret;
-
- ret = seq_open(file, &afs_proc_cells_ops);
- if (ret < 0)
- return ret;
-
- m = file->private_data;
- m->private = PDE_DATA(inode);
- return 0;
+ return seq_open_net(inode, file, &afs_proc_cells_ops,
+ sizeof(struct seq_net_private));
}
/*
@@ -266,7 +260,8 @@ static int afs_proc_cells_show(struct seq_file *m, void *v)
static ssize_t afs_proc_cells_write(struct file *file, const char __user *buf,
size_t size, loff_t *_pos)
{
- struct afs_net *net = afs_proc2net(file);
+ struct afs_net *net;
+ struct net *net_ns = NULL;
char *kbuf, *name, *args;
int ret;
@@ -305,6 +300,12 @@ static ssize_t afs_proc_cells_write(struct file *file, const char __user *buf,
/* determine command to perform */
_debug("cmd=%s name=%s args=%s", kbuf, name, args);
+ ret = -ESTALE;
+ net_ns = get_proc_net(file_inode(file));
+ if (!net_ns)
+ goto done;
+ net = afs_net(net_ns);
+
if (strcmp(kbuf, "add") == 0) {
struct afs_cell *cell;
@@ -324,6 +325,7 @@ static ssize_t afs_proc_cells_write(struct file *file, const char __user *buf,
ret = size;
done:
+ put_net(net_ns);
kfree(kbuf);
_leave(" = %d", ret);
return ret;
@@ -338,15 +340,24 @@ static ssize_t afs_proc_rootcell_read(struct file *file, char __user *buf,
size_t size, loff_t *_pos)
{
struct afs_cell *cell;
- struct afs_net *net = afs_proc2net(file);
+ struct afs_net *net;
+ struct net *net_ns = NULL;
unsigned int seq = 0;
char name[AFS_MAXCELLNAME + 1];
int len;
if (*_pos > 0)
return 0;
- if (!net->ws_cell)
- return 0;
+
+ net_ns = get_proc_net(file_inode(file));
+ if (!net_ns)
+ return -ESTALE;
+ net = afs_net(net_ns);
+
+ if (!net->ws_cell) {
+ len = 0;
+ goto out;
+ }
rcu_read_lock();
do {
@@ -362,14 +373,18 @@ static ssize_t afs_proc_rootcell_read(struct file *file, char __user *buf,
rcu_read_unlock();
if (!len)
- return 0;
+ goto out;
name[len++] = '\n';
if (len > size)
len = size;
- if (copy_to_user(buf, name, len) != 0)
- return -EFAULT;
+ if (copy_to_user(buf, name, len) != 0) {
+ len = -EFAULT;
+ goto out;
+ }
*_pos = 1;
+out:
+ put_net(net_ns);
return len;
}
@@ -381,7 +396,8 @@ static ssize_t afs_proc_rootcell_write(struct file *file,
const char __user *buf,
size_t size, loff_t *_pos)
{
- struct afs_net *net = afs_proc2net(file);
+ struct afs_net *net;
+ struct net *net_ns = NULL;
char *kbuf, *s;
int ret;
@@ -407,6 +423,12 @@ static ssize_t afs_proc_rootcell_write(struct file *file,
/* determine command to perform */
_debug("rootcell=%s", kbuf);
+ ret = -ESTALE;
+ net_ns = get_proc_net(file_inode(file));
+ if (!net_ns)
+ goto out;
+ net = afs_net(net_ns);
+
ret = afs_cell_init(net, kbuf);
if (ret >= 0)
ret = size; /* consume everything, always */
@@ -420,13 +442,14 @@ static ssize_t afs_proc_rootcell_write(struct file *file,
/*
* initialise /proc/fs/afs/<cell>/
*/
-int afs_proc_cell_setup(struct afs_net *net, struct afs_cell *cell)
+int afs_proc_cell_setup(struct afs_cell *cell)
{
struct proc_dir_entry *dir;
+ struct afs_net *net = cell->net;
_enter("%p{%s},%p", cell, cell->name, net->proc_afs);
- dir = proc_mkdir(cell->name, net->proc_afs);
+ dir = proc_net_mkdir(net->net, cell->name, net->proc_afs);
if (!dir)
goto error_dir;
@@ -449,12 +472,12 @@ int afs_proc_cell_setup(struct afs_net *net, struct afs_cell *cell)
/*
* remove /proc/fs/afs/<cell>/
*/
-void afs_proc_cell_remove(struct afs_net *net, struct afs_cell *cell)
+void afs_proc_cell_remove(struct afs_cell *cell)
{
- _enter("");
+ struct afs_net *net = cell->net;
+ _enter("");
remove_proc_subtree(cell->name, net->proc_afs);
-
_leave("");
}
@@ -471,7 +494,8 @@ static int afs_proc_cell_volumes_open(struct inode *inode, struct file *file)
if (!cell)
return -ENOENT;
- ret = seq_open(file, &afs_proc_cell_volumes_ops);
+ ret = seq_open_net(inode, file, &afs_proc_cell_volumes_ops,
+ sizeof(struct seq_net_private));
if (ret < 0)
return ret;
@@ -560,7 +584,8 @@ static int afs_proc_cell_vlservers_open(struct inode *inode, struct file *file)
if (!cell)
return -ENOENT;
- ret = seq_open(file, &afs_proc_cell_vlservers_ops);
+ ret = seq_open_net(inode, file, &afs_proc_cell_vlservers_ops,
+ sizeof(struct seq_net_private));
if (ret<0)
return ret;
@@ -649,7 +674,8 @@ static int afs_proc_cell_vlservers_show(struct seq_file *m, void *v)
*/
static int afs_proc_servers_open(struct inode *inode, struct file *file)
{
- return seq_open(file, &afs_proc_servers_ops);
+ return seq_open_net(inode, file, &afs_proc_servers_ops,
+ sizeof(struct seq_net_private));
}
/*
@@ -729,7 +755,8 @@ static int afs_proc_sysname_open(struct inode *inode, struct file *file)
struct seq_file *m;
int ret;
- ret = seq_open(file, &afs_proc_sysname_ops);
+ ret = seq_open_net(inode, file, &afs_proc_sysname_ops,
+ sizeof(struct seq_net_private));
if (ret < 0)
return ret;
diff --git a/fs/afs/super.c b/fs/afs/super.c
index 6ab0b79e061e..f56070a9c606 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -48,6 +48,8 @@ struct file_system_type afs_fs_type = {
};
MODULE_ALIAS_FS("afs");
+int afs_net_id;
+
static const struct super_operations afs_super_ops = {
.statfs = afs_statfs,
.alloc_inode = afs_alloc_inode,
@@ -117,7 +119,7 @@ int __init afs_fs_init(void)
/*
* clean up the filesystem
*/
-void __exit afs_fs_exit(void)
+void afs_fs_exit(void)
{
_enter("");
@@ -391,7 +393,7 @@ static int afs_test_super(struct super_block *sb, struct fs_context *fc)
struct afs_fs_context *ctx = container_of(fc, struct afs_fs_context, fc);
struct afs_super_info *as = AFS_FS_S(sb);
- return (as->net == ctx->net &&
+ return (as->net_ns == ctx->fc.net_ns &&
as->volume &&
as->volume->vid == ctx->volume->vid);
}
@@ -477,7 +479,7 @@ static struct afs_super_info *afs_alloc_sbi(struct afs_fs_context *ctx)
as = kzalloc(sizeof(struct afs_super_info), GFP_KERNEL);
if (as) {
- as->net = afs_get_net(ctx->net);
+ as->net_ns = get_net(ctx->fc.net_ns);
if (ctx->dyn_root) {
as->dyn_root = true;
} else {
@@ -492,8 +494,8 @@ static void afs_destroy_sbi(struct afs_super_info *as)
{
if (as) {
afs_put_volume(as->cell, as->volume);
- afs_put_cell(as->net, as->cell);
- afs_put_net(as->net);
+ afs_put_cell(afs_net(as->net_ns), as->cell);
+ put_net(as->net_ns);
kfree(as);
}
}
@@ -506,7 +508,8 @@ static void afs_kill_super(struct super_block *sb)
* deactivating the superblock.
*/
if (as->volume)
- afs_clear_callback_interests(as->net, as->volume->servers);
+ afs_clear_callback_interests(afs_net(as->net_ns),
+ as->volume->servers);
kill_anon_super(sb);
if (as->volume)
afs_deactivate_volume(as->volume);
@@ -574,7 +577,6 @@ static void afs_free_fc(struct fs_context *fc)
afs_put_volume(ctx->cell, ctx->volume);
afs_put_cell(ctx->net, ctx->cell);
- afs_put_net(ctx->net);
afs_destroy_sbi(ctx->as);
key_put(ctx->key);
}
@@ -595,19 +597,19 @@ static int afs_init_fs_context(struct fs_context *fc, struct super_block *src_sb
struct afs_fs_context *ctx = container_of(fc, struct afs_fs_context, fc);
struct afs_super_info *src_as;
struct afs_cell *cell;
+ struct net *net_ns;
if (current->nsproxy->net_ns != &init_net)
return -EINVAL;
+ ctx->type = AFSVL_ROVOL;
- if (src_sb) {
- src_as = AFS_FS_S(src_sb);
- if (src_as) {
- ctx->net = afs_get_net(src_as->net);
- ctx->cell = afs_get_cell(src_as->cell);
- ctx->volume = __afs_get_volume(src_as->volume);
- }
- } else {
- ctx->net = afs_get_net(&__afs_net);
+ switch (ctx->fc.purpose) {
+ case FS_CONTEXT_FOR_USER_MOUNT:
+ case FS_CONTEXT_FOR_KERNEL_MOUNT:
+ ctx->fc.net_ns = maybe_get_net(current->nsproxy->net_ns);
+ if (!ctx->fc.net_ns)
+ return -ESTALE;
+ ctx->net = afs_net(ctx->fc.net_ns);
/* Default to the workstation cell. */
rcu_read_lock();
@@ -616,6 +618,30 @@ static int afs_init_fs_context(struct fs_context *fc, struct super_block *src_sb
if (IS_ERR(cell))
cell = NULL;
ctx->cell = cell;
+ break;
+
+ case FS_CONTEXT_FOR_SUBMOUNT:
+ if (!src_sb)
+ return -EINVAL;
+
+ src_as = AFS_FS_S(src_sb);
+ ASSERT(src_as);
+
+ net_ns = maybe_get_net(src_as->net_ns);
+ if (!net_ns)
+ return -ESTALE;
+ ctx->fc.net_ns = net_ns;
+ ctx->net = afs_net(net_ns);
+ if (src_as->cell)
+ ctx->cell = afs_get_cell(src_as->cell);
+ if (src_as->volume && src_as->volume->type == AFSVL_RWVOL) {
+ ctx->type = AFSVL_RWVOL;
+ ctx->force = true;
+ }
+ break;
+
+ case FS_CONTEXT_FOR_RECONFIGURE:
+ break;
}
ctx->fc.ops = &afs_context_ops;
Provide a system call by which a filesystem opened with fsopen() and
configured by a series of writes can be mounted:
int ret = fsmount(int fsfd, int dfd, const char *path,
unsigned int at_flags, unsigned int flags);
where fsfd is the fd returned by fsopen(), dfd, path and at_flags locate
the mountpoint and flags are the applicable MS_* flags. dfd can be
AT_FDCWD or an fd open to a directory.
In the event that fsmount() fails, it may be possible to get an error
message by calling read(). If no message is available, ENODATA will be
reported.
Signed-off-by: David Howells <[email protected]>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
fs/namespace.c | 82 ++++++++++++++++++++++++++++++++
include/linux/fs_context.h | 2 -
include/linux/syscalls.h | 2 +
kernel/sys_ni.c | 1
6 files changed, 88 insertions(+), 1 deletion(-)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index d02346692c3f..5efec45b5ecb 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -397,3 +397,4 @@
383 i386 statx sys_statx __ia32_sys_statx
384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl
385 i386 fsopen sys_fsopen __ia32_sys_fsopen
+386 i386 fsmount sys_fsmount __ia32_sys_fsmount
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 6708847571e2..f602389e1406 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -342,6 +342,7 @@
331 common pkey_free __x64_sys_pkey_free
332 common statx __x64_sys_statx
333 common fsopen __x64_sys_fsopen
+334 common fsmount __x64_sys_fsmount
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/namespace.c b/fs/namespace.c
index dff482ad87b4..14d110901bbe 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3192,6 +3192,88 @@ struct vfsmount *kern_mount(struct file_system_type *type)
}
EXPORT_SYMBOL_GPL(kern_mount);
+/*
+ * Mount a new, prepared superblock (specified by fs_fd) on the location
+ * specified by dfd and dir_name. dfd can be AT_FDCWD, a dir fd or a container
+ * fd. This cannot be used for binding, moving or remounting mounts.
+ */
+SYSCALL_DEFINE5(fsmount, int, fs_fd, int, dfd, const char __user *, dir_name,
+ unsigned int, at_flags, unsigned int, flags)
+{
+ struct fs_context *fc;
+ struct path mountpoint;
+ struct fd f;
+ unsigned int lookup_flags, mnt_flags = 0;
+ long ret;
+
+ if ((at_flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
+ AT_EMPTY_PATH)) != 0)
+ return -EINVAL;
+
+ if (flags & ~(MS_RDONLY | MS_NOSUID | MS_NODEV | MS_NOEXEC |
+ MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_STRICTATIME))
+ return -EINVAL;
+
+ if (flags & MS_RDONLY)
+ mnt_flags |= MNT_READONLY;
+ if (flags & MS_NOSUID)
+ mnt_flags |= MNT_NOSUID;
+ if (flags & MS_NODEV)
+ mnt_flags |= MNT_NODEV;
+ if (flags & MS_NOEXEC)
+ mnt_flags |= MNT_NOEXEC;
+ if (flags & MS_NODIRATIME)
+ mnt_flags |= MNT_NODIRATIME;
+
+ if (flags & MS_STRICTATIME) {
+ if (flags & MS_NOATIME)
+ return -EINVAL;
+ } else if (flags & MS_NOATIME) {
+ mnt_flags |= MNT_NOATIME;
+ } else {
+ mnt_flags |= MNT_RELATIME;
+ }
+
+ f = fdget(fs_fd);
+ if (!f.file)
+ return -EBADF;
+
+ ret = -EINVAL;
+ if (f.file->f_op != &fscontext_fs_fops)
+ goto err_fsfd;
+
+ fc = f.file->private_data;
+
+ ret = -EPERM;
+ if (!may_mount() ||
+ ((fc->sb_flags & MS_MANDLOCK) && !may_mandlock()))
+ goto err_fsfd;
+
+ /* There must be a valid superblock or we can't mount it */
+ ret = -EINVAL;
+ if (!fc->root)
+ goto err_fsfd;
+
+ /* Find the mountpoint. A container can be specified in dfd. */
+ lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
+ if (at_flags & AT_SYMLINK_NOFOLLOW)
+ lookup_flags &= ~LOOKUP_FOLLOW;
+ if (at_flags & AT_NO_AUTOMOUNT)
+ lookup_flags &= ~LOOKUP_AUTOMOUNT;
+ if (at_flags & AT_EMPTY_PATH)
+ lookup_flags |= LOOKUP_EMPTY;
+ ret = user_path_at(dfd, dir_name, lookup_flags, &mountpoint);
+ if (ret < 0)
+ goto err_fsfd;
+
+ ret = do_new_mount_fc(fc, &mountpoint, mnt_flags);
+
+ path_put(&mountpoint);
+err_fsfd:
+ fdput(f);
+ return ret;
+}
+
/*
* Return true if path is reachable from root
*
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 536ae7d60f1f..dd79acddabec 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -102,5 +102,5 @@ extern int vfs_get_super(struct fs_context *fc,
int (*fill_super)(struct super_block *sb,
struct fs_context *fc));
-extern const struct file_operations fs_fs_fops;
+extern const struct file_operations fscontext_fs_fops;
#endif /* _LINUX_FS_CONTEXT_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3c9b10e92015..e5f68788a096 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -892,6 +892,8 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
unsigned mask, struct statx __user *buffer);
asmlinkage long sys_fsopen(const char *fs_name, unsigned int flags,
void *reserved3, void *reserved4, void *reserved5);
+asmlinkage long sys_fsmount(int fsfd, int dfd, const char *path, unsigned int at_flags,
+ unsigned int flags);
/*
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index c113fc9d5e77..2c236aad9b80 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -433,3 +433,4 @@ COND_SYSCALL(setuid16);
/* fd-based mount */
COND_SYSCALL(sys_fsopen);
+COND_SYSCALL(sys_fsmount);
The kern_mount_data() isn't used any more so remove it.
Signed-off-by: David Howells <[email protected]>
---
fs/namespace.c | 7 -------
include/linux/fs.h | 1 -
2 files changed, 8 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index c61ff2ab090a..dff482ad87b4 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3192,13 +3192,6 @@ struct vfsmount *kern_mount(struct file_system_type *type)
}
EXPORT_SYMBOL_GPL(kern_mount);
-struct vfsmount *kern_mount_data(struct file_system_type *type,
- void *data, size_t data_size)
-{
- return vfs_kern_mount(type, SB_KERNMOUNT, type->name, data, data_size);
-}
-EXPORT_SYMBOL_GPL(kern_mount_data);
-
/*
* Return true if path is reachable from root
*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c1f1428f6c67..9bc78e2c3ce5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2187,7 +2187,6 @@ mount_pseudo(struct file_system_type *fs_type, char *name,
extern int register_filesystem(struct file_system_type *);
extern int unregister_filesystem(struct file_system_type *);
extern struct vfsmount *kern_mount(struct file_system_type *);
-extern struct vfsmount *kern_mount_data(struct file_system_type *, void *, size_t);
extern void kern_unmount(struct vfsmount *mnt);
extern int may_umount_tree(struct vfsmount *);
extern int may_umount(struct vfsmount *);
Make the cpuset filesystem use the filesystem context. This is potentially
tricky as the cpuset fs is almost an alias for the cgroup filesystem, but
with some special parameters.
This can, however, be handled by setting up an appropriate cgroup
filesystem and returning the root directory of that as the root dir of this
one.
Signed-off-by: David Howells <[email protected]>
cc: Tejun Heo <[email protected]>
---
kernel/cgroup/cpuset.c | 66 ++++++++++++++++++++++++++++++++++++++----------
1 file changed, 52 insertions(+), 14 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 3c8ef37879f0..fce0584f161e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -38,7 +38,7 @@
#include <linux/mm.h>
#include <linux/memory.h>
#include <linux/export.h>
-#include <linux/mount.h>
+#include <linux/fs_context.h>
#include <linux/namei.h>
#include <linux/pagemap.h>
#include <linux/proc_fs.h>
@@ -315,26 +315,64 @@ static inline bool is_in_v2_mode(void)
* users. If someone tries to mount the "cpuset" filesystem, we
* silently switch it to mount "cgroup" instead
*/
-static struct dentry *cpuset_mount(struct file_system_type *fs_type,
- int flags, const char *unused_dev_name,
- void *data, size_t data_size)
+static int cpuset_get_tree(struct fs_context *fc)
{
- struct file_system_type *cgroup_fs = get_fs_type("cgroup");
- struct dentry *ret = ERR_PTR(-ENODEV);
+ static const char opts[] = "cpuset,noprefix,release_agent=/sbin/cpuset_release_agent";
+ struct file_system_type *cgroup_fs;
+ struct fs_context *cg_fc;
+ char *p;
+ int ret = -ENODEV;
+
+ cgroup_fs = get_fs_type("cgroup");
if (cgroup_fs) {
- char mountopts[] =
- "cpuset,noprefix,"
- "release_agent=/sbin/cpuset_release_agent";
- ret = cgroup_fs->mount(cgroup_fs, flags, unused_dev_name,
- mountopts, data_size);
- put_filesystem(cgroup_fs);
+ ret = PTR_ERR(cgroup_fs);
+ goto out;
+ }
+
+ cg_fc = vfs_new_fs_context(cgroup_fs, NULL, fc->sb_flags, fc->purpose);
+ put_filesystem(cgroup_fs);
+ if (IS_ERR(cg_fc)) {
+ ret = PTR_ERR(cg_fc);
+ goto out;
}
+
+ ret = -ENOMEM;
+ p = kstrdup(opts, GFP_KERNEL);
+ if (!p)
+ goto out_fc;
+
+ ret = generic_parse_monolithic(fc, p, sizeof(opts) - 1);
+ kfree(p);
+ if (ret < 0)
+ goto out_fc;
+
+ ret = vfs_get_tree(cg_fc);
+ if (ret < 0)
+ goto out_fc;
+
+ fc->root = dget(cg_fc->root);
+ ret = 0;
+
+out_fc:
+ put_fs_context(cg_fc);
+out:
return ret;
}
+static const struct fs_context_operations cpuset_fs_context_ops = {
+ .get_tree = cpuset_get_tree,
+};
+
+static int cpuset_init_fs_context(struct fs_context *fc, struct super_block *src_sb)
+{
+ fc->ops = &cpuset_fs_context_ops;
+ return 0;
+}
+
static struct file_system_type cpuset_fs_type = {
- .name = "cpuset",
- .mount = cpuset_mount,
+ .name = "cpuset",
+ .fs_context_size = sizeof(struct fs_context),
+ .init_fs_context = cpuset_init_fs_context,
};
/*
Make kernfs support superblock creation/mount/remount with fs_context.
This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
be made to support fs_context also.
Notes:
(1) A kernfs_fs_context struct is created to wrap fs_context and the
kernfs mount parameters are moved in here (or are in fs_context).
(2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
namespace tag parameter is passed in the context if desired
(3) kernfs_free_fs_context() is provided as a destructor for the
kernfs_fs_context struct, but for the moment it does nothing except
get called in the right places.
(4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
pass, but possibly this should be done anyway in case someone wants to
add a parameter in future.
(5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
the cgroup v1 and v2 mount parameters are all moved there.
(6) cgroup1 parameter parsing error messages are now handled by invalf(),
which allows userspace to collect them directly.
(7) cgroup1 parameter cleanup is now done in the context destructor rather
than in the mount/get_tree and remount functions.
Weirdies:
(*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
but then uses the resulting pointer after dropping the locks. I'm
told this is okay and needs commenting.
(*) The cgroup refcount web. This really needs documenting.
(*) cgroup2 only has one root?
Signed-off-by: David Howells <[email protected]>
cc: Greg Kroah-Hartman <[email protected]>
cc: Tejun Heo <[email protected]>
cc: Li Zefan <[email protected]>
cc: Johannes Weiner <[email protected]>
cc: [email protected]
cc: [email protected]
---
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 125 +++++++------
fs/kernfs/mount.c | 91 +++++----
fs/sysfs/mount.c | 59 ++++--
include/linux/cgroup.h | 3
include/linux/kernfs.h | 37 ++--
kernel/cgroup/cgroup-internal.h | 42 +++-
kernel/cgroup/cgroup-v1.c | 295 ++++++++++++++----------------
kernel/cgroup/cgroup.c | 219 +++++++++++++---------
8 files changed, 469 insertions(+), 402 deletions(-)
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 3584ef8de1fd..121584d9d544 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -36,6 +36,12 @@
#include <asm/intel_rdt_sched.h>
#include "intel_rdt.h"
+struct rdt_fs_context {
+ struct kernfs_fs_context kfc;
+ bool enable_cdpl2;
+ bool enable_cdpl3;
+};
+
DEFINE_STATIC_KEY_FALSE(rdt_enable_key);
DEFINE_STATIC_KEY_FALSE(rdt_mon_enable_key);
DEFINE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
@@ -1104,39 +1110,6 @@ static void cdp_disable_all(void)
cdpl2_disable();
}
-static int parse_rdtgroupfs_options(char *data)
-{
- char *token, *o = data;
- int ret = 0;
-
- while ((token = strsep(&o, ",")) != NULL) {
- if (!*token) {
- ret = -EINVAL;
- goto out;
- }
-
- if (!strcmp(token, "cdp")) {
- ret = cdpl3_enable();
- if (ret)
- goto out;
- } else if (!strcmp(token, "cdpl2")) {
- ret = cdpl2_enable();
- if (ret)
- goto out;
- } else {
- ret = -EINVAL;
- goto out;
- }
- }
-
- return 0;
-
-out:
- pr_err("Invalid mount option \"%s\"\n", token);
-
- return ret;
-}
-
/*
* We don't allow rdtgroup directories to be created anywhere
* except the root directory. Thus when looking for the rdtgroup
@@ -1205,13 +1178,11 @@ static int mkdir_mondata_all(struct kernfs_node *parent_kn,
struct rdtgroup *prgrp,
struct kernfs_node **mon_data_kn);
-static struct dentry *rdt_mount(struct file_system_type *fs_type,
- int flags, const char *unused_dev_name,
- void *data, size_t data_size)
+static int rdt_get_tree(struct fs_context *fc)
{
+ struct rdt_fs_context *ctx = container_of(fc, struct rdt_fs_context, kfc.fc);
struct rdt_domain *dom;
struct rdt_resource *r;
- struct dentry *dentry;
int ret;
cpus_read_lock();
@@ -1220,47 +1191,46 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
* resctrl file system can only be mounted once.
*/
if (static_branch_unlikely(&rdt_enable_key)) {
- dentry = ERR_PTR(-EBUSY);
+ ret = -EBUSY;
goto out;
}
- ret = parse_rdtgroupfs_options(data);
- if (ret) {
- dentry = ERR_PTR(ret);
- goto out_cdp;
+ if (ctx->enable_cdpl2) {
+ ret = cdpl2_enable();
+ if (ret < 0)
+ goto out_cdp;
+ }
+
+ if (ctx->enable_cdpl3) {
+ ret = cdpl3_enable();
+ if (ret < 0)
+ goto out_cdp;
}
closid_init();
ret = rdtgroup_create_info_dir(rdtgroup_default.kn);
- if (ret) {
- dentry = ERR_PTR(ret);
+ if (ret < 0)
goto out_cdp;
- }
if (rdt_mon_capable) {
ret = mongroup_create_dir(rdtgroup_default.kn,
NULL, "mon_groups",
&kn_mongrp);
- if (ret) {
- dentry = ERR_PTR(ret);
+ if (ret < 0)
goto out_info;
- }
kernfs_get(kn_mongrp);
ret = mkdir_mondata_all(rdtgroup_default.kn,
&rdtgroup_default, &kn_mondata);
- if (ret) {
- dentry = ERR_PTR(ret);
+ if (ret < 0)
goto out_mongrp;
- }
kernfs_get(kn_mondata);
rdtgroup_default.mon.mon_data_kn = kn_mondata;
}
- dentry = kernfs_mount(fs_type, flags, rdt_root,
- RDTGROUP_SUPER_MAGIC, NULL);
- if (IS_ERR(dentry))
+ ret = kernfs_get_tree(&ctx->kfc);
+ if (ret < 0)
goto out_mondata;
if (rdt_alloc_capable)
@@ -1293,8 +1263,46 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
rdt_last_cmd_clear();
mutex_unlock(&rdtgroup_mutex);
cpus_read_unlock();
+ return ret;
+}
+
+static int rdt_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+ struct rdt_fs_context *ctx = container_of(fc, struct rdt_fs_context, kfc.fc);
+
+ if (strcmp(opt, "cdp") == 0) {
+ ctx->enable_cdpl3 = true;
+ return 0;
+ }
+ if (strcmp(opt, "cdpl2") == 0) {
+ ctx->enable_cdpl2 = true;
+ return 0;
+ }
- return dentry;
+ return -EINVAL;
+}
+
+static void rdt_fs_context_free(struct fs_context *fc)
+{
+ struct rdt_fs_context *ctx = container_of(fc, struct rdt_fs_context, kfc.fc);
+
+ kernfs_free_fs_context(&ctx->kfc);
+}
+
+static const struct fs_context_operations rdt_fs_context_ops = {
+ .free = rdt_fs_context_free,
+ .parse_option = rdt_parse_option,
+ .get_tree = rdt_get_tree,
+};
+
+static int rdt_init_fs_context(struct fs_context *fc, struct super_block *src_sb)
+{
+ struct rdt_fs_context *ctx = container_of(fc, struct rdt_fs_context, kfc.fc);
+
+ ctx->kfc.root = rdt_root;
+ ctx->kfc.magic = RDTGROUP_SUPER_MAGIC;
+ ctx->kfc.fc.ops = &rdt_fs_context_ops;
+ return 0;
}
static int reset_all_ctrls(struct rdt_resource *r)
@@ -1459,9 +1467,10 @@ static void rdt_kill_sb(struct super_block *sb)
}
static struct file_system_type rdt_fs_type = {
- .name = "resctrl",
- .mount = rdt_mount,
- .kill_sb = rdt_kill_sb,
+ .name = "resctrl",
+ .fs_context_size = sizeof(struct rdt_fs_context),
+ .init_fs_context = rdt_init_fs_context,
+ .kill_sb = rdt_kill_sb,
};
static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index c05efbd1822e..725874dd6c5b 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -22,14 +22,14 @@
struct kmem_cache *kernfs_node_cache;
-static int kernfs_sop_remount_fs(struct super_block *sb, int *flags,
- char *data, size_t data_size)
+static int kernfs_sop_reconfigure(struct super_block *sb, struct fs_context *fc)
{
+ struct kernfs_fs_context *kfc = container_of(fc, struct kernfs_fs_context, fc);
struct kernfs_root *root = kernfs_info(sb)->root;
struct kernfs_syscall_ops *scops = root->syscall_ops;
- if (scops && scops->remount_fs)
- return scops->remount_fs(root, flags, data);
+ if (scops && scops->reconfigure)
+ return scops->reconfigure(root, kfc);
return 0;
}
@@ -61,7 +61,7 @@ const struct super_operations kernfs_sops = {
.drop_inode = generic_delete_inode,
.evict_inode = kernfs_evict_inode,
- .remount_fs = kernfs_sop_remount_fs,
+ .reconfigure = kernfs_sop_reconfigure,
.show_options = kernfs_sop_show_options,
.show_path = kernfs_sop_show_path,
};
@@ -219,7 +219,7 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
} while (true);
}
-static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
+static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *kfc)
{
struct kernfs_super_info *info = kernfs_info(sb);
struct inode *inode;
@@ -230,7 +230,7 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV;
sb->s_blocksize = PAGE_SIZE;
sb->s_blocksize_bits = PAGE_SHIFT;
- sb->s_magic = magic;
+ sb->s_magic = kfc->magic;
sb->s_op = &kernfs_sops;
sb->s_xattr = kernfs_xattr_handlers;
if (info->root->flags & KERNFS_ROOT_SUPPORT_EXPORTOP)
@@ -257,20 +257,25 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
return 0;
}
-static int kernfs_test_super(struct super_block *sb, void *data)
+static int kernfs_test_super(struct super_block *sb, struct fs_context *fc)
{
+ struct kernfs_fs_context *kfc = container_of(fc, struct kernfs_fs_context, fc);
struct kernfs_super_info *sb_info = kernfs_info(sb);
- struct kernfs_super_info *info = data;
+ struct kernfs_super_info *info = kfc->info;
return sb_info->root == info->root && sb_info->ns == info->ns;
}
-static int kernfs_set_super(struct super_block *sb, void *data)
+static int kernfs_set_super(struct super_block *sb, struct fs_context *fc)
{
+ struct kernfs_fs_context *kfc = container_of(fc, struct kernfs_fs_context, fc);
int error;
- error = set_anon_super(sb, data);
- if (!error)
- sb->s_fs_info = data;
+
+ error = set_anon_super(sb, kfc->info);
+ if (!error) {
+ sb->s_fs_info = kfc->info;
+ kfc->info = NULL;
+ }
return error;
}
@@ -288,24 +293,15 @@ const void *kernfs_super_ns(struct super_block *sb)
}
/**
- * kernfs_mount_ns - kernfs mount helper
- * @fs_type: file_system_type of the fs being mounted
- * @flags: mount flags specified for the mount
- * @root: kernfs_root of the hierarchy being mounted
- * @magic: file system specific magic number
- * @new_sb_created: tell the caller if we allocated a new superblock
- * @ns: optional namespace tag of the mount
- *
- * This is to be called from each kernfs user's file_system_type->mount()
- * implementation, which should pass through the specified @fs_type and
- * @flags, and specify the hierarchy and namespace tag to mount via @root
- * and @ns, respectively.
+ * kernfs_get_tree - kernfs filesystem access/retrieval helper
+ * @kfc: The filesystem context.
*
- * The return value can be passed to the vfs layer verbatim.
+ * This is to be called from each kernfs user's fs_context->ops->get_tree()
+ * implementation, which should set the specified ->@fs_type and ->@flags, and
+ * specify the hierarchy and namespace tag to mount via ->@root and ->@ns,
+ * respectively.
*/
-struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
- struct kernfs_root *root, unsigned long magic,
- bool *new_sb_created, const void *ns)
+int kernfs_get_tree(struct kernfs_fs_context *kfc)
{
struct super_block *sb;
struct kernfs_super_info *info;
@@ -313,37 +309,42 @@ struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
info = kzalloc(sizeof(*info), GFP_KERNEL);
if (!info)
- return ERR_PTR(-ENOMEM);
-
- info->root = root;
- info->ns = ns;
+ return -ENOMEM;
- sb = sget_userns(fs_type, kernfs_test_super, kernfs_set_super, flags,
- &init_user_ns, info);
- if (IS_ERR(sb) || sb->s_fs_info != info)
- kfree(info);
+ info->root = kfc->root;
+ info->ns = kfc->ns_tag;
+
+ kfc->info = info;
+ sb = sget_fc(&kfc->fc, kernfs_test_super, kernfs_set_super);
+ if (kfc->info) {
+ kfree(kfc->info);
+ kfc->info = NULL;
+ } else {
+ kfc->ns_tag = NULL;
+ kfc->fc.degraded = true;
+ }
if (IS_ERR(sb))
- return ERR_CAST(sb);
-
- if (new_sb_created)
- *new_sb_created = !sb->s_root;
+ return PTR_ERR(sb);
if (!sb->s_root) {
struct kernfs_super_info *info = kernfs_info(sb);
- error = kernfs_fill_super(sb, magic);
+ kfc->new_sb_created = true;
+
+ error = kernfs_fill_super(sb, kfc);
if (error) {
deactivate_locked_super(sb);
- return ERR_PTR(error);
+ return error;
}
sb->s_flags |= SB_ACTIVE;
mutex_lock(&kernfs_mutex);
- list_add(&info->node, &root->supers);
+ list_add(&info->node, &info->root->supers);
mutex_unlock(&kernfs_mutex);
}
- return dget(sb->s_root);
+ kfc->fc.root = dget(sb->s_root);
+ return 0;
}
/**
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index b2b7d9ae4aba..d81f117562f5 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -20,27 +20,45 @@
static struct kernfs_root *sysfs_root;
struct kernfs_node *sysfs_root_kn;
-static struct dentry *sysfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data, size_t data_size)
+static int sysfs_get_tree(struct fs_context *fc)
{
- struct dentry *root;
- void *ns;
- bool new_sb;
+ struct kernfs_fs_context *kfc = container_of(fc, struct kernfs_fs_context, fc);
+ int ret;
- if (!(flags & SB_KERNMOUNT)) {
+ ret = kernfs_get_tree(kfc);
+ if (kfc->new_sb_created)
+ fc->root->d_sb->s_iflags |= SB_I_USERNS_VISIBLE;
+ return 0;
+}
+
+static void sysfs_fs_context_free(struct fs_context *fc)
+{
+ struct kernfs_fs_context *kfc = container_of(fc, struct kernfs_fs_context, fc);
+
+ if (kfc->ns_tag)
+ kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag);
+ kernfs_free_fs_context(kfc);
+}
+
+static const struct fs_context_operations sysfs_fs_context_ops = {
+ .free = sysfs_fs_context_free,
+ .get_tree = sysfs_get_tree,
+};
+
+static int sysfs_init_fs_context(struct fs_context *fc, struct super_block *src_sb)
+{
+ struct kernfs_fs_context *kfc = container_of(fc, struct kernfs_fs_context, fc);
+
+ if (!(fc->sb_flags & SB_KERNMOUNT)) {
if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET))
- return ERR_PTR(-EPERM);
+ return -EPERM;
}
- ns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
- root = kernfs_mount_ns(fs_type, flags, sysfs_root,
- SYSFS_MAGIC, &new_sb, ns);
- if (IS_ERR(root) || !new_sb)
- kobj_ns_drop(KOBJ_NS_TYPE_NET, ns);
- else if (new_sb)
- root->d_sb->s_iflags |= SB_I_USERNS_VISIBLE;
-
- return root;
+ kfc->ns_tag = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
+ kfc->root = sysfs_root;
+ kfc->magic = SYSFS_MAGIC;
+ kfc->fc.ops = &sysfs_fs_context_ops;
+ return 0;
}
static void sysfs_kill_sb(struct super_block *sb)
@@ -52,10 +70,11 @@ static void sysfs_kill_sb(struct super_block *sb)
}
static struct file_system_type sysfs_fs_type = {
- .name = "sysfs",
- .mount = sysfs_mount,
- .kill_sb = sysfs_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
+ .name = "sysfs",
+ .fs_context_size = sizeof(struct kernfs_fs_context),
+ .init_fs_context = sysfs_init_fs_context,
+ .kill_sb = sysfs_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};
int __init sysfs_init(void)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 473e0c0abb86..50771a7d0be9 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -821,10 +821,11 @@ copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,
#endif /* !CONFIG_CGROUPS */
-static inline void get_cgroup_ns(struct cgroup_namespace *ns)
+static inline struct cgroup_namespace *get_cgroup_ns(struct cgroup_namespace *ns)
{
if (ns)
refcount_inc(&ns->count);
+ return ns;
}
static inline void put_cgroup_ns(struct cgroup_namespace *ns)
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index ab25c8b6d9e3..29f6e15254bc 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -16,6 +16,7 @@
#include <linux/rbtree.h>
#include <linux/atomic.h>
#include <linux/wait.h>
+#include <linux/fs_context.h>
struct file;
struct dentry;
@@ -25,6 +26,7 @@ struct vm_area_struct;
struct super_block;
struct file_system_type;
+struct kernfs_fs_context;
struct kernfs_open_node;
struct kernfs_iattrs;
@@ -166,7 +168,7 @@ struct kernfs_node {
* kernfs_node parameter.
*/
struct kernfs_syscall_ops {
- int (*remount_fs)(struct kernfs_root *root, int *flags, char *data);
+ int (*reconfigure)(struct kernfs_root *root, struct kernfs_fs_context *kfc);
int (*show_options)(struct seq_file *sf, struct kernfs_root *root);
int (*mkdir)(struct kernfs_node *parent, const char *name,
@@ -267,6 +269,20 @@ struct kernfs_ops {
#endif
};
+/*
+ * The kernfs superblock creation/mount parameter context.
+ */
+struct kernfs_fs_context {
+ struct fs_context fc;
+ struct kernfs_root *root; /* Root of the hierarchy being mounted */
+ void *ns_tag; /* Namespace tag of the mount (or NULL) */
+ unsigned long magic; /* File system specific magic number */
+
+ /* The following are set/used by kernfs_mount() */
+ struct kernfs_super_info *info; /* The new superblock info */
+ bool new_sb_created; /* Set to T if we allocated a new sb */
+};
+
#ifdef CONFIG_KERNFS
static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
@@ -350,9 +366,7 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr);
void kernfs_notify(struct kernfs_node *kn);
const void *kernfs_super_ns(struct super_block *sb);
-struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
- struct kernfs_root *root, unsigned long magic,
- bool *new_sb_created, const void *ns);
+int kernfs_get_tree(struct kernfs_fs_context *fc);
void kernfs_kill_sb(struct super_block *sb);
struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns);
@@ -454,11 +468,8 @@ static inline void kernfs_notify(struct kernfs_node *kn) { }
static inline const void *kernfs_super_ns(struct super_block *sb)
{ return NULL; }
-static inline struct dentry *
-kernfs_mount_ns(struct file_system_type *fs_type, int flags,
- struct kernfs_root *root, unsigned long magic,
- bool *new_sb_created, const void *ns)
-{ return ERR_PTR(-ENOSYS); }
+static inline int kernfs_get_tree(struct kernfs_fs_context *fc)
+{ return -ENOSYS; }
static inline void kernfs_kill_sb(struct super_block *sb) { }
@@ -535,13 +546,9 @@ static inline int kernfs_rename(struct kernfs_node *kn,
return kernfs_rename_ns(kn, new_parent, new_name, NULL);
}
-static inline struct dentry *
-kernfs_mount(struct file_system_type *fs_type, int flags,
- struct kernfs_root *root, unsigned long magic,
- bool *new_sb_created)
+static inline void kernfs_free_fs_context(struct kernfs_fs_context *kfc)
{
- return kernfs_mount_ns(fs_type, flags, root,
- magic, new_sb_created, NULL);
+ /* Note that we don't deal with kfc->ns_tag here. */
}
#endif /* __LINUX_KERNFS_H */
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index b928b27050c6..1aa3176779c9 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -8,6 +8,26 @@
#include <linux/list.h>
#include <linux/refcount.h>
+/*
+ * The cgroup filesystem superblock creation/mount context.
+ */
+struct cgroup_fs_context {
+ struct kernfs_fs_context kfc;
+ struct cgroup_root *root;
+ struct cgroup_namespace *ns;
+ u8 version; /* cgroups version */
+ unsigned int flags; /* CGRP_ROOT_* flags */
+
+ /* cgroup1 bits */
+ bool cpuset_clone_children;
+ bool none; /* User explicitly requested empty subsystem */
+ bool all_ss; /* Seen 'all' option */
+ bool one_ss; /* Seen 'none' option */
+ u16 subsys_mask; /* Selected subsystems */
+ char *name; /* Hierarchy name */
+ char *release_agent; /* Path for release notifications */
+};
+
/*
* A cgroup can be associated with multiple css_sets as different tasks may
* belong to different cgroups on different hierarchies. In the other
@@ -89,16 +109,6 @@ struct cgroup_mgctx {
#define DEFINE_CGROUP_MGCTX(name) \
struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)
-struct cgroup_sb_opts {
- u16 subsys_mask;
- unsigned int flags;
- char *release_agent;
- bool cpuset_clone_children;
- char *name;
- /* User explicitly requested empty subsystem */
- bool none;
-};
-
extern struct mutex cgroup_mutex;
extern spinlock_t css_set_lock;
extern struct cgroup_subsys *cgroup_subsys[];
@@ -169,12 +179,10 @@ int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
struct cgroup_namespace *ns);
void cgroup_free_root(struct cgroup_root *root);
-void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts);
+void init_cgroup_root(struct cgroup_fs_context *ctx);
int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags);
int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask);
-struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
- struct cgroup_root *root, unsigned long magic,
- struct cgroup_namespace *ns);
+int cgroup_do_get_tree(struct cgroup_fs_context *ctx);
int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp);
void cgroup_migrate_finish(struct cgroup_mgctx *mgctx);
@@ -225,8 +233,8 @@ bool cgroup1_ssid_disabled(int ssid);
void cgroup1_pidlist_destroy_all(struct cgroup *cgrp);
void cgroup1_release_agent(struct work_struct *work);
void cgroup1_check_for_release(struct cgroup *cgrp);
-struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
- void *data, unsigned long magic,
- struct cgroup_namespace *ns);
+int cgroup1_parse_option(struct cgroup_fs_context *ctx, char *p);
+int cgroup1_validate(struct cgroup_fs_context *ctx);
+int cgroup1_get_tree(struct cgroup_fs_context *ctx);
#endif /* __CGROUP_INTERNAL_H */
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index a2c05d2476ac..e149ff63b35a 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -16,6 +16,8 @@
#include <trace/events/cgroup.h>
+#define cg_invalf(fmt, ...) ({ pr_err(fmt, ## __VA_ARGS__); })
+
/*
* pidlists linger the following amount before being destroyed. The goal
* is avoiding frequent destruction in the middle of consecutive read calls
@@ -915,168 +917,166 @@ static int cgroup1_show_options(struct seq_file *seq, struct kernfs_root *kf_roo
return 0;
}
-static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
+int cgroup1_parse_option(struct cgroup_fs_context *ctx, char *token)
{
- char *token, *o = data;
- bool all_ss = false, one_ss = false;
- u16 mask = U16_MAX;
struct cgroup_subsys *ss;
- int nr_opts = 0;
int i;
-#ifdef CONFIG_CPUSETS
- mask = ~((u16)1 << cpuset_cgrp_id);
-#endif
-
- memset(opts, 0, sizeof(*opts));
-
- while ((token = strsep(&o, ",")) != NULL) {
- nr_opts++;
+ if (!strcmp(token, "none")) {
+ /* Explicitly have no subsystems */
+ ctx->none = true;
+ return 0;
+ }
+ if (!strcmp(token, "all")) {
+ /* Mutually exclusive option 'all' + subsystem name */
+ if (ctx->one_ss)
+ return cg_invalf("cgroup1: all conflicts with subsys name");
+ ctx->all_ss = true;
+ return 0;
+ }
+ if (!strcmp(token, "noprefix")) {
+ ctx->flags |= CGRP_ROOT_NOPREFIX;
+ return 0;
+ }
+ if (!strcmp(token, "clone_children")) {
+ ctx->cpuset_clone_children = true;
+ return 0;
+ }
+ if (!strcmp(token, "xattr")) {
+ ctx->flags |= CGRP_ROOT_XATTR;
+ return 0;
+ }
+ if (!strncmp(token, "release_agent=", 14)) {
+ /* Specifying two release agents is forbidden */
+ if (ctx->release_agent)
+ return cg_invalf("cgroup1: release_agent respecified");
+ ctx->release_agent =
+ kstrndup(token + 14, PATH_MAX - 1, GFP_KERNEL);
+ if (!ctx->release_agent)
+ return -ENOMEM;
+ return 0;
+ }
- if (!*token)
- return -EINVAL;
- if (!strcmp(token, "none")) {
- /* Explicitly have no subsystems */
- opts->none = true;
- continue;
- }
- if (!strcmp(token, "all")) {
- /* Mutually exclusive option 'all' + subsystem name */
- if (one_ss)
- return -EINVAL;
- all_ss = true;
- continue;
- }
- if (!strcmp(token, "noprefix")) {
- opts->flags |= CGRP_ROOT_NOPREFIX;
- continue;
+ if (!strncmp(token, "name=", 5)) {
+ const char *name = token + 5;
+ /* Can't specify an empty name */
+ if (!strlen(name))
+ return cg_invalf("cgroup1: Empty name");
+ /* Must match [\w.-]+ */
+ for (i = 0; i < strlen(name); i++) {
+ char c = name[i];
+ if (isalnum(c))
+ continue;
+ if ((c == '.') || (c == '-') || (c == '_'))
+ continue;
+ return cg_invalf("cgroup1: Invalid name");
}
- if (!strcmp(token, "clone_children")) {
- opts->cpuset_clone_children = true;
+ /* Specifying two names is forbidden */
+ if (ctx->name)
+ return cg_invalf("cgroup1: name respecified");
+ ctx->name = kstrndup(name,
+ MAX_CGROUP_ROOT_NAMELEN - 1,
+ GFP_KERNEL);
+ if (!ctx->name)
+ return -ENOMEM;
+
+ return 0;
+ }
+
+ for_each_subsys(ss, i) {
+ if (strcmp(token, ss->legacy_name))
continue;
- }
if (!strcmp(token, "cpuset_v2_mode")) {
- opts->flags |= CGRP_ROOT_CPUSET_V2_MODE;
+ ctx->flags |= CGRP_ROOT_CPUSET_V2_MODE;
continue;
}
if (!strcmp(token, "xattr")) {
- opts->flags |= CGRP_ROOT_XATTR;
+ ctx->flags |= CGRP_ROOT_XATTR;
continue;
}
- if (!strncmp(token, "release_agent=", 14)) {
- /* Specifying two release agents is forbidden */
- if (opts->release_agent)
- return -EINVAL;
- opts->release_agent =
- kstrndup(token + 14, PATH_MAX - 1, GFP_KERNEL);
- if (!opts->release_agent)
- return -ENOMEM;
+ if (cgroup1_ssid_disabled(i))
continue;
- }
- if (!strncmp(token, "name=", 5)) {
- const char *name = token + 5;
- /* Can't specify an empty name */
- if (!strlen(name))
- return -EINVAL;
- /* Must match [\w.-]+ */
- for (i = 0; i < strlen(name); i++) {
- char c = name[i];
- if (isalnum(c))
- continue;
- if ((c == '.') || (c == '-') || (c == '_'))
- continue;
- return -EINVAL;
- }
- /* Specifying two names is forbidden */
- if (opts->name)
- return -EINVAL;
- opts->name = kstrndup(name,
- MAX_CGROUP_ROOT_NAMELEN - 1,
- GFP_KERNEL);
- if (!opts->name)
- return -ENOMEM;
- continue;
- }
+ /* Mutually exclusive option 'all' + subsystem name */
+ if (ctx->all_ss)
+ return cg_invalf("cgroup1: subsys name conflicts with all");
+ ctx->subsys_mask |= (1 << i);
+ ctx->one_ss = true;
+ return 0;
+ }
- for_each_subsys(ss, i) {
- if (strcmp(token, ss->legacy_name))
- continue;
- if (!cgroup_ssid_enabled(i))
- continue;
- if (cgroup1_ssid_disabled(i))
- continue;
+ if (i == CGROUP_SUBSYS_COUNT)
+ return -ENOENT;
+
+ return 0;
+}
- /* Mutually exclusive option 'all' + subsystem name */
- if (all_ss)
- return -EINVAL;
- opts->subsys_mask |= (1 << i);
- one_ss = true;
+/*
+ * Validate the options that have been parsed.
+ */
+int cgroup1_validate(struct cgroup_fs_context *ctx)
+{
+ struct cgroup_subsys *ss;
+ u16 mask = U16_MAX;
+ int i;
- break;
- }
- if (i == CGROUP_SUBSYS_COUNT)
- return -ENOENT;
- }
+#ifdef CONFIG_CPUSETS
+ mask = ~((u16)1 << cpuset_cgrp_id);
+#endif
/*
* If the 'all' option was specified select all the subsystems,
* otherwise if 'none', 'name=' and a subsystem name options were
* not specified, let's default to 'all'
*/
- if (all_ss || (!one_ss && !opts->none && !opts->name))
+ if (ctx->all_ss || (!ctx->one_ss && !ctx->none && !ctx->name))
for_each_subsys(ss, i)
if (cgroup_ssid_enabled(i) && !cgroup1_ssid_disabled(i))
- opts->subsys_mask |= (1 << i);
+ ctx->subsys_mask |= (1 << i);
/*
* We either have to specify by name or by subsystems. (So all
* empty hierarchies must have a name).
*/
- if (!opts->subsys_mask && !opts->name)
- return -EINVAL;
+ if (!ctx->subsys_mask && !ctx->name)
+ return cg_invalf("cgroup1: Need name or subsystem set");
/*
* Option noprefix was introduced just for backward compatibility
* with the old cpuset, so we allow noprefix only if mounting just
* the cpuset subsystem.
*/
- if ((opts->flags & CGRP_ROOT_NOPREFIX) && (opts->subsys_mask & mask))
- return -EINVAL;
+ if ((ctx->flags & CGRP_ROOT_NOPREFIX) && (ctx->subsys_mask & mask))
+ return cg_invalf("cgroup1: noprefix used incorrectly");
/* Can't specify "none" and some subsystems */
- if (opts->subsys_mask && opts->none)
- return -EINVAL;
+ if (ctx->subsys_mask && ctx->none)
+ return cg_invalf("cgroup1: none used incorrectly");
return 0;
}
-static int cgroup1_remount(struct kernfs_root *kf_root, int *flags, char *data)
+static int cgroup1_reconfigure(struct kernfs_root *kf_root, struct kernfs_fs_context *kfc)
{
- int ret = 0;
+ struct cgroup_fs_context *ctx = container_of(kfc, struct cgroup_fs_context, kfc);
struct cgroup_root *root = cgroup_root_from_kf(kf_root);
- struct cgroup_sb_opts opts;
u16 added_mask, removed_mask;
+ int ret = 0;
cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
- /* See what subsystems are wanted */
- ret = parse_cgroupfs_options(data, &opts);
- if (ret)
- goto out_unlock;
-
- if (opts.subsys_mask != root->subsys_mask || opts.release_agent)
+ if (ctx->subsys_mask != root->subsys_mask || ctx->release_agent)
pr_warn("option changes via remount are deprecated (pid=%d comm=%s)\n",
task_tgid_nr(current), current->comm);
- added_mask = opts.subsys_mask & ~root->subsys_mask;
- removed_mask = root->subsys_mask & ~opts.subsys_mask;
+ added_mask = ctx->subsys_mask & ~root->subsys_mask;
+ removed_mask = root->subsys_mask & ~ctx->subsys_mask;
/* Don't allow flags or name to change at remount */
- if ((opts.flags ^ root->flags) ||
- (opts.name && strcmp(opts.name, root->name))) {
- pr_err("option or name mismatch, new: 0x%x \"%s\", old: 0x%x \"%s\"\n",
- opts.flags, opts.name ?: "", root->flags, root->name);
+ if ((ctx->flags ^ root->flags) ||
+ (ctx->name && strcmp(ctx->name, root->name))) {
+ cg_invalf("option or name mismatch, new: 0x%x \"%s\", old: 0x%x \"%s\"",
+ ctx->flags, ctx->name ?: "", root->flags, root->name);
ret = -EINVAL;
goto out_unlock;
}
@@ -1093,17 +1093,15 @@ static int cgroup1_remount(struct kernfs_root *kf_root, int *flags, char *data)
WARN_ON(rebind_subsystems(&cgrp_dfl_root, removed_mask));
- if (opts.release_agent) {
+ if (ctx->release_agent) {
spin_lock(&release_agent_path_lock);
- strcpy(root->release_agent_path, opts.release_agent);
+ strcpy(root->release_agent_path, ctx->release_agent);
spin_unlock(&release_agent_path_lock);
}
trace_cgroup_remount(root);
out_unlock:
- kfree(opts.release_agent);
- kfree(opts.name);
mutex_unlock(&cgroup_mutex);
return ret;
}
@@ -1111,31 +1109,25 @@ static int cgroup1_remount(struct kernfs_root *kf_root, int *flags, char *data)
struct kernfs_syscall_ops cgroup1_kf_syscall_ops = {
.rename = cgroup1_rename,
.show_options = cgroup1_show_options,
- .remount_fs = cgroup1_remount,
+ .reconfigure = cgroup1_reconfigure,
.mkdir = cgroup_mkdir,
.rmdir = cgroup_rmdir,
.show_path = cgroup_show_path,
};
-struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
- void *data, unsigned long magic,
- struct cgroup_namespace *ns)
+/*
+ * Find or create a v1 cgroups superblock.
+ */
+int cgroup1_get_tree(struct cgroup_fs_context *ctx)
{
struct super_block *pinned_sb = NULL;
- struct cgroup_sb_opts opts;
struct cgroup_root *root;
struct cgroup_subsys *ss;
- struct dentry *dentry;
int i, ret;
bool new_root = false;
cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
- /* First find the desired set of subsystems */
- ret = parse_cgroupfs_options(data, &opts);
- if (ret)
- goto out_unlock;
-
/*
* Destruction of cgroup root is asynchronous, so subsystems may
* still be dying after the previous unmount. Let's drain the
@@ -1144,15 +1136,13 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
* starting. Testing ref liveliness is good enough.
*/
for_each_subsys(ss, i) {
- if (!(opts.subsys_mask & (1 << i)) ||
+ if (!(ctx->subsys_mask & (1 << i)) ||
ss->root == &cgrp_dfl_root)
continue;
if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
mutex_unlock(&cgroup_mutex);
- msleep(10);
- ret = restart_syscall();
- goto out_free;
+ goto err_restart;
}
cgroup_put(&ss->root->cgrp);
}
@@ -1168,8 +1158,8 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
* name matches but sybsys_mask doesn't, we should fail.
* Remember whether name matched.
*/
- if (opts.name) {
- if (strcmp(opts.name, root->name))
+ if (ctx->name) {
+ if (strcmp(ctx->name, root->name))
continue;
name_match = true;
}
@@ -1178,15 +1168,15 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
* If we asked for subsystems (or explicitly for no
* subsystems) then they must match.
*/
- if ((opts.subsys_mask || opts.none) &&
- (opts.subsys_mask != root->subsys_mask)) {
+ if ((ctx->subsys_mask || ctx->none) &&
+ (ctx->subsys_mask != root->subsys_mask)) {
if (!name_match)
continue;
ret = -EBUSY;
- goto out_unlock;
+ goto err_unlock;
}
- if (root->flags ^ opts.flags)
+ if (root->flags ^ ctx->flags)
pr_warn("new mount options do not match the existing superblock, will be ignored\n");
/*
@@ -1207,9 +1197,7 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
mutex_unlock(&cgroup_mutex);
if (!IS_ERR_OR_NULL(pinned_sb))
deactivate_super(pinned_sb);
- msleep(10);
- ret = restart_syscall();
- goto out_free;
+ goto err_restart;
}
ret = 0;
@@ -1221,41 +1209,35 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
* specification is allowed for already existing hierarchies but we
* can't create new one without subsys specification.
*/
- if (!opts.subsys_mask && !opts.none) {
- ret = -EINVAL;
- goto out_unlock;
+ if (!ctx->subsys_mask && !ctx->none) {
+ ret = cg_invalf("cgroup1: No subsys list or none specified");
+ goto err_unlock;
}
/* Hierarchies may only be created in the initial cgroup namespace. */
- if (ns != &init_cgroup_ns) {
+ if (ctx->ns != &init_cgroup_ns) {
ret = -EPERM;
- goto out_unlock;
+ goto err_unlock;
}
root = kzalloc(sizeof(*root), GFP_KERNEL);
if (!root) {
ret = -ENOMEM;
- goto out_unlock;
+ goto err_unlock;
}
new_root = true;
+ ctx->root = root;
- init_cgroup_root(root, &opts);
+ init_cgroup_root(ctx);
- ret = cgroup_setup_root(root, opts.subsys_mask, PERCPU_REF_INIT_DEAD);
+ ret = cgroup_setup_root(root, ctx->subsys_mask, PERCPU_REF_INIT_DEAD);
if (ret)
cgroup_free_root(root);
out_unlock:
mutex_unlock(&cgroup_mutex);
-out_free:
- kfree(opts.release_agent);
- kfree(opts.name);
-
- if (ret)
- return ERR_PTR(ret);
- dentry = cgroup_do_mount(&cgroup_fs_type, flags, root,
- CGROUP_SUPER_MAGIC, ns);
+ ret = cgroup_do_get_tree(ctx);
/*
* There's a race window after we release cgroup_mutex and before
@@ -1276,7 +1258,14 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
if (pinned_sb)
deactivate_super(pinned_sb);
- return dentry;
+ return ret;
+
+err_restart:
+ msleep(10);
+ return restart_syscall();
+err_unlock:
+ mutex_unlock(&cgroup_mutex);
+ return ret;
}
static int __init cgroup1_wq_init(void)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c2678e4800fc..93d544e3286c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1718,25 +1718,21 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
return len;
}
-static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
+static int cgroup2_parse_option(struct cgroup_fs_context *ctx, char *token)
{
- char *token;
-
- *root_flags = 0;
-
- if (!data)
+ if (!strcmp(token, "nsdelegate")) {
+ ctx->flags |= CGRP_ROOT_NS_DELEGATE;
return 0;
-
- while ((token = strsep(&data, ",")) != NULL) {
- if (!strcmp(token, "nsdelegate")) {
- *root_flags |= CGRP_ROOT_NS_DELEGATE;
- continue;
- }
-
- pr_err("cgroup2: unknown option \"%s\"\n", token);
- return -EINVAL;
}
+ return -EINVAL;
+}
+
+static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root)
+{
+ if (current->nsproxy->cgroup_ns == &init_cgroup_ns &&
+ cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
+ seq_puts(seq, ",nsdelegate");
return 0;
}
@@ -1750,23 +1746,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
}
}
-static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root)
+static int cgroup_reconfigure(struct kernfs_root *kf_root, struct kernfs_fs_context *kfc)
{
- if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
- seq_puts(seq, ",nsdelegate");
- return 0;
-}
+ struct cgroup_fs_context *ctx = container_of(kfc, struct cgroup_fs_context, kfc);
-static int cgroup_remount(struct kernfs_root *kf_root, int *flags, char *data)
-{
- unsigned int root_flags;
- int ret;
-
- ret = parse_cgroup_root_flags(data, &root_flags);
- if (ret)
- return ret;
-
- apply_cgroup_root_flags(root_flags);
+ apply_cgroup_root_flags(ctx->flags);
return 0;
}
@@ -1852,8 +1836,9 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
INIT_WORK(&cgrp->release_agent_work, cgroup1_release_agent);
}
-void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts)
+void init_cgroup_root(struct cgroup_fs_context *ctx)
{
+ struct cgroup_root *root = ctx->root;
struct cgroup *cgrp = &root->cgrp;
INIT_LIST_HEAD(&root->root_list);
@@ -1862,12 +1847,12 @@ void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts)
init_cgroup_housekeeping(cgrp);
idr_init(&root->cgroup_idr);
- root->flags = opts->flags;
- if (opts->release_agent)
- strscpy(root->release_agent_path, opts->release_agent, PATH_MAX);
- if (opts->name)
- strscpy(root->name, opts->name, MAX_CGROUP_ROOT_NAMELEN);
- if (opts->cpuset_clone_children)
+ root->flags = ctx->flags;
+ if (ctx->release_agent)
+ strscpy(root->release_agent_path, ctx->release_agent, PATH_MAX);
+ if (ctx->name)
+ strscpy(root->name, ctx->name, MAX_CGROUP_ROOT_NAMELEN);
+ if (ctx->cpuset_clone_children)
set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
}
@@ -1972,57 +1957,50 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
return ret;
}
-struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
- struct cgroup_root *root, unsigned long magic,
- struct cgroup_namespace *ns)
+int cgroup_do_get_tree(struct cgroup_fs_context *ctx)
{
- struct dentry *dentry;
- bool new_sb;
+ int ret;
+
+ ctx->kfc.root = ctx->root->kf_root;
- dentry = kernfs_mount(fs_type, flags, root->kf_root, magic, &new_sb);
+ ret = kernfs_get_tree(&ctx->kfc);
+ if (ret < 0)
+ goto out_cgrp;
/*
* In non-init cgroup namespace, instead of root cgroup's dentry,
* we return the dentry corresponding to the cgroupns->root_cgrp.
*/
- if (!IS_ERR(dentry) && ns != &init_cgroup_ns) {
+ if (ctx->ns != &init_cgroup_ns) {
struct dentry *nsdentry;
struct cgroup *cgrp;
mutex_lock(&cgroup_mutex);
spin_lock_irq(&css_set_lock);
- cgrp = cset_cgroup_from_root(ns->root_cset, root);
+ cgrp = cset_cgroup_from_root(ctx->ns->root_cset, ctx->root);
spin_unlock_irq(&css_set_lock);
mutex_unlock(&cgroup_mutex);
- nsdentry = kernfs_node_dentry(cgrp->kn, dentry->d_sb);
- dput(dentry);
- dentry = nsdentry;
+ nsdentry = kernfs_node_dentry(cgrp->kn, ctx->kfc.fc.root->d_sb);
+ dput(ctx->kfc.fc.root);
+ ctx->kfc.fc.root = nsdentry;
}
- if (IS_ERR(dentry) || !new_sb)
- cgroup_put(&root->cgrp);
+ ret = 0;
+ if (ctx->kfc.new_sb_created)
+ goto out_cgrp;
+ apply_cgroup_root_flags(ctx->flags);
+ return 0;
- return dentry;
+out_cgrp:
+ return ret;
}
-static struct dentry *cgroup_mount(struct file_system_type *fs_type,
- int flags, const char *unused_dev_name,
- void *data, size_t data_size)
+static int cgroup_get_tree(struct fs_context *fc)
{
- struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
- struct dentry *dentry;
- int ret;
-
- get_cgroup_ns(ns);
-
- /* Check if the caller has permission to mount. */
- if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
- put_cgroup_ns(ns);
- return ERR_PTR(-EPERM);
- }
+ struct cgroup_fs_context *ctx = container_of(fc, struct cgroup_fs_context, kfc.fc);
/*
* The first time anyone tries to mount a cgroup, enable the list
@@ -2031,29 +2009,81 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
if (!use_task_css_set_links)
cgroup_enable_task_cg_lists();
- if (fs_type == &cgroup2_fs_type) {
- unsigned int root_flags;
-
- ret = parse_cgroup_root_flags(data, &root_flags);
- if (ret) {
- put_cgroup_ns(ns);
- return ERR_PTR(ret);
- }
+ switch (ctx->version) {
+ case 1:
+ return cgroup1_get_tree(ctx);
+ case 2:
cgrp_dfl_visible = true;
cgroup_get_live(&cgrp_dfl_root.cgrp);
- dentry = cgroup_do_mount(&cgroup2_fs_type, flags, &cgrp_dfl_root,
- CGROUP2_SUPER_MAGIC, ns);
- if (!IS_ERR(dentry))
- apply_cgroup_root_flags(root_flags);
- } else {
- dentry = cgroup1_mount(&cgroup_fs_type, flags, data,
- CGROUP_SUPER_MAGIC, ns);
+ ctx->root = &cgrp_dfl_root;
+ return cgroup_do_get_tree(ctx);
+
+ default:
+ BUG();
}
+}
+
+static int cgroup_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+ struct cgroup_fs_context *ctx = container_of(fc, struct cgroup_fs_context, kfc.fc);
+
+ if (ctx->version == 1)
+ return cgroup1_parse_option(ctx, opt);
+
+ return cgroup2_parse_option(ctx, opt);
+}
+
+static int cgroup_validate(struct fs_context *fc)
+{
+ struct cgroup_fs_context *ctx = container_of(fc, struct cgroup_fs_context, kfc.fc);
+
+ if (ctx->version == 1)
+ return cgroup1_validate(ctx);
+ return 0;
+}
+
+/*
+ * Destroy a cgroup filesystem context.
+ */
+static void cgroup_fs_context_free(struct fs_context *fc)
+{
+ struct cgroup_fs_context *ctx = container_of(fc, struct cgroup_fs_context, kfc.fc);
- put_cgroup_ns(ns);
- return dentry;
+ kfree(ctx->name);
+ kfree(ctx->release_agent);
+ if (ctx->root)
+ cgroup_put(&ctx->root->cgrp);
+ put_cgroup_ns(ctx->ns);
+ kernfs_free_fs_context(&ctx->kfc);
+}
+
+static const struct fs_context_operations cgroup_fs_context_ops = {
+ .free = cgroup_fs_context_free,
+ .parse_option = cgroup_parse_option,
+ .validate = cgroup_validate,
+ .get_tree = cgroup_get_tree,
+};
+
+/*
+ * Initialise the cgroup filesystem creation/reconfiguration context. Notably,
+ * we select the namespace we're going to use.
+ */
+static int cgroup_init_fs_context(struct fs_context *fc, struct super_block *src_sb)
+{
+ struct cgroup_fs_context *ctx = container_of(fc, struct cgroup_fs_context, kfc.fc);
+ struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
+
+ /* Check if the caller has permission to mount. */
+ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ ctx->ns = get_cgroup_ns(ns);
+ ctx->version = (fc->fs_type == &cgroup2_fs_type) ? 2 : 1;
+ ctx->kfc.magic = (ctx->version == 2) ? CGROUP2_SUPER_MAGIC : CGROUP_SUPER_MAGIC;
+ ctx->kfc.fc.ops = &cgroup_fs_context_ops;
+ return 0;
}
static void cgroup_kill_sb(struct super_block *sb)
@@ -2078,17 +2108,19 @@ static void cgroup_kill_sb(struct super_block *sb)
}
struct file_system_type cgroup_fs_type = {
- .name = "cgroup",
- .mount = cgroup_mount,
- .kill_sb = cgroup_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
+ .name = "cgroup",
+ .fs_context_size = sizeof(struct cgroup_fs_context),
+ .init_fs_context = cgroup_init_fs_context,
+ .kill_sb = cgroup_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};
static struct file_system_type cgroup2_fs_type = {
- .name = "cgroup2",
- .mount = cgroup_mount,
- .kill_sb = cgroup_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
+ .name = "cgroup2",
+ .fs_context_size = sizeof(struct cgroup_fs_context),
+ .init_fs_context = cgroup_init_fs_context,
+ .kill_sb = cgroup_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};
int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
@@ -5132,7 +5164,7 @@ int cgroup_rmdir(struct kernfs_node *kn)
static struct kernfs_syscall_ops cgroup_kf_syscall_ops = {
.show_options = cgroup_show_options,
- .remount_fs = cgroup_remount,
+ .reconfigure = cgroup_reconfigure,
.mkdir = cgroup_mkdir,
.rmdir = cgroup_rmdir,
.show_path = cgroup_show_path,
@@ -5199,11 +5231,12 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
*/
int __init cgroup_init_early(void)
{
- static struct cgroup_sb_opts __initdata opts;
+ static struct cgroup_fs_context __initdata ctx;
struct cgroup_subsys *ss;
int i;
- init_cgroup_root(&cgrp_dfl_root, &opts);
+ ctx.root = &cgrp_dfl_root;
+ init_cgroup_root(&ctx);
cgrp_dfl_root.cgrp.self.flags |= CSS_NO_REF;
RCU_INIT_POINTER(init_task.cgroups, &init_css_set);
Convert the hugetlbfs to use the fs_context during mount.
Signed-off-by: David Howells <[email protected]>
---
fs/hugetlbfs/inode.c | 330 ++++++++++++++++++++++++++++----------------------
1 file changed, 184 insertions(+), 146 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 76fb8eb2bea8..11056af43e66 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -45,11 +45,18 @@ const struct file_operations hugetlbfs_file_operations;
static const struct inode_operations hugetlbfs_dir_inode_operations;
static const struct inode_operations hugetlbfs_inode_operations;
-struct hugetlbfs_config {
+enum hugetlbfs_size_type { NO_SIZE, SIZE_STD, SIZE_PERCENT };
+
+struct hugetlbfs_fs_context {
+ struct fs_context fc;
struct hstate *hstate;
+ unsigned long long max_size_opt;
+ unsigned long long min_size_opt;
long max_hpages;
long nr_inodes;
long min_hpages;
+ enum hugetlbfs_size_type max_val_type;
+ enum hugetlbfs_size_type min_val_type;
kuid_t uid;
kgid_t gid;
umode_t mode;
@@ -708,16 +715,16 @@ static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
}
static struct inode *hugetlbfs_get_root(struct super_block *sb,
- struct hugetlbfs_config *config)
+ struct hugetlbfs_fs_context *ctx)
{
struct inode *inode;
inode = new_inode(sb);
if (inode) {
inode->i_ino = get_next_ino();
- inode->i_mode = S_IFDIR | config->mode;
- inode->i_uid = config->uid;
- inode->i_gid = config->gid;
+ inode->i_mode = S_IFDIR | ctx->mode;
+ inode->i_uid = ctx->uid;
+ inode->i_gid = ctx->gid;
inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
inode->i_op = &hugetlbfs_dir_inode_operations;
inode->i_fop = &simple_dir_operations;
@@ -1081,8 +1088,6 @@ static const struct super_operations hugetlbfs_ops = {
.show_options = hugetlbfs_show_options,
};
-enum hugetlbfs_size_type { NO_SIZE, SIZE_STD, SIZE_PERCENT };
-
/*
* Convert size option passed from command line to number of huge pages
* in the pool specified by hstate. Size option could be in bytes
@@ -1105,171 +1110,156 @@ hugetlbfs_size_to_hpages(struct hstate *h, unsigned long long size_opt,
return size_opt;
}
-static int
-hugetlbfs_parse_options(char *options, struct hugetlbfs_config *pconfig)
+/*
+ * Parse one mount option.
+ */
+static int hugetlbfs_parse_option(struct fs_context *fc, char *opt, size_t len)
{
- char *p, *rest;
+ struct hugetlbfs_fs_context *ctx = container_of(fc, struct hugetlbfs_fs_context, fc);
+ char *rest;
+ unsigned long ps;
substring_t args[MAX_OPT_ARGS];
- int option;
- unsigned long long max_size_opt = 0, min_size_opt = 0;
- enum hugetlbfs_size_type max_val_type = NO_SIZE, min_val_type = NO_SIZE;
-
- if (!options)
+ int token, option;
+
+ token = match_token(opt, tokens, args);
+ switch (token) {
+ case Opt_uid:
+ if (match_int(&args[0], &option))
+ goto bad_val;
+ ctx->uid = make_kuid(current_user_ns(), option);
+ if (!uid_valid(ctx->uid))
+ goto bad_val;
return 0;
- while ((p = strsep(&options, ",")) != NULL) {
- int token;
- if (!*p)
- continue;
+ case Opt_gid:
+ if (match_int(&args[0], &option))
+ goto bad_val;
+ ctx->gid = make_kgid(current_user_ns(), option);
+ if (!gid_valid(ctx->gid))
+ goto bad_val;
+ return 0;
- token = match_token(p, tokens, args);
- switch (token) {
- case Opt_uid:
- if (match_int(&args[0], &option))
- goto bad_val;
- pconfig->uid = make_kuid(current_user_ns(), option);
- if (!uid_valid(pconfig->uid))
- goto bad_val;
- break;
+ case Opt_mode:
+ if (match_octal(&args[0], &option))
+ goto bad_val;
+ ctx->mode = option & 01777U;
+ return 0;
- case Opt_gid:
- if (match_int(&args[0], &option))
- goto bad_val;
- pconfig->gid = make_kgid(current_user_ns(), option);
- if (!gid_valid(pconfig->gid))
- goto bad_val;
- break;
+ case Opt_size:
+ /* memparse() will accept a K/M/G without a digit */
+ if (!isdigit(*args[0].from))
+ goto bad_val;
+ ctx->max_size_opt = memparse(args[0].from, &rest);
+ ctx->max_val_type = SIZE_STD;
+ if (*rest == '%')
+ ctx->max_val_type = SIZE_PERCENT;
+ return 0;
- case Opt_mode:
- if (match_octal(&args[0], &option))
- goto bad_val;
- pconfig->mode = option & 01777U;
- break;
+ case Opt_nr_inodes:
+ /* memparse() will accept a K/M/G without a digit */
+ if (!isdigit(*args[0].from))
+ goto bad_val;
+ ctx->nr_inodes = memparse(args[0].from, &rest);
+ return 0;
- case Opt_size: {
- /* memparse() will accept a K/M/G without a digit */
- if (!isdigit(*args[0].from))
- goto bad_val;
- max_size_opt = memparse(args[0].from, &rest);
- max_val_type = SIZE_STD;
- if (*rest == '%')
- max_val_type = SIZE_PERCENT;
- break;
+ case Opt_pagesize:
+ ps = memparse(args[0].from, &rest);
+ ctx->hstate = size_to_hstate(ps);
+ if (!ctx->hstate) {
+ pr_err("Unsupported page size %lu MB\n", ps >> 20);
+ return -EINVAL;
}
+ return 0;
- case Opt_nr_inodes:
- /* memparse() will accept a K/M/G without a digit */
- if (!isdigit(*args[0].from))
- goto bad_val;
- pconfig->nr_inodes = memparse(args[0].from, &rest);
- break;
+ case Opt_min_size:
+ /* memparse() will accept a K/M/G without a digit */
+ if (!isdigit(*args[0].from))
+ goto bad_val;
+ ctx->min_size_opt = memparse(args[0].from, &rest);
+ ctx->min_val_type = SIZE_STD;
+ if (*rest == '%')
+ ctx->min_val_type = SIZE_PERCENT;
+ return 0;
- case Opt_pagesize: {
- unsigned long ps;
- ps = memparse(args[0].from, &rest);
- pconfig->hstate = size_to_hstate(ps);
- if (!pconfig->hstate) {
- pr_err("Unsupported page size %lu MB\n",
- ps >> 20);
- return -EINVAL;
- }
- break;
- }
+ default:
+ pr_err("Bad mount option: \"%s\"\n", opt);
+ return -EINVAL;
+ }
- case Opt_min_size: {
- /* memparse() will accept a K/M/G without a digit */
- if (!isdigit(*args[0].from))
- goto bad_val;
- min_size_opt = memparse(args[0].from, &rest);
- min_val_type = SIZE_STD;
- if (*rest == '%')
- min_val_type = SIZE_PERCENT;
- break;
- }
+bad_val:
+ pr_err("Bad value '%s' for mount option '%s'\n", args[0].from, opt);
+ return -EINVAL;
+}
- default:
- pr_err("Bad mount option: \"%s\"\n", p);
- return -EINVAL;
- break;
- }
- }
+/*
+ * Validate the parsed options.
+ */
+static int hugetlbfs_validate(struct fs_context *fc)
+{
+ struct hugetlbfs_fs_context *ctx = container_of(fc, struct hugetlbfs_fs_context, fc);
/*
* Use huge page pool size (in hstate) to convert the size
* options to number of huge pages. If NO_SIZE, -1 is returned.
*/
- pconfig->max_hpages = hugetlbfs_size_to_hpages(pconfig->hstate,
- max_size_opt, max_val_type);
- pconfig->min_hpages = hugetlbfs_size_to_hpages(pconfig->hstate,
- min_size_opt, min_val_type);
+ ctx->max_hpages = hugetlbfs_size_to_hpages(ctx->hstate,
+ ctx->max_size_opt,
+ ctx->max_val_type);
+ ctx->min_hpages = hugetlbfs_size_to_hpages(ctx->hstate,
+ ctx->min_size_opt,
+ ctx->min_val_type);
/*
* If max_size was specified, then min_size must be smaller
*/
- if (max_val_type > NO_SIZE &&
- pconfig->min_hpages > pconfig->max_hpages) {
- pr_err("minimum size can not be greater than maximum size\n");
+ if (ctx->max_val_type > NO_SIZE &&
+ ctx->min_hpages > ctx->max_hpages) {
+ pr_err("Minimum size can not be greater than maximum size\n");
return -EINVAL;
}
return 0;
-
-bad_val:
- pr_err("Bad value '%s' for mount option '%s'\n", args[0].from, p);
- return -EINVAL;
}
static int
-hugetlbfs_fill_super(struct super_block *sb, void *data, size_t data_size,
- int silent)
+hugetlbfs_fill_super(struct super_block *sb, struct fs_context *fc)
{
- int ret;
- struct hugetlbfs_config config;
+ struct hugetlbfs_fs_context *ctx =
+ container_of(fc, struct hugetlbfs_fs_context, fc);
struct hugetlbfs_sb_info *sbinfo;
- config.max_hpages = -1; /* No limit on size by default */
- config.nr_inodes = -1; /* No limit on number of inodes by default */
- config.uid = current_fsuid();
- config.gid = current_fsgid();
- config.mode = 0755;
- config.hstate = &default_hstate;
- config.min_hpages = -1; /* No default minimum size */
- ret = hugetlbfs_parse_options(data, &config);
- if (ret)
- return ret;
-
sbinfo = kmalloc(sizeof(struct hugetlbfs_sb_info), GFP_KERNEL);
if (!sbinfo)
return -ENOMEM;
sb->s_fs_info = sbinfo;
- sbinfo->hstate = config.hstate;
spin_lock_init(&sbinfo->stat_lock);
- sbinfo->max_inodes = config.nr_inodes;
- sbinfo->free_inodes = config.nr_inodes;
- sbinfo->spool = NULL;
- sbinfo->uid = config.uid;
- sbinfo->gid = config.gid;
- sbinfo->mode = config.mode;
+ sbinfo->hstate = ctx->hstate;
+ sbinfo->max_inodes = ctx->nr_inodes;
+ sbinfo->free_inodes = ctx->nr_inodes;
+ sbinfo->spool = NULL;
+ sbinfo->uid = ctx->uid;
+ sbinfo->gid = ctx->gid;
+ sbinfo->mode = ctx->mode;
/*
* Allocate and initialize subpool if maximum or minimum size is
* specified. Any needed reservations (for minimim size) are taken
* taken when the subpool is created.
*/
- if (config.max_hpages != -1 || config.min_hpages != -1) {
- sbinfo->spool = hugepage_new_subpool(config.hstate,
- config.max_hpages,
- config.min_hpages);
+ if (ctx->max_hpages != -1 || ctx->min_hpages != -1) {
+ sbinfo->spool = hugepage_new_subpool(ctx->hstate,
+ ctx->max_hpages,
+ ctx->min_hpages);
if (!sbinfo->spool)
goto out_free;
}
sb->s_maxbytes = MAX_LFS_FILESIZE;
- sb->s_blocksize = huge_page_size(config.hstate);
- sb->s_blocksize_bits = huge_page_shift(config.hstate);
+ sb->s_blocksize = huge_page_size(ctx->hstate);
+ sb->s_blocksize_bits = huge_page_shift(ctx->hstate);
sb->s_magic = HUGETLBFS_MAGIC;
sb->s_op = &hugetlbfs_ops;
sb->s_time_gran = 1;
- sb->s_root = d_make_root(hugetlbfs_get_root(sb, &config));
+ sb->s_root = d_make_root(hugetlbfs_get_root(sb, ctx));
if (!sb->s_root)
goto out_free;
return 0;
@@ -1279,17 +1269,39 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, size_t data_size,
return -ENOMEM;
}
-static struct dentry *hugetlbfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data, size_t data_size)
+static int hugetlbfs_get_tree(struct fs_context *fc)
+{
+ return vfs_get_super(fc, vfs_get_independent_super, hugetlbfs_fill_super);
+}
+
+static const struct fs_context_operations hugetlbfs_fs_context_ops = {
+ .parse_option = hugetlbfs_parse_option,
+ .validate = hugetlbfs_validate,
+ .get_tree = hugetlbfs_get_tree,
+};
+
+static int hugetlbfs_init_fs_context(struct fs_context *fc, struct super_block *src_sb)
{
- return mount_nodev(fs_type, flags, data, data_size,
- hugetlbfs_fill_super);
+ struct hugetlbfs_fs_context *ctx = container_of(fc, struct hugetlbfs_fs_context, fc);
+
+ ctx->max_hpages = -1; /* No limit on size by default */
+ ctx->nr_inodes = -1; /* No limit on number of inodes by default */
+ ctx->uid = current_fsuid();
+ ctx->gid = current_fsgid();
+ ctx->mode = 0755;
+ ctx->hstate = &default_hstate;
+ ctx->min_hpages = -1; /* No default minimum size */
+ ctx->max_val_type = NO_SIZE;
+ ctx->min_val_type = NO_SIZE;
+ ctx->fc.ops = &hugetlbfs_fs_context_ops;
+ return 0;
}
static struct file_system_type hugetlbfs_fs_type = {
- .name = "hugetlbfs",
- .mount = hugetlbfs_mount,
- .kill_sb = kill_litter_super,
+ .name = "hugetlbfs",
+ .fs_context_size = sizeof(struct hugetlbfs_fs_context),
+ .init_fs_context = hugetlbfs_init_fs_context,
+ .kill_sb = kill_litter_super,
};
static struct vfsmount *hugetlbfs_vfsmount[HUGE_MAX_HSTATE];
@@ -1396,8 +1408,47 @@ struct file *hugetlb_file_setup(const char *name, size_t size,
return file;
}
+static struct vfsmount *__init mount_one_hugetlbfs(struct hstate *h)
+{
+ struct hugetlbfs_fs_context *ctx;
+ struct fs_context *fc;
+ struct vfsmount *mnt;
+ int ret;
+
+ fc = vfs_new_fs_context(&hugetlbfs_fs_type, NULL, 0,
+ FS_CONTEXT_FOR_KERNEL_MOUNT);
+ if (IS_ERR(fc)) {
+ ret = PTR_ERR(fc);
+ goto err;
+ }
+
+ ctx = container_of(fc, struct hugetlbfs_fs_context, fc);
+ ctx->hstate = h;
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0)
+ goto err_fc;
+
+ mnt = vfs_create_mount(fc);
+ if (IS_ERR(mnt)) {
+ ret = PTR_ERR(mnt);
+ goto err_fc;
+ }
+
+ put_fs_context(fc);
+ return mnt;
+
+err_fc:
+ put_fs_context(fc);
+err:
+ pr_err("Cannot mount internal hugetlbfs for page size %uK",
+ 1U << (h->order + PAGE_SHIFT - 10));
+ return ERR_PTR(ret);
+}
+
static int __init init_hugetlbfs_fs(void)
{
+ struct vfsmount *mnt;
struct hstate *h;
int error;
int i;
@@ -1420,25 +1471,12 @@ static int __init init_hugetlbfs_fs(void)
i = 0;
for_each_hstate(h) {
- char buf[50];
- unsigned ps_kb = 1U << (h->order + PAGE_SHIFT - 10);
- int n;
-
- n = snprintf(buf, sizeof(buf), "pagesize=%uK", ps_kb);
- hugetlbfs_vfsmount[i] = kern_mount_data(&hugetlbfs_fs_type,
- buf, n + 1);
-
- if (IS_ERR(hugetlbfs_vfsmount[i])) {
- pr_err("Cannot mount internal hugetlbfs for "
- "page size %uK", ps_kb);
- error = PTR_ERR(hugetlbfs_vfsmount[i]);
- hugetlbfs_vfsmount[i] = NULL;
- }
+ mnt = mount_one_hugetlbfs(h);
+ if (IS_ERR(mnt) && i == 0)
+ goto out;
+ hugetlbfs_vfsmount[i] = mnt;
i++;
}
- /* Non default hstates are optional */
- if (!IS_ERR_OR_NULL(hugetlbfs_vfsmount[default_hstate_idx]))
- return 0;
out:
kmem_cache_destroy(hugetlbfs_inode_cachep);
Convert the mqueue filesystem to use the filesystem context stuff.
Notes:
(1) The relevant ipc namespace is selected in when the context is
initialised (and it defaults to the current task's ipc namespace).
The caller can override this before calling vfs_get_tree().
(2) Rather than simply calling kern_mount_data(), mq_init_ns() and
mq_internal_mount() create a context, adjust it and then do the rest
of the mount procedure.
(3) The lazy mqueue mounting on creation of a new namespace is retained
from a previous patch, but the avoidance of sget() if no superblock
yet exists is reverted and the superblock is again keyed on the
namespace pointer.
Yes, there was a performance gain in not searching the superblock
hash, but it's only paid once per ipc namespace - and only if someone
uses mqueue within that namespace, so I'm not sure it's worth it,
especially as calling sget() allows avoidance of recursion.
Signed-off-by: David Howells <[email protected]>
---
ipc/mqueue.c | 116 +++++++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 94 insertions(+), 22 deletions(-)
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 910c3c7532e6..2f2e7d73b13d 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -18,6 +18,7 @@
#include <linux/pagemap.h>
#include <linux/file.h>
#include <linux/mount.h>
+#include <linux/fs_context.h>
#include <linux/namei.h>
#include <linux/sysctl.h>
#include <linux/poll.h>
@@ -42,6 +43,11 @@
#include <net/sock.h>
#include "util.h"
+struct mqueue_fs_context {
+ struct fs_context fc;
+ struct ipc_namespace *ipc_ns;
+};
+
#define MQUEUE_MAGIC 0x19800202
#define DIRENT_SIZE 20
#define FILENT_SIZE 80
@@ -87,9 +93,11 @@ struct mqueue_inode_info {
unsigned long qsize; /* size of queue in memory (sum of all msgs) */
};
+static struct file_system_type mqueue_fs_type;
static const struct inode_operations mqueue_dir_inode_operations;
static const struct file_operations mqueue_file_operations;
static const struct super_operations mqueue_super_ops;
+static const struct fs_context_operations mqueue_fs_context_ops;
static void remove_notification(struct mqueue_inode_info *info);
static struct kmem_cache *mqueue_inode_cachep;
@@ -322,7 +330,7 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
return ERR_PTR(ret);
}
-static int mqueue_fill_super(struct super_block *sb, void *data, size_t data_size, int silent)
+static int mqueue_fill_super(struct super_block *sb, struct fs_context *fc)
{
struct inode *inode;
struct ipc_namespace *ns = sb->s_fs_info;
@@ -343,19 +351,77 @@ static int mqueue_fill_super(struct super_block *sb, void *data, size_t data_siz
return 0;
}
-static struct dentry *mqueue_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name,
- void *data, size_t data_size)
+static int mqueue_get_tree(struct fs_context *fc)
{
- struct ipc_namespace *ns;
- if (flags & SB_KERNMOUNT) {
- ns = data;
- data = NULL;
- } else {
- ns = current->nsproxy->ipc_ns;
+ struct mqueue_fs_context *ctx = container_of(fc, struct mqueue_fs_context, fc);
+
+ /* As a shortcut, if the namespace already has a superblock created,
+ * use the root from that directly rather than invoking sget() again.
+ */
+ spin_lock(&mq_lock);
+ if (ctx->ipc_ns->mq_mnt) {
+ fc->root = dget(ctx->ipc_ns->mq_mnt->mnt_sb->s_root);
+ atomic_inc(&fc->root->d_sb->s_active);
+ }
+ spin_unlock(&mq_lock);
+ if (fc->root) {
+ down_write(&fc->root->d_sb->s_umount);
+ return 0;
}
- return mount_ns(fs_type, flags, data, data_size, ns, ns->user_ns,
- mqueue_fill_super);
+
+ ctx->fc.s_fs_info = ctx->ipc_ns;
+ return vfs_get_super(fc, vfs_get_keyed_super, mqueue_fill_super);
+}
+
+static void mqueue_fs_context_free(struct fs_context *fc)
+{
+ struct mqueue_fs_context *ctx = container_of(fc, struct mqueue_fs_context, fc);
+
+ if (ctx->ipc_ns)
+ put_ipc_ns(ctx->ipc_ns);
+}
+
+static int mqueue_init_fs_context(struct fs_context *fc, struct super_block *src_sb)
+{
+ struct mqueue_fs_context *ctx = container_of(fc, struct mqueue_fs_context, fc);
+
+ ctx->ipc_ns = get_ipc_ns(current->nsproxy->ipc_ns);
+ ctx->fc.ops = &mqueue_fs_context_ops;
+ return 0;
+}
+
+static struct vfsmount *mq_create_mount(struct ipc_namespace *ns)
+{
+ struct mqueue_fs_context *ctx;
+ struct fs_context *fc;
+ struct vfsmount *mnt;
+ int ret;
+
+ fc = vfs_new_fs_context(&mqueue_fs_type, NULL, 0,
+ FS_CONTEXT_FOR_KERNEL_MOUNT);
+ if (IS_ERR(fc))
+ return ERR_CAST(fc);
+
+ ctx = container_of(fc, struct mqueue_fs_context, fc);
+ put_ipc_ns(ctx->ipc_ns);
+ ctx->ipc_ns = get_ipc_ns(ns);
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0)
+ goto err_fc;
+
+ mnt = vfs_create_mount(fc);
+ if (IS_ERR(mnt)) {
+ ret = PTR_ERR(mnt);
+ goto err_fc;
+ }
+
+ put_fs_context(fc);
+ return mnt;
+
+err_fc:
+ put_fs_context(fc);
+ return ERR_PTR(ret);
}
static void init_once(void *foo)
@@ -1521,15 +1587,23 @@ static const struct super_operations mqueue_super_ops = {
.statfs = simple_statfs,
};
+static const struct fs_context_operations mqueue_fs_context_ops = {
+ .free = mqueue_fs_context_free,
+ .get_tree = mqueue_get_tree,
+};
+
static struct file_system_type mqueue_fs_type = {
- .name = "mqueue",
- .mount = mqueue_mount,
- .kill_sb = kill_litter_super,
- .fs_flags = FS_USERNS_MOUNT,
+ .name = "mqueue",
+ .fs_context_size = sizeof(struct mqueue_fs_context),
+ .init_fs_context = mqueue_init_fs_context,
+ .kill_sb = kill_litter_super,
+ .fs_flags = FS_USERNS_MOUNT,
};
int mq_init_ns(struct ipc_namespace *ns)
{
+ struct vfsmount *m;
+
ns->mq_queues_count = 0;
ns->mq_queues_max = DFLT_QUEUESMAX;
ns->mq_msg_max = DFLT_MSGMAX;
@@ -1537,12 +1611,10 @@ int mq_init_ns(struct ipc_namespace *ns)
ns->mq_msg_default = DFLT_MSG;
ns->mq_msgsize_default = DFLT_MSGSIZE;
- ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns, 0);
- if (IS_ERR(ns->mq_mnt)) {
- int err = PTR_ERR(ns->mq_mnt);
- ns->mq_mnt = NULL;
- return err;
- }
+ m = mq_create_mount(&init_ipc_ns);
+ if (IS_ERR(m))
+ return PTR_ERR(ns->mq_mnt);
+ ns->mq_mnt = m;
return 0;
}
Move proc_fill_super() to fs/proc/root.c as that's where the other
superblock stuff is.
Signed-off-by: David Howells <[email protected]>
---
fs/proc/inode.c | 49 +------------------------------------------------
fs/proc/internal.h | 4 +---
fs/proc/root.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 49 insertions(+), 52 deletions(-)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index df65431c00be..0b13cf6eb6d7 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -24,7 +24,6 @@
#include <linux/seq_file.h>
#include <linux/slab.h>
#include <linux/mount.h>
-#include <linux/magic.h>
#include <linux/uaccess.h>
@@ -123,7 +122,7 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root)
return 0;
}
-static const struct super_operations proc_sops = {
+const struct super_operations proc_sops = {
.alloc_inode = proc_alloc_inode,
.destroy_inode = proc_destroy_inode,
.drop_inode = generic_delete_inode,
@@ -489,49 +488,3 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
pde_put(de);
return inode;
}
-
-int proc_fill_super(struct super_block *s, void *data, size_t data_size,
- int silent)
-{
- struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
- struct inode *root_inode;
- int ret;
-
- if (!proc_parse_options(data, ns))
- return -EINVAL;
-
- /* User space would break if executables or devices appear on proc */
- s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
- s->s_flags |= SB_NODIRATIME | SB_NOSUID | SB_NOEXEC;
- s->s_blocksize = 1024;
- s->s_blocksize_bits = 10;
- s->s_magic = PROC_SUPER_MAGIC;
- s->s_op = &proc_sops;
- s->s_time_gran = 1;
-
- /*
- * procfs isn't actually a stacking filesystem; however, there is
- * too much magic going on inside it to permit stacking things on
- * top of it
- */
- s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
-
- pde_get(&proc_root);
- root_inode = proc_get_inode(s, &proc_root);
- if (!root_inode) {
- pr_err("proc_fill_super: get root inode failed\n");
- return -ENOMEM;
- }
-
- s->s_root = d_make_root(root_inode);
- if (!s->s_root) {
- pr_err("proc_fill_super: allocate dentry failed\n");
- return -ENOMEM;
- }
-
- ret = proc_setup_self(s);
- if (ret) {
- return ret;
- }
- return proc_setup_thread_self(s);
-}
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 3362732fffa3..3182e1b636d3 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -189,13 +189,12 @@ struct pde_opener {
struct completion *c;
} __randomize_layout;
extern const struct inode_operations proc_link_inode_operations;
-
extern const struct inode_operations proc_pid_link_inode_operations;
+extern const struct super_operations proc_sops;
void proc_init_kmemcache(void);
void set_proc_pid_nlink(void);
extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
-extern int proc_fill_super(struct super_block *, void *, size_t, int);
extern void proc_entry_rundown(struct proc_dir_entry *);
/*
@@ -253,7 +252,6 @@ static inline void proc_tty_init(void) {}
* root.c
*/
extern struct proc_dir_entry proc_root;
-extern int proc_parse_options(char *options, struct pid_namespace *pid);
extern void proc_self_init(void);
extern int proc_remount(struct super_block *, int *, char *, size_t);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 99ce06c4e1a2..2fbc177f37a8 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -23,6 +23,7 @@
#include <linux/pid_namespace.h>
#include <linux/parser.h>
#include <linux/cred.h>
+#include <linux/magic.h>
#include "internal.h"
@@ -36,7 +37,7 @@ static const match_table_t tokens = {
{Opt_err, NULL},
};
-int proc_parse_options(char *options, struct pid_namespace *pid)
+static int proc_parse_options(char *options, struct pid_namespace *pid)
{
char *p;
substring_t args[MAX_OPT_ARGS];
@@ -78,6 +79,51 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
return 1;
}
+static int proc_fill_super(struct super_block *s, void *data, size_t data_size, int silent)
+{
+ struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
+ struct inode *root_inode;
+ int ret;
+
+ if (!proc_parse_options(data, ns))
+ return -EINVAL;
+
+ /* User space would break if executables or devices appear on proc */
+ s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
+ s->s_flags |= SB_NODIRATIME | SB_NOSUID | SB_NOEXEC;
+ s->s_blocksize = 1024;
+ s->s_blocksize_bits = 10;
+ s->s_magic = PROC_SUPER_MAGIC;
+ s->s_op = &proc_sops;
+ s->s_time_gran = 1;
+
+ /*
+ * procfs isn't actually a stacking filesystem; however, there is
+ * too much magic going on inside it to permit stacking things on
+ * top of it
+ */
+ s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
+
+ pde_get(&proc_root);
+ root_inode = proc_get_inode(s, &proc_root);
+ if (!root_inode) {
+ pr_err("proc_fill_super: get root inode failed\n");
+ return -ENOMEM;
+ }
+
+ s->s_root = d_make_root(root_inode);
+ if (!s->s_root) {
+ pr_err("proc_fill_super: allocate dentry failed\n");
+ return -ENOMEM;
+ }
+
+ ret = proc_setup_self(s);
+ if (ret) {
+ return ret;
+ }
+ return proc_setup_thread_self(s);
+}
+
int proc_remount(struct super_block *sb, int *flags,
char *data, size_t data_size)
{
Implement a filesystem context concept to be used during superblock
creation for mount and superblock reconfiguration for remount.
The mounting procedure then becomes:
(1) Allocate new fs_context context.
(2) Configure the context.
(3) Create superblock.
(4) Mount the superblock any number of times.
(5) Destroy the context.
Rather than calling fs_type->mount(), an fs_context struct is created and
fs_type->init_fs_context() is called to set it up.
fs_type->fs_context_size says how much space should be allocated for the
config context. The fs_context struct is placed at the beginning and any
extra space is for the filesystem's use.
A set of operations has to be set by ->init_fs_context() to provide
freeing, duplication, option parsing, binary data parsing, validation,
mounting and superblock filling.
Legacy filesystems are supported by the provision of a set of legacy
fs_context operations that build up a list of mount options and then invoke
fs_type->mount() from within the fs_context ->get_tree() operation. This
allows all filesystems to be accessed using fs_context.
It should be noted that, whilst this patch adds a lot of lines of code,
there is quite a bit of duplication with existing code that can be
eliminated should all filesystems be converted over.
Signed-off-by: David Howells <[email protected]>
---
fs/Makefile | 3
fs/fs_context.c | 593 ++++++++++++++++++++++++++++++++++++++++++++
fs/internal.h | 3
fs/libfs.c | 17 +
fs/namespace.c | 332 ++++++++++++++++---------
fs/super.c | 309 ++++++++++++++++++++++-
include/linux/fs.h | 14 +
include/linux/fs_context.h | 31 ++
include/linux/mount.h | 2
9 files changed, 1167 insertions(+), 137 deletions(-)
create mode 100644 fs/fs_context.c
diff --git a/fs/Makefile b/fs/Makefile
index c9375fd2c8c4..6f2dae3c32da 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -12,7 +12,8 @@ obj-y := open.o read_write.o file_table.o super.o \
attr.o bad_inode.o file.o filesystems.o namespace.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o splice.o sync.o utimes.o d_path.o \
- stack.o fs_struct.o statfs.o fs_pin.o nsfs.o
+ stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
+ fs_context.o
ifeq ($(CONFIG_BLOCK),y)
obj-y += buffer.o block_dev.o direct-io.o mpage.o
diff --git a/fs/fs_context.c b/fs/fs_context.c
new file mode 100644
index 000000000000..0e3f561a8219
--- /dev/null
+++ b/fs/fs_context.c
@@ -0,0 +1,593 @@
+/* Provide a way to create a superblock configuration context within the kernel
+ * that allows a superblock to be set up prior to mounting.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/fs_context.h>
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/nsproxy.h>
+#include <linux/slab.h>
+#include <linux/magic.h>
+#include <linux/security.h>
+#include <linux/parser.h>
+#include <linux/mnt_namespace.h>
+#include <linux/pid_namespace.h>
+#include <linux/user_namespace.h>
+#include <net/net_namespace.h>
+#include "mount.h"
+
+enum legacy_fs_param {
+ LEGACY_FS_UNSET_PARAMS,
+ LEGACY_FS_NO_PARAMS,
+ LEGACY_FS_MONOLITHIC_PARAMS,
+ LEGACY_FS_INDIVIDUAL_PARAMS,
+ LEGACY_FS_MAGIC_PARAMS,
+};
+
+struct legacy_fs_context {
+ struct fs_context fc;
+ char *legacy_data; /* Data page for legacy filesystems */
+ char *secdata;
+ size_t data_size;
+ enum legacy_fs_param param_type;
+};
+
+static const struct fs_context_operations legacy_fs_context_ops;
+
+static const match_table_t common_set_sb_flag = {
+ { SB_DIRSYNC, "dirsync" },
+ { SB_LAZYTIME, "lazytime" },
+ { SB_MANDLOCK, "mand" },
+ { SB_POSIXACL, "posixacl" },
+ { SB_RDONLY, "ro" },
+ { SB_SYNCHRONOUS, "sync" },
+ { },
+};
+
+static const match_table_t common_clear_sb_flag = {
+ { SB_LAZYTIME, "nolazytime" },
+ { SB_MANDLOCK, "nomand" },
+ { SB_RDONLY, "rw" },
+ { SB_SILENT, "silent" },
+ { SB_SYNCHRONOUS, "async" },
+ { },
+};
+
+static const match_table_t forbidden_sb_flag = {
+ { 0, "bind" },
+ { 0, "move" },
+ { 0, "private" },
+ { 0, "remount" },
+ { 0, "shared" },
+ { 0, "slave" },
+ { 0, "unbindable" },
+ { 0, "rec" },
+ { 0, "noatime" },
+ { 0, "relatime" },
+ { 0, "norelatime" },
+ { 0, "strictatime" },
+ { 0, "nostrictatime" },
+ { 0, "nodiratime" },
+ { 0, "dev" },
+ { 0, "nodev" },
+ { 0, "exec" },
+ { 0, "noexec" },
+ { 0, "suid" },
+ { 0, "nosuid" },
+ { },
+};
+
+/*
+ * Check for a common mount option that manipulates s_flags.
+ */
+static int vfs_parse_sb_flag_option(struct fs_context *fc, char *data)
+{
+ substring_t args[MAX_OPT_ARGS];
+ unsigned int token;
+
+ token = match_token(data, common_set_sb_flag, args);
+ if (token) {
+ fc->sb_flags |= token;
+ return 1;
+ }
+
+ token = match_token(data, common_clear_sb_flag, args);
+ if (token) {
+ fc->sb_flags &= ~token;
+ return 1;
+ }
+
+ token = match_token(data, forbidden_sb_flag, args);
+ if (token)
+ return -EINVAL;
+
+ return 0;
+}
+
+/**
+ * vfs_parse_fs_option - Add a single mount option to a superblock config
+ * @fc: The filesystem context to modify
+ * @opt: The option to apply.
+ * @len: The length of the option.
+ *
+ * A single mount option in string form is applied to the filesystem context
+ * being set up. Certain standard options (for example "ro") are translated
+ * into flag bits without going to the filesystem. The active security module
+ * is allowed to observe and poach options. Any other options are passed over
+ * to the filesystem to parse.
+ *
+ * This may be called multiple times for a context.
+ *
+ * Returns 0 on success and a negative error code on failure. In the event of
+ * failure, supplementary error information may have been set.
+ */
+int vfs_parse_fs_option(struct fs_context *fc, char *opt, size_t len)
+{
+ int ret;
+
+ ret = vfs_parse_sb_flag_option(fc, opt);
+ if (ret < 0)
+ return ret;
+ if (ret == 1)
+ return 0;
+
+ ret = security_fs_context_parse_option(fc, opt, len);
+ if (ret < 0)
+ return ret;
+ if (ret == 1)
+ return 0;
+
+ if (fc->ops->parse_option)
+ return fc->ops->parse_option(fc, opt, len);
+
+ return -EINVAL;
+}
+EXPORT_SYMBOL(vfs_parse_fs_option);
+
+/**
+ * vfs_set_fs_source - Set the source/device name in a filesystem context
+ * @fc: The filesystem context to alter
+ * @source: The name of the source
+ * @slen: Length of @source string
+ */
+int vfs_set_fs_source(struct fs_context *fc, const char *source, size_t slen)
+{
+ if (fc->source)
+ return -EINVAL;
+ if (source) {
+ fc->source = kmemdup_nul(source, slen, GFP_KERNEL);
+ if (!fc->source)
+ return -ENOMEM;
+ }
+
+ if (fc->ops->parse_source)
+ return fc->ops->parse_source(fc);
+ return 0;
+}
+EXPORT_SYMBOL(vfs_set_fs_source);
+
+/**
+ * generic_parse_monolithic - Parse key[=val][,key[=val]]* mount data
+ * @ctx: The superblock configuration to fill in.
+ * @data: The data to parse
+ * @data_size: The amount of data
+ *
+ * Parse a blob of data that's in key[=val][,key[=val]]* form. This can be
+ * called from the ->monolithic_mount_data() fs_context operation.
+ *
+ * Returns 0 on success or the error returned by the ->parse_option() fs_context
+ * operation on failure.
+ */
+int generic_parse_monolithic(struct fs_context *ctx, void *data, size_t data_size)
+{
+ char *options = data, *opt;
+ int ret;
+
+ if (!options)
+ return 0;
+
+ while ((opt = strsep(&options, ",")) != NULL) {
+ if (*opt) {
+ ret = vfs_parse_fs_option(ctx, opt, strlen(opt));
+ if (ret < 0)
+ return ret;
+ }
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL(generic_parse_monolithic);
+
+/**
+ * vfs_new_fs_context - Create a filesystem context.
+ * @fs_type: The filesystem type.
+ * @src_sb: A superblock from which this one derives (or NULL)
+ * @sb_flags: Filesystem/superblock flags (SB_*)
+ * @purpose: The purpose that this configuration shall be used for.
+ *
+ * Open a filesystem and create a mount context. The mount context is
+ * initialised with the supplied flags and, if a submount/automount from
+ * another superblock (@src_sb) is supplied, may have parameters such as
+ * namespaces copied across from that superblock.
+ */
+struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
+ struct super_block *src_sb,
+ unsigned int sb_flags,
+ enum fs_context_purpose purpose)
+{
+ struct fs_context *fc;
+ size_t fc_size = fs_type->fs_context_size;
+ int ret;
+
+ BUG_ON(fs_type->init_fs_context && fc_size < sizeof(*fc));
+
+ if (!fs_type->init_fs_context)
+ fc_size = sizeof(struct legacy_fs_context);
+
+ fc = kzalloc(fc_size, GFP_KERNEL);
+ if (!fc)
+ return ERR_PTR(-ENOMEM);
+
+ fc->purpose = purpose;
+ fc->sb_flags = sb_flags;
+ fc->fs_type = get_filesystem(fs_type);
+ fc->cred = get_current_cred();
+
+ switch (purpose) {
+ case FS_CONTEXT_FOR_KERNEL_MOUNT:
+ fc->sb_flags |= SB_KERNMOUNT;
+ /* Fallthrough */
+ case FS_CONTEXT_FOR_USER_MOUNT:
+ fc->user_ns = get_user_ns(fc->cred->user_ns);
+ fc->net_ns = get_net(current->nsproxy->net_ns);
+ break;
+ case FS_CONTEXT_FOR_SUBMOUNT:
+ fc->user_ns = get_user_ns(src_sb->s_user_ns);
+ fc->net_ns = get_net(current->nsproxy->net_ns);
+ break;
+ case FS_CONTEXT_FOR_RECONFIGURE:
+ /* We don't pin any namespaces as the superblock's
+ * subscriptions cannot be changed at this point.
+ */
+ break;
+ }
+
+
+ /* TODO: Make all filesystems support this unconditionally */
+ if (fc->fs_type->init_fs_context) {
+ ret = fc->fs_type->init_fs_context(fc, src_sb);
+ if (ret < 0)
+ goto err_fc;
+ } else {
+ fc->ops = &legacy_fs_context_ops;
+ }
+
+ /* Do the security check last because ->init_fs_context may change the
+ * namespace subscriptions.
+ */
+ ret = security_fs_context_alloc(fc, src_sb);
+ if (ret < 0)
+ goto err_fc;
+
+ return fc;
+
+err_fc:
+ put_fs_context(fc);
+ return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(vfs_new_fs_context);
+
+/**
+ * vfs_sb_reconfig - Create a filesystem context for remount/reconfiguration
+ * @mountpoint: The mountpoint to open
+ * @sb_flags: Filesystem/superblock flags (SB_*)
+ *
+ * Open a mounted filesystem and create a filesystem context such that a
+ * remount can be effected.
+ */
+struct fs_context *vfs_sb_reconfig(struct path *mountpoint,
+ unsigned int sb_flags)
+{
+ struct fs_context *fc;
+
+ fc = vfs_new_fs_context(mountpoint->mnt->mnt_sb->s_type,
+ mountpoint->mnt->mnt_sb,
+ sb_flags, FS_CONTEXT_FOR_RECONFIGURE);
+ if (IS_ERR(fc))
+ return fc;
+
+ fc->root = dget(mountpoint->dentry);
+ return fc;
+}
+
+/**
+ * vfs_dup_fc_config: Duplicate a filesytem context.
+ * @src_fc: The context to copy.
+ */
+struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc)
+{
+ struct fs_context *fc;
+ size_t fc_size;
+ int ret;
+
+ if (!src_fc->ops->dup)
+ return ERR_PTR(-ENOTSUPP);
+
+ fc_size = src_fc->fs_type->fs_context_size;
+ if (!src_fc->fs_type->init_fs_context)
+ fc_size = sizeof(struct legacy_fs_context);
+
+ fc = kmemdup(src_fc, src_fc->fs_type->fs_context_size, GFP_KERNEL);
+ if (!fc)
+ return ERR_PTR(-ENOMEM);
+
+ fc->source = NULL;
+ fc->security = NULL;
+ get_filesystem(fc->fs_type);
+ get_net(fc->net_ns);
+ get_user_ns(fc->user_ns);
+ get_cred(fc->cred);
+
+ /* Can't call put until we've called ->dup */
+ ret = fc->ops->dup(fc, src_fc);
+ if (ret < 0)
+ goto err_fc;
+
+ ret = security_fs_context_dup(fc, src_fc);
+ if (ret < 0)
+ goto err_fc;
+ return fc;
+
+err_fc:
+ put_fs_context(fc);
+ return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(vfs_dup_fs_context);
+
+/**
+ * put_fs_context - Dispose of a superblock configuration context.
+ * @fc: The context to dispose of.
+ */
+void put_fs_context(struct fs_context *fc)
+{
+ struct super_block *sb;
+
+ if (fc->root) {
+ sb = fc->root->d_sb;
+ dput(fc->root);
+ fc->root = NULL;
+ if (fc->drop_sb)
+ deactivate_super(sb);
+ }
+
+ if (fc->ops && fc->ops->free)
+ fc->ops->free(fc);
+
+ security_fs_context_free(fc);
+ if (fc->net_ns)
+ put_net(fc->net_ns);
+ put_user_ns(fc->user_ns);
+ if (fc->cred)
+ put_cred(fc->cred);
+ kfree(fc->subtype);
+ put_filesystem(fc->fs_type);
+ kfree(fc->source);
+ kfree(fc);
+}
+EXPORT_SYMBOL(put_fs_context);
+
+/*
+ * Free the config for a filesystem that doesn't support fs_context.
+ */
+static void legacy_fs_context_free(struct fs_context *fc)
+{
+ struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
+
+ free_secdata(ctx->secdata);
+ switch (ctx->param_type) {
+ case LEGACY_FS_UNSET_PARAMS:
+ case LEGACY_FS_NO_PARAMS:
+ break;
+ case LEGACY_FS_MAGIC_PARAMS:
+ break; /* ctx->data is a weird pointer */
+ default:
+ kfree(ctx->legacy_data);
+ break;
+ }
+}
+
+/*
+ * Duplicate a legacy config.
+ */
+static int legacy_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
+{
+ struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
+ struct legacy_fs_context *src_ctx = container_of(src_fc, struct legacy_fs_context, fc);
+
+ switch (ctx->param_type) {
+ case LEGACY_FS_MONOLITHIC_PARAMS:
+ case LEGACY_FS_INDIVIDUAL_PARAMS:
+ ctx->legacy_data = kmemdup(src_ctx->legacy_data,
+ src_ctx->data_size, GFP_KERNEL);
+ if (!ctx->legacy_data)
+ return -ENOMEM;
+ /* Fall through */
+ default:
+ break;
+ }
+ return 0;
+}
+
+/*
+ * Add an option to a legacy config. We build up a comma-separated list of
+ * options.
+ */
+static int legacy_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+ struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
+ unsigned int size = ctx->data_size;
+
+ if (ctx->param_type != LEGACY_FS_UNSET_PARAMS &&
+ ctx->param_type != LEGACY_FS_INDIVIDUAL_PARAMS) {
+ pr_warn("VFS: Can't mix monolithic and individual options\n");
+ return -EINVAL;
+ }
+
+ if (len > PAGE_SIZE - 2 - size)
+ return -EINVAL;
+ if (memchr(opt, ',', len) != NULL)
+ return -EINVAL;
+ if (!ctx->legacy_data) {
+ ctx->legacy_data = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!ctx->legacy_data)
+ return -ENOMEM;
+ }
+
+ ctx->legacy_data[size++] = ',';
+ memcpy(ctx->legacy_data + size, opt, len);
+ size += len;
+ ctx->legacy_data[size] = '\0';
+ ctx->data_size = size;
+ ctx->param_type = LEGACY_FS_INDIVIDUAL_PARAMS;
+ return 0;
+}
+
+/*
+ * Add monolithic mount data.
+ */
+static int legacy_parse_monolithic(struct fs_context *fc, void *data, size_t data_size)
+{
+ struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
+
+ if (ctx->param_type != LEGACY_FS_UNSET_PARAMS) {
+ pr_warn("VFS: Can't mix monolithic and individual options\n");
+ return -EINVAL;
+ }
+
+ if (!data) {
+ ctx->param_type = LEGACY_FS_NO_PARAMS;
+ return 0;
+ }
+
+ ctx->data_size = data_size;
+ if (data_size > 0) {
+ ctx->legacy_data = kmemdup(data, data_size, GFP_KERNEL);
+ if (!ctx->legacy_data)
+ return -ENOMEM;
+ ctx->param_type = LEGACY_FS_MONOLITHIC_PARAMS;
+ } else {
+ /* Some filesystems pass weird pointers through that we don't
+ * want to copy. They can indicate this by setting data_size
+ * to 0.
+ */
+ ctx->legacy_data = data;
+ ctx->param_type = LEGACY_FS_MAGIC_PARAMS;
+ }
+
+ return 0;
+}
+
+/*
+ * Use the legacy mount validation step to strip out and process security
+ * config options.
+ */
+static int legacy_validate(struct fs_context *fc)
+{
+ struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
+
+ switch (ctx->param_type) {
+ case LEGACY_FS_UNSET_PARAMS:
+ ctx->param_type = LEGACY_FS_NO_PARAMS;
+ /* Fall through */
+ case LEGACY_FS_NO_PARAMS:
+ case LEGACY_FS_MAGIC_PARAMS:
+ return 0;
+ default:
+ break;
+ }
+
+ if (ctx->fc.fs_type->fs_flags & FS_BINARY_MOUNTDATA)
+ return 0;
+
+ ctx->secdata = alloc_secdata();
+ if (!ctx->secdata)
+ return -ENOMEM;
+
+ return security_sb_copy_data(ctx->legacy_data, ctx->data_size,
+ ctx->secdata);
+}
+
+/*
+ * Determine the superblock subtype.
+ */
+static int legacy_set_subtype(struct fs_context *fc)
+{
+ const char *subtype = strchr(fc->fs_type->name, '.');
+
+ if (subtype) {
+ subtype++;
+ if (!subtype[0])
+ return -EINVAL;
+ } else {
+ subtype = "";
+ }
+
+ fc->subtype = kstrdup(subtype, GFP_KERNEL);
+ if (!fc->subtype)
+ return -ENOMEM;
+ return 0;
+}
+
+/*
+ * Get a mountable root with the legacy mount command.
+ */
+static int legacy_get_tree(struct fs_context *fc)
+{
+ struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
+ struct super_block *sb;
+ struct dentry *root;
+ int ret;
+
+ root = ctx->fc.fs_type->mount(ctx->fc.fs_type, ctx->fc.sb_flags,
+ ctx->fc.source, ctx->legacy_data,
+ ctx->data_size);
+ if (IS_ERR(root))
+ return PTR_ERR(root);
+
+ sb = root->d_sb;
+ BUG_ON(!sb);
+
+ if ((ctx->fc.fs_type->fs_flags & FS_HAS_SUBTYPE) &&
+ !fc->subtype) {
+ ret = legacy_set_subtype(fc);
+ if (ret < 0)
+ goto err_sb;
+ }
+
+ ctx->fc.root = root;
+ ctx->fc.drop_sb = true;
+ return 0;
+
+err_sb:
+ dput(root);
+ deactivate_locked_super(sb);
+ return ret;
+}
+
+static const struct fs_context_operations legacy_fs_context_ops = {
+ .free = legacy_fs_context_free,
+ .dup = legacy_fs_context_dup,
+ .parse_option = legacy_parse_option,
+ .parse_monolithic = legacy_parse_monolithic,
+ .validate = legacy_validate,
+ .get_tree = legacy_get_tree,
+};
diff --git a/fs/internal.h b/fs/internal.h
index 1afa522c5f30..91a990234488 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -98,7 +98,8 @@ extern struct file *get_empty_filp(void);
/*
* super.c
*/
-extern int do_remount_sb(struct super_block *, int, void *, size_t, int);
+extern int do_remount_sb(struct super_block *, int, void *, size_t, int,
+ struct fs_context *);
extern bool trylock_super(struct super_block *sb);
extern struct dentry *mount_fs(struct file_system_type *,
int, const char *, void *, size_t);
diff --git a/fs/libfs.c b/fs/libfs.c
index 9f1f4884b7cc..0bbe1ff1d09e 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -9,6 +9,7 @@
#include <linux/slab.h>
#include <linux/cred.h>
#include <linux/mount.h>
+#include <linux/fs_context.h>
#include <linux/vfs.h>
#include <linux/quotaops.h>
#include <linux/mutex.h>
@@ -574,13 +575,27 @@ static DEFINE_SPINLOCK(pin_fs_lock);
int simple_pin_fs(struct file_system_type *type, struct vfsmount **mount, int *count)
{
+ struct fs_context *fc;
struct vfsmount *mnt = NULL;
+ int ret;
+
spin_lock(&pin_fs_lock);
if (unlikely(!*mount)) {
spin_unlock(&pin_fs_lock);
- mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, NULL, 0);
+
+ fc = vfs_new_fs_context(type, NULL, 0, FS_CONTEXT_FOR_KERNEL_MOUNT);
+ if (IS_ERR(fc))
+ return PTR_ERR(fc);
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0)
+ return ret;
+
+ mnt = vfs_create_mount(fc);
+ put_fs_context(fc);
if (IS_ERR(mnt))
return PTR_ERR(mnt);
+
spin_lock(&pin_fs_lock);
if (!*mount)
*mount = mnt;
diff --git a/fs/namespace.c b/fs/namespace.c
index 8fc4f3459b80..c61ff2ab090a 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -25,8 +25,10 @@
#include <linux/magic.h>
#include <linux/bootmem.h>
#include <linux/task_work.h>
+#include <linux/file.h>
#include <linux/sched/task.h>
#include <uapi/linux/mount.h>
+#include <linux/fs_context.h>
#include "pnode.h"
#include "internal.h"
@@ -1019,56 +1021,6 @@ static struct mount *skip_mnt_tree(struct mount *p)
return p;
}
-struct vfsmount *
-vfs_kern_mount(struct file_system_type *type, int flags, const char *name,
- void *data, size_t data_size)
-{
- struct mount *mnt;
- struct dentry *root;
-
- if (!type)
- return ERR_PTR(-ENODEV);
-
- mnt = alloc_vfsmnt(name);
- if (!mnt)
- return ERR_PTR(-ENOMEM);
-
- if (flags & SB_KERNMOUNT)
- mnt->mnt.mnt_flags = MNT_INTERNAL;
-
- root = mount_fs(type, flags, name, data, data_size);
- if (IS_ERR(root)) {
- mnt_free_id(mnt);
- free_vfsmnt(mnt);
- return ERR_CAST(root);
- }
-
- mnt->mnt.mnt_root = root;
- mnt->mnt.mnt_sb = root->d_sb;
- mnt->mnt_mountpoint = mnt->mnt.mnt_root;
- mnt->mnt_parent = mnt;
- lock_mount_hash();
- list_add_tail(&mnt->mnt_instance, &root->d_sb->s_mounts);
- unlock_mount_hash();
- return &mnt->mnt;
-}
-EXPORT_SYMBOL_GPL(vfs_kern_mount);
-
-struct vfsmount *
-vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
- const char *name, void *data, size_t data_size)
-{
- /* Until it is worked out how to pass the user namespace
- * through from the parent mount to the submount don't support
- * unprivileged mounts with submounts.
- */
- if (mountpoint->d_sb->s_user_ns != &init_user_ns)
- return ERR_PTR(-EPERM);
-
- return vfs_kern_mount(type, SB_SUBMOUNT, name, data, data_size);
-}
-EXPORT_SYMBOL_GPL(vfs_submount);
-
static struct mount *clone_mnt(struct mount *old, struct dentry *root,
int flag)
{
@@ -1595,7 +1547,7 @@ static int do_umount(struct mount *mnt, int flags)
return -EPERM;
down_write(&sb->s_umount);
if (!sb_rdonly(sb))
- retval = do_remount_sb(sb, SB_RDONLY, NULL, 0, 0);
+ retval = do_remount_sb(sb, SB_RDONLY, NULL, 0, 0, NULL);
up_write(&sb->s_umount);
return retval;
}
@@ -2282,6 +2234,20 @@ static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
return error;
}
+/*
+ * Parse the monolithic page of mount data given to sys_mount().
+ */
+static int parse_monolithic_mount_data(struct fs_context *fc, void *data, size_t data_size)
+{
+ int (*monolithic_mount_data)(struct fs_context *, void *, size_t);
+
+ monolithic_mount_data = fc->ops->parse_monolithic;
+ if (!monolithic_mount_data)
+ monolithic_mount_data = generic_parse_monolithic;
+
+ return monolithic_mount_data(fc, data, data_size);
+}
+
/*
* change filesystem flags. dir should be a physical root of filesystem.
* If you've mounted a non-root directory somewhere and want to do remount
@@ -2290,9 +2256,11 @@ static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
static int do_remount(struct path *path, int ms_flags, int sb_flags,
int mnt_flags, void *data, size_t data_size)
{
+ struct fs_context *fc = NULL;
int err;
struct super_block *sb = path->mnt->mnt_sb;
struct mount *mnt = real_mount(path->mnt);
+ struct file_system_type *type = sb->s_type;
if (!check_mnt(mnt))
return -EINVAL;
@@ -2327,9 +2295,29 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
return -EPERM;
}
- err = security_sb_remount(sb, data, data_size);
- if (err)
- return err;
+ if (type->init_fs_context) {
+ fc = vfs_sb_reconfig(path, sb_flags);
+ if (IS_ERR(fc))
+ return PTR_ERR(fc);
+
+ err = parse_monolithic_mount_data(fc, data, data_size);
+ if (err < 0)
+ goto err_fc;
+
+ if (fc->ops->validate) {
+ err = fc->ops->validate(fc);
+ if (err < 0)
+ goto err_fc;
+ }
+
+ err = security_fs_context_validate(fc);
+ if (err)
+ return err;
+ } else {
+ err = security_sb_remount(sb, data, data_size);
+ if (err)
+ return err;
+ }
down_write(&sb->s_umount);
if (ms_flags & MS_BIND)
@@ -2337,7 +2325,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
else if (!capable(CAP_SYS_ADMIN))
err = -EPERM;
else
- err = do_remount_sb(sb, sb_flags, data, data_size, 0);
+ err = do_remount_sb(sb, sb_flags, data, data_size, 0, fc);
if (!err) {
lock_mount_hash();
mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
@@ -2346,6 +2334,9 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
unlock_mount_hash();
}
up_write(&sb->s_umount);
+err_fc:
+ if (fc)
+ put_fs_context(fc);
return err;
}
@@ -2429,29 +2420,6 @@ static int do_move_mount(struct path *path, const char *old_name)
return err;
}
-static struct vfsmount *fs_set_subtype(struct vfsmount *mnt, const char *fstype)
-{
- int err;
- const char *subtype = strchr(fstype, '.');
- if (subtype) {
- subtype++;
- err = -EINVAL;
- if (!subtype[0])
- goto err;
- } else
- subtype = "";
-
- mnt->mnt_sb->s_subtype = kstrdup(subtype, GFP_KERNEL);
- err = -ENOMEM;
- if (!mnt->mnt_sb->s_subtype)
- goto err;
- return mnt;
-
- err:
- mntput(mnt);
- return ERR_PTR(err);
-}
-
/*
* add a mount into a namespace's mount tree
*/
@@ -2498,42 +2466,85 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags);
+/*
+ * Create a new mount using a superblock configuration and request it
+ * be added to the namespace tree.
+ */
+static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ struct vfsmount *mnt;
+ int ret;
+
+ ret = security_sb_mountpoint(fc, mountpoint,
+ mnt_flags & ~MNT_INTERNAL_FLAGS);
+ if (ret < 0)
+ return ret;
+
+ mnt = vfs_create_mount(fc);
+ if (IS_ERR(mnt))
+ return PTR_ERR(mnt);
+
+ ret = -EPERM;
+ if (mount_too_revealing(mnt, &mnt_flags)) {
+ pr_warn("VFS: Mount too revealing\n");
+ goto err_mnt;
+ }
+
+ ret = do_add_mount(real_mount(mnt), mountpoint, mnt_flags);
+ if (ret < 0)
+ goto err_mnt;
+ return ret;
+
+err_mnt:
+ mntput(mnt);
+ return ret;
+}
+
/*
* create a new mount for userspace and request it to be added into the
* namespace's tree
*/
-static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
- int mnt_flags, const char *name,
+static int do_new_mount(struct path *mountpoint, const char *fstype,
+ int sb_flags, int mnt_flags, const char *name,
void *data, size_t data_size)
{
- struct file_system_type *type;
- struct vfsmount *mnt;
+ struct file_system_type *fs_type;
+ struct fs_context *fc;
int err;
if (!fstype)
return -EINVAL;
- type = get_fs_type(fstype);
- if (!type)
- return -ENODEV;
+ err = -ENODEV;
+ fs_type = get_fs_type(fstype);
+ if (!fs_type)
+ goto out;
- mnt = vfs_kern_mount(type, sb_flags, name, data, data_size);
- if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) &&
- !mnt->mnt_sb->s_subtype)
- mnt = fs_set_subtype(mnt, fstype);
+ fc = vfs_new_fs_context(fs_type, NULL, sb_flags,
+ FS_CONTEXT_FOR_USER_MOUNT);
+ put_filesystem(fs_type);
+ if (IS_ERR(fc)) {
+ err = PTR_ERR(fc);
+ goto out;
+ }
- put_filesystem(type);
- if (IS_ERR(mnt))
- return PTR_ERR(mnt);
+ err = vfs_set_fs_source(fc, name, name ? strlen(name) : 0);
+ if (err < 0)
+ goto out_fc;
- if (mount_too_revealing(mnt, &mnt_flags)) {
- mntput(mnt);
- return -EPERM;
- }
+ err = parse_monolithic_mount_data(fc, data, data_size);
+ if (err < 0)
+ goto out_fc;
- err = do_add_mount(real_mount(mnt), path, mnt_flags);
- if (err)
- mntput(mnt);
+ err = vfs_get_tree(fc);
+ if (err < 0)
+ goto out_fc;
+
+ err = do_new_mount_fc(fc, mountpoint, mnt_flags);
+out_fc:
+ put_fs_context(fc);
+out:
return err;
}
@@ -3081,6 +3092,113 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
return ksys_mount(dev_name, dir_name, type, flags, data);
}
+/**
+ * vfs_create_mount - Create a mount for a configured superblock
+ * @fc: The configuration context with the superblock attached
+ *
+ * Create a mount to an already configured superblock. If necessary, the
+ * caller should invoke vfs_get_tree() before calling this.
+ *
+ * Note that this does not attach the mount to anything.
+ */
+struct vfsmount *vfs_create_mount(struct fs_context *fc)
+{
+ struct mount *mnt;
+
+ if (!fc->root)
+ return ERR_PTR(-EINVAL);
+
+ mnt = alloc_vfsmnt(fc->source ?: "none");
+ if (!mnt)
+ return ERR_PTR(-ENOMEM);
+
+ if (fc->purpose == FS_CONTEXT_FOR_KERNEL_MOUNT)
+ /* It's a longterm mount, don't release mnt until we unmount
+ * before file sys is unregistered
+ */
+ mnt->mnt.mnt_flags = MNT_INTERNAL;
+
+ atomic_inc(&fc->root->d_sb->s_active);
+ mnt->mnt.mnt_sb = fc->root->d_sb;
+ mnt->mnt.mnt_root = dget(fc->root);
+ mnt->mnt_mountpoint = mnt->mnt.mnt_root;
+ mnt->mnt_parent = mnt;
+
+ lock_mount_hash();
+ list_add_tail(&mnt->mnt_instance, &mnt->mnt.mnt_sb->s_mounts);
+ unlock_mount_hash();
+ return &mnt->mnt;
+}
+EXPORT_SYMBOL(vfs_create_mount);
+
+struct vfsmount *vfs_kern_mount(struct file_system_type *type,
+ int sb_flags, const char *devname,
+ void *data, size_t data_size)
+{
+ struct fs_context *fc;
+ struct vfsmount *mnt;
+ int ret;
+
+ if (!type)
+ return ERR_PTR(-EINVAL);
+
+ fc = vfs_new_fs_context(type, NULL, sb_flags,
+ sb_flags & SB_KERNMOUNT ?
+ FS_CONTEXT_FOR_KERNEL_MOUNT :
+ FS_CONTEXT_FOR_USER_MOUNT);
+ if (IS_ERR(fc))
+ return ERR_CAST(fc);
+
+ ret = vfs_set_fs_source(fc, devname, devname ? strlen(devname) : 0);
+ if (ret < 0)
+ goto err_fc;
+
+ ret = parse_monolithic_mount_data(fc, data, data_size);
+ if (ret < 0)
+ goto err_fc;
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0)
+ goto err_fc;
+
+ mnt = vfs_create_mount(fc);
+out:
+ put_fs_context(fc);
+ return mnt;
+err_fc:
+ mnt = ERR_PTR(ret);
+ goto out;
+}
+EXPORT_SYMBOL_GPL(vfs_kern_mount);
+
+struct vfsmount *
+vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
+ const char *name, void *data, size_t data_size)
+{
+ /* Until it is worked out how to pass the user namespace
+ * through from the parent mount to the submount don't support
+ * unprivileged mounts with submounts.
+ */
+ if (mountpoint->d_sb->s_user_ns != &init_user_ns)
+ return ERR_PTR(-EPERM);
+
+ return vfs_kern_mount(type, MS_SUBMOUNT, name, data, data_size);
+}
+EXPORT_SYMBOL_GPL(vfs_submount);
+
+struct vfsmount *kern_mount(struct file_system_type *type)
+{
+ return vfs_kern_mount(type, SB_KERNMOUNT, type->name, NULL, 0);
+}
+EXPORT_SYMBOL_GPL(kern_mount);
+
+struct vfsmount *kern_mount_data(struct file_system_type *type,
+ void *data, size_t data_size)
+{
+ return vfs_kern_mount(type, SB_KERNMOUNT, type->name, data, data_size);
+}
+EXPORT_SYMBOL_GPL(kern_mount_data);
+
/*
* Return true if path is reachable from root
*
@@ -3301,22 +3419,6 @@ void put_mnt_ns(struct mnt_namespace *ns)
free_mnt_ns(ns);
}
-struct vfsmount *kern_mount_data(struct file_system_type *type,
- void *data, size_t data_size)
-{
- struct vfsmount *mnt;
- mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, data, data_size);
- if (!IS_ERR(mnt)) {
- /*
- * it is a longterm mount, don't release mnt until
- * we unmount before file sys is unregistered
- */
- real_mount(mnt)->mnt_ns = MNT_NS_INTERNAL;
- }
- return mnt;
-}
-EXPORT_SYMBOL_GPL(kern_mount_data);
-
void kern_unmount(struct vfsmount *mnt)
{
/* release long term mount so mount point can be released */
diff --git a/fs/super.c b/fs/super.c
index 9117e3447837..5d65a45ca6db 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -36,6 +36,7 @@
#include <linux/lockdep.h>
#include <linux/user_namespace.h>
#include <uapi/linux/mount.h>
+#include <linux/fs_context.h>
#include "internal.h"
static int thaw_super_locked(struct super_block *sb);
@@ -173,16 +174,13 @@ static void destroy_unused_super(struct super_block *s)
}
/**
- * alloc_super - create new superblock
- * @type: filesystem type superblock should belong to
- * @flags: the mount flags
- * @user_ns: User namespace for the super_block
+ * alloc_super - Create new superblock
+ * @fc: The filesystem configuration context
*
* Allocates and initializes a new &struct super_block. alloc_super()
* returns a pointer new superblock or %NULL if allocation had failed.
*/
-static struct super_block *alloc_super(struct file_system_type *type, int flags,
- struct user_namespace *user_ns)
+static struct super_block *alloc_super(struct fs_context *fc)
{
struct super_block *s = kzalloc(sizeof(struct super_block), GFP_USER);
static const struct super_operations default_op;
@@ -192,9 +190,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
return NULL;
INIT_LIST_HEAD(&s->s_mounts);
- s->s_user_ns = get_user_ns(user_ns);
+ s->s_user_ns = get_user_ns(fc->user_ns);
init_rwsem(&s->s_umount);
- lockdep_set_class(&s->s_umount, &type->s_umount_key);
+ lockdep_set_class(&s->s_umount, &fc->fs_type->s_umount_key);
/*
* sget() can have s_umount recursion.
*
@@ -218,12 +216,12 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
for (i = 0; i < SB_FREEZE_LEVELS; i++) {
if (__percpu_init_rwsem(&s->s_writers.rw_sem[i],
sb_writers_name[i],
- &type->s_writers_key[i]))
+ &fc->fs_type->s_writers_key[i]))
goto fail;
}
init_waitqueue_head(&s->s_writers.wait_unfrozen);
s->s_bdi = &noop_backing_dev_info;
- s->s_flags = flags;
+ s->s_flags = fc->sb_flags;
if (s->s_user_ns != &init_user_ns)
s->s_iflags |= SB_I_NODEV;
INIT_HLIST_NODE(&s->s_instances);
@@ -241,7 +239,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
s->s_count = 1;
atomic_set(&s->s_active, 1);
mutex_init(&s->s_vfs_rename_mutex);
- lockdep_set_class(&s->s_vfs_rename_mutex, &type->s_vfs_rename_key);
+ lockdep_set_class(&s->s_vfs_rename_mutex, &fc->fs_type->s_vfs_rename_key);
init_rwsem(&s->s_dquot.dqio_sem);
s->s_maxbytes = MAX_NON_LFS;
s->s_op = &default_op;
@@ -459,6 +457,96 @@ void generic_shutdown_super(struct super_block *sb)
EXPORT_SYMBOL(generic_shutdown_super);
+/**
+ * sget_fc - Find or create a superblock
+ * @fc: Filesystem context.
+ * @test: Comparison callback
+ * @set: Setup callback
+ *
+ * Find or create a superblock using the parameters stored in the filesystem
+ * context and the two callback functions.
+ *
+ * If an extant superblock is matched, then that will be returned with an
+ * elevated reference count that the caller must transfer or discard.
+ *
+ * If no match is made, a new superblock will be allocated and basic
+ * initialisation will be performed (s_type, s_fs_info and s_id will be set and
+ * the set() callback will be invoked), the superblock will be published and it
+ * will be returned in a partially constructed state with SB_BORN and SB_ACTIVE
+ * as yet unset.
+ */
+struct super_block *sget_fc(struct fs_context *fc,
+ int (*test)(struct super_block *, struct fs_context *),
+ int (*set)(struct super_block *, struct fs_context *))
+{
+ struct super_block *s = NULL;
+ struct super_block *old;
+ int err;
+
+ if (!(fc->sb_flags & SB_KERNMOUNT) &&
+ fc->purpose != FS_CONTEXT_FOR_SUBMOUNT) {
+ /* Don't allow mounting unless the caller has CAP_SYS_ADMIN
+ * over the namespace.
+ */
+ if (!(fc->fs_type->fs_flags & FS_USERNS_MOUNT) &&
+ !capable(CAP_SYS_ADMIN))
+ return ERR_PTR(-EPERM);
+ else if (!ns_capable(fc->user_ns, CAP_SYS_ADMIN))
+ return ERR_PTR(-EPERM);
+ }
+
+retry:
+ spin_lock(&sb_lock);
+ if (test) {
+ hlist_for_each_entry(old, &fc->fs_type->fs_supers, s_instances) {
+ if (!test(old, fc))
+ continue;
+ if (fc->user_ns != old->s_user_ns) {
+ spin_unlock(&sb_lock);
+ if (s) {
+ up_write(&s->s_umount);
+ destroy_unused_super(s);
+ }
+ return ERR_PTR(-EBUSY);
+ }
+ if (!grab_super(old))
+ goto retry;
+ if (s) {
+ up_write(&s->s_umount);
+ destroy_unused_super(s);
+ s = NULL;
+ }
+ return old;
+ }
+ }
+ if (!s) {
+ spin_unlock(&sb_lock);
+ s = alloc_super(fc);
+ if (!s)
+ return ERR_PTR(-ENOMEM);
+ goto retry;
+ }
+
+ s->s_fs_info = fc->s_fs_info;
+ err = set(s, fc);
+ if (err) {
+ s->s_fs_info = NULL;
+ spin_unlock(&sb_lock);
+ up_write(&s->s_umount);
+ destroy_unused_super(s);
+ return ERR_PTR(err);
+ }
+ s->s_type = fc->fs_type;
+ strlcpy(s->s_id, s->s_type->name, sizeof(s->s_id));
+ list_add_tail(&s->s_list, &super_blocks);
+ hlist_add_head(&s->s_instances, &s->s_type->fs_supers);
+ spin_unlock(&sb_lock);
+ get_filesystem(s->s_type);
+ register_shrinker(&s->s_shrink);
+ return s;
+}
+EXPORT_SYMBOL(sget_fc);
+
/**
* sget_userns - find or create a superblock
* @type: filesystem type superblock should belong to
@@ -501,7 +589,14 @@ struct super_block *sget_userns(struct file_system_type *type,
}
if (!s) {
spin_unlock(&sb_lock);
- s = alloc_super(type, (flags & ~SB_SUBMOUNT), user_ns);
+ {
+ struct fs_context fc = {
+ .fs_type = type,
+ .sb_flags = flags & ~SB_SUBMOUNT,
+ .user_ns = user_ns,
+ };
+ s = alloc_super(&fc);
+ }
if (!s)
return ERR_PTR(-ENOMEM);
goto retry;
@@ -829,11 +924,13 @@ struct super_block *user_get_super(dev_t dev)
* @data: the rest of options
* @data_size: The size of the data
* @force: whether or not to force the change
+ * @fc: the superblock config for filesystems that support it
+ * (NULL if called from emergency or umount)
*
* Alters the mount options of a mounted file system.
*/
int do_remount_sb(struct super_block *sb, int sb_flags, void *data,
- size_t data_size, int force)
+ size_t data_size, int force, struct fs_context *fc)
{
int retval;
int remount_ro;
@@ -875,8 +972,17 @@ int do_remount_sb(struct super_block *sb, int sb_flags, void *data,
}
}
- if (sb->s_op->remount_fs) {
- retval = sb->s_op->remount_fs(sb, &sb_flags, data, data_size);
+ if (sb->s_op->reconfigure ||
+ sb->s_op->remount_fs) {
+ if (sb->s_op->reconfigure) {
+ retval = sb->s_op->reconfigure(sb, fc);
+ sb_flags = fc->sb_flags;
+ if (retval == 0)
+ security_sb_reconfigure(fc);
+ } else {
+ retval = sb->s_op->remount_fs(sb, &sb_flags,
+ data, data_size);
+ }
if (retval) {
if (!force)
goto cancel_readonly;
@@ -915,7 +1021,7 @@ static void do_emergency_remount_callback(struct super_block *sb)
/*
* What lock protects sb->s_flags??
*/
- do_remount_sb(sb, SB_RDONLY, NULL, 0, 1);
+ do_remount_sb(sb, SB_RDONLY, NULL, 0, 1, NULL);
}
up_write(&sb->s_umount);
}
@@ -1097,6 +1203,90 @@ struct dentry *mount_ns(struct file_system_type *fs_type,
EXPORT_SYMBOL(mount_ns);
+static int set_anon_super_fc(struct super_block *sb, struct fs_context *fc)
+{
+ return set_anon_super(sb, NULL);
+}
+
+static int test_keyed_super(struct super_block *sb, struct fs_context *fc)
+{
+ return sb->s_fs_info == fc->s_fs_info;
+}
+
+static int test_single_super(struct super_block *s, struct fs_context *fc)
+{
+ return 1;
+}
+
+/**
+ * vfs_get_super - Get a superblock with a search key set in s_fs_info.
+ * @fc: The filesystem context holding the parameters
+ * @keying: How to distinguish superblocks
+ * @fill_super: Helper to initialise a new superblock
+ *
+ * Search for a superblock and create a new one if not found. The search
+ * criterion is controlled by @keying. If the search fails, a new superblock
+ * is created and @fill_super() is called to initialise it.
+ *
+ * @keying can take one of a number of values:
+ *
+ * (1) vfs_get_single_super - Only one superblock of this type may exist on the
+ * system. This is typically used for special system filesystems.
+ *
+ * (2) vfs_get_keyed_super - Multiple superblocks may exist, but they must have
+ * distinct keys (where the key is in s_fs_info). Searching for the same
+ * key again will turn up the superblock for that key.
+ *
+ * (3) vfs_get_independent_super - Multiple superblocks may exist and are
+ * unkeyed. Each call will get a new superblock.
+ *
+ * A permissions check is made by sget_fc() unless we're getting a superblock
+ * for a kernel-internal mount or a submount.
+ */
+int vfs_get_super(struct fs_context *fc,
+ enum vfs_get_super_keying keying,
+ int (*fill_super)(struct super_block *sb,
+ struct fs_context *fc))
+{
+ int (*test)(struct super_block *, struct fs_context *);
+ struct super_block *sb;
+
+ switch (keying) {
+ case vfs_get_single_super:
+ test = test_single_super;
+ break;
+ case vfs_get_keyed_super:
+ test = test_keyed_super;
+ break;
+ case vfs_get_independent_super:
+ test = NULL;
+ break;
+ default:
+ BUG();
+ }
+
+ sb = sget_fc(fc, test, set_anon_super_fc);
+ if (IS_ERR(sb))
+ return PTR_ERR(sb);
+
+ if (!sb->s_root) {
+ int err;
+ err = fill_super(sb, fc);
+ if (err) {
+ deactivate_locked_super(sb);
+ return err;
+ }
+
+ sb->s_flags |= SB_ACTIVE;
+ }
+
+ BUG_ON(fc->root);
+ fc->root = dget(sb->s_root);
+ fc->drop_sb = true;
+ return 0;
+}
+EXPORT_SYMBOL(vfs_get_super);
+
#ifdef CONFIG_BLOCK
static int set_bdev_super(struct super_block *s, void *data)
{
@@ -1245,7 +1435,7 @@ struct dentry *mount_single(struct file_system_type *fs_type,
}
s->s_flags |= SB_ACTIVE;
} else {
- do_remount_sb(s, flags, data, data_size, 0);
+ do_remount_sb(s, flags, data, data_size, 0, NULL);
}
return dget(s->s_root);
}
@@ -1584,3 +1774,88 @@ int thaw_super(struct super_block *sb)
return thaw_super_locked(sb);
}
EXPORT_SYMBOL(thaw_super);
+
+/**
+ * vfs_get_tree - Get the mountable root
+ * @fc: The superblock configuration context.
+ *
+ * The filesystem is invoked to get or create a superblock which can then later
+ * be used for mounting. The filesystem places a pointer to the root to be
+ * used for mounting in @fc->root.
+ */
+int vfs_get_tree(struct fs_context *fc)
+{
+ struct super_block *sb;
+ int ret;
+
+ if (fc->root)
+ return -EBUSY;
+
+ if (fc->ops->validate) {
+ ret = fc->ops->validate(fc);
+ if (ret < 0)
+ return ret;
+ }
+
+ ret = security_fs_context_validate(fc);
+ if (ret < 0)
+ return ret;
+
+ /* The filesystem may transfer preallocated resources from the
+ * configuration context to the superblock, thereby rendering the
+ * config unusable for another attempt at creation if this one fails.
+ */
+ if (fc->degraded)
+ return -EBUSY;
+
+ /* Get the mountable root in fc->root, with a ref on the root and a ref
+ * on the superblock.
+ */
+ ret = fc->ops->get_tree(fc);
+ if (ret < 0)
+ return ret;
+
+ if (!fc->root) {
+ pr_err("Filesystem %s get_tree() didn't set fc->root\n",
+ fc->fs_type->name);
+ /* We don't know what the locking state of the superblock is -
+ * if there is a superblock.
+ */
+ BUG();
+ }
+
+ sb = fc->root->d_sb;
+ WARN_ON(!sb->s_bdi);
+
+ ret = security_sb_get_tree(fc);
+ if (ret < 0)
+ goto err_sb;
+
+ ret = -ENOMEM;
+ if (fc->subtype && !sb->s_subtype) {
+ sb->s_subtype = kstrdup(fc->subtype, GFP_KERNEL);
+ if (!sb->s_subtype)
+ goto err_sb;
+ }
+
+ sb->s_flags |= SB_BORN;
+
+ /* Filesystems should never set s_maxbytes larger than MAX_LFS_FILESIZE
+ * but s_maxbytes was an unsigned long long for many releases. Throw
+ * this warning for a little while to try and catch filesystems that
+ * violate this rule.
+ */
+ WARN(sb->s_maxbytes < 0,
+ "%s set sb->s_maxbytes to negative value (%lld)\n",
+ fc->fs_type->name, sb->s_maxbytes);
+
+ up_write(&sb->s_umount);
+ return 0;
+
+err_sb:
+ dput(fc->root);
+ fc->root = NULL;
+ deactivate_locked_super(sb);
+ return ret;
+}
+EXPORT_SYMBOL(vfs_get_tree);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9703931cf095..c1f1428f6c67 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -60,6 +60,7 @@ struct workqueue_struct;
struct iov_iter;
struct fscrypt_info;
struct fscrypt_operations;
+struct fs_context;
extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -718,6 +719,11 @@ static inline void inode_unlock(struct inode *inode)
up_write(&inode->i_rwsem);
}
+static inline int inode_lock_killable(struct inode *inode)
+{
+ return down_write_killable(&inode->i_rwsem);
+}
+
static inline void inode_lock_shared(struct inode *inode)
{
down_read(&inode->i_rwsem);
@@ -1828,6 +1834,7 @@ struct super_operations {
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *, size_t);
+ int (*reconfigure) (struct super_block *, struct fs_context *);
void (*umount_begin) (struct super_block *);
int (*show_options)(struct seq_file *, struct dentry *);
@@ -2074,8 +2081,10 @@ struct file_system_type {
#define FS_HAS_SUBTYPE 4
#define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
+ unsigned short fs_context_size; /* Size of superblock config context to allocate */
struct dentry *(*mount) (struct file_system_type *, int,
const char *, void *, size_t);
+ int (*init_fs_context)(struct fs_context *, struct super_block *);
void (*kill_sb) (struct super_block *);
struct module *owner;
struct file_system_type * next;
@@ -2132,6 +2141,9 @@ void deactivate_locked_super(struct super_block *sb);
int set_anon_super(struct super_block *s, void *data);
int get_anon_bdev(dev_t *);
void free_anon_bdev(dev_t);
+struct super_block *sget_fc(struct fs_context *fc,
+ int (*test)(struct super_block *, struct fs_context *),
+ int (*set)(struct super_block *, struct fs_context *));
struct super_block *sget_userns(struct file_system_type *type,
int (*test)(struct super_block *,void *),
int (*set)(struct super_block *,void *),
@@ -2174,8 +2186,8 @@ mount_pseudo(struct file_system_type *fs_type, char *name,
extern int register_filesystem(struct file_system_type *);
extern int unregister_filesystem(struct file_system_type *);
+extern struct vfsmount *kern_mount(struct file_system_type *);
extern struct vfsmount *kern_mount_data(struct file_system_type *, void *, size_t);
-#define kern_mount(type) kern_mount_data(type, NULL, 0)
extern void kern_unmount(struct vfsmount *mnt);
extern int may_umount_tree(struct vfsmount *);
extern int may_umount(struct vfsmount *);
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 732a11898242..1914eef0a88f 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -25,6 +25,7 @@ struct pid_namespace;
struct super_block;
struct user_namespace;
struct vfsmount;
+struct path;
enum fs_context_purpose {
FS_CONTEXT_FOR_USER_MOUNT, /* New superblock for user-specified mount */
@@ -68,9 +69,37 @@ struct fs_context_operations {
int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
int (*parse_source)(struct fs_context *fc);
int (*parse_option)(struct fs_context *fc, char *opt, size_t len);
- int (*parse_monolithic)(struct fs_context *fc, void *data);
+ int (*parse_monolithic)(struct fs_context *fc, void *data, size_t data_size);
int (*validate)(struct fs_context *fc);
int (*get_tree)(struct fs_context *fc);
};
+/*
+ * fs_context manipulation functions.
+ */
+extern struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
+ struct super_block *src_sb,
+ unsigned int ms_flags,
+ enum fs_context_purpose purpose);
+extern struct fs_context *vfs_sb_reconfig(struct path *path, unsigned int ms_flags);
+extern struct fs_context *vfs_dup_fs_context(struct fs_context *src);
+extern int vfs_set_fs_source(struct fs_context *fc, const char *source, size_t slen);
+extern int vfs_parse_fs_option(struct fs_context *fc, char *data, size_t opt);
+extern int generic_parse_monolithic(struct fs_context *fc, void *data, size_t data_size);
+extern int vfs_get_tree(struct fs_context *fc);
+extern void put_fs_context(struct fs_context *fc);
+
+/*
+ * sget() wrapper to be called from the ->get_tree() op.
+ */
+enum vfs_get_super_keying {
+ vfs_get_single_super, /* Only one such superblock may exist */
+ vfs_get_keyed_super, /* Superblocks with different s_fs_info keys may exist */
+ vfs_get_independent_super, /* Multiple independent superblocks may exist */
+};
+extern int vfs_get_super(struct fs_context *fc,
+ enum vfs_get_super_keying keying,
+ int (*fill_super)(struct super_block *sb,
+ struct fs_context *fc));
+
#endif /* _LINUX_FS_CONTEXT_H */
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 8a1031a511c9..5f7e994614b1 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -21,6 +21,7 @@ struct super_block;
struct vfsmount;
struct dentry;
struct mnt_namespace;
+struct fs_context;
#define MNT_NOSUID 0x01
#define MNT_NODEV 0x02
@@ -88,6 +89,7 @@ struct path;
extern struct vfsmount *clone_private_mount(const struct path *path);
struct file_system_type;
+extern struct vfsmount *vfs_create_mount(struct fs_context *fc);
extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
int flags, const char *name,
void *data, size_t data_size);
Implement filesystem context security hooks for the smack LSM.
Signed-off-by: David Howells <[email protected]>
cc: Casey Schaufler <[email protected]>
cc: [email protected]
---
security/smack/smack_lsm.c | 309 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 309 insertions(+)
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 0b414836bebd..549aaa46353b 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -42,6 +42,7 @@
#include <linux/shm.h>
#include <linux/binfmts.h>
#include <linux/parser.h>
+#include <linux/fs_context.h>
#include "smack.h"
#define TRANS_TRUE "TRUE"
@@ -521,6 +522,307 @@ static int smack_syslog(int typefrom_file)
return rc;
}
+/*
+ * Mount context operations
+ */
+
+struct smack_fs_context {
+ union {
+ struct {
+ char *fsdefault;
+ char *fsfloor;
+ char *fshat;
+ char *fsroot;
+ char *fstransmute;
+ };
+ char *ptrs[5];
+
+ };
+ struct superblock_smack *sbsp;
+ struct inode_smack *isp;
+ bool transmute;
+};
+
+/**
+ * smack_fs_context_free - Free the security data from a filesystem context
+ * @fc: The filesystem context to be cleaned up.
+ */
+static void smack_fs_context_free(struct fs_context *fc)
+{
+ struct smack_fs_context *ctx = fc->security;
+ int i;
+
+ if (ctx) {
+ for (i = 0; i < ARRAY_SIZE(ctx->ptrs); i++)
+ kfree(ctx->ptrs[i]);
+ kfree(ctx->isp);
+ kfree(ctx->sbsp);
+ kfree(ctx);
+ fc->security = NULL;
+ }
+}
+
+/**
+ * smack_fs_context_alloc - Allocate security data for a filesystem context
+ * @fc: The filesystem context.
+ * @src_sb: Reference superblock (automount/reconfigure) or NULL
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_fs_context_alloc(struct fs_context *fc,
+ struct super_block *src_sb)
+{
+ struct smack_fs_context *ctx;
+ struct superblock_smack *sbsp;
+ struct inode_smack *isp;
+ struct smack_known *skp;
+
+ ctx = kzalloc(sizeof(struct smack_fs_context), GFP_KERNEL);
+ if (!ctx)
+ goto nomem;
+ fc->security = ctx;
+
+ sbsp = kzalloc(sizeof(struct superblock_smack), GFP_KERNEL);
+ if (!sbsp)
+ goto nomem_free;
+ ctx->sbsp = sbsp;
+
+ isp = new_inode_smack(NULL);
+ if (!isp)
+ goto nomem_free;
+ ctx->isp = isp;
+
+ if (src_sb) {
+ if (src_sb->s_security)
+ memcpy(sbsp, src_sb->s_security, sizeof(*sbsp));
+ } else if (!smack_privileged(CAP_MAC_ADMIN)) {
+ /* Unprivileged mounts get root and default from the caller. */
+ skp = smk_of_current();
+ sbsp->smk_root = skp;
+ sbsp->smk_default = skp;
+ } else {
+ sbsp->smk_root = &smack_known_floor;
+ sbsp->smk_default = &smack_known_floor;
+ sbsp->smk_floor = &smack_known_floor;
+ sbsp->smk_hat = &smack_known_hat;
+ /* SMK_SB_INITIALIZED will be zero from kzalloc. */
+ }
+
+ return 0;
+
+nomem_free:
+ smack_fs_context_free(fc);
+nomem:
+ return -ENOMEM;
+}
+
+/**
+ * smack_fs_context_dup - Duplicate the security data on fs_context duplication
+ * @fc: The new filesystem context.
+ * @src_fc: The source filesystem context being duplicated.
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_fs_context_dup(struct fs_context *fc,
+ struct fs_context *src_fc)
+{
+ struct smack_fs_context *dst, *src = src_fc->security;
+ int i;
+
+ dst = kzalloc(sizeof(struct smack_fs_context), GFP_KERNEL);
+ if (!dst)
+ goto nomem;
+ fc->security = dst;
+
+ dst->sbsp = kmemdup(src->sbsp, sizeof(struct superblock_smack),
+ GFP_KERNEL);
+ if (!dst->sbsp)
+ goto nomem_free;
+
+ for (i = 0; i < ARRAY_SIZE(dst->ptrs); i++) {
+ if (src->ptrs[i]) {
+ dst->ptrs[i] = kstrdup(src->ptrs[i], GFP_KERNEL);
+ if (!dst->ptrs[i])
+ goto nomem_free;
+ }
+ }
+
+ return 0;
+
+nomem_free:
+ smack_fs_context_free(fc);
+nomem:
+ return -ENOMEM;
+}
+
+/**
+ * smack_fs_context_parse_option - Parse a single mount option
+ * @fc: The new filesystem context being constructed.
+ * @opt: The option text buffer.
+ * @len: The length of the text.
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_fs_context_parse_option(struct fs_context *fc, char *p, size_t len)
+{
+ struct smack_fs_context *ctx = fc->security;
+ substring_t args[MAX_OPT_ARGS];
+ int rc = -ENOMEM;
+ int token;
+
+ /* Unprivileged mounts don't get to specify Smack values. */
+ if (!smack_privileged(CAP_MAC_ADMIN))
+ return -EPERM;
+
+ token = match_token(p, smk_mount_tokens, args);
+ switch (token) {
+ case Opt_fsdefault:
+ if (ctx->fsdefault)
+ goto error_dup;
+ ctx->fsdefault = match_strdup(&args[0]);
+ if (!ctx->fsdefault)
+ goto error;
+ break;
+ case Opt_fsfloor:
+ if (ctx->fsfloor)
+ goto error_dup;
+ ctx->fsfloor = match_strdup(&args[0]);
+ if (!ctx->fsfloor)
+ goto error;
+ break;
+ case Opt_fshat:
+ if (ctx->fshat)
+ goto error_dup;
+ ctx->fshat = match_strdup(&args[0]);
+ if (!ctx->fshat)
+ goto error;
+ break;
+ case Opt_fsroot:
+ if (ctx->fsroot)
+ goto error_dup;
+ ctx->fsroot = match_strdup(&args[0]);
+ if (!ctx->fsroot)
+ goto error;
+ break;
+ case Opt_fstransmute:
+ if (ctx->fstransmute)
+ goto error_dup;
+ ctx->fstransmute = match_strdup(&args[0]);
+ if (!ctx->fstransmute)
+ goto error;
+ break;
+ default:
+ pr_warn("Smack: unknown mount option\n");
+ goto error_inval;
+ }
+
+ return 0;
+
+error_dup:
+ pr_warn("Smack: duplicate mount option\n");
+error_inval:
+ rc = -EINVAL;
+error:
+ return rc;
+}
+
+/**
+ * smack_fs_context_validate - Validate the filesystem context security data
+ * @fc: The filesystem context.
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_fs_context_validate(struct fs_context *fc)
+{
+ struct smack_fs_context *ctx = fc->security;
+ struct superblock_smack *sbsp = ctx->sbsp;
+ struct inode_smack *isp = ctx->isp;
+ struct smack_known *skp;
+
+ if (ctx->fsdefault) {
+ skp = smk_import_entry(ctx->fsdefault, 0);
+ if (IS_ERR(skp))
+ return PTR_ERR(skp);
+ sbsp->smk_default = skp;
+ }
+
+ if (ctx->fsfloor) {
+ skp = smk_import_entry(ctx->fsfloor, 0);
+ if (IS_ERR(skp))
+ return PTR_ERR(skp);
+ sbsp->smk_floor = skp;
+ }
+
+ if (ctx->fshat) {
+ skp = smk_import_entry(ctx->fshat, 0);
+ if (IS_ERR(skp))
+ return PTR_ERR(skp);
+ sbsp->smk_hat = skp;
+ }
+
+ if (ctx->fsroot || ctx->fstransmute) {
+ skp = smk_import_entry(ctx->fstransmute ?: ctx->fsroot, 0);
+ if (IS_ERR(skp))
+ return PTR_ERR(skp);
+ sbsp->smk_root = skp;
+ ctx->transmute = !!ctx->fstransmute;
+ }
+
+ isp->smk_inode = sbsp->smk_root;
+ return 0;
+}
+
+/**
+ * smack_sb_get_tree - Assign the context to a newly created superblock
+ * @fc: The new filesystem context.
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_sb_get_tree(struct fs_context *fc)
+{
+ struct smack_fs_context *ctx = fc->security;
+ struct superblock_smack *sbsp = ctx->sbsp;
+ struct dentry *root = fc->root;
+ struct inode *inode = d_backing_inode(root);
+ struct super_block *sb = root->d_sb;
+ struct inode_smack *isp;
+ bool transmute = ctx->transmute;
+
+ if (sb->s_security)
+ return 0;
+
+ if (!smack_privileged(CAP_MAC_ADMIN)) {
+ /*
+ * For a handful of fs types with no user-controlled
+ * backing store it's okay to trust security labels
+ * in the filesystem. The rest are untrusted.
+ */
+ if (fc->user_ns != &init_user_ns &&
+ sb->s_magic != SYSFS_MAGIC && sb->s_magic != TMPFS_MAGIC &&
+ sb->s_magic != RAMFS_MAGIC) {
+ transmute = true;
+ sbsp->smk_flags |= SMK_SB_UNTRUSTED;
+ }
+ }
+
+ sbsp->smk_flags |= SMK_SB_INITIALIZED;
+ sb->s_security = sbsp;
+ ctx->sbsp = NULL;
+
+ /* Initialize the root inode. */
+ isp = inode->i_security;
+ if (isp == NULL) {
+ isp = ctx->isp;
+ ctx->isp = NULL;
+ inode->i_security = isp;
+ } else
+ isp->smk_inode = sbsp->smk_root;
+
+ if (transmute)
+ isp->smk_flags |= SMK_INODE_TRANSMUTE;
+
+ return 0;
+}
/*
* Superblock Hooks.
@@ -4628,6 +4930,13 @@ static struct security_hook_list smack_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(ptrace_traceme, smack_ptrace_traceme),
LSM_HOOK_INIT(syslog, smack_syslog),
+ LSM_HOOK_INIT(fs_context_alloc, smack_fs_context_alloc),
+ LSM_HOOK_INIT(fs_context_dup, smack_fs_context_dup),
+ LSM_HOOK_INIT(fs_context_free, smack_fs_context_free),
+ LSM_HOOK_INIT(fs_context_parse_option, smack_fs_context_parse_option),
+ LSM_HOOK_INIT(fs_context_validate, smack_fs_context_validate),
+ LSM_HOOK_INIT(sb_get_tree, smack_sb_get_tree),
+
LSM_HOOK_INIT(sb_alloc_security, smack_sb_alloc_security),
LSM_HOOK_INIT(sb_free_security, smack_sb_free_security),
LSM_HOOK_INIT(sb_copy_data, smack_sb_copy_data),
Implement the security hook to check the creation of a new mountpoint for
Tomoyo.
As far as I can tell, Tomoyo doesn't make use of the mount data or parse
any mount options, so I haven't implemented any of the fs_context hooks for
it.
Signed-off-by: David Howells <[email protected]>
cc: Tetsuo Handa <[email protected]>
cc: [email protected]
cc: [email protected]
---
security/tomoyo/common.h | 3 +++
security/tomoyo/mount.c | 42 ++++++++++++++++++++++++++++++++++++++++++
security/tomoyo/tomoyo.c | 15 +++++++++++++++
3 files changed, 60 insertions(+)
diff --git a/security/tomoyo/common.h b/security/tomoyo/common.h
index 539bcdd30bb8..e637ce73f7f9 100644
--- a/security/tomoyo/common.h
+++ b/security/tomoyo/common.h
@@ -971,6 +971,9 @@ int tomoyo_init_request_info(struct tomoyo_request_info *r,
const u8 index);
int tomoyo_mkdev_perm(const u8 operation, const struct path *path,
const unsigned int mode, unsigned int dev);
+int tomoyo_mount_permission_fc(struct fs_context *fc,
+ const struct path *mountpoint,
+ unsigned int mnt_flags);
int tomoyo_mount_permission(const char *dev_name, const struct path *path,
const char *type, unsigned long flags,
void *data_page);
diff --git a/security/tomoyo/mount.c b/security/tomoyo/mount.c
index 7dc7f59b7dde..e2e1cb203775 100644
--- a/security/tomoyo/mount.c
+++ b/security/tomoyo/mount.c
@@ -6,6 +6,7 @@
*/
#include <linux/slab.h>
+#include <linux/fs_context.h>
#include <uapi/linux/mount.h>
#include "common.h"
@@ -236,3 +237,44 @@ int tomoyo_mount_permission(const char *dev_name, const struct path *path,
tomoyo_read_unlock(idx);
return error;
}
+
+/**
+ * tomoyo_mount_permission_fc - Check permission to create a new mount.
+ * @fc: Context describing the object to be mounted.
+ * @mountpoint: The target object to mount on.
+ * @mnt: The MNT_* flags to be set on the mountpoint.
+ *
+ * Check the permission to create a mount of the object described in @fc. Note
+ * that the source object may be a newly created superblock or may be an
+ * existing one picked from the filesystem (bind mount).
+ *
+ * Returns 0 on success, negative value otherwise.
+ */
+int tomoyo_mount_permission_fc(struct fs_context *fc,
+ const struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ struct tomoyo_request_info r;
+ unsigned int ms_flags = 0;
+ int error;
+ int idx;
+
+ if (tomoyo_init_request_info(&r, NULL, TOMOYO_MAC_FILE_MOUNT) ==
+ TOMOYO_CONFIG_DISABLED)
+ return 0;
+
+ /* Convert MNT_* flags to MS_* equivalents. */
+ if (mnt_flags & MNT_NOSUID) ms_flags |= MS_NOSUID;
+ if (mnt_flags & MNT_NODEV) ms_flags |= MS_NODEV;
+ if (mnt_flags & MNT_NOEXEC) ms_flags |= MS_NOEXEC;
+ if (mnt_flags & MNT_NOATIME) ms_flags |= MS_NOATIME;
+ if (mnt_flags & MNT_NODIRATIME) ms_flags |= MS_NODIRATIME;
+ if (mnt_flags & MNT_RELATIME) ms_flags |= MS_RELATIME;
+ if (mnt_flags & MNT_READONLY) ms_flags |= MS_RDONLY;
+
+ idx = tomoyo_read_lock();
+ error = tomoyo_mount_acl(&r, fc->source, mountpoint, fc->fs_type->name,
+ ms_flags);
+ tomoyo_read_unlock(idx);
+ return error;
+}
diff --git a/security/tomoyo/tomoyo.c b/security/tomoyo/tomoyo.c
index 213b8c593668..31fd6bd4f657 100644
--- a/security/tomoyo/tomoyo.c
+++ b/security/tomoyo/tomoyo.c
@@ -391,6 +391,20 @@ static int tomoyo_path_chroot(const struct path *path)
return tomoyo_path_perm(TOMOYO_TYPE_CHROOT, path, NULL);
}
+/**
+ * tomoyo_sb_mount - Target for security_sb_mountpoint().
+ * @fc: Context describing the object to be mounted.
+ * @mountpoint: The target object to mount on.
+ * @mnt_flags: Mountpoint specific options (as MNT_* flags).
+ *
+ * Returns 0 on success, negative value otherwise.
+ */
+static int tomoyo_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ return tomoyo_mount_permission_fc(fc, mountpoint, mnt_flags);
+}
+
/**
* tomoyo_sb_mount - Target for security_sb_mount().
*
@@ -519,6 +533,7 @@ static struct security_hook_list tomoyo_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(path_chmod, tomoyo_path_chmod),
LSM_HOOK_INIT(path_chown, tomoyo_path_chown),
LSM_HOOK_INIT(path_chroot, tomoyo_path_chroot),
+ LSM_HOOK_INIT(sb_mountpoint, tomoyo_sb_mountpoint),
LSM_HOOK_INIT(sb_mount, tomoyo_sb_mount),
LSM_HOOK_INIT(sb_umount, tomoyo_sb_umount),
LSM_HOOK_INIT(sb_pivotroot, tomoyo_sb_pivotroot),
Introduce a filesystem context concept to be used during superblock
creation for mount and superblock reconfiguration for remount. This is
allocated at the beginning of the mount procedure and into it is placed:
(1) Filesystem type.
(2) Namespaces.
(3) Device name.
(4) Superblock flags (MS_*).
(5) Security details.
(6) Filesystem-specific data, as set by the mount options.
Signed-off-by: David Howells <[email protected]>
---
Documentation/filesystems/mounting.txt | 445 ++++++++++++++++++++++++++++++++
include/linux/fs_context.h | 76 +++++
2 files changed, 521 insertions(+)
create mode 100644 Documentation/filesystems/mounting.txt
create mode 100644 include/linux/fs_context.h
diff --git a/Documentation/filesystems/mounting.txt b/Documentation/filesystems/mounting.txt
new file mode 100644
index 000000000000..805135a66b64
--- /dev/null
+++ b/Documentation/filesystems/mounting.txt
@@ -0,0 +1,445 @@
+ ===================
+ FILESYSTEM MOUNTING
+ ===================
+
+CONTENTS
+
+ (1) Overview.
+
+ (2) The filesystem context.
+
+ (3) The filesystem context operations.
+
+ (4) Filesystem context security.
+
+ (5) VFS filesystem context operations.
+
+
+========
+OVERVIEW
+========
+
+The creation of new mounts is now to be done in a multistep process:
+
+ (1) Create a filesystem context.
+
+ (2) Parse the options and attach them to the context. Options may be passed
+ individually from userspace.
+
+ (3) Validate and pre-process the context.
+
+ (4) Get or create a superblock and mountable root.
+
+ (5) Perform the mount.
+
+ (6) Return an error message attached to the context.
+
+ (7) Destroy the context.
+
+To support this, the file_system_type struct gains two new fields:
+
+ unsigned short fs_context_size;
+
+which indicates the total amount of space that should be allocated for context
+data (see the Filesystem Context section), and:
+
+ int (*init_fs_context)(struct fs_context *fc, struct super_block *src_sb);
+
+which is invoked to set up the filesystem-specific parts of a filesystem
+context, including the additional space. The src_sb parameter is used to
+convey the superblock from which the filesystem may draw extra information
+(such as namespaces) for submount (FS_CONTEXT_FOR_SUBMOUNT) or reconfiguration
+(FS_CONTEXT_FOR_RECONFIGURE) purposes - otherwise it will be NULL.
+
+Note that security initialisation is done *after* the filesystem is called so
+that the namespaces may be adjusted first.
+
+And the super_operations struct gains one field:
+
+ int (*reconfigure) (struct super_block *, struct fs_context *);
+
+This shadows the ->reconfigure() operation and takes a prepared filesystem
+context instead of the mount flags and data page. It may modify the sb_flags
+in the context for the caller to pick up.
+
+[NOTE] reconfigure is intended as a replacement for remount_fs.
+
+
+======================
+THE FILESYSTEM CONTEXT
+======================
+
+The creation and reconfiguration of a superblock is governed by a filesystem
+context. This is represented by the fs_context structure:
+
+ struct fs_context {
+ const struct fs_context_operations *ops;
+ struct file_system_type *fs;
+ struct dentry *root;
+ struct user_namespace *user_ns;
+ struct net *net_ns;
+ const struct cred *cred;
+ char *device;
+ char *subtype;
+ void *security;
+ void *s_fs_info;
+ unsigned int sb_flags;
+ bool sloppy;
+ bool silent;
+ bool degraded;
+ bool drop_sb;
+ enum fs_context_purpose purpose : 8;
+ };
+
+When the VFS creates this, it allocates ->fs_context_size bytes (as specified
+by the file_system_type object) to hold both the fs_context struct and any
+extra data required by the filesystem. The fs_context struct is placed at the
+beginning of this space. Any extra space beyond that is for use by the
+filesystem. The filesystem should wrap the struct in its own, e.g.:
+
+ struct nfs_fs_context {
+ struct fs_context fc;
+ ...
+ };
+
+placing the fs_context struct first. container_of() can then be used. The
+file_system_type would be initialised thus:
+
+ struct file_system_type nfs = {
+ ...
+ .fs_context_size = sizeof(struct nfs_fs_context),
+ .init_fs_context = nfs_init_fs_context,
+ ...
+ };
+
+The fs_context fields are as follows:
+
+ (*) const struct fs_context_operations *ops
+
+ These are operations that can be done on a filesystem context (see
+ below). This must be set by the ->init_fs_context() file_system_type
+ operation.
+
+ (*) struct file_system_type *fs
+
+ A pointer to the file_system_type of the filesystem that is being
+ constructed or reconfigured. This retains a reference on the type owner.
+
+ (*) struct dentry *root
+
+ A pointer to the root of the mountable tree (and indirectly, the
+ superblock thereof). This is filled in by the ->get_tree() op.
+
+ (*) struct user_namespace *user_ns
+ (*) struct net *net_ns
+
+ There are a subset of the namespaces in use by the invoking process. They
+ retain references on each namespace. The subscribed namespaces may be
+ replaced by the filesystem to reflect other sources, such as the parent
+ mount superblock on an automount.
+
+ (*) struct cred *cred
+
+ The mounter's credentials. This retains a reference on the credentials.
+
+ (*) char *device
+
+ This is the device to be mounted. It may be a block device
+ (e.g. /dev/sda1) or something more exotic, such as the "host:/path" that
+ NFS desires.
+
+ (*) char *subtype
+
+ This is a string to be added to the type displayed in /proc/mounts to
+ qualify it (used by FUSE). This is available for the filesystem to set if
+ desired.
+
+ (*) void *security
+
+ A place for the LSMs to hang their security data for the superblock. The
+ relevant security operations are described below.
+
+ (*) void *s_fs_info
+
+ The proposed s_fs_info for a new superblock, set in the superblock by
+ sget_fc(). This can be used to distinguish superblocks.
+
+ (*) unsigned int sb_flags
+
+ This holds the SB_* flags to be set in super_block::s_flags.
+
+ (*) bool sloppy
+ (*) bool silent
+
+ These are set if the sloppy or silent mount options are given.
+
+ [NOTE] sloppy is probably unnecessary when userspace passes over one
+ option at a time since the error can just be ignored if userspace deems it
+ to be unimportant.
+
+ [NOTE] silent is probably redundant with sb_flags & SB_SILENT.
+
+ (*) bool degraded
+
+ This is set if any preallocated resources in the context have been used
+ up, thereby rendering it unreusable for the ->get_tree() op.
+
+ (*) bool drop_sb
+
+ This is set if a superblock reference needs to be deactivated when the
+ context is put.
+
+ (*) enum fs_context_purpose
+
+ This indicates the purpose for which the context is intended. The
+ available values are:
+
+ FS_CONTEXT_FOR_USER_MOUNT, -- New superblock for user-specified mount
+ FS_CONTEXT_FOR_KERNEL_MOUNT, -- New superblock for kernel-internal mount
+ FS_CONTEXT_FOR_SUBMOUNT -- New automatic submount of extant mount
+ FS_CONTEXT_FOR_RECONFIGURE -- Change an existing mount
+
+The mount context is created by calling vfs_new_fs_context(), vfs_sb_reconfig()
+or vfs_dup_fs_context() and is destroyed with put_fs_context(). Note that the
+structure is not refcounted.
+
+VFS, security and filesystem mount options are set individually with
+vfs_parse_mount_option(). Options provided by the old mount(2) system call as
+a page of data can be parsed with generic_parse_monolithic().
+
+When mounting, the filesystem is allowed to take data from any of the pointers
+and attach it to the superblock (or whatever), provided it clears the pointer
+in the mount context.
+
+The filesystem is also allowed to allocate resources and pin them with the
+mount context. For instance, NFS might pin the appropriate protocol version
+module.
+
+
+=================================
+THE FILESYSTEM CONTEXT OPERATIONS
+=================================
+
+The filesystem context points to a table of operations:
+
+ struct fs_context_operations {
+ void (*free)(struct fs_context *fc);
+ int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
+ int (*parse_source)(struct fs_context *fc);
+ int (*parse_option)(struct fs_context *fc, char *opt);
+ int (*parse_monolithic)(struct fs_context *fc, void *data);
+ int (*validate)(struct fs_context *fc);
+ int (*get_tree)(struct fs_context *fc);
+ };
+
+These operations are invoked by the various stages of the mount procedure to
+manage the filesystem context. They are as follows:
+
+ (*) void (*free)(struct fs_context *fc);
+
+ Called to clean up the filesystem-specific part of the filesystem context
+ when the context is destroyed. It should be aware that parts of the
+ context may have been removed and NULL'd out by ->get_tree().
+
+ (*) int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
+
+ Called when a filesystem context has been duplicated to get any refs or
+ copy any non-referenced resources held in the filesystem-specific part of
+ the filesystem context. An error may be returned to indicate failure to
+ do this.
+
+ [!] Note that even if this fails, put_fs_context() will be called
+ immediately thereafter, so ->dup() *must* make the
+ filesystem-specific part safe for ->free().
+
+ (*) int (*parse_source)(struct fs_context *fc);
+
+ Called when the source or device is specified for a filesystem context.
+ The string will have been stored in fc->source prior to calling. If
+ successful, 0 should be returned and a negative error code otherwise.
+
+ (*) int (*parse_option)(struct fs_context *fc, char *p);
+
+ Called when an option is to be added to the filesystem context. p points
+ to the option string, likely in "key[=val]" format. VFS-specific options
+ will have been weeded out and fc->sb_flags updated in the context.
+ Security options will also have been weeded out and fc->security updated.
+
+ If successful, 0 should be returned and a negative error code otherwise.
+
+ (*) int (*parse_monolithic)(struct fs_context *fc, void *data);
+
+ Called when the mount(2) system call is invoked to pass the entire data
+ page in one go. If this is expected to be just a list of "key[=val]"
+ items separated by commas, then this may be set to NULL.
+
+ The return value is as for ->parse_option().
+
+ If the filesystem (eg. NFS) needs to examine the data first and then finds
+ it's the standard key-val list then it may pass it off to
+ generic_parse_monolithic().
+
+ (*) int (*validate)(struct fs_context *fc);
+
+ Called when all the options have been applied and the mount is about to
+ take place. It is should check for inconsistencies from mount options and
+ it is also allowed to do preliminary resource acquisition. For instance,
+ the core NFS module could load the NFS protocol module here.
+
+ Note that if fc->purpose == FS_CONTEXT_FOR_RECONFIGURE, some of the
+ options necessary for a new mount may not be set.
+
+ The return value is as for ->parse_option().
+
+ (*) int (*get_tree)(struct fs_context *fc);
+
+ Called to get or create the mountable root and superblock, using the
+ information stored in the filesystem context (reconfiguration goes via a
+ different vector). It may detach any resources it desires from the
+ filesystem context and transfer them to the superblock it creates.
+
+ On success it should set fc->root to the mountable root and return 0. In
+ the case of an error, it should return a negative error code.
+
+
+===========================
+FILESYSTEM CONTEXT SECURITY
+===========================
+
+The filesystem context contains a security pointer that the LSMs can use for
+building up a security context for the superblock to be mounted. There are a
+number of operations used by the new mount code for this purpose:
+
+ (*) int security_fs_context_alloc(struct fs_context *fc,
+ struct super_block *src_sb);
+
+ Called to initialise fc->security (which is preset to NULL) and allocate
+ any resources needed. It should return 0 on success and a negative error
+ code on failure.
+
+ src_sb is non-NULL in the case of reconfiguration
+ (FS_CONTEXT_FOR_RECONFIGURE) in which case it indicates the superblock to
+ be reconfigured or in the case of a submount (FS_CONTEXT_FOR_SUBMOUNT) in
+ which case it indicates the parent superblock.
+
+ (*) int security_fs_context_dup(struct fs_context *fc,
+ struct fs_context *src_fc);
+
+ Called to initialise fc->security (which is preset to NULL) and allocate
+ any resources needed. The original filesystem context is pointed to by
+ src_fc and may be used for reference. It should return 0 on success and a
+ negative error code on failure.
+
+ (*) void security_fs_context_free(struct fs_context *fc);
+
+ Called to clean up anything attached to fc->security. Note that the
+ contents may have been transferred to a superblock and the pointer NULL'd
+ out during mount.
+
+ (*) int security_fs_context_parse_option(struct fs_context *fc, char *opt);
+
+ Called for each mount option. The arguments are as for the
+ ->parse_option() method. An active LSM may reject one with an error, pass
+ one over and return 0 or consume one and return 1. If consumed, the
+ option isn't passed on to the filesystem.
+
+ (*) int security_sb_get_tree(struct fs_context *fc);
+
+ Called during the mount procedure to verify that the specified superblock
+ is allowed to be mounted and to transfer the security data there. It
+ should return 0 or a negative error code.
+
+ [NOTE] Should I add a security_fs_context_validate() operation so that the
+ LSM has the opportunity to allocate stuff and check the options as a
+ whole?
+
+ (*) int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint)
+
+ Called during the mount procedure to verify that the root dentry attached
+ to the context is permitted to be attached to the specified mountpoint.
+ It should return 0 on success and a negative error code on failure.
+
+
+=================================
+VFS FILESYSTEM CONTEXT OPERATIONS
+=================================
+
+There are four operations for creating a filesystem context and
+one for destroying a context:
+
+ (*) struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
+ struct super_block *src_sb;
+ unsigned int sb_flags);
+
+ Create a filesystem context for a given filesystem type. This allocates
+ the filesystem context, sets the flags, initialises the security and calls
+ fs_type->init_fs_context() to initialise the filesystem context.
+
+ src_sb can be NULL or it may indicate a superblock that is going to be
+ reconfigured (FS_CONTEXT_FOR_RECONFIGURE) or a superblock that is the
+ parent of a submount (FS_CONTEXT_FOR_SUBMOUNT). This superblock is
+ provided as a source of namespace information.
+
+ (*) struct fs_context *vfs_sb_reconfigure(struct vfsmount *mnt,
+ unsigned int sb_flags);
+
+ Create a filesystem context from the same filesystem as an extant mount
+ and initialise the mount parameters from the superblock underlying that
+ mount. This is for use by superblock parameter reconfiguration.
+
+ (*) struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc);
+
+ Duplicate a filesystem context, copying any options noted and duplicating
+ or additionally referencing any resources held therein. This is available
+ for use where a filesystem has to get a mount within a mount, such as NFS4
+ does by internally mounting the root of the target server and then doing a
+ private pathwalk to the target directory.
+
+ (*) void put_fs_context(struct fs_context *fc);
+
+ Destroy a filesystem context, releasing any resources it holds. This
+ calls the ->free() operation. This is intended to be called by anyone who
+ created a filesystem context.
+
+ [!] filesystem contexts are not refcounted, so this causes unconditional
+ destruction.
+
+In all the above operations, apart from the put op, the return is a mount
+context pointer or a negative error code.
+
+For the remaining operations, if an error occurs, a negative error code will be
+returned.
+
+ (*) int vfs_get_tree(struct fs_context *fc);
+
+ Get or create the mountable root and superblock, using the parameters in
+ the filesystem context to select/configure the superblock. This invokes
+ the ->validate() op and then the ->get_tree() op.
+
+ [NOTE] ->validate() could perhaps be rolled into ->get_tree() and
+ ->reconfigure().
+
+ (*) struct vfsmount *vfs_create_mount(struct fs_context *fc);
+
+ Create a mount given the parameters in the specified filesystem context.
+ Note that this does not attach the mount to anything.
+
+ (*) int vfs_set_fs_source(struct fs_context *fc, char *source);
+
+ Supply the source name or device name for the mount. This may cause the
+ filesystem to access the device.
+
+ (*) int vfs_parse_fs_option(struct fs_context *fc, char *data);
+
+ Supply a single mount option to the filesystem context. The mount option
+ should likely be in a "key[=val]" string form. The option is first
+ checked to see if it corresponds to a standard mount flag (in which case
+ it is used to set an SB_xxx flag and consumed) or a security option (in
+ which case the LSM consumes it) before it is passed on to the filesystem.
+
+ (*) int generic_parse_monolithic(struct fs_context *fc, void *data);
+
+ Parse a sys_mount() data page, assuming the form to be a text list
+ consisting of key[=val] options separated by commas. Each item in the
+ list is passed to vfs_mount_option(). This is the default when the
+ ->parse_monolithic() operation is NULL.
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
new file mode 100644
index 000000000000..732a11898242
--- /dev/null
+++ b/include/linux/fs_context.h
@@ -0,0 +1,76 @@
+/* Filesystem superblock creation and reconfiguration context.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_FS_CONTEXT_H
+#define _LINUX_FS_CONTEXT_H
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+
+struct cred;
+struct dentry;
+struct file_operations;
+struct file_system_type;
+struct mnt_namespace;
+struct net;
+struct pid_namespace;
+struct super_block;
+struct user_namespace;
+struct vfsmount;
+
+enum fs_context_purpose {
+ FS_CONTEXT_FOR_USER_MOUNT, /* New superblock for user-specified mount */
+ FS_CONTEXT_FOR_KERNEL_MOUNT, /* New superblock for kernel-internal mount */
+ FS_CONTEXT_FOR_SUBMOUNT, /* New superblock for automatic submount */
+ FS_CONTEXT_FOR_RECONFIGURE, /* Superblock reconfiguration (remount) */
+};
+
+/*
+ * Filesystem context as allocated and constructed by the ->init_fs_context()
+ * file_system_type operation. The size of the object allocated is specified
+ * in struct file_system_type::fs_context_size and this must include sufficient
+ * space for the fs_context struct.
+ *
+ * Superblock creation fills in ->root whereas reconfiguration begins with this
+ * already set.
+ *
+ * See Documentation/filesystems/mounting.txt
+ */
+struct fs_context {
+ const struct fs_context_operations *ops;
+ struct file_system_type *fs_type;
+ struct dentry *root; /* The root and superblock */
+ struct user_namespace *user_ns; /* The user namespace for this mount */
+ struct net *net_ns; /* The network namespace for this mount */
+ const struct cred *cred; /* The mounter's credentials */
+ char *source; /* The source name (eg. device) */
+ char *subtype; /* The subtype to set on the superblock */
+ void *security; /* The LSM context */
+ void *s_fs_info; /* Proposed s_fs_info */
+ unsigned int sb_flags; /* Proposed superblock flags (SB_*) */
+ bool sloppy; /* Unrecognised options are okay */
+ bool silent;
+ bool degraded; /* True if the context can't be reused */
+ bool drop_sb; /* T if need to drop an SB reference */
+ enum fs_context_purpose purpose : 8;
+};
+
+struct fs_context_operations {
+ void (*free)(struct fs_context *fc);
+ int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
+ int (*parse_source)(struct fs_context *fc);
+ int (*parse_option)(struct fs_context *fc, char *opt, size_t len);
+ int (*parse_monolithic)(struct fs_context *fc, void *data);
+ int (*validate)(struct fs_context *fc);
+ int (*get_tree)(struct fs_context *fc);
+};
+
+#endif /* _LINUX_FS_CONTEXT_H */
Only the mount namespace code that implements mount(2) should be using the
MS_* flags. Suppress them inside the kernel unless uapi/linux/mount.h is
included.
Signed-off-by: David Howells <[email protected]>
---
arch/arc/kernel/setup.c | 1 +
arch/arm/kernel/atags_parse.c | 1 +
arch/sh/kernel/setup.c | 1 +
arch/sparc/kernel/setup_32.c | 1 +
arch/sparc/kernel/setup_64.c | 1 +
arch/x86/kernel/setup.c | 1 +
drivers/base/devtmpfs.c | 1 +
fs/f2fs/super.c | 2 +
fs/namespace.c | 1 +
fs/pnode.c | 1 +
fs/super.c | 1 +
include/uapi/linux/fs.h | 56 ++++------------------------------------
include/uapi/linux/mount.h | 58 +++++++++++++++++++++++++++++++++++++++++
init/do_mounts.c | 1 +
init/do_mounts_initrd.c | 1 +
security/apparmor/lsm.c | 1 +
security/apparmor/mount.c | 1 +
security/selinux/hooks.c | 1 +
security/tomoyo/mount.c | 1 +
19 files changed, 80 insertions(+), 52 deletions(-)
create mode 100644 include/uapi/linux/mount.h
diff --git a/arch/arc/kernel/setup.c b/arch/arc/kernel/setup.c
index b2cae79a25d7..714dc5c2baf1 100644
--- a/arch/arc/kernel/setup.c
+++ b/arch/arc/kernel/setup.c
@@ -19,6 +19,7 @@
#include <linux/of_fdt.h>
#include <linux/of.h>
#include <linux/cache.h>
+#include <uapi/linux/mount.h>
#include <asm/sections.h>
#include <asm/arcregs.h>
#include <asm/tlb.h>
diff --git a/arch/arm/kernel/atags_parse.c b/arch/arm/kernel/atags_parse.c
index c10a3e8ee998..a8a4333929f5 100644
--- a/arch/arm/kernel/atags_parse.c
+++ b/arch/arm/kernel/atags_parse.c
@@ -24,6 +24,7 @@
#include <linux/root_dev.h>
#include <linux/screen_info.h>
#include <linux/memblock.h>
+#include <uapi/linux/mount.h>
#include <asm/setup.h>
#include <asm/system_info.h>
diff --git a/arch/sh/kernel/setup.c b/arch/sh/kernel/setup.c
index d34e998b809f..d60c7f794d7a 100644
--- a/arch/sh/kernel/setup.c
+++ b/arch/sh/kernel/setup.c
@@ -33,6 +33,7 @@
#include <linux/of.h>
#include <linux/of_fdt.h>
#include <linux/uaccess.h>
+#include <uapi/linux/mount.h>
#include <asm/io.h>
#include <asm/page.h>
#include <asm/elf.h>
diff --git a/arch/sparc/kernel/setup_32.c b/arch/sparc/kernel/setup_32.c
index 13664c377196..7df3d704284c 100644
--- a/arch/sparc/kernel/setup_32.c
+++ b/arch/sparc/kernel/setup_32.c
@@ -34,6 +34,7 @@
#include <linux/kdebug.h>
#include <linux/export.h>
#include <linux/start_kernel.h>
+#include <uapi/linux/mount.h>
#include <asm/io.h>
#include <asm/processor.h>
diff --git a/arch/sparc/kernel/setup_64.c b/arch/sparc/kernel/setup_64.c
index 7944b3ca216a..206bf81eedaf 100644
--- a/arch/sparc/kernel/setup_64.c
+++ b/arch/sparc/kernel/setup_64.c
@@ -33,6 +33,7 @@
#include <linux/module.h>
#include <linux/start_kernel.h>
#include <linux/bootmem.h>
+#include <uapi/linux/mount.h>
#include <asm/io.h>
#include <asm/processor.h>
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 6285697b6e56..29b43f69ae55 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -50,6 +50,7 @@
#include <linux/init_ohci1394_dma.h>
#include <linux/kvm_para.h>
#include <linux/dma-contiguous.h>
+#include <uapi/linux/mount.h>
#include <linux/errno.h>
#include <linux/kernel.h>
diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index f7768077e817..79a235184fb5 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -25,6 +25,7 @@
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/kthread.h>
+#include <uapi/linux/mount.h>
#include "base.h"
static struct task_struct *thread;
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 42d564c5ccd0..a31cc49b7295 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -1450,7 +1450,7 @@ static int f2fs_remount(struct super_block *sb, int *flags, char *data)
err = dquot_suspend(sb, -1);
if (err < 0)
goto restore_opts;
- } else if (f2fs_readonly(sb) && !(*flags & MS_RDONLY)) {
+ } else if (f2fs_readonly(sb) && !(*flags & SB_RDONLY)) {
/* dquot_resume needs RW */
sb->s_flags &= ~SB_RDONLY;
if (sb_any_quota_suspended(sb)) {
diff --git a/fs/namespace.c b/fs/namespace.c
index 6f720ebca133..3f98e1a36b84 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -26,6 +26,7 @@
#include <linux/bootmem.h>
#include <linux/task_work.h>
#include <linux/sched/task.h>
+#include <uapi/linux/mount.h>
#include "pnode.h"
#include "internal.h"
diff --git a/fs/pnode.c b/fs/pnode.c
index 53d411a371ce..1100e810d855 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -10,6 +10,7 @@
#include <linux/mount.h>
#include <linux/fs.h>
#include <linux/nsproxy.h>
+#include <uapi/linux/mount.h>
#include "internal.h"
#include "pnode.h"
diff --git a/fs/super.c b/fs/super.c
index 5fa9a8d8d865..f7c5629bbbda 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -35,6 +35,7 @@
#include <linux/fsnotify.h>
#include <linux/lockdep.h>
#include <linux/user_namespace.h>
+#include <uapi/linux/mount.h>
#include "internal.h"
static int thaw_super_locked(struct super_block *sb);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index d2a8313fabd7..5da6c2d96af5 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -14,6 +14,11 @@
#include <linux/ioctl.h>
#include <linux/types.h>
+/* Use of MS_* flags within the kernel is restricted to core mount(2) code. */
+#if !defined(__KERNEL__)
+#include <linux/mount.h>
+#endif
+
/*
* It's silly to have NR_OPEN bigger than NR_FILE, but you can change
* the file limit at runtime and only root can increase the per-process
@@ -101,57 +106,6 @@ struct inodes_stat_t {
#define NR_FILE 8192 /* this can well be larger on a larger system */
-
-/*
- * These are the fs-independent mount-flags: up to 32 flags are supported
- */
-#define MS_RDONLY 1 /* Mount read-only */
-#define MS_NOSUID 2 /* Ignore suid and sgid bits */
-#define MS_NODEV 4 /* Disallow access to device special files */
-#define MS_NOEXEC 8 /* Disallow program execution */
-#define MS_SYNCHRONOUS 16 /* Writes are synced at once */
-#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
-#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
-#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
-#define MS_NOATIME 1024 /* Do not update access times. */
-#define MS_NODIRATIME 2048 /* Do not update directory access times */
-#define MS_BIND 4096
-#define MS_MOVE 8192
-#define MS_REC 16384
-#define MS_VERBOSE 32768 /* War is peace. Verbosity is silence.
- MS_VERBOSE is deprecated. */
-#define MS_SILENT 32768
-#define MS_POSIXACL (1<<16) /* VFS does not apply the umask */
-#define MS_UNBINDABLE (1<<17) /* change to unbindable */
-#define MS_PRIVATE (1<<18) /* change to private */
-#define MS_SLAVE (1<<19) /* change to slave */
-#define MS_SHARED (1<<20) /* change to shared */
-#define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */
-#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
-#define MS_I_VERSION (1<<23) /* Update inode I_version field */
-#define MS_STRICTATIME (1<<24) /* Always perform atime updates */
-#define MS_LAZYTIME (1<<25) /* Update the on-disk [acm]times lazily */
-
-/* These sb flags are internal to the kernel */
-#define MS_SUBMOUNT (1<<26)
-#define MS_NOREMOTELOCK (1<<27)
-#define MS_NOSEC (1<<28)
-#define MS_BORN (1<<29)
-#define MS_ACTIVE (1<<30)
-#define MS_NOUSER (1<<31)
-
-/*
- * Superblock flags that can be altered by MS_REMOUNT
- */
-#define MS_RMT_MASK (MS_RDONLY|MS_SYNCHRONOUS|MS_MANDLOCK|MS_I_VERSION|\
- MS_LAZYTIME)
-
-/*
- * Old magic mount flag and mask
- */
-#define MS_MGC_VAL 0xC0ED0000
-#define MS_MGC_MSK 0xffff0000
-
/*
* Structure for FS_IOC_FSGETXATTR[A] and FS_IOC_FSSETXATTR.
*/
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
new file mode 100644
index 000000000000..3f9ec42510b0
--- /dev/null
+++ b/include/uapi/linux/mount.h
@@ -0,0 +1,58 @@
+#ifndef _UAPI_LINUX_MOUNT_H
+#define _UAPI_LINUX_MOUNT_H
+
+/*
+ * These are the fs-independent mount-flags: up to 32 flags are supported
+ *
+ * Usage of these is restricted within the kernel to core mount(2) code and
+ * callers of sys_mount() only. Filesystems should be using the SB_*
+ * equivalent instead.
+ */
+#define MS_RDONLY 1 /* Mount read-only */
+#define MS_NOSUID 2 /* Ignore suid and sgid bits */
+#define MS_NODEV 4 /* Disallow access to device special files */
+#define MS_NOEXEC 8 /* Disallow program execution */
+#define MS_SYNCHRONOUS 16 /* Writes are synced at once */
+#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
+#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
+#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
+#define MS_NOATIME 1024 /* Do not update access times. */
+#define MS_NODIRATIME 2048 /* Do not update directory access times */
+#define MS_BIND 4096
+#define MS_MOVE 8192
+#define MS_REC 16384
+#define MS_VERBOSE 32768 /* War is peace. Verbosity is silence.
+ MS_VERBOSE is deprecated. */
+#define MS_SILENT 32768
+#define MS_POSIXACL (1<<16) /* VFS does not apply the umask */
+#define MS_UNBINDABLE (1<<17) /* change to unbindable */
+#define MS_PRIVATE (1<<18) /* change to private */
+#define MS_SLAVE (1<<19) /* change to slave */
+#define MS_SHARED (1<<20) /* change to shared */
+#define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */
+#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
+#define MS_I_VERSION (1<<23) /* Update inode I_version field */
+#define MS_STRICTATIME (1<<24) /* Always perform atime updates */
+#define MS_LAZYTIME (1<<25) /* Update the on-disk [acm]times lazily */
+
+/* These sb flags are internal to the kernel */
+#define MS_SUBMOUNT (1<<26)
+#define MS_NOREMOTELOCK (1<<27)
+#define MS_NOSEC (1<<28)
+#define MS_BORN (1<<29)
+#define MS_ACTIVE (1<<30)
+#define MS_NOUSER (1<<31)
+
+/*
+ * Superblock flags that can be altered by MS_REMOUNT
+ */
+#define MS_RMT_MASK (MS_RDONLY|MS_SYNCHRONOUS|MS_MANDLOCK|MS_I_VERSION|\
+ MS_LAZYTIME)
+
+/*
+ * Old magic mount flag and mask
+ */
+#define MS_MGC_VAL 0xC0ED0000
+#define MS_MGC_MSK 0xffff0000
+
+#endif /* _UAPI_LINUX_MOUNT_H */
diff --git a/init/do_mounts.c b/init/do_mounts.c
index 2c71dabe5626..ea6f21bb9440 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -32,6 +32,7 @@
#include <linux/nfs_fs.h>
#include <linux/nfs_fs_sb.h>
#include <linux/nfs_mount.h>
+#include <uapi/linux/mount.h>
#include "do_mounts.h"
diff --git a/init/do_mounts_initrd.c b/init/do_mounts_initrd.c
index 5a91aefa7305..65de0412f80f 100644
--- a/init/do_mounts_initrd.c
+++ b/init/do_mounts_initrd.c
@@ -18,6 +18,7 @@
#include <linux/sched.h>
#include <linux/freezer.h>
#include <linux/kmod.h>
+#include <uapi/linux/mount.h>
#include "do_mounts.h"
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index ce2b89e9ad94..9ebc9e9c3854 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -24,6 +24,7 @@
#include <linux/audit.h>
#include <linux/user_namespace.h>
#include <net/sock.h>
+#include <uapi/linux/mount.h>
#include "include/apparmor.h"
#include "include/apparmorfs.h"
diff --git a/security/apparmor/mount.c b/security/apparmor/mount.c
index 6e8c7ac0b33d..45bb769d6cd7 100644
--- a/security/apparmor/mount.c
+++ b/security/apparmor/mount.c
@@ -15,6 +15,7 @@
#include <linux/fs.h>
#include <linux/mount.h>
#include <linux/namei.h>
+#include <uapi/linux/mount.h>
#include "include/apparmor.h"
#include "include/audit.h"
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 4cafe6a19167..1f0316bf7e29 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -88,6 +88,7 @@
#include <linux/msg.h>
#include <linux/shm.h>
#include <linux/bpf.h>
+#include <uapi/linux/mount.h>
#include "avc.h"
#include "objsec.h"
diff --git a/security/tomoyo/mount.c b/security/tomoyo/mount.c
index 807fd91dbb54..7dc7f59b7dde 100644
--- a/security/tomoyo/mount.c
+++ b/security/tomoyo/mount.c
@@ -6,6 +6,7 @@
*/
#include <linux/slab.h>
+#include <uapi/linux/mount.h>
#include "common.h"
/* String table for special mount operations. */
Add LSM hooks for use by the filesystem context code. This includes:
(1) Hooks to handle allocation, duplication and freeing of the security
record attached to a filesystem context.
(2) A hook to snoop a mount options in key[=val] form. If the LSM decides
it wants to handle it, it can suppress the option being passed to the
filesystem. Note that 'val' may include commas and binary data with
the fsopen patch.
(3) A hook to transfer the security from the context to a newly created
superblock.
(4) A hook to rule on whether a path point can be used as a mountpoint.
These are intended to replace:
security_sb_copy_data
security_sb_kern_mount
security_sb_mount
security_sb_set_mnt_opts
security_sb_clone_mnt_opts
security_sb_parse_opts_str
Signed-off-by: David Howells <[email protected]>
cc: [email protected]
---
include/linux/lsm_hooks.h | 62 +++++++++++
include/linux/security.h | 44 ++++++++
security/security.c | 41 +++++++
security/selinux/hooks.c | 262 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 409 insertions(+)
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 9d0b286f3dba..da20f90d40bb 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -76,6 +76,50 @@
* changes on the process such as clearing out non-inheritable signal
* state. This is called immediately after commit_creds().
*
+ * Security hooks for mount using fd context.
+ *
+ * @fs_context_alloc:
+ * Allocate and attach a security structure to sc->security. This pointer
+ * is initialised to NULL by the caller.
+ * @fc indicates the new filesystem context.
+ * @src_sb indicates the source superblock of a submount.
+ * @fs_context_dup:
+ * Allocate and attach a security structure to sc->security. This pointer
+ * is initialised to NULL by the caller.
+ * @fc indicates the new filesystem context.
+ * @src_fc indicates the original filesystem context.
+ * @fs_context_free:
+ * Clean up a filesystem context.
+ * @fc indicates the filesystem context.
+ * @fs_context_parse_option:
+ * Userspace provided an option to configure a superblock. The LSM may
+ * reject it with an error and may use it for itself, in which case it
+ * should return 1; otherwise it should return 0 to pass it on to the
+ * filesystem.
+ * @fc indicates the filesystem context.
+ * @opt indicates the option in "key[=val]" form. It is NUL-terminated,
+ * but val may be binary data.
+ * @len indicates the size of the option.
+ * @fs_context_validate:
+ * Validate the filesystem context preparatory to applying it. This is
+ * done after all the options have been parsed.
+ * @fc indicates the filesystem context.
+ * @sb_get_tree:
+ * Assign the security to a newly created superblock.
+ * @fc indicates the filesystem context.
+ * @fc->root indicates the root that will be mounted.
+ * @fc->root->d_sb points to the superblock.
+ * @sb_reconfigure:
+ * Apply reconfiguration to the security on a superblock.
+ * @fc indicates the filesystem context.
+ * @fc->root indicates a dentry in the mount.
+ * @fc->root->d_sb points to the superblock.
+ * @sb_mountpoint:
+ * Equivalent of sb_mount, but with an fs_context.
+ * @fc indicates the filesystem context.
+ * @mountpoint indicates the path on which the mount will take place.
+ * @mnt_flags indicates the MNT_* flags specified.
+ *
* Security hooks for filesystem operations.
*
* @sb_alloc_security:
@@ -1450,6 +1494,16 @@ union security_list_options {
void (*bprm_committing_creds)(struct linux_binprm *bprm);
void (*bprm_committed_creds)(struct linux_binprm *bprm);
+ int (*fs_context_alloc)(struct fs_context *fc, struct super_block *src_sb);
+ int (*fs_context_dup)(struct fs_context *fc, struct fs_context *src_sc);
+ void (*fs_context_free)(struct fs_context *fc);
+ int (*fs_context_parse_option)(struct fs_context *fc, char *opt, size_t len);
+ int (*fs_context_validate)(struct fs_context *fc);
+ int (*sb_get_tree)(struct fs_context *fc);
+ void (*sb_reconfigure)(struct fs_context *fc);
+ int (*sb_mountpoint)(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags);
+
int (*sb_alloc_security)(struct super_block *sb);
void (*sb_free_security)(struct super_block *sb);
int (*sb_copy_data)(char *orig, char *copy);
@@ -1787,6 +1841,14 @@ struct security_hook_heads {
struct hlist_head bprm_check_security;
struct hlist_head bprm_committing_creds;
struct hlist_head bprm_committed_creds;
+ struct hlist_head fs_context_alloc;
+ struct hlist_head fs_context_dup;
+ struct hlist_head fs_context_free;
+ struct hlist_head fs_context_parse_option;
+ struct hlist_head fs_context_validate;
+ struct hlist_head sb_get_tree;
+ struct hlist_head sb_reconfigure;
+ struct hlist_head sb_mountpoint;
struct hlist_head sb_alloc_security;
struct hlist_head sb_free_security;
struct hlist_head sb_copy_data;
diff --git a/include/linux/security.h b/include/linux/security.h
index 200920f521a1..60a85bd9dfef 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -53,6 +53,7 @@ struct msg_msg;
struct xattr;
struct xfrm_sec_ctx;
struct mm_struct;
+struct fs_context;
/* If capable should audit the security request */
#define SECURITY_CAP_NOAUDIT 0
@@ -231,6 +232,15 @@ int security_bprm_set_creds(struct linux_binprm *bprm);
int security_bprm_check(struct linux_binprm *bprm);
void security_bprm_committing_creds(struct linux_binprm *bprm);
void security_bprm_committed_creds(struct linux_binprm *bprm);
+int security_fs_context_alloc(struct fs_context *fc, struct super_block *sb);
+int security_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc);
+void security_fs_context_free(struct fs_context *fc);
+int security_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len);
+int security_fs_context_validate(struct fs_context *fc);
+int security_sb_get_tree(struct fs_context *fc);
+void security_sb_reconfigure(struct fs_context *fc);
+int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags);
int security_sb_alloc(struct super_block *sb);
void security_sb_free(struct super_block *sb);
int security_sb_copy_data(char *orig, char *copy);
@@ -539,6 +549,40 @@ static inline void security_bprm_committed_creds(struct linux_binprm *bprm)
{
}
+static inline int security_fs_context_alloc(struct fs_context *fc,
+ struct super_block *src_sb)
+{
+ return 0;
+}
+static inline int security_fs_context_dup(struct fs_context *fc,
+ struct fs_context *src_fc)
+{
+ return 0;
+}
+static inline void security_fs_context_free(struct fs_context *fc)
+{
+}
+static inline int security_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+ return 0;
+}
+static inline int security_fs_context_validate(struct fs_context *fc)
+{
+ return 0;
+}
+static inline int security_sb_get_tree(struct fs_context *fc)
+{
+ return 0;
+}
+static inline void security_sb_reconfigure(struct fs_context *fc)
+{
+}
+static inline int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ return 0;
+}
+
static inline int security_sb_alloc(struct super_block *sb)
{
return 0;
diff --git a/security/security.c b/security/security.c
index 7bc2fde023a7..42e4ea19b61c 100644
--- a/security/security.c
+++ b/security/security.c
@@ -358,6 +358,47 @@ void security_bprm_committed_creds(struct linux_binprm *bprm)
call_void_hook(bprm_committed_creds, bprm);
}
+int security_fs_context_alloc(struct fs_context *fc, struct super_block *src_sb)
+{
+ return call_int_hook(fs_context_alloc, 0, fc, src_sb);
+}
+
+int security_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
+{
+ return call_int_hook(fs_context_dup, 0, fc, src_fc);
+}
+
+void security_fs_context_free(struct fs_context *fc)
+{
+ call_void_hook(fs_context_free, fc);
+}
+
+int security_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+ return call_int_hook(fs_context_parse_option, 0, fc, opt, len);
+}
+
+int security_fs_context_validate(struct fs_context *fc)
+{
+ return call_int_hook(fs_context_validate, 0, fc);
+}
+
+int security_sb_get_tree(struct fs_context *fc)
+{
+ return call_int_hook(sb_get_tree, 0, fc);
+}
+
+void security_sb_reconfigure(struct fs_context *fc)
+{
+ call_void_hook(sb_reconfigure, fc);
+}
+
+int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ return call_int_hook(sb_mountpoint, 0, fc, mountpoint, mnt_flags);
+}
+
int security_sb_alloc(struct super_block *sb)
{
return call_int_hook(sb_alloc_security, 0, sb);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 1f0316bf7e29..969a2a0dc582 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -48,6 +48,7 @@
#include <linux/fdtable.h>
#include <linux/namei.h>
#include <linux/mount.h>
+#include <linux/fs_context.h>
#include <linux/netfilter_ipv4.h>
#include <linux/netfilter_ipv6.h>
#include <linux/tty.h>
@@ -2960,6 +2961,259 @@ static int selinux_umount(struct vfsmount *mnt, int flags)
FILESYSTEM__UNMOUNT, NULL);
}
+/* fsopen mount context operations */
+
+static int selinux_fs_context_alloc(struct fs_context *fc,
+ struct super_block *src_sb)
+{
+ struct security_mnt_opts *opts;
+
+ opts = kzalloc(sizeof(*opts), GFP_KERNEL);
+ if (!opts)
+ return -ENOMEM;
+
+ fc->security = opts;
+ return 0;
+}
+
+static int selinux_fs_context_dup(struct fs_context *fc,
+ struct fs_context *src_fc)
+{
+ const struct security_mnt_opts *src = src_fc->security;
+ struct security_mnt_opts *opts;
+ int i, n;
+
+ opts = kzalloc(sizeof(*opts), GFP_KERNEL);
+ if (!opts)
+ return -ENOMEM;
+ fc->security = opts;
+
+ if (!src || !src->num_mnt_opts)
+ return 0;
+ n = opts->num_mnt_opts = src->num_mnt_opts;
+
+ if (src->mnt_opts) {
+ opts->mnt_opts = kcalloc(n, sizeof(char *), GFP_KERNEL);
+ if (!opts->mnt_opts)
+ return -ENOMEM;
+
+ for (i = 0; i < n; i++) {
+ if (src->mnt_opts[i]) {
+ opts->mnt_opts[i] = kstrdup(src->mnt_opts[i],
+ GFP_KERNEL);
+ if (!opts->mnt_opts[i])
+ return -ENOMEM;
+ }
+ }
+ }
+
+ if (src->mnt_opts_flags) {
+ opts->mnt_opts_flags = kmemdup(src->mnt_opts_flags,
+ n * sizeof(int), GFP_KERNEL);
+ if (!opts->mnt_opts_flags)
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static void selinux_fs_context_free(struct fs_context *fc)
+{
+ struct security_mnt_opts *opts = fc->security;
+
+ security_free_mnt_opts(opts);
+ fc->security = NULL;
+}
+
+static int selinux_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+ struct security_mnt_opts *opts = fc->security;
+ substring_t args[MAX_OPT_ARGS];
+ unsigned int have;
+ char *c, **oo;
+ int token, ctx, i, *of;
+
+ token = match_token(opt, tokens, args);
+ if (token == Opt_error)
+ return 0; /* Doesn't belong to us. */
+
+ have = 0;
+ for (i = 0; i < opts->num_mnt_opts; i++)
+ have |= 1 << opts->mnt_opts_flags[i];
+ if (have & (1 << token))
+ return -EINVAL;
+
+ switch (token) {
+ case Opt_context:
+ if (have & (1 << Opt_defcontext))
+ goto incompatible;
+ ctx = CONTEXT_MNT;
+ goto copy_context_string;
+
+ case Opt_fscontext:
+ ctx = FSCONTEXT_MNT;
+ goto copy_context_string;
+
+ case Opt_rootcontext:
+ ctx = ROOTCONTEXT_MNT;
+ goto copy_context_string;
+
+ case Opt_defcontext:
+ if (have & (1 << Opt_context))
+ goto incompatible;
+ ctx = DEFCONTEXT_MNT;
+ goto copy_context_string;
+
+ case Opt_labelsupport:
+ return 1;
+
+ default:
+ return -EINVAL;
+ }
+
+copy_context_string:
+ if (opts->num_mnt_opts > 3)
+ return -EINVAL;
+
+ of = krealloc(opts->mnt_opts_flags,
+ (opts->num_mnt_opts + 1) * sizeof(int), GFP_KERNEL);
+ if (!of)
+ return -ENOMEM;
+ of[opts->num_mnt_opts] = 0;
+ opts->mnt_opts_flags = of;
+
+ oo = krealloc(opts->mnt_opts,
+ (opts->num_mnt_opts + 1) * sizeof(char *), GFP_KERNEL);
+ if (!oo)
+ return -ENOMEM;
+ oo[opts->num_mnt_opts] = NULL;
+ opts->mnt_opts = oo;
+
+ c = match_strdup(&args[0]);
+ if (!c)
+ return -ENOMEM;
+ opts->mnt_opts[opts->num_mnt_opts] = c;
+ opts->mnt_opts_flags[opts->num_mnt_opts] = ctx;
+ opts->num_mnt_opts++;
+ return 1;
+
+incompatible:
+ return -EINVAL;
+}
+
+/*
+ * Validate the security parameters supplied for a reconfiguration/remount
+ * event.
+ */
+static int selinux_validate_for_sb_reconfigure(struct fs_context *fc)
+{
+ struct super_block *sb = fc->root->d_sb;
+ struct superblock_security_struct *sbsec = sb->s_security;
+ struct security_mnt_opts *opts = fc->security;
+ int rc, i, *flags;
+ char **mount_options;
+
+ if (!(sbsec->flags & SE_SBINITIALIZED))
+ return 0;
+
+ mount_options = opts->mnt_opts;
+ flags = opts->mnt_opts_flags;
+
+ for (i = 0; i < opts->num_mnt_opts; i++) {
+ u32 sid;
+
+ if (flags[i] == SBLABEL_MNT)
+ continue;
+
+ rc = security_context_str_to_sid(&selinux_state, mount_options[i],
+ &sid, GFP_KERNEL);
+ if (rc) {
+ pr_warn("SELinux: security_context_str_to_sid"
+ "(%s) failed for (dev %s, type %s) errno=%d\n",
+ mount_options[i], sb->s_id, sb->s_type->name, rc);
+ goto inval;
+ }
+
+ switch (flags[i]) {
+ case FSCONTEXT_MNT:
+ if (bad_option(sbsec, FSCONTEXT_MNT, sbsec->sid, sid))
+ goto bad_option;
+ break;
+ case CONTEXT_MNT:
+ if (bad_option(sbsec, CONTEXT_MNT, sbsec->mntpoint_sid, sid))
+ goto bad_option;
+ break;
+ case ROOTCONTEXT_MNT: {
+ struct inode_security_struct *root_isec;
+ root_isec = backing_inode_security(sb->s_root);
+
+ if (bad_option(sbsec, ROOTCONTEXT_MNT, root_isec->sid, sid))
+ goto bad_option;
+ break;
+ }
+ case DEFCONTEXT_MNT:
+ if (bad_option(sbsec, DEFCONTEXT_MNT, sbsec->def_sid, sid))
+ goto bad_option;
+ break;
+ default:
+ goto inval;
+ }
+ }
+
+ rc = 0;
+out:
+ return rc;
+
+bad_option:
+ pr_warn("SELinux: unable to change security options "
+ "during remount (dev %s, type=%s)\n",
+ sb->s_id, sb->s_type->name);
+inval:
+ rc = -EINVAL;
+ goto out;
+}
+
+/*
+ * Validate the security context assembled from the option data supplied to
+ * mount.
+ */
+static int selinux_fs_context_validate(struct fs_context *fc)
+{
+ if (fc->purpose == FS_CONTEXT_FOR_RECONFIGURE)
+ return selinux_validate_for_sb_reconfigure(fc);
+ return 0;
+}
+
+/*
+ * Set the security context on a superblock.
+ */
+static int selinux_sb_get_tree(struct fs_context *fc)
+{
+ const struct cred *cred = current_cred();
+ struct common_audit_data ad;
+ int rc;
+
+ rc = selinux_set_mnt_opts(fc->root->d_sb, fc->security, 0, NULL);
+ if (rc)
+ return rc;
+
+ /* Allow all mounts performed by the kernel */
+ if (fc->purpose == FS_CONTEXT_FOR_KERNEL_MOUNT)
+ return 0;
+
+ ad.type = LSM_AUDIT_DATA_DENTRY;
+ ad.u.dentry = fc->root;
+ return superblock_has_perm(cred, fc->root->d_sb, FILESYSTEM__MOUNT, &ad);
+}
+
+static int selinux_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ const struct cred *cred = current_cred();
+
+ return path_has_perm(cred, mountpoint, FILE__MOUNTON);
+}
+
/* inode security operations */
static int selinux_inode_alloc_security(struct inode *inode)
@@ -6871,6 +7125,14 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(bprm_committing_creds, selinux_bprm_committing_creds),
LSM_HOOK_INIT(bprm_committed_creds, selinux_bprm_committed_creds),
+ LSM_HOOK_INIT(fs_context_alloc, selinux_fs_context_alloc),
+ LSM_HOOK_INIT(fs_context_dup, selinux_fs_context_dup),
+ LSM_HOOK_INIT(fs_context_free, selinux_fs_context_free),
+ LSM_HOOK_INIT(fs_context_parse_option, selinux_fs_context_parse_option),
+ LSM_HOOK_INIT(fs_context_validate, selinux_fs_context_validate),
+ LSM_HOOK_INIT(sb_get_tree, selinux_sb_get_tree),
+ LSM_HOOK_INIT(sb_mountpoint, selinux_sb_mountpoint),
+
LSM_HOOK_INIT(sb_alloc_security, selinux_sb_alloc_security),
LSM_HOOK_INIT(sb_free_security, selinux_sb_free_security),
LSM_HOOK_INIT(sb_copy_data, selinux_sb_copy_data),
Implement hooks to check the creation of new mountpoints for AppArmor.
Unfortunately, the DFA evaluation puts the option data in last, after the
details of the mountpoint, so we have to cache the mount options in the
fs_context using those hooks till we get to the new mountpoint hook.
Signed-off-by: David Howells <[email protected]>
cc: John Johansen <[email protected]>
cc: [email protected]
cc: [email protected]
---
security/apparmor/include/mount.h | 11 +++++
security/apparmor/lsm.c | 80 +++++++++++++++++++++++++++++++++++++
security/apparmor/mount.c | 46 +++++++++++++++++++++
3 files changed, 135 insertions(+), 2 deletions(-)
diff --git a/security/apparmor/include/mount.h b/security/apparmor/include/mount.h
index 25d6067fa6ef..0441bfae30fa 100644
--- a/security/apparmor/include/mount.h
+++ b/security/apparmor/include/mount.h
@@ -16,6 +16,7 @@
#include <linux/fs.h>
#include <linux/path.h>
+#include <linux/fs_context.h>
#include "domain.h"
#include "policy.h"
@@ -27,7 +28,13 @@
#define AA_AUDIT_DATA 0x40
#define AA_MNT_CONT_MATCH 0x40
-#define AA_MS_IGNORE_MASK (MS_KERNMOUNT | MS_NOSEC | MS_ACTIVE | MS_BORN)
+#define AA_SB_IGNORE_MASK (SB_KERNMOUNT | SB_NOSEC | SB_ACTIVE | SB_BORN)
+
+struct apparmor_fs_context {
+ struct fs_context fc;
+ char *saved_options;
+ size_t saved_size;
+};
int aa_remount(struct aa_label *label, const struct path *path,
unsigned long flags, void *data);
@@ -45,6 +52,8 @@ int aa_move_mount(struct aa_label *label, const struct path *path,
int aa_new_mount(struct aa_label *label, const char *dev_name,
const struct path *path, const char *type, unsigned long flags,
void *data);
+int aa_new_mount_fc(struct aa_label *label, struct fs_context *fc,
+ const struct path *mountpoint);
int aa_umount(struct aa_label *label, struct vfsmount *mnt, int flags);
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index 9ebc9e9c3854..14398dec2e38 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -518,6 +518,78 @@ static int apparmor_file_mprotect(struct vm_area_struct *vma,
!(vma->vm_flags & VM_SHARED) ? MAP_PRIVATE : 0);
}
+static int apparmor_fs_context_alloc(struct fs_context *fc, struct super_block *src_sb)
+{
+ struct apparmor_fs_context *afc;
+
+ afc = kzalloc(sizeof(*afc), GFP_KERNEL);
+ if (!afc)
+ return -ENOMEM;
+
+ fc->security = afc;
+ return 0;
+}
+
+static int apparmor_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
+{
+ fc->security = NULL;
+ return 0;
+}
+
+static void apparmor_fs_context_free(struct fs_context *fc)
+{
+ struct apparmor_fs_context *afc = fc->security;
+
+ if (afc) {
+ kfree(afc->saved_options);
+ kfree(afc);
+ }
+}
+
+/*
+ * As a temporary hack, we buffer all the options. The problem is that we need
+ * to pass them to the DFA evaluator *after* mount point parameters, which
+ * means deferring the entire check to the sb_mountpoint hook.
+ */
+static int apparmor_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+ struct apparmor_fs_context *afc = fc->security;
+ size_t space = 0;
+ char *p, *q;
+
+ if (afc->saved_size > 0)
+ space = 1;
+
+ p = krealloc(afc->saved_options, afc->saved_size + space + len + 1, GFP_KERNEL);
+ if (!p)
+ return -ENOMEM;
+
+ q = p + afc->saved_size;
+ if (q != p)
+ *q++ = ' ';
+ memcpy(q, opt, len);
+ q += len;
+ *q = 0;
+
+ afc->saved_options = p;
+ afc->saved_size += 1 + len;
+ return 0;
+}
+
+static int apparmor_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ struct aa_label *label;
+ int error = 0;
+
+ label = __begin_current_label_crit_section();
+ if (!unconfined(label))
+ error = aa_new_mount_fc(label, fc, mountpoint);
+ __end_current_label_crit_section(label);
+
+ return error;
+}
+
static int apparmor_sb_mount(const char *dev_name, const struct path *path,
const char *type, unsigned long flags, void *data)
{
@@ -528,7 +600,7 @@ static int apparmor_sb_mount(const char *dev_name, const struct path *path,
if ((flags & MS_MGC_MSK) == MS_MGC_VAL)
flags &= ~MS_MGC_MSK;
- flags &= ~AA_MS_IGNORE_MASK;
+ flags &= ~AA_SB_IGNORE_MASK;
label = __begin_current_label_crit_section();
if (!unconfined(label)) {
@@ -1124,6 +1196,12 @@ static struct security_hook_list apparmor_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(capget, apparmor_capget),
LSM_HOOK_INIT(capable, apparmor_capable),
+ LSM_HOOK_INIT(fs_context_alloc, apparmor_fs_context_alloc),
+ LSM_HOOK_INIT(fs_context_dup, apparmor_fs_context_dup),
+ LSM_HOOK_INIT(fs_context_free, apparmor_fs_context_free),
+ LSM_HOOK_INIT(fs_context_parse_option, apparmor_fs_context_parse_option),
+ LSM_HOOK_INIT(sb_mountpoint, apparmor_sb_mountpoint),
+
LSM_HOOK_INIT(sb_mount, apparmor_sb_mount),
LSM_HOOK_INIT(sb_umount, apparmor_sb_umount),
LSM_HOOK_INIT(sb_pivotroot, apparmor_sb_pivotroot),
diff --git a/security/apparmor/mount.c b/security/apparmor/mount.c
index 45bb769d6cd7..3d477d288627 100644
--- a/security/apparmor/mount.c
+++ b/security/apparmor/mount.c
@@ -554,6 +554,52 @@ int aa_new_mount(struct aa_label *label, const char *dev_name,
return error;
}
+int aa_new_mount_fc(struct aa_label *label, struct fs_context *fc,
+ const struct path *mountpoint)
+{
+ struct apparmor_fs_context *afc = fc->security;
+ struct aa_profile *profile;
+ char *buffer = NULL, *dev_buffer = NULL;
+ bool binary;
+ int error;
+ struct path tmp_path, *dev_path = NULL;
+
+ AA_BUG(!label);
+ AA_BUG(!mountpoint);
+
+ binary = fc->fs_type->fs_flags & FS_BINARY_MOUNTDATA;
+
+ if (fc->fs_type->fs_flags & FS_REQUIRES_DEV) {
+ if (!fc->source)
+ return -ENOENT;
+
+ error = kern_path(fc->source, LOOKUP_FOLLOW, &tmp_path);
+ if (error)
+ return error;
+ dev_path = &tmp_path;
+ }
+
+ get_buffers(buffer, dev_buffer);
+ if (dev_path) {
+ error = fn_for_each_confined(label, profile,
+ match_mnt(profile, mountpoint, buffer, dev_path, dev_buffer,
+ fc->fs_type->name,
+ fc->sb_flags & ~AA_SB_IGNORE_MASK,
+ afc->saved_options, binary));
+ } else {
+ error = fn_for_each_confined(label, profile,
+ match_mnt_path_str(profile, mountpoint, buffer, fc->source,
+ fc->fs_type->name,
+ fc->sb_flags & ~AA_SB_IGNORE_MASK,
+ afc->saved_options, binary, NULL));
+ }
+ put_buffers(buffer, dev_buffer);
+ if (dev_path)
+ path_put(dev_path);
+
+ return error;
+}
+
static int profile_umount(struct aa_profile *profile, struct path *path,
char *buffer)
{
In do_mount() when the MS_* flags are being converted to MNT_* flags,
MS_RDONLY got accidentally convered to SB_RDONLY.
Undo this change.
Fixes: e462ec50cb5f ("VFS: Differentiate mount flags (MS_*) from internal superblock flags")
Signed-off-by: David Howells <[email protected]>
---
fs/namespace.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index e398f32d7541..6f720ebca133 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2814,7 +2814,7 @@ long do_mount(const char *dev_name, const char __user *dir_name,
mnt_flags |= MNT_NODIRATIME;
if (flags & MS_STRICTATIME)
mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
- if (flags & SB_RDONLY)
+ if (flags & MS_RDONLY)
mnt_flags |= MNT_READONLY;
/* The default atime for remount is preservation */
On Thu, Apr 19, 2018 at 9:31 AM, David Howells <[email protected]> wrote:
> Add LSM hooks for use by the filesystem context code. This includes:
>
> (1) Hooks to handle allocation, duplication and freeing of the security
> record attached to a filesystem context.
>
> (2) A hook to snoop a mount options in key[=val] form. If the LSM decides
> it wants to handle it, it can suppress the option being passed to the
> filesystem. Note that 'val' may include commas and binary data with
> the fsopen patch.
>
> (3) A hook to transfer the security from the context to a newly created
> superblock.
>
> (4) A hook to rule on whether a path point can be used as a mountpoint.
>
> These are intended to replace:
>
> security_sb_copy_data
> security_sb_kern_mount
> security_sb_mount
> security_sb_set_mnt_opts
> security_sb_clone_mnt_opts
> security_sb_parse_opts_str
>
> Signed-off-by: David Howells <[email protected]>
> cc: [email protected]
> ---
>
> include/linux/lsm_hooks.h | 62 +++++++++++
> include/linux/security.h | 44 ++++++++
> security/security.c | 41 +++++++
> security/selinux/hooks.c | 262 +++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 409 insertions(+)
Adding the SELinux mailing list to the CC line; in the future please
include the SELinux mailing list on patches like this. It would also
be very helpful to include "selinux" somewhere in the subject line
when the patch is predominately SELinux related (much like you did for
the other LSMs in this patchset).
I can't say I've digested all of this yet, but what SELinux testing
have you done with this patchset?
> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> index 9d0b286f3dba..da20f90d40bb 100644
> --- a/include/linux/lsm_hooks.h
> +++ b/include/linux/lsm_hooks.h
> @@ -76,6 +76,50 @@
> * changes on the process such as clearing out non-inheritable signal
> * state. This is called immediately after commit_creds().
> *
> + * Security hooks for mount using fd context.
> + *
> + * @fs_context_alloc:
> + * Allocate and attach a security structure to sc->security. This pointer
> + * is initialised to NULL by the caller.
> + * @fc indicates the new filesystem context.
> + * @src_sb indicates the source superblock of a submount.
> + * @fs_context_dup:
> + * Allocate and attach a security structure to sc->security. This pointer
> + * is initialised to NULL by the caller.
> + * @fc indicates the new filesystem context.
> + * @src_fc indicates the original filesystem context.
> + * @fs_context_free:
> + * Clean up a filesystem context.
> + * @fc indicates the filesystem context.
> + * @fs_context_parse_option:
> + * Userspace provided an option to configure a superblock. The LSM may
> + * reject it with an error and may use it for itself, in which case it
> + * should return 1; otherwise it should return 0 to pass it on to the
> + * filesystem.
> + * @fc indicates the filesystem context.
> + * @opt indicates the option in "key[=val]" form. It is NUL-terminated,
> + * but val may be binary data.
> + * @len indicates the size of the option.
> + * @fs_context_validate:
> + * Validate the filesystem context preparatory to applying it. This is
> + * done after all the options have been parsed.
> + * @fc indicates the filesystem context.
> + * @sb_get_tree:
> + * Assign the security to a newly created superblock.
> + * @fc indicates the filesystem context.
> + * @fc->root indicates the root that will be mounted.
> + * @fc->root->d_sb points to the superblock.
> + * @sb_reconfigure:
> + * Apply reconfiguration to the security on a superblock.
> + * @fc indicates the filesystem context.
> + * @fc->root indicates a dentry in the mount.
> + * @fc->root->d_sb points to the superblock.
> + * @sb_mountpoint:
> + * Equivalent of sb_mount, but with an fs_context.
> + * @fc indicates the filesystem context.
> + * @mountpoint indicates the path on which the mount will take place.
> + * @mnt_flags indicates the MNT_* flags specified.
> + *
> * Security hooks for filesystem operations.
> *
> * @sb_alloc_security:
> @@ -1450,6 +1494,16 @@ union security_list_options {
> void (*bprm_committing_creds)(struct linux_binprm *bprm);
> void (*bprm_committed_creds)(struct linux_binprm *bprm);
>
> + int (*fs_context_alloc)(struct fs_context *fc, struct super_block *src_sb);
> + int (*fs_context_dup)(struct fs_context *fc, struct fs_context *src_sc);
> + void (*fs_context_free)(struct fs_context *fc);
> + int (*fs_context_parse_option)(struct fs_context *fc, char *opt, size_t len);
> + int (*fs_context_validate)(struct fs_context *fc);
> + int (*sb_get_tree)(struct fs_context *fc);
> + void (*sb_reconfigure)(struct fs_context *fc);
> + int (*sb_mountpoint)(struct fs_context *fc, struct path *mountpoint,
> + unsigned int mnt_flags);
> +
> int (*sb_alloc_security)(struct super_block *sb);
> void (*sb_free_security)(struct super_block *sb);
> int (*sb_copy_data)(char *orig, char *copy);
> @@ -1787,6 +1841,14 @@ struct security_hook_heads {
> struct hlist_head bprm_check_security;
> struct hlist_head bprm_committing_creds;
> struct hlist_head bprm_committed_creds;
> + struct hlist_head fs_context_alloc;
> + struct hlist_head fs_context_dup;
> + struct hlist_head fs_context_free;
> + struct hlist_head fs_context_parse_option;
> + struct hlist_head fs_context_validate;
> + struct hlist_head sb_get_tree;
> + struct hlist_head sb_reconfigure;
> + struct hlist_head sb_mountpoint;
> struct hlist_head sb_alloc_security;
> struct hlist_head sb_free_security;
> struct hlist_head sb_copy_data;
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 200920f521a1..60a85bd9dfef 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -53,6 +53,7 @@ struct msg_msg;
> struct xattr;
> struct xfrm_sec_ctx;
> struct mm_struct;
> +struct fs_context;
>
> /* If capable should audit the security request */
> #define SECURITY_CAP_NOAUDIT 0
> @@ -231,6 +232,15 @@ int security_bprm_set_creds(struct linux_binprm *bprm);
> int security_bprm_check(struct linux_binprm *bprm);
> void security_bprm_committing_creds(struct linux_binprm *bprm);
> void security_bprm_committed_creds(struct linux_binprm *bprm);
> +int security_fs_context_alloc(struct fs_context *fc, struct super_block *sb);
> +int security_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc);
> +void security_fs_context_free(struct fs_context *fc);
> +int security_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len);
> +int security_fs_context_validate(struct fs_context *fc);
> +int security_sb_get_tree(struct fs_context *fc);
> +void security_sb_reconfigure(struct fs_context *fc);
> +int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
> + unsigned int mnt_flags);
> int security_sb_alloc(struct super_block *sb);
> void security_sb_free(struct super_block *sb);
> int security_sb_copy_data(char *orig, char *copy);
> @@ -539,6 +549,40 @@ static inline void security_bprm_committed_creds(struct linux_binprm *bprm)
> {
> }
>
> +static inline int security_fs_context_alloc(struct fs_context *fc,
> + struct super_block *src_sb)
> +{
> + return 0;
> +}
> +static inline int security_fs_context_dup(struct fs_context *fc,
> + struct fs_context *src_fc)
> +{
> + return 0;
> +}
> +static inline void security_fs_context_free(struct fs_context *fc)
> +{
> +}
> +static inline int security_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len)
> +{
> + return 0;
> +}
> +static inline int security_fs_context_validate(struct fs_context *fc)
> +{
> + return 0;
> +}
> +static inline int security_sb_get_tree(struct fs_context *fc)
> +{
> + return 0;
> +}
> +static inline void security_sb_reconfigure(struct fs_context *fc)
> +{
> +}
> +static inline int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
> + unsigned int mnt_flags)
> +{
> + return 0;
> +}
> +
> static inline int security_sb_alloc(struct super_block *sb)
> {
> return 0;
> diff --git a/security/security.c b/security/security.c
> index 7bc2fde023a7..42e4ea19b61c 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -358,6 +358,47 @@ void security_bprm_committed_creds(struct linux_binprm *bprm)
> call_void_hook(bprm_committed_creds, bprm);
> }
>
> +int security_fs_context_alloc(struct fs_context *fc, struct super_block *src_sb)
> +{
> + return call_int_hook(fs_context_alloc, 0, fc, src_sb);
> +}
> +
> +int security_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
> +{
> + return call_int_hook(fs_context_dup, 0, fc, src_fc);
> +}
> +
> +void security_fs_context_free(struct fs_context *fc)
> +{
> + call_void_hook(fs_context_free, fc);
> +}
> +
> +int security_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len)
> +{
> + return call_int_hook(fs_context_parse_option, 0, fc, opt, len);
> +}
> +
> +int security_fs_context_validate(struct fs_context *fc)
> +{
> + return call_int_hook(fs_context_validate, 0, fc);
> +}
> +
> +int security_sb_get_tree(struct fs_context *fc)
> +{
> + return call_int_hook(sb_get_tree, 0, fc);
> +}
> +
> +void security_sb_reconfigure(struct fs_context *fc)
> +{
> + call_void_hook(sb_reconfigure, fc);
> +}
> +
> +int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
> + unsigned int mnt_flags)
> +{
> + return call_int_hook(sb_mountpoint, 0, fc, mountpoint, mnt_flags);
> +}
> +
> int security_sb_alloc(struct super_block *sb)
> {
> return call_int_hook(sb_alloc_security, 0, sb);
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 1f0316bf7e29..969a2a0dc582 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -48,6 +48,7 @@
> #include <linux/fdtable.h>
> #include <linux/namei.h>
> #include <linux/mount.h>
> +#include <linux/fs_context.h>
> #include <linux/netfilter_ipv4.h>
> #include <linux/netfilter_ipv6.h>
> #include <linux/tty.h>
> @@ -2960,6 +2961,259 @@ static int selinux_umount(struct vfsmount *mnt, int flags)
> FILESYSTEM__UNMOUNT, NULL);
> }
>
> +/* fsopen mount context operations */
> +
> +static int selinux_fs_context_alloc(struct fs_context *fc,
> + struct super_block *src_sb)
> +{
> + struct security_mnt_opts *opts;
> +
> + opts = kzalloc(sizeof(*opts), GFP_KERNEL);
> + if (!opts)
> + return -ENOMEM;
> +
> + fc->security = opts;
> + return 0;
> +}
> +
> +static int selinux_fs_context_dup(struct fs_context *fc,
> + struct fs_context *src_fc)
> +{
> + const struct security_mnt_opts *src = src_fc->security;
> + struct security_mnt_opts *opts;
> + int i, n;
> +
> + opts = kzalloc(sizeof(*opts), GFP_KERNEL);
> + if (!opts)
> + return -ENOMEM;
> + fc->security = opts;
> +
> + if (!src || !src->num_mnt_opts)
> + return 0;
> + n = opts->num_mnt_opts = src->num_mnt_opts;
> +
> + if (src->mnt_opts) {
> + opts->mnt_opts = kcalloc(n, sizeof(char *), GFP_KERNEL);
> + if (!opts->mnt_opts)
> + return -ENOMEM;
> +
> + for (i = 0; i < n; i++) {
> + if (src->mnt_opts[i]) {
> + opts->mnt_opts[i] = kstrdup(src->mnt_opts[i],
> + GFP_KERNEL);
> + if (!opts->mnt_opts[i])
> + return -ENOMEM;
> + }
> + }
> + }
> +
> + if (src->mnt_opts_flags) {
> + opts->mnt_opts_flags = kmemdup(src->mnt_opts_flags,
> + n * sizeof(int), GFP_KERNEL);
> + if (!opts->mnt_opts_flags)
> + return -ENOMEM;
> + }
> +
> + return 0;
> +}
> +
> +static void selinux_fs_context_free(struct fs_context *fc)
> +{
> + struct security_mnt_opts *opts = fc->security;
> +
> + security_free_mnt_opts(opts);
> + fc->security = NULL;
> +}
> +
> +static int selinux_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len)
> +{
> + struct security_mnt_opts *opts = fc->security;
> + substring_t args[MAX_OPT_ARGS];
> + unsigned int have;
> + char *c, **oo;
> + int token, ctx, i, *of;
> +
> + token = match_token(opt, tokens, args);
> + if (token == Opt_error)
> + return 0; /* Doesn't belong to us. */
> +
> + have = 0;
> + for (i = 0; i < opts->num_mnt_opts; i++)
> + have |= 1 << opts->mnt_opts_flags[i];
> + if (have & (1 << token))
> + return -EINVAL;
> +
> + switch (token) {
> + case Opt_context:
> + if (have & (1 << Opt_defcontext))
> + goto incompatible;
> + ctx = CONTEXT_MNT;
> + goto copy_context_string;
> +
> + case Opt_fscontext:
> + ctx = FSCONTEXT_MNT;
> + goto copy_context_string;
> +
> + case Opt_rootcontext:
> + ctx = ROOTCONTEXT_MNT;
> + goto copy_context_string;
> +
> + case Opt_defcontext:
> + if (have & (1 << Opt_context))
> + goto incompatible;
> + ctx = DEFCONTEXT_MNT;
> + goto copy_context_string;
> +
> + case Opt_labelsupport:
> + return 1;
> +
> + default:
> + return -EINVAL;
> + }
> +
> +copy_context_string:
> + if (opts->num_mnt_opts > 3)
> + return -EINVAL;
> +
> + of = krealloc(opts->mnt_opts_flags,
> + (opts->num_mnt_opts + 1) * sizeof(int), GFP_KERNEL);
> + if (!of)
> + return -ENOMEM;
> + of[opts->num_mnt_opts] = 0;
> + opts->mnt_opts_flags = of;
> +
> + oo = krealloc(opts->mnt_opts,
> + (opts->num_mnt_opts + 1) * sizeof(char *), GFP_KERNEL);
> + if (!oo)
> + return -ENOMEM;
> + oo[opts->num_mnt_opts] = NULL;
> + opts->mnt_opts = oo;
> +
> + c = match_strdup(&args[0]);
> + if (!c)
> + return -ENOMEM;
> + opts->mnt_opts[opts->num_mnt_opts] = c;
> + opts->mnt_opts_flags[opts->num_mnt_opts] = ctx;
> + opts->num_mnt_opts++;
> + return 1;
> +
> +incompatible:
> + return -EINVAL;
> +}
> +
> +/*
> + * Validate the security parameters supplied for a reconfiguration/remount
> + * event.
> + */
> +static int selinux_validate_for_sb_reconfigure(struct fs_context *fc)
> +{
> + struct super_block *sb = fc->root->d_sb;
> + struct superblock_security_struct *sbsec = sb->s_security;
> + struct security_mnt_opts *opts = fc->security;
> + int rc, i, *flags;
> + char **mount_options;
> +
> + if (!(sbsec->flags & SE_SBINITIALIZED))
> + return 0;
> +
> + mount_options = opts->mnt_opts;
> + flags = opts->mnt_opts_flags;
> +
> + for (i = 0; i < opts->num_mnt_opts; i++) {
> + u32 sid;
> +
> + if (flags[i] == SBLABEL_MNT)
> + continue;
> +
> + rc = security_context_str_to_sid(&selinux_state, mount_options[i],
> + &sid, GFP_KERNEL);
> + if (rc) {
> + pr_warn("SELinux: security_context_str_to_sid"
> + "(%s) failed for (dev %s, type %s) errno=%d\n",
> + mount_options[i], sb->s_id, sb->s_type->name, rc);
> + goto inval;
> + }
> +
> + switch (flags[i]) {
> + case FSCONTEXT_MNT:
> + if (bad_option(sbsec, FSCONTEXT_MNT, sbsec->sid, sid))
> + goto bad_option;
> + break;
> + case CONTEXT_MNT:
> + if (bad_option(sbsec, CONTEXT_MNT, sbsec->mntpoint_sid, sid))
> + goto bad_option;
> + break;
> + case ROOTCONTEXT_MNT: {
> + struct inode_security_struct *root_isec;
> + root_isec = backing_inode_security(sb->s_root);
> +
> + if (bad_option(sbsec, ROOTCONTEXT_MNT, root_isec->sid, sid))
> + goto bad_option;
> + break;
> + }
> + case DEFCONTEXT_MNT:
> + if (bad_option(sbsec, DEFCONTEXT_MNT, sbsec->def_sid, sid))
> + goto bad_option;
> + break;
> + default:
> + goto inval;
> + }
> + }
> +
> + rc = 0;
> +out:
> + return rc;
> +
> +bad_option:
> + pr_warn("SELinux: unable to change security options "
> + "during remount (dev %s, type=%s)\n",
> + sb->s_id, sb->s_type->name);
> +inval:
> + rc = -EINVAL;
> + goto out;
> +}
> +
> +/*
> + * Validate the security context assembled from the option data supplied to
> + * mount.
> + */
> +static int selinux_fs_context_validate(struct fs_context *fc)
> +{
> + if (fc->purpose == FS_CONTEXT_FOR_RECONFIGURE)
> + return selinux_validate_for_sb_reconfigure(fc);
> + return 0;
> +}
> +
> +/*
> + * Set the security context on a superblock.
> + */
> +static int selinux_sb_get_tree(struct fs_context *fc)
> +{
> + const struct cred *cred = current_cred();
> + struct common_audit_data ad;
> + int rc;
> +
> + rc = selinux_set_mnt_opts(fc->root->d_sb, fc->security, 0, NULL);
> + if (rc)
> + return rc;
> +
> + /* Allow all mounts performed by the kernel */
> + if (fc->purpose == FS_CONTEXT_FOR_KERNEL_MOUNT)
> + return 0;
> +
> + ad.type = LSM_AUDIT_DATA_DENTRY;
> + ad.u.dentry = fc->root;
> + return superblock_has_perm(cred, fc->root->d_sb, FILESYSTEM__MOUNT, &ad);
> +}
> +
> +static int selinux_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
> + unsigned int mnt_flags)
> +{
> + const struct cred *cred = current_cred();
> +
> + return path_has_perm(cred, mountpoint, FILE__MOUNTON);
> +}
> +
> /* inode security operations */
>
> static int selinux_inode_alloc_security(struct inode *inode)
> @@ -6871,6 +7125,14 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
> LSM_HOOK_INIT(bprm_committing_creds, selinux_bprm_committing_creds),
> LSM_HOOK_INIT(bprm_committed_creds, selinux_bprm_committed_creds),
>
> + LSM_HOOK_INIT(fs_context_alloc, selinux_fs_context_alloc),
> + LSM_HOOK_INIT(fs_context_dup, selinux_fs_context_dup),
> + LSM_HOOK_INIT(fs_context_free, selinux_fs_context_free),
> + LSM_HOOK_INIT(fs_context_parse_option, selinux_fs_context_parse_option),
> + LSM_HOOK_INIT(fs_context_validate, selinux_fs_context_validate),
> + LSM_HOOK_INIT(sb_get_tree, selinux_sb_get_tree),
> + LSM_HOOK_INIT(sb_mountpoint, selinux_sb_mountpoint),
> +
> LSM_HOOK_INIT(sb_alloc_security, selinux_sb_alloc_security),
> LSM_HOOK_INIT(sb_free_security, selinux_sb_free_security),
> LSM_HOOK_INIT(sb_copy_data, selinux_sb_copy_data),
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
paul moore
http://www.paul-moore.com
Paul Moore <[email protected]> wrote:
> Adding the SELinux mailing list to the CC line; in the future please
> include the SELinux mailing list on patches like this. It would also
> be very helpful to include "selinux" somewhere in the subject line
> when the patch is predominately SELinux related (much like you did for
> the other LSMs in this patchset).
I should probably evict the SELinux bits into their own patch since the point
of this patch is the LSM hooks, not specifically SELinux's implementation
thereof.
> I can't say I've digested all of this yet, but what SELinux testing
> have you done with this patchset?
Using the fsopen()/fsmount() syscalls, these hooks will be made use of, say
for NFS (which I haven't included in this list). Even sys_mount() will make
use of them a bit, so just booting the system does that.
Note that for SELinux these hooks don't change very much except how the
parameters are handled. It doesn't actually change the checks that are made -
at least, not yet. There are some additional syscalls under consideration
(such as the ability to pick a live mounted filesystem into a context) that
might require additional permits.
David
Hi David,
On 04/19/18 06:31, David Howells wrote:
> Introduce a filesystem context concept to be used during superblock
> creation for mount and superblock reconfiguration for remount. This is
> allocated at the beginning of the mount procedure and into it is placed:
>
> (1) Filesystem type.
>
> (2) Namespaces.
>
> (3) Device name.
>
> (4) Superblock flags (MS_*).
>
> (5) Security details.
>
> (6) Filesystem-specific data, as set by the mount options.
>
> Signed-off-by: David Howells <[email protected]>
> ---
>
> Documentation/filesystems/mounting.txt | 445 ++++++++++++++++++++++++++++++++
> include/linux/fs_context.h | 76 +++++
> 2 files changed, 521 insertions(+)
> create mode 100644 Documentation/filesystems/mounting.txt
> create mode 100644 include/linux/fs_context.h
> diff --git a/Documentation/filesystems/mounting.txt b/Documentation/filesystems/mounting.txt
> new file mode 100644
> index 000000000000..805135a66b64
> --- /dev/null
> +++ b/Documentation/filesystems/mounting.txt
> @@ -0,0 +1,445 @@
> + ===================
> + FILESYSTEM MOUNTING
> + ===================
> +
> +CONTENTS
> +
> + (1) Overview.
> +
> + (2) The filesystem context.
> +
> + (3) The filesystem context operations.
> +
> + (4) Filesystem context security.
> +
> + (5) VFS filesystem context operations.
> +
> +
> +========
> +OVERVIEW
> +========
> +
> +The creation of new mounts is now to be done in a multistep process:
> +
> + (1) Create a filesystem context.
> +
> + (2) Parse the options and attach them to the context. Options may be passed
> + individually from userspace.
Does this say that step (2) can be multiple small steps? How does step (2) know
when userspace has completed sending individual options?
> +
> + (3) Validate and pre-process the context.
> +
> + (4) Get or create a superblock and mountable root.
> +
> + (5) Perform the mount.
> +
> + (6) Return an error message attached to the context.
where/how is this done?
> +
> + (7) Destroy the context.
> +
> +To support this, the file_system_type struct gains two new fields:
> +
> + unsigned short fs_context_size;
> +
> +which indicates the total amount of space that should be allocated for context
> +data (see the Filesystem Context section), and:
> +
> + int (*init_fs_context)(struct fs_context *fc, struct super_block *src_sb);
> +
> +which is invoked to set up the filesystem-specific parts of a filesystem
> +context, including the additional space. The src_sb parameter is used to
> +convey the superblock from which the filesystem may draw extra information
> +(such as namespaces) for submount (FS_CONTEXT_FOR_SUBMOUNT) or reconfiguration
> +(FS_CONTEXT_FOR_RECONFIGURE) purposes - otherwise it will be NULL.
> +
> +Note that security initialisation is done *after* the filesystem is called so
> +that the namespaces may be adjusted first.
> +
> +And the super_operations struct gains one field:
> +
> + int (*reconfigure) (struct super_block *, struct fs_context *);
> +
> +This shadows the ->reconfigure() operation and takes a prepared filesystem
> +context instead of the mount flags and data page. It may modify the sb_flags
> +in the context for the caller to pick up.
> +
> +[NOTE] reconfigure is intended as a replacement for remount_fs.
> +
> +
> +======================
> +THE FILESYSTEM CONTEXT
> +======================
> +
> +The creation and reconfiguration of a superblock is governed by a filesystem
> +context. This is represented by the fs_context structure:
> +
> + struct fs_context {
> + const struct fs_context_operations *ops;
> + struct file_system_type *fs;
> + struct dentry *root;
> + struct user_namespace *user_ns;
> + struct net *net_ns;
> + const struct cred *cred;
> + char *device;
> + char *subtype;
> + void *security;
> + void *s_fs_info;
> + unsigned int sb_flags;
> + bool sloppy;
> + bool silent;
> + bool degraded;
> + bool drop_sb;
> + enum fs_context_purpose purpose : 8;
> + };
> +
> +When the VFS creates this, it allocates ->fs_context_size bytes (as specified
> +by the file_system_type object) to hold both the fs_context struct and any
> +extra data required by the filesystem. The fs_context struct is placed at the
> +beginning of this space. Any extra space beyond that is for use by the
> +filesystem. The filesystem should wrap the struct in its own, e.g.:
in its own struct, e.g.:
> +
> + struct nfs_fs_context {
> + struct fs_context fc;
> + ...
> + };
> +
> +placing the fs_context struct first. container_of() can then be used. The
> +file_system_type would be initialised thus:
> +
> + struct file_system_type nfs = {
> + ...
> + .fs_context_size = sizeof(struct nfs_fs_context),
> + .init_fs_context = nfs_init_fs_context,
> + ...
> + };
> +
> +The fs_context fields are as follows:
> +
> + (*) const struct fs_context_operations *ops
> +
> + These are operations that can be done on a filesystem context (see
> + below). This must be set by the ->init_fs_context() file_system_type
> + operation.
> +
> + (*) struct file_system_type *fs
> +
> + A pointer to the file_system_type of the filesystem that is being
> + constructed or reconfigured. This retains a reference on the type owner.
> +
> + (*) struct dentry *root
> +
> + A pointer to the root of the mountable tree (and indirectly, the
> + superblock thereof). This is filled in by the ->get_tree() op.
> +
> + (*) struct user_namespace *user_ns
> + (*) struct net *net_ns
> +
> + There are a subset of the namespaces in use by the invoking process. They
> + retain references on each namespace. The subscribed namespaces may be
> + replaced by the filesystem to reflect other sources, such as the parent
> + mount superblock on an automount.
> +
> + (*) struct cred *cred
> +
> + The mounter's credentials. This retains a reference on the credentials.
> +
> + (*) char *device
> +
> + This is the device to be mounted. It may be a block device
> + (e.g. /dev/sda1) or something more exotic, such as the "host:/path" that
> + NFS desires.
> +
> + (*) char *subtype
> +
> + This is a string to be added to the type displayed in /proc/mounts to
> + qualify it (used by FUSE). This is available for the filesystem to set if
> + desired.
> +
> + (*) void *security
> +
> + A place for the LSMs to hang their security data for the superblock. The
> + relevant security operations are described below.
> +
> + (*) void *s_fs_info
> +
> + The proposed s_fs_info for a new superblock, set in the superblock by
> + sget_fc(). This can be used to distinguish superblocks.
> +
> + (*) unsigned int sb_flags
> +
> + This holds the SB_* flags to be set in super_block::s_flags.
> +
> + (*) bool sloppy
> + (*) bool silent
> +
> + These are set if the sloppy or silent mount options are given.
> +
> + [NOTE] sloppy is probably unnecessary when userspace passes over one
> + option at a time since the error can just be ignored if userspace deems it
> + to be unimportant.
> +
> + [NOTE] silent is probably redundant with sb_flags & SB_SILENT.
> +
> + (*) bool degraded
> +
> + This is set if any preallocated resources in the context have been used
> + up, thereby rendering it unreusable for the ->get_tree() op.
> +
> + (*) bool drop_sb
> +
> + This is set if a superblock reference needs to be deactivated when the
> + context is put.
> +
> + (*) enum fs_context_purpose
> +
> + This indicates the purpose for which the context is intended. The
> + available values are:
> +
> + FS_CONTEXT_FOR_USER_MOUNT, -- New superblock for user-specified mount
> + FS_CONTEXT_FOR_KERNEL_MOUNT, -- New superblock for kernel-internal mount
> + FS_CONTEXT_FOR_SUBMOUNT -- New automatic submount of extant mount
> + FS_CONTEXT_FOR_RECONFIGURE -- Change an existing mount
> +
> +The mount context is created by calling vfs_new_fs_context(), vfs_sb_reconfig()
> +or vfs_dup_fs_context() and is destroyed with put_fs_context(). Note that the
> +structure is not refcounted.
> +
> +VFS, security and filesystem mount options are set individually with
> +vfs_parse_mount_option(). Options provided by the old mount(2) system call as
> +a page of data can be parsed with generic_parse_monolithic().
> +
> +When mounting, the filesystem is allowed to take data from any of the pointers
> +and attach it to the superblock (or whatever), provided it clears the pointer
> +in the mount context.
> +
> +The filesystem is also allowed to allocate resources and pin them with the
> +mount context. For instance, NFS might pin the appropriate protocol version
> +module.
> +
> +
> +=================================
> +THE FILESYSTEM CONTEXT OPERATIONS
> +=================================
> +
> +The filesystem context points to a table of operations:
> +
> + struct fs_context_operations {
> + void (*free)(struct fs_context *fc);
> + int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
> + int (*parse_source)(struct fs_context *fc);
> + int (*parse_option)(struct fs_context *fc, char *opt);
> + int (*parse_monolithic)(struct fs_context *fc, void *data);
> + int (*validate)(struct fs_context *fc);
> + int (*get_tree)(struct fs_context *fc);
> + };
> +
> +These operations are invoked by the various stages of the mount procedure to
> +manage the filesystem context. They are as follows:
> +
> + (*) void (*free)(struct fs_context *fc);
> +
> + Called to clean up the filesystem-specific part of the filesystem context
> + when the context is destroyed. It should be aware that parts of the
> + context may have been removed and NULL'd out by ->get_tree().
> +
> + (*) int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
> +
> + Called when a filesystem context has been duplicated to get any refs or
> + copy any non-referenced resources held in the filesystem-specific part of
> + the filesystem context. An error may be returned to indicate failure to
> + do this.
> +
> + [!] Note that even if this fails, put_fs_context() will be called
> + immediately thereafter, so ->dup() *must* make the
> + filesystem-specific part safe for ->free().
> +
> + (*) int (*parse_source)(struct fs_context *fc);
> +
> + Called when the source or device is specified for a filesystem context.
> + The string will have been stored in fc->source prior to calling. If
"source" is called "device" above but "source" in the header file.
Please change one of them to be consistent.
> + successful, 0 should be returned and a negative error code otherwise.
or a
> +
> + (*) int (*parse_option)(struct fs_context *fc, char *p);
> +
> + Called when an option is to be added to the filesystem context. p points
> + to the option string, likely in "key[=val]" format. VFS-specific options
> + will have been weeded out and fc->sb_flags updated in the context.
> + Security options will also have been weeded out and fc->security updated.
> +
> + If successful, 0 should be returned and a negative error code otherwise.
or a
> +
> + (*) int (*parse_monolithic)(struct fs_context *fc, void *data);
> +
> + Called when the mount(2) system call is invoked to pass the entire data
> + page in one go. If this is expected to be just a list of "key[=val]"
> + items separated by commas, then this may be set to NULL.
> +
> + The return value is as for ->parse_option().
> +
> + If the filesystem (eg. NFS) needs to examine the data first and then finds
e.g.
> + it's the standard key-val list then it may pass it off to
> + generic_parse_monolithic().
> +
> + (*) int (*validate)(struct fs_context *fc);
> +
> + Called when all the options have been applied and the mount is about to
> + take place. It is should check for inconsistencies from mount options and
> + it is also allowed to do preliminary resource acquisition. For instance,
> + the core NFS module could load the NFS protocol module here.
> +
> + Note that if fc->purpose == FS_CONTEXT_FOR_RECONFIGURE, some of the
> + options necessary for a new mount may not be set.
> +
> + The return value is as for ->parse_option().
> +
> + (*) int (*get_tree)(struct fs_context *fc);
> +
> + Called to get or create the mountable root and superblock, using the
> + information stored in the filesystem context (reconfiguration goes via a
> + different vector). It may detach any resources it desires from the
> + filesystem context and transfer them to the superblock it creates.
> +
> + On success it should set fc->root to the mountable root and return 0. In
> + the case of an error, it should return a negative error code.
> +
> +
> +===========================
> +FILESYSTEM CONTEXT SECURITY
> +===========================
> +
> +The filesystem context contains a security pointer that the LSMs can use for
> +building up a security context for the superblock to be mounted. There are a
> +number of operations used by the new mount code for this purpose:
> +
> + (*) int security_fs_context_alloc(struct fs_context *fc,
> + struct super_block *src_sb);
> +
> + Called to initialise fc->security (which is preset to NULL) and allocate
> + any resources needed. It should return 0 on success and a negative error
or a
> + code on failure.
> +
> + src_sb is non-NULL in the case of reconfiguration
> + (FS_CONTEXT_FOR_RECONFIGURE) in which case it indicates the superblock to
> + be reconfigured or in the case of a submount (FS_CONTEXT_FOR_SUBMOUNT) in
> + which case it indicates the parent superblock.
I seem to recall that you were going to rewrite that long sentence above.
-ETOOMANYCASES
> +
> + (*) int security_fs_context_dup(struct fs_context *fc,
> + struct fs_context *src_fc);
> +
> + Called to initialise fc->security (which is preset to NULL) and allocate
> + any resources needed. The original filesystem context is pointed to by
> + src_fc and may be used for reference. It should return 0 on success and a
or a
> + negative error code on failure.
> +
> + (*) void security_fs_context_free(struct fs_context *fc);
> +
> + Called to clean up anything attached to fc->security. Note that the
> + contents may have been transferred to a superblock and the pointer NULL'd
> + out during mount.
[Here we have evidence that in English any noun can be verbed.] :)
> +
> + (*) int security_fs_context_parse_option(struct fs_context *fc, char *opt);
> +
> + Called for each mount option. The arguments are as for the
> + ->parse_option() method. An active LSM may reject one with an error, pass
> + one over and return 0 or consume one and return 1. If consumed, the
What does "pass one over" mean?
> + option isn't passed on to the filesystem.
> +
> + (*) int security_sb_get_tree(struct fs_context *fc);
> +
> + Called during the mount procedure to verify that the specified superblock
> + is allowed to be mounted and to transfer the security data there. It
> + should return 0 or a negative error code.
> +
> + [NOTE] Should I add a security_fs_context_validate() operation so that the
> + LSM has the opportunity to allocate stuff and check the options as a
> + whole?
> +
> + (*) int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint)
end line with ';' like the other prototypes.
> +
> + Called during the mount procedure to verify that the root dentry attached
> + to the context is permitted to be attached to the specified mountpoint.
> + It should return 0 on success and a negative error code on failure.
or a
> +
> +
> +=================================
> +VFS FILESYSTEM CONTEXT OPERATIONS
> +=================================
> +
> +There are four operations for creating a filesystem context and
> +one for destroying a context:
> +
> + (*) struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
> + struct super_block *src_sb;
s/;/,/ above
> + unsigned int sb_flags);
> +
> + Create a filesystem context for a given filesystem type. This allocates
> + the filesystem context, sets the flags, initialises the security and calls
> + fs_type->init_fs_context() to initialise the filesystem context.
> +
> + src_sb can be NULL or it may indicate a superblock that is going to be
> + reconfigured (FS_CONTEXT_FOR_RECONFIGURE) or a superblock that is the
> + parent of a submount (FS_CONTEXT_FOR_SUBMOUNT). This superblock is
> + provided as a source of namespace information.
> +
> + (*) struct fs_context *vfs_sb_reconfigure(struct vfsmount *mnt,
> + unsigned int sb_flags);
> +
> + Create a filesystem context from the same filesystem as an extant mount
> + and initialise the mount parameters from the superblock underlying that
> + mount. This is for use by superblock parameter reconfiguration.
> +
> + (*) struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc);
> +
> + Duplicate a filesystem context, copying any options noted and duplicating
> + or additionally referencing any resources held therein. This is available
> + for use where a filesystem has to get a mount within a mount, such as NFS4
> + does by internally mounting the root of the target server and then doing a
> + private pathwalk to the target directory.
> +
> + (*) void put_fs_context(struct fs_context *fc);
> +
> + Destroy a filesystem context, releasing any resources it holds. This
> + calls the ->free() operation. This is intended to be called by anyone who
> + created a filesystem context.
> +
> + [!] filesystem contexts are not refcounted, so this causes unconditional
> + destruction.
> +
> +In all the above operations, apart from the put op, the return is a mount
> +context pointer or a negative error code.
> +
> +For the remaining operations, if an error occurs, a negative error code will be
> +returned.
> +
> + (*) int vfs_get_tree(struct fs_context *fc);
> +
> + Get or create the mountable root and superblock, using the parameters in
> + the filesystem context to select/configure the superblock. This invokes
> + the ->validate() op and then the ->get_tree() op.
> +
> + [NOTE] ->validate() could perhaps be rolled into ->get_tree() and
> + ->reconfigure().
> +
> + (*) struct vfsmount *vfs_create_mount(struct fs_context *fc);
> +
> + Create a mount given the parameters in the specified filesystem context.
> + Note that this does not attach the mount to anything.
> +
> + (*) int vfs_set_fs_source(struct fs_context *fc, char *source);
> +
> + Supply the source name or device name for the mount. This may cause the
> + filesystem to access the device.
> +
> + (*) int vfs_parse_fs_option(struct fs_context *fc, char *data);
> +
> + Supply a single mount option to the filesystem context. The mount option
> + should likely be in a "key[=val]" string form. The option is first
> + checked to see if it corresponds to a standard mount flag (in which case
> + it is used to set an SB_xxx flag and consumed) or a security option (in
> + which case the LSM consumes it) before it is passed on to the filesystem.
> +
> + (*) int generic_parse_monolithic(struct fs_context *fc, void *data);
> +
> + Parse a sys_mount() data page, assuming the form to be a text list
> + consisting of key[=val] options separated by commas. Each item in the
> + list is passed to vfs_mount_option(). This is the default when the
> + ->parse_monolithic() operation is NULL.
> diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
> new file mode 100644
> index 000000000000..732a11898242
> --- /dev/null
> +++ b/include/linux/fs_context.h
> @@ -0,0 +1,76 @@
> +/* Filesystem superblock creation and reconfiguration context.
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells ([email protected])
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#ifndef _LINUX_FS_CONTEXT_H
> +#define _LINUX_FS_CONTEXT_H
> +
> +#include <linux/kernel.h>
> +#include <linux/errno.h>
> +
> +struct cred;
> +struct dentry;
> +struct file_operations;
> +struct file_system_type;
> +struct mnt_namespace;
> +struct net;
> +struct pid_namespace;
> +struct super_block;
> +struct user_namespace;
> +struct vfsmount;
> +
> +enum fs_context_purpose {
> + FS_CONTEXT_FOR_USER_MOUNT, /* New superblock for user-specified mount */
> + FS_CONTEXT_FOR_KERNEL_MOUNT, /* New superblock for kernel-internal mount */
> + FS_CONTEXT_FOR_SUBMOUNT, /* New superblock for automatic submount */
> + FS_CONTEXT_FOR_RECONFIGURE, /* Superblock reconfiguration (remount) */
> +};
> +
> +/*
> + * Filesystem context as allocated and constructed by the ->init_fs_context()
> + * file_system_type operation. The size of the object allocated is specified
> + * in struct file_system_type::fs_context_size and this must include sufficient
> + * space for the fs_context struct.
> + *
> + * Superblock creation fills in ->root whereas reconfiguration begins with this
> + * already set.
> + *
> + * See Documentation/filesystems/mounting.txt
> + */
> +struct fs_context {
> + const struct fs_context_operations *ops;
> + struct file_system_type *fs_type;
> + struct dentry *root; /* The root and superblock */
> + struct user_namespace *user_ns; /* The user namespace for this mount */
> + struct net *net_ns; /* The network namespace for this mount */
> + const struct cred *cred; /* The mounter's credentials */
> + char *source; /* The source name (eg. device) */
> + char *subtype; /* The subtype to set on the superblock */
> + void *security; /* The LSM context */
> + void *s_fs_info; /* Proposed s_fs_info */
> + unsigned int sb_flags; /* Proposed superblock flags (SB_*) */
> + bool sloppy; /* Unrecognised options are okay */
> + bool silent;
> + bool degraded; /* True if the context can't be reused */
> + bool drop_sb; /* T if need to drop an SB reference */
s/T /True /
> + enum fs_context_purpose purpose : 8;
> +};
> +
> +struct fs_context_operations {
> + void (*free)(struct fs_context *fc);
> + int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
> + int (*parse_source)(struct fs_context *fc);
> + int (*parse_option)(struct fs_context *fc, char *opt, size_t len);
> + int (*parse_monolithic)(struct fs_context *fc, void *data);
> + int (*validate)(struct fs_context *fc);
> + int (*get_tree)(struct fs_context *fc);
> +};
> +
> +#endif /* _LINUX_FS_CONTEXT_H */
>
--
~Randy
On 04/20/2018 11:35 AM, David Howells wrote:
> Paul Moore <[email protected]> wrote:
>
>> Adding the SELinux mailing list to the CC line; in the future please
>> include the SELinux mailing list on patches like this. It would also
>> be very helpful to include "selinux" somewhere in the subject line
>> when the patch is predominately SELinux related (much like you did for
>> the other LSMs in this patchset).
>
> I should probably evict the SELinux bits into their own patch since the point
> of this patch is the LSM hooks, not specifically SELinux's implementation
> thereof.
>
>> I can't say I've digested all of this yet, but what SELinux testing
>> have you done with this patchset?
>
> Using the fsopen()/fsmount() syscalls, these hooks will be made use of, say
> for NFS (which I haven't included in this list). Even sys_mount() will make
> use of them a bit, so just booting the system does that.
>
> Note that for SELinux these hooks don't change very much except how the
> parameters are handled. It doesn't actually change the checks that are made -
> at least, not yet. There are some additional syscalls under consideration
> (such as the ability to pick a live mounted filesystem into a context) that
> might require additional permits.
Neither fsopen() nor fscontext_fs_write() appear to perform any kind of up-front
permission checking (DAC or MAC), although some security hooks may be ultimately called
to allocate structures, parse security options, etc. Is there a reason not apply a may_mount()
or similar check up front?
Stephen Smalley <[email protected]> wrote:
> Neither fsopen() nor fscontext_fs_write() appear to perform any kind of
> up-front permission checking (DAC or MAC), although some security hooks may
> be ultimately called to allocate structures, parse security options, etc.
> Is there a reason not apply a may_mount() or similar check up front?
may_mount() is called by fsmount() at the moment. It may make sense to move
this earlier to fsopen(). Note that there's also going to be something that
looks like:
fd = fspick("/mnt");
fsmount(fd, "/a", MNT_NOEXEC); // ie. bind mount
or:
fd = fspick("/mnt");
write(fd, "o intr");
write(fd, "x reconfigure"); // ie. something like remount
close(fd);
I guess we'd want to call may_mount() in fspick() too. But there's also the
possibility of using this to create a query interfact too:
fd = fspick("/mnt");
write(fd, "q intr");
read(fd, value_buffer);
David
On 04/24/2018 11:22 AM, David Howells wrote:
> Stephen Smalley <[email protected]> wrote:
>
>> Neither fsopen() nor fscontext_fs_write() appear to perform any kind of
>> up-front permission checking (DAC or MAC), although some security hooks may
>> be ultimately called to allocate structures, parse security options, etc.
>> Is there a reason not apply a may_mount() or similar check up front?
>
> may_mount() is called by fsmount() at the moment. It may make sense to move
> this earlier to fsopen(). Note that there's also going to be something that
> looks like:
>
> fd = fspick("/mnt");
> fsmount(fd, "/a", MNT_NOEXEC); // ie. bind mount
>
> or:
>
> fd = fspick("/mnt");
> write(fd, "o intr");
> write(fd, "x reconfigure"); // ie. something like remount
> close(fd);
>
> I guess we'd want to call may_mount() in fspick() too. But there's also the
> possibility of using this to create a query interfact too:
>
> fd = fspick("/mnt");
> write(fd, "q intr");
> read(fd, value_buffer);
My concern was that fsopen()/fscontext_fs_write() may expose attack surface (e.g. mount option parsing code) that might not be normally accessible to unprivileged userspace (i.e. gated by may_mount() and security_sb_mount()) prior to your changes.
Randy Dunlap <[email protected]> wrote:
> > + (2) Parse the options and attach them to the context. Options may be passed
> > + individually from userspace.
>
> Does this say that step (2) can be multiple small steps?
Perhaps "phase (2)" would be a better name than "step (2)". During (2),
multiple argument-supplying calls may be made.
> How does step (2) know when userspace has completed sending individual
> options?
Moving to phase (3) terminates phase (2). This is triggered by userspace
writing "x create" or "x reconfigure" to the fd as things stand.
> > + (6) Return an error message attached to the context.
>
> where/how is this done?
That got taken out and made general - which Linus then objected to. I need to
reinsert this and make it fscontext-specific as most people would really like
to have it, the mount process being able to produce so many weird and
wonderful errors.
> > +When the VFS creates this, it allocates ->fs_context_size bytes (as specified
> > +by the file_system_type object) to hold both the fs_context struct and any
> > +extra data required by the filesystem. The fs_context struct is placed at the
> > +beginning of this space. Any extra space beyond that is for use by the
> > +filesystem. The filesystem should wrap the struct in its own, e.g.:
>
> in its own struct, e.g.:
How about "... The filesystem should wrap the struct in one of its own, e.g."?
> > + (*) int security_fs_context_parse_option(struct fs_context *fc, char *opt);
> > +
> > + Called for each mount option. The arguments are as for the
> > + ->parse_option() method. An active LSM may reject one with an error, pass
> > + one over and return 0 or consume one and return 1. If consumed, the
>
> What does "pass one over" mean?
How about:
An active LSM may return 0 to pass the option on to the filesystem, 1
to cause the option to be discarded or an error to cause the option to
be rejected.
David
On 05/01/2018 07:29 AM, David Howells wrote:
> Randy Dunlap <[email protected]> wrote:
>
>>> + (2) Parse the options and attach them to the context. Options may be passed
>>> + individually from userspace.
>>
>> Does this say that step (2) can be multiple small steps?
>
> Perhaps "phase (2)" would be a better name than "step (2)". During (2),
> multiple argument-supplying calls may be made.
Ack.
>> How does step (2) know when userspace has completed sending individual
>> options?
>
> Moving to phase (3) terminates phase (2). This is triggered by userspace
> writing "x create" or "x reconfigure" to the fd as things stand.
>
>>> + (6) Return an error message attached to the context.
>>
>> where/how is this done?
>
> That got taken out and made general - which Linus then objected to. I need to
> reinsert this and make it fscontext-specific as most people would really like
> to have it, the mount process being able to produce so many weird and
> wonderful errors.
>
>>> +When the VFS creates this, it allocates ->fs_context_size bytes (as specified
>>> +by the file_system_type object) to hold both the fs_context struct and any
>>> +extra data required by the filesystem. The fs_context struct is placed at the
>>> +beginning of this space. Any extra space beyond that is for use by the
>>> +filesystem. The filesystem should wrap the struct in its own, e.g.:
>>
>> in its own struct, e.g.:
>
> How about "... The filesystem should wrap the struct in one of its own, e.g."?
OK.
>>> + (*) int security_fs_context_parse_option(struct fs_context *fc, char *opt);
>>> +
>>> + Called for each mount option. The arguments are as for the
>>> + ->parse_option() method. An active LSM may reject one with an error, pass
>>> + one over and return 0 or consume one and return 1. If consumed, the
>>
>> What does "pass one over" mean?
>
> How about:
>
> An active LSM may return 0 to pass the option on to the filesystem, 1
> to cause the option to be discarded or an error to cause the option to
> be rejected.
Much better. Thanks.
--
~Randy
On 04/19/2018 06:31 AM, David Howells wrote:
> Implement hooks to check the creation of new mountpoints for AppArmor.
>
> Unfortunately, the DFA evaluation puts the option data in last, after the
> details of the mountpoint, so we have to cache the mount options in the
> fs_context using those hooks till we get to the new mountpoint hook.
>
> Signed-off-by: David Howells <[email protected]>
thanks David,
this looks good, and has pasted the testing that I have done so far. I
have started on the work that will allow us to reorder the match but
its not ready yet and shouldn't hold this up.
Acked-by: John Johansen <[email protected]>
> cc: John Johansen <[email protected]>
> cc: [email protected]
> cc: [email protected]
> ---
>
> security/apparmor/include/mount.h | 11 +++++
> security/apparmor/lsm.c | 80 +++++++++++++++++++++++++++++++++++++
> security/apparmor/mount.c | 46 +++++++++++++++++++++
> 3 files changed, 135 insertions(+), 2 deletions(-)
>
> diff --git a/security/apparmor/include/mount.h b/security/apparmor/include/mount.h
> index 25d6067fa6ef..0441bfae30fa 100644
> --- a/security/apparmor/include/mount.h
> +++ b/security/apparmor/include/mount.h
> @@ -16,6 +16,7 @@
>
> #include <linux/fs.h>
> #include <linux/path.h>
> +#include <linux/fs_context.h>
>
> #include "domain.h"
> #include "policy.h"
> @@ -27,7 +28,13 @@
> #define AA_AUDIT_DATA 0x40
> #define AA_MNT_CONT_MATCH 0x40
>
> -#define AA_MS_IGNORE_MASK (MS_KERNMOUNT | MS_NOSEC | MS_ACTIVE | MS_BORN)
> +#define AA_SB_IGNORE_MASK (SB_KERNMOUNT | SB_NOSEC | SB_ACTIVE | SB_BORN)
> +
> +struct apparmor_fs_context {
> + struct fs_context fc;
> + char *saved_options;
> + size_t saved_size;
> +};
>
> int aa_remount(struct aa_label *label, const struct path *path,
> unsigned long flags, void *data);
> @@ -45,6 +52,8 @@ int aa_move_mount(struct aa_label *label, const struct path *path,
> int aa_new_mount(struct aa_label *label, const char *dev_name,
> const struct path *path, const char *type, unsigned long flags,
> void *data);
> +int aa_new_mount_fc(struct aa_label *label, struct fs_context *fc,
> + const struct path *mountpoint);
>
> int aa_umount(struct aa_label *label, struct vfsmount *mnt, int flags);
>
> diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
> index 9ebc9e9c3854..14398dec2e38 100644
> --- a/security/apparmor/lsm.c
> +++ b/security/apparmor/lsm.c
> @@ -518,6 +518,78 @@ static int apparmor_file_mprotect(struct vm_area_struct *vma,
> !(vma->vm_flags & VM_SHARED) ? MAP_PRIVATE : 0);
> }
>
> +static int apparmor_fs_context_alloc(struct fs_context *fc, struct super_block *src_sb)
> +{
> + struct apparmor_fs_context *afc;
> +
> + afc = kzalloc(sizeof(*afc), GFP_KERNEL);
> + if (!afc)
> + return -ENOMEM;
> +
> + fc->security = afc;
> + return 0;
> +}
> +
> +static int apparmor_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
> +{
> + fc->security = NULL;
> + return 0;
> +}
> +
> +static void apparmor_fs_context_free(struct fs_context *fc)
> +{
> + struct apparmor_fs_context *afc = fc->security;
> +
> + if (afc) {
> + kfree(afc->saved_options);
> + kfree(afc);
> + }
> +}
> +
> +/*
> + * As a temporary hack, we buffer all the options. The problem is that we need
> + * to pass them to the DFA evaluator *after* mount point parameters, which
> + * means deferring the entire check to the sb_mountpoint hook.
> + */
> +static int apparmor_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len)
> +{
> + struct apparmor_fs_context *afc = fc->security;
> + size_t space = 0;
> + char *p, *q;
> +
> + if (afc->saved_size > 0)
> + space = 1;
> +
> + p = krealloc(afc->saved_options, afc->saved_size + space + len + 1, GFP_KERNEL);
> + if (!p)
> + return -ENOMEM;
> +
> + q = p + afc->saved_size;
> + if (q != p)
> + *q++ = ' ';
> + memcpy(q, opt, len);
> + q += len;
> + *q = 0;
> +
> + afc->saved_options = p;
> + afc->saved_size += 1 + len;
> + return 0;
> +}
> +
> +static int apparmor_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
> + unsigned int mnt_flags)
> +{
> + struct aa_label *label;
> + int error = 0;
> +
> + label = __begin_current_label_crit_section();
> + if (!unconfined(label))
> + error = aa_new_mount_fc(label, fc, mountpoint);
> + __end_current_label_crit_section(label);
> +
> + return error;
> +}
> +
> static int apparmor_sb_mount(const char *dev_name, const struct path *path,
> const char *type, unsigned long flags, void *data)
> {
> @@ -528,7 +600,7 @@ static int apparmor_sb_mount(const char *dev_name, const struct path *path,
> if ((flags & MS_MGC_MSK) == MS_MGC_VAL)
> flags &= ~MS_MGC_MSK;
>
> - flags &= ~AA_MS_IGNORE_MASK;
> + flags &= ~AA_SB_IGNORE_MASK;
>
> label = __begin_current_label_crit_section();
> if (!unconfined(label)) {
> @@ -1124,6 +1196,12 @@ static struct security_hook_list apparmor_hooks[] __lsm_ro_after_init = {
> LSM_HOOK_INIT(capget, apparmor_capget),
> LSM_HOOK_INIT(capable, apparmor_capable),
>
> + LSM_HOOK_INIT(fs_context_alloc, apparmor_fs_context_alloc),
> + LSM_HOOK_INIT(fs_context_dup, apparmor_fs_context_dup),
> + LSM_HOOK_INIT(fs_context_free, apparmor_fs_context_free),
> + LSM_HOOK_INIT(fs_context_parse_option, apparmor_fs_context_parse_option),
> + LSM_HOOK_INIT(sb_mountpoint, apparmor_sb_mountpoint),
> +
> LSM_HOOK_INIT(sb_mount, apparmor_sb_mount),
> LSM_HOOK_INIT(sb_umount, apparmor_sb_umount),
> LSM_HOOK_INIT(sb_pivotroot, apparmor_sb_pivotroot),
> diff --git a/security/apparmor/mount.c b/security/apparmor/mount.c
> index 45bb769d6cd7..3d477d288627 100644
> --- a/security/apparmor/mount.c
> +++ b/security/apparmor/mount.c
> @@ -554,6 +554,52 @@ int aa_new_mount(struct aa_label *label, const char *dev_name,
> return error;
> }
>
> +int aa_new_mount_fc(struct aa_label *label, struct fs_context *fc,
> + const struct path *mountpoint)
> +{
> + struct apparmor_fs_context *afc = fc->security;
> + struct aa_profile *profile;
> + char *buffer = NULL, *dev_buffer = NULL;
> + bool binary;
> + int error;
> + struct path tmp_path, *dev_path = NULL;
> +
> + AA_BUG(!label);
> + AA_BUG(!mountpoint);
> +
> + binary = fc->fs_type->fs_flags & FS_BINARY_MOUNTDATA;
> +
> + if (fc->fs_type->fs_flags & FS_REQUIRES_DEV) {
> + if (!fc->source)
> + return -ENOENT;
> +
> + error = kern_path(fc->source, LOOKUP_FOLLOW, &tmp_path);
> + if (error)
> + return error;
> + dev_path = &tmp_path;
> + }
> +
> + get_buffers(buffer, dev_buffer);
> + if (dev_path) {
> + error = fn_for_each_confined(label, profile,
> + match_mnt(profile, mountpoint, buffer, dev_path, dev_buffer,
> + fc->fs_type->name,
> + fc->sb_flags & ~AA_SB_IGNORE_MASK,
> + afc->saved_options, binary));
> + } else {
> + error = fn_for_each_confined(label, profile,
> + match_mnt_path_str(profile, mountpoint, buffer, fc->source,
> + fc->fs_type->name,
> + fc->sb_flags & ~AA_SB_IGNORE_MASK,
> + afc->saved_options, binary, NULL));
> + }
> + put_buffers(buffer, dev_buffer);
> + if (dev_path)
> + path_put(dev_path);
> +
> + return error;
> +}
> +
> static int profile_umount(struct aa_profile *profile, struct path *path,
> char *buffer)
> {
>
John Johansen <[email protected]> wrote:
> this looks good, and has pasted the testing that I have done so far. I
> have started on the work that will allow us to reorder the match but
> its not ready yet and shouldn't hold this up.
Excellent, thanks!
One thing to consider: Kent Overstreet mentioned the possibility of adding
support for multiple sources - something that his bcachefs would require.
Hi David,
We run CRIU tests for vfs/for-next, and today a few of these test failed. I
found that the problem appears after this patch..
https://travis-ci.org/avagin/linux/jobs/393766778
The reproducer is attached. It creates a process in a new set of namespaces
(user, mount, etc) and then this process fails to mount procfs, the mount
syscall returns EBUSY.
666 pipe([3, 4]) = 0
666 clone(child_stack=0x7ffc23a89400, flags=CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWNET|SIGCHLD) = 667
666 openat(AT_FDCWD, "/proc/667/uid_map", O_WRONLY <unfinished ...>
667 close(4 <unfinished ...>
666 <... openat resumed> ) = 5
666 write(5, "0 100000 100000\n100000 200000 50"..., 36 <unfinished ...>
667 <... close resumed> ) = 0
666 <... write resumed> ) = 36
666 close(5 <unfinished ...>
667 read(3, <unfinished ...>
666 <... close resumed> ) = 0
666 openat(AT_FDCWD, "/proc/667/gid_map", O_WRONLY) = 5
666 write(5, "0 400000 50000\n50000 500000 1000"..., 35) = 35
666 close(5) = 0
666 write(4, " \225\250#", 4) = 4
667 <... read resumed> " \225\250#", 4) = 4
666 wait4(667, <unfinished ...>
667 setsid() = 1
667 setuid(0) = 0
667 setgid(0) = 0
667 setgroups(0, NULL) = 0
667 mount("proc", "/mnt", "proc", MS_MGC_VAL|MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL) = -1 EBUSY (Device or resource busy)
Thanks,
Andrei
On Thu, Apr 19, 2018 at 02:32:28PM +0100, David Howells wrote:
> Add fs_context support to procfs.
>
> Signed-off-by: David Howells <[email protected]>
> ---
>
> fs/proc/inode.c | 2 -
> fs/proc/internal.h | 2 -
> fs/proc/root.c | 169 ++++++++++++++++++++++++++++++++++------------------
> 3 files changed, 113 insertions(+), 60 deletions(-)
>
> diff --git a/fs/proc/inode.c b/fs/proc/inode.c
> index 0b13cf6eb6d7..7aa86dd65ba8 100644
> --- a/fs/proc/inode.c
> +++ b/fs/proc/inode.c
> @@ -128,7 +128,7 @@ const struct super_operations proc_sops = {
> .drop_inode = generic_delete_inode,
> .evict_inode = proc_evict_inode,
> .statfs = simple_statfs,
> - .remount_fs = proc_remount,
> + .reconfigure = proc_reconfigure,
> .show_options = proc_show_options,
> };
>
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index 3182e1b636d3..a5ab9504768a 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -254,7 +254,7 @@ static inline void proc_tty_init(void) {}
> extern struct proc_dir_entry proc_root;
>
> extern void proc_self_init(void);
> -extern int proc_remount(struct super_block *, int *, char *, size_t);
> +extern int proc_reconfigure(struct super_block *, struct fs_context *);
>
> /*
> * task_[no]mmu.c
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index 2fbc177f37a8..e6bd31fbc714 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -19,14 +19,24 @@
> #include <linux/module.h>
> #include <linux/bitops.h>
> #include <linux/user_namespace.h>
> +#include <linux/fs_context.h>
> #include <linux/mount.h>
> #include <linux/pid_namespace.h>
> #include <linux/parser.h>
> #include <linux/cred.h>
> #include <linux/magic.h>
> +#include <linux/slab.h>
>
> #include "internal.h"
>
> +struct proc_fs_context {
> + struct fs_context fc;
> + struct pid_namespace *pid_ns;
> + unsigned long mask;
> + int hidepid;
> + int gid;
> +};
> +
> enum {
> Opt_gid, Opt_hidepid, Opt_err,
> };
> @@ -37,56 +47,60 @@ static const match_table_t tokens = {
> {Opt_err, NULL},
> };
>
> -static int proc_parse_options(char *options, struct pid_namespace *pid)
> +static int proc_parse_option(struct fs_context *fc, char *opt, size_t len)
> {
> - char *p;
> + struct proc_fs_context *ctx = container_of(fc, struct proc_fs_context, fc);
> substring_t args[MAX_OPT_ARGS];
> - int option;
> -
> - if (!options)
> - return 1;
> -
> - while ((p = strsep(&options, ",")) != NULL) {
> - int token;
> - if (!*p)
> - continue;
> -
> - args[0].to = args[0].from = NULL;
> - token = match_token(p, tokens, args);
> - switch (token) {
> - case Opt_gid:
> - if (match_int(&args[0], &option))
> - return 0;
> - pid->pid_gid = make_kgid(current_user_ns(), option);
> - break;
> - case Opt_hidepid:
> - if (match_int(&args[0], &option))
> - return 0;
> - if (option < HIDEPID_OFF ||
> - option > HIDEPID_INVISIBLE) {
> - pr_err("proc: hidepid value must be between 0 and 2.\n");
> - return 0;
> - }
> - pid->hide_pid = option;
> - break;
> - default:
> - pr_err("proc: unrecognized mount option \"%s\" "
> - "or missing value\n", p);
> - return 0;
> + int token;
> +
> + args[0].to = args[0].from = NULL;
> + token = match_token(opt, tokens, args);
> + switch (token) {
> + case Opt_gid:
> + if (match_int(&args[0], &ctx->gid))
> + return -EINVAL;
> + break;
> +
> + case Opt_hidepid:
> + if (match_int(&args[0], &ctx->hidepid))
> + return -EINVAL;
> + if (ctx->hidepid < HIDEPID_OFF ||
> + ctx->hidepid > HIDEPID_INVISIBLE) {
> + pr_err("proc: hidepid value must be between 0 and 2.\n");
> + return -EINVAL;
> }
> + break;
> +
> + default:
> + pr_err("proc: unrecognized mount option \"%s\" or missing value\n",
> + opt);
> + return -EINVAL;
> }
>
> - return 1;
> + ctx->mask |= 1 << token;
> + return 0;
> +}
> +
> +static void proc_set_options(struct super_block *s,
> + struct fs_context *fc,
> + struct pid_namespace *pid_ns,
> + struct user_namespace *user_ns)
> +{
> + struct proc_fs_context *ctx = container_of(fc, struct proc_fs_context, fc);
> +
> + if (ctx->mask & (1 << Opt_gid))
> + pid_ns->pid_gid = make_kgid(user_ns, ctx->gid);
> + if (ctx->mask & (1 << Opt_hidepid))
> + pid_ns->hide_pid = ctx->hidepid;
> }
>
> -static int proc_fill_super(struct super_block *s, void *data, size_t data_size, int silent)
> +static int proc_fill_super(struct super_block *s, struct fs_context *fc)
> {
> - struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
> + struct pid_namespace *pid_ns = get_pid_ns(s->s_fs_info);
> struct inode *root_inode;
> int ret;
>
> - if (!proc_parse_options(data, ns))
> - return -EINVAL;
> + proc_set_options(s, fc, pid_ns, current_user_ns());
>
> /* User space would break if executables or devices appear on proc */
> s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
> @@ -103,7 +117,7 @@ static int proc_fill_super(struct super_block *s, void *data, size_t data_size,
> * top of it
> */
> s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
> -
> +
> pde_get(&proc_root);
> root_inode = proc_get_inode(s, &proc_root);
> if (!root_inode) {
> @@ -124,30 +138,46 @@ static int proc_fill_super(struct super_block *s, void *data, size_t data_size,
> return proc_setup_thread_self(s);
> }
>
> -int proc_remount(struct super_block *sb, int *flags,
> - char *data, size_t data_size)
> +int proc_reconfigure(struct super_block *sb, struct fs_context *fc)
> {
> struct pid_namespace *pid = sb->s_fs_info;
>
> sync_filesystem(sb);
> - return !proc_parse_options(data, pid);
> +
> + if (fc)
> + proc_set_options(sb, fc, pid, current_user_ns());
> + return 0;
> }
>
> -static struct dentry *proc_mount(struct file_system_type *fs_type,
> - int flags, const char *dev_name,
> - void *data, size_t data_size)
> +static int proc_get_tree(struct fs_context *fc)
> {
> - struct pid_namespace *ns;
> + struct proc_fs_context *ctx = container_of(fc, struct proc_fs_context, fc);
>
> - if (flags & SB_KERNMOUNT) {
> - ns = data;
> - data = NULL;
> - } else {
> - ns = task_active_pid_ns(current);
> - }
> + ctx->fc.s_fs_info = ctx->pid_ns;
> + return vfs_get_super(fc, vfs_get_keyed_super, proc_fill_super);
> +}
>
> - return mount_ns(fs_type, flags, data, data_size, ns, ns->user_ns,
> - proc_fill_super);
> +static void proc_fs_context_free(struct fs_context *fc)
> +{
> + struct proc_fs_context *ctx = container_of(fc, struct proc_fs_context, fc);
> +
> + if (ctx->pid_ns)
> + put_pid_ns(ctx->pid_ns);
> +}
> +
> +static const struct fs_context_operations proc_fs_context_ops = {
> + .free = proc_fs_context_free,
> + .parse_option = proc_parse_option,
> + .get_tree = proc_get_tree,
> +};
> +
> +static int proc_init_fs_context(struct fs_context *fc, struct super_block *src_sb)
> +{
> + struct proc_fs_context *ctx = container_of(fc, struct proc_fs_context, fc);
> +
> + ctx->pid_ns = get_pid_ns(task_active_pid_ns(current));
> + ctx->fc.ops = &proc_fs_context_ops;
> + return 0;
> }
>
> static void proc_kill_sb(struct super_block *sb)
> @@ -165,7 +195,8 @@ static void proc_kill_sb(struct super_block *sb)
>
> static struct file_system_type proc_fs_type = {
> .name = "proc",
> - .mount = proc_mount,
> + .fs_context_size = sizeof(struct proc_fs_context),
> + .init_fs_context = proc_init_fs_context,
> .kill_sb = proc_kill_sb,
> .fs_flags = FS_USERNS_MOUNT,
> };
> @@ -205,7 +236,7 @@ static struct dentry *proc_root_lookup(struct inode * dir, struct dentry * dentr
> {
> if (!proc_pid_lookup(dir, dentry, flags))
> return NULL;
> -
> +
> return proc_lookup(dir, dentry, flags);
> }
>
> @@ -259,9 +290,31 @@ struct proc_dir_entry proc_root = {
>
> int pid_ns_prepare_proc(struct pid_namespace *ns)
> {
> + struct proc_fs_context *ctx;
> + struct fs_context *fc;
> struct vfsmount *mnt;
> + int ret;
> +
> + fc = vfs_new_fs_context(&proc_fs_type, NULL, 0,
> + FS_CONTEXT_FOR_KERNEL_MOUNT);
> + if (IS_ERR(fc))
> + return PTR_ERR(fc);
> +
> + ctx = container_of(fc, struct proc_fs_context, fc);
> + if (ctx->pid_ns != ns) {
> + put_pid_ns(ctx->pid_ns);
> + get_pid_ns(ns);
> + ctx->pid_ns = ns;
> + }
> +
> + ret = vfs_get_tree(fc);
> + if (ret < 0) {
> + put_fs_context(fc);
> + return ret;
> + }
>
> - mnt = kern_mount_data(&proc_fs_type, ns, 0);
> + mnt = vfs_create_mount(fc);
> + put_fs_context(fc);
> if (IS_ERR(mnt))
> return PTR_ERR(mnt);
>
On Mon, Jun 18, 2018 at 08:34:50PM -0700, Andrei Vagin wrote:
> Hi David,
>
> We run CRIU tests for vfs/for-next, and today a few of these test failed. I
> found that the problem appears after this patch..
>
> > int pid_ns_prepare_proc(struct pid_namespace *ns)
> > {
> > + struct proc_fs_context *ctx;
> > + struct fs_context *fc;
> > struct vfsmount *mnt;
> > + int ret;
> > +
> > + fc = vfs_new_fs_context(&proc_fs_type, NULL, 0,
> > + FS_CONTEXT_FOR_KERNEL_MOUNT);
> > + if (IS_ERR(fc))
> > + return PTR_ERR(fc);
> > +
> > + ctx = container_of(fc, struct proc_fs_context, fc);
> > + if (ctx->pid_ns != ns) {
> > + put_pid_ns(ctx->pid_ns);
> > + get_pid_ns(ns);
> > + ctx->pid_ns = ns;
> > + }
> > +
> > + ret = vfs_get_tree(fc);
> > + if (ret < 0) {
> > + put_fs_context(fc);
> > + return ret;
> > + }
> >
> > - mnt = kern_mount_data(&proc_fs_type, ns, 0);
Here ns->user_ns and get_current_cred()->user_ns are not always equal
> > + mnt = vfs_create_mount(fc);
> > + put_fs_context(fc);
> > if (IS_ERR(mnt))
> > return PTR_ERR(mnt);
> >
> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <sched.h>
> #include <unistd.h>
> #include <stdio.h>
> #include <sys/mount.h>
> #include <sys/wait.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <stdlib.h>
> #include <grp.h>
> #include <linux/limits.h>
>
>
> #define NS_STACK_SIZE 4096
>
> #define __stack_aligned__ __attribute__((aligned(16)))
>
> /* All arguments should be above stack, because it grows down */
> struct ns_exec_args {
> char stack[NS_STACK_SIZE] __stack_aligned__;
> char stack_ptr[0];
> int pfd[2];
> };
>
> static int ns_exec(void *_arg)
> {
> struct ns_exec_args *args = (struct ns_exec_args *) _arg;
> int ret;
>
> close(args->pfd[1]);
> if (read(args->pfd[0], &ret, sizeof(ret)) != sizeof(ret))
> return -1;
>
> setsid();
>
> if (setuid(0) || setgid(0) || setgroups(0, NULL)) {
> fprintf(stderr, "set*id failed: %m\n");
> return -1;
> }
>
> if (mount("proc", "/mnt", "proc", MS_MGC_VAL | MS_NOSUID | MS_NOEXEC | MS_NODEV, NULL)) {
> fprintf(stderr, "mount(/proc) failed: %m\n");
> return -1;
> }
>
> return 0;
> }
>
> #define UID_MAP "0 100000 100000\n100000 200000 50000"
> #define GID_MAP "0 400000 50000\n50000 500000 100000"
> int main()
> {
> pid_t pid;
> int ret, status;
> struct ns_exec_args args;
> int flags;
> char pname[PATH_MAX];
> int fd, pfd[2];
>
> if (pipe(pfd))
> return 1;
>
> args.pfd[0] = pfd[0];
> args.pfd[1] = pfd[1];
>
> flags = CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWUTS |
> CLONE_NEWNET | CLONE_NEWIPC | CLONE_NEWUSER | SIGCHLD;
>
> pid = clone(ns_exec, args.stack_ptr, flags, &args);
> if (pid < 0) {
> fprintf(stderr, "clone() failed: %m\n");
> exit(1);
> }
>
>
> snprintf(pname, sizeof(pname), "/proc/%d/uid_map", pid);
> fd = open(pname, O_WRONLY);
> if (fd < 0) {
> fprintf(stderr, "open(%s): %m\n", pname);
> exit(1);
> }
> if (write(fd, UID_MAP, sizeof(UID_MAP)) < 0) {
> fprintf(stderr, "write(" UID_MAP "): %m\n");
> exit(1);
> }
> close(fd);
>
> snprintf(pname, sizeof(pname), "/proc/%d/gid_map", pid);
> fd = open(pname, O_WRONLY);
> if (fd < 0) {
> fprintf(stderr, "open(%s): %m\n", pname);
> exit(1);
> }
> if (write(fd, GID_MAP, sizeof(GID_MAP)) < 0) {
> fprintf(stderr, "write(" GID_MAP "): %m\n");
> exit(1);
> }
> close(fd);
>
> if (write(pfd[1], &ret, sizeof(ret)) != sizeof(ret))
> return 1;
>
> if (waitpid(pid, &status, 0) != pid)
> return 1;
> if (status)
> return 1;
>
> return 0;
> }
On Mon, Jun 25, 2018 at 11:13:20PM -0700, Andrei Vagin wrote:
> On Mon, Jun 18, 2018 at 08:34:50PM -0700, Andrei Vagin wrote:
> > Hi David,
> >
> > We run CRIU tests for vfs/for-next, and today a few of these test failed. I
> > found that the problem appears after this patch..
> >
> > > int pid_ns_prepare_proc(struct pid_namespace *ns)
> > > {
> > > + struct proc_fs_context *ctx;
> > > + struct fs_context *fc;
> > > struct vfsmount *mnt;
> > > + int ret;
> > > +
> > > + fc = vfs_new_fs_context(&proc_fs_type, NULL, 0,
> > > + FS_CONTEXT_FOR_KERNEL_MOUNT);
> > > + if (IS_ERR(fc))
> > > + return PTR_ERR(fc);
> > > +
> > > + ctx = container_of(fc, struct proc_fs_context, fc);
> > > + if (ctx->pid_ns != ns) {
> > > + put_pid_ns(ctx->pid_ns);
> > > + get_pid_ns(ns);
> > > + ctx->pid_ns = ns;
> > > + }
> > > +
> > > + ret = vfs_get_tree(fc);
> > > + if (ret < 0) {
> > > + put_fs_context(fc);
> > > + return ret;
> > > + }
> > >
> > > - mnt = kern_mount_data(&proc_fs_type, ns, 0);
>
> Here ns->user_ns and get_current_cred()->user_ns are not always equal
What do you think about the attached patch?
>
> > > + mnt = vfs_create_mount(fc);
> > > + put_fs_context(fc);
> > > if (IS_ERR(mnt))
> > > return PTR_ERR(mnt);
> > >
>
> > #define _GNU_SOURCE
> > #include <sys/types.h>
> > #include <sched.h>
> > #include <unistd.h>
> > #include <stdio.h>
> > #include <sys/mount.h>
> > #include <sys/wait.h>
> > #include <sys/stat.h>
> > #include <fcntl.h>
> > #include <stdlib.h>
> > #include <grp.h>
> > #include <linux/limits.h>
> >
> >
> > #define NS_STACK_SIZE 4096
> >
> > #define __stack_aligned__ __attribute__((aligned(16)))
> >
> > /* All arguments should be above stack, because it grows down */
> > struct ns_exec_args {
> > char stack[NS_STACK_SIZE] __stack_aligned__;
> > char stack_ptr[0];
> > int pfd[2];
> > };
> >
> > static int ns_exec(void *_arg)
> > {
> > struct ns_exec_args *args = (struct ns_exec_args *) _arg;
> > int ret;
> >
> > close(args->pfd[1]);
> > if (read(args->pfd[0], &ret, sizeof(ret)) != sizeof(ret))
> > return -1;
> >
> > setsid();
> >
> > if (setuid(0) || setgid(0) || setgroups(0, NULL)) {
> > fprintf(stderr, "set*id failed: %m\n");
> > return -1;
> > }
> >
> > if (mount("proc", "/mnt", "proc", MS_MGC_VAL | MS_NOSUID | MS_NOEXEC | MS_NODEV, NULL)) {
> > fprintf(stderr, "mount(/proc) failed: %m\n");
> > return -1;
> > }
> >
> > return 0;
> > }
> >
> > #define UID_MAP "0 100000 100000\n100000 200000 50000"
> > #define GID_MAP "0 400000 50000\n50000 500000 100000"
> > int main()
> > {
> > pid_t pid;
> > int ret, status;
> > struct ns_exec_args args;
> > int flags;
> > char pname[PATH_MAX];
> > int fd, pfd[2];
> >
> > if (pipe(pfd))
> > return 1;
> >
> > args.pfd[0] = pfd[0];
> > args.pfd[1] = pfd[1];
> >
> > flags = CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWUTS |
> > CLONE_NEWNET | CLONE_NEWIPC | CLONE_NEWUSER | SIGCHLD;
> >
> > pid = clone(ns_exec, args.stack_ptr, flags, &args);
> > if (pid < 0) {
> > fprintf(stderr, "clone() failed: %m\n");
> > exit(1);
> > }
> >
> >
> > snprintf(pname, sizeof(pname), "/proc/%d/uid_map", pid);
> > fd = open(pname, O_WRONLY);
> > if (fd < 0) {
> > fprintf(stderr, "open(%s): %m\n", pname);
> > exit(1);
> > }
> > if (write(fd, UID_MAP, sizeof(UID_MAP)) < 0) {
> > fprintf(stderr, "write(" UID_MAP "): %m\n");
> > exit(1);
> > }
> > close(fd);
> >
> > snprintf(pname, sizeof(pname), "/proc/%d/gid_map", pid);
> > fd = open(pname, O_WRONLY);
> > if (fd < 0) {
> > fprintf(stderr, "open(%s): %m\n", pname);
> > exit(1);
> > }
> > if (write(fd, GID_MAP, sizeof(GID_MAP)) < 0) {
> > fprintf(stderr, "write(" GID_MAP "): %m\n");
> > exit(1);
> > }
> > close(fd);
> >
> > if (write(pfd[1], &ret, sizeof(ret)) != sizeof(ret))
> > return 1;
> >
> > if (waitpid(pid, &status, 0) != pid)
> > return 1;
> > if (status)
> > return 1;
> >
> > return 0;
> > }
>
Andrei Vagin <[email protected]> wrote:
> > > > - mnt = kern_mount_data(&proc_fs_type, ns, 0);
> >
> > Here ns->user_ns and get_current_cred()->user_ns are not always equal
>
> What do you think about the attached patch?
> ...
> - fc = vfs_new_fs_context(&proc_fs_type, NULL, 0,
> - FS_CONTEXT_FOR_KERNEL_MOUNT);
> + fc = vfs_new_fs_context_userns(&proc_fs_type, NULL, 0,
> + FS_CONTEXT_FOR_KERNEL_MOUNT, ns->user_ns);
Or you could just change fc->user_ns immediately after calling
vfs_new_fs_context(). This is what network filesystems should do with
fc->net_ns, for example.
> -struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
> +struct fs_context *vfs_new_fs_context_userns(struct file_system_type *fs_type,
> struct dentry *reference,
> unsigned int sb_flags,
> - enum fs_context_purpose purpose)
> + enum fs_context_purpose purpose,
> + struct user_namespace *user_ns)
If you'd really rather add a new parameter, please don't rename the function
to vfs_new_fs_context_userns() - just add a new parameter. There don't need
to be two versions of it.
This brings me to another thought: I want to add the ability to let
namespaces be configured by userspace, for example:
fd = fsopen("nfs");
sprintf(buf, "ns user %d", my_user_ns_fd);
write(fd, buf);
sprintf(buf, "ns net %d", my_net_ns_fd);
write(fd, buf);
write(fd, "s fedoraproject.org:/pub");
write(fd, "o intr");
...
I think therefore, I might need to insert another phase between creating the
context and calling the filesystem initialiser:
fc = vfs_new_fs_context(&afs_fs_type, mntpt, 0,
FS_CONTEXT_FOR_SUBMOUNT);
followed by:
vfs_sb_set_namespace(fc, THIS_IS_USER_NS, user_ns);
vfs_sb_set_namespace(fc, THIS_IS_NET_NS, net_ns);
but then we'd need to do:
vfs_begin_options(fc);
before continuing (unless we made this happen automatically on the receipt of
the first option):
afs_mntpt_set_params(fc, mntpt);
vfs_get_tree(fc);
mnt = vfs_create_mount(fc, 0);
Alternatively, we could do the namespace setting after initialisation and let
the fs apply the changes itself.
David
On Tue, Jun 26, 2018 at 09:57:07AM +0100, David Howells wrote:
> Andrei Vagin <[email protected]> wrote:
>
> > > > > - mnt = kern_mount_data(&proc_fs_type, ns, 0);
> > >
> > > Here ns->user_ns and get_current_cred()->user_ns are not always equal
> >
> > What do you think about the attached patch?
> > ...
> > - fc = vfs_new_fs_context(&proc_fs_type, NULL, 0,
> > - FS_CONTEXT_FOR_KERNEL_MOUNT);
> > + fc = vfs_new_fs_context_userns(&proc_fs_type, NULL, 0,
> > + FS_CONTEXT_FOR_KERNEL_MOUNT, ns->user_ns);
>
> Or you could just change fc->user_ns immediately after calling
> vfs_new_fs_context(). This is what network filesystems should do with
> fc->net_ns, for example.
Ok, it works for me. The patch is attached.
>
> > -struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
> > +struct fs_context *vfs_new_fs_context_userns(struct file_system_type *fs_type,
> > struct dentry *reference,
> > unsigned int sb_flags,
> > - enum fs_context_purpose purpose)
> > + enum fs_context_purpose purpose,
> > + struct user_namespace *user_ns)
>
>
> If you'd really rather add a new parameter, please don't rename the function
> to vfs_new_fs_context_userns() - just add a new parameter. There don't need
> to be two versions of it.
>
>
> This brings me to another thought: I want to add the ability to let
> namespaces be configured by userspace, for example:
It may be a good feature, but I am not sure about procfs. A procfs
instance is created per pidns, so they should have the same owner
userns.
>
> fd = fsopen("nfs");
> sprintf(buf, "ns user %d", my_user_ns_fd);
> write(fd, buf);
> sprintf(buf, "ns net %d", my_net_ns_fd);
> write(fd, buf);
> write(fd, "s fedoraproject.org:/pub");
> write(fd, "o intr");
> ...
>
> I think therefore, I might need to insert another phase between creating the
> context and calling the filesystem initialiser:
>
> fc = vfs_new_fs_context(&afs_fs_type, mntpt, 0,
> FS_CONTEXT_FOR_SUBMOUNT);
>
> followed by:
>
> vfs_sb_set_namespace(fc, THIS_IS_USER_NS, user_ns);
> vfs_sb_set_namespace(fc, THIS_IS_NET_NS, net_ns);
>
> but then we'd need to do:
>
> vfs_begin_options(fc);
>
> before continuing (unless we made this happen automatically on the receipt of
> the first option):
>
> afs_mntpt_set_params(fc, mntpt);
> vfs_get_tree(fc);
> mnt = vfs_create_mount(fc, 0);
>
> Alternatively, we could do the namespace setting after initialisation and let
> the fs apply the changes itself.
>
> David