2018-09-21 16:30:42

by David Howells

[permalink] [raw]
Subject: [PATCH 00/34] VFS: Introduce filesystem context [ver #12]


Hi Al,

Here are a set of patches to create a filesystem context prior to setting
up a new mount, populating it with the parsed options/binary data, creating
the superblock and then effecting the mount. This is also used for remount
since much of the parsing stuff is common in many filesystems.

This allows namespaces and other information to be conveyed through the
mount procedure.

This also allows Miklós Szeredi's idea of doing:

fd = fsopen("nfs");
fsconfig(fd, FSCONFIG_SET_STRING, "option", "val", 0);
fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
mfd = fsmount(fd, MS_NODEV);
move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

that he presented at LSF-2017 to be implemented (see the relevant patches
in the series).

I didn't use netlink as that would make the core kernel depend on
CONFIG_NET and CONFIG_NETLINK and would introduce network namespacing
issues.

I've implemented filesystem context handling for procfs, nfs, mqueue,
cpuset, kernfs, sysfs, cgroup and afs filesystems.

Unconverted filesystems are handled by a legacy filesystem wrapper.


====================
WHY DO WE WANT THIS?
====================

Firstly, there's a bunch of problems with the mount(2) syscall:

(1) It's actually six or seven different interfaces rolled into one and
weird combinations of flags make it do different things beyond the
original specification of the syscall.

(2) It produces a particularly large and diverse set of errors, which have
to be mapped back to a small error code. Yes, there's dmesg - if you
have it configured - but you can't necessarily see that if you're
doing a mount inside of a container.

(3) It copies a PAGE_SIZE block of data for each of the type, device name
and options.

(4) The size of the buffers is PAGE_SIZE - and this is arch dependent.

(5) You can't mount into another mount namespace. I could, for example,
build a container without having to be in that container's namespace
if I can do it from outside.

(6) It's not really geared for the specification of multiple sources, but
some filesystems really want that - overlayfs, for example.

and some problems in the internal kernel api:

(1) There's no defined way to supply namespace configuration for the
superblock - so, for instance, I can't say that I want to create a
superblock in a particular network namespace (on automount, say).

NFS hacks around this by creating multiple shadow file_system_types
with different ->mount() ops.

(2) When calling mount internally, unless you have NFS-like hacks, you
have to generate or otherwise provide text config data which then gets
parsed, when some of the time you could bypass the parsing stage
entirely.

(3) The amount of data in the data buffer is not known, but the data
buffer might be on a kernel stack somewhere, leading to the
possibility of tripping the stack underrun guard.

and other issues too:

(1) Superblock remount in some filesystems applies options on an as-parsed
basis, so if there's a parse failure, a partial alteration with no
rollback is effected.

(2) Under some circumstances, the mount data may get copied multiple times
so that it can have multiple parsers applied to it or because it has
to be parsed multiple times - for instance, once to get the
preliminary info required to access the on-disk superblock and then
again to update the superblock record in the kernel.

I want to be able to add support for a bunch of things:

(1) UID, GID and Project ID mapping/translation. I want to be able to
install a translation table of some sort on the superblock to
translate source identifiers (which may be foreign numeric UIDs/GIDs,
text names, GUIDs) into system identifiers. This needs to be done
before the superblock is published[*].

Note that this may, for example, involve using the context and the
superblock held therein to issue an RPC to a server to look up
translations.

[*] By "published" I mean made available through mount so that other
userspace processes can access it by path.

Maybe specifying a translation range element with something like:

fsconfig(fd, fsconfig_translate_uid, "<srcuid> <nsuid> <count>", 0, 0);

The translation information also needs to propagate over an automount
in some circumstances.

(2) Namespace configuration. I want to be able to tell the superblock
creation process what namespaces should be applied when it created (in
particular the userns and netns) for containerisation purposes, e.g.:

fsconfig(fd, FSCONFIG_SET_NAMESPACE, "user", 0, userns_fd);
fsconfig(fd, FSCONFIG_SET_NAMESPACE, "net", 0, netns_fd);

(3) Namespace propagation. I want to have a properly defined mechanism
for propagating namespace configuration over automounts within the
kernel. This will be particularly useful for network filesystems.

(4) Pre-mount attribute query. A chunk of the changes is actually the
fsinfo() syscall to query attributes of the filesystem beyond what's
available in statx() and statfs(). This will allow a created
superblock to be queried before it is published.

(5) Upcall for configuration. I would like to be able to query
configuration that's stored in userspace when an automount is made.
For instance, to look up network parameters for NFS or to find a cache
selector for fscache.

The internal fs_context could be passed to the upcall process or the
kernel could read a config file directly if named appropriately for the
superblock, perhaps:

[/etc/fscontext.d/afs/example.com/cell.cfg]
realm = EXAMPLE.COM
translation = uid,3000,4000,100
fscache = tag=fred

(6) Event notifications. I want to be able to install a watch on a
superblock before it is published to catch things like quota events
and EIO.

(7) Large and binary parameters. There might be at some point a need to
pass large/binary objects like Microsoft PACs around. If I understand
PACs correctly, you can obtain these from the Kerberos server and then
pass them to the file server when you connect.

Having it possible to pass large or binary objects as individual
fsconfig calls make parsing these trivial. OTOH, some or all of this
can potentially be handled with the use of the keyrings interface - as
the afs filesystem does for passing kerberos tokens around; it's just
that that seems overkill for a parameter you may only need once.


===================
SIGNIFICANT CHANGES
===================

ver #12:

(*) Rebased on v4.19-rc3.

(*) Added three new context purposes: mount for hidden root, reconfigure
for unmount, reconfigure for emergency remount.

(*) Added a parameter for the new purpose into vfs_dup_fs_context().

(*) Moved the reconfiguration hook from struct super_operations to struct
fs_context_operations so they can be handled through the legacy
wrapper. mount -o remount now goes through that.

(*) Changed the parameter description in the following ways:

- Nominated one master name for each parameter, held in a simple
string pointer array. This makes it easy to simply look up a name
for that parameter for logging.

- Added a table of additional names for parameters. The name chosen
can be used to influence the action of the parameter.

- Noted which parameter is the source specifier, if there is one.

(*) Use correct user_ns for a new pidns superblock.

(*) Fix mqueue to not crash on mounting.

(*) Make VFS sample programs dependent on X86 to avoid errors in
autobuilders due to unset syscall IDs in other arches.

(*) [Miklós] Fixed subtype handling.

ver #11:

(*) Fixed AppArmor.

(*) Capitalised all the UAPI constants.

(*) Explicitly numbered the FSCONFIG_* UAPI constants.

(*) Removed all the places ANON_INODES is selected.

(*) Fixed a bug whereby the context gets freed twice (which broke mounts of
procfs).

(*) Split fsinfo() off into its own patch series.

ver #10:

(*) Renamed "option" to "parameter" in a number of places.

(*) Replaced the use of write() to drive the configuration with an fsconfig()
syscall. This also allows at-style paths and fds to be presented as typed
object.

(*) Routed the key=value parameter concept all the way through from the
fsconfig() system call to the LSM and filesystem.

(*) Added a parameter-description concept and helper functions to help
interpret a parameter and possibly convert the value.

(*) Made it possible to query the parameter description using the fsinfo()
syscall. Added a test-fs-query sample to dump the parameters used by a
filesystem.

ver #9:

(*) Dropped the fd cookie stuff and the FMODE_*/O_* split stuff.

(*) Al added an open_tree() system call to allow a mount tree to be picked
referenced or cloned into an O_PATH-style fd. This can then be used
with sys_move_mount(). Dropped the O_CLONE_MOUNT and O_NON_RECURSIVE
open() flags.

(*) Brought error logging back in, though only in the fs_context and not
in the task_struct.

(*) Separated MS_REMOUNT|MS_BIND handling from MS_REMOUNT handling.

(*) Used anon_inodes for the fd returned by fsopen() and fspick(). This
requires making it unconditional.

(*) Fixed lots of bugs. Especial thanks to Al and Eric Biggers for
finding them and providing patches.

(*) Wrote manual pages, which I'll post separately.

ver #8:

(*) Changed the way fsmount() mounts into the namespace according to some
of Al's ideas.

(*) Put better typing on the fd cookie obtained from __fdget() & co..

(*) Stored the fd cookie in struct nameidata rather than the dfd number.

(*) Changed sys_fsmount() to return an O_PATH-style fd rather than
actually mounting into the mount namespace.

(*) Separated internal FMODE_* handling from O_* handling to free up
certain O_* flag numbers.

(*) Added two new open flags (O_CLONE_MOUNT and O_NON_RECURSIVE) for use
with open(O_PATH) to copy a mount or mount-subtree to an O_PATH fd.

(*) Added a new syscall, sys_move_mount(), to move a mount from an
dfd+path source to a dfd+path destination.

(*) Added a file->f_mode flag (FMODE_NEED_UNMOUNT) that indicates that the
vfsmount attached to file->f_path needs 'unmounting' if set.

(*) Made sys_move_mount() clear FMODE_NEED_UNMOUNT if successful.

[!] This doesn't work quite right.

(*) Added a new syscall, fsinfo(), to query information about a
filesystem. The idea being that this will, in future, work with the
fd from fsopen() too and permit querying of the parameters and
metadata before fsmount() is called.

ver #7:

(*) Undo an incorrect MS_* -> SB_* conversion.

(*) Pass the mount data buffer size to all the mount-related functions that
take the data pointer. This fixes a problem where someone (say SELinux)
tries to copy the mount data, assuming it to be a page in size, and
overruns the buffer - thereby incurring an oops by hitting a guard page.

(*) Made the AFS filesystem use them as an example. This is a much easier to
deal with than with NFS or Ext4 as there are very few mount options.

ver #6:

(*) Dropped the supplementary error string facility for the moment.

(*) Dropped the NFS patches for the moment.

(*) Dropped the reserved file descriptor argument from fsopen() and
replaced it with three reserved pointers that must be NULL.

ver #5:

(*) Renamed sb_config -> fs_context and adjusted variable names.

(*) Differentiated the flags in sb->s_flags (now named SB_*) from those
passed to mount(2) (named MS_*).

(*) Renamed __vfs_new_fs_context() to vfs_new_fs_context() and made the
caller always provide a struct file_system_type pointer and the
parameters required.

(*) Got rid of vfs_submount_fc() in favour of passing
FS_CONTEXT_FOR_SUBMOUNT to vfs_new_fs_context(). The purpose is now
used more.

(*) Call ->validate() on the remount path.

(*) Got rid of the inode locking in sys_fsmount().

(*) Call security_sb_mountpoint() in the mount(2) path.

ver #4:

(*) Split the sb_config patch up somewhat.

(*) Made the supplementary error string facility something attached to the
task_struct rather than the sb_config so that error messages can be
obtained from NFS doing a mount-root-and-pathwalk inside the
nfs_get_tree() operation.

Further, made this managed and read by prctl rather than through the
mount fd so that it's more generally available.

ver #3:

(*) Rebased on 4.12-rc1.

(*) Split the NFS patch up somewhat.

ver #2:

(*) Removed the ->fill_super() from sb_config_operations and passed it in
directly to functions that want to call it. NFS now calls
nfs_fill_super() directly rather than jumping through a pointer to it
since there's only the one option at the moment.

(*) Removed ->mnt_ns and ->sb from sb_config and moved ->pid_ns into
proc_sb_config.

(*) Renamed create_super -> get_tree.

(*) Renamed struct mount_context to struct sb_config and amended various
variable names.

(*) sys_fsmount() acquired AT_* flags and MS_* flags (for MNT_* flags)
arguments.

ver #1:

(*) Split the sb_config stuff out into its own header.

(*) Support non-context aware filesystems through a special set of
sb_config operations.

(*) Stored the created superblock and root dentry into the sb_config after
creation rather than directly into a vfsmount. This allows some
arguments to be removed to various NFS functions.

(*) Added an explicit superblock-creation step. This allows a created
superblock to then be mounted multiple times.

(*) Added a flag to say that the sb_config is degraded and cannot have
another go at having a superblock creation whilst getting rid of the
one that says it's already mounted.

Possible further developments:

(*) Implement sb reconfiguration (for now it returns ENOANO).

(*) Implement mount context support in more filesystems, ext4 being next
on my list.

(*) Move the walk-from-root stuff that nfs has to generic code so that you
can do something akin to:

mount /dev/sda1:/foo/bar /mnt

See nfs_follow_remote_path() and mount_subtree(). This is slightly
tricky in NFS as we have to prevent referral loops.

(*) Work out how to get at the error message incurred by submounts
encountered during nfs_follow_remote_path().

Should the error message be moved to task_struct and made more
general, perhaps retrieved with a prctl() function?

(*) Clean up/consolidate the security functions. Possibly add a
validation hook to be called at the same time as the mount context
validate op.

The patches can be found here also:

https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git

on branch:

mount-api

David
---
Al Viro (2):
vfs: syscall: Add open_tree(2) to reference or clone a mount
teach move_mount(2) to work with OPEN_TREE_CLONE

David Howells (32):
vfs: syscall: Add move_mount(2) to move mounts around
vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled
vfs: Introduce the basic header for the new mount API's filesystem context
vfs: Introduce logging functions
vfs: Add configuration parser helpers
vfs: Add LSM hooks for the new mount API
vfs: Put security flags into the fs_context struct
selinux: Implement the new mount API LSM hooks
smack: Implement filesystem context security hooks
apparmor: Implement security hooks for the new mount API
tomoyo: Implement security hooks for the new mount API
vfs: Separate changing mount flags full remount
vfs: Implement a filesystem superblock creation/configuration context
vfs: Remove unused code after filesystem context changes
procfs: Move proc_fill_super() to fs/proc/root.c
proc: Add fs_context support to procfs
ipc: Convert mqueue fs to fs_context
cpuset: Use fs_context
kernfs, sysfs, cgroup, intel_rdt: Support fs_context
hugetlbfs: Convert to fs_context
vfs: Remove kern_mount_data()
vfs: Provide documentation for new mount API
Make anon_inodes unconditional
vfs: syscall: Add fsopen() to prepare for superblock creation
vfs: Implement logging through fs_context
vfs: Add some logging to the core users of the fs_context log
vfs: syscall: Add fsconfig() for configuring and managing a context
vfs: syscall: Add fsmount() to create a mount for a superblock
vfs: syscall: Add fspick() to select a superblock for reconfiguration
afs: Add fs_context support
afs: Use fs_context to pass parameters over automount
vfs: Add a sample program for the new mount API


Documentation/filesystems/mount_api.txt | 741 +++++++++++++++++++++++++
arch/arc/kernel/setup.c | 1
arch/arm/kernel/atags_parse.c | 1
arch/arm/kvm/Kconfig | 1
arch/arm64/kvm/Kconfig | 1
arch/mips/kvm/Kconfig | 1
arch/powerpc/kvm/Kconfig | 1
arch/s390/kvm/Kconfig | 1
arch/sh/kernel/setup.c | 1
arch/sparc/kernel/setup_32.c | 1
arch/sparc/kernel/setup_64.c | 1
arch/x86/Kconfig | 1
arch/x86/entry/syscalls/syscall_32.tbl | 6
arch/x86/entry/syscalls/syscall_64.tbl | 6
arch/x86/kernel/cpu/intel_rdt.h | 15 +
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 183 ++++--
arch/x86/kernel/setup.c | 1
arch/x86/kvm/Kconfig | 1
drivers/base/Kconfig | 1
drivers/base/devtmpfs.c | 1
drivers/char/tpm/Kconfig | 1
drivers/dma-buf/Kconfig | 1
drivers/gpio/Kconfig | 1
drivers/iio/Kconfig | 1
drivers/infiniband/Kconfig | 1
drivers/vfio/Kconfig | 1
fs/Kconfig | 7
fs/Makefile | 5
fs/afs/internal.h | 9
fs/afs/mntpt.c | 148 +++--
fs/afs/super.c | 470 +++++++++-------
fs/afs/volume.c | 4
fs/f2fs/super.c | 2
fs/file_table.c | 9
fs/filesystems.c | 4
fs/fs_context.c | 769 ++++++++++++++++++++++++++
fs/fs_parser.c | 555 +++++++++++++++++++
fs/fsopen.c | 563 +++++++++++++++++++
fs/hugetlbfs/inode.c | 391 ++++++++-----
fs/internal.h | 19 +
fs/kernfs/mount.c | 88 +--
fs/libfs.c | 20 +
fs/namei.c | 4
fs/namespace.c | 895 +++++++++++++++++++++++-------
fs/notify/fanotify/Kconfig | 1
fs/notify/inotify/Kconfig | 1
fs/pnode.c | 1
fs/proc/inode.c | 50 --
fs/proc/internal.h | 5
fs/proc/root.c | 256 ++++++---
fs/super.c | 464 ++++++++++++----
fs/sysfs/mount.c | 67 ++
include/linux/cgroup.h | 3
include/linux/errno.h | 1
include/linux/fs.h | 23 +
include/linux/fs_context.h | 215 +++++++
include/linux/fs_parser.h | 119 ++++
include/linux/kernfs.h | 41 +
include/linux/lsm_hooks.h | 79 ++-
include/linux/module.h | 6
include/linux/mount.h | 5
include/linux/security.h | 65 ++
include/linux/syscalls.h | 9
include/uapi/linux/fcntl.h | 2
include/uapi/linux/fs.h | 82 +--
include/uapi/linux/mount.h | 75 +++
init/Kconfig | 10
init/do_mounts.c | 1
init/do_mounts_initrd.c | 1
ipc/mqueue.c | 107 +++-
ipc/namespace.c | 2
kernel/cgroup/cgroup-internal.h | 50 +-
kernel/cgroup/cgroup-v1.c | 345 ++++++------
kernel/cgroup/cgroup.c | 264 ++++++---
kernel/cgroup/cpuset.c | 69 ++
samples/Kconfig | 10
samples/Makefile | 2
samples/statx/Makefile | 7
samples/statx/test-statx.c | 258 ---------
samples/vfs/Makefile | 10
samples/vfs/test-fsmount.c | 118 ++++
samples/vfs/test-statx.c | 258 +++++++++
security/apparmor/include/mount.h | 11
security/apparmor/lsm.c | 108 ++++
security/apparmor/mount.c | 47 ++
security/security.c | 56 ++
security/selinux/hooks.c | 387 ++++++++++---
security/selinux/include/security.h | 16 -
security/smack/smack.h | 21 -
security/smack/smack_lsm.c | 365 +++++++++++-
security/tomoyo/common.h | 3
security/tomoyo/mount.c | 46 ++
security/tomoyo/tomoyo.c | 15 +
93 files changed, 7191 insertions(+), 1900 deletions(-)
create mode 100644 Documentation/filesystems/mount_api.txt
create mode 100644 fs/fs_context.c
create mode 100644 fs/fs_parser.c
create mode 100644 fs/fsopen.c
create mode 100644 include/linux/fs_context.h
create mode 100644 include/linux/fs_parser.h
create mode 100644 include/uapi/linux/mount.h
delete mode 100644 samples/statx/Makefile
delete mode 100644 samples/statx/test-statx.c
create mode 100644 samples/vfs/Makefile
create mode 100644 samples/vfs/test-fsmount.c
create mode 100644 samples/vfs/test-statx.c



2018-09-21 16:30:57

by David Howells

[permalink] [raw]
Subject: [PATCH 01/34] vfs: syscall: Add open_tree(2) to reference or clone a mount [ver #12]

From: Al Viro <[email protected]>

open_tree(dfd, pathname, flags)

Returns an O_PATH-opened file descriptor or an error.
dfd and pathname specify the location to open, in usual
fashion (see e.g. fstatat(2)). flags should be an OR of
some of the following:
* AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
same meanings as usual
* OPEN_TREE_CLOEXEC - make the resulting descriptor
close-on-exec
* OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
instead of opening the location in question, create a detached
mount tree matching the subtree rooted at location specified by
dfd/pathname. With AT_RECURSIVE the entire subtree is cloned,
without it - only the part within in the mount containing the
location in question. In other words, the same as mount --rbind
or mount --bind would've taken. The detached tree will be
dissolved on the final close of obtained file. Creation of such
detached trees requires the same capabilities as doing mount --bind.

Signed-off-by: Al Viro <[email protected]>
Signed-off-by: David Howells <[email protected]>
cc: [email protected]
---

arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
fs/file_table.c | 9 +-
fs/internal.h | 1
fs/namespace.c | 132 +++++++++++++++++++++++++++-----
include/linux/fs.h | 7 +-
include/linux/syscalls.h | 1
include/uapi/linux/fcntl.h | 2
include/uapi/linux/mount.h | 10 ++
9 files changed, 137 insertions(+), 27 deletions(-)
create mode 100644 include/uapi/linux/mount.h

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 3cf7b533b3d1..ea1b413afd47 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -398,3 +398,4 @@
384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl
385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents
386 i386 rseq sys_rseq __ia32_sys_rseq
+387 i386 open_tree sys_open_tree __ia32_sys_open_tree
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f0b1709a5ffb..0545bed581dc 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,7 @@
332 common statx __x64_sys_statx
333 common io_pgetevents __x64_sys_io_pgetevents
334 common rseq __x64_sys_rseq
+335 common open_tree __x64_sys_open_tree

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/file_table.c b/fs/file_table.c
index e49af4caf15d..e03c8d121c6c 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -255,6 +255,7 @@ static void __fput(struct file *file)
struct dentry *dentry = file->f_path.dentry;
struct vfsmount *mnt = file->f_path.mnt;
struct inode *inode = file->f_inode;
+ fmode_t mode = file->f_mode;

if (unlikely(!(file->f_mode & FMODE_OPENED)))
goto out;
@@ -277,18 +278,20 @@ static void __fput(struct file *file)
if (file->f_op->release)
file->f_op->release(inode, file);
if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
- !(file->f_mode & FMODE_PATH))) {
+ !(mode & FMODE_PATH))) {
cdev_put(inode->i_cdev);
}
fops_put(file->f_op);
put_pid(file->f_owner.pid);
- if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
+ if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
i_readcount_dec(inode);
- if (file->f_mode & FMODE_WRITER) {
+ if (mode & FMODE_WRITER) {
put_write_access(inode);
__mnt_drop_write(mnt);
}
dput(dentry);
+ if (unlikely(mode & FMODE_NEED_UNMOUNT))
+ dissolve_on_fput(mnt);
mntput(mnt);
out:
file_free(file);
diff --git a/fs/internal.h b/fs/internal.h
index 364c20b5ea2d..17029b30e196 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -85,6 +85,7 @@ extern int __mnt_want_write_file(struct file *);
extern void __mnt_drop_write(struct vfsmount *);
extern void __mnt_drop_write_file(struct file *);

+extern void dissolve_on_fput(struct vfsmount *);
/*
* fs_struct.c
*/
diff --git a/fs/namespace.c b/fs/namespace.c
index 8a7e1a7d1d06..ded1a970ec40 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -20,12 +20,14 @@
#include <linux/init.h> /* init_rootfs */
#include <linux/fs_struct.h> /* get_fs_root et.al. */
#include <linux/fsnotify.h> /* fsnotify_vfsmount_delete */
+#include <linux/file.h>
#include <linux/uaccess.h>
#include <linux/proc_ns.h>
#include <linux/magic.h>
#include <linux/bootmem.h>
#include <linux/task_work.h>
#include <linux/sched/task.h>
+#include <uapi/linux/mount.h>

#include "pnode.h"
#include "internal.h"
@@ -1779,6 +1781,16 @@ struct vfsmount *collect_mounts(const struct path *path)
return &tree->mnt;
}

+void dissolve_on_fput(struct vfsmount *mnt)
+{
+ namespace_lock();
+ lock_mount_hash();
+ mntget(mnt);
+ umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
+ unlock_mount_hash();
+ namespace_unlock();
+}
+
void drop_collected_mounts(struct vfsmount *mnt)
{
namespace_lock();
@@ -2138,6 +2150,30 @@ static bool has_locked_children(struct mount *mnt, struct dentry *dentry)
return false;
}

+static struct mount *__do_loopback(struct path *old_path, int recurse)
+{
+ struct mount *mnt = ERR_PTR(-EINVAL), *old = real_mount(old_path->mnt);
+
+ if (IS_MNT_UNBINDABLE(old))
+ return mnt;
+
+ if (!check_mnt(old) && old_path->dentry->d_op != &ns_dentry_operations)
+ return mnt;
+
+ if (!recurse && has_locked_children(old, old_path->dentry))
+ return mnt;
+
+ if (recurse)
+ mnt = copy_tree(old, old_path->dentry, CL_COPY_MNT_NS_FILE);
+ else
+ mnt = clone_mnt(old, old_path->dentry, 0);
+
+ if (!IS_ERR(mnt))
+ mnt->mnt.mnt_flags &= ~MNT_LOCKED;
+
+ return mnt;
+}
+
/*
* do loopback mount.
*/
@@ -2145,7 +2181,7 @@ static int do_loopback(struct path *path, const char *old_name,
int recurse)
{
struct path old_path;
- struct mount *mnt = NULL, *old, *parent;
+ struct mount *mnt = NULL, *parent;
struct mountpoint *mp;
int err;
if (!old_name || !*old_name)
@@ -2159,38 +2195,21 @@ static int do_loopback(struct path *path, const char *old_name,
goto out;

mp = lock_mount(path);
- err = PTR_ERR(mp);
- if (IS_ERR(mp))
+ if (IS_ERR(mp)) {
+ err = PTR_ERR(mp);
goto out;
+ }

- old = real_mount(old_path.mnt);
parent = real_mount(path->mnt);
-
- err = -EINVAL;
- if (IS_MNT_UNBINDABLE(old))
- goto out2;
-
if (!check_mnt(parent))
goto out2;

- if (!check_mnt(old) && old_path.dentry->d_op != &ns_dentry_operations)
- goto out2;
-
- if (!recurse && has_locked_children(old, old_path.dentry))
- goto out2;
-
- if (recurse)
- mnt = copy_tree(old, old_path.dentry, CL_COPY_MNT_NS_FILE);
- else
- mnt = clone_mnt(old, old_path.dentry, 0);
-
+ mnt = __do_loopback(&old_path, recurse);
if (IS_ERR(mnt)) {
err = PTR_ERR(mnt);
goto out2;
}

- mnt->mnt.mnt_flags &= ~MNT_LOCKED;
-
err = graft_tree(mnt, parent, mp);
if (err) {
lock_mount_hash();
@@ -2204,6 +2223,75 @@ static int do_loopback(struct path *path, const char *old_name,
return err;
}

+SYSCALL_DEFINE3(open_tree, int, dfd, const char *, filename, unsigned, flags)
+{
+ struct file *file;
+ struct path path;
+ int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW;
+ bool detached = flags & OPEN_TREE_CLONE;
+ int error;
+ int fd;
+
+ BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC);
+
+ if (flags & ~(AT_EMPTY_PATH | AT_NO_AUTOMOUNT | AT_RECURSIVE |
+ AT_SYMLINK_NOFOLLOW | OPEN_TREE_CLONE |
+ OPEN_TREE_CLOEXEC))
+ return -EINVAL;
+
+ if ((flags & (AT_RECURSIVE | OPEN_TREE_CLONE)) == AT_RECURSIVE)
+ return -EINVAL;
+
+ if (flags & AT_NO_AUTOMOUNT)
+ lookup_flags &= ~LOOKUP_AUTOMOUNT;
+ if (flags & AT_SYMLINK_NOFOLLOW)
+ lookup_flags &= ~LOOKUP_FOLLOW;
+ if (flags & AT_EMPTY_PATH)
+ lookup_flags |= LOOKUP_EMPTY;
+
+ if (detached && !may_mount())
+ return -EPERM;
+
+ fd = get_unused_fd_flags(flags & O_CLOEXEC);
+ if (fd < 0)
+ return fd;
+
+ error = user_path_at(dfd, filename, lookup_flags, &path);
+ if (error)
+ goto out;
+
+ if (detached) {
+ struct mount *mnt = __do_loopback(&path, flags & AT_RECURSIVE);
+ if (IS_ERR(mnt)) {
+ error = PTR_ERR(mnt);
+ goto out2;
+ }
+ mntput(path.mnt);
+ path.mnt = &mnt->mnt;
+ }
+
+ file = dentry_open(&path, O_PATH, current_cred());
+ if (IS_ERR(file)) {
+ error = PTR_ERR(file);
+ goto out3;
+ }
+
+ if (detached)
+ file->f_mode |= FMODE_NEED_UNMOUNT;
+ path_put(&path);
+ fd_install(fd, file);
+ return fd;
+
+out3:
+ if (detached)
+ dissolve_on_fput(path.mnt);
+out2:
+ path_put(&path);
+out:
+ put_unused_fd(fd);
+ return error;
+}
+
static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
{
int error = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4323b8fe353d..6dc32507762f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -157,10 +157,13 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
#define FMODE_NONOTIFY ((__force fmode_t)0x4000000)

/* File is capable of returning -EAGAIN if I/O will block */
-#define FMODE_NOWAIT ((__force fmode_t)0x8000000)
+#define FMODE_NOWAIT ((__force fmode_t)0x8000000)
+
+/* File represents mount that needs unmounting */
+#define FMODE_NEED_UNMOUNT ((__force fmode_t)0x10000000)

/* File does not contribute to nr_files count */
-#define FMODE_NOACCOUNT ((__force fmode_t)0x20000000)
+#define FMODE_NOACCOUNT ((__force fmode_t)0x20000000)

/*
* Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2ff814c92f7f..6978f3c76d41 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -906,6 +906,7 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
unsigned mask, struct statx __user *buffer);
asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
int flags, uint32_t sig);
+asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);

/*
* Architecture-specific system calls
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 6448cdd9a350..594b85f7cb86 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -90,5 +90,7 @@
#define AT_STATX_FORCE_SYNC 0x2000 /* - Force the attributes to be sync'd with the server */
#define AT_STATX_DONT_SYNC 0x4000 /* - Don't sync attributes with the server */

+#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */
+

#endif /* _UAPI_LINUX_FCNTL_H */
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
new file mode 100644
index 000000000000..e8db2911adca
--- /dev/null
+++ b/include/uapi/linux/mount.h
@@ -0,0 +1,10 @@
+#ifndef _UAPI_LINUX_MOUNT_H
+#define _UAPI_LINUX_MOUNT_H
+
+/*
+ * open_tree() flags.
+ */
+#define OPEN_TREE_CLONE 1 /* Clone the target tree and attach the clone */
+#define OPEN_TREE_CLOEXEC O_CLOEXEC /* Close the file on execve() */
+
+#endif /* _UAPI_LINUX_MOUNT_H */


2018-09-21 16:31:11

by David Howells

[permalink] [raw]
Subject: [PATCH 02/34] vfs: syscall: Add move_mount(2) to move mounts around [ver #12]

Add a move_mount() system call that will move a mount from one place to
another and, in the next commit, allow to attach an unattached mount tree.

The new system call looks like the following:

int move_mount(int from_dfd, const char *from_path,
int to_dfd, const char *to_path,
unsigned int flags);

Signed-off-by: David Howells <[email protected]>
cc: [email protected]
---

arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
fs/namespace.c | 102 ++++++++++++++++++++++++++------
include/linux/lsm_hooks.h | 6 ++
include/linux/security.h | 7 ++
include/linux/syscalls.h | 3 +
include/uapi/linux/mount.h | 11 +++
security/security.c | 5 ++
8 files changed, 118 insertions(+), 18 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ea1b413afd47..76d092b7d1b0 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -399,3 +399,4 @@
385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents
386 i386 rseq sys_rseq __ia32_sys_rseq
387 i386 open_tree sys_open_tree __ia32_sys_open_tree
+388 i386 move_mount sys_move_mount __ia32_sys_move_mount
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 0545bed581dc..37ba4e65eee6 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -344,6 +344,7 @@
333 common io_pgetevents __x64_sys_io_pgetevents
334 common rseq __x64_sys_rseq
335 common open_tree __x64_sys_open_tree
+336 common move_mount __x64_sys_move_mount

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/namespace.c b/fs/namespace.c
index ded1a970ec40..dd38141b1723 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2386,43 +2386,37 @@ static inline int tree_contains_unbindable(struct mount *mnt)
return 0;
}

-static int do_move_mount(struct path *path, const char *old_name)
+static int do_move_mount(struct path *old_path, struct path *new_path)
{
- struct path old_path, parent_path;
+ struct path parent_path = {.mnt = NULL, .dentry = NULL};
struct mount *p;
struct mount *old;
struct mountpoint *mp;
int err;
- if (!old_name || !*old_name)
- return -EINVAL;
- err = kern_path(old_name, LOOKUP_FOLLOW, &old_path);
- if (err)
- return err;

- mp = lock_mount(path);
+ mp = lock_mount(new_path);
err = PTR_ERR(mp);
if (IS_ERR(mp))
goto out;

- old = real_mount(old_path.mnt);
- p = real_mount(path->mnt);
+ old = real_mount(old_path->mnt);
+ p = real_mount(new_path->mnt);

err = -EINVAL;
if (!check_mnt(p) || !check_mnt(old))
goto out1;

- if (old->mnt.mnt_flags & MNT_LOCKED)
+ if (!mnt_has_parent(old))
goto out1;

- err = -EINVAL;
- if (old_path.dentry != old_path.mnt->mnt_root)
+ if (old->mnt.mnt_flags & MNT_LOCKED)
goto out1;

- if (!mnt_has_parent(old))
+ if (old_path->dentry != old_path->mnt->mnt_root)
goto out1;

- if (d_is_dir(path->dentry) !=
- d_is_dir(old_path.dentry))
+ if (d_is_dir(new_path->dentry) !=
+ d_is_dir(old_path->dentry))
goto out1;
/*
* Don't move a mount residing in a shared parent.
@@ -2440,7 +2434,8 @@ static int do_move_mount(struct path *path, const char *old_name)
if (p == old)
goto out1;

- err = attach_recursive_mnt(old, real_mount(path->mnt), mp, &parent_path);
+ err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp,
+ &parent_path);
if (err)
goto out1;

@@ -2452,6 +2447,22 @@ static int do_move_mount(struct path *path, const char *old_name)
out:
if (!err)
path_put(&parent_path);
+ return err;
+}
+
+static int do_move_mount_old(struct path *path, const char *old_name)
+{
+ struct path old_path;
+ int err;
+
+ if (!old_name || !*old_name)
+ return -EINVAL;
+
+ err = kern_path(old_name, LOOKUP_FOLLOW, &old_path);
+ if (err)
+ return err;
+
+ err = do_move_mount(&old_path, path);
path_put(&old_path);
return err;
}
@@ -2873,7 +2884,7 @@ long do_mount(const char *dev_name, const char __user *dir_name,
else if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
retval = do_change_type(&path, flags);
else if (flags & MS_MOVE)
- retval = do_move_mount(&path, dev_name);
+ retval = do_move_mount_old(&path, dev_name);
else
retval = do_new_mount(&path, type_page, sb_flags, mnt_flags,
dev_name, data_page, data_size);
@@ -3108,6 +3119,61 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
return ksys_mount(dev_name, dir_name, type, flags, data);
}

+/*
+ * Move a mount from one place to another.
+ *
+ * Note the flags value is a combination of MOVE_MOUNT_* flags.
+ */
+SYSCALL_DEFINE5(move_mount,
+ int, from_dfd, const char *, from_pathname,
+ int, to_dfd, const char *, to_pathname,
+ unsigned int, flags)
+{
+ struct path from_path, to_path;
+ unsigned int lflags;
+ int ret = 0;
+
+ if (!may_mount())
+ return -EPERM;
+
+ if (flags & ~MOVE_MOUNT__MASK)
+ return -EINVAL;
+
+ /* If someone gives a pathname, they aren't permitted to move
+ * from an fd that requires unmount as we can't get at the flag
+ * to clear it afterwards.
+ */
+ lflags = 0;
+ if (flags & MOVE_MOUNT_F_SYMLINKS) lflags |= LOOKUP_FOLLOW;
+ if (flags & MOVE_MOUNT_F_AUTOMOUNTS) lflags |= LOOKUP_AUTOMOUNT;
+ if (flags & MOVE_MOUNT_F_EMPTY_PATH) lflags |= LOOKUP_EMPTY;
+
+ ret = user_path_at(from_dfd, from_pathname, lflags, &from_path);
+ if (ret < 0)
+ return ret;
+
+ lflags = 0;
+ if (flags & MOVE_MOUNT_T_SYMLINKS) lflags |= LOOKUP_FOLLOW;
+ if (flags & MOVE_MOUNT_T_AUTOMOUNTS) lflags |= LOOKUP_AUTOMOUNT;
+ if (flags & MOVE_MOUNT_T_EMPTY_PATH) lflags |= LOOKUP_EMPTY;
+
+ ret = user_path_at(to_dfd, to_pathname, lflags, &to_path);
+ if (ret < 0)
+ goto out_from;
+
+ ret = security_move_mount(&from_path, &to_path);
+ if (ret < 0)
+ goto out_to;
+
+ ret = do_move_mount(&from_path, &to_path);
+
+out_to:
+ path_put(&to_path);
+out_from:
+ path_put(&from_path);
+ return ret;
+}
+
/*
* Return true if path is reachable from root
*
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 8a44075347ac..d052db1a1565 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -147,6 +147,10 @@
* Parse a string of security data filling in the opts structure
* @options string containing all mount options known by the LSM
* @opts binary data structure usable by the LSM
+ * @move_mount:
+ * Check permission before a mount is moved.
+ * @from_path indicates the mount that is going to be moved.
+ * @to_path indicates the mountpoint that will be mounted upon.
* @dentry_init_security:
* Compute a context for a dentry as the inode is not yet available
* since NFSv4 has no label backed by an EA anyway.
@@ -1484,6 +1488,7 @@ union security_list_options {
unsigned long kern_flags,
unsigned long *set_kern_flags);
int (*sb_parse_opts_str)(char *options, struct security_mnt_opts *opts);
+ int (*move_mount)(const struct path *from_path, const struct path *to_path);
int (*dentry_init_security)(struct dentry *dentry, int mode,
const struct qstr *name, void **ctx,
u32 *ctxlen);
@@ -1816,6 +1821,7 @@ struct security_hook_heads {
struct hlist_head sb_set_mnt_opts;
struct hlist_head sb_clone_mnt_opts;
struct hlist_head sb_parse_opts_str;
+ struct hlist_head move_mount;
struct hlist_head dentry_init_security;
struct hlist_head dentry_create_files_as;
#ifdef CONFIG_SECURITY_PATH
diff --git a/include/linux/security.h b/include/linux/security.h
index 30a3db9f284b..a306061d2197 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -266,6 +266,7 @@ int security_sb_clone_mnt_opts(const struct super_block *oldsb,
unsigned long kern_flags,
unsigned long *set_kern_flags);
int security_sb_parse_opts_str(char *options, struct security_mnt_opts *opts);
+int security_move_mount(const struct path *from_path, const struct path *to_path);
int security_dentry_init_security(struct dentry *dentry, int mode,
const struct qstr *name, void **ctx,
u32 *ctxlen);
@@ -621,6 +622,12 @@ static inline int security_sb_parse_opts_str(char *options, struct security_mnt_
return 0;
}

+static inline int security_move_mount(const struct path *from_path,
+ const struct path *to_path)
+{
+ return 0;
+}
+
static inline int security_inode_alloc(struct inode *inode)
{
return 0;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 6978f3c76d41..79042396f7e5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -907,6 +907,9 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
int flags, uint32_t sig);
asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
+asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
+ int to_dfd, const char __user *to_path,
+ unsigned int ms_flags);

/*
* Architecture-specific system calls
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index e8db2911adca..89adf0d731ab 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -7,4 +7,15 @@
#define OPEN_TREE_CLONE 1 /* Clone the target tree and attach the clone */
#define OPEN_TREE_CLOEXEC O_CLOEXEC /* Close the file on execve() */

+/*
+ * move_mount() flags.
+ */
+#define MOVE_MOUNT_F_SYMLINKS 0x00000001 /* Follow symlinks on from path */
+#define MOVE_MOUNT_F_AUTOMOUNTS 0x00000002 /* Follow automounts on from path */
+#define MOVE_MOUNT_F_EMPTY_PATH 0x00000004 /* Empty from path permitted */
+#define MOVE_MOUNT_T_SYMLINKS 0x00000010 /* Follow symlinks on to path */
+#define MOVE_MOUNT_T_AUTOMOUNTS 0x00000020 /* Follow automounts on to path */
+#define MOVE_MOUNT_T_EMPTY_PATH 0x00000040 /* Empty to path permitted */
+#define MOVE_MOUNT__MASK 0x00000077
+
#endif /* _UAPI_LINUX_MOUNT_H */
diff --git a/security/security.c b/security/security.c
index 3d99ed8d9ddd..96a061cecb39 100644
--- a/security/security.c
+++ b/security/security.c
@@ -444,6 +444,11 @@ int security_sb_parse_opts_str(char *options, struct security_mnt_opts *opts)
}
EXPORT_SYMBOL(security_sb_parse_opts_str);

+int security_move_mount(const struct path *from_path, const struct path *to_path)
+{
+ return call_int_hook(move_mount, 0, from_path, to_path);
+}
+
int security_inode_alloc(struct inode *inode)
{
inode->i_security = NULL;


2018-09-21 16:31:29

by David Howells

[permalink] [raw]
Subject: [PATCH 04/34] vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled [ver #12]

Only the mount namespace code that implements mount(2) should be using the
MS_* flags. Suppress them inside the kernel unless uapi/linux/mount.h is
included.

Signed-off-by: David Howells <[email protected]>
---

arch/arc/kernel/setup.c | 1 +
arch/arm/kernel/atags_parse.c | 1 +
arch/sh/kernel/setup.c | 1 +
arch/sparc/kernel/setup_32.c | 1 +
arch/sparc/kernel/setup_64.c | 1 +
arch/x86/kernel/setup.c | 1 +
drivers/base/devtmpfs.c | 1 +
fs/f2fs/super.c | 2 +
fs/pnode.c | 1 +
fs/super.c | 1 +
include/uapi/linux/fs.h | 56 ++++-------------------------------------
include/uapi/linux/mount.h | 54 ++++++++++++++++++++++++++++++++++++++++
init/do_mounts.c | 1 +
init/do_mounts_initrd.c | 1 +
security/apparmor/lsm.c | 1 +
security/apparmor/mount.c | 1 +
security/selinux/hooks.c | 1 +
security/tomoyo/mount.c | 1 +
18 files changed, 75 insertions(+), 52 deletions(-)

diff --git a/arch/arc/kernel/setup.c b/arch/arc/kernel/setup.c
index b2cae79a25d7..714dc5c2baf1 100644
--- a/arch/arc/kernel/setup.c
+++ b/arch/arc/kernel/setup.c
@@ -19,6 +19,7 @@
#include <linux/of_fdt.h>
#include <linux/of.h>
#include <linux/cache.h>
+#include <uapi/linux/mount.h>
#include <asm/sections.h>
#include <asm/arcregs.h>
#include <asm/tlb.h>
diff --git a/arch/arm/kernel/atags_parse.c b/arch/arm/kernel/atags_parse.c
index c10a3e8ee998..a8a4333929f5 100644
--- a/arch/arm/kernel/atags_parse.c
+++ b/arch/arm/kernel/atags_parse.c
@@ -24,6 +24,7 @@
#include <linux/root_dev.h>
#include <linux/screen_info.h>
#include <linux/memblock.h>
+#include <uapi/linux/mount.h>

#include <asm/setup.h>
#include <asm/system_info.h>
diff --git a/arch/sh/kernel/setup.c b/arch/sh/kernel/setup.c
index c286cf5da6e7..2c0e0f37a318 100644
--- a/arch/sh/kernel/setup.c
+++ b/arch/sh/kernel/setup.c
@@ -32,6 +32,7 @@
#include <linux/of.h>
#include <linux/of_fdt.h>
#include <linux/uaccess.h>
+#include <uapi/linux/mount.h>
#include <asm/io.h>
#include <asm/page.h>
#include <asm/elf.h>
diff --git a/arch/sparc/kernel/setup_32.c b/arch/sparc/kernel/setup_32.c
index 13664c377196..7df3d704284c 100644
--- a/arch/sparc/kernel/setup_32.c
+++ b/arch/sparc/kernel/setup_32.c
@@ -34,6 +34,7 @@
#include <linux/kdebug.h>
#include <linux/export.h>
#include <linux/start_kernel.h>
+#include <uapi/linux/mount.h>

#include <asm/io.h>
#include <asm/processor.h>
diff --git a/arch/sparc/kernel/setup_64.c b/arch/sparc/kernel/setup_64.c
index 7944b3ca216a..206bf81eedaf 100644
--- a/arch/sparc/kernel/setup_64.c
+++ b/arch/sparc/kernel/setup_64.c
@@ -33,6 +33,7 @@
#include <linux/module.h>
#include <linux/start_kernel.h>
#include <linux/bootmem.h>
+#include <uapi/linux/mount.h>

#include <asm/io.h>
#include <asm/processor.h>
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index b4866badb235..e493202bf265 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -51,6 +51,7 @@
#include <linux/kvm_para.h>
#include <linux/dma-contiguous.h>
#include <xen/xen.h>
+#include <uapi/linux/mount.h>

#include <linux/errno.h>
#include <linux/kernel.h>
diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index 9c0126ad7de1..1b87a1e03b45 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -25,6 +25,7 @@
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/kthread.h>
+#include <uapi/linux/mount.h>
#include "base.h"

static struct task_struct *thread;
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 2b110139420c..89970dd81b0e 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -1486,7 +1486,7 @@ static int f2fs_remount(struct super_block *sb, int *flags,
err = dquot_suspend(sb, -1);
if (err < 0)
goto restore_opts;
- } else if (f2fs_readonly(sb) && !(*flags & MS_RDONLY)) {
+ } else if (f2fs_readonly(sb) && !(*flags & SB_RDONLY)) {
/* dquot_resume needs RW */
sb->s_flags &= ~SB_RDONLY;
if (sb_any_quota_suspended(sb)) {
diff --git a/fs/pnode.c b/fs/pnode.c
index 53d411a371ce..1100e810d855 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -10,6 +10,7 @@
#include <linux/mount.h>
#include <linux/fs.h>
#include <linux/nsproxy.h>
+#include <uapi/linux/mount.h>
#include "internal.h"
#include "pnode.h"

diff --git a/fs/super.c b/fs/super.c
index 3941f19828b4..67f88c055967 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -35,6 +35,7 @@
#include <linux/fsnotify.h>
#include <linux/lockdep.h>
#include <linux/user_namespace.h>
+#include <uapi/linux/mount.h>
#include "internal.h"

static int thaw_super_locked(struct super_block *sb);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 73e01918f996..1c982eb44ff4 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -14,6 +14,11 @@
#include <linux/ioctl.h>
#include <linux/types.h>

+/* Use of MS_* flags within the kernel is restricted to core mount(2) code. */
+#if !defined(__KERNEL__)
+#include <linux/mount.h>
+#endif
+
/*
* It's silly to have NR_OPEN bigger than NR_FILE, but you can change
* the file limit at runtime and only root can increase the per-process
@@ -101,57 +106,6 @@ struct inodes_stat_t {

#define NR_FILE 8192 /* this can well be larger on a larger system */

-
-/*
- * These are the fs-independent mount-flags: up to 32 flags are supported
- */
-#define MS_RDONLY 1 /* Mount read-only */
-#define MS_NOSUID 2 /* Ignore suid and sgid bits */
-#define MS_NODEV 4 /* Disallow access to device special files */
-#define MS_NOEXEC 8 /* Disallow program execution */
-#define MS_SYNCHRONOUS 16 /* Writes are synced at once */
-#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
-#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
-#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
-#define MS_NOATIME 1024 /* Do not update access times. */
-#define MS_NODIRATIME 2048 /* Do not update directory access times */
-#define MS_BIND 4096
-#define MS_MOVE 8192
-#define MS_REC 16384
-#define MS_VERBOSE 32768 /* War is peace. Verbosity is silence.
- MS_VERBOSE is deprecated. */
-#define MS_SILENT 32768
-#define MS_POSIXACL (1<<16) /* VFS does not apply the umask */
-#define MS_UNBINDABLE (1<<17) /* change to unbindable */
-#define MS_PRIVATE (1<<18) /* change to private */
-#define MS_SLAVE (1<<19) /* change to slave */
-#define MS_SHARED (1<<20) /* change to shared */
-#define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */
-#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
-#define MS_I_VERSION (1<<23) /* Update inode I_version field */
-#define MS_STRICTATIME (1<<24) /* Always perform atime updates */
-#define MS_LAZYTIME (1<<25) /* Update the on-disk [acm]times lazily */
-
-/* These sb flags are internal to the kernel */
-#define MS_SUBMOUNT (1<<26)
-#define MS_NOREMOTELOCK (1<<27)
-#define MS_NOSEC (1<<28)
-#define MS_BORN (1<<29)
-#define MS_ACTIVE (1<<30)
-#define MS_NOUSER (1<<31)
-
-/*
- * Superblock flags that can be altered by MS_REMOUNT
- */
-#define MS_RMT_MASK (MS_RDONLY|MS_SYNCHRONOUS|MS_MANDLOCK|MS_I_VERSION|\
- MS_LAZYTIME)
-
-/*
- * Old magic mount flag and mask
- */
-#define MS_MGC_VAL 0xC0ED0000
-#define MS_MGC_MSK 0xffff0000
-
/*
* Structure for FS_IOC_FSGETXATTR[A] and FS_IOC_FSSETXATTR.
*/
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index 89adf0d731ab..3634e065836c 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -1,6 +1,60 @@
#ifndef _UAPI_LINUX_MOUNT_H
#define _UAPI_LINUX_MOUNT_H

+/*
+ * These are the fs-independent mount-flags: up to 32 flags are supported
+ *
+ * Usage of these is restricted within the kernel to core mount(2) code and
+ * callers of sys_mount() only. Filesystems should be using the SB_*
+ * equivalent instead.
+ */
+#define MS_RDONLY 1 /* Mount read-only */
+#define MS_NOSUID 2 /* Ignore suid and sgid bits */
+#define MS_NODEV 4 /* Disallow access to device special files */
+#define MS_NOEXEC 8 /* Disallow program execution */
+#define MS_SYNCHRONOUS 16 /* Writes are synced at once */
+#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
+#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
+#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
+#define MS_NOATIME 1024 /* Do not update access times. */
+#define MS_NODIRATIME 2048 /* Do not update directory access times */
+#define MS_BIND 4096
+#define MS_MOVE 8192
+#define MS_REC 16384
+#define MS_VERBOSE 32768 /* War is peace. Verbosity is silence.
+ MS_VERBOSE is deprecated. */
+#define MS_SILENT 32768
+#define MS_POSIXACL (1<<16) /* VFS does not apply the umask */
+#define MS_UNBINDABLE (1<<17) /* change to unbindable */
+#define MS_PRIVATE (1<<18) /* change to private */
+#define MS_SLAVE (1<<19) /* change to slave */
+#define MS_SHARED (1<<20) /* change to shared */
+#define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */
+#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
+#define MS_I_VERSION (1<<23) /* Update inode I_version field */
+#define MS_STRICTATIME (1<<24) /* Always perform atime updates */
+#define MS_LAZYTIME (1<<25) /* Update the on-disk [acm]times lazily */
+
+/* These sb flags are internal to the kernel */
+#define MS_SUBMOUNT (1<<26)
+#define MS_NOREMOTELOCK (1<<27)
+#define MS_NOSEC (1<<28)
+#define MS_BORN (1<<29)
+#define MS_ACTIVE (1<<30)
+#define MS_NOUSER (1<<31)
+
+/*
+ * Superblock flags that can be altered by MS_REMOUNT
+ */
+#define MS_RMT_MASK (MS_RDONLY|MS_SYNCHRONOUS|MS_MANDLOCK|MS_I_VERSION|\
+ MS_LAZYTIME)
+
+/*
+ * Old magic mount flag and mask
+ */
+#define MS_MGC_VAL 0xC0ED0000
+#define MS_MGC_MSK 0xffff0000
+
/*
* open_tree() flags.
*/
diff --git a/init/do_mounts.c b/init/do_mounts.c
index d512dd615682..d95435fd37b5 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -22,6 +22,7 @@
#include <linux/nfs_fs.h>
#include <linux/nfs_fs_sb.h>
#include <linux/nfs_mount.h>
+#include <uapi/linux/mount.h>

#include "do_mounts.h"

diff --git a/init/do_mounts_initrd.c b/init/do_mounts_initrd.c
index d1a5d885ce13..56a557403d39 100644
--- a/init/do_mounts_initrd.c
+++ b/init/do_mounts_initrd.c
@@ -8,6 +8,7 @@
#include <linux/sched.h>
#include <linux/freezer.h>
#include <linux/kmod.h>
+#include <uapi/linux/mount.h>

#include "do_mounts.h"

diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index a7bb2f5377f7..3d98ace5b898 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -24,6 +24,7 @@
#include <linux/audit.h>
#include <linux/user_namespace.h>
#include <net/sock.h>
+#include <uapi/linux/mount.h>

#include "include/apparmor.h"
#include "include/apparmorfs.h"
diff --git a/security/apparmor/mount.c b/security/apparmor/mount.c
index c1da22482bfb..8c3787399356 100644
--- a/security/apparmor/mount.c
+++ b/security/apparmor/mount.c
@@ -15,6 +15,7 @@
#include <linux/fs.h>
#include <linux/mount.h>
#include <linux/namei.h>
+#include <uapi/linux/mount.h>

#include "include/apparmor.h"
#include "include/audit.h"
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 48704d5a15af..9102a8fecb15 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -88,6 +88,7 @@
#include <linux/msg.h>
#include <linux/shm.h>
#include <linux/bpf.h>
+#include <uapi/linux/mount.h>

#include "avc.h"
#include "objsec.h"
diff --git a/security/tomoyo/mount.c b/security/tomoyo/mount.c
index 807fd91dbb54..7dc7f59b7dde 100644
--- a/security/tomoyo/mount.c
+++ b/security/tomoyo/mount.c
@@ -6,6 +6,7 @@
*/

#include <linux/slab.h>
+#include <uapi/linux/mount.h>
#include "common.h"

/* String table for special mount operations. */


2018-09-21 16:31:38

by David Howells

[permalink] [raw]
Subject: [PATCH 05/34] vfs: Introduce the basic header for the new mount API's filesystem context [ver #12]

Introduce a filesystem context concept to be used during superblock
creation for mount and superblock reconfiguration for remount. This is
allocated at the beginning of the mount procedure and into it is placed:

(1) Filesystem type.

(2) Namespaces.

(3) Source/Device names (there may be multiple).

(4) Superblock flags (SB_*).

(5) Security details.

(6) Filesystem-specific data, as set by the mount options.

Also introduce a struct for typed key=value parameter concept with which
configuration data will be passed to filesystems. This will allow not only
for ordinary string values, but also make it possible to pass more exotic
values such as binary blobs, paths and fds with greater ease.

Signed-off-by: David Howells <[email protected]>
---

include/linux/fs_context.h | 108 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 108 insertions(+)
create mode 100644 include/linux/fs_context.h

diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
new file mode 100644
index 000000000000..56e406a96b80
--- /dev/null
+++ b/include/linux/fs_context.h
@@ -0,0 +1,108 @@
+/* Filesystem superblock creation and reconfiguration context.
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_FS_CONTEXT_H
+#define _LINUX_FS_CONTEXT_H
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+
+struct cred;
+struct dentry;
+struct file_operations;
+struct file_system_type;
+struct mnt_namespace;
+struct net;
+struct pid_namespace;
+struct super_block;
+struct user_namespace;
+struct vfsmount;
+struct path;
+
+enum fs_context_purpose {
+ FS_CONTEXT_FOR_USER_MOUNT, /* New superblock for user-specified mount */
+ FS_CONTEXT_FOR_KERNEL_MOUNT, /* New superblock for kernel-internal mount */
+ FS_CONTEXT_FOR_SUBMOUNT, /* New superblock for automatic submount */
+ FS_CONTEXT_FOR_ROOT_MOUNT, /* New superblock for internal root mount */
+ FS_CONTEXT_FOR_RECONFIGURE, /* Superblock reconfiguration (remount) */
+ FS_CONTEXT_FOR_UMOUNT, /* Reconfiguration to R/O for unmount */
+ FS_CONTEXT_FOR_EMERGENCY_RO, /* Emergency reconfiguration to R/O */
+};
+
+/*
+ * Type of parameter value.
+ */
+enum fs_value_type {
+ fs_value_is_undefined,
+ fs_value_is_flag, /* Value not given a value */
+ fs_value_is_string, /* Value is a string */
+ fs_value_is_blob, /* Value is a binary blob */
+ fs_value_is_filename, /* Value is a filename* + dirfd */
+ fs_value_is_filename_empty, /* Value is a filename* + dirfd + AT_EMPTY_PATH */
+ fs_value_is_file, /* Value is a file* */
+};
+
+/*
+ * Configuration parameter.
+ */
+struct fs_parameter {
+ const char *key; /* Parameter name */
+ enum fs_value_type type:8; /* The type of value here */
+ union {
+ char *string;
+ void *blob;
+ struct filename *name;
+ struct file *file;
+ };
+ size_t size;
+ int dirfd;
+};
+
+/*
+ * Filesystem context for holding the parameters used in the creation or
+ * reconfiguration of a superblock.
+ *
+ * Superblock creation fills in ->root whereas reconfiguration begins with this
+ * already set.
+ *
+ * See Documentation/filesystems/mounting.txt
+ */
+struct fs_context {
+ const struct fs_context_operations *ops;
+ struct file_system_type *fs_type;
+ void *fs_private; /* The filesystem's context */
+ struct dentry *root; /* The root and superblock */
+ struct user_namespace *user_ns; /* The user namespace for this mount */
+ struct net *net_ns; /* The network namespace for this mount */
+ const struct cred *cred; /* The mounter's credentials */
+ char *source; /* The source name (eg. dev path) */
+ char *subtype; /* The subtype to set on the superblock */
+ void *security; /* The LSM context */
+ void *s_fs_info; /* Proposed s_fs_info */
+ unsigned int sb_flags; /* Proposed superblock flags (SB_*) */
+ unsigned int sb_flags_mask; /* Superblock flags that were changed */
+ enum fs_context_purpose purpose:8;
+ bool sloppy:1; /* T if unrecognised options are okay */
+ bool silent:1; /* T if "o silent" specified */
+ bool need_free:1; /* Need to call ops->free() */
+};
+
+struct fs_context_operations {
+ void (*free)(struct fs_context *fc);
+ int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
+ int (*parse_param)(struct fs_context *fc, struct fs_parameter *param);
+ int (*parse_monolithic)(struct fs_context *fc, void *data, size_t data_size);
+ int (*validate)(struct fs_context *fc);
+ int (*get_tree)(struct fs_context *fc);
+ int (*reconfigure)(struct fs_context *fc);
+};
+
+#endif /* _LINUX_FS_CONTEXT_H */


2018-09-21 16:31:50

by David Howells

[permalink] [raw]
Subject: [PATCH 06/34] vfs: Introduce logging functions [ver #12]

Introduce a set of logging functions through which informational messages,
warnings and error messages incurred by the mount procedure can be logged
and, in a future patch, passed to userspace instead by way of the
filesystem configuration context file descriptor.

There are four functions:

(1) infof(const char *fmt, ...);

Logs an informational message.

(2) warnf(const char *fmt, ...);

Logs a warning message.

(3) errorf(const char *fmt, ...);

Logs an error message.

(4) invalf(const char *fmt, ...);

As errof(), but returns -EINVAL so can be used on a return statement.

Signed-off-by: David Howells <[email protected]>
---

include/linux/fs_context.h | 42 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 42 insertions(+)

diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 56e406a96b80..4b7327852b7f 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -105,4 +105,46 @@ struct fs_context_operations {
int (*reconfigure)(struct fs_context *fc);
};

+#define logfc(FC, FMT, ...) pr_notice(FMT, ## __VA_ARGS__)
+
+/**
+ * infof - Store supplementary informational message
+ * @fc: The context in which to log the informational message
+ * @fmt: The format string
+ *
+ * Store the supplementary informational message for the process if the process
+ * has enabled the facility.
+ */
+#define infof(fc, fmt, ...) ({ logfc(fc, fmt, ## __VA_ARGS__); })
+
+/**
+ * warnf - Store supplementary warning message
+ * @fc: The context in which to log the error message
+ * @fmt: The format string
+ *
+ * Store the supplementary warning message for the process if the process has
+ * enabled the facility.
+ */
+#define warnf(fc, fmt, ...) ({ logfc(fc, fmt, ## __VA_ARGS__); })
+
+/**
+ * errorf - Store supplementary error message
+ * @fc: The context in which to log the error message
+ * @fmt: The format string
+ *
+ * Store the supplementary error message for the process if the process has
+ * enabled the facility.
+ */
+#define errorf(fc, fmt, ...) ({ logfc(fc, fmt, ## __VA_ARGS__); })
+
+/**
+ * invalf - Store supplementary invalid argument error message
+ * @fc: The context in which to log the error message
+ * @fmt: The format string
+ *
+ * Store the supplementary error message for the process if the process has
+ * enabled the facility and return -EINVAL.
+ */
+#define invalf(fc, fmt, ...) ({ errorf(fc, fmt, ## __VA_ARGS__); -EINVAL; })
+
#endif /* _LINUX_FS_CONTEXT_H */


2018-09-21 16:32:06

by David Howells

[permalink] [raw]
Subject: [PATCH 08/34] vfs: Add LSM hooks for the new mount API [ver #12]

Add LSM hooks for use by the new mount API and filesystem context code.
This includes:

(1) Hooks to handle allocation, duplication and freeing of the security
record attached to a filesystem context.

(2) A hook to snoop source specifications. There may be multiple of these
if the filesystem supports it. They will to be local files/devices if
fs_context::source_is_dev is true and will be something else, possibly
remote server specifications, if false.

(3) A hook to snoop superblock configuration options in key[=val] form.
If the LSM decides it wants to handle it, it can suppress the option
being passed to the filesystem. Note that 'val' may include commas
and binary data with the fsopen patch.

(4) A hook to perform validation and allocation after the configuration
has been done but before the superblock is allocated and set up.

(5) A hook to transfer the security from the context to a newly created
superblock.

(6) A hook to rule on whether a path point can be used as a mountpoint.

These are intended to replace:

security_sb_copy_data
security_sb_kern_mount
security_sb_mount
security_sb_set_mnt_opts
security_sb_clone_mnt_opts
security_sb_parse_opts_str

Signed-off-by: David Howells <[email protected]>
cc: [email protected]
---

include/linux/lsm_hooks.h | 61 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/security.h | 47 +++++++++++++++++++++++++++++++++++
security/security.c | 41 ++++++++++++++++++++++++++++++
3 files changed, 149 insertions(+)

diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index d052db1a1565..7e50bfa1aee0 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -76,6 +76,49 @@
* changes on the process such as clearing out non-inheritable signal
* state. This is called immediately after commit_creds().
*
+ * Security hooks for mount using fs_context.
+ * [See also Documentation/filesystems/mounting.txt]
+ *
+ * @fs_context_alloc:
+ * Allocate and attach a security structure to sc->security. This pointer
+ * is initialised to NULL by the caller.
+ * @fc indicates the new filesystem context.
+ * @reference indicates the source dentry of a submount or start of reconfig.
+ * @fs_context_dup:
+ * Allocate and attach a security structure to sc->security. This pointer
+ * is initialised to NULL by the caller.
+ * @fc indicates the new filesystem context.
+ * @src_fc indicates the original filesystem context.
+ * @fs_context_free:
+ * Clean up a filesystem context.
+ * @fc indicates the filesystem context.
+ * @fs_context_parse_param:
+ * Userspace provided a parameter to configure a superblock. The LSM may
+ * reject it with an error and may use it for itself, in which case it
+ * should return 0; otherwise it should return -ENOPARAM to pass it on to
+ * the filesystem.
+ * @fc indicates the filesystem context.
+ * @param The parameter
+ * @fs_context_validate:
+ * Validate the filesystem context preparatory to applying it. This is
+ * done after all the options have been parsed.
+ * @fc indicates the filesystem context.
+ * @sb_get_tree:
+ * Assign the security to a newly created superblock.
+ * @fc indicates the filesystem context.
+ * @fc->root indicates the root that will be mounted.
+ * @fc->root->d_sb points to the superblock.
+ * @sb_reconfigure:
+ * Apply reconfiguration to the security on a superblock.
+ * @fc indicates the filesystem context.
+ * @fc->root indicates a dentry in the mount.
+ * @fc->root->d_sb points to the superblock.
+ * @sb_mountpoint:
+ * Equivalent of sb_mount, but with an fs_context.
+ * @fc indicates the filesystem context.
+ * @mountpoint indicates the path on which the mount will take place.
+ * @mnt_flags indicates the MNT_* flags specified.
+ *
* Security hooks for filesystem operations.
*
* @sb_alloc_security:
@@ -1466,6 +1509,16 @@ union security_list_options {
void (*bprm_committing_creds)(struct linux_binprm *bprm);
void (*bprm_committed_creds)(struct linux_binprm *bprm);

+ int (*fs_context_alloc)(struct fs_context *fc, struct dentry *reference);
+ int (*fs_context_dup)(struct fs_context *fc, struct fs_context *src_sc);
+ void (*fs_context_free)(struct fs_context *fc);
+ int (*fs_context_parse_param)(struct fs_context *fc, struct fs_parameter *param);
+ int (*fs_context_validate)(struct fs_context *fc);
+ int (*sb_get_tree)(struct fs_context *fc);
+ void (*sb_reconfigure)(struct fs_context *fc);
+ int (*sb_mountpoint)(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags);
+
int (*sb_alloc_security)(struct super_block *sb);
void (*sb_free_security)(struct super_block *sb);
int (*sb_copy_data)(char *orig, size_t orig_size, char *copy);
@@ -1808,6 +1861,14 @@ struct security_hook_heads {
struct hlist_head bprm_check_security;
struct hlist_head bprm_committing_creds;
struct hlist_head bprm_committed_creds;
+ struct hlist_head fs_context_alloc;
+ struct hlist_head fs_context_dup;
+ struct hlist_head fs_context_free;
+ struct hlist_head fs_context_parse_param;
+ struct hlist_head fs_context_validate;
+ struct hlist_head sb_get_tree;
+ struct hlist_head sb_reconfigure;
+ struct hlist_head sb_mountpoint;
struct hlist_head sb_alloc_security;
struct hlist_head sb_free_security;
struct hlist_head sb_copy_data;
diff --git a/include/linux/security.h b/include/linux/security.h
index a306061d2197..636215bf4c1b 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -53,6 +53,9 @@ struct msg_msg;
struct xattr;
struct xfrm_sec_ctx;
struct mm_struct;
+struct fs_context;
+struct fs_parameter;
+enum fs_value_type;

/* If capable should audit the security request */
#define SECURITY_CAP_NOAUDIT 0
@@ -246,6 +249,15 @@ int security_bprm_set_creds(struct linux_binprm *bprm);
int security_bprm_check(struct linux_binprm *bprm);
void security_bprm_committing_creds(struct linux_binprm *bprm);
void security_bprm_committed_creds(struct linux_binprm *bprm);
+int security_fs_context_alloc(struct fs_context *fc, struct dentry *reference);
+int security_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc);
+void security_fs_context_free(struct fs_context *fc);
+int security_fs_context_parse_param(struct fs_context *fc, struct fs_parameter *param);
+int security_fs_context_validate(struct fs_context *fc);
+int security_sb_get_tree(struct fs_context *fc);
+void security_sb_reconfigure(struct fs_context *fc);
+int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags);
int security_sb_alloc(struct super_block *sb);
void security_sb_free(struct super_block *sb);
int security_sb_copy_data(char *orig, size_t orig_size, char *copy);
@@ -548,6 +560,41 @@ static inline void security_bprm_committed_creds(struct linux_binprm *bprm)
{
}

+static inline int security_fs_context_alloc(struct fs_context *fc,
+ struct dentry *reference)
+{
+ return 0;
+}
+static inline int security_fs_context_dup(struct fs_context *fc,
+ struct fs_context *src_fc)
+{
+ return 0;
+}
+static inline void security_fs_context_free(struct fs_context *fc)
+{
+}
+static inline int security_fs_context_parse_param(struct fs_context *fc,
+ struct fs_parameter *param)
+{
+ return -ENOPARAM;
+}
+static inline int security_fs_context_validate(struct fs_context *fc)
+{
+ return 0;
+}
+static inline int security_sb_get_tree(struct fs_context *fc)
+{
+ return 0;
+}
+static inline void security_sb_reconfigure(struct fs_context *fc)
+{
+}
+static inline int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ return 0;
+}
+
static inline int security_sb_alloc(struct super_block *sb)
{
return 0;
diff --git a/security/security.c b/security/security.c
index 96a061cecb39..64304d20aae1 100644
--- a/security/security.c
+++ b/security/security.c
@@ -363,6 +363,47 @@ void security_bprm_committed_creds(struct linux_binprm *bprm)
call_void_hook(bprm_committed_creds, bprm);
}

+int security_fs_context_alloc(struct fs_context *fc, struct dentry *reference)
+{
+ return call_int_hook(fs_context_alloc, 0, fc, reference);
+}
+
+int security_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
+{
+ return call_int_hook(fs_context_dup, 0, fc, src_fc);
+}
+
+void security_fs_context_free(struct fs_context *fc)
+{
+ call_void_hook(fs_context_free, fc);
+}
+
+int security_fs_context_parse_param(struct fs_context *fc, struct fs_parameter *param)
+{
+ return call_int_hook(fs_context_parse_param, -ENOPARAM, fc, param);
+}
+
+int security_fs_context_validate(struct fs_context *fc)
+{
+ return call_int_hook(fs_context_validate, 0, fc);
+}
+
+int security_sb_get_tree(struct fs_context *fc)
+{
+ return call_int_hook(sb_get_tree, 0, fc);
+}
+
+void security_sb_reconfigure(struct fs_context *fc)
+{
+ call_void_hook(sb_reconfigure, fc);
+}
+
+int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ return call_int_hook(sb_mountpoint, 0, fc, mountpoint, mnt_flags);
+}
+
int security_sb_alloc(struct super_block *sb)
{
return call_int_hook(sb_alloc_security, 0, sb);


2018-09-21 16:32:17

by David Howells

[permalink] [raw]
Subject: [PATCH 09/34] vfs: Put security flags into the fs_context struct [ver #12]

Put security flags, such as SECURITY_LSM_NATIVE_LABELS, into the filesystem
context so that the filesystem can communicate them to the LSM more easily.

Signed-off-by: David Howells <[email protected]>
---

include/linux/fs_context.h | 1 +
include/linux/security.h | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 4b7327852b7f..83c40d30868e 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -89,6 +89,7 @@ struct fs_context {
void *s_fs_info; /* Proposed s_fs_info */
unsigned int sb_flags; /* Proposed superblock flags (SB_*) */
unsigned int sb_flags_mask; /* Superblock flags that were changed */
+ unsigned int lsm_flags; /* Information flags from the fs to the LSM */
enum fs_context_purpose purpose:8;
bool sloppy:1; /* T if unrecognised options are okay */
bool silent:1; /* T if "o silent" specified */
diff --git a/include/linux/security.h b/include/linux/security.h
index 636215bf4c1b..bae191a96c73 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -61,7 +61,7 @@ enum fs_value_type;
#define SECURITY_CAP_NOAUDIT 0
#define SECURITY_CAP_AUDIT 1

-/* LSM Agnostic defines for sb_set_mnt_opts */
+/* LSM Agnostic defines for fs_context::lsm_flags */
#define SECURITY_LSM_NATIVE_LABELS 1

struct ctl_table;


2018-09-21 16:32:17

by David Howells

[permalink] [raw]
Subject: [PATCH 07/34] vfs: Add configuration parser helpers [ver #12]

Because the new API passes in key,value parameters, match_token() cannot be
used with it. Instead, provide three new helpers to aid with parsing:

(1) fs_parse(). This takes a parameter and a simple static description of
all the parameters and maps the key name to an ID. It returns 1 on a
match, 0 on no match if unknowns should be ignored and some other
negative error code on a parse error.

The parameter description includes a list of key names to IDs, desired
parameter types and a list of enumeration name -> ID mappings.

[!] Note that for the moment I've required that the key->ID mapping
array is expected to be sorted and unterminated. The size of the
array is noted in the fsconfig_parser struct. This allows me to use
bsearch(), but I'm not sure any performance gain is worth the hassle
of requiring people to keep the array sorted.

The parameter type array is sized according to the number of parameter
IDs and is indexed directly. The optional enum mapping array is an
unterminated, unsorted list and the size goes into the fsconfig_parser
struct.

The function can do some additional things:

(a) If it's not ambiguous and no value is given, the prefix "no" on
a key name is permitted to indicate that the parameter should
be considered negatory.

(b) If the desired type is a single simple integer, it will perform
an appropriate conversion and store the result in a union in
the parse result.

(c) If the desired type is an enumeration, {key ID, name} will be
looked up in the enumeration list and the matching value will
be stored in the parse result union.

(d) Optionally generate an error if the key is unrecognised.

This is called something like:

enum rdt_param {
Opt_cdp,
Opt_cdpl2,
Opt_mba_mpbs,
nr__rdt_params
};

const struct fs_parameter_spec rdt_param_specs[nr__rdt_params] = {
[Opt_cdp] = { fs_param_is_bool },
[Opt_cdpl2] = { fs_param_is_bool },
[Opt_mba_mpbs] = { fs_param_is_bool },
};

const const char *const rdt_param_keys[nr__rdt_params] = {
[Opt_cdp] = "cdp",
[Opt_cdpl2] = "cdpl2",
[Opt_mba_mpbs] = "mba_mbps",
};

const struct fs_parameter_description rdt_parser = {
.name = "rdt",
.nr_params = nr__rdt_params,
.keys = rdt_param_keys,
.specs = rdt_param_specs,
.no_source = true,
};

int rdt_parse_param(struct fs_context *fc,
struct fs_parameter *param)
{
struct fs_parse_result parse;
struct rdt_fs_context *ctx = rdt_fc2context(fc);
int ret;

ret = fs_parse(fc, &rdt_parser, param, &parse);
if (ret < 0)
return ret;

switch (parse.key) {
case Opt_cdp:
ctx->enable_cdpl3 = true;
return 0;
case Opt_cdpl2:
ctx->enable_cdpl2 = true;
return 0;
case Opt_mba_mpbs:
ctx->enable_mba_mbps = true;
return 0;
}

return -EINVAL;
}

(2) fs_lookup_param(). This takes a { dirfd, path, LOOKUP_EMPTY? } or
string value and performs an appropriate path lookup to convert it
into a path object, which it will then return.

If the desired type was a blockdev, the type of the looked up inode
will be checked to make sure it is one.

This can be used like:

enum foo_param {
Opt_source,
nr__foo_params
};

const struct fs_parameter_spec foo_param_specs[nr__foo_params] = {
[Opt_source] = { fs_param_is_blockdev },
};

const char *char foo_param_keys[nr__foo_params] = {
[Opt_source] = "source",
};

const struct constant_table foo_param_alt_keys[] = {
{ "device", Opt_source },
};

const struct fs_parameter_description foo_parser = {
.name = "foo",
.nr_params = nr__foo_params,
.nr_alt_keys = ARRAY_SIZE(foo_param_alt_keys),
.keys = foo_param_keys,
.alt_keys = foo_param_alt_keys,
.specs = foo_param_specs,
};

int foo_parse_param(struct fs_context *fc,
struct fs_parameter *param)
{
struct fs_parse_result parse;
struct foo_fs_context *ctx = foo_fc2context(fc);
int ret;

ret = fs_parse(fc, &foo_parser, param, &parse);
if (ret < 0)
return ret;

switch (parse.key) {
case Opt_source:
return fs_lookup_param(fc, &foo_parser, param,
&parse, &ctx->source);
default:
return -EINVAL;
}
}

(3) lookup_constant(). This takes a table of named constants and looks up
the given name within it. The table is expected to be sorted such
that bsearch() be used upon it.

Possibly I should require the table be terminated and just use a
for-loop to scan it instead of using bsearch() to reduce hassle.

Tables look something like:

static const struct constant_table bool_names[] = {
{ "0", false },
{ "1", true },
{ "false", false },
{ "no", false },
{ "true", true },
{ "yes", true },
};

and a lookup is done with something like:

b = lookup_constant(bool_names, param->string, -1);

Additionally, optional validation routines for the parameter description
are provided that can be enabled at compile time. A later patch will
invoke these when a filesystem is registered.

Signed-off-by: David Howells <[email protected]>
---

fs/Kconfig | 7 +
fs/Makefile | 3
fs/fs_parser.c | 555 +++++++++++++++++++++++++++++++++++++++++++++
fs/internal.h | 2
fs/namei.c | 4
include/linux/errno.h | 1
include/linux/fs_parser.h | 119 ++++++++++
7 files changed, 688 insertions(+), 3 deletions(-)
create mode 100644 fs/fs_parser.c
create mode 100644 include/linux/fs_parser.h

diff --git a/fs/Kconfig b/fs/Kconfig
index ac474a61be37..25700b152c75 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -8,6 +8,13 @@ menu "File systems"
config DCACHE_WORD_ACCESS
bool

+config VALIDATE_FS_PARSER
+ bool "Validate filesystem parameter description"
+ default y
+ help
+ Enable this to perform validation of the parameter description for a
+ filesystem when it is registered.
+
if BLOCK

config FS_IOMAP
diff --git a/fs/Makefile b/fs/Makefile
index 293733f61594..07b894227dce 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -12,7 +12,8 @@ obj-y := open.o read_write.o file_table.o super.o \
attr.o bad_inode.o file.o filesystems.o namespace.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o splice.o sync.o utimes.o d_path.o \
- stack.o fs_struct.o statfs.o fs_pin.o nsfs.o
+ stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
+ fs_parser.o

ifeq ($(CONFIG_BLOCK),y)
obj-y += buffer.o block_dev.o direct-io.o mpage.o
diff --git a/fs/fs_parser.c b/fs/fs_parser.c
new file mode 100644
index 000000000000..cee210eddd10
--- /dev/null
+++ b/fs/fs_parser.c
@@ -0,0 +1,555 @@
+/* Filesystem parameter parser.
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/export.h>
+#include <linux/fs_context.h>
+#include <linux/fs_parser.h>
+#include <linux/slab.h>
+#include <linux/security.h>
+#include <linux/namei.h>
+#include <linux/bsearch.h>
+#include "internal.h"
+
+static const struct constant_table bool_names[] = {
+ { "0", false },
+ { "1", true },
+ { "false", false },
+ { "no", false },
+ { "true", true },
+ { "yes", true },
+};
+
+static int cmp_constant(const void *name, const void *entry)
+{
+ const struct constant_table *e = entry;
+ return strcmp(name, e->name);
+}
+
+/**
+ * lookup_constant - Look up a constant by name in an ordered table
+ * @tbl: The table of constants to search.
+ * @tbl_size: The size of the table.
+ * @name: The name to look up.
+ * @not_found: The value to return if the name is not found.
+ */
+int __lookup_constant(const struct constant_table *tbl, size_t tbl_size,
+ const char *name, int not_found)
+{
+ const struct constant_table *e;
+
+ e = bsearch(name, tbl, tbl_size, sizeof(tbl[0]), cmp_constant);
+ if (!e)
+ return not_found;
+ return e->value;
+}
+EXPORT_SYMBOL(__lookup_constant);
+
+static int cmp_key(const void *name, const void *entry)
+{
+ const char *const *e = entry;
+ return strcmp(name, *e);
+}
+
+static int fs_lookup_key(const struct fs_parameter_description *desc,
+ struct fs_parameter *param)
+{
+ const char *const *e;
+
+ e = bsearch(param->key, desc->keys, desc->nr_params,
+ sizeof(const char *), cmp_key);
+ if (e)
+ return e - desc->keys;
+
+ return __lookup_constant(desc->alt_keys, desc->nr_alt_keys, param->key,
+ -ENOPARAM);
+}
+
+/*
+ * fs_parse - Parse a filesystem configuration parameter
+ * @fc: The filesystem context to log errors through.
+ * @desc: The parameter description to use.
+ * @param: The parameter.
+ * @result: Where to place the result of the parse
+ *
+ * Parse a filesystem configuration parameter and attempt a conversion for a
+ * simple parameter for which this is requested. If successful, the determined
+ * parameter ID is placed into @result->key, the desired type is indicated in
+ * @result->t and any converted value is placed into an appropriate member of
+ * the union in @result.
+ *
+ * The function returns the parameter number if the parameter was matched,
+ * -ENOPARAM if it wasn't matched and @desc->ignore_unknown indicated that
+ * unknown parameters are okay and -EINVAL if there was a conversion issue or
+ * the parameter wasn't recognised and unknowns aren't okay.
+ */
+int fs_parse(struct fs_context *fc,
+ const struct fs_parameter_description *desc,
+ struct fs_parameter *param,
+ struct fs_parse_result *result)
+{
+ int ret, k, i, b;
+
+ result->has_value = !!param->string;
+
+ k = fs_lookup_key(desc, param);
+ if (k == -ENOPARAM) {
+ /* If we didn't find something that looks like "noxxx", see if
+ * "xxx" takes the "no"-form negative - but only if there
+ * wasn't an value.
+ */
+ if (result->has_value)
+ goto unknown_parameter;
+ if (param->key[0] != 'n' || param->key[1] != 'o' || !param->key[2])
+ goto unknown_parameter;
+
+ k = fs_lookup_key(desc, param);
+ if (k == -ENOPARAM)
+ goto unknown_parameter;
+ if (!(desc->specs[k].flags & fs_param_neg_with_no))
+ goto unknown_parameter;
+ result->key = k;
+ result->uint_32 = 0;
+ result->negated = true;
+ goto okay;
+ }
+
+ result->key = k;
+ result->negated = false;
+ if (result->key == fsconfig_key_removed)
+ return invalf(fc, "%s: Unsupported parameter name '%s'",
+ desc->name, param->key);
+
+ result->t = desc->specs[result->key];
+ if (result->t.flags & fs_param_deprecated)
+ warnf(fc, "%s: Deprecated parameter '%s'",
+ desc->name, param->key);
+
+ /* Certain parameter types only take a string and convert it. */
+ switch (result->t.type) {
+ case __fs_param_wasnt_defined:
+ return -EINVAL;
+ case fs_param_is_u32:
+ case fs_param_is_u32_octal:
+ case fs_param_is_u32_hex:
+ case fs_param_is_s32:
+ case fs_param_is_u64:
+ case fs_param_is_enum:
+ case fs_param_is_string:
+ if (param->type != fs_value_is_string)
+ goto bad_value;
+ if (!result->has_value) {
+ if (desc->specs[k].flags & fs_param_v_optional)
+ goto okay;
+ goto bad_value;
+ }
+ /* Fall through */
+ default:
+ break;
+ }
+
+ /* Try to turn the type we were given into the type desired by the
+ * parameter and give an error if we can't.
+ */
+ switch (result->t.type) {
+ case fs_param_is_flag:
+ if (param->type != fs_value_is_flag &&
+ (param->type != fs_value_is_string || result->has_value))
+ return invalf(fc, "%s: Unexpected value for '%s'",
+ desc->name, param->key);
+ result->boolean = true;
+ goto okay;
+
+ case fs_param_is_bool:
+ switch (param->type) {
+ case fs_value_is_flag:
+ result->boolean = true;
+ goto okay;
+ case fs_value_is_string:
+ if (param->size == 0) {
+ result->boolean = true;
+ goto okay;
+ }
+ b = lookup_constant(bool_names, param->string, -1);
+ if (b == -1)
+ goto bad_value;
+ result->boolean = b;
+ goto okay;
+ default:
+ goto bad_value;
+ }
+
+ case fs_param_is_u32:
+ ret = kstrtouint(param->string, 0, &result->uint_32);
+ goto maybe_okay;
+ case fs_param_is_u32_octal:
+ ret = kstrtouint(param->string, 8, &result->uint_32);
+ goto maybe_okay;
+ case fs_param_is_u32_hex:
+ ret = kstrtouint(param->string, 16, &result->uint_32);
+ goto maybe_okay;
+ case fs_param_is_s32:
+ ret = kstrtoint(param->string, 0, &result->int_32);
+ goto maybe_okay;
+ case fs_param_is_u64:
+ ret = kstrtoull(param->string, 0, &result->uint_64);
+ goto maybe_okay;
+
+ case fs_param_is_enum:
+ for (i = 0; i < desc->nr_enums; i++) {
+ if (desc->enums[i].param_id == result->key &&
+ strcmp(desc->enums[i].name, param->string) == 0) {
+ result->uint_32 = desc->enums[i].value;
+ goto okay;
+ }
+ }
+ goto bad_value;
+
+ case fs_param_is_string:
+ goto okay;
+ case fs_param_is_blob:
+ if (param->type != fs_value_is_blob)
+ goto bad_value;
+ goto okay;
+
+ case fs_param_is_fd: {
+ if (param->type != fs_value_is_file)
+ goto bad_value;
+ goto okay;
+ }
+
+ case fs_param_is_blockdev:
+ case fs_param_is_path:
+ goto okay;
+ default:
+ BUG();
+ }
+
+maybe_okay:
+ if (ret < 0)
+ goto bad_value;
+okay:
+ return result->key;
+
+bad_value:
+ return invalf(fc, "%s: Bad value for '%s'", desc->name, param->key);
+unknown_parameter:
+ return -ENOPARAM;
+}
+EXPORT_SYMBOL(fs_parse);
+
+/**
+ * fs_lookup_param - Look up a path referred to by a parameter
+ * @fc: The filesystem context to log errors through.
+ * @param: The parameter.
+ * @want_bdev: T if want a blockdev
+ * @_path: The result of the lookup
+ */
+int fs_lookup_param(struct fs_context *fc,
+ struct fs_parameter *param,
+ bool want_bdev,
+ struct path *_path)
+{
+ struct filename *f;
+ unsigned int flags = 0;
+ bool put_f;
+ int ret;
+
+ switch (param->type) {
+ case fs_value_is_string:
+ f = getname_kernel(param->string);
+ if (IS_ERR(f))
+ return PTR_ERR(f);
+ put_f = true;
+ break;
+ case fs_value_is_filename_empty:
+ flags = LOOKUP_EMPTY;
+ /* Fall through */
+ case fs_value_is_filename:
+ f = param->name;
+ put_f = false;
+ break;
+ default:
+ return invalf(fc, "%s: not usable as path", param->key);
+ }
+
+ ret = filename_lookup(param->dirfd, f, flags, _path, NULL);
+ if (ret < 0) {
+ errorf(fc, "%s: Lookup failure for '%s'", param->key, f->name);
+ goto out;
+ }
+
+ if (want_bdev &&
+ !S_ISBLK(d_backing_inode(_path->dentry)->i_mode)) {
+ path_put(_path);
+ _path->dentry = NULL;
+ _path->mnt = NULL;
+ errorf(fc, "%s: Non-blockdev passed as '%s'",
+ param->key, f->name);
+ ret = -ENOTBLK;
+ }
+
+out:
+ if (put_f)
+ putname(f);
+ return ret;
+}
+EXPORT_SYMBOL(fs_lookup_param);
+
+#ifdef CONFIG_VALIDATE_FS_PARSER
+/**
+ * validate_constant_table - Validate a constant table
+ * @name: Name to use in reporting
+ * @tbl: The constant table to validate.
+ * @tbl_size: The size of the table.
+ * @low: The lowest permissible value.
+ * @high: The highest permissible value.
+ * @special: One special permissible value outside of the range.
+ */
+bool validate_constant_table(const struct constant_table *tbl, size_t tbl_size,
+ int low, int high, int special)
+{
+ size_t i;
+ bool good = true;
+
+ if (tbl_size == 0) {
+ pr_warn("VALIDATE C-TBL: Empty\n");
+ return true;
+ }
+
+ for (i = 0; i < tbl_size; i++) {
+ if (!tbl[i].name) {
+ pr_err("VALIDATE C-TBL[%zu]: Null\n", i);
+ good = false;
+ } else if (i > 0 && tbl[i - 1].name) {
+ int c = strcmp(tbl[i-1].name, tbl[i].name);
+
+ if (c == 0) {
+ pr_err("VALIDATE C-TBL[%zu]: Duplicate %s\n",
+ i, tbl[i].name);
+ good = false;
+ }
+ if (c > 0) {
+ pr_err("VALIDATE C-TBL[%zu]: Missorted %s>=%s\n",
+ i, tbl[i-1].name, tbl[i].name);
+ good = false;
+ }
+ }
+
+ if (tbl[i].value != special &&
+ (tbl[i].value < low || tbl[i].value > high)) {
+ pr_err("VALIDATE C-TBL[%zu]: %s->%d const out of range (%d-%d)\n",
+ i, tbl[i].name, tbl[i].value, low, high);
+ good = false;
+ }
+ }
+
+ return good;
+}
+
+static bool validate_list(const char *const *tbl, size_t tbl_size)
+{
+ size_t i;
+ bool good = true;
+
+ for (i = 0; i < tbl_size; i++) {
+ if (!tbl[i]) {
+ pr_err("VALIDATE LIST[%zu]: Null\n", i);
+ good = false;
+ } else if (i > 0 && tbl[i - 1]) {
+ int c = strcmp(tbl[i-1], tbl[i]);
+
+ if (c == 0) {
+ pr_err("VALIDATE LIST[%zu]: Duplicate %s\n",
+ i, tbl[i]);
+ good = false;
+ }
+ if (c > 0) {
+ pr_err("VALIDATE LIST[%zu]: Missorted %s>=%s\n",
+ i, tbl[i-1], tbl[i]);
+ good = false;
+ }
+ }
+ }
+
+ return good;
+}
+
+/**
+ * fs_validate_description - Validate a parameter description
+ * @desc: The parameter description to validate.
+ */
+bool fs_validate_description(const struct fs_parameter_description *desc)
+{
+ const char *name = desc->name;
+ bool good = true, enums = false;
+ int i, j;
+
+ pr_notice("*** VALIDATE %s ***\n", name);
+
+ if (!name[0]) {
+ pr_err("VALIDATE Parser: No name\n");
+ name = "Unknown";
+ good = false;
+ }
+
+ if (desc->nr_params) {
+ if (!desc->specs) {
+ pr_err("VALIDATE %s: Missing types table\n", name);
+ good = false;
+ goto no_specs;
+ }
+
+ for (i = 0; i < desc->nr_params; i++) {
+ enum fs_parameter_type t = desc->specs[i].type;
+ if (t == __fs_param_wasnt_defined) {
+ pr_err("VALIDATE %s: [%u] Undefined type\n",
+ name, i);
+ good = false;
+ } else if (t >= nr__fs_parameter_type) {
+ pr_err("VALIDATE %s: [%u] Bad type %u\n",
+ name, i, t);
+ good = false;
+ } else if (t == fs_param_is_enum) {
+ enums = true;
+ }
+ }
+ }
+
+no_specs:
+ if (desc->nr_params) {
+ if (!desc->keys) {
+ pr_err("VALIDATE %s: Missing keys list\n", name);
+ good = false;
+ goto no_keys;
+ }
+
+ if (!validate_list(desc->keys, desc->nr_params)) {
+ pr_err("VALIDATE %s: Bad keys table\n", name);
+ good = false;
+ }
+
+ /* The "source" parameter is used to convey the device/source
+ * information.
+ */
+ if (desc->no_source) {
+ if (bsearch("source", desc->keys, desc->nr_params,
+ sizeof(const char *), cmp_key)) {
+ pr_err("VALIDATE %s: Source key, but marked no_source\n",
+ name);
+ good = false;
+ }
+
+ if (desc->source_param != 0) {
+ pr_err("VALIDATE %s: source_param not zero\n",
+ name);
+ good = false;
+ }
+ } else {
+ if (desc->source_param >= desc->nr_params) {
+ pr_err("VALIDATE %s: source_param is out of range\n",
+ name);
+ good = false;
+ goto no_keys;
+ }
+
+ if (strcmp(desc->keys[desc->source_param], "source") != 0) {
+ pr_err("VALIDATE %s: No source key, but not marked no_source\n",
+ name);
+ good = false;
+ }
+ }
+ } else {
+ if (desc->source_param) {
+ pr_err("VALIDATE %s: source_param not zero\n", name);
+ good = false;
+ }
+ }
+
+no_keys:
+ if (desc->nr_alt_keys) {
+ if (!desc->nr_params) {
+ pr_err("VALIDATE %s: %u alt_keys but no params\n",
+ name, desc->nr_alt_keys);
+ good = false;
+ goto no_alt_keys;
+ }
+ if (!desc->alt_keys) {
+ pr_err("VALIDATE %s: Missing alt_keys table\n", name);
+ good = false;
+ goto no_alt_keys;
+ }
+
+ if (!validate_constant_table(desc->alt_keys, desc->nr_alt_keys,
+ 0, desc->nr_params - 1,
+ fsconfig_key_removed)) {
+ pr_err("VALIDATE %s: Bad alt_keys table\n", name);
+ good = false;
+ }
+ }
+
+no_alt_keys:
+ if (desc->nr_enums) {
+ if (!enums) {
+ pr_err("VALIDATE %s: Enum table but no enum-type values\n",
+ name);
+ good = false;
+ goto no_enums;
+ }
+ if (!desc->enums) {
+ pr_err("VALIDATE %s: Missing enums table\n", name);
+ good = false;
+ goto no_enums;
+ }
+
+ for (j = 0; j < desc->nr_enums; j++) {
+ const struct fs_parameter_enum *e = &desc->enums[j];
+
+ if (!e->name[0]) {
+ pr_err("VALIDATE %s: e[%u] no name\n", name, j);
+ good = false;
+ }
+ if (e->param_id >= desc->nr_params) {
+ pr_err("VALIDATE %s: e[%u] bad param %u\n",
+ name, j, e->param_id);
+ good = false;
+ }
+ if (desc->specs[e->param_id].type != fs_param_is_enum) {
+ pr_err("VALIDATE %s: e[%u] enum val for non-enum type %u\n",
+ name, j, e->param_id);
+ good = false;
+ }
+ }
+
+ for (i = 0; i < desc->nr_params; i++) {
+ if (desc->specs[i].type != fs_param_is_enum)
+ continue;
+ for (j = 0; j < desc->nr_enums; j++)
+ if (desc->enums[j].param_id == i)
+ break;
+ if (j == desc->nr_enums) {
+ pr_err("VALIDATE %s: t[%u] enum with no vals\n",
+ name, i);
+ good = false;
+ }
+ }
+ } else {
+ if (enums) {
+ pr_err("VALIDATE %s: enum-type values, but no enum table\n",
+ name);
+ good = false;
+ goto no_enums;
+ }
+ }
+
+no_enums:
+ return good;
+}
+#endif /* CONFIG_VALIDATE_FS_PARSER */
diff --git a/fs/internal.h b/fs/internal.h
index 17029b30e196..63b6840de8c1 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -54,6 +54,8 @@ extern void __init chrdev_init(void);
/*
* namei.c
*/
+extern int filename_lookup(int dfd, struct filename *name, unsigned flags,
+ struct path *path, struct path *root);
extern int user_path_mountpoint_at(int, const char __user *, unsigned int, struct path *);
extern int vfs_path_lookup(struct dentry *, struct vfsmount *,
const char *, unsigned int, struct path *);
diff --git a/fs/namei.c b/fs/namei.c
index 0cab6494978c..fb913148d4d1 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2333,8 +2333,8 @@ static int path_lookupat(struct nameidata *nd, unsigned flags, struct path *path
return err;
}

-static int filename_lookup(int dfd, struct filename *name, unsigned flags,
- struct path *path, struct path *root)
+int filename_lookup(int dfd, struct filename *name, unsigned flags,
+ struct path *path, struct path *root)
{
int retval;
struct nameidata nd;
diff --git a/include/linux/errno.h b/include/linux/errno.h
index 3cba627577d6..d73f597a2484 100644
--- a/include/linux/errno.h
+++ b/include/linux/errno.h
@@ -18,6 +18,7 @@
#define ERESTART_RESTARTBLOCK 516 /* restart by calling sys_restart_syscall */
#define EPROBE_DEFER 517 /* Driver requests probe retry */
#define EOPENSTALE 518 /* open found a stale dentry */
+#define ENOPARAM 519 /* Parameter not supported */

/* Defined for the NFSv3 protocol */
#define EBADHANDLE 521 /* Illegal NFS file handle */
diff --git a/include/linux/fs_parser.h b/include/linux/fs_parser.h
new file mode 100644
index 000000000000..e21792a6fc33
--- /dev/null
+++ b/include/linux/fs_parser.h
@@ -0,0 +1,119 @@
+/* Filesystem parameter description and parser
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_FS_PARSER_H
+#define _LINUX_FS_PARSER_H
+
+#include <linux/fs_context.h>
+
+struct path;
+
+struct constant_table {
+ const char *name;
+ int value;
+};
+
+#define fsconfig_key_removed 0xff /* Parameter name is no longer valid */
+
+/*
+ * The type of parameter expected.
+ */
+enum fs_parameter_type {
+ __fs_param_wasnt_defined,
+ fs_param_is_flag,
+ fs_param_is_bool,
+ fs_param_is_u32,
+ fs_param_is_u32_octal,
+ fs_param_is_u32_hex,
+ fs_param_is_s32,
+ fs_param_is_u64,
+ fs_param_is_enum,
+ fs_param_is_string,
+ fs_param_is_blob,
+ fs_param_is_blockdev,
+ fs_param_is_path,
+ fs_param_is_fd,
+ nr__fs_parameter_type,
+};
+
+/*
+ * Specification of the type of value a parameter wants.
+ */
+struct fs_parameter_spec {
+ enum fs_parameter_type type:8; /* The desired parameter type */
+ u8 flags;
+#define fs_param_v_optional 0x01 /* The value is optional */
+#define fs_param_neg_with_no 0x02 /* "noxxx" is negative param */
+#define fs_param_neg_with_empty 0x04 /* "xxx=" is negative param */
+#define fs_param_deprecated 0x08 /* The param is deprecated */
+};
+
+struct fs_parameter_enum {
+ u8 param_id;
+ char name[14];
+ u8 value;
+};
+
+struct fs_parameter_description {
+ const char name[16]; /* Name for logging purposes */
+ u8 nr_params; /* Number of parameter IDs */
+ u8 nr_alt_keys; /* Number of alt_keys[] */
+ u8 nr_enums; /* Number of enum value names */
+ u8 source_param; /* Index of source parameter */
+ bool no_source; /* Set if no source is expected */
+ const char *const *keys; /* Sorted list of key names, one per nr_params */
+ const struct constant_table *alt_keys; /* Sorted list of alternate key names */
+ const struct fs_parameter_spec *specs; /* List of param specifications */
+ const struct fs_parameter_enum *enums; /* Enum values */
+};
+
+/*
+ * Result of parse.
+ */
+struct fs_parse_result {
+ struct fs_parameter_spec t;
+ u8 key; /* Looked up key ID */
+ bool negated; /* T if param was "noxxx" */
+ bool has_value; /* T if value supplied to param */
+ union {
+ bool boolean; /* For spec_bool */
+ int int_32; /* For spec_s32/spec_enum */
+ unsigned int uint_32; /* For spec_u32{,_octal,_hex}/spec_enum */
+ u64 uint_64; /* For spec_u64 */
+ };
+};
+
+extern int fs_parse(struct fs_context *fc,
+ const struct fs_parameter_description *desc,
+ struct fs_parameter *value,
+ struct fs_parse_result *result);
+extern int fs_lookup_param(struct fs_context *fc,
+ struct fs_parameter *param,
+ bool want_bdev,
+ struct path *_path);
+
+extern int __lookup_constant(const struct constant_table tbl[], size_t tbl_size,
+ const char *name, int not_found);
+#define lookup_constant(t, n, nf) __lookup_constant(t, ARRAY_SIZE(t), (n), (nf))
+
+#ifdef CONFIG_VALIDATE_FS_PARSER
+extern bool validate_constant_table(const struct constant_table *tbl, size_t tbl_size,
+ int low, int high, int special);
+extern bool fs_validate_description(const struct fs_parameter_description *desc);
+#else
+static inline bool validate_constant_table(const struct constant_table *tbl, size_t tbl_size,
+ int low, int high, int special)
+{ return true; }
+static inline bool fs_validate_description(const struct fs_parameter_description *desc)
+{ return true; }
+#endif
+
+#endif /* _LINUX_FS_PARSER_H */


2018-09-21 16:32:31

by David Howells

[permalink] [raw]
Subject: [PATCH 10/34] selinux: Implement the new mount API LSM hooks [ver #12]

Implement the new mount API LSM hooks for SELinux. At some point the old
hooks will need to be removed.

Question: Should the ->fs_context_parse_source() hook be implemented to
check the labels on any source devices specified?

Signed-off-by: David Howells <[email protected]>
cc: Paul Moore <[email protected]>
cc: Stephen Smalley <[email protected]>
cc: [email protected]
cc: [email protected]
---

security/selinux/hooks.c | 336 ++++++++++++++++++++++++++++++++---
security/selinux/include/security.h | 16 +-
2 files changed, 319 insertions(+), 33 deletions(-)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 9102a8fecb15..5f2af9dd44fa 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -48,6 +48,8 @@
#include <linux/fdtable.h>
#include <linux/namei.h>
#include <linux/mount.h>
+#include <linux/fs_context.h>
+#include <linux/fs_parser.h>
#include <linux/netfilter_ipv4.h>
#include <linux/netfilter_ipv6.h>
#include <linux/tty.h>
@@ -439,24 +441,23 @@ static inline int inode_doinit(struct inode *inode)
}

enum {
- Opt_error = -1,
- Opt_context = 1,
+ Opt_context = 0,
+ Opt_defcontext = 1,
Opt_fscontext = 2,
- Opt_defcontext = 3,
- Opt_rootcontext = 4,
- Opt_labelsupport = 5,
- Opt_nextmntopt = 6,
+ Opt_rootcontext = 3,
+ Opt_seclabel = 4,
+ nr__selinux_params
};

-#define NUM_SEL_MNT_OPTS (Opt_nextmntopt - 1)
+#define NUM_SEL_MNT_OPTS (nr__selinux_params - 1)

static const match_table_t tokens = {
- {Opt_context, CONTEXT_STR "%s"},
- {Opt_fscontext, FSCONTEXT_STR "%s"},
- {Opt_defcontext, DEFCONTEXT_STR "%s"},
- {Opt_rootcontext, ROOTCONTEXT_STR "%s"},
- {Opt_labelsupport, LABELSUPP_STR},
- {Opt_error, NULL},
+ {Opt_context, CONTEXT_STR "=%s"},
+ {Opt_fscontext, FSCONTEXT_STR "=%s"},
+ {Opt_defcontext, DEFCONTEXT_STR "=%s"},
+ {Opt_rootcontext, ROOTCONTEXT_STR "=%s"},
+ {Opt_seclabel, SECLABEL_STR},
+ {-1, NULL},
};

#define SEL_MOUNT_FAIL_MSG "SELinux: duplicate or incompatible mount options\n"
@@ -615,15 +616,11 @@ static int selinux_get_mnt_opts(const struct super_block *sb,
if (!selinux_state.initialized)
return -EINVAL;

- /* make sure we always check enough bits to cover the mask */
- BUILD_BUG_ON(SE_MNTMASK >= (1 << NUM_SEL_MNT_OPTS));
-
tmp = sbsec->flags & SE_MNTMASK;
/* count the number of mount options for this sb */
for (i = 0; i < NUM_SEL_MNT_OPTS; i++) {
- if (tmp & 0x01)
+ if (tmp & (1 << i))
opts->num_mnt_opts++;
- tmp >>= 1;
}
/* Check if the Label support flag is set */
if (sbsec->flags & SBLABEL_MNT)
@@ -1154,7 +1151,7 @@ static int selinux_parse_opts_str(char *options,
goto out_err;
}
break;
- case Opt_labelsupport:
+ case Opt_seclabel:
break;
default:
rc = -EINVAL;
@@ -1259,7 +1256,7 @@ static void selinux_write_opts(struct seq_file *m,
break;
case SBLABEL_MNT:
seq_putc(m, ',');
- seq_puts(m, LABELSUPP_STR);
+ seq_puts(m, SECLABEL_STR);
continue;
default:
BUG();
@@ -1268,6 +1265,7 @@ static void selinux_write_opts(struct seq_file *m,
/* we need a comma before each option */
seq_putc(m, ',');
seq_puts(m, prefix);
+ seq_putc(m, '=');
if (has_comma)
seq_putc(m, '\"');
seq_escape(m, opts->mnt_opts[i], "\"\n\\");
@@ -2753,11 +2751,11 @@ static inline int match_prefix(char *prefix, int plen, char *option, int olen)

static inline int selinux_option(char *option, int len)
{
- return (match_prefix(CONTEXT_STR, sizeof(CONTEXT_STR)-1, option, len) ||
- match_prefix(FSCONTEXT_STR, sizeof(FSCONTEXT_STR)-1, option, len) ||
- match_prefix(DEFCONTEXT_STR, sizeof(DEFCONTEXT_STR)-1, option, len) ||
- match_prefix(ROOTCONTEXT_STR, sizeof(ROOTCONTEXT_STR)-1, option, len) ||
- match_prefix(LABELSUPP_STR, sizeof(LABELSUPP_STR)-1, option, len));
+ return (match_prefix(CONTEXT_STR"=", sizeof(CONTEXT_STR)-1, option, len) ||
+ match_prefix(FSCONTEXT_STR"=", sizeof(FSCONTEXT_STR)-1, option, len) ||
+ match_prefix(DEFCONTEXT_STR"=", sizeof(DEFCONTEXT_STR)-1, option, len) ||
+ match_prefix(ROOTCONTEXT_STR"=", sizeof(ROOTCONTEXT_STR)-1, option, len) ||
+ match_prefix(SECLABEL_STR"=", sizeof(SECLABEL_STR)-1, option, len));
}

static inline void take_option(char **to, char *from, int *first, int len)
@@ -2972,6 +2970,284 @@ static int selinux_umount(struct vfsmount *mnt, int flags)
FILESYSTEM__UNMOUNT, NULL);
}

+/* fsopen mount context operations */
+
+static int selinux_fs_context_alloc(struct fs_context *fc,
+ struct dentry *reference)
+{
+ struct security_mnt_opts *opts;
+
+ opts = kzalloc(sizeof(*opts), GFP_KERNEL);
+ if (!opts)
+ return -ENOMEM;
+
+ fc->security = opts;
+ return 0;
+}
+
+static int selinux_fs_context_dup(struct fs_context *fc,
+ struct fs_context *src_fc)
+{
+ const struct security_mnt_opts *src = src_fc->security;
+ struct security_mnt_opts *opts;
+ int i, n;
+
+ opts = kzalloc(sizeof(*opts), GFP_KERNEL);
+ if (!opts)
+ return -ENOMEM;
+ fc->security = opts;
+
+ if (!src || !src->num_mnt_opts)
+ return 0;
+ n = opts->num_mnt_opts = src->num_mnt_opts;
+
+ if (src->mnt_opts) {
+ opts->mnt_opts = kcalloc(n, sizeof(char *), GFP_KERNEL);
+ if (!opts->mnt_opts)
+ return -ENOMEM;
+
+ for (i = 0; i < n; i++) {
+ if (src->mnt_opts[i]) {
+ opts->mnt_opts[i] = kstrdup(src->mnt_opts[i],
+ GFP_KERNEL);
+ if (!opts->mnt_opts[i])
+ return -ENOMEM;
+ }
+ }
+ }
+
+ if (src->mnt_opts_flags) {
+ opts->mnt_opts_flags = kmemdup(src->mnt_opts_flags,
+ n * sizeof(int), GFP_KERNEL);
+ if (!opts->mnt_opts_flags)
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static void selinux_fs_context_free(struct fs_context *fc)
+{
+ struct security_mnt_opts *opts = fc->security;
+
+ if (opts) {
+ security_free_mnt_opts(opts);
+ fc->security = NULL;
+ }
+}
+
+static const struct fs_parameter_spec selinux_param_specs[nr__selinux_params] = {
+ [Opt_context] = { fs_param_is_string },
+ [Opt_defcontext] = { fs_param_is_string },
+ [Opt_fscontext] = { fs_param_is_string },
+ [Opt_rootcontext] = { fs_param_is_string },
+ [Opt_seclabel] = { fs_param_is_flag },
+};
+
+static const char *const selinux_param_keys[nr__selinux_params] = {
+ [Opt_context] = CONTEXT_STR,
+ [Opt_defcontext] = DEFCONTEXT_STR,
+ [Opt_fscontext] = FSCONTEXT_STR,
+ [Opt_rootcontext] = ROOTCONTEXT_STR,
+ [Opt_seclabel] = SECLABEL_STR,
+};
+
+static const struct fs_parameter_description selinux_fs_parameters = {
+ .name = "SELinux",
+ .nr_params = nr__selinux_params,
+ .keys = selinux_param_keys,
+ .specs = selinux_param_specs,
+ .no_source = true,
+};
+
+static int selinux_fs_context_parse_param(struct fs_context *fc,
+ struct fs_parameter *param)
+{
+ struct security_mnt_opts *opts = fc->security;
+ struct fs_parse_result result;
+ unsigned int have;
+ char **oo;
+ int opt, ctx, i, *of;
+
+ opt = fs_parse(fc, &selinux_fs_parameters, param, &result);
+ if (opt < 0)
+ return opt;
+
+ have = 0;
+ for (i = 0; i < opts->num_mnt_opts; i++)
+ have |= 1 << opts->mnt_opts_flags[i];
+ if (have & (1 << opt))
+ return -EINVAL;
+
+ switch (opt) {
+ case Opt_context:
+ if (have & (1 << Opt_defcontext))
+ goto incompatible;
+ ctx = CONTEXT_MNT;
+ goto copy_context_string;
+
+ case Opt_fscontext:
+ ctx = FSCONTEXT_MNT;
+ goto copy_context_string;
+
+ case Opt_rootcontext:
+ ctx = ROOTCONTEXT_MNT;
+ goto copy_context_string;
+
+ case Opt_defcontext:
+ if (have & (1 << Opt_context))
+ goto incompatible;
+ ctx = DEFCONTEXT_MNT;
+ goto copy_context_string;
+
+ case Opt_seclabel:
+ return 1;
+
+ default:
+ return -EINVAL;
+ }
+
+copy_context_string:
+ if (opts->num_mnt_opts > 3)
+ return -EINVAL;
+
+ of = krealloc(opts->mnt_opts_flags,
+ (opts->num_mnt_opts + 1) * sizeof(int), GFP_KERNEL);
+ if (!of)
+ return -ENOMEM;
+ of[opts->num_mnt_opts] = 0;
+ opts->mnt_opts_flags = of;
+
+ oo = krealloc(opts->mnt_opts,
+ (opts->num_mnt_opts + 1) * sizeof(char *), GFP_KERNEL);
+ if (!oo)
+ return -ENOMEM;
+ oo[opts->num_mnt_opts] = NULL;
+ opts->mnt_opts = oo;
+
+ opts->mnt_opts[opts->num_mnt_opts] = param->string;
+ opts->mnt_opts_flags[opts->num_mnt_opts] = ctx;
+ opts->num_mnt_opts++;
+ param->string = NULL;
+ return 1;
+
+incompatible:
+ return -EINVAL;
+}
+
+/*
+ * Validate the security parameters supplied for a reconfiguration/remount
+ * event.
+ */
+static int selinux_validate_for_sb_reconfigure(struct fs_context *fc)
+{
+ struct super_block *sb = fc->root->d_sb;
+ struct superblock_security_struct *sbsec = sb->s_security;
+ struct security_mnt_opts *opts = fc->security;
+ int rc, i, *flags;
+ char **mount_options;
+
+ if (!(sbsec->flags & SE_SBINITIALIZED))
+ return 0;
+
+ mount_options = opts->mnt_opts;
+ flags = opts->mnt_opts_flags;
+
+ for (i = 0; i < opts->num_mnt_opts; i++) {
+ u32 sid;
+
+ if (flags[i] == SBLABEL_MNT)
+ continue;
+
+ rc = security_context_str_to_sid(&selinux_state, mount_options[i],
+ &sid, GFP_KERNEL);
+ if (rc) {
+ pr_warn("SELinux: security_context_str_to_sid"
+ "(%s) failed for (dev %s, type %s) errno=%d\n",
+ mount_options[i], sb->s_id, sb->s_type->name, rc);
+ goto inval;
+ }
+
+ switch (flags[i]) {
+ case FSCONTEXT_MNT:
+ if (bad_option(sbsec, FSCONTEXT_MNT, sbsec->sid, sid))
+ goto bad_option;
+ break;
+ case CONTEXT_MNT:
+ if (bad_option(sbsec, CONTEXT_MNT, sbsec->mntpoint_sid, sid))
+ goto bad_option;
+ break;
+ case ROOTCONTEXT_MNT: {
+ struct inode_security_struct *root_isec;
+ root_isec = backing_inode_security(sb->s_root);
+
+ if (bad_option(sbsec, ROOTCONTEXT_MNT, root_isec->sid, sid))
+ goto bad_option;
+ break;
+ }
+ case DEFCONTEXT_MNT:
+ if (bad_option(sbsec, DEFCONTEXT_MNT, sbsec->def_sid, sid))
+ goto bad_option;
+ break;
+ default:
+ goto inval;
+ }
+ }
+
+ rc = 0;
+out:
+ return rc;
+
+bad_option:
+ pr_warn("SELinux: unable to change security options "
+ "during remount (dev %s, type=%s)\n",
+ sb->s_id, sb->s_type->name);
+inval:
+ rc = -EINVAL;
+ goto out;
+}
+
+/*
+ * Validate the security context assembled from the option data supplied to
+ * mount.
+ */
+static int selinux_fs_context_validate(struct fs_context *fc)
+{
+ if (fc->purpose == FS_CONTEXT_FOR_RECONFIGURE)
+ return selinux_validate_for_sb_reconfigure(fc);
+ return 0;
+}
+
+/*
+ * Set the security context on a superblock.
+ */
+static int selinux_sb_get_tree(struct fs_context *fc)
+{
+ const struct cred *cred = current_cred();
+ struct common_audit_data ad;
+ int rc;
+
+ rc = selinux_set_mnt_opts(fc->root->d_sb, fc->security, 0, NULL);
+ if (rc)
+ return rc;
+
+ /* Allow all mounts performed by the kernel */
+ if (fc->purpose == FS_CONTEXT_FOR_KERNEL_MOUNT)
+ return 0;
+
+ ad.type = LSM_AUDIT_DATA_DENTRY;
+ ad.u.dentry = fc->root;
+ return superblock_has_perm(cred, fc->root->d_sb, FILESYSTEM__MOUNT, &ad);
+}
+
+static int selinux_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ const struct cred *cred = current_cred();
+
+ return path_has_perm(cred, mountpoint, FILE__MOUNTON);
+}
+
/* inode security operations */

static int selinux_inode_alloc_security(struct inode *inode)
@@ -6918,6 +7194,14 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(bprm_committing_creds, selinux_bprm_committing_creds),
LSM_HOOK_INIT(bprm_committed_creds, selinux_bprm_committed_creds),

+ LSM_HOOK_INIT(fs_context_alloc, selinux_fs_context_alloc),
+ LSM_HOOK_INIT(fs_context_dup, selinux_fs_context_dup),
+ LSM_HOOK_INIT(fs_context_free, selinux_fs_context_free),
+ LSM_HOOK_INIT(fs_context_parse_param, selinux_fs_context_parse_param),
+ LSM_HOOK_INIT(fs_context_validate, selinux_fs_context_validate),
+ LSM_HOOK_INIT(sb_get_tree, selinux_sb_get_tree),
+ LSM_HOOK_INIT(sb_mountpoint, selinux_sb_mountpoint),
+
LSM_HOOK_INIT(sb_alloc_security, selinux_sb_alloc_security),
LSM_HOOK_INIT(sb_free_security, selinux_sb_free_security),
LSM_HOOK_INIT(sb_copy_data, selinux_sb_copy_data),
@@ -7185,6 +7469,8 @@ static __init int selinux_init(void)
else
pr_debug("SELinux: Starting in permissive mode\n");

+ fs_validate_description(&selinux_fs_parameters);
+
return 0;
}

diff --git a/security/selinux/include/security.h b/security/selinux/include/security.h
index 23e762d529fa..7c100283b66f 100644
--- a/security/selinux/include/security.h
+++ b/security/selinux/include/security.h
@@ -50,20 +50,20 @@
/* Super block security struct flags for mount options */
/* BE CAREFUL, these need to be the low order bits for selinux_get_mnt_opts */
#define CONTEXT_MNT 0x01
-#define FSCONTEXT_MNT 0x02
-#define ROOTCONTEXT_MNT 0x04
-#define DEFCONTEXT_MNT 0x08
+#define DEFCONTEXT_MNT 0x02
+#define FSCONTEXT_MNT 0x04
+#define ROOTCONTEXT_MNT 0x08
#define SBLABEL_MNT 0x10
/* Non-mount related flags */
#define SE_SBINITIALIZED 0x0100
#define SE_SBPROC 0x0200
#define SE_SBGENFS 0x0400

-#define CONTEXT_STR "context="
-#define FSCONTEXT_STR "fscontext="
-#define ROOTCONTEXT_STR "rootcontext="
-#define DEFCONTEXT_STR "defcontext="
-#define LABELSUPP_STR "seclabel"
+#define CONTEXT_STR "context"
+#define FSCONTEXT_STR "fscontext"
+#define ROOTCONTEXT_STR "rootcontext"
+#define DEFCONTEXT_STR "defcontext"
+#define SECLABEL_STR "seclabel"

struct netlbl_lsm_secattr;



2018-09-21 16:32:38

by David Howells

[permalink] [raw]
Subject: [PATCH 11/34] smack: Implement filesystem context security hooks [ver #12]

Implement filesystem context security hooks for the smack LSM.

Question: Should the ->fs_context_parse_source() hook be implemented to
check the labels on any source devices specified?

Signed-off-by: David Howells <[email protected]>
cc: Casey Schaufler <[email protected]>
cc: [email protected]
---

security/smack/smack.h | 21 +--
security/smack/smack_lsm.c | 332 +++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 338 insertions(+), 15 deletions(-)

diff --git a/security/smack/smack.h b/security/smack/smack.h
index f7db791fb566..891a307a2029 100644
--- a/security/smack/smack.h
+++ b/security/smack/smack.h
@@ -195,21 +195,22 @@ struct smack_known_list_elem {

enum {
Opt_error = -1,
- Opt_fsdefault = 1,
- Opt_fsfloor = 2,
- Opt_fshat = 3,
- Opt_fsroot = 4,
- Opt_fstransmute = 5,
+ Opt_fsdefault = 0,
+ Opt_fsfloor = 1,
+ Opt_fshat = 2,
+ Opt_fsroot = 3,
+ Opt_fstransmute = 4,
+ nr__smack_params
};

/*
* Mount options
*/
-#define SMK_FSDEFAULT "smackfsdef="
-#define SMK_FSFLOOR "smackfsfloor="
-#define SMK_FSHAT "smackfshat="
-#define SMK_FSROOT "smackfsroot="
-#define SMK_FSTRANS "smackfstransmute="
+#define SMK_FSDEFAULT "smackfsdef"
+#define SMK_FSFLOOR "smackfsfloor"
+#define SMK_FSHAT "smackfshat"
+#define SMK_FSROOT "smackfsroot"
+#define SMK_FSTRANS "smackfstransmute"

#define SMACK_DELETE_OPTION "-DELETE"
#define SMACK_CIPSO_OPTION "-CIPSO"
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 03a2f0213d57..da7121d24bce 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -43,6 +43,8 @@
#include <linux/shm.h>
#include <linux/binfmts.h>
#include <linux/parser.h>
+#include <linux/fs_context.h>
+#include <linux/fs_parser.h>
#include "smack.h"

#define TRANS_TRUE "TRUE"
@@ -60,11 +62,11 @@ static struct kmem_cache *smack_inode_cache;
int smack_enabled;

static const match_table_t smk_mount_tokens = {
- {Opt_fsdefault, SMK_FSDEFAULT "%s"},
- {Opt_fsfloor, SMK_FSFLOOR "%s"},
- {Opt_fshat, SMK_FSHAT "%s"},
- {Opt_fsroot, SMK_FSROOT "%s"},
- {Opt_fstransmute, SMK_FSTRANS "%s"},
+ {Opt_fsdefault, SMK_FSDEFAULT "=%s"},
+ {Opt_fsfloor, SMK_FSFLOOR "=%s"},
+ {Opt_fshat, SMK_FSHAT "=%s"},
+ {Opt_fsroot, SMK_FSROOT "=%s"},
+ {Opt_fstransmute, SMK_FSTRANS "=%s"},
{Opt_error, NULL},
};

@@ -522,6 +524,319 @@ static int smack_syslog(int typefrom_file)
return rc;
}

+/*
+ * Mount context operations
+ */
+
+struct smack_fs_context {
+ union {
+ struct {
+ char *fsdefault;
+ char *fsfloor;
+ char *fshat;
+ char *fsroot;
+ char *fstransmute;
+ };
+ char *ptrs[5];
+
+ };
+ struct superblock_smack *sbsp;
+ struct inode_smack *isp;
+ bool transmute;
+};
+
+/**
+ * smack_fs_context_free - Free the security data from a filesystem context
+ * @fc: The filesystem context to be cleaned up.
+ */
+static void smack_fs_context_free(struct fs_context *fc)
+{
+ struct smack_fs_context *ctx = fc->security;
+ int i;
+
+ if (ctx) {
+ for (i = 0; i < ARRAY_SIZE(ctx->ptrs); i++)
+ kfree(ctx->ptrs[i]);
+ kfree(ctx->isp);
+ kfree(ctx->sbsp);
+ kfree(ctx);
+ fc->security = NULL;
+ }
+}
+
+/**
+ * smack_fs_context_alloc - Allocate security data for a filesystem context
+ * @fc: The filesystem context.
+ * @reference: Reference dentry (automount/reconfigure) or NULL
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_fs_context_alloc(struct fs_context *fc,
+ struct dentry *reference)
+{
+ struct smack_fs_context *ctx;
+ struct superblock_smack *sbsp;
+ struct inode_smack *isp;
+ struct smack_known *skp;
+
+ ctx = kzalloc(sizeof(struct smack_fs_context), GFP_KERNEL);
+ if (!ctx)
+ goto nomem;
+ fc->security = ctx;
+
+ sbsp = kzalloc(sizeof(struct superblock_smack), GFP_KERNEL);
+ if (!sbsp)
+ goto nomem_free;
+ ctx->sbsp = sbsp;
+
+ isp = new_inode_smack(NULL);
+ if (!isp)
+ goto nomem_free;
+ ctx->isp = isp;
+
+ if (reference) {
+ if (reference->d_sb->s_security)
+ memcpy(sbsp, reference->d_sb->s_security, sizeof(*sbsp));
+ } else if (!smack_privileged(CAP_MAC_ADMIN)) {
+ /* Unprivileged mounts get root and default from the caller. */
+ skp = smk_of_current();
+ sbsp->smk_root = skp;
+ sbsp->smk_default = skp;
+ } else {
+ sbsp->smk_root = &smack_known_floor;
+ sbsp->smk_default = &smack_known_floor;
+ sbsp->smk_floor = &smack_known_floor;
+ sbsp->smk_hat = &smack_known_hat;
+ /* SMK_SB_INITIALIZED will be zero from kzalloc. */
+ }
+
+ return 0;
+
+nomem_free:
+ smack_fs_context_free(fc);
+nomem:
+ return -ENOMEM;
+}
+
+/**
+ * smack_fs_context_dup - Duplicate the security data on fs_context duplication
+ * @fc: The new filesystem context.
+ * @src_fc: The source filesystem context being duplicated.
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_fs_context_dup(struct fs_context *fc,
+ struct fs_context *src_fc)
+{
+ struct smack_fs_context *dst, *src = src_fc->security;
+ int i;
+
+ dst = kzalloc(sizeof(struct smack_fs_context), GFP_KERNEL);
+ if (!dst)
+ goto nomem;
+ fc->security = dst;
+
+ dst->sbsp = kmemdup(src->sbsp, sizeof(struct superblock_smack),
+ GFP_KERNEL);
+ if (!dst->sbsp)
+ goto nomem_free;
+
+ for (i = 0; i < ARRAY_SIZE(dst->ptrs); i++) {
+ if (src->ptrs[i]) {
+ dst->ptrs[i] = kstrdup(src->ptrs[i], GFP_KERNEL);
+ if (!dst->ptrs[i])
+ goto nomem_free;
+ }
+ }
+
+ return 0;
+
+nomem_free:
+ smack_fs_context_free(fc);
+nomem:
+ return -ENOMEM;
+}
+
+static const struct fs_parameter_spec smack_param_specs[nr__smack_params] = {
+ [Opt_fsdefault] = { fs_param_is_string },
+ [Opt_fsfloor] = { fs_param_is_string },
+ [Opt_fshat] = { fs_param_is_string },
+ [Opt_fsroot] = { fs_param_is_string },
+ [Opt_fstransmute] = { fs_param_is_string },
+};
+
+static const char *const smack_param_keys[nr__smack_params] = {
+ [Opt_fsdefault] = SMK_FSDEFAULT,
+ [Opt_fsfloor] = SMK_FSFLOOR,
+ [Opt_fshat] = SMK_FSHAT,
+ [Opt_fsroot] = SMK_FSROOT,
+ [Opt_fstransmute] = SMK_FSTRANS,
+};
+
+static const struct fs_parameter_description smack_fs_parameters = {
+ .name = "smack",
+ .nr_params = nr__smack_params,
+ .keys = smack_param_keys,
+ .specs = smack_param_specs,
+ .no_source = true,
+};
+
+/**
+ * smack_fs_context_parse_param - Parse a single mount parameter
+ * @fc: The new filesystem context being constructed.
+ * @param: The parameter.
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_fs_context_parse_param(struct fs_context *fc,
+ struct fs_parameter *param)
+{
+ struct smack_fs_context *ctx = fc->security;
+ struct fs_parse_result result;
+ int opt;
+
+ /* Unprivileged mounts don't get to specify Smack values. */
+ if (!smack_privileged(CAP_MAC_ADMIN))
+ return -EPERM;
+
+ opt = fs_parse(fc, &smack_fs_parameters, param, &result);
+ if (opt < 0)
+ return opt;
+
+ switch (opt) {
+ case Opt_fsdefault:
+ if (ctx->fsdefault)
+ goto error_dup;
+ ctx->fsdefault = param->string;
+ break;
+ case Opt_fsfloor:
+ if (ctx->fsfloor)
+ goto error_dup;
+ ctx->fsfloor = param->string;
+ break;
+ case Opt_fshat:
+ if (ctx->fshat)
+ goto error_dup;
+ ctx->fshat = param->string;
+ break;
+ case Opt_fsroot:
+ if (ctx->fsroot)
+ goto error_dup;
+ ctx->fsroot = param->string;
+ break;
+ case Opt_fstransmute:
+ if (ctx->fstransmute)
+ goto error_dup;
+ ctx->fstransmute = param->string;
+ break;
+ default:
+ return invalf(fc, "Smack: unknown mount option\n");
+ }
+
+ param->string = NULL;
+ return 0;
+
+error_dup:
+ return invalf(fc, "Smack: duplicate mount option\n");
+}
+
+/**
+ * smack_fs_context_validate - Validate the filesystem context security data
+ * @fc: The filesystem context.
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_fs_context_validate(struct fs_context *fc)
+{
+ struct smack_fs_context *ctx = fc->security;
+ struct superblock_smack *sbsp = ctx->sbsp;
+ struct inode_smack *isp = ctx->isp;
+ struct smack_known *skp;
+
+ if (ctx->fsdefault) {
+ skp = smk_import_entry(ctx->fsdefault, 0);
+ if (IS_ERR(skp))
+ return PTR_ERR(skp);
+ sbsp->smk_default = skp;
+ }
+
+ if (ctx->fsfloor) {
+ skp = smk_import_entry(ctx->fsfloor, 0);
+ if (IS_ERR(skp))
+ return PTR_ERR(skp);
+ sbsp->smk_floor = skp;
+ }
+
+ if (ctx->fshat) {
+ skp = smk_import_entry(ctx->fshat, 0);
+ if (IS_ERR(skp))
+ return PTR_ERR(skp);
+ sbsp->smk_hat = skp;
+ }
+
+ if (ctx->fsroot || ctx->fstransmute) {
+ skp = smk_import_entry(ctx->fstransmute ?: ctx->fsroot, 0);
+ if (IS_ERR(skp))
+ return PTR_ERR(skp);
+ sbsp->smk_root = skp;
+ ctx->transmute = !!ctx->fstransmute;
+ }
+
+ isp->smk_inode = sbsp->smk_root;
+ return 0;
+}
+
+/**
+ * smack_sb_get_tree - Assign the context to a newly created superblock
+ * @fc: The new filesystem context.
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_sb_get_tree(struct fs_context *fc)
+{
+ struct smack_fs_context *ctx = fc->security;
+ struct superblock_smack *sbsp = ctx->sbsp;
+ struct dentry *root = fc->root;
+ struct inode *inode = d_backing_inode(root);
+ struct super_block *sb = root->d_sb;
+ struct inode_smack *isp;
+ bool transmute = ctx->transmute;
+
+ if (sb->s_security)
+ return 0;
+
+ if (!smack_privileged(CAP_MAC_ADMIN)) {
+ /*
+ * For a handful of fs types with no user-controlled
+ * backing store it's okay to trust security labels
+ * in the filesystem. The rest are untrusted.
+ */
+ if (fc->user_ns != &init_user_ns &&
+ sb->s_magic != SYSFS_MAGIC && sb->s_magic != TMPFS_MAGIC &&
+ sb->s_magic != RAMFS_MAGIC) {
+ transmute = true;
+ sbsp->smk_flags |= SMK_SB_UNTRUSTED;
+ }
+ }
+
+ sbsp->smk_flags |= SMK_SB_INITIALIZED;
+ sb->s_security = sbsp;
+ ctx->sbsp = NULL;
+
+ /* Initialize the root inode. */
+ isp = inode->i_security;
+ if (isp == NULL) {
+ isp = ctx->isp;
+ ctx->isp = NULL;
+ inode->i_security = isp;
+ } else
+ isp->smk_inode = sbsp->smk_root;
+
+ if (transmute)
+ isp->smk_flags |= SMK_INODE_TRANSMUTE;
+
+ return 0;
+}

/*
* Superblock Hooks.
@@ -4660,6 +4975,13 @@ static struct security_hook_list smack_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(ptrace_traceme, smack_ptrace_traceme),
LSM_HOOK_INIT(syslog, smack_syslog),

+ LSM_HOOK_INIT(fs_context_alloc, smack_fs_context_alloc),
+ LSM_HOOK_INIT(fs_context_dup, smack_fs_context_dup),
+ LSM_HOOK_INIT(fs_context_free, smack_fs_context_free),
+ LSM_HOOK_INIT(fs_context_parse_param, smack_fs_context_parse_param),
+ LSM_HOOK_INIT(fs_context_validate, smack_fs_context_validate),
+ LSM_HOOK_INIT(sb_get_tree, smack_sb_get_tree),
+
LSM_HOOK_INIT(sb_alloc_security, smack_sb_alloc_security),
LSM_HOOK_INIT(sb_free_security, smack_sb_free_security),
LSM_HOOK_INIT(sb_copy_data, smack_sb_copy_data),


2018-09-21 16:32:49

by David Howells

[permalink] [raw]
Subject: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

From: Al Viro <[email protected]>

Allow a detached tree created by open_tree(..., OPEN_TREE_CLONE) to be
attached by move_mount(2).

If by the time of final fput() of OPEN_TREE_CLONE-opened file its tree is
not detached anymore, it won't be dissolved. move_mount(2) is adjusted
to handle detached source.

That gives us equivalents of mount --bind and mount --rbind.

Signed-off-by: Al Viro <[email protected]>
Signed-off-by: David Howells <[email protected]>
---

fs/namespace.c | 26 ++++++++++++++++++++------
1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index dd38141b1723..caf5c55ef555 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1785,8 +1785,10 @@ void dissolve_on_fput(struct vfsmount *mnt)
{
namespace_lock();
lock_mount_hash();
- mntget(mnt);
- umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
+ if (!real_mount(mnt)->mnt_ns) {
+ mntget(mnt);
+ umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
+ }
unlock_mount_hash();
namespace_unlock();
}
@@ -2393,6 +2395,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
struct mount *old;
struct mountpoint *mp;
int err;
+ bool attached;

mp = lock_mount(new_path);
err = PTR_ERR(mp);
@@ -2403,10 +2406,19 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
p = real_mount(new_path->mnt);

err = -EINVAL;
- if (!check_mnt(p) || !check_mnt(old))
+ /* The mountpoint must be in our namespace. */
+ if (!check_mnt(p))
+ goto out1;
+ /* The thing moved should be either ours or completely unattached. */
+ if (old->mnt_ns && !check_mnt(old))
goto out1;

- if (!mnt_has_parent(old))
+ attached = mnt_has_parent(old);
+ /*
+ * We need to allow open_tree(OPEN_TREE_CLONE) followed by
+ * move_mount(), but mustn't allow "/" to be moved.
+ */
+ if (old->mnt_ns && !attached)
goto out1;

if (old->mnt.mnt_flags & MNT_LOCKED)
@@ -2421,7 +2433,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
/*
* Don't move a mount residing in a shared parent.
*/
- if (IS_MNT_SHARED(old->mnt_parent))
+ if (attached && IS_MNT_SHARED(old->mnt_parent))
goto out1;
/*
* Don't move a mount tree containing unbindable mounts to a destination
@@ -2435,7 +2447,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
goto out1;

err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp,
- &parent_path);
+ attached ? &parent_path : NULL);
if (err)
goto out1;

@@ -3121,6 +3133,8 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,

/*
* Move a mount from one place to another.
+ * In combination with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be
+ * used to copy a mount subtree.
*
* Note the flags value is a combination of MOVE_MOUNT_* flags.
*/


2018-09-21 16:33:04

by David Howells

[permalink] [raw]
Subject: [PATCH 13/34] tomoyo: Implement security hooks for the new mount API [ver #12]

Implement the security hook to check the creation of a new mountpoint for
Tomoyo.

As far as I can tell, Tomoyo doesn't make use of the mount data or parse
any mount options, so I haven't implemented any of the fs_context hooks for
it.

Signed-off-by: David Howells <[email protected]>
cc: Tetsuo Handa <[email protected]>
cc: [email protected]
cc: [email protected]
---

security/tomoyo/common.h | 3 +++
security/tomoyo/mount.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
security/tomoyo/tomoyo.c | 15 +++++++++++++++
3 files changed, 63 insertions(+)

diff --git a/security/tomoyo/common.h b/security/tomoyo/common.h
index 539bcdd30bb8..e637ce73f7f9 100644
--- a/security/tomoyo/common.h
+++ b/security/tomoyo/common.h
@@ -971,6 +971,9 @@ int tomoyo_init_request_info(struct tomoyo_request_info *r,
const u8 index);
int tomoyo_mkdev_perm(const u8 operation, const struct path *path,
const unsigned int mode, unsigned int dev);
+int tomoyo_mount_permission_fc(struct fs_context *fc,
+ const struct path *mountpoint,
+ unsigned int mnt_flags);
int tomoyo_mount_permission(const char *dev_name, const struct path *path,
const char *type, unsigned long flags,
void *data_page);
diff --git a/security/tomoyo/mount.c b/security/tomoyo/mount.c
index 7dc7f59b7dde..9ec84ab6f5e1 100644
--- a/security/tomoyo/mount.c
+++ b/security/tomoyo/mount.c
@@ -6,6 +6,7 @@
*/

#include <linux/slab.h>
+#include <linux/fs_context.h>
#include <uapi/linux/mount.h>
#include "common.h"

@@ -236,3 +237,47 @@ int tomoyo_mount_permission(const char *dev_name, const struct path *path,
tomoyo_read_unlock(idx);
return error;
}
+
+/**
+ * tomoyo_mount_permission_fc - Check permission to create a new mount.
+ * @fc: Context describing the object to be mounted.
+ * @mountpoint: The target object to mount on.
+ * @mnt: The MNT_* flags to be set on the mountpoint.
+ *
+ * Check the permission to create a mount of the object described in @fc. Note
+ * that the source object may be a newly created superblock or may be an
+ * existing one picked from the filesystem (bind mount).
+ *
+ * Returns 0 on success, negative value otherwise.
+ */
+int tomoyo_mount_permission_fc(struct fs_context *fc,
+ const struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ struct tomoyo_request_info r;
+ unsigned int ms_flags = 0;
+ int error;
+ int idx;
+
+ if (tomoyo_init_request_info(&r, NULL, TOMOYO_MAC_FILE_MOUNT) ==
+ TOMOYO_CONFIG_DISABLED)
+ return 0;
+
+ /* Convert MNT_* flags to MS_* equivalents. */
+ if (mnt_flags & MNT_NOSUID) ms_flags |= MS_NOSUID;
+ if (mnt_flags & MNT_NODEV) ms_flags |= MS_NODEV;
+ if (mnt_flags & MNT_NOEXEC) ms_flags |= MS_NOEXEC;
+ if (mnt_flags & MNT_NOATIME) ms_flags |= MS_NOATIME;
+ if (mnt_flags & MNT_NODIRATIME) ms_flags |= MS_NODIRATIME;
+ if (mnt_flags & MNT_RELATIME) ms_flags |= MS_RELATIME;
+ if (mnt_flags & MNT_READONLY) ms_flags |= MS_RDONLY;
+
+ idx = tomoyo_read_lock();
+ /* TODO: There may be multiple sources; for the moment, just pick the
+ * first if there is one.
+ */
+ error = tomoyo_mount_acl(&r, fc->source, mountpoint, fc->fs_type->name,
+ ms_flags);
+ tomoyo_read_unlock(idx);
+ return error;
+}
diff --git a/security/tomoyo/tomoyo.c b/security/tomoyo/tomoyo.c
index 07f1a0d3dd32..e1cf21e481ce 100644
--- a/security/tomoyo/tomoyo.c
+++ b/security/tomoyo/tomoyo.c
@@ -391,6 +391,20 @@ static int tomoyo_path_chroot(const struct path *path)
return tomoyo_path_perm(TOMOYO_TYPE_CHROOT, path, NULL);
}

+/**
+ * tomoyo_sb_mount - Target for security_sb_mountpoint().
+ * @fc: Context describing the object to be mounted.
+ * @mountpoint: The target object to mount on.
+ * @mnt_flags: Mountpoint specific options (as MNT_* flags).
+ *
+ * Returns 0 on success, negative value otherwise.
+ */
+static int tomoyo_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ return tomoyo_mount_permission_fc(fc, mountpoint, mnt_flags);
+}
+
/**
* tomoyo_sb_mount - Target for security_sb_mount().
*
@@ -521,6 +535,7 @@ static struct security_hook_list tomoyo_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(path_chmod, tomoyo_path_chmod),
LSM_HOOK_INIT(path_chown, tomoyo_path_chown),
LSM_HOOK_INIT(path_chroot, tomoyo_path_chroot),
+ LSM_HOOK_INIT(sb_mountpoint, tomoyo_sb_mountpoint),
LSM_HOOK_INIT(sb_mount, tomoyo_sb_mount),
LSM_HOOK_INIT(sb_umount, tomoyo_sb_umount),
LSM_HOOK_INIT(sb_pivotroot, tomoyo_sb_pivotroot),


2018-09-21 16:33:12

by David Howells

[permalink] [raw]
Subject: [PATCH 14/34] vfs: Separate changing mount flags full remount [ver #12]

Separate just the changing of mount flags (MS_REMOUNT|MS_BIND) from full
remount because the mount data will get parsed with the new fs_context
stuff prior to doing a remount - and this causes the syscall to fail under
some circumstances.

To quote Eric's explanation:

[...] mount(..., MS_REMOUNT|MS_BIND, ...) now validates the mount options
string, which breaks systemd unit files with ProtectControlGroups=yes
(e.g. systemd-networkd.service) when systemd does the following to
change a cgroup (v1) mount to read-only:

mount(NULL, "/run/systemd/unit-root/sys/fs/cgroup/systemd", NULL,
MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_BIND, NULL)

... when the kernel has CONFIG_CGROUPS=y but no cgroup subsystems
enabled, since in that case the error "cgroup1: Need name or subsystem
set" is hit when the mount options string is empty.

Probably it doesn't make sense to validate the mount options string at
all in the MS_REMOUNT|MS_BIND case, though maybe you had something else
in mind.

This is also worthwhile doing because we will need to add a mount_setattr()
syscall to take over the remount-bind function.

Reported-by: Eric Biggers <[email protected]>
Signed-off-by: David Howells <[email protected]>
---

fs/namespace.c | 146 +++++++++++++++++++++++++++++++------------------
include/linux/mount.h | 2 -
2 files changed, 93 insertions(+), 55 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index caf5c55ef555..059a13e1ae09 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -247,13 +247,9 @@ static struct mount *alloc_vfsmnt(const char *name)
* mnt_want/drop_write() will _keep_ the filesystem
* r/w.
*/
-int __mnt_is_readonly(struct vfsmount *mnt)
+bool __mnt_is_readonly(struct vfsmount *mnt)
{
- if (mnt->mnt_flags & MNT_READONLY)
- return 1;
- if (sb_rdonly(mnt->mnt_sb))
- return 1;
- return 0;
+ return (mnt->mnt_flags & MNT_READONLY) || sb_rdonly(mnt->mnt_sb);
}
EXPORT_SYMBOL_GPL(__mnt_is_readonly);

@@ -509,11 +505,12 @@ static int mnt_make_readonly(struct mount *mnt)
return ret;
}

-static void __mnt_unmake_readonly(struct mount *mnt)
+static int __mnt_unmake_readonly(struct mount *mnt)
{
lock_mount_hash();
mnt->mnt.mnt_flags &= ~MNT_READONLY;
unlock_mount_hash();
+ return 0;
}

int sb_prepare_remount_readonly(struct super_block *sb)
@@ -2294,21 +2291,91 @@ SYSCALL_DEFINE3(open_tree, int, dfd, const char *, filename, unsigned, flags)
return error;
}

-static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
+/*
+ * Don't allow locked mount flags to be cleared.
+ *
+ * No locks need to be held here while testing the various MNT_LOCK
+ * flags because those flags can never be cleared once they are set.
+ */
+static bool can_change_locked_flags(struct mount *mnt, unsigned int mnt_flags)
+{
+ unsigned int fl = mnt->mnt.mnt_flags;
+
+ if ((fl & MNT_LOCK_READONLY) &&
+ !(mnt_flags & MNT_READONLY))
+ return false;
+
+ if ((fl & MNT_LOCK_NODEV) &&
+ !(mnt_flags & MNT_NODEV))
+ return false;
+
+ if ((fl & MNT_LOCK_NOSUID) &&
+ !(mnt_flags & MNT_NOSUID))
+ return false;
+
+ if ((fl & MNT_LOCK_NOEXEC) &&
+ !(mnt_flags & MNT_NOEXEC))
+ return false;
+
+ if ((fl & MNT_LOCK_ATIME) &&
+ ((fl & MNT_ATIME_MASK) != (mnt_flags & MNT_ATIME_MASK)))
+ return false;
+
+ return true;
+}
+
+static int change_mount_ro_state(struct mount *mnt, unsigned int mnt_flags)
{
- int error = 0;
- int readonly_request = 0;
+ bool readonly_request = (mnt_flags & MNT_READONLY);

- if (ms_flags & MS_RDONLY)
- readonly_request = 1;
- if (readonly_request == __mnt_is_readonly(mnt))
+ if (readonly_request == __mnt_is_readonly(&mnt->mnt))
return 0;

if (readonly_request)
- error = mnt_make_readonly(real_mount(mnt));
- else
- __mnt_unmake_readonly(real_mount(mnt));
- return error;
+ return mnt_make_readonly(mnt);
+
+ return __mnt_unmake_readonly(mnt);
+}
+
+/*
+ * Update the user-settable attributes on a mount. The caller must hold
+ * sb->s_umount for writing.
+ */
+static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags)
+{
+ lock_mount_hash();
+ mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
+ mnt->mnt.mnt_flags = mnt_flags;
+ touch_mnt_namespace(mnt->mnt_ns);
+ unlock_mount_hash();
+}
+
+/*
+ * Handle reconfiguration of the mountpoint only without alteration of the
+ * superblock it refers to. This is triggered by specifying MS_REMOUNT|MS_BIND
+ * to mount(2).
+ */
+static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags)
+{
+ struct super_block *sb = path->mnt->mnt_sb;
+ struct mount *mnt = real_mount(path->mnt);
+ int ret;
+
+ if (!check_mnt(mnt))
+ return -EINVAL;
+
+ if (path->dentry != mnt->mnt.mnt_root)
+ return -EINVAL;
+
+ if (!can_change_locked_flags(mnt, mnt_flags))
+ return -EPERM;
+
+ down_write(&sb->s_umount);
+ ret = change_mount_ro_state(mnt, mnt_flags);
+ if (ret == 0)
+ set_mount_attributes(mnt, mnt_flags);
+ up_write(&sb->s_umount);
+ return ret;
}

/*
@@ -2329,50 +2396,19 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
if (path->dentry != path->mnt->mnt_root)
return -EINVAL;

- /* Don't allow changing of locked mnt flags.
- *
- * No locks need to be held here while testing the various
- * MNT_LOCK flags because those flags can never be cleared
- * once they are set.
- */
- if ((mnt->mnt.mnt_flags & MNT_LOCK_READONLY) &&
- !(mnt_flags & MNT_READONLY)) {
- return -EPERM;
- }
- if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) &&
- !(mnt_flags & MNT_NODEV)) {
- return -EPERM;
- }
- if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) &&
- !(mnt_flags & MNT_NOSUID)) {
- return -EPERM;
- }
- if ((mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) &&
- !(mnt_flags & MNT_NOEXEC)) {
+ if (!can_change_locked_flags(mnt, mnt_flags))
return -EPERM;
- }
- if ((mnt->mnt.mnt_flags & MNT_LOCK_ATIME) &&
- ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (mnt_flags & MNT_ATIME_MASK))) {
- return -EPERM;
- }

err = security_sb_remount(sb, data, data_size);
if (err)
return err;

down_write(&sb->s_umount);
- if (ms_flags & MS_BIND)
- err = change_mount_flags(path->mnt, ms_flags);
- else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
- err = -EPERM;
- else
+ err = -EPERM;
+ if (ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) {
err = do_remount_sb(sb, sb_flags, data, data_size, 0);
- if (!err) {
- lock_mount_hash();
- mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
- mnt->mnt.mnt_flags = mnt_flags;
- touch_mnt_namespace(mnt->mnt_ns);
- unlock_mount_hash();
+ if (!err)
+ set_mount_attributes(mnt, mnt_flags);
}
up_write(&sb->s_umount);
return err;
@@ -2888,7 +2924,9 @@ long do_mount(const char *dev_name, const char __user *dir_name,
SB_LAZYTIME |
SB_I_VERSION);

- if (flags & MS_REMOUNT)
+ if ((flags & (MS_REMOUNT | MS_BIND)) == (MS_REMOUNT | MS_BIND))
+ retval = do_reconfigure_mnt(&path, mnt_flags);
+ else if (flags & MS_REMOUNT)
retval = do_remount(&path, flags, sb_flags, mnt_flags,
data_page, data_size);
else if (flags & MS_BIND)
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 8a1031a511c9..c9edd284f0af 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -81,7 +81,7 @@ extern void mnt_drop_write_file(struct file *file);
extern void mntput(struct vfsmount *mnt);
extern struct vfsmount *mntget(struct vfsmount *mnt);
extern struct vfsmount *mnt_clone_internal(const struct path *path);
-extern int __mnt_is_readonly(struct vfsmount *mnt);
+extern bool __mnt_is_readonly(struct vfsmount *mnt);
extern bool mnt_may_suid(struct vfsmount *mnt);

struct path;


2018-09-21 16:33:37

by David Howells

[permalink] [raw]
Subject: [PATCH 16/34] vfs: Remove unused code after filesystem context changes [ver #12]

Remove code that is now unused after the filesystem context changes.

Signed-off-by: David Howells <[email protected]>
---

fs/internal.h | 2 -
fs/super.c | 62 --------------------------
include/linux/lsm_hooks.h | 12 -----
include/linux/security.h | 13 -----
security/security.c | 10 ----
security/selinux/hooks.c | 106 --------------------------------------------
security/smack/smack_lsm.c | 33 --------------
7 files changed, 238 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index fc2da60abbcd..73942ff5aa09 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -116,8 +116,6 @@ extern struct file *alloc_empty_file_noaccount(int, const struct cred *);
*/
extern int reconfigure_super(struct fs_context *);
extern bool trylock_super(struct super_block *sb);
-extern struct dentry *mount_fs(struct file_system_type *,
- int, const char *, void *, size_t);
extern struct super_block *user_get_super(dev_t);

/*
diff --git a/fs/super.c b/fs/super.c
index df8c4cebd000..de43b140bbb1 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1478,68 +1478,6 @@ struct dentry *mount_single(struct file_system_type *fs_type,
}
EXPORT_SYMBOL(mount_single);

-struct dentry *
-mount_fs(struct file_system_type *type, int flags, const char *name,
- void *data, size_t data_size)
-{
- struct dentry *root;
- struct super_block *sb;
- char *secdata = NULL;
- int error = -ENOMEM;
-
- if (data && !(type->fs_flags & FS_BINARY_MOUNTDATA)) {
- secdata = alloc_secdata();
- if (!secdata)
- goto out;
-
- error = security_sb_copy_data(data, data_size, secdata);
- if (error)
- goto out_free_secdata;
- }
-
- root = type->mount(type, flags, name, data, data_size);
- if (IS_ERR(root)) {
- error = PTR_ERR(root);
- goto out_free_secdata;
- }
- sb = root->d_sb;
- BUG_ON(!sb);
- WARN_ON(!sb->s_bdi);
-
- /*
- * Write barrier is for super_cache_count(). We place it before setting
- * SB_BORN as the data dependency between the two functions is the
- * superblock structure contents that we just set up, not the SB_BORN
- * flag.
- */
- smp_wmb();
- sb->s_flags |= SB_BORN;
-
- error = security_sb_kern_mount(sb, flags, secdata, data_size);
- if (error)
- goto out_sb;
-
- /*
- * filesystems should never set s_maxbytes larger than MAX_LFS_FILESIZE
- * but s_maxbytes was an unsigned long long for many releases. Throw
- * this warning for a little while to try and catch filesystems that
- * violate this rule.
- */
- WARN((sb->s_maxbytes < 0), "%s set sb->s_maxbytes to "
- "negative value (%lld)\n", type->name, sb->s_maxbytes);
-
- up_write(&sb->s_umount);
- free_secdata(secdata);
- return root;
-out_sb:
- dput(root);
- deactivate_locked_super(sb);
-out_free_secdata:
- free_secdata(secdata);
-out:
- return ERR_PTR(error);
-}
-
/*
* Setup private BDI for given superblock. It gets automatically cleaned up
* in generic_shutdown_super().
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 7e50bfa1aee0..fff43b0523a9 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -160,13 +160,6 @@
* @orig_data is the size of the original data
* @copy copied data which will be passed to the security module.
* Returns 0 if the copy was successful.
- * @sb_remount:
- * Extracts security system specific mount options and verifies no changes
- * are being made to those options.
- * @sb superblock being remounted
- * @data contains the filesystem-specific data.
- * @data_size contains the size of the data.
- * Return 0 if permission is granted.
* @sb_umount:
* Check permission before the @mnt file system is unmounted.
* @mnt contains the mounted file system.
@@ -1522,9 +1515,6 @@ union security_list_options {
int (*sb_alloc_security)(struct super_block *sb);
void (*sb_free_security)(struct super_block *sb);
int (*sb_copy_data)(char *orig, size_t orig_size, char *copy);
- int (*sb_remount)(struct super_block *sb, void *data, size_t data_size);
- int (*sb_kern_mount)(struct super_block *sb, int flags,
- void *data, size_t data_size);
int (*sb_show_options)(struct seq_file *m, struct super_block *sb);
int (*sb_statfs)(struct dentry *dentry);
int (*sb_mount)(const char *dev_name, const struct path *path,
@@ -1872,8 +1862,6 @@ struct security_hook_heads {
struct hlist_head sb_alloc_security;
struct hlist_head sb_free_security;
struct hlist_head sb_copy_data;
- struct hlist_head sb_remount;
- struct hlist_head sb_kern_mount;
struct hlist_head sb_show_options;
struct hlist_head sb_statfs;
struct hlist_head sb_mount;
diff --git a/include/linux/security.h b/include/linux/security.h
index bae191a96c73..11157798d4f8 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -261,8 +261,6 @@ int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
int security_sb_alloc(struct super_block *sb);
void security_sb_free(struct super_block *sb);
int security_sb_copy_data(char *orig, size_t orig_size, char *copy);
-int security_sb_remount(struct super_block *sb, void *data, size_t data_size);
-int security_sb_kern_mount(struct super_block *sb, int flags, void *data, size_t data_size);
int security_sb_show_options(struct seq_file *m, struct super_block *sb);
int security_sb_statfs(struct dentry *dentry);
int security_sb_mount(const char *dev_name, const struct path *path,
@@ -608,17 +606,6 @@ static inline int security_sb_copy_data(char *orig, size_t orig_size, char *copy
return 0;
}

-static inline int security_sb_remount(struct super_block *sb, void *data, size_t data_size)
-{
- return 0;
-}
-
-static inline int security_sb_kern_mount(struct super_block *sb, int flags,
- void *data, size_t data_size)
-{
- return 0;
-}
-
static inline int security_sb_show_options(struct seq_file *m,
struct super_block *sb)
{
diff --git a/security/security.c b/security/security.c
index 64304d20aae1..d902810f2749 100644
--- a/security/security.c
+++ b/security/security.c
@@ -420,16 +420,6 @@ int security_sb_copy_data(char *orig, size_t data_size, char *copy)
}
EXPORT_SYMBOL(security_sb_copy_data);

-int security_sb_remount(struct super_block *sb, void *data, size_t data_size)
-{
- return call_int_hook(sb_remount, 0, sb, data, data_size);
-}
-
-int security_sb_kern_mount(struct super_block *sb, int flags, void *data, size_t data_size)
-{
- return call_int_hook(sb_kern_mount, 0, sb, flags, data, data_size);
-}
-
int security_sb_show_options(struct seq_file *m, struct super_block *sb)
{
return call_int_hook(sb_show_options, 0, m, sb);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 5f2af9dd44fa..99c2c40c5d7a 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2832,110 +2832,6 @@ static int selinux_sb_copy_data(char *orig, size_t data_size, char *copy)
return rc;
}

-static int selinux_sb_remount(struct super_block *sb, void *data, size_t data_size)
-{
- int rc, i, *flags;
- struct security_mnt_opts opts;
- char *secdata, **mount_options;
- struct superblock_security_struct *sbsec = sb->s_security;
-
- if (!(sbsec->flags & SE_SBINITIALIZED))
- return 0;
-
- if (!data)
- return 0;
-
- if (sb->s_type->fs_flags & FS_BINARY_MOUNTDATA)
- return 0;
-
- security_init_mnt_opts(&opts);
- secdata = alloc_secdata();
- if (!secdata)
- return -ENOMEM;
- rc = selinux_sb_copy_data(data, data_size, secdata);
- if (rc)
- goto out_free_secdata;
-
- rc = selinux_parse_opts_str(secdata, &opts);
- if (rc)
- goto out_free_secdata;
-
- mount_options = opts.mnt_opts;
- flags = opts.mnt_opts_flags;
-
- for (i = 0; i < opts.num_mnt_opts; i++) {
- u32 sid;
-
- if (flags[i] == SBLABEL_MNT)
- continue;
- rc = security_context_str_to_sid(&selinux_state,
- mount_options[i], &sid,
- GFP_KERNEL);
- if (rc) {
- pr_warn("SELinux: security_context_str_to_sid"
- "(%s) failed for (dev %s, type %s) errno=%d\n",
- mount_options[i], sb->s_id, sb->s_type->name, rc);
- goto out_free_opts;
- }
- rc = -EINVAL;
- switch (flags[i]) {
- case FSCONTEXT_MNT:
- if (bad_option(sbsec, FSCONTEXT_MNT, sbsec->sid, sid))
- goto out_bad_option;
- break;
- case CONTEXT_MNT:
- if (bad_option(sbsec, CONTEXT_MNT, sbsec->mntpoint_sid, sid))
- goto out_bad_option;
- break;
- case ROOTCONTEXT_MNT: {
- struct inode_security_struct *root_isec;
- root_isec = backing_inode_security(sb->s_root);
-
- if (bad_option(sbsec, ROOTCONTEXT_MNT, root_isec->sid, sid))
- goto out_bad_option;
- break;
- }
- case DEFCONTEXT_MNT:
- if (bad_option(sbsec, DEFCONTEXT_MNT, sbsec->def_sid, sid))
- goto out_bad_option;
- break;
- default:
- goto out_free_opts;
- }
- }
-
- rc = 0;
-out_free_opts:
- security_free_mnt_opts(&opts);
-out_free_secdata:
- free_secdata(secdata);
- return rc;
-out_bad_option:
- pr_warn("SELinux: unable to change security options "
- "during remount (dev %s, type=%s)\n", sb->s_id,
- sb->s_type->name);
- goto out_free_opts;
-}
-
-static int selinux_sb_kern_mount(struct super_block *sb, int flags, void *data, size_t data_size)
-{
- const struct cred *cred = current_cred();
- struct common_audit_data ad;
- int rc;
-
- rc = superblock_doinit(sb, data);
- if (rc)
- return rc;
-
- /* Allow all mounts performed by the kernel */
- if (flags & MS_KERNMOUNT)
- return 0;
-
- ad.type = LSM_AUDIT_DATA_DENTRY;
- ad.u.dentry = sb->s_root;
- return superblock_has_perm(cred, sb, FILESYSTEM__MOUNT, &ad);
-}
-
static int selinux_sb_statfs(struct dentry *dentry)
{
const struct cred *cred = current_cred();
@@ -7205,8 +7101,6 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(sb_alloc_security, selinux_sb_alloc_security),
LSM_HOOK_INIT(sb_free_security, selinux_sb_free_security),
LSM_HOOK_INIT(sb_copy_data, selinux_sb_copy_data),
- LSM_HOOK_INIT(sb_remount, selinux_sb_remount),
- LSM_HOOK_INIT(sb_kern_mount, selinux_sb_kern_mount),
LSM_HOOK_INIT(sb_show_options, selinux_sb_show_options),
LSM_HOOK_INIT(sb_statfs, selinux_sb_statfs),
LSM_HOOK_INIT(sb_mount, selinux_mount),
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index da7121d24bce..1f51a8ac11d7 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -1164,38 +1164,6 @@ static int smack_set_mnt_opts(struct super_block *sb,
return 0;
}

-/**
- * smack_sb_kern_mount - Smack specific mount processing
- * @sb: the file system superblock
- * @flags: the mount flags
- * @data: the smack mount options
- *
- * Returns 0 on success, an error code on failure
- */
-static int smack_sb_kern_mount(struct super_block *sb, int flags,
- void *data, size_t data_size)
-{
- int rc = 0;
- char *options = data;
- struct security_mnt_opts opts;
-
- security_init_mnt_opts(&opts);
-
- if (!options)
- goto out;
-
- rc = smack_parse_opts_str(options, &opts);
- if (rc)
- goto out_err;
-
-out:
- rc = smack_set_mnt_opts(sb, &opts, 0, NULL);
-
-out_err:
- security_free_mnt_opts(&opts);
- return rc;
-}
-
/**
* smack_sb_statfs - Smack check on statfs
* @dentry: identifies the file system in question
@@ -4985,7 +4953,6 @@ static struct security_hook_list smack_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(sb_alloc_security, smack_sb_alloc_security),
LSM_HOOK_INIT(sb_free_security, smack_sb_free_security),
LSM_HOOK_INIT(sb_copy_data, smack_sb_copy_data),
- LSM_HOOK_INIT(sb_kern_mount, smack_sb_kern_mount),
LSM_HOOK_INIT(sb_statfs, smack_sb_statfs),
LSM_HOOK_INIT(sb_set_mnt_opts, smack_set_mnt_opts),
LSM_HOOK_INIT(sb_parse_opts_str, smack_parse_opts_str),


2018-09-21 16:33:45

by David Howells

[permalink] [raw]
Subject: [PATCH 17/34] procfs: Move proc_fill_super() to fs/proc/root.c [ver #12]

Move proc_fill_super() to fs/proc/root.c as that's where the other
superblock stuff is.

Signed-off-by: David Howells <[email protected]>
---

fs/proc/inode.c | 49 +------------------------------------------------
fs/proc/internal.h | 4 +---
fs/proc/root.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 49 insertions(+), 52 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 8f121c476c07..9fdda2946554 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -24,7 +24,6 @@
#include <linux/seq_file.h>
#include <linux/slab.h>
#include <linux/mount.h>
-#include <linux/magic.h>

#include <linux/uaccess.h>

@@ -124,7 +123,7 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root)
return 0;
}

-static const struct super_operations proc_sops = {
+const struct super_operations proc_sops = {
.alloc_inode = proc_alloc_inode,
.destroy_inode = proc_destroy_inode,
.drop_inode = generic_delete_inode,
@@ -490,49 +489,3 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
pde_put(de);
return inode;
}
-
-int proc_fill_super(struct super_block *s, void *data, size_t data_size,
- int silent)
-{
- struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
- struct inode *root_inode;
- int ret;
-
- if (!proc_parse_options(data, ns))
- return -EINVAL;
-
- /* User space would break if executables or devices appear on proc */
- s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
- s->s_flags |= SB_NODIRATIME | SB_NOSUID | SB_NOEXEC;
- s->s_blocksize = 1024;
- s->s_blocksize_bits = 10;
- s->s_magic = PROC_SUPER_MAGIC;
- s->s_op = &proc_sops;
- s->s_time_gran = 1;
-
- /*
- * procfs isn't actually a stacking filesystem; however, there is
- * too much magic going on inside it to permit stacking things on
- * top of it
- */
- s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
-
- pde_get(&proc_root);
- root_inode = proc_get_inode(s, &proc_root);
- if (!root_inode) {
- pr_err("proc_fill_super: get root inode failed\n");
- return -ENOMEM;
- }
-
- s->s_root = d_make_root(root_inode);
- if (!s->s_root) {
- pr_err("proc_fill_super: allocate dentry failed\n");
- return -ENOMEM;
- }
-
- ret = proc_setup_self(s);
- if (ret) {
- return ret;
- }
- return proc_setup_thread_self(s);
-}
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 3b88db52d206..912cb2cd29dd 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -205,13 +205,12 @@ struct pde_opener {
struct completion *c;
} __randomize_layout;
extern const struct inode_operations proc_link_inode_operations;
-
extern const struct inode_operations proc_pid_link_inode_operations;
+extern const struct super_operations proc_sops;

void proc_init_kmemcache(void);
void set_proc_pid_nlink(void);
extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
-extern int proc_fill_super(struct super_block *, void *, size_t, int);
extern void proc_entry_rundown(struct proc_dir_entry *);

/*
@@ -269,7 +268,6 @@ static inline void proc_tty_init(void) {}
* root.c
*/
extern struct proc_dir_entry proc_root;
-extern int proc_parse_options(char *options, struct pid_namespace *pid);

extern void proc_self_init(void);
extern int proc_remount(struct super_block *, int *, char *, size_t);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 28fadb0c51ab..15da85cefd3f 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -23,6 +23,7 @@
#include <linux/pid_namespace.h>
#include <linux/parser.h>
#include <linux/cred.h>
+#include <linux/magic.h>

#include "internal.h"

@@ -36,7 +37,7 @@ static const match_table_t tokens = {
{Opt_err, NULL},
};

-int proc_parse_options(char *options, struct pid_namespace *pid)
+static int proc_parse_options(char *options, struct pid_namespace *pid)
{
char *p;
substring_t args[MAX_OPT_ARGS];
@@ -78,6 +79,51 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
return 1;
}

+static int proc_fill_super(struct super_block *s, void *data, size_t data_size, int silent)
+{
+ struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
+ struct inode *root_inode;
+ int ret;
+
+ if (!proc_parse_options(data, ns))
+ return -EINVAL;
+
+ /* User space would break if executables or devices appear on proc */
+ s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
+ s->s_flags |= SB_NODIRATIME | SB_NOSUID | SB_NOEXEC;
+ s->s_blocksize = 1024;
+ s->s_blocksize_bits = 10;
+ s->s_magic = PROC_SUPER_MAGIC;
+ s->s_op = &proc_sops;
+ s->s_time_gran = 1;
+
+ /*
+ * procfs isn't actually a stacking filesystem; however, there is
+ * too much magic going on inside it to permit stacking things on
+ * top of it
+ */
+ s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
+
+ pde_get(&proc_root);
+ root_inode = proc_get_inode(s, &proc_root);
+ if (!root_inode) {
+ pr_err("proc_fill_super: get root inode failed\n");
+ return -ENOMEM;
+ }
+
+ s->s_root = d_make_root(root_inode);
+ if (!s->s_root) {
+ pr_err("proc_fill_super: allocate dentry failed\n");
+ return -ENOMEM;
+ }
+
+ ret = proc_setup_self(s);
+ if (ret) {
+ return ret;
+ }
+ return proc_setup_thread_self(s);
+}
+
int proc_remount(struct super_block *sb, int *flags,
char *data, size_t data_size)
{


2018-09-21 16:33:57

by David Howells

[permalink] [raw]
Subject: [PATCH 19/34] ipc: Convert mqueue fs to fs_context [ver #12]

Convert the mqueue filesystem to use the filesystem context stuff.

Notes:

(1) The relevant ipc namespace is selected in when the context is
initialised (and it defaults to the current task's ipc namespace).
The caller can override this before calling vfs_get_tree().

(2) Rather than simply calling kern_mount_data(), mq_init_ns() and
mq_internal_mount() create a context, adjust it and then do the rest
of the mount procedure.

(3) The lazy mqueue mounting on creation of a new namespace is retained
from a previous patch, but the avoidance of sget() if no superblock
yet exists is reverted and the superblock is again keyed on the
namespace pointer.

Yes, there was a performance gain in not searching the superblock
hash, but it's only paid once per ipc namespace - and only if someone
uses mqueue within that namespace, so I'm not sure it's worth it,
especially as calling sget() allows avoidance of recursion.

Signed-off-by: David Howells <[email protected]>
---

ipc/mqueue.c | 107 ++++++++++++++++++++++++++++++++++++++++++++-----------
ipc/namespace.c | 2 +
2 files changed, 86 insertions(+), 23 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 4671d215cb84..869687d586a2 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -18,6 +18,7 @@
#include <linux/pagemap.h>
#include <linux/file.h>
#include <linux/mount.h>
+#include <linux/fs_context.h>
#include <linux/namei.h>
#include <linux/sysctl.h>
#include <linux/poll.h>
@@ -42,6 +43,10 @@
#include <net/sock.h>
#include "util.h"

+struct mqueue_fs_context {
+ struct ipc_namespace *ipc_ns;
+};
+
#define MQUEUE_MAGIC 0x19800202
#define DIRENT_SIZE 20
#define FILENT_SIZE 80
@@ -87,9 +92,11 @@ struct mqueue_inode_info {
unsigned long qsize; /* size of queue in memory (sum of all msgs) */
};

+static struct file_system_type mqueue_fs_type;
static const struct inode_operations mqueue_dir_inode_operations;
static const struct file_operations mqueue_file_operations;
static const struct super_operations mqueue_super_ops;
+static const struct fs_context_operations mqueue_fs_context_ops;
static void remove_notification(struct mqueue_inode_info *info);

static struct kmem_cache *mqueue_inode_cachep;
@@ -322,7 +329,7 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
return ERR_PTR(ret);
}

-static int mqueue_fill_super(struct super_block *sb, void *data, size_t data_size, int silent)
+static int mqueue_fill_super(struct super_block *sb, struct fs_context *fc)
{
struct inode *inode;
struct ipc_namespace *ns = sb->s_fs_info;
@@ -343,19 +350,70 @@ static int mqueue_fill_super(struct super_block *sb, void *data, size_t data_siz
return 0;
}

-static struct dentry *mqueue_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name,
- void *data, size_t data_size)
+static int mqueue_get_tree(struct fs_context *fc)
{
- struct ipc_namespace *ns;
- if (flags & SB_KERNMOUNT) {
- ns = data;
- data = NULL;
- } else {
- ns = current->nsproxy->ipc_ns;
+ struct mqueue_fs_context *ctx = fc->fs_private;
+
+ fc->s_fs_info = ctx->ipc_ns;
+ return vfs_get_super(fc, vfs_get_keyed_super, mqueue_fill_super);
+}
+
+static void mqueue_fs_context_free(struct fs_context *fc)
+{
+ struct mqueue_fs_context *ctx = fc->fs_private;
+
+ if (ctx->ipc_ns)
+ put_ipc_ns(ctx->ipc_ns);
+ kfree(ctx);
+}
+
+static int mqueue_init_fs_context(struct fs_context *fc,
+ struct dentry *reference)
+{
+ struct mqueue_fs_context *ctx;
+
+ ctx = kzalloc(sizeof(struct mqueue_fs_context), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ ctx->ipc_ns = get_ipc_ns(current->nsproxy->ipc_ns);
+ fc->fs_private = ctx;
+ fc->ops = &mqueue_fs_context_ops;
+ return 0;
+}
+
+static struct vfsmount *mq_create_mount(struct ipc_namespace *ns)
+{
+ struct mqueue_fs_context *ctx;
+ struct fs_context *fc;
+ struct vfsmount *mnt;
+ int ret;
+
+ fc = vfs_new_fs_context(&mqueue_fs_type, NULL, 0, 0,
+ FS_CONTEXT_FOR_KERNEL_MOUNT);
+ if (IS_ERR(fc))
+ return ERR_CAST(fc);
+
+ ctx = fc->fs_private;
+ put_ipc_ns(ctx->ipc_ns);
+ ctx->ipc_ns = get_ipc_ns(ns);
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0)
+ goto err_fc;
+
+ mnt = vfs_create_mount(fc, 0);
+ if (IS_ERR(mnt)) {
+ ret = PTR_ERR(mnt);
+ goto err_fc;
}
- return mount_ns(fs_type, flags, data, data_size, ns, ns->user_ns,
- mqueue_fill_super);
+
+ put_fs_context(fc);
+ return mnt;
+
+err_fc:
+ put_fs_context(fc);
+ return ERR_PTR(ret);
}

static void init_once(void *foo)
@@ -1523,15 +1581,22 @@ static const struct super_operations mqueue_super_ops = {
.statfs = simple_statfs,
};

+static const struct fs_context_operations mqueue_fs_context_ops = {
+ .free = mqueue_fs_context_free,
+ .get_tree = mqueue_get_tree,
+};
+
static struct file_system_type mqueue_fs_type = {
- .name = "mqueue",
- .mount = mqueue_mount,
- .kill_sb = kill_litter_super,
- .fs_flags = FS_USERNS_MOUNT,
+ .name = "mqueue",
+ .init_fs_context = mqueue_init_fs_context,
+ .kill_sb = kill_litter_super,
+ .fs_flags = FS_USERNS_MOUNT,
};

int mq_init_ns(struct ipc_namespace *ns)
{
+ struct vfsmount *m;
+
ns->mq_queues_count = 0;
ns->mq_queues_max = DFLT_QUEUESMAX;
ns->mq_msg_max = DFLT_MSGMAX;
@@ -1539,12 +1604,10 @@ int mq_init_ns(struct ipc_namespace *ns)
ns->mq_msg_default = DFLT_MSG;
ns->mq_msgsize_default = DFLT_MSGSIZE;

- ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns, 0);
- if (IS_ERR(ns->mq_mnt)) {
- int err = PTR_ERR(ns->mq_mnt);
- ns->mq_mnt = NULL;
- return err;
- }
+ m = mq_create_mount(ns);
+ if (IS_ERR(m))
+ return PTR_ERR(m);
+ ns->mq_mnt = m;
return 0;
}

diff --git a/ipc/namespace.c b/ipc/namespace.c
index 21607791d62c..b3ca1476ca51 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
goto fail;

err = -ENOMEM;
- ns = kmalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
+ ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
if (ns == NULL)
goto fail_dec;



2018-09-21 16:34:06

by David Howells

[permalink] [raw]
Subject: [PATCH 18/34] proc: Add fs_context support to procfs [ver #12]

Add fs_context support to procfs.

Signed-off-by: David Howells <[email protected]>
---

fs/proc/inode.c | 1
fs/proc/internal.h | 1
fs/proc/root.c | 220 ++++++++++++++++++++++++++++++++++++----------------
3 files changed, 151 insertions(+), 71 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 9fdda2946554..4e38156e2531 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -129,7 +129,6 @@ const struct super_operations proc_sops = {
.drop_inode = generic_delete_inode,
.evict_inode = proc_evict_inode,
.statfs = simple_statfs,
- .remount_fs = proc_remount,
.show_options = proc_show_options,
};

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 912cb2cd29dd..40f905143d39 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -270,7 +270,6 @@ static inline void proc_tty_init(void) {}
extern struct proc_dir_entry proc_root;

extern void proc_self_init(void);
-extern int proc_remount(struct super_block *, int *, char *, size_t);

/*
* task_[no]mmu.c
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 15da85cefd3f..8912a8b57ac3 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -19,74 +19,97 @@
#include <linux/module.h>
#include <linux/bitops.h>
#include <linux/user_namespace.h>
+#include <linux/fs_context.h>
#include <linux/mount.h>
#include <linux/pid_namespace.h>
-#include <linux/parser.h>
+#include <linux/fs_parser.h>
#include <linux/cred.h>
#include <linux/magic.h>
+#include <linux/slab.h>

#include "internal.h"

-enum {
- Opt_gid, Opt_hidepid, Opt_err,
+struct proc_fs_context {
+ struct pid_namespace *pid_ns;
+ unsigned long mask;
+ int hidepid;
+ int gid;
};

-static const match_table_t tokens = {
- {Opt_hidepid, "hidepid=%u"},
- {Opt_gid, "gid=%u"},
- {Opt_err, NULL},
+enum proc_param {
+ Opt_gid,
+ Opt_hidepid,
+ nr__proc_params
};

-static int proc_parse_options(char *options, struct pid_namespace *pid)
+static const struct fs_parameter_spec proc_param_specs[nr__proc_params] = {
+ [Opt_gid] = { fs_param_is_u32 },
+ [Opt_hidepid] = { fs_param_is_u32 },
+};
+
+static const char *const proc_param_keys[nr__proc_params] = {
+ [Opt_gid] = "gid",
+ [Opt_hidepid] = "hidepid",
+};
+
+static const struct fs_parameter_description proc_fs_parameters = {
+ .name = "proc",
+ .nr_params = nr__proc_params,
+ .keys = proc_param_keys,
+ .specs = proc_param_specs,
+ .no_source = true,
+};
+
+static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
{
- char *p;
- substring_t args[MAX_OPT_ARGS];
- int option;
-
- if (!options)
- return 1;
-
- while ((p = strsep(&options, ",")) != NULL) {
- int token;
- if (!*p)
- continue;
-
- args[0].to = args[0].from = NULL;
- token = match_token(p, tokens, args);
- switch (token) {
- case Opt_gid:
- if (match_int(&args[0], &option))
- return 0;
- pid->pid_gid = make_kgid(current_user_ns(), option);
- break;
- case Opt_hidepid:
- if (match_int(&args[0], &option))
- return 0;
- if (option < HIDEPID_OFF ||
- option > HIDEPID_INVISIBLE) {
- pr_err("proc: hidepid value must be between 0 and 2.\n");
- return 0;
- }
- pid->hide_pid = option;
- break;
- default:
- pr_err("proc: unrecognized mount option \"%s\" "
- "or missing value\n", p);
- return 0;
- }
+ struct proc_fs_context *ctx = fc->fs_private;
+ struct fs_parse_result result;
+ int opt;
+
+ opt = fs_parse(fc, &proc_fs_parameters, param, &result);
+ if (opt < 0)
+ return opt;
+
+ switch (opt) {
+ case Opt_gid:
+ ctx->gid = result.uint_32;
+ break;
+
+ case Opt_hidepid:
+ ctx->hidepid = result.uint_32;
+ if (ctx->hidepid < HIDEPID_OFF ||
+ ctx->hidepid > HIDEPID_INVISIBLE)
+ return invalf(fc, "proc: hidepid value must be between 0 and 2.\n");
+ break;
+
+ default:
+ return -EINVAL;
}

- return 1;
+ ctx->mask |= 1 << result.key;
+ return 0;
}

-static int proc_fill_super(struct super_block *s, void *data, size_t data_size, int silent)
+static void proc_apply_options(struct super_block *s,
+ struct fs_context *fc,
+ struct pid_namespace *pid_ns,
+ struct user_namespace *user_ns)
{
- struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
+ struct proc_fs_context *ctx = fc->fs_private;
+
+ if (ctx->mask & (1 << Opt_gid))
+ pid_ns->pid_gid = make_kgid(user_ns, ctx->gid);
+ if (ctx->mask & (1 << Opt_hidepid))
+ pid_ns->hide_pid = ctx->hidepid;
+}
+
+static int proc_fill_super(struct super_block *s, struct fs_context *fc)
+{
+ struct pid_namespace *pid_ns = get_pid_ns(s->s_fs_info);
struct inode *root_inode;
int ret;

- if (!proc_parse_options(data, ns))
- return -EINVAL;
+ proc_apply_options(s, fc, pid_ns, current_user_ns());

/* User space would break if executables or devices appear on proc */
s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -103,7 +126,7 @@ static int proc_fill_super(struct super_block *s, void *data, size_t data_size,
* top of it
*/
s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
-
+
pde_get(&proc_root);
root_inode = proc_get_inode(s, &proc_root);
if (!root_inode) {
@@ -124,30 +147,61 @@ static int proc_fill_super(struct super_block *s, void *data, size_t data_size,
return proc_setup_thread_self(s);
}

-int proc_remount(struct super_block *sb, int *flags,
- char *data, size_t data_size)
+static int proc_reconfigure(struct fs_context *fc)
{
+ struct super_block *sb = fc->root->d_sb;
struct pid_namespace *pid = sb->s_fs_info;

sync_filesystem(sb);
- return !proc_parse_options(data, pid);
+
+ proc_apply_options(sb, fc, pid, current_user_ns());
+ return 0;
}

-static struct dentry *proc_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name,
- void *data, size_t data_size)
+static int proc_get_tree(struct fs_context *fc)
{
- struct pid_namespace *ns;
+ struct proc_fs_context *ctx = fc->fs_private;

- if (flags & SB_KERNMOUNT) {
- ns = data;
- data = NULL;
- } else {
- ns = task_active_pid_ns(current);
+ fc->s_fs_info = ctx->pid_ns;
+ return vfs_get_super(fc, vfs_get_keyed_super, proc_fill_super);
+}
+
+static void proc_fs_context_free(struct fs_context *fc)
+{
+ struct proc_fs_context *ctx = fc->fs_private;
+
+ if (ctx->pid_ns)
+ put_pid_ns(ctx->pid_ns);
+ kfree(ctx);
+}
+
+static const struct fs_context_operations proc_fs_context_ops = {
+ .free = proc_fs_context_free,
+ .parse_param = proc_parse_param,
+ .get_tree = proc_get_tree,
+ .reconfigure = proc_reconfigure,
+};
+
+static int proc_init_fs_context(struct fs_context *fc, struct dentry *reference)
+{
+ struct proc_fs_context *ctx;
+
+ switch (fc->purpose) {
+ case FS_CONTEXT_FOR_UMOUNT:
+ case FS_CONTEXT_FOR_EMERGENCY_RO:
+ return -EOPNOTSUPP;
+ default:
+ break;
}

- return mount_ns(fs_type, flags, data, data_size, ns, ns->user_ns,
- proc_fill_super);
+ ctx = kzalloc(sizeof(struct proc_fs_context), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ ctx->pid_ns = get_pid_ns(task_active_pid_ns(current));
+ fc->fs_private = ctx;
+ fc->ops = &proc_fs_context_ops;
+ return 0;
}

static void proc_kill_sb(struct super_block *sb)
@@ -164,10 +218,11 @@ static void proc_kill_sb(struct super_block *sb)
}

static struct file_system_type proc_fs_type = {
- .name = "proc",
- .mount = proc_mount,
- .kill_sb = proc_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
+ .name = "proc",
+ .init_fs_context = proc_init_fs_context,
+ .parameters = &proc_fs_parameters,
+ .kill_sb = proc_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};

void __init proc_root_init(void)
@@ -205,7 +260,7 @@ static struct dentry *proc_root_lookup(struct inode * dir, struct dentry * dentr
{
if (!proc_pid_lookup(dir, dentry, flags))
return NULL;
-
+
return proc_lookup(dir, dentry, flags);
}

@@ -258,9 +313,36 @@ struct proc_dir_entry proc_root = {

int pid_ns_prepare_proc(struct pid_namespace *ns)
{
+ struct proc_fs_context *ctx;
+ struct fs_context *fc;
struct vfsmount *mnt;
+ int ret;
+
+ fc = vfs_new_fs_context(&proc_fs_type, NULL, 0, 0,
+ FS_CONTEXT_FOR_KERNEL_MOUNT);
+ if (IS_ERR(fc))
+ return PTR_ERR(fc);
+
+ if (fc->user_ns != ns->user_ns) {
+ put_user_ns(fc->user_ns);
+ fc->user_ns = get_user_ns(ns->user_ns);
+ }
+
+ ctx = fc->fs_private;
+ if (ctx->pid_ns != ns) {
+ put_pid_ns(ctx->pid_ns);
+ get_pid_ns(ns);
+ ctx->pid_ns = ns;
+ }
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0) {
+ put_fs_context(fc);
+ return ret;
+ }

- mnt = kern_mount_data(&proc_fs_type, ns, 0);
+ mnt = vfs_create_mount(fc, 0);
+ put_fs_context(fc);
if (IS_ERR(mnt))
return PTR_ERR(mnt);



2018-09-21 16:34:14

by David Howells

[permalink] [raw]
Subject: [PATCH 12/34] apparmor: Implement security hooks for the new mount API [ver #12]

Implement hooks to check the creation of new mountpoints for AppArmor.

Unfortunately, the DFA evaluation puts the option data in last, after the
details of the mountpoint, so we have to cache the mount options in the
fs_context using those hooks till we get to the new mountpoint hook.

Signed-off-by: David Howells <[email protected]>
Acked-by: John Johansen <[email protected]>
cc: [email protected]
cc: [email protected]
---

security/apparmor/include/mount.h | 11 +++-
security/apparmor/lsm.c | 107 +++++++++++++++++++++++++++++++++++++
security/apparmor/mount.c | 46 ++++++++++++++++
3 files changed, 162 insertions(+), 2 deletions(-)

diff --git a/security/apparmor/include/mount.h b/security/apparmor/include/mount.h
index 25d6067fa6ef..0441bfae30fa 100644
--- a/security/apparmor/include/mount.h
+++ b/security/apparmor/include/mount.h
@@ -16,6 +16,7 @@

#include <linux/fs.h>
#include <linux/path.h>
+#include <linux/fs_context.h>

#include "domain.h"
#include "policy.h"
@@ -27,7 +28,13 @@
#define AA_AUDIT_DATA 0x40
#define AA_MNT_CONT_MATCH 0x40

-#define AA_MS_IGNORE_MASK (MS_KERNMOUNT | MS_NOSEC | MS_ACTIVE | MS_BORN)
+#define AA_SB_IGNORE_MASK (SB_KERNMOUNT | SB_NOSEC | SB_ACTIVE | SB_BORN)
+
+struct apparmor_fs_context {
+ struct fs_context fc;
+ char *saved_options;
+ size_t saved_size;
+};

int aa_remount(struct aa_label *label, const struct path *path,
unsigned long flags, void *data);
@@ -45,6 +52,8 @@ int aa_move_mount(struct aa_label *label, const struct path *path,
int aa_new_mount(struct aa_label *label, const char *dev_name,
const struct path *path, const char *type, unsigned long flags,
void *data);
+int aa_new_mount_fc(struct aa_label *label, struct fs_context *fc,
+ const struct path *mountpoint);

int aa_umount(struct aa_label *label, struct vfsmount *mnt, int flags);

diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index 3d98ace5b898..416204ea713d 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -520,6 +520,105 @@ static int apparmor_file_mprotect(struct vm_area_struct *vma,
!(vma->vm_flags & VM_SHARED) ? MAP_PRIVATE : 0);
}

+static int apparmor_fs_context_alloc(struct fs_context *fc, struct dentry *reference)
+{
+ struct apparmor_fs_context *afc;
+
+ afc = kzalloc(sizeof(*afc), GFP_KERNEL);
+ if (!afc)
+ return -ENOMEM;
+
+ fc->security = afc;
+ return 0;
+}
+
+static int apparmor_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
+{
+ fc->security = NULL;
+ return 0;
+}
+
+static void apparmor_fs_context_free(struct fs_context *fc)
+{
+ struct apparmor_fs_context *afc = fc->security;
+
+ if (afc) {
+ kfree(afc->saved_options);
+ kfree(afc);
+ }
+}
+
+/*
+ * As a temporary hack, we buffer all the options. The problem is that we need
+ * to pass them to the DFA evaluator *after* mount point parameters, which
+ * means deferring the entire check to the sb_mountpoint hook.
+ */
+static int apparmor_fs_context_parse_param(struct fs_context *fc,
+ struct fs_parameter *param)
+{
+ struct apparmor_fs_context *afc = fc->security;
+ const char *value;
+ size_t space = 0, k_len = strlen(param->key), len = k_len, v_len;
+ char *p, *q;
+
+ if (afc->saved_size > 0)
+ space = 1;
+
+ switch (param->type) {
+ case fs_value_is_string:
+ value = param->string;
+ v_len = param->size;
+ len += 1 + v_len;
+ break;
+ case fs_value_is_filename:
+ case fs_value_is_filename_empty: {
+ value = param->name->name;
+ v_len = param->size;
+ len += 1 + v_len;
+ break;
+ }
+ default:
+ value = NULL;
+ v_len = 0;
+ break;
+ }
+
+ p = krealloc(afc->saved_options, afc->saved_size + space + len + 1,
+ GFP_KERNEL);
+ if (!p)
+ return -ENOMEM;
+
+ q = p + afc->saved_size;
+ if (q != p)
+ *q++ = ' ';
+ memcpy(q, param->key, k_len);
+ q += k_len;
+ if (value) {
+ *q++ = '=';
+ memcpy(q, value, v_len);
+ q += v_len;
+ }
+ *q = 0;
+
+ afc->saved_options = p;
+ afc->saved_size += 1 + len;
+ return -ENOPARAM;
+}
+
+static int apparmor_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ struct aa_label *label;
+ int error = 0;
+
+ label = __begin_current_label_crit_section();
+ if (!unconfined(label))
+ error = aa_new_mount_fc(label, fc, mountpoint);
+ __end_current_label_crit_section(label);
+
+ return error;
+}
+
static int apparmor_sb_mount(const char *dev_name, const struct path *path,
const char *type, unsigned long flags,
void *data, size_t data_size)
@@ -531,7 +630,7 @@ static int apparmor_sb_mount(const char *dev_name, const struct path *path,
if ((flags & MS_MGC_MSK) == MS_MGC_VAL)
flags &= ~MS_MGC_MSK;

- flags &= ~AA_MS_IGNORE_MASK;
+ flags &= ~AA_SB_IGNORE_MASK;

label = __begin_current_label_crit_section();
if (!unconfined(label)) {
@@ -1134,6 +1233,12 @@ static struct security_hook_list apparmor_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(capget, apparmor_capget),
LSM_HOOK_INIT(capable, apparmor_capable),

+ LSM_HOOK_INIT(fs_context_alloc, apparmor_fs_context_alloc),
+ LSM_HOOK_INIT(fs_context_dup, apparmor_fs_context_dup),
+ LSM_HOOK_INIT(fs_context_free, apparmor_fs_context_free),
+ LSM_HOOK_INIT(fs_context_parse_param, apparmor_fs_context_parse_param),
+ LSM_HOOK_INIT(sb_mountpoint, apparmor_sb_mountpoint),
+
LSM_HOOK_INIT(sb_mount, apparmor_sb_mount),
LSM_HOOK_INIT(sb_umount, apparmor_sb_umount),
LSM_HOOK_INIT(sb_pivotroot, apparmor_sb_pivotroot),
diff --git a/security/apparmor/mount.c b/security/apparmor/mount.c
index 8c3787399356..3c95fffb76ac 100644
--- a/security/apparmor/mount.c
+++ b/security/apparmor/mount.c
@@ -554,6 +554,52 @@ int aa_new_mount(struct aa_label *label, const char *dev_name,
return error;
}

+int aa_new_mount_fc(struct aa_label *label, struct fs_context *fc,
+ const struct path *mountpoint)
+{
+ struct apparmor_fs_context *afc = fc->security;
+ struct aa_profile *profile;
+ char *buffer = NULL, *dev_buffer = NULL;
+ bool binary;
+ int error;
+ struct path tmp_path, *dev_path = NULL;
+
+ AA_BUG(!label);
+ AA_BUG(!mountpoint);
+
+ binary = fc->fs_type->fs_flags & FS_BINARY_MOUNTDATA;
+
+ if (fc->fs_type->fs_flags & FS_REQUIRES_DEV) {
+ if (!fc->source)
+ return -ENOENT;
+
+ error = kern_path(fc->source, LOOKUP_FOLLOW, &tmp_path);
+ if (error)
+ return error;
+ dev_path = &tmp_path;
+ }
+
+ get_buffers(buffer, dev_buffer);
+ if (dev_path) {
+ error = fn_for_each_confined(label, profile,
+ match_mnt(profile, mountpoint, buffer, dev_path, dev_buffer,
+ fc->fs_type->name,
+ fc->sb_flags & ~AA_SB_IGNORE_MASK,
+ afc->saved_options, binary));
+ } else {
+ error = fn_for_each_confined(label, profile,
+ match_mnt_path_str(profile, mountpoint, buffer,
+ fc->source, fc->fs_type->name,
+ fc->sb_flags & ~AA_SB_IGNORE_MASK,
+ afc->saved_options, binary, NULL));
+ }
+ put_buffers(buffer, dev_buffer);
+ if (dev_path)
+ path_put(dev_path);
+
+ return error;
+}
+
static int profile_umount(struct aa_profile *profile, struct path *path,
char *buffer)
{


2018-09-21 16:34:19

by David Howells

[permalink] [raw]
Subject: [PATCH 21/34] kernfs, sysfs, cgroup, intel_rdt: Support fs_context [ver #12]

Make kernfs support superblock creation/mount/remount with fs_context.

This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
be made to support fs_context also.

Notes:

(1) A kernfs_fs_context struct is created to wrap fs_context and the
kernfs mount parameters are moved in here (or are in fs_context).

(2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
namespace tag parameter is passed in the context if desired

(3) kernfs_free_fs_context() is provided as a destructor for the
kernfs_fs_context struct, but for the moment it does nothing except
get called in the right places.

(4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
pass, but possibly this should be done anyway in case someone wants to
add a parameter in future.

(5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
the cgroup v1 and v2 mount parameters are all moved there.

(6) cgroup1 parameter parsing error messages are now handled by invalf(),
which allows userspace to collect them directly.

(7) cgroup1 parameter cleanup is now done in the context destructor rather
than in the mount/get_tree and remount functions.

Weirdies:

(*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
but then uses the resulting pointer after dropping the locks. I'm
told this is okay and needs commenting.

(*) The cgroup refcount web. This really needs documenting.

(*) cgroup2 only has one root?

Add a suggestion from Thomas Gleixner in which the RDT enablement code is
placed into its own function.

Signed-off-by: David Howells <[email protected]>
cc: Greg Kroah-Hartman <[email protected]>
cc: Tejun Heo <[email protected]>
cc: Li Zefan <[email protected]>
cc: Johannes Weiner <[email protected]>
cc: [email protected]
cc: [email protected]
---

arch/x86/kernel/cpu/intel_rdt.h | 15 +
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 183 ++++++++++------
fs/kernfs/mount.c | 88 ++++----
fs/sysfs/mount.c | 67 ++++--
include/linux/cgroup.h | 3
include/linux/kernfs.h | 39 ++-
kernel/cgroup/cgroup-internal.h | 50 +++-
kernel/cgroup/cgroup-v1.c | 345 ++++++++++++++++--------------
kernel/cgroup/cgroup.c | 264 +++++++++++++++--------
kernel/cgroup/cpuset.c | 4
10 files changed, 640 insertions(+), 418 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index 4e588f36228f..1461adc2c5e8 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -33,6 +33,21 @@
#define RMID_VAL_ERROR BIT_ULL(63)
#define RMID_VAL_UNAVAIL BIT_ULL(62)

+
+struct rdt_fs_context {
+ struct kernfs_fs_context kfc;
+ bool enable_cdpl2;
+ bool enable_cdpl3;
+ bool enable_mba_mbps;
+};
+
+static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
+{
+ struct kernfs_fs_context *kfc = fc->fs_private;
+
+ return container_of(kfc, struct rdt_fs_context, kfc);
+}
+
DECLARE_STATIC_KEY_FALSE(rdt_enable_key);

/**
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index d6cb04c3a28b..34733a221669 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -24,6 +24,7 @@
#include <linux/cpu.h>
#include <linux/debugfs.h>
#include <linux/fs.h>
+#include <linux/fs_parser.h>
#include <linux/sysfs.h>
#include <linux/kernfs.h>
#include <linux/seq_buf.h>
@@ -1707,43 +1708,6 @@ static void cdp_disable_all(void)
cdpl2_disable();
}

-static int parse_rdtgroupfs_options(char *data)
-{
- char *token, *o = data;
- int ret = 0;
-
- while ((token = strsep(&o, ",")) != NULL) {
- if (!*token) {
- ret = -EINVAL;
- goto out;
- }
-
- if (!strcmp(token, "cdp")) {
- ret = cdpl3_enable();
- if (ret)
- goto out;
- } else if (!strcmp(token, "cdpl2")) {
- ret = cdpl2_enable();
- if (ret)
- goto out;
- } else if (!strcmp(token, "mba_MBps")) {
- ret = set_mba_sc(true);
- if (ret)
- goto out;
- } else {
- ret = -EINVAL;
- goto out;
- }
- }
-
- return 0;
-
-out:
- pr_err("Invalid mount option \"%s\"\n", token);
-
- return ret;
-}
-
/*
* We don't allow rdtgroup directories to be created anywhere
* except the root directory. Thus when looking for the rdtgroup
@@ -1815,13 +1779,27 @@ static int mkdir_mondata_all(struct kernfs_node *parent_kn,
struct rdtgroup *prgrp,
struct kernfs_node **mon_data_kn);

-static struct dentry *rdt_mount(struct file_system_type *fs_type,
- int flags, const char *unused_dev_name,
- void *data, size_t data_size)
+static int rdt_enable_ctx(struct rdt_fs_context *ctx)
+{
+ int ret = 0;
+
+ if (ctx->enable_cdpl2)
+ ret = cdpl2_enable();
+
+ if (!ret && ctx->enable_cdpl3)
+ ret = cdpl3_enable();
+
+ if (!ret && ctx->enable_mba_mbps)
+ ret = set_mba_sc(true);
+
+ return ret;
+}
+
+static int rdt_get_tree(struct fs_context *fc)
{
+ struct rdt_fs_context *ctx = rdt_fc2context(fc);
struct rdt_domain *dom;
struct rdt_resource *r;
- struct dentry *dentry;
int ret;

cpus_read_lock();
@@ -1830,53 +1808,42 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
* resctrl file system can only be mounted once.
*/
if (static_branch_unlikely(&rdt_enable_key)) {
- dentry = ERR_PTR(-EBUSY);
+ ret = -EBUSY;
goto out;
}

- ret = parse_rdtgroupfs_options(data);
- if (ret) {
- dentry = ERR_PTR(ret);
+ ret = rdt_enable_ctx(ctx);
+ if (ret < 0)
goto out_cdp;
- }

closid_init();

ret = rdtgroup_create_info_dir(rdtgroup_default.kn);
- if (ret) {
- dentry = ERR_PTR(ret);
- goto out_cdp;
- }
+ if (ret < 0)
+ goto out_mba;

if (rdt_mon_capable) {
ret = mongroup_create_dir(rdtgroup_default.kn,
NULL, "mon_groups",
&kn_mongrp);
- if (ret) {
- dentry = ERR_PTR(ret);
+ if (ret < 0)
goto out_info;
- }
kernfs_get(kn_mongrp);

ret = mkdir_mondata_all(rdtgroup_default.kn,
&rdtgroup_default, &kn_mondata);
- if (ret) {
- dentry = ERR_PTR(ret);
+ if (ret < 0)
goto out_mongrp;
- }
kernfs_get(kn_mondata);
rdtgroup_default.mon.mon_data_kn = kn_mondata;
}

ret = rdt_pseudo_lock_init();
- if (ret) {
- dentry = ERR_PTR(ret);
+ if (ret)
goto out_mondata;
- }

- dentry = kernfs_mount(fs_type, flags, rdt_root,
- RDTGROUP_SUPER_MAGIC, NULL);
- if (IS_ERR(dentry))
+ ret = kernfs_get_tree(fc);
+ if (ret < 0)
goto out_psl;

if (rdt_alloc_capable)
@@ -1905,14 +1872,97 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
kernfs_remove(kn_mongrp);
out_info:
kernfs_remove(kn_info);
+out_mba:
+ if (ctx->enable_mba_mbps)
+ set_mba_sc(false);
out_cdp:
cdp_disable_all();
out:
rdt_last_cmd_clear();
mutex_unlock(&rdtgroup_mutex);
cpus_read_unlock();
+ return ret;
+}
+
+enum rdt_param {
+ Opt_cdp,
+ Opt_cdpl2,
+ Opt_mba_mpbs,
+ nr__rdt_params
+};
+
+static const struct fs_parameter_spec rdt_param_specs[nr__rdt_params] = {
+ [Opt_cdp] = { fs_param_is_flag },
+ [Opt_cdpl2] = { fs_param_is_flag },
+ [Opt_mba_mpbs] = { fs_param_is_flag },
+};
+
+static const char *const rdt_param_keys[nr__rdt_params] = {
+ [Opt_cdp] = "cdp",
+ [Opt_cdpl2] = "cdpl2",
+ [Opt_mba_mpbs] = "mba_mbps",
+};
+
+static const struct fs_parameter_description rdt_fs_parameters = {
+ .name = "rdt",
+ .nr_params = nr__rdt_params,
+ .keys = rdt_param_keys,
+ .specs = rdt_param_specs,
+ .no_source = true,
+};
+
+static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
+{
+ struct rdt_fs_context *ctx = rdt_fc2context(fc);
+ struct fs_parse_result result;
+ int opt;

- return dentry;
+ opt = fs_parse(fc, &rdt_fs_parameters, param, &result);
+ if (opt < 0)
+ return opt;
+
+ switch (opt) {
+ case Opt_cdp:
+ ctx->enable_cdpl3 = true;
+ return 0;
+ case Opt_cdpl2:
+ ctx->enable_cdpl2 = true;
+ return 0;
+ case Opt_mba_mpbs:
+ ctx->enable_mba_mbps = true;
+ return 0;
+ }
+
+ return -EINVAL;
+}
+
+static void rdt_fs_context_free(struct fs_context *fc)
+{
+ struct rdt_fs_context *ctx = rdt_fc2context(fc);
+
+ kernfs_free_fs_context(fc);
+ kfree(ctx);
+}
+
+static const struct fs_context_operations rdt_fs_context_ops = {
+ .free = rdt_fs_context_free,
+ .parse_param = rdt_parse_param,
+ .get_tree = rdt_get_tree,
+};
+
+static int rdt_init_fs_context(struct fs_context *fc, struct dentry *reference)
+{
+ struct rdt_fs_context *ctx;
+
+ ctx = kzalloc(sizeof(struct rdt_fs_context), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ ctx->kfc.root = rdt_root;
+ ctx->kfc.magic = RDTGROUP_SUPER_MAGIC;
+ fc->fs_private = &ctx->kfc;
+ fc->ops = &rdt_fs_context_ops;
+ return 0;
}

static int reset_all_ctrls(struct rdt_resource *r)
@@ -2085,9 +2135,10 @@ static void rdt_kill_sb(struct super_block *sb)
}

static struct file_system_type rdt_fs_type = {
- .name = "resctrl",
- .mount = rdt_mount,
- .kill_sb = rdt_kill_sb,
+ .name = "resctrl",
+ .init_fs_context = rdt_init_fs_context,
+ .parameters = &rdt_fs_parameters,
+ .kill_sb = rdt_kill_sb,
};

static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f70e0b69e714..56742632956c 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -22,14 +22,13 @@

struct kmem_cache *kernfs_node_cache;

-static int kernfs_sop_remount_fs(struct super_block *sb, int *flags,
- char *data, size_t data_size)
+int kernfs_reconfigure(struct fs_context *fc)
{
- struct kernfs_root *root = kernfs_info(sb)->root;
+ struct kernfs_root *root = kernfs_info(fc->root->d_sb)->root;
struct kernfs_syscall_ops *scops = root->syscall_ops;

- if (scops && scops->remount_fs)
- return scops->remount_fs(root, flags, data);
+ if (scops && scops->reconfigure)
+ return scops->reconfigure(root, fc);
return 0;
}

@@ -61,7 +60,6 @@ const struct super_operations kernfs_sops = {
.drop_inode = generic_delete_inode,
.evict_inode = kernfs_evict_inode,

- .remount_fs = kernfs_sop_remount_fs,
.show_options = kernfs_sop_show_options,
.show_path = kernfs_sop_show_path,
};
@@ -219,7 +217,7 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
} while (true);
}

-static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
+static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *kfc)
{
struct kernfs_super_info *info = kernfs_info(sb);
struct inode *inode;
@@ -230,7 +228,7 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV;
sb->s_blocksize = PAGE_SIZE;
sb->s_blocksize_bits = PAGE_SHIFT;
- sb->s_magic = magic;
+ sb->s_magic = kfc->magic;
sb->s_op = &kernfs_sops;
sb->s_xattr = kernfs_xattr_handlers;
if (info->root->flags & KERNFS_ROOT_SUPPORT_EXPORTOP)
@@ -257,21 +255,20 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
return 0;
}

-static int kernfs_test_super(struct super_block *sb, void *data)
+static int kernfs_test_super(struct super_block *sb, struct fs_context *fc)
{
struct kernfs_super_info *sb_info = kernfs_info(sb);
- struct kernfs_super_info *info = data;
+ struct kernfs_super_info *info = fc->s_fs_info;

return sb_info->root == info->root && sb_info->ns == info->ns;
}

-static int kernfs_set_super(struct super_block *sb, void *data)
+static int kernfs_set_super(struct super_block *sb, struct fs_context *fc)
{
- int error;
- error = set_anon_super(sb, data);
- if (!error)
- sb->s_fs_info = data;
- return error;
+ struct kernfs_fs_context *kfc = fc->fs_private;
+
+ kfc->ns_tag = NULL;
+ return set_anon_super_fc(sb, fc);
}

/**
@@ -288,63 +285,60 @@ const void *kernfs_super_ns(struct super_block *sb)
}

/**
- * kernfs_mount_ns - kernfs mount helper
- * @fs_type: file_system_type of the fs being mounted
- * @flags: mount flags specified for the mount
- * @root: kernfs_root of the hierarchy being mounted
- * @magic: file system specific magic number
- * @new_sb_created: tell the caller if we allocated a new superblock
- * @ns: optional namespace tag of the mount
- *
- * This is to be called from each kernfs user's file_system_type->mount()
- * implementation, which should pass through the specified @fs_type and
- * @flags, and specify the hierarchy and namespace tag to mount via @root
- * and @ns, respectively.
+ * kernfs_get_tree - kernfs filesystem access/retrieval helper
+ * @fc: The filesystem context.
*
- * The return value can be passed to the vfs layer verbatim.
+ * This is to be called from each kernfs user's fs_context->ops->get_tree()
+ * implementation, which should set the specified ->@fs_type and ->@flags, and
+ * specify the hierarchy and namespace tag to mount via ->@root and ->@ns,
+ * respectively.
*/
-struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
- struct kernfs_root *root, unsigned long magic,
- bool *new_sb_created, const void *ns)
+int kernfs_get_tree(struct fs_context *fc)
{
+ struct kernfs_fs_context *kfc = fc->fs_private;
struct super_block *sb;
struct kernfs_super_info *info;
int error;

info = kzalloc(sizeof(*info), GFP_KERNEL);
if (!info)
- return ERR_PTR(-ENOMEM);
+ return -ENOMEM;

- info->root = root;
- info->ns = ns;
+ info->root = kfc->root;
+ info->ns = kfc->ns_tag;
INIT_LIST_HEAD(&info->node);

- sb = sget_userns(fs_type, kernfs_test_super, kernfs_set_super, flags,
- &init_user_ns, info);
- if (IS_ERR(sb) || sb->s_fs_info != info)
- kfree(info);
+ fc->s_fs_info = info;
+ sb = sget_fc(fc, kernfs_test_super, kernfs_set_super);
if (IS_ERR(sb))
- return ERR_CAST(sb);
-
- if (new_sb_created)
- *new_sb_created = !sb->s_root;
+ return PTR_ERR(sb);

if (!sb->s_root) {
struct kernfs_super_info *info = kernfs_info(sb);

- error = kernfs_fill_super(sb, magic);
+ kfc->new_sb_created = true;
+
+ error = kernfs_fill_super(sb, kfc);
if (error) {
deactivate_locked_super(sb);
- return ERR_PTR(error);
+ return error;
}
sb->s_flags |= SB_ACTIVE;

mutex_lock(&kernfs_mutex);
- list_add(&info->node, &root->supers);
+ list_add(&info->node, &info->root->supers);
mutex_unlock(&kernfs_mutex);
}

- return dget(sb->s_root);
+ fc->root = dget(sb->s_root);
+ return 0;
+}
+
+void kernfs_free_fs_context(struct fs_context *fc)
+{
+ /* Note that we don't deal with kfc->ns_tag here. */
+ kfree(fc->s_fs_info);
+ fc->s_fs_info = NULL;
}

/**
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 77302c35b0ff..1e1c0ccc6a36 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -13,6 +13,7 @@
#include <linux/magic.h>
#include <linux/mount.h>
#include <linux/init.h>
+#include <linux/slab.h>
#include <linux/user_namespace.h>

#include "sysfs.h"
@@ -20,27 +21,55 @@
static struct kernfs_root *sysfs_root;
struct kernfs_node *sysfs_root_kn;

-static struct dentry *sysfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data, size_t data_size)
+static int sysfs_get_tree(struct fs_context *fc)
{
- struct dentry *root;
- void *ns;
- bool new_sb = false;
+ struct kernfs_fs_context *kfc = fc->fs_private;
+ int ret;

- if (!(flags & SB_KERNMOUNT)) {
+ ret = kernfs_get_tree(fc);
+ if (ret)
+ return ret;
+
+ if (kfc->new_sb_created)
+ fc->root->d_sb->s_iflags |= SB_I_USERNS_VISIBLE;
+ return 0;
+}
+
+static void sysfs_fs_context_free(struct fs_context *fc)
+{
+ struct kernfs_fs_context *kfc = fc->fs_private;
+
+ if (kfc->ns_tag)
+ kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag);
+ kernfs_free_fs_context(fc);
+ kfree(kfc);
+}
+
+static const struct fs_context_operations sysfs_fs_context_ops = {
+ .free = sysfs_fs_context_free,
+ .get_tree = sysfs_get_tree,
+};
+
+static int sysfs_init_fs_context(struct fs_context *fc,
+ struct dentry *reference)
+{
+ struct kernfs_fs_context *kfc;
+
+ if (!(fc->sb_flags & SB_KERNMOUNT)) {
if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET))
- return ERR_PTR(-EPERM);
+ return -EPERM;
}

- ns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
- root = kernfs_mount_ns(fs_type, flags, sysfs_root,
- SYSFS_MAGIC, &new_sb, ns);
- if (!new_sb)
- kobj_ns_drop(KOBJ_NS_TYPE_NET, ns);
- else if (!IS_ERR(root))
- root->d_sb->s_iflags |= SB_I_USERNS_VISIBLE;
+ kfc = kzalloc(sizeof(struct kernfs_fs_context), GFP_KERNEL);
+ if (!kfc)
+ return -ENOMEM;

- return root;
+ kfc->ns_tag = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
+ kfc->root = sysfs_root;
+ kfc->magic = SYSFS_MAGIC;
+ fc->fs_private = kfc;
+ fc->ops = &sysfs_fs_context_ops;
+ return 0;
}

static void sysfs_kill_sb(struct super_block *sb)
@@ -52,10 +81,10 @@ static void sysfs_kill_sb(struct super_block *sb)
}

static struct file_system_type sysfs_fs_type = {
- .name = "sysfs",
- .mount = sysfs_mount,
- .kill_sb = sysfs_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
+ .name = "sysfs",
+ .init_fs_context = sysfs_init_fs_context,
+ .kill_sb = sysfs_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};

int __init sysfs_init(void)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 32c553556bbd..13b6379648ec 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -859,10 +859,11 @@ copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,

#endif /* !CONFIG_CGROUPS */

-static inline void get_cgroup_ns(struct cgroup_namespace *ns)
+static inline struct cgroup_namespace *get_cgroup_ns(struct cgroup_namespace *ns)
{
if (ns)
refcount_inc(&ns->count);
+ return ns;
}

static inline void put_cgroup_ns(struct cgroup_namespace *ns)
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 0f6bb8e1bc83..051709212f55 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -17,6 +17,7 @@
#include <linux/atomic.h>
#include <linux/uidgid.h>
#include <linux/wait.h>
+#include <linux/fs_context.h>

struct file;
struct dentry;
@@ -27,6 +28,7 @@ struct super_block;
struct file_system_type;
struct fs_context;

+struct kernfs_fs_context;
struct kernfs_open_node;
struct kernfs_iattrs;

@@ -168,7 +170,7 @@ struct kernfs_node {
* kernfs_node parameter.
*/
struct kernfs_syscall_ops {
- int (*remount_fs)(struct kernfs_root *root, int *flags, char *data);
+ int (*reconfigure)(struct kernfs_root *root, struct fs_context *fc);
int (*show_options)(struct seq_file *sf, struct kernfs_root *root);

int (*mkdir)(struct kernfs_node *parent, const char *name,
@@ -269,6 +271,18 @@ struct kernfs_ops {
#endif
};

+/*
+ * The kernfs superblock creation/mount parameter context.
+ */
+struct kernfs_fs_context {
+ struct kernfs_root *root; /* Root of the hierarchy being mounted */
+ void *ns_tag; /* Namespace tag of the mount (or NULL) */
+ unsigned long magic; /* File system specific magic number */
+
+ /* The following are set/used by kernfs_mount() */
+ bool new_sb_created; /* Set to T if we allocated a new sb */
+};
+
#ifdef CONFIG_KERNFS

static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
@@ -354,9 +368,8 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr);
void kernfs_notify(struct kernfs_node *kn);

const void *kernfs_super_ns(struct super_block *sb);
-struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
- struct kernfs_root *root, unsigned long magic,
- bool *new_sb_created, const void *ns);
+int kernfs_get_tree(struct fs_context *fc);
+void kernfs_free_fs_context(struct fs_context *fc);
void kernfs_kill_sb(struct super_block *sb);
struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns);
int kernfs_reconfigure(struct fs_context *fc);
@@ -461,11 +474,10 @@ static inline void kernfs_notify(struct kernfs_node *kn) { }
static inline const void *kernfs_super_ns(struct super_block *sb)
{ return NULL; }

-static inline struct dentry *
-kernfs_mount_ns(struct file_system_type *fs_type, int flags,
- struct kernfs_root *root, unsigned long magic,
- bool *new_sb_created, const void *ns)
-{ return ERR_PTR(-ENOSYS); }
+static inline int kernfs_get_tree(struct fs_context *fc)
+{ return -ENOSYS; }
+
+static inline void kernfs_free_fs_context(struct fs_context *fc) { }

static inline void kernfs_kill_sb(struct super_block *sb) { }

@@ -547,13 +559,4 @@ static inline int kernfs_rename(struct kernfs_node *kn,
return kernfs_rename_ns(kn, new_parent, new_name, NULL);
}

-static inline struct dentry *
-kernfs_mount(struct file_system_type *fs_type, int flags,
- struct kernfs_root *root, unsigned long magic,
- bool *new_sb_created)
-{
- return kernfs_mount_ns(fs_type, flags, root,
- magic, new_sb_created, NULL);
-}
-
#endif /* __LINUX_KERNFS_H */
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 75568fcf2180..35012d2aca97 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -34,6 +34,33 @@ extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN];
} \
} while (0)

+/*
+ * The cgroup filesystem superblock creation/mount context.
+ */
+struct cgroup_fs_context {
+ struct kernfs_fs_context kfc;
+ struct cgroup_root *root;
+ struct cgroup_namespace *ns;
+ u8 version; /* cgroups version */
+ unsigned int flags; /* CGRP_ROOT_* flags */
+
+ /* cgroup1 bits */
+ bool cpuset_clone_children;
+ bool none; /* User explicitly requested empty subsystem */
+ bool all_ss; /* Seen 'all' option */
+ bool one_ss; /* Seen 'none' option */
+ u16 subsys_mask; /* Selected subsystems */
+ char *name; /* Hierarchy name */
+ char *release_agent; /* Path for release notifications */
+};
+
+static inline struct cgroup_fs_context *cgroup_fc2context(struct fs_context *fc)
+{
+ struct kernfs_fs_context *kfc = fc->fs_private;
+
+ return container_of(kfc, struct cgroup_fs_context, kfc);
+}
+
/*
* A cgroup can be associated with multiple css_sets as different tasks may
* belong to different cgroups on different hierarchies. In the other
@@ -115,16 +142,6 @@ struct cgroup_mgctx {
#define DEFINE_CGROUP_MGCTX(name) \
struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)

-struct cgroup_sb_opts {
- u16 subsys_mask;
- unsigned int flags;
- char *release_agent;
- bool cpuset_clone_children;
- char *name;
- /* User explicitly requested empty subsystem */
- bool none;
-};
-
extern struct mutex cgroup_mutex;
extern spinlock_t css_set_lock;
extern struct cgroup_subsys *cgroup_subsys[];
@@ -195,12 +212,10 @@ int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
struct cgroup_namespace *ns);

void cgroup_free_root(struct cgroup_root *root);
-void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts);
+void init_cgroup_root(struct cgroup_fs_context *ctx);
int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags);
int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask);
-struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
- struct cgroup_root *root, unsigned long magic,
- struct cgroup_namespace *ns);
+int cgroup_do_get_tree(struct fs_context *fc);

int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp);
void cgroup_migrate_finish(struct cgroup_mgctx *mgctx);
@@ -244,14 +259,15 @@ extern const struct proc_ns_operations cgroupns_operations;
*/
extern struct cftype cgroup1_base_files[];
extern struct kernfs_syscall_ops cgroup1_kf_syscall_ops;
+extern const struct fs_parameter_description cgroup1_fs_parameters;

int proc_cgroupstats_show(struct seq_file *m, void *v);
bool cgroup1_ssid_disabled(int ssid);
void cgroup1_pidlist_destroy_all(struct cgroup *cgrp);
void cgroup1_release_agent(struct work_struct *work);
void cgroup1_check_for_release(struct cgroup *cgrp);
-struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
- void *data, unsigned long magic,
- struct cgroup_namespace *ns);
+int cgroup1_parse_param(struct fs_context *fc, struct fs_parameter *param);
+int cgroup1_validate(struct fs_context *fc);
+int cgroup1_get_tree(struct fs_context *fc);

#endif /* __CGROUP_INTERNAL_H */
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 51063e7a93c2..d8b325c3c2eb 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -13,9 +13,12 @@
#include <linux/delayacct.h>
#include <linux/pid_namespace.h>
#include <linux/cgroupstats.h>
+#include <linux/fs_parser.h>

#include <trace/events/cgroup.h>

+#define cg_invalf(fc, fmt, ...) ({ pr_err(fmt, ## __VA_ARGS__); -EINVAL; })
+
/*
* pidlists linger the following amount before being destroyed. The goal
* is avoiding frequent destruction in the middle of consecutive read calls
@@ -903,92 +906,61 @@ static int cgroup1_show_options(struct seq_file *seq, struct kernfs_root *kf_roo
return 0;
}

-static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
-{
- char *token, *o = data;
- bool all_ss = false, one_ss = false;
- u16 mask = U16_MAX;
- struct cgroup_subsys *ss;
- int nr_opts = 0;
- int i;
-
-#ifdef CONFIG_CPUSETS
- mask = ~((u16)1 << cpuset_cgrp_id);
-#endif
+enum cgroup1_param {
+ Opt_all,
+ Opt_clone_children,
+ Opt_cpuset_v2_mode,
+ Opt_name,
+ Opt_none,
+ Opt_noprefix,
+ Opt_release_agent,
+ Opt_xattr,
+ nr__cgroup1_params
+};

- memset(opts, 0, sizeof(*opts));
+static const struct fs_parameter_spec cgroup1_param_specs[nr__cgroup1_params] = {
+ [Opt_all] = { fs_param_is_flag },
+ [Opt_clone_children] = { fs_param_is_flag },
+ [Opt_cpuset_v2_mode] = { fs_param_is_flag },
+ [Opt_name] = { fs_param_is_string },
+ [Opt_none] = { fs_param_is_flag },
+ [Opt_noprefix] = { fs_param_is_flag },
+ [Opt_release_agent] = { fs_param_is_string },
+ [Opt_xattr] = { fs_param_is_flag },
+};

- while ((token = strsep(&o, ",")) != NULL) {
- nr_opts++;
+static const char *const cgroup1_param_keys[nr__cgroup1_params] = {
+ [Opt_all] = "all",
+ [Opt_clone_children] = "clone_children",
+ [Opt_cpuset_v2_mode] = "cpuset_v2_mode",
+ [Opt_name] = "name",
+ [Opt_none] = "none",
+ [Opt_noprefix] = "noprefix",
+ [Opt_release_agent] = "release_agent",
+ [Opt_xattr] = "xattr",
+};

- if (!*token)
- return -EINVAL;
- if (!strcmp(token, "none")) {
- /* Explicitly have no subsystems */
- opts->none = true;
- continue;
- }
- if (!strcmp(token, "all")) {
- /* Mutually exclusive option 'all' + subsystem name */
- if (one_ss)
- return -EINVAL;
- all_ss = true;
- continue;
- }
- if (!strcmp(token, "noprefix")) {
- opts->flags |= CGRP_ROOT_NOPREFIX;
- continue;
- }
- if (!strcmp(token, "clone_children")) {
- opts->cpuset_clone_children = true;
- continue;
- }
- if (!strcmp(token, "cpuset_v2_mode")) {
- opts->flags |= CGRP_ROOT_CPUSET_V2_MODE;
- continue;
- }
- if (!strcmp(token, "xattr")) {
- opts->flags |= CGRP_ROOT_XATTR;
- continue;
- }
- if (!strncmp(token, "release_agent=", 14)) {
- /* Specifying two release agents is forbidden */
- if (opts->release_agent)
- return -EINVAL;
- opts->release_agent =
- kstrndup(token + 14, PATH_MAX - 1, GFP_KERNEL);
- if (!opts->release_agent)
- return -ENOMEM;
- continue;
- }
- if (!strncmp(token, "name=", 5)) {
- const char *name = token + 5;
- /* Can't specify an empty name */
- if (!strlen(name))
- return -EINVAL;
- /* Must match [\w.-]+ */
- for (i = 0; i < strlen(name); i++) {
- char c = name[i];
- if (isalnum(c))
- continue;
- if ((c == '.') || (c == '-') || (c == '_'))
- continue;
- return -EINVAL;
- }
- /* Specifying two names is forbidden */
- if (opts->name)
- return -EINVAL;
- opts->name = kstrndup(name,
- MAX_CGROUP_ROOT_NAMELEN - 1,
- GFP_KERNEL);
- if (!opts->name)
- return -ENOMEM;
+const struct fs_parameter_description cgroup1_fs_parameters = {
+ .name = "cgroup1",
+ .nr_params = nr__cgroup1_params,
+ .keys = cgroup1_param_keys,
+ .specs = cgroup1_param_specs,
+ .no_source = true,
+};

- continue;
- }
+int cgroup1_parse_param(struct fs_context *fc, struct fs_parameter *param)
+{
+ struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
+ struct cgroup_subsys *ss;
+ struct fs_parse_result result;
+ int opt, i;

+ opt = fs_parse(fc, &cgroup1_fs_parameters, param, &result);
+ if (opt == -ENOPARAM) {
+ if (strcmp(param->key, "source") == 0)
+ return 0;
for_each_subsys(ss, i) {
- if (strcmp(token, ss->legacy_name))
+ if (strcmp(param->key, ss->legacy_name) != 0)
continue;
if (!cgroup_ssid_enabled(i))
continue;
@@ -996,75 +968,144 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
continue;

/* Mutually exclusive option 'all' + subsystem name */
- if (all_ss)
- return -EINVAL;
- opts->subsys_mask |= (1 << i);
- one_ss = true;
+ if (ctx->all_ss)
+ return cg_invalf(fc, "cgroup1: subsys name conflicts with all");
+ ctx->subsys_mask |= (1 << i);
+ ctx->one_ss = true;
+ return 0;
+ }

- break;
+ return cg_invalf(fc, "cgroup1: Unknown subsys name '%s'", param->key);
+ }
+ if (opt < 0)
+ return opt;
+
+ switch (opt) {
+ case Opt_none:
+ /* Explicitly have no subsystems */
+ ctx->none = true;
+ return 0;
+ case Opt_all:
+ /* Mutually exclusive option 'all' + subsystem name */
+ if (ctx->one_ss)
+ return cg_invalf(fc, "cgroup1: all conflicts with subsys name");
+ ctx->all_ss = true;
+ return 0;
+ case Opt_noprefix:
+ ctx->flags |= CGRP_ROOT_NOPREFIX;
+ return 0;
+ case Opt_clone_children:
+ ctx->cpuset_clone_children = true;
+ return 0;
+ case Opt_cpuset_v2_mode:
+ ctx->flags |= CGRP_ROOT_CPUSET_V2_MODE;
+ return 0;
+ case Opt_xattr:
+ ctx->flags |= CGRP_ROOT_XATTR;
+ return 0;
+ case Opt_release_agent:
+ /* Specifying two release agents is forbidden */
+ if (ctx->release_agent)
+ return cg_invalf(fc, "cgroup1: release_agent respecified");
+ ctx->release_agent = param->string;
+ param->string = NULL;
+ if (!ctx->release_agent)
+ return -ENOMEM;
+ return 0;
+
+ case Opt_name:
+ /* Can't specify an empty name */
+ if (!param->size)
+ return cg_invalf(fc, "cgroup1: Empty name");
+ if (param->size > MAX_CGROUP_ROOT_NAMELEN - 1)
+ return cg_invalf(fc, "cgroup1: Name too long");
+ /* Must match [\w.-]+ */
+ for (i = 0; i < param->size; i++) {
+ char c = param->string[i];
+ if (isalnum(c))
+ continue;
+ if ((c == '.') || (c == '-') || (c == '_'))
+ continue;
+ return cg_invalf(fc, "cgroup1: Invalid name");
}
- if (i == CGROUP_SUBSYS_COUNT)
- return -ENOENT;
+ /* Specifying two names is forbidden */
+ if (ctx->name)
+ return cg_invalf(fc, "cgroup1: name respecified");
+ ctx->name = param->string;
+ param->string = NULL;
+ return 0;
}

+ return 0;
+}
+
+/*
+ * Validate the options that have been parsed.
+ */
+int cgroup1_validate(struct fs_context *fc)
+{
+ struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
+ struct cgroup_subsys *ss;
+ u16 mask = U16_MAX;
+ int i;
+
+#ifdef CONFIG_CPUSETS
+ mask = ~((u16)1 << cpuset_cgrp_id);
+#endif
+
/*
* If the 'all' option was specified select all the subsystems,
* otherwise if 'none', 'name=' and a subsystem name options were
* not specified, let's default to 'all'
*/
- if (all_ss || (!one_ss && !opts->none && !opts->name))
+ if (ctx->all_ss || (!ctx->one_ss && !ctx->none && !ctx->name))
for_each_subsys(ss, i)
if (cgroup_ssid_enabled(i) && !cgroup1_ssid_disabled(i))
- opts->subsys_mask |= (1 << i);
+ ctx->subsys_mask |= (1 << i);

/*
* We either have to specify by name or by subsystems. (So all
* empty hierarchies must have a name).
*/
- if (!opts->subsys_mask && !opts->name)
- return -EINVAL;
+ if (!ctx->subsys_mask && !ctx->name)
+ return cg_invalf(fc, "cgroup1: Need name or subsystem set");

/*
* Option noprefix was introduced just for backward compatibility
* with the old cpuset, so we allow noprefix only if mounting just
* the cpuset subsystem.
*/
- if ((opts->flags & CGRP_ROOT_NOPREFIX) && (opts->subsys_mask & mask))
- return -EINVAL;
+ if ((ctx->flags & CGRP_ROOT_NOPREFIX) && (ctx->subsys_mask & mask))
+ return cg_invalf(fc, "cgroup1: noprefix used incorrectly");

/* Can't specify "none" and some subsystems */
- if (opts->subsys_mask && opts->none)
- return -EINVAL;
+ if (ctx->subsys_mask && ctx->none)
+ return cg_invalf(fc, "cgroup1: none used incorrectly");

return 0;
}

-static int cgroup1_remount(struct kernfs_root *kf_root, int *flags, char *data)
+static int cgroup1_reconfigure(struct kernfs_root *kf_root, struct fs_context *fc)
{
- int ret = 0;
+ struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
struct cgroup_root *root = cgroup_root_from_kf(kf_root);
- struct cgroup_sb_opts opts;
u16 added_mask, removed_mask;
+ int ret = 0;

cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);

- /* See what subsystems are wanted */
- ret = parse_cgroupfs_options(data, &opts);
- if (ret)
- goto out_unlock;
-
- if (opts.subsys_mask != root->subsys_mask || opts.release_agent)
+ if (ctx->subsys_mask != root->subsys_mask || ctx->release_agent)
pr_warn("option changes via remount are deprecated (pid=%d comm=%s)\n",
task_tgid_nr(current), current->comm);

- added_mask = opts.subsys_mask & ~root->subsys_mask;
- removed_mask = root->subsys_mask & ~opts.subsys_mask;
+ added_mask = ctx->subsys_mask & ~root->subsys_mask;
+ removed_mask = root->subsys_mask & ~ctx->subsys_mask;

/* Don't allow flags or name to change at remount */
- if ((opts.flags ^ root->flags) ||
- (opts.name && strcmp(opts.name, root->name))) {
- pr_err("option or name mismatch, new: 0x%x \"%s\", old: 0x%x \"%s\"\n",
- opts.flags, opts.name ?: "", root->flags, root->name);
+ if ((ctx->flags ^ root->flags) ||
+ (ctx->name && strcmp(ctx->name, root->name))) {
+ cg_invalf(fc, "option or name mismatch, new: 0x%x \"%s\", old: 0x%x \"%s\"",
+ ctx->flags, ctx->name ?: "", root->flags, root->name);
ret = -EINVAL;
goto out_unlock;
}
@@ -1081,17 +1122,15 @@ static int cgroup1_remount(struct kernfs_root *kf_root, int *flags, char *data)

WARN_ON(rebind_subsystems(&cgrp_dfl_root, removed_mask));

- if (opts.release_agent) {
+ if (ctx->release_agent) {
spin_lock(&release_agent_path_lock);
- strcpy(root->release_agent_path, opts.release_agent);
+ strcpy(root->release_agent_path, ctx->release_agent);
spin_unlock(&release_agent_path_lock);
}

trace_cgroup_remount(root);

out_unlock:
- kfree(opts.release_agent);
- kfree(opts.name);
mutex_unlock(&cgroup_mutex);
return ret;
}
@@ -1099,31 +1138,26 @@ static int cgroup1_remount(struct kernfs_root *kf_root, int *flags, char *data)
struct kernfs_syscall_ops cgroup1_kf_syscall_ops = {
.rename = cgroup1_rename,
.show_options = cgroup1_show_options,
- .remount_fs = cgroup1_remount,
+ .reconfigure = cgroup1_reconfigure,
.mkdir = cgroup_mkdir,
.rmdir = cgroup_rmdir,
.show_path = cgroup_show_path,
};

-struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
- void *data, unsigned long magic,
- struct cgroup_namespace *ns)
+/*
+ * Find or create a v1 cgroups superblock.
+ */
+int cgroup1_get_tree(struct fs_context *fc)
{
+ struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
struct super_block *pinned_sb = NULL;
- struct cgroup_sb_opts opts;
struct cgroup_root *root;
struct cgroup_subsys *ss;
- struct dentry *dentry;
int i, ret;
bool new_root = false;

cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);

- /* First find the desired set of subsystems */
- ret = parse_cgroupfs_options(data, &opts);
- if (ret)
- goto out_unlock;
-
/*
* Destruction of cgroup root is asynchronous, so subsystems may
* still be dying after the previous unmount. Let's drain the
@@ -1132,15 +1166,13 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
* starting. Testing ref liveliness is good enough.
*/
for_each_subsys(ss, i) {
- if (!(opts.subsys_mask & (1 << i)) ||
+ if (!(ctx->subsys_mask & (1 << i)) ||
ss->root == &cgrp_dfl_root)
continue;

if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
mutex_unlock(&cgroup_mutex);
- msleep(10);
- ret = restart_syscall();
- goto out_free;
+ goto err_restart;
}
cgroup_put(&ss->root->cgrp);
}
@@ -1156,8 +1188,8 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
* name matches but sybsys_mask doesn't, we should fail.
* Remember whether name matched.
*/
- if (opts.name) {
- if (strcmp(opts.name, root->name))
+ if (ctx->name) {
+ if (strcmp(ctx->name, root->name))
continue;
name_match = true;
}
@@ -1166,15 +1198,15 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
* If we asked for subsystems (or explicitly for no
* subsystems) then they must match.
*/
- if ((opts.subsys_mask || opts.none) &&
- (opts.subsys_mask != root->subsys_mask)) {
+ if ((ctx->subsys_mask || ctx->none) &&
+ (ctx->subsys_mask != root->subsys_mask)) {
if (!name_match)
continue;
ret = -EBUSY;
- goto out_unlock;
+ goto err_unlock;
}

- if (root->flags ^ opts.flags)
+ if (root->flags ^ ctx->flags)
pr_warn("new mount options do not match the existing superblock, will be ignored\n");

/*
@@ -1195,11 +1227,10 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
mutex_unlock(&cgroup_mutex);
if (!IS_ERR_OR_NULL(pinned_sb))
deactivate_super(pinned_sb);
- msleep(10);
- ret = restart_syscall();
- goto out_free;
+ goto err_restart;
}

+ ctx->root = root;
ret = 0;
goto out_unlock;
}
@@ -1209,41 +1240,35 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
* specification is allowed for already existing hierarchies but we
* can't create new one without subsys specification.
*/
- if (!opts.subsys_mask && !opts.none) {
- ret = -EINVAL;
- goto out_unlock;
+ if (!ctx->subsys_mask && !ctx->none) {
+ ret = cg_invalf(fc, "cgroup1: No subsys list or none specified");
+ goto err_unlock;
}

/* Hierarchies may only be created in the initial cgroup namespace. */
- if (ns != &init_cgroup_ns) {
+ if (ctx->ns != &init_cgroup_ns) {
ret = -EPERM;
- goto out_unlock;
+ goto err_unlock;
}

root = kzalloc(sizeof(*root), GFP_KERNEL);
if (!root) {
ret = -ENOMEM;
- goto out_unlock;
+ goto err_unlock;
}
new_root = true;
+ ctx->root = root;

- init_cgroup_root(root, &opts);
+ init_cgroup_root(ctx);

- ret = cgroup_setup_root(root, opts.subsys_mask, PERCPU_REF_INIT_DEAD);
+ ret = cgroup_setup_root(root, ctx->subsys_mask, PERCPU_REF_INIT_DEAD);
if (ret)
- cgroup_free_root(root);
+ goto err_unlock;

out_unlock:
mutex_unlock(&cgroup_mutex);
-out_free:
- kfree(opts.release_agent);
- kfree(opts.name);
-
- if (ret)
- return ERR_PTR(ret);

- dentry = cgroup_do_mount(&cgroup_fs_type, flags, root,
- CGROUP_SUPER_MAGIC, ns);
+ ret = cgroup_do_get_tree(fc);

/*
* There's a race window after we release cgroup_mutex and before
@@ -1256,6 +1281,7 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
percpu_ref_reinit(&root->cgrp.self.refcnt);
mutex_unlock(&cgroup_mutex);
}
+ cgroup_get(&root->cgrp);

/*
* If @pinned_sb, we're reusing an existing root and holding an
@@ -1264,7 +1290,14 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
if (pinned_sb)
deactivate_super(pinned_sb);

- return dentry;
+ return ret;
+
+err_restart:
+ msleep(10);
+ return restart_syscall();
+err_unlock:
+ mutex_unlock(&cgroup_mutex);
+ return ret;
}

static int __init cgroup1_wq_init(void)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 48dbf249bec5..3c3c40cad257 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -54,6 +54,7 @@
#include <linux/proc_ns.h>
#include <linux/nsproxy.h>
#include <linux/file.h>
+#include <linux/fs_parser.h>
#include <linux/sched/cputime.h>
#include <net/sock.h>

@@ -1737,25 +1738,51 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
return len;
}

-static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
-{
- char *token;
+enum cgroup2_param {
+ Opt_nsdelegate,
+ nr__cgroup2_params
+};

- *root_flags = 0;
+static const struct fs_parameter_spec cgroup2_param_specs[nr__cgroup2_params] = {
+ [Opt_nsdelegate] = { fs_param_is_flag },
+};

- if (!data)
- return 0;
+static const char *const cgroup2_param_keys[nr__cgroup2_params] = {
+ [Opt_nsdelegate] = "nsdelegate",
+};

- while ((token = strsep(&data, ",")) != NULL) {
- if (!strcmp(token, "nsdelegate")) {
- *root_flags |= CGRP_ROOT_NS_DELEGATE;
- continue;
- }
+static const struct fs_parameter_description cgroup2_fs_parameters = {
+ .name = "cgroup2",
+ .nr_params = nr__cgroup2_params,
+ .keys = cgroup2_param_keys,
+ .specs = cgroup2_param_specs,
+ .no_source = true,
+};

- pr_err("cgroup2: unknown option \"%s\"\n", token);
- return -EINVAL;
+static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param)
+{
+ struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
+ struct fs_parse_result result;
+ int opt;
+
+ opt = fs_parse(fc, &cgroup2_fs_parameters, param, &result);
+ if (opt < 0)
+ return opt;
+
+ switch (opt) {
+ case Opt_nsdelegate:
+ ctx->flags |= CGRP_ROOT_NS_DELEGATE;
+ return 0;
}

+ return -EINVAL;
+}
+
+static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root)
+{
+ if (current->nsproxy->cgroup_ns == &init_cgroup_ns &&
+ cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
+ seq_puts(seq, ",nsdelegate");
return 0;
}

@@ -1769,23 +1796,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
}
}

-static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root)
-{
- if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
- seq_puts(seq, ",nsdelegate");
- return 0;
-}
-
-static int cgroup_remount(struct kernfs_root *kf_root, int *flags, char *data)
+static int cgroup_reconfigure(struct kernfs_root *kf_root, struct fs_context *fc)
{
- unsigned int root_flags;
- int ret;
+ struct cgroup_fs_context *ctx = cgroup_fc2context(fc);

- ret = parse_cgroup_root_flags(data, &root_flags);
- if (ret)
- return ret;
-
- apply_cgroup_root_flags(root_flags);
+ apply_cgroup_root_flags(ctx->flags);
return 0;
}

@@ -1873,8 +1888,9 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
INIT_WORK(&cgrp->release_agent_work, cgroup1_release_agent);
}

-void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts)
+void init_cgroup_root(struct cgroup_fs_context *ctx)
{
+ struct cgroup_root *root = ctx->root;
struct cgroup *cgrp = &root->cgrp;

INIT_LIST_HEAD(&root->root_list);
@@ -1883,12 +1899,12 @@ void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts)
init_cgroup_housekeeping(cgrp);
idr_init(&root->cgroup_idr);

- root->flags = opts->flags;
- if (opts->release_agent)
- strscpy(root->release_agent_path, opts->release_agent, PATH_MAX);
- if (opts->name)
- strscpy(root->name, opts->name, MAX_CGROUP_ROOT_NAMELEN);
- if (opts->cpuset_clone_children)
+ root->flags = ctx->flags;
+ if (ctx->release_agent)
+ strscpy(root->release_agent_path, ctx->release_agent, PATH_MAX);
+ if (ctx->name)
+ strscpy(root->name, ctx->name, MAX_CGROUP_ROOT_NAMELEN);
+ if (ctx->cpuset_clone_children)
set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
}

@@ -1993,57 +2009,53 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
return ret;
}

-struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
- struct cgroup_root *root, unsigned long magic,
- struct cgroup_namespace *ns)
+int cgroup_do_get_tree(struct fs_context *fc)
{
- struct dentry *dentry;
- bool new_sb;
+ struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
+ int ret;

- dentry = kernfs_mount(fs_type, flags, root->kf_root, magic, &new_sb);
+ ctx->kfc.root = ctx->root->kf_root;
+
+ ret = kernfs_get_tree(fc);
+ if (ret < 0)
+ goto out_cgrp;

/*
* In non-init cgroup namespace, instead of root cgroup's dentry,
* we return the dentry corresponding to the cgroupns->root_cgrp.
*/
- if (!IS_ERR(dentry) && ns != &init_cgroup_ns) {
+ if (ctx->ns != &init_cgroup_ns) {
struct dentry *nsdentry;
struct cgroup *cgrp;

mutex_lock(&cgroup_mutex);
spin_lock_irq(&css_set_lock);

- cgrp = cset_cgroup_from_root(ns->root_cset, root);
+ cgrp = cset_cgroup_from_root(ctx->ns->root_cset, ctx->root);

spin_unlock_irq(&css_set_lock);
mutex_unlock(&cgroup_mutex);

- nsdentry = kernfs_node_dentry(cgrp->kn, dentry->d_sb);
- dput(dentry);
- dentry = nsdentry;
+ nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb);
+ if (IS_ERR(nsdentry))
+ return PTR_ERR(nsdentry);
+ dput(fc->root);
+ fc->root = nsdentry;
}

- if (IS_ERR(dentry) || !new_sb)
- cgroup_put(&root->cgrp);
+ ret = 0;
+ if (ctx->kfc.new_sb_created)
+ goto out_cgrp;
+ apply_cgroup_root_flags(ctx->flags);
+ return 0;

- return dentry;
+out_cgrp:
+ return ret;
}

-static struct dentry *cgroup_mount(struct file_system_type *fs_type,
- int flags, const char *unused_dev_name,
- void *data, size_t data_size)
+static int cgroup_get_tree(struct fs_context *fc)
{
- struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
- struct dentry *dentry;
- int ret;
-
- get_cgroup_ns(ns);
-
- /* Check if the caller has permission to mount. */
- if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
- put_cgroup_ns(ns);
- return ERR_PTR(-EPERM);
- }
+ struct cgroup_fs_context *ctx = cgroup_fc2context(fc);

/*
* The first time anyone tries to mount a cgroup, enable the list
@@ -2052,29 +2064,96 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
if (!use_task_css_set_links)
cgroup_enable_task_cg_lists();

- if (fs_type == &cgroup2_fs_type) {
- unsigned int root_flags;
-
- ret = parse_cgroup_root_flags(data, &root_flags);
- if (ret) {
- put_cgroup_ns(ns);
- return ERR_PTR(ret);
- }
+ switch (ctx->version) {
+ case 1:
+ return cgroup1_get_tree(fc);

+ case 2:
cgrp_dfl_visible = true;
cgroup_get_live(&cgrp_dfl_root.cgrp);

- dentry = cgroup_do_mount(&cgroup2_fs_type, flags, &cgrp_dfl_root,
- CGROUP2_SUPER_MAGIC, ns);
- if (!IS_ERR(dentry))
- apply_cgroup_root_flags(root_flags);
- } else {
- dentry = cgroup1_mount(&cgroup_fs_type, flags, data,
- CGROUP_SUPER_MAGIC, ns);
+ ctx->root = &cgrp_dfl_root;
+ return cgroup_do_get_tree(fc);
+
+ default:
+ BUG();
+ }
+}
+
+static int cgroup_parse_param(struct fs_context *fc, struct fs_parameter *param)
+{
+ struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
+
+ if (ctx->version == 1)
+ return cgroup1_parse_param(fc, param);
+
+ return cgroup2_parse_param(fc, param);
+}
+
+static int cgroup_validate(struct fs_context *fc)
+{
+ struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
+
+ if (ctx->version == 1)
+ return cgroup1_validate(fc);
+ return 0;
+}
+
+/*
+ * Destroy a cgroup filesystem context.
+ */
+static void cgroup_fs_context_free(struct fs_context *fc)
+{
+ struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
+
+ kfree(ctx->name);
+ kfree(ctx->release_agent);
+ if (ctx->root)
+ cgroup_put(&ctx->root->cgrp);
+ put_cgroup_ns(ctx->ns);
+ kernfs_free_fs_context(fc);
+ kfree(ctx);
+}
+
+static const struct fs_context_operations cgroup_fs_context_ops = {
+ .free = cgroup_fs_context_free,
+ .parse_param = cgroup_parse_param,
+ .validate = cgroup_validate,
+ .get_tree = cgroup_get_tree,
+ .reconfigure = kernfs_reconfigure,
+};
+
+/*
+ * Initialise the cgroup filesystem creation/reconfiguration context. Notably,
+ * we select the namespace we're going to use.
+ */
+static int cgroup_init_fs_context(struct fs_context *fc, struct dentry *reference)
+{
+ struct cgroup_fs_context *ctx;
+ struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
+
+ switch (fc->purpose) {
+ case FS_CONTEXT_FOR_UMOUNT:
+ case FS_CONTEXT_FOR_EMERGENCY_RO:
+ return -EOPNOTSUPP;
+ default:
+ break;
}

- put_cgroup_ns(ns);
- return dentry;
+ /* Check if the caller has permission to mount. */
+ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ ctx = kzalloc(sizeof(struct cgroup_fs_context), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ ctx->ns = get_cgroup_ns(ns);
+ ctx->version = (fc->fs_type == &cgroup2_fs_type) ? 2 : 1;
+ ctx->kfc.magic = (ctx->version == 2) ? CGROUP2_SUPER_MAGIC : CGROUP_SUPER_MAGIC;
+ fc->fs_private = &ctx->kfc;
+ fc->ops = &cgroup_fs_context_ops;
+ return 0;
}

static void cgroup_kill_sb(struct super_block *sb)
@@ -2099,17 +2178,19 @@ static void cgroup_kill_sb(struct super_block *sb)
}

struct file_system_type cgroup_fs_type = {
- .name = "cgroup",
- .mount = cgroup_mount,
- .kill_sb = cgroup_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
+ .name = "cgroup",
+ .init_fs_context = cgroup_init_fs_context,
+ .parameters = &cgroup1_fs_parameters,
+ .kill_sb = cgroup_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};

static struct file_system_type cgroup2_fs_type = {
- .name = "cgroup2",
- .mount = cgroup_mount,
- .kill_sb = cgroup_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
+ .name = "cgroup2",
+ .init_fs_context = cgroup_init_fs_context,
+ .parameters = &cgroup2_fs_parameters,
+ .kill_sb = cgroup_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};

int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
@@ -5179,7 +5260,7 @@ int cgroup_rmdir(struct kernfs_node *kn)

static struct kernfs_syscall_ops cgroup_kf_syscall_ops = {
.show_options = cgroup_show_options,
- .remount_fs = cgroup_remount,
+ .reconfigure = cgroup_reconfigure,
.mkdir = cgroup_mkdir,
.rmdir = cgroup_rmdir,
.show_path = cgroup_show_path,
@@ -5246,11 +5327,12 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
*/
int __init cgroup_init_early(void)
{
- static struct cgroup_sb_opts __initdata opts;
+ static struct cgroup_fs_context __initdata ctx;
struct cgroup_subsys *ss;
int i;

- init_cgroup_root(&cgrp_dfl_root, &opts);
+ ctx.root = &cgrp_dfl_root;
+ init_cgroup_root(&ctx);
cgrp_dfl_root.cgrp.self.flags |= CSS_NO_REF;

RCU_INIT_POINTER(init_task.cgroups, &init_css_set);
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index df78e166028c..b4ad1a52f006 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -324,10 +324,8 @@ static int cpuset_get_tree(struct fs_context *fc)
int ret = -ENODEV;

cgroup_fs = get_fs_type("cgroup");
- if (cgroup_fs) {
- ret = PTR_ERR(cgroup_fs);
+ if (!cgroup_fs)
goto out;
- }

cg_fc = vfs_new_fs_context(cgroup_fs, NULL, fc->sb_flags, fc->sb_flags,
fc->purpose);


2018-09-21 16:34:26

by David Howells

[permalink] [raw]
Subject: [PATCH 20/34] cpuset: Use fs_context [ver #12]

Make the cpuset filesystem use the filesystem context. This is potentially
tricky as the cpuset fs is almost an alias for the cgroup filesystem, but
with some special parameters.

This can, however, be handled by setting up an appropriate cgroup
filesystem and returning the root directory of that as the root dir of this
one.

Signed-off-by: David Howells <[email protected]>
cc: Tejun Heo <[email protected]>
---

kernel/cgroup/cpuset.c | 67 ++++++++++++++++++++++++++++++++++++++----------
1 file changed, 53 insertions(+), 14 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 6d9f1a709af9..df78e166028c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -38,7 +38,7 @@
#include <linux/mm.h>
#include <linux/memory.h>
#include <linux/export.h>
-#include <linux/mount.h>
+#include <linux/fs_context.h>
#include <linux/namei.h>
#include <linux/pagemap.h>
#include <linux/proc_fs.h>
@@ -315,26 +315,65 @@ static inline bool is_in_v2_mode(void)
* users. If someone tries to mount the "cpuset" filesystem, we
* silently switch it to mount "cgroup" instead
*/
-static struct dentry *cpuset_mount(struct file_system_type *fs_type,
- int flags, const char *unused_dev_name,
- void *data, size_t data_size)
+static int cpuset_get_tree(struct fs_context *fc)
{
- struct file_system_type *cgroup_fs = get_fs_type("cgroup");
- struct dentry *ret = ERR_PTR(-ENODEV);
+ static const char opts[] = "cpuset,noprefix,release_agent=/sbin/cpuset_release_agent";
+ struct file_system_type *cgroup_fs;
+ struct fs_context *cg_fc;
+ char *p;
+ int ret = -ENODEV;
+
+ cgroup_fs = get_fs_type("cgroup");
if (cgroup_fs) {
- char mountopts[] =
- "cpuset,noprefix,"
- "release_agent=/sbin/cpuset_release_agent";
- ret = cgroup_fs->mount(cgroup_fs, flags, unused_dev_name,
- mountopts, data_size);
- put_filesystem(cgroup_fs);
+ ret = PTR_ERR(cgroup_fs);
+ goto out;
+ }
+
+ cg_fc = vfs_new_fs_context(cgroup_fs, NULL, fc->sb_flags, fc->sb_flags,
+ fc->purpose);
+ put_filesystem(cgroup_fs);
+ if (IS_ERR(cg_fc)) {
+ ret = PTR_ERR(cg_fc);
+ goto out;
}
+
+ ret = -ENOMEM;
+ p = kstrdup(opts, GFP_KERNEL);
+ if (!p)
+ goto out_fc;
+
+ ret = generic_parse_monolithic(fc, p, sizeof(opts) - 1);
+ kfree(p);
+ if (ret < 0)
+ goto out_fc;
+
+ ret = vfs_get_tree(cg_fc);
+ if (ret < 0)
+ goto out_fc;
+
+ fc->root = dget(cg_fc->root);
+ ret = 0;
+
+out_fc:
+ put_fs_context(cg_fc);
+out:
return ret;
}

+static const struct fs_context_operations cpuset_fs_context_ops = {
+ .get_tree = cpuset_get_tree,
+};
+
+static int cpuset_init_fs_context(struct fs_context *fc,
+ struct dentry *reference)
+{
+ fc->ops = &cpuset_fs_context_ops;
+ return 0;
+}
+
static struct file_system_type cpuset_fs_type = {
- .name = "cpuset",
- .mount = cpuset_mount,
+ .name = "cpuset",
+ .init_fs_context = cpuset_init_fs_context,
};

/*


2018-09-21 16:34:27

by David Howells

[permalink] [raw]
Subject: [PATCH 22/34] hugetlbfs: Convert to fs_context [ver #12]

Convert the hugetlbfs to use the fs_context during mount.

Signed-off-by: David Howells <[email protected]>
---

fs/hugetlbfs/inode.c | 391 +++++++++++++++++++++++++++++---------------------
1 file changed, 230 insertions(+), 161 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 4fa2e644fa11..700b009af8e4 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -27,7 +27,7 @@
#include <linux/backing-dev.h>
#include <linux/hugetlb.h>
#include <linux/pagevec.h>
-#include <linux/parser.h>
+#include <linux/fs_parser.h>
#include <linux/mman.h>
#include <linux/slab.h>
#include <linux/dnotify.h>
@@ -45,11 +45,17 @@ const struct file_operations hugetlbfs_file_operations;
static const struct inode_operations hugetlbfs_dir_inode_operations;
static const struct inode_operations hugetlbfs_inode_operations;

-struct hugetlbfs_config {
+enum hugetlbfs_size_type { NO_SIZE, SIZE_STD, SIZE_PERCENT };
+
+struct hugetlbfs_fs_context {
struct hstate *hstate;
+ unsigned long long max_size_opt;
+ unsigned long long min_size_opt;
long max_hpages;
long nr_inodes;
long min_hpages;
+ enum hugetlbfs_size_type max_val_type;
+ enum hugetlbfs_size_type min_val_type;
kuid_t uid;
kgid_t gid;
umode_t mode;
@@ -57,22 +63,43 @@ struct hugetlbfs_config {

int sysctl_hugetlb_shm_group;

-enum {
- Opt_size, Opt_nr_inodes,
- Opt_mode, Opt_uid, Opt_gid,
- Opt_pagesize, Opt_min_size,
- Opt_err,
+enum hugetlb_param {
+ Opt_gid,
+ Opt_min_size,
+ Opt_mode,
+ Opt_nr_inodes,
+ Opt_pagesize,
+ Opt_size,
+ Opt_uid,
+ nr__hugetlb_params
+};
+
+static const struct fs_parameter_spec hugetlb_param_specs[nr__hugetlb_params] = {
+ [Opt_gid] = { fs_param_is_u32 },
+ [Opt_min_size] = { fs_param_is_string },
+ [Opt_mode] = { fs_param_is_u32 },
+ [Opt_nr_inodes] = { fs_param_is_string },
+ [Opt_pagesize] = { fs_param_is_string },
+ [Opt_size] = { fs_param_is_string },
+ [Opt_uid] = { fs_param_is_u32 },
+};
+
+static const char *const hugetlb_param_keys[nr__hugetlb_params] = {
+ [Opt_gid] = "gid",
+ [Opt_min_size] = "min_size",
+ [Opt_mode] = "mode",
+ [Opt_nr_inodes] = "nr_inodes",
+ [Opt_pagesize] = "pagesize",
+ [Opt_size] = "size",
+ [Opt_uid] = "uid",
};

-static const match_table_t tokens = {
- {Opt_size, "size=%s"},
- {Opt_nr_inodes, "nr_inodes=%s"},
- {Opt_mode, "mode=%o"},
- {Opt_uid, "uid=%u"},
- {Opt_gid, "gid=%u"},
- {Opt_pagesize, "pagesize=%s"},
- {Opt_min_size, "min_size=%s"},
- {Opt_err, NULL},
+static const struct fs_parameter_description hugetlb_fs_parameters = {
+ .name = "hugetlbfs",
+ .nr_params = nr__hugetlb_params,
+ .keys = hugetlb_param_keys,
+ .specs = hugetlb_param_specs,
+ .no_source = true,
};

#ifdef CONFIG_NUMA
@@ -708,16 +735,16 @@ static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
}

static struct inode *hugetlbfs_get_root(struct super_block *sb,
- struct hugetlbfs_config *config)
+ struct hugetlbfs_fs_context *ctx)
{
struct inode *inode;

inode = new_inode(sb);
if (inode) {
inode->i_ino = get_next_ino();
- inode->i_mode = S_IFDIR | config->mode;
- inode->i_uid = config->uid;
- inode->i_gid = config->gid;
+ inode->i_mode = S_IFDIR | ctx->mode;
+ inode->i_uid = ctx->uid;
+ inode->i_gid = ctx->gid;
inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
inode->i_op = &hugetlbfs_dir_inode_operations;
inode->i_fop = &simple_dir_operations;
@@ -1081,8 +1108,6 @@ static const struct super_operations hugetlbfs_ops = {
.show_options = hugetlbfs_show_options,
};

-enum hugetlbfs_size_type { NO_SIZE, SIZE_STD, SIZE_PERCENT };
-
/*
* Convert size option passed from command line to number of huge pages
* in the pool specified by hstate. Size option could be in bytes
@@ -1105,171 +1130,151 @@ hugetlbfs_size_to_hpages(struct hstate *h, unsigned long long size_opt,
return size_opt;
}

-static int
-hugetlbfs_parse_options(char *options, struct hugetlbfs_config *pconfig)
+/*
+ * Parse one mount parameter.
+ */
+static int hugetlbfs_parse_param(struct fs_context *fc, struct fs_parameter *param)
{
- char *p, *rest;
- substring_t args[MAX_OPT_ARGS];
- int option;
- unsigned long long max_size_opt = 0, min_size_opt = 0;
- enum hugetlbfs_size_type max_val_type = NO_SIZE, min_val_type = NO_SIZE;
-
- if (!options)
+ struct hugetlbfs_fs_context *ctx = fc->fs_private;
+ struct fs_parse_result result;
+ char *rest;
+ unsigned long ps;
+ int opt;
+
+ opt = fs_parse(fc, &hugetlb_fs_parameters, param, &result);
+ if (opt < 0)
+ return opt;
+
+ switch (opt) {
+ case Opt_uid:
+ ctx->uid = make_kuid(current_user_ns(), result.uint_32);
+ if (!uid_valid(ctx->uid))
+ goto bad_val;
return 0;

- while ((p = strsep(&options, ",")) != NULL) {
- int token;
- if (!*p)
- continue;
+ case Opt_gid:
+ ctx->gid = make_kgid(current_user_ns(), result.uint_32);
+ if (!gid_valid(ctx->gid))
+ goto bad_val;
+ return 0;

- token = match_token(p, tokens, args);
- switch (token) {
- case Opt_uid:
- if (match_int(&args[0], &option))
- goto bad_val;
- pconfig->uid = make_kuid(current_user_ns(), option);
- if (!uid_valid(pconfig->uid))
- goto bad_val;
- break;
+ case Opt_mode:
+ ctx->mode = result.uint_32 & 01777U;
+ return 0;

- case Opt_gid:
- if (match_int(&args[0], &option))
- goto bad_val;
- pconfig->gid = make_kgid(current_user_ns(), option);
- if (!gid_valid(pconfig->gid))
- goto bad_val;
- break;
+ case Opt_size:
+ /* memparse() will accept a K/M/G without a digit */
+ if (!isdigit(param->string[0]))
+ goto bad_val;
+ ctx->max_size_opt = memparse(param->string, &rest);
+ ctx->max_val_type = SIZE_STD;
+ if (*rest == '%')
+ ctx->max_val_type = SIZE_PERCENT;
+ return 0;

- case Opt_mode:
- if (match_octal(&args[0], &option))
- goto bad_val;
- pconfig->mode = option & 01777U;
- break;
+ case Opt_nr_inodes:
+ /* memparse() will accept a K/M/G without a digit */
+ if (!isdigit(param->string[0]))
+ goto bad_val;
+ ctx->nr_inodes = memparse(param->string, &rest);
+ return 0;

- case Opt_size: {
- /* memparse() will accept a K/M/G without a digit */
- if (!isdigit(*args[0].from))
- goto bad_val;
- max_size_opt = memparse(args[0].from, &rest);
- max_val_type = SIZE_STD;
- if (*rest == '%')
- max_val_type = SIZE_PERCENT;
- break;
+ case Opt_pagesize:
+ ps = memparse(param->string, &rest);
+ ctx->hstate = size_to_hstate(ps);
+ if (!ctx->hstate) {
+ pr_err("Unsupported page size %lu MB\n", ps >> 20);
+ return -EINVAL;
}
+ return 0;

- case Opt_nr_inodes:
- /* memparse() will accept a K/M/G without a digit */
- if (!isdigit(*args[0].from))
- goto bad_val;
- pconfig->nr_inodes = memparse(args[0].from, &rest);
- break;
+ case Opt_min_size:
+ /* memparse() will accept a K/M/G without a digit */
+ if (!isdigit(param->string[0]))
+ goto bad_val;
+ ctx->min_size_opt = memparse(param->string, &rest);
+ ctx->min_val_type = SIZE_STD;
+ if (*rest == '%')
+ ctx->min_val_type = SIZE_PERCENT;
+ return 0;

- case Opt_pagesize: {
- unsigned long ps;
- ps = memparse(args[0].from, &rest);
- pconfig->hstate = size_to_hstate(ps);
- if (!pconfig->hstate) {
- pr_err("Unsupported page size %lu MB\n",
- ps >> 20);
- return -EINVAL;
- }
- break;
- }
+ default:
+ return -EINVAL;
+ }

- case Opt_min_size: {
- /* memparse() will accept a K/M/G without a digit */
- if (!isdigit(*args[0].from))
- goto bad_val;
- min_size_opt = memparse(args[0].from, &rest);
- min_val_type = SIZE_STD;
- if (*rest == '%')
- min_val_type = SIZE_PERCENT;
- break;
- }
+bad_val:
+ return invalf(fc, "hugetlbfs: Bad value '%s' for mount option '%s'\n",
+ param->string, param->key);
+}

- default:
- pr_err("Bad mount option: \"%s\"\n", p);
- return -EINVAL;
- break;
- }
- }
+/*
+ * Validate the parsed options.
+ */
+static int hugetlbfs_validate(struct fs_context *fc)
+{
+ struct hugetlbfs_fs_context *ctx = fc->fs_private;

/*
* Use huge page pool size (in hstate) to convert the size
* options to number of huge pages. If NO_SIZE, -1 is returned.
*/
- pconfig->max_hpages = hugetlbfs_size_to_hpages(pconfig->hstate,
- max_size_opt, max_val_type);
- pconfig->min_hpages = hugetlbfs_size_to_hpages(pconfig->hstate,
- min_size_opt, min_val_type);
+ ctx->max_hpages = hugetlbfs_size_to_hpages(ctx->hstate,
+ ctx->max_size_opt,
+ ctx->max_val_type);
+ ctx->min_hpages = hugetlbfs_size_to_hpages(ctx->hstate,
+ ctx->min_size_opt,
+ ctx->min_val_type);

/*
* If max_size was specified, then min_size must be smaller
*/
- if (max_val_type > NO_SIZE &&
- pconfig->min_hpages > pconfig->max_hpages) {
- pr_err("minimum size can not be greater than maximum size\n");
+ if (ctx->max_val_type > NO_SIZE &&
+ ctx->min_hpages > ctx->max_hpages) {
+ pr_err("Minimum size can not be greater than maximum size\n");
return -EINVAL;
}

return 0;
-
-bad_val:
- pr_err("Bad value '%s' for mount option '%s'\n", args[0].from, p);
- return -EINVAL;
}

static int
-hugetlbfs_fill_super(struct super_block *sb, void *data, size_t data_size,
- int silent)
+hugetlbfs_fill_super(struct super_block *sb, struct fs_context *fc)
{
- int ret;
- struct hugetlbfs_config config;
+ struct hugetlbfs_fs_context *ctx = fc->fs_private;
struct hugetlbfs_sb_info *sbinfo;

- config.max_hpages = -1; /* No limit on size by default */
- config.nr_inodes = -1; /* No limit on number of inodes by default */
- config.uid = current_fsuid();
- config.gid = current_fsgid();
- config.mode = 0755;
- config.hstate = &default_hstate;
- config.min_hpages = -1; /* No default minimum size */
- ret = hugetlbfs_parse_options(data, &config);
- if (ret)
- return ret;
-
sbinfo = kmalloc(sizeof(struct hugetlbfs_sb_info), GFP_KERNEL);
if (!sbinfo)
return -ENOMEM;
sb->s_fs_info = sbinfo;
- sbinfo->hstate = config.hstate;
spin_lock_init(&sbinfo->stat_lock);
- sbinfo->max_inodes = config.nr_inodes;
- sbinfo->free_inodes = config.nr_inodes;
- sbinfo->spool = NULL;
- sbinfo->uid = config.uid;
- sbinfo->gid = config.gid;
- sbinfo->mode = config.mode;
+ sbinfo->hstate = ctx->hstate;
+ sbinfo->max_inodes = ctx->nr_inodes;
+ sbinfo->free_inodes = ctx->nr_inodes;
+ sbinfo->spool = NULL;
+ sbinfo->uid = ctx->uid;
+ sbinfo->gid = ctx->gid;
+ sbinfo->mode = ctx->mode;

/*
* Allocate and initialize subpool if maximum or minimum size is
* specified. Any needed reservations (for minimim size) are taken
* taken when the subpool is created.
*/
- if (config.max_hpages != -1 || config.min_hpages != -1) {
- sbinfo->spool = hugepage_new_subpool(config.hstate,
- config.max_hpages,
- config.min_hpages);
+ if (ctx->max_hpages != -1 || ctx->min_hpages != -1) {
+ sbinfo->spool = hugepage_new_subpool(ctx->hstate,
+ ctx->max_hpages,
+ ctx->min_hpages);
if (!sbinfo->spool)
goto out_free;
}
sb->s_maxbytes = MAX_LFS_FILESIZE;
- sb->s_blocksize = huge_page_size(config.hstate);
- sb->s_blocksize_bits = huge_page_shift(config.hstate);
+ sb->s_blocksize = huge_page_size(ctx->hstate);
+ sb->s_blocksize_bits = huge_page_shift(ctx->hstate);
sb->s_magic = HUGETLBFS_MAGIC;
sb->s_op = &hugetlbfs_ops;
sb->s_time_gran = 1;
- sb->s_root = d_make_root(hugetlbfs_get_root(sb, &config));
+ sb->s_root = d_make_root(hugetlbfs_get_root(sb, ctx));
if (!sb->s_root)
goto out_free;
return 0;
@@ -1279,17 +1284,51 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, size_t data_size,
return -ENOMEM;
}

-static struct dentry *hugetlbfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data, size_t data_size)
+static int hugetlbfs_get_tree(struct fs_context *fc)
+{
+ return vfs_get_super(fc, vfs_get_independent_super, hugetlbfs_fill_super);
+}
+
+static void hugetlbfs_fs_context_free(struct fs_context *fc)
+{
+ kfree(fc->fs_private);
+}
+
+static const struct fs_context_operations hugetlbfs_fs_context_ops = {
+ .free = hugetlbfs_fs_context_free,
+ .parse_param = hugetlbfs_parse_param,
+ .validate = hugetlbfs_validate,
+ .get_tree = hugetlbfs_get_tree,
+};
+
+static int hugetlbfs_init_fs_context(struct fs_context *fc,
+ struct dentry *reference)
{
- return mount_nodev(fs_type, flags, data, data_size,
- hugetlbfs_fill_super);
+ struct hugetlbfs_fs_context *ctx;
+
+ ctx = kzalloc(sizeof(struct hugetlbfs_fs_context), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ ctx->max_hpages = -1; /* No limit on size by default */
+ ctx->nr_inodes = -1; /* No limit on number of inodes by default */
+ ctx->uid = current_fsuid();
+ ctx->gid = current_fsgid();
+ ctx->mode = 0755;
+ ctx->hstate = &default_hstate;
+ ctx->min_hpages = -1; /* No default minimum size */
+ ctx->max_val_type = NO_SIZE;
+ ctx->min_val_type = NO_SIZE;
+ fc->fs_private = ctx;
+ fc->ops = &hugetlbfs_fs_context_ops;
+ return 0;
}

static struct file_system_type hugetlbfs_fs_type = {
- .name = "hugetlbfs",
- .mount = hugetlbfs_mount,
- .kill_sb = kill_litter_super,
+ .name = "hugetlbfs",
+ .init_fs_context = hugetlbfs_init_fs_context,
+ .parameters = &hugetlb_fs_parameters,
+ .kill_sb = kill_litter_super,
};

static struct vfsmount *hugetlbfs_vfsmount[HUGE_MAX_HSTATE];
@@ -1374,8 +1413,47 @@ struct file *hugetlb_file_setup(const char *name, size_t size,
return file;
}

+static struct vfsmount *__init mount_one_hugetlbfs(struct hstate *h)
+{
+ struct hugetlbfs_fs_context *ctx;
+ struct fs_context *fc;
+ struct vfsmount *mnt;
+ int ret;
+
+ fc = vfs_new_fs_context(&hugetlbfs_fs_type, NULL, 0, 0,
+ FS_CONTEXT_FOR_KERNEL_MOUNT);
+ if (IS_ERR(fc)) {
+ ret = PTR_ERR(fc);
+ goto err;
+ }
+
+ ctx = fc->fs_private;
+ ctx->hstate = h;
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0)
+ goto err_fc;
+
+ mnt = vfs_create_mount(fc, 0);
+ if (IS_ERR(mnt)) {
+ ret = PTR_ERR(mnt);
+ goto err_fc;
+ }
+
+ put_fs_context(fc);
+ return mnt;
+
+err_fc:
+ put_fs_context(fc);
+err:
+ pr_err("Cannot mount internal hugetlbfs for page size %uK",
+ 1U << (h->order + PAGE_SHIFT - 10));
+ return ERR_PTR(ret);
+}
+
static int __init init_hugetlbfs_fs(void)
{
+ struct vfsmount *mnt;
struct hstate *h;
int error;
int i;
@@ -1398,25 +1476,16 @@ static int __init init_hugetlbfs_fs(void)

i = 0;
for_each_hstate(h) {
- char buf[50];
- unsigned ps_kb = 1U << (h->order + PAGE_SHIFT - 10);
- int n;
-
- n = snprintf(buf, sizeof(buf), "pagesize=%uK", ps_kb);
- hugetlbfs_vfsmount[i] = kern_mount_data(&hugetlbfs_fs_type,
- buf, n + 1);
-
- if (IS_ERR(hugetlbfs_vfsmount[i])) {
- pr_err("Cannot mount internal hugetlbfs for "
- "page size %uK", ps_kb);
- error = PTR_ERR(hugetlbfs_vfsmount[i]);
- hugetlbfs_vfsmount[i] = NULL;
+ mnt = mount_one_hugetlbfs(h);
+ if (IS_ERR(mnt) && i == 0) {
+ error = PTR_ERR(mnt);
+ goto out;
}
+ hugetlbfs_vfsmount[i] = mnt;
i++;
}
- /* Non default hstates are optional */
- if (!IS_ERR_OR_NULL(hugetlbfs_vfsmount[default_hstate_idx]))
- return 0;
+
+ return 0;

out:
kmem_cache_destroy(hugetlbfs_inode_cachep);


2018-09-21 16:34:45

by David Howells

[permalink] [raw]
Subject: [PATCH 24/34] vfs: Provide documentation for new mount API [ver #12]

Provide documentation for the new mount API.

Signed-off-by: David Howells <[email protected]>
---

Documentation/filesystems/mount_api.txt | 741 +++++++++++++++++++++++++++++++
1 file changed, 741 insertions(+)
create mode 100644 Documentation/filesystems/mount_api.txt

diff --git a/Documentation/filesystems/mount_api.txt b/Documentation/filesystems/mount_api.txt
new file mode 100644
index 000000000000..04f388567f92
--- /dev/null
+++ b/Documentation/filesystems/mount_api.txt
@@ -0,0 +1,741 @@
+ ====================
+ FILESYSTEM MOUNT API
+ ====================
+
+CONTENTS
+
+ (1) Overview.
+
+ (2) The filesystem context.
+
+ (3) The filesystem context operations.
+
+ (4) Filesystem context security.
+
+ (5) VFS filesystem context operations.
+
+ (6) Parameter description.
+
+ (7) Parameter helper functions.
+
+
+========
+OVERVIEW
+========
+
+The creation of new mounts is now to be done in a multistep process:
+
+ (1) Create a filesystem context.
+
+ (2) Parse the parameters and attach them to the context. Parameters are
+ expected to be passed individually from userspace, though legacy binary
+ parameters can also be handled.
+
+ (3) Validate and pre-process the context.
+
+ (4) Get or create a superblock and mountable root.
+
+ (5) Perform the mount.
+
+ (6) Return an error message attached to the context.
+
+ (7) Destroy the context.
+
+To support this, the file_system_type struct gains a new field:
+
+ int (*init_fs_context)(struct fs_context *fc, struct dentry *reference);
+
+which is invoked to set up the filesystem-specific parts of a filesystem
+context, including the additional space. The reference parameter is used to
+convey a superblock and an automount point or a point to reconfigure from which
+the filesystem may draw extra information (such as namespaces) for submount
+(FS_CONTEXT_FOR_SUBMOUNT) or reconfiguration (FS_CONTEXT_FOR_RECONFIGURE)
+purposes - otherwise it will be NULL.
+
+Note that security initialisation is done *after* the filesystem is called so
+that the namespaces may be adjusted first.
+
+If fc->context is FS_CONTEXT_FOR_UMOUNT or FS_CONTEXT_FOR_EMERGENCY_RO, then
+the function can return -EOPNOTSUPP to indicate that the filesystem isn't
+interested in handling that. The error will be ignored.
+
+
+======================
+THE FILESYSTEM CONTEXT
+======================
+
+The creation and reconfiguration of a superblock is governed by a filesystem
+context. This is represented by the fs_context structure:
+
+ struct fs_context {
+ const struct fs_context_operations *ops;
+ struct file_system_type *fs_type;
+ void *fs_private;
+ struct dentry *root;
+ struct user_namespace *user_ns;
+ struct net *net_ns;
+ const struct cred *cred;
+ char *source;
+ char *subtype;
+ void *security;
+ void *s_fs_info;
+ unsigned int sb_flags;
+ unsigned int sb_flags_mask;
+ enum fs_context_purpose purpose:8;
+ bool sloppy:1;
+ bool silent:1;
+ ...
+ };
+
+The fs_context fields are as follows:
+
+ (*) const struct fs_context_operations *ops
+
+ These are operations that can be done on a filesystem context (see
+ below). This must be set by the ->init_fs_context() file_system_type
+ operation.
+
+ (*) struct file_system_type *fs_type
+
+ A pointer to the file_system_type of the filesystem that is being
+ constructed or reconfigured. This retains a reference on the type owner.
+
+ (*) void *fs_private
+
+ A pointer to the file system's private data. This is where the filesystem
+ will need to store any options it parses.
+
+ (*) struct dentry *root
+
+ A pointer to the root of the mountable tree (and indirectly, the
+ superblock thereof). This is filled in by the ->get_tree() op. If this
+ is set, an active reference on root->d_sb must also be held.
+
+ (*) struct user_namespace *user_ns
+ (*) struct net *net_ns
+
+ There are a subset of the namespaces in use by the invoking process. They
+ retain references on each namespace. The subscribed namespaces may be
+ replaced by the filesystem to reflect other sources, such as the parent
+ mount superblock on an automount.
+
+ (*) const struct cred *cred
+
+ The mounter's credentials. This retains a reference on the credentials.
+
+ (*) char *source
+
+ This specifies the source. It may be a block device (e.g. /dev/sda1) or
+ something more exotic, such as the "host:/path" that NFS desires.
+
+ (*) char *subtype
+
+ This is a string to be added to the type displayed in /proc/mounts to
+ qualify it (used by FUSE). This is available for the filesystem to set if
+ desired.
+
+ (*) void *security
+
+ A place for the LSMs to hang their security data for the superblock. The
+ relevant security operations are described below.
+
+ (*) void *s_fs_info
+
+ The proposed s_fs_info for a new superblock, set in the superblock by
+ sget_fc(). This can be used to distinguish superblocks.
+
+ (*) unsigned int sb_flags
+ (*) unsigned int sb_flags_mask
+
+ Which bits SB_* flags are to be set/cleared in super_block::s_flags.
+
+ (*) enum fs_context_purpose
+
+ This indicates the purpose for which the context is intended. The
+ available values are:
+
+ FS_CONTEXT_FOR_USER_MOUNT, -- New superblock for user-specified mount
+ FS_CONTEXT_FOR_KERNEL_MOUNT, -- New superblock for kernel-internal mount
+ FS_CONTEXT_FOR_SUBMOUNT -- New automatic submount of extant mount
+ FS_CONTEXT_FOR_ROOT_MOUNT -- Behind-the-scenes root mount (nfs/btrfs)
+ FS_CONTEXT_FOR_RECONFIGURE -- Change an existing mount
+ FS_CONTEXT_FOR_UMOUNT -- Reconfigure to R/O for umount()
+ FS_CONTEXT_FOR_EMERGENCY_RO -- Emergency reconfigure to R/O
+
+ In the last two cases, ->init_fs_context() will not have been called.
+
+ (*) bool sloppy
+ (*) bool silent
+
+ These are set if the sloppy or silent mount options are given.
+
+ [NOTE] sloppy is probably unnecessary when userspace passes over one
+ option at a time since the error can just be ignored if userspace deems it
+ to be unimportant.
+
+ [NOTE] silent is probably redundant with sb_flags & SB_SILENT.
+
+The mount context is created by calling vfs_new_fs_context() or
+vfs_dup_fs_context() and is destroyed with put_fs_context(). Note that the
+structure is not refcounted.
+
+VFS, security and filesystem mount options are set individually with
+vfs_parse_mount_option(). Options provided by the old mount(2) system call as
+a page of data can be parsed with generic_parse_monolithic().
+
+When mounting, the filesystem is allowed to take data from any of the pointers
+and attach it to the superblock (or whatever), provided it clears the pointer
+in the mount context.
+
+The filesystem is also allowed to allocate resources and pin them with the
+mount context. For instance, NFS might pin the appropriate protocol version
+module.
+
+
+=================================
+THE FILESYSTEM CONTEXT OPERATIONS
+=================================
+
+The filesystem context points to a table of operations:
+
+ struct fs_context_operations {
+ void (*free)(struct fs_context *fc);
+ int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
+ int (*parse_param)(struct fs_context *fc,
+ struct struct fs_parameter *param);
+ int (*parse_monolithic)(struct fs_context *fc, void *data,
+ size_t data_size);
+ int (*validate)(struct fs_context *fc);
+ int (*get_tree)(struct fs_context *fc);
+ int (*reconfigure)(struct fs_context *fc);
+ };
+
+These operations are invoked by the various stages of the mount procedure to
+manage the filesystem context. They are as follows:
+
+ (*) void (*free)(struct fs_context *fc);
+
+ Called to clean up the filesystem-specific part of the filesystem context
+ when the context is destroyed. It should be aware that parts of the
+ context may have been removed and NULL'd out by ->get_tree().
+
+ (*) int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
+
+ Called when a filesystem context has been duplicated to duplicate the
+ filesystem-private data. An error may be returned to indicate failure to
+ do this.
+
+ [!] Note that even if this fails, put_fs_context() will be called
+ immediately thereafter, so ->dup() *must* make the
+ filesystem-private data safe for ->free().
+
+ (*) int (*parse_param)(struct fs_context *fc,
+ struct struct fs_parameter *param);
+
+ Called when a parameter is being added to the filesystem context. param
+ points to the key name and maybe a value object. VFS-specific options
+ will have been weeded out and fc->sb_flags updated in the context.
+ Security options will also have been weeded out and fc->security updated.
+
+ The parameter can be parsed with fs_parse() and fs_lookup_param(). Note
+ that the source(s) are presented as parameters named "source".
+
+ If successful, 0 should be returned or a negative error code otherwise.
+
+ (*) int (*parse_monolithic)(struct fs_context *fc,
+ void *data, size_t data_size);
+
+ Called when the mount(2) system call is invoked to pass the entire data
+ page in one go. If this is expected to be just a list of "key[=val]"
+ items separated by commas, then this may be set to NULL.
+
+ The return value is as for ->parse_param().
+
+ If the filesystem (e.g. NFS) needs to examine the data first and then
+ finds it's the standard key-val list then it may pass it off to
+ generic_parse_monolithic().
+
+ (*) int (*validate)(struct fs_context *fc);
+
+ Called when all the options have been applied and the mount is about to
+ take place. It is should check for inconsistencies from mount options and
+ it is also allowed to do preliminary resource acquisition. For instance,
+ the core NFS module could load the NFS protocol module here.
+
+ Note that if fc->purpose == FS_CONTEXT_FOR_RECONFIGURE, some of the
+ options necessary for a new mount may not be set.
+
+ The return value is as for ->parse_option().
+
+ (*) int (*get_tree)(struct fs_context *fc);
+
+ Called to get or create the mountable root and superblock, using the
+ information stored in the filesystem context (reconfiguration goes via a
+ different vector). It may detach any resources it desires from the
+ filesystem context and transfer them to the superblock it creates.
+
+ On success it should set fc->root to the mountable root and return 0. In
+ the case of an error, it should return a negative error code.
+
+ The phase on a userspace-driven context will be set to only allow this to
+ be called once on any particular context.
+
+ (*) int (*reconfigure)(struct fs_context *fc);
+
+ Called to effect reconfiguration of a superblock using information stored
+ in the filesystem context. It may detach any resources it desires from
+ the filesystem context and transfer them to the superblock. The
+ superblock can be found from fc->root->d_sb.
+
+ On success it should return 0. In the case of an error, it should return
+ a negative error code.
+
+ [NOTE] reconfigure is intended as a replacement for remount_fs.
+
+
+===========================
+FILESYSTEM CONTEXT SECURITY
+===========================
+
+The filesystem context contains a security pointer that the LSMs can use for
+building up a security context for the superblock to be mounted. There are a
+number of operations used by the new mount code for this purpose:
+
+ (*) int security_fs_context_alloc(struct fs_context *fc,
+ struct dentry *reference);
+
+ Called to initialise fc->security (which is preset to NULL) and allocate
+ any resources needed. It should return 0 on success or a negative error
+ code on failure.
+
+ reference will be non-NULL if the context is being created for superblock
+ reconfiguration (FS_CONTEXT_FOR_RECONFIGURE) in which case it indicates
+ the root dentry of the superblock to be reconfigured. It will also be
+ non-NULL in the case of a submount (FS_CONTEXT_FOR_SUBMOUNT) in which case
+ it indicates the automount point.
+
+ (*) int security_fs_context_dup(struct fs_context *fc,
+ struct fs_context *src_fc);
+
+ Called to initialise fc->security (which is preset to NULL) and allocate
+ any resources needed. The original filesystem context is pointed to by
+ src_fc and may be used for reference. It should return 0 on success or a
+ negative error code on failure.
+
+ (*) void security_fs_context_free(struct fs_context *fc);
+
+ Called to clean up anything attached to fc->security. Note that the
+ contents may have been transferred to a superblock and the pointer cleared
+ during get_tree.
+
+ (*) int security_fs_context_parse_param(struct fs_context *fc,
+ struct fs_parameter *param);
+
+ Called for each mount parameter, including the source. The arguments are
+ as for the ->parse_param() method. It should return 0 to indicate that
+ the parameter should be passed on to the filesystem, 1 to indicate that
+ the parameter should be discarded or an error to indicate that the
+ parameter should be rejected.
+
+ The value pointed to by param may be modified (if a string) or stolen
+ (provided the value pointer is NULL'd out). If it is stolen, 1 must be
+ returned to prevent it being passed to the filesystem.
+
+ (*) int security_fs_context_validate(struct fs_context *fc);
+
+ Called after all the options have been parsed to validate the collection
+ as a whole and to do any necessary allocation so that
+ security_sb_get_tree() and security_sb_reconfigure() are less likely to
+ fail. It should return 0 or a negative error code.
+
+ In the case of reconfiguration, the target superblock will be accessible
+ via fc->root.
+
+ (*) int security_sb_get_tree(struct fs_context *fc);
+
+ Called during the mount procedure to verify that the specified superblock
+ is allowed to be mounted and to transfer the security data there. It
+ should return 0 or a negative error code.
+
+ (*) void security_sb_reconfigure(struct fs_context *fc);
+
+ Called to apply any reconfiguration to an LSM's context. It must not
+ fail. Error checking and resource allocation must be done in advance by
+ the parameter parsing and validation hooks.
+
+ (*) int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags);
+
+ Called during the mount procedure to verify that the root dentry attached
+ to the context is permitted to be attached to the specified mountpoint.
+ It should return 0 on success or a negative error code on failure.
+
+
+=================================
+VFS FILESYSTEM CONTEXT OPERATIONS
+=================================
+
+There are four operations for creating a filesystem context and
+one for destroying a context:
+
+ (*) struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
+ struct dentry *reference,
+ unsigned int sb_flags,
+ unsigned int sb_flags_mask,
+ enum fs_context_purpose purpose);
+
+ Create a filesystem context for a given filesystem type and purpose. This
+ allocates the filesystem context, sets the superblock flags, initialises
+ the security and calls fs_type->init_fs_context() to initialise the
+ filesystem private data.
+
+ reference can be NULL or it may indicate the root dentry of a superblock
+ that is going to be reconfigured (FS_CONTEXT_FOR_RECONFIGURE,
+ FS_CONTEXT_FOR_UMOUNT or FS_CONTEXT_FOR_EMERGENCY_RO) or the automount
+ point that triggered a submount (FS_CONTEXT_FOR_SUBMOUNT). This is
+ provided as a source of namespace information.
+
+ (*) struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc,
+ enum fs_context_purpose purpose);
+
+ Duplicate a filesystem context, copying any options noted and duplicating
+ or additionally referencing any resources held therein. This is available
+ for use where a filesystem has to get a mount within a mount, such as NFS4
+ does by internally mounting the root of the target server and then doing a
+ private pathwalk to the target directory.
+
+ The purpose in the new context is set from the purpose parameter.
+
+ (*) void put_fs_context(struct fs_context *fc);
+
+ Destroy a filesystem context, releasing any resources it holds. This
+ calls the ->free() operation. This is intended to be called by anyone who
+ created a filesystem context.
+
+ [!] filesystem contexts are not refcounted, so this causes unconditional
+ destruction.
+
+In all the above operations, apart from the put op, the return is a mount
+context pointer or a negative error code.
+
+For the remaining operations, if an error occurs, a negative error code will be
+returned.
+
+ (*) int vfs_get_tree(struct fs_context *fc);
+
+ Get or create the mountable root and superblock, using the parameters in
+ the filesystem context to select/configure the superblock. This invokes
+ the ->validate() op and then the ->get_tree() op.
+
+ [NOTE] ->validate() could perhaps be rolled into ->get_tree() and
+ ->reconfigure().
+
+ (*) struct vfsmount *vfs_create_mount(struct fs_context *fc);
+
+ Create a mount given the parameters in the specified filesystem context.
+ Note that this does not attach the mount to anything.
+
+ (*) int vfs_parse_fs_param(struct fs_context *fc,
+ struct fs_parameter *param);
+
+ Supply a single mount parameter to the filesystem context. This include
+ the specification of the source/device which is specified as the "source"
+ parameter (which may be specified multiple times if the filesystem
+ supports that).
+
+ param specifies the parameter key name and the value. The parameter is
+ first checked to see if it corresponds to a standard mount flag (in which
+ case it is used to set an SB_xxx flag and consumed) or a security option
+ (in which case the LSM consumes it) before it is passed on to the
+ filesystem.
+
+ The parameter value is typed and can be one of:
+
+ fs_value_is_flag, Parameter not given a value.
+ fs_value_is_string, Value is a string
+ fs_value_is_blob, Value is a binary blob
+ fs_value_is_filename, Value is a filename* + dirfd
+ fs_value_is_filename_empty, Value is a filename* + dirfd + AT_EMPTY_PATH
+ fs_value_is_file, Value is an open file (file*)
+
+ If there is a value, that value is stored in a union in the struct in one
+ of param->{string,blob,name,file}. Note that the function may steal and
+ clear the pointer, but then becomes responsible for disposing of the
+ object.
+
+ (*) int vfs_parse_fs_string(struct fs_context *fc, char *key,
+ const char *value, size_t v_size);
+
+ A wrapper around vfs_parse_fs_param() that just passes a constant string.
+
+ (*) int generic_parse_monolithic(struct fs_context *fc,
+ void *data, size_t data_len);
+
+ Parse a sys_mount() data page, assuming the form to be a text list
+ consisting of key[=val] options separated by commas. Each item in the
+ list is passed to vfs_mount_option(). This is the default when the
+ ->parse_monolithic() operation is NULL.
+
+
+=====================
+PARAMETER DESCRIPTION
+=====================
+
+Parameters are described using structures defined in linux/fs_parser.h.
+There's a core description struct that links everything together:
+
+ struct fs_parameter_description {
+ const char name[16];
+ u8 nr_params;
+ u8 nr_alt_keys;
+ u8 nr_enums;
+ bool ignore_unknown;
+ bool no_source;
+ const char *const *keys;
+ const struct constant_table *alt_keys;
+ const struct fs_parameter_spec *specs;
+ const struct fs_parameter_enum *enums;
+ };
+
+For example:
+
+ enum afs_param {
+ Opt_autocell,
+ Opt_bar,
+ Opt_dyn,
+ Opt_foo,
+ Opt_source,
+ nr__afs_params
+ };
+
+ static const struct fs_parameter_description afs_fs_parameters = {
+ .name = "kAFS",
+ .nr_params = nr__afs_params,
+ .nr_alt_keys = ARRAY_SIZE(afs_param_alt_keys),
+ .nr_enums = ARRAY_SIZE(afs_param_enums),
+ .keys = afs_param_keys,
+ .alt_keys = afs_param_alt_keys,
+ .specs = afs_param_specs,
+ .enums = afs_param_enums,
+ };
+
+The members are as follows:
+
+ (1) const char name[16];
+
+ The name to be used in error messages generated by the parse helper
+ functions.
+
+ (2) u8 nr_params;
+
+ The number of discrete parameter identifiers. This indicates the number
+ of elements in the ->types[] array and also limits the values that may be
+ used in the values that the ->keys[] array maps to.
+
+ It is expected that, for example, two parameters that are related, say
+ "acl" and "noacl" with have the same ID, but will be flagged to indicate
+ that one is the inverse of the other. The value can then be picked out
+ from the parse result.
+
+ (3) const struct fs_parameter_specification *specs;
+
+ Table of parameter specifications, where the entries are of type:
+
+ struct fs_parameter_type {
+ enum fs_parameter_spec type:8;
+ u8 flags;
+ };
+
+ and the parameter identifier is the index to the array. 'type' indicates
+ the desired value type and must be one of:
+
+ TYPE NAME EXPECTED VALUE RESULT IN
+ ======================= ======================= =====================
+ fs_param_is_flag No value n/a
+ fs_param_is_bool Boolean value result->boolean
+ fs_param_is_u32 32-bit unsigned int result->uint_32
+ fs_param_is_u32_octal 32-bit octal int result->uint_32
+ fs_param_is_u32_hex 32-bit hex int result->uint_32
+ fs_param_is_s32 32-bit signed int result->int_32
+ fs_param_is_enum Enum value name result->uint_32
+ fs_param_is_string Arbitrary string param->string
+ fs_param_is_blob Binary blob param->blob
+ fs_param_is_blockdev Blockdev path * Needs lookup
+ fs_param_is_path Path * Needs lookup
+ fs_param_is_fd File descriptor param->file
+
+ And each parameter can be qualified with 'flags':
+
+ fs_param_v_optional The value is optional
+ fs_param_neg_with_no If key name is prefixed with "no", it is false
+ fs_param_neg_with_empty If value is "", it is false
+ fs_param_deprecated The parameter is deprecated.
+
+ For example:
+
+ static const struct fs_parameter_spec afs_param_specs[nr__afs_params] = {
+ [Opt_autocell] = { fs_param_is flag },
+ [Opt_bar] = { fs_param_is_enum },
+ [Opt_dyn] = { fs_param_is flag },
+ [Opt_foo] = { fs_param_is_bool, fs_param_neg_with_no },
+ [Opt_source] = { fs_param_is_string },
+ };
+
+ Note that if the value is of fs_param_is_bool type, fs_parse() will try
+ to match any string value against "0", "1", "no", "yes", "false", "true".
+
+ [!] NOTE that the table must be sorted according to primary key name so
+ that ->keys[] is also sorted.
+
+ (4) const char *const *keys;
+
+ Table of primary key names for the parameters. There must be one entry
+ per defined parameter. The table is optional if ->nr_params is 0. The
+ table is just an array of names e.g.:
+
+ static const char *const afs_param_keys[nr__afs_params] = {
+ [Opt_autocell] = "autocell",
+ [Opt_bar] = "bar",
+ [Opt_dyn] = "dyn",
+ [Opt_foo] = "foo",
+ [Opt_source] = "source",
+ };
+
+ [!] NOTE that the table must be sorted such that the table can be searched
+ with bsearch() using strcmp(). This means that the Opt_* values must
+ correspond to the entries in this table.
+
+ (5) const struct constant_table *alt_keys;
+ u8 nr_alt_keys;
+
+ Table of additional key names and their mappings to parameter ID plus the
+ number of elements in the table. This is optional. The table is just an
+ array of { name, integer } pairs, e.g.:
+
+ static const struct constant_table afs_param_keys[] = {
+ { "baz", Opt_bar },
+ { "dynamic", Opt_dyn },
+ };
+
+ [!] NOTE that the table must be sorted such that strcmp() can be used with
+ bsearch() to search the entries.
+
+ The parameter ID can also be fs_param_key_removed to indicate that a
+ deprecated parameter has been removed and that an error will be given.
+ This differs from fs_param_deprecated where the parameter may still have
+ an effect.
+
+ Further, the behaviour of the parameter may differ when an alternate name
+ is used (for instance with NFS, "v3", "v4.2", etc. are alternate names).
+
+ (6) const struct fs_parameter_enum *enums;
+ u8 nr_enums;
+
+ Table of enum value names to integer mappings and the number of elements
+ stored therein. This is of type:
+
+ struct fs_parameter_enum {
+ u8 param_id;
+ char name[14];
+ u8 value;
+ };
+
+ Where the array is an unsorted list of { parameter ID, name }-keyed
+ elements that indicate the value to map to, e.g.:
+
+ static const struct fs_parameter_enum afs_param_enums[] = {
+ { Opt_bar, "x", 1},
+ { Opt_bar, "y", 23},
+ { Opt_bar, "z", 42},
+ };
+
+ If a parameter of type fs_param_is_enum is encountered, fs_parse() will
+ try to look the value up in the enum table and the result will be stored
+ in the parse result.
+
+ (7) bool no_source;
+
+ If this is set, fs_parse() will ignore any "source" parameter and not
+ pass it to the filesystem.
+
+The parser should be pointed to by the parser pointer in the file_system_type
+struct as this will provide validation on registration (if
+CONFIG_VALIDATE_FS_PARSER=y) and will allow the description to be queried from
+userspace using the fsinfo() syscall.
+
+
+==========================
+PARAMETER HELPER FUNCTIONS
+==========================
+
+A number of helper functions are provided to help a filesystem or an LSM
+process the parameters it is given.
+
+ (*) int lookup_constant(const struct constant_table tbl[],
+ const char *name, int not_found);
+
+ Look up a constant by name in a table of name -> integer mappings. The
+ table is an array of elements of the following type:
+
+ struct constant_table {
+ const char *name;
+ int value;
+ };
+
+ and it must be sorted such that it can be searched using bsearch() using
+ strcmp(). If a match is found, the corresponding value is returned. If a
+ match isn't found, the not_found value is returned instead.
+
+ (*) bool validate_constant_table(const struct constant_table *tbl,
+ size_t tbl_size,
+ int low, int high, int special);
+
+ Validate a constant table. Checks that all the elements are appropriately
+ ordered, that there are no duplicates and that the values are between low
+ and high inclusive, though provision is made for one allowable special
+ value outside of that range. If no special value is required, special
+ should just be set to lie inside the low-to-high range.
+
+ If all is good, true is returned. If the table is invalid, errors are
+ logged to dmesg, the stack is dumped and false is returned.
+
+ (*) int fs_parse(struct fs_context *fc,
+ const struct fs_param_parser *parser,
+ struct fs_parameter *param,
+ struct fs_param_parse_result *result);
+
+ This is the main interpreter of parameters. It uses the parameter
+ description (parser) to look up the name of the parameter to use and to
+ convert that to a parameter ID (stored in result->key).
+
+ If successful, and if the parameter type indicates the result is a
+ boolean, integer or enum type, the value is converted by this function and
+ the result stored in result->{boolean,int_32,uint_32}.
+
+ If a match isn't initially made, the key is prefixed with "no" and no
+ value is present then an attempt will be made to look up the key with the
+ prefix removed. If this matches a parameter for which the type has flag
+ fs_param_neg_with_no set, then a match will be made and the value will be
+ set to false/0/NULL.
+
+ If the parameter is successfully matched and, optionally, parsed
+ correctly, 1 is returned. If the parameter isn't matched and
+ parser->ignore_unknown is set, then 0 is returned. Otherwise -EINVAL is
+ returned.
+
+ (*) bool fs_validate_description(const struct fs_parameter_description *desc);
+
+ This is validates the parameter description. It returns true if the
+ description is good and false if it is not.
+
+ (*) int fs_lookup_param(struct fs_context *fc,
+ struct fs_parameter *value,
+ bool want_bdev,
+ struct path *_path);
+
+ This takes a parameter that carries a string or filename type and attempts
+ to do a path lookup on it. If the parameter expects a blockdev, a check
+ is made that the inode actually represents one.
+
+ Returns 0 if successful and *_path will be set; returns a negative error
+ code if not.


2018-09-21 16:34:50

by David Howells

[permalink] [raw]
Subject: [PATCH 25/34] Make anon_inodes unconditional [ver #12]

Make the anon_inodes facility unconditional so that it can be used by core
VFS code.

Signed-off-by: David Howells <[email protected]>
---

arch/arm/kvm/Kconfig | 1 -
arch/arm64/kvm/Kconfig | 1 -
arch/mips/kvm/Kconfig | 1 -
arch/powerpc/kvm/Kconfig | 1 -
arch/s390/kvm/Kconfig | 1 -
arch/x86/Kconfig | 1 -
arch/x86/kvm/Kconfig | 1 -
drivers/base/Kconfig | 1 -
drivers/char/tpm/Kconfig | 1 -
drivers/dma-buf/Kconfig | 1 -
drivers/gpio/Kconfig | 1 -
drivers/iio/Kconfig | 1 -
drivers/infiniband/Kconfig | 1 -
drivers/vfio/Kconfig | 1 -
fs/Makefile | 2 +-
fs/notify/fanotify/Kconfig | 1 -
fs/notify/inotify/Kconfig | 1 -
init/Kconfig | 10 ----------
18 files changed, 1 insertion(+), 27 deletions(-)

diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
index e2bd35b6780c..c09fcc092a54 100644
--- a/arch/arm/kvm/Kconfig
+++ b/arch/arm/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
bool "Kernel-based Virtual Machine (KVM) support"
depends on MMU && OF
select PREEMPT_NOTIFIERS
- select ANON_INODES
select ARM_GIC
select ARM_GIC_V3
select ARM_GIC_V3_ITS
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 47b23bf617c7..86fe9b3e3ff8 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -23,7 +23,6 @@ config KVM
depends on OF
select MMU_NOTIFIER
select PREEMPT_NOTIFIERS
- select ANON_INODES
select HAVE_KVM_CPU_RELAX_INTERCEPT
select HAVE_KVM_ARCH_TLB_FLUSH_ALL
select KVM_MMIO
diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
index 76b93a9c8c9b..4d06a29bc13b 100644
--- a/arch/mips/kvm/Kconfig
+++ b/arch/mips/kvm/Kconfig
@@ -20,7 +20,6 @@ config KVM
depends on HAVE_KVM
select EXPORT_UASM
select PREEMPT_NOTIFIERS
- select ANON_INODES
select KVM_GENERIC_DIRTYLOG_READ_PROTECT
select HAVE_KVM_VCPU_ASYNC_IOCTL
select KVM_MMIO
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 68a0e9d5b440..e058d02ee819 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -20,7 +20,6 @@ if VIRTUALIZATION
config KVM
bool
select PREEMPT_NOTIFIERS
- select ANON_INODES
select HAVE_KVM_EVENTFD
select HAVE_KVM_VCPU_ASYNC_IOCTL
select SRCU
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index a3dbd459cce9..600e4fd11a67 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -21,7 +21,6 @@ config KVM
prompt "Kernel-based Virtual Machine (KVM) support"
depends on HAVE_KVM
select PREEMPT_NOTIFIERS
- select ANON_INODES
select HAVE_KVM_CPU_RELAX_INTERCEPT
select HAVE_KVM_VCPU_ASYNC_IOCTL
select HAVE_KVM_EVENTFD
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1a0be022f91d..d02baf335d98 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -46,7 +46,6 @@ config X86
#
select ACPI_LEGACY_TABLES_LOOKUP if ACPI
select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI
- select ANON_INODES
select ARCH_CLOCKSOURCE_DATA
select ARCH_DISCARD_MEMBLOCK
select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 1bbec387d289..f3f2e547484b 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -27,7 +27,6 @@ config KVM
depends on X86_LOCAL_APIC
select PREEMPT_NOTIFIERS
select MMU_NOTIFIER
- select ANON_INODES
select HAVE_KVM_IRQCHIP
select HAVE_KVM_IRQFD
select IRQ_BYPASS_MANAGER
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 3e63a900b330..ae213ed2a7c8 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -174,7 +174,6 @@ source "drivers/base/regmap/Kconfig"
config DMA_SHARED_BUFFER
bool
default n
- select ANON_INODES
select IRQ_WORK
help
This option enables the framework for buffer-sharing between
diff --git a/drivers/char/tpm/Kconfig b/drivers/char/tpm/Kconfig
index 18c81cbe4704..4819874b5523 100644
--- a/drivers/char/tpm/Kconfig
+++ b/drivers/char/tpm/Kconfig
@@ -157,7 +157,6 @@ config TCG_CRB
config TCG_VTPM_PROXY
tristate "VTPM Proxy Interface"
depends on TCG_TPM
- select ANON_INODES
---help---
This driver proxies for an emulated TPM (vTPM) running in userspace.
A device /dev/vtpmx is provided that creates a device pair
diff --git a/drivers/dma-buf/Kconfig b/drivers/dma-buf/Kconfig
index ed3b785bae37..b0194c8c251c 100644
--- a/drivers/dma-buf/Kconfig
+++ b/drivers/dma-buf/Kconfig
@@ -3,7 +3,6 @@ menu "DMABUF options"
config SYNC_FILE
bool "Explicit Synchronization Framework"
default n
- select ANON_INODES
select DMA_SHARED_BUFFER
---help---
The Sync File Framework adds explicit syncronization via
diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
index 4f52c3a8ec99..392fd95b3734 100644
--- a/drivers/gpio/Kconfig
+++ b/drivers/gpio/Kconfig
@@ -12,7 +12,6 @@ config ARCH_HAVE_CUSTOM_GPIO_H

menuconfig GPIOLIB
bool "GPIO Support"
- select ANON_INODES
help
This enables GPIO support through the generic GPIO library.
You only need to enable this, if you also want to enable
diff --git a/drivers/iio/Kconfig b/drivers/iio/Kconfig
index d08aeb41cd07..1dec0fecb6ef 100644
--- a/drivers/iio/Kconfig
+++ b/drivers/iio/Kconfig
@@ -4,7 +4,6 @@

menuconfig IIO
tristate "Industrial I/O support"
- select ANON_INODES
help
The industrial I/O subsystem provides a unified framework for
drivers for many different types of embedded sensors using a
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index abb6660c099c..176b943dfec9 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -25,7 +25,6 @@ config INFINIBAND_USER_MAD

config INFINIBAND_USER_ACCESS
tristate "InfiniBand userspace access (verbs and CM)"
- select ANON_INODES
---help---
Userspace InfiniBand access support. This enables the
kernel side of userspace verbs and the userspace
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index c84333eb5eb5..9aa91e736023 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -22,7 +22,6 @@ menuconfig VFIO
tristate "VFIO Non-Privileged userspace driver framework"
depends on IOMMU_API
select VFIO_IOMMU_TYPE1 if (X86 || S390 || ARM_SMMU || ARM_SMMU_V3)
- select ANON_INODES
help
VFIO provides a framework for secure userspace device drivers.
See Documentation/vfio.txt for more details.
diff --git a/fs/Makefile b/fs/Makefile
index 9a0b8003f069..ae681523b4b1 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -25,7 +25,7 @@ obj-$(CONFIG_PROC_FS) += proc_namespace.o

obj-y += notify/
obj-$(CONFIG_EPOLL) += eventpoll.o
-obj-$(CONFIG_ANON_INODES) += anon_inodes.o
+obj-y += anon_inodes.o
obj-$(CONFIG_SIGNALFD) += signalfd.o
obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
diff --git a/fs/notify/fanotify/Kconfig b/fs/notify/fanotify/Kconfig
index 41355ce74ac0..f5b0b3af32dd 100644
--- a/fs/notify/fanotify/Kconfig
+++ b/fs/notify/fanotify/Kconfig
@@ -1,7 +1,6 @@
config FANOTIFY
bool "Filesystem wide access notification"
select FSNOTIFY
- select ANON_INODES
default n
---help---
Say Y here to enable fanotify support. fanotify is a file access
diff --git a/fs/notify/inotify/Kconfig b/fs/notify/inotify/Kconfig
index b981fc0c8379..0161c74e76e2 100644
--- a/fs/notify/inotify/Kconfig
+++ b/fs/notify/inotify/Kconfig
@@ -1,6 +1,5 @@
config INOTIFY_USER
bool "Inotify support for userspace"
- select ANON_INODES
select FSNOTIFY
default y
---help---
diff --git a/init/Kconfig b/init/Kconfig
index 1e234e2f1cba..275534995f78 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1114,9 +1114,6 @@ config LD_DEAD_CODE_DATA_ELIMINATION
config SYSCTL
bool

-config ANON_INODES
- bool
-
config HAVE_UID16
bool

@@ -1321,14 +1318,12 @@ config HAVE_FUTEX_CMPXCHG
config EPOLL
bool "Enable eventpoll support" if EXPERT
default y
- select ANON_INODES
help
Disabling this option will cause the kernel to be built without
support for epoll family of system calls.

config SIGNALFD
bool "Enable signalfd() system call" if EXPERT
- select ANON_INODES
default y
help
Enable the signalfd() system call that allows to receive signals
@@ -1338,7 +1333,6 @@ config SIGNALFD

config TIMERFD
bool "Enable timerfd() system call" if EXPERT
- select ANON_INODES
default y
help
Enable the timerfd() system call that allows to receive timer
@@ -1348,7 +1342,6 @@ config TIMERFD

config EVENTFD
bool "Enable eventfd() system call" if EXPERT
- select ANON_INODES
default y
help
Enable the eventfd() system call that allows to receive both
@@ -1450,7 +1443,6 @@ config KALLSYMS_BASE_RELATIVE
# syscall, maps, verifier
config BPF_SYSCALL
bool "Enable bpf() system call"
- select ANON_INODES
select BPF
select IRQ_WORK
default n
@@ -1467,7 +1459,6 @@ config BPF_JIT_ALWAYS_ON

config USERFAULTFD
bool "Enable userfaultfd() system call"
- select ANON_INODES
depends on MMU
help
Enable the userfaultfd() system call that allows to intercept and
@@ -1534,7 +1525,6 @@ config PERF_EVENTS
bool "Kernel performance events and counters"
default y if PROFILING
depends on HAVE_PERF_EVENTS
- select ANON_INODES
select IRQ_WORK
select SRCU
help


2018-09-21 16:34:51

by David Howells

[permalink] [raw]
Subject: [PATCH 23/34] vfs: Remove kern_mount_data() [ver #12]

The kern_mount_data() isn't used any more so remove it.

Signed-off-by: David Howells <[email protected]>
---

fs/namespace.c | 7 -------
include/linux/fs.h | 1 -
2 files changed, 8 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index d841ba5568d9..156261d03c12 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3328,13 +3328,6 @@ struct vfsmount *kern_mount(struct file_system_type *type)
}
EXPORT_SYMBOL_GPL(kern_mount);

-struct vfsmount *kern_mount_data(struct file_system_type *type,
- void *data, size_t data_size)
-{
- return vfs_kern_mount(type, SB_KERNMOUNT, type->name, data, data_size);
-}
-EXPORT_SYMBOL_GPL(kern_mount_data);
-
/*
* Move a mount from one place to another.
* In combination with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4c3c388646bc..14e08020890f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2232,7 +2232,6 @@ mount_pseudo(struct file_system_type *fs_type, char *name,
extern int register_filesystem(struct file_system_type *);
extern int unregister_filesystem(struct file_system_type *);
extern struct vfsmount *kern_mount(struct file_system_type *);
-extern struct vfsmount *kern_mount_data(struct file_system_type *, void *, size_t);
extern void kern_unmount(struct vfsmount *mnt);
extern int may_umount_tree(struct vfsmount *);
extern int may_umount(struct vfsmount *);


2018-09-21 16:34:52

by David Howells

[permalink] [raw]
Subject: [PATCH 15/34] vfs: Implement a filesystem superblock creation/configuration context [ver #12]

Implement a filesystem context concept to be used during superblock
creation for mount and superblock reconfiguration for remount.

The mounting procedure then becomes:

(1) Allocate new fs_context context.

(2) Configure the context.

(3) Create superblock.

(4) Query the superblock.

(5) Create a mount for the superblock.

(6) Destroy the context.

Rather than calling fs_type->mount(), an fs_context struct is created and
fs_type->init_fs_context() is called to set it up. Pointers exist for the
filesystem and LSM to hang their private data off.

A set of operations has to be set by ->init_fs_context() to provide
freeing, duplication, option parsing, binary data parsing, validation,
mounting and superblock filling.

Legacy filesystems are supported by the provision of a set of legacy
fs_context operations that build up a list of mount options and then invoke
fs_type->mount() from within the fs_context ->get_tree() operation. This
allows all filesystems to be accessed using fs_context.

It should be noted that, whilst this patch adds a lot of lines of code,
there is quite a bit of duplication with existing code that can be
eliminated should all filesystems be converted over.

Signed-off-by: David Howells <[email protected]>
---

fs/Makefile | 2
fs/filesystems.c | 4
fs/fs_context.c | 658 ++++++++++++++++++++++++++++++++++++++++++++
fs/internal.h | 14 +
fs/libfs.c | 20 +
fs/namespace.c | 397 ++++++++++++++++++---------
fs/super.c | 401 ++++++++++++++++++++++++---
include/linux/fs.h | 15 +
include/linux/fs_context.h | 30 ++
include/linux/kernfs.h | 2
include/linux/mount.h | 3
11 files changed, 1379 insertions(+), 167 deletions(-)
create mode 100644 fs/fs_context.c

diff --git a/fs/Makefile b/fs/Makefile
index 07b894227dce..9a0b8003f069 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -13,7 +13,7 @@ obj-y := open.o read_write.o file_table.o super.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o splice.o sync.o utimes.o d_path.o \
stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
- fs_parser.o
+ fs_context.o fs_parser.o

ifeq ($(CONFIG_BLOCK),y)
obj-y += buffer.o block_dev.o direct-io.o mpage.o
diff --git a/fs/filesystems.c b/fs/filesystems.c
index b03f57b1105b..9135646e41ac 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -16,6 +16,7 @@
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/uaccess.h>
+#include <linux/fs_parser.h>

/*
* Handling of filesystem drivers list.
@@ -73,6 +74,9 @@ int register_filesystem(struct file_system_type * fs)
int res = 0;
struct file_system_type ** p;

+ if (fs->parameters && !fs_validate_description(fs->parameters))
+ return -EINVAL;
+
BUG_ON(strchr(fs->name, '.'));
if (fs->next)
return -EBUSY;
diff --git a/fs/fs_context.c b/fs/fs_context.c
new file mode 100644
index 000000000000..328fcb764667
--- /dev/null
+++ b/fs/fs_context.c
@@ -0,0 +1,658 @@
+/* Provide a way to create a superblock configuration context within the kernel
+ * that allows a superblock to be set up prior to mounting.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/fs_context.h>
+#include <linux/fs_parser.h>
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/nsproxy.h>
+#include <linux/slab.h>
+#include <linux/magic.h>
+#include <linux/security.h>
+#include <linux/mnt_namespace.h>
+#include <linux/pid_namespace.h>
+#include <linux/user_namespace.h>
+#include <linux/bsearch.h>
+#include <net/net_namespace.h>
+#include "mount.h"
+#include "internal.h"
+
+enum legacy_fs_param {
+ LEGACY_FS_UNSET_PARAMS,
+ LEGACY_FS_NO_PARAMS,
+ LEGACY_FS_MONOLITHIC_PARAMS,
+ LEGACY_FS_INDIVIDUAL_PARAMS,
+ LEGACY_FS_MAGIC_PARAMS,
+};
+
+struct legacy_fs_context {
+ char *legacy_data; /* Data page for legacy filesystems */
+ char *secdata;
+ size_t data_size;
+ enum legacy_fs_param param_type;
+};
+
+static const struct constant_table common_set_sb_flag[] = {
+ { "dirsync", SB_DIRSYNC },
+ { "lazytime", SB_LAZYTIME },
+ { "mand", SB_MANDLOCK },
+ { "posixacl", SB_POSIXACL },
+ { "ro", SB_RDONLY },
+ { "sync", SB_SYNCHRONOUS },
+};
+
+static const struct constant_table common_clear_sb_flag[] = {
+ { "async", SB_SYNCHRONOUS },
+ { "nolazytime", SB_LAZYTIME },
+ { "nomand", SB_MANDLOCK },
+ { "rw", SB_RDONLY },
+ { "silent", SB_SILENT },
+};
+
+static const char *const forbidden_sb_flag[] = {
+ "bind",
+ "dev",
+ "exec",
+ "move",
+ "noatime",
+ "nodev",
+ "nodiratime",
+ "noexec",
+ "norelatime",
+ "nostrictatime",
+ "nosuid",
+ "private",
+ "rec",
+ "relatime",
+ "remount",
+ "shared",
+ "slave",
+ "strictatime",
+ "suid",
+ "unbindable",
+};
+
+static int cmp_flag_name(const void *name, const void *entry)
+{
+ const char **e = (const char **)entry;
+ return strcmp(name, *e);
+}
+
+/*
+ * Check for a common mount option that manipulates s_flags.
+ */
+static int vfs_parse_sb_flag(struct fs_context *fc, const char *key)
+{
+ unsigned int token;
+
+ if (bsearch(key, forbidden_sb_flag, ARRAY_SIZE(forbidden_sb_flag),
+ sizeof(forbidden_sb_flag[0]), cmp_flag_name))
+ return -EINVAL;
+
+ token = lookup_constant(common_set_sb_flag, key, 0);
+ if (token) {
+ fc->sb_flags |= token;
+ fc->sb_flags_mask |= token;
+ return 0;
+ }
+
+ token = lookup_constant(common_clear_sb_flag, key, 0);
+ if (token) {
+ fc->sb_flags &= ~token;
+ fc->sb_flags_mask |= token;
+ return 0;
+ }
+
+ return -ENOPARAM;
+}
+
+/**
+ * vfs_parse_fs_param - Add a single parameter to a superblock config
+ * @fc: The filesystem context to modify
+ * @param: The parameter
+ *
+ * A single mount option in string form is applied to the filesystem context
+ * being set up. Certain standard options (for example "ro") are translated
+ * into flag bits without going to the filesystem. The active security module
+ * is allowed to observe and poach options. Any other options are passed over
+ * to the filesystem to parse.
+ *
+ * This may be called multiple times for a context.
+ *
+ * Returns 0 on success and a negative error code on failure. In the event of
+ * failure, supplementary error information may have been set.
+ */
+int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param)
+{
+ int ret;
+
+ if (!param->key)
+ return invalf(fc, "Unnamed parameter\n");
+
+ ret = vfs_parse_sb_flag(fc, param->key);
+ if (ret != -ENOPARAM)
+ return ret;
+
+ ret = security_fs_context_parse_param(fc, param);
+ if (ret != -ENOPARAM)
+ /* Param belongs to the LSM or is disallowed by the LSM; so
+ * don't pass to the FS.
+ */
+ return ret;
+
+ if (fc->ops->parse_param) {
+ ret = fc->ops->parse_param(fc, param);
+ if (ret != -ENOPARAM)
+ return ret;
+ }
+
+ /* If the filesystem doesn't take any arguments, we need to ignore the
+ * source parameter if given.
+ */
+ if (strcmp(param->key, "source") == 0)
+ return 0;
+
+ return invalf(fc, "%s: Unknown parameter '%s'",
+ fc->fs_type->name, param->key);
+}
+EXPORT_SYMBOL(vfs_parse_fs_param);
+
+/**
+ * vfs_parse_fs_string - Convenience function to just parse a string.
+ */
+int vfs_parse_fs_string(struct fs_context *fc, const char *key,
+ const char *value, size_t v_size)
+{
+ int ret;
+
+ struct fs_parameter param = {
+ .key = key,
+ .type = fs_value_is_string,
+ .size = v_size,
+ };
+
+ if (v_size > 0) {
+ param.string = kmemdup_nul(value, v_size, GFP_KERNEL);
+ if (!param.string)
+ return -ENOMEM;
+ }
+
+ ret = vfs_parse_fs_param(fc, &param);
+ kfree(param.string);
+ return ret;
+}
+EXPORT_SYMBOL(vfs_parse_fs_string);
+
+/**
+ * generic_parse_monolithic - Parse key[=val][,key[=val]]* mount data
+ * @ctx: The superblock configuration to fill in.
+ * @data: The data to parse
+ * @data_size: The amount of data
+ *
+ * Parse a blob of data that's in key[=val][,key[=val]]* form. This can be
+ * called from the ->monolithic_mount_data() fs_context operation.
+ *
+ * Returns 0 on success or the error returned by the ->parse_option() fs_context
+ * operation on failure.
+ */
+int generic_parse_monolithic(struct fs_context *fc, void *data, size_t data_size)
+{
+ char *options = data, *key;
+ int ret = 0;
+
+ if (!options)
+ return 0;
+
+ while ((key = strsep(&options, ",")) != NULL) {
+ if (*key) {
+ size_t v_len = 0;
+ char *value = strchr(key, '=');
+
+ if (value) {
+ if (value == key)
+ continue;
+ *value++ = 0;
+ v_len = strlen(value);
+ }
+ ret = vfs_parse_fs_string(fc, key, value, v_len);
+ if (ret < 0)
+ break;
+ }
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL(generic_parse_monolithic);
+
+/**
+ * vfs_new_fs_context - Create a filesystem context.
+ * @fs_type: The filesystem type.
+ * @reference: The dentry from which this one derives (or NULL)
+ * @sb_flags: Filesystem/superblock flags (SB_*)
+ * @sb_flags_mask: Applicable members of @sb_flags
+ * @purpose: The purpose that this configuration shall be used for.
+ *
+ * Open a filesystem and create a mount context. The mount context is
+ * initialised with the supplied flags and, if a submount/automount from
+ * another superblock (referred to by @reference) is supplied, may have
+ * parameters such as namespaces copied across from that superblock.
+ */
+struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
+ struct dentry *reference,
+ unsigned int sb_flags,
+ unsigned int sb_flags_mask,
+ enum fs_context_purpose purpose)
+{
+ int (*init_fs_context)(struct fs_context *, struct dentry *);
+ struct fs_context *fc;
+ int ret = -ENOMEM;
+
+ fc = kzalloc(sizeof(struct fs_context), GFP_KERNEL);
+ if (!fc)
+ return ERR_PTR(-ENOMEM);
+
+ fc->purpose = purpose;
+ fc->sb_flags = sb_flags;
+ fc->sb_flags_mask = sb_flags_mask;
+ fc->fs_type = get_filesystem(fs_type);
+ fc->cred = get_current_cred();
+
+ switch (purpose) {
+ case FS_CONTEXT_FOR_KERNEL_MOUNT:
+ fc->sb_flags |= SB_KERNMOUNT;
+ /* Fallthrough */
+ case FS_CONTEXT_FOR_USER_MOUNT:
+ fc->user_ns = get_user_ns(fc->cred->user_ns);
+ fc->net_ns = get_net(current->nsproxy->net_ns);
+ break;
+ case FS_CONTEXT_FOR_SUBMOUNT:
+ case FS_CONTEXT_FOR_ROOT_MOUNT:
+ fc->user_ns = get_user_ns(reference->d_sb->s_user_ns);
+ fc->net_ns = get_net(current->nsproxy->net_ns);
+ break;
+ case FS_CONTEXT_FOR_RECONFIGURE:
+ case FS_CONTEXT_FOR_UMOUNT:
+ case FS_CONTEXT_FOR_EMERGENCY_RO:
+ /* We don't pin any namespaces as the superblock's
+ * subscriptions cannot be changed at this point.
+ */
+ atomic_inc(&reference->d_sb->s_active);
+ fc->root = dget(reference);
+ break;
+ }
+
+ /* TODO: Make all filesystems support this unconditionally */
+ init_fs_context = fc->fs_type->init_fs_context;
+ if (!init_fs_context)
+ init_fs_context = legacy_init_fs_context;
+
+ ret = (*init_fs_context)(fc, reference);
+ if (ret < 0)
+ goto err_fc;
+ fc->need_free = true;
+
+ /* Do the security check last because ->init_fs_context may change the
+ * namespace subscriptions.
+ */
+ ret = security_fs_context_alloc(fc, reference);
+ if (ret < 0)
+ goto err_fc;
+
+ return fc;
+
+err_fc:
+ put_fs_context(fc);
+ return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(vfs_new_fs_context);
+
+/**
+ * vfs_dup_fc_config: Duplicate a filesystem context.
+ * @src_fc: The context to copy.
+ * @purpose: The purpose to set in the new mount
+ */
+struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc,
+ enum fs_context_purpose purpose)
+{
+ struct fs_context *fc;
+ int ret;
+
+ if (!src_fc->ops->dup)
+ return ERR_PTR(-EOPNOTSUPP);
+
+ fc = kmemdup(src_fc, sizeof(struct fs_context), GFP_KERNEL);
+ if (!fc)
+ return ERR_PTR(-ENOMEM);
+
+ fc->fs_private = NULL;
+ fc->s_fs_info = NULL;
+ fc->source = NULL;
+ fc->security = NULL;
+ get_filesystem(fc->fs_type);
+ get_net(fc->net_ns);
+ get_user_ns(fc->user_ns);
+ get_cred(fc->cred);
+
+ /* Can't call put until we've called ->dup */
+ ret = fc->ops->dup(fc, src_fc);
+ if (ret < 0)
+ goto err_fc;
+
+ ret = security_fs_context_dup(fc, src_fc);
+ if (ret < 0)
+ goto err_fc;
+ return fc;
+
+err_fc:
+ put_fs_context(fc);
+ return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(vfs_dup_fs_context);
+
+/**
+ * put_fs_context - Dispose of a superblock configuration context.
+ * @fc: The context to dispose of.
+ */
+void put_fs_context(struct fs_context *fc)
+{
+ struct super_block *sb;
+
+ if (fc->root) {
+ sb = fc->root->d_sb;
+ dput(fc->root);
+ fc->root = NULL;
+ deactivate_super(sb);
+ }
+
+ if (fc->need_free && fc->ops && fc->ops->free)
+ fc->ops->free(fc);
+
+ security_fs_context_free(fc);
+ if (fc->net_ns)
+ put_net(fc->net_ns);
+ put_user_ns(fc->user_ns);
+ if (fc->cred)
+ put_cred(fc->cred);
+ kfree(fc->subtype);
+ put_filesystem(fc->fs_type);
+ kfree(fc->source);
+ kfree(fc);
+}
+EXPORT_SYMBOL(put_fs_context);
+
+/*
+ * Free the config for a filesystem that doesn't support fs_context.
+ */
+static void legacy_fs_context_free(struct fs_context *fc)
+{
+ struct legacy_fs_context *ctx = fc->fs_private;
+
+ if (ctx) {
+ free_secdata(ctx->secdata);
+ switch (ctx->param_type) {
+ case LEGACY_FS_UNSET_PARAMS:
+ case LEGACY_FS_NO_PARAMS:
+ break;
+ case LEGACY_FS_MAGIC_PARAMS:
+ break; /* ctx->data is a weird pointer */
+ default:
+ kfree(ctx->legacy_data);
+ break;
+ }
+
+ kfree(ctx);
+ }
+}
+
+/*
+ * Duplicate a legacy config.
+ */
+static int legacy_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
+{
+ struct legacy_fs_context *ctx;
+ struct legacy_fs_context *src_ctx = src_fc->fs_private;
+
+ ctx = kmemdup(src_ctx, sizeof(*src_ctx), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ switch (ctx->param_type) {
+ case LEGACY_FS_MONOLITHIC_PARAMS:
+ case LEGACY_FS_INDIVIDUAL_PARAMS:
+ ctx->legacy_data = kmemdup(src_ctx->legacy_data,
+ src_ctx->data_size, GFP_KERNEL);
+ if (!ctx->legacy_data) {
+ kfree(ctx);
+ return -ENOMEM;
+ }
+ /* Fall through */
+ default:
+ break;
+ }
+
+ fc->fs_private = ctx;
+ return 0;
+}
+
+/*
+ * Add a parameter to a legacy config. We build up a comma-separated list of
+ * options.
+ */
+static int legacy_parse_param(struct fs_context *fc, struct fs_parameter *param)
+{
+ struct legacy_fs_context *ctx = fc->fs_private;
+ unsigned int size = ctx->data_size;
+ size_t len = 0;
+
+ if (strcmp(param->key, "source") == 0) {
+ if (param->type != fs_value_is_string)
+ return invalf(fc, "VFS: Legacy: Non-string source");
+ if (fc->source)
+ return invalf(fc, "VFS: Legacy: Multiple sources");
+ fc->source = param->string;
+ param->string = NULL;
+ return 0;
+ }
+
+ if ((fc->fs_type->fs_flags & FS_HAS_SUBTYPE) &&
+ strcmp(param->key, "subtype") == 0) {
+ if (param->type != fs_value_is_string)
+ return invalf(fc, "VFS: Legacy: Non-string subtype");
+ if (fc->subtype)
+ return invalf(fc, "VFS: Legacy: Multiple subtype");
+ fc->subtype = param->string;
+ param->string= NULL;
+ return 0;
+ }
+
+ if (ctx->param_type != LEGACY_FS_UNSET_PARAMS &&
+ ctx->param_type != LEGACY_FS_INDIVIDUAL_PARAMS)
+ return invalf(fc, "VFS: Legacy: Can't mix monolithic and individual options");
+
+ switch (param->type) {
+ case fs_value_is_string:
+ len = 1 + param->size;
+ /* Fall through */
+ case fs_value_is_flag:
+ len += strlen(param->key);
+ break;
+ default:
+ return invalf(fc, "VFS: Legacy: Parameter type for '%s' not supported",
+ param->key);
+ }
+
+ if (len > PAGE_SIZE - 2 - size)
+ return invalf(fc, "VFS: Legacy: Cumulative options too large");
+ if (strchr(param->key, ',') ||
+ (param->type == fs_value_is_string &&
+ memchr(param->string, ',', param->size)))
+ return invalf(fc, "VFS: Legacy: Option '%s' contained comma",
+ param->key);
+ if (!ctx->legacy_data) {
+ ctx->legacy_data = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!ctx->legacy_data)
+ return -ENOMEM;
+ }
+
+ ctx->legacy_data[size++] = ',';
+ len = strlen(param->key);
+ memcpy(ctx->legacy_data + size, param->key, len);
+ size += len;
+ if (param->type == fs_value_is_string) {
+ ctx->legacy_data[size++] = '=';
+ memcpy(ctx->legacy_data + size, param->string, param->size);
+ size += param->size;
+ }
+ ctx->legacy_data[size] = '\0';
+ ctx->data_size = size;
+ ctx->param_type = LEGACY_FS_INDIVIDUAL_PARAMS;
+ return 0;
+}
+
+/*
+ * Add monolithic mount data.
+ */
+static int legacy_parse_monolithic(struct fs_context *fc, void *data, size_t data_size)
+{
+ struct legacy_fs_context *ctx = fc->fs_private;
+
+ if (ctx->param_type != LEGACY_FS_UNSET_PARAMS) {
+ pr_warn("VFS: Can't mix monolithic and individual options\n");
+ return -EINVAL;
+ }
+
+ if (!data) {
+ ctx->param_type = LEGACY_FS_NO_PARAMS;
+ return 0;
+ }
+
+ ctx->data_size = data_size;
+ if (data_size > 0) {
+ ctx->legacy_data = kmemdup(data, data_size, GFP_KERNEL);
+ if (!ctx->legacy_data)
+ return -ENOMEM;
+ ctx->param_type = LEGACY_FS_MONOLITHIC_PARAMS;
+ } else {
+ /* Some filesystems pass weird pointers through that we don't
+ * want to copy. They can indicate this by setting data_size
+ * to 0.
+ */
+ ctx->legacy_data = data;
+ ctx->param_type = LEGACY_FS_MAGIC_PARAMS;
+ }
+
+ return 0;
+}
+
+/*
+ * Use the legacy mount validation step to strip out and process security
+ * config options.
+ */
+static int legacy_validate(struct fs_context *fc)
+{
+ struct legacy_fs_context *ctx = fc->fs_private;
+
+ switch (ctx->param_type) {
+ case LEGACY_FS_UNSET_PARAMS:
+ ctx->param_type = LEGACY_FS_NO_PARAMS;
+ /* Fall through */
+ case LEGACY_FS_NO_PARAMS:
+ case LEGACY_FS_MAGIC_PARAMS:
+ return 0;
+ default:
+ break;
+ }
+
+ if (fc->fs_type->fs_flags & FS_BINARY_MOUNTDATA)
+ return 0;
+
+ ctx->secdata = alloc_secdata();
+ if (!ctx->secdata)
+ return -ENOMEM;
+
+ return security_sb_copy_data(ctx->legacy_data, ctx->data_size,
+ ctx->secdata);
+}
+
+/*
+ * Get a mountable root with the legacy mount command.
+ */
+static int legacy_get_tree(struct fs_context *fc)
+{
+ struct legacy_fs_context *ctx = fc->fs_private;
+ struct super_block *sb;
+ struct dentry *root;
+
+ root = fc->fs_type->mount(fc->fs_type, fc->sb_flags,
+ fc->source, ctx->legacy_data,
+ ctx->data_size);
+ if (IS_ERR(root))
+ return PTR_ERR(root);
+
+ sb = root->d_sb;
+ BUG_ON(!sb);
+
+ fc->root = root;
+ return 0;
+}
+
+/*
+ * Handle remount.
+ */
+static int legacy_reconfigure(struct fs_context *fc)
+{
+ struct legacy_fs_context *ctx = fc->fs_private;
+ struct super_block *sb = fc->root->d_sb;
+
+ if (!sb->s_op->remount_fs)
+ return 0;
+
+ return sb->s_op->remount_fs(sb, &fc->sb_flags,
+ ctx ? ctx->legacy_data : NULL,
+ ctx ? ctx->data_size : 0);
+}
+
+const struct fs_context_operations legacy_fs_context_ops = {
+ .free = legacy_fs_context_free,
+ .dup = legacy_fs_context_dup,
+ .parse_param = legacy_parse_param,
+ .parse_monolithic = legacy_parse_monolithic,
+ .validate = legacy_validate,
+ .get_tree = legacy_get_tree,
+ .reconfigure = legacy_reconfigure,
+};
+
+/*
+ * Initialise a legacy context for a filesystem that doesn't support
+ * fs_context.
+ */
+int legacy_init_fs_context(struct fs_context *fc, struct dentry *dentry)
+{
+ switch (fc->purpose) {
+ default:
+ fc->fs_private = kzalloc(sizeof(struct legacy_fs_context),
+ GFP_KERNEL);
+ if (!fc->fs_private)
+ return -ENOMEM;
+ break;
+
+ case FS_CONTEXT_FOR_UMOUNT:
+ case FS_CONTEXT_FOR_EMERGENCY_RO:
+ if (!fc->root->d_sb->s_op->remount_fs)
+ return -EOPNOTSUPP;
+ break;
+ }
+
+ fc->ops = &legacy_fs_context_ops;
+ return 0;
+}
diff --git a/fs/internal.h b/fs/internal.h
index 63b6840de8c1..fc2da60abbcd 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -51,6 +51,17 @@ int __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
*/
extern void __init chrdev_init(void);

+/*
+ * fs_context.c
+ */
+extern const struct fs_context_operations legacy_fs_context_ops;
+extern int legacy_init_fs_context(struct fs_context *fc, struct dentry *dentry);
+
+/*
+ * fsopen.c
+ */
+extern void vfs_clean_context(struct fs_context *fc);
+
/*
* namei.c
*/
@@ -74,6 +85,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
*/
extern void *copy_mount_options(const void __user *);
extern char *copy_mount_string(const void __user *);
+extern int parse_monolithic_mount_data(struct fs_context *, void *, size_t);

extern struct vfsmount *lookup_mnt(const struct path *);
extern int finish_automount(struct vfsmount *, struct path *);
@@ -102,7 +114,7 @@ extern struct file *alloc_empty_file_noaccount(int, const struct cred *);
/*
* super.c
*/
-extern int do_remount_sb(struct super_block *, int, void *, size_t, int);
+extern int reconfigure_super(struct fs_context *);
extern bool trylock_super(struct super_block *sb);
extern struct dentry *mount_fs(struct file_system_type *,
int, const char *, void *, size_t);
diff --git a/fs/libfs.c b/fs/libfs.c
index 9f1f4884b7cc..b1744c071ab0 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -9,6 +9,7 @@
#include <linux/slab.h>
#include <linux/cred.h>
#include <linux/mount.h>
+#include <linux/fs_context.h>
#include <linux/vfs.h>
#include <linux/quotaops.h>
#include <linux/mutex.h>
@@ -574,13 +575,30 @@ static DEFINE_SPINLOCK(pin_fs_lock);

int simple_pin_fs(struct file_system_type *type, struct vfsmount **mount, int *count)
{
+ struct fs_context *fc;
struct vfsmount *mnt = NULL;
+ int ret;
+
spin_lock(&pin_fs_lock);
if (unlikely(!*mount)) {
spin_unlock(&pin_fs_lock);
- mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, NULL, 0);
+
+ fc = vfs_new_fs_context(type, NULL, 0, 0,
+ FS_CONTEXT_FOR_KERNEL_MOUNT);
+ if (IS_ERR(fc))
+ return PTR_ERR(fc);
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0) {
+ put_fs_context(fc);
+ return ret;
+ }
+
+ mnt = vfs_create_mount(fc, 0);
+ put_fs_context(fc);
if (IS_ERR(mnt))
return PTR_ERR(mnt);
+
spin_lock(&pin_fs_lock);
if (!*mount)
*mount = mnt;
diff --git a/fs/namespace.c b/fs/namespace.c
index 059a13e1ae09..d841ba5568d9 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -28,6 +28,7 @@
#include <linux/task_work.h>
#include <linux/sched/task.h>
#include <uapi/linux/mount.h>
+#include <linux/fs_context.h>

#include "pnode.h"
#include "internal.h"
@@ -941,56 +942,6 @@ static struct mount *skip_mnt_tree(struct mount *p)
return p;
}

-struct vfsmount *
-vfs_kern_mount(struct file_system_type *type, int flags, const char *name,
- void *data, size_t data_size)
-{
- struct mount *mnt;
- struct dentry *root;
-
- if (!type)
- return ERR_PTR(-ENODEV);
-
- mnt = alloc_vfsmnt(name);
- if (!mnt)
- return ERR_PTR(-ENOMEM);
-
- if (flags & SB_KERNMOUNT)
- mnt->mnt.mnt_flags = MNT_INTERNAL;
-
- root = mount_fs(type, flags, name, data, data_size);
- if (IS_ERR(root)) {
- mnt_free_id(mnt);
- free_vfsmnt(mnt);
- return ERR_CAST(root);
- }
-
- mnt->mnt.mnt_root = root;
- mnt->mnt.mnt_sb = root->d_sb;
- mnt->mnt_mountpoint = mnt->mnt.mnt_root;
- mnt->mnt_parent = mnt;
- lock_mount_hash();
- list_add_tail(&mnt->mnt_instance, &root->d_sb->s_mounts);
- unlock_mount_hash();
- return &mnt->mnt;
-}
-EXPORT_SYMBOL_GPL(vfs_kern_mount);
-
-struct vfsmount *
-vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
- const char *name, void *data, size_t data_size)
-{
- /* Until it is worked out how to pass the user namespace
- * through from the parent mount to the submount don't support
- * unprivileged mounts with submounts.
- */
- if (mountpoint->d_sb->s_user_ns != &init_user_ns)
- return ERR_PTR(-EPERM);
-
- return vfs_kern_mount(type, SB_SUBMOUNT, name, data, data_size);
-}
-EXPORT_SYMBOL_GPL(vfs_submount);
-
static struct mount *clone_mnt(struct mount *old, struct dentry *root,
int flag)
{
@@ -1466,6 +1417,40 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)

static void shrink_submounts(struct mount *mnt);

+static int do_umount_root(struct super_block *sb)
+{
+ int ret = 0;
+ struct fs_context fc = {
+ .purpose = FS_CONTEXT_FOR_UMOUNT,
+ .fs_type = sb->s_type,
+ .root = sb->s_root,
+ .sb_flags = SB_RDONLY,
+ .sb_flags_mask = SB_RDONLY,
+ };
+
+ down_write(&sb->s_umount);
+ if (!sb_rdonly(sb)) {
+ int ret;
+
+ if (fc.fs_type->init_fs_context)
+ ret = fc.fs_type->init_fs_context(&fc, NULL);
+ else
+ ret = legacy_init_fs_context(&fc, NULL);
+
+ switch (ret) {
+ case 0:
+ ret = reconfigure_super(&fc);
+ fc.ops->free(&fc);
+ break;
+ case -EOPNOTSUPP:
+ ret = 0;
+ break;
+ }
+ }
+ up_write(&sb->s_umount);
+ return ret;
+}
+
static int do_umount(struct mount *mnt, int flags)
{
struct super_block *sb = mnt->mnt.mnt_sb;
@@ -1531,11 +1516,7 @@ static int do_umount(struct mount *mnt, int flags)
*/
if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
return -EPERM;
- down_write(&sb->s_umount);
- if (!sb_rdonly(sb))
- retval = do_remount_sb(sb, SB_RDONLY, NULL, 0, 0);
- up_write(&sb->s_umount);
- return retval;
+ return do_umount_root(sb);
}

namespace_lock();
@@ -2378,6 +2359,20 @@ static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags)
return ret;
}

+/*
+ * Parse the monolithic page of mount data given to sys_mount().
+ */
+int parse_monolithic_mount_data(struct fs_context *fc, void *data, size_t data_size)
+{
+ int (*monolithic_mount_data)(struct fs_context *, void *, size_t);
+
+ monolithic_mount_data = fc->ops->parse_monolithic;
+ if (!monolithic_mount_data)
+ monolithic_mount_data = generic_parse_monolithic;
+
+ return monolithic_mount_data(fc, data, data_size);
+}
+
/*
* change filesystem flags. dir should be a physical root of filesystem.
* If you've mounted a non-root directory somewhere and want to do remount
@@ -2386,6 +2381,7 @@ static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags)
static int do_remount(struct path *path, int ms_flags, int sb_flags,
int mnt_flags, void *data, size_t data_size)
{
+ struct fs_context *fc = NULL;
int err;
struct super_block *sb = path->mnt->mnt_sb;
struct mount *mnt = real_mount(path->mnt);
@@ -2399,18 +2395,34 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
if (!can_change_locked_flags(mnt, mnt_flags))
return -EPERM;

- err = security_sb_remount(sb, data, data_size);
+ fc = vfs_new_fs_context(path->dentry->d_sb->s_type,
+ path->dentry, sb_flags, MS_RMT_MASK,
+ FS_CONTEXT_FOR_RECONFIGURE);
+
+ err = parse_monolithic_mount_data(fc, data, data_size);
+ if (err < 0)
+ goto err_fc;
+
+ if (fc->ops->validate) {
+ err = fc->ops->validate(fc);
+ if (err < 0)
+ goto err_fc;
+ }
+
+ err = security_fs_context_validate(fc);
if (err)
- return err;
+ goto err_fc;

down_write(&sb->s_umount);
err = -EPERM;
if (ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) {
- err = do_remount_sb(sb, sb_flags, data, data_size, 0);
+ err = reconfigure_super(fc);
if (!err)
set_mount_attributes(mnt, mnt_flags);
}
up_write(&sb->s_umount);
+err_fc:
+ put_fs_context(fc);
return err;
}

@@ -2515,29 +2527,6 @@ static int do_move_mount_old(struct path *path, const char *old_name)
return err;
}

-static struct vfsmount *fs_set_subtype(struct vfsmount *mnt, const char *fstype)
-{
- int err;
- const char *subtype = strchr(fstype, '.');
- if (subtype) {
- subtype++;
- err = -EINVAL;
- if (!subtype[0])
- goto err;
- } else
- subtype = "";
-
- mnt->mnt_sb->s_subtype = kstrdup(subtype, GFP_KERNEL);
- err = -ENOMEM;
- if (!mnt->mnt_sb->s_subtype)
- goto err;
- return mnt;
-
- err:
- mntput(mnt);
- return ERR_PTR(err);
-}
-
/*
* add a mount into a namespace's mount tree
*/
@@ -2582,44 +2571,109 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
return err;
}

-static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags);
+static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags);
+
+/*
+ * Create a new mount using a superblock configuration and request it
+ * be added to the namespace tree.
+ */
+static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ struct vfsmount *mnt;
+ int ret;
+
+ ret = security_sb_mountpoint(fc, mountpoint,
+ mnt_flags & ~MNT_INTERNAL_FLAGS);
+ if (ret < 0)
+ return ret;
+
+ if (mount_too_revealing(fc->root->d_sb, &mnt_flags)) {
+ pr_warn("VFS: Mount too revealing\n");
+ return -EPERM;
+ }
+
+ mnt = vfs_create_mount(fc, mnt_flags);
+ if (IS_ERR(mnt))
+ return PTR_ERR(mnt);
+
+ ret = do_add_mount(real_mount(mnt), mountpoint, mnt_flags);
+ if (ret < 0)
+ goto err_mnt;
+ return ret;
+
+err_mnt:
+ mntput(mnt);
+ return ret;
+}

/*
* create a new mount for userspace and request it to be added into the
* namespace's tree
*/
-static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
- int mnt_flags, const char *name,
+static int do_new_mount(struct path *mountpoint, const char *fstype,
+ int sb_flags, int mnt_flags, const char *name,
void *data, size_t data_size)
{
- struct file_system_type *type;
- struct vfsmount *mnt;
+ struct file_system_type *fs_type;
+ struct fs_context *fc;
+ const char *subtype = NULL;
int err;

if (!fstype)
return -EINVAL;

- type = get_fs_type(fstype);
- if (!type)
- return -ENODEV;
+ err = -ENODEV;
+ fs_type = get_fs_type(fstype);
+ if (!fs_type)
+ goto out;

- mnt = vfs_kern_mount(type, sb_flags, name, data, data_size);
- if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) &&
- !mnt->mnt_sb->s_subtype)
- mnt = fs_set_subtype(mnt, fstype);
+ if (fs_type->fs_flags & FS_HAS_SUBTYPE) {
+ subtype = strchr(fstype, '.');
+ if (subtype) {
+ subtype++;
+ if (!subtype[0]) {
+ put_filesystem(fs_type);
+ return -EINVAL;
+ }
+ } else {
+ subtype = "";
+ }
+ }

- put_filesystem(type);
- if (IS_ERR(mnt))
- return PTR_ERR(mnt);
+ fc = vfs_new_fs_context(fs_type, NULL, sb_flags, sb_flags,
+ FS_CONTEXT_FOR_USER_MOUNT);
+ put_filesystem(fs_type);
+ if (IS_ERR(fc)) {
+ err = PTR_ERR(fc);
+ goto out;
+ }

- if (mount_too_revealing(mnt, &mnt_flags)) {
- mntput(mnt);
- return -EPERM;
+ if (subtype) {
+ err = vfs_parse_fs_string(fc, "subtype",
+ subtype, strlen(subtype));
+ if (err < 0)
+ goto out;
}

- err = do_add_mount(real_mount(mnt), path, mnt_flags);
- if (err)
- mntput(mnt);
+ if (name) {
+ err = vfs_parse_fs_string(fc, "source", name, strlen(name));
+ if (err < 0)
+ goto out_fc;
+ }
+
+ err = parse_monolithic_mount_data(fc, data, data_size);
+ if (err < 0)
+ goto out_fc;
+
+ err = vfs_get_tree(fc);
+ if (err < 0)
+ goto out_fc;
+
+ err = do_new_mount_fc(fc, mountpoint, mnt_flags);
+out_fc:
+ put_fs_context(fc);
+out:
return err;
}

@@ -3169,6 +3223,118 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
return ksys_mount(dev_name, dir_name, type, flags, data);
}

+/**
+ * vfs_create_mount - Create a mount for a configured superblock
+ * @fc: The configuration context with the superblock attached
+ * @mnt_flags: The mount flags to apply
+ *
+ * Create a mount to an already configured superblock. If necessary, the
+ * caller should invoke vfs_get_tree() before calling this.
+ *
+ * Note that this does not attach the mount to anything.
+ */
+struct vfsmount *vfs_create_mount(struct fs_context *fc, unsigned int mnt_flags)
+{
+ struct mount *mnt;
+
+ if (!fc->root)
+ return ERR_PTR(-EINVAL);
+
+ mnt = alloc_vfsmnt(fc->source ?: "none");
+ if (!mnt)
+ return ERR_PTR(-ENOMEM);
+
+ if (fc->purpose == FS_CONTEXT_FOR_KERNEL_MOUNT)
+ /* It's a longterm mount, don't release mnt until we unmount
+ * before file sys is unregistered
+ */
+ mnt_flags |= MNT_INTERNAL;
+
+ atomic_inc(&fc->root->d_sb->s_active);
+ mnt->mnt.mnt_flags = mnt_flags;
+ mnt->mnt.mnt_sb = fc->root->d_sb;
+ mnt->mnt.mnt_root = dget(fc->root);
+ mnt->mnt_mountpoint = mnt->mnt.mnt_root;
+ mnt->mnt_parent = mnt;
+
+ lock_mount_hash();
+ list_add_tail(&mnt->mnt_instance, &mnt->mnt.mnt_sb->s_mounts);
+ unlock_mount_hash();
+ return &mnt->mnt;
+}
+EXPORT_SYMBOL(vfs_create_mount);
+
+struct vfsmount *vfs_kern_mount(struct file_system_type *type,
+ int sb_flags, const char *devname,
+ void *data, size_t data_size)
+{
+ struct fs_context *fc;
+ struct vfsmount *mnt;
+ int ret;
+
+ if (!type)
+ return ERR_PTR(-EINVAL);
+
+ fc = vfs_new_fs_context(type, NULL, sb_flags, sb_flags,
+ sb_flags & SB_KERNMOUNT ?
+ FS_CONTEXT_FOR_KERNEL_MOUNT :
+ FS_CONTEXT_FOR_USER_MOUNT);
+ if (IS_ERR(fc))
+ return ERR_CAST(fc);
+
+ if (devname) {
+ ret = vfs_parse_fs_string(fc, "source",
+ devname, strlen(devname));
+ if (ret < 0)
+ goto err_fc;
+ }
+
+ ret = parse_monolithic_mount_data(fc, data, data_size);
+ if (ret < 0)
+ goto err_fc;
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0)
+ goto err_fc;
+
+ mnt = vfs_create_mount(fc, 0);
+out:
+ put_fs_context(fc);
+ return mnt;
+err_fc:
+ mnt = ERR_PTR(ret);
+ goto out;
+}
+EXPORT_SYMBOL_GPL(vfs_kern_mount);
+
+struct vfsmount *
+vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
+ const char *name, void *data, size_t data_size)
+{
+ /* Until it is worked out how to pass the user namespace
+ * through from the parent mount to the submount don't support
+ * unprivileged mounts with submounts.
+ */
+ if (mountpoint->d_sb->s_user_ns != &init_user_ns)
+ return ERR_PTR(-EPERM);
+
+ return vfs_kern_mount(type, SB_SUBMOUNT, name, data, data_size);
+}
+EXPORT_SYMBOL_GPL(vfs_submount);
+
+struct vfsmount *kern_mount(struct file_system_type *type)
+{
+ return vfs_kern_mount(type, SB_KERNMOUNT, type->name, NULL, 0);
+}
+EXPORT_SYMBOL_GPL(kern_mount);
+
+struct vfsmount *kern_mount_data(struct file_system_type *type,
+ void *data, size_t data_size)
+{
+ return vfs_kern_mount(type, SB_KERNMOUNT, type->name, data, data_size);
+}
+EXPORT_SYMBOL_GPL(kern_mount_data);
+
/*
* Move a mount from one place to another.
* In combination with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be
@@ -3446,22 +3612,6 @@ void put_mnt_ns(struct mnt_namespace *ns)
free_mnt_ns(ns);
}

-struct vfsmount *kern_mount_data(struct file_system_type *type,
- void *data, size_t data_size)
-{
- struct vfsmount *mnt;
- mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, data, data_size);
- if (!IS_ERR(mnt)) {
- /*
- * it is a longterm mount, don't release mnt until
- * we unmount before file sys is unregistered
- */
- real_mount(mnt)->mnt_ns = MNT_NS_INTERNAL;
- }
- return mnt;
-}
-EXPORT_SYMBOL_GPL(kern_mount_data);
-
void kern_unmount(struct vfsmount *mnt)
{
/* release long term mount so mount point can be released */
@@ -3502,7 +3652,8 @@ bool current_chrooted(void)
return chrooted;
}

-static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new,
+static bool mnt_already_visible(struct mnt_namespace *ns,
+ const struct super_block *sb,
int *new_mnt_flags)
{
int new_flags = *new_mnt_flags;
@@ -3514,7 +3665,7 @@ static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new,
struct mount *child;
int mnt_flags;

- if (mnt->mnt.mnt_sb->s_type != new->mnt_sb->s_type)
+ if (mnt->mnt.mnt_sb->s_type != sb->s_type)
continue;

/* This mount is not fully visible if it's root directory
@@ -3565,7 +3716,7 @@ static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new,
return visible;
}

-static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags)
+static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags)
{
const unsigned long required_iflags = SB_I_NOEXEC | SB_I_NODEV;
struct mnt_namespace *ns = current->nsproxy->mnt_ns;
@@ -3575,7 +3726,7 @@ static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags)
return false;

/* Can this filesystem be too revealing? */
- s_iflags = mnt->mnt_sb->s_iflags;
+ s_iflags = sb->s_iflags;
if (!(s_iflags & SB_I_USERNS_VISIBLE))
return false;

@@ -3585,7 +3736,7 @@ static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags)
return true;
}

- return !mnt_already_visible(ns, mnt, new_mnt_flags);
+ return !mnt_already_visible(ns, sb, new_mnt_flags);
}

bool mnt_may_suid(struct vfsmount *mnt)
diff --git a/fs/super.c b/fs/super.c
index 67f88c055967..df8c4cebd000 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -36,6 +36,7 @@
#include <linux/lockdep.h>
#include <linux/user_namespace.h>
#include <uapi/linux/mount.h>
+#include <linux/fs_context.h>
#include "internal.h"

static int thaw_super_locked(struct super_block *sb);
@@ -187,16 +188,13 @@ static void destroy_unused_super(struct super_block *s)
}

/**
- * alloc_super - create new superblock
- * @type: filesystem type superblock should belong to
- * @flags: the mount flags
- * @user_ns: User namespace for the super_block
+ * alloc_super - Create new superblock
+ * @fc: The filesystem configuration context
*
* Allocates and initializes a new &struct super_block. alloc_super()
* returns a pointer new superblock or %NULL if allocation had failed.
*/
-static struct super_block *alloc_super(struct file_system_type *type, int flags,
- struct user_namespace *user_ns)
+static struct super_block *alloc_super(struct fs_context *fc)
{
struct super_block *s = kzalloc(sizeof(struct super_block), GFP_USER);
static const struct super_operations default_op;
@@ -206,9 +204,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
return NULL;

INIT_LIST_HEAD(&s->s_mounts);
- s->s_user_ns = get_user_ns(user_ns);
+ s->s_user_ns = get_user_ns(fc->user_ns);
init_rwsem(&s->s_umount);
- lockdep_set_class(&s->s_umount, &type->s_umount_key);
+ lockdep_set_class(&s->s_umount, &fc->fs_type->s_umount_key);
/*
* sget() can have s_umount recursion.
*
@@ -232,12 +230,12 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
for (i = 0; i < SB_FREEZE_LEVELS; i++) {
if (__percpu_init_rwsem(&s->s_writers.rw_sem[i],
sb_writers_name[i],
- &type->s_writers_key[i]))
+ &fc->fs_type->s_writers_key[i]))
goto fail;
}
init_waitqueue_head(&s->s_writers.wait_unfrozen);
s->s_bdi = &noop_backing_dev_info;
- s->s_flags = flags;
+ s->s_flags = fc->sb_flags;
if (s->s_user_ns != &init_user_ns)
s->s_iflags |= SB_I_NODEV;
INIT_HLIST_NODE(&s->s_instances);
@@ -251,7 +249,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
s->s_count = 1;
atomic_set(&s->s_active, 1);
mutex_init(&s->s_vfs_rename_mutex);
- lockdep_set_class(&s->s_vfs_rename_mutex, &type->s_vfs_rename_key);
+ lockdep_set_class(&s->s_vfs_rename_mutex, &fc->fs_type->s_vfs_rename_key);
init_rwsem(&s->s_dquot.dqio_sem);
s->s_maxbytes = MAX_NON_LFS;
s->s_op = &default_op;
@@ -475,6 +473,91 @@ void generic_shutdown_super(struct super_block *sb)

EXPORT_SYMBOL(generic_shutdown_super);

+/**
+ * sget_fc - Find or create a superblock
+ * @fc: Filesystem context.
+ * @test: Comparison callback
+ * @set: Setup callback
+ *
+ * Find or create a superblock using the parameters stored in the filesystem
+ * context and the two callback functions.
+ *
+ * If an extant superblock is matched, then that will be returned with an
+ * elevated reference count that the caller must transfer or discard.
+ *
+ * If no match is made, a new superblock will be allocated and basic
+ * initialisation will be performed (s_type, s_fs_info and s_id will be set and
+ * the set() callback will be invoked), the superblock will be published and it
+ * will be returned in a partially constructed state with SB_BORN and SB_ACTIVE
+ * as yet unset.
+ */
+struct super_block *sget_fc(struct fs_context *fc,
+ int (*test)(struct super_block *, struct fs_context *),
+ int (*set)(struct super_block *, struct fs_context *))
+{
+ struct super_block *s = NULL;
+ struct super_block *old;
+ int err;
+
+ if (!(fc->sb_flags & SB_KERNMOUNT) &&
+ fc->purpose != FS_CONTEXT_FOR_SUBMOUNT) {
+ /* Don't allow mounting unless the caller has CAP_SYS_ADMIN
+ * over the namespace.
+ */
+ if (!(fc->fs_type->fs_flags & FS_USERNS_MOUNT) &&
+ !capable(CAP_SYS_ADMIN))
+ return ERR_PTR(-EPERM);
+ else if (!ns_capable(fc->user_ns, CAP_SYS_ADMIN))
+ return ERR_PTR(-EPERM);
+ }
+
+retry:
+ spin_lock(&sb_lock);
+ if (test) {
+ hlist_for_each_entry(old, &fc->fs_type->fs_supers, s_instances) {
+ if (test(old, fc))
+ goto share_extant_sb;
+ }
+ }
+ if (!s) {
+ spin_unlock(&sb_lock);
+ s = alloc_super(fc);
+ if (!s)
+ return ERR_PTR(-ENOMEM);
+ goto retry;
+ }
+
+ s->s_fs_info = fc->s_fs_info;
+ err = set(s, fc);
+ if (err) {
+ s->s_fs_info = NULL;
+ spin_unlock(&sb_lock);
+ destroy_unused_super(s);
+ return ERR_PTR(err);
+ }
+ fc->s_fs_info = NULL;
+ s->s_type = fc->fs_type;
+ strlcpy(s->s_id, s->s_type->name, sizeof(s->s_id));
+ list_add_tail(&s->s_list, &super_blocks);
+ hlist_add_head(&s->s_instances, &s->s_type->fs_supers);
+ spin_unlock(&sb_lock);
+ get_filesystem(s->s_type);
+ register_shrinker_prepared(&s->s_shrink);
+ return s;
+
+share_extant_sb:
+ if (fc->user_ns != old->s_user_ns) {
+ spin_unlock(&sb_lock);
+ destroy_unused_super(s);
+ return ERR_PTR(-EBUSY);
+ }
+ if (!grab_super(old))
+ goto retry;
+ destroy_unused_super(s);
+ return old;
+}
+EXPORT_SYMBOL(sget_fc);
+
/**
* sget_userns - find or create a superblock
* @type: filesystem type superblock should belong to
@@ -517,7 +600,14 @@ struct super_block *sget_userns(struct file_system_type *type,
}
if (!s) {
spin_unlock(&sb_lock);
- s = alloc_super(type, (flags & ~SB_SUBMOUNT), user_ns);
+ {
+ struct fs_context fc = {
+ .fs_type = type,
+ .sb_flags = flags & ~SB_SUBMOUNT,
+ .user_ns = user_ns,
+ };
+ s = alloc_super(&fc);
+ }
if (!s)
return ERR_PTR(-ENOMEM);
goto retry;
@@ -835,30 +925,30 @@ struct super_block *user_get_super(dev_t dev)
}

/**
- * do_remount_sb - asks filesystem to change mount options.
- * @sb: superblock in question
- * @sb_flags: revised superblock flags
- * @data: the rest of options
- * @data_size: The size of the data
- * @force: whether or not to force the change
+ * reconfigure_super - asks filesystem to change superblock parameters
+ * @fc: the superblock and configuration
*
- * Alters the mount options of a mounted file system.
+ * Alters the configuration parameters of a live superblock.
*/
-int do_remount_sb(struct super_block *sb, int sb_flags, void *data,
- size_t data_size, int force)
+int reconfigure_super(struct fs_context *fc)
{
+ struct super_block *sb = fc->root->d_sb;
int retval;
- int remount_ro;
+ int remount_ro = false;

+ if (fc->sb_flags_mask & ~MS_RMT_MASK)
+ return -EINVAL;
if (sb->s_writers.frozen != SB_UNFROZEN)
return -EBUSY;

+ if (fc->sb_flags_mask & SB_RDONLY) {
#ifdef CONFIG_BLOCK
- if (!(sb_flags & SB_RDONLY) && bdev_read_only(sb->s_bdev))
- return -EACCES;
+ if (!(fc->sb_flags & SB_RDONLY) && bdev_read_only(sb->s_bdev))
+ return -EACCES;
#endif

- remount_ro = (sb_flags & SB_RDONLY) && !sb_rdonly(sb);
+ remount_ro = (fc->sb_flags & SB_RDONLY) && !sb_rdonly(sb);
+ }

if (remount_ro) {
if (!hlist_empty(&sb->s_pins)) {
@@ -869,15 +959,16 @@ int do_remount_sb(struct super_block *sb, int sb_flags, void *data,
return 0;
if (sb->s_writers.frozen != SB_UNFROZEN)
return -EBUSY;
- remount_ro = (sb_flags & SB_RDONLY) && !sb_rdonly(sb);
+ remount_ro = !sb_rdonly(sb);
}
}
shrink_dcache_sb(sb);

- /* If we are remounting RDONLY and current sb is read/write,
- make sure there are no rw files opened */
+ /* If we are reconfiguring to RDONLY and current sb is read/write,
+ * make sure there are no files open for writing.
+ */
if (remount_ro) {
- if (force) {
+ if (fc->purpose == FS_CONTEXT_FOR_EMERGENCY_RO) {
sb->s_readonly_remount = 1;
smp_wmb();
} else {
@@ -887,17 +978,21 @@ int do_remount_sb(struct super_block *sb, int sb_flags, void *data,
}
}

- if (sb->s_op->remount_fs) {
- retval = sb->s_op->remount_fs(sb, &sb_flags, data, data_size);
- if (retval) {
- if (!force)
+ if (fc->ops->reconfigure) {
+ retval = fc->ops->reconfigure(fc);
+ if (retval == 0) {
+ security_sb_reconfigure(fc);
+ } else {
+ if (fc->purpose != FS_CONTEXT_FOR_EMERGENCY_RO)
goto cancel_readonly;
/* If forced remount, go ahead despite any errors */
WARN(1, "forced remount of a %s fs returned %i\n",
sb->s_type->name, retval);
}
}
- sb->s_flags = (sb->s_flags & ~MS_RMT_MASK) | (sb_flags & MS_RMT_MASK);
+
+ WRITE_ONCE(sb->s_flags, ((sb->s_flags & ~fc->sb_flags_mask) |
+ (fc->sb_flags & fc->sb_flags_mask)));
/* Needs to be ordered wrt mnt_is_readonly() */
smp_wmb();
sb->s_readonly_remount = 0;
@@ -921,13 +1016,29 @@ int do_remount_sb(struct super_block *sb, int sb_flags, void *data,

static void do_emergency_remount_callback(struct super_block *sb)
{
+ struct fs_context fc = {
+ .purpose = FS_CONTEXT_FOR_EMERGENCY_RO,
+ .fs_type = sb->s_type,
+ .root = sb->s_root,
+ .sb_flags = SB_RDONLY,
+ .sb_flags_mask = SB_RDONLY,
+ };
+
down_write(&sb->s_umount);
if (sb->s_root && sb->s_bdev && (sb->s_flags & SB_BORN) &&
!sb_rdonly(sb)) {
+ int ret;
+
+ if (fc.fs_type->init_fs_context)
+ ret = fc.fs_type->init_fs_context(&fc, NULL);
+ else
+ ret = legacy_init_fs_context(&fc, NULL);
+
/*
* What lock protects sb->s_flags??
*/
- do_remount_sb(sb, SB_RDONLY, NULL, 0, 1);
+ if (ret == 0)
+ reconfigure_super(&fc);
}
up_write(&sb->s_umount);
}
@@ -1090,6 +1201,89 @@ struct dentry *mount_ns(struct file_system_type *fs_type,

EXPORT_SYMBOL(mount_ns);

+int set_anon_super_fc(struct super_block *sb, struct fs_context *fc)
+{
+ return set_anon_super(sb, NULL);
+}
+EXPORT_SYMBOL(set_anon_super_fc);
+
+static int test_keyed_super(struct super_block *sb, struct fs_context *fc)
+{
+ return sb->s_fs_info == fc->s_fs_info;
+}
+
+static int test_single_super(struct super_block *s, struct fs_context *fc)
+{
+ return 1;
+}
+
+/**
+ * vfs_get_super - Get a superblock with a search key set in s_fs_info.
+ * @fc: The filesystem context holding the parameters
+ * @keying: How to distinguish superblocks
+ * @fill_super: Helper to initialise a new superblock
+ *
+ * Search for a superblock and create a new one if not found. The search
+ * criterion is controlled by @keying. If the search fails, a new superblock
+ * is created and @fill_super() is called to initialise it.
+ *
+ * @keying can take one of a number of values:
+ *
+ * (1) vfs_get_single_super - Only one superblock of this type may exist on the
+ * system. This is typically used for special system filesystems.
+ *
+ * (2) vfs_get_keyed_super - Multiple superblocks may exist, but they must have
+ * distinct keys (where the key is in s_fs_info). Searching for the same
+ * key again will turn up the superblock for that key.
+ *
+ * (3) vfs_get_independent_super - Multiple superblocks may exist and are
+ * unkeyed. Each call will get a new superblock.
+ *
+ * A permissions check is made by sget_fc() unless we're getting a superblock
+ * for a kernel-internal mount or a submount.
+ */
+int vfs_get_super(struct fs_context *fc,
+ enum vfs_get_super_keying keying,
+ int (*fill_super)(struct super_block *sb,
+ struct fs_context *fc))
+{
+ int (*test)(struct super_block *, struct fs_context *);
+ struct super_block *sb;
+
+ switch (keying) {
+ case vfs_get_single_super:
+ test = test_single_super;
+ break;
+ case vfs_get_keyed_super:
+ test = test_keyed_super;
+ break;
+ case vfs_get_independent_super:
+ test = NULL;
+ break;
+ default:
+ BUG();
+ }
+
+ sb = sget_fc(fc, test, set_anon_super_fc);
+ if (IS_ERR(sb))
+ return PTR_ERR(sb);
+
+ if (!sb->s_root) {
+ int err = fill_super(sb, fc);
+ if (err) {
+ deactivate_locked_super(sb);
+ return err;
+ }
+
+ sb->s_flags |= SB_ACTIVE;
+ }
+
+ BUG_ON(fc->root);
+ fc->root = dget(sb->s_root);
+ return 0;
+}
+EXPORT_SYMBOL(vfs_get_super);
+
#ifdef CONFIG_BLOCK
static int set_bdev_super(struct super_block *s, void *data)
{
@@ -1215,6 +1409,42 @@ struct dentry *mount_nodev(struct file_system_type *fs_type,
}
EXPORT_SYMBOL(mount_nodev);

+static int reconfigure_single(struct super_block *s,
+ int flags, void *data, size_t data_size)
+{
+ struct fs_context *fc;
+ int ret;
+
+ /* The caller really need to be passing fc down into mount_single(),
+ * then a chunk of this can be removed. Better yet, reconfiguration
+ * shouldn't happen, but rather the second mount should be rejected if
+ * the parameters are not compatible.
+ */
+ fc = vfs_new_fs_context(s->s_type, s->s_root, flags, MS_RMT_MASK,
+ FS_CONTEXT_FOR_RECONFIGURE);
+ if (IS_ERR(fc))
+ return PTR_ERR(fc);
+
+ ret = parse_monolithic_mount_data(fc, data, data_size);
+ if (ret < 0)
+ goto out;
+
+ if (fc->ops->validate) {
+ ret = fc->ops->validate(fc);
+ if (ret < 0)
+ goto out;
+ }
+
+ ret = security_fs_context_validate(fc);
+ if (ret)
+ goto out;
+
+ ret = reconfigure_super(fc);
+out:
+ put_fs_context(fc);
+ return ret;
+}
+
static int compare_single(struct super_block *s, void *p)
{
return 1;
@@ -1232,15 +1462,19 @@ struct dentry *mount_single(struct file_system_type *fs_type,
return ERR_CAST(s);
if (!s->s_root) {
error = fill_super(s, data, data_size, flags & SB_SILENT ? 1 : 0);
- if (error) {
- deactivate_locked_super(s);
- return ERR_PTR(error);
- }
+ if (error)
+ goto error;
s->s_flags |= SB_ACTIVE;
} else {
- do_remount_sb(s, flags, data, data_size, 0);
+ error = reconfigure_single(s, flags, data, data_size);
+ if (error)
+ goto error;
}
return dget(s->s_root);
+
+error:
+ deactivate_locked_super(s);
+ return ERR_PTR(error);
}
EXPORT_SYMBOL(mount_single);

@@ -1585,3 +1819,90 @@ int thaw_super(struct super_block *sb)
return thaw_super_locked(sb);
}
EXPORT_SYMBOL(thaw_super);
+
+/**
+ * vfs_get_tree - Get the mountable root
+ * @fc: The superblock configuration context.
+ *
+ * The filesystem is invoked to get or create a superblock which can then later
+ * be used for mounting. The filesystem places a pointer to the root to be
+ * used for mounting in @fc->root.
+ */
+int vfs_get_tree(struct fs_context *fc)
+{
+ struct super_block *sb;
+ int ret;
+
+ if (fc->fs_type->fs_flags & FS_REQUIRES_DEV && !fc->source)
+ return -ENOENT;
+
+ if (fc->root)
+ return -EBUSY;
+
+ if (fc->ops->validate) {
+ ret = fc->ops->validate(fc);
+ if (ret < 0)
+ return ret;
+ }
+
+ ret = security_fs_context_validate(fc);
+ if (ret < 0)
+ return ret;
+
+ /* Get the mountable root in fc->root, with a ref on the root and a ref
+ * on the superblock.
+ */
+ ret = fc->ops->get_tree(fc);
+ if (ret < 0)
+ return ret;
+
+ if (!fc->root) {
+ pr_err("Filesystem %s get_tree() didn't set fc->root\n",
+ fc->fs_type->name);
+ /* We don't know what the locking state of the superblock is -
+ * if there is a superblock.
+ */
+ BUG();
+ }
+
+ sb = fc->root->d_sb;
+ WARN_ON(!sb->s_bdi);
+
+ ret = security_sb_get_tree(fc);
+ if (ret < 0)
+ goto err_sb;
+
+ ret = -ENOMEM;
+ if (fc->subtype && !sb->s_subtype) {
+ sb->s_subtype = kstrdup(fc->subtype, GFP_KERNEL);
+ if (!sb->s_subtype)
+ goto err_sb;
+ }
+
+ /* Write barrier is for super_cache_count(). We place it before setting
+ * SB_BORN as the data dependency between the two functions is the
+ * superblock structure contents that we just set up, not the SB_BORN
+ * flag.
+ */
+ smp_wmb();
+ sb->s_flags |= SB_BORN;
+
+ /* Filesystems should never set s_maxbytes larger than MAX_LFS_FILESIZE
+ * but s_maxbytes was an unsigned long long for many releases. Throw
+ * this warning for a little while to try and catch filesystems that
+ * violate this rule.
+ */
+ WARN(sb->s_maxbytes < 0,
+ "%s set sb->s_maxbytes to negative value (%lld)\n",
+ fc->fs_type->name, sb->s_maxbytes);
+
+ up_write(&sb->s_umount);
+ return 0;
+
+err_sb:
+ dput(fc->root);
+ fc->root = NULL;
+ deactivate_locked_super(sb);
+ return ret;
+}
+EXPORT_SYMBOL(vfs_get_tree);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6dc32507762f..4c3c388646bc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -61,6 +61,8 @@ struct workqueue_struct;
struct iov_iter;
struct fscrypt_info;
struct fscrypt_operations;
+struct fs_context;
+struct fs_parameter_description;

extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -746,6 +748,11 @@ static inline void inode_unlock(struct inode *inode)
up_write(&inode->i_rwsem);
}

+static inline int inode_lock_killable(struct inode *inode)
+{
+ return down_write_killable(&inode->i_rwsem);
+}
+
static inline void inode_lock_shared(struct inode *inode)
{
down_read(&inode->i_rwsem);
@@ -2118,6 +2125,8 @@ struct file_system_type {
#define FS_HAS_SUBTYPE 4
#define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
+ int (*init_fs_context)(struct fs_context *, struct dentry *);
+ const struct fs_parameter_description *parameters;
struct dentry *(*mount) (struct file_system_type *, int,
const char *, void *, size_t);
void (*kill_sb) (struct super_block *);
@@ -2174,8 +2183,12 @@ void kill_litter_super(struct super_block *sb);
void deactivate_super(struct super_block *sb);
void deactivate_locked_super(struct super_block *sb);
int set_anon_super(struct super_block *s, void *data);
+int set_anon_super_fc(struct super_block *s, struct fs_context *fc);
int get_anon_bdev(dev_t *);
void free_anon_bdev(dev_t);
+struct super_block *sget_fc(struct fs_context *fc,
+ int (*test)(struct super_block *, struct fs_context *),
+ int (*set)(struct super_block *, struct fs_context *));
struct super_block *sget_userns(struct file_system_type *type,
int (*test)(struct super_block *,void *),
int (*set)(struct super_block *,void *),
@@ -2218,8 +2231,8 @@ mount_pseudo(struct file_system_type *fs_type, char *name,

extern int register_filesystem(struct file_system_type *);
extern int unregister_filesystem(struct file_system_type *);
+extern struct vfsmount *kern_mount(struct file_system_type *);
extern struct vfsmount *kern_mount_data(struct file_system_type *, void *, size_t);
-#define kern_mount(type) kern_mount_data(type, NULL, 0)
extern void kern_unmount(struct vfsmount *mnt);
extern int may_umount_tree(struct vfsmount *);
extern int may_umount(struct vfsmount *);
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 83c40d30868e..0415510f64ed 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -106,6 +106,36 @@ struct fs_context_operations {
int (*reconfigure)(struct fs_context *fc);
};

+/*
+ * fs_context manipulation functions.
+ */
+extern struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
+ struct dentry *reference,
+ unsigned int sb_flags,
+ unsigned int sb_flags_mask,
+ enum fs_context_purpose purpose);
+extern struct fs_context *vfs_dup_fs_context(struct fs_context *src,
+ enum fs_context_purpose purpose);
+extern int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param);
+extern int vfs_parse_fs_string(struct fs_context *fc, const char *key,
+ const char *value, size_t v_size);
+extern int generic_parse_monolithic(struct fs_context *fc, void *data, size_t data_size);
+extern int vfs_get_tree(struct fs_context *fc);
+extern void put_fs_context(struct fs_context *fc);
+
+/*
+ * sget() wrapper to be called from the ->get_tree() op.
+ */
+enum vfs_get_super_keying {
+ vfs_get_single_super, /* Only one such superblock may exist */
+ vfs_get_keyed_super, /* Superblocks with different s_fs_info keys may exist */
+ vfs_get_independent_super, /* Multiple independent superblocks may exist */
+};
+extern int vfs_get_super(struct fs_context *fc,
+ enum vfs_get_super_keying keying,
+ int (*fill_super)(struct super_block *sb,
+ struct fs_context *fc));
+
#define logfc(FC, FMT, ...) pr_notice(FMT, ## __VA_ARGS__)

/**
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 814643f7ee52..0f6bb8e1bc83 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -25,6 +25,7 @@ struct seq_file;
struct vm_area_struct;
struct super_block;
struct file_system_type;
+struct fs_context;

struct kernfs_open_node;
struct kernfs_iattrs;
@@ -358,6 +359,7 @@ struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
bool *new_sb_created, const void *ns);
void kernfs_kill_sb(struct super_block *sb);
struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns);
+int kernfs_reconfigure(struct fs_context *fc);

void kernfs_init(void);

diff --git a/include/linux/mount.h b/include/linux/mount.h
index c9edd284f0af..41b6b080ffd0 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -21,6 +21,7 @@ struct super_block;
struct vfsmount;
struct dentry;
struct mnt_namespace;
+struct fs_context;

#define MNT_NOSUID 0x01
#define MNT_NODEV 0x02
@@ -88,6 +89,8 @@ struct path;
extern struct vfsmount *clone_private_mount(const struct path *path);

struct file_system_type;
+extern struct vfsmount *vfs_create_mount(struct fs_context *fc,
+ unsigned int mnt_flags);
extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
int flags, const char *name,
void *data, size_t data_size);


2018-09-21 16:35:06

by David Howells

[permalink] [raw]
Subject: [PATCH 27/34] vfs: Implement logging through fs_context [ver #12]

Implement the ability for filesystems to log error, warning and
informational messages through the fs_context. These can be extracted by
userspace by reading from an fd created by fsopen().

Error messages are prefixed with "e ", warnings with "w " and informational
messages with "i ".

Inside the kernel, formatted messages are malloc'd but unformatted messages
are not copied if they're either in the core .rodata section or in the
.rodata section of the filesystem module pinned by fs_context::fs_type.
The messages are only good till the fs_type is released.

Note that the logging object is shared between duplicated fs_context
structures. This is so that such as NFS which do a mount within a mount
can get at least some of the errors from the inner mount.

Five logging functions are provided for this:

(1) void logfc(struct fs_context *fc, const char *fmt, ...);

This logs a message into the context. If the buffer is full, the
earliest message is discarded.

(2) void errorf(fc, fmt, ...);

This wraps logfc() to log an error.

(3) void invalf(fc, fmt, ...);

This wraps errorf() and returns -EINVAL for convenience.

(4) void warnf(fc, fmt, ...);

This wraps logfc() to log a warning.

(5) void infof(fc, fmt, ...);

This wraps logfc() to log an informational message.

Signed-off-by: David Howells <[email protected]>
---

fs/fs_context.c | 107 ++++++++++++++++++++++++++++++++++++++++++++
fs/fsopen.c | 67 ++++++++++++++++++++++++++++
include/linux/fs_context.h | 24 ++++++++--
include/linux/module.h | 6 ++
4 files changed, 200 insertions(+), 4 deletions(-)

diff --git a/fs/fs_context.c b/fs/fs_context.c
index 5d93c86c649d..e242bfe12084 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -11,6 +11,7 @@
*/

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/module.h>
#include <linux/fs_context.h>
#include <linux/fs_parser.h>
#include <linux/fs.h>
@@ -24,6 +25,7 @@
#include <linux/user_namespace.h>
#include <linux/bsearch.h>
#include <net/net_namespace.h>
+#include <asm/sections.h>
#include "mount.h"
#include "internal.h"

@@ -346,6 +348,8 @@ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc,
get_net(fc->net_ns);
get_user_ns(fc->user_ns);
get_cred(fc->cred);
+ if (fc->log)
+ refcount_inc(&fc->log->usage);

/* Can't call put until we've called ->dup */
ret = fc->ops->dup(fc, src_fc);
@@ -363,6 +367,108 @@ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc,
}
EXPORT_SYMBOL(vfs_dup_fs_context);

+/**
+ * logfc - Log a message to a filesystem context
+ * @fc: The filesystem context to log to.
+ * @fmt: The format of the buffer.
+ */
+void logfc(struct fs_context *fc, const char *fmt, ...)
+{
+ static const char store_failure[] = "OOM: Can't store error string";
+ struct fc_log *log = fc ? fc->log : NULL;
+ const char *p;
+ va_list va;
+ char *q;
+ u8 freeable;
+
+ va_start(va, fmt);
+ if (!strchr(fmt, '%')) {
+ p = fmt;
+ goto unformatted_string;
+ }
+ if (strcmp(fmt, "%s") == 0) {
+ p = va_arg(va, const char *);
+ goto unformatted_string;
+ }
+
+ q = kvasprintf(GFP_KERNEL, fmt, va);
+copied_string:
+ if (!q)
+ goto store_failure;
+ freeable = 1;
+ goto store_string;
+
+unformatted_string:
+ if ((unsigned long)p >= (unsigned long)__start_rodata &&
+ (unsigned long)p < (unsigned long)__end_rodata)
+ goto const_string;
+ if (log && within_module_core((unsigned long)p, log->owner))
+ goto const_string;
+ q = kstrdup(p, GFP_KERNEL);
+ goto copied_string;
+
+store_failure:
+ p = store_failure;
+const_string:
+ q = (char *)p;
+ freeable = 0;
+store_string:
+ if (!log) {
+ switch (fmt[0]) {
+ case 'w':
+ printk(KERN_WARNING "%s\n", q + 2);
+ break;
+ case 'e':
+ printk(KERN_ERR "%s\n", q + 2);
+ break;
+ default:
+ printk(KERN_NOTICE "%s\n", q + 2);
+ break;
+ }
+ if (freeable)
+ kfree(q);
+ } else {
+ unsigned int logsize = ARRAY_SIZE(log->buffer);
+ u8 index;
+
+ index = log->head & (logsize - 1);
+ BUILD_BUG_ON(sizeof(log->head) != sizeof(u8) ||
+ sizeof(log->tail) != sizeof(u8));
+ if ((u8)(log->head - log->tail) == logsize) {
+ /* The buffer is full, discard the oldest message */
+ if (log->need_free & (1 << index))
+ kfree(log->buffer[index]);
+ log->tail++;
+ }
+
+ log->buffer[index] = q;
+ log->need_free &= ~(1 << index);
+ log->need_free |= freeable << index;
+ log->head++;
+ }
+ va_end(va);
+}
+EXPORT_SYMBOL(logfc);
+
+/*
+ * Free a logging structure.
+ */
+static void put_fc_log(struct fs_context *fc)
+{
+ struct fc_log *log = fc->log;
+ int i;
+
+ if (log) {
+ if (refcount_dec_and_test(&log->usage)) {
+ fc->log = NULL;
+ for (i = 0; i <= 7; i++)
+ if (log->need_free & (1 << i))
+ kfree(log->buffer[i]);
+ kfree(log);
+ }
+ }
+}
+
/**
* put_fs_context - Dispose of a superblock configuration context.
* @fc: The context to dispose of.
@@ -388,6 +494,7 @@ void put_fs_context(struct fs_context *fc)
if (fc->cred)
put_cred(fc->cred);
kfree(fc->subtype);
+ put_fc_log(fc);
put_filesystem(fc->fs_type);
kfree(fc->source);
kfree(fc);
diff --git a/fs/fsopen.c b/fs/fsopen.c
index a9a36f61d76c..1b966b0c15f7 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -19,6 +19,52 @@
#include <linux/file.h>
#include "mount.h"

+/*
+ * Allow the user to read back any error, warning or informational messages.
+ */
+static ssize_t fscontext_read(struct file *file,
+ char __user *_buf, size_t len, loff_t *pos)
+{
+ struct fs_context *fc = file->private_data;
+ struct fc_log *log = fc->log;
+ unsigned int logsize = ARRAY_SIZE(log->buffer);
+ ssize_t ret;
+ char *p;
+ bool need_free;
+ int index, n;
+
+ ret = mutex_lock_interruptible(&fc->uapi_mutex);
+ if (ret < 0)
+ return ret;
+
+ if (log->head == log->tail) {
+ mutex_unlock(&fc->uapi_mutex);
+ return -ENODATA;
+ }
+
+ index = log->tail & (logsize - 1);
+ p = log->buffer[index];
+ need_free = log->need_free & (1 << index);
+ log->buffer[index] = NULL;
+ log->need_free &= ~(1 << index);
+ log->tail++;
+ mutex_unlock(&fc->uapi_mutex);
+
+ ret = -EMSGSIZE;
+ n = strlen(p);
+ if (n > len)
+ goto err_free;
+ ret = -EFAULT;
+ if (copy_to_user(_buf, p, n) != 0)
+ goto err_free;
+ ret = n;
+
+err_free:
+ if (need_free)
+ kfree(p);
+ return ret;
+}
+
static int fscontext_release(struct inode *inode, struct file *file)
{
struct fs_context *fc = file->private_data;
@@ -31,6 +77,7 @@ static int fscontext_release(struct inode *inode, struct file *file)
}

const struct file_operations fscontext_fops = {
+ .read = fscontext_read,
.release = fscontext_release,
.llseek = no_llseek,
};
@@ -49,6 +96,16 @@ static int fscontext_create_fd(struct fs_context *fc, unsigned int o_flags)
return fd;
}

+static int fscontext_alloc_log(struct fs_context *fc)
+{
+ fc->log = kzalloc(sizeof(*fc->log), GFP_KERNEL);
+ if (!fc->log)
+ return -ENOMEM;
+ refcount_set(&fc->log->usage, 1);
+ fc->log->owner = fc->fs_type->owner;
+ return 0;
+}
+
/*
* Open a filesystem by name so that it can be configured for mounting.
*
@@ -61,6 +118,7 @@ SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
struct file_system_type *fs_type;
struct fs_context *fc;
const char *fs_name;
+ int ret;

if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
return -EPERM;
@@ -83,5 +141,14 @@ SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
return PTR_ERR(fc);

fc->phase = FS_CONTEXT_CREATE_PARAMS;
+
+ ret = fscontext_alloc_log(fc);
+ if (ret < 0)
+ goto err_fc;
+
return fscontext_create_fd(fc, flags & FSOPEN_CLOEXEC ? O_CLOEXEC : 0);
+
+err_fc:
+ put_fs_context(fc);
+ return ret;
}
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 37f8bafaaec3..bb584db982ff 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -13,6 +13,7 @@
#define _LINUX_FS_CONTEXT_H

#include <linux/kernel.h>
+#include <linux/refcount.h>
#include <linux/errno.h>
#include <linux/mutex.h>

@@ -98,6 +99,7 @@ struct fs_context {
struct user_namespace *user_ns; /* The user namespace for this mount */
struct net *net_ns; /* The network namespace for this mount */
const struct cred *cred; /* The mounter's credentials */
+ struct fc_log *log; /* Logging buffer */
char *source; /* The source name (eg. dev path) */
char *subtype; /* The subtype to set on the superblock */
void *security; /* The LSM context */
@@ -154,7 +156,21 @@ extern int vfs_get_super(struct fs_context *fc,

extern const struct file_operations fscontext_fops;

-#define logfc(FC, FMT, ...) pr_notice(FMT, ## __VA_ARGS__)
+/*
+ * Mount error, warning and informational message logging. This structure is
+ * shareable between a mount and a subordinate mount.
+ */
+struct fc_log {
+ refcount_t usage;
+ u8 head; /* Insertion index in buffer[] */
+ u8 tail; /* Removal index in buffer[] */
+ u8 need_free; /* Mask of kfree'able items in buffer[] */
+ struct module *owner; /* Owner module for strings that don't then need freeing */
+ char *buffer[8];
+};
+
+extern __attribute__((format(printf, 2, 3)))
+void logfc(struct fs_context *fc, const char *fmt, ...);

/**
* infof - Store supplementary informational message
@@ -164,7 +180,7 @@ extern const struct file_operations fscontext_fops;
* Store the supplementary informational message for the process if the process
* has enabled the facility.
*/
-#define infof(fc, fmt, ...) ({ logfc(fc, fmt, ## __VA_ARGS__); })
+#define infof(fc, fmt, ...) ({ logfc(fc, "i "fmt, ## __VA_ARGS__); })

/**
* warnf - Store supplementary warning message
@@ -174,7 +190,7 @@ extern const struct file_operations fscontext_fops;
* Store the supplementary warning message for the process if the process has
* enabled the facility.
*/
-#define warnf(fc, fmt, ...) ({ logfc(fc, fmt, ## __VA_ARGS__); })
+#define warnf(fc, fmt, ...) ({ logfc(fc, "w "fmt, ## __VA_ARGS__); })

/**
* errorf - Store supplementary error message
@@ -184,7 +200,7 @@ extern const struct file_operations fscontext_fops;
* Store the supplementary error message for the process if the process has
* enabled the facility.
*/
-#define errorf(fc, fmt, ...) ({ logfc(fc, fmt, ## __VA_ARGS__); })
+#define errorf(fc, fmt, ...) ({ logfc(fc, "e "fmt, ## __VA_ARGS__); })

/**
* invalf - Store supplementary invalid argument error message
diff --git a/include/linux/module.h b/include/linux/module.h
index f807f15bebbe..0add4e176a06 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -682,6 +682,12 @@ static inline bool is_module_text_address(unsigned long addr)
return false;
}

+static inline bool within_module_core(unsigned long addr,
+ const struct module *mod)
+{
+ return false;
+}
+
/* Get/put a kernel symbol (calls should be symmetric) */
#define symbol_get(x) ({ extern typeof(x) x __attribute__((weak)); &(x); })
#define symbol_put(x) do { } while (0)


2018-09-21 16:35:14

by David Howells

[permalink] [raw]
Subject: [PATCH 28/34] vfs: Add some logging to the core users of the fs_context log [ver #12]

Add some logging to the core users of the fs_context log so that
information can be extracted from them as to the reason for failure.

Signed-off-by: David Howells <[email protected]>
---

fs/super.c | 4 +++-
kernel/cgroup/cgroup-v1.c | 2 +-
2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index de43b140bbb1..021cbdcc0e6e 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1771,8 +1771,10 @@ int vfs_get_tree(struct fs_context *fc)
struct super_block *sb;
int ret;

- if (fc->fs_type->fs_flags & FS_REQUIRES_DEV && !fc->source)
+ if (fc->fs_type->fs_flags & FS_REQUIRES_DEV && !fc->source) {
+ errorf(fc, "Filesystem requires source device");
return -ENOENT;
+ }

if (fc->root)
return -EBUSY;
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index d8b325c3c2eb..d5ae888b8c57 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -17,7 +17,7 @@

#include <trace/events/cgroup.h>

-#define cg_invalf(fc, fmt, ...) ({ pr_err(fmt, ## __VA_ARGS__); -EINVAL; })
+#define cg_invalf(fc, fmt, ...) invalf(fc, fmt, ## __VA_ARGS__)

/*
* pidlists linger the following amount before being destroyed. The goal


2018-09-21 16:35:14

by David Howells

[permalink] [raw]
Subject: [PATCH 26/34] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #12]

Provide an fsopen() system call that starts the process of preparing to
create a superblock that will then be mountable, using an fd as a context
handle. fsopen() is given the name of the filesystem that will be used:

int mfd = fsopen(const char *fsname, unsigned int flags);

where flags can be 0 or FSOPEN_CLOEXEC.

For example:

sfd = fsopen("ext4", FSOPEN_CLOEXEC);
fsconfig(sfd, FSCONFIG_SET_PATH, "source", "/dev/sda1", AT_FDCWD);
fsconfig(sfd, FSCONFIG_SET_FLAG, "noatime", NULL, 0);
fsconfig(sfd, FSCONFIG_SET_FLAG, "acl", NULL, 0);
fsconfig(sfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
fsconfig(sfd, FSCONFIG_SET_STRING, "sb", "1", 0);
fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
fsinfo(sfd, NULL, ...); // query new superblock attributes
mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

sfd = fsopen("afs", -1);
fsconfig(fd, FSCONFIG_SET_STRING, "source",
"#grand.central.org:root.cell", 0);
fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
mfd = fsmount(sfd, 0, MS_NODEV);
move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

If an error is reported at any step, an error message may be available to be
read() back (ENODATA will be reported if there isn't an error available) in
the form:

"e <subsys>:<problem>"
"e SELinux:Mount on mountpoint not permitted"

Once fsmount() has been called, further fsconfig() calls will incur EBUSY,
even if the fsmount() fails. read() is still possible to retrieve error
information.

The fsopen() syscall creates a mount context and hangs it of the fd that it
returns.

Netlink is not used because it is optional and would make the core VFS
dependent on the networking layer and also potentially add network
namespace issues.

Note that, for the moment, the caller must have SYS_CAP_ADMIN to use
fsopen().

Signed-off-by: David Howells <[email protected]>
cc: [email protected]
---

arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
fs/Makefile | 2 -
fs/fs_context.c | 4 +
fs/fsopen.c | 87 ++++++++++++++++++++++++++++++++
include/linux/fs_context.h | 18 +++++++
include/linux/syscalls.h | 1
include/uapi/linux/fs.h | 5 ++
8 files changed, 118 insertions(+), 1 deletion(-)
create mode 100644 fs/fsopen.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 76d092b7d1b0..1647fefd2969 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -400,3 +400,4 @@
386 i386 rseq sys_rseq __ia32_sys_rseq
387 i386 open_tree sys_open_tree __ia32_sys_open_tree
388 i386 move_mount sys_move_mount __ia32_sys_move_mount
+389 i386 fsopen sys_fsopen __ia32_sys_fsopen
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 37ba4e65eee6..235d33dbccb2 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -345,6 +345,7 @@
334 common rseq __x64_sys_rseq
335 common open_tree __x64_sys_open_tree
336 common move_mount __x64_sys_move_mount
+337 common fsopen __x64_sys_fsopen

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index ae681523b4b1..e3ea8093b178 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -13,7 +13,7 @@ obj-y := open.o read_write.o file_table.o super.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o splice.o sync.o utimes.o d_path.o \
stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
- fs_context.o fs_parser.o
+ fs_context.o fs_parser.o fsopen.o

ifeq ($(CONFIG_BLOCK),y)
obj-y += buffer.o block_dev.o direct-io.o mpage.o
diff --git a/fs/fs_context.c b/fs/fs_context.c
index 328fcb764667..5d93c86c649d 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -267,6 +267,8 @@ struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
fc->fs_type = get_filesystem(fs_type);
fc->cred = get_current_cred();

+ mutex_init(&fc->uapi_mutex);
+
switch (purpose) {
case FS_CONTEXT_FOR_KERNEL_MOUNT:
fc->sb_flags |= SB_KERNMOUNT;
@@ -334,6 +336,8 @@ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc,
if (!fc)
return ERR_PTR(-ENOMEM);

+ mutex_init(&fc->uapi_mutex);
+
fc->fs_private = NULL;
fc->s_fs_info = NULL;
fc->source = NULL;
diff --git a/fs/fsopen.c b/fs/fsopen.c
new file mode 100644
index 000000000000..a9a36f61d76c
--- /dev/null
+++ b/fs/fsopen.c
@@ -0,0 +1,87 @@
+/* Filesystem access-by-fd.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/fs_context.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/security.h>
+#include <linux/anon_inodes.h>
+#include <linux/namei.h>
+#include <linux/file.h>
+#include "mount.h"
+
+static int fscontext_release(struct inode *inode, struct file *file)
+{
+ struct fs_context *fc = file->private_data;
+
+ if (fc) {
+ file->private_data = NULL;
+ put_fs_context(fc);
+ }
+ return 0;
+}
+
+const struct file_operations fscontext_fops = {
+ .release = fscontext_release,
+ .llseek = no_llseek,
+};
+
+/*
+ * Attach a filesystem context to a file and an fd.
+ */
+static int fscontext_create_fd(struct fs_context *fc, unsigned int o_flags)
+{
+ int fd;
+
+ fd = anon_inode_getfd("fscontext", &fscontext_fops, fc,
+ O_RDWR | o_flags);
+ if (fd < 0)
+ put_fs_context(fc);
+ return fd;
+}
+
+/*
+ * Open a filesystem by name so that it can be configured for mounting.
+ *
+ * We are allowed to specify a container in which the filesystem will be
+ * opened, thereby indicating which namespaces will be used (notably, which
+ * network namespace will be used for network filesystems).
+ */
+SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
+{
+ struct file_system_type *fs_type;
+ struct fs_context *fc;
+ const char *fs_name;
+
+ if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (flags & ~FSOPEN_CLOEXEC)
+ return -EINVAL;
+
+ fs_name = strndup_user(_fs_name, PAGE_SIZE);
+ if (IS_ERR(fs_name))
+ return PTR_ERR(fs_name);
+
+ fs_type = get_fs_type(fs_name);
+ kfree(fs_name);
+ if (!fs_type)
+ return -ENODEV;
+
+ fc = vfs_new_fs_context(fs_type, NULL, 0, 0, FS_CONTEXT_FOR_USER_MOUNT);
+ put_filesystem(fs_type);
+ if (IS_ERR(fc))
+ return PTR_ERR(fc);
+
+ fc->phase = FS_CONTEXT_CREATE_PARAMS;
+ return fscontext_create_fd(fc, flags & FSOPEN_CLOEXEC ? O_CLOEXEC : 0);
+}
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 0415510f64ed..37f8bafaaec3 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -14,6 +14,7 @@

#include <linux/kernel.h>
#include <linux/errno.h>
+#include <linux/mutex.h>

struct cred;
struct dentry;
@@ -37,6 +38,19 @@ enum fs_context_purpose {
FS_CONTEXT_FOR_EMERGENCY_RO, /* Emergency reconfiguration to R/O */
};

+/*
+ * Userspace usage phase for fsopen/fspick.
+ */
+enum fs_context_phase {
+ FS_CONTEXT_CREATE_PARAMS, /* Loading params for sb creation */
+ FS_CONTEXT_CREATING, /* A superblock is being created */
+ FS_CONTEXT_AWAITING_MOUNT, /* Superblock created, awaiting fsmount() */
+ FS_CONTEXT_AWAITING_RECONF, /* Awaiting initialisation for reconfiguration */
+ FS_CONTEXT_RECONF_PARAMS, /* Loading params for reconfiguration */
+ FS_CONTEXT_RECONFIGURING, /* Reconfiguring the superblock */
+ FS_CONTEXT_FAILED, /* Failed to correctly transition a context */
+};
+
/*
* Type of parameter value.
*/
@@ -77,6 +91,7 @@ struct fs_parameter {
*/
struct fs_context {
const struct fs_context_operations *ops;
+ struct mutex uapi_mutex; /* Userspace access mutex */
struct file_system_type *fs_type;
void *fs_private; /* The filesystem's context */
struct dentry *root; /* The root and superblock */
@@ -91,6 +106,7 @@ struct fs_context {
unsigned int sb_flags_mask; /* Superblock flags that were changed */
unsigned int lsm_flags; /* Information flags from the fs to the LSM */
enum fs_context_purpose purpose:8;
+ enum fs_context_phase phase:8; /* The phase the context is in */
bool sloppy:1; /* T if unrecognised options are okay */
bool silent:1; /* T if "o silent" specified */
bool need_free:1; /* Need to call ops->free() */
@@ -136,6 +152,8 @@ extern int vfs_get_super(struct fs_context *fc,
int (*fill_super)(struct super_block *sb,
struct fs_context *fc));

+extern const struct file_operations fscontext_fops;
+
#define logfc(FC, FMT, ...) pr_notice(FMT, ## __VA_ARGS__)

/**
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 79042396f7e5..650d99c91987 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -910,6 +910,7 @@ asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
int to_dfd, const char __user *to_path,
unsigned int ms_flags);
+asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);

/*
* Architecture-specific system calls
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 1c982eb44ff4..f8818e6cddd6 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -344,4 +344,9 @@ typedef int __bitwise __kernel_rwf_t;
#define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
RWF_APPEND)

+/*
+ * Flags for fsopen() and co.
+ */
+#define FSOPEN_CLOEXEC 0x00000001
+
#endif /* _UAPI_LINUX_FS_H */


2018-09-21 16:35:22

by David Howells

[permalink] [raw]
Subject: [PATCH 29/34] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #12]

Add a syscall for configuring a filesystem creation context and triggering
actions upon it, to be used in conjunction with fsopen, fspick and fsmount.

long fsconfig(int fs_fd, unsigned int cmd, const char *key,
const void *value, int aux);

Where fs_fd indicates the context, cmd indicates the action to take, key
indicates the parameter name for parameter-setting actions and, if needed,
value points to a buffer containing the value and aux can give more
information for the value.

The following command IDs are proposed:

(*) FSCONFIG_SET_FLAG: No value is specified. The parameter must be
boolean in nature. The key may be prefixed with "no" to invert the
setting. value must be NULL and aux must be 0.

(*) FSCONFIG_SET_STRING: A string value is specified. The parameter can
be expecting boolean, integer, string or take a path. A conversion to
an appropriate type will be attempted (which may include looking up as
a path). value points to a NUL-terminated string and aux must be 0.

(*) FSCONFIG_SET_BINARY: A binary blob is specified. value points to
the blob and aux indicates its size. The parameter must be expecting
a blob.

(*) FSCONFIG_SET_PATH: A non-empty path is specified. The parameter must
be expecting a path object. value points to a NUL-terminated string
that is the path and aux is a file descriptor at which to start a
relative lookup or AT_FDCWD.

(*) FSCONFIG_SET_PATH_EMPTY: As fsconfig_set_path, but with AT_EMPTY_PATH
implied.

(*) FSCONFIG_SET_FD: An open file descriptor is specified. value must
be NULL and aux indicates the file descriptor.

(*) FSCONFIG_CMD_CREATE: Trigger superblock creation.

(*) FSCONFIG_CMD_RECONFIGURE: Trigger superblock reconfiguration.

For the "set" command IDs, the idea is that the file_system_type will point
to a list of parameters and the types of value that those parameters expect
to take. The core code can then do the parse and argument conversion and
then give the LSM and FS a cooked option or array of options to use.

Source specification is also done the same way same way, using special keys
"source", "source1", "source2", etc..

[!] Note that, for the moment, the key and value are just glued back
together and handed to the filesystem. Every filesystem that uses options
uses match_token() and co. to do this, and this will need to be changed -
but not all at once.

Example usage:

fd = fsopen("ext4", FSOPEN_CLOEXEC);
fsconfig(fd, fsconfig_set_path, "source", "/dev/sda1", AT_FDCWD);
fsconfig(fd, fsconfig_set_path_empty, "journal_path", "", journal_fd);
fsconfig(fd, fsconfig_set_fd, "journal_fd", "", journal_fd);
fsconfig(fd, fsconfig_set_flag, "user_xattr", NULL, 0);
fsconfig(fd, fsconfig_set_flag, "noacl", NULL, 0);
fsconfig(fd, fsconfig_set_string, "sb", "1", 0);
fsconfig(fd, fsconfig_set_string, "errors", "continue", 0);
fsconfig(fd, fsconfig_set_string, "data", "journal", 0);
fsconfig(fd, fsconfig_set_string, "context", "unconfined_u:...", 0);
fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);

or:

fd = fsopen("ext4", FSOPEN_CLOEXEC);
fsconfig(fd, fsconfig_set_string, "source", "/dev/sda1", 0);
fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);

or:

fd = fsopen("afs", FSOPEN_CLOEXEC);
fsconfig(fd, fsconfig_set_string, "source", "#grand.central.org:root.cell", 0);
fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);

or:

fd = fsopen("jffs2", FSOPEN_CLOEXEC);
fsconfig(fd, fsconfig_set_string, "source", "mtd0", 0);
fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);

Signed-off-by: David Howells <[email protected]>
cc: [email protected]
---

arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
fs/fsopen.c | 355 ++++++++++++++++++++++++++++++++
include/linux/syscalls.h | 2
include/uapi/linux/fs.h | 14 +
5 files changed, 373 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 1647fefd2969..f9970310c126 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -401,3 +401,4 @@
387 i386 open_tree sys_open_tree __ia32_sys_open_tree
388 i386 move_mount sys_move_mount __ia32_sys_move_mount
389 i386 fsopen sys_fsopen __ia32_sys_fsopen
+390 i386 fsconfig sys_fsconfig __ia32_sys_fsconfig
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 235d33dbccb2..4185d36e03bb 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -346,6 +346,7 @@
335 common open_tree __x64_sys_open_tree
336 common move_mount __x64_sys_move_mount
337 common fsopen __x64_sys_fsopen
+338 common fsconfig __x64_sys_fsconfig

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/fsopen.c b/fs/fsopen.c
index 1b966b0c15f7..5955a6b65596 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -10,6 +10,7 @@
*/

#include <linux/fs_context.h>
+#include <linux/fs_parser.h>
#include <linux/slab.h>
#include <linux/uaccess.h>
#include <linux/syscalls.h>
@@ -17,6 +18,7 @@
#include <linux/anon_inodes.h>
#include <linux/namei.h>
#include <linux/file.h>
+#include "internal.h"
#include "mount.h"

/*
@@ -152,3 +154,356 @@ SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
put_fs_context(fc);
return ret;
}
+
+/*
+ * Check the state and apply the configuration. Note that this function is
+ * allowed to 'steal' the value by setting param->xxx to NULL before returning.
+ */
+static int vfs_fsconfig(struct fs_context *fc, struct fs_parameter *param)
+{
+ int ret;
+
+ /* We need to reinitialise the context if we have reconfiguration
+ * pending after creation or a previous reconfiguration.
+ */
+ if (fc->phase == FS_CONTEXT_AWAITING_RECONF) {
+ if (fc->fs_type->init_fs_context) {
+ ret = fc->fs_type->init_fs_context(fc, fc->root);
+ if (ret < 0) {
+ fc->phase = FS_CONTEXT_FAILED;
+ return ret;
+ }
+ fc->need_free = true;
+ } else {
+ /* Leave legacy context ops in place */
+ }
+
+ /* Do the security check last because ->init_fs_context may
+ * change the namespace subscriptions.
+ */
+ ret = security_fs_context_alloc(fc, fc->root);
+ if (ret < 0) {
+ fc->phase = FS_CONTEXT_FAILED;
+ return ret;
+ }
+
+ fc->phase = FS_CONTEXT_RECONF_PARAMS;
+ }
+
+ if (fc->phase != FS_CONTEXT_CREATE_PARAMS &&
+ fc->phase != FS_CONTEXT_RECONF_PARAMS)
+ return -EBUSY;
+
+ return vfs_parse_fs_param(fc, param);
+}
+
+/*
+ * Reconfigure a superblock.
+ */
+int vfs_reconfigure_sb(struct fs_context *fc)
+{
+ struct super_block *sb = fc->root->d_sb;
+ int ret;
+
+ if (fc->ops->validate) {
+ ret = fc->ops->validate(fc);
+ if (ret < 0)
+ return ret;
+ }
+
+ ret = security_fs_context_validate(fc);
+ if (ret)
+ return ret;
+
+ if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ down_write(&sb->s_umount);
+ ret = reconfigure_super(fc);
+ up_write(&sb->s_umount);
+ return ret;
+}
+
+/*
+ * Clean up a context after performing an action on it and put it into a state
+ * from where it can be used to reconfigure a superblock.
+ */
+void vfs_clean_context(struct fs_context *fc)
+{
+ if (fc->ops && fc->ops->free)
+ fc->ops->free(fc);
+ fc->need_free = false;
+ fc->fs_private = NULL;
+ fc->s_fs_info = NULL;
+ fc->sb_flags = 0;
+ fc->sloppy = false;
+ fc->silent = false;
+ security_fs_context_free(fc);
+ fc->security = NULL;
+ kfree(fc->subtype);
+ fc->subtype = NULL;
+ kfree(fc->source);
+ fc->source = NULL;
+
+ fc->purpose = FS_CONTEXT_FOR_RECONFIGURE;
+ fc->phase = FS_CONTEXT_AWAITING_RECONF;
+}
+
+/*
+ * Perform an action on a context.
+ */
+static int vfs_fsconfig_action(struct fs_context *fc, enum fsconfig_command cmd)
+{
+ int ret = -EINVAL;
+
+ switch (cmd) {
+ case FSCONFIG_CMD_CREATE:
+ if (fc->phase != FS_CONTEXT_CREATE_PARAMS)
+ return -EBUSY;
+ fc->phase = FS_CONTEXT_CREATING;
+ ret = vfs_get_tree(fc);
+ if (ret == 0)
+ fc->phase = FS_CONTEXT_AWAITING_MOUNT;
+ else
+ fc->phase = FS_CONTEXT_FAILED;
+ return ret;
+
+ case FSCONFIG_CMD_RECONFIGURE:
+ if (fc->phase == FS_CONTEXT_AWAITING_RECONF) {
+ /* This is probably pointless, since no changes have
+ * been proposed.
+ */
+ if (fc->fs_type->init_fs_context) {
+ ret = fc->fs_type->init_fs_context(fc, fc->root);
+ if (ret < 0) {
+ fc->phase = FS_CONTEXT_FAILED;
+ return ret;
+ }
+ fc->need_free = true;
+ }
+ fc->phase = FS_CONTEXT_RECONF_PARAMS;
+ }
+
+ fc->phase = FS_CONTEXT_RECONFIGURING;
+ ret = vfs_reconfigure_sb(fc);
+ if (ret == 0)
+ vfs_clean_context(fc);
+ else
+ fc->phase = FS_CONTEXT_FAILED;
+ return ret;
+
+ default:
+ return -EOPNOTSUPP;
+ }
+}
+
+/**
+ * sys_fsconfig - Set parameters and trigger actions on a context
+ * @fd: The filesystem context to act upon
+ * @cmd: The action to take
+ * @_key: Where appropriate, the parameter key to set
+ * @_value: Where appropriate, the parameter value to set
+ * @aux: Additional information for the value
+ *
+ * This system call is used to set parameters on a context, including
+ * superblock settings, data source and security labelling.
+ *
+ * Actions include triggering the creation of a superblock and the
+ * reconfiguration of the superblock attached to the specified context.
+ *
+ * When setting a parameter, @cmd indicates the type of value being proposed
+ * and @_key indicates the parameter to be altered.
+ *
+ * @_value and @aux are used to specify the value, should a value be required:
+ *
+ * (*) fsconfig_set_flag: No value is specified. The parameter must be boolean
+ * in nature. The key may be prefixed with "no" to invert the
+ * setting. @_value must be NULL and @aux must be 0.
+ *
+ * (*) fsconfig_set_string: A string value is specified. The parameter can be
+ * expecting boolean, integer, string or take a path. A conversion to an
+ * appropriate type will be attempted (which may include looking up as a
+ * path). @_value points to a NUL-terminated string and @aux must be 0.
+ *
+ * (*) fsconfig_set_binary: A binary blob is specified. @_value points to the
+ * blob and @aux indicates its size. The parameter must be expecting a
+ * blob.
+ *
+ * (*) fsconfig_set_path: A non-empty path is specified. The parameter must be
+ * expecting a path object. @_value points to a NUL-terminated string that
+ * is the path and @aux is a file descriptor at which to start a relative
+ * lookup or AT_FDCWD.
+ *
+ * (*) fsconfig_set_path_empty: As fsconfig_set_path, but with AT_EMPTY_PATH
+ * implied.
+ *
+ * (*) fsconfig_set_fd: An open file descriptor is specified. @_value must be
+ * NULL and @aux indicates the file descriptor.
+ */
+SYSCALL_DEFINE5(fsconfig,
+ int, fd,
+ unsigned int, cmd,
+ const char __user *, _key,
+ const void __user *, _value,
+ int, aux)
+{
+ struct fs_context *fc;
+ struct fd f;
+ int ret;
+
+ struct fs_parameter param = {
+ .type = fs_value_is_undefined,
+ };
+
+ if (fd < 0)
+ return -EINVAL;
+
+ switch (cmd) {
+ case FSCONFIG_SET_FLAG:
+ if (!_key || _value || aux)
+ return -EINVAL;
+ break;
+ case FSCONFIG_SET_STRING:
+ if (!_key || !_value || aux)
+ return -EINVAL;
+ break;
+ case FSCONFIG_SET_BINARY:
+ if (!_key || !_value || aux <= 0 || aux > 1024 * 1024)
+ return -EINVAL;
+ break;
+ case FSCONFIG_SET_PATH:
+ case FSCONFIG_SET_PATH_EMPTY:
+ if (!_key || !_value || (aux != AT_FDCWD && aux < 0))
+ return -EINVAL;
+ break;
+ case FSCONFIG_SET_FD:
+ if (!_key || _value || aux < 0)
+ return -EINVAL;
+ break;
+ case FSCONFIG_CMD_CREATE:
+ case FSCONFIG_CMD_RECONFIGURE:
+ if (_key || _value || aux)
+ return -EINVAL;
+ break;
+ default:
+ return -EOPNOTSUPP;
+ }
+
+ f = fdget(fd);
+ if (!f.file)
+ return -EBADF;
+ ret = -EINVAL;
+ if (f.file->f_op != &fscontext_fops)
+ goto out_f;
+
+ fc = f.file->private_data;
+ if (fc->ops == &legacy_fs_context_ops) {
+ switch (cmd) {
+ case FSCONFIG_SET_BINARY:
+ case FSCONFIG_SET_PATH:
+ case FSCONFIG_SET_PATH_EMPTY:
+ case FSCONFIG_SET_FD:
+ ret = -EOPNOTSUPP;
+ goto out_f;
+ }
+ }
+
+ if (_key) {
+ param.key = strndup_user(_key, 256);
+ if (IS_ERR(param.key)) {
+ ret = PTR_ERR(param.key);
+ goto out_f;
+ }
+ }
+
+ switch (cmd) {
+ case FSCONFIG_SET_STRING:
+ param.type = fs_value_is_string;
+ param.string = strndup_user(_value, 256);
+ if (IS_ERR(param.string)) {
+ ret = PTR_ERR(param.string);
+ goto out_key;
+ }
+ param.size = strlen(param.string);
+ break;
+ case FSCONFIG_SET_BINARY:
+ param.type = fs_value_is_blob;
+ param.size = aux;
+ param.blob = memdup_user_nul(_value, aux);
+ if (IS_ERR(param.blob)) {
+ ret = PTR_ERR(param.blob);
+ goto out_key;
+ }
+ break;
+ case FSCONFIG_SET_PATH:
+ param.type = fs_value_is_filename;
+ param.name = getname_flags(_value, 0, NULL);
+ if (IS_ERR(param.name)) {
+ ret = PTR_ERR(param.name);
+ goto out_key;
+ }
+ param.dirfd = aux;
+ param.size = strlen(param.name->name);
+ break;
+ case FSCONFIG_SET_PATH_EMPTY:
+ param.type = fs_value_is_filename_empty;
+ param.name = getname_flags(_value, LOOKUP_EMPTY, NULL);
+ if (IS_ERR(param.name)) {
+ ret = PTR_ERR(param.name);
+ goto out_key;
+ }
+ param.dirfd = aux;
+ param.size = strlen(param.name->name);
+ break;
+ case FSCONFIG_SET_FD:
+ param.type = fs_value_is_file;
+ ret = -EBADF;
+ param.file = fget(aux);
+ if (!param.file)
+ goto out_key;
+ break;
+ default:
+ break;
+ }
+
+ ret = mutex_lock_interruptible(&fc->uapi_mutex);
+ if (ret == 0) {
+ switch (cmd) {
+ case FSCONFIG_CMD_CREATE:
+ case FSCONFIG_CMD_RECONFIGURE:
+ ret = vfs_fsconfig_action(fc, cmd);
+ break;
+ default:
+ ret = vfs_fsconfig(fc, &param);
+ break;
+ }
+ mutex_unlock(&fc->uapi_mutex);
+ }
+
+ /* Clean up the our record of any value that we obtained from
+ * userspace. Note that the value may have been stolen by the LSM or
+ * filesystem, in which case the value pointer will have been cleared.
+ */
+ switch (cmd) {
+ case FSCONFIG_SET_STRING:
+ case FSCONFIG_SET_BINARY:
+ kfree(param.string);
+ break;
+ case FSCONFIG_SET_PATH:
+ case FSCONFIG_SET_PATH_EMPTY:
+ if (param.name)
+ putname(param.name);
+ break;
+ case FSCONFIG_SET_FD:
+ if (param.file)
+ fput(param.file);
+ break;
+ default:
+ break;
+ }
+out_key:
+ kfree(param.key);
+out_f:
+ fdput(f);
+ return ret;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 650d99c91987..4ab15fdf8aea 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -911,6 +911,8 @@ asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
int to_dfd, const char __user *to_path,
unsigned int ms_flags);
asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
+asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
+ const void __user *value, int aux);

/*
* Architecture-specific system calls
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index f8818e6cddd6..fecbae30a30d 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -349,4 +349,18 @@ typedef int __bitwise __kernel_rwf_t;
*/
#define FSOPEN_CLOEXEC 0x00000001

+/*
+ * The type of fsconfig() call made.
+ */
+enum fsconfig_command {
+ FSCONFIG_SET_FLAG = 0, /* Set parameter, supplying no value */
+ FSCONFIG_SET_STRING = 1, /* Set parameter, supplying a string value */
+ FSCONFIG_SET_BINARY = 2, /* Set parameter, supplying a binary blob value */
+ FSCONFIG_SET_PATH = 3, /* Set parameter, supplying an object by path */
+ FSCONFIG_SET_PATH_EMPTY = 4, /* Set parameter, supplying an object by (empty) path */
+ FSCONFIG_SET_FD = 5, /* Set parameter, supplying an object by fd */
+ FSCONFIG_CMD_CREATE = 6, /* Invoke superblock creation */
+ FSCONFIG_CMD_RECONFIGURE = 7, /* Invoke superblock reconfiguration */
+};
+
#endif /* _UAPI_LINUX_FS_H */


2018-09-21 16:35:36

by David Howells

[permalink] [raw]
Subject: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

Provide an fspick() system call that can be used to pick an existing
mountpoint into an fs_context which can thereafter be used to reconfigure a
superblock (equivalent of the superblock side of -o remount).

This looks like:

int fd = fspick(AT_FDCWD, "/mnt",
FSPICK_CLOEXEC | FSPICK_NO_AUTOMOUNT);
fsconfig(fd, FSCONFIG_SET_FLAG, "intr", NULL, 0);
fsconfig(fd, FSCONFIG_SET_FLAG, "noac", NULL, 0);
fsconfig(fd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);

At the point of fspick being called, the file descriptor referring to the
filesystem context is in exactly the same state as the one that was created
by fsopen() after fsmount() has been successfully called.

Signed-off-by: David Howells <[email protected]>
cc: [email protected]
---

arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
fs/fsopen.c | 54 ++++++++++++++++++++++++++++++++
include/linux/syscalls.h | 1 +
include/uapi/linux/fs.h | 5 +++
5 files changed, 62 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index c78b68256f8a..d1eb6c815790 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -403,3 +403,4 @@
389 i386 fsopen sys_fsopen __ia32_sys_fsopen
390 i386 fsconfig sys_fsconfig __ia32_sys_fsconfig
391 i386 fsmount sys_fsmount __ia32_sys_fsmount
+392 i386 fspick sys_fspick __ia32_sys_fspick
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index d44ead5d4368..d3ab703c02bb 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -348,6 +348,7 @@
337 common fsopen __x64_sys_fsopen
338 common fsconfig __x64_sys_fsconfig
339 common fsmount __x64_sys_fsmount
+340 common fspick __x64_sys_fspick

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/fsopen.c b/fs/fsopen.c
index 5955a6b65596..9ead9220e2cb 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -155,6 +155,60 @@ SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
return ret;
}

+/*
+ * Pick a superblock into a context for reconfiguration.
+ */
+SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags)
+{
+ struct fs_context *fc;
+ struct path target;
+ unsigned int lookup_flags;
+ int ret;
+
+ if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if ((flags & ~(FSPICK_CLOEXEC |
+ FSPICK_SYMLINK_NOFOLLOW |
+ FSPICK_NO_AUTOMOUNT |
+ FSPICK_EMPTY_PATH)) != 0)
+ return -EINVAL;
+
+ lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
+ if (flags & FSPICK_SYMLINK_NOFOLLOW)
+ lookup_flags &= ~LOOKUP_FOLLOW;
+ if (flags & FSPICK_NO_AUTOMOUNT)
+ lookup_flags &= ~LOOKUP_AUTOMOUNT;
+ if (flags & FSPICK_EMPTY_PATH)
+ lookup_flags |= LOOKUP_EMPTY;
+ ret = user_path_at(dfd, path, lookup_flags, &target);
+ if (ret < 0)
+ goto err;
+
+ fc = vfs_new_fs_context(target.dentry->d_sb->s_type, target.dentry,
+ 0, 0, FS_CONTEXT_FOR_RECONFIGURE);
+ if (IS_ERR(fc)) {
+ ret = PTR_ERR(fc);
+ goto err_path;
+ }
+
+ fc->phase = FS_CONTEXT_RECONF_PARAMS;
+
+ ret = fscontext_alloc_log(fc);
+ if (ret < 0)
+ goto err_fc;
+
+ path_put(&target);
+ return fscontext_create_fd(fc, flags & FSPICK_CLOEXEC ? O_CLOEXEC : 0);
+
+err_fc:
+ put_fs_context(fc);
+err_path:
+ path_put(&target);
+err:
+ return ret;
+}
+
/*
* Check the state and apply the configuration. Note that this function is
* allowed to 'steal' the value by setting param->xxx to NULL before returning.
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 4697fad47789..eb8d62f4ee24 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -914,6 +914,7 @@ asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
const void __user *value, int aux);
asmlinkage long sys_fsmount(int fs_fd, unsigned int flags, unsigned int ms_flags);
+asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags);

/*
* Architecture-specific system calls
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 10281d582e28..7f01503a9e9b 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -351,6 +351,11 @@ typedef int __bitwise __kernel_rwf_t;

#define FSMOUNT_CLOEXEC 0x00000001

+#define FSPICK_CLOEXEC 0x00000001
+#define FSPICK_SYMLINK_NOFOLLOW 0x00000002
+#define FSPICK_NO_AUTOMOUNT 0x00000004
+#define FSPICK_EMPTY_PATH 0x00000008
+
/*
* The type of fsconfig() call made.
*/


2018-09-21 16:35:41

by David Howells

[permalink] [raw]
Subject: [PATCH 32/34] afs: Add fs_context support [ver #12]

Add fs_context support to the AFS filesystem, converting the parameter
parsing to store options there.

This will form the basis for namespace propagation over mountpoints within
the AFS model, thereby allowing AFS to be used in containers more easily.

Signed-off-by: David Howells <[email protected]>
---

fs/afs/internal.h | 8 -
fs/afs/super.c | 503 +++++++++++++++++++++++++++++++----------------------
fs/afs/volume.c | 4
3 files changed, 297 insertions(+), 218 deletions(-)

diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 871a228d7f37..22d428052500 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -35,15 +35,15 @@
struct pagevec;
struct afs_call;

-struct afs_mount_params {
+struct afs_fs_context {
bool rwpath; /* T if the parent should be considered R/W */
bool force; /* T to force cell type */
bool autocell; /* T if set auto mount operation */
bool dyn_root; /* T if dynamic root */
+ bool no_cell; /* T if the source is "none" (for dynroot) */
afs_voltype_t type; /* type of volume requested */
- int volnamesz; /* size of volume name */
+ unsigned int volnamesz; /* size of volume name */
const char *volname; /* name of volume to mount */
- struct net *net_ns; /* Network namespace in effect */
struct afs_net *net; /* the AFS net namespace stuff */
struct afs_cell *cell; /* cell in which to find volume */
struct afs_volume *volume; /* volume record */
@@ -1056,7 +1056,7 @@ static inline struct afs_volume *__afs_get_volume(struct afs_volume *volume)
return volume;
}

-extern struct afs_volume *afs_create_volume(struct afs_mount_params *);
+extern struct afs_volume *afs_create_volume(struct afs_fs_context *);
extern void afs_activate_volume(struct afs_volume *);
extern void afs_deactivate_volume(struct afs_volume *);
extern void afs_put_volume(struct afs_cell *, struct afs_volume *);
diff --git a/fs/afs/super.c b/fs/afs/super.c
index b85f5e993539..8d969fd4af40 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -1,6 +1,6 @@
/* AFS superblock handling
*
- * Copyright (c) 2002, 2007 Red Hat, Inc. All rights reserved.
+ * Copyright (c) 2002, 2007, 2018 Red Hat, Inc. All rights reserved.
*
* This software may be freely redistributed under the terms of the
* GNU General Public License.
@@ -21,7 +21,7 @@
#include <linux/slab.h>
#include <linux/fs.h>
#include <linux/pagemap.h>
-#include <linux/parser.h>
+#include <linux/fs_parser.h>
#include <linux/statfs.h>
#include <linux/sched.h>
#include <linux/nsproxy.h>
@@ -30,22 +30,22 @@
#include "internal.h"

static void afs_i_init_once(void *foo);
-static struct dentry *afs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name,
- void *data, size_t data_size);
static void afs_kill_super(struct super_block *sb);
static struct inode *afs_alloc_inode(struct super_block *sb);
static void afs_destroy_inode(struct inode *inode);
static int afs_statfs(struct dentry *dentry, struct kstatfs *buf);
static int afs_show_devname(struct seq_file *m, struct dentry *root);
static int afs_show_options(struct seq_file *m, struct dentry *root);
+static int afs_init_fs_context(struct fs_context *fc, struct dentry *reference);
+static const struct fs_parameter_description afs_fs_parameters;

struct file_system_type afs_fs_type = {
- .owner = THIS_MODULE,
- .name = "afs",
- .mount = afs_mount,
- .kill_sb = afs_kill_super,
- .fs_flags = 0,
+ .owner = THIS_MODULE,
+ .name = "afs",
+ .init_fs_context = afs_init_fs_context,
+ .parameters = &afs_fs_parameters,
+ .kill_sb = afs_kill_super,
+ .fs_flags = 0,
};
MODULE_ALIAS_FS("afs");

@@ -64,22 +64,40 @@ static const struct super_operations afs_super_ops = {
static struct kmem_cache *afs_inode_cachep;
static atomic_t afs_count_active_inodes;

-enum {
- afs_no_opt,
- afs_opt_cell,
- afs_opt_dyn,
- afs_opt_rwpath,
- afs_opt_vol,
- afs_opt_autocell,
+enum afs_param {
+ Opt_autocell,
+ Opt_cell,
+ Opt_dyn,
+ Opt_rwpath,
+ Opt_source,
+ Opt_vol,
+ nr__afs_params
};

-static const match_table_t afs_options_list = {
- { afs_opt_cell, "cell=%s" },
- { afs_opt_dyn, "dyn" },
- { afs_opt_rwpath, "rwpath" },
- { afs_opt_vol, "vol=%s" },
- { afs_opt_autocell, "autocell" },
- { afs_no_opt, NULL },
+static const struct fs_parameter_spec afs_param_specs[nr__afs_params] = {
+ [Opt_autocell] = { fs_param_is_flag },
+ [Opt_cell] = { fs_param_is_string },
+ [Opt_dyn] = { fs_param_is_flag },
+ [Opt_rwpath] = { fs_param_is_flag },
+ [Opt_source] = { fs_param_is_string },
+ [Opt_vol] = { fs_param_is_string },
+};
+
+static const char *const afs_param_keys[nr__afs_params] = {
+ [Opt_autocell] = "autocell",
+ [Opt_cell] = "cell",
+ [Opt_dyn] = "dyn",
+ [Opt_rwpath] = "rwpath",
+ [Opt_source] = "source",
+ [Opt_vol] = "vol",
+};
+
+static const struct fs_parameter_description afs_fs_parameters = {
+ .name = "kAFS",
+ .nr_params = nr__afs_params,
+ .source_param = Opt_source,
+ .keys = afs_param_keys,
+ .specs = afs_param_specs,
};

/*
@@ -191,71 +209,10 @@ static int afs_show_options(struct seq_file *m, struct dentry *root)
}

/*
- * parse the mount options
- * - this function has been shamelessly adapted from the ext3 fs which
- * shamelessly adapted it from the msdos fs
- */
-static int afs_parse_options(struct afs_mount_params *params,
- char *options, const char **devname)
-{
- struct afs_cell *cell;
- substring_t args[MAX_OPT_ARGS];
- char *p;
- int token;
-
- _enter("%s", options);
-
- options[PAGE_SIZE - 1] = 0;
-
- while ((p = strsep(&options, ","))) {
- if (!*p)
- continue;
-
- token = match_token(p, afs_options_list, args);
- switch (token) {
- case afs_opt_cell:
- rcu_read_lock();
- cell = afs_lookup_cell_rcu(params->net,
- args[0].from,
- args[0].to - args[0].from);
- rcu_read_unlock();
- if (IS_ERR(cell))
- return PTR_ERR(cell);
- afs_put_cell(params->net, params->cell);
- params->cell = cell;
- break;
-
- case afs_opt_rwpath:
- params->rwpath = true;
- break;
-
- case afs_opt_vol:
- *devname = args[0].from;
- break;
-
- case afs_opt_autocell:
- params->autocell = true;
- break;
-
- case afs_opt_dyn:
- params->dyn_root = true;
- break;
-
- default:
- printk(KERN_ERR "kAFS:"
- " Unknown or invalid mount option: '%s'\n", p);
- return -EINVAL;
- }
- }
-
- _leave(" = 0");
- return 0;
-}
-
-/*
- * parse a device name to get cell name, volume name, volume type and R/W
- * selector
- * - this can be one of the following:
+ * Parse the source name to get cell name, volume name, volume type and R/W
+ * selector.
+ *
+ * This can be one of the following:
* "%[cell:]volume[.]" R/W volume
* "#[cell:]volume[.]" R/O or R/W volume (rwpath=0),
* or R/W (rwpath=1) volume
@@ -264,11 +221,11 @@ static int afs_parse_options(struct afs_mount_params *params,
* "%[cell:]volume.backup" Backup volume
* "#[cell:]volume.backup" Backup volume
*/
-static int afs_parse_device_name(struct afs_mount_params *params,
- const char *name)
+static int afs_parse_source(struct fs_context *fc, struct fs_parameter *param)
{
+ struct afs_fs_context *ctx = fc->fs_private;
struct afs_cell *cell;
- const char *cellname, *suffix;
+ const char *cellname, *suffix, *name = param->string;
int cellnamesz;

_enter(",%s", name);
@@ -279,69 +236,174 @@ static int afs_parse_device_name(struct afs_mount_params *params,
}

if ((name[0] != '%' && name[0] != '#') || !name[1]) {
+ /* To use dynroot, we don't want to have to provide a source */
+ if (strcmp(name, "none") == 0) {
+ ctx->no_cell = true;
+ return 0;
+ }
printk(KERN_ERR "kAFS: unparsable volume name\n");
return -EINVAL;
}

/* determine the type of volume we're looking for */
- params->type = AFSVL_ROVOL;
- params->force = false;
- if (params->rwpath || name[0] == '%') {
- params->type = AFSVL_RWVOL;
- params->force = true;
+ ctx->type = AFSVL_ROVOL;
+ ctx->force = false;
+ if (ctx->rwpath || name[0] == '%') {
+ ctx->type = AFSVL_RWVOL;
+ ctx->force = true;
}
name++;

/* split the cell name out if there is one */
- params->volname = strchr(name, ':');
- if (params->volname) {
+ ctx->volname = strchr(name, ':');
+ if (ctx->volname) {
cellname = name;
- cellnamesz = params->volname - name;
- params->volname++;
+ cellnamesz = ctx->volname - name;
+ ctx->volname++;
} else {
- params->volname = name;
+ ctx->volname = name;
cellname = NULL;
cellnamesz = 0;
}

/* the volume type is further affected by a possible suffix */
- suffix = strrchr(params->volname, '.');
+ suffix = strrchr(ctx->volname, '.');
if (suffix) {
if (strcmp(suffix, ".readonly") == 0) {
- params->type = AFSVL_ROVOL;
- params->force = true;
+ ctx->type = AFSVL_ROVOL;
+ ctx->force = true;
} else if (strcmp(suffix, ".backup") == 0) {
- params->type = AFSVL_BACKVOL;
- params->force = true;
+ ctx->type = AFSVL_BACKVOL;
+ ctx->force = true;
} else if (suffix[1] == 0) {
} else {
suffix = NULL;
}
}

- params->volnamesz = suffix ?
- suffix - params->volname : strlen(params->volname);
+ ctx->volnamesz = suffix ?
+ suffix - ctx->volname : strlen(ctx->volname);

_debug("cell %*.*s [%p]",
- cellnamesz, cellnamesz, cellname ?: "", params->cell);
+ cellnamesz, cellnamesz, cellname ?: "", ctx->cell);

/* lookup the cell record */
- if (cellname || !params->cell) {
- cell = afs_lookup_cell(params->net, cellname, cellnamesz,
+ if (cellname) {
+ cell = afs_lookup_cell(ctx->net, cellname, cellnamesz,
NULL, false);
if (IS_ERR(cell)) {
- printk(KERN_ERR "kAFS: unable to lookup cell '%*.*s'\n",
+ pr_err("kAFS: unable to lookup cell '%*.*s'\n",
cellnamesz, cellnamesz, cellname ?: "");
return PTR_ERR(cell);
}
- afs_put_cell(params->net, params->cell);
- params->cell = cell;
+ afs_put_cell(ctx->net, ctx->cell);
+ ctx->cell = cell;
}

_debug("CELL:%s [%p] VOLUME:%*.*s SUFFIX:%s TYPE:%d%s",
- params->cell->name, params->cell,
- params->volnamesz, params->volnamesz, params->volname,
- suffix ?: "-", params->type, params->force ? " FORCE" : "");
+ ctx->cell->name, ctx->cell,
+ ctx->volnamesz, ctx->volnamesz, ctx->volname,
+ suffix ?: "-", ctx->type, ctx->force ? " FORCE" : "");
+
+ fc->source = param->string;
+ param->string = NULL;
+ return 0;
+}
+
+/*
+ * Parse a single mount parameter.
+ */
+static int afs_parse_param(struct fs_context *fc, struct fs_parameter *param)
+{
+ struct fs_parse_result result;
+ struct afs_fs_context *ctx = fc->fs_private;
+ struct afs_cell *cell;
+ int opt;
+
+ opt = fs_parse(fc, &afs_fs_parameters, param, &result);
+ if (opt < 0)
+ return opt;
+
+ switch (opt) {
+ case Opt_cell:
+ if (param->size <= 0)
+ return -EINVAL;
+ if (param->size > AFS_MAXCELLNAME)
+ return -ENAMETOOLONG;
+
+ rcu_read_lock();
+ cell = afs_lookup_cell_rcu(ctx->net, param->string, param->size);
+ rcu_read_unlock();
+ if (IS_ERR(cell))
+ return PTR_ERR(cell);
+ afs_put_cell(ctx->net, ctx->cell);
+ ctx->cell = cell;
+ break;
+
+ case Opt_source:
+ return afs_parse_source(fc, param);
+
+ case Opt_autocell:
+ ctx->autocell = true;
+ break;
+
+ case Opt_dyn:
+ ctx->dyn_root = true;
+ break;
+
+ case Opt_rwpath:
+ ctx->rwpath = true;
+ break;
+
+ case Opt_vol:
+ return invalf(fc, "'vol' param is obsolete");
+
+ default:
+ return -EINVAL;
+ }
+
+ _leave(" = 0");
+ return 0;
+}
+
+/*
+ * Validate the options, get the cell key and look up the volume.
+ */
+static int afs_validate_fc(struct fs_context *fc)
+{
+ struct afs_fs_context *ctx = fc->fs_private;
+ struct afs_volume *volume;
+ struct key *key;
+
+ if (!ctx->dyn_root) {
+ if (ctx->no_cell) {
+ pr_warn("kAFS: Can only specify source 'none' with -o dyn\n");
+ return -EINVAL;
+ }
+
+ if (!ctx->cell) {
+ pr_warn("kAFS: No cell specified\n");
+ return -EDESTADDRREQ;
+ }
+
+ /* We try to do the mount securely. */
+ key = afs_request_key(ctx->cell);
+ if (IS_ERR(key))
+ return PTR_ERR(key);
+
+ ctx->key = key;
+
+ if (ctx->volume) {
+ afs_put_volume(ctx->cell, ctx->volume);
+ ctx->volume = NULL;
+ }
+
+ volume = afs_create_volume(ctx);
+ if (IS_ERR(volume))
+ return PTR_ERR(volume);
+
+ ctx->volume = volume;
+ }

return 0;
}
@@ -349,39 +411,34 @@ static int afs_parse_device_name(struct afs_mount_params *params,
/*
* check a superblock to see if it's the one we're looking for
*/
-static int afs_test_super(struct super_block *sb, void *data)
+static int afs_test_super(struct super_block *sb, struct fs_context *fc)
{
- struct afs_super_info *as1 = data;
+ struct afs_fs_context *ctx = fc->fs_private;
struct afs_super_info *as = AFS_FS_S(sb);

- return (as->net_ns == as1->net_ns &&
+ return (as->net_ns == fc->net_ns &&
as->volume &&
- as->volume->vid == as1->volume->vid &&
+ as->volume->vid == ctx->volume->vid &&
!as->dyn_root);
}

-static int afs_dynroot_test_super(struct super_block *sb, void *data)
+static int afs_dynroot_test_super(struct super_block *sb, struct fs_context *fc)
{
- struct afs_super_info *as1 = data;
struct afs_super_info *as = AFS_FS_S(sb);

- return (as->net_ns == as1->net_ns &&
+ return (as->net_ns == fc->net_ns &&
as->dyn_root);
}

-static int afs_set_super(struct super_block *sb, void *data)
+static int afs_set_super(struct super_block *sb, struct fs_context *fc)
{
- struct afs_super_info *as = data;
-
- sb->s_fs_info = as;
return set_anon_super(sb, NULL);
}

/*
* fill in the superblock
*/
-static int afs_fill_super(struct super_block *sb,
- struct afs_mount_params *params)
+static int afs_fill_super(struct super_block *sb, struct afs_fs_context *ctx)
{
struct afs_super_info *as = AFS_FS_S(sb);
struct afs_fid fid;
@@ -412,13 +469,13 @@ static int afs_fill_super(struct super_block *sb,
fid.vid = as->volume->vid;
fid.vnode = 1;
fid.unique = 1;
- inode = afs_iget(sb, params->key, &fid, NULL, NULL, NULL);
+ inode = afs_iget(sb, ctx->key, &fid, NULL, NULL, NULL);
}

if (IS_ERR(inode))
return PTR_ERR(inode);

- if (params->autocell || params->dyn_root)
+ if (ctx->autocell || as->dyn_root)
set_bit(AFS_VNODE_AUTOCELL, &AFS_FS_I(inode)->flags);

ret = -ENOMEM;
@@ -443,17 +500,20 @@ static int afs_fill_super(struct super_block *sb,
return ret;
}

-static struct afs_super_info *afs_alloc_sbi(struct afs_mount_params *params)
+static struct afs_super_info *afs_alloc_sbi(struct fs_context *fc)
{
+ struct afs_fs_context *ctx = fc->fs_private;
struct afs_super_info *as;

as = kzalloc(sizeof(struct afs_super_info), GFP_KERNEL);
if (as) {
- as->net_ns = get_net(params->net_ns);
- if (params->dyn_root)
+ as->net_ns = get_net(fc->net_ns);
+ if (ctx->dyn_root) {
as->dyn_root = true;
- else
- as->cell = afs_get_cell(params->cell);
+ } else {
+ as->cell = afs_get_cell(ctx->cell);
+ as->volume = __afs_get_volume(ctx->volume);
+ }
}
return as;
}
@@ -475,7 +535,7 @@ static void afs_kill_super(struct super_block *sb)

if (as->dyn_root)
afs_dynroot_depopulate(sb);
-
+
/* Clear the callback interests (which will do ilookup5) before
* deactivating the superblock.
*/
@@ -488,112 +548,131 @@ static void afs_kill_super(struct super_block *sb)
}

/*
- * get an AFS superblock
+ * Get an AFS superblock and root directory.
*/
-static struct dentry *afs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name,
- void *options, size_t data_size)
+static int afs_get_tree(struct fs_context *fc)
{
- struct afs_mount_params params;
+ struct afs_fs_context *ctx = fc->fs_private;
struct super_block *sb;
- struct afs_volume *candidate;
- struct key *key;
struct afs_super_info *as;
int ret;

- _enter(",,%s,%p", dev_name, options);
-
- memset(&params, 0, sizeof(params));
-
- ret = -EINVAL;
- if (current->nsproxy->net_ns != &init_net)
- goto error;
- params.net_ns = current->nsproxy->net_ns;
- params.net = afs_net(params.net_ns);
-
- /* parse the options and device name */
- if (options) {
- ret = afs_parse_options(&params, options, &dev_name);
- if (ret < 0)
- goto error;
- }
-
- if (!params.dyn_root) {
- ret = afs_parse_device_name(&params, dev_name);
- if (ret < 0)
- goto error;
-
- /* try and do the mount securely */
- key = afs_request_key(params.cell);
- if (IS_ERR(key)) {
- _leave(" = %ld [key]", PTR_ERR(key));
- ret = PTR_ERR(key);
- goto error;
- }
- params.key = key;
- }
+ _enter("");

/* allocate a superblock info record */
ret = -ENOMEM;
- as = afs_alloc_sbi(&params);
+ as = afs_alloc_sbi(fc);
if (!as)
- goto error_key;
-
- if (!params.dyn_root) {
- /* Assume we're going to need a volume record; at the very
- * least we can use it to update the volume record if we have
- * one already. This checks that the volume exists within the
- * cell.
- */
- candidate = afs_create_volume(&params);
- if (IS_ERR(candidate)) {
- ret = PTR_ERR(candidate);
- goto error_as;
- }
-
- as->volume = candidate;
- }
+ goto error;
+ fc->s_fs_info = as;

/* allocate a deviceless superblock */
- sb = sget(fs_type,
- as->dyn_root ? afs_dynroot_test_super : afs_test_super,
- afs_set_super, flags, as);
+ sb = sget_fc(fc,
+ as->dyn_root ? afs_dynroot_test_super : afs_test_super,
+ afs_set_super);
if (IS_ERR(sb)) {
ret = PTR_ERR(sb);
- goto error_as;
+ goto error;
}

if (!sb->s_root) {
/* initial superblock/root creation */
_debug("create");
- ret = afs_fill_super(sb, &params);
+ ret = afs_fill_super(sb, ctx);
if (ret < 0)
goto error_sb;
- as = NULL;
sb->s_flags |= SB_ACTIVE;
} else {
_debug("reuse");
ASSERTCMP(sb->s_flags, &, SB_ACTIVE);
- afs_destroy_sbi(as);
- as = NULL;
}

- afs_put_cell(params.net, params.cell);
- key_put(params.key);
+ fc->root = dget(sb->s_root);
_leave(" = 0 [%p]", sb);
- return dget(sb->s_root);
+ return 0;

error_sb:
deactivate_locked_super(sb);
- goto error_key;
-error_as:
- afs_destroy_sbi(as);
-error_key:
- key_put(params.key);
error:
- afs_put_cell(params.net, params.cell);
_leave(" = %d", ret);
- return ERR_PTR(ret);
+ return ret;
+}
+
+static void afs_free_fc(struct fs_context *fc)
+{
+ struct afs_fs_context *ctx = fc->fs_private;
+
+ afs_destroy_sbi(fc->s_fs_info);
+ afs_put_volume(ctx->cell, ctx->volume);
+ afs_put_cell(ctx->net, ctx->cell);
+ key_put(ctx->key);
+ kfree(ctx);
+}
+
+static const struct fs_context_operations afs_context_ops = {
+ .free = afs_free_fc,
+ .parse_param = afs_parse_param,
+ .validate = afs_validate_fc,
+ .get_tree = afs_get_tree,
+};
+
+/*
+ * Set up the filesystem mount context.
+ */
+static int afs_init_fs_context(struct fs_context *fc, struct dentry *reference)
+{
+ struct afs_fs_context *ctx;
+ struct afs_super_info *src_as;
+ struct afs_cell *cell;
+
+ if (current->nsproxy->net_ns != &init_net)
+ return -EINVAL;
+
+ ctx = kzalloc(sizeof(struct afs_fs_context), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ ctx->type = AFSVL_ROVOL;
+
+ switch (fc->purpose) {
+ case FS_CONTEXT_FOR_USER_MOUNT:
+ case FS_CONTEXT_FOR_KERNEL_MOUNT:
+ ctx->net = afs_net(fc->net_ns);
+
+ /* Default to the workstation cell. */
+ rcu_read_lock();
+ cell = afs_lookup_cell_rcu(ctx->net, NULL, 0);
+ rcu_read_unlock();
+ if (IS_ERR(cell))
+ cell = NULL;
+ ctx->cell = cell;
+ break;
+
+ case FS_CONTEXT_FOR_SUBMOUNT:
+ if (!reference) {
+ kfree(ctx);
+ return -EINVAL;
+ }
+
+ src_as = AFS_FS_S(reference->d_sb);
+ ASSERT(src_as);
+
+ ctx->net = afs_net(fc->net_ns);
+ if (src_as->cell)
+ ctx->cell = afs_get_cell(src_as->cell);
+ if (src_as->volume && src_as->volume->type == AFSVL_RWVOL) {
+ ctx->type = AFSVL_RWVOL;
+ ctx->force = true;
+ }
+ break;
+
+ default:
+ break;
+ }
+
+ fc->fs_private = ctx;
+ fc->ops = &afs_context_ops;
+ return 0;
}

/*
diff --git a/fs/afs/volume.c b/fs/afs/volume.c
index 3037bd01f617..7adcddf02e66 100644
--- a/fs/afs/volume.c
+++ b/fs/afs/volume.c
@@ -21,7 +21,7 @@ static const char *const afs_voltypes[] = { "R/W", "R/O", "BAK" };
/*
* Allocate a volume record and load it up from a vldb record.
*/
-static struct afs_volume *afs_alloc_volume(struct afs_mount_params *params,
+static struct afs_volume *afs_alloc_volume(struct afs_fs_context *params,
struct afs_vldb_entry *vldb,
unsigned long type_mask)
{
@@ -149,7 +149,7 @@ static struct afs_vldb_entry *afs_vl_lookup_vldb(struct afs_cell *cell,
* - Rule 3: If parent volume is R/W, then only mount R/W volume unless
* explicitly told otherwise
*/
-struct afs_volume *afs_create_volume(struct afs_mount_params *params)
+struct afs_volume *afs_create_volume(struct afs_fs_context *params)
{
struct afs_vldb_entry *vldb;
struct afs_volume *volume;


2018-09-21 16:35:56

by David Howells

[permalink] [raw]
Subject: [PATCH 33/34] afs: Use fs_context to pass parameters over automount [ver #12]

Alter the AFS automounting code to create and modify an fs_context struct
when parameterising a new mount triggered by an AFS mountpoint rather than
constructing device name and option strings.

Also remove the cell=, vol= and rwpath options as they are then redundant.
The reason they existed is because the 'device name' may be derived
literally from a mountpoint object in the filesystem, so default cell and
parent-type information needed to be passed in by some other method from
the automount routines. The vol= option didn't end up being used.

Signed-off-by: David Howells <[email protected]>
cc: Eric W. Biederman <[email protected]>
---

fs/afs/internal.h | 1
fs/afs/mntpt.c | 148 +++++++++++++++++++++++++++--------------------------
fs/afs/super.c | 43 +--------------
3 files changed, 78 insertions(+), 114 deletions(-)

diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 22d428052500..7a603398b69e 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -36,7 +36,6 @@ struct pagevec;
struct afs_call;

struct afs_fs_context {
- bool rwpath; /* T if the parent should be considered R/W */
bool force; /* T to force cell type */
bool autocell; /* T if set auto mount operation */
bool dyn_root; /* T if dynamic root */
diff --git a/fs/afs/mntpt.c b/fs/afs/mntpt.c
index c45aa1776591..16ee515b51c9 100644
--- a/fs/afs/mntpt.c
+++ b/fs/afs/mntpt.c
@@ -47,6 +47,8 @@ static DECLARE_DELAYED_WORK(afs_mntpt_expiry_timer, afs_mntpt_expiry_timed_out);

static unsigned long afs_mntpt_expiry_timeout = 10 * 60;

+static const char afs_root_volume[] = "root.cell";
+
/*
* no valid lookup procedure on this sort of dir
*/
@@ -68,107 +70,107 @@ static int afs_mntpt_open(struct inode *inode, struct file *file)
}

/*
- * create a vfsmount to be automounted
+ * Set the parameters for the proposed superblock.
*/
-static struct vfsmount *afs_mntpt_do_automount(struct dentry *mntpt)
+static int afs_mntpt_set_params(struct fs_context *fc, struct dentry *mntpt)
{
- struct afs_super_info *as;
- struct vfsmount *mnt;
- struct afs_vnode *vnode;
- struct page *page;
- char *devname, *options;
- bool rwpath = false;
+ struct afs_fs_context *ctx = fc->fs_private;
+ struct afs_vnode *vnode = AFS_FS_I(d_inode(mntpt));
+ struct afs_cell *cell;
+ const char *p;
int ret;

- _enter("{%pd}", mntpt);
-
- BUG_ON(!d_inode(mntpt));
-
- ret = -ENOMEM;
- devname = (char *) get_zeroed_page(GFP_KERNEL);
- if (!devname)
- goto error_no_devname;
-
- options = (char *) get_zeroed_page(GFP_KERNEL);
- if (!options)
- goto error_no_options;
-
- vnode = AFS_FS_I(d_inode(mntpt));
if (test_bit(AFS_VNODE_PSEUDODIR, &vnode->flags)) {
/* if the directory is a pseudo directory, use the d_name */
- static const char afs_root_cell[] = ":root.cell.";
unsigned size = mntpt->d_name.len;

- ret = -ENOENT;
- if (size < 2 || size > AFS_MAXCELLNAME)
- goto error_no_page;
+ if (size < 2)
+ return -ENOENT;

+ p = mntpt->d_name.name;
if (mntpt->d_name.name[0] == '.') {
- devname[0] = '%';
- memcpy(devname + 1, mntpt->d_name.name + 1, size - 1);
- memcpy(devname + size, afs_root_cell,
- sizeof(afs_root_cell));
- rwpath = true;
- } else {
- devname[0] = '#';
- memcpy(devname + 1, mntpt->d_name.name, size);
- memcpy(devname + size + 1, afs_root_cell,
- sizeof(afs_root_cell));
+ size--;
+ p++;
+ ctx->type = AFSVL_RWVOL;
+ ctx->force = true;
+ }
+ if (size > AFS_MAXCELLNAME)
+ return -ENAMETOOLONG;
+
+ cell = afs_lookup_cell(ctx->net, p, size, NULL, false);
+ if (IS_ERR(cell)) {
+ pr_err("kAFS: unable to lookup cell '%pd'\n", mntpt);
+ return PTR_ERR(cell);
}
+ afs_put_cell(ctx->net, ctx->cell);
+ ctx->cell = cell;
+
+ ctx->volname = afs_root_volume;
+ ctx->volnamesz = sizeof(afs_root_volume) - 1;
} else {
/* read the contents of the AFS special symlink */
+ struct page *page;
loff_t size = i_size_read(d_inode(mntpt));
char *buf;

- ret = -EINVAL;
if (size > PAGE_SIZE - 1)
- goto error_no_page;
+ return -EINVAL;

page = read_mapping_page(d_inode(mntpt)->i_mapping, 0, NULL);
- if (IS_ERR(page)) {
- ret = PTR_ERR(page);
- goto error_no_page;
- }
+ if (IS_ERR(page))
+ return PTR_ERR(page);

- ret = -EIO;
- if (PageError(page))
- goto error;
+ if (PageError(page)) {
+ put_page(page);
+ return -EIO;
+ }

- buf = kmap_atomic(page);
- memcpy(devname, buf, size);
- kunmap_atomic(buf);
+ buf = kmap(page);
+ ret = vfs_parse_fs_string(fc, "source", buf, size);
+ kunmap(page);
put_page(page);
- page = NULL;
+ if (ret < 0)
+ return ret;
}

- /* work out what options we want */
- as = AFS_FS_S(mntpt->d_sb);
- if (as->cell) {
- memcpy(options, "cell=", 5);
- strcpy(options + 5, as->cell->name);
- if ((as->volume && as->volume->type == AFSVL_RWVOL) || rwpath)
- strcat(options, ",rwpath");
- }
+ return 0;
+}

- /* try and do the mount */
- _debug("--- attempting mount %s -o %s ---", devname, options);
- mnt = vfs_submount(mntpt, &afs_fs_type, devname,
- options, strlen(options) + 1);
- _debug("--- mount result %p ---", mnt);
+/*
+ * create a vfsmount to be automounted
+ */
+static struct vfsmount *afs_mntpt_do_automount(struct dentry *mntpt)
+{
+ struct fs_context *fc;
+ struct vfsmount *mnt;
+ int ret;
+
+ BUG_ON(!d_inode(mntpt));
+
+ fc = vfs_new_fs_context(&afs_fs_type, mntpt, 0, 0,
+ FS_CONTEXT_FOR_SUBMOUNT);
+ if (IS_ERR(fc))
+ return ERR_CAST(fc);
+
+ ret = afs_mntpt_set_params(fc, mntpt);
+ if (ret < 0)
+ goto error_fc;
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0)
+ goto error_fc;
+
+ mnt = vfs_create_mount(fc, 0);
+ if (IS_ERR(mnt)) {
+ ret = PTR_ERR(mnt);
+ goto error_fc;
+ }

- free_page((unsigned long) devname);
- free_page((unsigned long) options);
- _leave(" = %p", mnt);
+ put_fs_context(fc);
return mnt;

-error:
- put_page(page);
-error_no_page:
- free_page((unsigned long) options);
-error_no_options:
- free_page((unsigned long) devname);
-error_no_devname:
- _leave(" = %d", ret);
+error_fc:
+ put_fs_context(fc);
return ERR_PTR(ret);
}

diff --git a/fs/afs/super.c b/fs/afs/super.c
index 8d969fd4af40..944985e0a3c8 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -66,30 +66,21 @@ static atomic_t afs_count_active_inodes;

enum afs_param {
Opt_autocell,
- Opt_cell,
Opt_dyn,
- Opt_rwpath,
Opt_source,
- Opt_vol,
nr__afs_params
};

static const struct fs_parameter_spec afs_param_specs[nr__afs_params] = {
[Opt_autocell] = { fs_param_is_flag },
- [Opt_cell] = { fs_param_is_string },
[Opt_dyn] = { fs_param_is_flag },
- [Opt_rwpath] = { fs_param_is_flag },
[Opt_source] = { fs_param_is_string },
- [Opt_vol] = { fs_param_is_string },
};

static const char *const afs_param_keys[nr__afs_params] = {
[Opt_autocell] = "autocell",
- [Opt_cell] = "cell",
[Opt_dyn] = "dyn",
- [Opt_rwpath] = "rwpath",
[Opt_source] = "source",
- [Opt_vol] = "vol",
};

static const struct fs_parameter_description afs_fs_parameters = {
@@ -214,8 +205,8 @@ static int afs_show_options(struct seq_file *m, struct dentry *root)
*
* This can be one of the following:
* "%[cell:]volume[.]" R/W volume
- * "#[cell:]volume[.]" R/O or R/W volume (rwpath=0),
- * or R/W (rwpath=1) volume
+ * "#[cell:]volume[.]" R/O or R/W volume (R/O parent),
+ * or R/W (R/W parent) volume
* "%[cell:]volume.readonly" R/O volume
* "#[cell:]volume.readonly" R/O volume
* "%[cell:]volume.backup" Backup volume
@@ -246,9 +237,7 @@ static int afs_parse_source(struct fs_context *fc, struct fs_parameter *param)
}

/* determine the type of volume we're looking for */
- ctx->type = AFSVL_ROVOL;
- ctx->force = false;
- if (ctx->rwpath || name[0] == '%') {
+ if (name[0] == '%') {
ctx->type = AFSVL_RWVOL;
ctx->force = true;
}
@@ -317,7 +306,6 @@ static int afs_parse_param(struct fs_context *fc, struct fs_parameter *param)
{
struct fs_parse_result result;
struct afs_fs_context *ctx = fc->fs_private;
- struct afs_cell *cell;
int opt;

opt = fs_parse(fc, &afs_fs_parameters, param, &result);
@@ -325,21 +313,6 @@ static int afs_parse_param(struct fs_context *fc, struct fs_parameter *param)
return opt;

switch (opt) {
- case Opt_cell:
- if (param->size <= 0)
- return -EINVAL;
- if (param->size > AFS_MAXCELLNAME)
- return -ENAMETOOLONG;
-
- rcu_read_lock();
- cell = afs_lookup_cell_rcu(ctx->net, param->string, param->size);
- rcu_read_unlock();
- if (IS_ERR(cell))
- return PTR_ERR(cell);
- afs_put_cell(ctx->net, ctx->cell);
- ctx->cell = cell;
- break;
-
case Opt_source:
return afs_parse_source(fc, param);

@@ -351,13 +324,6 @@ static int afs_parse_param(struct fs_context *fc, struct fs_parameter *param)
ctx->dyn_root = true;
break;

- case Opt_rwpath:
- ctx->rwpath = true;
- break;
-
- case Opt_vol:
- return invalf(fc, "'vol' param is obsolete");
-
default:
return -EINVAL;
}
@@ -625,9 +591,6 @@ static int afs_init_fs_context(struct fs_context *fc, struct dentry *reference)
struct afs_super_info *src_as;
struct afs_cell *cell;

- if (current->nsproxy->net_ns != &init_net)
- return -EINVAL;
-
ctx = kzalloc(sizeof(struct afs_fs_context), GFP_KERNEL);
if (!ctx)
return -ENOMEM;


2018-09-21 16:36:06

by David Howells

[permalink] [raw]
Subject: [PATCH 34/34] vfs: Add a sample program for the new mount API [ver #12]

Add a sample program to demonstrate fsopen/fsmount/move_mount to mount
something.

Signed-off-by: David Howells <[email protected]>
---

samples/Kconfig | 10 +-
samples/Makefile | 2
samples/statx/Makefile | 7 -
samples/statx/test-statx.c | 258 --------------------------------------------
samples/vfs/Makefile | 10 ++
samples/vfs/test-fsmount.c | 118 ++++++++++++++++++++
samples/vfs/test-statx.c | 258 ++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 393 insertions(+), 270 deletions(-)
delete mode 100644 samples/statx/Makefile
delete mode 100644 samples/statx/test-statx.c
create mode 100644 samples/vfs/Makefile
create mode 100644 samples/vfs/test-fsmount.c
create mode 100644 samples/vfs/test-statx.c

diff --git a/samples/Kconfig b/samples/Kconfig
index bd133efc1a56..8df1c012820f 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -146,10 +146,12 @@ config SAMPLE_VFIO_MDEV_MBOCHS
Specifically it does *not* include any legacy vga stuff.
Device looks a lot like "qemu -device secondary-vga".

-config SAMPLE_STATX
- bool "Build example extended-stat using code"
- depends on BROKEN
+config SAMPLE_VFS
+ bool "Build example programs that use new VFS system calls"
+ depends on X86
help
- Build example userspace program to use the new extended-stat syscall.
+ Build example userspace programs that use new VFS system calls such
+ as mount API and statx(). Note that this is restricted to the x86
+ arch whilst it accesses system calls that aren't yet in all arches.

endif # SAMPLES
diff --git a/samples/Makefile b/samples/Makefile
index bd601c038b86..c5a6175c2d3f 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -3,4 +3,4 @@
obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ trace_events/ livepatch/ \
hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/ \
configfs/ connector/ v4l/ trace_printk/ \
- vfio-mdev/ statx/ qmi/
+ vfio-mdev/ vfs/ qmi/
diff --git a/samples/statx/Makefile b/samples/statx/Makefile
deleted file mode 100644
index 59df7c25a9d1..000000000000
--- a/samples/statx/Makefile
+++ /dev/null
@@ -1,7 +0,0 @@
-# List of programs to build
-hostprogs-$(CONFIG_SAMPLE_STATX) := test-statx
-
-# Tell kbuild to always build the programs
-always := $(hostprogs-y)
-
-HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
diff --git a/samples/statx/test-statx.c b/samples/statx/test-statx.c
deleted file mode 100644
index d4d77b09412c..000000000000
--- a/samples/statx/test-statx.c
+++ /dev/null
@@ -1,258 +0,0 @@
-/* Test the statx() system call.
- *
- * Note that the output of this program is intended to look like the output of
- * /bin/stat where possible.
- *
- * Copyright (C) 2015 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells ([email protected])
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public Licence
- * as published by the Free Software Foundation; either version
- * 2 of the Licence, or (at your option) any later version.
- */
-
-#define _GNU_SOURCE
-#define _ATFILE_SOURCE
-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
-#include <unistd.h>
-#include <ctype.h>
-#include <errno.h>
-#include <time.h>
-#include <sys/syscall.h>
-#include <sys/types.h>
-#include <linux/stat.h>
-#include <linux/fcntl.h>
-#include <sys/stat.h>
-
-#define AT_STATX_SYNC_TYPE 0x6000
-#define AT_STATX_SYNC_AS_STAT 0x0000
-#define AT_STATX_FORCE_SYNC 0x2000
-#define AT_STATX_DONT_SYNC 0x4000
-
-static __attribute__((unused))
-ssize_t statx(int dfd, const char *filename, unsigned flags,
- unsigned int mask, struct statx *buffer)
-{
- return syscall(__NR_statx, dfd, filename, flags, mask, buffer);
-}
-
-static void print_time(const char *field, struct statx_timestamp *ts)
-{
- struct tm tm;
- time_t tim;
- char buffer[100];
- int len;
-
- tim = ts->tv_sec;
- if (!localtime_r(&tim, &tm)) {
- perror("localtime_r");
- exit(1);
- }
- len = strftime(buffer, 100, "%F %T", &tm);
- if (len == 0) {
- perror("strftime");
- exit(1);
- }
- printf("%s", field);
- fwrite(buffer, 1, len, stdout);
- printf(".%09u", ts->tv_nsec);
- len = strftime(buffer, 100, "%z", &tm);
- if (len == 0) {
- perror("strftime2");
- exit(1);
- }
- fwrite(buffer, 1, len, stdout);
- printf("\n");
-}
-
-static void dump_statx(struct statx *stx)
-{
- char buffer[256], ft = '?';
-
- printf("results=%x\n", stx->stx_mask);
-
- printf(" ");
- if (stx->stx_mask & STATX_SIZE)
- printf(" Size: %-15llu", (unsigned long long)stx->stx_size);
- if (stx->stx_mask & STATX_BLOCKS)
- printf(" Blocks: %-10llu", (unsigned long long)stx->stx_blocks);
- printf(" IO Block: %-6llu", (unsigned long long)stx->stx_blksize);
- if (stx->stx_mask & STATX_TYPE) {
- switch (stx->stx_mode & S_IFMT) {
- case S_IFIFO: printf(" FIFO\n"); ft = 'p'; break;
- case S_IFCHR: printf(" character special file\n"); ft = 'c'; break;
- case S_IFDIR: printf(" directory\n"); ft = 'd'; break;
- case S_IFBLK: printf(" block special file\n"); ft = 'b'; break;
- case S_IFREG: printf(" regular file\n"); ft = '-'; break;
- case S_IFLNK: printf(" symbolic link\n"); ft = 'l'; break;
- case S_IFSOCK: printf(" socket\n"); ft = 's'; break;
- default:
- printf(" unknown type (%o)\n", stx->stx_mode & S_IFMT);
- break;
- }
- } else {
- printf(" no type\n");
- }
-
- sprintf(buffer, "%02x:%02x", stx->stx_dev_major, stx->stx_dev_minor);
- printf("Device: %-15s", buffer);
- if (stx->stx_mask & STATX_INO)
- printf(" Inode: %-11llu", (unsigned long long) stx->stx_ino);
- if (stx->stx_mask & STATX_NLINK)
- printf(" Links: %-5u", stx->stx_nlink);
- if (stx->stx_mask & STATX_TYPE) {
- switch (stx->stx_mode & S_IFMT) {
- case S_IFBLK:
- case S_IFCHR:
- printf(" Device type: %u,%u",
- stx->stx_rdev_major, stx->stx_rdev_minor);
- break;
- }
- }
- printf("\n");
-
- if (stx->stx_mask & STATX_MODE)
- printf("Access: (%04o/%c%c%c%c%c%c%c%c%c%c) ",
- stx->stx_mode & 07777,
- ft,
- stx->stx_mode & S_IRUSR ? 'r' : '-',
- stx->stx_mode & S_IWUSR ? 'w' : '-',
- stx->stx_mode & S_IXUSR ? 'x' : '-',
- stx->stx_mode & S_IRGRP ? 'r' : '-',
- stx->stx_mode & S_IWGRP ? 'w' : '-',
- stx->stx_mode & S_IXGRP ? 'x' : '-',
- stx->stx_mode & S_IROTH ? 'r' : '-',
- stx->stx_mode & S_IWOTH ? 'w' : '-',
- stx->stx_mode & S_IXOTH ? 'x' : '-');
- if (stx->stx_mask & STATX_UID)
- printf("Uid: %5d ", stx->stx_uid);
- if (stx->stx_mask & STATX_GID)
- printf("Gid: %5d\n", stx->stx_gid);
-
- if (stx->stx_mask & STATX_ATIME)
- print_time("Access: ", &stx->stx_atime);
- if (stx->stx_mask & STATX_MTIME)
- print_time("Modify: ", &stx->stx_mtime);
- if (stx->stx_mask & STATX_CTIME)
- print_time("Change: ", &stx->stx_ctime);
- if (stx->stx_mask & STATX_BTIME)
- print_time(" Birth: ", &stx->stx_btime);
-
- if (stx->stx_attributes_mask) {
- unsigned char bits, mbits;
- int loop, byte;
-
- static char attr_representation[64 + 1] =
- /* STATX_ATTR_ flags: */
- "????????" /* 63-56 */
- "????????" /* 55-48 */
- "????????" /* 47-40 */
- "????????" /* 39-32 */
- "????????" /* 31-24 0x00000000-ff000000 */
- "????????" /* 23-16 0x00000000-00ff0000 */
- "???me???" /* 15- 8 0x00000000-0000ff00 */
- "?dai?c??" /* 7- 0 0x00000000-000000ff */
- ;
-
- printf("Attributes: %016llx (", stx->stx_attributes);
- for (byte = 64 - 8; byte >= 0; byte -= 8) {
- bits = stx->stx_attributes >> byte;
- mbits = stx->stx_attributes_mask >> byte;
- for (loop = 7; loop >= 0; loop--) {
- int bit = byte + loop;
-
- if (!(mbits & 0x80))
- putchar('.'); /* Not supported */
- else if (bits & 0x80)
- putchar(attr_representation[63 - bit]);
- else
- putchar('-'); /* Not set */
- bits <<= 1;
- mbits <<= 1;
- }
- if (byte)
- putchar(' ');
- }
- printf(")\n");
- }
-}
-
-static void dump_hex(unsigned long long *data, int from, int to)
-{
- unsigned offset, print_offset = 1, col = 0;
-
- from /= 8;
- to = (to + 7) / 8;
-
- for (offset = from; offset < to; offset++) {
- if (print_offset) {
- printf("%04x: ", offset * 8);
- print_offset = 0;
- }
- printf("%016llx", data[offset]);
- col++;
- if ((col & 3) == 0) {
- printf("\n");
- print_offset = 1;
- } else {
- printf(" ");
- }
- }
-
- if (!print_offset)
- printf("\n");
-}
-
-int main(int argc, char **argv)
-{
- struct statx stx;
- int ret, raw = 0, atflag = AT_SYMLINK_NOFOLLOW;
-
- unsigned int mask = STATX_ALL;
-
- for (argv++; *argv; argv++) {
- if (strcmp(*argv, "-F") == 0) {
- atflag &= ~AT_STATX_SYNC_TYPE;
- atflag |= AT_STATX_FORCE_SYNC;
- continue;
- }
- if (strcmp(*argv, "-D") == 0) {
- atflag &= ~AT_STATX_SYNC_TYPE;
- atflag |= AT_STATX_DONT_SYNC;
- continue;
- }
- if (strcmp(*argv, "-L") == 0) {
- atflag &= ~AT_SYMLINK_NOFOLLOW;
- continue;
- }
- if (strcmp(*argv, "-O") == 0) {
- mask &= ~STATX_BASIC_STATS;
- continue;
- }
- if (strcmp(*argv, "-A") == 0) {
- atflag |= AT_NO_AUTOMOUNT;
- continue;
- }
- if (strcmp(*argv, "-R") == 0) {
- raw = 1;
- continue;
- }
-
- memset(&stx, 0xbf, sizeof(stx));
- ret = statx(AT_FDCWD, *argv, atflag, mask, &stx);
- printf("statx(%s) = %d\n", *argv, ret);
- if (ret < 0) {
- perror(*argv);
- exit(1);
- }
-
- if (raw)
- dump_hex((unsigned long long *)&stx, 0, sizeof(stx));
-
- dump_statx(&stx);
- }
- return 0;
-}
diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
new file mode 100644
index 000000000000..4ac9690fb3c4
--- /dev/null
+++ b/samples/vfs/Makefile
@@ -0,0 +1,10 @@
+# List of programs to build
+hostprogs-$(CONFIG_SAMPLE_VFS) := \
+ test-fsmount \
+ test-statx
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_test-fsmount.o += -I$(objtree)/usr/include
+HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
diff --git a/samples/vfs/test-fsmount.c b/samples/vfs/test-fsmount.c
new file mode 100644
index 000000000000..74124025ade0
--- /dev/null
+++ b/samples/vfs/test-fsmount.c
@@ -0,0 +1,118 @@
+/* fd-based mount test.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+#include <linux/fs.h>
+#include <linux/unistd.h>
+
+#define E(x) do { if ((x) == -1) { perror(#x); exit(1); } } while(0)
+
+static void check_messages(int fd)
+{
+ char buf[4096];
+ int err, n;
+
+ err = errno;
+
+ for (;;) {
+ n = read(fd, buf, sizeof(buf));
+ if (n < 0)
+ break;
+ n -= 2;
+
+ switch (buf[0]) {
+ case 'e':
+ fprintf(stderr, "Error: %*.*s\n", n, n, buf + 2);
+ break;
+ case 'w':
+ fprintf(stderr, "Warning: %*.*s\n", n, n, buf + 2);
+ break;
+ case 'i':
+ fprintf(stderr, "Info: %*.*s\n", n, n, buf + 2);
+ break;
+ }
+ }
+
+ errno = err;
+}
+
+static __attribute__((noreturn))
+void mount_error(int fd, const char *s)
+{
+ check_messages(fd);
+ fprintf(stderr, "%s: %m\n", s);
+ exit(1);
+}
+
+static inline int fsopen(const char *fs_name, unsigned int flags)
+{
+ return syscall(__NR_fsopen, fs_name, flags);
+}
+
+static inline int fsmount(int fsfd, unsigned int flags, unsigned int ms_flags)
+{
+ return syscall(__NR_fsmount, fsfd, flags, ms_flags);
+}
+
+static inline int fsconfig(int fsfd, unsigned int cmd,
+ const char *key, const void *val, int aux)
+{
+ return syscall(__NR_fsconfig, fsfd, cmd, key, val, aux);
+}
+
+static inline int move_mount(int from_dfd, const char *from_pathname,
+ int to_dfd, const char *to_pathname,
+ unsigned int flags)
+{
+ return syscall(__NR_move_mount,
+ from_dfd, from_pathname,
+ to_dfd, to_pathname, flags);
+}
+
+#define E_fsconfig(fd, cmd, key, val, aux) \
+ do { \
+ if (fsconfig(fd, cmd, key, val, aux) == -1) \
+ mount_error(fd, key ?: "create"); \
+ } while (0)
+
+int main(int argc, char *argv[])
+{
+ int fsfd, mfd;
+
+ /* Mount a publically available AFS filesystem */
+ fsfd = fsopen("afs", 0);
+ if (fsfd == -1) {
+ perror("fsopen");
+ exit(1);
+ }
+
+ E_fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "#grand.central.org:root.cell.", 0);
+ E_fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
+
+ mfd = fsmount(fsfd, 0, MS_RDONLY);
+ if (mfd < 0)
+ mount_error(fsfd, "fsmount");
+ E(close(fsfd));
+
+ if (move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH) < 0) {
+ perror("move_mount");
+ exit(1);
+ }
+
+ E(close(mfd));
+ exit(0);
+}
diff --git a/samples/vfs/test-statx.c b/samples/vfs/test-statx.c
new file mode 100644
index 000000000000..d4d77b09412c
--- /dev/null
+++ b/samples/vfs/test-statx.c
@@ -0,0 +1,258 @@
+/* Test the statx() system call.
+ *
+ * Note that the output of this program is intended to look like the output of
+ * /bin/stat where possible.
+ *
+ * Copyright (C) 2015 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#define _GNU_SOURCE
+#define _ATFILE_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <ctype.h>
+#include <errno.h>
+#include <time.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <linux/stat.h>
+#include <linux/fcntl.h>
+#include <sys/stat.h>
+
+#define AT_STATX_SYNC_TYPE 0x6000
+#define AT_STATX_SYNC_AS_STAT 0x0000
+#define AT_STATX_FORCE_SYNC 0x2000
+#define AT_STATX_DONT_SYNC 0x4000
+
+static __attribute__((unused))
+ssize_t statx(int dfd, const char *filename, unsigned flags,
+ unsigned int mask, struct statx *buffer)
+{
+ return syscall(__NR_statx, dfd, filename, flags, mask, buffer);
+}
+
+static void print_time(const char *field, struct statx_timestamp *ts)
+{
+ struct tm tm;
+ time_t tim;
+ char buffer[100];
+ int len;
+
+ tim = ts->tv_sec;
+ if (!localtime_r(&tim, &tm)) {
+ perror("localtime_r");
+ exit(1);
+ }
+ len = strftime(buffer, 100, "%F %T", &tm);
+ if (len == 0) {
+ perror("strftime");
+ exit(1);
+ }
+ printf("%s", field);
+ fwrite(buffer, 1, len, stdout);
+ printf(".%09u", ts->tv_nsec);
+ len = strftime(buffer, 100, "%z", &tm);
+ if (len == 0) {
+ perror("strftime2");
+ exit(1);
+ }
+ fwrite(buffer, 1, len, stdout);
+ printf("\n");
+}
+
+static void dump_statx(struct statx *stx)
+{
+ char buffer[256], ft = '?';
+
+ printf("results=%x\n", stx->stx_mask);
+
+ printf(" ");
+ if (stx->stx_mask & STATX_SIZE)
+ printf(" Size: %-15llu", (unsigned long long)stx->stx_size);
+ if (stx->stx_mask & STATX_BLOCKS)
+ printf(" Blocks: %-10llu", (unsigned long long)stx->stx_blocks);
+ printf(" IO Block: %-6llu", (unsigned long long)stx->stx_blksize);
+ if (stx->stx_mask & STATX_TYPE) {
+ switch (stx->stx_mode & S_IFMT) {
+ case S_IFIFO: printf(" FIFO\n"); ft = 'p'; break;
+ case S_IFCHR: printf(" character special file\n"); ft = 'c'; break;
+ case S_IFDIR: printf(" directory\n"); ft = 'd'; break;
+ case S_IFBLK: printf(" block special file\n"); ft = 'b'; break;
+ case S_IFREG: printf(" regular file\n"); ft = '-'; break;
+ case S_IFLNK: printf(" symbolic link\n"); ft = 'l'; break;
+ case S_IFSOCK: printf(" socket\n"); ft = 's'; break;
+ default:
+ printf(" unknown type (%o)\n", stx->stx_mode & S_IFMT);
+ break;
+ }
+ } else {
+ printf(" no type\n");
+ }
+
+ sprintf(buffer, "%02x:%02x", stx->stx_dev_major, stx->stx_dev_minor);
+ printf("Device: %-15s", buffer);
+ if (stx->stx_mask & STATX_INO)
+ printf(" Inode: %-11llu", (unsigned long long) stx->stx_ino);
+ if (stx->stx_mask & STATX_NLINK)
+ printf(" Links: %-5u", stx->stx_nlink);
+ if (stx->stx_mask & STATX_TYPE) {
+ switch (stx->stx_mode & S_IFMT) {
+ case S_IFBLK:
+ case S_IFCHR:
+ printf(" Device type: %u,%u",
+ stx->stx_rdev_major, stx->stx_rdev_minor);
+ break;
+ }
+ }
+ printf("\n");
+
+ if (stx->stx_mask & STATX_MODE)
+ printf("Access: (%04o/%c%c%c%c%c%c%c%c%c%c) ",
+ stx->stx_mode & 07777,
+ ft,
+ stx->stx_mode & S_IRUSR ? 'r' : '-',
+ stx->stx_mode & S_IWUSR ? 'w' : '-',
+ stx->stx_mode & S_IXUSR ? 'x' : '-',
+ stx->stx_mode & S_IRGRP ? 'r' : '-',
+ stx->stx_mode & S_IWGRP ? 'w' : '-',
+ stx->stx_mode & S_IXGRP ? 'x' : '-',
+ stx->stx_mode & S_IROTH ? 'r' : '-',
+ stx->stx_mode & S_IWOTH ? 'w' : '-',
+ stx->stx_mode & S_IXOTH ? 'x' : '-');
+ if (stx->stx_mask & STATX_UID)
+ printf("Uid: %5d ", stx->stx_uid);
+ if (stx->stx_mask & STATX_GID)
+ printf("Gid: %5d\n", stx->stx_gid);
+
+ if (stx->stx_mask & STATX_ATIME)
+ print_time("Access: ", &stx->stx_atime);
+ if (stx->stx_mask & STATX_MTIME)
+ print_time("Modify: ", &stx->stx_mtime);
+ if (stx->stx_mask & STATX_CTIME)
+ print_time("Change: ", &stx->stx_ctime);
+ if (stx->stx_mask & STATX_BTIME)
+ print_time(" Birth: ", &stx->stx_btime);
+
+ if (stx->stx_attributes_mask) {
+ unsigned char bits, mbits;
+ int loop, byte;
+
+ static char attr_representation[64 + 1] =
+ /* STATX_ATTR_ flags: */
+ "????????" /* 63-56 */
+ "????????" /* 55-48 */
+ "????????" /* 47-40 */
+ "????????" /* 39-32 */
+ "????????" /* 31-24 0x00000000-ff000000 */
+ "????????" /* 23-16 0x00000000-00ff0000 */
+ "???me???" /* 15- 8 0x00000000-0000ff00 */
+ "?dai?c??" /* 7- 0 0x00000000-000000ff */
+ ;
+
+ printf("Attributes: %016llx (", stx->stx_attributes);
+ for (byte = 64 - 8; byte >= 0; byte -= 8) {
+ bits = stx->stx_attributes >> byte;
+ mbits = stx->stx_attributes_mask >> byte;
+ for (loop = 7; loop >= 0; loop--) {
+ int bit = byte + loop;
+
+ if (!(mbits & 0x80))
+ putchar('.'); /* Not supported */
+ else if (bits & 0x80)
+ putchar(attr_representation[63 - bit]);
+ else
+ putchar('-'); /* Not set */
+ bits <<= 1;
+ mbits <<= 1;
+ }
+ if (byte)
+ putchar(' ');
+ }
+ printf(")\n");
+ }
+}
+
+static void dump_hex(unsigned long long *data, int from, int to)
+{
+ unsigned offset, print_offset = 1, col = 0;
+
+ from /= 8;
+ to = (to + 7) / 8;
+
+ for (offset = from; offset < to; offset++) {
+ if (print_offset) {
+ printf("%04x: ", offset * 8);
+ print_offset = 0;
+ }
+ printf("%016llx", data[offset]);
+ col++;
+ if ((col & 3) == 0) {
+ printf("\n");
+ print_offset = 1;
+ } else {
+ printf(" ");
+ }
+ }
+
+ if (!print_offset)
+ printf("\n");
+}
+
+int main(int argc, char **argv)
+{
+ struct statx stx;
+ int ret, raw = 0, atflag = AT_SYMLINK_NOFOLLOW;
+
+ unsigned int mask = STATX_ALL;
+
+ for (argv++; *argv; argv++) {
+ if (strcmp(*argv, "-F") == 0) {
+ atflag &= ~AT_STATX_SYNC_TYPE;
+ atflag |= AT_STATX_FORCE_SYNC;
+ continue;
+ }
+ if (strcmp(*argv, "-D") == 0) {
+ atflag &= ~AT_STATX_SYNC_TYPE;
+ atflag |= AT_STATX_DONT_SYNC;
+ continue;
+ }
+ if (strcmp(*argv, "-L") == 0) {
+ atflag &= ~AT_SYMLINK_NOFOLLOW;
+ continue;
+ }
+ if (strcmp(*argv, "-O") == 0) {
+ mask &= ~STATX_BASIC_STATS;
+ continue;
+ }
+ if (strcmp(*argv, "-A") == 0) {
+ atflag |= AT_NO_AUTOMOUNT;
+ continue;
+ }
+ if (strcmp(*argv, "-R") == 0) {
+ raw = 1;
+ continue;
+ }
+
+ memset(&stx, 0xbf, sizeof(stx));
+ ret = statx(AT_FDCWD, *argv, atflag, mask, &stx);
+ printf("statx(%s) = %d\n", *argv, ret);
+ if (ret < 0) {
+ perror(*argv);
+ exit(1);
+ }
+
+ if (raw)
+ dump_hex((unsigned long long *)&stx, 0, sizeof(stx));
+
+ dump_statx(&stx);
+ }
+ return 0;
+}


2018-09-21 16:36:25

by David Howells

[permalink] [raw]
Subject: [PATCH 30/34] vfs: syscall: Add fsmount() to create a mount for a superblock [ver #12]

Provide a system call by which a filesystem opened with fsopen() and
configured by a series of fsconfig() calls can have a detached mount object
created for it. This mount object can then be attached to the VFS mount
hierarchy using move_mount() by passing the returned file descriptor as the
from directory fd.

The system call looks like:

int mfd = fsmount(int fsfd, unsigned int flags,
unsigned int ms_flags);

where fsfd is the file descriptor returned by fsopen(). flags can be 0 or
FSMOUNT_CLOEXEC. ms_flags is a bitwise-OR of the following flags:

MS_RDONLY
MS_NOSUID
MS_NODEV
MS_NOEXEC
MS_NOATIME
MS_NODIRATIME
MS_RELATIME
MS_STRICTATIME

MS_UNBINDABLE
MS_PRIVATE
MS_SLAVE
MS_SHARED

In the event that fsmount() fails, it may be possible to get an error
message by calling read() on fsfd. If no message is available, ENODATA
will be reported.

Signed-off-by: David Howells <[email protected]>
cc: [email protected]
---

arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
fs/namespace.c | 125 +++++++++++++++++++++++++++++++-
include/linux/syscalls.h | 1
include/uapi/linux/fs.h | 2 +
5 files changed, 126 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f9970310c126..c78b68256f8a 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -402,3 +402,4 @@
388 i386 move_mount sys_move_mount __ia32_sys_move_mount
389 i386 fsopen sys_fsopen __ia32_sys_fsopen
390 i386 fsconfig sys_fsconfig __ia32_sys_fsconfig
+391 i386 fsmount sys_fsmount __ia32_sys_fsmount
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 4185d36e03bb..d44ead5d4368 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -347,6 +347,7 @@
336 common move_mount __x64_sys_move_mount
337 common fsopen __x64_sys_fsopen
338 common fsconfig __x64_sys_fsconfig
+339 common fsmount __x64_sys_fsmount

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/namespace.c b/fs/namespace.c
index 156261d03c12..4dfe7e23b7ee 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2463,7 +2463,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)

attached = mnt_has_parent(old);
/*
- * We need to allow open_tree(OPEN_TREE_CLONE) followed by
+ * We need to allow open_tree(OPEN_TREE_CLONE) or fsmount() followed by
* move_mount(), but mustn't allow "/" to be moved.
*/
if (old->mnt_ns && !attached)
@@ -3329,9 +3329,126 @@ struct vfsmount *kern_mount(struct file_system_type *type)
EXPORT_SYMBOL_GPL(kern_mount);

/*
- * Move a mount from one place to another.
- * In combination with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be
- * used to copy a mount subtree.
+ * Create a kernel mount representation for a new, prepared superblock
+ * (specified by fs_fd) and attach to an open_tree-like file descriptor.
+ */
+SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags, unsigned int, ms_flags)
+{
+ struct fs_context *fc;
+ struct file *file;
+ struct path newmount;
+ struct fd f;
+ unsigned int mnt_flags = 0;
+ long ret;
+
+ if (!may_mount())
+ return -EPERM;
+
+ if ((flags & ~(FSMOUNT_CLOEXEC)) != 0)
+ return -EINVAL;
+
+ if (ms_flags & ~(MS_RDONLY | MS_NOSUID | MS_NODEV | MS_NOEXEC |
+ MS_NOATIME | MS_NODIRATIME | MS_RELATIME |
+ MS_STRICTATIME))
+ return -EINVAL;
+
+ if (ms_flags & MS_RDONLY)
+ mnt_flags |= MNT_READONLY;
+ if (ms_flags & MS_NOSUID)
+ mnt_flags |= MNT_NOSUID;
+ if (ms_flags & MS_NODEV)
+ mnt_flags |= MNT_NODEV;
+ if (ms_flags & MS_NOEXEC)
+ mnt_flags |= MNT_NOEXEC;
+ if (ms_flags & MS_NODIRATIME)
+ mnt_flags |= MNT_NODIRATIME;
+
+ if (ms_flags & MS_STRICTATIME) {
+ if (ms_flags & MS_NOATIME)
+ return -EINVAL;
+ } else if (ms_flags & MS_NOATIME) {
+ mnt_flags |= MNT_NOATIME;
+ } else {
+ mnt_flags |= MNT_RELATIME;
+ }
+
+ f = fdget(fs_fd);
+ if (!f.file)
+ return -EBADF;
+
+ ret = -EINVAL;
+ if (f.file->f_op != &fscontext_fops)
+ goto err_fsfd;
+
+ fc = f.file->private_data;
+
+ /* There must be a valid superblock or we can't mount it */
+ ret = -EINVAL;
+ if (!fc->root)
+ goto err_fsfd;
+
+ ret = -EPERM;
+ if (mount_too_revealing(fc->root->d_sb, &mnt_flags)) {
+ pr_warn("VFS: Mount too revealing\n");
+ goto err_fsfd;
+ }
+
+ ret = mutex_lock_interruptible(&fc->uapi_mutex);
+ if (ret < 0)
+ goto err_fsfd;
+
+ ret = -EBUSY;
+ if (fc->phase != FS_CONTEXT_AWAITING_MOUNT)
+ goto err_unlock;
+
+ ret = -EPERM;
+ if ((fc->sb_flags & SB_MANDLOCK) && !may_mandlock())
+ goto err_unlock;
+
+ newmount.mnt = vfs_create_mount(fc, mnt_flags);
+ if (IS_ERR(newmount.mnt)) {
+ ret = PTR_ERR(newmount.mnt);
+ goto err_unlock;
+ }
+ newmount.dentry = dget(fc->root);
+
+ /* We've done the mount bit - now move the file context into more or
+ * less the same state as if we'd done an fspick(). We don't want to
+ * do any memory allocation or anything like that at this point as we
+ * don't want to have to handle any errors incurred.
+ */
+ vfs_clean_context(fc);
+
+ /* Attach to an apparent O_PATH fd with a note that we need to unmount
+ * it, not just simply put it.
+ */
+ file = dentry_open(&newmount, O_PATH, fc->cred);
+ if (IS_ERR(file)) {
+ ret = PTR_ERR(file);
+ goto err_path;
+ }
+ file->f_mode |= FMODE_NEED_UNMOUNT;
+
+ ret = get_unused_fd_flags((flags & FSMOUNT_CLOEXEC) ? O_CLOEXEC : 0);
+ if (ret >= 0)
+ fd_install(ret, file);
+ else
+ fput(file);
+
+err_path:
+ path_put(&newmount);
+err_unlock:
+ mutex_unlock(&fc->uapi_mutex);
+err_fsfd:
+ fdput(f);
+ return ret;
+}
+
+/*
+ * Move a mount from one place to another. In combination with
+ * fsopen()/fsmount() this is used to install a new mount and in combination
+ * with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be used to copy
+ * a mount subtree.
*
* Note the flags value is a combination of MOVE_MOUNT_* flags.
*/
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 4ab15fdf8aea..4697fad47789 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -913,6 +913,7 @@ asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
const void __user *value, int aux);
+asmlinkage long sys_fsmount(int fs_fd, unsigned int flags, unsigned int ms_flags);

/*
* Architecture-specific system calls
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index fecbae30a30d..10281d582e28 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -349,6 +349,8 @@ typedef int __bitwise __kernel_rwf_t;
*/
#define FSOPEN_CLOEXEC 0x00000001

+#define FSMOUNT_CLOEXEC 0x00000001
+
/*
* The type of fsconfig() call made.
*/


2018-10-04 18:37:57

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 00/34] VFS: Introduce filesystem context [ver #12]


David,

I have been going through these and it is a wonderful proof of concept
patchset. There are a couple significant problems with it however.

- Many patches do more than one thing that could benefit from being
broken up into more patches so that there is only one logical change
per patch. I have attempted a little of that and have found several
significant bugs.

- There are many unnecessary changes in this patchset that just add
noise and make it difficult to review.

- There are many typos and thinkos in this patchset that while not hard
to correct keep this from being anywhere close to being ready for
prime time.

- Some of the bugs I have encountered.
* proc that isn't pid_ns_prepare_proc does not set fc->user_ns to
match the pid namespace.
* mqueue does not set fc->user_ns to match the ipc namespace.
* The cpuset filesystem always fails to mount
* Non-converted filesystems don't have the old security hooks
and only have a bit blob so don't call into the new security
hooks either.
* The changes to implement the new security hooks at least for
selinux are riddled with typos, and thinkos.

I was hoping to get into the semantic questions but I can't get
there until I get a good solid baseline patch to work with.

I have been able to hoist the permission check out of sget_fc for
converted filesystems. So progress is being made. That absolutely
requires fc->user_ns to be set properly before vfs_get_tree. Something
that still needs to be fixed.

I have also observed that by not allowing unconverted filesystems
to mount using the new api. The compatbitility code can be
significantly simplified, and the who data_size problem goes away.

I am going to be travelling for the next couple of days so I
don't expect I will be able to answer questions in a timely manner.
In the hopes that it might help below is my work in progress git
tree where I have cleaned up some of these issues.

https://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git new-mount-api-testing

Eric



2018-10-05 18:25:10

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

On 21/09/2018 17:30, David Howells wrote:
> From: Al Viro <[email protected]>
>
> Allow a detached tree created by open_tree(..., OPEN_TREE_CLONE) to be
> attached by move_mount(2).
>
> If by the time of final fput() of OPEN_TREE_CLONE-opened file its tree is
> not detached anymore, it won't be dissolved. move_mount(2) is adjusted
> to handle detached source.
>
> That gives us equivalents of mount --bind and mount --rbind.
>
> Signed-off-by: Al Viro <[email protected]>
> Signed-off-by: David Howells <[email protected]>
> ---
>
> fs/namespace.c | 26 ++++++++++++++++++++------
> 1 file changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index dd38141b1723..caf5c55ef555 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1785,8 +1785,10 @@ void dissolve_on_fput(struct vfsmount *mnt)
> {
> namespace_lock();
> lock_mount_hash();
> - mntget(mnt);
> - umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
> + if (!real_mount(mnt)->mnt_ns) {
> + mntget(mnt);
> + umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
> + }
> unlock_mount_hash();
> namespace_unlock();
> }
> @@ -2393,6 +2395,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
> struct mount *old;
> struct mountpoint *mp;
> int err;
> + bool attached;
>
> mp = lock_mount(new_path);
> err = PTR_ERR(mp);
> @@ -2403,10 +2406,19 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
> p = real_mount(new_path->mnt);
>
> err = -EINVAL;
> - if (!check_mnt(p) || !check_mnt(old))
> + /* The mountpoint must be in our namespace. */
> + if (!check_mnt(p))
> + goto out1;
> + /* The thing moved should be either ours or completely unattached. */
> + if (old->mnt_ns && !check_mnt(old))
> goto out1;
>
> - if (!mnt_has_parent(old))
> + attached = mnt_has_parent(old);
> + /*
> + * We need to allow open_tree(OPEN_TREE_CLONE) followed by
> + * move_mount(), but mustn't allow "/" to be moved.
> + */
> + if (old->mnt_ns && !attached)
> goto out1;
>
> if (old->mnt.mnt_flags & MNT_LOCKED)

Hi

I replied last time to wonder about the MNT_UMOUNT mnt_flag. So I've
tested it now :-), on David's current tree (commit 5581f4935add).

The modified do_move_mount() allows re-attaching something that was
lazy-unmounted. But the lazy unmount sets MNT_UMOUNT. And this flag is
not cleared when the mount is re-attached.

I wasn't sure what effect this would have. Luckily it showed up straight
away, when I tried to unmount again. It causes a soft lockup.

Debug printk:

diff --git a/fs/namespace.c b/fs/namespace.c
index 4dfe7e23b7ee..ac8de9191cfe 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2472,6 +2472,10 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
if (old->mnt.mnt_flags & MNT_LOCKED)
goto out1;

+ pr_info("mnt_flags=%x umount=%x\n",
+ (unsigned) old->mnt.mnt_flags,
+ (unsigned) !!(old->mnt.mnt_flags & MNT_UMOUNT);
+
if (old_path->dentry != old_path->mnt->mnt_root)
goto out1;

Testing:

# mount -ttmpfs tmp /mnt
# cd /mnt
# umount .
umount: /mnt: target is busy.
# umount -l .
# mount --move . /mnt
[ 577.773804] mnt_flags=8000020 umount=1

Double-check the flags after the mount is re-attached:

# mount --move . /mnt
[ 610.891311] mnt_flags=8000020 umount=1
mount: /mnt: mount(2) system call failed: Too many levels of symbolic links.

The bug:

# cd
# umount /mnt
[ 656.229099] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [umount:1457]
[ 656.230231] Modules linked in: xt_CHECKSUM(E) ipt_MASQUERADE(E) tun(E) bridge(E) stp(E) llc(E) ip6t_rpfilter(E) ip6t_REJECT(E) nf_reject_ipv6(E) xt_conntrack(E) devlink(E) ip6table_nat(E) nf_nat_ipv6(E) ip6table_mangle(E) ip6table_raw(E) ip6table_security(E) iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) libcrc32c(E) nf_defrag_ipv4(E) iptable_mangle(E) iptable_raw(E) iptable_security(E) ip6table_filter(E) ip6_tables(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hwdep(E) snd_hda_core(E) snd_seq(E) snd_seq_device(E) snd_pcm(E) joydev(E) crc32_pclmul(E) snd_timer(E) snd(E) ghash_clmulni_intel(E) crct10dif_pclmul(E) virtio_balloon(E) soundcore(E) serio_raw(E) crc32c_intel(E) qxl(E) virtio_console(E) virtio_net(E) net_failover(E) failover(E) drm_kms_helper(E)
[ 656.242150] ttm(E) drm(E) qemu_fw_cfg(E) pata_acpi(E) ata_generic(E)
[ 656.243333] CPU: 0 PID: 1457 Comm: umount Tainted: G E 4.19.0-rc3+ #7
[ 656.244767] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
[ 656.247038] RIP: 0010:pin_kill+0x128/0x140
[ 656.247789] Code: f2 5a 00 48 8b 44 24 20 48 39 c5 0f 84 6f ff ff ff 48 89 df e8 e9 4a 5b 00 8b 43 18 85 c0 7e b3 c6 03 00 fb 66 0f 1f 44 00 00 <e9> 51 ff ff ff e8 be 11 dd ff 0f 1f 40 00 66 2e 0f 1f 84 00 00 00
[ 656.250738] RSP: 0018:ffffa58040f93e30 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[ 656.251984] RAX: 0000000000000000 RBX: ffff971a6b16dc30 RCX: dead000000000200
[ 656.253183] RDX: 0000000000000001 RSI: ffffa58040f93dd0 RDI: ffff971a6b16dc30
[ 656.254484] RBP: ffffa58040f93e50 R08: 000000000000067d R09: 000000000000067d
[ 656.255838] R10: 0000000000000000 R11: 0000000000000000 R12: ffff971a6b2b1800
[ 656.257181] R13: ffff971a6b16db88 R14: 0000000000000000 R15: ffff971a6b16db50
[ 656.258530] FS: 00007fc7bac88fc0(0000) GS:ffff971ad9600000(0000) knlGS:0000000000000000
[ 656.260079] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 656.261165] CR2: 00007fc7ba8704c7 CR3: 000000002d22c001 CR4: 00000000003606f0
[ 656.262506] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 656.263690] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 656.265329] Call Trace:
[ 656.267958] ? finish_wait+0x80/0x80
[ 656.269083] group_pin_kill+0x1a/0x30
[ 656.269989] namespace_unlock+0x6f/0x80
[ 656.270652] ksys_umount+0x220/0x420
[ 656.271393] __x64_sys_umount+0x12/0x20
[ 656.272249] do_syscall_64+0x5b/0x160
[ 656.272988] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 656.273942] RIP: 0033:0x7fc7b9cd9117
[ 656.274630] Code: ed 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 ed 2b 00 f7 d8 64 89 01 48
[ 656.278886] RSP: 002b:00007ffe0a557498 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 656.281518] RAX: ffffffffffffffda RBX: 0000556bab8bd420 RCX: 00007fc7b9cd9117
[ 656.283138] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556bab8bd600
[ 656.284757] RBP: 0000000000000000 R08: 0000556bab8bd620 R09: 00007ffe0a555d00
[ 656.286367] R10: 0000000000000000 R11: 0000000000000246 R12: 0000556bab8bd600
[ 656.288408] R13: 00007fc7baa7f1a4 R14: 0000000000000000 R15: 00007ffe0a557708


2018-10-07 10:49:04

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

On 05/10/2018 19:24, Alan Jenkins wrote:
> On 21/09/2018 17:30, David Howells wrote:
>> From: Al Viro <[email protected]>
>>
>> Allow a detached tree created by open_tree(..., OPEN_TREE_CLONE) to be
>> attached by move_mount(2).
>>
>> If by the time of final fput() of OPEN_TREE_CLONE-opened file its
>> tree is
>> not detached anymore, it won't be dissolved.  move_mount(2) is adjusted
>> to handle detached source.
>>
>> That gives us equivalents of mount --bind and mount --rbind.
>>
>> Signed-off-by: Al Viro <[email protected]>
>> Signed-off-by: David Howells <[email protected]>
>> ---
>>
>>   fs/namespace.c |   26 ++++++++++++++++++++------
>>   1 file changed, 20 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/namespace.c b/fs/namespace.c
>> index dd38141b1723..caf5c55ef555 100644
>> --- a/fs/namespace.c
>> +++ b/fs/namespace.c
>> @@ -1785,8 +1785,10 @@ void dissolve_on_fput(struct vfsmount *mnt)
>>   {
>>       namespace_lock();
>>       lock_mount_hash();
>> -    mntget(mnt);
>> -    umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
>> +    if (!real_mount(mnt)->mnt_ns) {
>> +        mntget(mnt);
>> +        umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
>> +    }
>>       unlock_mount_hash();
>>       namespace_unlock();
>>   }
>> @@ -2393,6 +2395,7 @@ static int do_move_mount(struct path *old_path,
>> struct path *new_path)
>>       struct mount *old;
>>       struct mountpoint *mp;
>>       int err;
>> +    bool attached;
>>         mp = lock_mount(new_path);
>>       err = PTR_ERR(mp);
>> @@ -2403,10 +2406,19 @@ static int do_move_mount(struct path
>> *old_path, struct path *new_path)
>>       p = real_mount(new_path->mnt);
>>         err = -EINVAL;
>> -    if (!check_mnt(p) || !check_mnt(old))
>> +    /* The mountpoint must be in our namespace. */
>> +    if (!check_mnt(p))
>> +        goto out1;
>> +    /* The thing moved should be either ours or completely
>> unattached. */
>> +    if (old->mnt_ns && !check_mnt(old))
>>           goto out1;
>>   -    if (!mnt_has_parent(old))
>> +    attached = mnt_has_parent(old);
>> +    /*
>> +     * We need to allow open_tree(OPEN_TREE_CLONE) followed by
>> +     * move_mount(), but mustn't allow "/" to be moved.
>> +     */
>> +    if (old->mnt_ns && !attached)
>>           goto out1;
>>         if (old->mnt.mnt_flags & MNT_LOCKED)
>
> Hi
>
> I replied last time to wonder about the MNT_UMOUNT mnt_flag. So I've
> tested it now :-), on David's current tree (commit 5581f4935add).
>
> The modified do_move_mount() allows re-attaching something that was
> lazy-unmounted. But the lazy unmount sets MNT_UMOUNT. And this flag is
> not cleared when the mount is re-attached.
>
> I wasn't sure what effect this would have. Luckily it showed up
> straight away, when I tried to unmount again. It causes a soft lockup.
>
> Debug printk:
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 4dfe7e23b7ee..ac8de9191cfe 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2472,6 +2472,10 @@ static int do_move_mount(struct path *old_path,
> struct path *new_path)
>      if (old->mnt.mnt_flags & MNT_LOCKED)
>          goto out1;
>
> +    pr_info("mnt_flags=%x umount=%x\n",
> +            (unsigned) old->mnt.mnt_flags,
> +            (unsigned) !!(old->mnt.mnt_flags & MNT_UMOUNT);
> +
>      if (old_path->dentry != old_path->mnt->mnt_root)
>          goto out1;

The lockup seems to be a general problem with the cleanup code. Even if
I use this as advertised, i.e. for a simple bind mount.

(I was suspicious that being able to pass around detached trees as an
FD, and re-attach them in any namespace, allows leaking memory by
creating a namespace loop.  I.e. maybe it gives you enough rope to skip
the test in mnt_ns_loop().  But I didn't get that far).

I converted test-fsmount.c for my own purposes:

diff --git a/samples/vfs/test-fsmount.c b/samples/vfs/test-fsmount.c
index 74124025ade0..da6e3fbf0513 100644
--- a/samples/vfs/test-fsmount.c
+++ b/samples/vfs/test-fsmount.c
@@ -83,6 +83,11 @@ static inline int move_mount(int from_dfd, const char *from_pathname,
to_dfd, to_pathname, flags);
}

+static inline int open_tree(int dfd, const char *pathname, unsigned flags)
+{
+ return syscall(__NR_open_tree, dfd, pathname, flags);
+}
+
#define E_fsconfig(fd, cmd, key, val, aux) \
do { \
if (fsconfig(fd, cmd, key, val, aux) == -1) \
@@ -93,6 +98,7 @@ int main(int argc, char *argv[])
{
int fsfd, mfd;

+#if 0
/* Mount a publically available AFS filesystem */
fsfd = fsopen("afs", 0);
if (fsfd == -1) {
@@ -115,4 +121,9 @@ int main(int argc, char *argv[])

E(close(mfd));
exit(0);
+#endif
+
+ E( mfd = open_tree(-1, "/mnt", OPEN_TREE_CLONE) );
+ E( fchdir(mfd) );
+ E( execl("/bin/bash", "/bin/bash", NULL) );
}

If I close() the mount FD "mfd", and then do "mount --move . /mnt", my
printk() shows MNT_UMOUNT has been set. ( I guess fchdir() works more
like openat(... , O_PATH) than dup() ). Then unmounting /mnt hangs, as I
would expect from my previous test.

If I instead do the mount+unmount first, and close the FD as a second
step, I think there's a lockup in the close().  The lockup happens in
the same place as the unmount lockup from before. (Except there's a line
"Code: Bad RIP value", I don't know why that happens).

# unshare --mount
# test-fsmount
# mount --move . /mnt
[ 270.859542] umount=0 mnt_flags=20

Check the flags are still the same:

# mount --move /mnt /mnt
[ 305./mnt: mount(2) system call failed: Too many levels of symbolic links.
[ 313.737030] umount=0 mnt_flags=20

Clean up the bind mount, and then the inherited mount FD.

# cd
# umount /mnt
# exit

[ 351.898629] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [bash:1483]
[ 351.899841] Modules linked in: xt_CHECKSUM(E) ipt_MASQUERADE(E) tun(E) bridge(E) stp(E) llc(E) ip6t_rpfilter(E) ip6t_REJECT(E) nf_reject_ipv6(E) xt_conntrack(E) ip6table_nat(E) nf_nat_ipv6(E) devlink(E) ip6table_mangle(E) ip6table_raw(E) ip6table_security(E) iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) libcrc32c(E) nf_defrag_ipv4(E) iptable_mangle(E) iptable_raw(E) iptable_security(E) ip6table_filter(E) ip6_tables(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hwdep(E) snd_hda_core(E) snd_seq(E) snd_seq_device(E) snd_pcm(E) joydev(E) crc32_pclmul(E) snd_timer(E) ghash_clmulni_intel(E) snd(E) crct10dif_pclmul(E) virtio_balloon(E) serio_raw(E) soundcore(E) crc32c_intel(E) qxl(E) drm_kms_helper(E) virtio_console(E) ttm(E) virtio_net(E) net_failover(E)
[ 351.912077] failover(E) drm(E) qemu_fw_cfg(E) pata_acpi(E) ata_generic(E)
[ 351.912888] CPU: 0 PID: 1483 Comm: bash Tainted: G E 4.19.0-rc3+ #7
[ 351.914221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
[ 351.916582] RIP: 0010:pin_kill+0x128/0x140
[ 351.917369] Code: f2 5a 00 48 8b 44 24 20 48 39 c5 0f 84 6f ff ff ff 48 89 df e8 e9 4a 5b 00 8b 43 18 85 c0 7e b3 c6 03 00 fb 66 0f 1f 44 00 00 <e9> 51 ff ff ff e8 be 11 dd ff 0f 1f 40 00 66 2e 0f 1f 84 00 00 00
[ 351.920729] RSP: 0018:ffffa1b381be3d88 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[ 351.921801] RAX: 0000000000000000 RBX: ffff909cf2ea68b0 RCX: dead000000000200
[ 351.922807] RDX: 0000000000000001 RSI: ffffa1b381be3d28 RDI: ffff909cf2ea68b0
[ 351.923811] RBP: ffffa1b381be3da8 R08: ffff909d59621760 R09: 0000000000000000
[ 351.924813] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000010000000
[ 351.925818] R13: ffff909cf5db9a38 R14: ffff909cf2ea67a0 R15: ffff909cedc07300
[ 351.926824] FS: 00007f1eb90ac740(0000) GS:ffff909d59600000(0000) knlGS:0000000000000000
[ 351.927957] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 351.928772] CR2: 00007f1eabedb180 CR3: 000000000f20a003 CR4: 00000000003606f0
[ 351.929779] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 351.930785] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 351.931791] Call Trace:
[ 351.932160] ? finish_wait+0x80/0x80
[ 351.932684] group_pin_kill+0x1a/0x30
[ 351.933207] namespace_unlock+0x6f/0x80
[ 351.933766] __fput+0x239/0x240
[ 351.934217] task_work_run+0x84/0xa0
[ 351.934743] do_exit+0x2d3/0xae0
[ 351.935206] ? __do_page_fault+0x263/0x4e0
[ 351.935799] do_group_exit+0x3a/0xa0
[ 351.936307] __x64_sys_exit_group+0x14/0x20
[ 351.936911] do_syscall_64+0x5b/0x160
[ 351.937436] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 351.938164] RIP: 0033:0x7f1eb877adb6
[ 351.938688] Code: Bad RIP value.
[ 351.939149] RSP: 002b:00007ffd56e019d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[ 351.940216] RAX: ffffffffffffffda RBX: 00007f1eb8a69740 RCX: 00007f1eb877adb6
[ 351.941222] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
[ 351.942229] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff80
[ 351.943236] R10: 00007ffd56e0188a R11: 0000000000000246 R12: 00007f1eb8a69740
[ 351.944242] R13: 0000000000000001 R14: 00007f1eb8a72708 R15: 0000000000000000



2018-10-07 19:22:14

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

On 07/10/2018 11:48, Alan Jenkins wrote:
> On 05/10/2018 19:24, Alan Jenkins wrote:
>> On 21/09/2018 17:30, David Howells wrote:
>>> From: Al Viro <[email protected]>
>>>
>>> Allow a detached tree created by open_tree(..., OPEN_TREE_CLONE) to be
>>> attached by move_mount(2).
>>>
>>> If by the time of final fput() of OPEN_TREE_CLONE-opened file its
>>> tree is
>>> not detached anymore, it won't be dissolved.  move_mount(2) is adjusted
>>> to handle detached source.
>>>
>>> That gives us equivalents of mount --bind and mount --rbind.
>>>
>>> Signed-off-by: Al Viro <[email protected]>
>>> Signed-off-by: David Howells <[email protected]>
>>> ---
>>>
>>>   fs/namespace.c |   26 ++++++++++++++++++++------
>>>   1 file changed, 20 insertions(+), 6 deletions(-)

>>> The lockup seems to be a general problem with the cleanup code. Even
>>> if I use this as advertised, i.e. for a simple bind mount.

Ah, I see.  The problem is you were expecting me to use the FD from
open_tree() directly.  But I did fchdir() into the FD, and then "mount
--move . /mnt" :-).

If I use the FD directly, it avoids the hang.  I used two separate C
programs (attached, to avoid my MUA damage)...

> (I was suspicious that being able to pass around detached trees as an
> FD, and re-attach them in any namespace, allows leaking memory by
> creating a namespace loop.  I.e. maybe it gives you enough rope to
> skip the test in mnt_ns_loop().

...so here's the memory leak.

# open_tree --help
usage: open_tree 3</source/path FD_NUMBER COMMAND...
# move_mount --help
usage: move_mount 3</from/path 4</to/path

Create a child namespace:

# mount --make-shared /tmp
# cd /tmp
# mkdir private_mnt
# mount -t tmpfs tmp private_mnt
# mount --make-private private_mnt
# touch private_mnt/child_ns
# unshare --mount=private_mnt/child_ns --propagation=shared ls -l /proc/self/ns/mnt
lrwxrwxrwx. 1 root root 0 Oct 7 19:23 /proc/self/ns/mnt -> 'mnt:[4026532334]'
# findmnt | grep /tmp
├─/tmp tmpfs tmpfs rw,nosuid,nodev,seclabel,size=1247640k,nr_inodes=311910
│ └─/tmp/private_mnt tmp tmpfs rw,relatime,seclabel,uid=1000,gid=1000
│ └─/tmp/private_mnt/child_ns nsfs[mnt:[4026532334]] nsfs rw,seclabel


Create a reference cycle:

# ~/test-open_tree 3</tmp/private_mnt 3 \
nsenter --mount=/tmp/private_mnt/child_ns \
sh -c '~/test-move_mount 4</mnt'

Attach 10MB of memory to the cycle:

# grep Shmem: /proc/meminfo
Shmem: 1464 kB
# dd if=/dev/zero of=/tmp/private_mnt/bigfile bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.00976358 s, 1.1 GB/s
# grep Shmem: /proc/meminfo
Shmem: 11704 kB

Detach the cycle, and leak all the memory:

# umount -l /tmp/private_mnt/
# grep Shmem: /proc/meminfo
Shmem: 11704 kB


Attachments:
vfs_samples.diff (3.63 kB)

2018-10-10 11:57:11

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

Alan Jenkins <[email protected]> wrote:

> I replied last time to wonder about the MNT_UMOUNT mnt_flag. So I've tested it
> now :-), on David's current tree (commit 5581f4935add).
>
> The modified do_move_mount() allows re-attaching something that was
> lazy-unmounted. But the lazy unmount sets MNT_UMOUNT. And this flag is not
> cleared when the mount is re-attached.

Sorry, yes. I'm not sure what the best way to deal with this is. Should it
just return -EPERM or -ESTALE if MNT_UMOUNT is set?

David

2018-10-10 12:32:38

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

Alan Jenkins <[email protected]> wrote:

> # mount --move . /mnt

is this calling move_mount(2) on your system?

David

2018-10-10 12:36:57

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

Alan Jenkins <[email protected]> wrote:

> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells ([email protected])

Do you want to update that and I can take them into my patchset?

David

2018-10-10 12:40:22

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

On 10/10/2018 13:31, David Howells wrote:
> Alan Jenkins <[email protected]> wrote:
>
>> # mount --move . /mnt
> is this calling move_mount(2) on your system?
>
> David

No. That was an unpatched mount program, from util-linux.

Alan


2018-10-10 12:50:47

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

Alan Jenkins <[email protected]> wrote:

> + pr_info("mnt_flags=%x umount=%x\n",
> + (unsigned) old->mnt.mnt_flags,
> + (unsigned) !!(old->mnt.mnt_flags & MNT_UMOUNT);
> +

Note that this doesn't actually compile, for want of a bracket.

David

2018-10-10 13:03:34

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

The attached change seems to fix the lazy-umount problem.

David
---
diff --git a/fs/namespace.c b/fs/namespace.c
index 5adeeea2a4d9..d43f0fa152e9 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2472,7 +2472,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
if (old->mnt_ns && !attached)
goto out1;

- if (old->mnt.mnt_flags & MNT_LOCKED)
+ if (old->mnt.mnt_flags & (MNT_LOCKED | MNT_UMOUNT))
goto out1;

if (old_path->dentry != old_path->mnt->mnt_root)

2018-10-10 13:08:25

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

On 10/10/2018 14:02, David Howells wrote:
> The attached change seems to fix the lazy-umount problem.
>
> David
> ---
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 5adeeea2a4d9..d43f0fa152e9 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2472,7 +2472,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
> if (old->mnt_ns && !attached)
> goto out1;
>
> - if (old->mnt.mnt_flags & MNT_LOCKED)
> + if (old->mnt.mnt_flags & (MNT_LOCKED | MNT_UMOUNT))
> goto out1;
>
> if (old_path->dentry != old_path->mnt->mnt_root)


I can't test any more at the moment, as my laptop died today :). But I
have no objection to this.

It would be more fun if there was a way to support it :), but I don't
have a genuine reason to want it.  And you couldn't use it for fully
general purposes anyway, because umount2( , MNT_DETACH) is defined as
separating all the child mounts.

P.S. Regarding the issue with the namespace loop.  My strawman solution
would be for graft_tree() to silently detach any NS file mounts that
have a sequence number less than or equal to the namespace they are
being mounted into.

2018-10-11 10:27:47

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

Alan Jenkins <[email protected]> wrote:

> # unshare --mount=private_mnt/child_ns --propagation=shared ls -l /proc/self/ns/mnt

I think the problem is that the mount of the nsfs object done by unshare here
pins the new mount namespace - but doesn't add the namespace's contents into
the mount tree, so the mount struct cycle-detection code is bypassed.

I think it's fine for all other namespaces, just not the mount namespace.

It looks like this bug might theoretically exist upstream also, though I don't
think there's any way to actually effect it given that mount() doesn't take a
dirfd argument.

The reason that you can do this with open_tree()/move_mount() is that it
allows you to create a mount tree (OPEN_TREE_CLONE) that has no namespace
assignment, pass it through the namespace switch and then attach it inside the
child namespace. The cross-namespace checks in do_move_mount() are bypassed
because the root of the newly-cloned mount tree doesn't have one.

Unfortunately, just searching the newly-cloned mount tree for a conflicting
nsfs mount doesn't help because the potential loop could be hidden several
levels deep.

I think the simplest solution is to either reject a request for
open_tree(OPEN_TREE_CLONE) if there are any nsfs objects in the source tree,
or to just not copy said objects.

David
---

Test script:

mount -t tmpfs none /a
mount --make-shared /a
cd /a
mkdir private_mnt
mount -t tmpfs xxx private_mnt
mount --make-private private_mnt
touch private_mnt/child_ns
unshare --mount=private_mnt/child_ns --propagation=shared \
ls -l /proc/self/ns/mnt
findmnt

~/open_tree 3</a/private_mnt 3 \
nsenter --mount=/a/private_mnt/child_ns \
sh -c '~/move_mount 4</mnt'

grep Shmem: /proc/meminfo
dd if=/dev/zero of=/a/private_mnt/bigfile bs=1M count=10

umount -l /a/private_mnt/
grep Shmem: /proc/meminfo

2018-10-11 12:23:14

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

David Howells <[email protected]> wrote:

> The reason that you can do this with open_tree()/move_mount() is that it
> allows you to create a mount tree (OPEN_TREE_CLONE) that has no namespace
> assignment, pass it through the namespace switch and then attach it inside the
> child namespace. The cross-namespace checks in do_move_mount() are bypassed
> because the root of the newly-cloned mount tree doesn't have one.

It's worse than that. The apparently disconnected tree given you by
open_tree(OPEN_TREE_CLONE) is still subject to modification by outside
forces. All it takes is one shared object within that tree.

So I do wonder if it's possible to form a ring, even in an upstream kernel, by
using the propagation mechanism to push through an nsfs mount into itself,
possibly with a layer of indirection (ie. having two mutually-referential
namespaces).

David

2018-10-11 12:25:50

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

On 11/10/2018 13:14, David Howells wrote:
> David Howells <[email protected]> wrote:
>
>> The reason that you can do this with open_tree()/move_mount() is that it
>> allows you to create a mount tree (OPEN_TREE_CLONE) that has no namespace
>> assignment, pass it through the namespace switch and then attach it inside the
>> child namespace. The cross-namespace checks in do_move_mount() are bypassed
>> because the root of the newly-cloned mount tree doesn't have one.
> It's worse than that. The apparently disconnected tree given you by
> open_tree(OPEN_TREE_CLONE) is still subject to modification by outside
> forces. All it takes is one shared object within that tree.
>
> So I do wonder if it's possible to form a ring, even in an upstream kernel, by
> using the propagation mechanism to push through an nsfs mount into itself,
> possibly with a layer of indirection (ie. having two mutually-referential
> namespaces).
>
> David

Upstream does cover the mount propagation case, by simply never
propagating mounts of mount NS files.  See commit 4ce5d2b1a8fd "vfs:
Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces" /
https://unix.stackexchange.com/questions/473717/what-code-prevents-mount-namespace-loops-in-a-more-complex-case-involving-mount-propagation




2018-10-11 12:44:07

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

On 11/10/2018 10:17, David Howells wrote:
> Alan Jenkins <[email protected]> wrote:
>
>> # unshare --mount=private_mnt/child_ns --propagation=shared ls -l /proc/self/ns/mnt
> I think the problem is that the mount of the nsfs object done by unshare here
> pins the new mount namespace - but doesn't add the namespace's contents into
> the mount tree, so the mount struct cycle-detection code is bypassed.
>
> I think it's fine for all other namespaces, just not the mount namespace.
>
> It looks like this bug might theoretically exist upstream also, though I don't
> think there's any way to actually effect it given that mount() doesn't take a
> dirfd argument.
>
> The reason that you can do this with open_tree()/move_mount() is that it
> allows you to create a mount tree (OPEN_TREE_CLONE) that has no namespace
> assignment, pass it through the namespace switch and then attach it inside the
> child namespace. The cross-namespace checks in do_move_mount() are bypassed
> because the root of the newly-cloned mount tree doesn't have one.
>
> Unfortunately, just searching the newly-cloned mount tree for a conflicting
> nsfs mount doesn't help because the potential loop could be hidden several
> levels deep.
>
> I think the simplest solution is to either reject a request for
> open_tree(OPEN_TREE_CLONE) if there are any nsfs objects in the source tree,
> or to just not copy said objects.
>
> David

Very clearly written, thank you.  Hum, your solution would mean
open_tree(OPEN_TREE_CLONE) + move_mount() is not equivalent to the
current `mount --rbind` :-(.  That does not fit the current patch
description.

It sounds like you're under-estimating how we can use mnt_ns->seq (as is
currently used in mnt_ns_loop()).  Or maybe I am over-estimating it :).

In principle, it should suffice for attach_recursive_mount() to check
the NS sequence numbers of the NS files which are mounted. You can't
hide the loop at a deeper level inside the NS, because of the existing
mnt_ns_loop() check.

I think mnt_ns_loop() works 100% correctly upstream, and there is no
memory leak bug there.  You can pass a mount NS fd between processes in
arbitrary namespaces, and you can mount it with "mount --no-canonicalize
--bind /proc/self/fd/3 /other_ns".  But mnt_ns_loop() will only allow
the mount when the other NS is newer than your own mount namespace.

Upstream also covers mount propagation (and CLONE_NEWNS), by simply not
propagating mounts of mount NS files.  ( See commit 4ce5d2b1a8fd "vfs:
Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces" /
https://unix.stackexchange.com/questions/473717/what-code-prevents-mount-namespace-loops-in-a-more-complex-case-involving-mount-propagation
)

I think it is more a question of taste :-).  Would it be acceptable to
prune the tree (or fail?) in move_mount() (and also `mount --move`, if
you [ab]use it like I did) ?

I suspect we should prefer your solution.  It is clearly simpler, and I
don't know that anyone really uses `mount --rbind` to clone trees of
mount NS files.

Either way, I suggest we take care to say whether `mount --rbind` and
`mount --bind` can be implemented using open_tree() + move_mount(), or
whether we think it might be undesirable.  (E.g. because someone might
read the current commit message, and desire to implement `mount
--bind,ro` atomically, if/when we also have mount_setattr() ).

Regards

Alan


> ---
>
> Test script:
>
> mount -t tmpfs none /a
> mount --make-shared /a
> cd /a
> mkdir private_mnt
> mount -t tmpfs xxx private_mnt
> mount --make-private private_mnt
> touch private_mnt/child_ns
> unshare --mount=private_mnt/child_ns --propagation=shared \
> ls -l /proc/self/ns/mnt
> findmnt
>
> ~/open_tree 3</a/private_mnt 3 \
> nsenter --mount=/a/private_mnt/child_ns \
> sh -c '~/move_mount 4</mnt'
>
> grep Shmem: /proc/meminfo
> dd if=/dev/zero of=/a/private_mnt/bigfile bs=1M count=10
>
> umount -l /a/private_mnt/
> grep Shmem: /proc/meminfo

2018-10-11 13:11:31

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

Alan Jenkins <[email protected]> wrote:

> It sounds like you're under-estimating how we can use mnt_ns->seq (as is
> currently used in mnt_ns_loop()).  Or maybe I am over-estimating it :).

I don't see how it helps. The duplication and attachment of the nsfs object
is already done by open_tree(), but as it's a detached tree, there are no
namespace assignments on the objects therein. move_mount() is attaching the
subtree as a whole.

I modified my example to put everything under /a, setting up initially on /a/x
and then moving to /a/y within the namespace. Then I made it print the mount
tree in more places. So after setup, I see:

[root@andromeda x]# findmnt -R /a
TARGET SOURCE
/a none
\_/a/x none
\_/a/x/private_mnt xxx
\_/a/x/private_mnt/child_ns nsfs[mnt:[4026532272]]

this looks fine. Then I do:

~/open_tree 3</a/x/private_mnt 3 \
nsenter --mount=/a/x/private_mnt/child_ns \
sh -c 'findmnt -R /a; ~/move_mount 4</a/y; findmnt -R /a'

and I see:

TARGET SOURCE
/a none
\_/a/x none
\_/a/x/private_mnt xxx
TARGET SOURCE
/a none
|_/a/x none
| \_/a/x/private_mnt xxx
\_/a/y xxx
\_/a/y/child_ns nsfs[mnt:[4026532272]]

in which /a/x/private_mnt got cloned and the clone mounted on "/a/y".

So, you're right, it's nothing to do with propagation. But I'm not sure how I
check this. Reject it in move_mount() if there's an nsfs? I'm not sure if
the seq number is actually useful here.

David

2018-10-11 17:47:46

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

Okay, this appears to fix the cycle-creation problem.

It could probably be improved by comparing sequence numbers as Alan suggests,
but I need to work out how to get at that.

David
---
commit 069c3376f7849044117c866aeafbb1a525f84926
Author: David Howells <[email protected]>
Date: Thu Oct 4 23:18:59 2018 +0100

fixes

diff --git a/fs/internal.h b/fs/internal.h
index 17029b30e196..47a6c80c3c51 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -172,6 +172,7 @@ extern void mnt_pin_kill(struct mount *m);
* fs/nsfs.c
*/
extern const struct dentry_operations ns_dentry_operations;
+extern struct file_system_type nsfs;

/*
* fs/ioctl.c
diff --git a/fs/namespace.c b/fs/namespace.c
index e969ded7d54b..25ecd8b3c76b 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2388,6 +2388,27 @@ static inline int tree_contains_unbindable(struct mount *mnt)
return 0;
}

+/*
+ * Object if there are any nsfs mounts in the specified subtree. These can act
+ * as pins for mount namespaces that aren't checked by the mount-cycle checking
+ * code, thereby allowing cycles to be made.
+ */
+static bool check_for_nsfs_mounts(struct mount *subtree)
+{
+ struct mount *p;
+ bool ret = false;
+
+ lock_mount_hash();
+ for (p = subtree; p; p = next_mnt(p, subtree))
+ if (p->mnt.mnt_sb->s_type == &nsfs)
+ goto out;
+
+ ret = true;
+out:
+ unlock_mount_hash();
+ return ret;
+}
+
static int do_move_mount(struct path *old_path, struct path *new_path)
{
struct path parent_path = {.mnt = NULL, .dentry = NULL};
@@ -2442,6 +2463,8 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
if (IS_MNT_SHARED(p) && tree_contains_unbindable(old))
goto out1;
err = -ELOOP;
+ if (!check_for_nsfs_mounts(old))
+ goto out1;
for (; mnt_has_parent(p); p = p->mnt_parent)
if (p == old)
goto out1;
diff --git a/fs/nsfs.c b/fs/nsfs.c
index f069eb6495b0..d3abcd5c2a23 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -269,7 +269,7 @@ static struct dentry *nsfs_mount(struct file_system_type *fs_type,
return mount_pseudo(fs_type, "nsfs:", &nsfs_ops,
&ns_dentry_operations, NSFS_MAGIC);
}
-static struct file_system_type nsfs = {
+struct file_system_type nsfs = {
.name = "nsfs",
.mount = nsfs_mount,
.kill_sb = kill_anon_super,

2018-10-11 19:38:31

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

David Howells <[email protected]> writes:

> Okay, this appears to fix the cycle-creation problem.
>
> It could probably be improved by comparing sequence numbers as Alan suggests,
> but I need to work out how to get at that.

It should just be a matter of replacing the test
"if (p->mnt.mnt_sb->s_type == &nsfs)" with "if mnt_ns_loop(p->mnt.mnt_root)"

That would allow reusing 100% of the existing logic, and remove the need
to export file_system_type nsfs;

As your test exists below it will reject a lot more than mount namespace
file descriptors. It will reject file descriptors for every other
namespace as well.

Eric

> ---
> commit 069c3376f7849044117c866aeafbb1a525f84926
> Author: David Howells <[email protected]>
> Date: Thu Oct 4 23:18:59 2018 +0100
>
> fixes
>
> diff --git a/fs/internal.h b/fs/internal.h
> index 17029b30e196..47a6c80c3c51 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -172,6 +172,7 @@ extern void mnt_pin_kill(struct mount *m);
> * fs/nsfs.c
> */
> extern const struct dentry_operations ns_dentry_operations;
> +extern struct file_system_type nsfs;
>
> /*
> * fs/ioctl.c
> diff --git a/fs/namespace.c b/fs/namespace.c
> index e969ded7d54b..25ecd8b3c76b 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2388,6 +2388,27 @@ static inline int tree_contains_unbindable(struct mount *mnt)
> return 0;
> }
>
> +/*
> + * Object if there are any nsfs mounts in the specified subtree. These can act
> + * as pins for mount namespaces that aren't checked by the mount-cycle checking
> + * code, thereby allowing cycles to be made.
> + */
> +static bool check_for_nsfs_mounts(struct mount *subtree)
> +{
> + struct mount *p;
> + bool ret = false;
> +
> + lock_mount_hash();
> + for (p = subtree; p; p = next_mnt(p, subtree))
> + if (p->mnt.mnt_sb->s_type == &nsfs)
> + goto out;
> +
> + ret = true;
> +out:
> + unlock_mount_hash();
> + return ret;
> +}
> +
> static int do_move_mount(struct path *old_path, struct path *new_path)
> {
> struct path parent_path = {.mnt = NULL, .dentry = NULL};
> @@ -2442,6 +2463,8 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
> if (IS_MNT_SHARED(p) && tree_contains_unbindable(old))
> goto out1;
> err = -ELOOP;
> + if (!check_for_nsfs_mounts(old))
> + goto out1;
> for (; mnt_has_parent(p); p = p->mnt_parent)
> if (p == old)
> goto out1;
> diff --git a/fs/nsfs.c b/fs/nsfs.c
> index f069eb6495b0..d3abcd5c2a23 100644
> --- a/fs/nsfs.c
> +++ b/fs/nsfs.c
> @@ -269,7 +269,7 @@ static struct dentry *nsfs_mount(struct file_system_type *fs_type,
> return mount_pseudo(fs_type, "nsfs:", &nsfs_ops,
> &ns_dentry_operations, NSFS_MAGIC);
> }
> -static struct file_system_type nsfs = {
> +struct file_system_type nsfs = {
> .name = "nsfs",
> .mount = nsfs_mount,
> .kill_sb = kill_anon_super,

2018-10-11 20:18:37

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

Eric W. Biederman <[email protected]> wrote:

> It should just be a matter of replacing the test
> "if (p->mnt.mnt_sb->s_type == &nsfs)" with "if mnt_ns_loop(p->mnt.mnt_root)"

Okay, the attached seems to work.

Thanks,
David
---
diff --git a/fs/namespace.c b/fs/namespace.c
index e969ded7d54b..5548fb9b7de2 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2388,6 +2388,27 @@ static inline int tree_contains_unbindable(struct mount *mnt)
return 0;
}

+/*
+ * Object if there are any nsfs mounts in the specified subtree. These can act
+ * as pins for mount namespaces that aren't checked by the mount-cycle checking
+ * code, thereby allowing cycles to be made.
+ */
+static bool check_for_nsfs_mounts(struct mount *subtree)
+{
+ struct mount *p;
+ bool ret = false;
+
+ lock_mount_hash();
+ for (p = subtree; p; p = next_mnt(p, subtree))
+ if (mnt_ns_loop(p->mnt.mnt_root))
+ goto out;
+
+ ret = true;
+out:
+ unlock_mount_hash();
+ return ret;
+}
+
static int do_move_mount(struct path *old_path, struct path *new_path)
{
struct path parent_path = {.mnt = NULL, .dentry = NULL};
@@ -2442,6 +2463,8 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
if (IS_MNT_SHARED(p) && tree_contains_unbindable(old))
goto out1;
err = -ELOOP;
+ if (!check_for_nsfs_mounts(old))
+ goto out1;
for (; mnt_has_parent(p); p = p->mnt_parent)
if (p == old)
goto out1;

2018-10-12 14:22:48

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

On 10/10/2018 13:36, David Howells wrote:
> Alan Jenkins <[email protected]> wrote:
>
>> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
>> + * Written by David Howells ([email protected])
> Do you want to update that and I can take them into my patchset?
>
> David


Sure :).  I've attached a slightly updated version.

Thanks

Alan


Attachments:
0001-vfs-tiny-sample-programs-for-open_tree-and-move_moun.patch (5.00 kB)

2018-10-12 14:51:51

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

On 21/09/2018 17:34, David Howells wrote:
> Provide an fspick() system call that can be used to pick an existing
> mountpoint into an fs_context which can thereafter be used to reconfigure a
> superblock (equivalent of the superblock side of -o remount).
>
> This looks like:
>
> int fd = fspick(AT_FDCWD, "/mnt",
> FSPICK_CLOEXEC | FSPICK_NO_AUTOMOUNT);
> fsconfig(fd, FSCONFIG_SET_FLAG, "intr", NULL, 0);
> fsconfig(fd, FSCONFIG_SET_FLAG, "noac", NULL, 0);
> fsconfig(fd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);
>
> At the point of fspick being called, the file descriptor referring to the
> filesystem context is in exactly the same state as the one that was created
> by fsopen() after fsmount() has been successfully called.
>
> Signed-off-by: David Howells <[email protected]>
> cc: [email protected]
> ---
>
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> fs/fsopen.c | 54 ++++++++++++++++++++++++++++++++
> include/linux/syscalls.h | 1 +
> include/uapi/linux/fs.h | 5 +++
> 5 files changed, 62 insertions(+)
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index c78b68256f8a..d1eb6c815790 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -403,3 +403,4 @@
> 389 i386 fsopen sys_fsopen __ia32_sys_fsopen
> 390 i386 fsconfig sys_fsconfig __ia32_sys_fsconfig
> 391 i386 fsmount sys_fsmount __ia32_sys_fsmount
> +392 i386 fspick sys_fspick __ia32_sys_fspick
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index d44ead5d4368..d3ab703c02bb 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -348,6 +348,7 @@
> 337 common fsopen __x64_sys_fsopen
> 338 common fsconfig __x64_sys_fsconfig
> 339 common fsmount __x64_sys_fsmount
> +340 common fspick __x64_sys_fspick
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/fsopen.c b/fs/fsopen.c
> index 5955a6b65596..9ead9220e2cb 100644
> --- a/fs/fsopen.c
> +++ b/fs/fsopen.c
> @@ -155,6 +155,60 @@ SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
> return ret;
> }
>
> +/*
> + * Pick a superblock into a context for reconfiguration.
> + */
> +SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags)
> +{
> + struct fs_context *fc;
> + struct path target;
> + unsigned int lookup_flags;
> + int ret;
> +
> + if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
> + return -EPERM;


This seems to accept basically any mount.  Specifically: are you sure
it's OK to return a handle to a SB_NO_USER superblock?

# strace -f -v -e trace=154 \
./fspick 3</proc/self/ns/mnt 3 \
stat -f /dev/fd/3

syscall_0x154(0x3, 0x4009a1, 0x8, ...) = 0x4
File: "/dev/fd/3"
ID: 0 Namelen: 255 Type: anon-inode FS
Block size: 4096 Fundamental block size: 4096
Blocks: Total: 0 Free: 0 Available: 0
Inodes: Total: 0 Free: 0
+++ exited with 0 +++



2018-10-12 14:56:22

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

Alan Jenkins <[email protected]> wrote:

> + open_tree_clone \
> + move_mount \

I'll rename them to test-XXX if you're okay with that.

David

2018-10-12 14:58:07

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

On 12/10/2018 15:54, David Howells wrote:
> Alan Jenkins <[email protected]> wrote:
>
>> + open_tree_clone \
>> + move_mount \
> I'll rename them to test-XXX if you're okay with that.
>
> David


Yes, that's fine.

Feel free to make adaptations you like.  I don't have anything planned
for them myself, outside of testing the patch series.

Alan


2018-10-13 06:07:28

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

On Thu, Oct 11, 2018 at 09:17:54PM +0100, David Howells wrote:
> +/*
> + * Object if there are any nsfs mounts in the specified subtree. These can act
> + * as pins for mount namespaces that aren't checked by the mount-cycle checking
> + * code, thereby allowing cycles to be made.
> + */
> +static bool check_for_nsfs_mounts(struct mount *subtree)
> +{
> + struct mount *p;
> + bool ret = false;
> +
> + lock_mount_hash();
> + for (p = subtree; p; p = next_mnt(p, subtree))
> + if (mnt_ns_loop(p->mnt.mnt_root))
> + goto out;
> +
> + ret = true;
> +out:
> + unlock_mount_hash();
> + return ret;
> +}

Umm... The comment doesn't match the behaviour - you are
accepting references to later namespaces. Behaviour is
not a problem, the comment is.

2018-10-13 06:12:19

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

On Fri, Oct 12, 2018 at 03:49:50PM +0100, Alan Jenkins wrote:
> > +SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags)
> > +{
> > + struct fs_context *fc;
> > + struct path target;
> > + unsigned int lookup_flags;
> > + int ret;
> > +
> > + if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
> > + return -EPERM;
>
>
> This seems to accept basically any mount.? Specifically: are you sure it's
> OK to return a handle to a SB_NO_USER superblock?

Umm... As long as we don't try to do pathname resolution from its ->s_root,
shouldn't be a problem and I don't see anything that would do that. I might've
missed something, but...

2018-10-13 09:45:47

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

On 13/10/2018 07:11, Al Viro wrote:
> On Fri, Oct 12, 2018 at 03:49:50PM +0100, Alan Jenkins wrote:
>>> +SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags)
>>> +{
>>> + struct fs_context *fc;
>>> + struct path target;
>>> + unsigned int lookup_flags;
>>> + int ret;
>>> +
>>> + if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
>>> + return -EPERM;
>>
>> This seems to accept basically any mount.  Specifically: are you sure it's
>> OK to return a handle to a SB_NO_USER superblock?
> Umm... As long as we don't try to do pathname resolution from its ->s_root,
> shouldn't be a problem and I don't see anything that would do that. I might've
> missed something, but...

Sorry, I guess SB_NOUSER was the wrong word.  I was trying find if
anything stopped things like

int memfd = memfd_create("foo", 0);
int fsfd = fspick(memfd, "", FSPICK_EMPTY_PATH);

fsconfig(fsfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
fsconfig(fsfd, FSCONFIG_SET_STRING, "size", "100M", 0);
fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);

So far I'm getting -EBUSY if I try to apply the "ro", -EINVAL if I try
to apply the "size=100M".  But if I don't apply either, then
FSCONFIG_CMD_RECONFIGURE succeeds.

It seems worrying that it might let me set options on shm_mnt. Or at
least letting me get as far as the -EBUSY check for the "ro" superblock
flag.

I'm not sure why I'm getting the -EINVAL setting the "size" option.  But
it would be much more reassuring if I was getting -EPERM :-).

Alan


2018-10-13 23:05:07

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

On Sat, Oct 13, 2018 at 2:45 AM Alan Jenkins
<[email protected]> wrote:
>
> On 13/10/2018 07:11, Al Viro wrote:
> > On Fri, Oct 12, 2018 at 03:49:50PM +0100, Alan Jenkins wrote:
> >>> +SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags)
> >>> +{
> >>> + struct fs_context *fc;
> >>> + struct path target;
> >>> + unsigned int lookup_flags;
> >>> + int ret;
> >>> +
> >>> + if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
> >>> + return -EPERM;
> >>
> >> This seems to accept basically any mount. Specifically: are you sure it's
> >> OK to return a handle to a SB_NO_USER superblock?
> > Umm... As long as we don't try to do pathname resolution from its ->s_root,
> > shouldn't be a problem and I don't see anything that would do that. I might've
> > missed something, but...
>
> Sorry, I guess SB_NOUSER was the wrong word. I was trying find if
> anything stopped things like
>
> int memfd = memfd_create("foo", 0);
> int fsfd = fspick(memfd, "", FSPICK_EMPTY_PATH);
>
> fsconfig(fsfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> fsconfig(fsfd, FSCONFIG_SET_STRING, "size", "100M", 0);
> fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);
>
> So far I'm getting -EBUSY if I try to apply the "ro", -EINVAL if I try
> to apply the "size=100M". But if I don't apply either, then
> FSCONFIG_CMD_RECONFIGURE succeeds.
>
> It seems worrying that it might let me set options on shm_mnt. Or at
> least letting me get as far as the -EBUSY check for the "ro" superblock
> flag.
>
> I'm not sure why I'm getting the -EINVAL setting the "size" option. But
> it would be much more reassuring if I was getting -EPERM :-).
>

I would argue that the filesystem associated with a memfd, and even
the fact that there *is* a filesystem, is none of user code's
business. So that fspick() call should return -EINVAL or similar.

2018-10-17 13:15:53

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

Alan Jenkins <[email protected]> wrote:

> Sorry, I guess SB_NOUSER was the wrong word.  I was trying find if anything
> stopped things like
>
> int memfd = memfd_create("foo", 0);
> int fsfd = fspick(memfd, "", FSPICK_EMPTY_PATH);
>
> fsconfig(fsfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> fsconfig(fsfd, FSCONFIG_SET_STRING, "size", "100M", 0);
> fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);
>
> So far I'm getting -EBUSY if I try to apply the "ro", -EINVAL if I try to
> apply the "size=100M".  But if I don't apply either, then
> FSCONFIG_CMD_RECONFIGURE succeeds.

I should probably check that the picked point is actually a mountpoint.

David

2018-10-17 13:21:09

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

David Howells <[email protected]> wrote:

> I should probably check that the picked point is actually a mountpoint.

The root of the mount object at the path specified, that is, perhaps with
something like the attached.

David
---
diff --git a/fs/fsopen.c b/fs/fsopen.c
index f673e93ac456..aaaaa17a233c 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -186,6 +186,10 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags
if (ret < 0)
goto err;

+ ret = -EINVAL;
+ if (target.mnt->mnt_root != target.dentry)
+ goto err_path;
+
fc = vfs_new_fs_context(target.dentry->d_sb->s_type, target.dentry,
0, 0, FS_CONTEXT_FOR_RECONFIGURE);
if (IS_ERR(fc)) {


2018-10-17 14:32:15

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

On 17/10/2018 14:20, David Howells wrote:
> David Howells <[email protected]> wrote:
>
>> I should probably check that the picked point is actually a mountpoint.
> The root of the mount object at the path specified, that is, perhaps with
> something like the attached.
>
> David


I agree.  I'm happy to see this is using the same check as do_remount().


* change filesystem flags. dir should be a physical root of filesystem.
* If you've mounted a non-root directory somewhere and want to do remount
* on it - tough luck.
*/


Thanks

Alan


> ---
> diff --git a/fs/fsopen.c b/fs/fsopen.c
> index f673e93ac456..aaaaa17a233c 100644
> --- a/fs/fsopen.c
> +++ b/fs/fsopen.c
> @@ -186,6 +186,10 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags
> if (ret < 0)
> goto err;
>
> + ret = -EINVAL;
> + if (target.mnt->mnt_root != target.dentry)
> + goto err_path;
> +
> fc = vfs_new_fs_context(target.dentry->d_sb->s_type, target.dentry,
> 0, 0, FS_CONTEXT_FOR_RECONFIGURE);
> if (IS_ERR(fc)) {
>

2018-10-17 14:37:00

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

Alan Jenkins <[email protected]> writes:

> On 17/10/2018 14:20, David Howells wrote:
>> David Howells <[email protected]> wrote:
>>
>>> I should probably check that the picked point is actually a mountpoint.
>> The root of the mount object at the path specified, that is, perhaps with
>> something like the attached.
>>
>> David
>
>
> I agree.  I'm happy to see this is using the same check as do_remount().
>
>
> * change filesystem flags. dir should be a physical root of filesystem.
> * If you've mounted a non-root directory somewhere and want to do remount
> * on it - tough luck.
> */

Davids check will work for bind mounts as well. It just won't work it
just won't work for files or subdirectories of some mountpoint.

Eric

>> ---
>> diff --git a/fs/fsopen.c b/fs/fsopen.c
>> index f673e93ac456..aaaaa17a233c 100644
>> --- a/fs/fsopen.c
>> +++ b/fs/fsopen.c
>> @@ -186,6 +186,10 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags
>> if (ret < 0)
>> goto err;
>> + ret = -EINVAL;
>> + if (target.mnt->mnt_root != target.dentry)
>> + goto err_path;
>> +
>> fc = vfs_new_fs_context(target.dentry->d_sb->s_type, target.dentry,
>> 0, 0, FS_CONTEXT_FOR_RECONFIGURE);
>> if (IS_ERR(fc)) {
>>

2018-10-17 14:55:54

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

On 17/10/2018 15:35, Eric W. Biederman wrote:
> Alan Jenkins <[email protected]> writes:
>
>> On 17/10/2018 14:20, David Howells wrote:
>>> David Howells <[email protected]> wrote:
>>>
>>>> I should probably check that the picked point is actually a mountpoint.
>>> The root of the mount object at the path specified, that is, perhaps with
>>> something like the attached.
>>>
>>> David
>>
>> I agree.  I'm happy to see this is using the same check as do_remount().
>>
>>
>> * change filesystem flags. dir should be a physical root of filesystem.
>> * If you've mounted a non-root directory somewhere and want to do remount
>> * on it - tough luck.
>> */
> Davids check will work for bind mounts as well. It just won't work it
> just won't work for files or subdirectories of some mountpoint.
>
> Eric


I see.  Then I am still happy to see the fspick() check match a check in
do_remount() (and it still solves the problem I was worried about).

I cannot blame David for the do_remount() comment being out of date :-).

# uname -r
4.18.10-200.fc.28.x86_64
# mount --bind /mnt /mnt
# mount -oremount,debug /mnt
# findmnt /mnt; findmnt /
[findmnt shows / has been remounted, adding the ext4 "debug" mount option]


>
>>> ---
>>> diff --git a/fs/fsopen.c b/fs/fsopen.c
>>> index f673e93ac456..aaaaa17a233c 100644
>>> --- a/fs/fsopen.c
>>> +++ b/fs/fsopen.c
>>> @@ -186,6 +186,10 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags
>>> if (ret < 0)
>>> goto err;
>>> + ret = -EINVAL;
>>> + if (target.mnt->mnt_root != target.dentry)
>>> + goto err_path;
>>> +
>>> fc = vfs_new_fs_context(target.dentry->d_sb->s_type, target.dentry,
>>> 0, 0, FS_CONTEXT_FOR_RECONFIGURE);
>>> if (IS_ERR(fc)) {
>>>

2018-10-17 15:26:48

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

Eric W. Biederman <[email protected]> wrote:

> Davids check will work for bind mounts as well. It just won't work it
> just won't work for files or subdirectories of some mountpoint.

Bind mounts have to be done with open_tree() and move_mount(). You can't now
do fsmount() on something fspick()'d.

David

2018-10-17 15:39:12

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

David Howells <[email protected]> writes:

> Eric W. Biederman <[email protected]> wrote:
>
>> Davids check will work for bind mounts as well. It just won't work it
>> just won't work for files or subdirectories of some mountpoint.
>
> Bind mounts have to be done with open_tree() and move_mount(). You can't now
> do fsmount() on something fspick()'d.

But a bind mount will have mnt_root set to the the dentry that was
bound.

Therefore fspick as you are proposing modifying will work for the root
of bind mounts, as well as the root of regular mounts. My apologies for
not being clear.

Eric


2018-10-17 15:48:03

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

Alan Jenkins <[email protected]> wrote:

> I agree. I'm happy to see this is using the same check as do_remount().
>
>
> * change filesystem flags. dir should be a physical root of filesystem.
> * If you've mounted a non-root directory somewhere and want to do remount
> * on it - tough luck.
> */

Are you suggesting that it should only work at the ultimate root of a
superblock and not a bind mount somewhere within?

That's tricky to make work for NFS because s_root is a dummy dentry.

David

2018-10-17 17:42:16

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

On 17/10/2018 16:45, David Howells wrote:
> Alan Jenkins <[email protected]> wrote:
>
>> I agree. I'm happy to see this is using the same check as do_remount().
>>
>>
>> * change filesystem flags. dir should be a physical root of filesystem.
>> * If you've mounted a non-root directory somewhere and want to do remount
>> * on it - tough luck.
>> */
> Are you suggesting that it should only work at the ultimate root of a
> superblock and not a bind mount somewhere within?
>
> That's tricky to make work for NFS because s_root is a dummy dentry.
>
> David


Retro-actively: I do not suggest that.

I tried to answer this question in my reply to Eric correcting me.  Eric
was right to correct me.  I now understand the comment above
do_remount() is incorrect.  I re-reviewed your diff in light of that.  I
re-endorse your diff as a way to solve the problem I raised.

(I think it would be useful to remove the misleading comment above
do_remount(), to avoid future confusion.)

> @@ -186,6 +186,10 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char
__user *, path, unsigned int, flags

> if (ret < 0)
> goto err;
>
> + ret = -EINVAL;
> + if (target.mnt->mnt_root != target.dentry)
> + goto err_path;
> +

( the "if" statement it adds to fspick() is equivalent to the second
"if" statement in do_remount(): )

static int do_remount <https://elixir.bootlin.com/linux/v4.18/ident/do_remount>(struct path <https://elixir.bootlin.com/linux/v4.18/ident/path> *path <https://elixir.bootlin.com/linux/v4.18/ident/path>, int ms_flags, int sb_flags,
int mnt_flags, void *data)
{
int err;
struct super_block <https://elixir.bootlin.com/linux/v4.18/ident/super_block> *sb = path <https://elixir.bootlin.com/linux/v4.18/ident/path>->mnt
<https://elixir.bootlin.com/linux/v4.18/ident/mnt>->mnt_sb;
struct mount <https://elixir.bootlin.com/linux/v4.18/ident/mount> *mnt <https://elixir.bootlin.com/linux/v4.18/ident/mnt> = real_mount
<https://elixir.bootlin.com/linux/v4.18/ident/real_mount>(path
<https://elixir.bootlin.com/linux/v4.18/ident/path>->mnt
<https://elixir.bootlin.com/linux/v4.18/ident/mnt>);

if (!check_mnt <https://elixir.bootlin.com/linux/v4.18/ident/check_mnt>(mnt
<https://elixir.bootlin.com/linux/v4.18/ident/mnt>))
return -EINVAL <https://elixir.bootlin.com/linux/v4.18/ident/EINVAL>;

if (path <https://elixir.bootlin.com/linux/v4.18/ident/path>->dentry != path <https://elixir.bootlin.com/linux/v4.18/ident/path>->mnt
<https://elixir.bootlin.com/linux/v4.18/ident/mnt>->mnt_root)
return -EINVAL <https://elixir.bootlin.com/linux/v4.18/ident/EINVAL>;

Thanks

Alan


2018-10-17 17:48:10

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

Hi David.  I think there's an outstanding point below, have you been
thinking about it?

On 07/10/2018 11:48, Alan Jenkins wrote:
> On 05/10/2018 19:24, Alan Jenkins wrote:
>> On 21/09/2018 17:30, David Howells wrote:
>>> From: Al Viro <[email protected]>
>>>
>>> Allow a detached tree created by open_tree(..., OPEN_TREE_CLONE) to be
>>> attached by move_mount(2).
>>>
>>> If by the time of final fput() of OPEN_TREE_CLONE-opened file its
>>> tree is
>>> not detached anymore, it won't be dissolved.  move_mount(2) is adjusted
>>> to handle detached source.
>>>
>>> That gives us equivalents of mount --bind and mount --rbind.
>>>
>>> Signed-off-by: Al Viro <[email protected]>
>>> Signed-off-by: David Howells <[email protected]>
>>> ---
>>>
>>>   fs/namespace.c |   26 ++++++++++++++++++++------
>>>   1 file changed, 20 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/fs/namespace.c b/fs/namespace.c
>>> index dd38141b1723..caf5c55ef555 100644
>>> --- a/fs/namespace.c
>>> +++ b/fs/namespace.c
>>> @@ -1785,8 +1785,10 @@ void dissolve_on_fput(struct vfsmount *mnt)
>>>   {
>>>       namespace_lock();
>>>       lock_mount_hash();
>>> -    mntget(mnt);
>>> -    umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
>>> +    if (!real_mount(mnt)->mnt_ns) {
>>> +        mntget(mnt);
>>> +        umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
>>> +    }
>>>       unlock_mount_hash();
>>>       namespace_unlock();
>>>   }
>>> @@ -2393,6 +2395,7 @@ static int do_move_mount(struct path
>>> *old_path, struct path *new_path)
>>>       struct mount *old;
>>>       struct mountpoint *mp;
>>>       int err;
>>> +    bool attached;
>>>         mp = lock_mount(new_path);
>>>       err = PTR_ERR(mp);
>>> @@ -2403,10 +2406,19 @@ static int do_move_mount(struct path
>>> *old_path, struct path *new_path)
>>>       p = real_mount(new_path->mnt);
>>>         err = -EINVAL;
>>> -    if (!check_mnt(p) || !check_mnt(old))
>>> +    /* The mountpoint must be in our namespace. */
>>> +    if (!check_mnt(p))
>>> +        goto out1;
>>> +    /* The thing moved should be either ours or completely
>>> unattached. */
>>> +    if (old->mnt_ns && !check_mnt(old))
>>>           goto out1;
>>>   -    if (!mnt_has_parent(old))
>>> +    attached = mnt_has_parent(old);
>>> +    /*
>>> +     * We need to allow open_tree(OPEN_TREE_CLONE) followed by
>>> +     * move_mount(), but mustn't allow "/" to be moved.
>>> +     */
>>> +    if (old->mnt_ns && !attached)
>>>           goto out1;
>>>         if (old->mnt.mnt_flags & MNT_LOCKED)
>>
>> Hi
>>
>> I replied last time to wonder about the MNT_UMOUNT mnt_flag. So I've
>> tested it now :-), on David's current tree (commit 5581f4935add).
>>
>> The modified do_move_mount() allows re-attaching something that was
>> lazy-unmounted. But the lazy unmount sets MNT_UMOUNT. And this flag
>> is not cleared when the mount is re-attached.
>>
>> I wasn't sure what effect this would have. Luckily it showed up
>> straight away, when I tried to unmount again. It causes a soft lockup.
>>
>> Debug printk:
>>
>> diff --git a/fs/namespace.c b/fs/namespace.c
>> index 4dfe7e23b7ee..ac8de9191cfe 100644
>> --- a/fs/namespace.c
>> +++ b/fs/namespace.c
>> @@ -2472,6 +2472,10 @@ static int do_move_mount(struct path
>> *old_path, struct path *new_path)
>>      if (old->mnt.mnt_flags & MNT_LOCKED)
>>          goto out1;
>>
>> +    pr_info("mnt_flags=%x umount=%x\n",
>> +            (unsigned) old->mnt.mnt_flags,
>> +            (unsigned) !!(old->mnt.mnt_flags & MNT_UMOUNT);
>> +
>>      if (old_path->dentry != old_path->mnt->mnt_root)
>>          goto out1;
>
> The lockup seems to be a general problem with the cleanup code. Even
> if I use this as advertised, i.e. for a simple bind mount.
>
> (I was suspicious that being able to pass around detached trees as an
> FD, and re-attach them in any namespace, allows leaking memory by
> creating a namespace loop.  I.e. maybe it gives you enough rope to
> skip the test in mnt_ns_loop().  But I didn't get that far).
>
> I converted test-fsmount.c for my own purposes:
>
> diff --git a/samples/vfs/test-fsmount.c b/samples/vfs/test-fsmount.c
> index 74124025ade0..da6e3fbf0513 100644
> --- a/samples/vfs/test-fsmount.c
> +++ b/samples/vfs/test-fsmount.c
> @@ -83,6 +83,11 @@ static inline int move_mount(int from_dfd, const
> char *from_pathname,
>                 to_dfd, to_pathname, flags);
>  }
>
> +static inline int open_tree(int dfd, const char *pathname, unsigned
> flags)
> +{
> +    return syscall(__NR_open_tree, dfd, pathname, flags);
> +}
> +
>  #define E_fsconfig(fd, cmd, key, val, aux)                \
>      do {                                \
>          if (fsconfig(fd, cmd, key, val, aux) == -1)        \
> @@ -93,6 +98,7 @@ int main(int argc, char *argv[])
>  {
>      int fsfd, mfd;
>
> +#if 0
>      /* Mount a publically available AFS filesystem */
>      fsfd = fsopen("afs", 0);
>      if (fsfd == -1) {
> @@ -115,4 +121,9 @@ int main(int argc, char *argv[])
>
>      E(close(mfd));
>      exit(0);
> +#endif
> +
> +    E( mfd = open_tree(-1, "/mnt", OPEN_TREE_CLONE) );
> +    E( fchdir(mfd) );
> +    E( execl("/bin/bash", "/bin/bash", NULL) );
>  }
>
> If I close() the mount FD "mfd", and then do "mount --move . /mnt", my
> printk() shows MNT_UMOUNT has been set. ( I guess fchdir() works more
> like openat(... , O_PATH) than dup() ). Then unmounting /mnt hangs, as
> I would expect from my previous test.


^ You posted a diff that would solve this problem


>
>
> If I instead do the mount+unmount first, and close the FD as a second
> step, I think there's a lockup in the close().  The lockup happens in
> the same place as the unmount lockup from before.


^ but I don't think you have addressed this problem in your replies so far.

Thanks

Alan


> (Except there's a line "Code: Bad RIP value", I don't know why that
> happens).
>
> # unshare --mount
> # test-fsmount
> # mount --move . /mnt
> [  270.859542] umount=0 mnt_flags=20
>
> Check the flags are still the same:
>
> # mount --move /mnt /mnt
> [  305./mnt: mount(2) system call failed: Too many levels of symbolic
> links.
> [  313.737030] umount=0 mnt_flags=20
>
> Clean up the bind mount, and then the inherited mount FD.
>
> # cd
> # umount /mnt
> # exit
>
> [  351.898629] watchdog: BUG: soft lockup - CPU#0 stuck for 22s!
> [bash:1483]
> [  351.899841] Modules linked in: xt_CHECKSUM(E) ipt_MASQUERADE(E)
> tun(E) bridge(E) stp(E) llc(E) ip6t_rpfilter(E) ip6t_REJECT(E)
> nf_reject_ipv6(E) xt_conntrack(E) ip6table_nat(E) nf_nat_ipv6(E)
> devlink(E) ip6table_mangle(E) ip6table_raw(E) ip6table_security(E)
> iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack(E)
> nf_defrag_ipv6(E) libcrc32c(E) nf_defrag_ipv4(E) iptable_mangle(E)
> iptable_raw(E) iptable_security(E) ip6table_filter(E) ip6_tables(E)
> snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E)
> snd_hwdep(E) snd_hda_core(E) snd_seq(E) snd_seq_device(E) snd_pcm(E)
> joydev(E) crc32_pclmul(E) snd_timer(E) ghash_clmulni_intel(E) snd(E)
> crct10dif_pclmul(E) virtio_balloon(E) serio_raw(E) soundcore(E)
> crc32c_intel(E) qxl(E) drm_kms_helper(E) virtio_console(E) ttm(E)
> virtio_net(E) net_failover(E)
> [  351.912077]  failover(E) drm(E) qemu_fw_cfg(E) pata_acpi(E)
> ata_generic(E)
> [  351.912888] CPU: 0 PID: 1483 Comm: bash Tainted: G E    
> 4.19.0-rc3+ #7
> [  351.914221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28
> 04/01/2014
> [  351.916582] RIP: 0010:pin_kill+0x128/0x140
> [  351.917369] Code: f2 5a 00 48 8b 44 24 20 48 39 c5 0f 84 6f ff ff
> ff 48 89 df e8 e9 4a 5b 00 8b 43 18 85 c0 7e b3 c6 03 00 fb 66 0f 1f
> 44 00 00 <e9> 51 ff ff ff e8 be 11 dd ff 0f 1f 40 00 66 2e 0f 1f 84 00
> 00 00
> [  351.920729] RSP: 0018:ffffa1b381be3d88 EFLAGS: 00000202 ORIG_RAX:
> ffffffffffffff13
> [  351.921801] RAX: 0000000000000000 RBX: ffff909cf2ea68b0 RCX:
> dead000000000200
> [  351.922807] RDX: 0000000000000001 RSI: ffffa1b381be3d28 RDI:
> ffff909cf2ea68b0
> [  351.923811] RBP: ffffa1b381be3da8 R08: ffff909d59621760 R09:
> 0000000000000000
> [  351.924813] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000010000000
> [  351.925818] R13: ffff909cf5db9a38 R14: ffff909cf2ea67a0 R15:
> ffff909cedc07300
> [  351.926824] FS:  00007f1eb90ac740(0000) GS:ffff909d59600000(0000)
> knlGS:0000000000000000
> [  351.927957] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  351.928772] CR2: 00007f1eabedb180 CR3: 000000000f20a003 CR4:
> 00000000003606f0
> [  351.929779] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [  351.930785] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [  351.931791] Call Trace:
> [  351.932160]  ? finish_wait+0x80/0x80
> [  351.932684]  group_pin_kill+0x1a/0x30
> [  351.933207]  namespace_unlock+0x6f/0x80
> [  351.933766]  __fput+0x239/0x240
> [  351.934217]  task_work_run+0x84/0xa0
> [  351.934743]  do_exit+0x2d3/0xae0
> [  351.935206]  ? __do_page_fault+0x263/0x4e0
> [  351.935799]  do_group_exit+0x3a/0xa0
> [  351.936307]  __x64_sys_exit_group+0x14/0x20
> [  351.936911]  do_syscall_64+0x5b/0x160
> [  351.937436]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  351.938164] RIP: 0033:0x7f1eb877adb6
> [  351.938688] Code: Bad RIP value.
> [  351.939149] RSP: 002b:00007ffd56e019d8 EFLAGS: 00000246 ORIG_RAX:
> 00000000000000e7
> [  351.940216] RAX: ffffffffffffffda RBX: 00007f1eb8a69740 RCX:
> 00007f1eb877adb6
> [  351.941222] RDX: 0000000000000000 RSI: 000000000000003c RDI:
> 0000000000000000
> [  351.942229] RBP: 0000000000000000 R08: 00000000000000e7 R09:
> ffffffffffffff80
> [  351.943236] R10: 00007ffd56e0188a R11: 0000000000000246 R12:
> 00007f1eb8a69740
> [  351.944242] R13: 0000000000000001 R14: 00007f1eb8a72708 R15:
> 0000000000000000
>
>

2018-10-17 21:21:51

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

Alan Jenkins <[email protected]> wrote:

> static int do_remount <https://elixir.bootlin.com/linux/v4.18/ident/do_remount>(struct path <https://elixir.bootlin.com/linux/v4.18/ident/path> *path <https://elixir.bootlin.com/linux/v4.18/ident/path>, int ms_flags, int sb_flags,
> int mnt_flags, void *data)

What happened there? You seem to have had a load of URLs substituted in.

David

2018-10-17 22:15:01

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH 31/34] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #12]

[resent, hopefully with slightly less formatting damage]

On 17/10/2018 16:45, David Howells wrote:

> Alan Jenkins <[email protected]> wrote:
>
>> I agree. I'm happy to see this is using the same check as do_remount().
>>
>>
>> * change filesystem flags. dir should be a physical root of filesystem.
>> * If you've mounted a non-root directory somewhere and want to do remount
>> * on it - tough luck.
>> */
> Are you suggesting that it should only work at the ultimate root of a
> superblock and not a bind mount somewhere within?
>
> That's tricky to make work for NFS because s_root is a dummy dentry.
>
> David


Retro-actively: I do not suggest that.

I tried to answer this question in my reply to Eric correcting me. Eric
was right to correct me.  I now understand the comment above
do_remount() is incorrect.  I re-reviewed your diff in light of that.  I
re-endorse your diff as a way to solve the problem I raised.

(I think it would be useful to remove the misleading comment above
do_remount(), to avoid future confusion.)


> @@ -186,6 +186,10 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags
> if (ret < 0)
> goto err;
>
> + ret = -EINVAL;
> + if (target.mnt->mnt_root != target.dentry)
> + goto err_path;
> +
> fc = vfs_new_fs_context(target.dentry->d_sb->s_type, target.dentry,
> 0, 0, FS_CONTEXT_FOR_RECONFIGURE);
> if (IS_ERR(fc)) {


( the "if" statement it adds to fspick() is equivalent to the second
"if" statement in do_remount(): )

static int do_remount(struct path *path, int ms_flags, int sb_flags,
int mnt_flags, void *data)
{
int err;
struct super_block *sb = path->mnt->mnt_sb;
struct mount *mnt = real_mount(path->mnt);

if (!check_mnt(mnt))
return -EINVAL;

if (path->dentry != path->mnt->mnt_root)
return -EINVAL;

Thanks

Alan

2018-10-18 20:11:57

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

Alan Jenkins <[email protected]> wrote:

> If I instead do the mount+unmount first, and close the FD as a second step, I
> think there's a lockup in the close().  The lockup happens in the same place
> as the unmount lockup from before. (Except there's a line "Code: Bad RIP
> value", I don't know why that happens).

Sorry, which FD are we talking about?

I presume you're talking about a command sequence like this:

# unshare --mount
# test-fsmount
# mount --move . /mnt
# mount --move /mnt /mnt
# cd
# umount /mnt
# exit

but this fails on your modified test-fsmount with:

shell-init: error retrieving current directory: getcwd: cannot access
parent directories: No such file or directory

David

2018-10-18 21:10:59

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

David Howells <[email protected]> wrote:

> but this fails on your modified test-fsmount with:
>
> shell-init: error retrieving current directory: getcwd: cannot access
> parent directories: No such file or directory

Actually, it doesn't fail at this point, and I do see a splat later in
fsnotify_first_mark().

static struct fsnotify_mark *fsnotify_first_mark(struct fsnotify_mark_connector **connp)
{
struct fsnotify_mark_connector *conn;
struct hlist_node *node = NULL;

conn = srcu_dereference(*connp, &fsnotify_mark_srcu);

conn here is 6b6b6b6b6b6b6b6b.

RIP: 0010:fsnotify_first_mark+0x5f/0xbb

Call Trace:
fsnotify+0x115/0x344
? __fput+0xac/0x1c1
__fput+0xac/0x1c1
task_work_run+0x78/0x9f
do_exit+0x525/0xa05
do_group_exit+0xb2/0xb2
__x64_sys_exit_group+0x14/0x14
do_syscall_64+0x7d/0x1a0
entry_SYSCALL_64_after_hwframe+0x49/0xbe

The line in fsnotify is:

fsnotify_first_mark(&mnt->mnt_fsnotify_marks);

and fsnotify() is called from fsnotify_close().

David

2018-10-19 11:57:47

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

Okay, I put in a tracepoint (patch attached) and got a trace from the life of
the offending mount object. I've cropped non-useful information out of the
lines, inserted a blank line every time the usage count goes down to 2 and
dropped most of the lines generated by fsnotify.

Most of the lines are irrelevant. You can see system calls being issued and
commands being run that make no difference on balance.

What matters are the first four lines, the two mounts and the umount. You can
see it go splat on the last line when the usage count has become poisoned.

bash-3597 M=93785f8a u=1 ALC sp=clone_mnt+0x31/0x30a
bash-3597 M=93785f8a u=2 GET sp=do_dentry_open+0x24/0x2d3
bash-3597 M=93785f8a u=1 PUT sp=__se_sys_open_tree+0x195/0x1da

^--- These three lines look like the open_tree() syscall done by test-fsmount.

bash-3597 M=93785f8a u=2 GET sp=set_fs_pwd+0x37/0xdb

^--- And this the fchdir() syscall from test-fsmount. At this point u=2 would
seem correct as the subtree isn't actually mounted anywhere (1 for pwd, 1
for fd).

v--- test-fsmount then called execl() on bash, which did some stat'ing to find
the executable...

bash-3597 M=93785f8a u=3 GET sp=legitimize_path.isra.7+0x16/0x50
bash-3597 M=93785f8a u=2 PUT sp=vfs_statx+0x95/0xcc

bash-3597 M=93785f8a u=3 GET sp=legitimize_path.isra.7+0x16/0x50
bash-3597 M=93785f8a u=2 PUT sp=vfs_statx+0x95/0xcc

v--- and then exec'd it.

bash-3597 M=93785f8a u=3 GET sp=legitimize_path.isra.7+0x16/0x50
bash-3597 M=93785f8a u=4 GET sp=do_dentry_open+0x24/0x2d3
bash-3597 M=93785f8a u=3 PUT sp=terminate_walk+0x20/0xda
bash-3597 M=93785f8a u=4 GET sp=legitimize_path.isra.7+0x16/0x50
bash-3597 M=93785f8a u=3 PUT sp=vfs_statx+0x95/0xcc
bash-3597 M=93785f8a u=2 PUT sp=__fput+0x180/0x1c1

v--- bash then did stuff:

bash-3597 M=93785f8a u=3 GET sp=legitimize_path.isra.7+0x16/0x50
bash-3597 M=93785f8a u=2 PUT sp=vfs_statx+0x95/0xcc

bash-3597 M=93785f8a u=3 GET sp=copy_fs_struct+0xcc/0xde
grepconf.sh-3598 M=93785f8a u=4 GET sp=legitimize_path.isra.7+0x16/0x50
grepconf.sh-3598 M=93785f8a u=3 PUT sp=vfs_statx+0x95/0xcc
grepconf.sh-3598 M=93785f8a u=4 GET sp=legitimize_path.isra.7+0x16/0x50
grepconf.sh-3598 M=93785f8a u=3 PUT sp=vfs_statx+0x95/0xcc
grepconf.sh-3598 M=93785f8a u=4 GET sp=legitimize_path.isra.7+0x16/0x50
grepconf.sh-3598 M=93785f8a u=5 GET sp=do_dentry_open+0x24/0x2d3
grepconf.sh-3598 M=93785f8a u=4 PUT sp=terminate_walk+0x20/0xda
grepconf.sh-3598 M=93785f8a u=5 GET sp=legitimize_path.isra.7+0x16/0x50
grepconf.sh-3598 M=93785f8a u=4 PUT sp=vfs_statx+0x95/0xcc
grepconf.sh-3598 M=93785f8a u=3 PUT sp=__fput+0x180/0x1c1
grepconf.sh-3598 M=93785f8a u=4 GET sp=legitimize_path.isra.7+0x16/0x50
grepconf.sh-3598 M=93785f8a u=3 PUT sp=vfs_statx+0x95/0xcc
grepconf.sh-3598 M=93785f8a u=4 GET sp=copy_fs_struct+0xcc/0xde
grep-3599 M=93785f8a u=3 PUT sp=free_fs_struct+0x1e/0x2e
grepconf.sh-3598 M=93785f8a u=2 PUT sp=free_fs_struct+0x1e/0x2e

bash-3597 M=93785f8a u=3 GET sp=copy_fs_struct+0xcc/0xde
bash-3600 M=93785f8a u=4 GET sp=copy_fs_struct+0xcc/0xde
tty-3601 M=93785f8a u=3 PUT sp=free_fs_struct+0x1e/0x2e
bash-3600 M=93785f8a u=4 GET sp=copy_fs_struct+0xcc/0xde
tput-3602 M=93785f8a u=3 PUT sp=free_fs_struct+0x1e/0x2e
bash-3600 M=93785f8a u=2 PUT sp=free_fs_struct+0x1e/0x2e

bash-3597 M=93785f8a u=3 GET sp=copy_fs_struct+0xcc/0xde
bash-3603 M=93785f8a u=4 GET sp=copy_fs_struct+0xcc/0xde
dircolors-3604 M=93785f8a u=3 PUT sp=free_fs_struct+0x1e/0x2e
bash-3603 M=93785f8a u=2 PUT sp=free_fs_struct+0x1e/0x2e

bash-3597 M=93785f8a u=3 GET sp=copy_fs_struct+0xcc/0xde
grep-3605 M=93785f8a u=2 PUT sp=free_fs_struct+0x1e/0x2e

bash-3597 M=93785f8a u=3 GET sp=copy_fs_struct+0xcc/0xde
grepconf.sh-3606 M=93785f8a u=4 GET sp=legitimize_path.isra.7+0x16/0x50
grepconf.sh-3606 M=93785f8a u=3 PUT sp=vfs_statx+0x95/0xcc
grepconf.sh-3606 M=93785f8a u=4 GET sp=legitimize_path.isra.7+0x16/0x50
grepconf.sh-3606 M=93785f8a u=3 PUT sp=vfs_statx+0x95/0xcc
grepconf.sh-3606 M=93785f8a u=4 GET sp=legitimize_path.isra.7+0x16/0x50
grepconf.sh-3606 M=93785f8a u=5 GET sp=do_dentry_open+0x24/0x2d3
grepconf.sh-3606 M=93785f8a u=4 PUT sp=terminate_walk+0x20/0xda
grepconf.sh-3606 M=93785f8a u=5 GET sp=legitimize_path.isra.7+0x16/0x50
grepconf.sh-3606 M=93785f8a u=4 PUT sp=vfs_statx+0x95/0xcc
grepconf.sh-3606 M=93785f8a u=3 PUT sp=__fput+0x180/0x1c1
grepconf.sh-3606 M=93785f8a u=4 GET sp=legitimize_path.isra.7+0x16/0x50
grepconf.sh-3606 M=93785f8a u=3 PUT sp=vfs_statx+0x95/0xcc
grepconf.sh-3606 M=93785f8a u=4 GET sp=copy_fs_struct+0xcc/0xde
grep-3607 M=93785f8a u=3 PUT sp=free_fs_struct+0x1e/0x2e
grepconf.sh-3606 M=93785f8a u=2 PUT sp=free_fs_struct+0x1e/0x2e

bash-3597 M=93785f8a u=3 GET sp=copy_fs_struct+0xcc/0xde
grepconf.sh-3608 M=93785f8a u=4 GET sp=legitimize_path.isra.7+0x16/0x50
grepconf.sh-3608 M=93785f8a u=3 PUT sp=vfs_statx+0x95/0xcc
grepconf.sh-3608 M=93785f8a u=4 GET sp=legitimize_path.isra.7+0x16/0x50
grepconf.sh-3608 M=93785f8a u=3 PUT sp=vfs_statx+0x95/0xcc
grepconf.sh-3608 M=93785f8a u=4 GET sp=legitimize_path.isra.7+0x16/0x50
grepconf.sh-3608 M=93785f8a u=5 GET sp=do_dentry_open+0x24/0x2d3
grepconf.sh-3608 M=93785f8a u=4 PUT sp=terminate_walk+0x20/0xda
grepconf.sh-3608 M=93785f8a u=5 GET sp=legitimize_path.isra.7+0x16/0x50
grepconf.sh-3608 M=93785f8a u=4 PUT sp=vfs_statx+0x95/0xcc
grepconf.sh-3608 M=93785f8a u=3 PUT sp=__fput+0x180/0x1c1
grepconf.sh-3608 M=93785f8a u=4 GET sp=legitimize_path.isra.7+0x16/0x50
grepconf.sh-3608 M=93785f8a u=3 PUT sp=vfs_statx+0x95/0xcc
grepconf.sh-3608 M=93785f8a u=4 GET sp=copy_fs_struct+0xcc/0xde
grep-3609 M=93785f8a u=3 PUT sp=free_fs_struct+0x1e/0x2e
grepconf.sh-3608 M=93785f8a u=2 PUT sp=free_fs_struct+0x1e/0x2e

bash-3597 M=93785f8a u=3 GET sp=legitimize_path.isra.7+0x16/0x50
bash-3597 M=93785f8a u=2 PUT sp=vfs_statx+0x95/0xcc

I ran "mount --move . /mnt":

bash-3597 M=93785f8a u=3 GET sp=copy_fs_struct+0xcc/0xde
mount-3610 M=93785f8a u=4 GET sp=legitimize_path.isra.7+0x16/0x50
mount-3610 M=93785f8a u=3 PUT sp=vfs_statx+0x95/0xcc
mount-3610 M=93785f8a u=4 GET sp=legitimize_path.isra.7+0x16/0x50
mount-3610 M=93785f8a u=5 GET sp=do_dentry_open+0x24/0x2d3
mount-3610 M=93785f8a u=4 PUT sp=terminate_walk+0x20/0xda
mount-3610 M=93785f8a u=5 GET sp=legitimize_path.isra.7+0x16/0x50
mount-3610 M=93785f8a u=4 PUT sp=vfs_statx+0x95/0xcc
mount-3610 M=93785f8a u=3 PUT sp=__fput+0x180/0x1c1
mount-3610 M=93785f8a u=4 GET sp=legitimize_path.isra.7+0x16/0x50
mount-3610 M=93785f8a u=3 PUT sp=do_mount+0x715/0x929
mount-3610 M=93785f8a u=2 PUT sp=free_fs_struct+0x1e/0x2e

which worked. Herein lieth the problem. The usage count should be 3 now (1
for pwd, 1 for fd, 1 for mount) - but how does VFS know that this mount object
isn't mounted yet?

I ran "mount --move /mnt /mnt":

bash-3597 M=93785f8a u=3 GET sp=copy_fs_struct+0xcc/0xde
mount-3611 M=93785f8a u=4 GET sp=legitimize_path.isra.7+0x16/0x50
mount-3611 M=93785f8a u=3 PUT sp=vfs_statx+0x95/0xcc
mount-3611 M=93785f8a u=4 GET sp=legitimize_path.isra.7+0x16/0x50
mount-3611 M=93785f8a u=5 GET sp=legitimize_path.isra.7+0x16/0x50
mount-3611 M=93785f8a u=4 PUT sp=do_mount+0x715/0x929
mount-3611 M=93785f8a u=3 PUT sp=do_mount+0x1cf/0x929
mount-3611 M=93785f8a u=2 PUT sp=free_fs_struct+0x1e/0x2e

which failed with ELOOP.

I ran "cd":

bash-3597 M=93785f8a u=1 PUT sp=set_fs_pwd+0xb8/0xdb

I ran "umount /mnt":

umount-3612 M=93785f8a u=2 GET sp=legitimize_path.isra.7+0x16/0x50
umount-3612 M=93785f8a u=1 PUT sp=vfs_statx+0x95/0xcc
umount-3612 M=93785f8a u=2 GET sp=legitimize_path.isra.7+0x16/0x50
umount-3612 M=93785f8a u=1 PUT sp=vfs_statx+0x95/0xcc
umount-3612 M=93785f8a u=2 GET sp=legitimize_path.isra.7+0x16/0x50
umount-3612 M=93785f8a u=1 PUT sp=user_statfs+0x61/0x98
umount-3612 M=93785f8a u=2 GET sp=legitimize_mnt+0x12/0x108
umount-3612 M=93785f8a u=1 PUT sp=pin_kill+0x11c/0x325
umount-3612 M=93785f8a u=0 PUT sp=ksys_umount+0x3e8/0x40e
umount-3612 M=93785f8a u=0 FRE sp=cleanup_mnt+0x4d/0x5e

And finally, I exited the shell, which then tried to release the fd inherited
from open_tree():

bash-3597 M=93785f8a u=1802201963 NFY sp=__fput+0xac/0x1c1

Note that the subtree attached to the fd has not at this point been directly
mounted by move_mount(), but implicitly mounted by fchdir() into it and then
using mount(MS_MOVE) of "." to "/mnt".

Note also that if I run the commands all as one go rather than pasting them
individually, a crash occurs in pin_kill() instead.

So, I'm not sure how to proceed from here. There's no easy way to remove the
FMODE_NEED_UNMOUNT flag left by open_tree().

David
---
commit e7c8e6590aa0dd3bf2e10450b8992d496c44be8b
Author: David Howells <[email protected]>
Date: Fri Oct 19 10:38:35 2018 +0100

mnt_count trace

diff --git a/fs/mount.h b/fs/mount.h
index f39bc9da4d73..124a6dd73936 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -20,7 +20,7 @@ struct mnt_namespace {
} __randomize_layout;

struct mnt_pcp {
- int mnt_count;
+ int mnt_countxxx;
int mnt_writers;
};

@@ -46,6 +46,7 @@ struct mount {
int mnt_count;
int mnt_writers;
#endif
+ atomic_t mnt_count;
struct list_head mnt_mounts; /* list of children, anchored here */
struct list_head mnt_child; /* and going through their mnt_child */
struct list_head mnt_instance; /* mount instance on sb->s_mounts */
diff --git a/fs/namei.c b/fs/namei.c
index fb913148d4d1..da1489f6925c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -460,32 +460,6 @@ int inode_permission(struct inode *inode, int mask)
}
EXPORT_SYMBOL(inode_permission);

-/**
- * path_get - get a reference to a path
- * @path: path to get the reference to
- *
- * Given a path increment the reference count to the dentry and the vfsmount.
- */
-void path_get(const struct path *path)
-{
- mntget(path->mnt);
- dget(path->dentry);
-}
-EXPORT_SYMBOL(path_get);
-
-/**
- * path_put - put a reference to a path
- * @path: path to put the reference to
- *
- * Given a path decrement the reference count to the dentry and the vfsmount.
- */
-void path_put(const struct path *path)
-{
- dput(path->dentry);
- mntput(path->mnt);
-}
-EXPORT_SYMBOL(path_put);
-
#define EMBEDDED_LEVELS 2
struct nameidata {
struct path path;
diff --git a/fs/namespace.c b/fs/namespace.c
index 6370bfabec99..389e806e1a65 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -29,6 +29,8 @@
#include <linux/sched/task.h>
#include <uapi/linux/mount.h>
#include <linux/fs_context.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/mnt.h>

#include "pnode.h"
#include "internal.h"
@@ -109,8 +111,10 @@ static int mnt_alloc_id(struct mount *mnt)
return 0;
}

-static void mnt_free_id(struct mount *mnt)
+static noinline void mnt_free_id(struct mount *mnt)
{
+ trace_mnt_count(mnt, mnt->mnt_id, atomic_read(&mnt->mnt_count), 99,
+ __builtin_return_address(0));
ida_free(&mnt_id_ida, mnt->mnt_id);
}

@@ -141,6 +145,9 @@ void mnt_release_group_id(struct mount *mnt)
*/
static inline void mnt_add_count(struct mount *mnt, int n)
{
+ int u;
+
+#if 0
#ifdef CONFIG_SMP
this_cpu_add(mnt->mnt_pcp->mnt_count, n);
#else
@@ -148,6 +155,9 @@ static inline void mnt_add_count(struct mount *mnt, int n)
mnt->mnt_count += n;
preempt_enable();
#endif
+#endif
+ u = atomic_add_return(n, &mnt->mnt_count);
+ trace_mnt_count(mnt, mnt->mnt_id, u, n, __builtin_return_address(0));
}

/*
@@ -155,6 +165,7 @@ static inline void mnt_add_count(struct mount *mnt, int n)
*/
unsigned int mnt_get_count(struct mount *mnt)
{
+#if 0
#ifdef CONFIG_SMP
unsigned int count = 0;
int cpu;
@@ -167,6 +178,8 @@ unsigned int mnt_get_count(struct mount *mnt)
#else
return mnt->mnt_count;
#endif
+#endif
+ return atomic_read(&mnt->mnt_count);
}

static void drop_mountpoint(struct fs_pin *p)
@@ -198,11 +211,15 @@ static struct mount *alloc_vfsmnt(const char *name)
if (!mnt->mnt_pcp)
goto out_free_devname;

+#if 0
this_cpu_add(mnt->mnt_pcp->mnt_count, 1);
+#endif
#else
mnt->mnt_count = 1;
mnt->mnt_writers = 0;
#endif
+ atomic_set(&mnt->mnt_count, 1);
+ trace_mnt_count(mnt, mnt->mnt_id, 1, 0, __builtin_return_address(0));

INIT_HLIST_NODE(&mnt->mnt_hash);
INIT_LIST_HEAD(&mnt->mnt_child);
@@ -1128,7 +1145,7 @@ static void mntput_no_expire(struct mount *mnt)
cleanup_mnt(mnt);
}

-void mntput(struct vfsmount *mnt)
+inline void mntput(struct vfsmount *mnt)
{
if (mnt) {
struct mount *m = real_mount(mnt);
@@ -1140,7 +1157,7 @@ void mntput(struct vfsmount *mnt)
}
EXPORT_SYMBOL(mntput);

-struct vfsmount *mntget(struct vfsmount *mnt)
+inline struct vfsmount *mntget(struct vfsmount *mnt)
{
if (mnt)
mnt_add_count(real_mount(mnt), 1);
@@ -3970,3 +3987,29 @@ const struct proc_ns_operations mntns_operations = {
.install = mntns_install,
.owner = mntns_owner,
};
+
+/**
+ * path_get - get a reference to a path
+ * @path: path to get the reference to
+ *
+ * Given a path increment the reference count to the dentry and the vfsmount.
+ */
+void path_get(const struct path *path)
+{
+ mntget(path->mnt);
+ dget(path->dentry);
+}
+EXPORT_SYMBOL(path_get);
+
+/**
+ * path_put - put a reference to a path
+ * @path: path to put the reference to
+ *
+ * Given a path decrement the reference count to the dentry and the vfsmount.
+ */
+void path_put(const struct path *path)
+{
+ dput(path->dentry);
+ mntput(path->mnt);
+}
+EXPORT_SYMBOL(path_put);
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index ababdbfab537..aaef44d6204c 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -23,6 +23,7 @@
#include <linux/module.h>
#include <linux/mount.h>
#include <linux/srcu.h>
+#include <trace/events/mnt.h>

#include <linux/fsnotify_backend.h>
#include "fsnotify.h"
@@ -324,10 +325,13 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is,
/* global tests shouldn't care about events on child only the specific event */
__u32 test_mask = (mask & ~FS_EVENT_ON_CHILD);

- if (data_is == FSNOTIFY_EVENT_PATH)
+ if (data_is == FSNOTIFY_EVENT_PATH) {
mnt = real_mount(((const struct path *)data)->mnt);
- else
+ trace_mnt_count(mnt, mnt->mnt_id, atomic_read(&mnt->mnt_count), 88,
+ __builtin_return_address(0));
+ } else {
mnt = NULL;
+ }

/*
* Optimization: srcu_read_lock() has a memory barrier which can
diff --git a/include/trace/events/mnt.h b/include/trace/events/mnt.h
new file mode 100644
index 000000000000..da1a981f1a61
--- /dev/null
+++ b/include/trace/events/mnt.h
@@ -0,0 +1,57 @@
+/* Mount tracepoints
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mnt
+
+#if !defined(_TRACE_MNT_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MNT_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(mnt_count,
+ TP_PROTO(const void *mnt, int mnt_id, int mnt_count,
+ int delta, const void *where),
+
+ TP_ARGS(mnt, mnt_id, mnt_count, delta, where),
+
+ TP_STRUCT__entry(
+ __field(int, mnt_id )
+ __field(int, mnt_count )
+ __field(int, delta )
+ __field(const void *, mnt )
+ __field(const void *, where )
+ ),
+
+ TP_fast_assign(
+ __entry->mnt_id = mnt_id;
+ __entry->mnt_count = mnt_count;
+ __entry->delta = delta;
+ __entry->mnt = mnt;
+ __entry->where = where;
+ ),
+
+ TP_printk("M=%p m=%08x u=%d %s sp=%pSR",
+ __entry->mnt,
+ __entry->mnt_id,
+ __entry->mnt_count,
+ __print_symbolic(__entry->delta,
+ { 0, "ALC" },
+ { 1, "GET" },
+ { -1, "PUT" },
+ { 88, "NFY" },
+ { 99, "FRE" }),
+ __entry->where)
+ );
+
+#endif /* _TRACE_MNT_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

2018-10-19 13:38:43

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

Alan Jenkins <[email protected]> wrote:

> If I close() the mount FD "mfd", and then do "mount --move . /mnt", my
> printk() shows MNT_UMOUNT has been set. ( I guess fchdir() works more like
> openat(... , O_PATH) than dup() ). Then unmounting /mnt hangs, as I would
> expect from my previous test.

Okay, I think the attached should fix it.

The issue being that do_move_mount() calls attach_recursive_mnt() with a NULL
parent_path, which means that the moved-mount doesn't get its refcount
incremented.

David
---
diff --git a/fs/namespace.c b/fs/namespace.c
index 6370bfabec99..ce9fff980549 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1935,7 +1935,8 @@ int count_mounts(struct mnt_namespace *ns, struct mount *mnt)
static int attach_recursive_mnt(struct mount *source_mnt,
struct mount *dest_mnt,
struct mountpoint *dest_mp,
- struct path *parent_path)
+ struct path *parent_path,
+ bool moving)
{
HLIST_HEAD(tree_list);
struct mnt_namespace *ns = dest_mnt->mnt_ns;
@@ -1976,6 +1977,8 @@ static int attach_recursive_mnt(struct mount *source_mnt,
attach_mnt(source_mnt, dest_mnt, dest_mp);
touch_mnt_namespace(source_mnt->mnt_ns);
} else {
+ if (moving)
+ mnt_add_count(source_mnt, 1);
mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
commit_tree(source_mnt);
}
@@ -2062,7 +2065,7 @@ static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp)
d_is_dir(mnt->mnt.mnt_root))
return -ENOTDIR;

- return attach_recursive_mnt(mnt, p, mp, NULL);
+ return attach_recursive_mnt(mnt, p, mp, NULL, false);
}

/*
@@ -2522,7 +2525,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
goto out1;

err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp,
- attached ? &parent_path : NULL);
+ attached ? &parent_path : NULL, true);
if (err)
goto out1;


2018-10-19 17:36:44

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

On 19/10/2018 14:37, David Howells wrote:
> Alan Jenkins <[email protected]> wrote:
>
>> If I close() the mount FD "mfd", and then do "mount --move . /mnt", my
>> printk() shows MNT_UMOUNT has been set. ( I guess fchdir() works more like
>> openat(... , O_PATH) than dup() ). Then unmounting /mnt hangs, as I would
>> expect from my previous test.
> Okay, I think the attached should fix it.
>
> The issue being that do_move_mount() calls attach_recursive_mnt() with a NULL
> parent_path, which means that the moved-mount doesn't get its refcount
> incremented.
>
> David
> ---
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 6370bfabec99..ce9fff980549 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1935,7 +1935,8 @@ int count_mounts(struct mnt_namespace *ns, struct mount *mnt)
> static int attach_recursive_mnt(struct mount *source_mnt,
> struct mount *dest_mnt,
> struct mountpoint *dest_mp,
> - struct path *parent_path)
> + struct path *parent_path,
> + bool moving)
> {
> HLIST_HEAD(tree_list);
> struct mnt_namespace *ns = dest_mnt->mnt_ns;
> @@ -1976,6 +1977,8 @@ static int attach_recursive_mnt(struct mount *source_mnt,
> attach_mnt(source_mnt, dest_mnt, dest_mp);
> touch_mnt_namespace(source_mnt->mnt_ns);
> } else {
> + if (moving)
> + mnt_add_count(source_mnt, 1);
> mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
> commit_tree(source_mnt);
> }
> @@ -2062,7 +2065,7 @@ static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp)
> d_is_dir(mnt->mnt.mnt_root))
> return -ENOTDIR;
>
> - return attach_recursive_mnt(mnt, p, mp, NULL);
> + return attach_recursive_mnt(mnt, p, mp, NULL, false);
> }
>
> /*
> @@ -2522,7 +2525,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
> goto out1;
>
> err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp,
> - attached ? &parent_path : NULL);
> + attached ? &parent_path : NULL, true);
> if (err)
> goto out1;
>

I guess this tries to fix the second of the two sequences I mentioned -
mount+unmount, then close the FD.  It doesn't seem to work.

# open_tree_clone 3</mnt 3 sh
# cd /proc/self/fd/3
# mount --move . /mnt
[ 41.747831] mnt_flags=1020 umount=0
# cd /
# umount /mnt
umount: /mnt: target is busy

^ a newly introduced bug? I do not remember having this problem before.

# umount -l /mnt
# exec 3<&- # close FD 3
[ 95.984094] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [sh:1423]
...
[ 96.000032] RIP: 0010:pin_kill+0x128/0x140

And the first sequence I mentioned - close the FD, then mount+unmount -
seems to be unchanged.

# open_tree_clone 3</mnt 3 sh
# cd /proc/self/fd/3
# exec 3<&- # close FD 3
# mount --move . /mnt
[ 76.175127] mnt_flags=8000020 umount=1
# cd /
# umount /mnt
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [umount:1472]
...
RIP: 0010:pin_kill+0x128/0x140

The close-then-mount test seemed to be solved by the diff you suggested
earlier.

diff --git a/fs/namespace.c b/fs/namespace.c
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2469,7 +2469,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
if (old->mnt_ns && !attached)
goto out1;

- if (old->mnt.mnt_flags & MNT_LOCKED)
+ if (old->mnt.mnt_flags & (MNT_LOCKED | MNT_UMOUNT))
goto out1;

if (old_path->dentry != old_path->mnt->mnt_root)

If we can do that, then is it possible to solve mount-unmount-close the
same way?

@@ -1763,7 +1763,7 @@ void dissolve_on_fput(struct vfsmount *mnt)
{
namespace_lock();
lock_mount_hash();
- if (!real_mount(mnt)->mnt_ns) {
+ if (!real_mount(mnt)->mnt_ns && !(mnt->mnt_flags & MNT_UMOUNT)) {
mntget(mnt);
umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
}

Regards

Alan


2018-10-19 21:35:49

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

Alan Jenkins <[email protected]> wrote:

> And the first sequence I mentioned - close the FD, then mount+unmount -
> seems to be unchanged.

Unchanged in what sense? Still breaks? I thought I'd fixed that - or are we
talking about a different first sequence?

Sorry, I'm losing track of how many different ways of breaking open_tree() and
move_mount() you've posted. I don't suppose you could post a checklist?

> I guess this tries to fix the second of the two sequences I mentioned -
> mount+unmount, then close the FD.  It doesn't seem to work.
>
> # open_tree_clone 3</mnt 3 sh
> # cd /proc/self/fd/3
> # mount --move . /mnt
> [ 41.747831] mnt_flags=1020 umount=0
> # cd /
> # umount /mnt
> umount: /mnt: target is busy
>
> ^ a newly introduced bug? I do not remember having this problem before.
>
> # umount -l /mnt

Sigh, so I see. I have the attached trace from this sequence.

David
----
Command "open_tree_clone 3</mnt 3 sh"

sh-3614 M=421a9872 u=1 ALC sp=clone_mnt+0x31/0x30a
sh-3614 M=421a9872 u=2 GET sp=do_dentry_open+0x24/0x2d3
sh-3614 M=421a9872 u=1 PUT sp=__se_sys_open_tree+0x195/0x1da
sh-3614 M=421a9872 u=2 GET sp=proc_fd_link+0x106/0x124
sh-3614 M=421a9872 u=1 PUT sp=vfs_statx+0x95/0xcc

Command "cd /proc/self/fd/3":

sh-3614 M=421a9872 u=2 GET sp=proc_fd_link+0x106/0x124
sh-3614 M=421a9872 u=3 GET sp=set_fs_pwd+0x37/0xdb
sh-3614 M=421a9872 u=2 PUT sp=ksys_chdir+0x88/0xbd

sh-3614 M=421a9872 u=3 GET sp=legitimize_path.isra.7+0x16/0x50
sh-3614 M=421a9872 u=2 PUT sp=vfs_statx+0x95/0xcc

Command "mount --move . /mnt":

sh-3614 M=421a9872 u=3 GET sp=copy_fs_struct+0xcc/0xde
mount-3615 M=421a9872 u=4 GET sp=legitimize_path.isra.7+0x16/0x50
mount-3615 M=421a9872 u=3 PUT sp=vfs_statx+0x95/0xcc
mount-3615 M=421a9872 u=4 GET sp=legitimize_path.isra.7+0x16/0x50
mount-3615 M=421a9872 u=5 GET sp=do_dentry_open+0x24/0x2d3
mount-3615 M=421a9872 u=4 PUT sp=terminate_walk+0x20/0xda
mount-3615 M=421a9872 u=5 GET sp=legitimize_path.isra.7+0x16/0x50
mount-3615 M=421a9872 u=4 PUT sp=vfs_statx+0x95/0xcc
mount-3615 M=421a9872 u=3 PUT sp=__fput+0x180/0x1c1
mount-3615 M=421a9872 u=4 GET sp=legitimize_path.isra.7+0x16/0x50
mount-3615 M=421a9872 u=4 0x4e sp= (null)
mount-3615 M=421a9872 u=5 GET sp=do_move_mount+0x216/0x298
mount-3615 M=421a9872 u=4 PUT sp=do_mount+0x715/0x929
mount-3615 M=421a9872 u=3 PUT sp=free_fs_struct+0x1e/0x2e

Command "cd /":

sh-3614 M=421a9872 u=2 PUT sp=set_fs_pwd+0xb8/0xdb

Command "umount /mnt":

umount-3616 M=421a9872 u=3 GET sp=legitimize_path.isra.7+0x16/0x50
umount-3616 M=421a9872 u=2 PUT sp=vfs_statx+0x95/0xcc
umount-3616 M=421a9872 u=3 GET sp=legitimize_path.isra.7+0x16/0x50
umount-3616 M=421a9872 u=2 PUT sp=vfs_statx+0x95/0xcc
umount-3616 M=421a9872 u=3 GET sp=legitimize_path.isra.7+0x16/0x50
umount-3616 M=421a9872 u=2 PUT sp=user_statfs+0x61/0x98
umount-3616 M=421a9872 u=3 GET sp=legitimize_mnt+0x12/0x108
umount-3616 M=421a9872 u=2 PUT sp=ksys_umount+0x3e8/0x40e

(Fails, -EBUSY).

Command "umount -l /mnt":

umount-3617 M=421a9872 u=3 GET sp=legitimize_mnt+0x12/0x108
umount-3617 M=421a9872 u=2 PUT sp=pin_kill+0x11c/0x325
umount-3617 M=421a9872 u=1 PUT sp=ksys_umount+0x3e8/0x40e

Command "exec 3<&-":

(Goes weird: bash still responds, but trying to run a command locks up that
shell; can still log in with ssh, but can't then run commands).

David

2018-10-19 21:40:44

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

Alan Jenkins <[email protected]> wrote:

> I guess this tries to fix the second of the two sequences I mentioned -
> mount+unmount, then close the FD.  It doesn't seem to work.

It fixes this:

unshare --mount
/root/test-fsmount
mount --move . /mnt
mount --move /mnt /mnt
cd
umount /mnt
exit

Which usually gets a GPF in fsnotify.

David

2018-10-19 22:36:59

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

Alan Jenkins <[email protected]> wrote:

> # open_tree_clone 3</mnt 3 sh
> # cd /proc/self/fd/3
> # mount --move . /mnt
> [ 41.747831] mnt_flags=1020 umount=0
> # cd /
> # umount /mnt
> umount: /mnt: target is busy
>
> ^ a newly introduced bug? I do not remember having this problem before.

The reason EBUSY is returned is because propagate_mount_busy() is called by
do_umount() with refcnt == 2, but mnt_count == 3:

umount-3577 M=f8898a34 u=3 0x555 sp=__x64_sys_umount+0x12/0x15

the trace line being added here:

if (!propagate_mount_busy(mnt, 2)) {
if (!list_empty(&mnt->mnt_list))
umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
retval = 0;
} else {
trace_mnt_count(mnt, mnt->mnt_id,
atomic_read(&mnt->mnt_count),
0x555, __builtin_return_address(0));
}

The busy evaluation is a result of this check:

if (!list_empty(&mnt->mnt_mounts) || do_refcount_check(mnt, refcnt))

in propagate_mount_busy().


The problem apparently being that mnt_count counts both refs from mountings
and refs from other sources, such as file descriptors or pathwalk.

David

2018-10-20 05:25:48

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

On Fri, Oct 19, 2018 at 11:36:19PM +0100, David Howells wrote:
> Alan Jenkins <[email protected]> wrote:
>
> > # open_tree_clone 3</mnt 3 sh
> > # cd /proc/self/fd/3
> > # mount --move . /mnt
> > [ 41.747831] mnt_flags=1020 umount=0
> > # cd /
> > # umount /mnt
> > umount: /mnt: target is busy
> >
> > ^ a newly introduced bug? I do not remember having this problem before.
>
> The reason EBUSY is returned is because propagate_mount_busy() is called by
> do_umount() with refcnt == 2, but mnt_count == 3:
>
> umount-3577 M=f8898a34 u=3 0x555 sp=__x64_sys_umount+0x12/0x15
>
> the trace line being added here:
>
> if (!propagate_mount_busy(mnt, 2)) {
> if (!list_empty(&mnt->mnt_list))
> umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
> retval = 0;
> } else {
> trace_mnt_count(mnt, mnt->mnt_id,
> atomic_read(&mnt->mnt_count),
> 0x555, __builtin_return_address(0));
> }
>
> The busy evaluation is a result of this check:
>
> if (!list_empty(&mnt->mnt_mounts) || do_refcount_check(mnt, refcnt))
>
> in propagate_mount_busy().
>
>
> The problem apparently being that mnt_count counts both refs from mountings
> and refs from other sources, such as file descriptors or pathwalk.

As it bloody well should. Once the tree has been attached, that open_ctree()
descriptor is refering to vfsmount of /mnt (what else could it be?)

IOW, it *is* genuinely busy. The livelock on umount -l you've mentioned is
a different story - that's definitely a bug, but this -EBUSY is correct.

2018-10-20 11:08:35

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

On 19/10/2018 23:36, David Howells wrote:
> Alan Jenkins <[email protected]> wrote:
>
>> # open_tree_clone 3</mnt 3 sh
>> # cd /proc/self/fd/3
>> # mount --move . /mnt
>> [ 41.747831] mnt_flags=1020 umount=0
>> # cd /
>> # umount /mnt
>> umount: /mnt: target is busy
>>
>> ^ a newly introduced bug? I do not remember having this problem before.
> The reason EBUSY is returned is because propagate_mount_busy() is called by
> do_umount() with refcnt == 2, but mnt_count == 3:
>
> umount-3577 M=f8898a34 u=3 0x555 sp=__x64_sys_umount+0x12/0x15
>
> the trace line being added here:
>
> if (!propagate_mount_busy(mnt, 2)) {
> if (!list_empty(&mnt->mnt_list))
> umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
> retval = 0;
> } else {
> trace_mnt_count(mnt, mnt->mnt_id,
> atomic_read(&mnt->mnt_count),
> 0x555, __builtin_return_address(0));
> }
>
> The busy evaluation is a result of this check:
>
> if (!list_empty(&mnt->mnt_mounts) || do_refcount_check(mnt, refcnt))
>
> in propagate_mount_busy().
>
>
> The problem apparently being that mnt_count counts both refs from mountings
> and refs from other sources, such as file descriptors or pathwalk.
>
> David

Sorry for wasting your time on the EBUSY.  The EBUSY error is not new,
it is correct, and I was doing the wrong thing.  I cannot "umount /mnt"
if I still have an FD which points inside /mnt.

I was trying to provide a nice clearer overview, but it was still too
sloppy to understand.  I've written a second attempt to rephrase it (and
remove my mistake about EBUSY).  This all seems consistent with what Al
just said, so if you got the picture from Al's message, you can ignore
this one :-).

~

The patch series [ver #12] has a problem.  OPEN_TREE_CLONE creates an
open file, marked with FMODE_NEED_UNMOUNT for cleanup. Users are
expected to move_mount() directly from that file.

However, it is also possible to use openat() on the open file, which
gives you a second open file.  This raises questions about the cleanup
handling.  The second open file is *not* marked FMODE_NEED_UNMOUNT. 
What happens if we clean up the first open file and then move_mount()
from the second one?  And what happens if you consume the second open
file using move_mount(), and then cleanup up the first open file?

When I test the patch series [ver #12], it seems I can trigger the same
bug for either case.  The two reproducers use the same commands, but in
a different order.

"close-then-mount"

# open_tree_clone 3</mnt 3 sh
# cd /proc/self/fd/3
# exec 3<&-  # close FD 3
# mount --move . /mnt && cd /
# umount -l /mnt
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [umount:1472]
...
RIP: 0010:pin_kill+0x128/0x140
...
 Call Trace:
pin_kill+0x5a/0x140
? finish_wait+0x80/0x80
group_pin_kill+0x1a/0x30
namespace_unlock+0x6f/0x80
ksys_umount+0x220/0x420
__x64_sys_umount+0x12/0x20
do_syscall_64+0x5b/0x160
entry_SYSCALL_64_after_hwframe+0x44/0xa9


"mount-then-close"

# open_tree_clone 3</mnt 3 sh
# cd /proc/self/fd/3
# mount --move . /mnt && cd /
# umount -l /mnt
# exec 3<&-  # close FD 3
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [sh:1423]
...
RIP: 0010:pin_kill+0x128/0x140
...
Call Trace:
? finish_wait+0x80/0x80
group_pin_kill+0x1a/0x30
namespace_unlock+0x6f/0x80
__fput+0x239/0x240
task_work_run+0x84/0xa0
exit_to_usermode_loop+0xb4/0xc0
do_syscall_64+0x14d/0x160
entry_SYSCALL_64_after_hwframe+0x44/0xa9

When I debug the kernel and reproduce "close-then-mount", I can see
something is wrong even before the last command.  The mount command
attaches a mount into the mount namespace which is still marked as
MNT_UMOUNT.  This contradicts a comment in the predicate function,
disconnect_mount():

/* Because the reference counting rules change when mounts are
* unmounted and connected, umounted mounts may not be
* connected to mounted mounts.
*/
if (!(mnt
<https://elixir.bootlin.com/linux/latest/ident/mnt>->mnt_parent->mnt
<https://elixir.bootlin.com/linux/latest/ident/mnt>.mnt_flags & MNT_UMOUNT <https://elixir.bootlin.com/linux/latest/ident/MNT_UMOUNT>))
return true;

We could ask if there is a procedure to safely clear MNT_UMOUNT on a
detached tree, but we don't have a specific reason to. You suggested a
one-line diff, to deny the problematic mount command in "close-then-mount".

@@ -2469,7 +2469,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
if (old->mnt_ns && !attached)
goto out1;

- if (old->mnt.mnt_flags & MNT_LOCKED)
+ if (old->mnt.mnt_flags & (MNT_LOCKED | MNT_UMOUNT))
goto out1;

if (old_path->dentry != old_path->mnt->mnt_root)

It sounds plausible, and it worked as suggested.  But it feels
incomplete.  If my two reproducer sequences are really symmetric, we
need to fix the code path in move_mount() *and* the code path in
close().  I asked if we can add this on top:

@@ -1763,7 +1763,7 @@ void dissolve_on_fput(struct vfsmount *mnt)
{
namespace_lock();
lock_mount_hash();
- if (!real_mount(mnt)->mnt_ns) {
+ if (!real_mount(mnt)->mnt_ns && !(mnt->mnt_flags & MNT_UMOUNT)) {
mntget(mnt);
umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
}

(To apply without whitespace damage, see the attachment).  I tested now
and this seems to allow "mount-then-close"; there is no immediate
softlockup or error message.

You mentioned when you tested, you can get a GPF in fsnotify instead,
depending on the timing of the commands.  I have been entering my
commands one at a time, and I have not seen the GPF so far.

You posted an analysis of a GPF, where you showed the reference count
was clearly one less than it should have been.  You narrowed this down
to a step where you connected an unmounted mount (MNT_UMOUNT) to a
mounted mount.  So your analysis is consistent with the comment in
disconnect_mount(), which says 1) you're not allowed to do that, 2) the
reason is because of different reference-counting rules.  AFAICT, the
GPF you analyzed would be prevented by the fix in do_move_mount(),
checking for MNT_UMOUNT.

I have been trying to understand MNT_UMOUNT by reading the patch series
that added it.  Now I'm getting the impression the different
ref-counting rules pre-date MNT_UMOUNT.  I *think* the added check in
dissolve_on_fput() makes things right, but I don't understand enough to
be sure.

Alan


Attachments:
MNT_UMOUNT.diff (733.00 B)

2018-10-20 11:49:07

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

On Sat, Oct 20, 2018 at 12:06:32PM +0100, Alan Jenkins wrote:

> You posted an analysis of a GPF, where you showed the reference count was
> clearly one less than it should have been.? You narrowed this down to a step
> where you connected an unmounted mount (MNT_UMOUNT) to a mounted mount.? So
> your analysis is consistent with the comment in disconnect_mount(), which
> says 1) you're not allowed to do that, 2) the reason is because of different
> reference-counting rules.? AFAICT, the GPF you analyzed would be prevented
> by the fix in do_move_mount(), checking for MNT_UMOUNT.

Not just refcounting; it's that fs_pin is really intended to have ->kill()
triggered only once. If you look at the pin_kill() (which is where the
livelock happened) you'll see what's going on - anyone hitting it between
the first call and freeing of the object will be sleeping until ->kill()
from the first call gets through pin_remove(), at which point they bugger
off (being very careful with accessing the sucker to avoid use-after-free).

MNT_UMOUNT means that there's no way back.

> pre-date MNT_UMOUNT.? I *think* the added check in dissolve_on_fput() makes
> things right, but I don't understand enough to be sure.

That, plus making sure that do_move_mount() grabs a reference in case
of successfully attaching a tree. I hate passing bool argument, BTW -
better just do mnt_add_count() either before attach_recursive_mnt()
and decrement on failure, or, better yet, just do it on success. Note
that namespace_sem is held, so the damn thing *can't* disappear under
us - nobody will be able to detach it until we drop namespace_lock.

> diff --git a/fs/namespace.c b/fs/namespace.c
> index 4dfe7e23b7ee..e8d61d5f581d 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1763,7 +1763,7 @@ void dissolve_on_fput(struct vfsmount *mnt)
> {
> namespace_lock();
> lock_mount_hash();
> - if (!real_mount(mnt)->mnt_ns) {
> + if (!real_mount(mnt)->mnt_ns && !(mnt->mnt_flags & MNT_UMOUNT)) {
> mntget(mnt);
> umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
> }
> @@ -2469,7 +2469,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
> if (old->mnt_ns && !attached)
> goto out1;
>
> - if (old->mnt.mnt_flags & MNT_LOCKED)
> + if (old->mnt.mnt_flags & (MNT_LOCKED | MNT_UMOUNT))
> goto out1;
>
> if (old_path->dentry != old_path->mnt->mnt_root)


2018-10-20 12:28:37

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

On Sat, Oct 20, 2018 at 12:48:26PM +0100, Al Viro wrote:

> Not just refcounting; it's that fs_pin is really intended to have ->kill()
> triggered only once. If you look at the pin_kill() (which is where the
> livelock happened)

More specifically, it's group_pin_kill() assuming that by the time pin_kill()
returns it either will have called to pin_remove() or will have waited for
one to complete. Either way, the object will be gone from the list, so we
do get progress. Livelock comes since the object has already been through
pin_remove() once and then got reinserted into the list. Now pin_kill()
returns immediately and we keep spinning on the element that doesn't go
away.

2018-10-21 00:42:12

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

Alan Jenkins <[email protected]> wrote:

> diff --git a/fs/namespace.c b/fs/namespace.c
> index 4dfe7e23b7ee..e8d61d5f581d 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1763,7 +1763,7 @@ void dissolve_on_fput(struct vfsmount *mnt)
> {
> namespace_lock();
> lock_mount_hash();
> - if (!real_mount(mnt)->mnt_ns) {
> + if (!real_mount(mnt)->mnt_ns && !(mnt->mnt_flags & MNT_UMOUNT)) {
> mntget(mnt);
> umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
> }
> @@ -2469,7 +2469,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
> if (old->mnt_ns && !attached)
> goto out1;
>
> - if (old->mnt.mnt_flags & MNT_LOCKED)
> + if (old->mnt.mnt_flags & (MNT_LOCKED | MNT_UMOUNT))
> goto out1;
>
> if (old_path->dentry != old_path->mnt->mnt_root)

I've already got one of these; I'll fold in the other also.

David

2018-10-21 18:12:52

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 01/34] vfs: syscall: Add open_tree(2) to reference or clone a mount [ver #12]

David Howells <[email protected]> writes:

> From: Al Viro <[email protected]>
>
> open_tree(dfd, pathname, flags)
>
> Returns an O_PATH-opened file descriptor or an error.
> dfd and pathname specify the location to open, in usual
> fashion (see e.g. fstatat(2)). flags should be an OR of
> some of the following:
> * AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
> same meanings as usual
> * OPEN_TREE_CLOEXEC - make the resulting descriptor
> close-on-exec
> * OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
> instead of opening the location in question, create a detached
> mount tree matching the subtree rooted at location specified by
> dfd/pathname. With AT_RECURSIVE the entire subtree is cloned,
> without it - only the part within in the mount containing the
> location in question. In other words, the same as mount --rbind
> or mount --bind would've taken. The detached tree will be
> dissolved on the final close of obtained file. Creation of such
> detached trees requires the same capabilities as doing mount --bind.


What happens when mounts propgate to such a detached mount tree?

It looks to me like the test in propagate_one for setting
CL_UNPRIVILEGED will trigger a NULL pointer dereference:

AKA:
/* Notice when we are propagating across user namespaces */
if (m->mnt_ns->user_ns != user_ns)
type |= CL_UNPRIVILEGED;

Since we don't know which mount namespace if any this subtree is going
into the test should become:

if (!m->mnt_ns || (m->mnt_ns->user_ns != user_ns))
type |= CL_UNPRIVILEGED;

If the tree is never attached anywhere it won't hurt as we don't allow
mounts or umounts on the detached subtree. We don't have enough
information to know about the namespace we copied from if it would cause
CL_UNPRIVILEGED to be set on any given propagation. Similarly we don't
have any information at all about the mount namespace this tree may
possibily be copied to.

Eric


> Signed-off-by: Al Viro <[email protected]>
> Signed-off-by: David Howells <[email protected]>
> cc: [email protected]
> ---
>
> arch/x86/entry/syscalls/syscall_32.tbl | 1
> arch/x86/entry/syscalls/syscall_64.tbl | 1
> fs/file_table.c | 9 +-
> fs/internal.h | 1
> fs/namespace.c | 132 +++++++++++++++++++++++++++-----
> include/linux/fs.h | 7 +-
> include/linux/syscalls.h | 1
> include/uapi/linux/fcntl.h | 2
> include/uapi/linux/mount.h | 10 ++
> 9 files changed, 137 insertions(+), 27 deletions(-)
> create mode 100644 include/uapi/linux/mount.h
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 3cf7b533b3d1..ea1b413afd47 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -398,3 +398,4 @@
> 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl
> 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents
> 386 i386 rseq sys_rseq __ia32_sys_rseq
> +387 i386 open_tree sys_open_tree __ia32_sys_open_tree
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index f0b1709a5ffb..0545bed581dc 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -343,6 +343,7 @@
> 332 common statx __x64_sys_statx
> 333 common io_pgetevents __x64_sys_io_pgetevents
> 334 common rseq __x64_sys_rseq
> +335 common open_tree __x64_sys_open_tree
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/file_table.c b/fs/file_table.c
> index e49af4caf15d..e03c8d121c6c 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -255,6 +255,7 @@ static void __fput(struct file *file)
> struct dentry *dentry = file->f_path.dentry;
> struct vfsmount *mnt = file->f_path.mnt;
> struct inode *inode = file->f_inode;
> + fmode_t mode = file->f_mode;
>
> if (unlikely(!(file->f_mode & FMODE_OPENED)))
> goto out;
> @@ -277,18 +278,20 @@ static void __fput(struct file *file)
> if (file->f_op->release)
> file->f_op->release(inode, file);
> if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
> - !(file->f_mode & FMODE_PATH))) {
> + !(mode & FMODE_PATH))) {
> cdev_put(inode->i_cdev);
> }
> fops_put(file->f_op);
> put_pid(file->f_owner.pid);
> - if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
> + if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
> i_readcount_dec(inode);
> - if (file->f_mode & FMODE_WRITER) {
> + if (mode & FMODE_WRITER) {
> put_write_access(inode);
> __mnt_drop_write(mnt);
> }
> dput(dentry);
> + if (unlikely(mode & FMODE_NEED_UNMOUNT))
> + dissolve_on_fput(mnt);
> mntput(mnt);
> out:
> file_free(file);
> diff --git a/fs/internal.h b/fs/internal.h
> index 364c20b5ea2d..17029b30e196 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -85,6 +85,7 @@ extern int __mnt_want_write_file(struct file *);
> extern void __mnt_drop_write(struct vfsmount *);
> extern void __mnt_drop_write_file(struct file *);
>
> +extern void dissolve_on_fput(struct vfsmount *);
> /*
> * fs_struct.c
> */
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 8a7e1a7d1d06..ded1a970ec40 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -20,12 +20,14 @@
> #include <linux/init.h> /* init_rootfs */
> #include <linux/fs_struct.h> /* get_fs_root et.al. */
> #include <linux/fsnotify.h> /* fsnotify_vfsmount_delete */
> +#include <linux/file.h>
> #include <linux/uaccess.h>
> #include <linux/proc_ns.h>
> #include <linux/magic.h>
> #include <linux/bootmem.h>
> #include <linux/task_work.h>
> #include <linux/sched/task.h>
> +#include <uapi/linux/mount.h>
>
> #include "pnode.h"
> #include "internal.h"
> @@ -1779,6 +1781,16 @@ struct vfsmount *collect_mounts(const struct path *path)
> return &tree->mnt;
> }
>
> +void dissolve_on_fput(struct vfsmount *mnt)
> +{
> + namespace_lock();
> + lock_mount_hash();
> + mntget(mnt);
> + umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
> + unlock_mount_hash();
> + namespace_unlock();
> +}
> +
> void drop_collected_mounts(struct vfsmount *mnt)
> {
> namespace_lock();
> @@ -2138,6 +2150,30 @@ static bool has_locked_children(struct mount *mnt, struct dentry *dentry)
> return false;
> }
>
> +static struct mount *__do_loopback(struct path *old_path, int recurse)
> +{
> + struct mount *mnt = ERR_PTR(-EINVAL), *old = real_mount(old_path->mnt);
> +
> + if (IS_MNT_UNBINDABLE(old))
> + return mnt;
> +
> + if (!check_mnt(old) && old_path->dentry->d_op != &ns_dentry_operations)
> + return mnt;
> +
> + if (!recurse && has_locked_children(old, old_path->dentry))
> + return mnt;
> +
> + if (recurse)
> + mnt = copy_tree(old, old_path->dentry, CL_COPY_MNT_NS_FILE);
> + else
> + mnt = clone_mnt(old, old_path->dentry, 0);
> +
> + if (!IS_ERR(mnt))
> + mnt->mnt.mnt_flags &= ~MNT_LOCKED;
> +
> + return mnt;
> +}
> +
> /*
> * do loopback mount.
> */
> @@ -2145,7 +2181,7 @@ static int do_loopback(struct path *path, const char *old_name,
> int recurse)
> {
> struct path old_path;
> - struct mount *mnt = NULL, *old, *parent;
> + struct mount *mnt = NULL, *parent;
> struct mountpoint *mp;
> int err;
> if (!old_name || !*old_name)
> @@ -2159,38 +2195,21 @@ static int do_loopback(struct path *path, const char *old_name,
> goto out;
>
> mp = lock_mount(path);
> - err = PTR_ERR(mp);
> - if (IS_ERR(mp))
> + if (IS_ERR(mp)) {
> + err = PTR_ERR(mp);
> goto out;
> + }
>
> - old = real_mount(old_path.mnt);
> parent = real_mount(path->mnt);
> -
> - err = -EINVAL;
> - if (IS_MNT_UNBINDABLE(old))
> - goto out2;
> -
> if (!check_mnt(parent))
> goto out2;
>
> - if (!check_mnt(old) && old_path.dentry->d_op != &ns_dentry_operations)
> - goto out2;
> -
> - if (!recurse && has_locked_children(old, old_path.dentry))
> - goto out2;
> -
> - if (recurse)
> - mnt = copy_tree(old, old_path.dentry, CL_COPY_MNT_NS_FILE);
> - else
> - mnt = clone_mnt(old, old_path.dentry, 0);
> -
> + mnt = __do_loopback(&old_path, recurse);
> if (IS_ERR(mnt)) {
> err = PTR_ERR(mnt);
> goto out2;
> }
>
> - mnt->mnt.mnt_flags &= ~MNT_LOCKED;
> -
> err = graft_tree(mnt, parent, mp);
> if (err) {
> lock_mount_hash();
> @@ -2204,6 +2223,75 @@ static int do_loopback(struct path *path, const char *old_name,
> return err;
> }
>
> +SYSCALL_DEFINE3(open_tree, int, dfd, const char *, filename, unsigned, flags)
> +{
> + struct file *file;
> + struct path path;
> + int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW;
> + bool detached = flags & OPEN_TREE_CLONE;
> + int error;
> + int fd;
> +
> + BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC);
> +
> + if (flags & ~(AT_EMPTY_PATH | AT_NO_AUTOMOUNT | AT_RECURSIVE |
> + AT_SYMLINK_NOFOLLOW | OPEN_TREE_CLONE |
> + OPEN_TREE_CLOEXEC))
> + return -EINVAL;
> +
> + if ((flags & (AT_RECURSIVE | OPEN_TREE_CLONE)) == AT_RECURSIVE)
> + return -EINVAL;
> +
> + if (flags & AT_NO_AUTOMOUNT)
> + lookup_flags &= ~LOOKUP_AUTOMOUNT;
> + if (flags & AT_SYMLINK_NOFOLLOW)
> + lookup_flags &= ~LOOKUP_FOLLOW;
> + if (flags & AT_EMPTY_PATH)
> + lookup_flags |= LOOKUP_EMPTY;
> +
> + if (detached && !may_mount())
> + return -EPERM;
> +
> + fd = get_unused_fd_flags(flags & O_CLOEXEC);
> + if (fd < 0)
> + return fd;
> +
> + error = user_path_at(dfd, filename, lookup_flags, &path);
> + if (error)
> + goto out;
> +
> + if (detached) {
> + struct mount *mnt = __do_loopback(&path, flags & AT_RECURSIVE);
> + if (IS_ERR(mnt)) {
> + error = PTR_ERR(mnt);
> + goto out2;
> + }
> + mntput(path.mnt);
> + path.mnt = &mnt->mnt;
> + }
> +
> + file = dentry_open(&path, O_PATH, current_cred());
> + if (IS_ERR(file)) {
> + error = PTR_ERR(file);
> + goto out3;
> + }
> +
> + if (detached)
> + file->f_mode |= FMODE_NEED_UNMOUNT;
> + path_put(&path);
> + fd_install(fd, file);
> + return fd;
> +
> +out3:
> + if (detached)
> + dissolve_on_fput(path.mnt);
> +out2:
> + path_put(&path);
> +out:
> + put_unused_fd(fd);
> + return error;
> +}
> +
> static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
> {
> int error = 0;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 4323b8fe353d..6dc32507762f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -157,10 +157,13 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> #define FMODE_NONOTIFY ((__force fmode_t)0x4000000)
>
> /* File is capable of returning -EAGAIN if I/O will block */
> -#define FMODE_NOWAIT ((__force fmode_t)0x8000000)
> +#define FMODE_NOWAIT ((__force fmode_t)0x8000000)
> +
> +/* File represents mount that needs unmounting */
> +#define FMODE_NEED_UNMOUNT ((__force fmode_t)0x10000000)
>
> /* File does not contribute to nr_files count */
> -#define FMODE_NOACCOUNT ((__force fmode_t)0x20000000)
> +#define FMODE_NOACCOUNT ((__force fmode_t)0x20000000)
>
> /*
> * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 2ff814c92f7f..6978f3c76d41 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -906,6 +906,7 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
> unsigned mask, struct statx __user *buffer);
> asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
> int flags, uint32_t sig);
> +asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
>
> /*
> * Architecture-specific system calls
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 6448cdd9a350..594b85f7cb86 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -90,5 +90,7 @@
> #define AT_STATX_FORCE_SYNC 0x2000 /* - Force the attributes to be sync'd with the server */
> #define AT_STATX_DONT_SYNC 0x4000 /* - Don't sync attributes with the server */
>
> +#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */
> +
>
> #endif /* _UAPI_LINUX_FCNTL_H */
> diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
> new file mode 100644
> index 000000000000..e8db2911adca
> --- /dev/null
> +++ b/include/uapi/linux/mount.h
> @@ -0,0 +1,10 @@
> +#ifndef _UAPI_LINUX_MOUNT_H
> +#define _UAPI_LINUX_MOUNT_H
> +
> +/*
> + * open_tree() flags.
> + */
> +#define OPEN_TREE_CLONE 1 /* Clone the target tree and attach the clone */
> +#define OPEN_TREE_CLOEXEC O_CLOEXEC /* Close the file on execve() */
> +
> +#endif /* _UAPI_LINUX_MOUNT_H */

2018-10-21 18:15:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

David Howells <[email protected]> writes:

> From: Al Viro <[email protected]>
>
> Allow a detached tree created by open_tree(..., OPEN_TREE_CLONE) to be
> attached by move_mount(2).
>
> If by the time of final fput() of OPEN_TREE_CLONE-opened file its tree is
> not detached anymore, it won't be dissolved. move_mount(2) is adjusted
> to handle detached source.
>
> That gives us equivalents of mount --bind and mount --rbind.

In light of recent conversations about double umount_tree.

Do we want to simply limit ourselves to attaching file->f_path of
a FMODE_NEED_UMOUNT file descriptor and clearing FMODE_NEED_UMOUNT
when it is attached?

Either that or perhaps move the logic into mntput on when to perform the
umount_tree?

Otherwise I believe we start running into surprising situations:

This works:
ofd = open_tree(...);
dup_fd = openat(ofd, "", O_PATH);
mount_move(dup_fd, ...);
close(ofd);
close(dup_fd);

This should fail:
ofd = open_tree(...);
dup_fd = openat(ofd, "", O_PATH);
close(ofd);
mount_move(dup_fd, ...);
close(dup_fd);

Allowing any file descriptor that points to mnt_ns == NULL without
MNT_UMOUNT set seems like it is adding flexibility for no reason.


> Signed-off-by: Al Viro <[email protected]>
> Signed-off-by: David Howells <[email protected]>
> ---
>
> fs/namespace.c | 26 ++++++++++++++++++++------
> 1 file changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index dd38141b1723..caf5c55ef555 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1785,8 +1785,10 @@ void dissolve_on_fput(struct vfsmount *mnt)
> {
> namespace_lock();
> lock_mount_hash();
> - mntget(mnt);
> - umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
> + if (!real_mount(mnt)->mnt_ns) {
> + mntget(mnt);
> + umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
> + }

^^^^^^ This change should be unnecessary.

> unlock_mount_hash();
> namespace_unlock();
> }
> @@ -2393,6 +2395,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
> struct mount *old;
> struct mountpoint *mp;
> int err;
> + bool attached;
>
> mp = lock_mount(new_path);
> err = PTR_ERR(mp);
> @@ -2403,10 +2406,19 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
> p = real_mount(new_path->mnt);
>
> err = -EINVAL;
> - if (!check_mnt(p) || !check_mnt(old))
> + /* The mountpoint must be in our namespace. */
> + if (!check_mnt(p))
> + goto out1;
> + /* The thing moved should be either ours or completely unattached. */
> + if (old->mnt_ns && !check_mnt(old))
> goto out1;

^^^^^^^^^^^^^^^^^^^^^^^

!old->mnt_ns should only be allowed when it comes from a file
descriptor with FMODE_NEED_UMOUNT.


> - if (!mnt_has_parent(old))
> + attached = mnt_has_parent(old);
> + /*
> + * We need to allow open_tree(OPEN_TREE_CLONE) followed by
> + * move_mount(), but mustn't allow "/" to be moved.
> + */
> + if (old->mnt_ns && !attached)
> goto out1;
>
> if (old->mnt.mnt_flags & MNT_LOCKED)
> @@ -2421,7 +2433,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
> /*
> * Don't move a mount residing in a shared parent.
> */
> - if (IS_MNT_SHARED(old->mnt_parent))
> + if (attached && IS_MNT_SHARED(old->mnt_parent))
> goto out1;
> /*
> * Don't move a mount tree containing unbindable mounts to a destination
> @@ -2435,7 +2447,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
> goto out1;
>
> err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp,
> - &parent_path);
> + attached ? &parent_path : NULL);
> if (err)
> goto out1;

^^^^^^^^^^^^^^^^^^^
Somewhere around here we should have code that clears FMODE_NEED_UMOUNT.


> @@ -3121,6 +3133,8 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
>
> /*
> * Move a mount from one place to another.
> + * In combination with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be
> + * used to copy a mount subtree.
> *
> * Note the flags value is a combination of MOVE_MOUNT_* flags.
> */

2018-10-23 11:20:51

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

On 21/09/2018 17:30, David Howells wrote:
> From: Al Viro <[email protected]>
>
> Allow a detached tree created by open_tree(..., OPEN_TREE_CLONE) to be
> attached by move_mount(2).
>
> If by the time of final fput() of OPEN_TREE_CLONE-opened file its tree is
> not detached anymore, it won't be dissolved. move_mount(2) is adjusted
> to handle detached source.
>
> That gives us equivalents of mount --bind and mount --rbind.
>
> Signed-off-by: Al Viro <[email protected]>
> Signed-off-by: David Howells <[email protected]>
> ---
>
> fs/namespace.c | 26 ++++++++++++++++++++------
> 1 file changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index dd38141b1723..caf5c55ef555 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1785,8 +1785,10 @@ void dissolve_on_fput(struct vfsmount *mnt)
> {
> namespace_lock();
> lock_mount_hash();
> - mntget(mnt);
> - umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
> + if (!real_mount(mnt)->mnt_ns) {
> + mntget(mnt);
> + umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
> + }
> unlock_mount_hash();
> namespace_unlock();
> }
> @@ -2393,6 +2395,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
> struct mount *old;
> struct mountpoint *mp;
> int err;
> + bool attached;
>
> mp = lock_mount(new_path);
> err = PTR_ERR(mp);
> @@ -2403,10 +2406,19 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
> p = real_mount(new_path->mnt);
>
> err = -EINVAL;
> - if (!check_mnt(p) || !check_mnt(old))
> + /* The mountpoint must be in our namespace. */
> + if (!check_mnt(p))
> + goto out1;
> + /* The thing moved should be either ours or completely unattached. */
> + if (old->mnt_ns && !check_mnt(old))
> goto out1;
>
> - if (!mnt_has_parent(old))
> + attached = mnt_has_parent(old);
> + /*
> + * We need to allow open_tree(OPEN_TREE_CLONE) followed by
> + * move_mount(), but mustn't allow "/" to be moved.
> + */
> + if (old->mnt_ns && !attached)
> goto out1;
>
> if (old->mnt.mnt_flags & MNT_LOCKED)
> @@ -2421,7 +2433,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
> /*
> * Don't move a mount residing in a shared parent.
> */
> - if (IS_MNT_SHARED(old->mnt_parent))
> + if (attached && IS_MNT_SHARED(old->mnt_parent))
> goto out1;
> /*
> * Don't move a mount tree containing unbindable mounts to a destination
> @@ -2435,7 +2447,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
> goto out1;
>
> err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp,
> - &parent_path);
> + attached ? &parent_path : NULL);
> if (err)
> goto out1;
>

I think there's another small hole. It is possible to move a sub-mount
from a detached tree (instead of moving the root of the tree). Then
do_move_mount() calls attach_recursive_mnt() with a non-NULL parent_path.

This causes it to skip count_mounts(). So, it doesn't count the number
of attached mounts, and it allows you to exceed sysctl_mount_max.

Regards
Alan

(I've tested to confirm the code lets you move a sub-mount. I didn't
test whether it allows exceeding sysctl_mount_max.

# unshare -m --propagation private
# mkdir -p /tmp/mnt
# mount --bind /tmp/mnt /tmp/mnt
# open_tree_clone 3</tmp 3 sh
# cd /proc/self/fd/3
# mount --move mnt /mnt
# exit
exit
# exit
logout
#

2018-10-23 16:23:10

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

On Tue, Oct 23, 2018 at 12:19:35PM +0100, Alan Jenkins wrote:

> I think there's another small hole. It is possible to move a sub-mount from
> a detached tree (instead of moving the root of the tree). Then
> do_move_mount() calls attach_recursive_mnt() with a non-NULL parent_path.
>
> This causes it to skip count_mounts(). So, it doesn't count the number of
> attached mounts, and it allows you to exceed sysctl_mount_max.

That's trivial to repair, fortunately - we just need to check source_mnt->mnt_ns
instead of parent_path.

2018-11-19 04:25:32

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 21/34] kernfs, sysfs, cgroup, intel_rdt: Support fs_context [ver #12]

On Fri, Sep 21, 2018 at 05:33:01PM +0100, David Howells wrote:
> Make kernfs support superblock creation/mount/remount with fs_context.
>
> This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
> be made to support fs_context also.
>
> Notes:
>
> (1) A kernfs_fs_context struct is created to wrap fs_context and the
> kernfs mount parameters are moved in here (or are in fs_context).
>
> (2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
> namespace tag parameter is passed in the context if desired
>
> (3) kernfs_free_fs_context() is provided as a destructor for the
> kernfs_fs_context struct, but for the moment it does nothing except
> get called in the right places.
>
> (4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
> pass, but possibly this should be done anyway in case someone wants to
> add a parameter in future.
>
> (5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
> the cgroup v1 and v2 mount parameters are all moved there.
>
> (6) cgroup1 parameter parsing error messages are now handled by invalf(),
> which allows userspace to collect them directly.
>
> (7) cgroup1 parameter cleanup is now done in the context destructor rather
> than in the mount/get_tree and remount functions.
>
> Weirdies:
>
> (*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
> but then uses the resulting pointer after dropping the locks. I'm
> told this is okay and needs commenting.
>
> (*) The cgroup refcount web. This really needs documenting.
>
> (*) cgroup2 only has one root?
>
> Add a suggestion from Thomas Gleixner in which the RDT enablement code is
> placed into its own function.
>
> Signed-off-by: David Howells <[email protected]>
> cc: Greg Kroah-Hartman <[email protected]>
> cc: Tejun Heo <[email protected]>
> cc: Li Zefan <[email protected]>
> cc: Johannes Weiner <[email protected]>
> cc: [email protected]
> cc: [email protected]
> ---
>
> arch/x86/kernel/cpu/intel_rdt.h | 15 +
> arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 183 ++++++++++------
> fs/kernfs/mount.c | 88 ++++----
> fs/sysfs/mount.c | 67 ++++--
> include/linux/cgroup.h | 3
> include/linux/kernfs.h | 39 ++-
> kernel/cgroup/cgroup-internal.h | 50 +++-
> kernel/cgroup/cgroup-v1.c | 345 ++++++++++++++++--------------
> kernel/cgroup/cgroup.c | 264 +++++++++++++++--------
> kernel/cgroup/cpuset.c | 4
> 10 files changed, 640 insertions(+), 418 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
> index 4e588f36228f..1461adc2c5e8 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.h
> +++ b/arch/x86/kernel/cpu/intel_rdt.h
> @@ -33,6 +33,21 @@
> #define RMID_VAL_ERROR BIT_ULL(63)
> #define RMID_VAL_UNAVAIL BIT_ULL(62)
>
> +
> +struct rdt_fs_context {
> + struct kernfs_fs_context kfc;
> + bool enable_cdpl2;
> + bool enable_cdpl3;
> + bool enable_mba_mbps;
> +};
> +
> +static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
> +{
> + struct kernfs_fs_context *kfc = fc->fs_private;
> +
> + return container_of(kfc, struct rdt_fs_context, kfc);
> +}
> +
> DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
>
> /**
> diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> index d6cb04c3a28b..34733a221669 100644
> --- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> +++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> @@ -24,6 +24,7 @@
> #include <linux/cpu.h>
> #include <linux/debugfs.h>
> #include <linux/fs.h>
> +#include <linux/fs_parser.h>
> #include <linux/sysfs.h>
> #include <linux/kernfs.h>
> #include <linux/seq_buf.h>
> @@ -1707,43 +1708,6 @@ static void cdp_disable_all(void)
> cdpl2_disable();
> }
>
> -static int parse_rdtgroupfs_options(char *data)
> -{
> - char *token, *o = data;
> - int ret = 0;
> -
> - while ((token = strsep(&o, ",")) != NULL) {
> - if (!*token) {
> - ret = -EINVAL;
> - goto out;
> - }
> -
> - if (!strcmp(token, "cdp")) {
> - ret = cdpl3_enable();
> - if (ret)
> - goto out;
> - } else if (!strcmp(token, "cdpl2")) {
> - ret = cdpl2_enable();
> - if (ret)
> - goto out;
> - } else if (!strcmp(token, "mba_MBps")) {
> - ret = set_mba_sc(true);
> - if (ret)
> - goto out;
> - } else {
> - ret = -EINVAL;
> - goto out;
> - }
> - }
> -
> - return 0;
> -
> -out:
> - pr_err("Invalid mount option \"%s\"\n", token);
> -
> - return ret;
> -}
> -
> /*
> * We don't allow rdtgroup directories to be created anywhere
> * except the root directory. Thus when looking for the rdtgroup
> @@ -1815,13 +1779,27 @@ static int mkdir_mondata_all(struct kernfs_node *parent_kn,
> struct rdtgroup *prgrp,
> struct kernfs_node **mon_data_kn);
>
> -static struct dentry *rdt_mount(struct file_system_type *fs_type,
> - int flags, const char *unused_dev_name,
> - void *data, size_t data_size)
> +static int rdt_enable_ctx(struct rdt_fs_context *ctx)
> +{
> + int ret = 0;
> +
> + if (ctx->enable_cdpl2)
> + ret = cdpl2_enable();
> +
> + if (!ret && ctx->enable_cdpl3)
> + ret = cdpl3_enable();
> +
> + if (!ret && ctx->enable_mba_mbps)
> + ret = set_mba_sc(true);
> +
> + return ret;
> +}
> +
> +static int rdt_get_tree(struct fs_context *fc)
> {
> + struct rdt_fs_context *ctx = rdt_fc2context(fc);
> struct rdt_domain *dom;
> struct rdt_resource *r;
> - struct dentry *dentry;
> int ret;
>
> cpus_read_lock();
> @@ -1830,53 +1808,42 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
> * resctrl file system can only be mounted once.
> */
> if (static_branch_unlikely(&rdt_enable_key)) {
> - dentry = ERR_PTR(-EBUSY);
> + ret = -EBUSY;
> goto out;
> }
>
> - ret = parse_rdtgroupfs_options(data);
> - if (ret) {
> - dentry = ERR_PTR(ret);
> + ret = rdt_enable_ctx(ctx);
> + if (ret < 0)
> goto out_cdp;
> - }
>
> closid_init();
>
> ret = rdtgroup_create_info_dir(rdtgroup_default.kn);
> - if (ret) {
> - dentry = ERR_PTR(ret);
> - goto out_cdp;
> - }
> + if (ret < 0)
> + goto out_mba;
>
> if (rdt_mon_capable) {
> ret = mongroup_create_dir(rdtgroup_default.kn,
> NULL, "mon_groups",
> &kn_mongrp);
> - if (ret) {
> - dentry = ERR_PTR(ret);
> + if (ret < 0)
> goto out_info;
> - }
> kernfs_get(kn_mongrp);
>
> ret = mkdir_mondata_all(rdtgroup_default.kn,
> &rdtgroup_default, &kn_mondata);
> - if (ret) {
> - dentry = ERR_PTR(ret);
> + if (ret < 0)
> goto out_mongrp;
> - }
> kernfs_get(kn_mondata);
> rdtgroup_default.mon.mon_data_kn = kn_mondata;
> }
>
> ret = rdt_pseudo_lock_init();
> - if (ret) {
> - dentry = ERR_PTR(ret);
> + if (ret)
> goto out_mondata;
> - }
>
> - dentry = kernfs_mount(fs_type, flags, rdt_root,
> - RDTGROUP_SUPER_MAGIC, NULL);
> - if (IS_ERR(dentry))
> + ret = kernfs_get_tree(fc);
> + if (ret < 0)
> goto out_psl;
>
> if (rdt_alloc_capable)
> @@ -1905,14 +1872,97 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
> kernfs_remove(kn_mongrp);
> out_info:
> kernfs_remove(kn_info);
> +out_mba:
> + if (ctx->enable_mba_mbps)
> + set_mba_sc(false);
> out_cdp:
> cdp_disable_all();
> out:
> rdt_last_cmd_clear();
> mutex_unlock(&rdtgroup_mutex);
> cpus_read_unlock();
> + return ret;
> +}
> +
> +enum rdt_param {
> + Opt_cdp,
> + Opt_cdpl2,
> + Opt_mba_mpbs,
> + nr__rdt_params
> +};
> +
> +static const struct fs_parameter_spec rdt_param_specs[nr__rdt_params] = {
> + [Opt_cdp] = { fs_param_is_flag },
> + [Opt_cdpl2] = { fs_param_is_flag },
> + [Opt_mba_mpbs] = { fs_param_is_flag },
> +};
> +
> +static const char *const rdt_param_keys[nr__rdt_params] = {
> + [Opt_cdp] = "cdp",
> + [Opt_cdpl2] = "cdpl2",
> + [Opt_mba_mpbs] = "mba_mbps",
> +};
> +
> +static const struct fs_parameter_description rdt_fs_parameters = {
> + .name = "rdt",
> + .nr_params = nr__rdt_params,
> + .keys = rdt_param_keys,
> + .specs = rdt_param_specs,
> + .no_source = true,
> +};
> +
> +static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
> +{
> + struct rdt_fs_context *ctx = rdt_fc2context(fc);
> + struct fs_parse_result result;
> + int opt;
>
> - return dentry;
> + opt = fs_parse(fc, &rdt_fs_parameters, param, &result);
> + if (opt < 0)
> + return opt;
> +
> + switch (opt) {
> + case Opt_cdp:
> + ctx->enable_cdpl3 = true;
> + return 0;
> + case Opt_cdpl2:
> + ctx->enable_cdpl2 = true;
> + return 0;
> + case Opt_mba_mpbs:
> + ctx->enable_mba_mbps = true;
> + return 0;
> + }
> +
> + return -EINVAL;
> +}
> +
> +static void rdt_fs_context_free(struct fs_context *fc)
> +{
> + struct rdt_fs_context *ctx = rdt_fc2context(fc);
> +
> + kernfs_free_fs_context(fc);
> + kfree(ctx);
> +}
> +
> +static const struct fs_context_operations rdt_fs_context_ops = {
> + .free = rdt_fs_context_free,
> + .parse_param = rdt_parse_param,
> + .get_tree = rdt_get_tree,
> +};
> +
> +static int rdt_init_fs_context(struct fs_context *fc, struct dentry *reference)
> +{
> + struct rdt_fs_context *ctx;
> +
> + ctx = kzalloc(sizeof(struct rdt_fs_context), GFP_KERNEL);
> + if (!ctx)
> + return -ENOMEM;
> +
> + ctx->kfc.root = rdt_root;
> + ctx->kfc.magic = RDTGROUP_SUPER_MAGIC;
> + fc->fs_private = &ctx->kfc;
> + fc->ops = &rdt_fs_context_ops;
> + return 0;
> }
>
> static int reset_all_ctrls(struct rdt_resource *r)
> @@ -2085,9 +2135,10 @@ static void rdt_kill_sb(struct super_block *sb)
> }
>
> static struct file_system_type rdt_fs_type = {
> - .name = "resctrl",
> - .mount = rdt_mount,
> - .kill_sb = rdt_kill_sb,
> + .name = "resctrl",
> + .init_fs_context = rdt_init_fs_context,
> + .parameters = &rdt_fs_parameters,
> + .kill_sb = rdt_kill_sb,
> };
>
> static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index f70e0b69e714..56742632956c 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -22,14 +22,13 @@
>
> struct kmem_cache *kernfs_node_cache;
>
> -static int kernfs_sop_remount_fs(struct super_block *sb, int *flags,
> - char *data, size_t data_size)
> +int kernfs_reconfigure(struct fs_context *fc)
> {
> - struct kernfs_root *root = kernfs_info(sb)->root;
> + struct kernfs_root *root = kernfs_info(fc->root->d_sb)->root;
> struct kernfs_syscall_ops *scops = root->syscall_ops;
>
> - if (scops && scops->remount_fs)
> - return scops->remount_fs(root, flags, data);
> + if (scops && scops->reconfigure)
> + return scops->reconfigure(root, fc);
> return 0;
> }
>
> @@ -61,7 +60,6 @@ const struct super_operations kernfs_sops = {
> .drop_inode = generic_delete_inode,
> .evict_inode = kernfs_evict_inode,
>
> - .remount_fs = kernfs_sop_remount_fs,
> .show_options = kernfs_sop_show_options,
> .show_path = kernfs_sop_show_path,
> };
> @@ -219,7 +217,7 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
> } while (true);
> }
>
> -static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
> +static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *kfc)
> {
> struct kernfs_super_info *info = kernfs_info(sb);
> struct inode *inode;
> @@ -230,7 +228,7 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
> sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV;
> sb->s_blocksize = PAGE_SIZE;
> sb->s_blocksize_bits = PAGE_SHIFT;
> - sb->s_magic = magic;
> + sb->s_magic = kfc->magic;
> sb->s_op = &kernfs_sops;
> sb->s_xattr = kernfs_xattr_handlers;
> if (info->root->flags & KERNFS_ROOT_SUPPORT_EXPORTOP)
> @@ -257,21 +255,20 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
> return 0;
> }
>
> -static int kernfs_test_super(struct super_block *sb, void *data)
> +static int kernfs_test_super(struct super_block *sb, struct fs_context *fc)
> {
> struct kernfs_super_info *sb_info = kernfs_info(sb);
> - struct kernfs_super_info *info = data;
> + struct kernfs_super_info *info = fc->s_fs_info;
>
> return sb_info->root == info->root && sb_info->ns == info->ns;
> }
>
> -static int kernfs_set_super(struct super_block *sb, void *data)
> +static int kernfs_set_super(struct super_block *sb, struct fs_context *fc)
> {
> - int error;
> - error = set_anon_super(sb, data);
> - if (!error)
> - sb->s_fs_info = data;
> - return error;
> + struct kernfs_fs_context *kfc = fc->fs_private;
> +
> + kfc->ns_tag = NULL;
> + return set_anon_super_fc(sb, fc);
> }
>
> /**
> @@ -288,63 +285,60 @@ const void *kernfs_super_ns(struct super_block *sb)
> }
>
> /**
> - * kernfs_mount_ns - kernfs mount helper
> - * @fs_type: file_system_type of the fs being mounted
> - * @flags: mount flags specified for the mount
> - * @root: kernfs_root of the hierarchy being mounted
> - * @magic: file system specific magic number
> - * @new_sb_created: tell the caller if we allocated a new superblock
> - * @ns: optional namespace tag of the mount
> - *
> - * This is to be called from each kernfs user's file_system_type->mount()
> - * implementation, which should pass through the specified @fs_type and
> - * @flags, and specify the hierarchy and namespace tag to mount via @root
> - * and @ns, respectively.
> + * kernfs_get_tree - kernfs filesystem access/retrieval helper
> + * @fc: The filesystem context.
> *
> - * The return value can be passed to the vfs layer verbatim.
> + * This is to be called from each kernfs user's fs_context->ops->get_tree()
> + * implementation, which should set the specified ->@fs_type and ->@flags, and
> + * specify the hierarchy and namespace tag to mount via ->@root and ->@ns,
> + * respectively.
> */
> -struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
> - struct kernfs_root *root, unsigned long magic,
> - bool *new_sb_created, const void *ns)
> +int kernfs_get_tree(struct fs_context *fc)
> {
> + struct kernfs_fs_context *kfc = fc->fs_private;
> struct super_block *sb;
> struct kernfs_super_info *info;
> int error;
>
> info = kzalloc(sizeof(*info), GFP_KERNEL);
> if (!info)
> - return ERR_PTR(-ENOMEM);
> + return -ENOMEM;
>
> - info->root = root;
> - info->ns = ns;
> + info->root = kfc->root;
> + info->ns = kfc->ns_tag;
> INIT_LIST_HEAD(&info->node);
>
> - sb = sget_userns(fs_type, kernfs_test_super, kernfs_set_super, flags,
> - &init_user_ns, info);
> - if (IS_ERR(sb) || sb->s_fs_info != info)
> - kfree(info);
> + fc->s_fs_info = info;
> + sb = sget_fc(fc, kernfs_test_super, kernfs_set_super);
> if (IS_ERR(sb))
> - return ERR_CAST(sb);
> -
> - if (new_sb_created)
> - *new_sb_created = !sb->s_root;
> + return PTR_ERR(sb);
>
> if (!sb->s_root) {
> struct kernfs_super_info *info = kernfs_info(sb);
>
> - error = kernfs_fill_super(sb, magic);
> + kfc->new_sb_created = true;
> +
> + error = kernfs_fill_super(sb, kfc);
> if (error) {
> deactivate_locked_super(sb);
> - return ERR_PTR(error);
> + return error;
> }
> sb->s_flags |= SB_ACTIVE;
>
> mutex_lock(&kernfs_mutex);
> - list_add(&info->node, &root->supers);
> + list_add(&info->node, &info->root->supers);
> mutex_unlock(&kernfs_mutex);
> }
>
> - return dget(sb->s_root);
> + fc->root = dget(sb->s_root);
> + return 0;
> +}
> +
> +void kernfs_free_fs_context(struct fs_context *fc)
> +{
> + /* Note that we don't deal with kfc->ns_tag here. */
> + kfree(fc->s_fs_info);
> + fc->s_fs_info = NULL;
> }
>
> /**
> diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
> index 77302c35b0ff..1e1c0ccc6a36 100644
> --- a/fs/sysfs/mount.c
> +++ b/fs/sysfs/mount.c
> @@ -13,6 +13,7 @@
> #include <linux/magic.h>
> #include <linux/mount.h>
> #include <linux/init.h>
> +#include <linux/slab.h>
> #include <linux/user_namespace.h>
>
> #include "sysfs.h"
> @@ -20,27 +21,55 @@
> static struct kernfs_root *sysfs_root;
> struct kernfs_node *sysfs_root_kn;
>
> -static struct dentry *sysfs_mount(struct file_system_type *fs_type,
> - int flags, const char *dev_name, void *data, size_t data_size)
> +static int sysfs_get_tree(struct fs_context *fc)
> {
> - struct dentry *root;
> - void *ns;
> - bool new_sb = false;
> + struct kernfs_fs_context *kfc = fc->fs_private;
> + int ret;
>
> - if (!(flags & SB_KERNMOUNT)) {
> + ret = kernfs_get_tree(fc);
> + if (ret)
> + return ret;
> +
> + if (kfc->new_sb_created)
> + fc->root->d_sb->s_iflags |= SB_I_USERNS_VISIBLE;
> + return 0;
> +}
> +
> +static void sysfs_fs_context_free(struct fs_context *fc)
> +{
> + struct kernfs_fs_context *kfc = fc->fs_private;
> +
> + if (kfc->ns_tag)
> + kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag);
> + kernfs_free_fs_context(fc);
> + kfree(kfc);
> +}
> +
> +static const struct fs_context_operations sysfs_fs_context_ops = {
> + .free = sysfs_fs_context_free,
> + .get_tree = sysfs_get_tree,
> +};
> +
> +static int sysfs_init_fs_context(struct fs_context *fc,
> + struct dentry *reference)
> +{
> + struct kernfs_fs_context *kfc;
> +
> + if (!(fc->sb_flags & SB_KERNMOUNT)) {
> if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET))
> - return ERR_PTR(-EPERM);
> + return -EPERM;
> }
>
> - ns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
> - root = kernfs_mount_ns(fs_type, flags, sysfs_root,
> - SYSFS_MAGIC, &new_sb, ns);
> - if (!new_sb)
> - kobj_ns_drop(KOBJ_NS_TYPE_NET, ns);
> - else if (!IS_ERR(root))
> - root->d_sb->s_iflags |= SB_I_USERNS_VISIBLE;
> + kfc = kzalloc(sizeof(struct kernfs_fs_context), GFP_KERNEL);
> + if (!kfc)
> + return -ENOMEM;
>
> - return root;
> + kfc->ns_tag = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
> + kfc->root = sysfs_root;
> + kfc->magic = SYSFS_MAGIC;
> + fc->fs_private = kfc;
> + fc->ops = &sysfs_fs_context_ops;
> + return 0;
> }
>
> static void sysfs_kill_sb(struct super_block *sb)
> @@ -52,10 +81,10 @@ static void sysfs_kill_sb(struct super_block *sb)
> }
>
> static struct file_system_type sysfs_fs_type = {
> - .name = "sysfs",
> - .mount = sysfs_mount,
> - .kill_sb = sysfs_kill_sb,
> - .fs_flags = FS_USERNS_MOUNT,
> + .name = "sysfs",
> + .init_fs_context = sysfs_init_fs_context,
> + .kill_sb = sysfs_kill_sb,
> + .fs_flags = FS_USERNS_MOUNT,
> };
>
> int __init sysfs_init(void)
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 32c553556bbd..13b6379648ec 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -859,10 +859,11 @@ copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,
>
> #endif /* !CONFIG_CGROUPS */
>
> -static inline void get_cgroup_ns(struct cgroup_namespace *ns)
> +static inline struct cgroup_namespace *get_cgroup_ns(struct cgroup_namespace *ns)
> {
> if (ns)
> refcount_inc(&ns->count);
> + return ns;
> }
>
> static inline void put_cgroup_ns(struct cgroup_namespace *ns)
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 0f6bb8e1bc83..051709212f55 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -17,6 +17,7 @@
> #include <linux/atomic.h>
> #include <linux/uidgid.h>
> #include <linux/wait.h>
> +#include <linux/fs_context.h>
>
> struct file;
> struct dentry;
> @@ -27,6 +28,7 @@ struct super_block;
> struct file_system_type;
> struct fs_context;
>
> +struct kernfs_fs_context;
> struct kernfs_open_node;
> struct kernfs_iattrs;
>
> @@ -168,7 +170,7 @@ struct kernfs_node {
> * kernfs_node parameter.
> */
> struct kernfs_syscall_ops {
> - int (*remount_fs)(struct kernfs_root *root, int *flags, char *data);
> + int (*reconfigure)(struct kernfs_root *root, struct fs_context *fc);
> int (*show_options)(struct seq_file *sf, struct kernfs_root *root);
>
> int (*mkdir)(struct kernfs_node *parent, const char *name,
> @@ -269,6 +271,18 @@ struct kernfs_ops {
> #endif
> };
>
> +/*
> + * The kernfs superblock creation/mount parameter context.
> + */
> +struct kernfs_fs_context {
> + struct kernfs_root *root; /* Root of the hierarchy being mounted */
> + void *ns_tag; /* Namespace tag of the mount (or NULL) */
> + unsigned long magic; /* File system specific magic number */
> +
> + /* The following are set/used by kernfs_mount() */
> + bool new_sb_created; /* Set to T if we allocated a new sb */
> +};
> +
> #ifdef CONFIG_KERNFS
>
> static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
> @@ -354,9 +368,8 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr);
> void kernfs_notify(struct kernfs_node *kn);
>
> const void *kernfs_super_ns(struct super_block *sb);
> -struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
> - struct kernfs_root *root, unsigned long magic,
> - bool *new_sb_created, const void *ns);
> +int kernfs_get_tree(struct fs_context *fc);
> +void kernfs_free_fs_context(struct fs_context *fc);
> void kernfs_kill_sb(struct super_block *sb);
> struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns);
> int kernfs_reconfigure(struct fs_context *fc);
> @@ -461,11 +474,10 @@ static inline void kernfs_notify(struct kernfs_node *kn) { }
> static inline const void *kernfs_super_ns(struct super_block *sb)
> { return NULL; }
>
> -static inline struct dentry *
> -kernfs_mount_ns(struct file_system_type *fs_type, int flags,
> - struct kernfs_root *root, unsigned long magic,
> - bool *new_sb_created, const void *ns)
> -{ return ERR_PTR(-ENOSYS); }
> +static inline int kernfs_get_tree(struct fs_context *fc)
> +{ return -ENOSYS; }
> +
> +static inline void kernfs_free_fs_context(struct fs_context *fc) { }
>
> static inline void kernfs_kill_sb(struct super_block *sb) { }
>
> @@ -547,13 +559,4 @@ static inline int kernfs_rename(struct kernfs_node *kn,
> return kernfs_rename_ns(kn, new_parent, new_name, NULL);
> }
>
> -static inline struct dentry *
> -kernfs_mount(struct file_system_type *fs_type, int flags,
> - struct kernfs_root *root, unsigned long magic,
> - bool *new_sb_created)
> -{
> - return kernfs_mount_ns(fs_type, flags, root,
> - magic, new_sb_created, NULL);
> -}
> -
> #endif /* __LINUX_KERNFS_H */
> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
> index 75568fcf2180..35012d2aca97 100644
> --- a/kernel/cgroup/cgroup-internal.h
> +++ b/kernel/cgroup/cgroup-internal.h
> @@ -34,6 +34,33 @@ extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN];
> } \
> } while (0)
>
> +/*
> + * The cgroup filesystem superblock creation/mount context.
> + */
> +struct cgroup_fs_context {
> + struct kernfs_fs_context kfc;
> + struct cgroup_root *root;
> + struct cgroup_namespace *ns;
> + u8 version; /* cgroups version */
> + unsigned int flags; /* CGRP_ROOT_* flags */
> +
> + /* cgroup1 bits */
> + bool cpuset_clone_children;
> + bool none; /* User explicitly requested empty subsystem */
> + bool all_ss; /* Seen 'all' option */
> + bool one_ss; /* Seen 'none' option */
> + u16 subsys_mask; /* Selected subsystems */
> + char *name; /* Hierarchy name */
> + char *release_agent; /* Path for release notifications */
> +};
> +
> +static inline struct cgroup_fs_context *cgroup_fc2context(struct fs_context *fc)
> +{
> + struct kernfs_fs_context *kfc = fc->fs_private;
> +
> + return container_of(kfc, struct cgroup_fs_context, kfc);
> +}
> +
> /*
> * A cgroup can be associated with multiple css_sets as different tasks may
> * belong to different cgroups on different hierarchies. In the other
> @@ -115,16 +142,6 @@ struct cgroup_mgctx {
> #define DEFINE_CGROUP_MGCTX(name) \
> struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)
>
> -struct cgroup_sb_opts {
> - u16 subsys_mask;
> - unsigned int flags;
> - char *release_agent;
> - bool cpuset_clone_children;
> - char *name;
> - /* User explicitly requested empty subsystem */
> - bool none;
> -};
> -
> extern struct mutex cgroup_mutex;
> extern spinlock_t css_set_lock;
> extern struct cgroup_subsys *cgroup_subsys[];
> @@ -195,12 +212,10 @@ int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
> struct cgroup_namespace *ns);
>
> void cgroup_free_root(struct cgroup_root *root);
> -void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts);
> +void init_cgroup_root(struct cgroup_fs_context *ctx);
> int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags);
> int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask);
> -struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
> - struct cgroup_root *root, unsigned long magic,
> - struct cgroup_namespace *ns);
> +int cgroup_do_get_tree(struct fs_context *fc);
>
> int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp);
> void cgroup_migrate_finish(struct cgroup_mgctx *mgctx);
> @@ -244,14 +259,15 @@ extern const struct proc_ns_operations cgroupns_operations;
> */
> extern struct cftype cgroup1_base_files[];
> extern struct kernfs_syscall_ops cgroup1_kf_syscall_ops;
> +extern const struct fs_parameter_description cgroup1_fs_parameters;
>
> int proc_cgroupstats_show(struct seq_file *m, void *v);
> bool cgroup1_ssid_disabled(int ssid);
> void cgroup1_pidlist_destroy_all(struct cgroup *cgrp);
> void cgroup1_release_agent(struct work_struct *work);
> void cgroup1_check_for_release(struct cgroup *cgrp);
> -struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
> - void *data, unsigned long magic,
> - struct cgroup_namespace *ns);
> +int cgroup1_parse_param(struct fs_context *fc, struct fs_parameter *param);
> +int cgroup1_validate(struct fs_context *fc);
> +int cgroup1_get_tree(struct fs_context *fc);
>
> #endif /* __CGROUP_INTERNAL_H */
> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
> index 51063e7a93c2..d8b325c3c2eb 100644
> --- a/kernel/cgroup/cgroup-v1.c
> +++ b/kernel/cgroup/cgroup-v1.c
> @@ -13,9 +13,12 @@
> #include <linux/delayacct.h>
> #include <linux/pid_namespace.h>
> #include <linux/cgroupstats.h>
> +#include <linux/fs_parser.h>
>
> #include <trace/events/cgroup.h>
>
> +#define cg_invalf(fc, fmt, ...) ({ pr_err(fmt, ## __VA_ARGS__); -EINVAL; })
> +
> /*
> * pidlists linger the following amount before being destroyed. The goal
> * is avoiding frequent destruction in the middle of consecutive read calls
> @@ -903,92 +906,61 @@ static int cgroup1_show_options(struct seq_file *seq, struct kernfs_root *kf_roo
> return 0;
> }
>
> -static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
> -{
> - char *token, *o = data;
> - bool all_ss = false, one_ss = false;
> - u16 mask = U16_MAX;
> - struct cgroup_subsys *ss;
> - int nr_opts = 0;
> - int i;
> -
> -#ifdef CONFIG_CPUSETS
> - mask = ~((u16)1 << cpuset_cgrp_id);
> -#endif
> +enum cgroup1_param {
> + Opt_all,
> + Opt_clone_children,
> + Opt_cpuset_v2_mode,
> + Opt_name,
> + Opt_none,
> + Opt_noprefix,
> + Opt_release_agent,
> + Opt_xattr,
> + nr__cgroup1_params
> +};
>
> - memset(opts, 0, sizeof(*opts));
> +static const struct fs_parameter_spec cgroup1_param_specs[nr__cgroup1_params] = {
> + [Opt_all] = { fs_param_is_flag },
> + [Opt_clone_children] = { fs_param_is_flag },
> + [Opt_cpuset_v2_mode] = { fs_param_is_flag },
> + [Opt_name] = { fs_param_is_string },
> + [Opt_none] = { fs_param_is_flag },
> + [Opt_noprefix] = { fs_param_is_flag },
> + [Opt_release_agent] = { fs_param_is_string },
> + [Opt_xattr] = { fs_param_is_flag },
> +};
>
> - while ((token = strsep(&o, ",")) != NULL) {
> - nr_opts++;
> +static const char *const cgroup1_param_keys[nr__cgroup1_params] = {
> + [Opt_all] = "all",
> + [Opt_clone_children] = "clone_children",
> + [Opt_cpuset_v2_mode] = "cpuset_v2_mode",
> + [Opt_name] = "name",
> + [Opt_none] = "none",
> + [Opt_noprefix] = "noprefix",
> + [Opt_release_agent] = "release_agent",
> + [Opt_xattr] = "xattr",
> +};
>
> - if (!*token)
> - return -EINVAL;
> - if (!strcmp(token, "none")) {
> - /* Explicitly have no subsystems */
> - opts->none = true;
> - continue;
> - }
> - if (!strcmp(token, "all")) {
> - /* Mutually exclusive option 'all' + subsystem name */
> - if (one_ss)
> - return -EINVAL;
> - all_ss = true;
> - continue;
> - }
> - if (!strcmp(token, "noprefix")) {
> - opts->flags |= CGRP_ROOT_NOPREFIX;
> - continue;
> - }
> - if (!strcmp(token, "clone_children")) {
> - opts->cpuset_clone_children = true;
> - continue;
> - }
> - if (!strcmp(token, "cpuset_v2_mode")) {
> - opts->flags |= CGRP_ROOT_CPUSET_V2_MODE;
> - continue;
> - }
> - if (!strcmp(token, "xattr")) {
> - opts->flags |= CGRP_ROOT_XATTR;
> - continue;
> - }
> - if (!strncmp(token, "release_agent=", 14)) {
> - /* Specifying two release agents is forbidden */
> - if (opts->release_agent)
> - return -EINVAL;
> - opts->release_agent =
> - kstrndup(token + 14, PATH_MAX - 1, GFP_KERNEL);
> - if (!opts->release_agent)
> - return -ENOMEM;
> - continue;
> - }
> - if (!strncmp(token, "name=", 5)) {
> - const char *name = token + 5;
> - /* Can't specify an empty name */
> - if (!strlen(name))
> - return -EINVAL;
> - /* Must match [\w.-]+ */
> - for (i = 0; i < strlen(name); i++) {
> - char c = name[i];
> - if (isalnum(c))
> - continue;
> - if ((c == '.') || (c == '-') || (c == '_'))
> - continue;
> - return -EINVAL;
> - }
> - /* Specifying two names is forbidden */
> - if (opts->name)
> - return -EINVAL;
> - opts->name = kstrndup(name,
> - MAX_CGROUP_ROOT_NAMELEN - 1,
> - GFP_KERNEL);
> - if (!opts->name)
> - return -ENOMEM;
> +const struct fs_parameter_description cgroup1_fs_parameters = {
> + .name = "cgroup1",
> + .nr_params = nr__cgroup1_params,
> + .keys = cgroup1_param_keys,
> + .specs = cgroup1_param_specs,
> + .no_source = true,
> +};
>
> - continue;
> - }
> +int cgroup1_parse_param(struct fs_context *fc, struct fs_parameter *param)
> +{
> + struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
> + struct cgroup_subsys *ss;
> + struct fs_parse_result result;
> + int opt, i;
>
> + opt = fs_parse(fc, &cgroup1_fs_parameters, param, &result);
> + if (opt == -ENOPARAM) {
> + if (strcmp(param->key, "source") == 0)
> + return 0;
> for_each_subsys(ss, i) {
> - if (strcmp(token, ss->legacy_name))
> + if (strcmp(param->key, ss->legacy_name) != 0)
> continue;
> if (!cgroup_ssid_enabled(i))
> continue;
> @@ -996,75 +968,144 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
> continue;
>
> /* Mutually exclusive option 'all' + subsystem name */
> - if (all_ss)
> - return -EINVAL;
> - opts->subsys_mask |= (1 << i);
> - one_ss = true;
> + if (ctx->all_ss)
> + return cg_invalf(fc, "cgroup1: subsys name conflicts with all");
> + ctx->subsys_mask |= (1 << i);
> + ctx->one_ss = true;
> + return 0;
> + }
>
> - break;
> + return cg_invalf(fc, "cgroup1: Unknown subsys name '%s'", param->key);
> + }
> + if (opt < 0)
> + return opt;
> +
> + switch (opt) {
> + case Opt_none:
> + /* Explicitly have no subsystems */
> + ctx->none = true;
> + return 0;
> + case Opt_all:
> + /* Mutually exclusive option 'all' + subsystem name */
> + if (ctx->one_ss)
> + return cg_invalf(fc, "cgroup1: all conflicts with subsys name");
> + ctx->all_ss = true;
> + return 0;
> + case Opt_noprefix:
> + ctx->flags |= CGRP_ROOT_NOPREFIX;
> + return 0;
> + case Opt_clone_children:
> + ctx->cpuset_clone_children = true;
> + return 0;
> + case Opt_cpuset_v2_mode:
> + ctx->flags |= CGRP_ROOT_CPUSET_V2_MODE;
> + return 0;
> + case Opt_xattr:
> + ctx->flags |= CGRP_ROOT_XATTR;
> + return 0;
> + case Opt_release_agent:
> + /* Specifying two release agents is forbidden */
> + if (ctx->release_agent)
> + return cg_invalf(fc, "cgroup1: release_agent respecified");
> + ctx->release_agent = param->string;
> + param->string = NULL;
> + if (!ctx->release_agent)
> + return -ENOMEM;
> + return 0;
> +
> + case Opt_name:
> + /* Can't specify an empty name */
> + if (!param->size)
> + return cg_invalf(fc, "cgroup1: Empty name");
> + if (param->size > MAX_CGROUP_ROOT_NAMELEN - 1)
> + return cg_invalf(fc, "cgroup1: Name too long");
> + /* Must match [\w.-]+ */
> + for (i = 0; i < param->size; i++) {
> + char c = param->string[i];
> + if (isalnum(c))
> + continue;
> + if ((c == '.') || (c == '-') || (c == '_'))
> + continue;
> + return cg_invalf(fc, "cgroup1: Invalid name");
> }
> - if (i == CGROUP_SUBSYS_COUNT)
> - return -ENOENT;
> + /* Specifying two names is forbidden */
> + if (ctx->name)
> + return cg_invalf(fc, "cgroup1: name respecified");
> + ctx->name = param->string;
> + param->string = NULL;
> + return 0;
> }
>
> + return 0;
> +}
> +
> +/*
> + * Validate the options that have been parsed.
> + */
> +int cgroup1_validate(struct fs_context *fc)
> +{
> + struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
> + struct cgroup_subsys *ss;
> + u16 mask = U16_MAX;
> + int i;
> +
> +#ifdef CONFIG_CPUSETS
> + mask = ~((u16)1 << cpuset_cgrp_id);
> +#endif
> +
> /*
> * If the 'all' option was specified select all the subsystems,
> * otherwise if 'none', 'name=' and a subsystem name options were
> * not specified, let's default to 'all'
> */
> - if (all_ss || (!one_ss && !opts->none && !opts->name))
> + if (ctx->all_ss || (!ctx->one_ss && !ctx->none && !ctx->name))
> for_each_subsys(ss, i)
> if (cgroup_ssid_enabled(i) && !cgroup1_ssid_disabled(i))
> - opts->subsys_mask |= (1 << i);
> + ctx->subsys_mask |= (1 << i);
>
> /*
> * We either have to specify by name or by subsystems. (So all
> * empty hierarchies must have a name).
> */
> - if (!opts->subsys_mask && !opts->name)
> - return -EINVAL;
> + if (!ctx->subsys_mask && !ctx->name)
> + return cg_invalf(fc, "cgroup1: Need name or subsystem set");
>
> /*
> * Option noprefix was introduced just for backward compatibility
> * with the old cpuset, so we allow noprefix only if mounting just
> * the cpuset subsystem.
> */
> - if ((opts->flags & CGRP_ROOT_NOPREFIX) && (opts->subsys_mask & mask))
> - return -EINVAL;
> + if ((ctx->flags & CGRP_ROOT_NOPREFIX) && (ctx->subsys_mask & mask))
> + return cg_invalf(fc, "cgroup1: noprefix used incorrectly");
>
> /* Can't specify "none" and some subsystems */
> - if (opts->subsys_mask && opts->none)
> - return -EINVAL;
> + if (ctx->subsys_mask && ctx->none)
> + return cg_invalf(fc, "cgroup1: none used incorrectly");
>
> return 0;
> }
>
> -static int cgroup1_remount(struct kernfs_root *kf_root, int *flags, char *data)
> +static int cgroup1_reconfigure(struct kernfs_root *kf_root, struct fs_context *fc)
> {
> - int ret = 0;
> + struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
> struct cgroup_root *root = cgroup_root_from_kf(kf_root);
> - struct cgroup_sb_opts opts;
> u16 added_mask, removed_mask;
> + int ret = 0;
>
> cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
>
> - /* See what subsystems are wanted */
> - ret = parse_cgroupfs_options(data, &opts);
> - if (ret)
> - goto out_unlock;
> -
> - if (opts.subsys_mask != root->subsys_mask || opts.release_agent)
> + if (ctx->subsys_mask != root->subsys_mask || ctx->release_agent)
> pr_warn("option changes via remount are deprecated (pid=%d comm=%s)\n",
> task_tgid_nr(current), current->comm);
>
> - added_mask = opts.subsys_mask & ~root->subsys_mask;
> - removed_mask = root->subsys_mask & ~opts.subsys_mask;
> + added_mask = ctx->subsys_mask & ~root->subsys_mask;
> + removed_mask = root->subsys_mask & ~ctx->subsys_mask;
>
> /* Don't allow flags or name to change at remount */
> - if ((opts.flags ^ root->flags) ||
> - (opts.name && strcmp(opts.name, root->name))) {
> - pr_err("option or name mismatch, new: 0x%x \"%s\", old: 0x%x \"%s\"\n",
> - opts.flags, opts.name ?: "", root->flags, root->name);
> + if ((ctx->flags ^ root->flags) ||
> + (ctx->name && strcmp(ctx->name, root->name))) {
> + cg_invalf(fc, "option or name mismatch, new: 0x%x \"%s\", old: 0x%x \"%s\"",
> + ctx->flags, ctx->name ?: "", root->flags, root->name);
> ret = -EINVAL;
> goto out_unlock;
> }
> @@ -1081,17 +1122,15 @@ static int cgroup1_remount(struct kernfs_root *kf_root, int *flags, char *data)
>
> WARN_ON(rebind_subsystems(&cgrp_dfl_root, removed_mask));
>
> - if (opts.release_agent) {
> + if (ctx->release_agent) {
> spin_lock(&release_agent_path_lock);
> - strcpy(root->release_agent_path, opts.release_agent);
> + strcpy(root->release_agent_path, ctx->release_agent);
> spin_unlock(&release_agent_path_lock);
> }
>
> trace_cgroup_remount(root);
>
> out_unlock:
> - kfree(opts.release_agent);
> - kfree(opts.name);
> mutex_unlock(&cgroup_mutex);
> return ret;
> }
> @@ -1099,31 +1138,26 @@ static int cgroup1_remount(struct kernfs_root *kf_root, int *flags, char *data)
> struct kernfs_syscall_ops cgroup1_kf_syscall_ops = {
> .rename = cgroup1_rename,
> .show_options = cgroup1_show_options,
> - .remount_fs = cgroup1_remount,
> + .reconfigure = cgroup1_reconfigure,
> .mkdir = cgroup_mkdir,
> .rmdir = cgroup_rmdir,
> .show_path = cgroup_show_path,
> };
>
> -struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
> - void *data, unsigned long magic,
> - struct cgroup_namespace *ns)
> +/*
> + * Find or create a v1 cgroups superblock.
> + */
> +int cgroup1_get_tree(struct fs_context *fc)
> {
> + struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
> struct super_block *pinned_sb = NULL;
> - struct cgroup_sb_opts opts;
> struct cgroup_root *root;
> struct cgroup_subsys *ss;
> - struct dentry *dentry;
> int i, ret;
> bool new_root = false;
>
> cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
>
> - /* First find the desired set of subsystems */
> - ret = parse_cgroupfs_options(data, &opts);
> - if (ret)
> - goto out_unlock;
> -
> /*
> * Destruction of cgroup root is asynchronous, so subsystems may
> * still be dying after the previous unmount. Let's drain the
> @@ -1132,15 +1166,13 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
> * starting. Testing ref liveliness is good enough.
> */
> for_each_subsys(ss, i) {
> - if (!(opts.subsys_mask & (1 << i)) ||
> + if (!(ctx->subsys_mask & (1 << i)) ||
> ss->root == &cgrp_dfl_root)
> continue;
>
> if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
> mutex_unlock(&cgroup_mutex);
> - msleep(10);
> - ret = restart_syscall();
> - goto out_free;
> + goto err_restart;
> }
> cgroup_put(&ss->root->cgrp);
> }
> @@ -1156,8 +1188,8 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
> * name matches but sybsys_mask doesn't, we should fail.
> * Remember whether name matched.
> */
> - if (opts.name) {
> - if (strcmp(opts.name, root->name))
> + if (ctx->name) {
> + if (strcmp(ctx->name, root->name))
> continue;
> name_match = true;
> }
> @@ -1166,15 +1198,15 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
> * If we asked for subsystems (or explicitly for no
> * subsystems) then they must match.
> */
> - if ((opts.subsys_mask || opts.none) &&
> - (opts.subsys_mask != root->subsys_mask)) {
> + if ((ctx->subsys_mask || ctx->none) &&
> + (ctx->subsys_mask != root->subsys_mask)) {
> if (!name_match)
> continue;
> ret = -EBUSY;
> - goto out_unlock;
> + goto err_unlock;
> }
>
> - if (root->flags ^ opts.flags)
> + if (root->flags ^ ctx->flags)
> pr_warn("new mount options do not match the existing superblock, will be ignored\n");
>
> /*
> @@ -1195,11 +1227,10 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
> mutex_unlock(&cgroup_mutex);
> if (!IS_ERR_OR_NULL(pinned_sb))
> deactivate_super(pinned_sb);
> - msleep(10);
> - ret = restart_syscall();
> - goto out_free;
> + goto err_restart;
> }
>
> + ctx->root = root;
> ret = 0;
> goto out_unlock;
> }
> @@ -1209,41 +1240,35 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
> * specification is allowed for already existing hierarchies but we
> * can't create new one without subsys specification.
> */
> - if (!opts.subsys_mask && !opts.none) {
> - ret = -EINVAL;
> - goto out_unlock;
> + if (!ctx->subsys_mask && !ctx->none) {
> + ret = cg_invalf(fc, "cgroup1: No subsys list or none specified");
> + goto err_unlock;
> }
>
> /* Hierarchies may only be created in the initial cgroup namespace. */
> - if (ns != &init_cgroup_ns) {
> + if (ctx->ns != &init_cgroup_ns) {
> ret = -EPERM;
> - goto out_unlock;
> + goto err_unlock;
> }
>
> root = kzalloc(sizeof(*root), GFP_KERNEL);
> if (!root) {
> ret = -ENOMEM;
> - goto out_unlock;
> + goto err_unlock;
> }
> new_root = true;
> + ctx->root = root;
>
> - init_cgroup_root(root, &opts);
> + init_cgroup_root(ctx);
>
> - ret = cgroup_setup_root(root, opts.subsys_mask, PERCPU_REF_INIT_DEAD);
> + ret = cgroup_setup_root(root, ctx->subsys_mask, PERCPU_REF_INIT_DEAD);
> if (ret)
> - cgroup_free_root(root);
> + goto err_unlock;
>
> out_unlock:
> mutex_unlock(&cgroup_mutex);
> -out_free:
> - kfree(opts.release_agent);
> - kfree(opts.name);
> -
> - if (ret)
> - return ERR_PTR(ret);
>
> - dentry = cgroup_do_mount(&cgroup_fs_type, flags, root,
> - CGROUP_SUPER_MAGIC, ns);
> + ret = cgroup_do_get_tree(fc);
>
> /*
> * There's a race window after we release cgroup_mutex and before
> @@ -1256,6 +1281,7 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
> percpu_ref_reinit(&root->cgrp.self.refcnt);
> mutex_unlock(&cgroup_mutex);
> }
> + cgroup_get(&root->cgrp);
>
> /*
> * If @pinned_sb, we're reusing an existing root and holding an
> @@ -1264,7 +1290,14 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
> if (pinned_sb)
> deactivate_super(pinned_sb);
>
> - return dentry;
> + return ret;
> +
> +err_restart:
> + msleep(10);
> + return restart_syscall();
> +err_unlock:
> + mutex_unlock(&cgroup_mutex);
> + return ret;
> }
>
> static int __init cgroup1_wq_init(void)
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 48dbf249bec5..3c3c40cad257 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -54,6 +54,7 @@
> #include <linux/proc_ns.h>
> #include <linux/nsproxy.h>
> #include <linux/file.h>
> +#include <linux/fs_parser.h>
> #include <linux/sched/cputime.h>
> #include <net/sock.h>
>
> @@ -1737,25 +1738,51 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
> return len;
> }
>
> -static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
> -{
> - char *token;
> +enum cgroup2_param {
> + Opt_nsdelegate,
> + nr__cgroup2_params
> +};
>
> - *root_flags = 0;
> +static const struct fs_parameter_spec cgroup2_param_specs[nr__cgroup2_params] = {
> + [Opt_nsdelegate] = { fs_param_is_flag },
> +};
>
> - if (!data)
> - return 0;
> +static const char *const cgroup2_param_keys[nr__cgroup2_params] = {
> + [Opt_nsdelegate] = "nsdelegate",
> +};
>
> - while ((token = strsep(&data, ",")) != NULL) {
> - if (!strcmp(token, "nsdelegate")) {
> - *root_flags |= CGRP_ROOT_NS_DELEGATE;
> - continue;
> - }
> +static const struct fs_parameter_description cgroup2_fs_parameters = {
> + .name = "cgroup2",
> + .nr_params = nr__cgroup2_params,
> + .keys = cgroup2_param_keys,
> + .specs = cgroup2_param_specs,
> + .no_source = true,
> +};
>
> - pr_err("cgroup2: unknown option \"%s\"\n", token);
> - return -EINVAL;
> +static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param)
> +{
> + struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
> + struct fs_parse_result result;
> + int opt;
> +
> + opt = fs_parse(fc, &cgroup2_fs_parameters, param, &result);
> + if (opt < 0)
> + return opt;
> +
> + switch (opt) {
> + case Opt_nsdelegate:
> + ctx->flags |= CGRP_ROOT_NS_DELEGATE;
> + return 0;
> }
>
> + return -EINVAL;
> +}
> +
> +static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root)
> +{
> + if (current->nsproxy->cgroup_ns == &init_cgroup_ns &&
> + cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
> + seq_puts(seq, ",nsdelegate");
> return 0;
> }
>
> @@ -1769,23 +1796,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
> }
> }
>
> -static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root)
> -{
> - if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
> - seq_puts(seq, ",nsdelegate");
> - return 0;
> -}
> -
> -static int cgroup_remount(struct kernfs_root *kf_root, int *flags, char *data)
> +static int cgroup_reconfigure(struct kernfs_root *kf_root, struct fs_context *fc)
> {
> - unsigned int root_flags;
> - int ret;
> + struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
>
> - ret = parse_cgroup_root_flags(data, &root_flags);
> - if (ret)
> - return ret;
> -
> - apply_cgroup_root_flags(root_flags);
> + apply_cgroup_root_flags(ctx->flags);
> return 0;
> }
>
> @@ -1873,8 +1888,9 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
> INIT_WORK(&cgrp->release_agent_work, cgroup1_release_agent);
> }
>
> -void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts)
> +void init_cgroup_root(struct cgroup_fs_context *ctx)
> {
> + struct cgroup_root *root = ctx->root;
> struct cgroup *cgrp = &root->cgrp;
>
> INIT_LIST_HEAD(&root->root_list);
> @@ -1883,12 +1899,12 @@ void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts)
> init_cgroup_housekeeping(cgrp);
> idr_init(&root->cgroup_idr);
>
> - root->flags = opts->flags;
> - if (opts->release_agent)
> - strscpy(root->release_agent_path, opts->release_agent, PATH_MAX);
> - if (opts->name)
> - strscpy(root->name, opts->name, MAX_CGROUP_ROOT_NAMELEN);
> - if (opts->cpuset_clone_children)
> + root->flags = ctx->flags;
> + if (ctx->release_agent)
> + strscpy(root->release_agent_path, ctx->release_agent, PATH_MAX);
> + if (ctx->name)
> + strscpy(root->name, ctx->name, MAX_CGROUP_ROOT_NAMELEN);
> + if (ctx->cpuset_clone_children)
> set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
> }
>
> @@ -1993,57 +2009,53 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
> return ret;
> }
>
> -struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
> - struct cgroup_root *root, unsigned long magic,
> - struct cgroup_namespace *ns)
> +int cgroup_do_get_tree(struct fs_context *fc)
> {
> - struct dentry *dentry;
> - bool new_sb;
> + struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
> + int ret;
>
> - dentry = kernfs_mount(fs_type, flags, root->kf_root, magic, &new_sb);
> + ctx->kfc.root = ctx->root->kf_root;
> +
> + ret = kernfs_get_tree(fc);
> + if (ret < 0)
> + goto out_cgrp;
>
> /*
> * In non-init cgroup namespace, instead of root cgroup's dentry,
> * we return the dentry corresponding to the cgroupns->root_cgrp.
> */
> - if (!IS_ERR(dentry) && ns != &init_cgroup_ns) {
> + if (ctx->ns != &init_cgroup_ns) {
> struct dentry *nsdentry;
> struct cgroup *cgrp;
>
> mutex_lock(&cgroup_mutex);
> spin_lock_irq(&css_set_lock);
>
> - cgrp = cset_cgroup_from_root(ns->root_cset, root);
> + cgrp = cset_cgroup_from_root(ctx->ns->root_cset, ctx->root);
>
> spin_unlock_irq(&css_set_lock);
> mutex_unlock(&cgroup_mutex);
>
> - nsdentry = kernfs_node_dentry(cgrp->kn, dentry->d_sb);
> - dput(dentry);
> - dentry = nsdentry;
> + nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb);
> + if (IS_ERR(nsdentry))
> + return PTR_ERR(nsdentry);
> + dput(fc->root);
> + fc->root = nsdentry;
> }
>
> - if (IS_ERR(dentry) || !new_sb)
> - cgroup_put(&root->cgrp);

I don't see where this cgroup_put() has been moved.

With this patch, the next script works only once, on the second attempt
it hangs up on mounting a cgroup file system.

This is the only suspicious place in this patch what I have found.

[root@fc24 ~]# cat fs-vs-cg
d=$(mktemp -d /tmp/cg.XXXXXX)
mkdir $d/a
mkdir $d/b
mount -t cgroup -o none,name=xxxx xxx $d/a
mount -t cgroup -o none,name=xxxx xxx $d/b
umount $d/a
umount $d/b

[root@fc24 ~]# unshare -m --propagation private bash -x fs-vs-cg
++ mktemp -d /tmp/cg.XXXXXX
+ d=/tmp/cg.yUfagS
+ mkdir /tmp/cg.yUfagS/a
+ mkdir /tmp/cg.yUfagS/b
+ mount -t cgroup -o none,name=xxxx xxx /tmp/cg.yUfagS/a
+ mount -t cgroup -o none,name=xxxx xxx /tmp/cg.yUfagS/b
+ umount /tmp/cg.yUfagS/a
+ umount /tmp/cg.yUfagS/b
[root@fc24 ~]# unshare -m --propagation private bash -x fs-vs-cg
++ mktemp -d /tmp/cg.XXXXXX
+ d=/tmp/cg.ippWUn
+ mkdir /tmp/cg.ippWUn/a
+ mkdir /tmp/cg.ippWUn/b
+ mount -t cgroup -o none,name=xxxx xxx /tmp/cg.ippWUn/a
^Z
[1]+ Stopped unshare -m --propagation private bash -x fs-vs-cg

[root@fc24 ~]# ps
PID TTY TIME CMD
556 pts/0 00:00:00 bash
591 pts/0 00:00:00 bash
595 pts/0 00:00:00 mount
596 pts/0 00:00:00 ps

[root@fc24 ~]# bg
[1]+ unshare -m --propagation private bash -x fs-vs-cg &

[root@fc24 ~]# cat /proc/595/stack
[<0>] msleep+0x38/0x40
[<0>] cgroup1_get_tree+0x4e1/0x72c
[<0>] vfs_get_tree+0x5e/0x140
[<0>] do_mount+0x326/0xc70
[<0>] ksys_mount+0xba/0xd0
[<0>] __x64_sys_mount+0x21/0x30
[<0>] do_syscall_64+0x60/0x210
[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[<0>] 0xffffffffffffffff

> + ret = 0;
> + if (ctx->kfc.new_sb_created)
> + goto out_cgrp;
> + apply_cgroup_root_flags(ctx->flags);
> + return 0;
>
> - return dentry;
> +out_cgrp:
> + return ret;
> }
>
> -static struct dentry *cgroup_mount(struct file_system_type *fs_type,
> - int flags, const char *unused_dev_name,
> - void *data, size_t data_size)
> +static int cgroup_get_tree(struct fs_context *fc)
> {
> - struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
> - struct dentry *dentry;
> - int ret;
> -
> - get_cgroup_ns(ns);
> -
> - /* Check if the caller has permission to mount. */
> - if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> - put_cgroup_ns(ns);
> - return ERR_PTR(-EPERM);
> - }
> + struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
>
> /*
> * The first time anyone tries to mount a cgroup, enable the list
> @@ -2052,29 +2064,96 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
> if (!use_task_css_set_links)
> cgroup_enable_task_cg_lists();
>
> - if (fs_type == &cgroup2_fs_type) {
> - unsigned int root_flags;
> -
> - ret = parse_cgroup_root_flags(data, &root_flags);
> - if (ret) {
> - put_cgroup_ns(ns);
> - return ERR_PTR(ret);
> - }
> + switch (ctx->version) {
> + case 1:
> + return cgroup1_get_tree(fc);
>
> + case 2:
> cgrp_dfl_visible = true;
> cgroup_get_live(&cgrp_dfl_root.cgrp);
>
> - dentry = cgroup_do_mount(&cgroup2_fs_type, flags, &cgrp_dfl_root,
> - CGROUP2_SUPER_MAGIC, ns);
> - if (!IS_ERR(dentry))
> - apply_cgroup_root_flags(root_flags);
> - } else {
> - dentry = cgroup1_mount(&cgroup_fs_type, flags, data,
> - CGROUP_SUPER_MAGIC, ns);
> + ctx->root = &cgrp_dfl_root;
> + return cgroup_do_get_tree(fc);
> +
> + default:
> + BUG();
> + }
> +}
> +
> +static int cgroup_parse_param(struct fs_context *fc, struct fs_parameter *param)
> +{
> + struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
> +
> + if (ctx->version == 1)
> + return cgroup1_parse_param(fc, param);
> +
> + return cgroup2_parse_param(fc, param);
> +}
> +
> +static int cgroup_validate(struct fs_context *fc)
> +{
> + struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
> +
> + if (ctx->version == 1)
> + return cgroup1_validate(fc);
> + return 0;
> +}
> +
> +/*
> + * Destroy a cgroup filesystem context.
> + */
> +static void cgroup_fs_context_free(struct fs_context *fc)
> +{
> + struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
> +
> + kfree(ctx->name);
> + kfree(ctx->release_agent);
> + if (ctx->root)
> + cgroup_put(&ctx->root->cgrp);
> + put_cgroup_ns(ctx->ns);
> + kernfs_free_fs_context(fc);
> + kfree(ctx);
> +}
> +
> +static const struct fs_context_operations cgroup_fs_context_ops = {
> + .free = cgroup_fs_context_free,
> + .parse_param = cgroup_parse_param,
> + .validate = cgroup_validate,
> + .get_tree = cgroup_get_tree,
> + .reconfigure = kernfs_reconfigure,
> +};
> +
> +/*
> + * Initialise the cgroup filesystem creation/reconfiguration context. Notably,
> + * we select the namespace we're going to use.
> + */
> +static int cgroup_init_fs_context(struct fs_context *fc, struct dentry *reference)
> +{
> + struct cgroup_fs_context *ctx;
> + struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
> +
> + switch (fc->purpose) {
> + case FS_CONTEXT_FOR_UMOUNT:
> + case FS_CONTEXT_FOR_EMERGENCY_RO:
> + return -EOPNOTSUPP;
> + default:
> + break;
> }
>
> - put_cgroup_ns(ns);
> - return dentry;
> + /* Check if the caller has permission to mount. */
> + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + ctx = kzalloc(sizeof(struct cgroup_fs_context), GFP_KERNEL);
> + if (!ctx)
> + return -ENOMEM;
> +
> + ctx->ns = get_cgroup_ns(ns);
> + ctx->version = (fc->fs_type == &cgroup2_fs_type) ? 2 : 1;
> + ctx->kfc.magic = (ctx->version == 2) ? CGROUP2_SUPER_MAGIC : CGROUP_SUPER_MAGIC;
> + fc->fs_private = &ctx->kfc;
> + fc->ops = &cgroup_fs_context_ops;
> + return 0;
> }
>
> static void cgroup_kill_sb(struct super_block *sb)
> @@ -2099,17 +2178,19 @@ static void cgroup_kill_sb(struct super_block *sb)
> }
>
> struct file_system_type cgroup_fs_type = {
> - .name = "cgroup",
> - .mount = cgroup_mount,
> - .kill_sb = cgroup_kill_sb,
> - .fs_flags = FS_USERNS_MOUNT,
> + .name = "cgroup",
> + .init_fs_context = cgroup_init_fs_context,
> + .parameters = &cgroup1_fs_parameters,
> + .kill_sb = cgroup_kill_sb,
> + .fs_flags = FS_USERNS_MOUNT,
> };
>
> static struct file_system_type cgroup2_fs_type = {
> - .name = "cgroup2",
> - .mount = cgroup_mount,
> - .kill_sb = cgroup_kill_sb,
> - .fs_flags = FS_USERNS_MOUNT,
> + .name = "cgroup2",
> + .init_fs_context = cgroup_init_fs_context,
> + .parameters = &cgroup2_fs_parameters,
> + .kill_sb = cgroup_kill_sb,
> + .fs_flags = FS_USERNS_MOUNT,
> };
>
> int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
> @@ -5179,7 +5260,7 @@ int cgroup_rmdir(struct kernfs_node *kn)
>
> static struct kernfs_syscall_ops cgroup_kf_syscall_ops = {
> .show_options = cgroup_show_options,
> - .remount_fs = cgroup_remount,
> + .reconfigure = cgroup_reconfigure,
> .mkdir = cgroup_mkdir,
> .rmdir = cgroup_rmdir,
> .show_path = cgroup_show_path,
> @@ -5246,11 +5327,12 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
> */
> int __init cgroup_init_early(void)
> {
> - static struct cgroup_sb_opts __initdata opts;
> + static struct cgroup_fs_context __initdata ctx;
> struct cgroup_subsys *ss;
> int i;
>
> - init_cgroup_root(&cgrp_dfl_root, &opts);
> + ctx.root = &cgrp_dfl_root;
> + init_cgroup_root(&ctx);
> cgrp_dfl_root.cgrp.self.flags |= CSS_NO_REF;
>
> RCU_INIT_POINTER(init_task.cgroups, &init_css_set);
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index df78e166028c..b4ad1a52f006 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -324,10 +324,8 @@ static int cpuset_get_tree(struct fs_context *fc)
> int ret = -ENODEV;
>
> cgroup_fs = get_fs_type("cgroup");
> - if (cgroup_fs) {
> - ret = PTR_ERR(cgroup_fs);
> + if (!cgroup_fs)
> goto out;
> - }
>
> cg_fc = vfs_new_fs_context(cgroup_fs, NULL, fc->sb_flags, fc->sb_flags,
> fc->purpose);

2018-12-06 17:12:48

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 21/34] kernfs, sysfs, cgroup, intel_rdt: Support fs_context [ver #12]

On Sun, Nov 18, 2018 at 08:23:42PM -0800, Andrei Vagin wrote:
> On Fri, Sep 21, 2018 at 05:33:01PM +0100, David Howells wrote:
> > @@ -1993,57 +2009,53 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
> > return ret;
> > }
> >
> > -struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
> > - struct cgroup_root *root, unsigned long magic,
> > - struct cgroup_namespace *ns)
> > +int cgroup_do_get_tree(struct fs_context *fc)
> > {
> > - struct dentry *dentry;
> > - bool new_sb;
> > + struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
> > + int ret;
> >
> > - dentry = kernfs_mount(fs_type, flags, root->kf_root, magic, &new_sb);
> > + ctx->kfc.root = ctx->root->kf_root;
> > +
> > + ret = kernfs_get_tree(fc);
> > + if (ret < 0)
> > + goto out_cgrp;
> >
> > /*
> > * In non-init cgroup namespace, instead of root cgroup's dentry,
> > * we return the dentry corresponding to the cgroupns->root_cgrp.
> > */
> > - if (!IS_ERR(dentry) && ns != &init_cgroup_ns) {
> > + if (ctx->ns != &init_cgroup_ns) {
> > struct dentry *nsdentry;
> > struct cgroup *cgrp;
> >
> > mutex_lock(&cgroup_mutex);
> > spin_lock_irq(&css_set_lock);
> >
> > - cgrp = cset_cgroup_from_root(ns->root_cset, root);
> > + cgrp = cset_cgroup_from_root(ctx->ns->root_cset, ctx->root);
> >
> > spin_unlock_irq(&css_set_lock);
> > mutex_unlock(&cgroup_mutex);
> >
> > - nsdentry = kernfs_node_dentry(cgrp->kn, dentry->d_sb);
> > - dput(dentry);
> > - dentry = nsdentry;
> > + nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb);
> > + if (IS_ERR(nsdentry))
> > + return PTR_ERR(nsdentry);
> > + dput(fc->root);
> > + fc->root = nsdentry;
> > }
> >
> > - if (IS_ERR(dentry) || !new_sb)
> > - cgroup_put(&root->cgrp);
>
> I don't see where this cgroup_put() has been moved.

David, have you looked at this problem? It isn't fixed in linux-next
yet.

https://travis-ci.org/avagin/linux/jobs/463960763

Thanks,
Andrei

>
> With this patch, the next script works only once, on the second attempt
> it hangs up on mounting a cgroup file system.
>
> This is the only suspicious place in this patch what I have found.
>
> [root@fc24 ~]# cat fs-vs-cg
> d=$(mktemp -d /tmp/cg.XXXXXX)
> mkdir $d/a
> mkdir $d/b
> mount -t cgroup -o none,name=xxxx xxx $d/a
> mount -t cgroup -o none,name=xxxx xxx $d/b
> umount $d/a
> umount $d/b
>
> [root@fc24 ~]# unshare -m --propagation private bash -x fs-vs-cg
> ++ mktemp -d /tmp/cg.XXXXXX
> + d=/tmp/cg.yUfagS
> + mkdir /tmp/cg.yUfagS/a
> + mkdir /tmp/cg.yUfagS/b
> + mount -t cgroup -o none,name=xxxx xxx /tmp/cg.yUfagS/a
> + mount -t cgroup -o none,name=xxxx xxx /tmp/cg.yUfagS/b
> + umount /tmp/cg.yUfagS/a
> + umount /tmp/cg.yUfagS/b
> [root@fc24 ~]# unshare -m --propagation private bash -x fs-vs-cg
> ++ mktemp -d /tmp/cg.XXXXXX
> + d=/tmp/cg.ippWUn
> + mkdir /tmp/cg.ippWUn/a
> + mkdir /tmp/cg.ippWUn/b
> + mount -t cgroup -o none,name=xxxx xxx /tmp/cg.ippWUn/a
> ^Z
> [1]+ Stopped unshare -m --propagation private bash -x fs-vs-cg
>
> [root@fc24 ~]# ps
> PID TTY TIME CMD
> 556 pts/0 00:00:00 bash
> 591 pts/0 00:00:00 bash
> 595 pts/0 00:00:00 mount
> 596 pts/0 00:00:00 ps
>
> [root@fc24 ~]# bg
> [1]+ unshare -m --propagation private bash -x fs-vs-cg &
>
> [root@fc24 ~]# cat /proc/595/stack
> [<0>] msleep+0x38/0x40
> [<0>] cgroup1_get_tree+0x4e1/0x72c
> [<0>] vfs_get_tree+0x5e/0x140
> [<0>] do_mount+0x326/0xc70
> [<0>] ksys_mount+0xba/0xd0
> [<0>] __x64_sys_mount+0x21/0x30
> [<0>] do_syscall_64+0x60/0x210
> [<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [<0>] 0xffffffffffffffff
>

2018-12-17 14:26:28

by Anders Roxell

[permalink] [raw]
Subject: Re: [PATCH 34/34] vfs: Add a sample program for the new mount API [ver #12]

On Fri, 21 Sep 2018 at 18:34, David Howells <[email protected]> wrote:
>
> Add a sample program to demonstrate fsopen/fsmount/move_mount to mount
> something.
>
> Signed-off-by: David Howells <[email protected]>

I was trying to build today linux-next tag next-20181217, I ran into
the build error below
when I was trying to build a allmodconfig kernel. I think this patch
triggers the build error.

In file included from /usr/include/x86_64-linux-gnu/sys/stat.h:446,
from ../samples/vfs/test-statx.c:28:
/usr/include/x86_64-linux-gnu/bits/statx.h:25:8: error: redefinition
of ‘struct statx_timestamp’
struct statx_timestamp
^~~~~~~~~~~~~~~
In file included from ../samples/vfs/test-statx.c:26:
./usr/include/linux/stat.h:56:8: note: originally defined here
struct statx_timestamp {
^~~~~~~~~~~~~~~
In file included from /usr/include/x86_64-linux-gnu/sys/stat.h:446,
from ../samples/vfs/test-statx.c:28:
/usr/include/x86_64-linux-gnu/bits/statx.h:36:8: error: redefinition
of ‘struct statx’
struct statx
^~~~~
In file included from ../samples/vfs/test-statx.c:26:
./usr/include/linux/stat.h:99:8: note: originally defined here
struct statx {
^~~~~
../samples/vfs/test-statx.c:40:9: error: conflicting types for ‘statx’
ssize_t statx(int dfd, const char *filename, unsigned flags,
^~~~~
In file included from /usr/include/x86_64-linux-gnu/sys/stat.h:446,
from ../samples/vfs/test-statx.c:28:
/usr/include/x86_64-linux-gnu/bits/statx.h:87:5: note: previous
declaration of ‘statx’ was here
int statx (int __dirfd, const char *__restrict __path, int __flags,
^~~~~
make[3]: *** [scripts/Makefile.host:90: samples/vfs/test-statx] Error 1
make[3]: Target '__build' not remade because of errors.
make[2]: *** [../scripts/Makefile.build:492: samples/vfs] Error 2
make[2]: Target '__build' not remade because of errors.
make[1]: *** [/srv/src/kernel/next/Makefile:1065: samples] Error 2
make[1]: Target 'bzImage' not remade because of errors.
make: *** [Makefile:152: sub-make] Error 2
make: Target 'bzImage' not remade because of errors.

My libc version:
$ dpkg -l libc6
ii libc6:amd64 2.28-2 amd64 GNU C Library: Shared libraries

Any idea what I do wrong?

Cheers,
Anders

> ---
>
> samples/Kconfig | 10 +-
> samples/Makefile | 2
> samples/statx/Makefile | 7 -
> samples/statx/test-statx.c | 258 --------------------------------------------
> samples/vfs/Makefile | 10 ++
> samples/vfs/test-fsmount.c | 118 ++++++++++++++++++++
> samples/vfs/test-statx.c | 258 ++++++++++++++++++++++++++++++++++++++++++++
> 7 files changed, 393 insertions(+), 270 deletions(-)
> delete mode 100644 samples/statx/Makefile
> delete mode 100644 samples/statx/test-statx.c
> create mode 100644 samples/vfs/Makefile
> create mode 100644 samples/vfs/test-fsmount.c
> create mode 100644 samples/vfs/test-statx.c
>
> diff --git a/samples/Kconfig b/samples/Kconfig
> index bd133efc1a56..8df1c012820f 100644
> --- a/samples/Kconfig
> +++ b/samples/Kconfig
> @@ -146,10 +146,12 @@ config SAMPLE_VFIO_MDEV_MBOCHS
> Specifically it does *not* include any legacy vga stuff.
> Device looks a lot like "qemu -device secondary-vga".
>
> -config SAMPLE_STATX
> - bool "Build example extended-stat using code"
> - depends on BROKEN
> +config SAMPLE_VFS
> + bool "Build example programs that use new VFS system calls"
> + depends on X86
> help
> - Build example userspace program to use the new extended-stat syscall.
> + Build example userspace programs that use new VFS system calls such
> + as mount API and statx(). Note that this is restricted to the x86
> + arch whilst it accesses system calls that aren't yet in all arches.
>
> endif # SAMPLES
> diff --git a/samples/Makefile b/samples/Makefile
> index bd601c038b86..c5a6175c2d3f 100644
> --- a/samples/Makefile
> +++ b/samples/Makefile
> @@ -3,4 +3,4 @@
> obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ trace_events/ livepatch/ \
> hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/ \
> configfs/ connector/ v4l/ trace_printk/ \
> - vfio-mdev/ statx/ qmi/
> + vfio-mdev/ vfs/ qmi/
> diff --git a/samples/statx/Makefile b/samples/statx/Makefile
> deleted file mode 100644
> index 59df7c25a9d1..000000000000
> --- a/samples/statx/Makefile
> +++ /dev/null
> @@ -1,7 +0,0 @@
> -# List of programs to build
> -hostprogs-$(CONFIG_SAMPLE_STATX) := test-statx
> -
> -# Tell kbuild to always build the programs
> -always := $(hostprogs-y)
> -
> -HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
> diff --git a/samples/statx/test-statx.c b/samples/statx/test-statx.c
> deleted file mode 100644
> index d4d77b09412c..000000000000
> --- a/samples/statx/test-statx.c
> +++ /dev/null
> @@ -1,258 +0,0 @@
> -/* Test the statx() system call.
> - *
> - * Note that the output of this program is intended to look like the output of
> - * /bin/stat where possible.
> - *
> - * Copyright (C) 2015 Red Hat, Inc. All Rights Reserved.
> - * Written by David Howells ([email protected])
> - *
> - * This program is free software; you can redistribute it and/or
> - * modify it under the terms of the GNU General Public Licence
> - * as published by the Free Software Foundation; either version
> - * 2 of the Licence, or (at your option) any later version.
> - */
> -
> -#define _GNU_SOURCE
> -#define _ATFILE_SOURCE
> -#include <stdio.h>
> -#include <stdlib.h>
> -#include <string.h>
> -#include <unistd.h>
> -#include <ctype.h>
> -#include <errno.h>
> -#include <time.h>
> -#include <sys/syscall.h>
> -#include <sys/types.h>
> -#include <linux/stat.h>
> -#include <linux/fcntl.h>
> -#include <sys/stat.h>
> -
> -#define AT_STATX_SYNC_TYPE 0x6000
> -#define AT_STATX_SYNC_AS_STAT 0x0000
> -#define AT_STATX_FORCE_SYNC 0x2000
> -#define AT_STATX_DONT_SYNC 0x4000
> -
> -static __attribute__((unused))
> -ssize_t statx(int dfd, const char *filename, unsigned flags,
> - unsigned int mask, struct statx *buffer)
> -{
> - return syscall(__NR_statx, dfd, filename, flags, mask, buffer);
> -}
> -
> -static void print_time(const char *field, struct statx_timestamp *ts)
> -{
> - struct tm tm;
> - time_t tim;
> - char buffer[100];
> - int len;
> -
> - tim = ts->tv_sec;
> - if (!localtime_r(&tim, &tm)) {
> - perror("localtime_r");
> - exit(1);
> - }
> - len = strftime(buffer, 100, "%F %T", &tm);
> - if (len == 0) {
> - perror("strftime");
> - exit(1);
> - }
> - printf("%s", field);
> - fwrite(buffer, 1, len, stdout);
> - printf(".%09u", ts->tv_nsec);
> - len = strftime(buffer, 100, "%z", &tm);
> - if (len == 0) {
> - perror("strftime2");
> - exit(1);
> - }
> - fwrite(buffer, 1, len, stdout);
> - printf("\n");
> -}
> -
> -static void dump_statx(struct statx *stx)
> -{
> - char buffer[256], ft = '?';
> -
> - printf("results=%x\n", stx->stx_mask);
> -
> - printf(" ");
> - if (stx->stx_mask & STATX_SIZE)
> - printf(" Size: %-15llu", (unsigned long long)stx->stx_size);
> - if (stx->stx_mask & STATX_BLOCKS)
> - printf(" Blocks: %-10llu", (unsigned long long)stx->stx_blocks);
> - printf(" IO Block: %-6llu", (unsigned long long)stx->stx_blksize);
> - if (stx->stx_mask & STATX_TYPE) {
> - switch (stx->stx_mode & S_IFMT) {
> - case S_IFIFO: printf(" FIFO\n"); ft = 'p'; break;
> - case S_IFCHR: printf(" character special file\n"); ft = 'c'; break;
> - case S_IFDIR: printf(" directory\n"); ft = 'd'; break;
> - case S_IFBLK: printf(" block special file\n"); ft = 'b'; break;
> - case S_IFREG: printf(" regular file\n"); ft = '-'; break;
> - case S_IFLNK: printf(" symbolic link\n"); ft = 'l'; break;
> - case S_IFSOCK: printf(" socket\n"); ft = 's'; break;
> - default:
> - printf(" unknown type (%o)\n", stx->stx_mode & S_IFMT);
> - break;
> - }
> - } else {
> - printf(" no type\n");
> - }
> -
> - sprintf(buffer, "%02x:%02x", stx->stx_dev_major, stx->stx_dev_minor);
> - printf("Device: %-15s", buffer);
> - if (stx->stx_mask & STATX_INO)
> - printf(" Inode: %-11llu", (unsigned long long) stx->stx_ino);
> - if (stx->stx_mask & STATX_NLINK)
> - printf(" Links: %-5u", stx->stx_nlink);
> - if (stx->stx_mask & STATX_TYPE) {
> - switch (stx->stx_mode & S_IFMT) {
> - case S_IFBLK:
> - case S_IFCHR:
> - printf(" Device type: %u,%u",
> - stx->stx_rdev_major, stx->stx_rdev_minor);
> - break;
> - }
> - }
> - printf("\n");
> -
> - if (stx->stx_mask & STATX_MODE)
> - printf("Access: (%04o/%c%c%c%c%c%c%c%c%c%c) ",
> - stx->stx_mode & 07777,
> - ft,
> - stx->stx_mode & S_IRUSR ? 'r' : '-',
> - stx->stx_mode & S_IWUSR ? 'w' : '-',
> - stx->stx_mode & S_IXUSR ? 'x' : '-',
> - stx->stx_mode & S_IRGRP ? 'r' : '-',
> - stx->stx_mode & S_IWGRP ? 'w' : '-',
> - stx->stx_mode & S_IXGRP ? 'x' : '-',
> - stx->stx_mode & S_IROTH ? 'r' : '-',
> - stx->stx_mode & S_IWOTH ? 'w' : '-',
> - stx->stx_mode & S_IXOTH ? 'x' : '-');
> - if (stx->stx_mask & STATX_UID)
> - printf("Uid: %5d ", stx->stx_uid);
> - if (stx->stx_mask & STATX_GID)
> - printf("Gid: %5d\n", stx->stx_gid);
> -
> - if (stx->stx_mask & STATX_ATIME)
> - print_time("Access: ", &stx->stx_atime);
> - if (stx->stx_mask & STATX_MTIME)
> - print_time("Modify: ", &stx->stx_mtime);
> - if (stx->stx_mask & STATX_CTIME)
> - print_time("Change: ", &stx->stx_ctime);
> - if (stx->stx_mask & STATX_BTIME)
> - print_time(" Birth: ", &stx->stx_btime);
> -
> - if (stx->stx_attributes_mask) {
> - unsigned char bits, mbits;
> - int loop, byte;
> -
> - static char attr_representation[64 + 1] =
> - /* STATX_ATTR_ flags: */
> - "????????" /* 63-56 */
> - "????????" /* 55-48 */
> - "????????" /* 47-40 */
> - "????????" /* 39-32 */
> - "????????" /* 31-24 0x00000000-ff000000 */
> - "????????" /* 23-16 0x00000000-00ff0000 */
> - "???me???" /* 15- 8 0x00000000-0000ff00 */
> - "?dai?c??" /* 7- 0 0x00000000-000000ff */
> - ;
> -
> - printf("Attributes: %016llx (", stx->stx_attributes);
> - for (byte = 64 - 8; byte >= 0; byte -= 8) {
> - bits = stx->stx_attributes >> byte;
> - mbits = stx->stx_attributes_mask >> byte;
> - for (loop = 7; loop >= 0; loop--) {
> - int bit = byte + loop;
> -
> - if (!(mbits & 0x80))
> - putchar('.'); /* Not supported */
> - else if (bits & 0x80)
> - putchar(attr_representation[63 - bit]);
> - else
> - putchar('-'); /* Not set */
> - bits <<= 1;
> - mbits <<= 1;
> - }
> - if (byte)
> - putchar(' ');
> - }
> - printf(")\n");
> - }
> -}
> -
> -static void dump_hex(unsigned long long *data, int from, int to)
> -{
> - unsigned offset, print_offset = 1, col = 0;
> -
> - from /= 8;
> - to = (to + 7) / 8;
> -
> - for (offset = from; offset < to; offset++) {
> - if (print_offset) {
> - printf("%04x: ", offset * 8);
> - print_offset = 0;
> - }
> - printf("%016llx", data[offset]);
> - col++;
> - if ((col & 3) == 0) {
> - printf("\n");
> - print_offset = 1;
> - } else {
> - printf(" ");
> - }
> - }
> -
> - if (!print_offset)
> - printf("\n");
> -}
> -
> -int main(int argc, char **argv)
> -{
> - struct statx stx;
> - int ret, raw = 0, atflag = AT_SYMLINK_NOFOLLOW;
> -
> - unsigned int mask = STATX_ALL;
> -
> - for (argv++; *argv; argv++) {
> - if (strcmp(*argv, "-F") == 0) {
> - atflag &= ~AT_STATX_SYNC_TYPE;
> - atflag |= AT_STATX_FORCE_SYNC;
> - continue;
> - }
> - if (strcmp(*argv, "-D") == 0) {
> - atflag &= ~AT_STATX_SYNC_TYPE;
> - atflag |= AT_STATX_DONT_SYNC;
> - continue;
> - }
> - if (strcmp(*argv, "-L") == 0) {
> - atflag &= ~AT_SYMLINK_NOFOLLOW;
> - continue;
> - }
> - if (strcmp(*argv, "-O") == 0) {
> - mask &= ~STATX_BASIC_STATS;
> - continue;
> - }
> - if (strcmp(*argv, "-A") == 0) {
> - atflag |= AT_NO_AUTOMOUNT;
> - continue;
> - }
> - if (strcmp(*argv, "-R") == 0) {
> - raw = 1;
> - continue;
> - }
> -
> - memset(&stx, 0xbf, sizeof(stx));
> - ret = statx(AT_FDCWD, *argv, atflag, mask, &stx);
> - printf("statx(%s) = %d\n", *argv, ret);
> - if (ret < 0) {
> - perror(*argv);
> - exit(1);
> - }
> -
> - if (raw)
> - dump_hex((unsigned long long *)&stx, 0, sizeof(stx));
> -
> - dump_statx(&stx);
> - }
> - return 0;
> -}
> diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
> new file mode 100644
> index 000000000000..4ac9690fb3c4
> --- /dev/null
> +++ b/samples/vfs/Makefile
> @@ -0,0 +1,10 @@
> +# List of programs to build
> +hostprogs-$(CONFIG_SAMPLE_VFS) := \
> + test-fsmount \
> + test-statx
> +
> +# Tell kbuild to always build the programs
> +always := $(hostprogs-y)
> +
> +HOSTCFLAGS_test-fsmount.o += -I$(objtree)/usr/include
> +HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
> diff --git a/samples/vfs/test-fsmount.c b/samples/vfs/test-fsmount.c
> new file mode 100644
> index 000000000000..74124025ade0
> --- /dev/null
> +++ b/samples/vfs/test-fsmount.c
> @@ -0,0 +1,118 @@
> +/* fd-based mount test.
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells ([email protected])
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <sys/prctl.h>
> +#include <sys/wait.h>
> +#include <linux/fs.h>
> +#include <linux/unistd.h>
> +
> +#define E(x) do { if ((x) == -1) { perror(#x); exit(1); } } while(0)
> +
> +static void check_messages(int fd)
> +{
> + char buf[4096];
> + int err, n;
> +
> + err = errno;
> +
> + for (;;) {
> + n = read(fd, buf, sizeof(buf));
> + if (n < 0)
> + break;
> + n -= 2;
> +
> + switch (buf[0]) {
> + case 'e':
> + fprintf(stderr, "Error: %*.*s\n", n, n, buf + 2);
> + break;
> + case 'w':
> + fprintf(stderr, "Warning: %*.*s\n", n, n, buf + 2);
> + break;
> + case 'i':
> + fprintf(stderr, "Info: %*.*s\n", n, n, buf + 2);
> + break;
> + }
> + }
> +
> + errno = err;
> +}
> +
> +static __attribute__((noreturn))
> +void mount_error(int fd, const char *s)
> +{
> + check_messages(fd);
> + fprintf(stderr, "%s: %m\n", s);
> + exit(1);
> +}
> +
> +static inline int fsopen(const char *fs_name, unsigned int flags)
> +{
> + return syscall(__NR_fsopen, fs_name, flags);
> +}
> +
> +static inline int fsmount(int fsfd, unsigned int flags, unsigned int ms_flags)
> +{
> + return syscall(__NR_fsmount, fsfd, flags, ms_flags);
> +}
> +
> +static inline int fsconfig(int fsfd, unsigned int cmd,
> + const char *key, const void *val, int aux)
> +{
> + return syscall(__NR_fsconfig, fsfd, cmd, key, val, aux);
> +}
> +
> +static inline int move_mount(int from_dfd, const char *from_pathname,
> + int to_dfd, const char *to_pathname,
> + unsigned int flags)
> +{
> + return syscall(__NR_move_mount,
> + from_dfd, from_pathname,
> + to_dfd, to_pathname, flags);
> +}
> +
> +#define E_fsconfig(fd, cmd, key, val, aux) \
> + do { \
> + if (fsconfig(fd, cmd, key, val, aux) == -1) \
> + mount_error(fd, key ?: "create"); \
> + } while (0)
> +
> +int main(int argc, char *argv[])
> +{
> + int fsfd, mfd;
> +
> + /* Mount a publically available AFS filesystem */
> + fsfd = fsopen("afs", 0);
> + if (fsfd == -1) {
> + perror("fsopen");
> + exit(1);
> + }
> +
> + E_fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "#grand.central.org:root.cell.", 0);
> + E_fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> +
> + mfd = fsmount(fsfd, 0, MS_RDONLY);
> + if (mfd < 0)
> + mount_error(fsfd, "fsmount");
> + E(close(fsfd));
> +
> + if (move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH) < 0) {
> + perror("move_mount");
> + exit(1);
> + }
> +
> + E(close(mfd));
> + exit(0);
> +}
> diff --git a/samples/vfs/test-statx.c b/samples/vfs/test-statx.c
> new file mode 100644
> index 000000000000..d4d77b09412c
> --- /dev/null
> +++ b/samples/vfs/test-statx.c
> @@ -0,0 +1,258 @@
> +/* Test the statx() system call.
> + *
> + * Note that the output of this program is intended to look like the output of
> + * /bin/stat where possible.
> + *
> + * Copyright (C) 2015 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells ([email protected])
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#define _GNU_SOURCE
> +#define _ATFILE_SOURCE
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <ctype.h>
> +#include <errno.h>
> +#include <time.h>
> +#include <sys/syscall.h>
> +#include <sys/types.h>
> +#include <linux/stat.h>
> +#include <linux/fcntl.h>
> +#include <sys/stat.h>
> +
> +#define AT_STATX_SYNC_TYPE 0x6000
> +#define AT_STATX_SYNC_AS_STAT 0x0000
> +#define AT_STATX_FORCE_SYNC 0x2000
> +#define AT_STATX_DONT_SYNC 0x4000
> +
> +static __attribute__((unused))
> +ssize_t statx(int dfd, const char *filename, unsigned flags,
> + unsigned int mask, struct statx *buffer)
> +{
> + return syscall(__NR_statx, dfd, filename, flags, mask, buffer);
> +}
> +
> +static void print_time(const char *field, struct statx_timestamp *ts)
> +{
> + struct tm tm;
> + time_t tim;
> + char buffer[100];
> + int len;
> +
> + tim = ts->tv_sec;
> + if (!localtime_r(&tim, &tm)) {
> + perror("localtime_r");
> + exit(1);
> + }
> + len = strftime(buffer, 100, "%F %T", &tm);
> + if (len == 0) {
> + perror("strftime");
> + exit(1);
> + }
> + printf("%s", field);
> + fwrite(buffer, 1, len, stdout);
> + printf(".%09u", ts->tv_nsec);
> + len = strftime(buffer, 100, "%z", &tm);
> + if (len == 0) {
> + perror("strftime2");
> + exit(1);
> + }
> + fwrite(buffer, 1, len, stdout);
> + printf("\n");
> +}
> +
> +static void dump_statx(struct statx *stx)
> +{
> + char buffer[256], ft = '?';
> +
> + printf("results=%x\n", stx->stx_mask);
> +
> + printf(" ");
> + if (stx->stx_mask & STATX_SIZE)
> + printf(" Size: %-15llu", (unsigned long long)stx->stx_size);
> + if (stx->stx_mask & STATX_BLOCKS)
> + printf(" Blocks: %-10llu", (unsigned long long)stx->stx_blocks);
> + printf(" IO Block: %-6llu", (unsigned long long)stx->stx_blksize);
> + if (stx->stx_mask & STATX_TYPE) {
> + switch (stx->stx_mode & S_IFMT) {
> + case S_IFIFO: printf(" FIFO\n"); ft = 'p'; break;
> + case S_IFCHR: printf(" character special file\n"); ft = 'c'; break;
> + case S_IFDIR: printf(" directory\n"); ft = 'd'; break;
> + case S_IFBLK: printf(" block special file\n"); ft = 'b'; break;
> + case S_IFREG: printf(" regular file\n"); ft = '-'; break;
> + case S_IFLNK: printf(" symbolic link\n"); ft = 'l'; break;
> + case S_IFSOCK: printf(" socket\n"); ft = 's'; break;
> + default:
> + printf(" unknown type (%o)\n", stx->stx_mode & S_IFMT);
> + break;
> + }
> + } else {
> + printf(" no type\n");
> + }
> +
> + sprintf(buffer, "%02x:%02x", stx->stx_dev_major, stx->stx_dev_minor);
> + printf("Device: %-15s", buffer);
> + if (stx->stx_mask & STATX_INO)
> + printf(" Inode: %-11llu", (unsigned long long) stx->stx_ino);
> + if (stx->stx_mask & STATX_NLINK)
> + printf(" Links: %-5u", stx->stx_nlink);
> + if (stx->stx_mask & STATX_TYPE) {
> + switch (stx->stx_mode & S_IFMT) {
> + case S_IFBLK:
> + case S_IFCHR:
> + printf(" Device type: %u,%u",
> + stx->stx_rdev_major, stx->stx_rdev_minor);
> + break;
> + }
> + }
> + printf("\n");
> +
> + if (stx->stx_mask & STATX_MODE)
> + printf("Access: (%04o/%c%c%c%c%c%c%c%c%c%c) ",
> + stx->stx_mode & 07777,
> + ft,
> + stx->stx_mode & S_IRUSR ? 'r' : '-',
> + stx->stx_mode & S_IWUSR ? 'w' : '-',
> + stx->stx_mode & S_IXUSR ? 'x' : '-',
> + stx->stx_mode & S_IRGRP ? 'r' : '-',
> + stx->stx_mode & S_IWGRP ? 'w' : '-',
> + stx->stx_mode & S_IXGRP ? 'x' : '-',
> + stx->stx_mode & S_IROTH ? 'r' : '-',
> + stx->stx_mode & S_IWOTH ? 'w' : '-',
> + stx->stx_mode & S_IXOTH ? 'x' : '-');
> + if (stx->stx_mask & STATX_UID)
> + printf("Uid: %5d ", stx->stx_uid);
> + if (stx->stx_mask & STATX_GID)
> + printf("Gid: %5d\n", stx->stx_gid);
> +
> + if (stx->stx_mask & STATX_ATIME)
> + print_time("Access: ", &stx->stx_atime);
> + if (stx->stx_mask & STATX_MTIME)
> + print_time("Modify: ", &stx->stx_mtime);
> + if (stx->stx_mask & STATX_CTIME)
> + print_time("Change: ", &stx->stx_ctime);
> + if (stx->stx_mask & STATX_BTIME)
> + print_time(" Birth: ", &stx->stx_btime);
> +
> + if (stx->stx_attributes_mask) {
> + unsigned char bits, mbits;
> + int loop, byte;
> +
> + static char attr_representation[64 + 1] =
> + /* STATX_ATTR_ flags: */
> + "????????" /* 63-56 */
> + "????????" /* 55-48 */
> + "????????" /* 47-40 */
> + "????????" /* 39-32 */
> + "????????" /* 31-24 0x00000000-ff000000 */
> + "????????" /* 23-16 0x00000000-00ff0000 */
> + "???me???" /* 15- 8 0x00000000-0000ff00 */
> + "?dai?c??" /* 7- 0 0x00000000-000000ff */
> + ;
> +
> + printf("Attributes: %016llx (", stx->stx_attributes);
> + for (byte = 64 - 8; byte >= 0; byte -= 8) {
> + bits = stx->stx_attributes >> byte;
> + mbits = stx->stx_attributes_mask >> byte;
> + for (loop = 7; loop >= 0; loop--) {
> + int bit = byte + loop;
> +
> + if (!(mbits & 0x80))
> + putchar('.'); /* Not supported */
> + else if (bits & 0x80)
> + putchar(attr_representation[63 - bit]);
> + else
> + putchar('-'); /* Not set */
> + bits <<= 1;
> + mbits <<= 1;
> + }
> + if (byte)
> + putchar(' ');
> + }
> + printf(")\n");
> + }
> +}
> +
> +static void dump_hex(unsigned long long *data, int from, int to)
> +{
> + unsigned offset, print_offset = 1, col = 0;
> +
> + from /= 8;
> + to = (to + 7) / 8;
> +
> + for (offset = from; offset < to; offset++) {
> + if (print_offset) {
> + printf("%04x: ", offset * 8);
> + print_offset = 0;
> + }
> + printf("%016llx", data[offset]);
> + col++;
> + if ((col & 3) == 0) {
> + printf("\n");
> + print_offset = 1;
> + } else {
> + printf(" ");
> + }
> + }
> +
> + if (!print_offset)
> + printf("\n");
> +}
> +
> +int main(int argc, char **argv)
> +{
> + struct statx stx;
> + int ret, raw = 0, atflag = AT_SYMLINK_NOFOLLOW;
> +
> + unsigned int mask = STATX_ALL;
> +
> + for (argv++; *argv; argv++) {
> + if (strcmp(*argv, "-F") == 0) {
> + atflag &= ~AT_STATX_SYNC_TYPE;
> + atflag |= AT_STATX_FORCE_SYNC;
> + continue;
> + }
> + if (strcmp(*argv, "-D") == 0) {
> + atflag &= ~AT_STATX_SYNC_TYPE;
> + atflag |= AT_STATX_DONT_SYNC;
> + continue;
> + }
> + if (strcmp(*argv, "-L") == 0) {
> + atflag &= ~AT_SYMLINK_NOFOLLOW;
> + continue;
> + }
> + if (strcmp(*argv, "-O") == 0) {
> + mask &= ~STATX_BASIC_STATS;
> + continue;
> + }
> + if (strcmp(*argv, "-A") == 0) {
> + atflag |= AT_NO_AUTOMOUNT;
> + continue;
> + }
> + if (strcmp(*argv, "-R") == 0) {
> + raw = 1;
> + continue;
> + }
> +
> + memset(&stx, 0xbf, sizeof(stx));
> + ret = statx(AT_FDCWD, *argv, atflag, mask, &stx);
> + printf("statx(%s) = %d\n", *argv, ret);
> + if (ret < 0) {
> + perror(*argv);
> + exit(1);
> + }
> +
> + if (raw)
> + dump_hex((unsigned long long *)&stx, 0, sizeof(stx));
> +
> + dump_statx(&stx);
> + }
> + return 0;
> +}
>

2018-12-20 17:03:24

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 34/34] vfs: Add a sample program for the new mount API [ver #12]


Anders Roxell <[email protected]> wrote:

> In file included from /usr/include/x86_64-linux-gnu/sys/stat.h:446,
> from ../samples/vfs/test-statx.c:28:
> /usr/include/x86_64-linux-gnu/bits/statx.h:25:8: error: redefinition
> of ‘struct statx_timestamp’

Yeah - the problem is that statx has now made it into glibc, but the sample
program doesn't deal with it turning up in the system headers. I've passed a
patch to Al already.

David

2019-03-14 07:47:55

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 07/34] vfs: Add configuration parser helpers [ver #12]

Hi David,

On Fri, Sep 21, 2018 at 6:33 PM David Howells <[email protected]> wrote:
> Because the new API passes in key,value parameters, match_token() cannot be
> used with it. Instead, provide three new helpers to aid with parsing:

> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -8,6 +8,13 @@ menu "File systems"
> config DCACHE_WORD_ACCESS
> bool
>
> +config VALIDATE_FS_PARSER
> + bool "Validate filesystem parameter description"
> + default y
> + help
> + Enable this to perform validation of the parameter description for a
> + filesystem when it is registered.
> +

When would I want to disable this?

It seems this option was introduced in "ver #10" of your patch series,
without being mentioned in the changelog for that version.

Thanks!

Gr{oetje,eeting}s,

Geert


--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2019-03-14 10:30:03

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 07/34] vfs: Add configuration parser helpers [ver #12]

Geert Uytterhoeven <[email protected]> wrote:

> When would I want to disable this?
>
> It seems this option was introduced in "ver #10" of your patch series,
> without being mentioned in the changelog for that version.

Sorry, yes - it's a debugging tool to check that the parser tables are vaguely
sane. I set it to default to 'Y' for the moment to catch errors in upcoming
fs conversion development. You probably want to disable it if you're not
doing fs conversion.

David

2019-03-14 10:50:18

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 07/34] vfs: Add configuration parser helpers [ver #12]

Hi David,

On Thu, Mar 14, 2019 at 11:27 AM David Howells <[email protected]> wrote:
> Geert Uytterhoeven <[email protected]> wrote:
> > When would I want to disable this?
> >
> > It seems this option was introduced in "ver #10" of your patch series,
> > without being mentioned in the changelog for that version.
>
> Sorry, yes - it's a debugging tool to check that the parser tables are vaguely
> sane. I set it to default to 'Y' for the moment to catch errors in upcoming
> fs conversion development. You probably want to disable it if you're not
> doing fs conversion.

OK thanks. Makes perfect sense, now I see the output of

pr_notice("*** VALIDATE %s ***\n", name);

in my boot test.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds