2018-05-25 02:47:42

by David Howells

[permalink] [raw]
Subject: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]


Hi Al,

Can you take a look at this please, in particular the last 6 patches?

Here are a set of patches to create a filesystem context prior to setting
up a new mount, populating it with the parsed options/binary data, creating
the superblock and then effecting the mount. This is also used for remount
since much of the parsing stuff is common in many filesystems.

This allows namespaces and other information to be conveyed through the
mount procedure.

This also allows Miklós Szeredi's idea of doing:

fd = fsopen("nfs");
write(fd, "option=val", ...);
mfd = fsmount(fd, MS_NODEV);
move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

that he presented at LSF-2017 to be implemented (see the relevant patches
in the series).

I didn't use netlink as that would make the core kernel depend on
CONFIG_NET and CONFIG_NETLINK and would introduce network namespacing
issues.

I've implemented filesystem context handling for procfs, nfs, mqueue,
cpuset, kernfs, sysfs, cgroup and afs filesystems.

Unconverted filesystems are handled by the legacy filesystem wrapper.

This post is mostly about the internal filesystem context and the special
kernel interface filesystems. I've included the fsopen() and fsmount()
syscall implementations for reference, but I expect these to undergo some
reconsideration during LSF. The last five patches relate to the AFS
conversion and are included as an example.

Significant changes:

ver #8:

(*) Changed the way fsmount() mounts into the namespace according to some
of Al's ideas.

(*) Put better typing on the fd cookie obtained from __fdget() & co..

(*) Stored the fd cookie in struct nameidata rather than the dfd number.

(*) Changed sys_fsmount() to return an O_PATH-style fd rather than
actually mounting into the mount namespace.

(*) Separated internal FMODE_* handling from O_* handling to free up
certain O_* flag numbers.

(*) Added two new open flags (O_CLONE_MOUNT and O_NON_RECURSIVE) for use
with open(O_PATH) to copy a mount or mount-subtree to an O_PATH fd.

(*) Added a new syscall, sys_move_mount(), to move a mount from an
dfd+path source to a dfd+path destination.

(*) Added a file->f_mode flag (FMODE_NEED_UNMOUNT) that indicates that the
vfsmount attached to file->f_path needs 'unmounting' if set.

(*) Made sys_move_mount() clear FMODE_NEED_UNMOUNT if successful.

[!] This doesn't work quite right.

(*) Added a new syscall, fsinfo(), to query information about a
filesystem. The idea being that this will, in future, work with the
fd from fsopen() too and permit querying of the parameters and
metadata before fsmount() is called.

ver #7:

(*) Undo an incorrect MS_* -> SB_* conversion.

(*) Pass the mount data buffer size to all the mount-related functions that
take the data pointer. This fixes a problem where someone (say SELinux)
tries to copy the mount data, assuming it to be a page in size, and
overruns the buffer - thereby incurring an oops by hitting a guard page.

(*) Made the AFS filesystem use them as an example. This is a much easier to
deal with than with NFS or Ext4 as there are very few mount options.

ver #6:

(*) Dropped the supplementary error string facility for the moment.

(*) Dropped the NFS patches for the moment.

(*) Dropped the reserved file descriptor argument from fsopen() and
replaced it with three reserved pointers that must be NULL.

ver #5:

(*) Renamed sb_config -> fs_context and adjusted variable names.

(*) Differentiated the flags in sb->s_flags (now named SB_*) from those
passed to mount(2) (named MS_*).

(*) Renamed __vfs_new_fs_context() to vfs_new_fs_context() and made the
caller always provide a struct file_system_type pointer and the
parameters required.

(*) Got rid of vfs_submount_fc() in favour of passing
FS_CONTEXT_FOR_SUBMOUNT to vfs_new_fs_context(). The purpose is now
used more.

(*) Call ->validate() on the remount path.

(*) Got rid of the inode locking in sys_fsmount().

(*) Call security_sb_mountpoint() in the mount(2) path.

ver #4:

(*) Split the sb_config patch up somewhat.

(*) Made the supplementary error string facility something attached to the
task_struct rather than the sb_config so that error messages can be
obtained from NFS doing a mount-root-and-pathwalk inside the
nfs_get_tree() operation.

Further, made this managed and read by prctl rather than through the
mount fd so that it's more generally available.

ver #3:

(*) Rebased on 4.12-rc1.

(*) Split the NFS patch up somewhat.

ver #2:

(*) Removed the ->fill_super() from sb_config_operations and passed it in
directly to functions that want to call it. NFS now calls
nfs_fill_super() directly rather than jumping through a pointer to it
since there's only the one option at the moment.

(*) Removed ->mnt_ns and ->sb from sb_config and moved ->pid_ns into
proc_sb_config.

(*) Renamed create_super -> get_tree.

(*) Renamed struct mount_context to struct sb_config and amended various
variable names.

(*) sys_fsmount() acquired AT_* flags and MS_* flags (for MNT_* flags)
arguments.

ver #1:

(*) Split the sb_config stuff out into its own header.

(*) Support non-context aware filesystems through a special set of
sb_config operations.

(*) Stored the created superblock and root dentry into the sb_config after
creation rather than directly into a vfsmount. This allows some
arguments to be removed to various NFS functions.

(*) Added an explicit superblock-creation step. This allows a created
superblock to then be mounted multiple times.

(*) Added a flag to say that the sb_config is degraded and cannot have
another go at having a superblock creation whilst getting rid of the
one that says it's already mounted.

Possible further developments:

(*) Implement sb reconfiguration (for now it returns ENOANO).

(*) Implement mount context support in more filesystems, ext4 being next
on my list.

(*) Move the walk-from-root stuff that nfs has to generic code so that you
can do something akin to:

mount /dev/sda1:/foo/bar /mnt

See nfs_follow_remote_path() and mount_subtree(). This is slightly
tricky in NFS as we have to prevent referral loops.

(*) Work out how to get at the error message incurred by submounts
encountered during nfs_follow_remote_path().

Should the error message be moved to task_struct and made more
general, perhaps retrieved with a prctl() function?

(*) Clean up/consolidate the security functions. Possibly add a
validation hook to be called at the same time as the mount context
validate op.

The patches can be found here also:

http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=mount-context

David
---
David Howells (32):
VFS: Suppress MS_* flag defs within the kernel unless explicitly enabled
vfs: Provide documentation for new mount API
VFS: Introduce the basic header for the new mount API's filesystem context
VFS: Add LSM hooks for the new mount API
selinux: Implement the new mount API LSM hooks
smack: Implement filesystem context security hooks
apparmor: Implement security hooks for the new mount API
tomoyo: Implement security hooks for the new mount API
VFS: Require specification of size of mount data for internal mounts
VFS: Implement a filesystem superblock creation/configuration context
VFS: Remove unused code after filesystem context changes
procfs: Move proc_fill_super() to fs/proc/root.c
proc: Add fs_context support to procfs
ipc: Convert mqueue fs to fs_context
cpuset: Use fs_context
kernfs, sysfs, cgroup, intel_rdt: Support fs_context
hugetlbfs: Convert to fs_context
VFS: Remove kern_mount_data()
VFS: Implement fsopen() to prepare for a mount
vfs: Make close() unmount the attached mount if so flagged
VFS: Implement fsmount() to effect a pre-configured mount
vfs: Provide an fspick() system call
VFS: Implement logging through fs_context
vfs: Add some logging to the core users of the fs_context log
afs: Add fs_context support
afs: Use fs_context to pass parameters over automount
vfs: Use a 'struct fd_cookie *' type for light fd handling
vfs: Store the fd_cookie in nameidata, not the dfd int
vfs: Don't mix FMODE_* flags with O_* flags
vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT)
[RFC] fs: Add a move_mount() system call
[RFC] fsinfo: Add a system call to allow querying of filesystem information


Documentation/filesystems/mounting.txt | 458 +++++++++++++
arch/arc/kernel/setup.c | 1
arch/arm/kernel/atags_parse.c | 1
arch/ia64/kernel/perfmon.c | 3
arch/powerpc/platforms/cell/spufs/inode.c | 6
arch/s390/hypfs/inode.c | 7
arch/sh/kernel/setup.c | 1
arch/sparc/kernel/setup_32.c | 1
arch/sparc/kernel/setup_64.c | 1
arch/x86/entry/syscalls/syscall_32.tbl | 5
arch/x86/entry/syscalls/syscall_64.tbl | 5
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 129 ++--
arch/x86/kernel/setup.c | 1
drivers/base/devtmpfs.c | 7
drivers/dax/super.c | 2
drivers/dma-buf/dma-buf.c | 2
drivers/dma-buf/sync_file.c | 2
drivers/gpu/drm/drm_drv.c | 3
drivers/gpu/drm/drm_syncobj.c | 2
drivers/gpu/drm/i915/i915_gemfs.c | 2
drivers/infiniband/hw/qib/qib_fs.c | 7
drivers/misc/ibmasm/ibmasmfs.c | 11
drivers/mtd/mtdsuper.c | 26 -
drivers/oprofile/oprofilefs.c | 8
.../staging/lustre/lustre/llite/llite_internal.h | 2
drivers/staging/lustre/lustre/llite/llite_lib.c | 3
drivers/staging/lustre/lustre/mdc/mdc_lib.c | 2
drivers/staging/lustre/lustre/mdc/mdc_locks.c | 2
drivers/staging/lustre/lustre/obdclass/obd_mount.c | 7
drivers/staging/ncpfs/inode.c | 10
drivers/tty/pty.c | 4
drivers/usb/gadget/function/f_fs.c | 7
drivers/usb/gadget/legacy/inode.c | 7
drivers/virtio/virtio_balloon.c | 2
drivers/xen/xenfs/super.c | 7
fs/9p/vfs_super.c | 2
fs/Makefile | 3
fs/adfs/super.c | 9
fs/affs/super.c | 13
fs/afs/internal.h | 9
fs/afs/mntpt.c | 147 ++--
fs/afs/super.c | 423 ++++++------
fs/afs/volume.c | 4
fs/aio.c | 3
fs/anon_inodes.c | 23 -
fs/autofs4/autofs_i.h | 2
fs/autofs4/dev-ioctl.c | 2
fs/autofs4/init.c | 4
fs/autofs4/inode.c | 3
fs/befs/linuxvfs.c | 11
fs/bfs/inode.c | 8
fs/binfmt_misc.c | 7
fs/block_dev.c | 2
fs/btrfs/super.c | 30 +
fs/btrfs/tests/btrfs-tests.c | 2
fs/cachefiles/rdwr.c | 2
fs/ceph/super.c | 3
fs/cifs/cifs_dfs_ref.c | 3
fs/cifs/cifsfs.c | 5
fs/coda/inode.c | 11
fs/configfs/mount.c | 7
fs/cramfs/inode.c | 17
fs/debugfs/inode.c | 14
fs/devpts/inode.c | 10
fs/ecryptfs/main.c | 2
fs/efivarfs/super.c | 9
fs/efs/super.c | 14
fs/eventfd.c | 2
fs/eventpoll.c | 2
fs/exec.c | 6
fs/exofs/super.c | 7
fs/exportfs/expfs.c | 2
fs/ext2/super.c | 14
fs/ext4/super.c | 16
fs/f2fs/super.c | 13
fs/fat/inode.c | 3
fs/fat/namei_msdos.c | 8
fs/fat/namei_vfat.c | 8
fs/fcntl.c | 6
fs/file.c | 20 -
fs/file_table.c | 4
fs/freevxfs/vxfs_super.c | 12
fs/fs_context.c | 689 ++++++++++++++++++++
fs/fsopen.c | 477 ++++++++++++++
fs/fuse/control.c | 9
fs/fuse/inode.c | 16
fs/gfs2/ops_fstype.c | 6
fs/gfs2/super.c | 4
fs/hfs/super.c | 12
fs/hfsplus/super.c | 12
fs/hostfs/hostfs_kern.c | 7
fs/hpfs/super.c | 11
fs/hugetlbfs/inode.c | 337 ++++++----
fs/internal.h | 10
fs/isofs/inode.c | 11
fs/jffs2/super.c | 10
fs/jfs/super.c | 11
fs/kernfs/mount.c | 88 +--
fs/libfs.c | 17
fs/minix/inode.c | 14
fs/namei.c | 105 ++-
fs/namespace.c | 650 +++++++++++++++----
fs/nfs/dir.c | 15
fs/nfs/internal.h | 4
fs/nfs/namespace.c | 3
fs/nfs/nfs4namespace.c | 3
fs/nfs/nfs4proc.c | 9
fs/nfs/nfs4super.c | 27 -
fs/nfs/super.c | 22 -
fs/nfsd/nfsctl.c | 8
fs/nilfs2/super.c | 10
fs/notify/fanotify/fanotify_user.c | 10
fs/notify/inotify/inotify_user.c | 2
fs/nsfs.c | 5
fs/ntfs/super.c | 13
fs/ocfs2/dlmfs/dlmfs.c | 5
fs/ocfs2/super.c | 14
fs/omfs/inode.c | 9
fs/open.c | 31 +
fs/openpromfs/inode.c | 11
fs/orangefs/orangefs-kernel.h | 2
fs/orangefs/super.c | 5
fs/overlayfs/super.c | 11
fs/pipe.c | 3
fs/pnode.c | 1
fs/proc/inode.c | 50 -
fs/proc/internal.h | 6
fs/proc/root.c | 212 +++++-
fs/pstore/inode.c | 10
fs/qnx4/inode.c | 14
fs/qnx6/inode.c | 14
fs/ramfs/inode.c | 6
fs/reiserfs/super.c | 14
fs/romfs/super.c | 13
fs/signalfd.c | 3
fs/squashfs/super.c | 12
fs/statfs.c | 431 +++++++++++++
fs/super.c | 402 +++++++++---
fs/sysfs/mount.c | 64 +-
fs/sysv/inode.c | 3
fs/sysv/super.c | 16
fs/timerfd.c | 2
fs/tracefs/inode.c | 10
fs/ubifs/super.c | 5
fs/udf/super.c | 16
fs/ufs/super.c | 11
fs/xfs/xfs_ioctl.c | 2
fs/xfs/xfs_super.c | 10
include/linux/anon_inodes.h | 6
include/linux/cgroup.h | 3
include/linux/debugfs.h | 8
include/linux/fcntl.h | 3
include/linux/file.h | 31 +
include/linux/fs.h | 73 +-
include/linux/fs_context.h | 178 +++++
include/linux/fsinfo.h | 25 +
include/linux/fsnotify.h | 8
include/linux/kernfs.h | 36 +
include/linux/lsm_hooks.h | 89 ++-
include/linux/mount.h | 8
include/linux/mtd/super.h | 4
include/linux/nfs_fs.h | 3
include/linux/ramfs.h | 4
include/linux/security.h | 74 ++
include/linux/shmem_fs.h | 3
include/linux/syscalls.h | 11
include/uapi/asm-generic/fcntl.h | 9
include/uapi/linux/fs.h | 68 --
include/uapi/linux/fsinfo.h | 231 +++++++
include/uapi/linux/magic.h | 1
include/uapi/linux/mount.h | 69 ++
init/do_mounts.c | 5
init/do_mounts_initrd.c | 1
ipc/mqueue.c | 126 +++-
kernel/bpf/inode.c | 7
kernel/bpf/syscall.c | 6
kernel/cgroup/cgroup-internal.h | 42 +
kernel/cgroup/cgroup-v1.c | 296 ++++-----
kernel/cgroup/cgroup.c | 224 ++++---
kernel/cgroup/cpuset.c | 65 ++
kernel/events/core.c | 2
kernel/sys_ni.c | 6
kernel/trace/trace.c | 7
mm/shmem.c | 10
mm/zsmalloc.c | 3
net/socket.c | 3
net/sunrpc/rpc_pipe.c | 7
net/unix/af_unix.c | 2
samples/statx/Makefile | 5
samples/statx/test-fsinfo.c | 179 +++++
security/apparmor/apparmorfs.c | 8
security/apparmor/file.c | 2
security/apparmor/include/mount.h | 11
security/apparmor/lsm.c | 84 ++
security/apparmor/mount.c | 47 +
security/inode.c | 7
security/keys/big_key.c | 2
security/security.c | 70 ++
security/selinux/hooks.c | 294 ++++++++-
security/selinux/selinuxfs.c | 8
security/smack/smack_lsm.c | 344 +++++++++-
security/smack/smackfs.c | 9
security/tomoyo/common.h | 3
security/tomoyo/mount.c | 46 +
security/tomoyo/tomoyo.c | 19 +
205 files changed, 6775 insertions(+), 1836 deletions(-)
create mode 100644 Documentation/filesystems/mounting.txt
create mode 100644 fs/fs_context.c
create mode 100644 fs/fsopen.c
create mode 100644 include/linux/fs_context.h
create mode 100644 include/linux/fsinfo.h
create mode 100644 include/uapi/linux/fsinfo.h
create mode 100644 include/uapi/linux/mount.h
create mode 100644 samples/statx/test-fsinfo.c



2018-05-25 02:47:47

by David Howells

[permalink] [raw]
Subject: [PATCH 01/32] VFS: Suppress MS_* flag defs within the kernel unless explicitly enabled [ver #8]

Only the mount namespace code that implements mount(2) should be using the
MS_* flags. Suppress them inside the kernel unless uapi/linux/mount.h is
included.

Signed-off-by: David Howells <[email protected]>
---

arch/arc/kernel/setup.c | 1 +
arch/arm/kernel/atags_parse.c | 1 +
arch/sh/kernel/setup.c | 1 +
arch/sparc/kernel/setup_32.c | 1 +
arch/sparc/kernel/setup_64.c | 1 +
arch/x86/kernel/setup.c | 1 +
drivers/base/devtmpfs.c | 1 +
fs/f2fs/super.c | 2 +
fs/namespace.c | 1 +
fs/pnode.c | 1 +
fs/super.c | 1 +
include/uapi/linux/fs.h | 56 ++++------------------------------------
include/uapi/linux/mount.h | 58 +++++++++++++++++++++++++++++++++++++++++
init/do_mounts.c | 1 +
init/do_mounts_initrd.c | 1 +
security/apparmor/lsm.c | 1 +
security/apparmor/mount.c | 1 +
security/selinux/hooks.c | 1 +
security/tomoyo/mount.c | 1 +
19 files changed, 80 insertions(+), 52 deletions(-)
create mode 100644 include/uapi/linux/mount.h

diff --git a/arch/arc/kernel/setup.c b/arch/arc/kernel/setup.c
index b2cae79a25d7..714dc5c2baf1 100644
--- a/arch/arc/kernel/setup.c
+++ b/arch/arc/kernel/setup.c
@@ -19,6 +19,7 @@
#include <linux/of_fdt.h>
#include <linux/of.h>
#include <linux/cache.h>
+#include <uapi/linux/mount.h>
#include <asm/sections.h>
#include <asm/arcregs.h>
#include <asm/tlb.h>
diff --git a/arch/arm/kernel/atags_parse.c b/arch/arm/kernel/atags_parse.c
index c10a3e8ee998..a8a4333929f5 100644
--- a/arch/arm/kernel/atags_parse.c
+++ b/arch/arm/kernel/atags_parse.c
@@ -24,6 +24,7 @@
#include <linux/root_dev.h>
#include <linux/screen_info.h>
#include <linux/memblock.h>
+#include <uapi/linux/mount.h>

#include <asm/setup.h>
#include <asm/system_info.h>
diff --git a/arch/sh/kernel/setup.c b/arch/sh/kernel/setup.c
index c286cf5da6e7..2c0e0f37a318 100644
--- a/arch/sh/kernel/setup.c
+++ b/arch/sh/kernel/setup.c
@@ -32,6 +32,7 @@
#include <linux/of.h>
#include <linux/of_fdt.h>
#include <linux/uaccess.h>
+#include <uapi/linux/mount.h>
#include <asm/io.h>
#include <asm/page.h>
#include <asm/elf.h>
diff --git a/arch/sparc/kernel/setup_32.c b/arch/sparc/kernel/setup_32.c
index 13664c377196..7df3d704284c 100644
--- a/arch/sparc/kernel/setup_32.c
+++ b/arch/sparc/kernel/setup_32.c
@@ -34,6 +34,7 @@
#include <linux/kdebug.h>
#include <linux/export.h>
#include <linux/start_kernel.h>
+#include <uapi/linux/mount.h>

#include <asm/io.h>
#include <asm/processor.h>
diff --git a/arch/sparc/kernel/setup_64.c b/arch/sparc/kernel/setup_64.c
index 7944b3ca216a..206bf81eedaf 100644
--- a/arch/sparc/kernel/setup_64.c
+++ b/arch/sparc/kernel/setup_64.c
@@ -33,6 +33,7 @@
#include <linux/module.h>
#include <linux/start_kernel.h>
#include <linux/bootmem.h>
+#include <uapi/linux/mount.h>

#include <asm/io.h>
#include <asm/processor.h>
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 5c623dfe39d1..879b33c7cbd0 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -51,6 +51,7 @@
#include <linux/kvm_para.h>
#include <linux/dma-contiguous.h>
#include <xen/xen.h>
+#include <uapi/linux/mount.h>

#include <linux/errno.h>
#include <linux/kernel.h>
diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index f7768077e817..79a235184fb5 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -25,6 +25,7 @@
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/kthread.h>
+#include <uapi/linux/mount.h>
#include "base.h"

static struct task_struct *thread;
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 42d564c5ccd0..a31cc49b7295 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -1450,7 +1450,7 @@ static int f2fs_remount(struct super_block *sb, int *flags, char *data)
err = dquot_suspend(sb, -1);
if (err < 0)
goto restore_opts;
- } else if (f2fs_readonly(sb) && !(*flags & MS_RDONLY)) {
+ } else if (f2fs_readonly(sb) && !(*flags & SB_RDONLY)) {
/* dquot_resume needs RW */
sb->s_flags &= ~SB_RDONLY;
if (sb_any_quota_suspended(sb)) {
diff --git a/fs/namespace.c b/fs/namespace.c
index 5f75969adff1..1c41ab9332ee 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -26,6 +26,7 @@
#include <linux/bootmem.h>
#include <linux/task_work.h>
#include <linux/sched/task.h>
+#include <uapi/linux/mount.h>

#include "pnode.h"
#include "internal.h"
diff --git a/fs/pnode.c b/fs/pnode.c
index 53d411a371ce..1100e810d855 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -10,6 +10,7 @@
#include <linux/mount.h>
#include <linux/fs.h>
#include <linux/nsproxy.h>
+#include <uapi/linux/mount.h>
#include "internal.h"
#include "pnode.h"

diff --git a/fs/super.c b/fs/super.c
index 50728d9c1a05..5132a32e5ebc 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -35,6 +35,7 @@
#include <linux/fsnotify.h>
#include <linux/lockdep.h>
#include <linux/user_namespace.h>
+#include <uapi/linux/mount.h>
#include "internal.h"

static int thaw_super_locked(struct super_block *sb);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index d2a8313fabd7..5da6c2d96af5 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -14,6 +14,11 @@
#include <linux/ioctl.h>
#include <linux/types.h>

+/* Use of MS_* flags within the kernel is restricted to core mount(2) code. */
+#if !defined(__KERNEL__)
+#include <linux/mount.h>
+#endif
+
/*
* It's silly to have NR_OPEN bigger than NR_FILE, but you can change
* the file limit at runtime and only root can increase the per-process
@@ -101,57 +106,6 @@ struct inodes_stat_t {

#define NR_FILE 8192 /* this can well be larger on a larger system */

-
-/*
- * These are the fs-independent mount-flags: up to 32 flags are supported
- */
-#define MS_RDONLY 1 /* Mount read-only */
-#define MS_NOSUID 2 /* Ignore suid and sgid bits */
-#define MS_NODEV 4 /* Disallow access to device special files */
-#define MS_NOEXEC 8 /* Disallow program execution */
-#define MS_SYNCHRONOUS 16 /* Writes are synced at once */
-#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
-#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
-#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
-#define MS_NOATIME 1024 /* Do not update access times. */
-#define MS_NODIRATIME 2048 /* Do not update directory access times */
-#define MS_BIND 4096
-#define MS_MOVE 8192
-#define MS_REC 16384
-#define MS_VERBOSE 32768 /* War is peace. Verbosity is silence.
- MS_VERBOSE is deprecated. */
-#define MS_SILENT 32768
-#define MS_POSIXACL (1<<16) /* VFS does not apply the umask */
-#define MS_UNBINDABLE (1<<17) /* change to unbindable */
-#define MS_PRIVATE (1<<18) /* change to private */
-#define MS_SLAVE (1<<19) /* change to slave */
-#define MS_SHARED (1<<20) /* change to shared */
-#define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */
-#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
-#define MS_I_VERSION (1<<23) /* Update inode I_version field */
-#define MS_STRICTATIME (1<<24) /* Always perform atime updates */
-#define MS_LAZYTIME (1<<25) /* Update the on-disk [acm]times lazily */
-
-/* These sb flags are internal to the kernel */
-#define MS_SUBMOUNT (1<<26)
-#define MS_NOREMOTELOCK (1<<27)
-#define MS_NOSEC (1<<28)
-#define MS_BORN (1<<29)
-#define MS_ACTIVE (1<<30)
-#define MS_NOUSER (1<<31)
-
-/*
- * Superblock flags that can be altered by MS_REMOUNT
- */
-#define MS_RMT_MASK (MS_RDONLY|MS_SYNCHRONOUS|MS_MANDLOCK|MS_I_VERSION|\
- MS_LAZYTIME)
-
-/*
- * Old magic mount flag and mask
- */
-#define MS_MGC_VAL 0xC0ED0000
-#define MS_MGC_MSK 0xffff0000
-
/*
* Structure for FS_IOC_FSGETXATTR[A] and FS_IOC_FSSETXATTR.
*/
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
new file mode 100644
index 000000000000..3f9ec42510b0
--- /dev/null
+++ b/include/uapi/linux/mount.h
@@ -0,0 +1,58 @@
+#ifndef _UAPI_LINUX_MOUNT_H
+#define _UAPI_LINUX_MOUNT_H
+
+/*
+ * These are the fs-independent mount-flags: up to 32 flags are supported
+ *
+ * Usage of these is restricted within the kernel to core mount(2) code and
+ * callers of sys_mount() only. Filesystems should be using the SB_*
+ * equivalent instead.
+ */
+#define MS_RDONLY 1 /* Mount read-only */
+#define MS_NOSUID 2 /* Ignore suid and sgid bits */
+#define MS_NODEV 4 /* Disallow access to device special files */
+#define MS_NOEXEC 8 /* Disallow program execution */
+#define MS_SYNCHRONOUS 16 /* Writes are synced at once */
+#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
+#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
+#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
+#define MS_NOATIME 1024 /* Do not update access times. */
+#define MS_NODIRATIME 2048 /* Do not update directory access times */
+#define MS_BIND 4096
+#define MS_MOVE 8192
+#define MS_REC 16384
+#define MS_VERBOSE 32768 /* War is peace. Verbosity is silence.
+ MS_VERBOSE is deprecated. */
+#define MS_SILENT 32768
+#define MS_POSIXACL (1<<16) /* VFS does not apply the umask */
+#define MS_UNBINDABLE (1<<17) /* change to unbindable */
+#define MS_PRIVATE (1<<18) /* change to private */
+#define MS_SLAVE (1<<19) /* change to slave */
+#define MS_SHARED (1<<20) /* change to shared */
+#define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */
+#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
+#define MS_I_VERSION (1<<23) /* Update inode I_version field */
+#define MS_STRICTATIME (1<<24) /* Always perform atime updates */
+#define MS_LAZYTIME (1<<25) /* Update the on-disk [acm]times lazily */
+
+/* These sb flags are internal to the kernel */
+#define MS_SUBMOUNT (1<<26)
+#define MS_NOREMOTELOCK (1<<27)
+#define MS_NOSEC (1<<28)
+#define MS_BORN (1<<29)
+#define MS_ACTIVE (1<<30)
+#define MS_NOUSER (1<<31)
+
+/*
+ * Superblock flags that can be altered by MS_REMOUNT
+ */
+#define MS_RMT_MASK (MS_RDONLY|MS_SYNCHRONOUS|MS_MANDLOCK|MS_I_VERSION|\
+ MS_LAZYTIME)
+
+/*
+ * Old magic mount flag and mask
+ */
+#define MS_MGC_VAL 0xC0ED0000
+#define MS_MGC_MSK 0xffff0000
+
+#endif /* _UAPI_LINUX_MOUNT_H */
diff --git a/init/do_mounts.c b/init/do_mounts.c
index 2c71dabe5626..ea6f21bb9440 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -32,6 +32,7 @@
#include <linux/nfs_fs.h>
#include <linux/nfs_fs_sb.h>
#include <linux/nfs_mount.h>
+#include <uapi/linux/mount.h>

#include "do_mounts.h"

diff --git a/init/do_mounts_initrd.c b/init/do_mounts_initrd.c
index 5a91aefa7305..65de0412f80f 100644
--- a/init/do_mounts_initrd.c
+++ b/init/do_mounts_initrd.c
@@ -18,6 +18,7 @@
#include <linux/sched.h>
#include <linux/freezer.h>
#include <linux/kmod.h>
+#include <uapi/linux/mount.h>

#include "do_mounts.h"

diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index ce2b89e9ad94..9ebc9e9c3854 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -24,6 +24,7 @@
#include <linux/audit.h>
#include <linux/user_namespace.h>
#include <net/sock.h>
+#include <uapi/linux/mount.h>

#include "include/apparmor.h"
#include "include/apparmorfs.h"
diff --git a/security/apparmor/mount.c b/security/apparmor/mount.c
index 6e8c7ac0b33d..45bb769d6cd7 100644
--- a/security/apparmor/mount.c
+++ b/security/apparmor/mount.c
@@ -15,6 +15,7 @@
#include <linux/fs.h>
#include <linux/mount.h>
#include <linux/namei.h>
+#include <uapi/linux/mount.h>

#include "include/apparmor.h"
#include "include/audit.h"
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 65cba637be10..54ecb1c18ca1 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -88,6 +88,7 @@
#include <linux/msg.h>
#include <linux/shm.h>
#include <linux/bpf.h>
+#include <uapi/linux/mount.h>

#include "avc.h"
#include "objsec.h"
diff --git a/security/tomoyo/mount.c b/security/tomoyo/mount.c
index 807fd91dbb54..7dc7f59b7dde 100644
--- a/security/tomoyo/mount.c
+++ b/security/tomoyo/mount.c
@@ -6,6 +6,7 @@
*/

#include <linux/slab.h>
+#include <uapi/linux/mount.h>
#include "common.h"

/* String table for special mount operations. */


2018-05-25 02:47:53

by David Howells

[permalink] [raw]
Subject: [PATCH 03/32] VFS: Introduce the basic header for the new mount API's filesystem context [ver #8]

Introduce a filesystem context concept to be used during superblock
creation for mount and superblock reconfiguration for remount. This is
allocated at the beginning of the mount procedure and into it is placed:

(1) Filesystem type.

(2) Namespaces.

(3) Source/Device names (there may be multiple).

(4) Superblock flags (SB_*).

(5) Security details.

(6) Filesystem-specific data, as set by the mount options.

Signed-off-by: David Howells <[email protected]>
---

include/linux/fs_context.h | 75 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 75 insertions(+)
create mode 100644 include/linux/fs_context.h

diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
new file mode 100644
index 000000000000..04783814632c
--- /dev/null
+++ b/include/linux/fs_context.h
@@ -0,0 +1,75 @@
+/* Filesystem superblock creation and reconfiguration context.
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_FS_CONTEXT_H
+#define _LINUX_FS_CONTEXT_H
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+
+struct cred;
+struct dentry;
+struct file_operations;
+struct file_system_type;
+struct mnt_namespace;
+struct net;
+struct pid_namespace;
+struct super_block;
+struct user_namespace;
+struct vfsmount;
+
+enum fs_context_purpose {
+ FS_CONTEXT_FOR_USER_MOUNT, /* New superblock for user-specified mount */
+ FS_CONTEXT_FOR_KERNEL_MOUNT, /* New superblock for kernel-internal mount */
+ FS_CONTEXT_FOR_SUBMOUNT, /* New superblock for automatic submount */
+ FS_CONTEXT_FOR_RECONFIGURE, /* Superblock reconfiguration (remount) */
+};
+
+/*
+ * Filesystem context for holding the parameters used in the creation or
+ * reconfiguration of a superblock.
+ *
+ * Superblock creation fills in ->root whereas reconfiguration begins with this
+ * already set.
+ *
+ * See Documentation/filesystems/mounting.txt
+ */
+struct fs_context {
+ const struct fs_context_operations *ops;
+ struct file_system_type *fs_type;
+ void *fs_private; /* The filesystem's context */
+ struct dentry *root; /* The root and superblock */
+ struct user_namespace *user_ns; /* The user namespace for this mount */
+ struct net *net_ns; /* The network namespace for this mount */
+ const struct cred *cred; /* The mounter's credentials */
+ char *source; /* The source name (eg. dev path) */
+ char *subtype; /* The subtype to set on the superblock */
+ void *security; /* The LSM context */
+ void *s_fs_info; /* Proposed s_fs_info */
+ unsigned int sb_flags; /* Proposed superblock flags (SB_*) */
+ bool sloppy:1; /* T if unrecognised options are okay */
+ bool silent:1; /* T if "o silent" specified */
+ bool drop_sb:1; /* T if need to drop an SB reference */
+ bool source_is_dev:1; /* T if source is local device/file */
+ enum fs_context_purpose purpose : 8;
+};
+
+struct fs_context_operations {
+ void (*free)(struct fs_context *fc);
+ int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
+ int (*parse_source)(struct fs_context *fc, char *source);
+ int (*parse_option)(struct fs_context *fc, char *opt, size_t len);
+ int (*parse_monolithic)(struct fs_context *fc, void *data);
+ int (*validate)(struct fs_context *fc);
+ int (*get_tree)(struct fs_context *fc);
+};
+
+#endif /* _LINUX_FS_CONTEXT_H */


2018-05-25 02:47:54

by David Howells

[permalink] [raw]
Subject: [PATCH 07/32] apparmor: Implement security hooks for the new mount API [ver #8]

Implement hooks to check the creation of new mountpoints for AppArmor.

Unfortunately, the DFA evaluation puts the option data in last, after the
details of the mountpoint, so we have to cache the mount options in the
fs_context using those hooks till we get to the new mountpoint hook.

Signed-off-by: David Howells <[email protected]>
Acked-by: John Johansen <[email protected]>
cc: [email protected]
cc: [email protected]
---

security/apparmor/include/mount.h | 11 +++++
security/apparmor/lsm.c | 80 +++++++++++++++++++++++++++++++++++++
security/apparmor/mount.c | 46 +++++++++++++++++++++
3 files changed, 135 insertions(+), 2 deletions(-)

diff --git a/security/apparmor/include/mount.h b/security/apparmor/include/mount.h
index 25d6067fa6ef..0441bfae30fa 100644
--- a/security/apparmor/include/mount.h
+++ b/security/apparmor/include/mount.h
@@ -16,6 +16,7 @@

#include <linux/fs.h>
#include <linux/path.h>
+#include <linux/fs_context.h>

#include "domain.h"
#include "policy.h"
@@ -27,7 +28,13 @@
#define AA_AUDIT_DATA 0x40
#define AA_MNT_CONT_MATCH 0x40

-#define AA_MS_IGNORE_MASK (MS_KERNMOUNT | MS_NOSEC | MS_ACTIVE | MS_BORN)
+#define AA_SB_IGNORE_MASK (SB_KERNMOUNT | SB_NOSEC | SB_ACTIVE | SB_BORN)
+
+struct apparmor_fs_context {
+ struct fs_context fc;
+ char *saved_options;
+ size_t saved_size;
+};

int aa_remount(struct aa_label *label, const struct path *path,
unsigned long flags, void *data);
@@ -45,6 +52,8 @@ int aa_move_mount(struct aa_label *label, const struct path *path,
int aa_new_mount(struct aa_label *label, const char *dev_name,
const struct path *path, const char *type, unsigned long flags,
void *data);
+int aa_new_mount_fc(struct aa_label *label, struct fs_context *fc,
+ const struct path *mountpoint);

int aa_umount(struct aa_label *label, struct vfsmount *mnt, int flags);

diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index 9ebc9e9c3854..bf2401ade80e 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -518,6 +518,78 @@ static int apparmor_file_mprotect(struct vm_area_struct *vma,
!(vma->vm_flags & VM_SHARED) ? MAP_PRIVATE : 0);
}

+static int apparmor_fs_context_alloc(struct fs_context *fc, struct dentry *reference)
+{
+ struct apparmor_fs_context *afc;
+
+ afc = kzalloc(sizeof(*afc), GFP_KERNEL);
+ if (!afc)
+ return -ENOMEM;
+
+ fc->security = afc;
+ return 0;
+}
+
+static int apparmor_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
+{
+ fc->security = NULL;
+ return 0;
+}
+
+static void apparmor_fs_context_free(struct fs_context *fc)
+{
+ struct apparmor_fs_context *afc = fc->security;
+
+ if (afc) {
+ kfree(afc->saved_options);
+ kfree(afc);
+ }
+}
+
+/*
+ * As a temporary hack, we buffer all the options. The problem is that we need
+ * to pass them to the DFA evaluator *after* mount point parameters, which
+ * means deferring the entire check to the sb_mountpoint hook.
+ */
+static int apparmor_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+ struct apparmor_fs_context *afc = fc->security;
+ size_t space = 0;
+ char *p, *q;
+
+ if (afc->saved_size > 0)
+ space = 1;
+
+ p = krealloc(afc->saved_options, afc->saved_size + space + len + 1, GFP_KERNEL);
+ if (!p)
+ return -ENOMEM;
+
+ q = p + afc->saved_size;
+ if (q != p)
+ *q++ = ' ';
+ memcpy(q, opt, len);
+ q += len;
+ *q = 0;
+
+ afc->saved_options = p;
+ afc->saved_size += 1 + len;
+ return 0;
+}
+
+static int apparmor_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ struct aa_label *label;
+ int error = 0;
+
+ label = __begin_current_label_crit_section();
+ if (!unconfined(label))
+ error = aa_new_mount_fc(label, fc, mountpoint);
+ __end_current_label_crit_section(label);
+
+ return error;
+}
+
static int apparmor_sb_mount(const char *dev_name, const struct path *path,
const char *type, unsigned long flags, void *data)
{
@@ -528,7 +600,7 @@ static int apparmor_sb_mount(const char *dev_name, const struct path *path,
if ((flags & MS_MGC_MSK) == MS_MGC_VAL)
flags &= ~MS_MGC_MSK;

- flags &= ~AA_MS_IGNORE_MASK;
+ flags &= ~AA_SB_IGNORE_MASK;

label = __begin_current_label_crit_section();
if (!unconfined(label)) {
@@ -1124,6 +1196,12 @@ static struct security_hook_list apparmor_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(capget, apparmor_capget),
LSM_HOOK_INIT(capable, apparmor_capable),

+ LSM_HOOK_INIT(fs_context_alloc, apparmor_fs_context_alloc),
+ LSM_HOOK_INIT(fs_context_dup, apparmor_fs_context_dup),
+ LSM_HOOK_INIT(fs_context_free, apparmor_fs_context_free),
+ LSM_HOOK_INIT(fs_context_parse_option, apparmor_fs_context_parse_option),
+ LSM_HOOK_INIT(sb_mountpoint, apparmor_sb_mountpoint),
+
LSM_HOOK_INIT(sb_mount, apparmor_sb_mount),
LSM_HOOK_INIT(sb_umount, apparmor_sb_umount),
LSM_HOOK_INIT(sb_pivotroot, apparmor_sb_pivotroot),
diff --git a/security/apparmor/mount.c b/security/apparmor/mount.c
index 45bb769d6cd7..de791134e352 100644
--- a/security/apparmor/mount.c
+++ b/security/apparmor/mount.c
@@ -554,6 +554,52 @@ int aa_new_mount(struct aa_label *label, const char *dev_name,
return error;
}

+int aa_new_mount_fc(struct aa_label *label, struct fs_context *fc,
+ const struct path *mountpoint)
+{
+ struct apparmor_fs_context *afc = fc->security;
+ struct aa_profile *profile;
+ char *buffer = NULL, *dev_buffer = NULL;
+ bool binary;
+ int error;
+ struct path tmp_path, *dev_path = NULL;
+
+ AA_BUG(!label);
+ AA_BUG(!mountpoint);
+
+ binary = fc->fs_type->fs_flags & FS_BINARY_MOUNTDATA;
+
+ if (fc->fs_type->fs_flags & FS_REQUIRES_DEV) {
+ if (!fc->source)
+ return -ENOENT;
+
+ error = kern_path(fc->source, LOOKUP_FOLLOW, &tmp_path);
+ if (error)
+ return error;
+ dev_path = &tmp_path;
+ }
+
+ get_buffers(buffer, dev_buffer);
+ if (dev_path) {
+ error = fn_for_each_confined(label, profile,
+ match_mnt(profile, mountpoint, buffer, dev_path, dev_buffer,
+ fc->fs_type->name,
+ fc->sb_flags & ~AA_SB_IGNORE_MASK,
+ afc->saved_options, binary));
+ } else {
+ error = fn_for_each_confined(label, profile,
+ match_mnt_path_str(profile, mountpoint, buffer,
+ fc->source, fc->fs_type->name,
+ fc->sb_flags & ~AA_SB_IGNORE_MASK,
+ afc->saved_options, binary, NULL));
+ }
+ put_buffers(buffer, dev_buffer);
+ if (dev_path)
+ path_put(dev_path);
+
+ return error;
+}
+
static int profile_umount(struct aa_profile *profile, struct path *path,
char *buffer)
{


2018-05-25 02:47:56

by David Howells

[permalink] [raw]
Subject: [PATCH 08/32] tomoyo: Implement security hooks for the new mount API [ver #8]

Implement the security hook to check the creation of a new mountpoint for
Tomoyo.

As far as I can tell, Tomoyo doesn't make use of the mount data or parse
any mount options, so I haven't implemented any of the fs_context hooks for
it.

Signed-off-by: David Howells <[email protected]>
cc: Tetsuo Handa <[email protected]>
cc: [email protected]
cc: [email protected]
---

security/tomoyo/common.h | 3 +++
security/tomoyo/mount.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
security/tomoyo/tomoyo.c | 15 +++++++++++++++
3 files changed, 63 insertions(+)

diff --git a/security/tomoyo/common.h b/security/tomoyo/common.h
index 539bcdd30bb8..e637ce73f7f9 100644
--- a/security/tomoyo/common.h
+++ b/security/tomoyo/common.h
@@ -971,6 +971,9 @@ int tomoyo_init_request_info(struct tomoyo_request_info *r,
const u8 index);
int tomoyo_mkdev_perm(const u8 operation, const struct path *path,
const unsigned int mode, unsigned int dev);
+int tomoyo_mount_permission_fc(struct fs_context *fc,
+ const struct path *mountpoint,
+ unsigned int mnt_flags);
int tomoyo_mount_permission(const char *dev_name, const struct path *path,
const char *type, unsigned long flags,
void *data_page);
diff --git a/security/tomoyo/mount.c b/security/tomoyo/mount.c
index 7dc7f59b7dde..9ec84ab6f5e1 100644
--- a/security/tomoyo/mount.c
+++ b/security/tomoyo/mount.c
@@ -6,6 +6,7 @@
*/

#include <linux/slab.h>
+#include <linux/fs_context.h>
#include <uapi/linux/mount.h>
#include "common.h"

@@ -236,3 +237,47 @@ int tomoyo_mount_permission(const char *dev_name, const struct path *path,
tomoyo_read_unlock(idx);
return error;
}
+
+/**
+ * tomoyo_mount_permission_fc - Check permission to create a new mount.
+ * @fc: Context describing the object to be mounted.
+ * @mountpoint: The target object to mount on.
+ * @mnt: The MNT_* flags to be set on the mountpoint.
+ *
+ * Check the permission to create a mount of the object described in @fc. Note
+ * that the source object may be a newly created superblock or may be an
+ * existing one picked from the filesystem (bind mount).
+ *
+ * Returns 0 on success, negative value otherwise.
+ */
+int tomoyo_mount_permission_fc(struct fs_context *fc,
+ const struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ struct tomoyo_request_info r;
+ unsigned int ms_flags = 0;
+ int error;
+ int idx;
+
+ if (tomoyo_init_request_info(&r, NULL, TOMOYO_MAC_FILE_MOUNT) ==
+ TOMOYO_CONFIG_DISABLED)
+ return 0;
+
+ /* Convert MNT_* flags to MS_* equivalents. */
+ if (mnt_flags & MNT_NOSUID) ms_flags |= MS_NOSUID;
+ if (mnt_flags & MNT_NODEV) ms_flags |= MS_NODEV;
+ if (mnt_flags & MNT_NOEXEC) ms_flags |= MS_NOEXEC;
+ if (mnt_flags & MNT_NOATIME) ms_flags |= MS_NOATIME;
+ if (mnt_flags & MNT_NODIRATIME) ms_flags |= MS_NODIRATIME;
+ if (mnt_flags & MNT_RELATIME) ms_flags |= MS_RELATIME;
+ if (mnt_flags & MNT_READONLY) ms_flags |= MS_RDONLY;
+
+ idx = tomoyo_read_lock();
+ /* TODO: There may be multiple sources; for the moment, just pick the
+ * first if there is one.
+ */
+ error = tomoyo_mount_acl(&r, fc->source, mountpoint, fc->fs_type->name,
+ ms_flags);
+ tomoyo_read_unlock(idx);
+ return error;
+}
diff --git a/security/tomoyo/tomoyo.c b/security/tomoyo/tomoyo.c
index 213b8c593668..31fd6bd4f657 100644
--- a/security/tomoyo/tomoyo.c
+++ b/security/tomoyo/tomoyo.c
@@ -391,6 +391,20 @@ static int tomoyo_path_chroot(const struct path *path)
return tomoyo_path_perm(TOMOYO_TYPE_CHROOT, path, NULL);
}

+/**
+ * tomoyo_sb_mount - Target for security_sb_mountpoint().
+ * @fc: Context describing the object to be mounted.
+ * @mountpoint: The target object to mount on.
+ * @mnt_flags: Mountpoint specific options (as MNT_* flags).
+ *
+ * Returns 0 on success, negative value otherwise.
+ */
+static int tomoyo_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ return tomoyo_mount_permission_fc(fc, mountpoint, mnt_flags);
+}
+
/**
* tomoyo_sb_mount - Target for security_sb_mount().
*
@@ -519,6 +533,7 @@ static struct security_hook_list tomoyo_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(path_chmod, tomoyo_path_chmod),
LSM_HOOK_INIT(path_chown, tomoyo_path_chown),
LSM_HOOK_INIT(path_chroot, tomoyo_path_chroot),
+ LSM_HOOK_INIT(sb_mountpoint, tomoyo_sb_mountpoint),
LSM_HOOK_INIT(sb_mount, tomoyo_sb_mount),
LSM_HOOK_INIT(sb_umount, tomoyo_sb_umount),
LSM_HOOK_INIT(sb_pivotroot, tomoyo_sb_pivotroot),


2018-05-25 02:48:03

by David Howells

[permalink] [raw]
Subject: [PATCH 14/32] ipc: Convert mqueue fs to fs_context [ver #8]

Convert the mqueue filesystem to use the filesystem context stuff.

Notes:

(1) The relevant ipc namespace is selected in when the context is
initialised (and it defaults to the current task's ipc namespace).
The caller can override this before calling vfs_get_tree().

(2) Rather than simply calling kern_mount_data(), mq_init_ns() and
mq_internal_mount() create a context, adjust it and then do the rest
of the mount procedure.

(3) The lazy mqueue mounting on creation of a new namespace is retained
from a previous patch, but the avoidance of sget() if no superblock
yet exists is reverted and the superblock is again keyed on the
namespace pointer.

Yes, there was a performance gain in not searching the superblock
hash, but it's only paid once per ipc namespace - and only if someone
uses mqueue within that namespace, so I'm not sure it's worth it,
especially as calling sget() allows avoidance of recursion.

Signed-off-by: David Howells <[email protected]>
---

ipc/mqueue.c | 121 +++++++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 99 insertions(+), 22 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 910c3c7532e6..934ccdc48a1d 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -18,6 +18,7 @@
#include <linux/pagemap.h>
#include <linux/file.h>
#include <linux/mount.h>
+#include <linux/fs_context.h>
#include <linux/namei.h>
#include <linux/sysctl.h>
#include <linux/poll.h>
@@ -42,6 +43,10 @@
#include <net/sock.h>
#include "util.h"

+struct mqueue_fs_context {
+ struct ipc_namespace *ipc_ns;
+};
+
#define MQUEUE_MAGIC 0x19800202
#define DIRENT_SIZE 20
#define FILENT_SIZE 80
@@ -87,9 +92,11 @@ struct mqueue_inode_info {
unsigned long qsize; /* size of queue in memory (sum of all msgs) */
};

+static struct file_system_type mqueue_fs_type;
static const struct inode_operations mqueue_dir_inode_operations;
static const struct file_operations mqueue_file_operations;
static const struct super_operations mqueue_super_ops;
+static const struct fs_context_operations mqueue_fs_context_ops;
static void remove_notification(struct mqueue_inode_info *info);

static struct kmem_cache *mqueue_inode_cachep;
@@ -322,7 +329,7 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
return ERR_PTR(ret);
}

-static int mqueue_fill_super(struct super_block *sb, void *data, size_t data_size, int silent)
+static int mqueue_fill_super(struct super_block *sb, struct fs_context *fc)
{
struct inode *inode;
struct ipc_namespace *ns = sb->s_fs_info;
@@ -343,19 +350,84 @@ static int mqueue_fill_super(struct super_block *sb, void *data, size_t data_siz
return 0;
}

-static struct dentry *mqueue_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name,
- void *data, size_t data_size)
+static int mqueue_get_tree(struct fs_context *fc)
{
- struct ipc_namespace *ns;
- if (flags & SB_KERNMOUNT) {
- ns = data;
- data = NULL;
- } else {
- ns = current->nsproxy->ipc_ns;
+ struct mqueue_fs_context *ctx = fc->fs_private;
+
+ /* As a shortcut, if the namespace already has a superblock created,
+ * use the root from that directly rather than invoking sget() again.
+ */
+ spin_lock(&mq_lock);
+ if (ctx->ipc_ns->mq_mnt) {
+ fc->root = dget(ctx->ipc_ns->mq_mnt->mnt_sb->s_root);
+ atomic_inc(&fc->root->d_sb->s_active);
}
- return mount_ns(fs_type, flags, data, data_size, ns, ns->user_ns,
- mqueue_fill_super);
+ spin_unlock(&mq_lock);
+ if (fc->root) {
+ down_write(&fc->root->d_sb->s_umount);
+ return 0;
+ }
+
+ fc->s_fs_info = ctx->ipc_ns;
+ return vfs_get_super(fc, vfs_get_keyed_super, mqueue_fill_super);
+}
+
+static void mqueue_fs_context_free(struct fs_context *fc)
+{
+ struct mqueue_fs_context *ctx = fc->fs_private;
+
+ if (ctx->ipc_ns)
+ put_ipc_ns(ctx->ipc_ns);
+ kfree(ctx);
+}
+
+static int mqueue_init_fs_context(struct fs_context *fc,
+ struct dentry *reference)
+{
+ struct mqueue_fs_context *ctx;
+
+ ctx = kzalloc(sizeof(struct mqueue_fs_context), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ ctx->ipc_ns = get_ipc_ns(current->nsproxy->ipc_ns);
+ fc->fs_private = ctx;
+ fc->ops = &mqueue_fs_context_ops;
+ return 0;
+}
+
+static struct vfsmount *mq_create_mount(struct ipc_namespace *ns)
+{
+ struct mqueue_fs_context *ctx;
+ struct fs_context *fc;
+ struct vfsmount *mnt;
+ int ret;
+
+ fc = vfs_new_fs_context(&mqueue_fs_type, NULL, 0,
+ FS_CONTEXT_FOR_KERNEL_MOUNT);
+ if (IS_ERR(fc))
+ return ERR_CAST(fc);
+
+ ctx = fc->fs_private;
+ put_ipc_ns(ctx->ipc_ns);
+ ctx->ipc_ns = get_ipc_ns(ns);
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0)
+ goto err_fc;
+
+ mnt = vfs_create_mount(fc, 0);
+ if (IS_ERR(mnt)) {
+ ret = PTR_ERR(mnt);
+ goto err_fc;
+ }
+
+ put_fs_context(fc);
+ return mnt;
+
+err_fc:
+ put_fs_context(fc);
+ return ERR_PTR(ret);
}

static void init_once(void *foo)
@@ -1521,15 +1593,22 @@ static const struct super_operations mqueue_super_ops = {
.statfs = simple_statfs,
};

+static const struct fs_context_operations mqueue_fs_context_ops = {
+ .free = mqueue_fs_context_free,
+ .get_tree = mqueue_get_tree,
+};
+
static struct file_system_type mqueue_fs_type = {
- .name = "mqueue",
- .mount = mqueue_mount,
- .kill_sb = kill_litter_super,
- .fs_flags = FS_USERNS_MOUNT,
+ .name = "mqueue",
+ .init_fs_context = mqueue_init_fs_context,
+ .kill_sb = kill_litter_super,
+ .fs_flags = FS_USERNS_MOUNT,
};

int mq_init_ns(struct ipc_namespace *ns)
{
+ struct vfsmount *m;
+
ns->mq_queues_count = 0;
ns->mq_queues_max = DFLT_QUEUESMAX;
ns->mq_msg_max = DFLT_MSGMAX;
@@ -1537,12 +1616,10 @@ int mq_init_ns(struct ipc_namespace *ns)
ns->mq_msg_default = DFLT_MSG;
ns->mq_msgsize_default = DFLT_MSGSIZE;

- ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns, 0);
- if (IS_ERR(ns->mq_mnt)) {
- int err = PTR_ERR(ns->mq_mnt);
- ns->mq_mnt = NULL;
- return err;
- }
+ m = mq_create_mount(&init_ipc_ns);
+ if (IS_ERR(m))
+ return PTR_ERR(ns->mq_mnt);
+ ns->mq_mnt = m;
return 0;
}



2018-05-25 02:48:07

by David Howells

[permalink] [raw]
Subject: [PATCH 16/32] kernfs, sysfs, cgroup, intel_rdt: Support fs_context [ver #8]

Make kernfs support superblock creation/mount/remount with fs_context.

This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
be made to support fs_context also.

Notes:

(1) A kernfs_fs_context struct is created to wrap fs_context and the
kernfs mount parameters are moved in here (or are in fs_context).

(2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
namespace tag parameter is passed in the context if desired

(3) kernfs_free_fs_context() is provided as a destructor for the
kernfs_fs_context struct, but for the moment it does nothing except
get called in the right places.

(4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
pass, but possibly this should be done anyway in case someone wants to
add a parameter in future.

(5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
the cgroup v1 and v2 mount parameters are all moved there.

(6) cgroup1 parameter parsing error messages are now handled by invalf(),
which allows userspace to collect them directly.

(7) cgroup1 parameter cleanup is now done in the context destructor rather
than in the mount/get_tree and remount functions.

Weirdies:

(*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
but then uses the resulting pointer after dropping the locks. I'm
told this is okay and needs commenting.

(*) The cgroup refcount web. This really needs documenting.

(*) cgroup2 only has one root?

Signed-off-by: David Howells <[email protected]>
cc: Greg Kroah-Hartman <[email protected]>
cc: Tejun Heo <[email protected]>
cc: Li Zefan <[email protected]>
cc: Johannes Weiner <[email protected]>
cc: [email protected]
cc: [email protected]
---

arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 129 +++++++------
fs/kernfs/mount.c | 89 +++++----
fs/sysfs/mount.c | 64 +++++-
include/linux/cgroup.h | 3
include/linux/kernfs.h | 36 ++--
kernel/cgroup/cgroup-internal.h | 42 +++-
kernel/cgroup/cgroup-v1.c | 296 ++++++++++++++----------------
kernel/cgroup/cgroup.c | 224 +++++++++++++----------
8 files changed, 483 insertions(+), 400 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 3584ef8de1fd..0c2cb4f08dd3 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -36,6 +36,12 @@
#include <asm/intel_rdt_sched.h>
#include "intel_rdt.h"

+struct rdt_fs_context {
+ struct kernfs_fs_context kfc;
+ bool enable_cdpl2;
+ bool enable_cdpl3;
+};
+
DEFINE_STATIC_KEY_FALSE(rdt_enable_key);
DEFINE_STATIC_KEY_FALSE(rdt_mon_enable_key);
DEFINE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
@@ -1104,39 +1110,6 @@ static void cdp_disable_all(void)
cdpl2_disable();
}

-static int parse_rdtgroupfs_options(char *data)
-{
- char *token, *o = data;
- int ret = 0;
-
- while ((token = strsep(&o, ",")) != NULL) {
- if (!*token) {
- ret = -EINVAL;
- goto out;
- }
-
- if (!strcmp(token, "cdp")) {
- ret = cdpl3_enable();
- if (ret)
- goto out;
- } else if (!strcmp(token, "cdpl2")) {
- ret = cdpl2_enable();
- if (ret)
- goto out;
- } else {
- ret = -EINVAL;
- goto out;
- }
- }
-
- return 0;
-
-out:
- pr_err("Invalid mount option \"%s\"\n", token);
-
- return ret;
-}
-
/*
* We don't allow rdtgroup directories to be created anywhere
* except the root directory. Thus when looking for the rdtgroup
@@ -1205,13 +1178,11 @@ static int mkdir_mondata_all(struct kernfs_node *parent_kn,
struct rdtgroup *prgrp,
struct kernfs_node **mon_data_kn);

-static struct dentry *rdt_mount(struct file_system_type *fs_type,
- int flags, const char *unused_dev_name,
- void *data, size_t data_size)
+static int rdt_get_tree(struct fs_context *fc)
{
+ struct rdt_fs_context *ctx = fc->fs_private;
struct rdt_domain *dom;
struct rdt_resource *r;
- struct dentry *dentry;
int ret;

cpus_read_lock();
@@ -1220,47 +1191,46 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
* resctrl file system can only be mounted once.
*/
if (static_branch_unlikely(&rdt_enable_key)) {
- dentry = ERR_PTR(-EBUSY);
+ ret = -EBUSY;
goto out;
}

- ret = parse_rdtgroupfs_options(data);
- if (ret) {
- dentry = ERR_PTR(ret);
- goto out_cdp;
+ if (ctx->enable_cdpl2) {
+ ret = cdpl2_enable();
+ if (ret < 0)
+ goto out_cdp;
+ }
+
+ if (ctx->enable_cdpl3) {
+ ret = cdpl3_enable();
+ if (ret < 0)
+ goto out_cdp;
}

closid_init();

ret = rdtgroup_create_info_dir(rdtgroup_default.kn);
- if (ret) {
- dentry = ERR_PTR(ret);
+ if (ret < 0)
goto out_cdp;
- }

if (rdt_mon_capable) {
ret = mongroup_create_dir(rdtgroup_default.kn,
NULL, "mon_groups",
&kn_mongrp);
- if (ret) {
- dentry = ERR_PTR(ret);
+ if (ret < 0)
goto out_info;
- }
kernfs_get(kn_mongrp);

ret = mkdir_mondata_all(rdtgroup_default.kn,
&rdtgroup_default, &kn_mondata);
- if (ret) {
- dentry = ERR_PTR(ret);
+ if (ret < 0)
goto out_mongrp;
- }
kernfs_get(kn_mondata);
rdtgroup_default.mon.mon_data_kn = kn_mondata;
}

- dentry = kernfs_mount(fs_type, flags, rdt_root,
- RDTGROUP_SUPER_MAGIC, NULL);
- if (IS_ERR(dentry))
+ ret = kernfs_get_tree(fc);
+ if (ret < 0)
goto out_mondata;

if (rdt_alloc_capable)
@@ -1293,8 +1263,51 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
rdt_last_cmd_clear();
mutex_unlock(&rdtgroup_mutex);
cpus_read_unlock();
+ return ret;
+}
+
+static int rdt_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+ struct rdt_fs_context *ctx = fc->fs_private;
+
+ if (strcmp(opt, "cdp") == 0) {
+ ctx->enable_cdpl3 = true;
+ return 0;
+ }
+ if (strcmp(opt, "cdpl2") == 0) {
+ ctx->enable_cdpl2 = true;
+ return 0;
+ }
+
+ return -EINVAL;
+}
+
+static void rdt_fs_context_free(struct fs_context *fc)
+{
+ struct rdt_fs_context *ctx = fc->fs_private;

- return dentry;
+ kernfs_free_fs_context(&ctx->kfc);
+}
+
+static const struct fs_context_operations rdt_fs_context_ops = {
+ .free = rdt_fs_context_free,
+ .parse_option = rdt_parse_option,
+ .get_tree = rdt_get_tree,
+};
+
+static int rdt_init_fs_context(struct fs_context *fc, struct dentry *reference)
+{
+ struct rdt_fs_context *ctx;
+
+ ctx = kzalloc(sizeof(struct rdt_fs_context), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ ctx->kfc.root = rdt_root;
+ ctx->kfc.magic = RDTGROUP_SUPER_MAGIC;
+ fc->fs_private = ctx;
+ fc->ops = &rdt_fs_context_ops;
+ return 0;
}

static int reset_all_ctrls(struct rdt_resource *r)
@@ -1459,9 +1472,9 @@ static void rdt_kill_sb(struct super_block *sb)
}

static struct file_system_type rdt_fs_type = {
- .name = "resctrl",
- .mount = rdt_mount,
- .kill_sb = rdt_kill_sb,
+ .name = "resctrl",
+ .init_fs_context = rdt_init_fs_context,
+ .kill_sb = rdt_kill_sb,
};

static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f70e0b69e714..64cd97a6fe14 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -22,14 +22,14 @@

struct kmem_cache *kernfs_node_cache;

-static int kernfs_sop_remount_fs(struct super_block *sb, int *flags,
- char *data, size_t data_size)
+static int kernfs_sop_reconfigure(struct super_block *sb, struct fs_context *fc)
{
+ struct kernfs_fs_context *kfc = fc->fs_private;
struct kernfs_root *root = kernfs_info(sb)->root;
struct kernfs_syscall_ops *scops = root->syscall_ops;

- if (scops && scops->remount_fs)
- return scops->remount_fs(root, flags, data);
+ if (scops && scops->reconfigure)
+ return scops->reconfigure(root, kfc);
return 0;
}

@@ -61,7 +61,7 @@ const struct super_operations kernfs_sops = {
.drop_inode = generic_delete_inode,
.evict_inode = kernfs_evict_inode,

- .remount_fs = kernfs_sop_remount_fs,
+ .reconfigure = kernfs_sop_reconfigure,
.show_options = kernfs_sop_show_options,
.show_path = kernfs_sop_show_path,
};
@@ -219,7 +219,7 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
} while (true);
}

-static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
+static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *kfc)
{
struct kernfs_super_info *info = kernfs_info(sb);
struct inode *inode;
@@ -230,7 +230,7 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV;
sb->s_blocksize = PAGE_SIZE;
sb->s_blocksize_bits = PAGE_SHIFT;
- sb->s_magic = magic;
+ sb->s_magic = kfc->magic;
sb->s_op = &kernfs_sops;
sb->s_xattr = kernfs_xattr_handlers;
if (info->root->flags & KERNFS_ROOT_SUPPORT_EXPORTOP)
@@ -257,20 +257,25 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
return 0;
}

-static int kernfs_test_super(struct super_block *sb, void *data)
+static int kernfs_test_super(struct super_block *sb, struct fs_context *fc)
{
+ struct kernfs_fs_context *kfc = fc->fs_private;
struct kernfs_super_info *sb_info = kernfs_info(sb);
- struct kernfs_super_info *info = data;
+ struct kernfs_super_info *info = kfc->info;

return sb_info->root == info->root && sb_info->ns == info->ns;
}

-static int kernfs_set_super(struct super_block *sb, void *data)
+static int kernfs_set_super(struct super_block *sb, struct fs_context *fc)
{
+ struct kernfs_fs_context *kfc = fc->fs_private;
int error;
- error = set_anon_super(sb, data);
- if (!error)
- sb->s_fs_info = data;
+
+ error = set_anon_super(sb, kfc->info);
+ if (!error) {
+ sb->s_fs_info = kfc->info;
+ kfc->info = NULL;
+ }
return error;
}

@@ -288,63 +293,59 @@ const void *kernfs_super_ns(struct super_block *sb)
}

/**
- * kernfs_mount_ns - kernfs mount helper
- * @fs_type: file_system_type of the fs being mounted
- * @flags: mount flags specified for the mount
- * @root: kernfs_root of the hierarchy being mounted
- * @magic: file system specific magic number
- * @new_sb_created: tell the caller if we allocated a new superblock
- * @ns: optional namespace tag of the mount
+ * kernfs_get_tree - kernfs filesystem access/retrieval helper
+ * @fc: The filesystem context.
*
- * This is to be called from each kernfs user's file_system_type->mount()
- * implementation, which should pass through the specified @fs_type and
- * @flags, and specify the hierarchy and namespace tag to mount via @root
- * and @ns, respectively.
- *
- * The return value can be passed to the vfs layer verbatim.
+ * This is to be called from each kernfs user's fs_context->ops->get_tree()
+ * implementation, which should set the specified ->@fs_type and ->@flags, and
+ * specify the hierarchy and namespace tag to mount via ->@root and ->@ns,
+ * respectively.
*/
-struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
- struct kernfs_root *root, unsigned long magic,
- bool *new_sb_created, const void *ns)
+int kernfs_get_tree(struct fs_context *fc)
{
+ struct kernfs_fs_context *kfc = fc->fs_private;
struct super_block *sb;
struct kernfs_super_info *info;
int error;

info = kzalloc(sizeof(*info), GFP_KERNEL);
if (!info)
- return ERR_PTR(-ENOMEM);
+ return -ENOMEM;

- info->root = root;
- info->ns = ns;
+ info->root = kfc->root;
+ info->ns = kfc->ns_tag;
INIT_LIST_HEAD(&info->node);

- sb = sget_userns(fs_type, kernfs_test_super, kernfs_set_super, flags,
- &init_user_ns, info);
- if (IS_ERR(sb) || sb->s_fs_info != info)
- kfree(info);
+ kfc->info = info;
+ sb = sget_fc(fc, kernfs_test_super, kernfs_set_super);
+ if (kfc->info) {
+ kfree(kfc->info);
+ kfc->info = NULL;
+ } else {
+ kfc->ns_tag = NULL;
+ }
if (IS_ERR(sb))
- return ERR_CAST(sb);
-
- if (new_sb_created)
- *new_sb_created = !sb->s_root;
+ return PTR_ERR(sb);

if (!sb->s_root) {
struct kernfs_super_info *info = kernfs_info(sb);

- error = kernfs_fill_super(sb, magic);
+ kfc->new_sb_created = true;
+
+ error = kernfs_fill_super(sb, kfc);
if (error) {
deactivate_locked_super(sb);
- return ERR_PTR(error);
+ return error;
}
sb->s_flags |= SB_ACTIVE;

mutex_lock(&kernfs_mutex);
- list_add(&info->node, &root->supers);
+ list_add(&info->node, &info->root->supers);
mutex_unlock(&kernfs_mutex);
}

- return dget(sb->s_root);
+ fc->root = dget(sb->s_root);
+ return 0;
}

/**
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 77302c35b0ff..c1cc4d9dc189 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -13,6 +13,7 @@
#include <linux/magic.h>
#include <linux/mount.h>
#include <linux/init.h>
+#include <linux/slab.h>
#include <linux/user_namespace.h>

#include "sysfs.h"
@@ -20,27 +21,52 @@
static struct kernfs_root *sysfs_root;
struct kernfs_node *sysfs_root_kn;

-static struct dentry *sysfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data, size_t data_size)
+static int sysfs_get_tree(struct fs_context *fc)
{
- struct dentry *root;
- void *ns;
- bool new_sb = false;
+ struct kernfs_fs_context *kfc = fc->fs_private;
+ int ret;

- if (!(flags & SB_KERNMOUNT)) {
+ ret = kernfs_get_tree(fc);
+ if (kfc->new_sb_created)
+ fc->root->d_sb->s_iflags |= SB_I_USERNS_VISIBLE;
+ return 0;
+}
+
+static void sysfs_fs_context_free(struct fs_context *fc)
+{
+ struct kernfs_fs_context *kfc = fc->fs_private;
+
+ if (kfc->ns_tag)
+ kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag);
+ kernfs_free_fs_context(kfc);
+ kfree(kfc);
+}
+
+static const struct fs_context_operations sysfs_fs_context_ops = {
+ .free = sysfs_fs_context_free,
+ .get_tree = sysfs_get_tree,
+};
+
+static int sysfs_init_fs_context(struct fs_context *fc,
+ struct dentry *reference)
+{
+ struct kernfs_fs_context *kfc;
+
+ if (!(fc->sb_flags & SB_KERNMOUNT)) {
if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET))
- return ERR_PTR(-EPERM);
+ return -EPERM;
}

- ns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
- root = kernfs_mount_ns(fs_type, flags, sysfs_root,
- SYSFS_MAGIC, &new_sb, ns);
- if (!new_sb)
- kobj_ns_drop(KOBJ_NS_TYPE_NET, ns);
- else if (!IS_ERR(root))
- root->d_sb->s_iflags |= SB_I_USERNS_VISIBLE;
+ kfc = kzalloc(sizeof(struct kernfs_fs_context), GFP_KERNEL);
+ if (!kfc)
+ return -ENOMEM;

- return root;
+ kfc->ns_tag = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
+ kfc->root = sysfs_root;
+ kfc->magic = SYSFS_MAGIC;
+ fc->fs_private = kfc;
+ fc->ops = &sysfs_fs_context_ops;
+ return 0;
}

static void sysfs_kill_sb(struct super_block *sb)
@@ -52,10 +78,10 @@ static void sysfs_kill_sb(struct super_block *sb)
}

static struct file_system_type sysfs_fs_type = {
- .name = "sysfs",
- .mount = sysfs_mount,
- .kill_sb = sysfs_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
+ .name = "sysfs",
+ .init_fs_context = sysfs_init_fs_context,
+ .kill_sb = sysfs_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};

int __init sysfs_init(void)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 473e0c0abb86..50771a7d0be9 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -821,10 +821,11 @@ copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,

#endif /* !CONFIG_CGROUPS */

-static inline void get_cgroup_ns(struct cgroup_namespace *ns)
+static inline struct cgroup_namespace *get_cgroup_ns(struct cgroup_namespace *ns)
{
if (ns)
refcount_inc(&ns->count);
+ return ns;
}

static inline void put_cgroup_ns(struct cgroup_namespace *ns)
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index ab25c8b6d9e3..9d89d5ea9b39 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -16,6 +16,7 @@
#include <linux/rbtree.h>
#include <linux/atomic.h>
#include <linux/wait.h>
+#include <linux/fs_context.h>

struct file;
struct dentry;
@@ -25,6 +26,7 @@ struct vm_area_struct;
struct super_block;
struct file_system_type;

+struct kernfs_fs_context;
struct kernfs_open_node;
struct kernfs_iattrs;

@@ -166,7 +168,7 @@ struct kernfs_node {
* kernfs_node parameter.
*/
struct kernfs_syscall_ops {
- int (*remount_fs)(struct kernfs_root *root, int *flags, char *data);
+ int (*reconfigure)(struct kernfs_root *root, struct kernfs_fs_context *kfc);
int (*show_options)(struct seq_file *sf, struct kernfs_root *root);

int (*mkdir)(struct kernfs_node *parent, const char *name,
@@ -267,6 +269,19 @@ struct kernfs_ops {
#endif
};

+/*
+ * The kernfs superblock creation/mount parameter context.
+ */
+struct kernfs_fs_context {
+ struct kernfs_root *root; /* Root of the hierarchy being mounted */
+ void *ns_tag; /* Namespace tag of the mount (or NULL) */
+ unsigned long magic; /* File system specific magic number */
+
+ /* The following are set/used by kernfs_mount() */
+ struct kernfs_super_info *info; /* The new superblock info */
+ bool new_sb_created; /* Set to T if we allocated a new sb */
+};
+
#ifdef CONFIG_KERNFS

static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
@@ -350,9 +365,7 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr);
void kernfs_notify(struct kernfs_node *kn);

const void *kernfs_super_ns(struct super_block *sb);
-struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
- struct kernfs_root *root, unsigned long magic,
- bool *new_sb_created, const void *ns);
+int kernfs_get_tree(struct fs_context *fc);
void kernfs_kill_sb(struct super_block *sb);
struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns);

@@ -454,11 +467,8 @@ static inline void kernfs_notify(struct kernfs_node *kn) { }
static inline const void *kernfs_super_ns(struct super_block *sb)
{ return NULL; }

-static inline struct dentry *
-kernfs_mount_ns(struct file_system_type *fs_type, int flags,
- struct kernfs_root *root, unsigned long magic,
- bool *new_sb_created, const void *ns)
-{ return ERR_PTR(-ENOSYS); }
+static inline int kernfs_get_tree(struct kernfs_fs_context *fc)
+{ return -ENOSYS; }

static inline void kernfs_kill_sb(struct super_block *sb) { }

@@ -535,13 +545,9 @@ static inline int kernfs_rename(struct kernfs_node *kn,
return kernfs_rename_ns(kn, new_parent, new_name, NULL);
}

-static inline struct dentry *
-kernfs_mount(struct file_system_type *fs_type, int flags,
- struct kernfs_root *root, unsigned long magic,
- bool *new_sb_created)
+static inline void kernfs_free_fs_context(struct kernfs_fs_context *kfc)
{
- return kernfs_mount_ns(fs_type, flags, root,
- magic, new_sb_created, NULL);
+ /* Note that we don't deal with kfc->ns_tag here. */
}

#endif /* __LINUX_KERNFS_H */
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 0808a33d16d3..4fb4a820824a 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -8,6 +8,26 @@
#include <linux/list.h>
#include <linux/refcount.h>

+/*
+ * The cgroup filesystem superblock creation/mount context.
+ */
+struct cgroup_fs_context {
+ struct kernfs_fs_context kfc;
+ struct cgroup_root *root;
+ struct cgroup_namespace *ns;
+ u8 version; /* cgroups version */
+ unsigned int flags; /* CGRP_ROOT_* flags */
+
+ /* cgroup1 bits */
+ bool cpuset_clone_children;
+ bool none; /* User explicitly requested empty subsystem */
+ bool all_ss; /* Seen 'all' option */
+ bool one_ss; /* Seen 'none' option */
+ u16 subsys_mask; /* Selected subsystems */
+ char *name; /* Hierarchy name */
+ char *release_agent; /* Path for release notifications */
+};
+
/*
* A cgroup can be associated with multiple css_sets as different tasks may
* belong to different cgroups on different hierarchies. In the other
@@ -89,16 +109,6 @@ struct cgroup_mgctx {
#define DEFINE_CGROUP_MGCTX(name) \
struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)

-struct cgroup_sb_opts {
- u16 subsys_mask;
- unsigned int flags;
- char *release_agent;
- bool cpuset_clone_children;
- char *name;
- /* User explicitly requested empty subsystem */
- bool none;
-};
-
extern struct mutex cgroup_mutex;
extern spinlock_t css_set_lock;
extern struct cgroup_subsys *cgroup_subsys[];
@@ -169,12 +179,10 @@ int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
struct cgroup_namespace *ns);

void cgroup_free_root(struct cgroup_root *root);
-void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts);
+void init_cgroup_root(struct cgroup_fs_context *ctx);
int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags);
int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask);
-struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
- struct cgroup_root *root, unsigned long magic,
- struct cgroup_namespace *ns);
+int cgroup_do_get_tree(struct fs_context *fc);

int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp);
void cgroup_migrate_finish(struct cgroup_mgctx *mgctx);
@@ -225,8 +233,8 @@ bool cgroup1_ssid_disabled(int ssid);
void cgroup1_pidlist_destroy_all(struct cgroup *cgrp);
void cgroup1_release_agent(struct work_struct *work);
void cgroup1_check_for_release(struct cgroup *cgrp);
-struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
- void *data, unsigned long magic,
- struct cgroup_namespace *ns);
+int cgroup1_parse_option(struct cgroup_fs_context *ctx, char *p);
+int cgroup1_validate(struct cgroup_fs_context *ctx);
+int cgroup1_get_tree(struct fs_context *fc);

#endif /* __CGROUP_INTERNAL_H */
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index e06c97f3ed1a..48accc74292a 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -16,6 +16,8 @@

#include <trace/events/cgroup.h>

+#define cg_invalf(fmt, ...) ({ pr_err(fmt, ## __VA_ARGS__); })
+
/*
* pidlists linger the following amount before being destroyed. The goal
* is avoiding frequent destruction in the middle of consecutive read calls
@@ -903,168 +905,166 @@ static int cgroup1_show_options(struct seq_file *seq, struct kernfs_root *kf_roo
return 0;
}

-static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
+int cgroup1_parse_option(struct cgroup_fs_context *ctx, char *token)
{
- char *token, *o = data;
- bool all_ss = false, one_ss = false;
- u16 mask = U16_MAX;
struct cgroup_subsys *ss;
- int nr_opts = 0;
int i;

-#ifdef CONFIG_CPUSETS
- mask = ~((u16)1 << cpuset_cgrp_id);
-#endif
-
- memset(opts, 0, sizeof(*opts));
-
- while ((token = strsep(&o, ",")) != NULL) {
- nr_opts++;
+ if (!strcmp(token, "none")) {
+ /* Explicitly have no subsystems */
+ ctx->none = true;
+ return 0;
+ }
+ if (!strcmp(token, "all")) {
+ /* Mutually exclusive option 'all' + subsystem name */
+ if (ctx->one_ss)
+ return cg_invalf("cgroup1: all conflicts with subsys name");
+ ctx->all_ss = true;
+ return 0;
+ }
+ if (!strcmp(token, "noprefix")) {
+ ctx->flags |= CGRP_ROOT_NOPREFIX;
+ return 0;
+ }
+ if (!strcmp(token, "clone_children")) {
+ ctx->cpuset_clone_children = true;
+ return 0;
+ }
+ if (!strcmp(token, "xattr")) {
+ ctx->flags |= CGRP_ROOT_XATTR;
+ return 0;
+ }
+ if (!strncmp(token, "release_agent=", 14)) {
+ /* Specifying two release agents is forbidden */
+ if (ctx->release_agent)
+ return cg_invalf("cgroup1: release_agent respecified");
+ ctx->release_agent =
+ kstrndup(token + 14, PATH_MAX - 1, GFP_KERNEL);
+ if (!ctx->release_agent)
+ return -ENOMEM;
+ return 0;
+ }

- if (!*token)
- return -EINVAL;
- if (!strcmp(token, "none")) {
- /* Explicitly have no subsystems */
- opts->none = true;
- continue;
- }
- if (!strcmp(token, "all")) {
- /* Mutually exclusive option 'all' + subsystem name */
- if (one_ss)
- return -EINVAL;
- all_ss = true;
- continue;
- }
- if (!strcmp(token, "noprefix")) {
- opts->flags |= CGRP_ROOT_NOPREFIX;
- continue;
+ if (!strncmp(token, "name=", 5)) {
+ const char *name = token + 5;
+ /* Can't specify an empty name */
+ if (!strlen(name))
+ return cg_invalf("cgroup1: Empty name");
+ /* Must match [\w.-]+ */
+ for (i = 0; i < strlen(name); i++) {
+ char c = name[i];
+ if (isalnum(c))
+ continue;
+ if ((c == '.') || (c == '-') || (c == '_'))
+ continue;
+ return cg_invalf("cgroup1: Invalid name");
}
- if (!strcmp(token, "clone_children")) {
- opts->cpuset_clone_children = true;
+ /* Specifying two names is forbidden */
+ if (ctx->name)
+ return cg_invalf("cgroup1: name respecified");
+ ctx->name = kstrndup(name,
+ MAX_CGROUP_ROOT_NAMELEN - 1,
+ GFP_KERNEL);
+ if (!ctx->name)
+ return -ENOMEM;
+
+ return 0;
+ }
+
+ for_each_subsys(ss, i) {
+ if (strcmp(token, ss->legacy_name))
continue;
- }
if (!strcmp(token, "cpuset_v2_mode")) {
- opts->flags |= CGRP_ROOT_CPUSET_V2_MODE;
+ ctx->flags |= CGRP_ROOT_CPUSET_V2_MODE;
continue;
}
if (!strcmp(token, "xattr")) {
- opts->flags |= CGRP_ROOT_XATTR;
+ ctx->flags |= CGRP_ROOT_XATTR;
continue;
}
- if (!strncmp(token, "release_agent=", 14)) {
- /* Specifying two release agents is forbidden */
- if (opts->release_agent)
- return -EINVAL;
- opts->release_agent =
- kstrndup(token + 14, PATH_MAX - 1, GFP_KERNEL);
- if (!opts->release_agent)
- return -ENOMEM;
+ if (cgroup1_ssid_disabled(i))
continue;
- }
- if (!strncmp(token, "name=", 5)) {
- const char *name = token + 5;
- /* Can't specify an empty name */
- if (!strlen(name))
- return -EINVAL;
- /* Must match [\w.-]+ */
- for (i = 0; i < strlen(name); i++) {
- char c = name[i];
- if (isalnum(c))
- continue;
- if ((c == '.') || (c == '-') || (c == '_'))
- continue;
- return -EINVAL;
- }
- /* Specifying two names is forbidden */
- if (opts->name)
- return -EINVAL;
- opts->name = kstrndup(name,
- MAX_CGROUP_ROOT_NAMELEN - 1,
- GFP_KERNEL);
- if (!opts->name)
- return -ENOMEM;

- continue;
- }
+ /* Mutually exclusive option 'all' + subsystem name */
+ if (ctx->all_ss)
+ return cg_invalf("cgroup1: subsys name conflicts with all");
+ ctx->subsys_mask |= (1 << i);
+ ctx->one_ss = true;
+ return 0;
+ }

- for_each_subsys(ss, i) {
- if (strcmp(token, ss->legacy_name))
- continue;
- if (!cgroup_ssid_enabled(i))
- continue;
- if (cgroup1_ssid_disabled(i))
- continue;
+ if (i == CGROUP_SUBSYS_COUNT)
+ return -ENOENT;
+
+ return 0;
+}

- /* Mutually exclusive option 'all' + subsystem name */
- if (all_ss)
- return -EINVAL;
- opts->subsys_mask |= (1 << i);
- one_ss = true;
+/*
+ * Validate the options that have been parsed.
+ */
+int cgroup1_validate(struct cgroup_fs_context *ctx)
+{
+ struct cgroup_subsys *ss;
+ u16 mask = U16_MAX;
+ int i;

- break;
- }
- if (i == CGROUP_SUBSYS_COUNT)
- return -ENOENT;
- }
+#ifdef CONFIG_CPUSETS
+ mask = ~((u16)1 << cpuset_cgrp_id);
+#endif

/*
* If the 'all' option was specified select all the subsystems,
* otherwise if 'none', 'name=' and a subsystem name options were
* not specified, let's default to 'all'
*/
- if (all_ss || (!one_ss && !opts->none && !opts->name))
+ if (ctx->all_ss || (!ctx->one_ss && !ctx->none && !ctx->name))
for_each_subsys(ss, i)
if (cgroup_ssid_enabled(i) && !cgroup1_ssid_disabled(i))
- opts->subsys_mask |= (1 << i);
+ ctx->subsys_mask |= (1 << i);

/*
* We either have to specify by name or by subsystems. (So all
* empty hierarchies must have a name).
*/
- if (!opts->subsys_mask && !opts->name)
- return -EINVAL;
+ if (!ctx->subsys_mask && !ctx->name)
+ return cg_invalf("cgroup1: Need name or subsystem set");

/*
* Option noprefix was introduced just for backward compatibility
* with the old cpuset, so we allow noprefix only if mounting just
* the cpuset subsystem.
*/
- if ((opts->flags & CGRP_ROOT_NOPREFIX) && (opts->subsys_mask & mask))
- return -EINVAL;
+ if ((ctx->flags & CGRP_ROOT_NOPREFIX) && (ctx->subsys_mask & mask))
+ return cg_invalf("cgroup1: noprefix used incorrectly");

/* Can't specify "none" and some subsystems */
- if (opts->subsys_mask && opts->none)
- return -EINVAL;
+ if (ctx->subsys_mask && ctx->none)
+ return cg_invalf("cgroup1: none used incorrectly");

return 0;
}

-static int cgroup1_remount(struct kernfs_root *kf_root, int *flags, char *data)
+static int cgroup1_reconfigure(struct kernfs_root *kf_root, struct kernfs_fs_context *kfc)
{
- int ret = 0;
+ struct cgroup_fs_context *ctx = container_of(kfc, struct cgroup_fs_context, kfc);
struct cgroup_root *root = cgroup_root_from_kf(kf_root);
- struct cgroup_sb_opts opts;
u16 added_mask, removed_mask;
+ int ret = 0;

cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);

- /* See what subsystems are wanted */
- ret = parse_cgroupfs_options(data, &opts);
- if (ret)
- goto out_unlock;
-
- if (opts.subsys_mask != root->subsys_mask || opts.release_agent)
+ if (ctx->subsys_mask != root->subsys_mask || ctx->release_agent)
pr_warn("option changes via remount are deprecated (pid=%d comm=%s)\n",
task_tgid_nr(current), current->comm);

- added_mask = opts.subsys_mask & ~root->subsys_mask;
- removed_mask = root->subsys_mask & ~opts.subsys_mask;
+ added_mask = ctx->subsys_mask & ~root->subsys_mask;
+ removed_mask = root->subsys_mask & ~ctx->subsys_mask;

/* Don't allow flags or name to change at remount */
- if ((opts.flags ^ root->flags) ||
- (opts.name && strcmp(opts.name, root->name))) {
- pr_err("option or name mismatch, new: 0x%x \"%s\", old: 0x%x \"%s\"\n",
- opts.flags, opts.name ?: "", root->flags, root->name);
+ if ((ctx->flags ^ root->flags) ||
+ (ctx->name && strcmp(ctx->name, root->name))) {
+ cg_invalf("option or name mismatch, new: 0x%x \"%s\", old: 0x%x \"%s\"",
+ ctx->flags, ctx->name ?: "", root->flags, root->name);
ret = -EINVAL;
goto out_unlock;
}
@@ -1081,17 +1081,15 @@ static int cgroup1_remount(struct kernfs_root *kf_root, int *flags, char *data)

WARN_ON(rebind_subsystems(&cgrp_dfl_root, removed_mask));

- if (opts.release_agent) {
+ if (ctx->release_agent) {
spin_lock(&release_agent_path_lock);
- strcpy(root->release_agent_path, opts.release_agent);
+ strcpy(root->release_agent_path, ctx->release_agent);
spin_unlock(&release_agent_path_lock);
}

trace_cgroup_remount(root);

out_unlock:
- kfree(opts.release_agent);
- kfree(opts.name);
mutex_unlock(&cgroup_mutex);
return ret;
}
@@ -1099,31 +1097,26 @@ static int cgroup1_remount(struct kernfs_root *kf_root, int *flags, char *data)
struct kernfs_syscall_ops cgroup1_kf_syscall_ops = {
.rename = cgroup1_rename,
.show_options = cgroup1_show_options,
- .remount_fs = cgroup1_remount,
+ .reconfigure = cgroup1_reconfigure,
.mkdir = cgroup_mkdir,
.rmdir = cgroup_rmdir,
.show_path = cgroup_show_path,
};

-struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
- void *data, unsigned long magic,
- struct cgroup_namespace *ns)
+/*
+ * Find or create a v1 cgroups superblock.
+ */
+int cgroup1_get_tree(struct fs_context *fc)
{
+ struct cgroup_fs_context *ctx = fc->fs_private;
struct super_block *pinned_sb = NULL;
- struct cgroup_sb_opts opts;
struct cgroup_root *root;
struct cgroup_subsys *ss;
- struct dentry *dentry;
int i, ret;
bool new_root = false;

cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);

- /* First find the desired set of subsystems */
- ret = parse_cgroupfs_options(data, &opts);
- if (ret)
- goto out_unlock;
-
/*
* Destruction of cgroup root is asynchronous, so subsystems may
* still be dying after the previous unmount. Let's drain the
@@ -1132,15 +1125,13 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
* starting. Testing ref liveliness is good enough.
*/
for_each_subsys(ss, i) {
- if (!(opts.subsys_mask & (1 << i)) ||
+ if (!(ctx->subsys_mask & (1 << i)) ||
ss->root == &cgrp_dfl_root)
continue;

if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
mutex_unlock(&cgroup_mutex);
- msleep(10);
- ret = restart_syscall();
- goto out_free;
+ goto err_restart;
}
cgroup_put(&ss->root->cgrp);
}
@@ -1156,8 +1147,8 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
* name matches but sybsys_mask doesn't, we should fail.
* Remember whether name matched.
*/
- if (opts.name) {
- if (strcmp(opts.name, root->name))
+ if (ctx->name) {
+ if (strcmp(ctx->name, root->name))
continue;
name_match = true;
}
@@ -1166,15 +1157,15 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
* If we asked for subsystems (or explicitly for no
* subsystems) then they must match.
*/
- if ((opts.subsys_mask || opts.none) &&
- (opts.subsys_mask != root->subsys_mask)) {
+ if ((ctx->subsys_mask || ctx->none) &&
+ (ctx->subsys_mask != root->subsys_mask)) {
if (!name_match)
continue;
ret = -EBUSY;
- goto out_unlock;
+ goto err_unlock;
}

- if (root->flags ^ opts.flags)
+ if (root->flags ^ ctx->flags)
pr_warn("new mount options do not match the existing superblock, will be ignored\n");

/*
@@ -1195,9 +1186,7 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
mutex_unlock(&cgroup_mutex);
if (!IS_ERR_OR_NULL(pinned_sb))
deactivate_super(pinned_sb);
- msleep(10);
- ret = restart_syscall();
- goto out_free;
+ goto err_restart;
}

ret = 0;
@@ -1209,41 +1198,35 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
* specification is allowed for already existing hierarchies but we
* can't create new one without subsys specification.
*/
- if (!opts.subsys_mask && !opts.none) {
- ret = -EINVAL;
- goto out_unlock;
+ if (!ctx->subsys_mask && !ctx->none) {
+ ret = cg_invalf("cgroup1: No subsys list or none specified");
+ goto err_unlock;
}

/* Hierarchies may only be created in the initial cgroup namespace. */
- if (ns != &init_cgroup_ns) {
+ if (ctx->ns != &init_cgroup_ns) {
ret = -EPERM;
- goto out_unlock;
+ goto err_unlock;
}

root = kzalloc(sizeof(*root), GFP_KERNEL);
if (!root) {
ret = -ENOMEM;
- goto out_unlock;
+ goto err_unlock;
}
new_root = true;
+ ctx->root = root;

- init_cgroup_root(root, &opts);
+ init_cgroup_root(ctx);

- ret = cgroup_setup_root(root, opts.subsys_mask, PERCPU_REF_INIT_DEAD);
+ ret = cgroup_setup_root(root, ctx->subsys_mask, PERCPU_REF_INIT_DEAD);
if (ret)
cgroup_free_root(root);

out_unlock:
mutex_unlock(&cgroup_mutex);
-out_free:
- kfree(opts.release_agent);
- kfree(opts.name);
-
- if (ret)
- return ERR_PTR(ret);

- dentry = cgroup_do_mount(&cgroup_fs_type, flags, root,
- CGROUP_SUPER_MAGIC, ns);
+ ret = cgroup_do_get_tree(fc);

/*
* There's a race window after we release cgroup_mutex and before
@@ -1264,7 +1247,14 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
if (pinned_sb)
deactivate_super(pinned_sb);

- return dentry;
+ return ret;
+
+err_restart:
+ msleep(10);
+ return restart_syscall();
+err_unlock:
+ mutex_unlock(&cgroup_mutex);
+ return ret;
}

static int __init cgroup1_wq_init(void)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index af2baf9985bd..87f2a0f68d31 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1718,25 +1718,21 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
return len;
}

-static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
+static int cgroup2_parse_option(struct cgroup_fs_context *ctx, char *token)
{
- char *token;
-
- *root_flags = 0;
-
- if (!data)
+ if (!strcmp(token, "nsdelegate")) {
+ ctx->flags |= CGRP_ROOT_NS_DELEGATE;
return 0;
-
- while ((token = strsep(&data, ",")) != NULL) {
- if (!strcmp(token, "nsdelegate")) {
- *root_flags |= CGRP_ROOT_NS_DELEGATE;
- continue;
- }
-
- pr_err("cgroup2: unknown option \"%s\"\n", token);
- return -EINVAL;
}

+ return -EINVAL;
+}
+
+static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root)
+{
+ if (current->nsproxy->cgroup_ns == &init_cgroup_ns &&
+ cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
+ seq_puts(seq, ",nsdelegate");
return 0;
}

@@ -1750,23 +1746,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
}
}

-static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root)
-{
- if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
- seq_puts(seq, ",nsdelegate");
- return 0;
-}
-
-static int cgroup_remount(struct kernfs_root *kf_root, int *flags, char *data)
+static int cgroup_reconfigure(struct kernfs_root *kf_root, struct kernfs_fs_context *kfc)
{
- unsigned int root_flags;
- int ret;
-
- ret = parse_cgroup_root_flags(data, &root_flags);
- if (ret)
- return ret;
+ struct cgroup_fs_context *ctx = container_of(kfc, struct cgroup_fs_context, kfc);

- apply_cgroup_root_flags(root_flags);
+ apply_cgroup_root_flags(ctx->flags);
return 0;
}

@@ -1852,8 +1836,9 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
INIT_WORK(&cgrp->release_agent_work, cgroup1_release_agent);
}

-void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts)
+void init_cgroup_root(struct cgroup_fs_context *ctx)
{
+ struct cgroup_root *root = ctx->root;
struct cgroup *cgrp = &root->cgrp;

INIT_LIST_HEAD(&root->root_list);
@@ -1862,12 +1847,12 @@ void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts)
init_cgroup_housekeeping(cgrp);
idr_init(&root->cgroup_idr);

- root->flags = opts->flags;
- if (opts->release_agent)
- strscpy(root->release_agent_path, opts->release_agent, PATH_MAX);
- if (opts->name)
- strscpy(root->name, opts->name, MAX_CGROUP_ROOT_NAMELEN);
- if (opts->cpuset_clone_children)
+ root->flags = ctx->flags;
+ if (ctx->release_agent)
+ strscpy(root->release_agent_path, ctx->release_agent, PATH_MAX);
+ if (ctx->name)
+ strscpy(root->name, ctx->name, MAX_CGROUP_ROOT_NAMELEN);
+ if (ctx->cpuset_clone_children)
set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
}

@@ -1972,57 +1957,51 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
return ret;
}

-struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
- struct cgroup_root *root, unsigned long magic,
- struct cgroup_namespace *ns)
+int cgroup_do_get_tree(struct fs_context *fc)
{
- struct dentry *dentry;
- bool new_sb;
+ struct cgroup_fs_context *ctx = fc->fs_private;
+ int ret;
+
+ ctx->kfc.root = ctx->root->kf_root;

- dentry = kernfs_mount(fs_type, flags, root->kf_root, magic, &new_sb);
+ ret = kernfs_get_tree(fc);
+ if (ret < 0)
+ goto out_cgrp;

/*
* In non-init cgroup namespace, instead of root cgroup's dentry,
* we return the dentry corresponding to the cgroupns->root_cgrp.
*/
- if (!IS_ERR(dentry) && ns != &init_cgroup_ns) {
+ if (ctx->ns != &init_cgroup_ns) {
struct dentry *nsdentry;
struct cgroup *cgrp;

mutex_lock(&cgroup_mutex);
spin_lock_irq(&css_set_lock);

- cgrp = cset_cgroup_from_root(ns->root_cset, root);
+ cgrp = cset_cgroup_from_root(ctx->ns->root_cset, ctx->root);

spin_unlock_irq(&css_set_lock);
mutex_unlock(&cgroup_mutex);

- nsdentry = kernfs_node_dentry(cgrp->kn, dentry->d_sb);
- dput(dentry);
- dentry = nsdentry;
+ nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb);
+ dput(fc->root);
+ fc->root = nsdentry;
}

- if (IS_ERR(dentry) || !new_sb)
- cgroup_put(&root->cgrp);
+ ret = 0;
+ if (ctx->kfc.new_sb_created)
+ goto out_cgrp;
+ apply_cgroup_root_flags(ctx->flags);
+ return 0;

- return dentry;
+out_cgrp:
+ return ret;
}

-static struct dentry *cgroup_mount(struct file_system_type *fs_type,
- int flags, const char *unused_dev_name,
- void *data, size_t data_size)
+static int cgroup_get_tree(struct fs_context *fc)
{
- struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
- struct dentry *dentry;
- int ret;
-
- get_cgroup_ns(ns);
-
- /* Check if the caller has permission to mount. */
- if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
- put_cgroup_ns(ns);
- return ERR_PTR(-EPERM);
- }
+ struct cgroup_fs_context *ctx = fc->fs_private;

/*
* The first time anyone tries to mount a cgroup, enable the list
@@ -2031,29 +2010,87 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
if (!use_task_css_set_links)
cgroup_enable_task_cg_lists();

- if (fs_type == &cgroup2_fs_type) {
- unsigned int root_flags;
-
- ret = parse_cgroup_root_flags(data, &root_flags);
- if (ret) {
- put_cgroup_ns(ns);
- return ERR_PTR(ret);
- }
+ switch (ctx->version) {
+ case 1:
+ return cgroup1_get_tree(fc);

+ case 2:
cgrp_dfl_visible = true;
cgroup_get_live(&cgrp_dfl_root.cgrp);

- dentry = cgroup_do_mount(&cgroup2_fs_type, flags, &cgrp_dfl_root,
- CGROUP2_SUPER_MAGIC, ns);
- if (!IS_ERR(dentry))
- apply_cgroup_root_flags(root_flags);
- } else {
- dentry = cgroup1_mount(&cgroup_fs_type, flags, data,
- CGROUP_SUPER_MAGIC, ns);
+ ctx->root = &cgrp_dfl_root;
+ return cgroup_do_get_tree(fc);
+
+ default:
+ BUG();
}
+}
+
+static int cgroup_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+ struct cgroup_fs_context *ctx = fc->fs_private;
+
+ if (ctx->version == 1)
+ return cgroup1_parse_option(ctx, opt);
+
+ return cgroup2_parse_option(ctx, opt);
+}
+
+static int cgroup_validate(struct fs_context *fc)
+{
+ struct cgroup_fs_context *ctx = fc->fs_private;

- put_cgroup_ns(ns);
- return dentry;
+ if (ctx->version == 1)
+ return cgroup1_validate(ctx);
+ return 0;
+}
+
+/*
+ * Destroy a cgroup filesystem context.
+ */
+static void cgroup_fs_context_free(struct fs_context *fc)
+{
+ struct cgroup_fs_context *ctx = fc->fs_private;
+
+ kfree(ctx->name);
+ kfree(ctx->release_agent);
+ if (ctx->root)
+ cgroup_put(&ctx->root->cgrp);
+ put_cgroup_ns(ctx->ns);
+ kernfs_free_fs_context(&ctx->kfc);
+ kfree(ctx);
+}
+
+static const struct fs_context_operations cgroup_fs_context_ops = {
+ .free = cgroup_fs_context_free,
+ .parse_option = cgroup_parse_option,
+ .validate = cgroup_validate,
+ .get_tree = cgroup_get_tree,
+};
+
+/*
+ * Initialise the cgroup filesystem creation/reconfiguration context. Notably,
+ * we select the namespace we're going to use.
+ */
+static int cgroup_init_fs_context(struct fs_context *fc, struct dentry *reference)
+{
+ struct cgroup_fs_context *ctx;
+ struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
+
+ /* Check if the caller has permission to mount. */
+ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ ctx = kzalloc(sizeof(struct cgroup_fs_context), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ ctx->ns = get_cgroup_ns(ns);
+ ctx->version = (fc->fs_type == &cgroup2_fs_type) ? 2 : 1;
+ ctx->kfc.magic = (ctx->version == 2) ? CGROUP2_SUPER_MAGIC : CGROUP_SUPER_MAGIC;
+ fc->fs_private = ctx;
+ fc->ops = &cgroup_fs_context_ops;
+ return 0;
}

static void cgroup_kill_sb(struct super_block *sb)
@@ -2078,17 +2115,17 @@ static void cgroup_kill_sb(struct super_block *sb)
}

struct file_system_type cgroup_fs_type = {
- .name = "cgroup",
- .mount = cgroup_mount,
- .kill_sb = cgroup_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
+ .name = "cgroup",
+ .init_fs_context = cgroup_init_fs_context,
+ .kill_sb = cgroup_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};

static struct file_system_type cgroup2_fs_type = {
- .name = "cgroup2",
- .mount = cgroup_mount,
- .kill_sb = cgroup_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
+ .name = "cgroup2",
+ .init_fs_context = cgroup_init_fs_context,
+ .kill_sb = cgroup_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};

int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
@@ -5132,7 +5169,7 @@ int cgroup_rmdir(struct kernfs_node *kn)

static struct kernfs_syscall_ops cgroup_kf_syscall_ops = {
.show_options = cgroup_show_options,
- .remount_fs = cgroup_remount,
+ .reconfigure = cgroup_reconfigure,
.mkdir = cgroup_mkdir,
.rmdir = cgroup_rmdir,
.show_path = cgroup_show_path,
@@ -5199,11 +5236,12 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
*/
int __init cgroup_init_early(void)
{
- static struct cgroup_sb_opts __initdata opts;
+ static struct cgroup_fs_context __initdata ctx;
struct cgroup_subsys *ss;
int i;

- init_cgroup_root(&cgrp_dfl_root, &opts);
+ ctx.root = &cgrp_dfl_root;
+ init_cgroup_root(&ctx);
cgrp_dfl_root.cgrp.self.flags |= CSS_NO_REF;

RCU_INIT_POINTER(init_task.cgroups, &init_css_set);


2018-05-25 02:48:11

by David Howells

[permalink] [raw]
Subject: [PATCH 18/32] VFS: Remove kern_mount_data() [ver #8]

The kern_mount_data() isn't used any more so remove it.

Signed-off-by: David Howells <[email protected]>
---

fs/namespace.c | 7 -------
include/linux/fs.h | 1 -
2 files changed, 8 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 14be35d02050..ead49e822418 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3198,13 +3198,6 @@ struct vfsmount *kern_mount(struct file_system_type *type)
}
EXPORT_SYMBOL_GPL(kern_mount);

-struct vfsmount *kern_mount_data(struct file_system_type *type,
- void *data, size_t data_size)
-{
- return vfs_kern_mount(type, SB_KERNMOUNT, type->name, data, data_size);
-}
-EXPORT_SYMBOL_GPL(kern_mount_data);
-
/*
* Return true if path is reachable from root
*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 19bbed58829d..e771803cc8dc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2186,7 +2186,6 @@ mount_pseudo(struct file_system_type *fs_type, char *name,
extern int register_filesystem(struct file_system_type *);
extern int unregister_filesystem(struct file_system_type *);
extern struct vfsmount *kern_mount(struct file_system_type *);
-extern struct vfsmount *kern_mount_data(struct file_system_type *, void *, size_t);
extern void kern_unmount(struct vfsmount *mnt);
extern int may_umount_tree(struct vfsmount *);
extern int may_umount(struct vfsmount *);


2018-05-25 02:48:15

by David Howells

[permalink] [raw]
Subject: [PATCH 20/32] vfs: Make close() unmount the attached mount if so flagged [ver #8]


---

fs/file_table.c | 4 ++++
include/linux/fs.h | 4 +++-
2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index 7ec0b3e5f05d..dbbcc563748a 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -30,6 +30,7 @@
#include <linux/atomic.h>

#include "internal.h"
+#include "mount.h"

/* sysctl tunables... */
struct files_stat_struct files_stat = {
@@ -200,6 +201,9 @@ static void __fput(struct file *file)
eventpoll_release(file);
locks_remove_file(file);

+ if (unlikely(file->f_mode & FMODE_NEED_UNMOUNT))
+ __detach_mounts(dentry);
+
ima_file_free(file);
if (unlikely(file->f_flags & FASYNC)) {
if (file->f_op->fasync)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e771803cc8dc..ba571c18e236 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -152,7 +152,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
#define FMODE_NONOTIFY ((__force fmode_t)0x4000000)

/* File is capable of returning -EAGAIN if I/O will block */
-#define FMODE_NOWAIT ((__force fmode_t)0x8000000)
+#define FMODE_NOWAIT ((__force fmode_t)0x8000000)
+/* File represents mount that needs unmounting */
+#define FMODE_NEED_UNMOUNT ((__force fmode_t)0x10000000)

/*
* Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector


2018-05-25 02:48:16

by David Howells

[permalink] [raw]
Subject: [PATCH 21/32] VFS: Implement fsmount() to effect a pre-configured mount [ver #8]

Provide a system call by which a filesystem opened with fsopen() and
configured by a series of writes can be mounted:

int ret = fsmount(int fsfd, int dfd, const char *path,
unsigned int at_flags, unsigned int flags);

where fsfd is the fd returned by fsopen(), dfd, path and at_flags locate
the mountpoint and flags are the applicable MS_* flags. dfd can be
AT_FDCWD or an fd open to a directory.

In the event that fsmount() fails, it may be possible to get an error
message by calling read(). If no message is available, ENODATA will be
reported.

Signed-off-by: David Howells <[email protected]>
---

arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
fs/namespace.c | 133 ++++++++++++++++++++++++++++++++
include/linux/fs_context.h | 2
include/linux/syscalls.h | 2
include/uapi/linux/fs.h | 7 ++
kernel/sys_ni.c | 1
7 files changed, 147 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 0e084cc11638..bdcb0c4a0491 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -398,3 +398,4 @@
384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl
385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents
386 i386 fsopen sys_fsopen __ia32_sys_fsopen
+387 i386 fsmount sys_fsmount __ia32_sys_fsmount
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7200d5bb65ca..7d932d3897fa 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,7 @@
332 common statx __x64_sys_statx
333 common io_pgetevents __x64_sys_io_pgetevents
334 common fsopen __x64_sys_fsopen
+335 common fsmount __x64_sys_fsmount

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/namespace.c b/fs/namespace.c
index ead49e822418..03ade803b948 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3198,6 +3198,139 @@ struct vfsmount *kern_mount(struct file_system_type *type)
}
EXPORT_SYMBOL_GPL(kern_mount);

+/*
+ * Create a kernel mount representation for a new, prepared superblock
+ * (specified by fs_fd) and attach to an O_PATH-class file descriptor.
+ */
+SYSCALL_DEFINE5(fsmount, int, fs_fd, unsigned int, flags, unsigned int, ms_flags,
+ void *, spare_4, void *, spare_5)
+{
+ struct fs_context *fc;
+ struct inode *inode;
+ struct file *file;
+ struct path newmount;
+ struct fd f;
+ unsigned int mnt_flags = 0;
+ long ret;
+
+ if ((flags & ~(FSMOUNT_CLOEXEC)) != 0 || spare_4 || spare_5)
+ return -EINVAL;
+
+ if (ms_flags & ~(MS_RDONLY | MS_NOSUID | MS_NODEV | MS_NOEXEC |
+ MS_NOATIME | MS_NODIRATIME | MS_RELATIME |
+ MS_STRICTATIME))
+ return -EINVAL;
+
+ if (ms_flags & MS_RDONLY)
+ mnt_flags |= MNT_READONLY;
+ if (ms_flags & MS_NOSUID)
+ mnt_flags |= MNT_NOSUID;
+ if (ms_flags & MS_NODEV)
+ mnt_flags |= MNT_NODEV;
+ if (ms_flags & MS_NOEXEC)
+ mnt_flags |= MNT_NOEXEC;
+ if (ms_flags & MS_NODIRATIME)
+ mnt_flags |= MNT_NODIRATIME;
+
+ if (ms_flags & MS_STRICTATIME) {
+ if (ms_flags & MS_NOATIME)
+ return -EINVAL;
+ } else if (ms_flags & MS_NOATIME) {
+ mnt_flags |= MNT_NOATIME;
+ } else {
+ mnt_flags |= MNT_RELATIME;
+ }
+
+ f = fdget(fs_fd);
+ if (!f.file)
+ return -EBADF;
+
+ ret = -EINVAL;
+ if (f.file->f_op != &fscontext_fs_fops)
+ goto err_fsfd;
+
+ fc = f.file->private_data;
+
+ ret = -EPERM;
+ if (!may_mount() ||
+ ((fc->sb_flags & SB_MANDLOCK) && !may_mandlock()))
+ goto err_fsfd;
+
+ /* There must be a valid superblock or we can't mount it */
+ ret = -EINVAL;
+ if (!fc->root)
+ goto err_fsfd;
+
+ ret = -EPERM;
+ if (mount_too_revealing(fc->root->d_sb, &mnt_flags)) {
+ pr_warn("VFS: Mount too revealing\n");
+ goto err_fsfd;
+ }
+
+ inode = file_inode(f.file);
+ ret = inode_lock_killable(inode);
+ if (ret < 0)
+ goto err_fsfd;
+
+ ret = -EBUSY;
+ if (fc->phase != FS_CONTEXT_AWAITING_MOUNT)
+ goto err_unlock;
+
+ newmount.mnt = vfs_create_mount(fc, mnt_flags);
+ if (IS_ERR(newmount.mnt)) {
+ ret = PTR_ERR(newmount.mnt);
+ goto err_unlock;
+ }
+ newmount.dentry = dget(fc->root);
+
+ /* We've done the mount bit - now move the file context into more or
+ * less the same state as if we'd done an fspick(). We don't want to
+ * do any memory allocation or anything like that at this point as we
+ * don't want to have to handle any errors incurred.
+ */
+ if (fc->ops && fc->ops->free)
+ fc->ops->free(fc);
+ fc->fs_private = NULL;
+ fc->s_fs_info = NULL;
+ fc->sb_flags = 0;
+ fc->sloppy = false;
+ fc->silent = false;
+ fc->source_is_dev = false;
+ security_fs_context_free(fc);
+ fc->security = NULL;
+ kfree(fc->subtype);
+ fc->subtype = NULL;
+ kfree(fc->source);
+ fc->source = NULL;
+
+ fc->purpose = FS_CONTEXT_FOR_RECONFIGURE;
+ fc->phase = FS_CONTEXT_AWAITING_RECONF;
+
+ /* Attach to an apparent O_PATH fd with a note that we need to unmount
+ * it, not just simply put it.
+ */
+ file = dentry_open(&newmount, O_PATH, fc->cred);
+ if (IS_ERR(file))
+ goto err_path;
+ file->f_mode |= FMODE_NEED_UNMOUNT;
+
+ ret = get_unused_fd_flags(flags & FSMOUNT_CLOEXEC);
+ if (ret < 0)
+ goto err_file;
+
+ fd_install(ret, file);
+
+err_file:
+ fput(file);
+err_path:
+ path_put(&newmount);
+err_unlock:
+ inode_unlock(inode);
+err_fsfd:
+ fdput(f);
+ return ret;
+}
+
/*
* Return true if path is reachable from root
*
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 368fe5bb1efd..bec4022e3f4b 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -115,4 +115,6 @@ extern int vfs_get_super(struct fs_context *fc,
int (*fill_super)(struct super_block *sb,
struct fs_context *fc));

+extern const struct file_operations fscontext_fs_fops;
+
#endif /* _LINUX_FS_CONTEXT_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e0f19406af92..178370cad1dd 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -898,6 +898,8 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
unsigned mask, struct statx __user *buffer);
asmlinkage long sys_fsopen(const char *fs_name, unsigned int flags,
void *reserved3, void *reserved4, void *reserved5);
+asmlinkage long sys_fsmount(int fsfd, int dfd, const char *path, unsigned int at_flags,
+ unsigned int flags);


/*
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 5da6c2d96af5..edb1983a9990 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -338,4 +338,11 @@ typedef int __bitwise __kernel_rwf_t;
#define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
RWF_APPEND)

+/*
+ * Flags for fsopen() and co.
+ */
+#define FSOPEN_CLOEXEC 0x00000001
+
+#define FSMOUNT_CLOEXEC 0x00000001
+
#endif /* _UAPI_LINUX_FS_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 6bb0e1bb3eae..632a937ca09c 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -435,3 +435,4 @@ COND_SYSCALL(setuid16);

/* fd-based mount */
COND_SYSCALL(sys_fsopen);
+COND_SYSCALL(sys_fsmount);


2018-05-25 02:48:17

by David Howells

[permalink] [raw]
Subject: [PATCH 22/32] vfs: Provide an fspick() system call [ver #8]

Provide an fspick() system call that can be used to pick an existing
mountpoint into an fs_context which can thereafter be used to reconfigure a
superblock (equivalent of the superblock side of -o remount).

This looks like:

int fd = fspick(AT_FDCWD, "/mnt",
FSPICK_CLOEXEC | FSPICK_NO_AUTOMOUNT);
write(fd, "o intr");
write(fd, "o noac");
write(fd, "x reconfigure");

At the point of fspick being called, the file descriptor referring to the
filesystem context is in exactly the same state as the one that was created
by fsopen() after fsmount() has been successfully called.

Signed-off-by: David Howells <[email protected]>
---

arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
fs/fsopen.c | 94 +++++++++++++++++++++++++-------
include/linux/syscalls.h | 1
include/uapi/linux/fs.h | 5 ++
kernel/sys_ni.c | 1
6 files changed, 83 insertions(+), 20 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index bdcb0c4a0491..b7e2adda092c 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -399,3 +399,4 @@
385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents
386 i386 fsopen sys_fsopen __ia32_sys_fsopen
387 i386 fsmount sys_fsmount __ia32_sys_fsmount
+388 i386 fspick sys_fspick __ia32_sys_fspick
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7d932d3897fa..fd322986974b 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -344,6 +344,7 @@
333 common io_pgetevents __x64_sys_io_pgetevents
334 common fsopen __x64_sys_fsopen
335 common fsmount __x64_sys_fsmount
+336 common fspick __x64_sys_fspick

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/fsopen.c b/fs/fsopen.c
index 26565ddd7c9e..d69155b9303e 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -17,6 +17,7 @@
#include <linux/magic.h>
#include <linux/syscalls.h>
#include <linux/security.h>
+#include <linux/namei.h>
#include "mount.h"

static struct vfsmount *fscontext_fs_mnt __read_mostly;
@@ -286,6 +287,36 @@ static int __init init_fscontext_fs(void)

fs_initcall(init_fscontext_fs);

+/*
+ * Attach a filesystem context to a file and an fd.
+ */
+static int fsopen_create_fd(struct fs_context *fc, bool cloexec)
+{
+ struct file *file;
+ int ret;
+
+ file = create_fscontext_file(fc);
+ if (IS_ERR(file)) {
+ ret = PTR_ERR(file);
+ goto err_fc;
+ }
+
+ ret = get_unused_fd_flags(cloexec);
+ if (ret < 0)
+ goto err_file;
+
+ fd_install(ret, file);
+ return ret;
+
+err_fc:
+ put_fs_context(fc);
+ goto err;
+err_file:
+ fput(file);
+err:
+ return ret;
+}
+
/*
* Open a filesystem by name so that it can be configured for mounting.
*
@@ -298,9 +329,7 @@ SYSCALL_DEFINE5(fsopen, const char __user *, _fs_name, unsigned int, flags,
{
struct file_system_type *fs_type;
struct fs_context *fc;
- struct file *file;
const char *fs_name;
- int fd, ret;

if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
return -EPERM;
@@ -324,29 +353,54 @@ SYSCALL_DEFINE5(fsopen, const char __user *, _fs_name, unsigned int, flags,

fc->phase = FS_CONTEXT_CREATE_PARAMS;

- ret = -EOPNOTSUPP;
- if (!fc->ops)
- goto err_fc;
+ return fsopen_create_fd(fc, flags & FSOPEN_CLOEXEC);
+}

- file = create_fscontext_file(fc);
- if (IS_ERR(file)) {
- ret = PTR_ERR(file);
- goto err_fc;
- }
+/*
+ * Pick a superblock into a context for reconfiguration.
+ */
+SYSCALL_DEFINE3(fspick, int, dfd, const char *, path, unsigned int, flags)
+{
+ struct fs_context *fc;
+ struct path target;
+ unsigned int lookup_flags;
+ int ret;
+
+ if ((flags & ~(FSPICK_CLOEXEC |
+ FSPICK_SYMLINK_NOFOLLOW |
+ FSPICK_NO_AUTOMOUNT |
+ FSPICK_EMPTY_PATH)) != 0)
+ return -EINVAL;

- ret = get_unused_fd_flags(flags & O_CLOEXEC);
+ lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
+ if (flags & FSPICK_SYMLINK_NOFOLLOW)
+ lookup_flags &= ~LOOKUP_FOLLOW;
+ if (flags & FSPICK_NO_AUTOMOUNT)
+ lookup_flags &= ~LOOKUP_AUTOMOUNT;
+ if (flags & FSPICK_EMPTY_PATH)
+ lookup_flags |= LOOKUP_EMPTY;
+ ret = user_path_at(dfd, path, lookup_flags, &target);
if (ret < 0)
- goto err_file;
+ goto err;
+
+ ret = -EOPNOTSUPP;
+ if (!target.dentry->d_sb->s_op->reconfigure)
+ goto err;
+
+ fc = vfs_new_fs_context(target.dentry->d_sb->s_type, target.dentry,
+ 0, FS_CONTEXT_FOR_RECONFIGURE);
+ if (IS_ERR(fc)) {
+ ret = PTR_ERR(fc);
+ goto err_path;
+ }

- fd = ret;
- fd_install(fd, file);
- return fd;
+ fc->phase = FS_CONTEXT_RECONF_PARAMS;

-err_file:
- fput(file);
- return ret;
+ path_put(&target);
+ return fsopen_create_fd(fc, flags & FSPICK_CLOEXEC);

-err_fc:
- put_fs_context(fc);
+err_path:
+ path_put(&target);
+err:
return ret;
}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 178370cad1dd..5130fd687a85 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -900,6 +900,7 @@ asmlinkage long sys_fsopen(const char *fs_name, unsigned int flags,
void *reserved3, void *reserved4, void *reserved5);
asmlinkage long sys_fsmount(int fsfd, int dfd, const char *path, unsigned int at_flags,
unsigned int flags);
+asmlinkage long sys_fspick(int dfd, const char *path, unsigned int at_flags);


/*
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index edb1983a9990..f3875a84349d 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -345,4 +345,9 @@ typedef int __bitwise __kernel_rwf_t;

#define FSMOUNT_CLOEXEC 0x00000001

+#define FSPICK_CLOEXEC 0x00000001
+#define FSPICK_SYMLINK_NOFOLLOW 0x00000002
+#define FSPICK_NO_AUTOMOUNT 0x00000004
+#define FSPICK_EMPTY_PATH 0x00000008
+
#endif /* _UAPI_LINUX_FS_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 632a937ca09c..152fdc95d426 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -436,3 +436,4 @@ COND_SYSCALL(setuid16);
/* fd-based mount */
COND_SYSCALL(sys_fsopen);
COND_SYSCALL(sys_fsmount);
+COND_SYSCALL(sys_fspick);


2018-05-25 02:48:21

by David Howells

[permalink] [raw]
Subject: [PATCH 25/32] afs: Add fs_context support [ver #8]

Add fs_context support to the AFS filesystem, converting the parameter
parsing to store options there.

This will form the basis for namespace propagation over mountpoints within
the AFS model, thereby allowing AFS to be used in containers more easily.

Signed-off-by: David Howells <[email protected]>
---

fs/afs/internal.h | 8 -
fs/afs/super.c | 444 ++++++++++++++++++++++++++++++-----------------------
fs/afs/volume.c | 4
3 files changed, 256 insertions(+), 200 deletions(-)

diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index e6cef5702ae2..eb6e75e00181 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -34,15 +34,15 @@
struct pagevec;
struct afs_call;

-struct afs_mount_params {
+struct afs_fs_context {
bool rwpath; /* T if the parent should be considered R/W */
bool force; /* T to force cell type */
bool autocell; /* T if set auto mount operation */
bool dyn_root; /* T if dynamic root */
+ bool no_cell; /* T if the source is "none" (for dynroot) */
afs_voltype_t type; /* type of volume requested */
- int volnamesz; /* size of volume name */
+ unsigned int volnamesz; /* size of volume name */
const char *volname; /* name of volume to mount */
- struct net *net_ns; /* Network namespace in effect */
struct afs_net *net; /* the AFS net namespace stuff */
struct afs_cell *cell; /* cell in which to find volume */
struct afs_volume *volume; /* volume record */
@@ -1012,7 +1012,7 @@ static inline struct afs_volume *__afs_get_volume(struct afs_volume *volume)
return volume;
}

-extern struct afs_volume *afs_create_volume(struct afs_mount_params *);
+extern struct afs_volume *afs_create_volume(struct afs_fs_context *);
extern void afs_activate_volume(struct afs_volume *);
extern void afs_deactivate_volume(struct afs_volume *);
extern void afs_put_volume(struct afs_cell *, struct afs_volume *);
diff --git a/fs/afs/super.c b/fs/afs/super.c
index a562b90ad660..f0494f36d548 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -1,6 +1,6 @@
/* AFS superblock handling
*
- * Copyright (c) 2002, 2007 Red Hat, Inc. All rights reserved.
+ * Copyright (c) 2002, 2007, 2018 Red Hat, Inc. All rights reserved.
*
* This software may be freely redistributed under the terms of the
* GNU General Public License.
@@ -30,22 +30,20 @@
#include "internal.h"

static void afs_i_init_once(void *foo);
-static struct dentry *afs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name,
- void *data, size_t data_size);
static void afs_kill_super(struct super_block *sb);
static struct inode *afs_alloc_inode(struct super_block *sb);
static void afs_destroy_inode(struct inode *inode);
static int afs_statfs(struct dentry *dentry, struct kstatfs *buf);
static int afs_show_devname(struct seq_file *m, struct dentry *root);
static int afs_show_options(struct seq_file *m, struct dentry *root);
+static int afs_init_fs_context(struct fs_context *fc, struct dentry *reference);

struct file_system_type afs_fs_type = {
- .owner = THIS_MODULE,
- .name = "afs",
- .mount = afs_mount,
- .kill_sb = afs_kill_super,
- .fs_flags = 0,
+ .owner = THIS_MODULE,
+ .name = "afs",
+ .init_fs_context = afs_init_fs_context,
+ .kill_sb = afs_kill_super,
+ .fs_flags = 0,
};
MODULE_ALIAS_FS("afs");

@@ -191,61 +189,53 @@ static int afs_show_options(struct seq_file *m, struct dentry *root)
}

/*
- * parse the mount options
- * - this function has been shamelessly adapted from the ext3 fs which
- * shamelessly adapted it from the msdos fs
+ * Parse an single mount option.
*/
-static int afs_parse_options(struct afs_mount_params *params,
- char *options, const char **devname)
+static int afs_parse_option(struct fs_context *fc, char *opt, size_t len)
{
+ struct afs_fs_context *ctx = fc->fs_private;
struct afs_cell *cell;
substring_t args[MAX_OPT_ARGS];
- char *p;
- int token;
-
- _enter("%s", options);
-
- options[PAGE_SIZE - 1] = 0;
-
- while ((p = strsep(&options, ","))) {
- if (!*p)
- continue;
-
- token = match_token(p, afs_options_list, args);
- switch (token) {
- case afs_opt_cell:
- rcu_read_lock();
- cell = afs_lookup_cell_rcu(params->net,
- args[0].from,
- args[0].to - args[0].from);
- rcu_read_unlock();
- if (IS_ERR(cell))
- return PTR_ERR(cell);
- afs_put_cell(params->net, params->cell);
- params->cell = cell;
- break;
-
- case afs_opt_rwpath:
- params->rwpath = true;
- break;
-
- case afs_opt_vol:
- *devname = args[0].from;
- break;
-
- case afs_opt_autocell:
- params->autocell = true;
- break;
-
- case afs_opt_dyn:
- params->dyn_root = true;
- break;
-
- default:
- printk(KERN_ERR "kAFS:"
- " Unknown or invalid mount option: '%s'\n", p);
+ int token, size;
+
+ _enter("%s", opt);
+
+ token = match_token(opt, afs_options_list, args);
+ switch (token) {
+ case afs_opt_cell:
+ size = args[0].to - args[0].from;
+ if (size <= 0)
return -EINVAL;
- }
+ if (size > AFS_MAXCELLNAME)
+ return -ENAMETOOLONG;
+
+ rcu_read_lock();
+ cell = afs_lookup_cell_rcu(ctx->net, args[0].from, size);
+ rcu_read_unlock();
+ if (IS_ERR(cell))
+ return PTR_ERR(cell);
+ afs_put_cell(ctx->net, ctx->cell);
+ ctx->cell = cell;
+ break;
+
+ case afs_opt_rwpath:
+ ctx->rwpath = true;
+ break;
+
+ case afs_opt_vol:
+ return -EINVAL; /* Not required for automount */
+
+ case afs_opt_autocell:
+ ctx->autocell = true;
+ break;
+
+ case afs_opt_dyn:
+ ctx->dyn_root = true;
+ break;
+
+ default:
+ printk(KERN_ERR "kAFS: Unknown or invalid mount option: '%s'\n", opt);
+ return -EINVAL;
}

_leave(" = 0");
@@ -253,9 +243,10 @@ static int afs_parse_options(struct afs_mount_params *params,
}

/*
- * parse a device name to get cell name, volume name, volume type and R/W
- * selector
- * - this can be one of the following:
+ * Parse the source name to get cell name, volume name, volume type and R/W
+ * selector.
+ *
+ * This can be one of the following:
* "%[cell:]volume[.]" R/W volume
* "#[cell:]volume[.]" R/O or R/W volume (rwpath=0),
* or R/W (rwpath=1) volume
@@ -264,9 +255,9 @@ static int afs_parse_options(struct afs_mount_params *params,
* "%[cell:]volume.backup" Backup volume
* "#[cell:]volume.backup" Backup volume
*/
-static int afs_parse_device_name(struct afs_mount_params *params,
- const char *name)
+static int afs_parse_source(struct fs_context *fc, char *name)
{
+ struct afs_fs_context *ctx = fc->fs_private;
struct afs_cell *cell;
const char *cellname, *suffix;
int cellnamesz;
@@ -279,69 +270,116 @@ static int afs_parse_device_name(struct afs_mount_params *params,
}

if ((name[0] != '%' && name[0] != '#') || !name[1]) {
+ /* To use dynroot, we don't want to have to provide a source */
+ if (strcmp(name, "none") == 0) {
+ ctx->no_cell = true;
+ return 0;
+ }
printk(KERN_ERR "kAFS: unparsable volume name\n");
return -EINVAL;
}

/* determine the type of volume we're looking for */
- params->type = AFSVL_ROVOL;
- params->force = false;
- if (params->rwpath || name[0] == '%') {
- params->type = AFSVL_RWVOL;
- params->force = true;
+ ctx->type = AFSVL_ROVOL;
+ ctx->force = false;
+ if (ctx->rwpath || name[0] == '%') {
+ ctx->type = AFSVL_RWVOL;
+ ctx->force = true;
}
name++;

/* split the cell name out if there is one */
- params->volname = strchr(name, ':');
- if (params->volname) {
+ ctx->volname = strchr(name, ':');
+ if (ctx->volname) {
cellname = name;
- cellnamesz = params->volname - name;
- params->volname++;
+ cellnamesz = ctx->volname - name;
+ ctx->volname++;
} else {
- params->volname = name;
+ ctx->volname = name;
cellname = NULL;
cellnamesz = 0;
}

/* the volume type is further affected by a possible suffix */
- suffix = strrchr(params->volname, '.');
+ suffix = strrchr(ctx->volname, '.');
if (suffix) {
if (strcmp(suffix, ".readonly") == 0) {
- params->type = AFSVL_ROVOL;
- params->force = true;
+ ctx->type = AFSVL_ROVOL;
+ ctx->force = true;
} else if (strcmp(suffix, ".backup") == 0) {
- params->type = AFSVL_BACKVOL;
- params->force = true;
+ ctx->type = AFSVL_BACKVOL;
+ ctx->force = true;
} else if (suffix[1] == 0) {
} else {
suffix = NULL;
}
}

- params->volnamesz = suffix ?
- suffix - params->volname : strlen(params->volname);
+ ctx->volnamesz = suffix ?
+ suffix - ctx->volname : strlen(ctx->volname);

_debug("cell %*.*s [%p]",
- cellnamesz, cellnamesz, cellname ?: "", params->cell);
+ cellnamesz, cellnamesz, cellname ?: "", ctx->cell);

/* lookup the cell record */
- if (cellname || !params->cell) {
- cell = afs_lookup_cell(params->net, cellname, cellnamesz,
+ if (cellname) {
+ cell = afs_lookup_cell(ctx->net, cellname, cellnamesz,
NULL, false);
if (IS_ERR(cell)) {
- printk(KERN_ERR "kAFS: unable to lookup cell '%*.*s'\n",
+ pr_err("kAFS: unable to lookup cell '%*.*s'\n",
cellnamesz, cellnamesz, cellname ?: "");
return PTR_ERR(cell);
}
- afs_put_cell(params->net, params->cell);
- params->cell = cell;
+ afs_put_cell(ctx->net, ctx->cell);
+ ctx->cell = cell;
}

_debug("CELL:%s [%p] VOLUME:%*.*s SUFFIX:%s TYPE:%d%s",
- params->cell->name, params->cell,
- params->volnamesz, params->volnamesz, params->volname,
- suffix ?: "-", params->type, params->force ? " FORCE" : "");
+ ctx->cell->name, ctx->cell,
+ ctx->volnamesz, ctx->volnamesz, ctx->volname,
+ suffix ?: "-", ctx->type, ctx->force ? " FORCE" : "");
+
+ return 0;
+}
+
+/*
+ * Validate the options, get the cell key and look up the volume.
+ */
+static int afs_validate_fc(struct fs_context *fc)
+{
+ struct afs_fs_context *ctx = fc->fs_private;
+ struct afs_volume *volume;
+ struct key *key;
+
+ if (!ctx->dyn_root) {
+ if (ctx->no_cell) {
+ pr_warn("kAFS: Can only specify source 'none' with -o dyn\n");
+ return -EINVAL;
+ }
+
+ if (!ctx->cell) {
+ pr_warn("kAFS: No cell specified\n");
+ return -EDESTADDRREQ;
+ }
+
+ /* We try to do the mount securely. */
+ key = afs_request_key(ctx->cell);
+ if (IS_ERR(key))
+ return PTR_ERR(key);
+
+ ctx->key = key;
+
+ if (ctx->volume) {
+ afs_put_volume(ctx->cell, ctx->volume);
+ ctx->volume = NULL;
+ }
+
+ volume = afs_create_volume(ctx);
+ if (IS_ERR(volume))
+ return PTR_ERR(volume);
+
+ ctx->volume = volume;
+ }

return 0;
}
@@ -349,34 +387,30 @@ static int afs_parse_device_name(struct afs_mount_params *params,
/*
* check a superblock to see if it's the one we're looking for
*/
-static int afs_test_super(struct super_block *sb, void *data)
+static int afs_test_super(struct super_block *sb, struct fs_context *fc)
{
- struct afs_super_info *as1 = data;
+ struct afs_fs_context *ctx = fc->fs_private;
struct afs_super_info *as = AFS_FS_S(sb);

- return (as->net_ns == as1->net_ns &&
+ return (as->net_ns == fc->net_ns &&
as->volume &&
- as->volume->vid == as1->volume->vid);
+ as->volume->vid == ctx->volume->vid);
}

-static int afs_dynroot_test_super(struct super_block *sb, void *data)
+static int afs_dynroot_test_super(struct super_block *sb, struct fs_context *fc)
{
return false;
}

-static int afs_set_super(struct super_block *sb, void *data)
+static int afs_set_super(struct super_block *sb, struct fs_context *fc)
{
- struct afs_super_info *as = data;
-
- sb->s_fs_info = as;
return set_anon_super(sb, NULL);
}

/*
* fill in the superblock
*/
-static int afs_fill_super(struct super_block *sb,
- struct afs_mount_params *params)
+static int afs_fill_super(struct super_block *sb, struct afs_fs_context *ctx)
{
struct afs_super_info *as = AFS_FS_S(sb);
struct afs_fid fid;
@@ -407,13 +441,13 @@ static int afs_fill_super(struct super_block *sb,
fid.vid = as->volume->vid;
fid.vnode = 1;
fid.unique = 1;
- inode = afs_iget(sb, params->key, &fid, NULL, NULL, NULL);
+ inode = afs_iget(sb, ctx->key, &fid, NULL, NULL, NULL);
}

if (IS_ERR(inode))
return PTR_ERR(inode);

- if (params->autocell || params->dyn_root)
+ if (ctx->autocell || as->dyn_root)
set_bit(AFS_VNODE_AUTOCELL, &AFS_FS_I(inode)->flags);

ret = -ENOMEM;
@@ -421,7 +455,7 @@ static int afs_fill_super(struct super_block *sb,
if (!sb->s_root)
goto error;

- if (params->dyn_root)
+ if (as->dyn_root)
sb->s_d_op = &afs_dynroot_dentry_operations;
else
sb->s_d_op = &afs_fs_dentry_operations;
@@ -434,17 +468,20 @@ static int afs_fill_super(struct super_block *sb,
return ret;
}

-static struct afs_super_info *afs_alloc_sbi(struct afs_mount_params *params)
+static struct afs_super_info *afs_alloc_sbi(struct fs_context *fc)
{
+ struct afs_fs_context *ctx = fc->fs_private;
struct afs_super_info *as;

as = kzalloc(sizeof(struct afs_super_info), GFP_KERNEL);
if (as) {
- as->net_ns = get_net(params->net_ns);
- if (params->dyn_root)
+ as->net_ns = get_net(fc->net_ns);
+ if (ctx->dyn_root) {
as->dyn_root = true;
- else
- as->cell = afs_get_cell(params->cell);
+ } else {
+ as->cell = afs_get_cell(ctx->cell);
+ as->volume = __afs_get_volume(ctx->volume);
+ }
}
return as;
}
@@ -459,129 +496,148 @@ static void afs_destroy_sbi(struct afs_super_info *as)
}
}

+static void afs_kill_super(struct super_block *sb)
+{
+ struct afs_super_info *as = AFS_FS_S(sb);
+ struct afs_net *net = afs_net(as->net_ns);
+
+ /* Clear the callback interests (which will do ilookup5) before
+ * deactivating the superblock.
+ */
+ if (as->volume)
+ afs_clear_callback_interests(net, as->volume->servers);
+ kill_anon_super(sb);
+ if (as->volume)
+ afs_deactivate_volume(as->volume);
+ afs_destroy_sbi(as);
+}
+
/*
- * get an AFS superblock
+ * Get an AFS superblock and root directory.
*/
-static struct dentry *afs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name,
- void *options, size_t data_size)
+static int afs_get_tree(struct fs_context *fc)
{
- struct afs_mount_params params;
+ struct afs_fs_context *ctx = fc->fs_private;
struct super_block *sb;
- struct afs_volume *candidate;
- struct key *key;
struct afs_super_info *as;
int ret;

- _enter(",,%s,%p", dev_name, options);
-
- memset(&params, 0, sizeof(params));
-
- ret = -EINVAL;
- if (current->nsproxy->net_ns != &init_net)
- goto error;
- params.net_ns = current->nsproxy->net_ns;
- params.net = afs_net(params.net_ns);
-
- /* parse the options and device name */
- if (options) {
- ret = afs_parse_options(&params, options, &dev_name);
- if (ret < 0)
- goto error;
- }
-
- if (!params.dyn_root) {
- ret = afs_parse_device_name(&params, dev_name);
- if (ret < 0)
- goto error;
-
- /* try and do the mount securely */
- key = afs_request_key(params.cell);
- if (IS_ERR(key)) {
- _leave(" = %ld [key]", PTR_ERR(key));
- ret = PTR_ERR(key);
- goto error;
- }
- params.key = key;
- }
+ _enter("%s", fc->source);

/* allocate a superblock info record */
ret = -ENOMEM;
- as = afs_alloc_sbi(&params);
+ as = afs_alloc_sbi(fc);
if (!as)
- goto error_key;
-
- if (!params.dyn_root) {
- /* Assume we're going to need a volume record; at the very
- * least we can use it to update the volume record if we have
- * one already. This checks that the volume exists within the
- * cell.
- */
- candidate = afs_create_volume(&params);
- if (IS_ERR(candidate)) {
- ret = PTR_ERR(candidate);
- goto error_as;
- }
-
- as->volume = candidate;
- }
+ goto error;
+ fc->s_fs_info = as;

/* allocate a deviceless superblock */
- sb = sget(fs_type,
- as->dyn_root ? afs_dynroot_test_super : afs_test_super,
- afs_set_super, flags, as);
+ sb = sget_fc(fc,
+ as->dyn_root ? afs_dynroot_test_super : afs_test_super,
+ afs_set_super);
if (IS_ERR(sb)) {
ret = PTR_ERR(sb);
- goto error_as;
+ goto error;
}

if (!sb->s_root) {
/* initial superblock/root creation */
_debug("create");
- ret = afs_fill_super(sb, &params);
+ ret = afs_fill_super(sb, ctx);
if (ret < 0)
goto error_sb;
- as = NULL;
sb->s_flags |= SB_ACTIVE;
} else {
_debug("reuse");
ASSERTCMP(sb->s_flags, &, SB_ACTIVE);
- afs_destroy_sbi(as);
- as = NULL;
}

- afs_put_cell(params.net, params.cell);
- key_put(params.key);
+ fc->root = dget(sb->s_root);
+ fc->drop_sb = true;
_leave(" = 0 [%p]", sb);
- return dget(sb->s_root);
+ return 0;

error_sb:
deactivate_locked_super(sb);
- goto error_key;
-error_as:
- afs_destroy_sbi(as);
-error_key:
- key_put(params.key);
error:
- afs_put_cell(params.net, params.cell);
_leave(" = %d", ret);
- return ERR_PTR(ret);
+ return ret;
}

-static void afs_kill_super(struct super_block *sb)
+static void afs_free_fc(struct fs_context *fc)
{
- struct afs_super_info *as = AFS_FS_S(sb);
+ struct afs_fs_context *ctx = fc->fs_private;

- /* Clear the callback interests (which will do ilookup5) before
- * deactivating the superblock.
- */
- if (as->volume)
- afs_clear_callback_interests(afs_net(as->net_ns),
- as->volume->servers);
- kill_anon_super(sb);
- if (as->volume)
- afs_deactivate_volume(as->volume);
- afs_destroy_sbi(as);
+ afs_destroy_sbi(fc->s_fs_info);
+ afs_put_volume(ctx->cell, ctx->volume);
+ afs_put_cell(ctx->net, ctx->cell);
+ key_put(ctx->key);
+ kfree(ctx);
+}
+
+static const struct fs_context_operations afs_context_ops = {
+ .free = afs_free_fc,
+ .parse_source = afs_parse_source,
+ .parse_option = afs_parse_option,
+ .validate = afs_validate_fc,
+ .get_tree = afs_get_tree,
+};
+
+/*
+ * Set up the filesystem mount context.
+ */
+static int afs_init_fs_context(struct fs_context *fc, struct dentry *reference)
+{
+ struct afs_fs_context *ctx;
+ struct afs_super_info *src_as;
+ struct afs_cell *cell;
+
+ if (current->nsproxy->net_ns != &init_net)
+ return -EINVAL;
+
+ ctx = kzalloc(sizeof(struct afs_fs_context), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ ctx->type = AFSVL_ROVOL;
+
+ switch (fc->purpose) {
+ case FS_CONTEXT_FOR_USER_MOUNT:
+ case FS_CONTEXT_FOR_KERNEL_MOUNT:
+ ctx->net = afs_net(fc->net_ns);
+
+ /* Default to the workstation cell. */
+ rcu_read_lock();
+ cell = afs_lookup_cell_rcu(ctx->net, NULL, 0);
+ rcu_read_unlock();
+ if (IS_ERR(cell))
+ cell = NULL;
+ ctx->cell = cell;
+ break;
+
+ case FS_CONTEXT_FOR_SUBMOUNT:
+ if (!reference)
+ return -EINVAL;
+
+ src_as = AFS_FS_S(reference->d_sb);
+ ASSERT(src_as);
+
+ ctx->net = afs_net(fc->net_ns);
+ if (src_as->cell)
+ ctx->cell = afs_get_cell(src_as->cell);
+ if (src_as->volume && src_as->volume->type == AFSVL_RWVOL) {
+ ctx->type = AFSVL_RWVOL;
+ ctx->force = true;
+ }
+ break;
+
+ case FS_CONTEXT_FOR_RECONFIGURE:
+ break;
+ }
+
+ fc->fs_private = ctx;
+ fc->ops = &afs_context_ops;
+ return 0;
}

/*
diff --git a/fs/afs/volume.c b/fs/afs/volume.c
index 3037bd01f617..7adcddf02e66 100644
--- a/fs/afs/volume.c
+++ b/fs/afs/volume.c
@@ -21,7 +21,7 @@ static const char *const afs_voltypes[] = { "R/W", "R/O", "BAK" };
/*
* Allocate a volume record and load it up from a vldb record.
*/
-static struct afs_volume *afs_alloc_volume(struct afs_mount_params *params,
+static struct afs_volume *afs_alloc_volume(struct afs_fs_context *params,
struct afs_vldb_entry *vldb,
unsigned long type_mask)
{
@@ -149,7 +149,7 @@ static struct afs_vldb_entry *afs_vl_lookup_vldb(struct afs_cell *cell,
* - Rule 3: If parent volume is R/W, then only mount R/W volume unless
* explicitly told otherwise
*/
-struct afs_volume *afs_create_volume(struct afs_mount_params *params)
+struct afs_volume *afs_create_volume(struct afs_fs_context *params)
{
struct afs_vldb_entry *vldb;
struct afs_volume *volume;


2018-05-25 02:48:24

by David Howells

[permalink] [raw]
Subject: [PATCH 26/32] afs: Use fs_context to pass parameters over automount [ver #8]

Alter the AFS automounting code to create and modify an fs_context struct
when parameterising a new mount triggered by an AFS mountpoint rather than
constructing device name and option strings.

Also remove the cell=, vol= and rwpath options as they are then redundant.
The reason they existed is because the 'device name' may be derived
literally from a mountpoint object in the filesystem, so default cell and
parent-type information needed to be passed in by some other method from
the automount routines. The vol= option didn't end up being used.

Signed-off-by: David Howells <[email protected]>
cc: Eric W. Biederman <[email protected]>
---

fs/afs/internal.h | 1
fs/afs/mntpt.c | 148 +++++++++++++++++++++++++++--------------------------
fs/afs/super.c | 43 +--------------
3 files changed, 79 insertions(+), 113 deletions(-)

diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index eb6e75e00181..90af5001f8c8 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -35,7 +35,6 @@ struct pagevec;
struct afs_call;

struct afs_fs_context {
- bool rwpath; /* T if the parent should be considered R/W */
bool force; /* T to force cell type */
bool autocell; /* T if set auto mount operation */
bool dyn_root; /* T if dynamic root */
diff --git a/fs/afs/mntpt.c b/fs/afs/mntpt.c
index c45aa1776591..fc383d727552 100644
--- a/fs/afs/mntpt.c
+++ b/fs/afs/mntpt.c
@@ -47,6 +47,8 @@ static DECLARE_DELAYED_WORK(afs_mntpt_expiry_timer, afs_mntpt_expiry_timed_out);

static unsigned long afs_mntpt_expiry_timeout = 10 * 60;

+static const char afs_root_volume[] = "root.cell";
+
/*
* no valid lookup procedure on this sort of dir
*/
@@ -68,107 +70,107 @@ static int afs_mntpt_open(struct inode *inode, struct file *file)
}

/*
- * create a vfsmount to be automounted
+ * Set the parameters for the proposed superblock.
*/
-static struct vfsmount *afs_mntpt_do_automount(struct dentry *mntpt)
+static int afs_mntpt_set_params(struct fs_context *fc, struct dentry *mntpt)
{
- struct afs_super_info *as;
- struct vfsmount *mnt;
- struct afs_vnode *vnode;
- struct page *page;
- char *devname, *options;
- bool rwpath = false;
+ struct afs_fs_context *ctx = fc->fs_private;
+ struct afs_vnode *vnode = AFS_FS_I(d_inode(mntpt));
+ struct afs_cell *cell;
+ const char *p;
int ret;

- _enter("{%pd}", mntpt);
-
- BUG_ON(!d_inode(mntpt));
-
- ret = -ENOMEM;
- devname = (char *) get_zeroed_page(GFP_KERNEL);
- if (!devname)
- goto error_no_devname;
-
- options = (char *) get_zeroed_page(GFP_KERNEL);
- if (!options)
- goto error_no_options;
-
- vnode = AFS_FS_I(d_inode(mntpt));
if (test_bit(AFS_VNODE_PSEUDODIR, &vnode->flags)) {
/* if the directory is a pseudo directory, use the d_name */
- static const char afs_root_cell[] = ":root.cell.";
unsigned size = mntpt->d_name.len;

- ret = -ENOENT;
- if (size < 2 || size > AFS_MAXCELLNAME)
- goto error_no_page;
+ if (size < 2)
+ return -ENOENT;

+ p = mntpt->d_name.name;
if (mntpt->d_name.name[0] == '.') {
- devname[0] = '%';
- memcpy(devname + 1, mntpt->d_name.name + 1, size - 1);
- memcpy(devname + size, afs_root_cell,
- sizeof(afs_root_cell));
- rwpath = true;
- } else {
- devname[0] = '#';
- memcpy(devname + 1, mntpt->d_name.name, size);
- memcpy(devname + size + 1, afs_root_cell,
- sizeof(afs_root_cell));
+ size--;
+ p++;
+ ctx->type = AFSVL_RWVOL;
+ ctx->force = true;
+ }
+ if (size > AFS_MAXCELLNAME)
+ return -ENAMETOOLONG;
+
+ cell = afs_lookup_cell(ctx->net, p, size, NULL, false);
+ if (IS_ERR(cell)) {
+ pr_err("kAFS: unable to lookup cell '%pd'\n", mntpt);
+ return PTR_ERR(cell);
}
+ afs_put_cell(ctx->net, ctx->cell);
+ ctx->cell = cell;
+
+ ctx->volname = afs_root_volume;
+ ctx->volnamesz = sizeof(afs_root_volume) - 1;
} else {
/* read the contents of the AFS special symlink */
+ struct page *page;
loff_t size = i_size_read(d_inode(mntpt));
char *buf;

- ret = -EINVAL;
if (size > PAGE_SIZE - 1)
- goto error_no_page;
+ return -EINVAL;

page = read_mapping_page(d_inode(mntpt)->i_mapping, 0, NULL);
- if (IS_ERR(page)) {
- ret = PTR_ERR(page);
- goto error_no_page;
- }
+ if (IS_ERR(page))
+ return PTR_ERR(page);

- ret = -EIO;
- if (PageError(page))
- goto error;
+ if (PageError(page)) {
+ put_page(page);
+ return -EIO;
+ }

- buf = kmap_atomic(page);
- memcpy(devname, buf, size);
- kunmap_atomic(buf);
+ buf = kmap(page);
+ ret = vfs_set_fs_source(fc, buf, size);
+ kunmap(page);
put_page(page);
- page = NULL;
+ if (ret < 0)
+ return ret;
}

- /* work out what options we want */
- as = AFS_FS_S(mntpt->d_sb);
- if (as->cell) {
- memcpy(options, "cell=", 5);
- strcpy(options + 5, as->cell->name);
- if ((as->volume && as->volume->type == AFSVL_RWVOL) || rwpath)
- strcat(options, ",rwpath");
- }
+ return 0;
+}

- /* try and do the mount */
- _debug("--- attempting mount %s -o %s ---", devname, options);
- mnt = vfs_submount(mntpt, &afs_fs_type, devname,
- options, strlen(options) + 1);
- _debug("--- mount result %p ---", mnt);
+/*
+ * create a vfsmount to be automounted
+ */
+static struct vfsmount *afs_mntpt_do_automount(struct dentry *mntpt)
+{
+ struct fs_context *fc;
+ struct vfsmount *mnt;
+ int ret;
+
+ BUG_ON(!d_inode(mntpt));
+
+ fc = vfs_new_fs_context(&afs_fs_type, mntpt, 0,
+ FS_CONTEXT_FOR_SUBMOUNT);
+ if (IS_ERR(fc))
+ return ERR_CAST(fc);
+
+ ret = afs_mntpt_set_params(fc, mntpt);
+ if (ret < 0)
+ goto error_fc;
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0)
+ goto error_fc;
+
+ mnt = vfs_create_mount(fc, 0);
+ if (IS_ERR(mnt)) {
+ ret = PTR_ERR(mnt);
+ goto error_fc;
+ }

- free_page((unsigned long) devname);
- free_page((unsigned long) options);
- _leave(" = %p", mnt);
+ put_fs_context(fc);
return mnt;

-error:
- put_page(page);
-error_no_page:
- free_page((unsigned long) options);
-error_no_options:
- free_page((unsigned long) devname);
-error_no_devname:
- _leave(" = %d", ret);
+error_fc:
+ put_fs_context(fc);
return ERR_PTR(ret);
}

diff --git a/fs/afs/super.c b/fs/afs/super.c
index f0494f36d548..ac97d771f27c 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -64,18 +64,12 @@ static atomic_t afs_count_active_inodes;

enum {
afs_no_opt,
- afs_opt_cell,
afs_opt_dyn,
- afs_opt_rwpath,
- afs_opt_vol,
afs_opt_autocell,
};

static const match_table_t afs_options_list = {
- { afs_opt_cell, "cell=%s" },
{ afs_opt_dyn, "dyn" },
- { afs_opt_rwpath, "rwpath" },
- { afs_opt_vol, "vol=%s" },
{ afs_opt_autocell, "autocell" },
{ afs_no_opt, NULL },
};
@@ -194,37 +188,13 @@ static int afs_show_options(struct seq_file *m, struct dentry *root)
static int afs_parse_option(struct fs_context *fc, char *opt, size_t len)
{
struct afs_fs_context *ctx = fc->fs_private;
- struct afs_cell *cell;
substring_t args[MAX_OPT_ARGS];
- int token, size;
+ int token;

_enter("%s", opt);

token = match_token(opt, afs_options_list, args);
switch (token) {
- case afs_opt_cell:
- size = args[0].to - args[0].from;
- if (size <= 0)
- return -EINVAL;
- if (size > AFS_MAXCELLNAME)
- return -ENAMETOOLONG;
-
- rcu_read_lock();
- cell = afs_lookup_cell_rcu(ctx->net, args[0].from, size);
- rcu_read_unlock();
- if (IS_ERR(cell))
- return PTR_ERR(cell);
- afs_put_cell(ctx->net, ctx->cell);
- ctx->cell = cell;
- break;
-
- case afs_opt_rwpath:
- ctx->rwpath = true;
- break;
-
- case afs_opt_vol:
- return -EINVAL; /* Not required for automount */
-
case afs_opt_autocell:
ctx->autocell = true;
break;
@@ -248,8 +218,8 @@ static int afs_parse_option(struct fs_context *fc, char *opt, size_t len)
*
* This can be one of the following:
* "%[cell:]volume[.]" R/W volume
- * "#[cell:]volume[.]" R/O or R/W volume (rwpath=0),
- * or R/W (rwpath=1) volume
+ * "#[cell:]volume[.]" R/O or R/W volume (R/O parent),
+ * or R/W (R/W parent) volume
* "%[cell:]volume.readonly" R/O volume
* "#[cell:]volume.readonly" R/O volume
* "%[cell:]volume.backup" Backup volume
@@ -280,9 +250,7 @@ static int afs_parse_source(struct fs_context *fc, char *name)
}

/* determine the type of volume we're looking for */
- ctx->type = AFSVL_ROVOL;
- ctx->force = false;
- if (ctx->rwpath || name[0] == '%') {
+ if (name[0] == '%') {
ctx->type = AFSVL_RWVOL;
ctx->force = true;
}
@@ -592,9 +560,6 @@ static int afs_init_fs_context(struct fs_context *fc, struct dentry *reference)
struct afs_super_info *src_as;
struct afs_cell *cell;

- if (current->nsproxy->net_ns != &init_net)
- return -EINVAL;
-
ctx = kzalloc(sizeof(struct afs_fs_context), GFP_KERNEL);
if (!ctx)
return -ENOMEM;


2018-05-25 02:48:35

by David Howells

[permalink] [raw]
Subject: [PATCH 31/32] [RFC] fs: Add a move_mount() system call [ver #8]

[!] NOTE: This patch doesn't quite work to move an O_CLONE_MOUNT-produced
vfsmount as move_mount() checks that the source vfsmount mnt_ns matches
the calling process's mnt_ns - but the vfsmount's mnt_ns isn't set
until one attempts to actually mount it into the namespace.

Add a move_mount() system call that will move a mount from one place to
another and change the flags, where one or both of those places may be
selected by an O_PATH open.

To this end, two additional open()/openat() flags are defined that can be
used with O_PATH:

(*) O_CLONE_MOUNT - Clone a mount (subtree) and attach it to the file
descriptor. This can be used to turn a move_mount() into a copy
operation.

(*) O_NON_RECURSIVE - Clone only the targetted mount, and not the entire
subtree.

Unfortunately, the other extant open flags cannot be reused as when O_PATH
was added, no check was provided that would give an error if any other flag
was given other than O_TMPFILE, O_DIRECTORY, O_NOFOLLOW and O_CLOEXEC -
rather, the flags are just masked off - so there's no guarantee that
userspace isn't attempting to do this somewhere. Further, O_CREAT has an
effect before the O_PATH handling clears it - though this may be later
ignored.

The new system call looks like the following:

int move_mount(int from_dfd, const char *from_path,
int to_dfd, const char *to_path,
unsigned int ms_flags);

As from_dfd and to_dfd can both be obtained from openat(O_PATH), there is
no need to have two sets of AT_NO_FOLLOW-style flags here also. Further,
either fd can be obtained from the new fsmount() syscall.

New mounts are a case of:

sbfd = fsopen();
...
mfd = fsmount(, MS_RDONLY);
move_mount(mfd, NULL, AT_FDCWD, "/mnt", MS_RDONLY);

Signed-off-by: David Howells <[email protected]>
---

arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
fs/internal.h | 3 +
fs/namei.c | 40 ++++++++++
fs/namespace.c | 125 ++++++++++++++++++++++++++++----
include/linux/lsm_hooks.h | 6 ++
include/linux/security.h | 7 ++
include/linux/syscalls.h | 3 +
include/uapi/linux/mount.h | 11 +++
kernel/sys_ni.c | 1
security/security.c | 5 +
11 files changed, 186 insertions(+), 17 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index b7e2adda092c..76c95f35a599 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -400,3 +400,4 @@
386 i386 fsopen sys_fsopen __ia32_sys_fsopen
387 i386 fsmount sys_fsmount __ia32_sys_fsmount
388 i386 fspick sys_fspick __ia32_sys_fspick
+389 i386 move_mount sys_move_mount __ia32_sys_move_mount
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index fd322986974b..b53080b756e8 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -345,6 +345,7 @@
334 common fsopen __x64_sys_fsopen
335 common fsmount __x64_sys_fsmount
336 common fspick __x64_sys_fspick
+337 common move_mount __x64_sys_move_mount

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/internal.h b/fs/internal.h
index e3460a2e6b59..a52cfef7b47b 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -17,6 +17,7 @@ struct linux_binprm;
struct path;
struct mount;
struct shrink_control;
+struct fd_cookie;

/*
* block_dev.c
@@ -55,6 +56,8 @@ extern void __init chrdev_init(void);
extern int user_path_mountpoint_at(int, const char __user *, unsigned int, struct path *);
extern int vfs_path_lookup(struct dentry *, struct vfsmount *,
const char *, unsigned int, struct path *);
+extern int move_mount_lookup(int, const char __user *, unsigned,
+ struct path *, struct fd_cookie **);
long do_mknodat(int dfd, const char __user *filename, umode_t mode,
unsigned int dev);
long do_mkdirat(int dfd, const char __user *pathname, umode_t mode);
diff --git a/fs/namei.c b/fs/namei.c
index acb8e27d4288..c4063170fb20 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2333,6 +2333,46 @@ static int filename_lookup(int dfd, struct filename *name, unsigned flags,
return retval;
}

+/*
+ * Look up the from for move_mount(). This is a bit tricky as move_mount()
+ * needs to clear FMODE_NEED_UNMOUNT on the file struct pointed to by dfd - if
+ * the pathname is empty and if the move completed successfully, so we need to
+ * pass back the fd information to the caller.
+ */
+int move_mount_lookup(int dfd, const char __user *from_name, unsigned flags,
+ struct path *path, struct fd_cookie **_dfd_f)
+{
+ struct nameidata nd;
+ struct filename *name;
+ struct file *file;
+ int retval;
+
+ name = getname_flags(from_name, flags, NULL);
+ if (IS_ERR(name))
+ return PTR_ERR(name);
+ set_nameidata(&nd, dfd, name);
+ retval = path_lookupat(&nd, flags | LOOKUP_RCU, path);
+ if (unlikely(retval == -ECHILD))
+ retval = path_lookupat(&nd, flags, path);
+ if (unlikely(retval == -ESTALE))
+ retval = path_lookupat(&nd, flags | LOOKUP_REVAL, path);
+
+ if (likely(!retval)) {
+ audit_inode(name, path->dentry, flags & LOOKUP_PARENT);
+ file = __fdfile(nd.dfd);
+ if (file &&
+ file->f_path.mnt == path->mnt &&
+ file->f_path.dentry == path->dentry) {
+ *_dfd_f = nd.dfd;
+ nd.dfd = NULL;
+ }
+ }
+
+ restore_nameidata();
+ putname(name);
+ return retval;
+}
+
/* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
static int path_parentat(struct nameidata *nd, unsigned flags,
struct path *parent)
diff --git a/fs/namespace.c b/fs/namespace.c
index e73cfcdfb3d1..5cd9b5be149f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2395,26 +2395,22 @@ static inline int tree_contains_unbindable(struct mount *mnt)
return 0;
}

-static int do_move_mount(struct path *path, const char *old_name)
+static int do_move_mount(struct path *old_path, struct path *new_path,
+ const struct file *dfd_ref)
{
- struct path old_path, parent_path;
+ struct path parent_path;
struct mount *p;
struct mount *old;
struct mountpoint *mp;
int err;
- if (!old_name || !*old_name)
- return -EINVAL;
- err = kern_path(old_name, LOOKUP_FOLLOW, &old_path);
- if (err)
- return err;

- mp = lock_mount(path);
+ mp = lock_mount(new_path);
err = PTR_ERR(mp);
if (IS_ERR(mp))
goto out;

- old = real_mount(old_path.mnt);
- p = real_mount(path->mnt);
+ old = real_mount(old_path->mnt);
+ p = real_mount(new_path->mnt);

err = -EINVAL;
if (!check_mnt(p) || !check_mnt(old))
@@ -2424,14 +2420,19 @@ static int do_move_mount(struct path *path, const char *old_name)
goto out1;

err = -EINVAL;
- if (old_path.dentry != old_path.mnt->mnt_root)
+ if (old_path->dentry != old_path->mnt->mnt_root)
goto out1;

- if (!mnt_has_parent(old))
- goto out1;
+ if (!mnt_has_parent(old)) {
+ /* We need to allow open(O_PATH|O_CLONE_MOUNT) or fsmount()
+ * followed by move_mount(), but mustn't allow "/" to be moved.
+ */
+ if (!dfd_ref || !(dfd_ref->f_mode & FMODE_NEED_UNMOUNT))
+ goto out1;
+ }

- if (d_is_dir(path->dentry) !=
- d_is_dir(old_path.dentry))
+ if (d_is_dir(new_path->dentry) !=
+ d_is_dir(old_path->dentry))
goto out1;
/*
* Don't move a mount residing in a shared parent.
@@ -2449,7 +2450,8 @@ static int do_move_mount(struct path *path, const char *old_name)
if (p == old)
goto out1;

- err = attach_recursive_mnt(old, real_mount(path->mnt), mp, &parent_path);
+ err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp,
+ &parent_path);
if (err)
goto out1;

@@ -2461,6 +2463,22 @@ static int do_move_mount(struct path *path, const char *old_name)
out:
if (!err)
path_put(&parent_path);
+ return err;
+}
+
+static int do_move_mount_old(struct path *path, const char *old_name)
+{
+ struct path old_path;
+ int err;
+
+ if (!old_name || !*old_name)
+ return -EINVAL;
+
+ err = kern_path(old_name, LOOKUP_FOLLOW, &old_path);
+ if (err)
+ return err;
+
+ err = do_move_mount(&old_path, path, NULL);
path_put(&old_path);
return err;
}
@@ -2903,7 +2921,7 @@ long do_mount(const char *dev_name, const char __user *dir_name,
else if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
retval = do_change_type(&path, flags);
else if (flags & MS_MOVE)
- retval = do_move_mount(&path, dev_name);
+ retval = do_move_mount_old(&path, dev_name);
else
retval = do_new_mount(&path, type_page, sb_flags, mnt_flags,
dev_name, data_page, data_size);
@@ -3375,6 +3393,79 @@ SYSCALL_DEFINE5(fsmount, int, fs_fd, unsigned int, flags, unsigned int, ms_flags
return ret;
}

+/*
+ * Move a mount from one place to another. In combination with
+ * fsopen()/fsmount() this is used to install a new mount and in combination
+ * with open(O_PATH|O_CLONE_MOUNT[|O_NON_RECURSIVE]) it can be used to copy a
+ * mount subtree.
+ *
+ * Note the flags value is a combination of MOVE_MOUNT_* flags.
+ */
+SYSCALL_DEFINE5(move_mount,
+ int, from_dfd, const char *, from_pathname,
+ int, to_dfd, const char *, to_pathname,
+ unsigned int, flags)
+{
+ struct path from_path, to_path;
+ struct fd_cookie *from_f = NULL;
+ unsigned int lflags;
+ int ret = 0;
+
+ if (!may_mount())
+ return -EPERM;
+
+ if (flags & ~MOVE_MOUNT__MASK)
+ return -EINVAL;
+
+ /* If someone gives a pathname, they aren't permitted to move
+ * from an fd that requires unmount as we can't get at the flag
+ * to clear it afterwards.
+ */
+ lflags = 0;
+ if (flags & MOVE_MOUNT_F_SYMLINKS) lflags |= LOOKUP_FOLLOW;
+ if (flags & MOVE_MOUNT_F_AUTOMOUNTS) lflags |= LOOKUP_AUTOMOUNT;
+ if (flags & MOVE_MOUNT_F_EMPTY_PATH) lflags |= LOOKUP_EMPTY;
+
+ ret = move_mount_lookup(from_dfd, from_pathname, lflags, &from_path,
+ &from_f);
+ if (ret < 0)
+ return ret;
+
+ lflags = 0;
+ if (flags & MOVE_MOUNT_T_SYMLINKS) lflags |= LOOKUP_FOLLOW;
+ if (flags & MOVE_MOUNT_T_AUTOMOUNTS) lflags |= LOOKUP_AUTOMOUNT;
+ if (flags & MOVE_MOUNT_T_EMPTY_PATH) lflags |= LOOKUP_EMPTY;
+
+ ret = user_path_at(to_dfd, to_pathname, lflags, &to_path);
+ if (ret < 0)
+ goto out_from;
+
+ ret = security_move_mount(&from_path, &to_path);
+ if (ret < 0)
+ goto out_to;
+
+ ret = do_move_mount(&from_path, &to_path, __fdfile(from_f));
+
+out_to:
+ path_put(&to_path);
+out_from:
+ path_put(&from_path);
+ if (from_f) {
+ if (ret == 0) {
+ struct file *file = __fdfile(from_f);
+
+ /* If successful, move_mount() should always clear the
+ * unmount-on-close flag, but it may race with another
+ * move_mount() when doing so.
+ */
+ WRITE_ONCE(file->f_flags,
+ READ_ONCE(file->f_flags) & ~FMODE_NEED_UNMOUNT);
+ }
+ __fdput(from_f);
+ }
+ return ret;
+}
+
/*
* Return true if path is reachable from root
*
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 5d8f8bd39b52..85fea328dbac 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -198,6 +198,10 @@
* Parse a string of security data filling in the opts structure
* @options string containing all mount options known by the LSM
* @opts binary data structure usable by the LSM
+ * @move_mount:
+ * Check permission before a mount is moved.
+ * @from_path indicates the mount that is going to be moved.
+ * @to_path indicates the mountpoint that will be mounted upon.
* @dentry_init_security:
* Compute a context for a dentry as the inode is not yet available
* since NFSv4 has no label backed by an EA anyway.
@@ -1535,6 +1539,7 @@ union security_list_options {
unsigned long kern_flags,
unsigned long *set_kern_flags);
int (*sb_parse_opts_str)(char *options, struct security_mnt_opts *opts);
+ int (*move_mount)(const struct path *from_path, const struct path *to_path);
int (*dentry_init_security)(struct dentry *dentry, int mode,
const struct qstr *name, void **ctx,
u32 *ctxlen);
@@ -1873,6 +1878,7 @@ struct security_hook_heads {
struct hlist_head sb_set_mnt_opts;
struct hlist_head sb_clone_mnt_opts;
struct hlist_head sb_parse_opts_str;
+ struct hlist_head move_mount;
struct hlist_head dentry_init_security;
struct hlist_head dentry_create_files_as;
#ifdef CONFIG_SECURITY_PATH
diff --git a/include/linux/security.h b/include/linux/security.h
index 5040455a747d..fcc6f5d04006 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -261,6 +261,7 @@ int security_sb_clone_mnt_opts(const struct super_block *oldsb,
unsigned long kern_flags,
unsigned long *set_kern_flags);
int security_sb_parse_opts_str(char *options, struct security_mnt_opts *opts);
+int security_move_mount(const struct path *from_path, const struct path *to_path);
int security_dentry_init_security(struct dentry *dentry, int mode,
const struct qstr *name, void **ctx,
u32 *ctxlen);
@@ -655,6 +656,12 @@ static inline int security_sb_parse_opts_str(char *options, struct security_mnt_
return 0;
}

+static inline int security_move_mount(const struct path *from_path,
+ const struct path *to_path)
+{
+ return 0;
+}
+
static inline int security_inode_alloc(struct inode *inode)
{
return 0;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 5130fd687a85..bf89f57046dc 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -901,6 +901,9 @@ asmlinkage long sys_fsopen(const char *fs_name, unsigned int flags,
asmlinkage long sys_fsmount(int fsfd, int dfd, const char *path, unsigned int at_flags,
unsigned int flags);
asmlinkage long sys_fspick(int dfd, const char *path, unsigned int at_flags);
+asmlinkage long sys_move_mount(int from_dfd, const char *from_path,
+ int to_dfd, const char *to_path,
+ unsigned int ms_flags);


/*
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index 3f9ec42510b0..2084596eb1d9 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -55,4 +55,15 @@
#define MS_MGC_VAL 0xC0ED0000
#define MS_MGC_MSK 0xffff0000

+/*
+ * move_mount() flags.
+ */
+#define MOVE_MOUNT_F_SYMLINKS 0x00000001 /* Follow symlinks on from path */
+#define MOVE_MOUNT_F_AUTOMOUNTS 0x00000002 /* Follow automounts on from path */
+#define MOVE_MOUNT_F_EMPTY_PATH 0x00000004 /* Empty from path permitted */
+#define MOVE_MOUNT_T_SYMLINKS 0x00000010 /* Follow symlinks on to path */
+#define MOVE_MOUNT_T_AUTOMOUNTS 0x00000020 /* Follow automounts on to path */
+#define MOVE_MOUNT_T_EMPTY_PATH 0x00000040 /* Empty to path permitted */
+#define MOVE_MOUNT__MASK 0x00000077
+
#endif /* _UAPI_LINUX_MOUNT_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 152fdc95d426..e65b5d587251 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -437,3 +437,4 @@ COND_SYSCALL(setuid16);
COND_SYSCALL(sys_fsopen);
COND_SYSCALL(sys_fsmount);
COND_SYSCALL(sys_fspick);
+COND_SYSCALL(sys_move_mount);
diff --git a/security/security.c b/security/security.c
index 3b155f7ee3ba..f7af4093706a 100644
--- a/security/security.c
+++ b/security/security.c
@@ -480,6 +480,11 @@ int security_sb_parse_opts_str(char *options, struct security_mnt_opts *opts)
}
EXPORT_SYMBOL(security_sb_parse_opts_str);

+int security_move_mount(const struct path *from_path, const struct path *to_path)
+{
+ return call_int_hook(move_mount, 0, from_path, to_path);
+}
+
int security_inode_alloc(struct inode *inode)
{
inode->i_security = NULL;


2018-05-25 02:48:37

by David Howells

[permalink] [raw]
Subject: [PATCH 32/32] [RFC] fsinfo: Add a system call to allow querying of filesystem information [ver #8]

Add a system call to allow filesystem information to be queried. This is
implemented as a function switch where the desired attribute value or
values is nominated.


===============
NEW SYSTEM CALL
===============

The new system call looks like:

int ret = fsinfo(int dfd,
const char *filename,
const struct fsinfo_params *params,
void *buffer,
size_t buf_size);

The params parameter optionally points to a block of parameters:

struct fsinfo_params {
enum fsinfo_attribute request;
__u32 Nth;
__u32 at_flags;
__u32 __spare[6];
};

If params is NULL, it is assumed params->request should be
fsinfo_attr_statfs, params->Nth should be 0 and params->at_flags should be
0.

If params is given, all of params->__spare[] must be 0.

dfd, filename and params->at_flags indicate the file to query. There is no
equivalent of lstat() as that can be emulated with fsinfo() by setting
AT_SYMLINK_NOFOLLOW in params->at_flags. There is also no equivalent of
fstat() as that can be emulated by passing a NULL filename to fsinfo() with
the fd of interest in dfd. AT_NO_AUTOMOUNT can also be used to an allow
automount point to be queried without triggering it.

AT_FORCE_ATTR_SYNC can be set in params->at_flags. This will require a
network filesystem to synchronise its attributes with the server.

AT_NO_ATTR_SYNC can be set in params->at_flags. This will suppress
synchronisation with the server in a network filesystem. The resulting
values should be considered approximate.

params->request indicates the attribute/attributes to be queried. This can
be one of:

fsinfo_attr_statfs - statfs-style info
fsinfo_attr_fsinfo - Information about fsinfo()
fsinfo_attr_limits - Filesystem limits
fsinfo_attr_capabilities - Filesystem capabilities
fsinfo_attr_timestamp_info - Inode timestamp info
fsinfo_attr_volume_id - Volume ID (var length)
fsinfo_attr_volume_uuid - Volume UUID
fsinfo_attr_volume_name - Volume name (string)
fsinfo_attr_cell_name - Cell name (string)
fsinfo_attr_domain_name - Domain name (string)
fsinfo_attr_realm_name - Realm name (string)
fsinfo_attr_server_name - Name of the Nth server (string)
fsinfo_attr_server_addresses - Addresses of the Nth server
fsinfo_attr_error_state - Error state
fsinfo_attr_parameter - Nth mount parameter (string)
fsinfo_attr_source - Nth mount source name (string)
fsinfo_attr_name_encoding - Filename encoding (string)
fsinfo_attr_name_codepage - Filename codepage (string)
fsinfo_attr_io_size - Optimal I/O sizes

Some attributes (such as the servers backing a network filesystem) can have
multiple values. These can be enumerated by setting params->Nth to 0, 1,
... until ENODATA is returned.

buffer and buf_size point to the reply buffer. The buffer is filled up to the
specified size, even if this means truncating the reply. The full size of the
reply is returned. In future versions, this will allow extra fields to be
tacked on to the end of the reply, but anyone not expecting them will only get
the subset they're expecting. If either buffer of buf_size are 0, no copy
will take place and the data size will be returned.

At the moment, this will only work on x86_64 and i386 as it requires the system
call to be wired up.

Signed-off-by: David Howells <[email protected]>
---

arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
fs/statfs.c | 431 ++++++++++++++++++++++++++++++++
include/linux/fs.h | 4
include/linux/fsinfo.h | 25 ++
include/linux/syscalls.h | 3
include/uapi/linux/fsinfo.h | 231 +++++++++++++++++
samples/statx/Makefile | 5
samples/statx/test-fsinfo.c | 179 +++++++++++++
9 files changed, 879 insertions(+), 1 deletion(-)
create mode 100644 include/linux/fsinfo.h
create mode 100644 include/uapi/linux/fsinfo.h
create mode 100644 samples/statx/test-fsinfo.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 76c95f35a599..d447f9fe28b4 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -401,3 +401,4 @@
387 i386 fsmount sys_fsmount __ia32_sys_fsmount
388 i386 fspick sys_fspick __ia32_sys_fspick
389 i386 move_mount sys_move_mount __ia32_sys_move_mount
+390 i386 fsinfo sys_fsinfo __ia32_sys_fsinfo
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b53080b756e8..dce56e0507ef 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -346,6 +346,7 @@
335 common fsmount __x64_sys_fsmount
336 common fspick __x64_sys_fspick
337 common move_mount __x64_sys_move_mount
+338 common fsinfo __x64_sys_fsinfo

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/statfs.c b/fs/statfs.c
index 5b2a24f0f263..07437780a30c 100644
--- a/fs/statfs.c
+++ b/fs/statfs.c
@@ -9,6 +9,7 @@
#include <linux/security.h>
#include <linux/uaccess.h>
#include <linux/compat.h>
+#include <linux/fsinfo.h>
#include "internal.h"

static int flags_by_mnt(int mnt_flags)
@@ -384,3 +385,433 @@ COMPAT_SYSCALL_DEFINE2(ustat, unsigned, dev, struct compat_ustat __user *, u)
return 0;
}
#endif
+
+/*
+ * Get basic filesystem stats from statfs.
+ */
+static int fsinfo_generic_statfs(struct dentry *dentry,
+ struct fsinfo_statfs *p)
+{
+ struct super_block *sb;
+ struct kstatfs buf;
+ int ret;
+
+ ret = statfs_by_dentry(dentry, &buf);
+ if (ret < 0)
+ return ret;
+
+ sb = dentry->d_sb;
+ p->f_fstype = sb->s_magic;
+ p->f_dev_major = MAJOR(sb->s_dev);
+ p->f_dev_minor = MINOR(sb->s_dev);
+ p->f_blocks = buf.f_blocks;
+ p->f_bfree = buf.f_bfree;
+ p->f_bavail = buf.f_bavail;
+ p->f_files = buf.f_files;
+ p->f_ffree = buf.f_ffree;
+ p->f_favail = buf.f_ffree;
+ p->f_bsize = buf.f_bsize;
+ p->f_frsize = buf.f_frsize;
+ p->f_flags = ST_VALID | flags_by_sb(sb->s_flags);
+
+ memcpy(&p->f_fsid, &buf.f_fsid, sizeof(p->f_fsid));
+ strcpy(p->f_fs_name, dentry->d_sb->s_type->name);
+ return sizeof(*p);
+}
+
+static int fsinfo_generic_limits(struct dentry *dentry,
+ struct fsinfo_limits *lim)
+{
+ struct super_block *sb = dentry->d_sb;
+
+ lim->max_file_size = sb->s_maxbytes;
+ lim->max_hard_links = sb->s_max_links;
+ lim->max_uid = UINT_MAX;
+ lim->max_gid = UINT_MAX;
+ lim->max_projid = UINT_MAX;
+ lim->max_filename_len = NAME_MAX;
+ lim->max_symlink_len = PAGE_SIZE;
+ lim->max_xattr_name_len = XATTR_NAME_MAX;
+ lim->max_xattr_body_len = XATTR_SIZE_MAX;
+ lim->max_dev_major = 0xffffff;
+ lim->max_dev_minor = 0xff;
+ return sizeof(*lim);
+}
+
+static inline void set_cap(struct fsinfo_capabilities *c,
+ enum fsinfo_capability cap)
+{
+ c->capabilities[cap / 8] |= 1 << (cap % 8);
+}
+
+static int fsinfo_generic_capabilities(struct dentry *dentry,
+ struct fsinfo_capabilities *c)
+{
+ struct super_block *sb = dentry->d_sb;
+
+ c->supported_stx_mask = STATX_BASIC_STATS;
+ c->supported_stx_attributes = 0;
+
+ if (sb->s_mtd)
+ set_cap(c, fsinfo_cap_is_flash_fs);
+ else if (sb->s_bdev)
+ set_cap(c, fsinfo_cap_is_block_fs);
+
+ if (sb->s_quota_types & QTYPE_MASK_USR)
+ set_cap(c, fsinfo_cap_user_quotas);
+ if (sb->s_quota_types & QTYPE_MASK_GRP)
+ set_cap(c, fsinfo_cap_group_quotas);
+ if (sb->s_quota_types & QTYPE_MASK_PRJ)
+ set_cap(c, fsinfo_cap_project_quotas);
+ if (sb->s_xattr)
+ set_cap(c, fsinfo_cap_xattrs);
+ if (sb->s_d_op && sb->s_d_op->d_automount)
+ set_cap(c, fsinfo_cap_automounts);
+ if (sb->s_id[0])
+ set_cap(c, fsinfo_cap_volume_name);
+ set_cap(c, fsinfo_cap_no_unix_mode);
+ return sizeof(*c);
+}
+
+static int fsinfo_generic_timestamp_info(struct dentry *dentry,
+ struct fsinfo_timestamp_info *ts)
+{
+ struct super_block *sb = dentry->d_sb;
+
+ /* If unset, assume 1s granularity */
+ u16 mantissa = 1;
+ s8 exponent = 0;
+
+ ts->minimum_timestamp = S64_MIN;
+ ts->maximum_timestamp = S64_MAX;
+ if (sb->s_time_gran < 1000000000) {
+ if (sb->s_time_gran < 1000)
+ exponent = -9;
+ else if (sb->s_time_gran < 1000000)
+ exponent = -6;
+ else
+ exponent = -3;
+ }
+#define set_gran(x) \
+ do { \
+ ts->x##_mantissa = mantissa; \
+ ts->x##_exponent = exponent; \
+ } while (0)
+ set_gran(atime_gran);
+ set_gran(btime_gran);
+ set_gran(ctime_gran);
+ set_gran(mtime_gran);
+ return sizeof(*ts);
+}
+
+static int fsinfo_generic_volume_uuid(struct dentry *dentry,
+ struct fsinfo_volume_uuid *vu)
+{
+ struct super_block *sb = dentry->d_sb;
+
+ memcpy(vu, &sb->s_uuid, sizeof(*vu));
+ return sizeof(*vu);
+}
+
+static int fsinfo_generic_volume_name(struct dentry *dentry, char *buf)
+{
+ struct super_block *sb = dentry->d_sb;
+ size_t len = strlen(sb->s_id);
+
+ if (buf)
+ memcpy(buf, sb->s_id, len + 1);
+ return len;
+}
+
+static int fsinfo_generic_name_encoding(struct dentry *dentry, char *buf)
+{
+ static const char encoding[] = "utf8";
+
+ if (buf)
+ memcpy(buf, encoding, sizeof(encoding) - 1);
+ return sizeof(encoding) - 1;
+}
+
+static int fsinfo_generic_io_size(struct dentry *dentry,
+ struct fsinfo_io_size *c)
+{
+ struct super_block *sb = dentry->d_sb;
+ struct kstatfs buf;
+ int ret;
+
+ if (sb->s_op->statfs == simple_statfs) {
+ c->block_size = PAGE_SIZE;
+ c->max_single_read_size = 0;
+ c->max_single_write_size = 0;
+ c->best_read_size = PAGE_SIZE;
+ c->best_write_size = PAGE_SIZE;
+ } else {
+ ret = statfs_by_dentry(dentry, &buf);
+ if (ret < 0)
+ return ret;
+ c->block_size = buf.f_bsize;
+ c->max_single_read_size = buf.f_bsize;
+ c->max_single_write_size = buf.f_bsize;
+ c->best_read_size = PAGE_SIZE;
+ c->best_write_size = PAGE_SIZE;
+ }
+ return sizeof(*c);
+}
+
+/*
+ * Implement some queries generically from stuff in the superblock.
+ */
+int generic_fsinfo(struct dentry *dentry, struct fsinfo_kparams *params)
+{
+#define _gen(X) fsinfo_attr_##X: return fsinfo_generic_##X(dentry, params->buffer)
+
+ switch (params->request) {
+ case _gen(statfs);
+ case _gen(limits);
+ case _gen(capabilities);
+ case _gen(timestamp_info);
+ case _gen(volume_uuid);
+ case _gen(volume_name);
+ case _gen(name_encoding);
+ case _gen(io_size);
+ default:
+ return -EOPNOTSUPP;
+ }
+}
+
+/*
+ * Retrieve the filesystem info. We make some stuff up if the operation is not
+ * supported.
+ */
+int vfs_fsinfo(const struct path *path, struct fsinfo_kparams *params)
+{
+ struct dentry *dentry = path->dentry;
+ int (*get_fsinfo)(struct dentry *, struct fsinfo_kparams *);
+ int ret;
+
+ if (params->request == fsinfo_attr_fsinfo) {
+ struct fsinfo_fsinfo *info = params->buffer;
+
+ info->max_attr = fsinfo_attr__nr;
+ info->max_cap = fsinfo_cap__nr;
+ return sizeof(*info);
+ }
+
+ get_fsinfo = dentry->d_sb->s_op->get_fsinfo;
+ if (!get_fsinfo) {
+ if (!dentry->d_sb->s_op->statfs)
+ return -EOPNOTSUPP;
+ get_fsinfo = generic_fsinfo;
+ }
+
+ ret = security_sb_statfs(dentry);
+ if (ret)
+ return ret;
+
+ ret = get_fsinfo(dentry, params);
+ if (ret < 0)
+ return ret;
+
+ if (params->request == fsinfo_attr_statfs &&
+ params->buffer) {
+ struct fsinfo_statfs *p = params->buffer;
+
+ p->f_flags |= flags_by_mnt(path->mnt->mnt_flags);
+ }
+ return 0;
+}
+
+static int vfs_fsinfo_path(int dfd, const char __user *filename,
+ struct fsinfo_kparams *params)
+{
+ struct path path;
+ unsigned lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
+ int ret = -EINVAL;
+
+ if ((params->at_flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
+ AT_EMPTY_PATH)) != 0)
+ return -EINVAL;
+
+ if (params->at_flags & AT_SYMLINK_NOFOLLOW)
+ lookup_flags &= ~LOOKUP_FOLLOW;
+ if (params->at_flags & AT_NO_AUTOMOUNT)
+ lookup_flags &= ~LOOKUP_AUTOMOUNT;
+ if (params->at_flags & AT_EMPTY_PATH)
+ lookup_flags |= LOOKUP_EMPTY;
+
+retry:
+ ret = user_path_at(dfd, filename, lookup_flags, &path);
+ if (ret)
+ goto out;
+
+ ret = vfs_fsinfo(&path, params);
+ path_put(&path);
+ if (retry_estale(ret, lookup_flags)) {
+ lookup_flags |= LOOKUP_REVAL;
+ goto retry;
+ }
+out:
+ return ret;
+}
+
+static int vfs_fsinfo_fd(unsigned int fd, struct fsinfo_kparams *params)
+{
+ struct fd f = fdget_raw(fd);
+ int ret = -EBADF;
+
+ if (f.file) {
+ ret = vfs_fsinfo(&f.file->f_path, params);
+ fdput(f);
+ }
+ return ret;
+}
+
+/*
+ * Return buffer information by requestable attribute.
+ *
+ * STRUCT indicates a fixed-size structure with only one instance.
+ * STRUCT_N indicates a fixed-size structure that may have multiple instances.
+ * STRING indicates a string with only one instance.
+ * STRING_N indicates a string that may have multiple instances.
+ * STRUCT_ARRAY indicates an array of fixed-size structs with only one instance.
+ * STRUCT_ARRAY_N as above that may have multiple instances.
+ *
+ * If an entry is marked STRUCT or STRUCT_N then if no buffer is supplied to
+ * sys_fsinfo(), sys_fsinfo() will handle returning the buffer size without
+ * calling vfs_fsinfo() and the filesystem.
+ *
+ * No struct may have more than 252 bytes (ie. 0x3f * 4)
+ */
+#define FSINFO_STRUCT(N) [fsinfo_attr_##N] = sizeof(struct fsinfo_##N)/sizeof(__u32)
+#define FSINFO_STRING(N) [fsinfo_attr_##N] = 0x80
+#define FSINFO_STRUCT_ARRAY(N) [fsinfo_attr_##N] = 0x80 | sizeof(struct fsinfo_##N)
+#define FSINFO_STRUCT_N(N) [fsinfo_attr_##N] = 0xc0 | sizeof(struct fsinfo_##N)/sizeof(__u32)
+#define FSINFO_STRING_N(N) [fsinfo_attr_##N] = 0xc0
+#define FSINFO_STRUCT_ARRAY_N(N) [fsinfo_attr_##N] = 0xc0 | sizeof(struct fsinfo_##N)
+static const u8 fsinfo_buffer_sizes[fsinfo_attr__nr] = {
+ FSINFO_STRUCT(statfs),
+ FSINFO_STRUCT(fsinfo),
+ FSINFO_STRUCT(limits),
+ FSINFO_STRUCT_ARRAY(capabilities),
+ FSINFO_STRUCT(timestamp_info),
+ FSINFO_STRING(volume_id),
+ FSINFO_STRUCT(volume_uuid),
+ FSINFO_STRING(volume_name),
+ FSINFO_STRING(cell_name),
+ FSINFO_STRING(domain_name),
+ FSINFO_STRING(realm_name),
+ FSINFO_STRING_N(server_name),
+ FSINFO_STRUCT_ARRAY_N (server_addresses),
+ FSINFO_STRUCT(error_state),
+ FSINFO_STRING_N(parameter),
+ FSINFO_STRING_N(source),
+ FSINFO_STRING(name_encoding),
+ FSINFO_STRING(name_codepage),
+ FSINFO_STRUCT(io_size),
+};
+
+/**
+ * sys_fsinfo - System call to get filesystem information
+ * @dfd: Base directory to pathwalk from or fd referring to filesystem.
+ * @filename: Filesystem to query or NULL.
+ * @_params: Parameters to define request (or NULL for enhanced statfs).
+ * @_buffer: Result buffer.
+ * @buf_size: Size of result buffer.
+ *
+ * Get information on a filesystem. The filesystem attribute to be queried is
+ * indicated by @_params->request, and some of the attributes can have multiple
+ * values, indexed by @_params->Nth. If @_params is NULL, then the 0th
+ * fsinfo_attr_statfs attribute is queried. If an attribute does not exist,
+ * EOPNOTSUPP is returned; if the Nth value does not exist, ENODATA is
+ * returned.
+ *
+ * On success, the size of the attribute's value is returned. If @buf_size is
+ * 0 or @_buffer is NULL, only the size is returned. If the size of the value
+ * is larger than @buf_size, it will be truncated by the copy. The full size
+ * of the value will be returned.
+ */
+SYSCALL_DEFINE5(fsinfo,
+ int, dfd, const char __user *, filename,
+ struct fsinfo_params *, _params,
+ void __user *, _buffer, size_t, buf_size)
+{
+ struct fsinfo_params user_params;
+ struct fsinfo_kparams params;
+ size_t size;
+ int ret;
+
+ if (!access_ok(VERIFY_WRITE, _buffer, buf_size))
+ return -EFAULT;
+
+ if (_params) {
+ if (copy_from_user(&user_params, _params, sizeof(user_params)))
+ return -EFAULT;
+ if (user_params.__spare[0] ||
+ user_params.__spare[1] ||
+ user_params.__spare[2] ||
+ user_params.__spare[3] ||
+ user_params.__spare[4] ||
+ user_params.__spare[5])
+ return -EINVAL;
+ if (user_params.request > fsinfo_attr__nr)
+ return -EOPNOTSUPP;
+ params.request = user_params.request;
+ params.Nth = user_params.Nth;
+ params.at_flags = user_params.at_flags;
+ } else {
+ params.request = fsinfo_attr_statfs;
+ params.Nth = 0;
+ params.at_flags = AT_SYMLINK_FOLLOW;
+ }
+
+ if (!_buffer)
+ buf_size = 0;
+
+ /* Allocate an appropriately-sized buffer. We will truncate the
+ * contents when we write the contents back to userspace.
+ */
+ size = fsinfo_buffer_sizes[params.request];
+ if (!(size & 0x40) && params.Nth != 0)
+ return -ENODATA;
+ size &= ~0x40;
+ if (size == 0)
+ return -ENOBUFS;
+ if (size & 0x80) {
+ size = 4096;
+ } else {
+ size *= sizeof(__u32);
+ if (buf_size == 0)
+ return size; /* We know how big the buffer should be */
+ }
+
+ if (buf_size > 0) {
+ params.buf_size = size;
+ params.buffer = kzalloc(size, GFP_KERNEL);
+ if (!params.buffer)
+ return -ENOMEM;
+ } else {
+ params.buf_size = 0;
+ params.buffer = NULL;
+ }
+
+ if (filename)
+ ret = vfs_fsinfo_path(dfd, filename, &params);
+ else
+ ret = vfs_fsinfo_fd(dfd, &params);
+ if (ret < 0)
+ goto error;
+
+ if (ret == 0) {
+ ret = -ENODATA;
+ goto error;
+ }
+
+ if (buf_size > ret)
+ buf_size = ret;
+
+ if (copy_to_user(_buffer, params.buffer, buf_size))
+ ret = -EFAULT;
+error:
+ kfree(params.buffer);
+ return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 40890e3359f0..a339c5560506 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -61,6 +61,8 @@ struct iov_iter;
struct fscrypt_info;
struct fscrypt_operations;
struct fs_context;
+struct fsinfo_kparams;
+enum fsinfo_attribute;

extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -1835,6 +1837,7 @@ struct super_operations {
int (*thaw_super) (struct super_block *);
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
+ int (*get_fsinfo) (struct dentry *, struct fsinfo_kparams *);
int (*remount_fs) (struct super_block *, int *, char *, size_t);
int (*reconfigure) (struct super_block *, struct fs_context *);
void (*umount_begin) (struct super_block *);
@@ -2200,6 +2203,7 @@ extern int iterate_mounts(int (*)(struct vfsmount *, void *), void *,
extern int vfs_statfs(const struct path *, struct kstatfs *);
extern int user_statfs(const char __user *, struct kstatfs *);
extern int fd_statfs(int, struct kstatfs *);
+extern int vfs_fsinfo(const struct path *, struct fsinfo_kparams *);
extern int freeze_super(struct super_block *super);
extern int thaw_super(struct super_block *super);
extern bool our_mnt(struct vfsmount *mnt);
diff --git a/include/linux/fsinfo.h b/include/linux/fsinfo.h
new file mode 100644
index 000000000000..4832064653ab
--- /dev/null
+++ b/include/linux/fsinfo.h
@@ -0,0 +1,25 @@
+/* Filesystem information query
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_FSINFO_H
+#define _LINUX_FSINFO_H
+
+#include <uapi/linux/fsinfo.h>
+
+struct fsinfo_kparams {
+ enum fsinfo_attribute request; /* What is being asking for */
+ __u32 Nth; /* Instance of it (some may have multiple) */
+ __u32 at_flags; /* AT_SYMLINK_NOFOLLOW and similar */
+ void *buffer; /* Where to place the reply */
+ size_t buf_size; /* Size of the buffer */
+};
+
+#endif /* _LINUX_FSINFO_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index bf89f57046dc..14be5dc15a13 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -49,6 +49,7 @@ struct stat64;
struct statfs;
struct statfs64;
struct statx;
+struct fsinfo_params;
struct __sysctl_args;
struct sysinfo;
struct timespec;
@@ -904,6 +905,8 @@ asmlinkage long sys_fspick(int dfd, const char *path, unsigned int at_flags);
asmlinkage long sys_move_mount(int from_dfd, const char *from_path,
int to_dfd, const char *to_path,
unsigned int ms_flags);
+asmlinkage long sys_fsinfo(int dfd, const char *path, struct fsinfo_params *params,
+ void *buffer, size_t buf_size);


/*
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
new file mode 100644
index 000000000000..972feebaf2ed
--- /dev/null
+++ b/include/uapi/linux/fsinfo.h
@@ -0,0 +1,231 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* fsinfo() definitions.
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ */
+#ifndef _UAPI_LINUX_FSINFO_H
+#define _UAPI_LINUX_FSINFO_H
+
+/*
+ * The filesystem attributes that can be requested. Note that some attributes
+ * may have multiple instances which can be switched in the parameter block.
+ */
+enum fsinfo_attribute {
+ fsinfo_attr_statfs = 0, /* Extended filesystem information */
+ fsinfo_attr_fsinfo = 1, /* Information about fsinfo() */
+ fsinfo_attr_limits = 2, /* Filesystem limits */
+ fsinfo_attr_capabilities = 3, /* Filesystem capabilities (bits) */
+ fsinfo_attr_timestamp_info = 4, /* Inode timestamp info */
+ fsinfo_attr_volume_id = 5, /* Volume ID (var length) */
+ fsinfo_attr_volume_uuid = 6, /* Volume UUID (LE uuid) */
+ fsinfo_attr_volume_name = 7, /* Volume name (string) */
+ fsinfo_attr_cell_name = 8, /* Cell name (string) */
+ fsinfo_attr_domain_name = 9, /* Domain name (string) */
+ fsinfo_attr_realm_name = 10, /* Realm name (string) */
+ fsinfo_attr_server_name = 11, /* Name of the Nth server */
+ fsinfo_attr_server_addresses = 12, /* Addresses of the Nth server */
+ fsinfo_attr_error_state = 13, /* Error state */
+ fsinfo_attr_parameter = 14, /* Nth mount parameter (string) */
+ fsinfo_attr_source = 15, /* Nth mount source name (string) */
+ fsinfo_attr_name_encoding = 16, /* Filename encoding (string) */
+ fsinfo_attr_name_codepage = 17, /* Filename codepage (string) */
+ fsinfo_attr_io_size = 18, /* Optimal I/O sizes */
+ fsinfo_attr__nr
+};
+
+/*
+ * Optional fsinfo() parameter structure.
+ *
+ * If this is not given, it is assumed that fsinfo_attr_statfs instance 0 is
+ * desired.
+ */
+struct fsinfo_params {
+ enum fsinfo_attribute request; /* What is being asking for */
+ __u32 Nth; /* Instance of it (some may have multiple) */
+ __u32 at_flags; /* AT_SYMLINK_NOFOLLOW and similar flags */
+ __u32 __spare[6]; /* Spare params; all must be 0 */
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_statfs).
+ * - This gives extended filesystem information.
+ */
+struct fsinfo_statfs {
+ /* 0x00 - General info */
+ __u32 f_fstype; /* Filesystem type from linux/magic.h [uncond] */
+ __u32 f_dev_major; /* As st_dev_* from struct statx [uncond] */
+ __u32 f_dev_minor;
+ __u32 __spare0c[1];
+
+ /* 0x10 - statfs information */
+ __u64 f_blocks; /* Total number of blocks in fs */
+ __u64 f_bfree; /* Total number of free blocks */
+ __u64 f_bavail; /* Number of free blocks available to ordinary user */
+ __u64 f_files; /* Total number of file nodes in fs */
+ __u64 f_ffree; /* Number of free file nodes */
+ __u64 f_favail; /* Number of free file nodes available to ordinary user */
+ /* 0x40 */
+ __u32 f_bsize; /* Optimal block size */
+ __u32 f_frsize; /* Fragment size */
+ __u64 f_flags; /* Filesystem mount flags (MS_*) */
+ /* 0x50 */
+ __u64 f_fsid; /* Short 64-bit Filesystem ID (as statfs) */
+ __u64 f_sb_id; /* Internal superblock ID for sbnotify()/mntnotify() */
+ /* 0x60 - Filesystem type name */
+ char f_fs_name[15 + 1];
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_id).
+ * - This gives filesystem identifiers.
+ */
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_limits).
+ *
+ * List of supported filesystem limits.
+ */
+struct fsinfo_limits {
+ __u64 max_file_size; /* Maximum file size */
+ __u64 max_uid; /* Maximum UID supported */
+ __u64 max_gid; /* Maximum GID supported */
+ __u64 max_projid; /* Maximum project ID supported */
+ __u32 max_dev_major; /* Maximum device major representable */
+ __u32 max_dev_minor; /* Maximum device minor representable */
+ __u32 max_hard_links; /* Maximum number of hard links on a file */
+ __u32 max_xattr_body_len; /* Maximum xattr content length */
+ __u16 max_xattr_name_len; /* Maximum xattr name length */
+ __u16 max_filename_len; /* Maximum filename length */
+ __u16 max_symlink_len; /* Maximum symlink content length */
+ __u16 __spare;
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_capabilities).
+ *
+ * Bitmask indicating filesystem capabilities where renderable as single bits.
+ */
+enum fsinfo_capability {
+ fsinfo_cap_is_kernel_fs = 0, /* fs is kernel-special filesystem */
+ fsinfo_cap_is_block_fs = 1, /* fs is block-based filesystem */
+ fsinfo_cap_is_flash_fs = 2, /* fs is flash filesystem */
+ fsinfo_cap_is_network_fs = 3, /* fs is network filesystem */
+ fsinfo_cap_is_automounter_fs = 4, /* fs is automounter special filesystem */
+ fsinfo_cap_automounts = 5, /* fs supports automounts */
+ fsinfo_cap_adv_locks = 6, /* fs supports advisory file locking */
+ fsinfo_cap_mand_locks = 7, /* fs supports mandatory file locking */
+ fsinfo_cap_leases = 8, /* fs supports file leases */
+ fsinfo_cap_uids = 9, /* fs supports numeric uids */
+ fsinfo_cap_gids = 10, /* fs supports numeric gids */
+ fsinfo_cap_projids = 11, /* fs supports numeric project ids */
+ fsinfo_cap_id_names = 12, /* fs supports user names */
+ fsinfo_cap_id_guids = 13, /* fs supports user guids */
+ fsinfo_cap_windows_attrs = 14, /* fs has windows attributes */
+ fsinfo_cap_user_quotas = 15, /* fs has per-user quotas */
+ fsinfo_cap_group_quotas = 16, /* fs has per-group quotas */
+ fsinfo_cap_project_quotas = 17, /* fs has per-project quotas */
+ fsinfo_cap_xattrs = 18, /* fs has xattrs */
+ fsinfo_cap_journal = 19, /* fs has a journal */
+ fsinfo_cap_data_is_journalled = 20, /* fs is using data journalling */
+ fsinfo_cap_o_sync = 21, /* fs supports O_SYNC */
+ fsinfo_cap_o_direct = 22, /* fs supports O_DIRECT */
+ fsinfo_cap_volume_id = 23, /* fs has a volume ID */
+ fsinfo_cap_volume_uuid = 24, /* fs has a volume UUID */
+ fsinfo_cap_volume_name = 25, /* fs has a volume name */
+ fsinfo_cap_volume_fsid = 26, /* fs has a volume FSID */
+ fsinfo_cap_cell_name = 27, /* fs has a cell name */
+ fsinfo_cap_domain_name = 28, /* fs has a domain name */
+ fsinfo_cap_realm_name = 29, /* fs has a realm name */
+ fsinfo_cap_iver_all_change = 30, /* i_version represents data + meta changes */
+ fsinfo_cap_iver_data_change = 31, /* i_version represents data changes only */
+ fsinfo_cap_iver_mono_incr = 32, /* i_version incremented monotonically */
+ fsinfo_cap_symlinks = 33, /* fs supports symlinks */
+ fsinfo_cap_hard_links = 34, /* fs supports hard links */
+ fsinfo_cap_hard_links_1dir = 35, /* fs supports hard links in same dir only */
+ fsinfo_cap_device_files = 36, /* fs supports bdev, cdev */
+ fsinfo_cap_unix_specials = 37, /* fs supports pipe, fifo, socket */
+ fsinfo_cap_resource_forks = 38, /* fs supports resource forks/streams */
+ fsinfo_cap_name_case_indep = 39, /* Filename case independence is mandatory */
+ fsinfo_cap_name_non_utf8 = 40, /* fs has non-utf8 names */
+ fsinfo_cap_name_has_codepage = 41, /* fs has a filename codepage */
+ fsinfo_cap_sparse = 42, /* fs supports sparse files */
+ fsinfo_cap_not_persistent = 43, /* fs is not persistent */
+ fsinfo_cap_no_unix_mode = 44, /* fs does not support unix mode bits */
+ fsinfo_cap__nr
+};
+
+struct fsinfo_capabilities {
+ __u64 supported_stx_attributes; /* What statx::stx_attributes are supported */
+ __u32 supported_stx_mask; /* What statx::stx_mask bits are supported */
+ __u32 supported_ioc_flags; /* What FS_IOC_* flags are supported */
+ __u8 capabilities[(fsinfo_cap__nr + 7) / 8];
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_timestamp_info).
+ */
+struct fsinfo_timestamp_info {
+ __s64 minimum_timestamp; /* Minimum timestamp value in seconds */
+ __s64 maximum_timestamp; /* Maximum timestamp value in seconds */
+ __u16 atime_gran_mantissa; /* Granularity(secs) = mant * 10^exp */
+ __u16 btime_gran_mantissa;
+ __u16 ctime_gran_mantissa;
+ __u16 mtime_gran_mantissa;
+ __s8 atime_gran_exponent;
+ __s8 btime_gran_exponent;
+ __s8 ctime_gran_exponent;
+ __s8 mtime_gran_exponent;
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_volume_uuid).
+ */
+struct fsinfo_volume_uuid {
+ __u8 uuid[16];
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_server_addresses).
+ *
+ * Find the addresses of the Nth server for a network mount.
+ */
+struct fsinfo_server_addresses {
+ struct __kernel_sockaddr_storage address[0];
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_error_state).
+ *
+ * Retrieve the error state for a filesystem.
+ */
+struct fsinfo_error_state {
+ __u32 io_error; /* General I/O error counter */
+ __u32 wb_error; /* Writeback error counter */
+ __u32 bdev_error; /* Blockdev error counter */
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_io_size).
+ *
+ * Retrieve the optimal I/O size for a filesystem.
+ */
+struct fsinfo_io_size {
+ __u32 block_size; /* Minimum block granularity for O_DIRECT */
+ __u32 max_single_read_size; /* Maximum size of a single unbuffered read */
+ __u32 max_single_write_size; /* Maximum size of a single unbuffered write */
+ __u32 best_read_size; /* Optimal read size */
+ __u32 best_write_size; /* Optimal write size */
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_fsinfo).
+ *
+ * This gives information about fsinfo() itself.
+ */
+struct fsinfo_fsinfo {
+ enum fsinfo_attribute max_attr; /* Number of supported attributes */
+ enum fsinfo_capability max_cap; /* Number of supported capabilities */
+};
+
+#endif /* _UAPI_LINUX_FSINFO_H */
diff --git a/samples/statx/Makefile b/samples/statx/Makefile
index 59df7c25a9d1..9cb9a88e3a10 100644
--- a/samples/statx/Makefile
+++ b/samples/statx/Makefile
@@ -1,7 +1,10 @@
# List of programs to build
-hostprogs-$(CONFIG_SAMPLE_STATX) := test-statx
+hostprogs-$(CONFIG_SAMPLE_STATX) := test-statx test-fsinfo

# Tell kbuild to always build the programs
always := $(hostprogs-y)

HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
+
+HOSTCFLAGS_test-fsinfo.o += -I$(objtree)/usr/include
+HOSTLOADLIBES_test-fsinfo += -lm
diff --git a/samples/statx/test-fsinfo.c b/samples/statx/test-fsinfo.c
new file mode 100644
index 000000000000..7724390b0aa4
--- /dev/null
+++ b/samples/statx/test-fsinfo.c
@@ -0,0 +1,179 @@
+/* Test the fsinfo() system call
+ *
+ * Copyright (C) 2015 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#define _GNU_SOURCE
+#define _ATFILE_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <unistd.h>
+#include <ctype.h>
+#include <errno.h>
+#include <time.h>
+#include <math.h>
+#include <sys/syscall.h>
+#include <linux/stat.h>
+#include <linux/fcntl.h>
+#include <sys/stat.h>
+
+#define __NR_fsinfo 326
+
+static __attribute__((unused))
+ssize_t fsinfo(int dfd, const char *filename, unsigned flags,
+ unsigned request, void *buffer)
+{
+ return syscall(__NR_fsinfo, dfd, filename, flags, request, buffer);
+}
+
+static void dump_fsinfo(struct fsinfo *f)
+{
+ printf("mask : %x\n", f->f_mask);
+ printf("dev : %02x:%02x\n", f->f_dev_major, f->f_dev_minor);
+ printf("fs : type=%x name=%s\n", f->f_fstype, f->f_fs_name);
+ printf("ioc : %llx\n", (unsigned long long)f->f_supported_ioc_flags);
+ printf("nameln: %u\n", f->f_namelen);
+ printf("flags : %llx\n", (unsigned long long)f->f_flags);
+ printf("times : range=%llx-%llx\n",
+ (unsigned long long)f->f_min_time,
+ (unsigned long long)f->f_max_time);
+
+#define print_time(G) \
+ printf(#G"time : gran=%gs\n", \
+ (f->f_##G##time_gran_mantissa * \
+ pow(10., f->f_##G##time_gran_exponent)))
+ print_time(a);
+ print_time(b);
+ print_time(c);
+ print_time(m);
+
+
+ if (f->f_mask & FSINFO_BLOCKS_INFO)
+ printf("blocks: n=%llu fr=%llu av=%llu\n",
+ (unsigned long long)f->f_blocks,
+ (unsigned long long)f->f_bfree,
+ (unsigned long long)f->f_bavail);
+
+ if (f->f_mask & FSINFO_FILES_INFO)
+ printf("files : n=%llu fr=%llu av=%llu\n",
+ (unsigned long long)f->f_files,
+ (unsigned long long)f->f_ffree,
+ (unsigned long long)f->f_favail);
+
+ if (f->f_mask & FSINFO_BSIZE)
+ printf("bsize : %u\n", f->f_bsize);
+
+ if (f->f_mask & FSINFO_FRSIZE)
+ printf("frsize: %u\n", f->f_frsize);
+
+ if (f->f_mask & FSINFO_FSID)
+ printf("fsid : %llx\n", (unsigned long long)f->f_fsid);
+
+ if (f->f_mask & FSINFO_VOLUME_ID) {
+ int printable = 1, loop;
+ printf("volid : ");
+ for (loop = 0; loop < sizeof(f->f_volume_id); loop++)
+ if (!isprint(f->f_volume_id[loop]))
+ printable = 0;
+ if (printable) {
+ printf("'%.*s'", 16, f->f_volume_id);
+ } else {
+ for (loop = 0; loop < sizeof(f->f_volume_id); loop++) {
+ if (loop % 4 == 0 && loop != 0)
+ printf(" ");
+ printf("%02x", f->f_volume_id[loop]);
+ }
+ }
+ printf("\n");
+ }
+
+ if (f->f_mask & FSINFO_VOLUME_UUID)
+ printf("uuid : "
+ "%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x"
+ "-%02x%02x%02x%02x%02x%02x\n",
+ f->f_volume_uuid[ 0], f->f_volume_uuid[ 1],
+ f->f_volume_uuid[ 2], f->f_volume_uuid[ 3],
+ f->f_volume_uuid[ 4], f->f_volume_uuid[ 5],
+ f->f_volume_uuid[ 6], f->f_volume_uuid[ 7],
+ f->f_volume_uuid[ 8], f->f_volume_uuid[ 9],
+ f->f_volume_uuid[10], f->f_volume_uuid[11],
+ f->f_volume_uuid[12], f->f_volume_uuid[13],
+ f->f_volume_uuid[14], f->f_volume_uuid[15]);
+ if (f->f_mask & FSINFO_VOLUME_NAME)
+ printf("volume: '%s'\n", f->f_volume_name);
+ if (f->f_mask & FSINFO_DOMAIN_NAME)
+ printf("domain: '%s'\n", f->f_domain_name);
+}
+
+static void dump_hex(unsigned long long *data, int from, int to)
+{
+ unsigned offset, print_offset = 1, col = 0;
+
+ from /= 8;
+ to = (to + 7) / 8;
+
+ for (offset = from; offset < to; offset++) {
+ if (print_offset) {
+ printf("%04x: ", offset * 8);
+ print_offset = 0;
+ }
+ printf("%016llx", data[offset]);
+ col++;
+ if ((col & 3) == 0) {
+ printf("\n");
+ print_offset = 1;
+ } else {
+ printf(" ");
+ }
+ }
+
+ if (!print_offset)
+ printf("\n");
+}
+
+int main(int argc, char **argv)
+{
+ struct fsinfo f;
+ int ret, raw = 0, atflag = AT_SYMLINK_NOFOLLOW;
+
+ for (argv++; *argv; argv++) {
+ if (strcmp(*argv, "-F") == 0) {
+ atflag |= AT_FORCE_ATTR_SYNC;
+ continue;
+ }
+ if (strcmp(*argv, "-L") == 0) {
+ atflag &= ~AT_SYMLINK_NOFOLLOW;
+ continue;
+ }
+ if (strcmp(*argv, "-A") == 0) {
+ atflag |= AT_NO_AUTOMOUNT;
+ continue;
+ }
+ if (strcmp(*argv, "-R") == 0) {
+ raw = 1;
+ continue;
+ }
+
+ memset(&f, 0xbd, sizeof(f));
+ ret = fsinfo(AT_FDCWD, *argv, atflag, 0, &f);
+ printf("fsinfo(%s) = %d\n", *argv, ret);
+ if (ret < 0) {
+ perror(*argv);
+ exit(1);
+ }
+
+ if (raw)
+ dump_hex((unsigned long long *)&f, 0, sizeof(f));
+
+ dump_fsinfo(&f);
+ }
+ return 0;
+}


2018-05-25 02:48:41

by David Howells

[permalink] [raw]
Subject: [PATCH 30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8]

Make it possible to clone a mount tree with a new pair of open flags that
are used in conjunction with O_PATH:

(1) O_CLONE_MOUNT - Clone the mount or mount tree at the path.

(2) O_NON_RECURSIVE - Don't clone recursively.

Note that it's not a good idea to reuse other flags (such as O_CREAT)
because the open routine for O_PATH does not give an error if any other
flags are used in conjunction with O_PATH, but rather just masks off any it
doesn't use.

The resultant file struct is marked FMODE_NEED_UNMOUNT to as it pins an
extra reference for the mount. This will be cleared by the upcoming
move_mount() syscall when it successfully moves a cloned mount into the
filesystem tree.

Note that care needs to be taken with the error handling in do_o_path() in
the case that vfs_open() fails as the path may or may not have been
attached to the file struct and FMODE_NEED_UNMOUNT may or may not be set.
Note that O_DIRECT | O_PATH could be a problem with error handling too.

Signed-off-by: David Howells <[email protected]>
---

fs/fcntl.c | 2 +-
fs/internal.h | 1 +
fs/namei.c | 26 ++++++++++++++++++----
fs/namespace.c | 44 ++++++++++++++++++++++++++++++++++++++
fs/open.c | 7 +++++-
include/linux/fcntl.h | 3 ++-
include/uapi/asm-generic/fcntl.h | 8 +++++++
7 files changed, 83 insertions(+), 8 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 60bc5bf2f4cf..42a53cf03737 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1028,7 +1028,7 @@ static int __init fcntl_init(void)
* Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
* is defined as O_NONBLOCK on some platforms and not on others.
*/
- BUILD_BUG_ON(19 - 1 /* for O_RDONLY being 0 */ !=
+ BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
HWEIGHT32(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)));

fasync_cache = kmem_cache_create("fasync_cache",
diff --git a/fs/internal.h b/fs/internal.h
index c29552e0522f..e3460a2e6b59 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -75,6 +75,7 @@ extern struct vfsmount *lookup_mnt(const struct path *);
extern int finish_automount(struct vfsmount *, struct path *);

extern int sb_prepare_remount_readonly(struct super_block *);
+extern int copy_mount_for_o_path(struct path *, struct path *, bool);

extern void __init mnt_init(void);

diff --git a/fs/namei.c b/fs/namei.c
index 5cbd980b4031..acb8e27d4288 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3458,13 +3458,29 @@ static int do_tmpfile(struct nameidata *nd, unsigned flags,

static int do_o_path(struct nameidata *nd, unsigned flags, struct file *file)
{
- struct path path;
- int error = path_lookupat(nd, flags, &path);
- if (!error) {
- audit_inode(nd->name, path.dentry, 0);
- error = vfs_open(&path, file, current_cred());
+ struct path path, tmp;
+ int error;
+
+ error = path_lookupat(nd, flags, &path);
+ if (error)
+ return error;
+
+ if (file->f_flags & O_CLONE_MOUNT) {
+ error = copy_mount_for_o_path(
+ &path, &tmp, !(file->f_flags & O_NON_RECURSIVE));
path_put(&path);
+ if (error < 0)
+ return error;
+ path = tmp;
}
+
+ audit_inode(nd->name, path.dentry, 0);
+ error = vfs_open(&path, file, current_cred());
+ if (error < 0 &&
+ (flags & O_CLONE_MOUNT) &&
+ !(file->f_mode & FMODE_NEED_UNMOUNT))
+ __detach_mounts(path.dentry);
+ path_put(&path);
return error;
}

diff --git a/fs/namespace.c b/fs/namespace.c
index dba680aa1ea4..e73cfcdfb3d1 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2218,6 +2218,50 @@ static int do_loopback(struct path *path, const char *old_name,
return err;
}

+/*
+ * Copy the mount or mount subtree at the specified path for
+ * open(O_PATH|O_CLONE_MOUNT).
+ */
+int copy_mount_for_o_path(struct path *from, struct path *to, bool recurse)
+{
+ struct mountpoint *mp;
+ struct mount *mnt = NULL, *f = real_mount(from->mnt);
+ int ret;
+
+ mp = lock_mount(from);
+ if (IS_ERR(mp))
+ return PTR_ERR(mp);
+
+ ret = -EINVAL;
+ if (IS_MNT_UNBINDABLE(f))
+ goto out_unlock;
+
+ if (!check_mnt(f) && from->dentry->d_op != &ns_dentry_operations)
+ goto out_unlock;
+
+ if (!recurse && has_locked_children(f, from->dentry))
+ goto out_unlock;
+
+ if (recurse)
+ mnt = copy_tree(f, from->dentry, CL_COPY_MNT_NS_FILE);
+ else
+ mnt = clone_mnt(f, from->dentry, 0);
+ if (IS_ERR(mnt)) {
+ ret = PTR_ERR(mnt);
+ goto out_unlock;
+ }
+
+ mnt->mnt.mnt_flags &= ~MNT_LOCKED;
+
+ to->mnt = &mnt->mnt;
+ to->dentry = dget(from->dentry);
+ ret = 0;
+
+out_unlock:
+ unlock_mount(mp);
+ return ret;
+}
+
static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
{
int error = 0;
diff --git a/fs/open.c b/fs/open.c
index 79a8a1bd740d..27ce9c60345a 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -748,6 +748,8 @@ static int do_dentry_open(struct file *f,

if (unlikely(f->f_flags & O_PATH)) {
f->f_mode |= FMODE_PATH;
+ if (f->f_flags & O_CLONE_MOUNT)
+ f->f_mode |= FMODE_NEED_UNMOUNT;
f->f_op = &empty_fops;
goto done;
}
@@ -977,8 +979,11 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
* If we have O_PATH in the open flag. Then we
* cannot have anything other than the below set of flags
*/
- flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH;
+ flags &= (O_DIRECTORY | O_NOFOLLOW | O_PATH |
+ O_CLONE_MOUNT | O_NON_RECURSIVE);
acc_mode = 0;
+ } else if (flags & (O_CLONE_MOUNT | O_NON_RECURSIVE)) {
+ return -EINVAL;
}

op->open_flag = flags;
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 27dc7a60693e..8f60e2244740 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -9,7 +9,8 @@
(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
- O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | \
+ O_CLONE_MOUNT | O_NON_RECURSIVE)

#ifndef force_o_largefile
#define force_o_largefile() (BITS_PER_LONG != 32)
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 0b1c7e35090c..f533e35ea19b 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -88,6 +88,14 @@
#define __O_TMPFILE 020000000
#endif

+#ifndef O_CLONE_MOUNT
+#define O_CLONE_MOUNT 040000000 /* Used with O_PATH to clone the mount subtree at path */
+#endif
+
+#ifndef O_NON_RECURSIVE
+#define O_NON_RECURSIVE 0100000000 /* Used with O_CLONE_MOUNT to only clone one mount */
+#endif
+
/* a horrid kludge trying to make sure that this will fail on old kernels */
#define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)


2018-05-25 02:48:43

by David Howells

[permalink] [raw]
Subject: [PATCH 27/32] vfs: Use a 'struct fd_cookie *' type for light fd handling [ver #8]

Use a 'struct fd_cookie *' type for light fd handling rather than an
unsigned long so that confusion doesn't arise with integer fd numbers.

I have a use case where I want to store this in struct nameidata, but don't
want to expand it to a struct fd to save space.

Signed-off-by: David Howells <[email protected]>
---

fs/file.c | 20 +++++++++++---------
include/linux/file.h | 31 ++++++++++++++++++++++++-------
2 files changed, 35 insertions(+), 16 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 7ffd6e9d103d..8b0012ddadad 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -727,7 +727,7 @@ EXPORT_SYMBOL(fget_raw);
* The fput_needed flag returned by fget_light should be passed to the
* corresponding fput_light.
*/
-static unsigned long __fget_light(unsigned int fd, fmode_t mask)
+static struct fd_cookie *__fget_light(unsigned int fd, fmode_t mask)
{
struct files_struct *files = current->files;
struct file *file;
@@ -736,33 +736,35 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask)
file = __fcheck_files(files, fd);
if (!file || unlikely(file->f_mode & mask))
return 0;
- return (unsigned long)file;
+ return (struct fd_cookie *)file;
} else {
file = __fget(fd, mask);
if (!file)
return 0;
- return FDPUT_FPUT | (unsigned long)file;
+ return (struct fd_cookie *)(FDPUT_FPUT | (unsigned long)file);
}
}
-unsigned long __fdget(unsigned int fd)
+
+struct fd_cookie *__fdget(unsigned int fd)
{
return __fget_light(fd, FMODE_PATH);
}
EXPORT_SYMBOL(__fdget);

-unsigned long __fdget_raw(unsigned int fd)
+struct fd_cookie *__fdget_raw(unsigned int fd)
{
return __fget_light(fd, 0);
}

-unsigned long __fdget_pos(unsigned int fd)
+struct fd_cookie *__fdget_pos(unsigned int fd)
{
- unsigned long v = __fdget(fd);
- struct file *file = (struct file *)(v & ~3);
+ struct fd_cookie *v = __fdget(fd);
+ struct file *file = __fdfile(v);

if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
if (file_count(file) > 1) {
- v |= FDPUT_POS_UNLOCK;
+ v = (struct fd_cookie *)
+ ((unsigned long)v | FDPUT_POS_UNLOCK);
mutex_lock(&file->f_pos_lock);
}
}
diff --git a/include/linux/file.h b/include/linux/file.h
index 279720db984a..3fce1c92b576 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -11,6 +11,7 @@
#include <linux/posix_types.h>

struct file;
+struct fd_cookie; /* Deliberately undefined structure */

extern void fput(struct file *);

@@ -31,8 +32,24 @@ struct fd {
struct file *file;
unsigned int flags;
};
-#define FDPUT_FPUT 1
-#define FDPUT_POS_UNLOCK 2
+#define FDPUT_FPUT 1
+#define FDPUT_POS_UNLOCK 2
+#define FDPUT__MASK 3
+
+static inline unsigned long __fdflags(struct fd_cookie *f)
+{
+ return (unsigned long)f & FDPUT__MASK;
+}
+
+static inline struct file *__fdfile(struct fd_cookie *f)
+{
+ return (struct file *)((unsigned long)f & ~FDPUT__MASK);
+}
+
+static inline void __fdput(struct fd_cookie *f)
+{
+ fput_light(__fdfile(f), __fdflags(f) & FDPUT_FPUT);
+}

static inline void fdput(struct fd fd)
{
@@ -42,14 +59,14 @@ static inline void fdput(struct fd fd)

extern struct file *fget(unsigned int fd);
extern struct file *fget_raw(unsigned int fd);
-extern unsigned long __fdget(unsigned int fd);
-extern unsigned long __fdget_raw(unsigned int fd);
-extern unsigned long __fdget_pos(unsigned int fd);
+extern struct fd_cookie * __fdget(unsigned int fd);
+extern struct fd_cookie *__fdget_raw(unsigned int fd);
+extern struct fd_cookie *__fdget_pos(unsigned int fd);
extern void __f_unlock_pos(struct file *);

-static inline struct fd __to_fd(unsigned long v)
+static inline struct fd __to_fd(struct fd_cookie *v)
{
- return (struct fd){(struct file *)(v & ~3),v & 3};
+ return (struct fd){__fdfile(v), __fdflags(v)};
}

static inline struct fd fdget(unsigned int fd)


2018-05-25 02:48:51

by David Howells

[permalink] [raw]
Subject: [PATCH 24/32] vfs: Add some logging to the core users of the fs_context log [ver #8]

Add some logging to the core users of the fs_context log so that
information can be extracted from them as to the reason for failure.

Signed-off-by: David Howells <[email protected]>
---

fs/super.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index 06a665628939..447476d4371c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1743,8 +1743,10 @@ int vfs_get_tree(struct fs_context *fc)
struct super_block *sb;
int ret;

- if (fc->fs_type->fs_flags & FS_REQUIRES_DEV && !fc->source)
+ if (fc->fs_type->fs_flags & FS_REQUIRES_DEV && !fc->source) {
+ errorf(fc, "Filesystem requires source device");
return -ENOENT;
+ }

if (fc->root)
return -EBUSY;


2018-05-25 02:48:55

by David Howells

[permalink] [raw]
Subject: [PATCH 19/32] VFS: Implement fsopen() to prepare for a mount [ver #8]

Provide an fsopen() system call that starts the process of preparing to
mount, using an fd as a context handle. fsopen() is given the name of the
filesystem that will be used:

int mfd = fsopen(const char *fsname, int open_flags,
void *reserved3, void *reserved4,
void *reserved5);

where open_flags can be 0 or O_CLOEXEC and reserved* should all be NULL for
the moment.

For example:

mfd = fsopen("ext4", O_CLOEXEC, NULL, NULL, NULL);
write(mfd, "s /dev/sdb1"); // note I'm ignoring write's length arg
write(mfd, "o noatime");
write(mfd, "o acl");
write(mfd, "o user_attr");
write(mfd, "o iversion");
write(mfd, "o ");
write(mfd, "r /my/container"); // root inside the fs
write(mfd, "x create"); // create the superblock
fsmount(mfd, container_fd, "/mnt", AT_NO_FOLLOW);

mfd = fsopen("afs", -1);
write(mfd, "s %grand.central.org:root.cell");
write(mfd, "o cell=grand.central.org");
write(mfd, "r /");
write(mfd, "x create");
fsmount(mfd, AT_FDCWD, "/mnt", 0);

If an error is reported at any step, an error message may be available to be
read() back (ENODATA will be reported if there isn't an error available) in
the form:

"e <subsys>:<problem>"
"e SELinux:Mount on mountpoint not permitted"

Once fsmount() has been called, further write() calls will incur EBUSY,
even if the fsmount() fails. read() is still possible to retrieve error
information.

The fsopen() syscall creates a mount context and hangs it of the fd that it
returns.

Netlink is not used because it is optional.

Note that, for the moment, the caller must have SYS_CAP_ADMIN to use
fsopen().

Signed-off-by: David Howells <[email protected]>
---

arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
fs/Makefile | 2
fs/fsopen.c | 352 ++++++++++++++++++++++++++++++++
include/linux/syscalls.h | 2
include/uapi/linux/magic.h | 1
kernel/sys_ni.c | 3
7 files changed, 361 insertions(+), 1 deletion(-)
create mode 100644 fs/fsopen.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 14a2f996e543..0e084cc11638 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -397,3 +397,4 @@
383 i386 statx sys_statx __ia32_sys_statx
384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl
385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents
+386 i386 fsopen sys_fsopen __ia32_sys_fsopen
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index cd36232ab62f..7200d5bb65ca 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -342,6 +342,7 @@
331 common pkey_free __x64_sys_pkey_free
332 common statx __x64_sys_statx
333 common io_pgetevents __x64_sys_io_pgetevents
+334 common fsopen __x64_sys_fsopen

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index 6f2dae3c32da..ee3c8b31cc58 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -13,7 +13,7 @@ obj-y := open.o read_write.o file_table.o super.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o splice.o sync.o utimes.o d_path.o \
stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
- fs_context.o
+ fs_context.o fsopen.o

ifeq ($(CONFIG_BLOCK),y)
obj-y += buffer.o block_dev.o direct-io.o mpage.o
diff --git a/fs/fsopen.c b/fs/fsopen.c
new file mode 100644
index 000000000000..26565ddd7c9e
--- /dev/null
+++ b/fs/fsopen.c
@@ -0,0 +1,352 @@
+/* Filesystem access-by-fd.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/fs_context.h>
+#include <linux/mount.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/syscalls.h>
+#include <linux/security.h>
+#include "mount.h"
+
+static struct vfsmount *fscontext_fs_mnt __read_mostly;
+
+static int fscontext_fs_release(struct inode *inode, struct file *file)
+{
+ struct fs_context *fc = file->private_data;
+
+ file->private_data = NULL;
+
+ put_fs_context(fc);
+ return 0;
+}
+
+/*
+ * Userspace writes configuration data and commands to the fd and we parse it
+ * here. For the moment, we assume a single option or command per write. Each
+ * line written is of the form
+ *
+ * <option_type><space><stuff...>
+ *
+ * d /dev/sda1 -- Device name
+ * o noatime -- Option without value
+ * o cell=grand.central.org -- Option with value
+ * r / -- Dir within device to mount
+ * x create -- Create a superblock
+ * x reconfigure -- Reconfigure a superblock
+ */
+static ssize_t fscontext_fs_write(struct file *file,
+ const char __user *_buf, size_t len, loff_t *pos)
+{
+ struct fs_context *fc = file->private_data;
+ struct inode *inode = file_inode(file);
+ char opt[2], *data;
+ ssize_t ret;
+
+ if (len < 3 || len > 4095)
+ return -EINVAL;
+
+ if (copy_from_user(opt, _buf, 2) != 0)
+ return -EFAULT;
+ switch (opt[0]) {
+ case 's':
+ case 'o':
+ case 'x':
+ break;
+ default:
+ goto err_bad_cmd;
+ }
+ if (opt[1] != ' ')
+ goto err_bad_cmd;
+
+ data = memdup_user_nul(_buf + 2, len - 2);
+ if (IS_ERR(data))
+ return PTR_ERR(data);
+
+ /* From this point onwards we need to lock the fd against someone
+ * trying to mount it.
+ */
+ ret = inode_lock_killable(inode);
+ if (ret < 0)
+ goto err_free;
+
+ if (fc->phase == FS_CONTEXT_AWAITING_RECONF) {
+ if (fc->fs_type->init_fs_context) {
+ ret = fc->fs_type->init_fs_context(fc, fc->root);
+ if (ret < 0) {
+ fc->phase = FS_CONTEXT_FAILED;
+ goto err_unlock;
+ }
+ } else {
+ /* Leave legacy context ops in place */
+ }
+
+ /* Do the security check last because ->init_fs_context may
+ * change the namespace subscriptions.
+ */
+ ret = security_fs_context_alloc(fc, fc->root);
+ if (ret < 0) {
+ fc->phase = FS_CONTEXT_FAILED;
+ goto err_unlock;
+ }
+
+ fc->phase = FS_CONTEXT_RECONF_PARAMS;
+ }
+
+ ret = -EINVAL;
+ switch (opt[0]) {
+ case 's':
+ if (fc->phase != FS_CONTEXT_CREATE_PARAMS &&
+ fc->phase != FS_CONTEXT_RECONF_PARAMS)
+ goto wrong_phase;
+ ret = vfs_set_fs_source(fc, data, len - 2);
+ if (ret < 0)
+ goto err_unlock;
+ data = NULL;
+ break;
+
+ case 'o':
+ if (fc->phase != FS_CONTEXT_CREATE_PARAMS &&
+ fc->phase != FS_CONTEXT_RECONF_PARAMS)
+ goto wrong_phase;
+ ret = vfs_parse_fs_option(fc, data, len - 2);
+ if (ret < 0)
+ goto err_unlock;
+ break;
+
+ case 'x':
+ if (strcmp(data, "create") == 0) {
+ if (fc->phase != FS_CONTEXT_CREATE_PARAMS)
+ goto wrong_phase;
+ fc->phase = FS_CONTEXT_CREATING;
+ ret = vfs_get_tree(fc);
+ if (ret == 0)
+ fc->phase = FS_CONTEXT_AWAITING_MOUNT;
+ else
+ fc->phase = FS_CONTEXT_FAILED;
+ } else {
+ ret = -EOPNOTSUPP;
+ }
+ if (ret < 0)
+ goto err_unlock;
+ break;
+
+ default:
+ goto err_unlock;
+ }
+
+ ret = len;
+err_unlock:
+ inode_unlock(inode);
+err_free:
+ kfree(data);
+ return ret;
+err_bad_cmd:
+ return -EINVAL;
+wrong_phase:
+ ret = -EBUSY;
+ goto err_unlock;
+}
+
+const struct file_operations fscontext_fs_fops = {
+ .write = fscontext_fs_write,
+ .release = fscontext_fs_release,
+ .llseek = no_llseek,
+};
+
+/*
+ * Indicate the name we want to display the filesystem file as.
+ */
+static char *fscontext_fs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+ return dynamic_dname(dentry, buffer, buflen, "fs:[%lu]",
+ d_inode(dentry)->i_ino);
+}
+
+static const struct dentry_operations fscontext_fs_dentry_operations = {
+ .d_dname = fscontext_fs_dname,
+};
+
+/*
+ * Create a file that can be used to configure a new mount.
+ */
+static struct file *create_fscontext_file(struct fs_context *fc)
+{
+ struct inode *inode;
+ struct file *f;
+ struct path path;
+ int ret;
+
+ inode = alloc_anon_inode(fscontext_fs_mnt->mnt_sb);
+ if (IS_ERR(inode))
+ return ERR_CAST(inode);
+ inode->i_fop = &fscontext_fs_fops;
+
+ fc->phase = FS_CONTEXT_CREATE_PARAMS;
+
+ ret = -ENOMEM;
+ path.dentry = d_alloc_pseudo(fscontext_fs_mnt->mnt_sb, &empty_name);
+ if (!path.dentry)
+ goto err_inode;
+ path.mnt = mntget(fscontext_fs_mnt);
+
+ d_instantiate(path.dentry, inode);
+
+ f = alloc_file(&path, FMODE_READ | FMODE_WRITE, &fscontext_fs_fops);
+ if (IS_ERR(f)) {
+ ret = PTR_ERR(f);
+ goto err_file;
+ }
+
+ f->private_data = fc;
+ return f;
+
+err_file:
+ path_put(&path);
+ return ERR_PTR(ret);
+
+err_inode:
+ iput(inode);
+ return ERR_PTR(ret);
+}
+
+static const struct super_operations fscontext_fs_super_ops = {
+ .drop_inode = generic_delete_inode,
+ .destroy_inode = free_inode_nonrcu,
+ .statfs = simple_statfs,
+};
+
+/*
+ * Finish filling in the superblock and allocate the root dentry.
+ */
+static int fscontext_fs_fill_super(struct super_block *sb,
+ struct fs_context *fc)
+{
+ struct dentry *root;
+ struct inode *inode;
+
+ sb->s_op = &fscontext_fs_super_ops;
+ inode = alloc_anon_inode(sb);
+ if (IS_ERR(inode))
+ return PTR_ERR(inode);
+ inode->i_fop = &fscontext_fs_fops;
+
+ root = d_make_root(inode);
+ if (!root)
+ return -ENOMEM; /* inode is put by d_make_root() */
+ sb->s_root = root;
+ return 0;
+}
+
+static int fscontext_fs_get_tree(struct fs_context *fc)
+{
+ return vfs_get_super(fc, vfs_get_single_super, fscontext_fs_fill_super);
+}
+
+static const struct fs_context_operations fscontext_fs_context_ops = {
+ .get_tree = fscontext_fs_get_tree,
+};
+
+static int fs_init_fs_context(struct fs_context *fc, struct dentry *reference)
+{
+ fc->ops = &fscontext_fs_context_ops;
+ return 0;
+}
+
+static struct file_system_type fscontext_fs_type = {
+ .name = "fscontext",
+ .init_fs_context = fs_init_fs_context,
+ .kill_sb = kill_anon_super,
+};
+
+static int __init init_fscontext_fs(void)
+{
+ int ret;
+
+ ret = register_filesystem(&fscontext_fs_type);
+ if (ret < 0)
+ panic("Cannot register fscontext_fs\n");
+
+ fscontext_fs_mnt = kern_mount(&fscontext_fs_type);
+ if (IS_ERR(fscontext_fs_mnt))
+ panic("Cannot mount fscontext_fs: %ld\n",
+ PTR_ERR(fscontext_fs_mnt));
+ return 0;
+}
+
+fs_initcall(init_fscontext_fs);
+
+/*
+ * Open a filesystem by name so that it can be configured for mounting.
+ *
+ * We are allowed to specify a container in which the filesystem will be
+ * opened, thereby indicating which namespaces will be used (notably, which
+ * network namespace will be used for network filesystems).
+ */
+SYSCALL_DEFINE5(fsopen, const char __user *, _fs_name, unsigned int, flags,
+ void *, reserved3, void *, reserved4, void *, reserved5)
+{
+ struct file_system_type *fs_type;
+ struct fs_context *fc;
+ struct file *file;
+ const char *fs_name;
+ int fd, ret;
+
+ if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (flags & ~O_CLOEXEC || reserved3 || reserved4 || reserved5)
+ return -EINVAL;
+
+ fs_name = strndup_user(_fs_name, PAGE_SIZE);
+ if (IS_ERR(fs_name))
+ return PTR_ERR(fs_name);
+
+ fs_type = get_fs_type(fs_name);
+ kfree(fs_name);
+ if (!fs_type)
+ return -ENODEV;
+
+ fc = vfs_new_fs_context(fs_type, NULL, 0, FS_CONTEXT_FOR_USER_MOUNT);
+ put_filesystem(fs_type);
+ if (IS_ERR(fc))
+ return PTR_ERR(fc);
+
+ fc->phase = FS_CONTEXT_CREATE_PARAMS;
+
+ ret = -EOPNOTSUPP;
+ if (!fc->ops)
+ goto err_fc;
+
+ file = create_fscontext_file(fc);
+ if (IS_ERR(file)) {
+ ret = PTR_ERR(file);
+ goto err_fc;
+ }
+
+ ret = get_unused_fd_flags(flags & O_CLOEXEC);
+ if (ret < 0)
+ goto err_file;
+
+ fd = ret;
+ fd_install(fd, file);
+ return fd;
+
+err_file:
+ fput(file);
+ return ret;
+
+err_fc:
+ put_fs_context(fc);
+ return ret;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 811172fcb916..e0f19406af92 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -896,6 +896,8 @@ asmlinkage long sys_pkey_alloc(unsigned long flags, unsigned long init_val);
asmlinkage long sys_pkey_free(int pkey);
asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
unsigned mask, struct statx __user *buffer);
+asmlinkage long sys_fsopen(const char *fs_name, unsigned int flags,
+ void *reserved3, void *reserved4, void *reserved5);


/*
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 1a6fee974116..2fe02277fb32 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -89,5 +89,6 @@
#define UDF_SUPER_MAGIC 0x15013346
#define BALLOON_KVM_MAGIC 0x13661366
#define ZSMALLOC_MAGIC 0x58295829
+#define FSCONTEXT_FS_MAGIC 0x66736673

#endif /* __LINUX_MAGIC_H__ */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 183169c2a75b..6bb0e1bb3eae 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -432,3 +432,6 @@ COND_SYSCALL(setresgid16);
COND_SYSCALL(setresuid16);
COND_SYSCALL(setreuid16);
COND_SYSCALL(setuid16);
+
+/* fd-based mount */
+COND_SYSCALL(sys_fsopen);


2018-05-25 02:48:59

by David Howells

[permalink] [raw]
Subject: [PATCH 04/32] VFS: Add LSM hooks for the new mount API [ver #8]

Add LSM hooks for use by the new mount API and filesystem context code.
This includes:

(1) Hooks to handle allocation, duplication and freeing of the security
record attached to a filesystem context.

(2) A hook to snoop source specifications. There may be multiple of these
if the filesystem supports it. They will to be local files/devices if
fs_context::source_is_dev is true and will be something else, possibly
remote server specifications, if false.

(3) A hook to snoop superblock configuration options in key[=val] form.
If the LSM decides it wants to handle it, it can suppress the option
being passed to the filesystem. Note that 'val' may include commas
and binary data with the fsopen patch.

(4) A hook to perform validation and allocation after the configuration
has been done but before the superblock is allocated and set up.

(5) A hook to transfer the security from the context to a newly created
superblock.

(6) A hook to rule on whether a path point can be used as a mountpoint.

These are intended to replace:

security_sb_copy_data
security_sb_kern_mount
security_sb_mount
security_sb_set_mnt_opts
security_sb_clone_mnt_opts
security_sb_parse_opts_str

Signed-off-by: David Howells <[email protected]>
cc: [email protected]
---

include/linux/lsm_hooks.h | 71 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/security.h | 49 +++++++++++++++++++++++++++++++
security/security.c | 46 +++++++++++++++++++++++++++++
3 files changed, 166 insertions(+)

diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 9d0b286f3dba..25e5f760a590 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -76,6 +76,57 @@
* changes on the process such as clearing out non-inheritable signal
* state. This is called immediately after commit_creds().
*
+ * Security hooks for mount using fs_context.
+ * [See also Documentation/filesystems/mounting.txt]
+ *
+ * @fs_context_alloc:
+ * Allocate and attach a security structure to sc->security. This pointer
+ * is initialised to NULL by the caller.
+ * @fc indicates the new filesystem context.
+ * @reference indicates the source dentry of a submount or start of reconfig.
+ * @fs_context_dup:
+ * Allocate and attach a security structure to sc->security. This pointer
+ * is initialised to NULL by the caller.
+ * @fc indicates the new filesystem context.
+ * @src_fc indicates the original filesystem context.
+ * @fs_context_free:
+ * Clean up a filesystem context.
+ * @fc indicates the filesystem context.
+ * @fs_context_parse_source:
+ * Check a source for the superblock (multiple sources may be provided).
+ * The LSM may reject it with an error; otherwise it should return 0.
+ * @fc indicates the filesystem context.
+ * @src indicates the source name. It is NUL-terminated,
+ * @fc->source_is_dev is true if the source should be a local file or dev.
+ * @fs_context_parse_option:
+ * Userspace provided an option to configure a superblock. The LSM may
+ * reject it with an error and may use it for itself, in which case it
+ * should return 1; otherwise it should return 0 to pass it on to the
+ * filesystem.
+ * @fc indicates the filesystem context.
+ * @opt indicates the option in "key[=val]" form. It is NUL-terminated,
+ * but val may be binary data.
+ * @len indicates the size of the option.
+ * @fs_context_validate:
+ * Validate the filesystem context preparatory to applying it. This is
+ * done after all the options have been parsed.
+ * @fc indicates the filesystem context.
+ * @sb_get_tree:
+ * Assign the security to a newly created superblock.
+ * @fc indicates the filesystem context.
+ * @fc->root indicates the root that will be mounted.
+ * @fc->root->d_sb points to the superblock.
+ * @sb_reconfigure:
+ * Apply reconfiguration to the security on a superblock.
+ * @fc indicates the filesystem context.
+ * @fc->root indicates a dentry in the mount.
+ * @fc->root->d_sb points to the superblock.
+ * @sb_mountpoint:
+ * Equivalent of sb_mount, but with an fs_context.
+ * @fc indicates the filesystem context.
+ * @mountpoint indicates the path on which the mount will take place.
+ * @mnt_flags indicates the MNT_* flags specified.
+ *
* Security hooks for filesystem operations.
*
* @sb_alloc_security:
@@ -1450,6 +1501,17 @@ union security_list_options {
void (*bprm_committing_creds)(struct linux_binprm *bprm);
void (*bprm_committed_creds)(struct linux_binprm *bprm);

+ int (*fs_context_alloc)(struct fs_context *fc, struct dentry *reference);
+ int (*fs_context_dup)(struct fs_context *fc, struct fs_context *src_sc);
+ void (*fs_context_free)(struct fs_context *fc);
+ int (*fs_context_parse_source)(struct fs_context *fc, char *src);
+ int (*fs_context_parse_option)(struct fs_context *fc, char *opt, size_t len);
+ int (*fs_context_validate)(struct fs_context *fc);
+ int (*sb_get_tree)(struct fs_context *fc);
+ void (*sb_reconfigure)(struct fs_context *fc);
+ int (*sb_mountpoint)(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags);
+
int (*sb_alloc_security)(struct super_block *sb);
void (*sb_free_security)(struct super_block *sb);
int (*sb_copy_data)(char *orig, char *copy);
@@ -1787,6 +1849,15 @@ struct security_hook_heads {
struct hlist_head bprm_check_security;
struct hlist_head bprm_committing_creds;
struct hlist_head bprm_committed_creds;
+ struct hlist_head fs_context_alloc;
+ struct hlist_head fs_context_dup;
+ struct hlist_head fs_context_free;
+ struct hlist_head fs_context_parse_source;
+ struct hlist_head fs_context_parse_option;
+ struct hlist_head fs_context_validate;
+ struct hlist_head sb_get_tree;
+ struct hlist_head sb_reconfigure;
+ struct hlist_head sb_mountpoint;
struct hlist_head sb_alloc_security;
struct hlist_head sb_free_security;
struct hlist_head sb_copy_data;
diff --git a/include/linux/security.h b/include/linux/security.h
index 200920f521a1..857dc7574b4a 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -53,6 +53,7 @@ struct msg_msg;
struct xattr;
struct xfrm_sec_ctx;
struct mm_struct;
+struct fs_context;

/* If capable should audit the security request */
#define SECURITY_CAP_NOAUDIT 0
@@ -231,6 +232,16 @@ int security_bprm_set_creds(struct linux_binprm *bprm);
int security_bprm_check(struct linux_binprm *bprm);
void security_bprm_committing_creds(struct linux_binprm *bprm);
void security_bprm_committed_creds(struct linux_binprm *bprm);
+int security_fs_context_alloc(struct fs_context *fc, struct dentry *reference);
+int security_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc);
+void security_fs_context_free(struct fs_context *fc);
+int security_fs_context_parse_source(struct fs_context *fc, char *src);
+int security_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len);
+int security_fs_context_validate(struct fs_context *fc);
+int security_sb_get_tree(struct fs_context *fc);
+void security_sb_reconfigure(struct fs_context *fc);
+int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags);
int security_sb_alloc(struct super_block *sb);
void security_sb_free(struct super_block *sb);
int security_sb_copy_data(char *orig, char *copy);
@@ -539,6 +550,44 @@ static inline void security_bprm_committed_creds(struct linux_binprm *bprm)
{
}

+static inline int security_fs_context_alloc(struct fs_context *fc,
+ struct dentry *reference)
+{
+ return 0;
+}
+static inline int security_fs_context_dup(struct fs_context *fc,
+ struct fs_context *src_fc)
+{
+ return 0;
+}
+static inline void security_fs_context_free(struct fs_context *fc)
+{
+}
+static inline int security_fs_context_parse_source(struct fs_context *fc, char *src)
+{
+ return 0;
+}
+static inline int security_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+ return 0;
+}
+static inline int security_fs_context_validate(struct fs_context *fc)
+{
+ return 0;
+}
+static inline int security_sb_get_tree(struct fs_context *fc)
+{
+ return 0;
+}
+static inline void security_sb_reconfigure(struct fs_context *fc)
+{
+}
+static inline int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ return 0;
+}
+
static inline int security_sb_alloc(struct super_block *sb)
{
return 0;
diff --git a/security/security.c b/security/security.c
index 7bc2fde023a7..0aca5a03c070 100644
--- a/security/security.c
+++ b/security/security.c
@@ -358,6 +358,52 @@ void security_bprm_committed_creds(struct linux_binprm *bprm)
call_void_hook(bprm_committed_creds, bprm);
}

+int security_fs_context_alloc(struct fs_context *fc, struct dentry *reference)
+{
+ return call_int_hook(fs_context_alloc, 0, fc, reference);
+}
+
+int security_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
+{
+ return call_int_hook(fs_context_dup, 0, fc, src_fc);
+}
+
+void security_fs_context_free(struct fs_context *fc)
+{
+ call_void_hook(fs_context_free, fc);
+}
+
+int security_fs_context_parse_source(struct fs_context *fc, char *src)
+{
+ return call_int_hook(fs_context_parse_source, 0, fc, src);
+}
+
+int security_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+ return call_int_hook(fs_context_parse_option, 0, fc, opt, len);
+}
+
+int security_fs_context_validate(struct fs_context *fc)
+{
+ return call_int_hook(fs_context_validate, 0, fc);
+}
+
+int security_sb_get_tree(struct fs_context *fc)
+{
+ return call_int_hook(sb_get_tree, 0, fc);
+}
+
+void security_sb_reconfigure(struct fs_context *fc)
+{
+ call_void_hook(sb_reconfigure, fc);
+}
+
+int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ return call_int_hook(sb_mountpoint, 0, fc, mountpoint, mnt_flags);
+}
+
int security_sb_alloc(struct super_block *sb)
{
return call_int_hook(sb_alloc_security, 0, sb);


2018-05-25 02:49:06

by David Howells

[permalink] [raw]
Subject: [PATCH 06/32] smack: Implement filesystem context security hooks [ver #8]

Implement filesystem context security hooks for the smack LSM.

Question: Should the ->fs_context_parse_source() hook be implemented to
check the labels on any source devices specified?

Signed-off-by: David Howells <[email protected]>
cc: Casey Schaufler <[email protected]>
cc: [email protected]
---

security/smack/smack_lsm.c | 309 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 309 insertions(+)

diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 0b414836bebd..3c4dd21d511d 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -42,6 +42,7 @@
#include <linux/shm.h>
#include <linux/binfmts.h>
#include <linux/parser.h>
+#include <linux/fs_context.h>
#include "smack.h"

#define TRANS_TRUE "TRUE"
@@ -521,6 +522,307 @@ static int smack_syslog(int typefrom_file)
return rc;
}

+/*
+ * Mount context operations
+ */
+
+struct smack_fs_context {
+ union {
+ struct {
+ char *fsdefault;
+ char *fsfloor;
+ char *fshat;
+ char *fsroot;
+ char *fstransmute;
+ };
+ char *ptrs[5];
+
+ };
+ struct superblock_smack *sbsp;
+ struct inode_smack *isp;
+ bool transmute;
+};
+
+/**
+ * smack_fs_context_free - Free the security data from a filesystem context
+ * @fc: The filesystem context to be cleaned up.
+ */
+static void smack_fs_context_free(struct fs_context *fc)
+{
+ struct smack_fs_context *ctx = fc->security;
+ int i;
+
+ if (ctx) {
+ for (i = 0; i < ARRAY_SIZE(ctx->ptrs); i++)
+ kfree(ctx->ptrs[i]);
+ kfree(ctx->isp);
+ kfree(ctx->sbsp);
+ kfree(ctx);
+ fc->security = NULL;
+ }
+}
+
+/**
+ * smack_fs_context_alloc - Allocate security data for a filesystem context
+ * @fc: The filesystem context.
+ * @reference: Reference dentry (automount/reconfigure) or NULL
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_fs_context_alloc(struct fs_context *fc,
+ struct dentry *reference)
+{
+ struct smack_fs_context *ctx;
+ struct superblock_smack *sbsp;
+ struct inode_smack *isp;
+ struct smack_known *skp;
+
+ ctx = kzalloc(sizeof(struct smack_fs_context), GFP_KERNEL);
+ if (!ctx)
+ goto nomem;
+ fc->security = ctx;
+
+ sbsp = kzalloc(sizeof(struct superblock_smack), GFP_KERNEL);
+ if (!sbsp)
+ goto nomem_free;
+ ctx->sbsp = sbsp;
+
+ isp = new_inode_smack(NULL);
+ if (!isp)
+ goto nomem_free;
+ ctx->isp = isp;
+
+ if (reference) {
+ if (reference->d_sb->s_security)
+ memcpy(sbsp, reference->d_sb->s_security, sizeof(*sbsp));
+ } else if (!smack_privileged(CAP_MAC_ADMIN)) {
+ /* Unprivileged mounts get root and default from the caller. */
+ skp = smk_of_current();
+ sbsp->smk_root = skp;
+ sbsp->smk_default = skp;
+ } else {
+ sbsp->smk_root = &smack_known_floor;
+ sbsp->smk_default = &smack_known_floor;
+ sbsp->smk_floor = &smack_known_floor;
+ sbsp->smk_hat = &smack_known_hat;
+ /* SMK_SB_INITIALIZED will be zero from kzalloc. */
+ }
+
+ return 0;
+
+nomem_free:
+ smack_fs_context_free(fc);
+nomem:
+ return -ENOMEM;
+}
+
+/**
+ * smack_fs_context_dup - Duplicate the security data on fs_context duplication
+ * @fc: The new filesystem context.
+ * @src_fc: The source filesystem context being duplicated.
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_fs_context_dup(struct fs_context *fc,
+ struct fs_context *src_fc)
+{
+ struct smack_fs_context *dst, *src = src_fc->security;
+ int i;
+
+ dst = kzalloc(sizeof(struct smack_fs_context), GFP_KERNEL);
+ if (!dst)
+ goto nomem;
+ fc->security = dst;
+
+ dst->sbsp = kmemdup(src->sbsp, sizeof(struct superblock_smack),
+ GFP_KERNEL);
+ if (!dst->sbsp)
+ goto nomem_free;
+
+ for (i = 0; i < ARRAY_SIZE(dst->ptrs); i++) {
+ if (src->ptrs[i]) {
+ dst->ptrs[i] = kstrdup(src->ptrs[i], GFP_KERNEL);
+ if (!dst->ptrs[i])
+ goto nomem_free;
+ }
+ }
+
+ return 0;
+
+nomem_free:
+ smack_fs_context_free(fc);
+nomem:
+ return -ENOMEM;
+}
+
+/**
+ * smack_fs_context_parse_option - Parse a single mount option
+ * @fc: The new filesystem context being constructed.
+ * @opt: The option text buffer.
+ * @len: The length of the text.
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_fs_context_parse_option(struct fs_context *fc, char *p, size_t len)
+{
+ struct smack_fs_context *ctx = fc->security;
+ substring_t args[MAX_OPT_ARGS];
+ int rc = -ENOMEM;
+ int token;
+
+ /* Unprivileged mounts don't get to specify Smack values. */
+ if (!smack_privileged(CAP_MAC_ADMIN))
+ return -EPERM;
+
+ token = match_token(p, smk_mount_tokens, args);
+ switch (token) {
+ case Opt_fsdefault:
+ if (ctx->fsdefault)
+ goto error_dup;
+ ctx->fsdefault = match_strdup(&args[0]);
+ if (!ctx->fsdefault)
+ goto error;
+ break;
+ case Opt_fsfloor:
+ if (ctx->fsfloor)
+ goto error_dup;
+ ctx->fsfloor = match_strdup(&args[0]);
+ if (!ctx->fsfloor)
+ goto error;
+ break;
+ case Opt_fshat:
+ if (ctx->fshat)
+ goto error_dup;
+ ctx->fshat = match_strdup(&args[0]);
+ if (!ctx->fshat)
+ goto error;
+ break;
+ case Opt_fsroot:
+ if (ctx->fsroot)
+ goto error_dup;
+ ctx->fsroot = match_strdup(&args[0]);
+ if (!ctx->fsroot)
+ goto error;
+ break;
+ case Opt_fstransmute:
+ if (ctx->fstransmute)
+ goto error_dup;
+ ctx->fstransmute = match_strdup(&args[0]);
+ if (!ctx->fstransmute)
+ goto error;
+ break;
+ default:
+ pr_warn("Smack: unknown mount option\n");
+ goto error_inval;
+ }
+
+ return 0;
+
+error_dup:
+ pr_warn("Smack: duplicate mount option\n");
+error_inval:
+ rc = -EINVAL;
+error:
+ return rc;
+}
+
+/**
+ * smack_fs_context_validate - Validate the filesystem context security data
+ * @fc: The filesystem context.
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_fs_context_validate(struct fs_context *fc)
+{
+ struct smack_fs_context *ctx = fc->security;
+ struct superblock_smack *sbsp = ctx->sbsp;
+ struct inode_smack *isp = ctx->isp;
+ struct smack_known *skp;
+
+ if (ctx->fsdefault) {
+ skp = smk_import_entry(ctx->fsdefault, 0);
+ if (IS_ERR(skp))
+ return PTR_ERR(skp);
+ sbsp->smk_default = skp;
+ }
+
+ if (ctx->fsfloor) {
+ skp = smk_import_entry(ctx->fsfloor, 0);
+ if (IS_ERR(skp))
+ return PTR_ERR(skp);
+ sbsp->smk_floor = skp;
+ }
+
+ if (ctx->fshat) {
+ skp = smk_import_entry(ctx->fshat, 0);
+ if (IS_ERR(skp))
+ return PTR_ERR(skp);
+ sbsp->smk_hat = skp;
+ }
+
+ if (ctx->fsroot || ctx->fstransmute) {
+ skp = smk_import_entry(ctx->fstransmute ?: ctx->fsroot, 0);
+ if (IS_ERR(skp))
+ return PTR_ERR(skp);
+ sbsp->smk_root = skp;
+ ctx->transmute = !!ctx->fstransmute;
+ }
+
+ isp->smk_inode = sbsp->smk_root;
+ return 0;
+}
+
+/**
+ * smack_sb_get_tree - Assign the context to a newly created superblock
+ * @fc: The new filesystem context.
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_sb_get_tree(struct fs_context *fc)
+{
+ struct smack_fs_context *ctx = fc->security;
+ struct superblock_smack *sbsp = ctx->sbsp;
+ struct dentry *root = fc->root;
+ struct inode *inode = d_backing_inode(root);
+ struct super_block *sb = root->d_sb;
+ struct inode_smack *isp;
+ bool transmute = ctx->transmute;
+
+ if (sb->s_security)
+ return 0;
+
+ if (!smack_privileged(CAP_MAC_ADMIN)) {
+ /*
+ * For a handful of fs types with no user-controlled
+ * backing store it's okay to trust security labels
+ * in the filesystem. The rest are untrusted.
+ */
+ if (fc->user_ns != &init_user_ns &&
+ sb->s_magic != SYSFS_MAGIC && sb->s_magic != TMPFS_MAGIC &&
+ sb->s_magic != RAMFS_MAGIC) {
+ transmute = true;
+ sbsp->smk_flags |= SMK_SB_UNTRUSTED;
+ }
+ }
+
+ sbsp->smk_flags |= SMK_SB_INITIALIZED;
+ sb->s_security = sbsp;
+ ctx->sbsp = NULL;
+
+ /* Initialize the root inode. */
+ isp = inode->i_security;
+ if (isp == NULL) {
+ isp = ctx->isp;
+ ctx->isp = NULL;
+ inode->i_security = isp;
+ } else
+ isp->smk_inode = sbsp->smk_root;
+
+ if (transmute)
+ isp->smk_flags |= SMK_INODE_TRANSMUTE;
+
+ return 0;
+}

/*
* Superblock Hooks.
@@ -4628,6 +4930,13 @@ static struct security_hook_list smack_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(ptrace_traceme, smack_ptrace_traceme),
LSM_HOOK_INIT(syslog, smack_syslog),

+ LSM_HOOK_INIT(fs_context_alloc, smack_fs_context_alloc),
+ LSM_HOOK_INIT(fs_context_dup, smack_fs_context_dup),
+ LSM_HOOK_INIT(fs_context_free, smack_fs_context_free),
+ LSM_HOOK_INIT(fs_context_parse_option, smack_fs_context_parse_option),
+ LSM_HOOK_INIT(fs_context_validate, smack_fs_context_validate),
+ LSM_HOOK_INIT(sb_get_tree, smack_sb_get_tree),
+
LSM_HOOK_INIT(sb_alloc_security, smack_sb_alloc_security),
LSM_HOOK_INIT(sb_free_security, smack_sb_free_security),
LSM_HOOK_INIT(sb_copy_data, smack_sb_copy_data),


2018-05-25 02:49:07

by David Howells

[permalink] [raw]
Subject: [PATCH 12/32] procfs: Move proc_fill_super() to fs/proc/root.c [ver #8]

Move proc_fill_super() to fs/proc/root.c as that's where the other
superblock stuff is.

Signed-off-by: David Howells <[email protected]>
---

fs/proc/inode.c | 49 +------------------------------------------------
fs/proc/internal.h | 4 +---
fs/proc/root.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 49 insertions(+), 52 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index df65431c00be..0b13cf6eb6d7 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -24,7 +24,6 @@
#include <linux/seq_file.h>
#include <linux/slab.h>
#include <linux/mount.h>
-#include <linux/magic.h>

#include <linux/uaccess.h>

@@ -123,7 +122,7 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root)
return 0;
}

-static const struct super_operations proc_sops = {
+const struct super_operations proc_sops = {
.alloc_inode = proc_alloc_inode,
.destroy_inode = proc_destroy_inode,
.drop_inode = generic_delete_inode,
@@ -489,49 +488,3 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
pde_put(de);
return inode;
}
-
-int proc_fill_super(struct super_block *s, void *data, size_t data_size,
- int silent)
-{
- struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
- struct inode *root_inode;
- int ret;
-
- if (!proc_parse_options(data, ns))
- return -EINVAL;
-
- /* User space would break if executables or devices appear on proc */
- s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
- s->s_flags |= SB_NODIRATIME | SB_NOSUID | SB_NOEXEC;
- s->s_blocksize = 1024;
- s->s_blocksize_bits = 10;
- s->s_magic = PROC_SUPER_MAGIC;
- s->s_op = &proc_sops;
- s->s_time_gran = 1;
-
- /*
- * procfs isn't actually a stacking filesystem; however, there is
- * too much magic going on inside it to permit stacking things on
- * top of it
- */
- s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
-
- pde_get(&proc_root);
- root_inode = proc_get_inode(s, &proc_root);
- if (!root_inode) {
- pr_err("proc_fill_super: get root inode failed\n");
- return -ENOMEM;
- }
-
- s->s_root = d_make_root(root_inode);
- if (!s->s_root) {
- pr_err("proc_fill_super: allocate dentry failed\n");
- return -ENOMEM;
- }
-
- ret = proc_setup_self(s);
- if (ret) {
- return ret;
- }
- return proc_setup_thread_self(s);
-}
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index c0af86a18abe..c918ec4cc0d9 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -200,13 +200,12 @@ struct pde_opener {
struct completion *c;
} __randomize_layout;
extern const struct inode_operations proc_link_inode_operations;
-
extern const struct inode_operations proc_pid_link_inode_operations;
+extern const struct super_operations proc_sops;

void proc_init_kmemcache(void);
void set_proc_pid_nlink(void);
extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
-extern int proc_fill_super(struct super_block *, void *, size_t, int);
extern void proc_entry_rundown(struct proc_dir_entry *);

/*
@@ -264,7 +263,6 @@ static inline void proc_tty_init(void) {}
* root.c
*/
extern struct proc_dir_entry proc_root;
-extern int proc_parse_options(char *options, struct pid_namespace *pid);

extern void proc_self_init(void);
extern int proc_remount(struct super_block *, int *, char *, size_t);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 99ce06c4e1a2..2fbc177f37a8 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -23,6 +23,7 @@
#include <linux/pid_namespace.h>
#include <linux/parser.h>
#include <linux/cred.h>
+#include <linux/magic.h>

#include "internal.h"

@@ -36,7 +37,7 @@ static const match_table_t tokens = {
{Opt_err, NULL},
};

-int proc_parse_options(char *options, struct pid_namespace *pid)
+static int proc_parse_options(char *options, struct pid_namespace *pid)
{
char *p;
substring_t args[MAX_OPT_ARGS];
@@ -78,6 +79,51 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
return 1;
}

+static int proc_fill_super(struct super_block *s, void *data, size_t data_size, int silent)
+{
+ struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
+ struct inode *root_inode;
+ int ret;
+
+ if (!proc_parse_options(data, ns))
+ return -EINVAL;
+
+ /* User space would break if executables or devices appear on proc */
+ s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
+ s->s_flags |= SB_NODIRATIME | SB_NOSUID | SB_NOEXEC;
+ s->s_blocksize = 1024;
+ s->s_blocksize_bits = 10;
+ s->s_magic = PROC_SUPER_MAGIC;
+ s->s_op = &proc_sops;
+ s->s_time_gran = 1;
+
+ /*
+ * procfs isn't actually a stacking filesystem; however, there is
+ * too much magic going on inside it to permit stacking things on
+ * top of it
+ */
+ s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
+
+ pde_get(&proc_root);
+ root_inode = proc_get_inode(s, &proc_root);
+ if (!root_inode) {
+ pr_err("proc_fill_super: get root inode failed\n");
+ return -ENOMEM;
+ }
+
+ s->s_root = d_make_root(root_inode);
+ if (!s->s_root) {
+ pr_err("proc_fill_super: allocate dentry failed\n");
+ return -ENOMEM;
+ }
+
+ ret = proc_setup_self(s);
+ if (ret) {
+ return ret;
+ }
+ return proc_setup_thread_self(s);
+}
+
int proc_remount(struct super_block *sb, int *flags,
char *data, size_t data_size)
{


2018-05-25 02:49:08

by David Howells

[permalink] [raw]
Subject: [PATCH 05/32] selinux: Implement the new mount API LSM hooks [ver #8]

Implement the new mount API LSM hooks for SELinux. At some point the old
hooks will need to be removed.

Question: Should the ->fs_context_parse_source() hook be implemented to
check the labels on any source devices specified?

Signed-off-by: David Howells <[email protected]>
cc: Paul Moore <[email protected]>
cc: Stephen Smalley <[email protected]>
cc: [email protected]
cc: [email protected]
---

security/selinux/hooks.c | 262 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 262 insertions(+)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 54ecb1c18ca1..1ab74c5ae789 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -48,6 +48,7 @@
#include <linux/fdtable.h>
#include <linux/namei.h>
#include <linux/mount.h>
+#include <linux/fs_context.h>
#include <linux/netfilter_ipv4.h>
#include <linux/netfilter_ipv6.h>
#include <linux/tty.h>
@@ -2971,6 +2972,259 @@ static int selinux_umount(struct vfsmount *mnt, int flags)
FILESYSTEM__UNMOUNT, NULL);
}

+/* fsopen mount context operations */
+
+static int selinux_fs_context_alloc(struct fs_context *fc,
+ struct dentry *reference)
+{
+ struct security_mnt_opts *opts;
+
+ opts = kzalloc(sizeof(*opts), GFP_KERNEL);
+ if (!opts)
+ return -ENOMEM;
+
+ fc->security = opts;
+ return 0;
+}
+
+static int selinux_fs_context_dup(struct fs_context *fc,
+ struct fs_context *src_fc)
+{
+ const struct security_mnt_opts *src = src_fc->security;
+ struct security_mnt_opts *opts;
+ int i, n;
+
+ opts = kzalloc(sizeof(*opts), GFP_KERNEL);
+ if (!opts)
+ return -ENOMEM;
+ fc->security = opts;
+
+ if (!src || !src->num_mnt_opts)
+ return 0;
+ n = opts->num_mnt_opts = src->num_mnt_opts;
+
+ if (src->mnt_opts) {
+ opts->mnt_opts = kcalloc(n, sizeof(char *), GFP_KERNEL);
+ if (!opts->mnt_opts)
+ return -ENOMEM;
+
+ for (i = 0; i < n; i++) {
+ if (src->mnt_opts[i]) {
+ opts->mnt_opts[i] = kstrdup(src->mnt_opts[i],
+ GFP_KERNEL);
+ if (!opts->mnt_opts[i])
+ return -ENOMEM;
+ }
+ }
+ }
+
+ if (src->mnt_opts_flags) {
+ opts->mnt_opts_flags = kmemdup(src->mnt_opts_flags,
+ n * sizeof(int), GFP_KERNEL);
+ if (!opts->mnt_opts_flags)
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static void selinux_fs_context_free(struct fs_context *fc)
+{
+ struct security_mnt_opts *opts = fc->security;
+
+ security_free_mnt_opts(opts);
+ fc->security = NULL;
+}
+
+static int selinux_fs_context_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+ struct security_mnt_opts *opts = fc->security;
+ substring_t args[MAX_OPT_ARGS];
+ unsigned int have;
+ char *c, **oo;
+ int token, ctx, i, *of;
+
+ token = match_token(opt, tokens, args);
+ if (token == Opt_error)
+ return 0; /* Doesn't belong to us. */
+
+ have = 0;
+ for (i = 0; i < opts->num_mnt_opts; i++)
+ have |= 1 << opts->mnt_opts_flags[i];
+ if (have & (1 << token))
+ return -EINVAL;
+
+ switch (token) {
+ case Opt_context:
+ if (have & (1 << Opt_defcontext))
+ goto incompatible;
+ ctx = CONTEXT_MNT;
+ goto copy_context_string;
+
+ case Opt_fscontext:
+ ctx = FSCONTEXT_MNT;
+ goto copy_context_string;
+
+ case Opt_rootcontext:
+ ctx = ROOTCONTEXT_MNT;
+ goto copy_context_string;
+
+ case Opt_defcontext:
+ if (have & (1 << Opt_context))
+ goto incompatible;
+ ctx = DEFCONTEXT_MNT;
+ goto copy_context_string;
+
+ case Opt_labelsupport:
+ return 1;
+
+ default:
+ return -EINVAL;
+ }
+
+copy_context_string:
+ if (opts->num_mnt_opts > 3)
+ return -EINVAL;
+
+ of = krealloc(opts->mnt_opts_flags,
+ (opts->num_mnt_opts + 1) * sizeof(int), GFP_KERNEL);
+ if (!of)
+ return -ENOMEM;
+ of[opts->num_mnt_opts] = 0;
+ opts->mnt_opts_flags = of;
+
+ oo = krealloc(opts->mnt_opts,
+ (opts->num_mnt_opts + 1) * sizeof(char *), GFP_KERNEL);
+ if (!oo)
+ return -ENOMEM;
+ oo[opts->num_mnt_opts] = NULL;
+ opts->mnt_opts = oo;
+
+ c = match_strdup(&args[0]);
+ if (!c)
+ return -ENOMEM;
+ opts->mnt_opts[opts->num_mnt_opts] = c;
+ opts->mnt_opts_flags[opts->num_mnt_opts] = ctx;
+ opts->num_mnt_opts++;
+ return 1;
+
+incompatible:
+ return -EINVAL;
+}
+
+/*
+ * Validate the security parameters supplied for a reconfiguration/remount
+ * event.
+ */
+static int selinux_validate_for_sb_reconfigure(struct fs_context *fc)
+{
+ struct super_block *sb = fc->root->d_sb;
+ struct superblock_security_struct *sbsec = sb->s_security;
+ struct security_mnt_opts *opts = fc->security;
+ int rc, i, *flags;
+ char **mount_options;
+
+ if (!(sbsec->flags & SE_SBINITIALIZED))
+ return 0;
+
+ mount_options = opts->mnt_opts;
+ flags = opts->mnt_opts_flags;
+
+ for (i = 0; i < opts->num_mnt_opts; i++) {
+ u32 sid;
+
+ if (flags[i] == SBLABEL_MNT)
+ continue;
+
+ rc = security_context_str_to_sid(&selinux_state, mount_options[i],
+ &sid, GFP_KERNEL);
+ if (rc) {
+ pr_warn("SELinux: security_context_str_to_sid"
+ "(%s) failed for (dev %s, type %s) errno=%d\n",
+ mount_options[i], sb->s_id, sb->s_type->name, rc);
+ goto inval;
+ }
+
+ switch (flags[i]) {
+ case FSCONTEXT_MNT:
+ if (bad_option(sbsec, FSCONTEXT_MNT, sbsec->sid, sid))
+ goto bad_option;
+ break;
+ case CONTEXT_MNT:
+ if (bad_option(sbsec, CONTEXT_MNT, sbsec->mntpoint_sid, sid))
+ goto bad_option;
+ break;
+ case ROOTCONTEXT_MNT: {
+ struct inode_security_struct *root_isec;
+ root_isec = backing_inode_security(sb->s_root);
+
+ if (bad_option(sbsec, ROOTCONTEXT_MNT, root_isec->sid, sid))
+ goto bad_option;
+ break;
+ }
+ case DEFCONTEXT_MNT:
+ if (bad_option(sbsec, DEFCONTEXT_MNT, sbsec->def_sid, sid))
+ goto bad_option;
+ break;
+ default:
+ goto inval;
+ }
+ }
+
+ rc = 0;
+out:
+ return rc;
+
+bad_option:
+ pr_warn("SELinux: unable to change security options "
+ "during remount (dev %s, type=%s)\n",
+ sb->s_id, sb->s_type->name);
+inval:
+ rc = -EINVAL;
+ goto out;
+}
+
+/*
+ * Validate the security context assembled from the option data supplied to
+ * mount.
+ */
+static int selinux_fs_context_validate(struct fs_context *fc)
+{
+ if (fc->purpose == FS_CONTEXT_FOR_RECONFIGURE)
+ return selinux_validate_for_sb_reconfigure(fc);
+ return 0;
+}
+
+/*
+ * Set the security context on a superblock.
+ */
+static int selinux_sb_get_tree(struct fs_context *fc)
+{
+ const struct cred *cred = current_cred();
+ struct common_audit_data ad;
+ int rc;
+
+ rc = selinux_set_mnt_opts(fc->root->d_sb, fc->security, 0, NULL);
+ if (rc)
+ return rc;
+
+ /* Allow all mounts performed by the kernel */
+ if (fc->purpose == FS_CONTEXT_FOR_KERNEL_MOUNT)
+ return 0;
+
+ ad.type = LSM_AUDIT_DATA_DENTRY;
+ ad.u.dentry = fc->root;
+ return superblock_has_perm(cred, fc->root->d_sb, FILESYSTEM__MOUNT, &ad);
+}
+
+static int selinux_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ const struct cred *cred = current_cred();
+
+ return path_has_perm(cred, mountpoint, FILE__MOUNTON);
+}
+
/* inode security operations */

static int selinux_inode_alloc_security(struct inode *inode)
@@ -6882,6 +7136,14 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(bprm_committing_creds, selinux_bprm_committing_creds),
LSM_HOOK_INIT(bprm_committed_creds, selinux_bprm_committed_creds),

+ LSM_HOOK_INIT(fs_context_alloc, selinux_fs_context_alloc),
+ LSM_HOOK_INIT(fs_context_dup, selinux_fs_context_dup),
+ LSM_HOOK_INIT(fs_context_free, selinux_fs_context_free),
+ LSM_HOOK_INIT(fs_context_parse_option, selinux_fs_context_parse_option),
+ LSM_HOOK_INIT(fs_context_validate, selinux_fs_context_validate),
+ LSM_HOOK_INIT(sb_get_tree, selinux_sb_get_tree),
+ LSM_HOOK_INIT(sb_mountpoint, selinux_sb_mountpoint),
+
LSM_HOOK_INIT(sb_alloc_security, selinux_sb_alloc_security),
LSM_HOOK_INIT(sb_free_security, selinux_sb_free_security),
LSM_HOOK_INIT(sb_copy_data, selinux_sb_copy_data),


2018-05-25 02:49:09

by David Howells

[permalink] [raw]
Subject: [PATCH 15/32] cpuset: Use fs_context [ver #8]

Make the cpuset filesystem use the filesystem context. This is potentially
tricky as the cpuset fs is almost an alias for the cgroup filesystem, but
with some special parameters.

This can, however, be handled by setting up an appropriate cgroup
filesystem and returning the root directory of that as the root dir of this
one.

Signed-off-by: David Howells <[email protected]>
cc: Tejun Heo <[email protected]>
---

kernel/cgroup/cpuset.c | 66 ++++++++++++++++++++++++++++++++++++++----------
1 file changed, 52 insertions(+), 14 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 3c8ef37879f0..f570d13bc688 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -38,7 +38,7 @@
#include <linux/mm.h>
#include <linux/memory.h>
#include <linux/export.h>
-#include <linux/mount.h>
+#include <linux/fs_context.h>
#include <linux/namei.h>
#include <linux/pagemap.h>
#include <linux/proc_fs.h>
@@ -315,26 +315,64 @@ static inline bool is_in_v2_mode(void)
* users. If someone tries to mount the "cpuset" filesystem, we
* silently switch it to mount "cgroup" instead
*/
-static struct dentry *cpuset_mount(struct file_system_type *fs_type,
- int flags, const char *unused_dev_name,
- void *data, size_t data_size)
+static int cpuset_get_tree(struct fs_context *fc)
{
- struct file_system_type *cgroup_fs = get_fs_type("cgroup");
- struct dentry *ret = ERR_PTR(-ENODEV);
+ static const char opts[] = "cpuset,noprefix,release_agent=/sbin/cpuset_release_agent";
+ struct file_system_type *cgroup_fs;
+ struct fs_context *cg_fc;
+ char *p;
+ int ret = -ENODEV;
+
+ cgroup_fs = get_fs_type("cgroup");
if (cgroup_fs) {
- char mountopts[] =
- "cpuset,noprefix,"
- "release_agent=/sbin/cpuset_release_agent";
- ret = cgroup_fs->mount(cgroup_fs, flags, unused_dev_name,
- mountopts, data_size);
- put_filesystem(cgroup_fs);
+ ret = PTR_ERR(cgroup_fs);
+ goto out;
+ }
+
+ cg_fc = vfs_new_fs_context(cgroup_fs, NULL, fc->sb_flags, fc->purpose);
+ put_filesystem(cgroup_fs);
+ if (IS_ERR(cg_fc)) {
+ ret = PTR_ERR(cg_fc);
+ goto out;
}
+
+ ret = -ENOMEM;
+ p = kstrdup(opts, GFP_KERNEL);
+ if (!p)
+ goto out_fc;
+
+ ret = generic_parse_monolithic(fc, p, sizeof(opts) - 1);
+ kfree(p);
+ if (ret < 0)
+ goto out_fc;
+
+ ret = vfs_get_tree(cg_fc);
+ if (ret < 0)
+ goto out_fc;
+
+ fc->root = dget(cg_fc->root);
+ ret = 0;
+
+out_fc:
+ put_fs_context(cg_fc);
+out:
return ret;
}

+static const struct fs_context_operations cpuset_fs_context_ops = {
+ .get_tree = cpuset_get_tree,
+};
+
+static int cpuset_init_fs_context(struct fs_context *fc,
+ struct dentry *reference)
+{
+ fc->ops = &cpuset_fs_context_ops;
+ return 0;
+}
+
static struct file_system_type cpuset_fs_type = {
- .name = "cpuset",
- .mount = cpuset_mount,
+ .name = "cpuset",
+ .init_fs_context = cpuset_init_fs_context,
};

/*


2018-05-25 02:49:07

by David Howells

[permalink] [raw]
Subject: [PATCH 09/32] VFS: Require specification of size of mount data for internal mounts [ver #8]

Require specification of the size of the mount data passed to the VFS
mounting functions by internal mounts. The problem is that the legacy
handling for the upcoming mount-context patches has to copy an entire page
as that's how big the buffer is defined as being, but some of the internal
calls pass in a short bit of stack space, with the result that the stack
guard page may get hit.

Signed-off-by: David Howells <[email protected]>
---

arch/ia64/kernel/perfmon.c | 3 +
arch/powerpc/platforms/cell/spufs/inode.c | 6 +--
arch/s390/hypfs/inode.c | 7 ++-
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 2 -
drivers/base/devtmpfs.c | 6 +--
drivers/dax/super.c | 2 -
drivers/gpu/drm/drm_drv.c | 3 +
drivers/gpu/drm/i915/i915_gemfs.c | 2 -
drivers/infiniband/hw/qib/qib_fs.c | 7 ++-
drivers/misc/ibmasm/ibmasmfs.c | 11 +++--
drivers/mtd/mtdsuper.c | 26 +++++++-----
drivers/oprofile/oprofilefs.c | 8 ++--
.../staging/lustre/lustre/llite/llite_internal.h | 2 -
drivers/staging/lustre/lustre/llite/llite_lib.c | 3 +
drivers/staging/lustre/lustre/obdclass/obd_mount.c | 7 ++-
drivers/staging/ncpfs/inode.c | 10 +++--
drivers/usb/gadget/function/f_fs.c | 7 ++-
drivers/usb/gadget/legacy/inode.c | 7 ++-
drivers/virtio/virtio_balloon.c | 2 -
drivers/xen/xenfs/super.c | 7 ++-
fs/9p/vfs_super.c | 2 -
fs/adfs/super.c | 9 ++--
fs/affs/super.c | 13 ++++--
fs/afs/mntpt.c | 3 +
fs/afs/super.c | 6 ++-
fs/aio.c | 3 +
fs/anon_inodes.c | 3 +
fs/autofs4/autofs_i.h | 2 -
fs/autofs4/init.c | 4 +-
fs/autofs4/inode.c | 3 +
fs/befs/linuxvfs.c | 11 +++--
fs/bfs/inode.c | 8 ++--
fs/binfmt_misc.c | 7 ++-
fs/block_dev.c | 2 -
fs/btrfs/super.c | 30 ++++++++------
fs/btrfs/tests/btrfs-tests.c | 2 -
fs/ceph/super.c | 3 +
fs/cifs/cifs_dfs_ref.c | 3 +
fs/cifs/cifsfs.c | 5 +-
fs/coda/inode.c | 11 +++--
fs/configfs/mount.c | 7 ++-
fs/cramfs/inode.c | 17 +++++---
fs/debugfs/inode.c | 14 ++++--
fs/devpts/inode.c | 10 +++--
fs/ecryptfs/main.c | 2 -
fs/efivarfs/super.c | 9 +++-
fs/efs/super.c | 14 ++++--
fs/exofs/super.c | 7 ++-
fs/ext2/super.c | 14 ++++--
fs/ext4/super.c | 16 +++++--
fs/f2fs/super.c | 11 +++--
fs/fat/inode.c | 3 +
fs/fat/namei_msdos.c | 8 ++--
fs/fat/namei_vfat.c | 8 ++--
fs/freevxfs/vxfs_super.c | 12 ++++-
fs/fuse/control.c | 9 +++-
fs/fuse/inode.c | 16 +++++--
fs/gfs2/ops_fstype.c | 6 ++-
fs/gfs2/super.c | 4 +-
fs/hfs/super.c | 12 ++++-
fs/hfsplus/super.c | 12 ++++-
fs/hostfs/hostfs_kern.c | 7 ++-
fs/hpfs/super.c | 11 +++--
fs/hugetlbfs/inode.c | 13 ++++--
fs/internal.h | 4 +-
fs/isofs/inode.c | 11 +++--
fs/jffs2/super.c | 10 +++--
fs/jfs/super.c | 11 +++--
fs/kernfs/mount.c | 3 +
fs/libfs.c | 2 -
fs/minix/inode.c | 14 ++++--
fs/namespace.c | 38 ++++++++++-------
fs/nfs/internal.h | 4 +-
fs/nfs/namespace.c | 3 +
fs/nfs/nfs4namespace.c | 3 +
fs/nfs/nfs4super.c | 27 +++++++-----
fs/nfs/super.c | 22 +++++-----
fs/nfsd/nfsctl.c | 8 ++--
fs/nilfs2/super.c | 10 +++--
fs/nsfs.c | 3 +
fs/ntfs/super.c | 13 ++++--
fs/ocfs2/dlmfs/dlmfs.c | 5 +-
fs/ocfs2/super.c | 14 ++++--
fs/omfs/inode.c | 9 +++-
fs/openpromfs/inode.c | 11 +++--
fs/orangefs/orangefs-kernel.h | 2 -
fs/orangefs/super.c | 5 +-
fs/overlayfs/super.c | 11 +++--
fs/pipe.c | 3 +
fs/proc/inode.c | 3 +
fs/proc/internal.h | 4 +-
fs/proc/root.c | 11 +++--
fs/pstore/inode.c | 10 +++--
fs/qnx4/inode.c | 14 ++++--
fs/qnx6/inode.c | 14 ++++--
fs/ramfs/inode.c | 6 +--
fs/reiserfs/super.c | 14 ++++--
fs/romfs/super.c | 13 ++++--
fs/squashfs/super.c | 12 ++++-
fs/super.c | 44 +++++++++++---------
fs/sysfs/mount.c | 2 -
fs/sysv/inode.c | 3 +
fs/sysv/super.c | 16 +++++--
fs/tracefs/inode.c | 10 +++--
fs/ubifs/super.c | 5 +-
fs/udf/super.c | 16 +++++--
fs/ufs/super.c | 11 +++--
fs/xfs/xfs_super.c | 10 +++--
include/linux/debugfs.h | 8 ++--
include/linux/fs.h | 29 +++++++------
include/linux/lsm_hooks.h | 13 ++++--
include/linux/mount.h | 5 +-
include/linux/mtd/super.h | 4 +-
include/linux/ramfs.h | 4 +-
include/linux/security.h | 17 ++++----
include/linux/shmem_fs.h | 3 +
init/do_mounts.c | 4 +-
ipc/mqueue.c | 9 ++--
kernel/bpf/inode.c | 7 ++-
kernel/cgroup/cgroup.c | 2 -
kernel/cgroup/cpuset.c | 7 ++-
kernel/trace/trace.c | 7 ++-
mm/shmem.c | 10 +++--
mm/zsmalloc.c | 3 +
net/socket.c | 3 +
net/sunrpc/rpc_pipe.c | 7 ++-
security/apparmor/apparmorfs.c | 8 ++--
security/apparmor/lsm.c | 3 +
security/inode.c | 7 ++-
security/security.c | 18 +++++---
security/selinux/hooks.c | 11 +++--
security/selinux/selinuxfs.c | 8 ++--
security/smack/smack_lsm.c | 6 ++-
security/smack/smackfs.c | 9 +++-
security/tomoyo/tomoyo.c | 4 +-
135 files changed, 710 insertions(+), 470 deletions(-)

diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index 3b38c717008a..ae9a3ae2ba45 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -611,7 +611,8 @@ pfm_unprotect_ctx_ctxsw(pfm_context_t *x, unsigned long f)
static const struct dentry_operations pfmfs_dentry_operations;

static struct dentry *
-pfmfs_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data)
+pfmfs_mount(struct file_system_type *fs_type, int flags, const char *dev_name,
+ void *data, size_t data_size)
{
return mount_pseudo(fs_type, "pfm:", NULL, &pfmfs_dentry_operations,
PFMFS_MAGIC);
diff --git a/arch/powerpc/platforms/cell/spufs/inode.c b/arch/powerpc/platforms/cell/spufs/inode.c
index db329d4bf1c3..90d55b47c471 100644
--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -734,7 +734,7 @@ spufs_create_root(struct super_block *sb, void *data)
}

static int
-spufs_fill_super(struct super_block *sb, void *data, int silent)
+spufs_fill_super(struct super_block *sb, void *data, size_t data_size, int silent)
{
struct spufs_sb_info *info;
static const struct super_operations s_ops = {
@@ -761,9 +761,9 @@ spufs_fill_super(struct super_block *sb, void *data, int silent)

static struct dentry *
spufs_mount(struct file_system_type *fstype, int flags,
- const char *name, void *data)
+ const char *name, void *data, size_t data_size)
{
- return mount_single(fstype, flags, data, spufs_fill_super);
+ return mount_single(fstype, flags, data, data_size, spufs_fill_super);
}

static struct file_system_type spufs_type = {
diff --git a/arch/s390/hypfs/inode.c b/arch/s390/hypfs/inode.c
index 06b513d192b9..7aa4227d59d4 100644
--- a/arch/s390/hypfs/inode.c
+++ b/arch/s390/hypfs/inode.c
@@ -266,7 +266,8 @@ static int hypfs_show_options(struct seq_file *s, struct dentry *root)
return 0;
}

-static int hypfs_fill_super(struct super_block *sb, void *data, int silent)
+static int hypfs_fill_super(struct super_block *sb,
+ void *data, size_t data_size, int silent)
{
struct inode *root_inode;
struct dentry *root_dentry;
@@ -309,9 +310,9 @@ static int hypfs_fill_super(struct super_block *sb, void *data, int silent)
}

static struct dentry *hypfs_mount(struct file_system_type *fst, int flags,
- const char *devname, void *data)
+ const char *devname, void *data, size_t data_size)
{
- return mount_single(fst, flags, data, hypfs_fill_super);
+ return mount_single(fst, flags, data, data_size, hypfs_fill_super);
}

static void hypfs_kill_super(struct super_block *sb)
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index fca759d272a1..3584ef8de1fd 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -1207,7 +1207,7 @@ static int mkdir_mondata_all(struct kernfs_node *parent_kn,

static struct dentry *rdt_mount(struct file_system_type *fs_type,
int flags, const char *unused_dev_name,
- void *data)
+ void *data, size_t data_size)
{
struct rdt_domain *dom;
struct rdt_resource *r;
diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index 79a235184fb5..1b87a1e03b45 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -57,12 +57,12 @@ static int __init mount_param(char *str)
__setup("devtmpfs.mount=", mount_param);

static struct dentry *dev_mount(struct file_system_type *fs_type, int flags,
- const char *dev_name, void *data)
+ const char *dev_name, void *data, size_t data_size)
{
#ifdef CONFIG_TMPFS
- return mount_single(fs_type, flags, data, shmem_fill_super);
+ return mount_single(fs_type, flags, data, data_size, shmem_fill_super);
#else
- return mount_single(fs_type, flags, data, ramfs_fill_super);
+ return mount_single(fs_type, flags, data, data_size, ramfs_fill_super);
#endif
}

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 2b2332b605e4..cda4ab7b1dd4 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -396,7 +396,7 @@ static const struct super_operations dax_sops = {
};

static struct dentry *dax_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
return mount_pseudo(fs_type, "dax:", &dax_sops, NULL, DAXFS_MAGIC);
}
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index a1b9338736e3..fc652bb90b78 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -375,7 +375,8 @@ static const struct super_operations drm_fs_sops = {
};

static struct dentry *drm_fs_mount(struct file_system_type *fs_type, int flags,
- const char *dev_name, void *data)
+ const char *dev_name,
+ void *data, size_t data_size)
{
return mount_pseudo(fs_type,
"drm:",
diff --git a/drivers/gpu/drm/i915/i915_gemfs.c b/drivers/gpu/drm/i915/i915_gemfs.c
index 888b7d3f04c3..bf0a355e8f46 100644
--- a/drivers/gpu/drm/i915/i915_gemfs.c
+++ b/drivers/gpu/drm/i915/i915_gemfs.c
@@ -57,7 +57,7 @@ int i915_gemfs_init(struct drm_i915_private *i915)
int flags = 0;
int err;

- err = sb->s_op->remount_fs(sb, &flags, options);
+ err = sb->s_op->remount_fs(sb, &flags, options, sizeof(options));
if (err) {
kern_unmount(gemfs);
return err;
diff --git a/drivers/infiniband/hw/qib/qib_fs.c b/drivers/infiniband/hw/qib/qib_fs.c
index 1d940a2885c9..28648ef1f4cc 100644
--- a/drivers/infiniband/hw/qib/qib_fs.c
+++ b/drivers/infiniband/hw/qib/qib_fs.c
@@ -506,7 +506,8 @@ static int remove_device_files(struct super_block *sb,
* after device init. The direct add_cntr_files() call handles adding
* them from the init code, when the fs is already mounted.
*/
-static int qibfs_fill_super(struct super_block *sb, void *data, int silent)
+static int qibfs_fill_super(struct super_block *sb,
+ void *data, size_t data_size, int silent)
{
struct qib_devdata *dd, *tmp;
unsigned long flags;
@@ -541,11 +542,11 @@ static int qibfs_fill_super(struct super_block *sb, void *data, int silent)
}

static struct dentry *qibfs_mount(struct file_system_type *fs_type, int flags,
- const char *dev_name, void *data)
+ const char *dev_name, void *data, size_t data_size)
{
struct dentry *ret;

- ret = mount_single(fs_type, flags, data, qibfs_fill_super);
+ ret = mount_single(fs_type, flags, data, data_size, qibfs_fill_super);
if (!IS_ERR(ret))
qib_super = ret->d_sb;
return ret;
diff --git a/drivers/misc/ibmasm/ibmasmfs.c b/drivers/misc/ibmasm/ibmasmfs.c
index e05c3245930a..d0378eec6bca 100644
--- a/drivers/misc/ibmasm/ibmasmfs.c
+++ b/drivers/misc/ibmasm/ibmasmfs.c
@@ -88,13 +88,15 @@ static LIST_HEAD(service_processors);

static struct inode *ibmasmfs_make_inode(struct super_block *sb, int mode);
static void ibmasmfs_create_files (struct super_block *sb);
-static int ibmasmfs_fill_super (struct super_block *sb, void *data, int silent);
+static int ibmasmfs_fill_super (struct super_block *sb, void *data, size_t data_size,
+ int silent);


static struct dentry *ibmasmfs_mount(struct file_system_type *fst,
- int flags, const char *name, void *data)
+ int flags, const char *name,
+ void *data, size_t data_size)
{
- return mount_single(fst, flags, data, ibmasmfs_fill_super);
+ return mount_single(fst, flags, data, data_size, ibmasmfs_fill_super);
}

static const struct super_operations ibmasmfs_s_ops = {
@@ -112,7 +114,8 @@ static struct file_system_type ibmasmfs_type = {
};
MODULE_ALIAS_FS("ibmasmfs");

-static int ibmasmfs_fill_super (struct super_block *sb, void *data, int silent)
+static int ibmasmfs_fill_super (struct super_block *sb,
+ void *data, size_t data_size, int silent)
{
struct inode *root;

diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index d58a61c09304..13706ea5cf50 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -61,9 +61,9 @@ static int get_sb_mtd_set(struct super_block *sb, void *_mtd)
* get a superblock on an MTD-backed filesystem
*/
static struct dentry *mount_mtd_aux(struct file_system_type *fs_type, int flags,
- const char *dev_name, void *data,
+ const char *dev_name, void *data, size_t data_size,
struct mtd_info *mtd,
- int (*fill_super)(struct super_block *, void *, int))
+ int (*fill_super)(struct super_block *, void *, size_t, int))
{
struct super_block *sb;
int ret;
@@ -79,7 +79,7 @@ static struct dentry *mount_mtd_aux(struct file_system_type *fs_type, int flags,
pr_debug("MTDSB: New superblock for device %d (\"%s\")\n",
mtd->index, mtd->name);

- ret = fill_super(sb, data, flags & SB_SILENT ? 1 : 0);
+ ret = fill_super(sb, data, data_size, flags & SB_SILENT ? 1 : 0);
if (ret < 0) {
deactivate_locked_super(sb);
return ERR_PTR(ret);
@@ -105,8 +105,10 @@ static struct dentry *mount_mtd_aux(struct file_system_type *fs_type, int flags,
* get a superblock on an MTD-backed filesystem by MTD device number
*/
static struct dentry *mount_mtd_nr(struct file_system_type *fs_type, int flags,
- const char *dev_name, void *data, int mtdnr,
- int (*fill_super)(struct super_block *, void *, int))
+ const char *dev_name,
+ void *data, size_t data_size, int mtdnr,
+ int (*fill_super)(struct super_block *, void *,
+ size_t, int))
{
struct mtd_info *mtd;

@@ -116,15 +118,16 @@ static struct dentry *mount_mtd_nr(struct file_system_type *fs_type, int flags,
return ERR_CAST(mtd);
}

- return mount_mtd_aux(fs_type, flags, dev_name, data, mtd, fill_super);
+ return mount_mtd_aux(fs_type, flags, dev_name, data, data_size, mtd,
+ fill_super);
}

/*
* set up an MTD-based superblock
*/
struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
- const char *dev_name, void *data,
- int (*fill_super)(struct super_block *, void *, int))
+ const char *dev_name, void *data, size_t data_size,
+ int (*fill_super)(struct super_block *, void *, size_t, int))
{
#ifdef CONFIG_BLOCK
struct block_device *bdev;
@@ -153,7 +156,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
if (!IS_ERR(mtd))
return mount_mtd_aux(
fs_type, flags,
- dev_name, data, mtd,
+ dev_name, data, data_size, mtd,
fill_super);

printk(KERN_NOTICE "MTD:"
@@ -170,7 +173,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
pr_debug("MTDSB: mtd%%d, mtdnr %d\n",
mtdnr);
return mount_mtd_nr(fs_type, flags,
- dev_name, data,
+ dev_name, data, data_size,
mtdnr, fill_super);
}
}
@@ -197,7 +200,8 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
if (major != MTD_BLOCK_MAJOR)
goto not_an_MTD_device;

- return mount_mtd_nr(fs_type, flags, dev_name, data, mtdnr, fill_super);
+ return mount_mtd_nr(fs_type, flags, dev_name, data, data_size, mtdnr,
+ fill_super);

not_an_MTD_device:
#endif /* CONFIG_BLOCK */
diff --git a/drivers/oprofile/oprofilefs.c b/drivers/oprofile/oprofilefs.c
index 4ea08979312c..c721d7fd7c7e 100644
--- a/drivers/oprofile/oprofilefs.c
+++ b/drivers/oprofile/oprofilefs.c
@@ -238,7 +238,8 @@ struct dentry *oprofilefs_mkdir(struct dentry *parent, char const *name)
}


-static int oprofilefs_fill_super(struct super_block *sb, void *data, int silent)
+static int oprofilefs_fill_super(struct super_block *sb,
+ void *data, size_t data_size, int silent)
{
struct inode *root_inode;

@@ -265,9 +266,10 @@ static int oprofilefs_fill_super(struct super_block *sb, void *data, int silent)


static struct dentry *oprofilefs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_single(fs_type, flags, data, oprofilefs_fill_super);
+ return mount_single(fs_type, flags, data, data_size,
+ oprofilefs_fill_super);
}


diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
index d46bcf71b273..48b218ecdbd6 100644
--- a/drivers/staging/lustre/lustre/llite/llite_internal.h
+++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
@@ -810,7 +810,7 @@ int ll_iocontrol(struct inode *inode, struct file *file,
unsigned int cmd, unsigned long arg);
int ll_flush_ctx(struct inode *inode);
void ll_umount_begin(struct super_block *sb);
-int ll_remount_fs(struct super_block *sb, int *flags, char *data);
+int ll_remount_fs(struct super_block *sb, int *flags, char *data, size_t data_size);
int ll_show_options(struct seq_file *seq, struct dentry *dentry);
void ll_dirty_page_discard_warn(struct page *page, int ioret);
int ll_prep_inode(struct inode **inode, struct ptlrpc_request *req,
diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
index e7500c53fafc..d8bb57ff3797 100644
--- a/drivers/staging/lustre/lustre/llite/llite_lib.c
+++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
@@ -2039,7 +2039,8 @@ void ll_umount_begin(struct super_block *sb)
schedule();
}

-int ll_remount_fs(struct super_block *sb, int *flags, char *data)
+int ll_remount_fs(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
struct ll_sb_info *sbi = ll_s2sbi(sb);
char *profilenm = get_profile_name(sb);
diff --git a/drivers/staging/lustre/lustre/obdclass/obd_mount.c b/drivers/staging/lustre/lustre/obdclass/obd_mount.c
index f5e8214ac37b..0fc2a5604a10 100644
--- a/drivers/staging/lustre/lustre/obdclass/obd_mount.c
+++ b/drivers/staging/lustre/lustre/obdclass/obd_mount.c
@@ -1112,7 +1112,8 @@ static int lmd_parse(char *options, struct lustre_mount_data *lmd)
* and this is where we start setting things up.
* @param data Mount options (e.g. -o flock,abort_recov)
*/
-static int lustre_fill_super(struct super_block *sb, void *lmd2_data, int silent)
+static int lustre_fill_super(struct super_block *sb,
+ void *lmd2_data, size_t data_size, int silent)
{
struct lustre_mount_data *lmd;
struct lustre_sb_info *lsi;
@@ -1207,9 +1208,9 @@ EXPORT_SYMBOL(lustre_register_super_ops);

/***************** FS registration ******************/
static struct dentry *lustre_mount(struct file_system_type *fs_type, int flags,
- const char *devname, void *data)
+ const char *devname, void *data, size_t data_size)
{
- return mount_nodev(fs_type, flags, data, lustre_fill_super);
+ return mount_nodev(fs_type, flags, data, data_size, lustre_fill_super);
}

static void lustre_kill_super(struct super_block *sb)
diff --git a/drivers/staging/ncpfs/inode.c b/drivers/staging/ncpfs/inode.c
index bb411610a071..c26606ed0a0c 100644
--- a/drivers/staging/ncpfs/inode.c
+++ b/drivers/staging/ncpfs/inode.c
@@ -101,7 +101,8 @@ static void destroy_inodecache(void)
kmem_cache_destroy(ncp_inode_cachep);
}

-static int ncp_remount(struct super_block *sb, int *flags, char* data)
+static int ncp_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
sync_filesystem(sb);
*flags |= SB_NODIRATIME;
@@ -466,7 +467,8 @@ static int ncp_parse_options(struct ncp_mount_data_kernel *data, char *options)
return ret;
}

-static int ncp_fill_super(struct super_block *sb, void *raw_data, int silent)
+static int ncp_fill_super(struct super_block *sb,
+ void *raw_data, size_t data_size, int silent)
{
struct ncp_mount_data_kernel data;
struct ncp_server *server;
@@ -1023,9 +1025,9 @@ int ncp_notify_change(struct dentry *dentry, struct iattr *attr)
}

static struct dentry *ncp_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_nodev(fs_type, flags, data, ncp_fill_super);
+ return mount_nodev(fs_type, flags, data, data_size, ncp_fill_super);
}

static struct file_system_type ncp_fs_type = {
diff --git a/drivers/usb/gadget/function/f_fs.c b/drivers/usb/gadget/function/f_fs.c
index 0294e4f18873..694d59e613f4 100644
--- a/drivers/usb/gadget/function/f_fs.c
+++ b/drivers/usb/gadget/function/f_fs.c
@@ -1355,7 +1355,8 @@ struct ffs_sb_fill_data {
struct ffs_data *ffs_data;
};

-static int ffs_sb_fill(struct super_block *sb, void *_data, int silent)
+static int ffs_sb_fill(struct super_block *sb, void *_data, size_t data_size,
+ int silent)
{
struct ffs_sb_fill_data *data = _data;
struct inode *inode;
@@ -1483,7 +1484,7 @@ static int ffs_fs_parse_opts(struct ffs_sb_fill_data *data, char *opts)

static struct dentry *
ffs_fs_mount(struct file_system_type *t, int flags,
- const char *dev_name, void *opts)
+ const char *dev_name, void *opts, size_t data_size)
{
struct ffs_sb_fill_data data = {
.perms = {
@@ -1525,7 +1526,7 @@ ffs_fs_mount(struct file_system_type *t, int flags,
ffs->private_data = ffs_dev;
data.ffs_data = ffs;

- rv = mount_nodev(t, flags, &data, ffs_sb_fill);
+ rv = mount_nodev(t, flags, &data, sizeof(data), ffs_sb_fill);
if (IS_ERR(rv) && data.ffs_data) {
ffs_release_dev(data.ffs_data);
ffs_data_put(data.ffs_data);
diff --git a/drivers/usb/gadget/legacy/inode.c b/drivers/usb/gadget/legacy/inode.c
index 37ca0e669bd8..286a982b43a3 100644
--- a/drivers/usb/gadget/legacy/inode.c
+++ b/drivers/usb/gadget/legacy/inode.c
@@ -1990,7 +1990,8 @@ static const struct super_operations gadget_fs_operations = {
};

static int
-gadgetfs_fill_super (struct super_block *sb, void *opts, int silent)
+gadgetfs_fill_super (struct super_block *sb, void *opts, size_t data_size,
+ int silent)
{
struct inode *inode;
struct dev_data *dev;
@@ -2046,9 +2047,9 @@ gadgetfs_fill_super (struct super_block *sb, void *opts, int silent)
/* "mount -t gadgetfs path /dev/gadget" ends up here */
static struct dentry *
gadgetfs_mount (struct file_system_type *t, int flags,
- const char *path, void *opts)
+ const char *path, void *opts, size_t data_size)
{
- return mount_single (t, flags, opts, gadgetfs_fill_super);
+ return mount_single (t, flags, opts, data_size, gadgetfs_fill_super);
}

static void
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 6b237e3f4983..49f4a03ec162 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -526,7 +526,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
}

static struct dentry *balloon_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
static const struct dentry_operations ops = {
.d_dname = simple_dname,
diff --git a/drivers/xen/xenfs/super.c b/drivers/xen/xenfs/super.c
index 71ddfb4cf61c..fc4e6e43b66f 100644
--- a/drivers/xen/xenfs/super.c
+++ b/drivers/xen/xenfs/super.c
@@ -42,7 +42,8 @@ static const struct file_operations capabilities_file_ops = {
.llseek = default_llseek,
};

-static int xenfs_fill_super(struct super_block *sb, void *data, int silent)
+static int xenfs_fill_super(struct super_block *sb,
+ void *data, size_t data_size, int silent)
{
static const struct tree_descr xenfs_files[] = {
[2] = { "xenbus", &xen_xenbus_fops, S_IRUSR|S_IWUSR },
@@ -69,9 +70,9 @@ static int xenfs_fill_super(struct super_block *sb, void *data, int silent)

static struct dentry *xenfs_mount(struct file_system_type *fs_type,
int flags, const char *dev_name,
- void *data)
+ void *data, size_t data_size)
{
- return mount_single(fs_type, flags, data, xenfs_fill_super);
+ return mount_single(fs_type, flags, data, data_size, xenfs_fill_super);
}

static struct file_system_type xenfs_type = {
diff --git a/fs/9p/vfs_super.c b/fs/9p/vfs_super.c
index 48ce50484e80..7def28abd3a5 100644
--- a/fs/9p/vfs_super.c
+++ b/fs/9p/vfs_super.c
@@ -116,7 +116,7 @@ v9fs_fill_super(struct super_block *sb, struct v9fs_session_info *v9ses,
*/

static struct dentry *v9fs_mount(struct file_system_type *fs_type, int flags,
- const char *dev_name, void *data)
+ const char *dev_name, void *data, size_t data_size)
{
struct super_block *sb = NULL;
struct inode *inode = NULL;
diff --git a/fs/adfs/super.c b/fs/adfs/super.c
index cfda2c7caedc..8a7b2d263afd 100644
--- a/fs/adfs/super.c
+++ b/fs/adfs/super.c
@@ -210,7 +210,7 @@ static int parse_options(struct super_block *sb, char *options)
return 0;
}

-static int adfs_remount(struct super_block *sb, int *flags, char *data)
+static int adfs_remount(struct super_block *sb, int *flags, char *data, size_t data_size)
{
sync_filesystem(sb);
*flags |= SB_NODIRATIME;
@@ -362,7 +362,8 @@ static inline unsigned long adfs_discsize(struct adfs_discrecord *dr, int block_
return discsize;
}

-static int adfs_fill_super(struct super_block *sb, void *data, int silent)
+static int adfs_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
struct adfs_discrecord *dr;
struct buffer_head *bh;
@@ -522,9 +523,9 @@ static int adfs_fill_super(struct super_block *sb, void *data, int silent)
}

static struct dentry *adfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, adfs_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size, adfs_fill_super);
}

static struct file_system_type adfs_fs_type = {
diff --git a/fs/affs/super.c b/fs/affs/super.c
index e602619aed9d..b406ffca2066 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -26,7 +26,8 @@

static int affs_statfs(struct dentry *dentry, struct kstatfs *buf);
static int affs_show_options(struct seq_file *m, struct dentry *root);
-static int affs_remount (struct super_block *sb, int *flags, char *data);
+static int affs_remount (struct super_block *sb, int *flags,
+ char *data, size_t data_size);

static void
affs_commit_super(struct super_block *sb, int wait)
@@ -334,7 +335,8 @@ static int affs_show_options(struct seq_file *m, struct dentry *root)
* hopefully have the guts to do so. Until then: sorry for the mess.
*/

-static int affs_fill_super(struct super_block *sb, void *data, int silent)
+static int affs_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
struct affs_sb_info *sbi;
struct buffer_head *root_bh = NULL;
@@ -549,7 +551,7 @@ static int affs_fill_super(struct super_block *sb, void *data, int silent)
}

static int
-affs_remount(struct super_block *sb, int *flags, char *data)
+affs_remount(struct super_block *sb, int *flags, char *data, size_t data_size)
{
struct affs_sb_info *sbi = AFFS_SB(sb);
int blocksize;
@@ -632,9 +634,10 @@ affs_statfs(struct dentry *dentry, struct kstatfs *buf)
}

static struct dentry *affs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, affs_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ affs_fill_super);
}

static void affs_kill_sb(struct super_block *sb)
diff --git a/fs/afs/mntpt.c b/fs/afs/mntpt.c
index 99fd13500a97..c45aa1776591 100644
--- a/fs/afs/mntpt.c
+++ b/fs/afs/mntpt.c
@@ -152,7 +152,8 @@ static struct vfsmount *afs_mntpt_do_automount(struct dentry *mntpt)

/* try and do the mount */
_debug("--- attempting mount %s -o %s ---", devname, options);
- mnt = vfs_submount(mntpt, &afs_fs_type, devname, options);
+ mnt = vfs_submount(mntpt, &afs_fs_type, devname,
+ options, strlen(options) + 1);
_debug("--- mount result %p ---", mnt);

free_page((unsigned long) devname);
diff --git a/fs/afs/super.c b/fs/afs/super.c
index 593820372848..a562b90ad660 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -31,7 +31,8 @@

static void afs_i_init_once(void *foo);
static struct dentry *afs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data);
+ int flags, const char *dev_name,
+ void *data, size_t data_size);
static void afs_kill_super(struct super_block *sb);
static struct inode *afs_alloc_inode(struct super_block *sb);
static void afs_destroy_inode(struct inode *inode);
@@ -462,7 +463,8 @@ static void afs_destroy_sbi(struct afs_super_info *as)
* get an AFS superblock
*/
static struct dentry *afs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *options)
+ int flags, const char *dev_name,
+ void *options, size_t data_size)
{
struct afs_mount_params params;
struct super_block *sb;
diff --git a/fs/aio.c b/fs/aio.c
index 755d3f57bcc8..b780fd6eb9d0 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -243,7 +243,8 @@ static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
}

static struct dentry *aio_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name,
+ void *data, size_t data_size)
{
static const struct dentry_operations ops = {
.d_dname = simple_dname,
diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 3168ee4e77f4..13c06a7e0b85 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -39,7 +39,8 @@ static const struct dentry_operations anon_inodefs_dentry_operations = {
};

static struct dentry *anon_inodefs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name,
+ void *data, size_t data_size)
{
return mount_pseudo(fs_type, "anon_inode:", NULL,
&anon_inodefs_dentry_operations, ANON_INODE_FS_MAGIC);
diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index 4737615f0eaa..06a975ee5724 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -201,7 +201,7 @@ static inline void managed_dentry_clear_managed(struct dentry *dentry)

/* Initializing function */

-int autofs4_fill_super(struct super_block *, void *, int);
+int autofs4_fill_super(struct super_block *, void *, size_t, int);
struct autofs_info *autofs4_new_ino(struct autofs_sb_info *);
void autofs4_clean_ino(struct autofs_info *);

diff --git a/fs/autofs4/init.c b/fs/autofs4/init.c
index 8cf0e63389ae..3335cfdb9403 100644
--- a/fs/autofs4/init.c
+++ b/fs/autofs4/init.c
@@ -11,9 +11,9 @@
#include "autofs_i.h"

static struct dentry *autofs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_nodev(fs_type, flags, data, autofs4_fill_super);
+ return mount_nodev(fs_type, flags, data, data_size, autofs4_fill_super);
}

static struct file_system_type autofs_fs_type = {
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index 09e7d68dff02..49389477ba36 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -206,7 +206,8 @@ static int parse_options(char *options, int *pipefd, kuid_t *uid, kgid_t *gid,
return (*pipefd < 0);
}

-int autofs4_fill_super(struct super_block *s, void *data, int silent)
+int autofs4_fill_super(struct super_block *s, void *data, size_t data_size,
+ int silent)
{
struct inode *root_inode;
struct dentry *root;
diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
index 4700b4534439..31f760ea2494 100644
--- a/fs/befs/linuxvfs.c
+++ b/fs/befs/linuxvfs.c
@@ -52,7 +52,7 @@ static int befs_utf2nls(struct super_block *sb, const char *in, int in_len,
static int befs_nls2utf(struct super_block *sb, const char *in, int in_len,
char **out, int *out_len);
static void befs_put_super(struct super_block *);
-static int befs_remount(struct super_block *, int *, char *);
+static int befs_remount(struct super_block *, int *, char *, size_t);
static int befs_statfs(struct dentry *, struct kstatfs *);
static int befs_show_options(struct seq_file *, struct dentry *);
static int parse_options(char *, struct befs_mount_options *);
@@ -810,7 +810,7 @@ befs_put_super(struct super_block *sb)
* Load a set of NLS translations if needed.
*/
static int
-befs_fill_super(struct super_block *sb, void *data, int silent)
+befs_fill_super(struct super_block *sb, void *data, size_t data_size, int silent)
{
struct buffer_head *bh;
struct befs_sb_info *befs_sb;
@@ -942,7 +942,7 @@ befs_fill_super(struct super_block *sb, void *data, int silent)
}

static int
-befs_remount(struct super_block *sb, int *flags, char *data)
+befs_remount(struct super_block *sb, int *flags, char *data, size_t data_size)
{
sync_filesystem(sb);
if (!(*flags & SB_RDONLY))
@@ -976,9 +976,10 @@ befs_statfs(struct dentry *dentry, struct kstatfs *buf)

static struct dentry *
befs_mount(struct file_system_type *fs_type, int flags, const char *dev_name,
- void *data)
+ void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, befs_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ befs_fill_super);
}

static struct file_system_type befs_fs_type = {
diff --git a/fs/bfs/inode.c b/fs/bfs/inode.c
index 9a69392f1fb3..6e76e4e762e8 100644
--- a/fs/bfs/inode.c
+++ b/fs/bfs/inode.c
@@ -317,7 +317,8 @@ void bfs_dump_imap(const char *prefix, struct super_block *s)
#endif
}

-static int bfs_fill_super(struct super_block *s, void *data, int silent)
+static int bfs_fill_super(struct super_block *s, void *data, size_t data_size,
+ int silent)
{
struct buffer_head *bh, *sbh;
struct bfs_super_block *bfs_sb;
@@ -460,9 +461,10 @@ static int bfs_fill_super(struct super_block *s, void *data, int silent)
}

static struct dentry *bfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, bfs_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ bfs_fill_super);
}

static struct file_system_type bfs_fs_type = {
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index a41b48f82a70..274de8bfc004 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -814,7 +814,8 @@ static const struct super_operations s_ops = {
.evict_inode = bm_evict_inode,
};

-static int bm_fill_super(struct super_block *sb, void *data, int silent)
+static int bm_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
int err;
static const struct tree_descr bm_files[] = {
@@ -830,9 +831,9 @@ static int bm_fill_super(struct super_block *sb, void *data, int silent)
}

static struct dentry *bm_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_single(fs_type, flags, data, bm_fill_super);
+ return mount_single(fs_type, flags, data, data_size, bm_fill_super);
}

static struct linux_binfmt misc_format = {
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 7ec920e27065..313e57b06425 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -786,7 +786,7 @@ static const struct super_operations bdev_sops = {
};

static struct dentry *bd_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
struct dentry *dent;
dent = mount_pseudo(fs_type, "bdev:", &bdev_sops, NULL, BDEVFS_MAGIC);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 0628092b0b1b..8d8fcd2f0403 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -64,7 +64,8 @@ static const struct super_operations btrfs_super_ops;
static struct file_system_type btrfs_fs_type;
static struct file_system_type btrfs_root_fs_type;

-static int btrfs_remount(struct super_block *sb, int *flags, char *data);
+static int btrfs_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size);

const char *btrfs_decode_error(int errno)
{
@@ -1453,7 +1454,7 @@ static struct dentry *mount_subvol(const char *subvol_name, u64 subvol_objectid,
return root;
}

-static int parse_security_options(char *orig_opts,
+static int parse_security_options(char *orig_opts, size_t data_size,
struct security_mnt_opts *sec_opts)
{
char *secdata = NULL;
@@ -1462,7 +1463,7 @@ static int parse_security_options(char *orig_opts,
secdata = alloc_secdata();
if (!secdata)
return -ENOMEM;
- ret = security_sb_copy_data(orig_opts, secdata);
+ ret = security_sb_copy_data(orig_opts, data_size, secdata);
if (ret) {
free_secdata(secdata);
return ret;
@@ -1510,7 +1511,8 @@ static int setup_security_options(struct btrfs_fs_info *fs_info,
* for multiple device setup. Make sure to keep it in sync.
*/
static struct dentry *btrfs_mount_root(struct file_system_type *fs_type,
- int flags, const char *device_name, void *data)
+ int flags, const char *device_name,
+ void *data, size_t data_size)
{
struct block_device *bdev = NULL;
struct super_block *s;
@@ -1531,7 +1533,7 @@ static struct dentry *btrfs_mount_root(struct file_system_type *fs_type,

security_init_mnt_opts(&new_sec_opts);
if (data) {
- error = parse_security_options(data, &new_sec_opts);
+ error = parse_security_options(data, data_size, &new_sec_opts);
if (error)
return ERR_PTR(error);
}
@@ -1635,7 +1637,7 @@ static struct dentry *btrfs_mount_root(struct file_system_type *fs_type,
* "btrfs subvolume set-default", mount_subvol() is called always.
*/
static struct dentry *btrfs_mount(struct file_system_type *fs_type, int flags,
- const char *device_name, void *data)
+ const char *device_name, void *data, size_t data_size)
{
struct vfsmount *mnt_root;
struct dentry *root;
@@ -1655,21 +1657,24 @@ static struct dentry *btrfs_mount(struct file_system_type *fs_type, int flags,
}

/* mount device's root (/) */
- mnt_root = vfs_kern_mount(&btrfs_root_fs_type, flags, device_name, data);
+ mnt_root = vfs_kern_mount(&btrfs_root_fs_type, flags, device_name,
+ data, data_size);
if (PTR_ERR_OR_ZERO(mnt_root) == -EBUSY) {
if (flags & SB_RDONLY) {
mnt_root = vfs_kern_mount(&btrfs_root_fs_type,
- flags & ~SB_RDONLY, device_name, data);
+ flags & ~SB_RDONLY, device_name,
+ data, data_size);
} else {
mnt_root = vfs_kern_mount(&btrfs_root_fs_type,
- flags | SB_RDONLY, device_name, data);
+ flags | SB_RDONLY, device_name,
+ data, data_size);
if (IS_ERR(mnt_root)) {
root = ERR_CAST(mnt_root);
goto out;
}

down_write(&mnt_root->mnt_sb->s_umount);
- error = btrfs_remount(mnt_root->mnt_sb, &flags, NULL);
+ error = btrfs_remount(mnt_root->mnt_sb, &flags, NULL, 0);
up_write(&mnt_root->mnt_sb->s_umount);
if (error < 0) {
root = ERR_PTR(error);
@@ -1751,7 +1756,8 @@ static inline void btrfs_remount_cleanup(struct btrfs_fs_info *fs_info,
clear_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state);
}

-static int btrfs_remount(struct super_block *sb, int *flags, char *data)
+static int btrfs_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
struct btrfs_fs_info *fs_info = btrfs_sb(sb);
struct btrfs_root *root = fs_info->tree_root;
@@ -1770,7 +1776,7 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
struct security_mnt_opts new_sec_opts;

security_init_mnt_opts(&new_sec_opts);
- ret = parse_security_options(data, &new_sec_opts);
+ ret = parse_security_options(data, data_size, &new_sec_opts);
if (ret)
goto restore;
ret = setup_security_options(fs_info, sb,
diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
index 30ed438da2a9..d646cf7b04e5 100644
--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -24,7 +24,7 @@ static const struct super_operations btrfs_test_super_ops = {

static struct dentry *btrfs_test_mount(struct file_system_type *fs_type,
int flags, const char *dev_name,
- void *data)
+ void *data, size_t data_size)
{
return mount_pseudo(fs_type, "btrfs_test:", &btrfs_test_super_ops,
NULL, BTRFS_TEST_MAGIC);
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index b33082e6878f..81d035a7a204 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -1002,7 +1002,8 @@ static int ceph_setup_bdi(struct super_block *sb, struct ceph_fs_client *fsc)
}

static struct dentry *ceph_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name,
+ void *data, size_t data_size)
{
struct super_block *sb;
struct ceph_fs_client *fsc;
diff --git a/fs/cifs/cifs_dfs_ref.c b/fs/cifs/cifs_dfs_ref.c
index 6b61df117fd4..461d052a5d73 100644
--- a/fs/cifs/cifs_dfs_ref.c
+++ b/fs/cifs/cifs_dfs_ref.c
@@ -260,7 +260,8 @@ static struct vfsmount *cifs_dfs_do_refmount(struct dentry *mntpt,
if (IS_ERR(mountdata))
return (struct vfsmount *)mountdata;

- mnt = vfs_submount(mntpt, &cifs_fs_type, devname, mountdata);
+ mnt = vfs_submount(mntpt, &cifs_fs_type, devname,
+ mountdata, strlen(mountdata) + 1);
kfree(mountdata);
kfree(devname);
return mnt;
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 5a5a0158cc8f..8e56aeb0d2ff 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -572,7 +572,8 @@ static int cifs_show_stats(struct seq_file *s, struct dentry *root)
}
#endif

-static int cifs_remount(struct super_block *sb, int *flags, char *data)
+static int cifs_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
sync_filesystem(sb);
*flags |= SB_NODIRATIME;
@@ -675,7 +676,7 @@ static int cifs_set_super(struct super_block *sb, void *data)

static struct dentry *
cifs_do_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
int rc;
struct super_block *sb;
diff --git a/fs/coda/inode.c b/fs/coda/inode.c
index 97424cf206c0..dd819c150f70 100644
--- a/fs/coda/inode.c
+++ b/fs/coda/inode.c
@@ -93,7 +93,8 @@ void coda_destroy_inodecache(void)
kmem_cache_destroy(coda_inode_cachep);
}

-static int coda_remount(struct super_block *sb, int *flags, char *data)
+static int coda_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
sync_filesystem(sb);
*flags |= SB_NOATIME;
@@ -150,7 +151,8 @@ static int get_device_index(struct coda_mount_data *data)
return -1;
}

-static int coda_fill_super(struct super_block *sb, void *data, int silent)
+static int coda_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
struct inode *root = NULL;
struct venus_comm *vc;
@@ -316,9 +318,10 @@ static int coda_statfs(struct dentry *dentry, struct kstatfs *buf)
/* init_coda: used by filesystems.c to register coda */

static struct dentry *coda_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name,
+ void *data, size_t data_size)
{
- return mount_nodev(fs_type, flags, data, coda_fill_super);
+ return mount_nodev(fs_type, flags, data, data_size, coda_fill_super);
}

struct file_system_type coda_fs_type = {
diff --git a/fs/configfs/mount.c b/fs/configfs/mount.c
index cfd91320e869..c9c7c14eb9db 100644
--- a/fs/configfs/mount.c
+++ b/fs/configfs/mount.c
@@ -66,7 +66,8 @@ static struct configfs_dirent configfs_root = {
.s_iattr = NULL,
};

-static int configfs_fill_super(struct super_block *sb, void *data, int silent)
+static int configfs_fill_super(struct super_block *sb,
+ void *data, size_t data_size, int silent)
{
struct inode *inode;
struct dentry *root;
@@ -103,9 +104,9 @@ static int configfs_fill_super(struct super_block *sb, void *data, int silent)
}

static struct dentry *configfs_do_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_single(fs_type, flags, data, configfs_fill_super);
+ return mount_single(fs_type, flags, data, data_size, configfs_fill_super);
}

static struct file_system_type configfs_fs_type = {
diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index 124b093d14e5..3fcc5b7e346b 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -502,7 +502,8 @@ static void cramfs_kill_sb(struct super_block *sb)
kfree(sbi);
}

-static int cramfs_remount(struct super_block *sb, int *flags, char *data)
+static int cramfs_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
sync_filesystem(sb);
*flags |= SB_RDONLY;
@@ -603,7 +604,8 @@ static int cramfs_finalize_super(struct super_block *sb,
return 0;
}

-static int cramfs_blkdev_fill_super(struct super_block *sb, void *data,
+static int cramfs_blkdev_fill_super(struct super_block *sb,
+ void *data, size_t data_size,
int silent)
{
struct cramfs_sb_info *sbi;
@@ -625,8 +627,8 @@ static int cramfs_blkdev_fill_super(struct super_block *sb, void *data,
return cramfs_finalize_super(sb, &super.root);
}

-static int cramfs_mtd_fill_super(struct super_block *sb, void *data,
- int silent)
+static int cramfs_mtd_fill_super(struct super_block *sb,
+ void *data, size_t data_size, int silent)
{
struct cramfs_sb_info *sbi;
struct cramfs_super super;
@@ -951,18 +953,19 @@ static const struct super_operations cramfs_ops = {
};

static struct dentry *cramfs_mount(struct file_system_type *fs_type, int flags,
- const char *dev_name, void *data)
+ const char *dev_name,
+ void *data, size_t data_size)
{
struct dentry *ret = ERR_PTR(-ENOPROTOOPT);

if (IS_ENABLED(CONFIG_CRAMFS_MTD)) {
- ret = mount_mtd(fs_type, flags, dev_name, data,
+ ret = mount_mtd(fs_type, flags, dev_name, data, data_size,
cramfs_mtd_fill_super);
if (!IS_ERR(ret))
return ret;
}
if (IS_ENABLED(CONFIG_CRAMFS_BLOCKDEV)) {
- ret = mount_bdev(fs_type, flags, dev_name, data,
+ ret = mount_bdev(fs_type, flags, dev_name, data, data_size,
cramfs_blkdev_fill_super);
}
return ret;
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index 13b01351dd1c..57ba6d891c85 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -130,7 +130,8 @@ static int debugfs_apply_options(struct super_block *sb)
return 0;
}

-static int debugfs_remount(struct super_block *sb, int *flags, char *data)
+static int debugfs_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
int err;
struct debugfs_fs_info *fsi = sb->s_fs_info;
@@ -190,7 +191,7 @@ static struct vfsmount *debugfs_automount(struct path *path)
{
debugfs_automount_t f;
f = (debugfs_automount_t)path->dentry->d_fsdata;
- return f(path->dentry, d_inode(path->dentry)->i_private);
+ return f(path->dentry, d_inode(path->dentry)->i_private, 0);
}

static const struct dentry_operations debugfs_dops = {
@@ -199,7 +200,8 @@ static const struct dentry_operations debugfs_dops = {
.d_automount = debugfs_automount,
};

-static int debug_fill_super(struct super_block *sb, void *data, int silent)
+static int debug_fill_super(struct super_block *sb,
+ void *data, size_t data_size, int silent)
{
static const struct tree_descr debug_files[] = {{""}};
struct debugfs_fs_info *fsi;
@@ -235,9 +237,9 @@ static int debug_fill_super(struct super_block *sb, void *data, int silent)

static struct dentry *debug_mount(struct file_system_type *fs_type,
int flags, const char *dev_name,
- void *data)
+ void *data, size_t data_size)
{
- return mount_single(fs_type, flags, data, debug_fill_super);
+ return mount_single(fs_type, flags, data, data_size, debug_fill_super);
}

static struct file_system_type debug_fs_type = {
@@ -539,7 +541,7 @@ EXPORT_SYMBOL_GPL(debugfs_create_dir);
struct dentry *debugfs_create_automount(const char *name,
struct dentry *parent,
debugfs_automount_t f,
- void *data)
+ void *data, size_t data_size)
{
struct dentry *dentry = start_creating(name, parent);
struct inode *inode;
diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index e072e955ce33..2dee3d0c8554 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -386,7 +386,8 @@ static void update_ptmx_mode(struct pts_fs_info *fsi)
}
}

-static int devpts_remount(struct super_block *sb, int *flags, char *data)
+static int devpts_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
int err;
struct pts_fs_info *fsi = DEVPTS_SB(sb);
@@ -447,7 +448,8 @@ static void *new_pts_fs_info(struct super_block *sb)
}

static int
-devpts_fill_super(struct super_block *s, void *data, int silent)
+devpts_fill_super(struct super_block *s, void *data, size_t data_size,
+ int silent)
{
struct inode *inode;
int error;
@@ -504,9 +506,9 @@ devpts_fill_super(struct super_block *s, void *data, int silent)
* instance are independent of the PTYs in other devpts instances.
*/
static struct dentry *devpts_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_nodev(fs_type, flags, data, devpts_fill_super);
+ return mount_nodev(fs_type, flags, data, data_size, devpts_fill_super);
}

static void devpts_kill_sb(struct super_block *sb)
diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c
index 025d66a705db..5d029b7e069a 100644
--- a/fs/ecryptfs/main.c
+++ b/fs/ecryptfs/main.c
@@ -488,7 +488,7 @@ static struct file_system_type ecryptfs_fs_type;
* @raw_data: The options passed into the kernel
*/
static struct dentry *ecryptfs_mount(struct file_system_type *fs_type, int flags,
- const char *dev_name, void *raw_data)
+ const char *dev_name, void *raw_data, size_t data_size)
{
struct super_block *s;
struct ecryptfs_sb_info *sbi;
diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c
index 5b68e4294faa..db0e417f1c7e 100644
--- a/fs/efivarfs/super.c
+++ b/fs/efivarfs/super.c
@@ -191,7 +191,8 @@ static int efivarfs_destroy(struct efivar_entry *entry, void *data)
return 0;
}

-static int efivarfs_fill_super(struct super_block *sb, void *data, int silent)
+static int efivarfs_fill_super(struct super_block *sb,
+ void *data, size_t data_size, int silent)
{
struct inode *inode = NULL;
struct dentry *root;
@@ -227,9 +228,11 @@ static int efivarfs_fill_super(struct super_block *sb, void *data, int silent)
}

static struct dentry *efivarfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name,
+ void *data, size_t data_size)
{
- return mount_single(fs_type, flags, data, efivarfs_fill_super);
+ return mount_single(fs_type, flags, data, data_size,
+ efivarfs_fill_super);
}

static void efivarfs_kill_sb(struct super_block *sb)
diff --git a/fs/efs/super.c b/fs/efs/super.c
index 6ffb7ba1547a..ce85f22651f3 100644
--- a/fs/efs/super.c
+++ b/fs/efs/super.c
@@ -19,12 +19,14 @@
#include <linux/efs_fs_sb.h>

static int efs_statfs(struct dentry *dentry, struct kstatfs *buf);
-static int efs_fill_super(struct super_block *s, void *d, int silent);
+static int efs_fill_super(struct super_block *s, void *d, size_t data_size,
+ int silent);

static struct dentry *efs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, efs_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ efs_fill_super);
}

static void efs_kill_sb(struct super_block *s)
@@ -113,7 +115,8 @@ static void destroy_inodecache(void)
kmem_cache_destroy(efs_inode_cachep);
}

-static int efs_remount(struct super_block *sb, int *flags, char *data)
+static int efs_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
sync_filesystem(sb);
*flags |= SB_RDONLY;
@@ -253,7 +256,8 @@ static int efs_validate_super(struct efs_sb_info *sb, struct efs_super *super) {
return 0;
}

-static int efs_fill_super(struct super_block *s, void *d, int silent)
+static int efs_fill_super(struct super_block *s, void *d, size_t data_size,
+ int silent)
{
struct efs_sb_info *sb;
struct buffer_head *bh;
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
index 179cd5c2f52a..21886644b6f3 100644
--- a/fs/exofs/super.c
+++ b/fs/exofs/super.c
@@ -705,7 +705,8 @@ static int exofs_read_lookup_dev_table(struct exofs_sb_info *sbi,
/*
* Read the superblock from the OSD and fill in the fields
*/
-static int exofs_fill_super(struct super_block *sb, void *data, int silent)
+static int exofs_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
struct inode *root;
struct exofs_mountopt *opts = data;
@@ -861,7 +862,7 @@ static int exofs_fill_super(struct super_block *sb, void *data, int silent)
*/
static struct dentry *exofs_mount(struct file_system_type *type,
int flags, const char *dev_name,
- void *data)
+ void *data, size_t data_size)
{
struct exofs_mountopt opts;
int ret;
@@ -872,7 +873,7 @@ static struct dentry *exofs_mount(struct file_system_type *type,

if (!opts.dev_name)
opts.dev_name = dev_name;
- return mount_nodev(type, flags, &opts, exofs_fill_super);
+ return mount_nodev(type, flags, &opts, 0, exofs_fill_super);
}

/*
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index de1694512f1f..ca2cd53959b3 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -39,7 +39,8 @@
#include "acl.h"

static void ext2_write_super(struct super_block *sb);
-static int ext2_remount (struct super_block * sb, int * flags, char * data);
+static int ext2_remount (struct super_block * sb, int * flags,
+ char * data, size_t data_size);
static int ext2_statfs (struct dentry * dentry, struct kstatfs * buf);
static int ext2_sync_fs(struct super_block *sb, int wait);
static int ext2_freeze(struct super_block *sb);
@@ -815,7 +816,8 @@ static unsigned long descriptor_loc(struct super_block *sb,
return ext2_group_first_block_no(sb, bg) + has_super;
}

-static int ext2_fill_super(struct super_block *sb, void *data, int silent)
+static int ext2_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
struct buffer_head * bh;
@@ -1319,7 +1321,8 @@ static void ext2_write_super(struct super_block *sb)
ext2_sync_fs(sb, 1);
}

-static int ext2_remount (struct super_block * sb, int * flags, char * data)
+static int ext2_remount (struct super_block * sb, int * flags,
+ char *data, size_t data_size)
{
struct ext2_sb_info * sbi = EXT2_SB(sb);
struct ext2_super_block * es;
@@ -1473,9 +1476,10 @@ static int ext2_statfs (struct dentry * dentry, struct kstatfs * buf)
}

static struct dentry *ext2_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, ext2_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ ext2_fill_super);
}

#ifdef CONFIG_QUOTA
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index eb104e8476f0..e532d1d7739e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -70,12 +70,13 @@ static void ext4_mark_recovery_complete(struct super_block *sb,
static void ext4_clear_journal_err(struct super_block *sb,
struct ext4_super_block *es);
static int ext4_sync_fs(struct super_block *sb, int wait);
-static int ext4_remount(struct super_block *sb, int *flags, char *data);
+static int ext4_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size);
static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf);
static int ext4_unfreeze(struct super_block *sb);
static int ext4_freeze(struct super_block *sb);
static struct dentry *ext4_mount(struct file_system_type *fs_type, int flags,
- const char *dev_name, void *data);
+ const char *dev_name, void *data, size_t data_size);
static inline int ext2_feature_set_ok(struct super_block *sb);
static inline int ext3_feature_set_ok(struct super_block *sb);
static int ext4_feature_set_ok(struct super_block *sb, int readonly);
@@ -3405,7 +3406,8 @@ static void ext4_set_resv_clusters(struct super_block *sb)
atomic64_set(&sbi->s_resv_clusters, resv_clusters);
}

-static int ext4_fill_super(struct super_block *sb, void *data, int silent)
+static int ext4_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
char *orig_data = kstrdup(data, GFP_KERNEL);
@@ -4976,7 +4978,8 @@ struct ext4_mount_options {
#endif
};

-static int ext4_remount(struct super_block *sb, int *flags, char *data)
+static int ext4_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
struct ext4_super_block *es;
struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -5735,9 +5738,10 @@ static int ext4_get_next_id(struct super_block *sb, struct kqid *qid)
#endif

static struct dentry *ext4_mount(struct file_system_type *fs_type, int flags,
- const char *dev_name, void *data)
+ const char *dev_name, void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, ext4_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ ext4_fill_super);
}

#if !defined(CONFIG_EXT2_FS) && !defined(CONFIG_EXT2_FS_MODULE) && defined(CONFIG_EXT4_USE_FOR_EXT2)
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index a31cc49b7295..96db72d025d8 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -1384,7 +1384,8 @@ static void default_options(struct f2fs_sb_info *sbi)
#ifdef CONFIG_QUOTA
static int f2fs_enable_quotas(struct super_block *sb);
#endif
-static int f2fs_remount(struct super_block *sb, int *flags, char *data)
+static int f2fs_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
struct f2fs_sb_info *sbi = F2FS_SB(sb);
struct f2fs_mount_info org_mount_opt;
@@ -2589,7 +2590,8 @@ static void f2fs_tuning_parameters(struct f2fs_sb_info *sbi)
}
}

-static int f2fs_fill_super(struct super_block *sb, void *data, int silent)
+static int f2fs_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
struct f2fs_sb_info *sbi;
struct f2fs_super_block *raw_super;
@@ -3015,9 +3017,10 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent)
}

static struct dentry *f2fs_mount(struct file_system_type *fs_type, int flags,
- const char *dev_name, void *data)
+ const char *dev_name, void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, f2fs_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ f2fs_fill_super);
}

static void kill_f2fs_super(struct super_block *sb)
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index ffbbf0520d9e..53765ffd96c0 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -778,7 +778,8 @@ static void __exit fat_destroy_inodecache(void)
kmem_cache_destroy(fat_inode_cachep);
}

-static int fat_remount(struct super_block *sb, int *flags, char *data)
+static int fat_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
bool new_rdonly;
struct msdos_sb_info *sbi = MSDOS_SB(sb);
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index 484ce674e0cd..17134b516eea 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -646,16 +646,18 @@ static void setup(struct super_block *sb)
sb->s_flags |= SB_NOATIME;
}

-static int msdos_fill_super(struct super_block *sb, void *data, int silent)
+static int msdos_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
return fat_fill_super(sb, data, silent, 0, setup);
}

static struct dentry *msdos_mount(struct file_system_type *fs_type,
int flags, const char *dev_name,
- void *data)
+ void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, msdos_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ msdos_fill_super);
}

static struct file_system_type msdos_fs_type = {
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index 4f4362d5a04c..465202def06c 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -1043,16 +1043,18 @@ static void setup(struct super_block *sb)
sb->s_d_op = &vfat_dentry_ops;
}

-static int vfat_fill_super(struct super_block *sb, void *data, int silent)
+static int vfat_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
return fat_fill_super(sb, data, silent, 1, setup);
}

static struct dentry *vfat_mount(struct file_system_type *fs_type,
int flags, const char *dev_name,
- void *data)
+ void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, vfat_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ vfat_fill_super);
}

static struct file_system_type vfat_fs_type = {
diff --git a/fs/freevxfs/vxfs_super.c b/fs/freevxfs/vxfs_super.c
index 48b24bb50d02..1c6cf91f6de9 100644
--- a/fs/freevxfs/vxfs_super.c
+++ b/fs/freevxfs/vxfs_super.c
@@ -113,7 +113,8 @@ vxfs_statfs(struct dentry *dentry, struct kstatfs *bufp)
return 0;
}

-static int vxfs_remount(struct super_block *sb, int *flags, char *data)
+static int vxfs_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
sync_filesystem(sb);
*flags |= SB_RDONLY;
@@ -199,6 +200,7 @@ static int vxfs_try_sb_magic(struct super_block *sbp, int silent,
* vxfs_read_super - read superblock into memory and initialize filesystem
* @sbp: VFS superblock (to fill)
* @dp: fs private mount data
+ * @data_size: size of mount data
* @silent: do not complain loudly when sth is wrong
*
* Description:
@@ -211,7 +213,8 @@ static int vxfs_try_sb_magic(struct super_block *sbp, int silent,
* Locking:
* We are under @sbp->s_lock.
*/
-static int vxfs_fill_super(struct super_block *sbp, void *dp, int silent)
+static int vxfs_fill_super(struct super_block *sbp, void *dp, size_t data_size,
+ int silent)
{
struct vxfs_sb_info *infp;
struct vxfs_sb *rsbp;
@@ -312,9 +315,10 @@ static int vxfs_fill_super(struct super_block *sbp, void *dp, int silent)
* The usual module blurb.
*/
static struct dentry *vxfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, vxfs_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ vxfs_fill_super);
}

static struct file_system_type vxfs_fs_type = {
diff --git a/fs/fuse/control.c b/fs/fuse/control.c
index b9ea99c5b5b3..5d0abd02aa83 100644
--- a/fs/fuse/control.c
+++ b/fs/fuse/control.c
@@ -290,7 +290,8 @@ void fuse_ctl_remove_conn(struct fuse_conn *fc)
drop_nlink(d_inode(fuse_control_sb->s_root));
}

-static int fuse_ctl_fill_super(struct super_block *sb, void *data, int silent)
+static int fuse_ctl_fill_super(struct super_block *sb,
+ void *data, size_t data_size, int silent)
{
static const struct tree_descr empty_descr = {""};
struct fuse_conn *fc;
@@ -317,9 +318,11 @@ static int fuse_ctl_fill_super(struct super_block *sb, void *data, int silent)
}

static struct dentry *fuse_ctl_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *raw_data)
+ int flags, const char *dev_name,
+ void *raw_data, size_t data_size)
{
- return mount_single(fs_type, flags, raw_data, fuse_ctl_fill_super);
+ return mount_single(fs_type, flags, raw_data, data_size,
+ fuse_ctl_fill_super);
}

static void fuse_ctl_kill_sb(struct super_block *sb)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index ef309958e060..e150d078419f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -138,7 +138,8 @@ static void fuse_evict_inode(struct inode *inode)
}
}

-static int fuse_remount_fs(struct super_block *sb, int *flags, char *data)
+static int fuse_remount_fs(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
sync_filesystem(sb);
if (*flags & SB_MANDLOCK)
@@ -1043,7 +1044,8 @@ void fuse_dev_free(struct fuse_dev *fud)
}
EXPORT_SYMBOL_GPL(fuse_dev_free);

-static int fuse_fill_super(struct super_block *sb, void *data, int silent)
+static int fuse_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
struct fuse_dev *fud;
struct fuse_conn *fc;
@@ -1187,9 +1189,10 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)

static struct dentry *fuse_mount(struct file_system_type *fs_type,
int flags, const char *dev_name,
- void *raw_data)
+ void *raw_data, size_t data_size)
{
- return mount_nodev(fs_type, flags, raw_data, fuse_fill_super);
+ return mount_nodev(fs_type, flags, raw_data, data_size,
+ fuse_fill_super);
}

static void fuse_kill_sb_anon(struct super_block *sb)
@@ -1217,9 +1220,10 @@ MODULE_ALIAS_FS("fuse");
#ifdef CONFIG_BLOCK
static struct dentry *fuse_mount_blk(struct file_system_type *fs_type,
int flags, const char *dev_name,
- void *raw_data)
+ void *raw_data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, raw_data, fuse_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, raw_data, data_size,
+ fuse_fill_super);
}

static void fuse_kill_sb_blk(struct super_block *sb)
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 3ba3f167641c..a8a664eed01e 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1239,6 +1239,7 @@ static int test_gfs2_super(struct super_block *s, void *ptr)
* @flags: Mount flags
* @dev_name: The name of the device
* @data: The mount arguments
+ * @data_size: The size of the mount arguments
*
* Q. Why not use get_sb_bdev() ?
* A. We need to select one of two root directories to mount, independent
@@ -1248,7 +1249,7 @@ static int test_gfs2_super(struct super_block *s, void *ptr)
*/

static struct dentry *gfs2_mount(struct file_system_type *fs_type, int flags,
- const char *dev_name, void *data)
+ const char *dev_name, void *data, size_t data_size)
{
struct block_device *bdev;
struct super_block *s;
@@ -1345,7 +1346,8 @@ static int set_meta_super(struct super_block *s, void *ptr)
}

static struct dentry *gfs2_mount_meta(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name,
+ void *data, size_t data_size)
{
struct super_block *s;
struct gfs2_sbd *sdp;
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index cf5c7f3080d2..a2add54e63e3 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -1228,11 +1228,13 @@ static int gfs2_statfs(struct dentry *dentry, struct kstatfs *buf)
* @sb: the filesystem
* @flags: the remount flags
* @data: extra data passed in (not used right now)
+ * @data_size: size of the extra data
*
* Returns: errno
*/

-static int gfs2_remount_fs(struct super_block *sb, int *flags, char *data)
+static int gfs2_remount_fs(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
struct gfs2_sbd *sdp = sb->s_fs_info;
struct gfs2_args args = sdp->sd_args; /* Default to current settings */
diff --git a/fs/hfs/super.c b/fs/hfs/super.c
index 173876782f73..e739b381b041 100644
--- a/fs/hfs/super.c
+++ b/fs/hfs/super.c
@@ -111,7 +111,8 @@ static int hfs_statfs(struct dentry *dentry, struct kstatfs *buf)
return 0;
}

-static int hfs_remount(struct super_block *sb, int *flags, char *data)
+static int hfs_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
sync_filesystem(sb);
*flags |= SB_NODIRATIME;
@@ -382,7 +383,8 @@ static int parse_options(char *options, struct hfs_sb_info *hsb)
* hfs_btree_init() to get the necessary data about the extents and
* catalog B-trees and, finally, reading the root inode into memory.
*/
-static int hfs_fill_super(struct super_block *sb, void *data, int silent)
+static int hfs_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
struct hfs_sb_info *sbi;
struct hfs_find_data fd;
@@ -458,9 +460,11 @@ static int hfs_fill_super(struct super_block *sb, void *data, int silent)
}

static struct dentry *hfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name,
+ void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, hfs_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ hfs_fill_super);
}

static struct file_system_type hfs_fs_type = {
diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c
index 513c357c734b..758e2315be60 100644
--- a/fs/hfsplus/super.c
+++ b/fs/hfsplus/super.c
@@ -326,7 +326,8 @@ static int hfsplus_statfs(struct dentry *dentry, struct kstatfs *buf)
return 0;
}

-static int hfsplus_remount(struct super_block *sb, int *flags, char *data)
+static int hfsplus_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
sync_filesystem(sb);
if ((bool)(*flags & SB_RDONLY) == sb_rdonly(sb))
@@ -371,7 +372,8 @@ static const struct super_operations hfsplus_sops = {
.show_options = hfsplus_show_options,
};

-static int hfsplus_fill_super(struct super_block *sb, void *data, int silent)
+static int hfsplus_fill_super(struct super_block *sb,
+ void *data, size_t data_size, int silent)
{
struct hfsplus_vh *vhdr;
struct hfsplus_sb_info *sbi;
@@ -640,9 +642,11 @@ static void hfsplus_destroy_inode(struct inode *inode)
#define HFSPLUS_INODE_SIZE sizeof(struct hfsplus_inode_info)

static struct dentry *hfsplus_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name,
+ void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, hfsplus_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ hfsplus_fill_super);
}

static struct file_system_type hfsplus_fs_type = {
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index 3cd85eb5bbb1..09047b506764 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -923,7 +923,8 @@ static const struct inode_operations hostfs_link_iops = {
.get_link = hostfs_get_link,
};

-static int hostfs_fill_sb_common(struct super_block *sb, void *d, int silent)
+static int hostfs_fill_sb_common(struct super_block *sb,
+ void *d, size_t data_size, int silent)
{
struct inode *root_inode;
char *host_root_path, *req_root = d;
@@ -983,9 +984,9 @@ static int hostfs_fill_sb_common(struct super_block *sb, void *d, int silent)

static struct dentry *hostfs_read_sb(struct file_system_type *type,
int flags, const char *dev_name,
- void *data)
+ void *data, size_t data_size)
{
- return mount_nodev(type, flags, data, hostfs_fill_sb_common);
+ return mount_nodev(type, flags, data, data_size, hostfs_fill_sb_common);
}

static void hostfs_kill_sb(struct super_block *s)
diff --git a/fs/hpfs/super.c b/fs/hpfs/super.c
index f2c3ebcd309c..53e585b27c05 100644
--- a/fs/hpfs/super.c
+++ b/fs/hpfs/super.c
@@ -445,7 +445,8 @@ HPFS filesystem options:\n\
\n");
}

-static int hpfs_remount_fs(struct super_block *s, int *flags, char *data)
+static int hpfs_remount_fs(struct super_block *s, int *flags,
+ char *data, size_t data_size)
{
kuid_t uid;
kgid_t gid;
@@ -540,7 +541,8 @@ static const struct super_operations hpfs_sops =
.show_options = hpfs_show_options,
};

-static int hpfs_fill_super(struct super_block *s, void *options, int silent)
+static int hpfs_fill_super(struct super_block *s,
+ void *options, size_t data_size, int silent)
{
struct buffer_head *bh0, *bh1, *bh2;
struct hpfs_boot_block *bootblock;
@@ -757,9 +759,10 @@ bail2: brelse(bh0);
}

static struct dentry *hpfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, hpfs_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ hpfs_fill_super);
}

static struct file_system_type hpfs_fs_type = {
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index d508c7844681..76fb8eb2bea8 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1220,7 +1220,8 @@ hugetlbfs_parse_options(char *options, struct hugetlbfs_config *pconfig)
}

static int
-hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
+hugetlbfs_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
int ret;
struct hugetlbfs_config config;
@@ -1279,9 +1280,10 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
}

static struct dentry *hugetlbfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_nodev(fs_type, flags, data, hugetlbfs_fill_super);
+ return mount_nodev(fs_type, flags, data, data_size,
+ hugetlbfs_fill_super);
}

static struct file_system_type hugetlbfs_fs_type = {
@@ -1420,10 +1422,11 @@ static int __init init_hugetlbfs_fs(void)
for_each_hstate(h) {
char buf[50];
unsigned ps_kb = 1U << (h->order + PAGE_SHIFT - 10);
+ int n;

- snprintf(buf, sizeof(buf), "pagesize=%uK", ps_kb);
+ n = snprintf(buf, sizeof(buf), "pagesize=%uK", ps_kb);
hugetlbfs_vfsmount[i] = kern_mount_data(&hugetlbfs_fs_type,
- buf);
+ buf, n + 1);

if (IS_ERR(hugetlbfs_vfsmount[i])) {
pr_err("Cannot mount internal hugetlbfs for "
diff --git a/fs/internal.h b/fs/internal.h
index e08972db0303..1afa522c5f30 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -98,10 +98,10 @@ extern struct file *get_empty_filp(void);
/*
* super.c
*/
-extern int do_remount_sb(struct super_block *, int, void *, int);
+extern int do_remount_sb(struct super_block *, int, void *, size_t, int);
extern bool trylock_super(struct super_block *sb);
extern struct dentry *mount_fs(struct file_system_type *,
- int, const char *, void *);
+ int, const char *, void *, size_t);
extern struct super_block *user_get_super(dev_t);

/*
diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
index ec3fba7d492f..71138cbed995 100644
--- a/fs/isofs/inode.c
+++ b/fs/isofs/inode.c
@@ -111,7 +111,8 @@ static void destroy_inodecache(void)
kmem_cache_destroy(isofs_inode_cachep);
}

-static int isofs_remount(struct super_block *sb, int *flags, char *data)
+static int isofs_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
sync_filesystem(sb);
if (!(*flags & SB_RDONLY))
@@ -619,7 +620,8 @@ static bool rootdir_empty(struct super_block *sb, unsigned long block)
* Note: a check_disk_change() has been done immediately prior
* to this call, so we don't need to check again.
*/
-static int isofs_fill_super(struct super_block *s, void *data, int silent)
+static int isofs_fill_super(struct super_block *s, void *data, size_t data_size,
+ int silent)
{
struct buffer_head *bh = NULL, *pri_bh = NULL;
struct hs_primary_descriptor *h_pri = NULL;
@@ -1558,9 +1560,10 @@ struct inode *__isofs_iget(struct super_block *sb,
}

static struct dentry *isofs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, isofs_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ isofs_fill_super);
}

static struct file_system_type iso9660_fs_type = {
diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
index 87bdf0f4cba1..c4f220f1a531 100644
--- a/fs/jffs2/super.c
+++ b/fs/jffs2/super.c
@@ -238,7 +238,8 @@ static int jffs2_parse_options(struct jffs2_sb_info *c, char *data)
return 0;
}

-static int jffs2_remount_fs(struct super_block *sb, int *flags, char *data)
+static int jffs2_remount_fs(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
struct jffs2_sb_info *c = JFFS2_SB_INFO(sb);
int err;
@@ -267,7 +268,8 @@ static const struct super_operations jffs2_super_operations =
/*
* fill in the superblock
*/
-static int jffs2_fill_super(struct super_block *sb, void *data, int silent)
+static int jffs2_fill_super(struct super_block *sb,
+ void *data, size_t data_size, int silent)
{
struct jffs2_sb_info *c;
int ret;
@@ -312,9 +314,9 @@ static int jffs2_fill_super(struct super_block *sb, void *data, int silent)

static struct dentry *jffs2_mount(struct file_system_type *fs_type,
int flags, const char *dev_name,
- void *data)
+ void *data, size_t data_size)
{
- return mount_mtd(fs_type, flags, dev_name, data, jffs2_fill_super);
+ return mount_mtd(fs_type, flags, dev_name, data, data_size, jffs2_fill_super);
}

static void jffs2_put_super (struct super_block *sb)
diff --git a/fs/jfs/super.c b/fs/jfs/super.c
index 1b9264fd54b6..88f30ff12564 100644
--- a/fs/jfs/super.c
+++ b/fs/jfs/super.c
@@ -456,7 +456,8 @@ static int parse_options(char *options, struct super_block *sb, s64 *newLVSize,
return 0;
}

-static int jfs_remount(struct super_block *sb, int *flags, char *data)
+static int jfs_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
s64 newLVSize = 0;
int rc = 0;
@@ -516,7 +517,8 @@ static int jfs_remount(struct super_block *sb, int *flags, char *data)
return 0;
}

-static int jfs_fill_super(struct super_block *sb, void *data, int silent)
+static int jfs_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
struct jfs_sb_info *sbi;
struct inode *inode;
@@ -698,9 +700,10 @@ static int jfs_unfreeze(struct super_block *sb)
}

static struct dentry *jfs_do_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, jfs_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ jfs_fill_super);
}

static int jfs_sync_fs(struct super_block *sb, int wait)
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index ff2716f9322e..f70e0b69e714 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -22,7 +22,8 @@

struct kmem_cache *kernfs_node_cache;

-static int kernfs_sop_remount_fs(struct super_block *sb, int *flags, char *data)
+static int kernfs_sop_remount_fs(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
struct kernfs_root *root = kernfs_info(sb)->root;
struct kernfs_syscall_ops *scops = root->syscall_ops;
diff --git a/fs/libfs.c b/fs/libfs.c
index 0fb590d79f30..9f1f4884b7cc 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -578,7 +578,7 @@ int simple_pin_fs(struct file_system_type *type, struct vfsmount **mount, int *c
spin_lock(&pin_fs_lock);
if (unlikely(!*mount)) {
spin_unlock(&pin_fs_lock);
- mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, NULL);
+ mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, NULL, 0);
if (IS_ERR(mnt))
return PTR_ERR(mnt);
spin_lock(&pin_fs_lock);
diff --git a/fs/minix/inode.c b/fs/minix/inode.c
index 72e308c3e66b..3d91d9096b24 100644
--- a/fs/minix/inode.c
+++ b/fs/minix/inode.c
@@ -22,7 +22,8 @@
static int minix_write_inode(struct inode *inode,
struct writeback_control *wbc);
static int minix_statfs(struct dentry *dentry, struct kstatfs *buf);
-static int minix_remount (struct super_block * sb, int * flags, char * data);
+static int minix_remount (struct super_block * sb, int * flags,
+ char * data, size_t data_size);

static void minix_evict_inode(struct inode *inode)
{
@@ -118,7 +119,8 @@ static const struct super_operations minix_sops = {
.remount_fs = minix_remount,
};

-static int minix_remount (struct super_block * sb, int * flags, char * data)
+static int minix_remount (struct super_block * sb, int * flags,
+ char * data, size_t data_size)
{
struct minix_sb_info * sbi = minix_sb(sb);
struct minix_super_block * ms;
@@ -155,7 +157,8 @@ static int minix_remount (struct super_block * sb, int * flags, char * data)
return 0;
}

-static int minix_fill_super(struct super_block *s, void *data, int silent)
+static int minix_fill_super(struct super_block *s, void *data, size_t data_size,
+ int silent)
{
struct buffer_head *bh;
struct buffer_head **map;
@@ -651,9 +654,10 @@ void minix_truncate(struct inode * inode)
}

static struct dentry *minix_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, minix_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ minix_fill_super);
}

static struct file_system_type minix_fs_type = {
diff --git a/fs/namespace.c b/fs/namespace.c
index 1c41ab9332ee..a6ab1137f8d2 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1020,7 +1020,8 @@ static struct mount *skip_mnt_tree(struct mount *p)
}

struct vfsmount *
-vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void *data)
+vfs_kern_mount(struct file_system_type *type, int flags, const char *name,
+ void *data, size_t data_size)
{
struct mount *mnt;
struct dentry *root;
@@ -1035,7 +1036,7 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
if (flags & SB_KERNMOUNT)
mnt->mnt.mnt_flags = MNT_INTERNAL;

- root = mount_fs(type, flags, name, data);
+ root = mount_fs(type, flags, name, data, data_size);
if (IS_ERR(root)) {
mnt_free_id(mnt);
free_vfsmnt(mnt);
@@ -1055,7 +1056,7 @@ EXPORT_SYMBOL_GPL(vfs_kern_mount);

struct vfsmount *
vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
- const char *name, void *data)
+ const char *name, void *data, size_t data_size)
{
/* Until it is worked out how to pass the user namespace
* through from the parent mount to the submount don't support
@@ -1064,7 +1065,7 @@ vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
if (mountpoint->d_sb->s_user_ns != &init_user_ns)
return ERR_PTR(-EPERM);

- return vfs_kern_mount(type, SB_SUBMOUNT, name, data);
+ return vfs_kern_mount(type, SB_SUBMOUNT, name, data, data_size);
}
EXPORT_SYMBOL_GPL(vfs_submount);

@@ -1595,7 +1596,7 @@ static int do_umount(struct mount *mnt, int flags)
return -EPERM;
down_write(&sb->s_umount);
if (!sb_rdonly(sb))
- retval = do_remount_sb(sb, SB_RDONLY, NULL, 0);
+ retval = do_remount_sb(sb, SB_RDONLY, NULL, 0, 0);
up_write(&sb->s_umount);
return retval;
}
@@ -2288,7 +2289,7 @@ static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
* on it - tough luck.
*/
static int do_remount(struct path *path, int ms_flags, int sb_flags,
- int mnt_flags, void *data)
+ int mnt_flags, void *data, size_t data_size)
{
int err;
struct super_block *sb = path->mnt->mnt_sb;
@@ -2327,7 +2328,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
return -EPERM;
}

- err = security_sb_remount(sb, data);
+ err = security_sb_remount(sb, data, data_size);
if (err)
return err;

@@ -2337,7 +2338,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
else if (!capable(CAP_SYS_ADMIN))
err = -EPERM;
else
- err = do_remount_sb(sb, sb_flags, data, 0);
+ err = do_remount_sb(sb, sb_flags, data, data_size, 0);
if (!err) {
lock_mount_hash();
mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
@@ -2503,7 +2504,8 @@ static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags);
* namespace's tree
*/
static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
- int mnt_flags, const char *name, void *data)
+ int mnt_flags, const char *name,
+ void *data, size_t data_size)
{
struct file_system_type *type;
struct vfsmount *mnt;
@@ -2516,7 +2518,7 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
if (!type)
return -ENODEV;

- mnt = vfs_kern_mount(type, sb_flags, name, data);
+ mnt = vfs_kern_mount(type, sb_flags, name, data, data_size);
if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) &&
!mnt->mnt_sb->s_subtype)
mnt = fs_set_subtype(mnt, fstype);
@@ -2772,6 +2774,7 @@ long do_mount(const char *dev_name, const char __user *dir_name,
{
struct path path;
unsigned int mnt_flags = 0, sb_flags;
+ size_t data_size = data_page ? PAGE_SIZE : 0;
int retval = 0;

/* Discard magic */
@@ -2790,8 +2793,8 @@ long do_mount(const char *dev_name, const char __user *dir_name,
if (retval)
return retval;

- retval = security_sb_mount(dev_name, &path,
- type_page, flags, data_page);
+ retval = security_sb_mount(dev_name, &path, type_page, flags,
+ data_page, data_size);
if (!retval && !may_mount())
retval = -EPERM;
if (!retval && (flags & SB_MANDLOCK) && !may_mandlock())
@@ -2838,7 +2841,7 @@ long do_mount(const char *dev_name, const char __user *dir_name,

if (flags & MS_REMOUNT)
retval = do_remount(&path, flags, sb_flags, mnt_flags,
- data_page);
+ data_page, data_size);
else if (flags & MS_BIND)
retval = do_loopback(&path, dev_name, flags & MS_REC);
else if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
@@ -2847,7 +2850,7 @@ long do_mount(const char *dev_name, const char __user *dir_name,
retval = do_move_mount(&path, dev_name);
else
retval = do_new_mount(&path, type_page, sb_flags, mnt_flags,
- dev_name, data_page);
+ dev_name, data_page, data_size);
dput_out:
path_put(&path);
return retval;
@@ -3237,7 +3240,7 @@ static void __init init_mount_tree(void)
type = get_fs_type("rootfs");
if (!type)
panic("Can't find rootfs type");
- mnt = vfs_kern_mount(type, 0, "rootfs", NULL);
+ mnt = vfs_kern_mount(type, 0, "rootfs", NULL, 0);
put_filesystem(type);
if (IS_ERR(mnt))
panic("Can't create rootfs");
@@ -3299,10 +3302,11 @@ void put_mnt_ns(struct mnt_namespace *ns)
free_mnt_ns(ns);
}

-struct vfsmount *kern_mount_data(struct file_system_type *type, void *data)
+struct vfsmount *kern_mount_data(struct file_system_type *type,
+ void *data, size_t data_size)
{
struct vfsmount *mnt;
- mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, data);
+ mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, data, data_size);
if (!IS_ERR(mnt)) {
/*
* it is a longterm mount, don't release mnt until
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 8357ff69962f..db0f3ca3a35c 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -405,7 +405,7 @@ int nfs_set_sb_security(struct super_block *, struct dentry *, struct nfs_mount_
int nfs_clone_sb_security(struct super_block *, struct dentry *, struct nfs_mount_info *);
struct dentry *nfs_fs_mount_common(struct nfs_server *, int, const char *,
struct nfs_mount_info *, struct nfs_subversion *);
-struct dentry *nfs_fs_mount(struct file_system_type *, int, const char *, void *);
+struct dentry *nfs_fs_mount(struct file_system_type *, int, const char *, void *, size_t);
struct dentry * nfs_xdev_mount_common(struct file_system_type *, int,
const char *, struct nfs_mount_info *);
void nfs_kill_super(struct super_block *);
@@ -466,7 +466,7 @@ int nfs_show_options(struct seq_file *, struct dentry *);
int nfs_show_devname(struct seq_file *, struct dentry *);
int nfs_show_path(struct seq_file *, struct dentry *);
int nfs_show_stats(struct seq_file *, struct dentry *);
-int nfs_remount(struct super_block *sb, int *flags, char *raw_data);
+int nfs_remount(struct super_block *sb, int *flags, char *raw_data, size_t data_size);

/* write.c */
extern void nfs_pageio_init_write(struct nfs_pageio_descriptor *pgio,
diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
index e5686be67be8..df9e87331558 100644
--- a/fs/nfs/namespace.c
+++ b/fs/nfs/namespace.c
@@ -216,7 +216,8 @@ static struct vfsmount *nfs_do_clone_mount(struct nfs_server *server,
const char *devname,
struct nfs_clone_mount *mountdata)
{
- return vfs_submount(mountdata->dentry, &nfs_xdev_fs_type, devname, mountdata);
+ return vfs_submount(mountdata->dentry, &nfs_xdev_fs_type, devname,
+ mountdata, 0);
}

/**
diff --git a/fs/nfs/nfs4namespace.c b/fs/nfs/nfs4namespace.c
index 24f06dcc2b08..191cb4202056 100644
--- a/fs/nfs/nfs4namespace.c
+++ b/fs/nfs/nfs4namespace.c
@@ -278,7 +278,8 @@ static struct vfsmount *try_location(struct nfs_clone_mount *mountdata,
mountdata->hostname,
mountdata->mnt_path);

- mnt = vfs_submount(mountdata->dentry, &nfs4_referral_fs_type, page, mountdata);
+ mnt = vfs_submount(mountdata->dentry, &nfs4_referral_fs_type, page,
+ mountdata, 0);
if (!IS_ERR(mnt))
break;
}
diff --git a/fs/nfs/nfs4super.c b/fs/nfs/nfs4super.c
index 6fb7cb6b3f4b..e72e5dbdfcd0 100644
--- a/fs/nfs/nfs4super.c
+++ b/fs/nfs/nfs4super.c
@@ -18,11 +18,11 @@
static int nfs4_write_inode(struct inode *inode, struct writeback_control *wbc);
static void nfs4_evict_inode(struct inode *inode);
static struct dentry *nfs4_remote_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *raw_data);
+ int flags, const char *dev_name, void *raw_data, size_t data_size);
static struct dentry *nfs4_referral_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *raw_data);
+ int flags, const char *dev_name, void *raw_data, size_t data_size);
static struct dentry *nfs4_remote_referral_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *raw_data);
+ int flags, const char *dev_name, void *raw_data, size_t data_size);

static struct file_system_type nfs4_remote_fs_type = {
.owner = THIS_MODULE,
@@ -105,7 +105,7 @@ static void nfs4_evict_inode(struct inode *inode)
*/
static struct dentry *
nfs4_remote_mount(struct file_system_type *fs_type, int flags,
- const char *dev_name, void *info)
+ const char *dev_name, void *info, size_t data_size)
{
struct nfs_mount_info *mount_info = info;
struct nfs_server *server;
@@ -127,7 +127,7 @@ nfs4_remote_mount(struct file_system_type *fs_type, int flags,
}

static struct vfsmount *nfs_do_root_mount(struct file_system_type *fs_type,
- int flags, void *data, const char *hostname)
+ int flags, void *data, size_t data_size, const char *hostname)
{
struct vfsmount *root_mnt;
char *root_devname;
@@ -142,7 +142,8 @@ static struct vfsmount *nfs_do_root_mount(struct file_system_type *fs_type,
snprintf(root_devname, len, "[%s]:/", hostname);
else
snprintf(root_devname, len, "%s:/", hostname);
- root_mnt = vfs_kern_mount(fs_type, flags, root_devname, data);
+ root_mnt = vfs_kern_mount(fs_type, flags, root_devname,
+ data, data_size);
kfree(root_devname);
return root_mnt;
}
@@ -247,8 +248,8 @@ struct dentry *nfs4_try_mount(int flags, const char *dev_name,

export_path = data->nfs_server.export_path;
data->nfs_server.export_path = "/";
- root_mnt = nfs_do_root_mount(&nfs4_remote_fs_type, flags, mount_info,
- data->nfs_server.hostname);
+ root_mnt = nfs_do_root_mount(&nfs4_remote_fs_type, flags, mount_info, 0,
+ data->nfs_server.hostname);
data->nfs_server.export_path = export_path;

res = nfs_follow_remote_path(root_mnt, export_path);
@@ -261,7 +262,8 @@ struct dentry *nfs4_try_mount(int flags, const char *dev_name,

static struct dentry *
nfs4_remote_referral_mount(struct file_system_type *fs_type, int flags,
- const char *dev_name, void *raw_data)
+ const char *dev_name,
+ void *raw_data, size_t data_size)
{
struct nfs_mount_info mount_info = {
.fill_super = nfs_fill_super,
@@ -294,7 +296,8 @@ nfs4_remote_referral_mount(struct file_system_type *fs_type, int flags,
* Create an NFS4 server record on referral traversal
*/
static struct dentry *nfs4_referral_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *raw_data)
+ int flags, const char *dev_name,
+ void *raw_data, size_t data_size)
{
struct nfs_clone_mount *data = raw_data;
char *export_path;
@@ -306,8 +309,8 @@ static struct dentry *nfs4_referral_mount(struct file_system_type *fs_type,
export_path = data->mnt_path;
data->mnt_path = "/";

- root_mnt = nfs_do_root_mount(&nfs4_remote_referral_fs_type,
- flags, data, data->hostname);
+ root_mnt = nfs_do_root_mount(&nfs4_remote_referral_fs_type, flags,
+ data, 0, data->hostname);
data->mnt_path = export_path;

res = nfs_follow_remote_path(root_mnt, export_path);
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 5e470e233c83..b5f27d6999e5 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -287,7 +287,8 @@ static match_table_t nfs_vers_tokens = {
};

static struct dentry *nfs_xdev_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *raw_data);
+ int flags, const char *dev_name,
+ void *raw_data, size_t data_size);

struct file_system_type nfs_fs_type = {
.owner = THIS_MODULE,
@@ -1203,7 +1204,7 @@ static int nfs_get_option_ul_bound(substring_t args[], unsigned long *option,
* skipped as they are encountered. If there were no errors, return 1;
* otherwise return 0 (zero).
*/
-static int nfs_parse_mount_options(char *raw,
+static int nfs_parse_mount_options(char *raw, size_t raw_size,
struct nfs_parsed_mount_data *mnt)
{
char *p, *string, *secdata;
@@ -1221,7 +1222,7 @@ static int nfs_parse_mount_options(char *raw,
if (!secdata)
goto out_nomem;

- rc = security_sb_copy_data(raw, secdata);
+ rc = security_sb_copy_data(raw, raw_size, secdata);
if (rc)
goto out_security_failure;

@@ -2151,7 +2152,7 @@ static int nfs_validate_mount_data(struct file_system_type *fs_type,
}
#endif

-static int nfs_validate_text_mount_data(void *options,
+static int nfs_validate_text_mount_data(void *options, size_t data_size,
struct nfs_parsed_mount_data *args,
const char *dev_name)
{
@@ -2160,7 +2161,7 @@ static int nfs_validate_text_mount_data(void *options,
int max_pathlen = NFS_MAXPATHLEN;
struct sockaddr *sap = (struct sockaddr *)&args->nfs_server.address;

- if (nfs_parse_mount_options((char *)options, args) == 0)
+ if (nfs_parse_mount_options((char *)options, data_size, args) == 0)
return -EINVAL;

if (!nfs_verify_server_address(sap))
@@ -2243,7 +2244,7 @@ nfs_compare_remount_data(struct nfs_server *nfss,
}

int
-nfs_remount(struct super_block *sb, int *flags, char *raw_data)
+nfs_remount(struct super_block *sb, int *flags, char *raw_data, size_t data_size)
{
int error;
struct nfs_server *nfss = sb->s_fs_info;
@@ -2290,7 +2291,7 @@ nfs_remount(struct super_block *sb, int *flags, char *raw_data)

/* overwrite those values with any that were specified */
error = -EINVAL;
- if (!nfs_parse_mount_options((char *)options, data))
+ if (!nfs_parse_mount_options((char *)options, data_size, data))
goto out;

/*
@@ -2662,7 +2663,7 @@ struct dentry *nfs_fs_mount_common(struct nfs_server *server,
EXPORT_SYMBOL_GPL(nfs_fs_mount_common);

struct dentry *nfs_fs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *raw_data)
+ int flags, const char *dev_name, void *raw_data, size_t data_size)
{
struct nfs_mount_info mount_info = {
.fill_super = nfs_fill_super,
@@ -2680,7 +2681,8 @@ struct dentry *nfs_fs_mount(struct file_system_type *fs_type,
/* Validate the mount data */
error = nfs_validate_mount_data(fs_type, raw_data, mount_info.parsed, mount_info.mntfh, dev_name);
if (error == NFS_TEXT_DATA)
- error = nfs_validate_text_mount_data(raw_data, mount_info.parsed, dev_name);
+ error = nfs_validate_text_mount_data(raw_data, data_size,
+ mount_info.parsed, dev_name);
if (error < 0) {
mntroot = ERR_PTR(error);
goto out;
@@ -2724,7 +2726,7 @@ EXPORT_SYMBOL_GPL(nfs_kill_super);
*/
static struct dentry *
nfs_xdev_mount(struct file_system_type *fs_type, int flags,
- const char *dev_name, void *raw_data)
+ const char *dev_name, void *raw_data, size_t data_size)
{
struct nfs_clone_mount *data = raw_data;
struct nfs_mount_info mount_info = {
diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index d107b4426f7e..661296305123 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -1144,7 +1144,8 @@ static ssize_t write_v4_end_grace(struct file *file, char *buf, size_t size)
* populating the filesystem.
*/

-static int nfsd_fill_super(struct super_block * sb, void * data, int silent)
+static int nfsd_fill_super(struct super_block * sb,
+ void * data, size_t data_size, int silent)
{
static const struct tree_descr nfsd_files[] = {
[NFSD_List] = {"exports", &exports_nfsd_operations, S_IRUGO},
@@ -1179,10 +1180,11 @@ static int nfsd_fill_super(struct super_block * sb, void * data, int silent)
}

static struct dentry *nfsd_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
struct net *net = current->nsproxy->net_ns;
- return mount_ns(fs_type, flags, data, net, net->user_ns, nfsd_fill_super);
+ return mount_ns(fs_type, flags, data, data_size,
+ net, net->user_ns, nfsd_fill_super);
}

static void nfsd_umount(struct super_block *sb)
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index 6ffeca84d7c3..3a21a1ab141f 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -69,7 +69,8 @@ struct kmem_cache *nilfs_segbuf_cachep;
struct kmem_cache *nilfs_btree_path_cache;

static int nilfs_setup_super(struct super_block *sb, int is_mount);
-static int nilfs_remount(struct super_block *sb, int *flags, char *data);
+static int nilfs_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size);

void __nilfs_msg(struct super_block *sb, const char *level, const char *fmt,
...)
@@ -1118,7 +1119,8 @@ nilfs_fill_super(struct super_block *sb, void *data, int silent)
return err;
}

-static int nilfs_remount(struct super_block *sb, int *flags, char *data)
+static int nilfs_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
struct the_nilfs *nilfs = sb->s_fs_info;
unsigned long old_sb_flags;
@@ -1278,7 +1280,7 @@ static int nilfs_test_bdev_super(struct super_block *s, void *data)

static struct dentry *
nilfs_mount(struct file_system_type *fs_type, int flags,
- const char *dev_name, void *data)
+ const char *dev_name, void *data, size_t data_size)
{
struct nilfs_super_data sd;
struct super_block *s;
@@ -1346,7 +1348,7 @@ nilfs_mount(struct file_system_type *fs_type, int flags,
* Try remount to setup mount states if the current
* tree is not mounted and only snapshots use this sb.
*/
- err = nilfs_remount(s, &flags, data);
+ err = nilfs_remount(s, &flags, data, data_size);
if (err)
goto failed_super;
}
diff --git a/fs/nsfs.c b/fs/nsfs.c
index 60702d677bd4..f069eb6495b0 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -263,7 +263,8 @@ static const struct super_operations nsfs_ops = {
.show_path = nsfs_show_path,
};
static struct dentry *nsfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name,
+ void *data, size_t data_size)
{
return mount_pseudo(fs_type, "nsfs:", &nsfs_ops,
&ns_dentry_operations, NSFS_MAGIC);
diff --git a/fs/ntfs/super.c b/fs/ntfs/super.c
index bb7159f697f2..8501bbcceb5a 100644
--- a/fs/ntfs/super.c
+++ b/fs/ntfs/super.c
@@ -456,6 +456,7 @@ static inline int ntfs_clear_volume_flags(ntfs_volume *vol, VOLUME_FLAGS flags)
* @sb: superblock of mounted ntfs filesystem
* @flags: remount flags
* @opt: remount options string
+ * @data_size: size of the options string
*
* Change the mount options of an already mounted ntfs filesystem.
*
@@ -463,7 +464,8 @@ static inline int ntfs_clear_volume_flags(ntfs_volume *vol, VOLUME_FLAGS flags)
* ntfs_remount() returns successfully (i.e. returns 0). Otherwise,
* @sb->s_flags are not changed.
*/
-static int ntfs_remount(struct super_block *sb, int *flags, char *opt)
+static int ntfs_remount(struct super_block *sb, int *flags,
+ char *opt, size_t data_size)
{
ntfs_volume *vol = NTFS_SB(sb);

@@ -2694,6 +2696,7 @@ static const struct super_operations ntfs_sops = {
* ntfs_fill_super - mount an ntfs filesystem
* @sb: super block of ntfs filesystem to mount
* @opt: string containing the mount options
+ * @data_size: size of the mount options string
* @silent: silence error output
*
* ntfs_fill_super() is called by the VFS to mount the device described by @sb
@@ -2708,7 +2711,8 @@ static const struct super_operations ntfs_sops = {
*
* NOTE: @sb->s_flags contains the mount options flags.
*/
-static int ntfs_fill_super(struct super_block *sb, void *opt, const int silent)
+static int ntfs_fill_super(struct super_block *sb, void *opt, size_t data_size,
+ const int silent)
{
ntfs_volume *vol;
struct buffer_head *bh;
@@ -3060,9 +3064,10 @@ struct kmem_cache *ntfs_index_ctx_cache;
DEFINE_MUTEX(ntfs_lock);

static struct dentry *ntfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, ntfs_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ ntfs_fill_super);
}

static struct file_system_type ntfs_fs_type = {
diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index 602c71f32740..642e471a6472 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -568,6 +568,7 @@ static int dlmfs_unlink(struct inode *dir,

static int dlmfs_fill_super(struct super_block * sb,
void * data,
+ size_t data_size,
int silent)
{
sb->s_maxbytes = MAX_LFS_FILESIZE;
@@ -617,9 +618,9 @@ static const struct inode_operations dlmfs_file_inode_operations = {
};

static struct dentry *dlmfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_nodev(fs_type, flags, data, dlmfs_fill_super);
+ return mount_nodev(fs_type, flags, data, data_size, dlmfs_fill_super);
}

static struct file_system_type dlmfs_fs_type = {
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 3415e0b09398..62237837a098 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -107,7 +107,8 @@ static int ocfs2_check_set_options(struct super_block *sb,
static int ocfs2_show_options(struct seq_file *s, struct dentry *root);
static void ocfs2_put_super(struct super_block *sb);
static int ocfs2_mount_volume(struct super_block *sb);
-static int ocfs2_remount(struct super_block *sb, int *flags, char *data);
+static int ocfs2_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size);
static void ocfs2_dismount_volume(struct super_block *sb, int mnt_err);
static int ocfs2_initialize_mem_caches(void);
static void ocfs2_free_mem_caches(void);
@@ -633,7 +634,8 @@ static unsigned long long ocfs2_max_file_offset(unsigned int bbits,
return (((unsigned long long)bytes) << bitshift) - trim;
}

-static int ocfs2_remount(struct super_block *sb, int *flags, char *data)
+static int ocfs2_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
int incompat_features;
int ret = 0;
@@ -999,7 +1001,8 @@ static void ocfs2_disable_quotas(struct ocfs2_super *osb)
}
}

-static int ocfs2_fill_super(struct super_block *sb, void *data, int silent)
+static int ocfs2_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
struct dentry *root;
int status, sector_size;
@@ -1236,9 +1239,10 @@ static int ocfs2_fill_super(struct super_block *sb, void *data, int silent)
static struct dentry *ocfs2_mount(struct file_system_type *fs_type,
int flags,
const char *dev_name,
- void *data)
+ void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, ocfs2_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ ocfs2_fill_super);
}

static struct file_system_type ocfs2_fs_type = {
diff --git a/fs/omfs/inode.c b/fs/omfs/inode.c
index ee14af9e26f2..e5258fefcd2b 100644
--- a/fs/omfs/inode.c
+++ b/fs/omfs/inode.c
@@ -454,7 +454,8 @@ static int parse_options(char *options, struct omfs_sb_info *sbi)
return 1;
}

-static int omfs_fill_super(struct super_block *sb, void *data, int silent)
+static int omfs_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
struct buffer_head *bh, *bh2;
struct omfs_super_block *omfs_sb;
@@ -596,9 +597,11 @@ static int omfs_fill_super(struct super_block *sb, void *data, int silent)
}

static struct dentry *omfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name,
+ void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, omfs_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ omfs_fill_super);
}

static struct file_system_type omfs_fs_type = {
diff --git a/fs/openpromfs/inode.c b/fs/openpromfs/inode.c
index 2200662a9bf1..6171138c76c7 100644
--- a/fs/openpromfs/inode.c
+++ b/fs/openpromfs/inode.c
@@ -366,7 +366,8 @@ static struct inode *openprom_iget(struct super_block *sb, ino_t ino)
return inode;
}

-static int openprom_remount(struct super_block *sb, int *flags, char *data)
+static int openprom_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
sync_filesystem(sb);
*flags |= SB_NOATIME;
@@ -380,7 +381,8 @@ static const struct super_operations openprom_sops = {
.remount_fs = openprom_remount,
};

-static int openprom_fill_super(struct super_block *s, void *data, int silent)
+static int openprom_fill_super(struct super_block *s,
+ void *data, size_t data_size, int silent)
{
struct inode *root_inode;
struct op_inode_info *oi;
@@ -415,9 +417,10 @@ static int openprom_fill_super(struct super_block *s, void *data, int silent)
}

static struct dentry *openprom_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_single(fs_type, flags, data, openprom_fill_super);
+ return mount_single(fs_type, flags, data, data_size,
+ openprom_fill_super);
}

static struct file_system_type openprom_fs_type = {
diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
index c29bb0ebc6bb..411725292d37 100644
--- a/fs/orangefs/orangefs-kernel.h
+++ b/fs/orangefs/orangefs-kernel.h
@@ -319,7 +319,7 @@ extern uint64_t orangefs_features;
struct dentry *orangefs_mount(struct file_system_type *fst,
int flags,
const char *devname,
- void *data);
+ void *data, size_t data_size);

void orangefs_kill_sb(struct super_block *sb);
int orangefs_remount(struct orangefs_sb_info_s *);
diff --git a/fs/orangefs/super.c b/fs/orangefs/super.c
index 10796d3fe27d..6862a4769a12 100644
--- a/fs/orangefs/super.c
+++ b/fs/orangefs/super.c
@@ -206,7 +206,8 @@ static int orangefs_statfs(struct dentry *dentry, struct kstatfs *buf)
* Remount as initiated by VFS layer. We just need to reparse the mount
* options, no need to signal pvfs2-client-core about it.
*/
-static int orangefs_remount_fs(struct super_block *sb, int *flags, char *data)
+static int orangefs_remount_fs(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
gossip_debug(GOSSIP_SUPER_DEBUG, "orangefs_remount_fs: called\n");
return parse_mount_options(sb, data, 1);
@@ -456,7 +457,7 @@ static int orangefs_fill_sb(struct super_block *sb,
struct dentry *orangefs_mount(struct file_system_type *fst,
int flags,
const char *devname,
- void *data)
+ void *data, size_t data_size)
{
int ret = -EINVAL;
struct super_block *sb = ERR_PTR(-EINVAL);
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index e8551c97de51..b5548c5a19be 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -379,7 +379,8 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry)
return 0;
}

-static int ovl_remount(struct super_block *sb, int *flags, char *data)
+static int ovl_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
struct ovl_fs *ofs = sb->s_fs_info;

@@ -1355,7 +1356,8 @@ static struct ovl_entry *ovl_get_lowerstack(struct super_block *sb,
goto out;
}

-static int ovl_fill_super(struct super_block *sb, void *data, int silent)
+static int ovl_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
struct path upperpath = { };
struct dentry *root_dentry;
@@ -1493,9 +1495,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
}

static struct dentry *ovl_mount(struct file_system_type *fs_type, int flags,
- const char *dev_name, void *raw_data)
+ const char *dev_name,
+ void *raw_data, size_t data_size)
{
- return mount_nodev(fs_type, flags, raw_data, ovl_fill_super);
+ return mount_nodev(fs_type, flags, raw_data, data_size, ovl_fill_super);
}

static struct file_system_type ovl_fs_type = {
diff --git a/fs/pipe.c b/fs/pipe.c
index 39d6f431da83..915032a76bfd 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1179,7 +1179,8 @@ static const struct super_operations pipefs_ops = {
* d_name - pipe: will go nicely and kill the special-casing in procfs.
*/
static struct dentry *pipefs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name,
+ void *data, size_t data_size)
{
return mount_pseudo(fs_type, "pipe:", &pipefs_ops,
&pipefs_dentry_operations, PIPEFS_MAGIC);
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 2cf3b74391ca..df65431c00be 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -490,7 +490,8 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
return inode;
}

-int proc_fill_super(struct super_block *s, void *data, int silent)
+int proc_fill_super(struct super_block *s, void *data, size_t data_size,
+ int silent)
{
struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
struct inode *root_inode;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 916ccc39073d..c0af86a18abe 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -206,7 +206,7 @@ extern const struct inode_operations proc_pid_link_inode_operations;
void proc_init_kmemcache(void);
void set_proc_pid_nlink(void);
extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
-extern int proc_fill_super(struct super_block *, void *data, int flags);
+extern int proc_fill_super(struct super_block *, void *, size_t, int);
extern void proc_entry_rundown(struct proc_dir_entry *);

/*
@@ -267,7 +267,7 @@ extern struct proc_dir_entry proc_root;
extern int proc_parse_options(char *options, struct pid_namespace *pid);

extern void proc_self_init(void);
-extern int proc_remount(struct super_block *, int *, char *);
+extern int proc_remount(struct super_block *, int *, char *, size_t);

/*
* task_[no]mmu.c
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 61b7340b357a..99ce06c4e1a2 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -78,7 +78,8 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
return 1;
}

-int proc_remount(struct super_block *sb, int *flags, char *data)
+int proc_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
struct pid_namespace *pid = sb->s_fs_info;

@@ -87,7 +88,8 @@ int proc_remount(struct super_block *sb, int *flags, char *data)
}

static struct dentry *proc_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name,
+ void *data, size_t data_size)
{
struct pid_namespace *ns;

@@ -98,7 +100,8 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
ns = task_active_pid_ns(current);
}

- return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
+ return mount_ns(fs_type, flags, data, data_size, ns, ns->user_ns,
+ proc_fill_super);
}

static void proc_kill_sb(struct super_block *sb)
@@ -212,7 +215,7 @@ int pid_ns_prepare_proc(struct pid_namespace *ns)
{
struct vfsmount *mnt;

- mnt = kern_mount_data(&proc_fs_type, ns);
+ mnt = kern_mount_data(&proc_fs_type, ns, 0);
if (IS_ERR(mnt))
return PTR_ERR(mnt);

diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c
index 5fcb845b9fec..793258231096 100644
--- a/fs/pstore/inode.c
+++ b/fs/pstore/inode.c
@@ -271,7 +271,8 @@ static int pstore_show_options(struct seq_file *m, struct dentry *root)
return 0;
}

-static int pstore_remount(struct super_block *sb, int *flags, char *data)
+static int pstore_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
sync_filesystem(sb);
parse_options(data);
@@ -432,7 +433,8 @@ void pstore_get_records(int quiet)
inode_unlock(d_inode(root));
}

-static int pstore_fill_super(struct super_block *sb, void *data, int silent)
+static int pstore_fill_super(struct super_block *sb,
+ void *data, size_t data_size, int silent)
{
struct inode *inode;

@@ -464,9 +466,9 @@ static int pstore_fill_super(struct super_block *sb, void *data, int silent)
}

static struct dentry *pstore_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_single(fs_type, flags, data, pstore_fill_super);
+ return mount_single(fs_type, flags, data, data_size, pstore_fill_super);
}

static void pstore_kill_sb(struct super_block *sb)
diff --git a/fs/qnx4/inode.c b/fs/qnx4/inode.c
index 3d46fe302fcb..be35529c8052 100644
--- a/fs/qnx4/inode.c
+++ b/fs/qnx4/inode.c
@@ -29,7 +29,8 @@ static const struct super_operations qnx4_sops;

static struct inode *qnx4_alloc_inode(struct super_block *sb);
static void qnx4_destroy_inode(struct inode *inode);
-static int qnx4_remount(struct super_block *sb, int *flags, char *data);
+static int qnx4_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size);
static int qnx4_statfs(struct dentry *, struct kstatfs *);

static const struct super_operations qnx4_sops =
@@ -40,7 +41,8 @@ static const struct super_operations qnx4_sops =
.remount_fs = qnx4_remount,
};

-static int qnx4_remount(struct super_block *sb, int *flags, char *data)
+static int qnx4_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
struct qnx4_sb_info *qs;

@@ -183,7 +185,8 @@ static const char *qnx4_checkroot(struct super_block *sb,
return "bitmap file not found.";
}

-static int qnx4_fill_super(struct super_block *s, void *data, int silent)
+static int qnx4_fill_super(struct super_block *s, void *data, size_t data_size,
+ int silent)
{
struct buffer_head *bh;
struct inode *root;
@@ -383,9 +386,10 @@ static void destroy_inodecache(void)
}

static struct dentry *qnx4_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, qnx4_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ qnx4_fill_super);
}

static struct file_system_type qnx4_fs_type = {
diff --git a/fs/qnx6/inode.c b/fs/qnx6/inode.c
index 4aeb26bcb4d0..a415c1b5f936 100644
--- a/fs/qnx6/inode.c
+++ b/fs/qnx6/inode.c
@@ -30,7 +30,8 @@ static const struct super_operations qnx6_sops;
static void qnx6_put_super(struct super_block *sb);
static struct inode *qnx6_alloc_inode(struct super_block *sb);
static void qnx6_destroy_inode(struct inode *inode);
-static int qnx6_remount(struct super_block *sb, int *flags, char *data);
+static int qnx6_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size);
static int qnx6_statfs(struct dentry *dentry, struct kstatfs *buf);
static int qnx6_show_options(struct seq_file *seq, struct dentry *root);

@@ -53,7 +54,8 @@ static int qnx6_show_options(struct seq_file *seq, struct dentry *root)
return 0;
}

-static int qnx6_remount(struct super_block *sb, int *flags, char *data)
+static int qnx6_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
sync_filesystem(sb);
*flags |= SB_RDONLY;
@@ -294,7 +296,8 @@ static struct buffer_head *qnx6_check_first_superblock(struct super_block *s,
static struct inode *qnx6_private_inode(struct super_block *s,
struct qnx6_root_node *p);

-static int qnx6_fill_super(struct super_block *s, void *data, int silent)
+static int qnx6_fill_super(struct super_block *s, void *data, size_t data_size,
+ int silent)
{
struct buffer_head *bh1 = NULL, *bh2 = NULL;
struct qnx6_super_block *sb1 = NULL, *sb2 = NULL;
@@ -643,9 +646,10 @@ static void destroy_inodecache(void)
}

static struct dentry *qnx6_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, qnx6_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ qnx6_fill_super);
}

static struct file_system_type qnx6_fs_type = {
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index 11201b2d06b9..2e9b23b4a98b 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -217,7 +217,7 @@ static int ramfs_parse_options(char *data, struct ramfs_mount_opts *opts)
return 0;
}

-int ramfs_fill_super(struct super_block *sb, void *data, int silent)
+int ramfs_fill_super(struct super_block *sb, void *data, size_t data_size, int silent)
{
struct ramfs_fs_info *fsi;
struct inode *inode;
@@ -248,9 +248,9 @@ int ramfs_fill_super(struct super_block *sb, void *data, int silent)
}

struct dentry *ramfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_nodev(fs_type, flags, data, ramfs_fill_super);
+ return mount_nodev(fs_type, flags, data, data_size, ramfs_fill_super);
}

static void ramfs_kill_sb(struct super_block *sb)
diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c
index 1fc934d24459..d8631cb38485 100644
--- a/fs/reiserfs/super.c
+++ b/fs/reiserfs/super.c
@@ -61,7 +61,8 @@ static int is_any_reiserfs_magic_string(struct reiserfs_super_block *rs)
is_reiserfs_jr(rs));
}

-static int reiserfs_remount(struct super_block *s, int *flags, char *data);
+static int reiserfs_remount(struct super_block *s, int *flags,
+ char *data, size_t data_size);
static int reiserfs_statfs(struct dentry *dentry, struct kstatfs *buf);

static int reiserfs_sync_fs(struct super_block *s, int wait)
@@ -1433,7 +1434,8 @@ static void handle_quota_files(struct super_block *s, char **qf_names,
}
#endif

-static int reiserfs_remount(struct super_block *s, int *mount_flags, char *arg)
+static int reiserfs_remount(struct super_block *s, int *mount_flags,
+ char *arg, size_t data_size)
{
struct reiserfs_super_block *rs;
struct reiserfs_transaction_handle th;
@@ -1898,7 +1900,8 @@ static int function2code(hashf_t func)
if (!(silent)) \
reiserfs_warning(s, id, __VA_ARGS__)

-static int reiserfs_fill_super(struct super_block *s, void *data, int silent)
+static int reiserfs_fill_super(struct super_block *s, void *data, size_t data_size,
+ int silent)
{
struct inode *root_inode;
struct reiserfs_transaction_handle th;
@@ -2600,9 +2603,10 @@ static ssize_t reiserfs_quota_write(struct super_block *sb, int type,

static struct dentry *get_super_block(struct file_system_type *fs_type,
int flags, const char *dev_name,
- void *data)
+ void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, reiserfs_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ reiserfs_fill_super);
}

static int __init init_reiserfs_fs(void)
diff --git a/fs/romfs/super.c b/fs/romfs/super.c
index 8f06fd1f3d69..1c5b16ba3da7 100644
--- a/fs/romfs/super.c
+++ b/fs/romfs/super.c
@@ -448,7 +448,8 @@ static int romfs_statfs(struct dentry *dentry, struct kstatfs *buf)
/*
* remounting must involve read-only
*/
-static int romfs_remount(struct super_block *sb, int *flags, char *data)
+static int romfs_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
sync_filesystem(sb);
*flags |= SB_RDONLY;
@@ -482,7 +483,8 @@ static __u32 romfs_checksum(const void *data, int size)
/*
* fill in the superblock
*/
-static int romfs_fill_super(struct super_block *sb, void *data, int silent)
+static int romfs_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
struct romfs_super_block *rsb;
struct inode *root;
@@ -575,16 +577,17 @@ static int romfs_fill_super(struct super_block *sb, void *data, int silent)
*/
static struct dentry *romfs_mount(struct file_system_type *fs_type,
int flags, const char *dev_name,
- void *data)
+ void *data, size_t data_size)
{
struct dentry *ret = ERR_PTR(-EINVAL);

#ifdef CONFIG_ROMFS_ON_MTD
- ret = mount_mtd(fs_type, flags, dev_name, data, romfs_fill_super);
+ ret = mount_mtd(fs_type, flags, dev_name, data, data_size,
+ romfs_fill_super);
#endif
#ifdef CONFIG_ROMFS_ON_BLOCK
if (ret == ERR_PTR(-EINVAL))
- ret = mount_bdev(fs_type, flags, dev_name, data,
+ ret = mount_bdev(fs_type, flags, dev_name, data, data_size,
romfs_fill_super);
#endif
return ret;
diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c
index 8a73b97217c8..ed6881d97b3c 100644
--- a/fs/squashfs/super.c
+++ b/fs/squashfs/super.c
@@ -76,7 +76,8 @@ static const struct squashfs_decompressor *supported_squashfs_filesystem(short
}


-static int squashfs_fill_super(struct super_block *sb, void *data, int silent)
+static int squashfs_fill_super(struct super_block *sb,
+ void *data, size_t data_size, int silent)
{
struct squashfs_sb_info *msblk;
struct squashfs_super_block *sblk = NULL;
@@ -370,7 +371,8 @@ static int squashfs_statfs(struct dentry *dentry, struct kstatfs *buf)
}


-static int squashfs_remount(struct super_block *sb, int *flags, char *data)
+static int squashfs_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
sync_filesystem(sb);
*flags |= SB_RDONLY;
@@ -398,9 +400,11 @@ static void squashfs_put_super(struct super_block *sb)


static struct dentry *squashfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name,
+ void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, squashfs_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ squashfs_fill_super);
}


diff --git a/fs/super.c b/fs/super.c
index 5132a32e5ebc..c9d208b7999e 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -836,11 +836,13 @@ struct super_block *user_get_super(dev_t dev)
* @sb: superblock in question
* @sb_flags: revised superblock flags
* @data: the rest of options
+ * @data_size: The size of the data
* @force: whether or not to force the change
*
* Alters the mount options of a mounted file system.
*/
-int do_remount_sb(struct super_block *sb, int sb_flags, void *data, int force)
+int do_remount_sb(struct super_block *sb, int sb_flags, void *data,
+ size_t data_size, int force)
{
int retval;
int remount_ro;
@@ -883,7 +885,7 @@ int do_remount_sb(struct super_block *sb, int sb_flags, void *data, int force)
}

if (sb->s_op->remount_fs) {
- retval = sb->s_op->remount_fs(sb, &sb_flags, data);
+ retval = sb->s_op->remount_fs(sb, &sb_flags, data, data_size);
if (retval) {
if (!force)
goto cancel_readonly;
@@ -922,7 +924,7 @@ static void do_emergency_remount_callback(struct super_block *sb)
/*
* What lock protects sb->s_flags??
*/
- do_remount_sb(sb, SB_RDONLY, NULL, 1);
+ do_remount_sb(sb, SB_RDONLY, NULL, 0, 1);
}
up_write(&sb->s_umount);
}
@@ -1071,8 +1073,9 @@ static int ns_set_super(struct super_block *sb, void *data)
}

struct dentry *mount_ns(struct file_system_type *fs_type,
- int flags, void *data, void *ns, struct user_namespace *user_ns,
- int (*fill_super)(struct super_block *, void *, int))
+ int flags, void *data, size_t data_size,
+ void *ns, struct user_namespace *user_ns,
+ int (*fill_super)(struct super_block *, void *, size_t, int))
{
struct super_block *sb;

@@ -1089,7 +1092,7 @@ struct dentry *mount_ns(struct file_system_type *fs_type,

if (!sb->s_root) {
int err;
- err = fill_super(sb, data, flags & SB_SILENT ? 1 : 0);
+ err = fill_super(sb, data, data_size, flags & SB_SILENT ? 1 : 0);
if (err) {
deactivate_locked_super(sb);
return ERR_PTR(err);
@@ -1119,8 +1122,8 @@ static int test_bdev_super(struct super_block *s, void *data)
}

struct dentry *mount_bdev(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data,
- int (*fill_super)(struct super_block *, void *, int))
+ int flags, const char *dev_name, void *data, size_t data_size,
+ int (*fill_super)(struct super_block *, void *, size_t, int))
{
struct block_device *bdev;
struct super_block *s;
@@ -1172,7 +1175,7 @@ struct dentry *mount_bdev(struct file_system_type *fs_type,
s->s_mode = mode;
snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev);
sb_set_blocksize(s, block_size(bdev));
- error = fill_super(s, data, flags & SB_SILENT ? 1 : 0);
+ error = fill_super(s, data, data_size, flags & SB_SILENT ? 1 : 0);
if (error) {
deactivate_locked_super(s);
goto error;
@@ -1209,8 +1212,8 @@ EXPORT_SYMBOL(kill_block_super);
#endif

struct dentry *mount_nodev(struct file_system_type *fs_type,
- int flags, void *data,
- int (*fill_super)(struct super_block *, void *, int))
+ int flags, void *data, size_t data_size,
+ int (*fill_super)(struct super_block *, void *, size_t, int))
{
int error;
struct super_block *s = sget(fs_type, NULL, set_anon_super, flags, NULL);
@@ -1218,7 +1221,7 @@ struct dentry *mount_nodev(struct file_system_type *fs_type,
if (IS_ERR(s))
return ERR_CAST(s);

- error = fill_super(s, data, flags & SB_SILENT ? 1 : 0);
+ error = fill_super(s, data, data_size, flags & SB_SILENT ? 1 : 0);
if (error) {
deactivate_locked_super(s);
return ERR_PTR(error);
@@ -1234,8 +1237,8 @@ static int compare_single(struct super_block *s, void *p)
}

struct dentry *mount_single(struct file_system_type *fs_type,
- int flags, void *data,
- int (*fill_super)(struct super_block *, void *, int))
+ int flags, void *data, size_t data_size,
+ int (*fill_super)(struct super_block *, void *, size_t, int))
{
struct super_block *s;
int error;
@@ -1244,21 +1247,22 @@ struct dentry *mount_single(struct file_system_type *fs_type,
if (IS_ERR(s))
return ERR_CAST(s);
if (!s->s_root) {
- error = fill_super(s, data, flags & SB_SILENT ? 1 : 0);
+ error = fill_super(s, data, data_size, flags & SB_SILENT ? 1 : 0);
if (error) {
deactivate_locked_super(s);
return ERR_PTR(error);
}
s->s_flags |= SB_ACTIVE;
} else {
- do_remount_sb(s, flags, data, 0);
+ do_remount_sb(s, flags, data, data_size, 0);
}
return dget(s->s_root);
}
EXPORT_SYMBOL(mount_single);

struct dentry *
-mount_fs(struct file_system_type *type, int flags, const char *name, void *data)
+mount_fs(struct file_system_type *type, int flags, const char *name,
+ void *data, size_t data_size)
{
struct dentry *root;
struct super_block *sb;
@@ -1270,12 +1274,12 @@ mount_fs(struct file_system_type *type, int flags, const char *name, void *data)
if (!secdata)
goto out;

- error = security_sb_copy_data(data, secdata);
+ error = security_sb_copy_data(data, data_size, secdata);
if (error)
goto out_free_secdata;
}

- root = type->mount(type, flags, name, data);
+ root = type->mount(type, flags, name, data, data_size);
if (IS_ERR(root)) {
error = PTR_ERR(root);
goto out_free_secdata;
@@ -1293,7 +1297,7 @@ mount_fs(struct file_system_type *type, int flags, const char *name, void *data)
smp_wmb();
sb->s_flags |= SB_BORN;

- error = security_sb_kern_mount(sb, flags, secdata);
+ error = security_sb_kern_mount(sb, flags, secdata, data_size);
if (error)
goto out_sb;

diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 92682fcc41f6..77302c35b0ff 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -21,7 +21,7 @@ static struct kernfs_root *sysfs_root;
struct kernfs_node *sysfs_root_kn;

static struct dentry *sysfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
struct dentry *root;
void *ns;
diff --git a/fs/sysv/inode.c b/fs/sysv/inode.c
index bec9f79adb25..47f66bbc4578 100644
--- a/fs/sysv/inode.c
+++ b/fs/sysv/inode.c
@@ -57,7 +57,8 @@ static int sysv_sync_fs(struct super_block *sb, int wait)
return 0;
}

-static int sysv_remount(struct super_block *sb, int *flags, char *data)
+static int sysv_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
struct sysv_sb_info *sbi = SYSV_SB(sb);

diff --git a/fs/sysv/super.c b/fs/sysv/super.c
index 89765ddfb738..275c7038eecd 100644
--- a/fs/sysv/super.c
+++ b/fs/sysv/super.c
@@ -349,7 +349,8 @@ static int complete_read_super(struct super_block *sb, int silent, int size)
return 1;
}

-static int sysv_fill_super(struct super_block *sb, void *data, int silent)
+static int sysv_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
struct buffer_head *bh1, *bh = NULL;
struct sysv_sb_info *sbi;
@@ -470,7 +471,8 @@ static int v7_sanity_check(struct super_block *sb, struct buffer_head *bh)
return 1;
}

-static int v7_fill_super(struct super_block *sb, void *data, int silent)
+static int v7_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
struct sysv_sb_info *sbi;
struct buffer_head *bh;
@@ -528,15 +530,17 @@ static int v7_fill_super(struct super_block *sb, void *data, int silent)
/* Every kernel module contains stuff like this. */

static struct dentry *sysv_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, sysv_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ sysv_fill_super);
}

static struct dentry *v7_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, v7_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ v7_fill_super);
}

static struct file_system_type sysv_fs_type = {
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index bea8ad876bf9..85b3f230e202 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -225,7 +225,8 @@ static int tracefs_apply_options(struct super_block *sb)
return 0;
}

-static int tracefs_remount(struct super_block *sb, int *flags, char *data)
+static int tracefs_remount(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
int err;
struct tracefs_fs_info *fsi = sb->s_fs_info;
@@ -264,7 +265,8 @@ static const struct super_operations tracefs_super_operations = {
.show_options = tracefs_show_options,
};

-static int trace_fill_super(struct super_block *sb, void *data, int silent)
+static int trace_fill_super(struct super_block *sb,
+ void *data, size_t data_size, int silent)
{
static const struct tree_descr trace_files[] = {{""}};
struct tracefs_fs_info *fsi;
@@ -299,9 +301,9 @@ static int trace_fill_super(struct super_block *sb, void *data, int silent)

static struct dentry *trace_mount(struct file_system_type *fs_type,
int flags, const char *dev_name,
- void *data)
+ void *data, size_t data_size)
{
- return mount_single(fs_type, flags, data, trace_fill_super);
+ return mount_single(fs_type, flags, data, data_size, trace_fill_super);
}

static struct file_system_type trace_fs_type = {
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 6c397a389105..90144fecbf27 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -1842,7 +1842,8 @@ static void ubifs_put_super(struct super_block *sb)
mutex_unlock(&c->umount_mutex);
}

-static int ubifs_remount_fs(struct super_block *sb, int *flags, char *data)
+static int ubifs_remount_fs(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
int err;
struct ubifs_info *c = sb->s_fs_info;
@@ -2105,7 +2106,7 @@ static int sb_set(struct super_block *sb, void *data)
}

static struct dentry *ubifs_mount(struct file_system_type *fs_type, int flags,
- const char *name, void *data)
+ const char *name, void *data, size_t data_size)
{
struct ubi_volume_desc *ubi;
struct ubifs_info *c;
diff --git a/fs/udf/super.c b/fs/udf/super.c
index 7949c338efa5..91212c33c8d7 100644
--- a/fs/udf/super.c
+++ b/fs/udf/super.c
@@ -87,10 +87,10 @@ enum {
enum { UDF_MAX_LINKS = 0xffff };

/* These are the "meat" - everything else is stuffing */
-static int udf_fill_super(struct super_block *, void *, int);
+static int udf_fill_super(struct super_block *, void *, size_t, int);
static void udf_put_super(struct super_block *);
static int udf_sync_fs(struct super_block *, int);
-static int udf_remount_fs(struct super_block *, int *, char *);
+static int udf_remount_fs(struct super_block *, int *, char *, size_t);
static void udf_load_logicalvolint(struct super_block *, struct kernel_extent_ad);
static int udf_find_fileset(struct super_block *, struct kernel_lb_addr *,
struct kernel_lb_addr *);
@@ -126,9 +126,11 @@ struct logicalVolIntegrityDescImpUse *udf_sb_lvidiu(struct super_block *sb)

/* UDF filesystem type */
static struct dentry *udf_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name,
+ void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, udf_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ udf_fill_super);
}

static struct file_system_type udf_fstype = {
@@ -610,7 +612,8 @@ static int udf_parse_options(char *options, struct udf_options *uopt,
return 1;
}

-static int udf_remount_fs(struct super_block *sb, int *flags, char *options)
+static int udf_remount_fs(struct super_block *sb, int *flags,
+ char *options, size_t data_size)
{
struct udf_options uopt;
struct udf_sb_info *sbi = UDF_SB(sb);
@@ -2083,7 +2086,8 @@ u64 lvid_get_unique_id(struct super_block *sb)
return ret;
}

-static int udf_fill_super(struct super_block *sb, void *options, int silent)
+static int udf_fill_super(struct super_block *sb,
+ void *options, size_t data_size, int silent)
{
int ret = -EINVAL;
struct inode *inode = NULL;
diff --git a/fs/ufs/super.c b/fs/ufs/super.c
index 8254b8b3690f..b52917639e30 100644
--- a/fs/ufs/super.c
+++ b/fs/ufs/super.c
@@ -772,7 +772,8 @@ static u64 ufs_max_bytes(struct super_block *sb)
return res << uspi->s_bshift;
}

-static int ufs_fill_super(struct super_block *sb, void *data, int silent)
+static int ufs_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
struct ufs_sb_info * sbi;
struct ufs_sb_private_info * uspi;
@@ -1295,7 +1296,8 @@ static int ufs_fill_super(struct super_block *sb, void *data, int silent)
return -ENOMEM;
}

-static int ufs_remount (struct super_block *sb, int *mount_flags, char *data)
+static int ufs_remount (struct super_block *sb, int *mount_flags,
+ char *data, size_t data_size)
{
struct ufs_sb_private_info * uspi;
struct ufs_super_block_first * usb1;
@@ -1503,9 +1505,10 @@ static const struct super_operations ufs_super_ops = {
};

static struct dentry *ufs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, ufs_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ ufs_fill_super);
}

static struct file_system_type ufs_fs_type = {
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index d71424052917..0d91c924b8f5 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1268,7 +1268,8 @@ STATIC int
xfs_fs_remount(
struct super_block *sb,
int *flags,
- char *options)
+ char *options,
+ size_t data_size)
{
struct xfs_mount *mp = XFS_M(sb);
xfs_sb_t *sbp = &mp->m_sb;
@@ -1607,6 +1608,7 @@ STATIC int
xfs_fs_fill_super(
struct super_block *sb,
void *data,
+ size_t data_size,
int silent)
{
struct inode *root;
@@ -1796,9 +1798,11 @@ xfs_fs_mount(
struct file_system_type *fs_type,
int flags,
const char *dev_name,
- void *data)
+ void *data,
+ size_t data_size)
{
- return mount_bdev(fs_type, flags, dev_name, data, xfs_fs_fill_super);
+ return mount_bdev(fs_type, flags, dev_name, data, data_size,
+ xfs_fs_fill_super);
}

static long
diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h
index 3b0ba54cc4d5..a02de1b397ca 100644
--- a/include/linux/debugfs.h
+++ b/include/linux/debugfs.h
@@ -75,11 +75,11 @@ struct dentry *debugfs_create_dir(const char *name, struct dentry *parent);
struct dentry *debugfs_create_symlink(const char *name, struct dentry *parent,
const char *dest);

-typedef struct vfsmount *(*debugfs_automount_t)(struct dentry *, void *);
+typedef struct vfsmount *(*debugfs_automount_t)(struct dentry *, void *, size_t);
struct dentry *debugfs_create_automount(const char *name,
struct dentry *parent,
debugfs_automount_t f,
- void *data);
+ void *data, size_t data_size);

void debugfs_remove(struct dentry *dentry);
void debugfs_remove_recursive(struct dentry *dentry);
@@ -204,8 +204,8 @@ static inline struct dentry *debugfs_create_symlink(const char *name,

static inline struct dentry *debugfs_create_automount(const char *name,
struct dentry *parent,
- struct vfsmount *(*f)(void *),
- void *data)
+ struct vfsmount *(*f)(void *, size_t),
+ void *data, size_t data_size)
{
return ERR_PTR(-ENODEV);
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7f07977bdfd7..f7bb71b8e3df 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1827,7 +1827,7 @@ struct super_operations {
int (*thaw_super) (struct super_block *);
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
- int (*remount_fs) (struct super_block *, int *, char *);
+ int (*remount_fs) (struct super_block *, int *, char *, size_t);
void (*umount_begin) (struct super_block *);

int (*show_options)(struct seq_file *, struct dentry *);
@@ -2075,7 +2075,7 @@ struct file_system_type {
#define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
struct dentry *(*mount) (struct file_system_type *, int,
- const char *, void *);
+ const char *, void *, size_t);
void (*kill_sb) (struct super_block *);
struct module *owner;
struct file_system_type * next;
@@ -2094,26 +2094,27 @@ struct file_system_type {
#define MODULE_ALIAS_FS(NAME) MODULE_ALIAS("fs-" NAME)

extern struct dentry *mount_ns(struct file_system_type *fs_type,
- int flags, void *data, void *ns, struct user_namespace *user_ns,
- int (*fill_super)(struct super_block *, void *, int));
+ int flags, void *data, size_t data_size,
+ void *ns, struct user_namespace *user_ns,
+ int (*fill_super)(struct super_block *, void *, size_t, int));
#ifdef CONFIG_BLOCK
extern struct dentry *mount_bdev(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data,
- int (*fill_super)(struct super_block *, void *, int));
+ int flags, const char *dev_name, void *data, size_t data_size,
+ int (*fill_super)(struct super_block *, void *, size_t, int));
#else
static inline struct dentry *mount_bdev(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data,
- int (*fill_super)(struct super_block *, void *, int))
+ int flags, const char *dev_name, void *data, size_t data_size,
+ int (*fill_super)(struct super_block *, void *, size_t, int))
{
return ERR_PTR(-ENODEV);
}
#endif
extern struct dentry *mount_single(struct file_system_type *fs_type,
- int flags, void *data,
- int (*fill_super)(struct super_block *, void *, int));
+ int flags, void *data, size_t data_size,
+ int (*fill_super)(struct super_block *, void *, size_t, int));
extern struct dentry *mount_nodev(struct file_system_type *fs_type,
- int flags, void *data,
- int (*fill_super)(struct super_block *, void *, int));
+ int flags, void *data, size_t data_size,
+ int (*fill_super)(struct super_block *, void *, size_t, int));
extern struct dentry *mount_subtree(struct vfsmount *mnt, const char *path);
void generic_shutdown_super(struct super_block *sb);
#ifdef CONFIG_BLOCK
@@ -2173,8 +2174,8 @@ mount_pseudo(struct file_system_type *fs_type, char *name,

extern int register_filesystem(struct file_system_type *);
extern int unregister_filesystem(struct file_system_type *);
-extern struct vfsmount *kern_mount_data(struct file_system_type *, void *data);
-#define kern_mount(type) kern_mount_data(type, NULL)
+extern struct vfsmount *kern_mount_data(struct file_system_type *, void *, size_t);
+#define kern_mount(type) kern_mount_data(type, NULL, 0)
extern void kern_unmount(struct vfsmount *mnt);
extern int may_umount_tree(struct vfsmount *);
extern int may_umount(struct vfsmount *);
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 25e5f760a590..408357495d1e 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -155,6 +155,7 @@
* @type contains the filesystem type.
* @flags contains the mount flags.
* @data contains the filesystem-specific data.
+ * @data_size contains the size of the data.
* Return 0 if permission is granted.
* @sb_copy_data:
* Allow mount option data to be copied prior to parsing by the filesystem,
@@ -164,6 +165,7 @@
* specific options to avoid having to make filesystems aware of them.
* @type the type of filesystem being mounted.
* @orig the original mount data copied from userspace.
+ * @orig_data is the size of the original data
* @copy copied data which will be passed to the security module.
* Returns 0 if the copy was successful.
* @sb_remount:
@@ -171,6 +173,7 @@
* are being made to those options.
* @sb superblock being remounted
* @data contains the filesystem-specific data.
+ * @data_size contains the size of the data.
* Return 0 if permission is granted.
* @sb_umount:
* Check permission before the @mnt file system is unmounted.
@@ -1514,13 +1517,15 @@ union security_list_options {

int (*sb_alloc_security)(struct super_block *sb);
void (*sb_free_security)(struct super_block *sb);
- int (*sb_copy_data)(char *orig, char *copy);
- int (*sb_remount)(struct super_block *sb, void *data);
- int (*sb_kern_mount)(struct super_block *sb, int flags, void *data);
+ int (*sb_copy_data)(char *orig, size_t orig_size, char *copy);
+ int (*sb_remount)(struct super_block *sb, void *data, size_t data_size);
+ int (*sb_kern_mount)(struct super_block *sb, int flags,
+ void *data, size_t data_size);
int (*sb_show_options)(struct seq_file *m, struct super_block *sb);
int (*sb_statfs)(struct dentry *dentry);
int (*sb_mount)(const char *dev_name, const struct path *path,
- const char *type, unsigned long flags, void *data);
+ const char *type, unsigned long flags,
+ void *data, size_t data_size);
int (*sb_umount)(struct vfsmount *mnt, int flags);
int (*sb_pivotroot)(const struct path *old_path, const struct path *new_path);
int (*sb_set_mnt_opts)(struct super_block *sb,
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 45b1f56c6c2f..8a1031a511c9 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -90,10 +90,11 @@ extern struct vfsmount *clone_private_mount(const struct path *path);
struct file_system_type;
extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
int flags, const char *name,
- void *data);
+ void *data, size_t data_size);
extern struct vfsmount *vfs_submount(const struct dentry *mountpoint,
struct file_system_type *type,
- const char *name, void *data);
+ const char *name,
+ void *data, size_t data_size);

extern void mnt_set_expiry(struct vfsmount *mnt, struct list_head *expiry_list);
extern void mark_mounts_for_expiry(struct list_head *mounts);
diff --git a/include/linux/mtd/super.h b/include/linux/mtd/super.h
index f456230f9330..3f37c7cd711c 100644
--- a/include/linux/mtd/super.h
+++ b/include/linux/mtd/super.h
@@ -19,8 +19,8 @@
#include <linux/mount.h>

extern struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
- const char *dev_name, void *data,
- int (*fill_super)(struct super_block *, void *, int));
+ const char *dev_name, void *data, size_t data_size,
+ int (*fill_super)(struct super_block *, void *, size_t, int));
extern void kill_mtd_super(struct super_block *sb);


diff --git a/include/linux/ramfs.h b/include/linux/ramfs.h
index 5ef7d54caac2..6d64e6be9928 100644
--- a/include/linux/ramfs.h
+++ b/include/linux/ramfs.h
@@ -5,7 +5,7 @@
struct inode *ramfs_get_inode(struct super_block *sb, const struct inode *dir,
umode_t mode, dev_t dev);
extern struct dentry *ramfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data);
+ int flags, const char *dev_name, void *data, size_t data_size);

#ifdef CONFIG_MMU
static inline int
@@ -21,6 +21,6 @@ extern const struct file_operations ramfs_file_operations;
extern const struct vm_operations_struct generic_file_vm_ops;
extern int __init init_ramfs_fs(void);

-int ramfs_fill_super(struct super_block *sb, void *data, int silent);
+int ramfs_fill_super(struct super_block *sb, void *data, size_t data_size, int silent);

#endif
diff --git a/include/linux/security.h b/include/linux/security.h
index 857dc7574b4a..64cc080b9352 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -244,13 +244,13 @@ int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
unsigned int mnt_flags);
int security_sb_alloc(struct super_block *sb);
void security_sb_free(struct super_block *sb);
-int security_sb_copy_data(char *orig, char *copy);
-int security_sb_remount(struct super_block *sb, void *data);
-int security_sb_kern_mount(struct super_block *sb, int flags, void *data);
+int security_sb_copy_data(char *orig, size_t orig_size, char *copy);
+int security_sb_remount(struct super_block *sb, void *data, size_t data_size);
+int security_sb_kern_mount(struct super_block *sb, int flags, void *data, size_t data_size);
int security_sb_show_options(struct seq_file *m, struct super_block *sb);
int security_sb_statfs(struct dentry *dentry);
int security_sb_mount(const char *dev_name, const struct path *path,
- const char *type, unsigned long flags, void *data);
+ const char *type, unsigned long flags, void *data, size_t data_size);
int security_sb_umount(struct vfsmount *mnt, int flags);
int security_sb_pivotroot(const struct path *old_path, const struct path *new_path);
int security_sb_set_mnt_opts(struct super_block *sb,
@@ -596,17 +596,18 @@ static inline int security_sb_alloc(struct super_block *sb)
static inline void security_sb_free(struct super_block *sb)
{ }

-static inline int security_sb_copy_data(char *orig, char *copy)
+static inline int security_sb_copy_data(char *orig, size_t orig_size, char *copy)
{
return 0;
}

-static inline int security_sb_remount(struct super_block *sb, void *data)
+static inline int security_sb_remount(struct super_block *sb, void *data, size_t data_size)
{
return 0;
}

-static inline int security_sb_kern_mount(struct super_block *sb, int flags, void *data)
+static inline int security_sb_kern_mount(struct super_block *sb, int flags,
+ void *data, size_t data_size)
{
return 0;
}
@@ -624,7 +625,7 @@ static inline int security_sb_statfs(struct dentry *dentry)

static inline int security_sb_mount(const char *dev_name, const struct path *path,
const char *type, unsigned long flags,
- void *data)
+ void *data, size_t data_size)
{
return 0;
}
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 73b5e655a76e..f170fc673047 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -49,7 +49,8 @@ static inline struct shmem_inode_info *SHMEM_I(struct inode *inode)
* Functions in mm/shmem.c called directly from elsewhere:
*/
extern int shmem_init(void);
-extern int shmem_fill_super(struct super_block *sb, void *data, int silent);
+extern int shmem_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent);
extern struct file *shmem_file_setup(const char *name,
loff_t size, unsigned long flags);
extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
diff --git a/init/do_mounts.c b/init/do_mounts.c
index ea6f21bb9440..d4fc2a5afdb6 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -606,7 +606,7 @@ void __init prepare_namespace(void)

static bool is_tmpfs;
static struct dentry *rootfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
static unsigned long once;
void *fill = ramfs_fill_super;
@@ -617,7 +617,7 @@ static struct dentry *rootfs_mount(struct file_system_type *fs_type,
if (IS_ENABLED(CONFIG_TMPFS) && is_tmpfs)
fill = shmem_fill_super;

- return mount_nodev(fs_type, flags, data, fill);
+ return mount_nodev(fs_type, flags, data, data_size, fill);
}

static struct file_system_type rootfs_fs_type = {
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index a808f29d4c5a..910c3c7532e6 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -322,7 +322,7 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
return ERR_PTR(ret);
}

-static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
+static int mqueue_fill_super(struct super_block *sb, void *data, size_t data_size, int silent)
{
struct inode *inode;
struct ipc_namespace *ns = sb->s_fs_info;
@@ -345,7 +345,7 @@ static int mqueue_fill_super(struct super_block *sb, void *data, int silent)

static struct dentry *mqueue_mount(struct file_system_type *fs_type,
int flags, const char *dev_name,
- void *data)
+ void *data, size_t data_size)
{
struct ipc_namespace *ns;
if (flags & SB_KERNMOUNT) {
@@ -354,7 +354,8 @@ static struct dentry *mqueue_mount(struct file_system_type *fs_type,
} else {
ns = current->nsproxy->ipc_ns;
}
- return mount_ns(fs_type, flags, data, ns, ns->user_ns, mqueue_fill_super);
+ return mount_ns(fs_type, flags, data, data_size, ns, ns->user_ns,
+ mqueue_fill_super);
}

static void init_once(void *foo)
@@ -1536,7 +1537,7 @@ int mq_init_ns(struct ipc_namespace *ns)
ns->mq_msg_default = DFLT_MSG;
ns->mq_msgsize_default = DFLT_MSGSIZE;

- ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns);
+ ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns, 0);
if (IS_ERR(ns->mq_mnt)) {
int err = PTR_ERR(ns->mq_mnt);
ns->mq_mnt = NULL;
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index bf6da59ae0d0..d663a1efcfcc 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -480,7 +480,7 @@ static int bpf_parse_options(char *data, struct bpf_mount_opts *opts)
return 0;
}

-static int bpf_fill_super(struct super_block *sb, void *data, int silent)
+static int bpf_fill_super(struct super_block *sb, void *data, size_t data_size, int silent)
{
static const struct tree_descr bpf_rfiles[] = { { "" } };
struct bpf_mount_opts opts;
@@ -506,9 +506,10 @@ static int bpf_fill_super(struct super_block *sb, void *data, int silent)
}

static struct dentry *bpf_mount(struct file_system_type *type, int flags,
- const char *dev_name, void *data)
+ const char *dev_name, void *data,
+ size_t data_size)
{
- return mount_nodev(type, flags, data, bpf_fill_super);
+ return mount_nodev(type, flags, data, data_size, bpf_fill_super);
}

static struct file_system_type bpf_fs_type = {
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 12883656e63e..af2baf9985bd 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2010,7 +2010,7 @@ struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,

static struct dentry *cgroup_mount(struct file_system_type *fs_type,
int flags, const char *unused_dev_name,
- void *data)
+ void *data, size_t data_size)
{
struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
struct dentry *dentry;
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b42037e6e81d..3c8ef37879f0 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -316,7 +316,8 @@ static inline bool is_in_v2_mode(void)
* silently switch it to mount "cgroup" instead
*/
static struct dentry *cpuset_mount(struct file_system_type *fs_type,
- int flags, const char *unused_dev_name, void *data)
+ int flags, const char *unused_dev_name,
+ void *data, size_t data_size)
{
struct file_system_type *cgroup_fs = get_fs_type("cgroup");
struct dentry *ret = ERR_PTR(-ENODEV);
@@ -324,8 +325,8 @@ static struct dentry *cpuset_mount(struct file_system_type *fs_type,
char mountopts[] =
"cpuset,noprefix,"
"release_agent=/sbin/cpuset_release_agent";
- ret = cgroup_fs->mount(cgroup_fs, flags,
- unused_dev_name, mountopts);
+ ret = cgroup_fs->mount(cgroup_fs, flags, unused_dev_name,
+ mountopts, data_size);
put_filesystem(cgroup_fs);
}
return ret;
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 414d7210b2ec..0dd9078a0efb 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7961,7 +7961,8 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
ftrace_init_tracefs(tr, d_tracer);
}

-static struct vfsmount *trace_automount(struct dentry *mntpt, void *ingore)
+static struct vfsmount *trace_automount(struct dentry *mntpt,
+ void *data, size_t data_size)
{
struct vfsmount *mnt;
struct file_system_type *type;
@@ -7974,7 +7975,7 @@ static struct vfsmount *trace_automount(struct dentry *mntpt, void *ingore)
type = get_fs_type("tracefs");
if (!type)
return NULL;
- mnt = vfs_submount(mntpt, type, "tracefs", NULL);
+ mnt = vfs_submount(mntpt, type, "tracefs", NULL, 0);
put_filesystem(type);
if (IS_ERR(mnt))
return NULL;
@@ -8010,7 +8011,7 @@ struct dentry *tracing_init_dentry(void)
* work with the newer kerenl.
*/
tr->dir = debugfs_create_automount("tracing", NULL,
- trace_automount, NULL);
+ trace_automount, NULL, 0);
if (!tr->dir) {
pr_warn_once("Could not create debugfs directory 'tracing'\n");
return ERR_PTR(-ENOMEM);
diff --git a/mm/shmem.c b/mm/shmem.c
index 9d6c7e595415..76838f26822f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3602,7 +3602,8 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,

}

-static int shmem_remount_fs(struct super_block *sb, int *flags, char *data)
+static int shmem_remount_fs(struct super_block *sb, int *flags,
+ char *data, size_t data_size)
{
struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
struct shmem_sb_info config = *sbinfo;
@@ -3772,7 +3773,8 @@ static void shmem_put_super(struct super_block *sb)
sb->s_fs_info = NULL;
}

-int shmem_fill_super(struct super_block *sb, void *data, int silent)
+int shmem_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
struct inode *inode;
struct shmem_sb_info *sbinfo;
@@ -3986,9 +3988,9 @@ static const struct vm_operations_struct shmem_vm_ops = {
};

static struct dentry *shmem_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
- return mount_nodev(fs_type, flags, data, shmem_fill_super);
+ return mount_nodev(fs_type, flags, data, data_size, shmem_fill_super);
}

static struct file_system_type shmem_fs_type = {
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 61cb05dc950c..dc60f6d89f31 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1814,7 +1814,8 @@ static enum fullness_group putback_zspage(struct size_class *class,

#ifdef CONFIG_COMPACTION
static struct dentry *zs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name,
+ void *data, size_t data_size)
{
static const struct dentry_operations ops = {
.d_dname = simple_dname,
diff --git a/net/socket.c b/net/socket.c
index f10f1d947c78..34d3dd0f8ba3 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -353,7 +353,8 @@ static const struct xattr_handler *sockfs_xattr_handlers[] = {
};

static struct dentry *sockfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name,
+ void *data, size_t data_size)
{
return mount_pseudo_xattr(fs_type, "socket:", &sockfs_ops,
sockfs_xattr_handlers,
diff --git a/net/sunrpc/rpc_pipe.c b/net/sunrpc/rpc_pipe.c
index 4fda18d47e2c..023c2a6389e7 100644
--- a/net/sunrpc/rpc_pipe.c
+++ b/net/sunrpc/rpc_pipe.c
@@ -1367,7 +1367,7 @@ rpc_gssd_dummy_depopulate(struct dentry *pipe_dentry)
}

static int
-rpc_fill_super(struct super_block *sb, void *data, int silent)
+rpc_fill_super(struct super_block *sb, void *data, size_t data_size, int silent)
{
struct inode *inode;
struct dentry *root, *gssd_dentry;
@@ -1430,10 +1430,11 @@ EXPORT_SYMBOL_GPL(gssd_running);

static struct dentry *
rpc_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data, size_t data_size)
{
struct net *net = current->nsproxy->net_ns;
- return mount_ns(fs_type, flags, data, net, net->user_ns, rpc_fill_super);
+ return mount_ns(fs_type, flags, data, data_size,
+ net, net->user_ns, rpc_fill_super);
}

static void rpc_kill_sb(struct super_block *sb)
diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
index 949dd8a48164..04548c8102f3 100644
--- a/security/apparmor/apparmorfs.c
+++ b/security/apparmor/apparmorfs.c
@@ -137,7 +137,8 @@ static const struct super_operations aafs_super_ops = {
.show_path = aafs_show_path,
};

-static int fill_super(struct super_block *sb, void *data, int silent)
+static int fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
static struct tree_descr files[] = { {""} };
int error;
@@ -151,9 +152,10 @@ static int fill_super(struct super_block *sb, void *data, int silent)
}

static struct dentry *aafs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name, void *data,
+ size_t data_size)
{
- return mount_single(fs_type, flags, data, fill_super);
+ return mount_single(fs_type, flags, data, data_size, fill_super);
}

static struct file_system_type aafs_ops = {
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index bf2401ade80e..f3a3f1906f49 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -591,7 +591,8 @@ static int apparmor_sb_mountpoint(struct fs_context *fc, struct path *mountpoint
}

static int apparmor_sb_mount(const char *dev_name, const struct path *path,
- const char *type, unsigned long flags, void *data)
+ const char *type, unsigned long flags,
+ void *data, size_t data_size)
{
struct aa_label *label;
int error = 0;
diff --git a/security/inode.c b/security/inode.c
index 8dd9ca8848e4..a89a00714f33 100644
--- a/security/inode.c
+++ b/security/inode.c
@@ -39,7 +39,8 @@ static const struct super_operations securityfs_super_operations = {
.evict_inode = securityfs_evict_inode,
};

-static int fill_super(struct super_block *sb, void *data, int silent)
+static int fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
static const struct tree_descr files[] = {{""}};
int error;
@@ -55,9 +56,9 @@ static int fill_super(struct super_block *sb, void *data, int silent)

static struct dentry *get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name,
- void *data)
+ void *data, size_t data_size)
{
- return mount_single(fs_type, flags, data, fill_super);
+ return mount_single(fs_type, flags, data, data_size, fill_super);
}

static struct file_system_type fs_type = {
diff --git a/security/security.c b/security/security.c
index 0aca5a03c070..294c2fce1770 100644
--- a/security/security.c
+++ b/security/security.c
@@ -414,20 +414,20 @@ void security_sb_free(struct super_block *sb)
call_void_hook(sb_free_security, sb);
}

-int security_sb_copy_data(char *orig, char *copy)
+int security_sb_copy_data(char *orig, size_t data_size, char *copy)
{
- return call_int_hook(sb_copy_data, 0, orig, copy);
+ return call_int_hook(sb_copy_data, 0, orig, data_size, copy);
}
EXPORT_SYMBOL(security_sb_copy_data);

-int security_sb_remount(struct super_block *sb, void *data)
+int security_sb_remount(struct super_block *sb, void *data, size_t data_size)
{
- return call_int_hook(sb_remount, 0, sb, data);
+ return call_int_hook(sb_remount, 0, sb, data, data_size);
}

-int security_sb_kern_mount(struct super_block *sb, int flags, void *data)
+int security_sb_kern_mount(struct super_block *sb, int flags, void *data, size_t data_size)
{
- return call_int_hook(sb_kern_mount, 0, sb, flags, data);
+ return call_int_hook(sb_kern_mount, 0, sb, flags, data, data_size);
}

int security_sb_show_options(struct seq_file *m, struct super_block *sb)
@@ -441,9 +441,11 @@ int security_sb_statfs(struct dentry *dentry)
}

int security_sb_mount(const char *dev_name, const struct path *path,
- const char *type, unsigned long flags, void *data)
+ const char *type, unsigned long flags,
+ void *data, size_t data_size)
{
- return call_int_hook(sb_mount, 0, dev_name, path, type, flags, data);
+ return call_int_hook(sb_mount, 0, dev_name, path, type, flags,
+ data, data_size);
}

int security_sb_umount(struct vfsmount *mnt, int flags)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 1ab74c5ae789..3952aab4ff99 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2793,7 +2793,7 @@ static inline void take_selinux_option(char **to, char *from, int *first,
}
}

-static int selinux_sb_copy_data(char *orig, char *copy)
+static int selinux_sb_copy_data(char *orig, size_t data_size, char *copy)
{
int fnosec, fsec, rc = 0;
char *in_save, *in_curr, *in_end;
@@ -2835,7 +2835,7 @@ static int selinux_sb_copy_data(char *orig, char *copy)
return rc;
}

-static int selinux_sb_remount(struct super_block *sb, void *data)
+static int selinux_sb_remount(struct super_block *sb, void *data, size_t data_size)
{
int rc, i, *flags;
struct security_mnt_opts opts;
@@ -2855,7 +2855,7 @@ static int selinux_sb_remount(struct super_block *sb, void *data)
secdata = alloc_secdata();
if (!secdata)
return -ENOMEM;
- rc = selinux_sb_copy_data(data, secdata);
+ rc = selinux_sb_copy_data(data, data_size, secdata);
if (rc)
goto out_free_secdata;

@@ -2920,7 +2920,7 @@ static int selinux_sb_remount(struct super_block *sb, void *data)
goto out_free_opts;
}

-static int selinux_sb_kern_mount(struct super_block *sb, int flags, void *data)
+static int selinux_sb_kern_mount(struct super_block *sb, int flags, void *data, size_t data_size)
{
const struct cred *cred = current_cred();
struct common_audit_data ad;
@@ -2953,7 +2953,8 @@ static int selinux_mount(const char *dev_name,
const struct path *path,
const char *type,
unsigned long flags,
- void *data)
+ void *data,
+ size_t data_size)
{
const struct cred *cred = current_cred();

diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
index 245160373dab..87c07ff2ae7e 100644
--- a/security/selinux/selinuxfs.c
+++ b/security/selinux/selinuxfs.c
@@ -1884,7 +1884,8 @@ static struct dentry *sel_make_dir(struct dentry *dir, const char *name,

#define NULL_FILE_NAME "null"

-static int sel_fill_super(struct super_block *sb, void *data, int silent)
+static int sel_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
struct selinux_fs_info *fsi;
int ret;
@@ -1999,9 +2000,10 @@ static int sel_fill_super(struct super_block *sb, void *data, int silent)
}

static struct dentry *sel_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name,
+ void *data, size_t data_size)
{
- return mount_single(fs_type, flags, data, sel_fill_super);
+ return mount_single(fs_type, flags, data, data_size, sel_fill_super);
}

static void sel_kill_sb(struct super_block *sb)
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 3c4dd21d511d..d3c4a72d1640 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -869,6 +869,7 @@ static void smack_sb_free_security(struct super_block *sb)
/**
* smack_sb_copy_data - copy mount options data for processing
* @orig: where to start
+ * @orig_size: Size of orig buffer
* @smackopts: mount options string
*
* Returns 0 on success or -ENOMEM on error.
@@ -876,7 +877,7 @@ static void smack_sb_free_security(struct super_block *sb)
* Copy the Smack specific mount options out of the mount
* options list.
*/
-static int smack_sb_copy_data(char *orig, char *smackopts)
+static int smack_sb_copy_data(char *orig, size_t orig_size, char *smackopts)
{
char *cp, *commap, *otheropts, *dp;

@@ -1157,7 +1158,8 @@ static int smack_set_mnt_opts(struct super_block *sb,
*
* Returns 0 on success, an error code on failure
*/
-static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
+static int smack_sb_kern_mount(struct super_block *sb, int flags,
+ void *data, size_t data_size)
{
int rc = 0;
char *options = data;
diff --git a/security/smack/smackfs.c b/security/smack/smackfs.c
index f6482e53d55a..f4e91c5d6c2c 100644
--- a/security/smack/smackfs.c
+++ b/security/smack/smackfs.c
@@ -2844,13 +2844,15 @@ static const struct file_operations smk_ptrace_ops = {
* smk_fill_super - fill the smackfs superblock
* @sb: the empty superblock
* @data: unused
+ * @data_size: size of data buffer
* @silent: unused
*
* Fill in the well known entries for the smack filesystem
*
* Returns 0 on success, an error code on failure
*/
-static int smk_fill_super(struct super_block *sb, void *data, int silent)
+static int smk_fill_super(struct super_block *sb, void *data, size_t data_size,
+ int silent)
{
int rc;
struct inode *root_inode;
@@ -2934,9 +2936,10 @@ static int smk_fill_super(struct super_block *sb, void *data, int silent)
* Returns what the lower level code does.
*/
static struct dentry *smk_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+ int flags, const char *dev_name,
+ void *data, size_t data_size)
{
- return mount_single(fs_type, flags, data, smk_fill_super);
+ return mount_single(fs_type, flags, data, data_size, smk_fill_super);
}

static struct file_system_type smk_fs_type = {
diff --git a/security/tomoyo/tomoyo.c b/security/tomoyo/tomoyo.c
index 31fd6bd4f657..c3a0ae4fa7ce 100644
--- a/security/tomoyo/tomoyo.c
+++ b/security/tomoyo/tomoyo.c
@@ -413,11 +413,13 @@ static int tomoyo_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
* @type: Name of filesystem type. Maybe NULL.
* @flags: Mount options.
* @data: Optional data. Maybe NULL.
+ * @data_size: Size of data.
*
* Returns 0 on success, negative value otherwise.
*/
static int tomoyo_sb_mount(const char *dev_name, const struct path *path,
- const char *type, unsigned long flags, void *data)
+ const char *type, unsigned long flags,
+ void *data, size_t data_size)
{
return tomoyo_mount_permission(dev_name, path, type, flags, data);
}


2018-05-25 02:49:12

by David Howells

[permalink] [raw]
Subject: [PATCH 02/32] vfs: Provide documentation for new mount API [ver #8]

Provide documentation for the new mount API.

Signed-off-by: David Howells <[email protected]>
---

Documentation/filesystems/mounting.txt | 458 ++++++++++++++++++++++++++++++++
1 file changed, 458 insertions(+)
create mode 100644 Documentation/filesystems/mounting.txt

diff --git a/Documentation/filesystems/mounting.txt b/Documentation/filesystems/mounting.txt
new file mode 100644
index 000000000000..5230a9711b97
--- /dev/null
+++ b/Documentation/filesystems/mounting.txt
@@ -0,0 +1,458 @@
+ ===================
+ FILESYSTEM MOUNTING
+ ===================
+
+CONTENTS
+
+ (1) Overview.
+
+ (2) The filesystem context.
+
+ (3) The filesystem context operations.
+
+ (4) Filesystem context security.
+
+ (5) VFS filesystem context operations.
+
+
+========
+OVERVIEW
+========
+
+The creation of new mounts is now to be done in a multistep process:
+
+ (1) Create a filesystem context.
+
+ (2) Parse the options and attach them to the context. Options may be passed
+ individually from userspace.
+
+ (3) Validate and pre-process the context.
+
+ (4) Get or create a superblock and mountable root.
+
+ (5) Perform the mount.
+
+ (6) Return an error message attached to the context.
+
+ (7) Destroy the context.
+
+To support this, the file_system_type struct gains two new fields:
+
+ unsigned short fs_context_size;
+
+which indicates the total amount of space that should be allocated for context
+data (see the Filesystem Context section), and:
+
+ int (*init_fs_context)(struct fs_context *fc, struct super_block *src_sb);
+
+which is invoked to set up the filesystem-specific parts of a filesystem
+context, including the additional space. The src_sb parameter is used to
+convey the superblock from which the filesystem may draw extra information
+(such as namespaces) for submount (FS_CONTEXT_FOR_SUBMOUNT) or reconfiguration
+(FS_CONTEXT_FOR_RECONFIGURE) purposes - otherwise it will be NULL.
+
+Note that security initialisation is done *after* the filesystem is called so
+that the namespaces may be adjusted first.
+
+And the super_operations struct gains one field:
+
+ int (*reconfigure) (struct super_block *, struct fs_context *);
+
+This shadows the ->reconfigure() operation and takes a prepared filesystem
+context instead of the mount flags and data page. It may modify the sb_flags
+in the context for the caller to pick up.
+
+[NOTE] reconfigure is intended as a replacement for remount_fs.
+
+
+======================
+THE FILESYSTEM CONTEXT
+======================
+
+The creation and reconfiguration of a superblock is governed by a filesystem
+context. This is represented by the fs_context structure:
+
+ struct fs_context {
+ const struct fs_context_operations *ops;
+ struct file_system_type *fs;
+ struct dentry *root;
+ struct user_namespace *user_ns;
+ struct net *net_ns;
+ const struct cred *cred;
+ char *source;
+ char *subtype;
+ void *security;
+ void *s_fs_info;
+ unsigned int sb_flags;
+ bool sloppy;
+ bool silent;
+ bool degraded;
+ bool drop_sb;
+ enum fs_context_purpose purpose : 8;
+ };
+
+When the VFS creates this, it allocates ->fs_context_size bytes (as specified
+by the file_system_type object) to hold both the fs_context struct and any
+extra data required by the filesystem. The fs_context struct is placed at the
+beginning of this space. Any extra space beyond that is for use by the
+filesystem. The filesystem should wrap the struct in one of its own, e.g.:
+
+ struct nfs_fs_context {
+ struct fs_context fc;
+ ...
+ };
+
+placing the fs_context struct first. container_of() can then be used. The
+file_system_type would be initialised thus:
+
+ struct file_system_type nfs = {
+ ...
+ .fs_context_size = sizeof(struct nfs_fs_context),
+ .init_fs_context = nfs_init_fs_context,
+ ...
+ };
+
+The fs_context fields are as follows:
+
+ (*) const struct fs_context_operations *ops
+
+ These are operations that can be done on a filesystem context (see
+ below). This must be set by the ->init_fs_context() file_system_type
+ operation.
+
+ (*) struct file_system_type *fs
+
+ A pointer to the file_system_type of the filesystem that is being
+ constructed or reconfigured. This retains a reference on the type owner.
+
+ (*) struct dentry *root
+
+ A pointer to the root of the mountable tree (and indirectly, the
+ superblock thereof). This is filled in by the ->get_tree() op.
+
+ (*) struct user_namespace *user_ns
+ (*) struct net *net_ns
+
+ There are a subset of the namespaces in use by the invoking process. They
+ retain references on each namespace. The subscribed namespaces may be
+ replaced by the filesystem to reflect other sources, such as the parent
+ mount superblock on an automount.
+
+ (*) struct cred *cred
+
+ The mounter's credentials. This retains a reference on the credentials.
+
+ (*) char *source
+
+ This specifies the source. It may be a block device (e.g. /dev/sda1) or
+ something more exotic, such as the "host:/path" that NFS desires.
+
+ (*) char *subtype
+
+ This is a string to be added to the type displayed in /proc/mounts to
+ qualify it (used by FUSE). This is available for the filesystem to set if
+ desired.
+
+ (*) void *security
+
+ A place for the LSMs to hang their security data for the superblock. The
+ relevant security operations are described below.
+
+ (*) void *s_fs_info
+
+ The proposed s_fs_info for a new superblock, set in the superblock by
+ sget_fc(). This can be used to distinguish superblocks.
+
+ (*) unsigned int sb_flags
+
+ This holds the SB_* flags to be set in super_block::s_flags.
+
+ (*) bool sloppy
+ (*) bool silent
+
+ These are set if the sloppy or silent mount options are given.
+
+ [NOTE] sloppy is probably unnecessary when userspace passes over one
+ option at a time since the error can just be ignored if userspace deems it
+ to be unimportant.
+
+ [NOTE] silent is probably redundant with sb_flags & SB_SILENT.
+
+ (*) bool degraded
+
+ This is set if any preallocated resources in the context have been used
+ up, thereby rendering it unreusable for the ->get_tree() op.
+
+ (*) bool drop_sb
+
+ This is set if a superblock reference needs to be deactivated when the
+ context is put.
+
+ (*) enum fs_context_purpose
+
+ This indicates the purpose for which the context is intended. The
+ available values are:
+
+ FS_CONTEXT_FOR_USER_MOUNT, -- New superblock for user-specified mount
+ FS_CONTEXT_FOR_KERNEL_MOUNT, -- New superblock for kernel-internal mount
+ FS_CONTEXT_FOR_SUBMOUNT -- New automatic submount of extant mount
+ FS_CONTEXT_FOR_RECONFIGURE -- Change an existing mount
+
+The mount context is created by calling vfs_new_fs_context(), vfs_sb_reconfig()
+or vfs_dup_fs_context() and is destroyed with put_fs_context(). Note that the
+structure is not refcounted.
+
+VFS, security and filesystem mount options are set individually with
+vfs_parse_mount_option(). Options provided by the old mount(2) system call as
+a page of data can be parsed with generic_parse_monolithic().
+
+When mounting, the filesystem is allowed to take data from any of the pointers
+and attach it to the superblock (or whatever), provided it clears the pointer
+in the mount context.
+
+The filesystem is also allowed to allocate resources and pin them with the
+mount context. For instance, NFS might pin the appropriate protocol version
+module.
+
+
+=================================
+THE FILESYSTEM CONTEXT OPERATIONS
+=================================
+
+The filesystem context points to a table of operations:
+
+ struct fs_context_operations {
+ void (*free)(struct fs_context *fc);
+ int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
+ int (*parse_source)(struct fs_context *fc, char *source);
+ int (*parse_option)(struct fs_context *fc, char *opt, size_t len);
+ int (*parse_monolithic)(struct fs_context *fc, void *data);
+ int (*validate)(struct fs_context *fc);
+ int (*get_tree)(struct fs_context *fc);
+ };
+
+These operations are invoked by the various stages of the mount procedure to
+manage the filesystem context. They are as follows:
+
+ (*) void (*free)(struct fs_context *fc);
+
+ Called to clean up the filesystem-specific part of the filesystem context
+ when the context is destroyed. It should be aware that parts of the
+ context may have been removed and NULL'd out by ->get_tree().
+
+ (*) int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
+
+ Called when a filesystem context has been duplicated to get any refs or
+ copy any non-referenced resources held in the filesystem-specific part of
+ the filesystem context. An error may be returned to indicate failure to
+ do this.
+
+ [!] Note that even if this fails, put_fs_context() will be called
+ immediately thereafter, so ->dup() *must* make the
+ filesystem-specific part safe for ->free().
+
+ (*) int (*parse_source)(struct fs_context *fc, char *p);
+
+ Called when a source or device is specified for a filesystem context.
+ This may be called multiple times if the filesystem supports it. If
+ successful, 0 should be returned or a negative error code otherwise.
+
+ (*) int (*parse_option)(struct fs_context *fc, char *p);
+
+ Called when an option is to be added to the filesystem context. p points
+ to the option string, likely in "key[=val]" format. VFS-specific options
+ will have been weeded out and fc->sb_flags updated in the context.
+ Security options will also have been weeded out and fc->security updated.
+
+ If successful, 0 should be returned or a negative error code otherwise.
+
+ (*) int (*parse_monolithic)(struct fs_context *fc, void *data);
+
+ Called when the mount(2) system call is invoked to pass the entire data
+ page in one go. If this is expected to be just a list of "key[=val]"
+ items separated by commas, then this may be set to NULL.
+
+ The return value is as for ->parse_option().
+
+ If the filesystem (e.g. NFS) needs to examine the data first and then
+ finds it's the standard key-val list then it may pass it off to
+ generic_parse_monolithic().
+
+ (*) int (*validate)(struct fs_context *fc);
+
+ Called when all the options have been applied and the mount is about to
+ take place. It is should check for inconsistencies from mount options and
+ it is also allowed to do preliminary resource acquisition. For instance,
+ the core NFS module could load the NFS protocol module here.
+
+ Note that if fc->purpose == FS_CONTEXT_FOR_RECONFIGURE, some of the
+ options necessary for a new mount may not be set.
+
+ The return value is as for ->parse_option().
+
+ (*) int (*get_tree)(struct fs_context *fc);
+
+ Called to get or create the mountable root and superblock, using the
+ information stored in the filesystem context (reconfiguration goes via a
+ different vector). It may detach any resources it desires from the
+ filesystem context and transfer them to the superblock it creates.
+
+ On success it should set fc->root to the mountable root and return 0. In
+ the case of an error, it should return a negative error code.
+
+
+===========================
+FILESYSTEM CONTEXT SECURITY
+===========================
+
+The filesystem context contains a security pointer that the LSMs can use for
+building up a security context for the superblock to be mounted. There are a
+number of operations used by the new mount code for this purpose:
+
+ (*) int security_fs_context_alloc(struct fs_context *fc,
+ struct super_block *src_sb);
+
+ Called to initialise fc->security (which is preset to NULL) and allocate
+ any resources needed. It should return 0 on success or a negative error
+ code on failure.
+
+ src_sb will be non-NULL if the context is being created for superblock
+ reconfiguration (FS_CONTEXT_FOR_RECONFIGURE) in which case it indicates
+ the superblock to be reconfigured. It will also be non-NULL in the case
+ of a submount (FS_CONTEXT_FOR_SUBMOUNT) in which case it indicates the
+ parent superblock.
+
+ (*) int security_fs_context_dup(struct fs_context *fc,
+ struct fs_context *src_fc);
+
+ Called to initialise fc->security (which is preset to NULL) and allocate
+ any resources needed. The original filesystem context is pointed to by
+ src_fc and may be used for reference. It should return 0 on success or a
+ negative error code on failure.
+
+ (*) void security_fs_context_free(struct fs_context *fc);
+
+ Called to clean up anything attached to fc->security. Note that the
+ contents may have been transferred to a superblock and the pointer cleared
+ during get_tree.
+
+ (*) int security_fs_context_parse_source(struct fs_context *fc, char *src);
+
+ Called for each source (there may be more than one if the filesystem
+ supports it). The arguments are as for the ->parse_source() method. It
+ should return 0 on success or a negative error code on failure.
+
+ (*) int security_fs_context_parse_option(struct fs_context *fc, char *opt);
+
+ Called for each mount option. The arguments are as for the
+ ->parse_option() method. It should return 0 to indicate that the option
+ should be passed on to the filesystem, 1 to indicate that the option
+ should be discarded or an error to indicate that the option should be
+ rejected.
+
+ The buffer pointed to by opt may be modified.
+
+ (*) int security_fs_context_validate(struct fs_context *fc);
+
+ Called after all the options have been parsed to validate the collection
+ as a whole and to do any necessary allocation so that
+ security_sb_get_tree() is less likely to fail. It should return 0 or a
+ negative error code.
+
+ (*) int security_sb_get_tree(struct fs_context *fc);
+
+ Called during the mount procedure to verify that the specified superblock
+ is allowed to be mounted and to transfer the security data there. It
+ should return 0 or a negative error code.
+
+ (*) int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint);
+
+ Called during the mount procedure to verify that the root dentry attached
+ to the context is permitted to be attached to the specified mountpoint.
+ It should return 0 on success or a negative error code on failure.
+
+
+=================================
+VFS FILESYSTEM CONTEXT OPERATIONS
+=================================
+
+There are four operations for creating a filesystem context and
+one for destroying a context:
+
+ (*) struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
+ struct super_block *src_sb,
+ unsigned int sb_flags);
+
+ Create a filesystem context for a given filesystem type. This allocates
+ the filesystem context, sets the flags, initialises the security and calls
+ fs_type->init_fs_context() to initialise the filesystem context.
+
+ src_sb can be NULL or it may indicate a superblock that is going to be
+ reconfigured (FS_CONTEXT_FOR_RECONFIGURE) or a superblock that is the
+ parent of a submount (FS_CONTEXT_FOR_SUBMOUNT). This superblock is
+ provided as a source of namespace information.
+
+ (*) struct fs_context *vfs_sb_reconfigure(struct vfsmount *mnt,
+ unsigned int sb_flags);
+
+ Create a filesystem context from the same filesystem as an extant mount
+ and initialise the mount parameters from the superblock underlying that
+ mount. This is for use by superblock parameter reconfiguration.
+
+ (*) struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc);
+
+ Duplicate a filesystem context, copying any options noted and duplicating
+ or additionally referencing any resources held therein. This is available
+ for use where a filesystem has to get a mount within a mount, such as NFS4
+ does by internally mounting the root of the target server and then doing a
+ private pathwalk to the target directory.
+
+ (*) void put_fs_context(struct fs_context *fc);
+
+ Destroy a filesystem context, releasing any resources it holds. This
+ calls the ->free() operation. This is intended to be called by anyone who
+ created a filesystem context.
+
+ [!] filesystem contexts are not refcounted, so this causes unconditional
+ destruction.
+
+In all the above operations, apart from the put op, the return is a mount
+context pointer or a negative error code.
+
+For the remaining operations, if an error occurs, a negative error code will be
+returned.
+
+ (*) int vfs_get_tree(struct fs_context *fc);
+
+ Get or create the mountable root and superblock, using the parameters in
+ the filesystem context to select/configure the superblock. This invokes
+ the ->validate() op and then the ->get_tree() op.
+
+ [NOTE] ->validate() could perhaps be rolled into ->get_tree() and
+ ->reconfigure().
+
+ (*) struct vfsmount *vfs_create_mount(struct fs_context *fc);
+
+ Create a mount given the parameters in the specified filesystem context.
+ Note that this does not attach the mount to anything.
+
+ (*) int vfs_set_fs_source(struct fs_context *fc, char *source);
+
+ Supply one or more source names or device names for the mount. This may
+ cause the filesystem to access the source. Multiple sources may be
+ specified if the filesystem supports it.
+
+ (*) int vfs_parse_fs_option(struct fs_context *fc, char *data);
+
+ Supply a single mount option to the filesystem context. The mount option
+ should likely be in a "key[=val]" string form. The option is first
+ checked to see if it corresponds to a standard mount flag (in which case
+ it is used to set an SB_xxx flag and consumed) or a security option (in
+ which case the LSM consumes it) before it is passed on to the filesystem.
+
+ (*) int generic_parse_monolithic(struct fs_context *fc, void *data);
+
+ Parse a sys_mount() data page, assuming the form to be a text list
+ consisting of key[=val] options separated by commas. Each item in the
+ list is passed to vfs_mount_option(). This is the default when the
+ ->parse_monolithic() operation is NULL.


2018-05-25 02:49:15

by David Howells

[permalink] [raw]
Subject: [PATCH 17/32] hugetlbfs: Convert to fs_context [ver #8]

Convert the hugetlbfs to use the fs_context during mount.

Signed-off-by: David Howells <[email protected]>
---

fs/hugetlbfs/inode.c | 340 +++++++++++++++++++++++++++++---------------------
1 file changed, 194 insertions(+), 146 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 76fb8eb2bea8..1d0825ed0fb6 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -45,11 +45,17 @@ const struct file_operations hugetlbfs_file_operations;
static const struct inode_operations hugetlbfs_dir_inode_operations;
static const struct inode_operations hugetlbfs_inode_operations;

-struct hugetlbfs_config {
+enum hugetlbfs_size_type { NO_SIZE, SIZE_STD, SIZE_PERCENT };
+
+struct hugetlbfs_fs_context {
struct hstate *hstate;
+ unsigned long long max_size_opt;
+ unsigned long long min_size_opt;
long max_hpages;
long nr_inodes;
long min_hpages;
+ enum hugetlbfs_size_type max_val_type;
+ enum hugetlbfs_size_type min_val_type;
kuid_t uid;
kgid_t gid;
umode_t mode;
@@ -708,16 +714,16 @@ static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
}

static struct inode *hugetlbfs_get_root(struct super_block *sb,
- struct hugetlbfs_config *config)
+ struct hugetlbfs_fs_context *ctx)
{
struct inode *inode;

inode = new_inode(sb);
if (inode) {
inode->i_ino = get_next_ino();
- inode->i_mode = S_IFDIR | config->mode;
- inode->i_uid = config->uid;
- inode->i_gid = config->gid;
+ inode->i_mode = S_IFDIR | ctx->mode;
+ inode->i_uid = ctx->uid;
+ inode->i_gid = ctx->gid;
inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
inode->i_op = &hugetlbfs_dir_inode_operations;
inode->i_fop = &simple_dir_operations;
@@ -1081,8 +1087,6 @@ static const struct super_operations hugetlbfs_ops = {
.show_options = hugetlbfs_show_options,
};

-enum hugetlbfs_size_type { NO_SIZE, SIZE_STD, SIZE_PERCENT };
-
/*
* Convert size option passed from command line to number of huge pages
* in the pool specified by hstate. Size option could be in bytes
@@ -1105,171 +1109,156 @@ hugetlbfs_size_to_hpages(struct hstate *h, unsigned long long size_opt,
return size_opt;
}

-static int
-hugetlbfs_parse_options(char *options, struct hugetlbfs_config *pconfig)
+/*
+ * Parse one mount option.
+ */
+static int hugetlbfs_parse_option(struct fs_context *fc, char *opt, size_t len)
{
- char *p, *rest;
+ struct hugetlbfs_fs_context *ctx = fc->fs_private;
+ char *rest;
+ unsigned long ps;
substring_t args[MAX_OPT_ARGS];
- int option;
- unsigned long long max_size_opt = 0, min_size_opt = 0;
- enum hugetlbfs_size_type max_val_type = NO_SIZE, min_val_type = NO_SIZE;
-
- if (!options)
+ int token, option;
+
+ token = match_token(opt, tokens, args);
+ switch (token) {
+ case Opt_uid:
+ if (match_int(&args[0], &option))
+ goto bad_val;
+ ctx->uid = make_kuid(current_user_ns(), option);
+ if (!uid_valid(ctx->uid))
+ goto bad_val;
return 0;

- while ((p = strsep(&options, ",")) != NULL) {
- int token;
- if (!*p)
- continue;
+ case Opt_gid:
+ if (match_int(&args[0], &option))
+ goto bad_val;
+ ctx->gid = make_kgid(current_user_ns(), option);
+ if (!gid_valid(ctx->gid))
+ goto bad_val;
+ return 0;

- token = match_token(p, tokens, args);
- switch (token) {
- case Opt_uid:
- if (match_int(&args[0], &option))
- goto bad_val;
- pconfig->uid = make_kuid(current_user_ns(), option);
- if (!uid_valid(pconfig->uid))
- goto bad_val;
- break;
+ case Opt_mode:
+ if (match_octal(&args[0], &option))
+ goto bad_val;
+ ctx->mode = option & 01777U;
+ return 0;

- case Opt_gid:
- if (match_int(&args[0], &option))
- goto bad_val;
- pconfig->gid = make_kgid(current_user_ns(), option);
- if (!gid_valid(pconfig->gid))
- goto bad_val;
- break;
+ case Opt_size:
+ /* memparse() will accept a K/M/G without a digit */
+ if (!isdigit(*args[0].from))
+ goto bad_val;
+ ctx->max_size_opt = memparse(args[0].from, &rest);
+ ctx->max_val_type = SIZE_STD;
+ if (*rest == '%')
+ ctx->max_val_type = SIZE_PERCENT;
+ return 0;

- case Opt_mode:
- if (match_octal(&args[0], &option))
- goto bad_val;
- pconfig->mode = option & 01777U;
- break;
+ case Opt_nr_inodes:
+ /* memparse() will accept a K/M/G without a digit */
+ if (!isdigit(*args[0].from))
+ goto bad_val;
+ ctx->nr_inodes = memparse(args[0].from, &rest);
+ return 0;

- case Opt_size: {
- /* memparse() will accept a K/M/G without a digit */
- if (!isdigit(*args[0].from))
- goto bad_val;
- max_size_opt = memparse(args[0].from, &rest);
- max_val_type = SIZE_STD;
- if (*rest == '%')
- max_val_type = SIZE_PERCENT;
- break;
+ case Opt_pagesize:
+ ps = memparse(args[0].from, &rest);
+ ctx->hstate = size_to_hstate(ps);
+ if (!ctx->hstate) {
+ pr_err("Unsupported page size %lu MB\n", ps >> 20);
+ return -EINVAL;
}
+ return 0;

- case Opt_nr_inodes:
- /* memparse() will accept a K/M/G without a digit */
- if (!isdigit(*args[0].from))
- goto bad_val;
- pconfig->nr_inodes = memparse(args[0].from, &rest);
- break;
+ case Opt_min_size:
+ /* memparse() will accept a K/M/G without a digit */
+ if (!isdigit(*args[0].from))
+ goto bad_val;
+ ctx->min_size_opt = memparse(args[0].from, &rest);
+ ctx->min_val_type = SIZE_STD;
+ if (*rest == '%')
+ ctx->min_val_type = SIZE_PERCENT;
+ return 0;

- case Opt_pagesize: {
- unsigned long ps;
- ps = memparse(args[0].from, &rest);
- pconfig->hstate = size_to_hstate(ps);
- if (!pconfig->hstate) {
- pr_err("Unsupported page size %lu MB\n",
- ps >> 20);
- return -EINVAL;
- }
- break;
- }
+ default:
+ pr_err("Bad mount option: \"%s\"\n", opt);
+ return -EINVAL;
+ }

- case Opt_min_size: {
- /* memparse() will accept a K/M/G without a digit */
- if (!isdigit(*args[0].from))
- goto bad_val;
- min_size_opt = memparse(args[0].from, &rest);
- min_val_type = SIZE_STD;
- if (*rest == '%')
- min_val_type = SIZE_PERCENT;
- break;
- }
+bad_val:
+ pr_err("Bad value '%s' for mount option '%s'\n", args[0].from, opt);
+ return -EINVAL;
+}

- default:
- pr_err("Bad mount option: \"%s\"\n", p);
- return -EINVAL;
- break;
- }
- }
+/*
+ * Validate the parsed options.
+ */
+static int hugetlbfs_validate(struct fs_context *fc)
+{
+ struct hugetlbfs_fs_context *ctx = fc->fs_private;

/*
* Use huge page pool size (in hstate) to convert the size
* options to number of huge pages. If NO_SIZE, -1 is returned.
*/
- pconfig->max_hpages = hugetlbfs_size_to_hpages(pconfig->hstate,
- max_size_opt, max_val_type);
- pconfig->min_hpages = hugetlbfs_size_to_hpages(pconfig->hstate,
- min_size_opt, min_val_type);
+ ctx->max_hpages = hugetlbfs_size_to_hpages(ctx->hstate,
+ ctx->max_size_opt,
+ ctx->max_val_type);
+ ctx->min_hpages = hugetlbfs_size_to_hpages(ctx->hstate,
+ ctx->min_size_opt,
+ ctx->min_val_type);

/*
* If max_size was specified, then min_size must be smaller
*/
- if (max_val_type > NO_SIZE &&
- pconfig->min_hpages > pconfig->max_hpages) {
- pr_err("minimum size can not be greater than maximum size\n");
+ if (ctx->max_val_type > NO_SIZE &&
+ ctx->min_hpages > ctx->max_hpages) {
+ pr_err("Minimum size can not be greater than maximum size\n");
return -EINVAL;
}

return 0;
-
-bad_val:
- pr_err("Bad value '%s' for mount option '%s'\n", args[0].from, p);
- return -EINVAL;
}

static int
-hugetlbfs_fill_super(struct super_block *sb, void *data, size_t data_size,
- int silent)
+hugetlbfs_fill_super(struct super_block *sb, struct fs_context *fc)
{
- int ret;
- struct hugetlbfs_config config;
+ struct hugetlbfs_fs_context *ctx =
+ fc->fs_private;
struct hugetlbfs_sb_info *sbinfo;

- config.max_hpages = -1; /* No limit on size by default */
- config.nr_inodes = -1; /* No limit on number of inodes by default */
- config.uid = current_fsuid();
- config.gid = current_fsgid();
- config.mode = 0755;
- config.hstate = &default_hstate;
- config.min_hpages = -1; /* No default minimum size */
- ret = hugetlbfs_parse_options(data, &config);
- if (ret)
- return ret;
-
sbinfo = kmalloc(sizeof(struct hugetlbfs_sb_info), GFP_KERNEL);
if (!sbinfo)
return -ENOMEM;
sb->s_fs_info = sbinfo;
- sbinfo->hstate = config.hstate;
spin_lock_init(&sbinfo->stat_lock);
- sbinfo->max_inodes = config.nr_inodes;
- sbinfo->free_inodes = config.nr_inodes;
- sbinfo->spool = NULL;
- sbinfo->uid = config.uid;
- sbinfo->gid = config.gid;
- sbinfo->mode = config.mode;
+ sbinfo->hstate = ctx->hstate;
+ sbinfo->max_inodes = ctx->nr_inodes;
+ sbinfo->free_inodes = ctx->nr_inodes;
+ sbinfo->spool = NULL;
+ sbinfo->uid = ctx->uid;
+ sbinfo->gid = ctx->gid;
+ sbinfo->mode = ctx->mode;

/*
* Allocate and initialize subpool if maximum or minimum size is
* specified. Any needed reservations (for minimim size) are taken
* taken when the subpool is created.
*/
- if (config.max_hpages != -1 || config.min_hpages != -1) {
- sbinfo->spool = hugepage_new_subpool(config.hstate,
- config.max_hpages,
- config.min_hpages);
+ if (ctx->max_hpages != -1 || ctx->min_hpages != -1) {
+ sbinfo->spool = hugepage_new_subpool(ctx->hstate,
+ ctx->max_hpages,
+ ctx->min_hpages);
if (!sbinfo->spool)
goto out_free;
}
sb->s_maxbytes = MAX_LFS_FILESIZE;
- sb->s_blocksize = huge_page_size(config.hstate);
- sb->s_blocksize_bits = huge_page_shift(config.hstate);
+ sb->s_blocksize = huge_page_size(ctx->hstate);
+ sb->s_blocksize_bits = huge_page_shift(ctx->hstate);
sb->s_magic = HUGETLBFS_MAGIC;
sb->s_op = &hugetlbfs_ops;
sb->s_time_gran = 1;
- sb->s_root = d_make_root(hugetlbfs_get_root(sb, &config));
+ sb->s_root = d_make_root(hugetlbfs_get_root(sb, ctx));
if (!sb->s_root)
goto out_free;
return 0;
@@ -1279,17 +1268,50 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, size_t data_size,
return -ENOMEM;
}

-static struct dentry *hugetlbfs_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data, size_t data_size)
+static int hugetlbfs_get_tree(struct fs_context *fc)
+{
+ return vfs_get_super(fc, vfs_get_independent_super, hugetlbfs_fill_super);
+}
+
+static void hugetlbfs_fs_context_free(struct fs_context *fc)
{
- return mount_nodev(fs_type, flags, data, data_size,
- hugetlbfs_fill_super);
+ kfree(fc->fs_private);
+}
+
+static const struct fs_context_operations hugetlbfs_fs_context_ops = {
+ .free = hugetlbfs_fs_context_free,
+ .parse_option = hugetlbfs_parse_option,
+ .validate = hugetlbfs_validate,
+ .get_tree = hugetlbfs_get_tree,
+};
+
+static int hugetlbfs_init_fs_context(struct fs_context *fc,
+ struct dentry *reference)
+{
+ struct hugetlbfs_fs_context *ctx;
+
+ ctx = kzalloc(sizeof(struct hugetlbfs_fs_context), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ ctx->max_hpages = -1; /* No limit on size by default */
+ ctx->nr_inodes = -1; /* No limit on number of inodes by default */
+ ctx->uid = current_fsuid();
+ ctx->gid = current_fsgid();
+ ctx->mode = 0755;
+ ctx->hstate = &default_hstate;
+ ctx->min_hpages = -1; /* No default minimum size */
+ ctx->max_val_type = NO_SIZE;
+ ctx->min_val_type = NO_SIZE;
+ fc->fs_private = ctx;
+ fc->ops = &hugetlbfs_fs_context_ops;
+ return 0;
}

static struct file_system_type hugetlbfs_fs_type = {
- .name = "hugetlbfs",
- .mount = hugetlbfs_mount,
- .kill_sb = kill_litter_super,
+ .name = "hugetlbfs",
+ .init_fs_context = hugetlbfs_init_fs_context,
+ .kill_sb = kill_litter_super,
};

static struct vfsmount *hugetlbfs_vfsmount[HUGE_MAX_HSTATE];
@@ -1396,8 +1418,47 @@ struct file *hugetlb_file_setup(const char *name, size_t size,
return file;
}

+static struct vfsmount *__init mount_one_hugetlbfs(struct hstate *h)
+{
+ struct hugetlbfs_fs_context *ctx;
+ struct fs_context *fc;
+ struct vfsmount *mnt;
+ int ret;
+
+ fc = vfs_new_fs_context(&hugetlbfs_fs_type, NULL, 0,
+ FS_CONTEXT_FOR_KERNEL_MOUNT);
+ if (IS_ERR(fc)) {
+ ret = PTR_ERR(fc);
+ goto err;
+ }
+
+ ctx = fc->fs_private;
+ ctx->hstate = h;
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0)
+ goto err_fc;
+
+ mnt = vfs_create_mount(fc, 0);
+ if (IS_ERR(mnt)) {
+ ret = PTR_ERR(mnt);
+ goto err_fc;
+ }
+
+ put_fs_context(fc);
+ return mnt;
+
+err_fc:
+ put_fs_context(fc);
+err:
+ pr_err("Cannot mount internal hugetlbfs for page size %uK",
+ 1U << (h->order + PAGE_SHIFT - 10));
+ return ERR_PTR(ret);
+}
+
static int __init init_hugetlbfs_fs(void)
{
+ struct vfsmount *mnt;
struct hstate *h;
int error;
int i;
@@ -1420,25 +1481,12 @@ static int __init init_hugetlbfs_fs(void)

i = 0;
for_each_hstate(h) {
- char buf[50];
- unsigned ps_kb = 1U << (h->order + PAGE_SHIFT - 10);
- int n;
-
- n = snprintf(buf, sizeof(buf), "pagesize=%uK", ps_kb);
- hugetlbfs_vfsmount[i] = kern_mount_data(&hugetlbfs_fs_type,
- buf, n + 1);
-
- if (IS_ERR(hugetlbfs_vfsmount[i])) {
- pr_err("Cannot mount internal hugetlbfs for "
- "page size %uK", ps_kb);
- error = PTR_ERR(hugetlbfs_vfsmount[i]);
- hugetlbfs_vfsmount[i] = NULL;
- }
+ mnt = mount_one_hugetlbfs(h);
+ if (IS_ERR(mnt) && i == 0)
+ goto out;
+ hugetlbfs_vfsmount[i] = mnt;
i++;
}
- /* Non default hstates are optional */
- if (!IS_ERR_OR_NULL(hugetlbfs_vfsmount[default_hstate_idx]))
- return 0;

out:
kmem_cache_destroy(hugetlbfs_inode_cachep);


2018-05-25 02:49:33

by David Howells

[permalink] [raw]
Subject: [PATCH 28/32] vfs: Store the fd_cookie in nameidata, not the dfd int [ver #8]

Look up dfd in set_nameidata() if not AT_FDCWD and store the resultant
fd_cookie in struct nameidata. LOOKUP_AT_FDCWD is set if AT_FDCWD was
supplied. The fd_cookie is released in restore_nameidata().

This means that where the fd points in a construct like the following:

set_nameidata(&nd, dfd, name);
retval = path_lookupat(&nd, flags | LOOKUP_RCU, path);
if (unlikely(retval == -ECHILD))
retval = path_lookupat(&nd, flags, path);
if (unlikely(retval == -ESTALE))
retval = path_lookupat(&nd, flags | LOOKUP_REVAL, path);

doesn't change between the three calls to path_lookupat() or similar.

It also allows us to fish the fd_cookie out for the upcoming move_mount()
syscall which needs to clear a file flag if successful.

Signed-off-by: David Howells <[email protected]>
---

fs/namei.c | 38 +++++++++++++++++++++-----------------
1 file changed, 21 insertions(+), 17 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 6f0dc40f88c5..819d6ee71b46 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -53,8 +53,8 @@
* The new code replaces the old recursive symlink resolution with
* an iterative one (in case of non-nested symlink chains). It does
* this with calls to <fs>_follow_link().
- * As a side effect, dir_namei(), _namei() and follow_link() are now
- * replaced with a single function lookup_dentry() that can handle all
+ * As a side effect, dir_namei(), _namei() and follow_link() are now
+ * replaced with a single function lookup_dentry() that can handle all
* the special cases of the former code.
*
* With the new dcache, the pathname is stored at each inode, at least as
@@ -506,25 +506,34 @@ struct nameidata {
struct filename *name;
struct nameidata *saved;
struct inode *link_inode;
+ struct fd_cookie *dfd;
+ bool have_dfd;
unsigned root_seq;
- int dfd;
} __randomize_layout;

static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
{
struct nameidata *old = current->nameidata;
p->stack = p->internal;
- p->dfd = dfd;
p->name = name;
p->total_link_count = old ? old->total_link_count : 0;
p->saved = old;
current->nameidata = p;
+
+ if (likely(dfd == AT_FDCWD)) {
+ p->dfd = NULL;
+ p->have_dfd = false;
+ } else {
+ p->dfd = __fdget_raw(dfd); /* Error are dealt with later */
+ p->have_dfd = true;
+ }
}

static void restore_nameidata(void)
{
struct nameidata *now = current->nameidata, *old = now->saved;

+ __fdput(now->dfd);
current->nameidata = old;
if (old)
old->total_link_count = now->total_link_count;
@@ -2165,7 +2174,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->root.mnt = NULL;
rcu_read_unlock();
return ERR_PTR(-ECHILD);
- } else if (nd->dfd == AT_FDCWD) {
+ } else if (!nd->have_dfd) {
if (flags & LOOKUP_RCU) {
struct fs_struct *fs = current->fs;
unsigned seq;
@@ -2185,22 +2194,18 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
return s;
} else {
/* Caller must check execute permissions on the starting path component */
- struct fd f = fdget_raw(nd->dfd);
struct dentry *dentry;
+ struct file *file = __fdfile(nd->dfd);

- if (!f.file)
+ if (!nd->dfd)
return ERR_PTR(-EBADF);

- dentry = f.file->f_path.dentry;
+ dentry = file->f_path.dentry;

- if (*s) {
- if (!d_can_lookup(dentry)) {
- fdput(f);
- return ERR_PTR(-ENOTDIR);
- }
- }
+ if (*s && !d_can_lookup(dentry))
+ return ERR_PTR(-ENOTDIR);

- nd->path = f.file->f_path;
+ nd->path = file->f_path;
if (flags & LOOKUP_RCU) {
rcu_read_lock();
nd->inode = nd->path.dentry->d_inode;
@@ -2209,7 +2214,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
path_get(&nd->path);
nd->inode = nd->path.dentry->d_inode;
}
- fdput(f);
return s;
}
}
@@ -3557,7 +3561,7 @@ struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
if (IS_ERR(filename))
return ERR_CAST(filename);

- set_nameidata(&nd, -1, filename);
+ set_nameidata(&nd, AT_FDCWD, filename);
file = path_openat(&nd, op, flags | LOOKUP_RCU);
if (unlikely(file == ERR_PTR(-ECHILD)))
file = path_openat(&nd, op, flags);


2018-05-25 02:49:36

by David Howells

[permalink] [raw]
Subject: [PATCH 29/32] vfs: Don't mix FMODE_* flags with O_* flags [ver #8]

build_open_flags() has a weird bit in it:

/* Must never be set by userspace */
flags &= ~FMODE_NONOTIFY & ~O_CLOEXEC;

This didn't used to have the O_CLOEXEC removal in it, but just used to be:

/* Must never be set by userspace */
flags &= ~FMODE_NONOTIFY;

but this flag should be only from file->f_mode and should have nothing to
do with the O_* flags.

Further, this check is redundant with:

flags &= VALID_OPEN_FLAGS;

a few lines above.

Fix this by splitting the f_mode flags (FMODE_*) from the f_flags flags
(O_*) internally.

Fixes: ecf081d1a73b ("vfs: introduce FMODE_NONOTIFY")
Signed-off-by: David Howells <[email protected]>
cc: Eric Paris <[email protected]>
---

drivers/dma-buf/dma-buf.c | 2 +-
drivers/dma-buf/sync_file.c | 2 +-
drivers/gpu/drm/drm_syncobj.c | 2 +-
drivers/staging/lustre/lustre/mdc/mdc_lib.c | 2 +-
drivers/staging/lustre/lustre/mdc/mdc_locks.c | 2 +-
drivers/tty/pty.c | 4 ++--
fs/anon_inodes.c | 20 +++++++++++--------
fs/autofs4/dev-ioctl.c | 2 +-
fs/cachefiles/rdwr.c | 2 +-
fs/eventfd.c | 2 +-
fs/eventpoll.c | 2 +-
fs/exec.c | 6 ++++--
fs/exportfs/expfs.c | 2 +-
fs/fcntl.c | 6 ++----
fs/internal.h | 1 +
fs/namei.c | 1 +
fs/namespace.c | 2 +-
fs/nfs/dir.c | 15 ++++++++------
fs/nfs/nfs4proc.c | 9 +++------
fs/notify/fanotify/fanotify_user.c | 10 ++++++----
fs/notify/inotify/inotify_user.c | 2 +-
fs/nsfs.c | 2 +-
fs/open.c | 24 +++++++++++++++--------
fs/signalfd.c | 3 ++-
fs/timerfd.c | 2 +-
fs/xfs/xfs_ioctl.c | 2 +-
include/linux/anon_inodes.h | 6 +++---
include/linux/fs.h | 26 ++++++++++++++++++-------
include/linux/fsnotify.h | 8 ++++----
include/linux/nfs_fs.h | 3 ++-
include/uapi/asm-generic/fcntl.h | 1 -
ipc/mqueue.c | 6 ++----
kernel/bpf/syscall.c | 6 +++---
kernel/events/core.c | 2 +-
net/unix/af_unix.c | 2 +-
security/apparmor/file.c | 2 +-
security/keys/big_key.c | 2 +-
security/selinux/hooks.c | 2 +-
38 files changed, 109 insertions(+), 86 deletions(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index d78d5fc173dc..93445178d5c1 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -436,7 +436,7 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
dmabuf->resv = resv;

file = anon_inode_getfile("dmabuf", &dma_buf_fops, dmabuf,
- exp_info->flags);
+ exp_info->flags, 0);
if (IS_ERR(file)) {
ret = PTR_ERR(file);
goto err_dmabuf;
diff --git a/drivers/dma-buf/sync_file.c b/drivers/dma-buf/sync_file.c
index 35dd06479867..b92125e9be40 100644
--- a/drivers/dma-buf/sync_file.c
+++ b/drivers/dma-buf/sync_file.c
@@ -37,7 +37,7 @@ static struct sync_file *sync_file_alloc(void)
return NULL;

sync_file->file = anon_inode_getfile("sync_file", &sync_file_fops,
- sync_file, 0);
+ sync_file, 0, 0);
if (IS_ERR(sync_file->file))
goto err;

diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c
index d4f4ce484529..10eb9b6d7d6a 100644
--- a/drivers/gpu/drm/drm_syncobj.c
+++ b/drivers/gpu/drm/drm_syncobj.c
@@ -419,7 +419,7 @@ int drm_syncobj_get_fd(struct drm_syncobj *syncobj, int *p_fd)

file = anon_inode_getfile("syncobj_file",
&drm_syncobj_file_fops,
- syncobj, 0);
+ syncobj, 0, 0);
if (IS_ERR(file)) {
put_unused_fd(fd);
return PTR_ERR(file);
diff --git a/drivers/staging/lustre/lustre/mdc/mdc_lib.c b/drivers/staging/lustre/lustre/mdc/mdc_lib.c
index 46eefdc09e3a..092d5be903cc 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_lib.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_lib.c
@@ -176,7 +176,7 @@ static inline __u64 mds_pack_open_flags(__u64 flags)
cr_flags |= MDS_OPEN_SYNC;
if (flags & O_DIRECTORY)
cr_flags |= MDS_OPEN_DIRECTORY;
- if (flags & __FMODE_EXEC)
+ if (flags & FMODE_EXEC)
cr_flags |= MDS_FMODE_EXEC;
if (cl_is_lov_delay_create(flags))
cr_flags |= MDS_OPEN_DELAY_CREATE;
diff --git a/drivers/staging/lustre/lustre/mdc/mdc_locks.c b/drivers/staging/lustre/lustre/mdc/mdc_locks.c
index 695ef44532cf..7520e9dafffc 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_locks.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_locks.c
@@ -257,7 +257,7 @@ mdc_intent_open_pack(struct obd_export *exp, struct lookup_intent *it,
} else {
if (it->it_flags & (FMODE_WRITE | MDS_OPEN_TRUNC))
mode = LCK_CW;
- else if (it->it_flags & __FMODE_EXEC)
+ else if (it->it_flags & FMODE_EXEC)
mode = LCK_PR;
else
mode = LCK_CR;
diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
index 6c7151edd715..91a0df8dc4a7 100644
--- a/drivers/tty/pty.c
+++ b/drivers/tty/pty.c
@@ -636,7 +636,7 @@ int ptm_open_peer(struct file *master, struct tty_struct *tty, int flags)
}
path.dentry = tty->link->driver_data;

- filp = dentry_open(&path, flags, current_cred());
+ filp = dentry_open(&path, flags, 0, current_cred());
mntput(path.mnt);
if (IS_ERR(filp)) {
retval = PTR_ERR(filp);
@@ -806,7 +806,7 @@ static int ptmx_open(struct inode *inode, struct file *filp)
nonseekable_open(inode, filp);

/* We refuse fsnotify events on ptmx, since it's a shared resource */
- filp->f_mode |= FMODE_NONOTIFY;
+ filp->f_mode |= FMODE_DONT_NOTIFY;

retval = tty_alloc_file(filp);
if (retval)
diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 13c06a7e0b85..2b50a274f885 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -60,7 +60,8 @@ static struct file_system_type anon_inode_fs_type = {
* @name: [in] name of the "class" of the new file
* @fops: [in] file operations for the new file
* @priv: [in] private data for the new file (will be file's private_data)
- * @flags: [in] flags
+ * @f_flags: [in] O_* flags
+ * @f_mode: [in] FMODE_* flags
*
* Creates a new file by hooking it on a single inode. This is useful for files
* that do not need to have a full-fledged inode in order to operate correctly.
@@ -69,8 +70,8 @@ static struct file_system_type anon_inode_fs_type = {
* setup. Returns the newly created file* or an error pointer.
*/
struct file *anon_inode_getfile(const char *name,
- const struct file_operations *fops,
- void *priv, int flags)
+ const struct file_operations *fops, void *priv,
+ unsigned int f_flags, fmode_t f_mode)
{
struct qstr this;
struct path path;
@@ -103,12 +104,12 @@ struct file *anon_inode_getfile(const char *name,

d_instantiate(path.dentry, anon_inode_inode);

- file = alloc_file(&path, OPEN_FMODE(flags), fops);
+ file = alloc_file(&path, f_mode | OPEN_FMODE(f_flags), fops);
if (IS_ERR(file))
goto err_dput;
file->f_mapping = anon_inode_inode->i_mapping;

- file->f_flags = flags & (O_ACCMODE | O_NONBLOCK);
+ file->f_flags = f_flags & (O_ACCMODE | O_NONBLOCK);
file->private_data = priv;

return file;
@@ -129,7 +130,8 @@ EXPORT_SYMBOL_GPL(anon_inode_getfile);
* @name: [in] name of the "class" of the new file
* @fops: [in] file operations for the new file
* @priv: [in] private data for the new file (will be file's private_data)
- * @flags: [in] flags
+ * @f_flags: [in] O_* flags
+ * @f_mode: [in] FMODE_* flags
*
* Creates a new file by hooking it on a single inode. This is useful for files
* that do not need to have a full-fledged inode in order to operate correctly.
@@ -138,17 +140,17 @@ EXPORT_SYMBOL_GPL(anon_inode_getfile);
* setup. Returns new descriptor or an error code.
*/
int anon_inode_getfd(const char *name, const struct file_operations *fops,
- void *priv, int flags)
+ void *priv, unsigned int f_flags, fmode_t f_mode)
{
int error, fd;
struct file *file;

- error = get_unused_fd_flags(flags);
+ error = get_unused_fd_flags(f_flags);
if (error < 0)
return error;
fd = error;

- file = anon_inode_getfile(name, fops, priv, flags);
+ file = anon_inode_getfile(name, fops, priv, f_flags, f_mode);
if (IS_ERR(file)) {
error = PTR_ERR(file);
goto err_put_unused_fd;
diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c
index 26f6b4f41ce6..8e93d4e07aac 100644
--- a/fs/autofs4/dev-ioctl.c
+++ b/fs/autofs4/dev-ioctl.c
@@ -258,7 +258,7 @@ static int autofs_dev_ioctl_open_mountpoint(const char *name, dev_t devid)
if (err)
goto out;

- filp = dentry_open(&path, O_RDONLY, current_cred());
+ filp = dentry_open(&path, O_RDONLY, 0, current_cred());
path_put(&path);
if (IS_ERR(filp)) {
err = PTR_ERR(filp);
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 5082c8a49686..0d20db389405 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -910,7 +910,7 @@ int cachefiles_write_page(struct fscache_storage *op, struct page *page)
* own time */
path.mnt = cache->mnt;
path.dentry = object->backer;
- file = dentry_open(&path, O_RDWR | O_LARGEFILE, cache->cache_cred);
+ file = dentry_open(&path, O_RDWR | O_LARGEFILE, 0, cache->cache_cred);
if (IS_ERR(file)) {
ret = PTR_ERR(file);
goto error_2;
diff --git a/fs/eventfd.c b/fs/eventfd.c
index 08d3bd602f73..fb4c5912a982 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -402,7 +402,7 @@ static int do_eventfd(unsigned int count, int flags)
ctx->flags = flags;

fd = anon_inode_getfd("[eventfd]", &eventfd_fops, ctx,
- O_RDWR | (flags & EFD_SHARED_FCNTL_FLAGS));
+ O_RDWR | (flags & EFD_SHARED_FCNTL_FLAGS), 0);
if (fd < 0)
eventfd_free_ctx(ctx);

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 602ca4285b2e..e8052eb1e0dd 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1963,7 +1963,7 @@ static int do_epoll_create(int flags)
goto out_free_ep;
}
file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
- O_RDWR | (flags & O_CLOEXEC));
+ O_RDWR | (flags & O_CLOEXEC), 0);
if (IS_ERR(file)) {
error = PTR_ERR(file);
goto out_free_fd;
diff --git a/fs/exec.c b/fs/exec.c
index 183059c427b9..7ca13cc0b7f9 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -124,7 +124,8 @@ SYSCALL_DEFINE1(uselib, const char __user *, library)
struct filename *tmp = getname(library);
int error = PTR_ERR(tmp);
static const struct open_flags uselib_flags = {
- .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
+ .open_flag = O_LARGEFILE | O_RDONLY,
+ .f_mode = FMODE_EXEC,
.acc_mode = MAY_READ | MAY_EXEC,
.intent = LOOKUP_OPEN,
.lookup_flags = LOOKUP_FOLLOW,
@@ -838,7 +839,8 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
struct file *file;
int err;
struct open_flags open_exec_flags = {
- .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
+ .open_flag = O_LARGEFILE | O_RDONLY,
+ .f_mode = FMODE_EXEC,
.acc_mode = MAY_EXEC,
.intent = LOOKUP_OPEN,
.lookup_flags = LOOKUP_FOLLOW,
diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 645158dc33f1..ec8f68277ce7 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -308,7 +308,7 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
/*
* Open the directory ...
*/
- file = dentry_open(path, O_RDONLY, cred);
+ file = dentry_open(path, O_RDONLY, 0, cred);
error = PTR_ERR(file);
if (IS_ERR(file))
goto out;
diff --git a/fs/fcntl.c b/fs/fcntl.c
index d737ff082472..60bc5bf2f4cf 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1028,10 +1028,8 @@ static int __init fcntl_init(void)
* Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
* is defined as O_NONBLOCK on some platforms and not on others.
*/
- BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
- HWEIGHT32(
- (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
- __FMODE_EXEC | __FMODE_NONOTIFY));
+ BUILD_BUG_ON(19 - 1 /* for O_RDONLY being 0 */ !=
+ HWEIGHT32(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)));

fasync_cache = kmem_cache_create("fasync_cache",
sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
diff --git a/fs/internal.h b/fs/internal.h
index f47ede6ace5a..c29552e0522f 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -110,6 +110,7 @@ struct open_flags {
int open_flag;
umode_t mode;
int acc_mode;
+ fmode_t f_mode;
int intent;
int lookup_flags;
};
diff --git a/fs/namei.c b/fs/namei.c
index 819d6ee71b46..5cbd980b4031 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3481,6 +3481,7 @@ static struct file *path_openat(struct nameidata *nd,
return file;

file->f_flags = op->open_flag;
+ file->f_mode = op->f_mode;

if (unlikely(file->f_flags & __O_TMPFILE)) {
error = do_tmpfile(nd, flags, op, file, &opened);
diff --git a/fs/namespace.c b/fs/namespace.c
index 03ade803b948..dba680aa1ea4 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3309,7 +3309,7 @@ SYSCALL_DEFINE5(fsmount, int, fs_fd, unsigned int, flags, unsigned int, ms_flags
/* Attach to an apparent O_PATH fd with a note that we need to unmount
* it, not just simply put it.
*/
- file = dentry_open(&newmount, O_PATH, fc->cred);
+ file = dentry_open(&newmount, O_PATH, 0, fc->cred);
if (IS_ERR(file))
goto err_path;
file->f_mode |= FMODE_NEED_UNMOUNT;
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 73f8b43d988c..f8eeea255651 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1395,9 +1395,9 @@ const struct dentry_operations nfs4_dentry_operations = {
};
EXPORT_SYMBOL_GPL(nfs4_dentry_operations);

-static fmode_t flags_to_mode(int flags)
+static fmode_t flags_to_mode(int flags, fmode_t f_mode)
{
- fmode_t res = (__force fmode_t)flags & FMODE_EXEC;
+ fmode_t res = f_mode & FMODE_EXEC;
if ((flags & O_ACCMODE) != O_WRONLY)
res |= FMODE_READ;
if ((flags & O_ACCMODE) != O_RDONLY)
@@ -1407,7 +1407,7 @@ static fmode_t flags_to_mode(int flags)

static struct nfs_open_context *create_nfs_open_context(struct dentry *dentry, int open_flags, struct file *filp)
{
- return alloc_nfs_open_context(dentry, flags_to_mode(open_flags), filp);
+ return alloc_nfs_open_context(dentry, flags_to_mode(open_flags, filp->f_mode), filp);
}

static int do_open(struct inode *inode, struct file *filp)
@@ -2441,11 +2441,11 @@ static int nfs_do_access(struct inode *inode, struct rpc_cred *cred, int mask)
return status;
}

-static int nfs_open_permission_mask(int openflags)
+static int nfs_open_permission_mask(fmode_t f_mode, int openflags)
{
int mask = 0;

- if (openflags & __FMODE_EXEC) {
+ if (f_mode & FMODE_EXEC) {
/* ONLY check exec rights */
mask = MAY_EXEC;
} else {
@@ -2458,9 +2458,10 @@ static int nfs_open_permission_mask(int openflags)
return mask;
}

-int nfs_may_open(struct inode *inode, struct rpc_cred *cred, int openflags)
+int nfs_may_open(struct inode *inode, struct rpc_cred *cred, fmode_t f_mode,
+ int openflags)
{
- return nfs_do_access(inode, cred, nfs_open_permission_mask(openflags));
+ return nfs_do_access(inode, cred, nfs_open_permission_mask(f_mode, openflags));
}
EXPORT_SYMBOL_GPL(nfs_may_open);

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index b71757e85066..6b30118c0507 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -1712,7 +1712,8 @@ static struct nfs4_state *nfs4_try_open_cached(struct nfs4_opendata *opendata)
rcu_read_unlock();
nfs_release_seqid(opendata->o_arg.seqid);
if (!opendata->is_recover) {
- ret = nfs_may_open(state->inode, state->owner->so_cred, open_mode);
+ ret = nfs_may_open(state->inode, state->owner->so_cred,
+ fmode, open_mode);
if (ret != 0)
goto out;
}
@@ -2414,11 +2415,7 @@ static int nfs4_opendata_access(struct rpc_cred *cred,
return 0;

mask = 0;
- /*
- * Use openflags to check for exec, because fmode won't
- * always have FMODE_EXEC set when file open for exec.
- */
- if (openflags & __FMODE_EXEC) {
+ if (fmode & FMODE_EXEC) {
/* ONLY check for exec rights */
if (S_ISDIR(state->inode->i_mode))
mask = NFS4_ACCESS_LOOKUP;
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index ec4d8c59d0e3..a84fb5390e85 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -32,7 +32,7 @@
*
* Internal and external open flags are stored together in field f_flags of
* struct file. Only external open flags shall be allowed in event_f_flags.
- * Internal flags like FMODE_NONOTIFY, FMODE_EXEC, FMODE_NOCMTIME shall be
+ * Internal flags like FMODE_DONT_NOTIFY, FMODE_EXEC, FMODE_NOCMTIME shall be
* excluded.
*/
#define FANOTIFY_INIT_ALL_EVENT_F_BITS ( \
@@ -92,7 +92,8 @@ static int create_fd(struct fsnotify_group *group,
* are NULL; That's fine, just don't call dentry open */
if (event->path.dentry && event->path.mnt)
new_file = dentry_open(&event->path,
- group->fanotify_data.f_flags | FMODE_NONOTIFY,
+ group->fanotify_data.f_flags,
+ FMODE_DONT_NOTIFY,
current_cred());
else
new_file = ERR_PTR(-EOVERFLOW);
@@ -741,7 +742,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
return -EMFILE;
}

- f_flags = O_RDWR | FMODE_NONOTIFY;
+ f_flags = O_RDWR;
if (flags & FAN_CLOEXEC)
f_flags |= O_CLOEXEC;
if (flags & FAN_NONBLOCK)
@@ -809,7 +810,8 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
group->fanotify_data.audit = true;
}

- fd = anon_inode_getfd("[fanotify]", &fanotify_fops, group, f_flags);
+ fd = anon_inode_getfd("[fanotify]", &fanotify_fops, group, f_flags,
+ FMODE_DONT_NOTIFY);
if (fd < 0)
goto out_destroy_group;

diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index ef32f3657958..b8fd9ade776e 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -667,7 +667,7 @@ static int do_inotify_init(int flags)
return PTR_ERR(group);

ret = anon_inode_getfd("inotify", &inotify_fops, group,
- O_RDONLY | flags);
+ O_RDONLY | flags, 0);
if (ret < 0)
fsnotify_destroy_group(group);

diff --git a/fs/nsfs.c b/fs/nsfs.c
index f069eb6495b0..93886ec2540c 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -174,7 +174,7 @@ int open_related_ns(struct ns_common *ns,
return PTR_ERR(err);
}

- f = dentry_open(&path, O_RDONLY, current_cred());
+ f = dentry_open(&path, O_RDONLY, 0, current_cred());
path_put(&path);
if (IS_ERR(f)) {
put_unused_fd(fd);
diff --git a/fs/open.c b/fs/open.c
index c5ee7cd60424..79a8a1bd740d 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -34,6 +34,13 @@

#include "internal.h"

+const u8 acc_mode[O_ACCMODE + 1] = {
+ [O_RDONLY] = MAY_READ,
+ [O_WRONLY] = MAY_WRITE,
+ [O_RDWR] = MAY_READ | MAY_WRITE,
+ [3] = MAY_READ | MAY_WRITE,
+};
+
int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
struct file *filp)
{
@@ -732,9 +739,6 @@ static int do_dentry_open(struct file *f,
static const struct file_operations empty_fops = {};
int error;

- f->f_mode = OPEN_FMODE(f->f_flags) | FMODE_LSEEK |
- FMODE_PREAD | FMODE_PWRITE;
-
path_get(&f->f_path);
f->f_inode = inode;
f->f_mapping = inode->i_mapping;
@@ -743,11 +747,14 @@ static int do_dentry_open(struct file *f,
f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

if (unlikely(f->f_flags & O_PATH)) {
- f->f_mode = FMODE_PATH;
+ f->f_mode |= FMODE_PATH;
f->f_op = &empty_fops;
goto done;
}

+ f->f_mode |= OPEN_FMODE(f->f_flags) | FMODE_LSEEK |
+ FMODE_PREAD | FMODE_PWRITE;
+
if (f->f_mode & FMODE_WRITE && !special_file(inode->i_mode)) {
error = get_write_access(inode);
if (unlikely(error))
@@ -906,8 +913,8 @@ int vfs_open(const struct path *path, struct file *file,
return do_dentry_open(file, d_backing_inode(dentry), NULL, cred);
}

-struct file *dentry_open(const struct path *path, int flags,
- const struct cred *cred)
+struct file *dentry_open(const struct path *path, unsigned int flags,
+ fmode_t f_mode, const struct cred *cred)
{
int error;
struct file *f;
@@ -941,14 +948,15 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
* them in fcntl(F_GETFD) or similar interfaces.
*/
flags &= VALID_OPEN_FLAGS;
+ op->f_mode = 0;

if (flags & (O_CREAT | __O_TMPFILE))
op->mode = (mode & S_IALLUGO) | S_IFREG;
else
op->mode = 0;

- /* Must never be set by userspace */
- flags &= ~FMODE_NONOTIFY & ~O_CLOEXEC;
+ /* Don't leak O_CLOEXEC into ->f_flags */
+ flags &= ~O_CLOEXEC;

/*
* O_SYNC is implemented as __O_SYNC|O_DSYNC. As many places only
diff --git a/fs/signalfd.c b/fs/signalfd.c
index d2187a813376..885122c786de 100644
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@ -287,7 +287,8 @@ static int do_signalfd4(int ufd, sigset_t __user *user_mask, size_t sizemask,
* anon_inode_getfd() will install the fd.
*/
ufd = anon_inode_getfd("[signalfd]", &signalfd_fops, ctx,
- O_RDWR | (flags & (O_CLOEXEC | O_NONBLOCK)));
+ O_RDWR | (flags & (O_CLOEXEC | O_NONBLOCK)),
+ 0);
if (ufd < 0)
kfree(ctx);
} else {
diff --git a/fs/timerfd.c b/fs/timerfd.c
index cdad49da3ff7..6de8ea9737d7 100644
--- a/fs/timerfd.c
+++ b/fs/timerfd.c
@@ -425,7 +425,7 @@ SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags)
ctx->moffs = ktime_mono_to_real(0);

ufd = anon_inode_getfd("[timerfd]", &timerfd_fops, ctx,
- O_RDWR | (flags & TFD_SHARED_FCNTL_FLAGS));
+ O_RDWR | (flags & TFD_SHARED_FCNTL_FLAGS), 0);
if (ufd < 0)
kfree(ctx);

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 89fb1eb80aae..6c3c7ff271df 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -255,7 +255,7 @@ xfs_open_by_handle(

path.mnt = parfilp->f_path.mnt;
path.dentry = dentry;
- filp = dentry_open(&path, hreq->oflags, cred);
+ filp = dentry_open(&path, hreq->oflags, 0, cred);
dput(dentry);
if (IS_ERR(filp)) {
put_unused_fd(fd);
diff --git a/include/linux/anon_inodes.h b/include/linux/anon_inodes.h
index d0d7d96261ad..a1a190beb068 100644
--- a/include/linux/anon_inodes.h
+++ b/include/linux/anon_inodes.h
@@ -12,10 +12,10 @@
struct file_operations;

struct file *anon_inode_getfile(const char *name,
- const struct file_operations *fops,
- void *priv, int flags);
+ const struct file_operations *fops, void *priv,
+ unsigned int f_flags, fmode_t f_mode);
int anon_inode_getfd(const char *name, const struct file_operations *fops,
- void *priv, int flags);
+ void *priv, unsigned int f_flags, fmode_t f_mode);

#endif /* _LINUX_ANON_INODES_H */

diff --git a/include/linux/fs.h b/include/linux/fs.h
index ba571c18e236..40890e3359f0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -149,7 +149,7 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
#define FMODE_CAN_WRITE ((__force fmode_t)0x40000)

/* File was opened by fanotify and shouldn't generate fanotify events */
-#define FMODE_NONOTIFY ((__force fmode_t)0x4000000)
+#define FMODE_DONT_NOTIFY ((__force fmode_t)0x4000000)

/* File is capable of returning -EAGAIN if I/O will block */
#define FMODE_NOWAIT ((__force fmode_t)0x8000000)
@@ -2413,7 +2413,8 @@ extern struct file *file_open_name(struct filename *, int, umode_t);
extern struct file *filp_open(const char *, int, umode_t);
extern struct file *file_open_root(struct dentry *, struct vfsmount *,
const char *, int, umode_t);
-extern struct file * dentry_open(const struct path *, int, const struct cred *);
+extern struct file * dentry_open(const struct path *, unsigned int, fmode_t,
+ const struct cred *);
extern int filp_close(struct file *, fl_owner_t id);

extern struct filename *getname_flags(const char __user *, int, int *);
@@ -3349,12 +3350,23 @@ int proc_nr_inodes(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos);
int __init get_filesystem_list(char *buf);

-#define __FMODE_EXEC ((__force int) FMODE_EXEC)
-#define __FMODE_NONOTIFY ((__force int) FMODE_NONOTIFY)
+extern const u8 acc_mode[O_ACCMODE + 1];

-#define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE])
-#define OPEN_FMODE(flag) ((__force fmode_t)(((flag + 1) & O_ACCMODE) | \
- (flag & __FMODE_NONOTIFY)))
+/*
+ * Turn { O_RDONLY, O_WRONLY, O_RDWD, 3 } into MAY_READ and/or MAY_WRITE
+ */
+static inline unsigned int ACC_MODE(int x)
+{
+ return acc_mode[(x) & O_ACCMODE];
+}
+
+/*
+ * Turn { O_RDONLY, O_WRONLY, O_RDWD, 3 } into FMODE_READ and/or FMODE_WRITE
+ */
+static inline fmode_t OPEN_FMODE(unsigned int O_flags)
+{
+ return (__force fmode_t)((O_flags + 1) & O_ACCMODE);
+}

static inline bool is_sxid(umode_t mode)
{
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index bdaf22582f6e..67c3f9e3f371 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -38,7 +38,7 @@ static inline int fsnotify_perm(struct file *file, int mask)
__u32 fsnotify_mask = 0;
int ret;

- if (file->f_mode & FMODE_NONOTIFY)
+ if (file->f_mode & FMODE_DONT_NOTIFY)
return 0;
if (!(mask & (MAY_READ | MAY_OPEN)))
return 0;
@@ -184,7 +184,7 @@ static inline void fsnotify_access(struct file *file)
if (S_ISDIR(inode->i_mode))
mask |= FS_ISDIR;

- if (!(file->f_mode & FMODE_NONOTIFY)) {
+ if (!(file->f_mode & FMODE_DONT_NOTIFY)) {
fsnotify_parent(path, NULL, mask);
fsnotify(inode, mask, path, FSNOTIFY_EVENT_PATH, NULL, 0);
}
@@ -202,7 +202,7 @@ static inline void fsnotify_modify(struct file *file)
if (S_ISDIR(inode->i_mode))
mask |= FS_ISDIR;

- if (!(file->f_mode & FMODE_NONOTIFY)) {
+ if (!(file->f_mode & FMODE_DONT_NOTIFY)) {
fsnotify_parent(path, NULL, mask);
fsnotify(inode, mask, path, FSNOTIFY_EVENT_PATH, NULL, 0);
}
@@ -237,7 +237,7 @@ static inline void fsnotify_close(struct file *file)
if (S_ISDIR(inode->i_mode))
mask |= FS_ISDIR;

- if (!(file->f_mode & FMODE_NONOTIFY)) {
+ if (!(file->f_mode & FMODE_DONT_NOTIFY)) {
fsnotify_parent(path, NULL, mask);
fsnotify(inode, mask, path, FSNOTIFY_EVENT_PATH, NULL, 0);
}
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 2f129bbfaae8..a13508c0fe88 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -477,7 +477,8 @@ extern const struct dentry_operations nfs_dentry_operations;
extern void nfs_force_lookup_revalidate(struct inode *dir);
extern int nfs_instantiate(struct dentry *dentry, struct nfs_fh *fh,
struct nfs_fattr *fattr, struct nfs4_label *label);
-extern int nfs_may_open(struct inode *inode, struct rpc_cred *cred, int openflags);
+extern int nfs_may_open(struct inode *inode, struct rpc_cred *cred,
+ fmode_t mode, int openflags);
extern void nfs_access_zap_cache(struct inode *inode);

/*
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..0b1c7e35090c 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -6,7 +6,6 @@

/*
* FMODE_EXEC is 0x20
- * FMODE_NONOTIFY is 0x4000000
* These cannot be used by userspace O_* until internal and external open
* flags are split.
* -Eric Paris
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 934ccdc48a1d..de7548442c94 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -791,8 +791,6 @@ static int prepare_open(struct dentry *dentry, int oflag, int ro,
umode_t mode, struct filename *name,
struct mq_attr *attr)
{
- static const int oflag2acc[O_ACCMODE] = { MAY_READ, MAY_WRITE,
- MAY_READ | MAY_WRITE };
int acc;

if (d_really_is_negative(dentry)) {
@@ -810,7 +808,7 @@ static int prepare_open(struct dentry *dentry, int oflag, int ro,
return -EEXIST;
if ((oflag & O_ACCMODE) == (O_RDWR | O_WRONLY))
return -EINVAL;
- acc = oflag2acc[oflag & O_ACCMODE];
+ acc = ACC_MODE(oflag);
return inode_permission(d_inode(dentry), acc);
}

@@ -843,7 +841,7 @@ static int do_mq_open(const char __user *u_name, int oflag, umode_t mode,
path.mnt = mntget(mnt);
error = prepare_open(path.dentry, oflag, ro, mode, name, attr);
if (!error) {
- struct file *file = dentry_open(&path, oflag, current_cred());
+ struct file *file = dentry_open(&path, oflag, 0, current_cred());
if (!IS_ERR(file))
fd_install(fd, file);
else
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 016ef9025827..5018d399eed9 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -373,7 +373,7 @@ int bpf_map_new_fd(struct bpf_map *map, int flags)
return ret;

return anon_inode_getfd("bpf-map", &bpf_map_fops, map,
- flags | O_CLOEXEC);
+ flags | O_CLOEXEC, 0);
}

int bpf_get_file_flag(int flags)
@@ -1068,7 +1068,7 @@ int bpf_prog_new_fd(struct bpf_prog *prog)
return ret;

return anon_inode_getfd("bpf-prog", &bpf_prog_fops, prog,
- O_RDWR | O_CLOEXEC);
+ O_RDWR | O_CLOEXEC, 0);
}

static struct bpf_prog *____bpf_prog_get(struct fd f)
@@ -1445,7 +1445,7 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr)

raw_tp->prog = prog;
tp_fd = anon_inode_getfd("bpf-raw-tracepoint", &bpf_raw_tp_fops, raw_tp,
- O_CLOEXEC);
+ O_CLOEXEC, 0);
if (tp_fd < 0) {
bpf_probe_unregister(raw_tp->btp, prog);
err = tp_fd;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 67612ce359ad..0e0fcb1f946f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10612,7 +10612,7 @@ SYSCALL_DEFINE5(perf_event_open,
}

event_file = anon_inode_getfile("[perf_event]", &perf_fops, event,
- f_flags);
+ f_flags, 0);
if (IS_ERR(event_file)) {
err = PTR_ERR(event_file);
event_file = NULL;
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index e5473c03d667..6813b51d1baf 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2588,7 +2588,7 @@ static int unix_open_file(struct sock *sk)
if (fd < 0)
goto out;

- f = dentry_open(&path, O_PATH, current_cred());
+ f = dentry_open(&path, O_PATH, 0, current_cred());
if (IS_ERR(f)) {
put_unused_fd(fd);
fd = PTR_ERR(f);
diff --git a/security/apparmor/file.c b/security/apparmor/file.c
index 224b2fef93ca..392398f6254e 100644
--- a/security/apparmor/file.c
+++ b/security/apparmor/file.c
@@ -692,7 +692,7 @@ void aa_inherit_files(const struct cred *cred, struct files_struct *files)
if (!n) /* none found? */
goto out;

- devnull = dentry_open(&aa_null, O_RDWR, cred);
+ devnull = dentry_open(&aa_null, O_RDWR, 0, cred);
if (IS_ERR(devnull))
devnull = NULL;
/* replace all the matching ones with this */
diff --git a/security/keys/big_key.c b/security/keys/big_key.c
index 933623784ccd..d29381f22cde 100644
--- a/security/keys/big_key.c
+++ b/security/keys/big_key.c
@@ -374,7 +374,7 @@ long big_key_read(const struct key *key, char __user *buffer, size_t buflen)
if (!buf)
return -ENOMEM;

- file = dentry_open(path, O_RDONLY, current_cred());
+ file = dentry_open(path, O_RDONLY, 0, current_cred());
if (IS_ERR(file)) {
ret = PTR_ERR(file);
goto error;
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 9c5d60308136..098a541b76e0 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2625,7 +2625,7 @@ static inline void flush_unauthorized_files(const struct cred *cred,
if (!n) /* none found? */
return;

- devnull = dentry_open(&selinux_null, O_RDWR, cred);
+ devnull = dentry_open(&selinux_null, O_RDWR, 0, cred);
if (IS_ERR(devnull))
devnull = NULL;
/* replace all the matching ones with this */


2018-05-25 02:49:50

by David Howells

[permalink] [raw]
Subject: [PATCH 23/32] VFS: Implement logging through fs_context [ver #8]

Implement the ability for filesystems to log error, warning and
informational messages through the fs_context. These can be extracted by
userspace by reading from an fd created by fsopen().

Error messages are prefixed with "e ", warnings with "w " and informational
messages with "i ".

Inside the kernel, formatted messages are malloc'd but unformatted messages
are not copied if they're either in the core .rodata section or in the
.rodata section of the filesystem module pinned by fs_context::fs_type.
The messages are only good till the fs_type is released.

Note that the logging object is shared between duplicated fs_context
structures. This is so that such as NFS which do a mount within a mount
can get at least some of the errors from the inner mount.

Five logging functions are provided for this:

(1) void logfc(struct fs_context *fc, const char *fmt, ...);

This logs a message into the context. If the buffer is full, the
earliest message is discarded.

(2) void errorf(fc, fmt, ...);

This wraps logfc() to log an error.

(3) void invalf(fc, fmt, ...);

This wraps errorf() and returns -EINVAL for convenience.

(4) void warnf(fc, fmt, ...);

This wraps logfc() to log a warning.

(5) void infof(fc, fmt, ...);

This wraps logfc() to log an informational message.

Signed-off-by: David Howells <[email protected]>
---

fs/fs_context.c | 92 ++++++++++++++++++++++++++++++++++++++++++++
fs/fsopen.c | 71 ++++++++++++++++++++++++++++++++++
include/linux/fs_context.h | 58 ++++++++++++++++++++++++++++
3 files changed, 220 insertions(+), 1 deletion(-)

diff --git a/fs/fs_context.c b/fs/fs_context.c
index bef68a12ddb5..326a334b8860 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -11,6 +11,7 @@
*/

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/module.h>
#include <linux/fs_context.h>
#include <linux/fs.h>
#include <linux/mount.h>
@@ -23,6 +24,7 @@
#include <linux/pid_namespace.h>
#include <linux/user_namespace.h>
#include <net/net_namespace.h>
+#include <asm/sections.h>
#include "mount.h"

enum legacy_fs_param {
@@ -327,7 +329,7 @@ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc)
int ret;

if (!src_fc->ops->dup)
- return ERR_PTR(-ENOTSUPP);
+ return ERR_PTR(-EOPNOTSUPP);

fc = kmemdup(src_fc, sizeof(struct legacy_fs_context), GFP_KERNEL);
if (!fc)
@@ -340,6 +342,8 @@ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc)
get_net(fc->net_ns);
get_user_ns(fc->user_ns);
get_cred(fc->cred);
+ if (fc->log)
+ refcount_inc(&fc->log->usage);

/* Can't call put until we've called ->dup */
ret = fc->ops->dup(fc, src_fc);
@@ -357,6 +361,91 @@ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc)
}
EXPORT_SYMBOL(vfs_dup_fs_context);

+/**
+ * logfc - Log a message to a filesystem context
+ * @fc: The filesystem context to log to.
+ * @fmt: The format of the buffer.
+ */
+void logfc(struct fs_context *fc, const char *fmt, ...)
+{
+ static const char store_failure[] = "OOM: Can't store error string";
+ struct fc_log *log = fc->log;
+ unsigned int logsize = ARRAY_SIZE(log->buffer);
+ const char *p;
+ va_list va;
+ char *q;
+ u8 freeable, index;
+
+ if (!log)
+ return;
+
+ va_start(va, fmt);
+ if (!strchr(fmt, '%')) {
+ p = fmt;
+ goto unformatted_string;
+ }
+ if (strcmp(fmt, "%s") == 0) {
+ p = va_arg(va, const char *);
+ goto unformatted_string;
+ }
+
+ q = kvasprintf(GFP_KERNEL, fmt, va);
+copied_string:
+ if (!q)
+ goto store_failure;
+ freeable = 1;
+ goto store_string;
+
+unformatted_string:
+ if ((unsigned long)p >= (unsigned long)__start_rodata &&
+ (unsigned long)p < (unsigned long)__end_rodata)
+ goto const_string;
+ if (within_module_core((unsigned long)p, log->owner))
+ goto const_string;
+ q = kstrdup(p, GFP_KERNEL);
+ goto copied_string;
+
+store_failure:
+ p = store_failure;
+const_string:
+ q = (char *)p;
+ freeable = 0;
+store_string:
+ index = log->head & (logsize - 1);
+ if ((int)log->head - (int)log->tail == 8) {
+ /* The buffer is full, discard the oldest message */
+ if (log->need_free & (1 << index))
+ kfree(log->buffer[index]);
+ log->tail++;
+ }
+
+ log->buffer[index] = q;
+ log->need_free &= ~(1 << index);
+ log->need_free |= freeable << index;
+ log->head++;
+ va_end(va);
+}
+EXPORT_SYMBOL(logfc);
+
+/*
+ * Free a logging structure.
+ */
+static void put_fc_log(struct fs_context *fc)
+{
+ struct fc_log *log = fc->log;
+ int i;
+
+ if (log) {
+ if (refcount_dec_and_test(&log->usage)) {
+ fc->log = NULL;
+ for (i = 0; i <= 7; i++)
+ if (log->need_free & (1 << i))
+ kfree(log->buffer[i]);
+ kfree(log);
+ }
+ }
+}
+
/**
* put_fs_context - Dispose of a superblock configuration context.
* @fc: The context to dispose of.
@@ -385,6 +474,7 @@ void put_fs_context(struct fs_context *fc)
if (fc->cred)
put_cred(fc->cred);
kfree(fc->subtype);
+ put_fc_log(fc);
put_filesystem(fc->fs_type);
kfree(fc->source);
kfree(fc);
diff --git a/fs/fsopen.c b/fs/fsopen.c
index d69155b9303e..df3f603001a3 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -159,7 +159,57 @@ static ssize_t fscontext_fs_write(struct file *file,
goto err_unlock;
}

+/*
+ * Allow the user to read back any error, warning or informational messages.
+ */
+static ssize_t fscontext_fs_read(struct file *file,
+ char __user *_buf, size_t len, loff_t *pos)
+{
+ struct fs_context *fc = file->private_data;
+ struct fc_log *log = fc->log;
+ struct inode *inode = file_inode(file);
+ unsigned int logsize = ARRAY_SIZE(log->buffer);
+ ssize_t ret;
+ char *p;
+ bool need_free;
+ int index, n;
+
+ ret = inode_lock_killable(inode);
+ if (ret < 0)
+ return ret;
+
+ ret = -ENODATA;
+ if (log->head != log->tail) {
+ index = log->tail & (logsize - 1);
+ p = log->buffer[index];
+ need_free = log->need_free & (1 << index);
+ log->buffer[index] = NULL;
+ log->need_free &= ~(1 << index);
+ log->tail++;
+ ret = 0;
+ }
+
+ inode_unlock(inode);
+ if (ret < 0)
+ return ret;
+
+ ret = -EMSGSIZE;
+ n = strlen(p);
+ if (n > len)
+ goto err_free;
+ ret = -EFAULT;
+ if (copy_to_user(_buf, p, n) != 0)
+ goto err_free;
+ ret = n;
+
+err_free:
+ if (need_free)
+ kfree(p);
+ return ret;
+}
+
const struct file_operations fscontext_fs_fops = {
+ .read = fscontext_fs_read,
.write = fscontext_fs_write,
.release = fscontext_fs_release,
.llseek = no_llseek,
@@ -330,6 +380,7 @@ SYSCALL_DEFINE5(fsopen, const char __user *, _fs_name, unsigned int, flags,
struct file_system_type *fs_type;
struct fs_context *fc;
const char *fs_name;
+ int ret;

if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
return -EPERM;
@@ -353,7 +404,18 @@ SYSCALL_DEFINE5(fsopen, const char __user *, _fs_name, unsigned int, flags,

fc->phase = FS_CONTEXT_CREATE_PARAMS;

+ ret = -ENOMEM;
+ fc->log = kzalloc(sizeof(*fc->log), GFP_KERNEL);
+ if (!fc->log)
+ goto err_fc;
+ refcount_set(&fc->log->usage, 1);
+ fc->log->owner = fs_type->owner;
+
return fsopen_create_fd(fc, flags & FSOPEN_CLOEXEC);
+
+err_fc:
+ put_fs_context(fc);
+ return ret;
}

/*
@@ -396,9 +458,18 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char *, path, unsigned int, flags)

fc->phase = FS_CONTEXT_RECONF_PARAMS;

+ ret = -ENOMEM;
+ fc->log = kzalloc(sizeof(*fc->log), GFP_KERNEL);
+ if (!fc->log)
+ goto err_fc;
+ refcount_set(&fc->log->usage, 1);
+ fc->log->owner = fc->fs_type->owner;
+
path_put(&target);
return fsopen_create_fd(fc, flags & FSPICK_CLOEXEC);

+err_fc:
+ put_fs_context(fc);
err_path:
path_put(&target);
err:
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index bec4022e3f4b..c6c4c403b3f9 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -13,6 +13,7 @@
#define _LINUX_FS_CONTEXT_H

#include <linux/kernel.h>
+#include <linux/refcount.h>
#include <linux/errno.h>

struct cred;
@@ -64,6 +65,7 @@ struct fs_context {
struct user_namespace *user_ns; /* The user namespace for this mount */
struct net *net_ns; /* The network namespace for this mount */
const struct cred *cred; /* The mounter's credentials */
+ struct fc_log *log; /* Logging buffer */
char *source; /* The source name (eg. dev path) */
char *subtype; /* The subtype to set on the superblock */
void *security; /* The LSM context */
@@ -117,4 +119,60 @@ extern int vfs_get_super(struct fs_context *fc,

extern const struct file_operations fscontext_fs_fops;

+/*
+ * Mount error, warning and informational message logging. This structure is
+ * shareable between a mount and a subordinate mount.
+ */
+struct fc_log {
+ refcount_t usage;
+ u8 head; /* Insertion index in buffer[] */
+ u8 tail; /* Removal index in buffer[] */
+ u8 need_free; /* Mask of kfree'able items in buffer[] */
+ struct module *owner; /* Owner module for strings that don't then need freeing */
+ char *buffer[8];
+};
+
+extern __attribute__((format(printf, 2, 3)))
+void logfc(struct fs_context *fc, const char *fmt, ...);
+
+/**
+ * infof - Store supplementary informational message
+ * @fc: The context in which to log the informational message
+ * @fmt: The format string
+ *
+ * Store the supplementary informational message for the process if the process
+ * has enabled the facility.
+ */
+#define infof(fc, fmt, ...) ({ logfc(fc, "i "fmt, ## __VA_ARGS__); })
+
+/**
+ * warnf - Store supplementary warning message
+ * @fc: The context in which to log the error message
+ * @fmt: The format string
+ *
+ * Store the supplementary warning message for the process if the process has
+ * enabled the facility.
+ */
+#define warnf(fc, fmt, ...) ({ logfc(fc, "w "fmt, ## __VA_ARGS__); })
+
+/**
+ * errorf - Store supplementary error message
+ * @fc: The context in which to log the error message
+ * @fmt: The format string
+ *
+ * Store the supplementary error message for the process if the process has
+ * enabled the facility.
+ */
+#define errorf(fc, fmt, ...) ({ logfc(fc, "e "fmt, ## __VA_ARGS__); })
+
+/**
+ * invalf - Store supplementary invalid argument error message
+ * @fc: The context in which to log the error message
+ * @fmt: The format string
+ *
+ * Store the supplementary error message for the process if the process has
+ * enabled the facility and return -EINVAL.
+ */
+#define invalf(fc, fmt, ...) ({ errorf(fc, fmt, ## __VA_ARGS__); -EINVAL; })
+
#endif /* _LINUX_FS_CONTEXT_H */


2018-05-25 02:49:58

by David Howells

[permalink] [raw]
Subject: [PATCH 13/32] proc: Add fs_context support to procfs [ver #8]

Add fs_context support to procfs.

Signed-off-by: David Howells <[email protected]>
---

fs/proc/inode.c | 2 -
fs/proc/internal.h | 2 -
fs/proc/root.c | 179 ++++++++++++++++++++++++++++++++++------------------
3 files changed, 120 insertions(+), 63 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 0b13cf6eb6d7..7aa86dd65ba8 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -128,7 +128,7 @@ const struct super_operations proc_sops = {
.drop_inode = generic_delete_inode,
.evict_inode = proc_evict_inode,
.statfs = simple_statfs,
- .remount_fs = proc_remount,
+ .reconfigure = proc_reconfigure,
.show_options = proc_show_options,
};

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index c918ec4cc0d9..77254851327c 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -265,7 +265,7 @@ static inline void proc_tty_init(void) {}
extern struct proc_dir_entry proc_root;

extern void proc_self_init(void);
-extern int proc_remount(struct super_block *, int *, char *, size_t);
+extern int proc_reconfigure(struct super_block *, struct fs_context *);

/*
* task_[no]mmu.c
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 2fbc177f37a8..a379edccd880 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -19,14 +19,23 @@
#include <linux/module.h>
#include <linux/bitops.h>
#include <linux/user_namespace.h>
+#include <linux/fs_context.h>
#include <linux/mount.h>
#include <linux/pid_namespace.h>
#include <linux/parser.h>
#include <linux/cred.h>
#include <linux/magic.h>
+#include <linux/slab.h>

#include "internal.h"

+struct proc_fs_context {
+ struct pid_namespace *pid_ns;
+ unsigned long mask;
+ int hidepid;
+ int gid;
+};
+
enum {
Opt_gid, Opt_hidepid, Opt_err,
};
@@ -37,56 +46,60 @@ static const match_table_t tokens = {
{Opt_err, NULL},
};

-static int proc_parse_options(char *options, struct pid_namespace *pid)
+static int proc_parse_option(struct fs_context *fc, char *opt, size_t len)
{
- char *p;
+ struct proc_fs_context *ctx = fc->fs_private;
substring_t args[MAX_OPT_ARGS];
- int option;
-
- if (!options)
- return 1;
-
- while ((p = strsep(&options, ",")) != NULL) {
- int token;
- if (!*p)
- continue;
-
- args[0].to = args[0].from = NULL;
- token = match_token(p, tokens, args);
- switch (token) {
- case Opt_gid:
- if (match_int(&args[0], &option))
- return 0;
- pid->pid_gid = make_kgid(current_user_ns(), option);
- break;
- case Opt_hidepid:
- if (match_int(&args[0], &option))
- return 0;
- if (option < HIDEPID_OFF ||
- option > HIDEPID_INVISIBLE) {
- pr_err("proc: hidepid value must be between 0 and 2.\n");
- return 0;
- }
- pid->hide_pid = option;
- break;
- default:
- pr_err("proc: unrecognized mount option \"%s\" "
- "or missing value\n", p);
- return 0;
+ int token;
+
+ args[0].to = args[0].from = NULL;
+ token = match_token(opt, tokens, args);
+ switch (token) {
+ case Opt_gid:
+ if (match_int(&args[0], &ctx->gid))
+ return -EINVAL;
+ break;
+
+ case Opt_hidepid:
+ if (match_int(&args[0], &ctx->hidepid))
+ return -EINVAL;
+ if (ctx->hidepid < HIDEPID_OFF ||
+ ctx->hidepid > HIDEPID_INVISIBLE) {
+ pr_err("proc: hidepid value must be between 0 and 2.\n");
+ return -EINVAL;
}
+ break;
+
+ default:
+ pr_err("proc: unrecognized mount option \"%s\" or missing value\n",
+ opt);
+ return -EINVAL;
}

- return 1;
+ ctx->mask |= 1 << token;
+ return 0;
+}
+
+static void proc_set_options(struct super_block *s,
+ struct fs_context *fc,
+ struct pid_namespace *pid_ns,
+ struct user_namespace *user_ns)
+{
+ struct proc_fs_context *ctx = fc->fs_private;
+
+ if (ctx->mask & (1 << Opt_gid))
+ pid_ns->pid_gid = make_kgid(user_ns, ctx->gid);
+ if (ctx->mask & (1 << Opt_hidepid))
+ pid_ns->hide_pid = ctx->hidepid;
}

-static int proc_fill_super(struct super_block *s, void *data, size_t data_size, int silent)
+static int proc_fill_super(struct super_block *s, struct fs_context *fc)
{
- struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
+ struct pid_namespace *pid_ns = get_pid_ns(s->s_fs_info);
struct inode *root_inode;
int ret;

- if (!proc_parse_options(data, ns))
- return -EINVAL;
+ proc_set_options(s, fc, pid_ns, current_user_ns());

/* User space would break if executables or devices appear on proc */
s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -103,7 +116,7 @@ static int proc_fill_super(struct super_block *s, void *data, size_t data_size,
* top of it
*/
s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
-
+
pde_get(&proc_root);
root_inode = proc_get_inode(s, &proc_root);
if (!root_inode) {
@@ -124,30 +137,52 @@ static int proc_fill_super(struct super_block *s, void *data, size_t data_size,
return proc_setup_thread_self(s);
}

-int proc_remount(struct super_block *sb, int *flags,
- char *data, size_t data_size)
+int proc_reconfigure(struct super_block *sb, struct fs_context *fc)
{
struct pid_namespace *pid = sb->s_fs_info;

sync_filesystem(sb);
- return !proc_parse_options(data, pid);
+
+ if (fc)
+ proc_set_options(sb, fc, pid, current_user_ns());
+ return 0;
}

-static struct dentry *proc_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name,
- void *data, size_t data_size)
+static int proc_get_tree(struct fs_context *fc)
{
- struct pid_namespace *ns;
+ struct proc_fs_context *ctx = fc->fs_private;

- if (flags & SB_KERNMOUNT) {
- ns = data;
- data = NULL;
- } else {
- ns = task_active_pid_ns(current);
- }
+ fc->s_fs_info = ctx->pid_ns;
+ return vfs_get_super(fc, vfs_get_keyed_super, proc_fill_super);
+}
+
+static void proc_fs_context_free(struct fs_context *fc)
+{
+ struct proc_fs_context *ctx = fc->fs_private;
+
+ if (ctx->pid_ns)
+ put_pid_ns(ctx->pid_ns);
+ kfree(ctx);
+}
+
+static const struct fs_context_operations proc_fs_context_ops = {
+ .free = proc_fs_context_free,
+ .parse_option = proc_parse_option,
+ .get_tree = proc_get_tree,
+};

- return mount_ns(fs_type, flags, data, data_size, ns, ns->user_ns,
- proc_fill_super);
+static int proc_init_fs_context(struct fs_context *fc, struct dentry *reference)
+{
+ struct proc_fs_context *ctx;
+
+ ctx = kzalloc(sizeof(struct proc_fs_context), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ ctx->pid_ns = get_pid_ns(task_active_pid_ns(current));
+ fc->fs_private = ctx;
+ fc->ops = &proc_fs_context_ops;
+ return 0;
}

static void proc_kill_sb(struct super_block *sb)
@@ -164,10 +199,10 @@ static void proc_kill_sb(struct super_block *sb)
}

static struct file_system_type proc_fs_type = {
- .name = "proc",
- .mount = proc_mount,
- .kill_sb = proc_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
+ .name = "proc",
+ .init_fs_context = proc_init_fs_context,
+ .kill_sb = proc_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};

void __init proc_root_init(void)
@@ -205,7 +240,7 @@ static struct dentry *proc_root_lookup(struct inode * dir, struct dentry * dentr
{
if (!proc_pid_lookup(dir, dentry, flags))
return NULL;
-
+
return proc_lookup(dir, dentry, flags);
}

@@ -259,9 +294,31 @@ struct proc_dir_entry proc_root = {

int pid_ns_prepare_proc(struct pid_namespace *ns)
{
+ struct proc_fs_context *ctx;
+ struct fs_context *fc;
struct vfsmount *mnt;
+ int ret;
+
+ fc = vfs_new_fs_context(&proc_fs_type, NULL, 0,
+ FS_CONTEXT_FOR_KERNEL_MOUNT);
+ if (IS_ERR(fc))
+ return PTR_ERR(fc);
+
+ ctx = fc->fs_private;
+ if (ctx->pid_ns != ns) {
+ put_pid_ns(ctx->pid_ns);
+ get_pid_ns(ns);
+ ctx->pid_ns = ns;
+ }
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0) {
+ put_fs_context(fc);
+ return ret;
+ }

- mnt = kern_mount_data(&proc_fs_type, ns, 0);
+ mnt = vfs_create_mount(fc, 0);
+ put_fs_context(fc);
if (IS_ERR(mnt))
return PTR_ERR(mnt);



2018-05-25 02:50:00

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH 23/32] VFS: Implement logging through fs_context [ver #8]

On Fri, 2018-05-25 at 01:07 +0100, David Howells wrote:
> Implement the ability for filesystems to log error, warning and
> informational messages through the fs_context. These can be extracted by
> userspace by reading from an fd created by fsopen().
>
> Error messages are prefixed with "e ", warnings with "w " and informational
> messages with "i ".
[]
> diff --git a/fs/fs_context.c b/fs/fs_context.c
[]
> +/**
> + * logfc - Log a message to a filesystem context
> + * @fc: The filesystem context to log to.
> + * @fmt: The format of the buffer.
> + */
> +void logfc(struct fs_context *fc, const char *fmt, ...)
> +{
> + static const char store_failure[] = "OOM: Can't store error string";
> + struct fc_log *log = fc->log;
> + unsigned int logsize = ARRAY_SIZE(log->buffer);
> + const char *p;
> + va_list va;
> + char *q;
> + u8 freeable, index;
> +
> + if (!log)
> + return;
> +
> + va_start(va, fmt);
> + if (!strchr(fmt, '%')) {
> + p = fmt;
> + goto unformatted_string;
> + }
> + if (strcmp(fmt, "%s") == 0) {
> + p = va_arg(va, const char *);
> + goto unformatted_string;
> + }
> +
> + q = kvasprintf(GFP_KERNEL, fmt, va);
> +copied_string:
> + if (!q)
> + goto store_failure;
> + freeable = 1;
> + goto store_string;
> +
> +unformatted_string:
> + if ((unsigned long)p >= (unsigned long)__start_rodata &&
> + (unsigned long)p < (unsigned long)__end_rodata)
> + goto const_string;
> + if (within_module_core((unsigned long)p, log->owner))
> + goto const_string;
> + q = kstrdup(p, GFP_KERNEL);
> + goto copied_string;
> +
> +store_failure:
> + p = store_failure;
> +const_string:
> + q = (char *)p;
> + freeable = 0;
> +store_string:
> + index = log->head & (logsize - 1);
> + if ((int)log->head - (int)log->tail == 8) {
> + /* The buffer is full, discard the oldest message */
> + if (log->need_free & (1 << index))
> + kfree(log->buffer[index]);
> + log->tail++;
> + }
> +
> + log->buffer[index] = q;
> + log->need_free &= ~(1 << index);
> + log->need_free |= freeable << index;
> + log->head++;
> + va_end(va);
> +}
> +EXPORT_SYMBOL(logfc);

Perhaps this could be renamed to something more obviously
associated to fs_context

[]
> diff --git a/fs/fsopen.c b/fs/fsopen.c
[]
> @@ -159,7 +159,57 @@ static ssize_t fscontext_fs_write(struct file *file,
> goto err_unlock;
> }
>
> +/*
> + * Allow the user to read back any error, warning or informational messages.
> + */
> +static ssize_t fscontext_fs_read(struct file *file,
> + char __user *_buf, size_t len, loff_t *pos)
> +{
> + struct fs_context *fc = file->private_data;
> + struct fc_log *log = fc->log;
> + struct inode *inode = file_inode(file);
> + unsigned int logsize = ARRAY_SIZE(log->buffer);

logsize isn't modified, could be removed and ARRAY_SIZE used in-place.

> + ssize_t ret;
> + char *p;
> + bool need_free;

It _looks_ like need_free isn't set by all codepaths, but
it doesn't need to be given the goto/return.

It also seems the the logic isn't straightforward here and
could be rewritten to be more obvious.

> + int index, n;
> +
> + ret = inode_lock_killable(inode);
> + if (ret < 0)
> + return ret;
> +
> + ret = -ENODATA;
> + if (log->head != log->tail) {
> + index = log->tail & (logsize - 1);
> + p = log->buffer[index];
> + need_free = log->need_free & (1 << index);
> + log->buffer[index] = NULL;
> + log->need_free &= ~(1 << index);
> + log->tail++;
> + ret = 0;
> + }
> +
> + inode_unlock(inode);
> + if (ret < 0)
> + return ret;
> +
> + ret = -EMSGSIZE;
> + n = strlen(p);
> + if (n > len)
> + goto err_free;
> + ret = -EFAULT;
> + if (copy_to_user(_buf, p, n) != 0)
> + goto err_free;
> + ret = n;
> +
> +err_free:
> + if (need_free)
> + kfree(p);
> + return ret;
> +}
> +
> const struct file_operations fscontext_fs_fops = {
> + .read = fscontext_fs_read,
> .write = fscontext_fs_write,
> .release = fscontext_fs_release,
> .llseek = no_llseek,
> @@ -330,6 +380,7 @@ SYSCALL_DEFINE5(fsopen, const char __user *, _fs_name, unsigned int, flags,
> struct file_system_type *fs_type;
> struct fs_context *fc;
> const char *fs_name;
> + int ret;
>
> if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
> return -EPERM;
> @@ -353,7 +404,18 @@ SYSCALL_DEFINE5(fsopen, const char __user *, _fs_name, unsigned int, flags,
>
> fc->phase = FS_CONTEXT_CREATE_PARAMS;
>
> + ret = -ENOMEM;
> + fc->log = kzalloc(sizeof(*fc->log), GFP_KERNEL);
> + if (!fc->log)
> + goto err_fc;
> + refcount_set(&fc->log->usage, 1);
> + fc->log->owner = fs_type->owner;
> +
> return fsopen_create_fd(fc, flags & FSOPEN_CLOEXEC);
> +
> +err_fc:
> + put_fs_context(fc);
> + return ret;
> }

[]

> diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
>
> @@ -117,4 +119,60 @@ extern int vfs_get_super(struct fs_context *fc,
>
> extern const struct file_operations fscontext_fs_fops;
>
> +/*
> + * Mount error, warning and informational message logging. This structure is
> + * shareable between a mount and a subordinate mount.
> + */
> +struct fc_log {
> + refcount_t usage;
> + u8 head; /* Insertion index in buffer[] */
> + u8 tail; /* Removal index in buffer[] */
> + u8 need_free; /* Mask of kfree'able items in buffer[] */
> + struct module *owner; /* Owner module for strings that don't then need freeing */
> + char *buffer[8];
> +};
> +
> +extern __attribute__((format(printf, 2, 3)))

__printf(2, 3)

> +void logfc(struct fs_context *fc, const char *fmt, ...);
> +
> +/**
> + * infof - Store supplementary informational message
> + * @fc: The context in which to log the informational message
> + * @fmt: The format string
> + *
> + * Store the supplementary informational message for the process if the process
> + * has enabled the facility.
> + */
> +#define infof(fc, fmt, ...) ({ logfc(fc, "i "fmt, ## __VA_ARGS__); })

Why a statement expression macro and not just

#define infof(fc, fmt, ...) logfc(fc, "i " fmt, ##__VA_ARGS__)

etc...


2018-05-25 02:50:00

by David Howells

[permalink] [raw]
Subject: [PATCH 11/32] VFS: Remove unused code after filesystem context changes [ver #8]

Remove code that is now unused after the filesystem context changes.

Signed-off-by: David Howells <[email protected]>
---

fs/internal.h | 2 -
fs/super.c | 62 --------------------------------------------
include/linux/lsm_hooks.h | 3 --
include/linux/security.h | 7 -----
security/security.c | 5 ----
security/selinux/hooks.c | 20 --------------
security/smack/smack_lsm.c | 33 -----------------------
7 files changed, 132 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index 91a990234488..f47ede6ace5a 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -101,8 +101,6 @@ extern struct file *get_empty_filp(void);
extern int do_remount_sb(struct super_block *, int, void *, size_t, int,
struct fs_context *);
extern bool trylock_super(struct super_block *sb);
-extern struct dentry *mount_fs(struct file_system_type *,
- int, const char *, void *, size_t);
extern struct super_block *user_get_super(dev_t);

/*
diff --git a/fs/super.c b/fs/super.c
index b9d386d728c6..06a665628939 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1450,68 +1450,6 @@ struct dentry *mount_single(struct file_system_type *fs_type,
}
EXPORT_SYMBOL(mount_single);

-struct dentry *
-mount_fs(struct file_system_type *type, int flags, const char *name,
- void *data, size_t data_size)
-{
- struct dentry *root;
- struct super_block *sb;
- char *secdata = NULL;
- int error = -ENOMEM;
-
- if (data && !(type->fs_flags & FS_BINARY_MOUNTDATA)) {
- secdata = alloc_secdata();
- if (!secdata)
- goto out;
-
- error = security_sb_copy_data(data, data_size, secdata);
- if (error)
- goto out_free_secdata;
- }
-
- root = type->mount(type, flags, name, data, data_size);
- if (IS_ERR(root)) {
- error = PTR_ERR(root);
- goto out_free_secdata;
- }
- sb = root->d_sb;
- BUG_ON(!sb);
- WARN_ON(!sb->s_bdi);
-
- /*
- * Write barrier is for super_cache_count(). We place it before setting
- * SB_BORN as the data dependency between the two functions is the
- * superblock structure contents that we just set up, not the SB_BORN
- * flag.
- */
- smp_wmb();
- sb->s_flags |= SB_BORN;
-
- error = security_sb_kern_mount(sb, flags, secdata, data_size);
- if (error)
- goto out_sb;
-
- /*
- * filesystems should never set s_maxbytes larger than MAX_LFS_FILESIZE
- * but s_maxbytes was an unsigned long long for many releases. Throw
- * this warning for a little while to try and catch filesystems that
- * violate this rule.
- */
- WARN((sb->s_maxbytes < 0), "%s set sb->s_maxbytes to "
- "negative value (%lld)\n", type->name, sb->s_maxbytes);
-
- up_write(&sb->s_umount);
- free_secdata(secdata);
- return root;
-out_sb:
- dput(root);
- deactivate_locked_super(sb);
-out_free_secdata:
- free_secdata(secdata);
-out:
- return ERR_PTR(error);
-}
-
/*
* Setup private BDI for given superblock. It gets automatically cleaned up
* in generic_shutdown_super().
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 408357495d1e..5d8f8bd39b52 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1519,8 +1519,6 @@ union security_list_options {
void (*sb_free_security)(struct super_block *sb);
int (*sb_copy_data)(char *orig, size_t orig_size, char *copy);
int (*sb_remount)(struct super_block *sb, void *data, size_t data_size);
- int (*sb_kern_mount)(struct super_block *sb, int flags,
- void *data, size_t data_size);
int (*sb_show_options)(struct seq_file *m, struct super_block *sb);
int (*sb_statfs)(struct dentry *dentry);
int (*sb_mount)(const char *dev_name, const struct path *path,
@@ -1867,7 +1865,6 @@ struct security_hook_heads {
struct hlist_head sb_free_security;
struct hlist_head sb_copy_data;
struct hlist_head sb_remount;
- struct hlist_head sb_kern_mount;
struct hlist_head sb_show_options;
struct hlist_head sb_statfs;
struct hlist_head sb_mount;
diff --git a/include/linux/security.h b/include/linux/security.h
index 64cc080b9352..5040455a747d 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -246,7 +246,6 @@ int security_sb_alloc(struct super_block *sb);
void security_sb_free(struct super_block *sb);
int security_sb_copy_data(char *orig, size_t orig_size, char *copy);
int security_sb_remount(struct super_block *sb, void *data, size_t data_size);
-int security_sb_kern_mount(struct super_block *sb, int flags, void *data, size_t data_size);
int security_sb_show_options(struct seq_file *m, struct super_block *sb);
int security_sb_statfs(struct dentry *dentry);
int security_sb_mount(const char *dev_name, const struct path *path,
@@ -606,12 +605,6 @@ static inline int security_sb_remount(struct super_block *sb, void *data, size_t
return 0;
}

-static inline int security_sb_kern_mount(struct super_block *sb, int flags,
- void *data, size_t data_size)
-{
- return 0;
-}
-
static inline int security_sb_show_options(struct seq_file *m,
struct super_block *sb)
{
diff --git a/security/security.c b/security/security.c
index 294c2fce1770..3b155f7ee3ba 100644
--- a/security/security.c
+++ b/security/security.c
@@ -425,11 +425,6 @@ int security_sb_remount(struct super_block *sb, void *data, size_t data_size)
return call_int_hook(sb_remount, 0, sb, data, data_size);
}

-int security_sb_kern_mount(struct super_block *sb, int flags, void *data, size_t data_size)
-{
- return call_int_hook(sb_kern_mount, 0, sb, flags, data, data_size);
-}
-
int security_sb_show_options(struct seq_file *m, struct super_block *sb)
{
return call_int_hook(sb_show_options, 0, m, sb);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 3952aab4ff99..9c5d60308136 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2920,25 +2920,6 @@ static int selinux_sb_remount(struct super_block *sb, void *data, size_t data_si
goto out_free_opts;
}

-static int selinux_sb_kern_mount(struct super_block *sb, int flags, void *data, size_t data_size)
-{
- const struct cred *cred = current_cred();
- struct common_audit_data ad;
- int rc;
-
- rc = superblock_doinit(sb, data);
- if (rc)
- return rc;
-
- /* Allow all mounts performed by the kernel */
- if (flags & MS_KERNMOUNT)
- return 0;
-
- ad.type = LSM_AUDIT_DATA_DENTRY;
- ad.u.dentry = sb->s_root;
- return superblock_has_perm(cred, sb, FILESYSTEM__MOUNT, &ad);
-}
-
static int selinux_sb_statfs(struct dentry *dentry)
{
const struct cred *cred = current_cred();
@@ -7149,7 +7130,6 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(sb_free_security, selinux_sb_free_security),
LSM_HOOK_INIT(sb_copy_data, selinux_sb_copy_data),
LSM_HOOK_INIT(sb_remount, selinux_sb_remount),
- LSM_HOOK_INIT(sb_kern_mount, selinux_sb_kern_mount),
LSM_HOOK_INIT(sb_show_options, selinux_sb_show_options),
LSM_HOOK_INIT(sb_statfs, selinux_sb_statfs),
LSM_HOOK_INIT(sb_mount, selinux_mount),
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index d3c4a72d1640..d1970f4a9cdc 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -1150,38 +1150,6 @@ static int smack_set_mnt_opts(struct super_block *sb,
return 0;
}

-/**
- * smack_sb_kern_mount - Smack specific mount processing
- * @sb: the file system superblock
- * @flags: the mount flags
- * @data: the smack mount options
- *
- * Returns 0 on success, an error code on failure
- */
-static int smack_sb_kern_mount(struct super_block *sb, int flags,
- void *data, size_t data_size)
-{
- int rc = 0;
- char *options = data;
- struct security_mnt_opts opts;
-
- security_init_mnt_opts(&opts);
-
- if (!options)
- goto out;
-
- rc = smack_parse_opts_str(options, &opts);
- if (rc)
- goto out_err;
-
-out:
- rc = smack_set_mnt_opts(sb, &opts, 0, NULL);
-
-out_err:
- security_free_mnt_opts(&opts);
- return rc;
-}
-
/**
* smack_sb_statfs - Smack check on statfs
* @dentry: identifies the file system in question
@@ -4942,7 +4910,6 @@ static struct security_hook_list smack_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(sb_alloc_security, smack_sb_alloc_security),
LSM_HOOK_INIT(sb_free_security, smack_sb_free_security),
LSM_HOOK_INIT(sb_copy_data, smack_sb_copy_data),
- LSM_HOOK_INIT(sb_kern_mount, smack_sb_kern_mount),
LSM_HOOK_INIT(sb_statfs, smack_sb_statfs),
LSM_HOOK_INIT(sb_set_mnt_opts, smack_set_mnt_opts),
LSM_HOOK_INIT(sb_parse_opts_str, smack_parse_opts_str),


2018-05-25 02:50:03

by David Howells

[permalink] [raw]
Subject: [PATCH 10/32] VFS: Implement a filesystem superblock creation/configuration context [ver #8]

Implement a filesystem context concept to be used during superblock
creation for mount and superblock reconfiguration for remount.

The mounting procedure then becomes:

(1) Allocate new fs_context context.

(2) Configure the context.

(3) Create superblock.

(4) Mount the superblock any number of times.

(5) Destroy the context.

Rather than calling fs_type->mount(), an fs_context struct is created and
fs_type->init_fs_context() is called to set it up.
fs_type->fs_context_size says how much space should be allocated for the
config context. The fs_context struct is placed at the beginning and any
extra space is for the filesystem's use.

A set of operations has to be set by ->init_fs_context() to provide
freeing, duplication, option parsing, binary data parsing, validation,
mounting and superblock filling.

Legacy filesystems are supported by the provision of a set of legacy
fs_context operations that build up a list of mount options and then invoke
fs_type->mount() from within the fs_context ->get_tree() operation. This
allows all filesystems to be accessed using fs_context.

It should be noted that, whilst this patch adds a lot of lines of code,
there is quite a bit of duplication with existing code that can be
eliminated should all filesystems be converted over.

Signed-off-by: David Howells <[email protected]>
---

fs/Makefile | 3
fs/fs_context.c | 599 ++++++++++++++++++++++++++++++++++++++++++++
fs/internal.h | 3
fs/libfs.c | 17 +
fs/namespace.c | 350 +++++++++++++++++---------
fs/super.c | 311 ++++++++++++++++++++++-
include/linux/fs.h | 13 +
include/linux/fs_context.h | 45 +++
include/linux/mount.h | 3
9 files changed, 1201 insertions(+), 143 deletions(-)
create mode 100644 fs/fs_context.c

diff --git a/fs/Makefile b/fs/Makefile
index c9375fd2c8c4..6f2dae3c32da 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -12,7 +12,8 @@ obj-y := open.o read_write.o file_table.o super.o \
attr.o bad_inode.o file.o filesystems.o namespace.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o splice.o sync.o utimes.o d_path.o \
- stack.o fs_struct.o statfs.o fs_pin.o nsfs.o
+ stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
+ fs_context.o

ifeq ($(CONFIG_BLOCK),y)
obj-y += buffer.o block_dev.o direct-io.o mpage.o
diff --git a/fs/fs_context.c b/fs/fs_context.c
new file mode 100644
index 000000000000..bef68a12ddb5
--- /dev/null
+++ b/fs/fs_context.c
@@ -0,0 +1,599 @@
+/* Provide a way to create a superblock configuration context within the kernel
+ * that allows a superblock to be set up prior to mounting.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/fs_context.h>
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/nsproxy.h>
+#include <linux/slab.h>
+#include <linux/magic.h>
+#include <linux/security.h>
+#include <linux/parser.h>
+#include <linux/mnt_namespace.h>
+#include <linux/pid_namespace.h>
+#include <linux/user_namespace.h>
+#include <net/net_namespace.h>
+#include "mount.h"
+
+enum legacy_fs_param {
+ LEGACY_FS_UNSET_PARAMS,
+ LEGACY_FS_NO_PARAMS,
+ LEGACY_FS_MONOLITHIC_PARAMS,
+ LEGACY_FS_INDIVIDUAL_PARAMS,
+ LEGACY_FS_MAGIC_PARAMS,
+};
+
+struct legacy_fs_context {
+ struct fs_context fc;
+ char *legacy_data; /* Data page for legacy filesystems */
+ char *secdata;
+ size_t data_size;
+ enum legacy_fs_param param_type;
+};
+
+static const struct fs_context_operations legacy_fs_context_ops;
+
+static const match_table_t common_set_sb_flag = {
+ { SB_DIRSYNC, "dirsync" },
+ { SB_LAZYTIME, "lazytime" },
+ { SB_MANDLOCK, "mand" },
+ { SB_POSIXACL, "posixacl" },
+ { SB_RDONLY, "ro" },
+ { SB_SYNCHRONOUS, "sync" },
+ { },
+};
+
+static const match_table_t common_clear_sb_flag = {
+ { SB_LAZYTIME, "nolazytime" },
+ { SB_MANDLOCK, "nomand" },
+ { SB_RDONLY, "rw" },
+ { SB_SILENT, "silent" },
+ { SB_SYNCHRONOUS, "async" },
+ { },
+};
+
+static const match_table_t forbidden_sb_flag = {
+ { 0, "bind" },
+ { 0, "move" },
+ { 0, "private" },
+ { 0, "remount" },
+ { 0, "shared" },
+ { 0, "slave" },
+ { 0, "unbindable" },
+ { 0, "rec" },
+ { 0, "noatime" },
+ { 0, "relatime" },
+ { 0, "norelatime" },
+ { 0, "strictatime" },
+ { 0, "nostrictatime" },
+ { 0, "nodiratime" },
+ { 0, "dev" },
+ { 0, "nodev" },
+ { 0, "exec" },
+ { 0, "noexec" },
+ { 0, "suid" },
+ { 0, "nosuid" },
+ { },
+};
+
+/*
+ * Check for a common mount option that manipulates s_flags.
+ */
+static int vfs_parse_sb_flag_option(struct fs_context *fc, char *data)
+{
+ substring_t args[MAX_OPT_ARGS];
+ unsigned int token;
+
+ token = match_token(data, common_set_sb_flag, args);
+ if (token) {
+ fc->sb_flags |= token;
+ return 1;
+ }
+
+ token = match_token(data, common_clear_sb_flag, args);
+ if (token) {
+ fc->sb_flags &= ~token;
+ return 1;
+ }
+
+ token = match_token(data, forbidden_sb_flag, args);
+ if (token)
+ return -EINVAL;
+
+ return 0;
+}
+
+/**
+ * vfs_parse_fs_option - Add a single mount option to a superblock config
+ * @fc: The filesystem context to modify
+ * @opt: The option to apply.
+ * @len: The length of the option.
+ *
+ * A single mount option in string form is applied to the filesystem context
+ * being set up. Certain standard options (for example "ro") are translated
+ * into flag bits without going to the filesystem. The active security module
+ * is allowed to observe and poach options. Any other options are passed over
+ * to the filesystem to parse.
+ *
+ * This may be called multiple times for a context.
+ *
+ * Returns 0 on success and a negative error code on failure. In the event of
+ * failure, supplementary error information may have been set.
+ */
+int vfs_parse_fs_option(struct fs_context *fc, char *opt, size_t len)
+{
+ int ret;
+
+ ret = vfs_parse_sb_flag_option(fc, opt);
+ if (ret < 0)
+ return ret;
+ if (ret == 1)
+ return 0;
+
+ ret = security_fs_context_parse_option(fc, opt, len);
+ if (ret < 0)
+ return ret;
+ if (ret == 1)
+ return 0;
+
+ if (fc->ops->parse_option)
+ return fc->ops->parse_option(fc, opt, len);
+
+ return -EINVAL;
+}
+EXPORT_SYMBOL(vfs_parse_fs_option);
+
+/**
+ * vfs_set_fs_source - Set the source/device name in a filesystem context
+ * @fc: The filesystem context to alter
+ * @source: The name of the source
+ * @slen: Length of @source string
+ */
+int vfs_set_fs_source(struct fs_context *fc, const char *source, size_t slen)
+{
+ char *src;
+ int ret;
+
+ if (fc->source)
+ return -EINVAL;
+ src = kmemdup_nul(source, slen, GFP_KERNEL);
+ if (!src)
+ return -ENOMEM;
+
+ ret = security_fs_context_parse_source(fc, src);
+ if (ret < 0)
+ goto error;
+
+ if (fc->ops->parse_source) {
+ ret = fc->ops->parse_source(fc, src);
+ if (ret < 0)
+ goto error;
+ }
+
+ fc->source = src;
+ return 0;
+
+error:
+ kfree(src);
+ return ret;
+}
+EXPORT_SYMBOL(vfs_set_fs_source);
+
+/**
+ * generic_parse_monolithic - Parse key[=val][,key[=val]]* mount data
+ * @ctx: The superblock configuration to fill in.
+ * @data: The data to parse
+ * @data_size: The amount of data
+ *
+ * Parse a blob of data that's in key[=val][,key[=val]]* form. This can be
+ * called from the ->monolithic_mount_data() fs_context operation.
+ *
+ * Returns 0 on success or the error returned by the ->parse_option() fs_context
+ * operation on failure.
+ */
+int generic_parse_monolithic(struct fs_context *fc, void *data, size_t data_size)
+{
+ char *options = data, *opt;
+ int ret;
+
+ if (!options)
+ return 0;
+
+ while ((opt = strsep(&options, ",")) != NULL) {
+ if (*opt) {
+ ret = vfs_parse_fs_option(fc, opt, strlen(opt));
+ if (ret < 0)
+ return ret;
+ }
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL(generic_parse_monolithic);
+
+/**
+ * vfs_new_fs_context - Create a filesystem context.
+ * @fs_type: The filesystem type.
+ * @reference: The dentry from which this one derives (or NULL)
+ * @sb_flags: Filesystem/superblock flags (SB_*)
+ * @purpose: The purpose that this configuration shall be used for.
+ *
+ * Open a filesystem and create a mount context. The mount context is
+ * initialised with the supplied flags and, if a submount/automount from
+ * another superblock (referred to by @reference) is supplied, may have
+ * parameters such as namespaces copied across from that superblock.
+ */
+struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
+ struct dentry *reference,
+ unsigned int sb_flags,
+ enum fs_context_purpose purpose)
+{
+ struct fs_context *fc;
+ int ret;
+
+ fc = kzalloc(sizeof(struct legacy_fs_context), GFP_KERNEL);
+ if (!fc)
+ return ERR_PTR(-ENOMEM);
+
+ fc->purpose = purpose;
+ fc->sb_flags = sb_flags;
+ fc->fs_type = get_filesystem(fs_type);
+ fc->cred = get_current_cred();
+
+ switch (purpose) {
+ case FS_CONTEXT_FOR_KERNEL_MOUNT:
+ fc->sb_flags |= SB_KERNMOUNT;
+ /* Fallthrough */
+ case FS_CONTEXT_FOR_USER_MOUNT:
+ fc->user_ns = get_user_ns(fc->cred->user_ns);
+ fc->net_ns = get_net(current->nsproxy->net_ns);
+ break;
+ case FS_CONTEXT_FOR_SUBMOUNT:
+ fc->user_ns = get_user_ns(reference->d_sb->s_user_ns);
+ fc->net_ns = get_net(current->nsproxy->net_ns);
+ break;
+ case FS_CONTEXT_FOR_RECONFIGURE:
+ /* We don't pin any namespaces as the superblock's
+ * subscriptions cannot be changed at this point.
+ */
+ fc->root = dget(reference);
+ break;
+ }
+
+
+ /* TODO: Make all filesystems support this unconditionally */
+ if (fc->fs_type->init_fs_context) {
+ ret = fc->fs_type->init_fs_context(fc, reference);
+ if (ret < 0)
+ goto err_fc;
+ } else {
+ fc->ops = &legacy_fs_context_ops;
+ }
+
+ /* Do the security check last because ->init_fs_context may change the
+ * namespace subscriptions.
+ */
+ ret = security_fs_context_alloc(fc, reference);
+ if (ret < 0)
+ goto err_fc;
+
+ return fc;
+
+err_fc:
+ put_fs_context(fc);
+ return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(vfs_new_fs_context);
+
+/**
+ * vfs_sb_reconfig - Create a filesystem context for remount/reconfiguration
+ * @mountpoint: The mountpoint to open
+ * @sb_flags: Filesystem/superblock flags (SB_*)
+ *
+ * Open a mounted filesystem and create a filesystem context such that a
+ * remount can be effected.
+ */
+struct fs_context *vfs_sb_reconfig(struct path *mountpoint,
+ unsigned int sb_flags)
+{
+ struct fs_context *fc;
+
+ fc = vfs_new_fs_context(mountpoint->dentry->d_sb->s_type,
+ mountpoint->dentry,
+ sb_flags, FS_CONTEXT_FOR_RECONFIGURE);
+ if (IS_ERR(fc))
+ return fc;
+
+ return fc;
+}
+
+/**
+ * vfs_dup_fc_config: Duplicate a filesytem context.
+ * @src_fc: The context to copy.
+ */
+struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc)
+{
+ struct fs_context *fc;
+ int ret;
+
+ if (!src_fc->ops->dup)
+ return ERR_PTR(-ENOTSUPP);
+
+ fc = kmemdup(src_fc, sizeof(struct legacy_fs_context), GFP_KERNEL);
+ if (!fc)
+ return ERR_PTR(-ENOMEM);
+
+ fc->fs_private = NULL;
+ fc->source = NULL;
+ fc->security = NULL;
+ get_filesystem(fc->fs_type);
+ get_net(fc->net_ns);
+ get_user_ns(fc->user_ns);
+ get_cred(fc->cred);
+
+ /* Can't call put until we've called ->dup */
+ ret = fc->ops->dup(fc, src_fc);
+ if (ret < 0)
+ goto err_fc;
+
+ ret = security_fs_context_dup(fc, src_fc);
+ if (ret < 0)
+ goto err_fc;
+ return fc;
+
+err_fc:
+ put_fs_context(fc);
+ return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(vfs_dup_fs_context);
+
+/**
+ * put_fs_context - Dispose of a superblock configuration context.
+ * @fc: The context to dispose of.
+ */
+void put_fs_context(struct fs_context *fc)
+{
+ struct super_block *sb;
+
+ if (fc->root) {
+ sb = fc->root->d_sb;
+ dput(fc->root);
+ fc->root = NULL;
+ if (fc->drop_sb) {
+ deactivate_super(sb);
+ fc->drop_sb = false;
+ }
+ }
+
+ if (fc->ops && fc->ops->free)
+ fc->ops->free(fc);
+
+ security_fs_context_free(fc);
+ if (fc->net_ns)
+ put_net(fc->net_ns);
+ put_user_ns(fc->user_ns);
+ if (fc->cred)
+ put_cred(fc->cred);
+ kfree(fc->subtype);
+ put_filesystem(fc->fs_type);
+ kfree(fc->source);
+ kfree(fc);
+}
+EXPORT_SYMBOL(put_fs_context);
+
+/*
+ * Free the config for a filesystem that doesn't support fs_context.
+ */
+static void legacy_fs_context_free(struct fs_context *fc)
+{
+ struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
+
+ free_secdata(ctx->secdata);
+ switch (ctx->param_type) {
+ case LEGACY_FS_UNSET_PARAMS:
+ case LEGACY_FS_NO_PARAMS:
+ break;
+ case LEGACY_FS_MAGIC_PARAMS:
+ break; /* ctx->data is a weird pointer */
+ default:
+ kfree(ctx->legacy_data);
+ break;
+ }
+}
+
+/*
+ * Duplicate a legacy config.
+ */
+static int legacy_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
+{
+ struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
+ struct legacy_fs_context *src_ctx = container_of(src_fc, struct legacy_fs_context, fc);
+
+ switch (ctx->param_type) {
+ case LEGACY_FS_MONOLITHIC_PARAMS:
+ case LEGACY_FS_INDIVIDUAL_PARAMS:
+ ctx->legacy_data = kmemdup(src_ctx->legacy_data,
+ src_ctx->data_size, GFP_KERNEL);
+ if (!ctx->legacy_data)
+ return -ENOMEM;
+ /* Fall through */
+ default:
+ break;
+ }
+ return 0;
+}
+
+/*
+ * Add an option to a legacy config. We build up a comma-separated list of
+ * options.
+ */
+static int legacy_parse_option(struct fs_context *fc, char *opt, size_t len)
+{
+ struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
+ unsigned int size = ctx->data_size;
+
+ if (ctx->param_type != LEGACY_FS_UNSET_PARAMS &&
+ ctx->param_type != LEGACY_FS_INDIVIDUAL_PARAMS) {
+ pr_warn("VFS: Can't mix monolithic and individual options\n");
+ return -EINVAL;
+ }
+
+ if (len > PAGE_SIZE - 2 - size)
+ return -EINVAL;
+ if (memchr(opt, ',', len) != NULL)
+ return -EINVAL;
+ if (!ctx->legacy_data) {
+ ctx->legacy_data = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!ctx->legacy_data)
+ return -ENOMEM;
+ }
+
+ ctx->legacy_data[size++] = ',';
+ memcpy(ctx->legacy_data + size, opt, len);
+ size += len;
+ ctx->legacy_data[size] = '\0';
+ ctx->data_size = size;
+ ctx->param_type = LEGACY_FS_INDIVIDUAL_PARAMS;
+ return 0;
+}
+
+/*
+ * Add monolithic mount data.
+ */
+static int legacy_parse_monolithic(struct fs_context *fc, void *data, size_t data_size)
+{
+ struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
+
+ if (ctx->param_type != LEGACY_FS_UNSET_PARAMS) {
+ pr_warn("VFS: Can't mix monolithic and individual options\n");
+ return -EINVAL;
+ }
+
+ if (!data) {
+ ctx->param_type = LEGACY_FS_NO_PARAMS;
+ return 0;
+ }
+
+ ctx->data_size = data_size;
+ if (data_size > 0) {
+ ctx->legacy_data = kmemdup(data, data_size, GFP_KERNEL);
+ if (!ctx->legacy_data)
+ return -ENOMEM;
+ ctx->param_type = LEGACY_FS_MONOLITHIC_PARAMS;
+ } else {
+ /* Some filesystems pass weird pointers through that we don't
+ * want to copy. They can indicate this by setting data_size
+ * to 0.
+ */
+ ctx->legacy_data = data;
+ ctx->param_type = LEGACY_FS_MAGIC_PARAMS;
+ }
+
+ return 0;
+}
+
+/*
+ * Use the legacy mount validation step to strip out and process security
+ * config options.
+ */
+static int legacy_validate(struct fs_context *fc)
+{
+ struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
+
+ switch (ctx->param_type) {
+ case LEGACY_FS_UNSET_PARAMS:
+ ctx->param_type = LEGACY_FS_NO_PARAMS;
+ /* Fall through */
+ case LEGACY_FS_NO_PARAMS:
+ case LEGACY_FS_MAGIC_PARAMS:
+ return 0;
+ default:
+ break;
+ }
+
+ if (ctx->fc.fs_type->fs_flags & FS_BINARY_MOUNTDATA)
+ return 0;
+
+ ctx->secdata = alloc_secdata();
+ if (!ctx->secdata)
+ return -ENOMEM;
+
+ return security_sb_copy_data(ctx->legacy_data, ctx->data_size,
+ ctx->secdata);
+}
+
+/*
+ * Determine the superblock subtype.
+ */
+static int legacy_set_subtype(struct fs_context *fc)
+{
+ const char *subtype = strchr(fc->fs_type->name, '.');
+
+ if (subtype) {
+ subtype++;
+ if (!subtype[0])
+ return -EINVAL;
+ } else {
+ subtype = "";
+ }
+
+ fc->subtype = kstrdup(subtype, GFP_KERNEL);
+ if (!fc->subtype)
+ return -ENOMEM;
+ return 0;
+}
+
+/*
+ * Get a mountable root with the legacy mount command.
+ */
+static int legacy_get_tree(struct fs_context *fc)
+{
+ struct legacy_fs_context *ctx = container_of(fc, struct legacy_fs_context, fc);
+ struct super_block *sb;
+ struct dentry *root;
+ int ret;
+
+ root = ctx->fc.fs_type->mount(ctx->fc.fs_type, ctx->fc.sb_flags,
+ ctx->fc.source, ctx->legacy_data,
+ ctx->data_size);
+ if (IS_ERR(root))
+ return PTR_ERR(root);
+
+ sb = root->d_sb;
+ BUG_ON(!sb);
+
+ if ((ctx->fc.fs_type->fs_flags & FS_HAS_SUBTYPE) &&
+ !fc->subtype) {
+ ret = legacy_set_subtype(fc);
+ if (ret < 0)
+ goto err_sb;
+ }
+
+ ctx->fc.root = root;
+ ctx->fc.drop_sb = true;
+ return 0;
+
+err_sb:
+ dput(root);
+ deactivate_locked_super(sb);
+ return ret;
+}
+
+static const struct fs_context_operations legacy_fs_context_ops = {
+ .free = legacy_fs_context_free,
+ .dup = legacy_fs_context_dup,
+ .parse_option = legacy_parse_option,
+ .parse_monolithic = legacy_parse_monolithic,
+ .validate = legacy_validate,
+ .get_tree = legacy_get_tree,
+};
diff --git a/fs/internal.h b/fs/internal.h
index 1afa522c5f30..91a990234488 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -98,7 +98,8 @@ extern struct file *get_empty_filp(void);
/*
* super.c
*/
-extern int do_remount_sb(struct super_block *, int, void *, size_t, int);
+extern int do_remount_sb(struct super_block *, int, void *, size_t, int,
+ struct fs_context *);
extern bool trylock_super(struct super_block *sb);
extern struct dentry *mount_fs(struct file_system_type *,
int, const char *, void *, size_t);
diff --git a/fs/libfs.c b/fs/libfs.c
index 9f1f4884b7cc..823f0510e43d 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -9,6 +9,7 @@
#include <linux/slab.h>
#include <linux/cred.h>
#include <linux/mount.h>
+#include <linux/fs_context.h>
#include <linux/vfs.h>
#include <linux/quotaops.h>
#include <linux/mutex.h>
@@ -574,13 +575,27 @@ static DEFINE_SPINLOCK(pin_fs_lock);

int simple_pin_fs(struct file_system_type *type, struct vfsmount **mount, int *count)
{
+ struct fs_context *fc;
struct vfsmount *mnt = NULL;
+ int ret;
+
spin_lock(&pin_fs_lock);
if (unlikely(!*mount)) {
spin_unlock(&pin_fs_lock);
- mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, NULL, 0);
+
+ fc = vfs_new_fs_context(type, NULL, 0, FS_CONTEXT_FOR_KERNEL_MOUNT);
+ if (IS_ERR(fc))
+ return PTR_ERR(fc);
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0)
+ return ret;
+
+ mnt = vfs_create_mount(fc, 0);
+ put_fs_context(fc);
if (IS_ERR(mnt))
return PTR_ERR(mnt);
+
spin_lock(&pin_fs_lock);
if (!*mount)
*mount = mnt;
diff --git a/fs/namespace.c b/fs/namespace.c
index a6ab1137f8d2..14be35d02050 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -25,8 +25,10 @@
#include <linux/magic.h>
#include <linux/bootmem.h>
#include <linux/task_work.h>
+#include <linux/file.h>
#include <linux/sched/task.h>
#include <uapi/linux/mount.h>
+#include <linux/fs_context.h>

#include "pnode.h"
#include "internal.h"
@@ -1019,56 +1021,6 @@ static struct mount *skip_mnt_tree(struct mount *p)
return p;
}

-struct vfsmount *
-vfs_kern_mount(struct file_system_type *type, int flags, const char *name,
- void *data, size_t data_size)
-{
- struct mount *mnt;
- struct dentry *root;
-
- if (!type)
- return ERR_PTR(-ENODEV);
-
- mnt = alloc_vfsmnt(name);
- if (!mnt)
- return ERR_PTR(-ENOMEM);
-
- if (flags & SB_KERNMOUNT)
- mnt->mnt.mnt_flags = MNT_INTERNAL;
-
- root = mount_fs(type, flags, name, data, data_size);
- if (IS_ERR(root)) {
- mnt_free_id(mnt);
- free_vfsmnt(mnt);
- return ERR_CAST(root);
- }
-
- mnt->mnt.mnt_root = root;
- mnt->mnt.mnt_sb = root->d_sb;
- mnt->mnt_mountpoint = mnt->mnt.mnt_root;
- mnt->mnt_parent = mnt;
- lock_mount_hash();
- list_add_tail(&mnt->mnt_instance, &root->d_sb->s_mounts);
- unlock_mount_hash();
- return &mnt->mnt;
-}
-EXPORT_SYMBOL_GPL(vfs_kern_mount);
-
-struct vfsmount *
-vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
- const char *name, void *data, size_t data_size)
-{
- /* Until it is worked out how to pass the user namespace
- * through from the parent mount to the submount don't support
- * unprivileged mounts with submounts.
- */
- if (mountpoint->d_sb->s_user_ns != &init_user_ns)
- return ERR_PTR(-EPERM);
-
- return vfs_kern_mount(type, SB_SUBMOUNT, name, data, data_size);
-}
-EXPORT_SYMBOL_GPL(vfs_submount);
-
static struct mount *clone_mnt(struct mount *old, struct dentry *root,
int flag)
{
@@ -1596,7 +1548,7 @@ static int do_umount(struct mount *mnt, int flags)
return -EPERM;
down_write(&sb->s_umount);
if (!sb_rdonly(sb))
- retval = do_remount_sb(sb, SB_RDONLY, NULL, 0, 0);
+ retval = do_remount_sb(sb, SB_RDONLY, NULL, 0, 0, NULL);
up_write(&sb->s_umount);
return retval;
}
@@ -2283,6 +2235,20 @@ static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
return error;
}

+/*
+ * Parse the monolithic page of mount data given to sys_mount().
+ */
+static int parse_monolithic_mount_data(struct fs_context *fc, void *data, size_t data_size)
+{
+ int (*monolithic_mount_data)(struct fs_context *, void *, size_t);
+
+ monolithic_mount_data = fc->ops->parse_monolithic;
+ if (!monolithic_mount_data)
+ monolithic_mount_data = generic_parse_monolithic;
+
+ return monolithic_mount_data(fc, data, data_size);
+}
+
/*
* change filesystem flags. dir should be a physical root of filesystem.
* If you've mounted a non-root directory somewhere and want to do remount
@@ -2291,9 +2257,11 @@ static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
static int do_remount(struct path *path, int ms_flags, int sb_flags,
int mnt_flags, void *data, size_t data_size)
{
+ struct fs_context *fc = NULL;
int err;
struct super_block *sb = path->mnt->mnt_sb;
struct mount *mnt = real_mount(path->mnt);
+ struct file_system_type *type = sb->s_type;

if (!check_mnt(mnt))
return -EINVAL;
@@ -2328,9 +2296,29 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
return -EPERM;
}

- err = security_sb_remount(sb, data, data_size);
- if (err)
- return err;
+ if (type->init_fs_context) {
+ fc = vfs_sb_reconfig(path, sb_flags);
+ if (IS_ERR(fc))
+ return PTR_ERR(fc);
+
+ err = parse_monolithic_mount_data(fc, data, data_size);
+ if (err < 0)
+ goto err_fc;
+
+ if (fc->ops->validate) {
+ err = fc->ops->validate(fc);
+ if (err < 0)
+ goto err_fc;
+ }
+
+ err = security_fs_context_validate(fc);
+ if (err)
+ return err;
+ } else {
+ err = security_sb_remount(sb, data, data_size);
+ if (err)
+ return err;
+ }

down_write(&sb->s_umount);
if (ms_flags & MS_BIND)
@@ -2338,7 +2326,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
else if (!capable(CAP_SYS_ADMIN))
err = -EPERM;
else
- err = do_remount_sb(sb, sb_flags, data, data_size, 0);
+ err = do_remount_sb(sb, sb_flags, data, data_size, 0, fc);
if (!err) {
lock_mount_hash();
mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
@@ -2347,6 +2335,9 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
unlock_mount_hash();
}
up_write(&sb->s_umount);
+err_fc:
+ if (fc)
+ put_fs_context(fc);
return err;
}

@@ -2430,29 +2421,6 @@ static int do_move_mount(struct path *path, const char *old_name)
return err;
}

-static struct vfsmount *fs_set_subtype(struct vfsmount *mnt, const char *fstype)
-{
- int err;
- const char *subtype = strchr(fstype, '.');
- if (subtype) {
- subtype++;
- err = -EINVAL;
- if (!subtype[0])
- goto err;
- } else
- subtype = "";
-
- mnt->mnt_sb->s_subtype = kstrdup(subtype, GFP_KERNEL);
- err = -ENOMEM;
- if (!mnt->mnt_sb->s_subtype)
- goto err;
- return mnt;
-
- err:
- mntput(mnt);
- return ERR_PTR(err);
-}
-
/*
* add a mount into a namespace's mount tree
*/
@@ -2497,44 +2465,88 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
return err;
}

-static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags);
+static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags);
+
+/*
+ * Create a new mount using a superblock configuration and request it
+ * be added to the namespace tree.
+ */
+static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags)
+{
+ struct vfsmount *mnt;
+ int ret;
+
+ ret = security_sb_mountpoint(fc, mountpoint,
+ mnt_flags & ~MNT_INTERNAL_FLAGS);
+ if (ret < 0)
+ return ret;
+
+ if (mount_too_revealing(fc->root->d_sb, &mnt_flags)) {
+ pr_warn("VFS: Mount too revealing\n");
+ return -EPERM;
+ }
+
+ mnt = vfs_create_mount(fc, mnt_flags);
+ if (IS_ERR(mnt))
+ return PTR_ERR(mnt);
+
+ ret = do_add_mount(real_mount(mnt), mountpoint, mnt_flags);
+ if (ret < 0)
+ goto err_mnt;
+ return ret;
+
+err_mnt:
+ mntput(mnt);
+ return ret;
+}

/*
* create a new mount for userspace and request it to be added into the
* namespace's tree
*/
-static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
- int mnt_flags, const char *name,
+static int do_new_mount(struct path *mountpoint, const char *fstype,
+ int sb_flags, int mnt_flags, const char *name,
void *data, size_t data_size)
{
- struct file_system_type *type;
- struct vfsmount *mnt;
+ struct file_system_type *fs_type;
+ struct fs_context *fc;
int err;

if (!fstype)
return -EINVAL;

- type = get_fs_type(fstype);
- if (!type)
- return -ENODEV;
-
- mnt = vfs_kern_mount(type, sb_flags, name, data, data_size);
- if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) &&
- !mnt->mnt_sb->s_subtype)
- mnt = fs_set_subtype(mnt, fstype);
+ err = -ENODEV;
+ fs_type = get_fs_type(fstype);
+ if (!fs_type)
+ goto out;

- put_filesystem(type);
- if (IS_ERR(mnt))
- return PTR_ERR(mnt);
+ fc = vfs_new_fs_context(fs_type, NULL, sb_flags,
+ FS_CONTEXT_FOR_USER_MOUNT);
+ put_filesystem(fs_type);
+ if (IS_ERR(fc)) {
+ err = PTR_ERR(fc);
+ goto out;
+ }

- if (mount_too_revealing(mnt, &mnt_flags)) {
- mntput(mnt);
- return -EPERM;
+ if (name) {
+ err = vfs_set_fs_source(fc, name, strlen(name));
+ if (err < 0)
+ goto out_fc;
}

- err = do_add_mount(real_mount(mnt), path, mnt_flags);
- if (err)
- mntput(mnt);
+ err = parse_monolithic_mount_data(fc, data, data_size);
+ if (err < 0)
+ goto out_fc;
+
+ err = vfs_get_tree(fc);
+ if (err < 0)
+ goto out_fc;
+
+ err = do_new_mount_fc(fc, mountpoint, mnt_flags);
+out_fc:
+ put_fs_context(fc);
+out:
return err;
}

@@ -3082,6 +3094,117 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
return ksys_mount(dev_name, dir_name, type, flags, data);
}

+/**
+ * vfs_create_mount - Create a mount for a configured superblock
+ * @fc: The configuration context with the superblock attached
+ * @mnt_flags: The mount flags to apply
+ *
+ * Create a mount to an already configured superblock. If necessary, the
+ * caller should invoke vfs_get_tree() before calling this.
+ *
+ * Note that this does not attach the mount to anything.
+ */
+struct vfsmount *vfs_create_mount(struct fs_context *fc, unsigned int mnt_flags)
+{
+ struct mount *mnt;
+
+ if (!fc->root)
+ return ERR_PTR(-EINVAL);
+
+ mnt = alloc_vfsmnt(fc->source ?: "none");
+ if (!mnt)
+ return ERR_PTR(-ENOMEM);
+
+ if (fc->purpose == FS_CONTEXT_FOR_KERNEL_MOUNT)
+ /* It's a longterm mount, don't release mnt until we unmount
+ * before file sys is unregistered
+ */
+ mnt_flags |= MNT_INTERNAL;
+
+ atomic_inc(&fc->root->d_sb->s_active);
+ mnt->mnt.mnt_flags = mnt_flags;
+ mnt->mnt.mnt_sb = fc->root->d_sb;
+ mnt->mnt.mnt_root = dget(fc->root);
+ mnt->mnt_mountpoint = mnt->mnt.mnt_root;
+ mnt->mnt_parent = mnt;
+
+ lock_mount_hash();
+ list_add_tail(&mnt->mnt_instance, &mnt->mnt.mnt_sb->s_mounts);
+ unlock_mount_hash();
+ return &mnt->mnt;
+}
+EXPORT_SYMBOL(vfs_create_mount);
+
+struct vfsmount *vfs_kern_mount(struct file_system_type *type,
+ int sb_flags, const char *devname,
+ void *data, size_t data_size)
+{
+ struct fs_context *fc;
+ struct vfsmount *mnt;
+ int ret;
+
+ if (!type)
+ return ERR_PTR(-EINVAL);
+
+ fc = vfs_new_fs_context(type, NULL, sb_flags,
+ sb_flags & SB_KERNMOUNT ?
+ FS_CONTEXT_FOR_KERNEL_MOUNT :
+ FS_CONTEXT_FOR_USER_MOUNT);
+ if (IS_ERR(fc))
+ return ERR_CAST(fc);
+
+ if (devname) {
+ ret = vfs_set_fs_source(fc, devname, strlen(devname));
+ if (ret < 0)
+ goto err_fc;
+ }
+
+ ret = parse_monolithic_mount_data(fc, data, data_size);
+ if (ret < 0)
+ goto err_fc;
+
+ ret = vfs_get_tree(fc);
+ if (ret < 0)
+ goto err_fc;
+
+ mnt = vfs_create_mount(fc, 0);
+out:
+ put_fs_context(fc);
+ return mnt;
+err_fc:
+ mnt = ERR_PTR(ret);
+ goto out;
+}
+EXPORT_SYMBOL_GPL(vfs_kern_mount);
+
+struct vfsmount *
+vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
+ const char *name, void *data, size_t data_size)
+{
+ /* Until it is worked out how to pass the user namespace
+ * through from the parent mount to the submount don't support
+ * unprivileged mounts with submounts.
+ */
+ if (mountpoint->d_sb->s_user_ns != &init_user_ns)
+ return ERR_PTR(-EPERM);
+
+ return vfs_kern_mount(type, MS_SUBMOUNT, name, data, data_size);
+}
+EXPORT_SYMBOL_GPL(vfs_submount);
+
+struct vfsmount *kern_mount(struct file_system_type *type)
+{
+ return vfs_kern_mount(type, SB_KERNMOUNT, type->name, NULL, 0);
+}
+EXPORT_SYMBOL_GPL(kern_mount);
+
+struct vfsmount *kern_mount_data(struct file_system_type *type,
+ void *data, size_t data_size)
+{
+ return vfs_kern_mount(type, SB_KERNMOUNT, type->name, data, data_size);
+}
+EXPORT_SYMBOL_GPL(kern_mount_data);
+
/*
* Return true if path is reachable from root
*
@@ -3302,22 +3425,6 @@ void put_mnt_ns(struct mnt_namespace *ns)
free_mnt_ns(ns);
}

-struct vfsmount *kern_mount_data(struct file_system_type *type,
- void *data, size_t data_size)
-{
- struct vfsmount *mnt;
- mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, data, data_size);
- if (!IS_ERR(mnt)) {
- /*
- * it is a longterm mount, don't release mnt until
- * we unmount before file sys is unregistered
- */
- real_mount(mnt)->mnt_ns = MNT_NS_INTERNAL;
- }
- return mnt;
-}
-EXPORT_SYMBOL_GPL(kern_mount_data);
-
void kern_unmount(struct vfsmount *mnt)
{
/* release long term mount so mount point can be released */
@@ -3358,7 +3465,8 @@ bool current_chrooted(void)
return chrooted;
}

-static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new,
+static bool mnt_already_visible(struct mnt_namespace *ns,
+ const struct super_block *sb,
int *new_mnt_flags)
{
int new_flags = *new_mnt_flags;
@@ -3370,7 +3478,7 @@ static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new,
struct mount *child;
int mnt_flags;

- if (mnt->mnt.mnt_sb->s_type != new->mnt_sb->s_type)
+ if (mnt->mnt.mnt_sb->s_type != sb->s_type)
continue;

/* This mount is not fully visible if it's root directory
@@ -3421,7 +3529,7 @@ static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new,
return visible;
}

-static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags)
+static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags)
{
const unsigned long required_iflags = SB_I_NOEXEC | SB_I_NODEV;
struct mnt_namespace *ns = current->nsproxy->mnt_ns;
@@ -3431,7 +3539,7 @@ static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags)
return false;

/* Can this filesystem be too revealing? */
- s_iflags = mnt->mnt_sb->s_iflags;
+ s_iflags = sb->s_iflags;
if (!(s_iflags & SB_I_USERNS_VISIBLE))
return false;

@@ -3441,7 +3549,7 @@ static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags)
return true;
}

- return !mnt_already_visible(ns, mnt, new_mnt_flags);
+ return !mnt_already_visible(ns, sb, new_mnt_flags);
}

bool mnt_may_suid(struct vfsmount *mnt)
diff --git a/fs/super.c b/fs/super.c
index c9d208b7999e..b9d386d728c6 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -36,6 +36,7 @@
#include <linux/lockdep.h>
#include <linux/user_namespace.h>
#include <uapi/linux/mount.h>
+#include <linux/fs_context.h>
#include "internal.h"

static int thaw_super_locked(struct super_block *sb);
@@ -184,16 +185,13 @@ static void destroy_unused_super(struct super_block *s)
}

/**
- * alloc_super - create new superblock
- * @type: filesystem type superblock should belong to
- * @flags: the mount flags
- * @user_ns: User namespace for the super_block
+ * alloc_super - Create new superblock
+ * @fc: The filesystem configuration context
*
* Allocates and initializes a new &struct super_block. alloc_super()
* returns a pointer new superblock or %NULL if allocation had failed.
*/
-static struct super_block *alloc_super(struct file_system_type *type, int flags,
- struct user_namespace *user_ns)
+static struct super_block *alloc_super(struct fs_context *fc)
{
struct super_block *s = kzalloc(sizeof(struct super_block), GFP_USER);
static const struct super_operations default_op;
@@ -203,9 +201,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
return NULL;

INIT_LIST_HEAD(&s->s_mounts);
- s->s_user_ns = get_user_ns(user_ns);
+ s->s_user_ns = get_user_ns(fc->user_ns);
init_rwsem(&s->s_umount);
- lockdep_set_class(&s->s_umount, &type->s_umount_key);
+ lockdep_set_class(&s->s_umount, &fc->fs_type->s_umount_key);
/*
* sget() can have s_umount recursion.
*
@@ -229,12 +227,12 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
for (i = 0; i < SB_FREEZE_LEVELS; i++) {
if (__percpu_init_rwsem(&s->s_writers.rw_sem[i],
sb_writers_name[i],
- &type->s_writers_key[i]))
+ &fc->fs_type->s_writers_key[i]))
goto fail;
}
init_waitqueue_head(&s->s_writers.wait_unfrozen);
s->s_bdi = &noop_backing_dev_info;
- s->s_flags = flags;
+ s->s_flags = fc->sb_flags;
if (s->s_user_ns != &init_user_ns)
s->s_iflags |= SB_I_NODEV;
INIT_HLIST_NODE(&s->s_instances);
@@ -252,7 +250,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
s->s_count = 1;
atomic_set(&s->s_active, 1);
mutex_init(&s->s_vfs_rename_mutex);
- lockdep_set_class(&s->s_vfs_rename_mutex, &type->s_vfs_rename_key);
+ lockdep_set_class(&s->s_vfs_rename_mutex, &fc->fs_type->s_vfs_rename_key);
init_rwsem(&s->s_dquot.dqio_sem);
s->s_maxbytes = MAX_NON_LFS;
s->s_op = &default_op;
@@ -472,6 +470,97 @@ void generic_shutdown_super(struct super_block *sb)

EXPORT_SYMBOL(generic_shutdown_super);

+/**
+ * sget_fc - Find or create a superblock
+ * @fc: Filesystem context.
+ * @test: Comparison callback
+ * @set: Setup callback
+ *
+ * Find or create a superblock using the parameters stored in the filesystem
+ * context and the two callback functions.
+ *
+ * If an extant superblock is matched, then that will be returned with an
+ * elevated reference count that the caller must transfer or discard.
+ *
+ * If no match is made, a new superblock will be allocated and basic
+ * initialisation will be performed (s_type, s_fs_info and s_id will be set and
+ * the set() callback will be invoked), the superblock will be published and it
+ * will be returned in a partially constructed state with SB_BORN and SB_ACTIVE
+ * as yet unset.
+ */
+struct super_block *sget_fc(struct fs_context *fc,
+ int (*test)(struct super_block *, struct fs_context *),
+ int (*set)(struct super_block *, struct fs_context *))
+{
+ struct super_block *s = NULL;
+ struct super_block *old;
+ int err;
+
+ if (!(fc->sb_flags & SB_KERNMOUNT) &&
+ fc->purpose != FS_CONTEXT_FOR_SUBMOUNT) {
+ /* Don't allow mounting unless the caller has CAP_SYS_ADMIN
+ * over the namespace.
+ */
+ if (!(fc->fs_type->fs_flags & FS_USERNS_MOUNT) &&
+ !capable(CAP_SYS_ADMIN))
+ return ERR_PTR(-EPERM);
+ else if (!ns_capable(fc->user_ns, CAP_SYS_ADMIN))
+ return ERR_PTR(-EPERM);
+ }
+
+retry:
+ spin_lock(&sb_lock);
+ if (test) {
+ hlist_for_each_entry(old, &fc->fs_type->fs_supers, s_instances) {
+ if (!test(old, fc))
+ continue;
+ if (fc->user_ns != old->s_user_ns) {
+ spin_unlock(&sb_lock);
+ if (s) {
+ up_write(&s->s_umount);
+ destroy_unused_super(s);
+ }
+ return ERR_PTR(-EBUSY);
+ }
+ if (!grab_super(old))
+ goto retry;
+ if (s) {
+ up_write(&s->s_umount);
+ destroy_unused_super(s);
+ s = NULL;
+ }
+ return old;
+ }
+ }
+ if (!s) {
+ spin_unlock(&sb_lock);
+ s = alloc_super(fc);
+ if (!s)
+ return ERR_PTR(-ENOMEM);
+ goto retry;
+ }
+
+ s->s_fs_info = fc->s_fs_info;
+ err = set(s, fc);
+ if (err) {
+ s->s_fs_info = NULL;
+ spin_unlock(&sb_lock);
+ up_write(&s->s_umount);
+ destroy_unused_super(s);
+ return ERR_PTR(err);
+ }
+ fc->s_fs_info = NULL;
+ s->s_type = fc->fs_type;
+ strlcpy(s->s_id, s->s_type->name, sizeof(s->s_id));
+ list_add_tail(&s->s_list, &super_blocks);
+ hlist_add_head(&s->s_instances, &s->s_type->fs_supers);
+ spin_unlock(&sb_lock);
+ get_filesystem(s->s_type);
+ register_shrinker(&s->s_shrink);
+ return s;
+}
+EXPORT_SYMBOL(sget_fc);
+
/**
* sget_userns - find or create a superblock
* @type: filesystem type superblock should belong to
@@ -514,7 +603,14 @@ struct super_block *sget_userns(struct file_system_type *type,
}
if (!s) {
spin_unlock(&sb_lock);
- s = alloc_super(type, (flags & ~SB_SUBMOUNT), user_ns);
+ {
+ struct fs_context fc = {
+ .fs_type = type,
+ .sb_flags = flags & ~SB_SUBMOUNT,
+ .user_ns = user_ns,
+ };
+ s = alloc_super(&fc);
+ }
if (!s)
return ERR_PTR(-ENOMEM);
goto retry;
@@ -838,11 +934,13 @@ struct super_block *user_get_super(dev_t dev)
* @data: the rest of options
* @data_size: The size of the data
* @force: whether or not to force the change
+ * @fc: the superblock config for filesystems that support it
+ * (NULL if called from emergency or umount)
*
* Alters the mount options of a mounted file system.
*/
int do_remount_sb(struct super_block *sb, int sb_flags, void *data,
- size_t data_size, int force)
+ size_t data_size, int force, struct fs_context *fc)
{
int retval;
int remount_ro;
@@ -884,8 +982,17 @@ int do_remount_sb(struct super_block *sb, int sb_flags, void *data,
}
}

- if (sb->s_op->remount_fs) {
- retval = sb->s_op->remount_fs(sb, &sb_flags, data, data_size);
+ if (sb->s_op->reconfigure ||
+ sb->s_op->remount_fs) {
+ if (sb->s_op->reconfigure) {
+ retval = sb->s_op->reconfigure(sb, fc);
+ sb_flags = fc->sb_flags;
+ if (retval == 0)
+ security_sb_reconfigure(fc);
+ } else {
+ retval = sb->s_op->remount_fs(sb, &sb_flags,
+ data, data_size);
+ }
if (retval) {
if (!force)
goto cancel_readonly;
@@ -924,7 +1031,7 @@ static void do_emergency_remount_callback(struct super_block *sb)
/*
* What lock protects sb->s_flags??
*/
- do_remount_sb(sb, SB_RDONLY, NULL, 0, 1);
+ do_remount_sb(sb, SB_RDONLY, NULL, 0, 1, NULL);
}
up_write(&sb->s_umount);
}
@@ -1106,6 +1213,89 @@ struct dentry *mount_ns(struct file_system_type *fs_type,

EXPORT_SYMBOL(mount_ns);

+static int set_anon_super_fc(struct super_block *sb, struct fs_context *fc)
+{
+ return set_anon_super(sb, NULL);
+}
+
+static int test_keyed_super(struct super_block *sb, struct fs_context *fc)
+{
+ return sb->s_fs_info == fc->s_fs_info;
+}
+
+static int test_single_super(struct super_block *s, struct fs_context *fc)
+{
+ return 1;
+}
+
+/**
+ * vfs_get_super - Get a superblock with a search key set in s_fs_info.
+ * @fc: The filesystem context holding the parameters
+ * @keying: How to distinguish superblocks
+ * @fill_super: Helper to initialise a new superblock
+ *
+ * Search for a superblock and create a new one if not found. The search
+ * criterion is controlled by @keying. If the search fails, a new superblock
+ * is created and @fill_super() is called to initialise it.
+ *
+ * @keying can take one of a number of values:
+ *
+ * (1) vfs_get_single_super - Only one superblock of this type may exist on the
+ * system. This is typically used for special system filesystems.
+ *
+ * (2) vfs_get_keyed_super - Multiple superblocks may exist, but they must have
+ * distinct keys (where the key is in s_fs_info). Searching for the same
+ * key again will turn up the superblock for that key.
+ *
+ * (3) vfs_get_independent_super - Multiple superblocks may exist and are
+ * unkeyed. Each call will get a new superblock.
+ *
+ * A permissions check is made by sget_fc() unless we're getting a superblock
+ * for a kernel-internal mount or a submount.
+ */
+int vfs_get_super(struct fs_context *fc,
+ enum vfs_get_super_keying keying,
+ int (*fill_super)(struct super_block *sb,
+ struct fs_context *fc))
+{
+ int (*test)(struct super_block *, struct fs_context *);
+ struct super_block *sb;
+
+ switch (keying) {
+ case vfs_get_single_super:
+ test = test_single_super;
+ break;
+ case vfs_get_keyed_super:
+ test = test_keyed_super;
+ break;
+ case vfs_get_independent_super:
+ test = NULL;
+ break;
+ default:
+ BUG();
+ }
+
+ sb = sget_fc(fc, test, set_anon_super_fc);
+ if (IS_ERR(sb))
+ return PTR_ERR(sb);
+
+ if (!sb->s_root) {
+ int err = fill_super(sb, fc);
+ if (err) {
+ deactivate_locked_super(sb);
+ return err;
+ }
+
+ sb->s_flags |= SB_ACTIVE;
+ }
+
+ BUG_ON(fc->root);
+ fc->root = dget(sb->s_root);
+ fc->drop_sb = true;
+ return 0;
+}
+EXPORT_SYMBOL(vfs_get_super);
+
#ifdef CONFIG_BLOCK
static int set_bdev_super(struct super_block *s, void *data)
{
@@ -1254,7 +1444,7 @@ struct dentry *mount_single(struct file_system_type *fs_type,
}
s->s_flags |= SB_ACTIVE;
} else {
- do_remount_sb(s, flags, data, data_size, 0);
+ do_remount_sb(s, flags, data, data_size, 0, NULL);
}
return dget(s->s_root);
}
@@ -1601,3 +1791,90 @@ int thaw_super(struct super_block *sb)
return thaw_super_locked(sb);
}
EXPORT_SYMBOL(thaw_super);
+
+/**
+ * vfs_get_tree - Get the mountable root
+ * @fc: The superblock configuration context.
+ *
+ * The filesystem is invoked to get or create a superblock which can then later
+ * be used for mounting. The filesystem places a pointer to the root to be
+ * used for mounting in @fc->root.
+ */
+int vfs_get_tree(struct fs_context *fc)
+{
+ struct super_block *sb;
+ int ret;
+
+ if (fc->fs_type->fs_flags & FS_REQUIRES_DEV && !fc->source)
+ return -ENOENT;
+
+ if (fc->root)
+ return -EBUSY;
+
+ if (fc->ops->validate) {
+ ret = fc->ops->validate(fc);
+ if (ret < 0)
+ return ret;
+ }
+
+ ret = security_fs_context_validate(fc);
+ if (ret < 0)
+ return ret;
+
+ /* Get the mountable root in fc->root, with a ref on the root and a ref
+ * on the superblock.
+ */
+ ret = fc->ops->get_tree(fc);
+ if (ret < 0)
+ return ret;
+
+ if (!fc->root) {
+ pr_err("Filesystem %s get_tree() didn't set fc->root\n",
+ fc->fs_type->name);
+ /* We don't know what the locking state of the superblock is -
+ * if there is a superblock.
+ */
+ BUG();
+ }
+
+ sb = fc->root->d_sb;
+ WARN_ON(!sb->s_bdi);
+
+ ret = security_sb_get_tree(fc);
+ if (ret < 0)
+ goto err_sb;
+
+ ret = -ENOMEM;
+ if (fc->subtype && !sb->s_subtype) {
+ sb->s_subtype = kstrdup(fc->subtype, GFP_KERNEL);
+ if (!sb->s_subtype)
+ goto err_sb;
+ }
+
+ /* Write barrier is for super_cache_count(). We place it before setting
+ * SB_BORN as the data dependency between the two functions is the
+ * superblock structure contents that we just set up, not the SB_BORN
+ * flag.
+ */
+ smp_wmb();
+ sb->s_flags |= SB_BORN;
+
+ /* Filesystems should never set s_maxbytes larger than MAX_LFS_FILESIZE
+ * but s_maxbytes was an unsigned long long for many releases. Throw
+ * this warning for a little while to try and catch filesystems that
+ * violate this rule.
+ */
+ WARN(sb->s_maxbytes < 0,
+ "%s set sb->s_maxbytes to negative value (%lld)\n",
+ fc->fs_type->name, sb->s_maxbytes);
+
+ up_write(&sb->s_umount);
+ return 0;
+
+err_sb:
+ dput(fc->root);
+ fc->root = NULL;
+ deactivate_locked_super(sb);
+ return ret;
+}
+EXPORT_SYMBOL(vfs_get_tree);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f7bb71b8e3df..19bbed58829d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -60,6 +60,7 @@ struct workqueue_struct;
struct iov_iter;
struct fscrypt_info;
struct fscrypt_operations;
+struct fs_context;

extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -718,6 +719,11 @@ static inline void inode_unlock(struct inode *inode)
up_write(&inode->i_rwsem);
}

+static inline int inode_lock_killable(struct inode *inode)
+{
+ return down_write_killable(&inode->i_rwsem);
+}
+
static inline void inode_lock_shared(struct inode *inode)
{
down_read(&inode->i_rwsem);
@@ -1828,6 +1834,7 @@ struct super_operations {
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *, size_t);
+ int (*reconfigure) (struct super_block *, struct fs_context *);
void (*umount_begin) (struct super_block *);

int (*show_options)(struct seq_file *, struct dentry *);
@@ -2074,6 +2081,7 @@ struct file_system_type {
#define FS_HAS_SUBTYPE 4
#define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
+ int (*init_fs_context)(struct fs_context *, struct dentry *);
struct dentry *(*mount) (struct file_system_type *, int,
const char *, void *, size_t);
void (*kill_sb) (struct super_block *);
@@ -2132,6 +2140,9 @@ void deactivate_locked_super(struct super_block *sb);
int set_anon_super(struct super_block *s, void *data);
int get_anon_bdev(dev_t *);
void free_anon_bdev(dev_t);
+struct super_block *sget_fc(struct fs_context *fc,
+ int (*test)(struct super_block *, struct fs_context *),
+ int (*set)(struct super_block *, struct fs_context *));
struct super_block *sget_userns(struct file_system_type *type,
int (*test)(struct super_block *,void *),
int (*set)(struct super_block *,void *),
@@ -2174,8 +2185,8 @@ mount_pseudo(struct file_system_type *fs_type, char *name,

extern int register_filesystem(struct file_system_type *);
extern int unregister_filesystem(struct file_system_type *);
+extern struct vfsmount *kern_mount(struct file_system_type *);
extern struct vfsmount *kern_mount_data(struct file_system_type *, void *, size_t);
-#define kern_mount(type) kern_mount_data(type, NULL, 0)
extern void kern_unmount(struct vfsmount *mnt);
extern int may_umount_tree(struct vfsmount *);
extern int may_umount(struct vfsmount *);
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 04783814632c..368fe5bb1efd 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -25,6 +25,7 @@ struct pid_namespace;
struct super_block;
struct user_namespace;
struct vfsmount;
+struct path;

enum fs_context_purpose {
FS_CONTEXT_FOR_USER_MOUNT, /* New superblock for user-specified mount */
@@ -33,6 +34,19 @@ enum fs_context_purpose {
FS_CONTEXT_FOR_RECONFIGURE, /* Superblock reconfiguration (remount) */
};

+/*
+ * Userspace usage phase for fsopen/fspick.
+ */
+enum fs_context_phase {
+ FS_CONTEXT_CREATE_PARAMS, /* Loading params for sb creation */
+ FS_CONTEXT_CREATING, /* A superblock is being created */
+ FS_CONTEXT_AWAITING_MOUNT, /* Superblock created, awaiting fsmount() */
+ FS_CONTEXT_AWAITING_RECONF, /* Awaiting initialisation for reconfiguration */
+ FS_CONTEXT_RECONF_PARAMS, /* Loading params for reconfiguration */
+ FS_CONTEXT_RECONFIGURING, /* Reconfiguring the superblock */
+ FS_CONTEXT_FAILED, /* Failed to correctly transition a context */
+};
+
/*
* Filesystem context for holding the parameters used in the creation or
* reconfiguration of a superblock.
@@ -60,6 +74,7 @@ struct fs_context {
bool drop_sb:1; /* T if need to drop an SB reference */
bool source_is_dev:1; /* T if source is local device/file */
enum fs_context_purpose purpose : 8;
+ enum fs_context_phase phase:8; /* The phase the context is in */
};

struct fs_context_operations {
@@ -67,9 +82,37 @@ struct fs_context_operations {
int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
int (*parse_source)(struct fs_context *fc, char *source);
int (*parse_option)(struct fs_context *fc, char *opt, size_t len);
- int (*parse_monolithic)(struct fs_context *fc, void *data);
+ int (*parse_monolithic)(struct fs_context *fc, void *data, size_t data_size);
int (*validate)(struct fs_context *fc);
int (*get_tree)(struct fs_context *fc);
};

+/*
+ * fs_context manipulation functions.
+ */
+extern struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
+ struct dentry *reference,
+ unsigned int ms_flags,
+ enum fs_context_purpose purpose);
+extern struct fs_context *vfs_sb_reconfig(struct path *path, unsigned int ms_flags);
+extern struct fs_context *vfs_dup_fs_context(struct fs_context *src);
+extern int vfs_set_fs_source(struct fs_context *fc, const char *source, size_t slen);
+extern int vfs_parse_fs_option(struct fs_context *fc, char *data, size_t opt);
+extern int generic_parse_monolithic(struct fs_context *fc, void *data, size_t data_size);
+extern int vfs_get_tree(struct fs_context *fc);
+extern void put_fs_context(struct fs_context *fc);
+
+/*
+ * sget() wrapper to be called from the ->get_tree() op.
+ */
+enum vfs_get_super_keying {
+ vfs_get_single_super, /* Only one such superblock may exist */
+ vfs_get_keyed_super, /* Superblocks with different s_fs_info keys may exist */
+ vfs_get_independent_super, /* Multiple independent superblocks may exist */
+};
+extern int vfs_get_super(struct fs_context *fc,
+ enum vfs_get_super_keying keying,
+ int (*fill_super)(struct super_block *sb,
+ struct fs_context *fc));
+
#endif /* _LINUX_FS_CONTEXT_H */
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 8a1031a511c9..ee5af77afc06 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -21,6 +21,7 @@ struct super_block;
struct vfsmount;
struct dentry;
struct mnt_namespace;
+struct fs_context;

#define MNT_NOSUID 0x01
#define MNT_NODEV 0x02
@@ -88,6 +89,8 @@ struct path;
extern struct vfsmount *clone_private_mount(const struct path *path);

struct file_system_type;
+extern struct vfsmount *vfs_create_mount(struct fs_context *fc,
+ unsigned int mnt_flags);
extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
int flags, const char *name,
void *data, size_t data_size);


2018-05-31 19:20:41

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 20/32] vfs: Make close() unmount the attached mount if so flagged [ver #8]

On Fri, May 25, 2018 at 01:07:34AM +0100, David Howells wrote:
> + if (unlikely(file->f_mode & FMODE_NEED_UNMOUNT))
> + __detach_mounts(dentry);
> +

This is completely wrong. First of all, you want to dissolve the mount tree
on file->f_path.mount, not every tree rooted at dentry equal to file->f_path.dentry.
This is easily done - it would be a simple call of drop_collected_mounts(mnt)
if not for one detail. You want it to happen only if the sucker isn't attached
anywhere by that point. IOW,
namespace_lock();
lock_mount_hash();
if (!real_mount(mnt)->mnt_ns)
umount_tree(real_mount(mnt), UMOUNT_SYNC);
unlock_mount_hash();
namespace_unlock();
and that's it. You don't need that magical mystery turd in move_mount() later
in the series and all the infrastructure you grow for it.

FWIW, I would've suggested this
void drop_collected_mounts(struct vfsmount *mnt)
{
namespace_lock();
lock_mount_hash();
+ if (!real_mount(mnt)->mnt_ns)
+ umount_tree(real_mount(mnt), UMOUNT_SYNC);
- umount_tree(real_mount(mnt), UMOUNT_SYNC);
unlock_mount_hash();
namespace_unlock();
}

and in __fput()
if (unlikely(file->f_mode & FMODE_NEED_UNMOUNT))
drop_collected_mounts(mnt);

All there is to it, AFAICS...

2018-05-31 19:27:13

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 20/32] vfs: Make close() unmount the attached mount if so flagged [ver #8]

On Thu, May 31, 2018 at 08:19:55PM +0100, Al Viro wrote:
> On Fri, May 25, 2018 at 01:07:34AM +0100, David Howells wrote:
> > + if (unlikely(file->f_mode & FMODE_NEED_UNMOUNT))
> > + __detach_mounts(dentry);
> > +
>
> This is completely wrong. First of all, you want to dissolve the mount tree
> on file->f_path.mount, not every tree rooted at dentry equal to file->f_path.dentry.
> This is easily done - it would be a simple call of drop_collected_mounts(mnt)
> if not for one detail. You want it to happen only if the sucker isn't attached
> anywhere by that point. IOW,
> namespace_lock();
> lock_mount_hash();
> if (!real_mount(mnt)->mnt_ns)
> umount_tree(real_mount(mnt), UMOUNT_SYNC);
> unlock_mount_hash();
> namespace_unlock();
> and that's it. You don't need that magical mystery turd in move_mount() later
> in the series and all the infrastructure you grow for it.
>
> FWIW, I would've suggested this
> void drop_collected_mounts(struct vfsmount *mnt)
> {
> namespace_lock();
> lock_mount_hash();
> + if (!real_mount(mnt)->mnt_ns)
> + umount_tree(real_mount(mnt), UMOUNT_SYNC);
> - umount_tree(real_mount(mnt), UMOUNT_SYNC);
> unlock_mount_hash();
> namespace_unlock();
> }
>
> and in __fput()
> if (unlikely(file->f_mode & FMODE_NEED_UNMOUNT))
> drop_collected_mounts(mnt);
>
> All there is to it, AFAICS...

... and that eliminates #27 and #28 entirely, with #31 becoming simpler -
no move_mount_lookup(), no dfd_ref, the check in do_move_mount() becomes
+ if (!mnt_has_parent(old) && old->mnt_ns) {
+ /* We need to allow open(O_PATH|O_CLONE_MOUNT) or fsmount()
+ * followed by move_mount(), but mustn't allow "/" to be moved.
+ */
+ goto out1;
+ }
and I wouldn't be surprised if move_mount_old()/move_mount() split turns out
to be not needed at all, seeing that the whole "clear FMODE_NEED_UNMOUNT
on success" part goes away.

2018-05-31 20:57:27

by David Howells

[permalink] [raw]
Subject: Test program for move_mount()

/* move_mount test.
*
* Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
* Written by David Howells ([email protected])
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public Licence
* as published by the Free Software Foundation; either version
* 2 of the Licence, or (at your option) any later version.
*/

#define _GNU_SOURCE
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/prctl.h>
#include <sys/wait.h>

#define O_CLONE_MOUNT 040000000 /* Used with O_PATH to clone the mount subtree at path */
#define O_NON_RECURSIVE 0100000000 /* Used with O_CLONE_MOUNT to only clone one mount */

#define MOVE_MOUNT_F_SYMLINKS 0x00000001 /* Follow symlinks on from path */
#define MOVE_MOUNT_F_AUTOMOUNTS 0x00000002 /* Follow automounts on from path */
#define MOVE_MOUNT_F_EMPTY_PATH 0x00000004 /* Empty from path permitted */
#define MOVE_MOUNT_T_SYMLINKS 0x00000010 /* Follow symlinks on to path */
#define MOVE_MOUNT_T_AUTOMOUNTS 0x00000020 /* Follow automounts on to path */
#define MOVE_MOUNT_T_EMPTY_PATH 0x00000040 /* Empty to path permitted */
#define MOVE_MOUNT__MASK 0x00000077

static inline int move_mount(int from_dfd, const char *from_pathname,
int to_dfd, const char *to_pathname,
unsigned int flags)
{
return syscall(337, from_dfd, from_pathname, to_dfd, to_pathname, flags);
}

void format(void)
{
printf("Format: move_mount [-a] [-c] [-r]\n");
exit(2);
}

int main(int argc, char *argv[])
{
bool preopen = false;
int ret, fd;
int o_flags = O_PATH;

if (argc < 3)
format();

for (; argc > 3; argc--, argv++) {
if (strcmp(argv[1], "-a") == 0)
preopen = true;
else if (strcmp(argv[1], "-c") == 0)
o_flags |= O_CLONE_MOUNT;
else if (strcmp(argv[1], "-r") == 0)
o_flags |= O_NON_RECURSIVE;
else
format();
}

if (preopen) {
fd = open(argv[1], o_flags);
if (fd < 0) {
fprintf(stderr, "open(%s, O_PATH|...): %m\n", argv[1]);
exit(1);
}

ret = move_mount(fd, "", AT_FDCWD, argv[2],
MOVE_MOUNT_F_EMPTY_PATH);
if (ret != 0) {
fprintf(stderr, "move_mount([%s],%s) = %d: %m\n",
argv[1], argv[2], ret);
exit(1);
}
} else {
ret = move_mount(AT_FDCWD, argv[1], AT_FDCWD, argv[2],
MOVE_MOUNT_F_EMPTY_PATH);
if (ret != 0) {
fprintf(stderr, "move_mount(%s,%s) = %d: %m\n",
argv[1], argv[2], ret);
exit(1);
}
}

exit(0);
}

2018-05-31 20:59:40

by David Howells

[permalink] [raw]
Subject: fsinfo test program

/* Test the fsinfo() system call
*
* Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
* Written by David Howells ([email protected])
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public Licence
* as published by the Free Software Foundation; either version
* 2 of the Licence, or (at your option) any later version.
*/

#define _GNU_SOURCE
#define _ATFILE_SOURCE
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <unistd.h>
#include <ctype.h>
#include <errno.h>
#include <time.h>
#include <math.h>
#include <fcntl.h>
#include <sys/syscall.h>
#include <linux/stat.h>
#include <linux/socket.h>
#include <sys/stat.h>

#define __NR_fsinfo 338
enum fsinfo_attribute {
fsinfo_attr_statfs = 0, /* statfs()-style state */
fsinfo_attr_fsinfo = 1, /* Information about fsinfo() */
fsinfo_attr_ids = 2, /* Filesystem IDs */
fsinfo_attr_limits = 3, /* Filesystem limits */
fsinfo_attr_supports = 4, /* What's supported in statx, iocflags, ... */
fsinfo_attr_capabilities = 5, /* Filesystem capabilities (bits) */
fsinfo_attr_timestamp_info = 6, /* Inode timestamp info */
fsinfo_attr_volume_id = 7, /* Volume ID (string) */
fsinfo_attr_volume_uuid = 8, /* Volume UUID (LE uuid) */
fsinfo_attr_volume_name = 9, /* Volume name (string) */
fsinfo_attr_cell_name = 10, /* Cell name (string) */
fsinfo_attr_domain_name = 11, /* Domain name (string) */
fsinfo_attr_realm_name = 12, /* Realm name (string) */
fsinfo_attr_server_name = 13, /* Name of the Nth server */
fsinfo_attr_server_address = 14, /* Mth address of the Nth server */
fsinfo_attr_error_state = 15, /* Error state */
fsinfo_attr_parameter = 16, /* Nth mount parameter (string) */
fsinfo_attr_source = 17, /* Nth mount source name (string) */
fsinfo_attr_name_encoding = 18, /* Filename encoding (string) */
fsinfo_attr_name_codepage = 19, /* Filename codepage (string) */
fsinfo_attr_io_size = 20, /* Optimal I/O sizes */
fsinfo_attr__nr
};

struct fsinfo_params {
enum fsinfo_attribute request; /* What is being asking for */
__u32 Nth; /* Instance of it (some may have multiple) */
__u32 Mth; /* Subinstance of Nth instance */
__u32 at_flags; /* AT_SYMLINK_NOFOLLOW and similar flags */
__u32 __reserved[6]; /* Reserved params; all must be 0 */
};

struct fsinfo_statfs {
__u64 f_blocks; /* Total number of blocks in fs */
__u64 f_bfree; /* Total number of free blocks */
__u64 f_bavail; /* Number of free blocks available to ordinary user */
__u64 f_files; /* Total number of file nodes in fs */
__u64 f_ffree; /* Number of free file nodes */
__u64 f_favail; /* Number of free file nodes available to ordinary user */
__u32 f_bsize; /* Optimal block size */
__u32 f_frsize; /* Fragment size */
};

struct fsinfo_ids {
char f_fs_name[15 + 1];
__u64 f_flags; /* Filesystem mount flags (MS_*) */
__u64 f_fsid; /* Short 64-bit Filesystem ID (as statfs) */
__u64 f_sb_id; /* Internal superblock ID for sbnotify()/mntnotify() */
__u32 f_fstype; /* Filesystem type from linux/magic.h [uncond] */
__u32 f_dev_major; /* As st_dev_* from struct statx [uncond] */
__u32 f_dev_minor;
};

struct fsinfo_limits {
__u64 max_file_size; /* Maximum file size */
__u64 max_uid; /* Maximum UID supported */
__u64 max_gid; /* Maximum GID supported */
__u64 max_projid; /* Maximum project ID supported */
__u32 max_dev_major; /* Maximum device major representable */
__u32 max_dev_minor; /* Maximum device minor representable */
__u32 max_hard_links; /* Maximum number of hard links on a file */
__u32 max_xattr_body_len; /* Maximum xattr content length */
__u16 max_xattr_name_len; /* Maximum xattr name length */
__u16 max_filename_len; /* Maximum filename length */
__u16 max_symlink_len; /* Maximum symlink content length */
__u16 __spare;
};

struct fsinfo_supports {
__u64 supported_stx_attributes; /* What statx::stx_attributes are supported */
__u32 supported_stx_mask; /* What statx::stx_mask bits are supported */
__u32 supported_ioc_flags; /* What FS_IOC_* flags are supported */
};

enum fsinfo_capability {
fsinfo_cap_is_kernel_fs = 0, /* fs is kernel-special filesystem */
fsinfo_cap_is_block_fs = 1, /* fs is block-based filesystem */
fsinfo_cap_is_flash_fs = 2, /* fs is flash filesystem */
fsinfo_cap_is_network_fs = 3, /* fs is network filesystem */
fsinfo_cap_is_automounter_fs = 4, /* fs is automounter special filesystem */
fsinfo_cap_automounts = 5, /* fs supports automounts */
fsinfo_cap_adv_locks = 6, /* fs supports advisory file locking */
fsinfo_cap_mand_locks = 7, /* fs supports mandatory file locking */
fsinfo_cap_leases = 8, /* fs supports file leases */
fsinfo_cap_uids = 9, /* fs supports numeric uids */
fsinfo_cap_gids = 10, /* fs supports numeric gids */
fsinfo_cap_projids = 11, /* fs supports numeric project ids */
fsinfo_cap_id_names = 12, /* fs supports user names */
fsinfo_cap_id_guids = 13, /* fs supports user guids */
fsinfo_cap_windows_attrs = 14, /* fs has windows attributes */
fsinfo_cap_user_quotas = 15, /* fs has per-user quotas */
fsinfo_cap_group_quotas = 16, /* fs has per-group quotas */
fsinfo_cap_project_quotas = 17, /* fs has per-project quotas */
fsinfo_cap_xattrs = 18, /* fs has xattrs */
fsinfo_cap_journal = 19, /* fs has a journal */
fsinfo_cap_data_is_journalled = 20, /* fs is using data journalling */
fsinfo_cap_o_sync = 21, /* fs supports O_SYNC */
fsinfo_cap_o_direct = 22, /* fs supports O_DIRECT */
fsinfo_cap_volume_id = 23, /* fs has a volume ID */
fsinfo_cap_volume_uuid = 24, /* fs has a volume UUID */
fsinfo_cap_volume_name = 25, /* fs has a volume name */
fsinfo_cap_volume_fsid = 26, /* fs has a volume FSID */
fsinfo_cap_cell_name = 27, /* fs has a cell name */
fsinfo_cap_domain_name = 28, /* fs has a domain name */
fsinfo_cap_realm_name = 29, /* fs has a realm name */
fsinfo_cap_iver_all_change = 30, /* i_version represents data + meta changes */
fsinfo_cap_iver_data_change = 31, /* i_version represents data changes only */
fsinfo_cap_iver_mono_incr = 32, /* i_version incremented monotonically */
fsinfo_cap_symlinks = 33, /* fs supports symlinks */
fsinfo_cap_hard_links = 34, /* fs supports hard links */
fsinfo_cap_hard_links_1dir = 35, /* fs supports hard links in same dir only */
fsinfo_cap_device_files = 36, /* fs supports bdev, cdev */
fsinfo_cap_unix_specials = 37, /* fs supports pipe, fifo, socket */
fsinfo_cap_resource_forks = 38, /* fs supports resource forks/streams */
fsinfo_cap_name_case_indep = 39, /* Filename case independence is mandatory */
fsinfo_cap_name_non_utf8 = 40, /* fs has non-utf8 names */
fsinfo_cap_name_has_codepage = 41, /* fs has a filename codepage */
fsinfo_cap_sparse = 42, /* fs supports sparse files */
fsinfo_cap_not_persistent = 43, /* fs is not persistent */
fsinfo_cap_no_unix_mode = 44, /* fs does not support unix mode bits */
fsinfo_cap_has_atime = 45, /* fs supports access time */
fsinfo_cap_has_btime = 46, /* fs supports birth/creation time */
fsinfo_cap_has_ctime = 47, /* fs supports change time */
fsinfo_cap_has_mtime = 48, /* fs supports modification time */
fsinfo_cap__nr
};

struct fsinfo_capabilities {
__u8 capabilities[(fsinfo_cap__nr + 7) / 8];
};

struct fsinfo_timestamp_info {
__s64 minimum_timestamp; /* Minimum timestamp value in seconds */
__s64 maximum_timestamp; /* Maximum timestamp value in seconds */
__u16 atime_gran_mantissa; /* Granularity(secs) = mant * 10^exp */
__u16 btime_gran_mantissa;
__u16 ctime_gran_mantissa;
__u16 mtime_gran_mantissa;
__s8 atime_gran_exponent;
__s8 btime_gran_exponent;
__s8 ctime_gran_exponent;
__s8 mtime_gran_exponent;
};

struct fsinfo_volume_uuid {
__u8 uuid[16];
};

struct fsinfo_server_address {
struct __kernel_sockaddr_storage address;
};

struct fsinfo_error_state {
__u32 io_error; /* General I/O error counter */
__u32 wb_error; /* Writeback error counter */
__u32 bdev_error; /* Blockdev error counter */
};

struct fsinfo_io_size {
__u32 block_size; /* Minimum block granularity for O_DIRECT */
__u32 max_single_read_size; /* Maximum size of a single unbuffered read */
__u32 max_single_write_size; /* Maximum size of a single unbuffered write */
__u32 best_read_size; /* Optimal read size */
__u32 best_write_size; /* Optimal write size */
};

struct fsinfo_fsinfo {
enum fsinfo_attribute max_attr; /* Number of supported attributes */
enum fsinfo_capability max_cap; /* Number of supported capabilities */
};

/*****************************************************************************/
/*****************************************************************************/
/*****************************************************************************/
static __attribute__((unused))
ssize_t fsinfo(int dfd, const char *filename, struct fsinfo_params *params,
void *buffer, size_t buf_size)
{
return syscall(__NR_fsinfo, dfd, filename, params, buffer, buf_size);
}

#define FSINFO_STRING(N) [fsinfo_attr_##N] = 0x00
#define FSINFO_STRUCT(N) [fsinfo_attr_##N] = sizeof(struct fsinfo_##N)/sizeof(__u32)
#define FSINFO_STRING_N(N) [fsinfo_attr_##N] = 0x40
#define FSINFO_STRUCT_N(N) [fsinfo_attr_##N] = 0x40 | sizeof(struct fsinfo_##N)/sizeof(__u32)
#define FSINFO_STRUCT_NM(N) [fsinfo_attr_##N] = 0x80 | sizeof(struct fsinfo_##N)/sizeof(__u32)
static const __u8 fsinfo_buffer_sizes[fsinfo_attr__nr] = {
FSINFO_STRUCT (statfs),
FSINFO_STRUCT (fsinfo),
FSINFO_STRUCT (ids),
FSINFO_STRUCT (limits),
FSINFO_STRUCT (supports),
FSINFO_STRUCT (capabilities),
FSINFO_STRUCT (timestamp_info),
FSINFO_STRING (volume_id),
FSINFO_STRUCT (volume_uuid),
FSINFO_STRING (volume_name),
FSINFO_STRING (cell_name),
FSINFO_STRING (domain_name),
FSINFO_STRING (realm_name),
FSINFO_STRING_N (server_name),
FSINFO_STRUCT_NM (server_address),
FSINFO_STRUCT (error_state),
FSINFO_STRING_N (parameter),
FSINFO_STRING_N (source),
FSINFO_STRING (name_encoding),
FSINFO_STRING (name_codepage),
FSINFO_STRUCT (io_size),
};

#define FSINFO_NAME(N) [fsinfo_attr_##N] = #N
static const char *fsinfo_attr_names[fsinfo_attr__nr] = {
FSINFO_NAME(statfs),
FSINFO_NAME(fsinfo),
FSINFO_NAME(ids),
FSINFO_NAME(limits),
FSINFO_NAME(supports),
FSINFO_NAME(capabilities),
FSINFO_NAME(timestamp_info),
FSINFO_NAME(volume_id),
FSINFO_NAME(volume_uuid),
FSINFO_NAME(volume_name),
FSINFO_NAME(cell_name),
FSINFO_NAME(domain_name),
FSINFO_NAME(realm_name),
FSINFO_NAME(server_name),
FSINFO_NAME(server_address),
FSINFO_NAME(error_state),
FSINFO_NAME(parameter),
FSINFO_NAME(source),
FSINFO_NAME(name_encoding),
FSINFO_NAME(name_codepage),
FSINFO_NAME(io_size),
};

union reply {
char buffer[4096];
struct fsinfo_statfs statfs;
struct fsinfo_fsinfo fsinfo;
struct fsinfo_ids ids;
struct fsinfo_limits limits;
struct fsinfo_supports supports;
struct fsinfo_capabilities caps;
struct fsinfo_timestamp_info timestamps;
struct fsinfo_volume_uuid uuid;
struct fsinfo_server_address srv_addr;
struct fsinfo_error_state errors;
struct fsinfo_io_size io_size;
};

/*
* Dump as hex.
*/
static void dump_hex(unsigned int *data, int from, int to)
{
unsigned offset, print_offset = 1, col = 0;

from /= 4;
to = (to + 3) / 4;

for (offset = from; offset < to; offset++) {
if (print_offset) {
printf("%04x: ", offset * 8);
print_offset = 0;
}
printf("%08x", data[offset]);
col++;
if ((col & 3) == 0) {
printf("\n");
print_offset = 1;
} else {
printf(" ");
}
}

if (!print_offset)
printf("\n");
}

#if 0
static void dump_fsinfo(struct fsinfo *f)
{
printf("ioc : %llx\n", (unsigned long long)f->f_supported_ioc_flags);

if (f->f_mask & FSINFO_VOLUME_ID) {
int printable = 1, loop;
printf("volid : ");
for (loop = 0; loop < sizeof(f->f_volume_id); loop++)
if (!isprint(f->f_volume_id[loop]))
printable = 0;
if (printable) {
printf("'%.*s'", 16, f->f_volume_id);
} else {
for (loop = 0; loop < sizeof(f->f_volume_id); loop++) {
if (loop % 4 == 0 && loop != 0)
printf(" ");
printf("%02x", f->f_volume_id[loop]);
}
}
printf("\n");
}
}
#endif

static void dump_attr_statfs(union reply *r, int size)
{
struct fsinfo_statfs *f = &r->statfs;

printf("\tblocks: n=%llu fr=%llu av=%llu\n",
(unsigned long long)f->f_blocks,
(unsigned long long)f->f_bfree,
(unsigned long long)f->f_bavail);

printf("\tfiles : n=%llu fr=%llu av=%llu\n",
(unsigned long long)f->f_files,
(unsigned long long)f->f_ffree,
(unsigned long long)f->f_favail);
printf("\tbsize : %u\n", f->f_bsize);
printf("\tfrsize: %u\n", f->f_frsize);
}

static void dump_attr_fsinfo(union reply *r, int size)
{
struct fsinfo_fsinfo *f = &r->fsinfo;

printf("max_attr=%u max_cap=%u\n", f->max_attr, f->max_cap);
}

static void dump_attr_ids(union reply *r, int size)
{
struct fsinfo_ids *f = &r->ids;

printf("dev : %02x:%02x\n", f->f_dev_major, f->f_dev_minor);
printf("\tfs : type=%x name=%s\n", f->f_fstype, f->f_fs_name);
printf("\tflags : %llx\n", (unsigned long long)f->f_flags);
printf("\tfsid : %llx\n", (unsigned long long)f->f_fsid);
}

static void dump_attr_limits(union reply *r, int size)
{
struct fsinfo_limits *f = &r->limits;

printf("max file size: %llx\n", f->max_file_size);
}

static void dump_attr_supports(union reply *r, int size)
{
struct fsinfo_supports *f = &r->supports;

printf("stx_attr=%llx\n", f->supported_stx_attributes);
}

#define FSINFO_CAP_NAME(C) [fsinfo_cap_##C] = #C
static const char *fsinfo_cap_names[fsinfo_cap__nr] = {
FSINFO_CAP_NAME(is_kernel_fs),
FSINFO_CAP_NAME(is_block_fs),
FSINFO_CAP_NAME(is_flash_fs),
FSINFO_CAP_NAME(is_network_fs),
FSINFO_CAP_NAME(is_automounter_fs),
FSINFO_CAP_NAME(automounts),
FSINFO_CAP_NAME(adv_locks),
FSINFO_CAP_NAME(mand_locks),
FSINFO_CAP_NAME(leases),
FSINFO_CAP_NAME(uids),
FSINFO_CAP_NAME(gids),
FSINFO_CAP_NAME(projids),
FSINFO_CAP_NAME(id_names),
FSINFO_CAP_NAME(id_guids),
FSINFO_CAP_NAME(windows_attrs),
FSINFO_CAP_NAME(user_quotas),
FSINFO_CAP_NAME(group_quotas),
FSINFO_CAP_NAME(project_quotas),
FSINFO_CAP_NAME(xattrs),
FSINFO_CAP_NAME(journal),
FSINFO_CAP_NAME(data_is_journalled),
FSINFO_CAP_NAME(o_sync),
FSINFO_CAP_NAME(o_direct),
FSINFO_CAP_NAME(volume_id),
FSINFO_CAP_NAME(volume_uuid),
FSINFO_CAP_NAME(volume_name),
FSINFO_CAP_NAME(volume_fsid),
FSINFO_CAP_NAME(cell_name),
FSINFO_CAP_NAME(domain_name),
FSINFO_CAP_NAME(realm_name),
FSINFO_CAP_NAME(iver_all_change),
FSINFO_CAP_NAME(iver_data_change),
FSINFO_CAP_NAME(iver_mono_incr),
FSINFO_CAP_NAME(symlinks),
FSINFO_CAP_NAME(hard_links),
FSINFO_CAP_NAME(hard_links_1dir),
FSINFO_CAP_NAME(device_files),
FSINFO_CAP_NAME(unix_specials),
FSINFO_CAP_NAME(resource_forks),
FSINFO_CAP_NAME(name_case_indep),
FSINFO_CAP_NAME(name_non_utf8),
FSINFO_CAP_NAME(name_has_codepage),
FSINFO_CAP_NAME(sparse),
FSINFO_CAP_NAME(not_persistent),
FSINFO_CAP_NAME(no_unix_mode),
FSINFO_CAP_NAME(has_atime),
FSINFO_CAP_NAME(has_btime),
FSINFO_CAP_NAME(has_ctime),
FSINFO_CAP_NAME(has_mtime),
};

static void dump_attr_capabilities(union reply *r, int size)
{
struct fsinfo_capabilities *f = &r->caps;
int i;

for (i = 0; i < sizeof(f->capabilities); i++)
printf("%02x", f->capabilities[i]);
printf("\n");
for (i = 0; i < fsinfo_cap__nr; i++)
if (f->capabilities[i / 8] & (1 << (i % 8)))
printf("\t- %s\n", fsinfo_cap_names[i]);
}

static void dump_attr_timestamp_info(union reply *r, int size)
{
struct fsinfo_timestamp_info *f = &r->timestamps;

printf("range=%llx-%llx\n",
(unsigned long long)f->minimum_timestamp,
(unsigned long long)f->maximum_timestamp);

#define print_time(G) \
printf("\t"#G"time : gran=%gs\n", \
(f->G##time_gran_mantissa * \
pow(10., f->G##time_gran_exponent)))
print_time(a);
print_time(b);
print_time(c);
print_time(m);
}

static void dump_attr_volume_uuid(union reply *r, int size)
{
struct fsinfo_volume_uuid *f = &r->uuid;

printf("%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x"
"-%02x%02x%02x%02x%02x%02x\n",
f->uuid[ 0], f->uuid[ 1],
f->uuid[ 2], f->uuid[ 3],
f->uuid[ 4], f->uuid[ 5],
f->uuid[ 6], f->uuid[ 7],
f->uuid[ 8], f->uuid[ 9],
f->uuid[10], f->uuid[11],
f->uuid[12], f->uuid[13],
f->uuid[14], f->uuid[15]);
}

static void dump_attr_server_address(union reply *r, int size)
{
struct fsinfo_server_address *f = &r->srv_addr;

printf("family=%u\n", f->address.ss_family);
}

static void dump_attr_error_state(union reply *r, int size)
{
struct fsinfo_error_state *f = &r->errors;

printf("io=%u wb=%u bdev=%u\n", f->io_error, f->wb_error, f->bdev_error);
}

static void dump_attr_io_size(union reply *r, int size)
{
struct fsinfo_io_size *f = &r->io_size;

printf("bs=%u\n", f->block_size);
}

/*
*
*/
typedef void (*dumper_t)(union reply *r, int size);

#define FSINFO_DUMPER(N) [fsinfo_attr_##N] = dump_attr_##N
static const dumper_t fsinfo_attr_dumper[fsinfo_attr__nr] = {
FSINFO_DUMPER(statfs),
FSINFO_DUMPER(fsinfo),
FSINFO_DUMPER(ids),
FSINFO_DUMPER(limits),
FSINFO_DUMPER(supports),
FSINFO_DUMPER(capabilities),
FSINFO_DUMPER(timestamp_info),
FSINFO_DUMPER(volume_uuid),
FSINFO_DUMPER(server_address),
FSINFO_DUMPER(error_state),
FSINFO_DUMPER(io_size),
};

static void dump_fsinfo(enum fsinfo_attribute attr, __u8 about,
union reply *r, int size)
{
dumper_t dumper = fsinfo_attr_dumper[attr];
unsigned int len;

if (!dumper) {
printf("<no dumper>\n");
return;
}

len = (about & 0x3f) * sizeof(__u32);
if (size < len) {
printf("<short data %u/%u>\n", size, len);
return;
}

dumper(r, size);
}

/*
* Try one subinstance of an attribute.
*/
static int try_one(const char *file, struct fsinfo_params *params, bool raw)
{
union reply r;
char *p;
int ret;
__u8 about;

memset(&r.buffer, 0xbd, sizeof(r.buffer));

errno = 0;
ret = fsinfo(AT_FDCWD, file, params, r.buffer, sizeof(r.buffer));
if (params->request >= fsinfo_attr__nr) {
if (ret == -1 && errno == EOPNOTSUPP)
exit(0);
fprintf(stderr, "Unexpected error for too-large command %u: %m\n",
params->request);
exit(1);
}

//printf("fsinfo(%s,%s,%u,%u) = %d: %m\n",
// file, fsinfo_attr_names[params->request],
// params->Nth, params->Mth, ret);

about = fsinfo_buffer_sizes[params->request];
if (ret == -1) {
if (errno == ENODATA) {
switch (about & 0xc0) {
case 0x00:
if (params->Nth == 0 && params->Mth == 0) {
fprintf(stderr,
"Unexpected ENODATA1 (%u[%u][%u])\n",
params->request, params->Nth, params->Mth);
exit(1);
}
break;
case 0x40:
if (params->Nth == 0 && params->Mth == 0) {
fprintf(stderr,
"Unexpected ENODATA2 (%u[%u][%u])\n",
params->request, params->Nth, params->Mth);
exit(1);
}
break;
}
return (params->Mth == 0) ? 2 : 1;
}
if (errno == EOPNOTSUPP) {
if (params->Nth > 0 || params->Mth > 0) {
fprintf(stderr,
"Should return -ENODATA (%u[%u][%u])\n",
params->request, params->Nth, params->Mth);
exit(1);
}
//printf("\e[33m%s\e[m: <not supported>\n",
// fsinfo_attr_names[attr]);
return 2;
}
perror(file);
exit(1);
}

if (raw) {
if (ret > 4096)
ret = 4096;
dump_hex((unsigned int *)&r.buffer, 0, ret);
return 0;
}

switch (about & 0xc0) {
case 0x00:
printf("\e[33m%s\e[m: ", fsinfo_attr_names[params->request]);
break;
case 0x40:
printf("\e[33m%s[%u]\e[m: ",
fsinfo_attr_names[params->request],
params->Nth);
break;
case 0x80:
printf("\e[33m%s[%u][%u]\e[m: ",
fsinfo_attr_names[params->request],
params->Nth, params->Mth);
break;
}

switch (about) {
/* Struct */
case 0x01 ... 0x3f:
case 0x41 ... 0x7f:
case 0x81 ... 0xbf:
dump_fsinfo(params->request, about, &r, ret);
return 0;

/* String */
case 0x00:
case 0x40:
case 0x80:
if (ret >= 4096) {
ret = 4096;
r.buffer[4092] = '.';
r.buffer[4093] = '.';
r.buffer[4094] = '.';
r.buffer[4095] = 0;
} else {
r.buffer[ret] = 0;
}
for (p = r.buffer; *p; p++) {
if (!isprint(*p)) {
printf("<non-printable>\n");
continue;
}
}
printf("%s\n", r.buffer);
return 0;

default:
fprintf(stderr, "Fishy about %u %02x\n", params->request, about);
exit(1);
}
}

/*
*
*/
int main(int argc, char **argv)
{
struct fsinfo_params params = {
.at_flags = AT_SYMLINK_NOFOLLOW,
};
unsigned int attr;
int raw = 0, opt, Nth, Mth;

while ((opt = getopt(argc, argv, "alr"))) {
switch (opt) {
case 'a':
params.at_flags |= AT_NO_AUTOMOUNT;
continue;
case 'l':
params.at_flags &= ~AT_SYMLINK_NOFOLLOW;
continue;
case 'r':
raw = 1;
continue;
}
break;
}

argc -= optind;
argv += optind;

if (argc != 1) {
printf("Format: test-fsinfo [-alr] <file>\n");
exit(2);
}

for (attr = 0; attr <= fsinfo_attr__nr; attr++) {
Nth = 0;
do {
Mth = 0;
do {
params.request = attr;
params.Nth = Nth;
params.Mth = Mth;

switch (try_one(argv[0], &params, raw)) {
case 0:
continue;
case 1:
goto done_M;
case 2:
goto done_N;
}
} while (++Mth < 100);

done_M:
if (Mth >= 100) {
fprintf(stderr, "Fishy: Mth == %u\n", Mth);
break;
}

} while (++Nth < 100);

done_N:
if (Nth >= 100) {
fprintf(stderr, "Fishy: Nth == %u\n", Nth);
break;
}
}

return 0;
}

2018-05-31 21:21:40

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 31/32] [RFC] fs: Add a move_mount() system call [ver #8]

On Fri, May 25, 2018 at 01:08:44AM +0100, David Howells wrote:
> [!] NOTE: This patch doesn't quite work to move an O_CLONE_MOUNT-produced
> vfsmount as move_mount() checks that the source vfsmount mnt_ns matches
> the calling process's mnt_ns - but the vfsmount's mnt_ns isn't set
> until one attempts to actually mount it into the namespace.

*shrug*

Turn those checks into
error = -EINVAL;
/* mountpoint should be ours */
if (!check_mnt(p))
goto out1;
/* and the thing moved should be either ours or completely unattached */
if (old->mnt_ns && !check_mnt(old))
goto out1;

2018-05-31 21:26:41

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 19/32] VFS: Implement fsopen() to prepare for a mount [ver #8]

On Fri, May 25, 2018 at 01:07:27AM +0100, David Howells wrote:
> + inode = alloc_anon_inode(fscontext_fs_mnt->mnt_sb);
> + if (IS_ERR(inode))
> + return ERR_CAST(inode);
> + inode->i_fop = &fscontext_fs_fops;

That's almost certainly wrong - you need it only if you want it possible to
reopen via /proc/*/fd/*

> + fc->phase = FS_CONTEXT_CREATE_PARAMS;
> +
> + ret = -ENOMEM;
> + path.dentry = d_alloc_pseudo(fscontext_fs_mnt->mnt_sb, &empty_name);
> + if (!path.dentry)
> + goto err_inode;
> + path.mnt = mntget(fscontext_fs_mnt);
> +
> + d_instantiate(path.dentry, inode);
> +
> + f = alloc_file(&path, FMODE_READ | FMODE_WRITE, &fscontext_fs_fops);

Re your question on IRC - we might want that fs in longer run, but for now
just go with anon_inode_getfile() here. Easier that way and we can always
switch later.

2018-05-31 23:12:41

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 03/32] VFS: Introduce the basic header for the new mount API's filesystem context [ver #8]

On Fri, May 25, 2018 at 01:05:43AM +0100, David Howells wrote:
> + bool drop_sb:1; /* T if need to drop an SB reference */

IMO that should be simply fc->root != NULL - if you keep a dentry, you'd better
make sure that its superblock has an active reference, so deactivate_super()
is needed anyway.

2018-05-31 23:14:35

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 03/32] VFS: Introduce the basic header for the new mount API's filesystem context [ver #8]

On Fri, May 25, 2018 at 01:05:43AM +0100, David Howells wrote:
> + void *fs_private; /* The filesystem's context */
...
> + void *s_fs_info; /* Proposed s_fs_info */

While we are at it, do we really need both in generic interface?

2018-06-01 01:53:42

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 20/32] vfs: Make close() unmount the attached mount if so flagged [ver #8]

On Thu, May 31, 2018 at 08:19:55PM +0100, Al Viro wrote:
> On Fri, May 25, 2018 at 01:07:34AM +0100, David Howells wrote:
> > + if (unlikely(file->f_mode & FMODE_NEED_UNMOUNT))
> > + __detach_mounts(dentry);
> > +
>
> This is completely wrong. First of all, you want to dissolve the mount tree
> on file->f_path.mount, not every tree rooted at dentry equal to file->f_path.dentry.
> This is easily done - it would be a simple call of drop_collected_mounts(mnt)
> if not for one detail. You want it to happen only if the sucker isn't attached
> anywhere by that point. IOW,
> namespace_lock();
> lock_mount_hash();
> if (!real_mount(mnt)->mnt_ns)
> umount_tree(real_mount(mnt), UMOUNT_SYNC);
> unlock_mount_hash();
> namespace_unlock();
> and that's it. You don't need that magical mystery turd in move_mount() later
> in the series and all the infrastructure you grow for it.
>
> FWIW, I would've suggested this
> void drop_collected_mounts(struct vfsmount *mnt)
> {
> namespace_lock();
> lock_mount_hash();
> + if (!real_mount(mnt)->mnt_ns)
> + umount_tree(real_mount(mnt), UMOUNT_SYNC);
> - umount_tree(real_mount(mnt), UMOUNT_SYNC);
> unlock_mount_hash();
> namespace_unlock();
> }
>
> and in __fput()
> if (unlikely(file->f_mode & FMODE_NEED_UNMOUNT))
> drop_collected_mounts(mnt);
>
> All there is to it, AFAICS...

... except that it should be a separate primitive - drop_collected_mounts() is
used put_mnt_ns(), where the root definitely has non-NULL ->mnt_ns.

Another thing: the same issue (misuse of __detach_mounts()) exists in cleanup
path of do_o_path(). What's more, doing it there is pointless - if
do_dentry_open() has set FMODE_NEED_UNMOUNT, it either succeeds or calls fput()
itself. Either way, the caller should *not* do the cleanups done by fput().

Another thing: copy_mount_for_o_path() is bogus. Horrible calling conventions
aside, what the hell is that lock_mount() for? In do_loopback() we lock the
*mountpoint*; here the source gets locked, for no visible reason. What we
should do is something like this:

1) common helper -

static struct mount *__do_loopback(struct path *from, bool recurse)
{
struct mount *mnt = ERR_PTR(-EINVAL), *f = real_mount(from->mnt);

if (IS_MNT_UNBINDABLE(f))
return mnt;

if (!check_mnt(f) && from->dentry->d_op != &ns_dentry_operations)
return mnt;

if (!recurse && has_locked_children(f, from->dentry))
return mnt;

if (recurse)
mnt = copy_tree(f, from->dentry, CL_COPY_MNT_NS_FILE);
else
mnt = clone_mnt(f, from->dentry, 0);
if (!IS_ERR(mnt))
mnt->mnt.mnt_flags &= ~MNT_LOCKED;
return mnt;
}

2) in do_loopback() we are left with

static int do_loopback(struct path *path, const char *old_name,
int recurse)
{
struct path old_path;
struct mount *mnt, *parent;
struct mountpoint *mp;
int err;
if (!old_name || !*old_name)
return -EINVAL;
err = kern_path(old_name, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT, &old_path);
if (err)
return err;

err = -EINVAL;
if (mnt_ns_loop(old_path.dentry))
goto out;

mp = lock_mount(path);
if (IS_ERR(mp)) {
err = PTR_ERR(mp);
goto out;
}

parent = real_mount(path->mnt);
if (!check_mnt(parent))
goto out2;

mnt = __do_loopback(&old_path, recurse);
if (IS_ERR(mnt)) {
err = PTR_ERR(mnt);
goto out2;
}

err = graft_tree(mnt, parent, mp);
if (err) {
lock_mount_hash();
umount_tree(mnt, UMOUNT_SYNC);
unlock_mount_hash();
}
out2:
unlock_mount(mp);
out:
path_put(&old_path);
return err;
}

3) copy_mount_for_o_path() with saner calling conventions:

int copy_mount_for_o_path(struct path *path, bool recurse)
{
struct mount *mnt = __do_loopback(path, recurse);
if (IS_ERR(mnt)) {
path_put(path);
return PTR_ERR(mnt);
}
mntput(path->mnt);
path->mnt = &mnt->mnt;
return 0;
}

4) in do_o_path():
static int do_o_path(struct nameidata *nd, unsigned flags, struct file *file)
{
struct path path;
int error = path_lookupat(nd, flags, &path);
if (error)
return error;

if (file->f_flags & O_CLONE_MOUNT) {
error = copy_mount_for_o_path(&path,
!(file->f_flags & O_NON_RECURSIVE));
if (error < 0)
return error;
}

audit_inode(nd->name, path.dentry, 0);
error = vfs_open(&path, file, current_cred());
path_put(&path);
return error;
}

2018-06-01 05:18:39

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 20/32] vfs: Make close() unmount the attached mount if so flagged [ver #8]

On Fri, Jun 01, 2018 at 04:18:29AM +0100, Al Viro wrote:

> +void umount_on_fput(struct vfsmount *mnt)
> {

.... which needs to do mntput() in case if it's attached.

2018-06-01 06:28:33

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8]

On Fri, May 25, 2018 at 01:08:38AM +0100, David Howells wrote:
> Make it possible to clone a mount tree with a new pair of open flags that
> are used in conjunction with O_PATH:
>
> (1) O_CLONE_MOUNT - Clone the mount or mount tree at the path.
>
> (2) O_NON_RECURSIVE - Don't clone recursively.

Err. I don't think we should use up two O_* flags for something
only useful for your new mount API. Don't we have a better place
to for these flags?

Instead of overloading this on open having a specific syscalls just
seems like a much saner idea.

2018-06-01 06:44:25

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8]

On Thu, May 31, 2018 at 11:26:54PM -0700, Christoph Hellwig wrote:
> On Fri, May 25, 2018 at 01:08:38AM +0100, David Howells wrote:
> > Make it possible to clone a mount tree with a new pair of open flags that
> > are used in conjunction with O_PATH:
> >
> > (1) O_CLONE_MOUNT - Clone the mount or mount tree at the path.
> >
> > (2) O_NON_RECURSIVE - Don't clone recursively.
>
> Err. I don't think we should use up two O_* flags for something
> only useful for your new mount API. Don't we have a better place
> to for these flags?
>
> Instead of overloading this on open having a specific syscalls just
> seems like a much saner idea.

It's not just mount API; these can be used independently of that.
Think of the uses where you pass those to ...at() and you'll see
a bunch of applications of that thing.

2018-06-01 08:04:32

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8]

[added linux-api]

On Fri, May 25, 2018 at 3:08 AM, David Howells <[email protected]> wrote:
> Make it possible to clone a mount tree with a new pair of open flags that
> are used in conjunction with O_PATH:
>
> (1) O_CLONE_MOUNT - Clone the mount or mount tree at the path.
>
> (2) O_NON_RECURSIVE - Don't clone recursively.
>
> Note that it's not a good idea to reuse other flags (such as O_CREAT)
> because the open routine for O_PATH does not give an error if any other
> flags are used in conjunction with O_PATH, but rather just masks off any it
> doesn't use.
>
> The resultant file struct is marked FMODE_NEED_UNMOUNT to as it pins an
> extra reference for the mount. This will be cleared by the upcoming
> move_mount() syscall when it successfully moves a cloned mount into the
> filesystem tree.
>
> Note that care needs to be taken with the error handling in do_o_path() in
> the case that vfs_open() fails as the path may or may not have been
> attached to the file struct and FMODE_NEED_UNMOUNT may or may not be set.
> Note that O_DIRECT | O_PATH could be a problem with error handling too.
>
> Signed-off-by: David Howells <[email protected]>
> ---
>
[...]

> @@ -977,8 +979,11 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
> * If we have O_PATH in the open flag. Then we
> * cannot have anything other than the below set of flags
> */
> - flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH;
> + flags &= (O_DIRECTORY | O_NOFOLLOW | O_PATH |
> + O_CLONE_MOUNT | O_NON_RECURSIVE);
> acc_mode = 0;
> + } else if (flags & (O_CLONE_MOUNT | O_NON_RECURSIVE)) {
> + return -EINVAL;

Reject O_NON_RECURSIVE without O_CLONE_MOUNT?
That would free at least one flag combination for future use.

Doesn't it make more sense for user API to opt-into
O_RECURSIVE_CLONE, rather than opt-out of it?


> }
>
> op->open_flag = flags;
> diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
> index 27dc7a60693e..8f60e2244740 100644
> --- a/include/linux/fcntl.h
> +++ b/include/linux/fcntl.h
> @@ -9,7 +9,8 @@
> (O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
> O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
> FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
> - O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
> + O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | \
> + O_CLONE_MOUNT | O_NON_RECURSIVE)
>
> #ifndef force_o_largefile
> #define force_o_largefile() (BITS_PER_LONG != 32)
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 0b1c7e35090c..f533e35ea19b 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -88,6 +88,14 @@
> #define __O_TMPFILE 020000000
> #endif
>
> +#ifndef O_CLONE_MOUNT
> +#define O_CLONE_MOUNT 040000000 /* Used with O_PATH to clone the mount subtree at path */
> +#endif
> +
> +#ifndef O_NON_RECURSIVE
> +#define O_NON_RECURSIVE 0100000000 /* Used with O_CLONE_MOUNT to only clone one mount */
> +#endif
> +
> /* a horrid kludge trying to make sure that this will fail on old kernels */
> #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
> #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
>

I am not sure what are the consequences of opening O_PATH with old kernel
and getting an open file, can't think of anything bad.
Can the same be claimed for O_PATH|O_CLONE_MOUNT?

Wouldn't it be better to apply the O_TMPFILE kludge to the new
open flag, so that apps can check if O_CLONE_MOUNT feature is supported
by kernel?

Thanks,
Amir.

2018-06-01 08:29:34

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8]

Al Viro <[email protected]> wrote:

> > Instead of overloading this on open having a specific syscalls just
> > seems like a much saner idea.
>
> It's not just mount API; these can be used independently of that.
> Think of the uses where you pass those to ...at() and you'll see
> a bunch of applications of that thing.

I kind of agree with Christoph on this point. Yes, you can use the resultant
fd for other things, but that doesn't mean it has to be obtained initially
through open() or openat() rather than, say, a new pick_mount() syscall.

Further, having more parameters available gives us the opportunity to change
the settings on any mounts we create at the point of creation.

David

2018-06-01 08:43:25

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8]

Amir Goldstein <[email protected]> wrote:

> Reject O_NON_RECURSIVE without O_CLONE_MOUNT?

Yes, I should add that.

> I am not sure what are the consequences of opening O_PATH with old kernel
> and getting an open file, can't think of anything bad.
> Can the same be claimed for O_PATH|O_CLONE_MOUNT?

Yes, actually, there can be consequences. Some files have side effects.
Think open("/dev/foobar", O_PATH).

> Wouldn't it be better to apply the O_TMPFILE kludge to the new
> open flag, so that apps can check if O_CLONE_MOUNT feature is supported
> by kernel?

Ugh. The problem is that the O_TMPFILE kludge can't be done because O_PATH
currently just masks off any bits it's not interested in rather than giving an
error.

Even the O_TMPFILE kludge doesn't protect you against someone having set
random unassigned bits when testing on a kernel that didn't support it.

And this bit:

/*
* Clear out all open flags we don't know about so that we don't report
* them in fcntl(F_GETFD) or similar interfaces.
*/
flags &= VALID_OPEN_FLAGS;

is just plain wrong. Effectively, it allows userspace to set random reserved
bits without consequences. It should give an error instead.

Probably we should really replace open() and openat() both before we can
allocate any further open flags.

</grumble>

David

2018-06-02 03:10:11

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8]

On Fri, Jun 01, 2018 at 09:27:43AM +0100, David Howells wrote:
> Al Viro <[email protected]> wrote:
>
> > > Instead of overloading this on open having a specific syscalls just
> > > seems like a much saner idea.
> >
> > It's not just mount API; these can be used independently of that.
> > Think of the uses where you pass those to ...at() and you'll see
> > a bunch of applications of that thing.
>
> I kind of agree with Christoph on this point. Yes, you can use the resultant
> fd for other things, but that doesn't mean it has to be obtained initially
> through open() or openat() rather than, say, a new pick_mount() syscall.
>
> Further, having more parameters available gives us the opportunity to change
> the settings on any mounts we create at the point of creation.

open_subtree(int dirfd, const char *pathname, int flags), then? How would
flags be interpreted? What I see mapping at that thing is
* equivalent of O_PATH open
* clone subtree, O_PATH open root
* clone one mount, O_PATH open root
and apparently you want to add (orthogonal to that)
* make shared/slave/private/unbindable
* ditto with recursion?
* same for nodev/nosuid/noexec/noatime/nodiratime/relatime/ro/?
as well as usual AT_... flags (empty path, follow)

Choose the encoding...

2018-06-02 03:43:40

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8]

On Sat, Jun 02, 2018 at 04:09:14AM +0100, Al Viro wrote:
> On Fri, Jun 01, 2018 at 09:27:43AM +0100, David Howells wrote:
> > Al Viro <[email protected]> wrote:
> >
> > > > Instead of overloading this on open having a specific syscalls just
> > > > seems like a much saner idea.
> > >
> > > It's not just mount API; these can be used independently of that.
> > > Think of the uses where you pass those to ...at() and you'll see
> > > a bunch of applications of that thing.
> >
> > I kind of agree with Christoph on this point. Yes, you can use the resultant
> > fd for other things, but that doesn't mean it has to be obtained initially
> > through open() or openat() rather than, say, a new pick_mount() syscall.
> >
> > Further, having more parameters available gives us the opportunity to change
> > the settings on any mounts we create at the point of creation.
>
> open_subtree(int dirfd, const char *pathname, int flags), then? How would
> flags be interpreted? What I see mapping at that thing is
> * equivalent of O_PATH open
> * clone subtree, O_PATH open root
> * clone one mount, O_PATH open root
> and apparently you want to add (orthogonal to that)
> * make shared/slave/private/unbindable
> * ditto with recursion?
> * same for nodev/nosuid/noexec/noatime/nodiratime/relatime/ro/?
> as well as usual AT_... flags (empty path, follow)
>
> Choose the encoding...

_If_ I'm interpreting that correctly, that should be something like a bitmap
of attributes to modify + values to set for each. Let's see -
propagation 1 + 2 bits
nodev 1 + 1
noexec 1 + 1
nosuid 1 + 1
ro 1 + 1
atime 1 + 3
That's 15 bits. On top of that, we have 1 bit for "clone or original"
and 1 bit for "recursive or single-mount". As well as AT_EMPTY_PATH,
and AT_NO_AUTOMOUNT (inconvenient, since these are fixed bits). In
principle, that does fit into int, with some space to spare...

Is that what you have in mind?

2018-06-02 04:11:07

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8]

On Sat, Jun 02, 2018 at 04:42:56AM +0100, Al Viro wrote:
> _If_ I'm interpreting that correctly, that should be something like a bitmap
> of attributes to modify + values to set for each. Let's see -
> propagation 1 + 2 bits
> nodev 1 + 1
> noexec 1 + 1
> nosuid 1 + 1
> ro 1 + 1
> atime 1 + 3
> That's 15 bits. On top of that, we have 1 bit for "clone or original"
> and 1 bit for "recursive or single-mount". As well as AT_EMPTY_PATH,
> and AT_NO_AUTOMOUNT (inconvenient, since these are fixed bits). In
> principle, that does fit into int, with some space to spare...
>
> Is that what you have in mind?

TBH, I would probably prefer separate mount_setattr(2) for that kind
of work, with something like
int mount_setattr(int dirfd, const char *path, int flags, int attr)
*not* opening any files.
flags:
AT_EMPTY_PATH, AT_NO_AUTOMOUNT, AT_RECURSIVE
attr:
MOUNT_SETATTR_DEV (1<<0)
MOUNT_SETATTR_NODEV (1<<0)|(1<<1)
MOUNT_SETATTR_EXEC (1<<2)
MOUNT_SETATTR_NOEXEC (1<<2)|(1<<3)
MOUNT_SETATTR_SUID (1<<4)
MOUNT_SETATTR_NOSUID (1<<4)|(1<<5)
MOUNT_SETATTR_RW (1<<6)
MOUNT_SETATTR_RO (1<<6)|(1<<7)
MOUNT_SETATTR_RELATIME (1<<8)
MOUNT_SETATTR_NOATIME (1<<8)|(1<<9)
MOUNT_SETATTR_NODIRATIME (1<<8)|(2<<9)
MOUNT_SETATTR_STRICTATIME (1<<8)|(3<<9)
MOUNT_SETATTR_PRIVATE (1<<11)
MOUNT_SETATTR_SHARED (1<<11)|(1<<12)
MOUNT_SETATTR_SLAVE (1<<11)|(2<<12)
MOUNT_SETATTR_UNBINDABLE (1<<11)|(3<<12)

With either openat() used as in this series, or explicit
int open_tree(int dirfd, const char *path, int flags)
returning a descriptor, with flags being
AT_EMPTY_PATH, AT_NO_AUTOMOUNT, AT_RECURSIVE, AT_CLONE
with AT_RECURSIVE without AT_CLONE being an error. Hell, might
even add AT_UMOUNT for "open root and detach, to be dissolved on
close", incompatible with AT_CLONE.

Comments?

2018-06-02 15:46:06

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8]

Al Viro <[email protected]> wrote:

> TBH, I would probably prefer separate mount_setattr(2) for that kind
> of work, with something like
> int mount_setattr(int dirfd, const char *path, int flags, int attr)
> *not* opening any files.
> flags:
> AT_EMPTY_PATH, AT_NO_AUTOMOUNT, AT_RECURSIVE

I would call these MOUNT_SETATTR_* rather than AT_*.

> attr:
> MOUNT_SETATTR_DEV (1<<0)
> MOUNT_SETATTR_NODEV (1<<0)|(1<<1)
> MOUNT_SETATTR_EXEC (1<<2)
> MOUNT_SETATTR_NOEXEC (1<<2)|(1<<3)
> MOUNT_SETATTR_SUID (1<<4)
> MOUNT_SETATTR_NOSUID (1<<4)|(1<<5)
> MOUNT_SETATTR_RW (1<<6)
> MOUNT_SETATTR_RO (1<<6)|(1<<7)
> MOUNT_SETATTR_RELATIME (1<<8)
> MOUNT_SETATTR_NOATIME (1<<8)|(1<<9)
> MOUNT_SETATTR_NODIRATIME (1<<8)|(2<<9)
> MOUNT_SETATTR_STRICTATIME (1<<8)|(3<<9)
> MOUNT_SETATTR_PRIVATE (1<<11)
> MOUNT_SETATTR_SHARED (1<<11)|(1<<12)
> MOUNT_SETATTR_SLAVE (1<<11)|(2<<12)
> MOUNT_SETATTR_UNBINDABLE (1<<11)|(3<<12)

So, I like this generally, some notes though:

I wonder if this should be two separate parameters, a mask and the settings?
I'm not sure that's worth it since some of the mask bits would cover multiple
settings.

Also, should NODIRATIME be separate from the other *ATIME flags? I do also
like treating some of these settings as enumerations rather than a set of
bits.

I would make the prototype:

int mount_setattr(int dirfd, const char *path,
unsigned int flags, unsigned int attr,
void *reserved5);

Further, do we want to say you can either change the propagation type *or*
reconfigure the mountpoint restrictions, but not both at the same time?

> With either openat() used as in this series, or explicit
> int open_tree(int dirfd, const char *path, int flags)

Maybe open_mount(), grab_mount() or pick_mount()?

I wonder if fsopen()/fspick() should be create_fs()/open_fs()...

> returning a descriptor, with flags being
> AT_EMPTY_PATH, AT_NO_AUTOMOUNT, AT_RECURSIVE, AT_CLONE
> with AT_RECURSIVE without AT_CLONE being an error.

You also need an O_CLOEXEC equivalent.

I would make it:

OPEN_TREE_CLOEXEC 0x00000001
OPEN_TREE_EMPTY_PATH 0x00000002
OPEN_TREE_FOLLOW_SYMLINK 0x00000004
OPEN_TREE_NO_AUTOMOUNT 0x00000008
OPEN_TREE_CLONE 0x00000010
OPEN_TREE_RECURSIVE 0x00000020

adding the follow-symlinks so that you don't grab a symlink target by
accident. (Can you actually mount on top of a symlink?)

> Hell, might even add AT_UMOUNT for "open root and detach, to be dissolved on
> close", incompatible with AT_CLONE.

Cute. Guess you could do:

fd = open_mount(..., OPEN_TREE_DETACH);
mount_setattr(fd, "",
MOUNT_SETATTR_EMPTY_PATH,
MOUNT_SETATTR_NOSUID, NULL);
move_mount(fd, "", ...);

David

2018-06-02 17:50:45

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8]

On Sat, Jun 02, 2018 at 04:45:21PM +0100, David Howells wrote:
> Al Viro <[email protected]> wrote:
>
> > TBH, I would probably prefer separate mount_setattr(2) for that kind
> > of work, with something like
> > int mount_setattr(int dirfd, const char *path, int flags, int attr)
> > *not* opening any files.
> > flags:
> > AT_EMPTY_PATH, AT_NO_AUTOMOUNT, AT_RECURSIVE
>
> I would call these MOUNT_SETATTR_* rather than AT_*.

Why? AT_EMPTY_PATH/AT_NO_AUTOMOUNT are common with other ...at()
syscalls; AT_RECURSIVE - maybe, but it's still more like AT_...
namespace fodder, IMO.

> > attr:
> > MOUNT_SETATTR_DEV (1<<0)
> > MOUNT_SETATTR_NODEV (1<<0)|(1<<1)
> > MOUNT_SETATTR_EXEC (1<<2)
> > MOUNT_SETATTR_NOEXEC (1<<2)|(1<<3)
> > MOUNT_SETATTR_SUID (1<<4)
> > MOUNT_SETATTR_NOSUID (1<<4)|(1<<5)
> > MOUNT_SETATTR_RW (1<<6)
> > MOUNT_SETATTR_RO (1<<6)|(1<<7)
> > MOUNT_SETATTR_RELATIME (1<<8)
> > MOUNT_SETATTR_NOATIME (1<<8)|(1<<9)
> > MOUNT_SETATTR_NODIRATIME (1<<8)|(2<<9)
> > MOUNT_SETATTR_STRICTATIME (1<<8)|(3<<9)
> > MOUNT_SETATTR_PRIVATE (1<<11)
> > MOUNT_SETATTR_SHARED (1<<11)|(1<<12)
> > MOUNT_SETATTR_SLAVE (1<<11)|(2<<12)
> > MOUNT_SETATTR_UNBINDABLE (1<<11)|(3<<12)
>
> So, I like this generally, some notes though:
>
> I wonder if this should be two separate parameters, a mask and the settings?
> I'm not sure that's worth it since some of the mask bits would cover multiple
> settings.

Nah, better put those bits in the same word, as in above. Here bits 0, 2, 4, 6,
8 and 11 tell which attributes are to be modified, with values to set living
in bits 1, 3, 5, 7, 9--10 and 12--13. Look at the constants above..

> Also, should NODIRATIME be separate from the other *ATIME flags? I do also
> like treating some of these settings as enumerations rather than a set of
> bits.

Huh? That's precisely what I'm doing there: bit 8 is "want to change atime
settings", bits 9 and 10 hold a 4-element enumeration (rel/no/nodir/strict).
Similar for propagation settings (bit 11 indicates that we want to set those,
bits 12 and 13 - 4-element enum)...

> I would make the prototype:
>
> int mount_setattr(int dirfd, const char *path,
> unsigned int flags, unsigned int attr,
> void *reserved5);
>
> Further, do we want to say you can either change the propagation type *or*
> reconfigure the mountpoint restrictions, but not both at the same time?

Why? MOUNT_SETATTR_PRIVATE | MOUNT_NOATIME | MOUNT_SUID, i.e. 00101100010000,
i.e. 0xb10 for "turn nosuid off, switch atime polcy to noatime, change propagation
to private, leave everything else as-is"...

And for fsck sake, what's that "void *reserved5" for?

> > With either openat() used as in this series, or explicit
> > int open_tree(int dirfd, const char *path, int flags)
>
> Maybe open_mount(), grab_mount() or pick_mount()?
>
> I wonder if fsopen()/fspick() should be create_fs()/open_fs()...
>
> > returning a descriptor, with flags being
> > AT_EMPTY_PATH, AT_NO_AUTOMOUNT, AT_RECURSIVE, AT_CLONE
> > with AT_RECURSIVE without AT_CLONE being an error.
>
> You also need an O_CLOEXEC equivalent.

Point.

> I would make it:
>
> OPEN_TREE_CLOEXEC 0x00000001

Why not O_CLOEXEC, as with epoll_create()/timerfd_create()/etc.?

> OPEN_TREE_EMPTY_PATH 0x00000002
> OPEN_TREE_FOLLOW_SYMLINK 0x00000004
> OPEN_TREE_NO_AUTOMOUNT 0x00000008

Why? How are those different from normal AT_EMPTY_PATH/AT_NO_AUTOMOUNT?

> OPEN_TREE_CLONE 0x00000010
> OPEN_TREE_RECURSIVE 0x00000020
>
> adding the follow-symlinks so that you don't grab a symlink target by
> accident. (Can you actually mount on top of a symlink?)

You can't - mount(2) uses LOOKUP_FOLLOW for mountpoint (well, user_path(),
actually).

> > Hell, might even add AT_UMOUNT for "open root and detach, to be dissolved on
> > close", incompatible with AT_CLONE.
>
> Cute. Guess you could do:
>
> fd = open_mount(..., OPEN_TREE_DETACH);
> mount_setattr(fd, "",
> MOUNT_SETATTR_EMPTY_PATH,
> MOUNT_SETATTR_NOSUID, NULL);
> move_mount(fd, "", ...);

2018-06-03 00:56:51

by Al Viro

[permalink] [raw]
Subject: [PATCH][RFC] open_tree(2) (was Re: [PATCH 30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8])

On Sat, Jun 02, 2018 at 06:49:58PM +0100, Al Viro wrote:

> > > Hell, might even add AT_UMOUNT for "open root and detach, to be dissolved on
> > > close", incompatible with AT_CLONE.
> >
> > Cute. Guess you could do:
> >
> > fd = open_mount(..., OPEN_TREE_DETACH);
> > mount_setattr(fd, "",
> > MOUNT_SETATTR_EMPTY_PATH,
> > MOUNT_SETATTR_NOSUID, NULL);
> > move_mount(fd, "", ...);

Hadn't added that yet, but as for the rest of open_tree() - see
vfs.git#mount-open_tree. open() and its flags are not touched at all.
Other changes compared to the previous:
* may_mount() is required for OPEN_TREE_CLONE
* sys_ni.c cruft is dropped - those make no sense until and unless
those syscalls become conditional.

Appears to work, combined with move_mount() it yields eqiuvalents of
mount --{move,bind,rbind}. Combined with mount_setattr(2) (when that
gets added) we'll get mount -o remount,bind,nodev et.al.
(including the currently absent whole-subtree versions) and
mount --make-{r,}{shared,slave,private,unbindable}

It also can be used to get an isolated subtree usable for ....at()
stuff.

The addition of syscall itself is done by the following and I'd really
like linux-abi folks to comment on that puppy

commit 6cfba4dd99b10278c2156c8d4fced2eddedf167f
Author: Al Viro <[email protected]>
Date: Sat Jun 2 19:42:22 2018 -0400

new syscall: open_tree(2)

open_tree(dfd, pathname, flags)

Returns an O_PATH-opened file descriptor or an error.
dfd and pathname specify the location to open, in usual
fashion (see e.g. fstatat(2)). flags should be an OR of
some of the following:
* AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
same meanings as usual
* OPEN_TREE_CLOEXEC - make the resulting descriptor
close-on-exec
* OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
instead of opening the location in question, create a detached
mount tree matching the subtree rooted at location specified by
dfd/pathname. With AT_RECURSIVE the entire subtree is cloned,
without it - only the part within in the mount containing the
location in question. In other words, the same as mount --rbind
or mount --bind would've taken. The detached tree will be
dissolved on the final close of obtained file. Creation of such
detached trees requires the same capabilities as doing mount --bind.

Signed-off-by: Al Viro <[email protected]>

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 14a2f996e543..b2b44ecd2b17 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -397,3 +397,4 @@
383 i386 statx sys_statx __ia32_sys_statx
384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl
385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents
+391 i386 open_tree sys_open_tree __ia32_sys_open_tree
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index cd36232ab62f..d6f4949378e7 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -342,6 +342,7 @@
331 common pkey_free __x64_sys_pkey_free
332 common statx __x64_sys_statx
333 common io_pgetevents __x64_sys_io_pgetevents
+339 common open_tree __x64_sys_open_tree

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/file_table.c b/fs/file_table.c
index 7ec0b3e5f05d..7480271a0d21 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -189,6 +189,7 @@ static void __fput(struct file *file)
struct dentry *dentry = file->f_path.dentry;
struct vfsmount *mnt = file->f_path.mnt;
struct inode *inode = file->f_inode;
+ fmode_t mode = file->f_mode;

might_sleep();

@@ -209,14 +210,14 @@ static void __fput(struct file *file)
file->f_op->release(inode, file);
security_file_free(file);
if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
- !(file->f_mode & FMODE_PATH))) {
+ !(mode & FMODE_PATH))) {
cdev_put(inode->i_cdev);
}
fops_put(file->f_op);
put_pid(file->f_owner.pid);
- if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
+ if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
i_readcount_dec(inode);
- if (file->f_mode & FMODE_WRITER) {
+ if (mode & FMODE_WRITER) {
put_write_access(inode);
__mnt_drop_write(mnt);
}
@@ -224,6 +225,8 @@ static void __fput(struct file *file)
file->f_path.mnt = NULL;
file->f_inode = NULL;
file_free(file);
+ if (unlikely(mode & FMODE_NEED_UNMOUNT))
+ dissolve_on_fput(mnt);
dput(dentry);
mntput(mnt);
}
diff --git a/fs/internal.h b/fs/internal.h
index 980d005b21b4..b55575b9b55c 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -85,6 +85,7 @@ extern void __mnt_drop_write(struct vfsmount *);
extern void __mnt_drop_write_file(struct file *);
extern void mnt_drop_write_file_path(struct file *);

+extern void dissolve_on_fput(struct vfsmount *);
/*
* fs_struct.c
*/
diff --git a/fs/namespace.c b/fs/namespace.c
index 5f75969adff1..3281fea98cf0 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -20,12 +20,14 @@
#include <linux/init.h> /* init_rootfs */
#include <linux/fs_struct.h> /* get_fs_root et.al. */
#include <linux/fsnotify.h> /* fsnotify_vfsmount_delete */
+#include <linux/file.h>
#include <linux/uaccess.h>
#include <linux/proc_ns.h>
#include <linux/magic.h>
#include <linux/bootmem.h>
#include <linux/task_work.h>
#include <linux/sched/task.h>
+#include <uapi/linux/mount.h>

#include "pnode.h"
#include "internal.h"
@@ -1839,6 +1841,16 @@ struct vfsmount *collect_mounts(const struct path *path)
return &tree->mnt;
}

+void dissolve_on_fput(struct vfsmount *mnt)
+{
+ namespace_lock();
+ lock_mount_hash();
+ mntget(mnt);
+ umount_tree(real_mount(mnt), UMOUNT_SYNC);
+ unlock_mount_hash();
+ namespace_unlock();
+}
+
void drop_collected_mounts(struct vfsmount *mnt)
{
namespace_lock();
@@ -2198,6 +2210,30 @@ static bool has_locked_children(struct mount *mnt, struct dentry *dentry)
return false;
}

+static struct mount *__do_loopback(struct path *old_path, int recurse)
+{
+ struct mount *mnt = ERR_PTR(-EINVAL), *old = real_mount(old_path->mnt);
+
+ if (IS_MNT_UNBINDABLE(old))
+ return mnt;
+
+ if (!check_mnt(old) && old_path->dentry->d_op != &ns_dentry_operations)
+ return mnt;
+
+ if (!recurse && has_locked_children(old, old_path->dentry))
+ return mnt;
+
+ if (recurse)
+ mnt = copy_tree(old, old_path->dentry, CL_COPY_MNT_NS_FILE);
+ else
+ mnt = clone_mnt(old, old_path->dentry, 0);
+
+ if (!IS_ERR(mnt))
+ mnt->mnt.mnt_flags &= ~MNT_LOCKED;
+
+ return mnt;
+}
+
/*
* do loopback mount.
*/
@@ -2205,7 +2241,7 @@ static int do_loopback(struct path *path, const char *old_name,
int recurse)
{
struct path old_path;
- struct mount *mnt = NULL, *old, *parent;
+ struct mount *mnt = NULL, *parent;
struct mountpoint *mp;
int err;
if (!old_name || !*old_name)
@@ -2219,38 +2255,21 @@ static int do_loopback(struct path *path, const char *old_name,
goto out;

mp = lock_mount(path);
- err = PTR_ERR(mp);
- if (IS_ERR(mp))
+ if (IS_ERR(mp)) {
+ err = PTR_ERR(mp);
goto out;
+ }

- old = real_mount(old_path.mnt);
parent = real_mount(path->mnt);
-
- err = -EINVAL;
- if (IS_MNT_UNBINDABLE(old))
- goto out2;
-
if (!check_mnt(parent))
goto out2;

- if (!check_mnt(old) && old_path.dentry->d_op != &ns_dentry_operations)
- goto out2;
-
- if (!recurse && has_locked_children(old, old_path.dentry))
- goto out2;
-
- if (recurse)
- mnt = copy_tree(old, old_path.dentry, CL_COPY_MNT_NS_FILE);
- else
- mnt = clone_mnt(old, old_path.dentry, 0);
-
+ mnt = __do_loopback(&old_path, recurse);
if (IS_ERR(mnt)) {
err = PTR_ERR(mnt);
goto out2;
}

- mnt->mnt.mnt_flags &= ~MNT_LOCKED;
-
err = graft_tree(mnt, parent, mp);
if (err) {
lock_mount_hash();
@@ -2264,6 +2283,75 @@ static int do_loopback(struct path *path, const char *old_name,
return err;
}

+SYSCALL_DEFINE3(open_tree, int, dfd, const char *, filename, unsigned, flags)
+{
+ struct file *file;
+ struct path path;
+ int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW;
+ bool detached = flags & OPEN_TREE_CLONE;
+ int error;
+ int fd;
+
+ BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC);
+
+ if (flags & ~(AT_EMPTY_PATH | AT_NO_AUTOMOUNT | AT_RECURSIVE |
+ AT_SYMLINK_NOFOLLOW | OPEN_TREE_CLONE |
+ OPEN_TREE_CLOEXEC))
+ return -EINVAL;
+
+ if ((flags & (AT_RECURSIVE | OPEN_TREE_CLONE)) == AT_RECURSIVE)
+ return -EINVAL;
+
+ if (flags & AT_NO_AUTOMOUNT)
+ lookup_flags &= ~LOOKUP_AUTOMOUNT;
+ if (flags & AT_SYMLINK_NOFOLLOW)
+ lookup_flags &= ~LOOKUP_FOLLOW;
+ if (flags & AT_EMPTY_PATH)
+ lookup_flags |= LOOKUP_EMPTY;
+
+ if (detached && !may_mount())
+ return -EPERM;
+
+ fd = get_unused_fd_flags(flags & O_CLOEXEC);
+ if (fd < 0)
+ return fd;
+
+ error = user_path_at(dfd, filename, lookup_flags, &path);
+ if (error)
+ goto out;
+
+ if (detached) {
+ struct mount *mnt = __do_loopback(&path, flags & AT_RECURSIVE);
+ if (IS_ERR(mnt)) {
+ error = PTR_ERR(mnt);
+ goto out2;
+ }
+ mntput(path.mnt);
+ path.mnt = &mnt->mnt;
+ }
+
+ file = dentry_open(&path, O_PATH, current_cred());
+ if (IS_ERR(file)) {
+ error = PTR_ERR(file);
+ goto out3;
+ }
+
+ if (detached)
+ file->f_mode |= FMODE_NEED_UNMOUNT;
+ path_put(&path);
+ fd_install(fd, file);
+ return fd;
+
+out3:
+ if (detached)
+ dissolve_on_fput(path.mnt);
+out2:
+ path_put(&path);
+out:
+ put_unused_fd(fd);
+ return error;
+}
+
static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
{
int error = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 482563fe549c..706b4605bc26 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -154,6 +154,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
/* File is capable of returning -EAGAIN if I/O will block */
#define FMODE_NOWAIT ((__force fmode_t)0x8000000)

+/* File represents mount that needs unmounting */
+#define FMODE_NEED_UNMOUNT ((__force fmode_t)0x10000000)
+
/*
* Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
* that indicates that they should check the contents of the iovec are
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 811172fcb916..925483aba03a 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -896,7 +896,7 @@ asmlinkage long sys_pkey_alloc(unsigned long flags, unsigned long init_val);
asmlinkage long sys_pkey_free(int pkey);
asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
unsigned mask, struct statx __user *buffer);
-
+asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);

/*
* Architecture-specific system calls
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 6448cdd9a350..594b85f7cb86 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -90,5 +90,7 @@
#define AT_STATX_FORCE_SYNC 0x2000 /* - Force the attributes to be sync'd with the server */
#define AT_STATX_DONT_SYNC 0x4000 /* - Don't sync attributes with the server */

+#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */
+

#endif /* _UAPI_LINUX_FCNTL_H */
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
new file mode 100644
index 000000000000..b9c3b46210db
--- /dev/null
+++ b/include/uapi/linux/mount.h
@@ -0,0 +1,7 @@
+#ifndef _UAPI_LINUX_MOUNT_H
+#define _UAPI_LINUX_MOUNT_H
+
+#define OPEN_TREE_CLONE 1
+#define OPEN_TREE_CLOEXEC O_CLOEXEC
+
+#endif /* _UAPI_LINUX_MOUNT_H */

2018-06-04 10:35:47

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH][RFC] open_tree(2) (was Re: [PATCH 30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8])

On Sun, Jun 3, 2018 at 2:55 AM, Al Viro <[email protected]> wrote:
> On Sat, Jun 02, 2018 at 06:49:58PM +0100, Al Viro wrote:
>
>> > > Hell, might even add AT_UMOUNT for "open root and detach, to be dissolved on
>> > > close", incompatible with AT_CLONE.
>> >
>> > Cute. Guess you could do:
>> >
>> > fd = open_mount(..., OPEN_TREE_DETACH);
>> > mount_setattr(fd, "",
>> > MOUNT_SETATTR_EMPTY_PATH,
>> > MOUNT_SETATTR_NOSUID, NULL);
>> > move_mount(fd, "", ...);
>
> Hadn't added that yet, but as for the rest of open_tree() - see
> vfs.git#mount-open_tree. open() and its flags are not touched at all.
> Other changes compared to the previous:
> * may_mount() is required for OPEN_TREE_CLONE
> * sys_ni.c cruft is dropped - those make no sense until and unless
> those syscalls become conditional.
>
> Appears to work, combined with move_mount() it yields eqiuvalents of
> mount --{move,bind,rbind}. Combined with mount_setattr(2) (when that
> gets added) we'll get mount -o remount,bind,nodev et.al.

fsopen = create fsfd
fsmount = fsfd -> mountfd & set attr on mountfd & attach mountfd
fspick = path -> fsfd
move_mount = attach mountfd or move existing
fsinfo = info from path
open_tree = new mountfd from path or clone
mount_setattr = set attr on mountfd

Notice that fsmount() encompasses mount_setattr() + move_mount()
functionality. Split those out and leave fsmount() to actually do
the "fsfd ->mountfd" translation?

fsinfo() name suggests it's in the same class as
fsopen/fsmount/fspick, operating on fsfd object, but's it's not and I
think that's slightly confusing.

Rename move_mount() -> mount_move()?

Also does it make sense to make the cloning behavior of open_tree()
optional? Without cloning it's just a plain open(O_PATH). That way
it could be renamed mount_clone().

Thanks,
Miklos

2018-06-04 13:11:02

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 32/32] [RFC] fsinfo: Add a system call to allow querying of filesystem information [ver #8]

On Fri, May 25, 2018 at 2:08 AM, David Howells <[email protected]> wrote:

> +
> +static int fsinfo_generic_timestamp_info(struct dentry *dentry,
> + struct fsinfo_timestamp_info *ts)
> +{
> + struct super_block *sb = dentry->d_sb;
> +
> + /* If unset, assume 1s granularity */
> + u16 mantissa = 1;
> + s8 exponent = 0;
> +
> + ts->minimum_timestamp = S64_MIN;
> + ts->maximum_timestamp = S64_MAX;
> + if (sb->s_time_gran < 1000000000) {
> + if (sb->s_time_gran < 1000)
> + exponent = -9;
> + else if (sb->s_time_gran < 1000000)
> + exponent = -6;
> + else
> + exponent = -3;
> + }

ntfs has sb->s_time_gran=100, and vfat should really have
sb->s_time_gran=2000000000 but that doesn't seem to be set right
at the moment.

> +/*
> + * Optional fsinfo() parameter structure.
> + *
> + * If this is not given, it is assumed that fsinfo_attr_statfs instance 0 is
> + * desired.
> + */
> +struct fsinfo_params {
> + enum fsinfo_attribute request; /* What is being asking for */
> + __u32 Nth; /* Instance of it (some may have multiple) */
> + __u32 at_flags; /* AT_SYMLINK_NOFOLLOW and similar flags */
> + __u32 __spare[6]; /* Spare params; all must be 0 */
> +};

I fear the 'enum' in the uapi structure may have a different size depending
on the architecture. Maybe turn that into a __u32 as well?

> +struct fsinfo_capabilities {
> + __u64 supported_stx_attributes; /* What statx::stx_attributes are supported */
> + __u32 supported_stx_mask; /* What statx::stx_mask bits are supported */
> + __u32 supported_ioc_flags; /* What FS_IOC_* flags are supported */
> + __u8 capabilities[(fsinfo_cap__nr + 7) / 8];
> +};

This looks a bit odd: with the 44 capabilities, you end up having a
six-byte array
followed by two bytes of implicit padding. If the number of
capabilities grows beyond
64, you have a nine byte array with more padding to the next alignof(__u64). Is
that intentional?

How about making it a fixed size with either 64 or 128 capability bits?

> +/*
> + * Information struct for fsinfo(fsinfo_attr_timestamp_info).
> + */
> +struct fsinfo_timestamp_info {
> + __s64 minimum_timestamp; /* Minimum timestamp value in seconds */
> + __s64 maximum_timestamp; /* Maximum timestamp value in seconds */
> + __u16 atime_gran_mantissa; /* Granularity(secs) = mant * 10^exp */
> + __u16 btime_gran_mantissa;
> + __u16 ctime_gran_mantissa;
> + __u16 mtime_gran_mantissa;
> + __s8 atime_gran_exponent;
> + __s8 btime_gran_exponent;
> + __s8 ctime_gran_exponent;
> + __s8 mtime_gran_exponent;
> +};

This structure has a slightly inconsistent amount of padding at the end:
on x86-32 it has no padding, everywhere else it has 32 bits of padding
to make it 64-bit aligned. Maybe add a __u32 reserved field?

> +
> +#define __NR_fsinfo 326

Hardcoding the syscall number in the example makes it architecture specific.
Could you include <asm/unistd.h> to get the real number?

Arnd

2018-06-04 15:06:00

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 21/32] VFS: Implement fsmount() to effect a pre-configured mount [ver #8]

On Fri, May 25, 2018 at 2:07 AM, David Howells <[email protected]> wrote:
> Provide a system call by which a filesystem opened with fsopen() and
> configured by a series of writes can be mounted:
>
> int ret = fsmount(int fsfd, int dfd, const char *path,
> unsigned int at_flags, unsigned int flags);
>

> +/*
> + * Create a kernel mount representation for a new, prepared superblock
> + * (specified by fs_fd) and attach to an O_PATH-class file descriptor.
> + */
> +SYSCALL_DEFINE5(fsmount, int, fs_fd, unsigned int, flags, unsigned int, ms_flags,
> + void *, spare_4, void *, spare_5)

> +++ b/include/linux/syscalls.h
> @@ -898,6 +898,8 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
> unsigned mask, struct statx __user *buffer);
> asmlinkage long sys_fsopen(const char *fs_name, unsigned int flags,
> void *reserved3, void *reserved4, void *reserved5);
> +asmlinkage long sys_fsmount(int fsfd, int dfd, const char *path, unsigned int at_flags,
> + unsigned int flags);
>

The prototype in the header doesn't match the one in the implementation,
which should cause a compile-time error, at least if syscalls.h is included
in namespace.c

Do you have a particular use case in mind for the spare_4/spare_5 arguments?
If not, we can probably skip them. If we end up needing them after all, we can
always add a new syscall entry point, or use one of the flag bits to
decide whether
the additional arguments are valid or not.

> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -435,3 +435,4 @@ COND_SYSCALL(setuid16);
>
> /* fd-based mount */
> COND_SYSCALL(sys_fsopen);
> +COND_SYSCALL(sys_fsmount);

This should only be needed if the syscall is optional, which it doesn't
seem to be (same for the other ones here).

Arnd

2018-06-04 15:07:44

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 32/32] [RFC] fsinfo: Add a system call to allow querying of filesystem information [ver #8]

Arnd Bergmann <[email protected]> wrote:

> ntfs has sb->s_time_gran=100, and vfat should really have
> sb->s_time_gran=2000000000 but that doesn't seem to be set right
> at the moment.

(V)FAT actually has a different granularity on each timestamp.

> I fear the 'enum' in the uapi structure may have a different size depending
> on the architecture. Maybe turn that into a __u32 as well?

Fair enough. Done.

> > +struct fsinfo_capabilities {
> > + __u64 supported_stx_attributes; /* What statx::stx_attributes are supported */
> > + __u32 supported_stx_mask; /* What statx::stx_mask bits are supported */
> > + __u32 supported_ioc_flags; /* What FS_IOC_* flags are supported */
> > + __u8 capabilities[(fsinfo_cap__nr + 7) / 8];
> > +};
>
> This looks a bit odd: with the 44 capabilities, you end up having a six-byte
> array followed by two bytes of implicit padding. If the number of
> capabilities grows beyond 64, you have a nine byte array with more padding
> to the next alignof(__u64). Is that intentional?

I've split the capabilities out into their own thing. I've attached the
revised patch below.

> > +struct fsinfo_timestamp_info {
> > + __s64 minimum_timestamp; /* Minimum timestamp value in seconds */
> > + __s64 maximum_timestamp; /* Maximum timestamp value in seconds */
> > + __u16 atime_gran_mantissa; /* Granularity(secs) = mant * 10^exp */
> > + __u16 btime_gran_mantissa;
> > + __u16 ctime_gran_mantissa;
> > + __u16 mtime_gran_mantissa;
> > + __s8 atime_gran_exponent;
> > + __s8 btime_gran_exponent;
> > + __s8 ctime_gran_exponent;
> > + __s8 mtime_gran_exponent;
> > +};
>
> This structure has a slightly inconsistent amount of padding at the end:
> on x86-32 it has no padding, everywhere else it has 32 bits of padding
> to make it 64-bit aligned. Maybe add a __u32 reserved field?

It occurs to me that the min and max may be different for each timestamp.
Maybe I should have:

struct fsinfo_timestamp_info {
char name[7 + 1];
__s64 minimum_timestamp;
__s64 maximum_timestamp;
__u16 granularity_mantissa;
__s8 granularity_exponent;
__u8 __reserved[5];
};

and then you iterate through them by setting Nth. I could just put a:

__u64 granularity;

field, expressed in nS, rather than mantissa and exponent, but doing it this
way allows me to express granularities less that 1nS very simply (something
Dave Chinner was talking about).

It might also be worth putting minimum_timestamp and maximum_timestamp in
terms of granularity rather than nS.

> > +#define __NR_fsinfo 326
>
> Hardcoding the syscall number in the example makes it architecture specific.
> Could you include <asm/unistd.h> to get the real number?

Yeah, I've fixed that already.

David
---
commit 61ad926e92c6985acd7093e7349c8480e3b1f827
Author: David Howells <[email protected]>
Date: Thu May 31 22:53:51 2018 +0100

fsinfo: Add a system call to allow querying of filesystem information

Add a system call to allow filesystem information to be queried. This is
implemented as a function switch where the desired attribute value or
values is nominated.

===============
NEW SYSTEM CALL
===============

The new system call looks like:

int ret = fsinfo(int dfd,
const char *filename,
const struct fsinfo_params *params,
void *buffer,
size_t buf_size);

The params parameter optionally points to a block of parameters:

struct fsinfo_params {
enum fsinfo_attribute request;
__u32 Nth;
__u32 Mth;
__u32 at_flags;
__u32 __reserved[6];
};

If params is NULL, it is assumed params->request should be
fsinfo_attr_statfs, params->Nth should be 0, params->Mth should be 0 and
params->at_flags should be 0.

If params is given, all of params->__reserved[] must be 0.

dfd, filename and params->at_flags indicate the file to query. There is no
equivalent of lstat() as that can be emulated with fsinfo() by setting
AT_SYMLINK_NOFOLLOW in params->at_flags. There is also no equivalent of
fstat() as that can be emulated by passing a NULL filename to fsinfo() with
the fd of interest in dfd. AT_NO_AUTOMOUNT can also be used to an allow
automount point to be queried without triggering it.

AT_FORCE_ATTR_SYNC can be set in params->at_flags. This will require a
network filesystem to synchronise its attributes with the server.

AT_NO_ATTR_SYNC can be set in params->at_flags. This will suppress
synchronisation with the server in a network filesystem. The resulting
values should be considered approximate.

params->request indicates the attribute/attributes to be queried. This can
be one of:

fsinfo_attr_statfs - statfs-style info
fsinfo_attr_fsinfo - Information about fsinfo()
fsinfo_attr_ids - Filesystem IDs
fsinfo_attr_limits - Filesystem limits
fsinfo_attr_supports - What's supported in statx(), IOC flags
fsinfo_attr_capabilities - Filesystem capabilities
fsinfo_attr_timestamp_info - Inode timestamp info
fsinfo_attr_volume_id - Volume ID (string)
fsinfo_attr_volume_uuid - Volume UUID
fsinfo_attr_volume_name - Volume name (string)
fsinfo_attr_cell_name - Cell name (string)
fsinfo_attr_domain_name - Domain name (string)
fsinfo_attr_realm_name - Realm name (string)
fsinfo_attr_server_name - Name of the Nth server (string)
fsinfo_attr_server_address - Mth address of the Nth server
fsinfo_attr_parameter - Nth mount parameter (string)
fsinfo_attr_source - Nth mount source name (string)
fsinfo_attr_name_encoding - Filename encoding (string)
fsinfo_attr_name_codepage - Filename codepage (string)
fsinfo_attr_io_size - Optimal I/O sizes

Some attributes (such as the servers backing a network filesystem) can have
multiple values. These can be enumerated by setting params->Nth and
params->Mth to 0, 1, ... until ENODATA is returned.

buffer and buf_size point to the reply buffer. The buffer is filled up to the
specified size, even if this means truncating the reply. The full size of the
reply is returned. In future versions, this will allow extra fields to be
tacked on to the end of the reply, but anyone not expecting them will only get
the subset they're expecting. If either buffer of buf_size are 0, no copy
will take place and the data size will be returned.

At the moment, this will only work on x86_64 and i386 as it requires the system
call to be wired up.

Signed-off-by: David Howells <[email protected]>

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 136c8ce75e3a..a78893af3941 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -401,4 +401,5 @@
387 i386 fsmount sys_fsmount __ia32_sys_fsmount
388 i386 fspick sys_fspick __ia32_sys_fspick
389 i386 move_mount sys_move_mount __ia32_sys_move_mount
+390 i386 fsinfo sys_fsinfo __ia32_sys_fsinfo
391 i386 open_tree sys_open_tree __ia32_sys_open_tree
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c3769ca6d6c5..103e443dbaca 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -346,6 +346,7 @@
335 common fsmount __x64_sys_fsmount
336 common fspick __x64_sys_fspick
337 common move_mount __x64_sys_move_mount
+338 common fsinfo __x64_sys_fsinfo
339 common open_tree __x64_sys_open_tree

#
diff --git a/fs/statfs.c b/fs/statfs.c
index 5b2a24f0f263..f996ab6af44f 100644
--- a/fs/statfs.c
+++ b/fs/statfs.c
@@ -9,6 +9,7 @@
#include <linux/security.h>
#include <linux/uaccess.h>
#include <linux/compat.h>
+#include <linux/fsinfo.h>
#include "internal.h"

static int flags_by_mnt(int mnt_flags)
@@ -384,3 +385,469 @@ COMPAT_SYSCALL_DEFINE2(ustat, unsigned, dev, struct compat_ustat __user *, u)
return 0;
}
#endif
+
+/*
+ * Get basic filesystem stats from statfs.
+ */
+static int fsinfo_generic_statfs(struct dentry *dentry,
+ struct fsinfo_statfs *p)
+{
+ struct super_block *sb;
+ struct kstatfs buf;
+ int ret;
+
+ ret = statfs_by_dentry(dentry, &buf);
+ if (ret < 0)
+ return ret;
+
+ sb = dentry->d_sb;
+ p->f_blocks = buf.f_blocks;
+ p->f_bfree = buf.f_bfree;
+ p->f_bavail = buf.f_bavail;
+ p->f_files = buf.f_files;
+ p->f_ffree = buf.f_ffree;
+ p->f_favail = buf.f_ffree;
+ p->f_bsize = buf.f_bsize;
+ p->f_frsize = buf.f_frsize;
+ return sizeof(*p);
+}
+
+static int fsinfo_generic_ids(struct dentry *dentry,
+ struct fsinfo_ids *p)
+{
+ struct super_block *sb;
+ struct kstatfs buf;
+ int ret;
+
+ ret = statfs_by_dentry(dentry, &buf);
+ if (ret < 0)
+ return ret;
+
+ sb = dentry->d_sb;
+ p->f_fstype = sb->s_magic;
+ p->f_dev_major = MAJOR(sb->s_dev);
+ p->f_dev_minor = MINOR(sb->s_dev);
+ p->f_flags = ST_VALID | flags_by_sb(sb->s_flags);
+
+ memcpy(&p->f_fsid, &buf.f_fsid, sizeof(p->f_fsid));
+ strcpy(p->f_fs_name, dentry->d_sb->s_type->name);
+ return sizeof(*p);
+}
+
+static int fsinfo_generic_limits(struct dentry *dentry,
+ struct fsinfo_limits *lim)
+{
+ struct super_block *sb = dentry->d_sb;
+
+ lim->max_file_size = sb->s_maxbytes;
+ lim->max_hard_links = sb->s_max_links;
+ lim->max_uid = UINT_MAX;
+ lim->max_gid = UINT_MAX;
+ lim->max_projid = UINT_MAX;
+ lim->max_filename_len = NAME_MAX;
+ lim->max_symlink_len = PAGE_SIZE;
+ lim->max_xattr_name_len = XATTR_NAME_MAX;
+ lim->max_xattr_body_len = XATTR_SIZE_MAX;
+ lim->max_dev_major = 0xffffff;
+ lim->max_dev_minor = 0xff;
+ return sizeof(*lim);
+}
+
+static int fsinfo_generic_supports(struct dentry *dentry,
+ struct fsinfo_supports *c)
+{
+ struct super_block *sb = dentry->d_sb;
+
+ c->supported_stx_mask = STATX_BASIC_STATS;
+ if (sb->s_d_op && sb->s_d_op->d_automount)
+ c->supported_stx_attributes |= STATX_ATTR_AUTOMOUNT;
+ return sizeof(*c);
+}
+
+static int fsinfo_generic_capabilities(struct dentry *dentry,
+ struct fsinfo_capabilities *c)
+{
+ struct super_block *sb = dentry->d_sb;
+
+ if (sb->s_mtd)
+ fsinfo_set_cap(c, fsinfo_cap_is_flash_fs);
+ else if (sb->s_bdev)
+ fsinfo_set_cap(c, fsinfo_cap_is_block_fs);
+
+ if (sb->s_quota_types & QTYPE_MASK_USR)
+ fsinfo_set_cap(c, fsinfo_cap_user_quotas);
+ if (sb->s_quota_types & QTYPE_MASK_GRP)
+ fsinfo_set_cap(c, fsinfo_cap_group_quotas);
+ if (sb->s_quota_types & QTYPE_MASK_PRJ)
+ fsinfo_set_cap(c, fsinfo_cap_project_quotas);
+ if (sb->s_d_op && sb->s_d_op->d_automount)
+ fsinfo_set_cap(c, fsinfo_cap_automounts);
+ if (sb->s_id[0])
+ fsinfo_set_cap(c, fsinfo_cap_volume_id);
+
+ fsinfo_set_cap(c, fsinfo_cap_has_atime);
+ fsinfo_set_cap(c, fsinfo_cap_has_ctime);
+ fsinfo_set_cap(c, fsinfo_cap_has_mtime);
+ return sizeof(*c);
+}
+
+static int fsinfo_generic_timestamp_info(struct dentry *dentry,
+ struct fsinfo_timestamp_info *ts)
+{
+ struct super_block *sb = dentry->d_sb;
+
+ /* If unset, assume 1s granularity */
+ u16 mantissa = 1;
+ s8 exponent = 0;
+
+ ts->minimum_timestamp = S64_MIN;
+ ts->maximum_timestamp = S64_MAX;
+ if (sb->s_time_gran < 1000000000) {
+ if (sb->s_time_gran < 1000)
+ exponent = -9;
+ else if (sb->s_time_gran < 1000000)
+ exponent = -6;
+ else
+ exponent = -3;
+ }
+#define set_gran(x) \
+ do { \
+ ts->x##_mantissa = mantissa; \
+ ts->x##_exponent = exponent; \
+ } while (0)
+ set_gran(atime_gran);
+ set_gran(btime_gran);
+ set_gran(ctime_gran);
+ set_gran(mtime_gran);
+ return sizeof(*ts);
+}
+
+static int fsinfo_generic_volume_uuid(struct dentry *dentry,
+ struct fsinfo_volume_uuid *vu)
+{
+ struct super_block *sb = dentry->d_sb;
+
+ memcpy(vu, &sb->s_uuid, sizeof(*vu));
+ return sizeof(*vu);
+}
+
+static int fsinfo_generic_volume_id(struct dentry *dentry, char *buf)
+{
+ struct super_block *sb = dentry->d_sb;
+ size_t len = strlen(sb->s_id);
+
+ if (buf)
+ memcpy(buf, sb->s_id, len + 1);
+ return len;
+}
+
+static int fsinfo_generic_name_encoding(struct dentry *dentry, char *buf)
+{
+ static const char encoding[] = "utf8";
+
+ if (buf)
+ memcpy(buf, encoding, sizeof(encoding) - 1);
+ return sizeof(encoding) - 1;
+}
+
+static int fsinfo_generic_io_size(struct dentry *dentry,
+ struct fsinfo_io_size *c)
+{
+ struct super_block *sb = dentry->d_sb;
+ struct kstatfs buf;
+ int ret;
+
+ if (sb->s_op->statfs == simple_statfs) {
+ c->block_size = PAGE_SIZE;
+ c->max_single_read_size = 0;
+ c->max_single_write_size = 0;
+ c->best_read_size = PAGE_SIZE;
+ c->best_write_size = PAGE_SIZE;
+ } else {
+ ret = statfs_by_dentry(dentry, &buf);
+ if (ret < 0)
+ return ret;
+ c->block_size = buf.f_bsize;
+ c->max_single_read_size = buf.f_bsize;
+ c->max_single_write_size = buf.f_bsize;
+ c->best_read_size = PAGE_SIZE;
+ c->best_write_size = PAGE_SIZE;
+ }
+ return sizeof(*c);
+}
+
+/*
+ * Implement some queries generically from stuff in the superblock.
+ */
+int generic_fsinfo(struct dentry *dentry, struct fsinfo_kparams *params)
+{
+#define _gen(X) fsinfo_attr_##X: return fsinfo_generic_##X(dentry, params->buffer)
+
+ switch (params->request) {
+ case _gen(statfs);
+ case _gen(ids);
+ case _gen(limits);
+ case _gen(supports);
+ case _gen(capabilities);
+ case _gen(timestamp_info);
+ case _gen(volume_uuid);
+ case _gen(volume_id);
+ case _gen(name_encoding);
+ case _gen(io_size);
+ default:
+ return -EOPNOTSUPP;
+ }
+}
+EXPORT_SYMBOL(generic_fsinfo);
+
+/*
+ * Retrieve the filesystem info. We make some stuff up if the operation is not
+ * supported.
+ */
+int vfs_fsinfo(const struct path *path, struct fsinfo_kparams *params)
+{
+ struct dentry *dentry = path->dentry;
+ int (*get_fsinfo)(struct dentry *, struct fsinfo_kparams *);
+ int ret;
+
+ if (params->request == fsinfo_attr_fsinfo) {
+ struct fsinfo_fsinfo *info = params->buffer;
+
+ info->max_attr = fsinfo_attr__nr;
+ info->max_cap = fsinfo_cap__nr;
+ return sizeof(*info);
+ }
+
+ get_fsinfo = dentry->d_sb->s_op->get_fsinfo;
+ if (!get_fsinfo) {
+ if (!dentry->d_sb->s_op->statfs)
+ return -EOPNOTSUPP;
+ get_fsinfo = generic_fsinfo;
+ }
+
+ ret = security_sb_statfs(dentry);
+ if (ret)
+ return ret;
+
+ ret = get_fsinfo(dentry, params);
+ if (ret < 0)
+ return ret;
+
+ if (params->request == fsinfo_attr_ids &&
+ params->buffer) {
+ struct fsinfo_ids *p = params->buffer;
+
+ p->f_flags |= flags_by_mnt(path->mnt->mnt_flags);
+ }
+ return ret;
+}
+
+static int vfs_fsinfo_path(int dfd, const char __user *filename,
+ struct fsinfo_kparams *params)
+{
+ struct path path;
+ unsigned lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
+ int ret = -EINVAL;
+
+ if ((params->at_flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
+ AT_EMPTY_PATH)) != 0)
+ return -EINVAL;
+
+ if (params->at_flags & AT_SYMLINK_NOFOLLOW)
+ lookup_flags &= ~LOOKUP_FOLLOW;
+ if (params->at_flags & AT_NO_AUTOMOUNT)
+ lookup_flags &= ~LOOKUP_AUTOMOUNT;
+ if (params->at_flags & AT_EMPTY_PATH)
+ lookup_flags |= LOOKUP_EMPTY;
+
+retry:
+ ret = user_path_at(dfd, filename, lookup_flags, &path);
+ if (ret)
+ goto out;
+
+ ret = vfs_fsinfo(&path, params);
+ path_put(&path);
+ if (retry_estale(ret, lookup_flags)) {
+ lookup_flags |= LOOKUP_REVAL;
+ goto retry;
+ }
+out:
+ return ret;
+}
+
+static int vfs_fsinfo_fd(unsigned int fd, struct fsinfo_kparams *params)
+{
+ struct fd f = fdget_raw(fd);
+ int ret = -EBADF;
+
+ if (f.file) {
+ ret = vfs_fsinfo(&f.file->f_path, params);
+ fdput(f);
+ }
+ return ret;
+}
+
+/*
+ * Return buffer information by requestable attribute.
+ *
+ * STRUCT indicates a fixed-size structure with only one instance.
+ * STRUCT_N indicates a fixed-size structure that may have multiple instances.
+ * STRING indicates a string with only one instance.
+ * STRING_N indicates a string that may have multiple instances.
+ * STRUCT_ARRAY indicates an array of fixed-size structs with only one instance.
+ * STRUCT_ARRAY_N as above that may have multiple instances.
+ *
+ * If an entry is marked STRUCT, STRUCT_N or STRUCT_NM then if no buffer is
+ * supplied to sys_fsinfo(), sys_fsinfo() will handle returning the buffer size
+ * without calling vfs_fsinfo() and the filesystem.
+ *
+ * No struct may have more than 252 bytes (ie. 0x3f * 4)
+ */
+#define FSINFO_STRING(N) [fsinfo_attr_##N] = 0x00
+#define FSINFO_STRUCT(N) [fsinfo_attr_##N] = sizeof(struct fsinfo_##N)/sizeof(__u32)
+#define FSINFO_STRING_N(N) [fsinfo_attr_##N] = 0x40
+#define FSINFO_STRUCT_N(N) [fsinfo_attr_##N] = 0x40 | sizeof(struct fsinfo_##N)/sizeof(__u32)
+#define FSINFO_STRUCT_NM(N) [fsinfo_attr_##N] = 0x80 | sizeof(struct fsinfo_##N)/sizeof(__u32)
+static const u8 fsinfo_buffer_sizes[fsinfo_attr__nr] = {
+ FSINFO_STRUCT (statfs),
+ FSINFO_STRUCT (fsinfo),
+ FSINFO_STRUCT (ids),
+ FSINFO_STRUCT (limits),
+ FSINFO_STRUCT (capabilities),
+ FSINFO_STRUCT (supports),
+ FSINFO_STRUCT (timestamp_info),
+ FSINFO_STRING (volume_id),
+ FSINFO_STRUCT (volume_uuid),
+ FSINFO_STRING (volume_name),
+ FSINFO_STRING (cell_name),
+ FSINFO_STRING (domain_name),
+ FSINFO_STRING (realm_name),
+ FSINFO_STRING_N (server_name),
+ FSINFO_STRUCT_NM (server_address),
+ FSINFO_STRING_N (parameter),
+ FSINFO_STRING_N (source),
+ FSINFO_STRING (name_encoding),
+ FSINFO_STRING (name_codepage),
+ FSINFO_STRUCT (io_size),
+};
+
+/**
+ * sys_fsinfo - System call to get filesystem information
+ * @dfd: Base directory to pathwalk from or fd referring to filesystem.
+ * @filename: Filesystem to query or NULL.
+ * @_params: Parameters to define request (or NULL for enhanced statfs).
+ * @_buffer: Result buffer.
+ * @buf_size: Size of result buffer.
+ *
+ * Get information on a filesystem. The filesystem attribute to be queried is
+ * indicated by @_params->request, and some of the attributes can have multiple
+ * values, indexed by @_params->Nth and @_params->Mth. If @_params is NULL,
+ * then the 0th fsinfo_attr_statfs attribute is queried. If an attribute does
+ * not exist, EOPNOTSUPP is returned; if the Nth,Mth value does not exist,
+ * ENODATA is returned.
+ *
+ * On success, the size of the attribute's value is returned. If @buf_size is
+ * 0 or @_buffer is NULL, only the size is returned. If the size of the value
+ * is larger than @buf_size, it will be truncated by the copy. The full size
+ * of the value will be returned.
+ */
+SYSCALL_DEFINE5(fsinfo,
+ int, dfd, const char __user *, filename,
+ struct fsinfo_params *, _params,
+ void __user *, _buffer, size_t, buf_size)
+{
+ struct fsinfo_params user_params;
+ struct fsinfo_kparams params;
+ size_t size;
+ int ret;
+
+ if (!access_ok(VERIFY_WRITE, _buffer, buf_size))
+ return -EFAULT;
+
+ if (_params) {
+ if (copy_from_user(&user_params, _params, sizeof(user_params)))
+ return -EFAULT;
+ if (user_params.__reserved[0] ||
+ user_params.__reserved[1] ||
+ user_params.__reserved[2] ||
+ user_params.__reserved[3] ||
+ user_params.__reserved[4] ||
+ user_params.__reserved[5])
+ return -EINVAL;
+ if (user_params.request >= fsinfo_attr__nr)
+ return -EOPNOTSUPP;
+ params.request = user_params.request;
+ params.Nth = user_params.Nth;
+ params.Mth = user_params.Mth;
+ params.at_flags = user_params.at_flags;
+ } else {
+ params.request = fsinfo_attr_statfs;
+ params.Nth = 0;
+ params.Mth = 0;
+ params.at_flags = AT_SYMLINK_FOLLOW;
+ }
+
+ if (!_buffer || !buf_size) {
+ buf_size = 0;
+ _buffer = NULL;
+ }
+
+ /* Allocate an appropriately-sized buffer. We will truncate the
+ * contents when we write the contents back to userspace.
+ */
+ size = fsinfo_buffer_sizes[params.request];
+ switch (size & 0xc0) {
+ case 0x00:
+ if (params.Nth != 0)
+ return -ENODATA;
+ /* Fall through */
+ case 0x40:
+ if (params.Mth != 0)
+ return -ENODATA;
+ /* Fall through */
+ case 0x80:
+ break;
+ case 0xc0:
+ return -ENOBUFS;
+ }
+
+ size &= ~0xc0;
+ if (size == 0x00) {
+ size = 4096; /* String */
+ } else {
+ size *= sizeof(__u32);
+ if (buf_size == 0)
+ return size; /* We know how big the buffer should be */
+ }
+
+ if (buf_size > 0) {
+ params.buf_size = size;
+ params.buffer = kzalloc(size, GFP_KERNEL);
+ if (!params.buffer)
+ return -ENOMEM;
+ } else {
+ params.buf_size = 0;
+ params.buffer = NULL;
+ }
+
+ if (filename)
+ ret = vfs_fsinfo_path(dfd, filename, &params);
+ else
+ ret = vfs_fsinfo_fd(dfd, &params);
+ if (ret < 0)
+ goto error;
+
+ if (ret == 0) {
+ ret = -ENODATA;
+ goto error;
+ }
+
+ if (buf_size > ret)
+ buf_size = ret;
+
+ if (copy_to_user(_buffer, params.buffer, buf_size))
+ ret = -EFAULT;
+error:
+ kfree(params.buffer);
+ return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6d9c84de1ddf..79f98ed39a18 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -62,6 +62,8 @@ struct iov_iter;
struct fscrypt_info;
struct fscrypt_operations;
struct fs_context;
+struct fsinfo_kparams;
+enum fsinfo_attribute;

extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -1840,6 +1842,7 @@ struct super_operations {
int (*thaw_super) (struct super_block *);
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
+ int (*get_fsinfo) (struct dentry *, struct fsinfo_kparams *);
int (*remount_fs) (struct super_block *, int *, char *, size_t);
int (*reconfigure) (struct super_block *, struct fs_context *);
void (*umount_begin) (struct super_block *);
@@ -2216,6 +2219,7 @@ extern int iterate_mounts(int (*)(struct vfsmount *, void *), void *,
extern int vfs_statfs(const struct path *, struct kstatfs *);
extern int user_statfs(const char __user *, struct kstatfs *);
extern int fd_statfs(int, struct kstatfs *);
+extern int vfs_fsinfo(const struct path *, struct fsinfo_kparams *);
extern int freeze_super(struct super_block *super);
extern int thaw_super(struct super_block *super);
extern bool our_mnt(struct vfsmount *mnt);
diff --git a/include/linux/fsinfo.h b/include/linux/fsinfo.h
new file mode 100644
index 000000000000..2faa7043c5e7
--- /dev/null
+++ b/include/linux/fsinfo.h
@@ -0,0 +1,40 @@
+/* Filesystem information query
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_FSINFO_H
+#define _LINUX_FSINFO_H
+
+#include <uapi/linux/fsinfo.h>
+
+struct fsinfo_kparams {
+ enum fsinfo_attribute request; /* What is being asking for */
+ __u32 Nth; /* Instance of it (some may have multiple) */
+ __u32 Mth; /* Subinstance */
+ __u32 at_flags; /* AT_SYMLINK_NOFOLLOW and similar */
+ void *buffer; /* Where to place the reply */
+ size_t buf_size; /* Size of the buffer */
+};
+
+extern int generic_fsinfo(struct dentry *, struct fsinfo_kparams *);
+
+static inline void fsinfo_set_cap(struct fsinfo_capabilities *c,
+ enum fsinfo_capability cap)
+{
+ c->capabilities[cap / 8] |= 1 << (cap % 8);
+}
+
+static inline void fsinfo_clear_cap(struct fsinfo_capabilities *c,
+ enum fsinfo_capability cap)
+{
+ c->capabilities[cap / 8] &= ~(1 << (cap % 8));
+}
+
+#endif /* _LINUX_FSINFO_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 4972ee696142..0d3105865208 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -49,6 +49,7 @@ struct stat64;
struct statfs;
struct statfs64;
struct statx;
+struct fsinfo_params;
struct __sysctl_args;
struct sysinfo;
struct timespec;
@@ -905,6 +906,9 @@ asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int at_fla
asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
int to_dfd, const char __user *to_path,
unsigned int ms_flags);
+asmlinkage long sys_fsinfo(int dfd, const char __user *path,
+ struct fsinfo_params __user *params,
+ void __user *buffer, size_t buf_size);

/*
* Architecture-specific system calls
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
new file mode 100644
index 000000000000..a6758e71f0c7
--- /dev/null
+++ b/include/uapi/linux/fsinfo.h
@@ -0,0 +1,235 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* fsinfo() definitions.
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ */
+#ifndef _UAPI_LINUX_FSINFO_H
+#define _UAPI_LINUX_FSINFO_H
+
+#include <linux/types.h>
+#include <linux/socket.h>
+
+/*
+ * The filesystem attributes that can be requested. Note that some attributes
+ * may have multiple instances which can be switched in the parameter block.
+ */
+enum fsinfo_attribute {
+ fsinfo_attr_statfs = 0, /* statfs()-style state */
+ fsinfo_attr_fsinfo = 1, /* Information about fsinfo() */
+ fsinfo_attr_ids = 2, /* Filesystem IDs */
+ fsinfo_attr_limits = 3, /* Filesystem limits */
+ fsinfo_attr_supports = 4, /* What's supported in statx, iocflags, ... */
+ fsinfo_attr_capabilities = 5, /* Filesystem capabilities (bits) */
+ fsinfo_attr_timestamp_info = 6, /* Inode timestamp info */
+ fsinfo_attr_volume_id = 7, /* Volume ID (string) */
+ fsinfo_attr_volume_uuid = 8, /* Volume UUID (LE uuid) */
+ fsinfo_attr_volume_name = 9, /* Volume name (string) */
+ fsinfo_attr_cell_name = 10, /* Cell name (string) */
+ fsinfo_attr_domain_name = 11, /* Domain name (string) */
+ fsinfo_attr_realm_name = 12, /* Realm name (string) */
+ fsinfo_attr_server_name = 13, /* Name of the Nth server */
+ fsinfo_attr_server_address = 14, /* Mth address of the Nth server */
+ fsinfo_attr_parameter = 15, /* Nth mount parameter (string) */
+ fsinfo_attr_source = 16, /* Nth mount source name (string) */
+ fsinfo_attr_name_encoding = 17, /* Filename encoding (string) */
+ fsinfo_attr_name_codepage = 18, /* Filename codepage (string) */
+ fsinfo_attr_io_size = 19, /* Optimal I/O sizes */
+ fsinfo_attr__nr
+};
+
+/*
+ * Optional fsinfo() parameter structure.
+ *
+ * If this is not given, it is assumed that fsinfo_attr_statfs instance 0,0 is
+ * desired.
+ */
+struct fsinfo_params {
+ __u32 request; /* What is being asking for (enum fsinfo_attribute) */
+ __u32 Nth; /* Instance of it (some may have multiple) */
+ __u32 Mth; /* Subinstance of Nth instance */
+ __u32 at_flags; /* AT_SYMLINK_NOFOLLOW and similar flags */
+ __u32 __reserved[6]; /* Reserved params; all must be 0 */
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_statfs).
+ * - This gives extended filesystem information.
+ */
+struct fsinfo_statfs {
+ __u64 f_blocks; /* Total number of blocks in fs */
+ __u64 f_bfree; /* Total number of free blocks */
+ __u64 f_bavail; /* Number of free blocks available to ordinary user */
+ __u64 f_files; /* Total number of file nodes in fs */
+ __u64 f_ffree; /* Number of free file nodes */
+ __u64 f_favail; /* Number of free file nodes available to ordinary user */
+ __u32 f_bsize; /* Optimal block size */
+ __u32 f_frsize; /* Fragment size */
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_ids).
+ *
+ * List of basic identifiers as is normally found in statfs().
+ */
+struct fsinfo_ids {
+ char f_fs_name[15 + 1];
+ __u64 f_flags; /* Filesystem mount flags (MS_*) */
+ __u64 f_fsid; /* Short 64-bit Filesystem ID (as statfs) */
+ __u64 f_sb_id; /* Internal superblock ID for sbnotify()/mntnotify() */
+ __u32 f_fstype; /* Filesystem type from linux/magic.h [uncond] */
+ __u32 f_dev_major; /* As st_dev_* from struct statx [uncond] */
+ __u32 f_dev_minor;
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_limits).
+ *
+ * List of supported filesystem limits.
+ */
+struct fsinfo_limits {
+ __u64 max_file_size; /* Maximum file size */
+ __u64 max_uid; /* Maximum UID supported */
+ __u64 max_gid; /* Maximum GID supported */
+ __u64 max_projid; /* Maximum project ID supported */
+ __u32 max_dev_major; /* Maximum device major representable */
+ __u32 max_dev_minor; /* Maximum device minor representable */
+ __u32 max_hard_links; /* Maximum number of hard links on a file */
+ __u32 max_xattr_body_len; /* Maximum xattr content length */
+ __u16 max_xattr_name_len; /* Maximum xattr name length */
+ __u16 max_filename_len; /* Maximum filename length */
+ __u16 max_symlink_len; /* Maximum symlink content length */
+ __u16 __spare;
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_supports).
+ *
+ * What's supported in various masks, such as statx() attribute and mask bits
+ * and IOC flags.
+ */
+struct fsinfo_supports {
+ __u64 supported_stx_attributes; /* What statx::stx_attributes are supported */
+ __u32 supported_stx_mask; /* What statx::stx_mask bits are supported */
+ __u32 supported_ioc_flags; /* What FS_IOC_* flags are supported */
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_capabilities).
+ *
+ * Bitmask indicating filesystem capabilities where renderable as single bits.
+ */
+enum fsinfo_capability {
+ fsinfo_cap_is_kernel_fs = 0, /* fs is kernel-special filesystem */
+ fsinfo_cap_is_block_fs = 1, /* fs is block-based filesystem */
+ fsinfo_cap_is_flash_fs = 2, /* fs is flash filesystem */
+ fsinfo_cap_is_network_fs = 3, /* fs is network filesystem */
+ fsinfo_cap_is_automounter_fs = 4, /* fs is automounter special filesystem */
+ fsinfo_cap_automounts = 5, /* fs supports automounts */
+ fsinfo_cap_adv_locks = 6, /* fs supports advisory file locking */
+ fsinfo_cap_mand_locks = 7, /* fs supports mandatory file locking */
+ fsinfo_cap_leases = 8, /* fs supports file leases */
+ fsinfo_cap_uids = 9, /* fs supports numeric uids */
+ fsinfo_cap_gids = 10, /* fs supports numeric gids */
+ fsinfo_cap_projids = 11, /* fs supports numeric project ids */
+ fsinfo_cap_id_names = 12, /* fs supports user names */
+ fsinfo_cap_id_guids = 13, /* fs supports user guids */
+ fsinfo_cap_windows_attrs = 14, /* fs has windows attributes */
+ fsinfo_cap_user_quotas = 15, /* fs has per-user quotas */
+ fsinfo_cap_group_quotas = 16, /* fs has per-group quotas */
+ fsinfo_cap_project_quotas = 17, /* fs has per-project quotas */
+ fsinfo_cap_xattrs = 18, /* fs has xattrs */
+ fsinfo_cap_journal = 19, /* fs has a journal */
+ fsinfo_cap_data_is_journalled = 20, /* fs is using data journalling */
+ fsinfo_cap_o_sync = 21, /* fs supports O_SYNC */
+ fsinfo_cap_o_direct = 22, /* fs supports O_DIRECT */
+ fsinfo_cap_volume_id = 23, /* fs has a volume ID */
+ fsinfo_cap_volume_uuid = 24, /* fs has a volume UUID */
+ fsinfo_cap_volume_name = 25, /* fs has a volume name */
+ fsinfo_cap_volume_fsid = 26, /* fs has a volume FSID */
+ fsinfo_cap_cell_name = 27, /* fs has a cell name */
+ fsinfo_cap_domain_name = 28, /* fs has a domain name */
+ fsinfo_cap_realm_name = 29, /* fs has a realm name */
+ fsinfo_cap_iver_all_change = 30, /* i_version represents data + meta changes */
+ fsinfo_cap_iver_data_change = 31, /* i_version represents data changes only */
+ fsinfo_cap_iver_mono_incr = 32, /* i_version incremented monotonically */
+ fsinfo_cap_symlinks = 33, /* fs supports symlinks */
+ fsinfo_cap_hard_links = 34, /* fs supports hard links */
+ fsinfo_cap_hard_links_1dir = 35, /* fs supports hard links in same dir only */
+ fsinfo_cap_device_files = 36, /* fs supports bdev, cdev */
+ fsinfo_cap_unix_specials = 37, /* fs supports pipe, fifo, socket */
+ fsinfo_cap_resource_forks = 38, /* fs supports resource forks/streams */
+ fsinfo_cap_name_case_indep = 39, /* Filename case independence is mandatory */
+ fsinfo_cap_name_non_utf8 = 40, /* fs has non-utf8 names */
+ fsinfo_cap_name_has_codepage = 41, /* fs has a filename codepage */
+ fsinfo_cap_sparse = 42, /* fs supports sparse files */
+ fsinfo_cap_not_persistent = 43, /* fs is not persistent */
+ fsinfo_cap_no_unix_mode = 44, /* fs does not support unix mode bits */
+ fsinfo_cap_has_atime = 45, /* fs supports access time */
+ fsinfo_cap_has_btime = 46, /* fs supports birth/creation time */
+ fsinfo_cap_has_ctime = 47, /* fs supports change time */
+ fsinfo_cap_has_mtime = 48, /* fs supports modification time */
+ fsinfo_cap__nr
+};
+
+struct fsinfo_capabilities {
+ __u8 capabilities[(fsinfo_cap__nr + 7) / 8];
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_timestamp_info).
+ */
+struct fsinfo_timestamp_info {
+ __s64 minimum_timestamp; /* Minimum timestamp value in seconds */
+ __s64 maximum_timestamp; /* Maximum timestamp value in seconds */
+ __u16 atime_gran_mantissa; /* Granularity(secs) = mant * 10^exp */
+ __u16 btime_gran_mantissa;
+ __u16 ctime_gran_mantissa;
+ __u16 mtime_gran_mantissa;
+ __s8 atime_gran_exponent;
+ __s8 btime_gran_exponent;
+ __s8 ctime_gran_exponent;
+ __s8 mtime_gran_exponent;
+ __u32 __reserved[1];
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_volume_uuid).
+ */
+struct fsinfo_volume_uuid {
+ __u8 uuid[16];
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_server_addresses).
+ *
+ * Find the Mth address of the Nth server for a network mount.
+ */
+struct fsinfo_server_address {
+ struct __kernel_sockaddr_storage address;
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_io_size).
+ *
+ * Retrieve the optimal I/O size for a filesystem.
+ */
+struct fsinfo_io_size {
+ __u32 block_size; /* Minimum block granularity for O_DIRECT */
+ __u32 max_single_read_size; /* Maximum size of a single unbuffered read */
+ __u32 max_single_write_size; /* Maximum size of a single unbuffered write */
+ __u32 best_read_size; /* Optimal read size */
+ __u32 best_write_size; /* Optimal write size */
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_fsinfo).
+ *
+ * This gives information about fsinfo() itself.
+ */
+struct fsinfo_fsinfo {
+ __u32 max_attr; /* Number of supported attributes (fsinfo_attr__nr) */
+ __u32 max_cap; /* Number of supported capabilities (fsinfo_cap__nr) */
+};
+
+#endif /* _UAPI_LINUX_FSINFO_H */
diff --git a/samples/statx/Makefile b/samples/statx/Makefile
index 59df7c25a9d1..9cb9a88e3a10 100644
--- a/samples/statx/Makefile
+++ b/samples/statx/Makefile
@@ -1,7 +1,10 @@
# List of programs to build
-hostprogs-$(CONFIG_SAMPLE_STATX) := test-statx
+hostprogs-$(CONFIG_SAMPLE_STATX) := test-statx test-fsinfo

# Tell kbuild to always build the programs
always := $(hostprogs-y)

HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
+
+HOSTCFLAGS_test-fsinfo.o += -I$(objtree)/usr/include
+HOSTLOADLIBES_test-fsinfo += -lm
diff --git a/samples/statx/test-fsinfo.c b/samples/statx/test-fsinfo.c
new file mode 100644
index 000000000000..9d70c422da11
--- /dev/null
+++ b/samples/statx/test-fsinfo.c
@@ -0,0 +1,538 @@
+/* Test the fsinfo() system call
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#define _GNU_SOURCE
+#define _ATFILE_SOURCE
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <unistd.h>
+#include <ctype.h>
+#include <errno.h>
+#include <time.h>
+#include <math.h>
+#include <fcntl.h>
+#include <sys/syscall.h>
+#include <linux/fsinfo.h>
+#include <linux/socket.h>
+#include <sys/stat.h>
+
+static __attribute__((unused))
+ssize_t fsinfo(int dfd, const char *filename, struct fsinfo_params *params,
+ void *buffer, size_t buf_size)
+{
+ return syscall(__NR_fsinfo, dfd, filename, params, buffer, buf_size);
+}
+
+#define FSINFO_STRING(N) [fsinfo_attr_##N] = 0x00
+#define FSINFO_STRUCT(N) [fsinfo_attr_##N] = sizeof(struct fsinfo_##N)/sizeof(__u32)
+#define FSINFO_STRING_N(N) [fsinfo_attr_##N] = 0x40
+#define FSINFO_STRUCT_N(N) [fsinfo_attr_##N] = 0x40 | sizeof(struct fsinfo_##N)/sizeof(__u32)
+#define FSINFO_STRUCT_NM(N) [fsinfo_attr_##N] = 0x80 | sizeof(struct fsinfo_##N)/sizeof(__u32)
+static const __u8 fsinfo_buffer_sizes[fsinfo_attr__nr] = {
+ FSINFO_STRUCT (statfs),
+ FSINFO_STRUCT (fsinfo),
+ FSINFO_STRUCT (ids),
+ FSINFO_STRUCT (limits),
+ FSINFO_STRUCT (supports),
+ FSINFO_STRUCT (capabilities),
+ FSINFO_STRUCT (timestamp_info),
+ FSINFO_STRING (volume_id),
+ FSINFO_STRUCT (volume_uuid),
+ FSINFO_STRING (volume_name),
+ FSINFO_STRING (cell_name),
+ FSINFO_STRING (domain_name),
+ FSINFO_STRING (realm_name),
+ FSINFO_STRING_N (server_name),
+ FSINFO_STRUCT_NM (server_address),
+ FSINFO_STRING_N (parameter),
+ FSINFO_STRING_N (source),
+ FSINFO_STRING (name_encoding),
+ FSINFO_STRING (name_codepage),
+ FSINFO_STRUCT (io_size),
+};
+
+#define FSINFO_NAME(N) [fsinfo_attr_##N] = #N
+static const char *fsinfo_attr_names[fsinfo_attr__nr] = {
+ FSINFO_NAME(statfs),
+ FSINFO_NAME(fsinfo),
+ FSINFO_NAME(ids),
+ FSINFO_NAME(limits),
+ FSINFO_NAME(supports),
+ FSINFO_NAME(capabilities),
+ FSINFO_NAME(timestamp_info),
+ FSINFO_NAME(volume_id),
+ FSINFO_NAME(volume_uuid),
+ FSINFO_NAME(volume_name),
+ FSINFO_NAME(cell_name),
+ FSINFO_NAME(domain_name),
+ FSINFO_NAME(realm_name),
+ FSINFO_NAME(server_name),
+ FSINFO_NAME(server_address),
+ FSINFO_NAME(parameter),
+ FSINFO_NAME(source),
+ FSINFO_NAME(name_encoding),
+ FSINFO_NAME(name_codepage),
+ FSINFO_NAME(io_size),
+};
+
+union reply {
+ char buffer[4096];
+ struct fsinfo_statfs statfs;
+ struct fsinfo_fsinfo fsinfo;
+ struct fsinfo_ids ids;
+ struct fsinfo_limits limits;
+ struct fsinfo_supports supports;
+ struct fsinfo_capabilities caps;
+ struct fsinfo_timestamp_info timestamps;
+ struct fsinfo_volume_uuid uuid;
+ struct fsinfo_server_address srv_addr;
+ struct fsinfo_io_size io_size;
+};
+
+static void dump_hex(unsigned int *data, int from, int to)
+{
+ unsigned offset, print_offset = 1, col = 0;
+
+ from /= 4;
+ to = (to + 3) / 4;
+
+ for (offset = from; offset < to; offset++) {
+ if (print_offset) {
+ printf("%04x: ", offset * 8);
+ print_offset = 0;
+ }
+ printf("%08x", data[offset]);
+ col++;
+ if ((col & 3) == 0) {
+ printf("\n");
+ print_offset = 1;
+ } else {
+ printf(" ");
+ }
+ }
+
+ if (!print_offset)
+ printf("\n");
+}
+
+static void dump_attr_statfs(union reply *r, int size)
+{
+ struct fsinfo_statfs *f = &r->statfs;
+
+ printf("\n");
+ printf("\tblocks: n=%llu fr=%llu av=%llu\n",
+ (unsigned long long)f->f_blocks,
+ (unsigned long long)f->f_bfree,
+ (unsigned long long)f->f_bavail);
+
+ printf("\tfiles : n=%llu fr=%llu av=%llu\n",
+ (unsigned long long)f->f_files,
+ (unsigned long long)f->f_ffree,
+ (unsigned long long)f->f_favail);
+ printf("\tbsize : %u\n", f->f_bsize);
+ printf("\tfrsize: %u\n", f->f_frsize);
+}
+
+static void dump_attr_fsinfo(union reply *r, int size)
+{
+ struct fsinfo_fsinfo *f = &r->fsinfo;
+
+ printf("max_attr=%u max_cap=%u\n", f->max_attr, f->max_cap);
+}
+
+static void dump_attr_ids(union reply *r, int size)
+{
+ struct fsinfo_ids *f = &r->ids;
+
+ printf("\n");
+ printf("\tdev : %02x:%02x\n", f->f_dev_major, f->f_dev_minor);
+ printf("\tfs : type=%x name=%s\n", f->f_fstype, f->f_fs_name);
+ printf("\tflags : %llx\n", (unsigned long long)f->f_flags);
+ printf("\tfsid : %llx\n", (unsigned long long)f->f_fsid);
+}
+
+static void dump_attr_limits(union reply *r, int size)
+{
+ struct fsinfo_limits *f = &r->limits;
+
+ printf("\n");
+ printf("\tmax file size: %llx\n", f->max_file_size);
+ printf("\tmax ids : u=%llx g=%llx p=%llx\n",
+ f->max_uid, f->max_gid, f->max_projid);
+ printf("\tmax dev : maj=%x min=%x\n",
+ f->max_dev_major, f->max_dev_minor);
+ printf("\tmax links : %x\n", f->max_hard_links);
+ printf("\tmax xattr : n=%x b=%x\n",
+ f->max_xattr_name_len, f->max_xattr_body_len);
+ printf("\tmax len : file=%x sym=%x\n",
+ f->max_filename_len, f->max_symlink_len);
+}
+
+static void dump_attr_supports(union reply *r, int size)
+{
+ struct fsinfo_supports *f = &r->supports;
+
+ printf("\n");
+ printf("\tstx_attr=%llx\n", f->supported_stx_attributes);
+ printf("\tstx_mask=%x\n", f->supported_stx_mask);
+ printf("\tioc_flags=%x\n", f->supported_ioc_flags);
+}
+
+#define FSINFO_CAP_NAME(C) [fsinfo_cap_##C] = #C
+static const char *fsinfo_cap_names[fsinfo_cap__nr] = {
+ FSINFO_CAP_NAME(is_kernel_fs),
+ FSINFO_CAP_NAME(is_block_fs),
+ FSINFO_CAP_NAME(is_flash_fs),
+ FSINFO_CAP_NAME(is_network_fs),
+ FSINFO_CAP_NAME(is_automounter_fs),
+ FSINFO_CAP_NAME(automounts),
+ FSINFO_CAP_NAME(adv_locks),
+ FSINFO_CAP_NAME(mand_locks),
+ FSINFO_CAP_NAME(leases),
+ FSINFO_CAP_NAME(uids),
+ FSINFO_CAP_NAME(gids),
+ FSINFO_CAP_NAME(projids),
+ FSINFO_CAP_NAME(id_names),
+ FSINFO_CAP_NAME(id_guids),
+ FSINFO_CAP_NAME(windows_attrs),
+ FSINFO_CAP_NAME(user_quotas),
+ FSINFO_CAP_NAME(group_quotas),
+ FSINFO_CAP_NAME(project_quotas),
+ FSINFO_CAP_NAME(xattrs),
+ FSINFO_CAP_NAME(journal),
+ FSINFO_CAP_NAME(data_is_journalled),
+ FSINFO_CAP_NAME(o_sync),
+ FSINFO_CAP_NAME(o_direct),
+ FSINFO_CAP_NAME(volume_id),
+ FSINFO_CAP_NAME(volume_uuid),
+ FSINFO_CAP_NAME(volume_name),
+ FSINFO_CAP_NAME(volume_fsid),
+ FSINFO_CAP_NAME(cell_name),
+ FSINFO_CAP_NAME(domain_name),
+ FSINFO_CAP_NAME(realm_name),
+ FSINFO_CAP_NAME(iver_all_change),
+ FSINFO_CAP_NAME(iver_data_change),
+ FSINFO_CAP_NAME(iver_mono_incr),
+ FSINFO_CAP_NAME(symlinks),
+ FSINFO_CAP_NAME(hard_links),
+ FSINFO_CAP_NAME(hard_links_1dir),
+ FSINFO_CAP_NAME(device_files),
+ FSINFO_CAP_NAME(unix_specials),
+ FSINFO_CAP_NAME(resource_forks),
+ FSINFO_CAP_NAME(name_case_indep),
+ FSINFO_CAP_NAME(name_non_utf8),
+ FSINFO_CAP_NAME(name_has_codepage),
+ FSINFO_CAP_NAME(sparse),
+ FSINFO_CAP_NAME(not_persistent),
+ FSINFO_CAP_NAME(no_unix_mode),
+ FSINFO_CAP_NAME(has_atime),
+ FSINFO_CAP_NAME(has_btime),
+ FSINFO_CAP_NAME(has_ctime),
+ FSINFO_CAP_NAME(has_mtime),
+};
+
+static void dump_attr_capabilities(union reply *r, int size)
+{
+ struct fsinfo_capabilities *f = &r->caps;
+ int i;
+
+ for (i = 0; i < sizeof(f->capabilities); i++)
+ printf("%02x", f->capabilities[i]);
+ printf("\n");
+ for (i = 0; i < fsinfo_cap__nr; i++)
+ if (f->capabilities[i / 8] & (1 << (i % 8)))
+ printf("\t- %s\n", fsinfo_cap_names[i]);
+}
+
+static void dump_attr_timestamp_info(union reply *r, int size)
+{
+ struct fsinfo_timestamp_info *f = &r->timestamps;
+
+ printf("range=%llx-%llx\n",
+ (unsigned long long)f->minimum_timestamp,
+ (unsigned long long)f->maximum_timestamp);
+
+#define print_time(G) \
+ printf("\t"#G"time : gran=%gs\n", \
+ (f->G##time_gran_mantissa * \
+ pow(10., f->G##time_gran_exponent)))
+ print_time(a);
+ print_time(b);
+ print_time(c);
+ print_time(m);
+}
+
+static void dump_attr_volume_uuid(union reply *r, int size)
+{
+ struct fsinfo_volume_uuid *f = &r->uuid;
+
+ printf("%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x"
+ "-%02x%02x%02x%02x%02x%02x\n",
+ f->uuid[ 0], f->uuid[ 1],
+ f->uuid[ 2], f->uuid[ 3],
+ f->uuid[ 4], f->uuid[ 5],
+ f->uuid[ 6], f->uuid[ 7],
+ f->uuid[ 8], f->uuid[ 9],
+ f->uuid[10], f->uuid[11],
+ f->uuid[12], f->uuid[13],
+ f->uuid[14], f->uuid[15]);
+}
+
+static void dump_attr_server_address(union reply *r, int size)
+{
+ struct fsinfo_server_address *f = &r->srv_addr;
+
+ printf("family=%u\n", f->address.ss_family);
+}
+
+static void dump_attr_io_size(union reply *r, int size)
+{
+ struct fsinfo_io_size *f = &r->io_size;
+
+ printf("bs=%u\n", f->block_size);
+}
+
+/*
+ *
+ */
+typedef void (*dumper_t)(union reply *r, int size);
+
+#define FSINFO_DUMPER(N) [fsinfo_attr_##N] = dump_attr_##N
+static const dumper_t fsinfo_attr_dumper[fsinfo_attr__nr] = {
+ FSINFO_DUMPER(statfs),
+ FSINFO_DUMPER(fsinfo),
+ FSINFO_DUMPER(ids),
+ FSINFO_DUMPER(limits),
+ FSINFO_DUMPER(supports),
+ FSINFO_DUMPER(capabilities),
+ FSINFO_DUMPER(timestamp_info),
+ FSINFO_DUMPER(volume_uuid),
+ FSINFO_DUMPER(server_address),
+ FSINFO_DUMPER(io_size),
+};
+
+static void dump_fsinfo(enum fsinfo_attribute attr, __u8 about,
+ union reply *r, int size)
+{
+ dumper_t dumper = fsinfo_attr_dumper[attr];
+ unsigned int len;
+
+ if (!dumper) {
+ printf("<no dumper>\n");
+ return;
+ }
+
+ len = (about & 0x3f) * sizeof(__u32);
+ if (size < len) {
+ printf("<short data %u/%u>\n", size, len);
+ return;
+ }
+
+ dumper(r, size);
+}
+
+/*
+ * Try one subinstance of an attribute.
+ */
+static int try_one(const char *file, struct fsinfo_params *params, bool raw)
+{
+ union reply r;
+ char *p;
+ int ret;
+ __u8 about;
+
+ memset(&r.buffer, 0xbd, sizeof(r.buffer));
+
+ errno = 0;
+ ret = fsinfo(AT_FDCWD, file, params, r.buffer, sizeof(r.buffer));
+ if (params->request >= fsinfo_attr__nr) {
+ if (ret == -1 && errno == EOPNOTSUPP)
+ exit(0);
+ fprintf(stderr, "Unexpected error for too-large command %u: %m\n",
+ params->request);
+ exit(1);
+ }
+
+ //printf("fsinfo(%s,%s,%u,%u) = %d: %m\n",
+ // file, fsinfo_attr_names[params->request],
+ // params->Nth, params->Mth, ret);
+
+ about = fsinfo_buffer_sizes[params->request];
+ if (ret == -1) {
+ if (errno == ENODATA) {
+ switch (about & 0xc0) {
+ case 0x00:
+ if (params->Nth == 0 && params->Mth == 0) {
+ fprintf(stderr,
+ "Unexpected ENODATA1 (%u[%u][%u])\n",
+ params->request, params->Nth, params->Mth);
+ exit(1);
+ }
+ break;
+ case 0x40:
+ if (params->Nth == 0 && params->Mth == 0) {
+ fprintf(stderr,
+ "Unexpected ENODATA2 (%u[%u][%u])\n",
+ params->request, params->Nth, params->Mth);
+ exit(1);
+ }
+ break;
+ }
+ return (params->Mth == 0) ? 2 : 1;
+ }
+ if (errno == EOPNOTSUPP) {
+ if (params->Nth > 0 || params->Mth > 0) {
+ fprintf(stderr,
+ "Should return -ENODATA (%u[%u][%u])\n",
+ params->request, params->Nth, params->Mth);
+ exit(1);
+ }
+ //printf("\e[33m%s\e[m: <not supported>\n",
+ // fsinfo_attr_names[attr]);
+ return 2;
+ }
+ perror(file);
+ exit(1);
+ }
+
+ if (raw) {
+ if (ret > 4096)
+ ret = 4096;
+ dump_hex((unsigned int *)&r.buffer, 0, ret);
+ return 0;
+ }
+
+ switch (about & 0xc0) {
+ case 0x00:
+ printf("\e[33m%s\e[m: ",
+ fsinfo_attr_names[params->request]);
+ break;
+ case 0x40:
+ printf("\e[33m%s[%u]\e[m: ",
+ fsinfo_attr_names[params->request],
+ params->Nth);
+ break;
+ case 0x80:
+ printf("\e[33m%s[%u][%u]\e[m: ",
+ fsinfo_attr_names[params->request],
+ params->Nth, params->Mth);
+ break;
+ }
+
+ switch (about) {
+ /* Struct */
+ case 0x01 ... 0x3f:
+ case 0x41 ... 0x7f:
+ case 0x81 ... 0xbf:
+ dump_fsinfo(params->request, about, &r, ret);
+ return 0;
+
+ /* String */
+ case 0x00:
+ case 0x40:
+ case 0x80:
+ if (ret >= 4096) {
+ ret = 4096;
+ r.buffer[4092] = '.';
+ r.buffer[4093] = '.';
+ r.buffer[4094] = '.';
+ r.buffer[4095] = 0;
+ } else {
+ r.buffer[ret] = 0;
+ }
+ for (p = r.buffer; *p; p++) {
+ if (!isprint(*p)) {
+ printf("<non-printable>\n");
+ continue;
+ }
+ }
+ printf("%s\n", r.buffer);
+ return 0;
+
+ default:
+ fprintf(stderr, "Fishy about %u %02x\n", params->request, about);
+ exit(1);
+ }
+}
+
+/*
+ *
+ */
+int main(int argc, char **argv)
+{
+ struct fsinfo_params params = {
+ .at_flags = AT_SYMLINK_NOFOLLOW,
+ };
+ unsigned int attr;
+ int raw = 0, opt, Nth, Mth;
+
+ while ((opt = getopt(argc, argv, "alr"))) {
+ switch (opt) {
+ case 'a':
+ params.at_flags |= AT_NO_AUTOMOUNT;
+ continue;
+ case 'l':
+ params.at_flags &= ~AT_SYMLINK_NOFOLLOW;
+ continue;
+ case 'r':
+ raw = 1;
+ continue;
+ }
+ break;
+ }
+
+ argc -= optind;
+ argv += optind;
+
+ if (argc != 1) {
+ printf("Format: test-fsinfo [-alr] <file>\n");
+ exit(2);
+ }
+
+ for (attr = 0; attr <= fsinfo_attr__nr; attr++) {
+ Nth = 0;
+ do {
+ Mth = 0;
+ do {
+ params.request = attr;
+ params.Nth = Nth;
+ params.Mth = Mth;
+
+ switch (try_one(argv[0], &params, raw)) {
+ case 0:
+ continue;
+ case 1:
+ goto done_M;
+ case 2:
+ goto done_N;
+ }
+ } while (++Mth < 100);
+
+ done_M:
+ if (Mth >= 100) {
+ fprintf(stderr, "Fishy: Mth == %u\n", Mth);
+ break;
+ }
+
+ } while (++Nth < 100);
+
+ done_N:
+ if (Nth >= 100) {
+ fprintf(stderr, "Fishy: Nth == %u\n", Nth);
+ break;
+ }
+ }
+
+ return 0;
+}

2018-06-04 15:25:20

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 21/32] VFS: Implement fsmount() to effect a pre-configured mount [ver #8]

Arnd Bergmann <[email protected]> wrote:

> The prototype in the header doesn't match the one in the implementation,
> which should cause a compile-time error, at least if syscalls.h is included
> in namespace.c

I've fixed that sort of thing up from kbuild warnings.

> Do you have a particular use case in mind for the spare_4/spare_5 arguments?
> If not, we can probably skip them. If we end up needing them after all, we
> can always add a new syscall entry point, or use one of the flag bits to
> decide whether the additional arguments are valid or not.

Whilst that is true, these aren't really (or probably shouldn't be) hot path
syscalls, so I would contend that just clearing the extra arguments shouldn't
be much of a performance loss. On the other hand, syscall numbers are to some
extent precious. If we hit ~512 syscalls we start to have an issue as we
start to get overlaps.

And, yes, I do have ideas for them involving ID mapping on mounts (ie. killing
off shiftfs).

> > COND_SYSCALL(sys_fsopen);
> > +COND_SYSCALL(sys_fsmount);
>
> This should only be needed if the syscall is optional, which it doesn't
> seem to be (same for the other ones here).

Al removed them.

David

2018-06-04 15:27:54

by David Howells

[permalink] [raw]
Subject: Re: [PATCH][RFC] open_tree(2) (was Re: [PATCH 30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8])

Miklos Szeredi <[email protected]> wrote:

> fsinfo = info from path

Actually, I was thinking of making fsinfo() detect if it's been given an fsfd
and go through an fs_context operation instead in that case.

David

2018-06-04 15:53:53

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH][RFC] open_tree(2) (was Re: [PATCH 30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8])

On Mon, Jun 04, 2018 at 12:34:44PM +0200, Miklos Szeredi wrote:

> fsopen = create fsfd
> fsmount = fsfd -> mountfd & set attr on mountfd & attach mountfd
> fspick = path -> fsfd
> move_mount = attach mountfd or move existing
> fsinfo = info from path
> open_tree = new mountfd from path or clone
> mount_setattr = set attr on mountfd
>
> Notice that fsmount() encompasses mount_setattr() + move_mount()
> functionality. Split those out and leave fsmount() to actually do
> the "fsfd ->mountfd" translation?

Might make sense.

> fsinfo() name suggests it's in the same class as
> fsopen/fsmount/fspick, operating on fsfd object, but's it's not and I
> think that's slightly confusing.
>
> Rename move_mount() -> mount_move()?

mount_move_bikeshed_bikeshed_bikeshed(), surely?

> Also does it make sense to make the cloning behavior of open_tree()
> optional? Without cloning it's just a plain open(O_PATH). That way
> it could be renamed mount_clone().

Umm... I'm not sure about that one. If nothing else, OPEN_TREE_DETACH
might be a good idea, in which case cloning is not the primary effect;
hell knows.

2018-06-04 16:01:34

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH][RFC] open_tree(2) (was Re: [PATCH 30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8])

On Mon, Jun 04, 2018 at 04:52:05PM +0100, Al Viro wrote:
> On Mon, Jun 04, 2018 at 12:34:44PM +0200, Miklos Szeredi wrote:
>
> > fsopen = create fsfd
> > fsmount = fsfd -> mountfd & set attr on mountfd & attach mountfd
> > fspick = path -> fsfd
> > move_mount = attach mountfd or move existing
> > fsinfo = info from path
> > open_tree = new mountfd from path or clone
> > mount_setattr = set attr on mountfd
> >
> > Notice that fsmount() encompasses mount_setattr() + move_mount()
> > functionality. Split those out and leave fsmount() to actually do
> > the "fsfd ->mountfd" translation?
>
> Might make sense.

FWIW, to make it clear: fsmount(2) in this series actually does *NOT*
attach it to the tree. Commit message definitely needs updating - as it
is, it's

+SYSCALL_DEFINE5(fsmount, int, fs_fd, unsigned int, flags, unsigned int, ms_flags,
+ void *, reserved4, void *, reserved5)

PS: IMO these reserved... arguments are in bad taste; if anyone has good reasons
for that practice in ABI design, I'd like to hear those.

2018-06-04 16:01:35

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 32/32] [RFC] fsinfo: Add a system call to allow querying of filesystem information [ver #8]

On Mon, Jun 4, 2018 at 5:01 PM, David Howells <[email protected]> wrote:
> Arnd Bergmann <[email protected]> wrote:
>
>> ntfs has sb->s_time_gran=100, and vfat should really have
>> sb->s_time_gran=2000000000 but that doesn't seem to be set right
>> at the moment.
>
> (V)FAT actually has a different granularity on each timestamp.

Ah, right. I guess I missed the fact that the file system can override it,
so FAT doesn't actually have to do it this way.

>> > +struct fsinfo_capabilities {
>> > + __u64 supported_stx_attributes; /* What statx::stx_attributes are supported */
>> > + __u32 supported_stx_mask; /* What statx::stx_mask bits are supported */
>> > + __u32 supported_ioc_flags; /* What FS_IOC_* flags are supported */
>> > + __u8 capabilities[(fsinfo_cap__nr + 7) / 8];
>> > +};
>>
>> This looks a bit odd: with the 44 capabilities, you end up having a six-byte
>> array followed by two bytes of implicit padding. If the number of
>> capabilities grows beyond 64, you have a nine byte array with more padding
>> to the next alignof(__u64). Is that intentional?
>
> I've split the capabilities out into their own thing. I've attached the
> revised patch below.

I'm still not completely clear on how variable-length structures are supposed to
be handled by the fsinfo syscall. It seems like a possible source of
bugs to return
a structure from the kernel that has a different size in kernel and user space
depending on the fsinfo_cap__nr value at compile time. How does one
e.g. guarantee there is no out of bounds access when you run new user space on
an older kernel that has a smaller structure?

>> > +struct fsinfo_timestamp_info {
>> > + __s64 minimum_timestamp; /* Minimum timestamp value in seconds */
>> > + __s64 maximum_timestamp; /* Maximum timestamp value in seconds */
>> > + __u16 atime_gran_mantissa; /* Granularity(secs) = mant * 10^exp */
>> > + __u16 btime_gran_mantissa;
>> > + __u16 ctime_gran_mantissa;
>> > + __u16 mtime_gran_mantissa;
>> > + __s8 atime_gran_exponent;
>> > + __s8 btime_gran_exponent;
>> > + __s8 ctime_gran_exponent;
>> > + __s8 mtime_gran_exponent;
>> > +};
>>
>> This structure has a slightly inconsistent amount of padding at the end:
>> on x86-32 it has no padding, everywhere else it has 32 bits of padding
>> to make it 64-bit aligned. Maybe add a __u32 reserved field?
>
> It occurs to me that the min and max may be different for each timestamp.
> Maybe I should have:
>
> struct fsinfo_timestamp_info {
> char name[7 + 1];
> __s64 minimum_timestamp;
> __s64 maximum_timestamp;
> __u16 granularity_mantissa;
> __s8 granularity_exponent;
> __u8 __reserved[5];
> };
>

I don't particularly like having a string in there, that seems to add
unnecessary complexity compared to using an integer. Having four
min/max values would make it more generic but I don't think we have
a need for that at the moment with any of the file systems we support.

In any case, it would be nice to have a trivial way to query which of
the four timestamp types are supported at all, and returning
them separately would be one way of doing that.

> and then you iterate through them by setting Nth. I could just put a:
>
> __u64 granularity;
>
> field, expressed in nS, rather than mantissa and exponent, but doing it this
> way allows me to express granularities less that 1nS very simply (something
> Dave Chinner was talking about).
>
> It might also be worth putting minimum_timestamp and maximum_timestamp in
> terms of granularity rather than nS.

Expressing the granularity in nanoseconds (or something smaller) would seem
more natural to me, but I don't really care much either way.

Arnd

2018-06-04 17:22:25

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH][RFC] open_tree(2) (was Re: [PATCH 30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8])

On Sun, Jun 03, 2018 at 01:55:37AM +0100, Al Viro wrote:
> +SYSCALL_DEFINE3(open_tree, int, dfd, const char *, filename, unsigned, flags)
> +{
> + struct file *file;
> + struct path path;
> + int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW;
> + bool detached = flags & OPEN_TREE_CLONE;
> + int error;
> + int fd;
> +
> + BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC);

Why do we need OPEN_TREE_CLOEXEC? Wouldn't we be better off just making
the fd returned by open_tree implicitly close-on-exec? I can think of
no good reason for these file descriptors to be inherited across exec()
and if someone comes up with such a reason, fcntl(F_SETFD) is not an
expensive call to make.

2018-06-04 17:36:56

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH][RFC] open_tree(2) (was Re: [PATCH 30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8])

On Mon, Jun 04, 2018 at 10:16:30AM -0700, Matthew Wilcox wrote:
> On Sun, Jun 03, 2018 at 01:55:37AM +0100, Al Viro wrote:
> > +SYSCALL_DEFINE3(open_tree, int, dfd, const char *, filename, unsigned, flags)
> > +{
> > + struct file *file;
> > + struct path path;
> > + int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW;
> > + bool detached = flags & OPEN_TREE_CLONE;
> > + int error;
> > + int fd;
> > +
> > + BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC);
>
> Why do we need OPEN_TREE_CLOEXEC? Wouldn't we be better off just making
> the fd returned by open_tree implicitly close-on-exec? I can think of
> no good reason for these file descriptors to be inherited across exec()

How are they different from any file descriptor? It's not as if it was
something usable only for mounting stuff - again, you can use them
with any ...at() syscalls.

> and if someone comes up with such a reason, fcntl(F_SETFD) is not an
> expensive call to make.

2018-06-04 19:07:35

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 32/32] [RFC] fsinfo: Add a system call to allow querying of filesystem information [ver #8]

Arnd Bergmann <[email protected]> wrote:

> > I've split the capabilities out into their own thing. I've attached the
> > revised patch below.
>
> I'm still not completely clear on how variable-length structures are
> supposed to be handled by the fsinfo syscall. It seems like a possible
> source of bugs to return a structure from the kernel that has a different
> size in kernel and user space depending on the fsinfo_cap__nr value at
> compile time. How does one e.g. guarantee there is no out of bounds access
> when you run new user space on an older kernel that has a smaller structure?

There's a buffer size parameter:

int ret = fsinfo(int dfd,
const char *filename,
const struct fsinfo_params *params,
void *buffer,
size_t buf_size);

For a fixed-size buffer request (as opposed to a string), the fsinfo syscall
allocates an internal buffer sized for the size of the buffer that the
internal kernel code is expecting, and *not* what the user asked for:

/* Allocate an appropriately-sized buffer. We will truncate the
* contents when we write the contents back to userspace.
*/
size = fsinfo_buffer_sizes[params.request];
...
if (buf_size > 0) {
params.buf_size = size;
params.buffer = kzalloc(size, GFP_KERNEL);
if (!params.buffer)
return -ENOMEM;
}

so that the filesystems don't have to concern themselves with anything other
than the kernel's idea of the size.

The fsinfo() syscall truncates the reply buffer to the size the user requested
if the user requested a smaller amount. Take the fsinfo_supports struct for
example:

struct fsinfo_supports {
__u64 supported_stx_attributes;
__u32 supported_stx_mask;
__u32 supported_ioc_flags;
};

Now imagine that in future we want to add another field, say the mask of the
windows file attributes a filesystem supports. We can extend the struct like
so:

struct fsinfo_supports_v2 {
__u64 supported_stx_attributes;
__u32 supported_stx_mask;
__u32 supported_ioc_flags;
__u32 supported_win_file_atts;
__u32 __reserved[1];
};

Note that the start of the new struct *must* correspond in layout to the
original struct. An application that doesn't know about v2 would just ask for
v1:

struct fsinfo_supports foo;
fsinfo(.... &foo, sizeof(foo));

and would only ever get those bits - though it would be told that there is
more data available. An application that does know about v2 might do:

struct fsinfo_supports_v2 foo2;
fsinfo(.... &foo2, sizeof(foo2));

If all of v2 was available, all fields will be filled in and the return value
will == sizeof(foo2). If not all fields are available, the return value will
== sizeof(foo). If a v3 was added, the return value would == sizeof(v3), and
so on.

I can improve this such that if you asked for a fixed-length option and the
kernel doesn't have enough data to fill the user buffer provided, then it
clears the remainder of the buffer. That way at least any unsupported fields
will be initialised to 0.


For the capabilities bitmask, it's not really any different conceptually. If
you want to test capability bit 47, you need to ask for 6 bytes of data. If
the kernel doesn't support that many bits, it won't necessarily give you that
many bytes. If it has, say, 13 bytes-worth of caps available, it will only
give you the first 6 bytes-worth if that's all you ask for. You presumably
weren't interested or didn't know about any more than that.


As for strings, they're completely variable length anyway, so I don't think
there's a problem there.

> In any case, it would be nice to have a trivial way to query which of
> the four timestamp types are supported at all, and returning
> them separately would be one way of doing that.

fsinfo_cap_has_atime = 45, /* fs supports access time */
fsinfo_cap_has_btime = 46, /* fs supports birth/creation time */
fsinfo_cap_has_ctime = 47, /* fs supports change time */
fsinfo_cap_has_mtime = 48, /* fs supports modification time */

David

2018-06-04 19:29:18

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH][RFC] open_tree(2) (was Re: [PATCH 30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8])

On Mon, Jun 4, 2018 at 5:52 PM, Al Viro <[email protected]> wrote:
> On Mon, Jun 04, 2018 at 12:34:44PM +0200, Miklos Szeredi wrote:
>
>> fsopen = create fsfd
>> fsmount = fsfd -> mountfd & set attr on mountfd & attach mountfd
>> fspick = path -> fsfd
>> move_mount = attach mountfd or move existing
>> fsinfo = info from path
>> open_tree = new mountfd from path or clone
>> mount_setattr = set attr on mountfd
>>
>> Notice that fsmount() encompasses mount_setattr() + move_mount()
>> functionality. Split those out and leave fsmount() to actually do
>> the "fsfd ->mountfd" translation?
>
> Might make sense.

> FWIW, to make it clear: fsmount(2) in this series actually does *NOT*
> attach it to the tree.

Ah, that leaves the mount_setattr() functionality to split out. I'd
be more happy to rid this new API of all the old MS_* crap and have
have a new set of attributes, that just apply to mounts. It will
also need two args: a bitmap of new attributes and a mask to tell us
which attributes to change.

> Commit message definitely needs updating - as it
> is, it's
>
> +SYSCALL_DEFINE5(fsmount, int, fs_fd, unsigned int, flags, unsigned int, ms_flags,
> + void *, reserved4, void *, reserved5)
>
> PS: IMO these reserved... arguments are in bad taste; if anyone has good reasons
> for that practice in ABI design, I'd like to hear those.

Agreed. A flags argument is often wise to add even if currently
unused (and should be checked for undefined flags), but adding a
random number of pointers doesn't seem to make a lot of sense.

>
>> fsinfo() name suggests it's in the same class as
>> fsopen/fsmount/fspick, operating on fsfd object, but's it's not and I
>> think that's slightly confusing.
>>
>> Rename move_mount() -> mount_move()?
>
> mount_move_bikeshed_bikeshed_bikeshed(), surely?

Consistent naming for related functions... not unheard of in API
design. The above set definitely does not qualify.

>> Also does it make sense to make the cloning behavior of open_tree()
>> optional? Without cloning it's just a plain open(O_PATH). That way
>> it could be renamed mount_clone().
>
> Umm... I'm not sure about that one. If nothing else, OPEN_TREE_DETACH
> might be a good idea, in which case cloning is not the primary effect;
> hell knows.

So conceptually we have the following distinct mount tree operations:

treefd = clone(path);
treefd = detach(path);
attach(treefd, path);
move(path1, path2);

The detach/move/attach trio are more related in functionality, while
clone and detach have the same signature. I'm not sure either.

Thanks,
Miklos

2018-06-04 19:38:51

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH][RFC] open_tree(2) (was Re: [PATCH 30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8])

On Mon, Jun 4, 2018 at 7:35 PM, Al Viro <[email protected]> wrote:
> On Mon, Jun 04, 2018 at 10:16:30AM -0700, Matthew Wilcox wrote:
>> On Sun, Jun 03, 2018 at 01:55:37AM +0100, Al Viro wrote:
>> > +SYSCALL_DEFINE3(open_tree, int, dfd, const char *, filename, unsigned, flags)
>> > +{
>> > + struct file *file;
>> > + struct path path;
>> > + int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW;
>> > + bool detached = flags & OPEN_TREE_CLONE;
>> > + int error;
>> > + int fd;
>> > +
>> > + BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC);
>>
>> Why do we need OPEN_TREE_CLOEXEC? Wouldn't we be better off just making
>> the fd returned by open_tree implicitly close-on-exec? I can think of
>> no good reason for these file descriptors to be inherited across exec()
>
> How are they different from any file descriptor? It's not as if it was
> something usable only for mounting stuff - again, you can use them
> with any ...at() syscalls.

Defaulting to close on exec helps keep out clutter from the API.

Is there a disadvantage to needing an explicit fcntl(F_SETFD) call to
disable close on exec?

Thanks,
Miklos

2018-06-04 20:50:53

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 32/32] [RFC] fsinfo: Add a system call to allow querying of filesystem information [ver #8]

On Mon, Jun 4, 2018 at 9:03 PM, David Howells <[email protected]> wrote:
> Arnd Bergmann <[email protected]> wrote:
> The fsinfo() syscall truncates the reply buffer to the size the user requested
> if the user requested a smaller amount. Take the fsinfo_supports struct for
> example:
>
> struct fsinfo_supports {
> __u64 supported_stx_attributes;
> __u32 supported_stx_mask;
> __u32 supported_ioc_flags;
> };
>
> Now imagine that in future we want to add another field, say the mask of the
> windows file attributes a filesystem supports. We can extend the struct like
> so:
>
> struct fsinfo_supports_v2 {
> __u64 supported_stx_attributes;
> __u32 supported_stx_mask;
> __u32 supported_ioc_flags;
> __u32 supported_win_file_atts;
> __u32 __reserved[1];
> };

Looking at the code again, I realized my misunderstanding: I somehow
expected the system call to return multiple structures at once, which
would get really messy with groups of arrays of variable-sized
structures. Since we only really get back a single structure per call,
that is not an issue.

There is also no need to be concerned about the system call overhead,
right? Even if we query all data from all mounted file systems, I
suppose the total number of syscall roundtrips will be small enough
that there is no need for complicating the interface to make it
slightly faster.

> I can improve this such that if you asked for a fixed-length option and the
> kernel doesn't have enough data to fill the user buffer provided, then it
> clears the remainder of the buffer. That way at least any unsupported fields
> will be initialised to 0.

Yes, I think that would make sense here. It's not quite a read() based
interface since the return value for a short read is still the size of the
whole buffer that could have been accessed. By zeroing the extra
data, the kernel always writes the amount of data that the user asks
for, and the return value always shows how much data would have
been available.

It might be necessary to limit the size of the buffer though, to prevent
bad things from happening when the user asks for e.g. -1ull bytes
of data.

>> In any case, it would be nice to have a trivial way to query which of
>> the four timestamp types are supported at all, and returning
>> them separately would be one way of doing that.
>
> fsinfo_cap_has_atime = 45, /* fs supports access time */
> fsinfo_cap_has_btime = 46, /* fs supports birth/creation time */
> fsinfo_cap_has_ctime = 47, /* fs supports change time */
> fsinfo_cap_has_mtime = 48, /* fs supports modification time */

Ok.

Arnd

2018-06-07 02:41:19

by Goldwyn Rodrigues

[permalink] [raw]
Subject: Re: [PATCH 26/32] afs: Use fs_context to pass parameters over automount [ver #8]



On 05/24/2018 07:08 PM, David Howells wrote:
> Alter the AFS automounting code to create and modify an fs_context struct
> when parameterising a new mount triggered by an AFS mountpoint rather than
> constructing device name and option strings.
>
> Also remove the cell=, vol= and rwpath options as they are then redundant.
> The reason they existed is because the 'device name' may be derived
> literally from a mountpoint object in the filesystem, so default cell and
> parent-type information needed to be passed in by some other method from
> the automount routines. The vol= option didn't end up being used.
>
> Signed-off-by: David Howells <[email protected]>
> cc: Eric W. Biederman <[email protected]>
> ---
>
> fs/afs/internal.h | 1
> fs/afs/mntpt.c | 148 +++++++++++++++++++++++++++--------------------------
> fs/afs/super.c | 43 +--------------
> 3 files changed, 79 insertions(+), 113 deletions(-)
>
> diff --git a/fs/afs/internal.h b/fs/afs/internal.h
> index eb6e75e00181..90af5001f8c8 100644
> --- a/fs/afs/internal.h
> +++ b/fs/afs/internal.h
> @@ -35,7 +35,6 @@ struct pagevec;
> struct afs_call;
>
> struct afs_fs_context {
> - bool rwpath; /* T if the parent should be considered R/W */
> bool force; /* T to force cell type */
> bool autocell; /* T if set auto mount operation */
> bool dyn_root; /* T if dynamic root */
> diff --git a/fs/afs/mntpt.c b/fs/afs/mntpt.c
> index c45aa1776591..fc383d727552 100644
> --- a/fs/afs/mntpt.c
> +++ b/fs/afs/mntpt.c
> @@ -47,6 +47,8 @@ static DECLARE_DELAYED_WORK(afs_mntpt_expiry_timer, afs_mntpt_expiry_timed_out);
>
> static unsigned long afs_mntpt_expiry_timeout = 10 * 60;
>
> +static const char afs_root_volume[] = "root.cell";
> +
> /*
> * no valid lookup procedure on this sort of dir
> */
> @@ -68,107 +70,107 @@ static int afs_mntpt_open(struct inode *inode, struct file *file)
> }
>
> /*
> - * create a vfsmount to be automounted
> + * Set the parameters for the proposed superblock.
> */
> -static struct vfsmount *afs_mntpt_do_automount(struct dentry *mntpt)
> +static int afs_mntpt_set_params(struct fs_context *fc, struct dentry *mntpt)
> {
> - struct afs_super_info *as;
> - struct vfsmount *mnt;
> - struct afs_vnode *vnode;
> - struct page *page;
> - char *devname, *options;
> - bool rwpath = false;
> + struct afs_fs_context *ctx = fc->fs_private;
> + struct afs_vnode *vnode = AFS_FS_I(d_inode(mntpt));
> + struct afs_cell *cell;
> + const char *p;
> int ret;
>
> - _enter("{%pd}", mntpt);
> -
> - BUG_ON(!d_inode(mntpt));
> -
> - ret = -ENOMEM;
> - devname = (char *) get_zeroed_page(GFP_KERNEL);
> - if (!devname)
> - goto error_no_devname;
> -
> - options = (char *) get_zeroed_page(GFP_KERNEL);
> - if (!options)
> - goto error_no_options;
> -
> - vnode = AFS_FS_I(d_inode(mntpt));
> if (test_bit(AFS_VNODE_PSEUDODIR, &vnode->flags)) {
> /* if the directory is a pseudo directory, use the d_name */
> - static const char afs_root_cell[] = ":root.cell.";
> unsigned size = mntpt->d_name.len;
>
> - ret = -ENOENT;
> - if (size < 2 || size > AFS_MAXCELLNAME)
> - goto error_no_page;
> + if (size < 2)
> + return -ENOENT;
>
> + p = mntpt->d_name.name;
> if (mntpt->d_name.name[0] == '.') {
> - devname[0] = '%';
> - memcpy(devname + 1, mntpt->d_name.name + 1, size - 1);
> - memcpy(devname + size, afs_root_cell,
> - sizeof(afs_root_cell));
> - rwpath = true;
> - } else {
> - devname[0] = '#';
> - memcpy(devname + 1, mntpt->d_name.name, size);
> - memcpy(devname + size + 1, afs_root_cell,
> - sizeof(afs_root_cell));
> + size--;
> + p++;
> + ctx->type = AFSVL_RWVOL;
> + ctx->force = true;
> + }
> + if (size > AFS_MAXCELLNAME)
> + return -ENAMETOOLONG;
> +
> + cell = afs_lookup_cell(ctx->net, p, size, NULL, false);
> + if (IS_ERR(cell)) {
> + pr_err("kAFS: unable to lookup cell '%pd'\n", mntpt);
> + return PTR_ERR(cell);
> }
> + afs_put_cell(ctx->net, ctx->cell);
> + ctx->cell = cell;
> +
> + ctx->volname = afs_root_volume;
> + ctx->volnamesz = sizeof(afs_root_volume) - 1;
> } else {
> /* read the contents of the AFS special symlink */
> + struct page *page;
> loff_t size = i_size_read(d_inode(mntpt));
> char *buf;
>
> - ret = -EINVAL;
> if (size > PAGE_SIZE - 1)
> - goto error_no_page;
> + return -EINVAL;
>
> page = read_mapping_page(d_inode(mntpt)->i_mapping, 0, NULL);
> - if (IS_ERR(page)) {
> - ret = PTR_ERR(page);
> - goto error_no_page;
> - }
> + if (IS_ERR(page))
> + return PTR_ERR(page);
>
> - ret = -EIO;
> - if (PageError(page))
> - goto error;
> + if (PageError(page)) {
> + put_page(page);
> + return -EIO;
> + }
>
> - buf = kmap_atomic(page);
> - memcpy(devname, buf, size);
> - kunmap_atomic(buf);
> + buf = kmap(page);
> + ret = vfs_set_fs_source(fc, buf, size);
> + kunmap(page);
> put_page(page);
> - page = NULL;
> + if (ret < 0)
> + return ret;
> }
>
> - /* work out what options we want */
> - as = AFS_FS_S(mntpt->d_sb);
> - if (as->cell) {
> - memcpy(options, "cell=", 5);
> - strcpy(options + 5, as->cell->name);
> - if ((as->volume && as->volume->type == AFSVL_RWVOL) || rwpath)
> - strcat(options, ",rwpath");
> - }
> + return 0;
> +}
>
> - /* try and do the mount */
> - _debug("--- attempting mount %s -o %s ---", devname, options);
> - mnt = vfs_submount(mntpt, &afs_fs_type, devname,
> - options, strlen(options) + 1);
> - _debug("--- mount result %p ---", mnt);
> +/*
> + * create a vfsmount to be automounted
> + */
> +static struct vfsmount *afs_mntpt_do_automount(struct dentry *mntpt)
> +{
> + struct fs_context *fc;
> + struct vfsmount *mnt;
> + int ret;
> +
> + BUG_ON(!d_inode(mntpt));
> +
> + fc = vfs_new_fs_context(&afs_fs_type, mntpt, 0,
> + FS_CONTEXT_FOR_SUBMOUNT);
> + if (IS_ERR(fc))
> + return ERR_CAST(fc);
> +
> + ret = afs_mntpt_set_params(fc, mntpt);
> + if (ret < 0)
> + goto error_fc;
> +
> + ret = vfs_get_tree(fc);
> + if (ret < 0)
> + goto error_fc;
> +
> + mnt = vfs_create_mount(fc, 0);
> + if (IS_ERR(mnt)) {
> + ret = PTR_ERR(mnt);
> + goto error_fc;
> + }
>
> - free_page((unsigned long) devname);
> - free_page((unsigned long) options);
> - _leave(" = %p", mnt);
> + put_fs_context(fc);
> return mnt;
>

Why are you performing a put_fs_context(fc) in the success code path? Do
we not need a reference of fc anymore?


> -error:
> - put_page(page);
> -error_no_page:
> - free_page((unsigned long) options);
> -error_no_options:
> - free_page((unsigned long) devname);
> -error_no_devname:
> - _leave(" = %d", ret);
> +error_fc:
> + put_fs_context(fc);
> return ERR_PTR(ret);
> }
>
--
Goldwyn

2018-06-07 19:55:32

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 10/32] VFS: Implement a filesystem superblock creation/configuration context [ver #8]

On Fri, May 25, 2018 at 2:06 AM, David Howells <[email protected]> wrote:
> Implement a filesystem context concept to be used during superblock
> creation for mount and superblock reconfiguration for remount.
>
> The mounting procedure then becomes:
>
> (1) Allocate new fs_context context.
>
> (2) Configure the context.
>
> (3) Create superblock.
>
> (4) Mount the superblock any number of times.
>
> (5) Destroy the context.
>
> Rather than calling fs_type->mount(), an fs_context struct is created and
> fs_type->init_fs_context() is called to set it up.
> fs_type->fs_context_size says how much space should be allocated for the
> config context. The fs_context struct is placed at the beginning and any
> extra space is for the filesystem's use.
>
> A set of operations has to be set by ->init_fs_context() to provide
> freeing, duplication, option parsing, binary data parsing, validation,
> mounting and superblock filling.
>
> Legacy filesystems are supported by the provision of a set of legacy
> fs_context operations that build up a list of mount options and then invoke
> fs_type->mount() from within the fs_context ->get_tree() operation. This
> allows all filesystems to be accessed using fs_context.
>
> It should be noted that, whilst this patch adds a lot of lines of code,
> there is quite a bit of duplication with existing code that can be
> eliminated should all filesystems be converted over.
>
> Signed-off-by: David Howells <[email protected]>
> ---
>
> fs/Makefile | 3
> fs/fs_context.c | 599 ++++++++++++++++++++++++++++++++++++++++++++
> fs/internal.h | 3
> fs/libfs.c | 17 +
> fs/namespace.c | 350 +++++++++++++++++---------
> fs/super.c | 311 ++++++++++++++++++++++-
> include/linux/fs.h | 13 +
> include/linux/fs_context.h | 45 +++
> include/linux/mount.h | 3
> 9 files changed, 1201 insertions(+), 143 deletions(-)
> create mode 100644 fs/fs_context.c
>
> diff --git a/fs/Makefile b/fs/Makefile
> index c9375fd2c8c4..6f2dae3c32da 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -12,7 +12,8 @@ obj-y := open.o read_write.o file_table.o super.o \
> attr.o bad_inode.o file.o filesystems.o namespace.o \
> seq_file.o xattr.o libfs.o fs-writeback.o \
> pnode.o splice.o sync.o utimes.o d_path.o \
> - stack.o fs_struct.o statfs.o fs_pin.o nsfs.o
> + stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
> + fs_context.o
>
> ifeq ($(CONFIG_BLOCK),y)
> obj-y += buffer.o block_dev.o direct-io.o mpage.o
> diff --git a/fs/fs_context.c b/fs/fs_context.c
> new file mode 100644
> index 000000000000..bef68a12ddb5
> --- /dev/null
> +++ b/fs/fs_context.c
> @@ -0,0 +1,599 @@
> +/* Provide a way to create a superblock configuration context within the kernel
> + * that allows a superblock to be set up prior to mounting.
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells ([email protected])
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +#include <linux/fs_context.h>
> +#include <linux/fs.h>
> +#include <linux/mount.h>
> +#include <linux/nsproxy.h>
> +#include <linux/slab.h>
> +#include <linux/magic.h>
> +#include <linux/security.h>
> +#include <linux/parser.h>
> +#include <linux/mnt_namespace.h>
> +#include <linux/pid_namespace.h>
> +#include <linux/user_namespace.h>
> +#include <net/net_namespace.h>
> +#include "mount.h"
> +
> +enum legacy_fs_param {
> + LEGACY_FS_UNSET_PARAMS,
> + LEGACY_FS_NO_PARAMS,
> + LEGACY_FS_MONOLITHIC_PARAMS,
> + LEGACY_FS_INDIVIDUAL_PARAMS,
> + LEGACY_FS_MAGIC_PARAMS,
> +};
> +
> +struct legacy_fs_context {
> + struct fs_context fc;
> + char *legacy_data; /* Data page for legacy filesystems */
> + char *secdata;
> + size_t data_size;
> + enum legacy_fs_param param_type;
> +};
> +
> +static const struct fs_context_operations legacy_fs_context_ops;
> +
> +static const match_table_t common_set_sb_flag = {
> + { SB_DIRSYNC, "dirsync" },
> + { SB_LAZYTIME, "lazytime" },
> + { SB_MANDLOCK, "mand" },
> + { SB_POSIXACL, "posixacl" },
> + { SB_RDONLY, "ro" },
> + { SB_SYNCHRONOUS, "sync" },
> + { },
> +};
> +
> +static const match_table_t common_clear_sb_flag = {
> + { SB_LAZYTIME, "nolazytime" },
> + { SB_MANDLOCK, "nomand" },
> + { SB_RDONLY, "rw" },
> + { SB_SILENT, "silent" },
> + { SB_SYNCHRONOUS, "async" },
> + { },
> +};
> +
> +static const match_table_t forbidden_sb_flag = {
> + { 0, "bind" },
> + { 0, "move" },
> + { 0, "private" },
> + { 0, "remount" },
> + { 0, "shared" },
> + { 0, "slave" },
> + { 0, "unbindable" },
> + { 0, "rec" },
> + { 0, "noatime" },
> + { 0, "relatime" },
> + { 0, "norelatime" },
> + { 0, "strictatime" },
> + { 0, "nostrictatime" },
> + { 0, "nodiratime" },
> + { 0, "dev" },
> + { 0, "nodev" },
> + { 0, "exec" },
> + { 0, "noexec" },
> + { 0, "suid" },
> + { 0, "nosuid" },
> + { },
> +};
> +
> +/*
> + * Check for a common mount option that manipulates s_flags.
> + */
> +static int vfs_parse_sb_flag_option(struct fs_context *fc, char *data)
> +{
> + substring_t args[MAX_OPT_ARGS];
> + unsigned int token;
> +
> + token = match_token(data, common_set_sb_flag, args);
> + if (token) {
> + fc->sb_flags |= token;
> + return 1;
> + }
> +
> + token = match_token(data, common_clear_sb_flag, args);
> + if (token) {
> + fc->sb_flags &= ~token;
> + return 1;
> + }
> +
> + token = match_token(data, forbidden_sb_flag, args);
> + if (token)
> + return -EINVAL;
> +
> + return 0;
> +}
> +
> +/**
> + * vfs_parse_fs_option - Add a single mount option to a superblock config
> + * @fc: The filesystem context to modify
> + * @opt: The option to apply.
> + * @len: The length of the option.
> + *
> + * A single mount option in string form is applied to the filesystem context
> + * being set up. Certain standard options (for example "ro") are translated
> + * into flag bits without going to the filesystem. The active security module
> + * is allowed to observe and poach options. Any other options are passed over
> + * to the filesystem to parse.
> + *
> + * This may be called multiple times for a context.
> + *
> + * Returns 0 on success and a negative error code on failure. In the event of
> + * failure, supplementary error information may have been set.
> + */
> +int vfs_parse_fs_option(struct fs_context *fc, char *opt, size_t len)
> +{
> + int ret;
> +
> + ret = vfs_parse_sb_flag_option(fc, opt);
> + if (ret < 0)
> + return ret;
> + if (ret == 1)
> + return 0;

Why is vfs_parse_sb_flag_option() not called from ->parse_option()?

That way, filesystem can reject unsupported generic options. We don't
have that in the current API, but that doesn't mean the new API
shouldn't handle that case. Yeah, need to worry about backward
compat, so need a flag to say whether this comes from monolithic
option block or fsfd write.

Also thinking: if we are giving this brand new API to fs developers,
why not also give some helpers, so option parsing becomes easier, more
consistent, etc... I'm thinking along the lines of module_param_*().
I.e. we give the parser a structure pointer and an array of {option
name, structure member name, type} or {option name, get/set ops} and
the helpers take care of the rest (parse, show). That isn't going to
cover everything, but it might be good enough for most.

Of course, that can come later, while doing the conversion of
filesystems to the new API.

Thanks,
Miklos

2018-06-07 20:46:15

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 26/32] afs: Use fs_context to pass parameters over automount [ver #8]

Goldwyn Rodrigues <[email protected]> wrote:

Goldwyn Rodrigues <[email protected]> wrote:

> > +static struct vfsmount *afs_mntpt_do_automount(struct dentry *mntpt)
> > +{
> > + struct fs_context *fc;
> > + struct vfsmount *mnt;
> > + int ret;
> > +
> > + BUG_ON(!d_inode(mntpt));
> > +
> > + fc = vfs_new_fs_context(&afs_fs_type, mntpt, 0,
> > + FS_CONTEXT_FOR_SUBMOUNT);
> > + if (IS_ERR(fc))
> > + return ERR_CAST(fc);
> > +
> > + ret = afs_mntpt_set_params(fc, mntpt);
> > + if (ret < 0)
> > + goto error_fc;
> > +
> > + ret = vfs_get_tree(fc);
> > + if (ret < 0)
> > + goto error_fc;
> > +
> > + mnt = vfs_create_mount(fc, 0);
> > + if (IS_ERR(mnt)) {
> > + ret = PTR_ERR(mnt);
> > + goto error_fc;
> > + }
> >
> > - free_page((unsigned long) devname);
> > - free_page((unsigned long) options);
> > - _leave(" = %p", mnt);
> > + put_fs_context(fc);
> > return mnt;
> >
> Why are you performing a put_fs_context(fc) in the success code path? Do
> we not need a reference of fc anymore?

No. This is the ->d_automount() hook. The context is created at the top of
the function, configuration is done, the superblock is obtained, a mount
object is created (for which we get a ref returned). After that point, we
don't need the context any more and no one else is going to clean it up for
us.

David

2018-06-15 04:20:08

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]

David Howells <[email protected]> writes:

> Here are a set of patches to create a filesystem context prior to setting
> up a new mount, populating it with the parsed options/binary data, creating
> the superblock and then effecting the mount. This is also used for remount
> since much of the parsing stuff is common in many filesystems.

Dave,
I have read through these patches and I noticed a significant issue.

Today in mount_bdev we do something that looks like:

mount_bdev(...)
{
s = sget(..., bdev);
if (s->s_root) {
/* Noop */
} else {
err = fill_super(s, ...);
if (err) {
deactivate_locked_super(s);
return ERR_PTR(err);
}
s->s_flags |= SB_ATTIVE;
bdev->bd_super = s;
}
return dget(s->s_root);
}

The key point is that we don't process the mount options at all if
a super block already exists in the kernel. Similar to what
your fscontext changes are doing (after parsing the options).

Your fscontext changes do not improve upon this area of the mount api at
all and that concerns me. This is an area where people can and already
do shoot themselves in their feet.

The real world security issue we had in with this involved devpts. The
devpts filesystem requires the mode and gid parameters for new ttys to
be specified to be posix compliant. People were setting up chroot
environments and mounting devpts with the wrong arguments. As these two
devpts mounts shared a super block a change of arguments on one was a
change of arguments on the other. Which mean the chroots were
periodically breaking the primary devpts and causing new terminals to be
opened with essentially unusable permissions. Fun when you are trying
to ssh in to a box.

Creating a new mount and finding an old mount are the same operation in
the kernel today. This is fundamentally confusing. In the new api
could we please separate these two operations?

Perhaps someting like:
x create
x find

With the "x create" case failing if the filesystem already exists,
still allowing "x find"? And with the "x find" case failing if
the superblock is not already created in the kernel.

That should make it clear to a userspace program what is going on
and give it a chance to mount a filesystem anyway.



In a similar vein could we please clarify the rules for changing mount
options for an existing superblock are in the new api?

Today mount assumes that it has to provide all of the existing options
to reconfigure a mount. What people want to do and what most
filesystems support is just specifying the options that need to be
changed. Can we please make this the rule of how this are expected
to work for fscontext? That only changing mount options need to
be specified before: "x reconfigure"

Eric



2018-06-18 20:31:47

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]

Eric W. Biederman <[email protected]> wrote:

> I have read through these patches and I noticed a significant issue.
>
> Today in mount_bdev we do something that looks like:
>
> mount_bdev(...)
> {
> s = sget(..., bdev);
> if (s->s_root) {
> /* Noop */
> } else {
> err = fill_super(s, ...);
> if (err) {
> deactivate_locked_super(s);
> return ERR_PTR(err);
> }
> s->s_flags |= SB_ATTIVE;
> bdev->bd_super = s;
> }
> return dget(s->s_root);
> }
>
> The key point is that we don't process the mount options at all if
> a super block already exists in the kernel. Similar to what
> your fscontext changes are doing (after parsing the options).

Actually, no, that's not the case.

The fscontext code *requires* you to parse the parameters *before* any attempt
to access the superblock is made. Note that this will actually be a problem
for, say, ext4 which passes a text string stored in the superblock through the
parser *before* parsing the mount syscall data. Fun.

I'm intending to deal with that particular case by having ext4 create multiple
private contexts, one filled in from the user data, and then a second one
filled in from the superblock string. These can then be validated one against
the other before the super_block struct is published.

And if the super_block struct already exists, the user's specified parameters
can be validated against that.

> Your fscontext changes do not improve upon this area of the mount api at
> all and that concerns me. This is an area where people can and already
> do shoot themselves in their feet.

This *will* be dealt with, but I wanted to get the core changes upstream
before tackling all the filesystems. The legacy wrapper is just that and
should be got rid of when all the filesystems have been converted.

> ...
>
> Creating a new mount and finding an old mount are the same operation in
> the kernel today. This is fundamentally confusing. In the new api
> could we please separate these two operations?
>
> Perhaps someting like:
> x create
> x find
>
> With the "x create" case failing if the filesystem already exists,
> still allowing "x find"? And with the "x find" case failing if
> the superblock is not already created in the kernel.

No. What you're suggesting introduces a userspace-userspace and a
userspace-kernel race - unless you're willing to let userspace lock against
superblock creation by other parties.

Further, some filesystems *have* to be parameterised before you can do the
check for the superblock. Network filesystems, for example, where you have to
set the network parameters upfront and the key to the superblock might not be
known until you've queried the server.

> That should make it clear to a userspace program what is going on
> and give it a chance to mount a filesystem anyway.

That said, I'm willing to provide a "fail if already extant" option if we
think that's actually likely to be of use. However, you'd still have to
provide parameters before the check can be made.

> In a similar vein could we please clarify the rules for changing mount
> options for an existing superblock are in the new api?

You mean remount/reconfigure? Note that we have to provide backward
compatibility with every single filesystem.

> Today mount assumes that it has to provide all of the existing options to
> reconfigure a mount. What people want to do and what most filesystems
> support is just specifying the options that need to be changed. Can we
> please make this the rule of how this are expected to work for fscontext?
> That only changing mount options need to be specified before: "x
> reconfigure"

Fine by me - but it must *also* support every option being specified if that
is what mount currently does.

I don't really want to supply extra parsers if I can avoid it. Miklós, for
example wanted a different, incompatible interface, so you'd do:

write(fd, "o +foo");
write(fd, "o -bar");
write(fd, "x reconfig");

sort of thing to enable or disable options... but this assumes that options
are binary and requires a separate parser to the one that does the initial
configuration - and you still have to support the old remount data parse.

I'm okay with specifying that you should just specify the options you want to
change and that the normal way to 'disable' something is to prefix it with
"no".

I guess I could pass a flag through to indicate that this came from
sys_mount(MS_REMOUNT) rather than the new method.

David

2018-06-18 21:36:21

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]

David Howells <[email protected]> writes:

> Eric W. Biederman <[email protected]> wrote:
>
>> I have read through these patches and I noticed a significant issue.
>>
>> Today in mount_bdev we do something that looks like:
>>
>> mount_bdev(...)
>> {
>> s = sget(..., bdev);
>> if (s->s_root) {
>> /* Noop */
>> } else {
>> err = fill_super(s, ...);
>> if (err) {
>> deactivate_locked_super(s);
>> return ERR_PTR(err);
>> }
>> s->s_flags |= SB_ATTIVE;
>> bdev->bd_super = s;
>> }
>> return dget(s->s_root);
>> }
>>
>> The key point is that we don't process the mount options at all if
>> a super block already exists in the kernel. Similar to what
>> your fscontext changes are doing (after parsing the options).
>
> Actually, no, that's not the case.
>
> The fscontext code *requires* you to parse the parameters *before* any attempt
> to access the superblock is made. Note that this will actually be a problem
> for, say, ext4 which passes a text string stored in the superblock through the
> parser *before* parsing the mount syscall data. Fun.
>
> I'm intending to deal with that particular case by having ext4 create multiple
> private contexts, one filled in from the user data, and then a second one
> filled in from the superblock string. These can then be validated one against
> the other before the super_block struct is published.
>
> And if the super_block struct already exists, the user's specified parameters
> can be validated against that.

I did not say parse. I said process. My meaning is that today
on a second mount of a filesystem like ext2. But really most of them we
ignore those options.

>> Your fscontext changes do not improve upon this area of the mount api at
>> all and that concerns me. This is an area where people can and already
>> do shoot themselves in their feet.
>
> This *will* be dealt with, but I wanted to get the core changes upstream
> before tackling all the filesystems. The legacy wrapper is just that and
> should be got rid of when all the filesystems have been converted.

This is an area where you are making things explicit, where before
there was no way to talk about this. For filesystems that are in
legacy mode I realize we might miss this corner case, but we should
still think about it and handle it. The proc filesystem has the same
behavior and it is one you are converting.

>> ...
>>
>> Creating a new mount and finding an old mount are the same operation in
>> the kernel today. This is fundamentally confusing. In the new api
>> could we please separate these two operations?
>>
>> Perhaps someting like:
>> x create
>> x find
>>
>> With the "x create" case failing if the filesystem already exists,
>> still allowing "x find"? And with the "x find" case failing if
>> the superblock is not already created in the kernel.
>
> No. What you're suggesting introduces a userspace-userspace and a
> userspace-kernel race - unless you're willing to let userspace lock against
> superblock creation by other parties.
>
> Further, some filesystems *have* to be parameterised before you can do the
> check for the superblock. Network filesystems, for example, where you have to
> set the network parameters upfront and the key to the superblock might not be
> known until you've queried the server.

I am not talking about skipping the parameterization. I am talking
about actually acting on those options. Parsing and validating them
ahead of time is not my concern. When we make the super block
honor those options is my concern.

Further I am not suggesting something that has a meaningful race.
I am suggesting some that is the equivalent of the O_EXCL logic.
I am proposing that "x create" fail if the superblock already exists
in the kernel. I am proposing that "x find" will fail if the
superblock does not already exist.

In the worst case you have to iterate a time or two when another
user is racing with you to create the super block. But this
gives you very valuable information. Knowledge of if the superblock
is honoring all of your specified mount options or not.

It removes an existing nasty race today where people think they mount a
filesystem like "proc" with one set of options and those options are
ignored because an internal kernel mount already exists.

This is at the level of the fscontext API.

I don't care what filesystems that have not been updated to fscontext
do. I just want to avoid the nasty nasty confusion that is possible
with the existing API.

My motivation is I am in the middle of closing a regression in option
parsing in proc that caused a security option to get ignored.

I would be happy even with a result value of "x create" that told
reported if the superbloc "created" or "found". Instead of having two
different options.

But I want to be able to say to userspace very clearly. If this
superblock already exists. You need to go through the
remount/reconfigure code path to change it's options.


>> That should make it clear to a userspace program what is going on
>> and give it a chance to mount a filesystem anyway.
>
> That said, I'm willing to provide a "fail if already extant" option if we
> think that's actually likely to be of use. However, you'd still have to
> provide parameters before the check can be made.

Yes that is what I am asking for, though very much not optionally. It
is a rare case and it succeeds in the wrong way today. Letting people
think they have set a mount option when they have not.

>> In a similar vein could we please clarify the rules for changing mount
>> options for an existing superblock are in the new api?
>
> You mean remount/reconfigure? Note that we have to provide backward
> compatibility with every single filesystem.

Yes. I am thinking explicitly of reconfigure. The old remount
can do whatever is backwards compatible.

>> Today mount assumes that it has to provide all of the existing options to
>> reconfigure a mount. What people want to do and what most filesystems
>> support is just specifying the options that need to be changed. Can we
>> please make this the rule of how this are expected to work for fscontext?
>> That only changing mount options need to be specified before: "x
>> reconfigure"
>
> Fine by me - but it must *also* support every option being specified if that
> is what mount currently does.

Yes. That is what ext4 and xfs support today.

I just want it clear that at least if you use the new reconfigure
interface that you can expect options you have not specified to remain
the same instead of reverting to a default state.

> I don't really want to supply extra parsers if I can avoid it. Miklós, for
> example wanted a different, incompatible interface, so you'd do:
>
> write(fd, "o +foo");
> write(fd, "o -bar");
> write(fd, "x reconfig");
>
> sort of thing to enable or disable options... but this assumes that options
> are binary and requires a separate parser to the one that does the initial
> configuration - and you still have to support the old remount data parse.
>
> I'm okay with specifying that you should just specify the options you want to
> change and that the normal way to 'disable' something is to prefix it with
> "no".

Yes. That is what I am asking for. Just so we don't need to specify
the options that are staying the same.

> I guess I could pass a flag through to indicate that this came from
> sys_mount(MS_REMOUNT) rather than the new method.

I don't think that would be needed. As some of the most common
filesystems implement this semantic already for sys_mount(MS_REMOUNT).

What I think we might need is to cache the option string during the
transition for unconverted filesystems so that we can give then a full
set of options. So that if we have an unconverted filesystem like
devpts that assumes a missing option should be set to it's default value
we won't break it.

Eric


2018-06-18 23:34:41

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]

On Mon, Jun 18, 2018 at 09:30:50PM +0100, David Howells wrote:
>
> The fscontext code *requires* you to parse the parameters *before* any attempt
> to access the superblock is made. Note that this will actually be a problem
> for, say, ext4 which passes a text string stored in the superblock through the
> parser *before* parsing the mount syscall data. Fun.
>
> I'm intending to deal with that particular case by having ext4 create multiple
> private contexts, one filled in from the user data, and then a second one
> filled in from the superblock string. These can then be validated one against
> the other before the super_block struct is published.

Yeah, what we're trying to do is let the options in the superblock act
as defaults which then can be overridden by what the user specifies on
the command line.

So when you parse the user-supplied data, will there be a way to
determine what was specified explicitly, versus what was implied by
the defaults? I'll need that in order to be able to merge the two
contexts together.

- Ted

2018-06-21 18:49:10

by Andrei Vagin

[permalink] [raw]
Subject: Re: [16/32] kernfs, sysfs, cgroup, intel_rdt: Support fs_context [ver #8]

On Fri, May 25, 2018 at 01:07:08AM +0100, David Howells wrote:

...

> @@ -1972,57 +1957,51 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
> return ret;
> }
>
> -struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
> - struct cgroup_root *root, unsigned long magic,
> - struct cgroup_namespace *ns)
> +int cgroup_do_get_tree(struct fs_context *fc)
> {
> - struct dentry *dentry;
> - bool new_sb;
> + struct cgroup_fs_context *ctx = fc->fs_private;
> + int ret;
> +
> + ctx->kfc.root = ctx->root->kf_root;

[root@fc24 ~]# mount -t cgroup -o none,name=zdtmtst xxx /mnt/test
[root@fc24 ~]# mkdir /mnt/test/holder
[root@fc24 ~]# umount /mnt/test
[root@fc24 ~]# mount -t cgroup -o none,name=zdtmtst xxx /mnt/test
Killed

ctx->root can be NULL here

[ 93.719897] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 93.720097] PGD 80000002115f5067 P4D 80000002115f5067 PUD 1ef421067 PMD 0
[ 93.720179] Oops: 0000 [#1] SMP PTI
[ 93.720257] CPU: 1 PID: 13843 Comm: cgroup04 Not tainted 4.18.0-rc1-next-20180621+ #1
[ 93.720342] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 93.720432] RIP: 0010:cgroup_do_get_tree+0x1b/0xf0
[ 93.720515] Code: 00 00 02 5b 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 49 89 fc 53 48 83 ec 08 48 8b 9f 90 00 00 00 48 8b 43 20 <48> 8b 00 48 89 03 e8 8a cc 1e 00 85 c0 0f 88 97 00 00 00 48 81 7b
[ 93.720655] RSP: 0018:ffffb07941b03df8 EFLAGS: 00010292
[ 93.720740] RAX: 0000000000000000 RBX: ffff9ba3527da300 RCX: 0000000000000000
[ 93.720819] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9ba34d47b400
[ 93.720897] RBP: ffffb07941b03e58 R08: 0000000000000000 R09: 0000000000000002
[ 93.720975] R10: 0000000000000000 R11: 4aee8a3cb0beb9ec R12: ffff9ba34d47b400
[ 93.721053] R13: ffff9ba351518000 R14: ffffffff961705d4 R15: ffff9ba35143f000
[ 93.721131] FS: 000015418d893740(0000) GS:ffff9ba35fd00000(0000) knlGS:0000000000000000
[ 93.721233] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 93.721336] CR2: 0000000000000000 CR3: 00000001c4658004 CR4: 00000000001606e0
[ 93.721421] Call Trace:
[ 93.721508] cgroup1_get_tree+0x57c/0x640
[ 93.721587] vfs_get_tree+0x6e/0x180
[ 93.721665] do_mount+0x76b/0xa80
[ 93.721753] ksys_mount+0x80/0xd0
[ 93.721831] __x64_sys_mount+0x21/0x30
[ 93.721908] do_syscall_64+0x60/0x1b0
[ 93.721987] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 93.722065] RIP: 0033:0x15418d3bc85a

I think we need something like this:

diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index e12c0a91b8a4..b1340bd5f5fc 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -1192,6 +1192,7 @@ int cgroup1_get_tree(struct fs_context *fc)
}

ret = 0;
+ ctx->root = root;
goto out_unlock;
}

@@ -1241,6 +1242,7 @@ int cgroup1_get_tree(struct fs_context *fc)
percpu_ref_reinit(&root->cgrp.self.refcnt);
mutex_unlock(&cgroup_mutex);
}
+ cgroup_get(&root->cgrp);

/*
* If @pinned_sb, we're reusing an existing root and holding an

https://travis-ci.org/avagin/linux/jobs/394887987

>
> - dentry = kernfs_mount(fs_type, flags, root->kf_root, magic, &new_sb);
> + ret = kernfs_get_tree(fc);
> + if (ret < 0)
> + goto out_cgrp;
>
> /*
> * In non-init cgroup namespace, instead of root cgroup's dentry,
> * we return the dentry corresponding to the cgroupns->root_cgrp.
> */
> - if (!IS_ERR(dentry) && ns != &init_cgroup_ns) {

2018-06-22 12:53:26

by David Howells

[permalink] [raw]
Subject: Re: [16/32] kernfs, sysfs, cgroup, intel_rdt: Support fs_context [ver #8]

Andrei Vagin <[email protected]> wrote:

> ret = 0;
> + ctx->root = root;
> goto out_unlock;

Okay, I can see that.

> percpu_ref_reinit(&root->cgrp.self.refcnt);
> mutex_unlock(&cgroup_mutex);
> }
> + cgroup_get(&root->cgrp);

This probably needs to be conditional on ret == 0.

Which version are you testing btw? The patches in git have been fixed a
little from what was last posted.

David

2018-06-22 15:34:09

by Andrei Vagin

[permalink] [raw]
Subject: Re: [16/32] kernfs, sysfs, cgroup, intel_rdt: Support fs_context [ver #8]

On Fri, Jun 22, 2018 at 01:52:16PM +0100, David Howells wrote:
> Andrei Vagin <[email protected]> wrote:
>
> > ret = 0;
> > + ctx->root = root;
> > goto out_unlock;
>
> Okay, I can see that.
>
> > percpu_ref_reinit(&root->cgrp.self.refcnt);
> > mutex_unlock(&cgroup_mutex);
> > }
> > + cgroup_get(&root->cgrp);
>
> This probably needs to be conditional on ret == 0.

yes, you are right

>
> Which version are you testing btw? The patches in git have been fixed a
> little from what was last posted.

I'm testing linux-next-20180621

commit 8439c34f07a3f58245e933ca2703239417288363 (tag: next-20180621,
linux-next/master)
Author: Stephen Rothwell <[email protected]>
Date: Thu Jun 21 14:09:41 2018 +1000

Add linux-next specific files for 20180621

Signed-off-by: Stephen Rothwell <[email protected]>

>
> David

2018-06-22 17:00:25

by Andrei Vagin

[permalink] [raw]
Subject: Re: [16/32] kernfs, sysfs, cgroup, intel_rdt: Support fs_context [ver #8]

On Fri, Jun 22, 2018 at 08:30:29AM -0700, Andrei Vagin wrote:
> On Fri, Jun 22, 2018 at 01:52:16PM +0100, David Howells wrote:
> > Andrei Vagin <[email protected]> wrote:
> >
> > > ret = 0;
> > > + ctx->root = root;
> > > goto out_unlock;
> >
> > Okay, I can see that.
> >
> > > percpu_ref_reinit(&root->cgrp.self.refcnt);
> > > mutex_unlock(&cgroup_mutex);
> > > }
> > > + cgroup_get(&root->cgrp);
> >
> > This probably needs to be conditional on ret == 0.
>
> yes, you are right


I've read the code and I think it isn't obvious. A reference will be
released id cgroup_fs_context_free() even if ret isn't zero here.

I look at do_new_mount()

vfs_new_fs_context()
...
if (vfs_get_tree())
goto out_fc;
....
out_fc:
put_fs_context(fc);
fc->ops->free(fc);
cgroup_fs_context_free()
cgroup_put(&ctx->root->cgrp);

>
> >
> > Which version are you testing btw? The patches in git have been fixed a
> > little from what was last posted.
>
> I'm testing linux-next-20180621
>
> commit 8439c34f07a3f58245e933ca2703239417288363 (tag: next-20180621,
> linux-next/master)
> Author: Stephen Rothwell <[email protected]>
> Date: Thu Jun 21 14:09:41 2018 +1000
>
> Add linux-next specific files for 20180621
>
> Signed-off-by: Stephen Rothwell <[email protected]>
>
> >
> > David

2018-06-23 23:36:06

by David Howells

[permalink] [raw]
Subject: Re: [16/32] kernfs, sysfs, cgroup, intel_rdt: Support fs_context [ver #8]

Andrei Vagin <[email protected]> wrote:

> > > > percpu_ref_reinit(&root->cgrp.self.refcnt);
> > > > mutex_unlock(&cgroup_mutex);
> > > > }
> > > > + cgroup_get(&root->cgrp);
> > >
> > > This probably needs to be conditional on ret == 0.
> >
> > yes, you are right
>
>
> I've read the code and I think it isn't obvious. A reference will be
> released id cgroup_fs_context_free() even if ret isn't zero here.
>
> I look at do_new_mount()
>
> vfs_new_fs_context()
> ...
> if (vfs_get_tree())
> goto out_fc;
> ....
> out_fc:
> put_fs_context(fc);
> fc->ops->free(fc);
> cgroup_fs_context_free()
> cgroup_put(&ctx->root->cgrp);

Yeah, you're right: ctx->root is set above, so the put will trigger anyway.
I'll fold both of these changes in.

David

2018-07-03 18:34:24

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH 10/32] VFS: Implement a filesystem superblock creation/configuration context [ver #8]

On Fri, May 25, 2018 at 01:06:29AM +0100, David Howells wrote:
> +/**
> + * sget_fc - Find or create a superblock
> + * @fc: Filesystem context.
> + * @test: Comparison callback
> + * @set: Setup callback
> + *
> + * Find or create a superblock using the parameters stored in the filesystem
> + * context and the two callback functions.
> + *
> + * If an extant superblock is matched, then that will be returned with an
> + * elevated reference count that the caller must transfer or discard.
> + *
> + * If no match is made, a new superblock will be allocated and basic
> + * initialisation will be performed (s_type, s_fs_info and s_id will be set and
> + * the set() callback will be invoked), the superblock will be published and it
> + * will be returned in a partially constructed state with SB_BORN and SB_ACTIVE
> + * as yet unset.
> + */
> +struct super_block *sget_fc(struct fs_context *fc,
> + int (*test)(struct super_block *, struct fs_context *),
> + int (*set)(struct super_block *, struct fs_context *))
> +{
> + struct super_block *s = NULL;
> + struct super_block *old;
> + int err;
> +
> + if (!(fc->sb_flags & SB_KERNMOUNT) &&
> + fc->purpose != FS_CONTEXT_FOR_SUBMOUNT) {
> + /* Don't allow mounting unless the caller has CAP_SYS_ADMIN
> + * over the namespace.
> + */
> + if (!(fc->fs_type->fs_flags & FS_USERNS_MOUNT) &&
> + !capable(CAP_SYS_ADMIN))
> + return ERR_PTR(-EPERM);
> + else if (!ns_capable(fc->user_ns, CAP_SYS_ADMIN))
> + return ERR_PTR(-EPERM);
> + }
> +
> +retry:
> + spin_lock(&sb_lock);
> + if (test) {
> + hlist_for_each_entry(old, &fc->fs_type->fs_supers, s_instances) {
> + if (!test(old, fc))
> + continue;
> + if (fc->user_ns != old->s_user_ns) {
> + spin_unlock(&sb_lock);
> + if (s) {
> + up_write(&s->s_umount);
> + destroy_unused_super(s);
> + }

->s_umount is released once here and again in destroy_unused_super().

> + return ERR_PTR(-EBUSY);
> + }
> + if (!grab_super(old))
> + goto retry;
> + if (s) {
> + up_write(&s->s_umount);
> + destroy_unused_super(s);

Same bug here.

> + up_write(&s->s_umount);
> + destroy_unused_super(s);

And here.

- Eric

2018-07-03 21:54:30

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 10/32] VFS: Implement a filesystem superblock creation/configuration context [ver #8]

Eric Biggers <[email protected]> wrote:

> ->s_umount is released once here and again in destroy_unused_super().

Good catch, thanks. The interface has changed over the lifetime of the
patches. How about the attached patch?

David
---
commit b3899e214a6a0e0551f6dc707b28d61b11e718a5
Author: David Howells <[email protected]>
Date: Tue Jul 3 22:35:28 2018 +0100

vfs: Locking fix for sget_fc()

In sget_fc(), don't drop the s_umount lock before calling
destroy_unused_super() as that will drop the lock.

Fixes: 8a2e54b8af88 ("vfs: Implement a filesystem superblock creation/configuration context")
Reported-by: Eric Biggers <[email protected]>
Signed-off-by: David Howells <[email protected]>

diff --git a/fs/super.c b/fs/super.c
index 43400f5fa33a..b014cd48a451 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -516,19 +516,14 @@ struct super_block *sget_fc(struct fs_context *fc,
continue;
if (fc->user_ns != old->s_user_ns) {
spin_unlock(&sb_lock);
- if (s) {
- up_write(&s->s_umount);
+ if (s)
destroy_unused_super(s);
- }
return ERR_PTR(-EBUSY);
}
if (!grab_super(old))
goto retry;
- if (s) {
- up_write(&s->s_umount);
+ if (s)
destroy_unused_super(s);
- s = NULL;
- }
return old;
}
}
@@ -545,7 +540,6 @@ struct super_block *sget_fc(struct fs_context *fc,
if (err) {
s->s_fs_info = NULL;
spin_unlock(&sb_lock);
- up_write(&s->s_umount);
destroy_unused_super(s);
return ERR_PTR(err);
}

2018-07-03 21:59:24

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 10/32] VFS: Implement a filesystem superblock creation/configuration context [ver #8]

On Tue, Jul 03, 2018 at 10:53:22PM +0100, David Howells wrote:
> - if (s) {
> - up_write(&s->s_umount);
> + if (s)
> destroy_unused_super(s);

static void destroy_unused_super(struct super_block *s)
{
if (!s)
return;
...
}

IOW, all of those should be unconditional.

2018-07-03 22:08:40

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 10/32] VFS: Implement a filesystem superblock creation/configuration context [ver #8]

Al Viro <[email protected]> wrote:

> IOW, all of those should be unconditional.

Fair point. How about the attached, then?

David
---
commit 1aa76514c426150af429d111cec256e81729fa6f
Author: David Howells <[email protected]>
Date: Tue Jul 3 22:35:28 2018 +0100

vfs: Locking fix for sget_fc()

In sget_fc(), don't drop the s_umount lock before calling
destroy_unused_super() as that will drop the lock.

Fixes: 8a2e54b8af88 ("vfs: Implement a filesystem superblock creation/configuration context")
Reported-by: Eric Biggers <[email protected]>
Signed-off-by: David Howells <[email protected]>

diff --git a/fs/super.c b/fs/super.c
index 43400f5fa33a..dccd397751b1 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -516,19 +516,12 @@ struct super_block *sget_fc(struct fs_context *fc,
continue;
if (fc->user_ns != old->s_user_ns) {
spin_unlock(&sb_lock);
- if (s) {
- up_write(&s->s_umount);
- destroy_unused_super(s);
- }
+ destroy_unused_super(s);
return ERR_PTR(-EBUSY);
}
if (!grab_super(old))
goto retry;
- if (s) {
- up_write(&s->s_umount);
- destroy_unused_super(s);
- s = NULL;
- }
+ destroy_unused_super(s);
return old;
}
}
@@ -545,7 +538,6 @@ struct super_block *sget_fc(struct fs_context *fc,
if (err) {
s->s_fs_info = NULL;
spin_unlock(&sb_lock);
- up_write(&s->s_umount);
destroy_unused_super(s);
return ERR_PTR(err);
}