Here's a set of patches that adds a system call, fsinfo(), that allows
information about the VFS, mount topology, superblock and files to be
retrieved.
The patchset is based on top of the notifications patchset and allows event
counters implemented in the latter to be retrieved to allow overruns to be
efficiently managed.
Included are a couple of sample programs plus limited example code for NFS
and Ext4. The example code is not intended to go upstream as-is.
=======
THE WHY
=======
Why do we want this?
Using /proc/mounts (or similar) has problems:
(1) Reading from it holds a global lock (namespace_sem) that prevents
mounting and unmounting. Lots of data is encoded and mangled into
text whilst the lock is held, including superblock option strings and
mount point paths. This causes performance problems when there are a
lot of mount objects in a system.
(2) Even though namespace_sem is held during a read, reading the whole
file isn't necessarily atomic with respect to mount-type operations.
If a read isn't satisfied in one go, then it may return to userspace
briefly and then continue reading some way into the file. But changes
can occur in the interval that may then go unseen.
(3) Determining what has changed means parsing and comparing consecutive
outputs of /proc/mounts.
(4) Querying a specific mount or superblock means searching through
/proc/mounts and searching by path or mount ID - but we might have an
fd we want to query.
(5) Mount topology is not explicit. One must derive it manually by
comparing entries.
(6) Whilst you can poll() it for events, it only tells you that something
changed in the namespace, not what or whether you can even see the
change.
To fix the notification issues, the preceding notifications patchset added
mount watch notifications whereby you can watch for notifications in a
specific mount subtree. The notification messages include the ID(s) of the
affected mounts.
To support notifications, however, we need to be able to handle overruns in
the notification queue. I added a number of event counters to struct
super_block and struct mount to allow you to pin down the changes, but
there needs to be a way to retrieve them. Exposing them through /proc
would require adding yet another /proc/mounts-type file. We could add
per-mount directories full of attributes in sysfs, but that has issues also
(see below).
Adding an extensible system call interface for retrieving filesystem
information also allows other things to be exposed:
(1) Jeff Layton's error handling changes need a way to allow error event
information to be retrieved.
(2) Bits in masks returned by things like statx() and FS_IOC_GETFLAGS are
actually 3-state { Set, Unset, Not supported }. It could be useful to
provide a way to expose information like this[*].
(3) Limits of the numerical metadata values in a filesystem[*].
(4) Filesystem capability information[*]. Filesystems don't all have the
same capabilities, and even different instances may have different
capabilities, particularly with network filesystems where the set of
may be server-dependent. Capabilities might even vary at file
granularity - though possibly such information should be conveyed
through statx() instead.
(5) ID mapping/shifting tables in use for a superblock.
(6) Filesystem-specific information. I need something for AFS so that I
can do pioctl()-emulation, thereby allowing me to implement certain of
the AFS command line utilities that query state of a particular file.
This could also have application for other filesystems, such as NFS,
CIFS and ext4.
[*] In a lot of cases these are probably fixed and can be memcpy'd from
static data.
There's a further consideration: I want to make it possible to have
fsconfig(fd, FSCONFIG_CMD_CREATE) be intercepted by a container manager
such that the manager can supervise a mount attempted inside the container.
The manager would be given an fd pointing to the fs_context struct and
would then need some way to query it (fsinfo()) and modify it (fsconfig()).
This could also be used to arbitrate user-requested mounts when containers
are not in play.
============================
WHY NOT USE PROCFS OR SYSFS?
============================
Why is it better to go with a new system call rather than adding more magic
stuff to /proc or /sysfs for each superblock object and each mount object?
(1) It can be targetted. It makes it easy to query directly by path or
fd, but can also query by mount ID or fscontext fd. procfs and sysfs
cannot do three of these things easily.
(2) Easier to provide LSM oversight. Is the accessing process allowed to
query information pertinent to a particular file?
(3) It's more efficient as we can return specific binary data rather than
making huge text dumps. Granted, sysfs and procfs could present the
same data, though as lots of little files which have to be
individually opened, read, closed and parsed.
(4) We wouldn't have the overhead of open and close (even adding a
self-contained readfile() syscall has to do that internally).
(5) Opening a file in procfs or sysfs has a pathwalk overhead for each
file accessed. We can use an integer attribute ID instead (yes, this
is similar to ioctl) - but could also use a string ID if that is
preferred.
(6) Can query cross-namespace if, say, a container manager process is
given an fs_context that hasn't yet been mounted into a namespace - or
hasn't even been fully created yet.
(7) Don't have to create/delete a bunch of sysfs/procfs nodes each time a
mount happens or is removed - and since systemd makes much use of
mount namespaces and mount propagation, this will create a lot of
nodes.
================
DESIGN DECISIONS
================
(1) Information is partitioned into sets of attributes.
(2) Attribute IDs are integers as they're fast to compare.
(3) Attribute values are typed (struct, list of structs, string, opaque
blob). They type is fixed for a particular attribute.
(4) For structure types, the length is also a version. New fields can be
tacked onto the end.
(5) When copying a versioned struct to userspace, the core handles a
version mismatch by truncating or zero-padding the data as necessary.
None of this is seen by the filesystem.
(6) The core handles all the buffering and buffer resizing.
(7) The filesystem never gets any access to the userspace parameter buffer
or result buffer.
(8) "Meta" attributes can describe other attributes.
========
OVERVIEW
========
fsinfo() is a system call that allows information about the filesystem at a
particular path point to be queried as a set of attributes.
Attribute values are of four basic types:
(1) Structure with version-dependent length (the length is the version).
(2) Variable-length string.
(3) List of structures (all the same length).
(4) Opaque blob.
Attributes can have multiple values either as a sequence of values or a
sequence-of-sequences of values and all the values of a particular
attribute must be of the same type. Values can be up to INT_MAX size,
subject to memory availability.
Note that the values of an attribute *are* allowed to vary between dentries
within a single superblock, depending on the specific dentry that you're
looking at, but the values still have to be of the type for that attribute.
I've tried to make the interface as light as possible, so integer attribute
ID rather than string and the core does all the buffer allocation and
expansion and all the extensibility support work rather than leaving that
to the filesystems. This means that userspace pointers are not exposed to
the filesystem.
fsinfo() allows a variety of information to be retrieved about a filesystem
and the mount topology:
(1) General superblock attributes:
- Filesystem identifiers (UUID, volume label, device numbers, ...)
- The limits on a filesystem's capabilities
- Information on supported statx fields and attributes and IOC flags.
- A variety single-bit flags indicating supported capabilities.
- Timestamp resolution and range.
- The amount of space/free space in a filesystem (as statfs()).
- Superblock notification counter.
(2) Filesystem-specific superblock attributes:
- Superblock-level timestamps.
- Cell name, workgroup or other netfs grouping concept.
- Server names and addresses.
(3) VFS information:
- Mount topology information.
- Mount attributes.
- Mount notification counter.
- Mount point path.
(4) Information about what the fsinfo() syscall itself supports, including
the type and struct size of attributes.
The system is extensible:
(1) New attributes can be added. There is no requirement that a
filesystem implement every attribute. A helper function is provided
to scan a list of attributes and a filesystem can have multiple such
lists.
(2) Version length-dependent structure attributes can be made larger and
have additional information tacked on the end, provided it keeps the
layout of the existing fields. If an older process asks for a shorter
structure, it will only be given the bits it asks for. If a newer
process asks for a longer structure on an older kernel, the extra
space will be set to 0. In all cases, the size of the data actually
available is returned.
In essence, the size of a structure is that structure's version: a
smaller size is an earlier version and a later version includes
everything that the earlier version did.
(3) New single-bit capability flags can be added. This is a structure-typed
attribute and, as such, (2) applies. Any bits you wanted but the kernel
doesn't support are automatically set to 0.
fsinfo() may be called like the following, for example:
struct fsinfo_params params = {
.resolve_flags = RESOLVE_NO_TRAILING_SYMLINKS,
.flags = FSINFO_FLAGS_QUERY_PATH,
.request = FSINFO_ATTR_AFS_SERVER_ADDRESSES,
.Nth = 2,
};
struct fsinfo_server_address address;
len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", ¶ms,
&address, sizeof(address));
The above example would query an AFS filesystem to retrieve the address
list for the 3rd server, and:
struct fsinfo_params params = {
.resolve_flags = RESOLVE_NO_TRAILING_SYMLINKS,
.flags = FSINFO_FLAGS_QUERY_PATH,
.request = FSINFO_ATTR_NFS_SERVER_NAME;
};
char server_name[256];
len = fsinfo(AT_FDCWD, "/home/dhowells/", ¶ms,
&server_name, sizeof(server_name));
would retrieve the name of the NFS server as a string.
In future, I want to make fsinfo() capable of querying a context created by
fsopen() or fspick(), e.g.:
fd = fsopen("ext4", 0);
struct fsinfo_params params = {
.flags = FSINFO_FLAGS_QUERY_FSCONTEXT,
.request = FSINFO_ATTR_CONFIGURATION;
};
char buffer[65536];
fsinfo(fd, NULL, ¶ms, &buffer, sizeof(buffer));
even if that context doesn't currently have a superblock attached.
The patches can be found here also:
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
on branch:
fsinfo-core
===================
SIGNIFICANT CHANGES
===================
ver #18:
(*) Moved the mount and superblock notification patches into a different
branch.
(*) Made superblock configuration (->show_opts), bindmount path
(->show_path) and filesystem statistics (->show_stats) available as
the CONFIGURATION, MOUNT_PATH and FS_STATISTICS attributes.
(*) Made mountpoint device name available, filtered through the superblock
(->show_devname), as the SOURCE attribute.
(*) Made the mountpoint available as a full path as well as a relative
one.
(*) Added more event counters to MOUNT_INFO, including a subtree
notification counter, to make it easier to clean up after a
notification overrun.
(*) Made the event counter value returned by MOUNT_CHILDREN the sum of the
five event counters.
(*) Added a mount uniquifier and added that to the MOUNT_CHILDREN entries
also so that mount ID reuse can be detected.
(*) Merged the SB_NOTIFICATION attribute into the MOUNT_INFO attribute to
avoid duplicate information.
(*) Switched to using the RESOLVE_* flags rather than AT_* flags for
pathwalk control. Added more RESOLVE_* flags.
(*) Used a lock instead of RCU to enumerate children for the
MOUNT_CHILDREN attribute for safety. This is probably worth
revisiting at a later date, however.
ver #17:
(*) Applied comments from Jann Horn, Darrick Wong and Christian Brauner.
(*) Rearranged the order in which fsinfo() does things so that the
superblock operations table can have a function pointer rather than a
table pointer. The ->fsinfo() op is now called at least twice, once
to determine the size of buffer needed and then to retrieve the data.
If the retrieval step indicates yet more space is needed, the buffer
will be expanded and that step repeated.
(*) Merge the element size into the size in the fsinfo_attribute def and
don't set size for strings or opaques. Let a helper work that out.
This means that strings can actually get larger then 4K.
(*) A helper is provided to scan a list of attributes and call the
appropriate get function. This can be called from a filesystem's
->fsinfo() method multiple times. It also handles attribute
enumeration and info querying.
(*) Rearranged the patches to put all the notification patches first.
This allowed some of the bits to be squashed together. At some point,
I'll move the notification patches into a different branch.
ver #16:
(*) Split the features bits out of the fsinfo() core into their own patch
and got rid of the name encoding attributes.
(*) Renamed the 'array' type to 'list' and made AFS use it for returning
server address lists.
(*) Changed the ->fsinfo() method into an ->fsinfo_attributes[] table,
where each attribute has a ->get() method to deal with it. These
tables can then be returned with an fsinfo meta attribute.
(*) Dropped the fscontext query and parameter/description retrieval
attributes for now.
(*) Picked the mount topology attributes into this branch.
(*) Picked the mount notifications into this branch and rebased on top of
notifications-pipe-core.
(*) Picked the superblock notifications into this branch.
(*) Add sample code for Ext4 and NFS.
David
---
David Howells (14):
VFS: Add additional RESOLVE_* flags
fsinfo: Add fsinfo() syscall to query filesystem information
fsinfo: Provide a bitmap of supported features
fsinfo: Allow retrieval of superblock devname, options and stats
fsinfo: Allow fsinfo() to look up a mount object by ID
fsinfo: Add a uniquifier ID to struct mount
fsinfo: Allow mount information to be queried
fsinfo: Allow the mount topology propogation flags to be retrieved
fsinfo: Provide notification overrun handling support
fsinfo: sample: Mount listing program
fsinfo: Add API documentation
fsinfo: Add support for AFS
fsinfo: Example support for Ext4
fsinfo: Example support for NFS
Documentation/filesystems/fsinfo.rst | 564 +++++++++++++++++
arch/alpha/kernel/syscalls/syscall.tbl | 1
arch/arm/tools/syscall.tbl | 1
arch/arm64/include/asm/unistd.h | 2
arch/ia64/kernel/syscalls/syscall.tbl | 1
arch/m68k/kernel/syscalls/syscall.tbl | 1
arch/microblaze/kernel/syscalls/syscall.tbl | 1
arch/mips/kernel/syscalls/syscall_n32.tbl | 1
arch/mips/kernel/syscalls/syscall_n64.tbl | 1
arch/mips/kernel/syscalls/syscall_o32.tbl | 1
arch/parisc/kernel/syscalls/syscall.tbl | 1
arch/powerpc/kernel/syscalls/syscall.tbl | 1
arch/s390/kernel/syscalls/syscall.tbl | 1
arch/sh/kernel/syscalls/syscall.tbl | 1
arch/sparc/kernel/syscalls/syscall.tbl | 1
arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
arch/xtensa/kernel/syscalls/syscall.tbl | 1
fs/Kconfig | 7
fs/Makefile | 1
fs/afs/internal.h | 1
fs/afs/super.c | 218 +++++++
fs/d_path.c | 2
fs/ext4/Makefile | 1
fs/ext4/ext4.h | 6
fs/ext4/fsinfo.c | 45 +
fs/ext4/super.c | 3
fs/fsinfo.c | 720 ++++++++++++++++++++++
fs/internal.h | 13
fs/mount.h | 3
fs/namespace.c | 362 +++++++++++
fs/nfs/Makefile | 1
fs/nfs/fsinfo.c | 230 +++++++
fs/nfs/internal.h | 6
fs/nfs/nfs4super.c | 3
fs/nfs/super.c | 3
fs/open.c | 8
include/linux/fcntl.h | 3
include/linux/fs.h | 4
include/linux/fsinfo.h | 111 +++
include/linux/syscalls.h | 4
include/uapi/asm-generic/unistd.h | 4
include/uapi/linux/fsinfo.h | 360 +++++++++++
include/uapi/linux/mount.h | 10
include/uapi/linux/openat2.h | 8
include/uapi/linux/windows.h | 35 +
kernel/sys_ni.c | 1
samples/vfs/Makefile | 7
samples/vfs/test-fsinfo.c | 880 +++++++++++++++++++++++++++
samples/vfs/test-mntinfo.c | 277 ++++++++
50 files changed, 3905 insertions(+), 14 deletions(-)
create mode 100644 Documentation/filesystems/fsinfo.rst
create mode 100644 fs/ext4/fsinfo.c
create mode 100644 fs/fsinfo.c
create mode 100644 fs/nfs/fsinfo.c
create mode 100644 include/linux/fsinfo.h
create mode 100644 include/uapi/linux/fsinfo.h
create mode 100644 include/uapi/linux/windows.h
create mode 100644 samples/vfs/test-fsinfo.c
create mode 100644 samples/vfs/test-mntinfo.c
Add the ability to list some Ext4 volume timestamps as an example.
Is this useful for ext4? Is there anything else that could be useful?
Signed-off-by: David Howells <[email protected]>
cc: "Theodore Ts'o" <[email protected]>
cc: Andreas Dilger <[email protected]>
cc: [email protected]
---
fs/ext4/Makefile | 1 +
fs/ext4/ext4.h | 6 ++++++
fs/ext4/fsinfo.c | 45 +++++++++++++++++++++++++++++++++++++++++++
fs/ext4/super.c | 3 +++
include/uapi/linux/fsinfo.h | 16 +++++++++++++++
samples/vfs/test-fsinfo.c | 35 +++++++++++++++++++++++++++++++++
6 files changed, 106 insertions(+)
create mode 100644 fs/ext4/fsinfo.c
diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
index 4ccb3c9189d8..71d5b460c7c7 100644
--- a/fs/ext4/Makefile
+++ b/fs/ext4/Makefile
@@ -16,3 +16,4 @@ ext4-$(CONFIG_EXT4_FS_SECURITY) += xattr_security.o
ext4-inode-test-objs += inode-test.o
obj-$(CONFIG_EXT4_KUNIT_TESTS) += ext4-inode-test.o
ext4-$(CONFIG_FS_VERITY) += verity.o
+ext4-$(CONFIG_FSINFO) += fsinfo.o
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9a2ee2428ecc..461968a87cd6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -42,6 +42,7 @@
#include <linux/fscrypt.h>
#include <linux/fsverity.h>
+#include <linux/fsinfo.h>
#include <linux/compiler.h>
@@ -3166,6 +3167,11 @@ extern const struct inode_operations ext4_file_inode_operations;
extern const struct file_operations ext4_file_operations;
extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
+/* fsinfo.c */
+#ifdef CONFIG_FSINFO
+extern int ext4_fsinfo(struct path *path, struct fsinfo_context *ctx);
+#endif
+
/* inline.c */
extern int ext4_get_max_inline_size(struct inode *inode);
extern int ext4_find_inline_data_nolock(struct inode *inode);
diff --git a/fs/ext4/fsinfo.c b/fs/ext4/fsinfo.c
new file mode 100644
index 000000000000..785f82a74dc9
--- /dev/null
+++ b/fs/ext4/fsinfo.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Filesystem information for ext4
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ */
+
+#include <linux/mount.h>
+#include "ext4.h"
+
+static int ext4_fsinfo_get_volume_name(struct path *path, struct fsinfo_context *ctx)
+{
+ const struct ext4_sb_info *sbi = EXT4_SB(path->mnt->mnt_sb);
+ const struct ext4_super_block *es = sbi->s_es;
+
+ memcpy(ctx->buffer, es->s_volume_name, sizeof(es->s_volume_name));
+ return strlen(ctx->buffer);
+}
+
+static int ext4_fsinfo_get_timestamps(struct path *path, struct fsinfo_context *ctx)
+{
+ const struct ext4_sb_info *sbi = EXT4_SB(path->mnt->mnt_sb);
+ const struct ext4_super_block *es = sbi->s_es;
+ struct fsinfo_ext4_timestamps *ts = ctx->buffer;
+
+#define Z(R,S) R = S | (((u64)S##_hi) << 32)
+ Z(ts->mkfs_time, es->s_mkfs_time);
+ Z(ts->mount_time, es->s_mtime);
+ Z(ts->write_time, es->s_wtime);
+ Z(ts->last_check_time, es->s_lastcheck);
+ Z(ts->first_error_time, es->s_first_error_time);
+ Z(ts->last_error_time, es->s_last_error_time);
+ return sizeof(*ts);
+}
+
+static const struct fsinfo_attribute ext4_fsinfo_attributes[] = {
+ FSINFO_STRING (FSINFO_ATTR_VOLUME_NAME, ext4_fsinfo_get_volume_name),
+ FSINFO_VSTRUCT (FSINFO_ATTR_EXT4_TIMESTAMPS, ext4_fsinfo_get_timestamps),
+ {}
+};
+
+int ext4_fsinfo(struct path *path, struct fsinfo_context *ctx)
+{
+ return fsinfo_get_attribute(path, ctx, ext4_fsinfo_attributes);
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8434217549b3..02b4df073c4b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1477,6 +1477,9 @@ static const struct super_operations ext4_sops = {
.freeze_fs = ext4_freeze,
.unfreeze_fs = ext4_unfreeze,
.statfs = ext4_statfs,
+#ifdef CONFIG_FSINFO
+ .fsinfo = ext4_fsinfo,
+#endif
.remount_fs = ext4_remount,
.show_options = ext4_show_options,
#ifdef CONFIG_QUOTA
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index 154c13a55819..d8d05f0f1473 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -41,6 +41,8 @@
#define FSINFO_ATTR_AFS_SERVER_NAME 0x301 /* Name of the Nth server (string) */
#define FSINFO_ATTR_AFS_SERVER_ADDRESSES 0x302 /* List of addresses of the Nth server */
+#define FSINFO_ATTR_EXT4_TIMESTAMPS 0x400 /* Ext4 superblock timestamps */
+
/*
* Optional fsinfo() parameter structure.
*
@@ -312,4 +314,18 @@ struct fsinfo_afs_server_address {
#define FSINFO_ATTR_AFS_SERVER_ADDRESSES__STRUCT struct fsinfo_afs_server_address
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_EXT4_TIMESTAMPS).
+ */
+struct fsinfo_ext4_timestamps {
+ __u64 mkfs_time;
+ __u64 mount_time;
+ __u64 write_time;
+ __u64 last_check_time;
+ __u64 first_error_time;
+ __u64 last_error_time;
+};
+
+#define FSINFO_ATTR_EXT4_TIMESTAMPS__STRUCT struct fsinfo_ext4_timestamps
+
#endif /* _UAPI_LINUX_FSINFO_H */
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
index 82944f09e0c9..829297e9d1b6 100644
--- a/samples/vfs/test-fsinfo.c
+++ b/samples/vfs/test-fsinfo.c
@@ -374,6 +374,40 @@ static void dump_afs_fsinfo_server_address(void *reply, unsigned int size)
printf("family=%u\n", ss->ss_family);
}
+static char *dump_ext4_time(char *buffer, time_t tim)
+{
+ struct tm tm;
+ int len;
+
+ if (tim == 0)
+ return "-";
+
+ if (!localtime_r(&tim, &tm)) {
+ perror("localtime_r");
+ exit(1);
+ }
+ len = strftime(buffer, 100, "%F %T", &tm);
+ if (len == 0) {
+ perror("strftime");
+ exit(1);
+ }
+ return buffer;
+}
+
+static void dump_ext4_fsinfo_timestamps(void *reply, unsigned int size)
+{
+ struct fsinfo_ext4_timestamps *r = reply;
+ char buffer[100];
+
+ printf("\n");
+ printf("\tmkfs : %s\n", dump_ext4_time(buffer, r->mkfs_time));
+ printf("\tmount : %s\n", dump_ext4_time(buffer, r->mount_time));
+ printf("\twrite : %s\n", dump_ext4_time(buffer, r->write_time));
+ printf("\tfsck : %s\n", dump_ext4_time(buffer, r->last_check_time));
+ printf("\t1st-err : %s\n", dump_ext4_time(buffer, r->first_error_time));
+ printf("\tlast-err: %s\n", dump_ext4_time(buffer, r->last_error_time));
+}
+
static void dump_string(void *reply, unsigned int size)
{
char *s = reply, *p;
@@ -460,6 +494,7 @@ static const struct fsinfo_attribute fsinfo_attributes[] = {
FSINFO_STRING (FSINFO_ATTR_AFS_CELL_NAME, string),
FSINFO_STRING (FSINFO_ATTR_AFS_SERVER_NAME, string),
FSINFO_LIST_N (FSINFO_ATTR_AFS_SERVER_ADDRESSES, afs_fsinfo_server_address),
+ FSINFO_VSTRUCT (FSINFO_ATTR_EXT4_TIMESTAMPS, ext4_fsinfo_timestamps),
{}
};
Miklos Szeredi <[email protected]> wrote:
> > (1) It can be targetted. It makes it easy to query directly by path or
> > fd, but can also query by mount ID or fscontext fd. procfs and sysfs
> > cannot do three of these things easily.
>
> See above: with the addition of open(path, O_PATH) it can do all of these.
That's a horrible interface. To query a file by path, you have to do:
fd = open(path, O_PATH);
sprintf(procpath, "/proc/self/fdmount/%u/<attr>");
fd2 = open(procpath, O_RDONLY);
read(fd2, ...);
close(fd2);
close(fd);
See point (3) about efficiency also. You're having to open *two* files.
> > (2) Easier to provide LSM oversight. Is the accessing process allowed to
> > query information pertinent to a particular file?
>
> Not quite sure why this would be easier for a new ad-hoc interface than for
> the well established filesystem API.
You're right. That's why fsinfo() uses standard pathwalk where possible,
e.g.:
fsinfo(AT_FDCWD, "/path/to/file", ...);
or a fairly standard fd-querying interface:
fsinfo(fd, "", { resolve_flags = RESOLVE_EMPTY_PATH }, ...);
to query an open file descriptor. These are well-established filesystem APIs.
Where I vary from this is allowing direct specification of a mount ID also,
with a special flag to say that's what I'm doing:
fsinfo(AT_FDCWD, "23", { flags = FSINFO_QUERY_FLAGS_MOUNT }, ...);
> > (7) Don't have to create/delete a bunch of sysfs/procfs nodes each time a
> > mount happens or is removed - and since systemd makes much use of
> > mount namespaces and mount propagation, this will create a lot of
> > nodes.
>
> This patch creates a single struct mountfs_entry per mount, which is 48bytes.
fsinfo() doesn't create any. Furthermore, it seems that mounts get multiplied
8-10 times by systemd - though, as you say, it's not necessarily a great deal
of memory.
> Now onto the advantages of a filesystem based API:
>
> - immediately usable from all programming languages, including scripts
This is not true. You can't open O_PATH from shell scripts, so you can't
query things by path that you can't or shouldn't open (dev file paths, for
example; symlinks).
I imagine you're thinking of something like:
{
id=`cat /proc/self/fdmount/5/parent_mount`
} 5</my/path/to/my/file
but what if /my/path/to/my/file is actually /dev/foobar?
I've had a grep through the bash sources, but can't seem to find anywhere that
uses O_PATH.
> - same goes for future extensions: no need to update libc, utils, language
> bindings, strace, etc...
Applications and libraries using these attributes would have to change anyway
to make use of additional information.
But it's not a good argument since you now have to have text parsers that
change over time.
David
Hi,
On 2020-03-09 18:49:31 -0400, Jeff Layton wrote:
> On Mon, 2020-03-09 at 12:22 -0700, Andres Freund wrote:
> > On 2020-03-09 13:50:59 -0400, Jeff Layton wrote:
> > > I sent a patch a few weeks ago to make syncfs() return errors when there
> > > have been writeback errors on the superblock. It's not merged yet, but
> > > once we have something like that in place, we could expose info from the
> > > errseq_t to userland using this interface.
> >
> > I'm still a bit worried about the details of errseq_t being exposed to
> > userland. Partially because it seems to restrict further evolution of
> > errseq_t, and partially because it will likely up with userland trying
> > to understand it (it's e.g. just too attractive to report a count of
> > errors etc).
>
> Trying to interpret the counter field won't really tell you anything.
> The counter is not incremented unless someone has queried the value
> since it was last checked. A single increment could represent a single
> writeback error or 10000 identical ones.
Oh, right. A zero errseq would still indicate something, but that's
probably fine.
> > Is there a reason to not instead report a 64bit counter instead of the
> > cookie? In contrast to the struct file case we'd only have the space
> > overhead once per superblock, rather than once per #files * #fd. And it
> > seems that the maintenance of that counter could be done without
> > widespread changes, e.g. instead/in addition to your change:
> What problem would moving to a 64-bit counter solve? I get the concern
> about people trying to get a counter out of the cookie field, but giving
> people an explicit 64-bit counter seems even more open to
> misinterpretation.
Well, you could get an actual error count out of it? I was thinking that
that value would get incremented every time mapping_set_error() is
called, which should make it a meaningful count?
Greetings,
Andres Freund
On Mon, Mar 9, 2020 at 11:53 PM David Howells <[email protected]> wrote:
>
> Miklos Szeredi <[email protected]> wrote:
>
> > > (1) It can be targetted. It makes it easy to query directly by path or
> > > fd, but can also query by mount ID or fscontext fd. procfs and sysfs
> > > cannot do three of these things easily.
> >
> > See above: with the addition of open(path, O_PATH) it can do all of these.
>
> That's a horrible interface. To query a file by path, you have to do:
>
> fd = open(path, O_PATH);
> sprintf(procpath, "/proc/self/fdmount/%u/<attr>");
> fd2 = open(procpath, O_RDONLY);
> read(fd2, ...);
> close(fd2);
> close(fd);
>
> See point (3) about efficiency also. You're having to open *two* files.
I completely agree, opening two files is surely going to kill
performance of application needing to retrieve a billion mount
attributes per second.</sarcasm>
> > > (2) Easier to provide LSM oversight. Is the accessing process allowed to
> > > query information pertinent to a particular file?
> >
> > Not quite sure why this would be easier for a new ad-hoc interface than for
> > the well established filesystem API.
>
> You're right. That's why fsinfo() uses standard pathwalk where possible,
> e.g.:
>
> fsinfo(AT_FDCWD, "/path/to/file", ...);
>
> or a fairly standard fd-querying interface:
>
> fsinfo(fd, "", { resolve_flags = RESOLVE_EMPTY_PATH }, ...);
>
> to query an open file descriptor. These are well-established filesystem APIs.
Yes. The problem is with the "..." part where you pass random
structures to a function. That's useful sometimes, but at the very
least it breaks type safety, and not what I would call a "clean" API.
> > Now onto the advantages of a filesystem based API:
> >
> > - immediately usable from all programming languages, including scripts
>
> This is not true. You can't open O_PATH from shell scripts, so you can't
> query things by path that you can't or shouldn't open (dev file paths, for
> example; symlinks).
Yes. However, you just wrote the core of a utility that could do this
(in 6 lines, no less). Now try that feat with fsinfo(2)!
Thanks,
Miklos