2020-07-04 14:03:46

by Greg KH

[permalink] [raw]
Subject: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

Here is a tiny new syscall, readfile, that makes it simpler to read
small/medium sized files all in one shot, no need to do open/read/close.
This is especially helpful for tools that poke around in procfs or
sysfs, making a little bit of a less system load than before, especially
as syscall overheads go up over time due to various CPU bugs being
addressed.

There are 4 patches in this series, the first 3 are against the kernel
tree, adding the syscall logic, wiring up the syscall, and adding some
tests for it.

The last patch is agains the man-pages project, adding a tiny man page
to try to describe the new syscall.

Greg Kroah-Hartman (3):
readfile: implement readfile syscall
arch: wire up the readfile syscall
selftests: add readfile(2) selftests

arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/open.c | 50 +++
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 4 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/readfile/.gitignore | 3 +
tools/testing/selftests/readfile/Makefile | 7 +
tools/testing/selftests/readfile/readfile.c | 285 +++++++++++++++++
.../selftests/readfile/readfile_speed.c | 301 ++++++++++++++++++
26 files changed, 671 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/readfile/.gitignore
create mode 100644 tools/testing/selftests/readfile/Makefile
create mode 100644 tools/testing/selftests/readfile/readfile.c
create mode 100644 tools/testing/selftests/readfile/readfile_speed.c

--
2.27.0


2020-07-04 14:03:46

by Greg KH

[permalink] [raw]
Subject: [PATCH 1/3] readfile: implement readfile syscall

It's a tiny syscall, meant to allow a user to do a single "open this
file, read into this buffer, and close the file" all in a single shot.

Should be good for reading "tiny" files like sysfs, procfs, and other
"small" files.

There is no restarting the syscall, this is a "simple" syscall, with the
attempt to make reading "simple" files easier with less syscall
overhead.

Cc: Alexander Viro <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
fs/open.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 50 insertions(+)

diff --git a/fs/open.c b/fs/open.c
index 6cd48a61cda3..4469faa9379c 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1370,3 +1370,53 @@ int stream_open(struct inode *inode, struct file *filp)
}

EXPORT_SYMBOL(stream_open);
+
+static struct file *readfile_open(int dfd, const char __user *filename,
+ struct open_flags *op)
+{
+ struct filename *tmp;
+ struct file *f;
+
+ tmp = getname(filename);
+ if (IS_ERR(tmp))
+ return (struct file *)tmp;
+
+ f = do_filp_open(dfd, tmp, op);
+ if (!IS_ERR(f))
+ fsnotify_open(f);
+
+ putname(tmp);
+ return f;
+}
+
+SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename,
+ char __user *, buffer, size_t, bufsize, int, flags)
+{
+ struct open_flags op;
+ struct open_how how;
+ struct file *file;
+ loff_t pos = 0;
+ int retval;
+
+ /* only accept a small subset of O_ flags that make sense */
+ if ((flags & (O_NOFOLLOW | O_NOATIME)) != flags)
+ return -EINVAL;
+
+ /* add some needed flags to be able to open the file properly */
+ flags |= O_RDONLY | O_LARGEFILE;
+
+ how = build_open_how(flags, 0000);
+ retval = build_open_flags(&how, &op);
+ if (retval)
+ return retval;
+
+ file = readfile_open(dfd, filename, &op);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+
+ retval = vfs_read(file, buffer, bufsize, &pos);
+
+ filp_close(file, NULL);
+
+ return retval;
+}
--
2.27.0

2020-07-04 14:03:48

by Greg KH

[permalink] [raw]
Subject: [PATCH 2/3] arch: wire up the readfile syscall

This wires up the readfile syscall for all architectures

Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 ++
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/syscalls.h | 2 ++
include/uapi/asm-generic/unistd.h | 4 +++-
20 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 5ddd128d4b7a..4132380e997f 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -478,3 +478,4 @@
547 common openat2 sys_openat2
548 common pidfd_getfd sys_pidfd_getfd
549 common faccessat2 sys_faccessat2
+550 common readfile sys_readfile
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index d5cae5ffede0..454873892ba3 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -452,3 +452,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common readfile sys_readfile
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 3b859596840d..b3b2019f8d16 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)

-#define __NR_compat_syscalls 440
+#define __NR_compat_syscalls 441
#endif

#define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 6d95d0c8bf2f..524d19779612 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -885,6 +885,8 @@ __SYSCALL(__NR_openat2, sys_openat2)
__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
#define __NR_faccessat2 439
__SYSCALL(__NR_faccessat2, sys_faccessat2)
+#define __NR_readfile 440
+__SYSCALL(__NR_readfile, sys_readfile)

/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 49e325b604b3..b188f03736bb 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -359,3 +359,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common readfile sys_readfile
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index f71b1bbcc198..ab24bcb91344 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common readfile sys_readfile
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index edacc4561f2b..46c06f800e8e 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -444,3 +444,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common readfile sys_readfile
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index f777141f5256..552ba4dafbef 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -377,3 +377,4 @@
437 n32 openat2 sys_openat2
438 n32 pidfd_getfd sys_pidfd_getfd
439 n32 faccessat2 sys_faccessat2
+440 n32 readfile sys_readfile
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index da8c76394e17..e12581bf900b 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -353,3 +353,4 @@
437 n64 openat2 sys_openat2
438 n64 pidfd_getfd sys_pidfd_getfd
439 n64 faccessat2 sys_faccessat2
+440 n64 readfile sys_readfile
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 13280625d312..67cb8f8fbdb2 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -426,3 +426,4 @@
437 o32 openat2 sys_openat2
438 o32 pidfd_getfd sys_pidfd_getfd
439 o32 faccessat2 sys_faccessat2
+440 o32 readfile sys_readfile
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 5a758fa6ec52..775e5228ab51 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common readfile sys_readfile
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index f833a3190822..d452db708635 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -528,3 +528,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common readfile sys_readfile
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index bfdcb7633957..7ab529813a42 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
437 common openat2 sys_openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2 sys_faccessat2
+440 common readfile sys_readfile sys_readfile
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index acc35daa1b79..ce8862cdb707 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common readfile sys_readfile
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 8004a276cb74..d89e7224bb0f 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -484,3 +484,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common readfile sys_readfile
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index d8f8a1a69ed1..6f8d0b0acb6a 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -443,3 +443,4 @@
437 i386 openat2 sys_openat2
438 i386 pidfd_getfd sys_pidfd_getfd
439 i386 faccessat2 sys_faccessat2
+440 i386 readfile sys_readfile
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 78847b32e137..9c54081b7c14 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -360,6 +360,7 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common readfile sys_readfile

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 69d0d73876b3..7b1f2ea76621 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -409,3 +409,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common readfile sys_readfile
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b951a87da987..9d338ef7802d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1005,6 +1005,8 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
siginfo_t __user *info,
unsigned int flags);
asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
+asmlinkage long sys_readfile(int dfd, const char __user *filename,
+ char __user *buffer, size_t bufsize, int flags);

/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index f4a01305d9a6..81b677c01266 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -857,9 +857,11 @@ __SYSCALL(__NR_openat2, sys_openat2)
__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
#define __NR_faccessat2 439
__SYSCALL(__NR_faccessat2, sys_faccessat2)
+#define __NR_readfile 440
+__SYSCALL(__NR_readfile, sys_readfile)

#undef __NR_syscalls
-#define __NR_syscalls 440
+#define __NR_syscalls 441

/*
* 32 bit systems traditionally used different
--
2.27.0

2020-07-04 14:03:51

by Greg KH

[permalink] [raw]
Subject: [PATCH 3/3] selftests: add readfile(2) selftests

Test the functionality of readfile(2) in various ways.

Also provide a simple speed test program to benchmark using readfile()
instead of using open()/read()/close().

Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/readfile/.gitignore | 3 +
tools/testing/selftests/readfile/Makefile | 7 +
tools/testing/selftests/readfile/readfile.c | 285 +++++++++++++++++
.../selftests/readfile/readfile_speed.c | 301 ++++++++++++++++++
5 files changed, 597 insertions(+)
create mode 100644 tools/testing/selftests/readfile/.gitignore
create mode 100644 tools/testing/selftests/readfile/Makefile
create mode 100644 tools/testing/selftests/readfile/readfile.c
create mode 100644 tools/testing/selftests/readfile/readfile_speed.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 1195bd85af38..82359233b945 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -46,6 +46,7 @@ TARGETS += ptrace
TARGETS += openat2
TARGETS += rseq
TARGETS += rtc
+TARGETS += readfile
TARGETS += seccomp
TARGETS += sigaltstack
TARGETS += size
diff --git a/tools/testing/selftests/readfile/.gitignore b/tools/testing/selftests/readfile/.gitignore
new file mode 100644
index 000000000000..f0e758d437e4
--- /dev/null
+++ b/tools/testing/selftests/readfile/.gitignore
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+readfile
+readfile_speed
diff --git a/tools/testing/selftests/readfile/Makefile b/tools/testing/selftests/readfile/Makefile
new file mode 100644
index 000000000000..1bf1bdec40f8
--- /dev/null
+++ b/tools/testing/selftests/readfile/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+CFLAGS += -g -I../../../../usr/include/
+CFLAGS += -O2 -Wl,-no-as-needed -Wall
+
+TEST_GEN_PROGS := readfile readfile_speed
+
+include ../lib.mk
diff --git a/tools/testing/selftests/readfile/readfile.c b/tools/testing/selftests/readfile/readfile.c
new file mode 100644
index 000000000000..f0736c6dfa69
--- /dev/null
+++ b/tools/testing/selftests/readfile/readfile.c
@@ -0,0 +1,285 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2020 Greg Kroah-Hartman <[email protected]>
+ * Copyright (c) 2020 The Linux Foundation
+ *
+ * Test the readfile() syscall in various ways.
+ */
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <string.h>
+#include <syscall.h>
+
+#include "../kselftest.h"
+
+//#ifndef __NR_readfile
+//#define __NR_readfile -1
+//#endif
+
+#define __NR_readfile 440
+
+#define TEST_FILE1 "/sys/devices/system/cpu/vulnerabilities/meltdown"
+#define TEST_FILE2 "/sys/devices/system/cpu/vulnerabilities/spectre_v1"
+#define TEST_FILE4 "/sys/kernel/debug/usb/devices"
+
+static int sys_readfile(int fd, const char *filename, unsigned char *buffer,
+ size_t bufsize, int flags)
+{
+ return syscall(__NR_readfile, fd, filename, buffer, bufsize, flags);
+}
+
+/*
+ * Test that readfile() is even in the running kernel or not.
+ */
+static void test_readfile_supported(void)
+{
+ const char *proc_map = "/proc/self/maps";
+ unsigned char buffer[10];
+ int retval;
+
+ if (__NR_readfile < 0)
+ ksft_exit_skip("readfile() syscall is not defined for the kernel this test was built against\n");
+
+ /*
+ * Do a simple test to see if the syscall really is present in the
+ * running kernel
+ */
+ retval = sys_readfile(0, proc_map, &buffer[0], sizeof(buffer), 0);
+ if (retval == -1)
+ ksft_exit_skip("readfile() syscall not present on running kernel\n");
+
+ ksft_test_result_pass("readfile() syscall present\n");
+}
+
+/*
+ * Open all files in a specific sysfs directory and read from them
+ *
+ * This tests the "openat" type functionality of opening all files relative to a
+ * directory. We don't care at the moment about the contents.
+ */
+static void test_sysfs_files(void)
+{
+ static unsigned char buffer[8000];
+ const char *sysfs_dir = "/sys/devices/system/cpu/vulnerabilities/";
+ struct dirent *dirent;
+ DIR *vuln_sysfs_dir;
+ int sysfs_fd;
+ int retval;
+
+ sysfs_fd = open(sysfs_dir, O_PATH | O_DIRECTORY);
+ if (sysfs_fd == -1) {
+ ksft_test_result_skip("unable to open %s directory\n",
+ sysfs_dir);
+ return;
+ }
+
+ vuln_sysfs_dir = opendir(sysfs_dir);
+ if (!vuln_sysfs_dir) {
+ ksft_test_result_skip("%s unable to be opened, skipping test\n");
+ return;
+ }
+
+ ksft_print_msg("readfile: testing relative path functionality by reading files in %s\n",
+ sysfs_dir);
+ /* open all sysfs file in this directory and read the whole thing */
+ while ((dirent = readdir(vuln_sysfs_dir))) {
+ /* ignore . and .. */
+ if (strcmp(dirent->d_name, ".") == 0 ||
+ strcmp(dirent->d_name, "..") == 0)
+ continue;
+
+ retval = sys_readfile(sysfs_fd, dirent->d_name, &buffer[0],
+ sizeof(buffer), 0);
+
+ if (retval <= 0) {
+ ksft_test_result_fail("readfile(%s) failed with %d\n",
+ dirent->d_name, retval);
+ goto exit;
+ }
+
+ /* cut off trailing \n character */
+ buffer[retval - 1] = 0x00;
+ ksft_print_msg(" '%s' contains \"%s\"\n", dirent->d_name,
+ buffer);
+ }
+
+ ksft_test_result_pass("readfile() relative path functionality passed\n");
+
+exit:
+ closedir(vuln_sysfs_dir);
+ close(sysfs_fd);
+}
+
+/* Temporary directory variables */
+static int root_fd; /* test root directory file handle */
+static char tmpdir[PATH_MAX];
+
+static void setup_tmpdir(void)
+{
+ char *tmpdir_root;
+
+ tmpdir_root = getenv("TMPDIR");
+ if (!tmpdir_root)
+ tmpdir_root = "/tmp";
+
+ snprintf(tmpdir, PATH_MAX, "%s/readfile.XXXXXX", tmpdir_root);
+ if (!mkdtemp(tmpdir)) {
+ ksft_test_result_fail("mkdtemp(%s) failed\n", tmpdir);
+ ksft_exit_fail();
+ }
+
+ root_fd = open(tmpdir, O_PATH | O_DIRECTORY);
+ if (root_fd == -1) {
+ ksft_exit_fail_msg("%s unable to be opened, error = %d\n",
+ tmpdir, root_fd);
+ ksft_exit_fail();
+ }
+
+ ksft_print_msg("%s created to use for testing\n", tmpdir);
+}
+
+static void teardown_tmpdir(void)
+{
+ int retval;
+
+ close(root_fd);
+
+ retval = rmdir(tmpdir);
+ if (retval) {
+ ksft_exit_fail_msg("%s removed with return value %d\n",
+ tmpdir, retval);
+ ksft_exit_fail();
+ }
+ ksft_print_msg("%s cleaned up and removed\n", tmpdir);
+
+}
+
+static void test_filesize(size_t size)
+{
+ char filename[PATH_MAX];
+ unsigned char *write_data;
+ unsigned char *read_data;
+ int fd;
+ int retval;
+ size_t i;
+
+ snprintf(filename, PATH_MAX, "size-%ld", size);
+
+ read_data = malloc(size);
+ write_data = malloc(size);
+ if (!read_data || !write_data)
+ ksft_exit_fail_msg("Unable to allocate %ld bytes\n", size);
+
+ fd = openat(root_fd, filename, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
+ if (fd < 0)
+ ksft_exit_fail_msg("Unable to create file %s\n", filename);
+
+ ksft_print_msg("%s created\n", filename);
+
+ for (i = 0; i < size; ++i)
+ write_data[i] = (unsigned char)(0xff & i);
+
+ write(fd, write_data, size);
+ close(fd);
+
+ retval = sys_readfile(root_fd, filename, read_data, size, 0);
+
+ if (retval != size) {
+ ksft_test_result_fail("Read %d bytes but wanted to read %ld bytes.\n",
+ retval, size);
+ goto exit;
+ }
+
+ if (memcmp(read_data, write_data, size) != 0) {
+ ksft_test_result_fail("Read data of buffer size %d did not match written data\n",
+ size);
+ goto exit;
+ }
+
+ ksft_test_result_pass("readfile() of size %ld succeeded.\n", size);
+
+exit:
+ unlinkat(root_fd, filename, 0);
+ free(write_data);
+ free(read_data);
+}
+
+
+/*
+ * Create a bunch of differently sized files, and verify we read the correct
+ * amount of data from them.
+ */
+static void test_filesizes(void)
+{
+ setup_tmpdir();
+
+ test_filesize(0x10);
+ test_filesize(0x100);
+ test_filesize(0x1000);
+ test_filesize(0x10000);
+ test_filesize(0x100000);
+ test_filesize(0x1000000);
+
+ teardown_tmpdir();
+
+}
+
+static void readfile(const char *filename)
+{
+// int root_fd;
+ unsigned char buffer[16000];
+ int retval;
+
+ memset(buffer, 0x00, sizeof(buffer));
+
+// root_fd = open("/", O_DIRECTORY);
+// if (root_fd == -1)
+// ksft_exit_fail_msg("error with root_fd\n");
+
+ retval = sys_readfile(root_fd, filename, &buffer[0], sizeof(buffer), 0);
+
+// close(root_fd);
+
+ if (retval <= 0)
+ ksft_test_result_fail("readfile() test of filename=%s failed with retval %d\n",
+ filename, retval);
+ else
+ ksft_test_result_pass("readfile() test of filename=%s succeeded with retval=%d\n",
+ filename, retval);
+// buffer='%s'\n",
+// filename, retval, &buffer[0]);
+
+}
+
+
+int main(int argc, char *argv[])
+{
+ ksft_print_header();
+ ksft_set_plan(10);
+
+ test_readfile_supported(); // 1 test
+
+ test_sysfs_files(); // 1 test
+
+ test_filesizes(); // 6 tests
+
+ setup_tmpdir();
+
+ readfile(TEST_FILE1);
+ readfile(TEST_FILE2);
+// readfile(TEST_FILE4);
+
+ teardown_tmpdir();
+
+ if (ksft_get_fail_cnt())
+ return ksft_exit_fail();
+
+ return ksft_exit_pass();
+}
+
diff --git a/tools/testing/selftests/readfile/readfile_speed.c b/tools/testing/selftests/readfile/readfile_speed.c
new file mode 100644
index 000000000000..11ca79163131
--- /dev/null
+++ b/tools/testing/selftests/readfile/readfile_speed.c
@@ -0,0 +1,301 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2020 Greg Kroah-Hartman <[email protected]>
+ * Copyright (c) 2020 The Linux Foundation
+ *
+ * Tiny test program to try to benchmark the speed of the readfile syscall vs.
+ * the open/read/close sequence it can replace.
+ */
+#define _GNU_SOURCE
+#include <dirent.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <syscall.h>
+#include <time.h>
+#include <unistd.h>
+
+/* Default test file if no one wants to pick something else */
+#define DEFAULT_TEST_FILE "/sys/devices/system/cpu/vulnerabilities/meltdown"
+
+#define DEFAULT_TEST_LOOPS 1000
+
+#define DEFAULT_TEST_TYPE "both"
+
+/* Max number of bytes that will be read from the file */
+#define TEST_BUFFER_SIZE 10000
+static unsigned char test_buffer[TEST_BUFFER_SIZE];
+
+enum test_type {
+ TEST_READFILE,
+ TEST_OPENREADCLOSE,
+ TEST_BOTH,
+};
+
+/* Find the readfile syscall number */
+//#ifndef __NR_readfile
+//#define __NR_readfile -1
+//#endif
+#define __NR_readfile 440
+
+static int sys_readfile(int fd, const char *filename, unsigned char *buffer,
+ size_t bufsize, int flags)
+{
+ return syscall(__NR_readfile, fd, filename, buffer, bufsize, flags);
+}
+
+/* Test that readfile() is even in the running kernel or not. */
+static void test_readfile_supported(void)
+{
+ const char *proc_map = "/proc/self/maps";
+ unsigned char buffer[10];
+ int retval;
+
+ if (__NR_readfile < 0) {
+ fprintf(stderr,
+ "readfile() syscall is not defined for the kernel this test was built against.\n");
+ exit(1);
+ }
+
+ /*
+ * Do a simple test to see if the syscall really is present in the
+ * running kernel
+ */
+ retval = sys_readfile(0, proc_map, &buffer[0], sizeof(buffer), 0);
+ if (retval == -1) {
+ fprintf(stderr,
+ "readfile() syscall not present on running kernel.\n");
+ exit(1);
+ }
+}
+
+static inline long long get_time_ns(void)
+{
+ struct timespec t;
+
+ clock_gettime(CLOCK_MONOTONIC, &t);
+
+ return (long long)t.tv_sec * 1000000000 + t.tv_nsec;
+}
+
+/* taken from all-io.h from util-linux repo */
+static inline ssize_t read_all(int fd, unsigned char *buf, size_t count)
+{
+ ssize_t ret;
+ ssize_t c = 0;
+ int tries = 0;
+
+ while (count > 0) {
+ ret = read(fd, buf, count);
+ if (ret <= 0) {
+ if (ret < 0 && (errno == EAGAIN || errno == EINTR) &&
+ (tries++ < 5)) {
+ usleep(250000);
+ continue;
+ }
+ return c ? c : -1;
+ }
+ tries = 0;
+ count -= ret;
+ buf += ret;
+ c += ret;
+ }
+ return c;
+}
+
+static int openreadclose(const char *filename, unsigned char *buffer,
+ size_t bufsize)
+{
+ size_t count;
+ int fd;
+
+ fd = openat(0, filename, O_RDONLY);
+ if (fd < 0) {
+ printf("error opening %s\n", filename);
+ return fd;
+ }
+
+ count = read_all(fd, buffer, bufsize);
+ if (count < 0) {
+ printf("Error %ld reading from %s\n", count, filename);
+ }
+
+ close(fd);
+ return count;
+}
+
+static int run_test(enum test_type test_type, const char *filename)
+{
+ switch (test_type) {
+ case TEST_READFILE:
+ return sys_readfile(0, filename, &test_buffer[0],
+ TEST_BUFFER_SIZE, O_RDONLY);
+
+ case TEST_OPENREADCLOSE:
+ return openreadclose(filename, &test_buffer[0],
+ TEST_BUFFER_SIZE);
+ default:
+ return -EINVAL;
+ }
+}
+
+static const char * const test_names[] = {
+ [TEST_READFILE] = "readfile",
+ [TEST_OPENREADCLOSE] = "open/read/close",
+};
+
+static int run_test_loop(int loops, enum test_type test_type,
+ const char *filename)
+{
+ long long time_start;
+ long long time_end;
+ long long time_elapsed;
+ int retval = 0;
+ int i;
+
+ fprintf(stdout,
+ "Running %s test on file %s for %d loops...\n",
+ test_names[test_type], filename, loops);
+
+ /* Fill the cache with one run of the read first */
+ retval = run_test(test_type, filename);
+ if (retval < 0) {
+ fprintf(stderr,
+ "test %s was unable to run with error %d\n",
+ test_names[test_type], retval);
+ return retval;
+ }
+
+ time_start = get_time_ns();
+
+ for (i = 0; i < loops; ++i) {
+ retval = run_test(test_type, filename);
+
+ if (retval < 0) {
+ fprintf(stderr,
+ "test failed on loop %d with error %d\n",
+ i, retval);
+ break;
+ }
+ }
+ time_end = get_time_ns();
+
+ time_elapsed = time_end - time_start;
+
+ fprintf(stdout, "Took %lld ns\n", time_elapsed);
+
+ return retval;
+}
+
+static int do_read_file_test(int loops, enum test_type test_type,
+ const char *filename)
+{
+ int retval;
+
+ if (test_type == TEST_BOTH) {
+ retval = do_read_file_test(loops, TEST_READFILE, filename);
+ retval = do_read_file_test(loops, TEST_OPENREADCLOSE, filename);
+ return retval;
+ }
+ return run_test_loop(loops, test_type, filename);
+}
+
+static int check_file_present(const char *filename)
+{
+ struct stat sb;
+ int retval;
+
+ retval = stat(filename, &sb);
+ if (retval == -1) {
+ fprintf(stderr,
+ "filename %s is not present\n", filename);
+ return retval;
+ }
+
+ if ((sb.st_mode & S_IFMT) != S_IFREG) {
+ fprintf(stderr,
+ "filename %s must be a real file, not anything else.\n",
+ filename);
+ return -1;
+ }
+ return 0;
+}
+
+static void usage(char *progname)
+{
+ fprintf(stderr,
+ "usage: %s [options]\n"
+ " -l loops Number of loops to run the test for.\n"
+ " default is %d\n"
+ " -t testtype Test type to run.\n"
+ " types are: readfile, openreadclose, both\n"
+ " default is %s\n"
+ " -f filename Filename to read from, full path, not relative.\n"
+ " default is %s\n",
+ progname,
+ DEFAULT_TEST_LOOPS, DEFAULT_TEST_TYPE, DEFAULT_TEST_FILE);
+}
+
+int main(int argc, char *argv[])
+{
+ char *progname;
+ char *testtype = DEFAULT_TEST_TYPE;
+ char *filename = DEFAULT_TEST_FILE;
+ int loops = DEFAULT_TEST_LOOPS;
+ enum test_type test_type;
+ int retval;
+ char c;
+
+ progname = strrchr(argv[0], '/');
+ progname = progname ? 1+progname : argv[0];
+
+ while (EOF != (c = getopt(argc, argv, "t:l:f:h"))) {
+ switch (c) {
+ case 'l':
+ loops = atoi(optarg);
+ break;
+
+ case 't':
+ testtype = optarg;
+ break;
+
+ case 'f':
+ filename = optarg;
+ break;
+
+ case 'h':
+ usage(progname);
+ return 0;
+
+ default:
+ usage(progname);
+ return -1;
+ }
+ }
+
+ if (strcmp(testtype, "readfile") == 0)
+ test_type = TEST_READFILE;
+ else if (strcmp(testtype, "openreadclose") == 0)
+ test_type = TEST_OPENREADCLOSE;
+ else if (strcmp(testtype, "both") == 0)
+ test_type = TEST_BOTH;
+ else {
+ usage(progname);
+ return -1;
+ }
+
+ test_readfile_supported();
+
+ retval = check_file_present(filename);
+ if (retval)
+ return retval;
+
+ return do_read_file_test(loops, test_type, filename);
+}
--
2.27.0

2020-07-04 14:03:53

by Greg KH

[permalink] [raw]
Subject: [PATCH] readfile.2: new page describing readfile(2)

readfile(2) is a new syscall to remove the need to do the
open/read/close dance for small virtual files in places like procfs or
sysfs.

Signed-off-by: Greg Kroah-Hartman <[email protected]>
---

This patch is for the man-pages project, not the kernel source tree

man2/readfile.2 | 159 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 159 insertions(+)
create mode 100644 man2/readfile.2

diff --git a/man2/readfile.2 b/man2/readfile.2
new file mode 100644
index 000000000000..449e722c3442
--- /dev/null
+++ b/man2/readfile.2
@@ -0,0 +1,159 @@
+.\" This manpage is Copyright (C) 2020 Greg Kroah-Hartman;
+.\" and Copyright (C) 2020 The Linux Foundation
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date. The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein. The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH READFILE 2 2020-07-04 "Linux" "Linux Programmer's Manual"
+.SH NAME
+readfile \- read a file into a buffer
+.SH SYNOPSIS
+.nf
+.B #include <unistd.h>
+.PP
+.BI "ssize_t readfile(int " dirfd ", const char *" pathname ", void *" buf \
+", size_t " count ", int " flags );
+.fi
+.SH DESCRIPTION
+.BR readfile ()
+attempts to open the file specified by
+.IR pathname
+and to read up to
+.I count
+bytes from the file into the buffer starting at
+.IR buf .
+It is to be a shortcut of doing the sequence of
+.BR open ()
+and then
+.BR read ()
+and then
+.BR close ()
+for small files that are read frequently, such as those in
+.B procfs
+or
+.BR sysfs .
+.PP
+If the size of file is smaller than the value provided in
+.I count
+then the whole file will be copied into
+.IR buf .
+.PP
+If the file is larger than the value provided in
+.I count
+then only
+.I count
+number of bytes will be copied into
+.IR buf .
+.PP
+The argument
+.I flags
+may contain one of the following
+.IR "access modes" :
+.BR O_NOFOLLOW ", or " O_NOATIME .
+.PP
+If the pathname given in
+.I pathname
+is relative, then it is interpreted relative to the directory
+referred to by the file descriptor
+.IR dirfd .
+.PP
+If
+.I pathname
+is relative and
+.I dirfd
+is the special value
+.BR AT_FDCWD ,
+then
+.I pathname
+is interpreted relative to the current working
+directory of the calling process (like
+.BR openat ()).
+.PP
+If
+.I pathname
+is absolute, then
+.I dirfd
+is ignored.
+.SH RETURN VALUE
+On success, the number of bytes read is returned.
+It is not an error if this number is smaller than the number of bytes
+requested; this can happen if the file is smaller than the number of
+bytes requested.
+.PP
+On error, \-1 is returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+.TP
+.B EFAULT
+.I buf
+is outside your accessible address space.
+.TP
+.B EINTR
+The call was interrupted by a signal before any data was read; see
+.BR signal (7).
+.TP
+.B EINVAL
+.I flags
+was set to a value that is not allowed.
+.TP
+.B EIO
+I/O error.
+This will happen for example when the process is in a
+background process group, tries to read from its controlling terminal,
+and either it is ignoring or blocking
+.B SIGTTIN
+or its process group
+is orphaned.
+It may also occur when there is a low-level I/O error
+while reading from a disk or tape.
+A further possible cause of
+.B EIO
+on networked filesystems is when an advisory lock had been taken
+out on the file descriptor and this lock has been lost.
+See the
+.I "Lost locks"
+section of
+.BR fcntl (2)
+for further details.
+.SH CONFORMING TO
+None, this is a Linux-specific system call at this point in time.
+.SH NOTES
+The type
+.I size_t
+is an unsigned integer data type specified by POSIX.1.
+.PP
+On Linux,
+.BR read ()
+(and similar system calls) will transfer at most
+0x7ffff000 (2,147,479,552) bytes,
+returning the number of bytes actually transferred.
+.\" commit e28cc71572da38a5a12c1cfe4d7032017adccf69
+(This is true on both 32-bit and 64-bit systems.)
+.SH BUGS
+None yet!
+.SH SEE ALSO
+.BR close (2),
+.BR open (2),
+.BR openat (2),
+.BR read (2),
+.BR fread (3)
--
2.27.0

2020-07-04 18:39:10

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 3/3] selftests: add readfile(2) selftests

Hi Greg,

On Sat, Jul 4, 2020 at 4:05 PM Greg Kroah-Hartman
<[email protected]> wrote:
> Test the functionality of readfile(2) in various ways.
>
> Also provide a simple speed test program to benchmark using readfile()
> instead of using open()/read()/close().
>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>

Any benchmark results to share?

> --- /dev/null
> +++ b/tools/testing/selftests/readfile/readfile.c

> +static void readfile(const char *filename)
> +{
> +// int root_fd;

???

> + unsigned char buffer[16000];
> + int retval;
> +
> + memset(buffer, 0x00, sizeof(buffer));
> +
> +// root_fd = open("/", O_DIRECTORY);
> +// if (root_fd == -1)
> +// ksft_exit_fail_msg("error with root_fd\n");

???

> +
> + retval = sys_readfile(root_fd, filename, &buffer[0], sizeof(buffer), 0);
> +
> +// close(root_fd);
> +
> + if (retval <= 0)
> + ksft_test_result_fail("readfile() test of filename=%s failed with retval %d\n",
> + filename, retval);
> + else
> + ksft_test_result_pass("readfile() test of filename=%s succeeded with retval=%d\n",
> + filename, retval);
> +// buffer='%s'\n",
> +// filename, retval, &buffer[0]);
> +
> +}
> +
> +
> +int main(int argc, char *argv[])
> +{
> + ksft_print_header();
> + ksft_set_plan(10);
> +
> + test_readfile_supported(); // 1 test
> +
> + test_sysfs_files(); // 1 test
> +
> + test_filesizes(); // 6 tests
> +
> + setup_tmpdir();
> +
> + readfile(TEST_FILE1);
> + readfile(TEST_FILE2);
> +// readfile(TEST_FILE4);

???

> +
> + teardown_tmpdir();
> +
> + if (ksft_get_fail_cnt())
> + return ksft_exit_fail();
> +
> + return ksft_exit_pass();
> +}

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2020-07-04 18:39:13

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 2/3] arch: wire up the readfile syscall

On Sat, Jul 4, 2020 at 4:05 PM Greg Kroah-Hartman
<[email protected]> wrote:
> This wires up the readfile syscall for all architectures
>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>

> arch/m68k/kernel/syscalls/syscall.tbl | 1 +

Acked-by: Geert Uytterhoeven <[email protected]>

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2020-07-04 18:40:25

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 1/3] readfile: implement readfile syscall

Hi Greg,

On Sat, Jul 4, 2020 at 4:05 PM Greg Kroah-Hartman
<[email protected]> wrote:
> It's a tiny syscall, meant to allow a user to do a single "open this
> file, read into this buffer, and close the file" all in a single shot.
>
> Should be good for reading "tiny" files like sysfs, procfs, and other
> "small" files.
>
> There is no restarting the syscall, this is a "simple" syscall, with the
> attempt to make reading "simple" files easier with less syscall
> overhead.
>
> Cc: Alexander Viro <[email protected]>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>

Thanks for your patch!

> --- a/fs/open.c
> +++ b/fs/open.c

> +SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename,
> + char __user *, buffer, size_t, bufsize, int, flags)
> +{
> + struct open_flags op;
> + struct open_how how;
> + struct file *file;
> + loff_t pos = 0;
> + int retval;
> +
> + /* only accept a small subset of O_ flags that make sense */
> + if ((flags & (O_NOFOLLOW | O_NOATIME)) != flags)
> + return -EINVAL;
> +
> + /* add some needed flags to be able to open the file properly */
> + flags |= O_RDONLY | O_LARGEFILE;
> +
> + how = build_open_how(flags, 0000);
> + retval = build_open_flags(&how, &op);
> + if (retval)
> + return retval;
> +
> + file = readfile_open(dfd, filename, &op);
> + if (IS_ERR(file))
> + return PTR_ERR(file);
> +
> + retval = vfs_read(file, buffer, bufsize, &pos);

Should there be a way for the user to be informed that the file doesn't
fit in the provided buffer (.e.g. -EFBIG)?

> +
> + filp_close(file, NULL);
> +
> + return retval;
> +}

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2020-07-04 19:15:32

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/3] readfile: implement readfile syscall

On Sat, Jul 04, 2020 at 04:02:47PM +0200, Greg Kroah-Hartman wrote:
> + /* only accept a small subset of O_ flags that make sense */
> + if ((flags & (O_NOFOLLOW | O_NOATIME)) != flags)
> + return -EINVAL;
> +
> + /* add some needed flags to be able to open the file properly */
> + flags |= O_RDONLY | O_LARGEFILE;

Should we also add O_NOCTTY to prevent any problems if this is called on
a tty?

2020-07-04 19:31:12

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Sat, Jul 04, 2020 at 04:02:46PM +0200, Greg Kroah-Hartman wrote:
> Here is a tiny new syscall, readfile, that makes it simpler to read
> small/medium sized files all in one shot, no need to do open/read/close.
> This is especially helpful for tools that poke around in procfs or
> sysfs, making a little bit of a less system load than before, especially
> as syscall overheads go up over time due to various CPU bugs being
> addressed.

Nice series, but you are 3 months late with it... Next AFD, perhaps?

Seriously, the rationale is bollocks. If the overhead of 2 extra
syscalls is anywhere near the costs of the real work being done by
that thing, we have already lost and the best thing to do is to
throw the system away and start with saner hardware.

2020-07-04 19:42:59

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 1/3] readfile: implement readfile syscall

On Sat, Jul 4, 2020 at 4:03 PM Greg Kroah-Hartman
<[email protected]> wrote:
>
> It's a tiny syscall, meant to allow a user to do a single "open this
> file, read into this buffer, and close the file" all in a single shot.
>
> Should be good for reading "tiny" files like sysfs, procfs, and other
> "small" files.
>
> There is no restarting the syscall, this is a "simple" syscall, with the
> attempt to make reading "simple" files easier with less syscall
> overhead.
>
> Cc: Alexander Viro <[email protected]>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
> ---
> fs/open.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 50 insertions(+)
>
> diff --git a/fs/open.c b/fs/open.c
> index 6cd48a61cda3..4469faa9379c 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -1370,3 +1370,53 @@ int stream_open(struct inode *inode, struct file *filp)
> }
>
> EXPORT_SYMBOL(stream_open);
> +
> +static struct file *readfile_open(int dfd, const char __user *filename,
> + struct open_flags *op)
> +{
> + struct filename *tmp;
> + struct file *f;
> +
> + tmp = getname(filename);
> + if (IS_ERR(tmp))
> + return (struct file *)tmp;
> +
> + f = do_filp_open(dfd, tmp, op);
> + if (!IS_ERR(f))
> + fsnotify_open(f);
> +
> + putname(tmp);
> + return f;
> +}
> +
> +SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename,
> + char __user *, buffer, size_t, bufsize, int, flags)
> +{
> + struct open_flags op;
> + struct open_how how;
> + struct file *file;
> + loff_t pos = 0;
> + int retval;
> +
> + /* only accept a small subset of O_ flags that make sense */
> + if ((flags & (O_NOFOLLOW | O_NOATIME)) != flags)
> + return -EINVAL;
> +
> + /* add some needed flags to be able to open the file properly */
> + flags |= O_RDONLY | O_LARGEFILE;
> +
> + how = build_open_how(flags, 0000);
> + retval = build_open_flags(&how, &op);
> + if (retval)
> + return retval;
> +
> + file = readfile_open(dfd, filename, &op);
> + if (IS_ERR(file))
> + return PTR_ERR(file);
> +
> + retval = vfs_read(file, buffer, bufsize, &pos);
> +
> + filp_close(file, NULL);
> +
> + return retval;

Manpage says: "doing the sequence of open() and then read() and then
close()", which is exactly what it does.

But then it goes on to say: "If the file is larger than the value
provided in count then only count number of bytes will be copied into
buf", which is only half true, it should be: "If the file is larger
than the value provided in count then at most count number of bytes
will be copied into buf", which is not a lot of information.

And "If the size of file is smaller than the value provided in count
then the whole file will be copied into buf", which is simply a lie;
for example seq_file will happily return a smaller-than-PAGE_SIZE
chunk if at least one record fits in there. You'll have a very hard
time explaining that in the man page. So I think there are two
possible ways forward:

1) just leave the first explanation (it's an open + read + close
equivalent) and leave out the rest

2) add a loop around the vfs_read() in the code.

I'd strongly prefer #2 because with the non-looping read it's
impossible to detect whether the file was completely read or not, and
that's just going to lead to surprises and bugs in userspace code.

Thanks,
Miklos

2020-07-04 20:12:56

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 1/3] readfile: implement readfile syscall

On Sat, Jul 04, 2020 at 09:41:09PM +0200, Miklos Szeredi wrote:
> And "If the size of file is smaller than the value provided in count
> then the whole file will be copied into buf", which is simply a lie;
> for example seq_file will happily return a smaller-than-PAGE_SIZE
> chunk if at least one record fits in there. You'll have a very hard
> time explaining that in the man page. So I think there are two
> possible ways forward:
>
> 1) just leave the first explanation (it's an open + read + close
> equivalent) and leave out the rest
>
> 2) add a loop around the vfs_read() in the code.

3) don't bother with the entire thing, until somebody manages to demonstrate
a setup where it does make a real difference (compared to than the obvious
sequence of syscalls, that is). At which point we'll need to figure out
what's going on and deal with the underlying problem of that setup.

2020-07-04 20:17:56

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 1/3] readfile: implement readfile syscall

On Sat, Jul 04, 2020 at 09:12:06PM +0100, Al Viro wrote:
> On Sat, Jul 04, 2020 at 09:41:09PM +0200, Miklos Szeredi wrote:
> > And "If the size of file is smaller than the value provided in count
> > then the whole file will be copied into buf", which is simply a lie;
> > for example seq_file will happily return a smaller-than-PAGE_SIZE
> > chunk if at least one record fits in there. You'll have a very hard
> > time explaining that in the man page. So I think there are two
> > possible ways forward:
> >
> > 1) just leave the first explanation (it's an open + read + close
> > equivalent) and leave out the rest
> >
> > 2) add a loop around the vfs_read() in the code.
>
> 3) don't bother with the entire thing, until somebody manages to demonstrate
> a setup where it does make a real difference (compared to than the obvious
> sequence of syscalls, that is). At which point we'll need to figure out
> what's going on and deal with the underlying problem of that setup.

Incidentally, if that's intended for use on _sysfs_, I would like to see the
effects of that sucker being called by many processes in parallel, seeing that
sysfs has, er, certain scalability problems in its lookups. And I would be
very surprised if they were not heavier than said overhead of two extra syscalls.

2020-07-04 20:53:22

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [PATCH 1/3] readfile: implement readfile syscall

Al wrote:

> > On Sat, Jul 04, 2020 at 09:41:09PM +0200, Miklos Szeredi wrote:
> > 1) just leave the first explanation (it's an open + read + close
> > equivalent) and leave out the rest
> >
> > 2) add a loop around the vfs_read() in the code.
>
> 3) don't bother with the entire thing, until somebody manages to demonstrate
> a setup where it does make a real difference (compared to than the obvious
> sequence of syscalls, that is).

Ehh? System call overead is trivially measurable.
https://lwn.net/Articles/814175/

> At which point we'll need to figure out
> what's going on and deal with the underlying problem of that setup.

Run top?

Teach userspace to read 1 page minimum?

192803 read(4, "cpu 3718263 4417 342808 7127674"..., 1024) = 1024
192803 read(4, " 0 21217 21617 21954 10201 15425"..., 1024) = 1024
192803 read(4, " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"..., 1024) = 1024
192803 read(4, "", 1024)

Teach userspace to pread?

192803 openat(AT_FDCWD, "/proc/uptime", O_RDONLY) = 5
192803 lseek(5, 0, SEEK_SET) = 0
192803 read(5, "47198.56 713699.82\n", 8191) = 19

Rhetorical question: what is harder: ditch the main source of overhead
(VFS, seq_file, text) or teach userspace how to read files?

Here is open+read /proc/cpuinfo in python2 and python3.
Python2 case is terrifying.

BTW is there is something broken with seqfiles and record keeping?
Why does it return only 2 records per read?

Python 3:

openat(AT_FDCWD, "/proc/cpuinfo", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
ioctl(3, TCGETS, 0x7ffe6f1f0850) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR) = 0
ioctl(3, TCGETS, 0x7ffe6f1f0710) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR) = 0
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
read(3, "processor\t: 0\nvendor_id\t: Genuin"..., 8192) = 3038
read(3, "processor\t: 2\nvendor_id\t: Genuin"..., 5154) = 3038
read(3, "processor\t: 4\nvendor_id\t: Genuin"..., 2116) = 2116
read(3, "clmulqdq dtes64 monitor ds_cpl v"..., 8448) = 3966
read(3, "processor\t: 8\nvendor_id\t: Genuin"..., 4482) = 3038
read(3, "processor\t: 10\nvendor_id\t: Genui"..., 1444) = 1444
read(3, "t\t: 64\naddress sizes\t: 46 bits p"..., 16896) = 3116
read(3, "processor\t: 13\nvendor_id\t: Genui"..., 13780) = 3044
read(3, "processor\t: 15\nvendor_id\t: Genui"..., 10736) = 1522
read(3, "", 9214) = 0


Python 2

openat(AT_FDCWD, "/proc/cpuinfo", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 0
lseek(3, 0, SEEK_CUR) = 0
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
read(3, "processor\t: 0\nvendor_id\t: Genuin"..., 1024) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 1024
lseek(3, 0, SEEK_CUR) = 1024
read(3, " cqm_occup_llc cqm_mbm_total cqm"..., 1024) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 2048
lseek(3, 0, SEEK_CUR) = 2048
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 2048
lseek(3, 0, SEEK_CUR) = 2048
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 2048
lseek(3, 0, SEEK_CUR) = 2048
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 2048
lseek(3, 0, SEEK_CUR) = 2048
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 2048
lseek(3, 0, SEEK_CUR) = 2048
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 2048
lseek(3, 0, SEEK_CUR) = 2048
read(3, "ebs bts rep_good nopl xtopology "..., 1024) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 3072
lseek(3, 0, SEEK_CUR) = 3072
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 3072
lseek(3, 0, SEEK_CUR) = 3072
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 3072
lseek(3, 0, SEEK_CUR) = 3072
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 3072
lseek(3, 0, SEEK_CUR) = 3072
read(3, "ntel\ncpu family\t: 6\nmodel\t\t: 79\n"..., 1024) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 4096
lseek(3, 0, SEEK_CUR) = 4096
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 4096
lseek(3, 0, SEEK_CUR) = 4096
read(3, "bm_local dtherm ida arat pln pts"..., 1024) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 5120
lseek(3, 0, SEEK_CUR) = 5120
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 5120
lseek(3, 0, SEEK_CUR) = 5120
read(3, "nstop_tsc cpuid aperfmperf pni p"..., 1024) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 6144
lseek(3, 0, SEEK_CUR) = 6144
read(3, "del name\t: Intel(R) Xeon(R) CPU "..., 1024) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 7168
lseek(3, 0, SEEK_CUR) = 7168
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 7168
lseek(3, 0, SEEK_CUR) = 7168
read(3, "d_clear flush_l1d\nvmx flags\t: vn"..., 1024) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 8192
lseek(3, 0, SEEK_CUR) = 8192
read(3, "clmulqdq dtes64 monitor ds_cpl v"..., 1024) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 9216
lseek(3, 0, SEEK_CUR) = 9216
read(3, "E5-2620 v4 @ 2.10GHz\nstepping\t: "..., 1024) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 10240
lseek(3, 0, SEEK_CUR) = 10240
read(3, "vnmi preemption_timer posted_int"..., 1024) = 1024
read(3, " vmx smx est tm2 ssse3 sdbg fma "..., 1024) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 12288
lseek(3, 0, SEEK_CUR) = 12288
read(3, ": 1\nmicrocode\t: 0xb000038\ncpu MH"..., 1024) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 13312
lseek(3, 0, SEEK_CUR) = 13312
read(3, "r invvpid ept_x_only ept_ad ept_"..., 1024) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 14336
lseek(3, 0, SEEK_CUR) = 14336
read(3, "16 xtpr pdcm pcid dca sse4_1 sse"..., 1024) = 1024
read(3, "\t\t: 1326.352\ncache size\t: 20480 "..., 1024) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 16384
lseek(3, 0, SEEK_CUR) = 16384
read(3, "gb flexpriority apicv tsc_offset"..., 1024) = 1024
read(3, "4_2 x2apic movbe popcnt tsc_dead"..., 1024) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 18432
lseek(3, 0, SEEK_CUR) = 18432
read(3, "KB\nphysical id\t: 0\nsiblings\t: 16"..., 1024) = 1024
read(3, " vtpr mtf vapic ept vpid unrestr"..., 1024) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 20480
lseek(3, 0, SEEK_CUR) = 20480
read(3, "adline_timer aes xsave avx f16c "..., 2048) = 2048
read(3, "estricted_guest vapic_reg vid pl"..., 1024) = 1024
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 23552
lseek(3, 0, SEEK_CUR) = 23552
read(3, "16c rdrand lahf_lm abm 3dnowpref"..., 2048) = 770
read(3, "", 1024) = 0
close(3) = 0

2020-07-04 21:16:57

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 1/3] readfile: implement readfile syscall

On Sat, Jul 4, 2020 at 10:50 PM Alexey Dobriyan <[email protected]> wrote:
>
> Al wrote:
>
> > > On Sat, Jul 04, 2020 at 09:41:09PM +0200, Miklos Szeredi wrote:
> > > 1) just leave the first explanation (it's an open + read + close
> > > equivalent) and leave out the rest
> > >
> > > 2) add a loop around the vfs_read() in the code.
> >
> > 3) don't bother with the entire thing, until somebody manages to demonstrate
> > a setup where it does make a real difference (compared to than the obvious
> > sequence of syscalls, that is).
>
> Ehh? System call overead is trivially measurable.
> https://lwn.net/Articles/814175/
>
> > At which point we'll need to figure out
> > what's going on and deal with the underlying problem of that setup.
>
> Run top?
>
> Teach userspace to read 1 page minimum?
>
> 192803 read(4, "cpu 3718263 4417 342808 7127674"..., 1024) = 1024
> 192803 read(4, " 0 21217 21617 21954 10201 15425"..., 1024) = 1024
> 192803 read(4, " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"..., 1024) = 1024
> 192803 read(4, "", 1024)
>
> Teach userspace to pread?
>
> 192803 openat(AT_FDCWD, "/proc/uptime", O_RDONLY) = 5
> 192803 lseek(5, 0, SEEK_SET) = 0
> 192803 read(5, "47198.56 713699.82\n", 8191) = 19
>
> Rhetorical question: what is harder: ditch the main source of overhead
> (VFS, seq_file, text) or teach userspace how to read files?
>
> Here is open+read /proc/cpuinfo in python2 and python3.
> Python2 case is terrifying.
>
> BTW is there is something broken with seqfiles and record keeping?
> Why does it return only 2 records per read?

seqfile is weird. It uses an internal buffer (m->buf) that starts off
with a PAGE_SIZE size (m->size). The internal buffer is filled with
whole records, if a single record is larger than the bufsize, then the
buffer is expanded until the record fits. Then this internal buffer
is used to fill the user buffer supplied by read. If the length of
the internal buffer (m->count) overflows the user buffer, then the
rest is set aside for the next read. In this case the next read
starts with the tail (m->from) of the internal buffer, then, if
there's still space in the user buffer, it resets the internal buffer
(m->from = 0) and again fills it with at least one whole record and
copies as much of that to the user buffer as it can.

Note how this can lead to unfilled bytes at the end of the user
buffer. Not sure what's the rationelle, it could just as well loop
back to filling a new buf it there's space left in the user buffer.
Maybe it was: better return whole records whenever possible. Anyway
that can't be changed now, since it's bound to break something out
there.

Thanks,
Miklos

>
> Python 3:
>
> openat(AT_FDCWD, "/proc/cpuinfo", O_RDONLY|O_CLOEXEC) = 3
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> ioctl(3, TCGETS, 0x7ffe6f1f0850) = -1 ENOTTY (Inappropriate ioctl for device)
> lseek(3, 0, SEEK_CUR) = 0
> ioctl(3, TCGETS, 0x7ffe6f1f0710) = -1 ENOTTY (Inappropriate ioctl for device)
> lseek(3, 0, SEEK_CUR) = 0
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> read(3, "processor\t: 0\nvendor_id\t: Genuin"..., 8192) = 3038
> read(3, "processor\t: 2\nvendor_id\t: Genuin"..., 5154) = 3038
> read(3, "processor\t: 4\nvendor_id\t: Genuin"..., 2116) = 2116
> read(3, "clmulqdq dtes64 monitor ds_cpl v"..., 8448) = 3966
> read(3, "processor\t: 8\nvendor_id\t: Genuin"..., 4482) = 3038
> read(3, "processor\t: 10\nvendor_id\t: Genui"..., 1444) = 1444
> read(3, "t\t: 64\naddress sizes\t: 46 bits p"..., 16896) = 3116
> read(3, "processor\t: 13\nvendor_id\t: Genui"..., 13780) = 3044
> read(3, "processor\t: 15\nvendor_id\t: Genui"..., 10736) = 1522
> read(3, "", 9214) = 0
>
>
> Python 2
>
> openat(AT_FDCWD, "/proc/cpuinfo", O_RDONLY) = 3
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 0
> lseek(3, 0, SEEK_CUR) = 0
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> read(3, "processor\t: 0\nvendor_id\t: Genuin"..., 1024) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 1024
> lseek(3, 0, SEEK_CUR) = 1024
> read(3, " cqm_occup_llc cqm_mbm_total cqm"..., 1024) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 2048
> lseek(3, 0, SEEK_CUR) = 2048
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 2048
> lseek(3, 0, SEEK_CUR) = 2048
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 2048
> lseek(3, 0, SEEK_CUR) = 2048
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 2048
> lseek(3, 0, SEEK_CUR) = 2048
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 2048
> lseek(3, 0, SEEK_CUR) = 2048
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 2048
> lseek(3, 0, SEEK_CUR) = 2048
> read(3, "ebs bts rep_good nopl xtopology "..., 1024) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 3072
> lseek(3, 0, SEEK_CUR) = 3072
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 3072
> lseek(3, 0, SEEK_CUR) = 3072
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 3072
> lseek(3, 0, SEEK_CUR) = 3072
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 3072
> lseek(3, 0, SEEK_CUR) = 3072
> read(3, "ntel\ncpu family\t: 6\nmodel\t\t: 79\n"..., 1024) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 4096
> lseek(3, 0, SEEK_CUR) = 4096
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 4096
> lseek(3, 0, SEEK_CUR) = 4096
> read(3, "bm_local dtherm ida arat pln pts"..., 1024) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 5120
> lseek(3, 0, SEEK_CUR) = 5120
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 5120
> lseek(3, 0, SEEK_CUR) = 5120
> read(3, "nstop_tsc cpuid aperfmperf pni p"..., 1024) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 6144
> lseek(3, 0, SEEK_CUR) = 6144
> read(3, "del name\t: Intel(R) Xeon(R) CPU "..., 1024) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 7168
> lseek(3, 0, SEEK_CUR) = 7168
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 7168
> lseek(3, 0, SEEK_CUR) = 7168
> read(3, "d_clear flush_l1d\nvmx flags\t: vn"..., 1024) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 8192
> lseek(3, 0, SEEK_CUR) = 8192
> read(3, "clmulqdq dtes64 monitor ds_cpl v"..., 1024) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 9216
> lseek(3, 0, SEEK_CUR) = 9216
> read(3, "E5-2620 v4 @ 2.10GHz\nstepping\t: "..., 1024) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 10240
> lseek(3, 0, SEEK_CUR) = 10240
> read(3, "vnmi preemption_timer posted_int"..., 1024) = 1024
> read(3, " vmx smx est tm2 ssse3 sdbg fma "..., 1024) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 12288
> lseek(3, 0, SEEK_CUR) = 12288
> read(3, ": 1\nmicrocode\t: 0xb000038\ncpu MH"..., 1024) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 13312
> lseek(3, 0, SEEK_CUR) = 13312
> read(3, "r invvpid ept_x_only ept_ad ept_"..., 1024) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 14336
> lseek(3, 0, SEEK_CUR) = 14336
> read(3, "16 xtpr pdcm pcid dca sse4_1 sse"..., 1024) = 1024
> read(3, "\t\t: 1326.352\ncache size\t: 20480 "..., 1024) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 16384
> lseek(3, 0, SEEK_CUR) = 16384
> read(3, "gb flexpriority apicv tsc_offset"..., 1024) = 1024
> read(3, "4_2 x2apic movbe popcnt tsc_dead"..., 1024) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 18432
> lseek(3, 0, SEEK_CUR) = 18432
> read(3, "KB\nphysical id\t: 0\nsiblings\t: 16"..., 1024) = 1024
> read(3, " vtpr mtf vapic ept vpid unrestr"..., 1024) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 20480
> lseek(3, 0, SEEK_CUR) = 20480
> read(3, "adline_timer aes xsave avx f16c "..., 2048) = 2048
> read(3, "estricted_guest vapic_reg vid pl"..., 1024) = 1024
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> lseek(3, 0, SEEK_CUR) = 23552
> lseek(3, 0, SEEK_CUR) = 23552
> read(3, "16c rdrand lahf_lm abm 3dnowpref"..., 2048) = 770
> read(3, "", 1024) = 0
> close(3) = 0

2020-07-05 01:42:50

by Heinrich Schuchardt

[permalink] [raw]
Subject: Re: [PATCH 3/3] selftests: add readfile(2) selftests

On 7/4/20 4:02 PM, Greg Kroah-Hartman wrote:
> Test the functionality of readfile(2) in various ways.

Hello Greg,

I expect readfile() to generate fanotify events FAN_OPEN_PERM, FAN_OPEN,
FAN_ACCESS_PERM, FAN_ACCESS, FAN_CLOSE_NOWRITE in this sequence.

Looking at patch 1/3 you took care of notifications. Would this deserve
testing here?

Best regards

Heinrich

>
> Also provide a simple speed test program to benchmark using readfile()
> instead of using open()/read()/close().
>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
> ---
> tools/testing/selftests/Makefile | 1 +
> tools/testing/selftests/readfile/.gitignore | 3 +
> tools/testing/selftests/readfile/Makefile | 7 +
> tools/testing/selftests/readfile/readfile.c | 285 +++++++++++++++++
> .../selftests/readfile/readfile_speed.c | 301 ++++++++++++++++++
> 5 files changed, 597 insertions(+)
> create mode 100644 tools/testing/selftests/readfile/.gitignore
> create mode 100644 tools/testing/selftests/readfile/Makefile
> create mode 100644 tools/testing/selftests/readfile/readfile.c
> create mode 100644 tools/testing/selftests/readfile/readfile_speed.c
>
> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
> index 1195bd85af38..82359233b945 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -46,6 +46,7 @@ TARGETS += ptrace
> TARGETS += openat2
> TARGETS += rseq
> TARGETS += rtc
> +TARGETS += readfile
> TARGETS += seccomp
> TARGETS += sigaltstack
> TARGETS += size
> diff --git a/tools/testing/selftests/readfile/.gitignore b/tools/testing/selftests/readfile/.gitignore
> new file mode 100644
> index 000000000000..f0e758d437e4
> --- /dev/null
> +++ b/tools/testing/selftests/readfile/.gitignore
> @@ -0,0 +1,3 @@
> +# SPDX-License-Identifier: GPL-2.0
> +readfile
> +readfile_speed
> diff --git a/tools/testing/selftests/readfile/Makefile b/tools/testing/selftests/readfile/Makefile
> new file mode 100644
> index 000000000000..1bf1bdec40f8
> --- /dev/null
> +++ b/tools/testing/selftests/readfile/Makefile
> @@ -0,0 +1,7 @@
> +# SPDX-License-Identifier: GPL-2.0
> +CFLAGS += -g -I../../../../usr/include/
> +CFLAGS += -O2 -Wl,-no-as-needed -Wall
> +
> +TEST_GEN_PROGS := readfile readfile_speed
> +
> +include ../lib.mk
> diff --git a/tools/testing/selftests/readfile/readfile.c b/tools/testing/selftests/readfile/readfile.c
> new file mode 100644
> index 000000000000..f0736c6dfa69
> --- /dev/null
> +++ b/tools/testing/selftests/readfile/readfile.c
> @@ -0,0 +1,285 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2020 Greg Kroah-Hartman <[email protected]>
> + * Copyright (c) 2020 The Linux Foundation
> + *
> + * Test the readfile() syscall in various ways.
> + */
> +#define _GNU_SOURCE
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <sys/syscall.h>
> +#include <sys/types.h>
> +#include <dirent.h>
> +#include <fcntl.h>
> +#include <limits.h>
> +#include <string.h>
> +#include <syscall.h>
> +
> +#include "../kselftest.h"
> +
> +//#ifndef __NR_readfile
> +//#define __NR_readfile -1
> +//#endif
> +
> +#define __NR_readfile 440
> +
> +#define TEST_FILE1 "/sys/devices/system/cpu/vulnerabilities/meltdown"
> +#define TEST_FILE2 "/sys/devices/system/cpu/vulnerabilities/spectre_v1"
> +#define TEST_FILE4 "/sys/kernel/debug/usb/devices"
> +
> +static int sys_readfile(int fd, const char *filename, unsigned char *buffer,
> + size_t bufsize, int flags)
> +{
> + return syscall(__NR_readfile, fd, filename, buffer, bufsize, flags);
> +}
> +
> +/*
> + * Test that readfile() is even in the running kernel or not.
> + */
> +static void test_readfile_supported(void)
> +{
> + const char *proc_map = "/proc/self/maps";
> + unsigned char buffer[10];
> + int retval;
> +
> + if (__NR_readfile < 0)
> + ksft_exit_skip("readfile() syscall is not defined for the kernel this test was built against\n");
> +
> + /*
> + * Do a simple test to see if the syscall really is present in the
> + * running kernel
> + */
> + retval = sys_readfile(0, proc_map, &buffer[0], sizeof(buffer), 0);
> + if (retval == -1)
> + ksft_exit_skip("readfile() syscall not present on running kernel\n");
> +
> + ksft_test_result_pass("readfile() syscall present\n");
> +}
> +
> +/*
> + * Open all files in a specific sysfs directory and read from them
> + *
> + * This tests the "openat" type functionality of opening all files relative to a
> + * directory. We don't care at the moment about the contents.
> + */
> +static void test_sysfs_files(void)
> +{
> + static unsigned char buffer[8000];
> + const char *sysfs_dir = "/sys/devices/system/cpu/vulnerabilities/";
> + struct dirent *dirent;
> + DIR *vuln_sysfs_dir;
> + int sysfs_fd;
> + int retval;
> +
> + sysfs_fd = open(sysfs_dir, O_PATH | O_DIRECTORY);
> + if (sysfs_fd == -1) {
> + ksft_test_result_skip("unable to open %s directory\n",
> + sysfs_dir);
> + return;
> + }
> +
> + vuln_sysfs_dir = opendir(sysfs_dir);
> + if (!vuln_sysfs_dir) {
> + ksft_test_result_skip("%s unable to be opened, skipping test\n");
> + return;
> + }
> +
> + ksft_print_msg("readfile: testing relative path functionality by reading files in %s\n",
> + sysfs_dir);
> + /* open all sysfs file in this directory and read the whole thing */
> + while ((dirent = readdir(vuln_sysfs_dir))) {
> + /* ignore . and .. */
> + if (strcmp(dirent->d_name, ".") == 0 ||
> + strcmp(dirent->d_name, "..") == 0)
> + continue;
> +
> + retval = sys_readfile(sysfs_fd, dirent->d_name, &buffer[0],
> + sizeof(buffer), 0);
> +
> + if (retval <= 0) {
> + ksft_test_result_fail("readfile(%s) failed with %d\n",
> + dirent->d_name, retval);
> + goto exit;
> + }
> +
> + /* cut off trailing \n character */
> + buffer[retval - 1] = 0x00;
> + ksft_print_msg(" '%s' contains \"%s\"\n", dirent->d_name,
> + buffer);
> + }
> +
> + ksft_test_result_pass("readfile() relative path functionality passed\n");
> +
> +exit:
> + closedir(vuln_sysfs_dir);
> + close(sysfs_fd);
> +}
> +
> +/* Temporary directory variables */
> +static int root_fd; /* test root directory file handle */
> +static char tmpdir[PATH_MAX];
> +
> +static void setup_tmpdir(void)
> +{
> + char *tmpdir_root;
> +
> + tmpdir_root = getenv("TMPDIR");
> + if (!tmpdir_root)
> + tmpdir_root = "/tmp";
> +
> + snprintf(tmpdir, PATH_MAX, "%s/readfile.XXXXXX", tmpdir_root);
> + if (!mkdtemp(tmpdir)) {
> + ksft_test_result_fail("mkdtemp(%s) failed\n", tmpdir);
> + ksft_exit_fail();
> + }
> +
> + root_fd = open(tmpdir, O_PATH | O_DIRECTORY);
> + if (root_fd == -1) {
> + ksft_exit_fail_msg("%s unable to be opened, error = %d\n",
> + tmpdir, root_fd);
> + ksft_exit_fail();
> + }
> +
> + ksft_print_msg("%s created to use for testing\n", tmpdir);
> +}
> +
> +static void teardown_tmpdir(void)
> +{
> + int retval;
> +
> + close(root_fd);
> +
> + retval = rmdir(tmpdir);
> + if (retval) {
> + ksft_exit_fail_msg("%s removed with return value %d\n",
> + tmpdir, retval);
> + ksft_exit_fail();
> + }
> + ksft_print_msg("%s cleaned up and removed\n", tmpdir);
> +
> +}
> +
> +static void test_filesize(size_t size)
> +{
> + char filename[PATH_MAX];
> + unsigned char *write_data;
> + unsigned char *read_data;
> + int fd;
> + int retval;
> + size_t i;
> +
> + snprintf(filename, PATH_MAX, "size-%ld", size);
> +
> + read_data = malloc(size);
> + write_data = malloc(size);
> + if (!read_data || !write_data)
> + ksft_exit_fail_msg("Unable to allocate %ld bytes\n", size);
> +
> + fd = openat(root_fd, filename, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
> + if (fd < 0)
> + ksft_exit_fail_msg("Unable to create file %s\n", filename);
> +
> + ksft_print_msg("%s created\n", filename);
> +
> + for (i = 0; i < size; ++i)
> + write_data[i] = (unsigned char)(0xff & i);
> +
> + write(fd, write_data, size);
> + close(fd);
> +
> + retval = sys_readfile(root_fd, filename, read_data, size, 0);
> +
> + if (retval != size) {
> + ksft_test_result_fail("Read %d bytes but wanted to read %ld bytes.\n",
> + retval, size);
> + goto exit;
> + }
> +
> + if (memcmp(read_data, write_data, size) != 0) {
> + ksft_test_result_fail("Read data of buffer size %d did not match written data\n",
> + size);
> + goto exit;
> + }
> +
> + ksft_test_result_pass("readfile() of size %ld succeeded.\n", size);
> +
> +exit:
> + unlinkat(root_fd, filename, 0);
> + free(write_data);
> + free(read_data);
> +}
> +
> +
> +/*
> + * Create a bunch of differently sized files, and verify we read the correct
> + * amount of data from them.
> + */
> +static void test_filesizes(void)
> +{
> + setup_tmpdir();
> +
> + test_filesize(0x10);
> + test_filesize(0x100);
> + test_filesize(0x1000);
> + test_filesize(0x10000);
> + test_filesize(0x100000);
> + test_filesize(0x1000000);
> +
> + teardown_tmpdir();
> +
> +}
> +
> +static void readfile(const char *filename)
> +{
> +// int root_fd;
> + unsigned char buffer[16000];
> + int retval;
> +
> + memset(buffer, 0x00, sizeof(buffer));
> +
> +// root_fd = open("/", O_DIRECTORY);
> +// if (root_fd == -1)
> +// ksft_exit_fail_msg("error with root_fd\n");
> +
> + retval = sys_readfile(root_fd, filename, &buffer[0], sizeof(buffer), 0);
> +
> +// close(root_fd);
> +
> + if (retval <= 0)
> + ksft_test_result_fail("readfile() test of filename=%s failed with retval %d\n",
> + filename, retval);
> + else
> + ksft_test_result_pass("readfile() test of filename=%s succeeded with retval=%d\n",
> + filename, retval);
> +// buffer='%s'\n",
> +// filename, retval, &buffer[0]);
> +
> +}
> +
> +
> +int main(int argc, char *argv[])
> +{
> + ksft_print_header();
> + ksft_set_plan(10);
> +
> + test_readfile_supported(); // 1 test
> +
> + test_sysfs_files(); // 1 test
> +
> + test_filesizes(); // 6 tests
> +
> + setup_tmpdir();
> +
> + readfile(TEST_FILE1);
> + readfile(TEST_FILE2);
> +// readfile(TEST_FILE4);
> +
> + teardown_tmpdir();
> +
> + if (ksft_get_fail_cnt())
> + return ksft_exit_fail();
> +
> + return ksft_exit_pass();
> +}
> +
> diff --git a/tools/testing/selftests/readfile/readfile_speed.c b/tools/testing/selftests/readfile/readfile_speed.c
> new file mode 100644
> index 000000000000..11ca79163131
> --- /dev/null
> +++ b/tools/testing/selftests/readfile/readfile_speed.c
> @@ -0,0 +1,301 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2020 Greg Kroah-Hartman <[email protected]>
> + * Copyright (c) 2020 The Linux Foundation
> + *
> + * Tiny test program to try to benchmark the speed of the readfile syscall vs.
> + * the open/read/close sequence it can replace.
> + */
> +#define _GNU_SOURCE
> +#include <dirent.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <limits.h>
> +#include <stdarg.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/stat.h>
> +#include <sys/syscall.h>
> +#include <sys/types.h>
> +#include <syscall.h>
> +#include <time.h>
> +#include <unistd.h>
> +
> +/* Default test file if no one wants to pick something else */
> +#define DEFAULT_TEST_FILE "/sys/devices/system/cpu/vulnerabilities/meltdown"
> +
> +#define DEFAULT_TEST_LOOPS 1000
> +
> +#define DEFAULT_TEST_TYPE "both"
> +
> +/* Max number of bytes that will be read from the file */
> +#define TEST_BUFFER_SIZE 10000
> +static unsigned char test_buffer[TEST_BUFFER_SIZE];
> +
> +enum test_type {
> + TEST_READFILE,
> + TEST_OPENREADCLOSE,
> + TEST_BOTH,
> +};
> +
> +/* Find the readfile syscall number */
> +//#ifndef __NR_readfile
> +//#define __NR_readfile -1
> +//#endif
> +#define __NR_readfile 440
> +
> +static int sys_readfile(int fd, const char *filename, unsigned char *buffer,
> + size_t bufsize, int flags)
> +{
> + return syscall(__NR_readfile, fd, filename, buffer, bufsize, flags);
> +}
> +
> +/* Test that readfile() is even in the running kernel or not. */
> +static void test_readfile_supported(void)
> +{
> + const char *proc_map = "/proc/self/maps";
> + unsigned char buffer[10];
> + int retval;
> +
> + if (__NR_readfile < 0) {
> + fprintf(stderr,
> + "readfile() syscall is not defined for the kernel this test was built against.\n");
> + exit(1);
> + }
> +
> + /*
> + * Do a simple test to see if the syscall really is present in the
> + * running kernel
> + */
> + retval = sys_readfile(0, proc_map, &buffer[0], sizeof(buffer), 0);
> + if (retval == -1) {
> + fprintf(stderr,
> + "readfile() syscall not present on running kernel.\n");
> + exit(1);
> + }
> +}
> +
> +static inline long long get_time_ns(void)
> +{
> + struct timespec t;
> +
> + clock_gettime(CLOCK_MONOTONIC, &t);
> +
> + return (long long)t.tv_sec * 1000000000 + t.tv_nsec;
> +}
> +
> +/* taken from all-io.h from util-linux repo */
> +static inline ssize_t read_all(int fd, unsigned char *buf, size_t count)
> +{
> + ssize_t ret;
> + ssize_t c = 0;
> + int tries = 0;
> +
> + while (count > 0) {
> + ret = read(fd, buf, count);
> + if (ret <= 0) {
> + if (ret < 0 && (errno == EAGAIN || errno == EINTR) &&
> + (tries++ < 5)) {
> + usleep(250000);
> + continue;
> + }
> + return c ? c : -1;
> + }
> + tries = 0;
> + count -= ret;
> + buf += ret;
> + c += ret;
> + }
> + return c;
> +}
> +
> +static int openreadclose(const char *filename, unsigned char *buffer,
> + size_t bufsize)
> +{
> + size_t count;
> + int fd;
> +
> + fd = openat(0, filename, O_RDONLY);
> + if (fd < 0) {
> + printf("error opening %s\n", filename);
> + return fd;
> + }
> +
> + count = read_all(fd, buffer, bufsize);
> + if (count < 0) {
> + printf("Error %ld reading from %s\n", count, filename);
> + }
> +
> + close(fd);
> + return count;
> +}
> +
> +static int run_test(enum test_type test_type, const char *filename)
> +{
> + switch (test_type) {
> + case TEST_READFILE:
> + return sys_readfile(0, filename, &test_buffer[0],
> + TEST_BUFFER_SIZE, O_RDONLY);
> +
> + case TEST_OPENREADCLOSE:
> + return openreadclose(filename, &test_buffer[0],
> + TEST_BUFFER_SIZE);
> + default:
> + return -EINVAL;
> + }
> +}
> +
> +static const char * const test_names[] = {
> + [TEST_READFILE] = "readfile",
> + [TEST_OPENREADCLOSE] = "open/read/close",
> +};
> +
> +static int run_test_loop(int loops, enum test_type test_type,
> + const char *filename)
> +{
> + long long time_start;
> + long long time_end;
> + long long time_elapsed;
> + int retval = 0;
> + int i;
> +
> + fprintf(stdout,
> + "Running %s test on file %s for %d loops...\n",
> + test_names[test_type], filename, loops);
> +
> + /* Fill the cache with one run of the read first */
> + retval = run_test(test_type, filename);
> + if (retval < 0) {
> + fprintf(stderr,
> + "test %s was unable to run with error %d\n",
> + test_names[test_type], retval);
> + return retval;
> + }
> +
> + time_start = get_time_ns();
> +
> + for (i = 0; i < loops; ++i) {
> + retval = run_test(test_type, filename);
> +
> + if (retval < 0) {
> + fprintf(stderr,
> + "test failed on loop %d with error %d\n",
> + i, retval);
> + break;
> + }
> + }
> + time_end = get_time_ns();
> +
> + time_elapsed = time_end - time_start;
> +
> + fprintf(stdout, "Took %lld ns\n", time_elapsed);
> +
> + return retval;
> +}
> +
> +static int do_read_file_test(int loops, enum test_type test_type,
> + const char *filename)
> +{
> + int retval;
> +
> + if (test_type == TEST_BOTH) {
> + retval = do_read_file_test(loops, TEST_READFILE, filename);
> + retval = do_read_file_test(loops, TEST_OPENREADCLOSE, filename);
> + return retval;
> + }
> + return run_test_loop(loops, test_type, filename);
> +}
> +
> +static int check_file_present(const char *filename)
> +{
> + struct stat sb;
> + int retval;
> +
> + retval = stat(filename, &sb);
> + if (retval == -1) {
> + fprintf(stderr,
> + "filename %s is not present\n", filename);
> + return retval;
> + }
> +
> + if ((sb.st_mode & S_IFMT) != S_IFREG) {
> + fprintf(stderr,
> + "filename %s must be a real file, not anything else.\n",
> + filename);
> + return -1;
> + }
> + return 0;
> +}
> +
> +static void usage(char *progname)
> +{
> + fprintf(stderr,
> + "usage: %s [options]\n"
> + " -l loops Number of loops to run the test for.\n"
> + " default is %d\n"
> + " -t testtype Test type to run.\n"
> + " types are: readfile, openreadclose, both\n"
> + " default is %s\n"
> + " -f filename Filename to read from, full path, not relative.\n"
> + " default is %s\n",
> + progname,
> + DEFAULT_TEST_LOOPS, DEFAULT_TEST_TYPE, DEFAULT_TEST_FILE);
> +}
> +
> +int main(int argc, char *argv[])
> +{
> + char *progname;
> + char *testtype = DEFAULT_TEST_TYPE;
> + char *filename = DEFAULT_TEST_FILE;
> + int loops = DEFAULT_TEST_LOOPS;
> + enum test_type test_type;
> + int retval;
> + char c;
> +
> + progname = strrchr(argv[0], '/');
> + progname = progname ? 1+progname : argv[0];
> +
> + while (EOF != (c = getopt(argc, argv, "t:l:f:h"))) {
> + switch (c) {
> + case 'l':
> + loops = atoi(optarg);
> + break;
> +
> + case 't':
> + testtype = optarg;
> + break;
> +
> + case 'f':
> + filename = optarg;
> + break;
> +
> + case 'h':
> + usage(progname);
> + return 0;
> +
> + default:
> + usage(progname);
> + return -1;
> + }
> + }
> +
> + if (strcmp(testtype, "readfile") == 0)
> + test_type = TEST_READFILE;
> + else if (strcmp(testtype, "openreadclose") == 0)
> + test_type = TEST_OPENREADCLOSE;
> + else if (strcmp(testtype, "both") == 0)
> + test_type = TEST_BOTH;
> + else {
> + usage(progname);
> + return -1;
> + }
> +
> + test_readfile_supported();
> +
> + retval = check_file_present(filename);
> + if (retval)
> + return retval;
> +
> + return do_read_file_test(loops, test_type, filename);
> +}
>

2020-07-05 02:10:50

by Jan Ziak

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

Hello

At first, I thought that the proposed system call is capable of
reading *multiple* small files using a single system call - which
would help increase HDD/SSD queue utilization and increase IOPS (I/O
operations per second) - but that isn't the case and the proposed
system call can read just a single file.

Without the ability to read multiple small files using a single system
call, it is impossible to increase IOPS (unless an application is
using multiple reader threads or somehow instructs the kernel to
prefetch multiple files into memory).

While you are at it, why not also add a readfiles system call to read
multiple, presumably small, files? The initial unoptimized
implementation of readfiles syscall can simply call readfile
sequentially.

Sincerely
Jan (atomsymbol)

2020-07-05 02:17:07

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Sun, Jul 05, 2020 at 04:06:22AM +0200, Jan Ziak wrote:
> Hello
>
> At first, I thought that the proposed system call is capable of
> reading *multiple* small files using a single system call - which
> would help increase HDD/SSD queue utilization and increase IOPS (I/O
> operations per second) - but that isn't the case and the proposed
> system call can read just a single file.
>
> Without the ability to read multiple small files using a single system
> call, it is impossible to increase IOPS (unless an application is
> using multiple reader threads or somehow instructs the kernel to
> prefetch multiple files into memory).

What API would you use for this?

ssize_t readfiles(int dfd, char **files, void **bufs, size_t *lens);

I pretty much hate this interface, so I hope you have something better
in mind.

2020-07-05 02:47:14

by Jan Ziak

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Sun, Jul 5, 2020 at 4:16 AM Matthew Wilcox <[email protected]> wrote:
>
> On Sun, Jul 05, 2020 at 04:06:22AM +0200, Jan Ziak wrote:
> > Hello
> >
> > At first, I thought that the proposed system call is capable of
> > reading *multiple* small files using a single system call - which
> > would help increase HDD/SSD queue utilization and increase IOPS (I/O
> > operations per second) - but that isn't the case and the proposed
> > system call can read just a single file.
> >
> > Without the ability to read multiple small files using a single system
> > call, it is impossible to increase IOPS (unless an application is
> > using multiple reader threads or somehow instructs the kernel to
> > prefetch multiple files into memory).
>
> What API would you use for this?
>
> ssize_t readfiles(int dfd, char **files, void **bufs, size_t *lens);
>
> I pretty much hate this interface, so I hope you have something better
> in mind.

I am proposing the following:

struct readfile_t {
int dirfd;
const char *pathname;
void *buf;
size_t count;
int flags;
ssize_t retval; // set by kernel
int reserved; // not used by kernel
};

int readfiles(struct readfile_t *requests, size_t count);

Returns zero if all requests succeeded, otherwise the returned value
is non-zero (glibc wrapper: -1) and user-space is expected to check
which requests have succeeded and which have failed. retval in
readfile_t is set to what the single-file readfile syscall would
return if it was called with the contents of the corresponding
readfile_t struct.

The glibc library wrapper of this system call is expected to store the
errno in the "reserved" field. Thus, a programmer using glibc sees:

struct readfile_t {
int dirfd;
const char *pathname;
void *buf;
size_t count;
int flags;
ssize_t retval; // set by glibc (-1 on error)
int errno; // set by glibc if retval is -1
};

retval and errno in glibc's readfile_t are set to what the single-file
glibc readfile would return (retval) and set (errno) if it was called
with the contents of the corresponding readfile_t struct. In case of
an error, glibc will pick one readfile_t which failed (such as: the
1st failed one) and use it to set glibc's errno.

2020-07-05 02:56:00

by Heinrich Schuchardt

[permalink] [raw]
Subject: Re: [PATCH] readfile.2: new page describing readfile(2)

On 7/4/20 4:02 PM, Greg Kroah-Hartman wrote:
> readfile(2) is a new syscall to remove the need to do the
> open/read/close dance for small virtual files in places like procfs or
> sysfs.
>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
> ---
>
> This patch is for the man-pages project, not the kernel source tree
>
> man2/readfile.2 | 159 ++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 159 insertions(+)
> create mode 100644 man2/readfile.2
>
> diff --git a/man2/readfile.2 b/man2/readfile.2
> new file mode 100644
> index 000000000000..449e722c3442
> --- /dev/null
> +++ b/man2/readfile.2
> @@ -0,0 +1,159 @@
> +.\" This manpage is Copyright (C) 2020 Greg Kroah-Hartman;
> +.\" and Copyright (C) 2020 The Linux Foundation
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date. The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein. The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH READFILE 2 2020-07-04 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +readfile \- read a file into a buffer
> +.SH SYNOPSIS
> +.nf
> +.B #include <unistd.h>
> +.PP
> +.BI "ssize_t readfile(int " dirfd ", const char *" pathname ", void *" buf \
> +", size_t " count ", int " flags );
> +.fi
> +.SH DESCRIPTION
> +.BR readfile ()
> +attempts to open the file specified by
> +.IR pathname
> +and to read up to
> +.I count
> +bytes from the file into the buffer starting at
> +.IR buf .
> +It is to be a shortcut of doing the sequence of

Just my personal preference for concise language:

It replaces the sequence of

> +.BR open ()
> +and then

%s/and then /, /

> +.BR read ()
> +and then

%s/and then/, and/

> +.BR close ()
> +for small files that are read frequently, such as those in

". It reduces system call overheads especially for small files, like
those in"

readfile() makes sense even if each individual file is only read once,
not frequently.

Below you describe that file up to 2GiB can be read. So readfile() seems
to be a shortcut for larger files too.

> +.B procfs
> +or
> +.BR sysfs .

Executing readfile() generates the same file notification events as said
individual calls (cf. fanotify.7, inotify.7).

> +.PP
> +If the size of file is smaller than the value provided in
> +.I count
> +then the whole file will be copied into
> +.IR buf .
> +.PP
> +If the file is larger than the value provided in
> +.I count
> +then only
> +.I count
> +number of bytes will be copied into
> +.IR buf .
> +.PP
> +The argument
> +.I flags
> +may contain one of the following
> +.IR "access modes" :
> +.BR O_NOFOLLOW ", or " O_NOATIME .
> +.PP
> +If the pathname given in
> +.I pathname
> +is relative, then it is interpreted relative to the directory
> +referred to by the file descriptor
> +.IR dirfd .
> +.PP
> +If
> +.I pathname
> +is relative and
> +.I dirfd
> +is the special value
> +.BR AT_FDCWD ,
> +then
> +.I pathname
> +is interpreted relative to the current working
> +directory of the calling process (like
> +.BR openat ()).
> +.PP
> +If
> +.I pathname
> +is absolute, then
> +.I dirfd
> +is ignored.

readfile() blocks until either the whole file has been read, the buffer
is completely filled, or the system specific limit (see below) has been
reached.

> +.SH RETURN VALUE
> +On success, the number of bytes read is returned.
> +It is not an error if this number is smaller than the number of bytes
> +requested; this can happen if the file is smaller than the number of
> +bytes requested.

"It is not an error ..." is very vague. Are there any other cases where
a file is only partially read and the number of bytes returned is less
then the minimum of buffer size and file size? How would I discover
truncation?

Or can I rely on the call returning either an error or said minimum
number of bytes? In the latter case:

"When reading from a block device this always equals the minimum of the
buffer size specified by 'count', the file size, and the system specific
limit for read.2 calls (see below)."

> +.PP
> +On error, \-1 is returned, and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +.TP
> +.B EFAULT
> +.I buf
> +is outside your accessible address space.
> +.TP
> +.B EINTR
> +The call was interrupted by a signal before any data was read; see
> +.BR signal (7).
> +.TP
> +.B EINVAL
> +.I flags
> +was set to a value that is not allowed.
> +.TP
> +.B EIO
> +I/O error.
> +This will happen for example when the process is in a
> +background process group, tries to read from its controlling terminal,
> +and either it is ignoring or blocking

Can we copy the description from read.2 which gives more information or
refer to it?

> +.B SIGTTIN
> +or its process group
> +is orphaned.
> +It may also occur when there is a low-level I/O error
> +while reading from a disk or tape.
> +A further possible cause of
> +.B EIO
> +on networked filesystems is when an advisory lock had been taken
> +out on the file descriptor and this lock has been lost.
> +See the
> +.I "Lost locks"
> +section of
> +.BR fcntl (2)
> +for further details.

EPERM is missing in this section. Cf. fanotify.7.

Best regards

Heinrich

> +.SH CONFORMING TO
> +None, this is a Linux-specific system call at this point in time.
> +.SH NOTES
> +The type
> +.I size_t
> +is an unsigned integer data type specified by POSIX.1.
> +.PP
> +On Linux,
> +.BR read ()
> +(and similar system calls) will transfer at most
> +0x7ffff000 (2,147,479,552) bytes,
> +returning the number of bytes actually transferred.
> +.\" commit e28cc71572da38a5a12c1cfe4d7032017adccf69
> +(This is true on both 32-bit and 64-bit systems.)
> +.SH BUGS
> +None yet!
> +.SH SEE ALSO
> +.BR close (2),
> +.BR open (2),
> +.BR openat (2),
> +.BR read (2),
> +.BR fread (3)
>

2020-07-05 03:13:34

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Sun, Jul 05, 2020 at 04:46:04AM +0200, Jan Ziak wrote:
> On Sun, Jul 5, 2020 at 4:16 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Sun, Jul 05, 2020 at 04:06:22AM +0200, Jan Ziak wrote:
> > > Hello
> > >
> > > At first, I thought that the proposed system call is capable of
> > > reading *multiple* small files using a single system call - which
> > > would help increase HDD/SSD queue utilization and increase IOPS (I/O
> > > operations per second) - but that isn't the case and the proposed
> > > system call can read just a single file.
> > >
> > > Without the ability to read multiple small files using a single system
> > > call, it is impossible to increase IOPS (unless an application is
> > > using multiple reader threads or somehow instructs the kernel to
> > > prefetch multiple files into memory).
> >
> > What API would you use for this?
> >
> > ssize_t readfiles(int dfd, char **files, void **bufs, size_t *lens);
> >
> > I pretty much hate this interface, so I hope you have something better
> > in mind.
>
> I am proposing the following:
>
> struct readfile_t {
> int dirfd;
> const char *pathname;
> void *buf;
> size_t count;
> int flags;
> ssize_t retval; // set by kernel
> int reserved; // not used by kernel
> };
>
> int readfiles(struct readfile_t *requests, size_t count);
>
> Returns zero if all requests succeeded, otherwise the returned value
> is non-zero (glibc wrapper: -1) and user-space is expected to check
> which requests have succeeded and which have failed. retval in
> readfile_t is set to what the single-file readfile syscall would
> return if it was called with the contents of the corresponding
> readfile_t struct.

You should probably take a look at io_uring. That has the level of
complexity of this proposal and supports open/read/close along with many
other opcodes.

2020-07-05 03:20:06

by Jan Ziak

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Sun, Jul 5, 2020 at 5:12 AM Matthew Wilcox <[email protected]> wrote:
>
> You should probably take a look at io_uring. That has the level of
> complexity of this proposal and supports open/read/close along with many
> other opcodes.

Then glibc can implement readfile using io_uring and there is no need
for a new single-file readfile syscall.

2020-07-05 03:28:05

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Sun, Jul 05, 2020 at 05:18:58AM +0200, Jan Ziak wrote:
> On Sun, Jul 5, 2020 at 5:12 AM Matthew Wilcox <[email protected]> wrote:
> >
> > You should probably take a look at io_uring. That has the level of
> > complexity of this proposal and supports open/read/close along with many
> > other opcodes.
>
> Then glibc can implement readfile using io_uring and there is no need
> for a new single-file readfile syscall.

It could, sure. But there's also a value in having a simple interface
to accomplish a simple task. Your proposed API added a very complex
interface to satisfy needs that clearly aren't part of the problem space
that Greg is looking to address.

2020-07-05 04:11:10

by Jan Ziak

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Sun, Jul 5, 2020 at 5:27 AM Matthew Wilcox <[email protected]> wrote:
>
> On Sun, Jul 05, 2020 at 05:18:58AM +0200, Jan Ziak wrote:
> > On Sun, Jul 5, 2020 at 5:12 AM Matthew Wilcox <[email protected]> wrote:
> > >
> > > You should probably take a look at io_uring. That has the level of
> > > complexity of this proposal and supports open/read/close along with many
> > > other opcodes.
> >
> > Then glibc can implement readfile using io_uring and there is no need
> > for a new single-file readfile syscall.
>
> It could, sure. But there's also a value in having a simple interface
> to accomplish a simple task. Your proposed API added a very complex
> interface to satisfy needs that clearly aren't part of the problem space
> that Greg is looking to address.

I believe that we should look at the single-file readfile syscall from
a performance viewpoint. If an application is expecting to read a
couple of small/medium-size files per second, then neither readfile
nor readfiles makes sense in terms of improving performance. The
benefits start to show up only in case an application is expecting to
read at least a hundred of files per second. The "per second" part is
important, it cannot be left out. Because readfile only improves
performance for many-file reads, the syscall that applications
performing many-file reads actually want is the multi-file version,
not the single-file version.

I am not sure I understand why you think that a pointer to an array of
readfile_t structures is very complex. If it was very complex then it
would be a deep tree or a large graph.

2020-07-05 06:33:10

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Jul 4, 2020, at 8:46 PM, Jan Ziak <[email protected]> wrote:
>
> On Sun, Jul 5, 2020 at 4:16 AM Matthew Wilcox <[email protected]> wrote:
>>
>> On Sun, Jul 05, 2020 at 04:06:22AM +0200, Jan Ziak wrote:
>>> Hello
>>>
>>> At first, I thought that the proposed system call is capable of
>>> reading *multiple* small files using a single system call - which
>>> would help increase HDD/SSD queue utilization and increase IOPS (I/O
>>> operations per second) - but that isn't the case and the proposed
>>> system call can read just a single file.
>>>
>>> Without the ability to read multiple small files using a single system
>>> call, it is impossible to increase IOPS (unless an application is
>>> using multiple reader threads or somehow instructs the kernel to
>>> prefetch multiple files into memory).
>>
>> What API would you use for this?
>>
>> ssize_t readfiles(int dfd, char **files, void **bufs, size_t *lens);
>>
>> I pretty much hate this interface, so I hope you have something better
>> in mind.
>
> I am proposing the following:
>
> struct readfile_t {
> int dirfd;
> const char *pathname;
> void *buf;
> size_t count;
> int flags;
> ssize_t retval; // set by kernel
> int reserved; // not used by kernel
> };

If you are going to pass a struct from userspace to the kernel, it
should not mix int and pointer types (which may be 64-bit values,
so that there are not structure packing issues, like:

struct readfile {
int dirfd;
int flags;
const char *pathname;
void *buf;
size_t count;
ssize_t retval;
};

It would be better if "retval" was returned in "count", so that
the structure fits nicely into 32 bytes on a 64-bit system, instead
of being 40 bytes per entry, which adds up over many entries, like.

struct readfile {
int dirfd;
int flags;
const char *pathname;
void *buf;
ssize_t count; /* input: bytes requested, output: bytes read or -errno */
};


However, there is still an issue with passing pointers from userspace,
since they may be 32-bit userspace pointers on a 64-bit kernel.

> int readfiles(struct readfile_t *requests, size_t count);

It's not clear why count is a "size_t" since it is not a size.
An unsigned int is fine here, since it should never be negative.

> Returns zero if all requests succeeded, otherwise the returned value
> is non-zero (glibc wrapper: -1) and user-space is expected to check
> which requests have succeeded and which have failed. retval in
> readfile_t is set to what the single-file readfile syscall would
> return if it was called with the contents of the corresponding
> readfile_t struct.
>
> The glibc library wrapper of this system call is expected to store the
> errno in the "reserved" field. Thus, a programmer using glibc sees:
>
> struct readfile_t {
> int dirfd;
> const char *pathname;
> void *buf;
> size_t count;
> int flags;
> ssize_t retval; // set by glibc (-1 on error)
> int errno; // set by glibc if retval is -1
> };

Why not just return the errno directly in "retval", or in "count" as
proposed? That avoids further bloating the structure by another field.

> retval and errno in glibc's readfile_t are set to what the single-file
> glibc readfile would return (retval) and set (errno) if it was called
> with the contents of the corresponding readfile_t struct. In case of
> an error, glibc will pick one readfile_t which failed (such as: the
> 1st failed one) and use it to set glibc's errno.


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2020-07-05 06:55:48

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 3/3] selftests: add readfile(2) selftests

On Sat, Jul 04, 2020 at 08:38:26PM +0200, Geert Uytterhoeven wrote:
> Hi Greg,
>
> On Sat, Jul 4, 2020 at 4:05 PM Greg Kroah-Hartman
> <[email protected]> wrote:
> > Test the functionality of readfile(2) in various ways.
> >
> > Also provide a simple speed test program to benchmark using readfile()
> > instead of using open()/read()/close().
> >
> > Signed-off-by: Greg Kroah-Hartman <[email protected]>
>
> Any benchmark results to share?

Yes, the readfile_speed.c file will show you that on your machine, I'll
post the results of that for two of my machines when I send the next
version of this patch series.

>
> > --- /dev/null
> > +++ b/tools/testing/selftests/readfile/readfile.c
>
> > +static void readfile(const char *filename)
> > +{
> > +// int root_fd;
>
> ???

Ugh, sorry about that, I obviously didn't clean up my last tests from
this file, thanks for catching that.

I should add more tests to validate the flag handling as well, will do
all of that for the next version, thanks for noticing this.

greg k-h

2020-07-05 07:26:47

by Jan Ziak

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Sun, Jul 5, 2020 at 8:32 AM Andreas Dilger <[email protected]> wrote:
>
> On Jul 4, 2020, at 8:46 PM, Jan Ziak <[email protected]> wrote:
> >
> > On Sun, Jul 5, 2020 at 4:16 AM Matthew Wilcox <[email protected]> wrote:
> >>
> >> On Sun, Jul 05, 2020 at 04:06:22AM +0200, Jan Ziak wrote:
> >>> Hello
> >>>
> >>> At first, I thought that the proposed system call is capable of
> >>> reading *multiple* small files using a single system call - which
> >>> would help increase HDD/SSD queue utilization and increase IOPS (I/O
> >>> operations per second) - but that isn't the case and the proposed
> >>> system call can read just a single file.
> >>>
> >>> Without the ability to read multiple small files using a single system
> >>> call, it is impossible to increase IOPS (unless an application is
> >>> using multiple reader threads or somehow instructs the kernel to
> >>> prefetch multiple files into memory).
> >>
> >> What API would you use for this?
> >>
> >> ssize_t readfiles(int dfd, char **files, void **bufs, size_t *lens);
> >>
> >> I pretty much hate this interface, so I hope you have something better
> >> in mind.
> >
> > I am proposing the following:
> >
> > struct readfile_t {
> > int dirfd;
> > const char *pathname;
> > void *buf;
> > size_t count;
> > int flags;
> > ssize_t retval; // set by kernel
> > int reserved; // not used by kernel
> > };
>
> If you are going to pass a struct from userspace to the kernel, it
> should not mix int and pointer types (which may be 64-bit values,
> so that there are not structure packing issues, like:
>
> struct readfile {
> int dirfd;
> int flags;
> const char *pathname;
> void *buf;
> size_t count;
> ssize_t retval;
> };
>
> It would be better if "retval" was returned in "count", so that
> the structure fits nicely into 32 bytes on a 64-bit system, instead
> of being 40 bytes per entry, which adds up over many entries, like.

I know what you mean and it is a valid point, but in my opinion it
shouldn't (in most cases) be left to the programmer to decide what the
binary layout of a data structure is - instead it should be left to an
optimizing compiler to decide it. Just like code optimization,
determining the physical layout of data structures can be subject to
automatic optimizations as well. It is kind of unfortunate that in
C/C++, and in many other statically compiled languages (even recent
ones), the physical layout of all data structures is determined by the
programmer rather than the compiler. Also, tagging fields as "input",
"output", or both (the default) would be helpful in obtaining smaller
sizes:

struct readfile_t {
input int dirfd;
input const char *pathname;
input void *buf;
input size_t count;
input int flags;
output ssize_t retval; // set by kernel
output int reserved; // not used by kernel
};

int readfiles(struct readfile_t *requests, size_t count);

struct readfile_t r[10];
// Write r[i] inputs
int status = readfiles(r, nelem(r));
// Read r[i] outputs

A data-layout optimizing compiler should be able to determine that the
optimal layout of readfile_t is UNION(INPUT: 2*int+2*pointer+1*size_t,
OUTPUT: 1*ssize_t+1*int).

In the unfortunate case of the non-optimizing C language and if it is
just a micro-optimization (optimizing readfile_t is a
micro-optimization), it is better to leave the data structure in a
form that is appropriate for being efficiently readable by programmers
rather than to micro-optimize it and make it confusing to programmers.

> struct readfile {
> int dirfd;
> int flags;
> const char *pathname;
> void *buf;
> ssize_t count; /* input: bytes requested, output: bytes read or -errno */
> };
>
>
> However, there is still an issue with passing pointers from userspace,
> since they may be 32-bit userspace pointers on a 64-bit kernel.
>
> > int readfiles(struct readfile_t *requests, size_t count);
>
> It's not clear why count is a "size_t" since it is not a size.
> An unsigned int is fine here, since it should never be negative.

Generally speaking, size_t reflects the size of the address space
while unsigned int doesn't and therefore it is easier for unsigned int
to overflow on very large data sets.

> > Returns zero if all requests succeeded, otherwise the returned value
> > is non-zero (glibc wrapper: -1) and user-space is expected to check
> > which requests have succeeded and which have failed. retval in
> > readfile_t is set to what the single-file readfile syscall would
> > return if it was called with the contents of the corresponding
> > readfile_t struct.
> >
> > The glibc library wrapper of this system call is expected to store the
> > errno in the "reserved" field. Thus, a programmer using glibc sees:
> >
> > struct readfile_t {
> > int dirfd;
> > const char *pathname;
> > void *buf;
> > size_t count;
> > int flags;
> > ssize_t retval; // set by glibc (-1 on error)
> > int errno; // set by glibc if retval is -1
> > };
>
> Why not just return the errno directly in "retval", or in "count" as
> proposed? That avoids further bloating the structure by another field.
>
> > retval and errno in glibc's readfile_t are set to what the single-file
> > glibc readfile would return (retval) and set (errno) if it was called
> > with the contents of the corresponding readfile_t struct. In case of
> > an error, glibc will pick one readfile_t which failed (such as: the
> > 1st failed one) and use it to set glibc's errno.
>
>
> Cheers, Andreas

2020-07-05 07:34:59

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 3/3] selftests: add readfile(2) selftests

On Sun, Jul 05, 2020 at 03:41:48AM +0200, Heinrich Schuchardt wrote:
> On 7/4/20 4:02 PM, Greg Kroah-Hartman wrote:
> > Test the functionality of readfile(2) in various ways.
>
> Hello Greg,
>
> I expect readfile() to generate fanotify events FAN_OPEN_PERM, FAN_OPEN,
> FAN_ACCESS_PERM, FAN_ACCESS, FAN_CLOSE_NOWRITE in this sequence.

Yes, it should, I don't think I do anything unique here when it comes to
vfs accesses that would go around those events.

> Looking at patch 1/3 you took care of notifications. Would this deserve
> testing here?

Possibly, do we have other in-tree tests of syscalls that validate those
events properly being created?

thanks,

greg k-h

2020-07-05 08:16:20

by Vito Caputo

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Sun, Jul 05, 2020 at 04:27:32AM +0100, Matthew Wilcox wrote:
> On Sun, Jul 05, 2020 at 05:18:58AM +0200, Jan Ziak wrote:
> > On Sun, Jul 5, 2020 at 5:12 AM Matthew Wilcox <[email protected]> wrote:
> > >
> > > You should probably take a look at io_uring. That has the level of
> > > complexity of this proposal and supports open/read/close along with many
> > > other opcodes.
> >
> > Then glibc can implement readfile using io_uring and there is no need
> > for a new single-file readfile syscall.
>
> It could, sure. But there's also a value in having a simple interface
> to accomplish a simple task. Your proposed API added a very complex
> interface to satisfy needs that clearly aren't part of the problem space
> that Greg is looking to address.

I disagree re: "aren't part of the problem space".

Reading small files from procfs was specifically called out in the
rationale for the syscall.

In my experience you're rarely monitoring a single proc file in any
situation where you care about the syscall overhead. You're
monitoring many of them, and any serious effort to do this efficiently
in a repeatedly sampled situation has cached the open fds and already
uses pread() to simply restart from 0 on every sample and not
repeatedly pay for the name lookup.

Basically anything optimally using the existing interfaces for
sampling proc files needs a way to read multiple open file descriptors
in a single syscall to move the needle.

This syscall doesn't provide that. It doesn't really give any
advantage over what we can achieve already. It seems basically
pointless to me, from a monitoring proc files perspective.

Regards,
Vito Caputo

2020-07-05 09:47:49

by Heinrich Schuchardt

[permalink] [raw]
Subject: Re: [PATCH 3/3] selftests: add readfile(2) selftests

On 7/5/20 9:34 AM, Greg Kroah-Hartman wrote:
> On Sun, Jul 05, 2020 at 03:41:48AM +0200, Heinrich Schuchardt wrote:
>> On 7/4/20 4:02 PM, Greg Kroah-Hartman wrote:
>>> Test the functionality of readfile(2) in various ways.
>>
>> Hello Greg,
>>
>> I expect readfile() to generate fanotify events FAN_OPEN_PERM, FAN_OPEN,
>> FAN_ACCESS_PERM, FAN_ACCESS, FAN_CLOSE_NOWRITE in this sequence.
>
> Yes, it should, I don't think I do anything unique here when it comes to
> vfs accesses that would go around those events.
>
>> Looking at patch 1/3 you took care of notifications. Would this deserve
>> testing here?
>
> Possibly, do we have other in-tree tests of syscalls that validate those
> events properly being created?

There is an inotify test in
tools/testing/selftests/cgroup/test_freezer.c

There is no fanotify test in tree test.

An fanotify test will require running with CAP_SYS_ADMIN. The kselftest
documentation does not describe that tests should be run as root. So it
may be preferable to test that the inotify events IN_OPEN, IN_ACCESS,
IN_CLOSE_NOWRITE are created for readfile().

Example coding is included in the inotify.7 and fanotify.7 manpages.

Best regards

Heinrich

2020-07-05 11:27:15

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 3/3] selftests: add readfile(2) selftests

Hi Greg,

On Sun, Jul 5, 2020 at 8:55 AM Greg Kroah-Hartman
<[email protected]> wrote:
> On Sat, Jul 04, 2020 at 08:38:26PM +0200, Geert Uytterhoeven wrote:
> > On Sat, Jul 4, 2020 at 4:05 PM Greg Kroah-Hartman
> > <[email protected]> wrote:
> > > Test the functionality of readfile(2) in various ways.
> > >
> > > Also provide a simple speed test program to benchmark using readfile()
> > > instead of using open()/read()/close().
> > >
> > > Signed-off-by: Greg Kroah-Hartman <[email protected]>

> > > --- /dev/null
> > > +++ b/tools/testing/selftests/readfile/readfile.c
> >
> > > +static void readfile(const char *filename)
> > > +{
> > > +// int root_fd;
> >
> > ???
>
> Ugh, sorry about that, I obviously didn't clean up my last tests from
> this file, thanks for catching that.

Reading about seq_file behavior, did the commented-out test for
"/sys/kernel/debug/usb/devices" work?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2020-07-05 11:37:32

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 3/3] selftests: add readfile(2) selftests

On Sun, Jul 05, 2020 at 01:24:07PM +0200, Geert Uytterhoeven wrote:
> Hi Greg,
>
> On Sun, Jul 5, 2020 at 8:55 AM Greg Kroah-Hartman
> <[email protected]> wrote:
> > On Sat, Jul 04, 2020 at 08:38:26PM +0200, Geert Uytterhoeven wrote:
> > > On Sat, Jul 4, 2020 at 4:05 PM Greg Kroah-Hartman
> > > <[email protected]> wrote:
> > > > Test the functionality of readfile(2) in various ways.
> > > >
> > > > Also provide a simple speed test program to benchmark using readfile()
> > > > instead of using open()/read()/close().
> > > >
> > > > Signed-off-by: Greg Kroah-Hartman <[email protected]>
>
> > > > --- /dev/null
> > > > +++ b/tools/testing/selftests/readfile/readfile.c
> > >
> > > > +static void readfile(const char *filename)
> > > > +{
> > > > +// int root_fd;
> > >
> > > ???
> >
> > Ugh, sorry about that, I obviously didn't clean up my last tests from
> > this file, thanks for catching that.
>
> Reading about seq_file behavior, did the commented-out test for
> "/sys/kernel/debug/usb/devices" work?

Yes it did, which means I need to go dig to try to find a "problem"
seq_file procfs file to try to debug that behavior...

thanks,

greg k-h

2020-07-05 11:45:22

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Sun, Jul 05, 2020 at 01:07:14AM -0700, Vito Caputo wrote:
> On Sun, Jul 05, 2020 at 04:27:32AM +0100, Matthew Wilcox wrote:
> > On Sun, Jul 05, 2020 at 05:18:58AM +0200, Jan Ziak wrote:
> > > On Sun, Jul 5, 2020 at 5:12 AM Matthew Wilcox <[email protected]> wrote:
> > > >
> > > > You should probably take a look at io_uring. That has the level of
> > > > complexity of this proposal and supports open/read/close along with many
> > > > other opcodes.
> > >
> > > Then glibc can implement readfile using io_uring and there is no need
> > > for a new single-file readfile syscall.
> >
> > It could, sure. But there's also a value in having a simple interface
> > to accomplish a simple task. Your proposed API added a very complex
> > interface to satisfy needs that clearly aren't part of the problem space
> > that Greg is looking to address.
>
> I disagree re: "aren't part of the problem space".
>
> Reading small files from procfs was specifically called out in the
> rationale for the syscall.
>
> In my experience you're rarely monitoring a single proc file in any
> situation where you care about the syscall overhead. You're
> monitoring many of them, and any serious effort to do this efficiently
> in a repeatedly sampled situation has cached the open fds and already
> uses pread() to simply restart from 0 on every sample and not
> repeatedly pay for the name lookup.

That's your use case, but many other use cases are just "read a bunch of
sysfs files in one shot". Examples of that are tools that monitor
uevents and lots of hardware-information gathering tools.

Also not all tools sem to be as smart as you think they are, look at
util-linux for loads of the "open/read/close" lots of files pattern. I
had a half-baked patch to convert it to use readfile which I need to
polish off and post with the next series to show how this can be used to
both make userspace simpler as well as use less cpu time.

> Basically anything optimally using the existing interfaces for
> sampling proc files needs a way to read multiple open file descriptors
> in a single syscall to move the needle.

Is psutils using this type of interface, or do they constantly open
different files?

What about fun tools like bashtop:
https://github.com/aristocratos/bashtop.git
which thankfully now relies on python's psutil package to parse proc in
semi-sane ways, but that package does loads of constant open/read/close
of proc files all the time from what I can tell.

And lots of people rely on python's psutil, right?

> This syscall doesn't provide that. It doesn't really give any
> advantage over what we can achieve already. It seems basically
> pointless to me, from a monitoring proc files perspective.

What "good" monitoring programs do you suggest follow the pattern you
recommend?

thanks,

greg k-h

2020-07-05 11:47:32

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Sat, Jul 04, 2020 at 08:30:40PM +0100, Al Viro wrote:
> On Sat, Jul 04, 2020 at 04:02:46PM +0200, Greg Kroah-Hartman wrote:
> > Here is a tiny new syscall, readfile, that makes it simpler to read
> > small/medium sized files all in one shot, no need to do open/read/close.
> > This is especially helpful for tools that poke around in procfs or
> > sysfs, making a little bit of a less system load than before, especially
> > as syscall overheads go up over time due to various CPU bugs being
> > addressed.
>
> Nice series, but you are 3 months late with it... Next AFD, perhaps?

Perhaps :)

> Seriously, the rationale is bollocks. If the overhead of 2 extra
> syscalls is anywhere near the costs of the real work being done by
> that thing, we have already lost and the best thing to do is to
> throw the system away and start with saner hardware.

The real-work the kernel does is almost neglegant compared to the
open/close overhead of the syscalls on some platforms today. I'll post
benchmarks with the next version of this patch series to hopefully show
that. If not, then yeah, this isn't worth it, but it was fun to write.

thanks,

greg k-h

2020-07-05 11:52:43

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Sun, Jul 05, 2020 at 04:06:22AM +0200, Jan Ziak wrote:
> Hello
>
> At first, I thought that the proposed system call is capable of
> reading *multiple* small files using a single system call - which
> would help increase HDD/SSD queue utilization and increase IOPS (I/O
> operations per second) - but that isn't the case and the proposed
> system call can read just a single file.

If you want to do this for multple files, use io_ring, that's what it
was designed for. I think Jens was going to be adding support for the
open/read/close pattern to it as well, after some other more pressing
features/fixes were finished.

> Without the ability to read multiple small files using a single system
> call, it is impossible to increase IOPS (unless an application is
> using multiple reader threads or somehow instructs the kernel to
> prefetch multiple files into memory).

There's not much (but it is mesurable) need to prefetch virtual files
into memory first, which is primarily what this syscall is for (procfs,
sysfs, securityfs, etc.) If you are dealing with real-disks, then yes,
the overhead of the syscall might be in the noise compared to the i/o
path of the data.

> While you are at it, why not also add a readfiles system call to read
> multiple, presumably small, files? The initial unoptimized
> implementation of readfiles syscall can simply call readfile
> sequentially.

Again, that's what io_uring is for.

thanks,

greg k-h

2020-07-05 11:59:20

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Sun, Jul 05, 2020 at 06:09:03AM +0200, Jan Ziak wrote:
> On Sun, Jul 5, 2020 at 5:27 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Sun, Jul 05, 2020 at 05:18:58AM +0200, Jan Ziak wrote:
> > > On Sun, Jul 5, 2020 at 5:12 AM Matthew Wilcox <[email protected]> wrote:
> > > >
> > > > You should probably take a look at io_uring. That has the level of
> > > > complexity of this proposal and supports open/read/close along with many
> > > > other opcodes.
> > >
> > > Then glibc can implement readfile using io_uring and there is no need
> > > for a new single-file readfile syscall.
> >
> > It could, sure. But there's also a value in having a simple interface
> > to accomplish a simple task. Your proposed API added a very complex
> > interface to satisfy needs that clearly aren't part of the problem space
> > that Greg is looking to address.
>
> I believe that we should look at the single-file readfile syscall from
> a performance viewpoint. If an application is expecting to read a
> couple of small/medium-size files per second, then neither readfile
> nor readfiles makes sense in terms of improving performance. The
> benefits start to show up only in case an application is expecting to
> read at least a hundred of files per second. The "per second" part is
> important, it cannot be left out. Because readfile only improves
> performance for many-file reads, the syscall that applications
> performing many-file reads actually want is the multi-file version,
> not the single-file version.

It also is a measurable increase over reading just a single file.
Here's my really really fast AMD system doing just one call to readfile
vs. one call sequence to open/read/close:

$ ./readfile_speed -l 1
Running readfile test on file /sys/devices/system/cpu/vulnerabilities/meltdown for 1 loops...
Took 3410 ns
Running open/read/close test on file /sys/devices/system/cpu/vulnerabilities/meltdown for 1 loops...
Took 3780 ns

370ns isn't all that much, yes, but it is 370ns that could have been
used for something else :)

Look at the overhead these days of a syscall using something like perf
to see just how bad things have gotten on Intel-based systems (above was
AMD which doesn't suffer all the syscall slowdowns, only some).

I'm going to have to now dig up my old rpi to get the stats on that
thing, as well as some Intel boxes to show the problem I'm trying to
help out with here. I'll post that for the next round of this patch
series.

> I am not sure I understand why you think that a pointer to an array of
> readfile_t structures is very complex. If it was very complex then it
> would be a deep tree or a large graph.

Of course you can make it more complex if you want, but look at the
existing tools that currently do many open/read/close sequences. The
apis there don't lend themselves very well to knowing the larger list of
files ahead of time. But I could be looking at the wrong thing, what
userspace programs are you thinking of that could be easily converted
into using something like this?

thanks,

greg k-h

2020-07-05 12:01:06

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Sun, Jul 05, 2020 at 09:25:39AM +0200, Jan Ziak wrote:
> On Sun, Jul 5, 2020 at 8:32 AM Andreas Dilger <[email protected]> wrote:
> >
> > On Jul 4, 2020, at 8:46 PM, Jan Ziak <[email protected]> wrote:
> > >
> > > On Sun, Jul 5, 2020 at 4:16 AM Matthew Wilcox <[email protected]> wrote:
> > >>
> > >> On Sun, Jul 05, 2020 at 04:06:22AM +0200, Jan Ziak wrote:
> > >>> Hello
> > >>>
> > >>> At first, I thought that the proposed system call is capable of
> > >>> reading *multiple* small files using a single system call - which
> > >>> would help increase HDD/SSD queue utilization and increase IOPS (I/O
> > >>> operations per second) - but that isn't the case and the proposed
> > >>> system call can read just a single file.
> > >>>
> > >>> Without the ability to read multiple small files using a single system
> > >>> call, it is impossible to increase IOPS (unless an application is
> > >>> using multiple reader threads or somehow instructs the kernel to
> > >>> prefetch multiple files into memory).
> > >>
> > >> What API would you use for this?
> > >>
> > >> ssize_t readfiles(int dfd, char **files, void **bufs, size_t *lens);
> > >>
> > >> I pretty much hate this interface, so I hope you have something better
> > >> in mind.
> > >
> > > I am proposing the following:
> > >
> > > struct readfile_t {
> > > int dirfd;
> > > const char *pathname;
> > > void *buf;
> > > size_t count;
> > > int flags;
> > > ssize_t retval; // set by kernel
> > > int reserved; // not used by kernel
> > > };
> >
> > If you are going to pass a struct from userspace to the kernel, it
> > should not mix int and pointer types (which may be 64-bit values,
> > so that there are not structure packing issues, like:
> >
> > struct readfile {
> > int dirfd;
> > int flags;
> > const char *pathname;
> > void *buf;
> > size_t count;
> > ssize_t retval;
> > };
> >
> > It would be better if "retval" was returned in "count", so that
> > the structure fits nicely into 32 bytes on a 64-bit system, instead
> > of being 40 bytes per entry, which adds up over many entries, like.
>
> I know what you mean and it is a valid point, but in my opinion it
> shouldn't (in most cases) be left to the programmer to decide what the
> binary layout of a data structure is - instead it should be left to an
> optimizing compiler to decide it.

We don't get that luxury when creating user/kernel apis in C, sorry.

I suggest using the pahole tool if you are interested in seeing the
"best" way a structure can be layed out, it can perform that
optimization for you so that you know how to fix your code.

thanks,

greg k-h

2020-07-05 20:36:00

by Vito Caputo

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Sun, Jul 05, 2020 at 01:44:54PM +0200, Greg KH wrote:
> On Sun, Jul 05, 2020 at 01:07:14AM -0700, Vito Caputo wrote:
> > On Sun, Jul 05, 2020 at 04:27:32AM +0100, Matthew Wilcox wrote:
> > > On Sun, Jul 05, 2020 at 05:18:58AM +0200, Jan Ziak wrote:
> > > > On Sun, Jul 5, 2020 at 5:12 AM Matthew Wilcox <[email protected]> wrote:
> > > > >
> > > > > You should probably take a look at io_uring. That has the level of
> > > > > complexity of this proposal and supports open/read/close along with many
> > > > > other opcodes.
> > > >
> > > > Then glibc can implement readfile using io_uring and there is no need
> > > > for a new single-file readfile syscall.
> > >
> > > It could, sure. But there's also a value in having a simple interface
> > > to accomplish a simple task. Your proposed API added a very complex
> > > interface to satisfy needs that clearly aren't part of the problem space
> > > that Greg is looking to address.
> >
> > I disagree re: "aren't part of the problem space".
> >
> > Reading small files from procfs was specifically called out in the
> > rationale for the syscall.
> >
> > In my experience you're rarely monitoring a single proc file in any
> > situation where you care about the syscall overhead. You're
> > monitoring many of them, and any serious effort to do this efficiently
> > in a repeatedly sampled situation has cached the open fds and already
> > uses pread() to simply restart from 0 on every sample and not
> > repeatedly pay for the name lookup.
>
> That's your use case, but many other use cases are just "read a bunch of
> sysfs files in one shot". Examples of that are tools that monitor
> uevents and lots of hardware-information gathering tools.
>
> Also not all tools sem to be as smart as you think they are, look at
> util-linux for loads of the "open/read/close" lots of files pattern. I
> had a half-baked patch to convert it to use readfile which I need to
> polish off and post with the next series to show how this can be used to
> both make userspace simpler as well as use less cpu time.
>
> > Basically anything optimally using the existing interfaces for
> > sampling proc files needs a way to read multiple open file descriptors
> > in a single syscall to move the needle.
>
> Is psutils using this type of interface, or do they constantly open
> different files?
>

When I last checked, psutils was not an optimal example, nor did I
suggest it was.

> What about fun tools like bashtop:
> https://github.com/aristocratos/bashtop.git
> which thankfully now relies on python's psutil package to parse proc in
> semi-sane ways, but that package does loads of constant open/read/close
> of proc files all the time from what I can tell.
>
> And lots of people rely on python's psutil, right?

If python's psutil is constantly reopening the same files in /proc,
this is an argument to go improve python's psutil, especially if it's
popular.

Your proposed syscall doesn't magically make everything suboptimally
sampling proc more efficient. It still requires going out and
modifying everything to use the new syscall.

In order to actually realize a gain comparable to what can be done
using existing interfaces, but with your new syscall, if the code
wasn't already reusing the open fd it still requires a refactor to do
so with your syscall, to eliminate the directory lookup on every
sample.

At the end of the day, if you did all this work, you'd have code that
only works on kernels with the new syscall, didn't enjoy a significant
performance gain over what could have been achieved using the existing
interfaces, and still required basically the same amount of work as
optimizing for the existing interfaces would have. For what gain?

>
> > This syscall doesn't provide that. It doesn't really give any
> > advantage over what we can achieve already. It seems basically
> > pointless to me, from a monitoring proc files perspective.
>
> What "good" monitoring programs do you suggest follow the pattern you
> recommend?
>

"Good" is not generally a word I'd use to describe software, surely
that's not me you're quoting... but I assume you mean "optimal".

I'm sure sysprof is at least reusing open files when sampling proc,
because we discussed the issue when Christian took over maintenance.

It appears he's currently using the lseek()->read() sequence:

https://gitlab.gnome.org/GNOME/sysprof/-/blob/master/src/libsysprof/sysprof-netdev-source.c#L223
https://gitlab.gnome.org/GNOME/sysprof/-/blob/master/src/libsysprof/sysprof-memory-source.c#L210
https://gitlab.gnome.org/GNOME/sysprof/-/blob/master/src/libsysprof/sysprof-diskstat-source.c#L185

It'd be more efficient to just use pread() and lose the lseek(), at
which point it'd be just a single pread() call per sample per proc
file. Nothing your proposed syscall would improve upon, not that it'd
be eligible for software that wants to work on existing kernels from
distros like Debian and Centos/RHEL anyways.

If this were a conversation about providing something like a better
scatter-gather interface akin to p{read,write}v but with the fd in the
iovec, then we'd be talking about something very lucrative for proc
sampling. But like you've said elsewhere in this thread, io_uring()
may suffice as an alternative solution in that vein.

My personal interest in this topic stems from an experimental window
manager I made, and still use, which monitors every descendant process
for the X session at frequencies up to 60HZ. The code opens a bunch
of proc files for every process, and keeps them open until the process
goes away or falls out of scope. See the attachment for some idea of
what /proc/$(pidof wm)/fd looks like. All those proc files are read
at up to 60HZ continuously.

All top-like tools are really no different, and already shouldn't be
reopening things on every sample. They should be fixed if not - with
or without your syscall, it's equal effort, but the existing
interfaces... exist.

Regards,
Vito Caputo


Attachments:
(No filename) (5.97 kB)
vwm-fds.txt (18.52 kB)
Download all attachments

2020-07-06 06:09:34

by Jan Ziak

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Sun, Jul 5, 2020 at 1:58 PM Greg KH <[email protected]> wrote:
>
> On Sun, Jul 05, 2020 at 06:09:03AM +0200, Jan Ziak wrote:
> > On Sun, Jul 5, 2020 at 5:27 AM Matthew Wilcox <[email protected]> wrote:
> > >
> > > On Sun, Jul 05, 2020 at 05:18:58AM +0200, Jan Ziak wrote:
> > > > On Sun, Jul 5, 2020 at 5:12 AM Matthew Wilcox <[email protected]> wrote:
> > > > >
> > > > > You should probably take a look at io_uring. That has the level of
> > > > > complexity of this proposal and supports open/read/close along with many
> > > > > other opcodes.
> > > >
> > > > Then glibc can implement readfile using io_uring and there is no need
> > > > for a new single-file readfile syscall.
> > >
> > > It could, sure. But there's also a value in having a simple interface
> > > to accomplish a simple task. Your proposed API added a very complex
> > > interface to satisfy needs that clearly aren't part of the problem space
> > > that Greg is looking to address.
> >
> > I believe that we should look at the single-file readfile syscall from
> > a performance viewpoint. If an application is expecting to read a
> > couple of small/medium-size files per second, then neither readfile
> > nor readfiles makes sense in terms of improving performance. The
> > benefits start to show up only in case an application is expecting to
> > read at least a hundred of files per second. The "per second" part is
> > important, it cannot be left out. Because readfile only improves
> > performance for many-file reads, the syscall that applications
> > performing many-file reads actually want is the multi-file version,
> > not the single-file version.
>
> It also is a measurable increase over reading just a single file.
> Here's my really really fast AMD system doing just one call to readfile
> vs. one call sequence to open/read/close:
>
> $ ./readfile_speed -l 1
> Running readfile test on file /sys/devices/system/cpu/vulnerabilities/meltdown for 1 loops...
> Took 3410 ns
> Running open/read/close test on file /sys/devices/system/cpu/vulnerabilities/meltdown for 1 loops...
> Took 3780 ns
>
> 370ns isn't all that much, yes, but it is 370ns that could have been
> used for something else :)

I am curious as to how you amortized or accounted for the fact that
readfile() first needs to open the dirfd and then close it later.

From performance viewpoint, only codes where readfile() is called
multiple times from within a loop make sense:

dirfd = open();
for(...) {
readfile(dirfd, ...);
}
close(dirfd);

> Look at the overhead these days of a syscall using something like perf
> to see just how bad things have gotten on Intel-based systems (above was
> AMD which doesn't suffer all the syscall slowdowns, only some).
>
> I'm going to have to now dig up my old rpi to get the stats on that
> thing, as well as some Intel boxes to show the problem I'm trying to
> help out with here. I'll post that for the next round of this patch
> series.
>
> > I am not sure I understand why you think that a pointer to an array of
> > readfile_t structures is very complex. If it was very complex then it
> > would be a deep tree or a large graph.
>
> Of course you can make it more complex if you want, but look at the
> existing tools that currently do many open/read/close sequences. The
> apis there don't lend themselves very well to knowing the larger list of
> files ahead of time. But I could be looking at the wrong thing, what
> userspace programs are you thinking of that could be easily converted
> into using something like this?

Perhaps, passing multiple filenames to tools via the command-line is a
valid and quite general use case where it is known ahead of time that
multiple files are going to be read, such as "gcc *.o" which is
commonly used to link shared libraries and executables. Although, in
case of "gcc *.o" some of the object files are likely to be cached in
memory and thus unlikely to be required to be fetched from HDD/SSD, so
the valid use case where we could see a speedup (if gcc was to use the
multi-file readfiles() syscall) is when the programmer/Makefile
invokes "gcc *.o" after rebuilding a small subset of the object files
and the objects files which did not have to be rebuilt are stored on
HDD/SSD, so basically this means 1st-time use of a project's Makefile
in a particular day.

2020-07-06 11:11:49

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Mon, Jul 06, 2020 at 08:07:46AM +0200, Jan Ziak wrote:
> On Sun, Jul 5, 2020 at 1:58 PM Greg KH <[email protected]> wrote:
> > It also is a measurable increase over reading just a single file.
> > Here's my really really fast AMD system doing just one call to readfile
> > vs. one call sequence to open/read/close:
> >
> > $ ./readfile_speed -l 1
> > Running readfile test on file /sys/devices/system/cpu/vulnerabilities/meltdown for 1 loops...
> > Took 3410 ns
> > Running open/read/close test on file /sys/devices/system/cpu/vulnerabilities/meltdown for 1 loops...
> > Took 3780 ns
> >
> > 370ns isn't all that much, yes, but it is 370ns that could have been
> > used for something else :)
>
> I am curious as to how you amortized or accounted for the fact that
> readfile() first needs to open the dirfd and then close it later.
>
> >From performance viewpoint, only codes where readfile() is called
> multiple times from within a loop make sense:
>
> dirfd = open();
> for(...) {
> readfile(dirfd, ...);
> }
> close(dirfd);

dirfd can be AT_FDCWD or if the path is absolute, dirfd will be ignored,
so one does not have to open anything. It would be an optimisation
if one wanted to read several files relating to the same process:

char dir[50];
sprintf(dir, "/proc/%d", pid);
dirfd = open(dir);
readfile(dirfd, "maps", ...);
readfile(dirfd, "stack", ...);
readfile(dirfd, "comm", ...);
readfile(dirfd, "environ", ...);
close(dirfd);

but one would not have to do that.

2020-07-06 11:19:32

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Mon, Jul 06, 2020 at 08:07:46AM +0200, Jan Ziak wrote:
> On Sun, Jul 5, 2020 at 1:58 PM Greg KH <[email protected]> wrote:
> >
> > On Sun, Jul 05, 2020 at 06:09:03AM +0200, Jan Ziak wrote:
> > > On Sun, Jul 5, 2020 at 5:27 AM Matthew Wilcox <[email protected]> wrote:
> > > >
> > > > On Sun, Jul 05, 2020 at 05:18:58AM +0200, Jan Ziak wrote:
> > > > > On Sun, Jul 5, 2020 at 5:12 AM Matthew Wilcox <[email protected]> wrote:
> > > > > >
> > > > > > You should probably take a look at io_uring. That has the level of
> > > > > > complexity of this proposal and supports open/read/close along with many
> > > > > > other opcodes.
> > > > >
> > > > > Then glibc can implement readfile using io_uring and there is no need
> > > > > for a new single-file readfile syscall.
> > > >
> > > > It could, sure. But there's also a value in having a simple interface
> > > > to accomplish a simple task. Your proposed API added a very complex
> > > > interface to satisfy needs that clearly aren't part of the problem space
> > > > that Greg is looking to address.
> > >
> > > I believe that we should look at the single-file readfile syscall from
> > > a performance viewpoint. If an application is expecting to read a
> > > couple of small/medium-size files per second, then neither readfile
> > > nor readfiles makes sense in terms of improving performance. The
> > > benefits start to show up only in case an application is expecting to
> > > read at least a hundred of files per second. The "per second" part is
> > > important, it cannot be left out. Because readfile only improves
> > > performance for many-file reads, the syscall that applications
> > > performing many-file reads actually want is the multi-file version,
> > > not the single-file version.
> >
> > It also is a measurable increase over reading just a single file.
> > Here's my really really fast AMD system doing just one call to readfile
> > vs. one call sequence to open/read/close:
> >
> > $ ./readfile_speed -l 1
> > Running readfile test on file /sys/devices/system/cpu/vulnerabilities/meltdown for 1 loops...
> > Took 3410 ns
> > Running open/read/close test on file /sys/devices/system/cpu/vulnerabilities/meltdown for 1 loops...
> > Took 3780 ns
> >
> > 370ns isn't all that much, yes, but it is 370ns that could have been
> > used for something else :)
>
> I am curious as to how you amortized or accounted for the fact that
> readfile() first needs to open the dirfd and then close it later.

I do not open a dirfd, look at the benchmark code in the patch, it's all
right there.

I can make it simpler, will do that for the next round as I want to make
it really obvious for people to test on their hardware.

> >From performance viewpoint, only codes where readfile() is called
> multiple times from within a loop make sense:
>
> dirfd = open();
> for(...) {
> readfile(dirfd, ...);
> }
> close(dirfd);

No need to open dirfd at all, my benchmarks did not do that, just pass
in an absolute path if you don't want to. But if you want to, because
you want to read a bunch of files, you can, faster than you could if you
wanted to read a number of individual files without it :)

thanks,

greg k-h

2020-07-06 17:26:17

by Dave Martin

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Sat, Jul 04, 2020 at 04:02:46PM +0200, Greg Kroah-Hartman wrote:
> Here is a tiny new syscall, readfile, that makes it simpler to read
> small/medium sized files all in one shot, no need to do open/read/close.
> This is especially helpful for tools that poke around in procfs or
> sysfs, making a little bit of a less system load than before, especially
> as syscall overheads go up over time due to various CPU bugs being
> addressed.
>
> There are 4 patches in this series, the first 3 are against the kernel
> tree, adding the syscall logic, wiring up the syscall, and adding some
> tests for it.
>
> The last patch is agains the man-pages project, adding a tiny man page
> to try to describe the new syscall.

General question, using this series as an illustration only:


At the risk of starting a flamewar, why is this needed? Is there a
realistic usecase that would get significant benefit from this?

A lot of syscalls seem to get added that combine or refactor the
functionality of existing syscalls without justifying why this is
needed (or even wise). This case feels like a solution, not a
primitive, so I wonder if the long-term ABI fragmentation is worth the
benefit.

I ask because I'd like to get an idea of the policy on what is and is
not considered a frivolous ABI extension.

(I'm sure a usecase must be in mind, but it isn't mentioned here.
Certainly the time it takes top to dump the contents of /proc leaves
something to be desired.)

Cheers
---Dave

2020-07-14 06:52:15

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

Hi!

> > At first, I thought that the proposed system call is capable of
> > reading *multiple* small files using a single system call - which
> > would help increase HDD/SSD queue utilization and increase IOPS (I/O
> > operations per second) - but that isn't the case and the proposed
> > system call can read just a single file.
>
> If you want to do this for multple files, use io_ring, that's what it
> was designed for. I think Jens was going to be adding support for the
> open/read/close pattern to it as well, after some other more pressing
> features/fixes were finished.

What about... just using io_uring for single file, too? I'm pretty
sure it can be wrapped in a library that is simple to use, avoiding
need for new syscall.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (919.00 B)
signature.asc (188.00 B)
Digital signature
Download all attachments

2020-07-14 08:08:13

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Tue, Jul 14, 2020 at 8:51 AM Pavel Machek <[email protected]> wrote:
>
> Hi!
>
> > > At first, I thought that the proposed system call is capable of
> > > reading *multiple* small files using a single system call - which
> > > would help increase HDD/SSD queue utilization and increase IOPS (I/O
> > > operations per second) - but that isn't the case and the proposed
> > > system call can read just a single file.
> >
> > If you want to do this for multple files, use io_ring, that's what it
> > was designed for. I think Jens was going to be adding support for the
> > open/read/close pattern to it as well, after some other more pressing
> > features/fixes were finished.
>
> What about... just using io_uring for single file, too? I'm pretty
> sure it can be wrapped in a library that is simple to use, avoiding
> need for new syscall.

Just wondering: is there a plan to add strace support to io_uring?
And I don't just mean the syscalls associated with io_uring, but
tracing the ring itself.

I think that's quite important as io_uring becomes mainstream.

Thanks,
Miklos

2020-07-14 11:36:57

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On 14/07/2020 11:07, Miklos Szeredi wrote:
> On Tue, Jul 14, 2020 at 8:51 AM Pavel Machek <[email protected]> wrote:
>>
>> Hi!
>>
>>>> At first, I thought that the proposed system call is capable of
>>>> reading *multiple* small files using a single system call - which
>>>> would help increase HDD/SSD queue utilization and increase IOPS (I/O
>>>> operations per second) - but that isn't the case and the proposed
>>>> system call can read just a single file.
>>>
>>> If you want to do this for multple files, use io_ring, that's what it
>>> was designed for. I think Jens was going to be adding support for the
>>> open/read/close pattern to it as well, after some other more pressing
>>> features/fixes were finished.
>>
>> What about... just using io_uring for single file, too? I'm pretty
>> sure it can be wrapped in a library that is simple to use, avoiding
>> need for new syscall.
>
> Just wondering: is there a plan to add strace support to io_uring?
> And I don't just mean the syscalls associated with io_uring, but
> tracing the ring itself.

What kind of support do you mean? io_uring is asynchronous in nature
with all intrinsic tracing/debugging/etc. problems of such APIs.
And there are a lot of handy trace points, are those not enough?

Though, this can be an interesting project to rethink how async
APIs are worked with.

>
> I think that's quite important as io_uring becomes mainstream.

--
Pavel Begunkov

2020-07-14 11:58:49

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Tue, Jul 14, 2020 at 1:36 PM Pavel Begunkov <[email protected]> wrote:
>
> On 14/07/2020 11:07, Miklos Szeredi wrote:
> > On Tue, Jul 14, 2020 at 8:51 AM Pavel Machek <[email protected]> wrote:
> >>
> >> Hi!
> >>
> >>>> At first, I thought that the proposed system call is capable of
> >>>> reading *multiple* small files using a single system call - which
> >>>> would help increase HDD/SSD queue utilization and increase IOPS (I/O
> >>>> operations per second) - but that isn't the case and the proposed
> >>>> system call can read just a single file.
> >>>
> >>> If you want to do this for multple files, use io_ring, that's what it
> >>> was designed for. I think Jens was going to be adding support for the
> >>> open/read/close pattern to it as well, after some other more pressing
> >>> features/fixes were finished.
> >>
> >> What about... just using io_uring for single file, too? I'm pretty
> >> sure it can be wrapped in a library that is simple to use, avoiding
> >> need for new syscall.
> >
> > Just wondering: is there a plan to add strace support to io_uring?
> > And I don't just mean the syscalls associated with io_uring, but
> > tracing the ring itself.
>
> What kind of support do you mean? io_uring is asynchronous in nature
> with all intrinsic tracing/debugging/etc. problems of such APIs.
> And there are a lot of handy trace points, are those not enough?
>
> Though, this can be an interesting project to rethink how async
> APIs are worked with.

Yeah, it's an interesting problem. The uring has the same events, as
far as I understand, that are recorded in a multithreaded strace
output (syscall entry, syscall exit); nothing more is needed.

I do think this needs to be integrated into strace(1), otherwise the
usefulness of that tool (which I think is *very* high) would go down
drastically as io_uring usage goes up.

Thanks,
Miklos

2020-07-15 09:07:18

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On 14/07/2020 14:55, Miklos Szeredi wrote:
> On Tue, Jul 14, 2020 at 1:36 PM Pavel Begunkov <[email protected]> wrote:
>>
>> On 14/07/2020 11:07, Miklos Szeredi wrote:
>>> On Tue, Jul 14, 2020 at 8:51 AM Pavel Machek <[email protected]> wrote:
>>>>
>>>> Hi!
>>>>
>>>>>> At first, I thought that the proposed system call is capable of
>>>>>> reading *multiple* small files using a single system call - which
>>>>>> would help increase HDD/SSD queue utilization and increase IOPS (I/O
>>>>>> operations per second) - but that isn't the case and the proposed
>>>>>> system call can read just a single file.
>>>>>
>>>>> If you want to do this for multple files, use io_ring, that's what it
>>>>> was designed for. I think Jens was going to be adding support for the
>>>>> open/read/close pattern to it as well, after some other more pressing
>>>>> features/fixes were finished.
>>>>
>>>> What about... just using io_uring for single file, too? I'm pretty
>>>> sure it can be wrapped in a library that is simple to use, avoiding
>>>> need for new syscall.
>>>
>>> Just wondering: is there a plan to add strace support to io_uring?
>>> And I don't just mean the syscalls associated with io_uring, but
>>> tracing the ring itself.
>>
>> What kind of support do you mean? io_uring is asynchronous in nature
>> with all intrinsic tracing/debugging/etc. problems of such APIs.
>> And there are a lot of handy trace points, are those not enough?
>>
>> Though, this can be an interesting project to rethink how async
>> APIs are worked with.
>
> Yeah, it's an interesting problem. The uring has the same events, as
> far as I understand, that are recorded in a multithreaded strace
> output (syscall entry, syscall exit); nothing more is needed>
> I do think this needs to be integrated into strace(1), otherwise the
> usefulness of that tool (which I think is *very* high) would go down
> drastically as io_uring usage goes up.

Not touching the topic of usefulness of strace + io_uring, but I'd rather
have a tool that solves a problem, than a problem that created and honed
for a tool.

--
Pavel Begunkov

2020-07-15 09:13:04

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Wed, Jul 15, 2020 at 10:33 AM Pavel Begunkov <[email protected]> wrote:
>
> On 14/07/2020 14:55, Miklos Szeredi wrote:
> > On Tue, Jul 14, 2020 at 1:36 PM Pavel Begunkov <[email protected]> wrote:
> >>
> >> On 14/07/2020 11:07, Miklos Szeredi wrote:
> >>> On Tue, Jul 14, 2020 at 8:51 AM Pavel Machek <[email protected]> wrote:
> >>>>
> >>>> Hi!
> >>>>
> >>>>>> At first, I thought that the proposed system call is capable of
> >>>>>> reading *multiple* small files using a single system call - which
> >>>>>> would help increase HDD/SSD queue utilization and increase IOPS (I/O
> >>>>>> operations per second) - but that isn't the case and the proposed
> >>>>>> system call can read just a single file.
> >>>>>
> >>>>> If you want to do this for multple files, use io_ring, that's what it
> >>>>> was designed for. I think Jens was going to be adding support for the
> >>>>> open/read/close pattern to it as well, after some other more pressing
> >>>>> features/fixes were finished.
> >>>>
> >>>> What about... just using io_uring for single file, too? I'm pretty
> >>>> sure it can be wrapped in a library that is simple to use, avoiding
> >>>> need for new syscall.
> >>>
> >>> Just wondering: is there a plan to add strace support to io_uring?
> >>> And I don't just mean the syscalls associated with io_uring, but
> >>> tracing the ring itself.
> >>
> >> What kind of support do you mean? io_uring is asynchronous in nature
> >> with all intrinsic tracing/debugging/etc. problems of such APIs.
> >> And there are a lot of handy trace points, are those not enough?
> >>
> >> Though, this can be an interesting project to rethink how async
> >> APIs are worked with.
> >
> > Yeah, it's an interesting problem. The uring has the same events, as
> > far as I understand, that are recorded in a multithreaded strace
> > output (syscall entry, syscall exit); nothing more is needed>
> > I do think this needs to be integrated into strace(1), otherwise the
> > usefulness of that tool (which I think is *very* high) would go down
> > drastically as io_uring usage goes up.
>
> Not touching the topic of usefulness of strace + io_uring, but I'd rather
> have a tool that solves a problem, than a problem that created and honed
> for a tool.

Sorry, I'm not getting the metaphor. Can you please elaborate?

Thanks,
Miklos

2020-07-15 09:17:55

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On 15/07/2020 11:41, Miklos Szeredi wrote:
> On Wed, Jul 15, 2020 at 10:33 AM Pavel Begunkov <[email protected]> wrote:
>>
>> On 14/07/2020 14:55, Miklos Szeredi wrote:
>>> On Tue, Jul 14, 2020 at 1:36 PM Pavel Begunkov <[email protected]> wrote:
>>>>
>>>> On 14/07/2020 11:07, Miklos Szeredi wrote:
>>>>> On Tue, Jul 14, 2020 at 8:51 AM Pavel Machek <[email protected]> wrote:
>>>>>>
>>>>>> Hi!
>>>>>>
>>>>>>>> At first, I thought that the proposed system call is capable of
>>>>>>>> reading *multiple* small files using a single system call - which
>>>>>>>> would help increase HDD/SSD queue utilization and increase IOPS (I/O
>>>>>>>> operations per second) - but that isn't the case and the proposed
>>>>>>>> system call can read just a single file.
>>>>>>>
>>>>>>> If you want to do this for multple files, use io_ring, that's what it
>>>>>>> was designed for. I think Jens was going to be adding support for the
>>>>>>> open/read/close pattern to it as well, after some other more pressing
>>>>>>> features/fixes were finished.
>>>>>>
>>>>>> What about... just using io_uring for single file, too? I'm pretty
>>>>>> sure it can be wrapped in a library that is simple to use, avoiding
>>>>>> need for new syscall.
>>>>>
>>>>> Just wondering: is there a plan to add strace support to io_uring?
>>>>> And I don't just mean the syscalls associated with io_uring, but
>>>>> tracing the ring itself.
>>>>
>>>> What kind of support do you mean? io_uring is asynchronous in nature
>>>> with all intrinsic tracing/debugging/etc. problems of such APIs.
>>>> And there are a lot of handy trace points, are those not enough?
>>>>
>>>> Though, this can be an interesting project to rethink how async
>>>> APIs are worked with.
>>>
>>> Yeah, it's an interesting problem. The uring has the same events, as
>>> far as I understand, that are recorded in a multithreaded strace
>>> output (syscall entry, syscall exit); nothing more is needed>
>>> I do think this needs to be integrated into strace(1), otherwise the
>>> usefulness of that tool (which I think is *very* high) would go down
>>> drastically as io_uring usage goes up.
>>
>> Not touching the topic of usefulness of strace + io_uring, but I'd rather
>> have a tool that solves a problem, than a problem that created and honed
>> for a tool.
>
> Sorry, I'm not getting the metaphor. Can you please elaborate?

Sure, I mean _if_ there are tools that conceptually suit better, I'd
prefer to work with them, then trying to shove a new and possibly alien
infrastructure into strace.

But my knowledge of strace is very limited, so can't tell whether that's
the case. E.g. can it utilise static trace points?

--
Pavel Begunkov

2020-07-15 09:25:24

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On 15/07/2020 11:49, Pavel Begunkov wrote:
> On 15/07/2020 11:41, Miklos Szeredi wrote:
>> On Wed, Jul 15, 2020 at 10:33 AM Pavel Begunkov <[email protected]> wrote:
>>>
>>> On 14/07/2020 14:55, Miklos Szeredi wrote:
>>>> On Tue, Jul 14, 2020 at 1:36 PM Pavel Begunkov <[email protected]> wrote:
>>>>>
>>>>> On 14/07/2020 11:07, Miklos Szeredi wrote:
>>>>>> On Tue, Jul 14, 2020 at 8:51 AM Pavel Machek <[email protected]> wrote:
>>>>>>>
>>>>>>> Hi!
>>>>>>>
>>>>>>>>> At first, I thought that the proposed system call is capable of
>>>>>>>>> reading *multiple* small files using a single system call - which
>>>>>>>>> would help increase HDD/SSD queue utilization and increase IOPS (I/O
>>>>>>>>> operations per second) - but that isn't the case and the proposed
>>>>>>>>> system call can read just a single file.
>>>>>>>>
>>>>>>>> If you want to do this for multple files, use io_ring, that's what it
>>>>>>>> was designed for. I think Jens was going to be adding support for the
>>>>>>>> open/read/close pattern to it as well, after some other more pressing
>>>>>>>> features/fixes were finished.
>>>>>>>
>>>>>>> What about... just using io_uring for single file, too? I'm pretty
>>>>>>> sure it can be wrapped in a library that is simple to use, avoiding
>>>>>>> need for new syscall.
>>>>>>
>>>>>> Just wondering: is there a plan to add strace support to io_uring?
>>>>>> And I don't just mean the syscalls associated with io_uring, but
>>>>>> tracing the ring itself.
>>>>>
>>>>> What kind of support do you mean? io_uring is asynchronous in nature
>>>>> with all intrinsic tracing/debugging/etc. problems of such APIs.
>>>>> And there are a lot of handy trace points, are those not enough?
>>>>>
>>>>> Though, this can be an interesting project to rethink how async
>>>>> APIs are worked with.
>>>>
>>>> Yeah, it's an interesting problem. The uring has the same events, as
>>>> far as I understand, that are recorded in a multithreaded strace
>>>> output (syscall entry, syscall exit); nothing more is needed>
>>>> I do think this needs to be integrated into strace(1), otherwise the
>>>> usefulness of that tool (which I think is *very* high) would go down
>>>> drastically as io_uring usage goes up.
>>>
>>> Not touching the topic of usefulness of strace + io_uring, but I'd rather
>>> have a tool that solves a problem, than a problem that created and honed
>>> for a tool.
>>
>> Sorry, I'm not getting the metaphor. Can you please elaborate?
>
> Sure, I mean _if_ there are tools that conceptually suit better, I'd
> prefer to work with them, then trying to shove a new and possibly alien
> infrastructure into strace.
>
> But my knowledge of strace is very limited, so can't tell whether that's
> the case. E.g. can it utilise static trace points?

I think, if you're going to push this idea, we should start a new thread
CC'ing strace devs.

--
Pavel Begunkov

2020-07-15 11:57:04

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

On Wed, Jul 15, 2020 at 11:02 AM Pavel Begunkov <[email protected]> wrote:

> I think, if you're going to push this idea, we should start a new thread
> CC'ing strace devs.

Makes sense. I've pruned the Cc list, so here's the link for reference:

https://lore.kernel.org/linux-fsdevel/CAJfpegu3EwbBFTSJiPhm7eMyTK2MzijLUp1gcboOo3meMF_+Qg@mail.gmail.com/T/#u

Thanks,
Miklos